Evaluation metrics for supervised learning models and offline evaluation mechanisms are included in the Domino Data Science Field Note. The full report includes coverage on offline vs online evaluation mechanisms, hyperparameter tuning and potential A/B testing pitfalls. A complement to the report is also available.
Why does model evaluation matter ?
Data scientists make models. Data scientists talk about how they are responsible for building a model as a product or making a bunch of models that impact business strategy. Evaluation of its performance is a fundamental aspect of machine learning model development. The distribution of data in machine learning models may change over time. People can identify when retraining the machine learning model is needed by evaluating the model and detecting distribution drift. Model evaluation at the start of any project will help answer questions like “ how can I measure success for this project ? ” and avoid “ working on ill-formulated projects where good measurement is vague ”.
Supervised learning models have evaluation metrics.
There are multiple stages in the development of a machine learning model. There are multiple places where one needs to evaluate the model. During the prototyping stage, or when we try out different models to find the best one, Zheng advocates for considering model evaluation. There are different metrics for the tasks, and evaluation metrics are tied to machine learning tasks. The evaluation metrics that are covered in the report include classification, regression, and ranking for supervised learning. R ‘s metrics package and scikit-learn ‘s model evaluation are included in two packages to consider.
There is a classification.
The most popular metrics for measuring classification performance include accuracy, confusion matrix, log-loss, and AUC. The ratio between the number of correct predictions and the total number of predictions is known as the confusion matrix and it shows a more detailed breakdown of correct and incorrect. When the cost of misclassification might differ for the two classes, or one might have a lot more test data of one class than the other, using a confusion matrix is useful. There are different consequences for making a false positive or false negative in a cancer diagnosis.
Reading: Model Evaluation
log-loss can be used if the raw output of the classifier is a numerical probability instead of a class label. The gauge of confidence is the probability. This idea of probabilistic confidence is incorporated into the measurement of accuracy. It ‘s one way to summarize the ROC curve into a single number, so that it can be compared easily and automatically. The ROC curve is a whole curve. There are more explanations on AUC and ROC.
One of the primary ranking metrics is precision-recall. These are two metrics that are used together. Defining mathematically, precision and recall can be done by Zheng.
- Happy correct answers are the total items returned by ranker.
- Happy correct answers and total relevant items are included in the recall.
The ranker may simply order the items by the raw score in an underlying implementation. Personal recommendation is an example of a ranking problem or regression model. The recommender could be either a ranker or a score predictor. The first case has a ranked list of items for each user. In the case of score prediction, the recommender needs to return a predicted score for each user-item pair.
There is regression.
The model learns to predict numerical scores with regression, according to the report. Individualized recommendation is when we try to predict a user ‘s rating for an item. One of the most commonly used metrics for regression tasks is the root-mean-square-error. Zheng cautions that there are some challenges with RSME. RSMEs are sensitive to large outliers. The mean is not robust to large outliers and the average error could be very big. The model will probably not perform very well onliers with real data. It is important to look at robust estimates of performance that are unaffected by large outliers. It gives us a relative measure of the typical error if we look at the median absolute percentage.
Evaluation mechanisms that are offline.
In the paper, Zheng advocates.
The model needs to be evaluated on a dataset that is statistically independent from the one it was trained on. The estimate of its true performance on new data is overly optimistic. The training data has already been adapted to the model. A more fair evaluation would measure the model ‘s performance on data that it has n’t yet seen. This gives an estimate of how well the model generalizes to new data.
Researchers can use hold-out validation to generate new data. Hold-out validation if all data points are correct. We randomly hold out part of the data for validation. We train the model on the larger portion of the data and evaluate the validation metrics on the smaller hold-out set. When there is a need for a mechanism to generate additional datasets, resampling techniques such as bootstrapping or cross-validation may be used. Sampling from a single, original dataset is what bootstrapping does. The new datasets can be used to estimate a quantity of interest. Since there are multiple estimates, one can calculate things like the confidence interval for the estimate. When the training dataset is so small that one ca n’t afford to hold out part of the data just for validation purposes, cross validation is useful. k-fold cross-validation is one of the most commonly used cross-validations.
The training dataset is divided into k-folds. Each of the k folds takes turns being the hold-out validation set ; a model is trained on the rest of the k -1 folds and measured on the held-out folds. The average of the performance on all k folds is taken to be the overall performance. Pick the hyperparameters that resulted in the highest k-fold average if you repeat this procedure again.
The sckit-learn cross-validation module may be useful.
As data scientists spend so much time on making models, considering evaluation metrics early on may help them accelerate work and set up their projects for success. Evaluation of machine learning models is a challenge. A few insights from Zheng ‘s report can be found in this Domino Data Science Field note. You can download the full report.
Highlights of data science research, trends, techniques, and more can be found in Domino Data Science Field Notes. If you are interested in learning more about your data science work, please send us an email.