Rate this post

We will discuss some strategies for effective testing. You will learn how to test and evaluate models.

What is the goal of ML testing?

What are we trying to achieve when performing software testing ?

  • The software system needs to work according to the requirements. All the parameters that you test the program against should be stated in the technical specification document.
  • Software testing can point out defects and flaws during development. You do n’t want your clients to see bugs after the software is released. Different kinds of testing allow us to catch bugs that are hidden.

In machine learning, a programmer usually inputs the data and the desired behavior, and the logic is elaborated by the machine. This is true for deep learning. The purpose of machine learning testing is to make sure that the learned logic stays consistent no matter how many times we call the program.

Model evaluation in machine learning testing

Software testing usually includes :

  • The unit tests are done. Each element of the program is tested separately.
  • There are regression tests. If the software does n’t suddenly break, they cover it.
  • Integration tests. Multiple components of the program are observed in this type of testing.

Do n’t merge the code before it passes all the tests, always test newly introduced blocks of code, and write a test that captures the bug are some of the rules that people follow.

Machine learning can add more actions to your to-do list. You need to follow the best practices. Every model needs to be evaluated. Your model should generalize. Evaluation is needed to make sure that the performance is satisfactory, because this is not what we usually understand by testing.

Machine learning data set division

You split the database into three sets. The training set is used to train the model. You use two sets of data to evaluate the model ‘s performance.

  • There is a validation set. It is not enough to have only a training set and a testing set. Overfitting can result from that. A small validation data set can be used to evaluate a model. You make the testing set come into the game after you get the maximum accuracy on the validation set.
  • The holdout set is a test set. It ‘s possible that your model fits the training dataset perfectly. In order to assure that, you select samples for a testing set from your training set, examples that the machine has n’t seen before It is important to draw samples at random. To avoid training on your test data, you should not use the same set many times. The test set should be large enough to provide meaningful results and representative of the data set as a whole.

how to use the different parts of machine learning dataset

When used repeatedly, validation sets wear out. The more times you use the same data to make decisions, the less confident you are that the model will generalize. It is a good idea to collect more data for the test set and validation set.


Even on a limited dataset, cross-validation can be performed. The model is trained on each sample of the training set.

k-fold cross-validation

The most common method of cross-validation is k-fold. To use it, you need to divide the dataset into kkk subsets. You will perform a 10-fold cross-validation if you break the dataset into 10 subsets. The validation set must be used at least once.

k-fold cross-validation

The machine learning model can be tested on unseen data. It is popular because it is easy to apply, works well even with relatively small datasets, and the results are generally quite accurate. There is a detailed explanation on Medium about how to cross-validate the model.

Leave-one-out cross-validation

In this method, we train the model on all the data samples in the set except for one data point that is used to test the model. Each time you leave a different data point as a testing set, you get to test the performance for all the data.

The method has low bias since all the data points are used. We are testing the model against just one data point each time, which leads to higher variation in testing.

Cross-validation helps to better assess the accuracy of the model and provides for more efficient use of the data.

Evaluate models using metrics

Every data science project uses different metrics to evaluate the performance of the model. You have to keep an eye on that.


How much of the model ‘s predictions are true is called accuracy. The higher the accuracy, the better. When estimating the performance, it is not the only metric.

$$Accuracy \equiv \frac{True\ Positives + True\ Negatives}{True\ Positives + False\ Positives + True\ Negatives + False\ Negatives}$$


The percentage of bad predictions is described by loss. The loss is zero if the model is perfect.


The model ‘s accuracy is marked by the metric. The model diagnoses cancer to patients who do n’t have it.

$$Precision \equiv \frac{True\ Positives}{True\ Positives + False\ Positives}$$


The number of correct predictions is divided by the number of results that should have been predicted correctly. The percentage of relevant results correctly classified is referred to as the percentage of total relevant results.

$$Recall \equiv \frac{True\ Positives}{True\ Positives + False\ Negatives}$$

Confusion matrix

A confusion matrix is a square table with the number of classes that the model needs to classify. This method is usually applied to classification where each column represents a label. If you need to categorize fruits into three categories, you draw a table. One axis will be the actual label, and the other axis will be the predicted one.

confusion matrix

Best practices for ML model debugging

We have to figure out why the errors occur after evaluating the performance.

It ‘s not the same as debugging any other software system. There is n’t necessarily a bug if the predictions made by the model are poor. You have to look at a broader range of causes than you would in traditional programming. This makes it difficult to model the models.

Data debugging

The accuracy of predictions made by the model depends on the quality of data itself, so you should start with data debugging.

Database schema

The data schema can be used to check if the data contains the expected statistical values.

A database is like a map that describes the logic of the database : how the data is organized and what the relationship between the samples is. Some rules may be included.

  • Ensure that the submitted values are within the range of ratings.
  • Check that all the images are in the same format.

There are two types of scheme.

  • It ‘s physical. How the data will be stored is described.
  • It ‘s logical. Tables, tags, or schemes are the logical components of the database.

database schema

The data that was engineered should be checked separately. raw data can be okay, but engineered data can look different. You can write tests to see if the outliers are handled or if the missing values were replaced by mean or default values.

Model debugging

Model debugging can be done once you have tested your data.

Establish a baseline

You quickly test the model ‘s quality when you set a baseline and compare it against it. A baseline is when you use a simple heuristic to predict the label. Your model needs to be improved if it performs worse than its baseline. If you solve a classification problem, the baseline is predicting the most common label.

If you update your model, you can use it as a baseline for newer versions. The model must perform better than the baseline.

Write ML unit tests

Similar to traditional testing, you write and run tests to check the performance of the program. The tests catch bugs in different parts of the program. You can test the hidden layers in a neural network. Datacamp can help you dive deeper into unit testing for different models.

Adjust hyperparameters

The poor performance of the model can be attributed to ll-adjusted hyperparameters. The metrics you should check are listed here.

  • The rate of learning. The learning rate is usually set by the libraries. It might not be the best learning rate for your model. The best option is to set it manually. There are two and one. You can see what gives you the best loss without taking hours to train.
  • Regularization is done. You need to make sure that the model can make predictions without regularization. If you need to reduce your model ‘s size, L1 regularization is useful. If you prefer increased model stability, apply L2 regularization. Work with dropout regularization in neural networks.
  • The size of the batches. Smaller batches of models tend to generalize better. A lot of samples can be found in a batches where the minimal size depends on the model.
  • There is a depth of layers. The deeper a neural network is, the more layers it has. If you feel like the model should be deeper to solve your problem, start from 1 layer and gradually increase the number. This approach helps to keep the model simple.

How to write model tests?

To write model tests, we need to cover several issues.

  • The general logic of the model is not possible in the case of deep neural networks.
  • The model performance can be controlled by manual testing.
  • Evaluate the accuracy of the model.
  • Make sure the achieved loss is acceptable.
  • If you get good results, check the model performance on the real data.

There is a detailed explanation of how to do unit tests.

There are two types of tests.

Pre-train tests

This type of test allows you to catch bugs before you run the model. They do n’t need training parameters to be run. A program that checks for missing labels is an example of a pre-train test.

Post-train tests

The tests are performed on a trained model. We can investigate the logic behind the algorithm and see if there are any bugs. The program ‘s behavior is reported by three types of tests.

  • There are invariance tests. We can use invariance tests to see how much we can change without affecting the performance of the model. We can check for consistency in predictions with input examples. If we run a pattern recognition model on two different photos of red apples, we expect the result to be the same.
  • The expectation tests are directed. Directional expectation tests are used to check how input changes the behavior of the model. When building a regression model that estimates the prices of houses and takes square meters as one of the parameters, we want to see that adding extra space makes the price go up.
  • There are minimum functions tests. Just like traditional unit tests, these tests allow us to separate the components of the program. You can look at the model on specific cases.

Model development pipeline

Evaluation, pre-train tests, and post-train tests should be included in your model development. The stages should be organized in a way that looks like this.

model development pipeline


If you care about the quality of the model, you need to perform ML tests. It requires that you test the quality of data, not just the model, and that you adjust the hyperparameters to get the best results. You can be sure of its performance if you perform all the necessary procedures.

Source: https://nhadep247.net
Category: Machine