Definition of generalization?
Demonstrating how well a trained model is trained to classify or forecast unseen data is called generalization in machine learning. A generalized machine learning model works for all data. A model can be trained to classify between dogs and cats. A good performance may be obtained if the model is provided with only two breeds. It might get a low classification score when it is tested by other dogs as well. An issue can result in a classification of a dog image as a cat. Data diversity is important in order to make a good prediction. The model may get 85 % performance score if it is tested by only two dog breeds and 70 % if trained by all breeds. The first may get a very low score. It will be evaluated by an unseen dataset with all breed dogs. This can be the same as it has been trained by high data diversity.
Data diversity is not the only point to care for in order to have a generalized model. It can be a result of machine learning or poor configuration. We explain all the determinant factors in this post. Regularization can be applied during model training to ensure generalization. We explain bias as well as underfitting and overfitting.
Variance and bias (overfitting and underfitting)
Machine learning uses variance and bias. There are different predictions values made by a machine learning model. The bias is the distance from predictions to actual values. The prediction values are far from the actual values. The prediction values are highly varied.
The prediction results of a machine learning model can be either low-bias, low-variance, or high-bias. A low-biased model is called overfit and a high-biased model is called underfit. We find the best trade-off between underfitting and overfitting so that a trained model gets the best performance. A high prediction score on seen data is obtained by an overfit model. In both seen and unseen datasets, an underfit model has low performance.
Three models with underfitting (left), goodfit (middle), and overfitting (center). Credit: https://scikit-learn.org/
Overfitting/overtraining in supervised learning (e.g., neural network). Training error is shown in blue, validation error in red, both as a function of the number of training cycles. If the validation error increases(positive slope) while the training error steadily decreases(negative slope) then a situation of overfitting may have occurred. The best predictive and fitted model would be where the validation error has its global minimum. Credit: Wikipedia user: Gringer. Source: https://en.wikipedia.org/wiki/Overfitting
Determinant factors to train generalized models
There are different ways to protect a machine learning model. Below is where we explain them.
A used dataset should contain diversity in order to train a machine learning model. It does n’t mean a huge dataset, but a dataset containing all different samples The generalization is better fulfilled if the classifier is trained from a specific subset of data. It is recommended to use cross validation techniques during training. These techniques can be used to exploit the data and avoid generating an overfit model.
Machine Learning algorithm
Machine learning works against overfitting and underfitting. Computational machine learning is more likely to be overfitting. Decision Tree is a machine learning model that is more likely to be overfitting. Some machine learning models are too easy to understand. This is the reason to build an underfit model. Logistic regression and linear regression are examples.
Machine learning models are prone to overfitting when they become too complex. There are ways to make the model simpler. They are called regularization methods. We explain it.
A machine learning model can be made simpler with regularization. Pruning for decision trees, dropout techniques for neural networks, and adding a penalty parameters to the cost function are some of the approaches that are applied to different machine learning algorithms.
This content can be shared.