Rate this post

The last update was on March 19, 2020.

Predicting a class label for a given set of inputs is a classification predictive modeling problem.

It is a challenging problem if little is known about the dataset, as there are tens, if not hundreds, of machine learning algorithms to choose from. The distribution of examples across the classes can make the problem more difficult. The skewed class distribution requires the use of specialized methods to change the dataset.

Random Forest or SMOTE are popular machine learning programs that can be used to deal with overwhelm on a new classification project. Another approach is to look for similar problems in the research literature and try to re-implement them.

Although they are hit-or-miss, these approaches can be effective. The shortest path to a good result on a new classification task is to systematically evaluate a suite of machine learning algorithms in order to discover what works well, then double down. This approach can be used for imbalanced classification problems, tailored for the range of data sampling, cost-sensitive, and one-class classification algorithms that one may choose from.

There is a framework for working through an unbalanced classification dataset.

You will know after completing this.

  • The challenge of choosing a classification method.
  • A framework for working through an unbalanced classification project.
  • There are specific suggestions to try at each step.

Kick-start your project with my new book Imbalanced Classification with Python, which includes step-by-step tutorials and the Python source code files for all examples.

Let ‘s start.

Step-By-Step Framework for Imbalanced Classification Projects

The overview of the video.

The three parts are :

  1. What is the best way to use it ?
  2. A systematic framework is used.
  3. li > Select a Metric/li >,,,,,,,,,,,,,,,,,,,,,,,

What is the best way to use it ?

You can either acquire or hand over an unbalanced classification dataset. Now what ?

There are so many machine learning techniques to choose from.

How do you choose ?

The challenge at the beginning of each new project is imbalanced classification. The challenge of applied machine learning makes it both thrilling and terrifying.

There are at least two ways to solve this problem.

  • Use a favorite method.
  • What has worked before should be used.

One approach is to start tuning the hyperparameters. This approach is only effective if your favorite solution is the best for your dataset.

Another way to look at the literature is to see what techniques have been used. If many people have studied and reported results on similar datasets, this can be effective.

Research publications are biased towards showing promise for a pet algorithm rather than presenting an honest comparison of methods. Literature can be used to HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax HairMax

If little is known about the problem, the shortest path to a good result is to test a suite of different algorithms.

Would you like to get started with Imbalance Classification ?

I have a free 7-day email crash course with sample code.

You can get a free PDF Ebook version of the course if you sign up.

A systematic framework is used.

Consider a balanced classification task.

You have the same challenge of choosing which algorithms to use.

There are many solutions to this problem, but perhaps the most robust is to use empirical results to choose a solution.

Biases like “ what has worked in the past ” can feed ideas into the study, but can lead you astray if relied upon. You need to let the results from the systematic empirical experiments tell you what is best for your classification dataset.

The process of selecting a metric, testing a suite of algorithms, and tuning the best performing models begins once you have a dataset. It is the simplest and most reliable way to get from “ I have a new dataset ” to “ I have good results ” very quickly.

The process can be summarized as follows.

  1. You can select a metric.
  2. Spot check.
  3. Hyperparameter tuning.

Spot-checking is more involved when it comes to specialized data preparation such as scaling, removal of outliers, and more. The design of a test harness is required to estimate the performance of a model on unseen data.

This process can be used for imbalanced classification.

It is still important to check machine learning. Standard algorithms do n’t perform well when the class distribution is skewed. Testing them first provides a baseline in performance by which more specialized models can be compared.

It is still important to tune the hyperparameters. There are models specifically designed for imbalanced classification.

The same three-step procedure can be used and an additional step can be inserted.

This process can be summarized as follows.

  1. You can select a metric.
  2. Spot check.
  3. Spot Check Imbalanced Algorithms

  4. Hyperparameter tuning.

A high-level framework is provided to work through an imbalanced classification problem.

There are many different machine learning algorithms to choose from.

We need a low-level framework for each step.

There is a framework for balanced classification.

A similar framework can be developed to work through each step of an unbalanced classification project.

From selecting a metric to tuning it.

You can select a metric.

The most important step in the project is selecting a metric.

The measuring stick is used to evaluate and compare models. The wrong metric can mean the wrong algorithm. That is a model that can solve a different problem than the one you want solved.

The model or predictions that are most important to the project or project stakeholders must be captured by the metric.

Project stakeholders are not sure what they want and there are many metrics to choose from. It is beneficial to explore a few different framings and different metrics to see what makes sense to stakeholders.

You have to decide if you want to predict probabilities or class labels. The majority class is normal, called the negative class, and the minority class is the exception, called the positive class.

The uncertainty of the prediction is captured by probabilities, whereas crisp class labels can be used immediately.

  • Predict the probability of class membership for each example.
  • Predict a crisp class label for each example.

Predict what will happen.

If probabilities are intended to be used directly, a good metric would be the Brier Score and the Brier Skill score.

Predicting probabilities will allow the user to map them to class labels themselves. There is a measure that summarizes the performance of the model across the range of possible thresholds.

The area under curve and precision-recall curve can be used if the positive class is the most important. This will make sure that both recall and precision are maximized.

The ROC Curve and area under curve can be used if both classes are equally important. The true positive rate will be maximized and the false positive rate will be minimized.

Predict class labels.

If class labels are required and both classes are equally important, a good metric is classification accuracy. If the majority class is less than 80 percent off the data, it makes sense. The accuracy metric will lose meaning if a majority class skews more than 80 percent.

The G-mean metric can be used if the class distribution is skewed.

Variations of the F-Measure can be used if the positive class is more important. F1 can be used if false positives and false negatives are equally important. The F2 can be used if false positives are more expensive than false negatives. There is a measure that can be used.

A framework for choosing a metric.

If you feel lost in choosing a metric for your classification task, these are a useful starting point.

The framework we can use is as follows :

  • Do you need class labels ?
  • Do you think the positive class is more important ? Do you have examples for the majority class ?

These decisions can be transformed into a decision tree.

How to Choose a Metric for Imbalanced Classification

You can start evaluating machine learning algorithms once a metric has been chosen.

Spot check.

Spot checking machine learning is evaluating a suite of different types of machines.

It means giving each algorithm a chance to learn about the problem, including performing any required data preparation expected by the algorithm and using best-practice configuration options or defaults.

To provide a baseline in performance to which techniques specialized for imbalanced classification must be compared and outperformed in order to be considered skillful, the objective is to quickly test a range of standard machine learning algorithms. The idea is that if they ca n’t out-perform unbalanced algorithms, there is n’t much point in using fancy imbalanced algorithms.

A robust test harness needs to be defined. This often involves k-fold cross-validation. Ensuring that each fold has the same class distribution as the original dataset is often required. In order to effectively capture a sample of model performance on the dataset, the cross-validation procedure is often repeated multiple times, such as 3, 10, or 30.

There are at least four levels of spot check.

  1. There are Naive Algorithms.
  2. There are linear computations.
  3. There are Nonlinear Algorithms.
  4. The ensemble computations.

There are Naive Algorithms.

A naive classification needs to be evaluated.

In order to have skill on the dataset, any algorithm must overcome a rock- bottom baseline in performance.

There is no logic other than an if-statement or predicting a constant value in the algorithm. The choice of naive is based on the performance metric.

Predicting the majority class in all cases is a suitable naive algorithm for classification accuracy. Predicting the prior probability of each class in the training dataset is a suitable naive algorithm for the Brier Score.

There is a suggestion that performance metrics should be mapped to naive algorithms.

  • Predict the majority class.
  • Predict a uniformly random class.
  • Predict the minority class.
  • Predict a random class.
  • Predict a random class.
  • Predict majority class prior.

If you do n’t know what the best naive algorithm is for your metric, try a few and see if you can use it as your baseline.

There are some options.

  • Predict the majority class.
  • In all cases, predict the minority class.
  • Predict a class.
  • Predict a randomly selected class.
  • Predict the class ‘s probabilities.

There are linear computations.

The field of statistics can be used to make strong assumptions about the functional form of the problem.

Although this definition is stretched, we can refer to them as linear because the output is a linear combination of inputs. These are often fit under a framework called a probabilistic framework.

They are fast to train and perform well. There are examples of linear algorithms that you should consider trying.

  • Logistic regression.
  • There is a linear discriminant analysis.
  • Bayes.

There are Nonlinear Algorithms.

There are few assumptions made about the functional form of the problem.

The output is often a map of inputs to outputs.

They are slower to train and require more data. You should consider trying some of the examples.

  • There is a decision tree.
  • k-Nearest neighbors.
  • Neural networks are artificial.
  • There is support for a machine.

The ensemble computations.

The predictions from two or more models are combined into an ensemble.

It is a good idea to focus on the ensemble of decision tree algorithms when spot-checking, given that they are known to perform well in practice on a wide range of problems.

There are examples of decision tree algorithms that you should consider.

  • Bagged decision trees.
  • Random forest.
  • There are extra trees.
  • The boost was random.

There is a framework for spot-checking machine learning.

These suggestions can be summarized into a framework for testing machine learning.

  • li > Majority Class/li > li > Minority Class/li >
  • li > Linear Discriminant Analysis/li > li > Naive Bayes/li >
  • li > Decision Tree/li > li > k-Nearest Neighbors/li >
  • li > Bagged Decision Trees/li > li > Random Forest/li > li > Extra Trees/li >

The order of the steps is not flexible. Think of the order of algorithms as increasing in complexity.

The order of the algorithms within each step is flexible and the list of them is not complete. It ‘s a good idea to limit the tests to a subset of the most widely used. It ‘s a good idea to use data-preparation recommendations and hyperparameter defaults.

This step of the framework is summarized in the figure below.

How to Spot-Check Machine Learning Algorithms

Spot Check Imbalanced Algorithms.

Spot-checking is similar to spot-checking machine learning.

The goal is to quickly test a large number of techniques in order to discover what shows promise so that you can focus more attention on it later.

Both naive and skillful models are provided by the spot-checking performed in the previous section. This allows you to focus on methods that show promise, rather than getting excited about results that only appear effective compared to other techniques, which is an easy trap to fall into.

There are at least four different types of imbalanced classification techniques.

  1. Data sampling methods
  2. There are cost-sensitive computations.
  3. One-Class programs.
  4. The mathematics of probability tuning.

Data sampling methods

Data sampling can be used to change the composition of the training dataset to improve the performance of machine learning.

There are at least three main types of data sampling.

  • Data Oversamplinug.
  • Data is undersampling.
  • Oversampling and undersampling are combined.

Data oversampling involves duplicating examples of the minority class. SMOTE is one of the most popular methods. The amount of oversampling is one of the most important hyperparameters.

Data oversampling methods are examples.

  • Random sampling.
  • SMOTE is a word that means “ borderline. ”
  • SVM SMote.
  • SMOTE is what k means.

Undersampling involves randomly choosing which examples to remove from the majority class. The edited nearest neighbors and Tomek links are popular editing methods.

Data undersampling methods are examples.

  • Random sampling.
  • There is aCondensed Nearest Neighbor.
  • Links from Tomek.
  • Nearest neighbors have been edited.
  • There is a neighborhood cleaning rule.
  • There is a one-sided selection.

Any oversampling method can be combined with any undersampling method. It would be beneficial to test different combinations of oversampling and undersampling techniques.

There are popular combinations of over and undersampling.

  • Random undersampling and SMOTE.
  • Tomek and SMOTE links.
  • SMOTE and Edited neighbors.

Depending on the choice of machine learning algorithm, data sampling may perform differently.

It may be beneficial to test a suite of standard machine learning algorithms, such as all or a subset of those used when spot checking in the previous section.

The k-nearest neighbor algorithm is used internally. The data types and scale of input variables are very sensitive to this program. It is possible to use specialized methods if some input variables are categorical instead of numerical.

There are cost-sensitive computations.

The different costs of misclassification are taken into account when fitting the model on the training dataset.

When the cost of misclassification is configured to be proportional to the distribution of examples in the training dataset, these algorithms can be effective.

It is possible to test a range of cost-sensitive versions of linear, nonlinear, and ensemble algorithms.

Machine learning can be configured using cost-sensitive training.

  • Logistic regression.
  • Decision trees.
  • Support the machines.
  • Neural networks are artificial.
  • Bagged decision trees.
  • Random forest.
  • The boost was random.

One-Class programs.

Exception detection and anomaly detection can be used for classification tasks.

When used in this way, they are often referred to as one-class classification.

When there is a severe class imbalance with very few examples of the positive class, one-class classification can be very effective.

There are examples of one-class classification.

  • One-Class support machines.
  • There are forests that are isolated.
  • There is a minimum covariance determiningant.
  • There is a local outlier factor.

The mathematics of probability tuning.

Predicted probabilities can be improved in two ways.

  • Calculating probabilities.
  • The classification threshold is changed.

Calibrating Probabilities

Calibration of probabilities can be achieved by using a probabilistic framework.

When 100 examples are predicted to have a positive class label with a probability of 80 percent, the algorithm will predict the correct class label 80 percent of the time.

When probabilities are either required as the output or used to evaluate the model, it is necessary for a model to be Calibrated. Either ROC or PR AUC is used.

There are some examples of machine learning that predict probabilities.

  • Logistic regression.
  • There is a linear discriminant analysis.
  • Bayes.
  • Neural networks are artificial.

Post-process predicted probabilities can be used in order to calibrate them, since most of the time the predicted probabilities are not accurate.

It is important to calibrate the predicted probabilities when they are used to evaluate a model.

There are some examples of probability calibration.

  • Scaling is done by Platt.
  • Isotonic regression.

Tuning the Classification Threshold

Predicting probabilities that later need to be mapped to class labels is one of the methods that some algorithms are designed to do.

If the model is evaluated using class labels, then this is the case.

There are examples of machine learning that predict a probability.

  • Logistic regression.
  • There is a linear discriminant analysis.
  • Bayes.
  • Neural networks are artificial.

The threshold probability value is used to map probabilities to class labels. All probabilities below the threshold are mapped to class 0, and all probabilities equal to or above the threshold are mapped to class 1.

0 is the default threshold. The performance of a machine learning model that predicts probabilities can be improved by using different thresholds.

If a model is used that predicts a probability and class labels are required as output, it is a good idea to try tuning the classification threshold.

There is a framework for Spot-Checking Imbalanced Algorithms.

The suggestions can be summarized into a framework for testing machine learning.

  1. li > Random Oversampling/li > li > Borderline SMOTE/li >
  2. li > Logistic Regression/li > li > Decision Trees/li > li > Artificial Neural Networks/li >
  3. li > Isolation Forests/li > li > Minimum Covariance Determinant/li >
  4. li > Calibrating Probabilities/li > li > Platt Scaling/li >

The order of the steps is flexible and the order of the Algorithms within each step is also flexible.

The structure is designed to make you think about what you are evaluating.

The framework is summarized in the figure below.

How to Spot-Check Imbalanced Machine Learning Algorithms

Hyperparameter tuning.

You will have an idea of what works and what does n’t on your dataset after spot-checking machine learning.

Pick the top five or 10 combinations that performed well and tune the hyperparameters for them.

You can choose from three popular hyperparameter tuning algorithms.

  • Random search.
  • There is a grid search.
  • It ‘s called the Bayesian Optimization.

If you know what values to try, grid search is a good choice. It can be more difficult to set up and run a Bayesian program.

The best performing methods are not the only approach.

There may be some machines that perform well, but not as well when used with data sampling or probability calibration. The imbalanced-classification augmentations could be used to see if better performance can be achieved.

There may be a data sampling method that results in a dramatic lift in performance for one or more algorithms. There is an interesting basis for further tuning to see if additional lifts in performance can be achieved.

Further reading.

If you want to go deeper, this section provides more resources.

There are Tutorials.

Summary.

You discovered a framework for working through an unbalanced classification dataset.

Specifically, you learned.

  • The challenge of choosing a classification method.
  • A framework for working through an unbalanced classification project.
  • There are specific suggestions to try at each step.

I will do my best to answer your questions if you ask them in the comments.

You can get a handle on unbalanced classification.

Imbalanced Classification with Python

Imbalanced learning models can be developed in minutes.

A few lines of python code.

In my new book, Imbalanced Classification with Python, you can discover how.

Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, and much more are provided.

Imbalanced classification methods can be used in machine learning projects.

See what ‘s inside.

Source: https://nhadep247.net
Category: Machine