Cartoon Infographic on Building the Machine Learning Model. (Drawn by Chanin Nantasenamat)
Data science and machine learning are related.
How to Build a Machine Learning Model
A Visual Guide to Learning Data Science
It does n’t have to be intimidating to learn data science. Learning data science can be fun. How do we make learning data science fun and easy ?
Since a picture is worth a thousand words, it makes sense to make a cartoon about data science. I came up with an Infographic after a few days, which was also published on the Data Professor ‘s website.
There is a dataset.
Your journey of building the machine learning model begins with a dataset. The dataset is an M N matrix where M represents the columns and N the rows.
The columns can be broken down into two parts. X is synonymous with several terms such as features, independent variables and input variables. Y is synonymous with class label, dependent variable and output variable.
Cartoon illustration of a dataset. (Drawn by Chanin Nantasenamat)
A dataset that can be used for supervised learning has both X and Y in it, whereas a dataset that can be used for unsupervised learning only has X.
The dataset ( comprising of X and Y ) can be used for regression tasks if Y contains quantitative values.
There is an exploratory data analysis.
In order to get a preliminary understanding of the data, an exploratory data analysis is performed. One of the first things I would do in a data science project is eyeballing the data so as to gain a better understanding of it.
I usually use three major EDA approaches.
- Descriptive statistics include mean, median, mode, and standard deviation.
- Data visualization includes heat maps, box plot, scatter plots, and principal component analysis.
- Data shaping includes grouping data, filtering data, and pivoting data.
Example box plot of NBA player stats data. Plot obtained from the Jupyter notebook on Data Professor GitHub.Example correlation heatmap of NBA player stats data. Plot obtained from the Jupyter notebook on Data Professor GitHub.Example histogram plot of NBA player stats data. Plot obtained from the Jupyter notebook on Data Professor GitHub.Example scatter plot of NBA player stats data. Plot obtained from the Jupyter notebook on Data Professor GitHub.
The Data Professor made a video on the subject of performing exploratory data analysis in Python.
Data is pre-processed.
Data pre-processing is the process by which the data is subjected to various checks and scrutiny in order to remedy issues of missing values and spelling errors. The transformation is logarithmic.
Garbage out. George Fuechsel.
The quality of data is going to have a big impact on the quality of the model. Significant effort should be spent in the data pre-processing phase to achieve the highest model quality. The actual model building phase and subsequent post-model analysis account for 20 % of the time spent on data science projects.
There is a train-test split.
It is important that the trained model perform well on new data. The train-test split is when the available data is split into 2 parts in order to mimic the new data. The first portion is the larger data subset that is used as the training set and the second is usually a smaller subset and used as the testing set. The data split is performed once.
The training set is used to build a model and then applied to the testing set. To make predictions by serving as the new, unseen data. The selection of the best model is made on the basis of the model ‘s performance on the testing set and in efforts to obtain the best model, hyperparameter optimization may also be performed.
There is a train-validation-test split.
The data can be split into training set, validation set and testing set. The training set is used to build a predictive model and is evaluated on the validation set where predictions are made, model tuning can be made. The best performing model is based on the results of the validation set. We do the same procedure on the validation set as we did on the test set. The testing set is not involved in the model building. The testing set can act as the new data. The Machine Learning Crash Course provides a more in-depth treatment of this topic.
There is cross-validation.
In order to make the most economical use of the available data, an N-fold cross-validation is normally used. Usually 5-fold or 10-fold CV are used. The testing data is left out of the N-fold CV because the remaining folds are used for model building.
In a 5-fold CV, 1 fold is left out and used as the testing data while the remaining 4 folds are pooled together and used as the training data for model building. The model is applied to the left-out fold. The test data. All folds have a chance to be left out as the testing data is carried out iteratively. We will have built 5 models. When each of the 5 folds have been left out as the testing set, each of the 5 models have associated performance metrics. The average performance from the 5 models is used to calculate the metric values.
When N is equal to the number of data samples, we call this leave-one-out cross-validation. Each data sample represents a fold in this type of CV. There are 30 folds if N is 30. As in any other N-fold CV, 1 fold is left out as the testing set while the remaining 29 folds are used to build the model. The model is applied to the left-out fold. As before, this process is performed iteratively for a total of 30 times, and the average performance from the 30 models are computed and used as the CV performance metric.
A model building.
We finally get to use the data for model building in the fun part. Depending on the data type, we can either build a classification or regression model for the Y variable.
There are three types of machine learning.
- Supervised learning is a machine learning task that establishes a mathematical relationship between input X and output Y variables. The labeled data is used to learn how to predict the output from the input.
- A machine learning task that uses only the input X variables is called unsupervised learning. Unlabeled data is used in the modeling of the inherent structure of the data.
- Reinforcement learning is a machine learning task that learns through trial and error in order to maximize the reward.
There is Hyperparameter Optimization.
The machine learning process and prediction performance are directly impacted by hyperparameters. One needs to perform hyperparameter tuning or model tuning as there are no one-size-fits-all hyperparameter settings.
Random forest is an example. When using the randomForest R package, there are two common hyperparameters that are subjected to optimization, the mtry and n tree parameters. The number of variables that are randomly sampled as candidates at each split is represented by mtry and the number of trees to grow is represented by ntree.
Support vector machine is a popular machine learning method. The C and gamma parameters are used for the radial basis function. The C and exponential number are only used for the linear kernels. The C is a penalty term that limits overfitting. tuning is done so as to arrive at the optimal set of values to use for the hyperparameters and in spite of this there are research directed towards finding good starting values for the C and gamma parameters. The year 2014 )
There is a feature selection.
The process of selecting a subset of features from a large amount of features is called feature selection. Aside from achieving highly accurate models, one of the most important aspects of machine learning model building is to obtain actionable insights and in order to achieve that it is important to be able to select a subset of important features.
The task of feature selection can be a new area of research where intense efforts are made to come up with novel approaches. Some of the classical methods are based on genetic and simulation methods. There are many approaches based on evolutionary algorithms. Particle swarms, ants, etc. There are two approaches and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach and one approach Monte Carlo
The use of Monte Carlo simulation for feature selection was explored by our own research group. The year 2014 ) We have come up with a novel feature selection approach based on combining two popular evolutionary algorithms namely genetic and particle swarm. The year 2013 )
Schematic diagram of the principles of the genetic algorithm search space splicing particle swarms optimization (GA-SSS-PSO) approach as illustrated using the Schwefel function in 2 dimensions. “The original search space (a) x∈[–500,0] is spliced into sub-spaces at fixed interval of 2 at each dimension (a dimension equals an horizontal axis in the picture). This resulted in four subspaces (b–e) where the range of x at each dimension is half that of the original. Each string of GA encodes the indexes for one subspace. Then, GA heuristically selects a subspace (e) and PSO is initiated there (particles are shown as red dots). PSO searches for the global minimum of the subspaces and the best particle fitness is used as fitness of the GA string encoding the indexes for that subspace. Finally, GA undergoes evolution and selects a new subspace to explore. The whole process is repeated until satisfactory error level is reached.” (Reprinted from Chemometrics and Intelligent Laboratory Systems, Volume 128, Genetic algorithm search space splicing particle swarm optimization as general-purpose optimizer, Pages 153–159, Copyright (2013), with permission from Elsevier)
There are machine learning tasks.
classification and regression are two common machine learning tasks.
There is a classification.
A trained classification model takes input from a set of variables and predicts the output class label. Three classes are indicated by the different colors and labels. Each colored sphere is a data sample.
Schematic illustration of a multi-class classification problem. Three classes of data samples are shown in 2-dimensions. This drawing shows a hypothetical distribution of data samples. Such visualisation plot can be created by performing PCA analysis and displaying the first two principal components (PCs); alternatively a simple scatter plot of two variables can also be selected and visualized. (Drawn by Chanin Nantasenamat)
There is an example dataset.
The input features of the penguins dataset include bill length, bill depth, flipper length and body mass, as well as qualitative features such as sex and island. There are 343 rows and 8 columns in the dataset. The dataset contains 333 complete cases where 19 missing values were presented.
Artwork by @allison_horst
There are performance metrics.
Performance metrics such as accuracy, sensitivity, specificity, and the Matthew ‘s correlation coefficients can be used to assess classification performance.
Equation for calculating the Accuracy.Equation for calculating the Sensitivity.Equation for calculating the Specificity.Equation for calculating the Matthews Correlation Coefficient.
There are instances of true positives, true negatives, false positives and false negatives. A value of 1 indicates the best possible prediction scenario while a value of 1 indicates the worst one. Random prediction is indicative of an MCC of 0
There is regression.
A trained regression model can be summed up by a simple equation.
When Y corresponds to the quantitative output variable, X refers to the input variables and f refers to the mapping function for computing the output values as a function of input features. If X is known, Y can be deduced from the equation. A scatter plot of the actual values versus the predicted values is a popular way to visualize the results once Y is calculated.
Simple scatter plot of actual versus predicted value. (Drawn by Chanin Nantasenamat)
There is an example dataset.
The Boston Housing dataset is a popular example. There are 506 rows and 14 columns in the dataset. The first 4 rows of the dataset are shown below.
The first 13 variables are used as input variables while the median house price is used as the output variable. All 14 variables have quantitative values and are suitable for regression analysis. I made a video on how to build a linear regression model in Python.
In the video, I showed you how to read in the Boston Housing dataset, split the data into two parts, build a linear regression model using the 80 % subset, and apply the trained model to make prediction on the 20 % subset. The scatter plot of the actual versus predicted medv values are shown.
Scatter plot of actual vs predicted medv values of the test set (20% subset). Plot taken from the Jupyter notebook on the Data Professor GitHub.
There are performance metrics.
The degree at which a fitted model can accurately predict the values of input data is evaluated.
The R2 is a common metric for evaluating the performance of regression models.
R2 is simply the ratio of the residual sum of squares to the total sum of squares. It can be said to represent the relative measure of explained variance. For example, if R2 is 0. The model could explain 60 percent of the variance. The unexplained variance accounted for 40 % of the data that fits the regression model.
The root mean squared error is a common measure of the residuals or error of prediction.
The name implies that it is easy to calculate by taking the mean of the squared error. A square root of theMSE yields theRMSE.
There is a visual explanation of the classification process.
Let ‘s take a look at the entire process of a classification model. Using the penguins dataset as an example, we can see that penguins can be characterized by 4 quantitative features and 2 qualitative features, which are used as input for training a classification model. Some issues would need to be considered in training the model.
- What is the best machine learning method to use ?
- What search space should be searched for ?
- Which data splitting scheme can be used ?
The model can be used to make predictions on the class label once trained. In our case the penguins species ), which can be one of three penguins species.
Aside from performing only classification modeling, one could also perform principal component analysis, which will use only the X variables to discern the underlying structure of the data and in doing so would allow the visualization of the inherent data clusters.
Schematic illustration of the process of building a classification model. (Drawn by Chanin Nantasenamat)