Data normalization in machine learning
What is it, how does it help, tools used and an experiment
Photo by Patrick Tomasso on Unsplash
In the previous article I wrote about cluster analysis and had a discussion about data normalization. I talked about how data normalization impacts clustering.
I did n’t get the chance to go into more details. That was n’t the focus of the article, so I want to learn about that today.
What is the difference between normal and normalized ?
What exactly is normalized ?
We have a dataset with two variables : time traveled and distance covered. Time is measured in hours. The distance in miles is 5, 10, 25 hours. 500, 800, 1200 miles. Do you see the problem ?
The problem is that the variables are measured in two different units, one in hours and the other in miles. The distribution of data, which is quite different in these two variables, is the other problem.
Normalization is the process of transforming data in a way that they are either dimensionless or have similar distributions. The process of normalization is known by other names. Normalization is an important step in data pre-processing.
Does it help ?
How does this transformation help ?
The answer is that it dramatically improves model accuracy.
Normalization gives equal weights to each variable so that no single variable steers model performance in one direction just because they are bigger numbers.
When clustering, distance measures are used to determine if an observation belongs to a certain cluster. Euclidean distance is used to measure those distances. Variables with higher values can suppress other variables with small values.
What are the techniques used ?
Three popular and widely used techniques are applied for normalization.
- The simplest of all methods, rescaling is also known as min-max normalization.
- The method uses the mean of the observations.
- Z-score normalized is also known as standard score. It is used in machine learning.
Z is the standard score, is the population mean and is the population standard deviation.
You can show an example.
It is always good to see an algorithm in action. The example is not going to be dramatic, I am just showing it as an illustration to give an intuition.
pandas for data wrangling, matplotlib for visualization and preprocessing, and KMeans from the library are some of the libraries that can be imported.
The data from a GitHub repo can be imported as a csv file. The iris dataset is already cleaned so you can import and follow along right away.
I will use two features for clustering data points.
Library import pandas as pd import matplotlib Import preprocessing from sklearn is what pyplot is from. KMeans is a cluster import.
# import clean data read_csv There is a csv file.
The feature selection is df.
After we import libraries and data, we will use the K Means clustering algorithm. I have annotated each line of code so you know what is going on.
The inputs are normalized X_not_norm.
The instantiate model model has K Means.
Y_model is a model. X_not_norm is fit_predict.
The clusters are visualized. The model is scatter ( X_not_norm ). Cmap is ‘viridis ‘.
counts per cluster print value_counts
The visualization of the clusters and the count of data points are the outputs.
The model will be re-run after the inputs are normalized.
Preprocessing X_norm is the normalized inputs. The scale isdf.
The instantiate model model has K Means.
Y_model is a model. X_norm is fit_predict.
“ Value Counts ” print value_counts
You can visualize clusters. The model is scatter ( X_norm [ :,0 ], X_norm [ :,1 ] ). Cmap is ‘viridis ‘.
The outputs before and after the data is normalized. The counts cluster 0 is reduced by 4 members if you compare the value counts.
You can see which data points shifted from pre-normalized to post-normalized by looking at the data points in the left and right figures. The changes are usually at the boundaries rather than at either end of the distribution. It is not too dramatic, but you get the point.
Differences of clustering before and after normalization (source: Author)
The other side of the coin…
We think that it is a great thing to do in data science, but it creates some bad side effects.
Equal weights are applied to all features when normalization is applied. A lot of important information is lost in the process.
Read more: Free Pattern: Crochet Butterfly
One example is what happens to outliers. We think outliers are bad guys and need to get rid of them. You lose information when you lose outliers to get a better model.
The variables lose their units of measurement in the process. You ca n’t tell what the key differences are between the variables at the end of modeling.
I hope it was useful. Feel free to write your comments down. You can follow me on social media.