# Need of Feature Scaling in Machine Learning

Data collection and pre-processing of the raw form of data is the ultimate necessity. Big organizations in data science and machine learning domains record many attributes/properties to avoid losing critical information. Every attribute has its properties and valid ranges in which it can lie. For example, the speed of a motorbike can be in the range of 0–200 KM/h, but the speed of cars can be in the range of 0–400 KM/h. Machine learning or deep learning models expect these ranges to be on the same scale to decide the importance of these properties without any bias.

In this article, we will learn about one of the essential topics used in scaling different attributes for machine learning: Normalization and Standardization. Even among machine learning professionals, the confusion regarding the selection between normalization and Standardization persists. Through this article, we will try to clear this confusion forever.

• What is Normalization?
• Why do we need scaling (normalization or Standardization)?
• What are different normalization techniques?
• What is Standardization?
• When to normalize and when to standardize?

## What is Normalization?

In machine learning, an individual property that can be measured or characteristic of an observed phenomenon is a feature. Based on the availability of essential and independent observations, we train our model with a combination of input features. For example, suppose we want to train a machine learning model to predict the flat price. We can efficiently train our model with the size of the flat as our feature. But including the locality of the flat in our input features set will improve the performance of our model. Hence, we use various observable and independent features to make our model more sure about the predictions.

As the features are different, so the ranges of their numerical values would also be different. Scaling all the features into the same definite range is known as normalization. But shouldn’t we ask Why? Why scale the features? Why not directly use features and train the model?

## Why scale the features?

Let’s go through one example to answer this question, which will open the mathematical angle supporting Normalization or Standardization.

### Mathematical Intuition

Suppose we have to make a machine learning model learn the function Y = θ1*X + θ0. We have a supervised dataset (Input X and Output Y). During the learning process, the machine will start from randomly selected values (or hard-coded manual values) for θ1 and θ0. Then iteratively reduce the error between the predicted value of Y ( i.e., Y^) and the actual value of Y. Our overall goal is to minimize this error function.

Let’s choose Mean Squared Error (MSE), our error function, also called a cost function. The formulae for MSE are given in the below equation, where n is the number of training samples. As Y is a function that depends upon two variables, θ1, and θ0, hence cost function will also depend on these two variables. In the GIF below, there is one dimension of the Cost function, and the rest two dimensions can be considered as θ1 and θ0. At the start, suppose we are at position A (Shown in GIF above) and reaching position B is our ultimate goal as that is the minima of the cost function. For that, the machine will tweak the values of θ1 and θ0.

But the machine can take infinite values for θ1 and θ0 if it selects these values randomly at each step. We use optimizers to help the machine choose the following values of θ1 and θ0 to reach the minima quickly. Let’s select gradient descent as our optimizer to learn the function Y = θ1*X + θ0. In gradient descent, we update the value of any parameter using the below formulae. Let’s say we updated the value of θ1 and θ0 by using the above formulae. Then new values will be: Let’s calculate `ẟJ(θ1)/ẟθ1` and `ẟJ(θ0)/ẟθ0`. Prediction error can be represented in the equation as error = (Y^ — Y)

Cost function : Now let’s calculate the partial derivative of this cost function concerning two variables, θ1 and θ0. Also, After combining the equations and putting everything in the gradient descent formulae, The presence of feature value X in the above update formula will affect the step size of the gradient descent. If the features are in different ranges, it will cause different step sizes for every feature. In the image below, X1 and X2 are two attributes that constitute the input variable X, i.e., X = [X1, X2]. Consider this X1 and X2 as two dimensions. To ensure the functionality of the gradient descent moves smoothly towards the minima and steps for gradient descent get updated at the same rate in every dimension, we scale the data before feeding it to the model. ### Example Intuition

Some machine learning algorithms are susceptible to Normalization or Standardization, and some are insensitive. Algorithms like SVM, K-NN, K-means, Neural Networks, or Deep-learning are susceptible to Normalization/Standardization. These algorithms use the spatial relationships (Space dependent relations) present among the data samples. Let’s take an example, Let’s use the scaling technique and use the percentage of marks instead of direct marks. The scaled distances are closer and can be compared easily.

Algorithms like Decision trees, Random forests,s or other tree-based algorithms are insensitive to Normalization or Standardization as they are being applied to every feature individually and are not influenced by any other feature.

### The two reasons that support the need for scaling are:

• Scaling the features makes the flow of gradient descent smooth and helps algorithms quickly reach the minima of the cost function.
• Without scaling features, the algorithm may be biased toward the feature which has values higher in magnitude. Hence we scale features that bring every feature in the same range, and the model uses every feature wisely.

We know why scaling, so let’s see some popular techniques used to scale all the features in the same range.

## Popular Scaling techniques

### Min-Max Normalization

This is the most used normalization technique in the machine learning industry. We bring every attribute in the defined range, starting from a and ending at b. We map every feature's minimum and maximum value to the starting and ending range values. Range values [0, 1] and [-1, 1] are most popular ones. Formula all three cases are given below: ### Logistic Normalization

Here, we transform the features such that they start contributing proportionally in the updation step of gradient descent. If we see the formula below, we exponentially raise the value of X. ### Standardization

Standardization is another scaling technique in which we transform the feature such that the changed features will have mean (μ) = 0 and standard deviation (σ) = 1. The formula to standardize the features in data samples is : This scaling technique is also known as Z-Score normalization or Z-mean normalization. Unlike Normalization, Standardization techniques are not much affected by the presence of outliers (Think how!).

In normalization, we have min and max operations in the formula. What if some outliers have significantly higher/lower magnitude? They will affect the calculations, and hence normalization is much influenced by the presence of outliers. But if the data samples are large and there are some outliers, then the mean and standard deviation calculations will be affected by a smaller margin. Hence Standardization is less affected by outliers.

Now, we know two different scaling techniques. But sometimes, knowing more or having more options brings another challenge of choice. So we have a new question for us,

## When to Normalize and When to Standardize?

Let’s list down the use cases where Normalization and Standardization would be beneficial.

### Normalization would be more beneficial when:

• Data samples are NOT normally distributed.
• Dataset is clean or free from outliers.
• The dataset covers all the corner ( Minimum or Maximum ) ranges of features.
• They are often used for the algorithms like Neural Networks, K-NN, and K-means.

### Standardization would be more beneficial when:

• Data samples are from a normal distribution. This is not always true, but most effectiveness will be observed when it happens.
• The dataset contains outliers that can affect the min/max calculations.

## Summary

• Scaling features helps optimization algorithms to reach the minima of cost function quickly.
• Scaling features restrict models from being biased towards features having higher/lower magnitude values.
• Normalization and Standardization are two scaling techniques.
• With gaussian( normal) distributed data samples, Standardization works perfectly.

## Possible Interview Questions on Normalization

• What is data normalization, and why do we need it?
• Do we need to normalize the output/target variable as well?
• What is Standardization? When is Standardization preferred?
• Why will the model become biased if we do not scale the variables?
• Why Standardization seems to be better as per the real-life scenarios?

## Conclusion

In this article, we saw the need to scale different attributes in Machine Learning. Data Science and Machine learning expect all the features or attributes to be present on the same scale to decide the importance of those features without any biases. We have shown how scaling helps build machine learning models using two different examples. We also discussed one of the main challenges of using scaling techniques, even for machine learning professionals. We hope you have enjoyed the article.

Enjoy Learning, Enjoy Algorithms!