We work with train, validation, and test sets in Machine Learning. We use the train set to make our ML algorithm learn the complex patterns present in the train set and expect that it will generalize those learning on the test set. But there can be scenarios where the ML model smartly learns everything in the training set but fails to generalize it on test datasets. This makes Machine Learning a challenging domain, and most companies in this domain need help to resolve this. Regularization is a cure for this scenario.
The importance of this topic is high as the chances of interview questions being asked are incredibly high; in fact, it is as high as 95%. In this article, we will try to understand Regularization thoroughly by knowing how it works.
Regularization is a cure; to understand this cure, let's first understand the problem. We will be using the below terms frequently in the discussion, so let's know them:
We split our dataset into two sets of train and test data. We want to train a model using train data and then check the learning performance on the test dataset. The training error will be the error made by the machine learning model on the train data, and similarly, the test error will be the error made upon test data. The reference error will be taken from the labeled dataset, which will be zero in most cases. Let's call it zero error.
Here, the machine learning model performs poorly on training data. To know whether a model is underfitting, we check the difference between zero and training errors. If this gap is enormous, the ML model cannot learn the pattern in the training dataset, and we can say the ML model is underfitting.
ML model is not able to learn up to the mark. The reasons could be:
In the case of Underfitting, increasing data samples does not help. We can use either of the two methods:
Overfitting is a problem in machine learning, where the model performs better on training data but poorly on test data. Here, the gap between zero and training errors is less, but the gap between training and test errors becomes enormous.
In the case of Overfitting, we can use either of the two methods:
This type of fitting is considered the best fitting where training error is slightly higher for training data compared to the overfitting scenario, but testing error reduces significantly for testing data. In this case, the gap between zero error and training error becomes equal to or higher than at the time of Overfitting. But the gap between the training and test errors becomes significantly smaller.
Let's take an example to understand it better. Suppose we have been provided the data for the Price of the vehicle vs. the size of the vehicle. In the figure below, red X represents the given data in a 2D plane. A quadratic function can fit the given data appropriately, but we want to fit a higher degree polynomial on this data.
The overall objective of a machine learning algorithm is to reduce the error between predicted and actual values. Mathematically we can represent our objective as minimizing the sum of squares of all the prediction errors,and the same can be defined via thecost function as:
m
Jθ = min Σ (hθ(Xi) - Yi)^2
θ i
hθ(Xi) gives prediction by the machine learning model for an ith sample of input, and we have m such samples. Suppose our model fits a higher degree polynomial (4th order) to the given data. In that case, the training error will be lesser than the lower degree model as we are forcing our model to cover all the data samples in the train data.
But, as the dataset followed the quadratic nature, the 4th-order polynomial will produce more errors for the test dataset. Hence, the 4th-order model is Overfitting. So, we need a method limiting the model fitting to 2nd degree. How can we do that?
In the figure above, what if we make values of θ2 and θ3 extremely small so that X³ and X⁴ do not contribute significantly to the final fitting? But how? Let's modify our cost function definition as per the new requirements and make contributions of X³ and X⁴ negligible. We will add extra components of 1000*θ2² + 1000*θ3² like this:
m
Jθ = min Σ (hθ(Xi) - Yi)^2 + 1000*θ2² + 1000*θ3²
θ i
1000 is just an arbitrarily bigger magnitude number. The objective of ML algorithms is to find the minimum of the cost function and to achieve the minimum of the modified cost function; the Machine will try to explore zones where θ2 and θ3 will have minimal values.
θ2 ≈ 0 and θ3 ≈ 0
It will eventually convert our 4th-order polynomial to 2nd-order polynomial, and the problem of Overfitting will be resolved. In simple terms, we penalized the irrelevant parameters that were creating problems.
f(x) = θ0 + θ1.X + θ2.X² + θ3.X³ + θ4.X⁴ ≈ θ0 + θ1.X + θ2.X²
From the previous example, if we observe: two main objectives got fulfilled after penalizing the irrelevant parameters,
Regularization is the concept used to fulfill these two objectives mainly.Suppose there are a total of n features in the data, and the ML model will learn a total of n + 1 parameters (n weights for every feature and 1 bias), i.e.
Features = X1, X2, X3, X4, ...., Xn
Paramaters = θ0, θ1, θ2, θ3, ...., θn
We can quickly penalize the parameters corresponding to irrelevant features, and Overfitting will be reduced. But finding that set of irrelevant features takes work in most cases. To overcome this problem, penalization is applied to every parameter. If we do so, the new cost function will become:
1 n n
Jθ = ---- [ Σ (hθ(Xi) - Yi)² + λ Σ θ² ]
2n i i
---First Term-- --Second Term---
This new cost function controls the tradeoff between two goals:
One question that must come to our mind now: what if we set a huge value of λ in our cost function, let's say 10¹⁰? To understand this, let's take the same curve-fitting example where our model fitted a 4th-order polynomial:
f(x) = θ0 + θ1.X + θ2.X² + θ3.X³ + θ4.X⁴
If we make the λ value very high, then every parameter will become very small, I.e.
θ1 ≈ θ2 ≈ θ3 ≈ θ4 ≈ 0
And ML would fit a linear line, f(x) = θo, a perfect case of Underfitting. Hence, we need to choose the value of lambda wisely.
To achieve the minima of the cost function faster, we use optimizers that consistently navigate the parameter updation steps. We consecutively update the values of different parameters and check the value of the cost function. Let's use the Gradient Descent optimizer method and see how this regularization parameter works.
To observe closely, rearrange the second term:
α (learning rate), λ, and m are all positive. We know that the learning rate can not be too high as it can increase the chances of missing global minima and the general values are 0.01 or 0.001. λ also can not be very high as it will cause Underfitting, and the number of samples (m) is also very high in Machine Learning. Hence in every iteration, θj is reduced.
As the regularisation parameter (λ) affects the cost function and updation of parameters, it penalizes those parameters that derive the model in the wrong direction. Eventually, the reduction of those θjs happens more, and automatically the dimensions corresponding to that θ become less significant. In this way, Regularization solves the problem of Overfitting.
The loss function adds the "absolute value of magnitude" of the coefficient as a penalty term.
1 n n
Jθ = ---- [ Σ (hθ(Xi) - Yi)² + λ Σ |θ| ]
2n i i
/^\
|
This is also known as the LASSO (Least Absolute Shrinkage and Selection Operator) regression. Lasso shrinks the coefficient of irrelevant or less important features to zero and eventually helps in feature selection.
The loss function adds the "squared magnitude" of the coefficient as a penalty term.
1 n n
Jθ = ---- [ Σ (hθ(Xi) - Yi)² + λ Σ θ² ]
2n i i
/^\
|
Regression models that use L2-regularization are also known as Ridge regression.
The answer to this question depends upon our requirements. Both methods help reduce irrelevant features' effects, but their way of doing this differs.
L1 Regularization reduces the effect of irrelevant features by making their coefficient zero. It can be helpful when we have constraints for the number of features used to build the model. In simple terms, L1 regularization techniques are widely used for feature selection.
L2 Regularization reduces the effect of the irrelevant features by constraining their norms and ensures to keep all the features present in the dataset. It's different than making the coefficient precisely 0. This can be useful when we have fewer features and want to retain all.
This is one of the hottest topics in machine learning, on which you will be questioned in machine learning interviews. Some possible ones are:
Overfitting is one of the most easily found problems while addressing any problem statement using machine learning models. Regularization techniques are the cure for Overfitting. In this article, we have discussed how regularization techniques cure the problem of Overfitting. We also discussed the possible reasons for Underfitting and Overfitting and what can be done to eliminate these problems. This topic is one of the hottest topics in machine learning interviews, and we hope we have clarified it.
Next Blog: Bias-Variance tradeoff
Subscribe to get well designed content on data structure and algorithms, machine learning, system design, object orientd programming and math.