In Machine Learning, we generally have three data sets formed from the available dataset: Train set, Validation set, and Test set. With the help of validation sets, ML algorithms learn the complexities present in the training set. We expect our ML models to generalize those learning on the test set. But what if we find that the ML can learn complexities present in the train data but fails to generalize it on unseen test data?
This is one of the main concerns of almost every machine learning professional, and this problem exists in nearly every tech sector using machine learning. The term used to represent this problem in ML models is overfitting in machine learning, and regularization is a fix for that. In this article, we will learn regularization's mathematical aspects and see how it fixes the overfitting problem.
Note: This topic is extremely important as Interviewers love this topic to ask questions.
After going through this blog, we will be able to understand the following things:
To cure any problem, we first need to identify whether or not our machinelearning models are suffering from that problem. So, let's first understand how to identify the issues in ML models.
After training Machine learning models on the train set, we evaluate its performance on train and test datasets. The total error it makes on the train and test data is called training and testing errors, respectively. We want to compare these errors, so we may require some reference errors. This reference is calculated from the labeled dataset, which is zero in most cases, hence called the zero error.
By evaluating these training and testing errors, our ML models can be in either of these three states:
Let's understand each of them and a method to detect these states.
In this state, training error becomes huge with respect to the zero error. We can interpret that the model is not learning patterns in the dataset and is "underfitting".
Possible reasons for Underfitting can be
ML models are failing to learn patterns present in the dataset, and the possible reasons could be:
In the case of Underfitting, increasing data samples does not help. We can use either of the two methods:
Here, the training error is less with respect to the zero error, but the testing error is huge. The model learns patterns from the train data but fails to generalize these learnings on unseen datasets. Models with overfitting problems are useless, as reallife data will always differ from the train set.
Possible reasons for Overfitting are:
Possible Cures of Overfitting
In the case of Overfitting, we can use either of the two methods:
This is not a problem, but the best state ML models want to be in. Here, the training error is slightly more with respect to the zero error when we compare the training error of the overfitting scenario, but the testing error reduces significantly. In simple terms, the model started to generalize its learning on unseen datasets.
Suppose we have a machine learning problem statement where we need to build an ML model for predicting the vehicle's price by getting the vehicle's size as input. In the figure below, red X represents the given data in a 2D plane. A quadratic function can fit the given data appropriately, but we want to fit a higher degree polynomial on this data.
The overall objective of a machine learning algorithm is to reduce the error between predicted and actual values, as we always do in the case of supervised learning. Mathematically, we can represent our objective as minimizing the sum of squares of all the prediction errors, and the same can be defined via the cost function as:
m
Jθ = min Σ (hθ(Xi)  Yi)^2
θ i
hθ(Xi) represents the prediction by the ML model on ith input sample, and we have m such samples. Suppose our model fits a higher degree polynomial (4th order) to the given data. In that case, the training error will be lesser than the lower degree polynomial (2nd order) as it will try to fit a curve so that all the data samples should fall on the curve.
But, as most of the samples were from a lowerorder polynomial, the 4thorder polynomial will produce more errors for the test dataset. We can say that the 4thorder model is Overfitting. 2nddegree polynomial fitting was enough for this problem statement. So, we need a method limiting the model fitting to 2nd degree. But how can we do that?
From the questions in the figure above, if we find some method to make values of θ2 and θ3 extremely small (tend to zero) so that X³ and X⁴ do not contribute significantly to the final fitting, our job will be done. Right? But how?
This can be done by modifying our cost function definition per the new requirements and making contributions of X³ and X⁴ negligible. We will add extra components of 1000*θ3²+ 1000*θ4² like this:
m
Jθ = min Σ (hθ(Xi)  Yi)^2 + 1000*θ3² + 1000*θ4²
θ i
1000 is just an arbitrarily bigger magnitude number. ML algorithms aim to find the minimum of the cost function and achieve the minimum of the modified cost function; the optimization algorithms will explore zones where θ3 and θ4 will have minimal values.
θ3 ≈ 0 and θ4 ≈ 0
If we convert our 4thorder polynomial to a 2ndorder polynomial, the problem of Overfitting will be resolved. Simply put, our process penalized the irrelevant parameters that created problems.
f(x) = θ0 + θ1.X + θ2.X² + θ3.X³ + θ4.X⁴ ≈ θ0 + θ1.X + θ2.X²
From the previous example, two main objectives were fulfilled after penalizing the irrelevant parameters,
Regularization is the concept used to fulfill these two objectives mainly. Suppose there are a total of n features in the data, and the ML model will learn a total of n + 1 parameters (n weights for every feature and 1 bias), i.e.
Features = X1, X2, X3, X4, ...., Xn
Paramaters = θ0, θ1, θ2, θ3, ...., θn
We can quickly penalize the parameters corresponding to irrelevant features, and Overfitting will be reduced. However, finding that set of irrelevant features is not simple, and we might end up choosing some important features if we do it manually.
To overcome this problem, penalization is applied to every parameter involved in learning. If we do so, the new cost function will become:
1 n n
Jθ =  [ Σ (hθ(Xi)  Yi)² + λ Σ θ² ]
2n i i
First Term Second Term
This new cost function controls the tradeoff between two goals:
One question that must come to our mind now: What if we set a huge value of λ in our cost function, say 10¹⁰?
To understand this, let's take the same curvefitting example where our model fitted a 4thorder polynomial:
f(x) = θ0 + θ1.X + θ2.X² + θ3.X³ + θ4.X⁴
If we make the λ value very high, then every parameter will become very small, i.e.,
θ1 ≈ θ2 ≈ θ3 ≈ θ4 ≈ 0
Our ML model would fit a linear line, f(x) = θo, a perfect case of Underfitting. Hence, we need to choose the value of lambda wisely.
To achieve the minima of the cost function faster, we use optimization algorithms that consistently navigate the parameter updation steps. We consecutively update the values of different parameters and check the value of the cost function. Let's use the Gradient Descent algorithm and see how this regularization parameter works.
The pseudocode for the gradient descent algorithm on the updated cost function will look like this:
To observe closely, let's rearrange the second term:
α (learning rate), λ, and m are all positive. We know that the learning rate can not be too high as it can increase the chances of missing global minima, and the general values are 0.01 or 0.001. λ also can not be very high as it will cause Underfitting. In general, the number of samples (m) is very high in Machine Learning. Hence, by observing the term θj (1α*λ/m), we can say that,in every iteration, θj is reduced, and we can infer that λ affects parameter updation.
Every θ in the cost function corresponds to one feature or one dimension of the data. If we increase the weightage of one feature by increasing the value of θ corresponding to that feature, model predictions will move toward that. As the regularisation parameter (λ) affects the cost function and updation of parameters, it penalizes those parameters that derive the model in the wrong direction. Eventually, the reduction of those θjs happens more, and automatically, the dimensions corresponding to that θ become less significant. In this way, regularization solves the problem of Overfitting.
Now that we have learned how regularization solves the problem of overfitting let's see some of its most popular variations.
In this regularization technique, we add the "absolute value of magnitude" of the coefficient as a penalty term in the cost function.
1 n n
Jθ =  [ Σ (hθ(Xi)  Yi)² + λ Σ θ ]
2n i i
/^\

L1 — Regularization is also known as the LASSO (Least Absolute Shrinkage and Selection Operator) regression. Lasso shrinks the coefficient of irrelevant or less important features to zero and eventually helps in feature selection.
In this regularization technique, we add the "squared magnitude" of the coefficient as a penalty term in the cost function.
1 n n
Jθ =  [ Σ (hθ(Xi)  Yi)² + λ Σ θ² ]
2n i i
/^\

Regression models that use L2regularization are also known as Ridge regression.
The answer to this question depends upon our requirements. Both methods help reduce irrelevant features' effects, but their ways differ.
L1 Regularization reduces the effect of irrelevant features by shrinking their coefficient to zero. If the coefficient is zero, that feature will be left out of the dataset. This can be helpful when we have constraints for the number of features that can be used to build the machine learning model. In simple terms, L1 regularization techniques are widely used for feature selection.
L2 Regularization reduces the effect of the irrelevant features by constraining their magnitudes. It ensures that all the features are present in the dataset, which is different from making the coefficient precisely 0 and removing them completely. This can be useful when we have fewer features and want to retain all of them but also want to cure overfitting.
This is one of the hottest topics in machine learning, on which you will be questioned in machine learning interviews. Some possible ones are:
Overfitting is one of the most easily found problems in the Machine Learning industry, and regularization techniques are used to cure this. But as this is a cure to a problem, we first need to find whether our model is suffering from it. In this article, we have discussed methods to identify which model is overfitting and how regularization helps fix it. This topic is one of the hottest topics in machine learning interviews, and we hope it is thoroughly covered here.
Enjoy Learning, Enjoy Algorithms!