Regularization in Machine Learning

Introduction

In Machine Learning, we generally have three data sets formed from the available dataset: Train set, Validation set, and Test set. With the help of validation sets, ML algorithms learn the complexities present in the training set. We expect our ML models to generalize those learning on the test set. But what if we find that the ML can learn complexities present in the train data but fails to generalize it on unseen test data?

This is one of the main concerns of almost every machine learning professional, and this problem exists in nearly every tech sector using machine learning. The term used to represent this problem in ML models is overfitting in machine learning, and regularization is a fix for that. In this article, we will learn regularization's mathematical aspects and see how it fixes the overfitting problem.

Note: This topic is extremely important as Interviewers love this topic to ask questions.

Key takeaways from this blog

After going through this blog, we will be able to understand the following things:

What are Underfitting, Overfitting, and Accurate fitting?
What is Regularization?
How does regularization cure overfitting?
Mathematical logic behind regularization.
What are L1 and L2 Regularization?

To cure any problem, we first need to identify whether or not our machine-learning models are suffering from that problem. So, let's first understand how to identify the issues in ML models.

States of Machine Learning Models

After training Machine learning models on the train set, we evaluate its performance on train and test datasets. The total error it makes on the train and test data is called training and testing errors, respectively. We want to compare these errors, so we may require some reference errors. This reference is calculated from the labeled dataset, which is zero in most cases, hence called the zero error.

By evaluating these training and testing errors, our ML models can be in either of these three states:

Underfitting
Overfitting
Appropriate fitting

Let's understand each of them and a method to detect these states.

What is Underfitting, and how to detect it?

In this state, training error becomes huge with respect to the zero error. We can interpret that the model is not learning patterns in the dataset and is "underfitting".

What is underfitting and what are the possible cures for it?

Possible reasons for Underfitting can be

ML models are failing to learn patterns present in the dataset, and the possible reasons could be:

The dataset has no pattern: Input features and the target variable are highly uncorrelated. So, if there is no pattern in the data, the ML model will fail to find them.
The pattern is complex for the model to learn: The pattern is complex enough, and the ML algorithm we are using to learn the patterns is not strong enough.

Possible cure for Underfitting

In the case of Underfitting, increasing data samples does not help. We can use either of the two methods:

Feature addition, extraction, and engineering: If we engineer some features by transformation or extracting some additional features from the available dataset, we may provide more information to the ML algorithms, making them capable of finding patterns.
Increase the complexity of the ML Algorithm: It may be possible that data is complex for one type of ML algorithm, and that is failing to find the pattern. In that case, we can move to other complex algorithms that are more capable of learning patterns. For example, in the case of neural networks, the complexity of models increases with the addition of hidden layers.

What is Overfitting, and how to detect it?

Here, the training error is less with respect to the zero error, but the testing error is huge. The model learns patterns from the train data but fails to generalize these learnings on unseen datasets. Models with overfitting problems are useless, as real-life data will always differ from the train set.

What is overfitting and how to cure it?

Possible reasons for Overfitting are:

Too much learning: The ML model is learning everything present in the data, even the noises. A slight deviation in the dataset from the train set will completely change the model's hypothesis and fail to generalize.
The model is too strong to solve the given problem: ML algorithms with higher complexity are not preferred to solve simple problems. It's because performance is not the only criterion that matters for ML models. What resources it uses to attain that performance also plays a crucial role. Stronger ML algorithms try to learn more patterns, which may be unnecessary.

Possible Cures of Overfitting

In the case of Overfitting, we can use either of the two methods:

More Data: Increasing data can bring diversity to the data samples, and the model will learn generalizability. Sometimes, This is impractical, as data collection and labeling require human resources and additional costs.
Reduce model complexity: We can reduce the model complexity as it was learning too much from the dataset. Regularization is one technique that helps reduce the model's complexity, which we will discuss in detail.

What is Appropriate Fitting, and how can we sense it?

This is not a problem, but the best state ML models want to be in. Here, the training error is slightly more with respect to the zero error when we compare the training error of the overfitting scenario, but the testing error reduces significantly. In simple terms, the model started to generalize its learning on unseen datasets.

What is appropriate fitting and how to detect it?

Example to understand the Overfitting and the method to solve it

Suppose we have a machine learning problem statement where we need to build an ML model for predicting the vehicle's price by getting the vehicle's size as input. In the figure below, red X represents the given data in a 2D plane. A quadratic function can fit the given data appropriately, but we want to fit a higher degree polynomial on this data.

Fitting 4th order and second order polynomial on the same data samples

The overall objective of a machine learning algorithm is to reduce the error between predicted and actual values, as we always do in the case of supervised learning. Mathematically, we can represent our objective as minimizing the sum of squares of all the prediction errors, and the same can be defined via the cost function as:

         m
Jθ = min Σ  (hθ(Xi) - Yi)^2
      θ  i

hθ(Xi) represents the prediction by the ML model on ith input sample, and we have m such samples. Suppose our model fits a higher degree polynomial (4th order) to the given data. In that case, the training error will be lesser than the lower degree polynomial (2nd order) as it will try to fit a curve so that all the data samples should fall on the curve.

But, as most of the samples were from a lower-order polynomial, the 4th-order polynomial will produce more errors for the test dataset. We can say that the 4th-order model is Overfitting. 2nd-degree polynomial fitting was enough for this problem statement. So, we need a method limiting the model fitting to 2nd degree. But how can we do that?

From the questions in the figure above, if we find some method to make values of θ2 and θ3 extremely small (tend to zero) so that X³ and X⁴ do not contribute significantly to the final fitting, our job will be done. Right? But how?

This can be done by modifying our cost function definition per the new requirements and making contributions of X³ and X⁴ negligible. We will add extra components of 1000*θ3²+ 1000*θ4² like this:

         m
Jθ = min Σ  (hθ(Xi) - Yi)^2 + 1000*θ3² + 1000*θ4²
      θ  i

1000 is just an arbitrarily bigger magnitude number. ML algorithms aim to find the minimum of the cost function and achieve the minimum of the modified cost function; the optimization algorithms will explore zones where θ3 and θ4 will have minimal values.

θ3 ≈ 0 and 	θ4 ≈ 0

If we convert our 4th-order polynomial to a 2nd-order polynomial, the problem of Overfitting will be resolved. Simply put, our process penalized the irrelevant parameters that created problems.

f(x) = θ0 + θ1.X + θ2.X² + θ3.X³ + θ4.X⁴ ≈ θ0 + θ1.X + θ2.X²

Observations as Objective for Regularization

From the previous example, two main objectives were fulfilled after penalizing the irrelevant parameters,

The order of the curve that is to be fitted got simpler. This is similar to the process of making the ML algorithm weaker.
Overfitting is reduced, which can be ensured by decreasing the test error.

Regularization is the concept used to fulfill these two objectives mainly. Suppose there are a total of n features in the data, and the ML model will learn a total of n + 1 parameters (n weights for every feature and 1 bias), i.e.

Features = X1, X2, X3, X4, ...., Xn

Paramaters = θ0, θ1, θ2, θ3, ...., θn

We can quickly penalize the parameters corresponding to irrelevant features, and Overfitting will be reduced. However, finding that set of irrelevant features is not simple, and we might end up choosing some important features if we do it manually.

To overcome this problem, penalization is applied to every parameter involved in learning. If we do so, the new cost function will become:

      1      n                          n
Jθ = ----  [ Σ (hθ(Xi) - Yi)²   +     λ Σ θ² ]
      2n     i                          i
            
             ---First Term--      --Second Term---

This new cost function controls the tradeoff between two goals:

Fitting the training set well is ensured by the First term in the formula.
Keeping the parameters small so that the hypothesis becomes simpler, which is ensured by the regularization term. λ is the regularization parameter, and for the example above, we selected its value randomly as 1000.

How will the value of the regularization parameter λ affect model fitting?

One question that must come to our mind now: What if we set a huge value of λ in our cost function, say 10¹⁰?

To understand this, let's take the same curve-fitting example where our model fitted a 4th-order polynomial:

f(x) = θ0 + θ1.X + θ2.X² + θ3.X³ + θ4.X⁴

If we make the λ value very high, then every parameter will become very small, i.e.,

θ1 ≈ θ2 ≈ θ3 ≈ θ4 ≈ 0

Our ML model would fit a linear line, f(x) = θo, a perfect case of Underfitting. Hence, we need to choose the value of lambda wisely.

For a larger value of regularization parameter, model get's underfit

How does regularization help?

To achieve the minima of the cost function faster, we use optimization algorithms that consistently navigate the parameter updation steps. We consecutively update the values of different parameters and check the value of the cost function. Let's use the Gradient Descent algorithm and see how this regularization parameter works.

The pseudocode for the gradient descent algorithm on the updated cost function will look like this:

Gradient descent algorithm in addition to regularization term

To observe closely, let's rearrange the second term:

Rearranging the terms in the gradient descent with regularization

α (learning rate), λ, and m are all positive. We know that the learning rate can not be too high as it can increase the chances of missing global minima, and the general values are 0.01 or 0.001. λ also can not be very high as it will cause Underfitting. In general, the number of samples (m) is very high in Machine Learning. Hence, by observing the term θj (1-α*λ/m), we can say that,in every iteration, θj is reduced, and we can infer that λ affects parameter updation.

Every θ in the cost function corresponds to one feature or one dimension of the data. If we increase the weightage of one feature by increasing the value of θ corresponding to that feature, model predictions will move toward that. As the regularisation parameter (λ) affects the cost function and updation of parameters, it penalizes those parameters that derive the model in the wrong direction. Eventually, the reduction of those θjs happens more, and automatically, the dimensions corresponding to that θ become less significant. In this way, regularization solves the problem of Overfitting.

Now that we have learned how regularization solves the problem of overfitting let's see some of its most popular variations.

The most common regularization methods used in Machine Learning

L1-Regularization or LASSO

In this regularization technique, we add the "absolute value of magnitude" of the coefficient as a penalty term in the cost function.

      1      n                       n
Jθ = ----  [ Σ (hθ(Xi) - Yi)²   +  λ Σ |θ| ]
      2n     i                       i
                                      
                                     /^\
                                      |

L1 — Regularization is also known as the LASSO (Least Absolute Shrinkage and Selection Operator) regression. Lasso shrinks the coefficient of irrelevant or less important features to zero and eventually helps in feature selection.

L-2 Regularization or Ridge Regression

In this regularization technique, we add the "squared magnitude" of the coefficient as a penalty term in the cost function.

      1      n                       n
Jθ = ----  [ Σ (hθ(Xi) - Yi)²   +  λ Σ θ² ]
      2n     i                       i
                                      
                                     /^\
                                      |

Regression models that use L2-regularization are also known as Ridge regression.

How do we decide between L1 (LASSO) and L2 (Ridge)?

The answer to this question depends upon our requirements. Both methods help reduce irrelevant features' effects, but their ways differ.

L1 Regularization reduces the effect of irrelevant features by shrinking their coefficient to zero. If the coefficient is zero, that feature will be left out of the dataset. This can be helpful when we have constraints for the number of features that can be used to build the machine learning model. In simple terms, L1 regularization techniques are widely used for feature selection.

L2 Regularization reduces the effect of the irrelevant features by constraining their magnitudes. It ensures that all the features are present in the dataset, which is different from making the coefficient precisely 0 and removing them completely. This can be useful when we have fewer features and want to retain all of them but also want to cure overfitting.

Possible interview questions on regularization

This is one of the hottest topics in machine learning, on which you will be questioned in machine learning interviews. Some possible ones are:

What is Underfitting or Overfitting?
How will you decide whether your model is Underfitting or Overfitting?
How will you cure the problem of Underfitting?
How will you fix the problem of Overfitting?
What are the regularization techniques used in ML?
What's the difference between L1 and L2 regularization techniques?

Conclusion

Overfitting is one of the most easily found problems in the Machine Learning industry, and regularization techniques are used to cure this. But as this is a cure to a problem, we first need to find whether our model is suffering from it. In this article, we have discussed methods to identify which model is overfitting and how regularization helps fix it. This topic is one of the hottest topics in machine learning interviews, and we hope it is thoroughly covered here.

Enjoy Learning, Enjoy Algorithms!

Regularization: A Fix to Overfitting in Machine Learning