Regularization: A Fix to Overfitting in Machine Learning

Introduction

We work with train, validation, and test sets in Machine Learning. We use the train set to make our ML algorithm learn the complex patterns present in the train set and expect that it will generalize those learning on the test set. But there can be scenarios where the ML model smartly learns everything in the training set but fails to generalize it on test datasets. This makes Machine Learning a challenging domain, and most companies in this domain need help to resolve this. Regularization is a cure for this scenario.

The importance of this topic is high as the chances of interview questions being asked are incredibly high; in fact, it is as high as 95%. In this article, we will try to understand Regularization thoroughly by knowing how it works.

Key takeaways from this blog

  • What are Underfitting, Overfitting, and Accurate fitting?
  • What is Regularization?
  • How does regularization cure overfitting?
  • Mathematical logic behind Regularization.
  • What are L1 and L2 Regularization?

Regularization is a cure; to understand this cure, let's first understand the problem. We will be using the below terms frequently in the discussion, so let's know them:

  • Underfitting
  • Overfitting
  • Appropriate fitting

We split our dataset into two sets of train and test data. We want to train a model using train data and then check the learning performance on the test dataset. The training error will be the error made by the machine learning model on the train data, and similarly, the test error will be the error made upon test data. The reference error will be taken from the labeled dataset, which will be zero in most cases. Let's call it zero error.

What is Underfitting, and how to detect it?

Here, the machine learning model performs poorly on training data. To know whether a model is underfitting, we check the difference between zero and training errors. If this gap is enormous, the ML model cannot learn the pattern in the training dataset, and we can say the ML model is underfitting.

What is underfitting and what are the possible cures for it?

Possible reasons for Underfitting can be:

ML model is not able to learn up to the mark. The reasons could be:

  • Input features and the target variable are highly uncorrelated. Meaning finding the relationship between output and input is extremely difficult.
  • The function between the Input and Output variables that the Machine is trying to learn is complex, and the model cannot understand.

Possible cure for Underfitting

In the case of Underfitting, increasing data samples does not help. We can use either of the two methods:

  • We must either add, extract or engineer features to make the model learn. Or.
  • Increase the complexity of the model; for example, in the case of neural networks, the complexity of models increases with the addition of layers.

What is Overfitting, and how to detect it?

Overfitting is a problem in machine learning, where the model performs better on training data but poorly on test data. Here, the gap between zero and training errors is less, but the gap between training and test errors becomes enormous.

What is overfitting and how to cure it?

Possible reasons for Overfitting are:

  • ML model is learning everything present in the data, even the noises.
  • The model that we have chosen is unnecessarily stronger.

Possible Cures of Overfitting

In the case of Overfitting, we can use either of the two methods:

  • Increasing data can bring diversity to the data samples, and the model will learn generalizability. But getting more data can sometimes take time and effort.
  • We can reduce the model complexity as it was learning too much from the dataset. Regularization is one such technique that helps reduce the model's complexity, which we will discuss in detail. 

What is Appropriate Fitting, and how to detect it?

This type of fitting is considered the best fitting where training error is slightly higher for training data compared to the overfitting scenario, but testing error reduces significantly for testing data. In this case, the gap between zero error and training error becomes equal to or higher than at the time of Overfitting. But the gap between the training and test errors becomes significantly smaller.

What is appropriate fitting and how to detect it?

Example

Let's take an example to understand it better. Suppose we have been provided the data for the Price of the vehicle vs. the size of the vehicle. In the figure below, red X represents the given data in a 2D plane. A quadratic function can fit the given data appropriately, but we want to fit a higher degree polynomial on this data.

Fitting 4th order and second order polynomial on the same data samples

The overall objective of a machine learning algorithm is to reduce the error between predicted and actual values. Mathematically we can represent our objective as minimizing the sum of squares of all the prediction errors,and the same can be defined via thecost function as:

         m
Jθ = min Σ  (hθ(Xi) - Yi)^2
      θ  i

hθ(Xi) gives prediction by the machine learning model for an ith sample of input, and we have m such samples. Suppose our model fits a higher degree polynomial (4th order) to the given data. In that case, the training error will be lesser than the lower degree model as we are forcing our model to cover all the data samples in the train data. 

But, as the dataset followed the quadratic nature, the 4th-order polynomial will produce more errors for the test dataset. Hence, the 4th-order model is Overfitting. So, we need a method limiting the model fitting to 2nd degree. How can we do that?

In the figure above, what if we make values of θ2 and θ3 extremely small so that X³ and X⁴ do not contribute significantly to the final fitting? But how? Let's modify our cost function definition as per the new requirements and make contributions of X³ and X⁴ negligible. We will add extra components of 1000*θ2² + 1000*θ3² like this:

         m
Jθ = min Σ  (hθ(Xi) - Yi)^2 + 1000*θ2² + 1000*θ3²
      θ  i

1000 is just an arbitrarily bigger magnitude number. The objective of ML algorithms is to find the minimum of the cost function and to achieve the minimum of the modified cost function; the Machine will try to explore zones where θ2 and θ3 will have minimal values. 

θ2 ≈ 0 and 	θ3 ≈ 0

It will eventually convert our 4th-order polynomial to 2nd-order polynomial, and the problem of Overfitting will be resolved. In simple terms, we penalized the irrelevant parameters that were creating problems.

f(x) = θ0 + θ1.X + θ2.X² + θ3.X³ + θ4.X⁴ ≈ θ0 + θ1.X + θ2.X²

From the previous example, if we observe: two main objectives got fulfilled after penalizing the irrelevant parameters,

  1. The order of the curve that is to be fitted got simpler.
  2. Overfitting reduced

Regularization is the concept used to fulfill these two objectives mainly.Suppose there are a total of n features in the data, and the ML model will learn a total of n + 1 parameters (n weights for every feature and 1 bias), i.e.

Features = X1, X2, X3, X4, ...., Xn

Paramaters = θ0, θ1, θ2, θ3, ...., θn

We can quickly penalize the parameters corresponding to irrelevant features, and Overfitting will be reduced. But finding that set of irrelevant features takes work in most cases. To overcome this problem, penalization is applied to every parameter. If we do so, the new cost function will become:

      1      n                          n
Jθ = ----  [ Σ (hθ(Xi) - Yi)²   +     λ Σ θ² ]
      2n     i                          i
            
             ---First Term--      --Second Term---

This new cost function controls the tradeoff between two goals:

  • Fitting the training set well is ensured by the First term in the formula.
  • Keeping the parameters small so that the hypothesis becomes simpler, which is ensured by the regularization term. λ is the regularization parameter, and for the example above, we selected its value as 1000.

How will the value of the regularization parameter λ affect model fitting?

One question that must come to our mind now: what if we set a huge value of λ in our cost function, let's say 10¹⁰? To understand this, let's take the same curve-fitting example where our model fitted a 4th-order polynomial:

f(x) = θ0 + θ1.X + θ2.X² + θ3.X³ + θ4.X⁴

If we make the λ value very high, then every parameter will become very small, I.e.

θ1 ≈ θ2 ≈ θ3 ≈ θ4 ≈ 0

And ML would fit a linear line, f(x) = θo, a perfect case of Underfitting. Hence, we need to choose the value of lambda wisely.

For a larger value of regularization parameter, model get's underfit

How does regularization help?

To achieve the minima of the cost function faster, we use optimizers that consistently navigate the parameter updation steps. We consecutively update the values of different parameters and check the value of the cost function. Let's use the Gradient Descent optimizer method and see how this regularization parameter works.

Gradient descent algorithm in addition to regularization term

To observe closely, rearrange the second term:

Rearranging the terms in the gradient descent with regularization

α (learning rate), λ, and m are all positive. We know that the learning rate can not be too high as it can increase the chances of missing global minima and the general values are 0.01 or 0.001. λ also can not be very high as it will cause Underfitting, and the number of samples (m) is also very high in Machine Learning. Hence in every iteration, θj is reduced. 

As the regularisation parameter (λ) affects the cost function and updation of parameters, it penalizes those parameters that derive the model in the wrong direction. Eventually, the reduction of those θjs happens more, and automatically the dimensions corresponding to that θ become less significant. In this way, Regularization solves the problem of Overfitting.

Most common regularization methods used in Machine Learning

L1-Regularization

The loss function adds the "absolute value of magnitude" of the coefficient as a penalty term.

      1      n                       n
Jθ = ----  [ Σ (hθ(Xi) - Yi)²   +  λ Σ |θ| ]
      2n     i                       i
                                      
                                     /^\
                                      |

This is also known as the LASSO (Least Absolute Shrinkage and Selection Operator) regression. Lasso shrinks the coefficient of irrelevant or less important features to zero and eventually helps in feature selection.

L-2 Regularization

The loss function adds the "squared magnitude" of the coefficient as a penalty term.

      1      n                       n
Jθ = ----  [ Σ (hθ(Xi) - Yi)²   +  λ Σ θ² ]
      2n     i                       i
                                      
                                     /^\
                                      |

Regression models that use L2-regularization are also known as Ridge regression.

How to decide which Regularization (L1 and L2) to use?

The answer to this question depends upon our requirements. Both methods help reduce irrelevant features' effects, but their way of doing this differs.

L1 Regularization reduces the effect of irrelevant features by making their coefficient zero. It can be helpful when we have constraints for the number of features used to build the model. In simple terms, L1 regularization techniques are widely used for feature selection.

L2 Regularization reduces the effect of the irrelevant features by constraining their norms and ensures to keep all the features present in the dataset. It's different than making the coefficient precisely 0. This can be useful when we have fewer features and want to retain all.

Possible interview questions on Regularization

This is one of the hottest topics in machine learning, on which you will be questioned in machine learning interviews. Some possible ones are:

  • What is Underfitting or Overfitting?
  • How will you decide whether your model is Underfitting or Overfitting?
  • How will you cure the problem of Underfitting?
  • How will you fix the problem of Overfitting?
  • What are the regularization techniques used in ML?
  • What's the difference between L1 and L2 regularization techniques?

Conclusion

Overfitting is one of the most easily found problems while addressing any problem statement using machine learning models. Regularization techniques are the cure for Overfitting. In this article, we have discussed how regularization techniques cure the problem of Overfitting. We also discussed the possible reasons for Underfitting and Overfitting and what can be done to eliminate these problems. This topic is one of the hottest topics in machine learning interviews, and we hope we have clarified it.

Next Blog: Bias-Variance tradeoff

Enjoy Learning, Enjoy Algorithms!

Share Feedback

Coding Interview

Machine Learning

System Design

EnjoyAlgorithms Newsletter

Subscribe to get well designed content on data structure and algorithms, machine learning, system design, object orientd programming and math.

Explore More Content

Follow us on

©2023 Code Algorithms Pvt. Ltd.

All rights reserved.