Parameter Learning and Gradient Descent in Machine Learning

In our previous article about the intuition behind the cost function and the process of "learning" in machine learning, we learned that machines try to optimize the cost function by trying out various values of parameters. The machine memorizes the set of parameters for which the cost function becomes minimum. But there is still a catch! 

If a machine has to try all possible combinations of parameters, What are the chances that the machine will find those parameters for which the cost will be minimum? Let's explore this problem and its possible solutions in greater detail.

Key takeaways from this blog

After going through this blog, we might be able to understand the following things:

  1. Limitations of computations in Machine Learning.
  2. What are optimization algorithms, and why do we need them?
  3. What is the problem with multiple minima in cost function?
  4. What is Gradient Descent and the intuition behind it?
  5. What is the need for learning parameter α?

Let's begin understanding the problem first, i.e.,

Limitation of Computations in Machine Learning

Let's take the scenario of two parameters and plot the contour plot of the cost function concerning the two parameters θ0 and θ1. We know that the cost function at the pink star will be minimum, and the machine will memorize corresponding values of θ0 and θ1.

Selection possible values of parameters using random selection

Machines randomly try combinations of (θ0, θ1) to learn the optimal values of these parameters. But if we ask machines to take random values of these parameters every time, machines could choose all the combinations of θ0 and θ1 from either of the two circular regions shown in the above image. It would take nearly infinite time to determine which (θ0, θ1) will take us to the pink star. And we don't have that much computing power to perform infinite computations. Hence we need to take advantage of optimization algorithms. But before proceeding ahead, let's first understand,

What is an Optimization Algorithm in Machine Learning?

Optimization algorithms can guide our Machines to take sensible steps (not always random) and speed up finding optimal values of parameters.

To understand this optimization thoroughly, let's first define one cost problem with respect to the parameters and then understand the optimization steps.

Cost function and objective

In the equations above, suppose the cost function with respect to all the parameters we need to learn is represented as J(θ0, θ1, …, θn). Our objective is to minimize this cost function with respect to the parameters. Let's select only two parameters to keep things simple and easily visualizable.

Cost function and objective for two parameters

Outline of Optimization Algorithms In Machine Learning

The primary objective is already defined as optimizing the cost, and for that, any optimization algorithm will follow the standard outline, which is:

  • Step 1: Start with some random values of parameters θ0 and θ1.
  • Step 2: Change the values of parameters θ0 and θ1 such that J(θ0, θ1) becomes minimum.

Understanding this outline

3D plot of updating parameters

As shown in the above image, let's take a basic 3D plot of the Cost function vs. parameters θ0 and θ1. If we notice, there can be multiple local minimum values. If we follow the basic outline of optimization algorithms, our first step would be randomly selecting values for parameters: θ0 = 0 and θ1 = 0. This is shown as the "First initialization of parameters" point in the image above. If we slowly change the values of parameters in the direction in which the cost decreases, we might achieve the local minima position 1. 

What is the problem with multiple minima in cost function?

Please note that if we select some different values for the parameters at our initialization step, θ0 = 0.5 and θ1 = 0.5, we might find some other local minima (local minima 2).

3D plot of updating parameters with different intial values

We might be thinking, which one is correct then? That's a valid question, but we must have understood that the initial selection of parameter values will significantly affect finding the minima. Because of this reason, we prefer the cost function not to have multiple minima. But suppose we do not have a choice of the cost function. Whichever initial random selection of parameters produces the lower value of the cost function, that initialization will be better.

Local and global minima

Various optimization algorithm tries to make sure that we reach the global minimum value while finding the minimum of the cost function. Not only this, but optimization algorithms also fasten the process of finding minima. To understand it better, let's see the most famous and fundamental optimization algorithm, i.e., Gradient Descent.

What is Gradient Descent?

Gradient Descent is one of the most basic and reasonably famous methods used to optimize cost functions in machine learning. To understand it better, let's have a look at the pseudo-code for this algorithm,

Gradient descent pseudo code

Defining Key terms in gradient descent equation

Assignment Operator: Set equal to ":="

is used to assign the Right-hand side value to the Left-hand side value. If we are familiar with the programming, then ":=" is similar to "=" as we assign the RHS value to the LHS variable. Writing a = a + 1 in programming is identical to writing a := a+1 in the common description, but in the common description, we can not say a is equal to a+1, as a can never be equal to a + 1. For example, if a has a value of 4, then 4 can never be equal to 5.

Learning parameter (α)

The learning parameter controls the size of the change in the values of the parameters. In the updation step, we decide by how much we need to change the parameters to achieve the cost function minimisation.

Partial derivative

In the above equation, the cost function J varies with both parameters, θ0 and θ1. Partially derivating J with respect to one parameter (let's say θ0) means derivating the cost function J by considering J as a function of θ0 and θ1 is constant.

In simple terms, we update the parameters' values iteratively until the cost reaches a minimum value. But this update is also tricky. Let's see a common mistake that can be made while implementing gradient descent.

Common error while implementing Gradient Descent

 If we implement gradient descent like what is shown on the left side of the image below, it will be the wrong implementation. Can we guess why?

Incorrect and correct implementation of GD

The pink line shows the reason. In the left side implementation of gradient descent, the update of θ1 will use the "updated" values of θ0, which is unsuitable per the pseudo algorithm. In the correct implementation of the gradient descent algorithm, we first calculate the updated values for all the parameters without updating any of the parameter's values. Once we find the updated values for all the parameters, we change the value of every parameter at once.

Now, if we notice carefully, the update equation in the pseudo-code is designed in a specific way. But we might be thinking, why this type of update of parameters?

To answer this, let's look at the intuition behind the gradient descent algorithm.

Intuition behind Gradient Descent algorithm

Let's take the example of one-parameter learning (Just θ1) and see how this unique updating fashion helps find the optimal parameter values. Suppose cost varies with parameter θ1, as shown in the image below.

Cost function with respect to single parameter

We first select the random initial value, and here, this random value can be present at either of these three locations: 

  1. At the yellow star.
  2. On the left side of the yellow star.
  3. On the right side of the yellow star.

Case 1 is infrequent and needs quite a high luck to hit the minimum the first time. If we hit the first case at our first step, we will not need to update the parameters. Apart from case 1, if our machine selects any random initial value on either side of the yellow star, our job is done if the optimization algorithm somehow shows that by decreasing/increasing the θ1 value, the cost can reach the yellow star position. Right?

Parameter update with respect to sign of derivative

In gradient descent, we do the same by using the property of the partial derivative. As there is only one variable, the partial derivative will be the same as the derivative. The derivative is also considered a slope of the cost function at that particular value for θ1. If the random initial value is present on the right side of the yellow star, the derivative or slope will be positive, and if it is present on the left side of the yellow star, the slope will be negative. Please note that the derivative/slope will be zero at the yellow star.

Our learning parameter α, which controls the size of the change, is a positive constant for gradient descent. Hence, the sign of the derivative will tell us whether we are increasing or decreasing the values of parameter θ1 in the updation step.

But then we should raise another question, what is the need for α? Let's see this in detail.

What is the need for learning parameter α?

This is a valid question as the sign of the derivative can decide whether we need to increase or decrease the value of θ1. But there is a catch! Even if we choose to increase or decrease, updation could take infinite time to reach the yellow star as we do not have control over the magnitude of the update. Hence, to help in this situation, the parameter α tells us whether we need to change the parameter's value by a greater margin or smaller margin based on the position of θ1.

The tradeoff between large and small values of learning parameter α

  • Condition 1: If we select a minimal value for α, then the gradient descent algorithm will become very slow and may take a long time to achieve the minima.
  • Condition 2: If we select an immense value for α, then the gradient descent algorithm will overshoot the minimum values or might not reach the minima.

Effect of small and large learning parameter

We must be thinking that, 

Should we change the value of α in the process like choosing a larger value when θ1 is far from yellow star and choosing a smaller value when θ1 is near to yellow star?

In gradient descent algorithms, we keep the value of α constant throughout the training process. Based on the derivative values, the algorithm starts taking smaller steps when θ1 nears θ1'.

Slower steps when parameter approaches optimal values

Some algorithms change the values of α based on the parameter values, but we are not discussing them here. These algorithms do not guarantee to make us reach the global minima but speed up the process of finding the local minima and possibly the global minima.

Possible Interview Questions On Gradient Descent

Gradient descent is the most fundamental topic every ML engineer should know. Any interviewer expects us to know about it. Possible interview questions on this topic could be,

  1. What is the role of gradient descent in making machines learn?
  2. Questions on the correct and incorrect ways of implementing the gradient descent algorithm.
  3. How do we decide the value of the learning parameter?
  4. Will gradient descent always help in finding the global minimum?
  5. Do we need to change the value of the learning parameter in gradient descent?

Conclusion

In this article, we have discussed one of the most basic cost optimization algorithms, i.e., Gradient Descent. We learned how this algorithm helps us find the right set of parameters for learning in machine learning. The effect of the learning parameter is also discussed by choosing the smaller and larger values. We hope you learned something new and enjoyed the article.

References

  1. Andrew Ng Lectures

Enjoy learning, Enjoy algorithms!

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.