In machine learning, Gradient descent is an optimization algorithm that helps to find the minimum of a cost function. It is a simple and efficient algorithm widely used for training various types of models like linear regression, logistic regression, and neural networks.
After going through this blog, you will be able to understand the following things:
Before learning gradient descent in detail, let's start with a simple perspective! Machines try to optimize the cost function by trying out various values of parameters. For this, machine iterates through different values of parameters to achieve this. But if a machine has to try all possible combinations of parameters, What are the chances that the machine will find those parameters for which the cost will be minimum? Let's explore this problem and its possible solutions.
Consider a scenario with two parameters (θ0 and θ1) and imagine a contour plot of the cost function with respect to these parameters. We can see that the cost function will be minimum at the location marked by the pink star on the following plot. The machine will then memorize the values of θ0 and θ1 corresponding to this minimum value.
Machine initially starts with a random value for the parameters θ0 and θ1 in order to determine the optimal values. But if we ask machine to take random values of these parameters every time, machine could choose all the combinations of θ0 and θ1 from either of the two circular regions shown in the above image.
It would take nearly infinite time to determine which (θ0, θ1) will take us to the pink star. And we don't have that much computing power to perform infinite computations. So we need to take advantage of optimization algorithms.
An optimization algorithm in machine learning is a method for finding the set of parameters that minimize a cost function. Such algorithms help machines to efficiently search for the optimal parameters by updating to parameters in each iteration based on the gradient of the cost function with respect to the parameters.
So optimization algorithms will guide our machines to take sensible steps (not always random) and speed up finding optimal values of parameters. To understand this optimization thoroughly, let's first define one cost problem with respect to the parameters and then understand the optimization steps.
In the above equations, the cost function (J) is represented for all the parameters we need to learn i.e. J(θ0, θ1, …, θn). Our goal is to minimize this cost function with regard to these parameters. For simplicity, we will consider only two parameters in this example to make it easier to visualize.
Here our main goal is to optimize the cost, and to achieve this, any optimization algorithm will follow a standard process which is as follows:
As shown in the above image, let's take a basic 3D plot of the Cost function vs. parameters θ0 and θ1. If we notice, there can be multiple local minimum values. If we follow the basic outline of optimization algorithms, our first step would be randomly selecting values for parameters: θ0 = 0 and θ1 = 0. This is shown as the "First initialization of parameters" point in the image above. If we slowly change the values of parameters in the direction in which the cost decreases, we might achieve the local minimum position 1.
Please note that if we select some different values for the parameters at our initialization step, θ0 = 0.5 and θ1 = 0.5, we might find some other local minima (local minima 2).
One might ask, "Which minimum is correct?" This is a valid concern, but it's important to note that the initial selection of parameter values can greatly impact finding the minimum value. Ideally, the cost function should not have multiple minima. However, if we are given a cost function with multiple minima, the better initialization of parameters would be the one that results in a lower value of the cost function.
Various optimization algorithms work to ensure that the global minimum value of the cost function is reached while finding the minimum. Additionally, these algorithms also speed up the process of finding the minimum. To gain a better understanding, let's look at the most well-known and fundamental optimization algorithm, Gradient Descent.
Gradient Descent is one of the most basic and famous methods used to optimize cost functions in machine learning. Its name is "Gradient Descent" because it uses the gradient of the cost function to update parameters in each iteration so that the value of the cost function moves downhill towards the minimum.
Let's take the example of one-parameter learning (Just θ1) and see how gradient descent works? Suppose cost function varies with parameter θ1, as shown in the image below.
Initially, we choose a random starting value. This value can be located in one of three places:
Case 1 is rare and requires a lot of luck to reach the minimum at the first attempt. So, if our first choice is this case, we won't have to change the parameters. For cases 2 and 3, if our machine randomly selects a starting value on either side of the yellow star, our task is complete if the optimization algorithm shows that we can reach the yellow star by adjusting the value of θ1.
In gradient descent, we use the concept of partial derivatives. Since there is only one variable, the partial derivative is equivalent to the derivative. The derivative represents the slope of the cost function at a specific value of θ1. If the random starting value is to the right of the yellow star, slope (or derivative) will be positive. If it is to the left of the yellow star, slope will be negative. It's important to remember that the derivative/slope will be zero at the yellow star.
Our learning parameter α controls the magnitude of the change, which is a positive constant in gradient descent. So the sign of the derivative will indicate whether we should increase or decrease the value of θ1 in the next step. To understand it better, let's have a look at the pseudo-code for this algorithm.
Assignment Operator :=
This is used to assign the Right-hand side value to the Left-hand side value. If we are familiar with the programming, then ":=" is similar to "=" as we assign the RHS value to the LHS variable. Writing a = a + 1 in programming is identical to writing a := a + 1. But we can not say a is equal to a+1, because a can never be equal to a + 1. For example, if a has a value of 4, then 4 can never be equal to 5.
Learning parameter (α)
The learning parameter is a value that determines the size of the adjustments made to the model's parameters during the training process. It is used in the "update step" to decide the amount by which the parameters should be changed to minimize the cost function. In other words, the learning rate controls the step size taken towards finding the optimal values for the model parameters.
Partial derivative
In the above equation, the cost function (J) depends on both parameters (θ0 and θ1). So taking partial derivative of J with respect to one parameter (for example, θ0) means calculating the derivative of J while treating the other parameter (θ1) as a constant. In other words, partial differentiation of the cost function with respect to one parameter calculates how the cost changes with respect to just that one parameter, while holding the value of the other parameter constant.
In simple terms, we update the parameters values iteratively until the cost reaches a minimum value. But this update is also tricky. Let's see a common mistake that can be made while implementing gradient descent.
If we implement gradient descent like what is shown on the left side of the image below, it will be the wrong implementation. Can we guess why?
In the incorrect implementation of the gradient descent algorithm on the left-side, update of parameter θ1 uses "new" values of θ0 before they have actually been updated. On other side, the correct implementation of gradient descent involves first calculating the updated values for all parameters without actually changing their values. Once updated values have been found for all parameters, all parameters are updated simultaneously.
In simpler terms, the correct implementation involves calculating all changes to the parameters first and then updating all of them at once, rather than updating one parameter and using its updated value to update the next parameter. This ensures that the optimization process is performed correctly and efficiently.
Sign of the derivative determines whether we need to increase or decrease the value of θ0 or θ1. However, there is a challenge. Even if we choose to increase or decrease, the update may take an infinite amount of time to reach the optimal value, as we have no control over the magnitude of the update.
To overcome this issue, the "learning rate" parameter (α) helps us control the step size of the update based on the current position of θ0 or θ1. It determines whether we need to make a larger or smaller adjustment to the parameter's value.
One may think that the learning rate (α) should be adjusted during the optimization process, for example, by choosing a larger value when θ1 is far from the optimal value and a smaller value when θ1 is close to the optimal value.
However, in gradient descent algorithms, the learning rate is kept constant throughout the training process. As the optimization progresses and the derivative values change, the algorithm automatically adjusts the step size and takes smaller steps as θ1 approaches the optimal value (θ1').
Some optimization algorithms adjust the value of the learning rate (α) based on the current parameter values. While these algorithms may speed up the optimization process by finding a local minimum faster, they do not guarantee that the global minimum will be found. They may still converge to a local minimum rather than the global minimum. Explore and think!
Gradient descent is the most fundamental topic every ML engineer should know. Any interviewer expects us to know about it. Possible interview questions on this topic could be:
In this article, we explored one of the fundamental optimization algorithms, Gradient Descent. We discussed its role in helping find the optimal parameters for machine learning models and highlighted the impact of the learning rate on the optimization process. We hope that you found this article informative. Enjoy learning, Enjoy algorithms!
References: Andrew Ng Lectures