Cost Function and Learning Process in Machine Learning

As humans, we learn through various methods such as practice, study, experiences, discussions, etc. On the other side, modern computers use machine learning to learn in a similar way to simulate human intelligence. So obvious curiosity for us to know how exactly a machine learns something. In this blog, we'll dive into the concept of cost functions and the complete learning process of computers via machine learning.

Key takeaways from this blog

After reading this blog, we will gain understanding of the following:

  • How to evaluate the intelligence of a machine?
  • The basic concept behind the cost function.
  • The steps involved in the learning process for machine learning algorithms.
  • How a machine stores its learned information.

Now, let's start with the first fundamental question.

How do we check intelligence of a machine?

In machine learning, we expect machines to mimic human behavior of learning by learning from historical data. In other words, machine learn the mapping function from input data to output data based on the historical data provided.

The critical question is: How do we assess the learning progress of machine? Most of us may think: Isn’t it when machines stop making errors or mistakes? Yes! Precisely the same way. But this learning process is divided into multiple steps. To understand this process thoroughly, let’s take one data set and visualize learning steps in detail.

Suppose we have a historical dataset that consists of input and output pairs from a straight line function, f(X) = 2X. We want our machine to learn this function automatically by analyzing the data, where X represents the input and f(X) = Y represents the output.

(X,Y) = [(0,0), (1,2), (2,4), (3,6), (4, 8), ..., (50, 100)]
X = [0, 1, 2, ..., 50]
Y = [0, 2, 4, ..., 100]

#There are 51 samples.

If we represent these (X, Y) points in the cartesian 2D plane, then curve will be similar to what is shown in the image below.

Y = 2*X line

Now, suppose we need the machine to learn a "linear" function of the form:

Line passing from origin

In simpler terms, our task is complete if the machine determines the optimal value for θ1. For our data samples, the ideal value of θ1 is 2. But the question is, how will machine find this value? Let's explore.

Steps involved in the learning process of machine learning algorithms

Step 1: Unintentionally make the mistake

Machine will randomly select a value for θ1. Let's say machine chose θ1 = 1. It will then estimate the output, Y' = 1 * X, which is different from the actual output Y. Now, we have two outputs: Y(actual output) and Y' (predicted output).

X = [0, 1, 2, ..., 50]
Y'= [0, 1, 2, ..., 50]

Step 2: Realization of the mistake

To determine how incorrect its initial guess of θ1 was, machine will calculate difference between the estimated outputs (Y') and the actual outputs (Y). This difference, or error, will be used to gauge the initial accuracy of the guess.

Error calculation formula

Error calculation visulaization

We have 51 data samples, so to take account of all the samples, we define an average error over all the data samples.

Cost function

It's important to understand that our goal is to minimize this error as much as possible. In other words, our objective is to minimize the average error (cost function). Explore this blog: Loss and Cost Function in Machine Learning

Cost calculation example

Step 3: Rectifying the mistake

Now machine will keep adjusting the value of θ1 to reduce the average error. If we plot the average error (or cost function) against various values of θ1 that the machine randomly selects or guesses, we will obtain a curve as shown in the image below.

Minima of cost function

Here our objective is to minimize the cost function, and from the above graph, we can sense that for θ1 = 2, the value of cost function would be minimal. The critical question is: How machine will adjust the value of θ1? For this, machine will use optimization algorithm like gradient descent. In gradient descent algorithm, machine calculates gradient of the cost function with respect to θ1, which represents the rate of change of the cost function.

  • If gradient is positive, machine decreases the value of θ1.
  • If gradient is negative, machine increases the value of θ1.
  • This process continues until the cost function reaches its minimum value. At this point, the machine has found the optimal value of θ1.

Step 4: Learning from mistakes

The optimization process described above enables machine to continually adjust the value of θ1 and reduce the average error (ultimately finding the best-fit line for the data). When machine determines that θ1 = 2 results in the minimum of error/cost function, it will store this value in its memory. At this point, we can say that machine has learned.

Step 5: Applying this learning

Now, if we provide machine with a new value of X that it hasn't seen before, such as X = 500, it will simply use input X and apply to the equation θ1*X i.e. 2*500 = 1000.

It's important to note that this problem is relatively simple, as we only need to learn one parameter and estimate it by minimizing the cost function. However, things become more complex when we have multiple parameters to learn. Let's explore that scenario in the next steps.

Learning two parameters at the same time

The general straight-line equation seems to be in the form of:

Straight line equation

Where θ1 corresponds to the slope and θ0 is the intercept.

Slope and intercept

Suppose we want to estimate a linear function from our historical data coming from a linear equation Y = X + 1 (θ1 = 1 and θ0 = 1). Note: we have selected a basic example so learners can follow the process easily.

X = [0,1,2, ...., 50]
Y = [1,2,3, ...., 51]

In comparison to the previous scenario, we now need to determine both the values of θ1 and θ0. Let's assume that the cost function is still similar to what we saw before. To better understand the relationship, take a look at the figure below. It shows three dimensions, representing how parameters θ1 and θ0 impact the cost function.

3D representation of cost function

Let’s again go step-wise through the complete process.

Step 1: Unintentionally make the mistake

Now machine will randomly choose values for θ0 and θ1 (let's say it selects θ0 = 6 and θ1 = -6). Using these values, it will calculate Y', which is Y' = -6X + 6. This will the location of point A in the above image.

Step 2: Realization of the mistake

Now machine has both actual value (Y) and estimated value (Y') based on its initial random guess of parameters. To evaluate the accuracy of its prediction, machine will calculate the average error or cost function for all input samples. This process is similar to what was described in the previous example.

Step 3: Rectifying the mistake

Now machine will use an optimization algorithm like gradient descent to adjust parameters θ0 and θ1 so that the cost function will be minimum. In other words, machine is trying to reach point B in the image.

Suppose, after trying several combinations of θ0 and θ1, machine found that θ0 = 0.9999 and θ1 = 1.0001 produced the minimum cost function. Unfortunately, it missed checking the cost function at θ0 = 1 and θ1 = 1. Why do such situations occur in the optimization of a cost function? Here are some possible reasons:

  • The algorithm may only be able to check limited number of combinations of parameters, or it may have limited precision in its calculations. As a result, machine may miss a combination of parameters that would have produced a lower value of the cost function.
  • Another reason could be (Not applied to the above scenario because there is only one minima): Optimization algorithm may be stopping at a local minima rather than global minima, which is actual minimum value of cost function across all possible combinations of parameters.

For a better understanding, explore the working of the gradient descent algorithm.

Step 4: Learning from mistakes

Now machine will store two parameters as “learning” (θ0 = 0.9999 and θ1 = 1.0001) that it learned after trying various combinations of these parameters. We know these parameters are not perfectly correct, but our machine could only learn these values within the given time limit.

As the learning is complete, let’s discuss how machines store these learnings as humans do in their memories.

How does a machine stores its learnings?

Machines store the values of their parameters as weight and bias matrices in their memory. The size of these matrices depends on the specific problem at hand and how many parameters the machine needs to learn to accurately map input and output data.

relations of Weight and Bias matrices with output

We know that when we have a single row in a matrix, it can be considered a vector. So, we can also summarize the above learning as “the machine has learned a Weight Vector θ1 and a Bias Vector θ0”.

Additional insights about finding the minima

Let’s represent the above cost plot as a contour plot. But first, let’s define these two terms:

What is a contour line?

Contour lines are lines on which a function's value remains constant despite changes in the variables. When the variables (θ1 and θ0) are changed, the cost function value remains the same along these contour lines.

What is a contour plot?

A contour plot consists of many contour lines like in the image shown below.

  • In the 2D contour plot, we have oval lines on which the cost function value for all the red-X points will be constant. And in the same manner, the cost function values for all the red-O will be the same.
  • If you observe the 3D contour image, the value of the cost function at the innermost center is the minimum, and if we remember, our objective was to minimize the cost function. The machine will try to reach the pink star position by trying various values for θ1 and θ0.

3D contour vs 2D contour

Once the machine determines that θ1 = 1 and θ0 = 1 will minimize the cost function, it stores these values as the learned parameters. These learned parameters will then be used for future predictions. This is how the machine learns in the case of two parameters!

What if there are more than two parameters to learn?

There can be various scenarios where we need to learn many parameters. Let’s take one example. Suppose we have to learn a function of this format.

Multidimensional learning

Correlating this with a real-life example, Suppose we want to predict the price of various houses. The price of a house majorly depends upon the size of the house (2-BHK, 3-BHK, etc.). But suppose we need to include other important factors that affect the price, like location, number of floors, connectivity distance from railway station and airport, and many more. In that case, the machine will have to learn parameters for every factor. These parameter will be treated as the weightage of each factor in determining the price of the house.

Let’s assume that X1, X2, …, Xm are m such factors that affect the price of the house, and we collected n historical data samples for each factor. In the image below, X1 is not a number but an n-dimensional vector.

Representing X as input

We can represent the equation between input matrix X and output Y as Y = weight.T * X + Bias. Here weight.T is the transpose of the weight matrix. For analyzing a single sample, the input matrix X will have dimensions of (1 X m), where m represents the number of parameters or dimensions. Similarly, the Bias matrix will have the same dimensions as the output, which is (1 X 1) for a single sample.

To make sure that the addition in the equation is valid, the product of weight.T * X should result in (1 X 1) dimension. Since X has dimensions of (m X 1) with all factors represented as a single column matrix, the weight matrix must have dimensions of (m X 1). If we observe, the transpose of weight matrix (weight.T) will result in (1 X m), allowing the product of weight.T * X to have (1 X 1) dimensions.

If we consider all samples in one go, the weight matrix will be m X n as shown below.

Representing weight vector matrix

So, machines will learn all these parameters (θ11, θ12, …, θmn) of the weight and the bias matrices.

That’s how the machine learns multiple parameters in representing a function.

Possible Interview Questions

In machine learning interviews, interviewers often assess the candidate's foundational knowledge by asking basic concept questions. Some of the most common questions from this article include:

  • What is a cost function in machine learning and why is it defined?
  • What are the odds of the machine finding the minimum of the defined cost function if it selects and updates parameters randomly?
  • How do machines store and use their learnings for new inputs?
  • What happens if the machine doesn't reach the perfect minimum?
  • What are contour plots and how are they used in machine learning?

Conclusion

In this article, we developed a basic intuition behind the cost function involvement in machine learning. Through a simple example, we demonstrated the step-wise learning process of machines and analyzed how machine exactly learns something and how they memorize these learnings. In the last, we saw the contour plot of the two variables involved. We hope you enjoyed the article.

More From EnjoyAlgorithms

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.