Cost Function and Learning Process in Machine Learning

Understanding human intelligence is still an ongoing reach, but we say that machines try to mimic human intelligence in machine learning and artificial intelligence. It creates curiosity about how exactly a machine learns something. And as we know, there are methods to check the learnings of humans, like exams, quizzes, etc., but how do we decide that a machine has learned something?

In this article, we will discuss the complete process of machine learning and understand how exactly a machine learns something.

Key Takeaways from this blog

After going through this blog, we will be able to understand the following things:

  1. How do we check the intelligence of any machine?
  2. What is the basic intuition behind cost function?
  3. What are the steps involved in the learning process for Machine Learning algorithms?
  4. How does a machine stores its learnings?

Let's begin with our very first fundamental question,

How do we check the intelligence of any machine?

We must have passed many examinations to justify that we have already learned something. In Machine Learning, we expect our machines to mimic this human behavior and learn from the historical data. But how do we check this learning in the case of machines? Most of us must think, Isn't it when machines stop making errors or mistakes?

Yes! Precisely the same way. But this learning process is divided into multiple steps. To understand this process thoroughly, let's take one data set and visualize the learning steps in greater detail.

Machine learning process for un-recorded dataset

As we know, machines learn the mapping function from the input data to the output data based on the historical data provided. But what exactly it means to learn any function? 

Suppose we have historical data instances (as input & output) from a straight line f(X) = 2*X. Here f(x) is the function, and we want our machines to learn it automatically by looking into the historical data (X, f(X)=Y). Let's represent our data instances in X-Y coordinate form.

(X,Y) = [(0,0), (1,2), (2,4), (3,6), (4, 8), ..., (50, 100)]
X = [0, 1, 2, ..., 50]
Y = [0, 2, 4, ..., 100]

#There are 51 samples. How? (Think!)

If we represent these (X, Y) points in the cartesian 2D plane, then the curve will be similar to what is shown in the image below.

Y = 2*X line

Now, suppose we need the machine to learn a "linear" function of the form,

Line passing from origin

In simple terms, our job is done if the machine finds the perfect value of this θ1, Right? For our data samples, the perfect value of θ1 will be 2. But how will the machine find it?

Steps involved in the learning process for machine learning algorithms

Let's explore the process of this learning in simple steps.

Step 1: Unintentionally make the mistake

The machine will choose a random value for θ1. Suppose our machine picked θ1 = 1 and started estimating output Y', which is different from Y, and calculated from this equation: Y' = 1*X. Now we have two types of input, Y and Y'.

X = [0, 1, 2, ..., 50]
Y' = [0, 1, 2, ..., 50]

Step 2: Realization of the mistake

Our machine knows that Y is accurate and Y' is estimated. So now, it will calculate the error between the estimated Ys (Y') and the actual Ys (Y) to sense how much the initial guess of θ1 was wrong. 
Let's define this error as a simple difference between these two Ys.

Error calculation formula

Error calculation visulaization

We have 51 data samples, so to take account of all the samples, we define an average error over all the data samples.

Cost function

Please realize that our objective is to decrease this error as low as possible. From this objective, we can also say that our "average error" is similar to some "Cost" function where our goal is to minimize the cost. Let's calculate it.

Cost calculation example

Step 3: Rectifying the mistake

In the next run, the machine will update the value of θ1 so that this average error gets reduced. Suppose we plot the average error (or cost function) with respect to various values of θ1 that our machine is guessing or choosing randomly. Then we will get a curve, as shown in the plot below.

Minima of cost function

Our objective was to minimize the cost function, and from the above graph, we can sense that for θ1 = 2, the cost would be minimal. Right?

Step 4: Learn from Mistakes

Now the machine knows that for θ1 = 2, error/cost function is minimum. So it will store this value of θ1in the memory, and we will express this phenomenon as, "Machine has learned!!"

Step 5: Use this learning

Now, if we provide any new value of X that was not seen earlier by the machine, let's say X = 500. It will just take the input of X and give the result as θ1*X, i.e., 2*500 = 1000.

Please note that we have solved a fundamental problem where we had to learn just one parameter, which we estimated based on minimising the cost function. But what if we increase the complexity of the problem statement, where we have two parameters to learn? Let's explore. 

Learning two parameters at the same time

From our fundamental knowledge of linear algebra, we know that the straight-line equation seems to be in the form of

Straight line equation

Where θ1 corresponds to the slope and θ0 is the intercept.

Slope and intercept

Here, we want to estimate a linear function from our historical data coming from a linear equation Y = X + 1 (θ1 = 1 and θ0 = 1). Note: we have selected elementary examples so learners can follow the process easily.

X = [0,1,2, ...., 50]
Y = [1,2,3, ...., 51]

Correlating this case from our earlier scenario, we now have to find the values of both θ1 and θ0. Let's assume the cost function is similar to the earlier case for similarity. To visualize it better, see the figure below. There are three dimensions where we are trying to imagine the effect of parameters *θ1 and θ0* over the Cost function.

3 D representation of cost function

Let's again go step-wise through the complete process:

Step 1: Unintentionally make the mistake

The machine will select some random values of θ0 and θ1. (Let’s say θ0 = 6 and θ1 = -6) and based on this, it will calculate Y', where Y' = -6*X + 6. The position of point A in the above figure.

Step 2: Realization of the mistake

Now, the machine knows actual values Y and estimated value Y' based on a random guess of parameters. Using these, it will calculate the average error or cost function for all the input samples, similar to the previous example.

Step 3: Rectifying the mistake

It will update the parameters θ0 and θ1 so that the cost function becomes as low as possible. In simple terms, it will try to reach point B in the above GIF. Suppose, after trying several combinations of θ0 and θ1; the machine was only able to find that θ0 = 0.9999andθ1 = 1.0001gives the minimum cost function. It somehow missed checking the cost function at θ0 = 1 and θ1 = 1.

Step 4: Learn from mistakes

The machine now will store two parameters as "learning" (θ0 = 0.9999 and θ1 = 1.0001) that it learned after trying various combinations of these parameters. We know that these parameters are not perfectly correct, but our machine could only learn these values within the given time limit. 

As the learning is complete, let's discuss how machines store these learnings as humans do in their memories.

How does a machine stores its learnings?

Machines save the values of different parameters in the form of weight and bias matrices in their memories. The size of these matrices varies per the problem statement, and how many parameters machines need to learn to map the input and output data accurately.

relations of Weight and Bias matrices with output

We know when we have a single row in a matrix, it can be considered a vector. So, we can also summarize the above learning as "the machine has learned a Weight Vector θ1 and a Bias Vector θ0".

Additional Insights about finding the minima

Let's represent the above cost plot GIF as a contour plot. But first, let's define the two terms:

What is contour line?

Contour lines are the lines on which a defined function does not change the value when the variables are changed. With the variations in the two variables (θ1 and θ0), the value of the cost function remains constant. 

What is contour plot?

A contour plot consists of many contour lines. Like in the image shown below, 

  • In the 2D contour plot, we have oval lines on which the cost function value for all the red-X points will be constant. And in the same manner, the cost function values for all the red-O will be the same.

3D contour vs 2D contour

If you observe the 3D contour image, the value of the cost function at the innermost center is the minimum, and if we remember, our objective was to minimize the cost function. The machine will try to reach the pink star position by trying various values for θ1 and θ0.

Once our machine says that θ1 = 1 and θ0 = 1will minimize our cost function, it will store these values as learned parameters and use them later for predictions.

Now we must have got how exactly machine learns. To understand it deeply, let's increase the complexity of learning further.

Additional section

What if there are more than two parameters to learn?

There can be various scenarios where we need to learn many parameters. Let's take one example. Suppose we have to learn a function of this format,

Multidimensional learning

Correlating this with a real-life example, the price of a house majorly depends upon the size of the house (2-BHK, 3-BHK, etc.). But suppose we need to include other important factors that affect the price, like location, number of floors, connectivity distance from railway station and airport, and many more. In that case, the machine will have to learn parameters for every factor. This parameter will be treated as the weightage of that parameter in deciding house price. 

Let's assume that X1, X2, …, Xm are m such factors that affect the price of the house. And we collected n historical data samples for each factor. In the image below, X1 is not a number but an n-dimensional vector.

Representing X as input

Let's say we are analyzing just one sample, then using the dimensionality theory in the matrix, we can say that our X will be a matrix of dimensions (1 X m). We have "m" parameters; consider these as "m" dimensions. The Bias will have the same dimension as the output, and for a single sample, the output will be a single value (i.e., 1 X 1). So Bias has dimension (1 X 1). 

We know the equation between X and Y is: Y = weight.T*X + Bias. So to validate the addition, the product of weight.T*X should result in 1 X 1 dimension, and the dimension of X is (m X 1) if we place all factors along a single column matrix representing a vector. From the above equation, we can sense that the weight matrix will be m X 1 so that the transpose will make the dimension of W.T (1 X m) and the product weight.T*X will be 1 X 1.

If we consider all samples in one go, the weight matrix will be m X n, as shown below.

Representing weight vector matrix

So, machines will learn all these parameters (θ11, θ12, …, θmn) of the weight and the bias matrices.

That's how the machine learns multiple parameters in representing a function.

Possible Interview Questions

In machine learning interviews, interviewers can ask some basic concepts to check the base knowledge of the candidates. Some of the most frequent basic questions from this article could be,

  1. What is a Cost function in Machine Learning? Why do we define it?
  2. If the machine selects random values and updates the parameter, what are the chances of hitting the minimum of the defined cost function?
  3. How do machines store the learnings and utilize them for new input values?
  4. What if Machines do not achieve the perfect minima?
  5. What are contour plots?

Conclusion

In this article, we developed a basic intuition behind the cost function involvement in machine learning. Through a simplistic example, we demonstrated the step-wise learning process of machines and analyzed how machine exactly learns something and how they memorize these learnings. In the last, we saw the contour plot of the two variables involved. We hope you enjoyed the article.

Enjoy learning, enjoy algorithms!

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.