Understanding human intelligence is still an ongoing reach, but we say that machines try to mimic human intelligence in machine learning and artificial intelligence. It creates curiosity about how exactly a machine learns something. And as we know, there are methods to check the learnings of humans, like exams, quizzes, etc., but how do we decide that a machine has learned something?
In this article, we will discuss the complete process of machine learning and understand how exactly a machine learns something.
After going through this blog, we will be able to understand the following things:
Let's begin with our very first fundamental question,
We must have passed many examinations to justify that we have already learned something. In Machine Learning, we expect our machines to mimic this human behavior and learn from the historical data. But how do we check this learning in the case of machines? Most of us must think, Isn't it when machines stop making errors or mistakes?
Yes! Precisely the same way. But this learning process is divided into multiple steps. To understand this process thoroughly, let's take one data set and visualize the learning steps in greater detail.
As we know, machines learn the mapping function from the input data to the output data based on the historical data provided. But what exactly it means to learn any function?
Suppose we have historical data instances (as input & output) from a straight line f(X) = 2*X. Here f(x) is the function, and we want our machines to learn it automatically by looking into the historical data (X, f(X)=Y). Let's represent our data instances in X-Y coordinate form.
(X,Y) = [(0,0), (1,2), (2,4), (3,6), (4, 8), ..., (50, 100)] X = [0, 1, 2, ..., 50] Y = [0, 2, 4, ..., 100] #There are 51 samples. How? (Think!)
If we represent these (X, Y) points in the cartesian 2D plane, then the curve will be similar to what is shown in the image below.
Now, suppose we need the machine to learn a "linear" function of the form,
In simple terms, our job is done if the machine finds the perfect value of this θ1, Right? For our data samples, the perfect value of θ1 will be 2. But how will the machine find it?
Let's explore the process of this learning in simple steps.
The machine will choose a random value for θ1. Suppose our machine picked θ1 = 1 and started estimating output Y', which is different from Y, and calculated from this equation: Y' = 1*X. Now we have two types of input, Y and Y'.
X = [0, 1, 2, ..., 50] Y' = [0, 1, 2, ..., 50]
Our machine knows that Y is accurate and Y' is estimated. So now, it will calculate the error between the estimated Ys (Y') and the actual Ys (Y) to sense how much the initial guess of θ1 was wrong.
Let's define this error as a simple difference between these two Ys.
We have 51 data samples, so to take account of all the samples, we define an average error over all the data samples.
Please realize that our objective is to decrease this error as low as possible. From this objective, we can also say that our "average error" is similar to some "Cost" function where our goal is to minimize the cost. Let's calculate it.
In the next run, the machine will update the value of θ1 so that this average error gets reduced. Suppose we plot the average error (or cost function) with respect to various values of θ1 that our machine is guessing or choosing randomly. Then we will get a curve, as shown in the plot below.
Our objective was to minimize the cost function, and from the above graph, we can sense that for θ1 = 2, the cost would be minimal. Right?
Now the machine knows that for θ1 = 2, error/cost function is minimum. So it will store this value of θ1in the memory, and we will express this phenomenon as, "Machine has learned!!"
Now, if we provide any new value of X that was not seen earlier by the machine, let's say X = 500. It will just take the input of X and give the result as θ1*X, i.e., 2*500 = 1000.
Please note that we have solved a fundamental problem where we had to learn just one parameter, which we estimated based on minimising the cost function. But what if we increase the complexity of the problem statement, where we have two parameters to learn? Let's explore.
From our fundamental knowledge of linear algebra, we know that the straight-line equation seems to be in the form of
Where θ1 corresponds to the slope and θ0 is the intercept.
Here, we want to estimate a linear function from our historical data coming from a linear equation Y = X + 1 (θ1 = 1 and θ0 = 1). Note: we have selected elementary examples so learners can follow the process easily.
X = [0,1,2, ...., 50] Y = [1,2,3, ...., 51]
Correlating this case from our earlier scenario, we now have to find the values of both θ1 and θ0. Let's assume the cost function is similar to the earlier case for similarity. To visualize it better, see the figure below. There are three dimensions where we are trying to imagine the effect of parameters *θ1 and θ0* over the Cost function.
Let's again go step-wise through the complete process:
The machine will select some random values of θ0 and θ1. (Let’s say θ0 = 6 and θ1 = -6) and based on this, it will calculate Y', where Y' = -6*X + 6. The position of point A in the above figure.
Now, the machine knows actual values Y and estimated value Y' based on a random guess of parameters. Using these, it will calculate the average error or cost function for all the input samples, similar to the previous example.
It will update the parameters θ0 and θ1 so that the cost function becomes as low as possible. In simple terms, it will try to reach point B in the above GIF. Suppose, after trying several combinations of θ0 and θ1; the machine was only able to find that θ0 = 0.9999andθ1 = 1.0001gives the minimum cost function. It somehow missed checking the cost function at θ0 = 1 and θ1 = 1.
The machine now will store two parameters as "learning" (θ0 = 0.9999 and θ1 = 1.0001) that it learned after trying various combinations of these parameters. We know that these parameters are not perfectly correct, but our machine could only learn these values within the given time limit.
As the learning is complete, let's discuss how machines store these learnings as humans do in their memories.
Machines save the values of different parameters in the form of weight and bias matrices in their memories. The size of these matrices varies per the problem statement, and how many parameters machines need to learn to map the input and output data accurately.
We know when we have a single row in a matrix, it can be considered a vector. So, we can also summarize the above learning as "the machine has learned a Weight Vector θ1 and a Bias Vector θ0".
Let's represent the above cost plot GIF as a contour plot. But first, let's define the two terms:
Contour lines are the lines on which a defined function does not change the value when the variables are changed. With the variations in the two variables (θ1 and θ0), the value of the cost function remains constant.
A contour plot consists of many contour lines. Like in the image shown below,
If you observe the 3D contour image, the value of the cost function at the innermost center is the minimum, and if we remember, our objective was to minimize the cost function. The machine will try to reach the pink star position by trying various values for θ1 and θ0.
Once our machine says that θ1 = 1 and θ0 = 1will minimize our cost function, it will store these values as learned parameters and use them later for predictions.
Now we must have got how exactly machine learns. To understand it deeply, let's increase the complexity of learning further.
There can be various scenarios where we need to learn many parameters. Let's take one example. Suppose we have to learn a function of this format,
Correlating this with a real-life example, the price of a house majorly depends upon the size of the house (2-BHK, 3-BHK, etc.). But suppose we need to include other important factors that affect the price, like location, number of floors, connectivity distance from railway station and airport, and many more. In that case, the machine will have to learn parameters for every factor. This parameter will be treated as the weightage of that parameter in deciding house price.
Let's assume that X1, X2, …, Xm are m such factors that affect the price of the house. And we collected n historical data samples for each factor. In the image below, X1 is not a number but an n-dimensional vector.
Let's say we are analyzing just one sample, then using the dimensionality theory in the matrix, we can say that our X will be a matrix of dimensions (1 X m). We have "m" parameters; consider these as "m" dimensions. The Bias will have the same dimension as the output, and for a single sample, the output will be a single value (i.e., 1 X 1). So Bias has dimension (1 X 1).
We know the equation between X and Y is: Y = weight.T*X + Bias. So to validate the addition, the product of weight.T*X should result in 1 X 1 dimension, and the dimension of X is (m X 1) if we place all factors along a single column matrix representing a vector. From the above equation, we can sense that the weight matrix will be m X 1 so that the transpose will make the dimension of W.T (1 X m) and the product weight.T*X will be 1 X 1.
If we consider all samples in one go, the weight matrix will be m X n, as shown below.
So, machines will learn all these parameters (θ11, θ12, …, θmn) of the weight and the bias matrices.
That's how the machine learns multiple parameters in representing a function.
In machine learning interviews, interviewers can ask some basic concepts to check the base knowledge of the candidates. Some of the most frequent basic questions from this article could be,
In this article, we developed a basic intuition behind the cost function involvement in machine learning. Through a simplistic example, we demonstrated the step-wise learning process of machines and analyzed how machine exactly learns something and how they memorize these learnings. In the last, we saw the contour plot of the two variables involved. We hope you enjoyed the article.
Enjoy learning, enjoy algorithms!
Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.