As humans, we learn through various methods such as practice, study, experiences, discussions, etc. On the other side, modern computers use machine learning to learn in a similar way to simulate human intelligence. So obvious curiosity for us to know how exactly a machine learns something. In this blog, we'll dive into the concept of cost functions and the complete learning process of computers via machine learning.

After reading this blog, we will gain understanding of the following:

- How to evaluate the intelligence of a machine?
- The basic concept behind the cost function.
- The steps involved in the learning process for machine learning algorithms.
- How a machine stores its learned information.

Now, let's start with the first fundamental question.

In machine learning, we expect machines to mimic human behavior of learning by learning from historical data. In other words, machine learn the mapping function from input data to output data based on the historical data provided.

The critical question is: How do we assess the learning progress of machine? Most of us may think: Isn’t it when machines stop making errors or mistakes? Yes! Precisely the same way. But this learning process is divided into multiple steps. To understand this process thoroughly, let’s take one data set and visualize learning steps in detail.

Suppose we have a historical dataset that consists of input and output pairs from a straight line function, f(X) = 2X. We want our machine to learn this function automatically by analyzing the data, where X represents the input and f(X) = Y represents the output.

```
(X,Y) = [(0,0), (1,2), (2,4), (3,6), (4, 8), ..., (50, 100)]
X = [0, 1, 2, ..., 50]
Y = [0, 2, 4, ..., 100]
#There are 51 samples.
```

If we represent these (X, Y) points in the cartesian 2D plane, then curve will be similar to what is shown in the image below.

Now, suppose we need the machine to learn a "linear" function of the form:

In simpler terms, our task is complete if the machine determines the optimal value for θ1. For our data samples, the ideal value of θ1 is 2. But the question is, how will machine find this value? Let's explore.

Machine will randomly select a value for θ1. Let's say machine chose θ1 = 1. It will then estimate the output, Y' = 1 * X, which is different from the actual output Y. Now, we have two outputs: Y(actual output) and Y' (predicted output).

```
X = [0, 1, 2, ..., 50]
Y'= [0, 1, 2, ..., 50]
```

To determine how incorrect its initial guess of θ1 was, machine will calculate difference between the estimated outputs (Y') and the actual outputs (Y). This difference, or error, will be used to gauge the initial accuracy of the guess.

We have 51 data samples, so to take account of all the samples, we define an average error over all the data samples.

It's important to understand that our goal is to minimize this error as much as possible. In other words, our objective is to minimize the average error (**cost function**). Explore this blog: Loss and Cost Function in Machine Learning

Now machine will keep adjusting the value of θ1 to reduce the average error. If we plot the average error (or cost function) against various values of θ1 that the machine randomly selects or guesses, we will obtain a curve as shown in the image below.

Here our objective is to minimize the cost function, and from the above graph, we can sense that for **θ1 = 2,** the value of cost function would be minimal**.** The critical question is: How machine will adjust the value of θ1? For this, machine will use optimization algorithm like gradient descent. In gradient descent algorithm, machine calculates gradient of the cost function with respect to θ1, which represents the rate of change of the cost function.

- If gradient is positive, machine decreases the value of θ1.
- If gradient is negative, machine increases the value of θ1.
- This process continues until the cost function reaches its minimum value. At this point, the machine has found the optimal value of θ1.

The optimization process described above enables machine to continually adjust the value of θ1 and reduce the average error (ultimately finding the best-fit line for the data). When machine determines that θ1 = 2 results in the minimum of error/cost function, it will store this value in its memory. At this point, we can say that machine has learned.

Now, if we provide machine with a new value of X that it hasn't seen before, such as X = 500, it will simply use input X and apply to the equation θ1*X i.e. 2*500 = 1000.

It's important to note that this problem is relatively simple, as we only need to learn one parameter and estimate it by minimizing the cost function. However, things become more complex when we have multiple parameters to learn. Let's explore that scenario in the next steps.

The general straight-line equation seems to be in the form of:

Where θ1 corresponds to the **slope** and θ0 is the **intercept**.

Suppose we want to estimate a linear function from our historical data coming from a linear equation **Y = X + 1** (θ1 = 1 and θ0 = 1). Note: we have selected a basic example so learners can follow the process easily.

```
X = [0,1,2, ...., 50]
Y = [1,2,3, ...., 51]
```

In comparison to the previous scenario, we now need to determine both the values of θ1 and θ0. Let's assume that the cost function is still similar to what we saw before. To better understand the relationship, take a look at the figure below. It shows three dimensions, representing how parameters θ1 and θ0 impact the cost function.

Let’s again go step-wise through the complete process.

Now machine will randomly choose values for θ0 and θ1 (let's say it selects θ0 = 6 and θ1 = -6). Using these values, it will calculate Y', which is Y' = -6X + 6. This will the location of point A in the above image.

Now machine has both actual value (Y) and estimated value (Y') based on its initial random guess of parameters. To evaluate the accuracy of its prediction, machine will calculate the average error or cost function for all input samples. This process is similar to what was described in the previous example.

Now machine will use an optimization algorithm like gradient descent to adjust parameters θ0 and θ1 so that the cost function will be minimum. In other words, machine is trying to reach point B in the image.

Suppose, after trying several combinations of θ0 and θ1, machine found that θ0 = 0.9999 and θ1 = 1.0001 produced the minimum cost function. Unfortunately, it missed checking the cost function at θ0 = 1 and θ1 = 1. Why do such situations occur in the optimization of a cost function? Here are some possible reasons:

- The algorithm may only be able to check limited number of combinations of parameters, or it may have limited precision in its calculations. As a result, machine may miss a combination of parameters that would have produced a lower value of the cost function.
- Another reason could be (Not applied to the above scenario because there is only one minima): Optimization algorithm may be stopping at a local minima rather than global minima, which is actual minimum value of cost function across all possible combinations of parameters.

For a better understanding, explore the working of the gradient descent algorithm.

Now machine will store two parameters as “learning” (θ0 = 0.9999 and θ1 = 1.0001) that it learned after trying various combinations of these parameters. We know these parameters are not perfectly correct, but our machine could only learn these values within the given time limit.

As the learning is complete, let’s discuss how machines store these learnings as humans do in their memories.

Machines store the values of their parameters as weight and bias matrices in their memory. The size of these matrices depends on the specific problem at hand and how many parameters the machine needs to learn to accurately map input and output data.

We know that when we have a single row in a matrix, it can be considered a vector. So, we can also summarize the above learning as “the machine has learned a **Weight Vector** θ1 and a **Bias Vector** θ0”.

Let’s represent the above cost plot as a contour plot. But first, let’s define these two terms:

Contour lines are lines on which a function's value remains constant despite changes in the variables. When the variables (θ1 and θ0) are changed, the cost function value remains the same along these contour lines.

A contour plot consists of many contour lines like in the image shown below.

- In the 2D contour plot, we have oval lines on which the cost function value for all the red-X points will be constant. And in the same manner, the cost function values for all the red-O will be the same.
- If you observe the 3D contour image, the value of the cost function at the innermost center is the minimum, and if we remember, our objective was to minimize the cost function. The machine will try to reach the pink star position by trying various values for θ1 and θ0.

Once the machine determines that θ1 = 1 and θ0 = 1 will minimize the cost function, it stores these values as the learned parameters. These learned parameters will then be used for future predictions. This is how the machine learns in the case of two parameters!

There can be various scenarios where we need to learn many parameters. Let’s take one example. Suppose we have to learn a function of this format.

Correlating this with a real-life example, Suppose we want to predict the price of various houses. The price of a house majorly depends upon the size of the house (2-BHK, 3-BHK, etc.). But suppose we need to include other important factors that affect the price, like location, number of floors, connectivity distance from railway station and airport, and many more. In that case, the machine will have to learn parameters for every factor. These parameter will be treated as the weightage of each factor in determining the price of the house.

Let’s assume that X1, X2, …, Xm are **m** such factors that affect the price of the house, and we collected **n** historical data samples for each factor. In the image below, X1 is not a number but an n-dimensional vector.

We can represent the equation between input matrix X and output Y as Y = weight.T * X + Bias. Here weight.T is the transpose of the weight matrix. For analyzing a single sample, the input matrix X will have dimensions of (1 X m), where m represents the number of parameters or dimensions. Similarly, the Bias matrix will have the same dimensions as the output, which is (1 X 1) for a single sample.

To make sure that the addition in the equation is valid, the product of weight.T * X should result in (1 X 1) dimension. Since X has dimensions of (m X 1) with all factors represented as a single column matrix, the weight matrix must have dimensions of (m X 1). If we observe, the transpose of weight matrix (weight.T) will result in (1 X m), allowing the product of weight.T * X to have (1 X 1) dimensions.

If we consider all samples in one go, the weight matrix will be m X n as shown below.

So, machines will learn all these parameters (θ11, θ12, …, θmn) of the weight and the bias matrices.

That’s how the machine learns multiple parameters in representing a function.

In machine learning interviews, interviewers often assess the candidate's foundational knowledge by asking basic concept questions. Some of the most common questions from this article include:

- What is a cost function in machine learning and why is it defined?
- What are the odds of the machine finding the minimum of the defined cost function if it selects and updates parameters randomly?
- How do machines store and use their learnings for new inputs?
- What happens if the machine doesn't reach the perfect minimum?
- What are contour plots and how are they used in machine learning?

In this article, we developed a basic intuition behind the cost function involvement in machine learning. Through a simple example, we demonstrated the step-wise learning process of machines and analyzed how machine exactly learns something and how they memorize these learnings. In the last, we saw the contour plot of the two variables involved. We hope you enjoyed the article.