In Machine Learning, computers provide solutions to various problem statements, ranging from recommending better products to full-fledged autonomous vehicles. But how do we convey these problem statements to computers that only understand binary? If we tear down the exact problem statement machine solves in ML, we will get that it is nothing but an Optimization problem.

Optimization is the sole of Machine learning algorithms, as most ML problems get reduced to optimizing functions. **But what exactly do they optimize?** In this article, we will try to find the answer to this question.

After going through this blog, we will be familiar with

- What is the loss function, and why is it important?
- What are the various loss functions used for regression tasks? Absolute loss, Square loss, and Huber loss.
- What are the various loss functions used for binary classification tasks? Binary cross-entropy and Hinge loss.
- What are the loss functions used for multiple classification tasks? Categorical cross-entropy.

Before moving any further, let's define the term loss function and quickly review one famous optimization algorithm, gradient descent.

In the supervised learning approach, whenever we build any machine learning model, we try to minimize the error between the predictions made by our ML model and the corresponding true values. Computers understand this error through defined loss functions.

For example, suppose we are designing an email spam classifier model via our machine learning algorithms. The focus is to ensure that any new email in our inbox must be accurately categorized with spam or non-spam tag. **But how do we check that?** That's where the role of the loss function comes into the picture. We try to frame our problem statement in the form of numbers such that computers understand what they need to do.

One might think this error is calculated for every data sample. Which sample should we target to get the loss value? There comes the concept of the cost function in machine learning.

In Machine Learning, we have multiple observations using which we train our machines to solve a particular problem statement. The cost function is nothing but the average of the loss values coming from all the data samples.

We usually consider both terms as synonyms and think we can use them interchangeably. But, the Loss function is associated with every training example, and the cost function is the average of the loss function values over all the data samples. In Machine learning, we usually try to optimize our cost function rather than our loss function.

There is one famous quote in **Neural Smithing Book:**

It is important, therefore, that the function faithfully represent our design goals. If we choose a poor error function and obtain unsatisfactory results, the fault is ours for badly specifying the goal of the search.

Loss functions are the translation of our needs from machine learning in a mathematical or statistical form. If we know what exactly we want to achieve, it will make the process easier.

In computer science, whenever we say that we need to reduce the cost, it becomes an optimization problem. Machine Learning is not different here. Hence, before moving forward, let's quickly revise the Gradient descent algorithm.

Suppose J(**θ**) is the loss function and **θ** is the parameters that the machine learning model will learn. Suppose we are making our loss function continuous functions, and the model wants to learn those parameters for which the cost will be minimal. **But how will the model reach those parametric values?** That's where we need some optimization algorithm to optimize our cost function.

Gradient descent is quite a famous optimization algorithm in machine learning, so let's see how it works.

**The overall process of the Gradient Descent algorithm:**

- Initialize the weight values randomly.
- Partially differentiate the cost function
**G = ∂J(θ)/∂θ**w.r.t different parameters constituting the cost function. - Update the weights by an amount proportional to the gradient to ensure that loss reduces in each iteration,
**θ = θ-α.G**. Here**α**is the**learning rate**parameter which is considered a vital hyperparameter in deciding how fast the parameters' updates should happen. - Repeat the process of updation until the difference in values of the cost function between two consecutive iterations (e.g., 100th and 101st iteration) goes below the pre-defined threshold. For example, suppose we defined that threshold as 0.0005. The cost function after the 100th update gives a value of 1.007, and after the 101st update, it gives a value of 1.0071. The difference between the cost function values for two consecutive iterations is 0.0001; hence we can stop the updation now.

Now we know about this optimization algorithm, let's continue learning about the cost functions. As stated earlier, the cost function reflects our requirements in numbers, and we can customize it per our requirements. But if we start designing cost functions catered to every problem statement, it will become a bottleneck. Hence, there are some traditional in-built cost functions in ML.

Based on the nature of the problem statement, we categorize machine learning models into three classes,

- Classification
- Regression
- Clustering

But clustering is not a supervised learning approach. So we have two cases left, Regression and Classification problem statements. Let's discuss some readymade loss functions available in these categories.

In regression tasks, we try to predict the continuous target variables. Suppose we are trying to fit the function **f** using machine learning on the training data X = [X1, X2,…, Xn] so that **f(x)** fits Y = [Y1, Y2,…, Yn]. But this function "**f"** can not be perfect, and there will be errors in the fitting. To quantify this error, we have these loss functions:

Square error is also called **L2 loss,** and the formula for the same is given below. We are squaring the difference between Yi {the actual value} and f(xi) {corresponding predicted value}.

If we observe this loss function, it's a quadratic equation with only a global minimum and no local minimum, which can be considered a mathematical advantage. The unavailability of local minima ensures that our gradient descent algorithm does not fall into that local minimum trap, and we will have the most optimal parameters. Hence, the square error is one of the most favorable loss functions among data scientists and machine learning professionals.

The corresponding cost function will be the average of these losses for all the data samples, called **M**ean **S**quared **E**rror **(MSE).**

If we look at the update term of gradient descent, **θ = θ-α.G,** we infer that for higher values of error, we will have larger values of the gradient (dj/dθ).It means the higher the loss, the more changes will happen in the parameter. Or we can also say: more penalization will occur when the error is higher. But this can be problematic sometimes. Data samples can be outliers, and it can adversely affect MSE.

**Limitations of MSE:** The difference between actual and predicted will be higher, and squaring that difference will make them even more prominent. So MSE is less robust to the outlier's presence, and it is advised not to use MSE when there are too many outliers in the dataset.

Absolute error is also called **L1 loss.** Unlike MSE, here, we take the absolute value of the error rather than squaring it.

The corresponding cost function will be the average of the absolute losses over training samples, called the **M**ean **A**bsolute **E**rror **(MAE).** It is also widely used in industries, especially when the training data is more prone to outliers.

**Limitations of MAE:** Although MAE is frequently used, it is a non-differentiable function, so we need more complicated techniques to find the gradients and update the parameters. When the cost values approach minimum, the gradient becomes undefined, making MAE unstable.

L1 and L2 losses are prevalent, but they have limitations.

- L1 loss is more robust to outliers than L2, or we can say that when the difference is higher, L1 is more stable than L2.
- L2 loss is more stable than the L1 loss, especially when the difference between prediction and actual is smaller.

Huber loss takes the good from L1 and L2 and avoids their shortcomings. It is shown mathematically in the image below.

Huber loss function is characterized by the parameter **δ.**

It is quadratic for smaller errors and becomes linear for higher values of errors. The distance at which we transit the behavior is controlled by δ. This parameter δ characterizes the Huber loss function**.**

In classification tasks, we try to predict the categorical values instead of continuous ones for the target variables. For example, suppose we want to classify incoming emails as spam or non-spam.

To predict the categorical variables, we take the help of probability theory. Suppose our ML classification model says that email is spam with a probability of 0.9. In that case, we can say that model is 90% confident about its prediction; hence this mail should be categorized as spam. If the true value is also spam, the error will be zero. But if the model somehow misclassifies it, then the error value will be 1. This kind of error calculation will always give us discrete values and make our cost function undifferentiable.

To tackle this situation, we treat the predicted probabilities as the samples coming from one probability density function and actual probabilities coming from another. The objective is to make sure that these PDFs match. But before going any further, let's understand the term **entropy** first.

Entropy signifies uncertainty (something not certain). In thermodynamics, we measure the entropy of particles because their position is not certain. Similarly, for a random variable X coming from a probability distribution p(X), entropy is defined as:

If the entropy value is higher, the surety about random variable X following that distribution will be lesser. When the entropy is lower, the confidence or surety will be higher.

Based on the number of categories present in the data, we have two types of entropies:

Here we have the target variables in the binary format, or we can say that only two classes are present. If the probability of being in class 1 is **P,** then the probability of being in class 2 will be **(1-P).**

Cross entropy loss for the actual label of Y (which can take values of either 0 or 1) and the predicted probability of P can be defined as,

This loss is also known as the **Log loss.** For example, we can use the sigmoid function to calculate P, where Z represents the model's input parameters.

We can define the corresponding cost function as:

Here we have the categorical variables in the form of multiple classes. The entropy calculation will remain the same, but the Y vector will be represented as the One-hot encoded vector. Suppose there are three classes Cat, Dog, and no*cat*no_dog. One-hot representation of these classes can be,

**Cat = [1,0,0], Dog = [0,1,0], no catno_dog = [0,0,1].**

C = total classes/categories

Here, **Yij** will be in one-hot vector representation form, and **Pij** will be the predicted probability for being in class j when the **i-th** sample of input (Xi) is provided. Unlike binary cases, here, we use the **softmax function** to calculate Pij. To understand this softmax function, let's assume that there are three categories, as mentioned in the earlier example. If our model says that it is 90% sure that it is a cat and 70% sure that it is a dog, and 50% sure that it is neither a cat nor a dog, then what should be our final prediction?

To answer this, we apply the softmax function to all these confidence values such that (Zi's in the below formula) the sum of confidences becomes 1, and it starts resembling probability theory. Once "softmaxed," the class corresponding to the highest probability will be our final model's predicted category.

**Note:** The sum of all these values of **σi** will be 1 because the denominator and numerator will become identical.

This loss function is only used with **Support Vector Machines or Maximal Margin Classifiers,** with classes -1 and 1 ( Not 0 and 1). SVM is a machine learning algorithm specially used for binary classification and uses decision boundaries to separate two types.

For this algorithm, Hinge loss penalizes the wrong predictions as well as the predictions for which the model is less confident.

It is a non-differentiable function but possesses convex nature, which helps find the optimal loss.

Cost and loss functions are among machine learning interviews' most frequently asked topics. Interviewers mainly focus on checking the understanding of how ML works:

- What different loss functions did you try in your regression problem statement, and why did you finalize the one?
- What is Hinge loss, and why is it different from others?
- What is the difference between binary cross-entropy and categorical cross-entropy?
- What is the difference between the cost function and the loss function?
- If Huber loss is better, why do we generally see MSE and MAE as cost functions?

Answer: Huber loss increases the time complexity of our algorithm, and it brings an additional hyperparameter δ for which tuning is required.

In this article, we learned several loss functions which are highly popular in the machine learning domain. We learned different loss functions used in classification: binary cross-entropy, categorical cross-entropy, and the special hinge loss used in Support Vector Machines and Regression: square loss, absolute loss, and Huber loss. We hope this has cleared all your confusion about when to choose which loss function.

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.