In our previous article (here) we classified the whole machine learning on five different bases. While discussing the classification of ML based on the Nature of the problem statement, we divided ML problems into three different categories, namely.
While classifying Machine Learning based on the Nature of Input data, we define supervised learning as: Supervised learning is where we have an input variable (X) and an output variable (Y) and we use a machine-learning algorithm to learn the mapping function from the input to the output variable.
Based on the Nature of output data, we further categorize supervised learning into two different classes:
Both problems deal with the case of learning a mapping function from the input to the output data.
Let's dive deeper into these two problems, one after the other.
Formal definition: Regression is a type of problem that uses machine learning algorithms to learn the continuous mapping function.
Taking the example shown in the above image, suppose we want our machine learning algorithm to predict the weather temperature for today. If we solved the above problem as a regression problem, the output would be continuous. It means our ML model will give exact temperature values, e.g., 24 °C, 24.5°C, etc.
To measure the learned mapping function's performance, we measure the prediction's closeness with the accurate labeled validation/test data. In the figure below, blue is the regression model's predicted values, and red is the actual labeled function. The blue line's closeness with the red line will give us a measure of How good is our model?
While building the regression model, we define our cost function. It measures the value of the learned values' deviation from the predicted values. Optimizers make sure that this error reduces over the progressive iterations, also called epochs. Some of the most common error functions (or cost functions ) used for regression problems are:
Note: Yi is the predicted value, Yi' is the actual value, and N is the total samples over which prediction is made.
In regression problems, the mapping function that algorithms want to learn is discrete. The objective is to find the decision boundary/boundaries, dividing the dataset into different categories. But the Classification is a type of problem that requires the use of machine learning algorithms that learn how to assign a class label to the input data.
For example, suppose there are three class labels, [Apple, Banana, Cherry]. But the problem is that machines don't have the sense to understand these labels. That's why we need to convert these labels into a machine-readable format. For the above example, we can define Apple = [1,0,0], Banana = [0,1,0], Cherry = [0,0,1]
Once the machine learns from these labeled training datasets, it will give probabilities of different classes on the test dataset like this: [P(Apple), P(Banana), P(Cherry)]
These predicted probabilities can be from one type of probability distribution function (PDF), and the actual (true) labeled dataset can be from another probability distribution function (PDF). If the predicted distribution function follows the actual distribution function, the model is learning accurately. Note: These PDF functions are continuous. As a similarity between classification and regression, if the predicted PDF follows the actual PDF, we can say the model learns the trends.
Categorical Cross-Entropy
Suppose there are M class labels, and the predicted distribution for the i-th datasample is :
P(Y) = [Yi1', Yi2', ………. , YiM’]
And, actual distribution for that sample would be,
A(Y) = [Yi1, Yi2, ……….., YiM]
Cross Entropy (CEi) = — (Yi1*log(Yi1') + Yi2*log(Yi2') + …… + YiM*log(YiM’))
Binary Cross-Entropy
This is a special case of categorical cross-entropy, where there is only one output that can have two values, either 0 or 1. For example, if we want to predict whether a cat is present in any image or not.
Here, the cross-entropy function varies with the true value of Y,
CEi = -Yi1*log(Yi1') , if Yi1 = 1
CEi = -(1-Yi1)*log(1-Yi1'), if Yi1 = 0
And similarly, Binary-Cross-Entropy would be averaged over all the datasets.
Now, the primary question that we should ask ourselves is: If PDFs (probability distribution functions) are continuous in the range of [0,1], why can't MAE/MSE be chosen here? Take a pause and think!
Reason: MAE and MSE do well when the probability of an event occurring is close to the predicted value or when the wrong prediction's confidence is not that high. To understand the term confidence of prediction, let's take one example:
Suppose our ML model predicted that the patient-lady in the figure below is pregnant, and our model predicted it with the probability of 0.9. We can say that our model is very much confident. Now let's consider one scenario when the ML model says the patient-man in the below figure is pregnant with a probability of 0.9. This is a case where the model predicts something wrong and is confident about the prediction. To address these cases, the model needs to be penalized more for these predictions. Right?
Let's calculate the cross-entropy (CE), MAE, and MSE of the case where the ML model is predicting that a man is pregnant with high confidence (Probability (Y')= 0.8). Obviously, the actual output Y will be 0 here.
CE = -(1-Y)*log(1-Y’) = -(1 – 0)*log(1 – 0.8) = 1.64
MAE = |(Y-Y’)| = |0–0.8| = 0.8
MSE = |(Y-Y’)²| = (0–0.8)² = 0.64
As you can see, MAE and MSE have lower values than CE, which means the Cost function /Error function produces more value. Hence the model should be penalized more.
That's why we needed different cost functions for the classification problem.
Yes! We can. Let's take one example.
Problem statement: Predict the steering angle of an autonomous vehicle based on the image data.
Constraints: The steering angle can take any value between -50⁰ and 50⁰ with a precision of ±5⁰.
Regression Solution: This solution is simple, where we can map the images to the steering angle's continuous function, which continuously gives the output. Like steering angle = 20.7⁰ or steering angle = 5.0⁰.
Classification Problem: We stated that Precision is ±5⁰, so we can divide the entire range of -50⁰ to 50⁰ into 20 different classes by grouping every 5⁰ at a time.
Class 1 = -50⁰ to -46⁰
Class 2 = -45⁰ to 41⁰
…
Class 20 = 46⁰ to 50⁰
Now we have just to classify the input image into these 20 classes. This way, the problem is converted into a classification problem.
In this article, we discussed the concepts of classification and regression problems in detail. We also discussed the difference in cost functions like MAE, MSE, and Categorical Cross Entropies that are the critical difference. Meanwhile, in the end, we discussed a common problem statement where we discussed one famous problem statement that can be solved by considering the problem statement as a classification and a regression problem statement. We hope you have enjoyed the article and learned something new.
Enjoy Thinking, Enjoy Algorithms!
Regular expression is an expression that holds a defined search pattern to extract the pattern-specific strings. Today, RE are available for almost every high-level programming language and as data scientists or NLP engineers, we should know the basics of regular expressions and when to use them.
XGBoost, also known as Extreme Gradient Boosting, is a supervised learning technique that uses an ensemble approach based on the Gradient boosting algorithm. It is a scalable end-to-end system, widely used by data scientists to achieve state-of-the-art results on many machine learning challenges.
Gradient descent in Machine Learning is one of the most basic cost optimization algorithms. Every interviewer expects you to know about it. This article has discussed how it helps us find the right set of parameters for learning in machine learning.
K-Nearest Neighbor is a supervised learning algorithm that can be used to solve classification as well as regression problems. This algorithm learns without explicitly mapping input variables to the target variables. It is probably the first "machine learning" algorithm, and due to its simplicity, it is still accepted in solving many industrial problems.
Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.