Classification problems are among the most used problem statements in Machine Learning and Data Science. When we explore the real-life industry applications of Machine Learning, classification problems are widely used by the tech giants like Google, Apple, Tesla, Microsoft, Facebook, etc. We can find that 8 out of 10 problem statements belong to classification problems. Because of its popularity, many new methods come every day and challenge the previously existing methods. **But challenge on what basis?**

The answer is simple; We compare the capability of different approaches concerning their performance on common grounds, which we say as evaluation metrics. Research papers also publish the work and compare their results with the benchmarked research papers on standard evaluation metrics. In this article, we will discuss some of the most common and popular evaluation metrics used to evaluate the classification models.

- Accuracy and its limitations
- Confusion Matrix
- Precision & Recall
- F1-Score
- Specificity
- Receiver Operating Characteristic Curve (ROC)
- Area Under Curve (AUC)

Let's start understanding every method in detail.

Accuracy for a classification problem is a straightforward calculation widely used in industries. As we know, in classification problems, our machine learning model categorizes the input variables into different classes. We calculate the total number of predictions made by our model and how many of those predictions are correct. In mathematical representation,

```
from sklearn.metrics import accuracy_score
#Y_pred is the predicted target variable and
#Y_act is the true target variable. Then,
print("Accuracy = ", accuracy_score(Y_act, Y_pred))
```

Although it is a widely used metric, it has some severe limitations. Suppose we have trained a model to classify the images into two classes, "Cat" or "No-Cat"*.* We tested our model on 100 images containing cats, and our model gave the value corresponding to the "Cat" class all the time. So accuracy from the above formulae, (100/100)*100 = 100%. **Wow!!!**

**But what if our model is always predicting the "Cat" class?** Here is the catch! If our model is always predicting cat class, and we tested our model on a different set of 100 images with no cats, then accuracy would be (0/100)*100 = 0%. So we must not judge the model by just accuracy metric as it works well with a balanced dataset (having equal samples for all classes) but not with the unbalanced one.

Confusion Matrix is one of the best evaluation metrics considered the base for all other evaluation metrics. It is called a **confusion matrix** as people get confused easily with its theory. So let's understand it thoroughly. The image below shows the components of the confusion matrix.

Let's take an example to learn these four terms, **True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).** All other evaluation metrics will be defined using these terms.

Suppose we want to eat an apple, and we are health conscious. Luckily, we have our farm, and we know which apple belongs to the fresh category and which apples don't. But for the customers, we built a machine learning classification model that takes apple's image as input and predicts whether it is fresh or old. Before deploying this model for customers, we want to check its performance. Based on this, let's define four important terms.

**True Positive (TP):** This is when our model says an apple is fresh and that apple is actually fresh.

**True Negative (TN):** This is when our model says an apple is old and that apple is actually old.

**False Positive (FP):** This is when our machine learning model says an apple is fresh, but in reality, that apple is old. This case is popularly known as **Type I Error .**

**False Negative (FN):** This is when our machine learning model says an apple is old but in reality, that apple is fresh. This case is popularly known as **Type II Error**.

```
from sklearn.metrics import confusion_matrix
y_true = ["fresh", "old", "old", "fresh", "fresh", "fresh"]
y_pred = ["fresh", "old", "fresh", "old", "fresh", "old"]
arr = confusion_matrix(y_true, y_pred, labels=["fresh", "old"])
print("Confusion Matrix = ",arr)
tn, fp, fn, tp = arr.ravel()
```

Now, if we have to re-define the accuracy using the same terms above,

which states, **Out of all the apples we had, how many were correctly predicted as fresh and old.**

Now, suppose we have 100 apples. We predicted these apple types using our two different classification models; they segregated the apples as "**fresh"** and "**old"** apples. When we observed the predictions via two models, we found,

**Model 1 :** **TP = 68**, **FN = 22, FP = 0, TN = 10.**

**Model 2 :** **TP = 90**, **FN = 0, FP = 4, TN = 6.**

Now, suppose it's our farm, and we want to store the apple. But the problem is if we store old apples and fresh ones together, it will convert "**fresh"** into "**old**" ones. We must penalize **False Positives in such a scenario** as we don't want our model to categorize old apples into the fresh apple category. It will convert all the fresh ones into old ones and Precision is a measure for that. In the above case of two models, Model 1 is preferred because the higher the Precision, the better will be the model. In mathematical terms,

Now take another scenario, we want to sell apples as soon as possible to make extra profit. There is no problem of mixing up, but the goal is to penalize the **False Negatives** as we have to carefully tackle scenarios when fresh apples get categorized as old ones. It will hamper the profit. A recall is a measure of that only. In the above case of two models, Model 2 will be preferred as the better the recall value, the better the model will be.

The recall is also called **Sensitivity** and **True Positive Rate** (TPR). We might be thinking there can be scenarios where we would want both Precision and recall higher. Then how will we decide? Let's see!

Suppose we have just started a supermarket, where initially the customers are less. We want a balance between the storage of apples and the sales of apples because we don't know how many days it will take to finish all the apple stocks. In this scenario, we need to pick the model with a higher F1-Score calculated based on the precision and recall values.

HIgher the F1 score, the better will be the model and this metric is used when we want both Precision and recall high.

Suppose we want our model to be perfectly sure about the old apples to eliminate them from our stock. We will try to penalize the **False Positives** as we don't want any old apple to be predicted as fresh. Specificity gives us a measure of that.

ROC is a prevalent and important evaluation metric concerning machine learning interviews. We are familiar with the binary classification tasks, where we have to decide between two categories (Yes/No, Fresh/Old, etc.). Here, our model outputs a probability value that shows its confidence in predicting any particular class. In our example earlier, suppose the model receives the input of an apple's image and predicts its "**fresh**" with the probability of M% (0 ≤ M ≤ 100). If the value of M is greater than the threshold value (the default threshold value is 50%), then we say that model predicted the apple as "**fresh";** otherwise, we say that model predicted the apple as "**old".** But **what if we change that threshold?**

We plot **T**rue **P**ositive **R**ate (TPR)/Recall/Sensitivity as our Y-axis and False Positive Rate (FPR) as our X-axis for varying threshold values.

In the diagram below, every dot represents a value calculated for a certain threshold value. Let's take one example to understand it better. Suppose we first assumed that when the model says that it is sure that apple is fresh with ≥ 60% confidence, we will categorize the apple as fresh. If we decrease this threshold to ≥ 50%, then there are chances that the number of apples classified as positive will increase. And from the mathematical equations of TPR and FPR, we can say that both of these numbers will increase if we decrease the threshold. Models that are least affected by the change in the threshold value will be considered better.

With the above logic, we can clearly say that **Model 3** is better than Model 2, Model 2 is better than **Model 1** as we are changing the threshold values, and model 3 varies less than model 2. Model 2 varies less than model 1.

Suppose we built a random model to classify our apples as fresh and old, and our default threshold value is 50%. We also know that 0≤TPR≤1 and 0≤FPR≤1, so the area under the curve can be maxed at TPR = FPR = 1. The line of x=y (TPR = FPR) in the above image has an area under the curve of 0.5. If we have built a perfect model, it will classify apples with 100% confidence, and eventually, it will be a case where AUC = 1. So we can say that "**The more the area under the curve, the better the model**".

For a better understanding, we can look at the below image. In image 1, if two classes (positive and negative) are perfectly separable, then AUC will be 1, and in image 3, if classes are perfectly mixed and not separable, then AUC = 0.5.

Generally, questions on the evaluation metric are asked when we have represented our models' performance using any of the following metrics. But knowing the answers to the following questions will surely help.

- What evaluation metrics should be used for your project?
- What are the problems that accuracy terms can suffer?
- What is the confusion matrix, and why is it considered a base for all other metrics?
- Is it always preferred to have a better F1 score rather than better Precision or recall?
- What is the ROC plot, name X and Y-axis? Note: This is frequently asked.

This article has covered Accuracy, Confusion Matrix, Precision, Recall, F1-Score, ROC, and AUC, which are the most frequently used evaluation metric for the classification models. There can be other evaluation methods, but we have tried to cover the most frequent ones. We have covered every matric with a beautiful story of fresh and old apples. We hope you enjoyed it.

Enjoy Learning, Enjoy Algorithms!

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.