Evaluation Metrics to Evaluate Classification Models

Classification problems are the most used problem statements in Machine Learning and Data Science. When we explore the real-life industry applications of Machine Learning, classification problems are widely used by the tech giants like Google, Apple, Tesla, Microsoft, Facebook, etc. We can find that 8 out of 10 problem statements belong to classification problems. Because of its popularity, many new methods come every day and challenge the previously existing methods. But challenge on what basis?

The answer is simple; We compare the capability of different approaches concerning their performance on common grounds, which we call evaluation metrics. Research papers also publish the work and compare their results with the benchmarked research papers on standard evaluation metrics. This article will discuss some of the most common and popular evaluation metrics used to evaluate classification models.

Popular methods covered in this article

  1. Accuracy and its limitations
  2. Confusion Matrix
  3. Precision & Recall
  4. F1-Score
  5. Specificity
  6. Receiver Operating Characteristic Curve (ROC)
  7. Area Under Curve (AUC)

Let's start understanding every method in detail.


Accuracy for a classification problem is a straightforward calculation widely used in industries. Our machine learning model categorizes the input variables into different classes in classification problems. We calculate the total number of predictions made by our model and how many of those predictions are correct. In mathematical representation,

             Number of correct predictions
Accuracy = ---------------------------------
              Number of total predictions
from sklearn.metrics import accuracy_score
#Y_pred is the predicted target variable and 
#Y_act is the true target variable. Then,
print("Accuracy = ", accuracy_score(Y_act, Y_pred))

Limitations of Accuracy

Although it is a widely used metric, it has some severe limitations. Suppose we have trained a model to classify the images into two classes, "Cat" or "No-Cat". We tested our model on 100 images containing cats, and our model gave the value corresponding to the "Cat" class all the time. So accuracy from the above formulae, (100/100)*100 = 100%. Wow!!!

But what if our model is always predicting the "Cat" class? Here is the catch! If our model is always predicting cat class, and we tested our model on a different set of 100 images with no cats, then accuracy would be (0/100)*100 = 0%. So we must not judge the model by the accuracy metric as it works well with a balanced dataset (having equal samples for all classes) but not with the unbalanced one.

Confusion Matrix

The Confusion Matrix is one of the best evaluation metrics and is considered the basis for all other metrics. It is called a confusion matrix, as people get confused easily with its theory. So let's understand it thoroughly. The image below shows the components of the confusion matrix.

Confusion matrix used to evaluate the performance of classification models in Machine Learning

Let's take an example to learn these four terms, True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). All other evaluation metrics will be defined using these terms.

Suppose we want to eat an apple, and we are health conscious. Luckily, we have our farm, and we know which apple belongs to the fresh category and which apples don't. But for the customers, we built a machine learning classification model that takes apple's image as input and predicts whether it is fresh or old. Before deploying this model for customers, we want to check its performance. Based on this, let's define four important terms.

True Positive (TP): This is when our model says an apple is fresh and that the apple is actually fresh.

True Negative (TN): This is when our model says an apple is old and that the apple is actually old.

False Positive (FP): This is when our machine learning model says an apple is fresh, but in reality, that apple is old. This case is popularly known as Type I Error.

False Negative (FN): This is when our machine learning model says an apple is old but in reality, that apple is fresh. This case is popularly known as Type II Error.

from sklearn.metrics import confusion_matrix

y_true = ["fresh", "old", "old", "fresh", "fresh", "fresh"]
y_pred = ["fresh", "old", "fresh", "old", "fresh", "old"]

arr = confusion_matrix(y_true, y_pred, labels=["fresh", "old"])

print("Confusion Matrix = ",arr)

tn, fp, fn, tp = arr.ravel()

Now, if we have to re-define the accuracy using the same terms above,

                 (TP + TN)
Accuracy = ---------------------
            (TP + TN + FP + FN)

which states, Out of all the apples we had, how many were correctly predicted as fresh and old.

We can also use from sklearn.metrics import accuracy_score to calculate the accuracy of predictions.

Precision and Recall

Now, suppose we have 100 apples. We predicted these apple types using our two different classification models; they segregated the apples as "fresh" and "old" apples. When we observed the predictions via two models, we found,

Model 1 : TP = 68, FN = 22, FP = 0, TN = 10.

Model 2 : TP = 90, FN = 0, FP = 4, TN = 6.

Suppose it's our farm, and we want to store the apple. But the problem is if we store old apples and fresh ones together, it will convert "fresh" into "old" ones. We must penalize False Positives in such a scenario as we don't want our model to categorize old apples into the fresh apple category. It will convert all the fresh ones into old ones, and Precision is a measure for that. In the above case of two models, Model 1 is preferred because the higher the Precision, the better the model will be. In mathematical terms,

Precision = -----------
              TP + FP

Now take another scenario; we want to sell apples as soon as possible to make extra profit. There is no problem of mixing up, but the goal is to penalize the False Negatives as we have to carefully tackle scenarios when fresh apples get categorized as old ones. It will hamper the profit. A recall is a measure of that only. In the above case of two models, Model 2 will be preferred as the better the recall value, the better the model will be.

Recall = -----------
           TP + FN

The recall is also called Sensitivity and True Positive Rate (TPR). We can also use the inbuilt functions present in Scikit-learn to calculate the Precision and Recall values. The corresponding functions would be:

from sklearn.metrics import precision_score, recall_score
#Y_pred is the predicted target variable and 
#Y_act is the true target variable. Then,
print("Precision = ", precision_score(Y_act, Y_pred))

print("Recall = ", recall_score(Y_act, Y_pred))

There could be scenarios where we would want both Precision and recall higher. Then how will we decide? Let's see!


Suppose we have just started a supermarket, where initially the customers are fewer. We want a balance between the storage of apples and the sales of apples because we are still determining how many days it will take to finish all the apple stocks. In this scenario, we need to pick the model with a higher F1-Score calculated based on the precision and recall values.

F1 = 2 * -----------------------
          (Precision + Recall)

The higher the F1 score, the better the model will be, and this metric is used when we want both Precision and recall high. The inbuilt function in Scikit-learn to calculate the F1 score would be:

from sklearn.metrics import f1_score
#Y_pred is the predicted target variable and 
#Y_act is the true target variable. Then,
print("F1 Score = ", f1_score(Y_act, Y_pred))


Suppose we want our model to be perfectly sure about the old apples to eliminate them from our stock. We will try to penalize the False Positives as we want to avoid any old apple being predicted as fresh. Specificity gives us a measure of that.

Specificity = -----------
                TN + FP

There is no direct function available in the Scikit-learn library, but we can get the value of sensitivity using a confusion matrix like this:

from sklearn.metrics import confusion_matrix

#Y_pred is the predicted target variable and 
#Y_act is the true target variable. Then,
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

print("F1 Score = ", tn/(tn+fp))

ROC: Receiver Operating Characteristic curve

ROC is a prevalent and essential evaluation metric concerning machine learning interviews. We are familiar with binary classification tasks, where we have to decide between two categories (Yes/No, Fresh/Old, etc.). Here, our model outputs a probability value showing confidence in predicting any particular class. In our example earlier, suppose the model receives the input of an apple's image and predicts its "fresh" with the probability of M% (0 ≤ M ≤ 100). If the value of M is greater than the threshold value (the default threshold value is 50%), then we say that model predicted the apple as "fresh"; otherwise, we say that model predicted the apple as "old". But what if we change that threshold?

We plot True Positive Rate (TPR)/Recall/Sensitivity as our Y-axis and False Positive Rate (FPR) as our X-axis for varying threshold values.

                                                       TN          FP
False Positive Rate (FPR) = 1 - Specificity = 1 -  ---------- = ---------
                                                     TN + FP     FP + TN

In the diagram below, every dot represents a value calculated for a certain threshold value. Let's take one example to understand it better. Suppose we first assume that when the model says that it is sure that the apple is fresh with ≥ 60% confidence, we will categorize the apple as fresh. If we decrease this threshold to ≥ 50%, then there are chances that the number of apples classified as positive will increase. And from the mathematical equations of TPR and FPR, we can say that these numbers will increase if we decrease the threshold. Models that are least affected by the change in the threshold value will be considered better.

How to plot the roc curve for the classification models in machine learning?

With the above logic, we can clearly say that Model 3 is better than Model 2, Model 2 is better than Model 1 as we are changing the threshold values, and model 3 varies less than model 2. Model 2 varies less than model 1.

To plot the roc curve for the predictions provided by our ML model, we can use the RocCurveDisplayfunction from the scikit-learn library.

from matplotlib import pyplot as plt
from sklearn.metrics import RocCurveDisplay

#Y_pred is the predicted target variable and 
#Y_act is the true target variable. Then,

RocCurveDisplay.from_predictions(y_act, y_pred)


AUC: Area Under Curve

Suppose we built a random model to classify our apples as fresh and old, and our default threshold value is 50%. We also know that 0≤TPR≤1 and 0≤FPR≤1, so the area under the curve can be maxed at TPR = FPR = 1. The line of x=y (TPR = FPR) in the above image has an area under the curve of 0.5. If we have built a perfect model, it will classify apples with 100% confidence, and eventually, it will be a case where AUC = 1. So we can say, "The more the area under the curve, the better the model".

AUC (Area under curve) used to evaluate the classification models in machine learning

For a better understanding, we can look at the below image. In image 1, if two classes (positive and negative) are perfectly separable, then AUC will be 1, and in image 3, if classes are perfectly mixed and not separable, then AUC = 0.5.

Correlation of AUC with the predictions to showcase the working of AUC and ROC curves

Possible Interview Questions

Generally, questions on the evaluation metric are asked when we have represented our models' performance using any of the following metrics. But knowing the answers to the following questions will surely help.

  • What evaluation metrics should be used for your project?
  • What are the problems that accuracy terms can suffer?
  • What is the confusion matrix, and why is it considered a base for all other metrics?
  • Is it always preferred to have a better F1 score rather than better Precision or recall?
  • What is the ROC plot, name X and Y-axis? Note: This is frequently asked.


This article has covered Accuracy, Confusion Matrix, Precision, Recall, F1-Score, ROC, and AUC, which are the most frequently used evaluation metric for the classification models. There can be other evaluation methods, but we have tried to cover the most frequent ones. We have covered every metric with a beautiful story of fresh and old apples. We hope you enjoyed it.

Enjoy Learning, Enjoy Algorithms!

More from EnjoyAlgorithms

Self-paced Courses and Blogs