Logistic Regression in Machine Learning

Logistic Regression is one of the most used machine learning algorithms among industries and academia. It is a supervised learning algorithm where the target variable should be categorical, such as positive or negative, Type A, B, or C, etc. We can also say that it can only solve the classification problems. Although the name contains the term "regression", it is only used to solve the classification problem.

According to the Kaggle survey of 2021, Logistic Regression is the most used algorithm for solving classification problems, and there are some practical reasons for that. In this article, we will discuss this algorithm and the reason for its popularity.

Key takeaways from this blog

  • What is Logistic Regression?
  • Why can we not fit a linear regression model on the classification problems?
  • How to tweak Linear Regression to form Logistic Regression?
  • What is a decision boundary in Logistic Regression?
  • What is the mathematics behind the loss function of Logistic Regression?
  • What are the types of Logistic Regression?
  • Why is Logistic Regression the most used algorithm?
  • Python-based implementation of Logistic Regression.
  • Real-life industrial applications of Logistic Regression.
  • Possible interview questions on Logistic Regression.

What is Logistic Regression?

The inherent nature of Logistic Regression is similar to linear regression algorithm, except it predicts categorical target variables instead of the continuous ones used in Linear Regression. It is a classical Machine Learning algorithm that requires supervised data to solve classification problems. The actual labeled values Y in Linear Regression are probability values, and it is a parametric solution because the parameters we will learn will not change drastically with future inputs.

Why is Linear Regression not used for classification problems?

Suppose two classes (class 1 and 2) in the image below. The actual values of the target variable will be in binary format, where Y = 0 confirms the occurrence of class 1 and Y=1 confirms the occurrence of class 2. There are mainly two reasons because of which we can not fit a linear regression on classification tasks:

  • When we fit a linear regression model on this dataset, it will never be confined in the range of 0 and 1. But the target variables are probabilities (let's say p(X)), so we can not allow our model to go in the range of >1 and <0 (as probability values lie in the range of [0, 1]). That's why we can not use linear Regression here.
  • Suppose the data is highly biased towards one class, i.e., the number of samples of class 1 >> the number of samples of class 2. In such a case, our Linear line will be more inclined towards class 1. Hence accuracy will suffer a lot.

So, we do not prefer to use Linear Regression for classification problems.

Linear vs logistic regression for a binary classification

How to change Linear Regression to Logistic Regression?

If we remember, in Linear Regression, we try to learn the weight and bias parameters that represent the output variable in the form of:

Linear equation analogy

In Logistic Regression, we will use the same analogy of learning the parameters. To avoid the failures of Linear Regression, we fit the probability function p(X) that can only have values between 0 and 1. There are many such functions, but in "Logistic Regression", we use the logistic function.

Probability of X with logistic function

After arranging a little bit:

Rearrangement of equation

Taking logarithm both the sides:

Logarithm on both sides

This looks more like a Linear Regression problem where we can fit the logit function. The Y-values from the original linear regression model are transformed using the logit function (also known as a log of odds function) to make the problem more like a linear regression problem. 

logit function

If X = [x1, x2, …, xn], then we are trying to map

Target function

We can say that the linear Regression fits the linear function, but logistic Regression fits the sigmoid function.

Representation of logit

What is a decision boundary in Logistic Regression?

We are saying that the Logistic Regression is mapping the categorical variables, but we saw the equations predicting the sigmoid function, which is continuous. Then how do we use these predictions to correlate to the classes?

Sigmoid

Here comes the role of the decision boundary. Suppose our logistic regression model is trying to fit the categorical variables having values 0 and 1. We made our decision threshold = 0.5, which means when the probability p(X) ≥ 0.5, it will be mapped to "1" otherwise "0". This "0.5" is the default value for Logistic Regression, and we can change its value depending on the problem statement and our requirement.

Loss Function of Logistic Regression

Why can't we use RMSE or MSE in Logistic Regression?

One significant difference between linear and logistic Regression is that linear Regression uses RMSE (Root Mean Squared Error) or Sum-Squared Error. In contrast, logistic Regression cannot use the same, as the loss function will be non-convex, and primarily it will land in the local optima.

Source: Pinterest

Why Maximum-likelihood?

To avoid the problems of RMSE and MSE, we adopt maximum likelihood for this type of regression problem. In Maximal Likelihood Estimation (MLE), we first assume a "probability distribution function" on our observed data. If we remember the Gaussian distribution function, mean and variance were the parameters controlling the probability of the observed data in our gaussian PDF. 

Similarly, some parameters will be involved in our "assumed PDF". In MLE, we try to find the best optimal values of those parameters such that the observed values become more probable in the assumed PDF.

Each point in the (Y*-x) scale is mapped to the (Y-x) scale in maximum likelihood.

Y predicted

The y values (in the y-X scale) can be computed using the equation above, and the likelihood of the y-values (or log-likelihood) can be calculated. The value y gives the probability of the observation having a positive class, and consecutively the negative class will have a probability of (1-y).

With the name of MLE, it is clear that we need to maximize something, but what if we multiply it with -1? Then we need to minimize it, and with this hypothesis, we design our cost function for Logistic Regression.

Maximum likelihood

The above-defined likelihood (or log(likelihood) is the cost function to be minimized, and that -ve sign in the above state makes sure of that. In simpler terms, if we focus on the part of MLE (without -ve sign),

Maximum likelihood 2

P(Y| X, β)is a conditional probability that represents the probability of y if the values of input X and parameter β are already known. If we take the logarithm on both sides and then multiply it with -1, then

logarithm arrangements

final cost

Loss function source researchgate

The loss function for the logistic regression algorithm is unique and essential to understanding. That's why we emphasized this section mathematically.

Types of Logistic Regression

Based on the nature of target variables, we can categorize logistic Regression into three categories:

  • Binary/Binomial Logistic Regression: The target variable can have the values in the binary format. E.g., (+ve and -ve), (email spam, non-spam), (black and white).
  • Multinomial Logistic Regression: The target variables can have >2 types of output classes but not in the ordered manner. E.g.,(+ve, -ve, 0), (black, white, gray)
  • Ordinal Logistic Regression: The target variables can have >2 types of output classes but in an ordered manner. E.g., (Movie rating from 1 to 5).

Inherently, Logistic Regression solves a binary classification problem, but we can also solve classification problems with multiple labels. We will treat every class label as a separate binary classification problem in such a scenario. Let's take one example where we have the task to classify the image of the ball in three color classes, red, green, and blue. 

We will solve the binary classification problem for all three classes to solve this problem. We will take the input image of the ball and will predict the probability of the image being "red" or "not red", "green" or "not green", and "black and not black". We will treat the predicted probabilities as the model's confidence. The class with the highest probability/confidence will be the predicted class by the model.

Why is Logistic Regression the most used algorithm?

There are many advanced algorithms in ML, but still, people love to use Logistic Regression for classification or Linear Regression for regression problems. Reasons for that are:

  • These models are easy to explain to customers or stakeholders. The more explainable algorithm gains more trust.
  • These models are less complex as compared to other high-level algorithms. They provide the predictions in real-time and hence can be deployed on smaller footprint devices.

Let's move towards the implementation.

Python implementation of Logistic Regression

Step 1: Importing necessary libraries

  • pandas for creating a data frame used to train and test the model.
  • matplotlib for plotting scattered data points and fitted curves.
  • traintestsplit for splitting the dataset into train and test sets.
  • LogisticRegression from sklearn.linear_model for performing the classification operations
  • Confusion_matrix from sklearn.metrics to evaluate the correctness of the model
## Importing of required libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrices import confusion_matrix

Step 2: Dataset loading and explanation

The dataset used for this project is a college_admit dataset, which gives specific observations of students who were and weren't admitted to a college based on their 'sat' score,' gpa', and the 'number of recommendations' they have. There are 55 observations and three features used to decide whether a student gets an admission or not.

path = 'college_admit.csv'
def data_set(path):
    data = pd.read_csv(path, header = 0)
    df = pd.DataFrame(data, columns= ['sat','gpa',
                                    'recommendations','admitted'])
    
    X = df[['sat','gpa', 'recommendations']]
    y = df['admitted']
    print("DataFrame : ",df)
    X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=
                                         0.2, random_state=0)
    #It will split the data into train and test set in the ratio of
    # 80:20 and give us the split required for training and tessting
    return (X_train,X_test,y_train,y_test)

The data frame can be printed using the function data_set( ) above, which returns the training and testing dataset.

Data snippet logistic regression concept

Step 3: Training of Logistic Regression model

The model can be trained and returned using the function logistic_reg( ), which takes the output from the function data_set( ) as input, and produces a fully trained logistic regression model.

# logistic regression model building
def logistic_reg(dataset):
  
    X_train,X_test,y_train,y_test = dataset
    
    model = LogisticRegression()
    model.fit(X_train,y_train)
    
    return model

The function can return the model with its specifications.

Model formation

Step 4: Evaluation of the trained model

As we already have stated, logistic regression is a classification algorithm, so some popular metrics to evaluate any classification models are Accuracy, precision, recall, etc. A complete list can be found here. We will compute and plot the confusion matrix to evaluate the classification performance. The confusion_matrix function is imported from sklearn.metrics library. It takes in the actual values of the test data (i.e., ytest) and the predicted values (i.e., ypred) by the model on the test data to give away a 2x2 confusion matrix.

if __name__ == "__main__":
  # call the data setter function created above
  dataset = data_set(path)
  
  #split the data into required training and testing sets
  X_train,X_test,y_train,y_test = dataset
  
  #train the model
  model = logistic_reg(dataset)
  
  #prediction using the trained model
  y_pred = model.predict(X_test)
  
  # calculate the confusion matrix and plot it
  confusion_mat = confusion_matrix(y_test,y_pred, labels=None)
  print("Confusion Matrix = ",confusion_mat)
  
  #ploting confusion matrix
  fig,ax = plt.subplots()
  ax.set_title("Confusion Matrix")
  cm_ax = axx.matshow(confusion_mat)
  fig.colorbar(cm_ax)
  ax.set_xticklabels([''] + ['yes','no'])
  ax.set_yticklabels([''] + ['yes','no'])

The confusion matrix can be used to compute the model accuracy as:

Confusion matrix

Accuracy

Accuracy for our model is 9/11 = 0.8181

Real-life industrial applications of Logistic Regression

There are many industrial applications of Logistic Regression. Some popular ones are:

  • Predicting the rating from the sentiment of the textual movie reviews.
  • Predicting the probability of any patient developing a particular disease.
  • Predicting the handwritten digits using images.

Quick Note

  • Logistic Regression can predict the categorical dependent variable using a given set of independent variables.
  • Logistic Regression is used to solve Classification problems, which means predicting the possibility of each observation.
  • The maximum likelihood estimation method is used as the objective function.
  • Logistic Regression need not have any linear relationship between the dependent and independent variables.

Possible Interview Questions

Logistic Regression is the most used classification algorithm, and hence it is prevalent in machine learning industries. Interviewers love to check the basic concepts around this algorithm. Some interview questions on this topic can be,

  • Why is Logistic Regression a classification problem?
  • Can we solve the classification problem using Linear Regression? If Yes, How? If No, what can be the technical challenges?
  • What are the types of Logistic Regression?
  • What is the cost function associated with Logistic Regression?
  • Why can't we use MSE or other traditional cost functions instead of the log loss functions?
  • What is the default value of the decision boundary? When do we need to change it?

Conclusion

This blog represented a detailed understanding of Logistic Regression, one of the most used algorithms in industries. We learned about how this is different from the conventional linear regression algorithm. After that, we focused on Logistic Regression's loss function/cost function, which makes it unique from other machine learning algorithms. After that, we did some hands-on Logistic regression and built a model to predict the probability of getting admission. We hope you have enjoyed the article.

Enjoy Learning! Enjoy Algorithms!

More From EnjoyAlgorithms

Our weekly newsletter

Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.

Follow Us:

LinkedinMedium

© 2020 EnjoyAlgorithms Inc.

All rights reserved.