Linear Regression: A complete understanding

In Machine Learning and Data Science field, Linear regression comes under the category of supervised learning approach. It is one of the most popular algorithms that has both statistics and machine learning nature. It predicts the linear relationship between a dependent variable and a set of independent variables. The dependent variable (y) is the variable that is to be predicted, and a set of independent variables (X) is the quantity used to develop the relationship.

Key takeaways from this blog

In this article, we will try to emphasize the following points in detail,

  1. What is Linear regression, and how does it belong to both statistics and machine learning?
  2. What are the types of Linear regression?
  3. How to form your dataset to fit the Linear Regression model best?
  4. A python-based implementation of Linear regression.
  5. Possible interview questions on linear regression.

So, let’s start without any further delay.

A model is linear when it is linear in parameters relating the input to the output. The dependency need not be linear in terms of inputs for the model to be a linear model. For example, all equations below are linear regression, and they define the model that represents the linear relationship between the model parameters.

Linear regression examples

In a machine learning problem, linearity is defined as the linearly dependent nature of a set of independent features and the dependent quantity. X = [x1, x2, …, xn] a set of independent features, and y is an output dependent quantity, with the objective (for a multiple linear regression) map y → X as,

polynomial equation

The βi’s are the constraints (also called weights) parameterizing the space of linear function mapping X to Y, and ξ is the error due to fitting imperfection. The problem dealt with by linear regression is finding the weight vector [β0, β1, β 2, …, βn] that minimizes a loss function given the quantities X, y.

Residuals

In the figure, the line plotted has an error concerning each data point. Linear regression varies those parameters to minimize the cumulative error, also known as a loss function/error function/cost function. Several loss functions have been defined in the machine learning literature, and one most used among them is the Sum of Squared Residuals (SSR). To get the best set of weight, the sum of squared residuals (SSR) for all observations( j = 1,2 …, m) should be minimum. This approach is termed as a method of Ordinary Least Squares (OLS).

SSR formula

The differences (Yj-f(Xj)) for all observations j = 1,2…, m, are called the residuals. Regression is about determining the best weights that are consistent with the smallest sum of residuals. Thus, taking the SSR as the objective function to be minimized, the weight vector [β0, β1, β 2, …, βn] corresponding to the minimum SSR value is to be computed. This is an optimization problem that many approaches can solve, such as Gradient Descent, Linear Algebra, and differential calculus.

The goodness of fit of Linear Regression is usually determined by the R² Score. is a statistical measure of how close the independent feature data are to the fitted regression curve. can also be termed as the percent of the response variable variation explained by the linear model. The higher the , the better the model fits the data.

What should be kept in mind while preparing the data for Linear Regression

As this algorithm exists for more than 200 years, much research and studies have been done on this algorithm. The literature suggests different heuristics that can be kept in mind while preparing the dataset for it. As we discussed, Ordinary Least Square is the most common technique for implementing the Linear regression models, so one could try the methods below and see if the R² score is improving or not.

  • Gaussian distributed data is more reliable: One can use different transformation techniques to fit the input data into a gaussian distribution as the prediction made by LR in such a case is more reliable.

Gaussian distribution

  • Outliers should be removed: In OLS (ordinary least square), we actually sum the residuals of all the data samples. In such a case when there will be some outliers present. It will make the predictions biased, and eventually model will perform poorly.

Outlier removal

  • Input should be rescaled: Providing rescaled inputs either via standardization or normalization can produce more accurate results. This is more useful when there will be a multiple-linear regression problem statement.
  • Transformation of data for linearity between input and output: It should be ensured that the input and output are linearly related; otherwise, data transformation can make them linear, like logarithmic scaling.
  • Colinearity should be removed: Model can be overfitted in case of colinearity. So with the help of correlation value, the most correlated can be removed for better generalization.

Let’s get some hands-on and implement the Linear Regression model.

Python Implementation of Linear Regression

Step 1

The necessary libraries imported are

  • numpy for algebraic operations
  • matplotlib for plotting scattered data points and fitted curve
  • LinearRegression from sklearn.linear_model for performing the regression operations
  • Metrics such as r2_score to evaluate the goodness of fit of the performed regression
  • Preprocessing features such as PolynomialFeatures to obtain higher dimensions of the input to process polynomial regression.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrices import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures

Step 2

A dataset of curve-linear nature is created to demonstrate the regression analysis. A total of 50 data points are chosen from a polynomial equation to which a randomized normally distributed noise value is added to enable fitting error when a higher dimension model is chosen.

Fabricated data

The term *np.random.normal(mean, variance, number of points)* creates a normal distribution of chosen mean and variance.

#creating and plotting dataset with curve-linear relationship
np.random.seed(0)
x = np.arange(-5,5,0.2) + 0.1*np.random.normal(0,1,50)
y = -1*(x**4) - 2*(x**3) + 13*(x**2) + 14*(x) - 24 +  10*np.random.normal(-1,1,50)
plt.figure(figsize=(10,5))
plt.scatter(x,y, color='red', s=25, label='data')
plt.xlabel('x',fontsize=16)
plt.ylabel('y',fontsize=16)
plt.grid()
plt.show()

Scatter plot

Step 3

To perform a polynomial regression analysis using the linear regression function, the dimensionality of the function needs to be changed. The function poly_data takes in input — the original data and degree of the polynomial data and generates more features per the degree to perform the regression.

def poly_data(x,y,degree):
    x = x[:,np.newaxis]
    y = y[:,np.newaxis]
    polynomial_features = PolynomialFeatures(degree=degree)
    x_poly = polynomial_features.fit_transform(x)
    return (x_poly,y)

For example, poly_data(x,y,2) will generate a second order of the polynomial by mapping X → [1,x, x²], increasing the number of features for performing multiple regression.

Step 4

PolyLinearRegression takes the input data generated by the poly_data function, unpack it, fits it in the LinearRegression model, and returns the model. The RMSE and R2 scores are computed for the predicted output value are incorporated in the model. As the model is a kind of data structure only, it can store these values quite easily.

def PolyLinearRegression(data, degree=1):
    x_poly,y = data
    model = LinearRegression()
    model.fit(x_poly, y) #it will fit the model
    y_poly_pred = model.predict(x_poly) #prediction
    rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
    r2 = r2_score(y,y_poly_pred)    
    model.degree = degree
    model.RMSE = rmse
    model.r2 = r2
    return model

Step 5

Plotting the result for the regression analysis is done using the function Regression_plots, which takes in the input data, the model, and the axis of the subplot and returns the plot for a regression model of a different order.

def Regression_plots(data,model,axis):
    x, y = data
    axis.plot(x[:,1],model.predict(x), color=color[degree=1],
         label = str("Model Degree: %d"%model.degree)
                 + str("; RMSE:%.3f"%model.RMSE
                 + str("; R2 Score: %.3f"%model.r2))
    axis.legend()

Step 6

Compiling the functions and the plots to perform regression analysis of the fourth-order polynomial function by varying the degree of the polynomial. The degree of the polynomial is varied from 1 to 4, and the RMSE error and R² score corresponding to each degree are plotted.

if __name__ == "__main__":
    np.random.seed(0)
    x = np.arange(-5,5,0.2) + 0.1*np.random.normal(0,1,50)
    y = -1*(x**4) - 2*(x**3) + 13*(x**2) + 14*(x) - 24 + 10*np.random.normal(-1,1,50)
    _,axis = plt.subplots()
    color = ['black','green','blue','purple']
    axis.grid()
    axis.set_xlabel('x')
    axis.set_ylabel('y')
    axis.scatter(x[:,np.newaxis], y[:,np.newaxis], color='red',
           s=25, label='data')
    axis.set_title('LinearRegression")
    for degree in range(1,5):
        data = poly_data(x,y,degree = degree)
        model = PolyLinearRegression(data, degree=degree)
        Regression_plots(data,model,axis)

The RMSE and R² score are shown in the plot. It is observed that with the increase in the model degree, the RMSE is reducing, and the R² score is improving, which means the higher dimension fits data better compared to the lower dimension.

Comparison of RMSE for different degree polynomials

Hyperparameter for Linear regression

If we keep increasing the degree of the polynomial, we will suddenly land in the overfitting problem. In our case, we fabricated the data, and hence we know that the polynomial with degree 4 will fit it perfectly. But, in real scenarios, the fitting degree of the polynomial can not be guessed by looking at the scatter plot of data samples. Here, we keep on increasing the degree, and when the model overfits, we use regularization techniques to cure it. To apply the regularization, a hyper-parameter of lambda needs to be tuned.

Possible Interview Questions

Linear regression is the most used algorithm in the machine learning and data science field. In every interview for machine learning engineer or data scientist positions, interviewers love to ask questions related to this. The basic understanding of how this algorithm actually works. Some of those questions can be,

  1. What is a Linear Regression algorithm, and Why do we say it is a Machine Learning algorithm?
  2. What is the Ordinary Least Square method?
  3. What are residuals?
  4. If input variables are not linearly related, can it still be a linear regression algorithm?
  5. How to ensure the better performance of Linear regression models?

Conclusion

In this article, we discussed the most famous algorithm in machine learning, i.e., Linear regression. We implemented the linear regression model on our constructed data step-wise to understand all the verticals involved. We also discussed the methods using which we can increase the performance of the linear regression model. We hope you enjoyed it.

Enjoy Learning! Enjoy Algorithms!

We welcome your comments

Subscribe Our Newsletter

Get well-designed application and interview centirc content on ds-algorithms, machine learning, system design and oops. Content will be delivered weekly.