In Machine Learning and Data Science field, Linear regression comes under the category of supervised learning approach. It is one of the most popular algorithms that has both statistics and machine learning nature. It predicts the linear relationship between a dependent variable and a set of independent variables. The dependent variable (y) is the variable that is to be predicted, and a set of independent variables (X) is the quantity used to develop the relationship.
In this article, we will try to emphasize the following points in detail,
So, let’s start without any further delay.
A model is linear when it is linear in parameters relating the input to the output. The dependency need not be linear in terms of inputs for the model to be a linear model. For example, all equations below are linear regression, and they define the model that represents the linear relationship between the model parameters.
In a machine learning problem, linearity is defined as the linearly dependent nature of a set of independent features and the dependent quantity. X = [x1, x2, …, xn] a set of independent features, and y is an output dependent quantity, with the objective (for a multiple linear regression) map y → X as,
The βi’s are the constraints (also called weights) parameterizing the space of linear function mapping X to Y, and ξ is the error due to fitting imperfection. The problem dealt with by linear regression is finding the weight vector [β0, β1, β 2, …, βn] that minimizes a loss function given the quantities X, y.
In the figure, the line plotted has an error concerning each data point. Linear regression varies those parameters to minimize the cumulative error, also known as a loss function/error function/cost function. Several loss functions have been defined in the machine learning literature, and one most used among them is the Sum of Squared Residuals (SSR). To get the best set of weight, the sum of squared residuals (SSR) for all observations( j = 1,2 …, m) should be minimum. This approach is termed as a method of Ordinary Least Squares (OLS).
The differences (Yj-f(Xj)) for all observations j = 1,2…, m, are called the residuals. Regression is about determining the best weights that are consistent with the smallest sum of residuals. Thus, taking the SSR as the objective function to be minimized, the weight vector [β0, β1, β 2, …, βn] corresponding to the minimum SSR value is to be computed. This is an optimization problem that many approaches can solve, such as Gradient Descent, Linear Algebra, and differential calculus.
The goodness of fit of Linear Regression is usually determined by the R² Score. R² is a statistical measure of how close the independent feature data are to the fitted regression curve. R² can also be termed as the percent of the response variable variation explained by the linear model. The higher the R², the better the model fits the data.
As this algorithm exists for more than 200 years, much research and studies have been done on this algorithm. The literature suggests different heuristics that can be kept in mind while preparing the dataset for it. As we discussed, Ordinary Least Square is the most common technique for implementing the Linear regression models, so one could try the methods below and see if the R² score is improving or not.
Let’s get some hands-on and implement the Linear Regression model.
The necessary libraries imported are
import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression from sklearn.metrices import mean_squared_error, r2_score from sklearn.preprocessing import PolynomialFeatures
A dataset of curve-linear nature is created to demonstrate the regression analysis. A total of 50 data points are chosen from a polynomial equation to which a randomized normally distributed noise value is added to enable fitting error when a higher dimension model is chosen.
The term *np.random.normal(mean, variance, number of points)* creates a normal distribution of chosen mean and variance.
#creating and plotting dataset with curve-linear relationship np.random.seed(0) x = np.arange(-5,5,0.2) + 0.1*np.random.normal(0,1,50) y = -1*(x**4) - 2*(x**3) + 13*(x**2) + 14*(x) - 24 + 10*np.random.normal(-1,1,50) plt.figure(figsize=(10,5)) plt.scatter(x,y, color='red', s=25, label='data') plt.xlabel('x',fontsize=16) plt.ylabel('y',fontsize=16) plt.grid() plt.show()
To perform a polynomial regression analysis using the linear regression function, the dimensionality of the function needs to be changed. The function poly_data takes in input — the original data and degree of the polynomial data and generates more features per the degree to perform the regression.
def poly_data(x,y,degree): x = x[:,np.newaxis] y = y[:,np.newaxis] polynomial_features = PolynomialFeatures(degree=degree) x_poly = polynomial_features.fit_transform(x) return (x_poly,y)
For example, poly_data(x,y,2) will generate a second order of the polynomial by mapping X → [1,x, x²], increasing the number of features for performing multiple regression.
PolyLinearRegression takes the input data generated by the poly_data function, unpack it, fits it in the LinearRegression model, and returns the model. The RMSE and R2 scores are computed for the predicted output value are incorporated in the model. As the model is a kind of data structure only, it can store these values quite easily.
def PolyLinearRegression(data, degree=1): x_poly,y = data model = LinearRegression() model.fit(x_poly, y) #it will fit the model y_poly_pred = model.predict(x_poly) #prediction rmse = np.sqrt(mean_squared_error(y,y_poly_pred)) r2 = r2_score(y,y_poly_pred) model.degree = degree model.RMSE = rmse model.r2 = r2 return model
Plotting the result for the regression analysis is done using the function Regression_plots, which takes in the input data, the model, and the axis of the subplot and returns the plot for a regression model of a different order.
def Regression_plots(data,model,axis): x, y = data axis.plot(x[:,1],model.predict(x), color=color[degree=1], label = str("Model Degree: %d"%model.degree) + str("; RMSE:%.3f"%model.RMSE + str("; R2 Score: %.3f"%model.r2)) axis.legend()
Compiling the functions and the plots to perform regression analysis of the fourth-order polynomial function by varying the degree of the polynomial. The degree of the polynomial is varied from 1 to 4, and the RMSE error and R² score corresponding to each degree are plotted.
if __name__ == "__main__": np.random.seed(0) x = np.arange(-5,5,0.2) + 0.1*np.random.normal(0,1,50) y = -1*(x**4) - 2*(x**3) + 13*(x**2) + 14*(x) - 24 + 10*np.random.normal(-1,1,50) _,axis = plt.subplots() color = ['black','green','blue','purple'] axis.grid() axis.set_xlabel('x') axis.set_ylabel('y') axis.scatter(x[:,np.newaxis], y[:,np.newaxis], color='red', s=25, label='data') axis.set_title('LinearRegression") for degree in range(1,5): data = poly_data(x,y,degree = degree) model = PolyLinearRegression(data, degree=degree) Regression_plots(data,model,axis)
The RMSE and R² score are shown in the plot. It is observed that with the increase in the model degree, the RMSE is reducing, and the R² score is improving, which means the higher dimension fits data better compared to the lower dimension.
If we keep increasing the degree of the polynomial, we will suddenly land in the overfitting problem. In our case, we fabricated the data, and hence we know that the polynomial with degree 4 will fit it perfectly. But, in real scenarios, the fitting degree of the polynomial can not be guessed by looking at the scatter plot of data samples. Here, we keep on increasing the degree, and when the model overfits, we use regularization techniques to cure it. To apply the regularization, a hyper-parameter of lambda needs to be tuned.
Linear regression is the most used algorithm in the machine learning and data science field. In every interview for machine learning engineer or data scientist positions, interviewers love to ask questions related to this. The basic understanding of how this algorithm actually works. Some of those questions can be,
In this article, we discussed the most famous algorithm in machine learning, i.e., Linear regression. We implemented the linear regression model on our constructed data step-wise to understand all the verticals involved. We also discussed the methods using which we can increase the performance of the linear regression model. We hope you enjoyed it.
Get well-designed application and interview centirc content on ds-algorithms, machine learning, system design and oops. Content will be delivered weekly.