In Regression problems, we map input variables with the continuous output variable(s). For example, predicting the share price in the stock market, predicting atmospheric temperature, etc. Based on the various usabilities, much research is going on in this area to build a more accurate model. When we build a solution for any regression problem, we compare its performance with the existing work. But to compare the two works, there should be some standard metric, like measuring distance in meters, plot size in square feet, etc. Similarly, we need to have some standard evaluation metrics to evaluate two regression models.
But before moving ahead, let's understand one crucial question.
Whenever we say that we have built a model, the first question that comes is, "What is the accuracy of our model?". Accuracy is a general term that can be formulated as "Out of all the predictions our model made, how many of them were accurate". As regression problems use supervised data, we know Yactual, and predictions will be considered accurate when Ypredicted is exactly equal to Y_actual. But, in regression problems, we have a continuous target variable. So, if we start evaluating our model on accuracy parameters, we will end up overfitting our model.
To avoid that, we use other evaluation metrics where we consider our model good even if the predictions are very close to the actual value but not exactly equal to the predictions. Hence we can not measure accuracy here. However, to compare the performance of the regression models, there are some defined metrics based on which we can decide which model is performing better. So let's understand these most common metrics for regression problems.
MAE is a fundamental and most used evaluation metric for regression problems. Here we try to calculate the difference between the actual and predicted values. This difference is termed an error. Let's say Ŷi is the predicted value, and Yi is the actual value. So, we can define error in prediction as Error = Yi-Ŷi. This error can either be positive or negative, but we are more concerned about the magnitude. Hence we take modulus, Error = |Yi-Ŷi|.
If we have N such samples present in the data, the total error would be the sum of errors over all those samples, i.e., Total error = Σ |Yi-Ŷi|. But we can not represent the error in terms of total error as the number of samples can be different in different experiments. Hence, we use the mean of this error. Mean says that whenever we will do inference using this model, the value of Ypredicted will lie in the range of (Ypredicted-MAE) ≤ Ypredicted ≤ (Ypredicted+MAE).
from sklearn.metrics import mean_absolute_error print("MAE = ",mean_absolute_error(y_true, y_pred))
In different research works, it can be observed that when the target variable feature has a single dimension, some research normalizes that feature, and some don't. For example, suppose our target variable can take values in [0–100]. One method kept the feature as it is, and the second method normalized this feature and brought it in the range of [0,1], where 0 represents 0 and 100 represents 1. But in such a scenario, for the same model, the value of MAE would vary. The error in the first method, where we kept the feature as it is, would be higher than the error in the second method.
To take care of these situations, we can define our error in percentage variation from the actual values. In the equation below, Yi is the actual value, Ŷi is the predicted value, and N is the total number of samples.
from sklearn.metrics import mean_absolute_percentage_error print("MAPE = ",mean_absolute_percentage_error(y_true, y_pred))
MSE is a very popular evaluation metric for regression problems. It is similar to the mean absolute error, but the error is squared here, Error = |Yi-Ŷi|². Similarly, when this squared error is calculated for N samples, the Total Error will beΣ |Yi-Ŷi|². The below formula represent this value as a Mean Squared Error, which reflects the average value of squared error per sample,
from sklearn.metrics import mean_squared_error print("MSE = ",mean_squared_error(y_true, y_pred))
RMSE is the most famous evaluation metric for the regression model. The overall calculation of RMSE is similar to MSE; just the final value is square-rooted as we calculated the square of errors in MSE. We learned in MAE that any new prediction would lie in the range of [Ypredicted-Error, Ypredicted+Error] at the time of inference. In MSE, we squared the error, so we need to calculate the square root to bring it back to the normal stage. That's RMSE for us.
from sklearn.metrics import mean_squared_error import nunpy as np print("RMSE = ",np.sqrt(mean_squared_error(y_true, y_pred)))
Correlation between two variables explains the strength of their relationship. In contrast, R-squared explains to what extent the variance of one variable explains the variance of the second variable. It is also known as the Coefficient of Determination. This metric is interesting and important so let's understand it by an example. Suppose we know the salaries of 10 government employees and want to guess the salary for the 11th employee. Assume that we are basic learners and don't have any idea about machine learning techniques. What will be our most reasonable guess?
Mean of salaries of 10 employees, right? The mean value will be considered as the baseline value. We will calculate the baseline error as the squared difference between actual Y and mean value. Let's call this error TSS (Total Sum Squared).
Now suppose we know ML, and we built a Machine Learning model to predict the salary. After learning better techniques like ML, we assume that we can improve our prediction capability over the naive guess. Hence the total squared prediction error Σ |Yi-Ŷi|² will be lesser than TSS.
R-Square can be calculated using the equation below in which y̅i is the mean value, and Ŷi is the predicted value.
from sklearn.metrics import r2_score print("R_Squared = ",r2_score(y_true, y_pred))
In theories, the R_squared value will always lie in the range of [0,1], while in practice, values lie in (-∞, 1]. Can you guess when we will have negative R²?
The problem is with the assumption. We thought the ML would beat the performance when we naively guessed the average, but it did not happen. The reason behind negative R² can be,
R² is a good measure and is widely used in the industry to measure the performance of regression models. But there are serious problems that can misguide machine learning engineers and researchers. If we look carefully, we can change the R² value without changing the model at all. Can we guess how?
We can increase the input features and make our baseline error higher. If there are too many independent variables, the model can overfit, and R² would be high. But on the test data, it will perform poorly.
To tackle R²'s problems, researchers formed a new metric that is considered the improvement in R² and is known as adjusted R². In the equation below, N is the total number of data samples, and k is the number of independent variables.
from sklearn.metrics import r2_score r_sqr = r2_score(y_true, y_pred) N = len(y_true) k = # of independent variable in input features. print("Adjusted R_Squared = ",(1-r_sqr)*(N-1)/(N-k-1))
Adjusted R-Squared value will always be lesser than the traditional R-Squared value. Whenever we add a new independent variable, it will affect the calculations. So, we can never be misguided with the score now.
Industries and Research papers are more inclined toward RMSE or MSE values, so we must compare our results with these parameters. Additionally, there is a slight inclination toward R-Squared values as well as it can be directly correlated with the accuracy. Adjusted R² is the only parameter considering the overfitting problem. But due to the dependency on several independent features, there is no direct library available in most frameworks to calculate it. We hope you enjoyed the article.
Enjoy Learning, Enjoy Algorithms!
Cancer classification is one area where ML can deliver a robust predictive model based on given observations to identify the cancer possibility. In this article, we have built a cancer classification model to predict the presence of malignant (cancer-causing cells) or benign cells using a support vector classifier model.
In this blog, we will do hands-on on several data preprocessing techniques in machine learning like Feature Selection, Feature Quality Assessment, Feature Sampling, and Feature Reduction. We will use different datasets for demonstration and briefly discuss the intuition behind the methods.
Subscribe to get free weekly content on data structure and algorithms, machine learning, system design, oops design and mathematics.