Predicting Life Expectancy Using Linear Regression

With the advancements in Machine Learning and Data Science, we now have the ability to predict the remaining life expectancy of a person with a high degree of accuracy, based on certain essential parameters. In this blog post, we will be exploring the parameters that affect the life expectancy of individuals living in different countries, and how machine learning models can be used to estimate life expectancy. We will also be focusing on the specific parameters that have the most significant impact on an individual's life expectancy.

Key takeaways from this blog

  • Effective Data Visualization techniques
  • The intuition behind the Linear Regression model
  • Application of Linear Regression in predicting Life Expectancy
  • Evaluation techniques for the linear regression model
  • Use of life-expectancy prediction by different industries.

Let’s start by understanding Life Expectancy.

What is Life Expectancy?

The term “life expectancy” refers to the number of years a person can expect to live. By definition, life expectancy is based on an estimate of the average age members of a particular population group will be when they die.

Life expectancy depends on several factors, the most important being gender and birth year. Generally, females have a slightly higher life expectancy than males due to biological differences. Other factors that influence life expectancy include:

  • Race and Ethnicity
  • Family Medical History
  • Risky Lifestyle choices

However, that’s hardly the entire list! As we work our way through the data analysis, we will explore additional hidden factors that influence the life expectancy of an individual.

Data Analysis for WHO dataset

We use the Life Expectancy (WHO) Kaggle dataset for this demonstration.

Let’s start by loading the data.

import pandas as pd
life_exp = pd.read_csv('Life Expectancy Data.csv')
life_exp.head()

WHO dataset columns for predicting life expectancy

Distribution of Life Expectancy

Most of the lifespan lies between 45 to 90 years, with an average lifespan of 69 years.

sns.histplot(life_exp['Life expectancy'].dropna(), kde=True, color='orange')

Life expectancy distribution with respect to age in the WHO dataset

Correlation Heatmap

A correlation heat map is a graphical representation of a correlation matrix representing the correlation between different variables. This helps in understanding the linear dependencies of variables over each other. Correlation is always calculated between two variables and has a range of [-1, 1].

  • A correlation value close to zero means the two variables are unrelated
  • An absolute correlation value close to 1 means the two variables are perfectly related.

Let’s implement the heat map in python for dependency visualization:

cmap = sns.diverging_palette(500, 10, as_cmap=True)
sns.heatmap(life_exp.corr(), cmap=cmap, center=0, annot=False, square=True);

Correlation heatmap plot using seaborn library on WHO dataset

Life expectancy considerably correlates with Adult Mortality, BMI, Schooling, HIV/AIDS, ICOR, and GDP. The following insights can be drawn based on the correlation heatmap:

  • Life expectancy and Adult Mortality rates have a high negative correlation, which is also anticipated.
  • BMI has a positive correlation with Life expectancy.
  • GDP also has a positive correlation with Life expectancy, which can be inferred that as the country’s GDP increases, life expectancy also increases.
  • Not surprisingly, Schooling years have a high positive correlation with Life expectancy. Proper schooling leads to the adoption of healthy habits and discipline.

What is adult mortality?

The adult mortality rate is shown in the probability that those who have reached age 15 will die before age 60 (shown per 1,000 persons).

fig = px.scatter(to_bubble, x='GDP', y='Life expectancy',
                 size='Population', color='Continent',
                 hover_name='Country', log_x=True, size_max=40)
fig.show()

Continent-wise life expectancy vs GDP plotting on the WHO dataset

This bubble plot is very informative in understanding the trend of life expectancies for different continents. The size of the bubble defines the population in the respective countries. Following are the safe inferences we can make based on the bubble plot:

  • African countries have a lower life expectancy as compared to other continents.
  • The majority of life expectancy ranges between 60 to 75 years in Asian countries. Asian countries also bear high populations.
  • South America and North America have similar life expectancy trends.
  • European countries having high GDP has a remarkably high life expectancy.

Let’s analyze the impact of GDP for different continents versus Life expectancy.

for continent, ax in zip(set(life_exp['Continent']), axs.flat):
    continents = life_exp[life_exp['Continent'] == continent]
    sns.regplot(x = continents['GDP'],y = continents['Life expectancy'], color = 'red', ax = ax).set_title(continent)
plt.tight_layout()    
plt.show()

GDP vs Life expectancy of all 6 continents present in the WHO dataset

High GDP has a strong positive impact on life expectancy! In other words, If someone is residing in a developed country with a high GDP, his life expectancy is expected to be relatively higher than a person living in a developing country.

What is ICOR (Income Composition of Resources)?

ICOR measures how good a country is at utilizing its resources. ICOR is graded between 0 to 1, and higher ICOR indicates optimal utilization of available resources. ICOR has a considerably high correlation with Life expectancy. Let’s visualize the impact of ICOR on Life expectancy continent-wise.

for continent, ax in zip(set(life_exp["Continent"]), axs.flat):
    continents = life_exp[life_exp['Continent'] == continent]
    sns.regplot(x = continents['Income composition of resources'],y = continents["Life expectancy "], color = 'blue', ax = ax).set_title(continent)
plt.tight_layout()    
plt.show()

Income vs life expectancy for all 6 continents present in the WHO dataset

As expected, higher ICOR yields higher Life expectancy. If a country utilizes its resources productively, it is more likely to see its citizens live longer than expected.

Life Expectancy Prediction

The question arises, is there a way to predict life expectancy based on the 22 independent features discussed? The answer is yes, but first we must choose a suitable supervised regression algorithm for the task.

There are many algorithms available for regression tasks, and each has its own advantages and disadvantages. One algorithm might produce better results than others, but may require more interpretability. Even if interpretability is not an issue, deploying complex algorithms can be difficult. In other words, there is a trade-off between accuracy, model complexity, and model interpretability. An optimal algorithm must be interpretable, accurate, and easy to deploy, but there is no perfect algorithm.

For example, Linear Regression is a relatively simple and interpretable algorithm. It requires minimal effort to deploy, but its accuracy can be limited when the data is non-linear. Complex algorithms may perform better on non-linear datasets, but the model may lack interpretability.

Let’s proceed with Linear Regression for this task.

Linear Regression

Linear Regression is a regression algorithm with a linear approach. It’s a supervised regression algorithm where we try to predict a continuous value of a given data point by generalizing the data we have in hand. The linear part indicates the linear approach for the generalization of data.

The idea is to predict the dependent variable (Y) using a given independent variable (X). This can be accomplished by fitting a best-fit line in the data. A line providing the least sum of residual error is the best fit line or regression line.

What is a residual error?

A residual error measures how far away a point is vertically from the regression line. Simply, it is the error between a predicted value and the observed actual value. A line providing the least sum of residual error is the best fit line or regression line.

Let’s predict Life expectancy by using Linear Regression. Before building the model, we need to split the dataset into training and testing sets. We will use this test set to evaluate the model's performance.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
target = life_exp['Life expectancy']
features = life_exp[life_exp.columns.difference(['Life expectancy', 'Year'])]
#----- Splitting the dataset -----#
x_train, x_test, y_train, y_test = train_test_split(pd.get_dummies(features), target, test_size=0.3)
#----- Linear Regression -----#
lr = LinearRegression()
#----- Fitting model over training data -----#
lr.fit(x_train, y_train)
#----- Evaluating the model over test data -----#
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)
#lr confidence:  0.9538309850283277

The coefficient of determination R-square came out closer to 1, indicating the model optimally predicts the Life expectancies.

For validation of the model, let’s check the distribution of residuals.

sns.histplot(residuals, kde=True, color="orange")
plt.title('Residual Plot')
plt.xlabel('Residuals: (Predictions - Actual)')
plt.ylabel('Density');

Residual plot for representing the error in prediction of linear regression model

Residual distribution is approximately normal, having a mean close to zero. This is precisely what we are looking for. Let’s visualize the residuals in a scatter plot!

Scatter plot to represent Residues in the prediction of linear regression model on test dataset

Residuals are centered around zero, and the coefficient of determination R-square is close to 1. Close to 1 R-squared value indicates a good fit over the test dataset. With these results, our model is highly efficient in predicting Life Expectancy.

Industrial Application of Linear Regression

WHO (Public Health Standards)

World Health Organization (WHO) keeps track of all countries' health status and many other related factors. They are responsible for monitoring the public health risk, promoting human health, and coordinating responses to health emergencies. WHO highly relies on statistical algorithms like Linear Regression for studying the Life expectancy and impact of pandemics over the lifespan.

Blue Shield of California (Healthcare Insurance)

Blue Shield of California(BSOC) is at the forefront of innovation in the healthcare domain. BSOC is a non-profit health insurance organization that provides the best possible medical treatment and insurance plans. BSOC uses the Linear Regression model to estimate medical expenses based on insurance data.

JPMorgan Chase & Co. (Finance)

JPMorgan Chase is a global leader in financial services offering solutions to the world’s most important corporations and government institutions. JPMC has generalised the use of Linear Regression in their Capital Asset Pricing Model (CAPM), where risky assets are merged with non-risky assets to reduce the unsystematic risk. Moreover, JPMC uses Linear Regression for Forecasting and Financial Analysis.

Johnson & Johnson (Pharmaceutical)

Johnson & Johnson (J&J) is an American multinational corporation that develops medical devices, pharmaceuticals, and consumer packaged goods. J&J uses linear regression to estimate the remaining shelf life of medicine stocks.

Walmart (Retail Corporation)

Walmart is a well-known retail corporation that operates various hypermarkets, department stores, grocery stores, and garment buying houses. Walmart relies on regression analysis for sales forecasting and improved decision making.

Possible Interview Questions

Predicting life expectancy is a popular machine learning project that is commonly found in resumes of freshers. Therefore, possible interview questions that interviewers may ask include:

  • How did you approach linear regression? Why not use more sophisticated algorithms for higher accuracy?
  • What is R-squared? How does it reflect the accuracy of your model? What is the range of R-squared?
  • What can be done to further increase the performance of this machine learning model?
  • In which industrial sectors can this machine learning project be beneficial?
  • Can you name some of the hyperparameters involved in this project?

Conclusion

We began by understanding life expectancy and the factors that affect it. We then visualized these affecting parameters and correlated them to make inferences. Finally, we covered linear regression and used it to predict life expectancy.

One can potentially increase their lifespan by adopting a healthy lifestyle, getting a proper education, and getting vaccinated. Demographic location also plays a significant role. Our analysis found that people living in Europe have a higher lifespan than other continents. A country's GDP and income composition also have a broader impact on life expectancy. Some parameters, such as pollution and environmental index, were not included in this analysis but are expected to have a strong correlation with life expectancy.

Next Blog: Introduction to Logistic Regression

Enjoy learning, Enjoy algorithms!

Share Your Insights

More from EnjoyAlgorithms

Self-paced Courses and Blogs

Coding Interview

Machine Learning

System Design

Our Newsletter

Subscribe to get well designed content on data structure and algorithms, machine learning, system design, object orientd programming and math.