Predicting Life Expectancy Using Linear Regression

Everything has an expiration date; humans are no exception either. With ongoing advancements in Machine Learning and Data Science, we can precisely predict the remaining life span of a person given the essential parameters. In this blog, we aim to explore the parameters affecting the life span of individuals living in distant countries and learn how the life span can be estimated with the help of machine learning models. We will also focus on parameters that greatly impact the life span of an individual.

Key takeaways from this blog

  • Effective Data Visualization techniques
  • The intuition behind the Linear Regression model 
  • Application of Linear Regression in predicting Life Expectancy
  • Evaluation techniques for the linear regression model
  • Use of life-expectancy prediction by different industries.

Let’s start by understanding Life Expectancy.

Life Expectancy

The term “life expectancy” refers to the number of years a person can expect to live. By definition, life expectancy is based on an estimate of the average age that members of a particular population group will be when they die.

Life expectancy depends on several factors, the two most important being gender and birth year. Generally, females have a slightly higher life expectancy than males due to biological differences. Other factors that influence life expectancy include:

  • Race and Ethnicity
  • Family Medical History 
  • Risky Lifestyle choices

However, that’s hardly the entire list! As we work our way through the data analysis, we will explore additional hidden factors that influence the life expectancy of an individual.

Data Analysis

For this demonstration, we make use of the Life Expectancy (WHO) Kaggle dataset.  

Let’s start by loading the data.

import pandas as pd
life_exp = pd.read_csv('Life Expectancy Data.csv')

Dataset snippet

Distribution of Life Expectancy

The majority of lifespan lies between 45 to 90 years, with an average lifespan of 69 years. 

sns.histplot(life_exp['Life expectancy'].dropna(), kde=True, color='orange')

Data distribution

Correlation Heatmap

A correlation heat map is a graphical representation of a correlation matrix representing the correlation between different variables. This helps in understanding the linear dependencies of variables over each other. Correlation is always calculated between two variables, and it has a range of [-1, 1]. 

  • A correlation value close to zero means the two variables are unrelated
  • An absolute correlation value close to 1 means the two variables are perfectly related.

Let’s implement the heat map in python for dependency visualization:

cmap = sns.diverging_palette(500, 10, as_cmap=True)
sns.heatmap(life_exp.corr(), cmap=cmap, center=0, annot=False, square=True);

Data Heatmap for cross-correlation of features

Life expectancy has a considerable correlation with Adult Mortality, BMI, Schooling, HIV/AIDS, ICOR, and GDP. Following insights can be drawn based on the correlation heatmap:

  • Life expectancy and Adult Mortality rates have a high negative correlation, which is also anticipated. 
  • BMI has a positive correlation with Life expectancy.
  • GDP also has a positive correlation with Life expectancy, which can be inferred that as the country's GDP increases, the life expectancy also increases. 
  • Not surprisingly, Schooling years have a high positive correlation with Life expectancy. Proper schooling leads to the adoption of healthy habits and discipline.

What is adult mortality?

The adult mortality rate is shown in the probability that those who have reached age 15 will die before age 60 (shown per 1,000 persons).

fig = px.scatter(to_bubble, x='GDP', y='Life expectancy',
                 size='Population', color='Continent',
                 hover_name='Country', log_x=True, size_max=40)

Continent wise life expectancy

This bubble plot is very informative in understanding the trend of life expectancies for different continents. The size of the bubble defines the population in the respective countries. Following are the safe inferences we can make based on the bubble plot: 

  • African countries have a lower life expectancy as compared to other continents. 
  • The majority range of life expectancy lies between 60 to 75 years for Asian countries. Asian countries also bear high populations.
  • South America and North America have similar life expectancy trends. 
  • European countries having high GDP has a remarkable high life expectancy. 

Let’s analyze the impact of GDP for different continents versus Life expectancy. 

for continent, ax in zip(set(life_exp['Continent']), axs.flat):
    continents = life_exp[life_exp['Continent'] == continent]
    sns.regplot(x = continents['GDP'],y = continents['Life expectancy'], color = 'red', ax = ax).set_title(continent)

Continent wise GDP feature

High GDP has a strong positive impact on life expectancy!

In other words, If someone is residing in a developed country that has a high GDP, then his life expectancy is expected to be relatively higher than a person living in a developing country. 

What is ICOR (Income Composition of Resources)? 

ICOR is the measure of how good a country is at utilizing its resources. ICOR is graded between 0 to 1, and higher ICOR indicates optimal utilization of available resources. ICOR has a considerably high correlation with Life expectancy. Let’s visualize the impact of ICOR on Life expectancy continent-wise.

for continent, ax in zip(set(life_exp["Continent"]), axs.flat):
    continents = life_exp[life_exp['Continent'] == continent]
    sns.regplot(x = continents['Income composition of resources'],y = continents["Life expectancy "], color = 'blue', ax = ax).set_title(continent)

Continent wise income

As expected, higher ICOR yields higher Life expectancy. If a country utilizes its resources productively, it is more likely to see its citizens live longer than expected.

Life Expectancy Prediction

Now the question comes, Is there any way to predict the Life expectancy based on the discussed 22 independent features? Yes, but first, we need to finalize a supervised regression algorithm that fits our task. 

We have a bunch of algorithms for regression tasks, and each algorithm has its pros and cons. One algorithm might fetch superior results as compared to others but might lack in terms of explainability. Even if explainability is not compromised, the deployment of such complex algorithms is a tedious task. In other words, there is a trade-off between accuracy, model complexity and, model explainability. An optimal algorithm must be explainable, accurate, and easy to deploy, but there’s nothing like an ideal algorithm. 

For instance, Linear Regression is a comparatively simple and explainable algorithm. Deployment of Linear Regression requires minimal efforts, but on the contrary, it lacks accuracy when the data is non-linear. Complex algorithms perform better on non-linear datasets, but then the model lacks explainability. 

Let’s proceed with Linear Regression for this task. 

Linear Regression

Linear Regression is a regression algorithm with a linear approach. It’s a supervised regression algorithm where we try to predict a continuous value of a given data point by generalizing the data we have in hand. The linear part indicates the linear approach for the generalization of data.

Linear regression intuition

The idea is to predict the dependent variable (Y) using a given independent variable (X). This can be accomplished by fitting a best fit line in the data. A line providing the least sum of residual error is the best fit line or regression line.

What is a residual error?

A residual error is a measure of how far away a point is vertically from the regression line. Simply, it is the error between a predicted value and the observed actual value. A line providing the least sum of residual error is the best fit line or regression line. 

Let’s predict Life expectancy by using Linear Regression. 

Before building the model, we need to split the dataset into training and testing sets. We will make use of this test set for evaluating the performance of the model.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
target = life_exp['Life expectancy']
features = life_exp[life_exp.columns.difference(['Life expectancy', 'Year'])]
#----- Splitting the dataset -----#
x_train, x_test, y_train, y_test = train_test_split(pd.get_dummies(features), target, test_size=0.3)
#----- Linear Regression -----#
lr = LinearRegression()
#----- Fitting model over training data -----#, y_train)
#----- Evaluating the model over test data -----#
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)
#lr confidence:  0.9538309850283277


The coefficient of determination R-square came out to be closer to 1, which indicates the model optimally predicts the Life expectancies. 

For validation of the model, let’s check the distribution of residuals.

sns.histplot(residuals, kde=True, color="orange")
plt.title('Residual Plot')
plt.xlabel('Residuals: (Predictions - Actual)')

Error distribution for performance evaluation

Residual distribution is approximately normal, having a mean close to zero. This is exactly what we are looking for. 

Let’s visualize the residuals in a scatter plot!

Error distribution for performance evaluation  2

Residuals are centered around zero, and the coefficient of determination R-square is close to 1. Close to 1 R-squared value indicates a good fit over the test dataset. With these results, we can conclude that our model is highly efficient in predicting Life Expectancy.

Industrial Application of Linear Regression

WHO (Public Health Standards)

World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. They are responsible for monitoring the public health risk, promoting human health, and coordinating responses to health emergencies. WHO highly relies on statistical algorithms like Linear Regression for studying the Life expectancy and impact of pandemics over life-span. 

Blue Shield of California (Healthcare Insurance)

Blue Shield of California(BSOC) is at the forefront of innovation in the healthcare domain. BSOC is a non-profit health insurance organization focused on providing the best possible medical treatment and insurance plans. BSOC takes advantage of the Linear Regression model for the estimation of medical expenses based on insurance data.

JPMorgan Chase & Co. (Finance)

JPMorgan Chase is a global leader in financial services offering solutions to the world’s most important corporations and government institutions. JPMC has generalized the use of Linear Regression in their Capital Asset Pricing Model (CAPM), where risky assets are merged with non-risky assets to reduce the unsystematic risk. Moreover, JPMC uses Linear Regression for Forecasting and Financial Analysis.

Johnson & Johnson (Pharmaceutical)

Johnson & Johnson (J&J) is an American multinational corporation that develops medical devices, pharmaceuticals, and consumer packaged goods. J&J uses Linear Regression for estimating the remaining shelf life of medicine stocks. 

Walmart (Retail Corporation)

Walmart is a renowned retailing corporation that operates as different hypermarkets, departmental stores, grocery stores, and garments buying houses. Walmart relies on Regression analysis for sales forecasting and better decision-making. 

Possible Interview Questions

Life expectancy prediction is one of the most popular machine learning projects that interviewers commonly find in fresher resumes. So, the possible interview questions that interviewers can ask are:

  • How did you approach the Linear Regression? Why not more sophisticated algorithms for higher accuracy?
  • What is RSquare? How does it reflect the accuracy of your model? What is the range of RSquare?
  • What things can be done further to increase this machine learning model performance?
  • What are the industrial sectors where this machine learning project can be beneficial?
  • Can you name some hyperparameters involved in this project?


We started with understanding Life Expectancy and learned the factors affecting it. We further visualized the affecting parameters and correlated them to derive inferences. Finally, we covered the Linear Regression and implemented it to predict Life expectancy. 

One can extend their life span by adopting a healthy lifestyle, proper education, and getting vaccinated. Of course, Demographic location plays an important role. In our analysis, we found that people living in Europe have a higher lifespan than other continents. A country’s GDP and Income composition affect Life Expectancy more broadly. Some parameters like pollution and environmental index have been missing in this analysis and are expected to be highly related to Life Expectancy.

Enjoy learning, Enjoy algorithms!

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.