Everything has an expiration date; humans are no exception either. With ongoing advancements in Machine Learning and Data Science, we can precisely predict the remaining life span of a person given the essential parameters. In this blog, we aim to explore the parameters affecting the life span of individuals living in distant countries and learn how the life span can be estimated with the help of machine learning models. We will also focus on parameters that greatly impact the life span of an individual.
Let’s start by understanding Life Expectancy.
The term “life expectancy” refers to the number of years a person can expect to live. By definition, life expectancy is based on an estimate of the average age that members of a particular population group will be when they die.
Life expectancy depends on several factors, the two most important being gender and birth year. Generally, females have a slightly higher life expectancy than males due to biological differences. Other factors that influence life expectancy include:
However, that’s hardly the entire list! As we work our way through the data analysis, we will explore additional hidden factors that influence the life expectancy of an individual.
For this demonstration, we make use of the Life Expectancy (WHO) Kaggle dataset.
Let’s start by loading the data.
import pandas as pd life_exp = pd.read_csv('Life Expectancy Data.csv') life_exp.head()
The majority of lifespan lies between 45 to 90 years, with an average lifespan of 69 years.
sns.histplot(life_exp['Life expectancy'].dropna(), kde=True, color='orange')
A correlation heat map is a graphical representation of a correlation matrix representing the correlation between different variables. This helps in understanding the linear dependencies of variables over each other. Correlation is always calculated between two variables, and it has a range of [-1, 1].
Let’s implement the heat map in python for dependency visualization:
cmap = sns.diverging_palette(500, 10, as_cmap=True) sns.heatmap(life_exp.corr(), cmap=cmap, center=0, annot=False, square=True);
Life expectancy has a considerable correlation with Adult Mortality, BMI, Schooling, HIV/AIDS, ICOR, and GDP. Following insights can be drawn based on the correlation heatmap:
What is adult mortality?
The adult mortality rate is shown in the probability that those who have reached age 15 will die before age 60 (shown per 1,000 persons).
fig = px.scatter(to_bubble, x='GDP', y='Life expectancy', size='Population', color='Continent', hover_name='Country', log_x=True, size_max=40) fig.show()
This bubble plot is very informative in understanding the trend of life expectancies for different continents. The size of the bubble defines the population in the respective countries. Following are the safe inferences we can make based on the bubble plot:
Let’s analyze the impact of GDP for different continents versus Life expectancy.
for continent, ax in zip(set(life_exp['Continent']), axs.flat): continents = life_exp[life_exp['Continent'] == continent] sns.regplot(x = continents['GDP'],y = continents['Life expectancy'], color = 'red', ax = ax).set_title(continent) plt.tight_layout() plt.show()
High GDP has a strong positive impact on life expectancy!
In other words, If someone is residing in a developed country that has a high GDP, then his life expectancy is expected to be relatively higher than a person living in a developing country.
What is ICOR (Income Composition of Resources)?
ICOR is the measure of how good a country is at utilizing its resources. ICOR is graded between 0 to 1, and higher ICOR indicates optimal utilization of available resources. ICOR has a considerably high correlation with Life expectancy. Let’s visualize the impact of ICOR on Life expectancy continent-wise.
for continent, ax in zip(set(life_exp["Continent"]), axs.flat): continents = life_exp[life_exp['Continent'] == continent] sns.regplot(x = continents['Income composition of resources'],y = continents["Life expectancy "], color = 'blue', ax = ax).set_title(continent) plt.tight_layout() plt.show()
As expected, higher ICOR yields higher Life expectancy. If a country utilizes its resources productively, it is more likely to see its citizens live longer than expected.
Now the question comes, Is there any way to predict the Life expectancy based on the discussed 22 independent features? Yes, but first, we need to finalize a supervised regression algorithm that fits our task.
We have a bunch of algorithms for regression tasks, and each algorithm has its pros and cons. One algorithm might fetch superior results as compared to others but might lack in terms of explainability. Even if explainability is not compromised, the deployment of such complex algorithms is a tedious task. In other words, there is a trade-off between accuracy, model complexity and, model explainability. An optimal algorithm must be explainable, accurate, and easy to deploy, but there’s nothing like an ideal algorithm.
For instance, Linear Regression is a comparatively simple and explainable algorithm. Deployment of Linear Regression requires minimal efforts, but on the contrary, it lacks accuracy when the data is non-linear. Complex algorithms perform better on non-linear datasets, but then the model lacks explainability.
Let’s proceed with Linear Regression for this task.
Linear Regression is a regression algorithm with a linear approach. It’s a supervised regression algorithm where we try to predict a continuous value of a given data point by generalizing the data we have in hand. The linear part indicates the linear approach for the generalization of data.
The idea is to predict the dependent variable (Y) using a given independent variable (X). This can be accomplished by fitting a best fit line in the data. A line providing the least sum of residual error is the best fit line or regression line.
What is a residual error?
A residual error is a measure of how far away a point is vertically from the regression line. Simply, it is the error between a predicted value and the observed actual value. A line providing the least sum of residual error is the best fit line or regression line.
Let’s predict Life expectancy by using Linear Regression.
Before building the model, we need to split the dataset into training and testing sets. We will make use of this test set for evaluating the performance of the model.
from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression target = life_exp['Life expectancy'] features = life_exp[life_exp.columns.difference(['Life expectancy', 'Year'])] #----- Splitting the dataset -----# x_train, x_test, y_train, y_test = train_test_split(pd.get_dummies(features), target, test_size=0.3) #----- Linear Regression -----# lr = LinearRegression() #----- Fitting model over training data -----# lr.fit(x_train, y_train) #----- Evaluating the model over test data -----# lr_confidence = lr.score(x_test, y_test) print("lr confidence: ", lr_confidence) #lr confidence: 0.9538309850283277
The coefficient of determination R-square came out to be closer to 1, which indicates the model optimally predicts the Life expectancies.
For validation of the model, let’s check the distribution of residuals.
sns.histplot(residuals, kde=True, color="orange") plt.title('Residual Plot') plt.xlabel('Residuals: (Predictions - Actual)') plt.ylabel('Density');
Residual distribution is approximately normal, having a mean close to zero. This is exactly what we are looking for.
Let’s visualize the residuals in a scatter plot!
Residuals are centered around zero, and the coefficient of determination R-square is close to 1. Close to 1 R-squared value indicates a good fit over the test dataset. With these results, we can conclude that our model is highly efficient in predicting Life Expectancy.
WHO (Public Health Standards)
World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. They are responsible for monitoring the public health risk, promoting human health, and coordinating responses to health emergencies. WHO highly relies on statistical algorithms like Linear Regression for studying the Life expectancy and impact of pandemics over life-span.
Blue Shield of California (Healthcare Insurance)
Blue Shield of California(BSOC) is at the forefront of innovation in the healthcare domain. BSOC is a non-profit health insurance organization focused on providing the best possible medical treatment and insurance plans. BSOC takes advantage of the Linear Regression model for the estimation of medical expenses based on insurance data.
JPMorgan Chase & Co. (Finance)
JPMorgan Chase is a global leader in financial services offering solutions to the world’s most important corporations and government institutions. JPMC has generalized the use of Linear Regression in their Capital Asset Pricing Model (CAPM), where risky assets are merged with non-risky assets to reduce the unsystematic risk. Moreover, JPMC uses Linear Regression for Forecasting and Financial Analysis.
Johnson & Johnson (Pharmaceutical)
Johnson & Johnson (J&J) is an American multinational corporation that develops medical devices, pharmaceuticals, and consumer packaged goods. J&J uses Linear Regression for estimating the remaining shelf life of medicine stocks.
Walmart (Retail Corporation)
Walmart is a renowned retailing corporation that operates as different hypermarkets, departmental stores, grocery stores, and garments buying houses. Walmart relies on Regression analysis for sales forecasting and better decision-making.
Life expectancy prediction is one of the most popular machine learning projects that interviewers commonly find in fresher resumes. So, the possible interview questions that interviewers can ask are:
We started with understanding Life Expectancy and learned the factors affecting it. We further visualized the affecting parameters and correlated them to derive inferences. Finally, we covered the Linear Regression and implemented it to predict Life expectancy.
One can extend their life span by adopting a healthy lifestyle, proper education, and getting vaccinated. Of course, Demographic location plays an important role. In our analysis, we found that people living in Europe have a higher lifespan than other continents. A country’s GDP and Income composition affect Life Expectancy more broadly.
Some parameters like pollution and environmental index have been missing in this analysis and are expected to be highly related to Life Expectancy.
Subscribe to get well-designed content on data structures and algorithms, machine learning, system design, oops, and mathematics. enjoy learning!
Based on the nature of input that we provide to a machine learning algorithm, machine learning can be classified into 4 major categories - Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, Reinforcement Learning.
Many major companies like Google, IBM, etc., are exploring machine learning potential in this domain. Cancer classification is one such area where ML can deliver a robust predictive model to identify the cancer possibility based on given observations.