Drug Discovery Using XGBoost Regressor Model

Drug Discovery is a very long process as it involves countless experiments. Due to this, making a new drug takes a lot of time and money. Scientists have been looking for alternative methods to reduce the costs and funds required for a long time. Nowadays, Machine learning has become so advanced that it is being used for Drug Discovery, thus reducing the time needed to produce a new drug.

Key takeaways from this article

This blog focuses on the effectiveness of different compounds on Acetylcholinesterase. We will understand the answer to the following questions in detail:

  1. Practical use of this problem
  2. What are the various steps involved in drug discovery
  3. What steps are involved in building the XGBoost regressor
  4. Active and Inactive Compounds
  5. Need of fingerprints
  6. Possible Interview Questions on this topic

Problem Understanding and its Practical Use

We want to identify the compounds with the best effect on a particular target protein, here in our case, which is Acetylcholinesterase. Acetylcholinesterase, also known as AChE, AChase, or acetylhydrolase, is the primary cholinesterase in the body. It is an enzyme that catalyzes the breakdown of acetylcholine and some other choline esters that function as neurotransmitters. A commonly known disease is Alzheimer’s, where drugs must act on AChE to keep the condition in control.

What are the various steps involved in drug discovery

Drug Discovery project can have multiple steps, like:

  1. Target Discovery
  2. Lead compound discovery and synthesis pathway
  3. Optimization
  4. Data Accumulation
  5. Repeat

Machine learning and deep learning algorithms can solve any of the mentioned steps, e.g., mining proteomics in target discovery, optimizing lead structures for better bioactivity, and analyzing accumulated data at the end of experiments to get a conclusion. In this article, we will build an XGBoost regressor model to predict the effect of compounds on AChE. So let’s begin with the implementation steps:

Model Implementation Steps

Step 1: Importing libraries and loading the dataset

The database we use is the ‘Chembl’ database which is a manually curated database of bioactive compounds with drug-like properties. It contains data on more than 2 million compounds. We use the python library chemblwebresourceclient to get the data.

import pandas as pd
from chembl_webresource_client.new_client import new_client

data = new_client.target
data_query = data.search('acetylcholinesterase')

targets = pd.DataFrame.from_dict(data_query)

Step 2: Understanding the data


# Index(['cross_references', 'organism', 'pref_name', 'score',
# 'species_group_flag', 'target_chembl_id', 'target_components', 
# 'target_type', tax_id'], dtype='object')

Here ‘prefname’ contains our target name. ‘organism’ refers to a living being within which this target is present so that it can be a human or a bird. ‘targetchemblid’ refers to the unique id corresponding to our target. From the above ‘target’ data frame, we get that id will be ‘CHEMBL220’ for Human Acetylcholinesterase. We use these details (‘targetchembl_id’ that we get) to retrieve the activity data of different compounds on Human AChE.

data_new = new_client.activity
data1 = data_new.filter(target_chembl_id='CHEMBL220').filter(standard_type="IC50")

df = pd.DataFrame.from_dict(data1)

'activity_id', 'activity_properties'
'assay_description', 'assay type'
'assav variant accession'
'assay_Vartant _mutation','bao endpoint',
'bao _format',
'Thao label'
'canonical smiles'. 'data validity comment',
'document liter descriertofffichenumen',
'document chembl id', 'document Tournal',
'molecule chembl id'
'molecule _pref_name',
'potential duplicate',
'audt units',
'record id',
'pcherbi value',
'standard flag'
'standard relation',
'standard text value',
'standard type standard units',
'standard upper value',
'standard _value',
'target_Chembl_id', 'target _organisn',
'target tax id'
'text value' , 'toid', 'type',
'uo units',
'upper value',

IC50, a value in the column ‘standard_type’, is filtered out to make our dataset uniform so that we do not have a mixture of varying biotype units. So, we will only use the data with the biotype unit as IC50.

Some of the fields present in the data which is of importance to us are:

  • ‘canonical_smiles’ → stores formulas for compounds.
  • ‘Moleculechemblid’ → represents the chembl_id of compounds.
  • ‘Standard_value’ → represents potency; the lower the number, the higher the potency. It means the percentage of protein required in water to produce the same inhibition. Hence lower the concentration required, the better.

We try to analyze raw data by various techniques. So, we draw a histogram of the DataFrame to see the distribution of values. We see that srcid is left-skewed while standardflag is right-skewed.

import matplotlib.pyplot as plt

Hist plot to check the skewness of the features present in the chembl dataset used for drug discovery project

Now we draw a heatmap to analyze whether any values are collinearly related.

import seaborn as sns
##convert data type from object to categorical to view further

Heatmap of subset features from the chembl dataset used for drug discovery project

We might be wondering why only six features were considered in the heatmap above when there were 45 features available in the dataset. It’s because many of the “not-included” features are present as “object” types and need to be changed to the “category” type. After that, all features will be considered in the heatmap. The following code tells us how to make changes and rerun the heatmap.

for col in df.columns:
  if(df[col].dtype == 'object'):
    df[col] = df[col].astype('category')
    df[col] = df[col].cat.codes

Heatmap of all the features in chembl dataset used for drug discovery project

The data we have cannot be used directly, so let us see how we can preprocess this data in our next section.

Step 3: Data Preprocessing for the AChE dataset

#dropping records which donot have values in columns standard_value and canonical_smiles
df2 = df[df.standard_value.notna()]
df2 = df2[df2.canonical_smiles.notna()]
#dropping records with duplicate canonical_smiles values to keep them unique
df2_unique = df2.drop_duplicates(['canonical_smiles'])
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_unique[selection]

Here we select three columns that we need for our model: ‘canonicalsmiles’, ‘standardvalue’, and ‘moleculechemblid’. We remove records where values are not present in the mentioned columns. We take unique on ‘canonicalsmiles’ as we do not want duplicate records. Dropping duplicate ‘canonicalsmiles’ automatically ensures that final records are unique, as the other two fields are set by default when ‘canonical_smiles’ is fixed.

We convert IC50 to pIC50 to make data values in ‘standard_value’ more uniformly distributed. We use the following formula for conversion:

pIC50 = -log10(IC50)

We also added a field called ‘bioactivity_threshold’ to use in EDA. The following conditions decide:

Class is inactive if standard_value > 10000

Class is active if standard_value < 1000

Active Compound means a small molecule that explicitly inhibits, stimulates, or otherwise alters the production or activity of a Target.

Inactive Compound means any Compound, whether or not covered under claims of a Collaboration Patent, which does not meet the level of activity specified in the definition of Active Compound. So for practical drug purposes, we want active compounds.

We still do not know the relation between different columns or ratios of data present. In the next section, we perform EDA to see the ratio of active and inactive compounds.

Step 4: Exploratory Data Analysis

import seaborn as sns
sns.countplot(x='bioactivity_class', data=df_2class)

Active vs inactive bioactivity classes in chembl dataset

Seaborn is a library of python used for plotting graphs

The above chart shows that the dataset is balanced for inactive and active compounds.

The machine does not understand categories or alphabets. It only understands binary language or numbers. So in the next section, we convert molecular compounds into features that computers can understand.

Step 5: Getting Fingerprints of compounds

We know that the machine only understands the binary language. Hence we need to convert terms like ‘canonicalsmiles’ into numbers that the computer understands. Now we convert ‘canonicalsmiles’ into fingerprints based on PaDel Descriptor. PaDel Descriptor is software to calculate molecular descriptors and fingerprints.

For each term, we get a length of 881 fingerprints which consists of 0 and 1. For more details, please see the link here.

How to form the features vectors from our dataset?

We get fingerprints as our feature vector, i.e., X, and the ‘pIC50’ field as our Y, which we predict in our model.

We have prepared our data and divided it into X and Y. In the next section, we see how to train a model which can help us predict the effect of compounds on AChE.

Step 6: Model Formation

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient-boosting framework. We have an open-source library in python for this. We can use it for both classification and regression.

Please note that XGBoost is an extreme form of gradient boosting, but if we have gradient boosting, why do we need to go to XGBoost? It is so because XGBoost uses an advanced form of regularization, has high performance, has high training speed, and can be parallelized across clusters. To know more about XGBoost, please read our blog here. We can see below the code for building the model using XGBoost.

import xgboost as xg
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
xgb_r = xg.XGBRegressor(objective ='reg:squarederror',n_estimators = 10, seed = 123)

xgb_r.fit(X_train, Y_train)
Y_pred = xgb_r.predict(X_test)

traintestsplit(test_size=0.2) splits a dataset into test and train in a ratio of 20:80.

XGBRegressor creates an instance of a regression model based on XGBoost

Step 7: Scatter Plot of Experimental vs. Predicted Values

import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.regplot(Y_test, Y_pred)

We must note that this problem is a regression problem, not a classification problem. Hence we do not have the concept of accuracy. We get an error, and we try to minimize it. Mean Squared error would be an example of one such error. Here we see the scatter plot of predicted and actual values to see how the result of the model looks.

Predicted vs Actual scatter plot for XGBoost Regression model

Case Studies Of Companies Use-case


It is a startup founded in 2012. It developed an AI-powered platform, Centaur Chemist, which identifies potential drug targets and sends them for clinical trials. It currently has 14 patents to its name, of which seven are in the AI domain.


It is a South Korean startup founded in 2015. It developed an AI platform, Standigm BEST, which explores latent chemical space to generate novel compounds with desired properties. Also, it analyses literature to identify new drug compounds.

Possible Interview Questions

If we mention this project in our resume, then these are the following questions are likely to be asked in the interview:

  • What are regression problems?
  • Why XGBoost? What other algorithms can be tried in place of XGBoost?
  • What features were used for the final feature set?
  • What is the use of fingerprints?
  • Do Regression problems have accuracy?
  • What is the standard value, and how does it vary wrt compounds’ effect on protein?


Machine learning is acting as a lifesaver. It is helping us to reduce the time needed to identify the correct drugs for different diseases. The timely covid vaccine discovery itself is a prominent example of this fact. Today, machine learning and deep learning are involved in every step of drug discovery, from identification to analyzing experimental results. The future scope is tremendous, as we have barely scratched the surface.

Enjoy learning, Enjoy algorithms!

Share feedback with us

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.