Machine learning in Drug Discovery

Drug Discovery is a very long process as it involves countless experiments. Due to this, making a new drug takes a lot of time and money. Scientists have been looking for alternative methods to reduce the costs and funds required for a long time. Nowadays, Machine learning has become so advanced that it is being used for Drug Discovery, thus reducing the time needed to produce a new drug.

Key takeaways from this article

This blog focuses on the effectiveness of different compounds on Acetylcholinesterase. We will understand the answer to the following questions in detail:

  1. Practical use of this problem
  2. What are the various steps involved in drug discovery
  3. What steps are involved in building the XGBoost regressor
  4. Active and Inactive Compounds
  5. Need of fingerprints
  6. Possible Interview Questions on this topic

Problem Understanding and its Practical Use

We want to identify the compounds with the best effect on a particular target protein, here in our case, which is Acetylcholinesterase. Acetylcholinesterase, also known as AChE, AChase, or acetylhydrolase, is the primary cholinesterase in the body. It is an enzyme that catalyzes the breakdown of acetylcholine and some other choline esters that function as neurotransmitters. A commonly known disease is Alzheimer’s, where drugs must act on AChE to keep the condition in control.

What are the various steps involved in drug discovery

Drug Discovery has multiple steps, as can be seen in the figure :

Machine learning and deep learning algorithms can solve any of the mentioned steps, e.g., mining proteomic in target discovery, optimizing lead structures for better bioactivity, and analyzing accumulated data at the end of experiments to get the conclusion. In this article, we will build an XGBoost regressor model to predict the effect of compounds on AChE. We would recommend having a look at our XGBoost and Gradient Boosting blogs to understand the algorithms better. So let’s begin with the implementation steps:

Model Implementation Steps

Step 1: Importing libraries and loading the dataset

The database we use is the ‘Chembl’ database which is a manually curated database of bioactive compounds with drug-like properties. It contains data on more than 2 million compounds. We use the python library chemblwebresourceclient to get the data.

import pandas as pd
from chembl_webresource_client.new_client import new_client

data =
data_query ='acetylcholinesterase')

targets = pd.DataFrame.from_dict(data_query)

Step 2: Understanding the data


Target chembl_webresource_client dataset columns

Here ‘prefname’ contains our target name. ‘organism’ refers to a living being within which this target is present so that it can be a human or a bird. ‘targetchemblid’ refers to the unique id corresponding to our target. From the above ‘target’ data frame, we get that id will be ‘CHEMBL220’ for Human Acetylcholinesterase. We use these details (‘targetchembl_id’ that we get) to retrieve the activity data of different compounds on Human AChE.

data_new = new_client.activity
data1 = data_new.filter(target_chembl_id='CHEMBL220').filter(standard_type="IC50")

df = pd.DataFrame.from_dict(data1)

All chembl_webresource_client dataset columns

IC50, a value in the column ‘standard_type’ is filtered out to make our dataset uniform so that we do not have a mixture of varying biotype units. So from now onwards, we only use the data with the biotype unit as IC50.

Some of the fields present in the data which is of importance to us are:

  • ‘canonical_smiles’ → stores formulas for compounds.
  • ‘Moleculechemblid’ → represents the chembl_id of compounds.
  • ‘Standard_value’ → represents potency; the lower the number, the higher the potency. It means the percentage of protein required in water to produce the same inhibition. Hence lower the concentration required, the better.

Step 3: Data Preprocessing for the AChE dataset

#dropping records which donot have values in columns standard_value and canonical_smiles

df2 = df[df.standard_value.notna()]
df2 = df2[df2.canonical_smiles.notna()]

#dropping records with duplicate canonical_smiles values to keep them unique
df2_unique = df2.drop_duplicates(['canonical_smiles'])

selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_unique[selection]

Here we select three columns that we need for our model: ‘canonicalsmiles’, ‘standardvalue’, and ‘moleculechemblid’. We remove records where values are not present in the mentioned columns. We take unique on ‘canonicalsmiles’ as we do not want duplicate records. Dropping duplicate ‘canonicalsmiles’ automatically ensures that final records are unique as the other two fields are set by default when ‘canonical_smiles’ is fixed.

We convert IC50 to pIC50 to make data values in ‘standard_value’ more uniformly distributed. We use the following formula for conversion:

pIC50 = -log10(IC50)

We also added a field called ‘bioactivity_threshold’ to use in EDA. The following conditions decide:

Class is inactive if standard_value > 10000

Class is active if standard_value < 1000

Active Compound means a small molecule that explicitly inhibits, stimulates, or otherwise alters the production or activity of a Target.

Inactive Compound means any Compound, whether or not covered under claims of a Collaboration Patent, which does not meet the level of activity specified in the definition of Active Compound.

So for practical drug purposes, we want active compounds. The final data sample is in the figure below.

Final dataset snippet

Step 4: Exploratory Data Analysis

import seaborn as sns
sns.countplot(x='bioactivity_class', data=df_2class)

Active vs inactive bioactivity classes in chembl dataset

Seaborn is a library of python used for plotting graphs

The above chart shows that the dataset is balanced for inactive and active compounds.

Step 5: Getting Fingerprints of compounds

We know that the machine only understands binary language. Hence we need to convert terms like ‘canonicalsmiles’ into numbers that the computer understands. Now we convert ‘canonicalsmiles’ into fingerprints based on PaDel Descriptor. PaDel Descriptor is software to calculate molecular descriptors and fingerprints. For each term, we get a length of 881 fingerprints which consists of 0 and 1. For more details, please see the link.

We get fingerprints as our feature vector, i.e., X, and the ‘pIC50’ field as our Y, which we predict in our model.

Step 6: Model Formation

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. We have an open-source library in python for this. We can use it for both classification and regression.

import xgboost as xg
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
xgb_r = xg.XGBRegressor(objective ='reg:squarederror',n_estimators = 10, seed = 123), Y_train)
Y_pred = xgb_r.predict(X_test)

traintestsplit splits a dataset into test and train in a ratio of 20:80.

XGBRegressor creates an instance of a regression model based on XGBoost.

Step 7: Scatter Plot of Experimental vs. Predicted Values

import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.regplot(Y_test, Y_pred)

We must note that this problem is a regression problem, not a classification problem. Hence we do not have the concept of accuracy. We get an error, and we try to minimize it. Mean Squared error would be an example of one such error. Here we see the scatter plot of predicted and actual values to see how the result of the model looks.

Predicted vs Actual values of drug discovery

Case Studies Of Companies Use-case


It is a startup founded in 2012. It developed an AI-powered platform, Centaur Chemist, which identifies potential drug targets and sends them for clinical trials. It currently has 14 patents to its name, of which seven are in the AI domain.


It is a South Korean startup founded in 2015. It developed an AI platform, Standigm BEST, which explores latent chemical space to generate novel compounds with desired properties. Also, it analyses literature to identify new drug compounds.

Possible Interview Questions

If we mention this project in our resume, then these are the following questions are likely to be asked in the interview:

  • What are regression problems?
  • Why XGBoost? What other algorithms can be tried in place of XGBoost?
  • What features were used for the final feature set?
  • What is the use of fingerprints?
  • Do Regression problems have accuracy?
  • What is the standard value, and how does it vary wrt compounds’ effect on protein?


Machine learning is acting as a lifesaver. It is helping us to reduce the time needed to identify the correct drugs for different diseases. The timely covid vaccine discovery itself is a prominent example of this fact. Today, machine learning and deep learning are involved in every step of drug discovery, from identification to analyzing experimental results. The future scope is tremendous as we have barely scratched the surface.

Enjoy Learning! Enjoy Algorithms!

Share on social media:

More blogs to explore

Our weekly newsletter

Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.

© 2022 Code Algorithms Pvt. Ltd.

All rights reserved.