Drug Discovery is a very long process as it involves countless experiments. Due to this, making a new drug takes a lot of time and money. Scientists have been looking for alternative methods to reduce the costs and funds required for a long time. Nowadays, Machine learning has become so advanced that it is being used for Drug Discovery, thus reducing the time needed to produce a new drug.
This blog focuses on the effectiveness of different compounds on Acetylcholinesterase. We will understand the answer to the following questions in detail:
We want to identify the compounds with the best effect on a particular target protein, here in our case, which is Acetylcholinesterase. Acetylcholinesterase, also known as AChE, AChase, or acetylhydrolase, is the primary cholinesterase in the body. It is an enzyme that catalyzes the breakdown of acetylcholine and some other choline esters that function as neurotransmitters. A commonly known disease is Alzheimer’s, where drugs must act on AChE to keep the condition in control.
Drug Discovery project can have multiple steps, like:
Machine learning and deep learning algorithms can solve any of the mentioned steps, e.g., mining proteomics in target discovery, optimizing lead structures for better bioactivity, and analyzing accumulated data at the end of experiments to get a conclusion. In this article, we will build an XGBoost regressor model to predict the effect of compounds on AChE. So let’s begin with the implementation steps:
The database we use is the ‘Chembl’ database which is a manually curated database of bioactive compounds with drug-like properties. It contains data on more than 2 million compounds. We use the python library chemblwebresourceclient to get the data.
import pandas as pd
from chembl_webresource_client.new_client import new_client
data = new_client.target
data_query = data.search('acetylcholinesterase')
targets = pd.DataFrame.from_dict(data_query)
targets.columns
# Index(['cross_references', 'organism', 'pref_name', 'score',
# 'species_group_flag', 'target_chembl_id', 'target_components',
# 'target_type', tax_id'], dtype='object')
Here ‘prefname’ contains our target name. ‘organism’ refers to a living being within which this target is present so that it can be a human or a bird. ‘targetchemblid’ refers to the unique id corresponding to our target. From the above ‘target’ data frame, we get that id will be ‘CHEMBL220’ for Human Acetylcholinesterase. We use these details (‘targetchembl_id’ that we get) to retrieve the activity data of different compounds on Human AChE.
data_new = new_client.activity
data1 = data_new.filter(target_chembl_id='CHEMBL220').filter(standard_type="IC50")
df = pd.DataFrame.from_dict(data1)
df.columns
'''
Index(['activity_comment',
'activity_id', 'activity_properties'
'assay_chembl_id',
'assay_description', 'assay type'
'assav variant accession'
'assay_Vartant _mutation','bao endpoint',
'bao _format',
'Thao label'
'canonical smiles'. 'data validity comment',
'document liter descriertofffichenumen',
'document chembl id', 'document Tournal',
'molecule chembl id'
'molecule _pref_name',
'potential duplicate',
'parent_molecule_chembl_id',
'audt units',
'record id',
'pcherbi value',
'relation',
'src_id'
'standard flag'
'standard relation',
'standard text value',
'standard type standard units',
'standard upper value',
'standard _value',
'target_Chembl_id', 'target _organisn',
'target_pref_name'
'target tax id'
'text value' , 'toid', 'type',
'units'
'uo units',
'upper value',
'value'],
type='object")
'''
IC50, a value in the column ‘standard_type’, is filtered out to make our dataset uniform so that we do not have a mixture of varying biotype units. So, we will only use the data with the biotype unit as IC50.
Some of the fields present in the data which is of importance to us are:
We try to analyze raw data by various techniques. So, we draw a histogram of the DataFrame to see the distribution of values. We see that srcid is left-skewed while standardflag is right-skewed.
import matplotlib.pyplot as plt
df.hist()
plt.tight_layout()
plt.show()
Now we draw a heatmap to analyze whether any values are collinearly related.
import seaborn as sns
cmap=sns.diverging_palette(500,10,as_cmap=True)
sns.heatmap(df.corr(),cmap=cmap,center=0,annot=False,square=True)
plt.show()
##convert data type from object to categorical to view further
We might be wondering why only six features were considered in the heatmap above when there were 45 features available in the dataset. It’s because many of the “not-included” features are present as “object” types and need to be changed to the “category” type. After that, all features will be considered in the heatmap. The following code tells us how to make changes and rerun the heatmap.
for col in df.columns:
if(df[col].dtype == 'object'):
df[col] = df[col].astype('category')
df[col] = df[col].cat.codes
The data we have cannot be used directly, so let us see how we can preprocess this data in our next section.
#dropping records which donot have values in columns standard_value and canonical_smiles
df2 = df[df.standard_value.notna()]
df2 = df2[df2.canonical_smiles.notna()]
#dropping records with duplicate canonical_smiles values to keep them unique
df2_unique = df2.drop_duplicates(['canonical_smiles'])
selection = ['molecule_chembl_id','canonical_smiles','standard_value']
df3 = df2_unique[selection]
Here we select three columns that we need for our model: ‘canonicalsmiles’, ‘standardvalue’, and ‘moleculechemblid’. We remove records where values are not present in the mentioned columns. We take unique on ‘canonicalsmiles’ as we do not want duplicate records. Dropping duplicate ‘canonicalsmiles’ automatically ensures that final records are unique, as the other two fields are set by default when ‘canonical_smiles’ is fixed.
We convert IC50 to pIC50 to make data values in ‘standard_value’ more uniformly distributed. We use the following formula for conversion:
pIC50 = -log10(IC50)
We also added a field called ‘bioactivity_threshold’ to use in EDA. The following conditions decide:
Class is inactive if standard_value > 10000
Class is active if standard_value < 1000
Active Compound means a small molecule that explicitly inhibits, stimulates, or otherwise alters the production or activity of a Target.
Inactive Compound means any Compound, whether or not covered under claims of a Collaboration Patent, which does not meet the level of activity specified in the definition of Active Compound. So for practical drug purposes, we want active compounds.
We still do not know the relation between different columns or ratios of data present. In the next section, we perform EDA to see the ratio of active and inactive compounds.
import seaborn as sns
sns.countplot(x='bioactivity_class', data=df_2class)
Seaborn is a library of python used for plotting graphs
The above chart shows that the dataset is balanced for inactive and active compounds.
The machine does not understand categories or alphabets. It only understands binary language or numbers. So in the next section, we convert molecular compounds into features that computers can understand.
We know that the machine only understands the binary language. Hence we need to convert terms like ‘canonicalsmiles’ into numbers that the computer understands. Now we convert ‘canonicalsmiles’ into fingerprints based on PaDel Descriptor. PaDel Descriptor is software to calculate molecular descriptors and fingerprints.
For each term, we get a length of 881 fingerprints which consists of 0 and 1. For more details, please see the link here.
We get fingerprints as our feature vector, i.e., X, and the ‘pIC50’ field as our Y, which we predict in our model.
We have prepared our data and divided it into X and Y. In the next section, we see how to train a model which can help us predict the effect of compounds on AChE.
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient-boosting framework. We have an open-source library in python for this. We can use it for both classification and regression.
Please note that XGBoost is an extreme form of gradient boosting, but if we have gradient boosting, why do we need to go to XGBoost? It is so because XGBoost uses an advanced form of regularization, has high performance, has high training speed, and can be parallelized across clusters. To know more about XGBoost, please read our blog here. We can see below the code for building the model using XGBoost.
import xgboost as xg
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
xgb_r = xg.XGBRegressor(objective ='reg:squarederror',n_estimators = 10, seed = 123)
xgb_r.fit(X_train, Y_train)
Y_pred = xgb_r.predict(X_test)
traintestsplit(test_size=0.2) splits a dataset into test and train in a ratio of 20:80.
XGBRegressor creates an instance of a regression model based on XGBoost
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.regplot(Y_test, Y_pred)
ax.set_xlabel('Experimental_pIC50')
ax.set_ylabel('Predicted_pIC50')
plt.savefig('prediction_plot2.pdf')
We must note that this problem is a regression problem, not a classification problem. Hence we do not have the concept of accuracy. We get an error, and we try to minimize it. Mean Squared error would be an example of one such error. Here we see the scatter plot of predicted and actual values to see how the result of the model looks.
It is a startup founded in 2012. It developed an AI-powered platform, Centaur Chemist, which identifies potential drug targets and sends them for clinical trials. It currently has 14 patents to its name, of which seven are in the AI domain.
It is a South Korean startup founded in 2015. It developed an AI platform, Standigm BEST, which explores latent chemical space to generate novel compounds with desired properties. Also, it analyses literature to identify new drug compounds.
If we mention this project in our resume, then these are the following questions are likely to be asked in the interview:
Machine learning is acting as a lifesaver. It is helping us to reduce the time needed to identify the correct drugs for different diseases. The timely covid vaccine discovery itself is a prominent example of this fact. Today, machine learning and deep learning are involved in every step of drug discovery, from identification to analyzing experimental results. The future scope is tremendous, as we have barely scratched the surface.
Subscribe to get weekly content on data structure and algorithms, machine learning, system design and oops.