How AI could accelerate screening of potential inhibitors for SARS-CoV-2

Expleo
5 min readMay 11, 2021

Blaise Mbunga Mputu (Data Scientist)

Expleo is a company that is involved in a wide range of activities in life sciences, and this article will focus on a use-case related to drug discovery part.

Coronavirus disease 2019 (COVID-19) is an infectious disease caused by a newly emerged highly pathogenic virus called novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Targeting the main protease (3CLMpro) is an appealing approach for drug development because this protein plays a significant role in the viral replication and transcription[1].

After provided a brief overview of 3CLMpro, we introduce a computer aided drug discovery (CADD) tool that could be applied to accelerate the identification (screening) of potential hit candidates for this protease.

SARS-CoV-2 and 3 CL Mpro

Figure 1: Structure of SARS-CoV-2 [2]

Coronaviruses are single-stranded RNA viruses which can be found in various animal species. The most recent one, SARS-CoV-2, includes 29.891 nucleotides which encodes 9.860 amino acids[2].

Experimental High-Throughput Screening (HTS) programs for drug discovery are generally considered to be time-consuming and laborious[3]. To reduce the time and the cost of the experimental HTS, drug virtual screening based machine learning can help to eliminate the unlikely drug-target pairs and to test only potentially active combinations. In fact, the use of machine learning is considered the most adequate choice. Indeed, the recent studies have proven the efficiency of AI techniques in understanding the known chemical space and screening small molecules against 3CLMpro[4,5].

In the current work, we have applied machine learning algorithms to construct predictive models in order to accelerate the screening of potential inhibitors of 3CLMpro.

An overview of data collection and pre-processing

To achieve this objective, data were collected through two public resources which contain the bioactivities data:

  1. ChEMBL[6] is a database which maintains the most comprehensive collection of drug-like small molecules. By querying ChEMBL, we have extracted 374 compounds.
  2. Diamond Light Source[7] is a database from UK’s national synchrotron which also contains a collection of drug-like small molecules. It allowed us to extract 280 compounds.

After cleaning and removal of duplicates, the final dataset consists of 584 compounds with their corresponding pIC50. pIC50 value is the negative log of IC50 (half maximal inhibitory concentration) that indicates how much of a particular drug or other substance is needed to inhibit a given biological process by half.

Molecular encoding and data labelling

Selected compounds have been converted in SMILES[8,9] (simplified molecular-input line-entry system) which is a specification for describing the structure of chemical species. SMILES need to be converted into a list of features, named fingerprints, for machine learning purpose. Additionally, pIC50 value is used as a label with a cut-off of 5 (where all values greater than 5 are considered as active whereas those smaller than 5 are considered as inactive)[10].

Machine learning models

We have trained a machine learning pipeline using python library PyCaret. According to the accuracy and the Area Under the Curve (AUC), Gradient Boosting Classifier (GBC) has proven to be the best algorithm. The model was automatically submitted to hyperparameters tuning using random grid search with 10-fold cross-validation. The obtained results performed 67% and 70% for the accuracy and the AUC, respectively. The performance of this model is shown in the AUC below (Figure 2) which means that there is a 70% of chance that the model will be able to distinguish between active and inactive classes.

Figure 2: AUC of Gradient Boosting Model

Final Thoughts

The shown machine learning algorithm demonstrates how AI could accelerate screening of the potential inhibitors for SARS-CoV-2 (3CLMpro). With an AUC of 70% and an accuracy of 67% for this purpose, GBC model could be acceptable. We think that using more bioactivity data, the algorithm could be improved by learning better the model representation of small molecules. Furthermore, data augmentation techniques can be used to improve the model as implemented by Isidoro Cortes-Ciriano, et al[11]. The authors outlined that data augmentation can increase both the generalization ability of the predictive models and their robustness against the changes in the structure of the data.

References

1. Li, Z. et al. Identify potent SARS-CoV-2 main protease inhibitors via accelerated free energy perturbation-based virtual screening of existing drugs. Proc. Natl. Acad. Sci. 117, 27381–27387 (2020).

2. Cascella, M., Rajnik, M., Cuomo, A., Dulebohn, S. C. & Di Napoli, R. Features, Evaluation, and Treatment of Coronavirus (COVID-19). in StatPearls (StatPearls Publishing, 2021).

3. Janzen, W. P. Screening Technologies for Small Molecule Discovery: The State of the Art. Chem. Biol. 21, 1162–1170 (2014).

4. Xu, Z. et al. Discovery of Potential Flavonoid Inhibitors Against COVID-19 3CL Proteinase Based on Virtual Screening Strategy. Front. Mol. Biosci. 7, 556481 (2020).

5. Zhavoronkov, A. et al. Potential non-covalent SARS-CoV-2 3C-like protease inhibitors designed using generative deep learning approaches and reviewed by human medicinal chemist in virtual reality. (2020) doi:10.13140/RG.2.2.13846.98881.

6. Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107 (2012).

7. Walsh, M. A., Grimes, J. M. & Stuart, D. I. Diamond Light Source: contributions to SARS-CoV-2 biology and therapeutics. Biochem. Biophys. Res. Commun. 538, 40–46 (2021).

8. Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model. 28, 31–36 (1988).

9. Hirohara, M., Saito, Y., Koda, Y., Sato, K. & Sakakibara, Y. Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinformatics 19, 526 (2018).

10. Allen, B. K. et al. Large-Scale Computational Screening Identifies First in Class Multitarget Inhibitor of EGFR Kinase and BRD4. Sci. Rep. 5, 16924 (2015).

11. Cortes-Ciriano, I. & Bender, A. Improved Chemical Structure–Activity Modeling Through Data Augmentation. J. Chem. Inf. Model. 55, 2682–2692 (2015).

--

--

Expleo

Expleo is an engineering, quality services and management consulting company. The company is active in a variety of industries, including banking & financial