dual-event machine learning models to accelerate drug ... · pdf filedual-event machine...
TRANSCRIPT
Dual-Event Machine Learning Models to Accelerate Drug Discovery
Sean Ekins1,2*, Robert C. Reynolds3,4*, Hiyun Kim5, Mi-Sun Koo5, Marilyn
Ekonomidis5, Meliza Talaue5, Steve D. Paget5, Lisa K. Woolhiser6, Anne J.
Lenaerts6, Barry A. Bunin1, Nancy Connell5 and Joel S. Freundlich5,7* 1Collaborative Drug Discovery, 1633 Bayshore Highway, Suite 342, Burlingame, CA 94010, USA. 2Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay-Varina, NC 27526, USA. 3Southern Research Institute, 2000 Ninth Avenue South, Birmingham, AL 35205, USA. 4Current address: University of Alabama at Birmingham, College of Arts and Sciences , Department of Chemistry, 1530 3rd
Avenue South, Birmingham, Alabama 35294-1240, USA. 5Department of Medicine, Center for Emerging and Reemerging Pathogens, UMDNJ – New Jersey Medical School, 185 South
Orange Avenue Newark, NJ 07103, USA. 6Department of Microbiology, Immunology and Pathology, Colorado State University, 200 West Lake Street, CO 80523, USA. 7Department of Pharmacology & Physiology, UMDNJ – New Jersey Medical School, 185 South Orange Avenue Newark, NJ
07103, USA.
.
Tuberculosis kills 1.6-1.7m/yr (~1 every 8 seconds)
1/3rd of worlds population infected!!!!
Multi drug resistance in 4.3% of cases
Extensively drug resistant increasing incidence
one new drug (bedaquiline) in 40 yrs
Drug-drug interactions and Co-morbidity with HIV
Collaboration between groups is rare
These groups may work on existing or new targets
Use of computational methods with TB is rare
TB facts
streptomycin (1943)
para-aminosalicyclic acid (1949)
isoniazid (1952) (Bayer, Roche, Squibb)
pyrazinamide (1954)
cycloserine (1955)
ethambutol (1962)
rifampicin (1967)
~ 20 public datasets for TB
Including Novartis data on TB hits
>300,000 cpds
Patents, Papers Annotated by CDD
Open to browse by anyone
http://www.collaborativedrug.
com/register
Bayesian Model Construction: Mtb Whole-Cell HTS
• Learning from 3,779 compounds from an NIAID library
- active: MIC < 5 mM
- inactive: MIC ≥ 5 mM
Bayesian machine learning
Ekins, Williams and Xu, Drug Metab Dispos 38: 2302-2308, 2010
Bayesian classification is a simple probabilistic classification model. It is based on
Bayes’ theorem
h is the hypothesis or model
d is the observed data
p(h) is the prior belief (probability of hypothesis h before observing any data)
p(d) is the data evidence (marginal probability of the data)
p(d|h) is the likelihood (probability of data d if hypothesis h is true)
p(h|d) is the posterior probability (probability of hypothesis h being true given the
observed data d)
A weight is calculated for each feature using a Laplacian-adjusted probability
estimate to account for the different sampling frequencies of different features.
The weights are summed to provide a probability estimate
Novel Bayesian Models for Mtb Whole-Cell Efficacy
SRI MLSMR 220K single point model
active: ≥90% inhibition @ 10 mM; inactive <90% inhibition @ 10 mM
SRI MLSMR 2.5K dose reponse model
active: IC50 ≤ 5 mM; inactive: IC50 > 5 mM
Ekins, S. et al., Mol. Biosyst. 2010, 6, 840-51; Ekins, S. et al., Mol. Biosyst. 2010, 6, 2316-2324.
• Laplacian-corrected Bayesian classifier models (Accelrys Discovery Studio)
• Molecular function class fingerprints of maximum diameter 6 (FCFP_6)
• Simple molecular descriptors chosen including AlogP, molecular weight,
# rotatable bonds, # rings, # hydrogen bond acceptors, # hydrogen bond
donors, and polar surface area
• Validated w/ leave-one-out cross-validation & leave-50%-out cross-validation
Model Building and Validation
Bayesian Classification TB Models
Dateset
(number of
molecules)
External
ROC Score
Internal
ROC
Score Concordance Specificity Sensitivity
MLSMR
All single point
screen
(N = 220463) 0.86 ± 0 0.86 ± 0 78.56 ± 1.86 78.59 ± 1.94 77.13 ± 2.26
MLSMR
dose response set
(N = 2273) 0.73 ± 0.01 0.75 ± 0.01 66.85 ± 4.06 67.21 ± 7.05 65.47 ± 7.96
We can use the public data for machine learning model building
Using Discovery Studio Bayesian model
Leave out 50% x 100
Ekins et al., Mol BioSyst, 6: 840-851, 2010
Bayesian Classification Models for TB
Good
Bad
active compounds with MIC < 5uM
Laplacian-corrected Bayesian classifier models were generated using FCFP-6 and
simple descriptors. 2 models 220,000 and >2000 compounds
Ekins et al., Mol BioSyst, 6: 840-851, 2010
100K library Novartis Data FDA drugs
Additional test sets
Suggests models can predict data from the same and independent labs
Enrichments 4-10 fold
Initial enrichment – enables screening few compounds to find actives
21 hits in 2108 cpds 34 hits in 248 cpds 1702 hits in >100K cpds
Ekins and Freundlich, Pharm Res, 28, 1859-1869, 2011. Ekins et al., Mol BioSyst, 6: 840-851, 2010
Testing to date has been retrospective
Can we use our models to select compounds and influence
design?
Prospective prediction
Do it enough times to show robustness
Testing prospectively
Ranked Asinex 25K library with MLSMR dose response model –
Bayesian score range -28.4 – 15.3
99 compounds screened (Bayesian score 9.4 – 15.3).
12 cpds were identified with IC90 < 30 ug/mL
~12% hit rate
Most active SYN 22269076
Pyrazolo[1,5-a]pyrimidine
IC50 1.1ug/ml (3.2uM)
Bayesian Machine Learning Models – testing
Bayesian
Score 14.9 10.6 9.8 Bob Reynolds (SRI)
Principal component analysis (PCA) of all SRI data sets to
illustrate overlap of chemistry space using the datasets
from this study (red TAACF-CB2, green = MLSMR, black =
kinase dataset), 3PCs explain 72% of the variance.
Top scoring molecules assayed for
Mtb growth inhibition
Mtb screening
molecule database
High-throughput
phenotypic
Mtb screening
Descriptors + Bioactivity (+Cytotoxicity)
Bayesian Machine Learning Mtb Model
Molecule Database
(e.g. GSK malaria actives)
virtually scored using Bayesian Models
New bioactivity data
may enhance models
Identify in vitro hits
Increased hit/lead discovery efficiency
NH
S
N
NH
S
N
Dual-Event models
Dual-Event models
Become more stringent in what we call an ACTIVE
IC90 < 10 ug/ml (CB2) or <10uM (MLSMR) and a selectivity index (SI)
greater than ten.
SI was calculated as SI = CC50/IC90 where CC50 is the concentration that
resulted in 50% inhibition of Vero cells (CC50).
Bayesian Classification TB Models
Dateset
(number of
molecules)
External
ROC
Score
Internal
ROC
Score Concordance Specificity Sensitivity
MLSMR
All single point
screen
(N = 220463) 0.86 ± 0 0.86 ± 0 78.56 ± 1.86 78.59 ± 1.94 77.13 ± 2.26
MLSMR
dose response set
(N = 2273) 0.73 ± 0.01 0.75 ± 0.01 66.85 ± 4.06 67.21 ± 7.05 65.47 ± 7.96
NEW Dose resp and
cytotoxicity (N =
2273) 0.82 ± 0.02 0.84 ± 0.02 82.61 ± 4.68 83.91 ± 5.48 65.99 ± 7.47
Single pt ROC XV AUC = 0.88
Dose resp = 0.78
Dose resp + cyto = 0.86
Ekins et al., PLOSONE, in press 2013
Bayesian Machine Learning Models – blind testing
Dual event model shows increased enrichment
Ekins et al.,Chem Biol 20, 370–378, 2013
1. Virtually screen 13,533-member GSK antimalarial hit library
2. Model = SRI TAACF-CB dose response + cytotoxicity model
3. Top 46 commercially available compounds visually inspected
4. 7 compounds chosen for Mtb testing based on
- drug-likeness
- chemotype diversity
Prospective prediction of antimalarial compounds vs Mtb
Dateset
(number of molecules)
External
ROC Score
Internal ROC
Score Concordance Specificity Sensitivity
TAACF-CB2 IC90 and
cytotoxicity (1783) 0.64 0.59 ± 0.01 0.63 ± 0.02 55.74 ±1.31 61.61 ± 8.96
Prospective prediction of antimalarial compounds vs Mtb
7 tested, 5 active (70% hit rate)
Ekins et al.,Chem Biol 20, 370–378, 2013
Bayesian Model Follow-up: Do we have a lead?
• BAS00521003/ TCMDC-125802 reported to be a
P. falciparum lactate dehydrogenase inhibitor
• Only one report (that we were unaware of when
picking the compound) of antitubercular activity
from 1969
- solid agar MIC = 1 mg/mL (“wild strain”)
- “no activity” in mouse model up to 400 mg/kg
- however, activity was solely judged by
extension of survival!
Bruhin, H. et al., J. Pharm. Pharmac. 1969, 21, 423-433.
SRI MLSMR 220K library contains:
107 hits with this substructure
- 3 nitrofuryl hydrazones
- 10 furyl hydrazones
- 19 nitrophenyl hydrazones
32 inactives with this substructure
Maddry et al., Tuberculosis 2009, 89, 354.
MIC of 0.0625 ug/mL
Efficacy Profiling of TCMDC-125802
• 64X MIC affords 6 logs of kill
• Resistance and/or drug
instability beyond 14 d
Vero cells : CC50 = 4.0
mg/mL
Selectivity Index SI =
CC50/MICMtb = 16 – 64
Ekins et al.,Chem Biol 20, 370–378, 2013
In vivo Evaluation of TCMDC-125802
Goal: Evaluate the in vivo safety and efficacy of JSF-2019 in mouse
models of TB infection
Step #2: 7-day Maximum Tolerated Dose study in mice
- formulated in 0.5% methyl cellulose
- single dose p.o. @ 30, 100, and 300 mg/kg in B6D2F1 mice
- no overt toxicity
Lisa Woolhiser and Anne Lenaerts (CSU)
Step #3: evaluation in GKO mouse model of TB infection
- Five 12 week-old female C57BL/6 mice infected with Mtb Erdman via
low-dose aerosol exposure
- Days 16 – 23 : dosed w/ 300 mg/kg JSF-2019 p.o. OR 25 mg/kg INH
OR untreated
- Sacrificed day 24 and lung and spleen homogenates were cultured
- no difference in lungs and spleens vs. control
http://goo.gl/UujRX Ballel et al., Fueling Open-Source drug discovery: 177 small-
molecule leads against tuberculosis ChemMedChem 2013.
GSK screened 2M compounds – 3 yrs ago
Bayesian predictions for 14,000 cpds exposed 11 / 15 (73%)
correct when paper was published
Further prospective validation example
Why screen cpds?
Conclusions
>38,000 molecules screened through Bayesian models
106 molecules were tested in vitro
17 actives were identified (22.5 % hit rate)
Identified several novel potent lead series with good cytotoxicity & selectivity
Some series have been missed in SRI screening data
Took a non toxic molecule quickly in vivo – Have made analogs in attempt to
overcome in vivo efficacy failure
All Bayesian models shared with Abbott and Merck in TB Accelerator project
All Bayesian models are freely available to researchers
Ekins et al.,Chem Biol 20, 370–378, 2013
Acknowledgments
The project described was supported by Award Number R43 LM011152-01
“Biocomputation across distributed private datasets to enhance drug
discovery” from the National Library of Medicine (PI: S. Ekins)
Accelrys
The CDD TB has been developed thanks to funding from the Bill and
Melinda Gates Foundation (Grant#49852 “Collaborative drug discovery for
TB through a novel database of SAR data optimized to promote data
archiving and sharing”)
Allen Casey (IDRI)
Joel Freundlich Lab
You can find me @... CDD Booth 205
PAPER ID: 13433
PAPER TITLE: “Dispensing processes profoundly impact biological assays and computational and statistical
analyses”
April 8th 8.35am Room 349
PAPER ID: 14750
PAPER TITLE: “Enhancing High Throughput Screening For Mycobacterium tuberculosis Drug Discovery
Using Bayesian Models”
April 9th 1.30pm Room 353
PAPER ID: 21524
PAPER TITLE: “Navigating between patents, papers, abstracts and databases using public sources and
tools”
April 9th 3.50pm Room 350
PAPER ID: 13358
PAPER TITLE: “TB Mobile: Appifying Data on Anti-tuberculosis Molecule Targets”
April 10th 8.30am Room 357
PAPER ID: 13382
PAPER TITLE: “Challenges and recommendations for obtaining chemical structures of industry-provided
repurposing candidates”
April 10th 10.20am Room 350
PAPER ID: 13438
PAPER TITLE: “Dual-event machine learning models to accelerate drug discovery”
April 10th 3.05 pm Room 350