exploring chemical space with computers—challenges and opportunities
DESCRIPTION
Exploring Chemical Space with Computers—Challenges and Opportunities. Pierre Baldi UCI. Chemical Informatics. Historical perspective: physics, chemistry and biology Understanding chemical space Small molecules (systems biology, chemical synthesis, drug design, nanotechnology). - PowerPoint PPT PresentationTRANSCRIPT
Exploring Chemical Space with Computers—Challenges and Opportunities
Pierre BaldiUCI
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology,
chemical synthesis, drug design, nanotechnology)
Chemical Space
Stars Small Mol.
Existing
1022 107
Virtual 0 1060 (?)
Access Difficult “Easy”
Mode Individual Combinatorial
Chemical Space
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology) Predict physical, chemical, biological
properties (classification/regression) Build filters/tools to efficiently navigate
chemical space to discover new drugs, new galaxies, etc.
Methods
Spetrum: Schrodinger Equation Molecular Dynamics Machine Learning (e.g. SS prediction)
Chemical Informatics
Informatics must be able to deal with variable-size structured data Graphical Models (Recursive) Neural Networks ILP GA SGs Kernels
Two Essential Ingredients
1. Data2. Similarity Measures
Bioinformatics analogy and differences:
Data (GenBank, Swissprot, PDB) Similarity (BLAST)
Data
Mutag (Mutagenicity) 200 compounds (125/63), mutagenicity in Salmonella
PTC (Predictive Toxicity Challenge) A few hundred compounds, carcinogenicity (FM,MM,FR,MR)
NCI (Anti-cancer activity) 70,000 compounds screened for ability to inhibit growth in 60
human tumor cell lines Alkanes (Boiling points)
All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([-164,174])
Benzodiazepines (QSAR) 79 1,4-benzodiazepines-2-one, affinity towards GABAA
ChemDB 7M compounds
Similarity
Rapid Searches of Large Databases
Predictive Methods (Kernel Methods)
Why it is not hopeless?
Similarity
Rapid Search of Large Databases Protein Receptor (Docking) Small Molecule/Ligand Small Molecule/Ligand (Similarity)(Similarity)
Predictive Methods (Kernel Methods) Why it is not hopeless
OrganicOrganicChemicalsChemicals
Linear Classifiers
Classification
Learning to Classify Limited number of training
examples (molecules, patients, sequences, etc.)
Learning algorithm (how to build the classifier?)
Generalization: should correctly classify test data.
Formalization X is the input space Y (e.g. toxic/non toxic, or
{1,-1}) is the target class f: X→Y is the classifier.
Classification
Fundamental Point: f is entirely determined by the dot products xi,xjmeasuring the similarity
between pairs of data points
Non Linear Classification(Kernel Methods)
We can transform a nonlinear problem into a linear one using a kernel.
Non Linear Classification(Kernel Methods)
We can transform a nonlinear problem into a linear one using a kernel K.
Fundamental property: the linear decision surface depends on
K(xi ,xj)=(xi ) , (xj). All we need is the Gram similarity
matrix K. K defines the local metric of the embedding space.
Similarity: Data Representations
NC(O)C(=O)O
O
OH
NH2
OH
Molecular Representations
1D: SMILES strings 2D: Graph of bonds 2.5D: Surfaces 3D: Atomic coordinates 4D: Temporal evolution
15Total:
1D SMILES Kernel
CCCCCCc1ccc(cc1O)O
CCCCCc1ccc(cc1)CO
C H3
OHCH3
OH O H
Kmer CountCCCC 2CCCc 1CCc1 1Cc1c 1c1cc 11ccc 1ccc( 1cc(c 1c(cc 1(cc1 1cc1) 1c1)C 11)CO 1
Kmer CountCCCC 3CCCc 1CCc1 1Cc1c 1c1cc 11ccc 1ccc( 1cc(c 1c(cc 1(cc1 1cc1O 1c1O) 11O)O 1
Kmer Count1 Count2 Product(cc1 1 1 11)CO 0 1 01O)O 1 0 01ccc 1 1 1CCCC 3 2 6CCCc 1 1 1CCc1 1 1 1Cc1c 1 1 1c(cc 1 1 1c1)C 0 1 0c1O) 1 0 0c1cc 1 1 1cc(c 1 1 1cc1) 0 1 0cc1O 1 0 0ccc( 1 1 1
2D Molecule Graph Kernel
For chemical compounds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }
Count labeled paths Fingerprints
(CsNsCdO)
Similarity Measures
3D Coordinate Kernel
1.4 A
2.0 A
2.8 A
3.4 A
4.2 A
Atom Distance Histogram
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5Distance (Angstroms)
Co
un
t
Distance Count0 01 52 73 34 15 0
Example of Results
Results
Results
Results
0.6500
0.6600
0.6700
0.6800
0.6900
0.7000
0.7100
0.7200
0.7300
0.7400
0.7500
Cell Line
Pre
dic
tio
n A
ccu
racy
1D SMILES(71.7% avg, 1.17% stdev)
2D Molecule Graph(72.3% avg, 0.99% stdev)
3D Coordinates(69.8% avg, 1.27% stdev)
Example of Results
Summary
Derived a variety of kernels for small molecules State-of-the-art performance on several benchmark
datasets 2D kernels slightly better than 1D and 3D kernels Many possible extensions: 2.5D kernels, isomers, etc… Need for larger data sets and new models of
cooperation in the chemistry community Many open (ML) questions (e.g. clustering and
visualizing 107 compounds, intelligent recognition of useful molecules, information retrieval from literature, docking, prediction of reaction rates, matching table of all proteins against all known compounds, origin of life)
Chemistry version of the Turing test
ChemDB
7M compounds (3.5M unique) Commercially available PostgreSQL/Oracle Annotation (Experimental,
Computational) Searchable Web interface Similarity, in silico reactions
Acknowledgements Informatics
Liva Ralaivola J. Chen S. J. Swamidass Yimeng Dou Peter Phung Jocelyne Bruand
Funding NIH NSF IGB
Pharmacology Daniele Piomelli
Chemistry G. Weiss J. S. Nowick R. Chamberlin
New Questions
Predict drug-like molecules? toxicity? New Strategies
How can we search efficiently? Intelligently? New data structures and algorithms Optimizing old structures
How can we understand this much data? Cluster and visualize millions of data points Define commercially accessible space.
Are there other useful things we can do with this?
Discover new polymers, etc. Wonder about the origin of life. Combinatorially combine all known chemicals.
Acknowledgements
Jocelyne Bruand Peter Phung Liva Ralaivola S. Joshua Swamidass Yimeng Dou NIH/NSF/IGB
Questions
DockingD
ata b
ase
of p
o ten
tial
dru
gs
6 m
illi
on s
mal
l mol
e cul
e s
…
Query:Binding Site of Protein
Scoring Function
& Efficient Minimizer
Some Targets
P53 (Luecke) ACCD5 (Tsai) IMPDH, PPAR, etc.
(Luecke) HIV Integrase
(Robinson)
P53
Drug Rescue of P53 Mutants
Docking → ChemDB
~6 million commercially available compounds
Searchable, annotated, downloadable.
Other Databases: Cambridge Structural Database ChemBank PubChem
Chemical Toxicity Prediction
By Kernel Methods
Jonathan ChenS Joshua Swamidass
The Baldi Lab
Data Flow
Toxicity State List
Predictions
Gram MatrixID 1 2 3 4 …1 21 4 5 10 …2 4 14 5 3 …3 5 5 15 6 …4 10 3 6 23 …… … … … … …
4 Yes
O
S
P
S
O
C H3
O
C H3
NH
C H3
2 No
Cl
Cl
Cl
3 Yes
O O
1 No
NH
N
CH 3CH3
O
O
OH
ID Toxic?
Kernel
Linear Classifier
Results
0.5000
0.5500
0.6000
0.6500
0.7000
0.7500
0.8000
0.8500
0.9000
0.9500
1.0000
Cell Line
Pre
dic
tio
n A
ccu
racy
1D SMILES(71.7% avg, 1.17% stdev)
2D Molecule Graph(72.3% avg, 0.99% stdev)
3D Coordinates(69.8% avg, 1.27% stdev)
Default(54.2% avg, 3.49% stdev)
Example of Results
Kernel/Method Mutag MM FM MR FRKashima (2003) 89.1 61.0 61.0 62.8 66.7 Kashima (2003) 85.1 64.3 63.4 58.4 66.11D SMILES spec. 84.0 66.1 61.3 57.3 66.11D SMILES spec+ 85.6 66.4 63.0 57.6 67.02D Tanimoto 87.8 66.4 64.2 63.7 66.72D MinMax 86.2 64.0 64.5 64.5 66.42D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.92D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.82D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.12D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.72D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.72D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.52D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.42D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.43D Histogram 81.9 59.8 61.0 60.8 64.4
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology) Catalog Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Datasets
Small Molecules as Undirected Labeled Graphs of Bonds
atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology) Bioinformatics analogy:
Catalog (GenBank) Search (BLAST)
Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.
Chemical Informatics
Historical perspective: physics, chemistry and biology
Understanding chemical space Small molecules (systems biology, chemical
synthesis, drug design, nanotechnology) Bioinformatics analogy:
Catalog (GenBank) Search (BLAST)
Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical
space to discover new drugs, new galaxies, etc.