exploring chemical space with computers—challenges and opportunities

Exploring Chemical Space with Computers—Challenges and Opportunities

Pierre BaldiUCI

Chemical Informatics

Historical perspective: physics, chemistry and biology

Understanding chemical space Small molecules (systems biology,

chemical synthesis, drug design, nanotechnology)

Chemical Space

Stars Small Mol.

Existing

1022 107

Virtual 0 1060 (?)

Access Difficult “Easy”

Mode Individual Combinatorial

Chemical Space



Understanding chemical space Small molecules (systems biology, chemical

synthesis, drug design, nanotechnology) Predict physical, chemical, biological

properties (classification/regression) Build filters/tools to efficiently navigate

chemical space to discover new drugs, new galaxies, etc.

Methods

Spetrum: Schrodinger Equation Molecular Dynamics Machine Learning (e.g. SS prediction)


Informatics must be able to deal with variable-size structured data Graphical Models (Recursive) Neural Networks ILP GA SGs Kernels

Two Essential Ingredients

1. Data2. Similarity Measures

Bioinformatics analogy and differences:

Data (GenBank, Swissprot, PDB) Similarity (BLAST)

Data

Mutag (Mutagenicity) 200 compounds (125/63), mutagenicity in Salmonella

PTC (Predictive Toxicity Challenge) A few hundred compounds, carcinogenicity (FM,MM,FR,MR)

NCI (Anti-cancer activity) 70,000 compounds screened for ability to inhibit growth in 60

human tumor cell lines Alkanes (Boiling points)

All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([-164,174])

Benzodiazepines (QSAR) 79 1,4-benzodiazepines-2-one, affinity towards GABAA

ChemDB 7M compounds

Similarity

Rapid Searches of Large Databases

Predictive Methods (Kernel Methods)

Why it is not hopeless?

Similarity

Rapid Search of Large Databases Protein Receptor (Docking) Small Molecule/Ligand Small Molecule/Ligand (Similarity)(Similarity)

Predictive Methods (Kernel Methods) Why it is not hopeless

OrganicOrganicChemicalsChemicals

Linear Classifiers

Classification

Learning to Classify Limited number of training

examples (molecules, patients, sequences, etc.)

Learning algorithm (how to build the classifier?)

Generalization: should correctly classify test data.

Formalization X is the input space Y (e.g. toxic/non toxic, or

{1,-1}) is the target class f: X→Y is the classifier.

Classification

Fundamental Point: f is entirely determined by the dot products xi,xjmeasuring the similarity

between pairs of data points

Non Linear Classification(Kernel Methods)

We can transform a nonlinear problem into a linear one using a kernel.

Non Linear Classification(Kernel Methods)

We can transform a nonlinear problem into a linear one using a kernel K.

Fundamental property: the linear decision surface depends on

K(xi ,xj)=(xi ) , (xj). All we need is the Gram similarity

matrix K. K defines the local metric of the embedding space.

Similarity: Data Representations

NC(O)C(=O)O

O

OH

NH2

OH

Molecular Representations

1D: SMILES strings 2D: Graph of bonds 2.5D: Surfaces 3D: Atomic coordinates 4D: Temporal evolution

15Total:

1D SMILES Kernel

CCCCCCc1ccc(cc1O)O

CCCCCc1ccc(cc1)CO

C H3

OHCH3

OH O H

Kmer CountCCCC 2CCCc 1CCc1 1Cc1c 1c1cc 11ccc 1ccc( 1cc(c 1c(cc 1(cc1 1cc1) 1c1)C 11)CO 1

Kmer CountCCCC 3CCCc 1CCc1 1Cc1c 1c1cc 11ccc 1ccc( 1cc(c 1c(cc 1(cc1 1cc1O 1c1O) 11O)O 1

Kmer Count1 Count2 Product(cc1 1 1 11)CO 0 1 01O)O 1 0 01ccc 1 1 1CCCC 3 2 6CCCc 1 1 1CCc1 1 1 1Cc1c 1 1 1c(cc 1 1 1c1)C 0 1 0c1O) 1 0 0c1cc 1 1 1cc(c 1 1 1cc1) 0 1 0cc1O 1 0 0ccc( 1 1 1

2D Molecule Graph Kernel

For chemical compounds atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }

Count labeled paths Fingerprints

(CsNsCdO)

Similarity Measures

3D Coordinate Kernel

1.4 A

2.0 A

2.8 A

3.4 A

4.2 A

Atom Distance Histogram

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5Distance (Angstroms)

Co

un

t

Distance Count0 01 52 73 34 15 0

Example of Results

Results

Results

0.6500

0.6600

0.6700

0.6800

0.6900

0.7000

0.7100

0.7200

0.7300

0.7400

0.7500

Cell Line

Pre

dic

tio

n A

ccu

racy

1D SMILES(71.7% avg, 1.17% stdev)

2D Molecule Graph(72.3% avg, 0.99% stdev)

3D Coordinates(69.8% avg, 1.27% stdev)

Example of Results

Summary

Derived a variety of kernels for small molecules State-of-the-art performance on several benchmark

datasets 2D kernels slightly better than 1D and 3D kernels Many possible extensions: 2.5D kernels, isomers, etc… Need for larger data sets and new models of

cooperation in the chemistry community Many open (ML) questions (e.g. clustering and

visualizing 107 compounds, intelligent recognition of useful molecules, information retrieval from literature, docking, prediction of reaction rates, matching table of all proteins against all known compounds, origin of life)

Chemistry version of the Turing test

ChemDB

7M compounds (3.5M unique) Commercially available PostgreSQL/Oracle Annotation (Experimental,

Computational) Searchable Web interface Similarity, in silico reactions

Acknowledgements Informatics

Liva Ralaivola J. Chen S. J. Swamidass Yimeng Dou Peter Phung Jocelyne Bruand

Funding NIH NSF IGB

Pharmacology Daniele Piomelli

Chemistry G. Weiss J. S. Nowick R. Chamberlin

New Questions

Predict drug-like molecules? toxicity? New Strategies

How can we search efficiently? Intelligently? New data structures and algorithms Optimizing old structures

How can we understand this much data? Cluster and visualize millions of data points Define commercially accessible space.

Are there other useful things we can do with this?

Discover new polymers, etc. Wonder about the origin of life. Combinatorially combine all known chemicals.

Acknowledgements

Jocelyne Bruand Peter Phung Liva Ralaivola S. Joshua Swamidass Yimeng Dou NIH/NSF/IGB

Questions

DockingD

ata b

ase

of p

o ten

tial

dru

gs

6 m

illi

on s

mal

l mol

e cul

e s

…

Query:Binding Site of Protein

Scoring Function

& Efficient Minimizer

Some Targets

P53 (Luecke) ACCD5 (Tsai) IMPDH, PPAR, etc.

(Luecke) HIV Integrase

(Robinson)

Drug Rescue of P53 Mutants

Docking → ChemDB

~6 million commercially available compounds

Searchable, annotated, downloadable.

Other Databases: Cambridge Structural Database ChemBank PubChem

Chemical Toxicity Prediction

By Kernel Methods

Jonathan ChenS Joshua Swamidass

The Baldi Lab

Data Flow

Toxicity State List

Predictions

Gram MatrixID 1 2 3 4 …1 21 4 5 10 …2 4 14 5 3 …3 5 5 15 6 …4 10 3 6 23 …… … … … … …

4 Yes

O

S

P

S

O

C H3

O

C H3

NH

C H3

2 No

Cl

Cl

Cl

3 Yes

O O

1 No

NH

N

CH 3CH3

O

O

OH

ID Toxic?

Kernel

Linear Classifier

Results

0.5000

0.5500

0.6000

0.6500

0.7000

0.7500

0.8000

0.8500

0.9000

0.9500

1.0000

Cell Line

Pre

dic

tio

n A

ccu

racy

1D SMILES(71.7% avg, 1.17% stdev)

2D Molecule Graph(72.3% avg, 0.99% stdev)

3D Coordinates(69.8% avg, 1.27% stdev)

Default(54.2% avg, 3.49% stdev)

Example of Results

Kernel/Method Mutag MM FM MR FRKashima (2003) 89.1 61.0 61.0 62.8 66.7 Kashima (2003) 85.1 64.3 63.4 58.4 66.11D SMILES spec. 84.0 66.1 61.3 57.3 66.11D SMILES spec+ 85.6 66.4 63.0 57.6 67.02D Tanimoto 87.8 66.4 64.2 63.7 66.72D MinMax 86.2 64.0 64.5 64.5 66.42D Tanimoto, l = 1024, b = 1 87.2 66.1 62.4 65.7 66.92D Hybrid l = 1024, b = 1 87.2 65.2 61.9 64.2 65.82D Tanimoto, l = 512, b = 1 84.6 66.4 59.9 59.9 66.12D Hybrid l = 512, b = 1 86.7 65.2 61.0 60.7 64.72D Tanimoto, l = 1024 + MI 84.6 63.1 63.0 61.9 66.72D Hybrid l = 1024 + MI 84.6 62.8 63.7 61.9 65.52D Tanimoto, l = 512 + MI 85.6 60.1 61.0 61.3 62.42D Hybrid l = 512 + MI 86.2 63.7 62.7 62.2 64.43D Histogram 81.9 59.8 61.0 60.8 64.4




synthesis, drug design, nanotechnology) Catalog Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical

space to discover new drugs, new galaxies, etc.

Datasets

Small Molecules as Undirected Labeled Graphs of Bonds

atom/node labels: A = {C,N,O,H, … } bond/edge labels: B = {s, d, t, ar, … }




synthesis, drug design, nanotechnology) Bioinformatics analogy:

Catalog (GenBank) Search (BLAST)

Predict physical, chemical, biological properties Build filters/tools to efficiently navigate chemical

space to discover new drugs, new galaxies, etc.

exploring chemical space with computers—challenges and opportunities

Documents

chemical synthesis

data genbank

data representationsncoc

test data

linear decision surface

drug design

embedding space

nonlinear problem