navigating molecular haystacks: tools & applications

76
Outlier Detection Using R-NN Curves Searching for HIV Integrase Inhibitors Summary Navigating Molecular Haystacks: Tools & Applications Finding Interesting Molecules & Doing it Fast Rajarshi Guha Department of Chemistry Pennsylvania State University 20 th April, 2006

Upload: rguha

Post on 10-May-2015

807 views

Category:

Education


3 download

TRANSCRIPT

Page 1: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Navigating Molecular Haystacks:Tools & Applications

Finding Interesting Molecules & Doing it Fast

Rajarshi Guha

Department of ChemistryPennsylvania State University

20th April, 2006

Page 2: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Past Work

QSAR Applications

Artemisinin analogs

PDGFR inhibitors

Bleaching agents

Linear & non-linear methods

QSAR Methods

Representative QSAR sets

Model interpretation

Model applicability

Page 3: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Past Work

Cheminformatics

Automated QSAR pipeline

Contributions to the CDK

Cheminformatics webservices

R packages and snippets

Page 4: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Past Work

Chemical Data Mining

Approximate k-NN

Local regression

Outlier detection

VS protocol

Page 5: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Past Work

Chemical Data Mining

Approximate k-NN

Local regression

Outlier detection

VS protocol

Page 6: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Outline

1 Outlier Detection Using R-NN CurvesMethods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

2 Searching for HIV Integrase InhibitorsPrevious Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

3 Summary

Page 7: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Outline

1 Outlier Detection Using R-NN CurvesMethods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

2 Searching for HIV Integrase InhibitorsPrevious Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

3 Summary

Page 8: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Diversity Analysis

Why is it Important?

Compound acquisition

Lead hopping

Knowledge of the distribution of compounds in a descriptorspace may improve predictive models

Page 9: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Approaches to Diversity Analysis

Cell based

Divide space into bins

Compounds are mapped to bins

Disadvantages

Not useful for high dimensionaldata

Choosing the bin size can be tricky

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45

Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12

Page 10: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Approaches to Diversity Analysis

Distance based

Considers distance betweencompounds in a space

Generally requires pairwise distancecalculation

Can be sped up by kD trees, MVPtrees etc.

●●

Agrafiotis, D. K.; Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58

Page 11: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Generating an R-NN Curve

Observations

Consider a query point with a hypersphere, of radius R,centered on it

For small R, the hypersphere will contain very few or noneighbors

For larger R, the number of neighbors will increase

When R ≥ Dmax , the neighbor set is the whole dataset

The question is . . .

Does the variation of nearest neighbor count with radius allow usto characterize the location of a query point in a dataset?

Page 12: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Generating an R-NN Curve

Observations

Consider a query point with a hypersphere, of radius R,centered on it

For small R, the hypersphere will contain very few or noneighbors

For larger R, the number of neighbors will increase

When R ≥ Dmax , the neighbor set is the whole dataset

The question is . . .

Does the variation of nearest neighbor count with radius allow usto characterize the location of a query point in a dataset?

Page 13: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Generating an R-NN Curve

Algorithm

Dmax ← max pairwise distancefor molecule in dataset do

R ← 0.01× Dmax

while R ≤ Dmax doFind NN’s within radius RIncrement R

end whileend forPlot NN count vs. R

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

5%

10%

30%

50%

Guha, R.; Dutta, D.; Jurs, P.C.; Chen, T.; J. Chem. Inf. Model., submitted

Page 14: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Generating an R-NN Curve

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

0 20 40 60 80 100

050

100

150

200

250

R

Num

ber

of N

eigh

bors

Sparse

●●●●●

●●●

●●

●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 1000

5010

015

020

025

0R

Num

ber

of N

eigh

bors

Dense

Page 15: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Characterizing an R-NN Curve

Converting the Plot to Numbers

Since R-NN curves are sigmoidal, fit them to the logisticequation

NN = a · 1 + m e−R/τ

1 + n e−R/τ

m, n should characterize the curve

Problems

Two parametersNon-linear fitting is dependent on the starting pointFor some starting points, the fit does not converge andrequires repetition

Page 16: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Characterizing an R-NN Curve

Converting the Plot to a Number

Determine the value of R where thelower tail transitions to the linearportion of the curve

Solution

Determine the slope at variouspoints on the curve

Find R for the first occurence ofthe maximal slope (Rmax(S))

Can be achieved using a finitedifference approach

S’’

S = 0

S = 0

S’

Page 17: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Characterizing an R-NN Curve

Converting the Plot to a Number

Determine the value of R where thelower tail transitions to the linearportion of the curve

Solution

Determine the slope at variouspoints on the curve

Find R for the first occurence ofthe maximal slope (Rmax(S))

Can be achieved using a finitedifference approach

S’’

S = 0

S = 0

S’

Page 18: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Characterizing Multiple R-NN Curves

Problem

Visual inspection of curves is usefulfor a few molecules

For larger datasets we need tosummarize R-NN curves

Solution

Plot Rmax(S) values for eachmolecule in the dataset

Points at the top of the plot arelocated in the sparsest regions

Points at the bottom are located inthe densest regions

0 50 100 150 200 250

020

4060

80

Serial Number

R max

S

Page 19: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

How Can We Use It For Large Datasets?

Breaking the O(n2) barrier

Traditional NN detection has a time complexity of O(n2)

Modern NN algorithms such as kD-trees

have lower time complexityrestricted to the exact NN problem

Solution is to use approximate NN algorithms such as LocalitySensitive Hashing (LSH)

Bentley, J.; Commun. ACM 1980, 23, 214–229

Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004

Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333

Page 20: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

How Can We Use It For Large Datasets?

Why LSH?

Theoretically sublinear

Shown to be 3 orders ofmagnitude faster thantraditional kNN

Very accurate (> 94%)0.02 0.03 0.04 0.05 0.06 0.07 0.08

−3.5

−3.0

−2.5

−2.0

−1.5

−1.0

Radiuslo

g [M

ean

Que

ry T

ime

(sec

)]

LSH (142 descriptors)LSH (20 descriptors)kNN (142 descriptors)kNN (20 descriptors)

Comparison of NN detection speed

on a 42,000 compound dataset

using a 200 point query set

Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333

Page 21: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Alternatives?

Why not use PCA?

R-NN curves are fundamentally aform of dimension reduction

Principal Components Analysis isalso a form of dimension reduction

Disadvantages

Eigendecomposition via SVD isO(n3)

Difficult to visualize more than 2 or3 PC’s at the same time

0 5 10

−6−4

−20

24

Principal Component 1

Prin

cipal

Com

pone

nt 2

Page 22: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Datasets

Boiling point dataset

277 molecules

Average: MW = 115, Stanimoto = 0.20

Calculated 214 descriptors, reduced to 64

Kazius-AMES dataset

4337 molecules

Average MW = 240, Stanimoto = 0.21

Known to have a number of significant outliers

Calculated 142 MOE descriptors, reduced to 45

Goll, E.; Jurs, P.; J. Chem. Inf. Comput. Sci. 1999, 39, 974–983

Kazius, J.; McGuire, R.; Bursi, R.; J. Med. Chem. 2005, 48, 312–320

Page 23: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Choosing a Descriptor Space

Boiling point dataset

We had previously generated linear regression models using aGA to search for subsets

Best model had 4 descriptors

Kazius-AMES dataset

No models were generated for this dataset

Considered random descriptor subsets

Used a 5-descriptor subset to represent the results

The R-NN curve approach focuses on the distribution of moleculesin a given descriptor space. Hence a good or random descriptorsubset should be able to highlight the utility of the method

Page 24: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Choosing a Descriptor Space

Boiling point dataset

We had previously generated linear regression models using aGA to search for subsets

Best model had 4 descriptors

Kazius-AMES dataset

No models were generated for this dataset

Considered random descriptor subsets

Used a 5-descriptor subset to represent the results

The R-NN curve approach focuses on the distribution of moleculesin a given descriptor space. Hence a good or random descriptorsubset should be able to highlight the utility of the method

Page 25: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

R-NN Curves for the BP Dataset

●●●●●●●●

●●●●

●●

●●●

●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

6

R

Num

ber

of N

eigh

bors

●●●●●●●●

●●●●●

●●

●●

●●●●

●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

18

R

Num

ber

of N

eigh

bors

●●●●●●

●●●●●

●●●●

●●

●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

21

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●

●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

22

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●

●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

24

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●

●●●●●●●●●●

●●

●●●

●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

37

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●

●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

42

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●

●●●

●●

●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

85

R

Num

ber

of N

eigh

bors

●●●●

●●●

●●●●●●

●●

●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

90

R

Num

ber

of N

eigh

bors

●●●●●●●●●

●●●●●

●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

111

R

Num

ber

of N

eigh

bors

●●●●

●●●

●●●●●●

●●

●●

●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

141

R

Num

ber

of N

eigh

bors

●●●●

●●●●●●

●●●

●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

148

R

Num

ber

of N

eigh

bors

●●●●

●●●●●●

●●●

●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

149

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

151

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

165

R

Num

ber

of N

eigh

bors

●●●

●●

●●●

●●

●●●●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

190

R

Num

ber

of N

eigh

bors

●●●●●●●●●

●●●●●

●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

194

R

Num

ber

of N

eigh

bors

●●●●●●

●●●●●

●●●●

●●

●●

●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

198

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●

●●

●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

200

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●

●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

201

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●●●

226

R

Num

ber

of N

eigh

bors

●●●

●●●●

●●

●●

●●●

●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

251

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

259

R

Num

ber

of N

eigh

bors

●●●●●●●

●●●●●●

●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

268

R

Num

ber

of N

eigh

bors

●●●●●●●

●●●●●●

●●●

●●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

275

R

Num

ber

of N

eigh

bors

Page 26: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Outliers for the Kazius-AMES Dataset

Defining outliers

R = 0.5× Dmax

NN count ≤ 10% of the dataset

Outliers detected

Molecule 3904 had 2 neighbors

Molecule 1861 had 41 neighbors

R = 0.4× Dmax → 6 outliers

Clearly, this process is subjective

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

0 20 40 60 80 100

010

0020

0030

0040

00

R

Num

ber

of N

eigh

bors

3904

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 100

010

0020

0030

0040

00

RN

umbe

r of

Nei

ghbo

rs

1861

Page 27: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Outlier Structures

NH

HN OO

NH

NH

O

HO

O

HN

NH

OHO

H2N

N•

O H2N

O

HN

S

HO

S

O

HN

HN

O

HN

O

HN

O

NH

O

HN

OOH

HN

O NH

O

NH

O

OH

HN

O

HO

NH

O

HN

O

NH

O

NH

NH

O

HN

O

NH

H2N

O

O

NH2

•N

HN

HN

NH

O

HN

O

NH

S

H2N

N•

O

NH2

•N

NH2

O

NHO

HO

NH

OH

O

O

NH

H2N

•N

ONH

N

NHO

O

NH

O

O

N

HO

HO

O

HO

OH

N

NH2

OH

HN

NHO

O

HO

OH

NH2

O

H2N

O

H2N

HNO

OH

H2NO

N

S

N

S HN

O

N

N

N

O

O

O

O

3904 1861

Page 28: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Summary of the Kazius-AMES Dataset

3904 & 1861 are immediatelyidentifiable as outliers

Excluding these two, there areno significant outliers

The dataset appears to berelatively dense

One plot summarizes the locationcharacteristics of all the molecules inthe dataset

0 1000 2000 3000 400020

4060

80Serial Number

R max

S

1861

3904

Page 29: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

R-NN Curves & Clustering

−5 0 5 10 15 20 25

05

1015

84

88

137

251

0 20 40 60 80 100

050

100

150

200

250

300

251

0 20 40 60 80 100

050

100

150

200

250

300

84

0 20 40 60 80 100

050

100

150

200

250

300

88

0 20 40 60 80 100

050

100

150

200

250

300

137

Page 30: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

R-NN Curves & Clustering

0 20 40 60 80 100

050

100

150

200

250

300

251

0 20 40 60 80 100

050

100

150

200

250

300

84

0 20 40 60 80 100

050

100

150

200

250

300

88

0 20 40 60 80 100

050

100

150

200

250

300

137

0 20 40 60 80 100

02

46

810

1214

251

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

20

84

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

88

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

137

Slo

pe o

f the

RN

N C

urve

Smoothed with Friedmans Super Smoother

Friedman, J.H.; Laboratory for Computational Statistics, Stanford University, Technical Report 5, 1984

Page 31: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Conclusions

Future Work

Allow comparison over multiple datasets

Automated detection of clustering

Conclusions

Applicable to arbitrary dimensions

Can be easily extended to large datasets

Summarizing a dataset does not require any user definedparameters

Simple and intuitive approach to visualizing distribution ofmolecules in a descriptor space

Page 32: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Methods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

Conclusions

Future Work

Allow comparison over multiple datasets

Automated detection of clustering

Conclusions

Applicable to arbitrary dimensions

Can be easily extended to large datasets

Summarizing a dataset does not require any user definedparameters

Simple and intuitive approach to visualizing distribution ofmolecules in a descriptor space

Page 33: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Outline

1 Outlier Detection Using R-NN CurvesMethods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

2 Searching for HIV Integrase InhibitorsPrevious Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

3 Summary

Page 34: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Background

Most drugs for HIV treatmenttarget the protease or reversetranscriptase enzymes

Integrase is essential for thereplication of HIV

There are no FDA approved drugsthat target HIV integrase

1BIS

Page 35: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Previous Work

Previous approaches

Pharmacophore based models

Docking

Disadvantages

Requires a reliable receptor structure

Computationally intensive

Does not necessarily lead to a diverse hit list

Nair, V; Chi, G.; Neamati, N.; J. Med. Chem., 2006, 49, 445–447

Dayam, R; Sanchez, T.; Neamati, N.; J. Med. Chem., 2005, 48, 8009–8015

Deng, J. et al.; J. Med. Chem., 2005, 48, 1496–1505

Page 36: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Goals

High speed

Be able to process large libraries rapidly

Avoid docking till required

Try and use connectivity information only

Reliability

Use a consensus approach for predictions

Use molecular similarity to further prioritize predictions

Novelty

Try to obtain hits suitable for lead hopping

Try to obtain a diverse set of hits

Page 37: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Goals

High speed

Be able to process large libraries rapidly

Avoid docking till required

Try and use connectivity information only

Reliability

Use a consensus approach for predictions

Use molecular similarity to further prioritize predictions

Novelty

Try to obtain hits suitable for lead hopping

Try to obtain a diverse set of hits

Page 38: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Goals

High speed

Be able to process large libraries rapidly

Avoid docking till required

Try and use connectivity information only

Reliability

Use a consensus approach for predictions

Use molecular similarity to further prioritize predictions

Novelty

Try to obtain hits suitable for lead hopping

Try to obtain a diverse set of hits

Page 39: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Datasets & Tools

Training data

Curated dataset

529 inactives

395 actives

Binary valued activity

Vendor database

50,000 compounds

2D structures

Software

MOE for descriptors

R for the VS protocol

MASSrandomForestfingerprint

Page 40: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Overview of the Screening Protocol

Descriptor Calculationand Reduction

RF Model LDA Model

VendorDB

Keep PredictedActives with

Majority Vote > M

Keep PredictedActives withPosterior > P

Similarity filterSimilarity filter

Consensus Hit List

Hit List

Training Dataset

Hit List

Page 41: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Descriptor Calculations

Restricted ourselves to topological& 2.5D descriptors

147 descriptors were evaluated

Correlated and low variancedescriptors were removed resultingin a reduced pool of 45 descriptors

Page 42: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Predictive Models - LDA

Why?

Simple

May be sufficient

How?

Use a GA to search for good descriptor subsets

Used a 6-descriptor model

Accuracy?

Percent Correct

On whole dataset 72On TSET/PSET 72 / 71With leave-10%-out 71

Page 43: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Predictive Models - Random Forest

Why?

No feature selection required

Does not overfit

Useful for non-linearities in the dataset

How?

Number of trees = 500

Number of features sampled = 6

Accuracy

Percent Correct

On whole dataset 72On TSET/PSET 75 / 70

Page 44: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Consensus Predictions

Goal

Try and be sure that a predicted active isactive

Consider compounds predicted active byboth models

LDA: consider compounds withposterior > PRF: consider compounds withmajority vote > M

Apply similarity filter

We now have 2 hit lists

A final hit list is obtained from theintersection of the individual hitlists

Vendor Database

LDA RF

Posterior & MajorityConstrains

Similarity Filter

Predicted Active

Page 45: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Consensus Predictions

Goal

Try and be sure that a predicted active isactive

Consider compounds predicted active byboth models

LDA: consider compounds withposterior > PRF: consider compounds withmajority vote > M

Apply similarity filter

We now have 2 hit lists

A final hit list is obtained from theintersection of the individual hitlists

Vendor Database

LDA RF

Posterior & MajorityConstrains

Similarity Filter

Predicted Active

Page 46: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Similarity Filter

Goal

Predict actives from vendor database that are more similar toactives than inactives

Evaluate 166 bit MACCS fingerprints

Evaluate average Tanimoto similarity between

Vendor compound and TSET actives (S1)Vendor compound and TSET inactives (S2)

Select compounds where

S1 − S2 ≥ ε

Page 47: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Similarity Filter

Goal

Predict actives from vendor database that are more similar toactives than inactives

Evaluate 166 bit MACCS fingerprints

Evaluate average Tanimoto similarity between

Vendor compound and TSET actives (S1)Vendor compound and TSET inactives (S2)

Select compounds where

S1 − S2 ≥ ε

Page 48: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Suggesting Good Leads

Can we prioritize further?

Look for TSET actives that areoutliers

These compounds may be goodstarting points for furthermodifications

Evaluate similarity of hit listcompounds to these isolated TSETcompounds

Hit list compounds most similar tothe most isolated TSET active maybe good leads

0 200 400 600 800

2040

6080

Compound Serial No.

R max

S

The points marked red are

the most outlying

compounds in the TSET and

are also active

Saeh, J.C.; Lyne, P.D.; Takasaki, B.K.; Cosgrove, D.A.; J. Chem. Inf. Comput. Sci., 2005, 45, 1122-1133

Guha, R.; Dutta, D.;Jurs, P.C.; Chen, T; J. Chem. Inf. Model., submitted

Page 49: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Results

Summary of the hits

Parameter settings

P,M > 0.7ε ≥ 0.01

Of the 34 hits, 7 had asimilarity > 0.5 with themost outlying TSET active

We obtained 66 more hitsfrom an ensemble of LDAmodels

LDA RF

313 hits 57 hits

34 hits

Average similarity = 0.64

Not significantly diverse

None were in common withthose predicted by thepharmacophore models

Page 50: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

The Search for β-diketones

A recently published inhibitor exhibited a β-diketone moiety

None of our hits had this moiety

R [C,N] R

O O

Searched the vendor DB for structures with this feature andpredicted them

LDA model → 244 actives (posterior > 0.8)RF model → 4 actives (score > 0.85)The intersection set contained 4 compounds

We missed these hits due to our similarity constraints

Nair, V; Chi, G.; Neamati, N.; J. Med. Chem., 2006, 49, 445–447

Page 51: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

The Search for β-diketones

A recently published inhibitor exhibited a β-diketone moiety

None of our hits had this moiety

R [C,N] R

O O

Searched the vendor DB for structures with this feature andpredicted them

LDA model → 244 actives (posterior > 0.8)RF model → 4 actives (score > 0.85)The intersection set contained 4 compounds

We missed these hits due to our similarity constraints

Nair, V; Chi, G.; Neamati, N.; J. Med. Chem., 2006, 49, 445–447

Page 52: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Similarity to the Published Compound

N

N

OH

O

OO

O

OH

Published

NHN

O

O

O

O

HO

F

O

PredictedNair, V; Chi, G.; Neamati, N.; J. Med. Chem., 2006, 49, 445–447

Page 53: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Future Work

Reliability

Consider similarity to known actives in terms ofpharmacophores

Dock our best hits (may not be conclusive)

Improve performance with local methods

Diversity

Investigate distribution of vendor compounds in the descriptorspace

Cluster the vendor DB and predict representative members ofthe clusters

Assay!

Page 54: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Future Work

Reliability

Consider similarity to known actives in terms ofpharmacophores

Dock our best hits (may not be conclusive)

Improve performance with local methods

Diversity

Investigate distribution of vendor compounds in the descriptorspace

Cluster the vendor DB and predict representative members ofthe clusters

Assay!

Page 55: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Previous Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

Future Work

Reliability

Consider similarity to known actives in terms ofpharmacophores

Dock our best hits (may not be conclusive)

Improve performance with local methods

Diversity

Investigate distribution of vendor compounds in the descriptorspace

Cluster the vendor DB and predict representative members ofthe clusters

Assay!

Page 56: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Outline

1 Outlier Detection Using R-NN CurvesMethods for Diversity AnalysisGenerating & Summarizing R-NN CurvesUsing R-NN Curves

2 Searching for HIV Integrase InhibitorsPrevious Work on HIV IntegraseA Tiered Virtual Screening ProtocolWhat Does the Pipeline Give Us?

3 Summary

Page 57: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Summary

The outlier detection scheme provides a way to analyze spatialdistributions of molecules in arbitrary descriptor spaces

The method can be extended to large datasets usingapproximate NN algorithms

The technique results in a single intuitive plot that can beused to easily identify outliers

A consensus approach to predictive models allows us toincrease confidence in activity predictions

Prioritization based on similarity is intuitive, but may not workwell with homogenous datasets

The protocol appears to have identified possible leads whichawait confirmation

Page 58: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Summary

The outlier detection scheme provides a way to analyze spatialdistributions of molecules in arbitrary descriptor spaces

The method can be extended to large datasets usingapproximate NN algorithms

The technique results in a single intuitive plot that can beused to easily identify outliers

A consensus approach to predictive models allows us toincrease confidence in activity predictions

Prioritization based on similarity is intuitive, but may not workwell with homogenous datasets

The protocol appears to have identified possible leads whichawait confirmation

Page 59: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Acknowledgements

Prof. P. C. Jurs

Dr. D. Dutta

Prof. N. Neamati

Dr. R. Dayam

Page 60: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Page 61: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Why are R-NN Curves Sigmoidal?

Let the neighbor density be ρ0

At some critical radius, r0, the neighbor density changes toρ1, ρ1 >> ρ0

After a certain radius, R, the number of NN’s becomesconstant, N

For a radius r < r0, consider a small change dr

The number of NN’s on the strip is 2πρ0dr which onintegration gives 2πρ0r

20

Page 62: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Why are R-NN Curves Sigmoidal?

Let the neighbor density be ρ0

At some critical radius, r0, the neighbor density changes toρ1, ρ1 >> ρ0

After a certain radius, R, the number of NN’s becomesconstant, N

For a radius r < r0, consider a small change dr

The number of NN’s on the strip is 2πρ0dr which onintegration gives 2πρ0r

20

Page 63: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Why are R-NN Curves Sigmoidal?

For a radius r0 < r < R, the number of NN’s is given by

2πρ0r20 +

∫ r

r0

2πrρ1dr

The number of nearest neighbors as a function of r is

NN(r) =

0 if r ≤ 02πρ0r

2 if 0 < r ≤ r02πρ0r

20 + 2πρ1(r

2 − r20 ) if r0 < r < R

N if r ≥ R

From Zhang et al., the above function is an approximation ofthe logistic function

Zhang, M.; Vassiliadis, S.; Delgado-Frias, J.; IEEE Trans. Comput., 1996, 45, 1045–1049

Page 64: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Why are R-NN Curves Sigmoidal?

For a radius r0 < r < R, the number of NN’s is given by

2πρ0r20 +

∫ r

r0

2πrρ1dr

The number of nearest neighbors as a function of r is

NN(r) =

0 if r ≤ 02πρ0r

2 if 0 < r ≤ r02πρ0r

20 + 2πρ1(r

2 − r20 ) if r0 < r < R

N if r ≥ R

From Zhang et al., the above function is an approximation ofthe logistic function

Zhang, M.; Vassiliadis, S.; Delgado-Frias, J.; IEEE Trans. Comput., 1996, 45, 1045–1049

Page 65: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Why are R-NN Curves Sigmoidal?

For a radius r0 < r < R, the number of NN’s is given by

2πρ0r20 +

∫ r

r0

2πrρ1dr

The number of nearest neighbors as a function of r is

NN(r) =

0 if r ≤ 02πρ0r

2 if 0 < r ≤ r02πρ0r

20 + 2πρ1(r

2 − r20 ) if r0 < r < R

N if r ≥ R

From Zhang et al., the above function is an approximation ofthe logistic function

Zhang, M.; Vassiliadis, S.; Delgado-Frias, J.; IEEE Trans. Comput., 1996, 45, 1045–1049

Page 66: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

R-NN Curves & Clustering

How Do We Characterize Curves?

7 RMSE

7 Generating a mean curve

3 Examine slopes

3 Piecewise linear approximation

3 Compare distance matrices

3 Shape matching

Hausdorff distanceFrechet distance

Page 67: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

R-NN Curves & Clustering

0 20 40 60 80 100

050

100

150

200

250

300

251

0 20 40 60 80 100

050

100

150

200

250

300

84

0 20 40 60 80 100

050

100

150

200

250

300

88

0 20 40 60 80 100

050

100

150

200

250

300

137

0 20 40 60 80 100

02

46

810

1214

251

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

20

84

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

88

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

137

Slo

pe o

f the

RN

N C

urve

Smoothed with Friedmans Super Smoother

Friedman, J.H.; Laboratory for Computational Statistics, Stanford University, Technical Report 5, 1984

Page 68: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Genetic Algorithms

Features

Stochastic optimization procedure

Based on evolutionary principles

Applicable to large search spaces

Not always guaranteed to find theoptimal solution

Fitness is evaluated by theobjective function

Page 69: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Genetic Algorithms

Initialization

Initially random chromosomes

Chromosome length is the size of thedescriptor subset

Genes are the descriptors

Larger pool sizes allow for a wider search

1 12 14 34 35

8 12 23 29 34

1 27 33 42 43

5 22 28 35 41

12 17 21 32 48

Page 70: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Genetic Algorithms

Mutation

Mutation frequency is relatively low

Mutations randomly change a singledescriptor

Helps to get out of local minima

1 27 33

1 27 33

43

43

2

15

Page 71: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Genetic Algorithms

Crossover

Crossover results in swapping of genes

Crossover between fit individuals shouldlead to children with good aspects of bothparents

In single point crossover

Choose crosspointSwap corresponding sectionsResults in two new children

42 43

32

1 27 33

48

42 431 27 33

12 17 21

32 48

12 17 21

Page 72: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Random Forest

Features

Built in estimation of prediction accuracy

Measure of descriptor importance

Measure of similarity

Random Forest vs. Decision Tree

RF is faster because it searches fewer descriptors at each split

A decision tree must be pruned via cross validation. RF treesare grown to full depth

RF is much more accurate than a decision tree

Page 73: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Random Forest

TrainingData

N moleculebootstrap sample

Grow tree tomaximum size

Store tree

Is the Forest Big Enough?

YES

NO

STOP

Page 74: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

OOB Samples

What is OOB?

Due to bootstrap sampling, each tree uses ≈ 2/3 of the TSET

The remaining 1/3 is the Out Of Bag sample

The OOB sample can be used to estimate performance duringtraining or measure descriptor importance

Descriptor Importance

Decision trees create explicit relationships between descriptorsand predictions

These relationships are hidden in a RF model

A randomization procedure on OOB samples can be used torank descriptors in importance

Page 75: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Model Complexity

Very simple models or very complex models may performpoorly

The optimal performance and complexity are obtained from atrade off between bias and variance

Mean Squared Error = Var(y) + Bias2(y), where

Var ∝ 1

complexityand Bias ∝ complexity

Page 76: Navigating Molecular Haystacks: Tools & Applications

Outlier Detection Using R-NN CurvesSearching for HIV Integrase Inhibitors

Summary

Model Complexity

http://www.ece.wisc.edu/∼nowak/ece901/lecture3.pdf