amit satsangi [email protected]

23
Faculty of Computer Science CMPUT 605 December 06, 2007 February 04, 2008 © 2006 Novel Approaches for Small Bio- molecule Classification and Structural Similarity Search Karakoc E, Cherkasov A., and Sahinalp S.C. Amit Satsangi [email protected]

Upload: morse

Post on 08-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Amit Satsangi [email protected]. Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc E, Cherkasov A., and Sahinalp S.C. Background and Focus. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Amit Satsangi amit@cs.ualberta

Faculty of Computer Science

CMPUT 605 December 06, 2007February 04,

2008© 2006

Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search

Karakoc E, Cherkasov A., and Sahinalp S.C.

Amit [email protected]

Page 2: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Background and Focus

Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin)

Structural similarity Similar biological and/or physico-chemical properties (Maggiora et al.)

Classification of probe compound (unknown bioactivity)

Similarity search amongst compounds with known bioactivity

Page 3: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Background and Focus

Determining similarity distance measures (SDM)

Using SDM for classification of compounds—k-NN

classification

Efficient data structures for fast similarity search—

DMVP trees (an improvement over SCVP trees used

previously)

Page 4: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Outline

Similarity measures

Classification techniques

k-NN classifier

DMVP tree

Results, Observations and Conclusion

Page 5: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Similarity between Molecules

Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines)

Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools)

*Quantitative Structure-Activity Relationship

Page 6: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Similarity Measures

Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y:

X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases)

Range of Tanimoto coefficient: [0, 1]

Page 7: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Similarity measures

Tanimoto Dist. Measure: DT(X,Y) = 1 –T(X,Y)

Minkowski distance (LP):

Real valued data possible

Page 8: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Classification Techniques

Multiple Linear Regression (MLR)

Linear Discriminant Analysis (LDA)

Artifical Neural Networks (ANN)

Support Vector Machines (SVM)

k-nearest Neighbor (k-NN) classification not used

previously.

Page 9: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Distance-based Classification

Compounds—s & r

S & R respective descriptor arrays

If D(S,R) is small then bioactivity levels of s & r are

similar

Notion of distance classification of new compounds

Distance measure == metric (conditions) e.g. Hamming

Distance, Tanimoto distance etc.

Page 10: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

k-nn Classification

Given Bioactivity

To Find Distance measure that separates active

and inactive compounds for the training set N-

dimensional plane

Problem Easy

Page 11: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

k-nn Classification

Given Bioactivity

To Find Distance measure that separates active and inactive compounds for the training set N-dimensional plane

Problem NP-hard

Solution Use Genetic Algorithms, heuristic linear search to find the plane

Page 12: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

QSAR approach

• Uses a linear combination of descriptors

• Assigns a weight to each dimension

, W [0,1]

• Weighted Minkowski distance of order 1

• Only binary classification considered (A/I)

• Methods are general

Page 13: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Parameter Optimization

Page 14: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

k-NN Classifier

Set of data elements: {X1, … Xn}

Query element: Y

Range query Find Xi such that D(Y,Xi) < R1 (user

defined)

k-nn query Find k items such that their distance

to Y is as small as possible

Page 15: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Data structures: VP-Trees

Vantage Point (VP) tree

Choose an arbitrary data point (called Vantage

Point)

Binary tree—recursively partitions the dataset into

two equal sized subsets

Zero in on the nearest neighbor

Page 16: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Efficient data structures: SCVP Trees

Space Covering Vantage Point tree

Multiple vantage points chosen at each level

No more a binary tree—multiple branches at each

internal node

Multiple inner partitions—hope is that each data

point lies in atleast one inner partition

Page 17: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

DMVP Tree

Memory requirements of SCVP tree can be large—redundancy

of data elements

Deterministic selection of Vantage points

VP minimization—NP-Hard

Minimization == Weighted set cover problem

Use of greedy Algorithm: O(log l); l<n

Approximates the min number of VP’s

Page 18: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Experiments

Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug-like(1202)

62 dimensional descriptor array (30 QSAR & 32 physico-chemical properties)

k=1 i.e. one NN

Comparison with LDA, MLR, ANN

70% data used for training

wL1 distance is calculated in all cases

Page 19: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Experimental Results

Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR

ANN beats k-NN on almost all counts

Pruning—more than 80% in each kind of bioactivity (over brute-force search)

Key point – k-NN classifier is faster

More than 100 times faster than ANN

Page 20: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Experimental Results

Can calculate the level of bioactivity instead of a

YES/NO

The value of the weights provides insights into the

importance of descriptors for each bioactivity

Page 21: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Observations & Conclusion

Bacterial metabolites & antimicrobial drugs overlap (confirmation)

Human metabolites display distinctive properties

QSAR models for drugs + human metabolites dominated by few descriptors

These descriptors favored by drug developers and natural evolution

Page 22: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Observations & Conclusion

Classification results from k-NN can help rationalize the

design and discovery of drugs

DMVP tree improves the space utilization of the program

Provides a means for fast similarity search

Data structure can be applied to any metric distance like wLp

and Tanimoto distance

Page 23: Amit Satsangi amit@cs.ualberta

© 2006

Department of Computing Science

CMPUT 605

Thank You For Your Attention!