contemporary qsar classifiers...
TRANSCRIPT
Contemporary QSAR Classifiers Compared
Craig BruceSchool of Chemistry
Craig Bruce HPC User Meeting17th January 2007
1
Introduction
QSARSimilar Property PrincipleSimilar structure » similar properties
QuantitativeStructure-ActivityRelationship
Craig Bruce HPC User Meeting17th January 2007
2
Methods
Support Vector Machine
Craig Bruce HPC User Meeting17th January 2007
3
Methods
Support Vector Machine Decision Tree
Craig Bruce HPC User Meeting17th January 2007
4
Methods
Support Vector Machine Decision Tree Random Forest Ensemble
Bagging Boosting
Parameter Tuning
Craig Bruce HPC User Meeting17th January 2007
5
DatasetsDataset Compound type No.
Compounds No. Descriptor s
2.5D Fragments
A C E Angiotensin converting enzyme 114 5 6 1024
AchE Acetyl-cholinesterase inhibito rs 111 6 3 774
B Z R Benzodiazepine recepto r 163 7 5 832
COX2 Cyclooxygenase-2 inhibitor s 322 7 4 660
DHFR Dihydrofolate reductase inhibitors
397 7 0 952
G P B Glycogen phosphorylase b 6 6 7 0 692
THER Therolysin inhibitors 7 6 6 4 575
T H R Thrombin inhibito rs 8 8 6 6 527
Sutherland, J. J.; O'Brien, L. A.; Weaver, D. F. J. Med. Chem. 2004, 47(22), 5541-5554.
Craig Bruce HPC User Meeting17th January 2007
6
Cross-Validation
Trained on full datasetCV to measure classifier
Dataset
Craig Bruce HPC User Meeting17th January 2007
7
Need for HPC
8 datasets2 descriptor sets7 classifiers10 repeats of CV1120 models to generate
Craig Bruce HPC User Meeting17th January 2007
8
Results - 2.5DDataset Tree Bagged
Tree
Boosted
Tree
Random
Forest
SVM Tuned
Foresta
Tuned
SVMb
A C E 86.9 86.5 86.6 85.4 90.3 89.3 89.9
AchE 70.6 71.6 72.7 72.6 72.0 79.5 74.3
B Z R 71.7 75.5 75.4 74.0 77.4 79.5 81.6
COX2 75.6 75.7 76.1 73.4 75.4 75.7 75.2
DHFR 78.8 83.2 83.4 83.1 79.6 84.9 82.2
G P B 70.6 74.5 76.2 74.1 73.9 76.7 75.3
THER 67.2 69.2 67.8 69.7 69.5 74.6 74.6
T H R 66.5 69.1 68.0 69.1 67.2 72.5 69.0
a 100 Treesb Polynomial kernel; exponent = 2; complexity constants = 0.05
Craig Bruce HPC User Meeting17th January 2007
9
Results - Fragments
a 100 Treesb RBF kernel; width = 0.1; complexity constants = 1
Dataset Tree Bagged
Tree
Boosted
Tree
Random
Forest
SVM Tuned
Foresta
Tuned
SVMb
A C E 80.4 82.0 81.0 80.5 78.9 80.0 82.2
AchE 64.1 68.0 68.8 70.5 69.4 70.5 77.1
B Z R 74.0 75.0 69.8 67.3 74.0 68.7 75.8
COX2 71.1 71.5 71.0 68.1 72.6 68.7 71.1
DHFR 84.4 85.4 83.1 84.9 83.5 85.5 86.5
G P B 73.8 75.6 76.2 74.5 77.4 75.2 76.7
THER 72.2 75.8 75.5 75.4 75.3 76.7 73.4
T H R 71.5 69.2 68.8 66.7 71.1 68.4 69.8
Craig Bruce HPC User Meeting17th January 2007
10
Statistics
Paired t-testMultiple Comparison Tests
Nonparametric Friedman test (corrected Iman & Davenport) Post-hoc Nemenyi test
Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets.J. Mach. Learn. Res. 2006, 7, 1-30
Craig Bruce HPC User Meeting17th January 2007
11
Statistical Results
10 vs 100 trees in random forest tuning in 2.5D
Across classifiers statistical difference detected Tuned SVM & RF better than decision tree Other differences not significant
Craig Bruce HPC User Meeting17th January 2007
12
Problems
Datasets are large 2GB RAM quickly used (unfairly) Although larger amounts of RAM can be
supported it is very expensive
Problem for larger datasets and runningensemble classifiers
Craig Bruce HPC User Meeting17th January 2007
13
HPC solutions
Split task over many nodesParallelRandom ForestBagging
Craig Bruce HPC User Meeting17th January 2007
14
Tree computation
FinalClassification
Craig Bruce HPC User Meeting17th January 2007
15
Tree computation
FinalClassification
Craig Bruce HPC User Meeting17th January 2007
16
Interpretation
QSAR need good accuracy and Interpretability
SVM transform the dataDecision trees produce instant
classification rules
Craig Bruce HPC User Meeting17th January 2007
17
Trees
Craig Bruce HPC User Meeting17th January 2007
18
Conclusions
SVM excellent classifier Ensemble of trees very competitive Universal parameters for random forest; SVM
more dataset specific Trees have interpretability advantage Future work
Extraction of information from ensemblesBruce, C. L.; Melville, J. L.; Pickett, S. D.; Hirst, J. D.
Contemporary QSAR Classifiers Compared.J. Chem. Inf. Mod. 47, 219–227 (2007).
Craig Bruce HPC User Meeting17th January 2007
19
Acknowledgements
Jonathan HirstJames Melville
Stephen PickettChris LuscombeGavin Harper
Craig Bruce HPC User Meeting17th January 2007
20
Any Questions?