atomnet: a deep convolu0onal neural network for bioac0vity...
TRANSCRIPT
AtomNet: A Deep Convolu0onal
Neural Network for Bioac0vity
Predic0on in Structure-based
Drug Discovery
Izhar Wallach, Michael Dzamba, Abraham Heifets
NithinKannanCS371Presenta2on
February1,2018
Current Approaches to Drug Discovery
• StructurebasedalgorithmslookatthestructureoftargetproteinstodesignligandsthatbindØ Structurebasedalgorithmshavetoomanyfalseposi2veexamplesØ Theseareexpensivetoexperimentallyverify
• MLalgorithmsaIempttolearnmo2fsandstructuralsimilari2esbetweenligandstopredictligandreceptorinterac2onsØ TrainingsetnotsufficientforaccurateligandbasedMLalgorithms
AtomNet
• Convolu2onalNeuralNetworksrecognizethelocalstructuresandpaIernsthatmakesuccessfulligands• Theneuralnetworkisabletopickuponaspectssuchashydrogenbondingandaroma2city
Leeetal.2009Avisualiza2onofincreasinglycomplexlayersofaCNN Wallachetal.2015Sulfonyldetec2on
AtomNet Architecture
• TheinputtoAtomNetisa3DinputconvolvedoverastackofhundredsoffiltersØ Theinputcanbethoughtofasanimageofthe3Dstructurewitheachgridcellassociatedwithastructurevector
• Theoutputisaprobabilitydescribingwhethertheligandwillbindstronglytothetargetprotein
• Themodelhasfourconvolu2onallayersandtwofullyconnectedhiddenlayerswith1024neuronseachØ 128*5↑3 ,256* 3↑3 , 256* 3↑3 ,256* 3↑3 (numberoffilters,filterdimension)
The Directory of Useful Decoys Enhanced
(DUDE) Dataset
• Thedatasetcontainsadiversesetofac2vemolecules(ligands)forcertaintargetsetsofproteinsØ Foreveryac2vemolecule,thereisasetofpropertymatcheddecoys,thatareinac2ve
• Similarac2vemoleculesareremovedbyclusteringusingscaffoldsimilarityinordertoreduceanaloguebiaswhentraining/tes2ngAtomNetØ ScaffoldSimilarity:structuralsimilarity(numberofringandchainatoms)
• Usedfortrainingandtes2ngofAtomNet
The ChEMBL-20 PMD Dataset
• Thedatasetisasetofac2veligandscompiledbytheEuropeanMolecularBiologyLaboratory
• ConstructedinamanneranalogoustotheDUDEdataset
• 30PropertyMatchedDecoys(PMD)associatedwitheachac2vemolecule
• Bemis-Murckoscaffoldswereusedtoclusterthemoleculestoavoidanaloguebias
Experimentally Verified Inac0ves
• TheproblemwithPMDdatasetsisthattheyrequiredecoystohaveasufficientlydifferent2Dfingerprint
• Thisallowstheconstruc2onofmanydecoyswithoutexpensiveexperimentalvalida2on
• However,thisinitselfcreatesabiasinwhatourmodelislearning
• Therefore,theseexperimentallyverifiedinac2vesservedtoprovideamorerepresenta2vedatasetandforceourmodeltoproperlyclassifyac2vityonadversarialexamples
AUC metric
• Thiscurvelooksatthetrueposi2verateandfalseposi2verateatdifferentbindingthresholds
• Ideally,wewouldliketohaveourmodelonlypredicttrueposi2ves,givingusanAUCof1
• AsimilarlyimportantmetricisthelogAUCcurve,whichplacesmoreemphasisonareasoftheAUCcurvewithlowerfalseposi2verates
UNMC2010: VaryingAUCcurves
Results of the AtomNet Model
• Thereceiveropera2ngcharacteris2cisagraphofthetrueposi2veratevsthefalseposi2verate.
• Discoveringtrueposi2veswithasfewfalseposi2vesaspossiblestreamlinesthedrugdiscoveryprocess
Wallachetal.2015AtomNet’sperformanceonvariousdatasets
Strengths
• Theconvolu2onallayerspickuponlocalchemicalstructures,usinginterac2onsbetweenthesetono2ceincreasinglycomplexrela2onsinourmodel
• Themodelisrobust,workingonavarietyofdifferentproteins
• DoeswellcomparedtootherstructurebasedmodelsontheDUDEdataset
Limita0ons
• NoaIemptmadetoalterhyperparametersofthemodel
• Thecodeisproprietary,meaningotherscan’timprovethemodel
• Didnottrydifferentinputrepresenta2ons
• Themodeldoesn’tusephysics
Learning Deep Architectures for Interaction Prediction in
Structure-based Virtual ScreeningCritique by Daniel Hsu
Virtual Screening
• Virtual screening - computational drug discovery
• Identifies candidates for ligands to bind proteins
• Structure-based: Use binding capacity and structure
Difficulties• Complexity of chemical space: 1060 [1]
• Commercially available compounds: 107 [2]
• High false positive rate of identified ligands [3]
• Limited datasets for structure-based virtual screening
[1] RS Bohacek, C McMartin, and WC Guida. The art and practice of structure-based drug design: A molecular modeling perspective. Medicinal research reviews, 16(1):3–50, 1996.[2] JJ Irwin, T Sterling, MM Mysinger, ES Bolstad, and RG Coleman. Zinc: a free tool to discover chemistry for biology. Journal of chemical information and modeling, 52(7):1757–1768, 2012.[3] Deng, N.; Forli, S.; He, P.; Perryman, A.; Wickstrom, L.; Vijayan, R. S.; Tiefenbrunn, T.; Stout, D.; Gallicchio, E.; Olson, A. J.; Levy, R. M. Distinguishing Binders from false positives by free energy calculations: fragment screening against the flap site of HIV protease. J. Phys. Chem. B 2015, 119, 976−988.
Approach
• Use deep learning for structure-based virtual screening
• Predict binding capacity of protein-molecule pair
• Propose new benchmark dataset more suitable for structure-based virtual screening
Model Pipeline• Process protein and small molecule separately into two
fixed-size descriptions (fingerprint vectors)
• Use neural nets to further transform fingerprints
Ligand Representation: Fingerprints
• Convert molecule into binary vector
• 1D representation of a molecule
ECFP Fingerprints
• Hashing - hash concatenated features of neighborhood of atom
• Indexing - Set 1 into index of feature vector
• Sensitive to small perturbations in molecular structure
ECFP Fingerprints
https://chemaxon.com/products/screen-suite
Atom Convolution Fingerprints
• Initialize vector representation of each with its element, connectivity, number of hydrogen bonds, etc
• Update vector with convolution of weight matrix on neighbor atoms
• Apply non-linearity to updated vector
N1
N2
N3
H
Atom Convolution Fingerprints
• Obtain fingerprint of ligand by convolving another matrix with the final atom vectors to obtain combined sum
• Traditional ECFP fingerprint is binary vector, so approximate this with a softmax operation
• Softmax also makes this operation differentiable
A1
A2
A3
V
FSoftmax
DUD-E experiment
• Used model to classify active ligands vs decoys using ECFP only: 0.904 AUC
• Perhaps dataset mostly contains information on differences between ligands and decoys, instead of interactions between ligand and proteins
• Suspicious because model makes conclusions on binding of ligands and proteins without using protein information
PDBBind + DUD-E
• Use PDBBind for training and DUD-E for testing
• No decoys, negative examples from random sampling
• Fingerprint with network achieved 0.714 AUC, better than fingerprint with ECFP and other algorithms.
Conclusion
• Deep learning architecture for predicting binding potential
• Form fingerprints using atom convolution instead of ECFP
• Propose new benchmark with PDBBind and DUD-E
• Criticism: Deep learning may not be “cheating” if it can make predictions on binding potential using only ligand information
Machine learning for structure-based virtual screening
What are Neural Networks?
2
5
5 …
25
ℎ" = $(&'(")( + +")-.
(
…
…Input Layer
Hidden Layer
Output Layer
What are Neural Networks?
3
5
5 …
25
ℎ" = $(&'(")( + +")-.
(
…
…
Convolutional Neural Networks
4
Local connectivity in 2D space. Similar to human vision.
Krizhevsky, A. et al. Proc. Advances in Neural Information Processing Systems 25 1090–1098 (2012)
…
…
Convolutional Neural Networks
5
Shares the same weights and biases. Identifies a certain pattern and is translationally independent.
Krizhevsky, A. et al. Proc. Advances in Neural Information Processing Systems 25 1090–1098 (2012)
Convolutional Neural Networks
6
Each feature map learns to identify a certain pattern in the image
Krizhevsky, A. et al. Proc. Advances in Neural Information Processing Systems 25 1090–1098 (2012)
Convolutional Neural Networks
7Krizhevsky, A. et al. Proc. Advances in Neural Information Processing Systems 25 1090–1098 (2012)
Convolutional Neural Networks
8
3D mesh of a molecule where each point in the grid contains information about the atom types.
Protein-Ligand Scoring with Convolutional Neural Networks
Ragoza M. et al. J. Chem. Inf. Model. 57 942−957 (2017)
Improvements from AtomNet
• More detailed methodology.• Includes descriptors which better describe the
binding site, includes aromaticity, protonation state etc.
• Improved visualization method enabling mutation analysis.
• Stringent model evaluation to check for overfitting.
10
Generating a training set
11
DUD-E is a dataset with 102 proteins, 200,00 ligands and over 1 million decoy molecules. However, not all co-crystal structures are available, but reference complexes are given.
The ligand on the reference receptor is removed and a box is drawn from a 8 Å around the ligand.
Ligands and decoys are docked using smina and the Autodock Vina scoring function.
About Autodock Vina and smina
• Given a bounded box, smina will run a docking algorithm to generate protein-ligand conformations.
• Each structure is then given a score, where the score is constructed from pairwise interactions including sterics, hydrogen bonding etc. These parameters are optimized using the PDBbind dataset.
• Assumptions• Receptor is rigid.• Protonation state of atoms remain unchanged.
12Trott et al. J Comput Chem. 31 455-461 (2010)
Results
13Ragoza M. et al. J. Chem. Inf. Model. 57 942−957 (2017)
Outperforms Vina on 90% of the targets.
Independent Test Sets
• How much is our model overfitting our data?
14Ragoza M. et al. J. Chem. Inf. Model. 57 942−957 (2017)
Visualization
15Ragoza M. et al. J. Chem. Inf. Model. 57 942−957 (2017)
Limitations
• Ligand screening takes Autodock Vina prediction of ligand conformation as the “truth”.
• Assumption that receptors only have one binding site.• Co-crystal structures are not always readily available.• Entropy and enthalpy not fully encapsulated within the descriptors.• Feature maps are potentially interpretable.• Paper also carried out pose prediction, but the performance was around the
same as Vina.
16