topological analysis and prediction of biomolecular datatopological analysis and prediction of...

1
Topological Analysis and Prediction of Biomolecular Data Zixuan Cang 1 , Lin Mu 3 , Kedi Wu 1 , Kristopher Opron 2 , Kelin Xia 1 , and Guo-wei Wei 1,2 1 Department of Mathematics, Michigan State University, MI 48824, USA 2 Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA 3 Oak Ridge National Laboratory, TN 37831, USA Introduction Protein function and dynamics are closely related to its sequence and structure. However, the prediction of protein function and dynamics from its sequence and structure is still a fundamental challenge in molecular biology. Prediction of protein related observables provides advices for experiments and sheds light on how protein functions. Persistent homology is a new branch of algebraic topology that has found its success in the topological data analysis in a variety of disciplines, including molecular biophysics. Filtration of simplicial complex Filtration of Vietoris-Rips complex built on α-carbon point cloud of protein (ID:2LJC) We explore the potential of using persistent homology as an independent tool for protein structure classification and protein-ligand/drug binding anity prediction. From persistent homology computations, we extract protein topological fingerprints which are topological invariants generated during a filtration process. We develop topological machine learning in which feature vectors are generated solely based on the output of topological fingerprints. Persistent Homology In the past decade, persistent homology has been developed as a new multiscale representation of topological features. Simplex A k-simplex denoted by σ k is a convex hull of k + 1 vertices which is represented by a set of points σ k = {λ 0 u 0 + λ 1 u 1 + ... + λ k u k | X λ i = 1, λ i > 0, i = 0, 1, ..., k}, where {u 0 , u 1 , ..., u k } R n is a set of anely independent points. Simplicial complex A simplicial complex K is a finite collection of simplices satisfying two conditions. First, faces of a simplex in K are also in K; Secondly, intersection of any two simplices in K is a face of both the simplices. The highest dimension of simplices in K determines dimension of K. Homology For a simplicial complex K,a k-chain is a formal sum of the form N i=1 c i [σ k i ], where [σ k i ] is oriented k-simplex from K. A boundary operator k over a k-simplex σ k is defined as, k σ k = k X i=0 (-1) i [u 0 , u 1 , ..., b u i , ..., u k ], where [u 0 , u 1 , ..., b u i , ..., u k ] denotes the face obtained by deleting the ith vertex in the simplex. The boundary operator induces a boundary homomorphism k : C k (K) C k-1 (K). The composition operator k-1 k is a zero map, k-1 k (σ k )= X j<i (-1) i (-1) j [u 0 , ..., b u i , ... b u j , ...u k ]+ X j>i (-1) i (-1) j-1 [u 0 , ..., b u j , ... b u i , ...u k ] = 0 A sequence of chain groups connected by boundary operation form a chain complex, ··· -→ C n (K) n -→ C n-1 (K) n-1 -→ ··· 1 -→ C 0 (K) 0 -→ 0. The equation k k+1 = 0 is equivalent to the inclusion Imk+1 Ker k , where Im and Ker denote image and kernel. Elements of Kerk are called kth cycle group, and denoted as Z k =Kerk . Elements of Imk+1 are called kth boundary group, and denoted as B k =Imk+1 .A kth homology group is defined as the quotient group of Z k and B k . H k = Z k /B k . The kth Betti number of simplicial complex K is the rank of H k , β k = rank(H k )= rank(Z k )- rank(B k ). Betti number β k is finite number, since rank(B p ) 6 rank(Z p ) < . Betti numbers computed from a homology group are used to describe the corresponding space. Feature construction for structure classification tasks A number of features describing properties of the samples in dierent scales are extracted from persistent homology bar codes. Illustrated below are some examples of features used in machine learning. I The length of the second longest Betti 0 bar. I The summation of lengths of all Betti 0 bars except for those exceed the max filtration value. I The average length of Betti 0 bars except for those exceed the max filtration value. I The onset value of the longest Betti 1 bar. I The number of Betti 1 bars that locate at [4.5, 5.5], divided by the number of atoms. I The onset value of the first Betti 2 bar that ends after a given number. Bar codes in dierent dimension with dierent lifespan, birth time, and death time show characteristics of the sample from dierent scales. With persistent homology, we are able to describe local properties like alpha helices or beta sheets and global properties like size of the cavity of a spherical structure or size of tunnel of a cylindrical structure. Performance on Classification T asks Performance on Binding Affinity Prediction The persistent homology based protein-ligand/drug binding anity predictor named T-Score is tested on the PDBBind v2007 core set with the v2007 refined set as a training set where the testing set has been excluded from the training set. A high Pearson correlation of 0.80 is achieved and our T-Score outperforms all the other eminent methods in computational biophysics. References I Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia and Guo-Wei Wei, “A topological approach to protein classification”, Molecular Based Mathematical Biology, 3, 140-62 (2015). I Kelin Xia and Guo-Wei Wei, “Persistent homology analysis of protein structure, flexibility and folding”, International Journal for Numerical Methods in Biomedical Engineering, 30(8):814-844 (2014). I Kelin Xia, Zhixiong Zhao and Guo-Wei Wei,“Multiresolution persistent homology for excessively large biomolecular datasets”, Journal of Chemical Physics, 143, 134103 (2015). Protein-Ligand/Drug Binding Free Energy Structure-based drug design relies on computational methods to identify and optimize potential drugs. Molecule docking is the most widely used approach which predicts the location and orientation (pose) of a ligand bound to a protein to form a stable complex. In the process of search for a pose for the ligand, a scoring function which measures the binding anity between the two molecules is needed to distinguish the favorable poses from the unfavorable ones. An accurate and ecient protein-ligand/drug binding anity predictor is therefore the key of molecule docking process. As the dominating forces that regulates protein-ligand binding are mainly weak forces which heavily depends on spacial arrangements, persistent homology becomes a competitive candidate for this job. Flowchart of Persistent Homology Based Protein-Ligand Binding Affinity Prediction Conclusion We test the performance of persistent homology in various protein structure classification tasks as well as protein-ligand binding/drug anity prediction. It is found that persistent homology is able to oer a power representation of proteins and capture their intrinsic interactions. Our persistent homology based T-Score outperforms all the other eminent methods in computational biophysics on the blind prediction of protein-ligand/drug binding anities. Acknowledgment This work was supported in part by NSF grants IIS-1302285, and DMS-1160352, and NIH Grant R01GM-090208. http://www.math.msu.edu/˜wei/ [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]

Upload: others

Post on 04-Aug-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Topological Analysis and Prediction of Biomolecular DataTopological Analysis and Prediction of Biomolecular Data Zixuan Cang1, Lin Mu3, Kedi Wu1, Kristopher Opron2, Kelin Xia1, and

Topological Analysis and Prediction of Biomolecular Data

Zixuan Cang1, Lin Mu3, Kedi Wu1, Kristopher Opron2, Kelin Xia1, and Guo-wei Wei1,2

1Department of Mathematics, Michigan State University, MI 48824, USA2Department of Biochemistry and Molecular Biology, Michigan State University, MI 48824, USA

3Oak Ridge National Laboratory, TN 37831, USA

IntroductionProtein function and dynamics are closely related to its sequence and structure. However, theprediction of protein function and dynamics from its sequence and structure is still a fundamentalchallenge in molecular biology. Prediction of protein related observables provides advices forexperiments and sheds light on how protein functions. Persistent homology is a new branch ofalgebraic topology that has found its success in the topological data analysis in a variety ofdisciplines, including molecular biophysics.

Filtrationofsimplicialcomplex

Filtration ofVietoris-Ripscomplex builton α-carbonpoint cloud ofprotein(ID:2LJC)

We explore the potential of using persistent homology as an independent tool for protein structureclassification and protein-ligand/drug binding affinity prediction. From persistent homologycomputations, we extract protein topological fingerprints which are topological invariantsgenerated during a filtration process. We develop topological machine learning in which featurevectors are generated solely based on the output of topological fingerprints.

Persistent HomologyIn the past decade, persistent homology has been developed as a new multiscale representation oftopological features.Simplex A k-simplex denoted by σk is a convex hull of k + 1 vertices which is represented by a setof points

σk = {λ0u0 + λ1u1 + ... + λkuk|∑

λi = 1, λi > 0, i = 0, 1, ..., k},

where {u0, u1, ..., uk} ⊂ Rn is a set of affinely independent points.Simplicial complex A simplicial complex K is a finite collection of simplices satisfying twoconditions. First, faces of a simplex in K are also in K; Secondly, intersection of any two simplicesin K is a face of both the simplices. The highest dimension of simplices in K determines dimensionof K.Homology For a simplicial complex K, a k-chain is a formal sum of the form

∑Ni=1 ci[σ

ki ], where [σk

i ]

is oriented k-simplex from K. A boundary operator ∂k over a k-simplex σk is defined as,

∂kσk =

k∑i=0

(−1)i[u0, u1, ..., ui, ..., uk],

where [u0, u1, ..., ui, ..., uk] denotes the face obtained by deleting the ith vertex in the simplex. Theboundary operator induces a boundary homomorphism ∂k : Ck(K)→ Ck−1(K). The compositionoperator ∂k−1 ◦ ∂k is a zero map,

∂k−1∂k(σk) =

∑j<i

(−1)i(−1)j[u0, ..., ui, ...uj, ...uk] +∑j>i

(−1)i(−1)j−1[u0, ..., uj, ...ui, ...uk]

= 0

A sequence of chain groups connected by boundary operation form a chain complex,

· · · −→ Cn(K)∂n−→ Cn−1(K)

∂n−1−→ · · ·∂1−→ C0(K)

∂0−→ 0.The equation ∂k ◦ ∂k+1 = 0 is equivalent to the inclusion Im∂k+1 ⊂ Ker ∂k, where Im and Kerdenote image and kernel. Elements of Ker∂k are called kth cycle group, and denoted as Zk=Ker∂k.Elements of Im∂k+1 are called kth boundary group, and denoted as Bk=Im∂k+1. A kth homologygroup is defined as the quotient group of Zk and Bk.

Hk = Zk/Bk.

The kth Betti number of simplicial complex K is the rank of Hk,

βk = rank(Hk) = rank(Zk) − rank(Bk).

Betti number βk is finite number, since rank(Bp) 6 rank(Zp) <∞. Betti numbers computed from ahomology group are used to describe the corresponding space.

Feature construction for structure classification tasksA number of features describing properties of the samples in different scales are extracted frompersistent homology bar codes. Illustrated below are some examples of features used in machinelearning.I The length of the second longest Betti 0 bar.I The summation of lengths of all Betti 0 bars except for those exceed the max filtration value.I The average length of Betti 0 bars except for those exceed the max filtration value.I The onset value of the longest Betti 1 bar.I The number of Betti 1 bars that locate at [4.5, 5.5], divided by the number of atoms.I The onset value of the first Betti 2 bar that ends after a given number.Bar codes in different dimension with different lifespan, birth time, and death time showcharacteristics of the sample from different scales. With persistent homology, we are able todescribe local properties like alpha helices or beta sheets and global properties like size of thecavity of a spherical structure or size of tunnel of a cylindrical structure.

Performance on Classification Tasks

Performance on Binding Affinity Prediction

The persistent homologybased protein-ligand/drugbinding affinity predictornamed T-Score is tested onthe PDBBind v2007 core setwith the v2007 refined set as atraining set where the testingset has been excluded fromthe training set. A highPearson correlation of 0.80 isachieved and our T-Scoreoutperforms all the othereminent methods incomputational biophysics.

References

I Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia and Guo-Wei Wei, “A topologicalapproach to protein classification”, Molecular Based Mathematical Biology, 3, 140-62 (2015).

I Kelin Xia and Guo-Wei Wei, “Persistent homology analysis of protein structure, flexibility andfolding”, International Journal for Numerical Methods in Biomedical Engineering, 30(8):814-844(2014).

I Kelin Xia, Zhixiong Zhao and Guo-Wei Wei,“Multiresolution persistent homology for excessivelylarge biomolecular datasets”, Journal of Chemical Physics, 143, 134103 (2015).

Protein-Ligand/Drug Binding Free EnergyStructure-based drug design relies oncomputational methods to identify andoptimize potential drugs. Moleculedocking is the most widely usedapproach which predicts the locationand orientation (pose) of a ligandbound to a protein to form a stablecomplex. In the process of search for apose for the ligand, a scoring functionwhich measures the binding affinitybetween the two molecules is needed todistinguish the favorable poses fromthe unfavorable ones. An accurate andefficient protein-ligand/drug bindingaffinity predictor is therefore the key ofmolecule docking process. As thedominating forces that regulatesprotein-ligand binding are mainlyweak forces which heavily depends onspacial arrangements, persistenthomology becomes a competitivecandidate for this job.

Flowchart of Persistent Homology Based Protein-LigandBinding Affinity Prediction

ConclusionWe test the performance of persistent homology in various protein structure classification tasks aswell as protein-ligand binding/drug affinity prediction. It is found that persistent homology is ableto offer a power representation of proteins and capture their intrinsic interactions. Our persistenthomology based T-Score outperforms all the other eminent methods in computational biophysicson the blind prediction of protein-ligand/drug binding affinities.

Acknowledgment

This work was supported in part by NSF grants IIS-1302285, andDMS-1160352, and NIH Grant R01GM-090208.

http://www.math.msu.edu/˜wei/ [email protected] [email protected] [email protected] [email protected] [email protected] [email protected]