phycmap: predicting protein contact map using evolutionary and physical constraints by integer...

36
PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological Institute at Chicago Web server at http:// raptorx.uchicago.edu See http :// arxiv.org/abs/1308.1975 for an extended version

Upload: macey-hollister

Post on 14-Jan-2016

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming

Zhiyong Wang and Jinbo XuToyota Technological Institute at Chicago

Web server at http://raptorx.uchicago.eduSee http://arxiv.org/abs/1308.1975 for an extended version

Page 2: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Problem DefinitionContact : Distance between two Cα or Cβ atoms < 8Å

short range: 6-12 AAs apartmedium range: 12-24 AAs long range: >24 AAs apart

1J8B

Page 3: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Existing WorkResidue co-evolution method: mutual information (MI), PSICOV, Evfold Needs a large number of homologous sequences PSICOV and Evfold better than MI since they differentiate direct and indirect residue

couplings (Residues A and C indirect coupling if it is due to direct A-B and B-C couplings)

PSICOV and Evfold also enforce sparsity

Supervised learning method: NNcon, SVMcon, CMAPpro Mutual information, sequence profile and others Predicts contacts one by one, ignoring their correlation Do not differentiate direct and indirect residue couplings

First-principle method: Astro-Fold No evolutionary information Minimize contact potential Enforce physical feasibility including sparsity

Page 4: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Our Method: PhyCMAP1. Focus on proteins with few sequence homologs proteins with many sequence homologs

very likely have similar templates in PDB

2. Integrate by machine learning seq profile, residue co-evolution and

non-evolutionary info (implicitly) differentiate direct and

indirect residue couplings through feature engineering

3. Enforce physical constraints, which imply sparsity

Page 5: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Info used by Random Forests• Evolution info from a single protein family– sequence profile – co-evolution: 2 types of mutual information (MI)

• Non-evolution info from the whole structure space: residue contact potential

• Mixed info from the above 2 sources– homologous pairwise contact score– EPAD: context-specific evolutionary-based distance-

dependent statistical potential• amino acid physic-chemical properties

Page 6: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Mutual Information

1. Contrastive Mutual Information (CMI): remove local background by measuring the MI difference of one pair with its neighbors.

2. Chaining effect of residue couplings: MI, MI2, MI3, MI4, equivalent to (1-MI), (1-MI)2, (1-MI)3, (1-MI)4 (see http://arxiv.org/abs/1308.1975 for more details)

Page 7: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

CMI Example: 1J8B• Upper triangle: mutual information• Lower triangle: contrastive mutual information• Blue boxes: native contacts

Page 8: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Homologous Pairwise Contact Score

Probability of a residue pair forming a contact between 2 secondary structures.

PSbeta (a, b): prob of two AAs a and b forming a beta contactPShelix (a, b): prob of two AAs a and b forming a helix contactH: the set of sequence homologs in a multiple seq alignment

𝑃𝑆 (𝑖 , 𝑗 )= 1|𝐻|

(∑h∈𝐻

𝑃𝑆𝑏𝑒𝑡𝑎 (h 𝑖 , h 𝑗 )𝑜𝑟 ∑h∈𝐻

𝑃𝑆h𝑒𝑙𝑖𝑥 (h𝑖 , h 𝑗))

Page 9: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Training Random Forests• Training dataset– Chosen before CASP10 started– 900 non-redundant protein structures– <25% sequence identity– All contacts and 20% of non-contacts

• Model parameters– Number of features: 300– Number of trees: 500– 5 fold cross validation

Page 10: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Select Physically Feasible Contacts by Integer Linear Programming

Maximize accumulative contact probability while minimize violation of physical constraints

Xi,j Indicate one contact between two residues i and j

Rr a relaxation variable of the rth soft constraint

g(R) penalty for violation of physical constraints

6

,,,

)(maxij

jijiRX

RgPX

Page 11: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Soft Constraints 1

# contacts between two secondary structure segments is limited

2)(:2,11,

,1)(,

sjSStypejssji bRX

siSStypei

s1,s2 95% MaxH,H 5 12H,E 3 10H,C 4 11E,H 4 12E,E 9 13E,C 6 15C,H 3 12C,E 5 12C,C 6 20

Page 12: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Soft Constraints 2Upper and lower bounds for #contacts between two beta strands

))(),(min(3 ,

)(),(2,

vLenuLenS

RX

vu

uSSegjvSSegiji

3

)(),(,

))(),(max(3.3 RvLenuLen

XuSSegjvSSegiji

Page 13: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Soft Constraints 3

Statistics shows that only 3.4% of loop segments that have a contact between the start and end residues.

Page 14: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Hard Constraints 1

• For parallel contacts between two β strands, the contacts of neighboring residue pairs should satisfy the following constraints

• For anti-parallel contacts

11,11,1, jijiji XXX

11,11,1, jijiji XXX

Page 15: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Hard Constraints 2

1) One residue cannot form contacts with both j and j+2 when j and j+2 are in the same alpha helix

12,, jiji XX

2) One beta-strand can form beta-sheets with up to 2 other beta-strands.

Page 16: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Test Datasets• CASP10: 123 proteins– 36 are “hard”, i.e., no similar templates in PDB – low sequence identity (<25%) among them– low seq id with the training data, which were chosen

before CASP10 started

• Set600: 601 proteins– share <25% seq ID with the training proteins – each has ≥50 AAs and an X-ray structure with resolution

<1.9Å– each has ≥5 AAs with predicted secondary structure

being alpha-helix or beta-strand

Page 17: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Accuracy w.r.t. #sequence homologs1. Meff: #non-redundant sequence homologs of a protein

2. Divide the CASP10 targets into groups by Meff

3. Top L/10 predicted medium- and long-range contacts

logMeff

accuracy

Page 18: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Results on CASP10 – Medium RangeOverall accuracy on top L/5 predicted Cβ contacts: PhyCMAP 0.465, CMAPpro 0.370, PSICOV 0.316

CMAPpro PSICOV

PhyC

MAP

PhyC

MAP

Page 19: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Results on CASP10 – Long RangeOverall accuracy on top L/5 predicted Cβ contacts: PhyCMAP: 0.373, CMAPpro: 0.313, PSICOV: 0.315

CMAPpro PSICOV

PhyC

MAP

PhyC

MAP

Page 20: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Results on 36 hard CASP10 targetsaccuracy on top L/5 medium and long-range Cβ contacts:

PhyCMAP: 0.363, CMAPpro: 0.308, PSICOV: 0.180

CMAPpro PSICOV

PhyC

MAP

PhyC

MAP

Page 21: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

CMAPproPSICOV

PhyC

MAP

PhyC

MAP

Results on Set600 with few homologs (Meff ≤ 100)

top L/5 predicted medium and long Cβ contacts: PhyCMAP: 0.345, CMAPpro: 0.287, PSICOV: 0.059

Page 22: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Example: T0677-D2Dozens of sequence homologs Meff=31

Upper triangle: native Cβ contacts

Left lower triangle: PhyCMAP accuracy 0.357Right lower triangle: Evfold accuracy ~0

Note contacts between alpha helices are not continuous

Page 23: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Example: T0693-D2Many sequence homologs Meff=2208

Upper triangles: native Cβ contactsLeft lower triangle: PhyCMAP accuracy 0.744

Right lower triangle: Evfold accuracy 0.419

Page 24: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Example: T0701-D1Many sequence homologs Meff=3300

Upper triangle: native Cβ contactsLeft lower triangle: PhyCMAP accuracy 0.794

Right lower triangle: Evfold accuracy 0.444

Page 25: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Example: T0756-D1

Many sequence homologs Meff=1824Upper triangles: native Cβ contacts

Left lower triangle: PhyCMAP accuracy 0.944Right lower triangle: Evfold accuracy 0.500

Page 26: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Summary

Combining seq profile, residue co-evolution, non-evolutionary info can result in good accuracy even for proteins with 10--100 non-redundant seq homologs

Physical constraints are helpful for proteins with few sequence homologs

L/10 L/

5

L/10 L/

5

Short-range

contacts

Medium and long-

range

0.2

0.3

0.4

0.5

with physical constraintsno physical constraints

Cβ accuracy on 130 proteins Meff ≤ 100

Page 27: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Acknowledgements• Student: Zhiyong Wang• Funding – NIH R01GM0897532– NSF CAREER award– Alfred P. Sloan Research Fellowship

• Computational resources– University of Chicago Beagle team– TeraGrid

Web server at http://raptorx.uchicago.edu

Page 28: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Protein contact Contact : Distance between two Cα or Cβ atoms < 8Å; or Distance between the closest atoms of 2 residues.

1J8B

short range: 6-12 AAs apartmedium range: 12-24 AAs long range: >24 AAs apart

Page 29: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Why contact prediction?

• Contacts describe spatial and functional relationship of residues

• Contains key information for 3D structure• Useful for protein structure prediction• Used for protein structure alignment and

classification

Page 30: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Contrastive Mutual Information

Contrastive Mutual Information (CMI) removes local background, by measuring the MI difference between one pair of residues and neighboring pairs.

Page 31: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Integer Linear Programming

• Objective function:• g(R): penalty for violation of physical constraints

Variables ExplanationsXi,j equal to 1 if there is a contact between

two residues i and j.APu,v equal to 1 if two beta-strands u and v

form an anti-parallel beta-sheet.Pu,v equal to 1 if two beta-strands u and v

form a parallel beta-sheet.Su,v equal to 1 if two beta-strands u and v

form a beta-sheet. Tu,v equal to 1 if there is an alpha-bridge

between two helices u and v.Rr a non-negative integral relaxation

variable of the rth soft constraint.

6

,,,

)(maxij

jijiRX

RgPX

Page 32: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Hard Constraints 3

One beta-strand can form beta-sheets with up to 2 other beta-strands.

2)(:

, betavSStypev

vuS

Page 33: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Global constraints

• Antiparallel and parallel contacts

• A residue contact implies a segment-wise contact

• Put a limit of total number of contacts

– k is the number of top contacts we want to predict.

vuvuvu SPAP ,,,

)(),(,,, vSSegjuSSegiSX vuji

6,1,

ijLjiji kX

Page 34: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Results on Set600 with many sequence homologs (Meff > 100)

CMAPpro PSICOV

PhyC

MAP

PhyC

MAP

top L/5 predicted medium and long Cβ contacts: PhyCMAP: 0.611, CMAPpro: 0.515, PSICOV: 0.569

Page 35: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Contribution of HPS and CMI featuresAverage Cβ accuracy the 471 proteins with Meff >100

L/10 L/

5

L/10 L/

5Short-range contacts Medium and long-

range

0.4

0.5

0.6

0.7

with CMI and HPS no CMI and HPS

Page 36: PhyCMAP: Predicting protein contact map using evolutionary and physical constraints by integer programming Zhiyong Wang and Jinbo Xu Toyota Technological

Contribution of physical constraints Average Cβ accuracy on 130 proteins with Meff ≤ 100

L/10 L/

5

L/10 L/

5

Short-range contacts Medium and long-range

0.2

0.3

0.4

0.5

with physical constraintsno physical constraints