inferring functional information from domain co-evolution yohan kim, mehmet koyuturk, umut topkara,...
TRANSCRIPT
Inferring Functional Information from Domain co-evolutionYohan Kim, Mehmet Koyuturk, Umut Topkara, Ananth Grama andShankar Subramaniam
Gaurav Chadha
Deepak Desore
Motivation (1 of 2..)
Prior Work Focused on understanding Protein function at the level
of entire protein sequences Assumption: Complete Sequence follows single
evolutionary trajectory
It is well known that a domain can exist in various contexts, which invalidates the above assumption for multi-domain protein sequences
Motivation (2 of 2 ..)
Our approach Improvement of Multiple Profile method Constructs Co-evolutionary Matrix to assign
phylogenetic similarity scores to each protein pair
Identifies Co-evolving regions using residue-level conservation
Computational Methods & Algorithms Constructing phylogenetic profiles
Protein(single) phylogenetic profiles Segment(Multiple) phylogenetic profiles Residue phylogenetic profiles
Computing Co-evolutionary matrices
Deriving phylogenetic similarity scores
Protein phylogenetic profiles
Phylogenetic profile is a vector which tells about the existence of a protein in a genome.
Let P = {P1,P2,…,Pn} be the
set of proteins and,
G = {G1,G2,…,Gm} be the set
of Genomes Every row represents binary
phylogenetic profile of a protein.
Protein phylogenetic profiles(contd.)
Single phylogenetic profile ψi for protein Pi is,
ψi(j) = - 1 , 1 <= j <= m
log(Eij)
where Eij is minimum BLAST E-value of local
alignment between Pi and Gj
Advantage: gives degree of sequence divergence
Protein phylogenetic profiles(contd.)
Mutual Information I(X,Y) defined as,
I(X,Y) = H(X) + H(Y) – H(X,Y),
where H(X), Shannon Entropy of X is defined as,
H(X) = ∑ px * log(px), x Є X
and px = P[X = x]
Phylogenetic similarity between ψi(j) and ψi(j) is,
μs(Pi,Pj) = I(ψi, ψi)
Segment phylogenetic profiles
Single profile based methods could miss significant interactions.
Domain D12 of P2 follows evolutionary trajectory
similar to P1 and P3 which single profile method didn’t capture.
Segment phylogen. profiles(contd.)
Dividing each protein Pi into fixed size segments S1i,S2
i,…,Sk
i
Phylogenetic similarity between two proteins,
μM(Pi,Pj) = max I(ψsi, ψt
j), s,t
where ψsi is phylogenetic profile of segment Sk
i of protein
Pi
Residue phylogenetic profiles
Problem with multiple phylogenetic profiles:
Both domains covered together by the segment S22,
overriding their individual phylogenetic profiles. Significant local alignment between two proteins
corresponds to the residues covered in the alignment rather than the whole sequences.
Residue phylog. profiles(contd.)
A(Pi,Gj) – set of significant local alignments between
Protein Pi and Genome Gj
T(A) = [rb,re] – interval of residues on Pi
corresponding to each alignment A Є A(Pi,Gj)
For each residue r on Pi phylogenetic profile is
ψri(j) = min - 1 , 1 <= j <= m
A Є A r log(E(A))
Ar = {A Є A(Pi,Gj): r Є T(A)} is the set of local
alignments that contain r
Computing co-evolutionary matrices
For each protein pair Pi and Pj with lengths li and lj,
co-evolutionary matrix entry Mij(r,s) is,
Mij(r,s) = I (ψri, ψs
j),
where 1 <= r <= li and 1 <= s <= lj
The Co-evolutionary Matrix contains Information about which regions of the two proteins co-
evolved The co-evolved domain(s) appear as a block of high
mutual information scores in the matrix
Deriving phylogenetic similarity scores Phylogenetic similarity scores between two proteins
Pi and Pj is,
μC(Pi,Pj) = max min Mij(a,b) 1<= r <= li r <= a <= r + W
1<= s <= lj s <= a <= s + W
where W is the window parameter that quantifies the minimum size of the region on a protein to be considered as a conserved domain.
Results
Implemented and tested on 4311 E.coli proteins 152 Genomes(131 Bacteria,17 Archaea,4 Eukaryota) Value of f (down-sampling factor) = 30, W = 2 These values translate in overlapping segments of 60
residue long Excluded homologous proteins from analysis Define p-value as fraction of non-homologous protein pairs
(N)
Results (contd.)
MIS – Mutual Information Score PP – No. of predicted protein pairs PPV = TP / (TP + FP) For all μ*, coverage = TP + FP TN and FN are the no. of protein pairs that do not meet the threshold
Results (contd.)
Co-evolutionary matrix has 1.5 times greater coverage at PPV = 0.7 than the single profile method
At same no. of PP, Co-evolutionary matrix has better PPV and sensitivity values than single profile method
Results (contd.)
Mutual Information score distribution for interacting and non-interacting protein pairs At 0 MIS, SP shows a
peak while CM doesn’t. In other ways, at low MIS scores, SP scores over CM
Results (contd.)
Shows p-values of Single Profile method v/s Co-evolutionary Matrix method Scattered circles show that
the two methods can predict very differently
Results (contd.) – Phosphotransferase system
Domain IIA(residues 1-170) and domain IIB(residue 170-320) Darker region shows that the domains have co-evolved. So we can
conclude that IIB evolved with IIC rather than IIA
Top-20 predicted interacting partners of protein IIAB for both methods
Results (contd.) - Chemotaxis
N-terminus of CheA(residues 1-200) and C-terminus of CheA(residues 540-670) co-evolved with C-terminus region of CheB (residues 170-340)
Top-20 predicted interacting partners of protein CheA using both methods
Results (contd.) – Kdp System
N-terminal domain of KdpD (residues 1-395) co-evolved with KdpC
Top-10 predicted interacting partners of protein KdpD using both methods
Conclusion
Results in this paper strongly suggest that co-evolution of proteins should be captured at the domain level Because domains with conflicting evolutionary histories
can co-exist in a single protein sequence Regions that are important for supporting both
functional and physical interactions between proteins can be detected