7. lecture ws 2007/08bioinformatics iii1 v7 molecular decomposition of graphs – functional nets...

7. Lecture WS 2007/08

Bioinformatics III 1

V7 Molecular decomposition of graphs – functional netsMost cellular processes result from a cascade of events mediated by proteins that act in a cooperative manner.

Protein complexes can share components: proteins can be reused and participate to several complexes ( Cellzome data).

Methods for analyzing high-throughput protein interaction data have mainly used clustering techniques.

They have been applied to assign protein function by inference from the biological context as given by their interactors, and to identify complexes as dense regions of the network (see V5).

The logical organization into shared and specific components, and its representation remains elusive.

Gagneur et al. Genome Biology 5, R57 (2004)



shared componentsShared components = proteins or groups of proteins occurring in different complexes are fairly common. A shared component may be a small part of many complexes, acting as a unit that is constantly reused for its function.

Also, it may be the main part of the complex e.g. in a family of variant complexes that differ from each other by distinct proteins that provide functional specificity.

Aim: identify and properly represent the modularity of protein-protein interaction networks by identifying the shared components and the way they are arranged to generate complexes.

Gagneur et al. Genome Biology 5, R57 (2004)Georg Casari, Cellzome (Heidelberg)



ModulesA graph and its modules.

Nodes connected by a link are called neighbors.

In graph theory, a module is a set of nodes that have the same neighbors outside the module.

In addition to the trivial modules {a},{b},...,{g} and {a,b,c,..,g}, this graph contains the modules {a,b,c}, {a,b},{a,c},{b,c} and {e,f}.




QuotientElements of a module have exactly the same neighbors outside the module

one can substitute all of them for a representative node.

In a quotient, all elements of the module are replaced by the representative node, and the edges with the neighbors are replaced by edges to the representative.

Quotients can be iterated until the entire graph is merged into a final representative node.

Iterated quotients can be captured in a tree, where each node represents a module, which is a subset of its parent and the set of its descendant leaves.




Modular decompositionModular decomposition of the example graph shown before.

Modular decomposition results in a labeled tree that represents iterations of particular quotients, here the successive quotients on the modules {a,b,c} and {e,f}.

The modular decomposition is a unique, canonical tree of iterated quotients(formal proof exists Möhring 1985).




NodesThe nodes of the modular decomposition are categorized in 3 ways:

series : the direct descendants are all neighbors of each other,(labelled by an asterisk within a circle)

parallel : the direct descendants are all non-neighbors of each other,(labelled by two parallel lines within a circle)

prime : by the structure of the module otherwise (prime module case).(labelled by a P within a circle)


The graph can be retrieved from the tree on the right by recursively expanding the modules using the information in the labels. Therefore, the labeled tree can be seen as an exact alternative representation of the graph.



Results from protein complex purifications (PCP), e.g. TAPDifferent types of data:

- Y2H: detects direct physical interactions between proteins

- PCP by tandem affinity purification with mass-spectrometric identification of the protein components identifies multi-protein complexes

Molecular decomposition will have a different meaning due to different semantics of such graphs.

Here, focus analysis on PCP content. PCP experiment: select bait protein where TAP-label is attached Co-purify protein with those proteins that co-occur in at least one complex with the bait protein.




Clique and maximal cliqueA clique is a fully connected sub-graph, that is, a set of nodes that are all neighbors of each other.

In this example, the whole graph is a clique and consequently any subset of it is also a clique, for example {a,c,d,e} or {b,e}. A maximal clique is a clique that is not contained in any larger clique. Here only {a,b,c,d,e} is a maximal clique.


Assuming complete datasets and ideal results, a permanent complex will appear as a clique.The opposite is not true: not every clique in the network necessarily derives from an existing complex. E.g. 3 connected proteins can be the outcome of a single trimer, 3 heterodimers or combinations thereof.



Results from protein complex purifications (PCP), e.g. TAPInterpretation of graph and module labels for systematic PCP experiments. (a) Two neighbors in the network are proteins occurring in a same complex. (b) Several potential sets of complexes can be the origin of the same observed network. Restricting interpretation to the simplest model (top right), the series module reads as a logical AND between its members. (c) A module labeled ´parallel´ corresponds to proteins or modules working as strict alternatives with respect to their common neighbors. (d) The ´prime´ case is a structure where none of the two previous cases occurs.




Obtain maximal cliques

Modular decomposition provides an instruction set to deliver all maximal cliques of a graph.

In particular, when the decomposition has only series and parallels, the maximal cliques are straightforwardly retrieved by traversing the tree recursively from top to bottom.

A series module acts as a product: the maximal cliques are all the combinations made up of one maximal clique from each „child“ node.

A parallel module acts as a sum: the set of maximal cliques is the union of all maximal cliques from the „child“ nodes.




In the modular decomposition tree, the leaves are proteins,the root represents the whole network.

In between, each node is a module that is a sub-part of its parent.The label of a node gives the nature of the relationship between its direct children.

Proteins or modules in a parallel module can be seen asalternatives. If A is neighbor of B and C, which are not neighborsof each other, then A can belong to a complex together witheither B or C, but not with both at the same time.

B and C define a parallel module and thus are alternative partners in a complex with their common neighbor A.This situation corresponds to a logical „exclusive OR“ between B and C.

Interpretation for PCP protein interaction networks




Proteins or modules in a series module can be seen as potentially combined in any way.

If A is neighbor of B and C, and B and C are also neighbors, the A can belong to a complex together with B or C, or with both at the same time.This corresponds to a logical „OR“ between B and C.

A series module can be seen as a unit: a set of proteins (modules) that function together.

A ‚prime‘ is a graph where neither of these cases occurs.

Interpretation for PCP protein interaction networks




Two examples of modular decomposition of protein-protein interaction networks. In each case from top to bottom: schema of complexes, the corresponding protein-protein interaction network as determined from PCP experiments, and its modular decomposition (MOD).

(a) Protein phosphatase 2A. Parallel modules group proteins that do not interact but are functionally equivalent. Here these are the catalytic Pph21 and Pph22 (module 2) and the regulatory Cdc55 and Rts1 (module 3).

Back to the real world …





RNA polymerases I, II and III



Modular decomposition of graphs is a well-defined concept.

It can be thoroughly proven for which graphs a modular decomposition exists.

Efficient O(m + n) algorithms exist to compute the decomposition.

However, experiments have shown that biological complexes are not strictly disjoint. They often share components ... meaning that separate complexes do not fulfil the strict requirements of modular graph decomposition.

Also, there exists a „danger“ of false-positive or false-negative interactions.

other methods e.g. for detecting communities (Girven & Newman) or clusters (Spirin & Mirny) are more suitable for identification of complexes because they are more sensitive.

Conclusions



Transferring functional annotation in interaction maps

Even the best-studied model organisms contain a large number of proteins whose functions are currently unknown. E.g. about one-third of the proteins in Saccharomyces cerevisiae remain uncharacterized.

Traditionally, computational methods to assign protein function have relied largely on sequence homology. The recent emergence of high-throughput experimental datasets have led to a number of alternative, non-homology based methods for functional annotation. These methods have generally exploited the concept of guilt by association, where proteins are functionally linked through either experimental or computational means.



Associating protein function from proteomic data

Large-scale experiments have linked proteins that - physically interact, - are synthetic lethals,- are coexpressed or - are coregulated.

In addition, computational techniques linking pairs of proteins include - phylogenetic profiles, - gene clusters, - conserved gene neighbors and - gene fusion analysis

Integrating the information from several sources provides the best method for linking proteins functionally.

Nabieva et al., Bioinformatics 21, i1 (2005)

If two protein-coding genes are found to be separate in one species (Sp1, Sp4, Sp5) and fused to form a single gene in another (Sp2, Sp3), a physical interaction is probable. ´This is termed the Rosetta Stone or gene-fusion method.

The Gene neighbourhood method analyzes the gene order in different evolutionarily related organisms. Genes that always occur in the same order (here: A, B, C) are likely to form an operon meaning that they would be jointly regulated and are quite likely to interact. Gene D may occur at different locations making it less likely to be part of the same operon.

The Gene cluster method. Here, genes A, B, and C are arranged linearly as one operand. When transcription is activated at promoter P, all three genes are

simultaneously transcribed.



Features of integrated networks

Aim: partition interaction networks into functional modules . These are sets of proteins that are part of the same cellular function or take part in the same protein complex.

These functional modules, or clusters, are useful for annotating uncharacterized proteins, as the most common functional annotation within a cluster can be transferred to uncharacterized proteins.

Proteins in experimentally and computationally determined interaction graphs have been grouped together based on - shared interactions,- the similarity between shortest path vectors to all other proteins in the network - and shared membership within highly connected components or cliques.




Algorithms: Majority

as in Schwikowsky et al. (2000): consider all neighboring proteins and sum up the number of times each annotation occurs for each protein.

„The annotated functions of all neighbors of P are ordered in a list, from the most frequent to the least frequent. Functions that occur the same number of times are ordered arbitrarily. Everything after the third entry in the list is discarded, and the remaining three or fewer functions are declared as predictions for the function of P.“

In the case of weighted interaction graphs take a weighted sum.






The simple majority vote approach, named Majority, has clear predictive value.

However, it takes only limited advantage of the underlying graph structure of the network.

In the interaction network (left) ‘Majority’ would assign functions to proteins d and f , but not to protein e, even though our intuition might indicate that protein e has the same function as proteins d and f .




This approach, named Neighborhood, does not consider any aspect of network topology within the local neighborhood. E.g., Fig. 2 shows two interaction networks that are treated equivalently when considering a radius of 2 and annotating protein a.

However, in the first case, there is a single link that connects protein a to the annotated proteins, and in the second case, there are several independent paths between a and the annotated proteins, and moreover, two of these proteins are directly adjacent to a.


Hishigaki et al. (2001) extended the Majority algorithm by predicting a protein’s function by looking at all proteins within a particular radius and finding over-represented functional annotations.



Algorithms: Neighborhood algorithm

Predict function of each protein in the interaction map (black circle in Figure C), based on the functions of ‘n-neighbouring proteins’, which are defined as a set of proteins reached via n physical interactions at most (n is an integer parameter).

E.g. all proteins enclosed by the inner dashed circle are ‘1-neighbouring proteins’, and those enclosed by the outer circle are ‘2-neighbouring proteins’. The protein of interest is assigned the function with the highest 2 value among functions of all n-neighbouring proteins.

where i denotes a protein function, e.g. ‘Golgi’, ‘DNA repair’ and ‘transcription factor’, ei denotes an expectation number of i in n-neighbouring proteins expected from the distribution on the total map, and ni denotes an observed number of i in n-neighbouring proteins.

Then, the function of a query protein is predicted to be the function i with the maximum 2 value. When there are multiple functions with the largest 2 value, both functions are assigned. The optimal n value is determined by a so-called self-consistency test, where the predicted functions of all proteins in the map are compared with their annotated functions for each n.

Here: consider neighborhoods of radius 1, 2 and 3. This method does not extend naturally to the case of weighted interaction graphs.




Algorithms: GenMultiCut

It was suggested that functional annotations on interaction networks should be made in order to minimize the number of times different annotations are associated with neighboring proteins.

This task is similar to the minimum multiway k-cut problem.

In multiway k-cut, the task is to partition a graph in such a way that each of k terminal nodes belongs to a different subset of the partition and so that the (weighted) number of edges that are ‘cut’ in the process is minimized.

In the more general version of the multiway k-cut problem considered here, the goal is to assign a unique function to all the unannotated nodes so as to minimize the sum of the costs of the edges joining nodes with no function in common.





Although minimum multiway k-cut is NP-hard (Dahlhaus et al., 1994), it wasfound that the particular instances of minimum multiway cut arising here can, in practice, be solved exactly when stated as an ILP.

Introduce a node variable xu,a for each protein u and function a.

Set xu,a = 1 if protein u is predicted to have function a.

If a protein u has known functional annotations, variable xu,a is fixed as 1 for its known annotations a and as 0 for all other annotations.

We also introduce an edge variable xu,v,a for each function a and each pair of

adjacent proteins u and v. This variable is set to 1 if both proteins u and v are annotated with function a.

Minimizing the weighted number of neighboring proteins with different annotations is the same as maximizing the number with the same annotation, and so we have the following ILP:





The first constraint specifies that exactly one functional annotation is made for any protein. The second and third constraints ensure that if protein u is annotated with function a, xu,a is set as a constant to 1, and if protein u is annotated but not with function a,

xu,a is set as a constant to 0. The third and fourth constraints ensure that a particular function is picked for an edge only if it is also chosen for the corresponding proteins.


annot(u) : set of known annotations for protein u, FUNC = Uu annot(u) : set of all functional annotations.



considering multiple GenMultiCut optimal solutions

An important consideration in this framework is the existence of multiple optimal solutions. E.g. the network in Figure 3 has seven minimum cuts of value 1, and while the GenMultiCut criterion does not favor any one cut over the other, if we find all optimal cuts for this graph, we observe that x2 is in fact annotated with F1 more

often than with F2 in the assignments made by these cuts.

Thus, a sense of distance to annotated nodes is in fact present in the set of all optimal solutions.




Algorithms: FunctionalFlow

The functional flow algorithm generalizes the principle of ‘guilt by association’ to groups of proteins that may or may not interact with each other physically.

Each protein of known functional annotation is treated as a ‘source’ of ‘functional flow’ for that function. After simulating the spread over time of this functional flow through the neighborhoods surrounding the sources, the ‘functional score’ is obtained for each protein in the neighborhood; this score corresponds to the amount of ‘flow’ that the protein has received for that function, over the course of the simulation.

The functional flow-based model allows us to incorporate a distance effect, i.e. the effect of each annotated protein on any other protein depends on the distance separating these two proteins.

Running this process for each biological function in turn, we obtain, for each protein, the score for each function (the score may be 0 if the ‘flow’ for a function did not reach that protein during the simulation).

Then, any protein is assigned the functions for which the highest score was obtained as its predicted functions.




Algorithms: FunctionalFlow

For each protein u in the interaction network, a variable Rta(u) is defined that

corresponds to the amount in the reservoir for function a that node u has at time t . For each edge (u, v) in the interaction network, we define variables gt

a(u, v) and gta

(v, u) that represent the flow of function a at time t from protein u to protein v, and from protein v to protein u.

We will run the algorithm for d time steps or iterations. At time 0, we only have reservoirs of function a at annotated nodes:


At each subsequent time step, we recompute the reservoir of each protein by considering the amount of flow that has entered the node and the amount that has left:



Algorithms: FunctionalFlow Initially, at time 0, there is no flow on the edges, and ga

0 (u, v) = 0.

At each subsequent time step, the flow proceeds downhill while satisfying the capacity constraints:


Finally, the functional score for node u and function a over d iterations is calculated as the total amount of flow that has entered the node:

Why don‘t we simply take the current value of the reservoir?



Comparison of the four methods

Comparison of four basic methods on the unweighted physical interaction mapFig. 4 plots as a function of FP the number of TPs each method predicts. The Functional-Flow algorithm identifies more TPs over the entire range of FPs than either GenMultiCut or Neighborhood using radius 1, 2 or 3. FunctionalFlow performs better than Majority when proteins are not directly interacting with at least 3 proteins of the same function. Thus, FunctionalFlow is the method of choice when considering proteins that do not interact with many annotated proteins. Even in well-characterized proteomes, such as baker’s yeast, there are ca. 1200 proteins that have fewer than three annotated neighbors.


The Neighborhood algorithm performs similarly with either radius 1 or 2 in the high-confidence region. However, radius 1 (i.e. considering just direct interactions) has better overall performance than radius 2 or 3, demonstrating that Neighborhood’s strategy of ignoring topology is not optimal.



Reliability and data integration

To evaluate the approach for modeling physical interaction reliability as edge weights, we test the performance of FunctionalFlow using three ways of assigning physical interaction weights. (1) assign each edge a unit weight; this corresponds to the unweighted physical interaction map. (2) assign each experimental source a reliability score of 0.5; this rewards interactions that are found by more than one experiment. (3) assign each experimental source the predictive value (estimated in crossvalidation); here, edges obtained from multiple, more reliable experiments are given higher weights.


Rewarding multiple experimental evidence is beneficial. The main advantage comes from taking into account the actual reliability values for the different experiments.



All methods perform better on weighted map

The figure shows how Majority, GenMultiCut and FunctionalFlow perform on the yeast physical interaction map, where edges are weighted by individual experimental reliability.

The baseline performance of Majority on the unweighted physical interaction graph is also shown.

There is a substantial improvement in predictions using all three methods when incorporating edges weighted by reliability.




Is it useful to provide more information?

Will the network analysis algorithms perform better when other types of experimental information are added?

Here, adding genetic linkages to the graph (synthetic lethals).

The figure shows that adding genetic interaction data significantly improves prediction quality.




Conclusions

Network analysis algorithm FunctionalFlow provides an effective means for predicting protein function from protein interaction maps.

The algorithm utilizes indirect network interactions, network topology, networkdistances and edges weighted by reliability estimated from multiple data sources.

The simplest methods, such as Majority, perform well if there are enough direct neighbors with known function.


7. lecture ws 2007/08bioinformatics iii1 v7 molecular decomposition of graphs – functional nets...

Documents