ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/uf/e0/05/25/13/00001/elhesha_r.pdf ·...

DEVELOPING EFFICIENT ALGORITHMS TO IDENTIFY PATTERNS OF BIOLOGICALNETWORKS

By

RASHA ELHESHA

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2018

ACKNOWLEDGMENTS

Firstly, I would like to express my sincere gratitude to my advisor Prof. Tamer Kahveci

for his continuous support throughout my Ph.D study and related research. I owe my deepest

gratitude to him for his patience, motivation, and his faithful guidance which helped me

accomplish my research and my dissertation writing. I really appreciate all his hard efforts

with me. In addition, I would like to thank my PhD committee members; Prof. Sartaj sahni,

Prof. Alin Dobra, Prof. Ye Xia and Prof. Benjamin Baiser, for their insightful comments and

encouragement. They helped me achieving my thesis objectives with outstanding efficiency and

directed me along the right track.

I would like to thank my family for their continuous support and encouragements which

were worth more than I can express on paper. I would like to thank my husband, Mohamed,

for his sincere and faithful support and help. I would like to thank my two awesome children

who were able to draw a smile on my face during tough times. Last but not least, I would

like to acknowledge my father and my mother. Without their enthusiasm, encouragement and

support, this thesis would hardly have been completed. I am grateful to my sister, shereen for

always being there for me as a friend.

3

TABLE OF CONTENTSpage

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

CHAPTER

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 IDENTIFICATION OF LARGE DISJOINT MOTIFS IN BIOLOGICAL NETWORKS 16

2.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.1 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Summary of Existing Methods . . . . . . . . . . . . . . . . . . . . . . 21

2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.2 Joining Patterns to Find Larger Patterns . . . . . . . . . . . . . . . . 252.3.3 Finding MIS: Going from F1 to F2 . . . . . . . . . . . . . . . . . . . 292.3.4 Accelerating Our Algorithm Through Efficient Filters . . . . . . . . . . 322.3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.4.1 Evaluation of Running Time . . . . . . . . . . . . . . . . . . . . . . . 37

2.4.1.1 Effect of Graph and Motif Size . . . . . . . . . . . . . . . . 372.4.1.2 Effect of Graph Size and Density . . . . . . . . . . . . . . . 39

2.4.2 Comparison with Existing Methods . . . . . . . . . . . . . . . . . . . 402.4.2.1 Comparison with SUBDUE . . . . . . . . . . . . . . . . . . 412.4.2.2 Comparison with FSG . . . . . . . . . . . . . . . . . . . . . 44

2.4.3 Evaluation of Statistical Significance . . . . . . . . . . . . . . . . . . 452.4.4 Case Study on Human Herpesvirus . . . . . . . . . . . . . . . . . . . 48

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3 APPLICATION OF MOTIFS IDENTIFICATION . . . . . . . . . . . . . . . . . . . 51

3.1 Motifs in The Assembly of Food Web Networks . . . . . . . . . . . . . . . . 513.1.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.2 Motif Centrality in Food Web Networks . . . . . . . . . . . . . . . . . . . . . 563.2.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4

3.2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4 IDENTIFICATION OF CO-EVOLVING TEMPORAL NETWORKS . . . . . . . . . . 63

4.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.4.1 Proof of NP-hardness . . . . . . . . . . . . . . . . . . . . . . . . . . 744.4.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.4.3 Adopting Pairwise Alignment Methods to Generate Similarity Scores

for Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 774.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5.1 Evaluation of Recovered Region . . . . . . . . . . . . . . . . . . . . . 814.5.2 Evaluation of Induced Conserved Structure . . . . . . . . . . . . . . . 824.5.3 Evaluation of Edge Correctness . . . . . . . . . . . . . . . . . . . . . 834.5.4 Evaluation of Statistical Significance of The Alignment . . . . . . . . . 834.5.5 Evaluation of Running Time . . . . . . . . . . . . . . . . . . . . . . . 864.5.6 Evaluation of Recovered Genes in Real Dataset . . . . . . . . . . . . . 874.5.7 Evaluation on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5 IDENTIFICATION OF CO-EVOLVING TEMPORAL NETWORKS WITH UNCERTAINTIMELINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2 Related Work and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 955.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4.1 Comparing Against Other Strategies . . . . . . . . . . . . . . . . . . 1015.4.2 Comparing Stress Response Against Time Points Matching . . . . . . 1025.4.3 Hierarchical Clustering of Conditions . . . . . . . . . . . . . . . . . . 1045.4.4 Evaluation of Running Time . . . . . . . . . . . . . . . . . . . . . . . 1045.4.5 Evaluation of Alignment Quality . . . . . . . . . . . . . . . . . . . . . 105

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

5

LIST OF TABLESTable page

2-1 PPI networks selected from the MINT database . . . . . . . . . . . . . . . . . . . 37

2-2 The signifncance of the most abundant motif of PPI networks, first approach . . . . 47

2-3 The signifncance of the most abundant motif of PPI networks, second approach . . 47

2-4 Uniprot IDs of the proteins in an embedding of the most abundant motif in hhv-8PPI network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3-1 Approaches used to calculate motif centrality significance . . . . . . . . . . . . . . 60

4-1 Percentage of recovered query genes from gene aging dataset when using Alzheimer’sphenotype as query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4-2 Percentage of recovered query genes from gene aging dataset when using Huntington’sphenotype as query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4-3 Percentage of recovered query genes from gene aging dataset when using Type IIdiabetes phenotype as query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4-4 Number and significance of functional pathways associated with the underlying diseaseobserved among the aligned genes of target network . . . . . . . . . . . . . . . . . 90

6

LIST OF FIGURESFigure page

2-1 A hypothetical graph to represent motifs . . . . . . . . . . . . . . . . . . . . . . . 18

2-2 The four basic patterns used to find motifs . . . . . . . . . . . . . . . . . . . . . . 22

2-3 All patterns which can be constructed with four undirected edges. . . . . . . . . . . 26

2-4 Construct patterns with k + 1 edges . . . . . . . . . . . . . . . . . . . . . . . . . 27

2-5 Algebraic calculation of the frequency of one basic pattern . . . . . . . . . . . . . . 30

2-6 The overlap graph based on F2 and F3 frequency measures . . . . . . . . . . . . . 31

2-7 The running time of our motif discovery method using synthetic data varying graphand motif sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2-8 The total running time of our motif discovery method using real data . . . . . . . . 39

2-9 The running time of our motif discovery method using synthetic data varying graphdensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2-10 Comparison between our motif discovery algorithm and SUBDUE, motif size = 5 . . 41

2-11 Comparison between our motif discovery algorithm and SUBDUE, motif size = 10 . 42

2-12 Comparison between our motif discovery algorithm and SUBDUE, motif size = 15 . 43

2-13 Comparison of running time between our motif discovery algorithm and FSG . . . . 45

2-14 Motifs discovered in Human herpesvirus PPI . . . . . . . . . . . . . . . . . . . . . 49

3-1 The three-node motifs we explore to analyze the Assembly of Food Web Networks . 52

3-2 Schematic of the three levels of hierarchy for pitcher plant network assembly . . . . 53

3-3 The percentage of sites for which motif representation matches the continental network 55

3-4 The percentage of pitchers for which motif representation matches the site networks 56

3-5 All 13 motifs of 3-node subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3-6 Distribution of motif abundance over two classes of motif centrality significance . . 61

3-7 Correlation probabilities (p-values) between motif abundance and motif centrality . . 62

4-1 Comparison between different network alignment problems . . . . . . . . . . . . . . 65

4-2 Illustrating the alignment problem using hypothetical between two networks . . . . . 70

4-3 The percentage of recovered query in the resulting alignment varying evolution rates 82

7

4-4 The induced conserved structure (ICS) score of the resulting alignment varying evolutionrates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4-5 The Edge correctness (EC) score of the resulting alignment varying evolution rates . 84

4-6 The average z-score of Tempo varying network sizes . . . . . . . . . . . . . . . . . 84

4-7 The average z-score of Tempo against IsoRank . . . . . . . . . . . . . . . . . . . . 85

4-8 The total running time of IsoRank and Tempo for synthetic networks varying targetnetwork size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

4-9 The average z-score of our method using real data of three different diseases; Alzheimer’s,Huntington’s and Type-II diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4-10 The percentage of genes that contributes to each pathway of the resulting alignedgenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5-1 The statistical significance (z-score) of the resulting alignment varying the numberof time points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5-2 The significance of the overlaps between different conditions through time pointspost-perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5-3 The hierarchical clustering of z-score of the alignment between the five stress conditions105

5-4 The total running time of our method for synthetic networks varying the number oftime points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5-5 The edge correctness (EC) score of the resulting alignment varying the number oftime points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5-6 The induced conserved structure (ICS) score of the resulting alignment varying thenumber of time points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5-7 The percentage of recovered query of the resulting alignment . . . . . . . . . . . . 108

8

Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy

DEVELOPING EFFICIENT ALGORITHMS TO IDENTIFY PATTERNS OF BIOLOGICALNETWORKS

By

Rasha Elhesha

August 2018

Chair: Tamer KahveciMajor: Computer Engineering

Studying biological networks provide great potential to help understand how cells

function and how they respond to extra-cellular stimulants. Majority of the previous work

on biological networks assume the network topology is static and does not change. However,

it is well-understood that the interaction between molecules is dynamic and change over time.

Assuming a static topology may lead to biased or incorrect analysis. We consider analyzing

both static and temporal biological networks. Studying the temporal progression of network

topologies is of utmost importance since it uncovers how a network evolves and how it resists

to external stimuli and internal variations.

In this work, we address three main problems of biological networks. The first problem is

identifying large disjoint motifs, frequent topological patterns, in a static biological network.

We present a scalable algorithm for finding network motifs which counts independent copies

of each motif topology unlike most of the existing studies. We show two case studies of

food webs when applying our algorithm. The second problem is identification of co-evolving

temporal networks when information of time points are known. Two temporal networks

have co-evolving subnetworks if the topologies of these subnetworks remain similar to each

other as the network topology evolves over a period of time. In this problem, we consider

the problem of identifying co-evolving pair of temporal networks, which aim to capture the

evolution of molecules and their interactions over time. Although this problem shares some

characteristics of the well-known network alignment problems, it differs from existing network

9

alignment formulations as it seeks a mapping of the two network topologies that is invariant

to temporal evolution of the given networks. This is a computationally challenging problem as

it requires capturing not only similar topologies between two networks but also their similar

evolution patterns. We develop an efficient algorithm, Tempo, for solving identifying coevolving

subnetworks with two given temporal networks. We formally prove the correctness of our

method. We experimentally demonstrate that Tempo scales efficiently with the size of network

as well as the number of time points, and generates statistically significant alignments—

even when evolution rates of given networks are high. Our results on a human aging dataset

demonstrate that Tempo identifies novel genes contributing to the progression of Alzheimer’s,

Huntington’s and Type II diabetes, while existing methods fail to do so. The third problem

addresses the drawbacks in the second problem by considering the uncertainty of time points

in both temporal networks when identifying their co-evolving topology. More specifically, time

points of the observed network topologies are uncertain such that the information of which

time point in one sequence corresponds to that in the other sequence is not known in advance.

In this problem, we develop a novel method, tempo++ which identifies coevolving subnetworks

between subsequences of given pair of temporal networks. We use gene expression dataset

which contains time resolved response of E. coli to five different environmental perturbation

conditions (cold, heat, oxidative stress, lactose diauxie, and stationary phase). Using our

method, we could find similar response behavior of gene expressions between heat and

oxidative stress. Using Tempo++ to generate alignment significance, we could co-cluster these

five conditions into groups. These clusters also confirmed that E. coli has similar response to

heat and oxidative stress conditions. We compare the statistical significance of the alignments

found by Tempo++ against those of other possible strategies to tackle this problem.

10

CHAPTER 1INTRODUCTION

Biological networks describe the interaction between molecules, and are frequently

represented as graphs, where the nodes corresponds to the molecules (e.g., proteins or genes)

and the edges corresponds to the interactions (1). More formally, we denote a biological

network as G = (V,E), where V and E represent the set of nodes and the set of edges,

respectively. Analysis of these networks enable the elucidation of cellular functions (2), the

identificaion of variations in cancer networks (3), and the characterization of variations in drug

resistance (4). In addition, this analysis led to the formulation of a numerous computational

challenges, as well as, methods which address these challenges. Among these challenges,

identifying motifs (5; 6; 7) (i.e. local netwrok propoerties) and network alignment (8) (i.e.

global netwrok propoerties) are arguably two of the most important.

Majority of the previous work on biological networks assume the network topology

is static and does not change (9; 10). However, in many cases, the interaction between

molecules is dynamic (11; 12). For example, genetic and epigenetic mutations can alter

molecular interactions (13), and variation in gene copy number can affect the existence of

interactions (14; 15). Due to this dynamic behavior the topology of the network that models

the molecular interaction will evolve and change over time (16; 17; 18) and assuming a static

topology may lead to biased or incorrect analysis. In this work, we consider analyzing both

static and temporal networks.

The first problem we address in this dissertation is the problem of identifying large disjoint

motifs in biological networks (Chapter 2). Motifs are frequent topological patterns in a given

network (19). Given a target network and a motif size (i.e., number of nodes in the motif), we

aim to find the motifs of that size which have a frequency above a user specified threshold in

that target network. Unlike most of the methods in the literature, we count independent copies

of each motif where no two copies of the same motif share an edge. Counting motif frequency

11

(i.e. the number of occurrences of this motif), requires solving the subgraph isomorphism

problem, which is NP-Complete (20).

We develop a novel and scalable algorithm to solve the motif identification problem. We

introduce a set of small patterns and prove that we can construct any larger pattern by joining

those patterns iteratively. By iteratively joining already identified motifs with those patterns,

our algorithm avoids (i) constructing topologies which do not exist in the target network

(ii) repeatedly counting the frequency of the motifs generated in subsequent iterations. Our

experiments on both protein-protein interaction (PPI) and synthetic networks demonstrate

that our method is significantly faster and more accurate than the existing methods. In

addition, the increase in the running time of our algorithm is dramatically less than that of the

competing methods as the motif size grows.

Motif identification applications we address in this work are mainly develooped to analyze

Food Web Networks (Chapter 3). The first application is Motifs in the Assembly of Food Web

Networks. Mainly in this application, we compute the significance of three-node motifs across

a hierarchy of scales to to explore the assembly of food web networks found in the leaves of

the northern pitcher plant (Sarracenia purpurea (21; 22)). The second application is Motif

Centrality in Food Web Networks. We explored the relationship between motif abundance

and motif centrality to better understand why some motifs are found at high abundances

(i.e., over-represented) and some are found at low abundances (i.e., under-represented). We

developed a suite of methods for calculating the centrality of entire motifs and then analyzed

the relationship between motif centrality and motif abundance in published aquatic food

webs (23).

The second problem we address in this dissertation is identifying coevolving subnetworks

in a given pair of temporal networks (Chapter 4). Majority of the previous work on alignment

of biological networks assume the network topology is static (10)—an assumption that ignores

the history of network evolution, and may lead to biased or incorrect analysis. To address

the dynamic changes of biological networks, we define a biological network using a model

12

that describes the evolution of an underlying network at consecutive time points. We refer

to this model as a temporal network (24; 25). Informally, we view this model as containing a

single snapshot of the network at each time point and thus, the sequence of snapshots as a

time series network. Hence, we assume the topology of the biological network is observed at

t consecutive time points. Given two input temporal networks, we let one of them to be the

query network (smaller) and the other network be the target network, our algorithm captures

that network topologies evolve over time and seeks the alignment that persists through

this evolution. More specifically, the aligned nodes does not change from one time point to

another. The temporal network alignment problem is dramatically different than known and

existing network alignment problems.

We present a novel algorithm to identify coevolving subnetworks in a given pair of the

temporal networks. We propose a new scoring function that integrates the similarities of the

aligned nodes and their network topologies. Our algorithm works in two phases. In the first

phase, our algorithm first finds an initial alignment between the input networks G1 and G2

using the homological and topological similarities of their nodes. This phase ignores the penalty

arising from disconnected subnetworks in the alignment. The second phase of our algorithm

aims to maximize the alignment score by repeatedly altering the aligned nodes in the target

network using dynamic programming strategy. We solve the problem of connecting subgraphs

using a dynamic programming approach which selects a minimum number of swapping pairs

from the gap nodes and aligned nodes sets to ensure the maximum profit in the scoring

function. This problem is reduced to set cover problem which is NP-complete problem (26).

We demonstrate the efficiency and accuracy of Tempo using both real and synthetic data. We

compare the running time and the quality of the alignments found by Tempo against those

of three existing alignment algorithms. We show Tempo has competitive running time and

generates significantly better alignments. We could predict disease-related genes based on the

generated alignment using tempo which suggests that Tempo generates alignments that reflect

13

the evolution of nodes topologies through time as well as their homological similarities while

other methods only focuses on static and independent topologies.

The third problem we address in this dissertation is aligning two temporal networks

with uncertain time points (Chapter 5). More specifically, the information of time points

in both networks are unknown in advanvce (or uncertain). Furthermore, G1 and G2 has

possibly different number of time points. Without losing generality, we let G1 to be the

temporal network with shorter number of time points. Various factors affect the evolution

process of a biological network and thus, introduce uncertainty when capturing such

evolution. For example, the evolution rate of interacting molecules differs between people

with different disorders (i.e. diseases) or people with same disorder but at different stages

of this disorder (27).Consequently, the observed interactions of humans may vary even if

they are measured at the same time. Thus, the interaction networks constructed for those

measurements may correspond to different stages of the evolution. In this problem, we consider

the uncertainty of the time points in each topological network. This is a very challenging

problem since it does not only align the temporal networks, but also finds their corresponding

time points at which the alignment yields the highest alignment score.

We develop a novel method, Tempo++ to identify coevolving subnetworks in a given

pair of the temporal networks with uncertain time lines. Our method adopts a dynamic

time wrapping strategy to find the optimal matching between the two input temporal

networks by shifting and stretching the time points of G1 based on the alignment quality.

For instance, omitting the first two networks in G2 in the alignment corresponds to the case

where G1 denotes a later stage of evolution by two time points as compared to G2. Similarly,

omitting intermediate networks in G2 corresponds to the case when G1 is evolving slower

than G2. We demonstrate the efficiency and accuracy of Tempo++ using both real and

synthetic data. We use gene expression dataset which contains time resolved response of

E. coli to five different environmental perturbation conditions (cold, heat, oxidative stress,

lactose diauxie, and stationary phase). Using our method, we could find similar response

14

behavior of gene expressions between heat and oxidative stress. Using Tempo++ to generate

alignment significance, we could co-cluster these five conditions into groups. These clusters

also confirmed that E. coli has similar response to heat and oxidative stress conditions. We

compare the statistical significance of the alignments found by Tempo++ against those of

other possible strategies to tackle this problem.

15

CHAPTER 2IDENTIFICATION OF LARGE DISJOINT MOTIFS IN BIOLOGICAL NETWORKS

2.1 Preface

Studying biological networks has great potential to help understand how cells function and

how they respond to extra-cellular stimulants. Such studies have already been used successfully

in many applications. Characterizing the variations in drug resistance of different cell lines (4),

or identifying the pathways serving similar functions across different organisms (28; 29) are

only few examples among many.

Motifs are frequent topological patterns in a given network (19). Identifying motifs has

been one of the key steps in understanding the functions served by biological networks such as

gene regulatory or protein interaction networks (5; 6; 7). Motifs can be used to uncover the

basic structure and design principles of a network (30). They are also often considered as the

basic building blocks of a network (19) and one of the network local properties (31). Thus,

they can be used to classify networks (32) into functional sub-units. It is worth noting that

motifs have been used in various applications like prediction of regulatory elements in genomic

sequences (33).

Despite the fact that studying motifs is of utmost importance for network analysis, motifs

identification remains to be a computationally hard problem (34). The roots of the challenges

behind motif discovery arise from several reasons. First, even when the motif topology is given,

counting motif frequency (i.e. the number of occurrences of this motif), requires solving the

subgraph isomorphism problem, which is NP-Complete (20). Furthermore, when the motif

topology is not known in advance, trying out all alternative topologies is infeasible as the

number of such topologies increases exponentially with the number of edges in the motif.

There are two ways for motif frequency formulation; (i) allow for different copies of

the same motif to overlap (i.e., share nodes or edges) or (ii) count disjoint copies of the

motif under consideration. Most of the existing methods in the literature on motif counting

follow the first formulation. This formulation however has a fundamental drawback arising

16

from the fact that it does not have downward closure property. Briefly, this means that the

motif frequency does not decrease monotonically as the motif size increases. We discuss

this drawback in detail in Sections 2.2 along with why it makes it impossible to determine

the largest sized motif in a given network. Several algorithms use the second formulation

to compute the frequency of a given motif (e.g., (35)). Those algorithms, however, do not

scale to large networks. Also, they are limited to small motifs as their time complexities grow

exponentially with motif size. We elaborate on these methods in Section 2.2 as well.

In this chapter, we address the problem of finding motifs in a given network. More

specifically, given a target network and a motif size (i.e., number of nodes in the motif), we

aim to find the motifs of that size which have a frequency above a user specified threshold

in that target network. Unlike most of the methods in the literature, we use the second

formulation of motif counting described above, where no two copies of the same motif share an

edge, to compute the frequency.

Contributions: We develop a novel and scalable algorithm to solve the motif identification

problem. The central idea of our method, which stands out among the existing literature, is

to use a small set of patterns, called the basic building patterns. We prove that any motif

with four or more edges can be constructed as a combination of these patterns. Following

from this observation, our method first finds instances of these patterns. It then iteratively

grows motifs by joining known motifs at that iteration with the instances of these patterns.

Our algorithm develops efficient mechanisms to avoid a significant fraction of the costly

isomorphism tests while growing new motifs. Counting non-overlapping instances of a given

motif is a computationally challenging task that requires solving maximum independent set

(MIS) problem which is known to be NP-complete (34). We introduce a new and efficient

strategy for this purpose. This strategy avoids enumerating the overlapping motif instances.

It does this by algebraically computing the overlap count based on the neighbors of the motif

nodes in the target network. Our experiments on both protein-protein interaction (PPI) and

synthetic networks demonstrate that our method is significantly faster and more accurate

17

A B C D

Figure 2-1. This figure represents a hypothetical graph to illustrate motifs. A) a graph G thatcontain seven nodes {a, b, c, d, e, f, g} and eight edges {(a,b), (a,c), (b,c), (b,e),(e,d), (e,f), (f,g), (e,g)}. B) a pattern with two embeddings in G, {(a,b), (a,c),(b,c)} and {(e,f), (f,g), (e,g)}. C) a pattern with three embeddings in G, {(a,b),(a,c), (b,c), (b,e)}, {(e,f), (f,g), (e,g), (e,d)}, and {(e,f), (f,g), (e,g), (b,e)}. D) apattern that has one copy in G, {(b,e), (e,d), (e,f), (f,g), (e,g)} .

than the existing methods. In addition, the increase in the running time of our algorithm is

dramatically less than that of the competing methods as the motif size grows.

The rest of this chapter is organized as follows. We present the key definitions needed to

discuss our method and the related literature in Section 2.2. We describe our motif discovery

algorithm in Section 2.3. We experimentally evaluate our method and compare it to the

existing algorithms in Section 2.4. We end with a brief conclusion in Section 2.5.

2.2 Background

In this section, we provide the definitions and the terminology needed to describe our

method (Section 2.2.1). We then summarize the key literature tackling similar problems to the

one considered in this chapter (Section 2.2.2).

2.2.1 Definitions and Notation

We represent a given biological network using a graph denoted with G = (V,E). Here,

the set of nodes V denotes the set of interacting molecules, and the set of edges E denotes

the interactions among them. In the rest of this chapter, we use the term graph to denote a

biological network. Here, we focus on undirected graphs. Figure 2-1A represents a graph that

contains seven nodes and eight edges.

We say that a graph is connected if there is a path between all pairs of its nodes. We say

that a graph S = (VS, ES) is a subgraph of G if VS ⊆ V and ES ⊆ E. In the rest of this

chapter, we only consider connected subgraphs. Thus, to simplify our terminology, we use the

18

term subgraph instead of connected subgraph. Notice that a subgraph of a given graph can be

uniquely determined by the set of edges ES of that subgraph as all of its nodes are connected.

We say that two subgraphs S1 = (VS1 , ES1) and S2 = (VS2 , ES2) of G are identical

if they have the same set of edges. A less constrained association between two subgraphs is

isomorphism. Two subgraphs S1 and S2 are isomorphic if the following condition holds: There

exists a bijection f : VS1 → VS2 such that ∀(u, v) ∈ ES1 , ⇐⇒ (f(u), f(v)) ∈ ES2 .

We say that two subgraphs S1 and S2 overlap if they share at least one edge (i.e.,

ES1 ∩ ES2 = ∅). In Figure 2-1A, consider the four subgraphs S1, S2, S3, and S4 defined by

the set of edges {(a,b), (a,c), (b,c), (b,e)}, {(e,f), (f,g), (e,g), (e,d)}, {(e,f), (f,g), (e,g),

(b,e)} , and {(b,e), (d,e), (e,f), (e,g)} respectively. S1 and S2 are disjoint as they do not share

any edges. S1 and S3 overlap as they share the edge (b,e). Similarly S2 and S3 overlap. All

three subgraphs S1, S2, and S3 are isomorphic as they have the same topology. S1 and S4 are

non-isomorphic as they do not satisfy the bijection function defined above.

Notice that isomorphism is a transitive relation. Thus, for a given subgraph S of G, the

set of all subgraphs of G which are isomorphic to S defines an equivalence class. We represent

the subgraphs in each equivalence class with a graph isomorphic to those in that equivalence

class and call it a pattern. Figure 2-1C shows the pattern that represents the equivalence class

{S1, S2, S3}.

There are alternative definitions of the frequency of a pattern in a given graph. The

classical frequency definition is the number of all subgraphs of the target graph which are

isomorphic to the given pattern. This definition, also known as the F1 measure (36), counts

all the subgraphs regardless of whether they overlap with each other or not. There are two

other frequency definitions which avoid overlaps between different subgraphs. F2 measure

counts the largest subset of subgraphs in a given equivalence class which do not share any

edges with the rest of the subgraphs in that subset. It however allows them to share nodes. F3

measure is more stringent as it requires that no two subgraphs can share a node. Consider the

pattern in Figure 2-1C and the target graph in Figure 2-1A. The frequency of this pattern in

19

the target graph according to the F1 measure is three as it has three embeddings ({S1, S2,

S3}). On the other hand F2 is two {S1, S2}, and F3 is one (S1 or S2 or S3). From here on,

we denote the F1, F2, and F3 counts of a motif M in graph G using the notations F1G(M),

F2G(M), and F3G(M) respectively.

The downward closure property states that the frequency of a pattern should monotonically

decrease as this pattern grows (by inserting new nodes or edges to it). More specifically,

consider a function f() that operates on a pattern and returns a real number. Let us denote

two patterns with P1 and P2. We say that the function f() has downward closure property if

and only if f(P2) ≤ f(P1) for all (P1, P2) pairs where P1 is a subgraph of P2.

Under the light of these definitions, next we show that F1 measure is not downward

closed. Consider the pattern P1 in Figure 2-1B. The frequency of P1 is two in the target graph

in Figure 2-1A. Now consider the pattern P2 in Figure 2-1C which contains P1. Although P1 is

a subgraph of P2, the frequency of P2 is three in the same graph (i.e., more than that of P1).

Next, consider the pattern P3 in Figure 2-1D. P3 contains P2, and its frequency is only one

(i.e., less than that of P2). This example demonstrates that the F1 measure not only fails to

monotonically decrease, but it also fluctuates (i.e., its value may go up or down) as we grow

the pattern ( (37; 38) for further discussions on this issue).

Unlike the F1 measure, F2 is downward closed. In the following, we formally prove this.

Theorem 2.1. Assume that we are given a graph G. Given two patterns M and M where M

⊂ M , we have F2G(M) ≥ F2G(M).

Proof. To prove this, we consider the placement of each embedding of M in G according to

F2 measure (i.e. non-overlapping embeddings). Notice that each embedding of M contains M

as M ⊂ M . From each of these embeddings, we remove the edges that are in M −M . This

leads to one embedding of M for each embedding of M . Thus, the number of non-overlapping

embeddings of M in G is at least as much as that of M in G. Therefore, F2G(M) ≥

F2G(M).

20

Similarly, we say that F3 measure which also counts non-overlapping embeddings, is also

downward closed.

Failure to satisfy the downward closure property has major implications on the correctness

of motif identification. Traditional motif identification algorithms often grow a motif starting

from an initial motif of a small number of edges (Section 2.2.2). Should they employ the F1

measure, these algorithms cannot have an early stopping criteria as they grow motifs. This is

because the frequency can go up as we grow motif even when the current motif frequency is

low. Next, we formally define the problem considered in this chapter.

Problem definition.. Given an input graph G = (V,E), the number of nodes in the

target motif µ, and frequency threshold α, we aim to find all patterns of µ nodes which have

frequency at least α in G under the frequency measure F2. The method we develop in this

chapter can however be easily extended to F3 as well (Section 2.3.3).

2.2.2 Summary of Existing Methods

We classify the literature on motif identification and counting, based on the underlying

frequency measure. This is because the frequency measure dramatically changes the cost of

counting motifs as well as how we can interpret the frequency of the underlying pattern. Most

of the existing studies use F1 frequency measure to count the embeddings of a pattern in

a given graph (e.g., (39; 40; 41; 42; 43; 44)). These methods carry the drawbacks inherent

in the F1 measure. First, F1 ignores the fact that different copies of the same motif can

overlap due to the nodes and the edges they share. This can lead to artificially massive

number of motif embeddings as the same node or edge can participate in multiple embeddings.

To understand this better, consider the pattern and the graph in Figures 2-1C and 2-1A

respectively. F1 counts three copies of the pattern (S1, S2, and S3). Different nodes and

edges however contribute to this count at different numbers. The edge (a, b) appears only in

S1 while (b, e) appears in both S1 and S3.

Second and more importantly, the F1 measure is not downward closed. This is because

as we grow a pattern by including new edges or nodes, its count as computed by F1 is not

21

Figure 2-2. The four basic patterns used by our algorithm.

monotonic; it may decrease, stay the same, or increase. Lack of downward closure property

makes it nearly impossible to decide if the motif found is the largest one in size while growing a

pattern. Thus, using F2 is essential for the tractability of identifying frequent patterns. We use

the F2 measure in this chapter. Thus, the studies limited to the F1 measure are out of the

scope of this chapter.

Several algorithms tackle the problem of finding frequent patterns in multiple graphs.

FSG (45) is one of the key methods in this class. These methods, however, do not count the

number of occurrences of a pattern in each graph. They rather check if the given pattern

appears at least once in each graph. Vanetik et. el. (37) also addressed the same problem.

Finding frequent patterns or counting them without overlaps (i.e., using F2 or F3

measures) have received little attention in the literature. One of the existing algorithms

in this category is SUBDUE (35). Flexible Pattern Finder Algorithm (FPF) (36) detects

frequent patterns using both F2 and F3. Two algorithms were proposed by Kuramochi and

Karypis (46), named hSiGraM, vSiGraM. However, these algorithms are computationally

expensive and do not scale to large graphs or motifs. We evaluate SUBDUE and FSG

experimentally in Section 2.4.

2.3 Method

In this section we describe our method. Section 2.3.1 presents an overview of our

algorithm. Section 2.3.2 explains the mechanism we use to grow motifs by joining smaller

motifs. Section 2.3.3 describes how we count disjoint motif instances. Section 2.3.4 presents

filtering techniques we implement to avoid costly isomorphism tests. Section 2.3.5 discusses

the complexity analysis of our method.

22

2.3.1 Algorithm Overview

In this section, we provide an overview of our method for discovering motifs. At the heart

of our method lie four unique graph patterns. We call them the basic building patterns for

we use them as guide to construct larger motifs of arbitrary sizes and topologies. Figure 2-2

presents these basic building patterns. We explain why we use these four specific patterns in

Section 2.3.2 in detail.

Algorithm 2.1 presents the pseudo-code of our method. We elaborate on each key step

of our method in subsequent sections. The algorithm takes a graph G, the number of nodes

of the target motif µ, and the minimum acceptable motif frequency as input α. For each

of the four basic building patterns, it first locates all subgraphs in G that are isomorphic to

that pattern (Line 1). Let us denote the set of instances of the ith pattern (i ∈ {1, 2, 3,

4}) with Si. In each set Si, it is possible to have overlapping subgraps. It then extracts the

maximum set of edge-disjoint subgraphs in each set Si (Line 2) (Section 2.3.3 for details).

Let us denote the resulting set with S ′i for the ith pattern. Notice that the cardinalities of

the sets Si and S ′i are the F1 and F2 measures of the ith pattern respectively. The union of

all the sets S ′i constitutes the current motif instances as well as the basic building pattern

instances at this point (Line 3). The algorithm then iteratively grows the current motif set.

At each iteration, it joins the current motif set with the basic building pattern set (Line 9).

More specifically, a motif instance and a basic building pattern join if they share at least one

edge. Joining two such subgraphs either creates a pattern which already exists in the current

set (Line 10) or a new pattern (Line 12). At each iteration, after growing the current set, it

filters the overlapping subgraphs to identify MIS for each pattern (Line 18). The algorithm

removes all patterns with frequency lower than the user supplied cutoff (Line 21). It reports

the frequent subgraphs that have as many edges as the target motif size (Line 23). The

algorithm terminates when the current set can not be grown to have any other patterns which

satisfy the target motif (i.e. each pattern in the current set is either larger than the target

motif size or its frequency is lower than the user specified frequency).

23

Algorithm 2.1. Motif Discovery algorithm Input:

• Target motif size µ

• Frequency threshold α

• Input graph G = (V,E)

output:

• Motif topologies, and their instance subgraphs, that each have same number of nodes asµ and its F2 > α

1: BPSf1 = getAllSubgraphs-Isomorphic-to-BasicPatterns()

2: BPS = extract-maxDisjointSubgraphs-PerPattern(BPSf1)

3: CurrentSet (CS) = BPS

4: newSet (NS) = ϕ

5: while CS has new patterns and at least one of them with number of nodes < µ and its

F2 > α do

6: for each pattern p1 in CS do

7: for each pattern p2 in BSP where p2 = p1 do

8: for each subgraph s1 ∈ p1 and s2 ∈ p2 do

9: s3 = join(s1, s2)

10: if s3 ∈ existing pattern P then

11: add s3 ∈ P in NS if not duplicate

12: else

13: Create Pnew with s3 topology, add s3 ∈ Pnew in NS

14: end if

15: end for

16: end for

17: end for

18: CS = extractmaxDisjointSubgraphsPerPattern(NS)

19: for each pattern p1 ∈ CS do

24

20: if F2 of p1 < α then

21: Delete p1 and all subgraphs ∈ p1

22: else if number of nodes of p1 = µ then

23: put p1 and all subgraphs ∈ p1 in the output

24: end if

25: end for

26: NS = ϕ

27: end while

2.3.2 Joining Patterns to Find Larger Patterns

Here, we describe one join iteration of our method; the process of joining the subgraphs

of current set of patterns with the subgraphs of the basic building patterns to construct larger

patterns. At the end of the iteration, the resulting set of subgraphs becomes the current set of

subgraphs for the next join iteration.

Recall that we join two subgraphs only if they share at least one edge. Joining two such

subgraphs either yields a pattern that is isomorphic to one of the existing patterns or a new

one. In the former case, we consider the set of subgraphs S isomorphic to that pattern. We

check if the new subgraph is already in S. If it is in S, we discard it. Otherwise, we store it in

S. In the latter case (i.e., the pattern is observed the first time), we save this as a new pattern

and also keep the corresponding subgraph.

Notice that, although the subgraphs in S do not overlap prior to join, this may no longer

hold after new subgraphs are inserted into S. At the end of each join iteration, we select the

MIS for each pattern. We defer the discussion on how we do this to Section 2.3.3. We then

remove the patterns with F2 values below the user supplied frequency threshold, α. This

eliminates non-promising patterns, and thus, reduces the number of candidate patterns for the

next join iteration. Using the F2 measure ensures that patterns maintain downward closure

property. Thus, non-frequent patterns will never grow to yield frequent patterns.

25

Why do we need different equivalence classes? If the motif frequency is measured using

F1, it is sufficient to join the subgraphs belonging to existing patterns with only those which

belong to the same equivalence class of the simple pattern with two edges (see Figure 2-2A)

to construct any larger pattern. This however is not true when F2 (or F3) is used to count

the motif frequency. To understand the rationale behind this, recall that each equivalence

class represents a set of disjoint isomorphic subgraphs. As a result, no two subgraphs from the

same equivalence class join for they do not share any edges. Therefore we need more than one

equivalence class to construct new and larger patterns.

Given that we need multiple patterns, next, we seek the answer to the following question:

What is the smallest set of patterns which can be used to produce arbitrary large topologies by

joining them? Here we outline the key steps of the proof that the four basic building patterns,

presented in Figure 2-2, suffice to construct any larger pattern. That said, we do not guarantee

to find all copies of such patterns in the target network.

Figure 2-3. All patterns which can be constructed with four undirected edges.

Before we discuss our induction steps, we explain our strategy on a specific motif size of

four to improve the clarity of the discussion on induction. Figure 2-3 shows all the possible

patterns which can be constructed with undirected four edges. A careful inspection shows that

each one is an overlapping combination of two of the basic building patterns. For instance,

the pattern in Figure 2-3A can result from joining the basic pattern in Figure 2-2A with the

basic pattern in Figure 2-2C. It is worth noting that we can construct some of the patterns in

Figure 2-3 by joining two different pairs of basic building patterns. This redundancy ensures

we can still locate a specific pattern even if one of those pairs does not exist. Therefore, our

method can construct any pattern with four edges from patterns with three or two edges.

We conduct our proof for the arbitrary pattern size by induction.

26

Basis.. The four basic patterns in Figure 2-2 constitute all possible graph topologies with

two or three edges.

Induction step.. We assume that our method can construct any pattern with up to k

edges (k ≥ 3). We next show that any pattern with k + 1 edges can be constructed by joining

a pattern with k edges with one of the basic building patterns.

Recall that the downward closure property states that those smaller patterns have at

least as much frequency as the larger one according to F2 (Theorem 2.1). This means that

if a pattern with k + 1 edges is frequent, then so is any of the k edge patterns obtained by

removing an edge from that pattern.

Consider a graph G and a copy of a pattern P1 of size k edges in G, S1. Also, consider a

copy of a pattern P2 with k + 1 edges such that P2 contains P1 and one additional edge. Let

us denote this additional edge with (a, b). We need to show that P2 can be obtained from P1

by joining it with at least one of the basic patterns.

Figure 2-4. Constructing patterns with k + 1 edges. A) A subgraph S2 in a hypothetical graphG. S2 is isomorphic to a pattern P2 of size k + 1 edges. If we remove theadditional edge (a, b) we obtain S1 which is isomorphic to P1 where P1 ⊂ P2.Notice that S1 could have arbitrary k − 1 edges rather than (b, c). Here we obtainS2 as a result of joining S1 with the subgraph {(a, b), (b, c)} which belongs to M1equivalence class (Figure 2-2A). B) Failure to accomplish the join in (a), we seekto inspect deg(c) and deg(b) in S1. The first possibility is that deg(c) > 1. Thismeans that the subgraph {(b, c), (c, d)} exists. We then can join S1 with thesubgraph {(a, b), (b, c), (c, d)} which belongs to M4 equivalence class(Figure 2-2D) to obtain S2 which is isomorphic to a pattern P2 of size k + 1edges. C) The second possibility is that deg(b) > 1. This means that the subgraph{(b, c), (b, d)} exists. We then can join S1 with the subgraph {(a, b), (b, c), (b, d)}which belongs to M3 equivalence class (Figure 2-2C) to obtain S2.

27

Since both P1 and P2 are connected graphs, at least one of the two nodes a and b has

an edge in P1. Without violating the generality of the proof, let us assume that b has an edge

(b, c) in P1. Figure 2-4A illustrates the two edges (a, b) and (b, c).

First, we consider using the basic pattern M1 in Figure 2-2A in the join operation. In

this case, a copy of M1, {(a, b), (b, c)} will join with S1 having a common edge (b, c) which

will result in the pattern P2 with k + 1 edges. This join however occurs only if the subgraph

{(a, b), (b, c)} is included in the F2 counts of M1 (i.e. within the chosen non-overlapping

copies of M1).

If this condition fails, we consider the degrees of the two nodes b and c in pattern P1. We

start with node c. Let us denote the degree of a node with function deg() (e.g. deg(c) is the

degree of node c in pattern P1).

If deg(c) > 1, then c has at least one more edge on top of (b, c). Let us denote this edge

with (c, d) (Figure 2-4B). In this scenario, we join a copy of the motif M4 (Figure 2-2D),

{(a, b), (b, c), (c, d)} (if this copy exists in the F2 count of M4) to obtain P2.

Finally, if deg(c) = 1, it is guaranteed that deg(b) > 1. This is because if both

nodes b and c have degree one, S1 cannot be a connected subgraph. Let us denote one of

the additional edges of b with (b, d) (Figure 2-4C). In this case, we join the subgraph that

isomorphic to the pattern M3, {(a, b), (b, c), (b, d)}, with S1 to obtain P2. We can do this if

this copy exists in the F2 count of M3.

In summary, we conclude that any pattern P2 with k + 1 edges can be constructed by

joining a pattern P1 with k edges (or k − 1 edges) and one of the basic building patterns to

obtain the additional edge (or edges) if at least one of the many possible scenarios hold. We

however cannot guarantee that the joins will find all of the instances of the k + 1 edge pattern

on the target graph.

Recall that as we aim to calculate the frequency of a given motif using F2, there is no self

join of any pattern. Thus, the basic building patterns set is the smallest set of patterns as we

can not construct one of those four patterns using the three other patterns. More specifically,

28

this means that we can not use only one of those four basic building patterns to construct

larger patterns by joining pairs of subgraphs belong to that pattern’s equivalence class. This

is because if we join the embeddings of a single motif topology (such as the first pattern in

Figure 2-2A) we cannot get any larger pattern as they do not share any edge(s).

2.3.3 Finding MIS: Going from F1 to F2

Here, we explain how we compute the F2 frequency for a given pattern. We use two

algorithms for this purpose. We explain why we have two separate algorithms later in

this section after describing the two algorithms. The first one is a heuristic used in the

literature (36). This algorithm constructs a new graph, called the overlap graph for each

pattern as follows. Each node in the overlap graph of a pattern denotes an embedding of that

pattern in the target graph. We add an edge between two nodes of the overlap graph if the

corresponding embeddings represented by those nodes overlap in the original graph. Once the

overlap graph is constructed, the algorithm starts by selecting the node with the minimum

degree (i.e. overlaps with the minimum number of embeddings) in the overlap graph. We

include the subgraph represented by this node in the edge-disjoint set. We then delete that

node along with all of its neighboring nodes in the overlap graph. We update the degree of the

neighbors of the deleted nodes. We repeat this process of picking the smallest degree node and

shrinking the overlap graph until the overlap graph is empty.

The algorithm described above works well for patterns with small number of embeddings.

It however becomes computationally impractical as the number of embeddings of the

underlying pattern gets large. This is because both constructing the overlap graph (particularly

identifying its edges) and updating it are computationally expensive tasks. Therefore, we

use this algorithm for all patterns except for the basic building patterns (where number of

embeddings are often too large).

The second algorithm addresses the scalability issue of the the first one. This scalability

issue is imposed by the expensive task of calculating the degree of each node in the overlap

graph (i.e. the number of overlaps of each embedding). Recall from the previous algorithm

29

Figure 2-5. Algebraic calculation of the frequency of one basic pattern. A) One of the basicbuilding patterns. B) A hypothetical graph that contains subgraphs isomorphic tothe pattern M1 in A).

that this number is considered as a loss value when selecting the node (i.e. embedding)

with minimum degree (i.e. number of overlaps) to include in the final MIS of the pattern

under consideration. Briefly, the second algorithm we introduce here avoids the expensive

task of calculating number of overlaps for each embedding. The algorithm performs this by

algebraically computing such numbers instead of performing actual overlapping tests. Once we

compute node degrees of the overlap graph, this algorithm selects the disjoint embeddings the

same way as the former algorithm described before. More specifically, the algorithm selects the

node with the minimum degree and includes its corresponding embedding in the final MIS. It

then removes neighboring nodes to that node from the overlap graph. It repeats this process

until the overlap graph is empty. Next, we explain how we compute the degree of a node in

the overlap graph for the pattern M1 in Figure 2-2A. Our computation is similar for the other

three basic building patterns, yet tailored towards their specific topologies (derivation is shown

in appendix). Figure 2-5 shows a hypothetical subgraph S1 ={(a, c), (b, c)} in the input graph

G which is isomorphic to M1. This subgraph is represented by a node in the overlap graph of

M1’s embeddings. Let us denote the degree of a node in the original graph G with function

d() (e.g. d(vi) is the degree of node vi). Another embedding of M1 in G overlaps with S1

only if it contains the edge (a, c), or (b, c). Any edge in G connected to the middle node c

forms two overlapping embeddings, one with the subgraph that has edge the (a, c) and the

other with the subgraph that has the edge (b, c). We exclude the edges belong to S1 (i.e. the

embedding we want to calculate its number of overlaps) itself from the potential edges of G

30

that considered in the overlapping embeddings with S1. Thus, by excluding the two edges

(a, c) and (b, c) from c’s degree, node c yields 2 × (d(c) - 2) overlaps. In addition, any edge

that belongs to node a forms an embedding when combined with the edge (a, c). Excluding

the edge (a, c), node a yields d(a) - 1 overlaps. Similarly, node b produces d(b) - 1 overlaps.

Thus, the total number of overlaps for the embedding S1 = {(a, c), (b, c)} combined from

edges of its three nodes {(a, b, c)} is

2(d(c)− 2) + d(a)− 1 + d(b)− 1 = 2d(c) + d(a) + d(b)− 6

Notice that unlike the first algorithm, the second one requires a unique derivation for

each pattern. Thus, we apply it only to the basic building patterns, for their topologies do not

depend on the input graph. Also, it is worth noting that typically the basic building blocks

have much larger number of embeddings as compared to the patterns derived by joining

them. Thus, the efficiency of the second algorithm is needed for them more than the patterns

obtained in subsequent iterations (experimental results).

Figure 2-6. The overlap graph based on F2 and F3 frequency measures. A) The overlap graphof the pattern in Figure 2-1C based on F2 measure of this pattern in the graph inFigure 2-1A . B) The overlap graph of the same pattern based on F3 measure.

To adapt our method to count non-overlapping embeddings of each pattern according to

F3 instead of F2, we only need to change how we calculate the MIS of this pattern. More

specifically, we change the criteria which states that two subgraphs overlap if they share at

least one edge to two subgraphs overlap if they share at least one node (Section 2.2.1). This

will result in changing the overlap graph constructed using the first method we explain in

this section. In addition, it will also have slight change in calculating the total number of

overlap of each embedding using the second method we discuss in this section. Practically,

31

we expect the overlap graph to be denser when we use the F3 measure as compared to that

for the F2 measure. To illustrate this, consider the graph G in Figure 2-1A and the pattern

in Figure 2-1C. This patter have 3 embeddings in G which are S1, S2, and S3 defined by the

set of edges {(a,b), (a,c), (b,c), (b,e)}, {(e,f), (f,g), (e,g), (e,d)}, {(e,f), (f,g), (e,g), (b,e)}

respectively. Figure 2-6A and Figure 2-6B represent the overlap graph of this pattern based on

F2 and F3 measures respectively.

2.3.4 Accelerating Our Algorithm Through Efficient Filters

Recall that at each iteration, our algorithm generates new subgraphs. For each of these

subgraphs, it checks if this subgraph is isomorphic to one of the patterns constructed till that

iteration. Isomorphism test is a computationally expensive task. Next, we describe how we

avoid a large fraction of these tests.

We develop two canonical labeling strategies for patterns. Canonical labeling assigns

unique labels to the nodes of a given pattern (47). If two patterns are isomorphic, then they

have the same canonical labeling. The inverse is however not true. Unlike isomorphism test,

comparing the canonical labeling is a trivial task. Following from this observation, when

we construct a new subgraph, we first compare its canonical labeling to those of existing

patterns. We then limit the costly isomorphism test to only those patterns which have the

same canonical labeling as the new subgraph.

The first canonical labeling counts the degree (i.e. number of incident edges) of each

node in the given pattern. It then sorts those degrees and keeps them as a vector we call the

degree vector. If two patterns have different degree vectors, then they are guaranteed to have

different topologies. Despite its simplicity, this labeling filters out a large fraction of patterns.

To test its efficiency, we have tested it on random graphs generated using Barabási−Albert

model (48). We generate 1000 pairs of graphs where each pair is non-isomorphic and have

the same number of nodes and edges. The degree vector successfully filters 85% of the 1000

experiments.

32

The second canonical labeling extends on the first one. It was first introduced by (49).

Consider a pattern P = (V,E). Let us define the distance between two nodes vi, vj ∈ V

as the number of edges on the shortest path that connects vi and vj and denote it with

xij. Let us define the diameter of P as the maximum distance between any two nodes,

and denote it with x. Using this notation, we assign label to node vi as:∑j∈V

j 2x−xij−d(vj).

Once we compute the labels of all the nodes in the given pattern, we sort them. We call the

resulting vector the nodes vector. Similar to the first labeling above, two isomorphic graphs are

guaranteed to yield the same labeling. We compute and compare the nodes vector with only

the patterns which cannot be eliminated using the first canonical labeling. We then consider

the patterns with identical canonical labels for graph isomorphism.

2.3.5 Complexity Analysis

Here we analyze the complexity of our method. We refer to Algorithm 2.1 as we discuss

the steps of our method. For each steep, we explain its complexity. We then summarize the

complexity of all steps to denote the overall complexity of our method. These steps are

Find all subgraphs isomorphic to each of the four basic patterns (Line 1): In this

step, we analyze each of the four basic patterns separately since they have different topologies.

For the pattern M1 in Figure 2-2A, to get all subgraphs isomorphic to this pattern, we

consider all edges connected to each node in the underlying network. We select any two edges

combination connected to every node. Here, we denote the degree of a node with function d()

(e.g. d(vi) is the degree of node vi). Thus, the complexity of collecting subgraphs that are

isomorphic to M1 is∑

vi∈V(d(vi)2

). Similarly, for the pattern M3 in Figure 2-2C, we select any

three edges combination connected to each node in G. Thus, the complexity of constructing

subgraphs which are isomorphic to M3 is∑

vi∈V(d(vi)3

). For the pattern M2 in Figure 2-2B,

we consider each edge eij in G with two nodes vi and vj. We collect edges of both nodes.

We then select one edge connected to vi and one edge connected to vj (on the condition

that these two edges are connected from the other end) along with eij to form a subgraph

isomorphic with M2. Thus, the complexity of constructing subgraphs that are isomorphic to

33

M3 is∑

eij∈E d(vi)d(vj). Similarly to M2, we perform the same operation to get isomorphic

subgraphs to the pattern M4 in Figure 2-2D. Only this time we make sure that the two

edges belong to vi and vj are not connected with each other from the other end. Thus,

the complexity of constructing subgraphs that are isomorphic to M4 is∑

eij∈E d(vi)d(vj).

Collectively, the complexity of performing this step is O(∑

vi∈V d(vi)3 +

∑eij∈E d(vi)d(vj)).

Notice that, theoretically, the worst case scenario happens when d(vi) = O(n). In this scenario,

the complexity of this step becomes O(n4).

Extract maximum disjoint set for basic patterns (Line 2): In this step, we use the algebraic

algorithm described in Section 2.3.3 (second one) to calculate the number of overlaps of each

subgraph belonging to each pattern equivalence class. This process takes constant time. We

calculate this algebraic equations as we construct subgraphs in the previous step. We then sort

those subgraphs within each equivalence class in decreasing order of their number of overlaps.

This process has complexity equal to O(mlog(m)) where m is the number of subgraphs in

each equivalence class. Recall from previous step that this number is O(∑

vi∈V d(vi)3 +∑

eij∈E d(vi)d(vj)). Thus, the complexity of this step is O((∑

vi∈V d(vi)3)log(

∑vi∈V d(vi)

3) +

(∑

eij∈E d(vi)d(vj)) log(∑

eij∈E d(vi)d(vj))).

Join Iterations (Lines 5-27): In this step, we analyze the complexity of one join iteration.

We then summarize the complexity of all join iterations. Let us denote the number of current

patterns in iteration i with xi. Notice that, for the first iteration xi = 4. Recall that in

each join iteration, we increase the size of each of the current patterns with one or two

edges. In addition, the patterns of the first join iteration are at least of size 2. Thus, the size

(i.e. number of edges) of each of the current patterns in iteration i is at least i + 2. The

number of subgraphs isomorphic to each of the current patterns is at most |E|i+2

since they are

non-overlapping subgraphs. Recall that the subgraphs of the basic patterns are non-overlapping

within each pattern. Thus, the number of subgraphs of the patterns M1, M2, M3, and

M4 are |E|2

, |E|3

, |E|3

, and |E|3

respectively. Collectively, the number of subgraphs of the basic

patterns is O(|E|).

34

In the join iteration, we start by joining subgraphs of current patterns with the subgraphs

of the basic patterns (Lines 6-9). Thus, the total number of joins we perform at iteration

i is O(|E| |E|i+2

xi) . For each join, we compare the resulting subgraph against all patterns

(Line 10). Recall that, we use filters to avoid this costly isomorphism check (Section 2.3.4).

Thus, the complexity of this operation is O(xi). If this subgraph is isomorphic to one on the

current patterns, we check whether this subgraph is a duplicate of one of the subgraphs which

already exists in this equivalence class (Line 11). We search an indexed list of those subgraphs

in O(log( |E|i+2

)). Collectively, we obtain the complexity of performing all joins at iteration i

by multiplying the three complexities above and get O(|E| |E|i+2

xi xi log(|E|i+2

)), which equals

O(x2i|E|2i+2

log( |E|i+2

)).

Upon completing all join operations, our algorithm extracts the MIS for each pattern (Line

18) using the overlap graph algorithm described in Section 2.3.3 (first one). Notice that we

perform this operation for the new set of patterns, xi+1 (current patterns of next iteration)

for which the number of patterns is at most |E|i+3

(This is because each pattern is of size i + 3

and no two patterns overlap). For each pattern, we collect the overlapped subgraphs of each

subgraph in O(( |E|i+3

)2). We then sort the subgraphs in decreasing order of their number of

overlaps in O( |E|i+3

log( |E|i+3

)) time. Thus we extract the MIS for all patterns in O(xi+1 ( |E|i+3

)3

log( |E|i+3

)).

Finally, we check each resulting pattern (Line 19-25) and delete it if its frequency is less

than the threshold α. We perform this step in O(xi+1) time.

Recall that in each join iteration, we increase the size of each of the current patterns with

one or two edges. Also recall that we start the with patterns of at least of size 2. Thus, total

number of join iterations we perform until we reach to all patterns are at least of the target

motif size is µ− 2. Thus, the complexity of all join iterations is O(µ−2∑i=1

(x2i|E|2i+2

log( |E|i+2

) + xi+1

( |E|i+3

)3 log( |E|i+3

) + xi+1)) or simply O(µ−2∑i=1

[xi

|E|2ilog( |E|

i+2)][xi+

|E|i2

]+ xi+1)

In summary, the complexity of our method considering all the previous steps is

35

O((∑

vi∈Vd(vi)

3)(1 + log(

∑vi∈V

d(vi)3))

+(∑

eij∈Ed(vi)d(vj)

)(1 + log(

∑eij∈E

d(vi)d(vj)))

+

µ−2∑i=1

( [xi|E|2

ilog(

|E|i+ 2

)

] [xi +

|E|i2

]+ xi+1

))Notice that xi here depends significantly on the topology and the density of the given

network G. To the best of our knowledge, there is no closed formula that calculates xi (i.e.

the number of unique topologies of certain size in a given graph G).

2.4 Experimental Results

In this section, we experimentally evaluate the performance of our motif discovery

algorithm on synthetic and real graphs (Section 2.4.1). We measure the running time and

accuracy of our algorithm. We compare our algorithm to two state of the art algorithms,

FSG (45) and SUBDUE (35) (Section 2.4.2). We evaluate the statistical significance of the

most abundant motif in each of the real graph (Section 2.4.3). We present a case study of the

motifs identified by our method on Human herpesvirus PPI network (Section 2.4.4). In all of

our experiments, we report the motif frequency using the F2 measure.

Data set.. We use real and synthetic datasets in our experiments. The real graphs are

the PPI networks of seven organisms taken from the MINT database (50) (Table 2-1 for

details). We first remove the nodes and edges of these graphs which are guaranteed to not be

a part of the motif to be found. To do that, we filter a subset of the nodes of each network

as follows. We first identify connected subgraphs of each graph. Let us denote the size of the

motif we aim to find with µ. We remove the connected subgraphs with less than µ nodes.

Table 2-1 lists these networks and their sizes after filtering them for µ = 5 (which is the

smallest motif size in all of our experiments).

In addition to the real dataset, we construct synthetic graphs. The purpose of having

synthetic dataset is to systematically evaluate our method by varying network characteristics

36

(network size and density) in a controlled environment. We build this dataset using the

Barabási−Albert model (48) for it captures the connectivity patterns of real networks (51; 52;

53). Moreover, this model has been frequently used in the literature to simulate real networks.

Table 2-1. The size (number of Proteins and interactions) of the PPI networks selected fromthe MINT database.

Network name Networkcode

Numberofproteins

Numberofinteractions

Human herpesvirus8 hhv-8 48 82Campylobacter jejuni cje 109 117Treponema pallidum tpa 108 173Rattus norvegicus rno 535 643Helicobacter pylori hpy 717 1472Escherichia coli eco 616 1561Plasmodium falciparum pfa 1221 2577

Implementation and environment.. We implement our algorithm in C++ and perform

experiments on a computer equipped with AMD Opteron(tm) Processor 1.4 GHz CPU, 500

GBs of main memory running Linux operating system.

2.4.1 Evaluation of Running Time

In this experiment, we evaluate the running time of our motif discovery algorithm. Our

goal here is to observe the effect of varying parameters; graph size, graph density, and motif

size on the running time of our algorithm.

2.4.1.1 Effect of Graph and Motif Size

We evaluate the running time of our method under varying graph and motif sizes using

both synthetic and real datasets.

Results on synthetic graphs.. We generate synthetic graphs of varying size (i.e.

number of nodes) from 100 to 1000 at increments of 100. We fix the graph density to two

edges per node on the average (i.e., mean node degree is set to four). We set the minimum

desired motif frequency, α = 10. We run experiments for motif sizes µ = 5, 10, and 15 and

report the running time. Figure 2-7 presents the results.

37

0.1

1

10

100

1000

10000

100000

1e+006

100 200 300 400 500 600 700 800 900 1000

Run

ning

tim

e [s

]

Network size

Motif size = 15Motif size = 10Motif size = 5

Figure 2-7. The total running time of our method for varying graph size and motif sizes(number of nodes). Motif size varies from 5 to 15. The x-axis shows the inputgraph sizes varying from 100 to 1000. The y-axis shows the total running time inseconds.

The results demonstrate that our method scales well with growing graph and motif sizes.

The running time grows with increasing graph and motif sizes, yet it remains practical for very

large graphs. For motif sizes of 5 and 10, it runs in only several minutes even for the largest

input graph. As the motif size grows, the cost increases. However, our method can identify

very large motifs in a little over a day for massive networks. We observe that the motif size has

more influence on the performance of our method than the input graph size. This is because

the number of alternative motif topologies grow exponentially with the motif size. This is an

inherent characteristic of the underlying computational problem. However, even when the motif

size is 15 our method remains to have a practical running time.

Results on real graphs.. Next, we test our method on real dataset. We set the

minimum desired motif frequency, α = 5. We run experiments for motif sizes µ = 5, 10,

and 15 and report the running time. Figure 2-8 presents the results. Similar to the synthetic

dataset results, our method scales to large graph and motif sizes on the real dataset. Note

that the number of alternative motif topologies grows exponentially with the motif size.

Furthermore, the cost of subgraph isomorphiosm also grows exponentially with the motif size.

38

Despite these two major complicating factors, the running time of our method increases only

by about an order of magnitude when we increase the motif size by five. Finally, the parallel

between these results and those in Figure 2-7 suggests that synthetic graphs generated by

Barabási−Albert model have similar structural properties as the real PPI graphs.

0.01

0.1

1

10

100

1000

10000

100000

12 3 4 5 6 7

Run

ning

tim

e [s

]

Network number

Motif size = 15Motif size = 10Motif size = 5

Figure 2-8. The total running time of our method for the real PPI networks. Network numbers1 to 7 on the x-axis correspond to hhv-8, cje, tpa, rno, hpy, eco, and pfa PPInetworks respectively. The positions of the PPI networks on the x-axis indicate thesizes of the input graphs (Table 2-1). The y-axis shows the running time inseconds.

2.4.1.2 Effect of Graph Size and Density

Here, we evaluate the effect of varying input graph size and density on the running

time of our algorithm. We use synthetic dataset in order to control the graph density in this

experiment. We generate synthetic graphs varying network size from 100 to 1000 at increments

of 100. We set the desired motif frequency α = 5 and the motif size µ = 10. We vary graph

density from one to four which covers broad range of biological networks (54). For each input

graph and density value, we report the total running time. Figure 2-9 presents the results.

We observe that the running time increases with growing graph density. As the graph

density increases, the number of alternative embeddings of a given motif grows as well. This

also increases the number of overlapping subgraph pairs, which in turn increases the cost

of finding MIS for each pattern to calculate its F2 frequency (Section 2.3.3). Despite these

39

0.1

1

10

100

1000

10000

100 200 300 400 500 600 700 800 900 1000

Run

ning

tim

e [s

]

Network size

Density = 4Density = 3Density = 2Density = 1

Figure 2-9. The total running time of our method for the synthetic graphs with different graphsizes (number of nodes) and varying graph densities from 1 to 4. The x-axis showsthe input graph sizes. The y-axis shows the total running time in seconds.

major complications inherent in the nature of the motif counting problem, our method remains

scalable with respect to growing density. These results suggest that our method is reliable and

computationally feasible for a broad range of networks with different sizes and densities.

2.4.2 Comparison with Existing Methods

Here, we compare our method against two methods in the literature which are tailored

towards a problem similar to the one considered in this chapter, namely SUBDUE and

FSG. We measure the running time and accuracy. We compute accuracy in terms of three

parameters, the number of unique motifs found, the average frequency per motif in the target

graph, and the frequency of the most abundant motif.

Of these two methods, for SUBDUE, we only report the accuracy of the result as we

observe that for most datasets and motif sizes, SUBDUE fails to identify motifs (results shown

later in this section). For FSG, we only report the running time. This is because FSG finds

motifs in multiple graphs, limited to at most one embedding per graph. In other words, it

cannot find multiple embeddings of the same motif in a single graph. Therefore, FSG would

yield very low accuracy when applied to a single graph. In the rest of the chapter, we will refer

to our method as MD (Motif Discovery) for simplicity.

40

2.4.2.1 Comparison with SUBDUE

In this experiment, we analyze the effect of varying input graph and motif sizes on

the accuracy of our method as compared to that of SUBDUE. We use real dataset in this

experiment (Table 2-1). SUBDUE does not allow the user to set a minimum allowable motif

frequency parameter. It finds all subgraph topologies of a given size even for those subgraphs

that appear only once. Due to this limitation of SUBDUE, to have a fair comparison, we set

α = 1 for our method as well. We follow our earlier definition (see 2.2.1), and use motif size

µ to denote the number of nodes in the given motif topology. We run both methods on each

input graph using motif sizes µ = 5, 10, and 15. We report the accuracy of our method as well

as SUBDUE. Figures 2-10, 2-11, and 2-12 present the results of µ = 5, 10, and 15 respectively.

1

10

100

hhv-8 cje tpa rno hpy pfa

Num

ber

of M

otifs

Network code

SUBDUEMD

1

10

100

1000


Ave

rage

freq

uenc

y pe

r m

orif

Network code

SUBDUEMD

1

10

100

1000


Fre

quen

cy o

f mos

t abu

ndan

t mot

if

Network code

SUBDUEMD

Figure 2-10. The accuracy of our method (MD) and SUBDUE in terms of three measures A)the number of unique motif topologies found, B) the average frequency per motifin the target graph, and C) the frequency of the most abundant motif. Resultsare for the motif size µ = 5 on the real dataset (Table 2-1).

Our results for µ = 5 (Figure 2-10) demonstrate that both methods identify similar

number of unique motifs, yet our method outperforms SUBDUE significantly in terms of the

41

1

10

100

1000

10000

100000


Num

ber

of M

otifs

Network code

SUBDUEMD

1

10


Ave

rage

freq

uenc

y pe

r m

orif

Network code

SUBDUEMD

1

10

100

1000


Fre

quen

cy o

f mos

t abu

ndan

t mot

if

Network code

SUBDUEMD


average frequency per motif in all cases (Figure 2-10B). When we focus on the most abundant

topology of each method, we observe a similar pattern; our method always finds patterns

with much higher frequency than SUBDUE in all the experiments (Figure 2-10C). It is worth

nothing that motif discovery problem gets exponentially harder with growing motif size. As

a result, we expect most algorithms tailored for motif identification to perform well for small

motif sizes such as µ = 5. Next, we observe how our method and SUBDUE perform for large

values of µ.

As we grow the motif size to µ = 10 (Figure 2-11), the results suggest that the gap

between our method and SUBDUE grows rapidly in terms all three accuracy measures. More

importantly, the results also show that in half of the cases, particularity where the input graph

size is large, SUBDUE could not find any motifs while our method continue to locate patterns

42

1

10

100

1000

10000

100000


Num

ber

of M

otifs

Network code

SUBDUEMD

1

10

100


Ave

rage

freq

uenc

y pe

r m

orif

Network code

SUBDUEMD

1

10

100


Fre

quen

cy o

f mos

t abu

ndan

t mot

if

Network code

SUBDUEMD


with high frequency. For example, our method was capable of finding motif topologies with

frequency over 100 while SUBDUE could not locate any motif (Figure 2-11C).

For few cases (Figure 2-11B), (hhv-8, cje, and tpa), the average frequency per motif of

SUBDUE is slightly higher than that of our method. This is because, we set the minimum

frequency α = 1. Our method locates many topologies which exist only once while SUBDUE

fails to locate them. For example, our algorithm finds thousands of unique motif topologies

while subdue outputs only 8 motif topologies for the hhv-8 organism (Figure 2-11A). As a

result, these unique topologies pull the average frequency down. That said, Figure 2-11C

confirms that our method can identify motifs which are more frequent than those found by

SUBDUE even for those organisms.

As we further increase the motif size to µ = 15 (Figure 2-12), the significance of our

method becomes more prevalent. We observe that SUBDUE could not find any motifs in

43

any of the graphs accept for tpa’s PPI network. On the other hand, our algorithm not only

identifies a massive number of patterns (Figure 2-12A), but also some of these patterns have

very large frequencies (Figure 2-12C).

In summary, the results demonstrate that our method scales to large input graph and

motif sizes and continue to locate patterns with high frequency for a broad range of motif and

input graph sizes while SUBDUE fails to do so.

2.4.2.2 Comparison with FSG

In this experiment, we compare the effect of different input graph and motif sizes to

the running time of our algorithm and that of FSG. We use real dataset in this experiment

(Table 2-1). FSG method requires multiple graphs as input. It defines the frequency of the

motif topology as number of different graphs that this motif appears within. Since our method

operate on one input graph , we set the desired motif frequency α = 1 to be consistent with

FSG. FSG defines motif size as the number of edges in the given motif. To be consistent with

FSG, we use µ to denote the number of edges in the motif in this experiment. We run both

methods on each input graph using motif sizes µ = 7, 8, and 9. We report the running time of

our method (MD) as well as FSG. We do not run experiments for µ > 9 as FSG fails to scale

to large motif sizes unlike our method. Figure 2-13 presents the results.

We observe that our method (MD) is orders of magnitude faster than FSG, particularly

in large motif sizes. The running time of our method increases slowly with both motif size and

the graph size. On the other hand, the running time of FSG increases slowly with the input

graph size, but very rapidly with the motif size. Only for a few cases of small motif sizes (i.e

≤ 7 edges) FSG performs better than our method. This is due the overhead of calculating F2

for the basic building patterns where number of overlapped embeddings is huge. That said, the

running time difference in those cases are negligible. These results suggest that our method

outperforms FSG in terms of running time for a broad range of input real biological networks

with different sizes. This performance advantage is further magnified by the fact that our

method can find multiple embeddings of each motif while FSG finds only one. The two main

44

0.01

0.1

1

10

100

cje hhv-8 tpa rno hpy pfa

Run

ning

tim

e [s

]

Network code

FSGMD

0.1

1

10

100

1000


Run

ning

tim

e [s

]

Network code

FSGMD

0.1

1

10

100

1000

10000

100000


Run

ning

tim

e [s

]

Network code

FSGMD

Figure 2-13. The total running time of our method (MD) and FSG for the real PPI networks(Table 2-1) and µ = 7 (top left), 8 (top right), and 9 (bottom). The y-axis showsthe running time in seconds.

reasons behind the fact that our method is significantly faster than FSG is that our method

(i) does not calculate the frequency of the each new pattern by locating the copies of this

pattern in the network using subgraph isomorphism as FSG does, and (ii) it ensures that every

generated pattern exists at least once in the underlying graph.

2.4.3 Evaluation of Statistical Significance

In this experiment, we evaluate the statistical significance of the most abundant motif

identified by our method in each of the six PPI networks (Table 2-1). We compute the

statistical significance of the abundance of the most frequent motif of a given size in two

alternative approaches. Each of these two approaches measures a different aspect of the

significance.

The first approach measures the statistical significance of the frequency of most abundant

motif with respect to the abundances of all motifs with the same size in the same graph. More

45

specifically, given a target graph G = (V,E) and motif size µ, we first find all motifs of size

µ in G. Assume that there are totally m such motifs. Let us denote the frequency of these

motifs with x1, x2, …, xm, with x1 being the largest among all. Let us denote the mean and

standard deviation of these m frequency values with x and σ. We report the z-score of the

frequency of the most abundant motif as x1−xσ

.

The second approach measures the statistical significance of the frequency of the most

abundant motif in the original graph with respect to those in the random ensemble of graphs

of the same size and degree distributions. More specifically, given a target graph G = (V,E)

and motif size µ, let us denote the frequency of the most abundant motif of this size in G

with x. We construct a set of n random networks from G through degree preserved edge

shuffling (55; 56). Note that degree preserved edge shuffling is an iterative technique, which is

often used in the literature to construct random network topologies with same size and degrees

as a given target graph G = (V,E). At each iteration of this technique, we randomly pick two

edges from E. Let us denote these edges with (v1, v2) and (u1, u2), where v1, v2, u1, u2 ∈ V .

We remove these two edges from E and insert two new edges (v1, u2) and (u1, v2). This

way as the network topology evolves randomly, we ensure that the degrees of all the nodes

remain unchanged. We repeat these iterations large number of times (exactly 10 × |E| times)

to randomize the entire network. Using the strategy above, we generate 100 random graphs,

denoted with G1, G2, …, G100. For each random graph Gi, we measure the frequency of the

most abundant motif of size µ. Let us denote this number as xi. Let us denote the mean and

standard deviation of these 100 frequency values with x and σ. We report the z-score of the

frequency of the most abundant motif as x−xσ

.

For both of the approaches above, we assume that a z-score above 2 or below -2 implies

high statistical significance (i.e., two standard deviations away from the mean). The larger

the magnitude of z-score is, the more significant the result is. Tables 2-2 and 2-3 present the

z-score for each of the six PPI network and three motif size (µ = 5, 10, 15) combinations

using the first and the second approach described above respectively.

46

Table 2-2. The z-scores that represent signifncance of the most abundant motif aginast othermotifs in in the same network in each PPI network usig three motif size.

Network code Motif size = 5 Motif size = 10 Motif size = 15hhv-8 1.52 14.00 4.67cje 1.41 5.53 12.12tpa 1.45 7.19 3.36rno 1.58 4.31 9.74hpy 1.54 13.70 9.003pfa 1.87 35.32 7.43

Table 2-2 suggests that, for small motif size (i.e. µ = 5), the most abundant motif is not

significantly more frequent than other motifs of the same size. However, as motifs get large

in size (i.e. µ = 10 and 15), the gap between the frequency of the most abundant motif and

the rest of the motifs becomes highly significant. This implies that larger motifs characterize

topological properties of PPI networks better than small motifs. This is because when motif

size is small different motifs have similar frequency values, and this cannot be statistically

different in abundance than each other. On the other hand, for large motif size, although

the number of unique motif topologies is large, they vary a lot in their abundances; the most

frequent one gets significantly more abundant than the rest.

Table 2-3. The z-scores that represent signifncance of the most abundant motif aginast mostabundant motifs in 100 random networks in each PPI network usig three motifsizes.

Network code Motif size = 5 Motif size = 10 Motif size = 15hhv-8 2.79 -0.54 -2.83cje 2.32 0.99 -0.82tpa 3.21 5.27 2.83rno -0.49 -4.02 -4.83hpy 22.42 8.61 6.15pfa 10.53 5.16 4.80

Table 2-3 shows that, for most of the PPI network and motif size combinations, the

most abundant motif is highly over-represented in the original network compared to random

networks. In three cases (Rattus norvegicus, µ = 10 and 15, and Human herpesvirus8,

µ = 15), we observe that the most abundant is significantly under-represented. These

results demonstrate that the motif abundance in PPI networks is not random for nearly all

47

combinations we tested. Thus, studying these structures has great potential to help understand

how these networks function. Among the six PPI networks, Rattus norvegicus stands out

to be the one with consistently under-represented or random motif abundance. The PPI

of Helicobacter pylori consistently has the most significant motif abundance for all motif

sizes. This indicates that the interactions in this network follow a regular pattern repeating

themselves at different locations of the network. Finally, notice that the two z-score values

reported in Tables 2-2 and 2-3 do not follow the same pattern (that is a high z-score according

to one measure does not imply a high value for the other). This implies that the frequencies

of different motifs (i.e., including the ones which are not most abundant) in these PPIs differ

from those in random networks. In other words, the PPI networks topologically deviate from

random networks.

2.4.4 Case Study on Human Herpesvirus

Table 2-4. Each row lists the Uniprot IDs of the proteins in an embedding of the mostabundant motif of size 10 found by our method in hhv-8 PPI network.

O40944 P88947 P88935 P88951 P88960 P88940 P90489 P88918 P90495 P88902O40910 O40944 P88947 P88929 P88920 P88925 P88927 P90486 P88918 P88954P88918 P88919 P88929 P88948 P88920 P88950 O36551 P88942 Q98141 P88954O40944 Q98141 P88920 P88951 P88954 P88947 P88948 P88958 P88939 P88944

Here we briefly analyze the motifs identified by our method on the hhv-8 PPI network

which causes Kaposi’s sarcoma disease. We choose this organism in our case study as it has

the smallest PPI network among the organisms in our database (Table 2-1). Notice from

Figure 2-11C that despite its small size (48 nodes and 82 edges), hhv-8 has four disjoint

embeddings of a very large motif with 10 nodes, covering a significant fraction of its PPI

network. This begs the question whether there is a fundamental recurring function that hhv-8

serves and is covered through evolutionary process with high redundancy. Figure 2-14 presents

the structure of those four embeddings. Each row of Table 2-4 lists the Uniprot ids of the

ten proteins that contribute to each of those embeddings. Analysis of these proteins in the

Gene Ontology database (57) reveals that three of those four embeddings, each contains two

48

proteins one responsible for viral DNA packaging (O40944 and P88919) and one responsible

for virion assembly (P88954). Without either process, no infectious progeny virus could be

formed (58). Several studies use these two processes as targets to identify effective inhibitors.

The existence of these two process in each of the three instances reflects the functional

importance of the motif topology found. These results suggest that our algorithm can find

significant and valuable motifs which can be use to detect key functions governed by the

network processes.

Figure 2-14. The organization of the four isomorphic subgraphs of 10 nodes in the hhv-8 PPInetwork. Each supgraph has different color and pattern.

2.5 Discussion

In this chapter, we developed a scalable method to solve the motif identification problem

given an input graph, desired motif size µ, and minimum frequency of desired motif α. We

proposed a set of small patterns, we call basic building patterns each containing two or three

edges. We proved that any motif with four or more edges can be constructed as a join of these

patterns. Our method first locates instances of the basic building patterns. It then iteratively

grows known motifs at that iteration by joining them with the instances of these patterns.

We developed efficient mechanisms to avoid a significant fraction of the costly isomorphism

tests. We also introduced a new and efficient strategy for solve the MIS extraction problem.

We analyzed the time complexity of our method based on the number of nodes and edges

in the target network and the number of frequent motifs at each iteration. Our experiments

on PPI networks from MINT comprehensively demonstrated that our method is significantly

49

faster and more accurate than the existing methods. Furthermore, we observed using synthetic

networks that the running time of our algorithm is reasonable with growing the size of the

target network and network density. We also showed using PPI networks that the increase in

the running time of our algorithm is dramatically less than that of the competing methods as

the motif size grows. We evaluated the statistical significant of the most abundant motif of

PPI networks resulting from our algorithm.

50

CHAPTER 3APPLICATION OF MOTIFS IDENTIFICATION

In this chapter, we address two applications of the motif identification problem. The

first application is Motifs in the Assembly of Food Web Networks (Section 3.1). The second

application is Motif Centrality in Food Web Networks (Section 3.2).

3.1 Motifs in The Assembly of Food Web Networks

3.1.1 Preface

The assembly of local communities from regional pools is a multifaceted process that

involves the confluence of interactions and environmental conditions at the local scale and

biogeographic and evolutionary history at the regional scale (59). Understanding the relative

influence of these factors on community structure has remained a challenge and mechanisms

driving community assembly are often inferred from patterns of taxonomic, functional, and

phylogenetic diversity. Moreover, community assembly is often viewed through the lens of

competition and rarely includes trophic interactions or entire food webs. Motifs provide a

novel framework for exploring community assembly by explicitly including interactions as

opposed to inferring them from patterns of taxonomic or phylogenetic composition (60).

Focusing on community assembly through the lens of motifs can be thought of as interaction

assembly. Here, we use motifs–subgraphs of nodes (e.g., species) and links (e.g., predation)

whose abundance within a network deviates significantly as compared to a random network

topology to explore the assembly of food web networks found in the leaves of the northern

pitcher plant (Sarracenia purpurea). We compared counts of three-node motifs (Figure 3-1)

across a hierarchy of scales to a suite of null models to determine if motifs are over-, under-,

or randomly represented (19). We then assessed if the pattern of representation of a motif in a

given network matched that of the network it was assembled from.

3.1.2 Method

In this section, we explain the dataset we analyze. We then discuss the methods we

develop to identify the assembly behaviors.

51

Figure 3-1. . Four of the thirteen possible three-node motifs; apparent competition,exploitative competition, tri-trophic chain, and omnivory. These four motifs havebeen explored both theoretically and empirically in ecological networks and are theonly motifs found in the pitcher plant dataset we analyzed.

Dataset. The pitcher plant (Sarracenia purpurea) is a carnivorous plant that inhabits bogs

and fens along the east coast of North America from the panhandle of Florida to Canada and

across southern Canada to British Columbia (21). S. purpurea forms tube-shaped leaves that

fill with rainwater. The leaves produce a nectar around the rim of the pitcher that attracts

invertebrate prey (e.g., ants, wasps) which subsequently drown in the pitcher liquid. An entire

food web consisting of bacteria, protozoa, rotifers, and dipteran larvae among other taxa (21)

resides within the pitcher and serves to decompose prey items releasing nutrients to the plant.

We used pitcher plant data from 39 sites across North America to explore motif assembly (

Figure 3-2). This dataset contains abundance data and feeding interactions for 20 pitcher

plant food webs at each site for a total of 769 food webs (11 pitchers were dropped due to

missing data). We based the interaction structure (i.e., who eats whom) on previous studies

and direct observation of feeding interactions. We constructed food web networks at three

levels of hierarchy as follows. At the first level, lies the food web networks for individual pitcher

plants. We consider these as the local networks. The second level in the hierarchy of networks

lies at the site scale. We created networks for each of the 39 sites by combining the local

network of every pitcher plant at that site. We combined a set of networks by taking the

union of all the nodes and the union of all the links of those networks. We designated the

resulting networks as site networks. Finally, at the top of the hierarchy lies the continental

network which summarizes all the 39 site food webs. We obtained this network by combining

52

the 39 site food webs. In summary, the local networks (n=769) were assembled from their

site networks (n=39) which were assembled from the continental network (n=1) (Figure 3-2).

Because of this hierarchical design, we designated the higher level network from which a

network is assembled as the parent network and a network that is assembled from the parent

network as a daughter network.

Figure 3-2. Schematic of the three levels of hierarchy for pitcher plant network assembly. Thecontinental scale network (A) contains all of the species and interactions foundacross the 39 North American site networks (B) (sites are indicated by blackcircles, we only show networks for three sites here for clarity). Species assemblefrom the continental network to the site networks. Within each site there are 20local food web networks (C) found in individual pitcher plant leaves (we show onlythree local networks here for clarity). Species from the site networks assemble intothe local networks.

Analysis. We took a four-step approach to analyzing motifs in the assembly of food

web networks. First, we counted motifs in empirical networks. Second, we developed null

models and counted motif representation in the null models. Next, we compared empirical

motif counts to those of the null models using z-scores and p-values. We use different null

models whcih each describe a different random scenario namely; Erdős-Rény (61), niche (62),

nested-hierarchy model (NH) (63), generalized cascade model (GC) (64), two co-occurrence

null models (CO1, and CO2) (60). Using each of the null models, we created 1000 networks

53

to get a distribution of null motif counts. We consider a z-score greater than two is evidence

that a particular motif is over-represented, a z-score less than negative two is evidence that a

motif is under-represented, and a z-score that falls between negative two and two suggests that

a motif appears no greater or less than we would expect under the null model (i.e., randomly).

In addition to calculating z-scores, we also calculated p-values to determine the probability

of obtaining a motif count equal to or more extreme than the observed count, under the null

model. A p-value > 0.975 is evidence of under-representation, a p-value < 0.025 is evidence

of over-representation, and the in between values indicate random representation. When motifs

are over- or under-represented, they represent a non-random selection of a given motif in a

network. Finally, we compared the motif representation of parent networks to their daughter

networks.

3.1.3 Experimental Results

Our main interest lies in determining if the pattern of representation of a motif (i.e.

over-represented, under-represented, or random) in a set of daughter networks matches that

of the parent networks they are assembled from. So we mainly calculated the proportion of

daughter networks that matched the parent network they were assembled from for all motifs.

Figures 3-3 and 3-4 present the results.

We found that the motif representation in daughter networks generally matched that of

their parent network regardless of motif for both continental-to-site and site-to-local network

assembly. While different null models showed different representation for a given motif, the

general pattern of agreement in motif representation between daughter and parent networks

was consistent. The consistency across motifs and null models shows that the assembly process

results in daughter networks that are structurally representative samples of the parent network

in terms of motif representation. The ultimate mechanism driving the assembly of daughter

networks (or community assembly in general) is the sampling of the parent network. In the

case of matching parent and daughter networks, proportional sampling from each trophic group

54

Figure 3-3. The percentage of sites for which motif representation (over-represented (blackfill), under-represented (cross hatch fill), and random (white fill)) matches thecontinental network under six different null models.

(loosely defined as species that have the same or similar prey and predators) produces daughter

networks with fewer nodes, but representative motif structure.

3.1.4 Discussion

In this application, we compared counts of three-node motifs across a hierarchy of scales

to a suite of null models to determine if motifs are over-, under-, or randomly represented. We

then assessed if the pattern of representation of a motif in a given network matched that of

the network it was assembled from. We found that motif representation in over 70% of site

networks matched the continental network they were assembled from and over 75% of local

networks matched the site networks they were assembled from for the majority of null models.

This suggests that the same processes are shaping networks across scales.

55

Figure 3-4. The percentage of pitchers for which motif representation (over-represented (blackfill), under-represented (cross hatch fill), and random (white fill)) matches the sitenetworks under six different null models.

3.2 Motif Centrality in Food Web Networks

3.2.1 Preface

The complexity of ecological networks has inspired an approach to network analysis

that reduces networks into meaningful subnetworks to better characterize the structure

and function of these systems. Motifs-subnetworks whose abundance in the given network

differs significantly from that in a random network topology in particular have captured the

interest of network ecologists due to the ecological theory that has been developed for several

three-node motifs (65). To better understand why some motifs are found at high abundances

(i.e., over-represented) and some are found at low abundances (i.e., under-represented), we

explored the relationship between motif abundance and motif centrality. In order to assess this

relationship, we developed a suite of methods for calculating the centrality of entire motifs and

then analyzed the relationship between motif centrality and motif abundance in 44 published

56

aquatic food webs. Our eight approaches for calculating motif centrality differed in three

aspects; the calculation of the centrality of a single node in a motif, the strategy of combining

the centrality of the nodes that make up a motif into a single centrality value, and the null

model used to test the significance of motif centrality.

3.2.2 Background

Integrating the concept of centrality with motifs, which also influence the functioning

and structure of networks, has the potential to increase our understanding of the variation

in abundance across different motifs (i.e., why a specific motif is under-represented or

over-represented in a food web). There have been several attempts to integrate the concept

of centrality with network motifs (66; 67). These studies have predominantly focused on

calculating a measure of node-centrality based on the location, frequency, or role of a node

within a given motif in a network [30-32]. The approach of quantifying the centrality of an

entire motif to assess its importance is uncommon. Li et al. (68) investigated the functional

potential behind central motifs in a cancer related human signaling network. They identified

central motifs by ranking the motifs based on their centrality values according six different

centrality measures and choosing the top 5%. One of the centrality measures they use is the

in-coming degree of the underlying node (Li et al. (68)) for other centralities). Piraveenan

et al. (69) averaged the centrality (betweenness and closeness) of four-node motifs in

Prokaryotic and Eukaryotic metabolic networks and found that the nodes that participated

in over-represented motifs (i.e. occur more frequently than randomly expected) had a greater

average centrality than the average of all nodes in the network. Motif centrality has not been

explored in food web networks.

3.2.3 Method

Dataset. We explored motif centrality in 44 food web networks contained in the enaR

package in R (networks 15-58 in Borrett and Lau et al. (23)). These networks describe aquatic

food webs ranging in size from 14 to 125 nodes (mean = 45.73, sd = 29.41) and connectance

(C = edges/nodes2) from 0.05-0.37 (mean = 0.17, sd = 0.08). Nodes depict species (e.g.,

57

Fundulus heteroclitus) or trophic-species and edges depict weighted biomass or energy flows

from prey to consumer that result from a feeding interaction. Each food web network also

contains information on node boundary loss and inputs, and node biomass.

Analysis. We analyzed motif centrality using a three-step process. First, we calculated the

statistical significance of motif abundance for each of the 13 three-node motifs (Figure 3-5)

using the niche null model (62). Second, we calculated the centrality of each motif (explained

later). Finally, we analyzed the relationship between motif abundance and centrality.

Figure 3-5. All 13 motifs of 3-node subgraphs. The first four motifs have specific ecologicalterminology.

Two attributes define how we measured the centrality of a motif; (1) the calculation of

the centrality of a single node in the motif and (2) the strategy to combine the centrality

of the nodes that make up that motif into a single centrality value. We used two measures

to quantify the centrality of each node in each of the 44 food web networks we analyzed.

The first measure is betweenness centrality (70). Briefly, a species is considered to have high

betweenness centrality in a given network if it is located on the shortest paths connecting

many pairs of species in that network. Our second measure of centrality, called throughflow

centrality (71), is the total energy entering or exiting a node. This method was developed

to specifically capture energy flow through nodes in a food web network]. Conceptually,

throughflow centrality measures the contribution of a given node to energy exchanged across

the entire food web.

So far, we have described two alternative strategies for calculating the centrality of single

nodes. A motif however, is made up of multiple nodes (i.e., three in our study; Figure 3-5).

More importantly, a given motif topology typically has many possible occurrences in a given

58

network. We used two approaches to compute the centrality of a given motif from node

centrality. We call the first approach redundant and the second approach non-redundant.

In the redundant approach, each node contributes to all occurrences of a given motif

independently regardless of the number of such occurrences (i.e., a node can contribute

more than once to the centrality of the same motif). The redundant approach has been

used by Li et al. (68). In the non-redundant approach, if a node appears in a given motif,

it only contributes once regardless of the number of instances of the motif it appears in.

The non-redundant approach has been used by Piraveenan et al. (69). In the redundant

approach, we first computed the centrality of each node in the given network. Next, given a

motif topology P , for each occurrence of P in the given network, we calculated the centrality

of that occurrence as the average of the centralities of the nodes in that occurrence. Once

we did this for all the occurrences of that motif, we computed the centrality of that motif

as the average of the centralities of all of its occurrences. We denoted the number of nodes

in the given motif P with n (here n = 3) and represented the number of occurrences of

that motif in the given network with t. Also, the ith node in the jth occurrence of P is

denoted with vij. We calculated the redundant centrality of P , denoted with MCr(P ) as

follows: MCr(P ) =∑tj=0

∑ni=0 C(vij)

n

tIn our second approach, we aimed to circumvent any bias

introduced by such multiple-counting of nodes by calculating non-redundant motif centrality.

This approach allows for each node in the given network to contribute once if it appears in

at least in one occurrence of the underlying motif. We defined an indicator function for each

motif P operating on the nodes v of the given network as δp(v) , where δp(v) = 1 if vappears

in at least one occurrence of P , and 0 otherwise. We computed the non-redundant centrality

value of the motif P , denoted withMCnr(P ) as follows; MCnr(P ) =∑v C(p)δp(v)∑v δp(v)

.

Once we calculated motif centrality, we tested its statistical significance by comparing

it to a null distribution of centrality constructed from random subnetworks of the same

size chosen from the observed network. We took two different approaches defining the null

model; constrained and unconstrained. In the constrained null model, we randomly selected

59

a three-node subnetwork (i.e. of the same size as the given motif) with the condition that

this subnetwork is connected. Connected subnetwork here means that there is an undirected

path between any pair of the three nodes in this subnetwork. This approach randomly selects

a subnetwork which matches one of the 13 motifs in Figure 3-5 since they are all possible

three-node topologies. In the unconstrained null model, we randomly selected a three-node

subnetwork (i.e. of the same size as the given motif as well) but we do not require them to be

connected. For each type of null model, we repeated this process 1000 times for each motif in

each network. We then computed the p-value of the observed motif centrality as the fraction

of random subnetworks which have higher centrality than the underlying motif. We summarize

all the approaches we use to calculate motif centrality significance in Table 3-1.

Table 3-1. Eight approaches used to calculate motif centrality significance. Each approachvaries in the combination of methods used to calculate node centrality, motifcentrality, and the null model used to assess the significance of motif centrality.

Approach Node centrality Motif centrality Null model1 Throughflow Redundant Constrained2 Throughflow Redundant Unconstrained3 Throughflow Non-redundant Constrained4 Throughflow Non-redundant Unconstrained5 Betweenness Redundant Constrained6 Betweenness Redundant Unconstrained7 Betweenness Non-redundant Constrained8 Betweenness Non-redundant Unconstrained

3.2.4 Experimental Results

The over-arching goal of our analysis is to determine if there is a relationship between

the abundance of a given motif and its centrality. We show the results in Figure 3-6. Focusing

on approach 5, networks in which motifs were found to be highly central (Figure 3-6), were

over-represented in abundance in six of the 13 motifs (motifs 2, 5, 7, 8, 9, and 10) and

under-represented in three motifs (6, 11, and 12).

In order to compute the statistical significance of this results in a systematic manner,

we computed the probability of obtaining the observed split between networks of different

centrality classes (e.g., highly central on non-central) across motif abundance class (i.e.,

60

Figure 3-6. Distribution of motif abundance over two classes of motif centrality significance forapproach five. White fill indicates random representation (0.025 > p < 0.975),black fill shows the number of networks in which a motif is over-represented (p <.025), and cross-hatch fill indicates under-representation (p > 0.975).

over-represented, under-represented, random). The heat map in Figure 3-7A represents the

positive correlation probabilities (p-values) between motif abundance and motif centrality

calculated. Similarly, the heat map in Figure 3-7B represents negative correlation probabilities

(p-values) between motif abundance and motif centrality calculated. Generally, we found

support that highly central motifs were over-represented and non-central motifs were

under-represented for several of the motifs. We found no support for our hypothesis, that

highly central motifs were under-represented and non-central motifs were over-represented.

3.2.5 Discussion

In this application, we explored the relationship between motif abundance and motif

centrality. In order to assess this relationship, we developed a suite of methods for calculating

61

A Positive correlation B Negative correlation

Figure 3-7. Correlation probabilities (p-values) between motif abundance and motif centralitycalculated. Only approaches that yielded two centrality classes could be used inthese calculations. Row and column cluster trees are shown to illustrate therelation of different approaches based on (A) positive correlation significance and(B) negative correlation probabilities. Significant p-values (≤ 0.05) are emphasizedwith stars.

the centrality of entire motifs and then analyzed the relationship between motif centrality and

motif abundance in 44 published aquatic food webs. We found that highly central motifs are

over-represented and non-central motifs are under-represented for six of the thirteen motifs.

This pattern suggests that high energy flow is associated with the persistence of certain motifs

in food webs. Further research on well resolved food web networks and integration of motif

centrality with new approaches to stability analysis will help determine the generality of our

results and provide further evidence of the mechanism driving them.

62

CHAPTER 4IDENTIFICATION OF CO-EVOLVING TEMPORAL NETWORKS

4.1 Preface

Biological networks describe the interaction between molecules. They are frequently

represented as graphs, where the nodes correspond to the molecules (e.g., proteins or genes)

and the edges correspond to their interactions (1). Formally, we denote a biological network

as G = (V,E) where V and E represent the set of nodes and the set of edges, respectively.

Analysis of these networks enable the elucidation of cellular functions (2), the identification of

variations in cancer networks (3), and the characterization of variations in drug resistance (4).

Studying biological networks led to numerous computational challenges as well as methods

which address these challenges. Network alignment is one of the most important of these

challenges (8) as it has a profound set of applications ranging from the detection of conserved

motifs to the prediction of protein functions (72). This problem aims to find a mapping of

the nodes of two given networks in which nodes that are similar in terms of content (i.e.

homology) and interaction structure (i.e. topology) are mapped to each other. Hence, we

represent the alignment between two given networks G1 = (V1, E1) and G2 = (V2, E2) as a

bijection function ψ : V1 → V2, and the score resulting from alignment ψ as score(G1, G2|ψ).

The network alignment problem seeks the function ψ that maximizes this score. We note that

there are various ways to calculate the scoring function.

There are two categories of network alignment problem: local and global alignment. The

former problem aims to find pairs of highly-conserved sub-networks in two given networks in

which a sub-network of the query network is mapped to multiple sub-networks in the target

network. Global network alignment aims to maximize the similarity in the networks in which

all nodes in the query network are mapped to a set of nodes in the target network. Network

alignment is a challenging task as the graph and subgraph isomorphism problems which are

known to be GI and NP-hard (20), reduce to them. In Section 4.2, we give a brief review of

63

the methods addressing the global network alignment problem as the problem we consider in

this paper is associated with that problem.

Biological networks have dynamic topologies (11). There are various reasons behind

this dynamic behavior. For example, genetic and epigenetic mutations can alter molecular

interactions (13), and variation in gene copy number can affect the existence of interactions (14).

Due to this dynamic behavior, the topology of the network that models the molecular

interaction evolve over time (16). Majority of the previous work on alignment of biological

networks assume the network topology is static (10)—an assumption that ignores the history

of network evolution, and may lead to biased or incorrect analysis. For example, identifying

causes and consequences of the influence of external stimuli is impossible when analyzing

static topologies. To address this oversight, we define a biological network using a model that

accounts for the evolution of the underlying network at consecutive time points. We refer to

this model as a temporal network (24). We view this model as containing a single snapshot

of the network at each time point in a sequence of time points and thus, as a time series

network. More formally, we denote a temporal network with t consecutive time points as G =

[G1, G2, . . . , Gt], where Gi = (V,Ei) represents the topology of the network at the ith time

point.

In this paper, we consider the problem of identifying coevolving subnetworks in a given

pair of temporal networks. We say that two subnetworks are coevolving if their topologies

remain similar even though their topologies evolve over time. We define this more formally

as follows. We consider two input temporal networks G1 = [G11, G

12, . . . , G

1t ] and G2 =

[G21, G

22, . . . , G

2t ], where ∀i ∈ {1, 2, . . . , t}, G1

i = (V 1, E1i ) and G2

i = (V 2, E2i ) represent G1

and G2 respectively at the time point i. Without losing generality, let G1 be the query (smaller)

network and G2 be the target network, i.e., |V 1| ≤ |V 2|. An alignment of G1 and G2 maps G1i

to G2i across all time points i. Thus, we represent the alignment of the two temporal networks

G1 and G2 as a bijection of their nodes and denote it as a function ψ : V 1 → V 2. We compute

the score of the alignment ψ of G1 and G2, denoted with score(G1,G2|ψ), as the sum of the

64

ψ

G1

G2

A Static

ψ

G3 G4

G1 G2

B Multiple

G1 G2 G3

ψ

G

ψ ψ31 2

C Dynamic networks

G1 G2 G3

G1 G2 G3

ψψ ψ

1 1 1

2 2 2

D Temporal networksFigure 4-1. This figure represents different network alignment problems in different types of

biological networks. A) This represents the alignment between two input staticnetworks. B) This represents the alignment between multiple time points whereeach network represent a different organism. C) This represents the alignmentbetween two input networks where one of them is static and one of them isdynamic. Here, there exist different alignment between the static network and eachversion of the dynamic network. D) This represents the alignment between twoinput temporal networks where each have time specific snapshots that was taken atthree specific time points. Here, the alignment is persist across all time points.

scores of the alignment at all time points. Hence, score(G1,G2|ψ) =∑t

i=1 score(G1i , G

2i |ψ).

We assume G1 is connected at all time points, but it maybe impossible to find an alignment

that is connected in the target network at all time points.

It is worth emphasizing that the temporal network alignment problem described above

is dramatically different than existing network alignment problems, which can be categorized

as follows: (i) pairwise alignment, (ii) multiple network alignment, and (iii) dynamic network

alignment. We illustrate these problems as well as the temporal one in Figure 5-2. The

pairwise network alignment problem (Figure 5-2I) ignores that the network topology evolves.

Although the multiple alignment problem (Figure 5-2J) can consider more than two networks

at once, it lacks the ability to capture the temporal changes since it treats all networks as

having static topologies. The dynamic network alignment problem (Figure 5-2H) considers

topological changes over time. It however, it seeks a different solution to the alignment

problem at each time point. Thus, it can not identify coevolving subnetwork. A new algorithm

is needed to capture such evolving characteristics. Unlike these alignment problems, temporal

network alignment (Figure 5-2G) captures that network topologies coevolve over time.

Contributions in this paper. We develop an efficient algorithm, Tempo, to identify

coevolving subnetworks in a given pair of the temporal networks. Briefly, our algorithm first

finds an initial alignment between the input networks G1 and G2 using the similarity score

65

between pairs of aligned nodes across all time points. It then performs a dynamic programming

strategy that maximizes the alignment quality (i.e. score) by repeatedly altering the aligned

nodes in the target network. We demonstrate the efficiency and accuracy of Tempo using both

real and synthetic data. We compare the running time and the quality of the alignments found

by Tempo against those of three existing alignment algorithms, IsoRank (10), MAGNA++ (73)

and GHOST (74). Note that all these networks are tailored towards optimizing alignment at

a single time point. To have a fair comparison, we allow each of these methods to consider

each time point independently then apply the resulting alignments to all other time points and

took the average. We show Tempo has competitive running time and generates significantly

better alignments. We use a human brain aging (75) dataset, and integrate this dataset to

analyze three phenotypes—two age related diseases (Alzheimer’s and Huntington’s) and one

disease that is less prone to aging (Type II diabetes). We perform gene ontology analysis

on the aligned genes reported by our algorithm and compared algorithms. Our algorithm

could successfully align genes of the phenotype query (i.e. the underlying disease) to strongly

related genes in the target network despite their evolving topologies unlike other algorithms.

Consequently, we could predict disease-related genes based on the generated alignment using

tempo which suggests that Tempo generates alignments that reflect the evolution of nodes

topologies through time as well as their homological similarities while other methods only

focuses on static and independent topologies. Lastly, we observe that alignments of age related

phenotype is significantly higher than alignment of non age phenotype which reflects their high

evolution rates and shows that Tempo could identify between different queries.

4.2 Related Work

One of the key studies on pairwise global network alignment is IsoRank (10), which

is based on the conjecture that two nodes should be matched if their respective neighbors

can also be matched. It formulates the alignment as an eigenvalue problem and computes

the similarity between pairs of nodes from two given networks as a combination of their

homological and topological similarities. It obtains the global alignment of the two given

66

networks using their maximum weight bipartite match with the scores as the weights.

The GRAAL (GRAph ALigner) family (76) of global network alignment methods use the

graphlet degree similarity to align two networks. Briefly, the graphlet-degree of a node

counts the number of graphlets (i.e. induced subgraph) that this node touches, for all

graphlets on 2 to 5 nodes. GRAAL (77) first selects a pair of nodes (one from each of

the two given networks) with high graphlet degree signature similarity as the seed of

the alignment, and greedily expands the alignment by iteratively including new pairs of

similar nodes. H-GRAAL (78), MI-GRAAL, and L-GRAAL algorithms also belong to the

same family. The SPINAL algorithm (79) iteratively grows the alignment based on apriori

computed node similarity score. MAGNA (80) optimizes the edge conservation between two

networks using a genetic algorithm. There are several other methods for pairwise network

alignment (81; 82; 83; 84; 74; 85; 86; 87). Although the underlying algorithms of these

methods vary, the end goal is similar to those discussed above.

Several algorithms address the multiple network alignment (88; 89; 90). IsoRankN (91)

extends IsoRank. It adopts spectral clustering on the induced graph of pairwise alignment

scores. The algorithm developed by Shih et. al. (92) is a seed-expansion heuristic that first

selects a set of node pairs with high similarity scores using a clustering algorithm, and then

expands these pairs by aligning nodes that maximizes the number of the total conserved edges

of aligned nodes.

INQ (93) aligns a dynamically evolving query network with one static target network. It

uses ColT (94) to find an initial alignment of the initial query, then it observes the differences

between the topologies of the already aligned query network and the new query network, and

finally, uses these differences to refine the alignment found for the previous query and generate

alignment of the current query network. DynaMAGNA++ (95) aligns two dynamic networks.

It assigns a value to each node based on how the incident edges and graphlets change through

dynamic events. It assigns each node a value based on dynamic graphlet degree vector

(DGDV) of graphlets up to size four. It considers a pair of nodes from two networks similar if

67

their DGDVs are similar. This algorithm starts by constructing an initial population of random

dynamic network alignments and then evolves this alignment to maximize the node similarities.

4.3 Problem Formulation

In this section, we develop a new scoring function, score(G1i ,G2

i | ψ), that integrates the

similarities of the aligned nodes and their evolving topologies, and includes a penalty for each

disconnected component in the aligned subnetworks of the target network at each time point.

Next, we introduce the terminology and discuss how we drive our scoring function.

Given a network G = (V,E) and a subset of nodes V , we define the induced subnetwork

of V in G as the nodes in V and all incident edges (i.e., E = {V × V } ∩ E). We denote this

induced network as G = (V | G). We say two nodes u and v in G are connected if there exists

a path between u and v in G. We say a subset of nodes in G form a connected component

if all pairs of nodes in that subset are connected in G. We define a subset of nodes V in G

as a maximum connected component if the following conditions hold: (i) V is a connected

component in G, and (ii) there is no node in V − V which is connected to a node in V . In

the rest of the paper, we use the term connected component instead of maximum connected

component. We denote the number of connected components of a given network G with

NCC(G).

Given two temporal networks with t time points, G1 = [G11, G

12, . . . , G

1t ] and G2 =

[G21, G

22, . . . , G

2t ], we denote the similarity between a pair of nodes u ∈ V 1 and v ∈ V 2 at time

point i (1 ≤ i ≤ t) with Si(u, v). We use an existing pairwise alignment method to calculate

Si(u, v). The alignment function ψ maps all nodes in V 1 to a subset of the nodes in V 2. We

denote this subset with Ψ(V 1) (i.e. Ψ(V 1) = {ψ(u)|∀u ∈ V 1}). We note that ψ yields an

induced subnetwork (Ψ(V 1)|G2i ) of G2

i for each time point i, and each induced subnetwork

(Ψ(V 1)|G2i ) forms one or more connected components. Figure 4-2A shows an illustration of

this latter point. We denote the number of connected components of the induced subnetwork

(Ψ(V 1)|G2i ) at time point i as NCC(Ψ(V 1) | G2

i ). If the number of connected components

at time point i is greater than one then the corresponding induced subnetwork is disconnected.

68

We incur a penalty to account for the missing edges which would connect the disconnected

components, and apply this penalty to each disconnected component.

The minimum number of edges needed to join NCC(Ψ(V 1) | G2i ) connected components

is NCC(Ψ(V 1) | G2i ) −1. We penalize each edge insertion with a constant value denoted with

δ, where δ ≥ Si(u, v), ∀ u ∈ V 1, v ∈ V 2 and i ∈ {1, 2, . . . , t}. We define the score of the

alignment ψ() at time point i as: score(G1i , G

2i | ψ) =

∑u∈V 1 Si(u, ψ(u)) − δ(NCC(Ψ(V 1) |

G2i )− 1). We define the temporal network alignment as

ψ

{t∑i=1

( ∑u∈V 1

Si(u, ψ(u))− δ(NCC(Ψ(V 1) | G2i )− 1)

)}. (4-1)

4.4 Methods

Overview. Our algorithm for solving the temporal network alignment problem has two phases.

The first phase finds an initial alignment between the input networks G1 and G2 using the

similarity score between pairs of aligned nodes across all time points. The induced subnetwork

of G2 obtained by this alignment may be disconnected since this phase ignores the penalty

incurred by edge insertions. The second phase reduces the number of connected components,

improving the alignment score. In the second phase, we improve the alignment between the

input networks by swapping a subset of the nodes in G2 that are aligned with nodes in G1 with

other nodes in G2. In order to swap a node vi ∈ Ψ(V 1) with vj ∈ V 2 − Ψ(V 1), we update

the alignment function ψ() to ψ′() such that ∀ u ∈ V one of the two conditions is satisfied:

(i) ψ′(u) = vj if ψ(u) = vi; and (ii) ψ′(u) = ψ(u) if ψ(u) = vi. Figure 4-2 illustrates this.

Here, initially b11 is aligned to a11 (Figure 4-2A). Swapping b11 with b14 updates the alignment

function so that b14 is aligned to a11 (Figure 4-2B). We observe that this swapping reduces the

number of connected components in the induced subnetwork of G2 by one. Notice that if we

swap b8 with b14 (instead of b11 with b14) then the number of connected components increases

(Figure 4-2C).

We note that the number of connected components may simultaneously decrease at

one time point and increase at other time points when we swap two nodes. We prove that

69

A initial alignment B after swapping b11 with b14 C after swapping b8 with b14Figure 4-2. This figure represents an alignment between two networks G1 and G2. Each node

in the query network G1 has a one-to-one mapping with a node in the network G2.The dashed line between two nodes emphasizes that they are mapped to eachother. A) This represents a hypothetical alignment where ai is aligned with bi forall 1 ≤ i ≤ 11. The induced subnetwork of the aligned nodes in G2 forms threeconnected components; C1 = {b1, b2, b3, b4}, C2 = {b5, b6, b7}, andC3 = {b8, b9, b10, b11}. Gap nodes are {b12, b13, b14}. B) After swapping b11 withb14. This swapping results in two connected components in G2. (c) After swappingb8 with b14. The aligned nodes in G2 form four connected components.

the problem of finding the subset of node swaps that minimizes the number of connected

components across all time points is NP-hard. We give a reduction from the Maximum

Coverage problem (96) to this problem later in this section.

Algorithm details. Tempo takes two networks (G1 and G2) and the maximum number of

allowed swaps (denoted as k) as input. In the following, we explain the two phases of our

method in detail.

Phase 1 (Initialization). Here, we construct an initial alignment of G1 and G2.

There exists several algorithms to perform pairwise alignment of two static networks at a single

time point. Each of these methods assign similarity scores to all node pairs (one from the first

network and one from the second) and then choose the alignment that maximizes the total

score of all aligned node pairs. We adopt one of these methods to obtain the similarity scores

of each network pairs (G1i , G

2i ) at each time point i, and use the outputted scores to calculate

an initial alignment. We denote the similarity of the node pair (u, v), u ∈ V 1 and v ∈ V 2

generated by such method at the ith time point with Si(u, v).

70

We generate an initial alignment ψ0 as follows. We first construct a weighted bipartite

network Gbp = (V 1, V 2, E) as follows: we insert an edge in Gbp between each pair of nodes

(u, v) such that u ∈ V 1 and v ∈ V 2. We set the weight of the edge (u, v) as the similarity

between nodes u and v aggregated over all time points. We denote the similarity as S(u, v) =∑t1 Si(u, v). The maximum-weight bipartite matching algorithm maps each node in V 1 to a

node in V 2 (97). This mapping represents the initial alignment, ψ0. We call the nodes in V 2

that are not mapped to any node in V 1 as gap nodes and denote with F = V 2 −Ψ(V 1).

Phase 2 (select k swapping pairs). Here, we describe our dynamic programming

algorithm that selects a set of k swaps which maximize the alignment score by reducing

the number of connected components in the induced alignment across all time points of G2

(Equation 4-1).

We denote a set of r swaps with ∆ = {(u1, v1), (u2, v2), . . . , (ur, vr)} with ∀i = j,

ui = uj and vi = vj. We denote the alignment after applying the swaps in a given set ∆ as

ψ∆. Let us denote the optimal set of r swaps for the alignment ψ with solution(r, ψ,G1,G2).

Also, for a given ui ∈ Ψ(V 1), we denote the optimal set of r swaps for the alignment ψ which

contains the swap pair (ui, vi), ∃vi ∈ F , with solution(r, ui, ψ,G1,G2).

Our algorithm works iteratively. In the first iteration, our algorithm selects one swapping

pair for each aligned node ui ∈ Ψ(V 1) as

solution(1, ui, ψ,G1,G2) =∆={(ui,vi)},vi∈F {score(G1,G2|ψ∆)}.

At each subsequent iteration r where 2 ≤ r ≤ k, for each aligned node ui ∈ Ψ(V 1), our

algorithm selects a set of r swapping pairs denoted with solution(r, ui, ψ,G1,G2) by adding

one swapping pair (ui, vi), ∃vi ∈ F , to the previously selected r − 1 pairs as follows.

∆={(ui,vi)}∪solution(r−1,uj ,ψ,G1,G2),Θ

{score(G1,G2|ψ∆)}. (4-2)

Here Θ represents the necessary conditions to include the (ui, vi) swap pair with a set of

r − 1 swap pairs as

71

Θ = (vi ∈ F ) AND

(uj ∈ Ψ(V 1)) AND

(@v ∈ F , such that (ui, v) ∈ solution(r − 1, uj, ψ,G1,G2))

AND (@u ∈ Ψ(V 1), such that

(u, vi) ∈ solution(r − 1, uj, ψ,G1,G2)).The first condition above ensures that node ui is swapped with a gap node and the

second ensures the dynamic programming iterates over all size r − 1 swap sets for all aligned

nodes of G2. The third condition ensures that the aligned node ui has not already been

swapped in the r − 1 sized swap set. The final condition is the dual of the previous one, as it

ensures that the gap node vi has not already been swapped in the r − 1 sized swap set. When

these conditions hold, the two nodes ui and vi can be swapped and included into the existing

set of r − 1 swaps without conflicting with any of the existing swaps.

We report the output of the algorithm at end of the kth iteration as set of k swaps with

the highest alignment score using equation

∆=solution(k,ψ,G1,G2)=

ui∈Ψ(V 1),∆i=solution(k,ui,ψ,G1,G2){score(G1,G2|ψ∆i

)}. (4-3)

We represent the set cardinalities |V 1|, |V 2|, and |F | with m, n, l, respectively. The

complexity of our algorithm is O(m2n2)+O(mn logm)+ml∑t

i=1 |E2i |+O(k2l2m). We provide

the derivation of this complexity in Section 4.4.2. We note that k ≤ NCC(ψ(V 1) | G2) − 1.

This value is either given as input or we set it to NCC(ψ(V 1) | G2)− 1.

Proof of correctness. Here, we formally proof the correctness of our algorithm. We say that

swapping the pair of nodes (ui, vi) is proper if that the swapping does not increase the number

of connected components of the aligned nodes. We first prove that our algorithm will always

find a proper swapping node ui from the set of aligned node for each gap node vi. We first

present a lemma which is necessary for the proof of our first theorem. Let us denote the degree

of a node v (i.e. number of edges connected to this node) within a component Ci = (Vc, Ec)

of the induced subnetwork G2i = (Ψ(V 1)|G2

i ) at time point i with the function deg(v|Ci).

72

Lemma 1. Given an undirected subnetwork of G2i , G2

i = (Ψ(V 1)|G2i ) where |Vc| = z and G2

i

is acyclic network (has no cycle) within its topology, then∑

v∈Ci deg(v|Ci) = 2(z − 1).Proof. Since Ci is a connected subnetwork with no cycles, the number of edges in Ci equals

z − 1 edges. Each edge belongs to an undirected network increases the sum of the network

nodes degrees by two. Thus,∑

v∈Ci deg(v|Ci) = 2(z − 1).Lemma 2. Given a gap node vi that connects at least two connected components, there exist

at least one aligned node ui which we can swap with vi without increasing the number of

connected component.Proof. We formally prove this by induction on the size of connected components that ui

belongs to.

Base case. We consider a component Ci = (Vc, Ec) where |Vc| = 2 and vi is connected

to Ci through uj, and assume ui belongs Ci. If we swap vi with ui, then Ci will contain uj

and vi which corresponds to one component. Thus, the number of connected components of

Ci is still one after swapping.

Induction hypothesis. We assume there exists a node ui for all components of size

q nodes that can be swapped without disconnecting its component. We consider two cases of

one component Ci where vi is connected to through uj. The first case is when Ci contains

at least one cycle with the set of nodes, Vc1 = {v1, v2, . . . , vn}. It follows that for each node

ui ∈ Vc1 and ui = uj, ui can be swapped with vi without disconnecting Ci. In the second

case, Ci represents acyclic network with no cycles. Next, we prove our theorem in this case by

contradiction. First, we assume that the number of nodes in Ci with degree equal to 1 is less

than 2. Consequently,∑

v∈Ci deg(v|Ci) ≥ 2(z − 1) + 1, which contradicts Lemma 1. Thus,

the number of nodes in Ci with degree equal to 1 is at least 2 nodes and thus, ∃v, w ∈ C st.

deg(v|C) = 1 and deg(w|C) = 1 and v = w. Therefore, we can swap vi with either v or w.

Next, we prove that swapping a gap node vi with an aligned node ui at each iteration will

increase the alignment score score(G1,G2|ψ), showing that the alignment score will always

improve by our dynamic programming algorithm.

73

Theorem 4.1. Given a value of δ where δ is greater than or equal to S(ψ(ui), ui) for all

ui ∈ V 2. At each iteration of our algorithm, score(G1,G2|ψ) monotonically increases.Proof. We assume that our algorithm chooses one pair of nodes to swap; a gap node vi and

aligned node ui which will connect x number of components. We note that the condition x ≥

2 must be satisfied for vi to be considered for swapping. Also, it follows from Lemma 2 that

if we swap vi and ui then the number of connected components will not increase. Thus, the

difference in the score equals D = δ(x − 1) − puv where puv is the difference in pairwise score

from swapping (i.e. puv = S(ψ(ui), ui) - S(ψ(ui), vi)). Since δ is greater than or equal to

S(u, v) ∀ u ∈ V 1 and ∈ V 2, then δ(x − 1) ≥ puv. Consequently, D ≥ 0 and score(G1,G2|ψ)

will not decrease.

4.4.1 Proof of NP-hardness

Here, we prove that our problem is NP-hard. To do that, we reduce the Maximum

Coverage Problem (MCP), which is known to be NP-hard (26), to our problem. Given a

positive integer k and a collection of sets, S = {S1, S2, . . . , Sm}, MCP seeks the subset S ⊆ S

such that |S| ≤ k and the number of covered elements |∪Si∈S Si| is maximized.

We reduce MCP to an instance of our problem. Let U = {x1, x2, . . . , xn} be the union

of elements in S (i.e. U = |∪Si∈S Si|). We construct a target temporal network G2 with one

time point G2 = (V 2, E2) as follows. We initialize G2 as V 2 = ∅ and E2 = ∅. Next, we add a

node aj in G2 for each element xj ∈ U . Also, for each set Si ∈ S, we add two nodes fi and bi

in V 2. Formally, V 2 = {a1, a2, . . . , an} ∪ {b1, b2, . . . , bm} ∪ {f1, f2, . . . , fm}. Next, we populate

the set of edges E2. To do that, for all Si ∈ S and xj ∈ Si, we insert the edge (fi, aj) in

E2. In addition, for all pair of sets Si, Sj ∈ S, where i < j, we insert the edge (fi, fj) in E2.

Finally, for a given query network G1 = (V 1, E1), we construct the set of nodes in G2 aligned

to those in G1 as Ψ(V 1) = {a1, a2, . . . , an} ∪ {b1, b2, . . . , bm}. Thus, the set of gap nodes is

{f1, f2, . . . , fm}. Notice that, the subnetwork of G2 induced by Ψ(V 1) has m + n nodes but

it contains no edges as all the edges in G2 are connected to a gap node by our construction.

74

Thus, the alignment yields n +m connected components as each node in Ψ(V 1) represents a

component.

Recall that each swapping operation swaps an aligned node with a gap node. Also,

recall that the optimization problem we solve for aligning temporal networks aims to find at

most k swaps, such that after applying those swaps, the number of connected components

NCC(Ψ(V 1) | G2) is minimized (Section 4.3). We call this optimization problem minimum

Connected Component Problem (mCCP) in the rest of this proof. Next, we prove that MCP is

maximized if and only if mCCP is minimized.

First, we prove that if there exists a solution to mCCP, then there exists a solution to

MCP. In other words, we prove that minimizing mCCP maximizes MCP. Let us denote the

nodes corresponding to the elements in a set Si with Ai = ∪xj∈Si{aj}. In our problem

instance, a swap operation swaps fi with a node in the set V 2 − Ai − {fi}. This is because

all nodes in Ai are connected to fi, and thus swapping fi with a node not in Ai ensures that

all nodes in Si ∪ {fi} form one connected component. Therefore, to minimize the number

of connected components, we swap fi with one of the nodes which is not a part of this

connected component. To ensure that, we swap fi with a node in the set {b1, b2, . . . , bm}.

Since all nodes in this set are disconnected, swapping fi with any node in this set will yield the

same number of connected components. Let us assume that the solution to mCCP performs

k swaps. Following from the discussion above, without losing generality, we assume that

these swaps are {(f1, b1), (f2, b2), . . . , (fk, bk)}. Notice that after these swaps, the nodes in

(∪ki=1Ai) ∪ {f1, f2, . . . , fk} forms one connected component, and all remaining nodes are

isolated. Let us denote the number of connected components after these swaps with β. Let

us denote the number of nodes in (∪ki=1Ai) with τ . Notice that τ also reflects the number of

elements covered in (∪ki=1Si). We have β = (m− k) + (n− τ) + 1.

In the formulation above, the first term (m − k) is the number of nodes bj which are

not swapped with a gap node. Since all those nodes are isolated, each one forms a connected

component by itself. The second term (n−τ) is the number of nodes aj which are not included

75

in the set (∪ki=1Ai). These nodes remain isolated even after swapping of nodes. The last term

(i.e., 1) is the connected component containing the nodes in (∪ki=1Ai) ∪ {f1, f2, . . . , fk}. After

minor algebraic manipulation, we rewrite the equation above as β = (m + n − k + 1) − τ. In

this equation, the parameters m, n, and k are input to the given mCCP problem, and thus we

denote the first term above with the constant c = m+n−k+1. Therefore, we have β = c−τ .

In this equality the smaller the value of β is, the larger τ gets. Thus, minimizing the number of

connected components β in mCCP maximizes the nuumber of elements covered in MCP.

Second, we prove that if there exists a solution to MCP, then there exists a solution to

mCCP. In other words, we prove that maximizing MCP minimizes mCCP. Let us assume that

the solution to MCP is S = {S1, S2, . . . , Sk}. The number of elements covered by this solution

is τ = |∪Si∈S Si|. By constructing an instance of mCCP as described above, we have k swaps

denoted with the set {(f1, b1), (f2, b2), . . . , (fk, bk}. Consequently, after performing these

swaps, the nodes in (∪ki=1Si) ∪ {f1, f2, . . . , fk} forms one connected component, and all the

remaning nodes are isolated. Let us denote the number of connected components with β. We

have β = (m− k) + (n− τ) + 1.

After minor algebraic manipulation, we rewrite the equation above as τ = (m + n − k +

1) − β. Since m, n, and k are input parameters, we have τ = c − β, where c is a constant

(c = (m + n − k + 1)). In this equality, the larger the value of τ is, the smaller β gets. Thus,

maximizing τ in MCP results in maximizing β in mCCP.

Lastly, the proof we describe above reduces an instance of MCP to an instance of mCCP

in polynomial time and space as it requires only building a network with O(n +m) nodes and

edges. Thus, we conclude that the mCCP problem is NP-hard.

4.4.2 Complexity Analysis

Here we analyze the complexity of our method. Recall that we represent |V 1|, |V 2|, and

|F | with m, n, l respectively. We refer to Section 2 as we discuss the phases of our method.

For each phase, we explain its complexity. We then summarize the complexity of all phases to

denote the overall complexity of our method. These phases are;

76

(1) Phase 1 (construct initial alignment). In this phase, we calculate the

similarity score between node pairs of the input two networks based on their homology and

their topology. First to calculate the topology vector Ai, we need to trace neighbors of all

node pairs which is performed in O(m2n2). Thus, the complexity of calculating the topology

score for all time points is O(m2n2t). We then integrate the homology and topology score

by multiplying the topology and the homology vectors in O(m2n2). The algorithm repeat the

previous step, let us say for c times to converge (O(m2n2c)). We select the initilat alignment

using the weighted-bipartite matching algorithm in O(mn logm). Thus, in this scenario, the

complexity of this phase becomes O(m2n2) +O(mn logm).

(2) Phase 2 (select k swapping pairs). This phase is performed in two steps. The

first step performs the initialization process of the dynamic programming algorithm, in which

we calculate the profit of swapping a gap node fl with an aligned node vj. In order to to this,

we calculate the number of components that fl can connect if swapped with vj using depth

first search through all time points in ml∑t

i=1 |E2i |. The second step performs the iterative

process of selecting k swapping pairs where the maximum number of iterations is (k − 1).

The process combines a gap node fl (i.e. 1 ≤ l ≤ |F |) with a set from swapping pairs from

the previous iteration where the maximum number of sets is l. Due to resolving the conflict

nodes issue, each combination may trace all profits of all gab nodes in the current combination.

This process is performed in O(km). Thus, the complexity of the second step of phase 2 is

O((k − 1)l2km) = O(k2l2m). Hence, the complexity of phase 2 is ml∑t

i=1 |E2i |+O(k2l2m).

In summary, the complexity of our method considering all the previous phases is

O(m2n2) +O(mn logm) +ml∑t

i=1 |E2i |+O(k2l2m).

4.4.3 Adopting Pairwise Alignment Methods to Generate Similarity Scores forTemporal Networks

In this section, we describe how we adopt pairwise alignment methods to generate

similarity scores in temporal networks that are needed to calculate an initial alignment. For

that purpose, we consider adopting IsoRank. We note that our choice of such method has no

77

impact on our method. Recall that IsoRank perform pairwise network alignment. Thus, our

modifications of IsoRank are meant to adopt it to temporal networks. First, we calculate the

homology score between all pairs of nodes (u, v) where u ∈ V 1 and v ∈ V 2 as the similarity

score of their sequences using BLAST (98). We denote the homology score between u and v

as H[u, v]. Next, we calculate the topological similarity matrix at the ith time point, denoted

as Ai, as follows. First, we initialize Ai to be the zero matrix. Next, for u,w ∈ V 1 and

v, z ∈ V 2 we let Ai[(u, v), (w, z)] = 1|N(w|G1

i )||N(z|G2i )|

if w ∈ N(u|G1i ), z ∈ N(v|G2

i ), where

N(v|G) denotes the neighbours of v in network G. Conceptually, Ai[(u, v), (w, z)] models the

topological support that the node pair (u, v) gives to the alignment of their neighbouring pair

(w, z) at the ith time point. We integrate the homology and the topology scores for G1i and

G2i at the ith time point iteratively using a mixing parameter α. We initialize H0

i = H. We

then update the similarity between node pairs at iteration r as Hri = αAiH

r−1i + (1 − α)H0

i .

We stop this iterative process when Hri = Hr−1

i .

We note that in subsequent iterations of the above formulation, the homological similarity

of each node pair (w, z) propagates their neighboring pairs (u, v) by a function governed

by the topology matrix and the mixing parameter α. We explain three issues arising from

these iterations. First, as the number of neighbors of w and z increases, the similarity

propagating to each neighbor pair decreases because the number of ways to align nodes w and

z without altering the topological similarity grows with increasing number of their neighbors.

Secondly, as the value of α decreases, the contribution of the homological similarity to the final

similarity value between each node pair grows and the contribution of the topological similarity

decreases. In the extreme case when α = 0, the topological similarity has no contribution.

Lastly, the iterations above are guaranteed to converge since Ai is a column stochastic matrix

(i.e., the values at each column add up to one). We denote the converged vector at the ith

time point with Si and call it a score vector. Each entry Si[u, v] in this vector shows the

similarity (homology and topology combined) between nodes u and v.

78

4.5 Results and Discussion

We evaluate the performance of our algorithm on synthetic and real data. Next, we

describe both datasets in detail.

Real Dataset. We obtain our real dataset from two sources. The first one is the human brain

aging dataset (75). Recall that this dataset contains gene expressions of 173 samples obtained

from 55 individuals spanning 37 ages from 20 to 99 years. The ages in this dataset are not

uniformly spaced. In order to bring consecutive time gaps to a more uniform values, we remove

two data points which have an age gap of more than 5 years from their successive age values,

leading to 35 ages. We select five temporal networks each having seven time points. Next, we

explain how we do that for the first temporal network. We start with the first (i.e., youngest)

time point in the aging data. We then skip the next four time points and take the sixth time

point in aging data iteratively until we have seven time points. Similarly, for 1 < j ≤ 5,

we select the jth temporal network starting from the jth time point. In this manner, we

form five non-overlapping and interleaved temporal networks. In order to integrate static

PPI network with gene expression data to form age-specific PPI networks, we set a cut-off

on the gene-expression value. All the interactions that have a lower transcription value for

either or both the proteins are removed from the corresponding age-specific network. We

use the protein-protein interaction (PPI) network data from BioGRID (99). For the second

source, we select phenotype specific query temporal networks from this dataset. We use

two neurodegenrative disorders which are conjectured to be age-related (Alzheimer’s and

Huntington’s) and a third one which we expect to be less prone to aging (Type II diabetes).

We retrieve the gene sets specific to these three diseases from KEGG database (100). We form

three query PPI temporal networks by keeping only the interactions where both the interactors

are from each of the three phenotype-specific (Alzheimer’s, Huntington’s or Type II Diabetes)

gene set.

Synthetic dataset. We generate synthetic networks to observe the performance of our method

under a wide spectrum of parameters classified under two categories; (i) network size and

79

(ii) temporal model parameters, namely number of time points, temporal rate, and cold rate.

We vary the target network size to take values from {100, 250, 500, 750, 1000}. We fix the

network density to two edges per node on the average (i.e., mean node degree is set to four).

We randomly select G11 as a connected subnetwork of G2

1. We set the size of the query network

to 50 nodes. We generate target network G21 using Barabási-Albert (BA) (48) model as this

model produces scale-free networks. In order to explain the parameters in the second category,

we describe how we generate the query and target networks G11 and G2

1 at the first time point.

We then explain how we use the parameters in this category to build the query and target

networks at the remaining time points.

We generate the subsequent networks for the remaining time points using the three

parameters in the second category above as follows. The first parameter is the number of

time points t in G1 and G2. We use 5, 10, 15, and 20 time points in our experiments. Recall

that we select a subnetwork of the target network G21 as the first query network G1

1. We

mark all nodes and edges in G21 within this subnetwork as cold nodes and edges respectively.

We mark all other nodes and edges in G21 as hot. Next, we iteratively generate the networks

G1i and G2

i at the ith time point (i > 1) from G1i−1 and G2

i−1 respectively as follows. Let

us denote temporal and cold rates (two real numbers) with ϵ and ϵc respectively such that

0 ≤ ϵc ≤ ϵ ≤ 1. Let us denote the ratio of cold edges to the total number of edges in the

target network G21 with γ. We calculate the hot rate, denoted with ϵh, from temporal rate

and cold rate as ϵh = (ϵ − ϵcγ)/(1 − γ). Conceptually, hot and cold rates model the rate

of evolution of hot and cold edges between two consecutive time points respectively. More

specifically, for each subsequent time point i, we generate G2i by randomizing G2

i−1 as follows.

We iterate over all edges in G2i−1. For each edge e, if it is a cold edge we remove it with

probability ϵc and insert a new edge between two randomly chosen cold nodes. If e is a hot

edge, we remove it with probability ϵh and insert a new hot edge between two random nodes

(with at least one being a hot node). We generate query networks at subsequent time points

using almost the same procedure with the only difference being that all edges are cold. We

80

generate datasets by varying ϵ and ϵc to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05,

0.1, 0.2} respectively. For each parameter setting we generate 10 target and query temporal

networks.

Recall that, we generate the scoring matrix based on both homology and topology

similarities. We generate the homology score between two pair of nodes u ∈ V 1 and v ∈

V 2 as follows. If v was originally selected as cold node and u is the same as v, then we

generate a homology score between u and v from log-normal distribution (101) with mean

2µ and standard deviation σ. Otherwise, we randomly generate the homology score between

u and v from log-normal distribution with mean µ and standard deviation σ. In this way, we

allow nodes in query network to be likely to align to nodes in the target network that were

originally extracted from. In this paper, we set µ and σ to be 2 and 0.25 respectively. Notice

that the homology scores do not change through time points, although topology scores do.

Thus, evolution through time points of query and target networks may affect how the query

is aligned to the cold region in the target network. We set the edge insertion penalty δ to be

maxu∈V 1,v∈V 2

S(u, v).

We compare the accuracy and running time of our algorithm against IsoRank, MAGNA++

and GHOST. IsoRank, MAGNA++ and GHOST are designed to align two networks at a

single time point. We therefore find the alignment using each of these methods at each time

point, impose the alignment to all the time points and report the average. We analyze the

biological significance of our results on real data by performing gene ontology analysis and

exploring publication evidence. We implemented Tempo in C++, performed all experiments on

a computer equipped with AMD FX(tm)-8320 Eight-core Processor 1.4 GHz CPU, 32 GB of

RAM running Linux operating system, and used α = 0.7 unless otherwise stated.

4.5.1 Evaluation of Recovered Region

In this experiment, we compare the accuracy of the alignment generated by Tempo

against that of IsoRank, MAGNA++, and GHOST. We recall that we select the original query

network from a subset of nodes and their edges from the target network, and then evolve the

81

20

40

60

80

0.05Rec

over

ed q

uery

(%

)

Cold rate0.05

Temporal rate

0.05 0.1Cold rate

0.1Temporal rate

0.05 0.1 0.2Cold rate

0.2Temporal rate

0.05 0.1 0.2Cold rate

0.4Temporal rate

0.05 0.1 0.2Cold rate

0.8Temporal rate

Tempo IsoRank MAGNA GHOST

Figure 4-3. The percentage of recovered query in the resulting alignment varying ϵ and ϵc totake the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2} respectively. Thex-axis shows temporal rate, ϵ and cold rate, ϵc (these are the parameters used forconstructing synthetic temporal network, with varying evolution rates. The y-axisshows the percentage of recovered query of IsoRank, MAGNA++, and GHOSTagainst Tempo. The error bars show the 80-percentile of the recovered query basedon the 10 repetitions of each parameters setting.

query through time points. Here, we evaluate the accuracy by calculating the percentage of the

aligned nodes from query network that are paired with the same nodes of the target network

that they were originally selected from. We refer to this percentage as recovered region. We

illustrate the results in Figure 5-7, which demonstrate that Tempo recovers high percentage of

the query networks compared to other methods. As the temporal rate increases, the accuracy

of Tempo improves dramatically while that of IsoRank remains nearly stagnant and while

MAGNA++ and GHOST continue to generate alignments with low recovery rates. Growing

the temporal rate while keeping the cold rate unchanged means that the topology of the query

network (i.e., cold edges) is evolving slower than the rest of the temporal network (i.e., hot

edges). This implies that Tempo can capture the variation in such evolutionary rate while

competing alignment strategies which fail to do so.

4.5.2 Evaluation of Induced Conserved Structure

Next, we evaluate the topological quality of the alignment generated by Tempo through

comparison with IsoRank, MAGNA++, and GHOST. For this purpose, we measure the

shared topological structure between G1i and G2

i which is preserved under the alignment

function ψ through all time points i. Induced conserved structure (ICS) measures the

percentage of edges from G1i that are aligned to edges in G2

i to the total edges of the induced

subnetwork Ψ(V 1|G2i ), and is one of the most common measures of topological quality (73).

82

0

0.1

0.2

0.3

0.4

0.5

0.05

ICS

Cold rate

0.05Temporal rate

0.05 0.1Cold rate

0.1Temporal rate

0.05 0.1 0.2Cold rate

0.2Temporal rate

0.05 0.1 0.2Cold rate

0.4Temporal rate

0.05 0.1 0.2Cold rate

0.8Temporal rate

Tempo GHOST MAGNA IsoRank

Figure 4-4. The induced conserved structure (ICS) score of the resulting alignment varying ϵand ϵc to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2}respectively. The x-axis shows temporal rate, ϵ and cold rate, ϵc. The y-axis showsthe ICS score of GHOST, MAGNA++, and IsoRank against our method (Tempo).

Formally, ICS(G1,G2, ψ) =∑t

i=1|E1i ∩E2

i [Ψ(V 1|G2i ]|

|E2i [Ψ(V 1|G2

i ]|. Figure 5-6 presents the results, which

demonstrate that Tempo generates alignments with high quality based on ICS compared

to other algorithms. We note that GHOST was created to optimize ICS, however, Tempo

outperforms GHOST on this measure—especially when the temporal rate is high since the

performance of GHOST degrades.

4.5.3 Evaluation of Edge Correctness

In this experiment, we evaluate the topological quality of the alignment generated by our

method against IsoRank, MAGNA++, and GHOST. For this purpose, we measure the shared

topological structure between G1i and G2

i which is preserved under the alignment function,

ψ through all time points i. Edge correctness (EC) is one of the most common measures of

topological quality (73; 74). It has a similar computations to ICS. Basically, it measures the


i to the total edges of smaller

network. More specifically, EC(G1,G2, ψ) =∑t

i=1|E1i ∩E2

i [Ψ(V 1|G2i ]|

|E1i |

. Figure 5-5 presents the

results. The results demonstrate that our algorithm generates alignments with high quality

based on EC compared to other algorithms.

4.5.4 Evaluation of Statistical Significance of The Alignment

We compare the statistical significance of the alignments generated by Tempo against

that of existing methods. In order to ensure that our experiments do not give any advantage to

our algorithm, we use IsoRank to generate initial alignments for Tempo and thus, compare the

statistical significance against IsoRank only.

83

0

0.05

0.1

0.15

0.2

0.05Edg

e co

rrec

tnes

s (E

C)

Cold rate

0.05Temporal rate

0.05 0.1Cold rate

0.1Temporal rate

0.05 0.1 0.2Cold rate

0.2Temporal rate

0.05 0.1 0.2Cold rate

0.4Temporal rate

0.05 0.1 0.2Cold rate

0.8Temporal rate

Tempo IsoRank GHOST MAGNA

Figure 4-5. The Edge correctness (EC) score of the resulting alignment varying ϵ and ϵc totake the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2} respectively. Thex-axis shows temporal rate, ϵ and cold rate, ϵc. The y-axis shows the EC score ofGHOST, MAGNA++, and IsoRank against our method (Tempo).

4 6 8

10 12

0.05

Z-S

core

Cold rate0.05

Temporal rate

0.05 0.1

Cold rate0.1

Temporal rate

0.05 0.1 0.2

Cold rate0.2

Temporal rate

0.05 0.1 0.2

Cold rate0.4

Temporal rate

0.05 0.1 0.2

Cold rate0.8

Temporal rateFigure 4-6. The average z-score of Tempo across network sizes {100, 250, 500, 750, 1000}

varying ϵ and ϵc to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2}respectively. The x-axis shows temporal rate, ϵ and cold rate, ϵc. The y-axis showsthe z-score of IsoRank (white) against Tempo (black).

Varying evolution rate. In this experiment, we evaluate the effect of varying the temporal

rate (ϵ) and cold rate (ϵc) on the significance of the score of the alignments produced by

Tempo and that of IsoRank. We generate synthetic networks of sizes {100, 250, 500, 750,

1000} and 20 time points. We fix the network density to two edges per node on the average,

and vary ϵ and ϵc (ϵc ≤ ϵ) to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2},

respectively. Next, we randomly selected 50 nodes from target network 1,000 times, and

calculate the alignment score of each, i.e., each random selection corresponds to an alignment.

We calculate the mean and standard deviation of these 1,000 scores and generate the z-score

of the alignment generated by Tempo using this mean and standard deviation. Hence, we

denote the score generated from our method by S∗, and denote the mean and standard

deviation of 1,000 scores generated from the random selections with Sµ and σ, respectively.

We calculate the z-score of our method as (S∗ − Sµ)/σ. We calculate the z-score of the

IsoRank method in a similar manner. Figure 5-1 presents the average z-score values across all

84

3

4

5

6

7

8

9

First-5 Second-5 Third-5 Forth-5

Z-S

core

Target time points

A Varying time points

2

4

6

8

10

12

100 250 500 750 1000

Z-S

core

Network size

B Varying network sizesFigure 4-7. The average z-score of Tempo (black) against IsoRank (white) (A) varying target

time points, the x-axis shows time point selected, and (B) varying network size, thex-axis shows network sizes in terms of number of nodes.

target network sizes. The results show that as we increase the temporal rate, the z-score of

Tempo significantly increases while the z-score of IsoRank increases by small amount. As the

evolution rate increases, the topology of the alignment found by Tempo differs significantly

from the topology of rest of the network, and thus, it becomes more challenging to find the

correct alignment. However, Tempo continues to generate accurate and significant results

especially for large evolution rates unlike IsoRank which considers each single time point

independently. We observe the same pattern as we increase cold rate.

Varying time points. In this experiment, we evaluate how the z-scores of Tempo and IsoRank

differ as the input networks evolve and deviate from each other. More specifically, we consider

aligning the query network with each of the four target sets we have which have evolving

time points (i.e. older ages) as we move to later target sets. First, we measure the z-score

of aligning the query to the first target set (i.e., containing time points 2, 7, 12, . . . ) then

we measure the z-score of aligning the query to the second target set (i.e., containing time

points 3, 8, 13, . . . ) and so on. We present the average z-score across all temporal and cold

rates. Figure 4-7A presents the results. The results show that Tempo continues to generate

alignment with high score significance as we evolve the network. We observe the same pattern

for IsoRank, however, Tempo outperforms IsoRank—especially when the time points are

distant. This confirms the fact that as the target and query networks evolve and deviate from

each other, Tempo is able to take into account the evolution through consecutive time points

and generate accurate alignments that persist.

85

Varying network size. In this experiment, we compare the significance of the alignment

generated by Tempo against IsoRank as the target network size increases and the query

becomes small with respect to the target. We average the z-score across all evolution rates

and vary target network size to take values {100, 250, 500, 750, 1000}. Figure 4-7B presents

the results, which show that the significance of the alignment (best alignment) increases as we

increase the size of the underlying target network. We expect this behavior since we compare

the aligned nodes (50 nodes) to a random selection of 50 nodes from the underlying target

network. Thus, the chance of selecting the best alignment decreases. That said, Tempo was

able to identify the accurate alignment which results in high significant values.


In this experiment, we evaluate the running time of our algorithm using synthetic dataset

for network sizes as well as number of time points (t). We report the average running time

over all values of ϵ and ϵc with each parameter combination tested 10 times. We also report

the running time for IsoRank, MAGNA++, and GHOST for aligning two networks at a single

time point. Figure 5-4 presents the results. The results demonstrate that Tempo successfully

scales to large target networks. The running times of both Tempo and IsoRank grow linearly

with increasing target network size and the number of time points (t). We notice that

MAGNA++ has similar behavior than IsoRank, while GHOST has an exponential running time.

The running time of Tempo is more than that of IsoRank, which is unsurprising since Tempo

computes alignment across multiple time points. That said, Tempo has practical running time

even for large networks with many time points. More importantly, unlike IsoRank, Tempo

considers the network topology at all time points while aligning networks. As we present later

in this section, as a natural consequence of the extra effort our method puts to consider all

time points, the alignment it finds is significantly more accurate than that of IsoRank which

considers only one time point at a time.

86

0.1

1

10

100

100 250 500 750 1000

Run

ning

tim

e [c

pu-s

]

Target network size

IsoRank

100 250 500 750 1000

Target network size

MAGNA++

100 250 500 750 1000

Target network size

Tempo

100 250 500 750 1000

Target network size

GHOSTt = 20 t = 15 t = 10 t = 5

Figure 4-8. The total running time of IsoRank and Tempo for synthetic networks varying targetnetwork size from {100, 250, 500, 750, 1000}, and varying t from 5 to 20. Thex-axis shows the input network sizes. The y-axis shows the total running time inseconds.

4.5.6 Evaluation of Recovered Genes in Real Dataset

In this experiment, we evaluate the recovered query region from gene aging dataset by

our algorithm, Tempo, against MAGNA++ and GHOST. Recall that we discussed the values

of IsoRank in the main paper since it reports high recovered rates. The recovered region

computes the percentage of genes in the query network that were mapped to themselves in the

target network despite their evolving topologies. Tables 4-1, 4-2, and 4-3 present the results

for Alzheimer’s, Huntington’s, and Type II diabetes respectively. The results show that our

algorithm significantly outperform both MAGNA++ and GHOST by aligning similar genes

despite their evolving topologies. On the other hand, MAGNA++ and GHOST could poorly

align small portion of the query genes to themselves. This suggests that our algorithm could

successfully capture the evolving topologies of the genes through time points while other

algorithms fail to do so since they consider aligning each time point independently.

Target time points Tempo MAGNA++ GHOSTFirst 7 94.87 2.56 0Second 7 97.43 5.13 0.36Third 7 97.43 2.56 0Forth 7 97.43 2.56 0

Table 4-1. Percentage of recovered query genes from gene aging dataset when usingAlzheimer’s phenotype as query.

87

Target time points Tempo MAGNA++ GHOSTFirst 7 90.9 0.36 0Second 7 86.36 0 0Third 7 95.45 0.73 0Forth 7 95.45 0.73 0

Table 4-2. Percentage of recovered query genes from gene aging dataset when usingHuntington’s phenotype as query.

Target time points Tempo MAGNA++ GHOSTFirst 7 97.22 2.56 0Second 7 97.22 2.56 0Third 7 97.22 5.12 0Forth 7 97.22 2.56 0

Table 4-3. Percentage of recovered query genes from gene aging dataset when using Type IIdiabetes phenotype as query.

4.5.7 Evaluation on Real Data

Next, we evaluate Tempo on the real data. We first evaluate the significance of alignment

score using Tempo. We calculate the z-score by comparing the score of aligned nodes to the

score of 1,000 randomly selected alignments of the same number of nodes. We compare our

results to those of IsoRank. We repeat this experiment for three different disease network

queries: Alzheimer’s, Huntigton’s and Type-II diabetes. Figure 4-9 shows the results. Our

results demonstrate that Tempo yields highly significant alignments, and outperforms IsoRank

in terms of z-score. We also observe that z-scores of non-age related disease (diabetes) is

lower than those of age-related diseases (i.e. Alzheimer and Huntington’s). Although there are

some fluctuations in the z-score with growing time gap between query and target networks,

we observe that the z-score tends to increase for Alzheimer’s and Huntington’s disease unlike

the Type-II diabetes. This suggests that age-related pathways have higher evolution rate

than other pathways. Thus, we conjecture that Tempo, which takes all time points into

consideration, is suitable for capturing evolving topologies.

Next, we consider the biological significance of our results by identifying aligned gene

pairs in which the aligned genes are different, and determining prior evidence that these gene

pairs are biologically relevant. We use Tempo to identify 4, 4 and 6 such pairs for Alzheimer’s,

88

14

16

18

20

Firs

t-7

Seco

nd-7

Third

-7

Forth

-7

Z-S

core

Target timepoints

Alzheimer

Firs

t-7

Seco

nd-7

Third

-7

Forth

-7

Target timepoints

Huntington’s

Firs

t-7

Seco

nd-7

Third

-7

Forth

-7

Target timepoints

Type II diabetes

Figure 4-9. The average z-score of our method using real data of three different diseases;Alzheimer’s, Huntington’s and Type-II diabetes. The x-axis shows which timepoints was selected to represent the target network. The y-axis shows the z-scoreof IsoRank (white bars) against our method (black bars).

Huntington’s and Type-II diabetes, respectively. We note that Alzheimer’s, Huntington’s

and Type-II diabetes query sizes are 39, 36, and 23. Thus, the percentages of the different

genes found to all the genes in the alignment are 10% to 26%. IsoRank only mapped genes

to themselves, suggesting that IsoRank only considers static topologies while our algorithm

could map genes based on homological similarities as well as evolving topologies. MAGNA++

and GHOST could only map few genes to themselves while other mapped genes were poorly

related.

For each combination of disease and differently mapped gene pairs identified by Tempo,

we first search PubMed for publication evidence specific to that disease. For instance, in case

of Alzheimer’s disease, the gene DAB1 that was selected by Tempo and was identified as a

potential gene that encode proteins related to functions in biological pathways relevant to

the disease (102). Genes found by Tempo for type II diabetes, for example gene ACTA1, has

remarkable change in gene expression value that was observed for the in diabetic samples

compared to non-diabetic samples (103). Moreover, significant up-regulation of GRB2 is

observed in transgenic samples compared to controls (104).

In order to determine the biological processes of the aligned genes found by Tempo in

gene aging dataset, we perform the gene ontology analysis of the aligned genes in target

network using Gene Ontology Consortium (105). We identify the biological processes or

signaling pathways that play significant roles in the disorder. We calculate how many related

pathways found by our method (Tempo) against MAGNA and GHOST and their significance.

89

Disease Tempo MAGNA++ GHOSTAlzheimer 2 / 4 / 2.29E-14 1 / 2 / 2.14E-03 1 / 2 / 3.32E-04Huntigton’s 1 / 4 / 1.15E-22 0 0Diabetes 2 / 4 / 2.29E-09 1 / 1 / 2.2E-01 0

Table 4-4. Number and significance of functional pathways associated with the underlyingdisease observed among the aligned genes of target network. Each cell lists theresults in the form x/y/z. Here, x represents number of pathways identified, ydenotes the number of time points at which these pathways are observed, and z isthe statistical significance (p-value) of the least significant of these pathways. Thecell with the value 0 implies that no pathways were found.

We also counted the frequency of those pathways when used different range of time points.

Table 4-4 present the results. We find references of certain pathways that are related to

specific neurodegenerative disorders (Alzheimer’s and Huntigton’s diseases). For genes we

identify when we use Alzheimer’s disease as a query network, we find two pathways, namely

Alzheimer disease-amyloid secretase and Alzheimer disease-presenilin are related to Alzheimer’s

disease (106). Various growth factors alter the brain development process at younger age,

that manifest as a variety of risk factors at an older age and eventually results in aging-related

diseases such as Alzheimer’s and Huntigton’s diseases (107). For the genes we identify when

we use type II diabetes phenotype as a query, we find two pathways that they are commonly

associated with type II diabetes (108) namely Insulin/IGF pathway-protein kinase B signaling

cascade and Insulin/IGF pathway-mitogen activated protein kinase kinase/MAP kinase cascade.

On the other hand, MAGNA or GHOST found at most one pathway with very low significance

and did not appear through all tested target networks (Table 4-4). In conclusion, studying

temporal networks in general and human aging specifically using Tempo enables us to identify

age related genes from non age related genes successfully. More importantly, Tempo takes

the network alignment problem one huge step forward by moving beyond the classical static

network models.

Significance of disease relevance. In this experiment, we perform gene ontology analysis

on the aligned genes from target network that result from our method, Tempo, MAGNA

and GHOST. Here, we present the percentage of genes that contributes to the significant

90

A Tempo B MAGNA C GHOSTFigure 4-10. This figure represents the percentage of genes that contributes to each pathway

of the resulting aligned genes in the target network. We point to the significantrelated pathways of the query disease (Alzheimer).

pathways which are related to the query disease. We show the results for Alzheimer disease.

Results are similar for the other two queries. Figure 4-10 presents the results. The results

demonstrate that our method finds alignments in target network with substantial fraction of

genes that contributes to the pathways which are associated with the query disease. On the

other hand, resulting alignments of MAGNA and GHOST contributes with a very small fraction

to pathways associated with Alzheimer. Notice that the aligned genes result from our method

have two pathways that are associated with Alzheimer while MAGNA and GHOST results in

only one.

4.6 Discussion

In this chapter, we developed a novel and scalable method to solve the problem of network

alignment between two given temporal networks. Our method seeks a persist alignment

through all time points of the input networks. We proposed a new alignment score function

to increase the similarity between aligned nodes and reduce the disconnected components of

the aligned nodes in the target network. We proposed a dynamic programming solution to this

problem which refine the alignment by selecting a maximum of k (user specified) swapping

pairs of nodes from larger network where each pair represents an aligned node and a gap node.

The selection process monotonically decreases the number of disconnected components and

thus increases the alignment score. Our method first identify an initial alignment between the

91

two input networks based on their nodes similarities. Our algorithm then iteratively selects k

swapping pairs. We proof the correctness of our algorithm. Our experiments on both synthetic

and real datasets comprehensively demonstrated that our method is both fast and efficient. We

observed using synthetic networks that the running time of our algorithm is reasonable with

growing the size of the target network and number of timepoints, t. Comparing our algorithm

to a classical network alignment algorithm show that our method generates more significant

alignment and could capture temporal evolution of the two input networks. Moreover, we

performed the gene ontology analysis on the genes reported by our algorithm after swapping

mechanism and observed that they are of biological significance as well.

92

CHAPTER 5IDENTIFICATION OF CO-EVOLVING TEMPORAL NETWORKS WITH UNCERTAIN

TIMELINE

5.1 Preface

Biological networks describe the interaction between molecules. They are frequently

represented as graphs, where the nodes correspond to the molecules (e.g., proteins or genes)

and the edges correspond to their interactions (1). Formally, we denote a biological network

as G = (V,E) where V and E represent the set of nodes and the set of edges, respectively.

The topology of the interactions of biological networks is not static. Genetic and epigenetic

mutations, errors in DNA replication, aging can alter molecular interactions (13). Due to

this dynamic behavior, the topology of the network that models the molecular interaction

evolve and change over time (16). Majority of the previous work on alignment of biological

networks assume the network topology is static (10) (Section 5.2 includes further details).

This assumption ignores the history of network evolution, and may lead to biased or incorrect

analysis. For example, identifying causes and consequences of the influence of external stimuli

is impossible when analyzing static topologies. In this paper, we define a biological network

using a model that accounts for the evolution of the underlying network at consecutive time

points. We refer to this model as a temporal network (24). We denote a temporal network

with t consecutive time points as G = [G1, G2, . . . , Gt], where Gi = (V,Ei) represents the

topology of the network at the ith time point.

Various factors affect the evolution process of a biological network and thus, introduce

uncertainty when capturing such evolution. For example, the evolution rate of interacting

molecules differs between people with different disorders (i.e. diseases) or people with same

disorder but at different stages of this disorder (27). Further more, the reaction to a specific

medication differs from one person to another depending on their resistance levels and

immune systems (109). Consequently, the observed interactions of humans may vary even if

they are measured at the same time. Thus, the interaction networks constructed for those

measurements may correspond to different stages of the evolution.

93

In this work, we consider the problem of identifying coevolving subnetworks between

subsequences of given pair of temporal networks. We say that two subnetworks are coevolving

if their topologies remain similar even though their topologies evolve over time. We define this

more formally as follows. We consider two input temporal networks G1 = [G11, G

12, . . . , G

1m] and

G2 = [G21, G

22, . . . , G

2n]. We let t1i and t2j represent the ith time points of G1 and the jth time

point of G2 respectively, where i ∈ {1, 2, . . .m} and j ∈ {1, 2, . . . n}. Notice that the time

points number only show the order of consecutive snapshots of the network such that ∀i and j,

1 ≤ i < m and 1 ≤ j < n, we have t1i < t1i+1 and t2j < t2j+1. These numbers does not reflect

actual timing information. More specifically, time points of the observed network topologies

are uncertain such that the information of which time point in one sequence corresponds to

that in the other sequence is not known in advance. Furthermore, G1 and G2 has possibly

different number of time points (i.e m = n). Without losing generality, we let G1 to be the

temporal network with shorter number of time points (i.e. m ≤ n). Another version of the

temporal alignment problem exists where time points of G1 and G2 are known (110). Having

the knowledge of the time points implies that the time values govern which network in G1 gets

aligned with that G2. However, this assumes that both networks co-evolve at the same speed.

Here, we consider the uncertainty of the time points in each topological network. This is a

very challenging problem since it does not only align the temporal networks, but also finds their

corresponding time points at which the alignment yields the highest score.

In this paper, we aim to find a subsequences S of G2 with m time points that correspond

to G1 at which the alignment yields the highest alignment quality (i.e. topological and

biological similarities). Finding such subsequence is a very challenging process since the naive

strategy would be to exhaustively search among all possible subsequences of S. However, this

is computationally too expensive as the number of subsequence pair S is Cnm (here Ci

j is the

combinatorial i choose j function), and thus grows exponentially with m and n. To avoid this

exponential cost, we apply a dynamic time wrapping algorithm. In this algorithm, we find the

optimal matching between the two input temporal networks by shifting and stretching the time

94

points of G1 based on the alignment quality. For instance, omitting the first two networks in

G2 in the alignment corresponds to the case where G1 denotes a later stage of evolution by two

time points as compared to G2. Similarly, omitting intermediate networks in G2 corresponds to

the case when G1 is evolving slower than G2.

Contributions. In this paper, we address the problem of to identify coevolving subnetworks

in a given pair of the temporal networks with uncertain time lines. This is the first work

to tackle this problem. We introduce a novel method, Tempo++ using a dynamic time

wrapping algorithm. Our solution is efficient and scalabe for a wide range of network sizes,

number of time points and evolution rates. We demonstrate the efficiency and accuracy of

Tempo++ using both real and synthetic data. For real dataset, we use gene expression dataset

which contains time resolved response of E. coli to five different environmental perturbation

conditions (cold, heat, oxidative stress, lactose diauxie, and stationary phase). Using our




heat and oxidative stress conditions. We compare the statistical significance of the alignments


5.2 Related Work and Notations

In this section, we discuss the literature of the biological network alignment problem and

introduce mathematical notations that we use throughout the paper.

Related work. Existing network alignment problems can be categorized as follows: (i)

pairwise alignment, (ii) multiple network alignment, and (iii) dynamic network alignment

(iv) temporal alignment with certain time line. The pairwise network alignment problem

ignores that the network topology evolves (10; 76; 79; 81; 82; 83; 84; 74; 85; 86; 87).

Although the multiple alignment problem can consider more than two networks at once, it

lacks the ability to capture the temporal changes since it treats all networks as having static

topologies (88; 89; 90; 91). The dynamic network alignment problem considers topological

95

changes over time. It however, it seeks a different solution to the alignment problem at each

time point. Thus, it can not identify coevolving subnetwork. Unlike these alignment problems,

temporal network alignment captures that network topologies coevolve over time.

Notations. We represent the alignment of the two temporal networks G1 and G2 as a

bijection of their nodes and denote it as a function ψ : V 1 → V 2. Notice that our goal is to

identify coevolving subnetworks within the input temporal networks. Thus, the alignment of a

temporal network persists across all time points in both input networks, and thus, describes a

mapping of the nodes which does not change from one time point to another. Next, we define

the quality score of the alignment. We compute the score of the alignment ψ of G1 and G2,

denoted with score(G1,G2|ψ), as the sum of the scores of the alignment at all time points.

Hence, score(G1,G2|ψ) =∑t

i=1 score(G1i , G

2i |ψ). We assume G1 is connected at all time

points, but it maybe impossible to find an alignment that is connected in the target network

at all time points. Notice that score(G1i , G

2i |ψ) integrates the similarities of the aligned

nodes and their evolving topologies, and includes a penalty for disconnectedness the aligned

subnetworks of the target network at each time point (see (110) for more details).

Our goal in this paper is to identify a subsequence of m networks from the temporal

network with the longer sequence of networks, G2 such that this subsequence yields the highest

alignment score when aligned with G1. Let us denote a subset of {1, 2, …, n} of size m with

S = {s1, s2, …, sm} with ∀i, si < si+1. We will call S a subsequence from now on as it

contains ordered numbers. The challenge in this paper is to identify the subsequence S of size

m and the alignment denoted with the mapping function ψ() : V 1 → V 2 which maximize the

alignment score as follows

argmaxS,ψ{∑

1≤i≤m

score(G1i , G

2si| ψ))}.

96

5.3 Method

Solving for the optimal alignment function for a specific S subsequence reduces to the

problem of temporal alignment with known time points information (110). Thus, we focus

next on describing our solution to identify a subsequence S of G2 which yields the maximum

alignment score.

We adopt dynamic time warping (DTW) algorithm to solve this problem. DTW has been

used for comparing two time series data with varying number of time points (111). Here, we

only allow stretching and/or shifting of time points in the network with longer sequences, G2.

Also, we ignore time points from the longer temporal network G2 that do not belong to the

subsequence S. DTW algorithm iteratively aligns the ith time point of G1 to a time point of

G2 where 1 ≤ i ≤ m. At each iteration i, there exists a window of possible time points of G2

that could be aligned with i. This window is defined ad [i : (n − m + i)]. Thus, there exist

(n−m+1) identified alignments for each time point of G1. Recall that the optimal alignments

of first x number of time point of G1 where 1 ≤ x < m does not necessarily be the the optimal

alignment of all sequences inG1. This is because the score of the alignment depends on both

the functional and topological similarities and the topological similarities changes from one

time point to another. Thus, we need to keep all (n − m + 1) identified alignments at each

iteration until we reach to the final iteration. Let us assume that the algorithm identifies the

alignment of the first (i) time points of G1 ((n−m+ 1) such alignments).

Let us denote the dynamic time warping alignment of the i time points in G1 to the j

time points of G2 with a doubly indexed indicator function δ() such that δ(r, s) = 1 if G1r is

aligned with G2s, and δ(r, s) = 0 otherwise. Also, let us define the node mapping given that the

i time points in G1 to the j time points of G2 with function ψi,j(). Let us define w as (m− n).

Also, let us denote the score of the dynamic time warping alignment of the first i time point in

G1 to the first j time points of G2 with f(i, j) which represents the total alignment scores of

97

alignments at those mapped time points ϕi,j()

f(i, j) =∑

1≤r≤i,i≤s≤i+w

score(G1r, G

2s | ψi,j)δ(r, s).

We calculate the alignment for the i time point of G1 iteratively based on the alignment

score as

f(i, j) = score(G1i , G

2j , ψ

i,j) + max(i−1)≤k≤(i+w−1)

{f(i− 1, k)}.

The final solution chooses the alignment of the all m time points from last iteration as

solution(δ(m,n), ψm,n,G1,G2) = argmaxm≤j≤n{ f(m, j)}. (5-1)

Complexity analysis. We analyze the complexity of the dynamic time wrapping for aligning

two input temporal networks with time points m and n where m ≤ n. In the first step, the

algorthm aligns only the first time point of G1 to a time point in G2. Notice that available time

points of G2 to match with the first time point in G1 is 1 to (n − m + 1) since there has to

be at least m− 1 points in G2 to match the rest of points in G1. In the each consecutive step,

our algorithm iteratively adds to the current alignment a new pair of time points one from

each network. It inspects (n − m + 1) (or simply (n − m)) possible alignment for each new

pair and chooses the alignments of previous points based on the best fit ((n − m) options)

when combined with the new pair. Summing over all pairs, the algorithm tries (n − m)2

cases at each iteration. The cost of alignment increases as we increase number of time points

within the alignment. For example, in the first iteration we align one time point, in the second

iteration we align two time points, and so on until we align m time points in the final iteration.

Notice that we only analyze the time points matching algorithm since the cost of alignment

when time points are known is analyszed before (110). Thus, the total cost is∑m

i=1 i(n −m)2

= (n−m)2m2.

98

5.4 Results

We evaluate the performance of our algorithm on synthetic and real data. Next, we

describe both datasets in detail.

Real Dataset. We analyze E. coli expression data using our method. We use the E. coli gene

expression dataset, GSE20305, obtained from the GEO database (112). This dataset contains

time resolved response of E. coli to five different environmental perturbation conditions (cold,

heat, oxidative stress, lactose diauxie, and stationary phase). Samples and expression values

were calculated to form eight time points of each condition. Each experimental condition

was independently repeated three times. We average expression values of the three replicas

at each time point. In order to integrate static PPI network with gene expression data to

form time point/group specific PPI networks, we set a cut-off on the gene-expression value.

All the interactions that have a lower transcription value for either or both the proteins are

removed from the corresponding time point specific network. We select same cut-off for

the five conditions. We use the protein-protein interaction (PPI) network data from String

database (113).

Synthetic dataset. We generate synthetic networks to observe the performance of our method

under a wide spectrum of parameters classified under two categories; (i) network size and

(ii) temporal model parameters, namely number of time points, temporal rate, and cold rate.

We vary the target network size to take values from {100, 250, 500, 750, 1000}. We fix the

network density to two edges per node on the average (i.e., mean node degree is set to four).

We randomly select G11 as a connected subnetwork of G2

1. We set the size of the query network

to 50 nodes. We generate target network G21 using Barabási-Albert (BA) (48) model as this

model produces scale-free networks. In order to explain the parameters in the second category,

we describe how we generate the query and target networks G11 and G2

1 at the first time point.

We then explain how we use the parameters in this category to build the query and target

networks at the remaining time points.

99

We generate the subsequent networks for the remaining time points using the three

parameters in the second category above as follows. The first parameter is the number of

time points t in G1 and G2. We use 5, 10, 15, and 20 time points in our experiments. Recall

that we select a subnetwork of the target network G21 as the first query network G1

1. We

mark all nodes and edges in G21 within this subnetwork as cold nodes and edges respectively.

We mark all other nodes and edges in G21 as hot. Next, we iteratively generate the networks

G1i and G2

i at the ith time point (i > 1) from G1i−1 and G2

i−1 respectively as follows. Let

us denote temporal and cold rates (two real numbers) with ϵ and ϵc respectively such that

0 ≤ ϵc ≤ ϵ ≤ 1. Let us denote the ratio of cold edges to the total number of edges in the

target network G21 with γ. We calculate the hot rate, denoted with ϵh, from temporal rate

and cold rate as ϵh = (ϵ − ϵcγ)/(1 − γ). Conceptually, hot and cold rates model the rate

of evolution of hot and cold edges between two consecutive time points respectively. More

specifically, for each subsequent time point i, we generate G2i by randomizing G2

i−1 as follows.

We iterate over all edges in G2i−1. For each edge e, if it is a cold edge we remove it with

probability ϵc and insert a new edge between two randomly chosen cold nodes. If e is a hot

edge, we remove it with probability ϵh and insert a new hot edge between two random nodes

(with at least one being a hot node). We generate query networks at subsequent time points

using almost the same procedure with the only difference being that all edges are cold. We

generate datasets by varying ϵ and ϵc to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05,

0.1, 0.2} respectively. For each parameter setting we generate 10 target and query temporal

networks.

Recall that, we generate the scoring matrix based on both homology and topology

similarities. We generate the homology score between two pair of nodes u ∈ V 1 and v ∈

V 2 as follows. If v was originally selected as cold node and u is the same as v, then we

generate a homology score between u and v from log-normal distribution (101) with mean

2µ and standard deviation σ. Otherwise, we randomly generate the homology score between

u and v from log-normal distribution with mean µ and standard deviation σ. In this way, we

100

allow nodes in query network to be likely to align to nodes in the target network that were

originally extracted from. In this paper, we set µ and σ to be 2 and 0.25 respectively. Notice

that the homology scores do not change through time points, although topology scores do.

Thus, evolution through time points of query and target networks may affect how the query

is aligned to the cold region in the target network. We set the edge insertion penalty δ to be

maxu∈V 1,v∈V 2

S(u, v).

5.4.1 Comparing Against Other Strategies

In this section, we compare the statistical significance of the alignments generated by

Tempo++ against that of possible strategies to approach the problem. The first strategy is

Exact matching which matches each time point t1i with t2i . The second strategy is Contiguous

matching. This strategy matches all time points of the shorter sequence to a contiguous block

of the longer sequence network with an equal number of time points. It tries all possible blocks

by using a sliding window strategies and then selects the matching with the best alignment

score. The third strategy is Gap preservation matching which preserves the gap between time

points in the shorter network when matches . It also uses a sliding window technique to get the

best fit alignment.

In this experiment, we use real dataset. We fix N to be 8 which are the time points in

each real network. Then we vary M to take values from {1, 2, 3, 4, 5, 6, 7}. We repeat this for

all combinations of two networks of the five stress conditions {cold, heat, lactose, oxidative,

control}. For each combination, we calculate the statistical significance using z-score as

follows. we randomly selected time points/aligned nodes from target network 1,000 times, and

calculate the alignment score of each, i.e., each random selection corresponds to an alignment.

We calculate the mean and standard deviation of these 1,000 scores and generate the z-score

of the alignment generated by Tempo++ using this mean and standard deviation. Hence,

we denote the score generated from our method by S∗, and denote the mean and standard

deviation of 1,000 scores generated from the random selections with Sµ and σ, respectively.

We calculate the z-score of our method as (S∗ − Sµ)/σ. We calculate the z-score of the

101

-1

0

1

2

3

1 2 3 4 5 6 7

Z-s

core

Number of time points (M)

Tempo++ Contiguous Gap Exact

Figure 5-1. The z-score of the resulting alignment varying the number of time points in theshorter sequence M to take the values {1, 2, 3, 4, 5, 6, 7} and keep N = 8 whereM < N . The x-axis shows M . The y-axis shows the z-score of our method(tempo++) against other strategies. The dashed grey line marks the z-scoresignificance cut-off (z-score ≥ 2).

other three strategies in a similar manner. We use a z-score significance cut-off to be ≥ 2.

Figure 5-1 presents the average z-score values across all network combinations.

The results demonstrate that our method generates more significant alignments compared

to other strategies. To match only one point (i.e. M = 1), contiguous and gap matching

strategics have the same z-sccre as tempo++ as they perform a sliding window technique.

However, exact matching have lower z-score which is expected as it forces matching exact

time points. As we grow M , our method continuous to generate significant (z-score ≥ 2)

alignments in most of the cases while the performance of other methods degrades. We notice

that exact matching ranks the lowest between other strategies. This is because assumes

that both input networks evolve in the same speed which is incorrect and biased. Similarly,

contiguous and gap matching implies certain assumptions which may be misleading.

5.4.2 Comparing Stress Response Against Time Points Matching

In this experiment, we evaluate the quality of aligning time points from different

stress conditions using our method against exact matching. We calculate the significance

of overlapped response pattern in both compared conditions to calculate the quality of

aligning two time points, one from each condition. To calculate such overlapping response

significance, we calculate a p-value as follows. All transcription values were normalized to the

average of time points taken before stress (the first two time points). To estimate the changes

between neighboring time points, we calculate t-test and fold change (FC) between the time

102

Figure 5-2. This figure represents the significance of the overlaps between different conditionsthrough time points post-perturbation ({t3, t4, t5, t6}). The significance of theoverlaps between conditions was calculated based on Fisher exact test. Thesignificant overlaps (p-value ≤ 0.05) are are colored with red, whereas nosignificant overlaps are colored with yellow. The matching (alignment) of timepoints generated using our method are marked with ∗.

point of interest and the directly preceding one were calculated. We consider a gene to be

significant if the p-value from t-test is ≤ 0.05 and its FC is ≥ 3. To determine the overlap

of responses between different conditions at two time points one from each condition, we test

the significance of overlapped changes between those two time points using the Fisher exact

test (R software package). A p-value ≤ 0.05 reflects significant overlap between tested two

time points. Notice that we only consider time points post-perturbation (t3, t4, . . . ). Figure 5-2

represent the results of significant overlaps between and conditions and and time points as well

the matching of time points generated using our method.

Matching exact time points (i.e. t1i matches t2i ) generates lower overlapped points than

using our algorithm. For example, aligning Lactose and oxidative conditions using our method

results in four overlapped time points where exact matching results in two overlapped time

points only. This suggests that our method could match time points with significantly similar

response to different stress conditions. Furthermore, the number of overlapped responses

decreases with increasing time. This was observed in earlier studies as well (114). In addition,

the results show that there exists significant similarity between heat and oxidative stress which

103

is reflected by both our method and the significant overlap test (all t1i matches and overlaps

with t2i ).

5.4.3 Hierarchical Clustering of Conditions

In this section, we want to analyze the similarity between applying different stress

conditions with respect to the changes in gene expressions and network topologies. For that

purpose, we apply hierarchical clustering to these conditions based on the z-score of pairwise

alignment. We first align two networks of two stress conditions using our method. We fix

N to be 8 which are the time points in each real network. Then we vary M to take values

from {1, 2, 3, 4, 5, 6, 7}. We calculate the statistical significance using z-score as described in

Section 5.4.1. We show the average z-score over all M values of two stress condition. We

then use hierarchical cluster analysis of R package based on the distribution of z-scores of

each condition when aligning with other conditions. We repeat this for all combinations of

two networks of the five stress conditions {cold, heat, lactose, oxidative, control}. Figure 5-3

presents the results.

The results illustrate that heat and oxidative conditions are co-clustered together which

confirms the results in the previous experiment. In addition, aligning networks of both

conditions has a very high significant z-score (2.4671). Similarly, we notice that control

and lactose exhibit also similar distributions. On the other hand, cold and heat conditions

are separated which suggests that genes have different response to those temperature stress

condition.


In this experiment, we evaluate the running time of our algorithm using synthetic dataset

for all network sizes varying the number of time points of longer (N) and shorter networks

(M). We report the average running time over all values of ϵ and ϵc as well as network

sizes with each parameter combination tested 10 times. Figure 5-4 presents the results. The

results demonstrate that our algorithm is fast and successfully scales to networks with very

large number of time points. The running times of Tempo++ grow linearly with increasing

104

Figure 5-3. The hierarchical clustering of z-score of the resulting alignment between twonetworks of the five stress conditions {cold, heat, lactose, oxidative, control}.White color represents NA values for self alignment as was not tested.

the number of time points in the longer sequence (N). We also notice that running time of

aligning a network with M = x (i.e. N = 14 and M = 2) number of time points is almost the

same as aligning a network with M = N − x (i.e. N = 14 and M = 12). This is expected

as the complexity of choosing x points out of N points is the same as choosing N − x points.

However, the running time of our method is more when aligning N − x points than that of

x, which is unsurprising since Tempo++ uses Tempo which running time increases with the

number of time points in the network with shorter sequences. That said, our method has

practical running time even for networks with many time points.

5.4.5 Evaluation of Alignment Quality

In this section, we evaluate the topological quality of the alignment generated by our

method using synthetic dataset. For this purpose, we measure the shared topological structure

between G1i and G2

i which is preserved under the alignment function ψ through all time

105

0

100

200

300

400

500

2 2 4 2 4 6 2 4 6 8 2 4 6 8 10 2 4 6 8 10 12 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18

Run

ning

tim

e [c

pu-s

]

M4 6 8 10 12 14 16 18 20

N

Figure 5-4. The total running time of our method for synthetic networks varying the number oftime points of the input networks M and N to take values from{2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where M < N . The x-axis shows the number oftime points of the longer sequence N and the shorter sequence M . The y-axisshows the total running time in seconds.

points i. Induced conserved structure (ICS) measures the percentage of edges from G1i that

are aligned to edges in G2i to the total edges of the induced subnetwork Ψ(V 1|G2

i ), and is

one of the most common measures of topological quality (73). Formally, ICS(G1,G2, ψ) =∑ti=1

|E1i ∩E2

i [Ψ(V 1|G2i ]|

|E2i [Ψ(V 1|G2

i ]|. We also evaluate our algorithm against other algorithms using the edge

correctness (EC) measure which has a similar computations to ICS. Basically, it measures the


i to the total edges of smaller

network. More specifically, EC(G1,G2, ψ) =∑t

i=1|E1i ∩E2

i [Ψ(V 1|G2i ]|

|E1i |

. In this experiment, we vary

the time points of the two input biological networks G1 (m) and G2 (n) such that n and m

take a value from {2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where m < n while keeping the network size

at 500. Figure 5-5 and Figure 5-6 present the results for EC and ICS scores respectively.

The results demonstrate that our algorithm generates alignments with reasonable quality

based on ICS, EC. We notice that the quality score of the generated alignment decreases when

increasing number of time points m while keeping n unchanged. This reflects the fact that the

more two networks evolve, their topologies change and deviate from each other which causes

the average quality score through a time points to decrease. Also this might reflect that as we

decrease m, we find more different alignments of those m time points of the shorter sequence

network within n time points in the longer sequence and consequently, we have more options

to choose the best alignment from them.

106

0.2

0.25

0.3

0.35

0.4

0.45

2 2 4 2 4 6 2 4 6 8 2 4 6 8 10 2 4 6 8 10 12 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18

Edg

e co

rrec

tnes

s (E

C)

m4 6 8 10 12 14 16 18 20

n

Figure 5-5. The edge correctness (EC) score of the resulting alignment varying the number oftime points of the longer sequence n and the shorter sequence m to take thevalues {2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where m < n. The x-axis shows n and m.The y-axis shows the EC score.

0.2

0.25

0.3

0.35

0.4

2 2 4 2 4 6 2 4 6 8 2 4 6 8 10 2 4 6 8 10 12 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18

Indu

ced

cons

erve

d st

ruct

ure

(IC

S)

m4 6 8 10 12 14 16 18 20

n

Figure 5-6. The induced conserved structure (ICS) score of the resulting alignment varying thenumber of time points of the longer sequence n and the shorter sequence m totake the values {2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where m < n. The x-axis shows nand m.The y-axis shows the ICS score.

In addition to EC and ICS scores, we evaluate the accuracy of the alignment generated by

our method. We recall that we select the original query network from a subset of nodes and

their edges from the target network, and then evolve the query through time points. Here, we

evaluate the accuracy by calculating the percentage of the aligned nodes from query network

that are paired with the same nodes of the target network that they were originally selected

from. We refer to this percentage as recovered region. We illustrate the results in Figure 5-7,

which demonstrate that our algorithm recovers high percentage of the query networks.

The results of recovered region percentage show that our method can capture the

coevolving topologies and recover high percentage (∼70%) of the query network that was

planted in the target network. We also notice a similar behavior to that of ICS and EC with

increasing the number of time points. That being said, our method continues to recover high

percentage of the query network with an average of ∼64%.

107

40

50

60

70

80

2 2 4 2 4 6 2 4 6 8 2 4 6 8 10 2 4 6 8 10 12 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18

Rec

over

ed q

uery

(%

)

m4 6 8 10 12 14 16 18 20

nFigure 5-7. The percentage of recovered query in the resulting alignment varying the numberof time points of the longer sequence n and the shorter sequence m to take thevalues {2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where m < n. The x-axis shows n and m.The y-axis shows the percentage of recovered query of our method.

5.5 Discussion

In this chapter, we addressed the problem of identifying coevolving subnetworks between

subsequences of given pair of temporal networks. We developed a novel method, Tempo++

using a dynamic time wrapping algorithm. We proved that our solution is efficient and scalabe

for a wide range of network sizes, number of time points and evolution rates. We demonstrates

the efficiency and accuracy of Tempo++ using both real and synthetic data. Using our




heat and oxidative stress conditions. We compared the statistical significance of the alignments


108

CHAPTER 6CONCLUSION

Biological networks help us understand cellular function. Interaction between molecules

is dynamic. Temporal networks describe the evolution of molecules and their interactions over

time. In this dissertation proposal we addressed three modeling and characterization problem

of biological networks, both static and temporal. In addition, we presented two applications on

ecological networks.

In the first problem, we developed a scalable method to solve the motif identification

problem given an input graph, desired motif size µ, and minimum frequency of desired

motif α. Our experiments on synthetic data and PPI networks from MINT comprehensively

demonstrated the the statistical and biological significance of motifs resulting from our

algorithm.

Following the first problem, we developed two applications of motif identification problem

in ecological networks. The first application is employing motifs to identify the assembly of

food web networks across hierarchical scales. We found that motif representation of daughter

networks highly matched the parent network they were assembled from. The second application

is identifying the relationship between motif centrality and motif abundance in aquatic food

webs. We found that highly central motifs are over-represented and non-central motifs are

under-represented for six of the thirteen motifs. This pattern suggests that high energy flow is

associated with the persistence of certain motifs in food webs.

In the second problem, we developed a novel and scalable method, Tempo to solve the

problem of identifying co-evolving subnetworks between two given temporal networks. We

proof the correctness of our algorithm. We compared our algorithm to a classical network

alignment algorithm show that our method generates more significant alignment and could

capture temporal evolution of the two input networks. We performed analysis on the genes (of

different phenotype) reported by our algorithm observed that they are of biological significance.

109

In the third problem, we aim to align two temporal networks with uncertain evolution

timeline. This is the first work to tackle this problem. We developed a novel method,

Tempo++ using a dynamic time wrapping algorithm which is efficient and scalabe for a

wide range of number of time points. We demonstrated the efficiency and accuracy of

Tempo++ using both real and synthetic data. We used gene expression dataset which contains

time resolved response of E. coli to five different environmental perturbation conditions. Using

our method, we could find similar response behavior of gene expressions between heat and



heat and oxidative stress conditions. We compared the statistical significance of the alignments

found by Tempo++ against those of other possible strategies to tackle this problem and found

that tempo++ outperformed all those strategies.

110

REFERENCES

[1] X. Zhu, M. Gerstein, and M. Snyder, “Getting connected: analysis and principles ofbiological networks,” Genes & Development, vol. 21, no. 9, pp. 1010–1024, 2007.

[2] J. A. Freyre-González, J. A. Alonso-Pavón, L. G. Treviño-Quintanilla, andJ. Collado-Vides, “Functional architecture of escherichia coli: new insights providedby a natural decomposition approach,” Genome biology, vol. 9, no. 10, p. R154, 2008.

[3] M. D. Leiserson, F. Vandin, H.-T. Wu, J. R. Dobson, J. V. Eldridge, J. L. Thomas,A. Papoutsaki, Y. Kim, B. Niu, M. McLellan et al., “Pan-cancer network analysisidentifies combinations of rare somatic mutations across pathways and proteincomplexes,” Nature genetics, vol. 47, no. 2, pp. 106–114, 2015.

[4] D. A. Charlebois, G. Balázsi, and M. Kærn, “Coherent feedforward transcriptionalregulatory motifs enhance drug resistance,” Physical Review E, vol. 89, no. 5, p. 052708,2014.

[5] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptionalregulation network of escherichia coli,” Nature Genetics, vol. 31, no. 1, pp. 64–68, 2002.

[6] P. Wang, J. Lü, and X. Yu, “Identification of important nodes in directed biologicalnetworks: A network motif approach,” PLOS ONE, vol. 9, no. 8, 2014.

[7] S. Wuchty, Z. N. Oltvai, and A.-L. Barabási, “Evolutionary conservation of motifconstituents in the yeast protein interaction network,” Nature Genetics, vol. 35, no. 2,pp. 176–179, 2003.

[8] J. Flannick, A. Novak, B. S. Srinivasan, H. H. McAdams, and S. Batzoglou, “Graemlin:general and robust alignment of multiple large interaction networks,” Genome research,vol. 16, no. 9, pp. 1169–1181, 2006.

[9] T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M.Hannett, C. T. Harbison, C. M. Thompson, I. Simon et al., “Transcriptional regulatorynetworks in saccharomyces cerevisiae,” science, vol. 298, no. 5594, pp. 799–804, 2002.

[10] R. Singh, J. Xu, and B. Berger, “Pairwise global alignment of protein interactionnetworks by matching neighborhood topology,” in Annual International Conference onResearch in Computational Molecular Biology. Springer, 2007, pp. 16–31.

[11] T. M. Przytycka, M. Singh, and D. K. Slonim, “Toward the dynamic interactome: it’sabout time,” Briefings in bioinformatics, p. bbp057, 2010.

[12] J.-D. J. Han, N. Bertin, T. Hao, D. S. Goldberg, G. F. Berriz, L. V. Zhang, D. Dupuy,A. J. Walhout, M. E. Cusick, F. P. Roth et al., “Evidence for dynamically organizedmodularity in the yeast protein–protein interaction network,” Nature, vol. 430, no. 6995,pp. 88–93, 2004.

111

[13] B. Sadikovic, K. Al-Romaih, J. Squire, and M. Zielenska, “Cause and consequences ofgenetic and epigenetic alterations in human cancer,” Current genomics, vol. 9, no. 6, pp.394–408, 2008.

[14] A. De Smith, R. Walters, P. Froguel, and A. Blakemore, “Human genes involved in copynumber variation: mechanisms of origin, functional effects and implications for disease,”Cytogenetic and genome research, vol. 123, no. 1-4, pp. 17–26, 2008.

[15] J. R. Pollack et al., “Genome-wide analysis of dna copy-number changes using cdnamicroarrays,” Nature genetics, vol. 23, no. 1, pp. 41–46, 1999.

[16] P. Holme and J. Saramäki, “Temporal networks,” Physics reports, vol. 519, no. 3, pp.97–125, 2012.

[17] N. M. Luscombe, M. M. Babu, H. Yu, M. Snyder, S. A. Teichmann, and M. Gerstein,“Genomic analysis of regulatory network dynamics reveals large topological changes,”Nature, vol. 431, no. 7006, pp. 308–312, 2004.

[18] A. Rao, A. O. Hero III, J. D. Engel et al., “Inferring time-varying network topologiesfrom gene expression data,” EURASIP Journal on Bioinformatics and Systems Biology,vol. 2007, pp. 7–7, 2007.

[19] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Networkmotifs: simple building blocks of complex networks,” Science, vol. 298, no. 5594, pp.824–827, 2002.

[20] S. A. Cook, “The complexity of theorem-proving procedures,” in ACM Symposium onTheory of Computing. ACM, 1971, pp. 151–158.

[21] H. L. Buckley, T. E. Miller, A. M. Ellison, and N. J. Gotelli, “Local-to continental-scalevariation in the richness and composition of an aquatic food web,” Global Ecology andBiogeography, vol. 19, no. 5, pp. 711–723, 2010.

[22] J. F. Addicott, “Predation and prey community structure: an experimental study of theeffect of mosquito larvae on the protozoan communities of pitcher plants,” Ecology,vol. 55, no. 3, pp. 475–492, 1974.

[23] S. R. Borrett and M. K. Lau, “enar: an r package for ecosystem network analysis,”Methods in Ecology and Evolution, vol. 5, no. 11, pp. 1206–1213, 2014.

[24] Y. Hulovatyy, H. Chen, and T. Milenković, “Exploring the structure and function oftemporal networks with dynamic graphlets,” Bioinformatics, vol. 31, no. 12, pp. i171–i180, 2015.

[25] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: densification laws,shrinking diameters and possible explanations,” in Proceedings of the eleventh ACMSIGKDD international conference on Knowledge discovery in data mining. ACM, 2005,pp. 177–187.

112

[26] R. M. Karp, “Reducibility among combinatorial problems,” in Complexity of computercomputations. Springer, 1972, pp. 85–103.

[27] I. Tomlinson, M. Novelli, and W. Bodmer, “The mutation rate and cancer,” Proceedingsof the National Academy of Sciences, vol. 93, no. 25, pp. 14 800–14 803, 1996.

[28] F. Ay, M. Kellis, and T. Kahveci, “SubMAP: aligning metabolic pathways withsubnetwork mappings,” Journal of Computational Biology, vol. 18, no. 3, pp. 219–235, 2011.

[29] S. Wuchty and P. F. Stadler, “Centers of complex networks,” Journal of TheoreticalBiology, vol. 223, no. 1, pp. 45–53, 2003.

[30] A. Masoudi-Nejad, F. Schreiber, and Z. Kashani, “Building blocks of biological networks:a review on major network motif discovery algorithms,” IET Systems Biology, vol. 6,no. 5, pp. 164–174, 2012.

[31] T. Milenković, J. Lai, and N. Pržulj, “Graphcrunch: a tool for large network analyses,”BMC Bioinformatics, vol. 9, no. 1, p. 70, 2008.

[32] M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis, “Frequent substructure-basedapproaches for classifying chemical compounds,” IEEE Transactions on Knowledge andData Engineering, vol. 17, no. 8, pp. 1036–1050, 2005.

[33] C. Yanover, M. Singh, and E. Zaslavsky, “M are better than one: an ensemble-basedmotif finder and its application to regulatory element prediction,” Bioinformatics, vol. 25,no. 7, pp. 868–874, 2009.

[34] M. R. Garey and D. S. Johnson, “Computers and Intractability: A Guide to the Theoryof NP-Completeness,” 1979.

[35] L. B. Holder, D. J. Cook, S. Djoko et al., “Substucture discovery in the subdue system.”in KDD workshop, 1994, pp. 169–180.

[36] F. Schreiber and H. Schwöbbermeyer, “Frequency concepts and pattern detection forthe analysis of motifs in networks,” in Transactions on Computational Systems Biology,2005, pp. 89–104.

[37] N. Vanetik, E. Gudes, and S. E. Shimony, “Computing frequent graph patterns fromsemistructured data,” in ICDM. IEEE, 2002, pp. 458–465.

[38] X. Yan, X. Zhou, and J. Han, “Mining closed relational graphs with connectivityconstraints,” in ACM SIGKDD, 2005, pp. 324–333.

[39] J. A. Grochow and M. Kellis, “Network motif discovery using subgraph enumeration andsymmetry-breaking,” in Research in Computational Molecular Biology. Springer, 2007,pp. 92–106.

113

[40] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon, “Efficient sampling algorithm forestimating subgraph concentrations and detecting network motifs,” Bioinformatics,vol. 20, no. 11, pp. 1746–1758, 2004.

[41] S. Omidi, F. Schreiber, and A. Masoudi-Nejad, “Moda: an efficient algorithm for networkmotif discovery in biological networks,” Genes & Genetic Systems, vol. 84, no. 5, pp.385–395, 2009.

[42] S. Wernicke, “Efficient detection of network motifs,” IEEE/ACM Transactions onComputational Biology and Bioinformatics (TCBB), vol. 3, no. 4, pp. 347–359, 2006.

[43] J. Chen, W. Hsu, M. L. Lee, and S.-K. Ng, “Nemofinder: Dissecting genome-wideprotein-protein interactions with meso-scale network motifs,” in ACM SIGKDD, 2006,pp. 106–115.

[44] Z. R. Kashani, H. Ahrabian, E. Elahi, A. Nowzari-Dalini, E. S. Ansari, S. Asadi,S. Mohammadi, F. Schreiber, and A. Masoudi-Nejad, “Kavosh: a new algorithm forfinding network motifs,” BMC bioinformatics, vol. 10, no. 1, p. 318, 2009.

[45] M. Kuramochi and G. Karypis, “An efficient algorithm for discovering frequentsubgraphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 9,pp. 1038–1051, 2004.

[46] ——, “Finding frequent patterns in a large sparse graph,” Data Mining and KnowledgeDiscovery, vol. 11, no. 3, pp. 243–271, 2005.

[47] L. Babai and E. M. Luks, “Canonical labeling of graphs,” in ACM Symposium on Theoryof Computing, 1983, pp. 171–183.

[48] A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol.286, no. 5439, pp. 509–512, 1999.

[49] K. Baskerville and M. Paczuski, “Subgraph ensembles and motif discovery using a newheuristic for graph isomorphism,” Physical Review E, vol. 74, p. 051903, 2006.

[50] A. Chatr-Aryamontri, A. Ceol, L. M. Palazzi, G. Nardelli, M. V. Schneider, L. Castagnoli,and G. Cesareni, “MINT: the Molecular INTeraction database,” Nucleic Acids Research,vol. 35, no. suppl 1, pp. D572–D574, 2007.

[51] S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin, “Structure of growingnetworks with preferential linking,” Physical review letters, vol. 85, no. 21, p. 4633,2000.

[52] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barabási, “The large-scaleorganization of metabolic networks,” Nature, vol. 407, no. 6804, pp. 651–654, 2000.

[53] S. Redner, “How popular is your paper? an empirical study of the citation distribution,”The European Physical Journal B-Condensed Matter and Complex Systems, vol. 4,no. 2, pp. 131–134, 1998.

114

[54] R. D. Leclerc, “Survival of the sparsest: robust gene networks are parsimonious,”Molecular Systems Biology, vol. 4, no. 1, p. 213, 2008.

[55] R. Milo, N. Kashtan, S. Itzkovitz, M. E. Newman, and U. Alon, “On the uniformgeneration of random graphs with prescribed degree sequences,” arXiv preprint cond-mat/0312028, 2003.

[56] D. Gale et al., “A theorem on flows in networks,” Pacific J. Math, vol. 7, no. 2, pp.1073–1082, 1957.

[57] M. Ashburner, C. A. Ball et al., “Gene ontology: tool for the unification of biology,”Nature genetics, vol. 25, no. 1, pp. 25–29, 2000.

[58] F. L. Homa and J. C. Brown, “Capsid assembly and dna packaging in herpes simplexvirus,” Reviews in Medical Virology, vol. 7, no. 2, p. 107, 1997.

[59] H. V. Cornell and J. H. Lawton, “Species interactions, local and regional processes, andlimits to the richness of ecological communities: a theoretical perspective,” Journal ofAnimal Ecology, pp. 1–12, 1992.

[60] N. J. Gotelli, “Null model analysis of species co-occurrence patterns,” Ecology, vol. 81,no. 9, pp. 2606–2621, 2000.

[61] P. Erdős and A. Rény, “On random graphs i,” Publ. Math. Debrecen, vol. 6, pp. 290–297, 1959.

[62] R. J. Williams and N. D. Martinez, “Simple rules yield complex food webs,” Nature, vol.404, no. 6774, p. 180, 2000.

[63] M.-F. Cattin, L.-F. Bersier, C. Banašek-Richter, R. Baltensperger, and J.-P. Gabriel,“Phylogenetic constraints and adaptation explain food-web structure,” in Nature, vol.427, 2004, pp. 835–839.

[64] D. Stouffer, J. Camacho, R. Guimera, C. Ng, and L. Nunes Amaral, “Quantitativepatterns in the structure of model and empirical food webs,” Ecology, vol. 86, no. 5, pp.1301–1311, 2005.

[65] D. B. Stouffer and J. Bascompte, “Compartmentalization increases food-webpersistence,” Proceedings of the National Academy of Sciences, vol. 108, no. 9, pp.3648–3652, 2011.

[66] D. Koschützki, H. Schwöbbermeyer, and F. Schreiber, “Ranking of network elementsbased on functional substructures,” Journal of theoretical biology, vol. 248, no. 3, pp.471–479, 2007.

[67] W. Kim, M. Li, J. Wang, and Y. Pan, “Essential protein discovery based on networkmotif and gene ontology,” in Bioinformatics and Biomedicine (BIBM), 2011 IEEEInternational Conference on. IEEE, 2011, pp. 470–475.

115

[68] W. Li, L. Chen, X. Li, X. Jia, C. Feng, L. Zhang, W. He, J. Lv, Y. He, W. Li et al.,“Cancer-related marketing centrality motifs acting as pivot units in the human signalingnetwork and mediating cross-talk between biological pathways,” Molecular BioSystems,vol. 9, no. 12, pp. 3026–3035, 2013.

[69] M. Piraveenan, K. Wimalawarne, and D. Kasthurirathn, “Centrality and compositionof four-node motifs in metabolic networks,” Procedia Computer Science, vol. 18, pp.409–418, 2013.

[70] L. C. Freeman, “A set of measures of centrality based on betweenness,” Sociometry, pp.35–41, 1977.

[71] S. R. Borrett, “Throughflow centrality is a global indicator of the functional importanceof species in ecosystems,” Ecological indicators, vol. 32, pp. 182–196, 2013.

[72] J. Clemente, K. Satou, and G. Valiente, “Finding conserved and non-conserved reactionsusing a metabolic pathway alignment algorithm,” Genome Informatics, vol. 17, no. 2, pp.46–56, 2006.

[73] V. Vijayan, V. Saraph, and T. Milenković, “MAGNA++: Maximizing Accuracy in GlobalNetwork Alignment via both node and edge conservation,” Bioinformatics, vol. 31,no. 14, pp. 2409–2411, 2015.

[74] R. Patro and C. Kingsford, “Global network alignment using multiscale spectralsignatures,” Bioinformatics, vol. 28, no. 23, pp. 3105–3114, 2012.

[75] N. C. Berchtold, D. H. Cribbs, P. D. Coleman, J. Rogers, E. Head, R. Kim, T. Beach,C. Miller, J. Troncoso, J. Q. Trojanowski et al., “Gene expression changes in the courseof normal brain aging are sexually dimorphic,” Proceedings of the National Academy ofSciences, vol. 105, no. 40, pp. 15 605–15 610, 2008.

[76] O. Kuchaiev and N. Pržulj, “Integrative network alignment reveals large regions of globalnetwork similarity in yeast and human,” Bioinformatics, vol. 27, no. 10, pp. 1390–1396,2011.

[77] O. Kuchaiev, T. Milenković, V. Memišević, W. Hayes, and N. Pržulj, “Topologicalnetwork alignment uncovers biological function and phylogeny,” Journal of the RoyalSociety Interface, p. rsif20100063, 2010.

[78] T. Milenković, W. L. Ng, W. Hayes, and N. Pržulj, “Optimal network alignment withgraphlet degree vectors,” Cancer informatics, vol. 9, p. 121, 2010.

[79] A. E. Aladağ and C. Erten, “Spinal: scalable protein interaction network alignment,”Bioinformatics, vol. 29, no. 7, pp. 917–924, 2013.

[80] V. Saraph and T. Milenković, “Magna: maximizing accuracy in global networkalignment,” Bioinformatics, vol. 30, no. 20, pp. 2931–2940, 2014.

116

[81] B. P. Kelley, B. Yuan, F. Lewitter, R. Sharan, B. R. Stockwell, and T. Ideker, “Pathblast:a tool for alignment of protein interaction networks,” Nucleic acids research, vol. 32, no.suppl_2, pp. W83–W88, 2004.

[82] H. T. Phan and M. J. Sternberg, “Pinalog: a novel approach to align protein interactionnetworks—implications for complex detection and function prediction,” Bioinformatics,vol. 28, no. 9, pp. 1239–1245, 2012.

[83] G. Guelsoy, B. Gandhi, and T. Kahveci, “Topac: alignment of gene regulatory networksusing topology-aware coloring,” Journal of bioinformatics and computational biology,vol. 10, no. 01, p. 1240001, 2012.

[84] M. M. Hasan and T. Kahveci, “Indexing a protein-protein interaction network expeditesnetwork alignment,” BMC bioinformatics, vol. 16, no. 1, p. 326, 2015.

[85] B. Neyshabur, A. Khadem, S. Hashemifar, and S. Arab, “NETAL: a new graph-basedmethod for global alignment of protein–protein interaction networks,” Bioinformatics,vol. 29, no. 13, pp. 1654–1662, 2013.

[86] Y. Sun, J. Crawford, J. Tang, and T. Milenković, “Simultaneous optimization of bothnode and edge conservation in network alignment via WAVE,” in International Workshopon Algorithms in Bioinformatics. Springer, 2015, pp. 16–39.

[87] J. Hu, B. Kehr, and K. Reinert, “NetCoffee: a fast and accurate global alignmentapproach to identify functionally conserved proteins in multiple networks,” Bioinformat-ics, vol. 30, no. 4, pp. 540–548, 2013.

[88] F. Alkan and C. Erten, “BEAMS: backbone extraction and merge strategy for the globalmany-to-many alignment of multiple PPI networks,” Bioinformatics, vol. 30, no. 4, pp.531–539, 2013.

[89] R. Ibragimov, M. Malek, J. Baumbach, and J. Guo, “Multiple graph edit distance:simultaneous topological alignment of multiple protein-protein interaction networkswith an evolutionary algorithm,” in Proceedings of the 2014 Conference on Genetic andEvolutionary Computation. ACM, 2014, pp. 277–284.

[90] S. Sahraeian and B. Yoon, “SMETANA: accurate and scalable algorithm for probabilisticalignment of large-scale biological networks,” PloS One, vol. 8, no. 7, p. e67995, 2013.

[91] C.-S. Liao, K. Lu, M. Baym, R. Singh, and B. Berger, “Isorankn: spectral methodsfor global alignment of multiple protein networks,” Bioinformatics, vol. 25, no. 12, pp.i253–i258, 2009.

[92] Y.-K. Shih and S. Parthasarathy, “Scalable global alignment for multiple biologicalnetworks,” BMC bioinformatics, vol. 13, no. 3, p. S11, 2012.

[93] M. M. Hasan and T. Kahveci, “Incremental network querying in biological networks,” inProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, andHealth Informatics, ser. BCB ’14. ACM, 2014, pp. 752–759.

117

[94] ——, “Color distribution can accelerate network alignment,” in Proceedings of theinternational conference on bioinformatics, computational biology and biomedicalinformatics. ACM, 2013, p. 52.

[95] V. Vijayan, D. Critchlow, and T. Milenkovic, “Alignment of dynamic networks,” arXivpreprint arXiv:1701.08842, 2017.

[96] U. Feige, “A threshold of ln n for approximating set cover,” Journal of the ACM(JACM), vol. 45, no. 4, pp. 634–652, 1998.

[97] C. H. Papadimitriou and K. Steiglitz, “Combinatorial optimization: Algorithms andcomplexity,” 1998.

[98] Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, “A greedy algorithm for aligning dnasequences,” Journal of Computational biology, vol. 7, no. 1-2, pp. 203–214, 2000.

[99] B.-J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone,R. Oughtred, D. H. Lackner, J. Bähler, V. Wood et al., “The biogrid interactiondatabase: 2008 update,” Nucleic acids research, vol. 36, no. suppl_1, pp. D637–D640,2007.

[100] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa, “KEGG: Kyotoencyclopedia of genes and genomes,” Nucleic Acids Research, vol. 27, no. 1, pp. 29–34,1999.

[101] N. Johnson, S. Kotz, and N. Balakrishnan, “Continuous univariate probabilitydistributions,(vol. 1),” 1994.

[102] H. Gao, Y. Tao, Q. He, F. Song, and D. Saffen, “Functional enrichment analysis of threealzheimer’s disease genome-wide association studies identities dab1 as a novel candidateliability/protective gene,” Biochemical and biophysical research communications, vol.463, no. 4, pp. 490–495, 2015.

[103] T. Nashida et al., “Atrophy of myoepithelial cells in parotid glands of diabetic mice;detection using skeletal muscle actin, a novel marker,” FEBS open bio, vol. 3, no. 1, pp.130–134, 2013.

[104] K. P. Burdon et al., “Genome-wide association study for sight-threatening diabeticretinopathy reveals association with genetic variation near the grb2 gene,” Diabetologia,vol. 58, no. 10, pp. 2288–2297, 2015.

[105] G. O. Consortium et al., “The gene ontology (go) database and informatics resource,”Nucleic acids research, vol. 32, no. suppl 1, pp. D258–D261, 2004.

[106] M. P. Mattson, “Pathways towards and away from Alzheimer’s disease,” Nature, vol.430, no. 7000, pp. 631–639, 2004.

[107] G. Bartzokis, “Age-related myelin breakdown: a developmental model of cognitive declineand alzheimer’s disease,” Neurobiology of aging, vol. 25, no. 1, pp. 5–18, 2004.

118

[108] M. Liu et al., “Network-based analysis of affected biological processes in type 2 diabetesmodels,” PLoS genetics, vol. 3, no. 6, p. e96, 2007.

[109] J. Davies and D. Davies, “Origins and evolution of antibiotic resistance,” Microbiologyand Molecular Biology Reviews, vol. 74, no. 3, pp. 417–433, 2010.

[110] R. Elhesha, A. Sarkar, C. Boucher, and T. Kahveci, “Identification of co-evolvingtemporal networks,” bioRxiv, 2018.

[111] L. Gupta, D. Molfese, R. Tammana, and P. Simos, “Nonlinear alignment and averagingfor estimating the evoked potential,” IEEE Transactions on Biomedical Engineering,vol. 43, no. 4, pp. 348–356, 1996.

[112] Barrett, T. and Wilhite, S.. and Ledoux, P. and Evangelista, C. and Kim, I. andTomashevsky, M. and Marshall, K. and Phillippy, K. and Sherman, P. and Holko, M.and others, “NCBI GEO: archive for functional genomics data sets—update,” NucleicAcids Research, vol. 41, no. D1, pp. D991–D995, 2012.

[113] D. Szklarczyk, J. H. Morris, H. Cook, M. Kuhn, S. Wyder, M. Simonovic,A. Santos, N. T. Doncheva, A. Roth, P. Bork et al., “The string database in 2017:quality-controlled protein–protein association networks, made broadly accessible,” NucleicAcids Research, vol. 45, pp. 362–368, 2016.

[114] S. Jozefczuk, S. Klie, G. Catchpole, J. Szymanski, A. Cuadros-Inostroza, D. Steinhauser,J. Selbig, and L. Willmitzer, “Metabolomic and transcriptomic stress response ofescherichia coli,” Molecular systems biology, vol. 6, no. 1, p. 364, 2010.

119

BIOGRAPHICAL SKETCH

Rasha Elhesha received her B.Sc. in computer and systems engineering in 2011 from

Alexandria University, Egypt. She worked as a software developer in Egypt. In August

2014, she started her PhD program in Computer and Information Science and Engineering

Department, University of Florida and joined Tamer Kahveci’s Bioinformatics lab. She got

the 2016 HWCO Outstanding International Student Award. In 2018, she got Gartner CISE

scholarship. Her research area is bioinformatics in general and biological networks analysis in

particular.

120

ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/uf/e0/05/25/13/00001/elhesha_r.pdf ·...

Documents