ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/uf/e0/05/25/13/00001/elhesha_r.pdf ·...
TRANSCRIPT
DEVELOPING EFFICIENT ALGORITHMS TO IDENTIFY PATTERNS OF BIOLOGICALNETWORKS
By
RASHA ELHESHA
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOLOF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2018
© 2018 Rasha Elhesha
ACKNOWLEDGMENTS
Firstly, I would like to express my sincere gratitude to my advisor Prof. Tamer Kahveci
for his continuous support throughout my Ph.D study and related research. I owe my deepest
gratitude to him for his patience, motivation, and his faithful guidance which helped me
accomplish my research and my dissertation writing. I really appreciate all his hard efforts
with me. In addition, I would like to thank my PhD committee members; Prof. Sartaj sahni,
Prof. Alin Dobra, Prof. Ye Xia and Prof. Benjamin Baiser, for their insightful comments and
encouragement. They helped me achieving my thesis objectives with outstanding efficiency and
directed me along the right track.
I would like to thank my family for their continuous support and encouragements which
were worth more than I can express on paper. I would like to thank my husband, Mohamed,
for his sincere and faithful support and help. I would like to thank my two awesome children
who were able to draw a smile on my face during tough times. Last but not least, I would
like to acknowledge my father and my mother. Without their enthusiasm, encouragement and
support, this thesis would hardly have been completed. I am grateful to my sister, shereen for
always being there for me as a friend.
3
TABLE OF CONTENTSpage
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 IDENTIFICATION OF LARGE DISJOINT MOTIFS IN BIOLOGICAL NETWORKS 16
2.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.1 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.2 Summary of Existing Methods . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222.3.1 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.2 Joining Patterns to Find Larger Patterns . . . . . . . . . . . . . . . . 252.3.3 Finding MIS: Going from F1 to F2 . . . . . . . . . . . . . . . . . . . 292.3.4 Accelerating Our Algorithm Through Efficient Filters . . . . . . . . . . 322.3.5 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.4.1 Evaluation of Running Time . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1.1 Effect of Graph and Motif Size . . . . . . . . . . . . . . . . 372.4.1.2 Effect of Graph Size and Density . . . . . . . . . . . . . . . 39
2.4.2 Comparison with Existing Methods . . . . . . . . . . . . . . . . . . . 402.4.2.1 Comparison with SUBDUE . . . . . . . . . . . . . . . . . . 412.4.2.2 Comparison with FSG . . . . . . . . . . . . . . . . . . . . . 44
2.4.3 Evaluation of Statistical Significance . . . . . . . . . . . . . . . . . . 452.4.4 Case Study on Human Herpesvirus . . . . . . . . . . . . . . . . . . . 48
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3 APPLICATION OF MOTIFS IDENTIFICATION . . . . . . . . . . . . . . . . . . . 51
3.1 Motifs in The Assembly of Food Web Networks . . . . . . . . . . . . . . . . 513.1.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.1.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Motif Centrality in Food Web Networks . . . . . . . . . . . . . . . . . . . . . 563.2.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4
3.2.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 603.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4 IDENTIFICATION OF CO-EVOLVING TEMPORAL NETWORKS . . . . . . . . . . 63
4.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.1 Proof of NP-hardness . . . . . . . . . . . . . . . . . . . . . . . . . . 744.4.2 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.4.3 Adopting Pairwise Alignment Methods to Generate Similarity Scores
for Temporal Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 774.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.5.1 Evaluation of Recovered Region . . . . . . . . . . . . . . . . . . . . . 814.5.2 Evaluation of Induced Conserved Structure . . . . . . . . . . . . . . . 824.5.3 Evaluation of Edge Correctness . . . . . . . . . . . . . . . . . . . . . 834.5.4 Evaluation of Statistical Significance of The Alignment . . . . . . . . . 834.5.5 Evaluation of Running Time . . . . . . . . . . . . . . . . . . . . . . . 864.5.6 Evaluation of Recovered Genes in Real Dataset . . . . . . . . . . . . . 874.5.7 Evaluation on Real Data . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5 IDENTIFICATION OF CO-EVOLVING TEMPORAL NETWORKS WITH UNCERTAINTIMELINE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.2 Related Work and Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 955.3 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 975.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4.1 Comparing Against Other Strategies . . . . . . . . . . . . . . . . . . 1015.4.2 Comparing Stress Response Against Time Points Matching . . . . . . 1025.4.3 Hierarchical Clustering of Conditions . . . . . . . . . . . . . . . . . . 1045.4.4 Evaluation of Running Time . . . . . . . . . . . . . . . . . . . . . . . 1045.4.5 Evaluation of Alignment Quality . . . . . . . . . . . . . . . . . . . . . 105
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
BIOGRAPHICAL SKETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5
LIST OF TABLESTable page
2-1 PPI networks selected from the MINT database . . . . . . . . . . . . . . . . . . . 37
2-2 The signifncance of the most abundant motif of PPI networks, first approach . . . . 47
2-3 The signifncance of the most abundant motif of PPI networks, second approach . . 47
2-4 Uniprot IDs of the proteins in an embedding of the most abundant motif in hhv-8PPI network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3-1 Approaches used to calculate motif centrality significance . . . . . . . . . . . . . . 60
4-1 Percentage of recovered query genes from gene aging dataset when using Alzheimer’sphenotype as query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4-2 Percentage of recovered query genes from gene aging dataset when using Huntington’sphenotype as query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4-3 Percentage of recovered query genes from gene aging dataset when using Type IIdiabetes phenotype as query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4-4 Number and significance of functional pathways associated with the underlying diseaseobserved among the aligned genes of target network . . . . . . . . . . . . . . . . . 90
6
LIST OF FIGURESFigure page
2-1 A hypothetical graph to represent motifs . . . . . . . . . . . . . . . . . . . . . . . 18
2-2 The four basic patterns used to find motifs . . . . . . . . . . . . . . . . . . . . . . 22
2-3 All patterns which can be constructed with four undirected edges. . . . . . . . . . . 26
2-4 Construct patterns with k + 1 edges . . . . . . . . . . . . . . . . . . . . . . . . . 27
2-5 Algebraic calculation of the frequency of one basic pattern . . . . . . . . . . . . . . 30
2-6 The overlap graph based on F2 and F3 frequency measures . . . . . . . . . . . . . 31
2-7 The running time of our motif discovery method using synthetic data varying graphand motif sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2-8 The total running time of our motif discovery method using real data . . . . . . . . 39
2-9 The running time of our motif discovery method using synthetic data varying graphdensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2-10 Comparison between our motif discovery algorithm and SUBDUE, motif size = 5 . . 41
2-11 Comparison between our motif discovery algorithm and SUBDUE, motif size = 10 . 42
2-12 Comparison between our motif discovery algorithm and SUBDUE, motif size = 15 . 43
2-13 Comparison of running time between our motif discovery algorithm and FSG . . . . 45
2-14 Motifs discovered in Human herpesvirus PPI . . . . . . . . . . . . . . . . . . . . . 49
3-1 The three-node motifs we explore to analyze the Assembly of Food Web Networks . 52
3-2 Schematic of the three levels of hierarchy for pitcher plant network assembly . . . . 53
3-3 The percentage of sites for which motif representation matches the continental network 55
3-4 The percentage of pitchers for which motif representation matches the site networks 56
3-5 All 13 motifs of 3-node subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3-6 Distribution of motif abundance over two classes of motif centrality significance . . 61
3-7 Correlation probabilities (p-values) between motif abundance and motif centrality . . 62
4-1 Comparison between different network alignment problems . . . . . . . . . . . . . . 65
4-2 Illustrating the alignment problem using hypothetical between two networks . . . . . 70
4-3 The percentage of recovered query in the resulting alignment varying evolution rates 82
7
4-4 The induced conserved structure (ICS) score of the resulting alignment varying evolutionrates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4-5 The Edge correctness (EC) score of the resulting alignment varying evolution rates . 84
4-6 The average z-score of Tempo varying network sizes . . . . . . . . . . . . . . . . . 84
4-7 The average z-score of Tempo against IsoRank . . . . . . . . . . . . . . . . . . . . 85
4-8 The total running time of IsoRank and Tempo for synthetic networks varying targetnetwork size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4-9 The average z-score of our method using real data of three different diseases; Alzheimer’s,Huntington’s and Type-II diabetes . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4-10 The percentage of genes that contributes to each pathway of the resulting alignedgenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5-1 The statistical significance (z-score) of the resulting alignment varying the numberof time points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5-2 The significance of the overlaps between different conditions through time pointspost-perturbation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5-3 The hierarchical clustering of z-score of the alignment between the five stress conditions105
5-4 The total running time of our method for synthetic networks varying the number oftime points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5-5 The edge correctness (EC) score of the resulting alignment varying the number oftime points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5-6 The induced conserved structure (ICS) score of the resulting alignment varying thenumber of time points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5-7 The percentage of recovered query of the resulting alignment . . . . . . . . . . . . 108
8
Abstract of Dissertation Presented to the Graduate Schoolof the University of Florida in Partial Fulfillment of theRequirements for the Degree of Doctor of Philosophy
DEVELOPING EFFICIENT ALGORITHMS TO IDENTIFY PATTERNS OF BIOLOGICALNETWORKS
By
Rasha Elhesha
August 2018
Chair: Tamer KahveciMajor: Computer Engineering
Studying biological networks provide great potential to help understand how cells
function and how they respond to extra-cellular stimulants. Majority of the previous work
on biological networks assume the network topology is static and does not change. However,
it is well-understood that the interaction between molecules is dynamic and change over time.
Assuming a static topology may lead to biased or incorrect analysis. We consider analyzing
both static and temporal biological networks. Studying the temporal progression of network
topologies is of utmost importance since it uncovers how a network evolves and how it resists
to external stimuli and internal variations.
In this work, we address three main problems of biological networks. The first problem is
identifying large disjoint motifs, frequent topological patterns, in a static biological network.
We present a scalable algorithm for finding network motifs which counts independent copies
of each motif topology unlike most of the existing studies. We show two case studies of
food webs when applying our algorithm. The second problem is identification of co-evolving
temporal networks when information of time points are known. Two temporal networks
have co-evolving subnetworks if the topologies of these subnetworks remain similar to each
other as the network topology evolves over a period of time. In this problem, we consider
the problem of identifying co-evolving pair of temporal networks, which aim to capture the
evolution of molecules and their interactions over time. Although this problem shares some
characteristics of the well-known network alignment problems, it differs from existing network
9
alignment formulations as it seeks a mapping of the two network topologies that is invariant
to temporal evolution of the given networks. This is a computationally challenging problem as
it requires capturing not only similar topologies between two networks but also their similar
evolution patterns. We develop an efficient algorithm, Tempo, for solving identifying coevolving
subnetworks with two given temporal networks. We formally prove the correctness of our
method. We experimentally demonstrate that Tempo scales efficiently with the size of network
as well as the number of time points, and generates statistically significant alignments—
even when evolution rates of given networks are high. Our results on a human aging dataset
demonstrate that Tempo identifies novel genes contributing to the progression of Alzheimer’s,
Huntington’s and Type II diabetes, while existing methods fail to do so. The third problem
addresses the drawbacks in the second problem by considering the uncertainty of time points
in both temporal networks when identifying their co-evolving topology. More specifically, time
points of the observed network topologies are uncertain such that the information of which
time point in one sequence corresponds to that in the other sequence is not known in advance.
In this problem, we develop a novel method, tempo++ which identifies coevolving subnetworks
between subsequences of given pair of temporal networks. We use gene expression dataset
which contains time resolved response of E. coli to five different environmental perturbation
conditions (cold, heat, oxidative stress, lactose diauxie, and stationary phase). Using our
method, we could find similar response behavior of gene expressions between heat and
oxidative stress. Using Tempo++ to generate alignment significance, we could co-cluster these
five conditions into groups. These clusters also confirmed that E. coli has similar response to
heat and oxidative stress conditions. We compare the statistical significance of the alignments
found by Tempo++ against those of other possible strategies to tackle this problem.
10
CHAPTER 1INTRODUCTION
Biological networks describe the interaction between molecules, and are frequently
represented as graphs, where the nodes corresponds to the molecules (e.g., proteins or genes)
and the edges corresponds to the interactions (1). More formally, we denote a biological
network as G = (V,E), where V and E represent the set of nodes and the set of edges,
respectively. Analysis of these networks enable the elucidation of cellular functions (2), the
identificaion of variations in cancer networks (3), and the characterization of variations in drug
resistance (4). In addition, this analysis led to the formulation of a numerous computational
challenges, as well as, methods which address these challenges. Among these challenges,
identifying motifs (5; 6; 7) (i.e. local netwrok propoerties) and network alignment (8) (i.e.
global netwrok propoerties) are arguably two of the most important.
Majority of the previous work on biological networks assume the network topology
is static and does not change (9; 10). However, in many cases, the interaction between
molecules is dynamic (11; 12). For example, genetic and epigenetic mutations can alter
molecular interactions (13), and variation in gene copy number can affect the existence of
interactions (14; 15). Due to this dynamic behavior the topology of the network that models
the molecular interaction will evolve and change over time (16; 17; 18) and assuming a static
topology may lead to biased or incorrect analysis. In this work, we consider analyzing both
static and temporal networks.
The first problem we address in this dissertation is the problem of identifying large disjoint
motifs in biological networks (Chapter 2). Motifs are frequent topological patterns in a given
network (19). Given a target network and a motif size (i.e., number of nodes in the motif), we
aim to find the motifs of that size which have a frequency above a user specified threshold in
that target network. Unlike most of the methods in the literature, we count independent copies
of each motif where no two copies of the same motif share an edge. Counting motif frequency
11
(i.e. the number of occurrences of this motif), requires solving the subgraph isomorphism
problem, which is NP-Complete (20).
We develop a novel and scalable algorithm to solve the motif identification problem. We
introduce a set of small patterns and prove that we can construct any larger pattern by joining
those patterns iteratively. By iteratively joining already identified motifs with those patterns,
our algorithm avoids (i) constructing topologies which do not exist in the target network
(ii) repeatedly counting the frequency of the motifs generated in subsequent iterations. Our
experiments on both protein-protein interaction (PPI) and synthetic networks demonstrate
that our method is significantly faster and more accurate than the existing methods. In
addition, the increase in the running time of our algorithm is dramatically less than that of the
competing methods as the motif size grows.
Motif identification applications we address in this work are mainly develooped to analyze
Food Web Networks (Chapter 3). The first application is Motifs in the Assembly of Food Web
Networks. Mainly in this application, we compute the significance of three-node motifs across
a hierarchy of scales to to explore the assembly of food web networks found in the leaves of
the northern pitcher plant (Sarracenia purpurea (21; 22)). The second application is Motif
Centrality in Food Web Networks. We explored the relationship between motif abundance
and motif centrality to better understand why some motifs are found at high abundances
(i.e., over-represented) and some are found at low abundances (i.e., under-represented). We
developed a suite of methods for calculating the centrality of entire motifs and then analyzed
the relationship between motif centrality and motif abundance in published aquatic food
webs (23).
The second problem we address in this dissertation is identifying coevolving subnetworks
in a given pair of temporal networks (Chapter 4). Majority of the previous work on alignment
of biological networks assume the network topology is static (10)—an assumption that ignores
the history of network evolution, and may lead to biased or incorrect analysis. To address
the dynamic changes of biological networks, we define a biological network using a model
12
that describes the evolution of an underlying network at consecutive time points. We refer
to this model as a temporal network (24; 25). Informally, we view this model as containing a
single snapshot of the network at each time point and thus, the sequence of snapshots as a
time series network. Hence, we assume the topology of the biological network is observed at
t consecutive time points. Given two input temporal networks, we let one of them to be the
query network (smaller) and the other network be the target network, our algorithm captures
that network topologies evolve over time and seeks the alignment that persists through
this evolution. More specifically, the aligned nodes does not change from one time point to
another. The temporal network alignment problem is dramatically different than known and
existing network alignment problems.
We present a novel algorithm to identify coevolving subnetworks in a given pair of the
temporal networks. We propose a new scoring function that integrates the similarities of the
aligned nodes and their network topologies. Our algorithm works in two phases. In the first
phase, our algorithm first finds an initial alignment between the input networks G1 and G2
using the homological and topological similarities of their nodes. This phase ignores the penalty
arising from disconnected subnetworks in the alignment. The second phase of our algorithm
aims to maximize the alignment score by repeatedly altering the aligned nodes in the target
network using dynamic programming strategy. We solve the problem of connecting subgraphs
using a dynamic programming approach which selects a minimum number of swapping pairs
from the gap nodes and aligned nodes sets to ensure the maximum profit in the scoring
function. This problem is reduced to set cover problem which is NP-complete problem (26).
We demonstrate the efficiency and accuracy of Tempo using both real and synthetic data. We
compare the running time and the quality of the alignments found by Tempo against those
of three existing alignment algorithms. We show Tempo has competitive running time and
generates significantly better alignments. We could predict disease-related genes based on the
generated alignment using tempo which suggests that Tempo generates alignments that reflect
13
the evolution of nodes topologies through time as well as their homological similarities while
other methods only focuses on static and independent topologies.
The third problem we address in this dissertation is aligning two temporal networks
with uncertain time points (Chapter 5). More specifically, the information of time points
in both networks are unknown in advanvce (or uncertain). Furthermore, G1 and G2 has
possibly different number of time points. Without losing generality, we let G1 to be the
temporal network with shorter number of time points. Various factors affect the evolution
process of a biological network and thus, introduce uncertainty when capturing such
evolution. For example, the evolution rate of interacting molecules differs between people
with different disorders (i.e. diseases) or people with same disorder but at different stages
of this disorder (27).Consequently, the observed interactions of humans may vary even if
they are measured at the same time. Thus, the interaction networks constructed for those
measurements may correspond to different stages of the evolution. In this problem, we consider
the uncertainty of the time points in each topological network. This is a very challenging
problem since it does not only align the temporal networks, but also finds their corresponding
time points at which the alignment yields the highest alignment score.
We develop a novel method, Tempo++ to identify coevolving subnetworks in a given
pair of the temporal networks with uncertain time lines. Our method adopts a dynamic
time wrapping strategy to find the optimal matching between the two input temporal
networks by shifting and stretching the time points of G1 based on the alignment quality.
For instance, omitting the first two networks in G2 in the alignment corresponds to the case
where G1 denotes a later stage of evolution by two time points as compared to G2. Similarly,
omitting intermediate networks in G2 corresponds to the case when G1 is evolving slower
than G2. We demonstrate the efficiency and accuracy of Tempo++ using both real and
synthetic data. We use gene expression dataset which contains time resolved response of
E. coli to five different environmental perturbation conditions (cold, heat, oxidative stress,
lactose diauxie, and stationary phase). Using our method, we could find similar response
14
behavior of gene expressions between heat and oxidative stress. Using Tempo++ to generate
alignment significance, we could co-cluster these five conditions into groups. These clusters
also confirmed that E. coli has similar response to heat and oxidative stress conditions. We
compare the statistical significance of the alignments found by Tempo++ against those of
other possible strategies to tackle this problem.
15
CHAPTER 2IDENTIFICATION OF LARGE DISJOINT MOTIFS IN BIOLOGICAL NETWORKS
2.1 Preface
Studying biological networks has great potential to help understand how cells function and
how they respond to extra-cellular stimulants. Such studies have already been used successfully
in many applications. Characterizing the variations in drug resistance of different cell lines (4),
or identifying the pathways serving similar functions across different organisms (28; 29) are
only few examples among many.
Motifs are frequent topological patterns in a given network (19). Identifying motifs has
been one of the key steps in understanding the functions served by biological networks such as
gene regulatory or protein interaction networks (5; 6; 7). Motifs can be used to uncover the
basic structure and design principles of a network (30). They are also often considered as the
basic building blocks of a network (19) and one of the network local properties (31). Thus,
they can be used to classify networks (32) into functional sub-units. It is worth noting that
motifs have been used in various applications like prediction of regulatory elements in genomic
sequences (33).
Despite the fact that studying motifs is of utmost importance for network analysis, motifs
identification remains to be a computationally hard problem (34). The roots of the challenges
behind motif discovery arise from several reasons. First, even when the motif topology is given,
counting motif frequency (i.e. the number of occurrences of this motif), requires solving the
subgraph isomorphism problem, which is NP-Complete (20). Furthermore, when the motif
topology is not known in advance, trying out all alternative topologies is infeasible as the
number of such topologies increases exponentially with the number of edges in the motif.
There are two ways for motif frequency formulation; (i) allow for different copies of
the same motif to overlap (i.e., share nodes or edges) or (ii) count disjoint copies of the
motif under consideration. Most of the existing methods in the literature on motif counting
follow the first formulation. This formulation however has a fundamental drawback arising
16
from the fact that it does not have downward closure property. Briefly, this means that the
motif frequency does not decrease monotonically as the motif size increases. We discuss
this drawback in detail in Sections 2.2 along with why it makes it impossible to determine
the largest sized motif in a given network. Several algorithms use the second formulation
to compute the frequency of a given motif (e.g., (35)). Those algorithms, however, do not
scale to large networks. Also, they are limited to small motifs as their time complexities grow
exponentially with motif size. We elaborate on these methods in Section 2.2 as well.
In this chapter, we address the problem of finding motifs in a given network. More
specifically, given a target network and a motif size (i.e., number of nodes in the motif), we
aim to find the motifs of that size which have a frequency above a user specified threshold
in that target network. Unlike most of the methods in the literature, we use the second
formulation of motif counting described above, where no two copies of the same motif share an
edge, to compute the frequency.
Contributions: We develop a novel and scalable algorithm to solve the motif identification
problem. The central idea of our method, which stands out among the existing literature, is
to use a small set of patterns, called the basic building patterns. We prove that any motif
with four or more edges can be constructed as a combination of these patterns. Following
from this observation, our method first finds instances of these patterns. It then iteratively
grows motifs by joining known motifs at that iteration with the instances of these patterns.
Our algorithm develops efficient mechanisms to avoid a significant fraction of the costly
isomorphism tests while growing new motifs. Counting non-overlapping instances of a given
motif is a computationally challenging task that requires solving maximum independent set
(MIS) problem which is known to be NP-complete (34). We introduce a new and efficient
strategy for this purpose. This strategy avoids enumerating the overlapping motif instances.
It does this by algebraically computing the overlap count based on the neighbors of the motif
nodes in the target network. Our experiments on both protein-protein interaction (PPI) and
synthetic networks demonstrate that our method is significantly faster and more accurate
17
A B C D
Figure 2-1. This figure represents a hypothetical graph to illustrate motifs. A) a graph G thatcontain seven nodes {a, b, c, d, e, f, g} and eight edges {(a,b), (a,c), (b,c), (b,e),(e,d), (e,f), (f,g), (e,g)}. B) a pattern with two embeddings in G, {(a,b), (a,c),(b,c)} and {(e,f), (f,g), (e,g)}. C) a pattern with three embeddings in G, {(a,b),(a,c), (b,c), (b,e)}, {(e,f), (f,g), (e,g), (e,d)}, and {(e,f), (f,g), (e,g), (b,e)}. D) apattern that has one copy in G, {(b,e), (e,d), (e,f), (f,g), (e,g)} .
than the existing methods. In addition, the increase in the running time of our algorithm is
dramatically less than that of the competing methods as the motif size grows.
The rest of this chapter is organized as follows. We present the key definitions needed to
discuss our method and the related literature in Section 2.2. We describe our motif discovery
algorithm in Section 2.3. We experimentally evaluate our method and compare it to the
existing algorithms in Section 2.4. We end with a brief conclusion in Section 2.5.
2.2 Background
In this section, we provide the definitions and the terminology needed to describe our
method (Section 2.2.1). We then summarize the key literature tackling similar problems to the
one considered in this chapter (Section 2.2.2).
2.2.1 Definitions and Notation
We represent a given biological network using a graph denoted with G = (V,E). Here,
the set of nodes V denotes the set of interacting molecules, and the set of edges E denotes
the interactions among them. In the rest of this chapter, we use the term graph to denote a
biological network. Here, we focus on undirected graphs. Figure 2-1A represents a graph that
contains seven nodes and eight edges.
We say that a graph is connected if there is a path between all pairs of its nodes. We say
that a graph S = (VS, ES) is a subgraph of G if VS ⊆ V and ES ⊆ E. In the rest of this
chapter, we only consider connected subgraphs. Thus, to simplify our terminology, we use the
18
term subgraph instead of connected subgraph. Notice that a subgraph of a given graph can be
uniquely determined by the set of edges ES of that subgraph as all of its nodes are connected.
We say that two subgraphs S1 = (VS1 , ES1) and S2 = (VS2 , ES2) of G are identical
if they have the same set of edges. A less constrained association between two subgraphs is
isomorphism. Two subgraphs S1 and S2 are isomorphic if the following condition holds: There
exists a bijection f : VS1 → VS2 such that ∀(u, v) ∈ ES1 , ⇐⇒ (f(u), f(v)) ∈ ES2 .
We say that two subgraphs S1 and S2 overlap if they share at least one edge (i.e.,
ES1 ∩ ES2 = ∅). In Figure 2-1A, consider the four subgraphs S1, S2, S3, and S4 defined by
the set of edges {(a,b), (a,c), (b,c), (b,e)}, {(e,f), (f,g), (e,g), (e,d)}, {(e,f), (f,g), (e,g),
(b,e)} , and {(b,e), (d,e), (e,f), (e,g)} respectively. S1 and S2 are disjoint as they do not share
any edges. S1 and S3 overlap as they share the edge (b,e). Similarly S2 and S3 overlap. All
three subgraphs S1, S2, and S3 are isomorphic as they have the same topology. S1 and S4 are
non-isomorphic as they do not satisfy the bijection function defined above.
Notice that isomorphism is a transitive relation. Thus, for a given subgraph S of G, the
set of all subgraphs of G which are isomorphic to S defines an equivalence class. We represent
the subgraphs in each equivalence class with a graph isomorphic to those in that equivalence
class and call it a pattern. Figure 2-1C shows the pattern that represents the equivalence class
{S1, S2, S3}.
There are alternative definitions of the frequency of a pattern in a given graph. The
classical frequency definition is the number of all subgraphs of the target graph which are
isomorphic to the given pattern. This definition, also known as the F1 measure (36), counts
all the subgraphs regardless of whether they overlap with each other or not. There are two
other frequency definitions which avoid overlaps between different subgraphs. F2 measure
counts the largest subset of subgraphs in a given equivalence class which do not share any
edges with the rest of the subgraphs in that subset. It however allows them to share nodes. F3
measure is more stringent as it requires that no two subgraphs can share a node. Consider the
pattern in Figure 2-1C and the target graph in Figure 2-1A. The frequency of this pattern in
19
the target graph according to the F1 measure is three as it has three embeddings ({S1, S2,
S3}). On the other hand F2 is two {S1, S2}, and F3 is one (S1 or S2 or S3). From here on,
we denote the F1, F2, and F3 counts of a motif M in graph G using the notations F1G(M),
F2G(M), and F3G(M) respectively.
The downward closure property states that the frequency of a pattern should monotonically
decrease as this pattern grows (by inserting new nodes or edges to it). More specifically,
consider a function f() that operates on a pattern and returns a real number. Let us denote
two patterns with P1 and P2. We say that the function f() has downward closure property if
and only if f(P2) ≤ f(P1) for all (P1, P2) pairs where P1 is a subgraph of P2.
Under the light of these definitions, next we show that F1 measure is not downward
closed. Consider the pattern P1 in Figure 2-1B. The frequency of P1 is two in the target graph
in Figure 2-1A. Now consider the pattern P2 in Figure 2-1C which contains P1. Although P1 is
a subgraph of P2, the frequency of P2 is three in the same graph (i.e., more than that of P1).
Next, consider the pattern P3 in Figure 2-1D. P3 contains P2, and its frequency is only one
(i.e., less than that of P2). This example demonstrates that the F1 measure not only fails to
monotonically decrease, but it also fluctuates (i.e., its value may go up or down) as we grow
the pattern ( (37; 38) for further discussions on this issue).
Unlike the F1 measure, F2 is downward closed. In the following, we formally prove this.
Theorem 2.1. Assume that we are given a graph G. Given two patterns M and M where M
⊂ M , we have F2G(M) ≥ F2G(M).
Proof. To prove this, we consider the placement of each embedding of M in G according to
F2 measure (i.e. non-overlapping embeddings). Notice that each embedding of M contains M
as M ⊂ M . From each of these embeddings, we remove the edges that are in M −M . This
leads to one embedding of M for each embedding of M . Thus, the number of non-overlapping
embeddings of M in G is at least as much as that of M in G. Therefore, F2G(M) ≥
F2G(M).
20
Similarly, we say that F3 measure which also counts non-overlapping embeddings, is also
downward closed.
Failure to satisfy the downward closure property has major implications on the correctness
of motif identification. Traditional motif identification algorithms often grow a motif starting
from an initial motif of a small number of edges (Section 2.2.2). Should they employ the F1
measure, these algorithms cannot have an early stopping criteria as they grow motifs. This is
because the frequency can go up as we grow motif even when the current motif frequency is
low. Next, we formally define the problem considered in this chapter.
Problem definition.. Given an input graph G = (V,E), the number of nodes in the
target motif µ, and frequency threshold α, we aim to find all patterns of µ nodes which have
frequency at least α in G under the frequency measure F2. The method we develop in this
chapter can however be easily extended to F3 as well (Section 2.3.3).
2.2.2 Summary of Existing Methods
We classify the literature on motif identification and counting, based on the underlying
frequency measure. This is because the frequency measure dramatically changes the cost of
counting motifs as well as how we can interpret the frequency of the underlying pattern. Most
of the existing studies use F1 frequency measure to count the embeddings of a pattern in
a given graph (e.g., (39; 40; 41; 42; 43; 44)). These methods carry the drawbacks inherent
in the F1 measure. First, F1 ignores the fact that different copies of the same motif can
overlap due to the nodes and the edges they share. This can lead to artificially massive
number of motif embeddings as the same node or edge can participate in multiple embeddings.
To understand this better, consider the pattern and the graph in Figures 2-1C and 2-1A
respectively. F1 counts three copies of the pattern (S1, S2, and S3). Different nodes and
edges however contribute to this count at different numbers. The edge (a, b) appears only in
S1 while (b, e) appears in both S1 and S3.
Second and more importantly, the F1 measure is not downward closed. This is because
as we grow a pattern by including new edges or nodes, its count as computed by F1 is not
21
Figure 2-2. The four basic patterns used by our algorithm.
monotonic; it may decrease, stay the same, or increase. Lack of downward closure property
makes it nearly impossible to decide if the motif found is the largest one in size while growing a
pattern. Thus, using F2 is essential for the tractability of identifying frequent patterns. We use
the F2 measure in this chapter. Thus, the studies limited to the F1 measure are out of the
scope of this chapter.
Several algorithms tackle the problem of finding frequent patterns in multiple graphs.
FSG (45) is one of the key methods in this class. These methods, however, do not count the
number of occurrences of a pattern in each graph. They rather check if the given pattern
appears at least once in each graph. Vanetik et. el. (37) also addressed the same problem.
Finding frequent patterns or counting them without overlaps (i.e., using F2 or F3
measures) have received little attention in the literature. One of the existing algorithms
in this category is SUBDUE (35). Flexible Pattern Finder Algorithm (FPF) (36) detects
frequent patterns using both F2 and F3. Two algorithms were proposed by Kuramochi and
Karypis (46), named hSiGraM, vSiGraM. However, these algorithms are computationally
expensive and do not scale to large graphs or motifs. We evaluate SUBDUE and FSG
experimentally in Section 2.4.
2.3 Method
In this section we describe our method. Section 2.3.1 presents an overview of our
algorithm. Section 2.3.2 explains the mechanism we use to grow motifs by joining smaller
motifs. Section 2.3.3 describes how we count disjoint motif instances. Section 2.3.4 presents
filtering techniques we implement to avoid costly isomorphism tests. Section 2.3.5 discusses
the complexity analysis of our method.
22
2.3.1 Algorithm Overview
In this section, we provide an overview of our method for discovering motifs. At the heart
of our method lie four unique graph patterns. We call them the basic building patterns for
we use them as guide to construct larger motifs of arbitrary sizes and topologies. Figure 2-2
presents these basic building patterns. We explain why we use these four specific patterns in
Section 2.3.2 in detail.
Algorithm 2.1 presents the pseudo-code of our method. We elaborate on each key step
of our method in subsequent sections. The algorithm takes a graph G, the number of nodes
of the target motif µ, and the minimum acceptable motif frequency as input α. For each
of the four basic building patterns, it first locates all subgraphs in G that are isomorphic to
that pattern (Line 1). Let us denote the set of instances of the ith pattern (i ∈ {1, 2, 3,
4}) with Si. In each set Si, it is possible to have overlapping subgraps. It then extracts the
maximum set of edge-disjoint subgraphs in each set Si (Line 2) (Section 2.3.3 for details).
Let us denote the resulting set with S ′i for the ith pattern. Notice that the cardinalities of
the sets Si and S ′i are the F1 and F2 measures of the ith pattern respectively. The union of
all the sets S ′i constitutes the current motif instances as well as the basic building pattern
instances at this point (Line 3). The algorithm then iteratively grows the current motif set.
At each iteration, it joins the current motif set with the basic building pattern set (Line 9).
More specifically, a motif instance and a basic building pattern join if they share at least one
edge. Joining two such subgraphs either creates a pattern which already exists in the current
set (Line 10) or a new pattern (Line 12). At each iteration, after growing the current set, it
filters the overlapping subgraphs to identify MIS for each pattern (Line 18). The algorithm
removes all patterns with frequency lower than the user supplied cutoff (Line 21). It reports
the frequent subgraphs that have as many edges as the target motif size (Line 23). The
algorithm terminates when the current set can not be grown to have any other patterns which
satisfy the target motif (i.e. each pattern in the current set is either larger than the target
motif size or its frequency is lower than the user specified frequency).
23
Algorithm 2.1. Motif Discovery algorithm Input:
• Target motif size µ
• Frequency threshold α
• Input graph G = (V,E)
output:
• Motif topologies, and their instance subgraphs, that each have same number of nodes asµ and its F2 > α
1: BPSf1 = getAllSubgraphs-Isomorphic-to-BasicPatterns()
2: BPS = extract-maxDisjointSubgraphs-PerPattern(BPSf1)
3: CurrentSet (CS) = BPS
4: newSet (NS) = ϕ
5: while CS has new patterns and at least one of them with number of nodes < µ and its
F2 > α do
6: for each pattern p1 in CS do
7: for each pattern p2 in BSP where p2 = p1 do
8: for each subgraph s1 ∈ p1 and s2 ∈ p2 do
9: s3 = join(s1, s2)
10: if s3 ∈ existing pattern P then
11: add s3 ∈ P in NS if not duplicate
12: else
13: Create Pnew with s3 topology, add s3 ∈ Pnew in NS
14: end if
15: end for
16: end for
17: end for
18: CS = extractmaxDisjointSubgraphsPerPattern(NS)
19: for each pattern p1 ∈ CS do
24
20: if F2 of p1 < α then
21: Delete p1 and all subgraphs ∈ p1
22: else if number of nodes of p1 = µ then
23: put p1 and all subgraphs ∈ p1 in the output
24: end if
25: end for
26: NS = ϕ
27: end while
2.3.2 Joining Patterns to Find Larger Patterns
Here, we describe one join iteration of our method; the process of joining the subgraphs
of current set of patterns with the subgraphs of the basic building patterns to construct larger
patterns. At the end of the iteration, the resulting set of subgraphs becomes the current set of
subgraphs for the next join iteration.
Recall that we join two subgraphs only if they share at least one edge. Joining two such
subgraphs either yields a pattern that is isomorphic to one of the existing patterns or a new
one. In the former case, we consider the set of subgraphs S isomorphic to that pattern. We
check if the new subgraph is already in S. If it is in S, we discard it. Otherwise, we store it in
S. In the latter case (i.e., the pattern is observed the first time), we save this as a new pattern
and also keep the corresponding subgraph.
Notice that, although the subgraphs in S do not overlap prior to join, this may no longer
hold after new subgraphs are inserted into S. At the end of each join iteration, we select the
MIS for each pattern. We defer the discussion on how we do this to Section 2.3.3. We then
remove the patterns with F2 values below the user supplied frequency threshold, α. This
eliminates non-promising patterns, and thus, reduces the number of candidate patterns for the
next join iteration. Using the F2 measure ensures that patterns maintain downward closure
property. Thus, non-frequent patterns will never grow to yield frequent patterns.
25
Why do we need different equivalence classes? If the motif frequency is measured using
F1, it is sufficient to join the subgraphs belonging to existing patterns with only those which
belong to the same equivalence class of the simple pattern with two edges (see Figure 2-2A)
to construct any larger pattern. This however is not true when F2 (or F3) is used to count
the motif frequency. To understand the rationale behind this, recall that each equivalence
class represents a set of disjoint isomorphic subgraphs. As a result, no two subgraphs from the
same equivalence class join for they do not share any edges. Therefore we need more than one
equivalence class to construct new and larger patterns.
Given that we need multiple patterns, next, we seek the answer to the following question:
What is the smallest set of patterns which can be used to produce arbitrary large topologies by
joining them? Here we outline the key steps of the proof that the four basic building patterns,
presented in Figure 2-2, suffice to construct any larger pattern. That said, we do not guarantee
to find all copies of such patterns in the target network.
Figure 2-3. All patterns which can be constructed with four undirected edges.
Before we discuss our induction steps, we explain our strategy on a specific motif size of
four to improve the clarity of the discussion on induction. Figure 2-3 shows all the possible
patterns which can be constructed with undirected four edges. A careful inspection shows that
each one is an overlapping combination of two of the basic building patterns. For instance,
the pattern in Figure 2-3A can result from joining the basic pattern in Figure 2-2A with the
basic pattern in Figure 2-2C. It is worth noting that we can construct some of the patterns in
Figure 2-3 by joining two different pairs of basic building patterns. This redundancy ensures
we can still locate a specific pattern even if one of those pairs does not exist. Therefore, our
method can construct any pattern with four edges from patterns with three or two edges.
We conduct our proof for the arbitrary pattern size by induction.
26
Basis.. The four basic patterns in Figure 2-2 constitute all possible graph topologies with
two or three edges.
Induction step.. We assume that our method can construct any pattern with up to k
edges (k ≥ 3). We next show that any pattern with k + 1 edges can be constructed by joining
a pattern with k edges with one of the basic building patterns.
Recall that the downward closure property states that those smaller patterns have at
least as much frequency as the larger one according to F2 (Theorem 2.1). This means that
if a pattern with k + 1 edges is frequent, then so is any of the k edge patterns obtained by
removing an edge from that pattern.
Consider a graph G and a copy of a pattern P1 of size k edges in G, S1. Also, consider a
copy of a pattern P2 with k + 1 edges such that P2 contains P1 and one additional edge. Let
us denote this additional edge with (a, b). We need to show that P2 can be obtained from P1
by joining it with at least one of the basic patterns.
Figure 2-4. Constructing patterns with k + 1 edges. A) A subgraph S2 in a hypothetical graphG. S2 is isomorphic to a pattern P2 of size k + 1 edges. If we remove theadditional edge (a, b) we obtain S1 which is isomorphic to P1 where P1 ⊂ P2.Notice that S1 could have arbitrary k − 1 edges rather than (b, c). Here we obtainS2 as a result of joining S1 with the subgraph {(a, b), (b, c)} which belongs to M1equivalence class (Figure 2-2A). B) Failure to accomplish the join in (a), we seekto inspect deg(c) and deg(b) in S1. The first possibility is that deg(c) > 1. Thismeans that the subgraph {(b, c), (c, d)} exists. We then can join S1 with thesubgraph {(a, b), (b, c), (c, d)} which belongs to M4 equivalence class(Figure 2-2D) to obtain S2 which is isomorphic to a pattern P2 of size k + 1edges. C) The second possibility is that deg(b) > 1. This means that the subgraph{(b, c), (b, d)} exists. We then can join S1 with the subgraph {(a, b), (b, c), (b, d)}which belongs to M3 equivalence class (Figure 2-2C) to obtain S2.
27
Since both P1 and P2 are connected graphs, at least one of the two nodes a and b has
an edge in P1. Without violating the generality of the proof, let us assume that b has an edge
(b, c) in P1. Figure 2-4A illustrates the two edges (a, b) and (b, c).
First, we consider using the basic pattern M1 in Figure 2-2A in the join operation. In
this case, a copy of M1, {(a, b), (b, c)} will join with S1 having a common edge (b, c) which
will result in the pattern P2 with k + 1 edges. This join however occurs only if the subgraph
{(a, b), (b, c)} is included in the F2 counts of M1 (i.e. within the chosen non-overlapping
copies of M1).
If this condition fails, we consider the degrees of the two nodes b and c in pattern P1. We
start with node c. Let us denote the degree of a node with function deg() (e.g. deg(c) is the
degree of node c in pattern P1).
If deg(c) > 1, then c has at least one more edge on top of (b, c). Let us denote this edge
with (c, d) (Figure 2-4B). In this scenario, we join a copy of the motif M4 (Figure 2-2D),
{(a, b), (b, c), (c, d)} (if this copy exists in the F2 count of M4) to obtain P2.
Finally, if deg(c) = 1, it is guaranteed that deg(b) > 1. This is because if both
nodes b and c have degree one, S1 cannot be a connected subgraph. Let us denote one of
the additional edges of b with (b, d) (Figure 2-4C). In this case, we join the subgraph that
isomorphic to the pattern M3, {(a, b), (b, c), (b, d)}, with S1 to obtain P2. We can do this if
this copy exists in the F2 count of M3.
In summary, we conclude that any pattern P2 with k + 1 edges can be constructed by
joining a pattern P1 with k edges (or k − 1 edges) and one of the basic building patterns to
obtain the additional edge (or edges) if at least one of the many possible scenarios hold. We
however cannot guarantee that the joins will find all of the instances of the k + 1 edge pattern
on the target graph.
Recall that as we aim to calculate the frequency of a given motif using F2, there is no self
join of any pattern. Thus, the basic building patterns set is the smallest set of patterns as we
can not construct one of those four patterns using the three other patterns. More specifically,
28
this means that we can not use only one of those four basic building patterns to construct
larger patterns by joining pairs of subgraphs belong to that pattern’s equivalence class. This
is because if we join the embeddings of a single motif topology (such as the first pattern in
Figure 2-2A) we cannot get any larger pattern as they do not share any edge(s).
2.3.3 Finding MIS: Going from F1 to F2
Here, we explain how we compute the F2 frequency for a given pattern. We use two
algorithms for this purpose. We explain why we have two separate algorithms later in
this section after describing the two algorithms. The first one is a heuristic used in the
literature (36). This algorithm constructs a new graph, called the overlap graph for each
pattern as follows. Each node in the overlap graph of a pattern denotes an embedding of that
pattern in the target graph. We add an edge between two nodes of the overlap graph if the
corresponding embeddings represented by those nodes overlap in the original graph. Once the
overlap graph is constructed, the algorithm starts by selecting the node with the minimum
degree (i.e. overlaps with the minimum number of embeddings) in the overlap graph. We
include the subgraph represented by this node in the edge-disjoint set. We then delete that
node along with all of its neighboring nodes in the overlap graph. We update the degree of the
neighbors of the deleted nodes. We repeat this process of picking the smallest degree node and
shrinking the overlap graph until the overlap graph is empty.
The algorithm described above works well for patterns with small number of embeddings.
It however becomes computationally impractical as the number of embeddings of the
underlying pattern gets large. This is because both constructing the overlap graph (particularly
identifying its edges) and updating it are computationally expensive tasks. Therefore, we
use this algorithm for all patterns except for the basic building patterns (where number of
embeddings are often too large).
The second algorithm addresses the scalability issue of the the first one. This scalability
issue is imposed by the expensive task of calculating the degree of each node in the overlap
graph (i.e. the number of overlaps of each embedding). Recall from the previous algorithm
29
Figure 2-5. Algebraic calculation of the frequency of one basic pattern. A) One of the basicbuilding patterns. B) A hypothetical graph that contains subgraphs isomorphic tothe pattern M1 in A).
that this number is considered as a loss value when selecting the node (i.e. embedding)
with minimum degree (i.e. number of overlaps) to include in the final MIS of the pattern
under consideration. Briefly, the second algorithm we introduce here avoids the expensive
task of calculating number of overlaps for each embedding. The algorithm performs this by
algebraically computing such numbers instead of performing actual overlapping tests. Once we
compute node degrees of the overlap graph, this algorithm selects the disjoint embeddings the
same way as the former algorithm described before. More specifically, the algorithm selects the
node with the minimum degree and includes its corresponding embedding in the final MIS. It
then removes neighboring nodes to that node from the overlap graph. It repeats this process
until the overlap graph is empty. Next, we explain how we compute the degree of a node in
the overlap graph for the pattern M1 in Figure 2-2A. Our computation is similar for the other
three basic building patterns, yet tailored towards their specific topologies (derivation is shown
in appendix). Figure 2-5 shows a hypothetical subgraph S1 ={(a, c), (b, c)} in the input graph
G which is isomorphic to M1. This subgraph is represented by a node in the overlap graph of
M1’s embeddings. Let us denote the degree of a node in the original graph G with function
d() (e.g. d(vi) is the degree of node vi). Another embedding of M1 in G overlaps with S1
only if it contains the edge (a, c), or (b, c). Any edge in G connected to the middle node c
forms two overlapping embeddings, one with the subgraph that has edge the (a, c) and the
other with the subgraph that has the edge (b, c). We exclude the edges belong to S1 (i.e. the
embedding we want to calculate its number of overlaps) itself from the potential edges of G
30
that considered in the overlapping embeddings with S1. Thus, by excluding the two edges
(a, c) and (b, c) from c’s degree, node c yields 2 × (d(c) - 2) overlaps. In addition, any edge
that belongs to node a forms an embedding when combined with the edge (a, c). Excluding
the edge (a, c), node a yields d(a) - 1 overlaps. Similarly, node b produces d(b) - 1 overlaps.
Thus, the total number of overlaps for the embedding S1 = {(a, c), (b, c)} combined from
edges of its three nodes {(a, b, c)} is
2(d(c)− 2) + d(a)− 1 + d(b)− 1 = 2d(c) + d(a) + d(b)− 6
Notice that unlike the first algorithm, the second one requires a unique derivation for
each pattern. Thus, we apply it only to the basic building patterns, for their topologies do not
depend on the input graph. Also, it is worth noting that typically the basic building blocks
have much larger number of embeddings as compared to the patterns derived by joining
them. Thus, the efficiency of the second algorithm is needed for them more than the patterns
obtained in subsequent iterations (experimental results).
Figure 2-6. The overlap graph based on F2 and F3 frequency measures. A) The overlap graphof the pattern in Figure 2-1C based on F2 measure of this pattern in the graph inFigure 2-1A . B) The overlap graph of the same pattern based on F3 measure.
To adapt our method to count non-overlapping embeddings of each pattern according to
F3 instead of F2, we only need to change how we calculate the MIS of this pattern. More
specifically, we change the criteria which states that two subgraphs overlap if they share at
least one edge to two subgraphs overlap if they share at least one node (Section 2.2.1). This
will result in changing the overlap graph constructed using the first method we explain in
this section. In addition, it will also have slight change in calculating the total number of
overlap of each embedding using the second method we discuss in this section. Practically,
31
we expect the overlap graph to be denser when we use the F3 measure as compared to that
for the F2 measure. To illustrate this, consider the graph G in Figure 2-1A and the pattern
in Figure 2-1C. This patter have 3 embeddings in G which are S1, S2, and S3 defined by the
set of edges {(a,b), (a,c), (b,c), (b,e)}, {(e,f), (f,g), (e,g), (e,d)}, {(e,f), (f,g), (e,g), (b,e)}
respectively. Figure 2-6A and Figure 2-6B represent the overlap graph of this pattern based on
F2 and F3 measures respectively.
2.3.4 Accelerating Our Algorithm Through Efficient Filters
Recall that at each iteration, our algorithm generates new subgraphs. For each of these
subgraphs, it checks if this subgraph is isomorphic to one of the patterns constructed till that
iteration. Isomorphism test is a computationally expensive task. Next, we describe how we
avoid a large fraction of these tests.
We develop two canonical labeling strategies for patterns. Canonical labeling assigns
unique labels to the nodes of a given pattern (47). If two patterns are isomorphic, then they
have the same canonical labeling. The inverse is however not true. Unlike isomorphism test,
comparing the canonical labeling is a trivial task. Following from this observation, when
we construct a new subgraph, we first compare its canonical labeling to those of existing
patterns. We then limit the costly isomorphism test to only those patterns which have the
same canonical labeling as the new subgraph.
The first canonical labeling counts the degree (i.e. number of incident edges) of each
node in the given pattern. It then sorts those degrees and keeps them as a vector we call the
degree vector. If two patterns have different degree vectors, then they are guaranteed to have
different topologies. Despite its simplicity, this labeling filters out a large fraction of patterns.
To test its efficiency, we have tested it on random graphs generated using Barabási−Albert
model (48). We generate 1000 pairs of graphs where each pair is non-isomorphic and have
the same number of nodes and edges. The degree vector successfully filters 85% of the 1000
experiments.
32
The second canonical labeling extends on the first one. It was first introduced by (49).
Consider a pattern P = (V,E). Let us define the distance between two nodes vi, vj ∈ V
as the number of edges on the shortest path that connects vi and vj and denote it with
xij. Let us define the diameter of P as the maximum distance between any two nodes,
and denote it with x. Using this notation, we assign label to node vi as:∑j∈V
j 2x−xij−d(vj).
Once we compute the labels of all the nodes in the given pattern, we sort them. We call the
resulting vector the nodes vector. Similar to the first labeling above, two isomorphic graphs are
guaranteed to yield the same labeling. We compute and compare the nodes vector with only
the patterns which cannot be eliminated using the first canonical labeling. We then consider
the patterns with identical canonical labels for graph isomorphism.
2.3.5 Complexity Analysis
Here we analyze the complexity of our method. We refer to Algorithm 2.1 as we discuss
the steps of our method. For each steep, we explain its complexity. We then summarize the
complexity of all steps to denote the overall complexity of our method. These steps are
Find all subgraphs isomorphic to each of the four basic patterns (Line 1): In this
step, we analyze each of the four basic patterns separately since they have different topologies.
For the pattern M1 in Figure 2-2A, to get all subgraphs isomorphic to this pattern, we
consider all edges connected to each node in the underlying network. We select any two edges
combination connected to every node. Here, we denote the degree of a node with function d()
(e.g. d(vi) is the degree of node vi). Thus, the complexity of collecting subgraphs that are
isomorphic to M1 is∑
vi∈V(d(vi)2
). Similarly, for the pattern M3 in Figure 2-2C, we select any
three edges combination connected to each node in G. Thus, the complexity of constructing
subgraphs which are isomorphic to M3 is∑
vi∈V(d(vi)3
). For the pattern M2 in Figure 2-2B,
we consider each edge eij in G with two nodes vi and vj. We collect edges of both nodes.
We then select one edge connected to vi and one edge connected to vj (on the condition
that these two edges are connected from the other end) along with eij to form a subgraph
isomorphic with M2. Thus, the complexity of constructing subgraphs that are isomorphic to
33
M3 is∑
eij∈E d(vi)d(vj). Similarly to M2, we perform the same operation to get isomorphic
subgraphs to the pattern M4 in Figure 2-2D. Only this time we make sure that the two
edges belong to vi and vj are not connected with each other from the other end. Thus,
the complexity of constructing subgraphs that are isomorphic to M4 is∑
eij∈E d(vi)d(vj).
Collectively, the complexity of performing this step is O(∑
vi∈V d(vi)3 +
∑eij∈E d(vi)d(vj)).
Notice that, theoretically, the worst case scenario happens when d(vi) = O(n). In this scenario,
the complexity of this step becomes O(n4).
Extract maximum disjoint set for basic patterns (Line 2): In this step, we use the algebraic
algorithm described in Section 2.3.3 (second one) to calculate the number of overlaps of each
subgraph belonging to each pattern equivalence class. This process takes constant time. We
calculate this algebraic equations as we construct subgraphs in the previous step. We then sort
those subgraphs within each equivalence class in decreasing order of their number of overlaps.
This process has complexity equal to O(mlog(m)) where m is the number of subgraphs in
each equivalence class. Recall from previous step that this number is O(∑
vi∈V d(vi)3 +∑
eij∈E d(vi)d(vj)). Thus, the complexity of this step is O((∑
vi∈V d(vi)3)log(
∑vi∈V d(vi)
3) +
(∑
eij∈E d(vi)d(vj)) log(∑
eij∈E d(vi)d(vj))).
Join Iterations (Lines 5-27): In this step, we analyze the complexity of one join iteration.
We then summarize the complexity of all join iterations. Let us denote the number of current
patterns in iteration i with xi. Notice that, for the first iteration xi = 4. Recall that in
each join iteration, we increase the size of each of the current patterns with one or two
edges. In addition, the patterns of the first join iteration are at least of size 2. Thus, the size
(i.e. number of edges) of each of the current patterns in iteration i is at least i + 2. The
number of subgraphs isomorphic to each of the current patterns is at most |E|i+2
since they are
non-overlapping subgraphs. Recall that the subgraphs of the basic patterns are non-overlapping
within each pattern. Thus, the number of subgraphs of the patterns M1, M2, M3, and
M4 are |E|2
, |E|3
, |E|3
, and |E|3
respectively. Collectively, the number of subgraphs of the basic
patterns is O(|E|).
34
In the join iteration, we start by joining subgraphs of current patterns with the subgraphs
of the basic patterns (Lines 6-9). Thus, the total number of joins we perform at iteration
i is O(|E| |E|i+2
xi) . For each join, we compare the resulting subgraph against all patterns
(Line 10). Recall that, we use filters to avoid this costly isomorphism check (Section 2.3.4).
Thus, the complexity of this operation is O(xi). If this subgraph is isomorphic to one on the
current patterns, we check whether this subgraph is a duplicate of one of the subgraphs which
already exists in this equivalence class (Line 11). We search an indexed list of those subgraphs
in O(log( |E|i+2
)). Collectively, we obtain the complexity of performing all joins at iteration i
by multiplying the three complexities above and get O(|E| |E|i+2
xi xi log(|E|i+2
)), which equals
O(x2i|E|2i+2
log( |E|i+2
)).
Upon completing all join operations, our algorithm extracts the MIS for each pattern (Line
18) using the overlap graph algorithm described in Section 2.3.3 (first one). Notice that we
perform this operation for the new set of patterns, xi+1 (current patterns of next iteration)
for which the number of patterns is at most |E|i+3
(This is because each pattern is of size i + 3
and no two patterns overlap). For each pattern, we collect the overlapped subgraphs of each
subgraph in O(( |E|i+3
)2). We then sort the subgraphs in decreasing order of their number of
overlaps in O( |E|i+3
log( |E|i+3
)) time. Thus we extract the MIS for all patterns in O(xi+1 ( |E|i+3
)3
log( |E|i+3
)).
Finally, we check each resulting pattern (Line 19-25) and delete it if its frequency is less
than the threshold α. We perform this step in O(xi+1) time.
Recall that in each join iteration, we increase the size of each of the current patterns with
one or two edges. Also recall that we start the with patterns of at least of size 2. Thus, total
number of join iterations we perform until we reach to all patterns are at least of the target
motif size is µ− 2. Thus, the complexity of all join iterations is O(µ−2∑i=1
(x2i|E|2i+2
log( |E|i+2
) + xi+1
( |E|i+3
)3 log( |E|i+3
) + xi+1)) or simply O(µ−2∑i=1
[xi
|E|2ilog( |E|
i+2)][xi+
|E|i2
]+ xi+1)
In summary, the complexity of our method considering all the previous steps is
35
O((∑
vi∈Vd(vi)
3)(1 + log(
∑vi∈V
d(vi)3))
+(∑
eij∈Ed(vi)d(vj)
)(1 + log(
∑eij∈E
d(vi)d(vj)))
+
µ−2∑i=1
( [xi|E|2
ilog(
|E|i+ 2
)
] [xi +
|E|i2
]+ xi+1
))Notice that xi here depends significantly on the topology and the density of the given
network G. To the best of our knowledge, there is no closed formula that calculates xi (i.e.
the number of unique topologies of certain size in a given graph G).
2.4 Experimental Results
In this section, we experimentally evaluate the performance of our motif discovery
algorithm on synthetic and real graphs (Section 2.4.1). We measure the running time and
accuracy of our algorithm. We compare our algorithm to two state of the art algorithms,
FSG (45) and SUBDUE (35) (Section 2.4.2). We evaluate the statistical significance of the
most abundant motif in each of the real graph (Section 2.4.3). We present a case study of the
motifs identified by our method on Human herpesvirus PPI network (Section 2.4.4). In all of
our experiments, we report the motif frequency using the F2 measure.
Data set.. We use real and synthetic datasets in our experiments. The real graphs are
the PPI networks of seven organisms taken from the MINT database (50) (Table 2-1 for
details). We first remove the nodes and edges of these graphs which are guaranteed to not be
a part of the motif to be found. To do that, we filter a subset of the nodes of each network
as follows. We first identify connected subgraphs of each graph. Let us denote the size of the
motif we aim to find with µ. We remove the connected subgraphs with less than µ nodes.
Table 2-1 lists these networks and their sizes after filtering them for µ = 5 (which is the
smallest motif size in all of our experiments).
In addition to the real dataset, we construct synthetic graphs. The purpose of having
synthetic dataset is to systematically evaluate our method by varying network characteristics
36
(network size and density) in a controlled environment. We build this dataset using the
Barabási−Albert model (48) for it captures the connectivity patterns of real networks (51; 52;
53). Moreover, this model has been frequently used in the literature to simulate real networks.
Table 2-1. The size (number of Proteins and interactions) of the PPI networks selected fromthe MINT database.
Network name Networkcode
Numberofproteins
Numberofinteractions
Human herpesvirus8 hhv-8 48 82Campylobacter jejuni cje 109 117Treponema pallidum tpa 108 173Rattus norvegicus rno 535 643Helicobacter pylori hpy 717 1472Escherichia coli eco 616 1561Plasmodium falciparum pfa 1221 2577
Implementation and environment.. We implement our algorithm in C++ and perform
experiments on a computer equipped with AMD Opteron(tm) Processor 1.4 GHz CPU, 500
GBs of main memory running Linux operating system.
2.4.1 Evaluation of Running Time
In this experiment, we evaluate the running time of our motif discovery algorithm. Our
goal here is to observe the effect of varying parameters; graph size, graph density, and motif
size on the running time of our algorithm.
2.4.1.1 Effect of Graph and Motif Size
We evaluate the running time of our method under varying graph and motif sizes using
both synthetic and real datasets.
Results on synthetic graphs.. We generate synthetic graphs of varying size (i.e.
number of nodes) from 100 to 1000 at increments of 100. We fix the graph density to two
edges per node on the average (i.e., mean node degree is set to four). We set the minimum
desired motif frequency, α = 10. We run experiments for motif sizes µ = 5, 10, and 15 and
report the running time. Figure 2-7 presents the results.
37
0.1
1
10
100
1000
10000
100000
1e+006
100 200 300 400 500 600 700 800 900 1000
Run
ning
tim
e [s
]
Network size
Motif size = 15Motif size = 10Motif size = 5
Figure 2-7. The total running time of our method for varying graph size and motif sizes(number of nodes). Motif size varies from 5 to 15. The x-axis shows the inputgraph sizes varying from 100 to 1000. The y-axis shows the total running time inseconds.
The results demonstrate that our method scales well with growing graph and motif sizes.
The running time grows with increasing graph and motif sizes, yet it remains practical for very
large graphs. For motif sizes of 5 and 10, it runs in only several minutes even for the largest
input graph. As the motif size grows, the cost increases. However, our method can identify
very large motifs in a little over a day for massive networks. We observe that the motif size has
more influence on the performance of our method than the input graph size. This is because
the number of alternative motif topologies grow exponentially with the motif size. This is an
inherent characteristic of the underlying computational problem. However, even when the motif
size is 15 our method remains to have a practical running time.
Results on real graphs.. Next, we test our method on real dataset. We set the
minimum desired motif frequency, α = 5. We run experiments for motif sizes µ = 5, 10,
and 15 and report the running time. Figure 2-8 presents the results. Similar to the synthetic
dataset results, our method scales to large graph and motif sizes on the real dataset. Note
that the number of alternative motif topologies grows exponentially with the motif size.
Furthermore, the cost of subgraph isomorphiosm also grows exponentially with the motif size.
38
Despite these two major complicating factors, the running time of our method increases only
by about an order of magnitude when we increase the motif size by five. Finally, the parallel
between these results and those in Figure 2-7 suggests that synthetic graphs generated by
Barabási−Albert model have similar structural properties as the real PPI graphs.
0.01
0.1
1
10
100
1000
10000
100000
12 3 4 5 6 7
Run
ning
tim
e [s
]
Network number
Motif size = 15Motif size = 10Motif size = 5
Figure 2-8. The total running time of our method for the real PPI networks. Network numbers1 to 7 on the x-axis correspond to hhv-8, cje, tpa, rno, hpy, eco, and pfa PPInetworks respectively. The positions of the PPI networks on the x-axis indicate thesizes of the input graphs (Table 2-1). The y-axis shows the running time inseconds.
2.4.1.2 Effect of Graph Size and Density
Here, we evaluate the effect of varying input graph size and density on the running
time of our algorithm. We use synthetic dataset in order to control the graph density in this
experiment. We generate synthetic graphs varying network size from 100 to 1000 at increments
of 100. We set the desired motif frequency α = 5 and the motif size µ = 10. We vary graph
density from one to four which covers broad range of biological networks (54). For each input
graph and density value, we report the total running time. Figure 2-9 presents the results.
We observe that the running time increases with growing graph density. As the graph
density increases, the number of alternative embeddings of a given motif grows as well. This
also increases the number of overlapping subgraph pairs, which in turn increases the cost
of finding MIS for each pattern to calculate its F2 frequency (Section 2.3.3). Despite these
39
0.1
1
10
100
1000
10000
100 200 300 400 500 600 700 800 900 1000
Run
ning
tim
e [s
]
Network size
Density = 4Density = 3Density = 2Density = 1
Figure 2-9. The total running time of our method for the synthetic graphs with different graphsizes (number of nodes) and varying graph densities from 1 to 4. The x-axis showsthe input graph sizes. The y-axis shows the total running time in seconds.
major complications inherent in the nature of the motif counting problem, our method remains
scalable with respect to growing density. These results suggest that our method is reliable and
computationally feasible for a broad range of networks with different sizes and densities.
2.4.2 Comparison with Existing Methods
Here, we compare our method against two methods in the literature which are tailored
towards a problem similar to the one considered in this chapter, namely SUBDUE and
FSG. We measure the running time and accuracy. We compute accuracy in terms of three
parameters, the number of unique motifs found, the average frequency per motif in the target
graph, and the frequency of the most abundant motif.
Of these two methods, for SUBDUE, we only report the accuracy of the result as we
observe that for most datasets and motif sizes, SUBDUE fails to identify motifs (results shown
later in this section). For FSG, we only report the running time. This is because FSG finds
motifs in multiple graphs, limited to at most one embedding per graph. In other words, it
cannot find multiple embeddings of the same motif in a single graph. Therefore, FSG would
yield very low accuracy when applied to a single graph. In the rest of the chapter, we will refer
to our method as MD (Motif Discovery) for simplicity.
40
2.4.2.1 Comparison with SUBDUE
In this experiment, we analyze the effect of varying input graph and motif sizes on
the accuracy of our method as compared to that of SUBDUE. We use real dataset in this
experiment (Table 2-1). SUBDUE does not allow the user to set a minimum allowable motif
frequency parameter. It finds all subgraph topologies of a given size even for those subgraphs
that appear only once. Due to this limitation of SUBDUE, to have a fair comparison, we set
α = 1 for our method as well. We follow our earlier definition (see 2.2.1), and use motif size
µ to denote the number of nodes in the given motif topology. We run both methods on each
input graph using motif sizes µ = 5, 10, and 15. We report the accuracy of our method as well
as SUBDUE. Figures 2-10, 2-11, and 2-12 present the results of µ = 5, 10, and 15 respectively.
1
10
100
hhv-8 cje tpa rno hpy pfa
Num
ber
of M
otifs
Network code
SUBDUEMD
1
10
100
1000
hhv-8 cje tpa rno hpy pfa
Ave
rage
freq
uenc
y pe
r m
orif
Network code
SUBDUEMD
1
10
100
1000
hhv-8 cje tpa rno hpy pfa
Fre
quen
cy o
f mos
t abu
ndan
t mot
if
Network code
SUBDUEMD
Figure 2-10. The accuracy of our method (MD) and SUBDUE in terms of three measures A)the number of unique motif topologies found, B) the average frequency per motifin the target graph, and C) the frequency of the most abundant motif. Resultsare for the motif size µ = 5 on the real dataset (Table 2-1).
Our results for µ = 5 (Figure 2-10) demonstrate that both methods identify similar
number of unique motifs, yet our method outperforms SUBDUE significantly in terms of the
41
1
10
100
1000
10000
100000
hhv-8 cje tpa rno hpy pfa
Num
ber
of M
otifs
Network code
SUBDUEMD
1
10
hhv-8 cje tpa rno hpy pfa
Ave
rage
freq
uenc
y pe
r m
orif
Network code
SUBDUEMD
1
10
100
1000
hhv-8 cje tpa rno hpy pfa
Fre
quen
cy o
f mos
t abu
ndan
t mot
if
Network code
SUBDUEMD
Figure 2-11. The accuracy of our method (MD) and SUBDUE in terms of three measures A)the number of unique motif topologies found, B) the average frequency per motifin the target graph, and C) the frequency of the most abundant motif. Resultsare for the motif size µ = 10 on the real dataset (Table 2-1).
average frequency per motif in all cases (Figure 2-10B). When we focus on the most abundant
topology of each method, we observe a similar pattern; our method always finds patterns
with much higher frequency than SUBDUE in all the experiments (Figure 2-10C). It is worth
nothing that motif discovery problem gets exponentially harder with growing motif size. As
a result, we expect most algorithms tailored for motif identification to perform well for small
motif sizes such as µ = 5. Next, we observe how our method and SUBDUE perform for large
values of µ.
As we grow the motif size to µ = 10 (Figure 2-11), the results suggest that the gap
between our method and SUBDUE grows rapidly in terms all three accuracy measures. More
importantly, the results also show that in half of the cases, particularity where the input graph
size is large, SUBDUE could not find any motifs while our method continue to locate patterns
42
1
10
100
1000
10000
100000
hhv-8 cje tpa rno hpy pfa
Num
ber
of M
otifs
Network code
SUBDUEMD
1
10
100
hhv-8 cje tpa rno hpy pfa
Ave
rage
freq
uenc
y pe
r m
orif
Network code
SUBDUEMD
1
10
100
hhv-8 cje tpa rno hpy pfa
Fre
quen
cy o
f mos
t abu
ndan
t mot
if
Network code
SUBDUEMD
Figure 2-12. The accuracy of our method (MD) and SUBDUE in terms of three measures A)the number of unique motif topologies found, B) the average frequency per motifin the target graph, and C) the frequency of the most abundant motif. Resultsare for the motif size µ = 15 on the real dataset (Table 2-1).
with high frequency. For example, our method was capable of finding motif topologies with
frequency over 100 while SUBDUE could not locate any motif (Figure 2-11C).
For few cases (Figure 2-11B), (hhv-8, cje, and tpa), the average frequency per motif of
SUBDUE is slightly higher than that of our method. This is because, we set the minimum
frequency α = 1. Our method locates many topologies which exist only once while SUBDUE
fails to locate them. For example, our algorithm finds thousands of unique motif topologies
while subdue outputs only 8 motif topologies for the hhv-8 organism (Figure 2-11A). As a
result, these unique topologies pull the average frequency down. That said, Figure 2-11C
confirms that our method can identify motifs which are more frequent than those found by
SUBDUE even for those organisms.
As we further increase the motif size to µ = 15 (Figure 2-12), the significance of our
method becomes more prevalent. We observe that SUBDUE could not find any motifs in
43
any of the graphs accept for tpa’s PPI network. On the other hand, our algorithm not only
identifies a massive number of patterns (Figure 2-12A), but also some of these patterns have
very large frequencies (Figure 2-12C).
In summary, the results demonstrate that our method scales to large input graph and
motif sizes and continue to locate patterns with high frequency for a broad range of motif and
input graph sizes while SUBDUE fails to do so.
2.4.2.2 Comparison with FSG
In this experiment, we compare the effect of different input graph and motif sizes to
the running time of our algorithm and that of FSG. We use real dataset in this experiment
(Table 2-1). FSG method requires multiple graphs as input. It defines the frequency of the
motif topology as number of different graphs that this motif appears within. Since our method
operate on one input graph , we set the desired motif frequency α = 1 to be consistent with
FSG. FSG defines motif size as the number of edges in the given motif. To be consistent with
FSG, we use µ to denote the number of edges in the motif in this experiment. We run both
methods on each input graph using motif sizes µ = 7, 8, and 9. We report the running time of
our method (MD) as well as FSG. We do not run experiments for µ > 9 as FSG fails to scale
to large motif sizes unlike our method. Figure 2-13 presents the results.
We observe that our method (MD) is orders of magnitude faster than FSG, particularly
in large motif sizes. The running time of our method increases slowly with both motif size and
the graph size. On the other hand, the running time of FSG increases slowly with the input
graph size, but very rapidly with the motif size. Only for a few cases of small motif sizes (i.e
≤ 7 edges) FSG performs better than our method. This is due the overhead of calculating F2
for the basic building patterns where number of overlapped embeddings is huge. That said, the
running time difference in those cases are negligible. These results suggest that our method
outperforms FSG in terms of running time for a broad range of input real biological networks
with different sizes. This performance advantage is further magnified by the fact that our
method can find multiple embeddings of each motif while FSG finds only one. The two main
44
0.01
0.1
1
10
100
cje hhv-8 tpa rno hpy pfa
Run
ning
tim
e [s
]
Network code
FSGMD
0.1
1
10
100
1000
cje hhv-8 tpa rno hpy pfa
Run
ning
tim
e [s
]
Network code
FSGMD
0.1
1
10
100
1000
10000
100000
cje hhv-8 tpa rno hpy pfa
Run
ning
tim
e [s
]
Network code
FSGMD
Figure 2-13. The total running time of our method (MD) and FSG for the real PPI networks(Table 2-1) and µ = 7 (top left), 8 (top right), and 9 (bottom). The y-axis showsthe running time in seconds.
reasons behind the fact that our method is significantly faster than FSG is that our method
(i) does not calculate the frequency of the each new pattern by locating the copies of this
pattern in the network using subgraph isomorphism as FSG does, and (ii) it ensures that every
generated pattern exists at least once in the underlying graph.
2.4.3 Evaluation of Statistical Significance
In this experiment, we evaluate the statistical significance of the most abundant motif
identified by our method in each of the six PPI networks (Table 2-1). We compute the
statistical significance of the abundance of the most frequent motif of a given size in two
alternative approaches. Each of these two approaches measures a different aspect of the
significance.
The first approach measures the statistical significance of the frequency of most abundant
motif with respect to the abundances of all motifs with the same size in the same graph. More
45
specifically, given a target graph G = (V,E) and motif size µ, we first find all motifs of size
µ in G. Assume that there are totally m such motifs. Let us denote the frequency of these
motifs with x1, x2, …, xm, with x1 being the largest among all. Let us denote the mean and
standard deviation of these m frequency values with x and σ. We report the z-score of the
frequency of the most abundant motif as x1−xσ
.
The second approach measures the statistical significance of the frequency of the most
abundant motif in the original graph with respect to those in the random ensemble of graphs
of the same size and degree distributions. More specifically, given a target graph G = (V,E)
and motif size µ, let us denote the frequency of the most abundant motif of this size in G
with x. We construct a set of n random networks from G through degree preserved edge
shuffling (55; 56). Note that degree preserved edge shuffling is an iterative technique, which is
often used in the literature to construct random network topologies with same size and degrees
as a given target graph G = (V,E). At each iteration of this technique, we randomly pick two
edges from E. Let us denote these edges with (v1, v2) and (u1, u2), where v1, v2, u1, u2 ∈ V .
We remove these two edges from E and insert two new edges (v1, u2) and (u1, v2). This
way as the network topology evolves randomly, we ensure that the degrees of all the nodes
remain unchanged. We repeat these iterations large number of times (exactly 10 × |E| times)
to randomize the entire network. Using the strategy above, we generate 100 random graphs,
denoted with G1, G2, …, G100. For each random graph Gi, we measure the frequency of the
most abundant motif of size µ. Let us denote this number as xi. Let us denote the mean and
standard deviation of these 100 frequency values with x and σ. We report the z-score of the
frequency of the most abundant motif as x−xσ
.
For both of the approaches above, we assume that a z-score above 2 or below -2 implies
high statistical significance (i.e., two standard deviations away from the mean). The larger
the magnitude of z-score is, the more significant the result is. Tables 2-2 and 2-3 present the
z-score for each of the six PPI network and three motif size (µ = 5, 10, 15) combinations
using the first and the second approach described above respectively.
46
Table 2-2. The z-scores that represent signifncance of the most abundant motif aginast othermotifs in in the same network in each PPI network usig three motif size.
Network code Motif size = 5 Motif size = 10 Motif size = 15hhv-8 1.52 14.00 4.67cje 1.41 5.53 12.12tpa 1.45 7.19 3.36rno 1.58 4.31 9.74hpy 1.54 13.70 9.003pfa 1.87 35.32 7.43
Table 2-2 suggests that, for small motif size (i.e. µ = 5), the most abundant motif is not
significantly more frequent than other motifs of the same size. However, as motifs get large
in size (i.e. µ = 10 and 15), the gap between the frequency of the most abundant motif and
the rest of the motifs becomes highly significant. This implies that larger motifs characterize
topological properties of PPI networks better than small motifs. This is because when motif
size is small different motifs have similar frequency values, and this cannot be statistically
different in abundance than each other. On the other hand, for large motif size, although
the number of unique motif topologies is large, they vary a lot in their abundances; the most
frequent one gets significantly more abundant than the rest.
Table 2-3. The z-scores that represent signifncance of the most abundant motif aginast mostabundant motifs in 100 random networks in each PPI network usig three motifsizes.
Network code Motif size = 5 Motif size = 10 Motif size = 15hhv-8 2.79 -0.54 -2.83cje 2.32 0.99 -0.82tpa 3.21 5.27 2.83rno -0.49 -4.02 -4.83hpy 22.42 8.61 6.15pfa 10.53 5.16 4.80
Table 2-3 shows that, for most of the PPI network and motif size combinations, the
most abundant motif is highly over-represented in the original network compared to random
networks. In three cases (Rattus norvegicus, µ = 10 and 15, and Human herpesvirus8,
µ = 15), we observe that the most abundant is significantly under-represented. These
results demonstrate that the motif abundance in PPI networks is not random for nearly all
47
combinations we tested. Thus, studying these structures has great potential to help understand
how these networks function. Among the six PPI networks, Rattus norvegicus stands out
to be the one with consistently under-represented or random motif abundance. The PPI
of Helicobacter pylori consistently has the most significant motif abundance for all motif
sizes. This indicates that the interactions in this network follow a regular pattern repeating
themselves at different locations of the network. Finally, notice that the two z-score values
reported in Tables 2-2 and 2-3 do not follow the same pattern (that is a high z-score according
to one measure does not imply a high value for the other). This implies that the frequencies
of different motifs (i.e., including the ones which are not most abundant) in these PPIs differ
from those in random networks. In other words, the PPI networks topologically deviate from
random networks.
2.4.4 Case Study on Human Herpesvirus
Table 2-4. Each row lists the Uniprot IDs of the proteins in an embedding of the mostabundant motif of size 10 found by our method in hhv-8 PPI network.
O40944 P88947 P88935 P88951 P88960 P88940 P90489 P88918 P90495 P88902O40910 O40944 P88947 P88929 P88920 P88925 P88927 P90486 P88918 P88954P88918 P88919 P88929 P88948 P88920 P88950 O36551 P88942 Q98141 P88954O40944 Q98141 P88920 P88951 P88954 P88947 P88948 P88958 P88939 P88944
Here we briefly analyze the motifs identified by our method on the hhv-8 PPI network
which causes Kaposi’s sarcoma disease. We choose this organism in our case study as it has
the smallest PPI network among the organisms in our database (Table 2-1). Notice from
Figure 2-11C that despite its small size (48 nodes and 82 edges), hhv-8 has four disjoint
embeddings of a very large motif with 10 nodes, covering a significant fraction of its PPI
network. This begs the question whether there is a fundamental recurring function that hhv-8
serves and is covered through evolutionary process with high redundancy. Figure 2-14 presents
the structure of those four embeddings. Each row of Table 2-4 lists the Uniprot ids of the
ten proteins that contribute to each of those embeddings. Analysis of these proteins in the
Gene Ontology database (57) reveals that three of those four embeddings, each contains two
48
proteins one responsible for viral DNA packaging (O40944 and P88919) and one responsible
for virion assembly (P88954). Without either process, no infectious progeny virus could be
formed (58). Several studies use these two processes as targets to identify effective inhibitors.
The existence of these two process in each of the three instances reflects the functional
importance of the motif topology found. These results suggest that our algorithm can find
significant and valuable motifs which can be use to detect key functions governed by the
network processes.
Figure 2-14. The organization of the four isomorphic subgraphs of 10 nodes in the hhv-8 PPInetwork. Each supgraph has different color and pattern.
2.5 Discussion
In this chapter, we developed a scalable method to solve the motif identification problem
given an input graph, desired motif size µ, and minimum frequency of desired motif α. We
proposed a set of small patterns, we call basic building patterns each containing two or three
edges. We proved that any motif with four or more edges can be constructed as a join of these
patterns. Our method first locates instances of the basic building patterns. It then iteratively
grows known motifs at that iteration by joining them with the instances of these patterns.
We developed efficient mechanisms to avoid a significant fraction of the costly isomorphism
tests. We also introduced a new and efficient strategy for solve the MIS extraction problem.
We analyzed the time complexity of our method based on the number of nodes and edges
in the target network and the number of frequent motifs at each iteration. Our experiments
on PPI networks from MINT comprehensively demonstrated that our method is significantly
49
faster and more accurate than the existing methods. Furthermore, we observed using synthetic
networks that the running time of our algorithm is reasonable with growing the size of the
target network and network density. We also showed using PPI networks that the increase in
the running time of our algorithm is dramatically less than that of the competing methods as
the motif size grows. We evaluated the statistical significant of the most abundant motif of
PPI networks resulting from our algorithm.
50
CHAPTER 3APPLICATION OF MOTIFS IDENTIFICATION
In this chapter, we address two applications of the motif identification problem. The
first application is Motifs in the Assembly of Food Web Networks (Section 3.1). The second
application is Motif Centrality in Food Web Networks (Section 3.2).
3.1 Motifs in The Assembly of Food Web Networks
3.1.1 Preface
The assembly of local communities from regional pools is a multifaceted process that
involves the confluence of interactions and environmental conditions at the local scale and
biogeographic and evolutionary history at the regional scale (59). Understanding the relative
influence of these factors on community structure has remained a challenge and mechanisms
driving community assembly are often inferred from patterns of taxonomic, functional, and
phylogenetic diversity. Moreover, community assembly is often viewed through the lens of
competition and rarely includes trophic interactions or entire food webs. Motifs provide a
novel framework for exploring community assembly by explicitly including interactions as
opposed to inferring them from patterns of taxonomic or phylogenetic composition (60).
Focusing on community assembly through the lens of motifs can be thought of as interaction
assembly. Here, we use motifs–subgraphs of nodes (e.g., species) and links (e.g., predation)
whose abundance within a network deviates significantly as compared to a random network
topology to explore the assembly of food web networks found in the leaves of the northern
pitcher plant (Sarracenia purpurea). We compared counts of three-node motifs (Figure 3-1)
across a hierarchy of scales to a suite of null models to determine if motifs are over-, under-,
or randomly represented (19). We then assessed if the pattern of representation of a motif in a
given network matched that of the network it was assembled from.
3.1.2 Method
In this section, we explain the dataset we analyze. We then discuss the methods we
develop to identify the assembly behaviors.
51
Figure 3-1. . Four of the thirteen possible three-node motifs; apparent competition,exploitative competition, tri-trophic chain, and omnivory. These four motifs havebeen explored both theoretically and empirically in ecological networks and are theonly motifs found in the pitcher plant dataset we analyzed.
Dataset. The pitcher plant (Sarracenia purpurea) is a carnivorous plant that inhabits bogs
and fens along the east coast of North America from the panhandle of Florida to Canada and
across southern Canada to British Columbia (21). S. purpurea forms tube-shaped leaves that
fill with rainwater. The leaves produce a nectar around the rim of the pitcher that attracts
invertebrate prey (e.g., ants, wasps) which subsequently drown in the pitcher liquid. An entire
food web consisting of bacteria, protozoa, rotifers, and dipteran larvae among other taxa (21)
resides within the pitcher and serves to decompose prey items releasing nutrients to the plant.
We used pitcher plant data from 39 sites across North America to explore motif assembly (
Figure 3-2). This dataset contains abundance data and feeding interactions for 20 pitcher
plant food webs at each site for a total of 769 food webs (11 pitchers were dropped due to
missing data). We based the interaction structure (i.e., who eats whom) on previous studies
and direct observation of feeding interactions. We constructed food web networks at three
levels of hierarchy as follows. At the first level, lies the food web networks for individual pitcher
plants. We consider these as the local networks. The second level in the hierarchy of networks
lies at the site scale. We created networks for each of the 39 sites by combining the local
network of every pitcher plant at that site. We combined a set of networks by taking the
union of all the nodes and the union of all the links of those networks. We designated the
resulting networks as site networks. Finally, at the top of the hierarchy lies the continental
network which summarizes all the 39 site food webs. We obtained this network by combining
52
the 39 site food webs. In summary, the local networks (n=769) were assembled from their
site networks (n=39) which were assembled from the continental network (n=1) (Figure 3-2).
Because of this hierarchical design, we designated the higher level network from which a
network is assembled as the parent network and a network that is assembled from the parent
network as a daughter network.
Figure 3-2. Schematic of the three levels of hierarchy for pitcher plant network assembly. Thecontinental scale network (A) contains all of the species and interactions foundacross the 39 North American site networks (B) (sites are indicated by blackcircles, we only show networks for three sites here for clarity). Species assemblefrom the continental network to the site networks. Within each site there are 20local food web networks (C) found in individual pitcher plant leaves (we show onlythree local networks here for clarity). Species from the site networks assemble intothe local networks.
Analysis. We took a four-step approach to analyzing motifs in the assembly of food
web networks. First, we counted motifs in empirical networks. Second, we developed null
models and counted motif representation in the null models. Next, we compared empirical
motif counts to those of the null models using z-scores and p-values. We use different null
models whcih each describe a different random scenario namely; Erdős-Rény (61), niche (62),
nested-hierarchy model (NH) (63), generalized cascade model (GC) (64), two co-occurrence
null models (CO1, and CO2) (60). Using each of the null models, we created 1000 networks
53
to get a distribution of null motif counts. We consider a z-score greater than two is evidence
that a particular motif is over-represented, a z-score less than negative two is evidence that a
motif is under-represented, and a z-score that falls between negative two and two suggests that
a motif appears no greater or less than we would expect under the null model (i.e., randomly).
In addition to calculating z-scores, we also calculated p-values to determine the probability
of obtaining a motif count equal to or more extreme than the observed count, under the null
model. A p-value > 0.975 is evidence of under-representation, a p-value < 0.025 is evidence
of over-representation, and the in between values indicate random representation. When motifs
are over- or under-represented, they represent a non-random selection of a given motif in a
network. Finally, we compared the motif representation of parent networks to their daughter
networks.
3.1.3 Experimental Results
Our main interest lies in determining if the pattern of representation of a motif (i.e.
over-represented, under-represented, or random) in a set of daughter networks matches that
of the parent networks they are assembled from. So we mainly calculated the proportion of
daughter networks that matched the parent network they were assembled from for all motifs.
Figures 3-3 and 3-4 present the results.
We found that the motif representation in daughter networks generally matched that of
their parent network regardless of motif for both continental-to-site and site-to-local network
assembly. While different null models showed different representation for a given motif, the
general pattern of agreement in motif representation between daughter and parent networks
was consistent. The consistency across motifs and null models shows that the assembly process
results in daughter networks that are structurally representative samples of the parent network
in terms of motif representation. The ultimate mechanism driving the assembly of daughter
networks (or community assembly in general) is the sampling of the parent network. In the
case of matching parent and daughter networks, proportional sampling from each trophic group
54
Figure 3-3. The percentage of sites for which motif representation (over-represented (blackfill), under-represented (cross hatch fill), and random (white fill)) matches thecontinental network under six different null models.
(loosely defined as species that have the same or similar prey and predators) produces daughter
networks with fewer nodes, but representative motif structure.
3.1.4 Discussion
In this application, we compared counts of three-node motifs across a hierarchy of scales
to a suite of null models to determine if motifs are over-, under-, or randomly represented. We
then assessed if the pattern of representation of a motif in a given network matched that of
the network it was assembled from. We found that motif representation in over 70% of site
networks matched the continental network they were assembled from and over 75% of local
networks matched the site networks they were assembled from for the majority of null models.
This suggests that the same processes are shaping networks across scales.
55
Figure 3-4. The percentage of pitchers for which motif representation (over-represented (blackfill), under-represented (cross hatch fill), and random (white fill)) matches the sitenetworks under six different null models.
3.2 Motif Centrality in Food Web Networks
3.2.1 Preface
The complexity of ecological networks has inspired an approach to network analysis
that reduces networks into meaningful subnetworks to better characterize the structure
and function of these systems. Motifs-subnetworks whose abundance in the given network
differs significantly from that in a random network topology in particular have captured the
interest of network ecologists due to the ecological theory that has been developed for several
three-node motifs (65). To better understand why some motifs are found at high abundances
(i.e., over-represented) and some are found at low abundances (i.e., under-represented), we
explored the relationship between motif abundance and motif centrality. In order to assess this
relationship, we developed a suite of methods for calculating the centrality of entire motifs and
then analyzed the relationship between motif centrality and motif abundance in 44 published
56
aquatic food webs. Our eight approaches for calculating motif centrality differed in three
aspects; the calculation of the centrality of a single node in a motif, the strategy of combining
the centrality of the nodes that make up a motif into a single centrality value, and the null
model used to test the significance of motif centrality.
3.2.2 Background
Integrating the concept of centrality with motifs, which also influence the functioning
and structure of networks, has the potential to increase our understanding of the variation
in abundance across different motifs (i.e., why a specific motif is under-represented or
over-represented in a food web). There have been several attempts to integrate the concept
of centrality with network motifs (66; 67). These studies have predominantly focused on
calculating a measure of node-centrality based on the location, frequency, or role of a node
within a given motif in a network [30-32]. The approach of quantifying the centrality of an
entire motif to assess its importance is uncommon. Li et al. (68) investigated the functional
potential behind central motifs in a cancer related human signaling network. They identified
central motifs by ranking the motifs based on their centrality values according six different
centrality measures and choosing the top 5%. One of the centrality measures they use is the
in-coming degree of the underlying node (Li et al. (68)) for other centralities). Piraveenan
et al. (69) averaged the centrality (betweenness and closeness) of four-node motifs in
Prokaryotic and Eukaryotic metabolic networks and found that the nodes that participated
in over-represented motifs (i.e. occur more frequently than randomly expected) had a greater
average centrality than the average of all nodes in the network. Motif centrality has not been
explored in food web networks.
3.2.3 Method
Dataset. We explored motif centrality in 44 food web networks contained in the enaR
package in R (networks 15-58 in Borrett and Lau et al. (23)). These networks describe aquatic
food webs ranging in size from 14 to 125 nodes (mean = 45.73, sd = 29.41) and connectance
(C = edges/nodes2) from 0.05-0.37 (mean = 0.17, sd = 0.08). Nodes depict species (e.g.,
57
Fundulus heteroclitus) or trophic-species and edges depict weighted biomass or energy flows
from prey to consumer that result from a feeding interaction. Each food web network also
contains information on node boundary loss and inputs, and node biomass.
Analysis. We analyzed motif centrality using a three-step process. First, we calculated the
statistical significance of motif abundance for each of the 13 three-node motifs (Figure 3-5)
using the niche null model (62). Second, we calculated the centrality of each motif (explained
later). Finally, we analyzed the relationship between motif abundance and centrality.
Figure 3-5. All 13 motifs of 3-node subgraphs. The first four motifs have specific ecologicalterminology.
Two attributes define how we measured the centrality of a motif; (1) the calculation of
the centrality of a single node in the motif and (2) the strategy to combine the centrality
of the nodes that make up that motif into a single centrality value. We used two measures
to quantify the centrality of each node in each of the 44 food web networks we analyzed.
The first measure is betweenness centrality (70). Briefly, a species is considered to have high
betweenness centrality in a given network if it is located on the shortest paths connecting
many pairs of species in that network. Our second measure of centrality, called throughflow
centrality (71), is the total energy entering or exiting a node. This method was developed
to specifically capture energy flow through nodes in a food web network]. Conceptually,
throughflow centrality measures the contribution of a given node to energy exchanged across
the entire food web.
So far, we have described two alternative strategies for calculating the centrality of single
nodes. A motif however, is made up of multiple nodes (i.e., three in our study; Figure 3-5).
More importantly, a given motif topology typically has many possible occurrences in a given
58
network. We used two approaches to compute the centrality of a given motif from node
centrality. We call the first approach redundant and the second approach non-redundant.
In the redundant approach, each node contributes to all occurrences of a given motif
independently regardless of the number of such occurrences (i.e., a node can contribute
more than once to the centrality of the same motif). The redundant approach has been
used by Li et al. (68). In the non-redundant approach, if a node appears in a given motif,
it only contributes once regardless of the number of instances of the motif it appears in.
The non-redundant approach has been used by Piraveenan et al. (69). In the redundant
approach, we first computed the centrality of each node in the given network. Next, given a
motif topology P , for each occurrence of P in the given network, we calculated the centrality
of that occurrence as the average of the centralities of the nodes in that occurrence. Once
we did this for all the occurrences of that motif, we computed the centrality of that motif
as the average of the centralities of all of its occurrences. We denoted the number of nodes
in the given motif P with n (here n = 3) and represented the number of occurrences of
that motif in the given network with t. Also, the ith node in the jth occurrence of P is
denoted with vij. We calculated the redundant centrality of P , denoted with MCr(P ) as
follows: MCr(P ) =∑tj=0
∑ni=0 C(vij)
n
tIn our second approach, we aimed to circumvent any bias
introduced by such multiple-counting of nodes by calculating non-redundant motif centrality.
This approach allows for each node in the given network to contribute once if it appears in
at least in one occurrence of the underlying motif. We defined an indicator function for each
motif P operating on the nodes v of the given network as δp(v) , where δp(v) = 1 if vappears
in at least one occurrence of P , and 0 otherwise. We computed the non-redundant centrality
value of the motif P , denoted withMCnr(P ) as follows; MCnr(P ) =∑v C(p)δp(v)∑v δp(v)
.
Once we calculated motif centrality, we tested its statistical significance by comparing
it to a null distribution of centrality constructed from random subnetworks of the same
size chosen from the observed network. We took two different approaches defining the null
model; constrained and unconstrained. In the constrained null model, we randomly selected
59
a three-node subnetwork (i.e. of the same size as the given motif) with the condition that
this subnetwork is connected. Connected subnetwork here means that there is an undirected
path between any pair of the three nodes in this subnetwork. This approach randomly selects
a subnetwork which matches one of the 13 motifs in Figure 3-5 since they are all possible
three-node topologies. In the unconstrained null model, we randomly selected a three-node
subnetwork (i.e. of the same size as the given motif as well) but we do not require them to be
connected. For each type of null model, we repeated this process 1000 times for each motif in
each network. We then computed the p-value of the observed motif centrality as the fraction
of random subnetworks which have higher centrality than the underlying motif. We summarize
all the approaches we use to calculate motif centrality significance in Table 3-1.
Table 3-1. Eight approaches used to calculate motif centrality significance. Each approachvaries in the combination of methods used to calculate node centrality, motifcentrality, and the null model used to assess the significance of motif centrality.
Approach Node centrality Motif centrality Null model1 Throughflow Redundant Constrained2 Throughflow Redundant Unconstrained3 Throughflow Non-redundant Constrained4 Throughflow Non-redundant Unconstrained5 Betweenness Redundant Constrained6 Betweenness Redundant Unconstrained7 Betweenness Non-redundant Constrained8 Betweenness Non-redundant Unconstrained
3.2.4 Experimental Results
The over-arching goal of our analysis is to determine if there is a relationship between
the abundance of a given motif and its centrality. We show the results in Figure 3-6. Focusing
on approach 5, networks in which motifs were found to be highly central (Figure 3-6), were
over-represented in abundance in six of the 13 motifs (motifs 2, 5, 7, 8, 9, and 10) and
under-represented in three motifs (6, 11, and 12).
In order to compute the statistical significance of this results in a systematic manner,
we computed the probability of obtaining the observed split between networks of different
centrality classes (e.g., highly central on non-central) across motif abundance class (i.e.,
60
Figure 3-6. Distribution of motif abundance over two classes of motif centrality significance forapproach five. White fill indicates random representation (0.025 > p < 0.975),black fill shows the number of networks in which a motif is over-represented (p <.025), and cross-hatch fill indicates under-representation (p > 0.975).
over-represented, under-represented, random). The heat map in Figure 3-7A represents the
positive correlation probabilities (p-values) between motif abundance and motif centrality
calculated. Similarly, the heat map in Figure 3-7B represents negative correlation probabilities
(p-values) between motif abundance and motif centrality calculated. Generally, we found
support that highly central motifs were over-represented and non-central motifs were
under-represented for several of the motifs. We found no support for our hypothesis, that
highly central motifs were under-represented and non-central motifs were over-represented.
3.2.5 Discussion
In this application, we explored the relationship between motif abundance and motif
centrality. In order to assess this relationship, we developed a suite of methods for calculating
61
A Positive correlation B Negative correlation
Figure 3-7. Correlation probabilities (p-values) between motif abundance and motif centralitycalculated. Only approaches that yielded two centrality classes could be used inthese calculations. Row and column cluster trees are shown to illustrate therelation of different approaches based on (A) positive correlation significance and(B) negative correlation probabilities. Significant p-values (≤ 0.05) are emphasizedwith stars.
the centrality of entire motifs and then analyzed the relationship between motif centrality and
motif abundance in 44 published aquatic food webs. We found that highly central motifs are
over-represented and non-central motifs are under-represented for six of the thirteen motifs.
This pattern suggests that high energy flow is associated with the persistence of certain motifs
in food webs. Further research on well resolved food web networks and integration of motif
centrality with new approaches to stability analysis will help determine the generality of our
results and provide further evidence of the mechanism driving them.
62
CHAPTER 4IDENTIFICATION OF CO-EVOLVING TEMPORAL NETWORKS
4.1 Preface
Biological networks describe the interaction between molecules. They are frequently
represented as graphs, where the nodes correspond to the molecules (e.g., proteins or genes)
and the edges correspond to their interactions (1). Formally, we denote a biological network
as G = (V,E) where V and E represent the set of nodes and the set of edges, respectively.
Analysis of these networks enable the elucidation of cellular functions (2), the identification of
variations in cancer networks (3), and the characterization of variations in drug resistance (4).
Studying biological networks led to numerous computational challenges as well as methods
which address these challenges. Network alignment is one of the most important of these
challenges (8) as it has a profound set of applications ranging from the detection of conserved
motifs to the prediction of protein functions (72). This problem aims to find a mapping of
the nodes of two given networks in which nodes that are similar in terms of content (i.e.
homology) and interaction structure (i.e. topology) are mapped to each other. Hence, we
represent the alignment between two given networks G1 = (V1, E1) and G2 = (V2, E2) as a
bijection function ψ : V1 → V2, and the score resulting from alignment ψ as score(G1, G2|ψ).
The network alignment problem seeks the function ψ that maximizes this score. We note that
there are various ways to calculate the scoring function.
There are two categories of network alignment problem: local and global alignment. The
former problem aims to find pairs of highly-conserved sub-networks in two given networks in
which a sub-network of the query network is mapped to multiple sub-networks in the target
network. Global network alignment aims to maximize the similarity in the networks in which
all nodes in the query network are mapped to a set of nodes in the target network. Network
alignment is a challenging task as the graph and subgraph isomorphism problems which are
known to be GI and NP-hard (20), reduce to them. In Section 4.2, we give a brief review of
63
the methods addressing the global network alignment problem as the problem we consider in
this paper is associated with that problem.
Biological networks have dynamic topologies (11). There are various reasons behind
this dynamic behavior. For example, genetic and epigenetic mutations can alter molecular
interactions (13), and variation in gene copy number can affect the existence of interactions (14).
Due to this dynamic behavior, the topology of the network that models the molecular
interaction evolve over time (16). Majority of the previous work on alignment of biological
networks assume the network topology is static (10)—an assumption that ignores the history
of network evolution, and may lead to biased or incorrect analysis. For example, identifying
causes and consequences of the influence of external stimuli is impossible when analyzing
static topologies. To address this oversight, we define a biological network using a model that
accounts for the evolution of the underlying network at consecutive time points. We refer to
this model as a temporal network (24). We view this model as containing a single snapshot
of the network at each time point in a sequence of time points and thus, as a time series
network. More formally, we denote a temporal network with t consecutive time points as G =
[G1, G2, . . . , Gt], where Gi = (V,Ei) represents the topology of the network at the ith time
point.
In this paper, we consider the problem of identifying coevolving subnetworks in a given
pair of temporal networks. We say that two subnetworks are coevolving if their topologies
remain similar even though their topologies evolve over time. We define this more formally
as follows. We consider two input temporal networks G1 = [G11, G
12, . . . , G
1t ] and G2 =
[G21, G
22, . . . , G
2t ], where ∀i ∈ {1, 2, . . . , t}, G1
i = (V 1, E1i ) and G2
i = (V 2, E2i ) represent G1
and G2 respectively at the time point i. Without losing generality, let G1 be the query (smaller)
network and G2 be the target network, i.e., |V 1| ≤ |V 2|. An alignment of G1 and G2 maps G1i
to G2i across all time points i. Thus, we represent the alignment of the two temporal networks
G1 and G2 as a bijection of their nodes and denote it as a function ψ : V 1 → V 2. We compute
the score of the alignment ψ of G1 and G2, denoted with score(G1,G2|ψ), as the sum of the
64
ψ
G1
G2
A Static
ψ
G3 G4
G1 G2
B Multiple
G1 G2 G3
ψ
G
ψ ψ31 2
C Dynamic networks
G1 G2 G3
G1 G2 G3
ψψ ψ
1 1 1
2 2 2
D Temporal networksFigure 4-1. This figure represents different network alignment problems in different types of
biological networks. A) This represents the alignment between two input staticnetworks. B) This represents the alignment between multiple time points whereeach network represent a different organism. C) This represents the alignmentbetween two input networks where one of them is static and one of them isdynamic. Here, there exist different alignment between the static network and eachversion of the dynamic network. D) This represents the alignment between twoinput temporal networks where each have time specific snapshots that was taken atthree specific time points. Here, the alignment is persist across all time points.
scores of the alignment at all time points. Hence, score(G1,G2|ψ) =∑t
i=1 score(G1i , G
2i |ψ).
We assume G1 is connected at all time points, but it maybe impossible to find an alignment
that is connected in the target network at all time points.
It is worth emphasizing that the temporal network alignment problem described above
is dramatically different than existing network alignment problems, which can be categorized
as follows: (i) pairwise alignment, (ii) multiple network alignment, and (iii) dynamic network
alignment. We illustrate these problems as well as the temporal one in Figure 5-2. The
pairwise network alignment problem (Figure 5-2I) ignores that the network topology evolves.
Although the multiple alignment problem (Figure 5-2J) can consider more than two networks
at once, it lacks the ability to capture the temporal changes since it treats all networks as
having static topologies. The dynamic network alignment problem (Figure 5-2H) considers
topological changes over time. It however, it seeks a different solution to the alignment
problem at each time point. Thus, it can not identify coevolving subnetwork. A new algorithm
is needed to capture such evolving characteristics. Unlike these alignment problems, temporal
network alignment (Figure 5-2G) captures that network topologies coevolve over time.
Contributions in this paper. We develop an efficient algorithm, Tempo, to identify
coevolving subnetworks in a given pair of the temporal networks. Briefly, our algorithm first
finds an initial alignment between the input networks G1 and G2 using the similarity score
65
between pairs of aligned nodes across all time points. It then performs a dynamic programming
strategy that maximizes the alignment quality (i.e. score) by repeatedly altering the aligned
nodes in the target network. We demonstrate the efficiency and accuracy of Tempo using both
real and synthetic data. We compare the running time and the quality of the alignments found
by Tempo against those of three existing alignment algorithms, IsoRank (10), MAGNA++ (73)
and GHOST (74). Note that all these networks are tailored towards optimizing alignment at
a single time point. To have a fair comparison, we allow each of these methods to consider
each time point independently then apply the resulting alignments to all other time points and
took the average. We show Tempo has competitive running time and generates significantly
better alignments. We use a human brain aging (75) dataset, and integrate this dataset to
analyze three phenotypes—two age related diseases (Alzheimer’s and Huntington’s) and one
disease that is less prone to aging (Type II diabetes). We perform gene ontology analysis
on the aligned genes reported by our algorithm and compared algorithms. Our algorithm
could successfully align genes of the phenotype query (i.e. the underlying disease) to strongly
related genes in the target network despite their evolving topologies unlike other algorithms.
Consequently, we could predict disease-related genes based on the generated alignment using
tempo which suggests that Tempo generates alignments that reflect the evolution of nodes
topologies through time as well as their homological similarities while other methods only
focuses on static and independent topologies. Lastly, we observe that alignments of age related
phenotype is significantly higher than alignment of non age phenotype which reflects their high
evolution rates and shows that Tempo could identify between different queries.
4.2 Related Work
One of the key studies on pairwise global network alignment is IsoRank (10), which
is based on the conjecture that two nodes should be matched if their respective neighbors
can also be matched. It formulates the alignment as an eigenvalue problem and computes
the similarity between pairs of nodes from two given networks as a combination of their
homological and topological similarities. It obtains the global alignment of the two given
66
networks using their maximum weight bipartite match with the scores as the weights.
The GRAAL (GRAph ALigner) family (76) of global network alignment methods use the
graphlet degree similarity to align two networks. Briefly, the graphlet-degree of a node
counts the number of graphlets (i.e. induced subgraph) that this node touches, for all
graphlets on 2 to 5 nodes. GRAAL (77) first selects a pair of nodes (one from each of
the two given networks) with high graphlet degree signature similarity as the seed of
the alignment, and greedily expands the alignment by iteratively including new pairs of
similar nodes. H-GRAAL (78), MI-GRAAL, and L-GRAAL algorithms also belong to the
same family. The SPINAL algorithm (79) iteratively grows the alignment based on apriori
computed node similarity score. MAGNA (80) optimizes the edge conservation between two
networks using a genetic algorithm. There are several other methods for pairwise network
alignment (81; 82; 83; 84; 74; 85; 86; 87). Although the underlying algorithms of these
methods vary, the end goal is similar to those discussed above.
Several algorithms address the multiple network alignment (88; 89; 90). IsoRankN (91)
extends IsoRank. It adopts spectral clustering on the induced graph of pairwise alignment
scores. The algorithm developed by Shih et. al. (92) is a seed-expansion heuristic that first
selects a set of node pairs with high similarity scores using a clustering algorithm, and then
expands these pairs by aligning nodes that maximizes the number of the total conserved edges
of aligned nodes.
INQ (93) aligns a dynamically evolving query network with one static target network. It
uses ColT (94) to find an initial alignment of the initial query, then it observes the differences
between the topologies of the already aligned query network and the new query network, and
finally, uses these differences to refine the alignment found for the previous query and generate
alignment of the current query network. DynaMAGNA++ (95) aligns two dynamic networks.
It assigns a value to each node based on how the incident edges and graphlets change through
dynamic events. It assigns each node a value based on dynamic graphlet degree vector
(DGDV) of graphlets up to size four. It considers a pair of nodes from two networks similar if
67
their DGDVs are similar. This algorithm starts by constructing an initial population of random
dynamic network alignments and then evolves this alignment to maximize the node similarities.
4.3 Problem Formulation
In this section, we develop a new scoring function, score(G1i ,G2
i | ψ), that integrates the
similarities of the aligned nodes and their evolving topologies, and includes a penalty for each
disconnected component in the aligned subnetworks of the target network at each time point.
Next, we introduce the terminology and discuss how we drive our scoring function.
Given a network G = (V,E) and a subset of nodes V , we define the induced subnetwork
of V in G as the nodes in V and all incident edges (i.e., E = {V × V } ∩ E). We denote this
induced network as G = (V | G). We say two nodes u and v in G are connected if there exists
a path between u and v in G. We say a subset of nodes in G form a connected component
if all pairs of nodes in that subset are connected in G. We define a subset of nodes V in G
as a maximum connected component if the following conditions hold: (i) V is a connected
component in G, and (ii) there is no node in V − V which is connected to a node in V . In
the rest of the paper, we use the term connected component instead of maximum connected
component. We denote the number of connected components of a given network G with
NCC(G).
Given two temporal networks with t time points, G1 = [G11, G
12, . . . , G
1t ] and G2 =
[G21, G
22, . . . , G
2t ], we denote the similarity between a pair of nodes u ∈ V 1 and v ∈ V 2 at time
point i (1 ≤ i ≤ t) with Si(u, v). We use an existing pairwise alignment method to calculate
Si(u, v). The alignment function ψ maps all nodes in V 1 to a subset of the nodes in V 2. We
denote this subset with Ψ(V 1) (i.e. Ψ(V 1) = {ψ(u)|∀u ∈ V 1}). We note that ψ yields an
induced subnetwork (Ψ(V 1)|G2i ) of G2
i for each time point i, and each induced subnetwork
(Ψ(V 1)|G2i ) forms one or more connected components. Figure 4-2A shows an illustration of
this latter point. We denote the number of connected components of the induced subnetwork
(Ψ(V 1)|G2i ) at time point i as NCC(Ψ(V 1) | G2
i ). If the number of connected components
at time point i is greater than one then the corresponding induced subnetwork is disconnected.
68
We incur a penalty to account for the missing edges which would connect the disconnected
components, and apply this penalty to each disconnected component.
The minimum number of edges needed to join NCC(Ψ(V 1) | G2i ) connected components
is NCC(Ψ(V 1) | G2i ) −1. We penalize each edge insertion with a constant value denoted with
δ, where δ ≥ Si(u, v), ∀ u ∈ V 1, v ∈ V 2 and i ∈ {1, 2, . . . , t}. We define the score of the
alignment ψ() at time point i as: score(G1i , G
2i | ψ) =
∑u∈V 1 Si(u, ψ(u)) − δ(NCC(Ψ(V 1) |
G2i )− 1). We define the temporal network alignment as
ψ
{t∑i=1
( ∑u∈V 1
Si(u, ψ(u))− δ(NCC(Ψ(V 1) | G2i )− 1)
)}. (4-1)
4.4 Methods
Overview. Our algorithm for solving the temporal network alignment problem has two phases.
The first phase finds an initial alignment between the input networks G1 and G2 using the
similarity score between pairs of aligned nodes across all time points. The induced subnetwork
of G2 obtained by this alignment may be disconnected since this phase ignores the penalty
incurred by edge insertions. The second phase reduces the number of connected components,
improving the alignment score. In the second phase, we improve the alignment between the
input networks by swapping a subset of the nodes in G2 that are aligned with nodes in G1 with
other nodes in G2. In order to swap a node vi ∈ Ψ(V 1) with vj ∈ V 2 − Ψ(V 1), we update
the alignment function ψ() to ψ′() such that ∀ u ∈ V one of the two conditions is satisfied:
(i) ψ′(u) = vj if ψ(u) = vi; and (ii) ψ′(u) = ψ(u) if ψ(u) = vi. Figure 4-2 illustrates this.
Here, initially b11 is aligned to a11 (Figure 4-2A). Swapping b11 with b14 updates the alignment
function so that b14 is aligned to a11 (Figure 4-2B). We observe that this swapping reduces the
number of connected components in the induced subnetwork of G2 by one. Notice that if we
swap b8 with b14 (instead of b11 with b14) then the number of connected components increases
(Figure 4-2C).
We note that the number of connected components may simultaneously decrease at
one time point and increase at other time points when we swap two nodes. We prove that
69
A initial alignment B after swapping b11 with b14 C after swapping b8 with b14Figure 4-2. This figure represents an alignment between two networks G1 and G2. Each node
in the query network G1 has a one-to-one mapping with a node in the network G2.The dashed line between two nodes emphasizes that they are mapped to eachother. A) This represents a hypothetical alignment where ai is aligned with bi forall 1 ≤ i ≤ 11. The induced subnetwork of the aligned nodes in G2 forms threeconnected components; C1 = {b1, b2, b3, b4}, C2 = {b5, b6, b7}, andC3 = {b8, b9, b10, b11}. Gap nodes are {b12, b13, b14}. B) After swapping b11 withb14. This swapping results in two connected components in G2. (c) After swappingb8 with b14. The aligned nodes in G2 form four connected components.
the problem of finding the subset of node swaps that minimizes the number of connected
components across all time points is NP-hard. We give a reduction from the Maximum
Coverage problem (96) to this problem later in this section.
Algorithm details. Tempo takes two networks (G1 and G2) and the maximum number of
allowed swaps (denoted as k) as input. In the following, we explain the two phases of our
method in detail.
Phase 1 (Initialization). Here, we construct an initial alignment of G1 and G2.
There exists several algorithms to perform pairwise alignment of two static networks at a single
time point. Each of these methods assign similarity scores to all node pairs (one from the first
network and one from the second) and then choose the alignment that maximizes the total
score of all aligned node pairs. We adopt one of these methods to obtain the similarity scores
of each network pairs (G1i , G
2i ) at each time point i, and use the outputted scores to calculate
an initial alignment. We denote the similarity of the node pair (u, v), u ∈ V 1 and v ∈ V 2
generated by such method at the ith time point with Si(u, v).
70
We generate an initial alignment ψ0 as follows. We first construct a weighted bipartite
network Gbp = (V 1, V 2, E) as follows: we insert an edge in Gbp between each pair of nodes
(u, v) such that u ∈ V 1 and v ∈ V 2. We set the weight of the edge (u, v) as the similarity
between nodes u and v aggregated over all time points. We denote the similarity as S(u, v) =∑t1 Si(u, v). The maximum-weight bipartite matching algorithm maps each node in V 1 to a
node in V 2 (97). This mapping represents the initial alignment, ψ0. We call the nodes in V 2
that are not mapped to any node in V 1 as gap nodes and denote with F = V 2 −Ψ(V 1).
Phase 2 (select k swapping pairs). Here, we describe our dynamic programming
algorithm that selects a set of k swaps which maximize the alignment score by reducing
the number of connected components in the induced alignment across all time points of G2
(Equation 4-1).
We denote a set of r swaps with ∆ = {(u1, v1), (u2, v2), . . . , (ur, vr)} with ∀i = j,
ui = uj and vi = vj. We denote the alignment after applying the swaps in a given set ∆ as
ψ∆. Let us denote the optimal set of r swaps for the alignment ψ with solution(r, ψ,G1,G2).
Also, for a given ui ∈ Ψ(V 1), we denote the optimal set of r swaps for the alignment ψ which
contains the swap pair (ui, vi), ∃vi ∈ F , with solution(r, ui, ψ,G1,G2).
Our algorithm works iteratively. In the first iteration, our algorithm selects one swapping
pair for each aligned node ui ∈ Ψ(V 1) as
solution(1, ui, ψ,G1,G2) =∆={(ui,vi)},vi∈F {score(G1,G2|ψ∆)}.
At each subsequent iteration r where 2 ≤ r ≤ k, for each aligned node ui ∈ Ψ(V 1), our
algorithm selects a set of r swapping pairs denoted with solution(r, ui, ψ,G1,G2) by adding
one swapping pair (ui, vi), ∃vi ∈ F , to the previously selected r − 1 pairs as follows.
∆={(ui,vi)}∪solution(r−1,uj ,ψ,G1,G2),Θ
{score(G1,G2|ψ∆)}. (4-2)
Here Θ represents the necessary conditions to include the (ui, vi) swap pair with a set of
r − 1 swap pairs as
71
Θ = (vi ∈ F ) AND
(uj ∈ Ψ(V 1)) AND
(@v ∈ F , such that (ui, v) ∈ solution(r − 1, uj, ψ,G1,G2))
AND (@u ∈ Ψ(V 1), such that
(u, vi) ∈ solution(r − 1, uj, ψ,G1,G2)).The first condition above ensures that node ui is swapped with a gap node and the
second ensures the dynamic programming iterates over all size r − 1 swap sets for all aligned
nodes of G2. The third condition ensures that the aligned node ui has not already been
swapped in the r − 1 sized swap set. The final condition is the dual of the previous one, as it
ensures that the gap node vi has not already been swapped in the r − 1 sized swap set. When
these conditions hold, the two nodes ui and vi can be swapped and included into the existing
set of r − 1 swaps without conflicting with any of the existing swaps.
We report the output of the algorithm at end of the kth iteration as set of k swaps with
the highest alignment score using equation
∆=solution(k,ψ,G1,G2)=
ui∈Ψ(V 1),∆i=solution(k,ui,ψ,G1,G2){score(G1,G2|ψ∆i
)}. (4-3)
We represent the set cardinalities |V 1|, |V 2|, and |F | with m, n, l, respectively. The
complexity of our algorithm is O(m2n2)+O(mn logm)+ml∑t
i=1 |E2i |+O(k2l2m). We provide
the derivation of this complexity in Section 4.4.2. We note that k ≤ NCC(ψ(V 1) | G2) − 1.
This value is either given as input or we set it to NCC(ψ(V 1) | G2)− 1.
Proof of correctness. Here, we formally proof the correctness of our algorithm. We say that
swapping the pair of nodes (ui, vi) is proper if that the swapping does not increase the number
of connected components of the aligned nodes. We first prove that our algorithm will always
find a proper swapping node ui from the set of aligned node for each gap node vi. We first
present a lemma which is necessary for the proof of our first theorem. Let us denote the degree
of a node v (i.e. number of edges connected to this node) within a component Ci = (Vc, Ec)
of the induced subnetwork G2i = (Ψ(V 1)|G2
i ) at time point i with the function deg(v|Ci).
72
Lemma 1. Given an undirected subnetwork of G2i , G2
i = (Ψ(V 1)|G2i ) where |Vc| = z and G2
i
is acyclic network (has no cycle) within its topology, then∑
v∈Ci deg(v|Ci) = 2(z − 1).Proof. Since Ci is a connected subnetwork with no cycles, the number of edges in Ci equals
z − 1 edges. Each edge belongs to an undirected network increases the sum of the network
nodes degrees by two. Thus,∑
v∈Ci deg(v|Ci) = 2(z − 1).Lemma 2. Given a gap node vi that connects at least two connected components, there exist
at least one aligned node ui which we can swap with vi without increasing the number of
connected component.Proof. We formally prove this by induction on the size of connected components that ui
belongs to.
Base case. We consider a component Ci = (Vc, Ec) where |Vc| = 2 and vi is connected
to Ci through uj, and assume ui belongs Ci. If we swap vi with ui, then Ci will contain uj
and vi which corresponds to one component. Thus, the number of connected components of
Ci is still one after swapping.
Induction hypothesis. We assume there exists a node ui for all components of size
q nodes that can be swapped without disconnecting its component. We consider two cases of
one component Ci where vi is connected to through uj. The first case is when Ci contains
at least one cycle with the set of nodes, Vc1 = {v1, v2, . . . , vn}. It follows that for each node
ui ∈ Vc1 and ui = uj, ui can be swapped with vi without disconnecting Ci. In the second
case, Ci represents acyclic network with no cycles. Next, we prove our theorem in this case by
contradiction. First, we assume that the number of nodes in Ci with degree equal to 1 is less
than 2. Consequently,∑
v∈Ci deg(v|Ci) ≥ 2(z − 1) + 1, which contradicts Lemma 1. Thus,
the number of nodes in Ci with degree equal to 1 is at least 2 nodes and thus, ∃v, w ∈ C st.
deg(v|C) = 1 and deg(w|C) = 1 and v = w. Therefore, we can swap vi with either v or w.
Next, we prove that swapping a gap node vi with an aligned node ui at each iteration will
increase the alignment score score(G1,G2|ψ), showing that the alignment score will always
improve by our dynamic programming algorithm.
73
Theorem 4.1. Given a value of δ where δ is greater than or equal to S(ψ(ui), ui) for all
ui ∈ V 2. At each iteration of our algorithm, score(G1,G2|ψ) monotonically increases.Proof. We assume that our algorithm chooses one pair of nodes to swap; a gap node vi and
aligned node ui which will connect x number of components. We note that the condition x ≥
2 must be satisfied for vi to be considered for swapping. Also, it follows from Lemma 2 that
if we swap vi and ui then the number of connected components will not increase. Thus, the
difference in the score equals D = δ(x − 1) − puv where puv is the difference in pairwise score
from swapping (i.e. puv = S(ψ(ui), ui) - S(ψ(ui), vi)). Since δ is greater than or equal to
S(u, v) ∀ u ∈ V 1 and ∈ V 2, then δ(x − 1) ≥ puv. Consequently, D ≥ 0 and score(G1,G2|ψ)
will not decrease.
4.4.1 Proof of NP-hardness
Here, we prove that our problem is NP-hard. To do that, we reduce the Maximum
Coverage Problem (MCP), which is known to be NP-hard (26), to our problem. Given a
positive integer k and a collection of sets, S = {S1, S2, . . . , Sm}, MCP seeks the subset S ⊆ S
such that |S| ≤ k and the number of covered elements |∪Si∈S Si| is maximized.
We reduce MCP to an instance of our problem. Let U = {x1, x2, . . . , xn} be the union
of elements in S (i.e. U = |∪Si∈S Si|). We construct a target temporal network G2 with one
time point G2 = (V 2, E2) as follows. We initialize G2 as V 2 = ∅ and E2 = ∅. Next, we add a
node aj in G2 for each element xj ∈ U . Also, for each set Si ∈ S, we add two nodes fi and bi
in V 2. Formally, V 2 = {a1, a2, . . . , an} ∪ {b1, b2, . . . , bm} ∪ {f1, f2, . . . , fm}. Next, we populate
the set of edges E2. To do that, for all Si ∈ S and xj ∈ Si, we insert the edge (fi, aj) in
E2. In addition, for all pair of sets Si, Sj ∈ S, where i < j, we insert the edge (fi, fj) in E2.
Finally, for a given query network G1 = (V 1, E1), we construct the set of nodes in G2 aligned
to those in G1 as Ψ(V 1) = {a1, a2, . . . , an} ∪ {b1, b2, . . . , bm}. Thus, the set of gap nodes is
{f1, f2, . . . , fm}. Notice that, the subnetwork of G2 induced by Ψ(V 1) has m + n nodes but
it contains no edges as all the edges in G2 are connected to a gap node by our construction.
74
Thus, the alignment yields n +m connected components as each node in Ψ(V 1) represents a
component.
Recall that each swapping operation swaps an aligned node with a gap node. Also,
recall that the optimization problem we solve for aligning temporal networks aims to find at
most k swaps, such that after applying those swaps, the number of connected components
NCC(Ψ(V 1) | G2) is minimized (Section 4.3). We call this optimization problem minimum
Connected Component Problem (mCCP) in the rest of this proof. Next, we prove that MCP is
maximized if and only if mCCP is minimized.
First, we prove that if there exists a solution to mCCP, then there exists a solution to
MCP. In other words, we prove that minimizing mCCP maximizes MCP. Let us denote the
nodes corresponding to the elements in a set Si with Ai = ∪xj∈Si{aj}. In our problem
instance, a swap operation swaps fi with a node in the set V 2 − Ai − {fi}. This is because
all nodes in Ai are connected to fi, and thus swapping fi with a node not in Ai ensures that
all nodes in Si ∪ {fi} form one connected component. Therefore, to minimize the number
of connected components, we swap fi with one of the nodes which is not a part of this
connected component. To ensure that, we swap fi with a node in the set {b1, b2, . . . , bm}.
Since all nodes in this set are disconnected, swapping fi with any node in this set will yield the
same number of connected components. Let us assume that the solution to mCCP performs
k swaps. Following from the discussion above, without losing generality, we assume that
these swaps are {(f1, b1), (f2, b2), . . . , (fk, bk)}. Notice that after these swaps, the nodes in
(∪ki=1Ai) ∪ {f1, f2, . . . , fk} forms one connected component, and all remaining nodes are
isolated. Let us denote the number of connected components after these swaps with β. Let
us denote the number of nodes in (∪ki=1Ai) with τ . Notice that τ also reflects the number of
elements covered in (∪ki=1Si). We have β = (m− k) + (n− τ) + 1.
In the formulation above, the first term (m − k) is the number of nodes bj which are
not swapped with a gap node. Since all those nodes are isolated, each one forms a connected
component by itself. The second term (n−τ) is the number of nodes aj which are not included
75
in the set (∪ki=1Ai). These nodes remain isolated even after swapping of nodes. The last term
(i.e., 1) is the connected component containing the nodes in (∪ki=1Ai) ∪ {f1, f2, . . . , fk}. After
minor algebraic manipulation, we rewrite the equation above as β = (m + n − k + 1) − τ. In
this equation, the parameters m, n, and k are input to the given mCCP problem, and thus we
denote the first term above with the constant c = m+n−k+1. Therefore, we have β = c−τ .
In this equality the smaller the value of β is, the larger τ gets. Thus, minimizing the number of
connected components β in mCCP maximizes the nuumber of elements covered in MCP.
Second, we prove that if there exists a solution to MCP, then there exists a solution to
mCCP. In other words, we prove that maximizing MCP minimizes mCCP. Let us assume that
the solution to MCP is S = {S1, S2, . . . , Sk}. The number of elements covered by this solution
is τ = |∪Si∈S Si|. By constructing an instance of mCCP as described above, we have k swaps
denoted with the set {(f1, b1), (f2, b2), . . . , (fk, bk}. Consequently, after performing these
swaps, the nodes in (∪ki=1Si) ∪ {f1, f2, . . . , fk} forms one connected component, and all the
remaning nodes are isolated. Let us denote the number of connected components with β. We
have β = (m− k) + (n− τ) + 1.
After minor algebraic manipulation, we rewrite the equation above as τ = (m + n − k +
1) − β. Since m, n, and k are input parameters, we have τ = c − β, where c is a constant
(c = (m + n − k + 1)). In this equality, the larger the value of τ is, the smaller β gets. Thus,
maximizing τ in MCP results in maximizing β in mCCP.
Lastly, the proof we describe above reduces an instance of MCP to an instance of mCCP
in polynomial time and space as it requires only building a network with O(n +m) nodes and
edges. Thus, we conclude that the mCCP problem is NP-hard.
4.4.2 Complexity Analysis
Here we analyze the complexity of our method. Recall that we represent |V 1|, |V 2|, and
|F | with m, n, l respectively. We refer to Section 2 as we discuss the phases of our method.
For each phase, we explain its complexity. We then summarize the complexity of all phases to
denote the overall complexity of our method. These phases are;
76
(1) Phase 1 (construct initial alignment). In this phase, we calculate the
similarity score between node pairs of the input two networks based on their homology and
their topology. First to calculate the topology vector Ai, we need to trace neighbors of all
node pairs which is performed in O(m2n2). Thus, the complexity of calculating the topology
score for all time points is O(m2n2t). We then integrate the homology and topology score
by multiplying the topology and the homology vectors in O(m2n2). The algorithm repeat the
previous step, let us say for c times to converge (O(m2n2c)). We select the initilat alignment
using the weighted-bipartite matching algorithm in O(mn logm). Thus, in this scenario, the
complexity of this phase becomes O(m2n2) +O(mn logm).
(2) Phase 2 (select k swapping pairs). This phase is performed in two steps. The
first step performs the initialization process of the dynamic programming algorithm, in which
we calculate the profit of swapping a gap node fl with an aligned node vj. In order to to this,
we calculate the number of components that fl can connect if swapped with vj using depth
first search through all time points in ml∑t
i=1 |E2i |. The second step performs the iterative
process of selecting k swapping pairs where the maximum number of iterations is (k − 1).
The process combines a gap node fl (i.e. 1 ≤ l ≤ |F |) with a set from swapping pairs from
the previous iteration where the maximum number of sets is l. Due to resolving the conflict
nodes issue, each combination may trace all profits of all gab nodes in the current combination.
This process is performed in O(km). Thus, the complexity of the second step of phase 2 is
O((k − 1)l2km) = O(k2l2m). Hence, the complexity of phase 2 is ml∑t
i=1 |E2i |+O(k2l2m).
In summary, the complexity of our method considering all the previous phases is
O(m2n2) +O(mn logm) +ml∑t
i=1 |E2i |+O(k2l2m).
4.4.3 Adopting Pairwise Alignment Methods to Generate Similarity Scores forTemporal Networks
In this section, we describe how we adopt pairwise alignment methods to generate
similarity scores in temporal networks that are needed to calculate an initial alignment. For
that purpose, we consider adopting IsoRank. We note that our choice of such method has no
77
impact on our method. Recall that IsoRank perform pairwise network alignment. Thus, our
modifications of IsoRank are meant to adopt it to temporal networks. First, we calculate the
homology score between all pairs of nodes (u, v) where u ∈ V 1 and v ∈ V 2 as the similarity
score of their sequences using BLAST (98). We denote the homology score between u and v
as H[u, v]. Next, we calculate the topological similarity matrix at the ith time point, denoted
as Ai, as follows. First, we initialize Ai to be the zero matrix. Next, for u,w ∈ V 1 and
v, z ∈ V 2 we let Ai[(u, v), (w, z)] = 1|N(w|G1
i )||N(z|G2i )|
if w ∈ N(u|G1i ), z ∈ N(v|G2
i ), where
N(v|G) denotes the neighbours of v in network G. Conceptually, Ai[(u, v), (w, z)] models the
topological support that the node pair (u, v) gives to the alignment of their neighbouring pair
(w, z) at the ith time point. We integrate the homology and the topology scores for G1i and
G2i at the ith time point iteratively using a mixing parameter α. We initialize H0
i = H. We
then update the similarity between node pairs at iteration r as Hri = αAiH
r−1i + (1 − α)H0
i .
We stop this iterative process when Hri = Hr−1
i .
We note that in subsequent iterations of the above formulation, the homological similarity
of each node pair (w, z) propagates their neighboring pairs (u, v) by a function governed
by the topology matrix and the mixing parameter α. We explain three issues arising from
these iterations. First, as the number of neighbors of w and z increases, the similarity
propagating to each neighbor pair decreases because the number of ways to align nodes w and
z without altering the topological similarity grows with increasing number of their neighbors.
Secondly, as the value of α decreases, the contribution of the homological similarity to the final
similarity value between each node pair grows and the contribution of the topological similarity
decreases. In the extreme case when α = 0, the topological similarity has no contribution.
Lastly, the iterations above are guaranteed to converge since Ai is a column stochastic matrix
(i.e., the values at each column add up to one). We denote the converged vector at the ith
time point with Si and call it a score vector. Each entry Si[u, v] in this vector shows the
similarity (homology and topology combined) between nodes u and v.
78
4.5 Results and Discussion
We evaluate the performance of our algorithm on synthetic and real data. Next, we
describe both datasets in detail.
Real Dataset. We obtain our real dataset from two sources. The first one is the human brain
aging dataset (75). Recall that this dataset contains gene expressions of 173 samples obtained
from 55 individuals spanning 37 ages from 20 to 99 years. The ages in this dataset are not
uniformly spaced. In order to bring consecutive time gaps to a more uniform values, we remove
two data points which have an age gap of more than 5 years from their successive age values,
leading to 35 ages. We select five temporal networks each having seven time points. Next, we
explain how we do that for the first temporal network. We start with the first (i.e., youngest)
time point in the aging data. We then skip the next four time points and take the sixth time
point in aging data iteratively until we have seven time points. Similarly, for 1 < j ≤ 5,
we select the jth temporal network starting from the jth time point. In this manner, we
form five non-overlapping and interleaved temporal networks. In order to integrate static
PPI network with gene expression data to form age-specific PPI networks, we set a cut-off
on the gene-expression value. All the interactions that have a lower transcription value for
either or both the proteins are removed from the corresponding age-specific network. We
use the protein-protein interaction (PPI) network data from BioGRID (99). For the second
source, we select phenotype specific query temporal networks from this dataset. We use
two neurodegenrative disorders which are conjectured to be age-related (Alzheimer’s and
Huntington’s) and a third one which we expect to be less prone to aging (Type II diabetes).
We retrieve the gene sets specific to these three diseases from KEGG database (100). We form
three query PPI temporal networks by keeping only the interactions where both the interactors
are from each of the three phenotype-specific (Alzheimer’s, Huntington’s or Type II Diabetes)
gene set.
Synthetic dataset. We generate synthetic networks to observe the performance of our method
under a wide spectrum of parameters classified under two categories; (i) network size and
79
(ii) temporal model parameters, namely number of time points, temporal rate, and cold rate.
We vary the target network size to take values from {100, 250, 500, 750, 1000}. We fix the
network density to two edges per node on the average (i.e., mean node degree is set to four).
We randomly select G11 as a connected subnetwork of G2
1. We set the size of the query network
to 50 nodes. We generate target network G21 using Barabási-Albert (BA) (48) model as this
model produces scale-free networks. In order to explain the parameters in the second category,
we describe how we generate the query and target networks G11 and G2
1 at the first time point.
We then explain how we use the parameters in this category to build the query and target
networks at the remaining time points.
We generate the subsequent networks for the remaining time points using the three
parameters in the second category above as follows. The first parameter is the number of
time points t in G1 and G2. We use 5, 10, 15, and 20 time points in our experiments. Recall
that we select a subnetwork of the target network G21 as the first query network G1
1. We
mark all nodes and edges in G21 within this subnetwork as cold nodes and edges respectively.
We mark all other nodes and edges in G21 as hot. Next, we iteratively generate the networks
G1i and G2
i at the ith time point (i > 1) from G1i−1 and G2
i−1 respectively as follows. Let
us denote temporal and cold rates (two real numbers) with ϵ and ϵc respectively such that
0 ≤ ϵc ≤ ϵ ≤ 1. Let us denote the ratio of cold edges to the total number of edges in the
target network G21 with γ. We calculate the hot rate, denoted with ϵh, from temporal rate
and cold rate as ϵh = (ϵ − ϵcγ)/(1 − γ). Conceptually, hot and cold rates model the rate
of evolution of hot and cold edges between two consecutive time points respectively. More
specifically, for each subsequent time point i, we generate G2i by randomizing G2
i−1 as follows.
We iterate over all edges in G2i−1. For each edge e, if it is a cold edge we remove it with
probability ϵc and insert a new edge between two randomly chosen cold nodes. If e is a hot
edge, we remove it with probability ϵh and insert a new hot edge between two random nodes
(with at least one being a hot node). We generate query networks at subsequent time points
using almost the same procedure with the only difference being that all edges are cold. We
80
generate datasets by varying ϵ and ϵc to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05,
0.1, 0.2} respectively. For each parameter setting we generate 10 target and query temporal
networks.
Recall that, we generate the scoring matrix based on both homology and topology
similarities. We generate the homology score between two pair of nodes u ∈ V 1 and v ∈
V 2 as follows. If v was originally selected as cold node and u is the same as v, then we
generate a homology score between u and v from log-normal distribution (101) with mean
2µ and standard deviation σ. Otherwise, we randomly generate the homology score between
u and v from log-normal distribution with mean µ and standard deviation σ. In this way, we
allow nodes in query network to be likely to align to nodes in the target network that were
originally extracted from. In this paper, we set µ and σ to be 2 and 0.25 respectively. Notice
that the homology scores do not change through time points, although topology scores do.
Thus, evolution through time points of query and target networks may affect how the query
is aligned to the cold region in the target network. We set the edge insertion penalty δ to be
maxu∈V 1,v∈V 2
S(u, v).
We compare the accuracy and running time of our algorithm against IsoRank, MAGNA++
and GHOST. IsoRank, MAGNA++ and GHOST are designed to align two networks at a
single time point. We therefore find the alignment using each of these methods at each time
point, impose the alignment to all the time points and report the average. We analyze the
biological significance of our results on real data by performing gene ontology analysis and
exploring publication evidence. We implemented Tempo in C++, performed all experiments on
a computer equipped with AMD FX(tm)-8320 Eight-core Processor 1.4 GHz CPU, 32 GB of
RAM running Linux operating system, and used α = 0.7 unless otherwise stated.
4.5.1 Evaluation of Recovered Region
In this experiment, we compare the accuracy of the alignment generated by Tempo
against that of IsoRank, MAGNA++, and GHOST. We recall that we select the original query
network from a subset of nodes and their edges from the target network, and then evolve the
81
20
40
60
80
0.05Rec
over
ed q
uery
(%
)
Cold rate0.05
Temporal rate
0.05 0.1Cold rate
0.1Temporal rate
0.05 0.1 0.2Cold rate
0.2Temporal rate
0.05 0.1 0.2Cold rate
0.4Temporal rate
0.05 0.1 0.2Cold rate
0.8Temporal rate
Tempo IsoRank MAGNA GHOST
Figure 4-3. The percentage of recovered query in the resulting alignment varying ϵ and ϵc totake the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2} respectively. Thex-axis shows temporal rate, ϵ and cold rate, ϵc (these are the parameters used forconstructing synthetic temporal network, with varying evolution rates. The y-axisshows the percentage of recovered query of IsoRank, MAGNA++, and GHOSTagainst Tempo. The error bars show the 80-percentile of the recovered query basedon the 10 repetitions of each parameters setting.
query through time points. Here, we evaluate the accuracy by calculating the percentage of the
aligned nodes from query network that are paired with the same nodes of the target network
that they were originally selected from. We refer to this percentage as recovered region. We
illustrate the results in Figure 5-7, which demonstrate that Tempo recovers high percentage of
the query networks compared to other methods. As the temporal rate increases, the accuracy
of Tempo improves dramatically while that of IsoRank remains nearly stagnant and while
MAGNA++ and GHOST continue to generate alignments with low recovery rates. Growing
the temporal rate while keeping the cold rate unchanged means that the topology of the query
network (i.e., cold edges) is evolving slower than the rest of the temporal network (i.e., hot
edges). This implies that Tempo can capture the variation in such evolutionary rate while
competing alignment strategies which fail to do so.
4.5.2 Evaluation of Induced Conserved Structure
Next, we evaluate the topological quality of the alignment generated by Tempo through
comparison with IsoRank, MAGNA++, and GHOST. For this purpose, we measure the
shared topological structure between G1i and G2
i which is preserved under the alignment
function ψ through all time points i. Induced conserved structure (ICS) measures the
percentage of edges from G1i that are aligned to edges in G2
i to the total edges of the induced
subnetwork Ψ(V 1|G2i ), and is one of the most common measures of topological quality (73).
82
0
0.1
0.2
0.3
0.4
0.5
0.05
ICS
Cold rate
0.05Temporal rate
0.05 0.1Cold rate
0.1Temporal rate
0.05 0.1 0.2Cold rate
0.2Temporal rate
0.05 0.1 0.2Cold rate
0.4Temporal rate
0.05 0.1 0.2Cold rate
0.8Temporal rate
Tempo GHOST MAGNA IsoRank
Figure 4-4. The induced conserved structure (ICS) score of the resulting alignment varying ϵand ϵc to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2}respectively. The x-axis shows temporal rate, ϵ and cold rate, ϵc. The y-axis showsthe ICS score of GHOST, MAGNA++, and IsoRank against our method (Tempo).
Formally, ICS(G1,G2, ψ) =∑t
i=1|E1i ∩E2
i [Ψ(V 1|G2i ]|
|E2i [Ψ(V 1|G2
i ]|. Figure 5-6 presents the results, which
demonstrate that Tempo generates alignments with high quality based on ICS compared
to other algorithms. We note that GHOST was created to optimize ICS, however, Tempo
outperforms GHOST on this measure—especially when the temporal rate is high since the
performance of GHOST degrades.
4.5.3 Evaluation of Edge Correctness
In this experiment, we evaluate the topological quality of the alignment generated by our
method against IsoRank, MAGNA++, and GHOST. For this purpose, we measure the shared
topological structure between G1i and G2
i which is preserved under the alignment function,
ψ through all time points i. Edge correctness (EC) is one of the most common measures of
topological quality (73; 74). It has a similar computations to ICS. Basically, it measures the
percentage of edges from G1i that are aligned to edges in G2
i to the total edges of smaller
network. More specifically, EC(G1,G2, ψ) =∑t
i=1|E1i ∩E2
i [Ψ(V 1|G2i ]|
|E1i |
. Figure 5-5 presents the
results. The results demonstrate that our algorithm generates alignments with high quality
based on EC compared to other algorithms.
4.5.4 Evaluation of Statistical Significance of The Alignment
We compare the statistical significance of the alignments generated by Tempo against
that of existing methods. In order to ensure that our experiments do not give any advantage to
our algorithm, we use IsoRank to generate initial alignments for Tempo and thus, compare the
statistical significance against IsoRank only.
83
0
0.05
0.1
0.15
0.2
0.05Edg
e co
rrec
tnes
s (E
C)
Cold rate
0.05Temporal rate
0.05 0.1Cold rate
0.1Temporal rate
0.05 0.1 0.2Cold rate
0.2Temporal rate
0.05 0.1 0.2Cold rate
0.4Temporal rate
0.05 0.1 0.2Cold rate
0.8Temporal rate
Tempo IsoRank GHOST MAGNA
Figure 4-5. The Edge correctness (EC) score of the resulting alignment varying ϵ and ϵc totake the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2} respectively. Thex-axis shows temporal rate, ϵ and cold rate, ϵc. The y-axis shows the EC score ofGHOST, MAGNA++, and IsoRank against our method (Tempo).
4 6 8
10 12
0.05
Z-S
core
Cold rate0.05
Temporal rate
0.05 0.1
Cold rate0.1
Temporal rate
0.05 0.1 0.2
Cold rate0.2
Temporal rate
0.05 0.1 0.2
Cold rate0.4
Temporal rate
0.05 0.1 0.2
Cold rate0.8
Temporal rateFigure 4-6. The average z-score of Tempo across network sizes {100, 250, 500, 750, 1000}
varying ϵ and ϵc to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2}respectively. The x-axis shows temporal rate, ϵ and cold rate, ϵc. The y-axis showsthe z-score of IsoRank (white) against Tempo (black).
Varying evolution rate. In this experiment, we evaluate the effect of varying the temporal
rate (ϵ) and cold rate (ϵc) on the significance of the score of the alignments produced by
Tempo and that of IsoRank. We generate synthetic networks of sizes {100, 250, 500, 750,
1000} and 20 time points. We fix the network density to two edges per node on the average,
and vary ϵ and ϵc (ϵc ≤ ϵ) to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05, 0.1, 0.2},
respectively. Next, we randomly selected 50 nodes from target network 1,000 times, and
calculate the alignment score of each, i.e., each random selection corresponds to an alignment.
We calculate the mean and standard deviation of these 1,000 scores and generate the z-score
of the alignment generated by Tempo using this mean and standard deviation. Hence, we
denote the score generated from our method by S∗, and denote the mean and standard
deviation of 1,000 scores generated from the random selections with Sµ and σ, respectively.
We calculate the z-score of our method as (S∗ − Sµ)/σ. We calculate the z-score of the
IsoRank method in a similar manner. Figure 5-1 presents the average z-score values across all
84
3
4
5
6
7
8
9
First-5 Second-5 Third-5 Forth-5
Z-S
core
Target time points
A Varying time points
2
4
6
8
10
12
100 250 500 750 1000
Z-S
core
Network size
B Varying network sizesFigure 4-7. The average z-score of Tempo (black) against IsoRank (white) (A) varying target
time points, the x-axis shows time point selected, and (B) varying network size, thex-axis shows network sizes in terms of number of nodes.
target network sizes. The results show that as we increase the temporal rate, the z-score of
Tempo significantly increases while the z-score of IsoRank increases by small amount. As the
evolution rate increases, the topology of the alignment found by Tempo differs significantly
from the topology of rest of the network, and thus, it becomes more challenging to find the
correct alignment. However, Tempo continues to generate accurate and significant results
especially for large evolution rates unlike IsoRank which considers each single time point
independently. We observe the same pattern as we increase cold rate.
Varying time points. In this experiment, we evaluate how the z-scores of Tempo and IsoRank
differ as the input networks evolve and deviate from each other. More specifically, we consider
aligning the query network with each of the four target sets we have which have evolving
time points (i.e. older ages) as we move to later target sets. First, we measure the z-score
of aligning the query to the first target set (i.e., containing time points 2, 7, 12, . . . ) then
we measure the z-score of aligning the query to the second target set (i.e., containing time
points 3, 8, 13, . . . ) and so on. We present the average z-score across all temporal and cold
rates. Figure 4-7A presents the results. The results show that Tempo continues to generate
alignment with high score significance as we evolve the network. We observe the same pattern
for IsoRank, however, Tempo outperforms IsoRank—especially when the time points are
distant. This confirms the fact that as the target and query networks evolve and deviate from
each other, Tempo is able to take into account the evolution through consecutive time points
and generate accurate alignments that persist.
85
Varying network size. In this experiment, we compare the significance of the alignment
generated by Tempo against IsoRank as the target network size increases and the query
becomes small with respect to the target. We average the z-score across all evolution rates
and vary target network size to take values {100, 250, 500, 750, 1000}. Figure 4-7B presents
the results, which show that the significance of the alignment (best alignment) increases as we
increase the size of the underlying target network. We expect this behavior since we compare
the aligned nodes (50 nodes) to a random selection of 50 nodes from the underlying target
network. Thus, the chance of selecting the best alignment decreases. That said, Tempo was
able to identify the accurate alignment which results in high significant values.
4.5.5 Evaluation of Running Time
In this experiment, we evaluate the running time of our algorithm using synthetic dataset
for network sizes as well as number of time points (t). We report the average running time
over all values of ϵ and ϵc with each parameter combination tested 10 times. We also report
the running time for IsoRank, MAGNA++, and GHOST for aligning two networks at a single
time point. Figure 5-4 presents the results. The results demonstrate that Tempo successfully
scales to large target networks. The running times of both Tempo and IsoRank grow linearly
with increasing target network size and the number of time points (t). We notice that
MAGNA++ has similar behavior than IsoRank, while GHOST has an exponential running time.
The running time of Tempo is more than that of IsoRank, which is unsurprising since Tempo
computes alignment across multiple time points. That said, Tempo has practical running time
even for large networks with many time points. More importantly, unlike IsoRank, Tempo
considers the network topology at all time points while aligning networks. As we present later
in this section, as a natural consequence of the extra effort our method puts to consider all
time points, the alignment it finds is significantly more accurate than that of IsoRank which
considers only one time point at a time.
86
0.1
1
10
100
100 250 500 750 1000
Run
ning
tim
e [c
pu-s
]
Target network size
IsoRank
100 250 500 750 1000
Target network size
MAGNA++
100 250 500 750 1000
Target network size
Tempo
100 250 500 750 1000
Target network size
GHOSTt = 20 t = 15 t = 10 t = 5
Figure 4-8. The total running time of IsoRank and Tempo for synthetic networks varying targetnetwork size from {100, 250, 500, 750, 1000}, and varying t from 5 to 20. Thex-axis shows the input network sizes. The y-axis shows the total running time inseconds.
4.5.6 Evaluation of Recovered Genes in Real Dataset
In this experiment, we evaluate the recovered query region from gene aging dataset by
our algorithm, Tempo, against MAGNA++ and GHOST. Recall that we discussed the values
of IsoRank in the main paper since it reports high recovered rates. The recovered region
computes the percentage of genes in the query network that were mapped to themselves in the
target network despite their evolving topologies. Tables 4-1, 4-2, and 4-3 present the results
for Alzheimer’s, Huntington’s, and Type II diabetes respectively. The results show that our
algorithm significantly outperform both MAGNA++ and GHOST by aligning similar genes
despite their evolving topologies. On the other hand, MAGNA++ and GHOST could poorly
align small portion of the query genes to themselves. This suggests that our algorithm could
successfully capture the evolving topologies of the genes through time points while other
algorithms fail to do so since they consider aligning each time point independently.
Target time points Tempo MAGNA++ GHOSTFirst 7 94.87 2.56 0Second 7 97.43 5.13 0.36Third 7 97.43 2.56 0Forth 7 97.43 2.56 0
Table 4-1. Percentage of recovered query genes from gene aging dataset when usingAlzheimer’s phenotype as query.
87
Target time points Tempo MAGNA++ GHOSTFirst 7 90.9 0.36 0Second 7 86.36 0 0Third 7 95.45 0.73 0Forth 7 95.45 0.73 0
Table 4-2. Percentage of recovered query genes from gene aging dataset when usingHuntington’s phenotype as query.
Target time points Tempo MAGNA++ GHOSTFirst 7 97.22 2.56 0Second 7 97.22 2.56 0Third 7 97.22 5.12 0Forth 7 97.22 2.56 0
Table 4-3. Percentage of recovered query genes from gene aging dataset when using Type IIdiabetes phenotype as query.
4.5.7 Evaluation on Real Data
Next, we evaluate Tempo on the real data. We first evaluate the significance of alignment
score using Tempo. We calculate the z-score by comparing the score of aligned nodes to the
score of 1,000 randomly selected alignments of the same number of nodes. We compare our
results to those of IsoRank. We repeat this experiment for three different disease network
queries: Alzheimer’s, Huntigton’s and Type-II diabetes. Figure 4-9 shows the results. Our
results demonstrate that Tempo yields highly significant alignments, and outperforms IsoRank
in terms of z-score. We also observe that z-scores of non-age related disease (diabetes) is
lower than those of age-related diseases (i.e. Alzheimer and Huntington’s). Although there are
some fluctuations in the z-score with growing time gap between query and target networks,
we observe that the z-score tends to increase for Alzheimer’s and Huntington’s disease unlike
the Type-II diabetes. This suggests that age-related pathways have higher evolution rate
than other pathways. Thus, we conjecture that Tempo, which takes all time points into
consideration, is suitable for capturing evolving topologies.
Next, we consider the biological significance of our results by identifying aligned gene
pairs in which the aligned genes are different, and determining prior evidence that these gene
pairs are biologically relevant. We use Tempo to identify 4, 4 and 6 such pairs for Alzheimer’s,
88
14
16
18
20
Firs
t-7
Seco
nd-7
Third
-7
Forth
-7
Z-S
core
Target timepoints
Alzheimer
Firs
t-7
Seco
nd-7
Third
-7
Forth
-7
Target timepoints
Huntington’s
Firs
t-7
Seco
nd-7
Third
-7
Forth
-7
Target timepoints
Type II diabetes
Figure 4-9. The average z-score of our method using real data of three different diseases;Alzheimer’s, Huntington’s and Type-II diabetes. The x-axis shows which timepoints was selected to represent the target network. The y-axis shows the z-scoreof IsoRank (white bars) against our method (black bars).
Huntington’s and Type-II diabetes, respectively. We note that Alzheimer’s, Huntington’s
and Type-II diabetes query sizes are 39, 36, and 23. Thus, the percentages of the different
genes found to all the genes in the alignment are 10% to 26%. IsoRank only mapped genes
to themselves, suggesting that IsoRank only considers static topologies while our algorithm
could map genes based on homological similarities as well as evolving topologies. MAGNA++
and GHOST could only map few genes to themselves while other mapped genes were poorly
related.
For each combination of disease and differently mapped gene pairs identified by Tempo,
we first search PubMed for publication evidence specific to that disease. For instance, in case
of Alzheimer’s disease, the gene DAB1 that was selected by Tempo and was identified as a
potential gene that encode proteins related to functions in biological pathways relevant to
the disease (102). Genes found by Tempo for type II diabetes, for example gene ACTA1, has
remarkable change in gene expression value that was observed for the in diabetic samples
compared to non-diabetic samples (103). Moreover, significant up-regulation of GRB2 is
observed in transgenic samples compared to controls (104).
In order to determine the biological processes of the aligned genes found by Tempo in
gene aging dataset, we perform the gene ontology analysis of the aligned genes in target
network using Gene Ontology Consortium (105). We identify the biological processes or
signaling pathways that play significant roles in the disorder. We calculate how many related
pathways found by our method (Tempo) against MAGNA and GHOST and their significance.
89
Disease Tempo MAGNA++ GHOSTAlzheimer 2 / 4 / 2.29E-14 1 / 2 / 2.14E-03 1 / 2 / 3.32E-04Huntigton’s 1 / 4 / 1.15E-22 0 0Diabetes 2 / 4 / 2.29E-09 1 / 1 / 2.2E-01 0
Table 4-4. Number and significance of functional pathways associated with the underlyingdisease observed among the aligned genes of target network. Each cell lists theresults in the form x/y/z. Here, x represents number of pathways identified, ydenotes the number of time points at which these pathways are observed, and z isthe statistical significance (p-value) of the least significant of these pathways. Thecell with the value 0 implies that no pathways were found.
We also counted the frequency of those pathways when used different range of time points.
Table 4-4 present the results. We find references of certain pathways that are related to
specific neurodegenerative disorders (Alzheimer’s and Huntigton’s diseases). For genes we
identify when we use Alzheimer’s disease as a query network, we find two pathways, namely
Alzheimer disease-amyloid secretase and Alzheimer disease-presenilin are related to Alzheimer’s
disease (106). Various growth factors alter the brain development process at younger age,
that manifest as a variety of risk factors at an older age and eventually results in aging-related
diseases such as Alzheimer’s and Huntigton’s diseases (107). For the genes we identify when
we use type II diabetes phenotype as a query, we find two pathways that they are commonly
associated with type II diabetes (108) namely Insulin/IGF pathway-protein kinase B signaling
cascade and Insulin/IGF pathway-mitogen activated protein kinase kinase/MAP kinase cascade.
On the other hand, MAGNA or GHOST found at most one pathway with very low significance
and did not appear through all tested target networks (Table 4-4). In conclusion, studying
temporal networks in general and human aging specifically using Tempo enables us to identify
age related genes from non age related genes successfully. More importantly, Tempo takes
the network alignment problem one huge step forward by moving beyond the classical static
network models.
Significance of disease relevance. In this experiment, we perform gene ontology analysis
on the aligned genes from target network that result from our method, Tempo, MAGNA
and GHOST. Here, we present the percentage of genes that contributes to the significant
90
A Tempo B MAGNA C GHOSTFigure 4-10. This figure represents the percentage of genes that contributes to each pathway
of the resulting aligned genes in the target network. We point to the significantrelated pathways of the query disease (Alzheimer).
pathways which are related to the query disease. We show the results for Alzheimer disease.
Results are similar for the other two queries. Figure 4-10 presents the results. The results
demonstrate that our method finds alignments in target network with substantial fraction of
genes that contributes to the pathways which are associated with the query disease. On the
other hand, resulting alignments of MAGNA and GHOST contributes with a very small fraction
to pathways associated with Alzheimer. Notice that the aligned genes result from our method
have two pathways that are associated with Alzheimer while MAGNA and GHOST results in
only one.
4.6 Discussion
In this chapter, we developed a novel and scalable method to solve the problem of network
alignment between two given temporal networks. Our method seeks a persist alignment
through all time points of the input networks. We proposed a new alignment score function
to increase the similarity between aligned nodes and reduce the disconnected components of
the aligned nodes in the target network. We proposed a dynamic programming solution to this
problem which refine the alignment by selecting a maximum of k (user specified) swapping
pairs of nodes from larger network where each pair represents an aligned node and a gap node.
The selection process monotonically decreases the number of disconnected components and
thus increases the alignment score. Our method first identify an initial alignment between the
91
two input networks based on their nodes similarities. Our algorithm then iteratively selects k
swapping pairs. We proof the correctness of our algorithm. Our experiments on both synthetic
and real datasets comprehensively demonstrated that our method is both fast and efficient. We
observed using synthetic networks that the running time of our algorithm is reasonable with
growing the size of the target network and number of timepoints, t. Comparing our algorithm
to a classical network alignment algorithm show that our method generates more significant
alignment and could capture temporal evolution of the two input networks. Moreover, we
performed the gene ontology analysis on the genes reported by our algorithm after swapping
mechanism and observed that they are of biological significance as well.
92
CHAPTER 5IDENTIFICATION OF CO-EVOLVING TEMPORAL NETWORKS WITH UNCERTAIN
TIMELINE
5.1 Preface
Biological networks describe the interaction between molecules. They are frequently
represented as graphs, where the nodes correspond to the molecules (e.g., proteins or genes)
and the edges correspond to their interactions (1). Formally, we denote a biological network
as G = (V,E) where V and E represent the set of nodes and the set of edges, respectively.
The topology of the interactions of biological networks is not static. Genetic and epigenetic
mutations, errors in DNA replication, aging can alter molecular interactions (13). Due to
this dynamic behavior, the topology of the network that models the molecular interaction
evolve and change over time (16). Majority of the previous work on alignment of biological
networks assume the network topology is static (10) (Section 5.2 includes further details).
This assumption ignores the history of network evolution, and may lead to biased or incorrect
analysis. For example, identifying causes and consequences of the influence of external stimuli
is impossible when analyzing static topologies. In this paper, we define a biological network
using a model that accounts for the evolution of the underlying network at consecutive time
points. We refer to this model as a temporal network (24). We denote a temporal network
with t consecutive time points as G = [G1, G2, . . . , Gt], where Gi = (V,Ei) represents the
topology of the network at the ith time point.
Various factors affect the evolution process of a biological network and thus, introduce
uncertainty when capturing such evolution. For example, the evolution rate of interacting
molecules differs between people with different disorders (i.e. diseases) or people with same
disorder but at different stages of this disorder (27). Further more, the reaction to a specific
medication differs from one person to another depending on their resistance levels and
immune systems (109). Consequently, the observed interactions of humans may vary even if
they are measured at the same time. Thus, the interaction networks constructed for those
measurements may correspond to different stages of the evolution.
93
In this work, we consider the problem of identifying coevolving subnetworks between
subsequences of given pair of temporal networks. We say that two subnetworks are coevolving
if their topologies remain similar even though their topologies evolve over time. We define this
more formally as follows. We consider two input temporal networks G1 = [G11, G
12, . . . , G
1m] and
G2 = [G21, G
22, . . . , G
2n]. We let t1i and t2j represent the ith time points of G1 and the jth time
point of G2 respectively, where i ∈ {1, 2, . . .m} and j ∈ {1, 2, . . . n}. Notice that the time
points number only show the order of consecutive snapshots of the network such that ∀i and j,
1 ≤ i < m and 1 ≤ j < n, we have t1i < t1i+1 and t2j < t2j+1. These numbers does not reflect
actual timing information. More specifically, time points of the observed network topologies
are uncertain such that the information of which time point in one sequence corresponds to
that in the other sequence is not known in advance. Furthermore, G1 and G2 has possibly
different number of time points (i.e m = n). Without losing generality, we let G1 to be the
temporal network with shorter number of time points (i.e. m ≤ n). Another version of the
temporal alignment problem exists where time points of G1 and G2 are known (110). Having
the knowledge of the time points implies that the time values govern which network in G1 gets
aligned with that G2. However, this assumes that both networks co-evolve at the same speed.
Here, we consider the uncertainty of the time points in each topological network. This is a
very challenging problem since it does not only align the temporal networks, but also finds their
corresponding time points at which the alignment yields the highest score.
In this paper, we aim to find a subsequences S of G2 with m time points that correspond
to G1 at which the alignment yields the highest alignment quality (i.e. topological and
biological similarities). Finding such subsequence is a very challenging process since the naive
strategy would be to exhaustively search among all possible subsequences of S. However, this
is computationally too expensive as the number of subsequence pair S is Cnm (here Ci
j is the
combinatorial i choose j function), and thus grows exponentially with m and n. To avoid this
exponential cost, we apply a dynamic time wrapping algorithm. In this algorithm, we find the
optimal matching between the two input temporal networks by shifting and stretching the time
94
points of G1 based on the alignment quality. For instance, omitting the first two networks in
G2 in the alignment corresponds to the case where G1 denotes a later stage of evolution by two
time points as compared to G2. Similarly, omitting intermediate networks in G2 corresponds to
the case when G1 is evolving slower than G2.
Contributions. In this paper, we address the problem of to identify coevolving subnetworks
in a given pair of the temporal networks with uncertain time lines. This is the first work
to tackle this problem. We introduce a novel method, Tempo++ using a dynamic time
wrapping algorithm. Our solution is efficient and scalabe for a wide range of network sizes,
number of time points and evolution rates. We demonstrate the efficiency and accuracy of
Tempo++ using both real and synthetic data. For real dataset, we use gene expression dataset
which contains time resolved response of E. coli to five different environmental perturbation
conditions (cold, heat, oxidative stress, lactose diauxie, and stationary phase). Using our
method, we could find similar response behavior of gene expressions between heat and
oxidative stress. Using Tempo++ to generate alignment significance, we could co-cluster these
five conditions into groups. These clusters also confirmed that E. coli has similar response to
heat and oxidative stress conditions. We compare the statistical significance of the alignments
found by Tempo++ against those of other possible strategies to tackle this problem.
5.2 Related Work and Notations
In this section, we discuss the literature of the biological network alignment problem and
introduce mathematical notations that we use throughout the paper.
Related work. Existing network alignment problems can be categorized as follows: (i)
pairwise alignment, (ii) multiple network alignment, and (iii) dynamic network alignment
(iv) temporal alignment with certain time line. The pairwise network alignment problem
ignores that the network topology evolves (10; 76; 79; 81; 82; 83; 84; 74; 85; 86; 87).
Although the multiple alignment problem can consider more than two networks at once, it
lacks the ability to capture the temporal changes since it treats all networks as having static
topologies (88; 89; 90; 91). The dynamic network alignment problem considers topological
95
changes over time. It however, it seeks a different solution to the alignment problem at each
time point. Thus, it can not identify coevolving subnetwork. Unlike these alignment problems,
temporal network alignment captures that network topologies coevolve over time.
Notations. We represent the alignment of the two temporal networks G1 and G2 as a
bijection of their nodes and denote it as a function ψ : V 1 → V 2. Notice that our goal is to
identify coevolving subnetworks within the input temporal networks. Thus, the alignment of a
temporal network persists across all time points in both input networks, and thus, describes a
mapping of the nodes which does not change from one time point to another. Next, we define
the quality score of the alignment. We compute the score of the alignment ψ of G1 and G2,
denoted with score(G1,G2|ψ), as the sum of the scores of the alignment at all time points.
Hence, score(G1,G2|ψ) =∑t
i=1 score(G1i , G
2i |ψ). We assume G1 is connected at all time
points, but it maybe impossible to find an alignment that is connected in the target network
at all time points. Notice that score(G1i , G
2i |ψ) integrates the similarities of the aligned
nodes and their evolving topologies, and includes a penalty for disconnectedness the aligned
subnetworks of the target network at each time point (see (110) for more details).
Our goal in this paper is to identify a subsequence of m networks from the temporal
network with the longer sequence of networks, G2 such that this subsequence yields the highest
alignment score when aligned with G1. Let us denote a subset of {1, 2, …, n} of size m with
S = {s1, s2, …, sm} with ∀i, si < si+1. We will call S a subsequence from now on as it
contains ordered numbers. The challenge in this paper is to identify the subsequence S of size
m and the alignment denoted with the mapping function ψ() : V 1 → V 2 which maximize the
alignment score as follows
argmaxS,ψ{∑
1≤i≤m
score(G1i , G
2si| ψ))}.
96
5.3 Method
Solving for the optimal alignment function for a specific S subsequence reduces to the
problem of temporal alignment with known time points information (110). Thus, we focus
next on describing our solution to identify a subsequence S of G2 which yields the maximum
alignment score.
We adopt dynamic time warping (DTW) algorithm to solve this problem. DTW has been
used for comparing two time series data with varying number of time points (111). Here, we
only allow stretching and/or shifting of time points in the network with longer sequences, G2.
Also, we ignore time points from the longer temporal network G2 that do not belong to the
subsequence S. DTW algorithm iteratively aligns the ith time point of G1 to a time point of
G2 where 1 ≤ i ≤ m. At each iteration i, there exists a window of possible time points of G2
that could be aligned with i. This window is defined ad [i : (n − m + i)]. Thus, there exist
(n−m+1) identified alignments for each time point of G1. Recall that the optimal alignments
of first x number of time point of G1 where 1 ≤ x < m does not necessarily be the the optimal
alignment of all sequences inG1. This is because the score of the alignment depends on both
the functional and topological similarities and the topological similarities changes from one
time point to another. Thus, we need to keep all (n − m + 1) identified alignments at each
iteration until we reach to the final iteration. Let us assume that the algorithm identifies the
alignment of the first (i) time points of G1 ((n−m+ 1) such alignments).
Let us denote the dynamic time warping alignment of the i time points in G1 to the j
time points of G2 with a doubly indexed indicator function δ() such that δ(r, s) = 1 if G1r is
aligned with G2s, and δ(r, s) = 0 otherwise. Also, let us define the node mapping given that the
i time points in G1 to the j time points of G2 with function ψi,j(). Let us define w as (m− n).
Also, let us denote the score of the dynamic time warping alignment of the first i time point in
G1 to the first j time points of G2 with f(i, j) which represents the total alignment scores of
97
alignments at those mapped time points ϕi,j()
f(i, j) =∑
1≤r≤i,i≤s≤i+w
score(G1r, G
2s | ψi,j)δ(r, s).
We calculate the alignment for the i time point of G1 iteratively based on the alignment
score as
f(i, j) = score(G1i , G
2j , ψ
i,j) + max(i−1)≤k≤(i+w−1)
{f(i− 1, k)}.
The final solution chooses the alignment of the all m time points from last iteration as
solution(δ(m,n), ψm,n,G1,G2) = argmaxm≤j≤n{ f(m, j)}. (5-1)
Complexity analysis. We analyze the complexity of the dynamic time wrapping for aligning
two input temporal networks with time points m and n where m ≤ n. In the first step, the
algorthm aligns only the first time point of G1 to a time point in G2. Notice that available time
points of G2 to match with the first time point in G1 is 1 to (n − m + 1) since there has to
be at least m− 1 points in G2 to match the rest of points in G1. In the each consecutive step,
our algorithm iteratively adds to the current alignment a new pair of time points one from
each network. It inspects (n − m + 1) (or simply (n − m)) possible alignment for each new
pair and chooses the alignments of previous points based on the best fit ((n − m) options)
when combined with the new pair. Summing over all pairs, the algorithm tries (n − m)2
cases at each iteration. The cost of alignment increases as we increase number of time points
within the alignment. For example, in the first iteration we align one time point, in the second
iteration we align two time points, and so on until we align m time points in the final iteration.
Notice that we only analyze the time points matching algorithm since the cost of alignment
when time points are known is analyszed before (110). Thus, the total cost is∑m
i=1 i(n −m)2
= (n−m)2m2.
98
5.4 Results
We evaluate the performance of our algorithm on synthetic and real data. Next, we
describe both datasets in detail.
Real Dataset. We analyze E. coli expression data using our method. We use the E. coli gene
expression dataset, GSE20305, obtained from the GEO database (112). This dataset contains
time resolved response of E. coli to five different environmental perturbation conditions (cold,
heat, oxidative stress, lactose diauxie, and stationary phase). Samples and expression values
were calculated to form eight time points of each condition. Each experimental condition
was independently repeated three times. We average expression values of the three replicas
at each time point. In order to integrate static PPI network with gene expression data to
form time point/group specific PPI networks, we set a cut-off on the gene-expression value.
All the interactions that have a lower transcription value for either or both the proteins are
removed from the corresponding time point specific network. We select same cut-off for
the five conditions. We use the protein-protein interaction (PPI) network data from String
database (113).
Synthetic dataset. We generate synthetic networks to observe the performance of our method
under a wide spectrum of parameters classified under two categories; (i) network size and
(ii) temporal model parameters, namely number of time points, temporal rate, and cold rate.
We vary the target network size to take values from {100, 250, 500, 750, 1000}. We fix the
network density to two edges per node on the average (i.e., mean node degree is set to four).
We randomly select G11 as a connected subnetwork of G2
1. We set the size of the query network
to 50 nodes. We generate target network G21 using Barabási-Albert (BA) (48) model as this
model produces scale-free networks. In order to explain the parameters in the second category,
we describe how we generate the query and target networks G11 and G2
1 at the first time point.
We then explain how we use the parameters in this category to build the query and target
networks at the remaining time points.
99
We generate the subsequent networks for the remaining time points using the three
parameters in the second category above as follows. The first parameter is the number of
time points t in G1 and G2. We use 5, 10, 15, and 20 time points in our experiments. Recall
that we select a subnetwork of the target network G21 as the first query network G1
1. We
mark all nodes and edges in G21 within this subnetwork as cold nodes and edges respectively.
We mark all other nodes and edges in G21 as hot. Next, we iteratively generate the networks
G1i and G2
i at the ith time point (i > 1) from G1i−1 and G2
i−1 respectively as follows. Let
us denote temporal and cold rates (two real numbers) with ϵ and ϵc respectively such that
0 ≤ ϵc ≤ ϵ ≤ 1. Let us denote the ratio of cold edges to the total number of edges in the
target network G21 with γ. We calculate the hot rate, denoted with ϵh, from temporal rate
and cold rate as ϵh = (ϵ − ϵcγ)/(1 − γ). Conceptually, hot and cold rates model the rate
of evolution of hot and cold edges between two consecutive time points respectively. More
specifically, for each subsequent time point i, we generate G2i by randomizing G2
i−1 as follows.
We iterate over all edges in G2i−1. For each edge e, if it is a cold edge we remove it with
probability ϵc and insert a new edge between two randomly chosen cold nodes. If e is a hot
edge, we remove it with probability ϵh and insert a new hot edge between two random nodes
(with at least one being a hot node). We generate query networks at subsequent time points
using almost the same procedure with the only difference being that all edges are cold. We
generate datasets by varying ϵ and ϵc to take the values {0.05, 0.1, 0.2, 0.4, 0.8} and {0.05,
0.1, 0.2} respectively. For each parameter setting we generate 10 target and query temporal
networks.
Recall that, we generate the scoring matrix based on both homology and topology
similarities. We generate the homology score between two pair of nodes u ∈ V 1 and v ∈
V 2 as follows. If v was originally selected as cold node and u is the same as v, then we
generate a homology score between u and v from log-normal distribution (101) with mean
2µ and standard deviation σ. Otherwise, we randomly generate the homology score between
u and v from log-normal distribution with mean µ and standard deviation σ. In this way, we
100
allow nodes in query network to be likely to align to nodes in the target network that were
originally extracted from. In this paper, we set µ and σ to be 2 and 0.25 respectively. Notice
that the homology scores do not change through time points, although topology scores do.
Thus, evolution through time points of query and target networks may affect how the query
is aligned to the cold region in the target network. We set the edge insertion penalty δ to be
maxu∈V 1,v∈V 2
S(u, v).
5.4.1 Comparing Against Other Strategies
In this section, we compare the statistical significance of the alignments generated by
Tempo++ against that of possible strategies to approach the problem. The first strategy is
Exact matching which matches each time point t1i with t2i . The second strategy is Contiguous
matching. This strategy matches all time points of the shorter sequence to a contiguous block
of the longer sequence network with an equal number of time points. It tries all possible blocks
by using a sliding window strategies and then selects the matching with the best alignment
score. The third strategy is Gap preservation matching which preserves the gap between time
points in the shorter network when matches . It also uses a sliding window technique to get the
best fit alignment.
In this experiment, we use real dataset. We fix N to be 8 which are the time points in
each real network. Then we vary M to take values from {1, 2, 3, 4, 5, 6, 7}. We repeat this for
all combinations of two networks of the five stress conditions {cold, heat, lactose, oxidative,
control}. For each combination, we calculate the statistical significance using z-score as
follows. we randomly selected time points/aligned nodes from target network 1,000 times, and
calculate the alignment score of each, i.e., each random selection corresponds to an alignment.
We calculate the mean and standard deviation of these 1,000 scores and generate the z-score
of the alignment generated by Tempo++ using this mean and standard deviation. Hence,
we denote the score generated from our method by S∗, and denote the mean and standard
deviation of 1,000 scores generated from the random selections with Sµ and σ, respectively.
We calculate the z-score of our method as (S∗ − Sµ)/σ. We calculate the z-score of the
101
-1
0
1
2
3
1 2 3 4 5 6 7
Z-s
core
Number of time points (M)
Tempo++ Contiguous Gap Exact
Figure 5-1. The z-score of the resulting alignment varying the number of time points in theshorter sequence M to take the values {1, 2, 3, 4, 5, 6, 7} and keep N = 8 whereM < N . The x-axis shows M . The y-axis shows the z-score of our method(tempo++) against other strategies. The dashed grey line marks the z-scoresignificance cut-off (z-score ≥ 2).
other three strategies in a similar manner. We use a z-score significance cut-off to be ≥ 2.
Figure 5-1 presents the average z-score values across all network combinations.
The results demonstrate that our method generates more significant alignments compared
to other strategies. To match only one point (i.e. M = 1), contiguous and gap matching
strategics have the same z-sccre as tempo++ as they perform a sliding window technique.
However, exact matching have lower z-score which is expected as it forces matching exact
time points. As we grow M , our method continuous to generate significant (z-score ≥ 2)
alignments in most of the cases while the performance of other methods degrades. We notice
that exact matching ranks the lowest between other strategies. This is because assumes
that both input networks evolve in the same speed which is incorrect and biased. Similarly,
contiguous and gap matching implies certain assumptions which may be misleading.
5.4.2 Comparing Stress Response Against Time Points Matching
In this experiment, we evaluate the quality of aligning time points from different
stress conditions using our method against exact matching. We calculate the significance
of overlapped response pattern in both compared conditions to calculate the quality of
aligning two time points, one from each condition. To calculate such overlapping response
significance, we calculate a p-value as follows. All transcription values were normalized to the
average of time points taken before stress (the first two time points). To estimate the changes
between neighboring time points, we calculate t-test and fold change (FC) between the time
102
Figure 5-2. This figure represents the significance of the overlaps between different conditionsthrough time points post-perturbation ({t3, t4, t5, t6}). The significance of theoverlaps between conditions was calculated based on Fisher exact test. Thesignificant overlaps (p-value ≤ 0.05) are are colored with red, whereas nosignificant overlaps are colored with yellow. The matching (alignment) of timepoints generated using our method are marked with ∗.
point of interest and the directly preceding one were calculated. We consider a gene to be
significant if the p-value from t-test is ≤ 0.05 and its FC is ≥ 3. To determine the overlap
of responses between different conditions at two time points one from each condition, we test
the significance of overlapped changes between those two time points using the Fisher exact
test (R software package). A p-value ≤ 0.05 reflects significant overlap between tested two
time points. Notice that we only consider time points post-perturbation (t3, t4, . . . ). Figure 5-2
represent the results of significant overlaps between and conditions and and time points as well
the matching of time points generated using our method.
Matching exact time points (i.e. t1i matches t2i ) generates lower overlapped points than
using our algorithm. For example, aligning Lactose and oxidative conditions using our method
results in four overlapped time points where exact matching results in two overlapped time
points only. This suggests that our method could match time points with significantly similar
response to different stress conditions. Furthermore, the number of overlapped responses
decreases with increasing time. This was observed in earlier studies as well (114). In addition,
the results show that there exists significant similarity between heat and oxidative stress which
103
is reflected by both our method and the significant overlap test (all t1i matches and overlaps
with t2i ).
5.4.3 Hierarchical Clustering of Conditions
In this section, we want to analyze the similarity between applying different stress
conditions with respect to the changes in gene expressions and network topologies. For that
purpose, we apply hierarchical clustering to these conditions based on the z-score of pairwise
alignment. We first align two networks of two stress conditions using our method. We fix
N to be 8 which are the time points in each real network. Then we vary M to take values
from {1, 2, 3, 4, 5, 6, 7}. We calculate the statistical significance using z-score as described in
Section 5.4.1. We show the average z-score over all M values of two stress condition. We
then use hierarchical cluster analysis of R package based on the distribution of z-scores of
each condition when aligning with other conditions. We repeat this for all combinations of
two networks of the five stress conditions {cold, heat, lactose, oxidative, control}. Figure 5-3
presents the results.
The results illustrate that heat and oxidative conditions are co-clustered together which
confirms the results in the previous experiment. In addition, aligning networks of both
conditions has a very high significant z-score (2.4671). Similarly, we notice that control
and lactose exhibit also similar distributions. On the other hand, cold and heat conditions
are separated which suggests that genes have different response to those temperature stress
condition.
5.4.4 Evaluation of Running Time
In this experiment, we evaluate the running time of our algorithm using synthetic dataset
for all network sizes varying the number of time points of longer (N) and shorter networks
(M). We report the average running time over all values of ϵ and ϵc as well as network
sizes with each parameter combination tested 10 times. Figure 5-4 presents the results. The
results demonstrate that our algorithm is fast and successfully scales to networks with very
large number of time points. The running times of Tempo++ grow linearly with increasing
104
Figure 5-3. The hierarchical clustering of z-score of the resulting alignment between twonetworks of the five stress conditions {cold, heat, lactose, oxidative, control}.White color represents NA values for self alignment as was not tested.
the number of time points in the longer sequence (N). We also notice that running time of
aligning a network with M = x (i.e. N = 14 and M = 2) number of time points is almost the
same as aligning a network with M = N − x (i.e. N = 14 and M = 12). This is expected
as the complexity of choosing x points out of N points is the same as choosing N − x points.
However, the running time of our method is more when aligning N − x points than that of
x, which is unsurprising since Tempo++ uses Tempo which running time increases with the
number of time points in the network with shorter sequences. That said, our method has
practical running time even for networks with many time points.
5.4.5 Evaluation of Alignment Quality
In this section, we evaluate the topological quality of the alignment generated by our
method using synthetic dataset. For this purpose, we measure the shared topological structure
between G1i and G2
i which is preserved under the alignment function ψ through all time
105
0
100
200
300
400
500
2 2 4 2 4 6 2 4 6 8 2 4 6 8 10 2 4 6 8 10 12 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18
Run
ning
tim
e [c
pu-s
]
M4 6 8 10 12 14 16 18 20
N
Figure 5-4. The total running time of our method for synthetic networks varying the number oftime points of the input networks M and N to take values from{2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where M < N . The x-axis shows the number oftime points of the longer sequence N and the shorter sequence M . The y-axisshows the total running time in seconds.
points i. Induced conserved structure (ICS) measures the percentage of edges from G1i that
are aligned to edges in G2i to the total edges of the induced subnetwork Ψ(V 1|G2
i ), and is
one of the most common measures of topological quality (73). Formally, ICS(G1,G2, ψ) =∑ti=1
|E1i ∩E2
i [Ψ(V 1|G2i ]|
|E2i [Ψ(V 1|G2
i ]|. We also evaluate our algorithm against other algorithms using the edge
correctness (EC) measure which has a similar computations to ICS. Basically, it measures the
percentage of edges from G1i that are aligned to edges in G2
i to the total edges of smaller
network. More specifically, EC(G1,G2, ψ) =∑t
i=1|E1i ∩E2
i [Ψ(V 1|G2i ]|
|E1i |
. In this experiment, we vary
the time points of the two input biological networks G1 (m) and G2 (n) such that n and m
take a value from {2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where m < n while keeping the network size
at 500. Figure 5-5 and Figure 5-6 present the results for EC and ICS scores respectively.
The results demonstrate that our algorithm generates alignments with reasonable quality
based on ICS, EC. We notice that the quality score of the generated alignment decreases when
increasing number of time points m while keeping n unchanged. This reflects the fact that the
more two networks evolve, their topologies change and deviate from each other which causes
the average quality score through a time points to decrease. Also this might reflect that as we
decrease m, we find more different alignments of those m time points of the shorter sequence
network within n time points in the longer sequence and consequently, we have more options
to choose the best alignment from them.
106
0.2
0.25
0.3
0.35
0.4
0.45
2 2 4 2 4 6 2 4 6 8 2 4 6 8 10 2 4 6 8 10 12 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18
Edg
e co
rrec
tnes
s (E
C)
m4 6 8 10 12 14 16 18 20
n
Figure 5-5. The edge correctness (EC) score of the resulting alignment varying the number oftime points of the longer sequence n and the shorter sequence m to take thevalues {2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where m < n. The x-axis shows n and m.The y-axis shows the EC score.
0.2
0.25
0.3
0.35
0.4
2 2 4 2 4 6 2 4 6 8 2 4 6 8 10 2 4 6 8 10 12 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18
Indu
ced
cons
erve
d st
ruct
ure
(IC
S)
m4 6 8 10 12 14 16 18 20
n
Figure 5-6. The induced conserved structure (ICS) score of the resulting alignment varying thenumber of time points of the longer sequence n and the shorter sequence m totake the values {2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where m < n. The x-axis shows nand m.The y-axis shows the ICS score.
In addition to EC and ICS scores, we evaluate the accuracy of the alignment generated by
our method. We recall that we select the original query network from a subset of nodes and
their edges from the target network, and then evolve the query through time points. Here, we
evaluate the accuracy by calculating the percentage of the aligned nodes from query network
that are paired with the same nodes of the target network that they were originally selected
from. We refer to this percentage as recovered region. We illustrate the results in Figure 5-7,
which demonstrate that our algorithm recovers high percentage of the query networks.
The results of recovered region percentage show that our method can capture the
coevolving topologies and recover high percentage (∼70%) of the query network that was
planted in the target network. We also notice a similar behavior to that of ICS and EC with
increasing the number of time points. That being said, our method continues to recover high
percentage of the query network with an average of ∼64%.
107
40
50
60
70
80
2 2 4 2 4 6 2 4 6 8 2 4 6 8 10 2 4 6 8 10 12 2 4 6 8 10 12 14 2 4 6 8 10 12 14 16 2 4 6 8 10 12 14 16 18
Rec
over
ed q
uery
(%
)
m4 6 8 10 12 14 16 18 20
nFigure 5-7. The percentage of recovered query in the resulting alignment varying the numberof time points of the longer sequence n and the shorter sequence m to take thevalues {2, 4, 6, 8, 10, 12, 14, 16, 18, 20} where m < n. The x-axis shows n and m.The y-axis shows the percentage of recovered query of our method.
5.5 Discussion
In this chapter, we addressed the problem of identifying coevolving subnetworks between
subsequences of given pair of temporal networks. We developed a novel method, Tempo++
using a dynamic time wrapping algorithm. We proved that our solution is efficient and scalabe
for a wide range of network sizes, number of time points and evolution rates. We demonstrates
the efficiency and accuracy of Tempo++ using both real and synthetic data. Using our
method, we could find similar response behavior of gene expressions between heat and
oxidative stress. Using Tempo++ to generate alignment significance, we could co-cluster these
five conditions into groups. These clusters also confirmed that E. coli has similar response to
heat and oxidative stress conditions. We compared the statistical significance of the alignments
found by Tempo++ against those of other possible strategies to tackle this problem.
108
CHAPTER 6CONCLUSION
Biological networks help us understand cellular function. Interaction between molecules
is dynamic. Temporal networks describe the evolution of molecules and their interactions over
time. In this dissertation proposal we addressed three modeling and characterization problem
of biological networks, both static and temporal. In addition, we presented two applications on
ecological networks.
In the first problem, we developed a scalable method to solve the motif identification
problem given an input graph, desired motif size µ, and minimum frequency of desired
motif α. Our experiments on synthetic data and PPI networks from MINT comprehensively
demonstrated the the statistical and biological significance of motifs resulting from our
algorithm.
Following the first problem, we developed two applications of motif identification problem
in ecological networks. The first application is employing motifs to identify the assembly of
food web networks across hierarchical scales. We found that motif representation of daughter
networks highly matched the parent network they were assembled from. The second application
is identifying the relationship between motif centrality and motif abundance in aquatic food
webs. We found that highly central motifs are over-represented and non-central motifs are
under-represented for six of the thirteen motifs. This pattern suggests that high energy flow is
associated with the persistence of certain motifs in food webs.
In the second problem, we developed a novel and scalable method, Tempo to solve the
problem of identifying co-evolving subnetworks between two given temporal networks. We
proof the correctness of our algorithm. We compared our algorithm to a classical network
alignment algorithm show that our method generates more significant alignment and could
capture temporal evolution of the two input networks. We performed analysis on the genes (of
different phenotype) reported by our algorithm observed that they are of biological significance.
109
In the third problem, we aim to align two temporal networks with uncertain evolution
timeline. This is the first work to tackle this problem. We developed a novel method,
Tempo++ using a dynamic time wrapping algorithm which is efficient and scalabe for a
wide range of number of time points. We demonstrated the efficiency and accuracy of
Tempo++ using both real and synthetic data. We used gene expression dataset which contains
time resolved response of E. coli to five different environmental perturbation conditions. Using
our method, we could find similar response behavior of gene expressions between heat and
oxidative stress. Using Tempo++ to generate alignment significance, we could co-cluster these
five conditions into groups. These clusters also confirmed that E. coli has similar response to
heat and oxidative stress conditions. We compared the statistical significance of the alignments
found by Tempo++ against those of other possible strategies to tackle this problem and found
that tempo++ outperformed all those strategies.
110
REFERENCES
[1] X. Zhu, M. Gerstein, and M. Snyder, “Getting connected: analysis and principles ofbiological networks,” Genes & Development, vol. 21, no. 9, pp. 1010–1024, 2007.
[2] J. A. Freyre-González, J. A. Alonso-Pavón, L. G. Treviño-Quintanilla, andJ. Collado-Vides, “Functional architecture of escherichia coli: new insights providedby a natural decomposition approach,” Genome biology, vol. 9, no. 10, p. R154, 2008.
[3] M. D. Leiserson, F. Vandin, H.-T. Wu, J. R. Dobson, J. V. Eldridge, J. L. Thomas,A. Papoutsaki, Y. Kim, B. Niu, M. McLellan et al., “Pan-cancer network analysisidentifies combinations of rare somatic mutations across pathways and proteincomplexes,” Nature genetics, vol. 47, no. 2, pp. 106–114, 2015.
[4] D. A. Charlebois, G. Balázsi, and M. Kærn, “Coherent feedforward transcriptionalregulatory motifs enhance drug resistance,” Physical Review E, vol. 89, no. 5, p. 052708,2014.
[5] S. S. Shen-Orr, R. Milo, S. Mangan, and U. Alon, “Network motifs in the transcriptionalregulation network of escherichia coli,” Nature Genetics, vol. 31, no. 1, pp. 64–68, 2002.
[6] P. Wang, J. Lü, and X. Yu, “Identification of important nodes in directed biologicalnetworks: A network motif approach,” PLOS ONE, vol. 9, no. 8, 2014.
[7] S. Wuchty, Z. N. Oltvai, and A.-L. Barabási, “Evolutionary conservation of motifconstituents in the yeast protein interaction network,” Nature Genetics, vol. 35, no. 2,pp. 176–179, 2003.
[8] J. Flannick, A. Novak, B. S. Srinivasan, H. H. McAdams, and S. Batzoglou, “Graemlin:general and robust alignment of multiple large interaction networks,” Genome research,vol. 16, no. 9, pp. 1169–1181, 2006.
[9] T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M.Hannett, C. T. Harbison, C. M. Thompson, I. Simon et al., “Transcriptional regulatorynetworks in saccharomyces cerevisiae,” science, vol. 298, no. 5594, pp. 799–804, 2002.
[10] R. Singh, J. Xu, and B. Berger, “Pairwise global alignment of protein interactionnetworks by matching neighborhood topology,” in Annual International Conference onResearch in Computational Molecular Biology. Springer, 2007, pp. 16–31.
[11] T. M. Przytycka, M. Singh, and D. K. Slonim, “Toward the dynamic interactome: it’sabout time,” Briefings in bioinformatics, p. bbp057, 2010.
[12] J.-D. J. Han, N. Bertin, T. Hao, D. S. Goldberg, G. F. Berriz, L. V. Zhang, D. Dupuy,A. J. Walhout, M. E. Cusick, F. P. Roth et al., “Evidence for dynamically organizedmodularity in the yeast protein–protein interaction network,” Nature, vol. 430, no. 6995,pp. 88–93, 2004.
111
[13] B. Sadikovic, K. Al-Romaih, J. Squire, and M. Zielenska, “Cause and consequences ofgenetic and epigenetic alterations in human cancer,” Current genomics, vol. 9, no. 6, pp.394–408, 2008.
[14] A. De Smith, R. Walters, P. Froguel, and A. Blakemore, “Human genes involved in copynumber variation: mechanisms of origin, functional effects and implications for disease,”Cytogenetic and genome research, vol. 123, no. 1-4, pp. 17–26, 2008.
[15] J. R. Pollack et al., “Genome-wide analysis of dna copy-number changes using cdnamicroarrays,” Nature genetics, vol. 23, no. 1, pp. 41–46, 1999.
[16] P. Holme and J. Saramäki, “Temporal networks,” Physics reports, vol. 519, no. 3, pp.97–125, 2012.
[17] N. M. Luscombe, M. M. Babu, H. Yu, M. Snyder, S. A. Teichmann, and M. Gerstein,“Genomic analysis of regulatory network dynamics reveals large topological changes,”Nature, vol. 431, no. 7006, pp. 308–312, 2004.
[18] A. Rao, A. O. Hero III, J. D. Engel et al., “Inferring time-varying network topologiesfrom gene expression data,” EURASIP Journal on Bioinformatics and Systems Biology,vol. 2007, pp. 7–7, 2007.
[19] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Networkmotifs: simple building blocks of complex networks,” Science, vol. 298, no. 5594, pp.824–827, 2002.
[20] S. A. Cook, “The complexity of theorem-proving procedures,” in ACM Symposium onTheory of Computing. ACM, 1971, pp. 151–158.
[21] H. L. Buckley, T. E. Miller, A. M. Ellison, and N. J. Gotelli, “Local-to continental-scalevariation in the richness and composition of an aquatic food web,” Global Ecology andBiogeography, vol. 19, no. 5, pp. 711–723, 2010.
[22] J. F. Addicott, “Predation and prey community structure: an experimental study of theeffect of mosquito larvae on the protozoan communities of pitcher plants,” Ecology,vol. 55, no. 3, pp. 475–492, 1974.
[23] S. R. Borrett and M. K. Lau, “enar: an r package for ecosystem network analysis,”Methods in Ecology and Evolution, vol. 5, no. 11, pp. 1206–1213, 2014.
[24] Y. Hulovatyy, H. Chen, and T. Milenković, “Exploring the structure and function oftemporal networks with dynamic graphlets,” Bioinformatics, vol. 31, no. 12, pp. i171–i180, 2015.
[25] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: densification laws,shrinking diameters and possible explanations,” in Proceedings of the eleventh ACMSIGKDD international conference on Knowledge discovery in data mining. ACM, 2005,pp. 177–187.
112
[26] R. M. Karp, “Reducibility among combinatorial problems,” in Complexity of computercomputations. Springer, 1972, pp. 85–103.
[27] I. Tomlinson, M. Novelli, and W. Bodmer, “The mutation rate and cancer,” Proceedingsof the National Academy of Sciences, vol. 93, no. 25, pp. 14 800–14 803, 1996.
[28] F. Ay, M. Kellis, and T. Kahveci, “SubMAP: aligning metabolic pathways withsubnetwork mappings,” Journal of Computational Biology, vol. 18, no. 3, pp. 219–235, 2011.
[29] S. Wuchty and P. F. Stadler, “Centers of complex networks,” Journal of TheoreticalBiology, vol. 223, no. 1, pp. 45–53, 2003.
[30] A. Masoudi-Nejad, F. Schreiber, and Z. Kashani, “Building blocks of biological networks:a review on major network motif discovery algorithms,” IET Systems Biology, vol. 6,no. 5, pp. 164–174, 2012.
[31] T. Milenković, J. Lai, and N. Pržulj, “Graphcrunch: a tool for large network analyses,”BMC Bioinformatics, vol. 9, no. 1, p. 70, 2008.
[32] M. Deshpande, M. Kuramochi, N. Wale, and G. Karypis, “Frequent substructure-basedapproaches for classifying chemical compounds,” IEEE Transactions on Knowledge andData Engineering, vol. 17, no. 8, pp. 1036–1050, 2005.
[33] C. Yanover, M. Singh, and E. Zaslavsky, “M are better than one: an ensemble-basedmotif finder and its application to regulatory element prediction,” Bioinformatics, vol. 25,no. 7, pp. 868–874, 2009.
[34] M. R. Garey and D. S. Johnson, “Computers and Intractability: A Guide to the Theoryof NP-Completeness,” 1979.
[35] L. B. Holder, D. J. Cook, S. Djoko et al., “Substucture discovery in the subdue system.”in KDD workshop, 1994, pp. 169–180.
[36] F. Schreiber and H. Schwöbbermeyer, “Frequency concepts and pattern detection forthe analysis of motifs in networks,” in Transactions on Computational Systems Biology,2005, pp. 89–104.
[37] N. Vanetik, E. Gudes, and S. E. Shimony, “Computing frequent graph patterns fromsemistructured data,” in ICDM. IEEE, 2002, pp. 458–465.
[38] X. Yan, X. Zhou, and J. Han, “Mining closed relational graphs with connectivityconstraints,” in ACM SIGKDD, 2005, pp. 324–333.
[39] J. A. Grochow and M. Kellis, “Network motif discovery using subgraph enumeration andsymmetry-breaking,” in Research in Computational Molecular Biology. Springer, 2007,pp. 92–106.
113
[40] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon, “Efficient sampling algorithm forestimating subgraph concentrations and detecting network motifs,” Bioinformatics,vol. 20, no. 11, pp. 1746–1758, 2004.
[41] S. Omidi, F. Schreiber, and A. Masoudi-Nejad, “Moda: an efficient algorithm for networkmotif discovery in biological networks,” Genes & Genetic Systems, vol. 84, no. 5, pp.385–395, 2009.
[42] S. Wernicke, “Efficient detection of network motifs,” IEEE/ACM Transactions onComputational Biology and Bioinformatics (TCBB), vol. 3, no. 4, pp. 347–359, 2006.
[43] J. Chen, W. Hsu, M. L. Lee, and S.-K. Ng, “Nemofinder: Dissecting genome-wideprotein-protein interactions with meso-scale network motifs,” in ACM SIGKDD, 2006,pp. 106–115.
[44] Z. R. Kashani, H. Ahrabian, E. Elahi, A. Nowzari-Dalini, E. S. Ansari, S. Asadi,S. Mohammadi, F. Schreiber, and A. Masoudi-Nejad, “Kavosh: a new algorithm forfinding network motifs,” BMC bioinformatics, vol. 10, no. 1, p. 318, 2009.
[45] M. Kuramochi and G. Karypis, “An efficient algorithm for discovering frequentsubgraphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 9,pp. 1038–1051, 2004.
[46] ——, “Finding frequent patterns in a large sparse graph,” Data Mining and KnowledgeDiscovery, vol. 11, no. 3, pp. 243–271, 2005.
[47] L. Babai and E. M. Luks, “Canonical labeling of graphs,” in ACM Symposium on Theoryof Computing, 1983, pp. 171–183.
[48] A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol.286, no. 5439, pp. 509–512, 1999.
[49] K. Baskerville and M. Paczuski, “Subgraph ensembles and motif discovery using a newheuristic for graph isomorphism,” Physical Review E, vol. 74, p. 051903, 2006.
[50] A. Chatr-Aryamontri, A. Ceol, L. M. Palazzi, G. Nardelli, M. V. Schneider, L. Castagnoli,and G. Cesareni, “MINT: the Molecular INTeraction database,” Nucleic Acids Research,vol. 35, no. suppl 1, pp. D572–D574, 2007.
[51] S. N. Dorogovtsev, J. F. F. Mendes, and A. N. Samukhin, “Structure of growingnetworks with preferential linking,” Physical review letters, vol. 85, no. 21, p. 4633,2000.
[52] H. Jeong, B. Tombor, R. Albert, Z. N. Oltvai, and A.-L. Barabási, “The large-scaleorganization of metabolic networks,” Nature, vol. 407, no. 6804, pp. 651–654, 2000.
[53] S. Redner, “How popular is your paper? an empirical study of the citation distribution,”The European Physical Journal B-Condensed Matter and Complex Systems, vol. 4,no. 2, pp. 131–134, 1998.
114
[54] R. D. Leclerc, “Survival of the sparsest: robust gene networks are parsimonious,”Molecular Systems Biology, vol. 4, no. 1, p. 213, 2008.
[55] R. Milo, N. Kashtan, S. Itzkovitz, M. E. Newman, and U. Alon, “On the uniformgeneration of random graphs with prescribed degree sequences,” arXiv preprint cond-mat/0312028, 2003.
[56] D. Gale et al., “A theorem on flows in networks,” Pacific J. Math, vol. 7, no. 2, pp.1073–1082, 1957.
[57] M. Ashburner, C. A. Ball et al., “Gene ontology: tool for the unification of biology,”Nature genetics, vol. 25, no. 1, pp. 25–29, 2000.
[58] F. L. Homa and J. C. Brown, “Capsid assembly and dna packaging in herpes simplexvirus,” Reviews in Medical Virology, vol. 7, no. 2, p. 107, 1997.
[59] H. V. Cornell and J. H. Lawton, “Species interactions, local and regional processes, andlimits to the richness of ecological communities: a theoretical perspective,” Journal ofAnimal Ecology, pp. 1–12, 1992.
[60] N. J. Gotelli, “Null model analysis of species co-occurrence patterns,” Ecology, vol. 81,no. 9, pp. 2606–2621, 2000.
[61] P. Erdős and A. Rény, “On random graphs i,” Publ. Math. Debrecen, vol. 6, pp. 290–297, 1959.
[62] R. J. Williams and N. D. Martinez, “Simple rules yield complex food webs,” Nature, vol.404, no. 6774, p. 180, 2000.
[63] M.-F. Cattin, L.-F. Bersier, C. Banašek-Richter, R. Baltensperger, and J.-P. Gabriel,“Phylogenetic constraints and adaptation explain food-web structure,” in Nature, vol.427, 2004, pp. 835–839.
[64] D. Stouffer, J. Camacho, R. Guimera, C. Ng, and L. Nunes Amaral, “Quantitativepatterns in the structure of model and empirical food webs,” Ecology, vol. 86, no. 5, pp.1301–1311, 2005.
[65] D. B. Stouffer and J. Bascompte, “Compartmentalization increases food-webpersistence,” Proceedings of the National Academy of Sciences, vol. 108, no. 9, pp.3648–3652, 2011.
[66] D. Koschützki, H. Schwöbbermeyer, and F. Schreiber, “Ranking of network elementsbased on functional substructures,” Journal of theoretical biology, vol. 248, no. 3, pp.471–479, 2007.
[67] W. Kim, M. Li, J. Wang, and Y. Pan, “Essential protein discovery based on networkmotif and gene ontology,” in Bioinformatics and Biomedicine (BIBM), 2011 IEEEInternational Conference on. IEEE, 2011, pp. 470–475.
115
[68] W. Li, L. Chen, X. Li, X. Jia, C. Feng, L. Zhang, W. He, J. Lv, Y. He, W. Li et al.,“Cancer-related marketing centrality motifs acting as pivot units in the human signalingnetwork and mediating cross-talk between biological pathways,” Molecular BioSystems,vol. 9, no. 12, pp. 3026–3035, 2013.
[69] M. Piraveenan, K. Wimalawarne, and D. Kasthurirathn, “Centrality and compositionof four-node motifs in metabolic networks,” Procedia Computer Science, vol. 18, pp.409–418, 2013.
[70] L. C. Freeman, “A set of measures of centrality based on betweenness,” Sociometry, pp.35–41, 1977.
[71] S. R. Borrett, “Throughflow centrality is a global indicator of the functional importanceof species in ecosystems,” Ecological indicators, vol. 32, pp. 182–196, 2013.
[72] J. Clemente, K. Satou, and G. Valiente, “Finding conserved and non-conserved reactionsusing a metabolic pathway alignment algorithm,” Genome Informatics, vol. 17, no. 2, pp.46–56, 2006.
[73] V. Vijayan, V. Saraph, and T. Milenković, “MAGNA++: Maximizing Accuracy in GlobalNetwork Alignment via both node and edge conservation,” Bioinformatics, vol. 31,no. 14, pp. 2409–2411, 2015.
[74] R. Patro and C. Kingsford, “Global network alignment using multiscale spectralsignatures,” Bioinformatics, vol. 28, no. 23, pp. 3105–3114, 2012.
[75] N. C. Berchtold, D. H. Cribbs, P. D. Coleman, J. Rogers, E. Head, R. Kim, T. Beach,C. Miller, J. Troncoso, J. Q. Trojanowski et al., “Gene expression changes in the courseof normal brain aging are sexually dimorphic,” Proceedings of the National Academy ofSciences, vol. 105, no. 40, pp. 15 605–15 610, 2008.
[76] O. Kuchaiev and N. Pržulj, “Integrative network alignment reveals large regions of globalnetwork similarity in yeast and human,” Bioinformatics, vol. 27, no. 10, pp. 1390–1396,2011.
[77] O. Kuchaiev, T. Milenković, V. Memišević, W. Hayes, and N. Pržulj, “Topologicalnetwork alignment uncovers biological function and phylogeny,” Journal of the RoyalSociety Interface, p. rsif20100063, 2010.
[78] T. Milenković, W. L. Ng, W. Hayes, and N. Pržulj, “Optimal network alignment withgraphlet degree vectors,” Cancer informatics, vol. 9, p. 121, 2010.
[79] A. E. Aladağ and C. Erten, “Spinal: scalable protein interaction network alignment,”Bioinformatics, vol. 29, no. 7, pp. 917–924, 2013.
[80] V. Saraph and T. Milenković, “Magna: maximizing accuracy in global networkalignment,” Bioinformatics, vol. 30, no. 20, pp. 2931–2940, 2014.
116
[81] B. P. Kelley, B. Yuan, F. Lewitter, R. Sharan, B. R. Stockwell, and T. Ideker, “Pathblast:a tool for alignment of protein interaction networks,” Nucleic acids research, vol. 32, no.suppl_2, pp. W83–W88, 2004.
[82] H. T. Phan and M. J. Sternberg, “Pinalog: a novel approach to align protein interactionnetworks—implications for complex detection and function prediction,” Bioinformatics,vol. 28, no. 9, pp. 1239–1245, 2012.
[83] G. Guelsoy, B. Gandhi, and T. Kahveci, “Topac: alignment of gene regulatory networksusing topology-aware coloring,” Journal of bioinformatics and computational biology,vol. 10, no. 01, p. 1240001, 2012.
[84] M. M. Hasan and T. Kahveci, “Indexing a protein-protein interaction network expeditesnetwork alignment,” BMC bioinformatics, vol. 16, no. 1, p. 326, 2015.
[85] B. Neyshabur, A. Khadem, S. Hashemifar, and S. Arab, “NETAL: a new graph-basedmethod for global alignment of protein–protein interaction networks,” Bioinformatics,vol. 29, no. 13, pp. 1654–1662, 2013.
[86] Y. Sun, J. Crawford, J. Tang, and T. Milenković, “Simultaneous optimization of bothnode and edge conservation in network alignment via WAVE,” in International Workshopon Algorithms in Bioinformatics. Springer, 2015, pp. 16–39.
[87] J. Hu, B. Kehr, and K. Reinert, “NetCoffee: a fast and accurate global alignmentapproach to identify functionally conserved proteins in multiple networks,” Bioinformat-ics, vol. 30, no. 4, pp. 540–548, 2013.
[88] F. Alkan and C. Erten, “BEAMS: backbone extraction and merge strategy for the globalmany-to-many alignment of multiple PPI networks,” Bioinformatics, vol. 30, no. 4, pp.531–539, 2013.
[89] R. Ibragimov, M. Malek, J. Baumbach, and J. Guo, “Multiple graph edit distance:simultaneous topological alignment of multiple protein-protein interaction networkswith an evolutionary algorithm,” in Proceedings of the 2014 Conference on Genetic andEvolutionary Computation. ACM, 2014, pp. 277–284.
[90] S. Sahraeian and B. Yoon, “SMETANA: accurate and scalable algorithm for probabilisticalignment of large-scale biological networks,” PloS One, vol. 8, no. 7, p. e67995, 2013.
[91] C.-S. Liao, K. Lu, M. Baym, R. Singh, and B. Berger, “Isorankn: spectral methodsfor global alignment of multiple protein networks,” Bioinformatics, vol. 25, no. 12, pp.i253–i258, 2009.
[92] Y.-K. Shih and S. Parthasarathy, “Scalable global alignment for multiple biologicalnetworks,” BMC bioinformatics, vol. 13, no. 3, p. S11, 2012.
[93] M. M. Hasan and T. Kahveci, “Incremental network querying in biological networks,” inProceedings of the 5th ACM Conference on Bioinformatics, Computational Biology, andHealth Informatics, ser. BCB ’14. ACM, 2014, pp. 752–759.
117
[94] ——, “Color distribution can accelerate network alignment,” in Proceedings of theinternational conference on bioinformatics, computational biology and biomedicalinformatics. ACM, 2013, p. 52.
[95] V. Vijayan, D. Critchlow, and T. Milenkovic, “Alignment of dynamic networks,” arXivpreprint arXiv:1701.08842, 2017.
[96] U. Feige, “A threshold of ln n for approximating set cover,” Journal of the ACM(JACM), vol. 45, no. 4, pp. 634–652, 1998.
[97] C. H. Papadimitriou and K. Steiglitz, “Combinatorial optimization: Algorithms andcomplexity,” 1998.
[98] Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, “A greedy algorithm for aligning dnasequences,” Journal of Computational biology, vol. 7, no. 1-2, pp. 203–214, 2000.
[99] B.-J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone,R. Oughtred, D. H. Lackner, J. Bähler, V. Wood et al., “The biogrid interactiondatabase: 2008 update,” Nucleic acids research, vol. 36, no. suppl_1, pp. D637–D640,2007.
[100] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa, “KEGG: Kyotoencyclopedia of genes and genomes,” Nucleic Acids Research, vol. 27, no. 1, pp. 29–34,1999.
[101] N. Johnson, S. Kotz, and N. Balakrishnan, “Continuous univariate probabilitydistributions,(vol. 1),” 1994.
[102] H. Gao, Y. Tao, Q. He, F. Song, and D. Saffen, “Functional enrichment analysis of threealzheimer’s disease genome-wide association studies identities dab1 as a novel candidateliability/protective gene,” Biochemical and biophysical research communications, vol.463, no. 4, pp. 490–495, 2015.
[103] T. Nashida et al., “Atrophy of myoepithelial cells in parotid glands of diabetic mice;detection using skeletal muscle actin, a novel marker,” FEBS open bio, vol. 3, no. 1, pp.130–134, 2013.
[104] K. P. Burdon et al., “Genome-wide association study for sight-threatening diabeticretinopathy reveals association with genetic variation near the grb2 gene,” Diabetologia,vol. 58, no. 10, pp. 2288–2297, 2015.
[105] G. O. Consortium et al., “The gene ontology (go) database and informatics resource,”Nucleic acids research, vol. 32, no. suppl 1, pp. D258–D261, 2004.
[106] M. P. Mattson, “Pathways towards and away from Alzheimer’s disease,” Nature, vol.430, no. 7000, pp. 631–639, 2004.
[107] G. Bartzokis, “Age-related myelin breakdown: a developmental model of cognitive declineand alzheimer’s disease,” Neurobiology of aging, vol. 25, no. 1, pp. 5–18, 2004.
118
[108] M. Liu et al., “Network-based analysis of affected biological processes in type 2 diabetesmodels,” PLoS genetics, vol. 3, no. 6, p. e96, 2007.
[109] J. Davies and D. Davies, “Origins and evolution of antibiotic resistance,” Microbiologyand Molecular Biology Reviews, vol. 74, no. 3, pp. 417–433, 2010.
[110] R. Elhesha, A. Sarkar, C. Boucher, and T. Kahveci, “Identification of co-evolvingtemporal networks,” bioRxiv, 2018.
[111] L. Gupta, D. Molfese, R. Tammana, and P. Simos, “Nonlinear alignment and averagingfor estimating the evoked potential,” IEEE Transactions on Biomedical Engineering,vol. 43, no. 4, pp. 348–356, 1996.
[112] Barrett, T. and Wilhite, S.. and Ledoux, P. and Evangelista, C. and Kim, I. andTomashevsky, M. and Marshall, K. and Phillippy, K. and Sherman, P. and Holko, M.and others, “NCBI GEO: archive for functional genomics data sets—update,” NucleicAcids Research, vol. 41, no. D1, pp. D991–D995, 2012.
[113] D. Szklarczyk, J. H. Morris, H. Cook, M. Kuhn, S. Wyder, M. Simonovic,A. Santos, N. T. Doncheva, A. Roth, P. Bork et al., “The string database in 2017:quality-controlled protein–protein association networks, made broadly accessible,” NucleicAcids Research, vol. 45, pp. 362–368, 2016.
[114] S. Jozefczuk, S. Klie, G. Catchpole, J. Szymanski, A. Cuadros-Inostroza, D. Steinhauser,J. Selbig, and L. Willmitzer, “Metabolomic and transcriptomic stress response ofescherichia coli,” Molecular systems biology, vol. 6, no. 1, p. 364, 2010.
119
BIOGRAPHICAL SKETCH
Rasha Elhesha received her B.Sc. in computer and systems engineering in 2011 from
Alexandria University, Egypt. She worked as a software developer in Egypt. In August
2014, she started her PhD program in Computer and Information Science and Engineering
Department, University of Florida and joined Tamer Kahveci’s Bioinformatics lab. She got
the 2016 HWCO Outstanding International Student Award. In 2018, she got Gartner CISE
scholarship. Her research area is bioinformatics in general and biological networks analysis in
particular.
120