system identification methods for reverse engineering gene

88
System Identification methods for Reverse Engineering Gene Regulatory Networks by Zhen Wang A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen’s University Kingston, Ontario, Canada October 2010 Copyright c Zhen Wang, 2010

Upload: others

Post on 03-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

System Identification methods for Reverse

Engineering Gene Regulatory Networks

by

Zhen Wang

A thesis submitted to the

School of Computing

in conformity with the requirements for

the degree of Master of Science

Queen’s University

Kingston, Ontario, Canada

October 2010

Copyright c© Zhen Wang, 2010

Abstract

With the advent of high throughput measurement technologies, large scale gene ex-

pression data are available for analysis. Various computational methods have been

introduced to analyze and predict meaningful molecular interactions from gene expres-

sion data. Such patterns can provide an understanding of the regulatory mechanisms

in the cells. In the past, system identification algorithms have been extensively de-

veloped for engineering systems. These methods capture the dynamic input/output

relationship of a system, provide a deterministic model of its function, and have

reasonable computational requirements [68].

In this work, two system identification methods are applied for reverse engineering

of gene regulatory networks. The first method is based on an orthogonal search; it

selects terms from a predefined set of gene expression profiles to best fit the expression

levels of a given output gene. The second method consists of a few cascades, each

of which includes a dynamic component and a static component. Multiple cascades

are added in a parallel to reduce the difference of the estimated expression profiles

with the actual ones. Gene regulatory networks can be constructed by defining the

selected inputs as the regulators of the output. To assess the performance of the

approaches, a temporal synthetic dataset is developed. Methods are then applied

to this dataset as well as the Brainsim dataset, a popular simulated temporal gene

i

expression data [73]. Furthermore, the methods are also applied to a biological dataset

in yeast Saccharomyces Cerevisiae [74]. This dataset includes 14 cell-cycle regulated

genes; their known cell cycle pathway is used as the target network structure, and

the criteria ‘sensitivity’, ‘precision’, and ‘specificity’ are calculated to evaluate the

inferred networks through these two methods. Resulting networks are also compared

with two previous studies in the literature on the same dataset.

ii

Acknowledgments

I have been extremely fortunate to have had Professor Parvin Mousavi as my super-

visor during my master studies. I sincerely thank her for the great guidance, advice,

and support on both my professional and personal developments. During these two

years, she is not just a supervisor but more is as a friend and mentor to me. Without

her help, I could not get interested in Bioinformatics and finish the thesis.

I am grateful to my committee members, Professor Janice Glasgow and Professor

Dongsheng Tu, for reading and evaluating my thesis. Thank all my friends and

colleagues for their support and good cheers and the excellent atmosphere in the

laboratory.

Finally, I am deeply thankful to my dear family for their unconditional love and

support.

iii

Contents

Abstract i

Acknowledgments iii

Contents iv

List of Tables vi

List of Figures vii

1 Introduction 11.1 Gene Regulatory Networks . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Organization of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 72.1 Basic Concepts in Molecular Biology . . . . . . . . . . . . . . . . . . 72.2 Microarrays Gene Expression Measurement . . . . . . . . . . . . . . . 102.3 Processing Microarray Gene Expression Data . . . . . . . . . . . . . . 122.4 Network Reconstruction Algorithms . . . . . . . . . . . . . . . . . . . 14

2.4.1 Association Networks . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Boolean Networks . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.3 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 System Identification Methods . . . . . . . . . . . . . . . . . . . . . . 21

3 Data and Preprocessing 243.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.1.1 Temporal Synthetic Data . . . . . . . . . . . . . . . . . . . . . 253.1.2 Brainsim Songbird Dataset . . . . . . . . . . . . . . . . . . . . 28

iv

3.1.3 Yeast Saccharomyces Cerevisiae Dataset . . . . . . . . . . . . 293.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.1 Outlier Correction . . . . . . . . . . . . . . . . . . . . . . . . 323.2.2 Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Methods 334.1 Fast Orthogonal Search . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1.1 Orthogonal Search . . . . . . . . . . . . . . . . . . . . . . . . 344.1.2 Fast Orthogonal Search . . . . . . . . . . . . . . . . . . . . . . 374.1.3 Network Construction using FOS . . . . . . . . . . . . . . . . 40

4.2 Parallel Cascade Identification . . . . . . . . . . . . . . . . . . . . . . 414.2.1 Network Construction using PCI . . . . . . . . . . . . . . . . 45

4.3 Assessment of Network Inferences . . . . . . . . . . . . . . . . . . . . 46

5 Implementation and Results 485.1 Analysis of the Temporal Synthetic Dataset . . . . . . . . . . . . . . 48

5.1.1 Network Inference . . . . . . . . . . . . . . . . . . . . . . . . . 515.2 Analysis of the Brainsim Songbird Dataset . . . . . . . . . . . . . . . 54

5.2.1 Network Inference for Songbird data . . . . . . . . . . . . . . 575.3 Analysis of Yeast Saccharomyces Cerevisiae Dataset . . . . . . . . . . 61

5.3.1 Network Inference . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Summary and Conclusions 656.1 Further directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

Bibliography 69

v

List of Tables

4.1 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1 Interaction Matrix summed over 100 synthetic datasets by FOS: forthe target gene at jth column, the ijth entry of the matrix denotesthe number times of this regulation from regulator gene on the ith rowdiscovered in 100 synthetic datasets. Entries in bold are the actualregulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Interaction Matrix summed over 100 synthetic datasets by PCI: forthe target gene at jth column, the ijth entry of the matrix denotesthe number times of this regulation from regulator gene on the ith rowdiscovered in 100 synthetic datasets. Entries in bold are the actualregulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.3 Comparisons of the inferred networks of Synthetic Data by using FOSand PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 Comparisons of the inferred networks of Brainsim Simulated Data byusing FOS and PCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.5 Comparisons of the inferred networks of yeast Saccharomyces cerevisiaeData by using FOS, PCI and other two available studies. . . . . . . . 64

vi

List of Figures

2.1 (a) Double Helix structure of Deoxyribonucleic acid; (b) Pairing rulesfor A, T, C, G [82] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Brief illustration of Gene Expression. . . . . . . . . . . . . . . . . . . 92.3 Schematic illustration of one simple gene regulatory network. . . . . . 102.4 Steps of a cDNA microarray experiment . . . . . . . . . . . . . . . . 122.5 A simple Bayesian Network Model: five genes; there is an edge directed

from A to D, A is the parent of D and D is its child; . . . . . . . . . 19

3.1 A simple example explaining the relationship between regulation weightmatrix and GRN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Predefined network Structure for the synthetic data . . . . . . . . . . 273.3 Network Structure of the GRN simulated in Brainsim Songbird Data 293.4 The Target Pathways of these 14 genes available from KEGG . . . . . 31

4.1 Structure of a PCI model . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 Structure of a multiple input/single output PCI model . . . . . . . . 444.3 Structure of the modified PCI model . . . . . . . . . . . . . . . . . . 45

5.1 System Identification of FOS: Starred points are actual system outputsand solid lines denote the estimated system output using identified model. 49

5.2 System Identification of PCI: Starred points are actual system outputsand solid lines denote the estimated system output using identifiedmodel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 The histograms of the number of times that one pair of regulation isdiscovered from 100 synthetic dataset by (a) FOS (b) PCI . . . . . . 55

5.4 The final estimated networks of Synthetic Data by (a) FOS (b) PCI.Solid links are correctly discovered, TP; dashed links are missing ones,FN; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5 The histogram of topping 50 significant regulation discovered from 750Brainsim Songbird datasets by FOS. . . . . . . . . . . . . . . . . . . 58

5.6 The final estimated networks of Brainsim songbird Data using FOS.Dashed lines means the regulation that FOS could not recover. . . . . 58

vii

5.7 The histogram of topping 50 significant regulation discovered from 750Brainsim Songbird datasets by PCI . . . . . . . . . . . . . . . . . . . 59

5.8 The final estimated networks of Brainsim songbird Data using PCI.Dashed lines means the regulation that PCI could not recover. . . . . 59

5.9 The yeast cell cycle pathway inferred from Spellman data using differ-ent methods: (a) FOS (b) PCI (c) Kim [35], and (d) Zhang [85]. . . . 63

viii

Chapter 1

Introduction

1.1 Gene Regulatory Networks

Genes are the basic physical and functional units of heredity. They carry all the

information relevant to what the organism is like, how it survives, and how it behaves

in an environment [67]. Proteins are the building blocks that are essential parts

of living cells. They are the products of genes: a gene will be first transcribed to

an intermediate messenger ribonucleic acid (mRNA), and the mRNA molecule next

translated into a specific protein. Genes in cells do not function individually and

are controlled through intricate interconnections of cellular components, such like

proteins. The gene transcription process is controlled by a collection of proteins

called Transcription Factors (TFs), which can determine when and how much the

specific genes are expressed, and it is also affected by different types of enzymes,

a group of proteins that catalyze reactions [82]. These proteins are production of

corresponding genes, which will then serve as TFs or enzymes that accede to the gene

expression processes of their target genes. The process of genes interacting with each

1

1.2. MOTIVATION 2

other can be described as a Gene Regulatory Network (GRN). Research on GRNs

can provide useful explanations about why the behavior of one gene coincides with

the variations of some other genes.

GRNs are likely the most important organizational level in the cell where inter-

nal signals and the external environment are integrated in terms of corresponding

timed expression levels of genes [10]. They act as biochemical computers in cellular

processes, organizing the level of expression for each gene in the network by control-

ling whether and at what rate that gene will be transcribed. As a result, the type

and amount of proteins are produced differently in different cells in order to make

corresponding cells function properly.

Temporal gene expression data are observations of genetic activity levels over a

number of points of time. The advent of new high throughput technologies, such

as Microarrays, for acquiring gene expression data has made a wealth of molecular

data available. Reverse engineering GRNs, refers to the discovery of the principles

and structures of GRNs using gene expression data; it has received a great deal of

attention in recent years. Computational methods were applied to mine meaningful

interactions between genes.

1.2 Motivation

Reverse Engineering GRNs is an important issue in Bioinformatics, and can yield

remarkable improvements of understanding of biological systems on several fronts:

(i) clarification of and to understand the complex mechanisms of development and

evolution in living organisms [13]; (ii) description of the underlying network structure

of gene regulation pathways [78]; (iii) detection of pathways initiators which are

1.2. MOTIVATION 3

potential reasons of particular genetic disease, and extraction of possible drug targets

[26] and (iv) providing information on possible novel regulations for future research.

Deriving a GRN from gene expression data, however, is often difficult, due to the lack

of complete knowledge of the processes and parameters of the biological system and

its environment.

Numerous computational methods have been developed and investigated to con-

struct GRNs from gene expression data. Popular reverse engineering methods, in-

cludes Association Networks [19, 5], Boolean Networks [31], Bayesian Networks and

Dynamic Bayesian Networks [22, 62]. These methods build upon mathematic or

statistic algorithms to reconstruct networks using correlation, mutual information, or

conditional dependence between genes, respectively. System identification algorithms

are a category of reverse engineering methods that have been applied mainly in en-

gineering domain [57]. GRNs are biological systems that reflect the interconnected

relationships of genes, where temporal measurement of gene expression data can be

obtained as time series signals. Therefore, system identification algorithms have the

ability to build models that reveal the dynamic behaviors of gene regulation. They

fit models of dynamic systems to temporal data, and typically represent quantitative

aspects. These data-driven approaches can construct models from measured input-

output data, giving the best fit to the gene expression data. The inferred models

utilize the target gene in a network as the output and regulating genes as the inputs.

As a result, a structural gene network is obtained. Several system identification ap-

proaches using different models: linear modeling [18, 79], and models consisting of

ordinary differential equations [15, 64], have been discussed recently for inferring gene

regulatory networks.

1.3. OBJECTIVES 4

1.3 Objectives

In this thesis, two system identification algorithms, Fast Orthogonal Search (FOS)

and Parallel Cascade Identification (PCI), are discussed and implemented to build

dynamic models of GRNs. Both FOS and PCI were originally developed for nonlinear

system identification [37, 39], and have been applied in other engineering fields.

Interactive dynamic models of a synthetic dataset, a songbird simulated dataset,

and a real biological dataset, through FOS and PCI are devised. GRNs that capture

the time course variations of genes based on their regulators’ expressions are built for

all the models. The performance of the two approaches is compared with each other,

as well as with other published methods in the literature for verification.

1.4 Contribution

The primary contributions of this work are reported here:

• Two system identification algorithms, FOS and PCI, are presented for building

dynamic models that can capture genetic regulation information. To the best

of the author’s knowledge, neither FOS nor PCI has been used for this purpose

before in the literature.

• A modification on PCI algorithm is proposed. For the case of multiple in-

put/single output system, the original PCI algorithm considers only one input

signal for the dynamic system at a time; multiple input signals are added and

have equal weights. Yet, the modified method is able to treat multiple input

signals simultaneously starting from the dynamic system.

1.5. ORGANIZATION OF THESIS 5

• A method for building a sparse model of gene regulation from PCI is proposed.

As the gene regulatory networks are known to be sparse [48], a fully connected

model does not capture the biological system well.

• Three datasets are used to evaluate and compare the algorithms performances

for capturing GRNs.

– A time-delayed gene regulatory pathway of arbitrary structures was de-

signed. Its corresponding temporal artificial dataset was generated through

a stochastic function.

– A simulated temporal gene expression dataset, was produced using Brain-

sim simulator introduced by [73]. It has 100 genes plus another term

named activity, and represents gene interactions in response to the singing

behavior in a songbird.

– A biological dataset, comprising a subset of yeast Saccharomyces cere-

visiae, which includes the expression levels of 14 cell-cycle regulated genes

over time, were also used.

1.5 Organization of thesis

This thesis is organized as follows. Chapter 2 reviews the fundamental concepts

of molecular biology underlying GRNs. Microarray gene expression measurements

and required preprocessing approaches are discussed. Moreover, a review of related

network inference algorithms is provided. In chapter 3 the datasets that are used for

this study and their required preprocessing steps are introduced. Then in the following

two chapters, a complete description on the theory and implementations of discussed

1.5. ORGANIZATION OF THESIS 6

approaches, Fast Orthogonal Search and Parallel Cascade Identification, are given.

The statistic criteria used for evaluation of each method are also introduced and the

resulting networks are studied to illustrate the performances of discussed algorithms.

Conclusions and future directions of this research are presented in Chapter 6.

Chapter 2

Background

2.1 Basic Concepts in Molecular Biology

A cell is the most basic unit of a organism and also is the smallest unit making up our

bodies. There are tens of thousands of different types of cells, each of which has unique

functions; however, all cells share similarities. The most important shared feature of

cells is that they contain hereditary information in the form of Deoxyribonucleic

acid (DNA) molecular for almost all species1, and have the basic mechanisms for

translating genetic messages into the protein. Proteins are the fundamental structural

and functional units in cells and can act as structural components, enzyme catalysts,

and antibodies [82].

DNA is shaped as a double helix structure shown in Figure 2.1(a), and consists

of two long polymers made from repeating units called nucleotides [82]. These two

polymers are complementary, and the sequence in one strand is completely determined

by the sequence of nucleotides in the other strand. This feature has been recognized

1Some viruses have been discovered that they have RNA genomes.

7

2.1. BASIC CONCEPTS IN MOLECULAR BIOLOGY 8

(a) (b)

Figure 2.1: (a) Double Helix structure of Deoxyribonucleic acid; (b) Pairing rules forA, T, C, G [82]

as one of science’s most famous statements when Watson and Crick first presented

the structure of DNA helix in 1953. The four nucleotides on the DNA, adenine(A),

guanine(G), cytosine(C) and thymine(T), only bond to their complimentary base [82].

Adenine in one strand can only bond with thymine in the other strand, and similarly

guanine has to bond with cytosine, Figure 2.1(b) [82].

A segment of DNA, called a gene, stores genetic codes. A gene consists of a

long combination of four different nucleotide bases. The sequence of nucleotides

in a gene determine the structures of its protein products. According to central

dogma of molecular biology, producing a protein from information in a gene is a two-

step process: transcription and translation. Figure 2.2 summarizes the process of

expressing a protein-encoding gene [82].

The transcription process is to create an equivalent messenger RNA (mRNA) copy

of a portion of DNA. Hence, the information on a gene is transcribed into an mRNA

molecule. An mRNA polymerase enzyme can recognize and bind to a specific site

of DNA molecule, which signals the initiation of transcription. In the translation

2.1. BASIC CONCEPTS IN MOLECULAR BIOLOGY 9

Figure 2.2: Brief illustration of Gene Expression.

step, mRNA produced by transcription is decoded by the ribosome to make a specific

amino acid chain, which later will fold into a protein [82]. This complete process

where a gene gives rise to a protein is called gene expression.

DNA can be compared with a recipe in the gene expression process, due to its

storage of code to instruct other components of cells. Different portions of genes are

active in different cells; as a result their protein products can be drastically different.

The type and amount of proteins produced in each particular cell are extremely

important for the cell to function properly.

The process of gene expression is controlled by a collection of proteins named

transcription factors (TFs). These TFs can decide when, where and at which rate a

particular gene is expressed. Because of the enrollment of different TFs, which them-

selves are protein products of expressed genes, genes are under regulatory control and

comprise complex interactions known as Gene Regulatory Networks (GRNs) [78]. A

brief description is shown in Figure 2.3. Gene1 first is transcribed into mRNA1, and

2.2. MICROARRAYS GENE EXPRESSION MEASUREMENT 10

Figure 2.3: Schematic illustration of one simple gene regulatory network.

then translated to Prontein1 which serves as the TF of Gene2. Therefore, the ex-

pression process of Gene2 is determined by the product of the expression of Gene1,

and Gene1 is defined as its regulator. Furthermore, the expression processes of both

Gene2 and Gene3 are controlled by their common TF, Protein2, which is the expres-

sion product protein of Gene2. Therefore, Gene2 has a self-regulation relationship in

this network, and it also functions as the regulator of Gene3. Once Prontein2 binds

to the specific state of DNA, gene transcription of Gene3 will be activated.

2.2 Microarrays Gene Expression Measurement

Microarrays are a collection of single stranded DNA segments deposited or synthe-

sized on a solid surface. They can monitor the mRNA abundance of genes in a high

throughout fashion [69]. The single stranded DNA segments are called probes and

are complementary to specific RNA species based on the central dogma of molecular

biology [78]. Studies discovered that the amount of mRNA is proportional to the

2.2. MICROARRAYS GENE EXPRESSION MEASUREMENT 11

transcription rate of its corresponding gene [66]. Therefore, the relative transcrip-

tion rate of genes can be calculated through the measurement of their corresponding

mRNA levels. In this section, DNA Microarray experiments are briefly reviewed

because gene expression data has been an important element in advance of reverse

engineering GRNs.

Based on the type of probes used in experiments, Microarrays can be categorized

into two classes, cDNA Microarrays and oligonucleotide Microarrays [70]. cDNA

Microarray is a widely used technology in which two samples are usually analyzed

simultaneously in a comparative fashion. To measure expression levels of genes using

cDNA Microarray, mRNA is extracted from test cell and reference cell, and then

reverse transcribed into cDNA and labeled with fluorescent dyes. The test and ref-

erence cells labeled with dyes that are activated at different frequencies, referred to

as red and green respectively. Two fluorescently labeled samples are then mixed and

the mixture is hybridized on Microarray chips. Finally Microarrays are scanned and

the resulting images are analyzed to calculate gene expression values. The steps of

cDNA microarray is shown in Figure 2.4.

In oligonucleotide Microarray technology, genes on the microarray are represented

by a set of 14 to 20 short sequences of DNA, called oligonucleotide, each of which con-

sists of two probes named perfect match (PM) and miss match (MM). DNA sequences

in every pair of PM and MM are identical, except for one nucleotide in the center of

each sequence. PM is the exact sequence of the selected fragment of the gene. In this

approach, there is no need for using reference samples. First oligonucleotide arrays

are built onto microarray chips. Then mRNA is converted to fluorescently labeled

cDNA followed by hybridization of labeled cDNA samples to Microarray. Finally, the

2.3. PROCESSING MICROARRAY GENE EXPRESSION DATA 12

Figure 2.4: Steps of a cDNA microarray experiment

microarray is scanned and the resulting images are analyzed. Because the correct

gene will only hybridize to the PM, while incorrect hybridization affects both PM

and MM, the expression level of each gene is the average difference between PM and

MM [20]. Affymetrix GeneChip is one of the most widely adopted oligonucleotide

microarray technologies.

2.3 Processing Microarray Gene Expression Data

Due to the effects arising from the variations in the Microarray technologies and ex-

periment setups, preprocessing of gene expression measurements is required for more

reliable data analysis. Accurate preprocessing procedures improve the comparability

of expression data. Microarray data preprocessing usually includes the following steps

2.3. PROCESSING MICROARRAY GENE EXPRESSION DATA 13

[25]:

• Missing Values:

It is estimated that a microarray dataset has more than 5% missing values, af-

fecting more than 60% of the genes [14]. Since many data analysis methods such

as principal component analysis, support vector machines and artificial neural

networks require complete datasets, accurate estimation of missing value is an

important preprocessing step in microarray analysis. Obviously, repetition of

identical experiments can be adopted to solve the missing value issue; however,

this method is costly and time consuming [77]. A series of numerical methods

have been developed to estimate missing values: (1) replacing missing values

with constants; (2) replacing missing values with averages over time [3]; (3) K-

nearest neighbor replacement method [77]; (4) bayesian principal components

analysis replacement method [59]; (5) support vector regression impute method

[80]; (6) least squares formulation based replacement method [34]. Consider-

ing the complexities of different missing values estimation algorithms, simple

averaging is utilized in this thesis.

• Gene Selection:

Gene expression data analysis usually focuses on differentially expressed genes

(DEG). In a microarray experiment, the majority of genes, have constant expres-

sion levels cross time. These genes do not convey any significant information,

on the contrary, they will decrease the efficiency and increase the computational

cost. As such, several methods are developed to select significant genes: the

most simplest way to identify DEGs is by setting a threshold value for detecting

variation of genes; statistic hypothesis tests can also be used for detecting DEGs,

2.4. NETWORK RECONSTRUCTION ALGORITHMS 14

such like t-test [11] and maximal likelihood analysis [27]; fold change analysis,

significant genes can be determined based on relative increase or decrease in

their expression profiles [56].

• Interpolation:

Microarray gene expression dataset usually contains much fewer number of time

points than that of genes. This is partly time consuming nature, and cost of

designing experiment and acquiring data. The accuracies of many temporal

data analysis methods depend on the availability of training samples in time.

Interpolation can increase the number of samples by adding new data points

within the range of original known measurement. Many interpolation methods

are available in numerical analysis [53]: nearest neighbor interpolation, linear

interpolation, spline interpolation, and polynomial interpolation. Appropriate

interpolation can provide more reasonable data samples for analysis.

2.4 Network Reconstruction Algorithms

Given temporal gene expression data acquired under different experimental condi-

tions, a model of the gene interactions can be built through different reverse engi-

neering methods. A gene regulatory network, therefore, is constructed. The GRN is

represented as a graphic model, whose nodes stand for a set of genes and connections

take on different meanings through different models. Providing an accurate reverse

engineering tool that captures a global view of gene regulation is a challenging topic

in Systems Biology.

2.4. NETWORK RECONSTRUCTION ALGORITHMS 15

Many reverse engineering techniques have been proposed for building gene reg-

ulatory networks. Following different criteria, these techniques can be summarized

into several groups. Gardner and Faith [23] used the mathematical graphical models

described them into four categorizations: Association networks, Boolean networks,

Bayesian Networks, and Differential Equations. Karlebach and Shamir [29] roughly

divided various computational models for reverse engineering GRNs into three classes

based on their learning strategies: logical models which allow people to obtain a basic

understanding, continuous models to manipulate behaviors that depending on finer

timing and exact molecular concentrations, and single-molecule level models follow-

ing the observation that the functionality of regulatory networks is often affected

by noise. Another broad classification of deterministic models and stochastic mod-

els, has also been proposed by [68]. Sima [72] reviewed different network inference

methods in two classes based on whether or not they can infer dynamical interaction

between genes. In this section, representative reverse engineering methods, Associa-

tion networks, Boolean networks, and Bayesian Networks, and their advantages and

disadvantages are briefly reviewed. The notation of Genei to describe a gene that is

associated with a random variable Xi, whose gene expression levels are denoted as

Xi(t) at time point t, t = 0, . . . , T .

2.4.1 Association Networks

Association networks are amongst the simplest models for reverse engineering GRNs.

They represent GRNs using an undirected graph with edges weighted by similarities

or relevances. Popular relevance measures are covariance-based measures such as

Pearson correlation, and entropy-based measures such as mutual information.

2.4. NETWORK RECONSTRUCTION ALGORITHMS 16

Pearson correlation, developed by Karl Pearson, is one of the most common and

most useful measures of the linear dependence between two time series variables.

It is a coefficient calculated by dividing the covariance of the two variables by the

product of their standard deviations. The value of the coefficient ranges between −1

and 1. The closer the coefficient is to either −1 or 1, the stronger the correlation

between the variables. If Pearson correlation coefficient is 0, these two variables are

linearly independent. To calculate the Pearson correlation coefficient between two

genes Gene1 and Gene2, the following formula is available

ρ(X1, X2) =

∑Tt=0(X1(t)−X1(t))(X2(t)−X2(t))

T√σX1σX2

, (2.1)

where Xi and σXiare the mean and the standard deviation of random variable Xi,

i = 1, 2.

Pearson correlation only gives a perfect value when two variables are linearly

related. In contrast to this, mutual information, can detect nonlinear correlations. It

is frequently adopted as an index to quantify the mutual dependence of two variables.

The mutual information of two random variables X1 and X2 associate with two genes

is

I(X1;X2) =

T∑t=0

T∑t=0

p(X1(t), X2(t))log

(p(X1(t), X2(t))

p(X1(t))p(X2(t))

), (2.2)

where p(·) is the probability, calculated by the frequencies of corresponding variable.

The greater the mutual information is, the more relevant these two variables are. If

the mutual information is zero, these two variables are irrelevant.

Both Pearson correlation and mutual information have long been used in System

Biology to infer gene regulatory networks. D’haeseleer et al. [19] defined the distance

measure based on residue variance as d(X1, X2) = 1 − ρ(X1, X2)2, where d = 0 if

they are perfectly correlated and d = 1 if they are uncorrelated. Based on mutual

2.4. NETWORK RECONSTRUCTION ALGORITHMS 17

information, a method called ARACNE was proposed by Basso et al. [5], and it has

been used for inferring genetic networks in human B cells. Simplicity and low com-

putational costs are the major advantages of association networks. The limitations

of such models are that they can not reflect causalities and do not take into account

that multiple genes could enroll in the regulation.

2.4.2 Boolean Networks

Boolean Networks were first proposed by Kauffman [31, 30] for the purpose of model-

ing gene regulation, and since then they have been extensively investigated in System

Biology; (1) the mapping to study the qualitative properties of continuous biochemi-

cal control networks using logical structure is further discussed [33, 32]; (2) a model

based on the boolean genetic networks is built as a conceptual framework to identify

new drug targets for cancer treatment [26]; (3) and Liang et al. [50] had described an

algorithm for inferring genetic network from time series of gene expression patterns

using Boolean network model, and Akutsu et al. devises a simpler algorithm for the

same problem [2].

A Boolean network uses binary variables Xi ∈ {0, 1} that denote the tran-

script levels of Genei in the network as ”off” or ”on”, and edges made up of

simple Boolean operations FB, ”AND” ”OR” and ”NOT”. A simple example is

Xi(t + 1) = FBi (X1(t), . . . , XN(t)). The goal of reverse engineering a Boolean net-

work is to find the Boolean function FBi for each gene so that the gene expression

profile can be explained by this model. Two primary strategies were proposed to learn

the connectivity of genes in Boolean Networks. The first one computes the mutual

information between sets of two or more genes and tries to find the smallest set of

2.4. NETWORK RECONSTRUCTION ALGORITHMS 18

input genes that provides complete information on the output gene [50]. The other

one looks for the most parsimonious set of input genes whose expression variations

are coordinated or consistent with the output gene [2].

In contrast to Association Networks, Boolean networks successfully capture the

dynamics of gene regulation. However, Boolean networks are limited because changes

in gene expression levels over time can not be simply represented adequately by two

states and the discritization process from the continuous gene expression levels to

the binary data is not trivial. Furthermore, solving Boolean networks requires large

amount of experimental data because it does not place constraints on the form of the

Boolean interaction functions [23]. To determine a complete set of Boolean functions

from data, all possible combinations of input expression have to be considered. For

a fully connected Boolean network with N genes, it would require approximately 2N

data points to infer all Boolean functions [17] since each gene can be either ”off” or

”on” independently. Both Association networks and Boolean networks are simple ap-

proaches to provide models of gene regulation [6], compared with Bayesian Networks

and System Identification methods that will be discussed.

2.4.3 Bayesian Networks

A Bayesian network(BN) is a probabilistic graphical model that represents a set of

variables and their conditional dependencies via a directed acyclic graph. Such a

model consists of two components, the structure G a directed acyclic graph and the

parameters Θ a set of parameters of conditional distribution of each variable given

the rest of variables. In the graphical structure of the BN given in Figure 2.5, its

nodes stand for genes A,B,C,D,E and edges correspond to conditional dependencies

2.4. NETWORK RECONSTRUCTION ALGORITHMS 19

between genes. The absence of an edge between two genes means that those genes

are conditionally independent given their parent genes, for example, B are D are

conditionally independent given their parent genes A and E. BNs follow the first

order Markov assumption that each variable is conditionally dependent on its parent

only. The joint distribution over the set of genes is also calculated, which can be

rewritten as the product form of probability of each gene given its parents. BNs can

not deal with continuous values. Therefore, the probability of one gene is calculated

by frequencies of discretized expression levels over time.

Figure 2.5: A simple Bayesian Network Model: five genes; there is an edge directedfrom A to D, A is the parent of D and D is its child;

The problem of learning BNs ends up with learning these two components, struc-

ture learning and parameter learning. To construct a BN, using score-based ap-

proaches, is to determine a score function based on posterior probability of BN given

the data, which is then used as the criterion for selecting the optimal set of parents

for each variable. However, this selection procedure is computational costly, because

there are too many possible local structures. Several searching algorithms such like

greedy hill climbing searching [9], simulated annealing searching [49], Markov chain

Monte Carlo [58] and expectation maximization [76], were proposed for learning BNs.

2.4. NETWORK RECONSTRUCTION ALGORITHMS 20

According to the scores of possible structures of proposed BNs by different searching

algorithms, the network G with the greatest conditional probability P (G|D) will be

selected.

Dynamic Bayesian Networks (DBNs), unlike BNs, use temporal gene expression

data for constructing causal relationships among genes. Similar to BNs, the first order

Markov assumption also holds for DBNs. Therefore, the parents of each gene are

selected using information derived from gene expression at the same or the previous

time point, which greatly reduce the complex of DBN learning. As a result, the

structure of DBNs only represents direct associations between genes.

Current methods for DBN learning can be categorized into two major groups,

constraint based methods and score based methods [75]. Constraint based meth-

ods determine conditional independencies and dependencies between genes based on

a statistical tests, which provide satisfactory results with sparse networks [7]. Score

based methods, treat DBN learning as an optimization problem. Such methods devise

a scoring function to evaluate the candidate network structures based on the proba-

bility of the structure given the temporal expression data. They search all possible

network structures and select the optimal one [24].

Both BNs and DBNs have been successfully applied for reverse engineering GRNs

[21, 84, 24, 35, 85]. BNs are not able to reflect the causality or dynamic information

of temporal gene expression data. DBNs can offer a solution, however the complexity

and computational cost is a big bottleneck for analyzing continuous or large datasets

[28].

2.5. SYSTEM IDENTIFICATION METHODS 21

2.5 System Identification Methods

In this thesis, the focus is on a category of reverse engineering gene regulatory net-

works, system identification algorithms. There is no standard definition for the sys-

tem identification methods for reverse engineering gene regulatory networks. System

identification is a term in mathematics and engineering that refers to building dy-

namic models from measured data. Inspired by system engineering and the four

categorizations reviewed in [23], we concluded the definition of system identification

algorithms for reverse engineering GRNs, based on the key different properties of

differential equations compared with the other three categorizations, Association net-

works, Boolean networks, and Bayesian networks. The method that (1) is a dynamic

system capable to deal with continuous temporal expression data, (2) has a deter-

ministic function made up of the expression levels of multiple input genes, (3) is

a quantitative system that can describe the significant affects of regulators accord-

ing to their coefficients of the deterministic function, is called system identification

method. Obviously the differential equation model is an example of system identi-

fication methods. System identification algorithms can be promising tools for the

analysis of genetic systems as they allow for the function description of source genes

with target genes.

Several applications with system identification algorithms on inference of GRNs

have been discussed in the literature that include linear modeling [18, 79], and ordi-

nary differential equations [15, 64].

In a linear model, Genei is modeled as

Xi = β0 +∑j �=i

βjXj , (2.3)

2.5. SYSTEM IDENTIFICATION METHODS 22

where the regression coefficients βj make the model best fit to minimize the least

square error. If Xj is replaced by nonlinear function φ(Xj), where φ is nonlinear

function, the model will be considered a nonlinear one. To model the dynamics from

gene expression data, the above equation eq(2.3) can be written as

Xi(t) = β0 +∑j �=i

βjXj(t− 1). (2.4)

Such models represents that the change in the expression level of one gene at time

point t depends on a weighted linear sum of the expression levels of its regulator

genes at previous time point t − 1. One of the properties of linear models is that

each regulator contributes to the output independently of the rest of the regulators,

in the mathematical summation manner [29]. Linear models do not require a prior

knowledge about regulatory mechanisms. There are a series articles using linear

modeling to construct GRNs in the literature [4, 12, 18, 81, 83].

Ordinary differential equations (ODEs), are amongst the most popular formalisms

to model dynamic systems in science and engineering, and have also been used for

reverse engineering GRNs [15, 64]. ODEs model the gene expression profiles, where

the regulatory interaction takes the form of functional and differential relations be-

tween the gene expression profiles. More specifically, the ODE has the mathematical

form,

dXi

dt= αi + fi(X), i = 1, . . . , N, (2.5)

where fi is the corresponding function of Genei, and X is the matrix indicating all

the gene expression profiles, Gene1, · · · , GeneN . Obviously, ODEs can also take into

account time lag arising from the time required, Xi on the LHS of eq(2.5) is replaced

with Xi(t) and X on the RHS is replaced with X(t − 1). Since functions fi are not

2.5. SYSTEM IDENTIFICATION METHODS 23

fixed, many studies have used different functions such as sigmoidal functions were

used in [81] and linear functions in [8]. ODEs provide detailed information about the

dynamic of gene expression data.

Fast Orthogonal Search (FOS) uses orthogonal searching to identify the significant

regulators in describing the output. It iteratively searches a given candidate function

set, selects and adds the most significant function term to build up the model. Parallel

Cascade Identification (PCI) utilizes a number of cascades, each of which is a smaller

system, to solve system identification problem. The difference between system output

and the first cascade output is treated as the output of the new system for adding a

second cascade. The difference is again computed and another cascade is added. this

process continues until it reaches a desired approximation error. These two system

identification algorithms, have been extensively implemented in many different fields,

but not in reverse engineering gene regulatory networks. FOS has been applied to

estimate raman spectral [42], to detect broken rotor bar in motor [63] or estimate AC

induction motors [52], to select features for computer-aided diagnosis of breast cancer

[65], and to estimate optimal joint angle for upper limb hill muscle models [55]. PCI

is also a popular method that has been studied in signal classification [44], and to

predict clinical outcome or metastatic status [40, 41]. Especially they were used to

analyze genetic data, to predict the response of multiple sclerosis patients to therapy

using FOS [54] and to classify and predict protein family using PCI [43, 46]. The

two algorithms discussed in this thesis, FOS and PCI, could be considered as two

particular linear models, and if self-regulation is permitted, could also be considered

as two particular ordinary differential equation models.

Chapter 3

Data and Preprocessing

To evaluate the proposed approaches for reverse engineering gene regulatory net-

works, three different datasets are employed. First, a temporal synthetic dataset is

developed and used for evaluating the performance of FOS and PCI. Second, these

two methods are applied to songbird data which is a known simulated temporal gene

expression dataset developed by Smith et al.[73]. This dataset includes gene reg-

ulatory information with response to the singing behavior of a songbird. Since its

gene regulatory network is known, it is a good benchmark for evaluating the reverse

engineering methods. Finally, a real biological dataset from yeast Saccharomyces

Cerevisiae cell cycle is used. This dataset is a subset from a study by Spellman et

al.[74], including 14 genes. The cell-cycle pathway of these 14 genes is available in

KEGG1. Saccharomyces Cerevisiae yeast data has been studied both biologically and

using computational methods in the literature, which provides us with a great deal

of information for evaluating the performances of proposed methods. These three

1KEGG: Kyoto Encyclopedia of Genes and Genomes is a bioinformatics resource that storesgenomic and molecular knowledge.

24

3.1. DATA 25

datasets are referred to as the synthetic data, the songbird data, and the yeast data

in the remain of this thesis.

In this chapter, these datasets will be introduced and the necessary preprocessing

steps are explained prior to further analysis of the data.

3.1 Data

3.1.1 Temporal Synthetic Data

One time-delayed gene regulatory network of an arbitrary structure is modeled to

assess how well FOS and PCI can be used for learning the genetic connections. Based

on the network, a regulation weight matrix is generated and used to simulate tem-

poral expression data. All simulations are done using MATLAB. After the temporal

expression dataset is obtained, both FOS and PCI are used to learn it and construct

two estimated networks, respectively. The calculated networks are then compared

with the actual network to evaluate their performances.

An important assumption made in generation of this dataset is that the expression

level of a regulator gene at time point t only determines the expression value at next

time point t+ 1 of its target gene. The following stochastic formula holds:

Xt+1 = R ∗Xt + E, t = 0, . . . , T. (3.1)

where Xt+1 and Xt are column vectors which denote the expression levels of all genes

at corresponding time points t + 1 and t; E is a vector of system noise; R is the

regulation weight matrix representing gene regulations.

If there is a regulatory relationship directed from source gene i and target gene j,

the entry Rij of R is a nonzero number; otherwise it is zero. It is not difficult to notice

3.1. DATA 26

that a regulation weight matrix can be converted to a GRN, or vice versa. A simple

example is shown in Figure 3.1. If a regulation weight matrix R is a 3 × 3 matrix

Figure 3.1: A simple example explaining the relationship between regulation weightmatrix and GRN.

defined with only three nonzero entries at R13, R21 and R23, it could be converted

to the network which has three nodes standing for three genes, Gene1, Gene2 and

Gene3, and three edges with directions from Gene1 to Gene3, Gene2 to Gene1, and

Gene2 to Gene3. On the other hand, if the network is given with three nodes and

three regulation edges, it could be written in a corresponding matrix format as well.

To obtain temporal synthetic data, a GRN is defined including nine genes and

11 links as shown in Figure 3.2. As explained above, a regulation weight matrix can

be generated from the given structure, by randomly assigning a positive or negative

number at nonzero entry Rij to indicate the weights of the activation or inhibitation

relationship from source gene i to target gene j, respectively. The regulation matrix

R0 used for generating synthetic data is as below, and all the empty states of R0 are

zeros:

3.1. DATA 27

Figure 3.2: Predefined network Structure for the synthetic data

R0 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1-0.6

-1 -11 -4

11

-21

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

To simulate the synthetic dataset, first all gene expression levels are initialized

as zeros. Because Gene4 has no regulators, its expression levels are assigned with a

series of random real numbers. The expression levels of all genes are generated using

their regulators, and corresponding weight in R0 using eq(3.1), where the values of

noise E are assigned by MATLAB command randn which generates samples of a

standard normal distributed random variable. The expression values of all genes

are generated recursively over 150 time points. To study the transient response of

regulation, the data starting from the 50th time point is kept to further studies.

One hundred synthetic datasets are simulated by repeating this procedure. The only

differences among all synthetic datasets are the influences of noise value E in eq(3.1).

3.1. DATA 28

3.1.2 Brainsim Songbird Dataset

To provide a suitable way for evaluating network inference algorithms, Smith et al [73]

designed the Brainsim simulator2 to generate data representing a complex biological

system. Brainsim models the vocal communication system of the songbird brain.

The brain of a songbird is modeled as five regions, where the expression levels of

one hundred genes and the activity level in each of these regions of the brain are

simulated. A bird exhibits a behavior, in two possible states, 0 or 1 representing

”Silence” or ”Singing”.

The singing behavior of a songbird will cause a variation in the activity level,

which will directly affect the expression levels of involved genes in the network. This

gene regulatory network in every region contains 100 genes; however, only 10 of these

genes are connected with each other and correspond to the singing behavior. Two

of these ten genes, named as Gene1 and Gene4 are directly affected by the activity

level, and they affect the expression levels of the remaining eight genes as shown in

Figure 3.3. However, the remaining 90 genes are irrelevant, and can be considered as

noise.

The expression levels of the ten relevant genes at each time point are determined

by the expression levels of their regulators, noise, and a degradation factor. The 90

irrelevant genes expression levels randomly fluctuate or attenuate within the upper

and lower expression level bounds, from 0 to 50. Since noise is modeled in the

simulator, every time a gene expression dataset is generated, it will differ slightly

from the previous data but reflect the same gene regulatory network. The sampled

activity and gene expression data points are taken at the sampling interval of ten

2The Brainsim simulator and the songbird data are available online http://biology.st-andrews.ac.uk/vannesmithlab/downloads.html.

3.1. DATA 29

Figure 3.3: Network Structure of the GRN simulated in Brainsim Songbird Data

time steps between 90 and 280. Therefore, a dataset consists of gene expression levels

of 100 genes over 20 time points. To ensure the robustness of this data, 750 such

datasets are generated for analysis.

3.1.3 Yeast Saccharomyces Cerevisiae Dataset

Since 1998 when Spellman et al. published the yeast Saccharomyces Cerevisiae

Dataset in their article [74], many computation methods have been applied to study

this data. To demonstrate applicability of the discussed methods in this study, a

subset from yeast Saccharomyces Cerevisiae microarray time series dataset including

14 genes, FUS3, SIC1, FAR1, CDC6, CDC20, CDC28, CLN1, CLN2, CLN3, CLB5,

CLB6, SWI4, SWI6 and MBP1, is used. The details of the cell cycle control of

this subset are well known, as shown Figure 3.4. Moreover, this subset of data has

3.1. DATA 30

been extensively explored before, allowing for a comparison with those results in the

literature [35] and [85].

These 14 genes are involved in the early cell cycle of the yeast Saccharomyces

cerevisiae (budding yeast). Cell cycle is the series of events that takes place in a cell

leading to its division and duplication [82]. In yeast, it is accomplished through a

reproducible sequence of events, DNA replication (S phase) and mitosis (M phase)

separated temporally by gaps, G1 and G2 phases. At G1 phase, CDC28 associates

with CLN1, CLN2 and CLN3, while CLB5 and CLB6 regulates CDC during S, G2,

and M phases [1]. The activity of CLN3/CDC28 is required for cell cycle progression

to start. When the levels of CLN3/CDC28 accumulate more than a certain threshold,

SWI4/SWI6 and MBF1/SWI6 are activated, promoting transcription of CLN1 and

CLN2 [1]. CLN1/CDC28 and CLN2/CDC28 promote activation of other associated

kinase, which drives DNA replication SIC1 and FAR1 are the substrates and inhibitors

of CDC28. CDC6 and CDC20 affect the cell division control proteins. Mitogen-

activated protein kinase affect this progression through FUS3.

Kyoto Encyclopedia of Genes and Genomes (KEGG) contains all the current

knowledge of molecular and genetic pathways based on experimental observations

in organisms. KEGG regulatory pathway represents the current knowledge on the

protein and gene interaction networks [60]. The structure of the KEGG pathway of

the above-mentioned 14 genes is already given (Figure 3.4), and also it is considered

as the target network in this thesis.

The available dataset online3 generated by Spellman et al. [74] contains three time

series which were measured using different cells synchronization methods: α factor-

based arrest (referred to as alpha, includes 18 time points at 7 minutes interval over

3Data is available online http://genome-www.stanford.edu/cellcycle/

3.2. PREPROCESSING 31

Figure 3.4: The Target Pathways of these 14 genes available from KEGG

119 minutes), size-based (elu, 14 time points at 30 minutes interval over 390 minutes),

and arrest of a cdc15 temperature-sensitive mutant (cdc15, 24 time points, first 4 and

last 3 of which are at 20 minutes interval and the rest are at 10 minutes interval over

290 minutes). The alpha dataset is used and then studied in more detail as it was

also used in two previous studies [35, 85].

3.2 Preprocessing

In order to remove systematic bias in datasets, the preprocessing methods are neces-

sary to prepare the data for later analysis [51]:

• Removing outliers

• Replacing missing values

3.2. PREPROCESSING 32

3.2.1 Outlier Correction

Outliers in the gene expression data are the values that are far away from most of the

other values, which means such entries have a high probability of being incorrectly

obtained. To discover the outliers, statistic hypothesis that the expression levels of

a gene in different experiments are supposed to distribute in the range of twice its

standard deviation, σ, distance from its mean, μ is employed. All expression values

therefore greater than μ+2σ or less than μ− 2σ are considered as outliers. Detected

outliers of gene i will be removed and replaced by the mean μi of its expression

values over experiments. There are 100 replicates for the synthetic data (750 for the

songbird data), and less than 2% (1.8%) outliers were detected over all experiments.

Therefore, the effects of outliers can be ignored.

3.2.2 Missing Values

The yeast Saccharomyces Cerevisiae data in our studies have several missing values.

These missing values could be due to unreliable measurements at certain time points.

The other two datasets, synthetic data and songbird data, devised from computational

simulation, can avoid missing values by setting appropriate parameters. In this work,

the mean of each gene expression value over time is used to fill in the positions of

missing entries in the expression data of the yeast dataset, as mean is a statistically

sound measure and easy to implement.

Chapter 4

Methods

In this chapter, Fast Orthogonal Search (FOS) and Parallel Cascade Identification

(PCI) are introduced for reverse engineering of gene regulatory networks, and their

implementation is discussed.

To reverse engineer a network, one gene is studied at one time, and treated as the

system output, and the remaining genes are considered as system inputs. Through

the proposed algorithms, significant input genes can be selected from the pool of

all possible ones and used as the regulators of the corresponding output to build a

network. Both FOS and PCI were developed for system identification [37, 39]. They

have been applied to predict the response of multiple sclerosis patients to therapy

using FOS [54] and to classify and predict protein family using PCI [43, 46].

4.1 Fast Orthogonal Search

Fast Orthogonal Search was developed for identifying a model by searching through

a set of pre-designated candidate functions and iteratively selecting the optimal term

33

4.1. FAST ORTHOGONAL SEARCH 34

that produces the maximum reduction of mean square error (MSE) of the model

[37, 38]. Different from traditional orthogonal search algorithms eg. [36], the search-

ing procedure in FOS could avoid calculating the actual values of orthogonal terms,

which greatly speeds up the approximation procedure. It was shown that, compared

with an orthogonal search algorithm by Desrochers [16], whose computational cost

is proportional to the square of the number of candidate functions, FOS depends

linearly on the number of candidate functions [37].

4.1.1 Orthogonal Search

An approximation of a dynamic system over t = 0, · · · , T can be shown using the

following equation as:

y(t) = F [y(t− 1), . . . , y(t−K), x(t), . . . , x(t− L)] + e(t), t = 0, · · · , T, (4.1)

where y(t) is the system output; F is a polynomial function; x(t) is the input and e(t)

is error; K and L are the time delay of input and output, respectively. This equation

can be rewritten in a concise format:

y(t) = c+

M∑m=1

ampm(t) + e(t), t = 0, . . . , T, (4.2)

where c is a constant, pm(t) form = 1, 2, . . . ,M are the non-orthogonal basis functions

selected to be added to the model, and am are the associated coefficients which best

fit the output. The basis functions pm(t) have the following form

pm(t) = y(t− k1) · · · y(t− ki)x(t− l1) · · ·x(t− lj), m ≥ 1, (4.3)

where 1 ≤ k1, · · · , ki ≤ K, i ≥ 0 and 0 ≤ l1, · · · , lj ≤ L, j ≥ 0.

4.1. FAST ORTHOGONAL SEARCH 35

Through Gram-Schmidt orthogonalization [61], eq(4.2) can be rewritten as:

y(t) = c+

M∑m=1

gmwm(t) + e(t), t = 0, . . . , T, (4.4)

where wm(t) for m = 1, . . . ,M are orthogonal functions over the data and gm are the

orthogonal expansion coefficients, achieving a least-square fit. The constant c can be

considered as a zero-order function that equals 1, with a coefficient g0 = c. Therefore,

c = g0w0(t) where w0(t) = 1, t = 1, . . . , T . Since wm(t) are mutually orthogonal over

the data record and derived from pm(t), the orthogonal search algorithm iteratively

constructs a function which is orthogonal to all previously selected terms,

wm(t) = pm(t)−m−1∑r=0

αmrwr(t), m = 1, . . . ,M

where αmr =pm(t)wr(t)

w2r(t)

.1 Orthogonal search is efficiently used to select model terms

to develop models of the above form.

On the other hand, looking for optimal am in eq(4.2) minimizing the mean square

error (MSE) of the system:

error =

(y(t)− c−

M∑m=1

ampm(t)

)2

(4.5)

is equivalent to looking for optimal gm in eq(4.4) to minimize its MSE

error =

(y(t)−

M∑m=0

gmwm(t)

)2

= y2(t)−M∑

m=0

g2mw2m(t) (4.6)

due to the mutual orthogonal property of wm. To find the optimal gi that best fits

1The over-bar in section 4.1 always denotes the time average over the data from time R =max(K,L) to t = T , where T is the length of the time series.

4.1. FAST ORTHOGONAL SEARCH 36

the data, we take the first derivative of eq(4.4) [71],

error′ =

⎧⎨⎩(y(t)−

M∑m=0

gmwm(t)

)2⎫⎬⎭

= 2

⎧⎨⎩(y(t)−

M∑m=0

gmwm(t)

)× (−wi(t))

⎫⎬⎭

= 2

⎧⎨⎩y(t) (−wi(t)) +

(M∑

m=0

gmwm(t)

)wi(t)

⎫⎬⎭

= 2{−y(t)wi(t) + giwi(t)wi(t)

}. (4.7)

By assigning eq(4.7) with 0, the value of gm is given by:

gm =y(t)wm(t)

w2m(t)

, m = 0, . . . ,M. (4.8)

Now the coefficients am in eq(4.2) can be calculated by

am =M∑i=m

giυi, (4.9)

where

υm = 1, υi = −i−1∑r=m

αirυr, i = m+ 1, . . . ,M.

It is shown that the reduction in MSE by adding any given candidate func-

tion is readily obtained from the norm of the corresponding orthogonal function

and the orthogonal expansion coefficient. Assume that M candidate function terms

p1(t), · · · , pM(t) have already been selected to estimate the output, a further one

aM+1pM+1(t) is to be added to the right side of eq(4.2), i.e., a corresponding orthog-

onal function term gM+1wM+1 is to be added on the right side of eq(4.4), the MSE of

the model will be reduced by:

Q(M + 1) = g2M+1w2M+1(t). (4.10)

4.1. FAST ORTHOGONAL SEARCH 37

Therefore the candidate function term that is associated with the greatest Q is

the function term causing the maximum reduction of MSE. This function term will

be selected and added this to eq(4.2). This process is repeated iteratively until no

further terms could reduce the MSE by more than a given threshold, or if a maximum

number of accepted terms is reached. This process will result in an accurate model

that can describe the data. However the creation of calculating orthogonal functions

wm(t) is costly as mentioned in the beginning. Therefore, Fast Orthogonal Search

(FOS) is introduced to solve this problem.

4.1.2 Fast Orthogonal Search

Recall the introduced formulas above, to build the model of eq(4.2),

1. am is calculated using eq(4.9), where gm is given byy(t)wm(t)

w2m(t)

,

2. wm is calculated using pm(t)−m−1∑r=0

αmrwr(t),

3. αmr =pm(t)wr(t)

w2r(t)

,

4. Q(M) = g2Mw2M(t).

after comparison of the three equations, it is not difficult to see that all of the numer-

ators and the denominators are cross productions of corresponding terms, and the

denominator of αmr is similar to that of gm. FOS uses a vector C(m) and a matrix

D(m,m) to calculate the numerator and denominator of gm, respectively. Moreover,

the second part of Q(M) has a similar property, and can be substituted by D(r, r).

4.1. FAST ORTHOGONAL SEARCH 38

Therefore, the significant function terms can be selected using Q(M) and their corre-

sponding function coefficients am can be calculated without calculating the orthogonal

function terms wm(t).

Given a candidate function set with M terms, he pseudocode to calculate the vec-

tor C and the matrixD through FOS as presented in [38] is given below:

START

D(0, 0) = 1

C(0) = y(t)

for m = 1 to M do

Calculate D(m, 0) = pm(t)

end for

for m = 1 to M do

for r = 0 to m− 1 do

Calculate α using αmr = D(m, r)/D(r, r)

CalculateD(m, r+1) usingD(m, r+1) = pm(n)pr+1(n)−∑r

i=0 αr+1iD(m, i)

end for

Calculate C(m) = y(n)pm(n)−∑m−1

r=0 αmrC(r)

end for

After C and D are available, gm could be calculated by using eq(4.11).

gm =C(m)

D(m,m), for m = 0, . . . ,M. (4.11)

It has been proved in [37] by Korenberg that the MSE of the model defined by

eq(4.5) can be expressed as follows:

error = y2(t)−M∑

m=0

g2mD(m,m). (4.12)

4.1. FAST ORTHOGONAL SEARCH 39

Comparing eq(4.7) and eq(4.12), Q(M + 1) in eq(4.10), the amount of reduction of

MSE by adding a new term aM+1pM+1(t), is of this form

Q(M + 1) = g2M+1D(M + 1,M + 1). (4.13)

To select the (M+2)th term pM+2(t) we only need to carry out the above procedure

for m = M + 2. We do not need to repeat previous calculations for m ≤ M + 1. As

mentioned above, FOS will continue to select and add the optimal candidate term

to reduce the MSE of the model until it reaches some stopping criteria. In [37, 38],

two stopping criteria have been mentioned to terminate FOS. One is that once all

candidate function terms have been selected from the candidate functional set, FOS

will stop searching. The other one is based on a statistic significance test: FOS will be

terminated if adding a further term can not reduce MSE more than white gaussian

noise. Suppose we already selected M terms, for a given candidate function term

pM+1(t), its corresponding value of Q(M+1) can be calculated by eq(4.13). It can be

shown that if e(t) is a zero-mean, independent Gaussian noise, then the correlation

coefficient r is given by

r =

(Q(M + 1)

y2(t)−∑Mm=0 Q(m)

) 12

<2√

T − R + 1, (4.14)

with probability of around 0.95 confidence interval (C.I.) for sufficiently long record

length T − R + 1 [71]. Note that the denominator of R.H.S. of eq(4.14) is the stan-

dard deviation of r. Moreover, here 2 is an approximated value of 1.96, based on

− 1.96√T − R + 1

< r <1.96√

T − R + 1. Therefore, eq(4.14) can be rewritten in a more

general way:

Q(M + 1) >K

T − R + 1

(y2(t)−

M∑m=0

Q(m)

). (4.15)

4.1. FAST ORTHOGONAL SEARCH 40

For example, if we set K = 4, FOS will end up with a 95% C.I. [45] and if K is

chosen as 10.9, the C.I. will be 99.9% [42].

4.1.3 Network Construction using FOS

Implementing FOS for gene network reverse engineering, we model the interactions of

one gene at a time in the network. Moreover, we “assume” that the rate of change of

a gene in time is only dependent on the rate of change of its regulators at the previous

time point. Consider the gene expression data consisting of N gene expression profiles

over T time points, focusing on one gene Genej , it is treated as the output of the

system(the target gene of the network), and the remaining N−1 genes constitute the

candidate function set ξ = {Gene1, . . . , Genej−1, Genej+1, . . . , GeneN}. When adding

time series property to the system, because of the assumption that only the previous

time point of regulator genes is treated as the turn-on of regulation performance, the

candidate functional set is ξ = {Gene1(t), . . . , Genej−1(t), Genej+1(t), . . . , GeneN(t)}and output is Genej(t+ 1), t = 1, . . . , T − 1. Here, we do not permit self regulation,

therefore the form defined by eq(eq4.3) does not include the output y terms. The

time lag for input is 1, therefore R = 1.

Through FOS, corresponding MSE reduction Q, for all candidate functions in ξ

are calculated and compared. The candidate function resulting in the maximum value

of Q is selected to be added to the model and deleted from the candidate functional

set ξ. Obviously, FOS will always select a time series to estimate the studied gene

expression profile. This procedure is iteratively repeated until either of two stopping

criteria is met: (i) adding a new function does not result in a larger reduction of MSE

than white gaussian noise; or (ii) ξ is empty. The identified model is utilized to predict

4.2. PARALLEL CASCADE IDENTIFICATION 41

Genej using the selected genes, which are defined as regulators of Genej . Once all

the genes Genej , j = 1, . . . , N have been studied as the target, a network consisting

of all genes is constructed, whose nodes stand for genes, edges denote the regulations

between genes and arrows of the edges describe the direction of the regulation. Note

that the model built through FOS is highly dependent on the predefined candidate

basis function set. One could define complex basis functions like cross-products to

construct a more complicated network.

4.2 Parallel Cascade Identification

Parallel Cascade Identification (PCI) builds a model of input/output relationship of a

system using a number of cascades, each of which has a dynamic component, capable

to capture the memory of a system, followed by a static polynomial component, which

enables an accurate estimation of the system output, as shown in Figure 4.1 [39].

PCI starts by approximating the system utilizing the first cascade. The difference

of the actual system output, y(t), with the first cascade output, z1(t), is called the

residue, y1(t). The residue is then treated as the output of a new system that will

be approximated by the second cascade. The residue is again computed, and another

cascade is added. The process continues until it reaches a desired threshold for the

approximation error.

For a system represented as eq(4.1), following the Stone-Weierstrass theorem [47],

it can be approximated with a finite order Volterra series2, that is

ys(n) = k0 +M∑

m=1

Vm, n = 0, 1, . . . (4.16)

2The Volterra series were developed in 1887 by Vito Volterra. It is a model for non-linear behavior,similar to the Taylor series. But it has the ability to capture ‘memory’ effects.

4.2. PARALLEL CASCADE IDENTIFICATION 42

Figure 4.1: Structure of a PCI model

where M is the order of the Volterra series and for m ≥ 1, the mth order Volterra

functional is of this form

Vm =

R∑i1=0

· · ·R∑

im=0

km(i1, . . . , im)x(n− i1) · · ·x(n− im), (4.17)

where km is the mth order symmetric Volterra kernel which can be seen as a higher

order impulse response of the system and R + 1 is the memory length, which means

that the series output ys(n) only depends on input delays from 0 to R lags.

Consider a time series y(t) as the system output and x(t) as the input, t = 0, . . . , T ,

and assume that y(t) depends on input delays from 0 to R, PCI starts with the first

cascade to approximate the system. Let yi(t) be the residue after the ith cascade has

been added to the parallel cascade model. Thus, y0(t) = y(t). Obviously, following

its definition, the following equation holds:

yi(t) = yi−1(t)− zi(t), i = 1, 2, . . . . (4.18)

Consider fitting the ith cascade to the residue yi−1(t), i = 1, 2, · · · , the procedure

4.2. PARALLEL CASCADE IDENTIFICATION 43

of PCI is shown in Figure 4.1 and could be briefly described as follows:

1. Define a candidate function pool hi for a possible impulse response of the

dynamic system in the ith cascade and is of length R. hi consists of cross-

correlation functions of different orders between the input, x(t), and the residue,

yi−1(t). The cross-correlation functions are computed over a segment of the in-

put and output signals extending from t = R to t = T . For example, the

first-order cross correlation function is

φxyi−1(j) � 1

T − R + 1

T∑t=R

yi−1(t)x(t− j), and (4.19)

2. Randomly select the impulse response hi(j) from the pre-defined candidate

function pool, and the output of the dynamic component, ui(t), is calculated

by the following equation:

ui(t) =R∑

j=0

hi(j)x(n− j). (4.20)

3. ui(t) is then treated as the input of the static system. By fitting a static P(·)from the input ui(t) to the residue yi−1(t), a cascade is completely constructed.

The cascade output zi(t) = P[ui(t)].

4. Calculate the MSE of the estimated model, i.e. the mean square value of the

new residue over t = R, . . . , T , y2i (t) = (yi−1(t)− zi(t))2 = y2i−1(t)− z2i (t).

5. Repeat this procedure until MSE reduction caused by adding new cascade is

less than a threshold. Similar to the stopping criteria of FOS, when trying to

add a further cascade, the correlation coefficient r =

√z2i+1(t)/y

2i (t) is required

to follow |r| < 2/√T −R + 1 with probability of around 95%.

4.2. PARALLEL CASCADE IDENTIFICATION 44

Prior to reverse engineering GRNs using PCI, the multiple-input case is necessary

to be discussed. The multiple inputs case introduced in [39] is briefly reviewed here,

and shown in Figure 4.2. For example, consider two input signals, x1(t) and x2(t),

Figure 4.2: Structure of a multiple input/single output PCI model

the differences of PCI procedure from the single input case are:

• In Step 1, the candidate set for impulse response will also include a further

term, the cross-correlation of residue yi−1 with both x1(t) and x2(t).

• In Step 2, to include both inputs in the system, the output of linear system is

calculated by

wi(t) = ui(t)± Cx2(t− A), (4.21)

where the sign is chosen randomly, C is a convergent constant defined as

y(i−1)2(t)

y2(t), and the integer A is selected randomly from 0, · · · , R.

To include three or more inputs in the system, the output of linear system is calculated

4.2. PARALLEL CASCADE IDENTIFICATION 45

by

wi(t) = ui(t)±∑i

≥ 2Cxi(t− Ai), (4.22)

where Ai is randomly selected from {0, . . . , R} and C follows the previous definition.

4.2.1 Network Construction using PCI

For reverse engineering of gene networks, the time lag is set as R = 1. To approximate

the system, for the multiple-input case, if all input genes are assigned the same

coefficients C, even though an acceptable mathematical model can be generated to

predict the time series of the output, this model is not a good representation of genetic

regulation. Since PCI randomly selects the impulse response, here a modification is

made to PCI in this work as shown in Figure 4.3. First the system output y(t)

Figure 4.3: Structure of the modified PCI model

is the gene expression levels of Genej over time, and the input of the system is

X(t) = {Gene1(t), . . . , Genej−1(t), Genej+1(t), . . . , GeneN (t)}. Constructing the ith

4.3. ASSESSMENT OF NETWORK INFERENCES 46

cascade, every time we generate a vector Hi of impulse responses corresponding to

the input vector instead of only one impulse response in the original PCI. Assuming

R = 1, the output of the dynamic system is ui(t) = HiX(t− 1), and is directly used

as the input of static polynomial system.

Empirical data indicate that gene regulatory networks should be sparse, and the

average number of upstream regulators of per gene is less than two [48]. Unlike FOS

in which a criteria can be set to terminate the procedure once the maximum number

of accepted regulators is met, PCI will generate a relatively full matrix, except for its

diagonal (which are zeros as self-regulation is not allowed in the models). A method

is developed to reduce the number of estimated links by PCI. The regulation from

Genei to Genej is defined significant if the entry Rij of the regulation weight matrix

R has a greater absolute value compared with all the rest of the entries in the same

column. For example, if Rij is outside of the range of k standard deviation from the

mean of the corresponding column, it will be kept for further studies.

4.3 Assessment of Network Inferences

In order to evaluate the performances of the proposed methods, FOS and PCI, for

identifying gene regulatory networks from the datasets, statistical measures are em-

ployed for this purpose. For predictive analysis, confusion matrix (Table 4.1), is a

table with two rows and two columns that reports the number of True Positives, False

Positives, True Negatives and False Negatives.

• True Positive (TP): the interaction that exists in both the actual network and

inferred network by the reverse engineering methods;

4.3. ASSESSMENT OF NETWORK INFERENCES 47

Table 4.1: Confusion Matrix

actual links total

predicted linksTrue Positives False Positives P’

False Negatives True Negatives N’

total P N

• False Positive (FP): the interaction that does not exist in the actual network

but was falsely inferred by reverse engineering methods;

• True Negative (TN): the interaction that does not exist in either the actual

network or the inferred network;

• False Negative (FN): the interaction that does exist in the actual network but

is not inferred by the reverse engineering methods.

Moreover, three other criteria Precision (pre), Sensitivity (sen) and Specificity (spc)

are also employed as the evaluation methods, and defined as

precision =TP

TP + FP=

# of correctly estimated interactions

# of all estimated interactions,

sensitivity =TP

TP + FN=

# of correctly estimated interactions

# of all actual interactions,

specificity =TN

TN + FP

=# of possible interactions do not exsit in actual or estimated networks

# of possible interactions do not exsit in the actual network.

Chapter 5

Implementation and Results

Both FOS and PCI are implemented using MATLAB. In this chapter, details of their

implementations and the results of reverse engineered networks using each dataset,

described in Chapter 3, are provided. First, the temporal synthetic dataset is used

to evaluate the performances of FOS and PCI. Then, Brainsim songbird data will be

analyzed and its resulting networks will be compared with the actual network. In the

end, FOS and PCI will be applied on the yeast datasets and the inferred networks

will be compared with the target network from KEGG and two previous network

inference studies [35, 85] on the same data.

5.1 Analysis of the Temporal Synthetic Dataset

To evaluate the performances of FOS and PCI for learning the system network struc-

ture, 100 synthetic datasets were generated using the structure shown in Figure 3.2.

The only differences among all synthetic datasets are the influences of the noise value

E in eq(3.1). It is expected that both FOS and PCI should identify the underlying

48

5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 49

system network structure in all datasets. Implement FOS and PCI on this dataset

and build up two models individually.

Every synthetic dataset is composed of nine genes over 100 time points. The

stopping criteria for FOS was set to K = 10.9 or that at most two regulators for

each gene have been selected. The actual and estimated gene expressions using the

built models by FOS and PCI are shown in Figures 5.1 and 5.2, respectively. In these

Figure 5.1: System Identification of FOS: Starred points are actual system outputsand solid lines denote the estimated system output using identified model.

figures, the solid lines show the estimated gene expressions while the stars () denote

5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 50

actual system outputs. The system approximation errors are ∼ 0.001. The values

of MSE only provide a mathematical view of model accuracies. From Figures 5.1

Figure 5.2: System Identification of PCI: Starred points are actual system outputsand solid lines denote the estimated system output using identified model.

and 5.2, it is obvious that both methods perform well constructing estimated models.

Only one gene, Gene4, is not estimated well by models constructed by either method.

The reason for this is that to generate the synthetic datasets, the process starts by

assigning random values to Gene4 as its expression levels to generate expression values

of other genes. PCI seems to have fitted the system better than FOS due to the fact

5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 51

that PCI does include more function terms to estimate the model (possible eight

terms) compared to FOS (two terms at most).

5.1.1 Network Inference

Due to the pre-set stopping criteria, the regulatory weight matrix Rf provided by FOS

is very sparse, at most two nonzero entries in each column. The type of regulation is

defined as inhibition if the weight from the source gene to the target gene is negative,

and activation if it is positive. Yet Rp, the regulation matrix generated by PCI, is

relatively full, whose entry at ijth position denotes the weight of regulation from

regulator gene at ith row to the target gene at jth column. The criteria introduced

in section 4.2.1 is utilized to reduce the size of network. As a result, its regulation

weight matrix will become more sparse.

Finally, 100 inferred gene regulatory networks are available for each method. All

resulting links are summarized to decide which regulations are to be kept as significant

ones as one matrix. In theory, there are 72 possible regulations in a network of nine

genes. The summed regulation matrices are shown in Tables 5.1 and 5.2 for the

inferred models through FOS and PCI, respectively. From Tables 5.1 and 5.2, one

could conclude that,

• All the 100 synthetic datasets do have similar structures.

• Both FOS and PCI perform steadily on these 100 synthetic datasets.

• The criteria proposed to threshold the inferred network by PCI is reasonable,

and can remove the insignificant regulations.

The histogram of the number of times a link is reverse engineered in the 100

5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 52

1 2 3 4 5 6 7 8 9

1 100 0 5 14 20 17 100 17

2 13 100 10 11 11 10 0 14

3 14 16 19 100 100 9 0 16

4 20 7 100 17 8 17 100 13

5 100 13 0 8 16 11 0 12

6 18 14 0 20 14 100 0 12

7 14 20 0 8 13 16 0 100

8 10 19 0 9 15 4 18 16

9 11 11 0 21 16 21 19 0

Table 5.1: Interaction Matrix summed over 100 synthetic datasets by FOS: for thetarget gene at jth column, the ijth entry of the matrix denotes the numbertimes of this regulation from regulator gene on the ith row discovered in100 synthetic datasets. Entries in bold are the actual regulations.

5.1. ANALYSIS OF THE TEMPORAL SYNTHETIC DATASET 53

1 2 3 4 5 6 7 8 9

1 100 0 12 0 0 0 1 0

2 0 97 0 19 0 0 0 0

3 0 0 27 100 97 0 0 0

4 0 0 92 0 9 0 100 0

5 100 0 0 25 1 0 0 0

6 0 0 0 6 0 100 0 0

7 0 0 0 2 0 0 0 100

8 0 0 0 6 0 1 0 0

9 0 0 0 0 0 0 0 0

Table 5.2: Interaction Matrix summed over 100 synthetic datasets by PCI: for thetarget gene at jth column, the ijth entry of the matrix denotes the numbertimes of this regulation from regulator gene on the ith row discovered in100 synthetic datasets. Entries in bold are the actual regulations.

5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 54

synthetic datasets is shown in Figure 5.3. There are two clearly separated parts in

each histogram. Therefore, a threshold can be set to identify significant interactions

and build an inferred network. A pair of regulation is accepted if and only if it

appearers in more than a threshold number out of 100 datasets. Threshold is set

to be 90 for both FOS and PCI. The filtered regulations are used to build the final

networks for each method.

Figure 5.4 shows the identified networks by FOS and PCI. Both methods are able

to reverse engineer most of the true regulations. Out of 11 true regulations, FOS can

recover 10 links, while PCI recovered nine. Regulation of Gene6 by Gene8 is missing

in both estimated models, and PCI did not find regulation of Gene8 by Gene1. To

describe their performances more clearly, precision, sensitivity and specificity are

calculated as shown in Table 5.3.

Table 5.3: Comparisons of the inferred networks of Synthetic Data by using FOS andPCI

Fast Orthogonal Search Parallel Cascade Identification

Sensitivity 1011

= 91% 911

= 82%

Precision 1010

= 100% 99= 100%

Specificity 6161

= 100% 6161

= 100%

5.2 Analysis of the Brainsim Songbird Dataset

Brainsim Songbird dataset by Smith [73] is a popular benchmark dataset used for

evaluating different network inference algorithms. FOS and PCI were also applied to

5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 55

(a)

(b)

Figure 5.3: The histograms of the number of times that one pair of regulation isdiscovered from 100 synthetic dataset by (a) FOS (b) PCI

5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 56

(a)

(b)

Figure 5.4: The final estimated networks of Synthetic Data by (a) FOS (b) PCI. Solidlinks are correctly discovered, TP; dashed links are missing ones, FN;

5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 57

750 such Brainsim datasets as mentioned in Chapter 3. All the datasets have the same

underlying network structure. The network structure for 100 genes and one activity

term in each dataset is reverse engineered. Stopping criteria for FOS is set asK = 10.9

and the maximum number of regulators is 2, and for PCI, k = 1.5. Therefore,

similar to the previous section, 750 regulation weight matrices are generated for either

method, FOS and PCI.

5.2.1 Network Inference for Songbird data

To discover the significant regulations, all 750 regulation matrices reverse engineered

by FOS are summed. Note that for 100 genes there are more than 10k possible regu-

lations, and too many regulations that only appear one or twice out of 750 datasets,

therefore we only plot the histograms of the 50 most significant regulations, shown in

Figure 5.5. The threshold of 300 is used to select most significant regulations, which

should be comparable to the number of actual connections that is 11. By setting

the threshold, we have 11 significant regulations, which are used to build the final

network, shown in Figure 5.6.

For the implementation of PCI on Songbird Data, the histogram of the 50 most

significant regulations inferred out of 750 datasets is given in Figure 5.7. Due to the

criteria used to make the regulation weight matrix sparse, only a few regulations are

considered as significant ones. Therefore, most of insignificant regulations have been

removed and the histogram follows a more uniform distribution, and the threshold is

set as 600. This results in 10 significant regulations to build the network, shown in

Figure 5.8.

By comparing the inferred network by FOS Figure 5.6 with the original network

5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 58

Figure 5.5: The histogram of topping 50 significant regulation discovered from 750Brainsim Songbird datasets by FOS.

Figure 5.6: The final estimated networks of Brainsim songbird Data using FOS.Dashed lines means the regulation that FOS could not recover.

5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 59

Figure 5.7: The histogram of topping 50 significant regulation discovered from 750Brainsim Songbird datasets by PCI

Figure 5.8: The final estimated networks of Brainsim songbird Data using PCI.Dashed lines means the regulation that PCI could not recover.

5.2. ANALYSIS OF THE BRAINSIM SONGBIRD DATASET 60

structure Figure 3.3, it is observed that 10 out of all 11 inferred interactions are

truly captured and only one extra interaction is inferred: the regulation of Gene 5

by Activity. The co-regulation of Gene 6 by Gene 3 is missed by both methods,

which was either not predicted by previous studies of Brainsim Songbird Data [73];

because Gene 3 and Gene 5 control Gene 6 in a coordinated fashion with the lower

expression level of the pair serving as the limiting factor in the regulation of Gene 5, it

is found that Gene 5 had a lower expression level than Gene 3 in 89% of the temporal

cases, thus, Gene 5 nearly always serves as the effective regulator [73]. Analyzing the

inferred network through PCI Figure 5.8, 6 out of all 7 inferred interactions exist in

the actual network and 1 extra interaction from Gene 1 to Gene 5 is inferred. Five

interactions are missed. Both incorrectly inferred interaction using FOS and PCI is

the regulation of Gene 5. Both FOS and PCI are able to reverse engineer most of the

true regulations. To evaluate the accuracies of the obtained networks by FOS and

PCI, the criteria ‘precision’, ‘sensitivity’ and ‘specificity’ are calculated again, whose

results are shown in Table 5.4. As shown, FOS performed better than PCI with more

correctly detected regulations.

Fast Orthogonal Search Parallel Cascade Identification

Sensitivity 1011

= 91% 611

= 55%

Precision 1011

= 91% 67= 86%

Specificity 1008810089

≈ 100% 1008410089

≈ 100%

Table 5.4: Comparisons of the inferred networks of Brainsim Simulated Data by usingFOS and PCI

5.3. ANALYSIS OF YEAST SACCHAROMYCES CEREVISIAE DATASET 61

5.3 Analysis of Yeast Saccharomyces Cerevisiae

Dataset

A biological data consisting of 14 genes from yeast Saccharomyces cerevisiae [74],

including three time series, was ultimately used to evaluate the efficiency of these two

reverse engineering methods. The pathway of these genes in KEGG shown in Figure

3.4 is regarded as the target network used to compare and evaluate the performances

of FOS and PCI. Since CLN3 only works at the start of the cell cycle, we will not

consider its regulators for both methods, FOS and PCI. Stopping criteria used for

analyzing this data are the same as previous two datasets; for FOS is set as K = 10.9

and the maximum number of regulators is 2, and for PCI, k = 1.5. For this data, two

individual networks are inferred by the two methods.

5.3.1 Network Inference

As discussed in Chapter 3, the KEGG pathway is treated as the target network for

comparison. Complexes including one or several genes are considered as a ′gene′

in the network. There are 10 complexes, including CLN3/CDC28, SWI4/SWI6,

MBP1/SWI6, CLN1/CLN2/CDC18, and CLB5/CLB6/CDC28. Other nodes that

are made of one single gene only, CDC20, CDC6, SIC1, FAR1, and FUS. The follow-

ing assumptions are made:

• Genes CLN3 and CDC28 are only considered as possible regulators, as they

are starters of the cell cycle network.

• All discovered links from any gene in one complex to any other genes in a

different complex will be considered as one regulation. For example, if FOS or

5.3. ANALYSIS OF YEAST SACCHAROMYCES CEREVISIAE DATASET 62

PCI result in three regulations from the genes in the complex CLN3/CDC28 to

the genes in the complex SWI4/SWI6, CLN3 → SWI4, CLN3 → SWI6, and

CDC28 → SWI4, still only one regulation is used to construct the resulting

network. The weight of this regulation equals the maximum value of the weights

of these three regulations.

• All regulations among genes in the same complex will be ignored.

• If there exist two regulations between two complexes with different directions,

the weights of these regulations will be compared, and only the direction of one

regulation with the higher weight will be kept, which, therefore, determines the

directionality of regulation of these two complexes. For example, between a

complex cplxi and cplxj , if Rij and Rji are both nonzero and Rij > Rji, then

the directionality is determined as cplxi → cplxj . This interpretation is based

on the biological assumption that a small variation in the regulator gene will

result in a large change in the target gene.

The corresponding networks of the yeast dataset using FOS and PCI are shown

in Figure 5.9 (a) and (b). They are also compared with the two previous studies

[35, 85]. Details of their methods are not discussed here; instead, their results are

adopted for comparisons and their resulting networks are shown in Figure 5.9 (c) and

(d), respectively.

By comparing the inferred networks using FOS and PCI with the KEGG pathway,

it is observed that more than forty percent of the interactions in the target network

are inferred by FOS and PCI. While two interactions are captured by Kim et al. [35]

and three are captured by Zhang et al. [85]. Also, reverse engineered results using

FOS and PCI outperform the previous studies in terms of predicting more correctly

5.3. ANALYSIS OF YEAST SACCHAROMYCES CEREVISIAE DATASET 63

Figure 5.9: The yeast cell cycle pathway inferred from Spellman data using differentmethods: (a) FOS (b) PCI (c) Kim [35], and (d) Zhang [85].

5.3. ANALYSIS OF YEAST SACCHAROMYCES CEREVISIAE DATASET 64

estimated and misdirected interactions. Using the information from all four reverse

engineering approaches of cell cycle pathway of the yeast data, ‘precision’, ‘sensitivity’

and ‘specificity’ are calculated and displayed in Table 5.5, as a summary of Figure

5.9. Because different from synthetic dataset and songbird dataset, yeast dataset does

FOS PCI Kim[35] Zhang[85]

TPs 4 5 2 3

FPs 8 7 8 8

Sensitivity 40% 50% 20% 30%

Precision 29% 36% 15% 27%

Specificity 85% 85% 85% 86%

Table 5.5: Comparisons of the inferred networks of yeast Saccharomyces cerevisiaeData by using FOS, PCI and other two available studies.

not have replicate samples for analysis, its inferred results are less statistically sound

and hard to be evaluated. Even though, their absolute values are not very high, they

show significant improvement to the previously reported studies [35, 85].

Chapter 6

Summary and Conclusions

Reverse engineering gene regulatory networks from gene expression data is an im-

portant but challenging area of research in systems biology. In this thesis, Fast

Orthogonal Search and Parallel Cascade Identification, two system identification ap-

proaches, inspired by engineering systems, are introduced and employed to construct

GRNs using temporal gene expression data. The fast convergence time of FOS O(n2)

makes it an attractive approach to analyze large scale data. FOS searches all possible

regulator genes from a candidate set; it selects the optimal one, adds it to the model

and deletes it from the candidate set, iteratively. The selection procedure guarantees

that the searching will always select the most significant regulator from the exist-

ing possible regulators. The other approach, PCI, considers all possible regulators

simultaneously, but by assigning with different weights to them. A modification to

this algorithm was proposed to make the regulation weight matrix generated by PCI

sparse.

To evaluate the reliability and efficiency of FOS and PCI for inferring causal regu-

latory interactions from temporal gene expression data, a synthetic data is generated

65

CHAPTER 6. SUMMARY AND CONCLUSIONS 66

and used. FOS can recover 10 out of 11 actual regulations in this dataset, and PCI

using the proposed criteria can infer a sparse network and recover nine out of 11

true regulations. Via three statistical evaluation criteria ‘sensitivity’, ‘precision’ and

‘specificity’ as well as mean square error, the accuracies of the inferred structures

through both methods are quantified.

FOS and PCI are also applied to the Brainsim songbird data, a temporal simulated

dataset with known structure that models the singing behavior in a songbird. The

inferred structures quantified via the criteria ‘sensitivity’, ‘precision’ and ‘specificity’,

indicates a good performance of these two network inference approaches; only one out

of all inferred interactions is a false regulation using either approach, 10 true network

regulations can be recovered through FOS and six using PCI.

Finally the efficiencies of FOS and PCI for learning the network structure are

evaluated using a biological data, the temporal expression values of 14 genes in yeast

Saccharomyces cerevisiae cell cycle data reported in [74]. The networks inferred from

yeast data by FOS and PCI are compared to the KEGG pathway of the yeast as

the target network and two other yeast network inference studies on the same data

using evaluation criteria ‘sensitivity’, ‘precision’ and ‘specificity’. Even though, the

absolute values of these criteria are not high, compared with the two previous studies,

the results demonstrate a good performance of both FOS and PCI.

In conclusion, both FOS and PCI, can deal with continuous gene expression data,

capture their dynamics, and build deterministic models. By modeling the input/out

relationship, they can infer the causality of the gene regulatory networks by assigning

the input as the regulators of the output.

6.1. FURTHER DIRECTIONS 67

6.1 Further directions

Design and application of methods for reverse engineering of gene regulatory networks

from gene expression data, is a key aspect in systems biology. We proposed an idea

that can apply system identification algorithms well known for mathematics and

engineering into reverse engineering methods. A few future directions for this work

are listed below:

• Studying alternative basis functions to gene expression profile functions used as

the input in this work, to approximate the regression model of the association

between a given gene and that of its potential regulators in FOS and PCI.

• Considering biological knowledge to determine transcription factors or potential

gene regulators, assigning the gene expression functions of the potential genes

with higher probability to be selected as potential regulators. This can be done

by dividing the candidate functional set into several subsets; therefore, FOS

could start searching from the subset of a higher relevance and PCI can build

different groups of cascades by using different subsets.

• Generalizing the proposed model to a model that allows different regulators

of a gene regulate their target gene with different time lags, instead of 1 time

lag assumption used in this work. This can result in a more flexible network

inference model with higher accuracy.

• Incorporating biological information to determine the maximum number of po-

tential gene regulators for a given gene, instead of defining equal maximal num-

ber of regulators for all genes. Since FOS always selects regulators for a given

target gene, this prior knowledge can make an improvement.

6.1. FURTHER DIRECTIONS 68

• Because Parallel Cascade Identification randomly assigns coefficients to the in-

put in the dynamic system, applications of alternative algorithms used to gen-

erate the impulse response might decrease the computation time of PCI model.

• Further studying possible approaches for defining the significance of a regulation

in the regulation weight matrix generated by PCI.

Bibliography

[1] Cell cycle: yeast saccharomyces cerevisiae. http : //www.genome.jp/dbget −bin/wwwbget/map04111.

[2] T. Akutsu, S. Miyano, and S. Kuhara. Identification of genetic networks from a

small number of gene expression patterns under the boolean network model. in

Proceedings of Pacific Symposium on Biocomputing, 4:17–28, 1999.

[3] A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, and etc. Distinct types of

diffuse large b-cell lymphoma identified by gene expression profiling. Nature,

403:503–511, 2000.

[4] M. Bansal, G. D. Gatta, and D. di Bernardo. Inference of gene regulatory

networks and compound mode of action from time course gene expression profiles.

Bioinformatics, 22(7):815–822, 2006.

[5] K. Basso, A. A. Margolin, G. Stolovitzky, U. Klein, R. D. Favera, and A. Califano.

Reverse engineering of regulatory networks in human b cells. Nature Genetics,

37(4):382–390, 2005.

[6] S. Bornholdt. Boolean network models of cellular regulation: prospects and

limitations. Journal of the Royal Society, 5(Suppl 1):85–94, 2008.

69

BIBLIOGRAPHY 70

[7] L. Campos and J. Huete. On the use of independence relationships for learn-

ing simplified belief networks. International Journal of Intelligent Systems,

12(7):495–522, 1998.

[8] T. Chen, H. L. He, and G. M. Church. Modeling gene expression with differen-

til equations. in Proceedings of Pacific Sympposium on Biocomputing, 4:29–40,

1999.

[9] X. Chen, G. Anantha, and X. Wang. An effective structure learning method for

constrcting gene networks. Bioinformatics, 22(11):1367–1374, 2006.

[10] A. Crombach and P. Hogeweg. Evolution of evolvability in gene regulatory net-

works. PLoS Computational Biology, 4(7):e1000112, 2007.

[11] X. Cui and G. A. Churchill. Statistical tests for differential expression in cDNA

microarray experiments. Genome Biology, 4(4):210.1–210.10, 2003.

[12] M. S. Dasika, A. Gupta, and C. D. Maranas. A mixed integer linear programming

framework for inferring time delay in gene regulatory networks. in Proceedings

of Pacific Symposium on Biocomputing, 9:474–485, 2004.

[13] E. H. Davidson and D. H. Erwin. Gene regulatory networks and the evolution

of animal body plans. Science, 311(5762):796–800, 2006.

[14] A. G. de Brevern, S. Hazout, and A. Malpertuy. Influence of microarrays exper-

iments missing values on the stability of gene groups by hierarchical clustering.

BMC Bioinformatics, 5:114–225, 2004.

[15] M. J. L. de Hoon, S. Imoto, K. Kobayashi, N. Ogasawara, and S. Miyano. Infer-

ring gene regulatory networks from time-ordered gene expression data of bacillus

BIBLIOGRAPHY 71

subtilis using differential equations. in Proceedings of Pacific Symposium on

Biocomputing, 8:17–28, 2003.

[16] A. A. Desrochers. On an improved model reduction technique for nonlinear

systems. Automatica, 17(2):407–409, 1981.

[17] P. D’haeseleer, S. Liang, and R. Somogyi. Genetic network inference: from

co-expression clustering to reverse engineering. Bioinformatics, 16(8):707–762,

2000.

[18] P. D’haeseleer, X. Wen, S. Fuheman, and R. Somogyi. Linear modeling of mrna

expression levels during cns development and injury. in Proceedings of the 4th

Pacific Symposium on Biocomputing, 4:41–52, 1999.

[19] P. D’haeseleer, X. Wen, S. Fuhrman, and R. Somogyi. Mining the gene expression

matrix: Inferring gene relationships from large scale gene expression data. in

Proceedings of the second international workshop on Information Processing in

Cells and Tissues, pages 203–212, 1998.

[20] S. Draghici. Data Analysis tools for DNA microarrays. Chapman and Hall-CRC,

2003.

[21] N. Friedman. Learning bayesian network structure from massive datasets: the

sparse candidate algorithm. in Proceedings of Fifteenth Conference on Uncer-

tainty in Artificial Intelligence, pages 206–215, 1999.

[22] N. Friedman, M. Linia, I. Nachman, and D. Pe’er. Using bayesian networks to

analyze expression data. Journal of Computation Biology, 7(3):601–620, 2000.

BIBLIOGRAPHY 72

[23] T. S. Gardner and J. J. Faith. Reverse engineering transcription control networks.

Physics of Life Reviews, 2(1):65–88, 2005.

[24] D. Heckerman, D. Geiger, and D. Chickering. Learning bayesian networks: the

combination of knowledge and statistical data. Machine Learning, 20(3):197–243,

1995.

[25] J. Herrero, R. Diaz-Urizrte, and J. Dopazo. Gene expression data preprocessing.

Bioinformatics, 19(5):655–656, 2003.

[26] S. Huang. Gene expression profiling, genetic networks, and cellular states: an

integrating concept for tumorigenesis and drug discovery. Journal of Molecular

Medicine, 77(6):469–480, 1999.

[27] T. Ideker, V. Thorsson, A. F. Siegel, and L. E. Hood. Testing for differentially

expressed genes by maximum likelihood analysis of microarray data. Journal of

Computational Biology, 7(6):805–817, 2000.

[28] R. Kabli, F. Herrmann, and J. McCall. A chain model genetic algorithm for

bayesian network structure learning. in Proceedings of the 9th annual conference

on Genetic and evolutionary computation, pages 1264–1271, 2007.

[29] G. Karlebach and R. Shamir. Modelling and analysis of gene regulatory networks.

Nature Reviews Molecular Cell Biology, 9(10):770–780, 2008.

[30] S. A. Kauffman. Homeostasis and differentiation in random genetic control net-

works. Nature, 224:177–178, 1969.

[31] S. A. Kauffman. Metabolic stability and epigenesis in randomly constucted ge-

netic nets. Journal of Theoretical Biology, 22(3):437–467, 1969.

BIBLIOGRAPHY 73

[32] S. A. Kauffman. The large scale structure and dynamics of genetic control cir-

cuits: an ensemble approach. Journal of Theoretical Biology, 44(1):167–190,

1974.

[33] S. A. Kauffman and L. Glass. The logical analysis of continuous, nonlinear

biochemical control networks. Journal of Theoretical Biology, 39(1):103–129,

1973.

[34] H. Kim, G. H. Golub, and H. Park. Missing value estimation for DNA microarray

gene expression data: local least squares imputation. Bioinformatics, 21(2):187–

198, 2006.

[35] S. Y Kim, S. Imoto, and Miyano S. Inferring gene networks from time series

microarray data using dynamic bayesian networks. Briefings in bioinformatics,

4(3):228–235, 2003.

[36] M. J. Korenberg. Orthogonal identification of nonlinear difference equation mod-

els. in Proceedings of 28th Midwest Symposium on Circuits and Systems, 1:90–95,

1985.

[37] M. J. Korenberg. Fast orthogonal identification of nonlinear difference equation

and function expansion models. in Proceedings of 30th Midwest Symposium on

Circuits and Systems, 1:270–276, 1987.

[38] M. J. Korenberg. A robust orthogonal algorithm for system identification and

time-series analysis. Biological Cybernetics, 60(4):267–276, 1989.

[39] M. J. Korenberg. Parallel cascade identification and kernel estimation for non-

linear systems. Annals of Biomedical Engineering, 19(4):429–455, 1991.

BIBLIOGRAPHY 74

[40] M. J. Korenberg. Prediction of treatment response using gene expression profiles.

Journal of Proteome research, 1(1):55–61, 2002.

[41] M. J. Korenberg. On predicting medulloblastoma metastasis by gene expression

profiling. Journal of Proteome Research, 3(1):91–96, 2004.

[42] M. J. Korenberg, C. J. H. Brenan, and I. W. Hunter. Raman spectral estimation

via fast orthogonal search. Analyst, 122:879–882, 1997.

[43] M. J. Korenberg, R. David, I. W. Hunter, and J. E. Solomon. Parallel cas-

cade identification and its application to protein family prediction. Journal of

Biotechnology, 91(1):35–47, 2001.

[44] M. J. Korenberg and I. W. Hunter. Rapid dtmf signal classification via parallel

cascade identification. Electronics Letters, 32:1862–1863, 1996.

[45] M. J. Korenberg and L. D. Paarmann. Orthogonal approaches to time-series

analysis and system identification. IEEE Signal Processing Magazine, 8(3):29–

43, 1991.

[46] M. J. Korenberg, J. E. Solomon, and M. E. Regelson. Parallel cascade iden-

tification as a means for automatically classifying protein sequences into struc-

ture/function groups. Biological cybernetics, 82(1):15–21, 2000.

[47] S. Lang. Real and functional analysis. Transactions of the American Mathemat-

ical Society, 41(3):88–89, 1937.

[48] R. D. Leclerc. Survival of the sparsest: robust gene networks are parsimonious.

Molecular systems biology, 4(213):1–6, 2008.

BIBLIOGRAPHY 75

[49] P. Leray and O. Francois. Bayesian network structural learning and incomplete

data. in Proceedings of the international and interdisciplinary conference on

adaptive knowledge representation and reasoning, pages 33–40, 2005.

[50] S. Liang, S. Fuhrman, and R. Somogyi. Reveal, a general reverse engineering

algorithm for inference of genetic network architectures. in Proceedings of Pacific

Symposium on Biocomputing, 3:18–29, 1998.

[51] W. K. Lim, K. Wang, C. Lefebvre, and A. Califano. Comparative analysis of mi-

croarray normalization procedures: effects on reverse engineering gene networks.

Bioinformatics, 23(13):282–288, 2007.

[52] D. R. McGaughey, M. Tarbouchi, K. Nutt, and A. Chikhani. Speed sensorless

estimation ac induction motors using the fast orthogonal search algorithm. IEEE

Transactions on Industry Applications, 21(1):112–120, 2006.

[53] C. B. Moler. Numerical computing with MATLAB. Philadelphia: Society for

Industrial and Applied Mathematics, 2004.

[54] S. Mostafavi, S. Baranzini, J. Oksenberg, and P. Mousavi. A fast multivari-

ate feature-selection/classification approach for prediction of therapy in multiple

sclerosis. in Proceedings of IEEE Conference on Computational Intelligence in

Bioinformatics and Computational Biology, pages 1–8, 2006.

[55] K. Mountjoy, E. Morin, and K. Hashtrudi-Zaad. Use of the fast orthogonal search

method to estimate optimal joint angle for upper limb hill-muscle models. IEEE

Transactions on Biomedical Engineering, 57(4):790–798, 2010.

BIBLIOGRAPHY 76

[56] D. M. Mutch, A. Berger, R. Mansourian, A. Rytz, and M. Roberts. The limit fold

change model: a practical approach for selecting differentially expressed genes

from microarray data. BMC Bioinformatics, 3(17), 2002.

[57] H. G. Natke. Application of system identification in Engineering. Springer, New

York, 1988.

[58] R. E. Neapolitan. Learning Bayesian networks (artificial intelligence). New York:

Prentice–Hall, 2004.

[59] S. Oba, M. A. Sato, I. Takemasa, M. Monden, K. I. Matsubara, and S. Ishii.

A bayesian missing value estimation method for gene espression profile data.

Bioinformatics, 19(16):2088–2096, 2000.

[60] H. Ogata, S. Goto, W. Fujibuchi, and M. Kanehisa. Computation with the kegg

pathway database. Biosystems, 47(1-2):119–128, 1998.

[61] S. F. Orfanidis. Optimum signal processing. McGraw-Hill, New York, 1988.

[62] B. E. Perrin, L. Ralaivola, A. Mazurie, S. Bottani, J. Mallet, and F. d’Alche

Buc. Gene networks inference using dynamic bayesian networks. Bioinformatics,

19(Suppl 2):ii138–148, 2003.

[63] M. Pineda-Sanchez, M. Riera-Guasp, J. A. Antonino-Daviu, J. Roger-Folch,

J. Perez-Cruz, and R. Puche-Panadero. Instantaneous frequency of the left side-

band harmonic during the start-up transient: A new method for diagnosis of

broken bars. IEEE Transactions on Industrial Electronics, 56(11):4557–4570,

2009.

BIBLIOGRAPHY 77

[64] L. Qian, H. Wang, and E. R. Dougherty. Inference of noisy nonlinear differ-

ential equation models for gene regulatory network using genetic programming

and kalman filtering. IEEE Transactions on Signal Processing, 56(7):3327–3339,

2008.

[65] T. M. Rakoczy. Feature selection for computer-aided diagnosis of breast cancer

using dynamic contrast-enhanced magnetic resonance images. Master’s thesis,

Royal Military College of Canada, September 2009.

[66] J. C. Rapp, B. J. Baumgartner, and J. Mullet. Quantitative analysis of tran-

scription and rna levels of 15 barley chloroplast genes. The Journal of Biological

Chemistry, 267(30):21404–21411, 1992.

[67] W. Richard. Genes and DNA. Kingfisher, Boston, 2003.

[68] H. E. Samad, M Khammash, L. Petzold, and D. Gillespie. Stochastic modeling

of gene regulatory networks. International Journal of Robust Nonlinear Control,

15:691–711, 2005.

[69] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring

of gene expression patterns with a complementary DNA microarray. Science,

270(5235):467–470, 1995.

[70] A. Schulze and J. Downward. Navigating gene expression using microarrays: a

technology review. Nature cell biology, 3(8):E190–E195, 2001.

[71] J. Shao. Mathemtical Statistics. Springer, New York, 2005.

[72] C. Sima, J. Hua, and S. Jung. Inference of gene regulatory networks using time-

series data: A survey. Current Genomics, 10(6):416–429, 2009.

BIBLIOGRAPHY 78

[73] V. A. Smith, E. D. Jarvis, A. J. Hartemink, and E. J. Hartemink. Evaluating

functional network inference using simulations of complex biological systems.

Bioinformatics, 18(Suppl 1):S216–S224, 2002.

[74] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen,

P. O. Brown, D. Botstein, and B. Futcher. Comprehensive identification of

cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray

hybridization. Molecular Biology of the Cell, 9(12):3273–3297, 1998.

[75] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction and Search. MIT

Press, 2001.

[76] N. Sugimoto and H. Iba. Inference of gene regulatory networks by means of

dynamic differential bayesian networks and nonparametric regression. Genome

Informatics, 15(2):121–30, 2004.

[77] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani,

D. Botstein, and R. B. Altman. Missing value estimation methods for DNA

microarrays. Bioinformatics, 17(6):520–525, 2001.

[78] E. P. van Someren, L. F. Wessels, E. Backer, and M. L. Reinders. Genetic

network modeling. Pharmacogenomics, 3(4):507–25, 2002.

[79] E. P. van Someren, L. F. Wessels, and M. L. Reinders. Linear modeling of genetic

networks from experimental data. in Proceedings of International Conference on

Intelligent Systems for Molecular Biology, 8:355–66, 2000.

BIBLIOGRAPHY 79

[80] X. Wang, A. Li, Z. Jiang, and H. Feng. Missing value estimation for DNA

microarray gene expression data by support vector regression imputation and

orthogonal coding scheme. BMC Bioinformatics, 7(32), 2006.

[81] D. C. Weaver, C. T. Workman, and G. D. Stormo. Modeling regulatory networks

with weight matrices. in Proceedings of the Pacific Symposium on Biocomputing,

4:112–123, 1999.

[82] R. F. Weaver. Molecular Biology. McGraw Hill, 2008.

[83] M. K. Yeung, J. Tegner, and J. J. Collins. Reverse engineering gene networks

using singular value decomposition and robust regression. Proceedings of the

National Academy of Sciences of the United States of America, 99(9):6163–6168,

2002.

[84] J. Yu, V. A. Smith, P. P. Wang, A. J. Hartemink, and E. D. Jarvis. Using

bayesian network inference algorithms to recover molecular genetic regulatory

networks. in Proceedings of third International Conference on System Biology,

37(382–390), 2002.

[85] Y. Zhang, Z. Deng, H. Jiang, and P. Jia. Inferring gene regulatory networks

from multiple data sources via a dynamic bayesian network with structural EM.

Lecture Notes in Computer Science, 4544:204–214, 2007.