gene networks from high-throughput data –reverse

47
Linköping studies in science and technology. Dissertations. No. 1301 Gene networks from high-throughput data –Reverse engineering and analysis Mika Gustafsson Department of Science and Technology Linköping University, SE-601 74 Norrköping, Sweden Norrköping 2010

Upload: others

Post on 03-Oct-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Gene networks from high-throughput data –Reverse

Linköping studies in science and technology. Dissertations. No. 1301

Gene networks from high-throughput data –Reverse engineering and analysis

Mika Gustafsson

Department of Science and Technology Linköping University, SE-601 74 Norrköping, Sweden

Norrköping 2010

Page 2: Gene networks from high-throughput data –Reverse

All previously published papers were reproduced with permission from the publisher. Linköpings studies in science and technology. Dissertations. No. 1301 Copyright © Mika Gustafsson, 2010 ISBN 978-91-7393-442-8 ISSN 0345-7524 Printed by LiU-tryck, Linköping 2010

Page 3: Gene networks from high-throughput data –Reverse

ABSTRACT Experimental innovations starting in the 1990’s leading to the advent of high-throughput experiments in cellular biology have made it possible to measure thousands of genes simultaneously at a modest cost. This enables the discovery of new unexpected relationships between genes in addition to the possibility of falsify existing. To benefit as much as possible from these experiments the new inter disciplinary research field of systems biology have materialized. Systems biology goes beyond the conventional reductionist approach and aims at learning the whole system under the assumption that the system is greater than the sum of its parts. One emerging enterprise in systems biology is to use the high-throughput data to reverse engineer the web of gene regulatory interactions governing the cellular dynamics. This relatively new endeavor goes further than clustering genes with similar expression patterns and requires the separation of cause of gene expression from the effect. Despite the rapid data increase we then face the problem of having too few experiments to determine which regulations are active as the number of putative interactions has increased dramatic as the number of units in the system has increased. One possibility to overcome this problem is to impose more biologically motivated constraints. However, what is a biological fact or not is often not obvious and may be condition dependent. Moreover, investigations have suggested several statistical facts about gene regulatory networks, which motivate the development of new reverse engineering algorithms, relying on different model assumptions. As a result numerous new reverse engineering algorithms for gene regulatory networks has been proposed. As a consequent, there has grown an interest in the community to assess the performance of different attempts in fair trials on “real” biological problems. This resulted in the annually held DREAM conference which contains computational challenges that can be solved by the prosing researchers directly, and are evaluated by the chairs of the conference after the submission deadline. This thesis contains the evolution of regularization schemes to reverse engineer gene networks from high-throughput data within the framework of ordinary differential equations. Furthermore, to understand gene networks a substantial part of it also concerns statistical analysis of gene networks. First, we reverse engineer a genome-wide regulatory network based solely on microarray data utilizing an extremely simple strategy assuming sparseness (LASSO). To validate and analyze this network we also develop some statistical tools. Then we present a refinement of the initial strategy which is the algorithm for which we achieved best performer at the DREAM2 conference. This strategy is further refined into a reverse engineering scheme which also can include external high-throughput data, which we confirm to be of relevance as we achieved best performer in the DREAM3 conference as well. Finally, the tools we developed to analyze stability and flexibility in linearized ordinary differential equations representing gene regulatory networks is further discussed.

Page 4: Gene networks from high-throughput data –Reverse
Page 5: Gene networks from high-throughput data –Reverse

POPULÄRVETENSKAPLIG SAMMANFATTNING Experimentella innovationer i cellbiologi från 1990-talet har gjort det möjligt att mäta aktiviteten hos tusentals gener samtidigt relativt billigt. Detta har möjliggjort upptäckten av oväntade förhållanden mellan gener direkt från experiment vilket både genererar nya hypoteser och avfärdar befintliga. För att dra största möjliga nytta av de nya mätningarna har det nya tvärvetenskapliga forskningsområdet systembiologi växt fram. Systembiologi går bortom den konventionella reduktionistiska ansatsen och försöker lära hela systemet under antagandet att systemet är större än summan av dess beståndsdelar. Ett stort ämne inom detta område är att använda data från de storskaliga mätningarna för att skapa modeller för det regleringsnätverk som ligger till grund för dynamiken av geners uttryck och cellens dynamik. Detta försök att skilja orsak och verkan åt mellan gener är relativt nytt och går vidare från tidigare ansatser där grupper av gener med liknande dynamiska mönster detekteras. Flera svåra problem föreligger emellertid, framförallt har vi för få experiment för att säkert bestämma vilka interaktioner som används. En möjlighet att lösa detta problem är att införa mer biologiskt motiverade villkor. Men vad som är ett biologiskt faktum eller inte är ofta inte uppenbart och även tillståndsberoende. Dessutom har undersökningar föreslagit flera statistiska fakta om genregleringsnätverk, som kräver utveckling av nya metoder att identifiera systemet. Därför har en uppsjö nya metoder för att identifiera genregleringsnätverk föreslagits. Följaktligen har också intresset ökat i området för att utvärdera resultaten av olika försök i tester på ”verkliga” biologiska problem. Detta resulterade i den årligen hållna DREAM konferensen som innehåller flera beräkningsproblem som kan lösas direkt av de forskare som förespråkar en viss metod och bedöms sedan efter sista dagen för deltagande av konferensens organisatörer. Denna avhandling innehåller utvecklandet av en strategi för att identifiera nätverk innehållande tusentals gener utifrån storskaliga mätningar baserat på regularisering av ordinära differentialekvationer. Vidare för att förstå och möjliggöra validering av gennätverk handlar en väsentlig del också om statistisk analys av gennätverk. Först identifierar vi ett storskaligt regleringsnätverk baserat enbart på tidsutvecklingen för olika geners uttryck via en enkel strategi att nyttja att genregleringsnätverk är relativt enkla. För att validera och analysera gennätverk utvecklar vi också en del statistiska verktyg som fyller detta ändamål. Sedan visar vi en vidareutveckling av den ursprungliga strategin med vilken vi vann utmärkelsen best performer vid DREAM2 konferensen. Denna identifieringsstrategi utvecklar vi sedan vidare till en strategi som klarar att ta hänsyn till flera typer av storskaliga datatyper. Som bevis på dess applicerbarhet får vi återigen utmärkelsen best performer vid DREAM3 konferensen. Slutligen så diskuterar vi de verktyg vi utvecklat för att analysera stabilitet och flexibilitet i linjariserade ordinära differentialekvationer i kontexten av vad andra forskare funnit.

Page 6: Gene networks from high-throughput data –Reverse
Page 7: Gene networks from high-throughput data –Reverse

ACKNOWLEDGEMENTS This thesis is the result of my research performed during the time as a Ph. d. student in Theoretical and applied mathematical physics at ITN, Linköping University. During this time I have been financed by the CENIIT organization (centrum för industriell IT) for which support I am grateful. The work has been performed under the supervision of Michael Hörnquist from whom I have learned a lot, both about science and about great leadership. I have learned to look deeper for the real insights grace to Michael’s skeptical plain talking attitude to science. Michael has been a great supervisor by providing both great scientific freedom and also support to develop ideas. In addition, he has always been responsive, consequent and ensured the best to his colleagues, thus being a tremendous leader for many people. Co-supervisor Jesper Tegnér has also inspired me a lot during my time as a Ph. d. student. Jesper has a great sense of entrepreneurship and he has learned me much about how to present the scientific content. I would also like to thank the rest of my co-authors (Anna Lombardi, Jesper Lundström and Johan Björkegren) for the stimulating discussions we have had about our articles. Thanks also to Martin Evaldsson for help with various computer issues, and our inspiring projects. I would also like to thank my colleagues in the mathematics group for the collaborations we have had in education issues, and also the remainder of the department for their company.

Page 8: Gene networks from high-throughput data –Reverse

INCLUDED PUBLICATIONS I. M. Gustafsson, M. Hörnquist, and A. Lombardi, “Constructing and analyzing

a large-scale gene-to-gene regulatory network--lasso-constrained inference and biological validation,” IEEE/ACM transactions on computational biology and bioinformatics vol. 2, Sep. 2005, pp. 254-261. Author’s contribution: Performed the network analysis and biological validation. Took active part in the writing process and created the figures.

II. M. Gustafsson, M. Hörnquist, and A. Lombardi, “Comparison and validation of community structures in complex networks,” Physica A: Statistical Mechanics and its Applications, vol. 367, 2006, pp. 559-576. Author’s contribution: Initiated and to some extent managed the study and introduced the proposed novel measures related to closeness and coherence. Performed the GO analysis of the gene network. Took active part in the writing process and created some of the figures.

III. M. Gustafsson, M. Hörnquist, J. Lundström, J. Björkegren, and J. Tegnér, “Reverse engineering of gene networks with LASSO and nonlinear basis functions,” Annals of the New York Academy of Sciences, vol. 1158, Mar. 2009, pp. 265-275. Author’s contribution: Proposed the fundamental ideas of the algorithms, performed the calculations. Took active part in the writing process and created the figures.

IV. M. Gustafsson, M. Hörnquist, J. Björkegren, and J. Tegnér, “Genome-wide system analysis reveals stable yet flexible network dynamics in yeast,” IET systems biology, vol. 3, Jul. 2009, pp. 219-228. Author’s contribution: Performed the calculations and suggested the proposed measures. . Wrote an early draft of the paper, then took active part in the writing process and created the figures.

V. M. Gustafsson and M. Hörnquist, “Integrating various data sources for improved quality in reverse engineering of gene regulatory networks,” Handbook on research on computational methods in gene regulatory networks, Hershey, PA: IGI Global (Medical Information Science Reference), 2009, pp. 476-496. Author’s contribution: Performed the calculations, wrote the paper and created the figures.

VI. M. Gustafsson and M. Hörnquist, “Gene Expression Prediction by Soft Integration and the Elastic Net -Best Performance of the DREAM3 Gene Expression Challenge,” PLoS ONE, vol. 5, Feb. 2010, p. e9134. Author’s contribution: Proposed the fundamental ideas of the algorithms, performed the calculations. Wrote an early draft of the paper, then took active part in the writing process and created the figures.

VII. M. Gustafsson and M. Hörnquist, “System Analysis of Gene Regulatory Networks,“ submitted. Author’s contribution: Wrote the paper and created the figures.

Page 9: Gene networks from high-throughput data –Reverse

CONTENTS 1 Introduction to the work ........................................................................... 1

1.1 Background ......................................................................................... 1 1.2 Motivation of the papers ..................................................................... 2

2 Reverse engineering of gene networks ..................................................... 5

2.1 The biological system ......................................................................... 5 2.1.1 Gene regulation ...................................................................... 5 2.1.2 Available data ......................................................................... 6 2.1.3 Gene regulatory networks ...................................................... 8

2.2 Inference ............................................................................................. 8 2.2.1 Model exploration .................................................................. 9 2.2.2 Linear ODE models .............................................................. 10 2.2.3 Regularization ....................................................................... 11 2.2.4 Model selection .................................................................... 14 2.2.5 Incorporating prior knowledge ............................................ 16

2.3 Strategy assessment DREAM .......................................................... 18 2.3.1 Model assessment ................................................................. 18 2.3.2 Assessment of our strategy .................................................. 19

3 Network analysis ....................................................................................... 21

3.1 Network architecture ........................................................................ 21 3.1.1 Hubs ...................................................................................... 21 3.1.2 Higher organization .............................................................. 22

3.2 Stability and flexibility ..................................................................... 24 3.2.1 Background ........................................................................... 24 3.2.2 Traditional definitions of stability ....................................... 25 3.2.3 Motivation of new concepts, stability and flexibility .......... 25 3.2.4 Mathematical definitions ...................................................... 26 3.2.5 Findings ................................................................................ 27

4 References .................................................................................................. 30

PAPER I

PAPER II

PAPER III

PAPER IV

PAPER V

PAPER VI

PAPER VII

Page 10: Gene networks from high-throughput data –Reverse
Page 11: Gene networks from high-throughput data –Reverse

1

1 INTRODUCTION TO THE WORK 1.1 BACKGROUND Experimental innovations in cell biology starting in the 1990’s have made it possible to measure thousands of genes simultaneously at a modest cost, using high-throughput techniques. These techniques have generated lots of genome-wide material, which in part is publicly available or could be generated relatively cheap. Moreover, this increase in experimental information about genes, proteins and metabolites has made it possible to address new types of questions. For example by observing the activity of most of the relevant regulators and targets in the cell it is possible to detect unexpected relationships from the experiments. Therefore it is possible to use the high-throughput data to not only falsify existing hypotheses, but also generate new ones. This kind of new data has opened the innovative research field of systems biology, which aims at understanding the whole biological system as an entire unit. One emerging enterprise in this field is to use the genome-wide material to reverse engineer the dynamic system which governs the dynamics of the cell. This includes the gene regulatory network (GRN) that represents which genes regulate each other [1,2]. Possibly, the systems biology approach then leads to new understandings of diseases and a goal for the community is to produce a detailed mathematical model of the cell. Although it is a long way still to have a detailed in silico model of a cell [3] new insights about complex diseases as cancers have already been found by utilizing systems biology approaches (see [4] and references therein). In the case of genome-wide gene regulatory network inference we face the problem that the number of putative regulations greatly exceeds the number of experiments. Therefore, it is problematic to determine which regulations are active using conventional tools. One possibility to overcome this problem is to impose more biologically motivated constraints. Basically, this means that we bias the inference to facts that are known from similar (partly known) systems. However, what is a biological fact or not is often not obvious and may depend on the biological context. Moreover, investigations have suggested several statistical facts about gene regulatory networks, which require the development of new reverse engineering algorithms, relying on different model assumptions. For example, from evolutionary arguments and from well characterized systems we may assume that gene regulatory networks contain relatively few interactions (sparseness), contain some master switches (hubs) controlling significant parts of the gene regulatory network and contain sub networks which only weakly interacts with its surrounding (communities) [5,3,1,6]. Luscombe et. al. [7] showed that also several statistical features are strongly condition dependent. For example, different types of motifs (i.e., small recurrent graph structures) were significantly enriched for the active part of a network in different conditions depending on the response time required by the system. To integrate as much relevant material as possible and to create models for specific purposes several new reverse engineering algorithms for gene regulatory networks have been proposed. This includes completely different modeling frameworks suitable for different issues and how different types of data are integrated into the model. Consequently an increased interest has grown in the community to assess the performance of different attempts in fair trials on real biological problems. Therefore,

Page 12: Gene networks from high-throughput data –Reverse

2

some comparative studies to elucidate the merits of different algorithms have been made [8,9]. However, as algorithms are implemented differently by different people it is still hard to judge which algorithm is most suitable to a particular problem. This has resulted in the annually held DREAM conference which contains computational challenges that can be solved by any research group, and are evaluated by the chairs of the conference after the submission deadline. Indisputably it is hard to produce challenges that are of biological importance and have a true fair answer. Therefore the challenges and assessment metrics from the DREAM conference is continuously refined by the research community. Moreover, to motivate researchers to participate their identity is only revealed with their permission. 1.2 MOTIVATION OF THE PAPERS This thesis contains among other things refinements of regularization schemes to reverse engineer gene networks from high-throughput data within an ordinary differential equations (ODE) framework. As the subject is in a rapid evolution this is manifested in the evolution of the papers. The main objective of the thesis has been to elucidate the possibilities of linear models for genome-wide reconstruction of gene regulatory networks. Generally, linear models are the first order approximation of the system around its working point and are therefore good to start with when the functional form of the regulatory functions are unknown. Moreover, they have relatively few free parameters (degrees of freedom), and several fast computational implementations exist. Thus, they seem good starting points to infer genome-wide gene regulatory networks. Furthermore, to understand gene networks a substantial other part of it deals with statistical analysis of gene networks. In Paper I we explored some possibilities of the recently proposed least absolute shrinkage and selection operator (LASSO) for inference of gene regulatory networks. We also explored (Paper I, IV) what kind of large-scale statistical characteristics that are consistent with other networks than the ones from the LASSO. We utilized a publicly available microarray data set of Yeast (Saccharomyces cerevisiae) containing measurements of 6178 transcripts (genes) in 73 time instances of the cell cycle. Our working hypothesis was that by utilizing only microarray data the inferred system contains lots of errors, but on a statistical level the network is “more right than wrong”. As it turned out this modest assumption was not so easy to validate, since there are no known “golden truth” to use for benchmarking even for Yeast which is the most well studied eukaryote. Therefore, we justified the proposed network by calculating topological quantities and related these to biological properties using annotation databases (mostly gene ontology (GO) [10]). Indeed, this led us to delve into the general field of complex networks, and Paper II was the result. Moreover, we analyzed the topology of the network in Paper I in more detail in Paper IV. Paper IV took long time to publish partly since we proposed some new concepts in it, which not all reviewers took the time to try to understand. As we finished the first study several new possibilities to extend the inference scheme had been proposed. This included more efficient algorithms for implementation of the LASSO and several extensions to it. In Paper III we developed a successful strategy to reverse engineer gene regulatory networks from time-series and steady state expression data. This strategy achieved best performer in the DREAM2 in silico network challenge (explained in details in Paper III).

Page 13: Gene networks from high-throughput data –Reverse

3

However, this challenge of the DREAM was quite far from the inference problem we aim at understanding. The data were computationally generated to mimic noise-free microarrays, and the problem was also greatly over-determined for our model (i.e., the number of data instances exceeded the number of units). Subsequently, we explored possible benefits of incorporating other types of large-scale data sets into the inference that impose constraints on which of the network edges are more likely. These ideas are discussed in a general sense and some of the possibilities are also implemented in Paper V. Moreover, the DREAM3 conference included a challenge were it was possible to explore the basic ideas outlined in Paper V, with some modifications. Here, we have the possibility to incorporate all kinds of public data to predict the behavior of a real biological system in a situation where the number of units vastly exceeds the number of measurements. Hence, that challenge matches the goal of the overall strategy well. Indeed, the extra data helped us to achieve best performer in this DREAM3 challenge, which we describe in Paper VI. Finally, Paper VII was invited for a special issue of Open Bioinformatics by Professor Alberto de La Fuente to describe the newly proposed measures in Paper IV to a wider audience. The introductory part of this dissertation is divided into the chapters Reverse engineering of gene networks and Network analysis. Indeed, several of the included papers in the thesis contain both parts, but hopefully this division of the material is more pedagogical.

Page 14: Gene networks from high-throughput data –Reverse

4

Page 15: Gene networks from high-throughput data –Reverse

5

2 REVERSE ENGINEERING OF GENE NETWORKS This chapter is intended to introduce the biological system (§2.1) as we look upon it and discuss the in silico abstraction of the system (§2.2). In §2.3 we discuss how we can assess the relevance of the proposed mathematical models. Some text in this chapter is taken from Paper V and Paper VII with only minor modifications. 2.1 THE BIOLOGICAL SYSTEM

Cellular systems are intrinsically complex, still robust and at the same time able to quickly adapt to new situations. To understand, describe and model a wide range of cellular systems −involving genes, mRNA molecules, proteins and metabolites − networks have served as the unifying language [11]. Among the intra cellular networks there are several different types of networks, ranging from descriptions of physical interactions of proteins, to detailed metabolic networks describing biochemical reactions. Here we also have the gene regulatory networks (commonly abbreviated as GRNs), describing indirect regulatory relationships between genes[1,2]. Gene regulatory networks, defined from the interaction maps of proteins, RNA molecules and metabolites, describe in combination with the kinetics the cellular responses to input signals and its regular dynamics [1,2]. Some parts of these huge regulatory systems are well explored, e.g., the yeast cell cycle, the lambda switch, and the SOS pathway, but still many of the parts are unknown. However, reverse engineering genome-wide gene regulatory networks has a high potential for drug discovery and is essential for both the understanding of how cellular system works and is fundamental to make computationally derived predictions of cellular behavior. 2.1.1 Gene regulation

The full manual of the cell is carried within its DNA, which together with epigenetic effects and external stimuli determines its function. The DNA is like any other manual, passive information that needs to be expressed in order to have a meaning. The informative parts that can be activated can be divided into open reading frames, i.e., genes, which may contain promoters, a transcription start site (TSS) and an end sequence of nucleotides of the gene. A gene becomes activated in the transcription process, when a RNA polymerase copies the blueprint DNA into mRNA templates. The mRNA is then translated and proteins are synthesized in the ribosome. The transcription activity itself is governed by the present proteins in the region around the DNA, rather than the actual RNA polymerase concentration. The transcription activity is often regulated by the binding of transcription factors (TFs). Often the TFs are activators and involved in the recruitment of the RNA polymerase [12,13]. Another possibility is that the regulation is carried out from repressors (or inhibtors) by blocking the DNA sequence to obstruct the RNA polymerase from transcription. Sometimes regulation is carried out by recruiting TF recruiters or inhibitors, and sometimes by binding to a TF to inhibit the action of that TF (sees Figure 1 for a schematic illustration of the information flow in the eukaryotic cell). The cell can therefore be seen as a feedback control system, where gene activity is regulated by proteins that are regulated by the expression levels of other genes. The complete picture

Page 16: Gene networks from high-throughput data –Reverse

6

of the control system will include many intermediate levels which are regulated, e.g., certain proteins regulate genes in complexes and transcription factors (TFs) may be activated by metabolites. Moreover, in addition to transcription regulation through the gene expression post-transcription and translation regulation occur. In eukaryotes the transcribed mRNAs (pre-RNA) are processed by RNA binding proteins (RBPs) before translation. The RBPs regulates the transcripts through degradation, through the rate of export to the ribosome and through splicing. Splicing is the process where non-coding parts are removed from the transcripts. For humans this might been done in several ways (alternative splicing) depending on which RBPs that govern the splicing, which leads to that a transcript give raise to several different proteins. For example, 80% of the human genes are affected by alternative splicing [14] and the number of human proteins is about twice as many as the number of human genes.

Figure 1: Schematic chart of the information flow within an eukaryote cell. The information goes from the genes within the DNA molecule inside the nucleus out to the ribosomes-in the cytoplasm-via messenger RNA, where it is translated and proteins are built. The proteins-together with external stimuli and some RNA molecules-determine the subsequent expression, thus forming feedback control loops. The picture is a modification of a picture downloaded from http://www.ornl.gov/hgmis (visited 2003-03). 2.1.2 Available data

A reconstruction of the complete cellular network that describes gene regulation would require measurements of the concentrations of at least RNA molecules, proteins, metabolites and the knowledge of the interaction maps of these entities and their DNA-binding regions. Advancements in microarray technologies make it possible to measure the transcriptome (i.e., the set of mRNA molecules) in a cell relatively cheap, and by synchronizing cells (such that several cells are in a similar state) it is possible to measure the transcriptome in time-series. Importantly, several experimental techniques in molecular biology have been developed for automatic measurements of huge amount

Page 17: Gene networks from high-throughput data –Reverse

7

of units within cells. These high-throughput techniques form the basis in order to understand how the emerging omics (e.g., transcriptomics, proteomics, and metabolomics) interacts to understand cell regulation. However, large-scale data sets measuring protein or metabolic levels are only rarely present at the moment, but a huge amount of public data sets measuring the transcriptome in different conditions are available for many organisms in a unified format at the gene expression omnibus (GEO) repository [15,16]. Moreover, different types of binding information are also publicly available for some organisms originating from large-scale measurements and computational predictions, e.g., protein-protein and TF-DNA interactions [17-20]. Particularly informative for understanding genome-wide gene regulation is the interaction map between TFs and their DNA binding regions. This information may give direct structural properties of the regulatory possibilities, e.g., the presence of a binding element upstream of gene A for a TF which gene B codes for may induce a regulation of gene A by gene B. Several other types of data from specialized experimental techniques are also emerging. For example, quantitative real-time PCR assays (qRT-PCR) can measure the concentrations of thousands different transcripts at a high-resolution which may be needed for the TFs which often are abundant in low concentrations. Yet another type of experimental technique (and data) is the deep-CAGE which provides the expression of individual TSS [21-25] . Another type of structural information is based on the predictions of the interactions of proteins and the DNA sequence, i.e., prediction of putative regulations from the TF binding sites (TFBS). Yet another piece of information is the great source of common biological knowledge which comes from small-scale studies. When it comes to reverse engineering the system it may be incorporated in a variety of ways. The information may come from annotation knowledge or more “unclean” information as text-mining. Annotation knowledge may be the collection of detailed information from previous experiments given in a structural form, while text-mining may be a possibility to include the great unstructured data within the plethora of published biological papers in databases, e.g., PubGene extracts gene interactions from co-citations in PubMed [26]. The far most frequent way to measure changes in gene expression is the DNA microarray technique [27,28]. Some different platforms utilizing different techniques exists, e.g., the leading platforms are the ones obtained by Affymetrix and Illumina (for a comparison of the two see [29]). Most often the expression fold changes are the natural way to present gene expression for biologists, i.e., logarithmic values of the ratios (log ratios) to some control experiment. Reasons for this include that there are several orders of magnitude difference in data such that fold changes are the most natural way to present the data, but also that multiplicative errors in the concentrations are additive with respect to the log ratios. A problem with the microarray technique is that the data from it is noisy and several studies have reported that it to contain lots of outliers [30-33]. Some problems are connected to this, especially the lack of replicates in the early studies using the microarray technique created problems to estimate the detection limit. The presence of

Page 18: Gene networks from high-throughput data –Reverse

8

a heavy tailed distribution of errors in the log ratios was shown by Purdom et al. [31]. They showed from empirical observations that the associated errors cannot be described by the frequently assumed Gaussian distribution1. 2.1.3 Gene regulatory networks

A popular simplifying approach is to project transcripts and proteins on the associated coding genes, and disregard the metabolites. The corresponding effective gene regulatory network (also known as influence network [1]) describes regulations directly between genes, although the full regulatory system contains complex interactions between all these entities. This projection leads to loss of details in the model [2], but is probably a necessary simplification due to sparseness of data from similar conditions. Data sparseness is indeed a major problem for large-scale modelling projects. Despite the increasing amount of available high-throughput data sets we still have very few experiments from the similar conditions measuring different entities (e.g., mRNA and protein concentrations). Integration of data is also impeded by the different standards between array platforms. Today analyses are mainly performed on models based on projections of the full regulatory system onto subspaces, which contain fewer units and/or fewer interactions due to the limited number of included conditions [34,35]. Indeed, neither do interactions necessarily correspond to biochemical reactions nor physical interactions between proteins and genes; instead they correspond to the effective impact of the regulator gene on its downstream target. By modelling in the low-resolution space we may both have several pieces of structural information and qualitative information available of the same genes, thus we increase the amount of information known per unit. On the other hand we cannot obtain high-resolution molecular networks from the gene regulatory networks. That is, we cannot draw conclusions about the chemical reactions involved without extra information. Even with detailed knowledge of the kinetic framework in the full system, it would generally be hard to deduce the functional form of the kinetics in the gene-to-gene space in practise. This due to the fact that the functional form of the “effective” kinetics in the projected gene-gene space may largely depend on the exact model parameters of the full system. Since inference approaches do not know the model parameters in advance they should probably either be based on non-parametric adaptive models or simple enough models as we are short of data. 2.2 INFERENCE

In figure 2 we show a schematic overview of the methodology throughout this thesis. First we have the high throughput experimental data (left), which is refined into a network model representation (middle). Then it is possible to draw conclusions from this (right). Conclusions may be drawn both about the inferred network, i.e., how it is organized, and about what kind of dynamics it can exhibit. Ideally, eventually also such

1 Instead they showed it to closely follow an asymmetric double exponential (i.e. asymmetric Laplace) distribution, with a significant portion of outliers.

Page 19: Gene networks from high-throughput data –Reverse

9

topological and dynamical knowledge should be fed into the inference step as prior knowledge to confine the search space.

Figure 2: Map of the steps from experiments to knowledge, and a summary of some important features in the organization of gene regulatory networks and their impacts (taken from Paper VII). 2.2.1 Model exploration

The models utilized to describe the time-evolution of the gene levels with gene regulatory networks may be classified with respect to the number of states per node they allow. The Boolean networks and threshold networks are both ideal two state modeling formalisms, which are at one side of the extreme. They approximate the activity of each gene as being either simply on or off, which is motivated if the upper or lower limit of the gene activity often is attained. However, this approximation discards all intermediate steps and may therefore be of some importance for modeling multiple conditions within the same framework, but is probably worse to model a cell around a stable working point (e.g., the cell cycle). The Bayesian networks constitutes another modeling framework, which is more flexible and can also allow for multiple states per node and the framework is also extendable to model continuous levels. At the other extreme the number of states of the ordinary differential equations (ODE) based models assume a continuum of activity levels being possible, thus being particularly suitable for model systems around a working point. For the Bayesian networks continuity is modeled explicitly on the expense of either more data for inference or further a priori assumptions (e.g., restricting the degree distribution). This is problematic since already the two state Bayesian networks demands much data. With proper restrictions (i.e., regularizations) that are biologically motivated (e.g., LASSO, ridge, and elastic-net) the ODE based models are both computationally tractable and may need fewer experiments for inference than the former categories [9]. To determine causal relationships on a gene level we utilized (in Paper I,III,IV,V,VI) a popular model class that is continuous in time and deterministic. Within this class of

Page 20: Gene networks from high-throughput data –Reverse

10

models, the rate of change of gene expression for gene , at time , can quite generally be described by

Here is the gene expression at time t of gene j, N is the number of genes is an external perturbation and is a stochastic variable. 2.2.2 Linear ODE models

However, such a model has infinitely many degrees of freedom and the model class must be restricted. One common choice is to limit to time independent affine functions, let the perturbation be time independent and assume additive error , i.e.,

. (2)

In the case where the errors are individually independent Gaussian distributed, the minimization of the sum of squares in (3) below, with 2 is a proper choice for inference of the interaction matrix 2 (similarly 1 in (3) if are Laplacian distributed). Most of our work has been focussed on least squares methods (Paper I,III,IV,VI) but in Paper V (and to some extent also in Paper VI) we explored least absolute deviations for inference of gene regulatory networks. The assumption of an affine (effectively linear) relationship is certainly wrong, but still there exist several reasons why it might be of interest. In the case of expression measurements from a single condition, we can consider it as the result of a linearization of a non-linear function around a working point. This makes the approximation correct at least to first order. Moreover, inference of linear functions has great computational advantages, which might be of crucial importance given the huge size of the system. They may also be extended to cope with some non-linearities by transforming the input data, which we utilized in Paper III and V. Yet another reason for linear modelling is the lack of detailed knowledge of the “true” dynamical equations, due to the projection of gene regulation onto the space of genes only. The detailed equations governing the biochemical reactions can to some extent be described with Michaelis-Menten equations3, but when projecting several such processes, the result is unclear. It should also be noted that this formulation includes degradation into the model through negative self-interaction terms ( 0), and therefore all self-interactions are the combination of degradation and self-regulation. However, another common experimental situation is that the observation may come from perturbations, which have the purpose of driving the system far away from its working point. This type of

2In the case of enough data it gives the Maximum likelihood estimate (MLE). 3 At least if the system is sufficiently mixed.

, , , , (1)

Page 21: Gene networks from high-throughput data –Reverse

11

perturbations is one motivation to the investigation of non-linear functions in the dynamical equations (Paper V). Back to the inference problem again, importantly the 1) parameters in (2) can be inferred in subproblems each of size 1 . In the case where the number of measurements 1 , the matrix and can be obtained from solving N ordinary least squares problems ( 2) or least absolute deviations problems ( 1), where is the number of genes to be explained, i.e.,

· arg min·

, (3)

Here · , … , and is the time instance where the measurement k has been carried out. However, in the typical microarray scenario, we have , therefore we have infinitely many models that perfectly match the experimental data. To overcome this problem many different proposals have been suggested. Simulated experiments were proposed by D’haessler et al. [36], choosing · with the smallest L2-norm4 was exploited by Dewey et al. [37] and sparseness was in the gene regulatory network inference setting incorporated by Yeung et al. [38] and further extended by Wang et al. [39]. However, more measurements in the same time-series is an ineffective way of increasing the information content in the data [40]. The use of a sparse solution relies on a biologically motivated assumption [6], i.e., each gene is only regulated by a few others while choosing · with the smallest L2-norm corresponds to a fully connected network. However, working with perfect fits is a strong assumption when the noise level is high and perfect fits can be obtained from exactly K predictors for each gene, and as the number of experiments increases the network will become denser. Therefore it is necessary to control the model complexity in a careful way. 2.2.3 Regularization A more stable version of utilizing sparseness, the LASSO, was proposed by Tibshirani [41] which puts a L1 restriction on the predictors ( 2 in (4) below).

· arg min

·,

. .

(4)

Practically the exponent 1,2 is only utilized, mainly due to computational reasons. These two corresponds to regularized least absolute deviations (RLAD) (

4Lp‐norm · ∑

/.

Page 22: Gene networks from high-throughput data –Reverse

12

1) and least absolute shrinkage and selection operator (LASSO) 2). The parameter is the minimum L1-norm of the corresponding non-regularized solution of (3), i.e., the sum of absolute values of the least absolute deviations (LAD) 1 and least squares (LS) 2 solutions respectively. Hence, it serves as a baseline to the regularization parameter 0,1 , where 0 corresponds to the · 0 and

1 corresponds to the non-regularized case (LAD or LS). The least squares criterion is known to be sensitive for outliers, since the squaring puts extra notice to large deviations. However the least absolute deviations criterion is not affected at all by outliers, at least in response data . However, the solutions obtained from least absolute deviations are unstable, in the sense that small variations in data might result in large variations of the inferred parameters. This is not a problem with the least squares criterion where small changes in the measurements data always yields small changes in the inferred parameters (i.e., stability). The least absolute deviations criterion results in a solution which minimizes the median of the residuals, while the least squares criterion minimizes the mean of the residuals. Indeed, this reflects our assumption on the error distribution, where 1 corresponds to a Laplace and 2 to a Gaussian error distribution of . Thus the use of 1 in regression of gene regulatory networks may be motivated from the significant amount of outliers in high throughput data sources, but may be devastating as we also often are short of data [42]. Furthermore, the L1 constraint induces the solution to be on the border of a hyper octahedron, with sharp corners at the coordinate axes, thus the minimum of the quadratic objective on the constraint is therefore likely to have some of the coefficients exactly zero. Thus, the inequality in (4) sets some of the predictors to exactly zero, which in addition regularizes the solution. In a statistical frame the benefit obtained from shrinkage and regularization of · comes from a lower variance of the resulting fit. The fit itself is indeed a stochastic variable since it is a fit of a single realisation of the data. The decreased variance in the fit comes from posing an additional property of the model, i.e., assuming the model parameters to be small in a L1 sense, which leads to an increased bias of the model (see e.g., [43]). For the LASSO the amount of data needed to fit the model is strongly dependent of the number of predictors, of the gene of interest in the true model (4) [44]. In [44] they show that an unbiased estimate of the degrees of freedom -which is a measure of the model complexity- of the LASSO is the number of non-zero elements in the model. Thus, LASSO is particularly good to infer simply regulated genes, and is capable of performing simultaneous subset selection and parameter estimation without serious over-fitting. Despite its simplicity we utilized this formulation (4) ( 2) with the global value

0.1 in Paper I, which we set to have approximately two regulators for each gene on average. The network obtained from this study was shown even in the extreme case where 6178 and 73 to be biologically relevant (Paper I and IV) 5.

5 For computational convenience we actually utilized the L2-norm instead of the L1-norm in Paper I, but as the L2-norm always is smaller than the L1-norm this simply shrinks the solution further, which equally could have been performed from a smaller .

Page 23: Gene networks from high-throughput data –Reverse

13

The main reason to avoid scanning over a grid of values was that the computational implementation was based on treating (4) ( 2) as a general quadratic programming (QP) problem, which originally was suggested by Tibshirani [41] as a possibility. Some years later a specialized computational method for tackling (4) ( 2) much more efficiently was proposed by Efron et al. [45], based on the method of least angle regression (LARS). This was based on the fact that the coefficients · follow a piecewise linear path in and it is possible to compute the solutions for all 0,1 at about the same time as solving the problem (4) ( 2) for 1. This made it possible to scan the regularization parameter and utilize different model selection strategies as discussed in §2.2.4. The problem (4) ( 1) is called regularized least absolute deviations (RLAD) [46] and may for fixed be treated as an ordinary linear programming (LP) problem for which there exists very fast solving methods (e.g., LP simplex) also for very large systems, which we made use of in Paper V. Indeed there are shortcuts for this problem type as well, which we utilized in Paper VI with the ideas from Wang et al. [46]. These shortcuts also reveals a piecewise linear path of the coefficients. The solution can be computed extremely efficiently by updating two rows in the solutions of an algebraic system of linear equations [46] at the separation points where a new linearization is valid. In Paper VI we implemented this strategy for the part containing the RLAD. A more classical approach to shrink ·, thus also lowering the · variance is to replace the constraint of (4) by

wN

Ω . (5)

The problem is then called ridge regression (or Tikhonov regularization) and has been use for several decades in the statistician community (see for example the text-book [47]). Ridge regression and the LASSO rely on different prior assumptions about the distribution of the parameters. Ridge assumes the coefficients · to be Gaussian distributed and none of the coefficients will therefore almost certainly be exactly zero, while LASSO assumes sparse non-zero elements in · (Laplace distribution [43])6. However, as a subset selection operator the L1-norm tends to be somewhat greedy and picks only one among several highly correlated predictors. Ridge on the other hand picks all predictors to the degree of correlation to the target. Sparseness might be a purpose of its own for interpretability reasons, but often it does not always lower the prediction error significantly [48]. For gene regulatory network inference this obstacle might be severe for the prediction error, since many genes are correlated and the sparseity of data is tremendous. Nevertheless LASSO works well as a subset selection operator and can be combined with other types of data.

6Note that the distribution of the model parameters is in a Bayesian notion stochastic, hence the regularization corresponds to a particular prior distribution of the parameters. A non-Bayesian notion simply indicates that our regularizer is based on the assumption that the model parameters follow a particular distribution.

Page 24: Gene networks from high-throughput data –Reverse

14

Within the LASSO framework Zou et al. [48] proposed another important extension, called the elastic net, which modifies the inequality in the LASSO and ridge regression to yield a combined L1 and L2 shrink constraint, with the computational benefits of the LARS algorithm.

· arg min

·,

. . 1 2

(6)

Here is the parameter determining the mix between ridge ( 0) and LASSO

1. The net effect of (6) is both to shrink the number of non-zero coefficients and their magnitude (if 0 1). Furthermore, it makes it possible to have more included variables in the model (i.e., non-zero coefficients) than experiments in the model. Particularly, (6) has a great impact when the regulatory genes are correlated. As mentioned above LASSO selects only the most correlated genes, while the elastic net to some extent also includes slightly less correlated variables. In the case of highly correlated variables ridge regression and the elastic net have empirically been observed to yield lower prediction errors than the LASSO (see [48] and references therein), but may then include more indirect regulatory interactions as well. Therefore the choice of the mixture parameter for gene regulatory network inference may depend on whether direct interactions or low prediction error is preferred. For both the LASSO and the elastic net the inequality regularizes the solution, increases the numerical stability, and decreases the variance of the ·estimates from an increased bias. However, as previously indicated, the model assumptions of all these approaches are very different. The elastic net is an important recent inference procedure for gene regulatory networks, especially when prediction of new gene expression experiments is desired [48]. Interestingly, our application of the elastic net showed its effectiveness on real gene expression (paper VI), since we achieved best performer on an application of data integration utilizing the elastic net in the DREAM3 gene expression challenge [49]. However, when the main goal for inference is for interpretational reasons, we may still explore some other extensions to the LASSO and RLAD (4) instead. 2.2.4 Model selection

As mentioned in the previous section the main reason to introduce regularization is to bias the models towards our expectations, hence removing some degrees of freedom from the inference procedure. More generally, regularization is often used to prevent the growth of elements corresponding to highly correlated predictors in the solution of (3). This may be needed even if the number of experiments exceeds the number of putative predictors, in highly correlated data sets. Regularization, leads to less sensitivity in our estimations (i.e., lower variance of the fit) and if the bias is in a correct direction also drives the model towards a better model. Goodness of the model

Page 25: Gene networks from high-throughput data –Reverse

15

can be defined by various means, and ideally we want a model which is both structurally correct and have low prediction error. In the limits of the number of experiments ∞ the model which is minimal with respect to the Akaike information criterion, AIC (or equally Mallows Cp) or the Bayesian information criterion, BIC (similar to MDL) is preferred depending whether correct structure or low prediction error is the goal [43,50]. These criteria and also the generalized cross validation (GCV) criterion [51] are based on different trade-offs between model complexity and internal fit of the data. A fundamental assumption of all these criteria to be valid is lots of data, but instead the normal situation for high-throughput inference is insufficiency of data. Therefore, the intuitive minimal cross validation (CV) criterion is frequently used (e.g., in Paper III,V,VI and [52,53]) for data driven model selection of gene regulatory networks. The basic idea of cross validation is to estimate the prediction error by holding a small portion of the data aside for prediction and utilizing the inference procedure on the remaining data, by cycling the withheld data. Then the model with lowest estimated prediction error is selected. To estimate the expected prediction error as good as possible from the limited amount of experimental data we often have, systematic recycling of data is of crucial importance. However, having a low prediction error does not necessarily mean that the model is structurally correct. This is particularly true in the highly correlated data sets as gene expressions sets often are, where several similar candidate predictors exists. Moreover, this estimation of the prediction error may be unreliable since we have few experiments to make the estimate from [54]. Nevertheless, the alternatives might be overly conservative [54] and the performance of cross validation seems to work pretty good if used with care (Paper III,VI and [52,53,55]). Practically -fold cross validation is performed by dividing the data into disjoint sets and setting one of the sets aside each time, and utilizing the models inferred by the complementary set of experiments to predict the experiment of interest. Thus, straightforward implementations increases the computation time by almost a factor . However, sometimes careful considerations in the implementation may take advantage of computations made in other folds, as we did in Paper V. In practise a small leads to a bias of the model downwards, i.e. a pessimistic estimate of the prediction error since we use a small portion of the data. On the other hand for a large k we instead have a large variance in the data since the training samples are too similar, hence the trainings are not independent [43]. Nevertheless in practise we utilize as large as possible, again as we normally are short of data and for the typical high-throughput scenario we include moderate number of experiments , hence it is possible to utilize , which is called leave one out cross validation (LOOCV). The simplicity of the cross validation principle makes it flexible to new situations and it may be combined with any error function (e.g., the objective in (4)) or even using a correlation merit function as a measure which we utilized in Paper VI. In some cases the experimental setup naturally determines which experiments should be kept together (e.g., those within the same time-series in Paper VI) in order to increase the ability for the model to generalize. Moreover, as we found out in Paper VI the error according to

Page 26: Gene networks from high-throughput data –Reverse

16

cross validation is somewhat biased [55], since we pick the regularization parameter corresponding to the minimum value of the cross validation error. This may also be a problem when comparing minimum cross validation values obtained by tuning extra parameters for minimal cross validation. In order to avoid a bias in the model an extra cross validation loop should be made. However, the addition of an extra cross validation loop both increases the computation time by an extra factor 1 , and decreases the number of data points utilized in the solution of the sub problems furthermore. These two facts led us to use the biased cross validation to determine some extra parameters in Paper V and VI (more about this follows in §2.2.5). 2.2.5 Incorporating prior knowledge

The preference of parsimonious models discussed in §2.2.2 is based on our prior belief that gene regulation of individual genes can be understood from only a few other genes. This prior knowledge comes from prior studies in molecular biology (see e.g., [6] and references therein). As pointed out in §2.2.3 this decreases the data need tremendously. Moreover, the molecular biology community has through the years gathered huge amounts of data originating from both small scale studies and high-throughput experiments. The different data sources that may be taken into account includes TF-binding, protein-interactions, sequence information, literature knowledge and expression data. Indeed, these are in general crude clues about gene regulations and also connected with many false positive and false negative regulations. The introduction of several data types in the reverse engineering process request for a method to weight the data types appropriately, i.e., to prioritize, filter and in some cases discard the data based on how consistent it is with other data. The integration of multiple data sources in the inference serves two main purposes. First, it can improve the biological relevance of the gene regulatory network, i.e. the structure of the network becomes better. Second, it will hopefully also improve the ability of the model to predict new experiments with lower error. The biological significance is almost certainly increased if the inference algorithm is fed with known facts about the biological system. However, to lower the prediction error, careful statistically sound model selection must be performed. Generally low prediction error may not be critical if the trust of the extra data is great a priori and that the aim of the model is structural learning. One strategy to incorporate structural learning in the inference of ODE networks was developed by Wang et al. [39], which inferred a network that is structurally most consistent with data. The main problem with such a strategy is that one relies totally on the prior knowledge data. This is problematic since the prior knowledge as stated above is not perfect and may in some cases even be irrelevant for the problem of interest. Therefore a more open-ended incorporation would be to let the data decide the relevance from itself. In Paper V we outlined such a principled approach, which we successfully applied to increase the prediction performance of new expression data. The main idea is to softly integrate the extra data by the extent it decrease the prediction error of the model. Therefore we modify (4) (given ) to yield

Page 27: Gene networks from high-throughput data –Reverse

17

· arg min·

. . Π 1 2 .

(7)

The introduction of the weights makes it possible to include expression profiles of different relevance to the problem or lower the impact of noisy experiments. To set the weights of profiles of less relevance we need some data driven method, while to decrease the impact of noisy experiments inverse variances 1⁄ is typically used for 2 and 1⁄ for 1. In the case of including some expression profiles of less relevance we should utilize a separate evaluation function. Particularly, we utilized the non-weighted error-function disregarding the less relevant experiments together with cross validation in Paper V and VI to determine the weights . In addition Π in (7) let us penalize known regulatory interactions less than novel interactions. This reduces model complexity of a model consistent with our prior belief, but still let us discovers novel interactions from expression data. In a Bayesian setting we are manipulating the prior probabilities for interactions such that if Π is high there is a low prior probability for an interaction as the expression data are realized. Importantly, it is possible to use modified versions of the existing computationally efficient LARS and RLAD algorithms to solve (7). Note also that Π 0 and that we must decide a model how the prior knowledge is used to construct Π . In Paper VI we utilized

Π1

1 (8)

Here is large if the particular data source suggest an interaction between gene and and low if it suggests that there is no interaction. For example in Paper V we utilized the YEASTRACT database [56,57] of TF bindings on the DNA sequence. Then 1 if the database contains supportive evidence of an interaction and

0 if no evidence exists. Moreover, intermediate values may also be used for putative bindings, which we also did in Paper VI. Still the free parameter is to be determined from another procedure, and a natural way is to let the cross validation error determine this. In Paper V we determined , but also some of the weights (which was the same for all experiments of the same type) from the cross validation error. As previously mentioned (at the end of §2.2.4), when we set an extra parameter as the minimizer of the cross validation error it never leads to an increase (if the non-integrated solution still is possible). Therefore in order to avoid over-fitting we strived to use extra parameters related to the prior knowledge extremely carefully in our study in Paper VI. However, to fully employ the prior material it might be too conservative to choose the same for all genes since the knowledge is not homogeneously distributed over the genes, thus being of different importance for different genes. Therefore it would probably be fruitful to extend , such that , … , . One

Page 28: Gene networks from high-throughput data –Reverse

18

way to control the bias of the cross validation is to compute the null hypothesis and reject the null hypothesis only if the cross validation is lowered enough with respect to that (e.g., lower than 95% of the cross validation errors in the null distribution). The null hypothesis here corresponds to randomly ordered · . Indeed, this takes long computation time since it requires different null hypothesis for different genes, and might be combined with early stopping criteria to be more efficient. 2.3 STRATEGY ASSESSMENT DREAM 2.3.1 Model assessment

Since the validity of several of the assumptions may be questionable the outcomes should be taken with some care, thus the model should be validated a posteriori. Ideally, this validation should contain the validation of individual edges and prediction of responses of the system in new experiments. The validation could be done experimentally or more cost-effective by reusing available experiments (that are not used for inference). However, several research groups have reported networks depending on different data sources with great variability (see Paper IV). This stresses the complexity of the true regulatory system and also opens the question about other sources for validation. The variability among the models may be the result of either conceptual differences of the approaches or improper modeling. Moreover, large-scale gene regulatory network models are to a large extent incomplete and therefore we have adopted alternative strategies to validate the inferred models. In Paper I and IV we validated our model by observing that the network topology of the inferred network was in agreement with other studies and that the positioning of the genes seemed plausible. Still some benchmark studies by others have compared the effectiveness of different inference approaches by validating their predictions on the incomplete set of known regulatory interactions [8,9]. This may exclude some algorithms, but since the exact problem to be solved is somewhat fuzzy some algorithms may show advantageous in different situations. Moreover, most algorithms are not standalone algorithms and are subject to some arbitrariness. Therefore these approaches must find objective ways to set the arbitrariness of these parameters, which may put the study in to question. To establish which algorithms really are best suited for some particular problems Stolovitzky, Monroe and Califano initiated the dialogue for reverse engineering assessments and methods (DREAM) as a conference in 2006 [58]. From this initiative an annual research competition started. The DREAM challenges let research groups utilize and adjust their favourite methods on different biologically motivated problems (called challenges by DREAM), where objective answers are known by the organisers. Since no group know the answers before making the predictions, and that all known benefits of each of the algorithms may be utilized by the developing researchers, the assessment is fair. The DREAM competitions has been held in three conferences so far, called DREAM2 (2007), DREAM 3 (2008), DREAM4 (2009) [49,59]. The DREAM challenges are a great resource for the community as it both produces lots of benchmark sets and let developing researchers show the effectiveness of their method in a fair challenge. For the benchmark data we know the details of the successful approaches but unfortunately the identities of the less successful ones remain hidden ([49,59-66] and Paper III, VI). It is unfortunate that the less successful strategies remain hidden

Page 29: Gene networks from high-throughput data –Reverse

19

since in general much also can be learned from unsuccessful attempts, but was introduced to ensure that the most prestigious research groups presumably also could gain from it. Importantly, in DREAM3 some information about the less successful strategies was gathered (survey in [49]). 2.3.2 Assessment of our strategy

As we showed some effectiveness of the LASSO on a large-scale in Paper I we contributed to the network inference challenge of DREAM2 with an extension to our earlier work on the LASSO. The challenge was divided into several sub challenges and by our LASSO approach (Paper III) we were selected as best performer in the reverse engineering of Barabasi-Albert networks [11] with a skewed distribution of degrees (more in §3.1.1). In the DREAM3 gene expression challenge we found an arena for testing the framework outlined in §2.2.5 where the aim was to predict relative RNA expression time-responses of a set of target genes on the knock-out of a given TF and drug stimulation. The interest in this challenge was that the measurements were carried out on a real biological cell, and that the organism was the well studied Saccharomyces cerevisiae (budding Yeast). Since the names of the RNA probes where given in this challenge it called for an integrative approach. Hence, we tested the different versions outlined in §2.2.3 - §2.2.5 and assessed the performance by utilizing cross validation of the scoring of the DREAM challenge for model and strategy selection. From this we choose a version of the elastic net (7) with lots of public material expression profiles and binding material in the prior knowledge matrix (see Paper VI). Indeed, this strategy turned out to be successful and we won the title best performer in this challenge. However, somewhat puzzling was the fact that the contribution from the prior information was low and that another simpler algorithm achieved almost as high score as we did (and was also selected as best performer [60]). Still we achieved an increase in performance using the extra material, but not that much. This reveals-as previously mentioned in §2.2.5-that much can be done in this direction, and particularly we believe that a method how to tune the free parameters connected with , , in equation (7) must be developed. The main problem is that the introduction of these extra parameters always leads to a cross validation decrease; therefore this decrease must be controlled. In Paper VI we controlled the decrease by choosing the same , for all genes but different for different data types and utilized different for different genes. The rationales for this strategy are that both and are model regularization parameters controlling the degree of freedom of the fit, and should therefore have the same flexibility. Moreover, we could incorporate prior material in a variety of ways, e.g., equation (8) is rather arbitrary and can have different sources. Therefore, our original concern was to be able to reject irrelevant priors in a powerful way. However, to speculate further on successful modifications, I believe that although should be different for different genes as it controls the number of predictor genes it would probably be better to use a single for all genes. This mainly since the key reason for the introduction of is to control the strong correlations among gene profiles. Hence, as the input data is identical for each of the regression sub problem in (7) and therefore the pair wise correlations of the predictors, it would presumably be a reasonably good approximation to utilize the same for all genes. Moreover, the parameters connected

Page 30: Gene networks from high-throughput data –Reverse

20

to the prior data might require different values for different genes. One remedy then is to utilize the strategy outlined at the end of §2.2.5.

Page 31: Gene networks from high-throughput data –Reverse

21

3 NETWORK ANALYSIS Analyzing different inferred gene systems is of importance to learn basic organizational principles of gene regulatory networks. These principles could both be used as prior knowledge utilized in the inference of the system or as validating new findings as we performed in Paper I and IV. The DREAM conference provides great tools to compare new inference strategies with old ones, without validation based on network analysis. Still, to validate new networks and also to develop better reverse engineering schemes network analysis is of great interest. This chapter is divided into two sections, where §3.1 concerns statistical features of the architecture of gene regulatory networks where we mainly discuss the results of Paper I, II and IV. In §3.2 we discuss system level characteristics of gene regulatory networks which is greatly affected from the features described in §3.1 and is primarily based on our findings in Paper IV and much of the text is recycled from Paper VII. 3.1 NETWORK ARCHITECTURE 3.1.1 Hubs

A quantity that has been under much attention in the research of complex networks since the late 1990's is the distribution of node degrees, i.e., the number of interactions on a node. From the theory of Erdös-Renyi (ER) networks [67], where all node pairs have the same probability to be connected, a very narrow Poisson distribution is expected. In contrast many real networks where during the late 1990's shown to be scale-free, i.e., exhibit a broad degree distribution following approximately a power law for high degrees [68]. These nodes with high degrees have received a great portion of attention and are commonly referred to as hubs. Of particular importance in our studies are the regulatory hubs, which have lots of out-going edges. However, for a global understanding of the system is the presence of a large portion of weakly interacting units (having low-degree) also of importance. A popular model to explain the degree distribution became dynamical growth models, such as the Barabasi-Albert (BA) and gene duplication models [58-63]. The former has been a popular model to explain the presence of broad degree distributions in various kinds of networks, basically from the dynamic network growth where new nodes are more likely to interact with already well connected nodes [69] , which is called rich-gets-richer or Matthew effect [70] and more frequently preferential attachment [69,71]. Gene duplication models are based on growth by the duplication of units (e.g., genes) including the interactions and introduce some kind of random rewiring of interactions (diversification) [72]. This model origin in evolutionary principles and reproduce the degree distribution (as well as some other properties) for some protein-protein interaction networks [73,74] for certain duplication and diversification rates. Hubs have been reported in various biological networks [11,69,75-77]. Starting with the pioneering findings by Jeong et. al. [76,77] in a protein interaction network and metabolic network. The main reasons for its occurrences were suggested by Barabasi

Page 32: Gene networks from high-throughput data –Reverse

22

et. al. [75] to produce a system stable against random failures, yet fragile towards targeted attacks [62,63]. In a gene regulatory network a significant portion of the transcription factors (TFs) has been shown to act as regulatory hubs [78,79] which then may act as master switches between different states of the cell. In Paper I and IV we also observed TFs and essential7 cell cycle associated genes as regulatory hubs, which we use as part of the validation. Moreover, Maslov and Sneppen [75] demonstrated that regulatory hubs in the yeast transcriptional network (derived from literature mining in [81]) tend to regulate genes with low in-degree (peripheral genes) on average (assortative mixing [74]). This suggests that many regulatory hubs are not master switches, but instead their main goal is to mediate a signal in a cost efficient manner. Furthermore, in Paper IV we reported these hubs to be kept quiet by a negative in-regulation, which is consistent with the findings in [82]. Indeed, all these aspects associated to hubs may be used as either validation of the results or should be incorporated as facts built into the reverse engineering process. In Paper I and IV we found that the degree distribution of the inferred network to exhibit a broad distribution of out-degrees, where some hubs directly controlled significant portions of the networks. Interestingly, the LASSO performed much better on the BA network than the ER network in the DREAM2 challenges (Paper III). This is not trivial from the explicit L1 constrains of the in-regulations in the LASSO. Moreover, the above mentioned aspects associated to hubs do contribute to the overall system characteristics (see §3.2.5). 3.1.2 Higher organization Evidently, the degree distribution is an important design principle, which is simple, frequent and can explain several topological features of networks [11,68,69,75]. However, to describe and gain understanding of the system it is necessary to study features which involve the interaction of several entities. Several authors have found recurrent graph structures called motifs to be significant for gene regulatory networks [5,81]. They are also frequently associated with specific structures of the dynamical parameters [5,83] (and Paper IV), e.g., the signs in the feed forward loop (FFL) are often organized to yield coherent signals to the downstream target. Interestingly, in Paper IV the enrichment of several different motifs in our network was in good agreement with what others have reported [7,81]. Moreover, as we also consider the signs within the motifs we could effectively separate several of the motifs into one of the categories coherent/incoherent (see Figure 3 for examples). For example, we found that the FFL motif containing four nodes (4-FFL) often was utilized in its incoherent version, thus working as a short impulse generator. The dynamical effect of some motifs in isolation has been studied in experimental detail [83] and their numbers have been shown to change considerably between exogenous and endogenous processes [7]. In Paper IV we also observed that the genes involved in the incoherent FFLs are annotated as associated to the cell cycle.

7 A gene is essential (sometimes also called lethal) if gene deletion of the gene of interest kills the cell. As gene function is condition specific it is necessary to delete a gene in various conditions and measure the growth of the cells. Another less stringent use of the concept by applying only to growth rich media has been adopted by several researchers ([76,80] and by us in Paper IV).

Page 33: Gene networks from high-throughput data –Reverse

23

Figure 3: Sign distribution in network motifs and the observed density from the network in Paper IV (this is from Paper IV). Negative signs (repression) are coded by a perpendicular line to the line connecting the regulator with its target, and positive signs (activation) by two tilted lines. The classification of a motif being coherent (incoherent) refers to whether the input signals to the target(s) receives from its ancestors are consistent (or not). Another popular concept which is frequently reported in the context of gene networks is the concept of modules. The core idea is that a module is a functional evolutionary tested unit which works relatively isolated to perform some specific tasks [84]. Hence, it consists of genes with high degree of process similarity (Paper II) , i.e., a group of genes where several of the genes share GO process annotation [10]. In engineered systems, modules may be robust subunits performing different tasks [84]. In biological systems, modularity brings an evolutionary flexibility to the system, and recent studies suggest it to originate from time-varying evolutionary goals [46, 47], and that several of the modules serves their own specific goals. Moreover, some studies have revealed that modules can be organized into meta-modules [85] in a hierarchical structure [11]. Modules have been frequently employed from different angles within the systems biology community (Paper II, IV and e.g., [86,87]), but recovering gene groups with high process similarity in common. The first attempts used time-series clustering to detect gene clusters with high degree of process similarity [86]. More recently, network approaches originally based on the graph theoretic concept of tightly connected sub graph, called communities [88] have been adopted to the same problem. Furthermore, integrated approaches [87,85,89] which take into account both structural and expression data have also been adopted for this problem [89]. Since genes within the same module often share some processes annotation it is likely that their expression profiles show high correlation (or anti-correlation). Therefore some approaches of first clustering the data and then forming a network between the clusters have been utilized [90,52]. This seems crucial to the LASSO and RLAD (4) as we otherwise only find the gene with the highest correlation in the module to the target derivative as regulator, which could be somewhat accidental for the noisy high-throughput data [52]. However, in Paper IV we refrained from such an approach, since it is then hard to interpret the meaning of

Page 34: Gene networks from high-throughput data –Reverse

24

the edges. Interestingly, we still inferred a modular network with high degree of similarity in processes. However, by utilizing time-series co-expression clustering only on the same data, we also detect groups of genes with high shared degree of similarity in processes. But, as we compute the similarity between these clusters we see that the partitions are not more similar than expected by chance (see Paper IV and [91]). Evidently, co-expression based clustering as well as network clustering, detects gene sets with high process similarity, but otherwise the two approaches has little in common. Another remedy to avoid the two stage pre-clustering and inference procedure is to utilize the elastic net (6). The elastic net trades some strength from the most correlated putative predictor to others inter-correlated predictors. In a statistical setting it is motivated from a decrease in the variance of the fit, but from a network perspective it favors that several genes from a module are seen as regulators.

Figure 4: Example of a small network with community structure (this is from Paper II). Communities are marked using dashed circles.

3.2 STABILITY AND FLEXIBILITY 3.2.1 Background

The stability of the cell is crucial for its survival, i.e., it needs both being robust against random failures (e.g., unexpected gene disruption) and robust against small environmental changes, e.g., temperature fluctuations. The great challenge to understand system characteristics of gene regulatory networks has motivated several researchers to address the issue of system stability (and concepts connected with it) from different perspectives with different limitations. Therefore there is a need to clarify the usage of the concepts dynamics, stability, and flexibility since they are frequently employed within the systems biology community with multiple meanings. Dynamics is widely used for cellular networks for both the topological evolution as well as to describe the expression dynamics. The former represents the long term evolution of system characteristics over several generations, while the latter describes relatively short term fluctuations of the usage of the system over parts of the life time of an individual. We will here use the term to refer to the internal dynamics of a network model, i.e., the short-term fluctuations. Our definitions are based on Paper IV were we also addressed these popular issues, but with modifications of old concepts to fit in our framework. That study served both the purpose of further validating the network, but also led to new understandings on how individual architectural design principles on a global level impacts the stability and flexibility of the system.

Page 35: Gene networks from high-throughput data –Reverse

25

3.2.2 Traditional definitions of stability

Traditionally, stability has been divided into structural and dynamical stability. There are several reviews in the field discussing these concepts (e.g., [92,93]).

• Structural stability refers to the effect on the solution paths from a perturbation of the system parameters, i.e., a system is said to be structurally stable if small alterations of the model parameters lead to modest changes in the expression trajectories. • Dynamical stability refers to the stability of the expression trajectories upon the addition of noise.

To this Albert and Barabasi added a crude version of structural stability, which we call topological stability [75]. They assessed the stability (they called it robustness, a terminology followed by others) of the system from changes (deletion of nodes or edges) of topological quantities, such as average path length and diameter. Not surprisingly, they concluded that networks with only a few well connected hubs were robust (or stable) against most random removals, but extra sensitive towards removal of some specific nodes or edges (fragile). Indeed, the structural stability analysis is probably of great importance for some networks, where the main source of errors is edge/node failures (e.g., power grids). However, on the scale of intracellular processes the major source of perturbations for gene regulatory networks comes from changes in the gene expression pattern, based on noise as well as external stimuli. Therefore, the dynamical stability is to be of most importance to understand the mechanisms of the topological features in gene regulatory networks. However, from an evolutionary perspective it might still be of importance to analyze the topological stability, e.g., the impact of gene duplication and the effect of mutations in promoter regions. Nevertheless, in this thesis our main interest is on dynamical stability. 3.2.3 Motivation of new concepts, stability and flexibility

Dynamical stability in ODE models is often determined by analyzing the linearized system around a working point. It is then easy to compute the trajectory of any expression state from the eigensolutions to the Jacobian matrix, given that it stay close enough to the working point were the linearization still is valid. For a linear system the Jacobian matrix is identical to the interaction matrix in (2) and generally it corresponds to the first order approximation (linearization) in its present state. Furthermore, the eigensolutions to the Jacobian describes either exponentially growing, decaying or oscillatory states. The time-evolution explicitly takes the form

(9)

Here, corresponds to the th component of the th eigenvector to the Jacobian, is the corresponding (complex) eigenvalue, and coefficients are to be determined from some known condition (e.g., the initial conditions 0 ). Conventional analysis adopted from linear systems then concludes that the presence of any exponentially

Page 36: Gene networks from high-throughput data –Reverse

26

growing state (i.e., if the real part of any is greater than zero 0) eventually will lead to system instabilities, due to noise. This may be problematic in a large complex system (with sufficiently number of edges) if it is not properly tuned, since increased complexity leads to an increased probability for instabilities in random systems [94]. However, it may be compensated from strong self-degradation in the system. In a series of work Sinha et al. [95-97] studied the impact of some topological features on stability. In particular, they concluded that the presence of hubs (with random sign distribution) increases the probability of some growing modes, which may be compensated for by a modular organization. Notably, their studies reveal that dynamical stability in the classic strict sense is less probable with the heavy tailed out-degree distribution found in gene networks by other groups [76-79,98]. These studies reveal interesting couplings between topology and dynamical stability, but completely lack the influence of the model parameters on the dynamical stability. This influence is probably crucial for gene regulatory networks, since some studies have confirmed an intricate balance between activation and repression (Paper IV and [82]). The flip-side of dynamical stability is the responsiveness of the system to stimuli to switch into new working states. Evidently this flexibility comes on the expense of stability in the system. In a Boolean setting a system with a compromise between these two is called critical and a recent study by Balleza et al. [99] revealed several different organisms to have topologies and model parameters to facilitate a near critical system. Even though our primary interest is networks with an ODE dynamics, the studies of Boolean networks are of importance for the discussion of system stability in gene regulatory networks since not much have been studied for ODEs. However, as Boolean networks cannot admit intermediate activity levels they may not capture dynamic instabilities around some working point. When the stability analysis is performed on a linear system, a single growing mode will induce instability of the whole system. On the other hand a growing mode in the linearization of a non-linear model may also reflect a dynamic state switch to a new working point. For example, linear instabilities are deliberately used in some air-fighters control system to increase its maneuverability [100]. Hence, in a gene regulatory network, a growing eigenmode may reflect a fast switch from the current working state to some other which beneficially should be rapid, e.g., a switch from the cell cycle to stress conditions. However, as the presence of growing modes leads to a drift from the current working point, which must be compensated by non-linear effects, growing modes must be utilized only cautiously to be advantageous for the regulation of the cell. Therefore, but also because of errors when inferring a large complex network on noisy incomplete data, a linear model of a gene regulatory network will most probably contain some (but not too many) fast growing modes. The definition of stability based on its eigensolutions needs as a consequence to be adapted. 3.2.4 Mathematical definitions

From the above discussion it is clear that we need some stability measure that is softer than the standard definition that states that a system is linearly stable if and only if all

Page 37: Gene networks from high-throughput data –Reverse

27

eigenvalues to the interaction matrix (equation (2) have negative real parts. This means that all the possible trajectories in (9) are linear combinations of exponentially decaying and oscillatory eigenmodes. We therefore define the instability, , as the sum of all positive real parts to the eigenvalues to . We interpret the instability as the responsiveness to random signals (noise).

. (10)

Here is the jth eigenvalue to w, which we from now assume to be ordered such that

for all j and that is the number of eigenvalues with positive real parts. High responiveness to specific signals, i.e., some eigenvalues with large real parts may generally also include responsiveness to noise (i.e., many eigenvalues with large real parts). To circumvent most of the dependency problem of flexibility and stability, we define flexibility as the effective number of eigenmodes, where real parts are significantly larger than the real parts of the other, via the participation ratio [101]:

∑. (11)

PR is then the same as inverse of the 4th moment of the (L2) normalized vector , … , . PR quantifies how many eigenvalues are significantly larger than

the rest (among the ones with positive real parts). Thus, refer to a situation of non-specific drift where all unstable modes are as significant. Contrary 1 has only one component non-zero, and for equally large eigenvalues (and the rest is zero) we have . Therefore we define the flexibility index as

1 (12)

Then becomes an index between zero and unity, where values close to zero and unity refers to networks with non-specific and specific drift respectively. Similarly we transform to a stability index

1 , (13)

where is the maximum value of the instability and serves the purpose to make dimensionfree between zero and unity. In our studies we have approximated it from the sum of absolute values to all matrix elements of . This corresponds to the sum of the largest real parts contained in each of the rightmost circles given by the Gershgorin circles and in the special case of non-intersecting circles serves as a theoretical upper bound to . Thus is a soft measure of the degree of stability which can be used on linearizations of non-linear systems, and measures how responsive the system is to some particular stimuli. 3.2.5 Findings

The findings obtained by the concepts defined in §3.2.4 can be summarized in the system analysis carried out in Figure 5. This was also a main conclusion of Paper IV which we further explained in Paper VII.

Page 38: Gene networks from high-throughput data –Reverse

28

Figure 5: Stability and flexibility for the inferred network (Yeast) and several randomized versions thereof (this is from Paper IV). The error bars cover two standard deviations of the ensemble networks and are obtained from repeating the design of each network. By comparing flexibility and stability of the random ensembles of Figure 5 several conclusions may be drawn. If we follow the arrows backwards, starting from Yeast (network inferred in Paper I) and stepping back to Yeast Topology (same topology as Yeast but randomized signs) via Repressed Hubs (same as Yeast but randomization of the signs going to non-regulatory hubs) we can draw the conclusion that the organization of the signs (i.e. repression) on regulatory hubs increased the stability of the network. Moreover, stepping to the ensemble REWIRED (same degree distribution as Yeast obtained by the procedure in [102]) we can see that the topological features beyond the degree distribution incraseses both stability and flexibility. Furthermore, stepping to the ER-network (Erdös Renyi non-structured ensemble with narrow degree distribution) it is evident that the presence of hubs increased both stability and flexibility of the system. In Paper VII we discuss the possible reasons for these findings, which we proposed in Paper IV to partly validate the inferred Yeast network and partly to delve into the large-scale impacts of network design principles of gene regulatory networks. Last, the RAND network ensemble provides the result from the inference approach on randomizations of the expression data. Thus RAND provides a kind of sanity check that our conclusions are not completely dependent of the choice of inference scheme. Therefore it is satisfying that the Yeast network has higher stability than RAND. This put some strength into our findings and the large standard deviations reveal that the LASSO (4) is capable of infer networks with varying stability. Moreover, the flexibility of RAND is similar to Yeast which possibly reveals some of the limitations with the proposed measure of flexibility. A limitation of both measures is to compare models with different magnitudes and statistics of the model parameters. Although we have made an attempt to make it independent of the size of the input data,

Page 39: Gene networks from high-throughput data –Reverse

29

the safest way to use them is to compare network with identical but swapped matrix elements, i.e. with its randomized counterparts. The proposed quantities stability and flexibility may be used to understand the meaning of different network phenomena. But, to fully exploit the concepts it could be used directly in the inference setting. Somewhat problematic to this is that it depends on the eigenvalues which only can be computed after a full network inference has been carried out. Therefore, the main objective of these quantities will probably be to select between inferred competing networks.

Page 40: Gene networks from high-throughput data –Reverse

30

4 REFERENCES [1] M. Hecker, S. Lambeck, S. Toepfer, E. van Someren, and R. Guthke, “Gene

regulatory network inference: Data integration in dynamic models--A review,” Biosystems, vol. 96, Apr. 2009, pp. 86-103.

[2] P. Brazhnik, A. de la Fuente, and P. Mendes, “Gene networks: how to put the function in genomics,” Trends in biotechnology, vol. 20, Nov. 2002, pp. 467-472.

[3] R. Bonneau, “Learning biological networks: from modules to dynamics,” Nat Chem Biol, vol. 4, Nov. 2008, pp. 658-664.

[4] L.B. Edelman, J.A. Eddy, and N.D. Price, “In silico models of cancer,” Wiley Interdisciplinary Reviews: Systems Biology and Medicine, vol. 9999, 2010, p. n/a.

[5] U. Alon, An introduction to systems biology : design principles of biological circuits, Boca Raton, Fla.: Chapman and Hall/CRC, 2007.

[6] R.D. Leclerc, “Survival of the sparsest: robust gene networks are parsimonious,” Mol Syst Biol, vol. 4, print. 2008.

[7] N.M. Luscombe, M.M. Babu, H. Yu, M. Snyder, S.A. Teichmann, and M. Gerstein, “Genomic analysis of regulatory network dynamics reveals large topological changes,” Nature, vol. 431, Sep. 2004, pp. 308-312.

[8] J. Faith, B. Hayete, J. Thaden, I. Mogno, J. Wierzbowski, G. Cottarel, S. Kasif, J. Collins, and T. Gardner, “Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles,” PLoS Biol, vol. 5, Jan. 2007, p. e8.

[9] Mukesh Bansal, Vincenzo Belcastro, Alberto Ambesi-Impiombato, and Diego di Bernardo, “How to infer gene networks from expression profiles,” Mol Syst Biol, vol. 3, Feb. 2007.

[10] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, and G. Sherlock, “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium,” Nature genetics, vol. 25, May. 2000, pp. 25-29.

[11] A.L. Barabasi and Z.N. Oltvai, “Network biology: understanding the cell's functional organization,” Nature reviews.Genetics, vol. 5, Feb. 2004, pp. 101-113.

[12] M. Ptashne, A. Gann, and I. ebrary, Genes & signals, Cold Spring Harbor, New York: Cold Spring Harbor Laboratory Press, 2002.

[13] M. Ptashne and A. Gann, “Transcriptional activation by recruitment.,” Nature, vol. 386, Apr. 1997, pp. 577, 569.

[14] A. Matlin, F. Clark, and C. Smith, “Understanding alternative splicing: towards a cellular code.,” Nature reviews. Molecular cell biology, vol. 6, Maj. 2005, pp. 398, 386.

[15] R. Edgar, M. Domrachev, and A.E. Lash, “Gene Expression Omnibus: NCBI gene expression and hybridization array data repository,” Nucleic acids research, vol. 30, 2002, pp. 207-210.

[16] T. Barrett, D.B. Troup, S.E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I.F. Kim, A. Soboleva, M. Tomashevsky, and R. Edgar, “NCBI GEO: Mining tens of millions of expression profiles - Database and tools update,” Nucleic acids research, vol. 35, 2007, pp. D760-D765.

[17] M.C. Teixeira, P. Monteiro, P. Jain, S. Tenreiro, A.R. Fernandes, N.P. Mira, M.

Page 41: Gene networks from high-throughput data –Reverse

31

Alenquer, A.T. Freitas, A.L. Oliveira, and I. SaÌ-Correia, “The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae.,” Nucleic acids research., vol. 34, 2006, pp. D446-451.

[18] P.T. Monteiro, N.D. Mendes, M.C. Teixeira, S. D'orey, S. Tenreiro, N.P. Mira, H. Pais, A.P. Francisco, A.M. Carvalho, A.B. Lourenço, I. Sá-Correia, A.I. Oliveira, and A.T. Freitas, “YEASTRACT-DISCOVERER: New tools to improve the analysis of transcriptional regulatory associations in Saccharomyces cerevisiae,” Nucleic acids research, vol. 36, 2008, pp. D132-D136.

[19] L. Salwinski, C. Miller, A. Smith, F. Pettit, J. Bowie, and D. Eisenberg, “The Database of Interacting Proteins: 2004 update,” Nucl. Acids Res., vol. 32, 2004, pp. 451, D449.

[20] K. Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar, S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal, L. Balakrishnan, A. Marimuthu, S. Banerjee, D. Somanathan, A. Sebastian, S. Rani, S. Ray, H. Kishore, S. Kanth, M. Ahmed, M. Kashyap, R. Mohmood, Y. Ramachandra, V. Krishna, A. Rahiman, S. Mohan, P. Ranganathan, S. Ramabadran, R. Chaerkady, and A. Pandey, “Human Protein Reference Database--2009 update.,” Nucleic acids research, vol. 37, Jan. 2009, p. gkn892.

[21] T. Shiraki, S. Kondo, S. Katayama, K. Waki, T. Kasukawa, H. Kawaji, R. Kodzius, A. Watahiki, M. Nakamura, T. Arakawa, S. Fukuda, D. Sasaki, A. Podhajska, M. Harbers, J. Kawai, P. Carninci, and Y. Hayashizaki, “Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage.,” Proc Natl Acad Sci USA, vol. 100, 2003, pp. 15776 - 15781.

[22] R. Nilsson, V.B. Bajic, H. Suzuki, D.D. Bernardo, J. Bjorkegren, S. Katayama, J.F. Reid, M.J. Sweet, M. Gariboldi, P. Carninci, Y. Hayashizaki, D.A. Hume, J. Tegner, and T. Ravasi, “Transcriptional network dynamics in macrophage activation,” Genomics, vol. 88, Aug. 2006, pp. 133-142.

[23] P. Carninci, A. Sandelin, B. Lenhard, S. Katayama, K. Shimokawa, J. Ponjavic, C.A. Semple, M.S. Taylor, P.G. Engstrom, M.C. Frith, A.R. Forrst, W.B. Alkema, S.L. Tan, C. Plessy, R. Kodzius, T. Ravasi, T. Kasukawa, S. Fukuda, M. Kanamori-Katayama, Y. Kitazume, H. Kawaji, C. Kai, M. Nakamura, H. Konno, K. Nakano, S. Mottagui-Taber, P. Arner, A. Chesi, S. Gustincich, and F. Persichetti, “Genome-wide analysis of mammalian promoter architecture and evolution.,” Nat Genet, vol. 38, 2006, pp. 626 - 635.

[24] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Snyder, “The transcriptional landscape of the yeast genome defined by RNA sequencing.,” Science, vol. 320, 2008, pp. 1344 - 1349.

[25] The FANTOM Consortium, H. Suzuki, A.R. Forrest, E. van Nimwegen, C.O. Daub, P.J. Balwierz, K.M. Irvine, T. Lassmann, T. Ravasi, Y. Hasegawa, M.J. de Hoon, S. Katayama, K. Schroder, P. Carninci, Y. Tomaru, M. Kanamori-Katayama, A. Kubosaki, A. Akalin, Y. Ando, E. Arner, M. Asada, H. Asahara, T. Bailey, V.B. Bajic, D. Bauer, A.G. Beckhouse, N. Bertin, J. Bjorkegren, F. Brombacher, E. Bulger, A.M. Chalk, J. Chiba, N. Cloonan, A. Dawe, J. Dostie, P.G. Engstrom, M. Essack, G.J. Faulkner, J.L. Fink, D. Fredman, K. Fujimori, M. Furuno, T. Gojobori, J. Gough, S.M. Grimmond, M. Gustafsson, M. Hashimoto, T. Hashimoto, M. Hatakeyama, S. Heinzel, W. Hide, O. Hofmann, M. Hornquist, L. Huminiecki, K. Ikeo, N. Imamoto, S. Inoue, Y. Inoue, R. Ishihara, T. Iwayanagi, A. Jacobsen, M. Kaur, H. Kawaji, M.C. Kerr, R. Kimura, S. Kimura, Y. Kimura, H. Kitano, H. Koga, T. Kojima, S. Kondo, T. Konno, A. Krogh, A. Kruger, A. Kumar, B. Lenhard, A. Lennartsson, M. Lindow, M. Lizio,

Page 42: Gene networks from high-throughput data –Reverse

32

C. Macpherson, N. Maeda, C.A. Maher, M. Maqungo, J. Mar, N.A. Matigian, H. Matsuda, J.S. Mattick, S. Meier, S. Miyamoto, E. Miyamoto-Sato, K. Nakabayashi, Y. Nakachi, M. Nakano, S. Nygaard, T. Okayama, Y. Okazaki, H. Okuda-Yabukami, V. Orlando, J. Otomo, M. Pachkov, N. Petrovsky, C. Plessy, J. Quackenbush, A. Radovanovic, M. Rehli, R. Saito, A. Sandelin, S. Schmeier, C. Schonbach, A.S. Schwartz, C.A. Semple, M. Sera, J. Severin, K. Shirahige, C. Simons, G. St Laurent, M. Suzuki, T. Suzuki, M.J. Sweet, R.J. Taft, S. Takeda, Y. Takenaka, K. Tan, M.S. Taylor, R.D. Teasdale, J. Tegner, S. Teichmann, E. Valen, C. Wahlestedt, K. Waki, A. Waterhouse, C.A. Wells, O. Winther, L. Wu, K. Yamaguchi, H. Yanagawa, J. Yasuda, M. Zavolan, D.A. Hume, Riken Omics Science Center1, T. Arakawa, S. Fukuda, K. Imamura, C. Kai, A. Kaiho, T. Kawashima, C. Kawazu, Y. Kitazume, M. Kojima, H. Miura, K. Murakami, M. Murata, N. Ninomiya, H. Nishiyori, S. Noma, C. Ogawa, T. Sano, C. Simon, M. Tagami, Y. Takahashi, J. Kawai, and Y. Hayashizaki, “The transcriptional network that controls growth arrest and differentiation in a human myeloid leukemia cell line,” Nature genetics, Apr. 2009.

[26] T. Jenssen, A. Laegreid, J. Komorowski, and E. Hovig, “A literature network of human genes for high-throughput analysis of gene expression.,” Nature genetics, vol. 28, May. 2001, pp. 28, 21.

[27] D. Kulesh, D. Clive, D. Zarlenga, and J. Greene, “Identification of interferon-modulated proliferation-related cDNA sequences.,” Proc Natl Acad Sci U S A, vol. 84, Dec. 1987, pp. 8457, 8453.

[28] M. Schena, D. Shalon, R. Davis, and P. Brown, “Quantitative monitoring of gene expression patterns with a complementary DNA microarray.,” Science (New York, N.Y.), vol. 270, Oct. 1995, pp. 470, 467.

[29] M. Barnes, J. Freudenberg, S. Thompson, B. Aronow, and P. Pavlidis, “Experimental comparison and cross-validation of the Affymetrix and Illumina gene expression analysis platforms,” Nucleic Acids Research, vol. 33, Oct. 2005, pp. 5914-5923.

[30] T. Speed, Statistical analysis of gene expression microarray data, Chapman & Hall/CRC, 2003.

[31] E. Purdom and S.P. Holmes, “Error distribution for gene expression data,” Statistical applications in genetics and molecular biology, vol. 4, 2005.

[32] V. Filkov, S. Skiena, and J.Z. Zhi, “Analysis techniques for microarray time-series data,” Journal of computational biology, vol. 9, 2002, pp. 317-330.

[33] T. Ideker, V. Thorsson, J. Ranish, R. Christmas, J. Buhler, J. Eng, R. Bumgarner, D. Goodlett, R. Aebersold, and L. Hood, “Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network,” Science, vol. 292, Maj. 2001, pp. 929-934.

[34] J. Rung, T. Schlitt, A. Brazma, K. Freivalds, and J. Vilo, “Building And Analyzing Genome-Wide Gene Disruption Networks,” Bioinformatics, vol. 18, 2002, pp. 202-210.

[35] K. Tan, J. Tegner, and T. Ravasi, “Integrated approaches to uncovering transcription regulatory networks in mammalian cells,” Genomics, vol. 91, Mar. 2008, pp. 219-231.

[36] P. D'haeseleer, S. Liang, and R. Somogyi, “Linear Modeling of mRNA Expression Levels During CNS Development and Injury,” Pacific Symposium on Biocomputing, R.B. Altman, A.K. Dunker, L. Hunter, T.E. Klein, and K. Lauderdaule, eds., Singapore: World Scientific Publishing Co, 1999, pp. 41-52.

[37] T.G. Dewey and D.J. Galas, “Dynamic Models Of Gene Expression And Classification,” Funct. Integr. Genomics, vol. 1, 2001, pp. 269-271.

[38] M.K.S. Yeung, J. Tegnér, and J.J. Collins, “Reverse Engineering Gene Networks

Page 43: Gene networks from high-throughput data –Reverse

33

Using Singular Value Decomposition And Robust Regression,” Proc. Nat. Acad. Sciences USA, vol. 99, 2002, pp. 6163-6168.

[39] Y. Wang, T. Joshi, X. Zhang, D. Xu, and L. Chen, “Inferring gene regulatory networks from multiple microarray datasets,” Bioinformatics, vol. 22, Oktober. 2006, pp. 2413-2420.

[40] M. Hörnquist, J. Hertz, and M. Wahde, “Effective Dimensionality For Principal Component Analysis Of Rat CNS Expression Data,” BioSystems, vol. 71, 2003, pp. 311-317.

[41] Tibshirani, “Regression Shrinkage and Selection via the Lasso,” J. R. Stat. Soc. B, vol. 58, 1996, pp. 267-288.

[42] S.C. Narula and J.F. Wellington, “The minimum sum of absolute errors regression: A state of the art survey,” International Statistical Review, vol. 50, 1982, pp. 317-326.

[43] T. Hastie, R. Tibshirani, and J.H. Friedman, The elements of statistical learning : data mining, inference, and prediction, New York: Springer, 2001.

[44] H. Zou, T. Hastie, and R. Tibshirani, “On the "degrees of freedom" of the lasso,” Annals of Statistics, vol. 35, 2007, pp. 2173-3192.

[45] B. Efron, T. Hastie, L. Johnstone, and R. Tibshirani, “Least angle regression,” Annals of Statistics, vol. 32, 2004, pp. 499, 407.

[46] L. Wang, M.D. Gordon, and J. Zhu, “Regularized least absolute deviations regression and an efficient algorithm for parameter tuning,” 2007, pp. 690-700.

[47] N.R. Draper and H. Smith, Applied Regression Analysis, New York: Wiley, 1998.

[48] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society.Series B: Statistical Methodology, vol. 67, 2005, pp. 301-320.

[49] D. Marbach and G. Stolovitzky, PLoS ONE, vol. 5, 2010. [50] V. Thorsson, M. Hörnquist, A.F. Siegel, and L. Hood, “Reverse engineering

galactose regulation in yeast through model selection,” Statistical applications in genetics and molecular biology, vol. 4, 2005, p. Article28.

[51] G. Golub, M. Heath, and G. Wahba, “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter,” Technometrics, vol. 21, 1979, pp. 223, 215.

[52] R. Bonneau, D. Reiss, P. Shannon, M. Facciotti, L. Hood, N. Baliga, and V. Thorsson, “The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo,” Genome Biology, vol. 7, 2006, p. R36.

[53] J.M. Pena, J. Björkegren, and J. Tegnér, “Learning dynamic Bayesian network models via cross-validation,” Pattern Recogn. Lett., vol. 26, 2005, pp. 2295-2308.

[54] A. Isaksson, M. Wallman, H. Göransson, and M.G. Gustafsson, “Cross-validation and bootstrapping are unreliable in small sample classification,” Pattern Recogn. Lett., vol. 29, 2008, pp. 1960-1965.

[55] R.J. Tibshirani and R. Tibshirani, “A bias correction for the minimum error rate in cross-validation,” Annals of Applied Statistics, vol. 3, 2009, pp. 822-829.

[56] M.C. Teixeira, P. Monteiro, P. Jain, S. Tenreiro, A.R. Fernandes, N.P. Mira, M. Alenquer, A.T. Freitas, A.L. Oliveira, and I. Sa-Correia, “The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae,” Nucleic Acids Research, vol. 34, Jan. 2006, pp. D446-451.

[57] P.T. Monteiro, N.D. Mendes, M.C. Teixeira, S. d'Orey, S. Tenreiro, N.P. Mira, H. Pais, A.P. Francisco, A.M. Carvalho, A.B. Lourenco, I. Sa-Correia, A.L.

Page 44: Gene networks from high-throughput data –Reverse

34

Oliveira, and A.T. Freitas, “YEASTRACT-DISCOVERER: new tools to improve the analysis of transcriptional regulatory associations in Saccharomyces cerevisiae,” Nucleic Acids Research, vol. 36, Jan. 2008, pp. D132-136.

[58] G. Stolovitzky, D. Monroe, and A. Califano, “Dialogue on reverse-engineering assessment and methods: the DREAM of high-throughput pathway inference,” Annals of the New York Academy of Sciences, vol. 1115, Dec. 2007, pp. 1-22.

[59] G. Stolovitzky, R.J. Prill, and A. Califano, “Lessons from the DREAM2 Challenges,” Annals of the New York Academy of Sciences, vol. 1158, 2009, pp. 159-195.

[60] J. Ruan, “A Top-Performing Algorithm for the DREAM3 Gene Expression Prediction Challenge,” PLoS ONE, vol. 5, Feb. 2010, p. e8944.

[61] M. Lauria, F. Iorio, and D. di Bernardo, “NIRest: a tool for gene network and mode of action inference,” Annals of the New York Academy of Sciences, vol. 1158, Mar. 2009, pp. 257-264.

[62] T. Gowda, S. Vrudhula, and S. Kim, “Prediction of Pairwise Gene Interaction Using Threshold Logic,” Annals of the New York Academy of Sciences, vol. 1158, 2009, pp. 276-286.

[63] A. Scheinine, W.I. Mentzen, G. Fotia, E. Pieroni, F. Maggio, G. Mancosu, and A. de la Fuente, “Inferring gene networks: dream or nightmare?,” Annals of the New York Academy of Sciences, vol. 1158, Mar. 2009, pp. 287-301.

[64] N.D. Clarke and G. Bourque, “Success in the DREAM3 Signaling Response Challenge Using Simple Weighted-Average Imputation: Lessons for Community-Wide Experiments in Systems Biology,” PLoS ONE, vol. 5, Jan. 2010, p. e8417.

[65] K.Y. Yip, R.P. Alexander, K. Yan, and M. Gerstein, “Improved Reconstruction of In Silico Gene Regulatory Networks by Integrating Knockout and Perturbation Data,” PLoS ONE, vol. 5, Jan. 2010, p. e8121.

[66] N. Guex, E. Migliavacca, and I. Xenarios, “Multiple Imputations Applied to the DREAM3 Phosphoproteomics Challenge: A Winning Strategy,” PLoS ONE, vol. 5, Jan. 2010, p. e8012.

[67] B. Bollobás, Random Graphs, Cambridge UK: Cambridge University Press, 2001.

[68] R. Albert and A. Barabási, “Statistical mechanics of complex networks,” Reviews of Modern Physics, vol. 74, 2002, pp. 97, 47.

[69] A.L. Barabasi and R. Albert, “Emergence of scaling in random networks,” Science (New York, N.Y.), vol. 286, Oct. 1999, pp. 509-512.

[70] R.K. Merton, “The Matthew Effect in Science: The reward and communication systems of science are considered,” Science, vol. 159, Jan. 1968, pp. 56-63.

[71] D.D.S. Price, “A general theory of bibliometric and other cumulative advantage processes,” Journal of the American Society for Information Science, vol. 27, 1976, pp. 292-306.

[72] S. Maslov, K. Sneppen, and U. Alon, “Correlation Profiles and Motifs in Complex Networks,” Handbook of Graphs and Networks, S. Bornholdt and G. Schuster, eds., Weinheim: Wiley, 2003, pp. 168-198.

[73] S. Bornholdt and H. Schuster, Handbook of Graphs and Networks: From the Genome to the Internet, Wiley-VCH, 2003.

[74] M.E. Newman, “The Structure and Function of Complex Networks,” SIAM Review, vol. 45, 2003, pp. 167-256.

[75] R. Albert, H. Jeong, and A.-. Baraba´si, “Error and attack tolerance of complex networks,” Nature, vol. 406, 2000, pp. 378-382.

[76] H. Jeong, S.P. Mason, A.-. Baraba´si, and Z.N. Oltvai, “Lethality and centrality in protein networks,” Nature, vol. 411, 2001, pp. 41-42.

Page 45: Gene networks from high-throughput data –Reverse

35

[77] H. Jeong, B. Tombor, R. Albert, Z.N. Oltavai, A.-. Baraba´si, T. Dandekar, and S. Schuster, “The large-scale organization of metabolic networks,” Chemtracts, vol. 14, 2001, pp. 137-141.

[78] T.I. Lee, N.J. Rinaldi, F. Robert, D.T. Odom, Z. Bar-Joseph, G.K. Gerber, N.M. Hannett, C.T. Harbison, C.M. Thompson, and I. Simon, “Transcriptional regulatory networks in Saccharomyces cerevisiae.,” Science, vol. 298, 2002, pp. 799 - 804.

[79] S. Maslov and K. Sneppen, “Computational architecture of the yeast regulatory network,” Physical biology, vol. 2, Nov. 2005, pp. S94-100.

[80] T.R. Hughes, M.J. Marton, A.R. Jones, C.J. Roberts, R. Stoughton, C.D. Armour, H.A. Bennett, E. Coffey, H. Dai, Y.D. He, M.J. Kidd, A.M. King, M.R. Meyer, D. Slade, P.Y. Lum, S.B. Stepaniants, D.D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S.H. Friend, “Functional discovery via a compendium of expression profiles,” Cell, vol. 102, Jul. 2000, pp. 109-126.

[81] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, “Network motifs: simple building blocks of complex networks,” Science (New York, N.Y.), vol. 298, Oct. 2002, pp. 824-827.

[82] A. Ma'ayan, G.A. Cecchi, J. Wagner, A.R. Rao, R. Iyengar, and G. Stolovitzky, “Ordered cyclic motifs contribute to dynamic stability in biological and engineered networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 105, Dec. 2008, pp. 19235-19240.

[83] U. Alon, “Network motifs: theory and experimental approaches,” Nature reviews.Genetics, vol. 8, Jun. 2007, pp. 450-461.

[84] U. Alon, “Biological networks: the tinkerer as an engineer,” Science (New York, N.Y.), vol. 301, Sep. 2003, pp. 1866-1867.

[85] J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N. Barkai, “Revealing modular organization in the yeast transcriptional network,” Nature genetics, vol. 31, 2002, pp. 370-377.

[86] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proceedings of the National Academy of Sciences of the United States of America, vol. 95, Dec. 1998, pp. 14863-14868.

[87] E. Segal, M. Shapira, A. Regev, D. Pe'er, D. Botstein, D. Koller, and N. Friedman, “Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data,” Nature genetics, vol. 34, Jun. 2003, pp. 166-176.

[88] M. Girvan and M.E.J. Newman, “Community structure in social and biological networks,” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, Jun. 2002, pp. 7821-7826.

[89] X. Wang, E. Dalkic, M. Wu, and C. Chan, “Gene module level analysis: identification to networks and dynamics,” Current opinion in biotechnology, vol. 19, Oct. 2008, pp. 482-491.

[90] E.P. van Someren, L.F. Wessels, and M.J. Reinders, “Linear modeling of genetic networks from experimental data.,” Proc Int Conf Intell Syst Mol Biol, vol. 8, 2000, pp. 355 - 366.

[91] M. Gustafsson, Large-scale topology, stability and biology of gene networks, Linköping: Department of Science and Technology, Linköpings universitet, 2006.

[92] A. Lesne, “Robustness: confronting lessons from physics and biology,” Biological reviews of the Cambridge Philosophical Society, vol. 83, Nov. 2008, pp. 509-532.

[93] S. Nikolov, E. Yankulova, O. Wolkenhauer, and V. Petrov, “Principal difference

Page 46: Gene networks from high-throughput data –Reverse

36

between stability and structural stability (robustness) as used in systems biology,” Nonlinear Dynamics, Psychology, and Life Sciences, vol. 11, 2007, pp. 413-433.

[94] R.M. May, “Will a large complex system be stable?,” Nature, vol. 238, Aug. 1972, pp. 413-414.

[95] S. Sinha and S. Sinha, “Evidence of universality for the May-Wigner stability theorem for random networks with local dynamics,” Physical review.E, Statistical, nonlinear, and soft matter physics, vol. 71, Feb. 2005, p. 020902.

[96] S. Sinha, “Complexity vs. stability in small-world networks,” Physica A: Statistical Mechanics and its Applications, vol. 346, Feb. 2005, pp. 147-153.

[97] R.K. Pan and S. Sinha, “Modular networks emerge from multiconstraint optimization,” Physical review.E, Statistical, nonlinear, and soft matter physics, vol. 76, Oct. 2007, p. 045103.

[98] I. Farkas, H. Jeong, T. Vicsek, A.L. Barabasi, and Z.N. Oltvai, “The Topology Of The Transcription Regulatory Network In Saccharomyces Cerevisiae,” Physica A, vol. 381, 2003, pp. 601-612.

[99] E. Balleza, E.R. Alvarez-Buylla, A. Chaos, S. Kauffman, I. Shmulevich, and M. Aldana, “Critical dynamics in genetic regulatory networks: examples from four kingdoms,” PLoS ONE, vol. 3, Jun. 2008, p. e2456.

[100] R. Singh, “The contenders: Gripen JAS-39,” Feb. 2007. [101] F. Wegner, “Inverse participation ratio in 2+ε dimensions,” Zeitschrift für Physik

B Condensed Matter, vol. 36, 1980, pp. 209-214. [102] S. Maslov and K. Sneppen, “Specificity and stability in topology of protein

networks,” Science (New York, N.Y.), vol. 296, May. 2002, pp. 910-913.

Page 47: Gene networks from high-throughput data –Reverse

37