network-based approaches for pathway level analysistinn/papers/pathwayreview.pdf · network-based...

28
Network-based approaches for pathway level analysis Tin Nguyen 1 , Cristina Mitrea 2 , and Sorin Draghici 2,3* 1 Department of Computer Science and Engineering, University of Nevada, Reno 2 Department of Computer Science, Wayne State University 3 Department of Obstetrics and Gynecology, Wayne State University September 26, 2017 Contents 1 Abstract 2 2 Introduction 2 3 Network-based pathway analysis 3 3.1 Software availability and implementation ......................... 4 3.2 Experiment input and pathway databases ........................ 6 3.3 Graph models ........................................ 9 3.4 Pathway scoring strategies ................................. 11 3.4.1 Hierarchically aggregated scoring ......................... 11 3.4.2 Multivariate analysis and Bayesian network ................... 14 3.4.3 Other approaches .................................. 14 4 Challenges in pathway analysis 15 4.1 Systematic bias of pathway analysis methods ...................... 16 4.2 Querying data from knowledge bases ........................... 18 4.3 Missing benchmark datasets and comparison ...................... 19 5 Conclusions 19 6 Conflict-of-Interest Statement 20 7 Acknowledgment 20 * To whom the correspondence should be addressed. Phone: (313)-577-6679, Fax: (313)-577-6868, Email: [email protected] 1

Upload: vuongbao

Post on 29-Mar-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Network-based approaches for pathway level analysis

Tin Nguyen1, Cristina Mitrea2, and Sorin Draghici2,3

1Department of Computer Science and Engineering, University of Nevada, Reno2Department of Computer Science, Wayne State University

3Department of Obstetrics and Gynecology, Wayne State University

September 26, 2017

Contents

1 Abstract 2

2 Introduction 2

3 Network-based pathway analysis 33.1 Software availability and implementation . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Experiment input and pathway databases . . . . . . . . . . . . . . . . . . . . . . . . 63.3 Graph models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Pathway scoring strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4.1 Hierarchically aggregated scoring . . . . . . . . . . . . . . . . . . . . . . . . . 113.4.2 Multivariate analysis and Bayesian network . . . . . . . . . . . . . . . . . . . 143.4.3 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Challenges in pathway analysis 154.1 Systematic bias of pathway analysis methods . . . . . . . . . . . . . . . . . . . . . . 164.2 Querying data from knowledge bases . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Missing benchmark datasets and comparison . . . . . . . . . . . . . . . . . . . . . . 19

5 Conclusions 19

6 Conflict-of-Interest Statement 20

7 Acknowledgment 20

To whom the correspondence should be addressed. Phone: (313)-577-6679, Fax: (313)-577-6868, Email:[email protected]

1

1 Abstract

Identification of impacted pathways is an important problem because it allows us to gain insightsinto the underlying biology beyond the detection of differentially expressed genes. In the pastdecade, a plethora of methods have been developed this purpose. The last generation of pathwayanalysis methods are designed to take into account various aspects of pathway topology in orderto increase the accuracy of the findings. Here we cover 34 such topology-based pathway analysismethods published in the past 13 years. We compare these methods on categories related to im-plementation, availability, input format, graph models, and statistical approaches used to computepathway level statistics and statistical significance. We also discuss a number of critical challengesthat need to be addressed, arising both in methodology and pathway representation, includinginconsistent terminology, data format, lack of meaningful benchmarks, and more importantly, asystematic bias that is present in most existing methods.

2 Introduction

With rapid advances in high-throughput technologies, various kinds of genomic data are prevalentin most of biomedical research. Advanced techniques in sequencing (e.g., RNA-Seq, miRNA-Seq,DNA-Seq) and microarray assays (e.g., gene expression, methylation) have transformed biologicalresearch by enabling comprehensive monitoring of biological systems. Vast amounts of data of alltypes have accumulated in many public repositories, such as Gene Expression Omnibus (GEO) [1,2],Array Express [3, 4], The Cancer Genome Atlas (http://cancergenome.nih.gov/), and cBioPor-tal [5,6]. However, there is a large gap between the ease of data collection and our ability to extractknowledge from these data. Contributing to this gap is the fact that living organisms are complexsystems whose emerging phenotypes are the results of multiple complex interactions taking placeon various metabolic and signaling pathways.

Regardless of technology being used, a typical comparative analysis (e.g., disease versus control,treated vs. not treated, drug A vs. drug B, etc.) often yields a set of genes that are differentiallyexpressed (DE) between the two phenotypes. Even though these lists of DE genes are important inidentifying the genes that may be involved in biological changes, they fail to reveal the underlyingmechanisms. In order to translate these lists of DE genes into a better understanding of biologicalphenomena, researchers have developed a variety of knowledge bases that map genes to functionalmodules. Depending on the amount of information that one wishes to include, these modules canbe described as simple gene sets based on a function, process or component (e.g., the MolecularSignatures Database MSigDB [7]), organized in a hierarchical structure that contains informationabout the relationship between the various modules, as found in the Gene Ontology [8], or organizedinto pathways that describe in detail all known interactions between the various genes that areinvolved in a certain phenomenon. Biological processes, in which genes are known to interact witheach other, are accumulated in public databases, such as the Kyoto Encyclopedia of Genes andGenomes (KEGG) [9,10], Reactome [11], and Biocarta (www.biocarta.com).

Concurrently, statistical methods have been developed to identify the functional modules orpathways that are impacted from the differential expression evidence. They allow us to gain insightsinto the functional mechanisms of cells beyond the detection of differential expressed genes. Theearliest approaches use Over-Representation Analysis (ORA) [12, 13] to identify gene sets thathave more DE genes than expected by chance. The drawbacks of this type of approach includethat: i) it only considers the number of DE genes and completely ignores the magnitude of the

2

http://cancergenome.nih.gov/www.biocarta.com

actual expression changes, resulting in information loss; ii) it assumes that genes are independent,which they are not (since the pathways are graphs that describe precisely how these genes influenceeach other); and iii) it ignores the interactions between various genes and modules. FunctionalClass Scoring (FCS) approaches, such as Gene Set Enrichment Analysis (GSEA) [14] and Gene SetAnalysis (GSA) [15], have been developed to address some of the issues raised by ORA approaches.The main improvement of FCS is the observation that small but coordinated changes in expressionof functionally related genes can have significant impact on pathways.

ORA and FCS approaches are often referred as gene set enrichment methods. Comprehen-sive lists of gene set analysis approaches, as well as comparisons between them can be found inwell-developed surveys [1620]. While useful for the purpose for which they have been developed- to analyze sets of genes - these methods completely ignore the topology and interaction betweengenes. Topology-based approaches, which fully exploit all the knowledge about how genes inter-act as described by pathways, have been developed more recently. The first such techniques wereScorePAGE [21] for metabolic pathways and the Impact Analysis [22, 23] for signaling pathways.Figure 1 shows an example pathway named ERBB signaling pathway, in which the nodes representgenes and compounds while the edges represent the known interaction between the compounds.Gene set analysis approaches are not able to account for the topological order of genes, nor ableto explain the signal propagation and mechanisms of the pathway. These approaches would yieldexactly the same significance value for this pathway even if the graph were to be completely re-designed by future discovery. In contrast, network-based pathway analysis can distinguish betweenthis pathway and any other pathways with the same proportion of DE genes.

Here we provide a survey for 34 network-based methods developed for pathway analysis. Oursurvey of commercial tools for pathway analysis found iPathwayGuide (Advaita Corporation, http://www.advaitabio.com) and MetaCore (Thomson Reuters, http://www.thomsonreuters.com)to be topology-based. Other commercial tools, such as Ingenuity Pathway Analysis (https://www.ingenuity.com) or Genomatix (http://www.genomatix.de), do not use pathway topologyinformation and thus are not included in this review. The existence of only two commercial toolsis extra evidence to the challenging task undertaken by the developers of such methods due to lackof standards.

In this document, we categorize and compare the methods based on the following criteria: thetype of input the method accepts, graph model, and the statistical approaches used to evaluategene and pathway changes. In section 3.2 we describe the types of input the surveyed methodsuse. In section 3.3 we provide details regarding the graph models used for the biological networks.In section 3.4 we discuss the statistical methods used by the surveyed methods to assess geneand pathway changes between two phenotypes. In section 4, we discuss a number of outstandingchallenges that needed to be addressed in order to improve the reliability, as well as the relevanceof the next-generation pathway analysis approaches. We discuss the key elements and limitationsof the surveyed methods without going into details of each technique. For a more detailed reviewand technical descriptions of network-based pathway analysis methods, please refer to Mitrea etal. [24] or referenced manuscripts (Table 1).

3 Network-based pathway analysis

The term pathway analysis is used in a very broad context in the literature, including biologicalnetwork construction and inference. Here we focus on methods that are able to exploit biologicalknowledge in public repositories, rather than on network inference approaches that attempt to infer

3

http://www.advaitabio.comhttp://www.advaitabio.comhttp://www.thomsonreuters.comhttps://www.ingenuity.comhttps://www.ingenuity.comhttp://www.genomatix.de

Figure 1: Pathways are more than gene sets. Panel A shows the graphical representation of ERBB signaling pathwayfrom KEGG database while panel B shows the set of genes on the pathway. The graph in panel A contains importantinformation regarding gene product (protein) localization, gene, protein or metabolite interactions and the type ofthese interactions (activation, repression, etc.), the direction of the signal propagation, etc. ORA and FCS approachesare unable to exploit these information. As such, for the same molecular measurement, these approaches would yieldexactly the same significance value to this pathway even if the graph were to be completely redesigned by futurediscovery. In contrast, network-based approaches are able to take into account the topological order of the genes andthere interactions.

or reconstruct pathways from molecular measurements. Table 1 shows the list of 34 pathway analysisapproaches, together with their availability and licensing. In this section we start by describing theavailability of each method, before discussing the input format, graph model, and the underlyingstatistical approaches.

3.1 Software availability and implementation

We often think that the main strength of an approach lies in its novelty and algorithm efficiency.However, the tools implementation and availability have become increasingly important for several

4

Method Availability HIPAA License Code Year Reference

Signaling pathway analysis using gene expressionMetaCore* Web (http://www.genego.com/metacore.php) No Thomson

ReutersJava 2004 N/A

Pathway-Express Standalone (Bioconductor), Web (http://vortex.cs.wayne.edu/) superseded by the ROntoTools

No free** Java, R 2005 [22,25]

PathOlogist Standalone (ftp://ftp1.nci.nih.gov/pub/pathologist/)

No free MATLAB 2007 [26,27]

iPathwayGuide* Web (http://www.advaitabio.com/products.html)

Yes AdvaitaCorp.

Java, R 2009 N/A

SPIA Standalone (Bioconductor) No GPL (2) R 2009 [23]NetGSA Standalone (http://www.biostat.washington.

edu/~ashojaie/software/

No GPL-2 R 2009 [28,29]

PWEA Standalone (https://zlab.bu.edu/PWEA/ No free** C++ 2010 [30]TopoGSA Web (http://www.infobiotics.net/topogsa) No free** PHP, R 2010 [31]TopologyGSA Standalone (CRAN) No AGPL-3 R 2010 [32]DEGraph Standalone (Bioconductor) No GPL-3 R 2010 [33]GGEA Standalone (Bioconductor) No Artistic-2.0 R 2011 [34]BPA Standalone (http://bumil.boun.edu.tr/bpa) No free** MATLAB 2011 [35]GANPA Standalone (CRAN) No GPL-2 R 2011 [36]ROntoTools** Standalone (Bioconductor) No CC BY-NC-

ND 4R 2012 [37]

BAPA-IGGFD No implementation available No N/A R 2012 [38]CePa Standalone (CRAN), Web (http://mcube.nju.

edu.cn/cgi-bin/cepa/main.pl)No GPL (2) R 2012 [39]

THINK-Back-DS Standalone, Web (http://eecs.umich.edu/db/think/software.html)

No free** Java 2012 [40]

TBScore No implementation available No N/A N/A 2012 [41]ACST Standalone (available as article supplemental) No N/A R 2012 [42]EnrichNet Web (http://www.enrichnet.org/) No free** PHP 2012 [43]clipper Standalone (http://romualdi.bio.unipd.it/

software)No AGPL-3 R 2013 [44]

DEAP Standalone (available as article supplemental) No GNU LesserGPL

Python 2013 [45]

DRAGEN Standalone (http://bioinfo.au.tsinghua.edu.cn/dragen/)

No N/A C++ 2014 [46]

ToPASeq*** Standalone (Bioconductor) No AGPL-3 R 2016 [47]pDis Standalone (Bioconductor) No free** R 2016 [48]SPATIAL No implementation available No N/A N/A 2016 [49]BLMA**** Standalone (Bioconductor) No GPL (2) R 2017 [50,51]Signaling pathway analysis using multiple types of dataPARADIGM Standalone (http://sbenz.github.com/

Paradigm)No UCSC-

CGB, free**C 2010 [52]

microGraphite Standalone (http://romualdi.bio.unipd.it/software)

No AGPL-3 R 2014 [53]

mirIntegrator Standalone (Bioconductor) No GPL3 R 2016 [54,55]Metabolic pathway analysisScorePAGE No implementation available No N/A N/A 2004 [21]TAPPA No implementation available No N/A N/A 2007 [56]MetPA Web (http://metpa.metabolomics.ca) No GPL (2) PHP, R 2010 [57]MetaboAnalyst Web (http://metpa.metabolomics.ca) No GPL (2) R 2011 [58,59]

Table 1: Pathway analysis tools. Availability is a criterion that describes the implementation of the method as standaloneor web-based. HIPAA provides information about HIPAA compliance. License provides information about the type ofthe software license. GPL is an abbreviation for the GNU General Public License; AGPL is an abbreviation for the GNUAffero General Public License; CC BY-NC-ND 4 is an abbreviation of Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. Code shows the programming language used for the method implementation.* commercial methods; ** free for academic and non-commercial use; UCSC-CGB is the University of California Santa CruzCancer Genome Browser; ***ToPASeq provides an R package that runs TopologyGSA, DEGraph, clipper, SPIA, TBScore,PWEA, TAPPA. ****BLMA provides an R package for bi-level meta-analysis that runs SPIA, ORA, GSA, and PADOGusing one or multiple expression datasets.

5

http://www.genego.com/metacore.phphttp: //vortex.cs.wayne.edu/http: //vortex.cs.wayne.edu/ftp://ftp1.nci.nih.gov/pub/pathologist/ftp://ftp1.nci.nih.gov/pub/pathologist/http://www.advaitabio.com/products.htmlhttp://www.advaitabio.com/products.htmlhttp://www.biostat.washington.edu/~ashojaie/software/http://www.biostat.washington.edu/~ashojaie/software/https://zlab.bu.edu/PWEA/http://www.infobiotics.net/topogsahttp://bumil.boun.edu.tr/bpahttp://mcube.nju.edu.cn/cgi-bin/cepa/main.plhttp://mcube.nju.edu.cn/cgi-bin/cepa/main.plhttp://eecs.umich.edu/db/think/software.htmlhttp://eecs.umich.edu/db/think/software.htmlhttp://www.enrichnet.org/http://romualdi.bio.unipd.it/softwarehttp://romualdi.bio.unipd.it/softwarehttp://bioinfo.au.tsinghua.edu.cn/dragen/http://bioinfo.au.tsinghua.edu.cn/dragen/http://sbenz.github.com/Paradigmhttp://sbenz.github.com/Paradigmhttp://romualdi.bio.unipd.it/softwarehttp://romualdi.bio.unipd.it/softwarehttp://metpa.metabolomics.cahttp://metpa.metabolomics.ca

reasons. First, software availability and version control are crucial for reproducing the experimentalresults that were used to assess the performance of the approach [60]. For this reason, manyjournals request authors to make their software and data available before accepting methods articles.Second, if a software is not ready-to-run, it is very unlikely that the intended audience (mostly lifescientists) will invest the time to understand and implement complex algorithms. Practicality,user-friendliness, output format, and type of interface are all to be considered. Depending on thedesired availability and intended audience, a software package may be implemented as standaloneor web-based. Among the 34 approaches, there are 30 that are available either as a standalonesoftware package or web service.

Typically, standalone tools need to be installed on local machines or servers which often requiressome administrative skills. Most standalone tools depend on full or partial copies of public pathwaydatabases, stored locally, and need to be updated periodically. Advantages of standalone toolsinclude: i) instant availability that does not require internet access, and ii) the security and privacyof the experiment data. Web-based tools, on the other hand, run the analyses on a remote serverproviding computational power and a graphical interface. The major advantage of web-based toolsis that they are user-friendly and do not require a separate local installation. From the accessibilityperspective, web-based tools have the advantage of being available from any location as long asthere is an internet connection and a browser available. Also, the update is almost seamless to theclient. This makes the users task easy and enables collaboration since users all over the world canutilize the same method without the burden of installing it or keeping it up-to-date. There aremethods that provide both web-based and standalone implementations.

HIPAA compliance may also be a factor in certain applications that involve data coming frompatients or data linked to other clinical data or clinical records. Currently, the iPathwayGuide isthe only tool available that can do topological pathway analysis and is HIPAA compliant.

The programming language and style used for implementation also play an important rolein the acceptance of a method. Software tools that are neatly implemented and packaged aremore appealing compared to those that do not have ready-to-use implementations. Many of themethods are implemented in the R programming language and are available as software packageseither from Bioconductor, CRAN, or the authors website. Their popularity among biologists andbioinformaticians is due to the fact that many bioinformatics dedicated packages are available inR. In addition, the rigorous review procedure provided by Bioconductor and CRAN makes thesoftware packages more standardized and reliable.

3.2 Experiment input and pathway databases

A pathway analysis approach typically requires two types of input: experiment data and knownbiological networks. Experiment data is usually collected from high-throughput experiments thatcompare a condition phenotype to a control phenotype, where a condition phenotype can be adisease, a drug treatment, the knock-out (KO) of a gene while a control phenotype can be healthy,a different disease or drug treatment, or the wild-type (non-KO) samples. Experiment data can beobtained from multiple technologies that produce different types of data: gene expression, proteinabundance, metabolite concentration, miRNA expression, etc. Biological networks or pathways areoften represented in the form of graphs that capture our current knowledge about the interactionsof genes, proteins, metabolites, or compounds in an organism. The pathway data is accumulated,updated, and refined by amassing knowledge from scientific literature describing individual inter-actions or high throughput experiment results.

6

Table 2 shows the summary of input format and mathematical modeling of the surveyed ap-proaches. Most pathway analysis methods analyze data from high-throughput experiments, such asmicroarrays, next-generation sequencing, or proteomics. They accept either a list of gene IDs or alist of such gene IDs associated with measured changes. These changes could be measured with dif-ferent technologies and therefore can serve as proxies for different biochemical entities. For instance,one could use gene expression changes measured with microarrays, or protein levels measured witha proteomic approach, etc.

Different analysis methods use different input formats. Many methods accept a list of all genesconsidered in the experiment together with their expression values. Some analysis methods selecta subset of genes, considered to be differentially expressed (DE), based on a predefined cut-off.The cut-off is typically applied on fold-change, p-value, or both. These methods use the list of DEgenes and their corresponding statistics (fold-change, p-value) as input. Other methods use onlythe list of DE genes, without corresponding expression values, because their scoring methods arebased only on the relative positions of the genes in the graph.

Methods which use cut-offs are sensitive to the chosen threshold value, because a small changein the cut-off may drastically change the number of selected genes [61]. In addition, they typicallyuse the most significant genes and discard the rest whose weaker but coordinated changes mayalso have significant impact on pathways. Genes with moderate differential expression may be lost,even though they might be important players in the impacted pathways [62]. Furthermore, thegenes included in the set of DE genes can vary dramatically if the selection methods are changed.Hence, the results of pathway analyses based on DE genes may be vastly different depending onboth the selection method as well as the threshold value [63]. Furthermore, for the same disease,independent studies or measurements often produce different sets of differentially expressed (DE)genes [6466]. This makes approaches that use DE genes as input appear even more unreliable.

Usually pathways are sets of genes and/or gene products that interact with each other in acoordinated way to accomplish a given biological function. A typical signaling pathway (inKEGG for instance) uses nodes to represent genes or gene products and edges to represent signals,such as activation or repression, that go from one gene to another. A typical metabolic pathwayuses nodes to represent biochemical compounds and edges to represent reactions that transformone or more compound(s) into one or more other compounds. These reactions are usually carriedout or controlled by enzymes, which are in turn coded by genes. Hence, in a metabolic pathway,genes or gene products are associated with edges rather than nodes, as in a signaling pathway. Theimmediate consequence of this difference is that many techniques cannot be applied directly on allavailable pathways. This is why the analysis of metabolic pathways is still generally and arguablyunder-developed (only 128 Google Scholar citations to date for the original ScorePage paper [21]and not many other methods available for metabolic pathway analysis). However, the topology-based analysis of signaling pathways was very successful (over 1200 citations to date for the twoimpact analysis papers mentioned above [22,23] and over 30 other methods developed since).

There are other types of biological networks that incorporate genome wide interactions betweengenes or proteins such as protein-protein interaction (PPI) networks. These networks are notrestricted to specific biological functions. The main caveat related to PPI data is that most suchdata are obtained from a bait-prey laboratory assays, rather than from in vivo or in vitro studies.The fact that two proteins stick to each other in an assay performed in an artificial environmentcan be misleading since the two proteins may never be present at the same time in the same tissueor the same part of the cell.

Publicly available curated pathway databases used by methods listed in Table 2 are KEGG [67],

7

Method Experiment input Pathway database Graph model Pathway scoring

Signaling pathway analysis using gene expressionMetaCore DE genes proprietary canonical pathway, genome-

scale networkSingle-type, directed Hierarchically aggregated

Pathway-Express DE genes change, or mea-sured genes change*

KEGG signaling Single-type, directed Hierarchically aggregated

PathOlogist Measured genes expression KEGG Multi-type, directed Hierarchically aggregatediPathwayGuide DE genes change, or mea-

sured genes changeKEGG signaling, Reactome, NCI, Bio-Carta

Single-type, directed Hierarchically aggregated

SPIA DE genes change KEGG signaling Single-type, directed Hierarchically aggregatedNetGSA Measured genes expression KEGG signaling Single-type, directed Multivariate analysisPWEA Measured genes expression YeastNet Single-type, undirected Hierarchically aggregatedTopoGSA DE genes PPI network, KEGG Single-type, undirected Hierarchically aggregatedTopologyGSA Measured genes expression NCI-PID Single-type, undirected Multivariate analysisDEGraph Measured genes expression KEGG Single-type, undirected Multivariate analysisGGEA Measured genes expression KEGG Single-type, directed Aggregate fuzzy similarityBPA Measured genes expression -

with cut-offNCI-PID Single-type, DAG Bayesian network

GANPA DE genes change, or mea-sured genes expression

PPI network, KEGG, Reactome, NCI-PID, HumanCyc

Single-type, undirected Hierarchically aggregated

ROntoTools DE genes change, or mea-sured gene expression

KEGG signaling Single-type, directed Hierarchically aggregated

BAPA-IGGFD Measured genes expression -with cut-off

Literature-based interaction database,KEGG, WikiPathways, Reactome,MSigDB, GO BP, PANTHER; con-structed gene association network fromPPIs; co-annotation in GO BiologicalProcess (BP); and co-expression inmicroarray data

Single-type, DAG Bayesian network

CePa DE genes expression, ormeasured genes expression

NCI-PID Single-type, directed Hierarchically aggregated

THINK-Back-DS DE genes change, measuredgenes expression

KEGG, PANTHER, BioCarta, Reac-tome, GenMAPP

Single-type, directed Hierarchically aggregated

TBScore DE genes change KEGG signaling Single-type, directed Hierarchically aggregatedACST Measured genes expression KEGG signaling Single-type, directed Hierarchically aggregatedEnrichNet DE genes list PPI network, KEGG, BioCarta,

WikiPathways, Reactome, NCI-PID,InterPro, GO with STRING 9.0

Single-type, undirected Hierarchically aggregated

clipper Measured genes expression BioCarta, KEGG, NCI-PID, Reactome Single-type, directed Multivariate analysisDEAP Measured genes expression KEGG, Reactome Single-type, directed Hierarchically aggregatedDRAGEN Measured genes expression RegulonDB, M3D, HTRIdb, ENCODE,

MSigDBSingle-type, directed Linear regression

ToPASeq*** Measured genes expression KEGG Single-type, directed Hierarchical & multivariatepDis DE genes change, or mea-

sured genes change*KEGG signaling Single-type, directed Hierarchically aggregated

SPATIAL DE genes change KEGG signaling Single-type, directed Hierarchically aggregatedBLMA**** Measured genes expression KEGG signaling Single-type, directed Hierarchically aggregated

Signaling pathway analysis using multiple types of dataPARADIGM Measured genes expression,

copy number, and proteinslevels

Constructed PPI networks from MIPS,DIP, BIND, HPRD, IntAct, and Bi-oGRID

Multi-type, directed Hierarchically aggregated

microGraphite Measured gene expression,and miRNA expression

BioCarta, KEGG, NCI-PID, Reactome Single-type, directed Multivariate analysis

mirIntegrator DE genes change, and DEmiRNA change

KEGG signaling, miRTarBase Single-type, directed Hierarchically aggregated

Metabolic pathway analysisScorePAGE Measured genes expression KEGG metabolic Single-type, undirected Hierarchically aggregatedTAPPA Measured genes expression KEGG metabolic Single-type, undirected Hierarchically aggregatedMetPA DE metabolites change KEGG metabolic Single-type, directed Hierarchically aggregatedMetaboAnalyst DE genes change and DE

metabolites changeKEGG metabolic Single-type, directed Hierarchically aggregated

Table 2: A summary of the experiment data input format and biological network databases used by the surveyed methods ispresented. Experiment input shows the experiment data input format for each method. DE means differentially expressed.Change means fold-change value or t-statistics when comparing gene/metabolites values between two phenotypes. Mea-sured means the list of all the genes/metabolites measured in the experiment; List represents a list of genes/metabolitesidentifiers (e.g. symbols). With cut-off show methods that take as input the list of all measured genes and in the analysisthey mark the DE genes. Pathway database shows the name of the knowledge source for biological interactions. Graphmodel is a characteristic that shows if the graph has one or multiple types of nodes as well as if directed or undirected.* the package ROntoTools (Bioconductor) allows for analysis using expression values of all genes. ***ToPASeq provides anR package that runs TopologyGSA, DEGraph, clipper, SPIA, TBScore, PWEA, TAPPA. ****BLMA provides an R packagefor bi-level meta-analysis that runs SPIA, ORA, GSA, and PADOG using one or multiple expression datasets.

8

NCI-PID [68], BioCarta [69], WikiPathways [70], PANTHER [71], and Reactome [72]. These knowl-edge bases are built by manually curating experiments performed in different cell types under differ-ent conditions. These curated knowledge bases are more reliable than protein interaction networksbut do not include all known genes and their interactions. As an example, despite being contin-uously updated, KEGG included only about 5,000 human genes in signaling pathways while thenumber of protein-coding genes is estimated to be between 19,000 and 20,000 [73].

The implementation of analysis methods constrains the software to accept a specific input path-way data format, while the underlying graph models in the methods are independent of the inputformat. Regardless of the pathway format, this information must be parsed into a computer read-able graph data structure before being processed. The implementation may incorporate a parser,or this may be up to the user. For instance, SPIA accepts any signaling pathway or network ifit can be transformed into an adjacency matrix representing a directed graph where all nodes arecomponents and all edges are interactions. NetGSA is similarly flexible with regard to signalingand metabolic pathways. SPIA provides KEGG signaling pathways as a set of pre-parsed adjacencymatrices. The methods described in this chapter may be restricted to only one pathway database,or may accept several.

To the best of our knowledge, there has been no comprehensive approaches to compare the prosand cons of existing knowledge bases. It is completely up to researchers to choose the tools thatare able to work with the database they trust. Intuitively, methods that are able to exploit thecomplementary information available from different databases have an edge over methods that workwith one single database.

3.3 Graph models

Pathway analysis approaches use two major graph models to represent biological networks obtainedfrom knowledge bases: i) single-type and ii) multiple-type. The first model allows only one typeof node, i.e. a gene or protein, with edges representing molecular interactions between the nodes(KEGG signaling in Figure 2). Models that contain directed graphs are more suitable for analysesthat include signal propagation of gene perturbation. The second graph model allows multiple typeof nodes, such as components and interactions (NCI-PID in Figure 2). Multi-type graph models aremore complex than single-type, but they are expected to capture more pathway characteristics. Forexample, single-type models are limited when trying to describe all and any relations betweenmultiple components that are involved in the same interaction. Bipartite graphs, which contain twotypes of nodes and allow connection only between nodes of different types, are a particular case ofmulti-type graph models.

The majority of analysis methods surveyed here use a single-type graph model. Some apply theanalysis on a directed or un-directed single-type network built using the input pathway, while otherstransform the pathways into graphs with specific characteristics. An example of the later is Topolo-gyGSA, which transforms the directed input pathway into an undirected decomposable graph, thathas the advantage of being easily broken down into separate modules [74]. In this method, decom-posable graphs are used to find important submodules - those which drive the changes acrossthe whole pathway. For each pathway, TopologyGSA creates an undirected moral graph1 from the

1The moral graph of a DAG is the undirected graph created by adding an (undirected) edge between all parentsof the same node (sometimes called marrying), and then replacing all directed edges by undirected edges. The namestems from the fact that, in a moral graph, two nodes that have a common child are required to be married bysharing an edge.

9

Figure 2: Five biological network graph models used by public databases are displayed. KEGG database containsboth signaling and metabolic networks. Signaling networks have genes/gene products as nodes and regulatorysignals as edges. Types of regulatory signals include activation, inhibition, phosphorylation, and many others.Metabolic networks have biochemical compounds as nodes and chemical reactions as edges. Enzymes (specializedproteins/gene products) catalyze biochemical reactions, therefore genes are linked to the edges in these networks.Reactome database is a collection of biochemical reactions that are grouped in functionally related sets to forma pathway. There are two types of nodes: biochemical compounds and reactions. Reaction nodes link biochemicalcompounds as reactants and products. NCI-PID signaling networks also have two types of nodes: component nodesand process nodes. Component nodes are usually biomolecular components. Process nodes are usually biochemicalreactions or biological processes. In these networks process nodes link two or more component nodes throughdirected edges. Process nodes are assigned one of the following states: positive or negative regulation, or involved in.Protein-protein interaction (PPI) networks have proteins as nodes and physical binding as edges or interactions.Two-hybrid assays are typically used to determine protein protein interactions. PPIs can be directed in the bait-preyorientation, when the bait-prey relation is considered (top), or undirected, when is not (bottom). The BiologicalPathway Exchange (BioPAX) format has physical entities as nodes and conversions as edges. The advantage ofthis representation is that is generic and provides a lot of flexibility to accommodate various types of interactions.In addition, it provides a machine readable standard that can be used for all databases to provide their data in aunified format. Network nodes can be genes, and gene products, complexes, as well as non-coding RNA. Networkedges can be assembly of a complex, disassembly of a complex or biochemical reactions among others.

underlying directed acyclic graph (DAG) by connecting the parents of each child and removing theedge direction. The moral graph is then used to test the hypothesis that the underlying networkis changed significantly between the two phenotypes. If the the research hypothesis is rejected, adecomposable/triangulated graph is generated from the moral graph by adding new edges. Thisgraph is broken into the maximal possible submodules and the hypothesis is re-tested on each ofthem.

PathOlogist and PARADIGM are the two surveyed methods that use multi-type graph models.PathOlogist uses a bipartite graph model with component and interaction nodes. PARADIGM,conceptually motivated by the central dogma of molecular biology, takes a pathway graph as inputand converts it into a more detailed graph, where each component node is replaced by several morespecific nodes: biological entity nodes, interaction nodes, and nodes containing observed experimentdata. The observed experiment nodes could in principle contain gene expression and copy numberinformation. Biological entity nodes are DNA, mRNA, protein, and active protein. The interactionnodes are transcription, translation, or protein activation, among others. Biological entity andinteraction node values are derived from these data and specify the probability of the node beingactive. These are the hidden states of the model. From the mathematical model perspective, modelsthat allow multiple types of nodes: component nodes and interaction nodes, are more flexible and

10

are able to model both AND and OR gates, which are very common when describing cellularprocesses.

3.4 Pathway scoring strategies

The goal of the scoring method is to compute a score for each pathway based on the expressionchange and graph model, resulting in a ranked list of pathways or sub-pathways. There are a varietyof approaches to quantify the changes in a pathway. Some of the analysis methods use a hierarchi-cally aggregated scoring algorithm, where on the first level, a score is calculated and assigned toeach node or pair of nodes (component and/or interaction). On the second level, these scores areaggregated to compute the score of the pathway. On the last level, the statistical significance of thepathway score is assessed using univariate hypothesis testing. Another approach assigns a randomvariable to each node and a multivariate probability distribution is calculated for each pathway.The output score can be calculated in two ways. One way is to use multivariate hypothesis testingto assess the statistical significance of changes in the pathway distribution between the two phe-notypes. The other way is to estimate the distribution parameters based on the Bayesian networkmodel and use this distribution to compute a probabilistic score to measure the changes. In thissection, we provide details regarding the scoring algorithms of the surveyed methods. See Figure 3for scoring algorithms categories.

3.4.1 Hierarchically aggregated scoring

The workflow of the hierarchically aggregated scoring strategy has three levels: node statisticcomputation, pathway statistic computation and the evaluation of the significance for the pathwaystatistic.

Node level scoring. Most of the surveyed methods incorporate pathway topology informationin the node scores. The node level scoring can be divided into four categories: i) graph measures(centrality), ii) similarity measures, iii) probabilistic graphical models, and iv) normalized nodevalue (NNV). Approaches in the first category use centrality measures or a variation of thesemeasures to score nodes in a given pathway. Centrality measures represent the importance of a noderelative to all other nodes in a network. There are several centrality measures that can be appliedto networks of genes and their interactions and these are degree centrality, closeness, betweenness,and eigenvector centrality. Degree centrality accounts for the number of directed edges that enterand leave each node. Closeness sums the shortest distance from each node to all other nodes in thenetwork. Node betweenness measures the importance of a node according to the number of shortestpaths that pass through it. Eigenvector centrality uses the network adjacency matrix of a graphto determine a dominant eigenvector; each element of this vector is a score for the correspondingnode. Thus, each score is influenced by the scores of neighboring nodes. In the case of directedgraphs, a node that has many downstream genes has more influence and receives a higher score.Methods in this category include MetaCore, SPIA, iPathwayGuide, MetPA, SPATIAL, TopoGSA,DEAP, pDis, mirIntegrator, Pathway-Express, and BLMA.

Approaches in the second category use similarity measures in their node level scoring. Similaritymeasures estimate the co-expression, behavioral similarity, or co-regulation of pairs of components.Their values can be correlation coefficients, covariances, or dot products of the gene expressionprofile across time or sample. In these methods, the pathways with clusters of highly correlated

11

Figure 3: A summary of the statistical models is presented for the surveyed methods. Most methods use thehierarchically aggregated scoring strategy, in which the score is computed at the node level before being aggregatedat the pathway-level for significance assessment. In the left panel, the rows shows a node-level statistical model whilethe columns show the aggregating strategies (linear, non-linear, or weighted gene set). Approaches in multivariateanalysis and Bayesian network categories use multivariate modeling and Bayesian network, respectively, to findpathways that are most likely to be impacted. Both BLMA and ToPASeq provide R packages that run multiplealgorithms. DRAGEN follows a linear regression strategy while GGEA applies fuzzy modeling to rank the pathways.

genes are considered more significant. At the node level, a score is assigned to each pair of nodesin the network which is the ratio of one similarity measure over the shortest path distance betweenthese nodes. Thus, the topology information is captured in the node score by incorporating theshortest path distance of the pair. Methods in this category include ScorePAGE and PWEA.

Approaches in the third category incorporate the topology in the node level scoring using aprobabilistic graphical model. In this model, nodes are random variables, and edges define theconditional dependency of the nodes they link. For example, PARADIGM takes observed experi-ment data and calculates scores for all component nodes, in both observed and hidden states, fromthe detailed network created by the method based on the input pathway. For each node score,a positive or negative value denotes how likely it is for the node to be active or inactive, respec-tively. The scores are calculated to maximize the probability of the observed values. A p-valueis associated with each score of each sample such that each node can be tagged as significantlyactive, significantly inactive, or not-significant. For each network, a matrix of p-values is output,in which columns are samples, and rows are component nodes. Methods in this category includePARADIGM and PathOlogist.

12

Approaches in the fourth category simply compute the score for each node using the informationobtained from the experiment input. For example, TAPPA calculates the score of each node as thesquare root of the normalized log gene expressions (node value) while ACST and TBScore calculatethe node level score using a sign statistic and log fold-change.

Pathway level scoring. There are three different ways to compute the pathway score: i) linear,ii) non-linear, and iii) weighted gene set. Most methods aggregate node level statistics to pathwaylevel statistics using linear functions such as averaging or summation. For example, iPathwayGuidecomputes the scores of pathway as the sum of all genes while TBScore weights the pathway DEgenes based on their log fold change and the number of distinct DE genes directly downstream ofthem, using a depth-first search algorithm.

Approaches in the second group (non-linear) use a non-linear function to compute the pathwayscores. For example, TAPPA computes the pathway score for each sample as a weighted sum ofthe product of all node pair scores in the pathway. The weight coefficient is 0 when there is noedge between a pair. For any connected node pair the weight is a sign function, which representsjoint up- or down-regulation of the pair. Another example is EnrichNet, in which pathway scoresmeasure the difference of the node score distribution for a pathway and a background network/geneset which consists of all pathways. At the node level, the distance of all DE genes to the pathway ismeasured and summarized as a distance distribution. The method assumes that the most relevantpathway is the one with the greatest difference between the pathway node score distribution andthe background score distribution. The difference between the two distributions is measured by theweighted averaging of the difference between the two discretized and normalized distributions. Theaveraging method down-weights the higher distances and emphasizes the lower distance nodes.

Methods in the third group (weighted gene set) design scoring techniques that incorporateexisting gene set analysis methods, such as GSEA [14], GSA [15], or LRPath [75]. Pathway-levelscores can be calculated using node scores which represent the topology characteristic of the pathwayas weight adjustments to a gene set analysis method. PWEA, GANPA, THINK-Back-DS, and CePause this approach and we refer to them as weighted gene set analysis methods.

Pathway significance assessment. Pathway scores are intended to provide information regard-ing the amount of change incurred in the pathway between two phenotypes. However, the amountof change is not meaningful by itself since any amount of change can take place just by chance. Anassessment of the significance of the measured changes is thus required.

TopoGSA, MetPA, and EnrichNet output scores without any significance assessment, leavingit up to the user to interpret the results. This is problematic because the user does not have anyinstrument to help distinguish between changes due to noise or random causes, and meaningfulchanges, unlikely to occur just by chance and therefore, possibly related to the phenotype. Therest of the analysis methods perform a hypothesis testing for each pathway. The null hypothesisis that the value of the observed statistic is due to random noise or chance alone. The researchhypothesis is that the observed values are substantial enough that they are potentially related tothe phenotype. A p-value for calculated score is then computed and a user-defined threshold on thep-value is used to decide whether the the null hypothesis can be rejected or not for each pathway.Finally, a correction for multiple comparisons should be performed.

Typically, pathway analysis methods compute one score per pathway. The distribution of thisscore under the null hypothesis can be constructed and compared to the observed. However, thereare often too few samples to calculate this distribution, so it is assumed that the distribution is

13

known. For example, in MetaCore and many other techniques, when the pathway score is thenumber of DE nodes that fall on the pathway, the distribution is assumed to be hypergeometric.However, the hypergeometric distribution assumes that the variables (genes in this case) are in-dependent, which is incorrect, as witnessed by the fact that the pathway graph structure itself isdesigned to reflect the specific ways in which the genes influence each other. Another approach toidentify the distribution is to use statistical techniques such as the bootstrap [76]. Bootstrappingcan be done either at the sample level, by permuting the sample labels, or at gene set level, bypermuting the the values assigned to the genes in the set.

3.4.2 Multivariate analysis and Bayesian network

Multivariate analysis methods mostly use multivariate probability distributions to compute pathway-level statistics and these can be grouped into two subcategories. Methods in the first category usemultivariate hypothesis testing, while methods in the second category are based on Bayesian net-work.

NetGSA, TopologyGSA, DEGraph, clipper, and microGraphite are methods based on multivari-ate hypothesis testing. These analysis methods assume the vectors of gene expression values in each(sub)pathway are random vectors with multivariate normal distributions. The network topologyinformation is stored in the covariance matrix of the corresponding distribution. For a network,if the two distributions of the gene expression vectors corresponding to the two phenotypes aresignificantly different, the network is assumed to be significantly impacted when comparing the twophenotypes. The significance assessment is done by a multivariate hypothesis test. The definitionof the null hypothesis for the statistical tests and the techniques to calculate the parameters of thedistributions are the main differences between these three analysis methods.

BPA and BAPA-IGGFD are two methods based on Bayesian networks. In a Bayesian network,which is a special case of probabilistic graphical models, a random variable is assigned to each nodeof a directed acyclic (DAG) graph. The edges in the graph represent the conditional probabilitiesbetween nodes, so that the children are independent from each other and the rest of the graph whenconditioned on the parents. In BPA, the value of the Bayesian random variable assigned to eachnode captures the state of a gene (DE or not). In contrast, in BAPA-IGGFD each random variableassigned to an edge is the probability that up or down regulation of the genes at both ends of aninteraction are concordant with the type of interaction which can be activation or inhibition. Inboth BPA and BAPA-IGGFD, each random variable is assumed to follow a binomial distributionwhose probability of success follows a beta distribution. However, these two methods use differentapproaches in representing the multivariate distribution of the corresponding random vector. BPAassumes that the random vector has a multinomial distribution, which is the generalization ofthe binomial distribution. In this case, the vector of the success probability follows the Dirichletdistribution, which is the multivariate extension of the beta distribution. In contrast, BAPA-IGGFD assumes the random variables are independent, therefore the multivariate distributionsare calculated by multiplying the distributions of the random variables in the vector. It is worthmentioning that the assumption of independence in BAPA-IGGFD is contradicted by evidence,specifically in the case of edges that share nodes.

3.4.3 Other approaches

The four methods DRAGEN, GGEA, ToPASeq, and BLMA follow strategies that are very differentfrom those of the other 30 methods. As such, BLMA implements a bi-level meta-analysis approach

14

that can be applied in conjunction any of the four statistical approaches: SPIA [23], GSA [15],ORA [13], and PADOG [77]. The package allows users to perform pathway analysis with one datasetor with multiple datasets. Similarly, ToPASeq provides an R package that runs TopologyGSA,DEGraph, clipper, SPIA, TBScore, PWEA, and TAPPA. This method models the input (biologicalnetworks and experiment data) in such a way that it can be used for any of the 7 different analyses.

DRAGEN is an analysis method that scores the interactions rather then the genes and uses aregression model to detect differential regulation. DRAGEN fits each edge of a pathway into twolinear models (for case and control) and then computes a p-value that represents the differencebetween the two models. For each pathway, a summary statistic is computed by combining thep-values of the edges using a weighted Fishers method [78]. GGEA, on the other hand, uses PetriNet [79] to model the pathway. The summary statistic, named consistency, is computed from thefuzzy similarity between the observed gene expression and Petri net with fuzzy logic (PNFL) [80].For both of DRAGEN and GGEA, the p-value of the pathway is calculated by comparing theobserved summary statistic against the null distribution that is constructed by permutation.

4 Challenges in pathway analysis

Pathway analysis has become the first choice for gaining insights into the underlying biology of dueits explanatory power. However, there are outstanding annotation and methodological limitationsthat have not been addressed [19, 81]. There are three main limitations of current knowledgebases. First, existing knowledge bases are unable to keep up with the information available in dataobtained from recent technologies. For example, RNA-Seq data allows us to identify transcriptsthat are active under certain conditions. Alternatively spliced transcripts, even if they originatefrom the same gene, may have distinct or even opposite functions [82]. However, most knowledgebases provide pathway annotation only at the gene level. Second, there is a lack of condition andcell-specific information, i.e. information about cell type, conditions and time points. Finally,current pathway annotations are neither complete nor perfectly accurate [19,83]. For example, thenumber of genes in KEGG have been around 5,000 despite being updated continuously in the past10 years while there are approximately 19,000 human genes annotated with at least one GO term.The number of protein-coding genes is estimated to be between 19,000 and 20,000 [73], most ofwhich are included in DNA microarray assays, such as Affymetrix HG U133 plus 2.0.

Another challenge is the oversimplification that characterizes many of the models provided bypathway databases. In principle, each type of tissue might have different mechanisms so generic,organism-level pathways present a somewhat simplistic description of the phenomena. Furthermore,signaling and metabolic processes can also be different from one condition to another, or even fromone patient to another. Understanding the specific pathways that are impacted in a given phenotypeor sub-group of patients should be another goal for the next generation of pathway analysis tools.See Khatri et al. [19] for a more discussion of annotation limitations of existing knowledge bases.

Here we focus on challenges of pathway analysis from computational perspectives. We demon-strate that there is a systematic bias in pathway analysis [84]. This leads to unreliability of mostif not all pathway analysis approaches. We also discuss about the lack of benchmark datasets orpipelines to assess the performance of existing approaches.

15

4.1 Systematic bias of pathway analysis methods

Pathway analysis approaches often rely on hypothesis testing to identify the pathways that areimpacted under the effects of the diseases. Null distributions are used to model populations so thatstatistical tests can determine whether an observation is unlikely to occur by chance. In principle,the p-values produced by a sound statistical test must be uniformly distributed under the nullhypothesis [8588]. For example, the p-values that result from comparing two groups using a t-testshould be distributed uniformly if the data are normally distributed [87]. When the assumptions ofstatistical models do not hold, the resulting p-values are not uniformly distributed under the nullhypothesis. This makes classical methods, such as t-test, inaccurate since gene expression valuesdo not necessary follow their assumptions. Here we also show that the problem is also extended topathway analysis, as the pathway p-values obtained from statistical approaches are not uniformlydistributed under the null hypothesis. This might lead to severe bias towards well-studies diseases,such as cancer, and thus make the results unreliable [84].

Consider three pathway analysis methods that represent three different classes of methods forpathway analysis: Gene Set Analysis (GSA) [15] is a Functional Class Scoring method [14, 15,77, 89], Down-weighting of Overlapping Genes (PADOG) [77] is an enrichment method [12, 90,91], and Signaling Pathway Impact Analysis (SPIA) [92] is a topology-aware method [22, 92]. Tosimulate the null distribution, we download and process the data from 9 public datasets: GSE14924CD4, GSE14924 CD8, GSE17054, GSE12662, GSE57194, GSE33223, GSE42140, GSE8023, andGSE15061. Using 140 control samples from the 9 datasets, we simulate 40,000 datasets as follows.We randomly label 70 samples as controls samples and the remaining 70 samples as diseases. Werepeat this procedure 10, 000 times to generate different groups of 70 control and 70 disease samples.To make the simulation more general, we also create 10, 000 datasets consisting of 10 control and10 disease samples, 10, 000 datasets consisting of 10 control and 20 disease samples, and 10, 000datasets consisting of 20 control and 10 disease samples. We then calculate the p-values of theKEGG human signaling pathways using each of the three methods.

The effect of combining control (i.e. healthy) samples from different experiments is to uniformlydistribute all sources of bias among the random groups of samples. If we compare groups of controlsamples based on experiments, there could be true differences due to batch effects. By poolingthem together, we form a population which is considered the reference population. This approachis similar to selecting from a large group of people that may contain different sub-groups (e.g.different ethnicities, gender, race, life style, or living conditions). When we randomly select samples(for the two random groups to be compared) from the reference population, we expect all bias (e.g.ethnic subgroups) to be represented equally in both random groups and therefore, we should seeno difference between these random groups, no matter how many distinct ethnic subgroups werepresent in the population at large. Therefore, the p-values of a test for difference between the tworandomly selected groups should be equally probable between zero and one.

Figure 4 displays the empirical null distributions of p-values using PADOG, GSA, and SPIA.The horizontal axes represent p-values while the vertical axes represent p-value densities. Greenpanels (A0A6) show p-value distributions from PADOG, while blue (B0B6) and purple (C0C6)panels show p-value distributions from GSA and SPIA, respectively. For each method, the largerpanel (A0, B0, and C0) shows the cumulative p-values from all KEGG signaling pathways. Thesmall panels, 6 per method, display extreme examples of non-uniform p-value distributions forspecific pathways. For each method, we show three distributions severely biased towards zero (eg.A1A3), and three distributions severely biased towards one (eg. A4A6).

These results show that, contrary to generally accepted beliefs, the p-values are not uniformly

16

Distribution of pvalues for all pathways (PADOG)

pvalues

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(A0) Pathways in cancer

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

(A1) Renal cell carcinoma

0.0 0.2 0.4 0.6 0.8 1.0

01

23

(A2) Prostate cancer

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

(A3)

Prion diseases

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

(A4) Pertussis

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

(A5) African trypanosomiasis

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

(A6)

Distribution of pvalues for all pathways (GSA)

pvalues

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

(B0) Renal cell carcinoma

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

(B1) Endometrial cancer

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

(B2) ErbB signaling pathway

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

(B3)

Pertussis

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

(B4) Prion diseases

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

(B5) African trypanosomiasis

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

(B6)

Distribution of pvalues for all pathways (SPIA)

pvalues

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

(C0) Rheumatoid arthritis

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

(C1) Amoebiasis

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

(C2) Phagosome

0.0 0.2 0.4 0.6 0.8 1.00

12

34

5

(C3)

Wnt signaling pathway

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

(C4) Basal cell carcinoma

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

(C5) Hippo signaling pathway

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

(C6)

Figure 4: The empirical distributions of p-values using: Down-weighting of Overlapping Genes (PADOG) - top,Gene Set Analysis (GSA) - middle, and Signaling Pathway Impact Analysis (SPIA) - bottom. The distributions aregenerated by re-sampling from 140 control samples obtained from 9 AML datasets. The horizontal axes display thep-values while the vertical axes display the p-value densities. Panels A0-A6 (green) show the distributions of p-valuesfrom PADOG; panels B0-B6 (blue) show the distribution of p-values from GSA; panels C0-C6 (purple) show thedistribution of p-values from SPIA. The large panels on the left, A0, B0, and C0, display the distributions of p-valuescumulated from all KEGG signaling pathways. The smaller panels on the right display the p-value distributions ofselected individual pathways, which are extreme cases. For each method, the upper three distributions, for exampleA1-A3, are biased towards zero and the lower three distributions, for example A4-A6, are biased towards one. Sincenone of these p-value distributions are uniform, there will be systematic bias in identifying significant pathwaysusing any one of the methods. Pathways that have p-values biased towards zero will often be falsely identified assignificant (false positives). Likewise, pathways that have p-values biased towards one are more likely to be amongfalse negative results even if they may be implicated in the given phenotype.

17

distributed for three methods considered. Therefore one should expect a very strong and systematicbias in identifying significant pathways for each of these methods. Pathways that have p-valuesbiased towards zero will often be falsely identified as significant (false positives). Likewise, pathwaysthat have p-values biased towards one are likely to rarely meet the significance requirements, evenwhen they are truly implicated in the given phenotype (false negatives). Systematic bias, due tonon-uniformity of p-value distributions, results in failure of the statistical methods to correctlyidentify the biological pathways implicated in the condition, and also leads to inconsistent andincorrect results.

4.2 Querying data from knowledge bases

Independent research groups have tried different strategies to model complex bio-molecular phe-nomena. These independent efforts have lead to variation among pathway databases, complicatingthe task of developing pathway analysis methods. Depending on the database, there may be dif-ferences in information sources, experiment interpretation, models of molecular interactions, orboundaries of the pathways. Therefore, it is possible that pathways with the same designation andaiming to describe the same phenomena may have different topologies in different databases. Asan example, one could compare the insulin signaling pathways of KEGG and BioCarta. BioCartaincludes fewer nodes and emphasizes the effect of insulin on transcription, while KEGG includestranscription regulation as well as apoptosis and other biological processes. Differences in graphmodels for molecular interactions are particularly apparent when comparing the signaling pathwaysin KEGG and NCI-PID. While KEGG represents the interaction information using the directededges themselves, NCI-PID introduces process nodes to model interactions (see Figure 2). Devel-opers are facing the challenge of modifying methods to accept novel pathway databases or modifyingthe actual pathway graphs to conform to the method.

Pathway databases not only differ in the way that interactions are modeled, but their data areprovided in different formats as well [16]. Common formats are Pathway Interaction Database eX-tensible Markup Language (PID XML), KEGG Markup Language (KGML), Biological PathwayExchange (BioPAX) Level 2 and Level 3, System Biology Markup Language (SBML), and theBiological Connection Markup Language (BCML) [93]. The NCI provides a unified assembly ofBioCarta and Reactome, as well as their in-house NCI-Nature curated pathways, in NCI-PIDformat [68]. In order to unify pathway databases, pathway information should be provided in acommonly accepted format.

Another challenge is that the same biological pathways are represented differently from onepathway database to another. None of the tools is compatible with all database formats, requiringeither modification of pathway input or alteration of the underlying algorithm in order to accom-modate the differences. As an example, a study by Vaske and others [52] attempts to compareSPIA [23] with their tool PARADIGM, by re-implementing SPIA in C, and forcing its applicationon NCI-PID pathways. Implementation errors are present in the C version of SPIA, invalidating thecomparison. A solution to overcome this challenge could be the development of a unified globallyaccepted pathway format. Another possible solution is to build conversion software tools that cantranslate between pathway formats. Some attempts exist to use BIO-PAX [94] as the lingua francafor this domain.

18

4.3 Missing benchmark datasets and comparison

Newly developed approaches are typically assessed by simulated data or by well-studied biologicaldatasets [95, 96]. The advantage of using simulation is that the ground truth is known and it canbe used to compare sensitivity and specificity of different methods. However, simulation is oftenbiased and does not fully reflect the complexity of living organisms. On the other hand, when usingreal biological data, the biology is never fully known. In addition, many papers presenting newpathway analysis methods include results obtained on only a couple of datasets and researchers areoften influenced by the observer-expectancy effect [97]. Thus, such results are not objective, andmany times they cannot be reproduced.

A better evaluation approach has been proposed by Tarca et al. [98] using 42 real datasets.This approach use a target pathway which is the pathway describing the condition under studyavailable. For instance, in an experiment colon cancer vs healthy, the target pathway would be thecolon cancer pathway. The datasets are chosen so that there is a target pathway associated witheach of the datasets. The datasets are also all public so other methods can be compared on thevery same data, in a reproducible and objective way. The lower the rank and the p-value of thetarget pathway in the method output, the better the method. This approach has several importantadvantages including the fact that is reproducible, completely objective and relies on more thanjust a couple of data sets that are assessed by the authors using the literature. This approach wasalso used by Bayerlova et al. [96] (using a different set of 36 real datasets) as well by other, morerecent papers.

However, in spite of its advantages and great superiority compared to the usual method ofonly analyzing a couple of data set, even this benchmarking has important limitations. First, notall conditions have a namesake pathway in existing databases or described by literature. Second,complex diseases are often associated with not only one target pathway, but with many biologicalprocesses. By its nature, this assessment approach will ignore other pathways and their ranking,even though they may be true positives. More importantly, these approaches fail to take intoconsideration the systematic bias of pathway analysis approaches. In these review papers, most ofthe datasets are related to cancer. As such, 28 out of 42 datasets used in Tarca et al. [98], and 26out of 36 datasets used in Bayerlova et al. [96] are cancer datasets. In those cases, methods thatare biased towards cancer are very likely to identify cancer pathways, which are also the targetpathways, as significant. For this reason, the comparisons obtained from these reviews are likely tobe biased and not reliable for assessing the performance of existing approaches.

5 Conclusions

Pathway analysis is a core strategy of many basic research, clinical research, and translationalmedicine programs. Emerging applications range from targeting and modeling disease networksto screening chemical or ligand libraries, to identification of drug target interactions for improvedefficacy and safety. The integration of molecular interaction information into pathway analysisrepresents a major advance in the development of mathematical techniques aimed at the evaluationof systems perturbations in biological entities.

This document discussed and categorized 34 existing network-based pathway analysis approachesfrom different perspectives, including experiment input, graphical representation of pathways inknowledge bases, and statistical approaches to assess pathway significance. Despite being widelyused, using DE genes as input make the software sensitive to cutoff parameters. Regarding graph

19

model, approaches using multiple types of nodes and bipartite graphs are more flexible and areable to model both AND and OR gates, which are very common when describing cellular pro-cesses. In addition, tools that are able to work with multiple knowledge bases are expected toperform better due to complementary and independent information that cannot be obtained fromindividual databases. We also pointed out that there has been no reliable benchmark to assess andrank existing approaches. Some initial efforts to assess pathway analysis methods in an objectiveand reproducible way do exist but they still fail to take into account statistical bias of existingapproaches.

Despite tremendous efforts in the field, there are outstanding challenges that need to be ad-dressed. First, current pathway databases are unable to provide transcript-level activity or infor-mation related to new types of data, such as SNP, mutation, or methylation. Second, incompleteannotation and the lack of condition- and cell-specific information hinder the accuracy of down-stream pathway analysis. Third, variation among pathway databases and the lack of a standardformat in which the pathway data is provided pose a real challenge for implementation. Developersare facing the challenge of modifying methods to accept novel pathway databases or modifying theactual pathway graphs to conform with their method. Fourth, there is a systematic bias due to thefact that certain conditions, such as cancer, are much more studied than others. Using control dataobtained from 9 mRNA datasets, we showed that the p-values obtained by three pathway analysisapproaches that represent three mainstream strategies in pathway analysis (FCS, enrichment, andnetwork-based) are not uniformly distributed under the null. Pathways that have p-values biasedtowards zero will often be falsely identified as significant (false positives). Likewise, pathways thathave p-values biased towards one are likely to rarely meet the significance requirements, even whenthey are truly implicated in the given phenotype (false negatives). Systematic bias, due to non-uniformity of p-value distributions, results in failure of the statistical methods to correctly identifythe biological pathways implicated in the condition, and also leads to inconsistent and incorrectresults.

Finally, there is a lack of benchmarks to assess the performance of each mathematical approach,as well as to validate existing knowledge bases and their graphical representation of pathways.There have been some efforts to provide benchmarks for this purpose but such methods are notfully reliable yet.

6 Conflict-of-Interest Statement

Sorin Draghici is the founder and CEO of Advaita Corporation, a Wayne State University spin-offthat commercializes iPathwayGuide, one of the pathway analysis tools mentioned in this chapter.

7 Acknowledgment

This work has been partially supported by the following grants: National Institutes of Health(R01 DK089167, R42 GM087013), National Science Foundation (DBI-0965741), and the RobertJ. Sokol, MD Endowed Chair in Systems Biology. Any opinions, findings, and conclusions orrecommendations expressed in this material are those of the authors and do not necessarily reflectthe views of any of the funding agencies.

20

References

[1] Ron Edgar, Michael Domrachev, and Alex E Lash. Gene Expression Omnibus: NCBI geneexpression and hybridization array data repository. Nucleic Acids Research, 30(1):207210,2002.

[2] Tanya Barrett, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, MaximTomashevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Michelle Holko,Andrey Yefanov, Hyeseung Lee, Naigong Zhang, Cynthia L Robertson, Nadezhda Serova, SeanDavis, and Alexandra Soboleva. NCBI GEO: archive for functional genomics data setsupdate.Nucleic Acids Research, 41(D1):D991D995, 2013.

[3] Alvis Brazma, Helen Parkinson, Ugis Sarkans, Mohammadreza Shojatalab, Jaak Vilo, Ni-ran Abeygunawardena, Ele Holloway, Misha Kapushesky, Patrick Kemmeren, Gonzalo GarciaLara, Ahmet Oezcimen, Philippe Rocca-Serra, and Susanna-Assunta Sansone. ArrayExpressa public repository for microarray gene expression data at the EBI. Nucleic Acids Research,31(1):6871, 2003.

[4] Gabriella Rustici, Nikolay Kolesnikov, Marco Brandizi, Tony Burdett, Miroslaw Dylag, IbrahimEmam, Anna Farne, Emma Hastings, Jon Ison, Maria Keays, Natalja Kurbatova, James Mal-one, Roby Mani, Annalisa Mupo, Rui Pedro Pereira, Ekaterina Pilicheva, Johan Rung, AnjanSharma, Y. Amy Tang, Tobias Ternent, Andrew Tikhonov, Danielle Welter, Eleanor Williams,Alvis Brazma, Helen Parkinson, and Ugis Sarkans. ArrayExpress updatetrends in databasegrowth and links to data analysis tools. Nucleic Acids Research, 41(D1):D987D990, 2013.

[5] Jianjiong Gao, Bulent Arman Aksoy, Ugur Dogrusoz, Gideon Dresdner, Benjamin Gross,S Onur Sumer, Yichao Sun, Anders Jacobsen, Rileen Sinha, Erik Larsson, Ethan Cerami,Chris Sander, and Nikolaus Schultz. Integrative analysis of complex cancer genomics andclinical profiles using the cBioPortal. Science Signaling, 6(269):pl1, 2013.

[6] Ethan Cerami, Jianjiong Gao, Ugur Dogrusoz, Benjamin E Gross, Selcuk Onur Sumer,Bulent Arman Aksoy, Anders Jacobsen, Caitlin J Byrne, Michael L Heuer, Erik Larsson,et al. The cBio cancer genomics portal: an open platform for exploring multidimensionalcancer genomics data. Cancer Discovery, 2(5):401404, 2012.

[7] Arthur Liberzon, Aravind Subramanian, Reid Pinchback, Helga Thorvaldsdottir, PabloTamayo, and Jill P Mesirov. Molecular signatures database (MSigDB) 3.0. Bioinformatics,27(12):17391740, 2011.

[8] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler,J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, Mi-dori A. Harris, David P. Hill, Laurie Issel-Tarver, Andrew Kasarskis, Suzanna Lewis, John C.Matese, Joel E. Richardson, Martin Ringwald, Gerald M. Rubin, and Gavin Sherlock. GeneOntology: tool for the unification of biology. Nature Genetics, 25:2529, 2000.

[9] Minoru Kanehisa and Susumu Goto. KEGG: kyoto encyclopedia of genes and genomes. NucleicAcids Research, 28(1):2730, 2000.

21

[10] Minoru Kanehisa, Miho Furumichi, Mao Tanabe, Yoko Sato, and Kanae Morishima. KEGG:new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research,45(D1):D353D361, 2017.

[11] David Croft, Antonio Fabregat Mundo, Robin Haw, Marija Milacic, Joel Weiser, GuanmingWu, Michael Caudy, Phani Garapati, Marc Gillespie, Maulik R Kamdar, Bijay Jassal, StevenJupe, Lisa Matthews, Bruce May, Stanislav Palatnik, Karen Rothfels, Veronica Shamovsky,Heeyeon Song, Mark Williams, Ewan Birney, Henning Hermjakob, Lincoln Stein, and PeterDEustachio. The Reactome pathway knowledgebase. Nucleic Acids Research, 42(D1):D472D477, 2014.

[12] Sorin Draghici, Purvesh Khatri, Rui P Martins, G Charles Ostermeier, and Stephen A Krawetz.Global functional profiling of gene expression. Genomics, 81(2):98104, 2003.

[13] Saeed Tavazoie, Jason D. Hughes, Michael J. Campbell, Raymond J. Cho, and George M.Church. Systematic determination of genetic network architecture. Nature Genetics, 22:281285, 1999.

[14] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L.Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S.Lander, and Jill P. Mesirov. Gene set enrichment analysis: a knowledge-based approach forinterpreting genome-wide expression profiles. Proceeding of The National Academy of Sciencesof the Unites States of America, 102(43):1554515550, 2005.

[15] Bradley Efron and Robert Tibshirani. On testing the significance of sets of genes. The Annalsof Applied Statistics, 1(1):107129, 2007.

[16] Han-Yu Chuang, Matan Hofree, and Trey Ideker. A Decade of Systems Biology. AnnualReview of Cell and Developmental Biology, 26:721744, 2010.

[17] Frank Emmert-Streib and Galina V. Glazko. Pathway Analysis of Expression Data: De-ciphering Functional Building Blocks of Complex Diseases. PLoS Computational Biology,7(5):e1002053, 2011.

[18] Thomas Kelder, Bruce R Conklin, Chris T Evelo, and Alexander R Pico. Finding the rightquestions: exploratory pathway analysis to enhance biological discovery in large datasets. PLoSBiology, 8(8):e1000472, 2010.

[19] Purvesh Khatri, Marina Sirota, and Atul J Butte. Ten years of pathway analysis: currentapproaches and outstanding challenges. PLOS Computational Biology, 8(2):e1002375, 2012.

[20] Muhammad Faiz Misman, Safaai Deris, Suhairul ZM Hashim, R Jumali, and Mohd SaberiMohamad. Pathway-based microarray analysis for defining statistical significant phenotype-related pathways: a review of common approaches. In Information Management and Engi-neering, 2009. ICIME09. International Conference on, pages 496500. IEEE, 2009.

[21] Jorg Rahnenfuhrer, Francisco S. Domingues, Jochen Maydt, and Thomas Lengauer. Calculat-ing the Statistical Significance of Changes in Pathway Activity From Gene Expression Data.Statistical Applications in Genetics and Molecular Biology, 3(1), 2004.

22

[22] Sorin Draghici, Purvesh Khatri, Adi L Tarca, Kashyap Amin, Arina Done, Calin Voichita,Constantin Georgescu, and Roberto Romero. A systems biology approach for pathway levelanalysis. Genome Research, 17(10):15371545, 2007.

[23] Adi L Tarca, Sorin Draghici, Purvesh Khatri, Sonia S Hassan, Pooja Mittal, Jung-Sun Kim,Chong J Kim, Juan P Kusanovic, and Roberto Romero. A novel signaling pathway impactanalysis (SPIA). Bioinformatics, 25(1):7582, 2009.

[24] Cristina Mitrea, Zeinab Taghavi, Behzad Bokanizad, Samer Hanoudi, Rebecca Tagett, MicheleDonato, Calin Voichita, and Sorin Draghici. Methods and approaches in the topology-basedanalysis of biological pathways. Frontiers in Physiology, 4:278, 2013.

[25] Purvesh Khatri, Calin Voichita, Khalid Kattan, Nadeem Ansari, Avani Khatri, ConstantinGeorgescu, Adi L Tarca, and Sorin Draghici. Onto-Tools: new additions and improvements in2006. Nucleic Acids Research, 35(Web Server issue):W206W211, 2007.

[26] Sol Efroni, Carl F Schaefer, and Kenneth H Buetow. Identification of Key Processes UnderlyingCancer Phenotypes Using Biologic Pathway Analysis. PLoS One, 2(5):e425, 2007.

[27] S. Greenblum, S. Efroni, C. Schaefer, and K. Buetow. The PathOlogist: an automated toolfor pathway-centric analysis. BMC Bioinformatics, 12(1):133, 2011.

[28] Ali Shojaie and George Michailidis. Analysis of Gene Sets Based on the Underlying RegulatoryNetwork. Journal of Computational Biology, 16(3):407426, 2009.

[29] Ali Shojaie and George Michailidis. Network Enrichment Analysis in Complex Experiments.Statistical Applications in Genetics and Molecular Biology, 9(1), 2010.

[30] Jui-Hung Hung, Troy W Whitfield, Tun-Hsiang Yang, Zhenjun Hu, Zhiping Weng, and CharlesDeLisi. Identification of functional modules that correlate with phenotypic difference: theinfluence of network topology. Genome Biology, 11(2):R23, 2010.

[31] Enrico Glaab, Anas Baudot, Natalio Krasnogor, and Alfonso Valencia. TopoGSA: networktopological gene set analysis. Bioinformatics, 26(9):12711272, 2010.

[32] Maria S Massa, Monica Chiogna, and Chiara Romualdi. Gene set analysis exploiting thetopology of a pathway. BMC Systems Biology, 4(1):121, 2010.

[33] Laurent Jacob, Pierre Neuvial, and Sandrine Dudoit. Gains in power from structured two-sample tests of means on graphs. Arxiv preprint arXiv:1009.5173, 2010.

[34] Ludwig Geistlinger, Gergely Csaba, Robert Kuffner, Nicola Mulder, and Ralf Zimmer. Fromsets to graphs: towards a realistic enrichment analysis of transcriptomic systems. Bioinfor-matics, 27(13):i366i373, 2011.

[35] Senol Isci, Cengizhan Ozturk, Jon Jones, and Hasan H Otu. Pathway analysis of high-throughput biological data within a Bayesian network framework. Bioinformatics, 27(12):16671674, 2011.

[36] Zhaoyuan Fang, Weidong Tian, and Hongbin Ji. A network-based gene-weighting approachfor pathway analysis. Cell Research, 22(3):565580, 2011.

23

[37] Calin Voichita, Michele Donato, and Sorin Draghici. Incorporating gene significance in theimpact analysis of signaling pathways. In Machine Learning and Applications (ICMLA), 201211th International Conference on, volume 1, pages 126131, Boca Raton, FL, USA, 12-15 De-cember 2012. IEEE.

[38] Yifang Zhao, Ming-Hui Chen, Baikang Pei, David Rowe, Dong-Guk Shin, Wangang Xie, FangYu, and Lynn Kuo. A Bayesian Approach to Pathway Analysis by Integrating GeneGeneFunctional Directions and Microarray Data. Statistics in Biosciences, 4(1):105131, 2012.

[39] Zuguang Gu, Jialin Liu, Kunming Cao, Junfeng Zhang, and Jin Wang. Centrality-basedpathway enrichment: a systematic approach for finding significant pathways dominated by keygenes. BMC Systems Biology, 6(1):56, 2012.

[40] Fernando Farfan, Jun Ma, Maureen A Sartor, George Michailidis, and Hosagrahar V Jagadish.THINK Back: knowledge-based interpretation of high throughput data. BMC Bioinformatics,13(Suppl 2):S4, 2012.

[41] Maysson Al-Haj Ibrahim, Sabah Jassim, Michael Anthony Cawthorne, and Kenneth Lang-lands. A Topology-Based Score for Pathway Enrichment. Journal of Computational Biology,19(5):563573, 2012.

[42] Jakub Mieczkowski, Karolina Swiatek-Machado, and Bozena Kaminska. Identification of Path-way DeregulationGene Expression Based Analysis of Consistent Signal Transduction. PLoSONE, 7(7):e41541, 2012.

[43] Enrico Glaab, Anas Baudot, Natalio Krasnogor, Reinhard Schneider, and Alfonso Valencia.EnrichNet: network-based gene set enrichment analysis. Bioinformatics, 28(18):i451i457,2012.

[44] Paolo Martini, Gabriele Sales, M Sofia Massa, Monica Chiogna, and Chiara Romualdi. Alongsignal paths: an empirical gene set approach exploiting pathway topology. Nucleic AcidsResearch, 41(1):e19e19, 2013.

[45] Winston A Haynes, Roger Higdon, Larissa Stanberry, Dwayne Collins, and Eugene Kolker.Differential expression analysis for pathways. PLoS Computational Biology, 9(3):e1002967,2013.

[46] Shining Ma, Tao Jiang, and Rui Jiang. Differential regulation enrichment analysis via theintegration of transcriptional regulatory network and gene expression data. Bioinformatics,31(4):563571, 2014.

[47] Ivana Ihnatova and Eva Budinska. ToPASeq: an R package for topology-based pathway anal-ysis of microarray and RNA-Seq data. BMC bioinformatics, 16(1):350, 2015.

[48] Sahar Ansari, Calin Voichita, Michele Donato, Rebecca Tagett, and Sorin Draghici. A novelpathway analysis approach based on the unexplained disregulation of genes. Proceedings of theIEEE, 105(3):482495, 2017.

[49] Behzad Bokanizad, Rebecca Tagett, Sahar Ansari, B Hoda Helmi, and Sorin Draghici.SPATIAL: A System-level PAThway Impact AnaLysis approach. Nucleic Acids Research,44(11):50345044, 2016.

24

[50] Tin Nguyen and Sorin Draghici. BLMA: A package for bi-level meta-analysis. R packageversion 1.

[51] Tin Nguyen, Rebecca Tagett, Michele Donato, Cristina Mitrea, and Sorin Draghici. A novel bi-level meta-analysis approach-applied to biological pathway analysis. Bioinformatics, 32(3):409416, 2016.

[52] Charles J Vaske, Stephen C Benz, J Zachary Sanborn, Dent Earl, Christopher Szeto, JingchunZhu, David Haussler, and Joshua M Stuart. Inference of patient-specific pathway activities frommulti-dimensional cancer genomics data using PARADIGM. Bioinformatics, 26(12):i237i245,2010.

[53] Enrica Calura, Paolo Martini, Gabriele Sales, Luca Beltrame, Giovanna Chiorino, MaurizioDIncalci, Sergio Marchini, and Chiara Romualdi. Wiring miRNAs to pathways: a topolog-ical approach to integrate miRNA and mRNA expression profiles. Nucleic Acids Research,42(11):e96, 2014.

[54] Diana Diaz, Michele Donato, Tin Nguyen, and Sorin Draghici. MicroRNA-augmented path-ways (mirAP) and their applications to pathway analysis and disease subtyping. In PacificSymposium on Biocomputing. Pacific Symposium on Biocomputing, volume 22, page 390. NIHPublic Access, 2016.

[55] Tin Nguyen, Diana Diaz, Rebecca Tagett, and Sorin Draghici. Overcoming the matched-samplebottleneck: an orthogonal approach to integrate omic data. Nature Scientific Reports, 6:29251,2016.

[56] Shouguo Gao and Xujing Wang. TAPPA: topological analysis of pathway phenotype associa-tion. Bioinformatics, 23(22):31003102, 2007.

[57] Jianguo Xia and David S Wishart. MetPA: a web-based metabolomics tool for pathway analysisand visualization. Bioinformatics, 26(18):23422344, 2010.

[58] Jianguo Xia and David S Wishart. Web-based inference of biological patterns, functions andpathways from metabolomic data using MetaboAnalyst. Nature Protocols, 6(6):743, 2011.

[59] Jianguo Xia, Igor V Sinelnikov, Beomsoo Han, and David S Wishart. MetaboAnalyst 3.0making metabolomics more meaningful. Nucleic Acids Research, 43(W1):W251W257, 2015.

[60] Geir Kjetil Sandve, Anton Nekrutenko, James Taylor, and Eivind Hovig. Ten simple rules forreproducible computational research. PLoS computational biology, 9(10):e1003285, 2013.

[61] Dougu Nam and Seon-Young Kim. Gene-set approach for expression pattern analysis. Briefingsin Bioinformatics, 9(3):189197, 2008.

[62] Yoram Ben-Shaul, Hagai Bergman, and Hermona Soreq. Identifying subtle interrelated changesin functional gene categories using continuous measures of gene expression. Bioinformatics,21(7):11291137, 2005.

[63] Kuang-Hung Pan, Chih-Jian Lih, and Stanley N Cohen. Effects of threshold choice on biologicalconclusions reached during analysis of gene expression by DNA microarrays. Proceedings ofthe National Academy of Sciences of the United States of America, 102(25):89618965, 2005.

25

[64] Paul K Tan, Thomas J Downey, Edward L Spitznagel Jr, Pin Xu, Dadin Fu, Dimiter SDimitrov, Richard A Lempicki, Bruce M Raaka, and Margaret C Cam. Evaluation of geneexpression measurements from commercial microarray platforms. Nucleic Acids Research,31(19):56765684, 2003.

[65] Liat Ein-Dor, Itai Kela, Gad Getz, David Givol, and Eytan Domany. Outcome signature genesin breast cancer: is there a unique set? Bioinformatics, 21(2):171178, 2005.

[66] Liat Ein-Dor, Or Zuk, and Eytan Domany. Thousands of samples are needed to generate arobust gene list for predicting outcome in cancer. In Proceedings of the National Academy ofSciences, 103(15):59235928, 2006.

[67] Hiroyuki Ogata, Susumu Goto, Kazushige Sato, Wataru Fujibuchi, Hidemasa Bono, and Mi-noru Kanehisa. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research,27(1):2934, 1999.

[68] Carl F Schaefer, Kira Anthony, Shiva Krupa, Jeffrey Buchoff, Matthew Day, Timo Hannay,and Kenneth H Buetow. PID: the pathway interaction database. Nucleic Acids Research,37(Suppl 1):D674D679, 2009.

[69] BioCarta. BioCarta - Charting Pathways of Life. http://www.biocarta.com.

[70] Alexander R Pico, Thomas Kelder, Martijn P van Iersel, Kristina Hanspers, Bruce R Conklin,and Chris Evelo. WikiPathways: pathway editing for the people. PLoS Biology, 6(7):e184,2008.

[71] Huaiyu Mi, Betty Lazareva-Ulitsky, Rozina Loo, Anish Kejariwal, Jody Vandergriff, StevenRabkin, Nan Guo, Anushya Muruganujan, Olivier Doremieux, Michael J Campbell, HiroakiKitano, and Paul D Thomas. The PANTHER database of protein families, subfamilies, func-tions and pathways. Nucleic Acids Research, 33(Suppl 1):D284D288, 2005.

[72] G Joshi-Tope, Marc Gillespie, Imre Vastrik, Peter DEustachio, Esther Schmidt, Bernardde