module networks discovering regulatory modules and their condition specific regulators from gene...

41
Module Networks Module Networks Discovering Regulatory Modules and Discovering Regulatory Modules and their Condition Specific their Condition Specific Regulators from Gene Expression Regulators from Gene Expression Data Data Cohen Jony

Post on 21-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Module Networks Module Networks

Discovering Regulatory Modules and their Discovering Regulatory Modules and their Condition Specific Regulators from Gene Condition Specific Regulators from Gene

Expression DataExpression Data

Cohen Jony

Page 2: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

OutlineOutline

The ProblemThe Problem

RegulatorsRegulators

Module NetworksModule Networks

Learning Module NetworksLearning Module Networks

ResultsResults

ConclusionConclusion

Page 3: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

The ProblemThe Problem

Inferring regulatory networks from gene expression data.Inferring regulatory networks from gene expression data.

From:

Into:

Page 4: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

RegulatorsRegulators

Page 5: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Regulation typesRegulation types

Page 6: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Regulators exampleRegulators example

This is an example for a regulating module.

Page 7: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Known solution: Bayesian NetworksKnown solution: Bayesian Networks

The problem:

Too many variables and too little data cause statistical noise to lead to spurious dependencies, resulting in models that significantly over fit the data.

Page 8: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

From Bayesian To ModuleFrom Bayesian To Module

Page 9: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Module NetworksModule Networks We assume that we are given a domain of random variables

X = {X1; : : : ;Xn}. We use Val(Xi) to denote the domain of values of the variable Xi.

A module set C is a set of such formal variables M1; : : : ;MK. As all the variables in a module share the same CPD.

Note that all the variables must have the same domain of values!

Page 10: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

A module network template T = (S; θ) for C defines, for each module Mj in C:

1) a set of parents PaMj from X;

2) a conditional probability template (CPT) P( Mj | PaMj ) which specifies a distribution over Val (Mj ) for each assignment in Val (PaMj ).

We use S to denote the dependency structure encoded by {PaMj : Mj in C} and θ to denote the parameters required for the CPTs {P( Mj | PaMj ) : Mj in C}.

Module NetworksModule Networks

Page 11: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Module NetworksModule Networks

A A module assignment function for C is a function for C is a function A : X → {1; : : : ;K} such that A(Xi) = j only if Val (Xi) = Val A : X → {1; : : : ;K} such that A(Xi) = j only if Val (Xi) = Val (Mj ).(Mj ).

A module network is defined by both the module network A module network is defined by both the module network template and the assignment function.template and the assignment function.

Page 12: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

ExampleExample

In our example, we have three modules M1, M2, and M3.

PaM1 = Ø , PaM2 = {MSFT}, and PaM3 = {AMAT; INTL}.

In our example, we have that A(MSFT) = 1, A(MOT) = 2, A(INTL) = 2, and so on.

Page 13: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Learning Module NetworksLearning Module Networks

The iterative learning procedure attempts to search for the model with the highest score by using the expectation Maximization (EM) algorithm.

An important property of the EM algorithm is that each iteration is guaranteed to improve the likelihood of the model, until convergence to a local maximum of the score.

Each iteration of the algorithm consists of two steps:

M-step E-step

Page 14: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

In the , the procedure is given a partition of the genes In the , the procedure is given a partition of the genes into modules and learns the best regulation program (regression into modules and learns the best regulation program (regression tree) for each module. tree) for each module.

The regulation program is learned via a combinatorial search over The regulation program is learned via a combinatorial search over the space of trees. the space of trees.

The tree is grown from the root to its leaves. At any given node, The tree is grown from the root to its leaves. At any given node, the query which best partitions the gene expression into two the query which best partitions the gene expression into two distinct distributions is chosen, until no such split exists.distinct distributions is chosen, until no such split exists.

Learning Module Networks contLearning Module Networks cont..M-step

Page 15: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Learning Module Networks cont.Learning Module Networks cont. In the , given the inferred regulation programs, we In the , given the inferred regulation programs, we

determine the module whose associated regulation program determine the module whose associated regulation program best predicts each gene’s behavior. best predicts each gene’s behavior.

We test the probability of a gene’s measured expression We test the probability of a gene’s measured expression values in the dataset under each regulatory program, values in the dataset under each regulatory program, obtaining an overall probability that this gene’s expression obtaining an overall probability that this gene’s expression profile was generated by this regulation program. profile was generated by this regulation program.

We then select the module whose program gives the gene’s We then select the module whose program gives the gene’s expression profile the highest probability, and re-assign the expression profile the highest probability, and re-assign the gene to this module. gene to this module.

We take care not to assign a regulator gene to a module in We take care not to assign a regulator gene to a module in which it is also a regulatory input. which it is also a regulatory input.

E-step

Page 16: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Bayesian scoreBayesian score

When the priors satisfy the assumptions above, the Bayesian score decomposes into local module scores:

k

j

jaMj DXAPscoreDASscoreMj

1

):)(,():,(

Where…

Page 17: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Bayesian score cont.Bayesian score cont.

Where Lj(U,X, ӨMj:D) is the Likelihood function . Where P(ӨMj | Sj =u) is the Priors.

)(log)(log

)|():,,(log

):,(

XAPUSP

USPDXUL

DXUscore

jj

jMMj

M

jj

j

Where Sj = U denotes that we chose a structure where U are the parents of module Mj.

Where Aj = X denotes that A is such that Xj = X.

Page 18: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

AssumptionsAssumptions Let P(A), P(S | A), P(Ө | S,A) be assignment, structure, and

parameter priors. P(Ө | S,A) satisfies parameter independence if

P(Ө | S,A) satisfies parameter modularity if

for all structures S1 and S2 such that

k

jPM ASPASP

jMaj1

| ),|(),|(

),|(),|( 2|1| ASPASPjMajjMaj PMPM

21 SM

SM jj

PaPa

Page 19: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

AssumptionsAssumptions P(Ө, S | A) satisfies assignment independence if

P(Ө | S, A) = P(Ө | S) and P(S | A) = P(S).

P(S) satisfies structure modularity if where Sj denotes the choice of parents for module Mj , and ρj is a distribution over the possible parent sets for module Mj.

P(A) satisfies assignment modularity ifwhere Aj is the choice of variables assigned to module Mj, and {αj : j = 1; : : : ;K} is a family of functions from 2^X to the positive reals.

j jj SSP )()(

j jj SSP )()(

Page 20: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Assumptions - ExplainationsAssumptions - Explainations Parameter independence, parameter modularity, and

structure modularity are the natural analogues of standard assumptions in Bayesian network learning.

Parameter independence implies that P(Ө | S, A) is a product of terms that parallels the decomposition of the likelihood, with one prior term per local likelihood term Lj.

Parameter modularity states that the prior for the parameters of a module Mj depends only on the choice of parents for Mj and not on other aspects of the structure.

Structure modularity implies that the prior over the structure S is a product of terms, one per each module.

Page 21: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Assumptions - ExplainationsAssumptions - Explainations These two assumptions are new to module networks.

Assignment independence: makes the priors on the parents and parameters of a module independent of the exact set of variables assigned to the module.

Assignment modularity: implies that the prior on A is proportional to a product of local terms, one corresponding to each module.

Thus, the reassignment of one variable from one module Mi to another Mj does not change our preferences on the assignment of variables in modules other than i; j.

Page 22: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

ExperimentsExperiments The network learning procedure was evaluated on synthetic The network learning procedure was evaluated on synthetic

data, gene expression data, and stock market data. data, gene expression data, and stock market data.

The data consisted solely of continuous values. As all of the The data consisted solely of continuous values. As all of the variables have the same domain, the definition of the module variables have the same domain, the definition of the module set reduces simply to a specification of the total number of set reduces simply to a specification of the total number of modules. modules.

Beam search was used as the search algorithm, using a look Beam search was used as the search algorithm, using a look ahead of three splits to evaluate each operator.ahead of three splits to evaluate each operator.

As a comparison, Bayesian networks were used with As a comparison, Bayesian networks were used with precisely the same structure learning algorithm, simply precisely the same structure learning algorithm, simply treating each variable as its own module. treating each variable as its own module.

Page 23: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Synthetic dataSynthetic data

The synthetic data was generated by a known module The synthetic data was generated by a known module network.network.

The generating model had 10 modules and a total of 35 The generating model had 10 modules and a total of 35 variables that were a parent of some module. From the variables that were a parent of some module. From the learned module network, 500 variables where selected, learned module network, 500 variables where selected, including the 35 parents. including the 35 parents.

This procedure was run for training sets of various sizes This procedure was run for training sets of various sizes ranging from 25 instances to 500 instances, each repeated 10 ranging from 25 instances to 500 instances, each repeated 10 times for different training sets. times for different training sets.

Page 24: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Synthetic data - resultsSynthetic data - results Generalization to unseen test data, measuring the likelihood Generalization to unseen test data, measuring the likelihood

ascribed by the learned model to4500 unseen instances. ascribed by the learned model to4500 unseen instances.

As expected, models learned with larger training sets do As expected, models learned with larger training sets do better; but, when run using the correct number of 10 modules, better; but, when run using the correct number of 10 modules, the gain of increasing the number of data instances beyond the gain of increasing the number of data instances beyond 100 samples is small. 100 samples is small.

Models learned with a larger number of modules had a wider Models learned with a larger number of modules had a wider spread for the assignments of variables to modules and spread for the assignments of variables to modules and consequently achieved poor performance.consequently achieved poor performance.

Page 25: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Synthetic data – results contSynthetic data – results cont..

For all training set sizes, For all training set sizes, except 25, the model with 10 except 25, the model with 10 modules performs the best.modules performs the best.

Log-likelihood per instance assigned to held-out data.Log-likelihood per instance assigned to held-out data.

Page 26: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Synthetic data – results cont.Synthetic data – results cont.

Models learned using 100, Models learned using 100, 200, or 500 instances and up 200, or 500 instances and up to 50 modules assigned 80% to 50 modules assigned 80% of the variables to 10 of the variables to 10 modules. modules.

Fraction of variables assigned to the largest 10 modules. Fraction of variables assigned to the largest 10 modules.

Page 27: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Synthetic data – results contSynthetic data – results cont..

The total number of parent-child The total number of parent-child relationships in the generating relationships in the generating model was 2250. model was 2250.

The procedure recovers 74% of The procedure recovers 74% of the true relationships when the true relationships when learning from a dataset of size learning from a dataset of size 500 instances. 500 instances.

Average percentage of correct parent-child relationships recovered.Average percentage of correct parent-child relationships recovered.

Page 28: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Synthetic data – results cont.Synthetic data – results cont.

As the variables begin fragmenting over a large number of As the variables begin fragmenting over a large number of modules, the learned structure contains many spurious modules, the learned structure contains many spurious relationships. relationships.

Thus in domains with a modular structure, statistical noise is Thus in domains with a modular structure, statistical noise is likely to prevent overly detailed learned models such as likely to prevent overly detailed learned models such as Bayesian networks from extracting the commonality between Bayesian networks from extracting the commonality between different variables with a shared behavior. different variables with a shared behavior.

Page 29: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Gene Expression DataGene Expression Data

Expression data which measured the response of yeast to different stress conditions was used.

The data consists of 6157 genes and 173 experiments.

2355 genes that varied significantly in the data were selected and learned a module network over these genes.

A Bayesian network was also learned over this data set.

Page 30: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Candidate regulatorsCandidate regulators

A set of 466 candidate regulators was compiled from SGD A set of 466 candidate regulators was compiled from SGD and YPD.and YPD.

Both transcriptional factors and signaling proteins that may Both transcriptional factors and signaling proteins that may have transcriptional impact.have transcriptional impact.

Also included genes described to be similar to such Also included genes described to be similar to such regulators.regulators.

Excluded global regulators, whose regulation is not specific to Excluded global regulators, whose regulation is not specific to a small set of genes or process.a small set of genes or process.

Page 31: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Gene Expression reasultsGene Expression reasults The figure demonstrates that

module networks generalize much better then Bayesian network to unseen data for almost all choices of number of modules.

Page 32: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Biological validityBiological validity Biological validity of the learned module network with 50

modules was tested.

The enriched annotations reflect the key biological processes expected in our dataset.

For example, the “protein folding” module contains 10 genes, 7 of which are annotated as protein folding genes. In the whole data set, there are only 26 genes with this annotation. Thus, the p-value of this annotation, that is, the probability of choosing 7 or more genes in this category by choosing 10 random genes, is less than 10^-12.

42 modules, out of 50, had at least one significantly enriched annotation with a p-value less than 0.005.

Page 33: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Biological validity Cont.Biological validity Cont.

The enrichment of both HAP4 motif and STRE, recognized by Hap4 and Msn4, respectively, supporting their inclusion in the module’s regulation program.

Lines represent 500 bp of genomic sequence located upstream to the start codon of each of the genes; colored boxes represent the presence of cis-regulatory motifs locates in these regions.

Page 34: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Stock Market DataStock Market Data

NASDAQ stock prices for 2143 companies, covering 273 trading days.

stock → variable, instance → trading day.

The value of the variable is the log of the ratio between that day’s and the previous day’s closing stock price.

As potential controllers, 250 of the 2143 stocks, whose average trading volume was the largest across the dataset were selected.

Page 35: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Stock Market DataStock Market Data

Cross validation is used to evaluate the generalization ability of different models.

Module networks perform significantly better than Bayesian networks in this domain.

Page 36: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Stock Market DataStock Market Data

Significant enrichment for 21 annotations, covering a wide variety of sectors where found.

In 20 of the 21 cases, the enrichment was far more significant in the modules learned using module networks compared to the one learned by AutoClass.

Module networks compared with Autoclass

Page 37: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

ConclusionsConclusions The results show that learned module networks have much

higher generalization performance than a Bayesian network learned from the same data.

Parameter sharing between variables in the same module allows each parameter to be estimated based on a much larger sample, this allows us to learn dependencies that are considered too weak based on statistics of single variables. (these are well-known advantages of parameter sharing);

An interesting aspect of the method is that it determine automatically which variables have shared parameters.

Page 38: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

ConclusionsConclusions

The assumption of shared structure significantly restricts the space of possible dependency structures, allowing us to learn more robust models than those learned in a classical Bayesian network setting.

In module network, a spurious correlation would have to arise between a possible parent and a large number of other variables before the algorithm would introduce the dependency.

Page 39: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

Overview on Module NetworksOverview on Module Networks

Page 40: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

LiteratureLiterature Reference: Discovering Regulatory Modules and their Reference: Discovering Regulatory Modules and their

Condition Specific Regulators from Gene Expression Data.Condition Specific Regulators from Gene Expression Data.By: Eran Segal, Michal Shapira, Aviv Regev, Dana Pe’er, By: Eran Segal, Michal Shapira, Aviv Regev, Dana Pe’er, David Botstein, Daphne Koller & Nir Friedman.David Botstein, Daphne Koller & Nir Friedman.

Bibliography:Bibliography: P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman. P. Cheeseman, J. Kelly, M. Self, J. Stutz, W. Taylor, and D. Freeman.

Autoclass: a Bayesian classification system. In Autoclass: a Bayesian classification system. In ML ’88ML ’88. 1988.. 1988.

Page 41: Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony

THE ENDTHE END