cs726 modeling regulatory networks in cells using bayesian networks golan yona department of...

cs726

Modeling regulatory networks in cells using Bayesian networks

Golan Yona

Department of Computer ScienceCornell University

cs726

Outline

• Regulatory networks• Expression data• Bayesian Networks

• What• Why• How

• Learning networks from expression data• Using Bayesian networks to analyze expression data (Friedman

et al)

cs726

Regulatory networks

KEGG Regulatory

Pathways

cs726

Metabolic pathways

KEGG Metabolic

Pathways

cs726

Expression Arrays

• Measure the expression levels of thousands of genes in a cell under specific conditions (e.g. cell cycle) simultaneously

• Each cell has the same genomic data but different subsets of proteins are being expressed in different cells and at the same cell under different conditions.

• Protein level is controlled by controlling – transcription initiation– mRNA transcription– mRNA transport – splicing– post-translational modifications– degradation of mRNA and proteins.

• Microarray measure the level of mRNA, thus providing an indirect evidence for the control of protein levels

cs726

Micro Spotting pin

Micro Spotting pin

cs726

• Some are over-expressed (red), some under-expressed (green) measured with respect to a control group of genes (“fixed” genes)

• Different pathways are activated under different conditions

cs726

Goals

• Recover protein interactions and sub-networks that correspond to regulatory networks in the cell.

• Basic assumption: some genes are dependent on others while others exhibit independence or conditional independence

• The means: Bayesian networks. Capable of modeling the statistical dependencies between different variables (genes)

• Different from clustering analysis ..

• Applicable when the dependency between genes is “local”

• Problems: data is noisy, partial, sometimes misleading (translation, activation), not enough to ensure statistically significant models, time scale

cs726

Bayesian Networks

• A compromise between the assumption of complete dependency and complete conditional independence (Naïve Bayes)

• Less constraining yet still tractable• We know something about the statistical

dependencies between features but not necessarily about the type of the underlying distributions

cs726

Example

Oil pressureIn engine

Fan speed

Coolant temp.

Engine temp.

Oil temp.

Air pressure In tire

Smoke

cs726

Bayesian Networks

• Also called belief nets• A graph description of dependencies and independencies between

variables. Each node corresponds to a variable (gene).• The graph is directed and acyclic• The variables are discrete• A variable can take on a value from a set of values {a1,a2,…} e.g.

on/off• The probability of a specific value P(ai) and i P(ai) = 1• A link joining node A to node C is directional and represents the set of

conditional probabilities P(cj/ai) – causality (the probability that C is on when A is off)

• The network is described in terms of its nodes and edges and the conditional probability distributions associated with each node.

cs726

Network structure• For every node A

• The parents of A is the set of immediate predecessors of A

• The children of A is the set of immediate successors of A• B is a descendant of A if there is a directed path from A to B• Conditional probability

• Network assertions:• The value of a variable depends on its parents• A variable is conditionally independent of it non-descendants given its

parents

Eng cold, Eng cold, Eng hot, Eng hot

Fan fast Fan slow Fan fastFan slow

High 0.1 0.4 0.6 0.9

Low 0.9 0.6 0.4 0.1

Coo

lan

t te

mp

.

Parents

Fan speed

Coolant temp.

Engine temp.

Oil temp.

Smoke

cs726

Calculating the probability of an assignment

• The network describes the joint probability distribution of all variables (some conditionally independent and some are not)

• Depends on the structure!• The probability of a specific assignment of values

y1,y2,…yn for the variables Y1,Y2,…,Yn

This is the likelihood of the data given the model.All you need to know is..

n

i

iin YParentsyPyyyP1

21 ))(/(),...,,(

cs726

Learning Bayesian network from data

• Given the data set with specific assignments for variables (on/off for each gene), how can we find the most probable network structure that explains the data (the best “match” to the data)?

• How to quantify a match?

• Note that there are two aspects of the network that we need to learn– Structure (nodes, edges)

– Conditional probability distributions

• Common strategy: assign a score to each network G

cs726

• Common strategy: assign a score to each network G

• Pick the network that maximizes the score

constGPGDataP

DataGPGscore

)(log)/(log

)/(log)(

Likelihood prior

cs726

Learning

• The likelihood of the data given the model is estimated by averaging over all possible assignments of parameters (conditional probabilities) to G

• Summation over all possible assignments for conditional probabilities. The major contribution is from the set estimated from the data

• Given a specific structure, for every node we lookup its parents and calculate the empirical conditional probability distribution

dGPGDataPGDataP )/(),/()/(

cs726

Model selection

• The second term (log prior) is a measure for the complexity of the model (through uncertainty)

• Occam razor: entia non sunt multiplicanda praeter necessitatem(thou shall not multiply entities)

MDL principle

• In the papers discussed here it is being ignored

cs726

In search for the best network

• In theory: test different structures, calculate the probability of assignment to variables for each network structure, and output the network that maximizes the likelihood of the data given the network.

• Impossible in practice – the number of possible networks over n genes is

• For the yeast genome with 6000 genes this is > 105,000,000

22n

cs726

Possible solution

• Apply a heuristic local greedy search: Start with a random network and locally improve it, by testing perturbations over the original structure.

• Test one edge at a time, by adding, removing or reversing the edge, and testing its affect on the score. If the score improves - accept

cs726

How to learn from expression data

• Two types of features learned from multiple networks

• First - a gene Y is in the Markov blanket of X (two genes are involved in the same biological process. No other gene mediates the dependence)

• Problem of unobserved variables that can intermediate the interaction

• Second type – a gene X is ancestor of Y (based on all networks that are learned)

cs726

Application to the Yeast Cell cycle data

• Expression level measurements for 6177 genes along different time points in six cell cycles – altogether 76 measurements for each gene

• Only 800 genes vary during cell cycle and 250 cluster into 8 fairly distinct classes.

• Networks are learned for the 800 genes• Confidence values based on the set of networks

learned from different bootstrap sets

cs726

Typical sub-network

cs726

Biological significance

• Order relations: there are a few dominant genes that appear before many others, e.g. genes that are involved in cell cycle control and initiation.

cs726

• Most are nuclear proteins, but also cytoplasm membrane proteins (budding and sporulation)

• Some DNA repair proteins (prerequisite for transcription) • RSR1 – initiator of signal trunsduction cascades in the cell

cs726

Biological significance

• Markov connection: functionally related

cs726

• Most pairs have similar functions (verified sometimes through transitivity)

• Some are physically adjacent on the chromosome

• Some relations cannot be detected directly from expression data

• Detect conditional independence – group of genes that are expressed similarly, but one is a parent of all others and there are no connections between the others the parent is a control gene (e.g. CLN2 early cell cycle control gene, that controls RNR3, SVS1, SRO4 and RAD41 that are functionally unrelated).

cs726

Conclusions

• A powerful tool, but – not enough data– Computational problems– Learning algorithms– Authors decompose networks into basic

elements again

• Many possible extensions

cs726 modeling regulatory networks in cells using bayesian networks golan yona department of...

Documents