cs726 modeling regulatory networks in cells using bayesian networks golan yona department of...
Post on 20-Dec-2015
219 views
TRANSCRIPT
cs726
Modeling regulatory networks in cells using Bayesian networks
Golan Yona
Department of Computer ScienceCornell University
cs726
Outline
• Regulatory networks• Expression data• Bayesian Networks
• What• Why• How
• Learning networks from expression data• Using Bayesian networks to analyze expression data (Friedman
et al)
cs726
Regulatory networks
KEGG Regulatory
Pathways
cs726
cs726
Metabolic pathways
KEGG Metabolic
Pathways
cs726
cs726
Expression Arrays
• Measure the expression levels of thousands of genes in a cell under specific conditions (e.g. cell cycle) simultaneously
• Each cell has the same genomic data but different subsets of proteins are being expressed in different cells and at the same cell under different conditions.
• Protein level is controlled by controlling – transcription initiation– mRNA transcription– mRNA transport – splicing– post-translational modifications– degradation of mRNA and proteins.
• Microarray measure the level of mRNA, thus providing an indirect evidence for the control of protein levels
cs726
Micro Spotting pin
Micro Spotting pin
cs726
cs726
• Some are over-expressed (red), some under-expressed (green) measured with respect to a control group of genes (“fixed” genes)
• Different pathways are activated under different conditions
cs726
Goals
• Recover protein interactions and sub-networks that correspond to regulatory networks in the cell.
• Basic assumption: some genes are dependent on others while others exhibit independence or conditional independence
• The means: Bayesian networks. Capable of modeling the statistical dependencies between different variables (genes)
• Different from clustering analysis ..
• Applicable when the dependency between genes is “local”
• Problems: data is noisy, partial, sometimes misleading (translation, activation), not enough to ensure statistically significant models, time scale
cs726
Bayesian Networks
• A compromise between the assumption of complete dependency and complete conditional independence (Naïve Bayes)
• Less constraining yet still tractable• We know something about the statistical
dependencies between features but not necessarily about the type of the underlying distributions
cs726
Example
Oil pressureIn engine
Fan speed
Coolant temp.
Engine temp.
Oil temp.
Air pressure In tire
Smoke
cs726
Bayesian Networks
• Also called belief nets• A graph description of dependencies and independencies between
variables. Each node corresponds to a variable (gene).• The graph is directed and acyclic• The variables are discrete• A variable can take on a value from a set of values {a1,a2,…} e.g.
on/off• The probability of a specific value P(ai) and i P(ai) = 1• A link joining node A to node C is directional and represents the set of
conditional probabilities P(cj/ai) – causality (the probability that C is on when A is off)
• The network is described in terms of its nodes and edges and the conditional probability distributions associated with each node.
cs726
Network structure• For every node A
• The parents of A is the set of immediate predecessors of A
• The children of A is the set of immediate successors of A• B is a descendant of A if there is a directed path from A to B• Conditional probability
• Network assertions:• The value of a variable depends on its parents• A variable is conditionally independent of it non-descendants given its
parents
Eng cold, Eng cold, Eng hot, Eng hot
Fan fast Fan slow Fan fastFan slow
High 0.1 0.4 0.6 0.9
Low 0.9 0.6 0.4 0.1
Coo
lan
t te
mp
.
Parents
Fan speed
Coolant temp.
Engine temp.
Oil temp.
Smoke
cs726
Calculating the probability of an assignment
• The network describes the joint probability distribution of all variables (some conditionally independent and some are not)
• Depends on the structure!• The probability of a specific assignment of values
y1,y2,…yn for the variables Y1,Y2,…,Yn
This is the likelihood of the data given the model.All you need to know is..
n
i
iin YParentsyPyyyP1
21 ))(/(),...,,(
cs726
Learning Bayesian network from data
• Given the data set with specific assignments for variables (on/off for each gene), how can we find the most probable network structure that explains the data (the best “match” to the data)?
• How to quantify a match?
• Note that there are two aspects of the network that we need to learn– Structure (nodes, edges)
– Conditional probability distributions
• Common strategy: assign a score to each network G
cs726
• Common strategy: assign a score to each network G
• Pick the network that maximizes the score
constGPGDataP
DataGPGscore
)(log)/(log
)/(log)(
Likelihood prior
cs726
Learning
• The likelihood of the data given the model is estimated by averaging over all possible assignments of parameters (conditional probabilities) to G
• Summation over all possible assignments for conditional probabilities. The major contribution is from the set estimated from the data
• Given a specific structure, for every node we lookup its parents and calculate the empirical conditional probability distribution
dGPGDataPGDataP )/(),/()/(
cs726
Model selection
• The second term (log prior) is a measure for the complexity of the model (through uncertainty)
• Occam razor: entia non sunt multiplicanda praeter necessitatem(thou shall not multiply entities)
MDL principle
• In the papers discussed here it is being ignored
cs726
In search for the best network
• In theory: test different structures, calculate the probability of assignment to variables for each network structure, and output the network that maximizes the likelihood of the data given the network.
• Impossible in practice – the number of possible networks over n genes is
• For the yeast genome with 6000 genes this is > 105,000,000
22n
cs726
Possible solution
• Apply a heuristic local greedy search: Start with a random network and locally improve it, by testing perturbations over the original structure.
• Test one edge at a time, by adding, removing or reversing the edge, and testing its affect on the score. If the score improves - accept
cs726
How to learn from expression data
• Two types of features learned from multiple networks
• First - a gene Y is in the Markov blanket of X (two genes are involved in the same biological process. No other gene mediates the dependence)
• Problem of unobserved variables that can intermediate the interaction
• Second type – a gene X is ancestor of Y (based on all networks that are learned)
cs726
Application to the Yeast Cell cycle data
• Expression level measurements for 6177 genes along different time points in six cell cycles – altogether 76 measurements for each gene
• Only 800 genes vary during cell cycle and 250 cluster into 8 fairly distinct classes.
• Networks are learned for the 800 genes• Confidence values based on the set of networks
learned from different bootstrap sets
cs726
Typical sub-network
cs726
Biological significance
• Order relations: there are a few dominant genes that appear before many others, e.g. genes that are involved in cell cycle control and initiation.
cs726
• Most are nuclear proteins, but also cytoplasm membrane proteins (budding and sporulation)
• Some DNA repair proteins (prerequisite for transcription) • RSR1 – initiator of signal trunsduction cascades in the cell
cs726
Biological significance
• Markov connection: functionally related
cs726
• Most pairs have similar functions (verified sometimes through transitivity)
• Some are physically adjacent on the chromosome
• Some relations cannot be detected directly from expression data
• Detect conditional independence – group of genes that are expressed similarly, but one is a parent of all others and there are no connections between the others the parent is a control gene (e.g. CLN2 early cell cycle control gene, that controls RNR3, SVS1, SRO4 and RAD41 that are functionally unrelated).
cs726
Conclusions
• A powerful tool, but – not enough data– Computational problems– Learning algorithms– Authors decompose networks into basic
elements again
• Many possible extensions