gene expression analysis using bayesian networks

37
1 Gene Expression Analysis Using Bayesian Networks Éric Paquet LBIT Université de Montréal

Upload: aleta

Post on 04-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Éric Paquet LBIT Université de Montréal. Gene Expression Analysis Using Bayesian Networks. Biological basis. RNA Polymerase (Copy DNA in RNA). DNA (Storage of Genetic Information). Ribosome (Translate Genetic Information in Proteins). mRNA (Storage & Transport - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Gene Expression Analysis Using Bayesian Networks

1

Gene Expression Analysis Using Bayesian Networks

Éric Paquet

LBIT Université de Montréal

Page 2: Gene Expression Analysis Using Bayesian Networks

2

Biological basis

DNA(Storage of Genetic

Information)

mRNA(Storage & Transport

of Genetic Information)

Proteins(Expression of

Genetic Information)

RNA Polymerase(Copy DNA in RNA)

Ribosome(Translate Genetic

Information in Proteins)

*-PDB file 1L3A, Transcriptional Regulator Pbf-2 2

Page 3: Gene Expression Analysis Using Bayesian Networks

3

Biological basis

3

How do proteins get regulated? E. coli operon lactose example :

In normal time, E. coli uses glucose to get energy, but how does it react if there is no more glucose but only lactose?

Page 4: Gene Expression Analysis Using Bayesian Networks

4

Biological basis

4

......

RNA Polymerase

Polymerase action is blocked because of a DNA lockGene Lac I associated protein

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

Glucose Lactose

X

E. coli environment

Page 5: Gene Expression Analysis Using Bayesian Networks

5

......

RNA Polymerase

Glucose Lactose

X

E. coli environment

Biological basis

X

Lactose

5

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

Lactose

Lactose recruits gene lacI associated protein… unlockingthe DNA that is then accessible to the polymerase

Page 6: Gene Expression Analysis Using Bayesian Networks

6

Biological basis

6

= inhibit

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

Page 7: Gene Expression Analysis Using Bayesian Networks

7

......

RNA Polymerase

Glucose Lactose

E.coli environment

Biological basis

X

7

In absence of glucose, a polymerase magnet binds to the DNA to accelerate the products of information that help lactose decomposition

CAP

c-AMP

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

Lactose

Page 8: Gene Expression Analysis Using Bayesian Networks

8

Biological basis

8

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

= inhibit

= activate

Research goal:Infer these links

Page 9: Gene Expression Analysis Using Bayesian Networks

9

Why?

Get insights about cellular processesHelp understand diseasesFind drug targets

9

Page 10: Gene Expression Analysis Using Bayesian Networks

10

How?

Using gene expression data and tools for learning Bayesian networks

*-Spellman et al.(1998) Mol Biol Cell 9:3273-97

Lactose decomposor(β-galactosidase)

Lactose getter(permease)

+

*

10

Experiments

[mR

NA

] Tools for Learning Bayesian networks

Page 11: Gene Expression Analysis Using Bayesian Networks

11

A real value is coming from one spot and tells if the concentration of a specific mRNA is higher(+) or lower(-) than

the normal value

What is gene expression data?

Data showing the concentration of a specific mRNA at a given time of the cell life.

*

*-Spellman et al.(1998) Mol Biol Cell 9:3273-97

Experiments

[mR

NA

]Every columns are the result of one image

Page 12: Gene Expression Analysis Using Bayesian Networks

12

What is Bayesian networks?

Graphic representation of a joint distribution over a set of random variables.

A B

C D

E

P(A,B,C,D,E) = P(A)*P(B) *P(C|A)*P(D|A,B) *P(E|D)

Nodes represent gene expression while edges encode the interactions (cf. inhibition, activation)

Page 13: Gene Expression Analysis Using Bayesian Networks

13

Bayesian networks little problem

A Bayesian network should be a DAG (Direct Acyclic Graph), but there are a lot of example of regulatory networks having directed cycles.

*

*-Husmeier D.,Bioinformatics,Vol. 19 no. 17 2003, pages 2271–2282

Histeric oscillator

Switch

Transcription factor dimer

Page 14: Gene Expression Analysis Using Bayesian Networks

14

How can we deal with that?

Using DBN (Dynamic Bayesian Networks*) and sequential gene expression data

A

B

A1

B1

A2

B2

We unfold the network in time

*-Friedman, Murphy, Russell,Learning the Structure of Dynamic Probabilitic Networks

DBN = BN with constraints on parents and children nodes

t t+1

Page 15: Gene Expression Analysis Using Bayesian Networks

15

What are we searching for?

A Bayesian network that is most probable given the data D (gene expression)

We found this BN like that :BN* = argmaxBN{P(BN|D)}

)()()|()|(

DPBNPBNDPDBNP

Prior on network structureMarginal likelihood

Data probability

Where:

Naïve approach to the problem : try all possible dags and keep the best one!

Page 16: Gene Expression Analysis Using Bayesian Networks

16

It is impossible to try all possible DAGs because

The number of dags increases super-exponentially with the number of nodes

n = 3 → 25 dagsn = 4 → 543 dags n = 5 → 29281 dagsn = 6 → 3781503 dags n = 7 → 1138779265 dagsn = 8 → 783702329343 dags…

We are interested in problem having around 60 nodes ….

Page 17: Gene Expression Analysis Using Bayesian Networks

17

Learning Bayesian Networks from data?

Choosing search space method and a conditional distribution representation

•Networks space search methods•Greedy hill-climbing•Beam-search•Stochastic hill-climbing•Simulated annealing•MCMC simulation

•Conditional distribution representation•Linear Gaussian•Multinomial, binomial

Basically add, remove and reverse edges

A

B

CP(a) = ?P(b) = ?P(c|a,b) = ?

Page 18: Gene Expression Analysis Using Bayesian Networks

18

Learning Bayesian Networks from data?

Choosing search space method and a conditional distribution representation

•Networks space search methods•Greedy hill-climbing•Beam-search•Stochastic hill-climbing•Simulated annealing•MCMC simulation

•Conditional distribution representation•Linear Gaussian•Multinomial, binomial

A

B

CP(a) = ?P(b) = ?P(c|a,b) = ?Basically add, remove and reverse edges

Page 19: Gene Expression Analysis Using Bayesian Networks

19

We use three types of gene expression level?

Sort

-1.06 -0.12 0.18 0.21 1.16 1.19

Split data in 3 equal buckets

-1.06 -0.12 0.18 0.21 1.16 1.19

0 1 2

0 0 2 2 1 1 Discretized data

Page 20: Gene Expression Analysis Using Bayesian Networks

20

Return on:

)()()|()|(

DPBNPBNDPDBNP

Prior on network structureMarginal likelihood

Data probability

Page 21: Gene Expression Analysis Using Bayesian Networks

21

Insight on each terms

P(BN) → prior on networkIn our research, we always use a prior equals to 1We could incorporate knowledge using it

Eg. : we know the presence of an edge. If the edge is in the BN, P(BN) = 1 else P(BN) = 0

Efforts are made to reduce the search space by using knowledge eg. limit the number of parents or children

Page 22: Gene Expression Analysis Using Bayesian Networks

22

Insight on each terms

P(D|BN) → marginal likelihoodEasy to calculate using Multinomial distribution with Dirichlet prior *

ri

k ijk

ijkijkn

i

qi

j ijij

ij

asa

MNNbndP

11 1 )()(

)()()|(

*-Heckerman,A Tutorial on Learning With Bayesian Networks and Neapolitan,Learning Bayesian Networks

Page 23: Gene Expression Analysis Using Bayesian Networks

23

A

C B

MCMC (Markov Chain Monte Carlo) simulation

Markov Chain part:Zoom on a node of the chain

A

C B

A

C B

A

C B

A

C B

A

C B

A

C B

1/5

1/5

1/51/5

1/5

0

P(BNnew)

Page 24: Gene Expression Analysis Using Bayesian Networks

24

MCMC (Markov Chain Monte Carlo) simulation

Monte Carlo part:Choose next BN with probability P(BNnew)Accept the new BN with the following Metropolis–Hastings acceptance criterion :

gone! is P(D))(*)()|()()|(,1min

)(*)()()|()()()|(,1min

)(*)|()|(,1min

BNnewPBNoldPBNoldDPBNnewPBNnewDP

BNnewPDPBNoldPBNoldDPDPBNnewPBNnewDP

BNnewPDBNoldPDBNnewPMHP

Page 25: Gene Expression Analysis Using Bayesian Networks

25

Monte Carlo part example :1. Choose a random path. Each path having a P(BNnew) of 1/5

A

C B

A

C B

A

C B

A

C B

A

C B

A

C B

A

C B

1/5

1/5

1/51/5

1/5

0

P(BNnew)

1. Choose a random path. Each path having a P(BNnew) of 1/52. Choose another random number. If it is smaller than the

Metropolis-Hasting criterion, accept BNnew else return to BNold

Page 26: Gene Expression Analysis Using Bayesian Networks

26

MCMC (Markov Chain Monte Carlo) simulation recap:Choose a starting BN at randomBurning phase (generate 5*N BN from MCMC without storing them)Storing phase (get 100*N BN structure from MCMC)

log(

P(D

| B

N)P

(BN

))

Iteration

= Burning phase= Storing phase

Page 27: Gene Expression Analysis Using Bayesian Networks

27

Why 100*N BN and not only 1:

Cause we don’t have enough data and there are a lot of high scoring networksInstead, we associate confidence to edge. Eg. : how many time in the sample can we find edge going from A to B?We could fix a threshold on confidence and retrieve a global network construct with edges having confidence over the threshold

Page 28: Gene Expression Analysis Using Bayesian Networks

28

What we are working on:

Mixing both sequential and non-sequential data to retrieve interesting relation between genesHow?

Using DBN and MCMC for sequential data + BN and MCMC for non-sequential

100*N networks from DBN 100*N networks from BN

Informationtuner

Learn network

Page 29: Gene Expression Analysis Using Bayesian Networks

29

How to test the approach:

Problem : There is no way to test it on real data cause there is no completely known networkSolution : Work on realistic simulation where we know the network structureExample :

*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

0 1 12

2 4 13

3 5 6

7 8 9

10

11

*

Simulate

Page 30: Gene Expression Analysis Using Bayesian Networks

30

How to test the approach:

*-Hartemink A.” Using Bayesian Network Inference Algorithms to Recover Molecular Genetic Regulatory Networks”

0 1 12

2 4 13

3 5 6

7 8 9

10

11

*

Simulate

Sequential data Non-Sequential data

Infotuner DBN

MCMC

BNMCMC

0 1 122 4 133 5 6

7 8 91011

Compare using ROC curves

Page 31: Gene Expression Analysis Using Bayesian Networks

31

Test description:

Generate 60 sequential dataGenerate 120 non-sequential data (~reality proportion)Run DBN MCMC on sequential data keep 100*N sample netRun BN MCMC on non-sequential data keep 100*N sample netTest performance using weight on sample

0 BN 1 DBN.05 BN 0.95 DBN…0.95 BN .05 DBN1 BN 0 DBN

The metric used is the area under ROC curve. Perfect learner gets 1.0 , random gets 0.5 and the worst one gets 0.

Page 32: Gene Expression Analysis Using Bayesian Networks

32

Results:

1 DBN10

Are

a un

der R

OC

cur

ve

0 BN

Page 33: Gene Expression Analysis Using Bayesian Networks

33

Perspective:

Working on more sophisticated ways to mix sequential and non-sequential dataWorking on real cases:

Yeast cell-cycleArabidopsis Thaliana circadian rhythm

Real data also means missing valuesEvaluate missing values solution (EM, KNNImpute)

Page 34: Gene Expression Analysis Using Bayesian Networks

34

Acknowledgements:

François Major

Page 35: Gene Expression Analysis Using Bayesian Networks

35

Why are there missing datas?

Low correlationExperimental problems

Page 36: Gene Expression Analysis Using Bayesian Networks

36

ROC Curve

Receiver Operating Characteristic curve

*-http://gim.unmc.edu/dxtests/roc2.htm

*

Page 37: Gene Expression Analysis Using Bayesian Networks

37

MCMC simulation and number of sampled networks

ROC curve area in function of the number of sample networks from MCMC simulation for N=12

0.86

0.865

0.87

0.875

0.88

0.885

0.89

0.895

500

750

1000

1250

1500

1750

2000

2250

2500

2750

3000

3250

3500

3750

4000

4250

4500

4750

5000

# of samples from MCMC

RO

C a

rea