bioinformatics and graphical models: computation, approximation, and their value msr: nebojsa jojic,...

19
Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim Mullins, Mark Jensen, Jerry Learn

Post on 15-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Bioinformatics and Graphical Models:

Computation, approximation, and their value

MSR:Nebojsa Jojic, Vladimir

Jojic, Chris Meek, David Heckerman

UW:Jim Mullins, Mark

Jensen, Jerry Learn

Page 2: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Overview• Computational cost of usual algorithms

– State of the art– Phylogeny + alignment– Phylogeny + sequence modeling– Approximations and their pitfalls

• Recombination– Analogy to other ML domains– Graphical model– Experiments and computational cost

• Value of the computation– Potential applications– Drug discovery cycle– Value of time and clinical success– Market size and growth

• Discussion

Page 3: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Rational vaccine design(Jim Mullins et al)

• Rational design– Analysis of sequences to form a model of

virus evolution (phylogenies, etc.)– Develop vaccines that target as much

variability as possible

• Traditional design– Trial and error– Educated guesses

Page 4: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

State of the art sequence analysis programs

• Example: – Rational AIDS vaccine design– Analysis of the envelope gene from a single patient in one visit– 200 sequences with 600 base pairs each– Overnight to align– 1-2 hours to 2-3 days to build a tree, depending on how much

search you are willing to do– This does not include modeling the inter-sequence

dependencies, coupling alignment and tree search, and it ignores recombination

• The total length of the HIV genome is 10000 and the number of samples is practically only limited by cost

Page 5: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Computational cost of a slightly more detailed analysis

• Metropolis search over all trees on 400 sequences of the full genome (10k) would last around 2 years on one machine

• Exact search intractable!

Page 6: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Approximation

• Free energy as a bound on negative log-likelihood

• Computation and approximation of the free energy:– Iterative conditional modes– Mean-field method– Structured variational techniques– (Loopy) belief propagation– Sampling techniques

• How tight is the bound?• What does the looseness translate to?

Page 7: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

An example of the approximation issues

Page 8: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

An example of the approximation issues

Page 9: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

An example of the approximation issues:Tightness of the bounds

Variational technique Exact EM algorithm

Page 10: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Recombination

• In HIV, the rate of recombination has recently been estimated to be ¼ of the rate of mutation!

• Combinatorial explosion in inference

Page 11: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Similar situations in other domains where graphical models work well

• Occlusion in video

• Source interaction in audio

• Composition of images

Page 12: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

“Occlusion” in audio

Speaker1 Speaker2

M 1-M* *

+

||

Retrieved Speaker1

Retrieved Speaker2

Page 13: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Epitome of an image

Input image

A set of image patches

Epitome

Page 14: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Layers from a single photograph

em

es

S1 s2 M

x

Page 15: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Modeling alignment and recombination by learning a library of gene patterns

sji-1 sj

i sji+1

xji-1 xj

i xji+1

r1={ACTGTCAGT}r2={ACGATC}

copy pattern 1, position 2 (letter C); insertion mutation

s1={(1,1), (1,2), (1,3), (1,3), (1,3), (1,3), (1,4),(1,5),(1,6),(2,1),(2,2),(2,3)}c1 ={ 1 1 1 0 0 0 1 1 1 1 1 1 }x1={ A C T C A T G T A A C G }

s2 ={(2,1), (2,2), (2,3), (2,4), (2,5), (2,6), (1,4), (1,5), (1,6) }c2 ={ 1 1 1 1 1 1 1 1 1 }x2 ={ A C G A T C G T C }

cji-1 cj

i cji+1

s - pattern positionc = 1 : copy letter

(with possiblemutation)

c = 0 : draw letterfrom a distributionunrelated to the

patterns

Conditionals:

p(xji|s

ji=(1,2),c=1)=f(xj

i,r1(2))=f(xji,C)

p(xji|s,c=0)=g(xj

i)

Example:

Patterns:

Observations and a hidden variable assignment:

Page 16: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Experimental results

Page 17: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Value of computation(from Tufts Center)

Page 18: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Growth

• Human viruses– West Nile– SARS– Hepatitis C– Polio– …

• Animal viruses– FIV – Pig, chicken and cow viruses

• Most bacterial diseases• Parasitic diseases• The first sign of success of rational design might trigger

great increase in the number of diseases tackled

Page 19: Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

How can MS/MSR be involved?

• MS: Architecture, platform, tools– Storage, transmission, computation– E.g., parallelizable computation on a single machine;

pear-to-pear networks for parallel computation on multiple machines

• MSR:– Helping to speed up the scientific progress leading to

the new opportunities for growth– Advising MS on the research direction in the

community and the future requirements for the platform