elchanan mossel, uc berkeley joint work with: sebastien roch, microsoft research

24
Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research At Newton Institute Dec 07

Upload: kerry-johnson

Post on 03-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations. Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research At Newton Institute Dec 07. Lecture Plan. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From

Multiple Loci

& a couple of unrelated observations

Elchanan Mossel, UC Berkeley

Joint work with: Sebastien Roch, Microsoft Research

At Newton Institute Dec 07

Page 2: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Lecture Plan

• A simple observation about gene trees and population trees.

• A comment: on “optimal” and “absolute converging” tree reconstruction

• A comment on: “Generic models”.

• A comment on: “Network Reconstruction”.

• Disclaimer: Last talk – a bit philosophical (but would be happy to provide hard technical proofs )

Page 3: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Gene Trees and Population Trees

• Main goal in phylogenetics:• Recovering species/population histories.• Data: Current Genes.• Issue: In recent populations: gene trees may

differ from population trees. • Model for evolution of trees in populations: • Coalescence:

• Fixed size population N• Each individual chooses a random parent in

previous generation.• # generations = N £ branch-length

• Main Question: How to reconstruct population trees from gene trees?

Page 4: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Gene Trees: The Engineering Approach

• Two common “engineering” approaches: • Approach 1: • Assume all genes come from a single tree. • Kubato-Degnan: Inconsistent. • Approach 2: • Build tree for each tree on its own. • Take majority tree.• Degnan-Rosenberg: Inconsistent.

• Q: What should be done instead?

Page 5: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Gene Trees: A Rigorous Approach

• M-Roch: A consistent estimator of the molecular distance between two populations d(P1,P2) is:

• D(P1,P2) = min {dg(P1,P2) : g 2 Genes}

• ) distances between populations are identifiable.• ) tree is identifiable• Under standard coalescence assumptions, get

good rate:• P(topology error) · (# pops) £ exp(-c # genes) • c = shortest branch length.• Estimator can be “plugged in” into any distance

based method for reconstructing trees.• In M-Roch, use NJ, but similarly work for:• Short-quartets (ESSW)• Distorted metrics and forests (M)• etc.

Page 6: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Comments on Absolute Convergence

• Algorithmic paradigm: Want to reconstruct tree on • n species using • sequence length L and • running time T.

• “Absolute Convergence”: L = poly(n); T = poly(n).

• Q: Is this the best we can do?

Page 7: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

resolution of Steel’s conjecture

[M’04]

[Daskalakis-M-Roch’06]

short branches seq. length L = c log n

long branches seq. length L = nC

ancestral reconstruction

phylogeneticreconstruction

n = # species

Short branches := all branches < lcLong branches := all branches > lclc depends on mutation modelbut not on tree, tree size etc.

Page 8: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

The algorithmic challenge

• Conj: For short branches, if data is generated from the model:• ML identifies the correct using • L = O(log n) samples • (best bound known is L = exp(O(N)).

• Conclusion: In order to “beat” ML, need algorithms with L = O(log n)• Challenge: The constant in O is important!• Challenge: Deal with short/long branches (contract edges; output

forest)• Challenge: General mutation models (not just CFN, JC).

• Comment: Rigorous methods have running time gaurentee.• Comment: For L=poly(n), know how to deal with all challenges:• ESSW • M’07 (forests – long edges).• Gornieu et. al (short edges).

Page 9: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

On generic parameters

• From Rhodes talk: “Generic models are easier to identify”.• Typically – genetic parameters.• How about generic trees?

Page 10: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Mixtures and Phenomena in High Dims

• The Geometry of High Dimensions: “Almost every collection of k vectors are almost orthogonal in high enough dimension n”.

• M-Roch (in preparation):For every k, as n -> 1 the probability that a mixture of k trees on n leaves is identifiable goes to 1.

• Holds for most reasonable measures on the space of trees and most mutation models.

• Basic idea: In generic situations can (almost) cluster samples according to trees.

• Gives an efficient algorithm. • Similar results hold for rates across

sites

Page 11: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

A Comment on Dynamic Programming

• Q (Zhang):• Given a tree is it possible to find the • most informative k species?

• In terms of Pasrsimony?• In terms of ML?

• Note:If we know Parsimony/ML score for left/right sub-tree, we know it for the root.

• Q: Can use dynamic programming? • A: Yes – but with the right “data structure”• Information per node: • Discrete version of • the set • of achievable distributions.• Called “Density Evolution” in coding theory /

spin-glass theory.• Additive error = 1/poly(n).

L1L2

L

L1L2

L

Page 12: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Hardness of Distinguishing Network Models with Hidden Nodes

• Basic question: Is it possible to recover a network G from observation at a subset of the nodes?

• Easier question: Suppose we observe X1,…,Xr. Is it possible to determine if they come from nodes S in G1 or nodes T in G2?

• Problem: It may be that the two distributions are the same.

• Assume: The two distributions are different (large total variation distance)

• Q: Assuming the two distributions are different how hard is it to tell if it’s coming from G1 or G2?

• Related question: What is a computational model of a biologist?

G1

G2

Page 13: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

The distinguishing problem for Trees

• Q: Assuming the two distributions are different how hard is it to tell if it’s coming from T1 or T2?

• Note: For trees the problem is easy:• Perform likelihood test. • Easy to do efficiently (peeling, pruning,

dynamics programming).• # samples needed poly(n).

T1

T2

Page 14: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Two Models of a Biologist

• The Computationally Limited Biologist: Cannot solve hard computational problems, in particular cannot sample from a general G-distributions.

• The Computationally Unlimited Biologist: Can sample from any distribution.

• Related to the following problem: Can nature solve computationally hard problems?

From Shapiro at Weizmann

Page 15: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Hardness Results

• The Computational Limited Biologist (Bogdanov-M): Distinguishing problem can be solved efficiently iff NP=RP.

• Computational Unlimited Biologist (Bogdanov-M): The problem is at least zero-knowledge hard.

• Zero-Knowledge Problem: Can we decide if samples from a computationally efficient distribution is coming from the uniform distributions?

• Related to cryptography.

G1

G2

Page 16: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Reconstructing Networks

• Motivation: abundance of stochastic networks in biology, social networks, neuro-science etc. etc.

• Network defines a distribution as follows:• G=(V,E) = Graph on [n] = {1,2,…,n}• Distribution defined on AV, where A is some finite

set.• Too each clique C in G, associate a function

C : AC -> R+ and:

P[] = C C(C)

• Called Markov Random Field, Factorized Distribution etc.

• Directed models also common. • Markov Property: If S separates A from B then

A and B are conditionally independent

given S

Page 17: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Reconstructing Networks.

• Task 1: Given samples of , find G.

• Task 2: Given samples of restricted to a set S find G.

• Will consider the problem when n large and maximum degree d is small.

• (Note that specification of the model is of size max(n,,exp(max |C|)) )

Page 18: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Reconstructing Networks – A Trivial Algorithm

• Lower bound (Bresler-M-Sly):• In order to recover G of max-deg d need at least

c d log n samples.• Pf follows by “counting # of networks”.• Upper bound (Bresler-M-Sly):• If distribution is “non-degenerate” c d log n

samples suffice.• Trivial Algorithm:• For each v 2 V: • Enumerate on N(v)

• For each w 2 V check if v ind. of w given N(v).

• Non-Degeneracy: • For every v and every w 2 N(v) there exists two

assignments to N(v) 1 and 2 that differ at w and:dTV(P(v | 1), P(v | 2)) ¸

• For soft-core model suffices to have for all = u,v

• maxa,b,c,d |(c,a)-(d,a)+(c,b)-(d,b)| >

• Running time = O(nd+1 log n)

Page 19: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

A Trivial Algorithm – Related Result• Trivial Algorithm:• For each v 2 V: • Enumerate on N(v)

• For each w 2 V check v ind. of w given N(v).

• Related work• Algorithm was suggested before.• Abbeel, D. Koller, A. Ng: without restrictions learn

a model whose KL distance from generating model is small (no guarantee of obtaining the true model; in order to get O(1) KL distance need poly samples).

• M. J. Wainwright, P. Ravikumar, J. D: Use L1 regularization to get true model for Ising models, sampling complexity O(d5 log n) – no running time bounds.

• Other related work: assuming special form of potentials

Page 20: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Variants of the Trivial Algorithm• If graph has exponential decay of correlations• Corr(u,v) · exp(-c d(u,v))• Suffices to enumerate over N(v) • among w correlated with v.• Running time: O(n2 log n + n f(d)).

• Missing nodes: Suppose G is triangle free, • then a variant of the algorithm can find one hidden node. • Idea (with M. Biskup’s help):

Run the algorithm as if the node is not hidden

• Noise: The algorithm tolerates small amounts of noise (statistical robustness).

• Q: What about higher amounts of noise? • (From Bresler-M-Sly)

possiblew’s

Page 21: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Higher Noise & Non Identifiable Example

• Bresler-M-Sly: Example of non-identifiably• Consider

• G1 = path of length 2,

• G2 = triangle + Noise.

• Assume Ising model with random interactions and random noise.

• Then with constant probability, cannot distinguish between the models.

• Ising: P[] = u,v 2 E exp( (u) (v))

• Intuitive reason: dimension of distributionis 3 in both cases. = hidden nodes

= observed nodes

Page 22: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Thanks !!Thanks !!

Page 23: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Thanks !!Thanks !!• Sebastien Roch

•Costis Daskalakis

• Andrej Bogdanov

Page 24: Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research

Thanks !!Thanks !!Fascinating workshop:

Principal Organiser: Professor Mike Steel (University of Canterbury, NZ) Organisers: Professor Vincent Moulton (University of East Anglia) and

Dr Katharina Huber (University of East Anglia) Sponsored by: Allan Wilson Centre for Molecular Ecology and Evolution

As part of a great program:

Organisers: Professor V Moulton (East Anglia), Professor M Steel (Canterbury) and

Professor D Huson (Tubingen)