guenomu software -- model and agorithm in 2013

36
guenomu Software and Model Leonardo de O. Martins University of Vigo May, 16th 2013 Leo Martins (U Vigo) guenomu software 2013/5/16 1 / 15

Upload: leonardo-de-oliveira-martins

Post on 03-Jul-2015

98 views

Category:

Science


4 download

DESCRIPTION

This is a progress report presented to the Phylogenomics Group at UVigo in May 2013, about the current status of the software guenomu and the Bayesian model implemented. At that time I was experimenting with a mixture model, that has been since then abandoned, and the Hdist that is still experimental. The presentation also describes the exhange algorithm to solve doubly-intractable distributions, the generalized Multiple-Try Metropolis, and the parallel PRNG used to minimize communication between jobs.

TRANSCRIPT

Page 1: guenomu software -- model and agorithm in 2013

guenomu

Software and Model

Leonardo de O. Martins

University of Vigo

May, 16th 2013

Leo Martins (U Vigo) guenomu software 2013/5/16 1 / 15

Page 2: guenomu software -- model and agorithm in 2013

Outline

1 The Model

2 The Sampling

3 The Code

Leo Martins (U Vigo) guenomu software 2013/5/16 2 / 15

Page 3: guenomu software -- model and agorithm in 2013

Hierarchical Bayesian model

P(S ,Θ | D) ∝ P(θ0)P( ~λ0)P(α0)P(S)×

×N∏i=1

P(Di | Gi , ~θi )P(~θi | θ0)P(Gi | ~λi , ~wi ,S)P(~λi | ~λ0)P(~wi | αi )P(αi | α0)

Leo Martins (U Vigo) guenomu software 2013/5/16 3 / 15

Page 4: guenomu software -- model and agorithm in 2013

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

Page 5: guenomu software -- model and agorithm in 2013

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

Page 6: guenomu software -- model and agorithm in 2013

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

Page 7: guenomu software -- model and agorithm in 2013

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

Page 8: guenomu software -- model and agorithm in 2013

The mixture of distance distributions

P(G | ~λ, ~w , S) =

w1e−(dDUPS (G ,S)/λDUPS+dLOSS (G ,S)/λLOSS ) + w2e−(dILS (G ,S)/λILS ) + w3e−(dRF (G ,S)/λRF )

Z(~λ, ~w , S)

wi ∼ Gamma(αgene , 1)

λx ∼ Exp(Λx )

each gene has its own set of wi and λi

the distances dx (G , S) are scaled to account for different gene family sizes

Leo Martins (U Vigo) guenomu software 2013/5/16 4 / 15

Page 9: guenomu software -- model and agorithm in 2013

Outline

1 The Model

2 The Sampling

3 The Code

Leo Martins (U Vigo) guenomu software 2013/5/16 5 / 15

Page 10: guenomu software -- model and agorithm in 2013

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)

Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Page 11: guenomu software -- model and agorithm in 2013

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Page 12: guenomu software -- model and agorithm in 2013

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)

II. draw y ′ ∼ π(· | θ′)exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Page 13: guenomu software -- model and agorithm in 2013

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Page 14: guenomu software -- model and agorithm in 2013

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Page 15: guenomu software -- model and agorithm in 2013

Doubly-intractable distributions

π(y | θ) =qθ(y)

Z (θ)=

eθts(y)

Z (θ); Z (θ) =

∑y

eθts(y) (1)

augmented distribution: π(θ′, y ′, θ | y) ∝ π(y | θ)π(θ)h(θ′ | θ)π(y ′ | θ′)Gibbs update of the auxiliary variables θ′,y ′:

I. draw θ′ ∼ h(· | θ)II. draw y ′ ∼ π(· | θ′)

exchange ratio from θ to θ′

min

{1,

qθ(y ′)π(θ′)h(θ | θ′)qθ′(y)

qθ(y)π(θ)h(θ′ | θ)qθ′(y ′)

}(2)

We draw y ′ (the gene tree) through a secondary MCMC starting at itscurrent value

Leo Martins (U Vigo) guenomu software 2013/5/16 6 / 15

Page 16: guenomu software -- model and agorithm in 2013

Species tree proposal with the exchange algorithm

Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15

Page 17: guenomu software -- model and agorithm in 2013

Species tree proposal with the exchange algorithm

Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15

Page 18: guenomu software -- model and agorithm in 2013

Species tree proposal with the exchange algorithm

Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15

Page 19: guenomu software -- model and agorithm in 2013

Species tree proposal with the exchange algorithm

Leo Martins (U Vigo) guenomu software 2013/5/16 7 / 15

Page 20: guenomu software -- model and agorithm in 2013

Generalized Multiple-Try Metropolis

MH: sample y , decide if accept it with probability r

r =π(y)

π(x)

q(y , x)

q(x , y)=π(y)

π(x)

p(x | y)

p(y | x)

MTM: choose y among several samples, according to their relative weights

r =w(y1, x) + · · ·+ w(yk , x)

w(x∗1 , y) + · · ·+ w(x∗k , y)

where w(x , y) = π(x)q(x , y)λ(x , y) = π(x)p(y | x)λ(x , y)

GMTM: weights w(.) do not need to represent probability distributions.

r =π(y)pk(x | y)

π(x)pk(y | x)

Wx

Wy

where Wy = wi (yi ,x)∑kj=1 wj (yj ,x)

for the chosen element i

Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15

Page 21: guenomu software -- model and agorithm in 2013

Generalized Multiple-Try Metropolis

MH: sample y , decide if accept it with probability r

r =π(y)

π(x)

q(y , x)

q(x , y)=π(y)

π(x)

p(x | y)

p(y | x)

MTM: choose y among several samples, according to their relative weights

r =w(y1, x) + · · ·+ w(yk , x)

w(x∗1 , y) + · · ·+ w(x∗k , y)

where w(x , y) = π(x)q(x , y)λ(x , y) = π(x)p(y | x)λ(x , y)

GMTM: weights w(.) do not need to represent probability distributions.

r =π(y)pk(x | y)

π(x)pk(y | x)

Wx

Wy

where Wy = wi (yi ,x)∑kj=1 wj (yj ,x)

for the chosen element i

Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15

Page 22: guenomu software -- model and agorithm in 2013

Generalized Multiple-Try Metropolis

MH: sample y , decide if accept it with probability r

r =π(y)

π(x)

q(y , x)

q(x , y)=π(y)

π(x)

p(x | y)

p(y | x)

MTM: choose y among several samples, according to their relative weights

r =w(y1, x) + · · ·+ w(yk , x)

w(x∗1 , y) + · · ·+ w(x∗k , y)

where w(x , y) = π(x)q(x , y)λ(x , y) = π(x)p(y | x)λ(x , y)

GMTM: weights w(.) do not need to represent probability distributions.

r =π(y)pk(x | y)

π(x)pk(y | x)

Wx

Wy

where Wy = wi (yi ,x)∑kj=1 wj (yj ,x)

for the chosen element i

Leo Martins (U Vigo) guenomu software 2013/5/16 8 / 15

Page 23: guenomu software -- model and agorithm in 2013

gene tree proposal with GMTM or MTM

Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15

Page 24: guenomu software -- model and agorithm in 2013

gene tree proposal with GMTM or MTM

Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15

Page 25: guenomu software -- model and agorithm in 2013

gene tree proposal with GMTM or MTM

Leo Martins (U Vigo) guenomu software 2013/5/16 9 / 15

Page 26: guenomu software -- model and agorithm in 2013

Outline

1 The Model

2 The Sampling

3 The Code

Leo Martins (U Vigo) guenomu software 2013/5/16 10 / 15

Page 27: guenomu software -- model and agorithm in 2013

RF distance, Assignment cost (Hdist)

Leo Martins (U Vigo) guenomu software 2013/5/16 11 / 15

Page 28: guenomu software -- model and agorithm in 2013

RF distance, Assignment cost (Hdist)

Leo Martins (U Vigo) guenomu software 2013/5/16 11 / 15

Page 29: guenomu software -- model and agorithm in 2013

A parallel pseudo-random number generator (PRNG)

Given a seed and an algorithm, we have a stream of PRNs.

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

Leo Martins (U Vigo) guenomu software 2013/5/16 12 / 15

Page 30: guenomu software -- model and agorithm in 2013

A parallel pseudo-random number generator (PRNG)

Given a seed and an algorithm, we have a stream of PRNs.

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

Using a second algorithm, the firststream will give us a sequence ofseeds. We use the 150 parametersets for the Tausworthe (LFSR)generators (L’ecuyer, Maths Comput1999, pp.261).Therefore, given the seed, we canpredict all states of all streams.

Leo Martins (U Vigo) guenomu software 2013/5/16 12 / 15

Page 31: guenomu software -- model and agorithm in 2013

A parallel pseudo-random number generator (PRNG)

In our gene/species model:

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

we split gene families among jobs

all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.

each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.

the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.

Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15

Page 32: guenomu software -- model and agorithm in 2013

A parallel pseudo-random number generator (PRNG)

In our gene/species model:

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

we split gene families among jobs

all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.

each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.

the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.

Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15

Page 33: guenomu software -- model and agorithm in 2013

A parallel pseudo-random number generator (PRNG)

In our gene/species model:

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

we split gene families among jobs

all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.

each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.

the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.

Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15

Page 34: guenomu software -- model and agorithm in 2013

A parallel pseudo-random number generator (PRNG)

In our gene/species model:

PRNG1

PRNG2

PRNG2

PRNG2

PRNG2

x1

seed

x2

x3

x4

x11 x12

we split gene families among jobs

all jobs receive seed (broadcast)and therefore can reproduce thesame x1. That’s cheaper thancommunicating the states.

each job uses its own x(i+1) forsampling new gene trees etc. andcan work in parallel. They use thecommon x1 for sampling e.g. newspecies tree, which needssynchronization.

the only thing that must be sharedis thus the proposal values(AllReduce) when updating”global” parameters”, so that alljobs can make the sameacceptance/rejection decision.

Leo Martins (U Vigo) guenomu software 2013/5/16 13 / 15

Page 35: guenomu software -- model and agorithm in 2013

Each job looks like an independent analysis

Leo Martins (U Vigo) guenomu software 2013/5/16 14 / 15

Page 36: guenomu software -- model and agorithm in 2013

https://bitbucket.org/leomrtns/guenomu

Leo Martins (U Vigo) guenomu software 2013/5/16 15 / 15