a new nonparametric bayesian model for genetic recombination in open ancestral space

22
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University February 26, 2007 Paper by E. P. Xing and K-A. Sohn

Upload: faolan

Post on 26-Jan-2016

23 views

Category:

Documents


1 download

DESCRIPTION

A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space. Paper by E. P. Xing and K-A. Sohn. Presented by Chunping Wang Machine Learning Group, Duke University February 26, 2007. Outline. Terminology and Introduction - PowerPoint PPT Presentation

TRANSCRIPT

A New Nonparametric Bayesian Model for Genetic Recombination in

Open Ancestral Space

Presented by Chunping Wang

Machine Learning Group, Duke University

February 26, 2007

Paper by E. P. Xing and K-A. Sohn

Outline

• Terminology and Introduction

• DP Mixtures for Non-recombination Inheritance

• HMDP for Recombination

• Results

• Conclusions

• Allele: a viable DNA coding on a chromosome – observation

• Locus : the location of an allele – index of an observation

• Haplotype: a sequence of alleles – data sequence

• Recombination: exchange pieces of paired chromosome – state-transition

• Mutation: any change to a haplotype during inheritance – emission

Terminology and Introduction (1)

Terminology and Introduction (2)

Ancestors

Descendants

Terminology and Introduction (3)

Problems:

1. Ancestral inference: recovering ancestral haplotypes;

2. Recombination analysis: inferring the recombination hotspots;

3. Ancestral mapping: inferring the ancestral origin of each allele in each modern haplotype.

DP Mixtures for Non-recombination Inheritance (1)

Non-recombination:

• Only mutation may occur during inheritance;

• Each modern haplotype is originated from a single ancestor.

Only true for haplotypes spanning a short region in a chromosome.

DP Mixtures for Non-recombination Inheritance (2)

Q

0Q

i

ihn

)(~|

~|

),(~,| 00

ihii

i

Ph

QQ

QDPQQ

Kka kkk ,,1),,(* where , the distinct values of , denote the joint of the kth ancestor and the mutation parameter corresponding to the kth ancestor.

nii 1}{

DP Mixtures for Non-recombination Inheritance (3)

HMDP for Recombination (1)

For long haplotypes possibly bearing multiple ancestors, we consider recombinations (state-transitions across discrete space-interval).

jQ

ji

jihjm

2Q

2i

2ih

2m

1Q

0Q

1i

2ih

1m

F

Each row of the transition matrix in HMM is a DP. Also these DPs are linked by the top level master DP, and have the same set of target states.

The mixing proportions for each lower level DP are denoted as , then the jth row of the transition matrix is .

HMDP for Recombination (2)

],,[ 2,1, jjj

j

HMDP for Recombination (3)

Modern haplotypeAncestor haplotype

The indicators of ith modern haplotype for all the loci, which specify the corresponding ancestral haplotype

• when no recombination takes place during the inheritance process producing haplotype Hi,

• when a recombination occurs between loci t and t+1,

tkC ti ,,

1,, titi CC

HMDP for Recombination (4)

Introduce a Poisson point process to control the duration of non-recombinant inheritance (space-inhomogeneous)

ex

xp x

!

1)|(

Denote

d: the physical distance between loci t and t+1 ;

r: recombination rate per unit distance.

Then

x-the number of recombinations

1)|0( dredrxp

dredrxp 1)|0(

HMDP for Recombination (5)

Combine with the standard stationary HMDP, the non-stationary state transition probability:

)',()1()|'( ',,1, kkkCkCp kktiti

While d or r goes to infinity, , , the inhomogeneous HMDP model goes back to a standard HMDP.

0 dre 1

HMDP for Recombination (6)

Inference:

The emission function:

),(~ hhBeta

),|( achp

where

The prior base: )()(),( pApAF

)(Ap uniform

Integrate over , the marginal likelihood: )(p

HMDP for Recombination (7)

Inference:

Two sampling stages:

1. Sample given all haplotypes h and the most recently sampled ancestor pool a;

2. Sample every ancestor Ak given all haplotypes h and the current

}{ ,tiC

}{ ,tiC

Combine the HDP prior and the marginal likelihood,

we can infer the posterior for and , which are the variables of interest.

}{ ,tiC }{ ,tkA

Results (1)Simulated data:

30 populations, each includes 200 haplotypes from K=5 ancestral haplotypes. T=100

Compare: HMDP, HMMs with K=3,5 and 10

The average ancestor reconstruction errors for the five ancestors

Even the HMM with K=5 cannot beat the HMDP

Results (2)

Box plot of the empirical recombination rates

The vertical gray lines - the pre-specified recombination hotspots

Threshold 1

Threshold 2

Results (3)

Population maps: 1. true map; 2. HMDP; 3-5. HMMs with K=3,5,10

Each vertical thin line – one modern haplotype;

Each color – one ancestral haplotype.

Measure for accuracy: the mean squared distance to the true map

Results (4)Real haplotype data sets 1: Daly data – single population

512 haplotypes. T=103

Bottom: empirical recombination rates

Upper vertical lines: recombination hotspots.

Red dotted lines: HMM; blue dashed lines: MDL; black solid lines: HMDP

Results (5)

A Gaussian mixture fitting of empirical recombination rates

Choose the threshold

Results (6)

Estimated population map

Each vertical thin line – one modern haplotype;

Each color – one ancestral haplotype.

Conclusions

• This HMDP model is an application and extension of the HDP into the population genetics field;

• The HDP allows the space of states in HMM to be infinite so that it is suitable for inferring unknown number of ancestral haplotypes;

• The HMDP model also allows the recombination rates to be non-stationary;

• The HMDP model can jointly infer a number of important genetic variables.