indel rates and probabilistic alignments

30
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008

Upload: lulu

Post on 02-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Indel rates and probabilistic alignments. Gerton Lunter Budapest, June 2008. Alignment accuracy. Simulation: Jukes-Cantor model Subs/indel rate = 7.5 Aligned with Viterbi + true model.  Observed FPF. Neutral model for indels. CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Indel rates and  probabilistic alignments

Indel rates and probabilistic alignments

Gerton Lunter

Budapest, June 2008

Page 2: Indel rates and  probabilistic alignments

Alignment accuracy

Simulation:Jukes-Cantor modelSubs/indel rate = 7.5Aligned with Viterbi + true model

Observed FPF

Page 3: Indel rates and  probabilistic alignments

Neutral model for indels

CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCACGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA

Page 4: Indel rates and  probabilistic alignments

Neutral model for indels

• Look at inter-gap segments

Pr( length = L ) ?

CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCACGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA

Page 5: Indel rates and  probabilistic alignments

Neutral model for indels

CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCACGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA i i+1

• Look at inter-gap segments

Pr( length = L ) ?

Def: pi = Pr( column i+1 survived | column i survived)

Assumption: indels are independent of each other

Page 6: Indel rates and  probabilistic alignments

Neutral model for indels

CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCACGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA i i+1

• Look at inter-gap segments

Pr( length = L ) pi pi+1 ... pi+L-2

Def: pi = Pr( column i+1 survived | column i survived)

Assumption: indels are independent of each other

Assumption: indels occur uniformly across the genome

Page 7: Indel rates and  probabilistic alignments

Neutral model for indels

CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCACGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA i i+1

• Look at inter-gap segments

Pr( length = L ) pL

Def: pi = Pr( column i+1 survived | column i survived)

Assumption: indels are independent of each other

Assumption: indels occur uniformly across the genome

Prediction: Inter-gap distances follow a geometric distribution

Page 8: Indel rates and  probabilistic alignments

Inter-gap distances in alignments

Inter-gap distance (nucleotides)

Weighted regression: R2 > 0.9995

Log 1

0 co

unts

Transposable elements

+

Page 9: Indel rates and  probabilistic alignments

Inter-gap distances in alignments(simulation)

Page 10: Indel rates and  probabilistic alignments

Biases in alignments

A: gap wander (Holmes & Durbin, JCB 5 1998)

B,C: gap attractionD: gap annihilation

Page 11: Indel rates and  probabilistic alignments

Biases in alignments

Page 12: Indel rates and  probabilistic alignments

Influence of alignment parameters

• De-tuning of parameters away from “truth” does not improve alignments• Accuracy of parameters (within ~ factor 2) does not hurt alignments much

Page 13: Indel rates and  probabilistic alignments

Influence of model accuracy

Improved model (for mammalian genomic DNA):

• Better modelling of indel length distribution• Substitution model & indel rates depend on local GC content• Additional variation in local substitution rate

Parameters: BlastZ alignments of human and mouse

Page 14: Indel rates and  probabilistic alignments

Influence of model accuracy

Simulation:– 20 GC categories– 10 substitution rate categories– 100 sequences each = 20.000 sequences– Each ~800 nt, + 2x100 flanking sequence

Page 15: Indel rates and  probabilistic alignments

Summary so far

• Alignments are biased– Accuracy depends on position relative to gap– Fewer gaps than indels

• Alignments can be quite inaccurate– For 0.5 subs/site, 0.067 indels/site:

accuracy = 65%, false positives = 15%

• Choice of parameters does not matter much• Choice of MODEL does not matter much…

Page 16: Indel rates and  probabilistic alignments

Alignments: Best scoring path

A C C G T T C A C A A T G G A T

A

T

C

A

T

C

T

G

C

A

G

T

(Needleman-Wunsch, Smith-Waterman, Viterbi)

Page 17: Indel rates and  probabilistic alignments

Alignments: Posterior probabilities

A C C G T T C A C A A T G G A T

A

T

C

A

T

C

T

G

C

A

G

T

(Durbin, Eddy, Krogh, Mitchison 1998)

Page 18: Indel rates and  probabilistic alignments

Posterior probabilities

0

0.5

1

CT

TT

CT

AA

AA

CA

TG

AA

CC

GG

GG

GC

AC

AA

AC

CG

CC

CG

CG

GA

AA

GG

GG

TT

TT

AC

GT

AA

CG

TT

AA

GA

GG

GG

GT

GC

CC

C-A-

A-

TT

TC

TT

AC

GG

TT

GG

GT

CC

CA

AC

GT

GG

TT

TT

GG

GG

-A

TT

CC

GG

GG

GG

A-A-

TT

TG

CT

TT

CC

GG

CC

AA

TG

AA

AG

TT

AA

AT

GG

AA

0

0.5

1

AA

TT

TT

-T

-T

-T

-A

-G

-G

-T

-A

-G

GC

GG

GG

TT

GG

TT

GC

GG

AA

G-C-G-

TT

TG

TT

TT

TA

TT

TC

CC

CC

TG

GG

CC

AA

TA

TT

GG

TT

GC

CT

TT

CT

GG

AT

GC

AG

TA

GG

GG

AA

GG

TG

GG

-G

-G

-G

-T

-C

-C

-A

-G

-C

CC

AT

GC

AG

CC

AC

GG

CC

CA

GA

AG

CC

GG

TC

GG

GG

Page 19: Indel rates and  probabilistic alignments

Posteriors: Good predictors of accuracy

Page 20: Indel rates and  probabilistic alignments

Posterior decoding: better than Max Likelihood

Page 21: Indel rates and  probabilistic alignments

Posteriors & estimating indel rates

0

0.5

1

AA

TT

TT

-T

-T

-T

-A

-G

-G

-T

-A

-G

GC

GG

GG

TT

GG

TT

GC

GG

AA

G-C-G-

TT

TG

TT

TT

TA

TT

TC

CC

CC

TG

GG

CC

AA

TA

TT

GG

TT

GC

CT

TT

CT

GG

AT

GC

AG

TA

GG

GG

AA

GG

TG

GG

-G

-G

-G

-T

-C

-C

-A

-G

-C

CC

AT

GC

AG

CC

AC

GG

CC

CA

GA

AG

CC

GG

TC

GG

GG

The inter-gap histogram slope estimates the indelrate, and is not affected by gap attraction…

.. but is influenced by gap annihilation…

…leading to lower ‘asymptotic accuracy’…

…which cannot be observed – but posteriors can be…

…and they are identical in the mean:

Page 22: Indel rates and  probabilistic alignments

Indel rate estimators

Density: Alignment gaps per siteInter-gap: Slope of inter-gap histogramBW: Baum-Welch parameter estimateProb: Inter-gap histogram with posterior probability correction

Page 23: Indel rates and  probabilistic alignments

Human-mouse indel rate estimates

Inde

l rat

e

Page 24: Indel rates and  probabilistic alignments

Simulations: inferences are accurate

Inde

l rat

e

Page 25: Indel rates and  probabilistic alignments

Second summary

• Alignments are biased, and have errors

• Posterior accurately predicts local alignment quality

• Posterior decoding improves alignments, reduces biases

• With posterior decoding: modelling of indel lengths and sequence content improves alignments

• Indel rates (human-mouse) 60-100% higher than apparent from alignments

Page 26: Indel rates and  probabilistic alignments

Neutral indel model: Whole genome

Inter-gap distance (nucleotides) Inter-gap distance (nucleotides)

Lo

g 10 c

ou

nts

Transposable elements: Whole genome:

Page 27: Indel rates and  probabilistic alignments

Estimating fraction of sequence under purifying selection

Model: ● Genome is mixture of “conserved” and “neutral” sequence● “Conserved” sequence accepts no indel mutations● “Neutral” sequence accepts any indel mutation● Indels are point events (no spatial extent)

Account for “neutral overhang”:

Correction depends on level of clustering of conserved sequence:

– Low clustering: conserved segment is flanked by neutral overhangneutral contribution = 2 x average neutral distance between indels

– High clustering: indels “sample” neutral sequenceneutral contribution = 1 x average neutral distance between indels

Lower bound: ~79 Mb, or ~2.6 % Upper bound: ~100 Mb, or ~3.25 %

Page 28: Indel rates and  probabilistic alignments

How much of our genome is under purifying selection?

2.56 – 3.25% indel-conserved (79-100 Mb)+ + :

Divergence (subs/site)

Mb

5%

Page 29: Indel rates and  probabilistic alignments

Inferences are not biased by divergence

Inferred from data: Simulation (100 Mb conserved)

Page 30: Indel rates and  probabilistic alignments

Conclusions• Alignment is an inference problem; don’t ignore the uncertainties!

• Posterior decoding (heuristic) can be better than Viterbi (exact)

• Indel rates are high. Useful for identifying functional regions,since indels can be more disruptive of function than substitutions.

• Up to 10% of our genome may be functional, and a large proportion is rapidly turning over.