phasing and imputation

35
ESF Genomic Resources Summer School October 1-7, 2012, Pag Island, Croatia

Upload: gregor-gorjanc

Post on 30-Oct-2014

130 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Phasing and Imputation

ESF Genomic Resources Summer School

October 1-7, 2012, Pag Island, Croatia

Page 2: Phasing and Imputation

Day 3 – Practical workshop: Phasing and imputation Gregor Gorjanc Biotechnical Faculty, Department of Animal Science

University of Ljubljana, Slovenia

October 1-7, 2012, Pag Island, Croatia

Livestock Conservation Genomics: Data, Tools and Trends

Page 3: Phasing and Imputation

Background

• Introduced to phasing and imputation via genomic selection = genome-wide SNP data used for genomic prediciton

• Collaboration with John Hickey

Page 4: Phasing and Imputation

Red - paternal grandfather Magenta – maternal grandfather Blue - paternal grandmother Green – maternal grandmother Scale – number of variants (1000 variants / Mb)

Whole-genome sequence haplotypes (human trios) – Chr1

Roach et al (2011)

Page 5: Phasing and Imputation

Roach et al (2011)

Blue – paternal Brown – maternal Dark – grandfather Light – grandmother

Page 6: Phasing and Imputation

• Diplotype = combination of haplotypes

• Genotype = combination of alleles at locus (over chr.)

• Haplotype = combination of alleles on chr. (over loci)

*type

Haplotype 1 hi,1=(A-C-A-T)

hi,2=(G-C-A-G) Haplotype 2

gi=(AG,CC,AA,TG) Genotype

Page 7: Phasing and Imputation

Genotyping platforms

• We obtain genotypes separately for each locus can not resolve haplotypes directly

• With n SNP loci there are 2n possible haplotypes – 1 locus 21 = 2 haplotypes (0, 1 alleles)

– 2 loci 22 = 4 haplotypes (0-0, 0-1, 1-0, 1-1)

– 3 loci 23 = 8 haplotypes (0-0-0, 0-0-1, 0-1-1, …)

– …

– 10 loci 210 = 1024 haplotypes

Page 8: Phasing and Imputation

Classes of methods for phasing (=„haplotyping“)

• Molecular analysis isolation of single molecule and sequencing not completely there yet

• Population/statistical analysis linkage disequilibrium – short regions

• Genetic analysis genetic principles (simple rules, linkage) – longer regions

• Parsimony analysis (minimum number of recombinations)

• Pooled DNA

• …

Page 9: Phasing and Imputation

Methods & Programs

• Very active area of developments: several methods, programs, data formats, options, tricks, …

• fastPHASE • Beagle • MACH • IMPUTE2 • FIMPUTE • findhap.f90 • PHASEBOOK • Alpha{Phase, Impute} • LDMIP • ................................................................

Some review papers: • Liu et al (2009) Ann Rev Genomics Hum Genet, 10, 387-406 • Marchini & Howie (2010) Nat Rev Gen, 11, 499-511 • Browning & Browning (2011) Nat Rev Gen, 12, 703-714

Page 10: Phasing and Imputation

LD based modelling HMM

• Hidden Markov Model (HMM) – inferring haplotype cluster (or reference haplotype) membership locally along the chromosome (clustering each marker) allowing for block-like structure and gradual decline of LD with increasing distance

• Examples

– fastPHASE

– IMPUTE2

Scheet & Stephens (2006)

Page 11: Phasing and Imputation

Beagle model (also LD based)

• Locally variable number of clusters (haplotypes) - K

Browning & Browning (2010)

Page 12: Phasing and Imputation

Long-range phasing

• A fast rule based phasing method

• Basically a pedigree free linkage approach

• Completely unrelated animals contribute phasing information

• Even animals from a different breed can contribute

Kong et al (2008); Hickey et al (2011)

Page 13: Phasing and Imputation

Proband

Mother Father

10110122121110212101220022

11110222111111111111121021 10111121211121212211221121

What underlies a genotype? (coded as allele dosage)

Page 14: Phasing and Imputation

Proband

Mother Father

10100111011100111001110011

00010011110010101100110011

01010011100011000110011010 10101110101111111111111110

10100111011100111001110011 00010011110010101100110011

10110122121110212101220022

11110222111111111111121021 10111121211121212211221121

What underlies a genotype? (coded as allele dosage)

Page 15: Phasing and Imputation

Proband

Mother Father

10100111011100111001110011

00010011110010101100110011

01010011100011000110011010 10101110101111111111111110

10100111011100111001110011 00010011110010101100110011

11110122111111111111121021 10111121211121212211221121

10110122121110212101220022

Phasing a Trio

Can not phase this locus!!

Page 16: Phasing and Imputation

Proband

Mother:

Father:

10100111011100111001110011

00010011110010101100110011

01010111100011000110011010

10101110101111111111111110

10100111011100111001110011

00010011110010101100110011

10111110101011111100111110

10101110010110110000111110

Other:

11110222111111111111121021

**************************

10110122121110212101220022

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

10111121211121212211221121

**************************

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

20212220111121221100222220

****X**X**************XX*X

10110122121110212101220022

Genotype

Not a surrogate parent!

Surrogate parents are the drivers of long range phasing

Page 17: Phasing and Imputation

Proband

Mother:

Father:

10100111011100111001110011

00010011110010101100110011

01010111100011000110011010

10101110101111111111111110

10100111011100111001110011

00010011110010101100110011

Other:

11110222111111111111121021

**************************

10110122121110212101220022

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

10111121211121212211221121

**************************

10110122121110212101220022

Pat Hap:

Mat Hap:

Genotype:

Proband G:

Opp Homo:

20201221112110212012210121

**************************

10110122121110212101220022

Genotype

10100111011100111001110011

10101110101010101011100110

A surrogate parent! (Even without pedigree information)

Surrogate parents are the drivers of long range phasing

Page 18: Phasing and Imputation

Proband

Mother Surrogate Father

10100111011100111001110011

00010011110010101100110011

10101110101111111111111110

00010011110010101100110011

Can now phase this locus!

10100111011100111001110011

10101110101010101011100110

Could be a female

Could be a descendant

Could be many generations distant

Can be ‘unrelated’

Phasing a Trio

20201221112… 10111121211121212211221121

10110122121110212101220022

Page 19: Phasing and Imputation

Proband

Mother Father

10100111011100111001110011

00010011110010101100110011

01010011100011000110011010 10101110101111111111111110

10100111011100111001110011 00010011110010101100110011

Erdös 1 Surrogate Fathers

10100111011100111001110011

10100111011100111001110011

10100111011100111001110011

10101110101010101011100110

10101010101000000000111110

10101010101111000001100110

00010011110010101100110011

00010011110010101100110011

00010011110010101100110011

10101110010110110000111110

10111110101011111100111110

11111110111111110000111110

Erdös 1 Surrogate Mothers

10111110101011111100111110

10101110010110110000111110

11111110111111110000111110

10101010101111000001100110

Erdös 2 Surrogate Mothers

Potentially many meiosis separating

Potentially many meiosis separating

• Erdös 1 surrogates are surrogates of the proband.

• Erdös n+1 surrogates are surrogates of Erdos n

surrogates of the proband.

Surrogate

giving phase

information

Surrogate

giving phase

information

Long Range Phasing

Kong et al (2008); Hickey et al (2011)

Page 20: Phasing and Imputation

• Build library of all completely phased haplotypes

• Find haplotypes in the library which can explain an individuals genotype

• Low error rates

• Computationally fast

• Useful for extremely large data sets – Strategic use

10100111011100111001110011

10100000010000100011110011

10110011001100111001110011

10100111011001001001110011

10100101011100111001110011

10100111001100111001110001

111110111011100111001110011

10100100000000111001110011

Haplotype library imputation

Page 21: Phasing and Imputation

Practical - Phasing

1. Get AlphaPhase distribution package (either)

a) https://sites.google.com/site/hickeyjohn/alphaphase

b) DropBox (software directory)

2. Unpack the distribution package to your working directory

Distribution

|-- Examples

| |-- PhasingWithoutPedigreeInformation

| | |-- GenotypeFormat

| | `-- UnorderedFormat

| |-- PhasingWithPedigreeInformation

| | |-- GenotypeFormat

| | `-- UnorderedFormat

| `-- SimulatedScenario

|-- LinuxExecutable

|-- MacOSXexecutable

`-- WindowsExecutable

Page 22: Phasing and Imputation

Practical – Phasing II

3. Open terminal (Windows: Menu Start, Run, cmd) and change to the following directory (change ???)

cd ???/Distribution/Examples/PhasingWithPedigreeInformation/GenotypeFormat

4. Open parameter file and change numbre of SNP to 500

5. Run AlphaPhase (change OS if needed) ../../../WindowsExecutable/AlphaPhase.exe

6. Open AlphaPhase manual and go to the Output section – read the meaning of created output files

7. Check the phasing of genotype for individual 1598 – Copy the first few values (say 20) of maternal and paternal

haplotype to Excel and use Data Text to columns Space + merge delimiters

– Add genotype of individual 1598 and check if the sum of haplotypes gives genotype

Page 23: Phasing and Imputation

Practical – Phasing III

8. Who is the first (sequentially) Erdos 1 surrogate parent of individual 1598?

9. Find genotype of the above surrogate parent and copy the first few values into Excel and find which loci were informative for the phasing of genotype of individual 1598.

10. What was the phasing yield for individual 1598 and the overall phasing yield per core? Were there any SNP that had very low phasing yield?

11. How many haplotypes were found for the first core?

12. Is distribution of haplotype occurrence uniform?

13. We see „high repetition“ of some haplotypes in individuals 1591:1600. Why?

Page 24: Phasing and Imputation

Imputation – Why?

• Genotyping at high density is expensive

• Genotyping at low density is cheaper

• Imputation is free

• A genotype imputation method that allows us to get lots of genotyped and phenotyped animals at a low cost

Page 25: Phasing and Imputation

Imputation Empo_er_ng you_ wo_k (e.g. GW_S)

Marchini & Howie (2010)

Page 26: Phasing and Imputation

AlphaImpute vs. IMPUTE2

• AlphaImpute – Uses pedigree and linkage information

• IMPUTE2 – Pedigree free imputation which uses linkage

disequilibrium

– Similar algorithm to Beagle / fastPHASE

• Data set for comparison – pig data set

– multiple breed cattle data set

Page 27: Phasing and Imputation

Pig data set

0.5k LD 2.5k LD 5k LD 7.5k LD

Category Count AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2

BothParents

51 0.98 0.77 0.99 0.92 1.00 0.96 1.00 0.96

SireMGS

62 0.93 0.80 0.98 0.92 0.99 0.94 0.99 0.96

DamPGS

47 0.96 0.79 0.98 0.92 0.99 0.95 0.99 0.96

Sire

45 0.89 0.78 0.97 0.92 0.99 0.95 0.99 0.97

Dam

13 0.90 0.76 0.96 0.89 0.98 0.93 0.98 0.95

Other

291 0.86 0.79 0.94 0.91 0.97 0.95 0.97 0.96

Correlation is the statistic that matters Pedigree ~6500, SNP60K ~3200, SNPLD ~500

Page 28: Phasing and Imputation

Cattle data set

0.5k LD 2.5k LD 5k LD 7.5k LD

Category Count AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2

BothParents

28 0.97 0.64 0.99 0.92 0.99 0.94 1.00 0.95

SireMGS

224 0.87 0.60 0.97 0.91 0.98 0.95 0.99 0.96

DamPGS

7 0.92 0.63 0.97 0.87 0.98 0.91 0.98 0.95

Sire

144 0.86 0.60 0.96 0.90 0.98 0.95 0.98 0.96

Dam

4 0.95 0.63 0.98 0.90 0.99 0.97 0.99 0.95

Other

219 0.84 0.58 0.94 0.90 0.96 0.95 0.97 0.96

Correlation is the statistic that matters Pedigree ~24000, SNP60K ~4400, SNPLD ~600

Page 29: Phasing and Imputation

Practical - Imputation

1. We have LD genotype for core 1 on full-sib of individual 1600 1601: 1?????????????????????????????0??????????????????????

?1?????1??????????????????2???????????????1???2

2. Write down pedigree of 1601 and find out haplotype codes of parents for core 1.

3. Find haplotypes (01111000…) of parents and setup the table for inspection of data and imputation (see next slide).

4. What do the existing LD genotypes of individual 1601 show?

5. Impute missing genotypes.

Page 30: Phasing and Imputation

Parental haplotypes

100 0 0 127 19 0011001001001100000001101000000100000111101110111100101000001000100000000110110110100001001100001001*

100 0 0 20 49 *1001001100100100000011001000000100000100000110111110111000001111000001011100010010100011110001000110

200 0 0 13 34 0110110001001100010001001000000111100111100110111100101000001000100011010000101110100000001100001001

200 0 0 80 17 *0010100010110011110001001000000100011011100001000101000011100010001101110001010110100000010000011001*

LD genotype

1601 100 200 / / 1?????????????????????????????0???????????????????????1?????1??????????????????2???????????????1???2

1. The first genotype suggests that 1601 inherited (part of) haplotype 20 from

parent 100, while haplotypes 13 or 80 were inherited from parent 200

2. The second LD genotype is not informative

3. The third LD genotype suggests that haplotype 80 was inherited from parent 200

4. The same as 3.

5. The fifth LD genotype suggests that there has been recombination between parent 100

gametes (haplotypes 127 and 20) as we have allele 1 on both chromosomes in

individual 1601, but not on haplotype 20; no change can be seen for the other

homologous chromosome

6. The sixth LD genotype is not informative for parent 100, while it is for parent

7. The seventh genotype again suggest recombination between the 4th and 5th LD genotype

for parent 100, while there is no info on recombination for parent 200

Result:

LD genotype

1601 100 200 / / 1?????????????????????????????0???????????????????????1?????1??????????????????2???????????????1???2

Haplotypes

1601 100 200 20/127 1 1001001100100100000011001000000100000100000110111110111000001111000001000110110110100001001100001001

1601 100 200 80 18 0010100010110011110001001000000100011011100001000101000011100010001101110001010110100000010000011001

|

V

Full (imputed) genotype

1601 100 200 / / 1011101110210111110012002000000200011111100111111211111011101121001102110111120220200001011100012002

Practical – Imputation (Solutions)

Page 31: Phasing and Imputation

Imputation quality

• %Incorrect genotypes genotype error rate – true 0, inferred 2 error (1) – true 0, inferred 1 error (1) – true 0, inferred 0 correct (0)

• %Correct genotypes gen. concordance = 1 – gen. error rate

• %Incorrect alleles allele error rate – true 0, inferred 2 error (1) – true 0, inferred 1 error/2 (1/2) – true 0, inferred 0 correct (0) – with probabilistic imputation:

½ abs(true allele dosage – inferred allele dosage)

• %Correct alleles allele concordance = 1 - allele error rate

• Correlation(true allele dosage, inferred allele dosage)

Page 32: Phasing and Imputation

Example in maize

Hickey et al (2012) IMPUTE2, 1227 maize inbred lines

Page 33: Phasing and Imputation

IMPUTE2, 1227 maize inbred lines Hickey et al (2012)

Example in maize

Page 34: Phasing and Imputation

Practical – Imputation II

• Evaluate imputation quality for a „pink part“ of the core 1

Result:

LD genotype

1601 100 200 / / …1?????1??????????????????2???????????????1…

Haplotypes

1601 100 200 20/127 1 …100000111100000100011011011010000100110000…

1601 100 200 80 19 …001110001000110111000101011010000001000001…

Full (imputed) genotype

1601 100 200 / / …101110112100110211011112022020000101110001…

True genotype

1601 100 200 / / …101110112100110111011112022020000101110001…

Page 35: Phasing and Imputation

Day 3 – Practical workshop: Phasing and imputation Gregor Gorjanc Biotechnical Faculty, Department of Animal Science

University of Ljubljana, Slovenia

October 1-7, 2012, Pag Island, Croatia

Livestock Conservation Genomics: Data, Tools and Trends