phasing and imputation
TRANSCRIPT
ESF Genomic Resources Summer School
October 1-7, 2012, Pag Island, Croatia
Day 3 – Practical workshop: Phasing and imputation Gregor Gorjanc Biotechnical Faculty, Department of Animal Science
University of Ljubljana, Slovenia
October 1-7, 2012, Pag Island, Croatia
Livestock Conservation Genomics: Data, Tools and Trends
Background
• Introduced to phasing and imputation via genomic selection = genome-wide SNP data used for genomic prediciton
• Collaboration with John Hickey
Red - paternal grandfather Magenta – maternal grandfather Blue - paternal grandmother Green – maternal grandmother Scale – number of variants (1000 variants / Mb)
Whole-genome sequence haplotypes (human trios) – Chr1
Roach et al (2011)
Roach et al (2011)
Blue – paternal Brown – maternal Dark – grandfather Light – grandmother
• Diplotype = combination of haplotypes
• Genotype = combination of alleles at locus (over chr.)
• Haplotype = combination of alleles on chr. (over loci)
*type
Haplotype 1 hi,1=(A-C-A-T)
hi,2=(G-C-A-G) Haplotype 2
gi=(AG,CC,AA,TG) Genotype
Genotyping platforms
• We obtain genotypes separately for each locus can not resolve haplotypes directly
• With n SNP loci there are 2n possible haplotypes – 1 locus 21 = 2 haplotypes (0, 1 alleles)
– 2 loci 22 = 4 haplotypes (0-0, 0-1, 1-0, 1-1)
– 3 loci 23 = 8 haplotypes (0-0-0, 0-0-1, 0-1-1, …)
– …
– 10 loci 210 = 1024 haplotypes
Classes of methods for phasing (=„haplotyping“)
• Molecular analysis isolation of single molecule and sequencing not completely there yet
• Population/statistical analysis linkage disequilibrium – short regions
• Genetic analysis genetic principles (simple rules, linkage) – longer regions
• Parsimony analysis (minimum number of recombinations)
• Pooled DNA
• …
Methods & Programs
• Very active area of developments: several methods, programs, data formats, options, tricks, …
• fastPHASE • Beagle • MACH • IMPUTE2 • FIMPUTE • findhap.f90 • PHASEBOOK • Alpha{Phase, Impute} • LDMIP • ................................................................
Some review papers: • Liu et al (2009) Ann Rev Genomics Hum Genet, 10, 387-406 • Marchini & Howie (2010) Nat Rev Gen, 11, 499-511 • Browning & Browning (2011) Nat Rev Gen, 12, 703-714
LD based modelling HMM
• Hidden Markov Model (HMM) – inferring haplotype cluster (or reference haplotype) membership locally along the chromosome (clustering each marker) allowing for block-like structure and gradual decline of LD with increasing distance
• Examples
– fastPHASE
– IMPUTE2
Scheet & Stephens (2006)
Beagle model (also LD based)
• Locally variable number of clusters (haplotypes) - K
Browning & Browning (2010)
Long-range phasing
• A fast rule based phasing method
• Basically a pedigree free linkage approach
• Completely unrelated animals contribute phasing information
• Even animals from a different breed can contribute
Kong et al (2008); Hickey et al (2011)
Proband
Mother Father
10110122121110212101220022
11110222111111111111121021 10111121211121212211221121
What underlies a genotype? (coded as allele dosage)
Proband
Mother Father
10100111011100111001110011
00010011110010101100110011
01010011100011000110011010 10101110101111111111111110
10100111011100111001110011 00010011110010101100110011
10110122121110212101220022
11110222111111111111121021 10111121211121212211221121
What underlies a genotype? (coded as allele dosage)
Proband
Mother Father
10100111011100111001110011
00010011110010101100110011
01010011100011000110011010 10101110101111111111111110
10100111011100111001110011 00010011110010101100110011
11110122111111111111121021 10111121211121212211221121
10110122121110212101220022
Phasing a Trio
Can not phase this locus!!
Proband
Mother:
Father:
10100111011100111001110011
00010011110010101100110011
01010111100011000110011010
10101110101111111111111110
10100111011100111001110011
00010011110010101100110011
10111110101011111100111110
10101110010110110000111110
Other:
11110222111111111111121021
**************************
10110122121110212101220022
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
10111121211121212211221121
**************************
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
20212220111121221100222220
****X**X**************XX*X
10110122121110212101220022
Genotype
Not a surrogate parent!
Surrogate parents are the drivers of long range phasing
Proband
Mother:
Father:
10100111011100111001110011
00010011110010101100110011
01010111100011000110011010
10101110101111111111111110
10100111011100111001110011
00010011110010101100110011
Other:
11110222111111111111121021
**************************
10110122121110212101220022
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
10111121211121212211221121
**************************
10110122121110212101220022
Pat Hap:
Mat Hap:
Genotype:
Proband G:
Opp Homo:
20201221112110212012210121
**************************
10110122121110212101220022
Genotype
10100111011100111001110011
10101110101010101011100110
A surrogate parent! (Even without pedigree information)
Surrogate parents are the drivers of long range phasing
Proband
Mother Surrogate Father
10100111011100111001110011
00010011110010101100110011
10101110101111111111111110
00010011110010101100110011
Can now phase this locus!
10100111011100111001110011
10101110101010101011100110
Could be a female
Could be a descendant
Could be many generations distant
Can be ‘unrelated’
Phasing a Trio
20201221112… 10111121211121212211221121
10110122121110212101220022
Proband
Mother Father
10100111011100111001110011
00010011110010101100110011
01010011100011000110011010 10101110101111111111111110
10100111011100111001110011 00010011110010101100110011
Erdös 1 Surrogate Fathers
10100111011100111001110011
10100111011100111001110011
10100111011100111001110011
10101110101010101011100110
10101010101000000000111110
10101010101111000001100110
00010011110010101100110011
00010011110010101100110011
00010011110010101100110011
10101110010110110000111110
10111110101011111100111110
11111110111111110000111110
Erdös 1 Surrogate Mothers
10111110101011111100111110
10101110010110110000111110
11111110111111110000111110
10101010101111000001100110
Erdös 2 Surrogate Mothers
Potentially many meiosis separating
Potentially many meiosis separating
• Erdös 1 surrogates are surrogates of the proband.
• Erdös n+1 surrogates are surrogates of Erdos n
surrogates of the proband.
Surrogate
giving phase
information
Surrogate
giving phase
information
Long Range Phasing
Kong et al (2008); Hickey et al (2011)
• Build library of all completely phased haplotypes
• Find haplotypes in the library which can explain an individuals genotype
• Low error rates
• Computationally fast
• Useful for extremely large data sets – Strategic use
10100111011100111001110011
10100000010000100011110011
10110011001100111001110011
10100111011001001001110011
10100101011100111001110011
10100111001100111001110001
111110111011100111001110011
10100100000000111001110011
Haplotype library imputation
Practical - Phasing
1. Get AlphaPhase distribution package (either)
a) https://sites.google.com/site/hickeyjohn/alphaphase
b) DropBox (software directory)
2. Unpack the distribution package to your working directory
Distribution
|-- Examples
| |-- PhasingWithoutPedigreeInformation
| | |-- GenotypeFormat
| | `-- UnorderedFormat
| |-- PhasingWithPedigreeInformation
| | |-- GenotypeFormat
| | `-- UnorderedFormat
| `-- SimulatedScenario
|-- LinuxExecutable
|-- MacOSXexecutable
`-- WindowsExecutable
Practical – Phasing II
3. Open terminal (Windows: Menu Start, Run, cmd) and change to the following directory (change ???)
cd ???/Distribution/Examples/PhasingWithPedigreeInformation/GenotypeFormat
4. Open parameter file and change numbre of SNP to 500
5. Run AlphaPhase (change OS if needed) ../../../WindowsExecutable/AlphaPhase.exe
6. Open AlphaPhase manual and go to the Output section – read the meaning of created output files
7. Check the phasing of genotype for individual 1598 – Copy the first few values (say 20) of maternal and paternal
haplotype to Excel and use Data Text to columns Space + merge delimiters
– Add genotype of individual 1598 and check if the sum of haplotypes gives genotype
Practical – Phasing III
8. Who is the first (sequentially) Erdos 1 surrogate parent of individual 1598?
9. Find genotype of the above surrogate parent and copy the first few values into Excel and find which loci were informative for the phasing of genotype of individual 1598.
10. What was the phasing yield for individual 1598 and the overall phasing yield per core? Were there any SNP that had very low phasing yield?
11. How many haplotypes were found for the first core?
12. Is distribution of haplotype occurrence uniform?
13. We see „high repetition“ of some haplotypes in individuals 1591:1600. Why?
Imputation – Why?
• Genotyping at high density is expensive
• Genotyping at low density is cheaper
• Imputation is free
• A genotype imputation method that allows us to get lots of genotyped and phenotyped animals at a low cost
Imputation Empo_er_ng you_ wo_k (e.g. GW_S)
Marchini & Howie (2010)
AlphaImpute vs. IMPUTE2
• AlphaImpute – Uses pedigree and linkage information
• IMPUTE2 – Pedigree free imputation which uses linkage
disequilibrium
– Similar algorithm to Beagle / fastPHASE
• Data set for comparison – pig data set
– multiple breed cattle data set
Pig data set
0.5k LD 2.5k LD 5k LD 7.5k LD
Category Count AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2
BothParents
51 0.98 0.77 0.99 0.92 1.00 0.96 1.00 0.96
SireMGS
62 0.93 0.80 0.98 0.92 0.99 0.94 0.99 0.96
DamPGS
47 0.96 0.79 0.98 0.92 0.99 0.95 0.99 0.96
Sire
45 0.89 0.78 0.97 0.92 0.99 0.95 0.99 0.97
Dam
13 0.90 0.76 0.96 0.89 0.98 0.93 0.98 0.95
Other
291 0.86 0.79 0.94 0.91 0.97 0.95 0.97 0.96
Correlation is the statistic that matters Pedigree ~6500, SNP60K ~3200, SNPLD ~500
Cattle data set
0.5k LD 2.5k LD 5k LD 7.5k LD
Category Count AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2 AlphaImpute IMPUTE2
BothParents
28 0.97 0.64 0.99 0.92 0.99 0.94 1.00 0.95
SireMGS
224 0.87 0.60 0.97 0.91 0.98 0.95 0.99 0.96
DamPGS
7 0.92 0.63 0.97 0.87 0.98 0.91 0.98 0.95
Sire
144 0.86 0.60 0.96 0.90 0.98 0.95 0.98 0.96
Dam
4 0.95 0.63 0.98 0.90 0.99 0.97 0.99 0.95
Other
219 0.84 0.58 0.94 0.90 0.96 0.95 0.97 0.96
Correlation is the statistic that matters Pedigree ~24000, SNP60K ~4400, SNPLD ~600
Practical - Imputation
1. We have LD genotype for core 1 on full-sib of individual 1600 1601: 1?????????????????????????????0??????????????????????
?1?????1??????????????????2???????????????1???2
2. Write down pedigree of 1601 and find out haplotype codes of parents for core 1.
3. Find haplotypes (01111000…) of parents and setup the table for inspection of data and imputation (see next slide).
4. What do the existing LD genotypes of individual 1601 show?
5. Impute missing genotypes.
Parental haplotypes
100 0 0 127 19 0011001001001100000001101000000100000111101110111100101000001000100000000110110110100001001100001001*
100 0 0 20 49 *1001001100100100000011001000000100000100000110111110111000001111000001011100010010100011110001000110
200 0 0 13 34 0110110001001100010001001000000111100111100110111100101000001000100011010000101110100000001100001001
200 0 0 80 17 *0010100010110011110001001000000100011011100001000101000011100010001101110001010110100000010000011001*
LD genotype
1601 100 200 / / 1?????????????????????????????0???????????????????????1?????1??????????????????2???????????????1???2
1. The first genotype suggests that 1601 inherited (part of) haplotype 20 from
parent 100, while haplotypes 13 or 80 were inherited from parent 200
2. The second LD genotype is not informative
3. The third LD genotype suggests that haplotype 80 was inherited from parent 200
4. The same as 3.
5. The fifth LD genotype suggests that there has been recombination between parent 100
gametes (haplotypes 127 and 20) as we have allele 1 on both chromosomes in
individual 1601, but not on haplotype 20; no change can be seen for the other
homologous chromosome
6. The sixth LD genotype is not informative for parent 100, while it is for parent
7. The seventh genotype again suggest recombination between the 4th and 5th LD genotype
for parent 100, while there is no info on recombination for parent 200
Result:
LD genotype
1601 100 200 / / 1?????????????????????????????0???????????????????????1?????1??????????????????2???????????????1???2
Haplotypes
1601 100 200 20/127 1 1001001100100100000011001000000100000100000110111110111000001111000001000110110110100001001100001001
1601 100 200 80 18 0010100010110011110001001000000100011011100001000101000011100010001101110001010110100000010000011001
|
V
Full (imputed) genotype
1601 100 200 / / 1011101110210111110012002000000200011111100111111211111011101121001102110111120220200001011100012002
Practical – Imputation (Solutions)
Imputation quality
• %Incorrect genotypes genotype error rate – true 0, inferred 2 error (1) – true 0, inferred 1 error (1) – true 0, inferred 0 correct (0)
• %Correct genotypes gen. concordance = 1 – gen. error rate
• %Incorrect alleles allele error rate – true 0, inferred 2 error (1) – true 0, inferred 1 error/2 (1/2) – true 0, inferred 0 correct (0) – with probabilistic imputation:
½ abs(true allele dosage – inferred allele dosage)
• %Correct alleles allele concordance = 1 - allele error rate
• Correlation(true allele dosage, inferred allele dosage)
Example in maize
Hickey et al (2012) IMPUTE2, 1227 maize inbred lines
IMPUTE2, 1227 maize inbred lines Hickey et al (2012)
Example in maize
Practical – Imputation II
• Evaluate imputation quality for a „pink part“ of the core 1
Result:
LD genotype
1601 100 200 / / …1?????1??????????????????2???????????????1…
Haplotypes
1601 100 200 20/127 1 …100000111100000100011011011010000100110000…
1601 100 200 80 19 …001110001000110111000101011010000001000001…
Full (imputed) genotype
1601 100 200 / / …101110112100110211011112022020000101110001…
True genotype
1601 100 200 / / …101110112100110111011112022020000101110001…
Day 3 – Practical workshop: Phasing and imputation Gregor Gorjanc Biotechnical Faculty, Department of Animal Science
University of Ljubljana, Slovenia
October 1-7, 2012, Pag Island, Croatia
Livestock Conservation Genomics: Data, Tools and Trends