epi 511, advanced population and medical genetics

201
Alkes Price Harvard School of Public Health January 24 & January 26, 2017 EPI 511, Advanced Population and Medical Genetics Week 1: • Intro + HapMap / 1000 Genomes • Linkage Disequilibrium

Upload: others

Post on 02-Dec-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EPI 511, Advanced Population and Medical Genetics

Alkes Price

Harvard School of Public Health

January 24 & January 26, 2017

EPI 511, Advanced Population and Medical Genetics

Week 1:

• Intro + HapMap / 1000 Genomes

• Linkage Disequilibrium

Page 2: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course structure

Week 1: HapMap, 1000G / Linkage disequilibrium

Week 2: Population structure and admixture

Week 3: Population stratification

Week 4: Fine-mapping / Natural selection

Week 5: Heritability / Genetic risk prediction

Week 6: Mixed models / Rare variant analysis

Week 7: Functional interpretation

Page 3: EPI 511, Advanced Population and Medical Genetics

EPI 511: How to address the instructor

Alkes

Dr. Price

Professor Price

Honorable Professor Price

Honorable Distinguished Dr. Professor Price

Page 4: EPI 511, Advanced Population and Medical Genetics

EPI 511: Office Hours

Instructor: Alkes

Office Hours: Thu 3:30-4:30pm, Building 2, Room 211

Email Address: [email protected]

(Please put EPI511 in the subject of your email)

Teaching Assistant: Armin

Office Hours: Fri + Mon 2-3pm, Building 2, Room 209

Email Address: [email protected]

Page 5: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

Page 6: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion

Page 7: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

Page 8: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

Page 9: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

Video of each class will be posted on

the course www site <1hr after class.

Page 10: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

Page 11: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

• Experiences 5 take-home projects due Tue Jan 31, …, Tue Feb 28

Page 12: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

• Experiences 5 take-home projects due Tue Jan 31, …, Tue Feb 28

Page 13: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

• Experiences 5 take-home projects due Tue Jan 31, …, Tue Feb 28

• short Research Paper due Fri Mar 10

Page 14: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

• Experiences 5 take-home projects due Tue Jan 31, …, Tue Feb 28

• short Research Paper due Fri Mar 10

• self-assessment Opportunity

20min exam (date will not be announced in advance)

Page 15: EPI 511, Advanced Population and Medical Genetics

EPI 511: Outcome measures

• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session

• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class

Page 16: EPI 511, Advanced Population and Medical Genetics

EPI 511: Outcome measures

• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session

• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class

• Experiences (60% of course grade) 6 take-home projects (data and programming intensive)

Page 17: EPI 511, Advanced Population and Medical Genetics

Approaches to Scientific Understanding

Love is Understanding.

-- Madonna

Data is Understanding.

-- Alkes

Page 18: EPI 511, Advanced Population and Medical Genetics

EPI 511: Outcome measures

• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session

• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class

• Experiences (60% of course grade) 6 take-home projects (data and programming intensive)

Page 19: EPI 511, Advanced Population and Medical Genetics

Approaches to Scientific Understanding

Understanding Data requires Fixing Bugs.

Page 20: EPI 511, Advanced Population and Medical Genetics

Genetics + data + programming = bright future

Gewin 2007 Nature Hayden 2012 Nature

Page 21: EPI 511, Advanced Population and Medical Genetics

EPI 511: Outcome measures

• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session

• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class

• Experiences (60% of course grade) 5 take-home projects (data and programming intensive)

• short Research Paper (40% of course grade) 1,000-1,500 words (suggested topics provided on Feb 16)

Page 22: EPI 511, Advanced Population and Medical Genetics

EPI 511: Outcome measures

• Advance reading (0% of course grade) 1 required paper + 1 optional paper per course session

• Lecture + Discussion (0% of course grade) discussants: each student to sign up as discussant for 1 class

• Experiences (60% of course grade) 5 take-home projects (data and programming intensive)

• short Research Paper (40% of course grade) 1,000-1,500 words (suggested topics provided on Feb 16)

• self-assessment Opportunity (0% of course grade)

20min exam (date will not be announced in advance)

Page 23: EPI 511, Advanced Population and Medical Genetics

EPI 511: Policy on group work

Experiences (60% of course grade) 6 take-home projects (data and programming intensive)

• OK to discuss experiences with your colleagues

• Each piece of code you write should be your own

short Research Paper (40% of course grade) 1,000-1,500 words (suggested topics provided on Feb 16)

• OK to discuss the project with your colleagues

• Each piece of code you write should be your own

• Each piece of text you write should be your own

Page 24: EPI 511, Advanced Population and Medical Genetics

EPI 511, Advanced Population and Medical Genetics

Week 1:

• Introduction + HapMap Project

• Linkage Disequilibrium

Page 25: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Population Genetics

2. HapMap and HapMap2 projects

3. FST

4. HapMap3 and 1000 Genomes projects

Page 26: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Population Genetics

2. HapMap and HapMap2 projects

3. FST

4. HapMap3 and 1000 Genomes projects

Page 27: EPI 511, Advanced Population and Medical Genetics

What is Population Genetics?

Population genetics is the study of genetic variation

both within and between human populations.

Page 28: EPI 511, Advanced Population and Medical Genetics

Are different human populations

actually genetically different?

Page 29: EPI 511, Advanced Population and Medical Genetics

Are different human populations

actually genetically different?

Slightly.

5-7% of worldwide human genetic variation is due to

genetic differences between human populations.

The remaining 93-95% of human genetic variation is due to

genetic variation within human populations

(Rosenberg et al. 2002 Science).

Page 30: EPI 511, Advanced Population and Medical Genetics

Why study differences between

human populations?

• Learn about human migration patterns and ancient history.

Page 31: EPI 511, Advanced Population and Medical Genetics

Why study differences between

human populations?

• Learn about human migration patterns and ancient history.

• Improve our power to identify and localize disease genes.

Page 32: EPI 511, Advanced Population and Medical Genetics

Rosenberg et al. 2010

Nat Rev Genet

Page 33: EPI 511, Advanced Population and Medical Genetics

Bustamante et al. 2011 Nature; also see Popejoy & Fullerton 2016 Nature

Page 34: EPI 511, Advanced Population and Medical Genetics

Why study differences between

human populations?

• Learn about human migration patterns and ancient history.

• Improve our power to identify and localize disease genes.

Williams et al. 2014 Nature

Page 35: EPI 511, Advanced Population and Medical Genetics

Why study differences between

human populations?

• Learn about human migration patterns and ancient history.

• Improve our power to identify and localize disease genes.

- Use differences in linkage disequilibrium for fine-mapping.

- Avoid false positives due to population stratification.

- Signals of natural selection at genes related to disease.

Page 36: EPI 511, Advanced Population and Medical Genetics

Does “race” exist?

Page 37: EPI 511, Advanced Population and Medical Genetics

Does “race” exist?

Worldwide patterns of human genetic variation are best

described using continuous clines instead of discrete clusters.

(Serre & Paabo 2004 Genome Res)

Racial classifications are inadequate descriptors of the

distribution of human genetic variation.

(Tishkoff & Kidd 2004 Nat Genet)

For a fun time: go to a population genetics party and ask,

Page 38: EPI 511, Advanced Population and Medical Genetics

Isn’t it politically incorrect to study

differences between human populations?

Page 39: EPI 511, Advanced Population and Medical Genetics

Isn’t it politically incorrect to study

differences between human populations?

No. It is not politically incorrect.

Page 40: EPI 511, Advanced Population and Medical Genetics

Isn’t it politically incorrect to study

differences between human populations?

No. It is not politically incorrect.

“Studies of human population genetics have generated the

strongest proof that there is no scientific basis for racism.”

(Cavalli-Sforza 2005 Nat Rev Genet)

also see Cavalli-Sforza et al. 1994 The History and Geography of Human Genes

Page 41: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Population Genetics

2. HapMap and HapMap2 projects

3. FST

4. HapMap3 and 1000 Genomes projects

Page 42: EPI 511, Advanced Population and Medical Genetics

The International HapMap Project (International HapMap Consortium 2005 Nature)

CEU (European) CHB (Chinese)

JPT (Japanese) YRI (Nigerian)

Page 43: EPI 511, Advanced Population and Medical Genetics

CEU northern European USA 90

CHB Chinese China 45

JPT Japanese Japan 44

YRI Yoruba Nigeria 90

The International HapMap Project: 270 samples from 4 populations

Page 44: EPI 511, Advanced Population and Medical Genetics

The International HapMap Project (International HapMap Consortium 2005 Nature)

CEU (European) CHB (Chinese)

JPT (Japanese) YRI (Nigerian)

Phase I HapMap:

>1,000,000 SNPs

Page 45: EPI 511, Advanced Population and Medical Genetics

The International HapMap Project (International HapMap Consortium 2007 Nature)

CEU (European) CHB (Chinese)

JPT (Japanese) YRI (Nigerian)

Phase II HapMap:

>3,000,000 SNPs

Page 46: EPI 511, Advanced Population and Medical Genetics

What is a SNP?

A Single Nucleotide Polymorphism (SNP) is a letter of the

genome that differs in different individuals (e.g. G/T).

Page 47: EPI 511, Advanced Population and Medical Genetics

What is a SNP?

Rosenberg & Nordborg 2002 Nat Rev Genet

A Single Nucleotide Polymorphism (SNP) is a letter of the

genome that differs in different individuals (e.g. G/T).

Each SNP corresponds to one single mutation event in history,

e.g. G mutated to T in one single ancestor.

G = ancestral allele, T = derived allele.

Coalescent tree

Page 48: EPI 511, Advanced Population and Medical Genetics

What is a SNP: physical position

Each SNP has a physical position on a chromosome.

physical

chrom. position (bp)

rs10910034 1 2165898

rs1713712 1 2166021

… … …

Page 49: EPI 511, Advanced Population and Medical Genetics

What is a SNP: physical vs. genetic position

Each SNP has a physical and genetic position on a chromosome.

physical genetic position

chrom. position (Morgans)

rs10910034 1 2165898 0.01904785

rs1713712 1 2166021 0.01904814

… … … …

1 recombination event per Morgan per generation.

Genome-wide recombination rate is about 1cM / Mb.

[cM = centiMorgan = 1/100 Morgan, Mb = Megabase = 106 bp]

Thus, 1 Morgan is roughly 100Mb = 108 bp on average.

Page 50: EPI 511, Advanced Population and Medical Genetics

HapMap project: Summary of main results

• 3.1 million SNPs successfully genotyped using Perlegen

genotyping technology (Hinds et al. 2005 Science).

• These 3.1 million SNPs: about 30% of all common SNPs

(defined as SNPs with minor allele frequency >5%).

Page 51: EPI 511, Advanced Population and Medical Genetics

CEU northern European USA 90

CHB Chinese China 45

JPT Japanese Japan 44

YRI Yoruba Nigeria 90

HapMap: 270 samples from 4 populations

Affymetrix and

Illumina chips

Page 52: EPI 511, Advanced Population and Medical Genetics

HapMap project: Summary of main results

• 3.1 million SNPs successfully genotyped using Perlegen

genotyping technology (Hinds et al. 2005 Science).

• These 3.1 million SNPs: about 30% of all common SNPs

(defined as SNPs with minor allele frequency >5%).

“Properties of SNPs are influenced by discovery sampling …

HapMap relied on nearly any piece of information available.”

Clark et al. 2005 Genome Res; also see Keinan et al. 2007 Nat Genet

Page 53: EPI 511, Advanced Population and Medical Genetics

Summary of main results, continued

• Understanding genetic differences between populations.

• Patterns of linkage disequilibrium both within and across

populations.

• Most common SNPs in the human genome are in strong

linkage disequilibrium with at least one HapMap SNP

[avg r2 ≥ 0.90 in 10 sequenced ENCODE regions].

Page 54: EPI 511, Advanced Population and Medical Genetics

Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)

77% frequency

68% frequency

50% frequency C allele of rs10910034

Page 55: EPI 511, Advanced Population and Medical Genetics

Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)

FST = 0.19

FST = 0.11

FST = 0.16

Note: FST accounts for

sampling error due to

finite sample size.

Page 56: EPI 511, Advanced Population and Medical Genetics

Populations can be distinguished using

a large number of genetic markers

Principal Components Analysis

using 100 markers

Page 57: EPI 511, Advanced Population and Medical Genetics

Populations can be distinguished using

a large number of genetic markers

using 3 million markers

Principal Components Analysis

Page 58: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Population Genetics

2. HapMap and HapMap2 projects

3. FST

4. HapMap3 and 1000 Genomes projects

Page 59: EPI 511, Advanced Population and Medical Genetics

Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)

FST = 0.19

FST = 0.11

FST = 0.16

Page 60: EPI 511, Advanced Population and Medical Genetics

Defining vs. Estimating FST

• FST is an underlying parameter that depends on the two

populations, but does not depend on a particular finite sample.

• FST is an estimate of the underlying FST that depends on a

particular finite sample that is analyzed.

Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res

^

Page 61: EPI 511, Advanced Population and Medical Genetics

Defining FST

Definition:

• The FST between two populations is the value such that the

allele frequency difference between the two populations has

mean 0 and variance 2FSTp(1 – p), where p is the allele

frequency in the ancestral population.

p

p2 p1

FSTp(1 – p) FSTp(1 – p)

Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res

Page 62: EPI 511, Advanced Population and Medical Genetics

Defining FST

Definition:

• The FST between two populations is the value such that the

allele frequency difference between the two populations has

mean 0 and variance 2FSTp(1 – p), where p is the allele

frequency in the ancestral population.

p1 ~ N(p, FSTp(1 – p))

p

p2 p1

FSTp(1 – p) FSTp(1 – p)

Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res

Page 63: EPI 511, Advanced Population and Medical Genetics

Defining FST

Definition:

• The FST between two populations is the value such that the

allele frequency difference between the two populations has

mean 0 and variance 2FSTp(1 – p), where p is the allele

frequency in the ancestral population.

p1 ~ Beta(p(1 – FST)/FST, (1 – p)(1 – FST)/FST)

p

p2 p1

FSTp(1 – p) FSTp(1 – p)

Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res

Page 64: EPI 511, Advanced Population and Medical Genetics

Defining FST

Definition:

• The FST between two populations is the value such that the

allele frequency difference between the two populations has

mean 0 and variance 2FSTp(1 – p), where p is the allele

frequency in the ancestral population.

OR

• The FST between two populations is equal to the proportion

of genotypic variance in a set of N individuals from each

population that is attributable to population differences.

Weir & Hill 2002 Annu Rev Genet, Bhatia et al. 2013 Genome Res

Page 65: EPI 511, Advanced Population and Medical Genetics

Defining FST

Theorem 1:

• The FST between two populations is the value such that the

allele frequency difference between the two populations has

mean 0 and variance 2FSTp(1 – p), where p is the allele

frequency in the ancestral population.

=>

• The FST between two populations is equal to the proportion

of genotypic variance in a set of N individuals from each

population that is attributable to population differences.

Page 66: EPI 511, Advanced Population and Medical Genetics

Defining FST

Proof: Let pavg = (p1 + p2)/2.

Total genotypic variance is 2pavg(1 – pavg) ≈ 2p(1 – p)

[Note that individuals are diploid: genotype = 0 or 1 or 2.

Binomial sampling with n=2.]

Page 67: EPI 511, Advanced Population and Medical Genetics

Defining FST

Proof: Let pavg = (p1 + p2)/2.

Total genotypic variance is 2pavg(1 – pavg) ≈ 2p(1 – p)

[Note that individuals are diploid: genotype = 0 or 1 or 2.

Binomial sampling with n=2.]

Genotypic variance attributable to population differences:

Suppose we have N data points with value 2p1, N with value 2p2

After subtracting the average value (p1 + p2), we have

N data points with value (p1 – p2), N with value (p2 – p1).

Since p1 and p2 each have variance FSTp(1 – p), it follows that

(p1 – p2) and (p2 – p1) each have variance 2FSTp(1 – p)

Page 68: EPI 511, Advanced Population and Medical Genetics

Defining FST

Proof: Let pavg = (p1 + p2)/2.

Total genotypic variance is 2pavg(1 – pavg) ≈ 2p(1 – p)

[Note that individuals are diploid: genotype = 0 or 1 or 2.

Binomial sampling with n=2.]

Genotypic variance attributable to population differences:

Suppose we have N data points with value 2p1, N with value 2p2

After subtracting the average value (p1 + p2), we have

N data points with value (p1 – p2), N with value (p2 – p1).

Since p1 and p2 each have variance FSTp(1 – p), it follows that

(p1 – p2) and (p2 – p1) each have variance 2FSTp(1 – p)

2FSTp(1 – p) / 2p(1 – p) = FST. Q.E.D.

Page 69: EPI 511, Advanced Population and Medical Genetics

Defining FST

Theorem 1′:

• The FST between two populations is the value such that the

allele frequency difference between the two populations has

mean 0 and variance 2FSTp(1 – p), where p is the allele

frequency in the ancestral population.

=>

• The proportion of genotypic variance in a set of

αN individuals from population 1 and (1 – α)N individuals

from population 2 that is attributable to population differences

is equal to 4α(1 – α) · FST.

Page 70: EPI 511, Advanced Population and Medical Genetics

Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)

FST = 0.19

FST = 0.11

FST = 0.16

Page 71: EPI 511, Advanced Population and Medical Genetics

Genetic differences between HapMap populations (International HapMap Consortium 2005 and 2007 Nature)

FST = 0.19

FST = 0.11

FST = 0.16

[2FSTp(1 – p)]1/2 = 0.23

for p = 0.5

[2FSTp(1 – p)]1/2 = 0.31

for p = 0.5

[2FSTp(1 – p)]1/2 = 0.28

for p = 0.5

Page 72: EPI 511, Advanced Population and Medical Genetics

Genetic distances (FST) between

European American subpopulations

Ashkenazi

Northwest Eur. Southeast Eur.

FST = 0.009 FST = 0.004

FST = 0.005

Price, Butler et al. 2008 PLoS Genet

Page 73: EPI 511, Advanced Population and Medical Genetics

Genetic distances (FST) between

European American subpopulations

Ashkenazi

Northwest Eur. Southeast Eur.

FST = 0.009 FST = 0.004

FST = 0.005

Price, Butler et al. 2008 PLoS Genet

[2FSTp(1 – p)]1/2 = 0.067 for p = 0.5

[2FSTp(1 – p)]1/2 = 0.050 for p = 0.5

[2FSTp(1 – p)]1/2 = 0.045 for p = 0.5

Page 74: EPI 511, Advanced Population and Medical Genetics

Genetic distances (FST) between

East Asian subpopulations

FST = 0.007

International HapMap Consortium 2007 Nature

Chinese Japanese

[2FSTp(1 – p)]1/2 = 0.059 for p = 0.5

Page 75: EPI 511, Advanced Population and Medical Genetics

Genetic distances (FST) between

West African subpopulations

FST = 0.008

International HapMap3 Consortium 2010 Nature

[2FSTp(1 – p)]1/2 = 0.063 for p = 0.5

Yoruba

(Nigeria)

Luhya

(Kenya)

Page 76: EPI 511, Advanced Population and Medical Genetics

How do we estimate FST?

p1 and p2 are allele frequencies in 2 populations

Var(p1 – p2) = 2FSTp(1 – p).

Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).

= E((p1 – p2)2 / [2p(1 – p)]).

Page 77: EPI 511, Advanced Population and Medical Genetics

How do we estimate FST?

p1 and p2 are allele frequencies in 2 populations

Var(p1 – p2) = 2FSTp(1 – p).

Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).

= E((p1 – p2)2 / [2p(1 – p)]).

A PROBLEM: we don’t get to observe p (ancestral frequency)

SOLUTION: approximate p ≈ pavg = (p1 + p2)/2.

Page 78: EPI 511, Advanced Population and Medical Genetics

How do we estimate FST?

p1 and p2 are allele frequencies in 2 populations

Var(p1 – p2) = 2FSTp(1 – p).

Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).

= E((p1 – p2)2 / [2p(1 – p)]).

A BIGGER PROBLEM: we don’t get to observe p1 and p2.

We only get to observe sample allele frequencies p1 and p2

in sample sizes N1 (from pop. 1) and N2 (from pop. 2).

^ ^

Page 79: EPI 511, Advanced Population and Medical Genetics

How do we estimate FST?

p1 and p2 are allele frequencies in 2 populations

Var(p1 – p2) = 2FSTp(1 – p).

Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).

= E((p1 – p2)2 / [2p(1 – p)]).

SOLUTION:

Since Var(p1 – p2) ≈ [2FST + 1/(2N1) + 1/(2N2)] p(1 – p), estimate

FST = E([(p1 – p2)2 – (1/(2N1) + 1/(2N2))p(1 – p)] / [2p(1 – p)])

(where we approximate p ≈ (p1 + p2)/2)

^ ^

^ ^

^ ^

some details omitted; see Bhatia et al. 2013 Genome Res

Page 80: EPI 511, Advanced Population and Medical Genetics

How do we estimate FST?

p1 and p2 are allele frequencies in 2 populations

Var(p1 – p2) = 2FSTp(1 – p).

Thus, estimate FST = Var((p1 – p2) / [2p(1 – p)]1/2).

= E((p1 – p2)2 / [2p(1 – p)]).

SOLUTION:

Since Var(p1 – p2) ≈ [2FST + 1/(2N1) + 1/(2N2)] p(1 – p), estimate

FST = E([(p1 – p2)2 – (1/(2N1) + 1/(2N2))p(1 – p)] / [2p(1 – p)]).

OR FST = Σi [(pi1 – pi2)2 – (1/(2N1) + 1/(2N2))pi(1 – pi)]

Σi [2pi(1 – pi)]

^ ^

^ ^

some details omitted; see Bhatia et al. 2013 Genome Res

^ ^ (where i

indexes

SNPs)

Page 81: EPI 511, Advanced Population and Medical Genetics

Drift vs. Divergence

YRI CHB CEU

0.02

0.04 0.07

0.10

YRI YRI CEU CEU CHB CHB

Divergence

(per 1000bp of DNA)

0.84 0.60 0.57

Keinan et al. 2007 Nat Genet

NA18488 NA06989 NA18597

Drift

(FST)

Page 82: EPI 511, Advanced Population and Medical Genetics

Drift vs. Divergence

Drift

(FST)

YRI CHB CEU

0.02

0.04 0.05

0.10

YRI YRI CEU CEU CHB CHB

Divergence

(generations)

~30K

gen.

Keinan et al. 2007 Nat Genet

NA18488 NA06989 NA18597

Based on mut. rate 1.2–1.8 x 10-8

(Kong et al. 2012 Nature,

Sun et al. 2012 Nat Genet)

~20K

gen.

~20K

gen.

Page 83: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Population Genetics

2. HapMap and HapMap2 projects

3. FST

4. HapMap3 and 1000 Genomes projects

Page 84: EPI 511, Advanced Population and Medical Genetics

CEU northern European USA 90

CHB Chinese China 45

JPT Japanese Japan 44

YRI Yoruba Nigeria 90

HapMap: 270 samples from 4 populations

Affymetrix and

Illumina chips

Perkel 2008 Nat Methods

Page 85: EPI 511, Advanced Population and Medical Genetics

The HapMap Project:

Work is done, relax on beach?

Page 86: EPI 511, Advanced Population and Medical Genetics

Beyond HapMap: what the world still needs

• Larger sample sizes for analyses of linkage disequilibrium

• More complete representation of world population diversity

e.g. South Asian and Native American genetic variation

• Analyses of copy number variation (CNV)

• Low-frequency variants (minor allele frequency <5%)

Page 87: EPI 511, Advanced Population and Medical Genetics

The International HapMap3 Project:

1,260 samples from 11 diverse populations

International HapMap3 Consortium 2010 Nature

Page 88: EPI 511, Advanced Population and Medical Genetics

CEU northern European USA 180

CHB Chinese China 90

JPT Japanese Japan 90

YRI Yoruba Nigeria 180

TSI Tuscan Italy 90

CHD Chinese USA 100

LWK Luhya Kenya 90

MKK Maasai Kenya 180

ASW African-American USA 90

MXL Mexican-American USA 90

GIH Gujarati-American USA 90

HapMap3: 1,260 samples from 11 populations

Page 89: EPI 511, Advanced Population and Medical Genetics

The HapMap3 project

• Larger sample sizes for analyses of linkage disequilibrium

• More complete representation of world population diversity

e.g. South Asian and Native American genetic variation

• Analyses of copy number variation (CNV)

• Low-frequency variants (minor allele frequency <5%)

International HapMap3 Consortium 2010 Nature

Page 90: EPI 511, Advanced Population and Medical Genetics

Data generation: SNPs and CNVs

Affymetrix 6.0 array

900K SNPs

940K copy-number probes

Illumina Infinium 1M array

1M SNPs, of which

80K targeted at CNV regions

1.5M SNPs passed QC in all populations

(99.3% concordance for 250K SNPs on both arrays)

Note: only 1.5M SNPs, versus 3.1 million SNPs in HapMap2

International HapMap3 Consortium 2010 Nature

Page 91: EPI 511, Advanced Population and Medical Genetics

Not all HapMap3 populations are

similar to a population from HapMap

HapMap3 population Closest pop.

from HapMap

FST

TSI (Tuscan) CEU 0.004

CHD (Chinese) CHB 0.001

LWK (Luhya) YRI 0.008

MKK (Maasai) YRI 0.03

ASW (African-American) YRI 0.01

MXL (Mexican-American) CEU 0.04

GIH (Gujarati-American) CEU 0.04

Page 92: EPI 511, Advanced Population and Medical Genetics

Approaches to Scientific Understanding

Love is Understanding.

-- Madonna

Data is Understanding.

-- Alkes

Page 93: EPI 511, Advanced Population and Medical Genetics

HapMap3 data: individual files

CEU.ind:

NA06989 F CEU

NA11891 M CEU

NA11843 M CEU

NA12341 F CEU

NA12739 M CEU

[sample ID] [sex] [popname]

Page 94: EPI 511, Advanced Population and Medical Genetics

HapMap3 data: SNP files

CEU.snp:

rs10458597 1 0.0 554484 C T

rs2185539 1 0.0 556738 C T

rs11240767 1 0.0 718814 C T

rs12564807 1 0.0 724325 A G

rs3131972 1 0.0 742584 G A

[SNP ID] [chr] [0.0] [position] [ref] [var]

Page 95: EPI 511, Advanced Population and Medical Genetics

HapMap3 data: genotype files

CEU.geno:

2222222222… [Each line is 1 SNP, each column is 1 indiv.]

2222222222…

2222222222…

2222222222…

1121212112…

[Number of copies of reference allele: 0 or 1 or 2.

9 denotes missing data.]

Note: the HapMap3 data files for this course are restricted to

~700K SNPs that are common (MAF>5%) in every population.

Page 96: EPI 511, Advanced Population and Medical Genetics

Beyond HapMap: what the world still needs

• Larger sample sizes for analyses of linkage disequilibrium

• More complete representation of world population diversity

e.g. South Asian and Native American genetic variation

• Analyses of copy number polymorphisms (CNV)

• Low-frequency variants (minor allele frequency <5%)

Page 97: EPI 511, Advanced Population and Medical Genetics

Common Disease/Common Variant hypothesis

Lander 1996 Science; Reich & Lander 2001 Trends Genet

reviewed in Gibson 2012 Nat Rev Genet, Visscher et al. 2012 Am J Hum Genet

“For common diseases, there will be one or a few

predominating disease alleles with relatively high frequencies at

each of the major underlying disease loci”

Page 98: EPI 511, Advanced Population and Medical Genetics

Are rare and low-frequency variants important?

Visscher et al. 2012 Am J Hum Genet

(to be continued, Thu of Week 6)

Page 99: EPI 511, Advanced Population and Medical Genetics

Are rare and low-frequency variants important?

Gibson 2012 Nat Rev Genet

(to be continued, Thu of Week 6)

Page 100: EPI 511, Advanced Population and Medical Genetics

Are rare and low-frequency variants important?

Kaiser 2012 Science (to be continued, Thu of Week 6)

Page 101: EPI 511, Advanced Population and Medical Genetics

HapMap3 1Mb pilot sequencing study

and 1000 Genomes pilot projects

International HapMap3 Consortium 2010 Nature

1000 Genomes Project Consortium 2010 Nature

• HapMap3 pilot sequencing: 10 100kb regions spanning 1Mb (high coverage: Sanger sequencing)

692 individuals from 10 HapMap3 populations

• 1000 Genomes Trio pilot project: Genome-wide (high coverage: 42x)

6 individuals (one CEU trio and one YRI trio)

• 1000 Genomes Low-coverage pilot project: Genome-wide (low coverage: 2x-6x)

179 individuals from CEU, YRI, CHB, JPT populations

• 1000 Genomes Exon pilot project: 8,140 exons spanning 1.4Mb from 906 genes (high coverage: >50x)

697 individuals from 7 HapMap3 populations

Page 102: EPI 511, Advanced Population and Medical Genetics

Sample size and SNP discovery (per Mb)

International HapMap3 Consortium 2010 Nature

Page 103: EPI 511, Advanced Population and Medical Genetics

The 1000 Genomes (1000G) Project

Sequence the entire genomes of 1,092 individuals:

379 of European ancestry (Europe and USA)

286 of East Asian ancestry (Asia)

246 of African ancestry (Africa and USA)

181 of Latino ancestry (Latin America and USA)

Use next-generation sequencing technologies (~4x coverage):

e.g. Illumina, 454, SOLiD (read lengths 25-400bp)

(Metzker 2010 Nat Rev Genet, Davey et al. 2011 Nat Rev Genet,

also see Nielsen et al. 2011 Nat Rev Genet)

1000 Genomes Project Consortium 2012 Nature

Page 104: EPI 511, Advanced Population and Medical Genetics

1000G project: Summary of main results

• 38 million SNPs discovered and successfully genotyped.

Most of these are rare and low-frequency variants.

• The 38 million SNPs include

99.7% of all SNPs with minor allele frequency 5%

98% of all SNPs with minor allele frequency 1% ***

50% of all SNPs with minor allele frequency 0.1%

based on an independent UK European sample.

***: stated goal to identify >95% of SNPs with frequency 1%

was successfully achieved.

1000 Genomes Project Consortium 2012 Nature

Page 105: EPI 511, Advanced Population and Medical Genetics

Common variants are shared across populations,

but rare variants are often population-private

1000 Genomes Project Consortium 2012 Nature

Page 106: EPI 511, Advanced Population and Medical Genetics

1000G project: the final phase

Sequence the entire genomes of 2,504 individuals:

503 of European ancestry (Europe and USA)

504 of East Asian ancestry (Asia)

661 of African ancestry (Africa and USA)

347 of Latino ancestry (Latin America and USA)

489 of South Asian ancestry (South Asia and USA)

Use next-generation sequencing technologies (~7x coverage):

Illumina only (read lengths 70-400bp only)

85 million SNPs, of which 64 million have MAF<0.5%

Related resource: UK10K project: 7x WGS of 3,781 UK samples

(UK10K Consortium 2015 Nature; also see Gudbjartsson et al. 2015 Nature)

1000 Genomes Project Consortium 2015 Nature

Page 107: EPI 511, Advanced Population and Medical Genetics

1000G project: the final phase

Sequence the entire genomes of 2,504 individuals:

503 of European ancestry (Europe and USA)

504 of East Asian ancestry (Asia)

661 of African ancestry (Africa and USA)

347 of Latino ancestry (Latin America and USA)

489 of South Asian ancestry (South Asia and USA)

Use next-generation sequencing technologies (~7x coverage):

Illumina only (read lengths 70-400bp only)

85 million SNPs, of which 64 million have MAF<0.5%

1000 Genomes Project Consortium 2015 Nature; also see UK10K Consortium

2015 Nature, Gudbjartsson et al. 2015 Nat Genet, McCarthy et al. 2016 Nat Genet

Page 108: EPI 511, Advanced Population and Medical Genetics

What about rare variants?

• The 1000G project has identified most low-frequency variants

(minor allele frequency 1%-5%). These variants can be placed

on genotyping arrays or imputed (see Thu of Week 1)

Page 109: EPI 511, Advanced Population and Medical Genetics

What about rare variants?

• The 1000G project has identified most low-frequency variants

(minor allele frequency 1%-5%). These variants can be placed

on genotyping arrays or imputed (see Thu of Week 1)

• Rare variants: most have not been identified by 1000 Genomes!

Must sequence disease samples directly.

Past focus has been mostly on exome sequencing, but

now shifting to whole-genome sequencing.

(to be continued, Thu of Week 6)

Kiezun et al. 2012 Nat Genet, Tennessen et al. 2012 Science, Pasaniuc et al. 2012 Nat Genet,

Purcell et al. 2014 Nature, Do et al. 2015 Nature, Cai et al. 2015 Nature. Reviewed in

Goldstein et al. 2013 Nat Rev Genet, Lee et al. 2014 Am J Hum Genet, Zuk et al. 2014 PNAS

Page 110: EPI 511, Advanced Population and Medical Genetics

• Human populations are slightly genetically different.

These differences may be important for disease mapping.

(see Thu slides: Linkage Disequilibrium.)

• FST quantifies differences between human populations.

• HapMap, HapMap2, HapMap3 and 1000 Genomes projects

provide a valuable resource for common & low-frequency

variants (but most rare variants have not yet been identified).

Conclusions

Page 111: EPI 511, Advanced Population and Medical Genetics

EPI 511, Advanced Population and Medical Genetics

Week 1:

• Intro + HapMap / 1000 Genomes

• Linkage Disequilibrium

Page 112: EPI 511, Advanced Population and Medical Genetics

EPI 511: Course components

• Advance reading 1 required paper + 1 optional paper per course session

• Lecture + Discussion discussants: each student to sign up as discussant for 1 class

Page 113: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Linkage Disequilibrium

2. LD and Tag SNPs

3. LD and imputation

4. LD and fine-mapping

Page 114: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Linkage Disequilibrium

2. LD and Tag SNPs

3. LD and imputation

4. LD and fine-mapping

Page 115: EPI 511, Advanced Population and Medical Genetics

Definition: Linkage Disequilibrium (LD) refers to

correlations between genotypes of nearby markers.

Linkage Disequilibrium

Page 116: EPI 511, Advanced Population and Medical Genetics

Definition: Linkage Disequilibrium (LD) refers to

correlations between genotypes of nearby markers.

Linkage Disequilibrium Association Studies

Linkage Disequilibrium Linkage Mapping

(reviewed in Ott et al. 2015 Nat Rev Genet)

Linkage Disequilibrium

Page 117: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Example

Individuals

1 2 3 4 5 6 7 8

A A

G A

T T

A A

C G

T T

G G

C C

A A ... …

A A

G G

T T

A A

C C

T T

G G

T T

A A

... …

SNP 1

SNP 2 3 billion

letters

A A

G G

T T

A A

C C

T T

G G

C T

A A ... …

A A

A A

T T

A A

G G

T T

G G

T C

A A ... …

A A

G G

T T

A A

C C

T T

G G

T T

A A ... …

A A

G A

T T

A A

C G

T T

G G

C T

A A ... …

A A

G G

T T

A A

C C

T T

G G

C T

A A ... …

A A

G A

T T

A A

C C

T T

G G

C C

A A ... …

YES,

in LD

Page 118: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Example

Individuals

1 2 3 4 5 6 7 8

A A

G A

T T

A A

C G

T T

G G

C C

A A ... …

A A

G G

T T

A A

C C

T T

G G

T T

A A

... …

SNP 1

SNP 2 3 billion

letters

A A

G G

T T

A A

C C

T T

G G

C T

A A ... …

A A

A A

T T

A A

G G

T T

G G

T C

A A ... …

A A

G G

T T

A A

C C

T T

G G

T T

A A ... …

A A

G A

T T

A A

C G

T T

G G

C T

A A ... …

A A

G G

T T

A A

C C

T T

G G

C T

A A ... …

A A

G A

T T

A A

C C

T T

G G

C C

A A ... …

SNP 3

YES,

in LD

NOT

in LD

Page 119: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Example

Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

SNP 1

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

r2=1,

in LD

r2=0,

NOT

in LD

r2 is squared correlation

Page 120: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Example

Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 1

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

0 0

1 1 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 1 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

r2=1,

in LD

r2=0.7,

partial

LD

r2 is squared correlation

Page 121: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Example

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 2 0 1 0 0 0

... … … … … … … …

SNP 1

SNP 2 3 billion

letters

SNP 3

r2=1,

in LD

r2=0.7,

partial

LD

r2 is squared correlation

Page 122: EPI 511, Advanced Population and Medical Genetics

Genotypes vs. Haplotypes: phasing

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 2 0 1 0 0 0

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

PHASING

Genotypes Haplotypes

Stephens et al. 2001 Am J Hum Genet, Browning et al. 2011 Nat Rev Genet,

Williams et al. 2012 Am J Hum Genet, Delaneau et al. 2013 Nat Methods,

Loh et al. 2016a Nat Genet, Loh et al. 2016b Nat Genet

Page 123: EPI 511, Advanced Population and Medical Genetics

Genotypes vs. Haplotypes: phasing

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 2 0 1 0 0 0

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

PHASING

Genotypes Haplotypes

Stephens et al. 2001 Am J Hum Genet, Browning et al. 2011 Nat Rev Genet,

Williams et al. 2012 Am J Hum Genet, Delaneau et al. 2013 Nat Methods,

Loh et al. 2016a Nat Genet, Loh et al. 2016b Nat Genet

Page 124: EPI 511, Advanced Population and Medical Genetics

Genotypes vs. Haplotypes: phasing

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

1 0 2 0 1 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 2 0 1 0 0 0

Individuals

1 2 3 4 5 6 7 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

PHASING

Genotypes Haplotypes

Fact: r2 between SNP1 and SNP2 (phased haplotype data) equals

r2 between SNP1 and SNP2 (unphased genotype data),

assuming Hardy-Weinberg equilibrium holds

Page 125: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Haplotype Blocks

Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 1

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

0 0

1 1 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 1 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

These 3 SNPs form a “haplotype block” with two main haplotypes

Page 126: EPI 511, Advanced Population and Medical Genetics

LD with phased haplotypes: r2 vs. D′

Slatkin 2008 Nat Rev Genet

Consider two SNPs with frequencies pA and pB of alleles A, B.

Let gA refer to # copies (0, 1) of allele A for the first SNP.

Let gB refer to # copies (0, 1) of allele B for the second SNP.

)1()1(

)(

)()(

)]()()([ 222

BBAA

BAAB

BA

BABA

pppp

ppp

gVargVar

gEgEggEr

Page 127: EPI 511, Advanced Population and Medical Genetics

LD with phased haplotypes: r2 vs. D′

Slatkin 2008 Nat Rev Genet

Consider two SNPs with frequencies pA and pB of alleles A, B.

Suppose pA < pB < 0.5.

)1()1(

2

2

BBAA

BAAB

pppp

pppr

BAA

BAAB

ppp

pppD

Page 128: EPI 511, Advanced Population and Medical Genetics

LD with phased haplotypes: r2 vs. D′

Slatkin 2008 Nat Rev Genet

Consider two SNPs with frequencies pA and pB of alleles A, B.

Suppose pA < pB < 0.5. r2 and D′ are maximized when pAB = pA.

1

BAA

BAAB

ppp

pppD

BAB

BAA

BBAA

BAAB

ppp

ppp

pppp

pppr

)1()1(

2

2

Page 129: EPI 511, Advanced Population and Medical Genetics

LD with phased haplotypes: r2 vs. D′

Slatkin 2008 Nat Rev Genet

Consider two SNPs with frequencies pA and pB of alleles A, B.

Suppose pA < pB < 0.5. r2 and D′ are maximized when pAB = pA.

e.g. pA = 0.25, pB = 0.4, pAB = 0.25 => r2 = 0.5, D′ = 1

1

BAA

BAAB

ppp

pppD

BAB

BAA

BBAA

BAAB

ppp

ppp

pppp

pppr

)1()1(

2

2

Page 130: EPI 511, Advanced Population and Medical Genetics

LD with unphased diploid genotypes

Slatkin 2008 Nat Rev Genet

Consider two SNPs with frequencies pA and pB of alleles A, B.

Let gA refer to # copies (0, 1, 2) of allele A for the first SNP.

Let gB refer to # copies (0, 1, 2) of allele B for the second SNP.

1

BAA

BAAB

ppp

pppD

...)()(

)]()()([ 22

BA

BABA

gVargVar

gEgEggEr

cannot be directly computed,

since pAB relies on phased data!

Page 131: EPI 511, Advanced Population and Medical Genetics

Approaches to Scientific Understanding

Love is Understanding.

-- Madonna

Data is Understanding.

-- Alkes

Page 132: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Haplotype Blocks

Slatkin 2008 Nat Rev Genet

Haplotype blocks in

216kb region (MHC, chr 6)

x-axis = y-axis =

SNP position in region

D′ and L are measures of LD

(related to r2)

Red indicates high LD

Black indicates low LD

Also see Haploview program, Barrett et al. 2005 Bioinformatics

200 kb

100 kb

0 kb

Page 133: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Haplotype Blocks

Europeans

and Asians

Africans

Gabriel et al. 2002 Science

also see Reich 2001 Nature, Daly 2001 Nat Genet

Page 134: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Haplotype Blocks

African chromosomes: 50% of the genome lies in

haplotype blocks >22kb.

Europeans and Asians: 50% of the genome lies in

haplotype blocks >44kb.

Longer haplotype blocks in Europeans/Asians due to

out-of-Africa population bottleneck: descended from

small number of ancestors who left Africa 60-40 kya.

Gabriel et al. 2002 Science

also see Reich 2001 Nature, Daly 2001 Nat Genet

Page 135: EPI 511, Advanced Population and Medical Genetics

A brief history of modern humans

Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS,

Mellars 2006 Science, Armitage et al. 2011 Science, Henn et al. 2012 PNAS

Page 136: EPI 511, Advanced Population and Medical Genetics

A brief history of modern humans, contradicted

Green et al. 2010 Science, Reich et al. 2010 Nature, Meyer et al. 2012 Science,

Sankararaman et al. 2014 Nature, Vernot & Akey 2014 Science

reviewed in Racimo et al. 2015 Nat Rev Genet

• All non-African populations have ~2% of their genomes

descended from Neanderthals.

• Melanesian populations have ~5% of their genomes

descended from Denisovans, a relative of Neanderthals.

Page 137: EPI 511, Advanced Population and Medical Genetics

Population bottlenecks increase LD

population

bottleneck

population

bottleneck

Cavalli-Sforza & Feldman 2003 Nat Genet; also see Ramachandran et al. 2005 PNAS,

Mellars 2006 Science, Armitage et al. 2011 Science, Henn et al. 2012 PNAS

Page 138: EPI 511, Advanced Population and Medical Genetics

Population bottlenecks increase LD

Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

r2=0,

NOT

in LD

r2 is squared correlation

Page 139: EPI 511, Advanced Population and Medical Genetics

Population bottlenecks increase LD

due to subsampling haplotypes (genetic drift) Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0 ... …

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

r2=0,

NOT

in LD

r2 is squared correlation

Page 140: EPI 511, Advanced Population and Medical Genetics

Population bottlenecks increase LD

due to subsampling haplotypes (genetic drift) Individuals

1 2 3 4 5 6 7 8

SNP 2 3 billion

letters

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

r2=0.5,

partial

LD

Page 141: EPI 511, Advanced Population and Medical Genetics

Population bottlenecks increase LD

due to subsampling haplotypes (genetic drift) Individuals

1 2 3 4 5 6 7 8

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 2 3 billion

letters

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

1 1

0 0

0 0

1 1

0 0

0 0

1 0

0 0 ... …

0 0

1 1

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 1

0 0

0 0

0 1

0 0 ... …

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 ... …

SNP 3

r2 is squared correlation

r2=0.5,

partial

LD

Page 142: EPI 511, Advanced Population and Medical Genetics

Population bottlenecks increase LD

Conrad et al. 2006 Nat Genet

Average number of haplotypes per genomic region

Page 143: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Linkage Disequilibrium

2. LD and Tag SNPs

3. LD and imputation

4. LD and fine-mapping

Page 144: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium and tag SNPs

Individuals

Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

SNP 1: causal SNP

3 billion

letters

Direct association: genotype SNP1 in Cases and Controls.

Page 145: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium and tag SNPs

Individuals

Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

SNP 1

3 billion

letters

Indirect association: genotype SNP2 in Cases and Controls.

If SNP1 affects disease risk, then SNP2 will also be associated!

SNP 2

r2=1,

in LD

Page 146: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium and tag SNPs

Individuals

Cases Controls

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0

... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

SNP 1

3 billion

letters

Indirect association: genotype SNP3 in Cases and Controls.

If SNP1 affects disease risk, then SNP3 will also be associated!

SNP 3

r2=0.7,

partial

LD

SNP 2

Page 147: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium and tag SNPs

Theorem 2 (Pritchard and Przeworski 2001 Am J Hum Genet):

If SNP1 is causal and LD(SNP1,SNP2) = r2, then

Power of an association study of SNP1 with N samples =

Power of an association study of SNP2 with N/r2 samples.

Page 148: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium and tag SNPs

Theorem 2 (Pritchard and Przeworski 2001 Am J Hum Genet):

If SNP1 is causal and LD(SNP1,SNP2) = r2, then

Power of an association study of SNP1 with N samples =

Power of an association study of SNP2 with N/r2 samples.

Proof:

Let g1 and g2 be genotypes of SNP1 and SNP2 respectively

and π be phenotype, all normalized to mean 0 and variance 1.

Armitage Trend Test (χ2 = Nρ(g, π)2; Armitage 1955 Biometrics).

Page 149: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium and tag SNPs

Theorem 2 (Pritchard and Przeworski 2001 Am J Hum Genet):

If SNP1 is causal and LD(SNP1,SNP2) = r2, then

Power of an association study of SNP1 with N samples =

Power of an association study of SNP2 with N/r2 samples.

Proof:

Let g1 and g2 be genotypes of SNP1 and SNP2 respectively

and π be phenotype, all normalized to mean 0 and variance 1.

Armitage Trend Test (χ2 = Nρ(g, π)2; Armitage 1955 Biometrics):

SNP1 with N samples: Nρ(g1, π)2 = NE(g1· π)2

SNP2 with N/r2 samples: (N/r2)ρ(g2, π)2 = (N/r2)E(g2 · π)2

= (N/r2)E([rg1 + (g2-rg1)] · π)2

= (N/r2)E(rg1· π)2 = NE(g1· π)2. Q.E.D.

Page 150: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Haplotype Blocks

Control Case

Case

Case

Case

Control

Control

Control

Risk haplotype

Question: Which SNP to genotype?

Answer: Choose 1 SNP per haplotype block,

and take advantage of indirect association!

Case Control

Page 151: EPI 511, Advanced Population and Medical Genetics

Linkage Disequilibrium: Haplotype Blocks

Control Case

Case

Case

Case

Control

Control

Control

Needed: a resource describing the haplotypes

at each location in the genome.

Case Control

Risk haplotype

Page 152: EPI 511, Advanced Population and Medical Genetics

The International HapMap Project: 270 samples from 4 populations

CEU European USA 90 30 trios

YRI Yoruba Nigeria 90 30 trios

CHB Chinese China 45 unrelated

JPT Japanese Japan 45 unrelated

Page 153: EPI 511, Advanced Population and Medical Genetics

Genetic differences between populations are small

68% frequency 50% frequency C allele of rs10910034

A allele of rs260509

52% frequency 51% frequency

11kb away on chr 1

Page 154: EPI 511, Advanced Population and Medical Genetics

LD differences between populations are large!

68% frequency 50% frequency C allele of rs10910034

A allele of rs260509

52% frequency 51% frequency

11kb away on chr 1 r2 = 0.97 r2 = 0.34

Page 155: EPI 511, Advanced Population and Medical Genetics

HapMap project: a resource for “SNP tagging”

Individuals

1 2 3 4 5 6 7 8

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

SNP 1

SNP 2 3 billion

letters

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

1 1

0 0

0 0

1 1

0 0

0 0

0 0

1 1

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 1

0 0

0 0

0 1

0 0

0 0

0 0

0 1

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0

0 0 SNP 3

SNP1 “tags” this entire haplotype block at an r2 of 0.7

Page 156: EPI 511, Advanced Population and Medical Genetics

HapMap project: a resource for “SNP tagging”

How to select SNPs to genotype in an association study:

• Choose genomic region(s) of interest.

• Look up HapMap SNPs in the genomic region(s).

• Choose a subset of HapMap SNPs which “tag” haplotype

blocks in the genomic region(s).

(e.g. Tagger algorithm, de Bakker et al. 2005 Nat Genet)

Note: because LD patterns vary by population, it is

important to choose tag SNPs using a HapMap population

similar to the population in the association study.

Page 157: EPI 511, Advanced Population and Medical Genetics

HapMap project: a resource for “SNP tagging”

International HapMap Consortium 2007 Nature; also see Barrett et al. 2006 Nat Genet,

Smith et al. 2006 Genomics, International HapMap Consortium 2005 Nature

How many “tag SNPs” are required?

For the entire genome, the answer is:

Thus, to choose tag SNPs at an r2 of 0.8, we need roughly

1 SNP per 3kb in YRI, or 1 SNP per 5kb in CEU or CHB+JPT

Page 158: EPI 511, Advanced Population and Medical Genetics

Things aren’t always what they seem

Page 159: EPI 511, Advanced Population and Medical Genetics

Things aren’t always what they seem

• Estimating LD using a small number of HapMap samples

may lead to overfitting.

• HapMap SNPs are not a random subset of SNPs.

Page 160: EPI 511, Advanced Population and Medical Genetics

Things aren’t always what they seem

• Estimating LD using a small number of HapMap samples

may lead to overfitting.

• HapMap SNPs are not a random subset of SNPs.

Bhangale et al. 2008 Nat Genet

Page 161: EPI 511, Advanced Population and Medical Genetics

Things aren’t always what they seem

According to International HapMap Consortium 2007 Nature:

82% of common SNPs are tagged at r2 ≥ 0.8 by Affymetrix 6.0

According to Bhangale et al. 2008 Nat Genet:

66% of common SNPs are tagged at r2 ≥ 0.8 by Affymetrix 6.0

Bhangale et al. 2008 Nat Genet

Page 162: EPI 511, Advanced Population and Medical Genetics

Multi-SNP tagging

Haplotype

1 2 3 4 [freq. 25% for each haplotype]

SNP1 A A C C

SNP2 A C C A

SNP3 A C A C

r2=0,

NOT

in LD

(causal)

Page 163: EPI 511, Advanced Population and Medical Genetics

Multi-SNP tagging

Haplotype

1 2 3 4 [freq. 25% for each haplotype]

SNP1 A A C C

SNP2+3 A+A C+C C+A A+C r2=1,

YES

in LD

(causal)

Page 164: EPI 511, Advanced Population and Medical Genetics

Multi-SNP tagging

Pe’er et al. 2006 Nat Genet

also see Zaitlen et al. 2007 Am J Hum Genet

Page 165: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Linkage Disequilibrium

2. LD and Tag SNPs

3. LD and imputation

4. LD and fine-mapping

Page 166: EPI 511, Advanced Population and Medical Genetics

What is imputation?

Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010

Genet Epidemiol, Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics

Page 167: EPI 511, Advanced Population and Medical Genetics

What is imputation?

? Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010

Genet Epidemiol, Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics

Page 168: EPI 511, Advanced Population and Medical Genetics

Imputation: Why try?

• Increase power to detect disease association at untyped causal SNP

(imputed causal SNP may have stronger association than tag SNP)

Page 169: EPI 511, Advanced Population and Medical Genetics

Imputation: Why try?

r2 = 0.8

Causal SNP

Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010

Genet Epidemiol, Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics

Page 170: EPI 511, Advanced Population and Medical Genetics

Imputation: Why try?

Causal SNP

Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet, Li et al. 2010

Genet Epidemiol, Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics

Page 171: EPI 511, Advanced Population and Medical Genetics

Imputation: Why try?

• Increase power to detect disease association at untyped causal SNP

(imputed causal SNP may have stronger association than tag SNP)

Page 172: EPI 511, Advanced Population and Medical Genetics

Imputation: Why try?

• Increase power to detect disease association at untyped causal SNP

(imputed causal SNP may have stronger association than tag SNP)

• Enable meta-analysis of studies on Affymetrix + Illumina chips

Page 173: EPI 511, Advanced Population and Medical Genetics

Imputation: Why try?

• Increase power to detect disease association at untyped causal SNP

(imputed causal SNP may have stronger association than tag SNP)

• Enable meta-analysis of studies on Affymetrix + Illumina chips

• Improve genotype data quality

Page 174: EPI 511, Advanced Population and Medical Genetics

Imputation: Algorithms

Hidden Markov Model (HMM) based approaches:

• IMPUTE (Marchini et al. 2007 Nat Genet, Howie et al. 2009 PLoS Genet,

Howie et al. 2012 Nat Genet)

• MACH (Li et al. 2010 Genet Epidemiol)

• fastPHASE/BIMBAM (Scheet/Stephens 2006 AJHG, Servin/Stephens 2007

PLoS Genet, Guan/Stephens 2008 PLoS Genet)

• GEDI (Kennedy et al. 2008 ISBRA)

Localized Haplotype Clustering:

• BEAGLE (Browning/Browning 2007 AJHG, Browning/Browning 2009 AJHG)

Likelihood-based approaches:

• UNPHASED (Dudbridge 2008 Hum Hered)

• SNPMStat (Lin et al. 2008 AJHG)

reviewed in Marchini et al. 2010 Nat Rev Genet; also see Li et al. 2009 ARGHG

Page 175: EPI 511, Advanced Population and Medical Genetics

Imputation: What do the algorithms output?

Integer-valued genotypes at untyped SNPs

e.g. genotype = 2

OR

Continuous genotype dosages at untyped SNPs

e.g. genotype dosage = 1.79

OR

Continuous genotype probabilities at untyped SNPs

e.g. genotype probabilities P(0) = 0.01, P(1) = 0.19, P(2) = 0.80

Page 176: EPI 511, Advanced Population and Medical Genetics

Imputation: People do it.

reviewed in Marchini et al. 2010 Nat Rev Genet; also see Li et al. 2009 ARGHG

Page 177: EPI 511, Advanced Population and Medical Genetics

HMM-based imputation approaches

hap1

hap2

hap3

hap4

hap5

Imp.

reviewed in Marchini et al. 2010 Nat Rev Genet; also see Li et al. 2009 ARGHG

? ? ?

Note: current paradigm is to first phase the data, then run imputation on

phased data (Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics)

Page 178: EPI 511, Advanced Population and Medical Genetics

HMM-based imputation approaches

hap1

hap2

hap3

hap4

hap5

Imp.

reviewed in Marchini et al. 2010 Nat Rev Genet; also see Li et al. 2009 ARGHG

Note: current paradigm is to first phase the data, then run imputation on

phased data (Howie et al. 2012 Nat Genet, Fuchsberger et al. 2015 Bioinformatics)

Page 179: EPI 511, Advanced Population and Medical Genetics

Measuring imputation accuracy

Concordance rate: % of genotypes (or alleles) imputed correctly

• Natural analogue of genotyping error rate in QC analyses

• Concordance rate is often in the range of 95-99%.

Squared correlation (r2) between true and imputed genotype

• Natural analogue of r2 between causal SNP and tag SNP

• r2 << concordance rate, particularly for rare SNPs.

Page 180: EPI 511, Advanced Population and Medical Genetics

Measuring imputation accuracy

Concordance rate: % of genotypes (or alleles) imputed correctly

• Natural analogue of genotyping error rate in QC analyses

• Concordance rate is often in the range of 95-99%.

Squared correlation (r2) between true and imputed genotype

• Natural analogue of r2 between causal SNP and tag SNP

• r2 << concordance rate, particularly for rare SNPs.

Page 181: EPI 511, Advanced Population and Medical Genetics

Measuring imputation accuracy

Concordance rate: % of genotypes (or alleles) imputed correctly

• Natural analogue of genotyping error rate in QC analyses

• Concordance rate is often in the range of 95-99%.

Squared correlation (r2) between true and imputed genotype

• Natural analogue of r2 between causal SNP and tag SNP

• r2 << concordance rate, particularly for rare SNPs.

Page 182: EPI 511, Advanced Population and Medical Genetics

Measuring imputation accuracy

Concordance rate: % of genotypes (or alleles) imputed correctly

• Natural analogue of genotyping error rate in QC analyses

• Concordance rate is often in the range of 95-99%.

Squared correlation (r2) between true and imputed genotype

• Natural analogue of r2 between causal SNP and tag SNP

• r2 << concordance rate, particularly for rare SNPs.

Normalized difference between true and imputed allele frequency

• Measures whether imputation is biased towards ref or var allele

Page 183: EPI 511, Advanced Population and Medical Genetics

Imputation using HapMap data

International HapMap3 Consortium 2010 Nature

common SNPs imputed using HapMap2 CEU (N=120): r2 = 0.95

(European-ancestry WTCCC samples, Affymetrix & Illumina chips)

Page 184: EPI 511, Advanced Population and Medical Genetics

Imputation using HapMap data

International HapMap3 Consortium 2010 Nature

common SNPs imputed using HapMap2 CEU (N=120): r2 = 0.95

common SNPs imputed using HapMap3 CEU+TSI (N=410): r2 = 0.96

(European-ancestry WTCCC samples, Affymetrix & Illumina chips)

Page 185: EPI 511, Advanced Population and Medical Genetics

Imputation using HapMap data

International HapMap3 Consortium 2010 Nature

x-axis: MAF<5% SNPs, imputed using HapMap2 CEU (N=120)

y-axis: MAF<5% SNPs, imputed using HapMap3 CEU+TSI (N=410)

Page 186: EPI 511, Advanced Population and Medical Genetics

Imputation using HapMap data

International HapMap3 Consortium 2010 Nature

x-axis: MAF<5% SNPs, imputed using HapMap2 CEU (N=120)

y-axis: MAF<5% SNPs, imputed using HapMap3 CEU+TSI (N=410)

Page 187: EPI 511, Advanced Population and Medical Genetics

Imputation using HapMap data

International HapMap3 Consortium 2010 Nature

x-axis: MAF<5% SNPs, imputed using HapMap2 CEU (N=120)

y-axis: MAF<5% SNPs, imputed using HapMap3 CEU+TSI (N=410)

Page 188: EPI 511, Advanced Population and Medical Genetics
Page 189: EPI 511, Advanced Population and Medical Genetics

Low-coverage sequencing + imputation

increases power vs. genotyping arrays

Cost per

sample

Actual

#samples

Average

imputation r2

Effective

#samples

Illumina 1M array $400 750 1.00 750

0.4x sequencing $83* 3,600 0.81** 2,900

0.1x sequencing $43* 7,000 0.64** 4,500

Pasaniuc et al. 2012 Nat Genet; also see Cai et al. 2015 Nature, Davies et al. 2016 Nat Genet

Effective sample size of a GWAS with a $300,000 budget:

*Based on sample preparation cost of $30/sample, which is conservatively

double the $15/sample reported by Rohland & Reich 2012 Genome Res,

and on $133 per 1x sequencing (Illumina Network cost).

**Imputation r2 attained at Illumina 1M SNPs by downsampling reads from

real off-target exome sequencing data. Relative performance of

low-coverage sequencing will be even higher at non-Illumina 1M SNPs.

Page 190: EPI 511, Advanced Population and Medical Genetics

Outline

1. Introduction to Linkage Disequilibrium

2. LD and Tag SNPs

3. LD and imputation

4. LD and fine-mapping (to be continued, Tue of Week 4)

Page 191: EPI 511, Advanced Population and Medical Genetics

Definition of fine-mapping

Manhattan plot from Ikram et al. 2010 PLoS Genet

Which of these SNPs on chr 6 is the biologically causal SNP?

(Ditto for chr 5, 8, 12, 19)

Page 192: EPI 511, Advanced Population and Medical Genetics

WTCCC fine-mapping study

Maller et al. 2012 Nat Genet

Page 193: EPI 511, Advanced Population and Medical Genetics

GWAS in Europeans

SNP1: P-value = 10-8

LD and fine-mapping in Europeans

Page 194: EPI 511, Advanced Population and Medical Genetics

TCF7L2 locus in T2D: 1 top signal

Maller et al. 2012 Nat Genet

Page 195: EPI 511, Advanced Population and Medical Genetics

Fine-mapping in Europeans

SNP1: P-value = 10-8 CAUSAL??

SNP2: P-value = 10-8 CAUSAL??

LD and fine-mapping in Europeans

Page 196: EPI 511, Advanced Population and Medical Genetics

FTO locus in T2D: many top signals

Maller et al. 2012 Nat Genet

Page 197: EPI 511, Advanced Population and Medical Genetics

Fine-mapping in Europeans Fine-mapping in Africans

SNP1: P-value = 10-8 SNP1: P-value = 10-5

SNP2: P-value = 10-8 SNP2: P-value = 0.62

SNP3: P-value = 0.41 SNP3: P-value = 10-5

LD in Europeans LD in Africans

LD and cross-population fine-mapping

r2 SNP1 SNP2 SNP3

SNP1 1.00 0.99 0.08

SNP2 0.99 1.00 0.07

SNP3 0.08 0.07 1.00

r2 SNP1 SNP2 SNP3

SNP1 1.00 0.12 0.98

SNP2 0.12 1.00 0.14

SNP3 0.98 0.14 1.00

Page 198: EPI 511, Advanced Population and Medical Genetics

Fine-mapping in Europeans Fine-mapping in Africans

SNP1: P-value = 10-8 SNP1: P-value = 10-5 CAUSAL

SNP2: P-value = 10-8 SNP2: P-value = 0.62

SNP3: P-value = 0.41 SNP3: P-value = 10-5

LD in Europeans LD in Africans

LD and cross-population fine-mapping

r2 SNP1 SNP2 SNP3

SNP1 1.00 0.99 0.08

SNP2 0.99 1.00 0.07

SNP3 0.08 0.07 1.00

r2 SNP1 SNP2 SNP3

SNP1 1.00 0.12 0.98

SNP2 0.12 1.00 0.14

SNP3 0.98 0.14 1.00

Page 199: EPI 511, Advanced Population and Medical Genetics

LD and multi-ethnic fine-mapping

Zaitlen*, Pasaniuc* et al. 2010 Am J Hum Genet

also see Morris 2011 Genet Epidemiol, Udler et al. 2009 Hum Mol Genet,

Wu et al. 2013 PLoS Genet, Peters et al. 2013 PLoS Genet, Liu et al. 2016 Am J Hum Genet

Page 200: EPI 511, Advanced Population and Medical Genetics

• Linkage Disequilibrium is good, because we can tag most

common SNPs using chips with 1,000,000 SNPs or less.

• Linkage Disequilibrium is good, because we can infer

imputed genotypes at most common HapMap SNPs.

• Linkage Disequilibrium is bad, because it leads to

ambiguity as to the causal SNP when doing fine-mapping.

• Studying multiple populations, especially Africans (low LD),

can improve our ability to localize causal variants.

Conclusions

Page 201: EPI 511, Advanced Population and Medical Genetics

EPI 511: Office Hours

Instructor: Alkes

Office Hours: Thu 3:30-4:30pm, Building 2, Room 211

Email Address: [email protected]

(Please put EPI511 in the subject of your email)

Teaching Assistant: Armin

Office Hours: Fri + Mon 2-3pm, Building 2, Room 209

Email Address: [email protected]