peter kraft [email protected] bldg 2 rm 207 2-4271

EPI293Design and analysis of gene association studies

Winter Term 2008

Lecture 2: Patterns of LD and “tag SNP” selection

Peter [email protected]

Bldg 2 Rm 2072-4271

mailto:[email protected]

Study 1: Pop’n A, small N, no assoc’n

Study 2: Pop’n A, large N, no assoc’n

Study 3: Pop’n B, large N, assoc’n

Before HapMap: “looking under lamppost”

After HapMap

Study 2 revisited: Pop’n A, large N,

assoc’n

Outline

• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs• The HapMap project• Resources and tools for SNP selection

Outline

• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs • The HapMap project• Resources and tools for SNP selection

A G

a g

A G

a g

A g

A G

a g

A G

A G

a g

A G

A g

a g

A G

a g

A G

Basic idea: linkage disequilibrium

Alleles at two (or more) loci are correlated on chromosomes drawn at random from the population

Measures of linkage disequilibrium

• Basic data: table of haplotype frequencies

A G

a g

A G

a g

A g

A G

a g

A G

A G

a g

A G

A g

a g

A G

a g

A G

A a

G 8 0 50%

g 2 6 50%

62.5% 37.5%

Linkage disequilibrium and marginal allele freqs.

A a

G pApG + = x qApG - = y pG

g pAqG - = w qAqG + = z qG

pA qA 1

• pA & pG are (minor) allele frequencies

– qA = 1-pA; qG = 1-pG

= x z – y w is a measure of departure from independence– No association between A and G = 0

– Max() = min(pA qG, pG qA)

A a

G n11 n10 n1

g n01 n00 n0

n1 n0

Measure Formula Ref.

D’ Lewontin (1964)

2 = r2 Hill and Weir (1994)

* Levin (1953)

Edwards (1963)

Q Yule (1900)

)nn,nnmin(

nnnn

1001

01100011

o101

201100011

nnnn

nnnn

011

01100011

nn

nnnn

0110

0011

nn

nn

01100011

01100011

nnnn

nnnn

|D’| and r2 are most common

• D prime … – …ranges from 0 [no LD] to 1 [complete LD]…– …is less sensitive to marginal allele frequencies…– …is directly related to recombination fraction

• R squared…– …also ranges from 0 to 1…– …is correlation between alleles on the same

chromosome…– …is very sensitive to marginal allele frequencies…– …is directly related to study power

• If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r2 N cases and controls that directly measured G

• r2 N is the “effective sample size”

A G

a g

A G

a g

A g

A G

a g

A G

A G

a g

A G

A g

a g

A G

a g

A G

A a

G 8 0 50%

g 2 6 50%

62.5% 37.5%

D’ = (86 - 0) / (86) =1 r2 = (86 - 0)2 / (10688) = .6

Computational detail

• Haplotyopes are rarely directly observed• Have to infer from genotype data

– Genotypes consistent with haplotype pairs

• Most popular algorithm: Expectation Maximizxation1

• Related to, but not exactly equal to 3x3 table of genotypes

Aa

Gg

A

G

a

g

A

g

a

G

1 Thomas pp. 243-245

AA=2

Aa=1

Aa=0

BB=2

Bb=1

Bb=0

Correlation from this table makes no assumptions about HWE

(Weir, Genetic Data Analysis)

Outline

• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs • The HapMap project• Resources and tools for SNP selection

Why does LD exist?

1. “Recombination coldspots”2. Demographics (e.g. bottlenecks)3. Population stratification or admixture

• Confounds gene-disease association• Does not decay with distance

(among other reasons… selective pressure … etc.)

A

Decay of LD in Pictures

Decay of LD: T = 0 (1 - )T

0.05 0.10 0.15 0.20 0.25

0.0

0.2

0.4

0.6

0.8

1.0

theta

de

lta

1 generation

5 generations

10

20

40

80

200 kbp from chr2, positions 51,783,239 to 51,983,238

Data from the ENCODE projecthttp://www.hapmap.org/downloads/encode1.html.en

Implications

•Admixture can lead to false positives– Two unlinked loci can stay in LD– Recent admixture, continual gene flow problematic

•Isolated populations have advantages for fine-mapping

– LD extends long distances, so fewer markers need be typed

– But resolution may be poor

Knowledge of local LD structure is essential for candidate gene studies !

Outline


Basic “tagging” design

Measure haplotypes/LD pattern in a subsample

(often external database)

Choose subset of SNPs (“tagSNPs”) that contain majority of information

Genotype “tagSNPs” in main study,analyze appropriately

Over 750 known SNPs – at least 50 are common in Europeans

ATM

“block” = region of limited haplotype diversity and/or

low LD

But there are unappealing aspects of the “haplotype block” idea

• Definition and “block finding” algorithms are ad hoc– Different defns, algs lead to different block structures– Block structure changes with sample size, marker density

• “Hard boundaries” are…– …unappealing for tagSNP selection (what about “between blocks”)…

– … inaccurate description of LD patterns (some haps overlap boundaries)

• Plus, haplotypes present analytic challenges[Wall & Pritchard (2003a) Nat Rev Genet 4:587 (2003b) AJHG 73:502]

[Nothnagel and Rohde (2005) AJHG 77:988

Keep it simple

• We want SNPs that predict unobserved variants• Why not choose SNPs based on pairwise correlations?

• Q: What if we don’t know enough about common genetic variation to say we’ve captured it?

• A: HapMap and resequencing projects

A/T1

G/A2

G/C3

T/C4

G/C5

A/C6

high r2 high r2 high r2

AATT

GC

CG

GC

CG

TCCC

ACCC

GC

CG

TCCC

GGAA

GGAA

Outline


HapMap:

application in the design and interpretation of association studies

Mark J. Daly, PhD on behalf of

The International HapMap Consortium

[OK it may look like I’m totally stealing these slides—but they are free on the web at http://www.hapmap.org/tutorials.html.en]

Goals of this segment

• Briefly summarize HapMap design and current status

• Discuss the application of HapMap to all aspects of association study design, analysis and interpretation

HapMap Project

High-density SNP genotyping across the genome provides information about– SNP validation, frequency, assay conditions– correlation structure of alleles in the genome

A freely-available public resource to increase the power and efficiency

of genetic association studies to medical traits

All data is freely available on the web for applicationin study design and analyses as researchers see fit

HapMap Samples• 90 Yoruba individuals (30 parent-parent-offspring trios) from

Ibadan, Nigeria (YRI)

• 90 individuals (30 trios) of European descent from Utah (CEU)

• 45 Han Chinese individuals from Beijing (CHB)

• 45 Japanese individuals from Tokyo (JPT)

HapMap progress

PHASE I – completed, described in Nature paper

* 1,000,000 SNPs successfully typed in all 270 HapMap samples

* ENCODE variation reference resource available

PHASE II – data generation complete, data released early November 2005

* >3,500,000 SNPs typed in total !!!

Frazer, K. A., D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont, A. Boudreau, P. Hardenbol, S. M. Leal, S. Pasternak, D. A. Wheeler, et al. (2007). "A second generation human haplotype map of over 3.1 million SNPs." Nature 449(7164): 851-61.

ENCODE-HAPMAP variation project

• Ten “typical” 500kb regions

• 48 samples sequenced

• All discovered SNPs (and those dbSNP) typed in all 270 HapMap samples

• Current data set – 1 SNP every 279 bp

A much more complete variation resource by whichthe genome-wide map can evaluated

Completeness of dbSNP

Vast majority of common SNPs are contained in or highly correlated with a SNP in dbSNP

Recombination hotspots are widespreadand account for LD structure

7q21

Coverage of Phase II HapMap(estimated from ENCODE data)

From Table 6 – “A Haplotype Map of the Human Genome”, Nature

Panel %r2 > 0.8 max r2

YRI 81 0.90CEU 94 0.97CHB+JPT 94 0.97

Vast majority of common variation (MAF > .05) captured by Phase II HapMap

Applying the HapMap

• Study design - tagging• Study coverage evaluation• Study analysis - improving association testing• Study interpretation

– Comparison of multiple studies– Connection to genes/genomic features– Integration with expression and other functional data

• Other uses of HapMap data– Admixture, LOH, selection

Tagging from HapMap

• Since HapMap describes the majority of common variation in the genome, choosing non-redundant sets of SNPs from HapMap offers considerable efficiency without power loss in association studies

Pairwise tagging

Tags:

SNP 1SNP 3SNP 6

3 in total

Test for association:

SNP 1SNP 3SNP 6

A/T1

G/A2

G/C3

T/C4

G/C5

A/C6

high r2 high r2 high r2

AATT

GC

CG

GC

CG

TCCC

ACCC

GC

CG

TCCC

GGAA

GGAA

After Carlson et al. (2004) AJHG 74:106

Pairwise Tagging Efficiency

Table 7 Number of selected tag SNPs to capture all observed common SNPs in the Phase I HapMap for the three analysis panels using pairwise tagging at different r2 thresholds

YRI CEU CHB+JPT

Pairwise r2 ≥ 0.5 324,865 178,501 159,029

r2 ≥ 0.8 474,409 293,835 259,779

r2 = 1 604,886 447,579 434,476

Tag SNPs were picked to capture common SNPs in release 16c.1 for every 7,000 SNP bin using Haploview.

Tagging Phase I HapMap offers 2-5x gains in efficiency

Tags:

SNP 1SNP 3SNP 6

3 in total


SNP 1SNP 3SNP 6

Use of haplotypes can improve genotyping efficiency

Tags:

SNP 1SNP 3

2 in total


SNP 1 captures 1+2SNP 3 captures 3+5

“AG” haplotype captures SNP 4+6

AATT

GC

CG

GC

CG

TCCC

ACCC

GC

CG

TCCC

GGAA

GGAA

ACCC

A/T1

G/A2

G/C3

T/C4

G/C5

A/C6

tags in multi-marker test should be conditional on

significance of LD in order to avoid overfitting

Efficiency and powerR

elat

ive

pow

er (

%)

Average marker density (per kb)

tag SNPs

randomSNPs

P.I.W. de Bakker et al. (2005) Nat Genet Advance Online Publication 23 Oct 2005

~300,000 tag SNPsneeded to cover commonvariation in whole genome

in CEU

Will tag SNPs picked from HapMap apply to other population

samples?

Two issues: what if LD structure strongly differs between my samples and the HapMap samples?

Are CEU or YRI panels good surrogates for Latinos from Los Angeles? Are CEU samples even good surrogates for whites from

France?

Is HapMap sample size sufficient?Small sample correlation overestimated; are tagging algorithms

“overfitting” the sample

PK slide

Will tag SNPs picked from HapMap apply to other population

samples?

Population differences add very little inefficiencyPaul de Bakker Pac Symp Biocomput 2006

CEUCEU

Whites fromLos Angeles, CA

Whites fromLos Angeles, CA Botnia, FinlandBotnia, Finland

CEUCEUCEUCEU

Utah residents with European ancestry

(CEPH)

Utah residents with European ancestry

(CEPH)

De Bakker et al (2006) Nat Genet

Need and Goldstein (2006) Nat Genet

Impact of training set sample size

Tags chosen as pairwise tags

Tags chosen as multimarker tags(up to 6 markers)

Zeggini et al Nature Genetics 37, 1320 - 1322 (2005)

Impact of training set sample size

Tags chosen for common variants

Tags chosen for common and rare varants

Zeggini et al Nature Genetics 37, 1320 - 1322 (2005)

Outline


Public sources of SNP data

• Candidate genes– “Seattle SNPs” http://pga.gs.washington.edu/ *

– Environmental Genome Project http://egp.gs.washington.edu/ *

– IIPGA http://innateimmunity.net/IIPGA2/index_html *

– HAPMAP http://www.hapmap.org/

– BPC3 http://www.uscnorris.com/MECGenetics/

• Genome-wide– HAPMAP

• dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/

• OMIM (online mendelian inheritance in man)

* Resequencing data

http://pga.gs.washington.edu/

http://egp.gs.washington.edu/

http://innateimmunity.net/IIPGA2/index_html

http://www.hapmap.org/

Bioinformatics tools

– https://innateimmunity.net/IIPGA2/Bioinformatics/– http://pga.gs.washington.edu/software.html

– Haploview http://www.broad.mit.edu/mpg/haploview/index.php

– SNPSelector

https://innateimmunity.net/IIPGA2/Bioinformatics/

http://pga.gs.washington.edu/software.html

http://www.broad.mit.edu/mpg/haploview/index.php

So, OK, how should I select SNPs?

• PubMed/lit search– Previous associations with your (or related) phenotype

• GWAS!

– Functional studies

• Potentially functional variants– nsSNPs (perhaps ranked by SIFT or Polyphen score)

– Splice sites– Conserved regions

• tagSNPs

SNP SelectorBioinformatics

21:4181

http://primer.duhs.duke.edu/

SNP SelectorBioinformatics 21:4181

Molecular genotyping Molecular genotyping methodsmethods

David G. Cox M.S. Ph.D.David G. Cox M.S. Ph.D.Instructor of EpidemiologyInstructor of [email protected]@hsph.harvard.eduBldg. 2 Rm. 211Bldg. 2 Rm. 211(617) 432-2262(617) 432-2262

mailto:[email protected]

OverviewOverview

How it worksHow it works Considerations in choosing a Considerations in choosing a

methodmethod Quality Control (QC)Quality Control (QC) Organizing your dataOrganizing your data Completing the studyCompleting the study

PCRPCR

Rapid, versatile, Rapid, versatile, in vitro, in vitro, method for method for amplifying defined target DNA sequences amplifying defined target DNA sequences to yield multiple copies of specific region of to yield multiple copies of specific region of DNA sequenceDNA sequence

1980s, K. Mullis invented PCR1980s, K. Mullis invented PCR– Won Nobel Prize in 1993Won Nobel Prize in 1993

Applications for basic science, Applications for basic science, epidemiology, evolution, linkage analysis, epidemiology, evolution, linkage analysis, forensics, anthropologyforensics, anthropology

PCR (2)PCR (2)

Allows for screening of Allows for screening of uncharacterized mutationsuncharacterized mutations

Rapid genotyping for polymorphic Rapid genotyping for polymorphic markersmarkers

Detecting point mutations Detecting point mutations

PCR cyclePCR cycle

Three stepsThree steps

1.1. DenaturationDenaturation• Denature DNA to separate strandsDenature DNA to separate strands

2.2. AnnealingAnnealing• Primers bind to strandsPrimers bind to strands

3.3. ExtensionExtension• Polymerase synthesizes new strandsPolymerase synthesizes new strands

PCR cycle (2)PCR cycle (2)

Reaction mixture proceeds through Reaction mixture proceeds through repeated cycles of primer annealing, repeated cycles of primer annealing, DNA synthesis, and denaturationDNA synthesis, and denaturation

Target sequence concentration Target sequence concentration increases exponentially for each cycleincreases exponentially for each cycle– Each newly synthesized DNA strand acts Each newly synthesized DNA strand acts

as a template for further DNA synthesis in as a template for further DNA synthesis in subsequent cyclessubsequent cycles

DenaturationDenaturation

AnnealingAnnealing

ExtensionExtension

Main assays usedMain assays used

Looking to optimize three thingsLooking to optimize three things– Cost of genotypingCost of genotyping– Speed of genotypingSpeed of genotyping– Reliability of dataReliability of data

Three main categoriesThree main categories– Low-plexedLow-plexed

Usually PCR basedUsually PCR based– High-plexedHigh-plexed

PCR or non-PCR basedPCR or non-PCR based– Mega-plexedMega-plexed

Non-PCR basedNon-PCR based

PCR bases methodsPCR bases methods

Plex = number of separate assays in Plex = number of separate assays in an individual tubean individual tube

Single to low-plexedSingle to low-plexed– Usually limited to number of tagsUsually limited to number of tags

Either fluorescent or massEither fluorescent or mass Tags are expensive partTags are expensive part Micro scale reactionsMicro scale reactions Low start-up costsLow start-up costs

– Robotics not necessaryRobotics not necessary– Machines in many labsMachines in many labs

TaqmanTaqman

BioTroveBioTrove

Miniaturized TaqmanMiniaturized Taqman– Primers and probes spotted into holesPrimers and probes spotted into holes– Taqman reaction exactly the sameTaqman reaction exactly the same

Reduces costReduces cost– Lowers quantity of probe and master Lowers quantity of probe and master

mixmix– Still need to order a minimum Still need to order a minimum

amount of primer and probeamount of primer and probe

iPLEXiPLEX

Non-PCR bases Non-PCR bases methodsmethods Usually rely on some sort of genome Usually rely on some sort of genome

wide amplification stepwide amplification step Hybridization techniques increase plexHybridization techniques increase plex

– Stick DNA to some sort of chipStick DNA to some sort of chip Chips are roughly the size of microscope slidesChips are roughly the size of microscope slides

Nano scale reactionsNano scale reactions High start-up costs for machines and High start-up costs for machines and

roboticsrobotics– Core facilities normally usedCore facilities normally used

Illumina productsIllumina products

Highly multiplexed assaysHighly multiplexed assays– From 384 to ~1M SNPsFrom 384 to ~1M SNPs

Custom chips designed up to 72kCustom chips designed up to 72k GWAS products of ~500k and ~1MGWAS products of ~500k and ~1M

– Based on pair-wise tagging of SNPs from Based on pair-wise tagging of SNPs from hapmaphapmap

Use specially etched holesUse specially etched holes– Solves “spotting” problemSolves “spotting” problem– Addressing systemAddressing system

GoldengateGoldengate

InfiniumInfinium

AffymetrixAffymetrix

GWAS chipGWAS chip– Over 1.8M featuresOver 1.8M features

SNPsSNPs– ~500k from earlier version~500k from earlier version

Evenly spaced across genomeEvenly spaced across genome– ~500k additional SNPs~500k additional SNPs

TagTag X/YX/Y mtDNAmtDNA New SNPsNew SNPs HotspotsHotspots

CNVsCNVs– ~200k specifically targeted to CNVs~200k specifically targeted to CNVs– ~750k additional probes across genome~750k additional probes across genome

Quick word on CNVsQuick word on CNVs

Latest craze in genetic epi (one of)Latest craze in genetic epi (one of) Copy Number VariantsCopy Number Variants

– Either more (3+) or less (1) copies of a genetic region Either more (3+) or less (1) copies of a genetic region presentpresent

Polymorphic regions of varying zygosityPolymorphic regions of varying zygosity– Detected as Mendelian errors in HapMapDetected as Mendelian errors in HapMap– Behave (from a population genetic standpoint) like Behave (from a population genetic standpoint) like

any other polymorphismany other polymorphism Still not well characterizedStill not well characterized

– i.e. regions with high homology can show up as CNVs i.e. regions with high homology can show up as CNVs Genotyped using quantification of genotype Genotyped using quantification of genotype

signal signal

Back to AffyBack to Affy

Affy vs. IlluminaAffy vs. Illumina

AffymetrixAffymetrix– Earlier productEarlier product

Began with ~100kBegan with ~100k Assay and software Assay and software

issuesissues

– Costs have Costs have drastically declineddrastically declined

– SNP coverage has SNP coverage has drastically increaseddrastically increased

tagSNPs addedtagSNPs added

– WGA DNA OKWGA DNA OK

IlluminaIllumina– Later productLater product

Began with ~500kBegan with ~500k Better assay and Better assay and

software design software design (originally)(originally)

– Cost issuesCost issues– SNP coverage SNP coverage

relatively constantrelatively constant Always based on Always based on

hapmaphapmap

– WGA DNA WGA DNA discourageddiscouraged

Genotype ClusteringGenotype Clustering

Clustering continuedClustering continued

Low-plex assaysLow-plex assays– Usually done by eye by a technicianUsually done by eye by a technician– Can be labor intensive and subject Can be labor intensive and subject

to user biasto user bias High- and mega-plex assaysHigh- and mega-plex assays

– Usually computer assisted or Usually computer assisted or completely automatedcompletely automated

– Less labor intensive but subject to Less labor intensive but subject to clustering errorsclustering errors

Best case scenarioBest case scenario

Software clusteringSoftware clustering

0 0.20 0.40 0.60 0.80 1Norm Theta

rs4804195

0

1

2

3

4

Nor

m R

1049 1566 459

Human clusteringHuman clustering

0 0.20 0.40 0.60 0.80 1Norm Theta

rs4804195

0

1

2

3

4

Nor

m R

1088 1549 491

What would you do What would you do with this?with this?

0 0.20 0.40 0.60 0.80 1Norm Theta

rs6451182

0

1

2

3

4

Nor

m R

17 2885 225

Or this?Or this?

0 0.20 0.40 0.60 0.80 1Norm Theta

rs598558

-0.20

0

0.20

0.40

0.60

0.80

1

1.20

1.40

1.60

Nor

m R

3 38286

So you want to So you want to genotype?!?genotype?!? Three main things to considerThree main things to consider

– Number of SNPsNumber of SNPs– Number of samplesNumber of samples– Budgetary considerationsBudgetary considerations

Minor considerationsMinor considerations– DNA sourceDNA source– DNA quantity/qualityDNA quantity/quality

So you want to So you want to genotype?!?genotype?!?

Biotrove

Budgetary Budgetary ConsiderationsConsiderations Low plexLow plex

– Per SNP cost Per SNP cost normally doesn’t normally doesn’t decrease as the decrease as the number of SNPs number of SNPs increasesincreases

– Per genotype cost Per genotype cost may decrease as may decrease as sample size sample size increasesincreases

– Overall study cost Overall study cost can be low ($Ks)can be low ($Ks)

Higher plexHigher plex– Per SNP cost Per SNP cost

decreases as you decreases as you get closer to the get closer to the maximum plexmaximum plex

– Per genotype cost Per genotype cost decreases decreases drastically as plex drastically as plex goes upgoes up

– Overall study cost Overall study cost can be high ($Ms)can be high ($Ms)

And the And the BIGBIG question question

Genotype Costs

0100200300400500600700800

Assay and Scale

Co

st

per

well

0

0.2

0.4

0.6

0.8

1

1.2

Co

st

per

gen

oty

pe

Cost/well

Cost/genotype

How to minimize costs How to minimize costs while maximizing while maximizing genotypinggenotyping Genotype the right number of Genotype the right number of

samplessamples– Fill platesFill plates– Find the sweet spot in assay orderingFind the sweet spot in assay ordering

Genotype the right number of SNPsGenotype the right number of SNPs– If you can fill the beads, your per If you can fill the beads, your per

genotype cost goes down without genotype cost goes down without drastically increasing the total costdrastically increasing the total cost

Quality controlQuality control

Low-plexLow-plex– Blinded QCBlinded QC

Repeated samplesRepeated samples ~10% of the total ~10% of the total

sample sizesample size

– >95% completion >95% completion raterate

– Easy to repeat Easy to repeat individual plates individual plates to correct any to correct any errorserrors

High- to mega-plexHigh- to mega-plex– Blinded QCBlinded QC

One or two samples One or two samples per plateper plate

Same samples on Same samples on every plateevery plate

– Set both SNP and Set both SNP and Sample completion Sample completion ratesrates

– Not easy to repeat Not easy to repeat platesplates

Data overloadData overload

Low-plexLow-plex– Data trickles inData trickles in– Little need for elaborate databasesLittle need for elaborate databases

Assay descriptionAssay description– Primer/probe sequencePrimer/probe sequence– AllelesAlleles

SNP descriptionSNP description– Locus (usually rs# sufficient)Locus (usually rs# sufficient)

– Relational db for genotype dataRelational db for genotype data ID x genotypeID x genotype

Data overloadData overload

High- to mega-plexHigh- to mega-plex– Data delugeData deluge

Up to 1M SNPs worth of data at onceUp to 1M SNPs worth of data at once– Annotation of SNPsAnnotation of SNPs

Assay characteristicsAssay characteristics SNP characteristicsSNP characteristics

Large samples sizesLarge samples sizes– 1536x1000 samples is over 1.5 million data 1536x1000 samples is over 1.5 million data

pointspoints

Data analysis issuesData analysis issues

Low-plexLow-plex– Often <10 SNPs Often <10 SNPs

per studyper study Easy/quick to Easy/quick to

analyzeanalyze Data presentation Data presentation

simplesimple Data archival Data archival

simplesimple

– Multiple Multiple comparison issues comparison issues have largely been have largely been ignoredignored

High- to mega-plexHigh- to mega-plex– Massive data setsMassive data sets

Even summary stats Even summary stats for all the SNPs for all the SNPs takes hourstakes hours

Need to be able to Need to be able to access individual access individual SNPs as wellSNPs as well

Presenting 1536 -1M Presenting 1536 -1M SNPs worth of data SNPs worth of data is a challengeis a challenge

– Multiple comparison Multiple comparison issues more obviousissues more obvious

In summaryIn summary

Genotyping is now a numbers gameGenotyping is now a numbers game– Methods are VERY accurateMethods are VERY accurate– Budgets are tighterBudgets are tighter

ConsiderationsConsiderations– Number of SNPsNumber of SNPs– Number of samplesNumber of samples– Quantity/Quality of DNAQuantity/Quality of DNA

Feel free to contact me regarding DNA sources Feel free to contact me regarding DNA sources etc.etc.

Online resources (and Online resources (and slide sources)slide sources) Taqman (appliedbiosystems.com)Taqman (appliedbiosystems.com) iplex (sequenom.com)iplex (sequenom.com) Illumina products (Illumina.com)Illumina products (Illumina.com) Affymetrix products Affymetrix products

(Affymetrix.com)(Affymetrix.com)

Genotyping Quality Control

P Kraft

Quality controlMethod of assessment High quality standard

Completion rate > 95% completeHigh failure rate correlated with high error rate

Reproducible genotypes Repeat genotyping of random 5% sample has <1% discordance

Hardy Weinberg Single loci: no significant deviations or small magnitude of deviation

Multiple loci: no more deviations than expected (q-q plot), no consistent trend (all undercalling hets)

Non-paternity Where family data available

Remove all non-paternities

See Leal (2005) Genet Epidemiol,Cox & Kraft (2006) Hum Hered,

Abacasis (2005) Am J Hum Genet

Testing for departure from HWE

aa Aa AA

N0 N1 N2

Observed A allele frequency = p = (2 N2+ N1)/(2N),where N = N2+N1+N0

Pearson’s chi-square test for departures from HWE:

2

220

21

2

222

ii

2ii

)p1(N

))p1(NN(

)p1(p2N

))p1(p2NN(

Np

)NpN(

E

)EO(

This should be compared to a central chi-squared distribution with one degree of freedom

Implemented e.g. in SAS GENETICS – PROC ALLELE

http://www.sph.umich.edu/csg/abecasis/Exact/index.html

delta= (PAa – 2pApa)/ 2pApa

Example from BPC3

9 of 22 tests significant at .05 level!

quantile of Chi-square distribution

Q-Q plots compares two distributions by plotting their quantiles against each other.

Here it is useful to similarity between observed distribution of test statsitics and

theoretical null distribution. Points should lie on y=x diagonal!

GoodBetter than before Admixture?

log-log quantile plot ofp-value for Hardy-Weinberg proportion

-2-3-4-5-6

-2

-3

-4

-5

-6

-7

log10(quantile)

log

10(p

-val

ue)

Exact test , 299 779 SNPs

20 simulations

Observedvalues

expected : 244observed : 586

expected : 2600observed : 3340

CGEMS prostate cancer whole genome scan: phase 1a, slide courtesy of G Thomas

Statistical Methods to Handle Errors

• Family-based:• AE-TDT: Models both missing parental data and genotype

errors» Am J Hum Genet. 2001 Aug;69(2):371-80.

» Eur J Hum Genet. 2004 Sep;12(9):752-61.

• Likelihood with nuisance parameters» Genet Epidemiol. 2004 Feb;26(2):142-54.

• Bayesian» Genet Epidemiol. 2004 Jan;26(1):70-80.

• Case-control:» Rice & Holmans (2003) Ann J Hum Genet 29:204» Gordon et al. (2004) Stat App Genet Molec Biol 3

• Need locus-specific error rates» Difficult to get for high-throughput platforms

Nondifferential genotyping error can lead to inflated Type I

error rates!

Nondifferential genotyping error does not generally lead to inflated Type I error

rates, but can lead to loss of power, bias

away from null

peter kraft [email protected] bldg 2 rm 207 2-4271

Documents

picturesdecay of ld

large n

small n

r2 n cases

marginal allele frequenciesis

linkage disequilibriumalleles

marginal allele freqs

causal gene g