peter kraft [email protected] bldg 2 rm 207 2-4271
DESCRIPTION
EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 2: Patterns of LD and “tag SNP” selection. Peter Kraft [email protected] Bldg 2 Rm 207 2-4271. Before HapMap: “looking under lamppost”. Study 1: Pop’n A, small N, no assoc’n. - PowerPoint PPT PresentationTRANSCRIPT
EPI293Design and analysis of gene association studies
Winter Term 2008
Lecture 2: Patterns of LD and “tag SNP” selection
Peter [email protected]
Bldg 2 Rm 2072-4271
Study 1: Pop’n A, small N, no assoc’n
Study 2: Pop’n A, large N, no assoc’n
Study 3: Pop’n B, large N, assoc’n
Before HapMap: “looking under lamppost”
After HapMap
Study 2 revisited: Pop’n A, large N,
assoc’n
Outline
• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs• The HapMap project• Resources and tools for SNP selection
Outline
• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs • The HapMap project• Resources and tools for SNP selection
A G
a g
A G
a g
A g
A G
a g
A G
A G
a g
A G
A g
a g
A G
a g
A G
Basic idea: linkage disequilibrium
Alleles at two (or more) loci are correlated on chromosomes drawn at random from the population
Measures of linkage disequilibrium
• Basic data: table of haplotype frequencies
A G
a g
A G
a g
A g
A G
a g
A G
A G
a g
A G
A g
a g
A G
a g
A G
A a
G 8 0 50%
g 2 6 50%
62.5% 37.5%
Linkage disequilibrium and marginal allele freqs.
A a
G pApG + = x qApG - = y pG
g pAqG - = w qAqG + = z qG
pA qA 1
• pA & pG are (minor) allele frequencies
– qA = 1-pA; qG = 1-pG
= x z – y w is a measure of departure from independence– No association between A and G = 0
– Max() = min(pA qG, pG qA)
A a
G n11 n10 n1
g n01 n00 n0
n1 n0
Measure Formula Ref.
D’ Lewontin (1964)
2 = r2 Hill and Weir (1994)
* Levin (1953)
Edwards (1963)
Q Yule (1900)
)nn,nnmin(
nnnn
1001
01100011
o101
201100011
nnnn
nnnn
011
01100011
nn
nnnn
0110
0011
nn
nn
01100011
01100011
nnnn
nnnn
|D’| and r2 are most common
• D prime … – …ranges from 0 [no LD] to 1 [complete LD]…– …is less sensitive to marginal allele frequencies…– …is directly related to recombination fraction
• R squared…– …also ranges from 0 to 1…– …is correlation between alleles on the same
chromosome…– …is very sensitive to marginal allele frequencies…– …is directly related to study power
• If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r2 N cases and controls that directly measured G
• r2 N is the “effective sample size”
A G
a g
A G
a g
A g
A G
a g
A G
A G
a g
A G
A g
a g
A G
a g
A G
A a
G 8 0 50%
g 2 6 50%
62.5% 37.5%
D’ = (86 - 0) / (86) =1 r2 = (86 - 0)2 / (10688) = .6
Computational detail
• Haplotyopes are rarely directly observed• Have to infer from genotype data
– Genotypes consistent with haplotype pairs
• Most popular algorithm: Expectation Maximizxation1
• Related to, but not exactly equal to 3x3 table of genotypes
Aa
Gg
A
G
a
g
A
g
a
G
1 Thomas pp. 243-245
AA=2
Aa=1
Aa=0
BB=2
Bb=1
Bb=0
Correlation from this table makes no assumptions about HWE
(Weir, Genetic Data Analysis)
Outline
• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs • The HapMap project• Resources and tools for SNP selection
Why does LD exist?
1. “Recombination coldspots”2. Demographics (e.g. bottlenecks)3. Population stratification or admixture
• Confounds gene-disease association• Does not decay with distance
(among other reasons… selective pressure … etc.)
A
Decay of LD in Pictures
Decay of LD: T = 0 (1 - )T
0.05 0.10 0.15 0.20 0.25
0.0
0.2
0.4
0.6
0.8
1.0
theta
de
lta
1 generation
5 generations
10
20
40
80
200 kbp from chr2, positions 51,783,239 to 51,983,238
Data from the ENCODE projecthttp://www.hapmap.org/downloads/encode1.html.en
Implications
•Admixture can lead to false positives– Two unlinked loci can stay in LD– Recent admixture, continual gene flow problematic
•Isolated populations have advantages for fine-mapping
– LD extends long distances, so fewer markers need be typed
– But resolution may be poor
Knowledge of local LD structure is essential for candidate gene studies !
Outline
• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs• The HapMap project• Resources and tools for SNP selection
Basic “tagging” design
Measure haplotypes/LD pattern in a subsample
(often external database)
Choose subset of SNPs (“tagSNPs”) that contain majority of information
Genotype “tagSNPs” in main study,analyze appropriately
Over 750 known SNPs – at least 50 are common in Europeans
ATM
ATM
“block” = region of limited haplotype diversity and/or
low LD
But there are unappealing aspects of the “haplotype block” idea
• Definition and “block finding” algorithms are ad hoc– Different defns, algs lead to different block structures– Block structure changes with sample size, marker density
• “Hard boundaries” are…– …unappealing for tagSNP selection (what about “between blocks”)…
– … inaccurate description of LD patterns (some haps overlap boundaries)
• Plus, haplotypes present analytic challenges[Wall & Pritchard (2003a) Nat Rev Genet 4:587 (2003b) AJHG 73:502]
[Nothnagel and Rohde (2005) AJHG 77:988
CYP19
CYP19
Keep it simple
• We want SNPs that predict unobserved variants• Why not choose SNPs based on pairwise correlations?
• Q: What if we don’t know enough about common genetic variation to say we’ve captured it?
• A: HapMap and resequencing projects
A/T1
G/A2
G/C3
T/C4
G/C5
A/C6
high r2 high r2 high r2
AATT
GC
CG
GC
CG
TCCC
ACCC
GC
CG
TCCC
GGAA
GGAA
Outline
• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs• The HapMap project• Resources and tools for SNP selection
HapMap:
application in the design and interpretation of association studies
Mark J. Daly, PhD on behalf of
The International HapMap Consortium
[OK it may look like I’m totally stealing these slides—but they are free on the web at http://www.hapmap.org/tutorials.html.en]
Goals of this segment
• Briefly summarize HapMap design and current status
• Discuss the application of HapMap to all aspects of association study design, analysis and interpretation
HapMap Project
High-density SNP genotyping across the genome provides information about– SNP validation, frequency, assay conditions– correlation structure of alleles in the genome
A freely-available public resource to increase the power and efficiency
of genetic association studies to medical traits
All data is freely available on the web for applicationin study design and analyses as researchers see fit
HapMap Samples• 90 Yoruba individuals (30 parent-parent-offspring trios) from
Ibadan, Nigeria (YRI)
• 90 individuals (30 trios) of European descent from Utah (CEU)
• 45 Han Chinese individuals from Beijing (CHB)
• 45 Japanese individuals from Tokyo (JPT)
HapMap progress
PHASE I – completed, described in Nature paper
* 1,000,000 SNPs successfully typed in all 270 HapMap samples
* ENCODE variation reference resource available
PHASE II – data generation complete, data released early November 2005
* >3,500,000 SNPs typed in total !!!
Frazer, K. A., D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve, R. A. Gibbs, J. W. Belmont, A. Boudreau, P. Hardenbol, S. M. Leal, S. Pasternak, D. A. Wheeler, et al. (2007). "A second generation human haplotype map of over 3.1 million SNPs." Nature 449(7164): 851-61.
ENCODE-HAPMAP variation project
• Ten “typical” 500kb regions
• 48 samples sequenced
• All discovered SNPs (and those dbSNP) typed in all 270 HapMap samples
• Current data set – 1 SNP every 279 bp
A much more complete variation resource by whichthe genome-wide map can evaluated
Completeness of dbSNP
Vast majority of common SNPs are contained in or highly correlated with a SNP in dbSNP
Recombination hotspots are widespreadand account for LD structure
7q21
Coverage of Phase II HapMap(estimated from ENCODE data)
From Table 6 – “A Haplotype Map of the Human Genome”, Nature
Panel %r2 > 0.8 max r2
YRI 81 0.90CEU 94 0.97CHB+JPT 94 0.97
Vast majority of common variation (MAF > .05) captured by Phase II HapMap
Applying the HapMap
• Study design - tagging• Study coverage evaluation• Study analysis - improving association testing• Study interpretation
– Comparison of multiple studies– Connection to genes/genomic features– Integration with expression and other functional data
• Other uses of HapMap data– Admixture, LOH, selection
Tagging from HapMap
• Since HapMap describes the majority of common variation in the genome, choosing non-redundant sets of SNPs from HapMap offers considerable efficiency without power loss in association studies
Pairwise tagging
Tags:
SNP 1SNP 3SNP 6
3 in total
Test for association:
SNP 1SNP 3SNP 6
A/T1
G/A2
G/C3
T/C4
G/C5
A/C6
high r2 high r2 high r2
AATT
GC
CG
GC
CG
TCCC
ACCC
GC
CG
TCCC
GGAA
GGAA
After Carlson et al. (2004) AJHG 74:106
Pairwise Tagging Efficiency
Table 7 Number of selected tag SNPs to capture all observed common SNPs in the Phase I HapMap for the three analysis panels using pairwise tagging at different r2 thresholds
YRI CEU CHB+JPT
Pairwise r2 ≥ 0.5 324,865 178,501 159,029
r2 ≥ 0.8 474,409 293,835 259,779
r2 = 1 604,886 447,579 434,476
Tag SNPs were picked to capture common SNPs in release 16c.1 for every 7,000 SNP bin using Haploview.
Tagging Phase I HapMap offers 2-5x gains in efficiency
Tags:
SNP 1SNP 3SNP 6
3 in total
Test for association:
SNP 1SNP 3SNP 6
Use of haplotypes can improve genotyping efficiency
Tags:
SNP 1SNP 3
2 in total
Test for association:
SNP 1 captures 1+2SNP 3 captures 3+5
“AG” haplotype captures SNP 4+6
AATT
GC
CG
GC
CG
TCCC
ACCC
GC
CG
TCCC
GGAA
GGAA
ACCC
A/T1
G/A2
G/C3
T/C4
G/C5
A/C6
tags in multi-marker test should be conditional on
significance of LD in order to avoid overfitting
Efficiency and powerR
elat
ive
pow
er (
%)
Average marker density (per kb)
tag SNPs
randomSNPs
P.I.W. de Bakker et al. (2005) Nat Genet Advance Online Publication 23 Oct 2005
~300,000 tag SNPsneeded to cover commonvariation in whole genome
in CEU
Will tag SNPs picked from HapMap apply to other population
samples?
Two issues: what if LD structure strongly differs between my samples and the HapMap samples?
Are CEU or YRI panels good surrogates for Latinos from Los Angeles? Are CEU samples even good surrogates for whites from
France?
Is HapMap sample size sufficient?Small sample correlation overestimated; are tagging algorithms
“overfitting” the sample
PK slide
Will tag SNPs picked from HapMap apply to other population
samples?
Population differences add very little inefficiencyPaul de Bakker Pac Symp Biocomput 2006
CEUCEU
Whites fromLos Angeles, CA
Whites fromLos Angeles, CA Botnia, FinlandBotnia, Finland
CEUCEUCEUCEU
Utah residents with European ancestry
(CEPH)
Utah residents with European ancestry
(CEPH)
De Bakker et al (2006) Nat Genet
Need and Goldstein (2006) Nat Genet
Impact of training set sample size
Tags chosen as pairwise tags
Tags chosen as multimarker tags(up to 6 markers)
Zeggini et al Nature Genetics 37, 1320 - 1322 (2005)
Impact of training set sample size
Tags chosen for common variants
Tags chosen for common and rare varants
Zeggini et al Nature Genetics 37, 1320 - 1322 (2005)
Outline
• Measures of linkage disequilibrium• Reasons for LD and empirical patterns of LD• “Tagging” SNPs• The HapMap project• Resources and tools for SNP selection
Public sources of SNP data
• Candidate genes– “Seattle SNPs” http://pga.gs.washington.edu/ *
– Environmental Genome Project http://egp.gs.washington.edu/ *
– IIPGA http://innateimmunity.net/IIPGA2/index_html *
– HAPMAP http://www.hapmap.org/
– BPC3 http://www.uscnorris.com/MECGenetics/
• Genome-wide– HAPMAP
• dbSNP http://www.ncbi.nlm.nih.gov/projects/SNP/
• OMIM (online mendelian inheritance in man)
* Resequencing data
Bioinformatics tools
– https://innateimmunity.net/IIPGA2/Bioinformatics/– http://pga.gs.washington.edu/software.html
– Haploview http://www.broad.mit.edu/mpg/haploview/index.php
– SNPSelector
So, OK, how should I select SNPs?
• PubMed/lit search– Previous associations with your (or related) phenotype
• GWAS!
– Functional studies
• Potentially functional variants– nsSNPs (perhaps ranked by SIFT or Polyphen score)
– Splice sites– Conserved regions
• tagSNPs
SNP SelectorBioinformatics
21:4181
http://primer.duhs.duke.edu/
SNP SelectorBioinformatics 21:4181
Molecular genotyping Molecular genotyping methodsmethods
David G. Cox M.S. Ph.D.David G. Cox M.S. Ph.D.Instructor of EpidemiologyInstructor of [email protected]@hsph.harvard.eduBldg. 2 Rm. 211Bldg. 2 Rm. 211(617) 432-2262(617) 432-2262
OverviewOverview
How it worksHow it works Considerations in choosing a Considerations in choosing a
methodmethod Quality Control (QC)Quality Control (QC) Organizing your dataOrganizing your data Completing the studyCompleting the study
PCRPCR
Rapid, versatile, Rapid, versatile, in vitro, in vitro, method for method for amplifying defined target DNA sequences amplifying defined target DNA sequences to yield multiple copies of specific region of to yield multiple copies of specific region of DNA sequenceDNA sequence
1980s, K. Mullis invented PCR1980s, K. Mullis invented PCR– Won Nobel Prize in 1993Won Nobel Prize in 1993
Applications for basic science, Applications for basic science, epidemiology, evolution, linkage analysis, epidemiology, evolution, linkage analysis, forensics, anthropologyforensics, anthropology
PCR (2)PCR (2)
Allows for screening of Allows for screening of uncharacterized mutationsuncharacterized mutations
Rapid genotyping for polymorphic Rapid genotyping for polymorphic markersmarkers
Detecting point mutations Detecting point mutations
PCR cyclePCR cycle
Three stepsThree steps
1.1. DenaturationDenaturation• Denature DNA to separate strandsDenature DNA to separate strands
2.2. AnnealingAnnealing• Primers bind to strandsPrimers bind to strands
3.3. ExtensionExtension• Polymerase synthesizes new strandsPolymerase synthesizes new strands
PCR cycle (2)PCR cycle (2)
Reaction mixture proceeds through Reaction mixture proceeds through repeated cycles of primer annealing, repeated cycles of primer annealing, DNA synthesis, and denaturationDNA synthesis, and denaturation
Target sequence concentration Target sequence concentration increases exponentially for each cycleincreases exponentially for each cycle– Each newly synthesized DNA strand acts Each newly synthesized DNA strand acts
as a template for further DNA synthesis in as a template for further DNA synthesis in subsequent cyclessubsequent cycles
DenaturationDenaturation
AnnealingAnnealing
ExtensionExtension
Main assays usedMain assays used
Looking to optimize three thingsLooking to optimize three things– Cost of genotypingCost of genotyping– Speed of genotypingSpeed of genotyping– Reliability of dataReliability of data
Three main categoriesThree main categories– Low-plexedLow-plexed
Usually PCR basedUsually PCR based– High-plexedHigh-plexed
PCR or non-PCR basedPCR or non-PCR based– Mega-plexedMega-plexed
Non-PCR basedNon-PCR based
PCR bases methodsPCR bases methods
Plex = number of separate assays in Plex = number of separate assays in an individual tubean individual tube
Single to low-plexedSingle to low-plexed– Usually limited to number of tagsUsually limited to number of tags
Either fluorescent or massEither fluorescent or mass Tags are expensive partTags are expensive part Micro scale reactionsMicro scale reactions Low start-up costsLow start-up costs
– Robotics not necessaryRobotics not necessary– Machines in many labsMachines in many labs
TaqmanTaqman
BioTroveBioTrove
Miniaturized TaqmanMiniaturized Taqman– Primers and probes spotted into holesPrimers and probes spotted into holes– Taqman reaction exactly the sameTaqman reaction exactly the same
Reduces costReduces cost– Lowers quantity of probe and master Lowers quantity of probe and master
mixmix– Still need to order a minimum Still need to order a minimum
amount of primer and probeamount of primer and probe
iPLEXiPLEX
Non-PCR bases Non-PCR bases methodsmethods Usually rely on some sort of genome Usually rely on some sort of genome
wide amplification stepwide amplification step Hybridization techniques increase plexHybridization techniques increase plex
– Stick DNA to some sort of chipStick DNA to some sort of chip Chips are roughly the size of microscope slidesChips are roughly the size of microscope slides
Nano scale reactionsNano scale reactions High start-up costs for machines and High start-up costs for machines and
roboticsrobotics– Core facilities normally usedCore facilities normally used
Illumina productsIllumina products
Highly multiplexed assaysHighly multiplexed assays– From 384 to ~1M SNPsFrom 384 to ~1M SNPs
Custom chips designed up to 72kCustom chips designed up to 72k GWAS products of ~500k and ~1MGWAS products of ~500k and ~1M
– Based on pair-wise tagging of SNPs from Based on pair-wise tagging of SNPs from hapmaphapmap
Use specially etched holesUse specially etched holes– Solves “spotting” problemSolves “spotting” problem– Addressing systemAddressing system
GoldengateGoldengate
InfiniumInfinium
AffymetrixAffymetrix
GWAS chipGWAS chip– Over 1.8M featuresOver 1.8M features
SNPsSNPs– ~500k from earlier version~500k from earlier version
Evenly spaced across genomeEvenly spaced across genome– ~500k additional SNPs~500k additional SNPs
TagTag X/YX/Y mtDNAmtDNA New SNPsNew SNPs HotspotsHotspots
CNVsCNVs– ~200k specifically targeted to CNVs~200k specifically targeted to CNVs– ~750k additional probes across genome~750k additional probes across genome
Quick word on CNVsQuick word on CNVs
Latest craze in genetic epi (one of)Latest craze in genetic epi (one of) Copy Number VariantsCopy Number Variants
– Either more (3+) or less (1) copies of a genetic region Either more (3+) or less (1) copies of a genetic region presentpresent
Polymorphic regions of varying zygosityPolymorphic regions of varying zygosity– Detected as Mendelian errors in HapMapDetected as Mendelian errors in HapMap– Behave (from a population genetic standpoint) like Behave (from a population genetic standpoint) like
any other polymorphismany other polymorphism Still not well characterizedStill not well characterized
– i.e. regions with high homology can show up as CNVs i.e. regions with high homology can show up as CNVs Genotyped using quantification of genotype Genotyped using quantification of genotype
signal signal
Back to AffyBack to Affy
Affy vs. IlluminaAffy vs. Illumina
AffymetrixAffymetrix– Earlier productEarlier product
Began with ~100kBegan with ~100k Assay and software Assay and software
issuesissues
– Costs have Costs have drastically declineddrastically declined
– SNP coverage has SNP coverage has drastically increaseddrastically increased
tagSNPs addedtagSNPs added
– WGA DNA OKWGA DNA OK
IlluminaIllumina– Later productLater product
Began with ~500kBegan with ~500k Better assay and Better assay and
software design software design (originally)(originally)
– Cost issuesCost issues– SNP coverage SNP coverage
relatively constantrelatively constant Always based on Always based on
hapmaphapmap
– WGA DNA WGA DNA discourageddiscouraged
Genotype ClusteringGenotype Clustering
Clustering continuedClustering continued
Low-plex assaysLow-plex assays– Usually done by eye by a technicianUsually done by eye by a technician– Can be labor intensive and subject Can be labor intensive and subject
to user biasto user bias High- and mega-plex assaysHigh- and mega-plex assays
– Usually computer assisted or Usually computer assisted or completely automatedcompletely automated
– Less labor intensive but subject to Less labor intensive but subject to clustering errorsclustering errors
Best case scenarioBest case scenario
Software clusteringSoftware clustering
0 0.20 0.40 0.60 0.80 1Norm Theta
rs4804195
0
1
2
3
4
Nor
m R
1049 1566 459
Human clusteringHuman clustering
0 0.20 0.40 0.60 0.80 1Norm Theta
rs4804195
0
1
2
3
4
Nor
m R
1088 1549 491
What would you do What would you do with this?with this?
0 0.20 0.40 0.60 0.80 1Norm Theta
rs6451182
0
1
2
3
4
Nor
m R
17 2885 225
Or this?Or this?
0 0.20 0.40 0.60 0.80 1Norm Theta
rs598558
-0.20
0
0.20
0.40
0.60
0.80
1
1.20
1.40
1.60
Nor
m R
3 38286
So you want to So you want to genotype?!?genotype?!? Three main things to considerThree main things to consider
– Number of SNPsNumber of SNPs– Number of samplesNumber of samples– Budgetary considerationsBudgetary considerations
Minor considerationsMinor considerations– DNA sourceDNA source– DNA quantity/qualityDNA quantity/quality
So you want to So you want to genotype?!?genotype?!?
Biotrove
Budgetary Budgetary ConsiderationsConsiderations Low plexLow plex
– Per SNP cost Per SNP cost normally doesn’t normally doesn’t decrease as the decrease as the number of SNPs number of SNPs increasesincreases
– Per genotype cost Per genotype cost may decrease as may decrease as sample size sample size increasesincreases
– Overall study cost Overall study cost can be low ($Ks)can be low ($Ks)
Higher plexHigher plex– Per SNP cost Per SNP cost
decreases as you decreases as you get closer to the get closer to the maximum plexmaximum plex
– Per genotype cost Per genotype cost decreases decreases drastically as plex drastically as plex goes upgoes up
– Overall study cost Overall study cost can be high ($Ms)can be high ($Ms)
And the And the BIGBIG question question
Genotype Costs
0100200300400500600700800
Assay and Scale
Co
st
per
well
0
0.2
0.4
0.6
0.8
1
1.2
Co
st
per
gen
oty
pe
Cost/well
Cost/genotype
How to minimize costs How to minimize costs while maximizing while maximizing genotypinggenotyping Genotype the right number of Genotype the right number of
samplessamples– Fill platesFill plates– Find the sweet spot in assay orderingFind the sweet spot in assay ordering
Genotype the right number of SNPsGenotype the right number of SNPs– If you can fill the beads, your per If you can fill the beads, your per
genotype cost goes down without genotype cost goes down without drastically increasing the total costdrastically increasing the total cost
Quality controlQuality control
Low-plexLow-plex– Blinded QCBlinded QC
Repeated samplesRepeated samples ~10% of the total ~10% of the total
sample sizesample size
– >95% completion >95% completion raterate
– Easy to repeat Easy to repeat individual plates individual plates to correct any to correct any errorserrors
High- to mega-plexHigh- to mega-plex– Blinded QCBlinded QC
One or two samples One or two samples per plateper plate
Same samples on Same samples on every plateevery plate
– Set both SNP and Set both SNP and Sample completion Sample completion ratesrates
– Not easy to repeat Not easy to repeat platesplates
Data overloadData overload
Low-plexLow-plex– Data trickles inData trickles in– Little need for elaborate databasesLittle need for elaborate databases
Assay descriptionAssay description– Primer/probe sequencePrimer/probe sequence– AllelesAlleles
SNP descriptionSNP description– Locus (usually rs# sufficient)Locus (usually rs# sufficient)
– Relational db for genotype dataRelational db for genotype data ID x genotypeID x genotype
Data overloadData overload
High- to mega-plexHigh- to mega-plex– Data delugeData deluge
Up to 1M SNPs worth of data at onceUp to 1M SNPs worth of data at once– Annotation of SNPsAnnotation of SNPs
Assay characteristicsAssay characteristics SNP characteristicsSNP characteristics
Large samples sizesLarge samples sizes– 1536x1000 samples is over 1.5 million data 1536x1000 samples is over 1.5 million data
pointspoints
Data analysis issuesData analysis issues
Low-plexLow-plex– Often <10 SNPs Often <10 SNPs
per studyper study Easy/quick to Easy/quick to
analyzeanalyze Data presentation Data presentation
simplesimple Data archival Data archival
simplesimple
– Multiple Multiple comparison issues comparison issues have largely been have largely been ignoredignored
High- to mega-plexHigh- to mega-plex– Massive data setsMassive data sets
Even summary stats Even summary stats for all the SNPs for all the SNPs takes hourstakes hours
Need to be able to Need to be able to access individual access individual SNPs as wellSNPs as well
Presenting 1536 -1M Presenting 1536 -1M SNPs worth of data SNPs worth of data is a challengeis a challenge
– Multiple comparison Multiple comparison issues more obviousissues more obvious
In summaryIn summary
Genotyping is now a numbers gameGenotyping is now a numbers game– Methods are VERY accurateMethods are VERY accurate– Budgets are tighterBudgets are tighter
ConsiderationsConsiderations– Number of SNPsNumber of SNPs– Number of samplesNumber of samples– Quantity/Quality of DNAQuantity/Quality of DNA
Feel free to contact me regarding DNA sources Feel free to contact me regarding DNA sources etc.etc.
Online resources (and Online resources (and slide sources)slide sources) Taqman (appliedbiosystems.com)Taqman (appliedbiosystems.com) iplex (sequenom.com)iplex (sequenom.com) Illumina products (Illumina.com)Illumina products (Illumina.com) Affymetrix products Affymetrix products
(Affymetrix.com)(Affymetrix.com)
Genotyping Quality Control
P Kraft
Quality controlMethod of assessment High quality standard
Completion rate > 95% completeHigh failure rate correlated with high error rate
Reproducible genotypes Repeat genotyping of random 5% sample has <1% discordance
Hardy Weinberg Single loci: no significant deviations or small magnitude of deviation
Multiple loci: no more deviations than expected (q-q plot), no consistent trend (all undercalling hets)
Non-paternity Where family data available
Remove all non-paternities
See Leal (2005) Genet Epidemiol,Cox & Kraft (2006) Hum Hered,
Abacasis (2005) Am J Hum Genet
Testing for departure from HWE
aa Aa AA
N0 N1 N2
Observed A allele frequency = p = (2 N2+ N1)/(2N),where N = N2+N1+N0
Pearson’s chi-square test for departures from HWE:
2
220
21
2
222
ii
2ii
)p1(N
))p1(NN(
)p1(p2N
))p1(p2NN(
Np
)NpN(
E
)EO(
This should be compared to a central chi-squared distribution with one degree of freedom
Implemented e.g. in SAS GENETICS – PROC ALLELE
http://www.sph.umich.edu/csg/abecasis/Exact/index.html
delta= (PAa – 2pApa)/ 2pApa
Example from BPC3
9 of 22 tests significant at .05 level!
quantile of Chi-square distribution
Q-Q plots compares two distributions by plotting their quantiles against each other.
Here it is useful to similarity between observed distribution of test statsitics and
theoretical null distribution. Points should lie on y=x diagonal!
GoodBetter than before Admixture?
log-log quantile plot ofp-value for Hardy-Weinberg proportion
-2-3-4-5-6
-2
-3
-4
-5
-6
-7
log10(quantile)
log
10(p
-val
ue)
Exact test , 299 779 SNPs
20 simulations
Observedvalues
expected : 244observed : 586
expected : 2600observed : 3340
CGEMS prostate cancer whole genome scan: phase 1a, slide courtesy of G Thomas
Statistical Methods to Handle Errors
• Family-based:• AE-TDT: Models both missing parental data and genotype
errors» Am J Hum Genet. 2001 Aug;69(2):371-80.
» Eur J Hum Genet. 2004 Sep;12(9):752-61.
• Likelihood with nuisance parameters» Genet Epidemiol. 2004 Feb;26(2):142-54.
• Bayesian» Genet Epidemiol. 2004 Jan;26(1):70-80.
• Case-control:» Rice & Holmans (2003) Ann J Hum Genet 29:204» Gordon et al. (2004) Stat App Genet Molec Biol 3
• Need locus-specific error rates» Difficult to get for high-throughput platforms
Nondifferential genotyping error can lead to inflated Type I
error rates!
Nondifferential genotyping error does not generally lead to inflated Type I error
rates, but can lead to loss of power, bias
away from null