practical considerations in statistical genetics ashley beecham june 19, 2015

30
Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Upload: clinton-davidson

Post on 23-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Practical Considerations in Statistical GeneticsPractical Considerations in Statistical Genetics

Ashley Beecham

June 19, 2015

Page 2: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

ConsiderationsConsiderations

Study Design Quality control: pre-analysis

SamplesGenetic markers

Quality control: post-analysisQ-Q plots

Quality Control: meta-analysis Multiple Testing

Page 3: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Study DesignStudy Design

Is your phenotype genetic (i.e. heritable)? Is it a binary trait? Or quantitative? Are there age differences? Gender differences? Are there important environmental factors to consider?

Page 4: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Sample Quality ControlSample Quality Control

Genotyping efficiency Gender discrepancies Relatedness Population stratification (case-control studies) Mendelian errors (families)

Page 5: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Sample Quality Control (Gender Checks)Sample Quality Control (Gender Checks)

Sample Mix-up or Mislabel

Possible Sample Contamination

Sample Mix-up or Mislabel

Page 6: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Sample Quality Control (Relatedness)Sample Quality Control (Relatedness)

Calculate the Identity by State mean between pairs and plot the standardized mean and variance using Graphical Relationship Representation (Abecasis et al, Bioinformatics 2001)

Unrelated Case-Control Trios

Page 7: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Sample Quality Control (Population Stratification)Sample Quality Control (Population Stratification)

Allele frequency and prevalence differences between groups Genetic drift Differential selection Little migration between subpopulations

Page 8: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Sample Quality Control (Population Stratification)Sample Quality Control (Population Stratification)

EIGENSTRAT (Price et al. Nature Genetics 2006)) Principle Components Analysis (PCA) method

► Applies principle components analysis to genotype data to infer population substructure from genetic data

Principal components can be used as covariates in a regression model to correct for bias caused by substructure

Page 9: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Quality Control of Genetic MarkersQuality Control of Genetic Markers

Genotyping efficiency Hardy-Weinberg equilibrium Differential missingness

Page 10: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium

There are two alleles at a given locus, A and a

p=freq(A)and

q=freq(a)

p + q = 1

Page 11: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

(p + q) (p + q) =

p2 + pq + qp + q2 =

p2 + 2pq + q2

AA homozygotes

Aa heterozygotes

aa homozygotes

Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium

Page 12: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

p2 = f(AA)

2pq = f(Aa)

q2 = f(aa)

Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium

Page 13: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Under dominant modelFrequency of affecteds = p2 +2pq

Under a recessive modelFrequency of affecteds = q2

Frequency of carriers = 2pq

Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium

Page 14: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Simple χ2 test Laboratory error May be telling you something

Controls in HWE, Cases not

Marker Quality Control: Hardy Weinberg EquilibriumMarker Quality Control: Hardy Weinberg Equilibrium

Page 15: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Post Analysis Quality Control: Q-Q plotsPost Analysis Quality Control: Q-Q plots

What is a Q-Q Plot? “Q” stands for quantile

Used to assess the number and magnitude of observed associations between SNPs and the trait of interest, compared to the association statistics expected under the null hypothesis of no association

►Deviations from the “identity” line True Association Sharp deviations are likely due to Error Also possible due to sample relatedness or population structure

Genomic Inflation Factor (GIF) can be computed to assess deviations► Ratio of the median observed association statistic to the expected median► A value of 1 would mean no deviation

Page 16: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Post Analysis Quality Control: Q-Q plotsPost Analysis Quality Control: Q-Q plots

Page 17: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Meta-AnalysesMeta-Analyses

There can be biases in our data not only within sites but across sites!Genotyping effectsGenotype calling effects

Page 18: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Batch Effects: A Tale of the ImmunoChipBatch Effects: A Tale of the ImmunoChip

ImmunoChip

Fine-MappingReplication

207,728

AS(Ankylosing Spondylitis)

CeD(Coeliac Disease)

CD(Crohn’s Disease)

IgA(IgA Deficiency)

MS(Multiple Sclerosis)

PBC(Primary Biliary Cirrhosis)

PS(Psoriasis)

RA(Rheumatoid Arthritis)

SLE(Systemic Lupus Erythematosus)

T1D(Type 1 Diabetes)

UC(Ulcerative Colitis)

AITD(Autoimmune Thyroid Disease)

WTCCC2(PD, Bipolar, Reading etc.)

Page 19: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

A Focus on Multiple SclerosisA Focus on Multiple Sclerosis

Stratum Cases Controls

AUSNZ 247 944

Belgium 302 1703

Denmark 741 835

Finland 221 486

France 386 354

Germany 2582 5545

Italy 957 1255

Norway 894 674

Sweden 2153 2331

UK 4324 4422

US 1691 5542

TOTAL 14,498 24,091

Page 20: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Genotyping and Genotype CallingGenotyping and Genotype Calling

Genotyping was done at 5 sites:John P. Hussman Institute for Human Genomics, University of

MiamiWellcome Trust Sanger InstituteLocal sites in France, Germany, and the United States

All genotype calling was done at the Wellcome Trust Sanger Institute in 3 batches Initially used Illuminus and GenoSNPFinal genotype calls made with Opticall

0 0.200.40 0.60 0.80 1 1.201.40 1.601.80

Norm Intensity (A)

exm-IND10-102817747

-0.20

0

0.20

0.40

0.60

0.80

1

1.20

1.40

1.60

Norm

Inte

nsi

ty (B

)

0 0.20 0.40 0.60 0.80 1 1.20 1.40 1.60

Norm Intensity (A)

exm-IND22-16602868

0

0.20

0.40

0.60

0.80

1

1.20

1.40

Norm

Inte

nsi

ty (B

)

Page 21: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Using Illuminus and GenoSNP, autosomal markers were divided into categories of ‘good’, ‘middle’, and ‘bad’ based on the following criteria:Good: call rate in both was ≥95% and concordance was ≥99%

►Concordant calls were keptBad: call rate was <95% in both Illuminus and GenoSNP

►Drop all markersMiddle: marker did not meet Good or Bad criteria

►More detailed analysis was done using 1000 genomes data

Initial Marker Quality ControlInitial Marker Quality Control

Page 22: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Population substructure, problems related to ‘calling batches’ were discovered.

Using a test set of Swedish samples, PCA analysis was done

Miami Sanger

Initial Test for Population SubstructureInitial Test for Population Substructure

Page 23: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Investigating the ProblemInvestigating the Problem

Scatter plot of the first principal component’s loadings (y axis) vs –

log10(p-values) from a logistic regression model

using the genotypic center as phenotype

Scatter plot of the first principal component’s loadings (y axis) vs –

log10(p-values) from a test of SNP missing

between the 2 genotypic centers

Scatter plot of the first principal component’s loadings (y axis) vs –

log10(p-values) for deviation from Hardy-Weinberg equilibrium

We performed the following comparisons to identify the source of the problem: Define the genotyping center as phenotype and regress the variants. (A) Run genotyping missingness for the 2 centers. (B) Test for deviation of the Hardy-Weinberg equilibrium. (C)

Page 24: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Genotypic center as phenotype

SNP missingness between centers

HWE

Investigating the ProblemInvestigating the Problem

In the next step, we identified all the SNPs with a p-value < 10-3 in every respective test. We removed them and then calculated the new principal components

From the above, it is clear that the different genotypic centers is not the culprit, rather it seems to be associated with differences in HWE, which are a proxy for discordant calls between centers

Page 25: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Investigating the ProblemInvestigating the Problem

Example: rs13306196

For this SNP, the Illuminus call was used for both centers. In Miami, a G allele was assigned and in Sanger an A allele was assigned. This means that the cluster assignment was likely reversed between sites.

Data A1 A2 A1A1/A1A2/A2A2 Genotype Counts

All G A 1969/0/6866

Miami Illuminus 0 G 0/0/1969

Sanger Illuminus 0 A 0/0/6866

Page 26: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

GenoSNP Illuminus

Illuminus fails to call the same allele even for some mono-allelic markers

Investigating the ProblemInvestigating the Problem

Page 27: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

The dichotomy of the first principal component is explained by calling discordances of the Illuminus caller. Probably a bug exists in the Illuminus calling algorithm where there are difficulties in making calls when less than 3 clusters exist.

Solution: Re-QC using GenoSNP or Opticall (new)

Solution to the ProblemSolution to the Problem

Page 28: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Clean GenoSNP/IlluminusOpticall

Solution to the ProblemSolution to the Problem

Using Opticall, the first principal component no longer splits the data in 2 separate clusters

In later analyses, Opticall was determined to have less variation than GenoSNP in genotype frequencies between genotype calling batches

Page 29: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Final Assessment of Analysis: GIFFinal Assessment of Analysis: GIF

207,728

192,402

161,311

24,388

production

Failed QC

20,38110,710

Monomorphic

MAF > 5%

28,406

MAF 0.5-5%

108,517

MAF < 0.5%

(Autosomal)

Page 30: Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Multiple TestingMultiple Testing

In genetics, there have always been two opposing camps:Liberals: They don’t worry about it at all. They report nominal P

values and aren’t afraid to be wrong.

Conservatives: They worry about it all the time. They report only fully “corrected” P values.

Common methods:Bonferroni False Discovery Rate