lisa strug the hospital for sick children university of toronto

Case-Control Genetic Association Studies with Next-Generation Sequence Data and External Controls: The Robust Variance Score

Lisa Strug

The Hospital for Sick Children

University of Toronto

Emerging Statistical Challenges and Methods For Analysis of Massive Genomic Data in Complex Human Disease Studies

June, 2014 Banff

Motivation

Association studies using next generation sequencing (NGS) are expensive

Public NGS data exists (e.g. 1000 genomes, Complete Genomics)

Can we use our NGS cases with (publicly available) ‘out of study’ sequenced control groups in genetic association studies?

Supplement our control NGS data with public NGS data?

Or as the only control group, much like the Wellcome Trust Case Control Consortium (2007) did for GWAS with SNPs

Benefits of Using External NGS Control Groups

Using (publicly) available control groups would reduce costs

Allows one to focus on sequencing cases

Could increase statistical power

Would provide a means to prioritize variants for follow-up genotyping or functional studies

Could be used to increase sample size of sequenced controls

Challenges

Several factors could bias association studies when cases and controls are sequenced as part of different experimental designs:

1. Study design considerations e.g. case-control differences in ethnicity or other confounders

2. Factors related to sequencing technology and parameters Enrichment and base calling procedures (different platforms) Alignment (algorithm and reference genome) Read depth (LRD results in biased allele frequency) SNP detection and genotype calling algorithms (e.g GATK,

SAMtools)

Effects of read depth and platforms on genotype calling

Sequencing on Ion Torrent PGM (lower coverage for AT-rich genome and higher error rate*)

SNP detection and genotype calling

Map and local re-map to reference genome

Map and local re-map to reference genome

• Genotype probability P(genotype|Data) for each sample.

• SNPs and Indels by joining info across samples

Raw reads with base quality (FastQ file)

Alligned reads with map qualities (BAM file)

VCF files with genotype calls and genotype probabilities etc

Sequencing on Illumina HiSeq 2000(lower error rate *)

* Quail et al, BMC Genomics 2012, 13:341

Bias When Using Genotype Calls

Systematic differences in read depth can effect Type I error (Kim et al. 2011) due to differential missclassification/screening (rare homozygotes miscalled/screened more in the LRD sample)

Simulation: Genotype calls for 1000 variants 50 cases at 100x and

150 controls at 4x

R, selection threshold – filters out called genotypes that have low confidence, assigns them missing. E.g.

NGS Case-Control Association Designs

1. Sequence cases for variant discovery and follow-up with genotyping (Liu and Leal, 2012; Longmate et al., 2010; Sanna et al. 2011)• Cost effective and can control Type I error inflation• Cannot detect protective variants present only in discovery

sample and may be overly conservative

2. Adjusting for read depth or weighting variant calls by quality score (Daye et al. 2012; Garner 2011)• Parameters are not estimable when cases and controls

distinguishable by read depth

3. Randomly down-sampling BAM files (available in GATK toolkit)• Not as powerful as an approach using all data

Our Proposal

• Re-purpose and extend an approach by Skotte et al (2012) that incorporates genotype uncertainty into the association score test by using genotype likelihoods

• In comparison to using called genotypes• Can improve power • Avoid spurious findings• Limit the need for some subjective quality thresholds e.g. R

Workflow proposed for NGS association using external NGS controls

& Different alignment algorithms are implicitly accounted for by the RVS because the unit of analysis is genotype probability rather than the genotype calls in the association analysis.

Recall: Score Statistic for Logistic Regression - phenotype value for sample i (1-case, 0-control). - Genotype information for sample i at jth variant.

• To find out whether the genotype is associated with the phenotype, we use

• The score statistic used to test the null hypothesis

• And the corresponding score test statistic

Association for Sequencing Databy Skotte et al. (2012)

2

01 1( , ) ( | ) ( , )

n n

i ij i ij ij ijgi i

P Y D P Y G g P D G g

• Joint probability of the observed data (Yi, Dij) for i=1,…,n

Dij - sequence data (reads and errors, BAM file) for jth variant of ith subject. Gij – unobserved genotype (coded as 0, 1, 2).

• Notation:

• With

1 0

(1)

The score used to test the null hypothesis H0:

Calculation of E(Gij|Dij)

• The calculation of the expected genotype given the sequencing data is given by

Where

The genotype likelihood is calculated from all the reads of the sample and is provided in the output of standard genotype calling packages, eg .VCF file, or calculated from the aligned reads using the simple Bayesian genotyper (McKenna et al, 2010).

Genotype frequencies are calculated from the full sample by the EM algorithm (McKenna et al. 2010; Skotte et al. 2012)

Also need to calculate the var(Sj) and therefore var[E(Gij|Dij)]

Variance of E(Gij|Dij) Depends on Read Depth

• The law of the total variance: At high read depth,

because: 2

0| ( | )ij ij ij ij

gE G D gP G g D

( | )ij ijP G g D goes to 1 for true genotype in

And

When read depth is not

high enough

Therefore, is read depth dependent

Robust Variance Estimation

for the true variances

• If read depth is the same for cases and controls, the logistic regression estimate of the Var(Sj) could be used, otherwise, this estimate is biased

• Bias is a function of Ncontrols and Ncases where, for Ncontrols >> Ncases ,

Var(Sj) is underestimated

• Thus variance estimation must distinguish between the two groups• We estimate by with estimated

genotype frequencies • And is estimated by its sample variance in

controls

Robust Variance Score (RVS) Test

• Is the score test with the proposed variance estimation

• Evaluate test statistic by asymptotic distribution or by bootstrap resampling (Hall and Hart, 1990) .

2

ˆ ( )j

jj

ST

Var S Score test for single SNP analysis:

For joint analysis of J rare variants, combine vector S=(S1,…,SJ) into single test statistic by common approaches such as

CAST (Morgenthaler and Thilly 2007), Weighted Sum (Madsen and Browning 2009), SKAT (Wu et al. 2011), SKAT-O (Lee et al. 2012).

Estimate covariance matrices for vector S separately in cases and controls and combine as previously done.

Evaluating the RVS

I. Using Simulation to assess Power and Type I error

II. RVS applied to 1000 Genomes data: high read depth (HRD) exomes versus low read depth (LRD) whole genomes

III. RVS applied to Rolandic Epilepsy NGS HRD cases versus 1000 Genomes LRD controls, in a previously identified region of association

I. Evaluating RVS via Simulation

Simulation Setting:

single variant analysis under the null, MAF =1%,10%, 20%, 30%, 40%

Rare variant analysis: collapse 5 rare variants, MAF ranging from 0.1% to 5%

500 cases, 100x average read depth with error N(0.01,0.025)

1500 controls, 4x average read depth with error N(0.01,0.025)

We present RVS with CAST; similar results for other tests

Under the alternative, all 5 variants have OR=1.5

Single Variant Analysis Type I Error

QQ plot of p-values for an analysis of 1000 variants simulated under the null model.

500 high read depth cases and 1500 low read depth controls. MAF equal to A) 0.01, B) 0.1, C) 0.2 and D) 0.4.

RVS Type I Error and Power: Compared to True Genotypes

Method Type of analysis

Sample Size

Level of the test

0.05 10-2 10-3 10-4

CASTRVS 500 case

1500 controls

0.96 0.89 0.72 0.51

True Genotypes

0.97 0.92 0.79 0.61

Table: Empirical power

Quantile-quantile plot of p-values

II. Evaluating RVS Using 1000 Genomes Data

The 1000 Genomes project samples (CEU + GBR), phase 3 release [20130502]• One sample consists of exome data on 56 individuals (~50x)• Second sample consists of 113 individuals sequenced at low

read depth (~6.5x)

Aligned chromosome 11 reads downloaded

Use GATK (DePristo et al. 2011) on the combined sample of aligned reads, generate multi-sample VCF file

Apply common filters, then extract genotype calls and genotype likelihoods from the VCF file

Compared RVS to the score test using genotype calls

RVS Type I Error in 1000 Genomes Control Groups

Single SNP analysis, MAF>5%

Quantile-quantile plot of p-values

RVS Type I Error in 1000 Genomes Control Groups

Analysis with CAST, MAF<5%, 5 variants

Quantile-quantile plots of p-values

III. RVS in NGS cases with Rolandic Epilepsy (RE) and public controls (1000 genome)

27 individuals of European decent with RE, sequenced (~197x) in a previously linked and associated 600kb region of chr 11

Compare them to 113 whole genome controls from 1000 genomes (MAF>0.05)

Generate multi-sample VCF file on the combined set with GATK

Apply common filters, then extract genotype calls and genotype likelihoods

Compare RVS results to an analysis with genotype calls using a sample of 200 Europeans sequenced by Complete Genomics (~35x)

Rolandic Epilepsy NGS Association Analysis

SNP Position*

P-value (rank) using Genotype

Calls: RE versus Complete Genomics controls

P-value (rank)Using RVS:

RE versus 1000 Genomes controls

rs6484504 31424823 0.00008 (1) 0.0010 (3)

rs578666 31404484 0.0002 (2) 0.0003 (2)

rs674035 31399014 0.0007 (3) 0.0002 (1)

rs11031375 31428184 0.003 (4) 0.009 (7)

rs662702 31809070 0.012 (5) 0.052 (191)

rs11031330 31275073 0.011 (7) 0.0018(4)

rs603202 31866585 NA 0.0024(5)

* Build 37; approximately 450 variants analyzed

Conclusions

Score test based on genotype calls/likelihoods has inflated type I error when case and control groups have different read depths Read depth bias can be avoided if both cases and controls are HRD. We cannot guarantee HRD in both groups at every locus.

RVS allows one to incorporate external control groups into NGS association studies; assuming matching on other factors

RVS can be used for single or joint rare variant analysis.

RVS can be extended to accommodate in-study controls augmented with an out-of-study control group.

RVS will be extended to accommodate covariate adjustment.

Code and Publication

Code is available at:• https://github.com/strug-lab• Strug.research.sickkids.ca

Derkach A, Chiang T, Gong J, Addis L, Dobbins S, Tomlinson I, Houlston R, Pal DK, Strug LJ. 2014. Association analysis using next generation sequence data from publicly available control groups: The robust variance score statistics. Bioinformatics, Epub. PMID: 24733292

https://github.com/strug-lab

https://github.com/strug-lab

Acknowledgments

Andriy Derkach, graduate student in statistics at University of Toronto

Deb Pal, Richard Houlston and Ian Tomlinson

lisa strug the hospital for sick children university of toronto

Documents