jie zheng at #icg12: phenospd: an atlas of phenotypic correlations and a multiple testing correction...

PhenoSpD: an integrated toolkit for phenotypic correlation estimation and multiple testing correction using GWAS summary statistics

Jie Zheng

The 12th International Conference on Genomics

27th Oct 2017

One belt one road

Bristol

An invitation from GIGA Science and BGI

Phenome wide association study (PheWAS)• PheWAS analyzes many phenotypes

compared to a single or multiple genetic variant(s).

• PheWAS is common place, e.g.• MR-PheWAS. Millard et al, Sci Rep,

2015 • Haycock et al, JAMA Oncology, 2017

It is likely that longer telomeres increase risk for several cancers but reduce risk for some non-neoplastic diseases, including cardiovascular diseases.

Post GWAS era: a database of harmonized GWAS summary data in MRC Integrative Epidemiology Unit in Bristol

The network of post GWAS analysis software

Centralized Database

PhenoSpD MR-Base

LD Hub

LD Hub for LD Score Regression

Univariate analysis: SNP heritability

0 20 40 60 80 100

02

46

810

LD Score

Ch

i squa

re

Bivariate analysis: genetic correlations

LD Hub web app

Scope of LD Hub

LD Hub

Database

233 publicly available

GWAS traits

Test Center:

On-the-fly LD score regression analysis pipeline

Lookup Center:

Existing LD score regression results

lookup

GWAShare Center:

Summary data sharing & user contribution

MR-Base for Mendelian randomization

SNPs Trait 1

Confounders

Trait 2

Trait 1 = risk factor (exposure)

Trait 2 = disease (outcome)

Two sample Mendelian randomization

MR-Base web interface

Scope of MR-Base

MR-Base

SNP lookups

12 two-sample MR

methodologies

MR-Base

R- package

Database ~2000 GWAS

(1100 with full data)

PhenoSpD: why we need it? • Molecular phenotypes such as

metabolites are highly correlated.

• Multiple testing correction is a headache problem: Bonferroni correction is definitely over killed.

• When individual-level phenotype data is available, phenotypic correlation matrix can be calculated easily.

• However, in real world, phenotype data is normally not available.

• In MR-Base / LD Hub, we only have GWAS summary statistics.

• We need a magic hand to correct multiple testing!

Wurtz et al, J Am Coll Cardiol. 2013

PhenoSpD: how it works

1. Harmonize GWAS summary statistics

2. Estimate phenotypic correlation matrix using metaCCA / LD score regression

3. Apply Spectral decomposition (SpD) to estimate the equivalent number of independent variables in the phenotypic correlation matrix

MetaCCA• Summary statistics-based multivariate association

testing using canonical correlation analysis –Cichonska et al Bioinformatics 2016

• As a sub-product, it provides a way to estimate phenotypic correlation matrix 𝑌𝑌, which is equal to the Pearson correlation between regression coefficients (betas) of two GWASs

• The assumption is, both traits are from the same samples

• PS: 1000 Genomes is not the best option to estimate LD matrix between SNPs. See Benner et al AJHG 2017, and LDstore

LD score regression

• Method to estimate SNP heritability and genetic correlations -- Bulik-Sullivan et al NG 2014, 2015

• It is also provides a way to estimate phenotypic correlations between two traits, which is the intercept term of the bi-variate LD score regression.

• Compare to metaCCA, it adjusted for sample overlap automatically

• Both genetic and phenotypic correlation matrixes can be found in LD Hub

SNPSpD and MatSpD

• SNPSpD: A simple correction for multiple testing for SNPs in LD using spectral decomposition (SpD). Nyholt 2004 AJHG

• MatSpD: MatrixSpD, estimate the equivalent number of independent variables in a correlation (r) matrix

• The same method can be used to estimate the number of independent variables in a phenotypic correlation matrix

Simulation• How accurate is the phenotypic correlation estimation using GWAS results?• Is there any parameters strongly affecting such estimation?

Model N_ind_AN_ind_B N_overlap Overlap_% N_SNPs N_EnvF N_simu y1_y2_A_obs y1_y2_B_obs Mean_y1_y2_est SD_y1_y2_est Deviation_obs_est (%)

sample size 1 300 300 150 50% 1000 100 100 -0.70 -0.70 -0.46 0.56 34.1%sample size 2 500 500 250 50% 1000 100 100 -0.71 -0.70 -0.47 0.56 33.0%

sample size 3 1000 1000 500 50% 1000 100 100 -0.70 -0.70 -0.47 0.54 33.3%

sample size 4 3000 3000 1500 50% 1000 100 100 -0.70 -0.70 -0.46 0.54 33.6%

sample size 5 5000 5000 2500 50% 1000 100 100 -0.70 -0.70 -0.47 0.54 33.2%

sample size 6 10000 10000 5000 50% 1000 100 100 -0.71 -0.71 -0.47 0.54 33.9%

sample overlap 1 5000 5000 1000 10% 1000 100 100 -0.70 -0.70 -0.13 0.39 82.1%

sample overlap 2 5000 5000 2000 20% 1000 100 100 -0.70 -0.70 -0.23 0.47 67.2%

sample overlap 3 5000 5000 3000 30% 1000 100 100 -0.71 -0.71 -0.33 0.47 54.0%

sample overlap 4 5000 5000 4000 40% 1000 100 100 -0.71 -0.71 -0.40 0.51 43.2%

sample overlap 5 5000 5000 5000 50% 1000 100 100 -0.71 -0.71 -0.47 0.53 33.3%

sample overlap 6 5000 5000 6000 60% 1000 100 100 -0.71 -0.71 -0.53 0.59 25.2%

sample overlap 7 5000 5000 7000 70% 1000 100 100 -0.70 -0.70 -0.58 0.57 17.5%

sample overlap 8 5000 5000 8000 80% 1000 100 100 -0.70 -0.70 -0.62 0.65 11.4%

sample overlap 9 5000 5000 9000 90% 1000 100 100 -0.71 -0.71 -0.67 0.67 5.8%

unbalance sample 1 5000 5000 9000 90% 1000 100 100 -0.71 -0.71 -0.67 0.67 5.8%





number of SNPs 1 5000 5000 2500 50% 10 100 100 -0.70 -0.70 -0.44 0.73 38.1%

number of SNPs 2 5000 5000 2500 50% 100 100 100 -0.70 -0.70 -0.48 0.53 34.1%

number of SNPs 3 5000 5000 2500 50% 500 100 100 -0.70 -0.70 -0.47 0.53 34.3%

number of SNPs 4 5000 5000 2500 50% 1000 100 100 -0.71 -0.70 -0.47 0.56 33.5%

number of SNPs 5 5000 5000 2500 50% 5000 100 100 -0.70 -0.70 -0.47 0.55 33.6%

number of SNPs 6 5000 5000 2500 50% 10000 100 100 -0.71 -0.71 -0.47 0.59 33.7%

Accuracy tests using real data

The estimated phenotypic correlations have good agreement with observed phenotypic correlations

The exceptions are traits with limited sample size (therefore limited sample overlap).

• Shin et al provided the observed phenotypic correlation matrix for 452 metabolites, which can be used as a test dataset

• So we compared the observed phenotypic correlation with the estimated phenotypic correlation using PhenoSpD.

Growth importance of PhenoSpD• PhenoSpD is particularly useful for multiple GWASs from the same

samples, e.g. complex molecular traits such as metabolites and cytokines

• It can also be applied to all traits in MR-Base / LD Hub, which we can split traits into groups, e.g. all traits in GIANT consortium are highly possible to be correlated and majority of them are from the same sample

Real case application in MR-Base and LD HubConsortium / First

author

Category N_traits N_SNPs N_correlations N_independent_traits

Kettunen Blood metabolites 123 9826292 7503 44.9

Shin Metaoblites 451 2482345 101475 324.4

Roederer Immune system phenotypes

151 1585187 11325 94.2

CARDIOGRAM 2 335391 1 1

TRICL 4 335391 6 3

TAG 4 1449634 6 3.98

SSGAC 7 1449634 21 6

PGC 4 335391 6 3.644

Leptin 2 1449634 1 1

MAGIC 16 1449634 120 11.098

IIBDGC 3 335391 3 2

Hrgene 8 1449634 28 7

HaemGen 6 1449634 15 5

GPC 6 1449634 15 5

GLGC 4 1449634 6 3

GIANT 15 1449634 105 10.1097

GEFOS 3 1449634 3 3

CKDGen 9 335391 36 8

EGG 4 1449634 6 4

GIS 2 2029112 1 1

GUGC 2 2449580 1 1

ENIGMA 7 7237736 21 6

UK Biobank 5 9440243 9 5

Others 24 / / 24

All 862 / 120713 577.3317

Number of independent traits in MR-BaseConsortium / First author

Category N_traits N_SNPs N_correlations N_independent_traits

All traits All traits 221 / 24310 134.1167

Number of independent traits in LD Hub

Growth importance of PhenoSpD

• There is a great potential to apply PhenoSpD to multiple traits in large scale biobanks and cohorts such as UK Biobank, China KadoorieBiobank, HUNT study (all traits in one sample)

UK Biobank release from Ben Neale’s group

• RAPID GWAS OF THOUSANDS OF PHENOTYPES FOR 337,000 SAMPLES IN THE UK BIOBANK (http://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-phenotypes-for-337000-samples-in-the-uk-biobank)

• GWAS summary statistics of 337,000 European samples are available for over 2,400 human traits, everyone can access and download the results.

• ~600 traits are heritable, which are the most valuable data

http://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-phenotypes-for-337000-samples-in-the-uk-biobank)

PhenoSpD application

• Assess the potential causal relationship between genetic variation, DNA methylation and 139 complex traits.

• PhenoSpD:

139 outcomes 62 independent outcomes

Hypothesis free MR of DNA methylation on 139 human traits

Links for PhenoSpD• PhenoSpD Paper is on bioRxiv:

https://www.biorxiv.org/content/early/2017/07/25/148627

• R scripts of PhenoSpD can be found on MRC-IEU github:https://github.com/MRCIEU/PhenoSpD

• LD Hub: http://ldsc.broadinstitute.org/ldhub/

• MR-Base: www.mrbase.org

https://www.biorxiv.org/content/early/2017/07/25/148627

https://github.com/MRCIEU/PhenoSpD

http://ldsc.broadinstitute.org/ldhub/

http://www.mrbase.org/

Acknowledgements

• LD Hub team

• Jie Zheng

• David M Evans

• Benjamin Neale

• MR-Base team• Gibran Hemani

• Jie Zheng

• George Davey Smith

• Tom Gaunt

• Philip Haycock

• PhenoSpD team• Jie Zheng

• Tom Richardson

• Louise Millard

• Gibran Hemani

• Chris Raistrick

• Bjarni Vilhjalmsson

• Philip Haycock

• Tom Gaunt

Q & A

Thank you!

Questions welcomed