jie zheng at #icg12: phenospd: an atlas of phenotypic correlations and a multiple testing correction...
TRANSCRIPT
PhenoSpD: an integrated toolkit for phenotypic correlation estimation and multiple testing correction using GWAS summary statistics
Jie Zheng
The 12th International Conference on Genomics
27th Oct 2017
One belt one road
Bristol
An invitation from GIGA Science and BGI
Phenome wide association study (PheWAS)• PheWAS analyzes many phenotypes
compared to a single or multiple genetic variant(s).
• PheWAS is common place, e.g.• MR-PheWAS. Millard et al, Sci Rep,
2015 • Haycock et al, JAMA Oncology, 2017
It is likely that longer telomeres increase risk for several cancers but reduce risk for some non-neoplastic diseases, including cardiovascular diseases.
Post GWAS era: a database of harmonized GWAS summary data in MRC Integrative Epidemiology Unit in Bristol
The network of post GWAS analysis software
Centralized Database
PhenoSpD MR-Base
LD Hub
LD Hub for LD Score Regression
Univariate analysis: SNP heritability
0 20 40 60 80 100
02
46
810
LD Score
Ch
i squa
re
Bivariate analysis: genetic correlations
LD Hub web app
Scope of LD Hub
LD Hub
Database
233 publicly available
GWAS traits
Test Center:
On-the-fly LD score regression analysis pipeline
Lookup Center:
Existing LD score regression results
lookup
GWAShare Center:
Summary data sharing & user contribution
MR-Base for Mendelian randomization
SNPs Trait 1
Confounders
Trait 2
Trait 1 = risk factor (exposure)
Trait 2 = disease (outcome)
Two sample Mendelian randomization
MR-Base web interface
Scope of MR-Base
MR-Base
SNP lookups
12 two-sample MR
methodologies
MR-Base
R- package
Database ~2000 GWAS
(1100 with full data)
PhenoSpD: why we need it? • Molecular phenotypes such as
metabolites are highly correlated.
• Multiple testing correction is a headache problem: Bonferroni correction is definitely over killed.
• When individual-level phenotype data is available, phenotypic correlation matrix can be calculated easily.
• However, in real world, phenotype data is normally not available.
• In MR-Base / LD Hub, we only have GWAS summary statistics.
• We need a magic hand to correct multiple testing!
Wurtz et al, J Am Coll Cardiol. 2013
PhenoSpD: how it works
1. Harmonize GWAS summary statistics
2. Estimate phenotypic correlation matrix using metaCCA / LD score regression
3. Apply Spectral decomposition (SpD) to estimate the equivalent number of independent variables in the phenotypic correlation matrix
MetaCCA• Summary statistics-based multivariate association
testing using canonical correlation analysis –Cichonska et al Bioinformatics 2016
• As a sub-product, it provides a way to estimate phenotypic correlation matrix 𝑌𝑌, which is equal to the Pearson correlation between regression coefficients (betas) of two GWASs
• The assumption is, both traits are from the same samples
• PS: 1000 Genomes is not the best option to estimate LD matrix between SNPs. See Benner et al AJHG 2017, and LDstore
LD score regression
• Method to estimate SNP heritability and genetic correlations -- Bulik-Sullivan et al NG 2014, 2015
• It is also provides a way to estimate phenotypic correlations between two traits, which is the intercept term of the bi-variate LD score regression.
• Compare to metaCCA, it adjusted for sample overlap automatically
• Both genetic and phenotypic correlation matrixes can be found in LD Hub
SNPSpD and MatSpD
• SNPSpD: A simple correction for multiple testing for SNPs in LD using spectral decomposition (SpD). Nyholt 2004 AJHG
• MatSpD: MatrixSpD, estimate the equivalent number of independent variables in a correlation (r) matrix
• The same method can be used to estimate the number of independent variables in a phenotypic correlation matrix
Simulation• How accurate is the phenotypic correlation estimation using GWAS results?• Is there any parameters strongly affecting such estimation?
Model N_ind_AN_ind_B N_overlap Overlap_% N_SNPs N_EnvF N_simu y1_y2_A_obs y1_y2_B_obs Mean_y1_y2_est SD_y1_y2_est Deviation_obs_est (%)
sample size 1 300 300 150 50% 1000 100 100 -0.70 -0.70 -0.46 0.56 34.1%sample size 2 500 500 250 50% 1000 100 100 -0.71 -0.70 -0.47 0.56 33.0%
sample size 3 1000 1000 500 50% 1000 100 100 -0.70 -0.70 -0.47 0.54 33.3%
sample size 4 3000 3000 1500 50% 1000 100 100 -0.70 -0.70 -0.46 0.54 33.6%
sample size 5 5000 5000 2500 50% 1000 100 100 -0.70 -0.70 -0.47 0.54 33.2%
sample size 6 10000 10000 5000 50% 1000 100 100 -0.71 -0.71 -0.47 0.54 33.9%
sample overlap 1 5000 5000 1000 10% 1000 100 100 -0.70 -0.70 -0.13 0.39 82.1%
sample overlap 2 5000 5000 2000 20% 1000 100 100 -0.70 -0.70 -0.23 0.47 67.2%
sample overlap 3 5000 5000 3000 30% 1000 100 100 -0.71 -0.71 -0.33 0.47 54.0%
sample overlap 4 5000 5000 4000 40% 1000 100 100 -0.71 -0.71 -0.40 0.51 43.2%
sample overlap 5 5000 5000 5000 50% 1000 100 100 -0.71 -0.71 -0.47 0.53 33.3%
sample overlap 6 5000 5000 6000 60% 1000 100 100 -0.71 -0.71 -0.53 0.59 25.2%
sample overlap 7 5000 5000 7000 70% 1000 100 100 -0.70 -0.70 -0.58 0.57 17.5%
sample overlap 8 5000 5000 8000 80% 1000 100 100 -0.70 -0.70 -0.62 0.65 11.4%
sample overlap 9 5000 5000 9000 90% 1000 100 100 -0.71 -0.71 -0.67 0.67 5.8%
unbalance sample 1 5000 5000 9000 90% 1000 100 100 -0.71 -0.71 -0.67 0.67 5.8%
unbalance sample 2 5000 6000 9000 82% 1000 100 100 -0.71 -0.71 -0.64 0.66 9.8%
unbalance sample 3 5000 8000 9000 69% 1000 100 100 -0.70 -0.70 -0.58 0.65 17.3%
unbalance sample 4 5000 10000 9000 60% 1000 100 100 -0.70 -0.70 -0.54 0.62 22.9%
unbalance sample 5 5000 13000 9000 50% 1000 100 100 -0.70 -0.70 -0.48 0.54 30.9%
number of SNPs 1 5000 5000 2500 50% 10 100 100 -0.70 -0.70 -0.44 0.73 38.1%
number of SNPs 2 5000 5000 2500 50% 100 100 100 -0.70 -0.70 -0.48 0.53 34.1%
number of SNPs 3 5000 5000 2500 50% 500 100 100 -0.70 -0.70 -0.47 0.53 34.3%
number of SNPs 4 5000 5000 2500 50% 1000 100 100 -0.71 -0.70 -0.47 0.56 33.5%
number of SNPs 5 5000 5000 2500 50% 5000 100 100 -0.70 -0.70 -0.47 0.55 33.6%
number of SNPs 6 5000 5000 2500 50% 10000 100 100 -0.71 -0.71 -0.47 0.59 33.7%
Accuracy tests using real data
The estimated phenotypic correlations have good agreement with observed phenotypic correlations
The exceptions are traits with limited sample size (therefore limited sample overlap).
• Shin et al provided the observed phenotypic correlation matrix for 452 metabolites, which can be used as a test dataset
• So we compared the observed phenotypic correlation with the estimated phenotypic correlation using PhenoSpD.
Growth importance of PhenoSpD• PhenoSpD is particularly useful for multiple GWASs from the same
samples, e.g. complex molecular traits such as metabolites and cytokines
• It can also be applied to all traits in MR-Base / LD Hub, which we can split traits into groups, e.g. all traits in GIANT consortium are highly possible to be correlated and majority of them are from the same sample
Real case application in MR-Base and LD HubConsortium / First
author
Category N_traits N_SNPs N_correlations N_independent_traits
Kettunen Blood metabolites 123 9826292 7503 44.9
Shin Metaoblites 451 2482345 101475 324.4
Roederer Immune system phenotypes
151 1585187 11325 94.2
CARDIOGRAM 2 335391 1 1
TRICL 4 335391 6 3
TAG 4 1449634 6 3.98
SSGAC 7 1449634 21 6
PGC 4 335391 6 3.644
Leptin 2 1449634 1 1
MAGIC 16 1449634 120 11.098
IIBDGC 3 335391 3 2
Hrgene 8 1449634 28 7
HaemGen 6 1449634 15 5
GPC 6 1449634 15 5
GLGC 4 1449634 6 3
GIANT 15 1449634 105 10.1097
GEFOS 3 1449634 3 3
CKDGen 9 335391 36 8
EGG 4 1449634 6 4
GIS 2 2029112 1 1
GUGC 2 2449580 1 1
ENIGMA 7 7237736 21 6
UK Biobank 5 9440243 9 5
Others 24 / / 24
All 862 / 120713 577.3317
Number of independent traits in MR-BaseConsortium / First author
Category N_traits N_SNPs N_correlations N_independent_traits
All traits All traits 221 / 24310 134.1167
Number of independent traits in LD Hub
Growth importance of PhenoSpD
• There is a great potential to apply PhenoSpD to multiple traits in large scale biobanks and cohorts such as UK Biobank, China KadoorieBiobank, HUNT study (all traits in one sample)
UK Biobank release from Ben Neale’s group
• RAPID GWAS OF THOUSANDS OF PHENOTYPES FOR 337,000 SAMPLES IN THE UK BIOBANK (http://www.nealelab.is/blog/2017/7/19/rapid-gwas-of-thousands-of-phenotypes-for-337000-samples-in-the-uk-biobank)
• GWAS summary statistics of 337,000 European samples are available for over 2,400 human traits, everyone can access and download the results.
• ~600 traits are heritable, which are the most valuable data
PhenoSpD application
• Assess the potential causal relationship between genetic variation, DNA methylation and 139 complex traits.
• PhenoSpD:
139 outcomes 62 independent outcomes
Hypothesis free MR of DNA methylation on 139 human traits
Links for PhenoSpD• PhenoSpD Paper is on bioRxiv:
https://www.biorxiv.org/content/early/2017/07/25/148627
• R scripts of PhenoSpD can be found on MRC-IEU github:https://github.com/MRCIEU/PhenoSpD
• LD Hub: http://ldsc.broadinstitute.org/ldhub/
• MR-Base: www.mrbase.org
Acknowledgements
• LD Hub team
• Jie Zheng
• David M Evans
• Benjamin Neale
• MR-Base team• Gibran Hemani
• Jie Zheng
• George Davey Smith
• Tom Gaunt
• Philip Haycock
• PhenoSpD team• Jie Zheng
• Tom Richardson
• Louise Millard
• Gibran Hemani
• Chris Raistrick
• Bjarni Vilhjalmsson
• Philip Haycock
• Tom Gaunt
Q & A
Thank you!
Questions welcomed