![Page 1: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/1.jpg)
Imputation & Meta-analysis
Alexander Teumer
OHBM – 26/06/2016
![Page 2: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/2.jpg)
Imputation
Why do we impute
To allow comparison with other samples on other chips
To fine map – i.e. run association at variants we have not
genotyped
To improve call rate – i.e. increase the number of variants
available for poorly genotyped samples (not ideal)
To identify genotyping errors
array system A
array system B
reference panel
DNA
SNP
recombination hotspots
![Page 3: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/3.jpg)
A quick conceptual theory of imputation
Start with some genotype data
Using LD the structure within
your data, phase your data
to reconstruct the haplotypes
![Page 4: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/4.jpg)
A quick conceptual theory of imputation
Compare your phased data to
the references
Use the LD structure to
impute in the missing genotypes
(Marchini, J. and Howie, B. 2010. Nat Rev Genet 11 499-511.)
![Page 5: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/5.jpg)
Chose Genotyping Array
Ideally use a chip designed for imputation
All chips have data sheets if you are obtaining genotyping make sure you
check the sheet before choosing the chip!
Also look for papers on imputation using your preferred chip and ask
authors who have published using that chip
Check the manifests and make sure your favourite genes are covered!
Some arrays are less suitable for imputation
ExomeChip (almost only exomes covered, most SNPs not in refpanel)
Cardio-MetaboChip (selected regions only)
...but some have tag SNPs added
Illumina HumanCoreExome BeadChip
(Exome+300k genome-wide tag SNPs)
![Page 6: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/6.jpg)
Easiest (and best) way of imputing
Use the Imputation Servers
Michigan: https://imputationserver.sph.umich.edu/ (Minimac3)
Sanger: https://imputation.sanger.ac.uk/ (PBWT)
![Page 7: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/7.jpg)
Step 1 – Chose phasing method
ShapeIT
Well established method
Phased data not downloadable from imputation server
(cannot be re-used for fast re-imputation with different reference panel)
Eagle v2.0
New algorithm
Very fast and accurate
HapiUR
Available on Michigan server only
No reference-based phasing algorithm
This phasing does not take into account any sources of information
other than the input genotypes, i.e. no family data
![Page 8: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/8.jpg)
Step 2 – Pick your references
HapMapII
2.4M SNPs
Well imputed and well known set
Good for first imputation run – not commonly used anymore
1KGP aka 1000G
Phase1v3 ~37M SNPs+INDELs of these ~11M will be useable
1,092 individuals
Phase3v5 ~82M SNPs+INDELs of these ~12M will be useable
2,504 individuals
Haplotype reference consortium (HRC)
release 1.1 (full panel only usable through the imputation servers)
39M SNPs (MAC≥5), 32,470 individuals (pan European + 1000G)
![Page 9: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/9.jpg)
Step 2 – Pick your references
All Ethnicities vs Specific Ethnicity panels
Consider what the consortiums/collaborators you want to
work with want to do
Case by case basis
All ethnicities panels are larger (and slower) – but often
requested by collaborators
Can be more accurate – esp. for a ‘cosmopolitan US’ sample
May not improve imputation for homogeneous populations or
those with strong founder effects
![Page 10: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/10.jpg)
Main Differences of Imputation Servers
Michigan: Minimac3 very precise
Sanger: PBWT very fast
Chr X imputation coming soon for imputation servers
Durbin et al., Poster 2015
![Page 11: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/11.jpg)
Genotype data - Make your data clean! Convert to PLINK binary format
Exclude samples with:
Excessive missingness (>5%)
Reported vs. genotyped sex-mismatch
Unusual high/low heterozygosity
Check for ancestry outliers (PCA/MDS) or related/duplicate samples
Exclude SNPs with:
Excessive missingness (>5%)
Monomorphic SNPs (may represent genotyping errors)
Genotyping platform dependent: low MAF (<1%)
i.e. for HumanCoreExome or old array types
HWE violations (~P<10-4)
Mendelian errors (in case of family data available)
Duplicate chromosomal positions
![Page 12: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/12.jpg)
Align DNA strand to reference panel: usually forward (+) strand
Problem: strand ambiguous snps (AT and CG snps):
Remember: DNA is composed of 2 antiparallel strands the complement of an A is
a T and the complement of a C is G this makes it difficult to work out if the
genotypes are strand aligned to the references.
(+) and (–) strand is an arbitrary construct changes between builds and sources.
Check allele frequency or drop these SNPs and re-impute them…
Align SNP positions to the same genome build
Imputation servers require GRCh37 (hg19)
Convert using Liftover (http://genome.ucsc.edu/cgi-bin/hgLiftOver)
Genotype data - Make your data clean!
![Page 13: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/13.jpg)
Format input file
VCF format required
One file per chr for Michigan, one for all chr for Sanger imputation server
Use PLINK≥1.9 or PSEQ to convert plink files to VCF
Consider sample IDs: FID, IID or both (PLINK)
Ensure chromosomes are numbers 1...22, X, Y (without prefix) (PSEQ)
Match alleles and coordinates to GRCh37 (+) strand, Sanger: match also ref alleles
checkVCF tool, plink: use options --a2-allele and --real-ref-alleles to set reference alleles
Sort SNPs by genomic position (per chromosome)
VCFtools
Comments
Genoytpes Info
![Page 14: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/14.jpg)
Output VCF
Comments, info and genotypes in one file
One line per variant
One column per person
Allele dosage info and genotype probabilities incl.
imputation uncertainties
![Page 15: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/15.jpg)
But I’m going to assume you have the
time, computational capacity, storage
space and desire to do this yourself…
![Page 16: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/16.jpg)
Genotypes and reference panel
Sample and SNP QC are the same as for imputation
server approach
Download reference panel
match strand and genome build positions with own genotypes
HapMapII (NCBI build 36 / hg18 coordinates)
HapMapIII (NCBI build 36 / hg18 coordinates)
1000G phase1 release 3 (NCBI build 37 / GRCh37 / hg19)
1000G phase3 release 5 (NCBI build 37 / GRCh37 / hg19)
build your own reference panel...
full HRC panel not publically available for download
![Page 17: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/17.jpg)
Phase your data
Chose pre-phasing program
ShapeIT http://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html
Eagle https://data.broadinstitute.org/alkesgroup/Eagle/
Download genetic map and reference panel
genetic map contains recombination information
appropriate reference panel (optional) improves phasing
(speed + precision)
MaCH http://www.sph.umich.edu/csg/abecasis/MaCH/
![Page 18: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/18.jpg)
Impute your data
Chose imputation program
Minimac/Minimac3
IMPUTE2
Beagle
Never use PLINK
Similar accuracy, features, time frame
Different output formats & downstream analysis options
Take care of chrX imputation, i.e. for PAR and non-PAR:
specific options (IMPUTE2)
split by sex (Minimac/Minimac3)
Imputation program
popularity
Mach/Minimac
Beagle
PLINK
Impute
![Page 19: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/19.jpg)
File formats
Different software require different file formats
Tools for conversion available
Software Phasing MaCH Eagle ShapeIT
File format
Input Merlin PLINK PLINK/GEN
Output Mach HAPS HAPS
Software Imputing Minimac Minimac3 IMPUTE2
File format
Input Mach/ HAPS VCF HAPS
Output DOSE VCF/ GEN
DOSE
Software GWAS (dosage)
mach2QTL/ ProbABEL* EPACTS*
SNPTEST2/ QUICKTEST
File format
Input DOSE VCF GEN/
VCF (SNPTEST2)
* supports analysis of related samples
![Page 20: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/20.jpg)
Meta-analysis
![Page 21: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/21.jpg)
Approaches to GWAS meta-analysis
Fixed effects
Most common - most powerful approach for discovery under the model that the true effect of each risk allele is the same in each data set
Inverse variance weighted most common
N weighted (z-score based) also common
Random effects
Uncommon - more appropriate when the aim is to consider the generalizability of the observed association and estimate the average effect size of the associated variant and its uncertainty across different populations
Bayesian
Very uncommon – mainly MAs from the Welcome Trust
![Page 22: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/22.jpg)
Quality control of data going into MA is
critical!
Exclude rare variants
Typically 1% or 0.5% MAF with large samples (5000+) can
consider going lower
Exclude poorly imputed variants
Imputation accuracy metric depends on the software used
Mach/minimac/QUICKTEST r2
IMPUTE properinfo/info
BEAGLE ovarimp
Typically calculated as observed variance/expected –
can empirically go over 1 usually capped at 1
Threshold ~0.6
![Page 23: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/23.jpg)
Important considerations for MA
Duplicate QC and meta-analysis sites
Always check the input data
Column header, beta/SE/P-value distribution, allele frequencies,...
Use GWAtoolbox or EasyQC R-packages
Harmonize variant ID CHR:POS:TYPE (SNP/INDEL)
Make sure you double check meta-analysis results
QQ plots
Manhattan plots
Allele frequencies (min/max per SNP)
Heterogeneity (HetPVal / I²)
Compare inverse-variance vs. z-score based meta-analysis results
Consider allowing cohorts to ignore variants with MAF <0.5% and low r² – it will save you a lot of time and save a lot of storage space!
![Page 24: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/24.jpg)
Input file QC: GWAtoolbox
Checks consistency and distribution of input file columns
Compares beta distribution across cohorts
Harmonizes input files (header + separator)
Corrects for genomic control and calculates effective N
Input script like METAL
![Page 25: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/25.jpg)
GWAS Meta-Analysis
Most commonly used software for common variant
analysis: METAL
Automatic strand flipping of non-ambiguous SNPs
Calculation of max/min/mean allele frequency
Inverse variance & sample size weightings
Automatic genomic control correction
Heterogeneity tests
Most commonly used software for rare variant analysis:
RAREMETAL
seqMeta (R-package)
![Page 26: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/26.jpg)
http://www.sph.umich.edu/csg/abecasis/metal/
Documentation can be found at the metal wiki:
http://genome.sph.umich.edu/wiki/Metal_Documentation
METAL
![Page 27: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/27.jpg)
METAL
Requires results files
‘Script’ file
Describes the input files
Defines meta-analysis strategy
Name output file
![Page 28: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/28.jpg)
Steps
1. Check format of results files
1. Ensure all necessary columns are available
2. Modify files to include all information
2. Prepare script file
1. Ensure headers match description
2. Crosscheck each results file matches Process name
3. Run metal
1. metal < metal_script_file > metal_run.log
2. Output: result file + info file
3. Check log for errors and warnings
![Page 29: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/29.jpg)
METAL script file:
SNPID chr position coded_
all noncoded
_all strand_
genome beta SE pval
AF_code
d_all HWE_pv
al callrate n_total
impute
d used_for_
imp oevar_
imp
rs10 7 92221824 C A + -0.484216 0.240421 0.0440064 0.942 1 1 2004 1 0 0.346707
rs1000000 12 125456933 G A + -0.117195 0.0814519 0.150201 0.7925 0.115932 1 2004 1 0 0.993797
...
SNPID chr position coded_
all noncoded
_all strand_
genome beta SE pval
AF_code
d_all HWE_pv
al callrate n_total
impute
d used_for_
imp oevar_
imp
rs10 7 92221824 C A + -0.484216 0.240421 0.0440064 0.942 1 1 2004 1 0 0.346707
rs1000000 12 125456933 G A + -0.117195 0.0814519 0.150201 0.7925 0.115932 1 2004 1 0 0.993797
...
MARKER SNPID
ALLELE coded_all noncoded_all
EFFECT beta
STDERR SE
PVALUE pval
FREQLABEL AF_coded_all
GENOMICCONTROL ON
ADDFILTER SE > 0
ADDFILTER pval > 0
SCHEME STDERR
SEPARATOR COMMA
CUSTOMVARIABLE TotalSampleSize
LABEL TotalSampleSize as n_total
OUTFILE Meta-results_invvar .txt
PROCESS results1.txt
PROCESS results2.txt
ANALYZE HETEROGENEITY
Running METAL
# define column names
# set genomic control on/off
# filter result file lines
# set weights to inverse-variance
# define input file separator
# add custom variable to calculate N total
# set prefix of output filename
# define input files
# start meta-analysis and calc heterogeneity
![Page 30: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/30.jpg)
Output # This file contains a short description of the columns in the
# meta-analysis summary file, named ' Meta-results_invvar1.txt'
# Marker - this is the marker name
# Allele1 - the first allele for this marker in the first file where it occurs
# Allele2 - the second allele for this marker in the first file where it occurs
# Freq1 - weighted average of frequency for allele 1 across all studies
# FreqSE - corresponding standard error for allele frequency estimate
# Effect - overall estimated effect size for allele1
# StdErr - overall standard error for effect size estimate
# P-value - meta-analysis p-value
# Direction - summary of effect direction for each study, with one '+' or '-' per study
# HetChiSq - chi-squared statistic in simple test of heterogeneity
# df - degrees of freedom for heterogeneity statistic
# HetPVal - P-value for heterogeneity statistic
# TotalSampleSize - custom variable 1
# Input for this meta-analysis was stored in the files:
# --> Input File 1 : results1.txt
# --> Input File 2 : results2.txt
MarkerName Allele1 Allele2 Freq1 FreqSE Effect StdErr P-value Direction HetChiSq HetDf HetPVal TotalSamp
leSize
rs2326918 a g 0.8545 0.0053 0.0638 0.091 0.4836 +- 0.483 1 0.4873 2412
rs10760160 a c 0.5164 0.006 -0.0492 0.0625 0.431 -- 0.007 1 0.9324 2412
SNP1-152986 a c 0.3796 0 -0.147 0.3169 0.6427 ?- 0 0 1 408
...
info file:
result file:
![Page 31: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/31.jpg)
Common Errors
###########################################################################
## Processing file 'results3.txt'
## WARNING: Bad alleles for marker '5:92717972:SNP', expecting 'a/g' found 'c/g'
## WARNING: Bad alleles for marker '9:110286832:SNP', expecting 'a/g' found 'a/c'
![Page 32: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/32.jpg)
Questions?
GWAS Catalog: http://www.ebi.ac.uk/gwas/home
![Page 33: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/33.jpg)
Appendix
![Page 34: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/34.jpg)
Phase your data - Details
Phasing programs “use a hidden Markov model (HMM) to
model the haplotypes underlying G as an imperfect
mosaic of haplotypes in the set H. Compatible haplotypes
are sampled for G using the forward-backward algorithm
for HMMs”
Problem: complexity is quadratic and scales with sample
size and Nsnps O(MK2) Delaneau, O. et al. 2013. Nat Meth 10 5-6.
![Page 35: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/35.jpg)
Phase your data
Currently best program for phasing is SHAPEIT2
Delaneau, O., Zagury, J.-F. et al. 2013. Nat Meth 10 5-6.
Avoids the quadratic bottle neck by:
“collapsing all K haplotypes in H into a graph structure, Hg, and
then carrying out the HMM calculations on this graph.”
Sampling pairs of haplotypes
Transition accuracy is improved by drawing on surrogate
family members
![Page 36: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/36.jpg)
Phase your data
SHAPEIT2
Transition accuracy is improved by drawing on surrogate
family members
restricts each phasing update to a set of k template haplotypes
chosen separately for each individual at each iteration
The k templates are chosen by computing Hamming distances
between an individual's current sampled haplotypes and each
possible template haplotype.
the k templates with the smallest distances are refereed to as
“surrogate family members”
![Page 37: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/37.jpg)
SHAPEIT2
https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/s
hapeit.html
Can multi-thread
Note: this is a genetic map based on recombination (cM) not a
physical map (BP)!
![Page 38: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/38.jpg)
Recommendation
MiniMac3
lower memory and more computationally efficient
implementation
References are in a custom format (m3vcf) that can handle
very large references with lower memory
Can read in the SHAPEIT2 references
Output is vcf format
Includes both SNP and individuals IDs – safest format to avoid
errors
Downstream analysis with RAREMETALWORKER or other vcf
input tools
![Page 39: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/39.jpg)
vcf format
![Page 40: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/40.jpg)
Imputing in minimac3
Can impute X
Impute Males & Females together for the pseudo Autosomal
region (PAR)
Separately for the non-PAR
![Page 41: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/41.jpg)
Output
Comments, info and genotypes in a single file
One line per variant
One column per person
![Page 42: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/42.jpg)
Output
![Page 43: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/43.jpg)
The comments
![Page 44: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/44.jpg)
The info
![Page 45: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/45.jpg)
The genotypes
![Page 46: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/46.jpg)
A practical example
http://labs.med.miami.edu/myers/LFuN/LFuN.html
post-mortem gene expression in ‘brain’ tissue
N=193
![Page 47: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/47.jpg)
Imputation
Chromosome 22 only – HapMapII- b36r22
MaCH phasing
(In real life with a sample this size include the reference
in the phasing)
Minimac Imputation
Run twice
Once without stand alignment (badImp)
Once with strand alignment (goodImp)
![Page 48: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/48.jpg)
How do we know there was no
strand alignment from the output?
No way of telling from the phasing log
B/c we didn’t include a reference
Imputation log is FULL of errors
rs915677-T rs915677-R rs9617528-T rs9617528-R
A 0 .08 .72 0
C .91 0 0 .17
G 0 .92 .28 0
T .09 0 0 .83
![Page 49: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/49.jpg)
Plot the r2 for the 2 imputation runs
How do they compare?
badImp 17,908/39905 with r2 >=.6
goodImp 24,685/39905 with r2 >=.6
still quite bad b/c of small N
Should have compensated by including ref data
in the phasing step
In a QIMR dataset N=19k 32296/33815
![Page 50: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/50.jpg)
Bad
Imputation
Better
Imputation
Good
Imputation
![Page 51: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/51.jpg)
Analyses…
DO NOT ANALYSE HARDCALL
GENOTYPES!!!!!! Analyse the dosage or probabilities as this will account
for the imputation uncertainty
![Page 52: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/52.jpg)
Analyses in RAREMETALWORKER
Simple phenotype file formats
Can account for relatedness & twins
Can use GRM to account for relatedness (memory+++)
Ped file
(no header)
Dat file
raremetalworker --ped your.ped --dat your.dat --vcf your.vcf.gz --
prefix example
raremetalworker --ped your.ped --dat your.dat --vcf your.vcf.gz --
kinPedigree --prefix example
![Page 53: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/53.jpg)
Files to practice with
Detailed cookbooks are available:
Minimac http://genome.sph.umich.edu/wiki/Minimac:_GIANT_1000_Genomes_Imputation_Cookbook
Minimac3 http://genome.sph.umich.edu/wiki/Minimac3_Imputation_Cookbook
Impute2 http://genome.sph.umich.edu/wiki/Impute2:_GIANT_1000_Genomes_Imputation_Cookbook
But really and truly consider using the Imputation Servers
so that you can access the HRC references!
https://imputationserver.sph.umich.edu/
https://imputation.sanger.ac.uk/
![Page 54: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/54.jpg)
Meta-analysis
(extended)
![Page 55: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/55.jpg)
Setting up a Meta-analysis
Managing the personal and social connections is extremely important
meta-analyses are usually unfunded
Time line is too short and budget is too small for a grant
Meta-analyses do not work top down – to be successful they MUST be led by analysts who know what they are doing
Evangelou, E. 2013. Nat Rev Genet 14 379-389.
![Page 56: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/56.jpg)
Columns METAL uses
SNP
Effect allele & non-effect allele
Frequency of effect allele
OR/Beta
SE [for standard error meta-analysis]
P-value [for Z-score meta-analysis]
IMPORTANT – you can not use FDR controlled or adaptively
permuted P values!
N/weight column [for Z-score meta-analysis]
![Page 57: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/57.jpg)
Effect allele
Differs for different programs and analysis options
Minor/major allele
Alphabetical
1st listed
DO NOT ASSUME YOU KNOW ALWAYS DOUBLE
CHECK!
![Page 58: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/58.jpg)
Genomic control
λ (lambda)
Median test statistic/ expected median test stat
Should be one
![Page 59: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/59.jpg)
Strand Ambiguous SNPs
When you get data from different studies is not always
aligned the same way
Remember A<>T & C<>G
If a SNP is A/C or then the reverse strand is T/G
No ambiguity, regardless of strand we know which allele is
which
A/G, T/C & T/G also non ambiguous
METAL can align you non ambiguous SNPs
![Page 60: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/60.jpg)
Strand Ambiguous SNPs
Remember A<>T & C<>G
If a SNP is A/T then the reverse strand is T/A
AMBIGUOUS!!! Need to check allele freq to make sure
samples are aligned
C/G SNPs are also ambiguous!
METAL can not align ambiguous SNPs
![Page 61: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/61.jpg)
Meta-analysis running
We will run meta-analysis based on effect size and on test
statistic
For the weights of test statistic, I’ve assumed that the
sample sizes are the same
METAL defaults to weight of 1 when no weight column is
supplied
![Page 62: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/62.jpg)
INPUT FILES
Results1.txt
Results2.txt
![Page 63: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/63.jpg)
Step 2: script file: meta_run_file
# PERFORM META-ANALYSIS based on effect size and on test statistic # Loading in the input files with results from the participating samples # Note: Order of samples is …[sample size, alphabetic order,..] # Phenotype is .. # MB March 2013 MARKER SNP ALLELE A1 A2 PVALUE P EFFECT log(OR) STDERR SE specifies column names PROCESS results1.txt PROCESS results2.txt processes two results files OUTFILE meta_res_Z .txt Output file naming ANALYZE Conducts Z-based meta-analysis from test statistic CLEAR Clears workspace SCHEME STDERR Changes meta-analysis scheme to beta + SE PROCESS results1.txt PROCESS results2.txt processes two results files OUTFILE meta_res_SE .txt Output file naming ANALYZE Conducts effect size meta-analysis
![Page 64: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/64.jpg)
Larger Consortia # PERFORM META-ANALYSIS on P-values
module load metal
metal << EOT
# Loading in the inputfiles with results from the participating samples
# Note: Order of samples is alpahabetic
# Phenotype is WB
# 1. AGES_HAP
MARKER SNPID
ALLELE coded_all noncoded_all
EFFECT Beta
PVALUE Pval
WEIGHT n_total
GENOMICCONTROL ON
COLUMNCOUNTING LENIENT
PROCESS AGES_HAP.txt
# 2. ALSPAC_HAP
MARKER SNPID
ALLELE coded_all noncoded_all
EFFECT Beta
PVALUE Pval
WEIGHT n_total
GENOMICCONTROL ON
COLUMNCOUNTING LENIENT
PROCESS ALSPAC_HAP.txt
AND SO ON (in this case 40 files)
![Page 65: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/65.jpg)
Running metal
metal < metal_run_file > metal_run.log
metal is the command
metal_run_file is the script file
This will output information on the running of METAL things to
standard out [the terminal]
It will spawn 4 files:
2 results files: meta_res_Z1.txt + meta_res_SE1.txt
2 info files: meta_res_Z1.txt.info + meta_res_SE1.txt.info
![Page 66: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/66.jpg)
Output you’ll see
Overview of METAL commands
Any errors
And your best hit from meta-analysis
![Page 67: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/67.jpg)
Output
![Page 68: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/68.jpg)
Don’t ask for stuff you don’t need (Its annoying & adding extra columns*30M lines is a waste of
space…)
You need:
SNP, CHR:BP, EffectAllele, NonEffectAllele, EA_Freq, Ntotal,
Beta, SE, P, Rsq
Not
![Page 69: PowerPoint Presentation Materials...Title PowerPoint Presentation Author Created Date 5/26/2016 9:35:46 AM](https://reader036.vdocuments.net/reader036/viewer/2022071214/6042036451fc117e8575535a/html5/thumbnails/69.jpg)
Part of the slides are by courtesy of Sarah Medland