charles schmitt director, informatics and data sciences senior researcher – data mining
DESCRIPTION
Charles Schmitt Director, Informatics and Data Sciences Senior Researcher – Data Mining Renaissance Computing Institute. Searching for the Genetic Causes of Disease with Hadoop (and other big data technologies…). Who is involved?. Biomedical Informatics Group Kirk Wilhelmsen, M.D. - PowerPoint PPT PresentationTRANSCRIPT
Charles Schmitt
Director, Informatics and Data Sciences
Senior Researcher – Data Mining
Renaissance Computing Institute
Searching for the Genetic Causes of Disease with Hadoop
(and other big data technologies…)
Who is involved?
Data Sciences GroupCharles Schmitt, Ph.D.Erik ScottNassib NasserKeary CavinMicheal Shoffner
CollaboratorsJonathan Berg, M.D.Jim Evans , M.D.Kari North, Ph.D.Ethan Lange, Ph.D.Rob Fowler, Ph.D.UNC HTSFUNC LCCCUNC Center for BioinformaticsUNC ITS RCRENCI ACISUNC IPITMultiple remote collaboration sites
Biomedical Informatics GroupKirk Wilhelmsen, M.D.Chris Bizon, Ph.D.Xiaoshu Wang, Ph.D.Jason ReillyPhil OwensGuifeng JinMichael Spiegel, Ph.D.Joshua Salisbury, Ph.D.
Human DNA
• 23 chromosomes• Nearly identical copies
• Dynamic 3-d structure
Human Genetic Variations
ATCGATCGATCAGACTA__GGGCTAGACTACGATCGATC – reference genome
ATCGATCGGTCAGACTATCGGGCTA__CTACGAGCGCTC – patient maternalATCGATCGGTCAGACTATCGGGCTA__CTACGATCGCTC – patient paternal
SNPs: low millionsIndels: low 100k
Structural variations-~5-15% of genome is larger structural variants
(Nature Biotechnology Volume: 29, Pages: 723–730 Year published: (2011))
Next-Generation Sequencing
Genome
Low coverage/targeted sequencing: cheaper and faster to sequence, less data to store,
But…
Greater reliance on making statistical inferences
Different strategies for research and clinical use
4x coverage
xxxx
Reads
Exon Exon
6
Identifying variations
Likely heterozygous(6 C, 9 Gs)(7 T, 9 G)
Likely sequencing error(2 C, 14 T)(1 C, 15 A)
Identifying variations
2 homozygous SNPs
unclear(6 C, 14 T)
Identifying variations
CTT deletion (deltaF508) is the most common cause of cystic fibrosis
Clinical Binning – the critical information
Criteria: Loci with Clinical Utility Loci with Clinical Validity
Loci with Unknown
Clinical Implications
Loci with important reproductive implications
Genes
Bins: Bin 1Genes, which when
mutated, result in high risk of clinically
actionable condition
Bin 2ALow risk
incidental information
Bin 2BMedium risk
incidental information
Bin 2CHigh risk incidental
information
Bin 3All other loci
Bin RCarrier status for severe AR disease
Examples: BRCA1/2MLH1, MSH2
FBN1NF1
Loci with proven PGx clinical utility
PGx variants and common risk SNPs with
no proven clinical utility
APOE, genes associated with
Mendelian disease for which no firm
clinical recommendations
exist
Huntington disease
Prion diseases
SCA, PS1, PS2, APP
Tay Sachs, Familial Dysautnomia, CF, etc.
Estimated number of genes/loci Dozen(s)
~20 (eventually 100s –
1000s)100s Dozen(s) >20,000 Hundreds
Slide provided by Jim Evans, M.D., Ph.D., Department of Genetics, UNC-CH
The promise of genetics requires a greater understanding of the underlying structure of the data
12
Computing on the Genome: Imputation
ATCGATCGATCAG - reference
ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC
Its unclear if this patient is A/A, A/G, or G/G
13
Computing on the Genome: Imputation
Infer the patient is homozygous for GG
ATCGATCGATCAG - reference
ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC
ATCGGTCGGTCAG - patient 2ATCGGTCGGTCAG – patient 3ATCGGTCGGTCAG – patient 4ATCGGTCGGTCAG – patient 5ATCGATCGATCAG – patient 6ATCGATCGATCAG – patient 7ATCGATCGATCAG – patient 8
Population Evidence
14
Computing on the Genome: Imputation
Hidden Markov Models for cross-genome statistical correlations (Thunder*)
Imputation on 708 samples takes over 200,000 CPU hours to complete, or 22 CPU years
How many samples do we need to impute on rare variants?
* Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res. 2011 Jun;21(6):940-51
ATCGATCGATCAG - reference
ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC
ATCGGTCGGTCAG - patient 2ATCGGTCGGTCAG – patient 3ATCGGTCGGTCAG – patient 4ATCGGTCGGTCAG – patient 5ATCGATCGATCAG – patient 6ATCGATCGATCAG – patient 7ATCGATCGATCAG – patient 8
Population Evidence
Identifying moderate penetrant mutations from cross-population genetic structures
Convergent Haplotype Association Tagging
CHAT: developed by Kirk Wilhelmsen
A 1
2
3 4
5
B 1
2
3 4
5
C 1
2
3 4
5
A B C
1
2
3
4
5
Using Graph Theory in CHAT
Case Control
Clique 5 0
-Clique 495 500
Discovered CHAT is 2800 SNPs in length and 26 mb
The promise of genetics requires better approaches to store and analyze large data
The cost of storing 100,000 genomes
10 Pb = full human genomes at low coverage (1)
2 Pb = human exomes at medium coverage (2)
Or:
$5 to $25 million dollars for UNC Health Care System– Every patient’s genome once on enterprise data storage– Not including archived copies, not including analysis data sets
Or:
$15 to $75 billion dollars for the US to store every patient’s genome once
Cost of disk space alone, not including refresh of equipment(1) Empirical data, assuming ~100 Gb per sample compressed fastq, bam, vcf, and
ancillary data files at coverage between 3-15x(2) Empirical data, assuming ~20 Gb per sample at around 30x only storing compressed
fastq and bam file
There is more clinical genetic data …
Courtesy of NIH via WikiCommons
… gene expression (rna-seq)… per tissue data…time series data…the personal micro-biome
An Informatics Ecosystem for Clinical Genomics
At ~8K genomes, will scale to ~10-20K genomes
Need to scale to 100,000-200,000+ genomes
High Performance Computing (HPC)
Computing– KillDevil (ITS RC)
• 706 Traditional, GPU based, and large memory compute nodes
– BlueRidge (RENCI)• 204 Traditional, GPU based, and large
memory compute nodes– Croatan (RENCI)
• 30 node big-data configuration with 1 Pb spinning disk
– Topsail (UNC Genomics)• 400 traditional compute nodes
– Kure (ITS RC)• 220 Traditional and large memory
compute Nodes– Open Science Grid
• Distributed cycle scavenging grid across research institutions
– Teragrid• National HPC grid
Storage• PB+ Dell/Isilon system at UNC• PB+ DDN/NetApp/Dell systems at RENCI
Leverages: Traditional bioinformatics tools Traditional HPC workflow systems
Aggregating genomic knowledge
VarDB.
NCBI RefSeq
dbSNP
HGMD (commercial)
Annotations of Clinical Variations
Other tools…
Protein Effects
PolyPhen
Leverages strengths of RDBMS in structured knowledge representation
Other databases
• VarDB: several TB database• Reference Genomes• Canonical Variants• Annotations• Indexes
• AnnoBot: automated query system to update VarDB
HadoopVCF Example: Allele Frequency
#POS CHR … REFALT SAMPLE1
SAMPLE2
SAMPLE3
10 1 … G T 0/0 0/1 ./.
100 5 … G C ./. 1/0 ./.
1455 12 … A T 0/0 0/0 0/0
1023 18 … G T 0/1 0/0 0/1
#POS CHR … REFALT SAMPLE5
SAMPLE6
SAMPLE7
11 1 … C A 0/1 0/1 ./.
500 12 … T C ./. 0/1 ./.
1023 18 … G T 1/0 0/1 0/0
Variant Data file 2
Variant Data file 1
HadoopVCF developed by Chris Bizon
HadoopVCF Example: Generalized
Samples
Genom
ic Variants, Genom
ic Loci
Each file holds different data for different samples and locations.
Hadoop: Generalized algorithm
• Mapper– Key = subset of sample and loci– Value = intermediate sums
• Reducer– Calculation over intermediate sums
• Allele Frequencies, %missing, HWE p-values,…
• Hadoop Distributed Cache– Context from VCF headers and/or RDBMS for each mapper
Why Hadoop?
• Scalability for certain genome analysis patterns
• Challenges:– Other analysis patterns: Hidden Markov Models,
Permutation testing, Haplotype blocks, Graphs, Hierarchical graphs?
– Share resources• Running on scheduled HPC clusters• Running on centralized HP storage system + local disks• Moving data to and from the worker nodes
Managing an R&D ecosystem with big data
RENCI STORAGE(Tape, Drives)
UNC STORAGE (Tape, Drives)
UNC HPC RENCI HPC
Open Science GridTeragrid
CloudsIT Machines
RENCI Hadoop
Genomics Storage
Lab MachinesExternal Partner
Resources
Genomics HPC
Genomics Hadoop
Data Providers External Partners
Research versus production use:
Life Cycle: control increases over time as• Work scope increases• Expertise and technology matures• Risk increases• Number of groups touching data
increases
Wild West
Analysts Automated Processes Developers
IT Staff
iRODS Data Virtualization
RENCI/cuahsi/modeling
The iRODS Data Grid installs in a “layer” over storage systems, so you can view, manage, access, add, and share part or all of your data and metadata in a unified Collection.
Utah State Univ/cuahsi/catalog
User Sees Single “Virtual Collection”/cuahsi/catalog
/cuahsi/modeling/cuahsi/terrain
SDSC/cuahsi/terrain
User Client Views & Manages
Data Data Grid
Managing an R&D ecosystem with big dataRENCI STORAGE
(Tape, Drives)UNC STORAGE (Tape, Drives)
UNC HPC RENCI HPC
Open Science GridTeragrid
CloudsIT Machines
RENCI Hadoop
Genomics Storage
Lab MachinesExternal Partners
Genomics HPC
Genomics Hadoop
Integrated Rules-Oriented Data System (iRODS)
NFS Hadoop DDN WOSPosix RDBMS Web services
Data Services Programmatic APIs
Data Workflows iRODS Clients
Control over:• Data movement and replication• Metadata standards• Archival, deletion, and
retention• Integration with workflows,
hadoop, databases• Hiding complexities• Automation
• …, all policy driven
• …, without breaking the in-place systems
Data Providers External Partners
Analysts Automated Processes Developers
IT Staff
Thank You