charles schmitt director, informatics and data sciences senior researcher – data mining

Charles Schmitt

Director, Informatics and Data Sciences

Senior Researcher – Data Mining

Renaissance Computing Institute

Searching for the Genetic Causes of Disease with Hadoop

(and other big data technologies…)

Who is involved?

Data Sciences GroupCharles Schmitt, Ph.D.Erik ScottNassib NasserKeary CavinMicheal Shoffner

CollaboratorsJonathan Berg, M.D.Jim Evans , M.D.Kari North, Ph.D.Ethan Lange, Ph.D.Rob Fowler, Ph.D.UNC HTSFUNC LCCCUNC Center for BioinformaticsUNC ITS RCRENCI ACISUNC IPITMultiple remote collaboration sites

Biomedical Informatics GroupKirk Wilhelmsen, M.D.Chris Bizon, Ph.D.Xiaoshu Wang, Ph.D.Jason ReillyPhil OwensGuifeng JinMichael Spiegel, Ph.D.Joshua Salisbury, Ph.D.

Human DNA

• 23 chromosomes• Nearly identical copies

• Dynamic 3-d structure

Human Genetic Variations

ATCGATCGATCAGACTA__GGGCTAGACTACGATCGATC – reference genome

ATCGATCGGTCAGACTATCGGGCTA__CTACGAGCGCTC – patient maternalATCGATCGGTCAGACTATCGGGCTA__CTACGATCGCTC – patient paternal

SNPs: low millionsIndels: low 100k

Structural variations-~5-15% of genome is larger structural variants

(Nature Biotechnology Volume: 29, Pages: 723–730 Year published: (2011))

Next-Generation Sequencing

Genome

Low coverage/targeted sequencing: cheaper and faster to sequence, less data to store,

But…

Greater reliance on making statistical inferences

Different strategies for research and clinical use

4x coverage

xxxx

Reads

Exon Exon

6

Identifying variations

Likely heterozygous(6 C, 9 Gs)(7 T, 9 G)

Likely sequencing error(2 C, 14 T)(1 C, 15 A)


2 homozygous SNPs

unclear(6 C, 14 T)


CTT deletion (deltaF508) is the most common cause of cystic fibrosis

Clinical Binning – the critical information

Criteria: Loci with Clinical Utility Loci with Clinical Validity

Loci with Unknown

Clinical Implications

Loci with important reproductive implications

Genes

Bins: Bin 1Genes, which when

mutated, result in high risk of clinically

actionable condition

Bin 2ALow risk

incidental information

Bin 2BMedium risk

incidental information

Bin 2CHigh risk incidental

information

Bin 3All other loci

Bin RCarrier status for severe AR disease

Examples: BRCA1/2MLH1, MSH2

FBN1NF1

Loci with proven PGx clinical utility

PGx variants and common risk SNPs with

no proven clinical utility

APOE, genes associated with

Mendelian disease for which no firm

clinical recommendations

exist

Huntington disease

Prion diseases

SCA, PS1, PS2, APP

Tay Sachs, Familial Dysautnomia, CF, etc.

Estimated number of genes/loci Dozen(s)

~20 (eventually 100s –

1000s)100s Dozen(s) >20,000 Hundreds

Slide provided by Jim Evans, M.D., Ph.D., Department of Genetics, UNC-CH

The promise of genetics requires a greater understanding of the underlying structure of the data

12

Computing on the Genome: Imputation

ATCGATCGATCAG - reference

ATCGGTCGATCAG – patient TCGGTNNNTCAG GTCGGTCAG ATCGGTCGGTCA ATCGGTCGGTC

Its unclear if this patient is A/A, A/G, or G/G

13


Infer the patient is homozygous for GG



ATCGGTCGGTCAG - patient 2ATCGGTCGGTCAG – patient 3ATCGGTCGGTCAG – patient 4ATCGGTCGGTCAG – patient 5ATCGATCGATCAG – patient 6ATCGATCGATCAG – patient 7ATCGATCGATCAG – patient 8

Population Evidence

14


Hidden Markov Models for cross-genome statistical correlations (Thunder*)

Imputation on 708 samples takes over 200,000 CPU hours to complete, or 22 CPU years

How many samples do we need to impute on rare variants?

* Li Y, Sidore C, Kang HM, Boehnke M, Abecasis GR. Low-coverage sequencing: Implications for design of complex trait association studies. Genome Res. 2011 Jun;21(6):940-51



ATCGGTCGGTCAG - patient 2ATCGGTCGGTCAG – patient 3ATCGGTCGGTCAG – patient 4ATCGGTCGGTCAG – patient 5ATCGATCGATCAG – patient 6ATCGATCGATCAG – patient 7ATCGATCGATCAG – patient 8

Population Evidence

Identifying moderate penetrant mutations from cross-population genetic structures

Convergent Haplotype Association Tagging

CHAT: developed by Kirk Wilhelmsen

A 1

2

3 4

5

B 1

2

3 4

5

C 1

2

3 4

5

A B C

1

2

3

4

5

Using Graph Theory in CHAT

Case Control

Clique 5 0

-Clique 495 500

Discovered CHAT is 2800 SNPs in length and 26 mb

The promise of genetics requires better approaches to store and analyze large data

The cost of storing 100,000 genomes

10 Pb = full human genomes at low coverage (1)

2 Pb = human exomes at medium coverage (2)

Or:

$5 to $25 million dollars for UNC Health Care System– Every patient’s genome once on enterprise data storage– Not including archived copies, not including analysis data sets

Or:

$15 to $75 billion dollars for the US to store every patient’s genome once

Cost of disk space alone, not including refresh of equipment(1) Empirical data, assuming ~100 Gb per sample compressed fastq, bam, vcf, and

ancillary data files at coverage between 3-15x(2) Empirical data, assuming ~20 Gb per sample at around 30x only storing compressed

fastq and bam file

There is more clinical genetic data …

Courtesy of NIH via WikiCommons

… gene expression (rna-seq)… per tissue data…time series data…the personal micro-biome

An Informatics Ecosystem for Clinical Genomics

At ~8K genomes, will scale to ~10-20K genomes

Need to scale to 100,000-200,000+ genomes

High Performance Computing (HPC)

Computing– KillDevil (ITS RC)

• 706 Traditional, GPU based, and large memory compute nodes

– BlueRidge (RENCI)• 204 Traditional, GPU based, and large

memory compute nodes– Croatan (RENCI)

• 30 node big-data configuration with 1 Pb spinning disk

– Topsail (UNC Genomics)• 400 traditional compute nodes

– Kure (ITS RC)• 220 Traditional and large memory

compute Nodes– Open Science Grid

• Distributed cycle scavenging grid across research institutions

– Teragrid• National HPC grid

Storage• PB+ Dell/Isilon system at UNC• PB+ DDN/NetApp/Dell systems at RENCI

Leverages: Traditional bioinformatics tools Traditional HPC workflow systems

Aggregating genomic knowledge

VarDB.

NCBI RefSeq

dbSNP

HGMD (commercial)

Annotations of Clinical Variations

Other tools…

Protein Effects

PolyPhen

Leverages strengths of RDBMS in structured knowledge representation

Other databases

• VarDB: several TB database• Reference Genomes• Canonical Variants• Annotations• Indexes

• AnnoBot: automated query system to update VarDB

HadoopVCF Example: Allele Frequency

#POS CHR … REFALT SAMPLE1

SAMPLE2

SAMPLE3

10 1 … G T 0/0 0/1 ./.

100 5 … G C ./. 1/0 ./.

1455 12 … A T 0/0 0/0 0/0

1023 18 … G T 0/1 0/0 0/1

#POS CHR … REFALT SAMPLE5

SAMPLE6

SAMPLE7

11 1 … C A 0/1 0/1 ./.

500 12 … T C ./. 0/1 ./.

1023 18 … G T 1/0 0/1 0/0

Variant Data file 2

Variant Data file 1

HadoopVCF developed by Chris Bizon

HadoopVCF Example: Generalized

Samples

Genom

ic Variants, Genom

ic Loci

Each file holds different data for different samples and locations.

Hadoop: Generalized algorithm

• Mapper– Key = subset of sample and loci– Value = intermediate sums

• Reducer– Calculation over intermediate sums

• Allele Frequencies, %missing, HWE p-values,…

• Hadoop Distributed Cache– Context from VCF headers and/or RDBMS for each mapper

Why Hadoop?

• Scalability for certain genome analysis patterns

• Challenges:– Other analysis patterns: Hidden Markov Models,

Permutation testing, Haplotype blocks, Graphs, Hierarchical graphs?

– Share resources• Running on scheduled HPC clusters• Running on centralized HP storage system + local disks• Moving data to and from the worker nodes

Managing an R&D ecosystem with big data

RENCI STORAGE(Tape, Drives)

UNC STORAGE (Tape, Drives)

UNC HPC RENCI HPC

Open Science GridTeragrid

CloudsIT Machines

RENCI Hadoop

Genomics Storage

Lab MachinesExternal Partner

Resources

Genomics HPC

Genomics Hadoop

Data Providers External Partners

Research versus production use:

Life Cycle: control increases over time as• Work scope increases• Expertise and technology matures• Risk increases• Number of groups touching data

increases

Wild West

Analysts Automated Processes Developers

IT Staff

iRODS Data Virtualization

RENCI/cuahsi/modeling

The iRODS Data Grid installs in a “layer” over storage systems, so you can view, manage, access, add, and share part or all of your data and metadata in a unified Collection.

Utah State Univ/cuahsi/catalog

User Sees Single “Virtual Collection”/cuahsi/catalog

/cuahsi/modeling/cuahsi/terrain

SDSC/cuahsi/terrain

User Client Views & Manages

Data Data Grid

Managing an R&D ecosystem with big dataRENCI STORAGE

(Tape, Drives)UNC STORAGE (Tape, Drives)

UNC HPC RENCI HPC

Open Science GridTeragrid

CloudsIT Machines

RENCI Hadoop

Genomics Storage

Lab MachinesExternal Partners

Genomics HPC

Genomics Hadoop

Integrated Rules-Oriented Data System (iRODS)

NFS Hadoop DDN WOSPosix RDBMS Web services

Data Services Programmatic APIs

Data Workflows iRODS Clients

Control over:• Data movement and replication• Metadata standards• Archival, deletion, and

retention• Integration with workflows,

hadoop, databases• Hiding complexities• Automation

• …, all policy driven

• …, without breaking the in-place systems

Data Providers External Partners

Analysts Automated Processes Developers

IT Staff

Thank You

charles schmitt director, informatics and data sciences senior researcher – data mining

Documents