2012 sept 18_thug_biotech
DESCRIPTION
Toronto Hadoop User Group THUG September 18, 2012 Biotech workflow in HadoopTRANSCRIPT
Example Biotech Use Cases for Hadoop
Adam MuiseSystems Engineer
Cloudera
THUG – Toronto Hadoop User GroupSeptember 18 2012
This evening…
• I will discuss a Hadoop DNA sequencing use case actually implemented at a large biotech firm
• I will take questions from the Biologists that are new to Hadoop and review the architecture
• We will walk through the sequencing workflow • We will encourage an informal discussion of other ongoing
biotech use cases that we can apply Hadoop to• We will have an open discussion about other common
Bioinformatic toolsets and their compatibility with Hadoop• You will tell me if we should continue to have a biotech-
themed meetup on a quarterly basis
Use Case:Drug Development using NGS
• Implemented at a very large biotech firm, we will call “N”
• NGS = Next Generation Sequencing and all of the scientific process that accompanies it
• Workflow was previously on traditional HPC
Challenges at “N”
• NGS produces a great deal of DNA data to sequence• Prior to using Hadoop, the traditional process was a
tangle of manually distributed R and PERL mixed with a traditional HPC cluster (MPI based, 2200 processing cores)
• The workflow we will focus on took 6 weeks for a full data set
• E.g. 300 R files submitted at the same time hung the system, processing this benchmark took several hours as the parallelism had to be reduced
Enter the Dragon
• The workflow was analyzed by Cloudera and implemented in Hadoop
• The resulting workflow went from 6 weeks to 2 days or less on a 23 node cluster (3 masters, 20 workers)
• The primary gain was the massive parallelism that Hadoop provides without requiring message-passing as well as the data locality built into MapReduce/HDFS.
• 300 R files processed in 66 seconds
Symbolic workflow
Ingest Data• Typically FASTA,
FASTQ, QSEQ, or Tab-delimited formats
Pre-processing Steps• Remove bad reads• Count the bases
(and other stats)• Remove Adapter
and Barcodes (added by sequencers)
• Assess quality of sample
Alignment• Align reads to a
reference genome• Lots of replicated
reads to process
Post-processing Steps• Meaningful
scientific analysis starts here
• SNP• Variance analysis
Technologies involved in the typical sequencing workflows at “N”:
Pre-Processing•FastQC (v0.9.1) / FastQC Screen (v0.2.1)•Base Count Stats •FASX-Tool kit (0.0.13)•FASTQ/A Barcode Splitter :: FASTQ/Q Clipper
•FASTQ Quality Filter•FASTQ Quality Trimmer
Aligner Algorithms•Burrows-Wheeler Aligner – BWA (v0.5.9)•Bowtie aligner (0.12.7)•TOPHAT (1.4.1.1, 1.3.2 GNU)•Bowtie2 aligner (2.0.0-beta5)
Post-Processing•Picard•Multiple & Alignment Metrics•Quality Score Distribution•Mean Quality by Cycle•GC Bias Metrics•Insert size Metrics•RNA Seq Metrics•Mark Duplicates•HTSeq•Cufflinks / Cuffdiff / Cuffmerge / Cuffcompare
Resulting Workflow:Pre-processing Jobs
• Ingest into Hadoop Cluster (HDFS put)• Remove Adaptors (markers added by sequencers) if
present• Segregate data based on barcodes for small RNA
(again, artifact of sequencing and useful for sorting)• Plug-in for any other tools that can be used with
Hadoop streaming (eg std in and out to a PERL script or C program)
• Sequences converted to BAM (Binary Sequence Format)
Resulting Workflow:Alignment
• Used TopHat 2.0.4:– http://tophat.cbcb.umd.edu/– Implemented in Hadoop as a MapReduce job– Input/output file format is BAM– TopHat uses the Bowtie aligner
• This step maps the sequences to an established reference genome and separate out exons from the introns
• Splice junctions are identified via the reads that did not get identified as an exon
• Bulk of the work
Resulting Workflow:Post-processing
• Various post-processing tools can be implemented based on specific workflow variation
• Frequently CuffLinks was used:– http://cufflinks.cbcb.umd.edu/index.html
• These tools were implemented in Hadoop primarily with Streaming, this allows Hadoop to be a very flexible generalized parallel execution engine without imposing the complexity of traditional grid computing
ANOTHER USE CASE “M”
Another Use Case:“M” Challenges
• Immense amount of critical genomic data. "Vanilla-brand RDBMS" unable to capture/process all raw logs.
• Governance: Use "Vanilla-brand RDBMS" for compliance today. Data captured on tape today. Not easily accessed and many hidden costs with data stored on tape. Super slow process of accessing/analysis of data with "Vanilla-brand RDBMS"
• Genome analysis and integration is highly manual• Existing platforms don’t scale to expected volumes
– 100’s to 1000’s of reference genomes– 10,000 prokaryotic genomes/year by 2014– 100,000 resequenced lines by 2015
• Next-generation sequence analysis workflows are manual today – not sustainable for forecasted growth
• By 2015, the "M" sequencing lab expected to sequence more than two quadrillion nucleotides / year
“M” Challenges
Business Challenges before Cloudera Enterprise • Ingesting data for genotyping and DNA sequencing hitting scaling
limitations: – new sequencing devices are generating more data – need to increase frequency and number of products being – sequenced – need to bring openly available sequence datasets in-house
• DNA data has multiple dimensions, processing it efficiently in relational database with java based applications are: – not meeting performance needs, not scaling to meet SLA, not cost
effective – infrastructure complexity, difficulty scaling horizontally – store and access sparse unstructured dataset
“M” Existing Infrastructure• Systems which support "M"'s R&D pipeline are constantly changing, evolving and becoming increasingly
complex as the science and its corresponding processes evolve. The initial project phase is the adoption of complex computational analytics that are run on growing data sets. Additionally, operational improvements are found by streamlining and automating these analytical computations which previously ran offline by a select few users.
• RDBM’s: "Vanilla-brand RDBMS" (*SGE) for data capture, governance and compliance protocols • Performance bottlenecks and costly investment because the size of the data, in which the analysis can be
run against, is limited to the size of the hardware that the database engine is running on. Costly additional software/hw licenses required
• HPC: Illumina for sequencing, genotyping and gene expression"M" is not able to perform complex analytics on large datasets in parallel within a short period of time. Scientists run manual workloads to clean the data prior to being able to access the data.
• Java-based applications run on a *Sun Grid Engine – managed compute farm. Java applications are the easiest and simplest method of performing computational analysis. However, Java based analyses have several limiting factors. First, to perform any analysis, time must be spent to retrieve data from a repository which in most cases is a RDBMS. Second, the size of the data that the analysis can be run against at any one time is going to be limited to the max heap size of the Java Virtual Machine and, in some respects, the size of memory on the machine. Finally the amount of parallelization that can be performed at any one time is limited to the size of the hardware of the machine running the application.
“M” ImplementationUse Case • Phase 1: ‘Sequence Search and Retrieval’ "M" uses CDH to capture and store Apache logs from internal (not public facing) R&D workflow applications. Today they mine 300TB of data on their CDH cluster searching for trends and performance metrics for coordination of scientific workflow including raw genetic sequencing. "M" has developed complex (proprietary) computational algorithms running on CDH. Phase one includes slow migration of data off stored tape, to CDH for full governance and compliance. • Phase 2: Genetic Data - One Repository The R&D Biotech (seed breeding) group owns one cluster in production. They are building a scalable data architecture for next generation storage and analysis of genetic data - at scale - with real time access. Analysis results will include specific gene detection.
“M” ImplementationCloudera Enterprise Implementation Cloudera Manager 3.5 with CDH3u3 with HBase Sequence data dumped to NFS for cleansing and filtering (phase 2: move to HDFS/MR) Ingest cleansed data into HDFS for cost effective storage and processing MR on ingested data for insert into HBase Real-time full/range scan on HBase depending on user queries for analysis (compare/contrast, patterns) For fingerprint data: sqoop in/out of "Vanilla-brand RDBMS" for further processing(SimMatrix) Results Store & process large amounts of data, in near real time, cost effectively Analyse (compare/contrast, patterns) against larger number of products very efficiently Run more queries against much larger data set more frequently. This was a major challenge pre Cloudera.