f07-cloud-hadoop-bam
Embed Size (px)
DESCRIPTION
TRANSCRIPT

Hadoop-BAM: Directly manipulating BAM on Hadoop
Aleksi KallioCSC - IT Center for Science, Finland
BOSC 2011, July 16, Vienna

Background
Chipster 2.0: seamless integration of analysis tools, computing clusters and visualizations through a user friendly interface
With NGS data, the ”seamless” part gets really hard...
Use Hadoop to improve user experience
Hadoop-BAM: small side product that might prove to be useful for quite many people

Problem definition
Because of NGS instruments, we are in the middle of data deluge
BAM (Binary Alignment/Map) files are a standardized and compact way of storing (aligned) reads [Samtools]
So, what does ”data deluge” mean?
“Data deluge is a situation where one desperately tries to find space for yet another huge set of BAM (and fastq) files.”

Problem definition (it gets worse...)
You don't only need to store data, but you also have to do something with it
Pipelines take a long time to run
And in real life you don't use your pipelines once, but often tweak and rerun and rerun...

Enter: Hadoop
Map-reduce is a framework for processing terabytes of data in a distributed way
Hadoop is an open source implementation of the Google's map-reduce framework
NGS data has a lot in common with web logs, which were the original motivation for map-reduce

Map-reduce framework

Hadoop and map-reduce
The framework basically implements a distributed sorting algorithm
User has to write “map” and ”reduce” functions, nothing else
The framework does automatic parallelization and fault tolerance
But BAM is not Hadoop friendly:• Binary record format• BGZF compression on top of that

Possible solutions
Implement your own map-reduce framework
Ouch...
Convert to Hadoop-friendly text format
Storage size blows up
Network speed would become a bottleneck
Find a way to cope with BAM files in Hadoop
So we have Hadoop-BAM

Hadoop-BAM
Small and simple Java library
Throw it into your Hadoop installation
BAM! Your BAM files are accessible by Hadoop map-reduce functions

What does it do?
Gives you Picard SAM API
Hadoop splits data into chunks and special care is needed at chunk boundaries
Hadoop-BAM handles chunk boundaries behind the scenes

Detecting BAM record boundaries
First: BGZF blocks
Easy, blocks begin with magic numbers (32 bits)To make checking even more robust, multiple
blocks are checked and backtracked if needed
Second: BAM records
Harder, no identifiersBut various fields cross-reference each otherWe can detect records with very good accuracy

Example: Preprocessing for Chipster genome browser
How to allow interactive browsing with zooming in and out, for large BAM files?
Can use sampling, but it is either slow or inaccurate
Preprocess data and produce summaries at different levels (mipmapping)
Implemented on top of Hadoop-BAM

Result looks nice

Benchmarking
Take 50GB of data from 1000 Genomes
Run on cluster of 112 AMD Opteron 2.6 GHz (1344 cores) and Infiniband interconnect

Scalability results

Scalability results (cnt.)
Did sorting and summarizing
Fairly nice scaling for the processing step
No scaling for import and export
Lesson: avoid moving data in and out of Hadoop
So having to convert data from BAM to something else would be bad

Future plans
Develop or port basic BAM tools to use Hadoop-BAM
Tools that work on BAM and BED files
Building on top of Hadoop-BAM
Pig query engine
Variant detection pipelines
Some ideas about doing join operations
It's really hard...

Conclusions
Cloud computing is not a free lunch, but tools, algorithms and data formats need to be adapted
Hadoop-BAM library available with MIT license:
http://sourceforge.net/projects/hadoop-bam/
Contact: [email protected]

Acknowledgements
Matti Niemenmaa, André Schumacher, Keijo Heljanko (Aalto University, Department of Information and Computer Science)
Petri Klemelä, Eija Korpelainen (CSC - IT Center for Science)
TIVIT Cloud Software program for funding