f07-cloud-hadoop-bam

Hadoop-BAM: Directly manipulating BAM on Hadoop

Aleksi KallioCSC - IT Center for Science, Finland

BOSC 2011, July 16, Vienna

Background

Chipster 2.0: seamless integration of analysis tools, computing clusters and visualizations through a user friendly interface

With NGS data, the ”seamless” part gets really hard...

Use Hadoop to improve user experience

Hadoop-BAM: small side product that might prove to be useful for quite many people

Problem definition

Because of NGS instruments, we are in the middle of data deluge

BAM (Binary Alignment/Map) files are a standardized and compact way of storing (aligned) reads [Samtools]

So, what does ”data deluge” mean?

“Data deluge is a situation where one desperately tries to find space for yet another huge set of BAM (and fastq) files.”

Problem definition (it gets worse...)

You don't only need to store data, but you also have to do something with it

Pipelines take a long time to run

And in real life you don't use your pipelines once, but often tweak and rerun and rerun...

Enter: Hadoop

Map-reduce is a framework for processing terabytes of data in a distributed way

Hadoop is an open source implementation of the Google's map-reduce framework

NGS data has a lot in common with web logs, which were the original motivation for map-reduce

Map-reduce framework

Hadoop and map-reduce

The framework basically implements a distributed sorting algorithm

User has to write “map” and ”reduce” functions, nothing else

The framework does automatic parallelization and fault tolerance

But BAM is not Hadoop friendly:• Binary record format• BGZF compression on top of that

Possible solutions

Implement your own map-reduce framework

Ouch...

Convert to Hadoop-friendly text format

Storage size blows up

Network speed would become a bottleneck

Find a way to cope with BAM files in Hadoop

So we have Hadoop-BAM

Hadoop-BAM

Small and simple Java library

Throw it into your Hadoop installation

BAM! Your BAM files are accessible by Hadoop map-reduce functions

What does it do?

Gives you Picard SAM API

Hadoop splits data into chunks and special care is needed at chunk boundaries

Hadoop-BAM handles chunk boundaries behind the scenes

Detecting BAM record boundaries

First: BGZF blocks

Easy, blocks begin with magic numbers (32 bits)To make checking even more robust, multiple

blocks are checked and backtracked if needed

Second: BAM records

Harder, no identifiersBut various fields cross-reference each otherWe can detect records with very good accuracy

Example: Preprocessing for Chipster genome browser

How to allow interactive browsing with zooming in and out, for large BAM files?

Can use sampling, but it is either slow or inaccurate

Preprocess data and produce summaries at different levels (mipmapping)

Implemented on top of Hadoop-BAM

Result looks nice

Benchmarking

Take 50GB of data from 1000 Genomes

Run on cluster of 112 AMD Opteron 2.6 GHz (1344 cores) and Infiniband interconnect

Scalability results

Scalability results (cnt.)

Did sorting and summarizing

Fairly nice scaling for the processing step

No scaling for import and export

Lesson: avoid moving data in and out of Hadoop

So having to convert data from BAM to something else would be bad

Future plans

Develop or port basic BAM tools to use Hadoop-BAM

Tools that work on BAM and BED files

Building on top of Hadoop-BAM

Pig query engine

Variant detection pipelines

Some ideas about doing join operations

It's really hard...

Conclusions

Cloud computing is not a free lunch, but tools, algorithms and data formats need to be adapted

Hadoop-BAM library available with MIT license:

http://sourceforge.net/projects/hadoop-bam/

Contact: [email protected]

Acknowledgements

Matti Niemenmaa, André Schumacher, Keijo Heljanko (Aalto University, Department of Information and Computer Science)

Petri Klemelä, Eija Korpelainen (CSC - IT Center for Science)

TIVIT Cloud Software program for funding

f07-cloud-hadoop-bam

Technology

hadoopbam tools

hadoopbam small

hadoop installation

hadoop map

use hadoop

hadoop friendly

user experience hadoopbam

hadoopbam library available