f07-cloud-hadoop-bam

19
Hadoop-BAM: Directly manipulating BAM on Hadoop Aleksi Kallio CSC - IT Center for Science, Finland BOSC 2011, July 16, Vienna

Upload: bioinformatics-open-source-conference

Post on 02-Dec-2014

1.778 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: F07-Cloud-Hadoop-BAM

Hadoop-BAM: Directly manipulating BAM on Hadoop

Aleksi KallioCSC - IT Center for Science, Finland

BOSC 2011, July 16, Vienna

Page 2: F07-Cloud-Hadoop-BAM

Background

Chipster 2.0: seamless integration of analysis tools, computing clusters and visualizations through a user friendly interface

With NGS data, the ”seamless” part gets really hard...

Use Hadoop to improve user experience

Hadoop-BAM: small side product that might prove to be useful for quite many people

Page 3: F07-Cloud-Hadoop-BAM

Problem definition

Because of NGS instruments, we are in the middle of data deluge

BAM (Binary Alignment/Map) files are a standardized and compact way of storing (aligned) reads [Samtools]

So, what does ”data deluge” mean?

“Data deluge is a situation where one desperately tries to find space for yet another huge set of BAM (and fastq) files.”

Page 4: F07-Cloud-Hadoop-BAM

Problem definition (it gets worse...)

You don't only need to store data, but you also have to do something with it

Pipelines take a long time to run

And in real life you don't use your pipelines once, but often tweak and rerun and rerun...

Page 5: F07-Cloud-Hadoop-BAM

Enter: Hadoop

Map-reduce is a framework for processing terabytes of data in a distributed way

Hadoop is an open source implementation of the Google's map-reduce framework

NGS data has a lot in common with web logs, which were the original motivation for map-reduce

Page 6: F07-Cloud-Hadoop-BAM

Map-reduce framework

Page 7: F07-Cloud-Hadoop-BAM

Hadoop and map-reduce

The framework basically implements a distributed sorting algorithm

User has to write “map” and ”reduce” functions, nothing else

The framework does automatic parallelization and fault tolerance

But BAM is not Hadoop friendly:• Binary record format• BGZF compression on top of that

Page 8: F07-Cloud-Hadoop-BAM

Possible solutions

Implement your own map-reduce framework

Ouch...

Convert to Hadoop-friendly text format

Storage size blows up

Network speed would become a bottleneck

Find a way to cope with BAM files in Hadoop

So we have Hadoop-BAM

Page 9: F07-Cloud-Hadoop-BAM

Hadoop-BAM

Small and simple Java library

Throw it into your Hadoop installation

BAM! Your BAM files are accessible by Hadoop map-reduce functions

Page 10: F07-Cloud-Hadoop-BAM

What does it do?

Gives you Picard SAM API

Hadoop splits data into chunks and special care is needed at chunk boundaries

Hadoop-BAM handles chunk boundaries behind the scenes

Page 11: F07-Cloud-Hadoop-BAM

Detecting BAM record boundaries

First: BGZF blocks

Easy, blocks begin with magic numbers (32 bits)To make checking even more robust, multiple

blocks are checked and backtracked if needed

Second: BAM records

Harder, no identifiersBut various fields cross-reference each otherWe can detect records with very good accuracy

Page 12: F07-Cloud-Hadoop-BAM

Example: Preprocessing for Chipster genome browser

How to allow interactive browsing with zooming in and out, for large BAM files?

Can use sampling, but it is either slow or inaccurate

Preprocess data and produce summaries at different levels (mipmapping)

Implemented on top of Hadoop-BAM

Page 13: F07-Cloud-Hadoop-BAM

Result looks nice

Page 14: F07-Cloud-Hadoop-BAM

Benchmarking

Take 50GB of data from 1000 Genomes

Run on cluster of 112 AMD Opteron 2.6 GHz (1344 cores) and Infiniband interconnect

Page 15: F07-Cloud-Hadoop-BAM

Scalability results

Page 16: F07-Cloud-Hadoop-BAM

Scalability results (cnt.)

Did sorting and summarizing

Fairly nice scaling for the processing step

No scaling for import and export

Lesson: avoid moving data in and out of Hadoop

So having to convert data from BAM to something else would be bad

Page 17: F07-Cloud-Hadoop-BAM

Future plans

Develop or port basic BAM tools to use Hadoop-BAM

Tools that work on BAM and BED files

Building on top of Hadoop-BAM

Pig query engine

Variant detection pipelines

Some ideas about doing join operations

It's really hard...

Page 18: F07-Cloud-Hadoop-BAM

Conclusions

Cloud computing is not a free lunch, but tools, algorithms and data formats need to be adapted

Hadoop-BAM library available with MIT license:

http://sourceforge.net/projects/hadoop-bam/

Contact: [email protected]

Page 19: F07-Cloud-Hadoop-BAM

Acknowledgements

Matti Niemenmaa, André Schumacher, Keijo Heljanko (Aalto University, Department of Information and Computer Science)

Petri Klemelä, Eija Korpelainen (CSC - IT Center for Science)

TIVIT Cloud Software program for funding