kallio bosc2010 chipster-cloud

Aleksi KallioCSC – IT Center for Science, Finland

Connecting Chipster genome browser to the cloud

Architecture of Chipster platform

Loosely coupled, independent components Message oriented communications Flexible, scalable, robust In other words, very cloud like

Clients

Authentication service

Management service

Computing services

Brokers

Message broker

File broker

Chipster in the cloud

1) Deploying compute nodes in the cloud• Easy, because architecture already loosely coupled and based

on message passing 2) Running large parallel jobs in the cloud

• Architecture allows this easily• Cloud compatible tools can be integrated quickly

3) Using cloud as a back end for interactive visualisations

• Not maybe so obvious• So let's dig into this further...

Background: Chipster Genome Browser

Interactive Swing-based GUI Shows reads and analysis results in genomic context Interactive zooming from chromosome down to nucleotide level Ensembl annotations for genes and transcripts Integrated with the rest of the Chipster Parallel, distributed to some extent

Basic idea

Preprocess data with Hadoop / MapReduce Generate powers of two summaries for the data, like in

Google Earth• Doubles the data size

Current genome browser samples data to produce summaries

Now summaries can be read directly– Accurate results, significantly less disk seeks

Distribute data to scale into massive datasets• Use messaging to query independent data providers

Aggregate results as/if they appear to the visualiser

Work in progress...

Genome browser up and running

Hadoop based data processing at very early stages

Currently trying to get it scale well

What's the point?

Besides items (e.g., reads), visualiser can receive “superitems” (e.g., summaries of reads)

• Summarises coverage, quality, SNP's etc. of the original reads All kinds of advanced information can be generated in

the preprocessing step– Such as features that combine large number of genomes– Generators should be pluggable

We spend resources on the server side to improve user experience on the client side

• At server side CPU, memory and disk space required• But only for a short time (like in large batch jobs)• Cheap commodity servers can be used• And the experiment has already been expensive

Summary

Use cheap server resources to enable better user experience

Goal: to make data analysis quicker (and more fun) Tackle server side unreliability on the client side Future development

– If this works out, it could be used in other Chipster visualisers also

– Integrating Hbase queries to interactive visualisations– Optimising data summarising for visual truthfulness

For more info: [email protected],

kallio bosc2010 chipster-cloud

Technology

data analysis

data summarising

data processing

cloud architecture

cloud easy

summaries of reads

chipster parallel

interactive visualisations