cloud computing for research roger s barga dennis gannon, jared jackson, wei lu, jaliya ekanayake,...

Cloud Computing for ResearchRoger S Barga

Dennis Gannon, Jared Jackson, Wei Lu, Jaliya Ekanayake, Mohamed Fathalla

Extreme Computing Group, MSR

Three areas of focus for our team• Highly Scalable Research Services• Target high value research applications that currently impede

progress• Release as open source to research community

• Lower barriers between clouds and research – impedance mismatch• Do I really need to rewrite/port my application?• Do I really need to know that I am even using a cloud? (client +

cloud)

• Services for data scientists to explore extremely large data sets• Data Analytics as a Service• Raise the level of abstraction for deploying and using analytics

• Provide technical support to NSF Computing in Cloud PIs (groups), part of an international program (US, Europe, Asia,…)

Computational Resources in ResearchLack of Broad Access

High Performance Data-intensive Capacity

20%14M

Scientists & Engineers

55M Little to no access to high performance data-intensive capacity

Reference DatabasesMultiple Disciplines

Sequencing Technology

Exponential Growth

Data Explosion in Bioinformatics & Life Sciences

The Response

Reference Databases Metagenomics• Biological Engineering• Genomics• Environmental Engineering• Oceanography, Climate Research

Sequencing Technology

The ChallengeEnable Discovery

NCBI Trace Library

NCBI BLAST

BLAST (Basic Local Alignment Search Tool) • One of the most important software in bioinformatics• Identify similarity between bio-sequences

Computationally intensive• Large number of pairwise alignment operations• A normal BLAST running could take 700 ~ 1000 CPU hours

For most biologists, two choices to run large jobs• Build a local cluster • Submit jobs to NCBI or EBI (long job queue times)

NCBI BLAST on Windows Azure• Parallel BLAST engine on Azure

• Query-segmentation, data-parallel pattern• split the input sequences• query partitions in parallel• merge results together when done

• Follows the general suggested application model for Window Azure • Web Role + Queue + Worker

• With three special considerations• Batch job management• Task parallelism on an elastic Cloud• Large data-set management

AzureBLAST Task-Flow

A simple split/Join pattern

Leverage multi-core of one instance • argument “–num_threads” of NCBI-BLAST

Task granularity • Large partition load imbalance • Small partition unnecessary overheads• NCBI-BLAST overhead• Data transferring overhead.

Best Practice: test runs to profile and determine optimal size…

BLAST task

Splitting task

BLAST task

Merging Task

Micro-Benchmarks Inform Design

Task size vs. Performance• Benefit of the warm cache effect• 100 sequences per partition is the best

choice

Instance size vs. Performance• Super-linear speedup with larger size

worker instances• Primarily due to the memory capability.

Task Size/Instance Size vs. Cost• Extra-large instance generated the best

and the most economical throughput• Fully utilize the resource

R. palustris as a platform for H2 productionIdentify key drivers for producing hydrogen, promising

alternative fuel – understand R. palustris well enough to be able to improve its H2 production;

Characterize a population of strains and use integrative genomics approaches to dissect the molecular networks of H2 production;

BLAST to query 16 strains to sort out genetic relationships• Each strain, estimated ~5,000 proteins • Jobs kicked off NCBI clusters before completion• Against NCBI non-redundant proteins in ~30 min• Against ~5,000 proteins from another strain < 30 sec• Publishable result in one day for roughly $150.

Eric Schadt, Pac Bio and Sam Phattarasukol Harwood Lab, UW

All-Against-All ExperimentDiscovering Homologs • BLAST Uniref100, non-redundant protein sequence database• Discover the interrelationships of known protein sequences

“All against All” query• The database is also the input query• The protein database is large (4.2 GB size)

• Total of 9,865,668 sequences to be queried• Theoretically, 100 billion sequence comparisons!

Performance estimation• Estimated completion, 3,216,731 minutes (6.1 years) on 8 core VM

One of largest BLAST jobs as far as we know• This scale of experiment is usually infeasible to most researchers

Our Approach• Allocated a total of ~4000 instances • 475 extra-large VMs (8 cores per VM)

• 8 deployments of AzureBLAST• Each deployment has its own co-located storage service

• Divided 10 million sequences into multiple segments• Each was submitted to one deployment as one job for execution• 300,000 tasks on ~4000 cores on Azure (70,000 bp or 35 sequences per

Cloud System Upgrades

North Europe Data Center, totally 34,256 tasks processed

All 62 nodes lost tasks and then came back together. This is an update domain

~30 mins

~ 6 nodes in one group

35 Nodes experience blob

writing failure at the same time

Failures HappenWest Europe Datacenter; 30,976 tasks are completed, and job was killed

Reasonable guess: Fault Domain is

working

Impedance mismatch – Azure designed to manage long running services in a highly available, cost effective manner. Researchers operate quite differently…

• Business: “develop, deploy and forget”• Researcher “constantly changing codebase, tasks,

dependencies”

Anthill – Making Azure easier to use for researchers…

> AHill myCalc.exe mycalc will run on Azure

> AHill myCalc.exe d1 d2 d3…parameter sweep on Azure

…> AHill myCalc1 … concurrent execution using a VM pool

> AHill myCalc2 …> AHill myCalc3 …

dependencies”

Completed Support application parametric sweeps (various

patterns) Support for complex data types (any

ISerializable type) Support for scheduler fault tolerance (no single

point of failure)

Ongoing Complex schedules (workflows), in progress Prepare for an open release

dependencies”

Lessons Learned Master scheduling work into a pool of

slaves is highly efficient Lightweight workflow to coordinate task

flow Fault tolerance, data movement between

tasks (don’t always write results to long term storage, wait to see if future tasks reuse).

Excel DataScopeOffer data analytics as a service on Windows Azure that enables users to upload and extract patterns from data, identify hidden associations, discover similarities, forecast time series...

The project includes an extensible collection of data analytics and machine learning algorithms and runtime service on Azure that scales out the execution of these algorithms. Analysts can submit, sample, and analyze data from Excel through a customizable data analytics ribbon.

Offer data analytics as a service on Windows Azure that enables users to upload and extract patterns from data, identify hidden associations, discover similarities, forecast time series...

The project includes an extensible collection of data analytics and machine learning algorithms and runtime service on Azure that scales out the execution of these algorithms. Analysts can submit, sample, and analyze data from Excel through a customizable data analytics ribbon. So what are we building…• A common framework for implementing analytics algorithms and machine

learning, which can efficiently scale out to handle jobs of varying size;• Highly efficient MapReduce framework, from batch to streaming/iteration• In-memory processing algorithms, whenever possible• Minimize I/O overhead• Incremental processing

• Efficient jobs scheduling of MapReduce tasks across a shared pool of Azure VMs;

• Data services, from partitioning data across VMs to shared read-only working sets;

Excel DataScope

Observations and Experience• Clouds are the largest scale compute centers ever

constructed and have the potential to be important to large & small scale research.• There is an impedance mismatch between clouds and

research workloads

• Equally import they can increase participation in research, providing much needed resources to users and communities which lack ready access.

• Provide valuable fault tolerance and scalability abstractions

• Select the best fit VM for the job (CPU / Memory / Network)

• Guidance, recommendations, examples are just hints• Always measure if in doubt…

Resources: AzureScopehttp://azurescope.cloudapp.net • Simple benchmarks illustrating

basic performance for compute and storage services

• Benchmarks for reference algorithms

• Best Practice tips• Code Samples

Email us with questions at xcgngage@microsoft.com

Resources: AzureScopehttp://azurescope.cloudapp.net • Simple benchmarks illustrating

basic performance for compute and storage services

• Benchmarks for reference algorithms

• Best Practice tips• Code Samples

Email us with questions at xcgngage@microsoft.com

© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

cloud computing for research roger s barga dennis gannon, jared jackson, wei lu, jaliya ekanayake,...

management slide

data scientists

large data sets data

azureblast task

design task size

task sizeinstance size

data explosion

msr slide

Documents

hill country irrigation project: north otago field trials...

mr. tilak & mrs. anoja chandratillake & family · sisan...

trincomalee district - list of unemployed graduates...

ahs computing introduction to computing. ahs computing...

jaliya tul akdar wal saif al battar fi salat alal mukhtar by...

a south asian private sector perspective to emerging...

ieee transactions on journal name, manuscript id 1...

ekanayake et al., 2010_2

janaka ekanayake wind generation and its grid...

salsasalsasalsasalsa cloudcomp 09 munich, germany jaliya...

cloud computing cloud computing overview of distributed...

salsasalsa twister: a runtime for iterative mapreduce jaliya...

g51cua - introduction to advanced computing topics advanced...

amrita dhugga, maciej henneberg, jaliya kumaratilake ... ·...

thavi ekanayake - s3. · pdf fileunique positioning in the...

introduction to universal apps-jaliya udagedara

jaliya ekanayake, thilina gunarathne, and judy...

term 4 week 8 3 december 2018 -...

th data analysis and cloud computing · 2018-12-05 ·...

architecture and performance of runtime environments for...