cloud computing for research roger s barga dennis gannon, jared jackson, wei lu, jaliya ekanayake,...
Post on 14-Dec-2015
216 Views
Preview:
TRANSCRIPT
Cloud Computing for ResearchRoger S Barga
Dennis Gannon, Jared Jackson, Wei Lu, Jaliya Ekanayake, Mohamed Fathalla
Extreme Computing Group, MSR
Three areas of focus for our team• Highly Scalable Research Services• Target high value research applications that currently impede
progress• Release as open source to research community
• Lower barriers between clouds and research – impedance mismatch• Do I really need to rewrite/port my application?• Do I really need to know that I am even using a cloud? (client +
cloud)
• Services for data scientists to explore extremely large data sets• Data Analytics as a Service• Raise the level of abstraction for deploying and using analytics
• Provide technical support to NSF Computing in Cloud PIs (groups), part of an international program (US, Europe, Asia,…)
Computational Resources in ResearchLack of Broad Access
70M
1M
14M
High Performance Data-intensive Capacity
80%
20%14M
1M
Scientists & Engineers
55M Little to no access to high performance data-intensive capacity
Reference DatabasesMultiple Disciplines
Sequencing Technology
Exponential Growth
Data Explosion in Bioinformatics & Life Sciences
The Response
Reference Databases Metagenomics• Biological Engineering• Genomics• Environmental Engineering• Oceanography, Climate Research
Sequencing Technology
The ChallengeEnable Discovery
NCBI Trace Library
NCBI BLAST
BLAST (Basic Local Alignment Search Tool) • One of the most important software in bioinformatics• Identify similarity between bio-sequences
Computationally intensive• Large number of pairwise alignment operations• A normal BLAST running could take 700 ~ 1000 CPU hours
For most biologists, two choices to run large jobs• Build a local cluster • Submit jobs to NCBI or EBI (long job queue times)
NCBI BLAST on Windows Azure• Parallel BLAST engine on Azure
• Query-segmentation, data-parallel pattern• split the input sequences• query partitions in parallel• merge results together when done
• Follows the general suggested application model for Window Azure • Web Role + Queue + Worker
• With three special considerations• Batch job management• Task parallelism on an elastic Cloud• Large data-set management
AzureBLAST Task-Flow
A simple split/Join pattern
Leverage multi-core of one instance • argument “–num_threads” of NCBI-BLAST
Task granularity • Large partition load imbalance • Small partition unnecessary overheads• NCBI-BLAST overhead• Data transferring overhead.
Best Practice: test runs to profile and determine optimal size…
BLAST task
Splitting task
BLAST task
BLAST task
BLAST task
…
Merging Task
Micro-Benchmarks Inform Design
Task size vs. Performance• Benefit of the warm cache effect• 100 sequences per partition is the best
choice
Instance size vs. Performance• Super-linear speedup with larger size
worker instances• Primarily due to the memory capability.
Task Size/Instance Size vs. Cost• Extra-large instance generated the best
and the most economical throughput• Fully utilize the resource
R. palustris as a platform for H2 productionIdentify key drivers for producing hydrogen, promising
alternative fuel – understand R. palustris well enough to be able to improve its H2 production;
Characterize a population of strains and use integrative genomics approaches to dissect the molecular networks of H2 production;
BLAST to query 16 strains to sort out genetic relationships• Each strain, estimated ~5,000 proteins • Jobs kicked off NCBI clusters before completion• Against NCBI non-redundant proteins in ~30 min• Against ~5,000 proteins from another strain < 30 sec• Publishable result in one day for roughly $150.
Eric Schadt, Pac Bio and Sam Phattarasukol Harwood Lab, UW
All-Against-All ExperimentDiscovering Homologs • BLAST Uniref100, non-redundant protein sequence database• Discover the interrelationships of known protein sequences
“All against All” query• The database is also the input query• The protein database is large (4.2 GB size)
• Total of 9,865,668 sequences to be queried• Theoretically, 100 billion sequence comparisons!
Performance estimation• Estimated completion, 3,216,731 minutes (6.1 years) on 8 core VM
One of largest BLAST jobs as far as we know• This scale of experiment is usually infeasible to most researchers
Our Approach• Allocated a total of ~4000 instances • 475 extra-large VMs (8 cores per VM)
• 8 deployments of AzureBLAST• Each deployment has its own co-located storage service
• Divided 10 million sequences into multiple segments• Each was submitted to one deployment as one job for execution• 300,000 tasks on ~4000 cores on Azure (70,000 bp or 35 sequences per
task)
Cloud System Upgrades
North Europe Data Center, totally 34,256 tasks processed
All 62 nodes lost tasks and then came back together. This is an update domain
~30 mins
~ 6 nodes in one group
35 Nodes experience blob
writing failure at the same time
Failures HappenWest Europe Datacenter; 30,976 tasks are completed, and job was killed
Reasonable guess: Fault Domain is
working
Impedance mismatch – Azure designed to manage long running services in a highly available, cost effective manner. Researchers operate quite differently…
• Business: “develop, deploy and forget”• Researcher “constantly changing codebase, tasks,
dependencies”
Anthill – Making Azure easier to use for researchers…
> AHill myCalc.exe mycalc will run on Azure
> AHill myCalc.exe d1 d2 d3…parameter sweep on Azure
…> AHill myCalc1 … concurrent execution using a VM pool
> AHill myCalc2 …> AHill myCalc3 …
Impedance mismatch – Azure designed to manage long running services in a highly available, cost effective manner. Researchers operate quite differently…
• Business: “develop, deploy and forget”• Researcher “constantly changing codebase, tasks,
dependencies”
Anthill – Making Azure easier to use for researchers…
Completed Support application parametric sweeps (various
patterns) Support for complex data types (any
ISerializable type) Support for scheduler fault tolerance (no single
point of failure)
Ongoing Complex schedules (workflows), in progress Prepare for an open release
Impedance mismatch – Azure designed to manage long running services in a highly available, cost effective manner. Researchers operate quite differently…
• Business: “develop, deploy and forget”• Researcher “constantly changing codebase, tasks,
dependencies”
Anthill – Making Azure easier to use for researchers…
Lessons Learned Master scheduling work into a pool of
slaves is highly efficient Lightweight workflow to coordinate task
flow Fault tolerance, data movement between
tasks (don’t always write results to long term storage, wait to see if future tasks reuse).
Excel DataScopeOffer data analytics as a service on Windows Azure that enables users to upload and extract patterns from data, identify hidden associations, discover similarities, forecast time series...
The project includes an extensible collection of data analytics and machine learning algorithms and runtime service on Azure that scales out the execution of these algorithms. Analysts can submit, sample, and analyze data from Excel through a customizable data analytics ribbon.
Offer data analytics as a service on Windows Azure that enables users to upload and extract patterns from data, identify hidden associations, discover similarities, forecast time series...
The project includes an extensible collection of data analytics and machine learning algorithms and runtime service on Azure that scales out the execution of these algorithms. Analysts can submit, sample, and analyze data from Excel through a customizable data analytics ribbon. So what are we building…• A common framework for implementing analytics algorithms and machine
learning, which can efficiently scale out to handle jobs of varying size;• Highly efficient MapReduce framework, from batch to streaming/iteration• In-memory processing algorithms, whenever possible• Minimize I/O overhead• Incremental processing
• Efficient jobs scheduling of MapReduce tasks across a shared pool of Azure VMs;
• Data services, from partitioning data across VMs to shared read-only working sets;
Excel DataScope
Excel DataScope
Observations and Experience• Clouds are the largest scale compute centers ever
constructed and have the potential to be important to large & small scale research.• There is an impedance mismatch between clouds and
research workloads
• Equally import they can increase participation in research, providing much needed resources to users and communities which lack ready access.
• Provide valuable fault tolerance and scalability abstractions
• Select the best fit VM for the job (CPU / Memory / Network)
• Guidance, recommendations, examples are just hints• Always measure if in doubt…
Resources: AzureScopehttp://azurescope.cloudapp.net • Simple benchmarks illustrating
basic performance for compute and storage services
• Benchmarks for reference algorithms
• Best Practice tips• Code Samples
Email us with questions at xcgngage@microsoft.com
Resources: AzureScopehttp://azurescope.cloudapp.net • Simple benchmarks illustrating
basic performance for compute and storage services
• Benchmarks for reference algorithms
• Best Practice tips• Code Samples
Email us with questions at xcgngage@microsoft.com
© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
top related