talk given at "cloud computing for systems biology" workshop

Post on 06-May-2015

3.021 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The  role  of  cloud  compu.ng  in  big  biologyDeepak  Singh

life science industry

Credit: Bosco Ho

By ~Prescott under a CC-BY-NC license

context

analysis methods

technology

technology

?

??

?

back of the room

technology

technology

technologytechnology

technology

technology

technologytechnology

techn

ology

technology

technology

tech

nolo

gy

inherent characteristics

data driven

multi-dimensional

collaborative

distributed

<amazon web services>

the cloud

has_many :definitions

infrastructure as a service

precursors

virtualization

service oriented architecure

distributed computing

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

DatabaseAmazon RDS and

SimpleDB

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

DatabaseAmazon RDS and

SimpleDB

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

scalable

cost effectivescalable

cost effectivescalable

Pay as y

ou go

cost effectivescalable

reliable

cost effectivescalable

reliablesecure

Amazon EC2

servers on demand

highly scalable

3000 CPU’s for one firm’s risk management application

!"#$%&'()'*+,'-./01.2%/'

344'+567/'(.'

8%%9%.:/'

;<"&/:1='

>?,3?,44@'

A&B:1='

>?,>?,44@'

C".:1='

>?,D?,44@'

E(.:1='

>?,F?,44@'

;"%/:1='

>?,G?,44@'

C10"&:1='

>?,H?,44@'

I%:.%/:1='

>?,,?,44@'

3444JJ'

344'JJ'

design for failure

“Everything fails, all the time”-- Werner Vogels

assume failure

design backwards

assume failure

nothing fails

design backwards

assume failure

highly available systems

elastic block store

elastic IP

SQS

US East Region

Availability Zone A

Availability Zone B

Availability Zone C

Availability Zone D

data storage

one size does not fit all

Amazon S3

distributed object store

durable

available

!"#$%&'()*+

T

TT

scalable

fast

simple

structured data anyone?

Amazon SimpleDB

zero administration

highly available

schema less

key-value store

Amazon Relational Data Service

single API call

MySQL database

automatic backup

scale up with API call

futu

res

master-slave replicationfu

ture

s

data center failover

what do people do?

solve problems

> 1PB of data in S3

provide platforms & services

http://heroku.com

Platform as a Service

http://cyclecomputing.com

Computation as a Service

http://cyclecomputing.comhttp://wiki.github.com/documentcloud/cloud-crowd

Computational Platforms

sudo gem install cloud-crowd

Image: Matt Wood

they do science

3.7 million classifications in just over three days~15 million in less than a month>2.6 million clicks in 100 hours

Image  via  image  editor  under  a  CC-­‐BY  License

Protein Docking @ Pfizer

http://bioteam.net

</amazon web services>

anecdote

collaborative project

800 GB

Image: Wikipedia Commons

weeks to get started

Image: Matt Wood

Image: Chris Dagdigian

gigabytes

terabytes

petabytes

really fast

constant flux

Image: Chris Dagdigian

data management is not data storage

masterclassBig data & Biology: The implications of

petascale scienceTuesday November 17

1:30PM - 3:00PM Room: PB253-254-257-258

“science data platform”

deliver data to applications

deliver data to people

typical informatics workflow

Via Argonne National Labs under a CC-BY-SA license

Via Argonne National Labs under a CC-BY-SA license

killer a

pp

Data

Apps

Data Platform

App Platform

Data Platform

App Platform

Data Platform

App Platform

data services

Data Platform

application services

App Platform

Scalable Data Platform

Services

APIs

Getters Filters Savers

WORK

must accommodate change

must scale

highly available

loosely coupled

dynamic

task-based resources

one projectone set of resources

no waiting

Protein Docking @ Pfizer

http://bioteam.net

distributed mindset

one approach

disk read/writesslow & expensive

data processingfast & cheap

distribute dataparallelize reads

map/reduce

distributed data processingat scale

abstracting away hadoop

apache hive

http://hadoop.apache.org/hive/

apache pig

http://hadoop.apache.org/pig/

cascading

http://www.cascading.org/

hosted hadoop service

hadoop easy & simple

Input  S3  bucket

Output  S3  bucket

Amazon S3

Hadoop

Amazon EC2 Instances

Input dataset

outputresults

Deploy Application

Web Console, Command line tools

End

Notify

Get ResultsInput Data

Amazon Elastic MapReduce

Hadoop Hadoop

Hadoop

Hadoop

Hadoop

Elastic MapReduce

Elastic MapReduce

developersdevelop & distribute

scientists/analystsconsume

CloudBurst

Catalog k-mers Collect seeds End-to-end alignment

Mike Schatz, University of Maryland

Scalable Data Platform

Services

APIs

Getters Filters Savers

WORK

IN CONCLUSION

large scale biology

complex multidimensional data

whole lot of data

distributed collaborations

new computing and data architectures

a solution: cloud services

distributed

scalable

economical

here today

deesingh@amazon.com  Twi<er:@mndoci  Presenta?on  ideas  from  @mza,  James  Hamilton,  and  @lessig

Thank  you!

top related