masterworks talk on big data and the implications of petascale science

Big Data and Biology: The implica4ons of petascale scienceDeepak Singh

Via Reavel under a CC-BY-NC-ND license

http://flickr.com/photos/reavel/2404891348/sizes/m/

http://flickr.com/photos/reavel/2404891348/sizes/m/

life science industry

Credit: Bosco Ho

By ~Prescott under a CC-BY-NC license

http://flickr.com/photos/ppym1/

http://flickr.com/photos/ppym1/

Image: Wikipedia

biology

big data

Source: http://www.nature.com/news/specials/bigdata/index.html

http://www.nature.com/news/specials/bigdata/index.html

http://www.nature.com/news/specials/bigdata/index.html

Image: Matt Wood

Human genome

Image: Matt Wood

not just sequencing

Image: Ricardipus

http://www.flickr.com/photos/ricardipus/

http://www.flickr.com/photos/ricardipus/

more data

Image: Matt Wood

all hell breaks loose

~100 TB/Week

~100 TB/Week

>2 PB/Year

days

minutes?

gigabytes

terabytes

petabytes

exabytes?

really fast

Image: http://www.broadinstitute.org/~apleite/photos.html

http://www.broadinstitute.org/~apleite/photos.html

http://www.broadinstitute.org/~apleite/photos.html

single lab

Image: Chris Dagdigian

implications of scale

data management

data processing

data sharing

fundamental concepts

1. architecting for scale

“Everything fails, all the time”-- Werner Vogels

“Things will crash. Deal with it”-- Jeff Dean

“Remember everything fails”-- Randy Shoup

fun with numbers

datacenter availability

Source: Uptime Institute

Tier I: 28.8 hours annual down4me (99.67% availability)Tier II: 22.0 hrs annual down4me (99.75% availability)Tier III: 1.6 hrs annual down4me (99.98% availability)Tier IV: 0.8 hrs annual down4me (99.99% availability)

Source: Uptime Institute

cooling systems go down

power units fail

2-4% of serverswill die annually

Source: Jeff Dean, LADIS 2009

1-5% of disk drives will die every year

Source: Jeff Dean, LADIS 2009

2.3% AFR in population of 13,2503.3% AFR in population of 22,400

4.2% AFR in population of 246,000

Source: James Hamilton

software breaks

human errors

human errors~20% admin issues have unintended consequences

Source: James Hamilton

achieving scalabilityand availability

partitioning

redundancy

recovery oriented computing

Source: http://perspectives.mvdirona.com/, http://roc.cs.berkeley.edu/

http://perspectives.mvdirona.com

http://perspectives.mvdirona.com

http://roc.cs.berkeley.edu

http://roc.cs.berkeley.edu

assume sw/hw failure

design apps to be resilient

automation

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

Amazon S3

durable

available

!"#$%&'()*+

T

TT

Amazon EC2

highly scalable

3000 CPU’s for one firm’s risk management application

!"#$%&'()'*+,'-./01.2%/'

344'+567/'(.'

8%%9%.:/'

;<"&/:1='

>?,3?,44@'

A&B:1='

>?,>?,44@'

C".:1='

>?,D?,44@'

E(.:1='

>?,F?,44@'

;"%/:1='

>?,G?,44@'

C10"&:1='

>?,H?,44@'

I%:.%/:1='

>?,,?,44@'

3444JJ'

344'JJ'

highly available systems

dynamic

fault tolerant

US East Region

Availability Zone A

Availability Zone B

Availability Zone C

Availability Zone D

2. one size does not fit all

2. one size does not fit all^data

many data types

structured data

using the right data store

(a) feature first

RDBMS

Oracle, SQL Server, DB2, MySQL, Postgres

Source: http://www.bioinformaticszen.com/

http://www.bioinformaticszen.com/software/dealing-with-big-data-in-bioinformatics/

http://www.bioinformaticszen.com/software/dealing-with-big-data-in-bioinformatics/

use a bigger computer

remove joins

scaling limits

(b) scale first

scale is highest priority

single RDBMS incapable

solution 1: data sharding

10’s

100’s

solution 2: scalable key-value store

scale is design point

MongoDB, Project Voldermort, Cassandra, HBase, BigTable, Amazon SimpleDB, Dynamo

(c) simple structured storage

simplefast

low ops cost

BerkeleyDB, Tokyo Cabinet, Amazon SimpleDB

(d) purpose optimized stores

data warehousingstream processing

Aster Data, Vertica, Netezza, Greenplum, VoltDB, StreamBase

what about files?

cluster file systems

Lustre, GlusterFS

distributed file systems

HDFS, GFS

distributed object store

Amazon S3, Dynomite

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

3. processing big data

disk read/writesslow & expensive

data processingfast & cheap

distribute the dataparallel reads

data processing for the cloud

distributed file system(HDFS)

map/reduce

Via Cloudera under a Creative Commons License

fault tolerance

massive scalability

petabyte scale

hosted hadoop service

hadoop easy and simple

Input S3 bucket

Output S3 bucket

Amazon S3

Hadoop

Amazon EC2 Instances

Input dataset

outputresults

Deploy Application

Web Console, Command line tools

End

Notify

Get ResultsInput Data

Amazon Elastic MapReduce

Hadoop Hadoop

Hadoop

Hadoop

Hadoop

Elastic MapReduce

Elastic MapReduce

back to the science

basic informatics workflow

Via Christolakis under a CC-BY-NC-ND license

http://www.flickr.com/photos/43052603@N00/

http://www.flickr.com/photos/43052603@N00/

Via Argonne National Labs under a CC-BY-SA license

http://www.flickr.com/photos/argonne/



killer app



getting the data

Register projects

Register samples

Sample prep

Sequencing

Analysis

These slides cover work presented by Matt Wood at various conferences

Image: Matt Wood

constant change

flexible data capture

virtual fields

no schema

specify at run time

specify at run time(bootstrapping)

Sample

Name

Organism

Concentration

Source: Matt Wood

Source: Matt Wood

key value pairs

change happens

Sample

Name

Organism

Concentration

Sample

Name

Organism

Concentration

Origin

Quality metric

V1 V2

Source: Matt Wood

Source: Matt Wood

high throughput

lots of pipelines

scaling projects/pipelines?

lots of apps

loosely coupled

automation

scale operationally

be agile

now what?

Via asklar under a CC-BY license

http://www.flickr.com/photos/aslakr/


many data types

changing data types

Shaq Image: Keith Allison under a CC-BY-SA license

http://www.flickr.com/photos/keithallison/

http://www.flickr.com/photos/keithallison/

lots and lots and lots and lots and lots and lots of data andlots and lots of lots of data

By bitterlysweet under a CC-BY-NC-ND license

http://flickr.com/photos/bitterlysweet/

http://flickr.com/photos/bitterlysweet/

Source: http://bit.ly/anderson-bigdata

http://bit.ly/anderson-bigdata

http://bit.ly/anderson-bigdata

Chris Anderson doesn’t understand science

“more is different”

few data points

elaborate models

the unreasonable effectiveness of data

Source: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig, and Fernando Pereira

simple modelslots of data

information platform

information platforms at scale

one organization

4 TB daily added(compressed)

135 TB data scanned daily(compressed)

15 PB data total capacity

Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk

not always that big

can we learn any lessons?

Source: “Information Platforms and the Rise of the Data Scientist”, Jeff Hammerbacher in Beautiful Data

analytics platform

Data warehouse

Data warehouse is a repository of anorganization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis

http://en.wikipedia.org/wiki/Information_repository

http://en.wikipedia.org/wiki/Information_repository

extract

transform

MySQL --> Oracle

more data

more data types

changing data types

limit data warehouse

too limited

how do you scale and adapt?

100’s of TBs

1000’s of jobs

back to the science

back in the day

small data sets

flat files

../../folder1/ ../folder2/

file1file2..fileN

../folderN/.. .

shared file system

Image: Wikimedia Commons

Image: Chris Dagdigian

need to process

need to analyze

100’s of TBs

1000’s of jobs

Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk

data mining&

analytics

analysts are not programmers

not savvy with map/reduce

apache hive

http://hadoop.apache.org/hive/



manage & query data

manage & query dataon top of Hadoop

work by @peteskomoroch

cascading

http://www.cascading.org/

http://www.cascading.org

http://www.cascading.org

apache pig

http://hadoop.apache.org/pig/



Input S3 bucket

Output S3 bucket

Amazon S3

Hadoop

Amazon EC2 Instances

Input dataset

outputresults

Deploy Application

Web Console, Command line tools

End

Notify

Get ResultsInput Data

Amazon Elastic MapReduce

Hadoop Hadoop

Hadoop

Hadoop

Hadoop

Elastic MapReduce

Elastic MapReduce

hadoop and bioinformatics

High Throughput Sequence AnalysisMike Schatz, University of Maryland

Short Read Mapping

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)



Expensive to scale



Expensive to scale

Need parallelization framework

CloudBurst

Catalog k-mers Collect seeds End-to-end alignment

http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369

http://cloudburst-bio.sourceforge.net

http://cloudburst-bio.sourceforge.net

Bowtie: Ultrafast short read aligner

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

SOAPSnp: Consensus alignment and SNP calling

Ruiqiang Li, Yingrui Li, Xiaodong Fang, et al. (2009) "SNP detection for massively parallel whole-genome resequencing" Genome Res

Crossbow: Rapid whole genome SNP analysis

Ben Langmead

http://bowtie-bio.sourceforge.net/crossbow/index.shtml



Preprocessed reads

Preprocessed reads

Map: Bowtie

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Reduce: SoapSNP

Crossbow condenses over 1,000 hours of resequencing computa:on into a few hours without requiring the user to own or operate a computer cluster

Comparing Genomes

Estimating relative evolutionary rates from sequence comparisons:Identification of probable orthologs

A B C D E

S. cerevisiae C. elegans

species treegene tree

Admissible comparisons: A or B vs. DC vs. E

Inadmissible comparisons: A or B vs. EC vs. D

Estimating relative evolutionary rates from sequence comparisons:

A B C D E

S. cerevisiae C. elegans

species treegene tree

1. Orthologs found using the Reciprocal smallest distance algorithm2. Build alignment between two orthologs>Sequence CMSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…

>Sequence EMSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…

3. Estimate distance given a substitution matrix

Phe Ala Pro Leu ThrPhe Ala µπPro µπ µπ µπLeu µπ µπ µπ µπ

ab

bb

cb

c

c

c

a

b

c

vs.

vs.

vs.

vs.

vs.

vs.

Align sequences &Calculate distances

D=0.2

D=0.3

D=0.1

D=1.2

D=0.1

D=0.9

Orthologs:ib - jc D = 0.1

HL Align sequences &Calculate distances

JcIb

Genome I Genome J

RSD algorithm summary

Prof. Dennis WallHarvard Medical School

Roundup is a database of orthologs and their evolutionary distances.To get started, click browse. Alternatively, you can read our documentation here.

Good luck, researchers!

http://roundup.hms.harvard.edu/roundup/index.php?action=input_browse

http://roundup.hms.harvard.edu/roundup/index.php?action=input_browse

http://roundup.hms.harvard.edu/site/documentation

http://roundup.hms.harvard.edu/site/documentation

massive computational demand

1000 genomes = 5,994,000 processes = 23,976,000

hours

2737 years

compared 50+ genomes

trends in data sharing

data motion is hard

cloud services are a viable dataspace

share data

share applications

share results

http://aws.amazon.com/publicdatasets/



Data Platform

App Platform

Scalable Data Platform

Services

APIs

Getters Filters Savers

WORK

to conclude

big data

change thinking

data managementdata processing

data sharing

think distributed

new software architectures

new computing paradigms

cloud services

the cloud works

[email protected] Twi2er:@mndoci Presenta4on ideas from @mza, James Hamilton, and @lessig

Thank you!

mailto:[email protected]

mailto:[email protected]

masterworks talk on big data and the implications of petascale science

Technology

data management

data processing

amazon s3

big data

data sharing

data types

structured data

amazon ec2