masterworks talk on big data and the implications of petascale science

324
Big Data and Biology: The implica4ons of petascale science Deepak Singh

Upload: deepak-singh

Post on 13-Nov-2014

4.103 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Masterworks talk on Big Data and the implications of petascale science

Big  Data  and  Biology:  The  implica4ons  of  petascale  scienceDeepak  Singh

Page 2: Masterworks talk on Big Data and the implications of petascale science
Page 4: Masterworks talk on Big Data and the implications of petascale science
Page 5: Masterworks talk on Big Data and the implications of petascale science
Page 6: Masterworks talk on Big Data and the implications of petascale science

life science industry

Page 7: Masterworks talk on Big Data and the implications of petascale science

Credit: Bosco Ho

Page 8: Masterworks talk on Big Data and the implications of petascale science
Page 9: Masterworks talk on Big Data and the implications of petascale science

By ~Prescott under a CC-BY-NC license

Page 10: Masterworks talk on Big Data and the implications of petascale science
Page 11: Masterworks talk on Big Data and the implications of petascale science

data

Page 12: Masterworks talk on Big Data and the implications of petascale science

Image: Wikipedia

Page 13: Masterworks talk on Big Data and the implications of petascale science

biology

Page 14: Masterworks talk on Big Data and the implications of petascale science

big data

Page 15: Masterworks talk on Big Data and the implications of petascale science

Source: http://www.nature.com/news/specials/bigdata/index.html

Page 16: Masterworks talk on Big Data and the implications of petascale science

Image: Matt Wood

Page 17: Masterworks talk on Big Data and the implications of petascale science

Human genome

Image: Matt Wood

Page 18: Masterworks talk on Big Data and the implications of petascale science
Page 19: Masterworks talk on Big Data and the implications of petascale science

not just sequencing

Page 20: Masterworks talk on Big Data and the implications of petascale science
Page 22: Masterworks talk on Big Data and the implications of petascale science
Page 23: Masterworks talk on Big Data and the implications of petascale science
Page 24: Masterworks talk on Big Data and the implications of petascale science

more data

Page 25: Masterworks talk on Big Data and the implications of petascale science
Page 26: Masterworks talk on Big Data and the implications of petascale science
Page 27: Masterworks talk on Big Data and the implications of petascale science

Image: Matt Wood

Page 28: Masterworks talk on Big Data and the implications of petascale science

all hell breaks loose

Page 29: Masterworks talk on Big Data and the implications of petascale science

~100 TB/Week

Page 30: Masterworks talk on Big Data and the implications of petascale science

~100 TB/Week

>2 PB/Year

Page 31: Masterworks talk on Big Data and the implications of petascale science
Page 32: Masterworks talk on Big Data and the implications of petascale science
Page 33: Masterworks talk on Big Data and the implications of petascale science
Page 34: Masterworks talk on Big Data and the implications of petascale science
Page 35: Masterworks talk on Big Data and the implications of petascale science
Page 36: Masterworks talk on Big Data and the implications of petascale science
Page 37: Masterworks talk on Big Data and the implications of petascale science
Page 38: Masterworks talk on Big Data and the implications of petascale science

years

Page 39: Masterworks talk on Big Data and the implications of petascale science

weeks

Page 40: Masterworks talk on Big Data and the implications of petascale science

days

Page 41: Masterworks talk on Big Data and the implications of petascale science

days

Page 42: Masterworks talk on Big Data and the implications of petascale science

days

minutes?

Page 43: Masterworks talk on Big Data and the implications of petascale science

gigabytes

Page 44: Masterworks talk on Big Data and the implications of petascale science

terabytes

Page 45: Masterworks talk on Big Data and the implications of petascale science

petabytes

Page 46: Masterworks talk on Big Data and the implications of petascale science

exabytes?

Page 47: Masterworks talk on Big Data and the implications of petascale science

really fast

Page 48: Masterworks talk on Big Data and the implications of petascale science

Image: http://www.broadinstitute.org/~apleite/photos.html

Page 49: Masterworks talk on Big Data and the implications of petascale science

single lab

Page 50: Masterworks talk on Big Data and the implications of petascale science

Image: Chris Dagdigian

Page 51: Masterworks talk on Big Data and the implications of petascale science
Page 52: Masterworks talk on Big Data and the implications of petascale science
Page 53: Masterworks talk on Big Data and the implications of petascale science
Page 54: Masterworks talk on Big Data and the implications of petascale science
Page 55: Masterworks talk on Big Data and the implications of petascale science

implications of scale

Page 56: Masterworks talk on Big Data and the implications of petascale science

data management

Page 57: Masterworks talk on Big Data and the implications of petascale science

data processing

Page 58: Masterworks talk on Big Data and the implications of petascale science

data sharing

Page 59: Masterworks talk on Big Data and the implications of petascale science
Page 60: Masterworks talk on Big Data and the implications of petascale science

fundamental concepts

Page 61: Masterworks talk on Big Data and the implications of petascale science

1. architecting for scale

Page 62: Masterworks talk on Big Data and the implications of petascale science
Page 63: Masterworks talk on Big Data and the implications of petascale science

“Everything fails, all the time”-- Werner Vogels

Page 64: Masterworks talk on Big Data and the implications of petascale science
Page 65: Masterworks talk on Big Data and the implications of petascale science

“Things will crash. Deal with it”-- Jeff Dean

Page 66: Masterworks talk on Big Data and the implications of petascale science
Page 67: Masterworks talk on Big Data and the implications of petascale science

“Remember everything fails”-- Randy Shoup

Page 68: Masterworks talk on Big Data and the implications of petascale science

fun with numbers

Page 69: Masterworks talk on Big Data and the implications of petascale science

datacenter availability

Page 70: Masterworks talk on Big Data and the implications of petascale science

Source: Uptime Institute

Page 71: Masterworks talk on Big Data and the implications of petascale science

Tier  I:  28.8  hours  annual  down4me  (99.67%  availability)Tier  II:  22.0  hrs  annual  down4me  (99.75%  availability)Tier  III:  1.6  hrs  annual  down4me  (99.98%  availability)Tier  IV:  0.8  hrs  annual  down4me  (99.99%  availability)

Source: Uptime Institute

Page 72: Masterworks talk on Big Data and the implications of petascale science

cooling systems go down

Page 73: Masterworks talk on Big Data and the implications of petascale science

power units fail

Page 74: Masterworks talk on Big Data and the implications of petascale science

2-4% of serverswill die annually

Source: Jeff Dean, LADIS 2009

Page 75: Masterworks talk on Big Data and the implications of petascale science

1-5% of disk drives will die every year

Source: Jeff Dean, LADIS 2009

Page 76: Masterworks talk on Big Data and the implications of petascale science

2.3% AFR in population of 13,2503.3% AFR in population of 22,400

4.2% AFR in population of 246,000

Source: James Hamilton

Page 77: Masterworks talk on Big Data and the implications of petascale science

software breaks

Page 78: Masterworks talk on Big Data and the implications of petascale science

human errors

Page 79: Masterworks talk on Big Data and the implications of petascale science

human errors~20% admin issues have unintended consequences

Source: James Hamilton

Page 80: Masterworks talk on Big Data and the implications of petascale science

achieving scalabilityand availability

Page 81: Masterworks talk on Big Data and the implications of petascale science

partitioning

Page 82: Masterworks talk on Big Data and the implications of petascale science

redundancy

Page 83: Masterworks talk on Big Data and the implications of petascale science

recovery oriented computing

Source: http://perspectives.mvdirona.com/, http://roc.cs.berkeley.edu/

Page 84: Masterworks talk on Big Data and the implications of petascale science

assume sw/hw failure

Page 85: Masterworks talk on Big Data and the implications of petascale science

design apps to be resilient

Page 86: Masterworks talk on Big Data and the implications of petascale science

automation

Page 87: Masterworks talk on Big Data and the implications of petascale science
Page 88: Masterworks talk on Big Data and the implications of petascale science

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

Page 89: Masterworks talk on Big Data and the implications of petascale science

Amazon S3

Page 90: Masterworks talk on Big Data and the implications of petascale science

durable

Page 91: Masterworks talk on Big Data and the implications of petascale science

available

Page 92: Masterworks talk on Big Data and the implications of petascale science

!"#$%&'()*+

T

TT

Page 93: Masterworks talk on Big Data and the implications of petascale science

Amazon EC2

Page 94: Masterworks talk on Big Data and the implications of petascale science

highly scalable

Page 95: Masterworks talk on Big Data and the implications of petascale science

3000 CPU’s for one firm’s risk management application

!"#$%&'()'*+,'-./01.2%/'

344'+567/'(.'

8%%9%.:/'

;<"&/:1='

>?,3?,44@'

A&B:1='

>?,>?,44@'

C".:1='

>?,D?,44@'

E(.:1='

>?,F?,44@'

;"%/:1='

>?,G?,44@'

C10"&:1='

>?,H?,44@'

I%:.%/:1='

>?,,?,44@'

3444JJ'

344'JJ'

Page 96: Masterworks talk on Big Data and the implications of petascale science

highly available systems

Page 97: Masterworks talk on Big Data and the implications of petascale science

dynamic

Page 98: Masterworks talk on Big Data and the implications of petascale science

fault tolerant

Page 99: Masterworks talk on Big Data and the implications of petascale science

US East Region

Availability Zone A

Availability Zone B

Availability Zone C

Availability Zone D

Page 100: Masterworks talk on Big Data and the implications of petascale science

2. one size does not fit all

Page 101: Masterworks talk on Big Data and the implications of petascale science

2. one size does not fit all^data

Page 102: Masterworks talk on Big Data and the implications of petascale science

many data types

Page 103: Masterworks talk on Big Data and the implications of petascale science

structured data

Page 104: Masterworks talk on Big Data and the implications of petascale science

using the right data store

Page 105: Masterworks talk on Big Data and the implications of petascale science

(a) feature first

Page 106: Masterworks talk on Big Data and the implications of petascale science

RDBMS

Oracle, SQL Server, DB2, MySQL, Postgres

Page 110: Masterworks talk on Big Data and the implications of petascale science

use a bigger computer

Page 111: Masterworks talk on Big Data and the implications of petascale science

remove joins

Page 112: Masterworks talk on Big Data and the implications of petascale science

scaling limits

Page 113: Masterworks talk on Big Data and the implications of petascale science

(b) scale first

Page 114: Masterworks talk on Big Data and the implications of petascale science

scale is highest priority

Page 115: Masterworks talk on Big Data and the implications of petascale science

single RDBMS incapable

Page 116: Masterworks talk on Big Data and the implications of petascale science

solution 1: data sharding

Page 117: Masterworks talk on Big Data and the implications of petascale science

10’s

Page 118: Masterworks talk on Big Data and the implications of petascale science

100’s

Page 119: Masterworks talk on Big Data and the implications of petascale science

solution 2: scalable key-value store

Page 120: Masterworks talk on Big Data and the implications of petascale science

scale is design point

MongoDB, Project Voldermort, Cassandra, HBase, BigTable, Amazon SimpleDB, Dynamo

Page 121: Masterworks talk on Big Data and the implications of petascale science

(c) simple structured storage

Page 122: Masterworks talk on Big Data and the implications of petascale science
Page 123: Masterworks talk on Big Data and the implications of petascale science

simplefast

low ops cost

BerkeleyDB, Tokyo Cabinet, Amazon SimpleDB

Page 124: Masterworks talk on Big Data and the implications of petascale science

(d) purpose optimized stores

Page 125: Masterworks talk on Big Data and the implications of petascale science
Page 126: Masterworks talk on Big Data and the implications of petascale science

data warehousingstream processing

Aster Data, Vertica, Netezza, Greenplum, VoltDB, StreamBase

Page 127: Masterworks talk on Big Data and the implications of petascale science

what about files?

Page 128: Masterworks talk on Big Data and the implications of petascale science

cluster file systems

Lustre, GlusterFS

Page 129: Masterworks talk on Big Data and the implications of petascale science

distributed file systems

HDFS, GFS

Page 130: Masterworks talk on Big Data and the implications of petascale science

distributed object store

Amazon S3, Dynomite

Page 131: Masterworks talk on Big Data and the implications of petascale science
Page 132: Masterworks talk on Big Data and the implications of petascale science

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

Page 133: Masterworks talk on Big Data and the implications of petascale science

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

Page 134: Masterworks talk on Big Data and the implications of petascale science

3. processing big data

Page 135: Masterworks talk on Big Data and the implications of petascale science

disk read/writesslow & expensive

Page 136: Masterworks talk on Big Data and the implications of petascale science

data processingfast & cheap

Page 137: Masterworks talk on Big Data and the implications of petascale science

distribute the dataparallel reads

Page 138: Masterworks talk on Big Data and the implications of petascale science
Page 139: Masterworks talk on Big Data and the implications of petascale science

data processing for the cloud

Page 140: Masterworks talk on Big Data and the implications of petascale science

distributed file system(HDFS)

Page 141: Masterworks talk on Big Data and the implications of petascale science

map/reduce

Page 142: Masterworks talk on Big Data and the implications of petascale science

Via Cloudera under a Creative Commons License

Page 143: Masterworks talk on Big Data and the implications of petascale science

Via Cloudera under a Creative Commons License

Page 144: Masterworks talk on Big Data and the implications of petascale science

fault tolerance

Page 145: Masterworks talk on Big Data and the implications of petascale science

massive scalability

Page 146: Masterworks talk on Big Data and the implications of petascale science

petabyte scale

Page 147: Masterworks talk on Big Data and the implications of petascale science
Page 148: Masterworks talk on Big Data and the implications of petascale science
Page 149: Masterworks talk on Big Data and the implications of petascale science

hosted hadoop service

Page 150: Masterworks talk on Big Data and the implications of petascale science

hadoop easy and simple

Page 151: Masterworks talk on Big Data and the implications of petascale science

Input  S3  bucket

Output  S3  bucket

Amazon S3

Hadoop

Amazon EC2 Instances

Input dataset

outputresults

Deploy Application

Web Console, Command line tools

End

Notify

Get ResultsInput Data

Amazon Elastic MapReduce

Hadoop Hadoop

Hadoop

Hadoop

Hadoop

Elastic MapReduce

Elastic MapReduce

Page 152: Masterworks talk on Big Data and the implications of petascale science

back to the science

Page 153: Masterworks talk on Big Data and the implications of petascale science

basic informatics workflow

Page 154: Masterworks talk on Big Data and the implications of petascale science
Page 155: Masterworks talk on Big Data and the implications of petascale science
Page 156: Masterworks talk on Big Data and the implications of petascale science
Page 157: Masterworks talk on Big Data and the implications of petascale science
Page 159: Masterworks talk on Big Data and the implications of petascale science

Via Argonne National Labs under a CC-BY-SA license

Page 160: Masterworks talk on Big Data and the implications of petascale science

Via Argonne National Labs under a CC-BY-SA license

killer app

Page 161: Masterworks talk on Big Data and the implications of petascale science

getting the data

Page 162: Masterworks talk on Big Data and the implications of petascale science

Register projects

Register samples

Sample prep

Sequencing

Analysis

These slides cover work presented by Matt Wood at various conferences

Page 163: Masterworks talk on Big Data and the implications of petascale science

Image: Matt Wood

Page 164: Masterworks talk on Big Data and the implications of petascale science

constant change

Page 165: Masterworks talk on Big Data and the implications of petascale science

flexible data capture

Page 166: Masterworks talk on Big Data and the implications of petascale science

virtual fields

Page 167: Masterworks talk on Big Data and the implications of petascale science

no schema

Page 168: Masterworks talk on Big Data and the implications of petascale science
Page 169: Masterworks talk on Big Data and the implications of petascale science

specify at run time

Page 170: Masterworks talk on Big Data and the implications of petascale science

specify at run time(bootstrapping)

Page 171: Masterworks talk on Big Data and the implications of petascale science

Sample

Name

Organism

Concentration

Source: Matt Wood

Page 172: Masterworks talk on Big Data and the implications of petascale science

Source: Matt Wood

Page 173: Masterworks talk on Big Data and the implications of petascale science

key value pairs

Page 174: Masterworks talk on Big Data and the implications of petascale science
Page 175: Masterworks talk on Big Data and the implications of petascale science

change happens

Page 176: Masterworks talk on Big Data and the implications of petascale science

Sample

Name

Organism

Concentration

Sample

Name

Organism

Concentration

Origin

Quality metric

V1 V2

Source: Matt Wood

Page 177: Masterworks talk on Big Data and the implications of petascale science

Source: Matt Wood

Page 178: Masterworks talk on Big Data and the implications of petascale science

high throughput

Page 179: Masterworks talk on Big Data and the implications of petascale science

lots of pipelines

Page 180: Masterworks talk on Big Data and the implications of petascale science

scaling projects/pipelines?

Page 181: Masterworks talk on Big Data and the implications of petascale science

lots of apps

Page 182: Masterworks talk on Big Data and the implications of petascale science

loosely coupled

Page 183: Masterworks talk on Big Data and the implications of petascale science

automation

Page 184: Masterworks talk on Big Data and the implications of petascale science

scale operationally

Page 185: Masterworks talk on Big Data and the implications of petascale science

be agile

Page 186: Masterworks talk on Big Data and the implications of petascale science

now what?

Page 188: Masterworks talk on Big Data and the implications of petascale science

Via Argonne National Labs under a CC-BY-SA license

Page 189: Masterworks talk on Big Data and the implications of petascale science

many data types

Page 190: Masterworks talk on Big Data and the implications of petascale science

changing data types

Page 191: Masterworks talk on Big Data and the implications of petascale science

Shaq Image: Keith Allison under a CC-BY-SA license

Page 192: Masterworks talk on Big Data and the implications of petascale science

Shaq Image: Keith Allison under a CC-BY-SA license

Page 193: Masterworks talk on Big Data and the implications of petascale science

Shaq Image: Keith Allison under a CC-BY-SA license

Page 194: Masterworks talk on Big Data and the implications of petascale science

Shaq Image: Keith Allison under a CC-BY-SA license

Page 195: Masterworks talk on Big Data and the implications of petascale science

Shaq Image: Keith Allison under a CC-BY-SA license

Page 196: Masterworks talk on Big Data and the implications of petascale science

?

Page 197: Masterworks talk on Big Data and the implications of petascale science
Page 198: Masterworks talk on Big Data and the implications of petascale science

lots and lots and lots and lots and lots and lots of data andlots and lots of lots of data

Page 199: Masterworks talk on Big Data and the implications of petascale science

By bitterlysweet under a CC-BY-NC-ND license

Page 200: Masterworks talk on Big Data and the implications of petascale science

Source: http://bit.ly/anderson-bigdata

Page 201: Masterworks talk on Big Data and the implications of petascale science

Chris Anderson doesn’t understand science

Page 202: Masterworks talk on Big Data and the implications of petascale science

“more is different”

Page 203: Masterworks talk on Big Data and the implications of petascale science

few data points

Page 204: Masterworks talk on Big Data and the implications of petascale science

elaborate models

Page 205: Masterworks talk on Big Data and the implications of petascale science

the unreasonable effectiveness of data

Source: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig, and Fernando Pereira

Page 206: Masterworks talk on Big Data and the implications of petascale science

simple modelslots of data

Page 207: Masterworks talk on Big Data and the implications of petascale science
Page 208: Masterworks talk on Big Data and the implications of petascale science

information platform

Page 209: Masterworks talk on Big Data and the implications of petascale science
Page 210: Masterworks talk on Big Data and the implications of petascale science

information platforms at scale

Page 211: Masterworks talk on Big Data and the implications of petascale science

one organization

Page 212: Masterworks talk on Big Data and the implications of petascale science

4 TB daily added(compressed)

Page 213: Masterworks talk on Big Data and the implications of petascale science

135 TB data scanned daily(compressed)

Page 214: Masterworks talk on Big Data and the implications of petascale science

15 PB data total capacity

Page 215: Masterworks talk on Big Data and the implications of petascale science

???

Page 216: Masterworks talk on Big Data and the implications of petascale science

Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk

Page 217: Masterworks talk on Big Data and the implications of petascale science

not always that big

Page 218: Masterworks talk on Big Data and the implications of petascale science

can we learn any lessons?

Source: “Information Platforms and the Rise of the Data Scientist”, Jeff Hammerbacher in Beautiful Data

Page 219: Masterworks talk on Big Data and the implications of petascale science

analytics platform

Page 220: Masterworks talk on Big Data and the implications of petascale science

Data warehouse

Page 221: Masterworks talk on Big Data and the implications of petascale science

Data warehouse is a repository of anorganization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis

Page 222: Masterworks talk on Big Data and the implications of petascale science
Page 223: Masterworks talk on Big Data and the implications of petascale science
Page 224: Masterworks talk on Big Data and the implications of petascale science
Page 225: Masterworks talk on Big Data and the implications of petascale science

ETL

Page 226: Masterworks talk on Big Data and the implications of petascale science

extract

Page 227: Masterworks talk on Big Data and the implications of petascale science

transform

Page 228: Masterworks talk on Big Data and the implications of petascale science

load

Page 230: Masterworks talk on Big Data and the implications of petascale science

1 TB

Page 231: Masterworks talk on Big Data and the implications of petascale science

MySQL --> Oracle

Page 232: Masterworks talk on Big Data and the implications of petascale science

more data

Page 233: Masterworks talk on Big Data and the implications of petascale science

more data types

Page 234: Masterworks talk on Big Data and the implications of petascale science

changing data types

Page 235: Masterworks talk on Big Data and the implications of petascale science

limit data warehouse

Page 236: Masterworks talk on Big Data and the implications of petascale science

too limited

Page 237: Masterworks talk on Big Data and the implications of petascale science

how do you scale and adapt?

Page 238: Masterworks talk on Big Data and the implications of petascale science

100’s of TBs

Page 239: Masterworks talk on Big Data and the implications of petascale science

1000’s of jobs

Page 240: Masterworks talk on Big Data and the implications of petascale science

back to the science

Page 241: Masterworks talk on Big Data and the implications of petascale science

back in the day

Page 242: Masterworks talk on Big Data and the implications of petascale science

small data sets

Page 243: Masterworks talk on Big Data and the implications of petascale science

flat files

Page 244: Masterworks talk on Big Data and the implications of petascale science

../../folder1/ ../folder2/

file1file2..fileN

../folderN/.. .

Page 245: Masterworks talk on Big Data and the implications of petascale science

shared file system

Page 246: Masterworks talk on Big Data and the implications of petascale science

RDBMS

Page 247: Masterworks talk on Big Data and the implications of petascale science

Image: Wikimedia Commons

Page 248: Masterworks talk on Big Data and the implications of petascale science
Page 249: Masterworks talk on Big Data and the implications of petascale science
Page 250: Masterworks talk on Big Data and the implications of petascale science

Image: Chris Dagdigian

Page 251: Masterworks talk on Big Data and the implications of petascale science

need to process

Page 252: Masterworks talk on Big Data and the implications of petascale science

need to analyze

Page 253: Masterworks talk on Big Data and the implications of petascale science

100’s of TBs

Page 254: Masterworks talk on Big Data and the implications of petascale science

1000’s of jobs

Page 255: Masterworks talk on Big Data and the implications of petascale science

Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk

Page 256: Masterworks talk on Big Data and the implications of petascale science
Page 257: Masterworks talk on Big Data and the implications of petascale science

ETL

Page 259: Masterworks talk on Big Data and the implications of petascale science

data mining&

analytics

Page 260: Masterworks talk on Big Data and the implications of petascale science

Via Argonne National Labs under a CC-BY-SA license

Page 261: Masterworks talk on Big Data and the implications of petascale science

analysts are not programmers

Page 262: Masterworks talk on Big Data and the implications of petascale science

not savvy with map/reduce

Page 263: Masterworks talk on Big Data and the implications of petascale science

apache hive

http://hadoop.apache.org/hive/

Page 264: Masterworks talk on Big Data and the implications of petascale science

manage & query data

Page 265: Masterworks talk on Big Data and the implications of petascale science

manage & query dataon top of Hadoop

Page 266: Masterworks talk on Big Data and the implications of petascale science

work by @peteskomoroch

Page 267: Masterworks talk on Big Data and the implications of petascale science
Page 268: Masterworks talk on Big Data and the implications of petascale science
Page 269: Masterworks talk on Big Data and the implications of petascale science

cascading

http://www.cascading.org/

Page 270: Masterworks talk on Big Data and the implications of petascale science
Page 271: Masterworks talk on Big Data and the implications of petascale science

apache pig

http://hadoop.apache.org/pig/

Page 272: Masterworks talk on Big Data and the implications of petascale science

Input  S3  bucket

Output  S3  bucket

Amazon S3

Hadoop

Amazon EC2 Instances

Input dataset

outputresults

Deploy Application

Web Console, Command line tools

End

Notify

Get ResultsInput Data

Amazon Elastic MapReduce

Hadoop Hadoop

Hadoop

Hadoop

Hadoop

Elastic MapReduce

Elastic MapReduce

Page 273: Masterworks talk on Big Data and the implications of petascale science

hadoop and bioinformatics

Page 274: Masterworks talk on Big Data and the implications of petascale science

High Throughput Sequence AnalysisMike Schatz, University of Maryland

Page 275: Masterworks talk on Big Data and the implications of petascale science

Short Read Mapping

Page 276: Masterworks talk on Big Data and the implications of petascale science

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Page 277: Masterworks talk on Big Data and the implications of petascale science

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Page 278: Masterworks talk on Big Data and the implications of petascale science

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Page 279: Masterworks talk on Big Data and the implications of petascale science

Seed & ExtendGood alignments must have significant exact alignment

Minimal exact alignment length = l/(k+1)

Expensive to scale

Need parallelization framework

Page 280: Masterworks talk on Big Data and the implications of petascale science

CloudBurst

Catalog k-mers Collect seeds End-to-end alignment

Page 281: Masterworks talk on Big Data and the implications of petascale science

http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369

Page 282: Masterworks talk on Big Data and the implications of petascale science
Page 283: Masterworks talk on Big Data and the implications of petascale science

Bowtie: Ultrafast short read aligner

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.

Page 284: Masterworks talk on Big Data and the implications of petascale science

SOAPSnp: Consensus alignment and SNP calling

Ruiqiang Li, Yingrui Li, Xiaodong Fang, et al. (2009) "SNP detection for massively parallel whole-genome resequencing" Genome Res

Page 285: Masterworks talk on Big Data and the implications of petascale science

Crossbow: Rapid whole genome SNP analysis

Ben Langmead

http://bowtie-bio.sourceforge.net/crossbow/index.shtml

Page 286: Masterworks talk on Big Data and the implications of petascale science
Page 287: Masterworks talk on Big Data and the implications of petascale science

Preprocessed reads

Page 288: Masterworks talk on Big Data and the implications of petascale science

Preprocessed reads

Map: Bowtie

Page 289: Masterworks talk on Big Data and the implications of petascale science

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Page 290: Masterworks talk on Big Data and the implications of petascale science

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Reduce: SoapSNP

Page 291: Masterworks talk on Big Data and the implications of petascale science

Crossbow   condenses   over   1,000   hours   of  resequencing   computa:on   into   a   few   hours  without   requiring   the   user   to   own   or   operate   a  computer  cluster

Page 292: Masterworks talk on Big Data and the implications of petascale science

Comparing Genomes

Page 293: Masterworks talk on Big Data and the implications of petascale science

Estimating relative evolutionary rates from sequence comparisons:Identification of probable orthologs

A B C D E

S. cerevisiae C. elegans

species treegene tree

Admissible comparisons: A or B vs. DC vs. E

Inadmissible comparisons: A or B vs. EC vs. D

Page 294: Masterworks talk on Big Data and the implications of petascale science

Estimating relative evolutionary rates from sequence comparisons:

A B C D E

S. cerevisiae C. elegans

species treegene tree

1. Orthologs found using the Reciprocal smallest distance algorithm2. Build alignment between two orthologs>Sequence CMSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…

>Sequence EMSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…

3. Estimate distance given a substitution matrix

Phe Ala Pro Leu ThrPhe Ala µπPro µπ µπ µπLeu µπ µπ µπ µπ

Page 295: Masterworks talk on Big Data and the implications of petascale science

ab

bb

cb

c

c

c

a

b

c

vs.

vs.

vs.

vs.

vs.

vs.

Align sequences &Calculate distances

D=0.2

D=0.3

D=0.1

D=1.2

D=0.1

D=0.9

Orthologs:ib - jc D = 0.1

HL Align sequences &Calculate distances

JcIb

Genome I Genome J

RSD algorithm summary

Page 296: Masterworks talk on Big Data and the implications of petascale science

Prof. Dennis WallHarvard Medical School

Page 297: Masterworks talk on Big Data and the implications of petascale science

Roundup is a database of orthologs and their evolutionary distances.To get started, click browse. Alternatively, you can read our documentation here.

Good luck, researchers!

Page 298: Masterworks talk on Big Data and the implications of petascale science

massive computational demand

Page 299: Masterworks talk on Big Data and the implications of petascale science

1000 genomes = 5,994,000 processes = 23,976,000

hours

Page 300: Masterworks talk on Big Data and the implications of petascale science

2737 years

Page 301: Masterworks talk on Big Data and the implications of petascale science
Page 302: Masterworks talk on Big Data and the implications of petascale science

compared 50+ genomes

Page 303: Masterworks talk on Big Data and the implications of petascale science

trends in data sharing

Page 304: Masterworks talk on Big Data and the implications of petascale science

data motion is hard

Page 305: Masterworks talk on Big Data and the implications of petascale science

cloud services are a viable dataspace

Page 306: Masterworks talk on Big Data and the implications of petascale science

share data

Page 307: Masterworks talk on Big Data and the implications of petascale science

share applications

Page 308: Masterworks talk on Big Data and the implications of petascale science
Page 309: Masterworks talk on Big Data and the implications of petascale science

share results

Page 311: Masterworks talk on Big Data and the implications of petascale science
Page 312: Masterworks talk on Big Data and the implications of petascale science

Data Platform

App Platform

Page 313: Masterworks talk on Big Data and the implications of petascale science

Data Platform

App Platform

Page 314: Masterworks talk on Big Data and the implications of petascale science

Scalable Data Platform

Services

APIs

Getters Filters Savers

WORK

Page 315: Masterworks talk on Big Data and the implications of petascale science

to conclude

Page 316: Masterworks talk on Big Data and the implications of petascale science

big data

Page 317: Masterworks talk on Big Data and the implications of petascale science

change thinking

Page 318: Masterworks talk on Big Data and the implications of petascale science

data managementdata processing

data sharing

Page 319: Masterworks talk on Big Data and the implications of petascale science

think distributed

Page 320: Masterworks talk on Big Data and the implications of petascale science

new software architectures

Page 321: Masterworks talk on Big Data and the implications of petascale science

new computing paradigms

Page 322: Masterworks talk on Big Data and the implications of petascale science

cloud services

Page 323: Masterworks talk on Big Data and the implications of petascale science

the cloud works

Page 324: Masterworks talk on Big Data and the implications of petascale science

[email protected]  Twi2er:@mndoci  Presenta4on  ideas  from  @mza,  James  Hamilton,  and  @lessig

Thank  you!