masterworks talk on big data and the implications of petascale science
DESCRIPTION
TRANSCRIPT
Big Data and Biology: The implica4ons of petascale scienceDeepak Singh
Via Reavel under a CC-BY-NC-ND license
life science industry
Credit: Bosco Ho
By ~Prescott under a CC-BY-NC license
data
Image: Wikipedia
biology
big data
Source: http://www.nature.com/news/specials/bigdata/index.html
Image: Matt Wood
Human genome
Image: Matt Wood
not just sequencing
more data
Image: Matt Wood
all hell breaks loose
~100 TB/Week
~100 TB/Week
>2 PB/Year
years
weeks
days
days
days
minutes?
gigabytes
terabytes
petabytes
exabytes?
really fast
Image: http://www.broadinstitute.org/~apleite/photos.html
single lab
Image: Chris Dagdigian
implications of scale
data management
data processing
data sharing
fundamental concepts
1. architecting for scale
“Everything fails, all the time”-- Werner Vogels
“Things will crash. Deal with it”-- Jeff Dean
“Remember everything fails”-- Randy Shoup
fun with numbers
datacenter availability
Source: Uptime Institute
Tier I: 28.8 hours annual down4me (99.67% availability)Tier II: 22.0 hrs annual down4me (99.75% availability)Tier III: 1.6 hrs annual down4me (99.98% availability)Tier IV: 0.8 hrs annual down4me (99.99% availability)
Source: Uptime Institute
cooling systems go down
power units fail
2-4% of serverswill die annually
Source: Jeff Dean, LADIS 2009
1-5% of disk drives will die every year
Source: Jeff Dean, LADIS 2009
2.3% AFR in population of 13,2503.3% AFR in population of 22,400
4.2% AFR in population of 246,000
Source: James Hamilton
software breaks
human errors
human errors~20% admin issues have unintended consequences
Source: James Hamilton
achieving scalabilityand availability
partitioning
redundancy
recovery oriented computing
Source: http://perspectives.mvdirona.com/, http://roc.cs.berkeley.edu/
assume sw/hw failure
design apps to be resilient
automation
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
Your Custom Applications and Services
Content DeliveryAmazon CloudFront
MessagingAmazon Simple
Queue Service (SQS)
PaymentsAmazon Flexible Payments Service
(FPS)
On-Demand Workforce
Amazon Mechanical Turk
Parallel ProcessingAmazon Elastic
MapReduce
MonitoringAmazon CloudWatch
ManagementAWS Management Console
ToolsAWS Toolkit for Eclipse
Isolated NetworksAmazon Virtual Private
Cloud
DatabaseAmazon RDS and
SimpleDB
Amazon S3
durable
available
!"#$%&'()*+
T
TT
Amazon EC2
highly scalable
3000 CPU’s for one firm’s risk management application
!"#$%&'()'*+,'-./01.2%/'
344'+567/'(.'
8%%9%.:/'
;<"&/:1='
>?,3?,44@'
A&B:1='
>?,>?,44@'
C".:1='
>?,D?,44@'
E(.:1='
>?,F?,44@'
;"%/:1='
>?,G?,44@'
C10"&:1='
>?,H?,44@'
I%:.%/:1='
>?,,?,44@'
3444JJ'
344'JJ'
highly available systems
dynamic
fault tolerant
US East Region
Availability Zone A
Availability Zone B
Availability Zone C
Availability Zone D
2. one size does not fit all
2. one size does not fit all^data
many data types
structured data
using the right data store
(a) feature first
RDBMS
Oracle, SQL Server, DB2, MySQL, Postgres
Source: http://www.bioinformaticszen.com/
Source: http://www.bioinformaticszen.com/
Source: http://www.bioinformaticszen.com/
use a bigger computer
remove joins
scaling limits
(b) scale first
scale is highest priority
single RDBMS incapable
solution 1: data sharding
10’s
100’s
solution 2: scalable key-value store
scale is design point
MongoDB, Project Voldermort, Cassandra, HBase, BigTable, Amazon SimpleDB, Dynamo
(c) simple structured storage
simplefast
low ops cost
BerkeleyDB, Tokyo Cabinet, Amazon SimpleDB
(d) purpose optimized stores
data warehousingstream processing
Aster Data, Vertica, Netezza, Greenplum, VoltDB, StreamBase
what about files?
cluster file systems
Lustre, GlusterFS
distributed file systems
HDFS, GFS
distributed object store
Amazon S3, Dynomite
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
Your Custom Applications and Services
Content DeliveryAmazon CloudFront
MessagingAmazon Simple
Queue Service (SQS)
PaymentsAmazon Flexible Payments Service
(FPS)
On-Demand Workforce
Amazon Mechanical Turk
Parallel ProcessingAmazon Elastic
MapReduce
MonitoringAmazon CloudWatch
ManagementAWS Management Console
ToolsAWS Toolkit for Eclipse
Isolated NetworksAmazon Virtual Private
Cloud
DatabaseAmazon RDS and
SimpleDB
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
Your Custom Applications and Services
Content DeliveryAmazon CloudFront
MessagingAmazon Simple
Queue Service (SQS)
PaymentsAmazon Flexible Payments Service
(FPS)
On-Demand Workforce
Amazon Mechanical Turk
Parallel ProcessingAmazon Elastic
MapReduce
MonitoringAmazon CloudWatch
ManagementAWS Management Console
ToolsAWS Toolkit for Eclipse
Isolated NetworksAmazon Virtual Private
Cloud
DatabaseAmazon RDS and
SimpleDB
3. processing big data
disk read/writesslow & expensive
data processingfast & cheap
distribute the dataparallel reads
data processing for the cloud
distributed file system(HDFS)
map/reduce
Via Cloudera under a Creative Commons License
Via Cloudera under a Creative Commons License
fault tolerance
massive scalability
petabyte scale
hosted hadoop service
hadoop easy and simple
Input S3 bucket
Output S3 bucket
Amazon S3
Hadoop
Amazon EC2 Instances
Input dataset
outputresults
Deploy Application
Web Console, Command line tools
End
Notify
Get ResultsInput Data
Amazon Elastic MapReduce
Hadoop Hadoop
Hadoop
Hadoop
Hadoop
Elastic MapReduce
Elastic MapReduce
back to the science
basic informatics workflow
Via Christolakis under a CC-BY-NC-ND license
Via Argonne National Labs under a CC-BY-SA license
Via Argonne National Labs under a CC-BY-SA license
killer app
getting the data
Register projects
Register samples
Sample prep
Sequencing
Analysis
These slides cover work presented by Matt Wood at various conferences
Image: Matt Wood
constant change
flexible data capture
virtual fields
no schema
specify at run time
specify at run time(bootstrapping)
Sample
Name
Organism
Concentration
Source: Matt Wood
Source: Matt Wood
key value pairs
change happens
Sample
Name
Organism
Concentration
Sample
Name
Organism
Concentration
Origin
Quality metric
V1 V2
Source: Matt Wood
Source: Matt Wood
high throughput
lots of pipelines
scaling projects/pipelines?
lots of apps
loosely coupled
automation
scale operationally
be agile
now what?
Via asklar under a CC-BY license
Via Argonne National Labs under a CC-BY-SA license
many data types
changing data types
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
?
lots and lots and lots and lots and lots and lots of data andlots and lots of lots of data
By bitterlysweet under a CC-BY-NC-ND license
Source: http://bit.ly/anderson-bigdata
Chris Anderson doesn’t understand science
“more is different”
few data points
elaborate models
the unreasonable effectiveness of data
Source: “The Unreasonable Effectiveness of Data”, Alon Halevy, Peter Norvig, and Fernando Pereira
simple modelslots of data
information platform
information platforms at scale
one organization
4 TB daily added(compressed)
135 TB data scanned daily(compressed)
15 PB data total capacity
???
Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk
not always that big
can we learn any lessons?
Source: “Information Platforms and the Rise of the Data Scientist”, Jeff Hammerbacher in Beautiful Data
analytics platform
Data warehouse
Data warehouse is a repository of anorganization's electronically stored data. Data warehouses are designed to facilitate reporting and analysis
ETL
extract
transform
load
Via asklar under a CC-BY license
1 TB
MySQL --> Oracle
more data
more data types
changing data types
limit data warehouse
too limited
how do you scale and adapt?
100’s of TBs
1000’s of jobs
back to the science
back in the day
small data sets
flat files
../../folder1/ ../folder2/
file1file2..fileN
../folderN/.. .
shared file system
RDBMS
Image: Wikimedia Commons
Image: Chris Dagdigian
need to process
need to analyze
100’s of TBs
1000’s of jobs
Facebook data from Ashish Thusoo’s HadoopWorld 2009 talk
ETL
Via asklar under a CC-BY license
data mining&
analytics
Via Argonne National Labs under a CC-BY-SA license
analysts are not programmers
not savvy with map/reduce
apache hive
http://hadoop.apache.org/hive/
manage & query data
manage & query dataon top of Hadoop
work by @peteskomoroch
apache pig
http://hadoop.apache.org/pig/
Input S3 bucket
Output S3 bucket
Amazon S3
Hadoop
Amazon EC2 Instances
Input dataset
outputresults
Deploy Application
Web Console, Command line tools
End
Notify
Get ResultsInput Data
Amazon Elastic MapReduce
Hadoop Hadoop
Hadoop
Hadoop
Hadoop
Elastic MapReduce
Elastic MapReduce
hadoop and bioinformatics
High Throughput Sequence AnalysisMike Schatz, University of Maryland
Short Read Mapping
Seed & ExtendGood alignments must have significant exact alignment
Minimal exact alignment length = l/(k+1)
Seed & ExtendGood alignments must have significant exact alignment
Minimal exact alignment length = l/(k+1)
Expensive to scale
Seed & ExtendGood alignments must have significant exact alignment
Minimal exact alignment length = l/(k+1)
Expensive to scale
Seed & ExtendGood alignments must have significant exact alignment
Minimal exact alignment length = l/(k+1)
Expensive to scale
Need parallelization framework
CloudBurst
Catalog k-mers Collect seeds End-to-end alignment
http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
Bowtie: Ultrafast short read aligner
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10 (3): R25.
SOAPSnp: Consensus alignment and SNP calling
Ruiqiang Li, Yingrui Li, Xiaodong Fang, et al. (2009) "SNP detection for massively parallel whole-genome resequencing" Genome Res
Crossbow: Rapid whole genome SNP analysis
Ben Langmead
http://bowtie-bio.sourceforge.net/crossbow/index.shtml
Preprocessed reads
Preprocessed reads
Map: Bowtie
Preprocessed reads
Map: Bowtie
Sort: Bin and partition
Preprocessed reads
Map: Bowtie
Sort: Bin and partition
Reduce: SoapSNP
Crossbow condenses over 1,000 hours of resequencing computa:on into a few hours without requiring the user to own or operate a computer cluster
Comparing Genomes
Estimating relative evolutionary rates from sequence comparisons:Identification of probable orthologs
A B C D E
S. cerevisiae C. elegans
species treegene tree
Admissible comparisons: A or B vs. DC vs. E
Inadmissible comparisons: A or B vs. EC vs. D
Estimating relative evolutionary rates from sequence comparisons:
A B C D E
S. cerevisiae C. elegans
species treegene tree
1. Orthologs found using the Reciprocal smallest distance algorithm2. Build alignment between two orthologs>Sequence CMSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…
>Sequence EMSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…
3. Estimate distance given a substitution matrix
Phe Ala Pro Leu ThrPhe Ala µπPro µπ µπ µπLeu µπ µπ µπ µπ
ab
bb
cb
c
c
c
a
b
c
vs.
vs.
vs.
vs.
vs.
vs.
Align sequences &Calculate distances
D=0.2
D=0.3
D=0.1
D=1.2
D=0.1
D=0.9
Orthologs:ib - jc D = 0.1
HL Align sequences &Calculate distances
JcIb
Genome I Genome J
RSD algorithm summary
Prof. Dennis WallHarvard Medical School
Roundup is a database of orthologs and their evolutionary distances.To get started, click browse. Alternatively, you can read our documentation here.
Good luck, researchers!
massive computational demand
1000 genomes = 5,994,000 processes = 23,976,000
hours
2737 years
compared 50+ genomes
trends in data sharing
data motion is hard
cloud services are a viable dataspace
share data
share applications
share results
http://aws.amazon.com/publicdatasets/
Data Platform
App Platform
Data Platform
App Platform
Scalable Data Platform
Services
APIs
Getters Filters Savers
WORK
to conclude
big data
change thinking
data managementdata processing
data sharing
think distributed
new software architectures
new computing paradigms
cloud services
the cloud works
[email protected] Twi2er:@mndoci Presenta4on ideas from @mza, James Hamilton, and @lessig
Thank you!