september 23, 2015 1 the cloud workloads archive: a status report berkeley, ca, usa alexandru iosup...
TRANSCRIPT
April 19, 20231
The Cloud Workloads Archive:A Status Report
Berkeley, CA, USA
Alexandru Iosup
Parallel and Distributed Systems Group,Delft University of Technology,The Netherlands
Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica
RADLab,University of California, Berkeley,
USA
Special thanks to Ion for this opportunity!
About the Team
• Recent Work in Performance• The Grid Workloads Archive (Nov 2006)• The Failure Trace Archive (Nov 2009)• Analysis of Facebook, Yahoo, and Google data center workloads
(2009-2010)• The Peer-to-Peer Trace Archive (Apr 2010)• Tools: GrenchMark workload-based grid benchmarking, RAIN
• Speaker: Alexandru Iosup• Systems work: Tribler (P2P file sharing), Koala (grid scheduling),
POGGI and CAMEO (massively multiplayer online gaming)• Performance evaluation of clouds for sci.comp.: EC2 & three
others• Team of 15+ active collaborators in NL, AT, RO, US• Happy to be in Berkeley until September
April 19, 20232
Traces: Sine Qua Non in Comp.Sys.Res.• “My system/method/algorithm is better than yours
(on my carefully crafted workload)” • Unrealistic (trivial): Prove that ‘prioritize jobs from
users whose name starts with A’ is a good scheduling policy• Realistic? 85% jobs are short, 15% are long• Major problem in Computer Systems research
• Workload Trace = recording of real activity from a (real) system, often as a sequence of jobs / requests submitted by users for execution• Main use: compare and cross-validate new job and resource
management techniques and algorithms• Major problem: obtaining and using real workload traces
April 19, 20233
Previous Data Sharing Efforts• Critical datasets in
computer science• Grid Workloads Archive• Failure Trace Archive• Peer-to-Peer Trace Archive• Game Trace Archive (soon)• … PWA, ITA, CRAWDAD, …
• 1,000s of scientists• From theory to practice
Research Question:Are data center workloads unique? (vs GWA, PWA, …)
April 19, 20234
DatasetSize
Year
1GB
10GB
100GB
1TB
1TB/yr
P2PTA
GamTA
‘09 ‘10 ‘11‘06
April 19, 20235
Agenda
1. Introduction & Motivation2. The Cloud Workloads Archive: What’s in a
Name?3. Format and Tools4. Contents5. Analysis & Modeling6. Applications7. Take Home Message
The Cloud Workloads Archive (CWA)What’s in a Name?CWA = Public collection of cloud/data center workload
traces and of tools to process these traces; allows us to:1. Compare and cross-validate new job and resource management
techniques and algorithms, across various workload traces
2. Determine which (part of a) trace is most interesting for a specific job and resource management technique or algorithm
3. Design a general model for data center workloads, and validate it with various real workload traces
4. Evaluate the generality of a particular workload trace, to determine if results are biased towards a particular trace
5. Analyze the evolution of workload characteristics across long timescales, both intra- and inter-trace
April 19, 20236
One Format Fits Them All
• Flat format• Job and Tasks• Summary (20 unique data fields) and Detail (60 fields)
• Categories of information• Shared with GWA, PWA: Time, Disk, Memory, Net• Jobs/Tasks that change resource consumption profile• MapReduce-specific (two-thirds data fields)
April 19, 20237
A. Iosup, R. Griffith, A. Konwinski, M. Zaharia, A. Ghodsi, I. Stoica, Data Format for the Cloud Workloads Archive, v.3, 13/07/10
CWJ CWJD CWT
CWTD
CWA Contents: Large-Scale Workloads
• Tools• Convert to CWA format• Analyze and model automatically Report
April 19, 20238
Trace ID System Size J/T/Obs Period Notes
CWA-01 Facebook 1.1M/-/- 5m/2009 Time & IO
CWA-02 Yahoo M 28K/28M/- 20d/2009 ~Full detail
CWA-03 Facebook 2 61K/10M/- 10d/2009 Full detail
CWA-04 Facebook 3 ?/?/- 10d/01-2010
Full detail
CWA-05 Facebook 4 ?/?/- 3m/02+2010
Full detail
CWA-06 Google 2 25 Aug 2010CWA-07 eBay 23 Sep 2010CWA-08 Twitter Need help!
CWA-09?
Google 9K/177K/4M 7h/2009 Coarse,Period
April 19, 20239
Agenda
1. Introduction & Motivation2. The Cloud Workloads Archive: What’s in a Name?3. Format and Tools4. Contents5. Analysis & Modeling6. Applications7. Take Home Message
Types of Analysis
April 19, 202310
Analysis Type1. Basic statistics2. Evolution over time3. CorrelationsData Break-down1. Overall2. By Task Type (M/R)3. By App. Type (ID)4. By User (ID)5. By Duration (Short)
Analysis Focus1. Time-related
• Run, Wait, Resp.Time• Bounded Slowdown
2. Structure-related• Number of tasks
3. IO-related• IO sizes and ratios
4. Status-related5. Sys. Utilization-related
• Counts/Ratios
Types of AnalysisSys.U., Over Time, By RunTime
• Also 1h, 10mins, … counting intervals • Study Short-/Long- Range Dependence (self-similarity)
• Also Job count, Running/Waiting counts, …• Study system utilization behavior
April 19, 202311
Modeling Process
• Well-known prob. distrib.• Normal, Exp, LogNormal,
Gamma, Weibull, Gen-Pareto,
• MLE to fit• Fit known distribution to empirical distribution parameters
• Goodness-of-Fit• Assess how good the fit is; select best-fitting distribution• Kolmogorov-Smirnov: sensitive to body of distribution + D
stat• Anderson-Darling: sensitive to tails of distribution• Hybrid method*: works for very large populations
April 19, 202312
*Kondo et al., Failure Trace Archive, CCGrid’10, Best Paper Award.
Main Results: Basic Stats
• MapReduce vs Grid workloads [vs Parallel Prod. Env.]• Massive short tasks vs Many long tasks vs Few very long tasks• Fewer users for MapReduce environments?
• TODO: Analyse amounts per core
April 19, 202313
Trace ID TRunTime [s]
#Tasks/Job Pk.Arr.Rate/D
# users
CWA-01 165J n/a 21KJ/-T n/a
CWA-02 512/80med 901/712Map
6KJ/3.2MT n/a
CWA-03 433/86med 153/143Map
8KJ/2MT 18
GWA-T1 370 5—20 -/20KT 332
GWA-T3 89,274 5—20 -/8KT 387
GWA-T6 14,599 5—20 -/22.5KT 206
GWA-T10
31,964 5—20 -/1.6KTph 216
GWA-T11
8,971 5—20 -/22KTph 412
April 19, 202314
Agenda
1. Introduction & Motivation2. The Cloud Workloads Archive: What’s in a Name?3. Format and Tools4. Contents5. Analysis & Modeling6. Applications7. Take Home Message
Applications
1. Mesos running mixtures of workloads• Workloads: MPI, MapReduce, grid, …• Find bottlenecks• Find workloads that are particularly difficult to run• Improve the system!• Status: in progress, using cluster in Finland
(Petri Savolainen)
2. All the apps typical to trace-based work: design, validation, and comparison of algorithms, methods, and systems.
April 19, 202315
April 19, 202316
Agenda
1. Introduction & Motivation2. The Cloud Workloads Archive: What’s in a Name?3. Format and Tools4. Contents5. Analysis & Modeling6. Applications7. Take Home Message
April 19, 202317
Take Home MessageTake Home Message
• Cloud Workloads Archive• Datasets• Tools to convert, analyze, and model the datasets• Need your help to collect more traces
• Converted and analyzed three MapReduce workloads• Different from grid and parallel production environment
workloads(ask about additional proof and let me show a couple more slides)
• Invariants?
• Applications• 1: Model of Cloud/MapReduce workloads• 2: Test and improve Mesos
April 19, 202318
Continuing Our CollaborationContinuing Our Collaboration
• Scheduling mixtures of grid/HPC/cloud workloads• Scheduling and resource management in practice• Modeling aspects of cloud infrastructure and workloads
• Condor on top of Mesos
• Massively Social Gaming and Mesos• Step 1: Game analytics and social network analysis in Mesos
• …
April 19, 202319
Alex Iosup, Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica
Thank you! Questions? Observations?
More Information:• The Grid Workloads Archive: gwa.ewi.tudelft.nl
• The Failure Trace Archive: fta.inria.fr
• The GrenchMark perf. eval. tool: grenchmark.st.ewi.tudelft.nl
• Cloud research: www.st.ewi.tudelft.nl/~iosup/research_cloud.html
• see PDS publication database at: www.pds.twi.tudelft.nl/
email: [email protected]
Big thanks to our collaborators: U. Wisc.-Madison, U Chicago, U Dortmund, U Innsbruck, LRI/INRIA Paris, INRIA Grenoble, U Leiden, Politehnica University of Bucharest, Technion, …
Thanks for all:AliG, Andrew, AndyK, Ari, Beth, Blaine, David, Ion, Justin,
Lucian, Matei, Petri, Rean, Tim, …
Additional Slides
April 19, 202320
Main Results: Basic Stats
• MapReduce vs Grid workloads• IO-intensive vs Compute-intensive• Constant Wr[%]~40%IO for MapReduce traces?
• TODO: More MapReduce traces to validate findings
April 19, 202321
Trace ID Total IO [MB]
Rd. [MB] Wr [%] HDFS Wr[MB]CWA-01 10,934 6,805 38% 1,538
CWA-02 75,546 47,539 37% 8,563
CWA-03 - - - -
GWA12.1
469 174 63% n/a
GWA12.2
144 114 21% n/a
GWA12.3
161 130 19% n/a
GWA12.4
389 33 92% n/a
GWA12.5
330 31 91% n/a
Main Results
• Two-mode trace do NOT analyze as whole
April 19, 202322
April 19, 202323