september 23, 2015 1 the cloud workloads archive: a status report berkeley, ca, usa alexandru iosup...

23
June 23, 2022 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University of Technology, The Netherlands Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica RADLab, University of California, Berkeley, USA Special thanks to Ion for this opportunity!

Upload: patience-harrison

Post on 29-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

April 19, 20231

The Cloud Workloads Archive:A Status Report

Berkeley, CA, USA

Alexandru Iosup

Parallel and Distributed Systems Group,Delft University of Technology,The Netherlands

Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica

RADLab,University of California, Berkeley,

USA

Special thanks to Ion for this opportunity!

Page 2: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

About the Team

• Recent Work in Performance• The Grid Workloads Archive (Nov 2006)• The Failure Trace Archive (Nov 2009)• Analysis of Facebook, Yahoo, and Google data center workloads

(2009-2010)• The Peer-to-Peer Trace Archive (Apr 2010)• Tools: GrenchMark workload-based grid benchmarking, RAIN

• Speaker: Alexandru Iosup• Systems work: Tribler (P2P file sharing), Koala (grid scheduling),

POGGI and CAMEO (massively multiplayer online gaming)• Performance evaluation of clouds for sci.comp.: EC2 & three

others• Team of 15+ active collaborators in NL, AT, RO, US• Happy to be in Berkeley until September

April 19, 20232

Page 3: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Traces: Sine Qua Non in Comp.Sys.Res.• “My system/method/algorithm is better than yours

(on my carefully crafted workload)” • Unrealistic (trivial): Prove that ‘prioritize jobs from

users whose name starts with A’ is a good scheduling policy• Realistic? 85% jobs are short, 15% are long• Major problem in Computer Systems research

• Workload Trace = recording of real activity from a (real) system, often as a sequence of jobs / requests submitted by users for execution• Main use: compare and cross-validate new job and resource

management techniques and algorithms• Major problem: obtaining and using real workload traces

April 19, 20233

Page 4: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Previous Data Sharing Efforts• Critical datasets in

computer science• Grid Workloads Archive• Failure Trace Archive• Peer-to-Peer Trace Archive• Game Trace Archive (soon)• … PWA, ITA, CRAWDAD, …

• 1,000s of scientists• From theory to practice

Research Question:Are data center workloads unique? (vs GWA, PWA, …)

April 19, 20234

DatasetSize

Year

1GB

10GB

100GB

1TB

1TB/yr

P2PTA

GamTA

‘09 ‘10 ‘11‘06

Page 5: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

April 19, 20235

Agenda

1. Introduction & Motivation2. The Cloud Workloads Archive: What’s in a

Name?3. Format and Tools4. Contents5. Analysis & Modeling6. Applications7. Take Home Message

Page 6: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

The Cloud Workloads Archive (CWA)What’s in a Name?CWA = Public collection of cloud/data center workload

traces and of tools to process these traces; allows us to:1. Compare and cross-validate new job and resource management

techniques and algorithms, across various workload traces

2. Determine which (part of a) trace is most interesting for a specific job and resource management technique or algorithm

3. Design a general model for data center workloads, and validate it with various real workload traces

4. Evaluate the generality of a particular workload trace, to determine if results are biased towards a particular trace

5. Analyze the evolution of workload characteristics across long timescales, both intra- and inter-trace

April 19, 20236

Page 7: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

One Format Fits Them All

• Flat format• Job and Tasks• Summary (20 unique data fields) and Detail (60 fields)

• Categories of information• Shared with GWA, PWA: Time, Disk, Memory, Net• Jobs/Tasks that change resource consumption profile• MapReduce-specific (two-thirds data fields)

April 19, 20237

A. Iosup, R. Griffith, A. Konwinski, M. Zaharia, A. Ghodsi, I. Stoica, Data Format for the Cloud Workloads Archive, v.3, 13/07/10

CWJ CWJD CWT

CWTD

Page 8: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

CWA Contents: Large-Scale Workloads

• Tools• Convert to CWA format• Analyze and model automatically Report

April 19, 20238

Trace ID System Size J/T/Obs Period Notes

CWA-01 Facebook 1.1M/-/- 5m/2009 Time & IO

CWA-02 Yahoo M 28K/28M/- 20d/2009 ~Full detail

CWA-03 Facebook 2 61K/10M/- 10d/2009 Full detail

CWA-04 Facebook 3 ?/?/- 10d/01-2010

Full detail

CWA-05 Facebook 4 ?/?/- 3m/02+2010

Full detail

CWA-06 Google 2 25 Aug 2010CWA-07 eBay 23 Sep 2010CWA-08 Twitter Need help!

CWA-09?

Google 9K/177K/4M 7h/2009 Coarse,Period

Page 9: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

April 19, 20239

Agenda

1. Introduction & Motivation2. The Cloud Workloads Archive: What’s in a Name?3. Format and Tools4. Contents5. Analysis & Modeling6. Applications7. Take Home Message

Page 10: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Types of Analysis

April 19, 202310

Analysis Type1. Basic statistics2. Evolution over time3. CorrelationsData Break-down1. Overall2. By Task Type (M/R)3. By App. Type (ID)4. By User (ID)5. By Duration (Short)

Analysis Focus1. Time-related

• Run, Wait, Resp.Time• Bounded Slowdown

2. Structure-related• Number of tasks

3. IO-related• IO sizes and ratios

4. Status-related5. Sys. Utilization-related

• Counts/Ratios

Page 11: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Types of AnalysisSys.U., Over Time, By RunTime

• Also 1h, 10mins, … counting intervals • Study Short-/Long- Range Dependence (self-similarity)

• Also Job count, Running/Waiting counts, …• Study system utilization behavior

April 19, 202311

Page 12: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Modeling Process

• Well-known prob. distrib.• Normal, Exp, LogNormal,

Gamma, Weibull, Gen-Pareto,

• MLE to fit• Fit known distribution to empirical distribution parameters

• Goodness-of-Fit• Assess how good the fit is; select best-fitting distribution• Kolmogorov-Smirnov: sensitive to body of distribution + D

stat• Anderson-Darling: sensitive to tails of distribution• Hybrid method*: works for very large populations

April 19, 202312

*Kondo et al., Failure Trace Archive, CCGrid’10, Best Paper Award.

Page 13: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Main Results: Basic Stats

• MapReduce vs Grid workloads [vs Parallel Prod. Env.]• Massive short tasks vs Many long tasks vs Few very long tasks• Fewer users for MapReduce environments?

• TODO: Analyse amounts per core

April 19, 202313

Trace ID TRunTime [s]

#Tasks/Job Pk.Arr.Rate/D

# users

CWA-01 165J n/a 21KJ/-T n/a

CWA-02 512/80med 901/712Map

6KJ/3.2MT n/a

CWA-03 433/86med 153/143Map

8KJ/2MT 18

GWA-T1 370 5—20 -/20KT 332

GWA-T3 89,274 5—20 -/8KT 387

GWA-T6 14,599 5—20 -/22.5KT 206

GWA-T10

31,964 5—20 -/1.6KTph 216

GWA-T11

8,971 5—20 -/22KTph 412

Page 14: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

April 19, 202314

Agenda

1. Introduction & Motivation2. The Cloud Workloads Archive: What’s in a Name?3. Format and Tools4. Contents5. Analysis & Modeling6. Applications7. Take Home Message

Page 15: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Applications

1. Mesos running mixtures of workloads• Workloads: MPI, MapReduce, grid, …• Find bottlenecks• Find workloads that are particularly difficult to run• Improve the system!• Status: in progress, using cluster in Finland

(Petri Savolainen)

2. All the apps typical to trace-based work: design, validation, and comparison of algorithms, methods, and systems.

April 19, 202315

Page 16: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

April 19, 202316

Agenda

1. Introduction & Motivation2. The Cloud Workloads Archive: What’s in a Name?3. Format and Tools4. Contents5. Analysis & Modeling6. Applications7. Take Home Message

Page 17: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

April 19, 202317

Take Home MessageTake Home Message

• Cloud Workloads Archive• Datasets• Tools to convert, analyze, and model the datasets• Need your help to collect more traces

• Converted and analyzed three MapReduce workloads• Different from grid and parallel production environment

workloads(ask about additional proof and let me show a couple more slides)

• Invariants?

• Applications• 1: Model of Cloud/MapReduce workloads• 2: Test and improve Mesos

Page 18: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

April 19, 202318

Continuing Our CollaborationContinuing Our Collaboration

• Scheduling mixtures of grid/HPC/cloud workloads• Scheduling and resource management in practice• Modeling aspects of cloud infrastructure and workloads

• Condor on top of Mesos

• Massively Social Gaming and Mesos• Step 1: Game analytics and social network analysis in Mesos

• …

Page 19: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

April 19, 202319

Alex Iosup, Rean Griffith, Andrew Konwinski, Matei Zaharia, Ali Ghodsi, Ion Stoica

Thank you! Questions? Observations?

More Information:• The Grid Workloads Archive: gwa.ewi.tudelft.nl

• The Failure Trace Archive: fta.inria.fr

• The GrenchMark perf. eval. tool: grenchmark.st.ewi.tudelft.nl

• Cloud research: www.st.ewi.tudelft.nl/~iosup/research_cloud.html

• see PDS publication database at: www.pds.twi.tudelft.nl/

email: [email protected]

Big thanks to our collaborators: U. Wisc.-Madison, U Chicago, U Dortmund, U Innsbruck, LRI/INRIA Paris, INRIA Grenoble, U Leiden, Politehnica University of Bucharest, Technion, …

Thanks for all:AliG, Andrew, AndyK, Ari, Beth, Blaine, David, Ion, Justin,

Lucian, Matei, Petri, Rean, Tim, …

Page 20: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Additional Slides

April 19, 202320

Page 21: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Main Results: Basic Stats

• MapReduce vs Grid workloads• IO-intensive vs Compute-intensive• Constant Wr[%]~40%IO for MapReduce traces?

• TODO: More MapReduce traces to validate findings

April 19, 202321

Trace ID Total IO [MB]

Rd. [MB] Wr [%] HDFS Wr[MB]CWA-01 10,934 6,805 38% 1,538

CWA-02 75,546 47,539 37% 8,563

CWA-03 - - - -

GWA12.1

469 174 63% n/a

GWA12.2

144 114 21% n/a

GWA12.3

161 130 19% n/a

GWA12.4

389 33 92% n/a

GWA12.5

330 31 91% n/a

Page 22: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

Main Results

• Two-mode trace do NOT analyze as whole

April 19, 202322

Page 23: September 23, 2015 1 The Cloud Workloads Archive: A Status Report Berkeley, CA, USA Alexandru Iosup Parallel and Distributed Systems Group, Delft University

April 19, 202323