iaas cloud benchmarking: approaches, challenges, and experience

Download IaaS Cloud Benchmarking: Approaches, Challenges, and Experience

Post on 25-Feb-2016




0 download

Embed Size (px)


IaaS Cloud Benchmarking: Approaches, Challenges, and Experience. Alexandru Iosup Parallel and Distributed Systems Group Delft University of Technology The Netherlands. - PowerPoint PPT Presentation


  • ** Our team: Undergrad Nassos Antoniou, Thomas de Ruiter, Ruben Verboon, Grad Siqi Shen, Nezih Yigitbasi, Ozan Sonmez Staff Henk Sips, Dick Epema, Alexandru Iosup Collaborators Ion Stoica and the Mesos team (UC Berkeley), Thomas Fahringer, Radu Prodan (U. Innsbruck), Nicolae Tapus, Mihaela Balint, Vlad Posea (UPB), Derrick Kondo, Emmanuel Jeannot (INRIA), Assaf Schuster, Orna Ben-Yehuda (Technion), Ted Willke (Intel), Claudio Martella (Giraph), IaaS Cloud Benchmarking:Approaches, Challenges, and ExperienceAlexandru Iosup

    Parallel and Distributed Systems Group Delft University of Technology The NetherlandsLecture TCE, Technion, Haifa, Israel

  • Lectures at the Technion Computer Engineering Center (TCE), Haifa, IL**IaaS Cloud BenchmarkingMassivizing Online Social GamesScheduling in IaaS CloudsGamification in Higher EducationA TU Delft perspective on Big Data Processing and PreservationMay 7May 9May 27June 610am Taub 337Actually, HUJIGrateful to Orna Agmon Ben-Yehuda, Assaf Schuster, Isaac Keslassy.Also thankful to Bella Rotman and Ruth Boneh.

  • The Parallel and Distributed Systems Group at TU DelftAugust 31, 2011*

    Home pagewww.pds.ewi.tudelft.nl Publicationssee PDS publication database at publications.st.ewi.tudelft.nl VENIVENIVENI

  • (TU) Delft the Netherlands Europepop.: 100,000 pop: 16.5 M

    founded 13th centurypop: 100,000 founded 1842pop: 13,000 pop.: 100,000 (We are here)

  • **

  • **

  • What is Cloud Computing?3. A Useful IT ServiceUse only when you want! Pay only for what you use!**

  • IaaS Cloud ComputingVENI @larGe: Massivizing Online Games using Cloud Computing Many tasks

  • **HP EngineeringWhich Applications NeedCloud Computing? A Simplistic ViewDemand VariabilityDemand VolumeTaxes, @HomeLowLowHighHighWeb ServerSpace Survey Comet DetectedSky SurveyPharma ResearchSocial GamingOnline GamingSocial NetworkingAnalyticsSW Dev/TestOffice ToolsTsunami PredictionEpidemic SimulationExp. ResearchAfter an idea by Helmut KrcmarOK, so were done here?Not so fast!

  • **Average job size is 1 (that is, there are no [!] tightly-coupled, only conveniently parallel jobs)What I Learned From GridsA. Iosup, C. Dumitrescu, D.H.J. Epema, H. Li, L. Wolters, How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications, Grid 2006.From Parallel to Many-Task ComputingA. Iosup and D.H.J. Epema, Grid Computing Workloads, IEEE Internet Computing 15(2): 19-26 (2011)* The past

  • **What I Learned From GridsNMI Build-and-Test Environment at U.Wisc.-Madison: 112 hosts, >40 platforms (e.g., X86-32/Solaris/5, X86-64/RH/9)Serves >50 grid middleware packages: Condor, Globus, VDT, gLite, GridFTP, RLS, NWS, INCA(-2), APST, NINF-G, BOINC Two years of functionality tests (04-06): over 1:3 runs have at least one failure! Test or perish! For grids, reliability is more important than performance!A. Iosup, D.H.J.Epema, P. Couvares, A. Karp, M. Livny, Build-and-Test Workloads for Grid Middleware: Problem, Analysis, and Applications, CCGrid, 2007.* The past

  • **What I Learned From Grids99.999% reliableSmall Cluster5x decrease in failure rate after first year [Schroeder and Gibson, DSN06]Production Cluster>10% jobs fail [Iosup et al., CCGrid06]DAS-299.99999% reliableServer20-45% failures [Khalili et al., Grid06]TeraGrid27% failures, 5-10 retries [Dumitrescu et al., GCC05]Grid3Grids are unreliable infrastructureA. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, On the Dynamic Resource Availability in Grids, Grid 2007, Sep 2007.* The past

  • ***What I Learned From Grids,Applied to IaaS CloudsThe path to abundanceOn-demand capacityCheap for short-term tasksGreat for web apps (EIP, web crawl, DB ops, I/O)The killer cyclonePerformance for scientific applications (compute- or data-intensive)Failures, Many-tasks, etc.

    http://www.flickr.com/photos/dimitrisotiropoulos/4204766418/ Tropical Cyclone Nargis (NASA, ISSS, 04/29/08)or

    *We just dont know!

  • ****This Presentation: Research QuestionsQ1: What is the performance of production IaaS cloud services?Q2: How variable is the performance of widely used production cloud services?Q3: How do provisioning and allocation policies affect the performance of IaaS cloud services?Other questions studied at TU Delft: How does virtualization affect the performance of IaaS cloud services? What is a good model for cloud workloads? Etc.Q0: What are the workloads of IaaS clouds?Q4: What is the performance of production graph-processing platforms? (ongoing)But this is Benchmarking = process of quantifying the performance and other non-functional properties of the system

  • Why IaaS Cloud Benchmarking?Establish and share best-practices in answering important questions about IaaS clouds

    Use in procurementUse in system designUse in system tuning and operationUse in performance management

    Use in training**

  • Provide a platform for collaborative research efforts in the areas of computer benchmarking and quantitative system analysisProvide metrics, tools and benchmarks for evaluating early prototypes and research results as well as full-blown implementationsFoster interactions and collaborations btw. industry and academiaMission StatementThe Research Group of the Standard Performance Evaluation CorporationSPEC Research Group (RG)Find more information on: http://research.spec.org* The present

  • Current Members (Dec 2012)Find more information on: http://research.spec.org* The present

  • **AgendaAn Introduction to IaaS Cloud ComputingResearch Questions or Why We Need Benchmarking?A General Approach and Its Main ChallengesIaaS Cloud Workloads (Q0)IaaS Cloud Performance (Q1) and Perf. Variability (Q2)Provisioning and Allocation Policies for IaaS Clouds (Q3)Big Data: Large-Scale Graph Processing (Q4) Conclusion

  • A General Approach for IaaS Cloud Benchmarking*** The present

  • **Approach: Real Traces, Models, and Tools +Real-World Experimentation (+ Simulation)Formalize real-world scenariosExchange real tracesModel relevant operational elementsDevelop calable tools for meaningful and repeatable experimentsConduct comparative studiesSimulation only when needed (long-term scenarios, etc.)Rule of thumb:Put 10-15% project effort into benchmarking* The present

  • 10 Main Challenges in 4 Categories*MethodologicalExperiment compressionBeyond black-box testing through testing short-term dynamics and long-term evolutionImpact of middleware System-RelatedReliability, availability, and system-related propertiesMassive-scale, multi-site benchmarkingPerformance isolation, multi-tenancy models**Workload-relatedStatistical workload modelsBenchmarking performance isolation under various multi-tenancy workloads Metric-RelatedBeyond traditional performance: variability, elasticity, etc.Closer integration with cost models* List not exhaustiveIosup, Prodan, and Epema, IaaS Cloud Benchmarking: Approaches, Challenges, and Experience, MTAGS 2012. (invited paper)Read our article* The future

  • Agenda**An Introduction to IaaS Cloud ComputingResearch Questions or Why We Need Benchmarking?A General Approach and Its Main ChallengesIaaS Cloud Workloads (Q0)IaaS Cloud Performance (Q1) & Perf. Variability (Q2)Provisioning & Allocation Policies for IaaS Clouds (Q3)Big Data: Large-Scale Graph Processing (Q4) Conclusion

    WorkloadsPerformanceVariabilityPoliciesBig Data: Graphs

  • IaaS Cloud Workloads: Our Team**

  • **What Ill Talk AboutIaaS Cloud Workloads (Q0)

    BoTsWorkflowsBig Data Programming ModelsMapReduce workloads

  • What is a Bag of Tasks (BoT)? A System ViewWhy Bag of Tasks? From the perspective of the user, jobs in set are just tasks of a larger jobA single useful result from the complete BoTResult can be combination of all tasks, or a selection of the results of most or even a single task2012-2013* BoT = set of jobs sent by a user that is submitted at most s after the first jobIosup et al., The Characteristics and Performance of Groups of Jobs in Grids, Euro-Par, LNCS, vol.4641, pp. 382-393, 2007.Q0

  • Applications of the BoT Programming ModelParameter sweepsComprehensive, possibly exhaustive investigation of a modelVery useful in engineering and simulation-based science Monte Carlo simulationsSimulation with random elements: fixed time yet limited inaccuracyVery useful in engineering and simulation-based science Many other types of batch processingPeriodic computation, Cycle scavengingVery useful to automate operations and reduce waste2012-2013*Q0

  • BoTs Are the Dominant Programming Model for Grid Computing (Many Tasks) *Iosup and Epema: Grid Computing Workloads. IEEE Internet Computing 15(2): 19-26 (2011)Q0

  • What is a Wokflow?2012-2013* WF = set of jobs with precedence (think Direct Acyclic Graph)Q0

  • Applications of the Workflow Programming ModelComplex applicationsComplex filtering of dataComplex analysis of instrument measurements

    Applications created by non-CS scientists*Workflows have a natural correspondence in the real-world, as descriptions of a scientific procedureVisual model of a graph sometimes easier to program

    Precursor of the MapReduce Programming Model (next slides)2012-2013**Adapted from: Carole Goble and David de Roure, Chapter in The Fourth Paradigm, http://research.microsoft.com/en-us/collaboration/fourthparadigm/ Q0

  • Workflows Exist in Grids, but Did No Evidence of a Dominant Programming ModelTraces

    Selected Findings Loose couplingGraph with 3-4 levelsAverage WF size is 30/44 jobs75%+ WFs are sized 40 jobs or less, 95% are sized 200 jobs or less

    2012-2013*Ostermann et al., On the Characteristics of Grid Workflows, Cor