experiences teaching mapreduce in the clouds ari rabkin, charles reiss, randy katz, david patterson...

Post on 23-Dec-2015

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Experiences Teaching MapReducein the Clouds

Ari Rabkin, Charles Reiss,Randy Katz, David Patterson

University of California, Berkeley

2

Introduction: What we did

• Hadoop MapReduce performance benchmarking

• 300 students, 80 cores per student(in one semester)

• 2400 cores• Impossible without the cloud

3

Context: Teaching varieties of parallelism

• Instruction (e.g. pipelining), Data (e.g. vector instructions), Request (e.g. replicated webservers), …

• We were teaching many of these in an sophomore course

• This talk focuses on task parallelism

4

Task parallelism• Our example: MapReduce

• Sophomores wrote a MapReduceprogram and ran it in adistributed environment

• Observed speedup

• On a large dataset using real-world tools

<<

5

Others have taught MapReduce

• As a programming paradigm [Johnson '08]• As part of a elective "big data" analysis course

[Aaron '08, Lin '10, Couch '10]

6

Unlike prior work, we

• Cared about performance andits implementation on a cluster

• Taught sophomores• Emphasized cost and economics

7

Outline

• Motivation: MapReduce and why it matters• Assignment goals and design• Experiences

o challenges for studentso challenges for instructors

8

MapReduce: Why it matters

• Trend of "big data"o more data collection — smartphones, Internet

services, etc.o cheaper data storageo cheaper access to data processing capability

— public cloud computing providers• Dominant way to make sense of very large

datasets on commodity hardware is MapReduceo Google, Facebook, IBM, Amazon, many more, …

9

MapReduce: Programming modelinput

input records (e.g. page from a web crawl)

group bylist of values for each key

"map": a function call per record

key-value pairs (e.g. word -> # of times in record)

output

"reduce": a function call per group

results for each key (e.g. word and its number of occurences)

10

MapReduce: Distributed execution

Map task

Multiple "map", "reduce" calls per task

Input FilePartition

Input FilePartition

Input FilePartition

Output File

Output File

Map task

Map task

Reduce task

Reduce task

11

Assignment goals

• Measure performanceo Observe parallel speedup

• Non-trivial use of MapReduceo Multiple stages: output of one MapReduce

program used as input to another• Off-the-shelf tools

o Hadoop (standard industry platform,open source)

12

Why we used cloud computing

• Datacenter-like resources to hundreds of studentso Performance isolationo Complement teaching about datacenter

architecture• Maximum actual usage of >2400 cores

o Larger than our instructional clusterso Interference with other instructional users

13

Usage over time

Lab Projectdeadline

14

Assignment (Spring)

• Two-stage — co-occurrence (“How associated is a target word with other words?”) +sorting (top-K)

• Java — native Hadoop API language• Dataset of Usenet posts —

8.4GB (compressed size)

inst.eecs.berkeley.edu/~cs61c/sp11/

15

Assignment structure (Spring)1. Laboratory 1 — MapReduce programming

o Against native Hadoop APIo Running on lab machines only (not parallel)o Trivial MR tasks (fit in lab time)

2. Laboratory 2 — Measuring MR at scaleo Timing, calculations for existing MR programso Some design excersizes; no new coding

3. Project Part 1 — implement, run locally (smaller datasets)

4. Project Part 2 — time, get working at scale

16

What students achieved= linear speedup

17

Debugging difficulties

• First time efficiency mattered for many students

• Long runtime + remote execution Longer debugging cycleoReal-world problem

18

EfficiencyMost students on par with reference solution

~10 minutes — time on input big enough for MapReduce to make sense

Hadoop not well-tuned for small inputs

on 40 cores

19

Efficiency

But some students observed very bad performance

Waiting 40+ minutes for results which should take 10 minutes

on 40 cores

20

Things we learned about our student Java

Integer numSeen;for (...) {  ...  numSeen += 1;}

for (each word in bigString) {    ...    if (bigString.contains(targetWord)) {         ...    }}

// and more...

21

Using a public cloud provider

• Grant from Amazon ($100 credit/student)

• We wanted:o More capacity than we could provision

internallyo Students use cloud provider like

commercial user

22

Using a public cloud provider

"Backup" billing even with grant

23

What it cost (in grant credits)

Outliers:Usually misunderstood tools;tried restarting repeatedly after problems

Most student costs reasonableEach used a "dedicated" cluster of around 80 cores.

24

Student satisfaction

• When surveyed, students ranked this project first among the three software projectso Most students (90% of responders)

recommended keeping the project in later semesters

• Students reported that this project impressed potential employers

25

Conclusion/Lessons Learned

• Students wrote a parallel program and ran it against a large data seto Almost all students ran programs on large

datasets and observed parallel speedupso Early experience for sophomores debugging,

deploying programs with large datasets• First time that students write programs with

long enough run-time to measure efficiency• Public clouds allowed us to demonstrate scale

with low per-student costs

26

Other CC uses: long-running servers

• Long-running servers per student or group• Web/service classesNo elasticity, low resource

usage — cost-effective?

27

Other CC uses: VM per student

• Consistent infrastructure for development• Way to hand out/in assignments• With or without a “cloud” to host the VMs

28

Other CC uses: static clusters

• Customized machines for a particular course• Sometimes done without cost benefit ---

cluster kept up for entire semester

29

30

Backup Slides

31

Scripts

• https://github.com/woggling/ec2-wrappers

• Danger! Pre-alpha software!– Depends on Berkeley infrastructure in several

places– Could spend real money; do not use without

understanding– Requires some manual monitoring– Documentation is probably incomplete

32

Using a public cloud provider

56%

44%

top related