experiences teaching mapreduce in the clouds ari rabkin, charles reiss, randy katz, david patterson...
Post on 23-Dec-2015
217 Views
Preview:
TRANSCRIPT
1
Experiences Teaching MapReducein the Clouds
Ari Rabkin, Charles Reiss,Randy Katz, David Patterson
University of California, Berkeley
2
Introduction: What we did
• Hadoop MapReduce performance benchmarking
• 300 students, 80 cores per student(in one semester)
• 2400 cores• Impossible without the cloud
3
Context: Teaching varieties of parallelism
• Instruction (e.g. pipelining), Data (e.g. vector instructions), Request (e.g. replicated webservers), …
• We were teaching many of these in an sophomore course
• This talk focuses on task parallelism
4
Task parallelism• Our example: MapReduce
• Sophomores wrote a MapReduceprogram and ran it in adistributed environment
• Observed speedup
• On a large dataset using real-world tools
<<
5
Others have taught MapReduce
• As a programming paradigm [Johnson '08]• As part of a elective "big data" analysis course
[Aaron '08, Lin '10, Couch '10]
6
Unlike prior work, we
• Cared about performance andits implementation on a cluster
• Taught sophomores• Emphasized cost and economics
7
Outline
• Motivation: MapReduce and why it matters• Assignment goals and design• Experiences
o challenges for studentso challenges for instructors
8
MapReduce: Why it matters
• Trend of "big data"o more data collection — smartphones, Internet
services, etc.o cheaper data storageo cheaper access to data processing capability
— public cloud computing providers• Dominant way to make sense of very large
datasets on commodity hardware is MapReduceo Google, Facebook, IBM, Amazon, many more, …
9
MapReduce: Programming modelinput
input records (e.g. page from a web crawl)
group bylist of values for each key
"map": a function call per record
key-value pairs (e.g. word -> # of times in record)
output
"reduce": a function call per group
results for each key (e.g. word and its number of occurences)
10
MapReduce: Distributed execution
Map task
Multiple "map", "reduce" calls per task
Input FilePartition
Input FilePartition
Input FilePartition
Output File
Output File
Map task
Map task
Reduce task
Reduce task
11
Assignment goals
• Measure performanceo Observe parallel speedup
• Non-trivial use of MapReduceo Multiple stages: output of one MapReduce
program used as input to another• Off-the-shelf tools
o Hadoop (standard industry platform,open source)
12
Why we used cloud computing
• Datacenter-like resources to hundreds of studentso Performance isolationo Complement teaching about datacenter
architecture• Maximum actual usage of >2400 cores
o Larger than our instructional clusterso Interference with other instructional users
13
Usage over time
Lab Projectdeadline
14
Assignment (Spring)
• Two-stage — co-occurrence (“How associated is a target word with other words?”) +sorting (top-K)
• Java — native Hadoop API language• Dataset of Usenet posts —
8.4GB (compressed size)
inst.eecs.berkeley.edu/~cs61c/sp11/
15
Assignment structure (Spring)1. Laboratory 1 — MapReduce programming
o Against native Hadoop APIo Running on lab machines only (not parallel)o Trivial MR tasks (fit in lab time)
2. Laboratory 2 — Measuring MR at scaleo Timing, calculations for existing MR programso Some design excersizes; no new coding
3. Project Part 1 — implement, run locally (smaller datasets)
4. Project Part 2 — time, get working at scale
16
What students achieved= linear speedup
17
Debugging difficulties
• First time efficiency mattered for many students
• Long runtime + remote execution Longer debugging cycleoReal-world problem
18
EfficiencyMost students on par with reference solution
~10 minutes — time on input big enough for MapReduce to make sense
Hadoop not well-tuned for small inputs
on 40 cores
19
Efficiency
But some students observed very bad performance
Waiting 40+ minutes for results which should take 10 minutes
on 40 cores
20
Things we learned about our student Java
Integer numSeen;for (...) { ... numSeen += 1;}
for (each word in bigString) { ... if (bigString.contains(targetWord)) { ... }}
// and more...
21
Using a public cloud provider
• Grant from Amazon ($100 credit/student)
• We wanted:o More capacity than we could provision
internallyo Students use cloud provider like
commercial user
22
Using a public cloud provider
"Backup" billing even with grant
23
What it cost (in grant credits)
Outliers:Usually misunderstood tools;tried restarting repeatedly after problems
Most student costs reasonableEach used a "dedicated" cluster of around 80 cores.
24
Student satisfaction
• When surveyed, students ranked this project first among the three software projectso Most students (90% of responders)
recommended keeping the project in later semesters
• Students reported that this project impressed potential employers
25
Conclusion/Lessons Learned
• Students wrote a parallel program and ran it against a large data seto Almost all students ran programs on large
datasets and observed parallel speedupso Early experience for sophomores debugging,
deploying programs with large datasets• First time that students write programs with
long enough run-time to measure efficiency• Public clouds allowed us to demonstrate scale
with low per-student costs
26
Other CC uses: long-running servers
• Long-running servers per student or group• Web/service classesNo elasticity, low resource
usage — cost-effective?
27
Other CC uses: VM per student
• Consistent infrastructure for development• Way to hand out/in assignments• With or without a “cloud” to host the VMs
28
Other CC uses: static clusters
• Customized machines for a particular course• Sometimes done without cost benefit ---
cluster kept up for entire semester
29
30
Backup Slides
31
Scripts
• https://github.com/woggling/ec2-wrappers
• Danger! Pre-alpha software!– Depends on Berkeley infrastructure in several
places– Could spend real money; do not use without
understanding– Requires some manual monitoring– Documentation is probably incomplete
32
Using a public cloud provider
56%
44%
top related