mapreducecs-4513 d-term 20081 mapreduce cs-4513 distributed computing systems (slides include...
DESCRIPTION
MapReduceCS-4513 D-term From Operating System course Three fundamental models of parallel computing –Data Parallelism –Task Parallelism –Pipelined Parallelism Each requires a different set of tools Each requires a different mode of thinkingTRANSCRIPT
MapReduceCS-4513 D-term 2008 1
MapReduce
CS-4513Distributed Computing Systems
(Slides include materials from Operating System Concepts, 7th ed., by Silbershatz, Galvin, & Gagne, Distributed Systems: Principles & Paradigms, 2nd ed. By Tanenbaum and Van Steen, and
Modern Operating Systems, 2nd ed., by Tanenbaum)
MapReduceCS-4513 D-term 2008 2
Why MapReduce
• An important new model of parallel and distributed computing
• Particularly for problems dealing with “big data”
• An abstraction to automate the mechanics of data handling and to let the programmer concentrate on semantics of the problem
MapReduceCS-4513 D-term 2008 3
From Operating System course
• Three fundamental models of parallel computing– Data Parallelism– Task Parallelism– Pipelined Parallelism
• Each requires a different set of tools• Each requires a different mode of thinking
MapReduceCS-4513 D-term 2008 4
MapReduce
• A new model• Fundamentally different from previous
models• Shares some elements with each one
• Promise (hope?) of solving new classes of problems that were previously very tedious to solve
• Not in textbooks• Not in previous Distributed Systems courses at WPI
MapReduceCS-4513 D-term 2008 5
Learning about MapReduce
• Partition class into four teams• Each team responsible for understanding
and teaching the rest of the class about one subtopic
• 30-40 minutes of class time per team• Two teams on April 4• Two teams on April 8
MapReduceCS-4513 D-term 2008 6
MapReduce subtopics
• The abstraction itself and its algorithms
• Distributed MapReduce
• Class of problems that MapReduce can help solve
• Google File System to support MapReduce
MapReduceCS-4513 D-term 2008 7
MapReduce abstraction
• Explain the abstraction, what it does, etc.• Explain the algorithms• Show non-trivial programming examples
• Focus on how to think about a problem
MapReduceCS-4513 D-term 2008 8
Distributed MapReduce
• Show how it is naturally distributable and scalable
• Up to terabytes of data and more
• Show how mechanics of distribution and parallelization are automated
• Focus on• Performance, Reliability,
Fault-tolerance, Failure recovery
MapReduceCS-4513 D-term 2008 9
Classes of problems
• Identify classes of problems on which to use MapReduce
• Characterize them• Why were they difficult before• Why are people so excited about MapReduce• Why did Google rewrite 10,000 existing programs
in MapReduce form
MapReduceCS-4513 D-term 2008 10
Google File System
• What is so special about it?• How different from traditional file systems
• How does it help MapReduce
• Focus on• Performance, Reliability,
Fault-tolerance, Failure recovery
MapReduceCS-4513 D-term 2008 11
Action items today
• Form teams (one for each subtopic)• Roster to professor
• Get organized to• Do reading• Prepare topic
MapReduceCS-4513 D-term 2008 12
References
• See e-mails• See course web page