mapreducecs-4513 d-term 20081 mapreduce cs-4513 distributed computing systems (slides include...

12
MapReduce CS-4513 D-term 2008 1 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz, Galvin, & Gagne, Distributed Systems: Principles & Paradigms, 2 nd ed. By Tanenbaum and Van Steen, and Modern Operating Systems, 2 nd ed., by Tanenbaum)

Upload: claud-long

Post on 18-Jan-2018

214 views

Category:

Documents


0 download

DESCRIPTION

MapReduceCS-4513 D-term From Operating System course Three fundamental models of parallel computing –Data Parallelism –Task Parallelism –Pipelined Parallelism Each requires a different set of tools Each requires a different mode of thinking

TRANSCRIPT

Page 1: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 1

MapReduce

CS-4513Distributed Computing Systems

(Slides include materials from Operating System Concepts, 7th ed., by Silbershatz, Galvin, & Gagne, Distributed Systems: Principles & Paradigms, 2nd ed. By Tanenbaum and Van Steen, and

Modern Operating Systems, 2nd ed., by Tanenbaum)

Page 2: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 2

Why MapReduce

• An important new model of parallel and distributed computing

• Particularly for problems dealing with “big data”

• An abstraction to automate the mechanics of data handling and to let the programmer concentrate on semantics of the problem

Page 3: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 3

From Operating System course

• Three fundamental models of parallel computing– Data Parallelism– Task Parallelism– Pipelined Parallelism

• Each requires a different set of tools• Each requires a different mode of thinking

Page 4: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 4

MapReduce

• A new model• Fundamentally different from previous

models• Shares some elements with each one

• Promise (hope?) of solving new classes of problems that were previously very tedious to solve

• Not in textbooks• Not in previous Distributed Systems courses at WPI

Page 5: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 5

Learning about MapReduce

• Partition class into four teams• Each team responsible for understanding

and teaching the rest of the class about one subtopic

• 30-40 minutes of class time per team• Two teams on April 4• Two teams on April 8

Page 6: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 6

MapReduce subtopics

• The abstraction itself and its algorithms

• Distributed MapReduce

• Class of problems that MapReduce can help solve

• Google File System to support MapReduce

Page 7: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 7

MapReduce abstraction

• Explain the abstraction, what it does, etc.• Explain the algorithms• Show non-trivial programming examples

• Focus on how to think about a problem

Page 8: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 8

Distributed MapReduce

• Show how it is naturally distributable and scalable

• Up to terabytes of data and more

• Show how mechanics of distribution and parallelization are automated

• Focus on• Performance, Reliability,

Fault-tolerance, Failure recovery

Page 9: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 9

Classes of problems

• Identify classes of problems on which to use MapReduce

• Characterize them• Why were they difficult before• Why are people so excited about MapReduce• Why did Google rewrite 10,000 existing programs

in MapReduce form

Page 10: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 10

Google File System

• What is so special about it?• How different from traditional file systems

• How does it help MapReduce

• Focus on• Performance, Reliability,

Fault-tolerance, Failure recovery

Page 11: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 11

Action items today

• Form teams (one for each subtopic)• Roster to professor

• Get organized to• Do reading• Prepare topic

Page 12: MapReduceCS-4513 D-term 20081 MapReduce CS-4513 Distributed Computing Systems (Slides include materials from Operating System Concepts, 7 th ed., by Silbershatz,

MapReduceCS-4513 D-term 2008 12

References

• See e-mails• See course web page