database applications (15 -415) · hadoop mapreduce mapreduce is one of the most successful...
TRANSCRIPT
Database Applications (15-415)
Hadoop Lecture 24, April 23, 2014
Mohammad Hammoud
Today… Last Session: NoSQL databases
Today’s Session: Hadoop = HDFS + MapReduce
Announcements: Final Exam is on Sunday April 27th, at 9:00AM in room 2051
(all materials are included- open book, open notes) We will hold a “review session” (for the final exam)
tomorrow during the recitation PS4 grades are out PS5 (the “last” assignment) is due tomorrow, by midnight
Outline
A “Very Brief” Primer and GFS/HDFS
MapReduce: Systems and Applications Perspectives
MapReduce: Programming, Computation, Architectural and Scheduling Models
Fault-Tolerance in MapReduce
Hadoop MapReduce
MapReduce is one of the most successful realizations of large-scale “data-parallel” distributed analytics engines
Hadoop is an open source implementation of MapReduce
Hadoop MapReduce uses Hadoop Distributed File System (HDFS) as a distributed storage layer
HDFS is an open source implementation of GFS
GFS Data Distribution Policy The Google File System (GFS) is a scalable DFS for data-
intensive applications GFS divides large files into multiple pieces called chunks or blocks
(by default 64MB) and stores them on different data servers This design is referred to as block-based design
Each GFS chunk has a unique 64-bit identifier and is stored as a
file in the lower-layer local file system on the data server
GFS distributes chunks across cluster data servers using a random distribution policy
GFS Random Distribution Policy
Server 0 (Writer)
Blk 0
Blk 1
Blk 2
Blk 3
Blk 4
Blk 5
Blk 6
Server 1
Blk 0
Blk 2
Blk 3
Blk 3
Blk 5
0M
64M
128M
192M
256M
320M
384M
Server 2 Server 3
Blk 1
Blk 2
Blk 4
Blk 6
Blk 0
Blk 1
Blk 4
Blk 5
Blk 6
Large File Blk 0
Blk 1
Blk 2
Blk 3
Blk 4
Blk 5
Blk 6
GFS Architecture GFS adopts a master-slave architecture
GFS client Master
Chunk Server
Linux File System
Chunk Server
Linux File System
Chunk Server
Linux File System
File name
Contact address
Chunk Id, range
Chunk data
Outline
A “Very Brief” Primer and GFS/HDFS
MapReduce: Systems and Applications Perspectives
MapReduce: Programming, Computation, Architectural and Scheduling Models
Fault-Tolerance in MapReduce
The Problem Scope Hadoop MapReduce is used for powerful and efficient
analytics over Big Data The power of MapReduce lies in its ability to scale to 100s and
even 1000s of machines
What amount of work can MapReduce handle? Big Data in the order of 100s of GBs, TBs or PBs
It is unlikely that datasets of such sizes can fit on a
single machine Hence, a storage layer like HDFS is required!
9
Hadoop MapReduce: A System’s View Hadoop MapReduce incorporates two phases, Map and Reduce phases,
which encompass multiple Map and Reduce tasks
Map Task
Map Task
Map Task
Map Task
Reduce Task
Reduce Task
Reduce Task
Partition Partition
Partition
Partition
Partition Partition Partition Partition
Partition
To HDFS Dataset
HDFS
HDFS BLK
HDFS BLK
HDFS BLK
HDFS BLK
Map Phase Shuffle Stage
Merge Stage Reduce Stage
Reduce Phase
10
Partition
Split 0
Split 1
Split 2
Split 3
Partition Partition
Partition
Partition Partition
Partition
Partition
Data Structure: Keys and Values The MapReduce programmer has to specify only two
“sequential” functions, the Map and the Reduce functions These functions will be translated “automatically” into multiple
Map and Reduce tasks
In MapReduce, data elements are always structured as key-value (i.e., (K, V)) pairs In particular, the Map and Reduce functions receive and emit
(K, V) pairs
(K, V) Pairs
Map Function
(K’, V’) Pairs
Reduce Function
(K’’, V’’) Pairs
Input Splits Intermediate Outputs Final Outputs
WordCount: An Application View
12
Mohammad is delivering a lecture at CMUQ CMUQ is a member of QF
A Text File Mohammad is delivering a lecture at CMUQ
CMUQ is a member of QF
A Chunk of File
A Chunk of File
A Map Function
Parse &
Count
Key1 Value1
0 Mohammad is
20 delivering a
18 lecture at CMUQ
Key2 Value2
Mohammad 1
is 1
delivering 1
a 1
lecture 1
at 1
CMUQ 1
Key2 Value2
Mohammad 1
is 2
delivering 1
a 2
lecture 1
at 1
CMUQ 2
member 1
of 1
QF 1
Iterate&
Sum
Parse &
Count
Key1 Value1
0 CMUQ is a
17 member of QF
Key2 Value2
CMUQ 1
is 1
a 1
member 1
of 1
QF 1
A Map Function
A Reduce Function
Hadoop MapReduce: A Closer Look
Chunk
Chunk
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Chunks loaded from a (local) HDFS datanode
RecordReaders
Final (K, V) pairs
Writeback to HDFS store
Chunk
Chunk
InputFormat
Split Split Split
RR RR RR
Map Map Map
Input (K, V) pairs
Partitioner
Intermediate (K, V) pairs
Sort
Reduce
OutputFormat
Chunks loaded from a (local) HDFS datanode
RecordReaders
Final (K, V) pairs
Writeback to HDFS store
Node 1 Node 2
Shuffling Process
Intermediate (K,V) pairs
exchanged by all nodes
. . . . . .
Outline
A “Very Brief” Primer and GFS/HDFS
MapReduce: Systems and Applications Perspectives
MapReduce: Programming, Computation, Architectural and Scheduling Models
Fault-Tolerance in MapReduce
The Programming Model Hadoop MapReduce employs a shared-memory programming model
This entails two main issues:
Developers need not “explicitly” encode functions that send/receive messages within their MapReduce programs
HDFS provides a shared abstraction to all tasks
MT1 MT2 MT3 MT4 MT5 MT6
A “Shared-Memory” Storage Address Space (Provided by HDFS)
RT1 RT2 RT3
“Implicit” communication (Provided by the MapReduce Engine)
A “Shared-Memory” Storage Address Space (Provided by HDFS)
Merge & Sort Stage
The Computation Model Hadoop MapReduce adopts a synchronous computation model
A distributed program is said to be synchronous if and only if the
tasks operate in a lock-step mode
MT1
MT2
MT3
Partition0
Partition1
Partition2
Partition3
Partition4
Partition5
RT1 Partition0
Partition1
Map Phase
Shuffle Stage Reduce Stage
Reduce Phase
Shuffle, Merge and Sort start ONLY after 5% of Map Tasks commit!
Reduce starts ONLY after ALL partitions are shuffled merged and sorted!
The Architectural and Scheduling Models
Hadoop MapReduce employs a master-slave architecture
The Architectural and Scheduling Models
Hadoop MapReduce employs a master-slave architecture
A pull-based “task” scheduling strategy is used, whereby: Map tasks are scheduled nearby HDFS blocks Reduce tasks are scheduled anywhere
Core Switch
TaskTracker1
Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1
Rack Switch 1 Rack Switch 2
TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3
Job Scheduling in MapReduce
An application is represented by one or many jobs
A job consists of one or many Map and Reduce tasks Hadoop MapReduce comes with various choices of
job schedulers: FIFO Scheduler: schedules jobs in order of submission Fair Scheduler: aims at giving every user a “fair” share of
the cluster capacity over time
Capacity Scheduler: Similar to Fair Scheduler but does not apply job preemption
19
Summary
20
Aspect Hadoop MapReduce Aspect Hadoop MapReduce Parallelism Model Data-Parallel
Aspect Hadoop MapReduce Parallelism Model Data-Parallel
Programming Model Shared-Memory
Aspect Hadoop MapReduce Parallelism Model Data-Parallel
Programming Model Shared-Memory
Computation Model Synchronous
Aspect Hadoop MapReduce Parallelism Model Data-Parallel
Programming Model Shared-Memory
Computation Model Synchronous
Architectural Model Master-Slave
Aspect Hadoop MapReduce Parallelism Model Data-Parallel
Programming Model Shared-Memory
Computation Model Synchronous
Architectural Model Master-Slave
Scheduling Model Pull-Based
Aspect Hadoop MapReduce Parallelism Model Data-Parallel
Programming Model Shared-Memory
Computation Model Synchronous
Architectural Model Master-Slave
Scheduling Model Pull-Based
Application Suitability Loosely-Connected/Embarrassingly-Parallel Applications
Outline
A “Very Brief” Primer and GFS/HDFS
MapReduce: Systems and Applications Perspectives
MapReduce: Programming, Computation, Architectural and Scheduling Models
Fault-Tolerance in MapReduce
Fault Tolerance in Hadoop: Node Failures
MapReduce can guide jobs toward a successful completion even when jobs are run on large clusters (where probability of failures increases)
Hadoop MapReduce achieves fault-tolerance through restarting tasks
If a TT fails to communicate with JT for a period of time (by default, 1 minute), JT will assume that TT in question has crashed If the job is still in the Map phase, JT asks another TT to re-
execute all Map tasks that previously ran at the failed TT
If the job is in the Reduce phase, JT asks another TT to re-execute all Reduce tasks that were in-progress on the failed TT
22
Fault Tolerance in Hadoop: Speculative Execution
A MapReduce job is dominated by the slowest task
MapReduce attempts to locate slow tasks (or stragglers) and run replicated (or speculative) tasks that will optimistically commit before the stragglers
In general, this strategy is known as task resiliency or task replication (as opposed to data replication), but in Hadoop it is referred to as speculative execution
Only one copy of a straggler is allowed to be replicated
Whichever copy (among the two copies) of a task commits first, it becomes the definitive copy, and the other one is killed by JT
But, How to Locate Stragglers?
Hadoop monitors each task progress using a progress score between 0 and 1
If a task’s progress score is less than (average – 0.2), and the task has run for at least 1 minute, it is marked as a straggler
PS= 2/3
PS= 1/12
Not a straggler T1
T2
Time
A straggler
Next Class
A Review Session