![Page 1: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/1.jpg)
Map ReduceData at Scale
![Page 2: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/2.jpg)
History• A simple paradigm that popped up several times as paradigm
• Observed by google as a software pattern:
• Data gets filtered locally and filtered data is then reassembled elsewhere
• Software pattern: Many engineers are re-engineering the same steps
• Map-reduce:
• Engineer the common steps efficiently
• Individual problems only need to be engineered for what makes them different
![Page 3: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/3.jpg)
History• Open source project (in part sponsored by Yahoo!)
• Java-based Hadoop
• Eventually a first tier Apache Foundation project
• Other projects at higher level: Pig, Hive, HBase, Mahout, Zookeeper
• Use Hadoop as foundation
• Hadoop is becoming a distributed OS
![Page 4: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/4.jpg)
Map Reduce Paradigm• Input: Large amount of data spread over many different
nodes
• Output: A single file of results
• Two important phases:
• Mapper: Records are processed into key-value pairs. Mapper sends key-value pairs to reducers
• Reducer: Create final answer from mapper
![Page 5: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/5.jpg)
Simple Example• Hadoop Word Count
• Given different documents on many different sites
• Mapper:
• Extract words from record
• Combines words and generates key-value pairs of type word: key
• Sends to the reducers based on hash of key
• Reducer:
• Receives key-value pairs
• Adds values for each key
• Sends accumulated results to aggregator - client
![Page 6: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/6.jpg)
document
documentdocument
documentdocument
document
document
documentdocument
documentdocument
document
document
documentdocument
documentdocument
document
Mapper
Mapper
Mapper
Mapperdocument
documentdocument
documentdocument
document
document
documentdocument
documentdocument
document
document
documentdocument
documentdocument
document
document
documentdocument
documentdocument
document
document
documentdocument
documentdocument
document
document
documentdocument
documentdocument
document
document
documentdocument
documentdocument
document
Reducer
Reducer
Reducer
Reducer
Reducer
Reducerant: 2
ant: 3
ant: 1 ant: 4
Client
ani: 1ant: 10ape: 39asp: 2ass: 5auk: 2
rat: 5rat: 3
rat: 1
ram: 12rat: 9ray: 5roe: 2
HFS
![Page 7: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/7.jpg)
Map-reduce paradigm in detail
• The simple mapper -reducer paradigm can be expanded into several, typical components
![Page 8: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/8.jpg)
Map Reduce in Detail• Mapper:
• Record Reader
• Parses the data into records
• Example: Stackoverflow comments. • <row Id="5" PostId="5" Score="2" Text="Programming in
Portland, cooking in Chippewa ; it makes sense that these would be unlocalized. But does bicycling.se need to follow only that path? I agree that route a to b in city x is not a good use of this site; but general resources would be." CreationDate="2010-08-25T21:21:03.233" UserId="21" />
• Record reader extract the “Text=” string
• Passes record into a key-value format to rest of mapper
![Page 9: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/9.jpg)
Map Reduce in Detail• Mapper
• map
• Produces “intermediate” key-value pairs from the record
• Example:• "Programming in Portland, cooking in Chippewa ; it
makes sense that these would be unlocalized. But does bicycling.se need to follow only that path? I agree that route a to b in city x is not a good use of this site; but general resources would be.”
• Map produces: <programming: 1> <in: 1> <Portland: 1> <cooking: 1> <in: 1> …
![Page 10: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/10.jpg)
Map Reduce in Detail• Mapper
• Combiner — a local reducer
• Takes key-value pairs and processes them
• Example:
• Map produces: <programming: 1> <in: 1> <Portland: 1> <cooking: 1> <in: 1> …
• Combiner combines words: <programming: 1> <in: 4> <Portland: 3> …
![Page 11: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/11.jpg)
Map Reduce in Detail• Combiners allow us to reduce network traffic
• By compacting the same infomrmation
![Page 12: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/12.jpg)
Map Reduce in Detail• Mapper
• Partitioner
• Partitioner creates shards of the key-value pairs produced
• One for each reducer
• Often uses a hash function or a range
• Example:
• md5(key) mod (#reducers)
![Page 13: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/13.jpg)
Map Reduce in Detail• Reducer
• Shuffle and Sort
• Part of the map-reduce framework
• Incoming key-value pairs are sorted by key into one large data list
• Groups keys together for easy agglomeration
• Programmer can specify the comparator, but nothing else
![Page 14: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/14.jpg)
Map Reduce in Detail• Reducer
• reduce
• Written by programmer
• Works on each key group
• Data can be combined, filtered, aggregated
• Output is prepared
![Page 15: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/15.jpg)
Map Reduce in Detail• Reducer
• Output format
• Formats final key-value pair
![Page 16: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/16.jpg)
Hadoop • Classic Map-Reduce
• Client that submits the map-reduce job
• Job trackers which coordinate the job run
• Task trackers that run the tasks that the job has been split into
• Distributed file system (HDFS — Hadoop file system) for file sharing between entities
![Page 17: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/17.jpg)
Hadoop Classicclient Node
client JVM
MapReduce Program Job
jobtracker node
client JVM
JobTracker1
2
4
HDFS
3
tasktracker node
TaskTracker
Child JVM
child
MapTaskReduceTask
7
5
6
89
10
1: run job2: get new job id3: copy job resources4: submit job5: initialize job6: retrieve input splits7: heartbeat8: retrieve job resources9: launch10: run
![Page 18: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/18.jpg)
Hadoop • Job submission
• Creates an internal JobSummitter instance
• JobSubmitter
• asks jobtracker for a new jobid
• check the output specifications of the job
• computes the input split for the job
• copies the resources needed for the job
• tells the jobtracker that the job is ready for submission
![Page 19: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/19.jpg)
Hadoop Classicclient Node
client JVM
MapReduce Program Job
jobtracker node
client JVM
JobTracker1
2
4
HDFS
3
tasktracker node
TaskTracker
Child JVM
child
MapTaskReduceTask
7
5
6
89
10
1: run job2: get new job id3: copy job resources4: submit job5: initialize job6: retrieve input splits7: heartbeat8: retrieve job resources9: launch10: run
![Page 20: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/20.jpg)
Hadoop • Jobtracker receives call from submitJob( )
• places it in internal queue
• retrieves the input splits computed by the client
• creates a map task for each split
• number of mappers is set by the mapred.reduce.tasks
• runs job setup task
• runs job cleanup task
![Page 21: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/21.jpg)
Hadoop • Task assignment
• Tasktrackers periodically send heartbeat to jobtracker
• Includes message if task is done so that node can get a new job
• Tasktrackers have a set number of map and reduce jobs that they can handle
• To create a reduce task, the jobtracker simply goes through the list of reduce tasks and assigns one
• Preference is given to data-local (reduce on the same node) or rack-local
![Page 22: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/22.jpg)
Hadoop Classicclient Node
client JVM
MapReduce Program Job
jobtracker node
client JVM
JobTracker1
2
4
HDFS
3
tasktracker node
TaskTracker
Child JVM
child
MapTaskReduceTask
7
5
6
89
10
1: run job2: get new job id3: copy job resources4: submit job5: initialize job6: retrieve input splits7: heartbeat8: retrieve job resources9: launch10: run
![Page 23: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/23.jpg)
Classical Hadoop • Task execution
• Tasktracker localizes the job JAR from the file system
• It copies any files needed from the distributed cache
• Creates instances of TaskRunner to run the task
• TaskRunners launch a new Java Virtual Machine
• Child informs parent of progress
![Page 24: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/24.jpg)
Classical Hadoop • Streaming and pipes run special map and reduce tasks
• Streaming:
• communicates with process using standard input and output streams
• Pipes:
• Pipes task listens on socket
• passes C++ process a port number
• In both cases
• Java process passes input key-value pairs to the external process
![Page 25: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/25.jpg)
Classic Hadoop Streaming and Piping
TaskTracker
Child JVM
child
MapTaskReduceTask
Streaming Process
launch
run
stdin stoutlaunch
Streaming Pipes
TaskTracker
Child JVM
child
MapTaskReduceTask
launch
run
input socket
launch
C++ wrapper library
C++ map or reduce class
![Page 26: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/26.jpg)
Classical Hadoop • Progress and status updates
• Map-Reduce jobs can take hours
• Each job has a status
• System estimates progress for each task
• Mappers: per cent input dealt with
• Reducers: More complicated, but estimates are possible
• Tasks use set of counters for various events
• Can be user defined — see below
![Page 27: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/27.jpg)
Classical Hadoop • Job completion
• If job tracker receives notification that last task has completed
• Job status changes to “successful”
• Job statistics are sent to console
![Page 28: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/28.jpg)
Yarn: Hadoop 2• Classic structure runs into bottlenecks at about 4000
nodes
• YARN: Yet Another Resource Negotiator
• YARN splits jobtrackers into various entities:
• Resource manager daemon
• Application master
![Page 29: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/29.jpg)
Yarn: Hadoop 2• Each application instance has a dedicated application
master
•
![Page 30: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/30.jpg)
Yarn: Hadoop 21: run job2: get new application3: copy job resources4: submit application5: start container and launch6: initialize job7: retrieve input splits8: allocate resources9: start container and launch10: retrieve job resources11: run
client Node
client JVM
MapReduce Program Job1
2
4
3
HDFS
resource manager node
resource manager JVM
resource manager
node manager noderesource manager JVM
nodemanager
MRAppMaster
5
node manager node
task JVM
nodemanager
YarnChild
9
MapTask orReduceTask
11
5
6
7
11
8
![Page 31: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/31.jpg)
Yarn: Hadoop 2• Job submission as before
• When resource manager receives a call to submitApplication() hands off to scheduler
• Scheduler allocates a container
• Resource manager launches application master process there (5)
•
![Page 32: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/32.jpg)
Yarn: Hadoop 21: run job2: get new application3: copy job resources4: submit application5: start container and launch6: initialize job7: retrieve input splits8: allocate resources9: start container and launch10: retrieve job resources11: run
client Node
client JVM
MapReduce Program Job1
2
4
3
HDFS
resource manager node
resource manager JVM
resource manager
node manager noderesource manager JVM
nodemanager
MRAppMaster
5
node manager node
task JVM
nodemanager
YarnChild
9
MapTask orReduceTask
11
5
6
7
11
8
![Page 33: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/33.jpg)
Yarn: Hadoop 2• Application master
• initializes the job by creating book-keeping objects
• to receive and report on progress by individual tasks
• Retrieves input splits
• Creates a map task object for each split
• Creates a number of reduce tasks (mapreduce.job.reduces)
• Decides how to run job
• Small jobs might be run in the same node
• Uber tasks
![Page 34: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/34.jpg)
Yarn: Hadoop 2• Application master requests containers for all map and
reduce tasks from resource manager
• Scheduler gets enough information to make smart decisions
• One of the selling points of Yarn: Resource allocation is much smarter
![Page 35: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/35.jpg)
Yarn: Hadoop 21: run job2: get new application3: copy job resources4: submit application5: start container and launch6: initialize job7: retrieve input splits8: allocate resources9: start container and launch10: retrieve job resources11: run
client Node
client JVM
MapReduce Program Job1
2
4
3
HDFS
resource manager node
resource manager JVM
resource manager
node manager noderesource manager JVM
nodemanager
MRAppMaster
5
node manager node
task JVM
nodemanager
YarnChild
9
MapTask orReduceTask
11
5
6
7
11
8
![Page 36: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/36.jpg)
Yarn: Hadoop 2• Task execution:
• Application master starts container by contacting the node manager
• Streaming and pipes work in the same way as Classical MapReduce
![Page 37: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/37.jpg)
Yarn: Hadoop 2• Progress and Status reports
• tasks report progress and status to application masters
• (Classical: reports move from child through tasktracker to jobtracker for aggregation)
• client polls application master every second
![Page 38: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/38.jpg)
Failures in Classic Hadoop • Child task is failing:
• Throwing a runtime exception
• JVM through task master informs client
• Taskmaster can take on another task
• Streaming tasks are marked as failed
• Sudden exit of child JVM
• Taskmaster notes exit and marks attempt as failed
![Page 39: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/39.jpg)
Failures in Classic Hadoop • Child task is failing
• Hanging tasks
• Tasktracker notices lack of progress updates and marks task as failed
• Normal timeout period is 10 minutes
![Page 40: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/40.jpg)
Failures in Classic Hadoop • When jobtracker is notified that a task attempt has failedL
• jobtracker reschedules task elsewhere
• jobtracker does try a maximum of four times
• Client can specify the percentage of tasks that are allowed to fail
![Page 41: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/41.jpg)
Failures in Classic Hadoop • Users, jobtrackers can kill task attempts
• E.g. speculative execution can kill duplicates
• E.g. Tasktracker has failed
![Page 42: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/42.jpg)
Failures in Classic Hadoop • Tasktracker failure
• Tasktracker then no longer sends heartbeats
• Jobtracker removes tasktracker from its pool
• Jobtracker arranges for map tasks to restart
• Because there might be no access to the local results
• Tasktrackers can be blacklisted if too many tasks there fail
![Page 43: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/43.jpg)
Failures in Classic Hadoop • Jobtracker failure
• No mechanism in Hadoop to deal with this type of failure
![Page 44: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/44.jpg)
Failures inYARN • Task failures like before:
• Propagated back to application master
• Application master marks them as failed
• Hanging tasks are discovered by application master
![Page 45: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/45.jpg)
Failures inYARN • Application master failure
• Several attempts for a task to succeed
• Application master sends heartbeats to Resource manager
• Resource manager can restart application master elsewhere
![Page 46: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/46.jpg)
Failures inYARN • Node manager failure
• Application manager will know due to lack of hearbeats
![Page 47: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/47.jpg)
Failures inYARN • Resource manager failure
• Resource manager is hardened by using checkpointing to save state
![Page 48: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/48.jpg)
Job Scheduling • First implementations just used FIFO
• Later, priorities were introduced
• Fair scheduler: every user gets equal access to the capacity of the cluster
• Capacity scheduler: made up of queues
• Each queues is run like a fair scheduler
• Gives administrator more control over how different users are treated
![Page 49: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/49.jpg)
Shuffle and Sort• Each reduce job gets sorted input
• System part that sorts input and transfers to outputs of mappers to reducers is shuffle
![Page 50: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/50.jpg)
Shuffle and Sort• Map side
• Output is not simply written to disk
• For better performance:
• Output is put into buffers
• Buffers are partitioned and sorted when flushed to disk as spill files
• When mapper finishes:
• Spill files are combined
• Output is made available to reducers using HDFS
![Page 51: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/51.jpg)
Shuffle and Sort• Reduce side:
• Reducers start copying mapper output as soon as they are available (copy phase)
• If mapper outputs at reducer reach critical size, they are placed into spill files on disk
• Spill files are sorted and combined in batches — typically 10 spill files
• Final combination feeds directly to reducer
![Page 52: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/52.jpg)
Map Reduce Patterns• Summarizations
• Input: A large data set that can be grouped according to various criteria
• Output: A numerical summary
• Example:
• Calculate minimum, maximum, total of certain fields in documents in xml format ordered by user-id
![Page 53: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/53.jpg)
Summarization• Example:
• Given a database in xml-document format
• Determine the earliest, latest, and number of posts for each user
<row Id="193" PostTypeId="1" AcceptedAnswerId="194" CreationDate="2010-10-23T20:08:39.740" Score="3" ViewCount="30" Body="<p>Do you lose one point of reputation when you down vote community wiki? Meta? </p>

<p>I know that you do for "regular questions". </p>
" OwnerUserId="134" LastActivityDate="2010-10-24T05:41:48.760" Title="Do you lose one point of reputation when you down vote community wiki? Meta?" Tags="<discussion>" AnswerCount="1" CommentCount="0" />
![Page 54: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/54.jpg)
Summarization• Mapper:
• Step 1: Preprocess document by extracting the user ID and the date of the post
• Step 2: map:
• User ID becomes the key.
• Value stores the date twice in Java-date format and adds a long value of 1
“134”: (2010-10-23T20:08:39.740, 2010-10-23T20:08:39.740, 1)
![Page 55: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/55.jpg)
Summarization• Mapper:
• Step 3: Combiner
• Take intermediate User-ID — value pairs
• Combine the value pairs
• Combination of two values:
• first item is minimum of the dates
• second item is maximum of the dates
• third item is sum of third items
![Page 56: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/56.jpg)
Summarization• The map reduce framework is given the number of
reducers
• Autonomously maps combiner results to reducers
• Each reducer gets key-value parts for a range of user-IDs grouped by user-ID
![Page 57: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/57.jpg)
Summarization• Reducer:
• Passes through each group combining key-value pairs
• End-result:
• Key-value pair with key = user-id
• Value is a triple with
• minimum posting date
• maximum posting date
• number of posts
![Page 58: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/58.jpg)
Summarization• Reducer:
• Each summary key— value pair is sent to client
![Page 59: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/59.jpg)
Summarization• Example (cont.)
UserID 12345 01.02.2010 01.02.2010 1
UserID 12345 02.02.2010 02.02.2010 1
UserID 12345 04.02.2010 04.02.2010 1
UserID 98765 12.02.2010 12.02.2010 1
UserID 98765 02.02.2010 02.02.2010 1
UserID 98765 05.02.2010 05.02.2010 1
UserID 56565 02.02.2010 02.02.2010 1
UserID 56565 03.02.2010 03.02.2010 1
UserID 12345 02.02.2010 02.02.2010 1
UserID 12345 04.02.2010 04.02.2010 1
UserID 77444 12.02.2010 12.02.2010 1
UserID 77444 02.02.2010 02.02.2010 1
UserID 98765 05.02.2010 05.02.2010 1
Mapper 1
Mapper 2
Combiner
Combiner
UserID 12345 01.02.2010 04.02.2010 3
UserID 98765 02.02.2010 12.02.2010 3
UserID 56565 02.02.2010 03.02.2010 2
UserID 12345 02.02.2010 04.02.2010 2
UserID 77444 02.02.2010 12.02.2010 2
UserID 98765 05.02.2010 05.02.2010 1
![Page 60: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/60.jpg)
Summarization• Example (cont.) Automatic Shuffle and Sort
• Records with the same key are sent to the same reducer
Reducer 1
Reducer 2
Reducer 3
Mapper 1
Mapper 2
Mapper 3
Mapper 4
Mapper 123
UserID 12345
01.02.2010
04.02.2010 3
UserID 12345
02.02.2010
04.02.2010 2
![Page 61: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/61.jpg)
Summarization• Example (cont.)
• Reducer receives records already ordered by user-ID
• Combines records with same key
UserID 12345 01.02.2010 04.02.2010 3
UserID 12345 02.02.2010 04.02.2010 2
UserID 12345 26.03.2010 30.04.2010 5
UserID 12345 19.01.2010 01.04.2010 3
UserID 16542 02.02.2010 04.02.2010 6
UserID 16542 26.03.2010 29.05.2010 5
UserID 16542 19.01.2010 19.01.2010 1
UserID 12345 01.02.2010 30.02.2010 13
UserID 16542 19.01.2010 29.05.2010 12
![Page 62: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/62.jpg)
Summarization• In (pseudo-)pig:
• Load data
posts = LOAD ‘/stackexchange/posts.tsv.gz' USING PigStorage('\t') AS ( post_id : long, user_id : int, text : chararray, … post : date )
![Page 63: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/63.jpg)
Summarization• In (pseudo-)pig:
• Group by user-id
• Obtain min, max, count:
post_group = GROUP posts BY user_id;
result = FOREACH post_group GENERATE group, MIN(posts.date), MAX(posts.date), COUNT_STAR(post_group)
![Page 64: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/64.jpg)
Summarization• In (pseudo-)pig:
• Load data
orders = LOAD ‘/stackexchange/posts.tsv.gz' USING PigStorage('\t') AS ( post_id : long, user_id : int, text : chararray, … post : date )
![Page 65: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/65.jpg)
Summarization• Your turn:
• Calculate the average score per user
• The score is kept in the “score”-field
![Page 66: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/66.jpg)
Summarization• Solution:
• Need to aggregate sum of score and number of posts
• Mapper: for each user-id, create a record with score
• Combiner adds scores and counts
• Reducer combines as well
• Generates output key-value pair and sends it to the user
•
userid: score, 1
userid: sum_score, count
userid: sum_score/count
![Page 67: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/67.jpg)
Summarization• Finding the median of a numerical variable
• Mapper aggregates all values in a list
• Reducer aggregates all values in a list
• Reducer then determines median of the list
• Can easily run into memory problems
![Page 68: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/68.jpg)
Summarization• Median calculation:
• Can compress lists by using counts
• becomes
• Combiner creates compressed lists
• Reducer code directly calculates median
• An instance where combiner and reducer use different code
2, 3, 3, 3, 2, 4, 5, 2, 1, 2
(1,1), (2,4), (3,3), (4,1) (5,1)
![Page 69: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/69.jpg)
Summarization• Standard Deviation
• Square-root of variance
• Variance — Average square deviation from average
•
• Leads to a two pass solution, calculate average first
σ =1N
N
∑i=1
(xi − x̄)2
![Page 70: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/70.jpg)
Summarization• Standard Deviation
• Numerically dangerous one-path solution
• σ2x =
1N
N
∑i=1
(xi − x̄)2
=1N
N
∑i=1
(x2i − 2x̄xi + x̄2)
=1N
N
∑i=1
x2i − 2x̄
1N
N
∑i=1
xi + x̄2
=1N
N
∑i=1
x2i − 2x̄2 + x̄2 =
1N
N
∑i=1
x2i + x̄2
![Page 71: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/71.jpg)
Summarization• Chan’s adaptation of Welford’s online algorithm
• Using the counts of elements, can calculate the variance in parallel from any number of partitions
• Unfortunately, can still be numerically instable
def parallel_variance(avg_a, count_a, var_a, avg_b, count_b, var_b): delta = avg_b - avg_a m_a = var_a * (count_a - 1) m_b = var_b * (count_b - 1) M2 = m_a + m_b + delta ** 2 * count_a * count_b / (count_a + count_b) return M2 / (count_a + count_b - 1)
![Page 72: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/72.jpg)
Summarization• Standard Deviation:
• Schubert & Gertz: Numerically Stable Parallel Computation of (Co)-Variance
• SSDBM '18 Proceedings of the 30th International Conference on Scientific and Statistical Database Management
![Page 73: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/73.jpg)
Summarization• Inverted Index
• Analyze each comment in StackOverflow to find hyperlinks to Wikipedia
• Create an index of wikipedia pages pointing to StackOverflow comments that link to them
![Page 74: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/74.jpg)
Summarization• Inverted Index is a group-by problem solved almost
entirely in the map-reduce framework
![Page 75: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/75.jpg)
Summarization / Inverted Index
• Mapper
• Parser:
• Processes posts
• Checks for right type of post, extracts a list of wikipedia urls (or Null if there are none)
• Outputs key-value pairs :
• Keys: wikipedia url
• Value: row-ID of post
• Optional combiner:
• Aggregates values for a wikipedia url in a single list
![Page 76: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/76.jpg)
Summarization / Inverted Index
• Reducer
• Aggregates values belonging to the same key in a list
![Page 77: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/77.jpg)
Summarization / Inverted Index
• Generic Inverted Index diagram
mapper
Reducer
mapper
mapper
mappermapper
mapper
mapper
mapper
Reducer
Reducer
keyword: idkeyword: id
keyword: idkeyword: idkeyword: idkeyword: id
keyword: id
keyword: idkeyword: id
keyword: id
keyword: list of idkeyword: list of idkeyword: list of idkeyword: list of idkeyword: list of id
keyword: list of idkeyword: list of idkeyword: list of idkeyword: list of idkeyword: list of idkeyword: list of id
keyword: list of idkeyword: list of idkeyword: list of idkeyword: list of idkeyword: list of id
client
![Page 78: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/78.jpg)
Counter Pattern• Used to gather stats on an Hadoop job
• Create various counters (but not too many)
• Counters work exclusively in the Map-Reduce Paradigm
![Page 79: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/79.jpg)
Counter PatternMapper
(Counting)
Mapper
(Counting)
Mapper
(Counting)
Increment Counter E
Increment Counter B
Increment Counter A
Increment Counter D
Increment Counter C
Increment Counter D
Increment Counter A
Task Tracker
Task Tracker
Task Tracker
Job TrackerCounter ACounter BCounter CCounter D
![Page 80: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/80.jpg)
Counter Pattern• Mapper processes each input
• Increments counter for each record
• Counters are aggregated by Task Trackers
• Task Trackers report counts to Job Tracker
• Job Tracker aggregates counts (unless task tracker failed)
![Page 81: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/81.jpg)
Counter Patternpublic static class CountNrUsersByState extends Mapper<Object, Text, NullWritable, Null Writable> {
public void map(Object key, Text value, Context context) throws IOException, Interrupted Exception { Map<String, String> parsed = MRDPUtils.transformXmlToMap(value.toString())
String location = parsed.get(“Location”); if location !=null && !location.isEmpty()) { if (states.contains(state)) { context.getCounter(STATE_COUNTER_GROUP, state).increment(1); break; } … }
![Page 82: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/82.jpg)
Counter Pattern• To get the counts, just int code = job.waitForCompletion(true)? 0 : 1; if(code == 0){ for (Counter counter : job.getCounters().getGroup( CountNumUsersByStateMapper.STATECOUNTERGROUP)){ System.out.println(counter.getDisplayName() + “t” + counter.getValue()); } }
![Page 83: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/83.jpg)
Filtering Patterns• Extract data from records without changing them
• Sampling:
• get a few random records
• get records with very high or low values in a field
![Page 84: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/84.jpg)
Filtering Patterns• Simple filtering:
• User defined function or condition is a boolean
• Decides whether record is to be kept or not
• Very generic pattern
map(key, record): if(user_condition(record){ emit key, value
![Page 85: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/85.jpg)
Filtering Patterns• Simple Filtering
InputSplit
InputSplit
InputSplit
InputSplit
OutputSplit
OutputSplit
OutputSplit
OutputSplitFilter Mapper
Filter Mapper
Filter Mapper
Filter Mapper
![Page 86: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/86.jpg)
Filtering Patterns• Simple Filtering
• There is no “reduce” operation because there is no aggregation
• If output splits are saved, they can serve as new inputs
![Page 87: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/87.jpg)
Filtering Patterns• Simple Filtering in Pig
• Uses the FILTER keyword
• b = FILTER a BY value < 3
![Page 88: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/88.jpg)
Filtering Patterns• Because there are no reducers
• Data never has to be transmitted
• There is no sort phase and no reduce phase
![Page 89: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/89.jpg)
Filtering Patterns• Filtering pattern:
• Grep: filtering for a regular expression
![Page 90: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/90.jpg)
Filtering Patterns• Getting a random sample
• Simple random sampling (SRS)
• Grab a random subset of data
• Can get filter_percentage property:
• context.getConfiguration().get(“filter_percentage”)
• Mapper writes objects with a given probability
• There neither combiner nor reducer
![Page 91: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/91.jpg)
Excurse: Bloom Filters• Bloom filters (1970 Burton Howard Bloom)
• Use to test membership in a set
• Idea:
• A data structure that can quickly decide whether an element does not belong to a large set
• And probabilistically whether an element is present
• With a low probability of error
![Page 92: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/92.jpg)
Excurse: Bloom Filters• Idea: A bloom filter is a large bit array
• (How large: Deduplication proposes sizes of several GB)
• Uses a good hash function
• For each element in the set:
• Calculate
• Change the resulting bits in the bit array h( ele,0) h( ele,1) h( ele,2)
![Page 93: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/93.jpg)
Excurse: Bloom Filters• To test for the presence of in the set
• Calculate
• Check whether the corresponding bits are set.
• Bits are set, but
• Then bits were set by other elements.
• If and there are elements in the bit array, this happens with probability
xh(x,0) h(x,1) h(x,2)
S
x ∉ S
|S | = n N
(1 − (1 −1N
)3n)3 ≈ (1 − exp(−3n
N))
3
![Page 94: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/94.jpg)
Excurse: Bloom Filters
0.0002 0.0004 0.0006 0.0008 0.0010Error Prob
20
40
60
80Multiplicator
Number of bits = number of set elements times multiplicator needed to achieve a certain error probability with 3 and 4 hashes
![Page 95: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/95.jpg)
Filtering Patterns• Bloom Filtering pattern:
• Need to accept a few false positives
• Creating Bloom Filter
• Using Bloom Filter
![Page 96: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/96.jpg)
Filtering Patterns• Mappers create local bloom filter
• Reducer combines them
Input Split Local Bloom Filter
map
Input Split Local Bloom Filter
map
Input Split Local Bloom Filter
map
Input Split Local Bloom Filter
map
.
.
.
Global Bloom Filter
Reducer
![Page 97: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/97.jpg)
Filtering Patterns• Could use more than one reducer by breaking up local
bloom filter into ranges
Input Split Local Bloom Filter
map
Input Split Local Bloom Filter
map
Input Split Local Bloom Filter
map
Input Split Local Bloom Filter
map
.
.
.
Global Bloom Filter
Reducer
![Page 98: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/98.jpg)
Filtering Patterns• Bloom Filtering
Distributed Cache
Input Split
global Bloom Filter
Bloom Filter Test
output File
Input Split
Bloom Filter Test
output File
Input Split
Bloom Filter Test
output File
![Page 99: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/99.jpg)
Filtering Patterns• Bloom Filtering is used for
• Removing (almost all of) unwatched items
• Prefiltering data
• Pig: Need to implement Bloom filtering as user-defined functions
![Page 100: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/100.jpg)
Filtering Patterns• Top-ten
• Retrieve the records that have the k largest values in a certain attribute
• Group Exercise
![Page 101: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/101.jpg)
Filtering Patterns• A set of distinct records
• Example: Web page log
• Want to have records where user-name, device, or browser are different, but we don’t care about time stamps
![Page 102: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/102.jpg)
Filtering Patterns• Unique records:
• Use the map-reduce grouping properties
• Mapper group by the attributes we are interested in
• Combiners emit one value for each group
• Reducer only emits one value for each group
map(key, record): emit record, null
reduce(key, records): emit key
![Page 103: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/103.jpg)
Filtering Patterns• Unique records
• Pig:
b = distinct a;
![Page 104: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/104.jpg)
Data Organization Patterns• Problem:
• Transform data to a different format
• Row-based data to hierarchical format such as JSON or XML
![Page 105: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/105.jpg)
Data Organization Patterns• Example
• StackOverflow data
• Posts and comments are separated
• Lines in an XML document
• Hierarchy combines posts and comments
• Hierarchical data model allows us to correlate length of posts with number of comments, etc.
Posts Post
Comment Comment
Post Post
Comment Comment Comment
![Page 106: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/106.jpg)
Data Organization Patterns• Often, the data to be combined comes from different data
sets
• Hadoop class MultipleInputs from org.apache.hadoop.mapreduce.lib.input
• Allows to specify different input paths and different mappers for each input
• Configuration is done in the driver
![Page 107: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/107.jpg)
Data Organization Patterns• Multiple sources to hierarchical Pattern
• Mappers load data and parse it into a cohesive format
• Output key corresponds to root of hierarchical record
• E.g. StackOverflow: root is post_id
• Need to identify the source for each mapper output
• E.g. StackOverflow: is this a post or a comment
• Combiners are pretty useless because we create large strings
![Page 108: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/108.jpg)
Data Organization Patterns• Multiple sources to hierarchical Pattern
• Reducers receive data from different sources key by key
• For each key, can now build hierarchical data structure
![Page 109: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/109.jpg)
Data Organization Patterns• Multiple sources to hierarchical Pattern
• Result is in hierarchical form
• Probably need to add header and footer so that it is well-formed
![Page 110: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/110.jpg)
Data Organization Patterns
InputSplit
InputSplit
InputSplit
InputSplit
InputSplit
Data Set A
Data Set B
Data Set AMapper
Data Set AMapper
Data Set AMapper
SHUFFLE
&
SORTData Set B
Mapper
Data Set BMapper
(post ID, post data)
(Parent ID, child data)
Hierarchy Reducer
Hierarchy Reducer
OutputPart
OutputPart
![Page 111: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/111.jpg)
Data Organization Patterns• Multiple sources to Hierarchical Pattern
• Performance problems
• Need enough reducers
• Reducers might see a lot of skew:
• Some are busy, others are not
• Hot spots can result in humongous strings moved between mappers and reducers
• Could take up the heap of a Java Virtual Machine
![Page 112: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/112.jpg)
Data Organization Patterns• Needs two mappers: one for comments, one for posts
• Both: extract post-id to use as output key
• Append “P” or “C” to distinguish between sources
![Page 113: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/113.jpg)
Data Organization Patterns• Reducer:
• Reducers receive post-id + marker as key and text as value
• For each post-id with “P” marker:
• Create a post entry in the XML
• For each post-id with “C” marker:
• Create child
![Page 114: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/114.jpg)
Data Organization Patterns• Partitioning
• Moves records into shards, but does not care about ordering
• Example: partitioning by date
• Need to know number of partitions ahead of time
![Page 115: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/115.jpg)
Data Organization Patterns• Partitioning: Let the partitioner do the job
InputSplit
InputSplit
InputSplit
InputSplit
InputSplit
IdentityMapper
IdentityMapper
IdentityMapper
IdentityMapper
IdentityMapper
Shuffle&
Sort
IdentityReducer
IdentityReducer
IdentityReducer
OutputSplit
OutputSplit
OutputSplit
![Page 116: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/116.jpg)
Data Organization Patterns• Binning:
• Moves records into categories irrespective of order
• Related to partitioning
• Which one works better depends on the system
• Binners do not use reducers
![Page 117: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/117.jpg)
Data Organization Patterns• Binning
• Number of outputs = Number of mappers times Number of bins
InputSplit
Binning Mapper
Bin A
Bin B
Bin C
InputSplit
Binning Mapper
Bin A
Bin B
Bin C
InputSplit
Binning Mapper
Bin A
Bin B
Bin C
![Page 118: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/118.jpg)
Data Organization Patterns• Binning:
• Pig
• Split data INTO eights IF col1==8, bigs IF col1>8, smalls IF col1<8;
![Page 119: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/119.jpg)
Data Organization Patterns• Total Order Sorting
• Data needs to be sorted by a given comparator
• Result is a set of shards that are ordered
• Need to know the distribution of data first
• Run an analyze phase first
![Page 120: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/120.jpg)
Data Organization Patterns• Total order sorting:
• Analyze phase
• Mapper does random sampling
• Only outputs the key after which we sort
• Use only one reducer which will give us the sort keys in order
![Page 121: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/121.jpg)
Data Organization Patterns• Total order sorting:
• Order phase
• Mapper extracts the sort key and stores the record as a value
• Custom partitioner is loaded based on the results of the analysis phase
• TotalOrderPartitioner in Hadoop
• Takes the data ranges prescribed and uses them to partition
• Reducer simply outputs the values
• Shuffle and sort has already done all the work
![Page 122: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/122.jpg)
Data Organization Patterns• Total Ordering in Pig
• c = order b by col1;
![Page 123: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/123.jpg)
Data Organization Patterns• Shuffling
• Randomizes the order of a set of records
![Page 124: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/124.jpg)
Data Organization Patterns• Shuffling
• Mapper maintains the records, but creates a random key
• Reducer sort according to random keys
• Only record is printed out
![Page 125: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/125.jpg)
Data Organization Patterns• Shuffling
• Pig:
• c = Group b by Random( );
• d = FOREACH c generate Flatten(b);
![Page 126: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/126.jpg)
Join Patterns• Join Pattern
• Different cases
• Simplest Case: Join with a small table
• Send small table to all mappers
• Mappers calculate local join
• Reducers
![Page 127: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/127.jpg)
Join PatternsInput Split
A
Input Split
A
Input Split
A
Input Split
A
SmallTable
B
Input Split
A
Join Mapper
Join Mapper
Join Mapper
Join Mapper
Join Mapper
Distributed Cache
Output Part
Output Part
Output Part
Output Part
Output Part
![Page 128: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/128.jpg)
Join Patterns• Pig allows you to give hints for joins
huge = LOAD ‘huge.txt’ AS (h1,h2); smallest = LOAD ‘smallest.txt’ AS (ss1, ss2); small = LOAD ‘small.txt’ AS (s1,s2); A = JOIN huge BY h1, small BY s1, smallest BY ss1 USING ‘replicated’;
![Page 129: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/129.jpg)
Join Patterns• Reduce Side Join
• Join large multiple data sets together by some foreign key
• Structure:
• Mapper goes through all records in both data sets
• Mapper creates pairs
• foreign key —> source, rest of record
• source is the name of the table
• Can use hash partitioner or a customized partitioner
• Reducer combines values of each input group into two lists
![Page 130: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/130.jpg)
JoinMapper
Data Set A
inputsplit
inputsplit
inputsplit
inputsplit
inputsplit
inputsplit
Data Set Binputsplit
inputsplit
inputsplit
inputsplit
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
Shuffle&
Sortbob->A,record
bob->B,record
JoinReducer
JoinReducer
JoinReducer
JoinReducer
output part
output part
output part
output part
bob->A,recordbob->B,record bob, A.record, B.record
![Page 131: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/131.jpg)
Join Patterns• The reducer is given within the input group all records
with a foreign key
• This allows the reducer to create many types of joins based on equality
![Page 132: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/132.jpg)
Join Patterns• Inner join:
• Records from Table A and Table B are joined if they share the same foreign key
![Page 133: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/133.jpg)
foreign key attribute 1 attribute 2
Table Aforeign key attribute 1 attribute 2 attribute 3
Table B
Inner Join
![Page 134: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/134.jpg)
Join Patterns• Outer Join
• If the foreign key is not present in one table than the lacking values are made into Null values
![Page 135: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/135.jpg)
foreign key attribute 1 attribute 2
Table A
foreign key attribute 1 attribute 2 attribute 3
Table B
Outer JoinNULL NULLNULL
NULL
![Page 136: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/136.jpg)
Join Patterns• Optimization with Bloom Filters
• If we calculate an inner join with map-reduce, a mapper does not have to create a key-value pair if the foreign key value is not present in the other table
foreign key attribute 1 attribute 2
Table Aforeign key attribute 1 attribute 2 attribute 3
Table B
value
NULL value not present
![Page 137: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/137.jpg)
Join Patterns• Create a Bloom Filter for both sets (or for only one)
• Put Bloom filter in distributed cache or send it to all mappers for the other table
• Then only send records to the reducer when we know that the foreign key value is also present in the other table
![Page 138: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/138.jpg)
JoinMapper
Data Set A
inputsplit
inputsplit
inputsplit
inputsplit
inputsplit
inputsplit
Data Set Binputsplit
inputsplit
inputsplit
inputsplit
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
JoinMapper
Shuffle&
Sortbob->A,record
bob->B,record
JoinReducer
JoinReducer
JoinReducer
JoinReducer
output part
output part
output part
output part
bob->A,recordbob->B,record bob, A.record, B.record
Bloom Filter for foreignkey valuesof Table B
![Page 139: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/139.jpg)
Join Patterns• Bloom filter does not need to be very good to be efficient
• Bloom filter does not have false negatives, only false positives
• Bloom filter needs to be created before it can be used, making this into a two phase job
![Page 140: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/140.jpg)
Join Patterns• Composite Join
• Supported by Hadoop: CompositeInputFormat
• Idea:
• Preprocess all table contents to create input shards that are sorted and partitioned by foreign key.
• This is somewhat similar to the idea of the hash-join
• Two-phase job again
![Page 141: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/141.jpg)
Join Patterns• Create hash
buckets for records based on foreign keys
• Buckets are sorted
Data Set AForeign Keys
Data Set BForeign Keys
hash(fk)%5 = 0 AdamAdamJamesXavier
AdamXavierZazie
hash(fk)%5 = 1 BrendaCarl
BrendaBrenda
CarlOscar
hash(fk)%5 = 2
hash(fk)%5 = 3
hash(fk)%5 = 4
JorgeFritzKarl
JorgeJorge
AaronAaronGilbert
GilbertGilbertGilbert
DennisDennisFrodo
DennisFrodoFrodoFrodo
![Page 142: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/142.jpg)
Join Patterns• Send all buckets in
the hash to the same mapper
• Mappers then combine
• There are no reducers involved
Data Set AForeign Keys
Data Set BForeign Keys
hash(fk)%5 = 0 AdamAdamJamesXavier
AdamXavierZazie
hash(fk)%5 = 1 BrendaCarl
BrendaBrenda
CarlOscar
hash(fk)%5 = 2
hash(fk)%5 = 3
hash(fk)%5 = 4
JorgeFritzKarl
JorgeJorge
AaronAaronGilbert
GilbertGilbertGilbert
DennisDennisFrodo
DennisFrodoFrodoFrodo
Mapper
Mapper
Mapper
Mapper
Mapper
AdamAdamXavier
output parts:
BrendaBrenda
Carl
JorgeJorge
GilbertGilbert
DennisDennis FrodoFrodoFrodo
![Page 143: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/143.jpg)
Join Patterns• Composite Join Performance
• Most of the work in the creation of the Composite Join Splits
![Page 144: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/144.jpg)
Join Patterns• Cartesian Product
• Cartesian product combines all values in one table with all values in one other table
• Easily create gigantic results
• With huge execution times
• Uses essentially the composite join pattern
• Each mapper receives an input split from Table A and an input split from Table B
• If we break Table A into n pieces and Table B into m pieces, then we need n x m mappers
![Page 145: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/145.jpg)
Data Set AForeign Keys
Data Set BForeign Keys
hash(fk1)%2= 0hash(fk2)%2 = 0 Split
A-1SplitB-1
SplitA-1
Split B-2
Split A-2
SplitB-1
SplitA-2
SplitB-2
Mapper
Mapper
Mapper
Mapper
cartesian join
output parts:
cartesian join
hash(fk1)%2= 0hash(fk2)%2 = 1
hash(fk1)%2= 1hash(fk2)%2 = 0
hash(fk1)%2= 1hash(fk2)%2 = 1
cartesian join
cartesian join
![Page 146: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/146.jpg)
Meta Patterns• Job Chaining
• Run two or more map-reduce jobs
• Output of the first is input to the secondInputShard
InputShard
InputShard
InputShard
InputShard
InputShard
InputShard
InputShard
InputShard
InputShard
InputShard
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
Shuffle&
Order
Reducer
Reducer
Reducer
Reducer
Reducer
Reducer
Reducer
Reducer
InputShard
InputShard
InputShard
InputShard
InputShard
InputShard
Mapper
Mapper
Mapper
Mapper
Mapper
Mapper
InputShard
InputShard
Mapper
Mapper
Shuffle&
Order
Reducer
Reducer
Reducer
Reducer
![Page 147: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/147.jpg)
Meta Patterns• Map-Reduce framework is not very good at this
• Special frameworks exist
• Oozie —Apache Project Workflow Engine
• Java web application
• Workflow - collection of actions
• Map-reduce jobs, Pig jobs arranged in a DAG
• Uses XML-based Hadoop Process Definition Language for workflow specifications
• Oozie coordinates jobs
![Page 148: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/148.jpg)
Meta Patterns• Doing it yourself
• Using Java
• Map-reduce drivers are simple Java classes
• Take the drivers of the individual Map-reduce jobs and call them in sequence
• Output and input paths need to match
• Use Job.Submit( ), JobisComplete( ), and Job.waitForCompletion( )
![Page 149: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/149.jpg)
Meta Patterns• Doing it yourself
• Scripting
• Using JobControl
![Page 150: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/150.jpg)
Chain Folding• Opportunities for optimization
• Mappers work in isolation
• A record can be submitted to multiple mappers or to a reducer-mapper combination
• Avoids transfer of files
![Page 151: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/151.jpg)
Chain Folding• Chain folding patterns
• Multiple mapping phases are adjacent
• Fold them into single mappers
• If a job ends with a map phase
• Push into the preceding reducer phase
• Split mappers that reduce data and mappers that increase data
• Filter data as early as possible
![Page 152: Map Reduce - Marquette University€¦ · • Map-Reduce jobs can take hours • Each job has a status • System estimates progress for each task • Mappers: per cent input dealt](https://reader034.vdocuments.net/reader034/viewer/2022052018/6031af06a9490a416b18ccf3/html5/thumbnails/152.jpg)
Job Merging• If two unrelated map-reduce jobs use the same input set
• Can combine the mappers and reducers
• Data is loaded and parsed now only once