hadoop first mr job - inverted index construction
TRANSCRIPT
First MR job - Inverted Index construction
Map Reduce - Introduction• Parallel Job processing framework • Written in java• Close integration with HDFS• Provides :
– Auto partitioning of job into sub tasks– Auto retry on failures– Linear Scalability– Locality of task execution– Plugin based framework for extensibility
Lets think scalability• Let’s go through an exercise of scaling a simple program to process a large
data set.
• Problem: count the number of times each word occurs in a set of documents.
• Example: only one document with only one sentence – “Do as I say, not as I do.”
• Pseudocode: A multiset is a set where each element also has a count
define wordCount as Multiset; (assume a hash table)
for each document in documentSet {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
display(wordCount);
How about a billion documents?• Looping through all the documents using a single computer will be
extremely time consuming.
• You speed it up by rewriting the program so that it distributes the work over several machines.
• Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines.
define wordCount as Multiset;
for each document in documentSubset {
T = tokenize(document);
for each token in T {
wordCount[token]++;
}
}
sendToSecondPhase(wordCount);
define totalWordCount as Multiset;
for each wordCount received from firstPhase {
multisetAdd (totalWordCount, wordCount);
}
Problems
• Where are documents stored?– Having more machines for processing only helps up to a certain point—
until the storage server can’t keep up.
– You’ll also need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it.
• wordCount (and totalWordCount) are stored in memory– When processing large document sets, the number of unique words can
exceed the RAM storage of a machine.
– Furthermore phase two has only one machine, which will process wordCount sent from all the machines in phase one. The single machine in phase two will become the bottleneck.
• Solution: divide based on expected output!
– Let’s say we have 26 machines for phase two. We assign each machine to only handle wordCount for words beginning with a particular letter in the alphabet.
Map-Reduce
• MapReduce programs are executed in two main phases, called
– mapping and
– reducing .
• In the mapping phase, MapReduce takes the input data and feeds each data element to the mapper.
• In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result.
• The mapper is meant to filter and transform the input into something
• That the reducer can aggregate over.
• MapReduce uses lists and (key/value) pairs as its main data primitives.
Map-Reduce
• Map-Reduce Program– Based on two functions: Map and Reduce
– Every Map/Reduce program must specify a Mapper and optionally a Reducer
– Operate on key and value pairs
Map-Reduce works like a Unix pipeline:
cat input | grep | sort | uniq -c |
cat > output
Input | Map | Shuffle & Sort | Reduce |
Output
cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 |
sort | uniq -c > ~/userlist
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2)
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Putting things in context
HDFS
.
.
.
File 1
File 2
File 3
File N-2
File N-1
File N
Inputfiles
Splits
Mapper
Machine -1
Machine -2
Machine - M
Split 1
Split 2
Split 3
Split M-2
Split M-1
Split M
Map 1
Map 2
Map 3
Map M-2
Map M-1
Map M
Combiner 1
Combiner C
(Kay, Value) pairs
Record Reader combiner
.
.
.
Partition 1
Partition 2
Partition P-1
Partition P
Partitionar
Reducer
HDFS
.
.
.
File 1
File 2
File 3
File O-2
File O-1
File O
Reducer 1
Reducer 2
Reducer R-1
Reducer R
Inp
ut
Ou
tpu
tMachine -x
Some MapReduce Terminology
• Job – A “full program” - an execution of a Mapper and Reducer across a data set
• Task – An execution of a Mapper or a Reducer on a slice of data
– a.k.a. Task-In-Progress (TIP)
• Task Attempt – A particular instance of an attempt to execute a task on a machine
Terminology Example
• Running “Word Count” across 20 files is one job
• 20 files to be mapped simply 20 map tasks + some number of reduce tasks
• At least 20 map task attempts will be performed… more if a machine crashes, due to speculative execution etc.
Task Attempts
• A particular task will be attempted at least once, possibly more times if it crashes
– If the same input causes crashes over and over, that input will eventually be abandoned
• Multiple attempts at one task may occur in parallel with speculative execution turned on
– Task ID from TaskInProgress is not a unique identifier; don’t use it that way
MapReduce: High Level
JobTrackerMapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
Node-to-Node Communication
• Hadoop uses its own RPC protocol
• All communication begins in slave nodes
– Prevents circular-wait deadlock
– Slaves periodically poll for “status” message
• Classes must provide explicit serialization
Nodes, Trackers, Tasks
• Master node runs JobTracker instance, which accepts Job requests from clients
• TaskTracker instances run on slave nodes
• TaskTracker forks separate Java process for task instances
Job Distribution
• MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options
• Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code
• … Where’s the data distribution?
Data Distribution
• Implicit in design of MapReduce!
– All mappers are equivalent; so map whatever data is local to a particular node in HDFS
• If lots of data does happen to pile up on the same node, nearby nodes will map instead
– Data transfer is handled implicitly by HDFS
Configuring With JobConf
• MR Programs have many configurable options
• JobConf objects hold (key, value) components mapping String ’a’
– e.g., “mapred.map.tasks” 20
– JobConf is serialized and distributed before running the job
• Objects implementing JobConfigurable can retrieve elements from a JobConf
Job Launch Process: Client
• Client program creates a JobConf– Identify classes implementing Mapper and Reducer
interfaces • JobConf.setMapperClass(), setReducerClass()
– Specify inputs, outputs• JobConf.setInputPath(), setOutputPath()
– Optionally, other options too:• JobConf.setNumReduceTasks(),
JobConf.setOutputFormat()…
Job Launch Process: JobClient
• Pass JobConf to JobClient.runJob() or submitJob()
– runJob() blocks, submitJob() does not
• JobClient:
–Determines proper division of input into InputSplits
– Sends job data to master JobTracker server
Job Launch Process: JobTracker
• JobTracker:
– Inserts jar and JobConf (serialized to XML) in shared location
–Posts a JobInProgress to its run queue
Job Launch Process: TaskTracker
• TaskTrackers running on slave nodes periodically query JobTracker for work
• Retrieve job-specific jar and config
• Launch task in separate instance of Java
– main() is provided by Hadoop
Job Launch Process: Task
• TaskTracker.Child.main():
– Sets up the child TaskInProgress attempt
– Reads XML configuration
– Connects back to necessary MapReduce components via RPC
– Uses TaskRunner to launch user process
Job Launch Process: TaskRunner
• TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper– Task knows ahead of time which InputSplits it
should be mapping
– Calls Mapper once for each record retrieved from the InputSplit
• Running the Reducer is much the same
Creating the Mapper
• You provide the instance of Mapper
– Should extend MapReduceBase
• One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress
– Exists in separate process from all other instances of Mapper – no data sharing!
Mapper
• void map(WritableComparable key,
Writable value,
OutputCollector output,
Reporter reporter)
What is Writable?
• Hadoop defines its own classes for strings (Text), integers (IntWritable), etc.
• All values are instances of Writable
• All keys are instances of WritableComparable
Getting Data To The Mapper
Input file
InputSplit InputSplit InputSplit InputSplit
Input file
RecordReader RecordReader RecordReader RecordReader
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Inp
utF
orm
at
Reading Data
• Data sets are specified by InputFormats
– Defines input data (e.g., a directory)
– Identifies partitions of the data that form an InputSplit
– Factory for RecordReader objects to extract (k, v) records from the input source
FileInputFormat and Friends
• TextInputFormat – Treats each ‘\n’-terminated line of a file as a value
• KeyValueTextInputFormat – Maps ‘\n’- terminated text lines of “k SEP v”
• SequenceFileInputFormat – Binary file of (k, v) pairs with some additional metadata
• SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())
Filtering File Inputs
• FileInputFormat will read all files out of a specified directory and send them to the mapper
• Delegates filtering this file list to a method subclasses may override
– e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list
Record Readers
• Each InputFormat provides its own RecordReader implementation
– Provides capability multiplexing
• LineRecordReader – Reads a line from a text file
• KeyValueRecordReader – Used by KeyValueTextInputFormat
Input Split Size
• FileInputFormat will divide large files into chunks
– Exact size controlled by mapred.min.split.size
• RecordReaders receive file, offset, and length of chunk
• Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”
Sending Data To Reducers
• Map function receives OutputCollector object
– OutputCollector.collect() takes (k, v) elements
• Any (WritableComparable, Writable) can be used
WritableComparator
• Compares WritableComparable data
–Will call WritableComparable.compare()
–Can provide fast path for serialized data
• JobConf.setOutputValueGroupingComparator()
Sending Data To The Client
• Reporter object sent to Mapper allows simple asynchronous feedback
– incrCounter(Enum key, long amount)
– setStatus(String msg)
• Allows self-identification of input
– InputSplit getInputSplit()
Partition And Shuffle
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Reducer Reducer Reducer
(intermediates) (intermediates) (intermediates)
Partitioner Partitioner Partitioner Partitioner
sh
ufflin
g
Partitioner
• int getPartition(key, val, numPartitions)
– Outputs the partition number for a given key
– One partition == values sent to one Reduce task
• HashPartitioner used by default
– Uses key.hashCode() to return partition num
• JobConf sets Partitioner implementation
Reduction
• reduce( WritableComparable key,
Iterator values,
OutputCollector output,
Reporter reporter)
• Keys & values sent to one partition all go to the same reduce task
• Calls are sorted by key – “earlier” keys are reduced and output before “later” keys
Finally: Writing The Output
Reducer Reducer Reducer
RecordWriter RecordWriter RecordWriter
output file output file output file
Ou
tpu
tFo
rma
t
WordCount M/Rmap(String filename, String document)
{
List<String> T = tokenize(document);
for each token in T {
emit ((String)token, (Integer) 1);
}
}
reduce(String token, List<Integer> values)
{
Integer sum = 0;
for each value in values {
sum = sum + value;
}
emit ((String)token, (Integer) sum);
}
Word Count: Java Mapperpublic static class MapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text,
IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
Text word = new Text(itr.nextToken());
output.collect(word, new IntWritable(1));
}
}
}
42
Word Count: Java Reducepublic static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key,
Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
43
Word Count: Java Driverpublic void run(String inPath, String outPath)
throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new
Path(inPath));
FileOutputFormat.setOutputPath(conf, new
Path(outPath));
JobClient.runJob(conf);
}
44
WordCount with many mapper and One reducer
Job, Task, and Task Attempt IDs• The format of a job ID is composed of the time that the jobtracker (not the
job) started and an incrementing counter maintained by the jobtracker to uniquely identify the job to that instance of the jobtracker.
• job_201206111011_0002 : – is the second (0002, job IDs are 1-based) job run by the jobtracker
– which started at 10:11 on June 11, 2012
• Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job.
• task_201206111011_0002_m_000003: – is the fourth (000003, task IDs are 0-based)
– map (m) task of the job with ID job_201206111011_0002.
– The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in.
• Tasks may be executed more than once, due to failure or speculative execution, so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker.
• attempt_200904110811_0002_m_000003_0: – is the first (0, attempt IDs are 0-based) attempt at running task
task_201206111011_0002_m_000003.
Exercise - description• The objectives for this exercise are:
– Become familiar with decomposing a problem into Map and Reduce stages. – Get a sense for how MapReduce can be used in the real world.
• An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier.
For example, if given the following 2 documents:
Doc1: Buffalo buffalo buffalo. Doc2: Buffalo are mammals.
we could construct the following inverted file index: Buffalo -> Doc1, Doc2buffalo -> Doc1buffalo. -> Doc1are -> Doc2mammals. -> Doc2
Exercise - tasks• Task - 1: (30 min)
– Write pseudo-code for map and reduce to solve inverted index problem
– What are your K1 V1, K2, V2 etc.
– “Execute” your pseudo-code with following example and explain what shuffle & Sort stage do with keys and values
• Task – 2: (30min)– Use distributed code Python/Java and execute them following instruction
• Where input and out data was stored, and in what format?
• What were K1 V1, K2, V2 data types used?
• Task – 3: (45min)
• Some words are so common that their presence in an inverted index is "noise" --they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. How can you determine whether a word is "noisy“?– Re-write your pseudo-code with determination (your algorithms) and removal of
“noisy” words using map-reduce framework.
• Group / individual presentation (45 min)
Example: Inverted Index
• Input: (filename, text) records
• Output: list of files containing each word
• Map:foreach word in text.split():
output(word, filename)
• Combine: unique filenames for each word
• Reduce:def reduce(word, filenames):
output(word, sort(filenames))
49
Inverted Index
50
to be or not to be afraid, (1Xth.txt)
be, (1Xth.txt, hamlet.txt)greatness, (12th.txt)not, (1Xth.txt, hamlet.txt)of, (12th.txt)or, (hamlet.txt)to, (hamlet.txt)
hamlet.txt
be not afraid of
greatness
1Xth.txt
to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt
be, 1Xth.txtnot, 1Xth.txtafraid, 1Xth.txtof, 1Xth.txtgreatness, 1Xth.txt
A better example
• Billions of crawled pages and links
• Generate an index of words linking to web urls in which they occur.
– Input is split into url->pages (lines of pages)
– Map looks for words in lines of page and puts out word -> link pairs
– Group k,v pairs to generate word->{list of links}
– Reduce puts out pairs to output
Search Reverse Index
public static class MapClass extends MapReduceBaseimplements Mapper<Text, Text, Text, IntWritable> {
private Text word = new Text();public void map(Text url, Text pageText,
OutputCollector<Text, Text> output,Reporter reporter) throws IOException {
String line = pageText.toString();StringTokenizer itr = new StringTokenizer(line);while (itr.hasMoreTokens()) {//ignore unwanted and redundant wordsword.set(itr.nextToken());output.collect(word, url);
}}
}
Search Reverse Index
public static class Reduce extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text word, Iterator<Text> urls,OutputCollector<Text, Iterator<Text>> output,Reporter reporter) throws IOException {
output.collect(word, urls);}
}
End of sesssion
Day – 1: First MR job - Inverted Index construction