hadoop first mr job - inverted index construction

First MR job - Inverted Index construction

Map Reduce - Introduction• Parallel Job processing framework • Written in java• Close integration with HDFS• Provides :

– Auto partitioning of job into sub tasks– Auto retry on failures– Linear Scalability– Locality of task execution– Plugin based framework for extensibility

Lets think scalability• Let’s go through an exercise of scaling a simple program to process a large

data set.

• Problem: count the number of times each word occurs in a set of documents.

• Example: only one document with only one sentence – “Do as I say, not as I do.”

• Pseudocode: A multiset is a set where each element also has a count

define wordCount as Multiset; (assume a hash table)

for each document in documentSet {

T = tokenize(document);

for each token in T {

wordCount[token]++;

}

}

display(wordCount);

How about a billion documents?• Looping through all the documents using a single computer will be

extremely time consuming.

• You speed it up by rewriting the program so that it distributes the work over several machines.

• Each machine will process a distinct fraction of the documents. When all the machines have completed this, a second phase of processing will combine the result of all the machines.

define wordCount as Multiset;

for each document in documentSubset {

T = tokenize(document);


wordCount[token]++;

}

}

sendToSecondPhase(wordCount);

define totalWordCount as Multiset;

for each wordCount received from firstPhase {

multisetAdd (totalWordCount, wordCount);

}

Problems

• Where are documents stored?– Having more machines for processing only helps up to a certain point—

until the storage server can’t keep up.

– You’ll also need to split up the documents among the set of processing machines such that each machine will process only those documents that are stored in it.

• wordCount (and totalWordCount) are stored in memory– When processing large document sets, the number of unique words can

exceed the RAM storage of a machine.

– Furthermore phase two has only one machine, which will process wordCount sent from all the machines in phase one. The single machine in phase two will become the bottleneck.

• Solution: divide based on expected output!

– Let’s say we have 26 machines for phase two. We assign each machine to only handle wordCount for words beginning with a particular letter in the alphabet.

Map-Reduce

• MapReduce programs are executed in two main phases, called

– mapping and

– reducing .

• In the mapping phase, MapReduce takes the input data and feeds each data element to the mapper.

• In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result.

• The mapper is meant to filter and transform the input into something

• That the reducer can aggregate over.

• MapReduce uses lists and (key/value) pairs as its main data primitives.

Map-Reduce

• Map-Reduce Program– Based on two functions: Map and Reduce

– Every Map/Reduce program must specify a Mapper and optionally a Reducer

– Operate on key and value pairs

Map-Reduce works like a Unix pipeline:

cat input | grep | sort | uniq -c |

cat > output

Input | Map | Shuffle & Sort | Reduce |

Output

cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 |

sort | uniq -c > ~/userlist

Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2)

Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)

Map-Reduce on Hadoop

Putting things in context

HDFS

.

.

.

File 1

File 2

File 3

File N-2

File N-1

File N

Inputfiles

Splits

Mapper

Machine -1

Machine -2

Machine - M

Split 1

Split 2

Split 3

Split M-2

Split M-1

Split M

Map 1

Map 2

Map 3

Map M-2

Map M-1

Map M

Combiner 1

Combiner C

(Kay, Value) pairs

Record Reader combiner

.

.

.

Partition 1

Partition 2

Partition P-1

Partition P

Partitionar

Reducer

HDFS

.

.

.

File 1

File 2

File 3

File O-2

File O-1

File O

Reducer 1

Reducer 2

Reducer R-1

Reducer R

Inp

ut

Ou

tpu

tMachine -x

Some MapReduce Terminology

• Job – A “full program” - an execution of a Mapper and Reducer across a data set

• Task – An execution of a Mapper or a Reducer on a slice of data

– a.k.a. Task-In-Progress (TIP)

• Task Attempt – A particular instance of an attempt to execute a task on a machine

Terminology Example

• Running “Word Count” across 20 files is one job

• 20 files to be mapped simply 20 map tasks + some number of reduce tasks

• At least 20 map task attempts will be performed… more if a machine crashes, due to speculative execution etc.

Task Attempts

• A particular task will be attempted at least once, possibly more times if it crashes

– If the same input causes crashes over and over, that input will eventually be abandoned

• Multiple attempts at one task may occur in parallel with speculative execution turned on

– Task ID from TaskInProgress is not a unique identifier; don’t use it that way

MapReduce: High Level

JobTrackerMapReduce job

submitted by

client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

Node-to-Node Communication

• Hadoop uses its own RPC protocol

• All communication begins in slave nodes

– Prevents circular-wait deadlock

– Slaves periodically poll for “status” message

• Classes must provide explicit serialization

Nodes, Trackers, Tasks

• Master node runs JobTracker instance, which accepts Job requests from clients

• TaskTracker instances run on slave nodes

• TaskTracker forks separate Java process for task instances

Job Distribution

• MapReduce programs are contained in a Java “jar” file + an XML file containing serialized program configuration options

• Running a MapReduce job places these files into the HDFS and notifies TaskTrackers where to retrieve the relevant program code

• … Where’s the data distribution?

Data Distribution

• Implicit in design of MapReduce!

– All mappers are equivalent; so map whatever data is local to a particular node in HDFS

• If lots of data does happen to pile up on the same node, nearby nodes will map instead

– Data transfer is handled implicitly by HDFS

Configuring With JobConf

• MR Programs have many configurable options

• JobConf objects hold (key, value) components mapping String ’a’

– e.g., “mapred.map.tasks” 20

– JobConf is serialized and distributed before running the job

• Objects implementing JobConfigurable can retrieve elements from a JobConf

Job Launch Process: Client

• Client program creates a JobConf– Identify classes implementing Mapper and Reducer

interfaces • JobConf.setMapperClass(), setReducerClass()

– Specify inputs, outputs• JobConf.setInputPath(), setOutputPath()

– Optionally, other options too:• JobConf.setNumReduceTasks(),

JobConf.setOutputFormat()…

Job Launch Process: JobClient

• Pass JobConf to JobClient.runJob() or submitJob()

– runJob() blocks, submitJob() does not

• JobClient:

–Determines proper division of input into InputSplits

– Sends job data to master JobTracker server

Job Launch Process: JobTracker

• JobTracker:

– Inserts jar and JobConf (serialized to XML) in shared location

–Posts a JobInProgress to its run queue

Job Launch Process: TaskTracker

• TaskTrackers running on slave nodes periodically query JobTracker for work

• Retrieve job-specific jar and config

• Launch task in separate instance of Java

– main() is provided by Hadoop

Job Launch Process: Task

• TaskTracker.Child.main():

– Sets up the child TaskInProgress attempt

– Reads XML configuration

– Connects back to necessary MapReduce components via RPC

– Uses TaskRunner to launch user process

Job Launch Process: TaskRunner

• TaskRunner, MapTaskRunner, MapRunner work in a daisy-chain to launch your Mapper– Task knows ahead of time which InputSplits it

should be mapping

– Calls Mapper once for each record retrieved from the InputSplit

• Running the Reducer is much the same

Creating the Mapper

• You provide the instance of Mapper

– Should extend MapReduceBase

• One instance of your Mapper is initialized by the MapTaskRunner for a TaskInProgress

– Exists in separate process from all other instances of Mapper – no data sharing!

Mapper

• void map(WritableComparable key,

Writable value,

OutputCollector output,

Reporter reporter)

What is Writable?

• Hadoop defines its own classes for strings (Text), integers (IntWritable), etc.

• All values are instances of Writable

• All keys are instances of WritableComparable

Getting Data To The Mapper

Input file

InputSplit InputSplit InputSplit InputSplit

Input file

RecordReader RecordReader RecordReader RecordReader

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

Inp

utF

orm

at

Reading Data

• Data sets are specified by InputFormats

– Defines input data (e.g., a directory)

– Identifies partitions of the data that form an InputSplit

– Factory for RecordReader objects to extract (k, v) records from the input source

FileInputFormat and Friends

• TextInputFormat – Treats each ‘\n’-terminated line of a file as a value

• KeyValueTextInputFormat – Maps ‘\n’- terminated text lines of “k SEP v”

• SequenceFileInputFormat – Binary file of (k, v) pairs with some additional metadata

• SequenceFileAsTextInputFormat – Same, but maps (k.toString(), v.toString())

Filtering File Inputs

• FileInputFormat will read all files out of a specified directory and send them to the mapper

• Delegates filtering this file list to a method subclasses may override

– e.g., Create your own “xyzFileInputFormat” to read *.xyz from directory list

Record Readers

• Each InputFormat provides its own RecordReader implementation

– Provides capability multiplexing

• LineRecordReader – Reads a line from a text file

• KeyValueRecordReader – Used by KeyValueTextInputFormat

Input Split Size

• FileInputFormat will divide large files into chunks

– Exact size controlled by mapred.min.split.size

• RecordReaders receive file, offset, and length of chunk

• Custom InputFormat implementations may override split size – e.g., “NeverChunkFile”

Sending Data To Reducers

• Map function receives OutputCollector object

– OutputCollector.collect() takes (k, v) elements

• Any (WritableComparable, Writable) can be used

WritableComparator

• Compares WritableComparable data

–Will call WritableComparable.compare()

–Can provide fast path for serialized data

• JobConf.setOutputValueGroupingComparator()

Sending Data To The Client

• Reporter object sent to Mapper allows simple asynchronous feedback

– incrCounter(Enum key, long amount)

– setStatus(String msg)

• Allows self-identification of input

– InputSplit getInputSplit()

Partition And Shuffle

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

Mapper

(intermediates)

Reducer Reducer Reducer

(intermediates) (intermediates) (intermediates)

Partitioner Partitioner Partitioner Partitioner

sh

ufflin

g

Partitioner

• int getPartition(key, val, numPartitions)

– Outputs the partition number for a given key

– One partition == values sent to one Reduce task

• HashPartitioner used by default

– Uses key.hashCode() to return partition num

• JobConf sets Partitioner implementation

Reduction

• reduce( WritableComparable key,

Iterator values,

OutputCollector output,

Reporter reporter)

• Keys & values sent to one partition all go to the same reduce task

• Calls are sorted by key – “earlier” keys are reduced and output before “later” keys

Finally: Writing The Output

Reducer Reducer Reducer

RecordWriter RecordWriter RecordWriter

output file output file output file

Ou

tpu

tFo

rma

t

WordCount M/Rmap(String filename, String document)

{

List<String> T = tokenize(document);


emit ((String)token, (Integer) 1);

}

}

reduce(String token, List<Integer> values)

{

Integer sum = 0;

for each value in values {

sum = sum + value;

}

emit ((String)token, (Integer) sum);

}

Word Count: Java Mapperpublic static class MapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text,

IntWritable> {

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

String line = value.toString();

StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {

Text word = new Text(itr.nextToken());

output.collect(word, new IntWritable(1));

}

}

}

42

Word Count: Java Reducepublic static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text,

IntWritable> {

public void reduce(Text key,

Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

}

43

Word Count: Java Driverpublic void run(String inPath, String outPath)

throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setMapperClass(MapClass.class);

conf.setReducerClass(Reduce.class);

FileInputFormat.addInputPath(conf, new

Path(inPath));

FileOutputFormat.setOutputPath(conf, new

Path(outPath));

JobClient.runJob(conf);

}

44

WordCount with many mapper and One reducer

Job, Task, and Task Attempt IDs• The format of a job ID is composed of the time that the jobtracker (not the

job) started and an incrementing counter maintained by the jobtracker to uniquely identify the job to that instance of the jobtracker.

• job_201206111011_0002 : – is the second (0002, job IDs are 1-based) job run by the jobtracker

– which started at 10:11 on June 11, 2012

• Tasks belong to a job, and their IDs are formed by replacing the job prefix of a job ID with a task prefix, and adding a suffix to identify the task within the job.

• task_201206111011_0002_m_000003: – is the fourth (000003, task IDs are 0-based)

– map (m) task of the job with ID job_201206111011_0002.

– The task IDs are created for a job when it is initialized, so they do not necessarily dictate the order that the tasks will be executed in.

• Tasks may be executed more than once, due to failure or speculative execution, so to identify different instances of a task execution, task attempts are given unique IDs on the jobtracker.

• attempt_200904110811_0002_m_000003_0: – is the first (0, attempt IDs are 0-based) attempt at running task

task_201206111011_0002_m_000003.

Exercise - description• The objectives for this exercise are:

– Become familiar with decomposing a problem into Map and Reduce stages. – Get a sense for how MapReduce can be used in the real world.

• An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier.

For example, if given the following 2 documents:

Doc1: Buffalo buffalo buffalo. Doc2: Buffalo are mammals.

we could construct the following inverted file index: Buffalo -> Doc1, Doc2buffalo -> Doc1buffalo. -> Doc1are -> Doc2mammals. -> Doc2

Exercise - tasks• Task - 1: (30 min)

– Write pseudo-code for map and reduce to solve inverted index problem

– What are your K1 V1, K2, V2 etc.

– “Execute” your pseudo-code with following example and explain what shuffle & Sort stage do with keys and values

• Task – 2: (30min)– Use distributed code Python/Java and execute them following instruction

• Where input and out data was stored, and in what format?

• What were K1 V1, K2, V2 data types used?

• Task – 3: (45min)

• Some words are so common that their presence in an inverted index is "noise" --they can obfuscate the more interesting properties of that document. For example, the words "the", "a", "and", "of", "in", and "for" occur in almost every English document. How can you determine whether a word is "noisy“?– Re-write your pseudo-code with determination (your algorithms) and removal of

“noisy” words using map-reduce framework.

• Group / individual presentation (45 min)

Example: Inverted Index

• Input: (filename, text) records

• Output: list of files containing each word

• Map:foreach word in text.split():

output(word, filename)

• Combine: unique filenames for each word

• Reduce:def reduce(word, filenames):

output(word, sort(filenames))

49

Inverted Index

50

to be or not to be afraid, (1Xth.txt)

be, (1Xth.txt, hamlet.txt)greatness, (12th.txt)not, (1Xth.txt, hamlet.txt)of, (12th.txt)or, (hamlet.txt)to, (hamlet.txt)

hamlet.txt

be not afraid of

greatness

1Xth.txt

to, hamlet.txtbe, hamlet.txtor, hamlet.txtnot, hamlet.txt

be, 1Xth.txtnot, 1Xth.txtafraid, 1Xth.txtof, 1Xth.txtgreatness, 1Xth.txt

A better example

• Billions of crawled pages and links

• Generate an index of words linking to web urls in which they occur.

– Input is split into url->pages (lines of pages)

– Map looks for words in lines of page and puts out word -> link pairs

– Group k,v pairs to generate word->{list of links}

– Reduce puts out pairs to output

Search Reverse Index

public static class MapClass extends MapReduceBaseimplements Mapper<Text, Text, Text, IntWritable> {

private Text word = new Text();public void map(Text url, Text pageText,

OutputCollector<Text, Text> output,Reporter reporter) throws IOException {

String line = pageText.toString();StringTokenizer itr = new StringTokenizer(line);while (itr.hasMoreTokens()) {//ignore unwanted and redundant wordsword.set(itr.nextToken());output.collect(word, url);

}}

}

Search Reverse Index

public static class Reduce extends MapReduceBaseimplements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text word, Iterator<Text> urls,OutputCollector<Text, Iterator<Text>> output,Reporter reporter) throws IOException {

output.collect(word, urls);}

}

End of sesssion

Day – 1: First MR job - Inverted Index construction