Download - Write your first MapReduce program in 20 minutes

8/7/2019 Write your first MapReduce program in 20 minutes

1/20

Write your first MapReduce program in 20 minutes

Posted on January 2, 2009 by Michael Nielsen

The slow revolution

Some revolutions are marked by a single, spectacular event: the

storming of the Bastille during the French Revolution, or the

destruction of the towers of the World Trade Center on September 11,

2001, which so changed the USs relationship with the rest of the

world. But often the most important revolutions arent announced

with the blare of trumpets. They occur softly, too slow for a single

news cycle, but fast enough that if you arent alert, the revolution isover before youre aware its happening.

Such a revolution is happening right now in computing.

Microprocessor clock speeds have stagnated since about 2000. Major

chipmakers such as Intel and AMD continue to wring out

improvements in speed by improving on-chip caching and other clever

techniques, but they are gradually hitting the point of diminishingreturns. Instead, as transistors continue to shrink in size, the

chipmakers are packing multiple processing units onto a single chip.

Most computers shipped today use multi-core microprocessors, i.e.,

chips with 2 (or 4, or 8, or more) separate processing units on the

main microprocessor.

The result is a revolution in software development. Were gradually

moving from the old world in which multiple processor computing was

a special case, used only for boutique applications, to a world in which

it is widespread. As this movement happens, software development, so

long tailored to single-processor models, is seeing a major shift in


2/20

some its basic paradigms, to make the use of multiple processors

natural and simple for programmers.

This movement to multiple processors began decades ago. Projectssuch as the Connection Machine demonstrated the potential of

massively parallel computing in the 1980s. In the 1990s, scientists

became large-scale users of parallel computing, using parallel

computing to simulate things like nuclear explosions and the

dynamics of the Universe. Those scientific applications were a bit like

the early scientific computing of the late 1940s and 1950s: specialized,

boutique applications, built with heroic effort using relatively

primitive tools. As computing with multiple processors becomes

widespread, though, were seeing a flowering of general-purpose

software development tools tailored to multiple processor

environments.

One of the organizations driving this shift is Google. Google is one of

the largest users of multiple processor computing in the world, with its

entire computing cluster containing hundreds of thousands of

commodity machines, located in data centers around the world, and

linked using commodity networking components. This approach to

multiple processor computing is known as distributed computing; the

characteristic feature of distributed computing is that the processors

in the cluster dont necessarily share any memory or disk space, and so

information sharing must be mediated by the relatively slow network.

In this post, Ill describe a framework for distributed computing called

MapReduce. MapReduce was introduced in a paper written in 2004


3/20

by Jeffrey Dean and Sanjay Ghemawat from Google. Whats beautiful

about MapReduce is that it makes parallelization almost entirely

invisible to the programmer who is using MapReduce to develop

applications. If, for example, you allocate a large number of machinesin the cluster to a given MapReduce job, the job runs in a highly

parallelized way. If, on the other hand, you allocate only a small

number of machines, it will run in a much more serial way, and its

even possible to run jobs on just a single machine.

What exactly is MapReduce? From the programmers point of view,

its just a library thats imported at the start of your program, like any

other library. It provides a single library call that you can make,

passing in a description of some input data and two ordinary serial

functions (the mapper and reducer) that you, the programmer,

specify in your favorite programming language. MapReduce then takes

over, ensuring that the input data is distributed through the cluster,

and computing those two functions across the entire cluster of

machines, in a way well make precise shortly. All the details

parallelization, distribution of data, tolerance of machine failures are

hidden away from the programmer, inside the library.

What were going to do in this post is learn how to use the MapReduce

library. To do this, we dont need a big sophisticated version of the

MapReduce library. Instead, we can get away with a toy

implementation (just a few lines of Python!) that runs on a single

machine. By using this single-machine toy library we can learn how to

develop for MapReduce. The programs we develop will run essentially


4/20

unchanged when, in later posts, we improve the MapReduce library so

that it can run on a cluster of machines.

Our first MapReduce program

Okay, so how do we use MapReduce? Ill describe it with a simple

example, which is a program to count the number of occurrences of

different words in a set of files. The example is simple, but youll find it

rather strange if youre not already familiar with MapReduce: the

program well describe is certainly not the way most programmers

would solve the word-counting problem! What it is, however, is anexcellent illustration of the basic ideas of MapReduce. Furthermore,

what well eventually see is that by using this approach we can easily

scale our wordcount program up to run on millions or even billions of

documents, spread out over a large cluster of computers, and thats

not something a conventional approach could do easily.

The inputto a MapReduce job is just a set of(input_key,input_value) pairs, which well implement as a Python

dictionary. In the wordcount example, the input keys will be the

filenames of the files were interested in counting words in, and the

corresponding input values will be the contents of those files:

filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]

i = {}

forfilename in filenames:


5/20

f= open(filename)

i[filename] = f.read()

f.close()

After this code is run the Python dictionaryi will contain the input to

our MapReduce job, namely, i has three keys containing the

filenames, and three corresponding values containing the contents of

those files. Note that Ive used Windows filenaming conventions

above; if youre running a Mac or Linux you may need to tinker withthe filenames. Also, to run the code you will of course need text files

text\1.txt, text\2.txt, text\3.txt. You can create some

simple examples texts by cutting and pasting from the following:

text\a.txt:

The quick brown fox jumped over the lazy grey dogs.

text\b.txt:

That's one small step for a man, one giant leap for mankind.


6/20

text\c.txt:

Mary had a little lamb,

Its fleece was white as snow;

And everywhere that Mary went,

The lamb was sure to go.

The MapReduce job will process this input dictionary in two phases:

the map phase, which produces output which (after a little

intermediate processing) is then processed by the reduce phase. In the

map phase what happens is that for each (input_key,input_value)

pair in the input dictionaryi, a function

mapper(input_key,input_value) is computed, whose output is alist of intermediate keys and values. This function mapper is supplied

by the programmer well show how it works for wordcount below.

The output of the map phase is just the list formed by concatenating

the list of intermediate keys and values for all of the different input

keys and values.

I said above that the function mapper is supplied by the programmer.In the wordcount example, what mapper does is takes the input key

and input value a filename, and a string containing the contents of

the file and then moves through the words in the file. For each word

it encounters, it returns the intermediate key and value (word,1),


7/20

indicating that it found one occurrence ofword. So, for example, for

the input keytext\a.txt, a call to

mapper("text\\a.txt",i["text\\a.txt"]) returns:

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1), ('jumped', 1),

('over', 1), ('the', 1), ('lazy', 1), ('grey', 1), ('dogs', 1)]

Notice that everything has been lowercased, so we dont count words

with different cases as distinct. Furthermore, the same key gets

repeated multiple times, because words like the appear more thanonce in the text. This, incidentally, is the reason we use a Python list

for the output, and not a Python dictionary, for in a dictionary the

same key can only be used once.

Heres the Python code for the mapper function, together with a helper

function used to remove punctuation:

defmapper(input_key,input_value):

return [(word,1) for word in

remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):

return s.translate(string.maketrans("",""), string.punctuation)


8/20

mapper works by lowercasing the input file, removing the punctuation,

splitting the resulting string around whitespace, and finally emitting

the pair (word,1) for each resulting word. Note, incidentally, that Im

ignoring apostrophes, to keep the code simple, but you can easilyextend the code to deal with apostrophes and other special cases.

With this specification ofmapper, the output of the map phase for

wordcount is simply the result of combining the lists for

mapper("text\\a.txt"), mapper("text\\b.txt"), and

mapper("text\\c.txt"):

[('the', 1), ('quick', 1), ('brown', 1), ('fox', 1),

('jumped', 1), ('over', 1), ('the', 1), ('lazy', 1), ('grey', 1),

('dogs', 1), ('mary', 1), ('had', 1), ('a', 1), ('little', 1),

('lamb', 1), ('its', 1), ('fleece', 1), ('was', 1), ('white', 1),

('as', 1), ('snow', 1), ('and', 1), ('everywhere', 1),

('that', 1), ('mary', 1), ('went', 1), ('the', 1), ('lamb', 1),

('was', 1), ('sure', 1), ('to', 1), ('go', 1), ('thats', 1),

('one', 1), ('small', 1), ('step', 1), ('for', 1), ('a', 1),

('man', 1), ('one', 1), ('giant', 1), ('leap', 1), ('for', 1),

('mankind', 1)]


9/20

The map phase of MapReduce is logically trivial, but when the input

dictionary has, say 10 billion keys, and those keys point to files held on

thousands of different machines, implementing the map phase is

actually quite non-trivial. What the MapReduce library handles isdetails like knowing which files are stored on what machines, making

sure that machine failures dont affect the computation, making

efficient use of the network, and storing the output in a useable form.

We wont worry about these issues for now, but we will come back to

them in future posts.

What the MapReduce library now does in preparation for the reduce

phase is to group together all the intermediate values which have the

same key. In our example the result of doing this is the following

intermediate dictionary:

{'and': [1], 'fox': [1], 'over': [1], 'one': [1, 1], 'as': [1],

'go': [1], 'its': [1], 'lamb': [1, 1], 'giant': [1],

'for': [1, 1], 'jumped': [1], 'had': [1], 'snow': [1],

'to': [1], 'leap': [1], 'white': [1], 'was': [1, 1],

'mary': [1, 1], 'brown': [1], 'lazy': [1], 'sure': [1],

'that': [1], 'little': [1], 'small': [1], 'step': [1],

'everywhere': [1], 'mankind': [1], 'went': [1], 'man': [1],


10/20

'a': [1, 1], 'fleece': [1], 'grey': [1], 'dogs': [1],

'quick': [1], 'the': [1, 1, 1], 'thats': [1]}

We see, for example, that the word and, which appears only once in

the three files, has as its associated value a list containing just a single

1, [1]. By contrast, the word one, which appears twice, has [1,1] as

its value.

The reduce phase now commences. A programmer-defined function

reducer(intermediate_key,intermediate_value_list) isapplied to each entry in the intermediate dictionary. For wordcount,

reducer simply sums up the list of intermediate values, and return

both the intermediate_key and the sum as the output. This is done

by the following code:

def reducer(intermediate_key,intermediate_value_list):

return (intermediate_key,sum(intermediate_value_list))

The output from the reduce phase, and from the total MapReduce

computation, is thus:

[('and', 1), ('fox', 1), ('over', 1), ('one', 2), ('as', 1),

('go', 1), ('its', 1), ('lamb', 2), ('giant', 1), ('for', 2),

('jumped', 1), ('had', 1), ('snow', 1), ('to', 1), ('leap', 1),


11/20

('white', 1), ('was', 2), ('mary', 2), ('brown', 1),

('lazy', 1), ('sure', 1), ('that', 1), ('little', 1),

('small', 1), ('step', 1), ('everywhere', 1), ('mankind', 1),

('went', 1), ('man', 1), ('a', 2), ('fleece', 1), ('grey', 1),

('dogs', 1), ('quick', 1), ('the', 3), ('thats', 1)]

You can easily check that this is just a list of the words in the three files

we started with, and the associated wordcounts, as desired.

Weve looked at code defining the input dictionaryi, the mapper and

reducer functions. Collecting it all up, and adding a call to the

MapReduce library, heres the complete wordcount.py program:

#word_count.py

import string

import map_reduce

defmapper(input_key,input_value):

return [(word,1) for word in


12/20

remove_punctuation(input_value.lower()).split()]

def remove_punctuation(s):

return s.translate(string.maketrans("",""), string.punctuation)

def reducer(intermediate_key,intermediate_value_list):

return (intermediate_key,sum(intermediate_value_list))

filenames = ["text\\a.txt","text\\b.txt","text\\c.txt"]

i = {}

forfilename in filenames:

f= open(filename)

i[filename] = f.read()

f.close()


13/20

print map_reduce.map_reduce(i,mapper,reducer)

The map_reduce module imported by this program implements

MapReduce in pretty much the simplest possible way, using someuseful functions from the itertools library:

# map_reduce.py

"'Defines a single function, map_reduce, which takes an input

dictionary i and applies the user-defined

function mapper to each

(input_key,input_value) pair, producing a list of intermediate

keys and intermediate values. Repeated intermediate keys then

have their values grouped into a list, and the user-defined

function reducer is applied to the intermediate key and list of

intermediate values. The results are returned as a list."'

import itertools

defmap_reduce(i,mapper,reducer):


14/20

intermediate = []

for (key,value) in i.items():

intermediate.extend(mapper(key,value))

groups = {}

for key, group in itertools.groupby(sorted(intermediate),

lambda x: x[0]):

groups[key] = list([y for x, y in group])

return [reducer(intermediate_key,groups[intermediate_key])

for intermediate_key in groups]

(Credit to a nice blog post from Dave Spencer for the use of

itertools.groupby to simplify the reduce phase.)

Obviously, on a single machine an implementation of the MapReduce

library is pretty trivial! In later posts well extend this library so that it

can distribute the execution of the mapper and reducer functions

across multiple machines on a network. The payoff is that with enough

improvement to the library we can with essentially no change use our

wordcount.py program to count the words not just in 3 files, but

rather the words in billions of files, spread over thousands of

computers in a cluster. What the MapReduce library does, then, is


15/20

provide an approach to developing in a distributed environment where

many simple tasks (like wordcount) remain simple for the

programmer. Important (but boring) tasks like parallelization, getting

the right data into the right places, dealing with the failure ofcomputers and networking components, and even coping with racks of

computers being taken offline for maintenance, are all taken care of

under the hood of the library.

In the posts that follow, were thus going to do two things. First, were

going to learn how to develop MapReduce applications. That means

taking familiar tasks things like computing PageRank and figuring

out how they can be done within the MapReduce framework. Well do

that in the next post in this series. In later posts, well also take a look

at Hadoop, an open source platform that can be used to develop

MapReduce applications.

Second, well go under the hood of MapReduce, and look at how it

works. Well scale up our toy implementation so that it can be used

over small clusters of computers. This is not only fun in its own right,

it will also make us better MapReduce programmers, in the same way

as understanding the innards of an operating system (for example) can

make you a better application programmer.

To finish off this post, though, well do just two things. First, well sum

up what MapReduce does, stripping out the wordcount-specific

material. Its not any kind of a formal specification, just a brief

informal summary, together with a few remarks. Well refine this

summary a little in some future posts, but this is the basic MapReduce


16/20

model. Second, well give a overview of how MapReduce takes

advantage of a distributed environment to parallelize jobs.

MapReduce in general

Summing up our earlier description of MapReduce, and with the

details about wordcount removed, the input to a MapReduce job is a

set of(input_key,input_value) pairs. Each pair is used as input to

a function mapper(input_key,input_value)which produces as

output a list of intermediate keys and intermediate values:

[(intermediate_key,intermediate_value),

(intermediate_key',intermediate_value'),

...]

The output from all the different input pairs is then sorted, so thatvalues associated with the same intermediate_key are grouped

together in a list of intermediate values. The

reducer(intermediate_key,intermediate_value_list)

function is then applied to each intermediate key and list of

intermediate values, to produce the output from the MapReduce job.

A natural question is whether the order of values inintermediate_value_list matters. I must admit Im not sure of the

answer to this question if its discussed in the original paper, then I

missed it. In most of the examples Im familiar with, the order doesnt

matter, because the reducer works by applying a commutative,


17/20

associate operation across all intermediate values in the list. As well

see in a minute, because the mapper computations are potentially

done in parallel, on machines which may be of varying speed, itd be

hard to guarantee the ordering, and this suggests that the orderingdoesnt matter. Itd be nice to know for sure if anyone reading this

does know the answer, Id appreciate hearing it, and will update the

post!

One of the most striking things about MapReduce is howrestrictive it

is. A prioriits by no means clear that MapReduce should be all that

useful in practical applications. It turns out, though, that many

interesting computations can be expressed either directly in

MapReduce, or as a sequence of a few MapReduce computations.

Weve seen wordcount implemented in this post, and well see how to

compute PageRank using MapReduce in the next post. Many other

important computations can also be implemented using MapReduce,

including doing things like finding shortest paths in a graph, grepping

a large document collection, or many data mining algorithms. For

many such problems, though, the standard approach doesnt obviously

translate into MapReduce. Instead, you need to think through the

problem again from scratch, and find a way of doing it using

MapReduce.

Exercises

How would you implement grep in MapReduce?

Problems


18/20

Take a well-known algorithms book, e.g., Cormen-Leiserson-

Rivest-Stein, or a list of well-known algorithms. Which

algorithms lend themselves to being expressed in MapReduce?

Which do not? This isnt so much a problem as it is a suggestionfor a research program.

The early years of serial computing saw the introduction of

powerful general-purpose ideas like those that went into the Lisp

and Smalltalk programming languages. Arguably, those ideas are

still the most powerful in todays modern programming

languages. MapReduce isnt a programming language as we

conventionally think of them, of course, but its similar in that it

introduces a way of thinking about and approaching

programming problems. I dont think MapReduce has quite the

same power as the ideas that went into Lisp and Smalltalk; its

more like the Fortran of distributed computing. Can we find

similarly powerful ideas to Lisp or Smalltalk in distributed

computing? Whats the hundred-year frameworkgoing to look

like for distributed computing?

How MapReduce takes advantage of the distributed setting

You can probably already see how MapReduce takes advantage of a

large cluster of computers, but lets spell out some of the details. There

are two key points. First, the mapper functions can be run in parallel,

on different processors, because they dont share any data. Providedthe right data is in the local memory of the right processor a task

MapReduce manages for you the different computations can be

done in parallel. The more machines are in the cluster, the more

mapper computations can be simultaneously executed. Second, the


19/20

reducer functions can also be run in parallel, for the same reason,

and, again, the more machines are available, the more computations

can be done in parallel.

The difficulty in this story arises in the grouping step that takes place

between the map phase and the reduce phase. For the reducer

functions to work in parallel, we need to ensure that all the

intermediate values corresponding to the same key get sent to the

same machine. Obviously, this requires communcation between

machines, over the relatively slow network connections.

This looks tough to arrange, and you might think (as I did at first) that

a lot of communication would be required to get the right data to the

right machine. Fortunately, a simple and beautiful idea is used to

make sure that the right data ends up in the right location, without

there being too much communication overhead.

Imagine youve got 1000 machines that youre going to use to run

reducers on. As the mappers compute the intermediate keys and value

lists, they compute hash(intermediate_key) mod 1000 for some

hash function. This number is used to identify the machine in the

cluster that the corresponding reducer will be run on, and the

resulting intermediate key and value list is then sent to that machine.

Because every machine running mappers uses the same hash function,

this ensures that value lists corresponding to the same intermediate

key all end up at the same machine. Furthermore, by using a hash we

ensure that the intermediate keys end up pretty evenly spread over

machines in the cluster. Now, I should warn you that in practice this


20/20

hashing method isnt literally quite whats done (well describe that in

later lectures), but thats the main idea.

Needless to say, theres a lot more to the story of how MapReduceworks in practice, especially the way it handles data distribution and

fault-tolerance. In future posts well explore many more of the details.

Nonetheless, hopefully this post gives you a basic understanding of

how MapReduce works. In the next post well apply MapReduce to the

computation of PageRank.

Download - Write your first MapReduce program in 20 minutes

Top Related