coursework ii: google mapreduce in gridsam

Coursework II: Google MapReduce in GridSAMSteve Crouch

[email protected], stc@ecs

School of Electronics and Computer Science

Contents Introduction to Google’s MapReduce

Applications of MapReduce

The coursework– Extending a basic MapReduce framework provided in pseudocode

Coursework deadline: 27th March 4pm

Handin via ECS Coursework Handin System

Google MapReduce

MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Google Inc., OSDI 2004.

http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/mapreduce-osdi04.pdf

Google’s Need for a Distributed Programming Model and Infrastructure

Google implement many computations over a lot of data– Input: e.g. crawled documents, web request logs, etc.– Output: e.g. inverted indices, web document graphs, pages

crawled per host, frequent per-day queries, etc.

Input usually very large (> 1TB)

Computations need to be distributed for timeliness of results

Want to do this in an easy, but scalable and robust way; provide a programming model (with a suitable abstraction) for the distributed processing aspects

Realised many computations follow a map / reduce approach– map operation applied to a set of logical input “records”

to generate intermediate key/value pairs– reduce operation applied to all intermediate values sharing

same key to combine data in a useful way– Used as basis for rewrite of their production indexing

system!

History of MapReduce – Inspired by Functional Programming!

Functional operations only create new data structures and do not alter existing ones

Order of operations does not matter

Emphasis on data flow

e.g. Higher-Order functions in Lisp– map() – applies a function to each value in a

sequence fun map f [ ] = [ ] | map f (x::xs) = (f x) :: (map f xs)

– reduce() – combines all elements of a sequence using a binary operator

fun reduce f c [ ] = c | reduce f c (x::xs) = f x (reduce f c xs)

Looking at map and reduce Another Way… map():

– Delegates or distributes the computation for each piece of data to a given function, creating a new set of data

– Each computation cannot see the effects of the other computations

– The order of computation is irrelevant

reduce() takes this created data and reduces it to something we want

map() moves left to right over the list, applying the given function… can this be exploited in distributed computing?

Applying the Programming Model to the Data

Data store 1 Data store nmap

(key 1, values...)

(key 2, values...)

(key 3, values...)

map

(key 1, values...)

(key 2, values...)

(key 3, values...)

Input key*value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...

Distributed Computing Seminar: Lecture 2: MapReduce Theory and Implementation, Christophe Bisciglia, Aaron Kimball & Sierra Michels-Slettvet, Summer 2007.

For Example… Counting the number of occurrences of each word in a large collection of documents:

map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

• map outputs each word plus occurrence count• reduce sums together all counts emitted for each word

doc1,”Hello world”

doc2,”Hello there”

map()

map()

Hello, 1

world, 1

there, 1

Hello, 1reduce()

2

1

1

(Hello)

(world)

(there)

How it Works in Practice

"MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA,

December, 2004.

1. User program:- Splits work into M 64MB pieces- Program starts up across compute nodes as either Master or Worker (with exactly 1 Master)

2. Master assigns M map tasks and R reduce tasks to idle workers (either one map or one reduce task each)

3. A map Worker:- Parses key/value pairs out of its input- Passes each key/value to map function- Buffers intermediate keys/values in mem

4. Periodically, map Worker writes intermediate key/value pairs to disk, informing Master of their locations, who forwards to reduce Workers

5/6. When notified of locations by Master, reduce Worker remotely reads in data, sorts and groups data by key, passes to reduce function, results appended to output file

7. When all maps and reduces done, Master wakes up user program which resumes

Coursework: Part II

Learning Objectives:

To develop a general architectural and operational understanding of typical production-level grid software.

To develop the programming skills required to drive typical services on a production-level grid.

Tasks

Download and install the GridSAM server and client

(a) Extend some Java code stubs (which use the GridSAM Java API) to submit and monitor jobs to GridSAM

(b) Extend some pseudocode that describes a basic MapReduce framework for performing word counting on a number of files

Coursework: Part II –Installing GridSAM

Pre-Requisites

Pre-requisites:– Client and Server: Linux only (e.g. SuSE 9.0, RedHat, Debian, Ubuntu) May work on other Linuxs but no exhaustive testing

Tested on undergrad Linux boxes

– Requires Java JDK 6 (not JRE) or above– Beware:

Firewalls blocking 8080 and your FTP port inbetween client and server – add exceptions

VPNs can cause problems with staging data to/from GridSAM

Preparation/Installation Java 7 recommended

– Note: you may need to upgrade your Java– Ensure JAVA_HOME set on path

Install client…– Download gridsam-2.3.0-client.zip from coursework page– unzip gridsam-2.3.0-client.zip (into a file path that

contains no spaces)– cd gridsam-2.3.0-client– java SetupGridSAM

Install server (Linux only)…– Can just reuse your Apache Tomcat 5.5.28/6.0.32 from

mgrid (see mgrid install slides)– Download gridsam.war from coursework page– Shutdown Tomcat and copy in gridsam.war to apache-tomcat-

6.0.32/webapps and restart Tomcat– Can check log files in

apache-tomcat-6.0.32/webapps/gridsam/WEB-INF/logs if any problems occur

16

Coursework Materials Download COMP3019-materials.tgz from coursework page

– Copy to gridsam-2.3.0-client directory– Unpack, you’ll find some GridSAMExample* files

./GridSAMExampleCompile to check compilation– Code not complete; that’s the coursework!

GridSAMExampleRun wont until you done the coursework– Note server.domain and port in script – you need to

change these to point at your server (use HTTP not HTTPS!!)

Use the scripts and Java code as a basis

Refer to API docs on coursework page as required– To obtain job status, use e.g.: jobStage =

jobManager.findJobInstance(jobID).getLastKnownStage().getState().toString();

– Doing job.getLastKnownStage().getState().toString() directly wont work 17

The Coursework See the coursework handout on the COMP3019 page:

– http://www.ecs.soton.ac.uk/~stc/COMP3019

Notes for Part 1:– When specifying multiple arguments to your m-grid

applet, there is a single string you can use as an argument.

– Consider how you pass the two necessary arguments (i.e. a character and a textfile) as a single argument into the applet

– To load the text file below into your applet, package it into the jar file along with the code, and use the following in the applet:

InputStream in = getClass().getResourceAsStream(“textfile.txt”);

Part 2 (GridSAM) Notes:– If you encounter problems using the GridSAM FTP

server, some students have found success using a StupidFTP server (available under Ubuntu)

– When you want to check the status of a job use e.g. jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().getState().toString();

Doing job.getLastKnownStage().getState().toString() directly wont work

Coursework: Part II –Running a Command Line Example

Example using File Staging Objectives: submit simple job with data input and output requirements and monitor progress

OMII GridSAM

Client

OMII GridSAM

Server

submit JSDL

monitor

OMII GridSAM

FTP Server1 output file

2 input files

JSDL Example Gridsam-2.3.0/examples/remotecat-staging.jsdl

Change ftp URLs to match your ftp server e.g. ftp://anonymous:anonymous@localhost:55521/concat.sh ):

<JobDescription>

<JobIdentification> … </JobIdentification>

<Application>

<POSIXApplication xmlns="http://schemas.ggf.org/jsdl/2005/06/jsdl-posix">

<Executable>bin/concat</Executable>

<Argument>dir2/subdir1/file2.txt</Argument>

<Output>stdout.txt</Output>

<Error>stderr.txt</Error>

<Environment name="FIRST_INPUT">dir1/file1.txt</Environment>

</POSIXApplication>

</Application>

…

JSDL Example<DataStaging>

<FileName>bin/concat</FileName>

<CreationFlag>overwrite</CreationFlag>

<Source>

<URI>ftp://ftp.do:55521/concat.sh</URI>

</Source>

</DataStaging>

<DataStaging>

<FileName>dir1/file1.txt</FileName>

<CreationFlag>overwrite</CreationFlag>

<Source>

<URI>ftp://ftp.do:55521/input1.txt</URI>

</Source>

</DataStaging>

<DataStaging> <FileName>dir2/subdir1/file2.txt</FileName> <CreationFlag>overwrite</CreationFlag> <Source> <URI>ftp://ftp.do:55521/input2.txt</URI> </Source></DataStaging><DataStaging> <FileName>stdout.txt</FileName> <CreationFlag>overwrite</CreationFlag> <DeleteOnTermination>true</

DeleteOnTermination> <Target> <URI>ftp://ftp.do:55521/stdout.txt</URI> </Target></DataStaging></JobDescription></JobDefinition>

Set up the GridSAM Client’s FTP Server To allow GridSAM to retrieve input and store output

In gridsam-2.3.0-client directory:> ./gridsam.sh GridSAMFTPServer -p 55521 -d examples/

2010-04-29 08:20:59,250 WARN [GridSAMFTPServer] (main:) ../data/examples/ is exposed through FTP at ftp://[email protected]:55521/

2010-04-29 08:20:59,268 WARN [GridSAMFTPServer] (main:) Please make sure you understand the security implication of using anonymous FTP for file staging.

FtpServer.server.config.root.dir = ../data/examples/

FtpServer.server.config.data = /home/omii/COMP3019/omii-uk-client/gridsam/ftp/ftp1215306750

FtpServer.server.config.port = 55521

FtpServer.server.config.self.host = 152.78.237.90

Started FTP

Exposes the examples directory through FTP on port 55521 (anonymous access!)

Create input1.txt and input2.txt in this directory with some text in them

CLI Example: Submit to GridSAM Server

Ensure Java is on your path

In gridsam-2.3.0-client directory:– Submit to GridSAM server:

./gridsam.sh GridSAMSubmit –s “http://localhost:8080/gridsam/services/gridsam?wsdl” -j examples/remotecat-staging.jsdl

Unique job ID is returned– i.e. UID is urn:gridsam:<characters>

CLI Example: Monitoring the Job Monitor job until completion:> ./gridsam.sh GridSAMStatus -s

“http://localhost:8080/gridsam/services/gridsam?wsdl” -j <unique_job_id>

– <unique_job_id> is entire urn:gridsam:<characters> string

Job progress indicated by current state: – Pending, Staging-in, Staged-in, Active, Executed, Staging-out, Staged-out, Done

When complete, output resides in the stdout.txt file in the examples/ directory

What to Hand In

Submit: source code, results files, parameter files and output

Other parts that require written answers should form a separate document:

– In text, Microsoft Word or PDF– Up to 800 words in length, not including any source or trace output

Submission via ECS Coursework Handin system: Single Zip file: source, results, parameter files, output & written answers

coursework ii: google mapreduce in gridsam

Documents

sequencefun map f

map f xsreduce

f c xslooking

simplified data processing

data flowe

word values

piece of data

created data