coursework ii: google mapreduce in gridsam
DESCRIPTION
Coursework II: Google MapReduce in GridSAM. Steve Crouch [email protected], stc@ecs School of Electronics and Computer Science. Contents. Introduction to Google ’ s MapReduce Applications of MapReduce The coursework Extending a basic MapReduce framework provided in pseudocode - PowerPoint PPT PresentationTRANSCRIPT
Coursework II: Google MapReduce in GridSAMSteve Crouch
[email protected], stc@ecs
School of Electronics and Computer Science
Contents Introduction to Google’s MapReduce
Applications of MapReduce
The coursework– Extending a basic MapReduce framework provided in pseudocode
Coursework deadline: 27th March 4pm
Handin via ECS Coursework Handin System
Google MapReduce
MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, Google Inc., OSDI 2004.
http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en//papers/mapreduce-osdi04.pdf
Google’s Need for a Distributed Programming Model and Infrastructure
Google implement many computations over a lot of data– Input: e.g. crawled documents, web request logs, etc.– Output: e.g. inverted indices, web document graphs, pages
crawled per host, frequent per-day queries, etc.
Input usually very large (> 1TB)
Computations need to be distributed for timeliness of results
Want to do this in an easy, but scalable and robust way; provide a programming model (with a suitable abstraction) for the distributed processing aspects
Realised many computations follow a map / reduce approach– map operation applied to a set of logical input “records”
to generate intermediate key/value pairs– reduce operation applied to all intermediate values sharing
same key to combine data in a useful way– Used as basis for rewrite of their production indexing
system!
History of MapReduce – Inspired by Functional Programming!
Functional operations only create new data structures and do not alter existing ones
Order of operations does not matter
Emphasis on data flow
e.g. Higher-Order functions in Lisp– map() – applies a function to each value in a
sequence fun map f [ ] = [ ] | map f (x::xs) = (f x) :: (map f xs)
– reduce() – combines all elements of a sequence using a binary operator
fun reduce f c [ ] = c | reduce f c (x::xs) = f x (reduce f c xs)
Looking at map and reduce Another Way… map():
– Delegates or distributes the computation for each piece of data to a given function, creating a new set of data
– Each computation cannot see the effects of the other computations
– The order of computation is irrelevant
reduce() takes this created data and reduces it to something we want
map() moves left to right over the list, applying the given function… can this be exploited in distributed computing?
Applying the Programming Model to the Data
Data store 1 Data store nmap
(key 1, values...)
(key 2, values...)
(key 3, values...)
map
(key 1, values...)
(key 2, values...)
(key 3, values...)
Input key*value pairs
Input key*value pairs
== Barrier == : Aggregates intermediate values by output key
reduce reduce reduce
key 1, intermediate
values
key 2, intermediate
values
key 3, intermediate
values
final key 1 values
final key 2 values
final key 3 values
...
Distributed Computing Seminar: Lecture 2: MapReduce Theory and Implementation, Christophe Bisciglia, Aaron Kimball & Sierra Michels-Slettvet, Summer 2007.
For Example… Counting the number of occurrences of each word in a large collection of documents:
map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");
reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
• map outputs each word plus occurrence count• reduce sums together all counts emitted for each word
doc1,”Hello world”
doc2,”Hello there”
map()
map()
Hello, 1
world, 1
there, 1
Hello, 1reduce()
2
1
1
(Hello)
(world)
(there)
How it Works in Practice
"MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA,
December, 2004.
1. User program:- Splits work into M 64MB pieces- Program starts up across compute nodes as either Master or Worker (with exactly 1 Master)
2. Master assigns M map tasks and R reduce tasks to idle workers (either one map or one reduce task each)
3. A map Worker:- Parses key/value pairs out of its input- Passes each key/value to map function- Buffers intermediate keys/values in mem
4. Periodically, map Worker writes intermediate key/value pairs to disk, informing Master of their locations, who forwards to reduce Workers
5/6. When notified of locations by Master, reduce Worker remotely reads in data, sorts and groups data by key, passes to reduce function, results appended to output file
7. When all maps and reduces done, Master wakes up user program which resumes
Coursework: Part II
Learning Objectives:
To develop a general architectural and operational understanding of typical production-level grid software.
To develop the programming skills required to drive typical services on a production-level grid.
Tasks
Download and install the GridSAM server and client
(a) Extend some Java code stubs (which use the GridSAM Java API) to submit and monitor jobs to GridSAM
(b) Extend some pseudocode that describes a basic MapReduce framework for performing word counting on a number of files
Coursework: Part II –Installing GridSAM
Pre-Requisites
Pre-requisites:– Client and Server: Linux only (e.g. SuSE 9.0, RedHat, Debian, Ubuntu) May work on other Linuxs but no exhaustive testing
Tested on undergrad Linux boxes
– Requires Java JDK 6 (not JRE) or above– Beware:
Firewalls blocking 8080 and your FTP port inbetween client and server – add exceptions
VPNs can cause problems with staging data to/from GridSAM
Preparation/Installation Java 7 recommended
– Note: you may need to upgrade your Java– Ensure JAVA_HOME set on path
Install client…– Download gridsam-2.3.0-client.zip from coursework page– unzip gridsam-2.3.0-client.zip (into a file path that
contains no spaces)– cd gridsam-2.3.0-client– java SetupGridSAM
Install server (Linux only)…– Can just reuse your Apache Tomcat 5.5.28/6.0.32 from
mgrid (see mgrid install slides)– Download gridsam.war from coursework page– Shutdown Tomcat and copy in gridsam.war to apache-tomcat-
6.0.32/webapps and restart Tomcat– Can check log files in
apache-tomcat-6.0.32/webapps/gridsam/WEB-INF/logs if any problems occur
16
Coursework Materials Download COMP3019-materials.tgz from coursework page
– Copy to gridsam-2.3.0-client directory– Unpack, you’ll find some GridSAMExample* files
./GridSAMExampleCompile to check compilation– Code not complete; that’s the coursework!
GridSAMExampleRun wont until you done the coursework– Note server.domain and port in script – you need to
change these to point at your server (use HTTP not HTTPS!!)
Use the scripts and Java code as a basis
Refer to API docs on coursework page as required– To obtain job status, use e.g.: jobStage =
jobManager.findJobInstance(jobID).getLastKnownStage().getState().toString();
– Doing job.getLastKnownStage().getState().toString() directly wont work 17
The Coursework See the coursework handout on the COMP3019 page:
– http://www.ecs.soton.ac.uk/~stc/COMP3019
Notes for Part 1:– When specifying multiple arguments to your m-grid
applet, there is a single string you can use as an argument.
– Consider how you pass the two necessary arguments (i.e. a character and a textfile) as a single argument into the applet
– To load the text file below into your applet, package it into the jar file along with the code, and use the following in the applet:
InputStream in = getClass().getResourceAsStream(“textfile.txt”);
Part 2 (GridSAM) Notes:– If you encounter problems using the GridSAM FTP
server, some students have found success using a StupidFTP server (available under Ubuntu)
– When you want to check the status of a job use e.g. jobStage = jobManager.findJobInstance(jobID).getLastKnownStage().getState().toString();
Doing job.getLastKnownStage().getState().toString() directly wont work
Coursework: Part II –Running a Command Line Example
Example using File Staging Objectives: submit simple job with data input and output requirements and monitor progress
OMII GridSAM
Client
OMII GridSAM
Server
submit JSDL
monitor
OMII GridSAM
FTP Server1 output file
2 input files
JSDL Example Gridsam-2.3.0/examples/remotecat-staging.jsdl
Change ftp URLs to match your ftp server e.g. ftp://anonymous:anonymous@localhost:55521/concat.sh ):
<JobDescription>
<JobIdentification> … </JobIdentification>
<Application>
<POSIXApplication xmlns="http://schemas.ggf.org/jsdl/2005/06/jsdl-posix">
<Executable>bin/concat</Executable>
<Argument>dir2/subdir1/file2.txt</Argument>
<Output>stdout.txt</Output>
<Error>stderr.txt</Error>
<Environment name="FIRST_INPUT">dir1/file1.txt</Environment>
</POSIXApplication>
</Application>
…
JSDL Example<DataStaging>
<FileName>bin/concat</FileName>
<CreationFlag>overwrite</CreationFlag>
<Source>
<URI>ftp://ftp.do:55521/concat.sh</URI>
</Source>
</DataStaging>
<DataStaging>
<FileName>dir1/file1.txt</FileName>
<CreationFlag>overwrite</CreationFlag>
<Source>
<URI>ftp://ftp.do:55521/input1.txt</URI>
</Source>
</DataStaging>
<DataStaging> <FileName>dir2/subdir1/file2.txt</FileName> <CreationFlag>overwrite</CreationFlag> <Source> <URI>ftp://ftp.do:55521/input2.txt</URI> </Source></DataStaging><DataStaging> <FileName>stdout.txt</FileName> <CreationFlag>overwrite</CreationFlag> <DeleteOnTermination>true</
DeleteOnTermination> <Target> <URI>ftp://ftp.do:55521/stdout.txt</URI> </Target></DataStaging></JobDescription></JobDefinition>
Set up the GridSAM Client’s FTP Server To allow GridSAM to retrieve input and store output
In gridsam-2.3.0-client directory:> ./gridsam.sh GridSAMFTPServer -p 55521 -d examples/
2010-04-29 08:20:59,250 WARN [GridSAMFTPServer] (main:) ../data/examples/ is exposed through FTP at ftp://[email protected]:55521/
2010-04-29 08:20:59,268 WARN [GridSAMFTPServer] (main:) Please make sure you understand the security implication of using anonymous FTP for file staging.
FtpServer.server.config.root.dir = ../data/examples/
FtpServer.server.config.data = /home/omii/COMP3019/omii-uk-client/gridsam/ftp/ftp1215306750
FtpServer.server.config.port = 55521
FtpServer.server.config.self.host = 152.78.237.90
Started FTP
Exposes the examples directory through FTP on port 55521 (anonymous access!)
Create input1.txt and input2.txt in this directory with some text in them
CLI Example: Submit to GridSAM Server
Ensure Java is on your path
In gridsam-2.3.0-client directory:– Submit to GridSAM server:
./gridsam.sh GridSAMSubmit –s “http://localhost:8080/gridsam/services/gridsam?wsdl” -j examples/remotecat-staging.jsdl
Unique job ID is returned– i.e. UID is urn:gridsam:<characters>
CLI Example: Monitoring the Job Monitor job until completion:> ./gridsam.sh GridSAMStatus -s
“http://localhost:8080/gridsam/services/gridsam?wsdl” -j <unique_job_id>
– <unique_job_id> is entire urn:gridsam:<characters> string
Job progress indicated by current state: – Pending, Staging-in, Staged-in, Active, Executed, Staging-out, Staged-out, Done
When complete, output resides in the stdout.txt file in the examples/ directory
What to Hand In
Submit: source code, results files, parameter files and output
Other parts that require written answers should form a separate document:
– In text, Microsoft Word or PDF– Up to 800 words in length, not including any source or trace output
Submission via ECS Coursework Handin system: Single Zip file: source, results, parameter files, output & written answers