big data - lab a1 (sc 11 tutorial)

An Introduc+on to Data Intensive Compu+ng

Appendix A: Amazon’s Elas+c MapReduce

Robert Grossman University of Chicago Open Data Group

Collin BenneF

Open Data Group

November 14, 2011 1

Sec+on A1 Hadoop Streaming

See hFp://hadoop.apache.org/common/docs/r0.15.2/streaming.html

Basic Idea

•  With Hadoop streams you can run any program as the Mapper and the Reducer.

•  For example, you can run Python and Perl code.

•  You can also run standard Unix u+li+es. •  With streams, Mappers and Reducers use standard input and standard output.

Mappers for Streams

•  As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process.

•  The mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper.

•  By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab) is the value.

•  This default can be changed.

Reducers for Streams

•  As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process.

•  The reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer.

•  By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value.

Example

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-‐streaming.jar \ -‐input myInputDirs \ -‐output myOutputDir \ -‐mapper /bin/cat \ -‐reducer /bin/wc •  Here the Unix u+li+es cat and wc are the Mapper and Reducer.

Sec+on A2 S3 Buckets

S3 Buckets •  S3 bucket names must be unique across AWS •  A good prac+ce is to use a paFern like

tutorial.osdc.org/dataset1.txt for a domain you own.

•  The file is then referenced as tutorial.osdc.org.s3. amazonaws.com/

dataset1.txt •  If you own osdc.org you can create a DNS CNAME entry to access the file as tutorial.osdc.org/dataset1.txt

S3 Security

•  AWS access key (user name) •  This func+on is your S3 username . It is an alphanumeric text string that uniquely iden+fies users.

•  AWS Secret key (func+ons as password)

AWS Account Informa+on

Access Keys

User Name Password

Sec+on A3 Using AWS Elas+c MapReduce

Overview

1.  Upload input data to S3 2.  Create job flow by defining Map and Reduce 3.  Download output data from S3

Create New Elas+c MR Job Flow

Custom Jobs

•  Amazon Elas+c MR Custom jobs can be wriFen as a: – Custom Jar File – Streaming File – Pig Program – Hive Program

Step 1. Load Your Data Into an S3 Bucket

•  Amazon’s Elas+c MapReduce reads data from S3 and write data to S3

Step 1a. Create & Name the S3 Bucket

Step 1b. Upload Data Into the S3 Bucket

•  This can be done from the AWS Console. •  This can also be done using command line tools.

Step 2a. Write a Mapper #!/usr/bin/python import sys import re def main(argv): line = sys.stdin.readline() paFern = re.compile("[a-‐zA-‐Z][a-‐zA-‐Z0-‐9]*") try: while line: for word in paFern.findall(line): print "LongValueSum:" + word.lower() + "\t" + "1" line = sys.stdin.readline() except "end of file": return None if __name__ == "__main__": main(sys.argv)

Step 2b. Upload the Mapper to S3

•  This Mapper is already in S3 in this loca+on: s3://elas+cmapreduce/samples/wordcount/wordSpliFer.py So we don’t need to upload.

Step 3a. Write a Reducer def main(argv): line = sys.stdin.readline(); try: while line: line = line[:-‐1]; fields = line.split("\t"); print generateLongCountToken(fields[0]); line = sys.stdin.readline(); except "end of file": return None

Step 3a. Write a Reducer (cont’d) #!/usr/bin/python import sys; def generateLongCountToken(id): return "LongValueSum:" + id + "\t" + "1" def main(argv): line = sys.stdin.readline(); try: while line: line = line[:-‐1]; fields = line.split("\t"); print generateLongCountToken(fields[0]); line = sys.stdin.readline(); except "end of file": return None if __name__ == "__main__": main(sys.argv)

Step 3b. Upload Reducer to S3

myAggregatorForKeyCount.py •  This is a standard Reducer and part of a standard Hadoop library called Aggregate so we don’t need to upload it, just invoke it.

Hadoop Library Aggregate To use Aggregate, simply specify "-‐reducer aggregate": $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-‐streaming.jar \ -‐input myInputDirs \ -‐output myOutputDir \ -‐mapper myAggregatorForKeyCount.py \ -‐reducer aggregate \ -‐file myAggregatorForKeyCount.py \ -‐jobconf mapred.reduce.tasks=12

Step 4. Define the Job Flow

Step 4a. Specify Parameters

Step 4b. Configure EC2 Parameters

•  Default parameters work for this example

Step 4c. Configure Bootstrap Ac+ons

•  These include parameters for Hadoop, etc. •  Here are the choices:

Step 4d. Review Configura+on

Step 5. Launch Job Flow & Wait

… and wait …

Wait for Job

•  This job took 3 minutes.

Step 6. The Output Data is in S3

•  The output is in files labeled part-‐00000, part-‐00001, etc.

•  Recall we specified the bucket plus folders: tutorial.osdc.org/wordcount/output/2011-‐06-‐26

Step 6. Download the Data From S3

•  You can leave the data in S3 and work with it. •  You can download it with command line tools:

aws get tutorial.osdc.org/wordcount/output/2011-‐06-‐26/part-‐00000 part00000 •  You can download it with the S3 AWS Console.

Step 7. Remove Any Unnecessary Files

•  You will be charged for all files that remain in S3, so remove any unnecessary ones.

Ques+ons?

For the most current version of these notes, see rgrossman.com

big data - lab a1 (sc 11 tutorial)

Technology