big data - lab a1 (sc 11 tutorial)
TRANSCRIPT
An Introduc+on to Data Intensive Compu+ng
Appendix A: Amazon’s Elas+c MapReduce
Robert Grossman University of Chicago Open Data Group
Collin BenneF
Open Data Group
November 14, 2011 1
Basic Idea
• With Hadoop streams you can run any program as the Mapper and the Reducer.
• For example, you can run Python and Perl code.
• You can also run standard Unix u+li+es. • With streams, Mappers and Reducers use standard input and standard output.
Mappers for Streams
• As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process.
• The mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper.
• By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab) is the value.
• This default can be changed.
Reducers for Streams
• As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process.
• The reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer.
• By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value.
Example
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-‐streaming.jar \ -‐input myInputDirs \ -‐output myOutputDir \ -‐mapper /bin/cat \ -‐reducer /bin/wc • Here the Unix u+li+es cat and wc are the Mapper and Reducer.
S3 Buckets • S3 bucket names must be unique across AWS • A good prac+ce is to use a paFern like
tutorial.osdc.org/dataset1.txt for a domain you own.
• The file is then referenced as tutorial.osdc.org.s3. amazonaws.com/
dataset1.txt • If you own osdc.org you can create a DNS CNAME entry to access the file as tutorial.osdc.org/dataset1.txt
S3 Security
• AWS access key (user name) • This func+on is your S3 username . It is an alphanumeric text string that uniquely iden+fies users.
• AWS Secret key (func+ons as password)
Overview
1. Upload input data to S3 2. Create job flow by defining Map and Reduce 3. Download output data from S3
Custom Jobs
• Amazon Elas+c MR Custom jobs can be wriFen as a: – Custom Jar File – Streaming File – Pig Program – Hive Program
Step 1. Load Your Data Into an S3 Bucket
• Amazon’s Elas+c MapReduce reads data from S3 and write data to S3
Step 1b. Upload Data Into the S3 Bucket
• This can be done from the AWS Console. • This can also be done using command line tools.
Step 2a. Write a Mapper #!/usr/bin/python import sys import re def main(argv): line = sys.stdin.readline() paFern = re.compile("[a-‐zA-‐Z][a-‐zA-‐Z0-‐9]*") try: while line: for word in paFern.findall(line): print "LongValueSum:" + word.lower() + "\t" + "1" line = sys.stdin.readline() except "end of file": return None if __name__ == "__main__": main(sys.argv)
Step 2b. Upload the Mapper to S3
• This Mapper is already in S3 in this loca+on: s3://elas+cmapreduce/samples/wordcount/wordSpliFer.py So we don’t need to upload.
Step 3a. Write a Reducer def main(argv): line = sys.stdin.readline(); try: while line: line = line[:-‐1]; fields = line.split("\t"); print generateLongCountToken(fields[0]); line = sys.stdin.readline(); except "end of file": return None
Step 3a. Write a Reducer (cont’d) #!/usr/bin/python import sys; def generateLongCountToken(id): return "LongValueSum:" + id + "\t" + "1" def main(argv): line = sys.stdin.readline(); try: while line: line = line[:-‐1]; fields = line.split("\t"); print generateLongCountToken(fields[0]); line = sys.stdin.readline(); except "end of file": return None if __name__ == "__main__": main(sys.argv)
Step 3b. Upload Reducer to S3
myAggregatorForKeyCount.py • This is a standard Reducer and part of a standard Hadoop library called Aggregate so we don’t need to upload it, just invoke it.
Hadoop Library Aggregate To use Aggregate, simply specify "-‐reducer aggregate": $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-‐streaming.jar \ -‐input myInputDirs \ -‐output myOutputDir \ -‐mapper myAggregatorForKeyCount.py \ -‐reducer aggregate \ -‐file myAggregatorForKeyCount.py \ -‐jobconf mapred.reduce.tasks=12
Step 4c. Configure Bootstrap Ac+ons
• These include parameters for Hadoop, etc. • Here are the choices:
Step 6. The Output Data is in S3
• The output is in files labeled part-‐00000, part-‐00001, etc.
• Recall we specified the bucket plus folders: tutorial.osdc.org/wordcount/output/2011-‐06-‐26
Step 6. Download the Data From S3
• You can leave the data in S3 and work with it. • You can download it with command line tools:
aws get tutorial.osdc.org/wordcount/output/2011-‐06-‐26/part-‐00000 part00000 • You can download it with the S3 AWS Console.
Step 7. Remove Any Unnecessary Files
• You will be charged for all files that remain in S3, so remove any unnecessary ones.