cs 435 - introduction to big datacs435/pa/pa1/fall2019/recitation4.pdf · paahuni khandelwal email:...
Post on 22-May-2020
9 Views
Preview:
TRANSCRIPT
Paahuni Khandelwal Email: cs435@cs.colostate.edu
20th September, 2019
[Recitation 4]
CS 435 - Introduction to Big Data
Submission
!2
o Submission Deadline for PA1 - 30th Sept (by 5 pm)o Tarball should include
- 3 .java file for each profile (Profile1.java, Profile2.java, Profile3.java)- 3 jars for each profile containing .class file (Profile1.jar, Profile2.jar, Profile3.jar)- Output folder/part-file from each profile
o NOTE: Your output should be generated after running jobs on this input set -
https://www.cs.colostate.edu/~cs435/PA/PA1/Fall2019/PA1Dataset.tar.gz
Profile 2PROFILE 2o A list of top 500 unigrams and their frequencies within each article o Output should be grouped by Document IDo Top 500 unigrams per documento Total output records <= 500 * (no. of unique documents)
!3
Profile 2 Solution 1- Using Multiple jobs with TopN
o Output of Mapper 1: ({DocumentID, unigram},1)o Functionality of Reducer 1: - Count the frequency of (DocumentID, unigram) pairo Output of Reducer 1 will be in form : (DocumentID,{unigram, frequency})
!4
MAP REDUCE JOB 1
Profile 2 Solution 1- Using Multiple jobs with TopN
o Functionality of Mapper 2- Store and update top 500 unigrams(based on frequency) in the TreeMap- Override cleanup() method of Reducer interface to send local top 500 unigrams from each mapper- Is treeMap at mapper side really needed for Profile2?
o Output of Mapper 2: (DocumentID, {unigram, frequency})o Functionality of Reducer 2: - Store and update TreeMap to get 500 unigrams per documento Output of Reducer 2 will be in form : (DocumentID,{unigram, frequency})
!5
MAP REDUCE JOB 2
Profile 2 Solution 2- Using Composite Key with sortComparator
!6
o Output of Mapper 1: ({DocumentID, unigram},1)o Functionality of Reducer 1: - Count the frequency of (DocumentID, unigram) pairo Output of Reducer 1 will be in form : (DocumentID,{unigram, frequency})
Separate composite key/value pair by “\t”
MAP REDUCE JOB 1
Profile 2 Solution 2- Using Composite Key with sortComparator
!7
o Output of Mapper 2: ({DocumentID, unigram, frequency}, NullWritable.get())o myGroupComp:
- Extend WriteComparator to perform sorting on all three attributeso Functionality of Reducer 2: - Store 500 unigrams per documento Output of Reducer 2 will be in form : (DocumentID,{unigram, frequency}) or ({DocumentID, unigram, frequency}, null) or ({DocumentID, unigram}, frequency})
MAP REDUCE JOB 2
sortComparator
!8
public static class myGroupComp extends WritableComparator { protected myGroupComp() { super(Text.class, true); }
@Override public int compare(WritableComparable w1, WritableComparable w2) {
Text t1 = (Text) w1; Text t2 = (Text) w2; String[] t1Items = t1.toString().split("\t"); String[] t2Items = t2.toString().split("\t"); Integer docId1=Integer.parseInt(t1Items[0]); Integer docId2=Integer.parseInt(t2Items[0]); String unigram1 = t1Items[1]; String unigram2 = t2Items[1]; Integer frequency1=Integer.parseInt(t1Items[2]); Integer frequency2=Integer.parseInt(t2Items[2]);
sortComparator (Cont.)
!9
public int compare(WritableComparable w1, WritableComparable w2) {
. . . . Integer docId1=Integer.parseInt(t1Items[0]); Integer docId2=Integer.parseInt(t2Items[0]); String unigram1 = t1Items[1]; String unigram2 = t2Items[1]; Integer frequency1=Integer.parseInt(t1Items[2]); Integer frequency2=Integer.parseInt(t2Items[2]);
Integer comparison = docId1.compareTo(docId2); if (comparison == 0){ comparison = -1 * frequency1.compareTo(frequency2); // -1 for descending order if (comparison == 0){ comparison = unigram1.compareTo(unigram2); } }
return comparison; } }
TreeMap: Profile 2, Job 2 using sortComparator
!10
o How to sort key/value having multiple attributes in TreeMap?o Functionality of Reducer 2: Store and update TreeMap to get global 500 unigrams per document - Maintain global list of only top 500 items (DocumentID + “\t” + unigram + “\t” + frequency) - Maintain 2 global variable which points to noOfUnigramsSeenSoFar and for which DocumentID
- Use cleanup() to write output to context- Inside your reduce()
* Add new tuple to the list, if new DocIDFromInputKey arrives.* Or, increment noOfUnigramsSeenSoFar by 1 if DocIDFromInputKey is same as DocumentID* Keep setting the DocumentID to current tuple DocIDFromInputKey* Iterate through list and write output to context* Handle last DocIDFromInputKey unigrams in cleanup() method
Map Reduce Chaining Jobs?
!11
Configuration conf1 = new Configuration();
Job job1 = Job.getInstance(conf1, “Job 1"); job1.setNumReduceTasks(4); job1.setJarByClass(Profile2.class); job1.setMapperClass(Mapper1.class); job1.setCombinerClass(Reducer1.class); job1.setReducerClass(Reducer1.class); job1.setPartitionerClass(UnigramPartitioner.class); . . . FileInputFormat.addInputPath(job1, new Path(args[0])); FileOutputFormat.setOutputPath(job1, output_path_for_Job_1); job1.waitForCompletion(true);
Job job2 = Job.getInstance(conf2, “Job 2"); job2.setJarByClass(Profile2.class); job2.setMapperClass(Mapper2.class); job2.setReducerClass(Reducer2.class); job2.setSortComparatorClass(myGroupComp.class);
. . .
FileInputFormat.addInputPath(job2, output_path_for_Job_1)); FileOutputFormat.setOutputPath(job2, new Path(args[2])); System.exit(job2.waitForCompletion(true)?0:1);
Profile 3
o A list of top 500 unigrams and their frequencies in the corpuso List should be sorted from most frequent unigrams to least frequent oneso Solution to generate Profile 3 is combination of Profile 1 and Profile 2 o We will list unigram with its total occurrence in the complete dataset
!12
Profile 3 Solution 1 (using TopN) - 1 MapReduce Job
!13
o Output of Mapper: (unigram,1)o Functionality of Reducer: - Store and update HashMap<Unigram, Frequency> to get frequency of each unigram - Use cleanup() to perform sorting and writing top 500 unigrams to context
- Your cleanup() should comparingByValue on entrySet, then sort in reverseOrder() - Set count to 0 and through sorted HashMap keySet. - Write <unigram, frequency> to context until count reaches 500.
o Output of Reducer will be in form : (unigram, frequency)
Note: Explicitly set number of reducers to 1 in driver as job.setNumReduceTasks(1)
Refer: https://docs.oracle.com/javase/8/docs/api/java/util/Map.Entry.html
Profile 3 Solution 2 (using Multiple Jobs)
!14
MAP REDUCE JOB 1o Output of Mapper 1: (unigram,1)
o Functionality of Reducer 1: - Store and update HashMap<Unigram, Frequency> to get frequency of each unigram o Output of Reducer will be in form : (unigram, frequency)
Profile 3 Solution 2 (using Multiple Jobs)
!15
MAP REDUCE JOB 2o Use TopN design patterno Functionality of Mapper 2: - Initialize TreeMap<String,String> - Store and update the 500 unigrams in the TreeMap - Use cleanup() to send local top 500 unigrams from each mapperNOTE: You can set key as NullWritable o Output of Mapper 2:
(null, {unigram,frequency})
o Functionality of Reducer 2:- Initialize TreeMap<String, String>- Store and update global top 500 unigrams
o Output of Reducer 2 will be in form : (unigram, frequency)
Profile 3 Solution 2 (using Multiple Jobs)
!16
MAP REDUCE JOB 2 - Second Solutiono Output of Mapper 2:
(frequency, unigram)
o Use sortComparator: - set descending order while sorting on key (frequency)
o Output of Reducer 2 will be in form : (unigram, frequency) Write only top 500 unigrams only
Logger for intermediate results
!17
import org.apache.log4j.Logger;
. . .
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{
private Logger LOGGER = Logger.getLogger(TokenizerMapper.class.getName()); private final static IntWritable one = new IntWritable(1); private Text word = new Text();
@Override protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
LOGGER.info(“Mapper input key and value received”) . . .
} }
Next Recitation
o Will be held on 27th September (Friday) in CS130 from 4 to 5 pmo Introduction for PA 2o Help Sessions for PA1
- Monday (9/23) : 2pm-4pm- Thursday (9/26) : 8am-10am- Friday (9/27) : 3pm-4pm
!19
top related