mapreduce design patterns - piazza
TRANSCRIPT
![Page 1: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/1.jpg)
MapReduce Design Patterns
CMSC 491Hadoop-Based Distributed Computing
Spring 2015Adam Shook
![Page 2: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/2.jpg)
Agenda
• Summarization Patterns• Filtering Patterns• Data Organization Patterns• Joins Patterns• Metapatterns• I/O Patterns• Bloom Filters
![Page 3: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/3.jpg)
MapReduce Design Patterns
• Book by Donald Miner & Adam Shook• Building effective algorithms and analytics for
Hadoop and other systems.• 23 pattern grouped into six categories
– Summarization– Filtering– Data Organization– Joins– Metapatterns– Input and output
![Page 4: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/4.jpg)
Pattern Categories/1• Summarization patterns: Top-down summaries to get a top-level view
– Numerical summarizations– Inverted index– Counting with counters
• Filtering patterns: Extract interesting subsets of the data– Filtering– Bloom filtering– Top ten– Distinct
• Data organization patterns: Reorganize and restructure data to work with other systems or to make MapReduce analysis easier– Structured to hierarchical– Partitioning– Binning– Total order sorting– Shuffling
![Page 5: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/5.jpg)
Pattern Categories/2• Join patterns: Bringing and analyze different data sets together to discover
interesting relationships.– Reduce-side join– Replicated join– Composite join– Cartesian product
• Metapatterns: Piece together several patterns to solve a complex problem or to perform several analytics in the same job.– Job chaining– Chain folding– Job merging
• Input and output patterns: Custom the way to use Hadoop to input and output data.– Generating data– External source output– External source input– Partition pruning
![Page 6: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/6.jpg)
SUMMARIZATION PATTERNS
Numerical Summarizations, Inverted Index, Counting with Counters
![Page 7: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/7.jpg)
Overview
• Top-down summarization of large data sets to get a top-level view
• Most straightforward patterns• Calculate aggregates over entire
data set or groups• Build indexes
![Page 8: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/8.jpg)
Numerical Summarizations
• Group records together by a field or set of fields and calculate a numerical aggregate per group
• Build histograms or calculate statistics from numerical values
![Page 9: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/9.jpg)
Known Uses
• Word Count• Record Count• Min/Max/Count• Average/Median/Standard Deviation
![Page 10: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/10.jpg)
Structure
![Page 11: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/11.jpg)
Performance
• Perform well, especially when combiner is used
• Need to be concerned about data skew with from the key
![Page 12: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/12.jpg)
Example
• Discover the first time a StackOverflow user posted, the last time a user posted, and the number of posts in between
• User ID, Min Date, Max Date, Count
![Page 13: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/13.jpg)
public class MinMaxCountTuple implements Writable {private Date min = new Date();private Date max = new Date();private long count = 0; private final static SimpleDateFormat frmt =
new SimpleDateFormat( "yyyy-MM-dd'T'HH:mm:ss.SSS");
public Date getMin() { return min; } public void setMin(Date min) { this.min = min; } public Date getMax() { return max; } public void setMax(Date max) { this.max = max; } public long getCount() { return count; } public void setCount(long count) { this.count = count; }public void readFields(DataInput in) {
min = new Date(in.readLong()); max = new Date(in.readLong()); count = in.readLong();
}public void write(DataOutput out) {
out.writeLong(min.getTime());out.writeLong(max.getTime());out.writeLong(count);
}
public String toString() {return frmt.format(min) + "\t" + frmt.format(max) + "\t" + count;
}}
![Page 14: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/14.jpg)
public static class MinMaxCountMapper extends Mapper<Object, Text, Text, MinMaxCountTuple> {
private Text outUserId = new Text();private MinMaxCountTuple outTuple =
new MinMaxCountTuple();
private final static SimpleDateFormat frmt = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
public void map(Object key, Text value, Context context) { Map<String, String> parsed =
xmlToMap(value.toString());String strDate = parsed.get("CreationDate"); String userId = parsed.get("UserId");
Date creationDate = frmt.parse(strDate); outTuple.setMin(creationDate); outTuple.setMax(creationDate)outTuple.setCount(1);outUserId.set(userId);context.write(outUserId, outTuple);
}}
![Page 15: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/15.jpg)
public static class MinMaxCountReducer extends Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple> {
private MinMaxCountTuple result = new MinMaxCountTuple();
public void reduce(Text key, Iterable<MinMaxCountTuple> values, Context context) {
result.setMin(null); result.setMax(null); result.setCount(0);int sum=0;for (MinMaxCountTuple val : values) {
if (result.getMin() == null ||val.getMin().compareTo(result.getMin()) < 0) {
result.setMin(val.getMin());}if (result.getMax() == null ||
val.getMax().compareTo(result.getMax()) > 0) {result.setMax(val.getMax());
} sum += val.getCount();
}result.setCount(sum); context.write(key, result);
}}
![Page 16: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/16.jpg)
![Page 17: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/17.jpg)
public static void main(String[] args) {
Configuration conf = new Configuration();String[] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if (otherArgs.length != 2) {System.err.println("Usage: MinMaxCountDriver <in> <out>");System.exit(2);
}
Job job = new Job(conf, "Comment Date Min Max Count");job.setJarByClass(MinMaxCountDriver.class);
job.setMapperClass(MinMaxCountMapper.class);job.setCombinerClass(MinMaxCountReducer.class);job.setReducerClass(MinMaxCountReducer.class);
job.setOutputKeyClass(Text.class);job.setOutputValueClass(MinMaxCountTuple.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);}
![Page 18: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/18.jpg)
Inverted Index
• Generate an index from a data set to enable fast searches or data enrichment
• Building an index takes time, but can greatly reduce the amount of time to search for something
• Output can be ingested into key/value store
![Page 19: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/19.jpg)
Structure
![Page 20: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/20.jpg)
Counting with Counters
• Use MapReduce framework’s counter utility to calculate global sum entirely on the map side, producing no output
• Small number of counters only!!
![Page 21: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/21.jpg)
Known Uses
• Count number of records• Count a small number of unique field
instances• Sum fields of data together
![Page 22: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/22.jpg)
Structure
![Page 23: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/23.jpg)
FILTERING PATTERNSFiltering, Bloom Filtering, Top Ten, Distinct
![Page 24: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/24.jpg)
Filtering
• Discard records that are not of interest (Extract interesting subsets of the data)
• Create subsets of your big data sets that you want to further analyze
![Page 25: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/25.jpg)
Known Uses
• Closer view of the data• Tracking a thread of events• Distributed grep• Data cleansing• Simple random sampling
![Page 26: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/26.jpg)
Structure
![Page 27: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/27.jpg)
Bloom Filtering
• A Bloom filter (B.H. Bloom, 1970) is a space-efficient probabilistic data structure that is used to test set membership.
• Keep records that are a member of a large predefined set of values (hot values)
• Inherent possibility of false positives. It is not a problem if the output is a bit inaccurate.
![Page 28: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/28.jpg)
Known Uses
• Removing most of the non-watched values
• Pre-filtering a data set prior to expensive membership test
![Page 29: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/29.jpg)
Structure
![Page 30: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/30.jpg)
Top Ten
• Retrieve a relatively small number of top K records based on a ranking scheme
• Find the outliers or most interesting records
![Page 31: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/31.jpg)
Known Uses
• Outlier analysis• Selecting interesting data• Catchy dashboards
![Page 32: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/32.jpg)
Structure
![Page 33: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/33.jpg)
Distinct
• Remove duplicate entries of your data, either full records or a subset of fields
• That fourth V nobody talks about that much
![Page 34: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/34.jpg)
Known Uses
• Deduplicate data• Get distinct values• Protect from inner join explosion
![Page 35: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/35.jpg)
Structure
![Page 36: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/36.jpg)
DATA ORGANIZATION PATTERNS
Structured to Hierarchical, Partitioning, Binning,Total Order Sorting, Shuffling
![Page 37: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/37.jpg)
Data Organization patterns
• Reorganize and restructure data to work with other systems or to make MapReduce analysis easier.
![Page 38: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/38.jpg)
Structured to Hierarchical
• Transformed row-based data to a hierarchical format (such as JSON and XML)
• Reformatting RDBMS data to a more conducive structure
![Page 39: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/39.jpg)
Known Uses
• Pre-joining data• Prepare data for HBase or MongoDB
![Page 40: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/40.jpg)
Structure
![Page 41: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/41.jpg)
Partitioning
• Partition records into smaller data sets (i.e., shards, partitions or bins)
• It does not care about the order of records.
• Enables faster future query times due to partition pruning
![Page 42: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/42.jpg)
Known Uses
• Partition pruning by continuous value• Partition pruning by category• Sharding
![Page 43: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/43.jpg)
Structure
![Page 44: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/44.jpg)
Binning
• File records into one or more categories– Similar to partitioning, but the
implementation is different
• Can be used to solve similar problems to Partitioning
• Splits data up in the map phase instead of in the partitioner.
![Page 45: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/45.jpg)
Known Uses
• Pruning for follow-on analytics• Categorizing data
![Page 46: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/46.jpg)
Structure
![Page 47: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/47.jpg)
Total Order Sorting
• Sort your data set in parallel on a sort key
• Difficult to apply “divide and conquer” technique of MapReduce
• Sort key has to be comparable so the data can be ordered.
![Page 48: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/48.jpg)
Known Uses
• Sorting
![Page 49: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/49.jpg)
Structure: analyse phase
• determines the ranges• analyse phase is a random sampling
of the data
![Page 50: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/50.jpg)
Structure: sort phase
• TotalOrderPartitioner takes the data ranges from the partition file produced in the analyse step.
![Page 51: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/51.jpg)
Shuffling
• Set of records that you want to completely randomize
• Instill some anonymity or create some repeatable random sampling
![Page 52: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/52.jpg)
Known Uses
• Anonymize the order of the data set• Repeatable random sampling after
shuffled
![Page 53: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/53.jpg)
Structure
![Page 54: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/54.jpg)
JOIN PATTERNS
Join Refresher, Reduce-Side Join w/ and w/o Bloom Filter,Replicated Join, Composite Join, Cartesian Product
![Page 55: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/55.jpg)
Join Patterns
• Bringing and analyze different data sets together to discover interesting relationships.
• MR is good in processing datasets by looking at each record in isolation.
• Joining/combining datasets does not fit gracefully into the MR paradigm.
![Page 56: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/56.jpg)
Join Refresher
• A join is an operation that combines records from two or more data sets based on a field or set of fields, known as a foreign key
• Let’s go over the different types of joins before talking about how to do it in MapReduce
![Page 57: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/57.jpg)
A Tale of Two Tables
![Page 58: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/58.jpg)
Inner Join
• A+B on UserID
![Page 59: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/59.jpg)
Left Outer Join
• A+B on UserID
![Page 60: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/60.jpg)
Right Outer Join
• A+B on UserID
![Page 61: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/61.jpg)
Full Outer Join
• A+B on UserID
![Page 62: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/62.jpg)
Antijoin
• Full outer join minus the inner join
![Page 63: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/63.jpg)
Cartesian Product
AxB
![Page 64: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/64.jpg)
How to implement?
• Reduce-Side Join w/ and w/o Bloom Filter
• Replicated Join• Composite Join
• Cartesian Product stands alone
![Page 65: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/65.jpg)
Reduce Side Join
• Two or more data sets are joined in the reduce phase
• Covers all join types we have discussed– Exception: Mr. Cartesian
• All data is sent over the network– If applicable, filter using Bloom filter
![Page 66: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/66.jpg)
Structure
![Page 67: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/67.jpg)
Performance
• Need to be concerned about data skew
• 2 PB joined on 2 PB means 4 PB of network traffic
![Page 68: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/68.jpg)
Replicated Join
• Inner and Left Outer Joins• Removes need to shuffle any data to
the reduce phase• Read all files from the distributed
cache during the setup of the mapper method and store them into in-memory lookup tables.
• Mapper processes each record and joins it with all the data stored in memory.
![Page 69: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/69.jpg)
Structure
![Page 70: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/70.jpg)
Performance
• Fastest type of join• Map-only i.e., no combiner, partitioner
or reducer is used
• Limited based on how much data you can safely store inside JVM
• Need to be concerned about growing data sets
• Could optionally use a Bloom filter
![Page 71: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/71.jpg)
Composite Join
• Leverages built-in Hadoop utilities to join the data
• Requires the data to be already organized and prepared in a specific way– Sorted by foreign key, partitioned by
foreign key, and read in a very particular manner.
• Really only useful if you have one large data set that you are using a lot
![Page 72: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/72.jpg)
Data Structure
![Page 73: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/73.jpg)
Structure
![Page 74: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/74.jpg)
Performance
• Good performance, join operation is done on the map side
• Requires the data to have the same number of partitions, partitioned in the same way, and each partition must be sorted
![Page 75: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/75.jpg)
Cartesian Product
• Pair up and compare every single record with every other record in a data set
• Allows relationships between many different data sets to be uncovered at a fine-grain level
![Page 76: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/76.jpg)
Known Uses
• Document or image comparisons• Math stuff or something
![Page 77: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/77.jpg)
Structure
![Page 78: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/78.jpg)
Performance
• Massive data explosion!• Can use many map slots for a long
time
• Effectively creates a data set size O(n2)– Need to make sure your cluster can fit
what you are doing
![Page 79: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/79.jpg)
METAPATTERNSJob Chaining, Chain Folding, Job Merging
![Page 80: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/80.jpg)
Metapatterns
• Piece together several patterns to solve a complex problem or to perform several analytics in the same job.
![Page 81: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/81.jpg)
Job Chaining
• One job is often not enough• Need a combination of patterns
discussed to do your workflow
• Sequential vs Parallel
![Page 82: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/82.jpg)
Methodologies
• In the Driver• In a Bash run script• With the JobControl utility
![Page 83: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/83.jpg)
Chain Folding
• In job chaining, temporary data is stored in HDFS. So, total I/O is many.
• Each record can be submitted to multiple mappers, then a reducer, then a mapper
• Reduces amount of data movement in the pipeline. Chain folding is a optimization.
![Page 84: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/84.jpg)
Structure
Original chain
Optimizing mappers
![Page 85: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/85.jpg)
Structure
Original chain
Optimizing a reducer with a mapper
![Page 86: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/86.jpg)
Methodologies
• Just do it• ChainMapper/ChainReducer
![Page 87: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/87.jpg)
Job Merging
• Merge unrelated jobs together into the same pipeline
• Like job folding, job merging is another optimization method aimed to reduce the amount of I/O through the MR pipeline.
![Page 88: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/88.jpg)
Structure
![Page 89: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/89.jpg)
Methodologies
• Tag map output records• Use MultipleOutputs
![Page 90: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/90.jpg)
I/O PATTERNS
Generating Data, External Source Output,External Source Input, Partition Pruning
![Page 91: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/91.jpg)
Input and output patterns
• Custom the way to use Hadoop to input and output data.
![Page 92: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/92.jpg)
Customizing I/O
• Unstructured and semi-structured data often calls for a custom input format to be developed
![Page 93: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/93.jpg)
Generating Data
• Generate lots of data on the fly and in parallel from nothing. It does not load data.
• Random or representative big data sets for you to test your analytics with
![Page 94: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/94.jpg)
Known Uses
• Benchmarking your new cluster• Making more data to represent a
sample you were given
![Page 95: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/95.jpg)
Structure
![Page 96: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/96.jpg)
External Source Output
• You want to write MapReduce output to some non-native location
• Direct loading into a system instead of using HDFS as a staging area
![Page 97: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/97.jpg)
Known Uses
• Write directly out to some non-HDFS solution– Key/Value Store– RDBMS– In-Memory Store
• Many of these are already written
![Page 98: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/98.jpg)
Structure
![Page 99: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/99.jpg)
External Source Input
• You want to load data in parallel from some other source
• Hook other systems into the MapReduce framework
![Page 100: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/100.jpg)
Known Uses
• Skip the staging area and load directly into MapReduce
• Key/Value store• RDBMS• In-Memory store
![Page 101: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/101.jpg)
Structure
![Page 102: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/102.jpg)
Partition Pruning
• Abstract away how the data is stored to load what data is needed based on the query
![Page 103: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/103.jpg)
Known Uses
• Discard unneeded files based on the query
• Abstract data storage from query, allowing for powerful middleware to be built
![Page 104: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/104.jpg)
Structure
![Page 105: MapReduce Design Patterns - Piazza](https://reader030.vdocuments.net/reader030/viewer/2022012523/61970413c801ba2c3b5bfa81/html5/thumbnails/105.jpg)
References
• “MapReduce Design Patterns” – O’Reilly 2012
• www.github.com/adamjshook/mapreducepatterns
• http://en.wikipedia.org/wiki/Bloom_filter