design patterns in mapreduce

Akhilesh Joshi

[email protected]

▪Local Aggregation

▪Pairs and Stripes

▪Order inversion

▪Graph algorithms

▪ In Hadoop, intermediate results are written to local disk before being sent over the network. Since network and disk latencies are relatively expensive compared to other operations, reductions in the amount of intermediate data translate into increases in algorithmic efficiency.

▪ In MapReduce, local aggregation of intermediate results is one of the keys to efficient algorithms.

▪ Hence we take help of COMBINER to perform this local aggregation and reduce the intermediate key value pairs that are passed on by Mapper to the Reducer.

▪ Combiners provide a general mechanism within the MapReduce framework to reduce the amount of intermediate data generated by the mappers.

▪ They can be understood as mini-reducers that process the output of mappers.

▪ Combiners aggregate term counts across the documents processed by each map task

▪ CONCLUSION

This results in a reduction in the number of intermediate key-value pairs that

need to be shuffled across the network ==> from the order of total number of terms in the collection to the order of the number of unique terms in the collection.

An associative array (i.e., Map in Java) is introduced inside the mapper to tally up term counts within a single document: instead of emitting a key-value pair for each term in the document, this version emits a key-value pair for each unique term in the document.

NOTE : reducer is not Changed !

In this case, we initialize an associative array for holding term counts. Since it is possible to preserve state across multiple calls of the Map method (for each input key-value pair), we can continue to accumulate partial term counts in the associative array across multiple documents, and emit key-value pairs only when the mapper has processed all documents. That is emission of intermediate data is deferred until the Close method in the pseudo-code.

IN MAPPER COMBINER

IMPLEMENTATION

▪ Advantage of using this design pattern is that we will have control over the local aggregation

▪ In mapper combiners should be preferred over actual combiners as using actual combiners creates an overhead for creating and destroying the objects

▪ Combiners does reduce amount of intermediate data but they keep the number of key value pairs as it is as they are outputted from the mapper but that too in aggregated format

▪ Yes ! Using in mapper combiner tweaks the mapper to preserve the state across the documents

▪ This creates the bottleneck as outcome of the algorithm will depend on the type of key value pair it receives we call it as ORDER DEPENDING BUG !

▪ It is difficult to check the problem when we are dealing with large datasets

Another Disadvantage ??

▪ There is need for memory where there should be sufficient in memory until all the key values pairs are processed (there may be case in our word count example where vocabulary size may exceed the associative array !)

▪ SOLUTION : flush in memory periodically by maintaining some counter. The size of the blocks to be flushed is empirical and is hard to determine.

▪ Problem Statement : Compute a mean of certain Key (say key is alphanumeric employee id and value is salary)

▪ Addition and Multiplication are associative

▪ Division and Subtraction are not associative

1. Computing Average directly in reducer (no combiner)

2. Using Combiners to reduce workload on reducer

3. Using in-memory combiner to increase efficiency of approach 2

1. Computing Average directly in reducer (no combiner)

2. Using Combiners to reduce workload on reducer : DOES NOT WORK

3. Using in-memory combiner to increase efficiency of approach 2

WHY ?

Because since average is not associative combiners will calculate average from separate mapper classes and send them to reducer. Now reducer will take these averages and again combine them to into average. This will lead to wrong solution since : AVERAGE(1,2,3,4,5)=AVERAGE(AVERAGE(1,2),AVERAGE(3,4,5))

▪ NOTES :

▪ NO COMBINER USED

▪ AVERAGE IS CALCULATED IN REDUCER

▪ MAPPER USED : IDENTITY MAPPER

▪ This algorithm works but has some problems

Problems:

1. requires shuffling all key-value pairs from mappers to reducers across the network

2. reducer cannot be used as a combiner

INCORRECT

Notes :

1. Combiners used

2. Wrong since output of combiner should match with output of mapper here output of combiner is pair where as output of mapper was just list of intergers

3. This breaks the map reduce basic knowledge

CORRECTLY

Notes:

Correct implementation of combiner since output of mapper is matching with output of combiner

What if I don’t use combiner ?

Still the reducer will be able to calculate mean at the end correctly ; combiner just act as intermediator to reduce the reducer workload.

Also , output of reducer need not to be same as that of combiner or mapper.

EFFICIENT THAN ALL OTHER VERSIONS

Notes :

▪ Inside the mapper, the partial sums and counts associated with each string are held in memory across input key-value pairs

▪ Intermediate key-value pairs are emitted only after the entire input split has been processed

▪ In memory combiner is efficiently using the resources to reach out to the desired results

▪ Workload on Reducer is bit reduced

▪ WE ARE EMMITING SUM AND COUNT TO REACH THE AVERAGE i.e. associative actions for non associative (average) result.

▪ The concept of stripes is to aggregate data prior to the Reducers by using a Combiner. There are several benefits to this, discussed below. When a Mapper completes, its intermediate data sits idle when pairing until all Mappers are complete. With striping, the intermediate data is passed to the Combiner, which can start processing through the data like a Reducer. So, instead of Mappers sitting idle, they can execute the Combiner, until the slowest Mapper finishes.

▪ Link : http://nosql.mypopescu.com/post/19286669299/mapreduce-pairs-and-stripes-explained/

http://nosql.mypopescu.com/post/19286669299/mapreduce-pairs-and-stripes-explained/

▪ Input to the problem ▪ Key-value pairs in the form of a docid and a doc

▪The mapper▪ Processes each input document

▪ Emits key-value pairs with:

▪ Each co-occurring word pair as the key

▪ The integer one (the count) as the value

▪ This is done with two nested loops:

▪ The outer loop iterates over all words

▪ The inner loop iterates over all neighbors

▪The reducer:▪ Receives pairs relative to co-occurring words

▪ This requires modifying the partitioner

▪ Computes an absolute count of the joint event

▪ Emits the pair and the count as the final key-value output

▪ Basically reducers emit the cells of the matrix

▪ Input to the problem

▪ Key-value pairs in the form of a docid and a doc

▪ The mapper:▪ Same two nested loops structure as before

▪ Co-occurrence information is first stored in an associative array

▪ Emit key-value pairs with words as keys and the corresponding arrays as values

▪ The reducer:▪ Receives all associative arrays related to the same word

▪ Performs an element-wise sum of all associative arrays with the same key

▪ Emits key-value output in the form of word, associative array

▪ Basically, reducers emit rows of the co-occurrence matrix

▪ Generates a large number of key-value pairs (also intermediate)

▪ The benefit from combiners is limited, as it is less likely for a mapper to process multiple occurrences of a word

▪ Does not suffer from memory paging problems

▪ More compact

▪ Generates fewer and shorted intermediate keys

▪ Can make better use of combiners

▪ The framework has less sorting to do

▪ The values are more complex and have serialization/deserialization overhead

▪ Greatly benefits from combiners, as the key space is the vocabulary

▪ Suffers from memory paging problems, if not properly engineered

“STRIPES”

▪ Idea: group together pairs into an associative array

▪ Each mapper takes a sentence:

▪ Generate all co-occurring term pairs

▪ For each term, emit a → { b: countb, c: countc, d: countd … }

▪ Reducers perform element-wise sum of associative arrays

(a, b) → 1

(a, c) → 2

(a, d) → 5

(a, e) → 3

(a, f) → 2

a → { b: 1, c: 2, d: 5, e: 3, f: 2 }

a → { b: 1, d: 5, e: 3 }

a → { b: 1, c: 2, d: 2, f: 2 }

a → { b: 2, c: 2, d: 7, e: 3, f: 2 }+

▪ Combiners can be used both in pair and stripes but the implementation of combiners in strips gives better result because of the associative array

▪ Stripe approach might encounter the problem of memory since it tries to fit the associative array into it memory

▪ Pairs approach don’t face such problem w.r.t keeping the in-memory space

▪ STRIPE APPROACH PERFORMED BETTER THAN PAIRS BUT THEY HAVE THEIR OWN SIGNIFICANCE.

▪ Memory problem in strip can be dealt by dividing the entire vocabulary into buckets and applying strips on individual buckets this in turn reduces the memory allocation required for the stripes approach.

ORDER INVERSION▪ Drawback with co-occurrence matrix is that some words do appear too frequently

together ; simply because one of the words is very common

▪ Solution :

▪ convert absolute counts into relative frequencies, f(w(j) | w(i)). That is, what proportion of the time does wj appear in the context of wi?

▪ N(·, ·) is the number of times a co-occurring word pair is observed

▪ The denominator is called the marginal (the sum of the counts of the conditioning variable co-occurring with anything else)

▪ In the reducer, the counts of all words that co-occur with the conditioning variable (wi) are available in the associative array

▪ Hence, the sum of all those counts gives the marginal

▪ Then we divide the the joint counts by the marginal and we’re done

▪ The reducer receives the pair (wi , wj) and the count

▪ From this information alone it is not possible to compute f(wj|wi)

▪ Fortunately, as for the mapper, also the reducer can preserve state across multiple keys▪ We can buffer in memory all the words that co-occur with wi and their counts

▪ This is basically building the associative array in the stripes method

We must define the sort order of the pair

▪ In this way, the keys are first sorted by the left word, and then by the right word (in the pair)

▪ Hence, we can detect if all pairs associated with the word we are conditioning on (wi) have been seen

▪ At this point, we can use the in-memory buffer, compute the relative frequencies and emit

We must ensure that all pairs with the same left word are sent to the same reducer. But it cant be done automatically hence we use separate partitioner to achieve this task . . .

▪ Emit a special key-value pair to capture the marginal

▪ Control the sort order of the intermediate key, so that the special key-value pair is processed first

▪ Define a custom partitioner for routing intermediate key-value pairs

▪ Preserve state across multiple keys in the reducer

▪ The order inversion pattern is a nice trick that lets a reducer see intermediate results before it processes the data that generated them.

▪ We illustrate this with the example of computing relative frequencies for co-occurring word pairs e.g. what are the relative frequencies of words occurring within a small window of the word "dog"? The mapper counts word pairs in the corpus, so its output looks like..

((dog, cat), 125)

((dog, foot), 246)

▪ But it also keeps a running total of all the word pairs containing "dog", outputting this as ((dog,*), 5348)

▪ Using a suitable partitioner, so that all (dog,...) pairs get sent to the same reducer, and choosing the "*" token so that it occurs before any word in the sort order, the reducer sees the total ((dog,*), 5348) first, followed by all the other counts, and can trivially store the total and then output relative frequencies.

The benefit of the pattern is that it avoids an extra MapReduce iteration without creating any additional scalability bottleneck.

▪ Input to reducers are sorted by the keys

▪ Values are arbitrarily ordered

▪ We may want to order reducer values either ascending or descending.

▪ Solution :

▪ Buffer reducer values in memory and sort▪ Disadvantage : is data is too large , it may not fit in memory ; also unnecessary creation of

object space in memory heap

▪ Use secondary sort design pattern in map reduce▪ Uses shuffle and sort method

▪ Reducer values will be sorted

▪ Secondary key sorting is done by creating a composite key

▪Parallel BFS

▪Page Rank

design patterns in mapreduce

Technology