on large clusters simpliﬁed relational data...

Map-Reduce-MergeSimplified Relational Data Processing

on Large Clusters

Contents1. Introduction2. Map-Reduce3. Map-Reduce-Merge4. Application to relational data processing5. Optimization6. Enhancements7. Case studies8. Conlusions

IntroductionChallenge:

process and manage a vast amount of data collected from the entire World Wide Web.

Current Solutions:Customized parallel data processing systems Use large clusters of shared-nothing commodity nodes Google’s GFS, BigTable, MapReduce

Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop

IntroductionHadoop: open source

refactor of data processing into two primitives:map + reduce

don't need to worry about the nuisance details of coordinating parallel sub-tasks and managing distributed file storage => increase productivity

IntroductionMR is best at handling homogeneous datasets

Ex. joins --> calls for extra MR steps

Map-Reduce-Mergesimplified designrelational complete

Map-Reduce

Features and Principles

Low-cost unreliable commodity hardwareextremely scalable RAIN clusterfault-tolerant yet easy to administersimplified and restricted yet powerfulhighly parallel yet abstractedhigh throughputhigh performance by the largefunctional programming primitives......

Map-Reduce

Homogenization: for equi-join

Transform each dataset into (join key, data-source tag + payload)Then apply map-reduce to merge entries from different datasets

Problem: only equi-joins may take lots of extra disk space, incur excessive communications

Mape-Reduce-Merge

α, β, γ represent dataset lineagesReduce function produces a key/value list instead of just valuesMerge function reads data from both lineages

Mape-Reduce-Merge

Example

Merge Phase

Merge Phase

Partition Selector: Determine from which reducers this merger retrieves its input data based on the merger numberProcessors: 1.Process data from one source only 2.Users can define two processor functions Merger: Process two pairs of key/valuesConfigurable Iterators: 1. A merger has two logical iterators 2.Control their relative movement against each others

Merger

Configurable Iterators

example 1.


example 2.


example 3.

Application to relational data processingrelational completeprojection: mapaggregation: map + reducegeneralized selection: map --> where, reduce-->having, merger--> filtering condition involving more than one relationsjoins: to be discussed... set union/set intersection/set difference: easily handle it in mergercartersian product: nested looprename: trivial

Sort-Merge Join

Map: use range partitioner => records are partitioned into ordered buckets, each mutually exclusive

Reduce: sort data

Merge: reads from two sets of reducer outputs that cover the same key range

Hash Join

Map: use a common partitioner => records are partitioned into hashed buckets

Reduce: reads from every mapper for one designated partition, use the same hash function, records from these partitions can be grouped and aggregated using a hash table

Merge: reads from two sets of reducer outputs that share the same hashing buckets build/probe

Block Nested-Loop Join

Map: same as the one for the hash join

Reduce: same as the one for the hash join

Merge: almost the same as hash join, except for a nested-loop join is used instead

Optimizations

Optimal Reduce-Merge Connections Results of Reduce: partitioned and sorted

The selector of Merge can choose pertinent part of data

Optimizations

Combining PhasesReduceMap, MergeMap

Directly send output to new mappers

Reduce MergeCombine merger to reducer

ReduceMergeMapCombination of above two

Enhancements

Map-Reduce-Merge LibraryA library that contains commonly used

merger configurations like all kinds of joins

Enhancements

Map-Reduce-Merge WorkflowThe regular Map-Reduce workflow is very

strict Adding a new phase creates many workflow

combinations

Enhancements

Map-Reduce-Merge Workflow

Case Studies

Join Webgraphs| URL | inlinks | outlinks |Each column in a separate file

Goal: compute the intersection of inlinks and outlinks for each URL

Case Studies

Join WebgraphsReading all three columns into one Map-

Reduce can overflow buffer

Safer approach: 1) each URL as a row-id2) replicate row-id to each inlink and outlink3) produce <row-id, inoutlink> 4) then natural join <row-id, URL> with <row-

id, inoutlink>

Case Studies

Map-Reduce-Merge Workflow for TPC-H Q2

Case Studies

Map-Reduce-Merge Workflow for TPC-H Q2SQL: 5-way joins with aggregate and group by cluses

M-P-M: four 2-way joins then order by and sortings

Case Studies

Conclusions

Map-Reduce-Merge supports joins of heterogeneous datasets

Thus, it can be used to implement many relational operators, particularly joins

on large clusters simpliﬁed relational data...

Documents