on large clusters simplified relational data...
TRANSCRIPT
Map-Reduce-MergeSimplified Relational Data Processing
on Large Clusters
Contents1. Introduction2. Map-Reduce3. Map-Reduce-Merge4. Application to relational data processing5. Optimization6. Enhancements7. Case studies8. Conlusions
IntroductionChallenge:
process and manage a vast amount of data collected from the entire World Wide Web.
Current Solutions:Customized parallel data processing systems Use large clusters of shared-nothing commodity nodes Google’s GFS, BigTable, MapReduce
Ask.com’s Neptune Microsoft’s Dryad Yahoo!’s Hadoop
IntroductionHadoop: open source
refactor of data processing into two primitives:map + reduce
don't need to worry about the nuisance details of coordinating parallel sub-tasks and managing distributed file storage => increase productivity
IntroductionMR is best at handling homogeneous datasets
Ex. joins --> calls for extra MR steps
Map-Reduce-Mergesimplified designrelational complete
Map-Reduce
Features and Principles
Low-cost unreliable commodity hardwareextremely scalable RAIN clusterfault-tolerant yet easy to administersimplified and restricted yet powerfulhighly parallel yet abstractedhigh throughputhigh performance by the largefunctional programming primitives......
Map-Reduce
Homogenization: for equi-join
Transform each dataset into (join key, data-source tag + payload)Then apply map-reduce to merge entries from different datasets
Problem: only equi-joins may take lots of extra disk space, incur excessive communications
Mape-Reduce-Merge
α, β, γ represent dataset lineagesReduce function produces a key/value list instead of just valuesMerge function reads data from both lineages
Mape-Reduce-Merge
Example
Merge Phase
Merge Phase
Partition Selector: Determine from which reducers this merger retrieves its input data based on the merger numberProcessors: 1.Process data from one source only 2.Users can define two processor functions Merger: Process two pairs of key/valuesConfigurable Iterators: 1. A merger has two logical iterators 2.Control their relative movement against each others
Merger
Configurable Iterators
example 1.
Configurable Iterators
example 2.
Configurable Iterators
example 3.
Application to relational data processingrelational completeprojection: mapaggregation: map + reducegeneralized selection: map --> where, reduce-->having, merger--> filtering condition involving more than one relationsjoins: to be discussed... set union/set intersection/set difference: easily handle it in mergercartersian product: nested looprename: trivial
Sort-Merge Join
Map: use range partitioner => records are partitioned into ordered buckets, each mutually exclusive
Reduce: sort data
Merge: reads from two sets of reducer outputs that cover the same key range
Hash Join
Map: use a common partitioner => records are partitioned into hashed buckets
Reduce: reads from every mapper for one designated partition, use the same hash function, records from these partitions can be grouped and aggregated using a hash table
Merge: reads from two sets of reducer outputs that share the same hashing buckets build/probe
Block Nested-Loop Join
Map: same as the one for the hash join
Reduce: same as the one for the hash join
Merge: almost the same as hash join, except for a nested-loop join is used instead
Optimizations
Optimal Reduce-Merge Connections Results of Reduce: partitioned and sorted
The selector of Merge can choose pertinent part of data
Optimizations
Combining PhasesReduceMap, MergeMap
Directly send output to new mappers
Reduce MergeCombine merger to reducer
ReduceMergeMapCombination of above two
Enhancements
Map-Reduce-Merge LibraryA library that contains commonly used
merger configurations like all kinds of joins
Enhancements
Map-Reduce-Merge WorkflowThe regular Map-Reduce workflow is very
strict Adding a new phase creates many workflow
combinations
Enhancements
Map-Reduce-Merge Workflow
Case Studies
Join Webgraphs| URL | inlinks | outlinks |Each column in a separate file
Goal: compute the intersection of inlinks and outlinks for each URL
Case Studies
Join WebgraphsReading all three columns into one Map-
Reduce can overflow buffer
Safer approach: 1) each URL as a row-id2) replicate row-id to each inlink and outlink3) produce <row-id, inoutlink> 4) then natural join <row-id, URL> with <row-
id, inoutlink>
Case Studies
Map-Reduce-Merge Workflow for TPC-H Q2
Case Studies
Map-Reduce-Merge Workflow for TPC-H Q2SQL: 5-way joins with aggregate and group by cluses
M-P-M: four 2-way joins then order by and sortings
Case Studies
Conclusions
Map-Reduce-Merge supports joins of heterogeneous datasets
Thus, it can be used to implement many relational operators, particularly joins