1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. Conclusion
Guideline
The MapReduce (MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access.
Like Hadoop
Introduction-----MR
Parallel DBMS appeared at mid-1980. the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.
Introduction----Parallel DBMS
Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.
Introduction---Horizontal partitioning
One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query.
Like hash, range, and round-robin…..
Introduction---DBMS
It is not easy!!!!!!
UDF(user defined field) helps. Like GROUP BY in SQL.
Introduction-- Mapping parallel DBMS onto MapReduce
1. ETL and “read once” data sets 2. Complex analytics 3. Semi-structured data 4. Quick-and-dirty analyses 5. Limited-budget operations
Where the MR wins
Extract-transform-load system
MR system can be considered a general-purpose parallel ETL system.
DBMSs may perform the ETL
ETL and “read once” data sets
MR systems are good at processing the data is prepared for loading into a back-end system
DBMS requires wide tables with many attributes
Plus, MR-style systems are easily store and process
Semi-structured data
1. Repetitive record parsing 2. Compression 3. Pipelining 4. Scheduling 5. Column-oriented storage
Why the Parallel DBMS wins
Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type
Records are parsed by DBMSs when the data is initially loaded.
Repetitive record parsing
In parallel DBMS, data is streamed from producer to consumer
the intermediate data is never written to disk
In MR system, it writes the result to local data structure, and consumers read from it
Pipelining
In a parallel DBMS, every node knows what it should do
MR system is scheduled on processing nodes one storage block at a time.
Scheduling
Vertica Reads only the attributes necessary for
solving the user query DBMS-X and Hadoop are both row stores
Column-oriented storage
MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.
What should MR learn from Parallel DBMS
MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice
Conclusion