Download - Presenter: Ran Ding. 1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. Conclusion

MapReduce VS Parallel DBMSs

Presenter: Ran Ding

1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. Conclusion

Guideline

The MapReduce (MR) paradigm has been hailed as a revolutionary new platform for large-scale, massively parallel data access.

Like Hadoop

Introduction-----MR

Parallel DBMS appeared at mid-1980. the Teradata and Gamma projects pioneered a new architectural paradigm based on a cluster of commodity computers.

Introduction----Parallel DBMS

Distributing the rows of a relational table across the nodes of the cluster so they can process in parallel.

Introduction---Horizontal partitioning

One benefit is system automatically manages the various alternative partitioning strategies for the tables involved in the query.

Like hash, range, and round-robin…..

Introduction---DBMS

It is not easy!!!!!!

UDF(user defined field) helps. Like GROUP BY in SQL.

Introduction-- Mapping parallel DBMS onto MapReduce

1. ETL and “read once” data sets 2. Complex analytics 3. Semi-structured data 4. Quick-and-dirty analyses 5. Limited-budget operations

Where the MR wins

Extract-transform-load system

MR system can be considered a general-purpose parallel ETL system.

DBMSs may perform the ETL

ETL and “read once” data sets

Cannot be structured as single SQL aggregate queries

MR is a good candidate

Complex analytics

MR systems are good at processing the data is prepared for loading into a back-end system

DBMS requires wide tables with many attributes

Plus, MR-style systems are easily store and process

Semi-structured data

DBMS need the programmer write the schema then load

MR just copy!

Quick-and-dirty analyses

MR is basically open source for free

Parallel DBMS: huge cost

Limited-budget operations

DBMS “Sweet Spot” Test

1. Repetitive record parsing 2. Compression 3. Pipelining 4. Scheduling 5. Column-oriented storage

Why the Parallel DBMS wins

Parsing task requires each Map and Reduce task repeatedly parse and convert string fields into the appropriate type

Records are parsed by DBMSs when the data is initially loaded.

Repetitive record parsing

It is hard to say…….. Commercial DBMSs may use carefully tuned

compression algorithms

Compression

In parallel DBMS, data is streamed from producer to consumer

the intermediate data is never written to disk

In MR system, it writes the result to local data structure, and consumers read from it

Pipelining

In a parallel DBMS, every node knows what it should do

MR system is scheduled on processing nodes one storage block at a time.

Scheduling

Vertica Reads only the attributes necessary for

solving the user query DBMS-X and Hadoop are both row stores

Column-oriented storage

MR advocates should learn from parallel DBMS the technologies and techniques for efficient query parallel execution.

What should MR learn from Parallel DBMS

MR systems are powerful tools for ETL-style applications and for complex analytics. If the application is query-intensive, whether semi structured or rigidly structured, then a DBMS is probably the better choice

Conclusion

Thank you~~

Questions?

Download - Presenter: Ran Ding. 1. Introduction 2. Where the MR wins 3. DBMS “sweet spot” tests 4. Why the Parallel DBMS wins 5. Conclusion

Top Related