bigbench: big data benchmark proposal ahmad ghazal, tilmann rabl, minqing hu, francois raab, meikel...

BigBench: Big Data Benchmark Proposal

Ahmad Ghazal, Tilmann Rabl, Minqing Hu, Francois Raab, Meikel Poess, Alain Crolotte, Hans-Arno Jacobsen

2

• End to end benchmark• Based on a product retailer• Focus on

- Parallel DBMS - MR engines

• Initial work presented at 1st WBDB, San Jose• Collaboration with Industry & Academia

- Teradata, University of Toronto, InfoSizing , Oracle

• Full spec at 3rd WBDB, Xian, China

BigBench

3

• Data Model- Variety, Volume, Velocity

• Data Generator- PDGF for structured data- Enhancement : Semi-structured & Text generation

• Workload specification- Main driver: retail big data analytics- Covers : data source, declarative & procedural and machine learning

algorithms.

• Metrics• Evaluation

- Done on Teradata Aster

Outline

4

• Big Data is not about size only

• 3 V’s- Variety

• Different types of data

- Volume• Huge size

- Velocity• Continuous updates

Data Model

5

Data Model(Variety)

6

• Volume- Based on scale factor- Similar to TPC-DS scaling- Weblogs & product reviews also scaled

• Velocity- Periodic refreshes for all data- Different velocity for different areas

• Vstructured < Vunstructured < Vsemistructured - Queries run with refresh

Data Model(continued)

7

• "Parallel Data Generation Framework“ PDGF- For the structured part of model- Scale factor similar to TPC-DS

• Extensions to PDGF - Web logs - Product reviews

• Web logs: retail customers/guests visiting site- Web logs similar to apache web server logs- Coupled with structured data

• Product reviews: Customers and guest users- Algorithm based on Markov chain- Real data set sample input- Coupled with structured data

Data Generator

8

• Input- real review data- Category information- Rating- Review text

• Initial setup- Parse text

• Produce tokens• Correlation between tokens

- Build repository by category

• API for data generation- Integrated with PDGF- Based on category ID

Data GeneratorTextGen

9

• Workload Queries- 30 queries- Specified in English- No required syntax

• Business functions (Adapted from McKinsey+)- Marketing

• Cross-selling, Customer micro-segmentation, Sentiment analysis, Enhancing multichannel consumer experiences

- Merchandising• Assortment optimization, Pricing optimization

- Operations• Performance transparency, Product return analysis

- Supply chain• Inventory management

- Reporting (customers and products)

Workload (continued)

10

Technical Functions• Data source dimension

- Structured - Semi-structured- Un-structured

• Processing type dimension- Declarative (SQL, HQL)- Procedural - Mix of both

• Analytic technique dimension- Statistical analysis: correlation analysis, time-series, regression - Data mining: classification, clustering, association mining, pattern

analysis and text analysis - Simple reporting: ad hoc queries not covered above

Workload (continued)

11

Workload (continued)Business Categories Query Breakdown

Business category Number of Queries

Percentage

Marketing 18 60%

Merchandising 5 17%

Operations 4 13%

Supply chain 2 7%

New business models

1 3%

12

Workload (continued)Technical Dimensions Breakdown

Query processing type Number of Queries

Percentage

Declarative 10 33%

Procedural 7 23%

Declarative + Procedural 13 43%

13

Technical Dimensions Breakdown(continued)

Data Sources Number of Queries

Percentage

Structured 18 60%

Semi-structured 7 23%

Un-structured 5 17%

Analytic techniques Number of Queries

Percentage

Statistics analysis 6 20%

Data mining 17 57%

Reporting 8 27%

14

• Future Work• Initial thoughts

- Focus on loading and type of processing- MR engines good at loading- DBMS good at SQL- Metric =

•

• Applicable to single and multi-streams

Metrics

15

• BigBench proof of concept• DBMS

- Typically data loaded into tables- Possibly parsing weblogs to get schema- Reviews captured as VARCHAR or BLOB fields- Queries run using SQL + UDF

• MR engine - Data can be loaded on HDFS - MR, HQL, PigLatin can be used

• DBMS & MR engine - DBMS with Hadoop connectors- Data can be placed and split among both- Processing can also be split among two

Evaluation

16

• Teradata Aster- MPP database- Discovery platform- Supports SQL-MR

• System- Aster 5.0- 8 node cluster- 200 GB of data

Evaluation (continued)

17

• Data generation- DSDGen produced structured part- PDGF+ produced semi-structured and un-structured

• Data loaded into tables - Parsed weblogs into Weblogs table- Product reviews table

• Queries- SQL-MR syntax


18

• Example query• Product category affinity analysis

- Computes the probability of browsing products from a category after customers viewed items from another category.

- One form of market basket

• Business case- Marketing - cross-selling

• Type of source- Structured (web sales)

• Processing type - mix of declarative and procedural

• Analytics type- Data mining


19

SELECT

category_cd1 AS category1_cd ,

category_cd2 AS category2_cd , COUNT (*) AS cnt

FROM

basket_generator (

ON

( SELECT i. i_category_id AS category_cd ,

s. ws_bill_customer_sk AS customer_id

FROM web_sales s INNER JOIN item i

ON s. ws_item_sk = i_item_sk

)

PARTITION BY customer_id

BASKET_ITEM (' category_cd ')

ITEM_SET_MAX (500)

)

GROUP BY 1,2

order by 1 ,3 ,2;


20

Evaluation(Sample Queries)

21

Evaluation(Processing Type)

22

• End to end big data benchmark• Data Model

- Adding semi-structured and unstructured data- Different velocity for structured, semi-structured and unstructured- Volume based on scale factor

• Data Generation- Unstructured data using Markov chain model.- Enhancing PDGF for semi-structured and integrating it with

unstructured

• Workload- 30 queries based on McKinsey report- Covers processing type, data source and data mining algorithms

• Evaluation- Aster SQL-MR

Conclusion

23

• Add late binding to web logs- No schema/keys upfront

• Finalizing- Data generator- Metric specifications.- Downloadable kit

• More POC’s- Hadoop ecosystem like HIVE

Future Work

bigbench: big data benchmark proposal ahmad ghazal, tilmann rabl, minqing hu, francois raab, meikel...

Documents

data model variety slide

structured data enhancement

data model variety

data different velocity

refresh data model

regression data mining

velocity data generator

retail big data analytics