quark virtualization engine for analytics

21
Quark Virtualization Engine for Analytics Rajat Venkatesh Qubole @vrajat

Upload: hadoop-summit

Post on 07-Jan-2017

252 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Quark Virtualization Engine for Analytics

QuarkVirtualization Engine for Analytics

Rajat VenkateshQubole@vrajat

Page 2: Quark Virtualization Engine for Analytics

Quark

• Motivation• Use Cases• Architecture• Roadmap

Agenda

Page 3: Quark Virtualization Engine for Analytics

Quark

Data @Qubole

api.qubole.com

Monitoring & Alerts

Business Analysts

Workload Analysis

Customer Clusters

AmazonRDS

AmazonS3

Page 4: Quark Virtualization Engine for Analytics

Quark

Multi-Store Architecture

Embedded Thin JDBC JAR Quark Server

Quark Catalog

Laptop or Server

Amazon Redshift

Page 5: Quark Virtualization Engine for Analytics

Quark

Narrow Tables

TPCDS Dataset~3 Billion RowsORCPresto 0.119

Q3 referenced 3 attributes from store sales

0 10 20 30 40 500

50

100

150

200

250

Q3

String (512 Bytes)

Page 6: Quark Virtualization Engine for Analytics

Quark

Narrow Tables

Table No. of Queries

Total Columns

Columns Used

Tickets 25000+ 265 74

Customers 10000+ 53 43

Support 6000+ 33 10

Page 7: Quark Virtualization Engine for Analytics

Quark

select dt.d_year, item.i_brand_id brand_id, item.i_brand brand,

sum(ss_ext_sales_price) sum_aggfrom store_sales, item, date_dim dt where dt.d_date_sk = store_sales.ss_sold_date_sk

and store_sales.ss_item_sk = item.i_item_sk

and item.i_manufact_id = 436 and dt.d_moy = 12 -- partition key filters

and (ss_sold_date_sk between 2451149 and 2451179

or ss_sold_date_sk between 2451514 and 2451544 or ss_sold_date_sk between 2451880 and 2451910 or ss_sold_date_sk between 2452245 and 2452275 or ss_sold_date_sk between 2452610 and 2452640)group by dt.d_year, item.i_brand, item.i_brand_idorder by dt.d_year, sum_agg desc, brand_idlimit 100;

TPCDS q3.sqlcreate table

narrow_store_sales_3m as select ss_sold_date_sk, ss_item_sk, ss_sold_date_sk from store_sales where ss_sold_date_sk >= (julian_day(now() - 3

months));

Page 8: Quark Virtualization Engine for Analytics

Quark

Materialized View in Quarkcreate view store_sales_view as select ss_sold_date_sk, ss_item_sk, ss_sold_date_sk from store_sales where ss_sold_date_sk >= (julian_day(now() - 3

months));stored in

narrow_store_sales_3m

Page 9: Quark Virtualization Engine for Analytics

Quark

• Sort on non-partitioned columns.

• For e.g. in TPCDS, store_sales is partitioned by ss_sold_date_sk, sorted by ss_item_sk

Sorted Tables

q27 q3 q42 q52 q55 q7 q89 q980

100

200

300

400

500

0102030405060708090100

Base Tables Denormalized % Speedup

Page 10: Quark Virtualization Engine for Analytics

Quark

Materialized View in Quarkcreate view store_sales_sorted

as select * from store_sales where ss_sold_date_sk >= (julian_day(now() - 3

months)); order by ss_sold_date_sk,

ss_item_sk;stored in

sorted_store_sales_3m

Page 11: Quark Virtualization Engine for Analytics

Quark

• Join & store store_sales and items table in TPCDS

• Only star schema joins supported.

• FK-PK joins only.

Denormalized Tables

q19 q3 q4

2q4

3q4

6q5

2q5

3q5

5q5

9q6

3q6

8 q7 q73

q79

q89

q98

0

200

400

600

800

1000

1200

1400

1600

0

10

20

30

40

50

60

70

80

Unsorted Sorted % Speedup

Page 12: Quark Virtualization Engine for Analytics

Quark

Materialized View in Quarkcreate view store_sales_items_view

as select * from store_sales join items on ss_item_sk = i_item_sk where ss_sold_date_sk >= (julian_day(now() - 3

months)); order by ss_sold_date_sk,

ss_item_sk;stored in

sorted_store_sales_items_3m

Page 13: Quark Virtualization Engine for Analytics

Quark

• Cube are stored in a table• Cube on partial data - for e.g.

3 months• Incremental Cubes

OLAP Cubescreate cube store_sales_cube as select sum( … ), … from store_sales join items on ss_item_sk = i_item_sk join … where ss_sold_date_sk >= (julian_day(now() - 3

months)); group by by i_item_sk, dd_year,

… stored in

sorted_store_sales_cube_3m

Page 14: Quark Virtualization Engine for Analytics

Quark

• Quark supports multiple technologies.• Views or Cubes can span data bases

– Store your cube in Redshift or HBase or Elastic Search• Redirect your lookup queries to Apache HBase

Bring your own Storage & SQL Engine

Page 15: Quark Virtualization Engine for Analytics

Quark

Table store_sales partitioned by year, monthselect....from date_dim dt, store_sales, itemwhere.... -- partition key filters

and (ss_sold_date_sk between 2451149 and 2451179 or ss_sold_date_sk between 2451514 and 2451544 or ss_sold_date_sk between 2451880 and 2451910 or ss_sold_date_sk between 2452245 and 2452275 or ss_sold_date_sk between 2452610 and 2452640)....

Predicate Injection

-- Inject predicate year between 1998 and 2002 and month in (11, 12)

Page 16: Quark Virtualization Engine for Analytics

Quark

Apache Kylin and Apache Lens Comparison

● Quark supports many optimized storage structures○ Materialized Views○ Predicate Injections

● Quark encourages a mix of storage and SQL Engines (Apache Kylin)

● ANSI SQL (Apache Lens)● DDL Statements● No UI/API or Web Services. JDBC Server/Client only.

Page 17: Quark Virtualization Engine for Analytics

Quark

Architecture

JDBCClient

Quark Server

Catalog

Hive

DWH

K-V Store

Catalog

Optimizer

Execution Engine

MV and Cube Definitions

Avatica + Protobuf APIGet Catalog.Execute Queries.

Page 18: Quark Virtualization Engine for Analytics

Quark

Materialized Views[CALCITE-749] Add MaterializationService.TableFactory[CALCITE-786] Detect if materialized view can be used to rewrite a query in non-trivial cases[CALCITE-787] Star table wrongly assigned to materialized view[CALCITE-925] Match materialized views when predicates contain strings and ranges

OLAP Cubes[CALCITE-758] Use more than one lattice in the same query

Cost Based Optimizer[CALCITE-1003] Utility to convert RelNode to SQL[CALCITE-1010] FETCH?LIMIT and PFFSET in RelToSqlConverter[CALCITE-1109] Fix up condition when pushing Filter through Aggregate[CALCITE-1130] Add support for operators IS_NULL and IS_NOT_NULL in RexImplicationChecker[CALCITE-1216] Rule to convert Filter-on-Scan to materialized view

Contributions to Apache Calcite

Page 19: Quark Virtualization Engine for Analytics

Quark

Quark as a Service

1. Register DBs as DbTaps

2. Submit QuarkCommand

Account Info including DbTaps

Page 20: Quark Virtualization Engine for Analytics

Quark

• Optimizer– Materialized Views and Joins.– Statistics - Choose among MVs or SQL engines.

• Multi-Store– SQL Dialects– JIT Function definitions– Query Life Cycle & Management

• ETL– Integrate with Workflow engines like Apache Oozie or

Airflow.

RoadMap

Page 21: Quark Virtualization Engine for Analytics

Quark

Github: https://github.com/qubole/quark/

Mailing List: [email protected]

Subscribe: [email protected]

Unsubscribe: [email protected]: https://gitter.im/qubole/quark

Co-ordinates