quark virtualization engine for analytics
TRANSCRIPT
QuarkVirtualization Engine for Analytics
Rajat VenkateshQubole@vrajat
Quark
• Motivation• Use Cases• Architecture• Roadmap
Agenda
Quark
Data @Qubole
api.qubole.com
Monitoring & Alerts
Business Analysts
Workload Analysis
Customer Clusters
AmazonRDS
AmazonS3
Quark
Multi-Store Architecture
Embedded Thin JDBC JAR Quark Server
Quark Catalog
Laptop or Server
Amazon Redshift
Quark
Narrow Tables
TPCDS Dataset~3 Billion RowsORCPresto 0.119
Q3 referenced 3 attributes from store sales
0 10 20 30 40 500
50
100
150
200
250
Q3
String (512 Bytes)
Quark
Narrow Tables
Table No. of Queries
Total Columns
Columns Used
Tickets 25000+ 265 74
Customers 10000+ 53 43
Support 6000+ 33 10
Quark
select dt.d_year, item.i_brand_id brand_id, item.i_brand brand,
sum(ss_ext_sales_price) sum_aggfrom store_sales, item, date_dim dt where dt.d_date_sk = store_sales.ss_sold_date_sk
and store_sales.ss_item_sk = item.i_item_sk
and item.i_manufact_id = 436 and dt.d_moy = 12 -- partition key filters
and (ss_sold_date_sk between 2451149 and 2451179
or ss_sold_date_sk between 2451514 and 2451544 or ss_sold_date_sk between 2451880 and 2451910 or ss_sold_date_sk between 2452245 and 2452275 or ss_sold_date_sk between 2452610 and 2452640)group by dt.d_year, item.i_brand, item.i_brand_idorder by dt.d_year, sum_agg desc, brand_idlimit 100;
TPCDS q3.sqlcreate table
narrow_store_sales_3m as select ss_sold_date_sk, ss_item_sk, ss_sold_date_sk from store_sales where ss_sold_date_sk >= (julian_day(now() - 3
months));
Quark
Materialized View in Quarkcreate view store_sales_view as select ss_sold_date_sk, ss_item_sk, ss_sold_date_sk from store_sales where ss_sold_date_sk >= (julian_day(now() - 3
months));stored in
narrow_store_sales_3m
Quark
• Sort on non-partitioned columns.
• For e.g. in TPCDS, store_sales is partitioned by ss_sold_date_sk, sorted by ss_item_sk
Sorted Tables
q27 q3 q42 q52 q55 q7 q89 q980
100
200
300
400
500
0102030405060708090100
Base Tables Denormalized % Speedup
Quark
Materialized View in Quarkcreate view store_sales_sorted
as select * from store_sales where ss_sold_date_sk >= (julian_day(now() - 3
months)); order by ss_sold_date_sk,
ss_item_sk;stored in
sorted_store_sales_3m
Quark
• Join & store store_sales and items table in TPCDS
• Only star schema joins supported.
• FK-PK joins only.
Denormalized Tables
q19 q3 q4
2q4
3q4
6q5
2q5
3q5
5q5
9q6
3q6
8 q7 q73
q79
q89
q98
0
200
400
600
800
1000
1200
1400
1600
0
10
20
30
40
50
60
70
80
Unsorted Sorted % Speedup
Quark
Materialized View in Quarkcreate view store_sales_items_view
as select * from store_sales join items on ss_item_sk = i_item_sk where ss_sold_date_sk >= (julian_day(now() - 3
months)); order by ss_sold_date_sk,
ss_item_sk;stored in
sorted_store_sales_items_3m
Quark
• Cube are stored in a table• Cube on partial data - for e.g.
3 months• Incremental Cubes
OLAP Cubescreate cube store_sales_cube as select sum( … ), … from store_sales join items on ss_item_sk = i_item_sk join … where ss_sold_date_sk >= (julian_day(now() - 3
months)); group by by i_item_sk, dd_year,
… stored in
sorted_store_sales_cube_3m
Quark
• Quark supports multiple technologies.• Views or Cubes can span data bases
– Store your cube in Redshift or HBase or Elastic Search• Redirect your lookup queries to Apache HBase
Bring your own Storage & SQL Engine
Quark
Table store_sales partitioned by year, monthselect....from date_dim dt, store_sales, itemwhere.... -- partition key filters
and (ss_sold_date_sk between 2451149 and 2451179 or ss_sold_date_sk between 2451514 and 2451544 or ss_sold_date_sk between 2451880 and 2451910 or ss_sold_date_sk between 2452245 and 2452275 or ss_sold_date_sk between 2452610 and 2452640)....
Predicate Injection
-- Inject predicate year between 1998 and 2002 and month in (11, 12)
Quark
Apache Kylin and Apache Lens Comparison
● Quark supports many optimized storage structures○ Materialized Views○ Predicate Injections
● Quark encourages a mix of storage and SQL Engines (Apache Kylin)
● ANSI SQL (Apache Lens)● DDL Statements● No UI/API or Web Services. JDBC Server/Client only.
Quark
Architecture
JDBCClient
Quark Server
Catalog
Hive
DWH
K-V Store
Catalog
Optimizer
Execution Engine
MV and Cube Definitions
Avatica + Protobuf APIGet Catalog.Execute Queries.
Quark
Materialized Views[CALCITE-749] Add MaterializationService.TableFactory[CALCITE-786] Detect if materialized view can be used to rewrite a query in non-trivial cases[CALCITE-787] Star table wrongly assigned to materialized view[CALCITE-925] Match materialized views when predicates contain strings and ranges
OLAP Cubes[CALCITE-758] Use more than one lattice in the same query
Cost Based Optimizer[CALCITE-1003] Utility to convert RelNode to SQL[CALCITE-1010] FETCH?LIMIT and PFFSET in RelToSqlConverter[CALCITE-1109] Fix up condition when pushing Filter through Aggregate[CALCITE-1130] Add support for operators IS_NULL and IS_NOT_NULL in RexImplicationChecker[CALCITE-1216] Rule to convert Filter-on-Scan to materialized view
Contributions to Apache Calcite
Quark
Quark as a Service
1. Register DBs as DbTaps
2. Submit QuarkCommand
Account Info including DbTaps
Quark
• Optimizer– Materialized Views and Joins.– Statistics - Choose among MVs or SQL engines.
• Multi-Store– SQL Dialects– JIT Function definitions– Query Life Cycle & Management
• ETL– Integrate with Workflow engines like Apache Oozie or
Airflow.
RoadMap
Quark
Github: https://github.com/qubole/quark/
Mailing List: [email protected]
Subscribe: [email protected]
Unsubscribe: [email protected]: https://gitter.im/qubole/quark
Co-ordinates