introduction to presto at treasure data
TRANSCRIPT
![Page 2: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/2.jpg)
How do we make SQL scalable?• Problem
• Count access logs of each web page: • SELECT page, count(*) FROM weblog
GROUP BY page
• A Challenge • How do you process millions of records in a
second? • Making SQL scalable enough to handle large
data set
2
![Page 3: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/3.jpg)
3
HDFS
• Translate SQL into MapReduce (Hadoop) programs • MapReduce:
• Does the same job by using many machines
Hive
A B
A0B0
A1
A2
BB1
B2
B3
A
map reduce mergesplit
HDFS
Single CPU Job
Distributed Processing
![Page 4: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/4.jpg)
SQL to MapReduce• Mapping SQL stages into MapReduce program
• SELECT page, count(*) FROM weblogGROUP BY page
4
HDFS
A0B0
A1
A2
BB1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
![Page 5: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/5.jpg)
HDFS is the bottleneck• HDFS (Hadoop File System)
• Used for storing intermediate results • Provides fault-tolerance, but slow
5
HDFS
A0B0
A1
A2
BB1
B2
B3
A
map reduce mergesplit
HDFS
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
![Page 6: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/6.jpg)
Presto• Distributed query engine developed by Facebook
• Uses HTTP for data transfer
• No intermediate storage like HDFS
• No fault-tolerance (but failure rate is less than 0.2%)
• Pipelining data transfer and data processing
6
A0B0
A1
A2
BB1
B2
B3
A
map reduce mergesplit
TableScan(weblog)
GroupBy(hash(page))
count(weblog of a page)
result
![Page 7: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/7.jpg)
Architecture Comparison
7
Hive Presto Spark BigQuery
Performance Slow Fast Fast Ultra Fast (using many disks)
Intermediate Storage HDFS None Memory/Disk Colossus (?)
Data Transfer HTTP HTTP HTTP ?
Query Execution
Stage-wizeMapReduce
Run all stagesat once
(pipelining)Stage-wise ?
Fault Tolerance Yes
None (but, TD will retry
the query) fromscratch)
Yes, but limited ?
Multiple Job Support
GoodCan handle many
jobs
limited (~ 5 concurrent queries
per account in TD)
Require another resource manager (e.g. YARN, mesos)
limited (Query queue)
![Page 8: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/8.jpg)
Presto Usage Stats• More than 99.8% queries finishes without any error • 90%~ of queries finishes within 1 minute
• Treasure Data Presto Stats • Processing more than 100,000 queries / day • Processing 15 trillion records / day
• Facebook’s stat: • 30,000~100,000 queries / day • 1 trillion records / day
• Treasure data is No.1 Presto user in the world
8
![Page 9: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/9.jpg)
Presto can process more than 1M rows /sec.
• N
9
![Page 10: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/10.jpg)
Presto Overview• A distributed SQL Engine developed by Facebook
• For interactive analysis on peta-scale dataset • As a replacement of Hive
• Nov. 2013: Open sourced at GitHub • Facebook now has 12 engineers working on Presto
• Code • In-memory query engine, written in Java • Based on ANSI SQL syntax • Isolating query execution layer and storage access layer • Connector provides data access methods
• Cassandra / Hive / JMX / Kafka / MySQL / PostgreSQL / MongoDB / System / TPCH connectors
• td-presto is our connector to access PlazmaDB (Columnar Message Pack Database)
10
![Page 11: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/11.jpg)
Architectural overview
11
https://prestodb.io/overview.html
With Hive connector
![Page 12: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/12.jpg)
Presto Users• Facebook
12
![Page 13: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/13.jpg)
• Dropbox
13
![Page 14: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/14.jpg)
• Airbnb
14
![Page 15: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/15.jpg)
Interactive Analysis with TD Presto + Jupyter
15
• https://github.com/treasure-data/td-jupyter-notebooks/blob/master/imported/pandas-td-tutorial.ipynb
![Page 16: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/16.jpg)
Presto InternalQuery Execution
![Page 17: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/17.jpg)
Stage 1
Stage 2
Stage 0
Presto Architecture
Query
Task 0.0Split
Task 1.0Split
Task 1.1 Task 1.2Split Split Split
Task 2.0Split
Task 2.1 Task 2.2Split Split Split Split Split Split Split
Split
TableScan (FROM)
Aggregation (GROUP BY)
Output
@worker#2 @worker#3 @worker#0
![Page 18: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/18.jpg)
Logical Query PlanOutput[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint] - expr := 1
InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3
select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
![Page 19: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/19.jpg)
Output[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint] - expr := 1
InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3 Stage 3
Table Scan
select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
![Page 20: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/20.jpg)
Output[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint] - expr := 1
InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3 Stage 3
Stage 2
Logical Plan Optimization
select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
![Page 21: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/21.jpg)
Output[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint] - expr := 1
InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3 Stage 3
Stage 2
Stage 1
select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
![Page 22: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/22.jpg)
Output[nationkey, _col1] => [nationkey:bigint, count:bigint]- _col1 := count
Exchange[GATHER] => nationkey:bigint, count:bigint
Aggregate(FINAL)[nationkey] => [nationkey:bigint, count:bigint] - count := "count"("count_15")
Exchange[REPARTITION] => nationkey:bigint, count_15:bigint
Aggregate(PARTIAL)[nationkey] => [nationkey:bigint, count_15:bigint] - count_15 := "count"("expr")
Project => [nationkey:bigint, expr:bigint] - expr := 1
InnerJoin[("custkey" = "custkey_0")] => [custkey:bigint, custkey_0:bigint, nationkey:bigint]
Project => [custkey:bigint]
Filter[("orderpriority" = '1-URGENT')] => [custkey:bigint, orderpriority:varchar]
TableScan[tpch:tpch:orders:sf0.01, original constraint= ('1-URGENT' = "orderpriority")] => [custkey:bigint, orderpriority:varchar] - custkey := tpch:custkey:1 - orderpriority := tpch:orderpriority:5
Exchange[REPLICATE] => custkey_0:bigint, nationkey:bigint
TableScan[tpch:tpch:customer:sf0.01, original constraint=true] => [custkey_0:bigint, nationkey:bigint]- custkey_0 := tpch:custkey:0- nationkey := tpch:nationkey:3 Stage 3
Stage 2
Stage 1
Stage 0
Output Query Results (JSON)
select c.nationkey, count(1)from orders o join customer con o.custkey = c.custkey where o.orderpriority = '1-URGENT' group by c.nationkey
![Page 23: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/23.jpg)
TD Storage Architecture
23
LogLogLogLogLogLog
1-hourpartition1-hour
partition1-hourpartition
Hadoop MapReduce
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
Real-Time Storage
ArchiveStorage
time column-based partitioning…
Hive Presto
Log
many small log files log merge job
LogLogLogLogLog
Distributed SQL Query Engine
![Page 24: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/24.jpg)
Utilizing Time Index
24
1-hourpartition
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
time column-based partitioning
…
Hive/Presto1-hour
partition1-hourpartition1-hour
partition
TD_TIME_RANGE(time, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’)
Query Results
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
…
Hive/Presto Query Results
TD_TIME_RANGE(non_time_column, ‘2015-09-29 02:00:00’, ‘2015-09-29 03:00:00’)
Scanning the whole data set
1-hourpartition1-hour
partition1-hourpartition1-hourpartition
Full Scan
Partial Scan
![Page 25: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/25.jpg)
Queries with huge results• SELECT col1, col2, col3, … FROM …
• INSERT INTO (table) SELECT col1, col2, … • or CREATE TABLE AS
25
1-hourpartition
headercol1col2……
Presto
Read query results in JSON (single-thread task: slow)
msgack.gz
On Amazon S3
Presto
1-hourpartition1-hourpartition
1-hourpartition
Directly create 1-hour partition on S3 from query results Runs in parallel: fast
![Page 26: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/26.jpg)
Memory Consuming Operators• DISTINCT col1, col2, … (duplicate elimination)
• Need to store the whole data set in a single node • COUNT(DISTINCT col1), etc.
• Use approx_distinct(col1) instead
• order by col1, col2, … • A single node task (in Presto)
• UNION • performs duplicate elimination (single node) • Use UNION ALL
26
![Page 27: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/27.jpg)
Finding bottlenecks• Table scan range
• Check TD_TIME_RANGE condition • distinct
• duplicate elimination of all selected columns (single node) • slow and memory consuming
• huge result output • Output Stage (0) becomes the bottleneck • Use DROP TABLE IF EXISTS …, then CREATE TABLE AS SELECT …
27
![Page 28: Introduction to Presto at Treasure Data](https://reader034.vdocuments.net/reader034/viewer/2022042520/58e85ec51a28ab007c8b487f/html5/thumbnails/28.jpg)
Resources• Presto Query FAQs
• https://docs.treasuredata.com/articles/presto-query-faq
• Presto Documentation • https://prestodb.io/docs
28