The Enterprise Presto Company
STARBURST
Wojciech BielaGrzegorz Kokosiński
Presto: SQL-on-Anything
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Starburst in a nutshell
We are the Presto company!
● Largest team of Presto contributors outside of Facebook● Led Presto initiative at Teradata for past 3 years● Working in the SQL-on-Hadoop space since 2011
We offer:
● A production-ready distribution of Presto● Professional enterprise support● Presto managed services
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto is SQL-on-Anything.Deploy Anywhere , Query Anything
Analyst Tools
Data Sources
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto Community
More at https://github.com/prestodb/presto/wiki/Presto-Users
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Why Presto?
● 100% open source distributed ANSI SQL engine○ Originally developed by Facebook
○ Introduced to Fortune 500 by Teradata
○ Commercialized by Starburst
● Presto is SQL-on-Anything: ○ Deploy anywhere
○ Query anything
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto Highlights
● Community-driven open source project● No vendor lock-in
○ No Hadoop distro vendor lock-in
○ No storage engine vendor lock-in
○ No cloud vendor lock-in
● Query data where it lives○ No ETL or data integration necessary to get to insights
● Proven scalability● High concurrency● Interactive ANSI SQL queries
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto in Production
More at https://github.com/prestodb/presto/wiki/Presto-Users
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto in Production
* Multiple clusters (1000s of nodes)
* 300PB in HDFS, MySQL, and Raptor
* 1000s users, 100s concurrent queries
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Facebook - Data warehouse
● Hive + HDFS + ORC● multiple clusters ● Thousands of users, 300PB, 1000s nodes● PBs of data scanned, O(100k) queries every day● 100s of concurrent queries
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Facebook - User facing
● Usage○ reporting backend for ad campaign analytics
● Sharded MySQL storage● relatively small data (10’s to 100’s of TBs) ● 0.1-5 seconds latency● Support for data updates● highly available (different DCs)● 10-15 way joins
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto in Production
* 250+ AWS nodes
* 100+ PB in S3 (Parquet)
* 650+ users with 6K+ queries daily
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Suro / Kafka Cassandra
AegisthusUrsula
Amazon S3
TVs mobile laptop dimensionsevents
Teradata
TVs mobile laptopTVs mobile laptop
Netflix data pipeline
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto in Production
* 150+ PB HDFS
* 800+ nodes (2 clusters on prem)
* 200K+ queries/day
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto in Production
* 800+ nodes (on premise)
* Parquet data
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto in Production
* 120+ nodes in AWS
* 4PB is S3
* 200+ users
* Starburst support
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto in Production
More at https://github.com/prestodb/presto/wiki/Presto-Users
* Presto for interactive workload
* 200 nodes on AWS
* 20k+ queries / day
* 20PB+ data on S3
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Lyft ecosystem
Ingest Storage Compute Visualisation
AWS S3
Events
MongoDB
Other DS
Hive
Redshift
Superset
Other tools
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto in Production
More at https://github.com/prestodb/presto/wiki/Presto-Users
* 100 Presto VMs(on premises)
* 1K+ HDFS nodes
* ORC data
* Starburst support
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto in Production
More at https://github.com/prestodb/presto/wiki/Presto-Users
* 200+ nodes (on premises)
* HDFS, ObjectStore, and Cassandra
* Starburst support
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Logical Data Warehouse
Operational
Yahoo! Japan DWH
TeradataDWH
Operations(RDBs)
Data Lake (Hadoop)
TeradataDWH
RDBs
QG Presto
NoSQL
Data Lake
Hadoop S3
Copy & Load
ETL
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Data stream API
Worker
Data stream API
Worker
Coordinator
Metadata
API
Parser/
analyzerPlanner Scheduler
Worker
Client
Data location
API
Pluggable
Architecture
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto Connectors
Amazon S3
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Architecture
● Core Presto○ parser, planner, optimizer and scheduler
○ execution engine
○ stateless,
● Plugins○ connectors - data+metadata
○ user defined functions
○ user defined types
○ event listeners
○ authentication and authorization
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Connectors - Hive
● Table metadata read from Hive catalog
● Multiple filesystems○ HDFS, S3
● Supported file formats○ ORC (optimized reader, optimized writer)
○ RCFile (optimized reader, optimized writer)
○ Parquet (optimized reader)
○ Avro
○ all other Hive formats
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Easy deployment
● self contained (RPM/tar.gz)● worker auto discovery● trivial dependencies
○ just a recent JVM
● single-port network communication○ easy firewall/network setup
● even easier with presto-admin
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Hardware agnostic
● Infrastructure agnostic○ on premise (appliance or commodity clusters)
○ VM (OpenStack, etc.)
○ cloud (Amazon, Azure, etc)
■ pure EC2
■ EMR
■ AWS Athena (pay-as-you-go)
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Presto
Presto
HDFS DN
HDFS DN
HDFS DN
Presto
HDFS DN
Presto
HDFS DN
Presto
HDFS DN
HDFS DN
HDFS DN
Presto
HDFS DN
HDFS DN
HDFS DN
Separate nodes Shared nodes Mixed (rack local)
Deployment for HDFS
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
SQL support
● ANSI SQL support (good for BI tools)○ all standard data types
○ complex subqueries support (eg. correlated)
● Structural types○ map, array, row
○ JSON
● Lambda expressions○ SELECT transform(ARRAY['dog', 'whale'], x -> length(x))
■ [3, 5]
○ SELECT reduce(ARRAY[5, 20, 50], 0, (s, x) -> s + x, s -> s)
■ 75
● Spatial joins, functions and data types
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
SQL support
● All standard DDL/DML is supported ○ CREATE TABLE / CREATE TABLE AS
■ connector specific extensions supported via WITH clause
○ DROP TABLE
○ INSERT
○ DELETE
○ GRANT / REVOKE
● Set of supported features depends on connector○ richest support for Hive connector
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Performance
● MPP style ○ Operators pipeline and data streaming
○ In-memory execution
● Columnar data processing● Highly tuned Java
○ Query to ByteCode compilation
○ Memory efficient structures - Minimize GC
○ Careful inner loop implementation
● Multi-threaded execution keeps CPU busy○ Focus on being versatile. Support both (long running) single query at
a time and (interactive) highly concurrent workloads.
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Cost-Based Optimizer!
● Optimizer in ‘vanilla’ Presto○ currently rule based
● Exploit statistics provided by connectors○ Leverage existing Hive statistics
○ Selectivity estimates and statistics of plan fragments
○ Cost calculation of plan variants
● Cost based decisions (current Starburst release 195e)○ Join type selection
○ Join reordering
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Cost-Based Optimizer - results
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Cost-Based Optimizer in action
● 13x max speed-up● >50% 2-5x boost● ~10% 6-10x boost
For more see: https://www.starburstdata.com/technical-blog
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Connectivity
● Presto CLI● Enterprise JDBC/ODBC drivers
○ full JDBC and ODBC specs compliance
○ Kerberos authentication
○ LDAP authentication
● Open source JDBC driver○ requires Java 8
○ limited support for authentication
● Language specific bindings (R, Python, Go, Ruby, …)
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
BI Tool Support
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Security
● User authentication (CLI/ODBC/JDBC)○ Basic
○ Kerberos
○ LDAP
● Pluggable user authorization schemes (access control)● Connector level authorization
○ E.g. grants information stored in Hive catalog
● Support for kerberized HDFS/Hive metastore● SSL on the wire
○ client to Presto
○ between Presto nodes
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Key contributions from our team
● ANSI SQL syntax enhancements to fully support TPC-H and TPC-DS
● Spill to disk capabilities for large intermediate data sets
● Distributed sorting to handle ORDER BY for large datasets
● Security Integrations such as Kerberos, LDAP, and in-transit encryption
● Cost-Based Optimizer and other query performance improvements
● ODBC and JDBC drivers to enable BI tools such as Tableau, Qlik, etc
● Presto connectors for SQLServer and Cassandra
● Presto-Admin for easy installation & management of Presto
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Enterprise Support for Presto
PrestoCare™
Administration, monitoring and support of the
Presto Platform and Services
Enterprise Support
24/7 Enterprise support of Presto on-premises or in
the cloud.
Installation and tuning assistance.
Product roadmap influence.
Professional Services
Presto architecture,
tuning, integrations,
implementation, and other
development.
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Starburst Presto Roadmap
● Kafka connector improvements
● HDFS wire encryption
● Further Cost-Based Optimizer extensions
● Execution engine improvements
● Planner improvements
● Better AVRO support
● Support for Oracle Linux
● Support for Azure Cloud
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
More information
Certified Distro: www.starburstdata.com/presto
Project Website: www.prestodb.io
Presto Users Group: www.groups.google.com/group/presto-users
GitHub:www.github.com/prestodb/prestowww.github.com/starburstdata/presto
STARBURST
©2018 Starburst Data, Inc. All Rights Reserved
Learn more at:
www.starburstdata.com
[email protected]@starburstdata.com