presto @ treasure data - presto meetup boston 2015
Post on 12-Apr-2017
841 Views
Preview:
TRANSCRIPT
Designing An Evolving Database Service with Presto
Taro L. Saito leo@tresaure-data.com
Oct 6th, 2015. Presto Meetup @ Boston
Presto Usage at Treasure Data
2
• 100~ customers are actively using Presto • 30,000~ Presto queries every day • Importing 1,000,000~ records / sec.
Import Export
Store Analyze with Presto/Hive
Mobile and Web Sources
Mobile SDKs
JavaScript SDK (web access logs)
3
Stream Sources
Streaming
Apache Logs nginx logs
syslogJSON logs
…
4
JSON
Existing Data Sources
Bulk Import
Data files (CSV, TSV, etc.) MySQL
PostgreSQLOracle
…
5
Embedded Devices
• Collect data from Embedded linux, serial devices, MQTT, XBee Radio, etc.
6
Import data, now.
7
Treasure Data Architecture
8
LogLogLogLogLogLog
1-hourpartition1-hour
partition1-hourpartition
Hadoop MapReduce
2015-09-29 01:00:00
2015-09-29 02:00:00
2015-09-29 03:00:00
Real-Time Storage
ArchiveStorage
time column-based partitioning…
Hive Presto
Log
many small log files log merge job
LogLogLogLogLog
Distributed SQL Query Engine
S3 (AWS) Rick CS (IDCF)
Columnar Format
• JSON data • {“time”: 1412380700, “user”:1}
• Additional Column • {“time”: 1412381000, “user”:2, “status”:200}
• Type Escalation (int -> string) • {“time”: 1412390000, “user”:”U01”, “status”:200}
• MessagePack • A fast and compact JSON-like format
• Auto type conversion • Table schema <=> MessagePack types
Extensible Columnar Store
9
Use Cases
E-COMMERCE
BEFORE
AFTER
Biggest Mobile Shopping
WISH.COM
• Reduced costs
• Scalability
• Single data warehouse11
GAMING
BEFORE
AFTER
Daily Upload Delay of 1-2 days
2500+ servers
Real-timeReal-time
2500+ servers
1 Billion records/day
• Reduced TCO
• Real-time collection
• Real-time access to KPIs
Top 10 globally; 40M+ users
x 20
12
AD TECH
Publishers’ Dashboard Advertisers’ Dashboard
• 800 B/month
• Live in 2 weeks with 1 engineer!
• 300% growth
Europe’s largest mobile ad-exchange
More than 50 billion impressions/month
13
LOYALTY
Aggregation
E-CommerceMarketing Campaigns;
Promotions
• Customer Segmentation
• A/B Testing
14
Challenges• Handle Huge Query Result Output
• SELECT */ CREATE TABLE AS /INSERT INTO • Parallel Result Upload to S3
• Bypass JSON result generation at the coordinator
• td-presto connector • Accesses MessagePack based columnar store • Handle S3 access retry / pipelining
• Future: • Better query plan visualization
• Quickly find the performance bottleneck and memory consuming tasks • Storing intermediate query results to disks
• Process large joins, query resource limitation
15
Extensible Schema SQL via Hive, Presto
Unlimited Users, Queries
Enterprise Apps
Enterprise Apps Data Science Tools
REST API
Ingestion: Streaming, Bulk
BI Tools
treasuredata.com/request_demo
top related