presto summit series: pinterest · scale at pinterest business scale 400m+ maus 200b+ pins 4b+...

29
Presto Summit Series: Pinterest Presto at Pinterest August 19, 2020

Upload: others

Post on 12-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto Summit Series: PinterestPresto at Pinterest

August 19, 2020

Page 2: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Introduction

2

Page 4: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto at Pinterest

● Overview of Presto at Pinterest and the technical challenges● Leveraging warning systems for users to write better queries● Managing diverse workloads● Future work

Page 5: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Overview of Presto @ Pinterest

Page 6: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Scale at Pinterest

● Business Scale○ 400M+ MAUs○ 200B+ Pins○ 4B+ Boards

● Data Scale○ 400+ PB @ S3○ Peak of 80k Hadoop jobs per day○ 10,000+ Hive/Hadoop nodes○ > 500 Presto workers (dedicated nodes + k8s)○ > 110,000 Hive Tables

● Everything in Cloud(AWS)

Page 7: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Evolution

🏠Hadoop🏠Presto (RO) +🏠Spark +🏠Hive + HMS (RW)

2018

Qubole +Redshift +🏠Hadoop +🏠Presto + HMS(RO) +🏠Spark

2016 Q4

Qubole +Redshift +🏠Hadoop

Better SupportHappy Security Team

2016 Q2

Qubole +Redshift +🏠Hadoop +🏠Presto + HMS(RO)

Better Support

2016 Q3

Qubole +Redshift

Better Support Qubole:- MR- HMS- Hive- Spark

2014

🏠Hadoop🏠Presto (RO) +🏠Spark +🏠Hive + HMS (RW)

2018

🏠Hadoop🏠Presto (RW) +🏠Spark +🏠Hive + HMS (RW) +🏠Spark SQL🏠Flink

2020

Better Support Better Support

Page 8: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto at Pinterest

Page 9: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto at Pinterest

Page 10: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto at Pinterest

Page 11: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Clusters Overview

● PrestoSQL 320 with some backports/ customized changes● Connectors: Hive (major), MySQL, Druid, Thrift● Adhoc workers running on k8s pod

Use case # of cluster Cluster size Coordinator Worker

adhoc 2 200

64 core, 488G 43 ~ 48 core, 340 ~ 384Gscheduled 1 165

Pii and others 2 30 ~ 100

Page 12: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto Controller

● An in-house service critical to our Presto deployments, monitoring and self-healing etc.

● Following major functionalities are served by the controller.1. Health check2. Slow worker detection3. Heavy query detection4. Rolling restarts of Presto clusters5. Scaling clusters

Page 13: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto Gateway

● A service that sits between clients and Presto clusters● It essentially is a smart http proxy server● Stateless, easy to scale● Makes clients agnostic of specific Presto clusters and enables the following.

1. Queries routing based on rules/health/load/resource groups2. Resource usage visibility for users/ orgs3. Overall Presto clusters’ health visibility

Page 14: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto Gateway

Page 15: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Challenges: Deeply Nested Large Thrift Schemas

● Prime reason for coordinator getting stuck/ crashes● Example: A popular and commonly used large Thrift schema has over 12

million primitives and a depth of 41 levels. This schema when serialized to string takes over 282 MB.

● Close to 500 hive tables with over 100K primitives in their schemas.● Coordinator fetches table schema from Hive Metastore and then serialize that

schema in each task request it sends to workers.○ Keeps Hive Metastore service from getting bombarded with requests from workers.○ Adverse effect on coordinator memory and network when schemas are very large.

Page 16: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Challenges: Deeply Nested Large Thrift Schemas

● Our large and deeply nested schemas issue is only limited to tables using Thrift schemas.

● Thrift schema Java archive (jar) file is created and put into the classpaths of coordinator and each worker of a Presto cluster and is loaded at service start time.

● Completely got rid of schemas from tasks’ requests: instead, only a Thrift class name is passed as part of the request.

● Workers uses thrift schema jar to construct table schema.

Page 17: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Challenges: Inconsistent serdes/ schemas jar versions between coordinator and workers● Presto process always load latest serdes/ schemas jars from s3 when restart ● Loaded jars between coordinator & workers could potentially out of sync

when one process restart but the other one does not● Solution: version the jars and include version in the node info, broadcast

coordinator node version to all workers, restart worker if jar version not matched and pull the right version jars.

Page 18: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Warning systems for better query authoring and diagnosing

(Dr. Presto Project)

Page 19: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto Warnings

● Users sometimes are not aware of writing inefficient queries, use warnings system to deliver our recommendations

Use Cases Example Warnings

Query Authoring

Replace “count(distinct x)” with approx_distinct

Large result set / Wide output columns

Missing partition predicate

Query Diagnosing

High CPU consumption, wrong resource group config

Scanning huge non columnar tables

Wrong join order/ type

Performance analysis etc

Page 20: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Presto Warnings

Page 21: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Managing diverse workloads

Page 22: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

What we are facing?

● Traffic varies between business/ non business hour○ Y-axis: # of concurrent running queries○ X-axis: datetime (range in a day)○ Red line: scheduled queries, green line: adhoc query

Page 23: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

What we are facing?

● Previously resource limit on per user per client○ # of queries allowed is 3○ No cpu quota

● Various query type, from resource intensive -> resource lightweight● Due to the nature of the adhoc usage, compute demand changes very fast

Page 24: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

How to solve it

Query traffic variance ● Route traffic between adhoc and scheduled cluster, i.e. route scheduled

queries to adhoc cluster during off peak hours● Publish query traffic pattern to users via warnings to let user reschedule job if

possible

Page 25: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

How to solve it

Improve resource usage & developer velocityOrg-based resource group

● Have one single resource group for whole organization (various from 20 ~ 100 users) based on LDAP.

● Provide visibility for which user/ query is taking majority resource in the group● Each resource group will have resource sub-groups for allocated with different

resources: fast_lane, normal and expensive. User could choose which one to use via setting session property.

Page 26: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

How to solve it

Page 27: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

Future Work

● Spot instance● More Presto warnings● Fine grained access control● Better query failure diagnostics● Presto for ETL workloads

Page 28: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

© Copyright, All Rights Reserved, Pinterest Inc. 2017

Template // Jan 2017

Presto at Pinterest Blog: https://medium.com/pinterest-engineering/presto-at-pinterest-a8bda7515e52Presto at Pinterest NYC Presto Summit (2019): https://www.youtube.com/watch?v=AY7VtreK8IQ

Page 29: Presto Summit Series: Pinterest · Scale at Pinterest Business Scale 400M+ MAUs 200B+ Pins 4B+ Boards Data Scale 400+ PB @ S3 Peak of 80k Hadoop jobs per day 10,000+ Hive/Hadoop nodes

THANK YOU!

Register: www.starburstdata.com/events