from to - hptsspark’s evolution since 2013 sql (can run all tpc-ds/h queries) column store...

41
From to : enabling complex data applications Reynold Xin @rxin HPTS 2019

Upload: others

Post on 01-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

From to :enabling complex data applications

Reynold Xin @rxinHPTS 2019

Page 2: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle
Page 3: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Databricks Platform

> 1 million VMs per day launched.

> 1 exabytes data processed per week (soon to be per day).

Page 4: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

whoami

Databricks co-founder & Chief Architect- Designed (and implemented) most of “modern day” Spark- #1 code contributor to Spark by commits and net lines deleted

PhD in databases from Berkeley

Page 5: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

I used to be obsessed with perf eng

Page 6: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

“The greatest performance improvement of all is when a system goes from not-working to working.”

John Ousterhout

Page 7: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

This Talk/ What is Apache Spark?

/ Delta Lake: scalable & reliable data lakes

Page 8: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

This Talk/ What is Apache Spark?

/ Delta Lake: scalable & reliable data lakes

Page 9: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle
Page 10: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

2013 pitch: a better MapReduce

Page 11: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Spark’s evolution since 2013

SQL (can run all TPC-DS/H queries)Column store (Parquet, ORC, etc)Query optimizer (heuristics & cost-based)External operations (handle data > memory)Volcano execution engine -> Hyper-style code gen engineStreaming (incremental view maintenance)… What’s the point in reinventing a SQL query engine?

Page 12: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Because databases suck at …

Fast iteration: installation, data modeling, etc, take LONG time.

Scalability.

Building complex data applications (focus of this talk).

Page 13: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Modern eng principles & practices

Modularity / separation of concernsIncremental developmentExternal dependencies, libraries, frameworksDebuggerChange management Testing (unit tests, integration tests)Continuous integration / continuous delivery… SQL makes it virtually impossible to apply these.

Page 14: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

So what is “modern day” Spark?

An open source relational query enginewith language-integrated APIs in Python, Scala, Java, C# …and seamless UDFs

to enablebuilding complex appswithmodern eng principles.

Page 15: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

This Talk/ What is Apache Spark?

/ Delta Lake: scalable & reliable data lakes

Page 16: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Delta Lake

6 mo

since open source

4000+

use in companies

1 EB+

data processed / wk

Page 17: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

What does a typical data lake project look like?

Page 18: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Evolution of a Cutting-Edge Data Lake

Events

?AI & Reporting

StreamingAnalytics

Data Lake

Page 19: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Evolution of a Cutting-Edge Data Lake

Events

AI & Reporting

StreamingAnalytics

Data Lake

Page 20: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Challenge #1: Historical Queries?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Eventsλ-arch1

1

1

Page 21: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Challenge #2: Messy Data?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Events

Validation

λ-archValidation

1

21

1

2

Page 22: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Reprocessing

Challenge #3: Mistakes and Failures?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Events

Validation

λ-archValidation

Reprocessing

Partitioned

1

2

3

1

1

3

2

Page 23: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Reprocessing

Challenge #4: Updates?

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Events

Validation

λ-archValidation

Reprocessing

Updates

Partitioned

UPDATE &MERGE

Scheduled to Avoid Modifications

1

2

3

1

1

3

4

4

4

2

Page 24: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Wasting Time & Money

Solving Systems Problems

Instead of Extracting Value From Data

Page 25: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Data Lake Distractions

No atomicity means failed production jobs leave data in corrupt state requiring tedious recovery

No quality enforcement creates inconsistent and unusable data

No consistency / isolation makes it almost impossible to mix appends and reads, batch and streaming

Page 26: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Let’s try it instead with

Page 27: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Reprocessing

Challenges of the Data Lake

Data Lake

λ-arch

λ-arch

StreamingAnalytics

AI & Reporting

Events

Validation

λ-archValidation

Reprocessing

Updates

Partitioned

UPDATE &MERGE

Scheduled to Avoid Modifications

1

2

3

1

1

3

4

4

4

2

Page 28: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

AI & Reporting

StreamingAnalytics

The Architecture

Data Lake

CSV,JSON, TXT…

Kinesis

Page 29: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

AI & Reporting

StreamingAnalytics

The Architecture

Data Lake

CSV,JSON, TXT…

Kinesis

Full ACID Transaction

Focus on your data flow, instead of worrying about failures.

Page 30: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

AI & Reporting

StreamingAnalytics

The Architecture

Data Lake

CSV,JSON, TXT…

Kinesis

Open Standards, Open Source

Store petabytes of data without worries of lock-in. Growing community including Presto, Spark and more.

Page 31: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

AI & Reporting

StreamingAnalytics

The Architecture

Data Lake

CSV,JSON, TXT…

Kinesis

Powered by

Unifies Streaming / Batch with unlimited scalability.

Page 32: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Quality

Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption.

*Data Quality Levels *

Page 33: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

• Dumping ground for raw data• Often with long retention (years)• Avoid error-prone parsing

🔥

Page 34: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Intermediate data with some cleanup applied.Queryable for easy debugging!

Page 35: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Clean data, ready for consumption.Read with Spark or Presto

Page 36: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Streams move data through the Delta Lake• Low-latency or manually triggered• Eliminates management of schedules and jobs

Page 37: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Delta Lake also supports batch jobs and standard DML

UPDATE

DELETE MERGE

OVERWRITE

• Retention• Corrections• GDPR

• UPSERTS

INSERT

Page 38: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Data Lake

AI & Reporting

StreamingAnalytics

Business-level Aggregates

Filtered, CleanedAugmented

Raw Ingestion

The

Bronze Silver Gold

CSV,JSON, TXT…

Kinesis

Easy to recompute when business logic changes:• Clear tables• Restart streams

DELETE DELETE

Page 39: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Delta under the hood

my_table/_delta_log/

00000.json00001.json

date=2019-01-01/file-1.parquet

Transaction LogTable Versions

(Optional) Partition DirectoriesData Files

Page 40: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Handling Massive Metadata

Large tables can have millions of files in them! How do we scale the metadata? Use Spark for scaling!

Add 1.parquet

Add 2.parquetRemove 1.parquet

Remove 2.parquet

Add 3.parquet

Checkpoint

Page 41: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle

Conclusion

Complex data applications can only be built with modern engprinciples & practices, and SQL is a terrible choice.

We are building a new generation of tools to incorporate the best of SQL & support modern eng practices.

Spark + Delta Lake are just the beginning. Hiring in SF, AMS, YYZ.