![Page 1: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/1.jpg)
From to :enabling complex data applications
Reynold Xin @rxinHPTS 2019
![Page 2: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/2.jpg)
![Page 3: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/3.jpg)
Databricks Platform
> 1 million VMs per day launched.
> 1 exabytes data processed per week (soon to be per day).
![Page 4: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/4.jpg)
whoami
Databricks co-founder & Chief Architect- Designed (and implemented) most of “modern day” Spark- #1 code contributor to Spark by commits and net lines deleted
PhD in databases from Berkeley
![Page 5: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/5.jpg)
I used to be obsessed with perf eng
![Page 6: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/6.jpg)
“The greatest performance improvement of all is when a system goes from not-working to working.”
John Ousterhout
![Page 7: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/7.jpg)
This Talk/ What is Apache Spark?
/ Delta Lake: scalable & reliable data lakes
![Page 8: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/8.jpg)
This Talk/ What is Apache Spark?
/ Delta Lake: scalable & reliable data lakes
![Page 9: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/9.jpg)
![Page 10: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/10.jpg)
2013 pitch: a better MapReduce
![Page 11: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/11.jpg)
Spark’s evolution since 2013
SQL (can run all TPC-DS/H queries)Column store (Parquet, ORC, etc)Query optimizer (heuristics & cost-based)External operations (handle data > memory)Volcano execution engine -> Hyper-style code gen engineStreaming (incremental view maintenance)… What’s the point in reinventing a SQL query engine?
![Page 12: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/12.jpg)
Because databases suck at …
Fast iteration: installation, data modeling, etc, take LONG time.
Scalability.
Building complex data applications (focus of this talk).
![Page 13: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/13.jpg)
Modern eng principles & practices
Modularity / separation of concernsIncremental developmentExternal dependencies, libraries, frameworksDebuggerChange management Testing (unit tests, integration tests)Continuous integration / continuous delivery… SQL makes it virtually impossible to apply these.
![Page 14: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/14.jpg)
So what is “modern day” Spark?
An open source relational query enginewith language-integrated APIs in Python, Scala, Java, C# …and seamless UDFs
to enablebuilding complex appswithmodern eng principles.
![Page 15: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/15.jpg)
This Talk/ What is Apache Spark?
/ Delta Lake: scalable & reliable data lakes
![Page 16: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/16.jpg)
Delta Lake
6 mo
since open source
4000+
use in companies
1 EB+
data processed / wk
![Page 17: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/17.jpg)
What does a typical data lake project look like?
![Page 18: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/18.jpg)
Evolution of a Cutting-Edge Data Lake
Events
?AI & Reporting
StreamingAnalytics
Data Lake
![Page 19: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/19.jpg)
Evolution of a Cutting-Edge Data Lake
Events
AI & Reporting
StreamingAnalytics
Data Lake
![Page 20: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/20.jpg)
Challenge #1: Historical Queries?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Eventsλ-arch1
1
1
![Page 21: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/21.jpg)
Challenge #2: Messy Data?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Events
Validation
λ-archValidation
1
21
1
2
![Page 22: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/22.jpg)
Reprocessing
Challenge #3: Mistakes and Failures?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Events
Validation
λ-archValidation
Reprocessing
Partitioned
1
2
3
1
1
3
2
![Page 23: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/23.jpg)
Reprocessing
Challenge #4: Updates?
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Events
Validation
λ-archValidation
Reprocessing
Updates
Partitioned
UPDATE &MERGE
Scheduled to Avoid Modifications
1
2
3
1
1
3
4
4
4
2
![Page 24: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/24.jpg)
Wasting Time & Money
Solving Systems Problems
Instead of Extracting Value From Data
![Page 25: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/25.jpg)
Data Lake Distractions
No atomicity means failed production jobs leave data in corrupt state requiring tedious recovery
✗
No quality enforcement creates inconsistent and unusable data
No consistency / isolation makes it almost impossible to mix appends and reads, batch and streaming
![Page 26: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/26.jpg)
Let’s try it instead with
![Page 27: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/27.jpg)
Reprocessing
Challenges of the Data Lake
Data Lake
λ-arch
λ-arch
StreamingAnalytics
AI & Reporting
Events
Validation
λ-archValidation
Reprocessing
Updates
Partitioned
UPDATE &MERGE
Scheduled to Avoid Modifications
1
2
3
1
1
3
4
4
4
2
![Page 28: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/28.jpg)
AI & Reporting
StreamingAnalytics
The Architecture
Data Lake
CSV,JSON, TXT…
Kinesis
![Page 29: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/29.jpg)
AI & Reporting
StreamingAnalytics
The Architecture
Data Lake
CSV,JSON, TXT…
Kinesis
Full ACID Transaction
Focus on your data flow, instead of worrying about failures.
![Page 30: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/30.jpg)
AI & Reporting
StreamingAnalytics
The Architecture
Data Lake
CSV,JSON, TXT…
Kinesis
Open Standards, Open Source
Store petabytes of data without worries of lock-in. Growing community including Presto, Spark and more.
![Page 31: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/31.jpg)
AI & Reporting
StreamingAnalytics
The Architecture
Data Lake
CSV,JSON, TXT…
Kinesis
Powered by
Unifies Streaming / Batch with unlimited scalability.
![Page 32: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/32.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Quality
Delta Lake allows you to incrementally improve the quality of your data until it is ready for consumption.
*Data Quality Levels *
![Page 33: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/33.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
• Dumping ground for raw data• Often with long retention (years)• Avoid error-prone parsing
🔥
![Page 34: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/34.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Intermediate data with some cleanup applied.Queryable for easy debugging!
![Page 35: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/35.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Clean data, ready for consumption.Read with Spark or Presto
![Page 36: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/36.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Streams move data through the Delta Lake• Low-latency or manually triggered• Eliminates management of schedules and jobs
![Page 37: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/37.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Delta Lake also supports batch jobs and standard DML
UPDATE
DELETE MERGE
OVERWRITE
• Retention• Corrections• GDPR
• UPSERTS
INSERT
![Page 38: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/38.jpg)
Data Lake
AI & Reporting
StreamingAnalytics
Business-level Aggregates
Filtered, CleanedAugmented
Raw Ingestion
The
Bronze Silver Gold
CSV,JSON, TXT…
Kinesis
Easy to recompute when business logic changes:• Clear tables• Restart streams
DELETE DELETE
![Page 39: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/39.jpg)
Delta under the hood
my_table/_delta_log/
00000.json00001.json
date=2019-01-01/file-1.parquet
Transaction LogTable Versions
(Optional) Partition DirectoriesData Files
![Page 40: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/40.jpg)
Handling Massive Metadata
Large tables can have millions of files in them! How do we scale the metadata? Use Spark for scaling!
Add 1.parquet
Add 2.parquetRemove 1.parquet
Remove 2.parquet
Add 3.parquet
Checkpoint
![Page 41: From to - HPTSSpark’s evolution since 2013 SQL (can run all TPC-DS/H queries) Column store (Parquet, ORC, etc) Query optimizer (heuristics & cost-based) External operations (handle](https://reader034.vdocuments.net/reader034/viewer/2022050410/5f87092b145a6424df2d6630/html5/thumbnails/41.jpg)
Conclusion
Complex data applications can only be built with modern engprinciples & practices, and SQL is a terrible choice.
We are building a new generation of tools to incorporate the best of SQL & support modern eng practices.
Spark + Delta Lake are just the beginning. Hiring in SF, AMS, YYZ.