not your father's database: not your father’s database: how to use apache® spark™ properly...
TRANSCRIPT
![Page 1: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/1.jpg)
Not Your Father’s Database:How to Use Apache® Spark™ Properly in Your Big Data Architecture
![Page 2: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/2.jpg)
Not Your Father’s Database:How to Use Apache® Spark™ Properly in Your Big Data Architecture
![Page 3: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/3.jpg)
About Me
2005 Mobile Web & Voice Search
3
![Page 4: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/4.jpg)
About Me
2005 Mobile Web & Voice Search
4
2012 Reporting & Analytics
![Page 5: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/5.jpg)
About Me
2005 Mobile Web & Voice Search
5
2012 Reporting & Analytics
2014 Solutions Engineering
![Page 6: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/6.jpg)
This system talks like a SQL Database…
Is this your Spark infrastructure?
6
HDFS
![Page 7: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/7.jpg)
But the performance is very different…
Is this your Spark infrastructure?
7
HDFS
![Page 8: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/8.jpg)
Just in Time Data Warehouse w/ Spark
HDFS
![Page 9: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/9.jpg)
Just in Time Data Warehouse w/ Spark
HDFS
![Page 10: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/10.jpg)
Just in Time Data Warehouse w/ Spark
and more…HDFS
![Page 11: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/11.jpg)
Separate Compute vs. Storage
11
Benefits:• No need to import your data into Spark to begin
processing.• Dynamically Scale Spark clusters to match compute
vs. storage needs.• Choose the best data storage with different
performance characteristics for your use case.
![Page 12: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/12.jpg)
12
Know when to use other data stores besides file systems
Today’s Goal
![Page 13: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/13.jpg)
13
Data Warehousing
Use Case:
![Page 14: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/14.jpg)
Good: General Purpose Processing
Types of Data Sets to Store in File Systems: • Archival Data• Unstructured Data• Social Media and other web datasets• Backup copies of data stores
14
![Page 15: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/15.jpg)
Types of workloads• Batch Workloads• Ad Hoc Analysis
– Best Practice: Use in memory caching• Multi-step Pipelines• Iterative Workloads
15
Good: General Purpose Processing
![Page 16: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/16.jpg)
Benefits:• Inexpensive Storage• Incredibly flexible processing• Speed and Scale
16
Good: General Purpose Processing
![Page 17: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/17.jpg)
Bad: Random Access
sqlContext.sql(“select * from my_large_table where id=2I34823”)
Will this command run in Spark?
17
![Page 18: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/18.jpg)
Bad: Random Access
sqlContext.sql(“select * from my_large_table where id=2I34823”)
Will this command run in Spark?Yes, but it’s not very efficient — Spark may have
to go through all your files to find your row.
18
![Page 19: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/19.jpg)
Bad: Random Access
Solution: If you frequently randomly access your data, use a database.
• For traditional SQL databases, create an index on your key column.
• Key-Value NOSQL stores retrieves the value of a key efficiently out of the box.
19
![Page 20: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/20.jpg)
Bad: Frequent Inserts
sqlContext.sql(“insert into TABLE myTable select fields from my2ndTable”)
Each insert creates a new file:• Inserts are reasonably fast.• But querying will be slow…
20
![Page 21: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/21.jpg)
Bad: Frequent Inserts
Solution:• Option 1: Use a database to support the inserts.• Option 2: Routinely compact your Spark SQL table files.
21
![Page 22: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/22.jpg)
Good: Data Transformation/ETL
Use Spark to splice and dice your data files any way:
File storage is cheap: Not an “Anti-pattern” to duplicately store your
data.22
![Page 23: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/23.jpg)
Bad: Frequent/Incremental Updates
Update statements — not supported yet.
Why not?• Random Access: Locate the row(s) in the files.• Delete & Insert: Delete the old row and insert a new one.• Update: File formats aren’t optimized for updating rows.
Solution: Many databases support efficient update operations.
23
![Page 24: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/24.jpg)
Use Case: Up-to-date, live views of your SQL tables.
Tip: Use ClusterBy for fast joins or Bucketing with 2.0.
Bad: Frequent/Incremental Updates
24
IncrementalSQL Query
Database Snapshot
+
![Page 25: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/25.jpg)
Good: Connecting BI Tools
Tip: Cache your tables for optimal performance.
25
HDFS
![Page 26: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/26.jpg)
Bad: External Reporting w/ load
Too many concurrent requests will start to queue up.
26
HDFS
![Page 27: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/27.jpg)
Solution: Write out to a DB as a cache to handle load.
Bad: External Reporting w/ load
27
HDFS
DB
![Page 28: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/28.jpg)
28
Advanced Analytics and Data Science
Use Case:
![Page 29: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/29.jpg)
Good: Machine Learning & Data Science
Use MLlib, GraphX and Spark packages for machine learning and data science.
Benefits:• Built in distributed algorithms.• In memory capabilities for iterative workloads.• All in one solution: Data cleansing, featurization,
training, testing, serving, etc.29
![Page 30: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/30.jpg)
Bad: Searching Content w/ load
sqlContext.sql(“select * from mytable where name like '%xyz%'”)
Spark will go through each row to find results.
30
![Page 31: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/31.jpg)
31
Streaming and Realtime Analytics
Use Case:
![Page 32: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/32.jpg)
Good: Periodic Scheduled Jobs
Schedule your workloads to run on a regular basis: • Launch a dedicated cluster for important workloads.• Output your results as reports or store to a
files/database.• Poor Man’s Streaming: Spark is fast, so push the
interval to be frequent.
32
![Page 33: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/33.jpg)
Bad: Low Latency Stream Processing
Spark Streaming can detect new files dropped into a folder to process, but there is a delay to build up a whole file’s worth of data.
Solution: Send data to message queues not files.
33
![Page 34: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/34.jpg)
Thank you
![Page 35: Not your Father's Database: Not Your Father’s Database: How to Use Apache® Spark™ Properly in your Big Data Architecture](https://reader031.vdocuments.net/reader031/viewer/2022030311/58ef89501a28aba94e8b45eb/html5/thumbnails/35.jpg)
Not Your Father’s Database:How to Use Apache Spark Properly in Your Big Data Architecture
Spark Summit East 2016