conquering big data with apache sparkbigdataieee.org/bigdata2015/ieeekeynote nov 1 2015.pdfwhat is...
TRANSCRIPT
![Page 1: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/1.jpg)
Conquering Big Data with Apache Spark
Ion Stoica November 1st, 2015
UC BERKELEY
![Page 2: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/2.jpg)
The Berkeley AMPLab
January 2011 – 2017 • 8 faculty • > 50 students • 3 software engineer team
Organized for collaboration
3 day retreats (twice a year)
lgorithms
achines eople
400+ campers (100s companies)
AMPCamp (since 2012)
![Page 3: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/3.jpg)
The Berkeley AMPLab
Governmental and industrial funding:
Goal: Next generation of open source data analytics stack for industry & academia:
Berkeley Data Analytics Stack (BDAS)
![Page 4: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/4.jpg)
Generic Big Data Stack
Processing Layer
Resource Management Layer
Storage Layer
![Page 5: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/5.jpg)
Hadoop Stack
Processing Layer
Resource Management Layer
Storage Layer HDFS
Stor
age
Yarn Res.
Mgm
nt
Stor
m
HadoopMR
Hive Pig
Proc
essin
g
Impa
la
Gira
ph
…
![Page 6: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/6.jpg)
BDAS Stack
Processing Layer
Resource Management Layer
Storage Layer
Spark Core
Spar
k St
ream
ing
SparkSQL Grap
hX
MLlib
MLBase BlinkDB Sample Clean
Spar
kR
Velox
Proc
essin
g
Velox
Tachyon HDFS, S3, Ceph, …
Stor
ageSuccinct
BDAS Stack 3rd party
Mesos Mesos Hadoop Yarn Res.
Mgm
nt
![Page 7: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/7.jpg)
Today’s Talk
Resource Management Layer
Storage Layer
Spark Core
Spar
k St
ream
ing
SparkSQL Grap
hX
MLlib
MLBase BlinkDB Sample Clean
Spar
kR
Velox
Proc
essin
g
Velox
Tachyon HDFS, S3, Ceph, …
Stor
ageSuccinct
BDAS Stack 3rd party
Mesos Mesos Hadoop Yarn Res.
Mgm
nt
Today’s Talk
![Page 8: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/8.jpg)
Overview
1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten
![Page 9: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/9.jpg)
Overview
1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten
![Page 10: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/10.jpg)
A Short History
Started at UC Berkeley in 2009 Open Source: 2010 Apache Project: 2013 Today: most popular big data project
![Page 11: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/11.jpg)
What Is Spark?
Parallel execution engine for big data processing
Easy to use: 2-5x less code than Hadoop MR • High level API’s in Python, Java, and Scala
Fast: up to 100x faster than Hadoop MR
• Can exploit in-memory when available • Low overhead scheduling, optimized engine
General: support multiple computation models
![Page 12: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/12.jpg)
Analogy
First cellular phones
Unified device (smartphone)
Specialized devices
![Page 13: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/13.jpg)
Analogy
First cellular phones
Unified device (smartphone)
Specialized devices
Better Games Better GPS Better Phone
![Page 14: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/14.jpg)
Analogy
Batch processing Unified system Specialized systems
![Page 15: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/15.jpg)
Analogy
Batch processing Unified system Specialized systems
Real-time analytics
Instant fraud detection
Better Apps
![Page 16: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/16.jpg)
General Unifies batch, interactive comp.
Spark Core
SparkSQL
![Page 17: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/17.jpg)
General Unifies batch, interactive, streaming comp.
Spark Core
Spark Streaming SparkSQL
![Page 18: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/18.jpg)
General Unifies batch, interactive, streaming comp. Easy to build sophisticated applications
• Support iterative, graph-parallel algorithms • Powerful APIs in Scala, Python, Java, R
Spark Core
Spark Streaming SparkSQL MLlib GraphX SparkR
![Page 19: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/19.jpg)
Easy to Write Code
WordCount in 50+ lines of Java MR
WordCount in 3 lines of Spark
![Page 20: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/20.jpg)
Fast: Time to sort 100TB
2100 machines 2013 Record: Hadoop
2014 Record: Spark
Source: Daytona GraySort benchmark, sortbenchmark.org
72 minutes
207 machines
23 minutes
Also sorted 1PB in 4 hours
![Page 21: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/21.jpg)
Community Growth
June 2014 June 2015
total contributors 255 730
contributors/month 75 135
lines of code 175,000 400,000
![Page 22: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/22.jpg)
Meetup Groups: January 2015
source: meetup.com
![Page 23: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/23.jpg)
Meetup Groups: October 2015
source: meetup.com
![Page 24: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/24.jpg)
Community Growth
2014 2015
Summit Attendees
2014 2015
Meetup Members
2014 2015
Developers Contributing
3900
1100
42K
12K
350
600
![Page 25: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/25.jpg)
Large-Scale Usage
Largest cluster: 8000 nodes
Largest single job: 1 petabyte
Top streaming intake: 1 TB/hour
2014 on-disk sort record
![Page 26: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/26.jpg)
Spark Ecosystem Distributions Applications
![Page 27: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/27.jpg)
Overview
1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten
![Page 28: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/28.jpg)
RDD: Resilient Distributed Datasets
Collections of objects distr. across a cluster • Stored in RAM or on Disk • Automatically rebuilt on failure
Operations • Transformations • Actions
Execution model: similar to SIMD
![Page 29: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/29.jpg)
Operations on RDDs
Transformations f(RDD) => RDD § Lazy (not computed immediately) § E.g., “map”, “filter”, “groupBy”
Actions: § Triggers computation § E.g. “count”, “collect”, “saveAsTextFile”
![Page 30: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/30.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns
![Page 31: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/31.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns Worker
Worker
Worker
Driver
![Page 32: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/32.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
![Page 33: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/33.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns Worker
Worker
Worker
Driver
lines = spark.textFile(“hdfs://...”)
Base RDD
![Page 34: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/34.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
![Page 35: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/35.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
Worker
Worker
Worker
Driver
Transformed RDD
![Page 36: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/36.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
![Page 37: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/37.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count() Ac5on
![Page 38: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/38.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker
Driver
messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
![Page 39: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/39.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver tasks
tasks
tasks
![Page 40: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/40.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Read HDFS Block
Read HDFS Block
Read HDFS Block
![Page 41: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/41.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3 Process & Cache Data
Process & Cache Data
Process & Cache Data
![Page 42: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/42.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3
results
results
results
![Page 43: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/43.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Driver
Cache 1
Cache 2
Cache 3 messages.filter(lambda s: “php” in s).count()
![Page 44: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/44.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3 messages.filter(lambda s: “php” in s).count()
tasks
tasks
tasks
Driver
![Page 45: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/45.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3 messages.filter(lambda s: “php” in s).count()
Driver
Process from Cache
Process from Cache
Process from Cache
![Page 46: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/46.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3 messages.filter(lambda s: “php” in s).count()
Driver results
results
results
![Page 47: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/47.jpg)
Example: Log Mining Load error messages from a log into memory, then
interactively search for various patterns lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
Worker
Worker
Worker messages.filter(lambda s: “mysql” in s).count()
Block 1
Block 2
Block 3
Cache 1
Cache 2
Cache 3 messages.filter(lambda s: “php” in s).count()
Driver
Cache your data è Faster Results Full-text search of Wikipedia • 60GB on 20 EC2 machines • 0.5 sec from mem vs. 20s for on-disk
![Page 48: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/48.jpg)
Language Support
Standalone Programs Python, Scala, & Java Interactive Shells Python & Scala Performance Java & Scala are faster due to
static typing …but Python is often fine
Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count()
Scala
val lines = sc.textFile(...) lines.filter(x => x.contains(“ERROR”)).count()
Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
![Page 49: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/49.jpg)
Expressive API
map reduce
![Page 50: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/50.jpg)
Expressive API
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
groupByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save ...
![Page 51: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/51.jpg)
Fault Recovery: Design Alternatives
Replication: • Slow: need to write data over network • Memory inefficient
Backup on persistent storage • Persistent storage still (much) slower than memory • Still need to go over network to protect against machine failures
Spark choice: • Lineage: track sequence of operations to efficiently reconstruct lost RRD
partitions
![Page 52: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/52.jpg)
Fault Recovery Example
Two-partition RDD A={A1, A2} stored on disk 1) filter and cache à RDD B 2) joinà RDD C 3) aggregate à RDD D
A1
A2
RDD
A
agg. D
filter
filter B2
B1 join
join C2
C1
![Page 53: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/53.jpg)
Fault Recovery Example
C1 lost due to node failure before reduce finishes
A1
A2
RDD
A
agg. D
filter
filter B2
B1 join
join C2
C1
![Page 54: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/54.jpg)
Fault Recovery Example
C1 lost due to node failure before reduce finishes Reconstruct C1, eventually, on different node
A1
A2
RDD
A
agg. D
filter
filter B2
B1
join C2
agg. D
join C1
![Page 55: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/55.jpg)
Fault Recovery Results
119
57 56 58 58 81
57 59 57 59
0
50
100
150
1 2 3 4 5 6 7 8 9 10
Iteratrion
time (s)
Iteration
Failure happens
![Page 56: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/56.jpg)
Overview
1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten
![Page 57: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/57.jpg)
Process large data streams at second-scale latencies • Site statistics, intrusion detection, online ML
To build and scale these apps users want: • Integration: with offline analytical stack • Fault-tolerance: both for crashes and stragglers
Spark Streaming: Motivation
![Page 58: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/58.jpg)
Traditional Streaming Systems
Event-driven record-at-a-times • Each node has mutable state • For each record, update state & send new records
State is lost if node dies Making stateful stream processing be fault-tolerant is challenging
mutable state
node 1
node 3
input records
node 2
input records
![Page 59: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/59.jpg)
Spark Streaming
Data streams are chopped into batches • A batch is an RDD holding a few 100s ms worth of data
Each batch is processed in Spark
data streams
rece
iver
s
batches
![Page 60: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/60.jpg)
Streaming
How does it work?
Data streams are chopped into batches • A batch is an RDD holding a few 100s ms worth of data
Each batch is processed in Spark Results pushed out in batches
data streams
rece
iver
s
batches results
![Page 61: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/61.jpg)
Streaming Word Count
val lines = context.socketTextStream(“localhost”, 9999)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
print some counts on screen
count the words
split lines into words
create DStream from data over socket
start processing the stream
![Page 62: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/62.jpg)
Word Count
object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
![Page 63: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/63.jpg)
Word Count
object NetworkWordCount { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("NetworkWordCount") val context = new StreamingContext(sparkConf, Seconds(1)) val lines = context.socketTextStream(“localhost”, 9999) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination() } }
Spark Streaming
public class WordCountTopology { public static class SplitSentence extends ShellBolt implements IRichBolt { public SplitSentence() { super("python", "splitsentence.py"); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } @Override public Map<String, Object> getComponentConfiguration() { return null; } } public static class WordCount extends BaseBasicBolt { Map<String, Integer> counts = new HashMap<String, Integer>(); @Override public void execute(Tuple tuple, BasicOutputCollector collector) { String word = tuple.getString(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
Storm
public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology()); } else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); Thread.sleep(10000); cluster.shutdown(); } } }
![Page 64: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/64.jpg)
Machine Learning Pipelines tokenizer = Tokenizer(inputCol="text", outputCol="words”) hashingTF = HashingTF(inputCol="words", outputCol="features”) lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) df = sqlCtx.load("/path/to/data") model = pipeline.fit(df)
df0 df1 df2 df3 tokenizer hashingTF lr.model
lr
Pipeline Model
![Page 65: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/65.jpg)
0 20000 40000 60000 80000
100000 120000 140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
Powerful Stack – Agile Development
non-test, non-example source lines
![Page 66: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/66.jpg)
0 20000 40000 60000 80000
100000 120000 140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
non-test, non-example source lines
Streaming
Powerful Stack – Agile Development
![Page 67: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/67.jpg)
0 20000 40000 60000 80000
100000 120000 140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
SparkSQL Streaming
Powerful Stack – Agile Development
non-test, non-example source lines
![Page 68: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/68.jpg)
Powerful Stack – Agile Development
0 20000 40000 60000 80000
100000 120000 140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
GraphX
Streaming SparkSQL
non-test, non-example source lines
![Page 69: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/69.jpg)
Powerful Stack – Agile Development
0 20000 40000 60000 80000
100000 120000 140000
Hadoop MapReduce
Storm (Streaming)
Impala (SQL) Giraph (Graph)
Spark
GraphX
Streaming SparkSQL
Your App?
non-test, non-example source lines
![Page 70: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/70.jpg)
Benefits for Users
High performance data sharing • Data sharing is the bottleneck in many environments • RDD’s provide in-place sharing through memory
Applications can compose models • Run a SQL query and then PageRank the results • ETL your data and then run graph/ML on it
Benefit from investment in shared functionality • E.g. re-usable components (shell) and performance optimizations
![Page 71: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/71.jpg)
Overview
1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten
![Page 72: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/72.jpg)
Beyond Hadoop Users
Spark early adopters
Data Engineers Data Scientists Statisticians R users PyData …
Users
Understands MapReduce
& functional APIs
![Page 73: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/73.jpg)
![Page 74: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/74.jpg)
DataFrames in Spark
Distributed collection of data grouped into named columns (i.e. RDD with schema)
Domain-specific functions designed for common tasks • Metadata • Sampling • Project, filter, aggregation, join, … • UDFs
Available in Python, Scala, Java, and R
![Page 75: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/75.jpg)
Spark DataFrame
> head(filter(df, df$waiting < 50)) # an example in R ## eruptions waiting ##1 1.750 47 ##2 1.750 47 ##3 1.867 48
Similar APIs as single-node tools (Pandas, dplyr), i.e. easy to learn
![Page 76: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/76.jpg)
Spark RDD Execution
Java/Scala frontend
JVM backend
Python frontend
Python backend
opaque closures (user-defined functions)
![Page 77: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/77.jpg)
Spark DataFrame Execution
DataFrame frontend
Logical Plan
Physical execution
Catalyst optimizer
Intermediate representation for computation
![Page 78: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/78.jpg)
Spark DataFrame Execution
Python DF
Logical Plan
Physical execution
Catalyst optimizer
Java/Scala DF
R DF
Intermediate representation for computation
Simple wrappers to create logical plan
![Page 79: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/79.jpg)
Benefit of Logical Plan: Simpler Frontend
Python : ~2000 line of code (built over a weekend) R : ~1000 line of code i.e. much easier to add new language bindings (Julia, Clojure, …)
![Page 80: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/80.jpg)
Performance
0 2 4 6 8 10
Java/Scala
Python
Runtime for an example aggregation workload
RDD
![Page 81: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/81.jpg)
Benefit of Logical Plan: Performance Parity Across Languages
0 2 4 6 8 10
Java/Scala
Python
Java/Scala
Python
R
SQL
Runtime for an example aggregation workload (secs)
DataFrame
RDD
![Page 82: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/82.jpg)
Overview
1. Introduction 2. RDDs 3. Generality of RDDs (e.g. streaming) 4. DataFrames 5. Project Tungsten
![Page 83: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/83.jpg)
Hardware Trends
Storage
Network
CPU
![Page 84: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/84.jpg)
Hardware Trends
2010
Storage 50+MB/s (HDD)
Network 1Gbps
CPU ~3GHz
![Page 85: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/85.jpg)
Hardware Trends
2010 2015
Storage 50+MB/s (HDD)
500+MB/s (SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
![Page 86: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/86.jpg)
Hardware Trends
2010 2015
Storage 50+MB/s (HDD)
500+MB/s (SSD) 10X
Network 1Gbps 10Gbps 10X
CPU ~3GHz ~3GHz L
![Page 87: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/87.jpg)
Project Tungsten
Substantially speed up execution by optimizing CPU efficiency, via: (1) Runtime code generation (2) Exploiting cache locality (3) Off-heap memory management
![Page 88: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/88.jpg)
From DataFrame to Tungsten
Python DF
Logical Plan
Java/Scala DF
R DF
Tungsten Execution
Initial phase in Spark 1.5 More work coming in 2016
![Page 89: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/89.jpg)
Project Tungsten: Fully Managed Memory
Spark’s core API uses raw Java objects for aggregations and joins • GC overhead • Memory overhead: 4-8x more memory than serialized format • Computation overhead: little memory locality
DataFrame’s use custom binary format and off-heap managed memory
• GC free • No memory overhead • Cache locality
![Page 90: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/90.jpg)
Example: Hash Table Data Structure
Keep data closure to CPU cache
![Page 91: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/91.jpg)
Example: Aggregation Operation
![Page 92: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/92.jpg)
Python Java/Scala R SQL …
DataFrame Logical Plan
LLVM JVM SIMD 3D XPoint
Unified API, One Engine, Automatically Optimized
Tungsten backend
language frontend
…
![Page 93: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/93.jpg)
Refactoring Spark Core
Tungsten Execution
Python SQL SparkR Streaming
DataFrame (& Dataset)
Advanced Analytics
![Page 94: Conquering Big Data with Apache Sparkbigdataieee.org/BigData2015/IEEEKeynote Nov 1 2015.pdfWhat Is Spark? Parallel execution engine for big data processing Easy to use: 2-5x less code](https://reader036.vdocuments.net/reader036/viewer/2022071106/5fe0953a22560f4d043e3187/html5/thumbnails/94.jpg)
Summary
General engine with libraries for many data analysis tasks
Access to diverse data sources
Simple, unified API
Major focus going forward: • Easy of use (DataFrames) • Performance (Tungsten)
SQL Streaming ML Graph
…