spark 2 - contentful: a developer-friendly, api-first...
TRANSCRIPT
Spark 2
Alexey Zinovyev, Java/BigData Trainer in EPAM
About
With IT since 2007
With Java since 2009
With Hadoop since 2012
With EPAM since 2015
3Big Data Training
Secret Word from EPAM
itsubbotnik
4Big Data Training
Contacts
E-mail : [email protected]
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
5Joker’16: Spark 2 from Zinoviev Alexey
Sprk Dvlprs! Let’s start!
6Joker’16: Spark 2 from Zinoviev Alexey
< SPARK 2.0
7Big Data Training
Modern Java in 2016Big Data in 2014
8Big Data Training
Big Data in 2017
9Big Data Training
Machine Learning EVERYWHERE
10Big Data Training
Machine Learning vs Traditional Programming
11Big Data Training
12Big Data Training
Something wrong with HADOOP
13Joker’16: Spark 2 from Zinoviev Alexey
Hadoop is not SEXY
14Joker’16: Spark 2 from Zinoviev Alexey
Whaaaat?
15Joker’16: Spark 2 from Zinoviev Alexey
Map Reduce Job Writing
16Joker’16: Spark 2 from Zinoviev Alexey
MR code
17Joker’16: Spark 2 from Zinoviev Alexey
Hadoop Developers Right Now
18Joker’16: Spark 2 from Zinoviev Alexey
Iterative Calculations
10x – 100x
19Joker’16: Spark 2 from Zinoviev Alexey
MapReduce vs Spark
20Joker’16: Spark 2 from Zinoviev Alexey
MapReduce vs Spark
21Joker’16: Spark 2 from Zinoviev Alexey
MapReduce vs Spark
22Joker’16: Spark 2 from Zinoviev Alexey
SPARK 2.0 DISCUSSION
23Joker’16: Spark 2 from Zinoviev Alexey
Spark
Family
24Joker’16: Spark 2 from Zinoviev Alexey
Spark
Family
25Joker’16: Spark 2 from Zinoviev Alexey
Case #0 : How to avoid DStreams with RDD-like API?
26Joker’16: Spark 2 from Zinoviev Alexey
Continuous Applications
27Joker’16: Spark 2 from Zinoviev Alexey
Continuous Applications cases
• Updating data that will be served in real time
• Extract, transform and load (ETL)
• Creating a real-time version of an existing batch job
• Online machine learning
28Joker’16: Spark 2 from Zinoviev Alexey
Write Batches
29Joker’16: Spark 2 from Zinoviev Alexey
Catch Streaming
30Joker’16: Spark 2 from Zinoviev Alexey
The main concept of Structured Streaming
You can express your streaming computation the
same way you would express a batch computation
on static data.
31Joker’16: Spark 2 from Zinoviev Alexey
Batch
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.write.parquet("s3://out")
32Joker’16: Spark 2 from Zinoviev Alexey
Real
Time
// Read JSON continuously from S3
logsDF = spark.readStream.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.writeStream.parquet("s3://out")
.start()
33Joker’16: Spark 2 from Zinoviev Alexey
Unlimited Table
34Joker’16: Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
35Joker’16: Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
36Joker’16: Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
37Joker’16: Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
Don’t forget
to start
Streaming
38Joker’16: Spark 2 from Zinoviev Alexey
WordCount with Structured Streaming
39Joker’16: Spark 2 from Zinoviev Alexey
Structured Streaming provides …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
40Joker’16: Spark 2 from Zinoviev Alexey
Structured Streaming provides (in dreams) …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
41Joker’16: Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
42Joker’16: Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
org.apache.spark.sql.
AnalysisException:
Union between streaming
and batch
DataFrames/Datasets is not
supported;
43Joker’16: Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
Go to UnsupportedOperationChecker.scala and check your
operation
44Joker’16: Spark 2 from Zinoviev Alexey
Case #1 : We should think about optimization in RDD terms
45Joker’16: Spark 2 from Zinoviev Alexey
Single
Thread
collection
46Joker’16: Spark 2 from Zinoviev Alexey
No perf
issues,
right?
47Joker’16: Spark 2 from Zinoviev Alexey
The main concept
more partitions = more parallelism
48Joker’16: Spark 2 from Zinoviev Alexey
Do it
parallel
49Joker’16: Spark 2 from Zinoviev Alexey
I’d like
NARROW
50Joker’16: Spark 2 from Zinoviev Alexey
Map, filter, filter
51Joker’16: Spark 2 from Zinoviev Alexey
GroupByKey, join
52Joker’16: Spark 2 from Zinoviev Alexey
Case #2 : DataFrames suggest mix SQL and Scala functions
53Joker’16: Spark 2 from Zinoviev Alexey
History of Spark APIs
54Joker’16: Spark 2 from Zinoviev Alexey
RDD
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
55Joker’16: Spark 2 from Zinoviev Alexey
History of Spark APIs
56Joker’16: Spark 2 from Zinoviev Alexey
SQL
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
57Joker’16: Spark 2 from Zinoviev Alexey
Expression
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
58Joker’16: Spark 2 from Zinoviev Alexey
History of Spark APIs
59Joker’16: Spark 2 from Zinoviev Alexey
DataSet
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
60Joker’16: Spark 2 from Zinoviev Alexey
Case #2 : DataFrame is referring to data attributes by name
61Joker’16: Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
62Joker’16: Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
63Joker’16: Spark 2 from Zinoviev Alexey
Structured APIs in SPARK
64Joker’16: Spark 2 from Zinoviev Alexey
Unified API in Spark 2.0
DataFrame = Dataset[Row]
Dataframe is a schemaless (untyped) Dataset now
65Joker’16: Spark 2 from Zinoviev Alexey
Define
case class
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
66Joker’16: Spark 2 from Zinoviev Alexey
Read JSON
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
67Joker’16: Spark 2 from Zinoviev Alexey
Filter by
Field
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
68Joker’16: Spark 2 from Zinoviev Alexey
Case #3 : Spark has many contexts
69Joker’16: Spark 2 from Zinoviev Alexey
Spark Session
• New entry point in spark for creating datasets
• Replaces SQLContext, HiveContext and StreamingContext
• Move from SparkContext to SparkSession signifies move
away from RDD
70Joker’16: Spark 2 from Zinoviev Alexey
Spark
Session
val sparkSession = SparkSession.builder
.master("local")
.appName("spark session example")
.getOrCreate()
val df = sparkSession.read
.option("header","true")
.csv("src/main/resources/names.csv")
df.show()
71Joker’16: Spark 2 from Zinoviev Alexey
No, I want to create my lovely RDDs
72Joker’16: Spark 2 from Zinoviev Alexey
Where’s parallelize() method?
73Joker’16: Spark 2 from Zinoviev Alexey
RDD?
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
74Joker’16: Spark 2 from Zinoviev Alexey
Case #4 : Spark uses Java serialization A LOT
75Joker’16: Spark 2 from Zinoviev Alexey
Two choices to distribute data across cluster
• Java serialization
By default with ObjectOutputStream
• Kryo serialization
Should register classes (no support of Serialazible)
76Joker’16: Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
77Joker’16: Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
Don’t forget about GC
78Joker’16: Spark 2 from Zinoviev Alexey
Tungsten Compact Encoding
79Joker’16: Spark 2 from Zinoviev Alexey
Encoder’s concept
Generate bytecode to interact with off-heap
&
Give access to attributes without ser/deser
80Joker’16: Spark 2 from Zinoviev Alexey
Encoders
81Joker’16: Spark 2 from Zinoviev Alexey
No custom encoders
82Joker’16: Spark 2 from Zinoviev Alexey
Case #5 : Not enough storage levels
83Joker’16: Spark 2 from Zinoviev Alexey
Caching in Spark
• Frequently used RDD can be stored in memory
• One method, one short-cut: persist(), cache()
• SparkContext keeps track of cached RDD
• Serialized or deserialized Java objects
84Joker’16: Spark 2 from Zinoviev Alexey
Full list of options
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
85Joker’16: Spark 2 from Zinoviev Alexey
Spark Core Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
86Joker’16: Spark 2 from Zinoviev Alexey
Spark Streaming Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER (default for Spark Streaming)
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
87Joker’16: Spark 2 from Zinoviev Alexey
Developer API to make new Storage Levels
88Joker’16: Spark 2 from Zinoviev Alexey
What’s the most popular file format in BigData?
89Joker’16: Spark 2 from Zinoviev Alexey
Case #6 : External libraries to read CSV
90Joker’16: Spark 2 from Zinoviev Alexey
Easy to
read CSV
data = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/datasets/samples/users.csv")
data.cache()
data.createOrReplaceTempView(“users")
display(data)
91Joker’16: Spark 2 from Zinoviev Alexey
Case #7 : How to measure Spark performance?
92Joker’16: Spark 2 from Zinoviev Alexey
You'd measure performance!
93Joker’16: Spark 2 from Zinoviev Alexey
TPCDS
99 Queries
http://bit.ly/2dObMsH
94Joker’16: Spark 2 from Zinoviev Alexey
95Joker’16: Spark 2 from Zinoviev Alexey
How to benchmark Spark
96Joker’16: Spark 2 from Zinoviev Alexey
Special Tool from Databricks
Benchmark Tool for SparkSQL
https://github.com/databricks/spark-sql-perf
97Joker’16: Spark 2 from Zinoviev Alexey
Spark 2 vs Spark 1.6
98Joker’16: Spark 2 from Zinoviev Alexey
Case #8 : What’s faster: SQL or DataSet API?
99Joker’16: Spark 2 from Zinoviev Alexey
Job Stages in old Spark
100Joker’16: Spark 2 from Zinoviev Alexey
Scheduler Optimizations
101Joker’16: Spark 2 from Zinoviev Alexey
Catalyst Optimizer for DataFrames
102Joker’16: Spark 2 from Zinoviev Alexey
Unified Logical Plan
DataFrame = Dataset[Row]
103Joker’16: Spark 2 from Zinoviev Alexey
Bytecode
104Joker’16: Spark 2 from Zinoviev Alexey
DataSet.explain()
== Physical Plan ==Project [avg(price)#43,carat#45]+- SortMergeJoin [color#21], [color#47]
:- Sort [color#21 ASC], false, 0: +- TungstenExchange hashpartitioning(color#21,200), None: +- Project [avg(price)#43,color#21]: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as
bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]): +- TungstenExchange hashpartitioning(cut#20,color#21,200), None: +- TungstenAggregate(key=[cut#20,color#21],
functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L])
: +- Scan CsvRelation(-----)+- Sort [color#47 ASC], false, 0
+- TungstenExchange hashpartitioning(color#47,200), None+- ConvertToUnsafe
+- Scan CsvRelation(----)
105Joker’16: Spark 2 from Zinoviev Alexey
Case #9 : Why does explain() show so many Tungsten things?
106Joker’16: Spark 2 from Zinoviev Alexey
How to be effective with CPU
• Runtime code generation
• Exploiting cache locality
• Off-heap memory management
107Joker’16: Spark 2 from Zinoviev Alexey
Tungsten’s goal
Push performance closer to the limits of modern
hardware
108Joker’16: Spark 2 from Zinoviev Alexey
Maybe something UNSAFE?
109Joker’16: Spark 2 from Zinoviev Alexey
UnsafeRowFormat
• Bit set for tracking null values
• Small values are inlined
• For variable-length values are stored relative offset into the
variablelength data section
• Rows are always 8-byte word aligned
• Equality comparison and hashing can be performed on raw
bytes without requiring additional interpretation
110Joker’16: Spark 2 from Zinoviev Alexey
Case #10 : Can I influence on Memory Management in Spark?
111Joker’16: Spark 2 from Zinoviev Alexey
Case #11 : Should I tune generation’s stuff?
112Joker’16: Spark 2 from Zinoviev Alexey
Cached
Data
113Joker’16: Spark 2 from Zinoviev Alexey
During
operations
114Joker’16: Spark 2 from Zinoviev Alexey
For your
needs
115Joker’16: Spark 2 from Zinoviev Alexey
For Dark
Lord
116Joker’16: Spark 2 from Zinoviev Alexey
IN CONCLUSION
117Joker’16: Spark 2 from Zinoviev Alexey
We have no ability…
• join structured streaming and other sources to handle it
• one unified ML API
• GraphX rethinking and redesign
• Custom encoders
• Datasets everywhere
• integrate with something important
118Joker’16: Spark 2 from Zinoviev Alexey
Roadmap
• Support other data sources (not only S3 + HDFS)
• Transactional updates
• Dataset is one DSL for all operations
• GraphFrames + Structured MLLib
• Tungsten: custom encoders
• The RDD-based API is expected to be removed in Spark
3.0
119Joker’16: Spark 2 from Zinoviev Alexey
And we can DO IT!
Real-Time Data-Marts
Batch Data-Marts
Relations Graph
Ontology Metadata
Search Index
Events & Alarms
Real-time Dashboarding
Events & Alarms
All Raw Data backupis stored here
Real-time DataIngestion
Batch DataIngestion
Real-Time ETL & CEP
Batch ETL & Raw Area
Scheduler
Internal
External
Social
HDFS → CFSas an option
Time-Series Data
Titan & KairosDBstore data in Cassandra
Push Events & Alarms (Email, SNMP etc.)
120Joker’16: Spark 2 from Zinoviev Alexey
First Part
121Joker’16: Spark 2 from Zinoviev Alexey
Second
Part
122Joker’16: Spark 2 from Zinoviev Alexey
Contacts
E-mail : [email protected]
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
123Joker’16: Spark 2 from Zinoviev Alexey
Any questions?