spark 2 - contentful: a developer-friendly, api-first...

Post on 19-Mar-2018

228 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Spark 2

Alexey Zinovyev, Java/BigData Trainer in EPAM

About

With IT since 2007

With Java since 2009

With Hadoop since 2012

With EPAM since 2015

3Big Data Training

Secret Word from EPAM

itsubbotnik

4Big Data Training

Contacts

E-mail : Alexey_Zinovyev@epam.com

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

5Joker’16: Spark 2 from Zinoviev Alexey

Sprk Dvlprs! Let’s start!

6Joker’16: Spark 2 from Zinoviev Alexey

< SPARK 2.0

7Big Data Training

Modern Java in 2016Big Data in 2014

8Big Data Training

Big Data in 2017

9Big Data Training

Machine Learning EVERYWHERE

10Big Data Training

Machine Learning vs Traditional Programming

11Big Data Training

12Big Data Training

Something wrong with HADOOP

13Joker’16: Spark 2 from Zinoviev Alexey

Hadoop is not SEXY

14Joker’16: Spark 2 from Zinoviev Alexey

Whaaaat?

15Joker’16: Spark 2 from Zinoviev Alexey

Map Reduce Job Writing

16Joker’16: Spark 2 from Zinoviev Alexey

MR code

17Joker’16: Spark 2 from Zinoviev Alexey

Hadoop Developers Right Now

18Joker’16: Spark 2 from Zinoviev Alexey

Iterative Calculations

10x – 100x

19Joker’16: Spark 2 from Zinoviev Alexey

MapReduce vs Spark

20Joker’16: Spark 2 from Zinoviev Alexey

MapReduce vs Spark

21Joker’16: Spark 2 from Zinoviev Alexey

MapReduce vs Spark

22Joker’16: Spark 2 from Zinoviev Alexey

SPARK 2.0 DISCUSSION

23Joker’16: Spark 2 from Zinoviev Alexey

Spark

Family

24Joker’16: Spark 2 from Zinoviev Alexey

Spark

Family

25Joker’16: Spark 2 from Zinoviev Alexey

Case #0 : How to avoid DStreams with RDD-like API?

26Joker’16: Spark 2 from Zinoviev Alexey

Continuous Applications

27Joker’16: Spark 2 from Zinoviev Alexey

Continuous Applications cases

• Updating data that will be served in real time

• Extract, transform and load (ETL)

• Creating a real-time version of an existing batch job

• Online machine learning

28Joker’16: Spark 2 from Zinoviev Alexey

Write Batches

29Joker’16: Spark 2 from Zinoviev Alexey

Catch Streaming

30Joker’16: Spark 2 from Zinoviev Alexey

The main concept of Structured Streaming

You can express your streaming computation the

same way you would express a batch computation

on static data.

31Joker’16: Spark 2 from Zinoviev Alexey

Batch

// Read JSON once from S3

logsDF = spark.read.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.write.parquet("s3://out")

32Joker’16: Spark 2 from Zinoviev Alexey

Real

Time

// Read JSON continuously from S3

logsDF = spark.readStream.json("s3://logs")

// Transform with DataFrame API and save

logsDF.select("user", "url", "date")

.writeStream.parquet("s3://out")

.start()

33Joker’16: Spark 2 from Zinoviev Alexey

Unlimited Table

34Joker’16: Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

35Joker’16: Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

36Joker’16: Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

37Joker’16: Spark 2 from Zinoviev Alexey

WordCount

from

Socket

val lines = spark.readStream

.format("socket")

.option("host", "localhost")

.option("port", 9999)

.load()

val words = lines.as[String].flatMap(_.split(" "))

val wordCounts = words.groupBy("value").count()

Don’t forget

to start

Streaming

38Joker’16: Spark 2 from Zinoviev Alexey

WordCount with Structured Streaming

39Joker’16: Spark 2 from Zinoviev Alexey

Structured Streaming provides …

• fast & scalable

• fault-tolerant

• end-to-end with exactly-once semantic

• stream processing

• ability to use DataFrame/DataSet API for streaming

40Joker’16: Spark 2 from Zinoviev Alexey

Structured Streaming provides (in dreams) …

• fast & scalable

• fault-tolerant

• end-to-end with exactly-once semantic

• stream processing

• ability to use DataFrame/DataSet API for streaming

41Joker’16: Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

42Joker’16: Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

org.apache.spark.sql.

AnalysisException:

Union between streaming

and batch

DataFrames/Datasets is not

supported;

43Joker’16: Spark 2 from Zinoviev Alexey

Let’s UNION streaming and static DataSets

Go to UnsupportedOperationChecker.scala and check your

operation

44Joker’16: Spark 2 from Zinoviev Alexey

Case #1 : We should think about optimization in RDD terms

45Joker’16: Spark 2 from Zinoviev Alexey

Single

Thread

collection

46Joker’16: Spark 2 from Zinoviev Alexey

No perf

issues,

right?

47Joker’16: Spark 2 from Zinoviev Alexey

The main concept

more partitions = more parallelism

48Joker’16: Spark 2 from Zinoviev Alexey

Do it

parallel

49Joker’16: Spark 2 from Zinoviev Alexey

I’d like

NARROW

50Joker’16: Spark 2 from Zinoviev Alexey

Map, filter, filter

51Joker’16: Spark 2 from Zinoviev Alexey

GroupByKey, join

52Joker’16: Spark 2 from Zinoviev Alexey

Case #2 : DataFrames suggest mix SQL and Scala functions

53Joker’16: Spark 2 from Zinoviev Alexey

History of Spark APIs

54Joker’16: Spark 2 from Zinoviev Alexey

RDD

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

55Joker’16: Spark 2 from Zinoviev Alexey

History of Spark APIs

56Joker’16: Spark 2 from Zinoviev Alexey

SQL

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

57Joker’16: Spark 2 from Zinoviev Alexey

Expression

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

58Joker’16: Spark 2 from Zinoviev Alexey

History of Spark APIs

59Joker’16: Spark 2 from Zinoviev Alexey

DataSet

rdd.filter(_.age > 21) // RDD

df.filter("age > 21") // DataFrame SQL-style

df.filter(df.col("age").gt(21)) // Expression style

dataset.filter(_.age < 21); // Dataset API

60Joker’16: Spark 2 from Zinoviev Alexey

Case #2 : DataFrame is referring to data attributes by name

61Joker’16: Spark 2 from Zinoviev Alexey

DataSet = RDD’s types + DataFrame’s Catalyst

• RDD API

• compile-time type-safety

• off-heap storage mechanism

• performance benefits of the Catalyst query optimizer

• Tungsten

62Joker’16: Spark 2 from Zinoviev Alexey

DataSet = RDD’s types + DataFrame’s Catalyst

• RDD API

• compile-time type-safety

• off-heap storage mechanism

• performance benefits of the Catalyst query optimizer

• Tungsten

63Joker’16: Spark 2 from Zinoviev Alexey

Structured APIs in SPARK

64Joker’16: Spark 2 from Zinoviev Alexey

Unified API in Spark 2.0

DataFrame = Dataset[Row]

Dataframe is a schemaless (untyped) Dataset now

65Joker’16: Spark 2 from Zinoviev Alexey

Define

case class

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

66Joker’16: Spark 2 from Zinoviev Alexey

Read JSON

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

67Joker’16: Spark 2 from Zinoviev Alexey

Filter by

Field

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

68Joker’16: Spark 2 from Zinoviev Alexey

Case #3 : Spark has many contexts

69Joker’16: Spark 2 from Zinoviev Alexey

Spark Session

• New entry point in spark for creating datasets

• Replaces SQLContext, HiveContext and StreamingContext

• Move from SparkContext to SparkSession signifies move

away from RDD

70Joker’16: Spark 2 from Zinoviev Alexey

Spark

Session

val sparkSession = SparkSession.builder

.master("local")

.appName("spark session example")

.getOrCreate()

val df = sparkSession.read

.option("header","true")

.csv("src/main/resources/names.csv")

df.show()

71Joker’16: Spark 2 from Zinoviev Alexey

No, I want to create my lovely RDDs

72Joker’16: Spark 2 from Zinoviev Alexey

Where’s parallelize() method?

73Joker’16: Spark 2 from Zinoviev Alexey

RDD?

case class User(email: String, footSize: Long, name: String)

// DataFrame -> DataSet with Users

val userDS =

spark.read.json("/home/tmp/datasets/users.json").as[User]

userDS.map(_.name).collect()

userDS.filter(_.footSize > 38).collect()

ds.rdd // IF YOU REALLY WANT

74Joker’16: Spark 2 from Zinoviev Alexey

Case #4 : Spark uses Java serialization A LOT

75Joker’16: Spark 2 from Zinoviev Alexey

Two choices to distribute data across cluster

• Java serialization

By default with ObjectOutputStream

• Kryo serialization

Should register classes (no support of Serialazible)

76Joker’16: Spark 2 from Zinoviev Alexey

The main problem: overhead of serializing

Each serialized object contains the class structure as

well as the values

77Joker’16: Spark 2 from Zinoviev Alexey

The main problem: overhead of serializing

Each serialized object contains the class structure as

well as the values

Don’t forget about GC

78Joker’16: Spark 2 from Zinoviev Alexey

Tungsten Compact Encoding

79Joker’16: Spark 2 from Zinoviev Alexey

Encoder’s concept

Generate bytecode to interact with off-heap

&

Give access to attributes without ser/deser

80Joker’16: Spark 2 from Zinoviev Alexey

Encoders

81Joker’16: Spark 2 from Zinoviev Alexey

No custom encoders

82Joker’16: Spark 2 from Zinoviev Alexey

Case #5 : Not enough storage levels

83Joker’16: Spark 2 from Zinoviev Alexey

Caching in Spark

• Frequently used RDD can be stored in memory

• One method, one short-cut: persist(), cache()

• SparkContext keeps track of cached RDD

• Serialized or deserialized Java objects

84Joker’16: Spark 2 from Zinoviev Alexey

Full list of options

• MEMORY_ONLY

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

85Joker’16: Spark 2 from Zinoviev Alexey

Spark Core Storage Level

• MEMORY_ONLY (default for Spark Core)

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

86Joker’16: Spark 2 from Zinoviev Alexey

Spark Streaming Storage Level

• MEMORY_ONLY (default for Spark Core)

• MEMORY_AND_DISK

• MEMORY_ONLY_SER (default for Spark Streaming)

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2

87Joker’16: Spark 2 from Zinoviev Alexey

Developer API to make new Storage Levels

88Joker’16: Spark 2 from Zinoviev Alexey

What’s the most popular file format in BigData?

89Joker’16: Spark 2 from Zinoviev Alexey

Case #6 : External libraries to read CSV

90Joker’16: Spark 2 from Zinoviev Alexey

Easy to

read CSV

data = sqlContext.read.format("csv")

.option("header", "true")

.option("inferSchema", "true")

.load("/datasets/samples/users.csv")

data.cache()

data.createOrReplaceTempView(“users")

display(data)

91Joker’16: Spark 2 from Zinoviev Alexey

Case #7 : How to measure Spark performance?

92Joker’16: Spark 2 from Zinoviev Alexey

You'd measure performance!

93Joker’16: Spark 2 from Zinoviev Alexey

TPCDS

99 Queries

http://bit.ly/2dObMsH

94Joker’16: Spark 2 from Zinoviev Alexey

95Joker’16: Spark 2 from Zinoviev Alexey

How to benchmark Spark

96Joker’16: Spark 2 from Zinoviev Alexey

Special Tool from Databricks

Benchmark Tool for SparkSQL

https://github.com/databricks/spark-sql-perf

97Joker’16: Spark 2 from Zinoviev Alexey

Spark 2 vs Spark 1.6

98Joker’16: Spark 2 from Zinoviev Alexey

Case #8 : What’s faster: SQL or DataSet API?

99Joker’16: Spark 2 from Zinoviev Alexey

Job Stages in old Spark

100Joker’16: Spark 2 from Zinoviev Alexey

Scheduler Optimizations

101Joker’16: Spark 2 from Zinoviev Alexey

Catalyst Optimizer for DataFrames

102Joker’16: Spark 2 from Zinoviev Alexey

Unified Logical Plan

DataFrame = Dataset[Row]

103Joker’16: Spark 2 from Zinoviev Alexey

Bytecode

104Joker’16: Spark 2 from Zinoviev Alexey

DataSet.explain()

== Physical Plan ==Project [avg(price)#43,carat#45]+- SortMergeJoin [color#21], [color#47]

:- Sort [color#21 ASC], false, 0: +- TungstenExchange hashpartitioning(color#21,200), None: +- Project [avg(price)#43,color#21]: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as

bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]): +- TungstenExchange hashpartitioning(cut#20,color#21,200), None: +- TungstenAggregate(key=[cut#20,color#21],

functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L])

: +- Scan CsvRelation(-----)+- Sort [color#47 ASC], false, 0

+- TungstenExchange hashpartitioning(color#47,200), None+- ConvertToUnsafe

+- Scan CsvRelation(----)

105Joker’16: Spark 2 from Zinoviev Alexey

Case #9 : Why does explain() show so many Tungsten things?

106Joker’16: Spark 2 from Zinoviev Alexey

How to be effective with CPU

• Runtime code generation

• Exploiting cache locality

• Off-heap memory management

107Joker’16: Spark 2 from Zinoviev Alexey

Tungsten’s goal

Push performance closer to the limits of modern

hardware

108Joker’16: Spark 2 from Zinoviev Alexey

Maybe something UNSAFE?

109Joker’16: Spark 2 from Zinoviev Alexey

UnsafeRowFormat

• Bit set for tracking null values

• Small values are inlined

• For variable-length values are stored relative offset into the

variablelength data section

• Rows are always 8-byte word aligned

• Equality comparison and hashing can be performed on raw

bytes without requiring additional interpretation

110Joker’16: Spark 2 from Zinoviev Alexey

Case #10 : Can I influence on Memory Management in Spark?

111Joker’16: Spark 2 from Zinoviev Alexey

Case #11 : Should I tune generation’s stuff?

112Joker’16: Spark 2 from Zinoviev Alexey

Cached

Data

113Joker’16: Spark 2 from Zinoviev Alexey

During

operations

114Joker’16: Spark 2 from Zinoviev Alexey

For your

needs

115Joker’16: Spark 2 from Zinoviev Alexey

For Dark

Lord

116Joker’16: Spark 2 from Zinoviev Alexey

IN CONCLUSION

117Joker’16: Spark 2 from Zinoviev Alexey

We have no ability…

• join structured streaming and other sources to handle it

• one unified ML API

• GraphX rethinking and redesign

• Custom encoders

• Datasets everywhere

• integrate with something important

118Joker’16: Spark 2 from Zinoviev Alexey

Roadmap

• Support other data sources (not only S3 + HDFS)

• Transactional updates

• Dataset is one DSL for all operations

• GraphFrames + Structured MLLib

• Tungsten: custom encoders

• The RDD-based API is expected to be removed in Spark

3.0

119Joker’16: Spark 2 from Zinoviev Alexey

And we can DO IT!

Real-Time Data-Marts

Batch Data-Marts

Relations Graph

Ontology Metadata

Search Index

Events & Alarms

Real-time Dashboarding

Events & Alarms

All Raw Data backupis stored here

Real-time DataIngestion

Batch DataIngestion

Real-Time ETL & CEP

Batch ETL & Raw Area

Scheduler

Internal

External

Social

HDFS → CFSas an option

Time-Series Data

Titan & KairosDBstore data in Cassandra

Push Events & Alarms (Email, SNMP etc.)

120Joker’16: Spark 2 from Zinoviev Alexey

First Part

121Joker’16: Spark 2 from Zinoviev Alexey

Second

Part

122Joker’16: Spark 2 from Zinoviev Alexey

Contacts

E-mail : Alexey_Zinovyev@epam.com

Twitter : @zaleslaw @BigDataRussia

Facebook: https://www.facebook.com/zaleslaw

vk.com/big_data_russia Big Data Russia

vk.com/java_jvm Java & JVM langs

123Joker’16: Spark 2 from Zinoviev Alexey

Any questions?

top related