Download - Bigdata processing with Spark - part II
![Page 1: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/1.jpg)
SIKS Big Data CoursePart TwoProf.dr.ir. Arjen P. de [email protected], December 7, 2016
![Page 2: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/2.jpg)
![Page 3: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/3.jpg)
Recap Spark Data Sharing Crucial for:
- Interactive Analysis- Iterative machine learning algorithms
Spark RDDs- Distributed collections, cached in memory across cluster nodes
Keep track of Lineage- To ensure fault-tolerance- To optimize processing based on knowledge of the data partitioning
![Page 4: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/4.jpg)
RDDs in More DetailRDDs additionally provide:- Control over partitioning, which can be used to optimize data
placement across queries.- usually more efficient than the sort-based approach of Map
Reduce- Control over persistence (e.g. store on disk vs in RAM)- Fine-grained reads (treat RDD as a big table)
Slide by Matei Zaharia, creator Spark, http://spark-project.org
![Page 5: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/5.jpg)
Scheduling Process
rdd1.join(rdd2) .groupBy(…) .filter(…)
RDD Objects
build operator DAG agnostic
to operators!
doesn’t know about
stages
DAGScheduler
split graph into stages of taskssubmit each stage as ready
DAG
TaskScheduler
TaskSet
launch tasks via cluster managerretry failed or straggling tasks
Clustermanager
Worker
execute tasks
store and serve blocks
Block manager
ThreadsTask
stagefailed
![Page 6: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/6.jpg)
RDD API Example
// Read input fileval input = sc.textFile("input.txt")
val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b, 2 }
6
![Page 7: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/7.jpg)
RDD API Example
// Read input fileval input = sc.textFile( )
val tokenized = input .map(line => line.split(" ")) .filter(words => words.size > 0) // remove empty lines
val counts = tokenized // frequency of log levels .map(words => (words(0), 1)). .reduceByKey{ (a, b) => a + b }
7
![Page 8: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/8.jpg)
Transformations
sc.textFile().map().filter().map().reduceByKey()
8
![Page 9: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/9.jpg)
DAG View of RDD’s
textFile() map() filter() map() reduceByKey()
9
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input tokenized counts
![Page 10: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/10.jpg)
Transformations build up a DAG, but don’t “do anything”
10
![Page 11: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/11.jpg)
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).
11
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Hadoop RDD
Partition 1
Partition 2
Partition 3
input tokenized counts
runJob(counts)
![Page 12: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/12.jpg)
Physical Optimizations
1. Certain types of transformations can be pipelined.
2. If dependent RDD’s have already been cached (or persisted in a shuffle) the graph can be truncated.
Pipelining and truncation produce a set of stages where each stage is composed of tasks
12
![Page 13: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/13.jpg)
Scheduler OptimizationsPipelines narrow ops. within a stagePicks join algorithms based on partitioning (minimize shuffles)Reuses previously cached data
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= previously computed partition
Task
![Page 14: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/14.jpg)
Task DetailsStage boundaries are only at input RDDs or “shuffle” operationsSo, each task looks like this:
Taskf1 f2 …
map output fileor master
externalstorage
fetch mapoutputs
and/or
![Page 15: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/15.jpg)
How runJob Works
Needs to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).
15
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
input tokenized counts
runJob(counts)
![Page 16: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/16.jpg)
16
How runJob WorksNeeds to compute my parents, parents, parents, etc all the way back to an RDD with no dependencies (e.g. HadoopRDD).
input tokenized counts
Mapped RDD
Partition 1
Partition 2
Partition 3
Filtered RDD
Partition 1
Partition 2
Partition 3
Mapped RDD
Partition 1
Partition 2
Partition 3
Shuffle RDD
Partition 1
Partition 2
Hadoop RDD
Partition 1
Partition 2
Partition 3
runJob(counts)
![Page 17: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/17.jpg)
Stage Graph
Task 1
Task 2
Task 3
Task 1
Task 2
Stage 1 Stage 2
Each task will:1.Read Hadoop input2.Perform maps and filters3.Write partial sums
Each task will:1.Read partial sums2.Invoke user function passed to runJob.
Shuffle write Shuffle readInput read
![Page 18: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/18.jpg)
Physical Execution Model Distinguish between:
- Jobs: complete work to be done- Stages: bundles of work that can execute together- Tasks: unit of work, corresponds to one RDD partition
Defining stages and tasks should not require deep knowledge of what these actually do- Goal of Spark is to be extensible, letting users define new
RDD operators
![Page 19: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/19.jpg)
RDD InterfaceSet of partitions (“splits”)List of dependencies on parent RDDsFunction to compute a partition given parentsOptional preferred locations
Optional partitioning info (Partitioner)Captures all current Spark operations!
![Page 20: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/20.jpg)
Example: HadoopRDDpartitions = one per HDFS blockdependencies = nonecompute(partition) = read corresponding block
preferredLocations(part) = HDFS block locationpartitioner = none
![Page 21: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/21.jpg)
Example: FilteredRDDpartitions = same as parent RDDdependencies = “one-to-one” on parentcompute(partition) = compute parent and filter it
preferredLocations(part) = none (ask parent)partitioner = none
![Page 22: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/22.jpg)
Example: JoinedRDDpartitions = one per reduce taskdependencies = “shuffle” on each parentcompute(partition) = read and join shuffled data
preferredLocations(part) = nonepartitioner = HashPartitioner(numTasks)Spark will now
know this data is hashed!
![Page 23: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/23.jpg)
Dependency Types
union join with inputs not
co-partitioned
map, filter
join with inputs co-partitioned
“Narrow” deps:
groupByKey
“Wide” (shuffle) deps:
![Page 24: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/24.jpg)
Improving Efficiency Basic Principle: Avoid Shuffling!
![Page 25: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/25.jpg)
Filter Input Early
![Page 26: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/26.jpg)
Avoid groupByKey on Pair RDDs
All key-value pairs will be shuffled accross the network, to a reducer where the values are collected together
groupByKey
“Wide” (shuffle) deps:
![Page 27: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/27.jpg)
aggregateByKey Three inputs
- Zero-element- Merging function within partition- Merging function across partitions
val initialCount = 0;val addToCounts = (n: Int, v: String) => n + 1val sumPartitionCounts = (p1: Int, p2: Int) => p1 + p2
val countByKey = kv.aggregateByKey(initialCount)(addToCounts,sumPartitionCounts)
Combiners!
![Page 28: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/28.jpg)
combineByKey
val result = input.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc.1 + v, acc.2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) =>
(acc1._1 + acc2._1, acc1._2 + acc2._2) ).map{
case (key, value) => (key, value._1 / value._2.toFloat) }
result.collectAsMap().map(println(_))
![Page 29: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/29.jpg)
Control the Degree of Parallellism Repartition
- Concentrate effort - increase use of nodes Coalesce
- Reduce number of tasks
![Page 30: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/30.jpg)
Broadcast Values In case of a join with a small RHS or LHS, broadcast the
small set to every node in the cluster
![Page 31: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/31.jpg)
![Page 32: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/32.jpg)
![Page 33: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/33.jpg)
![Page 34: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/34.jpg)
Broadcast Variables Create with SparkContext.broadcast(initVal) Access with .value inside tasks
Immutable!- If you modify the broadcast value after creation, that change is
local to the node
![Page 35: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/35.jpg)
Maintaining Partitioning mapValues instead of map flatMapValues instead of flatMap
- Good for tokenization!
![Page 36: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/36.jpg)
![Page 37: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/37.jpg)
The best trick of all, however…
![Page 38: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/38.jpg)
Use Higher Level API’s!DataFrame APIs for core processing
Works across Scala, Java, Python and R
Spark ML for machine learning
Spark SQL for structured query processing
38
![Page 39: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/39.jpg)
Higher-Level Libraries
Spark
Spark Streaming
real-time
Spark SQLstructured
data
MLlibmachinelearning
GraphXgraph
![Page 40: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/40.jpg)
Combining Processing Types// Load data using SQLpoints = ctx.sql(“select latitude, longitude from tweets”)
// Train a machine learning modelmodel = KMeans.train(points, 10)
// Apply it to a streamsc.twitterStream(...) .map(t => (model.predict(t.location), 1)) .reduceByWindow(“5s”, (a, b) => a + b)
![Page 41: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/41.jpg)
Performance of CompositionSeparate computing frameworks:
…HDFS read
HDFS write
HDFS read
HDFS write
HDFS read
HDFS write
HDFS write
HDFS read
Spark:
![Page 42: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/42.jpg)
Encode Domain Knowledge In essence, nothing more than libraries with pre-cooked
code – that still operates over the abstraction of RDDs
Focus on optimizations that require domain knowledge
![Page 43: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/43.jpg)
Spark MLLib
![Page 44: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/44.jpg)
Data Sets
![Page 45: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/45.jpg)
Challenge: Data RepresentationJava objects often many times larger than data
class User(name: String, friends: Array[Int])User(“Bobby”, Array(1, 2))
User 0x… 0x…
String
3
0
1 2
Bobby
5 0x…
int[]
char[] 5
![Page 46: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/46.jpg)
DataFrames / Spark SQLEfficient library for working with structured data
»Two interfaces: SQL for data analysts and external apps, DataFrames for complex programs
»Optimized computation and storage underneath
Spark SQL added in 2014, DataFrames in 2015
![Page 47: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/47.jpg)
Spark SQL Architecture
Logical Plan
Physical Plan
Catalog
OptimizerRDDs
…
DataSource
API
SQL DataFrames
CodeGenerator
![Page 48: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/48.jpg)
DataFrame APIDataFrames hold rows with a known schema and offer relational operations through a DSLc = HiveContext()users = c.sql(“select * from users”)
ma_users = users[users.state == “MA”]
ma_users.count()
ma_users.groupBy(“name”).avg(“age”)
ma_users.map(lambda row: row.user.toUpper())
Expression AST
![Page 49: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/49.jpg)
What DataFrames Enable1. Compact binary representation
• Columnar, compressed cache; rows for processing
2. Optimization across operators (join reordering, predicate pushdown, etc)
3. Runtime code generation
![Page 50: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/50.jpg)
Performance
![Page 51: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/51.jpg)
Performance
![Page 52: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/52.jpg)
Data SourcesUniform way to access structured data
»Apps can migrate across Hive, Cassandra, JSON, …»Rich semantics allows query pushdown into data
sources
SparkSQL
users[users.age > 20]
select * from users
![Page 53: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/53.jpg)
ExamplesJSON:
JDBC:
Together:
select user.id, text from tweets
{ “text”: “hi”, “user”: { “name”: “bob”, “id”: 15 }}
tweets.jsonselect age from users where lang = “en”
select t.text, u.agefrom tweets t, users uwhere t.user.id = u.idand u.lang = “en” Spark
SQL{JSON}
select id, age fromusers where lang=“en”
![Page 54: Bigdata processing with Spark - part II](https://reader035.vdocuments.net/reader035/viewer/2022062316/58f23eb01a28abd6648b4571/html5/thumbnails/54.jpg)
Thanks Matei Zaharia, MIT (https://cs.stanford.edu/~matei/) Paul Wendell, Databricks http://spark-project.org