matei zaharia amp camp 2012 advanced spark
TRANSCRIPT
![Page 1: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/1.jpg)
Matei Zaharia
UC Berkeley
www.spark-‐project.org
Advanced Spark Features
UC BERKELEY
![Page 2: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/2.jpg)
Motivation You’ve now seen the core primitives of Spark: RDDs, transformations and actions
As we’ve built applications, we’ve added other primitives to improve speed & usability » A key goal has been to keep Spark a small, extensible platform for research
These work seamlessly with the existing model
![Page 3: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/3.jpg)
Spark Model Process distributed collections with functional operators, the same way you can for local ones
val points: RDD[Point] = // ... var clusterCenters = new Array[Point](k) val closestCenter = points.map { p => findClosest(clusterCenters, p) } ...
![Page 4: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/4.jpg)
Spark Model Process distributed collections with functional operators, the same way you can for local ones
val points: RDD[Point] = // ... var clusterCenters = new Array[Point](k) val closestCenter = points.map { p => findClosest(clusterCenters, p) } ...
Two foci for extension: collection storage & layout, and interaction of functions with program
![Page 5: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/5.jpg)
Spark Model Process distributed collections with functional operators, the same way you can for local ones
val points: RDD[Point] = // ... var clusterCenters = new Array[Point](k) val closestCenter = points.map { p => findClosest(clusterCenters, p) } ...
Two foci for extension: collection storage & layout, and interaction of functions with program
How should this be split
across nodes?
![Page 6: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/6.jpg)
Spark Model Process distributed collections with functional operators, the same way you can for local ones
val points: RDD[Point] = // ... var clusterCenters = new Array[Point](k) val closestCenter = points.map { p => findClosest(clusterCenters, p) } ...
Two foci for extension: collection storage & layout, and interaction of functions with program
How should this variable be shipped?
![Page 7: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/7.jpg)
Outline Broadcast variables
Accumulators
Controllable partitioning
Extending Spark
Richer shared variables
Data layout
![Page 8: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/8.jpg)
Motivation Normally, Spark closures, including variables they use, are sent separately with each task
In some cases, a large read-‐only variable needs to be shared across tasks, or across operations
Examples: large lookup tables, “map-‐side join”
![Page 9: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/9.jpg)
Example: Join // Load RDD of (URL, name) pairs val pageNames = sc.textFile(“pages.txt”).map(...)
// Load RDD of (URL, visit) pairs val visits = sc.textFile(“visits.txt”).map(...)
val joined = visits.join(pageNames)
Map tasks
…
Reduce tasks
A-‐E F-‐J K-‐O P-‐T U-‐Z
pages.txt
visits.txt
Shuffles both pageNames and visits over network
![Page 10: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/10.jpg)
Alternative if One Table is Small
val pageNames = sc.textFile(“pages.txt”).map(...) val pageMap = pageNames.collect().toMap()
val visits = sc.textFile(“visits.txt”).map(...)
val joined = visits.map(v => (v._1, (pageMap(v._1), v._2)))
map
pages.txt
visits.txt
result
master
Runs locally on each block of visits.txt
pageMap sent along with every task
![Page 11: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/11.jpg)
Better Version with Broadcast val pageNames = sc.textFile(“pages.txt”).map(...) val pageMap = pageNames.collect().toMap() val bc = sc.broadcast(pageMap)
val visits = sc.textFile(“visits.txt”).map(...)
val joined = visits.map(v => (v._1, (bc.value(v._1), v._2)))
map
pages.txt
visits.txt
result
master
Only sends pageMap to each node once
Type is Broadcast[Map[…]]
bc
bc
Call .value to access value
![Page 12: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/12.jpg)
Broadcast Variable Rules Create with SparkContext.broadcast(initialVal)
Access with .value inside tasks » First task to do so on each node fetches the value
Cannot modify value after creation » If you try, change will only be on one node
![Page 13: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/13.jpg)
Scaling Up Broadcast Initial version (HDFS) Cornet P2P broadcast
0
50
100
150
200
250
10 30 60 90
Iteration time (s)
Number of machines
Communication Computation
0
50
100
150
200
250
10 30 60 90
Iteration time (s)
Number of machines
Communication Computation
[Chowdhury et al, SIGCOMM 2011]
![Page 14: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/14.jpg)
Cornet Performance
0
20
40
60
80
100
HDFS (R=3)
HDFS (R=10)
BitTornado Tree (D=2)
Chain Cornet
Com
pletion time (s)
1GB data to 100 receivers
[Chowdhury et al, SIGCOMM 2011]
Hadoop’s “distributed cache”
![Page 15: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/15.jpg)
Outline Broadcast variables
Accumulators
Controllable partitioning
Extending Spark
![Page 16: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/16.jpg)
Motivation Often, an application needs to aggregate multiple values as it progresses
Accumulators generalize MapReduce’s counters to enable this
![Page 17: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/17.jpg)
Usage val badRecords = sc.accumulator(0) val badBytes = sc.accumulator(0.0)
records.filter(r => { if (isBad(r)) { badRecords += 1 badBytes += r.size false } else { true } }).save(...)
printf(“Total bad records: %d, avg size: %f\n”, badRecords.value, badBytes.value / badRecords.value)
Accumulator[Int] Accumulator[Double]
![Page 18: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/18.jpg)
Accumulator Rules Create with SparkContext.accumulator(initialVal)
“Add” to the value with += inside tasks » Each task’s effect only counted once
Access with .value, but only on master » Exception if you try it on workers
![Page 19: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/19.jpg)
Custom Accumulators Define an object extending AccumulatorParam[T], where T is your data type, and providing: » A zero element for a given T » An addInPlace method to merge in values
class Vector(val data: Array[Double]) {...}
implicit object VectorAP extends AccumulatorParam[Vector] { def zero(v: Vector) = new Vector(new Array(v.data.size))
def addInPlace(v1: Vector, v2: Vector) = { for (i <- 0 to v1.data.size-1) v1.data(i) += v2.data(i) return v1 } } Now you can use sc.accumulator(new Vector(…))
![Page 20: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/20.jpg)
Another Common Use val sum = sc.accumulator(0.0) val count = sc.accumulator(0.0)
records.foreach(r => { sum += r.size count += 1.0 })
val average = sum.value / count.value
Action that only runs for its side-‐effects
![Page 21: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/21.jpg)
Outline Broadcast variables
Accumulators
Controllable partitioning
Extending Spark
![Page 22: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/22.jpg)
Motivation Recall from yesterday that network bandwidth is ~100× as expensive as memory bandwidth
One way Spark avoids using it is through locality-‐aware scheduling for RAM and disk
Another important tool is controlling the partitioning of RDD contents across nodes
![Page 23: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/23.jpg)
Example: PageRank 1. Start each page at a rank of 1 2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors 3. Set each page’s rank to 0.15 + 0.85 × contribs
val links = // RDD of (url, neighbors) pairs var ranks = // RDD of (url, rank) pairs
for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _).mapValues(.15 + .85*_) }
![Page 24: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/24.jpg)
PageRank Execution links and ranks are repeatedly joined
Each join requires a full shuffle over the network » Hash both onto same nodes
reduceByKey
Contribs0
join
join
Contribs2
Ranks0 (url, rank)
Links (url, neighbors)
. . .
Ranks2
reduceByKey
Ranks1
Input File map
Map tasks
…
Reduce tasks
links
ranks
A-‐F
G-‐L
M-‐R
S-‐Z
join
![Page 25: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/25.jpg)
Solution Pre-‐partition the links RDD so that links for URLs with the same hash code are on the same node
val ranks = // RDD of (url, rank) pairs val links = sc.textFile(...).map(...) .partitionBy(new HashPartitioner(8)) for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _) .mapValues(0.15 + 0.85 * _) }
![Page 26: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/26.jpg)
New Execution
partitionBy map
Input File Links
Ranks0
join flatMap
Ranks1
reduceByKey join flatMap
Ranks2
reduceByKey
. . .
Links not shuffled
Ranks also not shuffled
![Page 27: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/27.jpg)
How it Works Each RDD has an optional Partitioner object
Any shuffle operation on an RDD with a Partitioner will respect that Partitioner
Any shuffle operation on two RDDs will take on the Partitioner of one of them, if one is set
Otherwise, by default use HashPartitioner
![Page 28: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/28.jpg)
Examples pages.join(visits).reduceByKey(...)
join reduceByKey
pages.join(visits).map(...).reduceByKey(...)
join map reduceByKey
pages.join(visits).mapValues(...).reduceByKey(...)
join mapValues reduceByKey
Output of join is already partitioned
map loses knowledge about partitioning
mapValues retains keys unchanged
![Page 29: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/29.jpg)
PageRank Performance
171
72
23
0
50
100
150
200
Time pe
r iteration
(s)
Hadoop
Basic Spark
Spark + Controlled Partitioning
Why it helps so much: links RDD is much bigger in bytes than ranks!
![Page 30: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/30.jpg)
Telling How an RDD is Partitioned Use the .partitioner method on RDD
scala> val a = sc.parallelize(List((1, 1), (2, 2))) scala> val b = sc.parallelize(List((1, 1), (2, 2))) scala> val joined = a.join(b)
scala> a.partitioner res0: Option[Partitioner] = None
scala> joined.partitioner res1: Option[Partitioner] = Some(HashPartitioner@286d41c0)
![Page 31: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/31.jpg)
Custom Partitioning Can define your own subclass of Partitioner to leverage domain-‐specific knowledge
Example: in PageRank, hash URLs by domain name, because may links are internal
class DomainPartitioner extends Partitioner { def numPartitions = 20
def getPartition(key: Any): Int = parseDomain(key.toString).hashCode % numPartitions
def equals(other: Any): Boolean = other.isInstanceOf[DomainPartitioner] }
Needed for Spark to tell when two partitioners
are equivalent
![Page 32: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/32.jpg)
Outline Broadcast variables
Accumulators
Controllable partitioning
Extending Spark
![Page 33: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/33.jpg)
Extension Points Spark provides several places to customize functionality:
Extending RDD: add new input sources or transformations
spark.cache.class: customize caching
spark.serializer: customize object storage
![Page 34: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/34.jpg)
What People Have Done New RDD transformations (sample, glom, mapPartitions, leftOuterJoin, rightOuterJoin)
New input sources (DynamoDB)
Custom serialization for memory and bandwidth efficiency
![Page 35: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/35.jpg)
Why Change Serialization? Greatly impacts network usage
Can also be used to improve memory efficiency » Java objects are often larger than raw data » Most compact way to keep large amounts of data in memory is SerializingCache
Spark’s default choice of Java serialization is very simple to use, but very slow » High space & time cost due to forward compatibility
![Page 36: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/36.jpg)
Serializer Benchmark: Time
Source: http://code.google.com/p/thrift-‐protobuf-‐compare/wiki/Benchmarking
![Page 37: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/37.jpg)
Serializer Benchmark: Space
Source: http://code.google.com/p/thrift-‐protobuf-‐compare/wiki/Benchmarking
![Page 38: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/38.jpg)
Better Serialization You can implement your own serializer by extending spark.Serializer
But as a good option that saves a lot of time, we recommend Kryo (code.google.com/p/kryo) » One of the fastest, but minimal boilerplate » Note: Spark currently uses Kryo 1.x, not 2.x
![Page 39: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/39.jpg)
Using Kryo class MyRegistrator extends spark.KryoRegistrator { def registerClasses(kryo: Kryo) { kryo.register(classOf[Class1]) kryo.register(classOf[Class2]) } }
System.setProperty( “spark.serializer”, “spark.KryoSerializer”) System.setProperty( “spark.kryo.registrator”, “mypkg.MyRegistrator”) System.setProperty( // Optional, for memory usage “spark.cache.class”, “mypkg.SerializingCache”)
val sc = new SparkContext(...)
![Page 40: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/40.jpg)
Impact of Serialization Saw as much as 4× space reduction and 10× time reduction with Kryo
Simple way to test serialization cost in your program: profile it with jstack or hprof
We plan to work on this further in the future!
![Page 41: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/41.jpg)
Codebase Size
Hadoop I/O: 400 LOC
Mesos runner: 700 LOC
Standalone runner: 1200 LOC
Interpreter: 3300 LOC
Spark core: 14,000 LOC
RDD ops: 1600
Block store: 2000
Scheduler: 2000
Networking: 1200
Accumulators: 200 Broadcast: 3500
![Page 42: Matei Zaharia Amp Camp 2012 Advanced Spark](https://reader031.vdocuments.net/reader031/viewer/2022021922/577ccd491a28ab9e788bfa41/html5/thumbnails/42.jpg)
Conclusion Spark provides a variety of features to improve application performance » Broadcast + accumulators for common sharing patterns » Data layout through controlled partitioning
With in-‐memory data, the bottleneck often shifts to network or CPU
You can do more by hacking Spark itself – ask us if interested!