Download - What’s New in Spark 0.6 and Shark 0.2
![Page 1: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/1.jpg)
What’s New in Spark 0.6 and Shark 0.2November 5, 2012
UC BERKELEYwww.spark-project.org
![Page 2: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/2.jpg)
AgendaIntro & Spark 0.6 tour (Matei Zaharia)Standalone deploy mode (Denny Britz)Shark 0.2 (Reynold Xin)Q & A
![Page 3: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/3.jpg)
What Are Spark & Shark?Spark: fast cluster computing engine based on general operators & in-memory computingShark: Hive-compatible data warehouse system built on Spark
Both are open source projects from the UCBerkeley AMP Lab
![Page 4: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/4.jpg)
What is the AMP Lab?60-person lab focusing on big dataFunded by NSF, DARPA, 18 companiesGoal: build an open-source, next-generation analytics stack
UC BERKELEY Spark
Mesos
Shark Stre
ami
ngGr
aph
Hado
op, M
PI. .
.
. . .
Lear
nin
g
![Page 5: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/5.jpg)
Some Exciting NewsRecently, three full-time developers joined AMP to work on these projectsAlso encourage outside contributions!
»This release: Shark server (Yahoo!), improved accumulators (Quantifind)
![Page 6: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/6.jpg)
Spark 0.6 ReleaseBiggest release so far in terms of featuresBiggest in terms of developers (18 total, 12 new)Focus areas: ease-of-use and performance
![Page 7: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/7.jpg)
Ease-of-UseSpark already had good traction despite two fairly researchy aspects
»Scala language»Requirement to run on Mesos
A big goal was to improve these:»Java API (and upcoming API in Python)»Simpler deployment (standalone mode,
YARN)
![Page 8: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/8.jpg)
Java APIlines.filter(_.contains(“error”)).count()
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); }}).count();
![Page 9: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/9.jpg)
Java API FeaturesSupports all existing Spark features
»RDDs, accumulators, broadcast variables
Retains type safety through specific classes for RDDs of special types
»E.g. JavaPairRDD<K, V> for key-value pairs
![Page 10: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/10.jpg)
Using Key-Value Pairsimport scala.Tuple2;
JavaRDD<String> words = ...;
JavaPairRDD<String, Integer> ones = words.map( new PairFunction<String, String, Integer> { public Tuple2<String, Integer> call(String s) { return new Tuple2(s, 1); } });
// Can now call ones.reduceByKey(), groupByKey(), etc
More info: spark-project.org/docs/0.6.0/
![Page 11: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/11.jpg)
Coming Next: PySparklines = sc.textFile(sys.argv[1])
counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(lambda x, y: x + y)
![Page 12: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/12.jpg)
Simpler DeploymentRefactored Spark’s scheduler to allow running on different cluster managersDenny will talk about the standalone mode…
![Page 13: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/13.jpg)
Other Ease-of-Use WorkDocumentation
»Big effort to improve Spark’s help and Scaladoc
Debugging hints (pointers to user code in logs)Maven Central artifacts
spark-project.org/documentation.html
![Page 14: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/14.jpg)
PerformanceNew ConnectionManager and BlockManager
»Replace simple HTTP shuffle with faster, async NIO
Faster control-plane (task scheduling & launch)Per-RDD control of storage level
![Page 15: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/15.jpg)
Some Graphs
020406080
100120
Spark 0.5
Runn
ing
tim
e (m
inut
es)
Large User App(2000 maps / 1000 reduces)
0100200300400500600700800900
1000Spark 0.5
Runn
ing
tim
e (m
s)
Wikipedia Search Demo
![Page 16: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/16.jpg)
Per-RDD Storage Levelimport spark.storage.StorageLevelval data = file.map(...)
// Keep in memory, recompute when out of space// (default behavior with cache())data.persist(StorageLevel.MEMORY_ONLY)
// Drop to disk instead of recomputingdata.persist(StorageLevel.MEMORY_AND_DISK)
// Serialize in-memory datadata.persist(StorageLevel.MEMORY_ONLY_SER)
![Page 17: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/17.jpg)
CompatibilityWe’ve always strived to stay source-compatible!Only change in this release is in configuration: spark.cache.class replaced with per-RDD levels
![Page 18: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/18.jpg)
![Page 19: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/19.jpg)
Shark 0.2Hive compatibility improvementsThrift server modePerformance improvementsSimpler deployment (comes with Spark 0.6)
![Page 20: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/20.jpg)
Hive CompatibilityHive 0.9 supportFull UDF/UDAF supportADD FILE support for running scriptsUser-supplied jars using ADD JAR
![Page 21: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/21.jpg)
Thrift ServerContributed by Yahoo!, compatible with Hive Thrift serverEnable multiple clients share cached tablesBI tool integration (e.g. Tableau)
![Page 22: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/22.jpg)
Performance
010203040506070
Shark 0.1
Runn
ing
Tim
e (s
ecs)
Group By(1B items, 150M distinct)
0
50
100
150
200
250Shark 0.1
Runn
ing
Tim
e (s
ecs)
Join(1B join 150M)
![Page 23: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/23.jpg)
Shark 0.3 PreviewIn-memory columnar compression (dictionary encoding, run length encoding, etc)Map pruningJVM bytecode generation for expression evalsPersist cached table meta data across sessions
![Page 24: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/24.jpg)
Spark 0.7+Spark StreamingPySpark: Python API for SparkMemory monitoring dashboard
![Page 25: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/25.jpg)
![Page 26: What’s New in Spark 0.6 and Shark 0.2](https://reader035.vdocuments.net/reader035/viewer/2022062323/5681610c550346895dd05fcc/html5/thumbnails/26.jpg)