intro to apache spark by cto of twingo
TRANSCRIPT
![Page 1: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/1.jpg)
An Overview of Apache Spark
Ilya GulmanCTO, TwingoJune, 2014
![Page 2: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/2.jpg)
About Twingo
• A Big Data Company
• Established in 2006 by Golan Nahum
• 25 Employees
• Reseller and expert integrator in HP VERTICA
• Reseller and integrator in MAPR
• Reseller and expert integrator in MICROSTRATEGY
• Deep knowedge in Phyton and Linux
• We did more than 20 Big Data successful Projects
• Expertise in SAAS /OEM BIG DATA solutions
![Page 3: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/3.jpg)
Agenda
• What is Spark?
• The Difference with Spark
• SQL on Spark
• Combining the power
• Real-World Use Cases
• Resources
![Page 4: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/4.jpg)
What is Spark?
![Page 5: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/5.jpg)
Fast and general MapReduce-like engine for large-scale data processing
• Fast In memory data storage for very fast interactive queriesUp to 100 times faster then Hadoop
• General - Unified platform that can combine:SQL, Machine Learning , Streaming , Graph & Complex analytics
• Ease of useCan be developed in Java, Scala or Python
• Integrated with Hadoop Can read from HDFS, HBase, Cassandra, and any Hadoop data source.
What is Spark ?
![Page 6: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/6.jpg)
Spark is the Most Active Open Source Project in Big Data
Proj
ect
cont
ribu
tors
in p
ast
year
0
20
40
60
80
100
120
140
GiraphStorm
Tez
![Page 7: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/7.jpg)
The Spark Community
![Page 8: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/8.jpg)
Unified Platform
Shark (SQL)
Spark Streaming (Streaming)
MLlib(Machine learning)
Spark (General execution engine)
GraphX (Graph computation)
Continued innovation bringing new functionality, e.g.:• Java 8 (Closures, Lamba Expressions)• Spark SQL (SQL on Spark, not just Hive)• BlinkDB (Approximate Queries)• SparkR (R wrapper for Spark)
![Page 9: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/9.jpg)
Supported Languages
• Java• Scala• Python• SQL
![Page 10: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/10.jpg)
Data Sources• Local Files
– file:///opt/httpd/logs/access_log• S3• Hadoop Distributed Filesystem
– Regular files, sequence files, any other Hadoop InputFormat
• Hbase • Can also read from any other Hadoop data
source.
![Page 11: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/11.jpg)
The Difference with Spark
![Page 12: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/12.jpg)
Easy and Fast Big Data
• Easy to Develop– Rich APIs in Java, Scala,
Python– Interactive shell
• Fast to Run– General execution graphs– In-memory storage
2-5× less code Up to 10× faster on disk,100× in memory
![Page 13: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/13.jpg)
Resilient Distributed Datasets (RDD)
• Spark revolves around RDDs• Fault-tolerant collection of elements that can be
operated on in parallel– Parallelized Collection: Scala collection which is run in
parallel– Hadoop Dataset: records of files supported by
Hadoop
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
![Page 14: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/14.jpg)
RDD Operations• Transformations
– Creation of a new dataset from an existing• map, filter, distinct, union, sample, groupByKey, join, etc…
• Actions– Return a value after running a computation
• collect, count, first, takeSample, foreach, etc…
Check the documentation for a complete listhttp://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-operations
![Page 15: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/15.jpg)
RDD Code Examplefile = spark.textFile("hdfs://...")
errors = file.filter(lambda line: "ERROR" in line)
# Count all the errorserrors.count()
# Count errors mentioning MySQLerrors.filter(lambda line: "MySQL" in line).count()
# Fetch the MySQL errors as an array of stringserrors.filter(lambda line: "MySQL" in line).collect()
![Page 16: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/16.jpg)
RDD Persistence / Caching
• Variety of storage levels– memory_only (default), memory_and_disk, etc…
• API Calls– persist(StorageLevel)– cache() – shorthand for persist(StorageLevel.MEMORY_ONLY)
• Considerations– Read from disk vs. recompute (memory_and_disk)– Total memory storage size (memory_only_ser)– Replicate to second node for faster fault recovery
(memory_only_2)• Think about this option if supporting a web application
http://spark.apache.org/docs/latest/scala-programming-guide.html#rdd-persistence
![Page 17: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/17.jpg)
Cache Scaling Matters
04080 69 58 41 30 12
% of working set in cache
Exec
utio
n ti
me
(s)
![Page 18: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/18.jpg)
RDD Fault RecoveryRDDs track lineage information that can be used to efficiently recompute lost data
msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2])
HDFS File Filtered RDD Mapped RDDfilter
(func = startsWith(…))map
(func = split(...))
![Page 19: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/19.jpg)
Interactive Shell• Iterative Development
– Cache those RDDs– Open the shell and ask questions
• We have all wished we could do this with MapReduce– Compile / save your code for scheduled jobs later
• Scala – spark-shell• Python – pyspark
![Page 20: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/20.jpg)
SQL on Spark
![Page 21: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/21.jpg)
Before Spark - Hive
• Puts structure/schema onto HDFS data• Compiles HiveQL queries into MapReduce
jobs• Very popular: 90+% of Facebook Hadoop
jobs generated by Hive• Initially developed by Facebook
![Page 22: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/22.jpg)
But.. Hive is slow
• Takes 20+ seconds even for simple queries
• "A good day is when I can run 6 Hive queries” – @mtraverso
![Page 23: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/23.jpg)
SQL over Spark
![Page 24: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/24.jpg)
Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)– Works in existing Hive warehouses without changing
queries or data!• Augments Hive
– In-memory tables and columnar memory store• Fast execution engine
– Uses Spark as the underlying execution engine– Low-latency, interactive queries– Scale-out and tolerates worker failures
![Page 25: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/25.jpg)
Shark – SQL over Spark
• Hive-compatible (HiveQL, UDFs, metadata)– Works in existing Hive warehouses without changing
queries or data!• Augments Hive
– In-memory tables and columnar memory store• Fast execution engine
– Uses Spark as the underlying execution engine– Low-latency, interactive queries– Scale-out and tolerates worker failures
![Page 26: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/26.jpg)
Machine Learning
![Page 27: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/27.jpg)
Machine Learning - MLlib• K-Means• L1 and L2-regularized Linear Regression• L1 and L2-regularized Logistic Regression• Alternating Least Squares• Naive Bayes• Stochastic Gradient Descent
* As of May 14, 2014** Don’t be surprised if you see the Mahout library converting to Spark soon
![Page 28: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/28.jpg)
Streaming
![Page 29: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/29.jpg)
Comparison to Storm• Higher throughput than Storm
– Spark Streaming: 670k records/sec/node– Storm: 115k records/sec/node– Commercial systems: 100-500k records/sec/node
100 10000
10
20
30 WordCount
Spark
Storm
Record Size (bytes)
Thro
ughp
ut p
er
node
(MB/
s)100 1000
0
20
40
60 Grep
Spark
Storm
Record Size (bytes)
Thro
ughp
ut p
er
node
(MB/
s)
![Page 30: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/30.jpg)
Combining the power
![Page 31: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/31.jpg)
Combining the power• Use Machine Learningg result as table.
GENERATE KMeans(tweet_locations)
SAVE AS TABLE tweet_clusters;
• Combine SQL, ML, and streaming (Scala)val points = sc.runSql[Double, Double](
“select latitude, longitude from historic_tweets”)
val model = KMeans.train(points, 10)
sc.twitterStream(...)
.map(t => (model.closestCenter(t.location), 1))
.reduceByWindow(“5s”, _ + _)
![Page 32: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/32.jpg)
Real-World Use Cases
![Page 33: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/33.jpg)
Spark at Yahoo!
• Fast Machine Learning Personalized news pages
![Page 34: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/34.jpg)
Spark at Yahoo!• Hive on Spark (Shark)
Using existing BI tools to view and query advertising analytic data collected in Hadoop.Any tool that plugs into Hive, like Tableau, automatically works with Shark.
![Page 35: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/35.jpg)
Spark at• One of the largest streaming video companies
on the Internet
• 4+ billion video feeds per month (second only to YouTube)
• CONVIVA uses Spark Streaming to learn network conditions in real time
![Page 36: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/36.jpg)
Resources
![Page 37: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/37.jpg)
Remember• If you want to use a new technology you must learn that
new technology
• For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it
• To get the most out of a new technology you need to learn that technology, this includes tuning– There are switches you can use to optimize your work
![Page 38: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/38.jpg)
Configuration
http://spark.apache.org/docs/latest/
Most Important• Application Configuration
http://spark.apache.org/docs/latest/configuration.html
• Standalone Cluster Configurationhttp://spark.apache.org/docs/latest/spark-standalone.html
• Tuning Guidehttp://spark.apache.org/docs/latest/tuning.html
![Page 39: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/39.jpg)
Resources
• Pig on Spark– http://apache-spark-user-list.1001560.n3.nabble.com/Pig-on-Spark-td2367.html– https://github.com/aniket486/pig– https://github.com/twitter/pig/tree/spork– http://docs.sigmoidanalytics.com/index.php/Setting_up_spork_with_spark_0.8.1– https://github.com/sigmoidanalytics/pig/tree/spork-hadoopasm-fix
• Latest on Spark– http://databricks.com/categories/spark/– http://www.spark-stack.org/
![Page 40: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/40.jpg)
Thank You
www.twingo.co.il
![Page 41: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/41.jpg)
Optional - More Examples
![Page 42: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/42.jpg)
SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]);JavaRDD<String> file = sc.textFile("hdfs://...");JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" ")))
.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
.reduceByKey((x, y) -> x + y);counts.saveAsTextFile("hdfs://...");
val sc = new SparkContext(master, appName, [sparkHome], [jars])val file = sc.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")
Word Count
• Java MapReduce (~15 lines of code)• Java Spark (~ 7 lines of code)• Scala and Python (4 lines of code)
– interactive shell: skip line 1 and replace the last line with counts.collect()• Java8 (4 lines of code)
![Page 43: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/43.jpg)
Network Word Count – Streaming
// Create the context with a 1 second batch sizeval ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1),System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass))
// Create a NetworkInputDStream on target host:port and count the// words in input stream of \n delimited text (eg. generated by 'nc')val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
![Page 44: Intro to Apache Spark by CTO of Twingo](https://reader035.vdocuments.net/reader035/viewer/2022062503/587c2a0d1a28aba0118b5101/html5/thumbnails/44.jpg)
Deploying Spark – Cluster Manager Types• Standalone mode
– Comes bundled (EC2 capable)• YARN• Mesos