big data apache spark + scala
TRANSCRIPT
![Page 2: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/2.jpg)
Spark, elevatorthe pitch
2
![Page 3: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/3.jpg)
Spark, the elevator pitch
Developed in 2009 at UC Berkeley AMPLab, open sourced in 2010, Spark has since become one of the largest OSS communities in big data, with over 200 contributors in 50+ organizations
“Organizations that are looking at big data challenges – including collection, ETL, storage, exploration and analytics – should consider Spark for its in-memory performance and the breadth of its model.! It supports advanced analytics solutions on Hadoop clusters, including the iterative model required for machine learning and graph analysis.”!
Gartner, Advanced Analytics and Data Science (2014)
spark.apache.org3
![Page 4: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/4.jpg)
Spark, the elevator pitch
4
![Page 5: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/5.jpg)
Spark, the elevator pitch
circa 2010: a unified engine for enterprise data workflows,based on commodity hardware a decade later…
Spark: Cluster Computing with Working Sets! Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica people.csail.mit.edu/matei/papers/2010/hotcloud_spark.pdf! ! Resilient Distributed Datasets: A Fault-Tolerant Abstraction for! In-Memory Cluster Computing! Matei Zaharia, Mosharaf Chowdhury,Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica! usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
5
![Page 6: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/6.jpg)
Spark, the elevator pitch
Spark Core is the general execution engine for the Spark platform that other functionality is built atop:! !
• •
in-memory computing capabilities deliver speed! general execution model supports wide variety of use cases! ease of development – native APIs in Java, Scala, Python (+ SQL, Clojure, R)•
6
![Page 7: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/7.jpg)
7
![Page 8: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/8.jpg)
Spark, the elevator pitch
Sustained exponential growth, as one of the mostactive Apache projects ohloh.net/orgs/apache
8
![Page 10: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/10.jpg)
TL;DR: Smashing The Previous Petabyte Sort Record
databricks.com/blog/2014/11/05/spark-officially- sets-a-new-record-in-large-scale-sorting.html
9
![Page 11: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/11.jpg)
![Page 12: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/12.jpg)
![Page 13: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/13.jpg)
![Page 14: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/14.jpg)
![Page 15: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/15.jpg)
![Page 16: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/16.jpg)
![Page 17: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/17.jpg)
![Page 18: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/18.jpg)
Why Streaming?
10
![Page 19: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/19.jpg)
ecause Machine Data!
Why Streaming?
B
I <3 Logs Jay Kreps O’Reilly (2014)! shop.oreilly.com/product/ 0636920034339.do
11
![Page 20: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/20.jpg)
Why Streaming?
Because Google!
MillWheel: Fault-Tolerant Stream Processing at Internet Scale! Tyler Akidau, Alex Balikov, Kaya Bekiroglu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, Sam Whittle! Very Large Data Bases (2013)! research.google.com/pubs/ pub41378.html
12
![Page 21: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/21.jpg)
hy Streaming?
Because IoT!
kickstarter.com/projects/1614456084/b4rm4n- be-a-cocktail-hero
W
13
![Page 22: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/22.jpg)
hy Streaming?
Because IoT! (exabytes/day per sensor)
bits.blogs.nytimes.com/2013/06/19/g-e-makes-the- machine-and-then-uses-sensors-to-listen-to-it/
W
14
![Page 23: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/23.jpg)
Spark Streaming
15
![Page 24: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/24.jpg)
![Page 25: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/25.jpg)
![Page 26: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/26.jpg)
Spark Streaming: Integration
Data can be ingested from many sources: Kafka, Flume, Twitter, ZeroMQ,TCP sockets, etc.!
Results can be pushed out to filesystems, databases, live dashboards, etc.!
Spark’s built-in machine learning algorithms andgraph processing algorithms can be applied todata streams
19
![Page 27: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/27.jpg)
014! ! graduated (Spark 0.9)
Spark Streaming: Timeline
2012! 2013!
! !
project started!alpha release (Spark 0.7)!
project lead: Tathagata Das @tathadas
20
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing! Matei Zaharia,Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion Stoica! Berkeley EECS (2012-12-14)! www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
![Page 28: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/28.jpg)
Spark Streaming: Requirements
Typical kinds of applications:!
• • • • •
datacenter operations!
web app funnel metrics!
ad optimization!
anti-fraud!
various telematics!
and much much more!
21
![Page 29: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/29.jpg)
Spark Streaming: Some Excellent Resources
Programming Guide spark.apache.org/docs/latest/streaming- programming-guide.html! TD @ Spark Summit 2014 youtu.be/o-NXwFrNAWQ?list=PLTPXxbhUt- YWGNTaDj6HSjnHMxiTD1HCR! “Deep Dive into Spark Streaming” s l i d e s h a re . n e t / s p a r k - p ro j e c t / d e e p - divewithsparkstreaming- tathagatadassparkmeetup20130617! Spark Reference Applications databricks.gitbooks.io/databricks-spark- reference-applications/
22
![Page 30: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/30.jpg)
Quiz: name the bits and pieces…
import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._
// create val ssc =
// create val lines
a StreamingContext with a SparkConf configuration new StreamingContext(sparkConf, Seconds(10))
a DStream that will connect to serverIP:serverPort = ssc.socketTextStream(serverIP, serverPort)
// split each line into words val words = lines.flatMap(_.split(" "))
// count each word in each batch! val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _)
// print a few of the counts to the console wordCounts.print()
ssc.start() ssc.awaitTermination()
23
![Page 31: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/31.jpg)
More Components
36
![Page 32: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/32.jpg)
![Page 33: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/33.jpg)
![Page 34: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/34.jpg)
![Page 35: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/35.jpg)
![Page 36: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/36.jpg)
![Page 37: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/37.jpg)
![Page 38: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/38.jpg)
![Page 39: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/39.jpg)
![Page 40: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/40.jpg)
Complementary Frameworks
36
![Page 41: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/41.jpg)
Spark Integrations:
Insights
cloud-based notebooks… ETL… the Hadoop ecosystem…widespread use of PyData… advanced analytics in streaming… rich custom search… web apps for data APIs… low-latency + multi-tenancy…
37
DiscoverIntegrate With Many Other
Systems
Use Lots of Different Data Sources
Run Sophisticated
AnalyticsClean Up Your Data
![Page 42: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/42.jpg)
Spark Integrations: Advanced analytics for streaming use cases
Kafka + Spark + Cassandra! datastax.com/documentation/datastax_enterprise/4.5/datastax_enterprise/spark/sparkIntro.html! http://helenaedelson.com/?p=991! github.com/datastax/spark-
cassandra-connector! github.com/dibbhatt/kafka-spark-
consumer
38
columnar key-valuedata streams
unified compute
![Page 43: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/43.jpg)
Spark Integrations: Rich search, immediate insights
Spark + ElasticSearch! databricks.com/blog/2014/06/27/application-spotlight- elasticsearch.html!
elasticsearch.org/guide/en/elasticsearch/hadoop/current/ spark.html!
spark-summit.org/2014/talk/streamlining-search- indexing-using-elastic-search-and-spark
39
document search
unified compute
![Page 44: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/44.jpg)
Spark Integrations: General Guidelines
• • •
use Tachyon as a best practice for sharing between two streaming apps!
or write to Cassandra or HBase / then read back!
design patterns for integration: spark.apache.org/docs/latest/streaming- programming-guide.html#output- operations-on-dstreams
40
![Page 45: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/45.jpg)
Examples
56
![Page 46: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/46.jpg)
![Page 47: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/47.jpg)
![Page 48: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/48.jpg)
![Page 49: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/49.jpg)
![Page 50: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/50.jpg)
Resources
56
![Page 51: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/51.jpg)
community:
spark.apache.org/community.htmlvideo+slide archives: spark-summit.orgevents worldwide: goo.gl/2YqJZKresources: workshops:
databricks.com/spark-training-resources databricks.com/spark-training
59
![Page 52: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/52.jpg)
atei Zaharia O’Reilly (2015*) hop.oreilly.com/product/
636920028512.do
Holden Karau Packt (2013) shop.oreilly.com/product/ 9781782167068.do
books:
Learning Spark Holden Karau, Fast Data Processing
with SparkAndy Konwinski,
Spark in Action Chris Fregly Manning (2015*) sparkinaction.com/
60
![Page 53: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/53.jpg)
Paco Nathan ( @pacoid )Spark DatabricksAndy Petrella (@noootsab)Gerard Mass (@maasg) devoxx
Los créditos de algunos de estos gráficos, tablas o logos son:
![Page 54: Big data apache spark + scala](https://reader034.vdocuments.net/reader034/viewer/2022042600/58f9a980760da3da068b6f47/html5/thumbnails/54.jpg)
aspgems.com @aspgems
Gracias