nigthclazz spark - machine learning / introduction à spark et zeppelin
TRANSCRIPT
![Page 1: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/1.jpg)
Spark & Zeppelin
Introduction#NightClazz Spark & ML
10/03/16Fabrice Sznajderman
![Page 2: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/2.jpg)
Agenda
● Apache Spark● Apache Zeppelin
Introduction
![Page 3: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/3.jpg)
Who I am?
● Fabrice Sznajderman ○ Java/Scala/Web developer
■ Java/Scala trainer● BrownBagLunch.fr
![Page 4: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/4.jpg)
SparkIntroduction
![Page 5: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/5.jpg)
Big pictureSpark introduction
![Page 6: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/6.jpg)
What is it about?
● A cluster computing framework ● Open source● Written in Scala
![Page 7: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/7.jpg)
History
2009 : Project start at MIT research lab
2010 : Project open-sourced
2013 : Become a Apache project and creation of the Databricks company
2014 : Become a top level Apache project and the most active project in the Apache fundation (500+ contributors)
2014 : Release of Spark 1.0, 1.1 and 1.2
2015 : Release of Spark 1.3, 1.4, 1.5 and 1.6
2015 : IBM, SAP… investment in Spark
2015 : 2000 registration in Spark Summit SF, 1000 in Spark Summit Amsterdam
2016 : new Spark Summit in San Francisco in June 2016
![Page 8: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/8.jpg)
Multi-languages
● Scala● Java● Python● R
![Page 9: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/9.jpg)
Spark Shell
● REPL● Learn API● Interactive Analysis
![Page 10: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/10.jpg)
RDDCore concept
![Page 11: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/11.jpg)
Definition
● Resilient ● Distributed ● Datasets
![Page 12: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/12.jpg)
Properties
● Immutable ● Serializable● Can be persist in RAM and / or
disk● Simple or complexe type
![Page 13: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/13.jpg)
Use as a collection
● DSL● Monadic type● Several operators
○ map, filter, count, distinct, flatmap, ...○ join, groupBy, union, ...
![Page 14: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/14.jpg)
Created from
● A collection (List, Set)● Various formats of file
○ json, text, Hadoop SequenceFile, ...
● Various database ○ JDBC, Cassandra, ...
● Others RDD
![Page 15: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/15.jpg)
Sample
val conf = new SparkConf()
.setAppName("sample")
.setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile("data.csv")
val nb = rdd.map(s => s.length).filter(i => i> 10).count()
![Page 16: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/16.jpg)
Lazy-evaluation
● Intermediate operators ○ map, filter, distinct, flatmap, …
● final operators○ count, mean, fold, first, ...
val nb = rdd.map(s => s.length).filter(i => i> 10).count()
![Page 17: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/17.jpg)
Caching
● Reused an intermediate result● Cache operator● Avoid re-computing
val r = rdd.map(s => s.length).cache()
val nb = r.filter(i => i> 10).count()val sum = r.filter(i => i> 10).sum()
![Page 18: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/18.jpg)
DistributedArchitecture
Core concept
![Page 19: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/19.jpg)
Run locally
val master = "local"
val master = "local[*]"
val master = "local[4]"
val conf = new SparkConf().setAppName("sample")
.setMaster(master)
val sc = new SparkContext(conf)
![Page 20: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/20.jpg)
Run on cluster
val master = "spark://..."
val conf = new SparkConf().setAppName("sample")
.setMaster(master)
val sc = new SparkContext(conf)
![Page 21: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/21.jpg)
Standalone Cluster
SparkMaster
SparkSlave
SparkSlave
SparkSlave
E
E E
E
E
E
Spark client
Spark client
Spark client
![Page 22: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/22.jpg)
ModulesCore concept
![Page 23: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/23.jpg)
Composed by
Spark Core
SparkStreaming MLlib GraphXSpark
SQL
ML PipelineDataFrames
Several data sources
![Page 24: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/24.jpg)
Several data sources
http://prog3.com/article/2015-06-18/2824958
![Page 25: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/25.jpg)
Spark SQL
● Structured data processing● SQL Language● DataFrame
![Page 26: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/26.jpg)
DataFrame 1/3
● A distributed collection of rows organized into named columns
● An abstraction for selecting, filtering, aggregating and plotting structured data
● Provide a schema● Not a RDD replacement
What?
![Page 27: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/27.jpg)
DataFrame 1/3
● RDD more efficient than before (Hadoop)
● But RDD is still too complicated for common tasks
● DataFrame is more simple and faster
Why?
![Page 28: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/28.jpg)
DataFrame 2/3
Optimized
![Page 29: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/29.jpg)
DataFrame 3/3
● From Spark 1.3● DataFrame API is just an
interface○ Implementation is done one time in
Spark engine
○ All languages take benefits of
optimization with out rewriting anything
How ?
![Page 30: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/30.jpg)
Spark Streaming
● Framework over RDD and Dataframe API
● Real-time data processing● RDD is DStream here● Same as before but dataset is
not static
![Page 31: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/31.jpg)
Spark StreamingInternal flow
http://spark.apache.org/docs/latest/img/streaming-flow.png
![Page 32: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/32.jpg)
Spark StreamingInputs / Ouputs
http://spark.apache.org/docs/latest/img/streaming-arch.png
![Page 33: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/33.jpg)
Spark MLlib
● Make pratical machine learning scalable and easy
● Provide commons learning algorithms & utilities
![Page 34: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/34.jpg)
Spark MLlib
● Divides into 2 packages○ spark.mllib ○ spark.ml
![Page 35: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/35.jpg)
Spark MLlib
● Original API based on RDD● Each model has its own
interface
spark.mllib
![Page 36: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/36.jpg)
Spark MLlibspark.mllib
val sc = //init sparkContext
val (trainingData, checkData) = sc.textFile("train.csv") /*transform*/ .randomSplit(Array(0.98, 0.02))
val model = RandomForest.trainClassifier( trainingData, 10, Map[Int, Int](), 30, "auto", "gini", 7, 100, 0)
val prediction = model.predict(...)
//init sparkContext
val (trainingData, checkData) = sc.textFile("train.csv") /*transform*/ .randomSplit(Array(0.98, 0.02))
val model = new LogisticRegressionWithLBFGS() .setNumClasses(10) .run(train)
val prediction = model.predict(...)
Each model exposes its own interface
![Page 37: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/37.jpg)
Spark MLlib
● Provides uniform set of high-level APIs
● Based on top of Dataframe● Pipeline concepts
○ Transformer○ Estimator○ Pipeline
spark.ml
![Page 38: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/38.jpg)
Spark MLlibspark.ml
● Transformer : transform(DF)○ map a dataFrame by adding new
column
○ predict the label and adding result in new column
● Estimator : fit(DF)○ learning algorithm○ produces a model from dataFrame
![Page 39: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/39.jpg)
Spark MLlibspark.ml
● Pipeline ○ sequence of stages (transformer or
estimator)○ specific order
![Page 40: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/40.jpg)
Spark MLlibspark.ml
val training:DataFrame = ???val test:DataFrame = ???
val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)
//training modelval model1 = lr.fit(training)
//prediction on data testmodel1.transform(test)
![Page 41: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/41.jpg)
Spark MLlibspark.ml
val training:DataFrame = ???val test:DataFrame = ???
val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new RandomForestClassifier()()
/*.add parameter*/
val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.val model = pipeline.fit(training)
model.transform(test)
val training:DataFrame = ???val test:DataFrame = ???
val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new LogisticRegression()
/*.add parameter*/
val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.val model = pipeline.fit(training)
model.transform(test)
Differents models
Same manner to create the pipeline
![Page 42: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/42.jpg)
ZeppelinIntroduction
![Page 43: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/43.jpg)
Big pictureZeppelin introduction
![Page 44: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/44.jpg)
What it is about?
● “A web-based notebook that enables interactive data analytics”
● 100% opensource● Undergoing Incubation
![Page 45: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/45.jpg)
Multi-purpose
● Data Ingestion● Data Discovery● Data Analytics● Data Visualization &
Collaboration
![Page 46: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/46.jpg)
Multiple Language backend
● Scala● shell● python● markdown● your language by creation your
own interpreter
![Page 47: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/47.jpg)
Data visualizationEasy way to build graph from data
![Page 48: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/48.jpg)
Demo
![Page 49: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/49.jpg)
Merci
![Page 50: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin](https://reader031.vdocuments.net/reader031/viewer/2022022202/587c04af1a28ab7c668b749b/html5/thumbnails/50.jpg)