nigthclazz spark - machine learning / introduction à spark et zeppelin

Spark & Zeppelin

Introduction#NightClazz Spark & ML

10/03/16Fabrice Sznajderman

Agenda

● Apache Spark● Apache Zeppelin

Introduction

Who I am?

● Fabrice Sznajderman ○ Java/Scala/Web developer

■ Java/Scala trainer● BrownBagLunch.fr

SparkIntroduction

Big pictureSpark introduction

What is it about?

● A cluster computing framework ● Open source● Written in Scala

History

2009 : Project start at MIT research lab

2010 : Project open-sourced

2013 : Become a Apache project and creation of the Databricks company

2014 : Become a top level Apache project and the most active project in the Apache fundation (500+ contributors)

2014 : Release of Spark 1.0, 1.1 and 1.2

2015 : Release of Spark 1.3, 1.4, 1.5 and 1.6

2015 : IBM, SAP… investment in Spark

2015 : 2000 registration in Spark Summit SF, 1000 in Spark Summit Amsterdam

2016 : new Spark Summit in San Francisco in June 2016

Multi-languages

● Scala● Java● Python● R

Spark Shell

● REPL● Learn API● Interactive Analysis

RDDCore concept

Definition

● Resilient ● Distributed ● Datasets

Properties

● Immutable ● Serializable● Can be persist in RAM and / or

disk● Simple or complexe type

Use as a collection

● DSL● Monadic type● Several operators

○ map, filter, count, distinct, flatmap, ...○ join, groupBy, union, ...

Created from

● A collection (List, Set)● Various formats of file

○ json, text, Hadoop SequenceFile, ...

● Various database ○ JDBC, Cassandra, ...

● Others RDD

Sample

val conf = new SparkConf()

.setAppName("sample")

.setMaster("local")

val sc = new SparkContext(conf)

val rdd = sc.textFile("data.csv")

val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Lazy-evaluation

● Intermediate operators ○ map, filter, distinct, flatmap, …

● final operators○ count, mean, fold, first, ...

val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Caching

● Reused an intermediate result● Cache operator● Avoid re-computing

val r = rdd.map(s => s.length).cache()

val nb = r.filter(i => i> 10).count()val sum = r.filter(i => i> 10).sum()

DistributedArchitecture

Core concept

Run locally

val master = "local"

val master = "local[*]"

val master = "local[4]"

val conf = new SparkConf().setAppName("sample")

.setMaster(master)


Run on cluster

val master = "spark://..."

val conf = new SparkConf().setAppName("sample")

.setMaster(master)


Standalone Cluster

SparkMaster

SparkSlave

SparkSlave

SparkSlave

E

E E

E

E

E

Spark client

Spark client

Spark client

ModulesCore concept

Composed by

Spark Core

SparkStreaming MLlib GraphXSpark

SQL

ML PipelineDataFrames

Several data sources

Several data sources

http://prog3.com/article/2015-06-18/2824958

Spark SQL

● Structured data processing● SQL Language● DataFrame

DataFrame 1/3

● A distributed collection of rows organized into named columns

● An abstraction for selecting, filtering, aggregating and plotting structured data

● Provide a schema● Not a RDD replacement

What?

DataFrame 1/3

● RDD more efficient than before (Hadoop)

● But RDD is still too complicated for common tasks

● DataFrame is more simple and faster

Why?

DataFrame 2/3

Optimized

DataFrame 3/3

● From Spark 1.3● DataFrame API is just an

interface○ Implementation is done one time in

Spark engine

○ All languages take benefits of

optimization with out rewriting anything

How ?

Spark Streaming

● Framework over RDD and Dataframe API

● Real-time data processing● RDD is DStream here● Same as before but dataset is

not static

Spark StreamingInternal flow

http://spark.apache.org/docs/latest/img/streaming-flow.png

Spark StreamingInputs / Ouputs

http://spark.apache.org/docs/latest/img/streaming-arch.png

Spark MLlib

● Make pratical machine learning scalable and easy

● Provide commons learning algorithms & utilities

Spark MLlib

● Divides into 2 packages○ spark.mllib ○ spark.ml

Spark MLlib

● Original API based on RDD● Each model has its own

interface

spark.mllib

Spark MLlibspark.mllib

val sc = //init sparkContext

val (trainingData, checkData) = sc.textFile("train.csv") /*transform*/ .randomSplit(Array(0.98, 0.02))

val model = RandomForest.trainClassifier( trainingData, 10, Map[Int, Int](), 30, "auto", "gini", 7, 100, 0)

val prediction = model.predict(...)

//init sparkContext

val (trainingData, checkData) = sc.textFile("train.csv") /*transform*/ .randomSplit(Array(0.98, 0.02))

val model = new LogisticRegressionWithLBFGS() .setNumClasses(10) .run(train)

val prediction = model.predict(...)

Each model exposes its own interface

Spark MLlib

● Provides uniform set of high-level APIs

● Based on top of Dataframe● Pipeline concepts

○ Transformer○ Estimator○ Pipeline

spark.ml

Spark MLlibspark.ml

● Transformer : transform(DF)○ map a dataFrame by adding new

column

○ predict the label and adding result in new column

● Estimator : fit(DF)○ learning algorithm○ produces a model from dataFrame

Spark MLlibspark.ml

● Pipeline ○ sequence of stages (transformer or

estimator)○ specific order

Spark MLlibspark.ml

val training:DataFrame = ???val test:DataFrame = ???

val lr = new LogisticRegression()

lr.setMaxIter(10).setRegParam(0.01)

//training modelval model1 = lr.fit(training)

//prediction on data testmodel1.transform(test)

Spark MLlibspark.ml


val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new RandomForestClassifier()()

/*.add parameter*/

val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.val model = pipeline.fit(training)

model.transform(test)


val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new LogisticRegression()

/*.add parameter*/

val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.val model = pipeline.fit(training)

model.transform(test)

Differents models

Same manner to create the pipeline

ZeppelinIntroduction

Big pictureZeppelin introduction

What it is about?

● “A web-based notebook that enables interactive data analytics”

● 100% opensource● Undergoing Incubation

Multi-purpose

● Data Ingestion● Data Discovery● Data Analytics● Data Visualization &

Collaboration

Multiple Language backend

● Scala● shell● python● markdown● your language by creation your

own interpreter

Data visualizationEasy way to build graph from data

nigthclazz spark - machine learning / introduction à spark et zeppelin

Technology