flinkml - big data application meetup

FlinkML: Large-scale Machine Learning with Apache FlinkTheodore Vasiloudis, Swedish Institute of Computer Science (SICS)

Big Data Application MeetupJuly 27th, 2016

Large-scale Machine Learning

What do we mean?

● Small-scale learning ● Large-scale learning

Source: Léon Bottou

What do we mean?

● Small-scale learning○ We have a small-scale learning problem

when the active budget constraint is the number of examples.

● Large-scale learning

What do we mean?

● Small-scale learning○ We have a small-scale learning problem

when the active budget constraint is the number of examples.

● Large-scale learning○ We have a large-scale learning problem

when the active budget constraint is the computing time.

Apache Flink

What is Apache Flink?

● Distributed stream and batch data processing engine● Easy and powerful APIs for batch and real-time streaming analysis● Backed by a very robust execution backend

○ true streaming dataflow engine○ custom memory manager○ native iterations○ cost-based optimizer

What is Apache Flink?

What does Flink give us?

● Expressive APIs● Pipelined stream processor● Closed loop iterations

Expressive APIs

● Main bounded data abstraction: DataSet● Program using functional-style transformations, creating a dataflow.

case class Word(word: String, frequency: Int)

val lines: DataSet[String] = env.readTextFile(...)

lines.flatMap(line => line.split(“ “).map(word => Word(word, 1)).groupBy(“word”).sum(“frequency”).print()

Pipelined Stream Processor

Iterate in the dataflow

Iterate by looping

● Loop in client submits one job per iteration step● Reuse data by caching in memory or disk

Iterate in the dataflow

Delta iterations

Performance

Extending the Yahoo Streaming Benchmark

FlinkML

● New effort to bring large-scale machine learning to Apache Flink

FlinkML

● New effort to bring large-scale machine learning to Apache Flink● Goals:

○ Truly scalable implementations○ Keep glue code to a minimum○ Ease of use

FlinkML: Overview

● Supervised Learning○ Optimization framework○ Support Vector Machine○ Multiple linear regression

FlinkML: Overview

● Recommendation○ Alternating Least Squares (ALS)

FlinkML: Overview

● Pre-processing○ Polynomial features○ Feature scaling

FlinkML: Overview

● Unsupervised learning○ Quad-tree exact kNN search

FlinkML: Overview

● Unsupervised learning○ Quad-tree exact kNN search

● sklearn-like ML pipelines

FlinkML API

// LabeledVector is a feature vector with a label (class or real value)val trainingData: DataSet[LabeledVector] = ...val testingData: DataSet[Vector] = ...

FlinkML API

val mlr = MultipleLinearRegression() .setStepsize(0.01) .setIterations(100) .setConvergenceThreshold(0.001)

FlinkML API

mlr.fit(trainingData)

FlinkML API

mlr.fit(trainingData)

// The fitted model can now be used to make predictionsval predictions: DataSet[LabeledVector] = mlr.predict(testingData)

FlinkML Pipelines

val scaler = StandardScaler()val polyFeatures = PolynomialFeatures().setDegree(3)val mlr = MultipleLinearRegression()

FlinkML Pipelines

// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

FlinkML Pipelines

// Construct pipeline of standard scaler, polynomial features and multiple linear // regressionval pipeline = scaler.chainTransformer(polyFeatures).chainPredictor(mlr)

// Train pipelinepipeline.fit(trainingData)

// Calculate predictionsval predictions = pipeline.predict(testingData)

FlinkML: Focus on scalability

Alternating Least Squares

R ≅ X Y✕Users

Naive Alternating Least Squares

Blocked Alternating Least Squares

Blocked ALS performance

FlinkML blocked ALS performance

Going beyond SGD in large-scale optimization

● Beyond SGD → Use Primal-Dual framework

● Slow updates → Immediately apply local updates

CoCoA: Communication Efficient Coordinate Ascent

Primal-dual framework

Source: Smith (2014)

Primal-dual framework

Immediately Apply Updates

Source: Smith (2014)Source: Smith (2014)

CoCoA: Communication Efficient Coordinate Ascent

CoCoA performance

Source:Jaggi (2014)

CoCoA performance

Available on FlinkML

Dealing with stragglers: SSP Iterations

● BSP: Bulk Synchronous parallel○ Every worker needs to wait for the others to finish before starting the next iteration

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.

● ASP: Asynchronous parallel○ Every worker can work individually, update model as needed.○ Can be fast, but can often diverge.

● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.

● SSP: State Synchronous parallel○ Relax constraints, so slowest workers can be up to K iterations behind fastest ones.○ Allows for progress, while keeping convergence guarantees.

Source: Ho et al. (2013)

SSP Iterations in Flink: Lasso Regression

Source: Peel et al. (2015)

SSP Iterations in Flink: Lasso Regression

Source: Peel et al. (2015)

PR submitted

Challenges in developing an open-source ML library

Challenges in open-source ML libraries

● Depth or breadth● Design choices● Testing

Challenges in open-source ML libraries

● Attracting developers● What to commit● Avoiding code rot

Current and future work on FlinkML

Current work

● Tooling○ Evaluation & cross-validation framework○ Distributed linear algebra○ Streaming predictors

● Algorithms○ Implicit ALS○ Multi-layer perceptron○ Efficient streaming decision trees○ Colum-wise statistics, histograms

Future of Machine Learning on Flink

● Streaming ML○ Flink already has SAMOA bindings.

○ Preliminary work already started, implement SOTA algorithms and develop new techniques.

Future of Machine Learning on Flink

● Streaming ML○ Flink already has SAMOA bindings.

○ Preliminary work already started, implement SOTA algorithms and develop new techniques.

● “Computation efficient” learning○ Utilize hardware and develop novel systems and algorithms to achieve large-scale learning

with modest computing resources.

Check it out:

@thvasilotvas@sics.se

flink.apache.orgci.apache.org/projects/flink/flink-docs-master/libs/ml

“Demo”

Thank you

@thvasilotvas@sics.se

flink.apache.orgci.apache.org/projects/flink/flink-docs-master/libs/ml

References

● Flink Project: flink.apache.org● FlinkML Docs: https://ci.apache.org/projects/flink/flink-docs-master/libs/ml/● Leon Botou: Learning with Large Datasets● Smith (2014): CoCoA AMPCAMP Presentation● Jaggi (2014): “Communication-efficient distributed dual coordinate ascent." NIPS 2014.● Ho (2013): "More effective distributed ML via a stale synchronous parallel parameter server." NIPS

2013.● Peel (2015): “Distributed Frank-Wolfe under Pipelined Stale Synchronous Parallelism”, IEEE BigData

2015● Recent INRIA paper examining Spark vs. Flink (batch only)● Extending the Yahoo streaming benchmark (And winning the Twitter Hack-week with Flink)● Also interesting: Bayesian anomaly detection in Flink

flinkml - big data application meetup

Software

crowdsourcing approaches to big data curation - rio big data...

big data jobs in london meetup

big data gamification meetup

big data meetup

london big-o algorithms and datastructures meetup welcome

pallet big data - jclouds meetup 2013

big data schiphol group meetup 30 10 2014

couchbase chennai meetup 2 - big data & analytics

ensemble modeling overview, big data meetup

big iot pitch: iot-meetup-vienna

big data intro - presentation to ochackerz meetup group

meetup big data by thjug

datatorrent presentation @ big data application meetup

big data i arkitektura big data aplikacije meetup

big data meetup: analytical systems evolution

skytree big data london meetup - may 2013

flinkml: large scale machine learning with apache flink

budapest big data meetup nov 26 2015

meetup intro techno big data

1st birmingham big data science group meetup