scalable machine learning -...

39
“Scalable” Machine Learning Mikio L. Braun Recommender Stammtisch June 26, 2014 Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 1 / 29

Upload: others

Post on 11-Oct-2019

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

“Scalable” Machine Learning

Mikio L. Braun Recommender Stammtisch

June 26, 2014

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 1 / 29

Page 2: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Hm... scalable machine learning

Why scalable machine learning?

Lots of data (literally, 100 GBs of log data)

Many many tasks (say, one per user)

Proper model selection over features/parameters takes a lot of time!

But the truth is, core ML methods don’t scale very well... .

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 2 / 29

Page 3: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

But is it really necessary?

6 4 2 0 2 4 60.5

0.0

0.5

1.0

1.5

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 3 / 29

Page 4: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Size 6= Complexity

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 4 / 29

Page 5: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

A complex data set

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 5 / 29

Page 6: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Learning curve checkers data set

102

103

104

105

data set size

0

5

10

15

20

25

30

35

40

45

test

err

or

(%)

BumpBoost500BumpBoost1000BumpBoost2000KRRSVM

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 6 / 29

Page 7: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What is enough data?

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 7 / 29

Page 8: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What is enough data?

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 7 / 29

Page 9: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What is enough data?

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 7 / 29

Page 10: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What is enough data?

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 7 / 29

Page 11: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

So how does complexity occur?

a lot of information

Many many combinations of things are indicativeWhatever would actually take to memorize a lot of things

high dimensional, unresolved invariancesFor example text: “He liked the book very much”

“like”: “enjoy”, “found interesting”, “was captivated by”“very much”: “greatly”, “massively”More degrees of freedom: “He thought the book was a keeper”“He had read his share of books over the past years, in fact, he hadcome to consider himself quite a lover of books. But this book, whichhis aunt had so mysteriously sent him just when he most needed it, wassuprisingly captivating beyond his wildest expectations.”

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 8 / 29

Page 12: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Map-Reduce for Machine Learning On Multicore

Locally Weighted Linear Regression, Naive Bayes, GaussianDiscriminantive Analyses, k-Means, Logistic Regression, Neural Networks,Principal Component Analysis, Indepenednt Component Analysis,Expectation Maximization, Support Vector Machines

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 9 / 29

Page 13: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

One example

Locally Weighted Linear Regression

Solves Aθ = b with A =∑m

i=1 wi (xixTi ), b =

∑mi=1 wi (xiyi ).

Compute sums partially in Map step, combine sums in Reduce step toget A and b.

Solve for θ single threaded.

But...

Only works when number of dimensions is small.

... in which case the problem doesn’t require many examples anyway.

... same approach (more or less) for GDA, PCA, ICA

... Naive Bayes even simpler (just compute conditional probabilities)

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 10 / 29

Page 14: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

k-Means Clustering

Parallelize computation of all distances

Iteration, updates done sequentially

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 11 / 29

Page 15: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Batch updates

Logistic Regression, Neural Networks, SVMs with stochastic gradientdescent.

BUT compute gradient over all the data set

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 12 / 29

Page 16: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Batch vs. true stochastic gradient descent

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 13 / 29

Page 17: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Lots of open questions, though

Mostly iterative algorithms

Sometimes, questionable implementations, for example, microbatches.

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 14 / 29

Page 18: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What does scalable mean anyway?

Scalable Inference for Logistic-Normal Topic Models... This paper presents a partially collapsed Gibbs sampling algorithmthat approaches the provably correct distribution by exploring theideas of data augmentation ...

A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks

Scalable Inference of Overlapping Communities

Scalable Influence Estimation in Continuous-Time Diffusion Networks

Scalable imputation of genetic data with a discretefragmentation-coagulation process

Scalable kernels for graphs with continuous attributes

Custom made algorithms & implementations!

(http://papers.nips.cc/search/?q=scalable)Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29

Page 19: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What does scalable mean anyway?

Scalable Inference for Logistic-Normal Topic Models

A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks... With [...] an efficient stochastic variational inference algorithm, weare able to analyze real networks with over a million vertices [...] on asingle machine in a matter of hours ...

Scalable Inference of Overlapping Communities

Scalable Influence Estimation in Continuous-Time Diffusion Networks

Scalable imputation of genetic data with a discretefragmentation-coagulation process

Scalable kernels for graphs with continuous attributes

Custom made algorithms & implementations!

(http://papers.nips.cc/search/?q=scalable)Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29

Page 20: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What does scalable mean anyway?

Scalable Inference for Logistic-Normal Topic Models

A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks

Scalable Inference of Overlapping Communities... Our algorithm is based on stochastic variational inference in themixed-membership stochastic blockmodel ...

Scalable Influence Estimation in Continuous-Time Diffusion Networks

Scalable imputation of genetic data with a discretefragmentation-coagulation process

Scalable kernels for graphs with continuous attributes

Custom made algorithms & implementations!

(http://papers.nips.cc/search/?q=scalable)

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29

Page 21: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What does scalable mean anyway?

Scalable Inference for Logistic-Normal Topic Models

A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks

Scalable Inference of Overlapping Communities

Scalable Influence Estimation in Continuous-Time Diffusion Networks... In this paper, we propose a randomized algorithm for influenceestimation in continuous-time diffusion networks ...

Scalable imputation of genetic data with a discretefragmentation-coagulation process

Scalable kernels for graphs with continuous attributes

Custom made algorithms & implementations!

(http://papers.nips.cc/search/?q=scalable)

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29

Page 22: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What does scalable mean anyway?

Scalable Inference for Logistic-Normal Topic Models

A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks

Scalable Inference of Overlapping Communities

Scalable Influence Estimation in Continuous-Time Diffusion Networks

Scalable imputation of genetic data with a discretefragmentation-coagulation process... Our model can be thought of as a discrete time analogue ofcontinuous time fragmentation-coagulation processes, preserving theimportant properties of projectivity, exchangeability and reversibility,while being more scalable ...

Scalable kernels for graphs with continuous attributes

Custom made algorithms & implementations!

(http://papers.nips.cc/search/?q=scalable)Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29

Page 23: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

What does scalable mean anyway?

Scalable Inference for Logistic-Normal Topic Models

A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks

Scalable Inference of Overlapping Communities

Scalable Influence Estimation in Continuous-Time Diffusion Networks

Scalable imputation of genetic data with a discretefragmentation-coagulation process

Scalable kernels for graphs with continuous attributes ... In thispaper, we present a class of path kernels with computationalcomplexity O(n2(m + δ2)) ...

Custom made algorithms & implementations!

(http://papers.nips.cc/search/?q=scalable)

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29

Page 24: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Large Scale Learning

Stochastic Gradient Descent

Higher order descent, just a few steps

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 16 / 29

Page 25: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Stochastic Gradient Descent

And do gradient descent on

1

2‖w‖2 + C

n∑i=1

max(0, 1− yi 〈w , xi 〉+ b)

yields

w ←

{w − η

t w if yt〈w ,Xt〉+ b ≤ 1

w − ηt (w − ytXt) else

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 17 / 29

Page 26: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Parallelization potential for SGD

Memory footprint: w

Main problem: Stream by data fast enough

Cross-validation/feature extraction issues

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 18 / 29

Page 27: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Google’s DistBelief

Data shards, model replicas, common parameter set.

(Dean, Corrado, Monga et al. Large Scale Distributed Deep Networks,NIPS 2012,http://research.google.com/archive/large deep networks nips2012.html)

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 19 / 29

Page 28: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

The future: Spark and Stratosphere/Flink

The problem with Hadoop:

Just one Map / Reduce step, but many algorithms are iterative

Disk based → long startup times

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 20 / 29

Page 29: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Spark

(http://spark.apache.org)

In-memory

Much larger set of operations (groupBy, joins, etc.)

resilience by storing how data was generated

caching of results on disk

micro-batch streaming, too

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 21 / 29

Page 30: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Apache Spark

// open file as collection of lines

val textFile = sc.textFile("README.md")

// count lines in file

textFile.count()

// get lines containing the word Spark

val linesWithSpark = textFile

.filter(line => line.contains("Spark"))

// count those lines

linesWithSpark.count()

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 22 / 29

Page 31: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Stratosphere/Flink - Big Data by Query Optimization

(http://stratosphere.eu/)

Databases: Query → Relational Algebra → Algorithms

Stratosphere: The same for Big Data

In-Memory, more operations, but also iterations!

Optimizing operations, including reshuffles of data.

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 23 / 29

Page 32: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Stream Mining

Stream Mining — mid 2000s

Process potentially infinite stream of data

Stream query:

How often have I seen item i?What are the most frequent items?How many distinct items are there?

Approximate results with bounded resources

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 24 / 29

Page 33: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Getting rid of exactness

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 25 / 29

Page 34: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Getting rid of exactness

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 26 / 29

Page 35: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Heavy hitters

Task: find most frequent items in a data set.

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-kElements in Data Streams, International Conference on Database Theory,2005

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 27 / 29

Page 36: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Exponential Decay

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 28 / 29

Page 37: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Indices

page referrer agent score

/index.html google mozilla 12.3/about.html facebook iexplorer 10.2/index.html twitter safari 8.5/post/123 google mozilla 5.5/about.html twitter safari 3.2

Trend for “referrer = google” →(/index.html, google, mozilla, 12.3), (/post/123, google, mozilla, 5.5)

Trend for “agent = safari” →(/index.html, twitter, safari, 8.5), (/about.html, twitter, safari, 3.2)

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 29 / 29

Page 38: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

streamdrill

In-memory, stream mining driven realtime analytics engine

written in Scala

Main tool: Trends (Top K + indices + exponential decay)

Process up to 20k events per second

Track about 1M per GB

Upcoming: Profiling, Recommendation, etc.

Download demo jar at streamdrill.com

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 30 / 29

Page 39: Scalable Machine Learning - recommenders.derecommenders.de/wp-content/uploads/2014/07/Scalable-Machine-Learning... · written in Scala Main tool: Trends (Top K + indices + exponential

Summary

Complexity 6= Size

Big Data best used for inherently parallel steps like

preprocessingfeature extractioncross validationapplying predictions

Else, you’re stuck with algorithm design!

Large scale learning by stochastic gradient descent

Spark/Stratosphere making it easier!

Stream Mining!

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 31 / 29