scalable machine learning -...

“Scalable” Machine Learning

Mikio L. Braun Recommender Stammtisch

June 26, 2014

Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 1 / 29

Hm... scalable machine learning

Why scalable machine learning?

Lots of data (literally, 100 GBs of log data)

Many many tasks (say, one per user)

Proper model selection over features/parameters takes a lot of time!

But the truth is, core ML methods don’t scale very well... .


But is it really necessary?

6 4 2 0 2 4 60.5

0.0

0.5

1.0

1.5


Size 6= Complexity


A complex data set

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0


Learning curve checkers data set

102

103

104

105

data set size

0

5

10

15

20

25

30

35

40

45

test

err

or

(%)

BumpBoost500BumpBoost1000BumpBoost2000KRRSVM


What is enough data?


So how does complexity occur?

a lot of information

Many many combinations of things are indicativeWhatever would actually take to memorize a lot of things

high dimensional, unresolved invariancesFor example text: “He liked the book very much”

“like”: “enjoy”, “found interesting”, “was captivated by”“very much”: “greatly”, “massively”More degrees of freedom: “He thought the book was a keeper”“He had read his share of books over the past years, in fact, he hadcome to consider himself quite a lover of books. But this book, whichhis aunt had so mysteriously sent him just when he most needed it, wassuprisingly captivating beyond his wildest expectations.”


Map-Reduce for Machine Learning On Multicore

Locally Weighted Linear Regression, Naive Bayes, GaussianDiscriminantive Analyses, k-Means, Logistic Regression, Neural Networks,Principal Component Analysis, Indepenednt Component Analysis,Expectation Maximization, Support Vector Machines


One example

Locally Weighted Linear Regression

Solves Aθ = b with A =∑m

i=1 wi (xixTi ), b =

∑mi=1 wi (xiyi ).

Compute sums partially in Map step, combine sums in Reduce step toget A and b.

Solve for θ single threaded.

But...

Only works when number of dimensions is small.

... in which case the problem doesn’t require many examples anyway.

... same approach (more or less) for GDA, PCA, ICA

... Naive Bayes even simpler (just compute conditional probabilities)


k-Means Clustering

Parallelize computation of all distances

Iteration, updates done sequentially


Batch updates

Logistic Regression, Neural Networks, SVMs with stochastic gradientdescent.

BUT compute gradient over all the data set


Batch vs. true stochastic gradient descent


Lots of open questions, though

Mostly iterative algorithms

Sometimes, questionable implementations, for example, microbatches.


What does scalable mean anyway?

Scalable Inference for Logistic-Normal Topic Models... This paper presents a partially collapsed Gibbs sampling algorithmthat approaches the provably correct distribution by exploring theideas of data augmentation ...

A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks

Scalable Inference of Overlapping Communities

Scalable Influence Estimation in Continuous-Time Diffusion Networks

Scalable imputation of genetic data with a discretefragmentation-coagulation process

Scalable kernels for graphs with continuous attributes

Custom made algorithms & implementations!

(http://papers.nips.cc/search/?q=scalable)Mikio L. Braun Recommender Stammtisch “Scalable” Machine Learning June 26, 2014 15 / 29


Scalable Inference for Logistic-Normal Topic Models

A Scalable Approach to Probabilistic Latent Space Inference ofLarge-Scale Networks... With [...] an efficient stochastic variational inference algorithm, weare able to analyze real networks with over a million vertices [...] on asingle machine in a matter of hours ...










Scalable Inference of Overlapping Communities... Our algorithm is based on stochastic variational inference in themixed-membership stochastic blockmodel ...





(http://papers.nips.cc/search/?q=scalable)






Scalable Influence Estimation in Continuous-Time Diffusion Networks... In this paper, we propose a randomized algorithm for influenceestimation in continuous-time diffusion networks ...











Scalable imputation of genetic data with a discretefragmentation-coagulation process... Our model can be thought of as a discrete time analogue ofcontinuous time fragmentation-coagulation processes, preserving theimportant properties of projectivity, exchangeability and reversibility,while being more scalable ...










Scalable kernels for graphs with continuous attributes ... In thispaper, we present a class of path kernels with computationalcomplexity O(n2(m + δ2)) ...




Large Scale Learning

Stochastic Gradient Descent

Higher order descent, just a few steps


Stochastic Gradient Descent

And do gradient descent on

1

2‖w‖2 + C

n∑i=1

max(0, 1− yi 〈w , xi 〉+ b)

yields

w ←

{w − η

t w if yt〈w ,Xt〉+ b ≤ 1

w − ηt (w − ytXt) else


Parallelization potential for SGD

Memory footprint: w

Main problem: Stream by data fast enough

Cross-validation/feature extraction issues


Google’s DistBelief

Data shards, model replicas, common parameter set.

(Dean, Corrado, Monga et al. Large Scale Distributed Deep Networks,NIPS 2012,http://research.google.com/archive/large deep networks nips2012.html)


The future: Spark and Stratosphere/Flink

The problem with Hadoop:

Just one Map / Reduce step, but many algorithms are iterative

Disk based → long startup times


Spark

(http://spark.apache.org)

In-memory

Much larger set of operations (groupBy, joins, etc.)

resilience by storing how data was generated

caching of results on disk

micro-batch streaming, too


Apache Spark

// open file as collection of lines

val textFile = sc.textFile("README.md")

// count lines in file

textFile.count()

// get lines containing the word Spark

val linesWithSpark = textFile

.filter(line => line.contains("Spark"))

// count those lines

linesWithSpark.count()


Stratosphere/Flink - Big Data by Query Optimization

(http://stratosphere.eu/)

Databases: Query → Relational Algebra → Algorithms

Stratosphere: The same for Big Data

In-Memory, more operations, but also iterations!

Optimizing operations, including reshuffles of data.


Stream Mining

Stream Mining — mid 2000s

Process potentially infinite stream of data

Stream query:

How often have I seen item i?What are the most frequent items?How many distinct items are there?

Approximate results with bounded resources


Getting rid of exactness


Heavy hitters

Task: find most frequent items in a data set.

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-kElements in Data Streams, International Conference on Database Theory,2005


Exponential Decay


Indices

page referrer agent score

/index.html google mozilla 12.3/about.html facebook iexplorer 10.2/index.html twitter safari 8.5/post/123 google mozilla 5.5/about.html twitter safari 3.2

Trend for “referrer = google” →(/index.html, google, mozilla, 12.3), (/post/123, google, mozilla, 5.5)

Trend for “agent = safari” →(/index.html, twitter, safari, 8.5), (/about.html, twitter, safari, 3.2)


streamdrill

In-memory, stream mining driven realtime analytics engine

written in Scala

Main tool: Trends (Top K + indices + exponential decay)

Process up to 20k events per second

Track about 1M per GB

Upcoming: Profiling, Recommendation, etc.

Download demo jar at streamdrill.com


Summary

Complexity 6= Size

Big Data best used for inherently parallel steps like

preprocessingfeature extractioncross validationapplying predictions

Else, you’re stuck with algorithm design!

Large scale learning by stochastic gradient descent

Spark/Stratosphere making it easier!

Stream Mining!


scalable machine learning -...

Documents