graphlab ted dunning clustering

30
1 ©MapR Technologies - Confidential Large-scale Single-pass k-Means Clustering at Scale Ted Dunning

Upload: mapr-technologies

Post on 15-Jan-2015

189 views

Category:

Technology


0 download

DESCRIPTION

Talk on the Mahout nearest neighbor framework focussing particularly on the k-means acceleration provided by the streaming k-means implementation.

TRANSCRIPT

Page 1: Graphlab Ted Dunning  Clustering

1©MapR Technologies - Confidential

Large-scale Single-pass k-Means Clustering at ScaleTed Dunning

Page 2: Graphlab Ted Dunning  Clustering

2©MapR Technologies - Confidential

Large-scale Single-pass k-Means Clustering

Page 3: Graphlab Ted Dunning  Clustering

3©MapR Technologies - Confidential

Large-scale k-Means Clustering

Page 4: Graphlab Ted Dunning  Clustering

4©MapR Technologies - Confidential

Goals

Cluster very large data sets Facilitate large nearest neighbor search Allow very large number of clusters Achieve good quality– low average distance to nearest centroid on held-out data

Based on Mahout Math Runs on Hadoop (really MapR) cluster FAST – cluster tens of millions in minutes

Page 5: Graphlab Ted Dunning  Clustering

5©MapR Technologies - Confidential

Non-goals

Use map-reduce (but it is there) Minimize the number of clusters Support metrics other than L2

Page 6: Graphlab Ted Dunning  Clustering

6©MapR Technologies - Confidential

Anti-goals

Multiple passes over original data Scale as O(k n)

Page 7: Graphlab Ted Dunning  Clustering

7©MapR Technologies - Confidential

Why?

Page 8: Graphlab Ted Dunning  Clustering

8©MapR Technologies - Confidential

K-nearest Neighbor withSuper Fast k-means

Page 9: Graphlab Ted Dunning  Clustering

9©MapR Technologies - Confidential

What’s that?

Find the k nearest training examples Use the average value of the target variable from them

This is easy … but hard– easy because it is so conceptually simple and you don’t have knobs to turn

or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results

Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time

Page 10: Graphlab Ted Dunning  Clustering

10©MapR Technologies - Confidential

How We Did It

2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)

Goal is new open technology to facilitate new closed solutions

Ambitious goal of ~ 1,000,000 x speedup

Page 11: Graphlab Ted Dunning  Clustering

11©MapR Technologies - Confidential

How We Did It

2 week hackathon with 6 developers from customer bank Agile-ish development To avoid IP issues– all code is Apache Licensed (no ownership question)– all data is synthetic (no question of private data)– all development done on individual machines, hosting on Github– open is easier than closed (in this case)

Goal is new open technology to facilitate new closed solutions

Ambitious goal of ~ 1,000,000 x speedup– well, really only 100-1000x after basic hygiene

Page 12: Graphlab Ted Dunning  Clustering

12©MapR Technologies - Confidential

What We Did

Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

Shared memory matrix– FileBasedMatrix uses mmap to share very large dense matrices

Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute

Super-fast clustering– Kmeans, StreamingKmeans

Page 13: Graphlab Ted Dunning  Clustering

13©MapR Technologies - Confidential

Projection Search

java.lang.TreeSet!

Page 14: Graphlab Ted Dunning  Clustering

14©MapR Technologies - Confidential

How Many Projections?

Page 15: Graphlab Ted Dunning  Clustering

15©MapR Technologies - Confidential

K-means Search

Simple Idea– pre-cluster the data– to find the nearest points, search the nearest clusters

Recursive application– to search a cluster, use a Searcher!

Page 16: Graphlab Ted Dunning  Clustering

16©MapR Technologies - Confidential

Page 17: Graphlab Ted Dunning  Clustering

17©MapR Technologies - Confidential

x

Page 18: Graphlab Ted Dunning  Clustering

18©MapR Technologies - Confidential

Page 19: Graphlab Ted Dunning  Clustering

19©MapR Technologies - Confidential

Page 20: Graphlab Ted Dunning  Clustering

20©MapR Technologies - Confidential

x

Page 21: Graphlab Ted Dunning  Clustering

21©MapR Technologies - Confidential

But This Requires k-means!

Need a new k-means algorithm to get speed– Hadoop is very slow at iterative map-reduce– Maybe Pregel clones like Giraph would be better– Or maybe not

Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads)– Very parallelizable

Page 22: Graphlab Ted Dunning  Clustering

22©MapR Technologies - Confidential

Basic Method

Use a single pass of k-means with very many clusters– output is a bad-ish clustering but a good surrogate

Use weighted centroids from step 1 to do in-memory clustering– output is a good clustering with fewer clusters

Page 23: Graphlab Ted Dunning  Clustering

23©MapR Technologies - Confidential

Algorithmic Details

Foreach data point xn

compute distance to nearest centroid, ∂sample u, if u > ∂/ß add to nearest centroidelse create new centroid

if number of centroids > 10 log nrecursively cluster centroidsset ß = 1.5 ß if number of centroids did not decrease

Page 24: Graphlab Ted Dunning  Clustering

24©MapR Technologies - Confidential

How It Works

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

Page 25: Graphlab Ted Dunning  Clustering

25©MapR Technologies - Confidential

Parallel Speedup?

Page 26: Graphlab Ted Dunning  Clustering

26©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

Page 27: Graphlab Ted Dunning  Clustering

27©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)

Page 28: Graphlab Ted Dunning  Clustering

28©MapR Technologies - Confidential

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)

Empirically, projection search beats 64 bit LSH by a bit

Page 29: Graphlab Ted Dunning  Clustering

29©MapR Technologies - Confidential

Moving to Scale

Map-reduce implementation nearly trivial

Map: rough-cluster input data, output ß, weighted centroids

Reduce: – single reducer gets all centroids– if too many centroids, merge using recursive clustering– optionally do final clustering in-memory

Combiner possible, but essentially never important

Page 30: Graphlab Ted Dunning  Clustering

30©MapR Technologies - Confidential

Contact:– [email protected]– @ted_dunning

Slides and such:– http://info.mapr.com/ted-mlconf Hash tags: #mlconf #mahout #mapr