Transcript
Page 1: TunUp final presentation

TunUp: A Distributed Cloud-basedGenetic Evolutionary Tuning for DataClustering

Gianmario [email protected]

March 2013

AgilOne, Inc.1091 N Shoreline Blvd. #250Mountain View, CA 94043

Page 2: TunUp final presentation

Agenda

1. Introduction2. Problem description3. TunUp4. K-means5. Clustering evaluation6. Full space tuning7. Genetic algorithm tuning8. Conclusions

Page 3: TunUp final presentation

Big Data

Page 4: TunUp final presentation

Business IntelligenceWhy ? Where? What? How?Insights of customers, products and companies

Can someone else know your customer better than you?Do you have the domain knowledge and proper computation

infrastructure?

Page 5: TunUp final presentation

Big Data as a Service (BDaaS)

Page 6: TunUp final presentation

Problem Description

income cost

customers

Page 7: TunUp final presentation

Tuning of Clustering Algorithms

We need tuning when:

➢ New algorithm or version is released

➢ We want to improve accuracy and/or performance

➢ New customer comes and the system must be adapted for the new dataset and requirements

9

Page 8: TunUp final presentation

TunUp

Java framework integrating JavaML and Watchmaker

Main features:

➢ Data manipulation (loading, labelling and normalization)➢ Clustering algorithms (k-means)➢ Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)➢ Evaluation techniques validation (Pearson Correlation t-test)➢ Full search space tuning➢ Genetic Algorithm tuning (local and parallel implementation)➢ RESTful API for web service deployment (tomcat in Amazon EC2)

Open-source: http://github.com/gm-spacagna/tunup

Page 9: TunUp final presentation

k-meansGeometric hard-assigning Clustering algorithm:

It partitions n data points into k clusters in which each point belongs to the cluster with the nearest mean centroid.

If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:

Algorithm:

1. Init ialization: a set of k random centroids are generated

2. Assignment: each point is assigned to the closest centroid

3. Update: the new centroids are calculated as the mean of the new clusters

4. Go to 2 until the convergence (centroids are stable and do not change)

Page 10: TunUp final presentation

k-means tuningInput parameters required:

1. K = (2,...,40)

2. Distance measure

3. Max iterations = 20 (fixed)

Different input parameters

Very different outcomes!!!

0. Angular 2. Chebyshev 3. Cosine 4. Euclidean 5. Jaccard Index 6. Manhattan 7. Pearson Correlation Coefficient8. Radial Basis Function Kernel9. Spearman Footrule

Page 11: TunUp final presentation

Clustering EvaluationDefinition of cluster:“A group of the same or similar elements gathered or occurring closely together”

Two main categories:

➢ Internal criterion : only based on the clustered data itself

➢ External criterion : based on benchmarks of pre-classified items

How do we evaluate if a set of clusters is good or not?

“Clustering is in the eye of the beholder” [E. Castro, 2002]

Page 12: TunUp final presentation

Internal EvaluationCommon goal is assigning better scores when:➢ High intra-cluster similarity➢ Low inter-cluster similarity

Cluster models:

➢ Distance-based (k-means)

➢ Density-based (EM-clustering)

➢ Distribution-based (DBSCAN)

➢ Connectivity-based (linkage clustering)

The choice of the evaluation technique depends on the nature of the data and the cluster model of the algorithm.

Page 13: TunUp final presentation

Proposed techniquesAIC: measure of the relative quantity of lost information of a statistical model. The clustering algorithm is modelled as a Gaussian Mixture Process. (inverted function)

Dunn: ratio between the minimum inter-clusters similarity and maximum cluster diameter. (natural fn.)

Davies-Bouldin : average similarity between each cluster and its most similar one. (inverted fn.)

Silhouette: measure of how well each point lies within its cluster. Indicates if the object is correctly clustered or if it would be more appropriate into the neighbouring cluster. (natural fn.)

Page 14: TunUp final presentation

External criterion: AdjustedRandGiven a a set of n elements S = {o1,...,on} and two partitions to compare: X={X1,...,Xr} and Y={Y1,...,Ys}

We can use AdjustedRand as reference of the best clustering evaluation and use it as validation for the internal criterion.

RandIndex=number of agreements between X and Ytotal number of possible pair combinations

AdjustedRandIndex=RandIndex−ExpectedIndexMaxIndex−ExpectedIndex

Page 15: TunUp final presentation

Correlation t-test

Average correlations:

AIC : 0.77Dunn: 0.49Davies-Bouldin: 0.51Silhouette: 0.49

Pearson correlation over a set of 120 random k-means configuration

evaluations:

Page 16: TunUp final presentation

DatasetD313100 vectors2 dimensions31 clusters

Source: http://cs.joensuu.fi/sipu/datasets/

S15000 vectors2 dimensions15 clusters

Page 17: TunUp final presentation

Initial Centroids issueN. observations = 200Input Configuration: k = 31 , Distance Measure = Eclidean

AdjustedRand AIC

We can consider the median value!

Page 18: TunUp final presentation

Full space evaluation

Global optimal is for:K = 36DistanceMeasure = Euclidean

N executions averaged = 20

Page 19: TunUp final presentation

Genetic Algorithm Tuning

Pr (mutate k i→k j)∝1

distance (k i , k j )

[x1,x2,x3,x4,...,xm]

[y1,y2,y3,y4,...,ym]

Crossovering:

Mutation:

Pr (mutate d i→d j)=1

N dist−1

[x1,x2,x3,y4,...,ym]

[y1,y2,y3,x4,...,xm]

Elitism +

Roulette wheel

Page 20: TunUp final presentation

Tuning parameters:

Fitness Evaluation : AICProb. mutation: 0.5Prob. Crossovering: 0.9Population size: 6Stagnation limit: 5Elitism: 1N executions averaged: 10

Relevant results:

➢ Best fitness value always decreasing➢ Mean fitness value trend decreasing➢ High standard deviation in the previous

population often generates a better mean population in the next one

Page 21: TunUp final presentation

Results

Test1: k = 39, Distance Measure = Manhattan

Test2: k = 33, Distance Measure = RBF Kernel

Test3: k = 36, Distance Measure = Euclidean

Different results due to:1. Early convergence2. Random initial centroids

Page 22: TunUp final presentation

Parallel GA

Optimal n. of servers = POP_SIZE – ELITISM

Amazon Elastic Compute Cloud EC210 x Micro instances

Simulation:10 evolutions, POP_SIZE = 5, no elitism

E[T single evolution] ≤

Page 23: TunUp final presentation

ConclusionsWe developed, tested and analysed TunUp, an open-solution for:Evaluation, Validation, Tuning of Data Clustering Algorithms

Future applications:➢ Tuning of existing algorithms➢ Supporting new algorithms design➢ Evaluation and comparison of different algorithms

Limitations:➢ Single distance measure➢ Equal normalization➢ Master / slave parallel execution➢ Random initial centroids

Page 24: TunUp final presentation

Questions?

Page 25: TunUp final presentation

Thank you! Tack! Grazie!


Top Related