Download - TunUp final presentation
![Page 1: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/1.jpg)
TunUp: A Distributed Cloud-basedGenetic Evolutionary Tuning for DataClustering
Gianmario [email protected]
March 2013
AgilOne, Inc.1091 N Shoreline Blvd. #250Mountain View, CA 94043
![Page 2: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/2.jpg)
Agenda
1. Introduction2. Problem description3. TunUp4. K-means5. Clustering evaluation6. Full space tuning7. Genetic algorithm tuning8. Conclusions
![Page 3: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/3.jpg)
Big Data
![Page 4: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/4.jpg)
Business IntelligenceWhy ? Where? What? How?Insights of customers, products and companies
Can someone else know your customer better than you?Do you have the domain knowledge and proper computation
infrastructure?
![Page 5: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/5.jpg)
Big Data as a Service (BDaaS)
![Page 6: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/6.jpg)
Problem Description
income cost
customers
![Page 7: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/7.jpg)
Tuning of Clustering Algorithms
We need tuning when:
➢ New algorithm or version is released
➢ We want to improve accuracy and/or performance
➢ New customer comes and the system must be adapted for the new dataset and requirements
9
![Page 8: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/8.jpg)
TunUp
Java framework integrating JavaML and Watchmaker
Main features:
➢ Data manipulation (loading, labelling and normalization)➢ Clustering algorithms (k-means)➢ Clustering evaluation (AIC, Dunn, Davies-Bouldin, Silhouette, aRand)➢ Evaluation techniques validation (Pearson Correlation t-test)➢ Full search space tuning➢ Genetic Algorithm tuning (local and parallel implementation)➢ RESTful API for web service deployment (tomcat in Amazon EC2)
Open-source: http://github.com/gm-spacagna/tunup
![Page 9: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/9.jpg)
k-meansGeometric hard-assigning Clustering algorithm:
It partitions n data points into k clusters in which each point belongs to the cluster with the nearest mean centroid.
If we have k clusters in the set S = S1,....,Sk where xj and μ represents the jth point in the specified cluster, the goal of k-means is minimizing the Within-Cluster Sum of Squares:
Algorithm:
1. Init ialization: a set of k random centroids are generated
2. Assignment: each point is assigned to the closest centroid
3. Update: the new centroids are calculated as the mean of the new clusters
4. Go to 2 until the convergence (centroids are stable and do not change)
![Page 10: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/10.jpg)
k-means tuningInput parameters required:
1. K = (2,...,40)
2. Distance measure
3. Max iterations = 20 (fixed)
Different input parameters
Very different outcomes!!!
0. Angular 2. Chebyshev 3. Cosine 4. Euclidean 5. Jaccard Index 6. Manhattan 7. Pearson Correlation Coefficient8. Radial Basis Function Kernel9. Spearman Footrule
![Page 11: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/11.jpg)
Clustering EvaluationDefinition of cluster:“A group of the same or similar elements gathered or occurring closely together”
Two main categories:
➢ Internal criterion : only based on the clustered data itself
➢ External criterion : based on benchmarks of pre-classified items
How do we evaluate if a set of clusters is good or not?
“Clustering is in the eye of the beholder” [E. Castro, 2002]
![Page 12: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/12.jpg)
Internal EvaluationCommon goal is assigning better scores when:➢ High intra-cluster similarity➢ Low inter-cluster similarity
Cluster models:
➢ Distance-based (k-means)
➢ Density-based (EM-clustering)
➢ Distribution-based (DBSCAN)
➢ Connectivity-based (linkage clustering)
The choice of the evaluation technique depends on the nature of the data and the cluster model of the algorithm.
![Page 13: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/13.jpg)
Proposed techniquesAIC: measure of the relative quantity of lost information of a statistical model. The clustering algorithm is modelled as a Gaussian Mixture Process. (inverted function)
Dunn: ratio between the minimum inter-clusters similarity and maximum cluster diameter. (natural fn.)
Davies-Bouldin : average similarity between each cluster and its most similar one. (inverted fn.)
Silhouette: measure of how well each point lies within its cluster. Indicates if the object is correctly clustered or if it would be more appropriate into the neighbouring cluster. (natural fn.)
![Page 14: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/14.jpg)
External criterion: AdjustedRandGiven a a set of n elements S = {o1,...,on} and two partitions to compare: X={X1,...,Xr} and Y={Y1,...,Ys}
We can use AdjustedRand as reference of the best clustering evaluation and use it as validation for the internal criterion.
RandIndex=number of agreements between X and Ytotal number of possible pair combinations
AdjustedRandIndex=RandIndex−ExpectedIndexMaxIndex−ExpectedIndex
![Page 15: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/15.jpg)
Correlation t-test
Average correlations:
AIC : 0.77Dunn: 0.49Davies-Bouldin: 0.51Silhouette: 0.49
Pearson correlation over a set of 120 random k-means configuration
evaluations:
![Page 16: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/16.jpg)
DatasetD313100 vectors2 dimensions31 clusters
Source: http://cs.joensuu.fi/sipu/datasets/
S15000 vectors2 dimensions15 clusters
![Page 17: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/17.jpg)
Initial Centroids issueN. observations = 200Input Configuration: k = 31 , Distance Measure = Eclidean
AdjustedRand AIC
We can consider the median value!
![Page 18: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/18.jpg)
Full space evaluation
Global optimal is for:K = 36DistanceMeasure = Euclidean
N executions averaged = 20
![Page 19: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/19.jpg)
Genetic Algorithm Tuning
Pr (mutate k i→k j)∝1
distance (k i , k j )
[x1,x2,x3,x4,...,xm]
[y1,y2,y3,y4,...,ym]
Crossovering:
Mutation:
Pr (mutate d i→d j)=1
N dist−1
[x1,x2,x3,y4,...,ym]
[y1,y2,y3,x4,...,xm]
Elitism +
Roulette wheel
![Page 20: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/20.jpg)
Tuning parameters:
Fitness Evaluation : AICProb. mutation: 0.5Prob. Crossovering: 0.9Population size: 6Stagnation limit: 5Elitism: 1N executions averaged: 10
Relevant results:
➢ Best fitness value always decreasing➢ Mean fitness value trend decreasing➢ High standard deviation in the previous
population often generates a better mean population in the next one
![Page 21: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/21.jpg)
Results
Test1: k = 39, Distance Measure = Manhattan
Test2: k = 33, Distance Measure = RBF Kernel
Test3: k = 36, Distance Measure = Euclidean
Different results due to:1. Early convergence2. Random initial centroids
![Page 22: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/22.jpg)
Parallel GA
Optimal n. of servers = POP_SIZE – ELITISM
Amazon Elastic Compute Cloud EC210 x Micro instances
Simulation:10 evolutions, POP_SIZE = 5, no elitism
E[T single evolution] ≤
![Page 23: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/23.jpg)
ConclusionsWe developed, tested and analysed TunUp, an open-solution for:Evaluation, Validation, Tuning of Data Clustering Algorithms
Future applications:➢ Tuning of existing algorithms➢ Supporting new algorithms design➢ Evaluation and comparison of different algorithms
Limitations:➢ Single distance measure➢ Equal normalization➢ Master / slave parallel execution➢ Random initial centroids
![Page 24: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/24.jpg)
Questions?
![Page 25: TunUp final presentation](https://reader033.vdocuments.net/reader033/viewer/2022060121/5594c64f1a28aba15c8b479b/html5/thumbnails/25.jpg)
Thank you! Tack! Grazie!