accelerating dynamic time warping clustering with a novel admissible pruning strategy nurjahan...

Post on 18-Jan-2016

232 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Accelerating Dynamic Time Warping Clustering with a Novel

Admissible Pruning StrategyNurjahan Begum Liudmila Ulanova Jun Wang1 Eamonn Keogh

University of California, Riverside University of Texas at Dallas1

Outline

• Introduction, Related Work & Background•Density Peaks (DP) Clustering Algorithm• Pruning Using DTW Boundings•Going Anytime: Distance Computation-Ordering Heuristic• Experimental Evaluation

Outline

• Introduction, Related Work & Background•Density Peaks (DP) Clustering Algorithm• Pruning Using DTW Boundings•Going Anytime: Distance Computation-Ordering Heuristic• Experimental Evaluation

Problem Description

• The problem this work plans to address is robustly clustering large time series datasets with invariance to irrelevant data.

• Accuracy

• Invariance to irrelevant data

• Scalability (Efficiency, Interruputability)

• Robustness to parameter settings

Accuracy: The Using of DTW

• For most time series data mining algorithms, the quality of the output depends almost exclusively on the distance measure used.

• A consensus has emerged that DTW is the best in most domains, almost always outperforming the Euclidean Distance (ED) .

• Convergence of DTW and ED for increasing data sizes? – Not for clustering!

Invariance to Irrelevant Data: the Using of DP• It has been suggested that the successful clustering of time series

requires the ability to ignore some data objects.• Anomalous objects themselves are unclusterable; • Interference with the clustering of clusterable data.

• DP, in contrast to clustering algorithms such as K-means, can ignore anomalous objects.

Efficiency: Pruning Using Both Boundings

• Both DTW and DP are slow. – CPU constrained, not I/O constrained.

• In some problems (notably similarity search), the lower-bounding pruning is the main technique used to produce speedup, whose effectiveness tends to improve on large datasets.

• This is not effective in clustering due to the need to know the distance between all pairs, or at least all distances within a certain range.

• Also, due to the non-metric character of DTW, it is hard to build an index for speeding up.

• This work exploits both the lower and upper boundings of DTW in the framework of DP.

Interruptablity: Going Anytime

• What if the pruning is still not sufficient? - User interruption

- This work further adapts the proposed method to an anytime algorithm.

• Anytime algorithms are algorithms that can return a valid solution to a problem, even if interrupted before ending.• Small setup time• Best-so-far answer

• Monotonicity & Diminishing returns

Robustness to Parameter Settings: the Using of DP• Many clustering algorithms require the user to set many parameters.

• DP requires only two parameters. Moreover, they are relatively intuitive and not particularly sensitive to user choice.

Outline

• Introduction, Related Work & Background•Density Peaks (DP) Clustering Algorithm• Pruning Using DTW Boundings•Going Anytime: Distance Computation-Ordering Heuristic• Experimental Evaluation

Internal Logic & Required Parameters of DP

• The DP algorithm assumes that the cluster centers are surrounded by lower local density neighbors and are at a relatively higher distance from any point with a higher local density. For a certain point i, • the Local Density ρi is the number of points that are closer to it than some

cutoff distance dc;• the Distance from Points of Higher Density is the minimum distance δi from

point i to all the points of higher density.

• The DP algorithm requires two pre-set parameters:• The cutoff distance dc

• The number of clusters k (can be determined in a knee-down manner)See Rodriguez, A., & Laio, A. Clustering by Fast Search and Find of Density Peaks. Science, 344(6191), 1492-1496, 2014. for more!

Four Phases of DP

• Local Density Calculation

• Distance to Higher Density Points Computation

• Cluster Center Selection

• Cluster Assignment

Phase 1: Local Density Calculation

Phase 2: Distance to Higher Density Points Computation

Phase 3: Cluster Center Selection

• The cluster centers are selected using a simple heuristic: points with higher values of (ρi×δi) are more likely to be centers.

Phase 4: Cluster Assignment

Why DP?

• Capability of ignoring outlier.

• Capability of handling datasets whose clusters can form arbitrary shapes.

• Few user-set parameters and low sensitivity.

• Amiability to distance computation pruning and conversion to an anytime algorithm.

Outline

• Introduction, Related Work & Background•Density Peaks (DP) Clustering Algorithm• Pruning Using DTW Boundings•Going Anytime: Distance Computation-Ordering Heuristic• Experimental Evaluation

Pruning Using DTW Bounds

• The proposed algorithm, TADPole (Time-series Anything DP), requires distance computations in the following two phases:

• Phase 1: local density computation

• Phase 2: distance to higher density points computation (NN distance computation)

Pruning in the Local Density Computation Phase

Pruning in the NN Distance Computation Phase

Pruning in the NN Distance Computation Phase

Multidimensional Time Series Clustering

Independent calculation → Summation

Multidimensional Time Series Clustering

Pruning Effectiveness: Baselines

• Brute force: all-pair distance matrix computed.

• Oracle (post-hoc): only necessary distance computations are needed.• Local density calculation phase: only distance computations contributing to

the actual density of an object considered.• NN distance calculation phase: only the actual NN distances considered.

Pruning Effectiveness: Illustration

• Dataset: StarLightCurves

Outline

• Introduction, Related Work & Background•Density Peaks (DP) Clustering Algorithm• Pruning Using DTW Boundings•Going Anytime: Distance Computation-Ordering Heuristic• Experimental Evaluation

Going Anytime: Which Phase is Amiable?

• TADPole requires distance computations in the following two phases:

• Phase 1: local density computation• Not amiable to anytime ordering - setup time

• Phase 2: NN distance computation• Amiable to anytime ordering!

Going Anytime: Contestants

• Oracle: In each step of the algorithm, this order cheatingly chooses the object that maximizes the current Rand Index.

• Top-to-bottom, left-to-right? Too brittle to “luck”!

• Random ordering: less brittle to luck.

• The proposed heuristic: ρ × ub

Going Anytime: Effectiveness Illustration

Outline

• Introduction, Related Work & Background•Density Peaks (DP) Clustering Algorithm• Pruning Using DTW Boundings•Going Anytime: Distance Computation-Ordering Heuristic• Experimental Evaluation

Clustering Quality & Efficiency Evaluation

TADPole is at least an order of magnitude faster than the rival methods.

Parameter Sensitivity Evaluation

Performed on Symbols dataset with k = 6

Conclusions & Comments

• Pruning using both bounds

• Anytime algorithm

• More borrowing than originating!

top related