v. megalooikonomou, temple university clustering and partitioning for spatial and temporal data...

43
V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory (DEnLab) Dept. of Computer and Information Sciences Temple University Philadelphia, PA www.cis.temple.edu/~vasilis

Post on 19-Dec-2015

236 views

Category:

Documents


16 download

TRANSCRIPT

Page 1: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Clustering and Partitioning for Spatial and Temporal Data Mining

Vasilis Megalooikonomou

Data Engineering Laboratory (DEnLab)Dept. of Computer and Information Sciences

Temple UniversityPhiladelphia, PA

www.cis.temple.edu/~vasilis

Page 2: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Outline• Introduction

– Motivation – Problems:• Spatial domain• Time domain

– Challenges• Spatial data

– Partitioning and Clustering– Detection of discriminative patterns– Results

• Temporal data– Partitioning– Vector Quantization– Results

• Conclusions - Discussion

Page 3: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Introduction

• Large spatial and temporal databases

• Meta-analysis of data pooled from multiple studies

• Goal: To understand patterns and discover associations, regularities and anomalies in spatial and temporal data

Page 4: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

ProblemSpatial Data Mining:

Given a large collection of spatial data, e.g., 2D or 3D images, and other data, find interesting things, i.e.:• associations among image data or among image

and non-image data• discriminative areas among groups of images• rules/patterns• similar images to a query image (queries by

content)

Page 5: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Challenges

• How to apply data mining techniques to images? • Learning from images directly• Heterogeneity and variability of image data• Preprocessing (segmentation, spatial normalization, etc)• Exploration of high correlation between neighboring

objects• Large dimensionality • Complexity of associations• Efficient management of topological/distance information• Spatial knowledge representation / Spatial Access

Methods (SAMs)

Page 6: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Example: Association Mining – Spatial Data

• Discover associations among spatial and non-spatial data:

• Images {i1, i2,…, iL}

• Spatial regions {s1, s2,…, sK}• Non-spatial variables {c1, c2,…, cM}

c1c2

c3

c1c7 c

2

c9

c6

i1 i2 i3 i4 i5 i6 i7

Page 7: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Example: fMRI contrast maps

Control Patient

Page 8: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Applications

Medical Imaging, Bioinformatics, Geography, Meteorology, etc..

Page 9: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Voxel-based Analysis

• No model on the image data

• Each voxel’s changes analyzed independently - a map of statistical significance is built

• Discriminatory significance measured by statistical tests (t-test, ranksum test, F-test, etc)

• Statistical Parametric Mapping (SPM)

• Significance of associations measured by chi-squared test, Fisher’s exact test (a contingency table for each pair of vars)

• Cluster voxels by findings

[V. Megalooikonomou, C. Davatzikos, E. Herskovits, SIGKDD 1999]

Page 10: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Analysis by grouping of voxels

• Grouping of voxels (atlas-based)

• Prior knowledge increases sensitivity• Data reduction: 107 voxels R regions (structures) • Map a ROI onto at least one region • As good as the atlas being used

• M non-spatial variables, R regions

• Analysis• Categorical structural variables

• Continuous structural variables

• M x R contingency tables, Chi-square/Fisher exact test• multiple comparison problem• log-linear analysis, multivariate Bayesian

• Logistic regression, Mann-Whitney

Page 11: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

• Adaptive partitioning of a 3D volume

Page 12: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

• Adaptive partitioning of a 3D volume• Partitioning criterion:

discriminative power of feature(s) of hyper-rectangle and size of hyper-rectangle

Page 13: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

• Adaptive partitioning of a 3D volume• Partitioning criterion:

discriminative power of feature(s) of hyper-rectangle and size of hyper-rectangle

Page 14: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

• Adaptive partitioning of a 3D volume• Partitioning criterion:

discriminative power of feature(s) of hyper-rectangle and size of hyper-rectangle

Page 15: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Dynamic Recursive Partitioning

• Adaptive partitioning of a 3D volume• Partitioning criterion:

discriminative power of feature(s) of hyper-rectangle and size of hyper-rectangle

• Extract features from discriminative regions• Reduce multiple comparison problem

(# tests = # partitions < # voxels)• tests downward closed

[V. Megalooikonomou, D. Pokrajac, A. Lazarevic, and Z. Obradovic, SPIE Conference on Visualization and Data Analysis, 2002]

Page 16: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Other Methods for Spatial Data Classification

•Distributional Distances:- Mahalanobis distance- Kullback-Leibler divergence (parametric, non-parametric)

•Maximum Likelihood:- Estimate probability densities and compute likelihood

•EM (Expectation-Maximization) method to model spatial regions using some base function (Gaussian)

•Static partitioning:

• Reduction of the # of attributes as compared to voxel-wise analysis• Space partitioned into 3D hyper-rectangles (variables: properties of voxels inside hyper-rectangles) - incrementally increase discretization

Distinguishing among distributions:

D. Pokrajac, V. Megalooikonomou, A. Lazarevic, D. Kontos, Z. Obradovic, Artificial Intelligence in Medicine, Vol. 33, No. 3, pp. 261-280, Mar. 2005.

*

**

*

*

*

**

*

Page 17: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Experimental ResultsAreas discovered by DRP with t-test: significance threshold=0.05, maximum tree depth=3. Colorbar shows significance

[D. Kontos, V. Megalooikonomou, D. Pokrajac, A. Lazarevic, Z. Obradovic, O. B. Boyko, J. Ford, F. Makedon, A. J. Saykin, MICCAI 2004]

    Number of tests

Thresh.

Depth DRP Voxel Wise

0.05 3 569 201774

0.05 4 4425 201774

0.01 4 4665 201774

Comparison of number of tests performed

Method Classification Accuracy (%)

Criterion Threshold Tree depth Controls Patients Total

DRP

correlation 0.4 3 82 93 88

t-test

0.05 3 89 100 94

0.05 4 84 100 92

0.01 4 87 100 93

ranksum

0.05 3 87 100 93

0.05 4 80 100 90

0.01 4 87 96 91

Maximum Likelihood / EM 77 67 72

Maximum Likelihood / k-means 77 83 80

Kullback-Leibler / EM 79 57 68

Kullback-Leibler / k-means 77 66 71

Page 18: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Experimental Results

Impact:• Assist in interpretation of images (e.g., facilitating diagnosis)

• Enable researchers to integrate, manipulate and analyze large volumes of image data

(a)

(b)

Discriminative sub-regions detected when applying (a) DRP and (b) voxel-wise analysis with ranksum test and significance threshold 0.05 to the real fMRI volume data

Page 19: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Time Sequence Analysis

• Time series data abound in many applications …• Challenges:

– High dimensionality– Large number of sequences– Similarity metric definition

• Similarity analysis (e.g., find stocks similar to that of IBM)• Goals: high accuracy, (high speed) in similarity searches among time

series and in discovering interesting patterns• Applications: clustering, classification, similarity searches,

summarization

Time Sequence: A sequence (ordered collection) of real values: X = x1, x2,…, xn

Page 20: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Dimensionality Reduction Techniques

• DFT: Discrete Fourier Transform

• DWT: Discrete Wavelet Transform

• SVD: Singular Value Decomposition

•APCA: Adaptive Piecewise Constant Approximation

• PAA: Piecewise Aggregate Approximation

• SAX: Symbolic Aggregate approXimation

•…• •

Page 21: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Similarity distances for time series

A more intuitive idea:two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995)

• Euclidean Distance:

most common, sensitive to shifts

• Envelope-based DTW:

faster: O(n)

• Dynamic Time Warping:

improving accuracy but slow: O(n2)

Page 22: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Partitioning – Piecewise Constant Approximations

Original time series(n points)

Piecewise constant approximation (PCA)or Piecewise Aggregate Approximation(PAA), [Yi and Faloutsos ’00, Keogh et al, ’00] (n' segments)

Adaptive Piecewise Constant Approximation (APCA), [Keogh et al., ’01] (n" segments)

Page 23: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Multiresolution Vector Quantized approximation (MVQ)

Partitions a sequence into equal-length segments and uses VQ to represent each sequence by appearance frequencies of key-subsequences

1) Uses a ‘vocabulary’ of subsequences (codebook) – training is involved

2) Takes multiple resolutions into account – keeps both local and global information

3) Unlike wavelets partially ignores the ordering of ‘codewords’

3) Can exploit prior knowledge about the data

4) Employs a new distance metric

[V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos, ICDE 2005]

Page 24: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Methodology

Codebook s=16

Generation

Series Transformation

Series

Encoding

112100000000100012000100110000001000000012001100100000001100210000010101001100101010000100100011

……

c m d b c a i f a j b bm i n j j a ma I n j m h l d f k o p h c a k o o g c b l p o c c b l h l h n k k k p l c a c g k k g j h h g k g j l p

……

s

l

Page 25: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

MethodologyCreating a ‘vocabulary’

Frequently appearing patterns in

subsequences

Frequently appearing patterns in

subsequencesOutput:

A codebook with s codewords

Q: How to create?

A: Use Vector Quantization, in particular, the Generalized Lloyd Algorithm (GLA)

Representing time seriesX = x1, x2,…, xn

f = (f1,f2,…, fs)is encoded with a new representation

(fi is the frequency of the i th codeword in X)

Page 26: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

MethodologyNew distance metric:

),(1

1),(

tqdistqSHM

s

i qiti

qiti

ff

fftqdis

1 ,,

,,

1),(

The histogram model is used to calculate similarity at each resolution level:

with

fi,t

fi,q

1 2...s

Page 27: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Methodology

Time series summarization:• High level information (frequently appearing patterns) is more useful

• The new representation can provide this kind of information

Both codeword (pattern) 3 & 5

show up 2 times

Both codeword (pattern) 3 & 5

show up 2 times

Page 28: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Methodology

Problems of frequency based encoding:

• It is hard to define an approximate resolution (codeword length)

• It may lose global information

Page 29: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Methodology

Solution: Use multiple resolutions:

• It is hard to define an approximate resolution (codeword length)

• It may lose global information

Page 30: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Methodology

Proposed distance metric:

Weighted sum of similarities, at all resolution levels

c

1ijHMiijHHM )d(q,S * w )d(q,S

similarity @ level i where c is the number of resolution levels

•lacking any prior knowledge equal weights to all resolution levels works well most of the time

Page 31: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

MVQ: Example of Codebooks

• Codebook for the first level

• Codebook for the second level (more codewords since there are more details)

Page 32: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Experiments

Datasets SYNDATA (control chart data): synthetic

CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day

Page 33: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Experiments

Best Match Searching:

Matching accuracy: % of knn’s (found by different approaches) that are in same class

100% k

|std_set(q) knn(q)| Accuracy

Page 34: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Experiments

Best Match Searching

Method Weight Vector

Accuracy 

Single levelVQ

[1 0 0 0 0] 0.55

[0 1 0 0 0] 0.70

[0 0 1 0 0] 0.65

[0 0 0 1 0] 0.48

[0 0 0 0 1] 0.46

MVQ [1 1 1 1 1] 0.83 

Euclidean 0.51 

SYNDATA CAMMOUSEMethod Weight Vector Accuracy

Single levelVQ

[1 0 0 0 0] 0.56

[0 1 0 0 0] 0.60

[0 0 1 0 0] 0.44

[0 0 0 1 0] 0.56

[0 0 0 0 1] 0.60

MVQ [1 1 1 1 1] 0.83

Euclidean 0.58

Page 35: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Experiments

Best Match Searching

(a) (b) Precision-recall for different methods

(a) on SYNDATA dataset (b) on CAMMOUSE dataset

MVQ

MVQ

Page 36: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Experiments

Clustering experiments

Given two clusterings, G=G1, G2, …, GK (the true clusters),

and A = A1, A2, …, Ak (clustering result by a certain

method), the clustering accuracy is evaluated with the cluster similarity defined as:

k

AGSimi ji

),(maxA)Sim(G,

j |A| |G|

|AG|2 Aj)Sim(Gi,

ji

ji

with

[Gavrilov, M., Anguelov, D., Indyk, P. and Motwani, R., KDD 2000]

Page 37: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

ExperimentsClustering experiments

Method Weight Vector

Accuracy

Single levelVQ

[1 0 0 0 0] 0.69

[0 1 0 0 0] 0.71

[0 0 1 0 0] 0.63

[0 0 0 1 0] 0.51

[0 0 0 0 1] 0.49

MVQ [1 1 1 1 1] 0.82

DFT 0.67

SAX 0.65

DTW 0.80

Euclidean 0.55

SYNDATA RTTMethod Weight

VectorAccuracy

Single levelVQ

[1 0 0 0 0] 0.55

[0 1 0 0 0] 0.52

[0 0 1 0 0] 0.57

[0 0 0 1 0] 0.80

[0 0 0 0 1] 0.79

MVQ [0 0 0 1 1] 0.81

DFT 0.54

SAX 0.54

DTW 0.62

Euclidean 0.50

Page 38: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

• Given two time series t1 and t2 as follows:

• In the first level, they are encoded with the same codeword (3), so they are not distinguishable

• In the second level, more details are recorded. These two series have different encoded form: the first series is encoded with codeword 1 and 4, the second one is encoded with codewords 9 and 12.

MVQ: Example: Two Time Series

Page 39: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

• Hilbert Space Filling Curve• Binning• Statistical tests of significance on groups of points• Identification of discriminative areas by back-projection

(a) linear mapping of a 3D fMRI scan, (b) effect of binning by representing each bin with its Vmean measurement,

(c) the discriminative voxels after applying the t-test with θ=0.05

(a) (b) (c)

Analysis of images by projection to 1D

[D. Kontos, V. Megalooikonomou, N. Ghubade, and C. Faloutsos. IEEE Engineering in Medicine and Biology Society (EMBS), 2003]

Page 40: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Areas discovered: (a) θ=0.05, (b) θ=0.01. The colorbar shows significance.

(a)

(b)

Variation: Concatenate the values of statistically significant areas spatial sequences

• Pattern analysis using the similarity between spatial sequences and time sequences

• SVD, DFT, DWT, PCA (clustering accuracy: 89-100%)

Applying time series techniques

Results: 87%-98% classification accuracy (t-test, CATX)

[Q. Wang, D. Kontos, G. Li and V. Megalooikonomou, ICASSP 2004]

Page 41: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Conclusions• ‘Find patterns/interesting things’ efficiently and

robustly in spatial and temporal data• Use of partitioning and clustering• Analysis at multiple resolutions• Reduction of the number of tests performed• Intelligent exploration of the space to find

discriminative areas • Reduction of dimensionality• Symbolic representation• Nice summarization

Page 42: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Collaborators

Faculty:• Zoran Obradovic • Orest Boyko • James Gee • Andrew Saykin• Christos Faloutsos• Christos Davatzikos• Edward Herskovits• Fillia Makedon• Dragoljub Pokrajac

Students:• Despina Kontos• Qiang Wang• Guo Li

Others:• James Ford• Alexandar Lazarevic

Page 43: V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory

V. Megalooikonomou, Temple University

Thank you!

AcknowledgementsThis research has been funded by:

– National Science Foundation CAREER award 0237921

– National Science Foundation Grant 0083423

– National Institutes of Health Grant R01 MH68066 funded by NIMH, NINDS, and NIA