distances between data sets based on summary statistics

13
Distances between Data Sets Based Distances between Data Sets Based on Summary Statistics on Summary Statistics Nikolaj Tatti, JMLR, 01/2007 Nikolaj Tatti, JMLR, 01/2007 Presented by Yuting Qi Presented by Yuting Qi ECE Dept. ECE Dept. Duke Univ. Duke Univ. 02/02/2007 02/02/2007 Machine Learning Paper Reading Series

Upload: arlo

Post on 09-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Machine Learning Paper Reading Series. Distances between Data Sets Based on Summary Statistics. Nikolaj Tatti, JMLR, 01/2007. Presented by Yuting Qi ECE Dept. Duke Univ. 02/02/2007. Introduction. Goal: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distances between Data Sets Based on Summary Statistics

Distances between Data Sets Based Distances between Data Sets Based on Summary Statisticson Summary Statistics

Nikolaj Tatti, JMLR, 01/2007Nikolaj Tatti, JMLR, 01/2007

Presented by Yuting QiPresented by Yuting QiECE Dept.ECE Dept.Duke Univ.Duke Univ.02/02/200702/02/2007

Machine Learning Paper Reading Series

Page 2: Distances between Data Sets Based on Summary Statistics

IntroductionIntroduction

• Goal:– Define a dissimilarity measure, the constrained

minimum (CM) distance, between two data sets D1 and D2 by comparing summary statistics of datasets.

• Requirements:– It should be a metric.– It should consider the statistical nature of data.– It should be evaluated quickly.

Page 3: Distances between Data Sets Based on Summary Statistics

The Constrained Minimum (CM) Distance The Constrained Minimum (CM) Distance 1/51/5

• Definition:– Basic notations:

• D: data set, a finite collection of samples in Ω.• Ω: finite sample space, |Ω| is the # of elements in Ω.• S: feature function, , known or learned. • Θ: frequency, , the average values of S over D,

S(D)

Example:

Ω=A,B,C, D1=(C,C,C,A), D2=(C,A,B,A)

The only feature of interest is the proportion of C in the data set, then the feature function S is

S(D1)=3/4, S(D2)=1/4

Page 4: Distances between Data Sets Based on Summary Statistics

The Constrained Minimum (CM) Distance The Constrained Minimum (CM) Distance 2/52/5

– Constrained set of distributions:

– An alternative definition of :

– Constrained space:

P is the set of all distributions defined on Ω.

Calculated from given data sets

We estimate statistics from given data set, then examine the distributions that can produce such statistics.

If think Ω=1,2,…,|Ω|, P is a set of vectors, u, in R|Ω| satisfying non-negative elements and summing to 1.

Given θ1 and θ2, //

ui=p(i)

Page 5: Distances between Data Sets Based on Summary Statistics

The Constrained Minimum (CM) Distance The Constrained Minimum (CM) Distance 3/53/5

• Illustration:

A

B

C

Example:

Ω=A,B,C, D1=(C,C,C,A), D2=(C,A,B,A)

the feature function S is

S(D1)=0.75, S(D2)=0.25

P is the triangle, is a plane

Then, C(S, 0.75), C(S, 0.25) are parallel lines

The constrained set of distributions

C+(S, 0.75), C+(S, 0.25) are the segments

Motivate:

A nature way to measure the distance between two parallel spaces: find the shortest length from two points from each space.

Page 6: Distances between Data Sets Based on Summary Statistics

The Constrained Minimum (CM) Distance The Constrained Minimum (CM) Distance 4/54/5

• CM Distance– Pick a vector from each constrained space

– CM distance between D1 and D2 is

• Theorem 1

• Computation time:– – |Ω| could be very large, O(N3) time is feasible

Page 7: Distances between Data Sets Based on Summary Statistics

The Constrained Minimum (CM) Distance The Constrained Minimum (CM) Distance 5/55/5

• Properties:

Page 8: Distances between Data Sets Based on Summary Statistics

CM Distance and Binary Data Sets CM Distance and Binary Data Sets 1/21/2

• Basic definitions:– Sample space:

– Itemset: , ai corresponds to ith dimension.

– Boolean formula S: Ω->0,1• Conjunction function SB:

– SB(w)=wi1^wi2^…^wiL, given itemset B=ai1, …, aiL

• Parity function TB:

– TB(w)=wi1+wi2+…+wiL (+: XOR)

– Given a collection of itemsets F=B1,…, BN, we have

Page 9: Distances between Data Sets Based on Summary Statistics

CM Distance and Binary Data Sets CM Distance and Binary Data Sets 2/22/2

• CM distance can be calculated in O(N) time assuming know θ1 and θ2.

Page 10: Distances between Data Sets Based on Summary Statistics

CM Distance and Event Sequences CM Distance and Event Sequences 1/11/1

• Transform a sequence s to a binary data set Given a window length k, pick a window in s and

transform it into a binary vector of length |Ω| (the alphabet) by setting 1 if the corresponding symbol occurs in window. S->D

• Define a way F to represent the statistics of sequence s, popular choice is episodes.

• Given transformed data sets D1, D2, F, the CM distance between s1 and s2 is

Page 11: Distances between Data Sets Based on Summary Statistics

Empirical TestsEmpirical Tests

• 7 datasets: – Bible, Addresses, Beatles, 20Newsgroups,

TopGenres, TopDecades, Abstact– Compare CM distance to a base distance

– Clustering experiments using different algorithms based on CM distance.

Page 12: Distances between Data Sets Based on Summary Statistics

Empirical TestsEmpirical Tests

Page 13: Distances between Data Sets Based on Summary Statistics

Conclusions & DiscussionConclusions & Discussion

• CM distance has nice statistical properties and can be evaluated efficiently

• It takes properly into account the correlation between features

• For many types of feature functions, the computation time of CM distance is fast.

• The performance of CM distance depends heavily on the data set.