towards compression-based information retrieval daniele cerra symbiose seminar, rennes, 18.11.2010

50
Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

Upload: olivia-gregory

Post on 16-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

Towards Compression-based Information Retrieval

Daniele Cerra

Symbiose Seminar, Rennes, 18.11.2010

Page 2: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

2

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Scope

PatternMatching

ShannonInformation

Theory

Data

Compression

AlgorithmicKL-divergence

Dictionary-basedSimilarity

Grammar-basedSimilarity

Content-basedImage Retrieval

Classification,Clustering,Detection

ComplexityEstimation for

AnnotatedDatasets

TheoryMethods

Applications

AlgorithmicInformation

Theory

Main Contributions

Page 3: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

3

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Outline

The core: Compression-based similarity measures (CBSM)

Theory: a “Complex” Web

Speeding up CBSM

Applications and Experiments

Conclusions and Perspectives

Page 4: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

4

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Outline

The core: Compression-based similarity measures (CBSM)

Theory: a “Complex” Web

Speeding up CBSM

Applications and Experiments

Conclusions and Perspectives

Page 5: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

5

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Compression-based Similarity Measures

Most well-known: Normalized Compression Distance (NCD) General Distance between any two strings x and y

Similarity metric under some assumptions

Basically parameter-free Applicable with any off-the-shelf compressor (such as Gzip) If two objects compress better together than separately, it means they share

common patterns and are similar

C(y)}max {C(x),

(y)}min{C(x),CC(x,y)yxNCD

),(

x

y

Coder

Coder

Coder

C(x)

C(y)

C(xy) NCD

Li, M. et al., “The similarity metric”, IEEE Tr. Inf. Theory, vol. 50, no. 12, 2004

Page 6: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

6

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Applications of CBSM

Clustering and classification of: Texts Music DNA genomes Chain letters Images Time Series …

Rodents

Primates

LandslidesExplosions

DNA Genomes

Seismic signalsSatellite images

Page 7: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

7

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Assessment and discussion of NCD results

Results obtained by NCD often outperform state-of-the-art methods Comparisons with 51 other distances*

But NCD-like measures have always been applied to restricted datasets Size < 100 objects in the main papers on the topic All information retrieval systems use at least thousands of objects More thorough experiments are required

NCD is too slow to be applied on a large dataset 1 second (on a 2.65 GHz machine) to process 10 strings of 10 KB each and

output 5 distances Being NCD data-driven, the full data has to be processed again and again to

compute each distance from a given object The price to pay for a parameter-free approach is that a compact

representation of the data in any explicit parameter space is not allowed

* Keogh, E., Lonardi, S. & Ratanamahatana, C. “Towards Parameter-free Data Mining”, SIGKDD 2004.

Page 8: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

8

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Outline

The core: Compression-based similarity measures (CBSM)

Theory: a “Complex” Web

Speeding up CBSM

Applications and Experiments

Conclusions and Perspectives

Page 9: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

9

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

A “Complex” Web

Classic Information Theory

Algorithmic Information Theory

Data Compression

Alg. Mutual Information

K(x)

Compression

H(X)

NCD

NID

Shannon Mutual Information

How to quantify information?

Page 10: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

10

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Shannon Entropy

Information content of the output of a random variable X

Example: Entropy of the outcomes of the toss of a biased/unbiased coin Max H(X) -> Coin not biased

Every toss carries a full bit of information!

Note: H(X) can be (much) greater than 1 if the values that X can take are more than two!

x

xpxpXH )(log)()(

Page 11: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

11

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Entropy & Compression: Shannon-Fano Code

x

xpxpXH )(log)()(

xP(x)

Symbol A B C D E

Code 00 01 10 110 111

Bits per Symbol in average

Compression is achieved!

X={A,B,C,D,E} We should use 3 bits per symbol to encode the outcomes of X

A B C D E2/5 1/5 1/5 1/10 1/10

2.210

1

10

13

5

1

5

1

5

22

Page 12: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

12

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

The probabilistic approach lacuna

H(X) is related to the probability density function of a data source X

It can’t tell the amount of information contained in an isolated object!

Algorithmic Information Theory comes to the rescue…

s={I_carry_important_information}

What is the information content of s? Source S: unknown

Page 13: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

13

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Algorithmic Information Theory: Kickoff

3 parallel definitions:

R. J. Solomonoff, “A formal theory of inductive inference”, 1964. A. N. Kolmogorov, “Three approaches to the quantitative definition of

information”, 1965. G. J. Chaitin, “On the length of programs for computing finite binary

sequences”, 1966.

Page 14: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

14

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Kolmogorov Complexity

Known also as: Kolmogorov-Chaitin complexity Algorithmic complexity

Shortest program q that outputs the string x and halts on an universal Turing machine

K(x) is a measure of the computational resources needed to specify the object x

K(x) is uncomputable

qxKQxq

min)(

Page 15: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

15

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Kolmogorov Complexity

A string with low complexity

001001001001001001001001001001001001001001 13 x (Write 001)

A string with high complexity 0100110111100100111011100011001001011100101 Write 0100110111100100111011100011001001011100101

qxKQxq

min)(

Page 16: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

16

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Two approaches to information content

Probabilistic (classic) AlgorithmicVS.

Information UncertaintyShannon Entropy

Information ComplexityKolmogorov Complexity

x

xpxpXH )(log)()( qxKQxq

min)(

Related to a single object x Length of the shortest program q among Qx

programs which outputs the finite binary string x and halts on a Universal Turing Machine

Measures how difficult it is to describe x from scratch

Uncomputable

Related to a discrete random variable X on a finite alphabet A with a probability mass function p(x)

Measure of the average uncertainty in X Measures the average number of bits

required to describe X Computable if p(x) is known

Page 17: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

17

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Alg. Mutual Information

A “Complex” Web

How to measure the information shared between two objects?

Classic Information Theory

Algorithmic Information Theory

Data Compression

K(x)

Compression

H(X)

NCD

NID

Shannon Mutual Information

Page 18: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

18

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Probabilistic (classic) Algorithmic

VS.(Statistic) Mutual Information Algorithmic Mutual Information

Amount of computational resources shared by the shortest programs which output the strings x and y

The joint Kolmogorov complexity K(x,y) is the length of the shortest program which outputs x followed by y

Symmetric, non-negative If then

K(x,y) = K(x) + K(y) x and y are algorithmically independent

Measure in bits of the amount of information a random variable X has about another variable Y

The joint entropy H(X,Y) is the entropy of the pair (X,Y) with a joint distribution p(x,y)

Symmetric, non-negative If I(X;Y) = 0 then

H(X;Y) = H(X) + H(Y) X and Y are statistically independent

),()()();( YXHYHXHYXI ),()()():( yxKyKxKyxIw

yx ypxp

yxpyxpYXI

, )()(

),(log),();(

0):( yxIw

Shannon/Kolmogorov Parallelisms: Mutual Information

Page 19: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

19

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

A “Complex” Web

How to derive a computable similarity measure?

Classic Information Theory

Algorithmic Information Theory

Data Compression

Alg. Mutual Information

K(x)

Compression

H(X)

NCD

NID

Shannon Mutual Information

Page 20: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

20

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Compression: Approximating Kolmogorov Complexity

K(x) represents a lower bound for what an off-the-shelf compressor can achieve

Vitányi et al. suggest this approximation:

C(x) is the size of the file obtained by compressing x with a standard lossless compressor (such as Gzip)

)()( xCxK

AOriginal size: 65 KbCompressed size: 47

Kb

BOriginal size: 65 KbCompressed size: 2 Kb

Page 21: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

21

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Back to NCD

)}(),(min{

)}(),(max{),(),(

)}(),(min{

)}(),(max{),(),(

)()(

yCxC

yCxCyxCyxNCD

yKxK

yKxKyxKyxNID

xCxK

The size K(x) of the shortest program which outputs x is assimilated to the size C(x) of the compressed version of x

Normalized measure of the elements that a compressor may use twice when compressing two objects x and y

Derived from algorithmic mutual information

Normalized length of the shortest program that computes x knowing y, as well as computing y knowing x

Similarity metric minimizing any admissible metric

ComputableAlgorithmicNormalized Compression Distance

(NCD)Normalized Information Distance

(NID)

Page 22: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

22

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

A “Complex” Web

What if we add some more?

Classic Information Theory

Algorithmic Information Theory

Data Compression

Alg. Mutual Information

K(x)

Compression

H(X)

NCD

NID

Shannon Mutual Information

Occam’s razor

Page 23: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

23

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Occam’s Razor

William of Occam14th century

All other things being equal, the simplest solution is the best!

If two explanations exist for a given phenomenon, pick the one with the smallest Kolmogorov Complexity! Say I!

Maybe today William of Occam could say:

Page 24: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

24

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

An Example of Occam’s razor

The Copernican model (Ma) is simpler than the Ptolemaic (Mb).

Mb in order to explain the motion of Mercury relative to Venus, introduced the existence of epicycles in the orbits of the planets.

Ma accounts for this motion with no further explanation.

Finally, the model Ma was acknowledged as correct.

Note that we can assume: K(Ma) < K(Mb)

Page 25: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

25

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Outline

The core: Compression-based similarity measures (CBSM)

Theory: a “Complex” Web

Speeding up CBSM

Applications and Experiments

Conclusions and Perspectives

Page 26: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

26

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Evolution of CBSM 1993 Ziv & Merhav

First use of relative entropy to classify texts

2000 Frank et al., Khmelev First compression-based experiments on text categorization

2001 Benedetto et al. Intuitively defined compression-based relative entropy Caused a rise of interest in compression-based methods

2002 Watanabe et al. Pattern Representation based on Data Compression (PRDC) Dictionary-based First in classifying general data with a first step of conversion into strings Independent from IT concepts

2004 NCD Solid theoretical foundations (Algorithmic Information Theory)

2005-2006 Other similarity measures Keogh et al. (Compression-based Dissimilarity Measure), Chen & Li (Chen-Li Metric for DNA classification) Sculley & Brodley (Cosine Similarity) Differ from NCD only by their normalization factors - Sculley & Brodley (2006)

2008 Macedonas et al. Independent definition of dictionary distance

Page 27: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

27

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Pattern Recognition based on Data Compression (PRDC)

||

|)|(|),(

x

DyxyxPRDC

Distance of object x from object y

Length of string representing the object x

Length of string x coded with the dictionary D(y) extracted from y

Watanabe et al., 2002

PRDC equation is not normalized according to the complexity of x and skips the joint compression step.

Normalizing the equation used in PRDC, almost identical measure with NCD are obtained ( average difference on 400 measures)

PRDC can be inserted in the list of measures which differ by NCD only for the normalization factor (Sculley & Brodley, 2006)

Length of string representing the object x coded with the dictionary extracted from itself

Length of string x coded with the dictionary D(xy) extracted from x and y joint

|)|(|

|)|(|

Dxx

Dxyx

Distance between two objects encoded into strings x and y

)10( 4O

Page 28: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

28

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Grammar-based Approximation

A dictionary extracted from a string x in PRDC may be regarded as a model for x

To better approximate K(x), consider the smallest Context-Free Grammar (CFG) generating x. The grammar’s set of rules can be regarded as

the smallest dictionary and generative model for x.

: size of x of length N represented by its smallest context-free grammar G(x): number of rules contained in the grammar

xC)(xG

)(xK

Two-part complexity representation Model + data given the model (MDL-like) Complexity overestimations are intuitively accounted for and decreased in the second term

NCD with.. Intraclass Interclass Difference

Standard Compressor

1.02 1.11 0.09

Grammar approximation

0.83 0.96 0.13

Avg. distances on 40,000 measurements

Sample CFG G(z) for string z = {aaabaaacaaadaaaf}

Page 29: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

29

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Comparisons...

NormalizedPRDC

Page 30: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

30

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Comparisons...

NCD

Grammars

Bears far apart

Rodents scatteredRodents

Primates

Primates

Seals Divided

Note Using a specific compressor for DNA

genomes improves NCD results considerably

Page 31: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

31

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Drawbacks of the Introduced CBSM

Solution

Extract dictionaries from the data, possibly offline

Compare only those

Similarity Measure

Accuracy Speed

NCD

Algorithmic Kullback-Leibler

PRDC

Normalized PRDC

Grammar-based

??

How to combine accuracy and speed?

Comparable to NCD

Worse than NCD

Better than NCD

Page 32: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

32

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Outline

The core: Compression-based similarity measures (CBSM)

Theory: a “Complex” Web

Speeding up CBSM

Applications and Experiments

Conclusions and Perspectives

Page 33: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

33

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Images preprocessing: 1D encoding

Scanning Direction

jip ,1

jip ,1

jip ,

What about loss of textural information? Horizontal textural information is already implicit in the dictionaries Basic vertical interactions are stored for each pixel

Smooth / Rough: 1 bit of information Other solutions (e.g. Peano scanning) gave worse performances

Conversion to Hue Saturation Value (HSV) color space Scalar quantization

4 bits for HueHuman eye is more sensitive to changes

in hue 2 bits for Saturation 2 bits for Value HSV color space

Page 34: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

34

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Dictionary-based Distance: Dictionary Extraction

Convert each image to a string and extract meaningful patterns into dictionaries using an LZW-like compressor Unlike LZW, loose (or no) constraints on dictionary size, flexible alphabet size

Sort entries in the dictionaries in order to enable binary searches Store only the dictionary

CAAABB

..ABABBCA..AABBCA

Dictionary-based universal compression algorithm Improvement by Welch (1984) over the LZ78

compressor (Lempel & Ziv, 1978) Searches for matches between the text to be

compressed and a set of previously found strings contained in a dictionary

When a match is found, a substring is substituted by a code representing a pattern in the dictionary

LZW”TOBEORNOTTOBEORTOBEORNOT!”

Page 35: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

35

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Fast Compression Distance

)(

))(),(()(),(

xD

yDxDxDyxFCD

Consider two dictionaries to compute a distance between them as the difference between shared/not shared patterns

The joint compression step of NCD is now replaced by an inner join of two sets

NCD acts like a black box, FCD simplifies it by making dictionaries explicit Dictionaries have to be extracted only once (also offline)

Count(select * from (D(x)))

Count(select * from Inner_Join(D(x),D(y)))

Page 36: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

36

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

How fast is FCD with respect to NCD?

Operations needed for the joint compression steps (LZW-based NCD)

yx mm log

)log()( yxyx mmnn

))(),((),( yDxDyxFCD

),(),( yxCyxNCD

Further advantages If in the search a pattern gives a mismatch, ignore all extensions of that pattern

Ensured by LZW’s prefix-closure property

Ignore shortest patterns (regard them as noise) To reduce storage space, ignore all redundant patterns which are prefixes of others

No losses also ensured by LZW’s prefix-closure property

Complexity decreases by approx. one order of magnitude

xn

xm

n. elements in x

n. patterns in x

Page 37: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

37

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Datasets

164 video frames from the movie “Run, Lola, Run”

1500 digital photos and hand-drawn images

90 books of known Italian authors

10,200 photographs of objects pictured from 4 different points of view

144 infrared images of meadows some of which contain fawns

Corel

Lola

Liber Liber

Fawns & Meadows

Nister-Stewenius

Page 38: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

38

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Authorship Attribution

Author Texts SuccessesDante 8 8D’Annunzio 4 4Deledda 15 15Fogazzaro 5 5Guicciardini 6 6Machiavelli 12 10Manzoni 4 4Pirandello 11 11Salgari 11 11Svevo 5 5Verga 9 9TOTAL 90 88

90

91

92

93

94

95

96

97

98

Method

Acc

ura

cy%

Relative Entropy

Ziv-Merhav

NCD(zlib)

NCD(bzip2)

NCD(blocksort)

Algorithmic KL

DD

Classification accuracy: comparison with 6 other compression-based methods

0

50

100

150

200

250

300

350

400

450

500

1Method

Tim

e (s

)

NCD

Algorithmic KL

DD

Running times comparison for the top 3 methods

FCD

FCD

Page 39: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

39

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Example of NCD’s Failure: Wild Animals Detection

The 3 missed detections (FCD)

Confusion Matrices

Fawn Meadow Accuracy Time

FCDFawn 38 3

97.9% 58 secMeadow 0 103

NCDFawn 29 12

77.8% 14 minMeadow 20 83

41 Fawns

103 Meadows

Limited buffer size in the compressor and total loss of vertical texture causes NCD’s performance to decrease!

Compressor used with NCD: LZWImage size: 160x120

Page 40: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

40

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Content-based image retrieval system

Classical (Smeulders, 2000)

Many steps and parameters to set

Proposed

Compare each object in the set to a query image Q and rank the results on the basis of their distance from Q

Page 41: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

41

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Applications: COREL Dataset

1500 images, 15 classes, 100 images per class

Africans Beach Architecture Buses

Dinosaurs

Mountains

Elephants

Food

Flowers Horses

Caves Postcards

Sunsets Tigers Women

Page 42: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

42

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Precision (P) vs. Recall (R) Evaluation

True Positives

False Positives False Negatives

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Recall

Pre

cisi

on

FCD GMM-MDIR JTC

Minimum Distortion Information Retrieval (Jeong & Gray, 2004)

FPTP

TPP

FNTP

TPR

Jointly Trained Codebook (Daptardar & Storer, 2008)

Running time: 18 min (images resampled to 64x64) 2 Processors (2GHz) + 2 GB RAM

Page 43: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

43

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Confusion Matrix

Classification according to the minimum average distance from a class

Page 44: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

44

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

False alarms (?)

Typical images belonging to the class “Africans”

The 10 misclassified images Mostly misclassified as “Tigers” No human presence “Extreme” image misclassified as “Food”

Page 45: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

45

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Estimation of the “complexity” of a dataset

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

Fawns

Lola

Corel

Liber Liber

Dataset Classes Objects Content Diversity

Fawns & Meadows 2 144 Average

Lola 19 164 Average

Corel 15 1500 High

Liber Liber 11 90 Low

Dataset MAP score

Fawns & Meadows 0.872

Liber Liber 0.81

Lola 0.661

Corel 0.433

Average

Reduced

High

Complexity

Rank

Page 46: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

46

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

A larger dataset and a comparison with state-of-the-artmethods: Nister-Stewenius

10200 images 2550 objects photographed under 4

points of view Score (from 1 to 4) represents the

number of meaningful objects retrieved in the top-4

SIFT-based NS1 and NS2 use different training sets and parameters settings, and yield different results

FCD is independent from parameters Only 1000 images processed for NCD Query Time: 8 seconds

2.5

2.75

3

3.25

3.5

3.75

4

100

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

6000

6500

7000

7500

8000

8500

9000

9500

1000

0

1020

0

NS1

FCD

NS2

NCD

Page 47: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

47

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

SAR Scene Hierarchical Clustering

32 TerraSAR-X subsets acquired over Paris

False Alarm

Page 48: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

48

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Outline

The core: Compression-based similarity measures (CBSM)

Theoretical Foundations

Contributions: Theory

Contributions: Applications and Experiments

Conclusions

Page 49: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

49

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

What can (not) compression-based techniques do

The FCD allows validating compression-based techniques Largest dataset tested by NCD: 100 objects Largest dataset tested by FCD: 10200 objects

CBSM do NOT Outperform every existing technique Results obtained so far on small datasets could be misleading On the larger datasets analyzed, results are often inferior to the state of

the art Open question: could they be somehow improved?

BUT they yield comparable results to other existing techniques Reducing drastically (ideally skipping) the parameters definition Skipping feature extraction Skipping clustering (in the case of classification) Without assuming a priori knowledge of the data Data driven, applicable to any kind of data

Page 50: Towards Compression-based Information Retrieval Daniele Cerra Symbiose Seminar, Rennes, 18.11.2010

50

Com

pete

nce

Cent

re o

n In

form

ation

Ext

racti

on

and

Imag

e U

nder

stan

ding

for E

arth

Obs

erva

tion

Compression