istemi per la icerca di dati multimediali - dsi.unive.it · sistemi per la ricerca di dati...

26
SISTEMI PER LA RICERCA SISTEMI PER LA RICERCA DI DATI MULTIMEDIALI Cl di L h Sl t Ol d Claudio Lucchese e Salvatore Orlando Terzo Workshop di Dipartimento Dipartimento di Informatica Università degli studi di Venezia

Upload: dokien

Post on 21-Feb-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

SISTEMI PER LA RICERCASISTEMI PER LA RICERCADI DATI MULTIMEDIALI

Cl di L h S l t O l dClaudio Lucchese e Salvatore OrlandoTerzo Workshop di Dipartimento

Dipartimento di InformaticaUniversità degli studi di Venezia

Ringrazio le seguenti persone, coinvolte nel progetto SAPIR

Raffaele PeregoFausto Rabitti

Paolo BolettieriFabrizio Falchi

Tommaso PiccioliMatteo Mordacchini

SISTEMI P

WHAT AND, MORE IMPORTANTLY, WHY ??!

M lti M di Obj t

PER LA R

ICE

Multi-Media Objects:text, web pages (Google, Yahoo!, …)i (Fli k b ht b Y h ! )

ERC

A DI D

AT

images (Flickr bought by Yahoo!, …)video (YouTube bought by Google, …)audio

TI MU

LTIMEDaudio

etc. ...Why we are interested in MM objects ?

DIA

LI

Why we are interested in MM objects ? 1 image every 10 documents.Increasing multimedia content on the webIncreasing multimedia content on the web.Yet, search in audio-visual content is limited to associated text and metadata annotations. 33

SISTEMI P

OK, BUT GOOGLE ALREADY DOES THE JOB ...

PER LA R

ICE

We mean to search by content !

G l d t t b d ( t ti ) h

ERC

A DI D

AT

Google does text-based (annotations) searchDo we have enough text ?How is the quality of such text ?

TI MU

LTIMEDHow is the quality of such text ?

Is it correlated with the actual content ?

DIA

LI

Look at this:http://www.nmis.isti.cnr.it/khi/pISTI-CNR (Pisa) and Max Planck KunsthistorischeInstitute project on Florentine Coat of Arms

44

SISTEMI P

CONTENT-BASED SEARCH

PER LA R

ICE

Shape Selection

ERC

A DI D

ATTI MU

LTIMED

Results

DIA

LI

55

SISTEMI P

CONTENT-BASED SEARCH

PER LA R

ICE

Shape Selection

ERC

A DI D

ATTI MU

LTIMED

Results

DIA

LI

66

SISTEMI P

WHAT IS THE ADDED VALUE ?

PER LA R

ICE

Content-based search for disambiguation:What about searching for “sapphire” on Flickr ?

ERC

A DI D

ATTI MU

LTIMEDD

IALI

7 of 21

SISTEMI P

CONTENT-BASED, I.E. SIMILARITY SEARCH

PER LA R

ICE

Objects are “unknown”distances between objects is “known”

ERC

A DI D

AT

Metric Space assumption:symmetryid tit

TI MU

LTIMED

identitytriangle inequality

Distance functions inducing a metric space:

DIA

LI

Distance functions inducing a metric space:Minkowski distances, edit distance, jaccard distance...

Typical queries: Range or kNNTypical queries: Range or kNNApplications:

88

Photo SearchSISTEM

I P

SIMILARITY SEARCH

PER LA R

ICE

Objects are “unknown”distances between objects is “known”

query:

ERC

A DI D

AT

Metric Space assumption:symmetryid tit

TI MU

LTIMED

identitytriangle inequality

Distance functions inducing a metric space:

DIA

LI

Distance functions inducing a metric space:Minkowski distances, edit distance, jaccard distance...

Typical queries: Range or kNNTypical queries: Range or kNNApplications:

99

Photo SearchSISTEM

I P

SIMILARITY SEARCHMedical Data Search

PER LA R

ICE

Objects are “unknown”distances between objects is “known”query:

ERC

A DI D

AT

Metric Space assumption:symmetryid tit

TI MU

LTIMED

identitytriangle inequality

Distance functions inducing a metric space:

DIA

LI

Distance functions inducing a metric space:Minkowski distances, edit distance, jaccard distance...

Typical queries: Range or kNNTypical queries: Range or kNNApplications:

1010

Photo SearchSISTEM

I P

SIMILARITY SEARCHMedical Data Search

PER LA R

ICE

Objects are “unknown”distances between objects is “known”

3D Shape Search

ERC

A DI D

AT

Metric Space assumption:symmetryid tit

query:

TI MU

LTIMED

identitytriangle inequality

Distance functions inducing a metric space:

q y(haemoglobin molecule)

DIA

LI

Distance functions inducing a metric space:Minkowski distances, edit distance, jaccard distance...

Typical queries: Range or kNNTypical queries: Range or kNNApplications:

1111

SISTEMI P

SIMILARITY SEARCH

PER LA R

ICE

Objects are “unknown”distances between objects is “known”

M t i S ti

ERC

A DI D

AT

Metric Space assumption:symmetryidentity

TI MU

LTIMEDidentity

triangle inequalityDistance functions inducing a metric space:

DIA

LI

Minkowski distances, edit distance, jaccard distance...Typical queries: Range or kNNApplications:

photos, 3D shapes, medical images but alsotext dna graphs etc etc 12text, dna, graphs, etc. etc. 12

SISTEMI P

FEATURE-BASED APPROACH

PER LA R

ICE

From the object space to the feature space

ERC

A DI D

ATTI MU

LTIMEDD

IALI

1313

SISTEMI P

DATA STRUCTURES

Vantage Point Tree Generalized Hyper-plane

PER LA R

ICEVantage Point Tree yp p

Tree

ERC

A DI D

ATTI MU

LTIMEDD

IALI

1414

SISTEMI P

DATA STRUCTURES

Vantage Point Tree Generalized Hyper-plane

PER LA R

ICEVantage Point Tree yp p

Tree

ERC

A DI D

AT

Multiple sub-trees may hold the solution

TI MU

LTIMEDp y D

IALI

They perform well with high dimensionality datadimensionality data

1515

I HAVE NEVER SEEN ANYTHINGSISTEM

I P

LIKE THAT IN THE WEB !!!

PER LA R

ICE

Why giants like Google and Yahoo! are not using content-based search ?

Recent studies confirm that

ERC

A DI D

ATRecent studies confirm that centralized solutions are not scalable !A single standard PC would need about 12 years

TI MU

LTIMEDg y

to process a collection of 100 million images.Why ?

DIA

LI

Feature extraction is expensive !feature extraction vs. words in a web-page

S hi i i !Searching is expensive !similarity search vs. boolean search

1616

I HAVE NEVER SEEN ANYTHINGSISTEM

I P

LIKE THAT IN THE WEB !!!

PER LA R

ICE

Why giants like Google and Yahoo! are not using content-based search ?

Recent studies confirm that

ERC

A DI D

ATRecent studies confirm that centralized solutions are not scalable !A single standard PC would need about 12 years

TI MU

LTIMEDg y

to process a collection of 100 million images.Why ?

DIA

LI

Feature extraction is expensive !feature extraction vs. words in a web-page

S hi i i !Searching is expensive !similarity search vs. boolean search

1717

SISTEMI P

PARALLEL FEATURE EXTRACTION

W l d Fli k t th I Id

PER LA R

ICE

We crawled Flickr to gather Image IdsA set of clients were deployed on the EDEE Grid:

Retrieve 1000 imaged Ids

ERC

A DI D

ATRetrieve 1000 imaged IdsDownload ImageExtract MPEG-7 features

TI MU

LTIMED

Parse Flicker Photo-page to obtain additional metadataSend metadata to our centralized repository

W h b t 50 illi i ith f t

DIA

LI

We have about 50 millions images with featuresit seems that Flickr as at least 1 billion imagesYahoo! images has at least 2 billion imagesg gand … they grow exponentially !

Our metadata collection is called CoPhIRIt is largely the largest publicly available collection 18It is largely the largest publicly available collection 18

SISTEMI P

PARALLEL SEARCH

PER LA R

ICE

The search space is mapped into a linear intervalThis is mapped onto a distributed network thanks to

DHT b d l ith

ERC

A DI D

AT

some DHT-based algorithmEach node of the network holds an M-Tree

TI MU

LTIMEDD

IALI

1919

SISTEMI P

OUR PROPOSAL

PER LA R

ICE

Metric cache:Cache is widely used in traditional search engines

T i i l

ERC

A DI D

AT

Trivial:Store results previous queries Return results when submitted query is stored

TI MU

LTIMEDReturn results when submitted query is stored

Less trivial ...Store results of previous queries

DIA

LI

Store results of previous queriesBest-effort query answering

(with guarantees)( g )Use past queries to optimize database queries.

2020

SISTEMI P

A CACHE WITH ONE ENTRY ...

PER LA R

ICE

qi is the query in cachewith its k-NN.r is the “radius of the

d(qi,qj)

ERC

A DI D

ATri is the radius of the query”.qj is a new query.

ri

TI MU

LTIMEDqj q y

If d(qi,qj)<rithen the cached objects t di t

qi

Rji

DIA

LI

at distance

Rji = ri - d( qi qj ) from qj

qj

ji i ( qi,qj ) qj

are the top-k’ results of the new query 21the new query. 21

SISTEMI P

A CACHE WITH MANY ENTRIES

Gi th

PER LA R

ICE

Given the new query qxfind the largest Rxicorresponding to the ri

ERC

A DI D

AT

cached query qi.

The cached objects within qi

TI MU

LTIMEDThe cached objects within

distance Rxi from qx are the most similar to the

qi

rj

DIA

LI

query.

Additionally one couldqj

rl

Additionally, one could use other objects in the cache to provide an approximate answer

ql qx

22approximate answer. 22

SISTEMI P

SOME RESULTS ...

PER LA R

ICEER

CA D

I DATTI M

ULTIM

EDDIA

LI

2323

SISTEMI P

SOME RESULTS ...

PER LA R

ICEER

CA D

I DATTI M

ULTIM

EDDIA

LI

2424

SISTEMI P

SOME RESULTS ...

PER LA R

ICEER

CA D

I DATTI M

ULTIM

EDDIA

LI

2525

THE END.G i ! )... Grazie ! =)