distributed multi-modal similarity...

47
Distributed Multi-modal Similarity Retrieval David Novak Seminar of DISA Lab, October 14, 2014 David Novak Multi-modal Similarity Retrieval DISA Seminar 1 / 17

Upload: others

Post on 27-Jul-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Distributed Multi-modal Similarity Retrieval

David Novak

Seminar of DISA Lab, October 14, 2014

David Novak Multi-modal Similarity Retrieval DISA Seminar 1 / 17

Page 2: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Outline of the Talk

1 MotivationSimilarity SearchE↵ectiveness and E�ciencyMulti-modal Search

2 Existing SolutionsSimilarity IndexingDistributed Key-value Stores

3 Big Data Similarity RetrievalGeneric ArchitectureSpecific System

4 Conclusions

David Novak Multi-modal Similarity Retrieval DISA Seminar 2 / 17

Page 3: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .[cognitive psychology]

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

Page 4: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .[cognitive psychology]

everything we can see, hear, measure, observe is in digital form

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

Page 5: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .[cognitive psychology]

everything we can see, hear, measure, observe is in digital form

Therefore, computers should be able to search data base on similarity

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

Page 6: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .[cognitive psychology]

everything we can see, hear, measure, observe is in digital form

Therefore, computers should be able to search data base on similarity

The similarity search problem has two aspects

e↵ectiveness: how to measure similarity of two “objects”domain specific (photos, X-rays, MRT results, voice, music, EEG,. . . )

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

Page 7: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Similarity Search

Motivation

The similarity is key to human cognition, learning, memory. . .[cognitive psychology]

everything we can see, hear, measure, observe is in digital form

Therefore, computers should be able to search data base on similarity

The similarity search problem has two aspects

e↵ectiveness: how to measure similarity of two “objects”domain specific (photos, X-rays, MRT results, voice, music, EEG,. . . )

e�ciency: how to realize similarity search fastusing a given similarity measureon very large data collections

David Novak Multi-modal Similarity Retrieval DISA Seminar 3 / 17

Page 8: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation E↵ectiveness and E�ciency

E�ciency: Motivation Example

Type of data:

general images (photos)

every image has been processed by a deep neural networkto obtain a “semantic characterization” of the image (descriptor)compared by Euclidean distance, it measures visual similarity of images

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

Page 9: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation E↵ectiveness and E�ciency

E�ciency: Motivation Example

Type of data:

general images (photos)

every image has been processed by a deep neural networkto obtain a “semantic characterization” of the image (descriptor)

compared by Euclidean distance, it measures visual similarity of images

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

Page 10: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation E↵ectiveness and E�ciency

E�ciency: Motivation Example

Type of data:

general images (photos)

every image has been processed by a deep neural networkto obtain a “semantic characterization” of the image (descriptor)compared by Euclidean distance, it measures visual similarity of images

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

Page 11: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation E↵ectiveness and E�ciency

E�ciency: Motivation Example

Type of data:

general images (photos)

every image has been processed by a deep neural networkto obtain a “semantic characterization” of the image (descriptor)compared by Euclidean distance, it measures visual similarity of images

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

Page 12: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation E↵ectiveness and E�ciency

E�ciency: Motivation Example

Type of data:

general images (photos)every image has been processed by a deep neural network

to obtain a “semantic characterization” of the image (descriptor)compared by Euclidean distance, it measures visual similarity of images

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

Page 13: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation E↵ectiveness and E�ciency

E�ciency: Motivation Example

Type of data:

general images (photos)

every image has been processed by a deep neural networkto obtain a “semantic characterization” of the image (descriptor)compared by Euclidean distance, it measures visual similarity of images

E�ciency problem:

what if we had 100 million of images with such descriptors

each descriptor is a 4096-dimensional float vector

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

Page 14: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation E↵ectiveness and E�ciency

E�ciency: Motivation Example

Type of data:

general images (photos)

every image has been processed by a deep neural networkto obtain a “semantic characterization” of the image (descriptor)compared by Euclidean distance, it measures visual similarity of images

E�ciency problem:

what if we had 100 million of images with such descriptors

each descriptor is a 4096-dimensional float vector

) over 1.5 TB of data to be organized for similarity searchanswer similarity queries online

David Novak Multi-modal Similarity Retrieval DISA Seminar 4 / 17

Page 15: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Multi-modal Search

Real Application: Multi-field Data

real-world application data objects would have many “fields”:attribute fields (numbers, strings, dates, etc.)(several) descriptors for similarity searchkeywords/annotations for full-text search, etc.

example:

{ "ID": "image_1",

"author": "David Novak",

"date": "20140327",

"categories": [ "outdoor", "family" ],

"DNN_visual_descriptor": [5.431, 0.0042, 0.0, 0.97,... ],

"dominant_color": "0x9E, 0xC2, 0x13",

"keywords": "summer, beach, ocean, sun, sand" }

David Novak Multi-modal Similarity Retrieval DISA Seminar 5 / 17

Page 16: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Multi-modal Search

Real Application: Multi-field Data

real-world application data objects would have many “fields”:attribute fields (numbers, strings, dates, etc.)(several) descriptors for similarity searchkeywords/annotations for full-text search, etc.

example:

{ "ID": "image_1",

"author": "David Novak",

"date": "20140327",

"categories": [ "outdoor", "family" ],

"DNN_visual_descriptor": [5.431, 0.0042, 0.0, 0.97,... ],

"dominant_color": "0x9E, 0xC2, 0x13",

"keywords": "summer, beach, ocean, sun, sand" }

David Novak Multi-modal Similarity Retrieval DISA Seminar 5 / 17

Page 17: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Multi-modal Search

Objectives

Goal: generic, horizontally scalable system architecture that would allow

standard attribute-based access

keyword (full-text) search

similarity search in “arbitrary” similarity space

multi-modal search – combination of several search perspectives, e.g.direct combination of similarity modalitiessimilarity query with filtering by attribute(s)re-ranking of search result by di↵erent criteria

. . . and do it all on a very large scalevoluminous data collectionshigh query throughput

David Novak Multi-modal Similarity Retrieval DISA Seminar 6 / 17

Page 18: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Multi-modal Search

Objectives

Goal: generic, horizontally scalable system architecture that would allow

standard attribute-based access

keyword (full-text) search

similarity search in “arbitrary” similarity space

multi-modal search – combination of several search perspectives, e.g.direct combination of similarity modalitiessimilarity query with filtering by attribute(s)re-ranking of search result by di↵erent criteria

. . . and do it all on a very large scalevoluminous data collectionshigh query throughput

David Novak Multi-modal Similarity Retrieval DISA Seminar 6 / 17

Page 19: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Motivation Multi-modal Search

Objectives

Goal: generic, horizontally scalable system architecture that would allow

standard attribute-based access

keyword (full-text) search

similarity search in “arbitrary” similarity space

multi-modal search – combination of several search perspectives, e.g.direct combination of similarity modalitiessimilarity query with filtering by attribute(s)re-ranking of search result by di↵erent criteria

. . . and do it all on a very large scalevoluminous data collectionshigh query throughput

David Novak Multi-modal Similarity Retrieval DISA Seminar 6 / 17

Page 20: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Distance-based Similarity Search

generic similarity searchapplicable to many domains

data modeled as metric space (D, �), where D is a domain of objectsand � is a total distance function � : D ⇥D �! R+

0

satisfyingpostulates of identity, symmetry, and triangle inequality

query by example: K -NN(q) returns K objects x from the datasetX ✓ D with the smallest �(q, x)

David Novak Multi-modal Similarity Retrieval DISA Seminar 7 / 17

Page 21: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Distance-based Similarity Search

generic similarity searchapplicable to many domains

data modeled as metric space (D, �), where D is a domain of objectsand � is a total distance function � : D ⇥D �! R+

0

satisfyingpostulates of identity, symmetry, and triangle inequality

query by example: K -NN(q) returns K objects x from the datasetX ✓ D with the smallest �(q, x)

23

1

q

David Novak Multi-modal Similarity Retrieval DISA Seminar 7 / 17

Page 22: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research

memory structures for precise K -NN search

e�cient disk-oriented techniquesprecise and approximate (not all objects from K -NN answer returned)objects are partitioned and organized on disk by the similarity metric

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

Page 23: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research

memory structures for precise K -NN search

e�cient disk-oriented techniquesprecise and approximate (not all objects from K -NN answer returned)

objects are partitioned and organized on disk by the similarity metric

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

Page 24: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research

memory structures for precise K -NN search

e�cient disk-oriented techniquesprecise and approximate (not all objects from K -NN answer returned)objects are partitioned and organized on disk by the similarity metric

disk storage

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

Page 25: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research

memory structures for precise K -NN search

e�cient disk-oriented techniquesprecise and approximate (not all objects from K -NN answer returned)objects are partitioned and organized on disk by the similarity metric

given query q, the“most-promising” partitions formthe candidate set

disk storage

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

Page 26: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Similarity Indexing Techniques

Metric-based similarity indexing: two decades of research

memory structures for precise K -NN search

e�cient disk-oriented techniquesprecise and approximate (not all objects from K -NN answer returned)objects are partitioned and organized on disk by the similarity metric

given query q, the“most-promising” partitions formthe candidate set

the candidate set SC is refined bycalculating �(q, x), 8x 2 SC

disk storage

David Novak Multi-modal Similarity Retrieval DISA Seminar 8 / 17

Page 27: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Similarity Indexing Techniques: Metadata Organization

Recently, there were proposed a few indexes of di↵erent type

memory index that organizes only metadataNovak, D., & Zezula, P. (2014). Rank Aggregation of Candidate Sets for E�cient

Similarity Search. In DEXA 2014, Springer.

David Novak Multi-modal Similarity Retrieval DISA Seminar 9 / 17

Page 28: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Similarity Indexing Techniques: Metadata Organization

Recently, there were proposed a few indexes of di↵erent type

memory index that organizes only metadataNovak, D., & Zezula, P. (2014). Rank Aggregation of Candidate Sets for E�cient

Similarity Search. In DEXA 2014, Springer.

David Novak Multi-modal Similarity Retrieval DISA Seminar 9 / 17

Page 29: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Similarity Indexing Techniques: Metadata Organization

Recently, there were proposed a few indexes of di↵erent type

memory index that organizes only metadataNovak, D., & Zezula, P. (2014). Rank Aggregation of Candidate Sets for E�cient

Similarity Search. In DEXA 2014, Springer.

similarity index IX candidate set CX⊆ X

k-NN(q)

refined answerdata storage

1

calculate δ(q,c), c ∊ CX2

3

David Novak Multi-modal Similarity Retrieval DISA Seminar 9 / 17

Page 30: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Distributed Similarity Indexes

Distributed Data Structures for metric-based similarity search

data partitioned to nodes according to the metric

at query time, query-relevant partitions (nodes) accessed

GHT?, VPT?, MCAN, M-Chord

David Novak Multi-modal Similarity Retrieval DISA Seminar 10 / 17

Page 31: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Distributed Similarity Indexes

Distributed Data Structures for metric-based similarity search

data partitioned to nodes according to the metric

at query time, query-relevant partitions (nodes) accessedGHT?, VPT?, MCAN, M-Chord

David Novak Multi-modal Similarity Retrieval DISA Seminar 10 / 17

Page 32: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Similarity Indexing

Distributed Similarity Indexes

Distributed Data Structures for metric-based similarity search

data partitioned to nodes according to the metric

at query time, query-relevant partitions (nodes) accessedGHT?, VPT?, MCAN, M-Chord

C2

p1

p2

2*c 3*c0 c

C

C1

p

0

0

(mchord)K2

N2

K1N1K3

N3

N4

K N

K4

+2

+16

+8 +4

(a) (b)

+1

0 = 32 0 = 32

David Novak Multi-modal Similarity Retrieval DISA Seminar 10 / 17

Page 33: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Distributed Key-value Stores

Current Distributed Stores

Currently, many e�cient distributed key-value or document stores emerged

distributed hash tablesobjects organized by IDs (ID-object map)

quick access to “documents” by IDs

secondary indexes on attributes

David Novak Multi-modal Similarity Retrieval DISA Seminar 11 / 17

Page 34: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Existing Solutions Distributed Key-value Stores

Current Distributed Stores

Currently, many e�cient distributed key-value or document stores emerged

distributed hash tablesobjects organized by IDs (ID-object map)

quick access to “documents” by IDs

secondary indexes on attributes

David Novak Multi-modal Similarity Retrieval DISA Seminar 11 / 17

Page 35: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Generic Architecture

Generic Architecture

key-value store (ID-object) on the whole dataset X

worker worker

worker

worker

worker

David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17

Page 36: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Generic Architecture

Generic Architecture

key-value store (ID-object) on the whole dataset X

worker worker

similarity index IXi

field

worker

worker

worker

David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17

Page 37: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Generic Architecture

Generic Architecture

key-value store (ID-object) on the whole dataset X

worker worker

similarity index IXi

field

worker

worker

k-NN(q.field)candidate set CXi⊆ Xi

1

worker

refinement on part of CXi

2

merge partial answers3

refinement on part of CXi

2

David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17

Page 38: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Generic Architecture

Generic Architecture

key-value store (ID-object) on the whole dataset X

worker worker

similarity index IXi

field inverted file index IXi

field2

attribute index IXi

field3similarity index IXj

field

worker

worker

k-NN(q.field)candidate set CXi⊆ Xi

1

worker

refinement on part of CXi

2

merge partial answers3

refinement on part of CXi

2

David Novak Multi-modal Similarity Retrieval DISA Seminar 12 / 17

Page 39: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Generic Architecture

System Features

Types of queries

ID-object query (often useful to initiate k-NN(q) query)

attribute-based queries (secondary indexes)

key-word (full-text) queries (Lucene-like index)

similarity queries (via similarity indexes)

combined similarity queries (late fusion)

K -NN query with attribute filtering

distributed re-ranking query answer

e�cient management of multiple data collectionsX = X

1

[ X2

[ · · · [ Xs

core key-value store is well horizontally scalable

David Novak Multi-modal Similarity Retrieval DISA Seminar 13 / 17

Page 40: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Generic Architecture

System Features

Types of queries

ID-object query (often useful to initiate k-NN(q) query)

attribute-based queries (secondary indexes)

key-word (full-text) queries (Lucene-like index)

similarity queries (via similarity indexes)

combined similarity queries (late fusion)

K -NN query with attribute filtering

distributed re-ranking query answer

e�cient management of multiple data collectionsX = X

1

[ X2

[ · · · [ Xs

core key-value store is well horizontally scalable

David Novak Multi-modal Similarity Retrieval DISA Seminar 13 / 17

Page 41: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Generic Architecture

System Features

Types of queries

ID-object query (often useful to initiate k-NN(q) query)

attribute-based queries (secondary indexes)

key-word (full-text) queries (Lucene-like index)

similarity queries (via similarity indexes)

combined similarity queries (late fusion)

K -NN query with attribute filtering

distributed re-ranking query answer

e�cient management of multiple data collectionsX = X

1

[ X2

[ · · · [ Xs

core key-value store is well horizontally scalable

David Novak Multi-modal Similarity Retrieval DISA Seminar 13 / 17

Page 42: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Specific System

Specific System: Large-scale Image Management

100M objects from the CoPhIR dataset (benchmark):

{ "ID": "002561195",

"title": "My wife & daughter on Gold Coast beach",

"tags": "summer, beach, ocean, sun, sand, Australia",

"mpeg7_scalable_color": "25 36 0 127 69...",

"mpeg7_color_layout": "25 41 53 20; 32; -16...",

"mpeg7_color_structure": "25 41 53 20; 32;...",

"mpeg7_edge_histogram": "5 1 2 3 7 7 3 6...",

"mpeg7_homogeneous_texture": "232 201 198 180 201...",

"GPS_coordinates": "45.50382, -73.59921",

"flickr_user": "david_novak" }

David Novak Multi-modal Similarity Retrieval DISA Seminar 14 / 17

Page 43: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Specific System

System Schema

Infinispan (Ispn) ID-object store

Lucene I tags,title

B+-tree I flickr_user

PPP-Codes index I GPS

PPP-Codes index I mpeg7 index

worker

Ispn node

worker

Ispn node

index

worker

Ispn node

worker

Ispn node

index

worker

Ispn node

index

David Novak Multi-modal Similarity Retrieval DISA Seminar 15 / 17

Page 44: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Big Data Similarity Retrieval Specific System

Specific System: Demo

20M objects of this type:

{ "ID": "002561195",

"title": "My wife & daughter on Gold Coast beach",

"keywords": "summer, beach, ocean, sun, sand, Australia",

"DNN_visual_descriptor": [5.431, 0.0042, 0.0, 0.97,... ] }

demo

David Novak Multi-modal Similarity Retrieval DISA Seminar 16 / 17

Page 45: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Conclusions

Conclusions

We have proposed and alfa-tested system architecture that

provides large-scale similarity search

...on a broad family of data + similarity measures

is distributed and horizontally scalable

can manage multi-field data:attribute, keywords, several similarity modalitiesmany variants of multi-modal search queries

Challenges:

full implementation and thorough testing

the similarity index can be bottleneck =) distribute it

David Novak Multi-modal Similarity Retrieval DISA Seminar 17 / 17

Page 46: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Conclusions

Conclusions

We have proposed and alfa-tested system architecture that

provides large-scale similarity search

...on a broad family of data + similarity measures

is distributed and horizontally scalable

can manage multi-field data:attribute, keywords, several similarity modalitiesmany variants of multi-modal search queries

Challenges:

full implementation and thorough testing

the similarity index can be bottleneck =) distribute it

David Novak Multi-modal Similarity Retrieval DISA Seminar 17 / 17

Page 47: Distributed Multi-modal Similarity Retrievaldisa.fi.muni.cz/wp-content/uploads/disa-presentation.pdf · Distributed Similarity Indexes Distributed Data Structures for metric-based

Conclusions

Conclusions

We have proposed and alfa-tested system architecture that

provides large-scale similarity search

...on a broad family of data + similarity measures

is distributed and horizontally scalable

can manage multi-field data:attribute, keywords, several similarity modalitiesmany variants of multi-modal search queries

Challenges:

full implementation and thorough testing

the similarity index can be bottleneck =) distribute it

David Novak Multi-modal Similarity Retrieval DISA Seminar 17 / 17