andrew zisserman - uclahelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfandrew zisserman (work...
TRANSCRIPT
![Page 1: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/1.jpg)
Scalable Visual Object Retrieval
Andrew Zisserman
(work with Ondřej Chum, Michael Isard, James Philbin, Josef Sivic)
Visual Geometry Group
Dept of Engineering Science
University of Oxford
![Page 2: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/2.jpg)
Query by visual example
Query: image/video clip � Retrieve: images/shots from archive
near duplicate
same object
same category
![Page 3: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/3.jpg)
outline
In images and videos:
1. Retrieving specific objects
• Use text analogy for efficient retrieval
2. Scaling up visual vocabularies
3. Query expansion to improve recall
![Page 4: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/4.jpg)
“Groundhog Day” [Rammis, 1993]Visually defined query
“Find this
clock”
Example: visual search in feature films
“Find this
place”
Problem specification: particular object retrieval
![Page 5: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/5.jpg)
retrieved shots
Example
![Page 6: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/6.jpg)
Particular objects, not entire images
Forced to face problems of:
• scale change,
• pose change,
• illumination change, and
• partial occlusion
![Page 7: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/7.jpg)
When do (images of) objects match?
Two requirements:
1. “patches” (parts) correspond, and
2. Configuration (spatial layout) corresponds
![Page 8: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/8.jpg)
Success of text retrieval
• efficient
• scalable
• high precision
Can we use retrieval mechanisms from text retrieval?
Need a visual analogy of a textual word.
![Page 9: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/9.jpg)
Approach
Determine regions (segmentation) and vector descriptors in each
frame which are invariant to camera viewpoint changes
Match descriptors between frames using invariant vectors
Visual problem
query?
• Retrieve key frames containing the same object
![Page 10: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/10.jpg)
Example of visual fragments
Image content is transformed into local fragments that are invariant to
translation, rotation, scale, and other imaging parameters
Lowe ICCV 1999• Fragments generalize over viewpoint and lighting
![Page 11: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/11.jpg)
Scale invariance
Multi-scale extraction of Harris interest points
Selection of points at characteristic scale in scale space
Laplacian
Chacteristic scale :
- maximum in scale space
- scale invariant
Mikolajczyk and Schmid ICCV 2001
![Page 12: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/12.jpg)
Viewpoint covariant segmentation
• Characteristic scales (size of region)
• Lindeberg and Garding ECCV 1994
• Lowe ICCV 1999
• Mikolajczyk and Schmid ICCV 2001
• Affine covariance (shape of region)
• Baumberg CVPR 2000
• Matas et al BMVC 2002 Maximally stable regions
• Mikolajczyk and Schmid ECCV 2002
• Schaffalitzky and Zisserman ECCV 2002
• Tuytelaars and Van Gool BMVC 2000
Shape adapted regions
“Harris affine”
![Page 13: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/13.jpg)
Example of affine covariant regions
1000+ regions per image Harris-affine
Maximally stable regions
• a region’s size and shape are not fixed, but
• automatically adapts to the image intensity to cover the same physical surface
• i.e. pre-image is the same surface region
Represent each region by SIFT descriptor (128-vector) [Lowe 1999]
![Page 14: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/14.jpg)
Descriptors – SIFT [Lowe’99]
distribution of the gradient over an image patch
gradient 3D histogram
→ →
image patch
y
x
very good performance in image matching [Mikolaczyk and Schmid’03]
4x4 location grid and 8 orientations (128 dimensions)
![Page 15: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/15.jpg)
Example
In each frame independently
determine elliptical regions (segmentation covariant with camera viewpoint)
compute SIFT descriptor for each region [Lowe ‘99]
Harris-affine
Maximally stable regions
1000+ descriptors per frame
![Page 16: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/16.jpg)
Object recognition
Establish correspondences between object model image and target image by
nearest neighbour matching on SIFT vectors
128D descriptor
spaceModel image Target image
![Page 17: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/17.jpg)
Match regions between frames using SIFT descriptors
Harris-affine
Maximally stable regions
• Multiple fragments overcomes problem of partial occlusion
• Transfer query box to localize object
Now, convert this approach to a text
retrieval representation
![Page 18: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/18.jpg)
Build a visual vocabulary for a movie
Vector quantize descriptors
• k-means clustering
+
+
Implementation
• compute SIFT features on frames from 48 shots of the film
• 6K clusters for Shape Adapted regions
• 10K clusters for Maximally Stable regions
SIFT 128DSIFT 128D
![Page 19: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/19.jpg)
Samples of visual words (clusters on SIFT descriptors):
Maximally stable regionsShape adapted regions
generic examples – cf textons
![Page 20: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/20.jpg)
More specific example
Samples of visual words (clusters on SIFT descriptors):
![Page 21: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/21.jpg)
Detect patches
Compute SIFT
descriptorNormalize
patch
+
+
Find nearest cluster
centre
Assign visual words and compute histograms for each
key frame in the video
Represent frame by sparse histogram
of visual word occurrences
200101…
![Page 22: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/22.jpg)
The same visual word
![Page 23: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/23.jpg)
Visual words are ‘iconic’ image patches or fragments
• represent the frequency of word occurrence
• but not their position
Representation: bag of (visual) words
Image
Collection of visual words
![Page 24: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/24.jpg)
Search
� For fast search, store a “posting list” for the dataset
� This maps word occurrences to the documents they occur in
frame #5 frame #10
Posting list
1
2
...
5,10, ...
10,...
...
#1 #1
#2
![Page 25: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/25.jpg)
Films = common dataset
“Pretty Woman”
“Groundhog Day”
“Casablanca”
“Charade”
![Page 26: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/26.jpg)
Video Google Demo
![Page 27: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/27.jpg)
Matching a query region
Stage 1: generate a short list of possible frames using bag of visual word representation:
1. Accumulate all visual words within the query region2. Use “book index” to find other frames with these words3. Compute similarity for frames which share at least one word
� Generates a tf-idf ranked list of all the frames in dataset
frame #5 frame #10
Posting list
1
2
...
5,10, ...
10,...
...
#1 #1
#2
![Page 28: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/28.jpg)
• Discard mismatches
• require spatial agreement with the neighbouring matches
• Compute matching score
• score each match with the number of agreement matches
• accumulate the score from all matches
• Also matches define correspondence between target and query region
Stage 2: re-rank short list on spatial consistency
NB weak measure
of spatial
consistency
![Page 29: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/29.jpg)
Sony logo from Google image
search on `Sony’
Retrieve shots from Groundhog Day
Example application I – product placement
![Page 30: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/30.jpg)
Retrieved shots in Groundhog Day for search on Sony logo
![Page 31: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/31.jpg)
Notre Dame from Google image
search on `Notre Dame’
Retrieve shots from Charade
Example II - finding photos in a personal collection
Query image
Charade (6,503 keyframes)
![Page 32: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/32.jpg)
First (correctly) retrieved shot
![Page 33: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/33.jpg)
A keyframe from the matching shotQuery image
Viewpoint invariant matching
![Page 34: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/34.jpg)
Part 2: Scaling up: the Oxford buildings
dataset
![Page 35: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/35.jpg)
Particular object search
Find these landmarks ...in these images
![Page 36: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/36.jpg)
Particular Object Search
� Problem: find particular occurrences of an object in a very large dataset of images
� Want to find the object despite possibly large changes inscale, viewpoint, lighting and partial occlusion
ViewpointScale
Lighting Occlusion
![Page 37: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/37.jpg)
Representation & Similarity
� Text retrieval approach to visual search (“Video Google”)
Image Sparse affine invariant regions descriptors
(Hessian Affine + SIFT)
Detection + Description Quantize Sparse
histogram of visual word occurrences
� Representation is a sparse histogram for each image
� Similarity measure is L2 distance between tf-idf weightedhistograms
200101…
![Page 38: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/38.jpg)
Investigate …
Vocabulary size: number of visual words in range 10K to 1M
Use of spatial information to re-rank
![Page 39: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/39.jpg)
Oxford buildings dataset
� Automatically crawled from Flickr
� Dataset (i) consists of 5062 images, crawled by searching for Oxford landmarks, e.g.
� “Oxford Christ Church”� “Oxford Radcliffe camera”� “Oxford”
� High resolution images (1024 x 768)
![Page 40: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/40.jpg)
Oxford buildings dataset
� Landmarks plus queries used for evaluation
All Soul's
Ashmolean
Balliol
Bodleian
Thom Tower
Cornmarket
Bridge of Sighs
Keble
Magdalen
Pitt Rivers
Radcliffe Camera
� Ground truth obtained for 11 landmarks over 5062 images
� Performance measured by mean Average Precision (mAP) over 55 queries
![Page 41: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/41.jpg)
Oxford buildings dataset
� Automatically crawled from Flickr
� Consists of:
� Dataset (i) crawled by searching for Oxford landmarks
� Datasets (ii) and (iii) from other popular Flickr tags. Acts as additional distractors
![Page 42: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/42.jpg)
Quantization / Clustering
� K-means usually seen as a quick + cheap method
� But far too slow for our needs – D~128, N~20M+, K~1M
![Page 43: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/43.jpg)
K-means overview
� K-means overview:
Initialize cluster centres
Find nearest cluster to each datapoint (slow) O(N K)
Re-compute cluster centres as centroid
Iterate
� Idea: nearest neighbour search is the bottleneck – use approximate nearest neighbour search
� K-means provably locally minimizes the sum of squared errors (SSE) between a cluster centre and its points
![Page 44: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/44.jpg)
Approximate K-means
� Use multiple, randomized k-d trees for search
� A k-d tree hierarchically decomposes the descriptor space
� Points nearby in the space can be found (hopefully) by backtracking around the tree some small number of steps
� Single tree works OK in low dimensions – not so well in high dimensions
![Page 45: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/45.jpg)
Approximate K-means
� Multiple randomized trees increase the chances of finding nearby points
Query point
True nearest neighbour
True nearest neighbour found?
No No Yes
![Page 46: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/46.jpg)
Approximate K-means
� Use the best-bin first strategy to determine which branch of the tree to examine next
� share this priority queue between multiple trees – searching multiple trees only slightly more expensive than searching one
� Original K-means complexity = O(N K)
� Approximate K-means complexity = O(N log K)
� This means we can scale to very large K
![Page 47: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/47.jpg)
Approximate K-means
� How accurate is the approximate search?
� Performance on 5K image dataset for a random forest of 8 trees
� Allows much larger clusterings than would be feasible with
standard K-means: N~17M points, K~1M
� AKM – 8.3 cpu hours per iteration
� Standard K-means - estimated 2650 cpu hours per iteration
![Page 48: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/48.jpg)
Approximate K-means
� Using large vocabularies gives a big boost in performance (peak @ 1M words)
� More discriminative vocabularies give:
� Better retrieval quality
� Increased search speed – documents share less words, so fewer documents need to be scored
![Page 49: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/49.jpg)
Beyond Bag of Words
� Use the position and shape of the underlying features to improve retrieval quality
� Both images have many matches – which is correct?
![Page 50: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/50.jpg)
Beyond Bag of Words
� We can measure spatial consistency between the query and each result to improve retrieval quality
Many spatially consistent matches – correct result
Few spatially consistent matches – incorrect result
![Page 51: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/51.jpg)
Beyond Bag of Words
� Extra bonus – gives localization of the object
![Page 52: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/52.jpg)
Estimating spatial correspondences
1. Test each correspondence
![Page 53: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/53.jpg)
Estimating spatial correspondences
2. Compute a (restricted) affine transformation (5 dof)
![Page 54: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/54.jpg)
Estimating spatial correspondences
3. Score by number of consistent matches
Use RANSAC on full affine transformation (6 dof)
![Page 55: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/55.jpg)
0.6250.6021.25M
0.6450.6181M
0.6300.609750K
0.6420.606500K
0.6330.598250K
0.5970.535100K
0.5990.47350K
vocab
sizebag of
wordsspatial
Mean Average Precision variation with vocabulary size
![Page 56: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/56.jpg)
![Page 57: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/57.jpg)
Example Results
Query Example Results
![Page 58: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/58.jpg)
Demo
![Page 59: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/59.jpg)
Part 3: Query expansion
![Page 60: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/60.jpg)
Query Expansion in text
In text :
• Reissue top n responses as queries
• Pseudo/blind relevance feedback
• Danger of topic drift
In vision:
• Reissue spatially verified image regions as queries
![Page 61: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/61.jpg)
Query Image Originally retrieved image Originally not retrieved
Query Expansion
![Page 62: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/62.jpg)
Query Expansion
![Page 63: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/63.jpg)
Query Expansion
![Page 64: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/64.jpg)
Query Expansion
![Page 65: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/65.jpg)
Query image Originally retrieved Retrieved only
after expansion
Query Expansion
![Page 66: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/66.jpg)
Query
image
Expanded results (better)
Original results (good)
Prec.
Prec.
Rec.
Rec.
![Page 67: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/67.jpg)
![Page 68: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/68.jpg)
before after expansion
![Page 69: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/69.jpg)
before after expansion
![Page 70: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/70.jpg)
Ori
gin
al q
ue
ryA
fte
r exp
an
sio
n
Average Precision histogram for 55 queries
![Page 71: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/71.jpg)
Demo
![Page 72: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/72.jpg)
Summary and Extensions
Have successfully ported methods from text retrieval to the visual
domain:
• Visual words enable posting lists for efficient retrieval of specific
objects
• Spatial re-ranking improves precision
• Query expansion improves recall, without drift
Outstanding problems:
• Include spatial information into index
• Universal vocabularies
Other examples of text methods ported to vision:
• Data mining – see Till Quack’s talk
• Use of topic models, e.g. pLSA and LDA for object and scene
categories
![Page 73: Andrew Zisserman - UCLAhelper.ipam.ucla.edu/publications/sews2/sews2_7272.pdfAndrew Zisserman (work with Ondřej Chum, ... 2. Scaling up ... Part 2: Scaling up: the Oxford buildings](https://reader031.vdocuments.net/reader031/viewer/2022022204/5b02f8ff7f8b9a6c0b8b8f5c/html5/thumbnails/73.jpg)
Sivic, J. and Zisserman, A.
Video Google: A Text Retrieval Approach to Object Matching in Videos
Proceedings of the International Conference on Computer Vision (2003)
http://www.robots.ox.ac.uk/~vgg/publications/papers/sivic03.pdf
Demo: http://www.robots.ox.ac.uk/~vgg/research/vgoogle/
Philbin, J., Chum, O., Isard, M., Sivic, J. and Zisserman, A.
Object retrieval with large vocabularies and fast spatial matching
Proceedings of the Conference on Computer Vision and Pattern Recognition(2007)
http://www.robots.ox.ac.uk/~vgg/publications/papers/philbin07.pdf
Chum, O., Philbin, J., Isard, M., Sivic, J. and Zisserman, A.
Total Recall: Automatic Query Expansion with a Generative Feature Model for
Object Retrieval
Proceedings of the International Conference on Computer Vision (2007)
http://www.robots.ox.ac.uk/~vgg/publications/papers/chum07b.pdf
Papers and Demo