1 large scale similarity learning and indexing part ii: learning to hash for large scale search fei...
TRANSCRIPT
1
Large Scale Similarity Learning and Indexing Part II: Learning to Hash for Large Scale Search
Fei Wang and Jun Wang
IBM TJ Watson Research Center
2
Outline
Background
Approximate nearest neighbor search
Tree and hashing for data Indexing
Locality sensitive hashing
Learning to Hashing:
Unsupervised hashing
Supervised hashing
Semi-supervised hashing (pointwise/pairwise/listwise)
Large Scale Active Learning with Hashing
Hyperplane hashing
Fast query selection with hashing
Summary and Discussion
3
Motivation
Similarity based search has been popular in many applications
– Image/video search and retrial: finding most similar images/videos
– Audio search: find similar songs
– Product search: find shoes with similar style but different color
– Patient search: find patients with similar diagnostic status
Two key components:
– Similarity/distance measure
– Indexing scheme
Whittlesearch (Kovashka et al. 2013)
4
Nearest Neighbor Search (NNS)
Given a set of points in a metric space and a query
point , find the closest point in to
-Nearest neighbor
k-nearest neighbor search
Nearest neighbor search is a fundamental problem in many topics, including computational geometry, information retrieval, machine learning, data mining, computer vision, and so on
Time complexity: linear to the size of data
Also need to load the entire dataset in the memory
Example: 1 billion images with 10K-dim BOW features
Linear scan takes ~15 hrs;
Storage for such a dataset is ~40 TB
5
Approximate Nearest Neighbor
Instead finding the exact nearest neighbors, return approximate nearest neighbors (Indyk, 03)
ANNs are reasonably good for many applications
Retrieve ANNs could be much faster (with a sublienar complexity)
Tree and Hashing are two popular indexing schemes for fast ANN search
6
Tree-Based ANN Search
Recursively partition the data: Divide and Conquer
Search complexity is O(log n) (worst case could be O(n))
Inefficient for high dimensional data
Requires significant memory cost
tree
KD-treeKD-treeKD-tree
7
Various Tree-Based Methods
Different ways to construct the tree structure
KD-tree Ball-tree
PCA-tree Random Projection-tree
2nd PC
1st PC
8
Hashing-Based ANN Search
Repeatedly partition the data
Each item in database represented as a hash code
Significantly reduced storage cost
for 1 billion images: 40 TB features -> 8GB hash codes
Search complexity: constant time or sublinear
linear scan 1 billions images: 15 hrs->13 sec
x1
X x1 x2 x3 x4 x5
h1 0 1 1 0 1
h2 1 0 1 0 1
h1h2
… … … … … …
hk … … … … …
010… 100… 111… 001… 110…x2
x3
x4
x5
9
Hashing: Training Step
Design models (hash functions) for computing hashing codes
a general linear projection based hash function family
Estimate the model parameters
10
Hashing: Indexing Step
Compute the hash codes for each database item
Organize all the codes in hash tables (Inverse look-up)
database items
hash codes
database items
Inverse look-up
hash bucket
11
Hashing: Query Step (Hash Lookup)
Compute the hash codes for the query point
Return the points within a small radius to the query in the Hamming space
The number of hashing codes within Hamming radius is
12
Hashing: Query Step (Hamming Ranking)
Hamming distance: the number of different bits between two hash codes
Rank the database items using their Hamming distance to the query’s hash code
generate rank list
13
A Conceptual Diagram for Hashing Based Image Search System
Indexing and Search
Image Database
Similarity Search & Retrieval
Hash Function Design
Visual Search ApplicationsVisual Search Applications
Reranking Refinement
Designing compact yet accurate hashing codes is a critical component to make the search effective
15
Single hash bit
hash table with the bit length
Collision Probability
High dot product: unlikely to split Lower dot product:
likely to split
=
16
Outline
Background
Approximate nearest neighbor search
Tree and hashing for data Indexing
Locality sensitive hashing
Learning to Hashing:
Unsupervised hashing
Supervised hashing
Semi-supervised hashing (pointwise/pairwise/listwise)
Large Scale Active Learning with Hashing
Hyperplane hashing
Fast query selection with hashing
Summary and Discussion
17
Overview: Learning-Based Hashing Techniques
Unsupervised: only use the property of unlabeled data (data-dependent)
Spectral hashing (SH, Weiss, et al. NIPS 2008) Kernerlized methods (Kulis et al. ICCV 2009) Graph hashing (Liu et al. ICML 2011) Isotropic hashing (Kong et al. NIPS 2012) Angular quantization hashing (Gong et al. NIPS 2012)
Supervised: use labeled data (task dependent) Deep learning based (Torralba, CVPR 2008) Binary reconstructive embedding (Kulis et al. NIPS 2009) Supervised kernel method (Liu et al. CVPR 2012) Minimal Loss Hashing (Norouzi & Fleet ICML 2011)
Semi-Supervised: use both labeled and unlabeled data Metric learning based (Jian et al. CVPR 2008) Semi-supervised hashing (Wang et al. CVPR 2010, PAMI 2012) Sequential hashing (Wang et al. ICML 2011)
18
Overview: Advanced (Other) Hashing Techniques
Triplet and Listwise Hashing Hamming metric learning based hashing (Norouzi et al. NIPS 2012) Order preserving hashing (Wang et al. ACM MM 2013) Column generation hashing (Li et al. ICML 2013) Ranking supervised hashing (Wang et. al. ICCV 2013)
Hyperplane Hashing and Active Learning Angle & embedding hyperplane hashing (Jain et al. NIPS 2010) Bilinear hashing (Liu et al. ICML 2012) Fast pairwise query selection (Qian et al. ICDM 2013)
Hashing for complex data sources Heterogeneous hashing (Ou et al. KDD 2013) Structured hashing (Ye et al. ICCV 2013) Multiple feature hashing (Song et al. ACM MM 2011) Composite hashing (Zhang et al. ACM SIGIR 2011) Submodular hashing (Cao et al. ACM MM 2012)
19
Unsupervised: PCA Hashing
Partition the data along the PCA directions
Projections with high variance are more reliable
Low-variance projections are very noisy
20
Unsupervised: Spectral Hashing (Weiss et al. 2008)
Partition the data along the PCA directions
Essentially a balanced minimum cut problem (NP hard for a single bit partition)
Approximation through spectral relaxation (uniform distribution assumption)
balancing
orthogonality
21
Unsupervised: Spectral Hashing
Illustration of spectral hashing
Main steps:
1) extraction projections through performing PCA on the data,
2) projection selection (prefer projections with large spread and small spatial frequency)
3) generating hashing codes through thresholding a sinusoidal function
22
Unsupervised: Graph Hashing (Liu et al. 2012)
Graph is capable of capturing complex nonlinear structure
The same objective as spectral hashing
-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
+1
-1-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
+1 -1
23
Unsupervised: Graph Hashing
Different Solution: Full graph construction and eigen-decomposition is not scalable
The same objective as spectral hashing
24
Unsupervised: Angular Quantization (Gong et al. 2012)
Data-independent angular quantization
The binary codes of a data point is the
nearest binary vertex in the hypercube
Data-dependent angular quantization
25
From Unsupervised to Supervised Hashing
Existing hashing methods mostly rely on random or principal projections
Not compact Insufficient accuracy
Simple metrics and features are usually not enough to express semantic similarity – semantic gap
Goal: to learn effective binary hash functions through incorporating supervised information
Five categories of objects from Caltech 101, 40 images for each category
26
Binary Reconstructive Embedding (Kulis & Darrell 2009)
Kernelized hashing function
Euclidian distance and the binary distance
Objective: minimize the difference of these two distance measures
Can be supervised method if using semantic distance/similarity
27
RBMs Based Binary Coding (Torralba et al. 2008)
Restricted Boltzmann Machine (RBM)
Stacking RBMs into multiple layers – deep network (512-512-256-N)
The training process has two stages:
unsupervised pre-training and supervised fine tuning
weight offset
energy
Objective
expected log probability Hinton & Salakhutdinov, 2006
28
Supervised Hashing with Kernels (Liu et al. 2012)
Pairwise similarity
Code inner product approximates pairwise similarity
29
Metric Learning Based Hashing (Jain et al. 2008)
Given the distance metric , the generalized distance
Generalized similarity measure
Parameterized hash function
Collision probability
30
Semi-Supervised Hashing (Wang et al. 2010)
Different ways to preserve pairwise relationships
Besides minimizing empirical loss on labeled data
-1 1-1
1 Neighbor pair
Non-neighbor pair
Maximum entropy principle
31
Supervised information as triplets
Triplet ranking loss
Objective: minimize regularized ranking loss
Optimization: stochastic gradient descent
Hamming Metric Learning (Norouzi et al. 2012)
32
Column Generation Hashing (Li et al. 2013)
Learn hashing functions and the weights of hash bits
Large-margin formulation
Column generation to iteratively lean hash bits and update the bit weights
weighted Hamming distance
33
Ranking Supervised Hashing (Wang et al. 2013)
Preserve ranking list in the Hamming space
Triplet matrix representing ranking order
Ranking consistency
34
Outline
Background
Approximate nearest neighbor search
Tree and hashing for data Indexing
Locality sensitive hashing
Learning to Hashing:
Unsupervised hashing
Supervised hashing
Semi-supervised hashing (pointwise/pairwise/listwise)
Large Scale Active Learning with Hashing
Hyperplane hashing
Fast query selection with hashing
Summary and Discussion
35
Point-to-Point NN vs. Point-to-Hyperplane NN
Hyperplane hashing: aims in finding nearest points to a hyperplane
An efficient way for query selection in the active learning paradigm
Points nearest to the hyperplanes are the most uncertain ones
37
Angle and Embedding Based Hyperplane Hashing (Jain et al. 2010)
Angle based hyperplane hashing
Embedding based hyperplane hashing
Figures from http://vision.cs.utexas.edu/projects/activehash/
collision probability
embedding
Distance in the embedded space proportional to the distance in the original space
39
Analysis and Comparison
Collision probability of bilinear hyperplane hashing
Compare with angle-based and embedding-based
40
Active Learning for Big Data
Active learning aims in reducing the
annotation cost through connecting
human and prediction models
The key idea for active learning is to iteratively identify those ambiguous points to the current prediction model
requires exhaustive testing over all the data samples
at least linear complexity and not feasible for big data applications
Example and figure from “Active Learning Literature Survey” by Burr Settles
41
Active Learning with Hashing
The conceptual diagram
Two key components
Index unlabeled data into hash tables;
Compute the hash code of the current classifier and treat it as a query
Figures from http://vision.cs.utexas.edu/projects/activehash/
42
Empirical Study
20News group data (18,846 documents, 20 classes)
Starting with 5 randomly labeled documents per class
Perform 300 iterations of active learning with different query selection strategy
43
Active Learning with Pairwise Queries
Typical applications can be found in pairwise comparison based ranking (Jamieson & Nowak et al. 2011, Wauthier et al. 2013)
In an active learning setting, the system sends the annotator a pair of points and receives the relevance comparison result as supervision
Exhaustively select optimal sample pairs will be with quadratic complexity
Qian et al. 2013Jamieson & Nowak et al. 2011
44
Fast Pairwise Query Selection with Hashing
Key motivations:
The selected query pairs should be with high relevance;
The order between the pair points should be uncertain
Two-step selection strategy
Relevance selection (point-to-point hashing);
Uncertain selection (point-to-hyperplane hashing):
45
Outline
Background
Approximate nearest neighbor search
Tree and hashing for data Indexing
Locality sensitive hashing
Learning to Hashing:
Unsupervised hashing
Supervised hashing
Semi-supervised hashing (pointwise/pairwise/listwise)
Large Scale Active Learning with Hashing
Hyperplane hashing
Fast query selection with hashing
Summary and Discussion
47
Summary and Trend in Learning to Hash
From data-independent to data-dependent
From task-independent to task-dependent
From simple supervising to complex supervision (pointwise->pairwise->triplet/listwise)
From linear methods to kernel based methods
From homogeneous data to heterogeneous data
From simple data to structured data
From point-to-point methods to point-to-hyperplane methods
From model driven to application driven
From single table to multiple table