1 large scale similarity learning and indexing part ii: learning to hash for large scale search fei...

1

Large Scale Similarity Learning and Indexing Part II: Learning to Hash for Large Scale Search

Fei Wang and Jun Wang

IBM TJ Watson Research Center

2

Outline

Background

Approximate nearest neighbor search

Tree and hashing for data Indexing

Locality sensitive hashing

Learning to Hashing:

Unsupervised hashing

Supervised hashing

Semi-supervised hashing (pointwise/pairwise/listwise)

Large Scale Active Learning with Hashing

Hyperplane hashing

Fast query selection with hashing

Summary and Discussion

3

Motivation

Similarity based search has been popular in many applications

– Image/video search and retrial: finding most similar images/videos

– Audio search: find similar songs

– Product search: find shoes with similar style but different color

– Patient search: find patients with similar diagnostic status

Two key components:

– Similarity/distance measure

– Indexing scheme

Whittlesearch (Kovashka et al. 2013)

4

Nearest Neighbor Search (NNS)

Given a set of points in a metric space and a query

point , find the closest point in to

-Nearest neighbor

k-nearest neighbor search

Nearest neighbor search is a fundamental problem in many topics, including computational geometry, information retrieval, machine learning, data mining, computer vision, and so on

Time complexity: linear to the size of data

Also need to load the entire dataset in the memory

Example: 1 billion images with 10K-dim BOW features

Linear scan takes ~15 hrs;

Storage for such a dataset is ~40 TB

5

Approximate Nearest Neighbor

Instead finding the exact nearest neighbors, return approximate nearest neighbors (Indyk, 03)

ANNs are reasonably good for many applications

Retrieve ANNs could be much faster (with a sublienar complexity)

Tree and Hashing are two popular indexing schemes for fast ANN search

6

Tree-Based ANN Search

Recursively partition the data: Divide and Conquer

Search complexity is O(log n) (worst case could be O(n))

Inefficient for high dimensional data

Requires significant memory cost

tree

KD-treeKD-treeKD-tree

7

Various Tree-Based Methods

Different ways to construct the tree structure

KD-tree Ball-tree

PCA-tree Random Projection-tree

2nd PC

1st PC

8

Hashing-Based ANN Search

Repeatedly partition the data

Each item in database represented as a hash code

Significantly reduced storage cost

for 1 billion images: 40 TB features -> 8GB hash codes

Search complexity: constant time or sublinear

linear scan 1 billions images: 15 hrs->13 sec

x1

X x1 x2 x3 x4 x5

h1 0 1 1 0 1

h2 1 0 1 0 1

h1h2

… … … … … …

hk … … … … …

010… 100… 111… 001… 110…x2

x3

x4

x5

9

Hashing: Training Step

Design models (hash functions) for computing hashing codes

a general linear projection based hash function family

Estimate the model parameters

10

Hashing: Indexing Step

Compute the hash codes for each database item

Organize all the codes in hash tables (Inverse look-up)

database items

hash codes

database items

Inverse look-up

hash bucket

11

Hashing: Query Step (Hash Lookup)

Compute the hash codes for the query point

Return the points within a small radius to the query in the Hamming space

The number of hashing codes within Hamming radius is

12

Hashing: Query Step (Hamming Ranking)

Hamming distance: the number of different bits between two hash codes

Rank the database items using their Hamming distance to the query’s hash code

generate rank list

13

A Conceptual Diagram for Hashing Based Image Search System

Indexing and Search

Image Database

Similarity Search & Retrieval

Hash Function Design

Visual Search ApplicationsVisual Search Applications

Reranking Refinement

Designing compact yet accurate hashing codes is a critical component to make the search effective

14

Locality Sensitive Hashing (LSH)

0

1

0

1 0

1

Database Items

hash function

random

101 Query

15

Single hash bit

hash table with the bit length

Collision Probability

High dot product: unlikely to split Lower dot product:

likely to split

=

16

Outline

Background






Supervised hashing



Hyperplane hashing



17

Overview: Learning-Based Hashing Techniques

Unsupervised: only use the property of unlabeled data (data-dependent)

Spectral hashing (SH, Weiss, et al. NIPS 2008) Kernerlized methods (Kulis et al. ICCV 2009) Graph hashing (Liu et al. ICML 2011) Isotropic hashing (Kong et al. NIPS 2012) Angular quantization hashing (Gong et al. NIPS 2012)

Supervised: use labeled data (task dependent) Deep learning based (Torralba, CVPR 2008) Binary reconstructive embedding (Kulis et al. NIPS 2009) Supervised kernel method (Liu et al. CVPR 2012) Minimal Loss Hashing (Norouzi & Fleet ICML 2011)

Semi-Supervised: use both labeled and unlabeled data Metric learning based (Jian et al. CVPR 2008) Semi-supervised hashing (Wang et al. CVPR 2010, PAMI 2012) Sequential hashing (Wang et al. ICML 2011)

18

Overview: Advanced (Other) Hashing Techniques

Triplet and Listwise Hashing Hamming metric learning based hashing (Norouzi et al. NIPS 2012) Order preserving hashing (Wang et al. ACM MM 2013) Column generation hashing (Li et al. ICML 2013) Ranking supervised hashing (Wang et. al. ICCV 2013)

Hyperplane Hashing and Active Learning Angle & embedding hyperplane hashing (Jain et al. NIPS 2010) Bilinear hashing (Liu et al. ICML 2012) Fast pairwise query selection (Qian et al. ICDM 2013)

Hashing for complex data sources Heterogeneous hashing (Ou et al. KDD 2013) Structured hashing (Ye et al. ICCV 2013) Multiple feature hashing (Song et al. ACM MM 2011) Composite hashing (Zhang et al. ACM SIGIR 2011) Submodular hashing (Cao et al. ACM MM 2012)

19

Unsupervised: PCA Hashing

Partition the data along the PCA directions

Projections with high variance are more reliable

Low-variance projections are very noisy

20

Unsupervised: Spectral Hashing (Weiss et al. 2008)

Partition the data along the PCA directions

Essentially a balanced minimum cut problem (NP hard for a single bit partition)

Approximation through spectral relaxation (uniform distribution assumption)

balancing

orthogonality

21

Unsupervised: Spectral Hashing

Illustration of spectral hashing

Main steps:

1) extraction projections through performing PCA on the data,

2) projection selection (prefer projections with large spread and small spatial frequency)

3) generating hashing codes through thresholding a sinusoidal function

22

Unsupervised: Graph Hashing (Liu et al. 2012)

Graph is capable of capturing complex nonlinear structure

The same objective as spectral hashing

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

+1

-1-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

+1 -1

23

Unsupervised: Graph Hashing

Different Solution: Full graph construction and eigen-decomposition is not scalable

The same objective as spectral hashing

24

Unsupervised: Angular Quantization (Gong et al. 2012)

Data-independent angular quantization

The binary codes of a data point is the

nearest binary vertex in the hypercube

Data-dependent angular quantization

25

From Unsupervised to Supervised Hashing

Existing hashing methods mostly rely on random or principal projections

Not compact Insufficient accuracy

Simple metrics and features are usually not enough to express semantic similarity – semantic gap

Goal: to learn effective binary hash functions through incorporating supervised information

Five categories of objects from Caltech 101, 40 images for each category

26

Binary Reconstructive Embedding (Kulis & Darrell 2009)

Kernelized hashing function

Euclidian distance and the binary distance

Objective: minimize the difference of these two distance measures

Can be supervised method if using semantic distance/similarity

27

RBMs Based Binary Coding (Torralba et al. 2008)

Restricted Boltzmann Machine (RBM)

Stacking RBMs into multiple layers – deep network (512-512-256-N)

The training process has two stages:

unsupervised pre-training and supervised fine tuning

weight offset

energy

Objective

expected log probability Hinton & Salakhutdinov, 2006

28

Supervised Hashing with Kernels (Liu et al. 2012)

Pairwise similarity

Code inner product approximates pairwise similarity

29

Metric Learning Based Hashing (Jain et al. 2008)

Given the distance metric , the generalized distance

Generalized similarity measure

Parameterized hash function

Collision probability

30

Semi-Supervised Hashing (Wang et al. 2010)

Different ways to preserve pairwise relationships

Besides minimizing empirical loss on labeled data

-1 1-1

1 Neighbor pair

Non-neighbor pair

Maximum entropy principle

31

Supervised information as triplets

Triplet ranking loss

Objective: minimize regularized ranking loss

Optimization: stochastic gradient descent

Hamming Metric Learning (Norouzi et al. 2012)

32

Column Generation Hashing (Li et al. 2013)

Learn hashing functions and the weights of hash bits

Large-margin formulation

Column generation to iteratively lean hash bits and update the bit weights

weighted Hamming distance

33

Ranking Supervised Hashing (Wang et al. 2013)

Preserve ranking list in the Hamming space

Triplet matrix representing ranking order

Ranking consistency

34

Outline

Background






Supervised hashing



Hyperplane hashing



35

Point-to-Point NN vs. Point-to-Hyperplane NN

Hyperplane hashing: aims in finding nearest points to a hyperplane

An efficient way for query selection in the active learning paradigm

Points nearest to the hyperplanes are the most uncertain ones

36

Hyperplane Hashing

Objective: find the datapoints with the shortest point-to-hyperplane distance

37

Angle and Embedding Based Hyperplane Hashing (Jain et al. 2010)

Angle based hyperplane hashing

Embedding based hyperplane hashing

Figures from http://vision.cs.utexas.edu/projects/activehash/

collision probability

embedding

Distance in the embedded space proportional to the distance in the original space

38

Bilinear Hyperplane Hashing (Liu et al. 2012)

Bilinear hash functions

u vx1

x2

39

Analysis and Comparison

Collision probability of bilinear hyperplane hashing

Compare with angle-based and embedding-based

40

Active Learning for Big Data

Active learning aims in reducing the

annotation cost through connecting

human and prediction models

The key idea for active learning is to iteratively identify those ambiguous points to the current prediction model

requires exhaustive testing over all the data samples

at least linear complexity and not feasible for big data applications

Example and figure from “Active Learning Literature Survey” by Burr Settles

41

Active Learning with Hashing

The conceptual diagram

Two key components

Index unlabeled data into hash tables;

Compute the hash code of the current classifier and treat it as a query

Figures from http://vision.cs.utexas.edu/projects/activehash/

42

Empirical Study

20News group data (18,846 documents, 20 classes)

Starting with 5 randomly labeled documents per class

Perform 300 iterations of active learning with different query selection strategy

43

Active Learning with Pairwise Queries

Typical applications can be found in pairwise comparison based ranking (Jamieson & Nowak et al. 2011, Wauthier et al. 2013)

In an active learning setting, the system sends the annotator a pair of points and receives the relevance comparison result as supervision

Exhaustively select optimal sample pairs will be with quadratic complexity

Qian et al. 2013Jamieson & Nowak et al. 2011

44

Fast Pairwise Query Selection with Hashing

Key motivations:

The selected query pairs should be with high relevance;

The order between the pair points should be uncertain

Two-step selection strategy

Relevance selection (point-to-point hashing);

Uncertain selection (point-to-hyperplane hashing):

45

Outline

Background






Supervised hashing



Hyperplane hashing



46

Summary and Trend in Metric Learning

47

Summary and Trend in Learning to Hash

From data-independent to data-dependent

From task-independent to task-dependent

From simple supervising to complex supervision (pointwise->pairwise->triplet/listwise)

From linear methods to kernel based methods

From homogeneous data to heterogeneous data

From simple data to structured data

From point-to-point methods to point-to-hyperplane methods

From model driven to application driven

From single table to multiple table

48

References

1 large scale similarity learning and indexing part ii: learning to hash for large scale search fei...

Documents

hashingbased ann search

hashing hyperplane

hashing summary

computing hashing codes

number of hashing codes

treebased ann search

fast ann search slide

nearest neighbor search