document image retrieval using bag of visual words model

IIIT H

yderabad

Document Image Retrieval using Bag of Visual Words Model

Ravi ShekharCVIT, IIIT Hyderabad

Advisor : Prof. C.V. Jawahar

IIIT H

yderabad

Motivation• Large number of printed books are digitized

IIIT H

yderabad


• Digital libraries like Universal Digital library (UDL), Digital library of India (DLI) and Google Books etc.

Digital Library Database

IIIT H

yderabad


• Digital libraries like Universal Digital library (UDL), Digital library of India (DLI) and Google Books etc.

• Need to design efficient and effective methodology for content level access

Digital Library Database

IIIT H

yderabad

Process Overview

IndexDatabase

Documents

Processing Input Query

Matching

Retrieved Documents

Scanning

Matching can be done by two levels : “Text” and “Image”

IIIT H

yderabad

Matching Approaches

• Recognition Based Approach (Text Level Matching)• Optical Character Recognition (OCR)

• Recognition Free Approach (Image Level Matching)• Word Spotting

IIIT H

yderabad

Recognition Based Approach

• Optical Character Recognition (OCR)• Binarization of Document• Segmentation using connected components

• Line level• Word level• Character level

• Character recognition using different features like patch, profile etc• Classification using ANN or SVM

IIIT H

yderabad

Limitations of Recognition Based Approach

• Cuts

IIIT H

yderabad


• Cuts• Merges

IIIT H

yderabad


• Cuts• Merges• Variation in Script

IIIT H

yderabad


• Cuts• Merges• Variation in Script• Variation in Font and Typesetting

IIIT H

yderabad


• Cuts• Merges• Variation in Script• Variation in Font and Typesetting• Underline and Over Written

IIIT H

yderabad

Recognition Free Approach

• Word Spotting• Representation of word image using global (profile) features

IIIT H

yderabad


• Word Spotting• Representation of word image using global (profile) features• Matching features using different distance measures like L1, L2 etc

IIIT H

yderabad


• Word Spotting• Representation of word image using global (profile) features• Matching features using different distance measures like L1, L2 etc• Comparison of different size word images using Dynamic time warping

(DTW)

IIIT H

yderabad

Why Recognition Free Approach ?

• Robust OCRs are unavailable for many non-Latin languages• These languages have rich heritage and there is a need for

content level search• Word Spotting based methods are too slow for real time system• Most of the existing retrieval methods are memory intensive• Scalability is an immediate challenge

IIIT H

yderabad

Word Image Retrieval using Bag of Visual Words

IIIT H

yderabad

Bag of Visual Words (BoVW)

• Bag of Words (BoW) representation is the most popular representation for text retrieval

• BoW based efficient systems like Lucene are publically available• Bag of Visual Words (BoVW) performs excellently for image and

video retrieval• BoVW based system is flexible, powerful and scalable to Billions

of images

IIIT H

yderabad

BoVW Representation

• Word Images are represented using Histogram of Visual Words

IIIT H

yderabad

BoVW Representation

• Code Book generation• Subset of Images is used• Clustering is done using Hierarchical K-Means (HKM)• HKM is faster than K-Means both in building tree and finding nearest

neighbours

IIIT H

yderabad

BoVW based Representation

IIIT H

yderabad

Histogram of Visual Words


IIIT H

yderabad


Cuts

IIIT H

yderabad



Cuts

IIIT H

yderabad


Merges

IIIT H

yderabad



Merges

IIIT H

yderabad

Proposed Architecture

IIIT H

yderabad

• Fixed size representation

Advantages of BoVW based Representation

IIIT H

yderabad

• Fixed size representation


Clean

Clean

IIIT H

yderabad

• Fixed size representation• Robust against degradation


IIIT H

yderabad

• Fixed size representation• Robust against degradation


Cuts MergeClean

IIIT H

yderabad

• Fixed size representation• Robust against degradation• Scalable to Billions of images

Advantage of BoVW based Representation

IIIT H

yderabad

• Fixed size representation• Robust against degradation• Scalable to Billions of Images• Language independent


IIIT H

yderabad

• Lost Geometry

Spatial Verification

IIIT H

yderabad

• Lost Geometry


Clean

Clean

IIIT H

yderabad

• Lost Geometry


Clean

Clean

Clean

IIIT H

yderabad

• Lost Geometry• Spatial Verification


IIIT H

yderabad

Re-ranking

• SIFT based re-ranking• Higher the Total Score, better the match

j I # SIFT iniI# SIFT in

nts#Match Poi

jI

iIScore ),(

image theofpart for Score : ) ,(

image entirefor Score : ) ,( where,

) ,(3

1) ,() ,(

kthI kjI k

iScore

jIiI Score

I kjI k

ik

Score j

Ii

I Scorej

Ii

I ScoreTotal3

1

IIIT H

yderabad

Experimentations

Books Used in Experimentations

Language #Books #Pages #Words

Hindi 4 427 112677

Malayalam 6 610 108767

Telugu 5 742 131156

Bangla 3 363 124584

Hindi 32 3992 1008138

IIIT H

yderabad

Quantitative Results

Performance Statistics

Language #Images #Query mAPmAP

after Re-ranking

mAP after Spatial

Verification

Hindi 112677 138 0.6808 0.7820 0.7865

Malayalam 108767 101 0.6962 0.7991 0.8188

Telugu 131156 131 0.6483 0.7328 0.7495

Bangla 124584 125 0.7806 0.8766 0.8947

Hindi 1008138 138 0.5895 0.7022 0.7062

IIIT H

yderabad


Performance Statistics

Language #Images #Query Prec@10Prec@10

after Re-ranking

Prec@10 after Spatial Verification

Hindi 112677 138 0.8437 0.8719 0.8770

Malayalam 108767 101 0.7668 0.8328 0.8581

Telugu 131156 131 0.8507 0.8668 0.883

Bangla 124584 125 0.8498 0.9022 0.9182

Hindi 1008138 138 0.8059 0.8509 0.8543

IIIT H

yderabad


• mAP Vs Query Length

IIIT H

yderabad


• mAP Vs Query Length• More the # characters, better the results

IIIT H

yderabad


Retrieval Time and Index Size

#Images Retrieval Time Index Size

25K 50ms 28 MB

100K 209ms 130 MB

0.5M 411ms 550 MB

1M 700ms 1.2 GB

IIIT H

yderabad

Qualitative Results

Query Retrieved Results

HI

IIIT H

yderabad

Qualitative Results


IIIT H

yderabad

Qualitative Results

• Sample Output for Noisy Images where Commercial OCR fails


IIIT H

yderabad

Enhancement over Bag of Visual Words based Word Image Retrieval

IIIT H

yderabad

Query Expansion

• Observation: Top ranked results are correct• Top-k results are used to form new query• Improves the precision of retrieved list• Modified average query expansion

─ Instead of equal weight to every Top-k results, rank based weight (1/2rank) is given

• Improves mAP and Prec@10 by 2%

IIIT H

yderabad

Query Expansion

Query Image

Index

Histogram

Querying

Refined Histogram

Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6

Query ImageRank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6

Query Histogram

IIIT H

yderabad

Query Expansion

Query Image

Index

Expanded Query Histogram

Querying

Previous Results


Modified Results


IIIT H

yderabad

Text Query Support

• Originally formulated in a “query by example” setting but users would prefer textual interface for document image collection

• We propose a novel and simple framework for text query support• Used a small subset of data with ground truth covering all possible

characters in a particular language• Visual words are learnt specific to each character and averaged across its

different variations• Given a textual query, we synthesize its BoVW histogram

• Text query results are comparable to word image results

IIIT H

yderabad

Text Query Support

• Query by example setting

Input Query Image Histogram

IIIT H

yderabad

Text Query Support

• Query by example setting• Text Queries Support

Input Text Query

Text Query Histogram

IIIT H

yderabad

Qualitative Results

Sample output for queries using different techniques

IIIT H

yderabad

Vector Quantization

• In Vector Quantization (VQ), each feature vector is mapped to single visual word (VW), i.e, Hard Assignment

IIIT H

yderabad

Vector Quantization


Codebook :

Code :

Descriptor : where,

,0,1||||,1||||..

||||minarg

10

1

2

B

c

x

icccts

Bcx

i

i

ilili

N

iii

C

IIIT H

yderabad

Vector Quantization


IIIT H

yderabad

Vector Quantization


(a)

Input Descriptor

IIIT H

yderabad

Vector Quantization


• Problems with VQ

IIIT H

yderabad

Vector Quantization


• Problems with VQ• Visual word uncertainty

IIIT H

yderabad

Vector Quantization



• Mapping single VW from out of 2 or more possible

IIIT H

yderabad

Vector Quantization

• In Vector Quantization(VQ), each feature vector is mapped to single visual word(VW) i.e Hard Assignment


• Mapping single VW from out of 2 or more possible

IIIT H

yderabad

Vector Quantization


• Problems with VQ• Visual word uncertainty• Visual word plausibility

IIIT H

yderabad

Vector Quantization



• Mapping a visual word without a suitable candidate in the vocabulary

IIIT H

yderabad

Vector Quantization



• Mapping a visual word without a suitable candidate in the vocabulary.

IIIT H

yderabad

Vector Quantization



• Solution: Soft Assignment• Map each feature vector to 2 or more possible VW

IIIT H

yderabad

Soft Assignment

• Map each feature vector to 2 or more possible VW• Approached of Soft Assignment

• Distance based • Equal weight• Based on Distance in Feature Space• Gaussian Distance• Does not minimize reconstruction error

IIIT H

yderabad

Soft Assignment


• Distance based • Equal weight• Based on Distance in Feature Space• Gaussian Distance• Does not minimize reconstruction error Input

Descriptor

IIIT H

yderabad

Soft Assignment


• Distance based • Equal weight• Based on Distance in Feature Space• Gaussian Distance• Does not minimize reconstruction error

• Through learning optimal reconstruction

IIIT H

yderabad

Locality-constrained Linear Coding (LLC)

• Similar patch should have similar code• Locality of Visual Word is used to describe feature vector

IIIT H

yderabad


• Similar patch should have similar code• Locality of Visual Word is used to describe feature vector

)B),dist(xexp(

,11..

||||||||minarg

i

andtion multiplica wise-element is ,

2

1

2

i

where

iT

ii

N

iii

C

d

icts

cdBcx

IIIT H

yderabad


• Similar patch should have similar code• Locality of Visual Word is used to describe feature vector• LLC Coding Process

• Find K – Nearest Neighbors of xi denoted as B

• Reconstruct xi using B

• Replace input xi with non-zero code obtained from previous step Input

Descriptor

IIIT H

yderabad

Re-ranking

• SIFT based re-ranking1

• Longest common sub-sequence (LCS) based re-ranking2

• Size of LCS of visual words projected on x-axis• Larger the size, better the match

1. Ravi Shekhar, C. V. Jawahar: Word Image Retrieval Using Bag of Visual Words. DAS 20122. Ismet Zeki Yalniz, R. Manmatha: An Efficient Framework for Searching Text in Noisy Document Images, DAS 2012

V1

V2

V6

V4

V4

V8

V9

x

y

0.5

0

1

0.5 1 1.5 2 2.5 3

IIIT H

yderabad

Re-ranking

• SIFT based re-ranking1

• Longest common sub-sequence (LCS) based re-ranking2

• Size of LCS of visual words projected on X-axis• Larger the size, better the match

• Linear Combination2Final Score = λ * Index_Score + (1-λ) * Re-ranking _Score where λ weighting

parameter

1. Ravi Shekhar, C. V. Jawahar: Word Image Retrieval Using Bag of Visual Words. DAS 20122. Ismet Zeki Yalniz, R. Manmatha: An Efficient Framework for Searching Text in Noisy Document Images, DAS 2012

IIIT H

yderabad

Dataset Used

Books Used For The Experiments

Book #Pages #Words

Telugu- 1716 120 4121

Telugu- 1718 100 21345

English-1601 363 113008

IIIT H

yderabad


LLC Based Statistics (mAP)

Book BoVWBoVW +

SIFT Re-ranking

BoVW + LCS

Re-rankingLLC

LLC + LCS Re-raking

Telugu-1716 0.8173 0.8645 0.9036 0.91 0.95

Telugu-1718 0.7834 0.8861 0.918 0.92 0.96

English-1601 0.8015 0.8531 0.92 0.8765 0.9451

IIIT H

yderabad


Text Query Based Statistics

Book Method mAP

Telugu- 1716 Text Query 0.8413

Telugu- 1718 Text Query 0.90

English-1601 Text Query 0.87

IIIT H

yderabad

Patch Based Word Image Retrieval

IIIT H

yderabad


• Designed feature based on patch

IIIT H

yderabad


• Designed feature based on patch• Representation of Patch using Profile Features

IIIT H

yderabad


• Designed feature based on patch• Representation of Patch using Profile Features• Profile Feature

IIIT H

yderabad



• Projection Profile

IIIT H

yderabad



• Projection Profile• Measures ink distribution of word image

IIIT H

yderabad



• Projection Profile• Ink Transition

• Measures internal shape of image

IIIT H

yderabad



• Projection Profile• Ink Transition• Upper Word Profile

IIIT H

yderabad



• Projection Profile• Ink Transition• Upper Word Profile

• Distance from Upper Boundary of word image

IIIT H

yderabad



• Projection Profile• Ink Transition• Upper Word Profile• Lower Word Profile

IIIT H

yderabad



• Projection Profile• Ink Transition• Upper Word Profile• Lower Word Profile

• Distance from Lower Boundary of word image

IIIT H

yderabad

Overview of Feature Calculation

. . .

Calculate 4 profile features

Concatenate 4 profile features

Projection profile

Lower word profile

Ink Transition

Upper word profile

Input word image

Descriptor

IIIT H

yderabad

Fast Pre-Processing

. . .

. . .

. . .

. . .

.

.

.

. . .

V1

V2

V3

.

.

.

Vk

InputPatch

Corresponding Patch Vector

Lookup Table

Is patch Vector

Present ?

Find corresponding

Visual WordRetrieve corresponding Visual

Word

Yes

No

Update

IIIT H

yderabad

Dataset Used

Book #Pages #Words

Telugu- 1718 100 21345

English-1601 363 113008

IIIT H

yderabad


Baseline Statistics

Book Method mAP

Telugu- 1718 SIFT 0.7834

Telugu- 1718 Patch 0.53

Telugu- 1718 Patch Feature 0.6183

Telugu- 1718 Patch Feature with Overlap 0.7214

IIIT H

yderabad


Enhancement on Baseline Statistics

Enhancement Method SIFT Patch Feature

Query Expansion 0.7920 0.75

Spatial Verification 0.8571 0.83

LCS Re-ranking 0.8798 0.8481

IIIT H

yderabad


Results with Split Features

Book SIFT Patch Feature

Telugu -1718 0.94 0.954

English – 1601 0.93 0.90

IIIT H

yderabad

Qualitative Results

IIIT H

yderabad

Contributions

• Language Independent System• Tested on 4 different languages

• Scalable to huge dataset • Tested on 1 Millions of word Images

• Handles Noisy document images• Demonstrated performance on dataset where commercial OCR fails.

• Enhancement on baseline results• Query Expansion • Text Query Support• Document specific Sparse coding

• Document Specific descriptor is proposed

IIIT H

yderabad

Future Work

• Test on different font dataset• Similar method for handwritten, camera based datasets• Learning character level visual word automatically using

annotated data• Multi Keyword support• Combine both recognition based and recognition free

methods• Improve patch based descriptor.

IIIT H

yderabad

Related Publications

• Ravi Shekhar and C. V. Jawahar , “Word Image Retrieval using Bag of Visual Words”, In Proceedings of 10th IAPR International Workshop on Document Analysis Systems (DAS), 2012.

• Praveen Krishnan, Ravi Shekhar and C. V. Jawahar, “Content Level Access to Digital Library of India Pages”, In Proceedings of 8th Indian Conference on Vision, Graphics and Image Processing (ICVGIP), 2012.

• Ravi Shekhar and C. V. Jawahar, “Document Specific Sparse Coding for Word Retrieval”, In Proceedings of 12th International Conference on Document Analysis and Recognition (ICDAR), 2013.

IIIT H

yderabad

Thanks !!!