06 case study - word spotting

Word Spotting in Historical Document Images

Dr Khurram Khurshid

Case Study

2

Historical Documents

Contain invaluable information

Preservation – information access?

DigitizationEasy and fast access

Information spotting

www.inforouter.com

Problem

3

Information Retrieval

How to retrieve the required information? OCR?

Fails for ancient documents

A 16th century document imageA 19th century document image

Problem

lêVÎerS' danS C6tte %Ur6' a été °*g^) el" doit *£

Word Spotting

4

Plan

IntroductionRelated work - state of the art

Proposed method

Document IndexingWord/graphic segmentation

Character extraction

Feature definition

Word RetrievalCharacter matching

Word matching

Experimental Results

Applications

Conclusion/Perspectives

Presentation Plan

http://en.wikipedia.org/wiki/Johannes_Gutenberg

5

Introduction - Word Spotting

Word Spotting - an alternate to OCR

Comparing two word images through a matching process to see if they are similar or not

Input query

Matching process

Recognized words

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Word Spotting State of the art Proposed Method Data sets

6

State of the Art

Mainly two different categories of methodsHolistic

• view a word image as a unit

Analytical• a word image is segmented into smaller units



7

Holistic vs Analytical

Handwritten textCharacter segmentation can be avoided

Holistic methods

Printed textEasier to segment into characters

Analytical methods • Ability to focus on the local intrinsic characteristics of words • Allow a more precise word representation

Analytical approach suitable for printed docs



8

Benchmark in the domain

Profile feature’s sequence matching [Rath and Manmatha 2007]Four feature sequences for each word image

• Vertical projection

• Lower profile

• Upper profile

• Ink/background transitions

Matching at word level using DTW


Method of Rath Proposed Method

Holistic Analytical

4 features (4+2=6) features

Features defined at word level Features defined at character level

DTW at word level Multi-level dynamic matching


9

Proposed Approach

Character Extraction

Feature Definition

Word/graphic Segment

Indexing Dynamic String Comparison (Word – level)

Dynamic Time Warping (Character – level)

Query Word Processing

ASCII Query



10

Data Sets

Document images of Bibliothèque Interuniversitaire de Médecine (BIUM)

12 books of 19th centuryData Set A (Training set)

• 20 pages : 4 each from 5 books• Total Words = 6,632• 25 (5x5) query words having 175 instances

Data Set B (Test set)• 60 pages : 4 each from 12 books• Total words = 17,010 • 60 (5x12) query words having 435 instances in total

Data Set C • 3 complete books• More than 500 pages in total

3 books of 16th century



11

Document Image Indexing

Feature Definition

Indexing

Word/Graphic Segmentation

Binarization


Binarization Word/graphic segmentation Character extraction Feature definition Image indexing


12

Document Image Binarization

NICK algorithm

NP

mpkmT

NP

ii )( 2

1

2

k = NICK factor having value between –0.2 and –0.1

pi = pixel value of gray scale image

NP = number of pixels in the window

m = mean gray value of these NP pixels

k = -0.2

k = -0.1



13


Multi-step bottom up approachHorizontal Run Length Smoothing Algorithm

Graphic Component Detection• Height-Area Analysis of the components

d > threshold



14


Evaluation of Word Segmentation on Data Set BWords segmented perfectly = 99.76%

ProblemsTitles in very large font

• Can be treated separately using large RLSA



15


Component Height-Area Analysis


Word Components = [ Component Area < Mean comp. area x A

AND Component height <

Mean comp. height x B ]


16


T-character (true alphabetic characters)

Connected components (CCs) of the word imageS-Character (segmented character)

Heuristic Rules – 3 passesPass 1 - Multi-component characters

B

A

B

A


Improve S-characters to correspond to T-characters


17


Pass 2 – Grouping the broken S-characters

Pass 3Remove punctuation marks and noise components

After processing stages, 98% of S-characters correspond to T-characters

Less than T

A

B

A

B


Split Characters

Merged Characters


18


Validation using data set ABefore post processing passes

# Total T-characters in the data set 82264

# of raw S-characters within words 115414

# of T-characters in these S-characters 60358

Recall % 73.4%

Precision % 52.3%

After pass 1 and 2

# of S-characters treated (merged) during pass 1 and 2 20745

# of S-characters after pass 1 and pass 2 94669


Recall % 98.6%

Precision % 85.7%

After pass 3

# of S-characters removed during pass 3 10244

# of S-characters after pass 3 84425


Recall % 98.6%

Precision % 96.1%

19

Feature Extraction

Sequence of Features For each pixel column

• Upper profile - distance of first ink pixel from top

• Lower profile - distance of last ink pixel from top

• Vertical projection - summation of different intensity values

• Ink/Non-ink transitions - number of ink /non-ink transitions

• Vertical histogram - count number of ink pixels

• Mid Row transitions

0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0……

…

Length of the vector sequence

= Pixel width of S-character

UpLpVpInkVhMr



20

Index File

One index file for one document imagePosition of each wordWidth and height of the word’s bounding boxNumber of S-characters in the wordPosition of each S-character in the document image Width/height of the S-character’s bounding boxFeatures of each S-character

Computational time per pageTest using Data Set B on

• Intel core2duo 2.1GHz • 3GB RAM

Average time = 130sAverage size = 600KB

0

50

100

150

200

250

300

350

400

450

1 5 9 13 17 21 25 29 33 37 41 45

Document images

Tim

e (s

)



21

Word Retrieval

Multi-stage matching

DTW for S-character matching

Spotted words

Word matching - string comparison

ASCII query

Query image representation

Length-Ratio filterIndexed docs

Processing stages

Query image

Query formation Length-ratio filter Word spotting Character matching Word matching


22

Query Formation

ASCII query

Prototype characters

Collection for each book

Word Image query

Click on the word in user interface

Position information in index file



23

Length-Ratio Filter

Filter out ‘non-likely’ words

Compare the number of S-characters in test and query words

50 – 75% words are filtered out

40

50

60

70

80

3 4 5 6 7 8 9 10

# of S-characters in query word

% o

f w

ord

s fi

lter

ed o

ut



24

Word Spotting

Multi-level matching for retrieval

DTW

Euclidean distance

String match

Character level

matching

Word level

matching

UpLpVpInkVhMr



25

Character Matching

Two characters are matched by comparing their feature vectors using DTW

Why DTW ? Non-linear elastic matching

i

i

i

i+2i

Linear alignment Non-linear alignment



26

Character Matching

For two S-characters X and Y of widths m and n, their feature

vectors are treated as two series X = (x1 .. xm) and Y = (y1 .. yn)

x x xxxx x

X

X

X

X

x

D(m,n)

y1 y2 y3 y4 y5 y6 yn

x1

x2

x3

x4

x5

xm

x2 y2

Distance Normalization

• Minimum Warping Path

D (m,n) / No. of steps (k)

• Average width

D (m,n) / [(m+n)/2]

),(

)1,1(

),1(

)1,(

min),( ji yxd

jiD

jiD

jiD

jiD

6

1

2,, )(),(

kkjkiji yxyxd



27

Word Matching

String matching distances

Relative position correspondence

Edit distance

Merge-split Edit distance

Linear Matching



String match

28

Relative Position Correspondence (RPC)

Natural way to match two stringsOne S-character of query word matched with different number of relative neighbour S-characters in the test word

• Smallest of these costs is added to the total word distance

Normalized word distance = Total word distance / number of matches

1 2 3 . . . . . .1 2 3 154 5

. . . . . .1 2 3 154 5

2 neighbors on each side

Query word

Test word

Order of S-characters



29

Edit Distance

Distance given by the minimal cost sequence of edit operationsReplace, Delete, Insert

For two words A, B of size s & t respectively• A = (a1 ... as) and B = (b1 ... bt)

Edit operation costs = DTW distances

Normalization by length of minimum warping path

w o r d 1word2

W

DTW

r1 r2 r3 r4 r5

o1 o2 o3 o4

)()1,(

)(),1(

)()1,1(

min),(

j

i

ji

bjiW

ajiW

bajiW

jiW

Replace

Delete

Inserttjsi 1;1



30

Merge-Split Edit Distance

Proposed solution to solve character segmentation problems

Two new operations ai→(bj+bj+1) and (ai+ai+1)→bj

Merge-T function

• One S-character of the query

• against two S-characters of test

Merge-Q function

• One S-character of the test

• against two S-characters of query

Modelling the Split capability

Classical Edit operations

Replace, Insert, Delete

ai → (bj+bj+1)

Query Test

(ai +ai+1) → bj



31

Merge-Split Edit Distance Calculation

)()0,1()0,(

)()1,0(),0(

0)0,0(

i

j

aiWiW

bjWjW

W

)()1,(

)(),1(

))(()1,1(

))(()1,1(

)()1,1(

min),( 1

1

j

i

jii

jji

ji

bjiW

ajiW

baajiW

bbajiW

bajiW

jiW

tjsi 1;1

?

Λ

a1

a2

….

Λ b1 b2 ….

W(s,t)

?

?

for j <= t

for i <=s

Normalization

k = length of warping path – no. of merge functions used in path

Normalized word distance = W (s,t) / k

Value copied in the next cell



32

Merge-Split Edit Operations - Example

Query word A with 3 S-chars (F,I,G)

Test word B with 2 S-chars (FI, G)

Insert Delete Replace Merge-T Merge-Q

1.53 1.39 0.86 1.65 0.07

)( 11 ba ))(( 211 bba ))(( 121 baa )( 1 a )( 1b



33

Matching Example

Test word

Query Word

Λ p o ur

Λ

pour

0.00 1.79 3.39 5.56

1.78 0.02 1.62 3.79

3.47 1.72 0.04 2.05

5.51 3.75 2.08 0.09

6.85 5.10 3.42 0.09

Resolves segmentation problems

Computationally expensive

Cost of matching u to ur = 1.83

Cost of matching (u + r) to ur = 0.09



34

Linear Displacement Matching

Reduce computational time: Step-wise instead of recursive

matching

Three operations in each step

• Minimum cost of the three operations is added to the total word distance

S-characters used in minimum cost operation are marked

Cost of insertion/deletion for the remaining S-characters?

Normalized word distance = Total word distance / Number of steps

Operation Replace Merge-T Merge-Q



35

Partial Calculation of the Distance Matrix

Test word

Query Word

Λ p o ur

Λ

pour

0.00 1.79 3.39 5.56

1.78 0.02 1.62 3.79

3.47 1.72 0.04 2.05

5.51 3.75 2.08 0.09

6.85 5.10 3.42 0.09

Query

Test

Normalized Word Distance = Total distance / Number of iterations

= 0.09/3 = 0.03



36

Computational Evaluation

Linear Displacement Matching vs Merge-Split distanceData Set A

Effect of query length?

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

3 4 5 6 7 8 9 10

Word length

Tim

e p

er 1

00 w

ord

s (s

ecs)

Merge Split Edit

Linear Matching

Computational time increases with the

query length

Increase non-significant



37


Data Set BQuery words of different lengths and styles

Performance measuresRecall, Precision, F-measure (F) and R-score (Relevance measure)

rhinoscopieExact

Relevant

False positive

Performance measures 19th century documents 16th century documents

Non

exact


38

Performance Measures

)( PositivesFalseetrievedRWordsExact

etrievedRWordsExactP

ExistingWordsExactTotal

etrievedRWordsExactR

)(

..2

RP

RPF

)( PositivesFalseetrievedRWordselevantR

etrievedRWordselevantRscoreR

Precision

Recall

F-measure

R-score



39


RPCEdit

distanceMerge-Split

Linear matching

Rath et al. 2007

ABBYY OCR

#query word instances 435 435 435 435 435 435

#exact words detected 401 406 427 420 335 422

#relevant words detected 99 53 39 33 66 0

#false positives 51 16 4 3 54 0

Precision (%) 88.72% 96.21% 99.07% 99.29% 86.12% 100%

Recall (%) 92.18% 93.33% 98.16% 96.55% 77.01% 97.01%

F-measure 90.42% 94.75% 98.61% 97.90% 81.31% 98.48%

R-score 66.00% 76.81% 90.70% 91.67% 55.00% -



40

Variation with threshold

30

40

50

60

70

80

90

100

60 70 80 90 100

Precision

Recall

40

55

70

85

100

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Threshold

F-s

core

41

Feature Performance

10

25

40

55

70

85

100

10 25 40 55 70 85 100

Precision

Rec

all

VP

UP

LP

Ink

VH

MRow

0

10

20

30

40

50

60

70

80

90

100

T1 T2 T3 T4 T5 T6

Thresholds

F-S

core

VP

UP

LP

ink

VH

MRow

42


Three Ancient books of 16th century12 document images each with 1400+ words each

15 query words with 171 instances in total



43


QueryTotal Query

instance

# Words retrieved by

OCR Edit distance Linear Matching Merge-Split Edit

Exct Exct Rel FP Exct Rel FP Exct Rel FP

TOTAL 171 89 88 41 25 146 74 19 149 78 23

Recall % 52.04% 51.46% 85.38% 87.13%

Precision % - 77.87% 88.48% 86.63%



44

Figure-caption pair generation

Link a figure with its caption

Caption candidates selection• Spatial Information

Figure caption label search• Label word spotting

Data Set C• 180 / 204 caption detected

• 4 false positives (98% precision)

Figure-caption pair generation


45

Perspectives

Automatic generation of prototype characters

Potential to be used for non-latin (Oriental) text

Potential to be used for low resolution contemporary documents

Ancient documents: Improve text line and word extraction

Perspectives


46

Results on Arabic text

06 case study - word spotting

Technology

word images

word level features

dtw word spotting state

century data

information retrieval

method holistic analytical

character level dtw

features features