06 case study - word spotting

46
Word Spotting in Historical Document Images Dr Khurram Khurshid Case Study

Upload: institute-of-space-technology-ist

Post on 10-Aug-2015

68 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: 06   case study - word spotting

Word Spotting in Historical Document Images

Dr Khurram Khurshid

Case Study

Page 2: 06   case study - word spotting

2

Historical Documents

Contain invaluable information

Preservation – information access?

DigitizationEasy and fast access

Information spotting

www.inforouter.com

Problem

Page 3: 06   case study - word spotting

3

Information Retrieval

How to retrieve the required information? OCR?

Fails for ancient documents

A 16th century document imageA 19th century document image

Problem

lêVÎerS' danS C6tte %Ur6' a été °*g^) el" doit *£

Word Spotting

Page 4: 06   case study - word spotting

4

Plan

IntroductionRelated work - state of the art

Proposed method

Document IndexingWord/graphic segmentation

Character extraction

Feature definition

Word RetrievalCharacter matching

Word matching

Experimental Results

Applications

Conclusion/Perspectives

Presentation Plan

http://en.wikipedia.org/wiki/Johannes_Gutenberg

Page 5: 06   case study - word spotting

5

Introduction - Word Spotting

Word Spotting - an alternate to OCR

Comparing two word images through a matching process to see if they are similar or not

Input query

Matching process

Recognized words

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Word Spotting State of the art Proposed Method Data sets

Page 6: 06   case study - word spotting

6

State of the Art

Mainly two different categories of methodsHolistic

• view a word image as a unit

Analytical• a word image is segmented into smaller units

Word Spotting State of the art Proposed Method Data sets

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 7: 06   case study - word spotting

7

Holistic vs Analytical

Handwritten textCharacter segmentation can be avoided

Holistic methods

Printed textEasier to segment into characters

Analytical methods • Ability to focus on the local intrinsic characteristics of words • Allow a more precise word representation

Analytical approach suitable for printed docs

Word Spotting State of the art Proposed Method Data sets

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 8: 06   case study - word spotting

8

Benchmark in the domain

Profile feature’s sequence matching [Rath and Manmatha 2007]Four feature sequences for each word image

• Vertical projection

• Lower profile

• Upper profile

• Ink/background transitions

Matching at word level using DTW

Word Spotting State of the art Proposed Method Data sets

Method of Rath Proposed Method

Holistic Analytical

4 features (4+2=6) features

Features defined at word level Features defined at character level

DTW at word level Multi-level dynamic matching

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 9: 06   case study - word spotting

9

Proposed Approach

Character Extraction

Feature Definition

Word/graphic Segment

Indexing Dynamic String Comparison (Word – level)

Dynamic Time Warping (Character – level)

Query Word Processing

ASCII Query

Word Spotting State of the art Proposed Method Data sets

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 10: 06   case study - word spotting

10

Data Sets

Document images of Bibliothèque Interuniversitaire de Médecine (BIUM)

12 books of 19th centuryData Set A (Training set)

• 20 pages : 4 each from 5 books• Total Words = 6,632• 25 (5x5) query words having 175 instances

Data Set B (Test set)• 60 pages : 4 each from 12 books• Total words = 17,010 • 60 (5x12) query words having 435 instances in total

Data Set C • 3 complete books• More than 500 pages in total

3 books of 16th century

Word Spotting State of the art Proposed Method Data sets

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 11: 06   case study - word spotting

11

Document Image Indexing

Feature Definition

Indexing

Word/Graphic Segmentation

Binarization

Character Extraction

Binarization Word/graphic segmentation Character extraction Feature definition Image indexing

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 12: 06   case study - word spotting

12

Document Image Binarization

NICK algorithm

NP

mpkmT

NP

ii )( 2

1

2

k = NICK factor having value between –0.2 and –0.1

pi = pixel value of gray scale image

NP = number of pixels in the window

m = mean gray value of these NP pixels

k = -0.2

k = -0.1

Binarization Word/graphic segmentation Character extraction Feature definition Image indexing

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 13: 06   case study - word spotting

13

Word/Graphic Segmentation

Multi-step bottom up approachHorizontal Run Length Smoothing Algorithm

Graphic Component Detection• Height-Area Analysis of the components

d > threshold

Binarization Word/graphic segmentation Character extraction Feature definition Image indexing

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 14: 06   case study - word spotting

14

Word/Graphic Segmentation

Evaluation of Word Segmentation on Data Set BWords segmented perfectly = 99.76%

ProblemsTitles in very large font

• Can be treated separately using large RLSA

Binarization Word/graphic segmentation Character extraction Feature definition Image indexing

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 15: 06   case study - word spotting

15

Word/Graphic Segmentation

Component Height-Area Analysis

Binarization Word/graphic segmentation Character extraction Feature definition Image indexing

Word Components = [ Component Area < Mean comp. area x A

AND Component height <

Mean comp. height x B ]

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 16: 06   case study - word spotting

16

Character Extraction

T-character (true alphabetic characters)

Connected components (CCs) of the word imageS-Character (segmented character)

Heuristic Rules – 3 passesPass 1 - Multi-component characters

B

A

B

A

Binarization Word/graphic segmentation Character extraction Feature definition Image indexing

Improve S-characters to correspond to T-characters

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 17: 06   case study - word spotting

17

Character Extraction

Pass 2 – Grouping the broken S-characters

Pass 3Remove punctuation marks and noise components

After processing stages, 98% of S-characters correspond to T-characters

Less than T

A

B

A

B

Binarization Word/graphic segmentation Character extraction Feature definition Image indexing

Split Characters

Merged Characters

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 18: 06   case study - word spotting

18

Character Extraction

Validation using data set ABefore post processing passes

# Total T-characters in the data set 82264

# of raw S-characters within words 115414

# of T-characters in these S-characters 60358

Recall % 73.4%

Precision % 52.3%

After pass 1 and 2

# of S-characters treated (merged) during pass 1 and 2 20745

# of S-characters after pass 1 and pass 2 94669

# of T-characters in these S-characters 81103

Recall % 98.6%

Precision % 85.7%

After pass 3

# of S-characters removed during pass 3 10244

# of S-characters after pass 3 84425

# of T-characters in these S-characters 81103

Recall % 98.6%

Precision % 96.1%

Page 19: 06   case study - word spotting

19

Feature Extraction

Sequence of Features For each pixel column

• Upper profile - distance of first ink pixel from top

• Lower profile - distance of last ink pixel from top

• Vertical projection - summation of different intensity values

• Ink/Non-ink transitions - number of ink /non-ink transitions

• Vertical histogram - count number of ink pixels

• Mid Row transitions

0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0……

Length of the vector sequence

= Pixel width of S-character

UpLpVpInkVhMr

Binarization Word/graphic segmentation Character extraction Feature definition Image indexing

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 20: 06   case study - word spotting

20

Index File

One index file for one document imagePosition of each wordWidth and height of the word’s bounding boxNumber of S-characters in the wordPosition of each S-character in the document image Width/height of the S-character’s bounding boxFeatures of each S-character

Computational time per pageTest using Data Set B on

• Intel core2duo 2.1GHz • 3GB RAM

Average time = 130sAverage size = 600KB

0

50

100

150

200

250

300

350

400

450

1 5 9 13 17 21 25 29 33 37 41 45

Document images

Tim

e (s

)

Binarization Word/graphic segmentation Character extraction Feature definition Image indexing

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 21: 06   case study - word spotting

21

Word Retrieval

Multi-stage matching

DTW for S-character matching

Spotted words

Word matching - string comparison

ASCII query

Query image representation

Length-Ratio filterIndexed docs

Processing stages

Query image

Query formation Length-ratio filter Word spotting Character matching Word matching

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 22: 06   case study - word spotting

22

Query Formation

ASCII query

Prototype characters

Collection for each book

Word Image query

Click on the word in user interface

Position information in index file

Query formation Length-ratio filter Word spotting Character matching Word matching

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 23: 06   case study - word spotting

23

Length-Ratio Filter

Filter out ‘non-likely’ words

Compare the number of S-characters in test and query words

50 – 75% words are filtered out

40

50

60

70

80

3 4 5 6 7 8 9 10

# of S-characters in query word

% o

f w

ord

s fi

lter

ed o

ut

Query formation Length-ratio filter Word spotting Character matching Word matching

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 24: 06   case study - word spotting

24

Word Spotting

Multi-level matching for retrieval

DTW

Euclidean distance

String match

Character level

matching

Word level

matching

UpLpVpInkVhMr

Query formation Length-ratio filter Word spotting Character matching Word matching

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 25: 06   case study - word spotting

25

Character Matching

Two characters are matched by comparing their feature vectors using DTW

Why DTW ? Non-linear elastic matching

i

i

i

i+2i

Linear alignment Non-linear alignment

Query formation Length-ratio filter Word spotting Character matching Word matching

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 26: 06   case study - word spotting

26

Character Matching

For two S-characters X and Y of widths m and n, their feature

vectors are treated as two series X = (x1 .. xm) and Y = (y1 .. yn)

x x xxxx x

X

X

X

X

x

D(m,n)

y1 y2 y3 y4 y5 y6 yn

x1

x2

x3

x4

x5

xm

x2 y2

Distance Normalization

• Minimum Warping Path

D (m,n) / No. of steps (k)

• Average width

D (m,n) / [(m+n)/2]

),(

)1,1(

),1(

)1,(

min),( ji yxd

jiD

jiD

jiD

jiD

6

1

2,, )(),(

kkjkiji yxyxd

Query formation Length-ratio filter Word spotting Character matching Word matching

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 27: 06   case study - word spotting

27

Word Matching

String matching distances

Relative position correspondence

Edit distance

Merge-split Edit distance

Linear Matching

Query formation Length-ratio filter Word spotting Character matching Word matching

Introduction Document Indexing Word Retrieval Results Applications Conclusion

String match

Page 28: 06   case study - word spotting

28

Relative Position Correspondence (RPC)

Natural way to match two stringsOne S-character of query word matched with different number of relative neighbour S-characters in the test word

• Smallest of these costs is added to the total word distance

Normalized word distance = Total word distance / number of matches

1 2 3 . . . . . .1 2 3 154 5

. . . . . .1 2 3 154 5

2 neighbors on each side

Query word

Test word

Order of S-characters

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Query formation Length-ratio filter Word spotting Character matching Word matching

Page 29: 06   case study - word spotting

29

Edit Distance

Distance given by the minimal cost sequence of edit operationsReplace, Delete, Insert

For two words A, B of size s & t respectively• A = (a1 ... as) and B = (b1 ... bt)

Edit operation costs = DTW distances

Normalization by length of minimum warping path

w o r d 1word2

W

DTW

r1 r2 r3 r4 r5

o1 o2 o3 o4

)()1,(

)(),1(

)()1,1(

min),(

j

i

ji

bjiW

ajiW

bajiW

jiW

Replace

Delete

Inserttjsi 1;1

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Query formation Length-ratio filter Word spotting Character matching Word matching

Page 30: 06   case study - word spotting

30

Merge-Split Edit Distance

Proposed solution to solve character segmentation problems

Two new operations ai→(bj+bj+1) and (ai+ai+1)→bj

Merge-T function

• One S-character of the query

• against two S-characters of test

Merge-Q function

• One S-character of the test

• against two S-characters of query

Modelling the Split capability

Classical Edit operations

Replace, Insert, Delete

ai → (bj+bj+1)

Query Test

(ai +ai+1) → bj

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Query formation Length-ratio filter Word spotting Character matching Word matching

Page 31: 06   case study - word spotting

31

Merge-Split Edit Distance Calculation

)()0,1()0,(

)()1,0(),0(

0)0,0(

i

j

aiWiW

bjWjW

W

)()1,(

)(),1(

))(()1,1(

))(()1,1(

)()1,1(

min),( 1

1

j

i

jii

jji

ji

bjiW

ajiW

baajiW

bbajiW

bajiW

jiW

tjsi 1;1

?

Λ

a1

a2

….

Λ b1 b2 ….

W(s,t)

?

?

for j <= t

for i <=s

Normalization

k = length of warping path – no. of merge functions used in path

Normalized word distance = W (s,t) / k

Value copied in the next cell

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Query formation Length-ratio filter Word spotting Character matching Word matching

Page 32: 06   case study - word spotting

32

Merge-Split Edit Operations - Example

Query word A with 3 S-chars (F,I,G)

Test word B with 2 S-chars (FI, G)

Insert Delete Replace Merge-T Merge-Q

1.53 1.39 0.86 1.65 0.07

)( 11 ba ))(( 211 bba ))(( 121 baa )( 1 a )( 1b

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Query formation Length-ratio filter Word spotting Character matching Word matching

Page 33: 06   case study - word spotting

33

Matching Example

Test word

Query Word

Λ p o ur

Λ

pour

0.00 1.79 3.39 5.56

1.78 0.02 1.62 3.79

3.47 1.72 0.04 2.05

5.51 3.75 2.08 0.09

6.85 5.10 3.42 0.09

Resolves segmentation problems

Computationally expensive

Cost of matching u to ur = 1.83

Cost of matching (u + r) to ur = 0.09

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Query formation Length-ratio filter Word spotting Character matching Word matching

Page 34: 06   case study - word spotting

34

Linear Displacement Matching

Reduce computational time: Step-wise instead of recursive

matching

Three operations in each step

• Minimum cost of the three operations is added to the total word distance

S-characters used in minimum cost operation are marked

Cost of insertion/deletion for the remaining S-characters?

Normalized word distance = Total word distance / Number of steps

Operation Replace Merge-T Merge-Q

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Query formation Length-ratio filter Word spotting Character matching Word matching

Page 35: 06   case study - word spotting

35

Partial Calculation of the Distance Matrix

Test word

Query Word

Λ p o ur

Λ

pour

0.00 1.79 3.39 5.56

1.78 0.02 1.62 3.79

3.47 1.72 0.04 2.05

5.51 3.75 2.08 0.09

6.85 5.10 3.42 0.09

Query

Test

Normalized Word Distance = Total distance / Number of iterations

= 0.09/3 = 0.03

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Query formation Length-ratio filter Word spotting Character matching Word matching

Page 36: 06   case study - word spotting

36

Computational Evaluation

Linear Displacement Matching vs Merge-Split distanceData Set A

Effect of query length?

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

3 4 5 6 7 8 9 10

Word length

Tim

e p

er 1

00 w

ord

s (s

ecs)

Merge Split Edit

Linear Matching

Computational time increases with the

query length

Increase non-significant

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Query formation Length-ratio filter Word spotting Character matching Word matching

Page 37: 06   case study - word spotting

37

Experimental Results

Data Set BQuery words of different lengths and styles

Performance measuresRecall, Precision, F-measure (F) and R-score (Relevance measure)

rhinoscopieExact

Relevant

False positive

Performance measures 19th century documents 16th century documents

Non

exact

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 38: 06   case study - word spotting

38

Performance Measures

)( PositivesFalseetrievedRWordsExact

etrievedRWordsExactP

ExistingWordsExactTotal

etrievedRWordsExactR

)(

..2

RP

RPF

)( PositivesFalseetrievedRWordselevantR

etrievedRWordselevantRscoreR

Precision

Recall

F-measure

R-score

Performance measures 19th century documents 16th century documents

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 39: 06   case study - word spotting

39

Experimental Results

RPCEdit

distanceMerge-Split

Linear matching

Rath et al. 2007

ABBYY OCR

#query word instances 435 435 435 435 435 435

#exact words detected 401 406 427 420 335 422

#relevant words detected 99 53 39 33 66 0

#false positives 51 16 4 3 54 0

Precision (%) 88.72% 96.21% 99.07% 99.29% 86.12% 100%

Recall (%) 92.18% 93.33% 98.16% 96.55% 77.01% 97.01%

F-measure 90.42% 94.75% 98.61% 97.90% 81.31% 98.48%

R-score 66.00% 76.81% 90.70% 91.67% 55.00% -

Performance measures 19th century documents 16th century documents

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 40: 06   case study - word spotting

40

Variation with threshold

30

40

50

60

70

80

90

100

60 70 80 90 100

Precision

Recall

40

55

70

85

100

0.4 0.5 0.6 0.7 0.8 0.9 1 1.1

Threshold

F-s

core

Page 41: 06   case study - word spotting

41

Feature Performance

10

25

40

55

70

85

100

10 25 40 55 70 85 100

Precision

Rec

all

VP

UP

LP

Ink

VH

MRow

0

10

20

30

40

50

60

70

80

90

100

T1 T2 T3 T4 T5 T6

Thresholds

F-S

core

VP

UP

LP

ink

VH

MRow

Page 42: 06   case study - word spotting

42

Experimental Results

Three Ancient books of 16th century12 document images each with 1400+ words each

15 query words with 171 instances in total

Performance measures 19th century documents 16th century documents

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 43: 06   case study - word spotting

43

Experimental Results

QueryTotal Query

instance

# Words retrieved by

OCR Edit distance Linear Matching Merge-Split Edit

Exct Exct Rel FP Exct Rel FP Exct Rel FP

TOTAL 171 89 88 41 25 146 74 19 149 78 23

Recall % 52.04% 51.46% 85.38% 87.13%

Precision % - 77.87% 88.48% 86.63%

Performance measures 19th century documents 16th century documents

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 44: 06   case study - word spotting

44

Figure-caption pair generation

Link a figure with its caption

Caption candidates selection• Spatial Information

Figure caption label search• Label word spotting

Data Set C• 180 / 204 caption detected

• 4 false positives (98% precision)

Figure-caption pair generation

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 45: 06   case study - word spotting

45

Perspectives

Automatic generation of prototype characters

Potential to be used for non-latin (Oriental) text

Potential to be used for low resolution contemporary documents

Ancient documents: Improve text line and word extraction

Perspectives

Introduction Document Indexing Word Retrieval Results Applications Conclusion

Page 46: 06   case study - word spotting

46

Results on Arabic text