06 case study - word spotting
TRANSCRIPT
Word Spotting in Historical Document Images
Dr Khurram Khurshid
Case Study
2
Historical Documents
Contain invaluable information
Preservation – information access?
DigitizationEasy and fast access
Information spotting
www.inforouter.com
Problem
3
Information Retrieval
How to retrieve the required information? OCR?
Fails for ancient documents
A 16th century document imageA 19th century document image
Problem
lêVÎerS' danS C6tte %Ur6' a été °*g^) el" doit *£
Word Spotting
4
Plan
IntroductionRelated work - state of the art
Proposed method
Document IndexingWord/graphic segmentation
Character extraction
Feature definition
Word RetrievalCharacter matching
Word matching
Experimental Results
Applications
Conclusion/Perspectives
Presentation Plan
http://en.wikipedia.org/wiki/Johannes_Gutenberg
5
Introduction - Word Spotting
Word Spotting - an alternate to OCR
Comparing two word images through a matching process to see if they are similar or not
Input query
Matching process
Recognized words
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Word Spotting State of the art Proposed Method Data sets
6
State of the Art
Mainly two different categories of methodsHolistic
• view a word image as a unit
Analytical• a word image is segmented into smaller units
Word Spotting State of the art Proposed Method Data sets
Introduction Document Indexing Word Retrieval Results Applications Conclusion
7
Holistic vs Analytical
Handwritten textCharacter segmentation can be avoided
Holistic methods
Printed textEasier to segment into characters
Analytical methods • Ability to focus on the local intrinsic characteristics of words • Allow a more precise word representation
Analytical approach suitable for printed docs
Word Spotting State of the art Proposed Method Data sets
Introduction Document Indexing Word Retrieval Results Applications Conclusion
8
Benchmark in the domain
Profile feature’s sequence matching [Rath and Manmatha 2007]Four feature sequences for each word image
• Vertical projection
• Lower profile
• Upper profile
• Ink/background transitions
Matching at word level using DTW
Word Spotting State of the art Proposed Method Data sets
Method of Rath Proposed Method
Holistic Analytical
4 features (4+2=6) features
Features defined at word level Features defined at character level
DTW at word level Multi-level dynamic matching
Introduction Document Indexing Word Retrieval Results Applications Conclusion
9
Proposed Approach
Character Extraction
Feature Definition
Word/graphic Segment
Indexing Dynamic String Comparison (Word – level)
Dynamic Time Warping (Character – level)
Query Word Processing
ASCII Query
Word Spotting State of the art Proposed Method Data sets
Introduction Document Indexing Word Retrieval Results Applications Conclusion
10
Data Sets
Document images of Bibliothèque Interuniversitaire de Médecine (BIUM)
12 books of 19th centuryData Set A (Training set)
• 20 pages : 4 each from 5 books• Total Words = 6,632• 25 (5x5) query words having 175 instances
Data Set B (Test set)• 60 pages : 4 each from 12 books• Total words = 17,010 • 60 (5x12) query words having 435 instances in total
Data Set C • 3 complete books• More than 500 pages in total
3 books of 16th century
Word Spotting State of the art Proposed Method Data sets
Introduction Document Indexing Word Retrieval Results Applications Conclusion
11
Document Image Indexing
Feature Definition
Indexing
Word/Graphic Segmentation
Binarization
Character Extraction
Binarization Word/graphic segmentation Character extraction Feature definition Image indexing
Introduction Document Indexing Word Retrieval Results Applications Conclusion
12
Document Image Binarization
NICK algorithm
NP
mpkmT
NP
ii )( 2
1
2
k = NICK factor having value between –0.2 and –0.1
pi = pixel value of gray scale image
NP = number of pixels in the window
m = mean gray value of these NP pixels
k = -0.2
k = -0.1
Binarization Word/graphic segmentation Character extraction Feature definition Image indexing
Introduction Document Indexing Word Retrieval Results Applications Conclusion
13
Word/Graphic Segmentation
Multi-step bottom up approachHorizontal Run Length Smoothing Algorithm
Graphic Component Detection• Height-Area Analysis of the components
d > threshold
Binarization Word/graphic segmentation Character extraction Feature definition Image indexing
Introduction Document Indexing Word Retrieval Results Applications Conclusion
14
Word/Graphic Segmentation
Evaluation of Word Segmentation on Data Set BWords segmented perfectly = 99.76%
ProblemsTitles in very large font
• Can be treated separately using large RLSA
Binarization Word/graphic segmentation Character extraction Feature definition Image indexing
Introduction Document Indexing Word Retrieval Results Applications Conclusion
15
Word/Graphic Segmentation
Component Height-Area Analysis
Binarization Word/graphic segmentation Character extraction Feature definition Image indexing
Word Components = [ Component Area < Mean comp. area x A
AND Component height <
Mean comp. height x B ]
Introduction Document Indexing Word Retrieval Results Applications Conclusion
16
Character Extraction
T-character (true alphabetic characters)
Connected components (CCs) of the word imageS-Character (segmented character)
Heuristic Rules – 3 passesPass 1 - Multi-component characters
B
A
B
A
Binarization Word/graphic segmentation Character extraction Feature definition Image indexing
Improve S-characters to correspond to T-characters
Introduction Document Indexing Word Retrieval Results Applications Conclusion
17
Character Extraction
Pass 2 – Grouping the broken S-characters
Pass 3Remove punctuation marks and noise components
After processing stages, 98% of S-characters correspond to T-characters
Less than T
A
B
A
B
Binarization Word/graphic segmentation Character extraction Feature definition Image indexing
Split Characters
Merged Characters
Introduction Document Indexing Word Retrieval Results Applications Conclusion
18
Character Extraction
Validation using data set ABefore post processing passes
# Total T-characters in the data set 82264
# of raw S-characters within words 115414
# of T-characters in these S-characters 60358
Recall % 73.4%
Precision % 52.3%
After pass 1 and 2
# of S-characters treated (merged) during pass 1 and 2 20745
# of S-characters after pass 1 and pass 2 94669
# of T-characters in these S-characters 81103
Recall % 98.6%
Precision % 85.7%
After pass 3
# of S-characters removed during pass 3 10244
# of S-characters after pass 3 84425
# of T-characters in these S-characters 81103
Recall % 98.6%
Precision % 96.1%
19
Feature Extraction
Sequence of Features For each pixel column
• Upper profile - distance of first ink pixel from top
• Lower profile - distance of last ink pixel from top
• Vertical projection - summation of different intensity values
• Ink/Non-ink transitions - number of ink /non-ink transitions
• Vertical histogram - count number of ink pixels
• Mid Row transitions
0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0……
…
Length of the vector sequence
= Pixel width of S-character
UpLpVpInkVhMr
Binarization Word/graphic segmentation Character extraction Feature definition Image indexing
Introduction Document Indexing Word Retrieval Results Applications Conclusion
20
Index File
One index file for one document imagePosition of each wordWidth and height of the word’s bounding boxNumber of S-characters in the wordPosition of each S-character in the document image Width/height of the S-character’s bounding boxFeatures of each S-character
Computational time per pageTest using Data Set B on
• Intel core2duo 2.1GHz • 3GB RAM
Average time = 130sAverage size = 600KB
0
50
100
150
200
250
300
350
400
450
1 5 9 13 17 21 25 29 33 37 41 45
Document images
Tim
e (s
)
Binarization Word/graphic segmentation Character extraction Feature definition Image indexing
Introduction Document Indexing Word Retrieval Results Applications Conclusion
21
Word Retrieval
Multi-stage matching
DTW for S-character matching
Spotted words
Word matching - string comparison
ASCII query
Query image representation
Length-Ratio filterIndexed docs
Processing stages
Query image
Query formation Length-ratio filter Word spotting Character matching Word matching
Introduction Document Indexing Word Retrieval Results Applications Conclusion
22
Query Formation
ASCII query
Prototype characters
Collection for each book
Word Image query
Click on the word in user interface
Position information in index file
Query formation Length-ratio filter Word spotting Character matching Word matching
Introduction Document Indexing Word Retrieval Results Applications Conclusion
23
Length-Ratio Filter
Filter out ‘non-likely’ words
Compare the number of S-characters in test and query words
50 – 75% words are filtered out
40
50
60
70
80
3 4 5 6 7 8 9 10
# of S-characters in query word
% o
f w
ord
s fi
lter
ed o
ut
Query formation Length-ratio filter Word spotting Character matching Word matching
Introduction Document Indexing Word Retrieval Results Applications Conclusion
24
Word Spotting
Multi-level matching for retrieval
DTW
Euclidean distance
String match
Character level
matching
Word level
matching
UpLpVpInkVhMr
Query formation Length-ratio filter Word spotting Character matching Word matching
Introduction Document Indexing Word Retrieval Results Applications Conclusion
25
Character Matching
Two characters are matched by comparing their feature vectors using DTW
Why DTW ? Non-linear elastic matching
i
i
i
i+2i
Linear alignment Non-linear alignment
Query formation Length-ratio filter Word spotting Character matching Word matching
Introduction Document Indexing Word Retrieval Results Applications Conclusion
26
Character Matching
For two S-characters X and Y of widths m and n, their feature
vectors are treated as two series X = (x1 .. xm) and Y = (y1 .. yn)
x x xxxx x
X
X
X
X
x
D(m,n)
y1 y2 y3 y4 y5 y6 yn
x1
x2
x3
x4
x5
xm
x2 y2
Distance Normalization
• Minimum Warping Path
D (m,n) / No. of steps (k)
• Average width
D (m,n) / [(m+n)/2]
),(
)1,1(
),1(
)1,(
min),( ji yxd
jiD
jiD
jiD
jiD
6
1
2,, )(),(
kkjkiji yxyxd
Query formation Length-ratio filter Word spotting Character matching Word matching
Introduction Document Indexing Word Retrieval Results Applications Conclusion
27
Word Matching
String matching distances
Relative position correspondence
Edit distance
Merge-split Edit distance
Linear Matching
Query formation Length-ratio filter Word spotting Character matching Word matching
Introduction Document Indexing Word Retrieval Results Applications Conclusion
String match
28
Relative Position Correspondence (RPC)
Natural way to match two stringsOne S-character of query word matched with different number of relative neighbour S-characters in the test word
• Smallest of these costs is added to the total word distance
Normalized word distance = Total word distance / number of matches
1 2 3 . . . . . .1 2 3 154 5
. . . . . .1 2 3 154 5
2 neighbors on each side
Query word
Test word
Order of S-characters
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Query formation Length-ratio filter Word spotting Character matching Word matching
29
Edit Distance
Distance given by the minimal cost sequence of edit operationsReplace, Delete, Insert
For two words A, B of size s & t respectively• A = (a1 ... as) and B = (b1 ... bt)
Edit operation costs = DTW distances
Normalization by length of minimum warping path
w o r d 1word2
W
DTW
r1 r2 r3 r4 r5
o1 o2 o3 o4
)()1,(
)(),1(
)()1,1(
min),(
j
i
ji
bjiW
ajiW
bajiW
jiW
Replace
Delete
Inserttjsi 1;1
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Query formation Length-ratio filter Word spotting Character matching Word matching
30
Merge-Split Edit Distance
Proposed solution to solve character segmentation problems
Two new operations ai→(bj+bj+1) and (ai+ai+1)→bj
Merge-T function
• One S-character of the query
• against two S-characters of test
Merge-Q function
• One S-character of the test
• against two S-characters of query
Modelling the Split capability
Classical Edit operations
Replace, Insert, Delete
ai → (bj+bj+1)
Query Test
(ai +ai+1) → bj
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Query formation Length-ratio filter Word spotting Character matching Word matching
31
Merge-Split Edit Distance Calculation
)()0,1()0,(
)()1,0(),0(
0)0,0(
i
j
aiWiW
bjWjW
W
)()1,(
)(),1(
))(()1,1(
))(()1,1(
)()1,1(
min),( 1
1
j
i
jii
jji
ji
bjiW
ajiW
baajiW
bbajiW
bajiW
jiW
tjsi 1;1
?
Λ
a1
a2
….
Λ b1 b2 ….
W(s,t)
?
?
for j <= t
for i <=s
Normalization
k = length of warping path – no. of merge functions used in path
Normalized word distance = W (s,t) / k
Value copied in the next cell
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Query formation Length-ratio filter Word spotting Character matching Word matching
32
Merge-Split Edit Operations - Example
Query word A with 3 S-chars (F,I,G)
Test word B with 2 S-chars (FI, G)
Insert Delete Replace Merge-T Merge-Q
1.53 1.39 0.86 1.65 0.07
)( 11 ba ))(( 211 bba ))(( 121 baa )( 1 a )( 1b
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Query formation Length-ratio filter Word spotting Character matching Word matching
33
Matching Example
Test word
Query Word
Λ p o ur
Λ
pour
0.00 1.79 3.39 5.56
1.78 0.02 1.62 3.79
3.47 1.72 0.04 2.05
5.51 3.75 2.08 0.09
6.85 5.10 3.42 0.09
Resolves segmentation problems
Computationally expensive
Cost of matching u to ur = 1.83
Cost of matching (u + r) to ur = 0.09
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Query formation Length-ratio filter Word spotting Character matching Word matching
34
Linear Displacement Matching
Reduce computational time: Step-wise instead of recursive
matching
Three operations in each step
• Minimum cost of the three operations is added to the total word distance
S-characters used in minimum cost operation are marked
Cost of insertion/deletion for the remaining S-characters?
Normalized word distance = Total word distance / Number of steps
Operation Replace Merge-T Merge-Q
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Query formation Length-ratio filter Word spotting Character matching Word matching
35
Partial Calculation of the Distance Matrix
Test word
Query Word
Λ p o ur
Λ
pour
0.00 1.79 3.39 5.56
1.78 0.02 1.62 3.79
3.47 1.72 0.04 2.05
5.51 3.75 2.08 0.09
6.85 5.10 3.42 0.09
Query
Test
Normalized Word Distance = Total distance / Number of iterations
= 0.09/3 = 0.03
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Query formation Length-ratio filter Word spotting Character matching Word matching
36
Computational Evaluation
Linear Displacement Matching vs Merge-Split distanceData Set A
Effect of query length?
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
3 4 5 6 7 8 9 10
Word length
Tim
e p
er 1
00 w
ord
s (s
ecs)
Merge Split Edit
Linear Matching
Computational time increases with the
query length
Increase non-significant
Introduction Document Indexing Word Retrieval Results Applications Conclusion
Query formation Length-ratio filter Word spotting Character matching Word matching
37
Experimental Results
Data Set BQuery words of different lengths and styles
Performance measuresRecall, Precision, F-measure (F) and R-score (Relevance measure)
rhinoscopieExact
Relevant
False positive
Performance measures 19th century documents 16th century documents
Non
exact
Introduction Document Indexing Word Retrieval Results Applications Conclusion
38
Performance Measures
)( PositivesFalseetrievedRWordsExact
etrievedRWordsExactP
ExistingWordsExactTotal
etrievedRWordsExactR
)(
..2
RP
RPF
)( PositivesFalseetrievedRWordselevantR
etrievedRWordselevantRscoreR
Precision
Recall
F-measure
R-score
Performance measures 19th century documents 16th century documents
Introduction Document Indexing Word Retrieval Results Applications Conclusion
39
Experimental Results
RPCEdit
distanceMerge-Split
Linear matching
Rath et al. 2007
ABBYY OCR
#query word instances 435 435 435 435 435 435
#exact words detected 401 406 427 420 335 422
#relevant words detected 99 53 39 33 66 0
#false positives 51 16 4 3 54 0
Precision (%) 88.72% 96.21% 99.07% 99.29% 86.12% 100%
Recall (%) 92.18% 93.33% 98.16% 96.55% 77.01% 97.01%
F-measure 90.42% 94.75% 98.61% 97.90% 81.31% 98.48%
R-score 66.00% 76.81% 90.70% 91.67% 55.00% -
Performance measures 19th century documents 16th century documents
Introduction Document Indexing Word Retrieval Results Applications Conclusion
40
Variation with threshold
30
40
50
60
70
80
90
100
60 70 80 90 100
Precision
Recall
40
55
70
85
100
0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
Threshold
F-s
core
41
Feature Performance
10
25
40
55
70
85
100
10 25 40 55 70 85 100
Precision
Rec
all
VP
UP
LP
Ink
VH
MRow
0
10
20
30
40
50
60
70
80
90
100
T1 T2 T3 T4 T5 T6
Thresholds
F-S
core
VP
UP
LP
ink
VH
MRow
42
Experimental Results
Three Ancient books of 16th century12 document images each with 1400+ words each
15 query words with 171 instances in total
Performance measures 19th century documents 16th century documents
Introduction Document Indexing Word Retrieval Results Applications Conclusion
43
Experimental Results
QueryTotal Query
instance
# Words retrieved by
OCR Edit distance Linear Matching Merge-Split Edit
Exct Exct Rel FP Exct Rel FP Exct Rel FP
TOTAL 171 89 88 41 25 146 74 19 149 78 23
Recall % 52.04% 51.46% 85.38% 87.13%
Precision % - 77.87% 88.48% 86.63%
Performance measures 19th century documents 16th century documents
Introduction Document Indexing Word Retrieval Results Applications Conclusion
44
Figure-caption pair generation
Link a figure with its caption
Caption candidates selection• Spatial Information
Figure caption label search• Label word spotting
Data Set C• 180 / 204 caption detected
• 4 false positives (98% precision)
Figure-caption pair generation
Introduction Document Indexing Word Retrieval Results Applications Conclusion
45
Perspectives
Automatic generation of prototype characters
Potential to be used for non-latin (Oriental) text
Potential to be used for low resolution contemporary documents
Ancient documents: Improve text line and word extraction
Perspectives
Introduction Document Indexing Word Retrieval Results Applications Conclusion
46
Results on Arabic text