effect of word-based correction on retrieval of arabic ocr degraded documents
DESCRIPTION
Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents. Walid Magdy & Kareem Darwish IBM Technology Development Center PO Box 166 El-Ahram, Giza, Egypt {wmagdy,darwishk}@eg.ibm.com. Outlines:. Motivation Background Approach Experimental Setup Results Conclusion - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/1.jpg)
Effect of Word-Based Effect of Word-Based
Correction on Retrieval ofCorrection on Retrieval of
Arabic OCR Degraded Arabic OCR Degraded
DocumentsDocuments
Walid Magdy & Kareem DarwishIBM Technology Development CenterPO Box 166 El-Ahram, Giza, Egypt{wmagdy,darwishk}@eg.ibm.com
![Page 2: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/2.jpg)
Outlines:
1. Motivation
2. Background
3. Approach
4. Experimental Setup
5. Results
6. Conclusion
7. Future Work
![Page 3: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/3.jpg)
Motivation:14
00
1500
1600
1700
1800
1900
2000
First printing press
Read to search
E-text becomes commonplace
Automated full text search
Problem: 500+ years of legacy documents
Goal: To search printed documents efficiently and effectively
1998: Arabic e-text comes
online
Does OCR solve the problem?
![Page 4: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/4.jpg)
Arabic Language Challenges
• Orthography– Character shape depends on position– 15 of the 28 letters contain dots– Optional diacritics may be present – Printed text may include ligatures and kashida
• Morphology– Prefix, infix, and suffix– 6x1010 possible surface forms
• Other factors– Eighth most widely spoken language in the world– Web growth started only recently
ونـهـاكـتبوسـيــwasaya+ktub+uunahaaand will + write + they it
=and they will write it
![Page 5: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/5.jpg)
• Pre-processing:– Remove diacritics– Normalize different forms of alef & ya to
accommodate for∙ Common spelling errors∙ Grammatical, morphological, and orthographic
propertiesئ , , ؤ ا ، إ ، آ ، ء , أ ا ,and ي ، ي ى
• Text Retrieval: Best Index Terms– Regular text: Light stemming and character 3 & 4-
grams are best– OCR text: character 3 & 4 grams are best
Arabic Pre-processing & Retrieval
![Page 6: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/6.jpg)
Word-Based Correction for
Retrieval of Arabic OCR Degraded Documents
Main Idea:
Word-Based Correction for
Retrieval of Arabic OCR Degraded Documents
VVorcl-Easod Comectlon l0r
Belrieval of Arahie OCRDcgraclod Doeurnerits
Correction
OCR
ImageDegraded TextCorrected Text
We want to examine the effect of correction on Retrieval
![Page 7: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/7.jpg)
Approach:
OCR system
OCRDegraded
Text---------------------------
OCRCorrected
Text-------------------------
Indexing
Ranked List of Documents
OCRCorrection
![Page 8: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/8.jpg)
• Test collections
• Error Correction
• Building Error Model
• Training & Decoding
• Experiments
Experimental Setup:
![Page 9: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/9.jpg)
Document Collections:
ZADTREC 2002 CLIRPrinted 14th century
religious book, scanned at 300x300 dpi and OCR’ed
Arabic newswire articles from Agence France Press
(AFP)
2,730 documents383,872 articles
25 topics 50 topics
Real Degraded text by OCR process
Synthetic degraded text using degradation model
WER = 39 %WER = 30.8 %
![Page 10: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/10.jpg)
The ZAD Collection:
شرع ومتى التيمم حكم
Sample Document:
Sample Query:
![Page 11: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/11.jpg)
The TREC 2002 CLIR Collection:
Sample Document:
Sample Query:وعراقيين ايرانيين حرب سجناء
<DOC><DOCNO>19940513_AFP_ARB0001</DOCNO><HEADER> 7710ع 4 0800ارا- تصج / افب / 86قبرص / ذاتي حكم سالم االوسط الشرق </HEADER><BODY><HEADLINE> &HT; اريحا كنيس فوق 1رفع ي لم الفلسطيني <HEADLINE/> العلم<TEXT><P> ) الغربية ) الضفة (- 5-31اريحا مدخل ) بحراسة الفلسطينية الشرطة عناصر احد يقوم ب افاال الفلسطينية الشرطة الى تسليمها تم التي المدينة مواقع آخر احد اريحا وسط في اليهودي الكنيس
الكنيس فوق الفلسطيني العلم رفع يتم لم <P/> انه<P> " مكان هذا الكنيس فوق الفلسطيني العلم رفع تحاول كانت لفلسطينية فلسطيني ضابط وقال<P/> "مقدس<P> ما االسرائيليون الجنود كان الذي الكنيس مدخل من يهود مستوطنين ثالثة اقترب ذلك وقبيل
ثيابهم بتمزيق قاموا الدخول من الجنود منعهم وعندما حراسته يوءمنون <P/> زالوا</TEXT>
![Page 12: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/12.jpg)
Manual Corrected OCR Text
Aligning Characters Mapping
Build Error Model
OCR Degraded
Text
OCR Degraded
Text
Generate Corrections
Pick up most likely
correction using Bayes
Rule
OCR Corrected
Text
Decoding
Training
OCR-Correction Model :
![Page 13: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/13.jpg)
Aligning Characters Mapping:
m:n Mapping
Ex: walid vvaicl
w vv S a a √ l Null D i i √ d cl S
w a l i d
v v a i c l
1 : 1 Mapping
Ex: walid vvaicl
w v S Null v I a a √ l Null D i i √ d c S Null l I
w a l i d
v v a i c l
![Page 14: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/14.jpg)
Building Error Model:
)..(
)....( )D ..D..C(C P yxlkonsubstituti
lk
yxlk
CCcount
DDCCcount
)..(
)..( )..C(C P lkdeletion
lk
lk
CCcount
CCcount
)(
)..( )D ..D ( P yxinsertion Ccount
DDcount yx
Where CkCl, and DxDy are a character or more
![Page 15: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/15.jpg)
Decoding:
yx DDall
lkyx CCDDP..:
)..|..(
Baye’s Rule:
P ( Wordcorrect | WordOCR ) =
argmax ( P ( WordOCR | Wordcorrect ) P ( Wordcorrect ) )
P ( WordOCR | Wordcorrect ) =
P ( Wordcorrect ) = LM probability
(used simple unigram probability)
Character Level model
Word Level model
![Page 16: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/16.jpg)
ε ε ε ε ε
Example:
Character Level Model:
1. Segmentation
2. Mapping
3. Generate Candidates
Ex: dairn
d a i r n
da i r n
d ai r n
dai r n
d a i rn
da ir n
d air n
dair n
d a i rn
da i rn
d ai rn
dai rn
d a irn
da irn
d airn
dairn
d a i rn
rn 0.7 m 0.15 im 0.02 ln 0.015 0.005
i 0.84 l 0.12 0.02 t 0.015 ll 0.005 0.005
d 0.8 h 0.1 cl 0.08 0.02
a 0.9 o 0.05 r 0.02 oi 0.015 0.005 n 0.005 e 0.005
dairn 0.425
daim 0.091
claim 0.0091
aim 0.00227
horn 0.00007
l 0.09 i 0.05 li 0.02 s 0.015 f 0.005 t 0.005 a 0.005
![Page 17: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/17.jpg)
Example (cont):
Word Level Model:
Find the Frequency of Occurrence of each generated word in the dictionary
P ) dairn | dairn ( = 0.425
P ) daim | dairn ( = 0.091
P ) claim | dairn ( = 0.0091
P ) aim | dairn ( = 0.00227
P ) horn | dairn ( = 0.00007
Freq ) dairn ( = 0
Freq ) daim ( = 0
Freq ) claim ( = 1500
Freq ) aim ( = 4000
Freq ) horn ( = 150
dairn claim
![Page 18: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/18.jpg)
IR Experiments
• Degraded Collections are corrected, best one, two, three and five corrections were picked up for each word to be indexed
• The collections were indexed and searched using words, character 3-grams, character 4-grams, and lightly stemmed word
• Retrieval performance were tested for all combination between index type and number of correction
• Measure of merit is Mean Average Precision
• Significance testing done using t-test with p-value = 0.05
![Page 19: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/19.jpg)
Correction Results:
11.5
8.1
13.213.71516.9
39
22.2
0
5
10
15
20
25
30
35
40
NoCorrection
1 2 3 4 5 10 AllN- corrections
Wo
rd E
rro
r R
ate
(%
)
9.28.1
6.89.510.2
11.9
30.8
16.7
0
5
10
15
20
25
30
35
NoCorrection
1 2 3 4 5 10 AllN- corrections
Wo
rd E
rro
r R
ate
(%
)
ZAD Collection TREC Collection
![Page 20: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/20.jpg)
IR Results:
“ZAD Collection” :
0
0.03
0.06
0.09
0.12
0.15
0.18
0.21
0.24
0.27
0.3
0.33
0.36
0.39
0.42
0.45
Whole Word 3-gram 4-gram Stem
Mea
n A
vera
ge P
reci
sion
CleanBad1 Correction2 Correcftions3 Corrections5 Corrections
Clean
Bad
0
0.03
0.06
0.09
0.12
0.15
0.18
0.21
0.24
0.27
0.3
0.33
0.36
0.39
0.42
0.45
Whole Word 3-gram 4-gram Stem
Mea
n A
vera
ge P
reci
sion
CleanBad1 Correction2 Correcftions3 Corrections5 Corrections
![Page 21: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/21.jpg)
IR Results:
“TREC Collection” :
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
Whole Word 3 -gram 4 - gram stem
Me
an
Av
era
ge
Pre
cis
ion
OriginalBad1 Correction2 Corrections3 Corrections5 Corrections
Clean
Bad
![Page 22: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/22.jpg)
Conclusion & future work:
• Despite WER was halved IR effectiveness was not improved with statistically significant increase
• Using more than one correction does not help
• Indexing using n-grams (shorter index terms) is better than “moderate” error correction
• Effect of using n-gram word LM on error correction“Magdy, W. and K. Darwish. Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology. IN EMNLP 2006”
• Effect of “good” error correction on improving the retrieval effectiveness
![Page 23: Effect of Word-Based Correction on Retrieval of Arabic OCR Degraded Documents](https://reader035.vdocuments.net/reader035/viewer/2022062518/5681472f550346895db46bf1/html5/thumbnails/23.jpg)
Lnanh Lnanh gongonThank Thank youyou
Correction