d01 choueka dershowitz_word_spotting_algorithm
TRANSCRIPT
Querying a Large Corpus of Historical Handwritten ManusciptsUsing Word-Spotting Alagorithms
Yaacov Choueka, Adiel ben-ShalomThe Friedberg Genizah Project
Nachum Dershowitz, Lior Wolf, Adi SilberfenigSchool of Computer Science, Tel Aviv University
Minerva 2015 ,Jerusalem
The Problem: find all occurrences of a given query-word in all the manuscripts
of the corpus(arbitrary language, arbitrary script)
Example: The Cairo Genizah Corpus
360,000 fragments Hebrew characters, Hebrew and Arabic languagesThe query: בראשית
Simple Solution: full-text search
KWIC Output
The catch:
The software can search only manuscripts that have been
transcribed into electronic form!Usually, however, most of the manuscripts are never transcribed!
In the Genizah case:480,000 images are available only 40,000 (8%) have been transcribed!
OCRDoes not work well
for handwritten historical documentsאהבתי כי ישמע יהוה את
קולי תחנוני כי הטה אוזנו לי ובימי אקרא אפפוני חבלי
מות ומצרי שאול מצאוני צרה
ויגון אמצא ובשם יהוה אקרא אנה יהוה מלטה
נפשי חנון יהוה וצדיק ואלוהינו
מרחם שומר פתאים יהוה דלותי ולי יהושיע שובי נפשי למנוחיכי כי יהוה גמל עליכי
כי חלצת נפשי ממות את עיני
מדמעה את רגלי מדחיאתהלך לפני יהוה בארצות החיים האמנתי כי אדבר אני
אדזבעיכישעידודארוליעחנוניכידסראזנויוביסיארראאוניחבלישתומצרישאולצאוניצדוגוןאמצאובשםידוארראאנאידודלטכשינוןידודוצדידואדינוסרחסשוערתאיסיזוזדלייייליידושיעשובינשילסנוחיכיכיידודגמלעיכיכיחלצתנשיממועאעעיניסדסעדאערגליאעדלךלניידודבארדחייפדאסנעיכיאדבראני גליאעדלOCR Transcription
Search for the image of the query word
(and not for its text)
The word-spotting approach:
Given one (or more) image(s)of a query word, find all occurrences of similar images in the corpus collection of manuscripts’ images
Query:
Word-spotting
Query:
1 .Binarization
2 .Extracting Word-Candidates (“Patches”) From a Manuscript’s Image
3 .Patch Normalization
Normalizing every patch into a standard grid of 8960 pixels (20*7 cells of 8*8 pixels each)
4 .Image descriptors for every patch
Constructing, for every patchan image-descriptor vector of
12,460 real numbers
140 cells * (31+58)=12,460
(31 features of HOG vector)(58 features of LBP vector)
5 .Dimension Reduction12,460
M
Patch 1
Patch 2
Patch 3
Patch M
M = Total Number of Patches In all images of the corpus
1000
M
Patch 1
Patch 2
Patch 3
Patch M
PCA – Principal Component Analysis
6 .Similarity Computation
Computing an efficient similarity measure
between the query-reduced
vectorand
the reduced vectorsof all patches of all images in the corpus
QueryDataset
1000
M
Patch 1
Patch 2
Patch 3
Patch M
Query Patch 11000
Result
MSimilarity of Query
Patch to Patch number i
7 .Result Sort the results by decreasing similarity
and display the patches with the best similarity to the query
Two Tests
Precision 50% 91%Single query 0.08 sec 0.03 secPre-processing per Page 46 sec 3 sec
1. George Washington – Handwritten2. Lord Byron – Printed20 pages, about 5000 words each
Current Problems
1. Efficiently building (off-line, in terms of space and time) compact image-descriptors for all patches from all (half-a-million) images.
2. Building an efficient (on-line) system for comparing the query vector to all (100 million?) patches’ vectors
When solved and implemented
it will offer new horizons
to the study of large corpora of historical documents
Thank You