d01 choueka dershowitz_word_spotting_algorithm

Querying a Large Corpus of Historical Handwritten ManusciptsUsing Word-Spotting Alagorithms

Yaacov Choueka, Adiel ben-ShalomThe Friedberg Genizah Project

Nachum Dershowitz, Lior Wolf, Adi SilberfenigSchool of Computer Science, Tel Aviv University

Minerva 2015 ,Jerusalem

The Problem: find all occurrences of a given query-word in all the manuscripts

of the corpus(arbitrary language, arbitrary script)

Example: The Cairo Genizah Corpus

360,000 fragments Hebrew characters, Hebrew and Arabic languagesThe query: בראשית

Simple Solution: full-text search

KWIC Output

The catch:

The software can search only manuscripts that have been

transcribed into electronic form!Usually, however, most of the manuscripts are never transcribed!

In the Genizah case:480,000 images are available only 40,000 (8%) have been transcribed!

OCRDoes not work well

for handwritten historical documentsאהבתי כי ישמע יהוה את

קולי תחנוני כי הטה אוזנו לי ובימי אקרא אפפוני חבלי

מות ומצרי שאול מצאוני צרה

ויגון אמצא ובשם יהוה אקרא אנה יהוה מלטה

נפשי חנון יהוה וצדיק ואלוהינו

מרחם שומר פתאים יהוה דלותי ולי יהושיע שובי נפשי למנוחיכי כי יהוה גמל עליכי

כי חלצת נפשי ממות את עיני

מדמעה את רגלי מדחיאתהלך לפני יהוה בארצות החיים האמנתי כי אדבר אני

אדזבעיכישעידודארוליעחנוניכידסראזנויוביסיארראאוניחבלישתומצרישאולצאוניצדוגוןאמצאובשםידוארראאנאידודלטכשינוןידודוצדידואדינוסרחסשוערתאיסיזוזדלייייליידושיעשובינשילסנוחיכיכיידודגמלעיכיכיחלצתנשיממועאעעיניסדסעדאערגליאעדלךלניידודבארדחייפדאסנעיכיאדבראני גליאעדלOCR Transcription

Search for the image of the query word

(and not for its text)

The word-spotting approach:

Given one (or more) image(s)of a query word, find all occurrences of similar images in the corpus collection of manuscripts’ images

Query:

Word-spotting

Query:

1 .Binarization

2 .Extracting Word-Candidates (“Patches”) From a Manuscript’s Image

3 .Patch Normalization

Normalizing every patch into a standard grid of 8960 pixels (20*7 cells of 8*8 pixels each)

4 .Image descriptors for every patch

Constructing, for every patchan image-descriptor vector of

12,460 real numbers

140 cells * (31+58)=12,460

(31 features of HOG vector)(58 features of LBP vector)

5 .Dimension Reduction12,460

M

Patch 1

Patch 2

Patch 3

Patch M

M = Total Number of Patches In all images of the corpus

1000

M

Patch 1

Patch 2

Patch 3

Patch M

PCA – Principal Component Analysis

6 .Similarity Computation

Computing an efficient similarity measure

between the query-reduced

vectorand

the reduced vectorsof all patches of all images in the corpus

QueryDataset

1000

M

Patch 1

Patch 2

Patch 3

Patch M

Query Patch 11000

Result

MSimilarity of Query

Patch to Patch number i

7 .Result Sort the results by decreasing similarity

and display the patches with the best similarity to the query

Two Tests

Precision 50% 91%Single query 0.08 sec 0.03 secPre-processing per Page 46 sec 3 sec

1. George Washington – Handwritten2. Lord Byron – Printed20 pages, about 5000 words each

Current Problems

1. Efficiently building (off-line, in terms of space and time) compact image-descriptors for all patches from all (half-a-million) images.

2. Building an efficient (on-line) system for comparing the query vector to all (100 million?) patches’ vectors

When solved and implemented

it will offer new horizons

to the study of large corpora of historical documents

Thank You

d01 choueka dershowitz_word_spotting_algorithm

Internet