d01 choueka dershowitz_word_spotting_algorithm

20
Querying a Large Corpus of Historical Handwritten Manuscipts Using Word-Spotting Alagorithms Yaacov Choueka, Adiel ben-Shalom The Friedberg Genizah Project Nachum Dershowitz, Lior Wolf, Adi Silberfenig School of Computer Science, Tel Aviv University Minerva 2015 , Jerusalem

Upload: evaminerva

Post on 13-Apr-2017

92 views

Category:

Internet


0 download

TRANSCRIPT

Page 1: D01 choueka dershowitz_word_spotting_algorithm

Querying a Large Corpus of Historical Handwritten ManusciptsUsing Word-Spotting Alagorithms

Yaacov Choueka, Adiel ben-ShalomThe Friedberg Genizah Project

Nachum Dershowitz, Lior Wolf, Adi SilberfenigSchool of Computer Science, Tel Aviv University

Minerva 2015 ,Jerusalem

Page 2: D01 choueka dershowitz_word_spotting_algorithm

The Problem: find all occurrences of a given query-word in all the manuscripts

of the corpus(arbitrary language, arbitrary script)

Example: The Cairo Genizah Corpus

360,000 fragments Hebrew characters, Hebrew and Arabic languagesThe query: בראשית

Page 3: D01 choueka dershowitz_word_spotting_algorithm

Simple Solution: full-text search

Page 4: D01 choueka dershowitz_word_spotting_algorithm

KWIC Output

Page 5: D01 choueka dershowitz_word_spotting_algorithm

The catch:

The software can search only manuscripts that have been

transcribed into electronic form!Usually, however, most of the manuscripts are never transcribed!

In the Genizah case:480,000 images are available only 40,000 (8%) have been transcribed!

Page 6: D01 choueka dershowitz_word_spotting_algorithm

OCRDoes not work well

for handwritten historical documentsאהבתי כי ישמע יהוה את

קולי תחנוני כי הטה אוזנו לי ובימי אקרא אפפוני חבלי

מות ומצרי שאול מצאוני צרה

ויגון אמצא ובשם יהוה אקרא אנה יהוה מלטה

נפשי חנון יהוה וצדיק ואלוהינו

מרחם שומר פתאים יהוה דלותי ולי יהושיע שובי נפשי למנוחיכי כי יהוה גמל עליכי

כי חלצת נפשי ממות את עיני

מדמעה את רגלי מדחיאתהלך לפני יהוה בארצות החיים האמנתי כי אדבר אני

אדזבעיכישעידודארוליעחנוניכידסראזנויוביסיארראאוניחבלישתומצרישאולצאוניצדוגוןאמצאובשםידוארראאנאידודלטכשינוןידודוצדידואדינוסרחסשוערתאיסיזוזדלייייליידושיעשובינשילסנוחיכיכיידודגמלעיכיכיחלצתנשיממועאעעיניסדסעדאערגליאעדלךלניידודבארדחייפדאסנעיכיאדבראני גליאעדלOCR Transcription

Page 7: D01 choueka dershowitz_word_spotting_algorithm

Search for the image of the query word

(and not for its text)

The word-spotting approach:

Page 8: D01 choueka dershowitz_word_spotting_algorithm

Given one (or more) image(s)of a query word, find all occurrences of similar images in the corpus collection of manuscripts’ images

Query:

Word-spotting

Page 9: D01 choueka dershowitz_word_spotting_algorithm

Query:

Page 10: D01 choueka dershowitz_word_spotting_algorithm

1 .Binarization

Page 11: D01 choueka dershowitz_word_spotting_algorithm

2 .Extracting Word-Candidates (“Patches”) From a Manuscript’s Image

Page 12: D01 choueka dershowitz_word_spotting_algorithm

3 .Patch Normalization

Normalizing every patch into a standard grid of 8960 pixels (20*7 cells of 8*8 pixels each)

Page 13: D01 choueka dershowitz_word_spotting_algorithm

4 .Image descriptors for every patch

Constructing, for every patchan image-descriptor vector of

12,460 real numbers

140 cells * (31+58)=12,460

(31 features of HOG vector)(58 features of LBP vector)

Page 14: D01 choueka dershowitz_word_spotting_algorithm

5 .Dimension Reduction12,460

M

Patch 1

Patch 2

Patch 3

Patch M

M = Total Number of Patches In all images of the corpus

1000

M

Patch 1

Patch 2

Patch 3

Patch M

PCA – Principal Component Analysis

Page 15: D01 choueka dershowitz_word_spotting_algorithm

6 .Similarity Computation

Computing an efficient similarity measure

between the query-reduced

vectorand

the reduced vectorsof all patches of all images in the corpus

QueryDataset

1000

M

Patch 1

Patch 2

Patch 3

Patch M

Query Patch 11000

Result

MSimilarity of Query

Patch to Patch number i

Page 16: D01 choueka dershowitz_word_spotting_algorithm

7 .Result Sort the results by decreasing similarity

and display the patches with the best similarity to the query

Page 17: D01 choueka dershowitz_word_spotting_algorithm

Two Tests

Precision 50% 91%Single query 0.08 sec 0.03 secPre-processing per Page 46 sec 3 sec

1. George Washington – Handwritten2. Lord Byron – Printed20 pages, about 5000 words each

Page 18: D01 choueka dershowitz_word_spotting_algorithm

Current Problems

1. Efficiently building (off-line, in terms of space and time) compact image-descriptors for all patches from all (half-a-million) images.

2. Building an efficient (on-line) system for comparing the query vector to all (100 million?) patches’ vectors

Page 19: D01 choueka dershowitz_word_spotting_algorithm

When solved and implemented

it will offer new horizons

to the study of large corpora of historical documents

Page 20: D01 choueka dershowitz_word_spotting_algorithm

Thank You