text search and information retrieval in large historical...

Keynote talk ICDAR-2019

Text Search and Information Retrieval inLarge Historical Collections of Untranscribed

Manuscripts

Enrique Vidal, Alejandro H. Toselli, Joan Puigcerver and the HTR PRHLT [email protected]

Pattern Recognition and Human Language Technology Center

CARABELA HOME READThis presentation can be downloaded from:

www.prhlt.upv.es/˜evidal/tmp/icdar19keynoteEVidal2p.pdf

E. Vidal PRHLT-UPV – PRHLT/UPV

Tex search and IR in untranscribed manuscripts ICDAR-2019

Handwritten Text Recognition and Historical Manuscripts

• Some decades ago, off-line Handwritten Text Recognition (HTR) wasthought to quickly become a research topic of little practical interest, sincethe use of text written on paper would soon become obsolete.

However . . .

• Massive historical manuscript collections stored in thousands of kilometersof shelfs in archives and libraries have changed the picture dramatically.

• Digitalization is a first step; but not enough: important information lingershidden behind zillions of pixels of digital images and the quintessence ofthese historical documents –their textual content– remains inaccessible.

E. Vidal PRHLT-UPV, September-2019


Textual access to Untranscribed Manuscripts

If perfect or sufficiently accurate transcripts of the text images were available,image textual contents would be obviously accessible.

However...

• manual transcription is entirely prohibitive for massive image collections,

• automatic transcription results generally lack the accuracy level needed formost applications, including scholarly editions, content-based documentclassification and information retrieval,

Good news: probabilisitic indexing (PI) and textual search can be directlycarried out on untrascribed images, as we will see now.

Even better news: PI allows interesting forms of “big data” analysis, such astext analytics, document classification, information retrieval , etc.

E. Vidal PRHLT-UPV, September-2019


Index

1 Demonstrations . 3

2 Text Image KWS and Probabilistic Indexing . 7

3 Implementation Issues . 15

4 Laboratory Results on Several Manuscript Collections . 18

5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24

- Estimating Textual Features of Unstrascribed Manuscript Collections . 26

- Boolean & Sequence Queries and Content-based Image Classification . 30

- Data-Base-like Information Retrieval from Handwriten Tables . 34

- Search for Melodic Patterns in Handwritten Music Notation . 41

6 Conclusions . 45

7 END (and further details) . 47

E. Vidal PRHLT-UPV, September-2019 Page 3


Large Scale Probabilistic Keyword Indexing and Search is Here, Now!

The HTR team of the PRHLT Research Center has been developing the PI technologyduring the last decade and, more recently, has successfully applied it to five largemanuscript collections, thereby making their textual contents fully accessible:

• Chancery (AN & BN, France): 83 000 pgs., very abridged French & Latin, 14-15th c.http://prhlt-kws.prhlt.upv.es/himanis/ (HIMANIS project)

• TSO (Teatro del Siglo de Oro, BN of Spain): 41 000 pages, Spanish, 16 -17th c.http://prhlt-carabela.prhlt.upv.es/tso/ (READ project)

• Bentham Papers (UCL & BL): 95 000 pages, English, scrawl writting, 18 -19th c.http://prhlt-kws.prhlt.upv.es/bentham/ (READ project)

• FCR (Finnish Court Records, NA Finland): 138 000 pages, Swedish, 18 -19th c.http://prhlt-kws.prhlt.upv.es/fcr/ (READ project)

• Carabela (AGI + AHPC): 125 000 pages, Spanish, abstruse scripts, 16 -18th c.– manuscripts valuable to underwater archaeologyhttp://carabela.prhlt.upv.es/en/demonstrators (CARABELA prj.)

These and many other smaller-scale, experimental demonstrators available from:http://transcriptorium.eu/demots/KWSdemos



Two Quick Demonstrations with Bentham Papers and Carabela

Bentham is famous as a theorist of punishment. Hence he wrote a lot about Australiaand its penal colonies. Let us try to see what:

Australia[New South Wales]Austral* || [New Holland][New South Wales] || [Botany Bay][New South Wales] && convict*[New South Wales] && [penal colon*]

Let us try now with Carabela.In the 16-17th centuries, every land to the south of Philippines (including Australia)was vaguely known as “Terra Australis Incognita”, and in Spain, “Tierra Austral”, “IslasAustrales”, etc. (of course, also “Nueva Guinea” was included):

[(Tierra* || Isla*) Austral*] || (Austral* Incognita∼3) || [Nueva Guinea][Luis Vaez∼ Torres] && [Pedro Fernandez Quiros∼2] (Early navigators of Austral seas)

Finally, one more query to show the robustness of the indexing approach:capitan



Vaez de Torres’ And Cook’s routes and dates



Index


◦ 2 Text Image KWS and Probabilistic Indexing . 7








6 Conclusions . 45




Text Image KWS Statistical Framework: 2-D Posteriorgram

Main concept: Posterior word probability at pixel level, or “2-D Posteriorgram” :

P (v | X, i, j), 1 ≤ i ≤ I, 1 ≤ j ≤ J, v ∈ V

where X is a I × J sized text image, V is a vocabulary and (i, j) a pixel of X.

P (v | X, i, j) denotes the probability that a word v is written in a region of Xwhich includes the pixel (i, j). It can be directly computed by margizalization:

P (v | X, i, j) =∑

B

P (v,B | X, i, j) ≈ 1

K(i, j)

∑

B∈B(i,j)P (v | X,B)

where B(i, j) is the set of all the K(i, j) reasonably shaped and sized (andassumedly equiprobable) regions or boxes of X which include the pixel (i, j).

j

i

A few possible boxesB ∈ B(i, j). For v = ”matter”,the thick-line box will provide the highest value ofP (v | X,B), while most of the other boxes willcontribute only (very) low values to the sum.

What is exactly P (v | X,B)?



Computing the 2-D Posteriorgram by word classification

The 2-D Posteriorigram: P (v | X, i, j) ≈ 1

K(i, j)

∑

B∈B(i,j)P (v | X,B)

P (v | X,B) is the posterior probability (implicitly or explicitly) used byany isolated word image classifier ; i.e, any system capable of solving thefollowing classification problem for a presegmented word-shaped subimageof X bounded by B,XB:

v = argmaxv∈V

P (v | XB)

For instance, for a simple k-nearest-neighbour classifier, if kv is the numberof v−labelled prototypes out of the k nearest to XB,

P (v | X,B) =kvk

However, the better the classifier the better the estimated posterirorgram!

Notice: Directly obtaining a full 2-D posteriorgram in this way entails a formidableamount of computation, but P (v | X, i, j) can be very efficient computed byclever combinations of subsampling of (i, j) and choices of B(i, j) [see later].



Pixel-level Posteriorgram

P (v | X, i, j)

X

Pixel-level posterior probabilities P (v |X, i, j) for a text imageX and word v =”matter”,computed using an accurate, contextual (n-gram based) word classifier. This helped toachieve very good posteriors: low in a region of X around (i=100, j=60), where a verysimilar (but different) word, “matters”, is written; high for the other three correct words.



Pixel-level Posteriorgram: Probabilistic Word Indexing

P (v | X, i, j)

X

Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage.

But, for each word, image region relevance probabilities and locations are easily derivedfrom the Posteriorgram – and used to probabilistically index the word in an efficient way.



Probabilistic Index: Example

200

150

100

50

0 100 200 300 400 500 600

# pageID="Bentham-071-021-002-part"# keyword relPrb bounding box#

2 0.929 1 36 20 3121 0.064 1 36 24 31IT 0.982 33 36 27 31IF 0.012 33 36 26 31

MATTERS 0.998 76 35 104 31MATTER 0.011 77 36 93 31

NOT 0.999 216 36 7 31WHETHER 1.000 256 36 99 31

THE 0.997 389 36 33 31MIS-SUPPOSAL 1.000 455 36 193 31

THE 0.927 430 88 30 31LHE 0.056 434 88 25 31... ... ... ...

REGARDS 0.857 5 115 84 31UGARDS 0.138 5 115 80 31

THE 0.993 110 115 43 31MATTER 0.998 160 115 93 31

OF 0.996 271 115 23 31FACT 0.999 306 115 49 31OR 0.973 377 115 37 31ON 0.021 377 115 42 31

MATTER 0.990 425 116 100 31OF 0.995 542 115 25 31LAM 0.407 575 115 30 31BIMR 0.175 575 115 55 31... ... ... ...LAW 0.032 575 115 36 31TAUE 0.031 575 115 55 31... ... ... ...LANE 0.012 575 115 59 31

THE 0.990 1 198 28 31MATTER 0.934 61 198 64 31

OF 0.988 141 198 28 31FAST 0.367 182 198 62 31FAR 0.186 182 198 36 31... ... ... ...FACT 0.017 182 198 46 31AS 0.142 200 198 29 31HAE 0.022 200 198 29 31

WHERE 0.992 255 198 90 31YOU 0.761 365 198 45 31YOW 0.030 365 198 45 31GOUS 0.064 372 198 47 31

SUPPOSE 0.975 429 198 120 31SUPFROSE 0.024 429 198 125 31

SOME 0.834 570 198 78 31SONER 0.016 576 198 83 31OME 0.109 580 198 65 31ME 0.022 620 198 22 31

Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities.E. Vidal PRHLT-UPV, September-2019 Page 10


Real Probabilistic Index of a Random Page from Bentham Papersfor the 25 most common English words: the, be, to, of, and, a, in, that, have, ...

Colors indicate relevance probabilities: low=red, high=green.E. Vidal PRHLT-UPV, September-2019 Page 11


Image Region KWS:Proper Information Retrieval Probabilistic Formulation

Posteriorgrams P (v |X, i, j) can be directly used for KWS. But, for indexing, weneed the probability that a word v is written within a pre-specified image region.

Let X be a given image region and R ∈ {yes, not} a binary random variable.

The Relevance Probability , P (R |X, v), is defined as the probability that X isrelevant for v; i.e., v is written somewhere in X. A good approximation is:

P (R | X, v) ≈ maxi,j

P (v | X, i, j)

This is a formal result which is also intuitively meaningful (as seen in p. .10 )[this is the short history – formal details ommited here]

If w is the (unknown) transcript of the image region X, it can be seen:

P (R | v,X) =∑

w:v∈wP (w | X) and thus:

∑

v

P (R | X, v) = m (see .27 )

where m (�1) is the expected number of (different) words written in X.



Lexicon-Free Probabilistic IndexingTwo basic ideas:• Use character level models both for optical and language modeling (high-order

character N-Grams)• Index any character sequence which, according to the models, is sufficiently likely to

constitute an actual word.Thereby, probabilistic index entries are called “pseudo-words” (rather than “words”)

The development of these ideas, departing from the popular “Filler Model” for KWS canbe tracked through the following publications (specially Puigcerver’s PhD thesis):• A. Fischer et al., “Lexicon-free handwritten word spotting using character HMMs” Pattern Recognition Letters, 2012• V. Frinken et al., “A novel word spotting method based on recurrent neural networks” IEEE TPAMI, 2012• A.H. Toselli et al., “Fast HMM-Filler approach for Key Word Spotting in Handwritten Documents” ICDAR’13• A. Fischer at al., “Improving HMM-Based Keyword Spotting with Character Language Models”, ICDAR’13• J. Puigcerver et al., “Word-Graph and Character-Lattice Combination for KWS in Handwritten Documents”, ICFHR’14• A.H. Toselli et al., “Context-aware lattice based Filler approach for key word spotting in handwritten documents”, ICDAR’15• Puigcerver et al., “Probabilistic interpretation and improvements to HMM-Filler for handwritten keyword spotting”, ICDAR’15• A.H. Toselli et al., “Two methods to improve confidence scores for lexicon-free word spotting in handwritten text” ICFHR’16• T. Bluche et. al., “Preparatory KWS Experiments for Large-Scale Indexing a Vast Medieval Manuscr. Collec. . . . ”, ICDAR’17• J. Puigcerver, “A probabilistic formulation of keyword spotting,” Ph.D. dissert., Universitat Politecnica de Valencia, 2018• A.H. Toselli et al., “Making Two Vast Historical Manuscript Collections Searchable and Extracting Meaningful Textual

Features Through Large-Scale Probabilistic Indexing”, ICDAR’19



Probabilistic Indices are NOT Transcripts

OPTIMAL TRANSCRIPTION PROBABILISTIC INDEXING

Generally comes after Layout Analysis Is generally Layout-agnosticStrictly needs carefully detected lines Line detection helps, but only if accurateThe output is a best, unique textinterpretation of the image (maybeaccompained with word boundingboxes) according to the models used

For the given models, the output is arich probability distribution of wordsand their sizes and possitions in theimages

The output is expeted to be providedin correct reading order

In general, Probabilistic Indexing isreading-order agnostic

Provides plaintext output which, ifaccurate, could be directly used inmany applications

In its basic form, does not provide anytext output; only bounding-box-markedimages

Usually yields only fixed andcomparatively low precission-recallperformance for the given trainedmodels

Allows flexible, user-controledprecision-recall tradeoffs and overallsearch preformance is generally muchbetter for the same trained models



Index



◦ 3 Implementation Issues . 15







6 Conclusions . 45




Implementing Probabilistic Text Image Indexing and Search

TextImages

Ingestion

Page ImageIndices

KWS & Indexing

Keyword Searchand User Interface Spots

Database

• “KWS & Indexing” : Off-line pre-computation of probabilistic indices

• “Ingestion” : Off-line creation of database of spotting results. Typically a simpleand computationally cheap process

• “Keyword search” : On-line user query analisys, find the requested informationand present the retrieved images. Short response times needed.



Probabilistic Text Image Indexing: Index Building through KWS

Transcribed images

KWS & indexing tool

Text images

Optical + Language Models Training system

KWS +indexing

Contextual wordrecognizer (HTR)

Char / word Lattices

Page-levelprobabilisticindices

• Indexing is typically based on Key Word Spotting (KWS) technologies

• Most effective KWS methods use contextual word recognizers which requiremodels trained from transcribed images

• Both the contextual recognizer and the training system are often separatepieces of software, not included in the indexing tool proper

• The contextual recognizer produces intermediate rich data structures, suchas character and/or word lattices, used by the KWS and indexing process

• In general, KWS and indexing can be computationally (very) demanding.



Index




◦ 4 Laboratory Results on Several Manuscript Collections . 18






6 Conclusions . 45




Laboratory Results on Modern Manuscript Collections (18th-19th c.)

– Recall-Precision curves– Average Precision (AP)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

AP=0.91, CRNN Bentham

AP=0.91, HMMs Bentham

Datasets training and test details

• BENTHAM: Many hands. Training: 400 pages(86 K run. words); 86 char OMs, 2-grm word LMtrained on Bentham texts; Lexicon 9 K words.Test : 33 pages; query set: 8 658 keywords





0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ


AP=0.91, HMMs Bentham

AP=0.92, CRNN Plantas

AP=0.91, HMMs Plantas



• PLANTAS (VOL-I): Single hand. Training: 264pages (79 K run. words); 76 char. OMs, 2-gramword LM trained with the training set + bookglossary transcripts. Lexicon 20 K words.Test : 607 pages; query set: 10 932 keywords





0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ


AP=0.92, CRNN Plantas

AP=0.89, CRNN IAMDB

AP=0.91, CRNN GW



• PLANTAS (VOL-I): Single hand. Training: 264pages (79 K run. words); 76 char. OMs, 2-gramword LM trained with the training set + bookglossary transcripts. Lexicon 20 K words.Test : 607 pages; query set: 10 932 keywords

• IAMDB: Many hands. Training: 7 K lines (62 Krun. words); 81 char. OMs, 2-grm word LM train-edonLBW (English) text; Lexicon 21 K words.Test : 929 lines; query set: 3 421 keywords

• GW: Single hand. Training: 492 lines (4 K run.words); 83 char. OMs, 2-gram word LM trainedon training transcripts; Lexicon 1.5 K words.Test : 164 lines; query set: 899 keywords



Laboratory Results on Earlier Manuscript Collections (14th-18th c.)

– Recall-Precision curves– Average Precision (AP)– Mean Average Precision (mAP)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

TSO AP=0.80 mAP=0.83

Chancery AP=0.75 mAP=0.68

Passau AP=0.69 mAP=0.67


• TSO: Spanish, many abbreviations, 1 hand.Training: 286 page images, 92 char CRNNOMs + 8-gram char LM trained on trainingtranscripts; Lexicon: 6 289 tokens.Test : cross-valid; Query set : 5 409 keywords.

• CHANCERY: Medieval French/Latin, Heavilyabbreviated, many hands.Training: 341 Acts (∼ 100 pages), 105 charCRNN OMs + 5-gram char LM trained ontraining transcripts; Lexicon: ∼ 20 000 tokens.Test : 95 Acts; Query set : 6 506 keywords.

• PASSAU: German/Latin, tables, many hands.Training: 200 pages, 102 char CRNN OMs +

char 6-gram LM trained on training transcripts;Lexicon 12 381 tokens.Test : 91 images; Query set : 6 500 keywords



Index





◦ 5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24





6 Conclusions . 45




Beyond Simple Keyword Spotting

• Word spelling flexibility:– Wild cards representing arbritary character sequences

– Approximate spelling (allow char. insertions, deletions and/or changes)

– Hyphenated words (based on accurate prediction of word prefixes &suffixes and geometry of bounding boxes.

Probabilisitic indices make it simple and natural to handle thesekinds of word spelling variations.

• Text Data Analytics on handwritten text images:– Astronomical amounts of historical handwritten documents exist but only

infinitesimal amounts of the information they convey is known so far

– Given the sheer scale of the available data, big-data analysis is perhapsthe only way to obtain relevant information from these documents

– Currently available text data analysis tools require plain-text input.

Probabilisitic indices allow many interesting forms of text dataanalytics from untranscribed images of handwritten text.



Index










6 Conclusions . 45




Wild Cards, Approximate Spelling, Abbreviations and HyphenationSince (pseudo-)words of Probabilistic Index entries are just plaintextcharacter sequences, flexible spelling can be allowed in query words,as in many plaintext searching systems:

• Wild cards: e.g., Mari* , *anna , ad*able• Approximate or “fuzzy” spelling: e.g., Elizabeth∼ , neighbor∼2

– particularly useful in historical documents, where exact spelling is often unknown– useful also to increase the recall for correctly spelled words which are not perfectly

indexed due to lexicon-free spotting

? both techniques implemented in the search interface

• Abbreviations: can be handled with flexible spelling to some extend,but much better results are achieved by training optical and languagemodels with the expanded forms of the abbreviated words

• Hyphenated words: similarly, best results achieved by training themodels with tagged word fragments and indexing the full-forms of thehyphenated words with the help of geometric reasoning.

Lab Results: TBP soon; Demonstrator :http://prhlt-kws.prhlt.upv.es/fcr-hyp/



Index










6 Conclusions . 45




Text Data Analytics from Probabilisitically Indexed Images

Basic text features usually computed for plain text documentscan be easily estimated from probabilistically indexed (butotherwise untranscribed) text images. Here we focus on:

• Number of running words

• Zipf curves

• Size of vocabulary

Recall : A Probabilistic Index (PI) provides, for each imageregion X and each character string (or “pseudo-word” ) v, theprobability that v is written in X, P (R | X, v).



Estimating Number of Running Words and Word Frequencies

Let w be the (unknown) transcript ofX and g(w, v) def= 1 iff v ∈ w. The number

of (different) words in w is n(w) =∑v g(w, v), and its expected value:

E[n(w)] =∑

w

∑

v

g(w, v)P (w |X) =∑

v

∑

w:v∈wP (w | X) =

∑

v

P (R | X, v)

If image regions X are sufficiently small (e.g., short lines), this is a generallygood, lower bound approximation to the number of running words in X.

Similarly, for an image collection, X , and its (unknown) transcript W :

E[n(W )] ≈∑

X∈X

∑

v

P (R |X, v)

The number of running words of an image collection can be estimated as thesum of (pseudo-)word Relevance Probabilities of all the indexed entries.

Finally, the frequency of use n(v), of a specific word v in X , is estimated as:

E[n(v)] =∑

X∈XP (R | X, v)



Zipf Curve and Lexicon Size of a Text

The Zipf curve of a given text (transcript) depicts the frequency of eachword as a function of its rank in the decreasingly sorted list of wordfrequencies. Example from Bentham GT transcripts:

<0.1

1

10

100

1K

10K

1 10 100 1K 10K 100K 1M 10M

Freq

uenc

y

Rank

Bentham Test-set GT Transcripts

The vocabulary size is just the highest rank of a least frequent word (freq=1).E. Vidal PRHLT-UPV, September-2019 Page 28


Estimating Zipf Curves and Lexicon Sizes from PIs

The Zipf curve of a given text (transcript) depicts the frequency of eachword as a function of its rank in the decreasingly sorted list of wordfrequencies. Example from Bentham GT transcripts (and PIs):

<0.1

1

10

100

1K

10K

1 10 100 1K 10K 100K 1M 10M

Freq

uenc

y

Rank

freq = 0.5

Bentham Test-set GT TranscriptsBentham Test-set Prob. Index

The vocabulary size can now be estimated as the rank for which freq= ε (= 0.5, e.g.)E. Vidal PRHLT-UPV, September-2019 Page 28


Estimated Zipf Curves for Bentham Papers and TSO

The Zipf curve of a given text (transcript) depicts the frequency of eachword as a function of its rank in the decreasingly sorted list of wordfrequencies. Example from Bentham and TSO GT transcripts (and PIs):

<0.1

1

10

100

1K

10K

1 10 100 1K 10K 100K 1M 10M

Freq

uenc

y

Rank

freq = 0.5

Bentham Test-set GT TranscriptsBentham Test-set Prob. Index

Passau Test-set GT TranscriptsPassau Test-set Prob. IndexTSO Test-set GT Transcripts

TSO Test-set Prob. Index

The vocabulary size can now be estimated as the rank for which freq= ε (= 0.5, e.g.)E. Vidal PRHLT-UPV, September-2019 Page 28


Estimating Running Words and Lexicon Size:Laboratory Results

Dataset Running Words Lexicon SizeReal Estim. Error Real Estim.† Error

BENTHAM 89 870 83 235 -7.4% 6 988 7 431 +6.3%TSO 9 996 9 926 -0.7% 2 544 2 496 -1.9%PASSAU 26 709 26 155 -2.1% 5 801 5 598 -3.5%

†Obtained from estimated Zipf curves with threshold ε = 0.5.

Work in progress: full details and results to be published soon

Preliminary results in:[Toselli et al.: “Making Two Vast Historical Manuscript Collections Searchable and ExtractingMeaningful Textual Features Through Large-Scale Probabilistic Indexing”, ICDAR-2019]



Index










6 Conclusions . 45




Multiple-word Queries: Boolean Combinations and Sequences

Multi-word and word-sequence queries are useful for searching for information in a more “semantic”way, as in plain-text Information Retrieval applications like Google.

Since single-word relevance probabilities of a Probabilistic Iindex are properly normalized andconsistent, they can be combined to (approximately) support various types of multi-word queries:

• Boolean: AND (∧), OR (∨), NOT (¬):If P (v1), P (v2) are the relevance probabilities of two indexed words, v1, v2, then:

P (v1 ∧ v2) = P (v1)P (v2 | v1) = P (v2)P (v1 | v2) ≈ min (P (v1), P (v2))

P (v1 ∨ v2) = P (v1) + P (v2)− P (v1 ∧ v2) ≈ max(P (v1), P (v2))

P (¬v1) = 1− P (v1) [Toselli et al., Pattern Anal. Appl. 22(1), pp.23-32, 2019]

For example, to search for image regions containing both the words “cat” (v1) and “dog” (v2), butnone of the words “mouse” (v3) or “rabit” (v4) the combined relevance probability is computed as:

P (v1 ∧ v2 ∧ ¬(v3 ∨ v4)) ≈ min(P (v1), P (v2), (1−max(P (v3), P (v4))))

• Word-sequence:Issue an AND query and examine the BB coordinates of the retrieved spots. Onlythose which are roughly in reading order are finally retrieved.



Document Image Classification Based on Textual Content

Two approaches:

• Classification by means of user provided (maybe complex) queries:

Users do often know how to define the textual content of a class of documentsin terms of boolean combinations of words and word sequences.

Let q1, . . . , qC be queries defining the contents of classes 1, . . . , C, and letP (R | X, qc) the relevance probability of document (image) X to the query ofclass c. Then X is classified into the “most relevant” class:

c = argmax1≤c≤C

P (R|X, qc)

• Classification based on successful Machine Learning plain-text classifiers,which can be adapted to use probabilistic indices of text images.

Particularly interesting are “bag-of-word” models which do not explicitly relyon word order, such as Multinomial or Bernoullli mixture models (details: .66)

⇒ Work in progress; results to be reported soon.



Index










6 Conclusions . 45




Handwriten Table Images• Handwritten tables perhaps account for more than half of the vast amounts

of documents preserved in archives.

• Tables contain important, often ready-to-use information for many historicalstudies, such as ethnography, demography, economics, genealogy, etc.

• Automatic transcription of handwriting tables is very difficult:– ad-hoc, variable, inconsistent and even erratic layouts,– difficult line detection and hopeless reading order ambiguities,– short lines lack linguistic context to help accurate word recognition, . . .

Handwritten table images (from the Passau daatset)E. Vidal PRHLT-UPV, September-2019 Page 34


Data-Base-like Information Retrieval from Table Images

Probabilistic Indices hold geometric information of word bounding boxes (BB)and support Boolean multi-word queries.

This, along with BB-based geometric reasoning, can be used to supportlayout-agnostic, structured queries for information retrieval from table images.

Consider queries of the form: 〈column-heading, column-content 〉where column-heading is a BOOLEAN combination of table headingwords and column-content is a (single) keyword.

Examples (from the PASSAU collection):〈 ORT, PASSAU 〉 (〈PLACE, PASSAU 〉)〈 GEBURTS ORT, PASSAU 〉 (〈BIRTH PLACE, PASSAU 〉)〈 TAUF TAG, APRIL 〉 (〈 CHRISTENING DAY, APRIL 〉)〈 KRANKHEIT ARZT, FRAISEN 〉 (〈 MEDICAL OFFICER, FRAISEN 〉)〈 NAMEN DES BRAEUTIGAMS, JOSEF 〉 (〈 NAME OF THE GROOM, JOSEF 〉)〈 NAMEN DER BRAUT, MARIA 〉 (〈 NAME OF THE BRIDE, MARIA 〉)〈 TAG MONAT JAHR TODES, 1879 〉 (〈DAY MONTH YEAR OF DEATH, 1879 〉)



Structured Multi-word Query Retreieval

To deal with queries of the form: 〈column-heading,column-content 〉,the retrieval process is carried out in four steps for each table image:

• retrieve column-heading words with AND-combined relevance probabilityhigher than the given threshold τ ,

• apply simple geometric reasoning: BBs of candidate spots are assigned highprobability only if they are close enough to each other and loosely located inupper regions of the image,

• retrieve column-content words with relevance probability higher then τ ,

• assign high probability only to the retrieved column-content BBs whichfall within column-wise regions loosely delimited by the horizontal span ofthe spotted column-heading BBs, and are below these BBs.



Information Retrieval from Table Images: Example

Query: 〈 NAMEN DER BRAUT , MARIA 〉




Query: 〈 NAMEN DER BRAUT , MARIA 〉 spotting column-heading words

h1 =NAMENh2 =DERh3 =BRAUT

h1h2h3

h1h2h3




Query: 〈 NAMEN DER BRAUT , MARIA 〉 column-heading geometric reasoning


h1h2h3




Query: 〈 NAMEN DER BRAUT , MARIA 〉 probabilities of column-heading words


P (h1) = P (h11 ∨ h12) ≈ max(P (h11), P (h12)

)

P (h2) = P (h21 ∨ h22 ∨ h23) ≈ max(P (h21), P (h22), P (h23)

)

P (h3) = P (h31)

h1h2h3




Query: 〈 NAMEN DER BRAUT , MARIA 〉 column-heading probability


P (h) = P (h1 ∧ h2 ∧ h3) ≈ min(P (h1), P (h2), P (h3)

)

h1h2h3




Query: 〈 NAMEN DER BRAUT , MARIA 〉 candidate region for column-content word




Query: 〈 NAMEN DER BRAUT , MARIA 〉 spotting column-content words

v =MARIA P (v) = P (v1 ∨ v2) ≈ max(P (v1), P (v2)

)

v1v2




Query: 〈 NAMEN DER BRAUT , MARIA 〉 retrieved data and total relevance probability

h = NAMEN DER BRAUTv = MARIA

P (〈h, v〉) = P (h ∧ v) ≈ min(P (h), P (v)

)



Table Images Information Retrieval: Laboratory Results

– Recall-Precision curves– Average Precision (AP), mean AP (mAP)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision

Recall

tabQueries AP=0.90 mAP=0.92

singleKWs AP=0.75 mAP=0.69

Dataset training and test details

• PASSAU: German/Latin, many hands.Training: 200 pages, 102 char CRNN OMs +

6-gram char LM trained on training transcripts;Lexicon: 12 381 tokens.Test: 91 page images; Query set : 6 500 keywords

• TAB PASSAU: Table queries in PASSAU.Training: same as PASSAU. Test: 44 table images;Query set : 363 real multiword structured queries.

• See: [Toselli et al.: “Probabilistic Indexing andSearch for Information Extraction on HandwrittenGerman Parish Records”, ICFHR-2018]

Work in progress: more details and results to be published soon



Table Images Information Retrieval: Demonstrations

http://prhlt-carabela.prhlt.upv.es/passauTab/http://prhlt-carabela.prhlt.upv.es/passauTab/views/help.html

Simple queries:

< name der braut , Maria > < name of the bride, Maria >< wohnort , Passau > < place of residence, Passau >

Single column, complex header/content descriptions:

< tauf tag , [2*ten (April || May)] > < baptism day, 2* of (April or May) >< (eltern || parent*) , Sebas* > < relatives or parents, Sebas* >

Two columns:

< tag trauung , Jaen* || Feb* > && < name* braut , Maria >< day wedding, Jan* or Feb* > and < name bride, Maria >

< name braeutigams , Georg > < eltern brau*, Martin || Magdalena >< name of the groom, Georg > and < parents of the bride, Martin or Magdalena >



Index










6 Conclusions . 45




Automatic Processing of Historic Handwritten Music Manuscripts

• Millions of historic musical manuscripts are preserved incathedrals, abbeys, archives, etc. Many are digitized,but their musical contents remain inaccessible

• In many cases, perfect transcripts are not really needed;instead, content-based search with some degree ofreliability would be extreemly useful

• Spotting just single music symbols is mostly useless (allthe symbolos involved generally appear in each page);instead, helpful search targets are “melodic patterns”,which typiclly correspond to music symbol sequences.

We explore approaches for accurate retrieval of melodic patterns, representedby music symbol sequences, from collections of early music manuscripts.



Handwritten Musica Notation: The VORAU Cod. 253 Manuscript

• Manuscript from Vorau Abbey library, ci. 1450,provided by the Austrian Academy of Sciences

• Dataset details:

Data: Train-Val Test

Pages 422 44Staves 1 000 97Running symbols 13 066 1 086Symbol set size 19 15

• Written in German gothic notation,without information about duration of notes

• Representation based on vertical positions ofmusic symbols in stave lines (L) and spaces (S)

Example:C3 S3 L4 L3 S3 S3 S3 L3 S2 L2 L3 . . .



VORAU-253: Indexing and Search Laboratory Results

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

Single Symbols: AP=0.89 mAP=0.75

Sequences: AP=0.86 mAP=0.92

• (Imperfect) Transkribus stave segmentation

• CRNN (TensorFlow) optical models andsymbol 2-gram LM.

• Query sets:

Single symbols: all the 15 symbols seenin the test set.

Symbol sequences: all the 615 sequenceswith lengths ranging from 3 to 15 whichappear in the test set more than once.

• Average Precision (AP) & mean AP (mAP)evaluated at stave level for sequencequeries and at relative symbol position levelfor single symbol queries.

See: [Calvo et al.: “Music Symbol Sequence Indexing in Medieval Plainchant Manuscripts”, ICDAR’19]

Work in progress: more details and results to be published soon



Music Indexing and Search: Demonstration

http://prhlt-carabela.prhlt.upv.es/music

Single symbols;

S3

F4

Symbol sequences:

[ S3 S3 S4 S3 L3 L3 L3 ]

[L3 L3 L3 L3 L3 L3 L3 L3]

[S2 L2 L2 S2 L3 S2 L3 S2 S1 L2 S1]

Sequences with alteration:

[L2 FLAT S2 L2]

[ FLAT S3 L3 S2 L3 ]



Index










◦ 6 Conclusions . 45




Conclusions

• A probabilisitic framework has been introduced for indexing and searching in largecollections of untranscribed handwritten documents

• Empirical results with a variety of historic collections, exhibiting different challengesand levels of complexity, assess the potential of this framework

• Abreviations, hyphenation and other difficulties entailed by historical manuscripts areovercome

• On the base of the proposed approach several very large collections of historicalmanuscripts have been actually indexed and their textual contents made publiclyaccessible through efficient web search interfaces

• Probabilistic Indices allow Text Data Analytics and a variety of forms of “semantic”Information Retrieval and “big-data” analysis to be carried out of massive sets ofuntranscribed handwritten text images

• Current and future projects:– CARABELA: Completing manuscript processing, up to 150 000 pages, and explore

content-based image classification into user-defined classes of documents

– Probabilistically index the 1 000 000 pages of the complete FCR collection of theNational Archives of Finland (NAF)

– Plans for other very large European collections of historical manuscripts.



Index










6 Conclusions . 45

◦ 7 END (and further details) . 47



Thanks for your attention!

(additional details below)



Image Region KWS

• Posteriorgrams can be directly used for KWS: Given a threshold τ ∈ [0, 1],a word v is spotted in all image positions where P (v | X, i, j) > τ . Varyingτ , adequate precision–recall tradeoffs can be achieved

• But, for indexing purposes, we need the probability that a word v is writtenwithin a pre-specified image region, such as a page, a column, or a line

A popular (but wrong!) idea: For a text image region X, use the wordposterior probability P (v | X).

But this is ill-defined, because∑

v

P (v | X) = 1

. . . but, for each of the (many) different words v actually written in X, weideally want P (v | X) to be close to 1 : the sum should ideally be�1 !

What is an adequate posterior probability for image region KWS ?



Choosing Adequate Minimal Image Regions: Line-level KWS

Line-shaped regions are good for indexing and search in practice;moreover, they allow for efficient computation by clever verticalsubsampling and choosing B(i, j):

• Vertical subsampling: In general, it amounts to just guessing a properline height and then running a vertical-sliding window of this heightwith some overlap

• Choosing B(i, j): For a line-shaped region, marginalization boundingboxes needed to compute posteriorgrams can be just defined byhorizontal segmentation

Line-level posteriorgrams can be very efficiently computed using WordGraphs, obtained as a byproduct of Viterbi or “token-passing” decoding.

This has two important benefits in order to compute posteriorgrams by marginalization:

• Optical (HMM) Character Models and (N-gram) Language Models are used toprovide very accurate, contextual word classification probabilities, P (v | X,B)

• WGs provide lots of alternative horizontal word-level segmentations, which directlydefine B(i, j)



Line-level KWS: State-of-the-art Modelling• Optical modelling: Deep Convolutional-recurrent (CRNN) network:

• Textual context modeling: Finite State character n-grams:



Probabilistic Indexing & Search: Precision-Recall Tradeoff Model

Indexing and search quality can beassessed by means of precision (π) & recall(ρ) performance.

Precision is high if most of the retrievedresults are correct while recall is high if mostof the existing correct results are retrieved.

If perfectly correct text were indexed, you’dget a single, “ideal” point with ρ = π = 1.

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

High confidence theshold

Low confidence threshold

Perfect (AP=1.0)Aut. Transcript (AP=0.6)

Prob. Index (AP=0.8)

If automatic (typically noisy) handwritten text transcripts are naively indexed just asplaintext, precision and recall are also fixed values, albeit not “ideal” (pehaps somethinglike ρ = 0.75, π = 0.8, with Averge Precision AP=0.6).

In contrast, probabilistic indexing allows for arbitrary precision-recall tradeoffs by settinga threshold on the system confidence (relevance probability)

This flexible “precision-recall tradeoff model” obviously allows for better search andretrieval performance than naive plaintext searching on automatic noisy transcripts.



Laboratory Results on Several Manuscript Collections (17th-19th c.)

– Recall-Precision curves– Average Precision (AP)– Mean Average Precision (mAP)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

ap=0.88, mAP=0.94

ap=0.86, mAP=0.81

ap=0.81, mAP=0.74

ap=0.71, mAP=0.66

Bentham

Plantas

Austen

Austen-B


• BENTHAM: Multi-hand. Training: 400 pagesfrom Bentham, 87 char. HMMs, 2-gram LMtrained on Bentham texts; Lexicon 9 341 tokens.Test : 33 pages; query set: 6 962 keywords

• PLANTAS (VOL-I): Single hand. Training: 224pages from Plantas, 77 char. HMMs, 2-gramLM trained with the training set + book glossarytranscripts. Lexicon 11 561 tokens.Test : 647 pages; query set: 9 945 keywords

• AUSTEN: Single hand. Training: 50 Austenpages, 81 char. HMMs, 2-gram LM trained onAusten texts; Lexicon 20K tokens.Test : 78 pages; query set: 2 281 keywords

• AUSTEN-B: Single hand. No training; usingBentham character HMMs, lexicon and LM.Test & query set: Same as for AUSTEN



Laboratory Results on Difficult Medieval Collections (14th-16th c.)


0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

ap=0.88, Bentham

ap=0.86, Plantas

ap=0.81, Austen

ap=0.71, Austen-B

ap=0.71, Alcaraz

ap=0.77, WienStUlrich

ap=0.75, Chancery


• ALCARAZ: Spanish, multi-hand. Training: 44pages, 70 char. HMMs + 2-gram LM trained ontraining transcripts; Lexicon 3 405 tokens.Test : Cross-val.; Query set : 3 400 keywords

• WIENSANKTULRICH: German/Latin, one hand.Training: 52 pages, 74 char. HMMs, 2-gram LMfrom training transcripts; Lexicon 2 303 tokens.Test : Cross-val.; Query set : 2 256 keywords

• CHANCERY: Medieval French/Latin, Heavilyabbreviated, many hands.Training: 341 Acts (∼ 100 pages), 105 charCRNN OMs + 5-gram char LM trained ontraining transcripts; Lexicon: ∼ 20 000 tokens.Test : 95 Acts; Query set : 6 506 keywords.

HIMANIS JPICH PROJECT: The full Chancerycollection (82 000) page images) was indexed.See: prhlt-kws.prhlt.upv.es/himanis



Chancery Laboratory Results: Impact of LM and Abbreviations

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

5grm AP=0.75 mAP=0.68

3grm AP=0.69 mAP=0.61

0grm AP=0.62 mAP=0.52

HTR AP=0.58 mAP=0.440

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

5gram-Latin

AP=0.86 mAP=0.73

5gram-French

AP=0.80 mAP=0.74

5gram-All

AP=0.75 mAP=0.68

Left: Recall-Precision results for different character N -gram models (0grm, 3grm, 5grm). A singleR-P point (HTR) is also shown for the 1-best recognition hypotheses with character 5-grams.

Right: Recall-Precision results for (only) abbreviated keywords using character 5-gram models:Latin-only (5g-la), French-only (5g-fr) and both Latin and French (5gr). A single R-P point (HTR)is also shown for the 1-best recognition hypotheses with character 5-grams.



Chancery: Examples of Abbreviated Word SpottingModernized (expanded) query keywords and corresponding spotting results

Keyword Guillaume chevalier livres quelconques

Full form

Abbreviated

False Positives

Avg. Precision (AP) 0.79 0.89 0.79 0.91

For each keyword: Selected examples ofcorrectly spotted images, both in full form andabbreviated, and one example of false positive.

The AP shown for each keyword is thetrue experimental value, computed taking intoaccount all the spotting results on the test set.

Latin and French abbreviated-only results arebetter than those including all the query words!

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

5gram-Latin

AP=0.86 mAP=0.73

5gram-French

AP=0.80 mAP=0.74

5gram-All

AP=0.75 mAP=0.68



Probabilistic Text Image Indexing and Search: System Diagram

Textimages

Ingestion

Database

Page-levelindices

indexing toolKWS &

Keyword search

• “KWS & indexing tool” : Off-line pre-computation of probabilistic indices

• “Ingestion” : Off-line creation of the actual database. Typically a simple andcomputationally cheap process

• “Keyword search” : On-line user query analisys, find the requestedinformation and present the retrived images. Short response times needed.



Probabilistic Text Image Indexing: Index Building through KWS

Transcribed images

KWS & indexing tool

Text images

Optical + Language Models Training system

KWS +indexing

Contextual wordrecognizer (HTR)

Char / word Lattices

Page-levelprobabilisticindices

• Indexing is typically based on Key Word Spotting (KWS) technologies

• Most effective KWS methods use contextual word recognizers which requiremodels trained from transcribed images

• Both the contextual recognizer and the training system are often separatepiecees of software, not included in the indexing tool proper

• The contextual recognizer produces intermediate rich data structures, suchas character and/or word lattices, used by the KWS and indexing process

• In general, KWS and indexing can be computationally (very) demanding.



Probabilistic Text Image Indexing: Index Ingestion

IngestionDatabase

Page-level

indicesChar folding, word grouping,index triming, data structuring

The set of individual image probabilistic indices are compiled into a datastructure, adequte for fast operation of the search engine.

The following processes are carried out here:

• Case & diacritics folding

• Word grouping – e.g., to index lemmas rather than regular words

• Organize the spots according to the chosen hierarchical structure

• Trim the index to the desired indexing density. Density be expressed asa relevance probability threshold, or as a number specifying how manyspots per page, per image region, or per running word should be indexed



Probabilistic Indexing: Search Engine and User Interface

Keyword

Database

Search engine

Query analysis GUI retrived images

Display

Text images

search

• GUI: Graphical / textual specification of queries and desired precision-recall tradeoff

• Query analysis: trivial for single words, but significant for multi-word queries

• Search engine: Access the database. Specialized software typically needed forprobabilistically consitent support of multi-word queries and hierarchical search.

• Display retrieved images: Prepare the images to be presented to the users as aresult of their queries. The way they are presented is highly application dependent.



Demonstrations

• CHANCERY collection (HIMANIS project).XIV-XV century “Tresor des Chartes registers”.82 000 images of densely handwritten text in Latin and French.

• TSO collection (Teatro del Siglo de Oro, READ project).XV-XVII century manuscripts of Spanish comedies,with more than 100 000 images, written by many hands.Work in progress: 150 manuscripts with 21 000 images so far.

• Many other demonstrators for smaller collections from variedhistorical periods in several languages.



Chancery: Indexing & Search Demonstration

PRHLT HIMANIS Search Interfacehttp://himanis.huma-num.fr/himanis

A small sample of query examples:

liliorumpredicatorum

ElenaElena || Helena || Helene

IsabelYsabel || Isabelle || Elisabet || Helisabet || Elisa || Elisia

(guerre || paix) && Alemaigne

[ duc de borbon ]



Teatro del Siglo de Oro Espanol: Indexing & Search Demonstration

PRHLT TSO Search Interfacehttp://prhlt-carabela.prhlt.upv.es/tso

A small sample of query examples:

marquesaAlmagro

teniente || alferez || sargentosol && espanol

Isabel (belleza || hermosura || nobleza)(valor || dolor) && (amor || honor)

[ Lope de Vega ][Calderon de la Barca]



Indexing & Search Demonstration for Other Collections

Many other handwrriten (smaller) collections,

from varied historical periods, in several languages:

in the TRANSCRIPTORIUM web site.



Passau Miscellanea: Indexing & Search Demonstration

PRHLT READ Search Interface for Passauhttp://transcriptorium.eu/demots/kws-Passau

A small sample of query possibilities:

Sabina

Margareta || Margareth || Margaretha || Margaretham || Margaritha

Adam && EvaPassau && Anna

(Johann || Anna) 1798

[ filia legitima ]

[matrica consignatio copulatorum]



Mixture Models for Textual-Content Based Image Classification

Let b(w) = b1, . . . , bN the “bag of words” bit vector of the text w.The K−components Bernoulli Mixture likelihood of b(w) for class c is:

PB(b(w) | c) ≡ PB(w | c) =K∑

k=1

N∏

n=1

p bnckn(1− p(1−bn)ckn

)

The parameters of this model, pckn, 1 ≤ c ≤ C, 1 ≤ k ≤ K, 1 ≤ n ≤ N ,can be learned from class-labelled documents through EM estimation.

For a text image X, w is unknown and we have to marginalize overall possible words in X – the required “word content” probabilities areprovided by the image Probabilistic Index. After some developments andassumptions, the following approximation can be derived:

c ≈ argmaxc

P (c)PB(w | c)

where w is the set of N most relevant words according to P (R | X, v), v ∈w and N can be estimated as the expected value of n(w) (see: .27)