text search and information retrieval in large historical...
TRANSCRIPT
Keynote talk ICDAR-2019
Text Search and Information Retrieval inLarge Historical Collections of Untranscribed
Manuscripts
Enrique Vidal, Alejandro H. Toselli, Joan Puigcerver and the HTR PRHLT [email protected]
Pattern Recognition and Human Language Technology Center
CARABELA HOME READThis presentation can be downloaded from:
www.prhlt.upv.es/˜evidal/tmp/icdar19keynoteEVidal2p.pdf
E. Vidal PRHLT-UPV – PRHLT/UPV
Tex search and IR in untranscribed manuscripts ICDAR-2019
Handwritten Text Recognition and Historical Manuscripts
• Some decades ago, off-line Handwritten Text Recognition (HTR) wasthought to quickly become a research topic of little practical interest, sincethe use of text written on paper would soon become obsolete.
However . . .
• Massive historical manuscript collections stored in thousands of kilometersof shelfs in archives and libraries have changed the picture dramatically.
• Digitalization is a first step; but not enough: important information lingershidden behind zillions of pixels of digital images and the quintessence ofthese historical documents –their textual content– remains inaccessible.
E. Vidal PRHLT-UPV, September-2019 Page 1
Tex search and IR in untranscribed manuscripts ICDAR-2019
Textual access to Untranscribed Manuscripts
If perfect or sufficiently accurate transcripts of the text images were available,image textual contents would be obviously accessible.
However...
• manual transcription is entirely prohibitive for massive image collections,
• automatic transcription results generally lack the accuracy level needed formost applications, including scholarly editions, content-based documentclassification and information retrieval,
Good news: probabilisitic indexing (PI) and textual search can be directlycarried out on untrascribed images, as we will see now.
Even better news: PI allows interesting forms of “big data” analysis, such astext analytics, document classification, information retrieval , etc.
E. Vidal PRHLT-UPV, September-2019 Page 2
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 3
Tex search and IR in untranscribed manuscripts ICDAR-2019
Large Scale Probabilistic Keyword Indexing and Search is Here, Now!
The HTR team of the PRHLT Research Center has been developing the PI technologyduring the last decade and, more recently, has successfully applied it to five largemanuscript collections, thereby making their textual contents fully accessible:
• Chancery (AN & BN, France): 83 000 pgs., very abridged French & Latin, 14-15th c.http://prhlt-kws.prhlt.upv.es/himanis/ (HIMANIS project)
• TSO (Teatro del Siglo de Oro, BN of Spain): 41 000 pages, Spanish, 16 -17th c.http://prhlt-carabela.prhlt.upv.es/tso/ (READ project)
• Bentham Papers (UCL & BL): 95 000 pages, English, scrawl writting, 18 -19th c.http://prhlt-kws.prhlt.upv.es/bentham/ (READ project)
• FCR (Finnish Court Records, NA Finland): 138 000 pages, Swedish, 18 -19th c.http://prhlt-kws.prhlt.upv.es/fcr/ (READ project)
• Carabela (AGI + AHPC): 125 000 pages, Spanish, abstruse scripts, 16 -18th c.– manuscripts valuable to underwater archaeologyhttp://carabela.prhlt.upv.es/en/demonstrators (CARABELA prj.)
These and many other smaller-scale, experimental demonstrators available from:http://transcriptorium.eu/demots/KWSdemos
E. Vidal PRHLT-UPV, September-2019 Page 4
Tex search and IR in untranscribed manuscripts ICDAR-2019
Two Quick Demonstrations with Bentham Papers and Carabela
Bentham is famous as a theorist of punishment. Hence he wrote a lot about Australiaand its penal colonies. Let us try to see what:
Australia[New South Wales]Austral* || [New Holland][New South Wales] || [Botany Bay][New South Wales] && convict*[New South Wales] && [penal colon*]
Let us try now with Carabela.In the 16-17th centuries, every land to the south of Philippines (including Australia)was vaguely known as “Terra Australis Incognita”, and in Spain, “Tierra Austral”, “IslasAustrales”, etc. (of course, also “Nueva Guinea” was included):
[(Tierra* || Isla*) Austral*] || (Austral* Incognita∼3) || [Nueva Guinea][Luis Vaez∼ Torres] && [Pedro Fernandez Quiros∼2] (Early navigators of Austral seas)
Finally, one more query to show the robustness of the indexing approach:capitan
E. Vidal PRHLT-UPV, September-2019 Page 5
Tex search and IR in untranscribed manuscripts ICDAR-2019
Vaez de Torres’ And Cook’s routes and dates
E. Vidal PRHLT-UPV, September-2019 Page 6
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
◦ 2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 7
Tex search and IR in untranscribed manuscripts ICDAR-2019
Text Image KWS Statistical Framework: 2-D Posteriorgram
Main concept: Posterior word probability at pixel level, or “2-D Posteriorgram” :
P (v | X, i, j), 1 ≤ i ≤ I, 1 ≤ j ≤ J, v ∈ V
where X is a I × J sized text image, V is a vocabulary and (i, j) a pixel of X.
P (v | X, i, j) denotes the probability that a word v is written in a region of Xwhich includes the pixel (i, j). It can be directly computed by margizalization:
P (v | X, i, j) =∑
B
P (v,B | X, i, j) ≈ 1
K(i, j)
∑
B∈B(i,j)P (v | X,B)
where B(i, j) is the set of all the K(i, j) reasonably shaped and sized (andassumedly equiprobable) regions or boxes of X which include the pixel (i, j).
j
i
A few possible boxesB ∈ B(i, j). For v = ”matter”,the thick-line box will provide the highest value ofP (v | X,B), while most of the other boxes willcontribute only (very) low values to the sum.
What is exactly P (v | X,B)?
E. Vidal PRHLT-UPV, September-2019 Page 8
Tex search and IR in untranscribed manuscripts ICDAR-2019
Computing the 2-D Posteriorgram by word classification
The 2-D Posteriorigram: P (v | X, i, j) ≈ 1
K(i, j)
∑
B∈B(i,j)P (v | X,B)
P (v | X,B) is the posterior probability (implicitly or explicitly) used byany isolated word image classifier ; i.e, any system capable of solving thefollowing classification problem for a presegmented word-shaped subimageof X bounded by B,XB:
v = argmaxv∈V
P (v | XB)
For instance, for a simple k-nearest-neighbour classifier, if kv is the numberof v−labelled prototypes out of the k nearest to XB,
P (v | X,B) =kvk
However, the better the classifier the better the estimated posterirorgram!
Notice: Directly obtaining a full 2-D posteriorgram in this way entails a formidableamount of computation, but P (v | X, i, j) can be very efficient computed byclever combinations of subsampling of (i, j) and choices of B(i, j) [see later].
E. Vidal PRHLT-UPV, September-2019 Page 9
Tex search and IR in untranscribed manuscripts ICDAR-2019
Pixel-level Posteriorgram
P (v | X, i, j)
X
Pixel-level posterior probabilities P (v |X, i, j) for a text imageX and word v =”matter”,computed using an accurate, contextual (n-gram based) word classifier. This helped toachieve very good posteriors: low in a region of X around (i=100, j=60), where a verysimilar (but different) word, “matters”, is written; high for the other three correct words.
E. Vidal PRHLT-UPV, September-2019 Page 10
Tex search and IR in untranscribed manuscripts ICDAR-2019
Pixel-level Posteriorgram: Probabilistic Word Indexing
P (v | X, i, j)
X
Directly computing and using a full pixel-level posteriorgram would entail a formidablecomputational load and would require prohibitive amounts of indexing storage.
But, for each word, image region relevance probabilities and locations are easily derivedfrom the Posteriorgram – and used to probabilistically index the word in an efficient way.
E. Vidal PRHLT-UPV, September-2019 Page 10
Tex search and IR in untranscribed manuscripts ICDAR-2019
Probabilistic Index: Example
200
150
100
50
0 100 200 300 400 500 600
# pageID="Bentham-071-021-002-part"# keyword relPrb bounding box#
2 0.929 1 36 20 3121 0.064 1 36 24 31IT 0.982 33 36 27 31IF 0.012 33 36 26 31
MATTERS 0.998 76 35 104 31MATTER 0.011 77 36 93 31
NOT 0.999 216 36 7 31WHETHER 1.000 256 36 99 31
THE 0.997 389 36 33 31MIS-SUPPOSAL 1.000 455 36 193 31
THE 0.927 430 88 30 31LHE 0.056 434 88 25 31... ... ... ...
REGARDS 0.857 5 115 84 31UGARDS 0.138 5 115 80 31
THE 0.993 110 115 43 31MATTER 0.998 160 115 93 31
OF 0.996 271 115 23 31FACT 0.999 306 115 49 31OR 0.973 377 115 37 31ON 0.021 377 115 42 31
MATTER 0.990 425 116 100 31OF 0.995 542 115 25 31LAM 0.407 575 115 30 31BIMR 0.175 575 115 55 31... ... ... ...LAW 0.032 575 115 36 31TAUE 0.031 575 115 55 31... ... ... ...LANE 0.012 575 115 59 31
THE 0.990 1 198 28 31MATTER 0.934 61 198 64 31
OF 0.988 141 198 28 31FAST 0.367 182 198 62 31FAR 0.186 182 198 36 31... ... ... ...FACT 0.017 182 198 46 31AS 0.142 200 198 29 31HAE 0.022 200 198 29 31
WHERE 0.992 255 198 90 31YOU 0.761 365 198 45 31YOW 0.030 365 198 45 31GOUS 0.064 372 198 47 31
SUPPOSE 0.975 429 198 120 31SUPFROSE 0.024 429 198 125 31
SOME 0.834 570 198 78 31SONER 0.016 576 198 83 31OME 0.109 580 198 65 31ME 0.022 620 198 22 31
Spots for MATTER and MATTERS marked in colors according to their Relevance Probabilities.E. Vidal PRHLT-UPV, September-2019 Page 10
Tex search and IR in untranscribed manuscripts ICDAR-2019
Real Probabilistic Index of a Random Page from Bentham Papersfor the 25 most common English words: the, be, to, of, and, a, in, that, have, ...
Colors indicate relevance probabilities: low=red, high=green.E. Vidal PRHLT-UPV, September-2019 Page 11
Tex search and IR in untranscribed manuscripts ICDAR-2019
Image Region KWS:Proper Information Retrieval Probabilistic Formulation
Posteriorgrams P (v |X, i, j) can be directly used for KWS. But, for indexing, weneed the probability that a word v is written within a pre-specified image region.
Let X be a given image region and R ∈ {yes, not} a binary random variable.
The Relevance Probability , P (R |X, v), is defined as the probability that X isrelevant for v; i.e., v is written somewhere in X. A good approximation is:
P (R | X, v) ≈ maxi,j
P (v | X, i, j)
This is a formal result which is also intuitively meaningful (as seen in p. .10 )[this is the short history – formal details ommited here]
If w is the (unknown) transcript of the image region X, it can be seen:
P (R | v,X) =∑
w:v∈wP (w | X) and thus:
∑
v
P (R | X, v) = m (see .27 )
where m (�1) is the expected number of (different) words written in X.
E. Vidal PRHLT-UPV, September-2019 Page 12
Tex search and IR in untranscribed manuscripts ICDAR-2019
Lexicon-Free Probabilistic IndexingTwo basic ideas:• Use character level models both for optical and language modeling (high-order
character N-Grams)• Index any character sequence which, according to the models, is sufficiently likely to
constitute an actual word.Thereby, probabilistic index entries are called “pseudo-words” (rather than “words”)
The development of these ideas, departing from the popular “Filler Model” for KWS canbe tracked through the following publications (specially Puigcerver’s PhD thesis):• A. Fischer et al., “Lexicon-free handwritten word spotting using character HMMs” Pattern Recognition Letters, 2012• V. Frinken et al., “A novel word spotting method based on recurrent neural networks” IEEE TPAMI, 2012• A.H. Toselli et al., “Fast HMM-Filler approach for Key Word Spotting in Handwritten Documents” ICDAR’13• A. Fischer at al., “Improving HMM-Based Keyword Spotting with Character Language Models”, ICDAR’13• J. Puigcerver et al., “Word-Graph and Character-Lattice Combination for KWS in Handwritten Documents”, ICFHR’14• A.H. Toselli et al., “Context-aware lattice based Filler approach for key word spotting in handwritten documents”, ICDAR’15• Puigcerver et al., “Probabilistic interpretation and improvements to HMM-Filler for handwritten keyword spotting”, ICDAR’15• A.H. Toselli et al., “Two methods to improve confidence scores for lexicon-free word spotting in handwritten text” ICFHR’16• T. Bluche et. al., “Preparatory KWS Experiments for Large-Scale Indexing a Vast Medieval Manuscr. Collec. . . . ”, ICDAR’17• J. Puigcerver, “A probabilistic formulation of keyword spotting,” Ph.D. dissert., Universitat Politecnica de Valencia, 2018• A.H. Toselli et al., “Making Two Vast Historical Manuscript Collections Searchable and Extracting Meaningful Textual
Features Through Large-Scale Probabilistic Indexing”, ICDAR’19
E. Vidal PRHLT-UPV, September-2019 Page 13
Tex search and IR in untranscribed manuscripts ICDAR-2019
Probabilistic Indices are NOT Transcripts
OPTIMAL TRANSCRIPTION PROBABILISTIC INDEXING
Generally comes after Layout Analysis Is generally Layout-agnosticStrictly needs carefully detected lines Line detection helps, but only if accurateThe output is a best, unique textinterpretation of the image (maybeaccompained with word boundingboxes) according to the models used
For the given models, the output is arich probability distribution of wordsand their sizes and possitions in theimages
The output is expeted to be providedin correct reading order
In general, Probabilistic Indexing isreading-order agnostic
Provides plaintext output which, ifaccurate, could be directly used inmany applications
In its basic form, does not provide anytext output; only bounding-box-markedimages
Usually yields only fixed andcomparatively low precission-recallperformance for the given trainedmodels
Allows flexible, user-controledprecision-recall tradeoffs and overallsearch preformance is generally muchbetter for the same trained models
E. Vidal PRHLT-UPV, September-2019 Page 14
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
◦ 3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 15
Tex search and IR in untranscribed manuscripts ICDAR-2019
Implementing Probabilistic Text Image Indexing and Search
TextImages
Ingestion
Page ImageIndices
KWS & Indexing
Keyword Searchand User Interface Spots
Database
• “KWS & Indexing” : Off-line pre-computation of probabilistic indices
• “Ingestion” : Off-line creation of database of spotting results. Typically a simpleand computationally cheap process
• “Keyword search” : On-line user query analisys, find the requested informationand present the retrieved images. Short response times needed.
E. Vidal PRHLT-UPV, September-2019 Page 16
Tex search and IR in untranscribed manuscripts ICDAR-2019
Probabilistic Text Image Indexing: Index Building through KWS
Transcribed images
KWS & indexing tool
Text images
Optical + Language Models Training system
KWS +indexing
Contextual wordrecognizer (HTR)
Char / word Lattices
Page-levelprobabilisticindices
• Indexing is typically based on Key Word Spotting (KWS) technologies
• Most effective KWS methods use contextual word recognizers which requiremodels trained from transcribed images
• Both the contextual recognizer and the training system are often separatepieces of software, not included in the indexing tool proper
• The contextual recognizer produces intermediate rich data structures, suchas character and/or word lattices, used by the KWS and indexing process
• In general, KWS and indexing can be computationally (very) demanding.
E. Vidal PRHLT-UPV, September-2019 Page 17
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
◦ 4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 18
Tex search and IR in untranscribed manuscripts ICDAR-2019
Laboratory Results on Modern Manuscript Collections (18th-19th c.)
– Recall-Precision curves– Average Precision (AP)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
AP=0.91, CRNN Bentham
AP=0.91, HMMs Bentham
Datasets training and test details
• BENTHAM: Many hands. Training: 400 pages(86 K run. words); 86 char OMs, 2-grm word LMtrained on Bentham texts; Lexicon 9 K words.Test : 33 pages; query set: 8 658 keywords
E. Vidal PRHLT-UPV, September-2019 Page 19
Tex search and IR in untranscribed manuscripts ICDAR-2019
Laboratory Results on Modern Manuscript Collections (18th-19th c.)
– Recall-Precision curves– Average Precision (AP)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
AP=0.91, CRNN Bentham
AP=0.91, HMMs Bentham
AP=0.92, CRNN Plantas
AP=0.91, HMMs Plantas
Datasets training and test details
• BENTHAM: Many hands. Training: 400 pages(86 K run. words); 86 char OMs, 2-grm word LMtrained on Bentham texts; Lexicon 9 K words.Test : 33 pages; query set: 8 658 keywords
• PLANTAS (VOL-I): Single hand. Training: 264pages (79 K run. words); 76 char. OMs, 2-gramword LM trained with the training set + bookglossary transcripts. Lexicon 20 K words.Test : 607 pages; query set: 10 932 keywords
E. Vidal PRHLT-UPV, September-2019 Page 19
Tex search and IR in untranscribed manuscripts ICDAR-2019
Laboratory Results on Modern Manuscript Collections (18th-19th c.)
– Recall-Precision curves– Average Precision (AP)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
AP=0.91, CRNN Bentham
AP=0.92, CRNN Plantas
AP=0.89, CRNN IAMDB
AP=0.91, CRNN GW
Datasets training and test details
• BENTHAM: Many hands. Training: 400 pages(86 K run. words); 86 char OMs, 2-grm word LMtrained on Bentham texts; Lexicon 9 K words.Test : 33 pages; query set: 8 658 keywords
• PLANTAS (VOL-I): Single hand. Training: 264pages (79 K run. words); 76 char. OMs, 2-gramword LM trained with the training set + bookglossary transcripts. Lexicon 20 K words.Test : 607 pages; query set: 10 932 keywords
• IAMDB: Many hands. Training: 7 K lines (62 Krun. words); 81 char. OMs, 2-grm word LM train-edonLBW (English) text; Lexicon 21 K words.Test : 929 lines; query set: 3 421 keywords
• GW: Single hand. Training: 492 lines (4 K run.words); 83 char. OMs, 2-gram word LM trainedon training transcripts; Lexicon 1.5 K words.Test : 164 lines; query set: 899 keywords
E. Vidal PRHLT-UPV, September-2019 Page 19
Tex search and IR in untranscribed manuscripts ICDAR-2019
Laboratory Results on Earlier Manuscript Collections (14th-18th c.)
– Recall-Precision curves– Average Precision (AP)– Mean Average Precision (mAP)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
TSO AP=0.80 mAP=0.83
Chancery AP=0.75 mAP=0.68
Passau AP=0.69 mAP=0.67
Datasets training and test details
• TSO: Spanish, many abbreviations, 1 hand.Training: 286 page images, 92 char CRNNOMs + 8-gram char LM trained on trainingtranscripts; Lexicon: 6 289 tokens.Test : cross-valid; Query set : 5 409 keywords.
• CHANCERY: Medieval French/Latin, Heavilyabbreviated, many hands.Training: 341 Acts (∼ 100 pages), 105 charCRNN OMs + 5-gram char LM trained ontraining transcripts; Lexicon: ∼ 20 000 tokens.Test : 95 Acts; Query set : 6 506 keywords.
• PASSAU: German/Latin, tables, many hands.Training: 200 pages, 102 char CRNN OMs +
char 6-gram LM trained on training transcripts;Lexicon 12 381 tokens.Test : 91 images; Query set : 6 500 keywords
E. Vidal PRHLT-UPV, September-2019 Page 20
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
◦ 5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 21
Tex search and IR in untranscribed manuscripts ICDAR-2019
Beyond Simple Keyword Spotting
• Word spelling flexibility:– Wild cards representing arbritary character sequences
– Approximate spelling (allow char. insertions, deletions and/or changes)
– Hyphenated words (based on accurate prediction of word prefixes &suffixes and geometry of bounding boxes.
Probabilisitic indices make it simple and natural to handle thesekinds of word spelling variations.
• Text Data Analytics on handwritten text images:– Astronomical amounts of historical handwritten documents exist but only
infinitesimal amounts of the information they convey is known so far
– Given the sheer scale of the available data, big-data analysis is perhapsthe only way to obtain relevant information from these documents
– Currently available text data analysis tools require plain-text input.
Probabilisitic indices allow many interesting forms of text dataanalytics from untranscribed images of handwritten text.
E. Vidal PRHLT-UPV, September-2019 Page 22
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 23
Tex search and IR in untranscribed manuscripts ICDAR-2019
Wild Cards, Approximate Spelling, Abbreviations and HyphenationSince (pseudo-)words of Probabilistic Index entries are just plaintextcharacter sequences, flexible spelling can be allowed in query words,as in many plaintext searching systems:
• Wild cards: e.g., Mari* , *anna , ad*able• Approximate or “fuzzy” spelling: e.g., Elizabeth∼ , neighbor∼2
– particularly useful in historical documents, where exact spelling is often unknown– useful also to increase the recall for correctly spelled words which are not perfectly
indexed due to lexicon-free spotting
? both techniques implemented in the search interface
• Abbreviations: can be handled with flexible spelling to some extend,but much better results are achieved by training optical and languagemodels with the expanded forms of the abbreviated words
• Hyphenated words: similarly, best results achieved by training themodels with tagged word fragments and indexing the full-forms of thehyphenated words with the help of geometric reasoning.
Lab Results: TBP soon; Demonstrator :http://prhlt-kws.prhlt.upv.es/fcr-hyp/
E. Vidal PRHLT-UPV, September-2019 Page 24
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 25
Tex search and IR in untranscribed manuscripts ICDAR-2019
Text Data Analytics from Probabilisitically Indexed Images
Basic text features usually computed for plain text documentscan be easily estimated from probabilistically indexed (butotherwise untranscribed) text images. Here we focus on:
• Number of running words
• Zipf curves
• Size of vocabulary
Recall : A Probabilistic Index (PI) provides, for each imageregion X and each character string (or “pseudo-word” ) v, theprobability that v is written in X, P (R | X, v).
E. Vidal PRHLT-UPV, September-2019 Page 26
Tex search and IR in untranscribed manuscripts ICDAR-2019
Estimating Number of Running Words and Word Frequencies
Let w be the (unknown) transcript ofX and g(w, v) def= 1 iff v ∈ w. The number
of (different) words in w is n(w) =∑v g(w, v), and its expected value:
E[n(w)] =∑
w
∑
v
g(w, v)P (w |X) =∑
v
∑
w:v∈wP (w | X) =
∑
v
P (R | X, v)
If image regions X are sufficiently small (e.g., short lines), this is a generallygood, lower bound approximation to the number of running words in X.
Similarly, for an image collection, X , and its (unknown) transcript W :
E[n(W )] ≈∑
X∈X
∑
v
P (R |X, v)
The number of running words of an image collection can be estimated as thesum of (pseudo-)word Relevance Probabilities of all the indexed entries.
Finally, the frequency of use n(v), of a specific word v in X , is estimated as:
E[n(v)] =∑
X∈XP (R | X, v)
E. Vidal PRHLT-UPV, September-2019 Page 27
Tex search and IR in untranscribed manuscripts ICDAR-2019
Zipf Curve and Lexicon Size of a Text
The Zipf curve of a given text (transcript) depicts the frequency of eachword as a function of its rank in the decreasingly sorted list of wordfrequencies. Example from Bentham GT transcripts:
<0.1
1
10
100
1K
10K
1 10 100 1K 10K 100K 1M 10M
Freq
uenc
y
Rank
Bentham Test-set GT Transcripts
The vocabulary size is just the highest rank of a least frequent word (freq=1).E. Vidal PRHLT-UPV, September-2019 Page 28
Tex search and IR in untranscribed manuscripts ICDAR-2019
Estimating Zipf Curves and Lexicon Sizes from PIs
The Zipf curve of a given text (transcript) depicts the frequency of eachword as a function of its rank in the decreasingly sorted list of wordfrequencies. Example from Bentham GT transcripts (and PIs):
<0.1
1
10
100
1K
10K
1 10 100 1K 10K 100K 1M 10M
Freq
uenc
y
Rank
freq = 0.5
Bentham Test-set GT TranscriptsBentham Test-set Prob. Index
The vocabulary size can now be estimated as the rank for which freq= ε (= 0.5, e.g.)E. Vidal PRHLT-UPV, September-2019 Page 28
Tex search and IR in untranscribed manuscripts ICDAR-2019
Estimated Zipf Curves for Bentham Papers and TSO
The Zipf curve of a given text (transcript) depicts the frequency of eachword as a function of its rank in the decreasingly sorted list of wordfrequencies. Example from Bentham and TSO GT transcripts (and PIs):
<0.1
1
10
100
1K
10K
1 10 100 1K 10K 100K 1M 10M
Freq
uenc
y
Rank
freq = 0.5
Bentham Test-set GT TranscriptsBentham Test-set Prob. Index
Passau Test-set GT TranscriptsPassau Test-set Prob. IndexTSO Test-set GT Transcripts
TSO Test-set Prob. Index
The vocabulary size can now be estimated as the rank for which freq= ε (= 0.5, e.g.)E. Vidal PRHLT-UPV, September-2019 Page 28
Tex search and IR in untranscribed manuscripts ICDAR-2019
Estimating Running Words and Lexicon Size:Laboratory Results
Dataset Running Words Lexicon SizeReal Estim. Error Real Estim.† Error
BENTHAM 89 870 83 235 -7.4% 6 988 7 431 +6.3%TSO 9 996 9 926 -0.7% 2 544 2 496 -1.9%PASSAU 26 709 26 155 -2.1% 5 801 5 598 -3.5%
†Obtained from estimated Zipf curves with threshold ε = 0.5.
Work in progress: full details and results to be published soon
Preliminary results in:[Toselli et al.: “Making Two Vast Historical Manuscript Collections Searchable and ExtractingMeaningful Textual Features Through Large-Scale Probabilistic Indexing”, ICDAR-2019]
E. Vidal PRHLT-UPV, September-2019 Page 29
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 30
Tex search and IR in untranscribed manuscripts ICDAR-2019
Multiple-word Queries: Boolean Combinations and Sequences
Multi-word and word-sequence queries are useful for searching for information in a more “semantic”way, as in plain-text Information Retrieval applications like Google.
Since single-word relevance probabilities of a Probabilistic Iindex are properly normalized andconsistent, they can be combined to (approximately) support various types of multi-word queries:
• Boolean: AND (∧), OR (∨), NOT (¬):If P (v1), P (v2) are the relevance probabilities of two indexed words, v1, v2, then:
P (v1 ∧ v2) = P (v1)P (v2 | v1) = P (v2)P (v1 | v2) ≈ min (P (v1), P (v2))
P (v1 ∨ v2) = P (v1) + P (v2)− P (v1 ∧ v2) ≈ max(P (v1), P (v2))
P (¬v1) = 1− P (v1) [Toselli et al., Pattern Anal. Appl. 22(1), pp.23-32, 2019]
For example, to search for image regions containing both the words “cat” (v1) and “dog” (v2), butnone of the words “mouse” (v3) or “rabit” (v4) the combined relevance probability is computed as:
P (v1 ∧ v2 ∧ ¬(v3 ∨ v4)) ≈ min(P (v1), P (v2), (1−max(P (v3), P (v4))))
• Word-sequence:Issue an AND query and examine the BB coordinates of the retrieved spots. Onlythose which are roughly in reading order are finally retrieved.
E. Vidal PRHLT-UPV, September-2019 Page 31
Tex search and IR in untranscribed manuscripts ICDAR-2019
Document Image Classification Based on Textual Content
Two approaches:
• Classification by means of user provided (maybe complex) queries:
Users do often know how to define the textual content of a class of documentsin terms of boolean combinations of words and word sequences.
Let q1, . . . , qC be queries defining the contents of classes 1, . . . , C, and letP (R | X, qc) the relevance probability of document (image) X to the query ofclass c. Then X is classified into the “most relevant” class:
c = argmax1≤c≤C
P (R|X, qc)
• Classification based on successful Machine Learning plain-text classifiers,which can be adapted to use probabilistic indices of text images.
Particularly interesting are “bag-of-word” models which do not explicitly relyon word order, such as Multinomial or Bernoullli mixture models (details: .66)
⇒ Work in progress; results to be reported soon.
E. Vidal PRHLT-UPV, September-2019 Page 32
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 33
Tex search and IR in untranscribed manuscripts ICDAR-2019
Handwriten Table Images• Handwritten tables perhaps account for more than half of the vast amounts
of documents preserved in archives.
• Tables contain important, often ready-to-use information for many historicalstudies, such as ethnography, demography, economics, genealogy, etc.
• Automatic transcription of handwriting tables is very difficult:– ad-hoc, variable, inconsistent and even erratic layouts,– difficult line detection and hopeless reading order ambiguities,– short lines lack linguistic context to help accurate word recognition, . . .
Handwritten table images (from the Passau daatset)E. Vidal PRHLT-UPV, September-2019 Page 34
Tex search and IR in untranscribed manuscripts ICDAR-2019
Data-Base-like Information Retrieval from Table Images
Probabilistic Indices hold geometric information of word bounding boxes (BB)and support Boolean multi-word queries.
This, along with BB-based geometric reasoning, can be used to supportlayout-agnostic, structured queries for information retrieval from table images.
Consider queries of the form: 〈column-heading, column-content 〉where column-heading is a BOOLEAN combination of table headingwords and column-content is a (single) keyword.
Examples (from the PASSAU collection):〈 ORT, PASSAU 〉 (〈PLACE, PASSAU 〉)〈 GEBURTS ORT, PASSAU 〉 (〈BIRTH PLACE, PASSAU 〉)〈 TAUF TAG, APRIL 〉 (〈 CHRISTENING DAY, APRIL 〉)〈 KRANKHEIT ARZT, FRAISEN 〉 (〈 MEDICAL OFFICER, FRAISEN 〉)〈 NAMEN DES BRAEUTIGAMS, JOSEF 〉 (〈 NAME OF THE GROOM, JOSEF 〉)〈 NAMEN DER BRAUT, MARIA 〉 (〈 NAME OF THE BRIDE, MARIA 〉)〈 TAG MONAT JAHR TODES, 1879 〉 (〈DAY MONTH YEAR OF DEATH, 1879 〉)
E. Vidal PRHLT-UPV, September-2019 Page 35
Tex search and IR in untranscribed manuscripts ICDAR-2019
Structured Multi-word Query Retreieval
To deal with queries of the form: 〈column-heading,column-content 〉,the retrieval process is carried out in four steps for each table image:
• retrieve column-heading words with AND-combined relevance probabilityhigher than the given threshold τ ,
• apply simple geometric reasoning: BBs of candidate spots are assigned highprobability only if they are close enough to each other and loosely located inupper regions of the image,
• retrieve column-content words with relevance probability higher then τ ,
• assign high probability only to the retrieved column-content BBs whichfall within column-wise regions loosely delimited by the horizontal span ofthe spotted column-heading BBs, and are below these BBs.
E. Vidal PRHLT-UPV, September-2019 Page 36
Tex search and IR in untranscribed manuscripts ICDAR-2019
Information Retrieval from Table Images: Example
Query: 〈 NAMEN DER BRAUT , MARIA 〉
E. Vidal PRHLT-UPV, September-2019 Page 37
Tex search and IR in untranscribed manuscripts ICDAR-2019
Information Retrieval from Table Images: Example
Query: 〈 NAMEN DER BRAUT , MARIA 〉 spotting column-heading words
h1 =NAMENh2 =DERh3 =BRAUT
h1h2h3
h1h2h3
E. Vidal PRHLT-UPV, September-2019 Page 37
Tex search and IR in untranscribed manuscripts ICDAR-2019
Information Retrieval from Table Images: Example
Query: 〈 NAMEN DER BRAUT , MARIA 〉 column-heading geometric reasoning
h1 =NAMENh2 =DERh3 =BRAUT
h1h2h3
E. Vidal PRHLT-UPV, September-2019 Page 37
Tex search and IR in untranscribed manuscripts ICDAR-2019
Information Retrieval from Table Images: Example
Query: 〈 NAMEN DER BRAUT , MARIA 〉 probabilities of column-heading words
h1 =NAMENh2 =DERh3 =BRAUT
P (h1) = P (h11 ∨ h12) ≈ max(P (h11), P (h12)
)
P (h2) = P (h21 ∨ h22 ∨ h23) ≈ max(P (h21), P (h22), P (h23)
)
P (h3) = P (h31)
h1h2h3
E. Vidal PRHLT-UPV, September-2019 Page 37
Tex search and IR in untranscribed manuscripts ICDAR-2019
Information Retrieval from Table Images: Example
Query: 〈 NAMEN DER BRAUT , MARIA 〉 column-heading probability
h1 =NAMENh2 =DERh3 =BRAUT
P (h) = P (h1 ∧ h2 ∧ h3) ≈ min(P (h1), P (h2), P (h3)
)
h1h2h3
E. Vidal PRHLT-UPV, September-2019 Page 37
Tex search and IR in untranscribed manuscripts ICDAR-2019
Information Retrieval from Table Images: Example
Query: 〈 NAMEN DER BRAUT , MARIA 〉 candidate region for column-content word
E. Vidal PRHLT-UPV, September-2019 Page 37
Tex search and IR in untranscribed manuscripts ICDAR-2019
Information Retrieval from Table Images: Example
Query: 〈 NAMEN DER BRAUT , MARIA 〉 spotting column-content words
v =MARIA P (v) = P (v1 ∨ v2) ≈ max(P (v1), P (v2)
)
v1v2
E. Vidal PRHLT-UPV, September-2019 Page 37
Tex search and IR in untranscribed manuscripts ICDAR-2019
Information Retrieval from Table Images: Example
Query: 〈 NAMEN DER BRAUT , MARIA 〉 retrieved data and total relevance probability
h = NAMEN DER BRAUTv = MARIA
P (〈h, v〉) = P (h ∧ v) ≈ min(P (h), P (v)
)
E. Vidal PRHLT-UPV, September-2019 Page 37
Tex search and IR in untranscribed manuscripts ICDAR-2019
Table Images Information Retrieval: Laboratory Results
– Recall-Precision curves– Average Precision (AP), mean AP (mAP)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision
Recall
tabQueries AP=0.90 mAP=0.92
singleKWs AP=0.75 mAP=0.69
Dataset training and test details
• PASSAU: German/Latin, many hands.Training: 200 pages, 102 char CRNN OMs +
6-gram char LM trained on training transcripts;Lexicon: 12 381 tokens.Test: 91 page images; Query set : 6 500 keywords
• TAB PASSAU: Table queries in PASSAU.Training: same as PASSAU. Test: 44 table images;Query set : 363 real multiword structured queries.
• See: [Toselli et al.: “Probabilistic Indexing andSearch for Information Extraction on HandwrittenGerman Parish Records”, ICFHR-2018]
Work in progress: more details and results to be published soon
E. Vidal PRHLT-UPV, September-2019 Page 38
Tex search and IR in untranscribed manuscripts ICDAR-2019
Table Images Information Retrieval: Demonstrations
http://prhlt-carabela.prhlt.upv.es/passauTab/http://prhlt-carabela.prhlt.upv.es/passauTab/views/help.html
Simple queries:
< name der braut , Maria > < name of the bride, Maria >< wohnort , Passau > < place of residence, Passau >
Single column, complex header/content descriptions:
< tauf tag , [2*ten (April || May)] > < baptism day, 2* of (April or May) >< (eltern || parent*) , Sebas* > < relatives or parents, Sebas* >
Two columns:
< tag trauung , Jaen* || Feb* > && < name* braut , Maria >< day wedding, Jan* or Feb* > and < name bride, Maria >
< name braeutigams , Georg > < eltern brau*, Martin || Magdalena >< name of the groom, Georg > and < parents of the bride, Martin or Magdalena >
E. Vidal PRHLT-UPV, September-2019 Page 39
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 40
Tex search and IR in untranscribed manuscripts ICDAR-2019
Automatic Processing of Historic Handwritten Music Manuscripts
• Millions of historic musical manuscripts are preserved incathedrals, abbeys, archives, etc. Many are digitized,but their musical contents remain inaccessible
• In many cases, perfect transcripts are not really needed;instead, content-based search with some degree ofreliability would be extreemly useful
• Spotting just single music symbols is mostly useless (allthe symbolos involved generally appear in each page);instead, helpful search targets are “melodic patterns”,which typiclly correspond to music symbol sequences.
We explore approaches for accurate retrieval of melodic patterns, representedby music symbol sequences, from collections of early music manuscripts.
E. Vidal PRHLT-UPV, September-2019 Page 41
Tex search and IR in untranscribed manuscripts ICDAR-2019
Handwritten Musica Notation: The VORAU Cod. 253 Manuscript
• Manuscript from Vorau Abbey library, ci. 1450,provided by the Austrian Academy of Sciences
• Dataset details:
Data: Train-Val Test
Pages 422 44Staves 1 000 97Running symbols 13 066 1 086Symbol set size 19 15
• Written in German gothic notation,without information about duration of notes
• Representation based on vertical positions ofmusic symbols in stave lines (L) and spaces (S)
Example:C3 S3 L4 L3 S3 S3 S3 L3 S2 L2 L3 . . .
E. Vidal PRHLT-UPV, September-2019 Page 42
Tex search and IR in untranscribed manuscripts ICDAR-2019
VORAU-253: Indexing and Search Laboratory Results
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
Single Symbols: AP=0.89 mAP=0.75
Sequences: AP=0.86 mAP=0.92
• (Imperfect) Transkribus stave segmentation
• CRNN (TensorFlow) optical models andsymbol 2-gram LM.
• Query sets:
Single symbols: all the 15 symbols seenin the test set.
Symbol sequences: all the 615 sequenceswith lengths ranging from 3 to 15 whichappear in the test set more than once.
• Average Precision (AP) & mean AP (mAP)evaluated at stave level for sequencequeries and at relative symbol position levelfor single symbol queries.
See: [Calvo et al.: “Music Symbol Sequence Indexing in Medieval Plainchant Manuscripts”, ICDAR’19]
Work in progress: more details and results to be published soon
E. Vidal PRHLT-UPV, September-2019 Page 43
Tex search and IR in untranscribed manuscripts ICDAR-2019
Music Indexing and Search: Demonstration
http://prhlt-carabela.prhlt.upv.es/music
Single symbols;
S3
F4
Symbol sequences:
[ S3 S3 S4 S3 L3 L3 L3 ]
[L3 L3 L3 L3 L3 L3 L3 L3]
[S2 L2 L2 S2 L3 S2 L3 S2 S1 L2 S1]
Sequences with alteration:
[L2 FLAT S2 L2]
[ FLAT S3 L3 S2 L3 ]
E. Vidal PRHLT-UPV, September-2019 Page 44
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
◦ 6 Conclusions . 45
7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 45
Tex search and IR in untranscribed manuscripts ICDAR-2019
Conclusions
• A probabilisitic framework has been introduced for indexing and searching in largecollections of untranscribed handwritten documents
• Empirical results with a variety of historic collections, exhibiting different challengesand levels of complexity, assess the potential of this framework
• Abreviations, hyphenation and other difficulties entailed by historical manuscripts areovercome
• On the base of the proposed approach several very large collections of historicalmanuscripts have been actually indexed and their textual contents made publiclyaccessible through efficient web search interfaces
• Probabilistic Indices allow Text Data Analytics and a variety of forms of “semantic”Information Retrieval and “big-data” analysis to be carried out of massive sets ofuntranscribed handwritten text images
• Current and future projects:– CARABELA: Completing manuscript processing, up to 150 000 pages, and explore
content-based image classification into user-defined classes of documents
– Probabilistically index the 1 000 000 pages of the complete FCR collection of theNational Archives of Finland (NAF)
– Plans for other very large European collections of historical manuscripts.
E. Vidal PRHLT-UPV, September-2019 Page 46
Tex search and IR in untranscribed manuscripts ICDAR-2019
Index
1 Demonstrations . 3
2 Text Image KWS and Probabilistic Indexing . 7
3 Implementation Issues . 15
4 Laboratory Results on Several Manuscript Collections . 18
5 Beyond Basic Keyword Spotting . 22- Wild Cards, Approximate Spelling, Abbreviations and Hyphenation . 24
- Estimating Textual Features of Unstrascribed Manuscript Collections . 26
- Boolean & Sequence Queries and Content-based Image Classification . 30
- Data-Base-like Information Retrieval from Handwriten Tables . 34
- Search for Melodic Patterns in Handwritten Music Notation . 41
6 Conclusions . 45
◦ 7 END (and further details) . 47
E. Vidal PRHLT-UPV, September-2019 Page 47
Tex search and IR in untranscribed manuscripts ICDAR-2019
Thanks for your attention!
(additional details below)
E. Vidal PRHLT-UPV, September-2019 Page 48
Tex search and IR in untranscribed manuscripts ICDAR-2019
Image Region KWS
• Posteriorgrams can be directly used for KWS: Given a threshold τ ∈ [0, 1],a word v is spotted in all image positions where P (v | X, i, j) > τ . Varyingτ , adequate precision–recall tradeoffs can be achieved
• But, for indexing purposes, we need the probability that a word v is writtenwithin a pre-specified image region, such as a page, a column, or a line
A popular (but wrong!) idea: For a text image region X, use the wordposterior probability P (v | X).
But this is ill-defined, because∑
v
P (v | X) = 1
. . . but, for each of the (many) different words v actually written in X, weideally want P (v | X) to be close to 1 : the sum should ideally be�1 !
What is an adequate posterior probability for image region KWS ?
E. Vidal PRHLT-UPV, September-2019 Page 49
Tex search and IR in untranscribed manuscripts ICDAR-2019
Choosing Adequate Minimal Image Regions: Line-level KWS
Line-shaped regions are good for indexing and search in practice;moreover, they allow for efficient computation by clever verticalsubsampling and choosing B(i, j):
• Vertical subsampling: In general, it amounts to just guessing a properline height and then running a vertical-sliding window of this heightwith some overlap
• Choosing B(i, j): For a line-shaped region, marginalization boundingboxes needed to compute posteriorgrams can be just defined byhorizontal segmentation
Line-level posteriorgrams can be very efficiently computed using WordGraphs, obtained as a byproduct of Viterbi or “token-passing” decoding.
This has two important benefits in order to compute posteriorgrams by marginalization:
• Optical (HMM) Character Models and (N-gram) Language Models are used toprovide very accurate, contextual word classification probabilities, P (v | X,B)
• WGs provide lots of alternative horizontal word-level segmentations, which directlydefine B(i, j)
E. Vidal PRHLT-UPV, September-2019 Page 50
Tex search and IR in untranscribed manuscripts ICDAR-2019
Line-level KWS: State-of-the-art Modelling• Optical modelling: Deep Convolutional-recurrent (CRNN) network:
• Textual context modeling: Finite State character n-grams:
E. Vidal PRHLT-UPV, September-2019 Page 51
Tex search and IR in untranscribed manuscripts ICDAR-2019
Probabilistic Indexing & Search: Precision-Recall Tradeoff Model
Indexing and search quality can beassessed by means of precision (π) & recall(ρ) performance.
Precision is high if most of the retrievedresults are correct while recall is high if mostof the existing correct results are retrieved.
If perfectly correct text were indexed, you’dget a single, “ideal” point with ρ = π = 1.
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
High confidence theshold
Low confidence threshold
Perfect (AP=1.0)Aut. Transcript (AP=0.6)
Prob. Index (AP=0.8)
If automatic (typically noisy) handwritten text transcripts are naively indexed just asplaintext, precision and recall are also fixed values, albeit not “ideal” (pehaps somethinglike ρ = 0.75, π = 0.8, with Averge Precision AP=0.6).
In contrast, probabilistic indexing allows for arbitrary precision-recall tradeoffs by settinga threshold on the system confidence (relevance probability)
This flexible “precision-recall tradeoff model” obviously allows for better search andretrieval performance than naive plaintext searching on automatic noisy transcripts.
E. Vidal PRHLT-UPV, September-2019 Page 52
Tex search and IR in untranscribed manuscripts ICDAR-2019
Laboratory Results on Several Manuscript Collections (17th-19th c.)
– Recall-Precision curves– Average Precision (AP)– Mean Average Precision (mAP)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
ap=0.88, mAP=0.94
ap=0.86, mAP=0.81
ap=0.81, mAP=0.74
ap=0.71, mAP=0.66
Bentham
Plantas
Austen
Austen-B
Datasets training and test details
• BENTHAM: Multi-hand. Training: 400 pagesfrom Bentham, 87 char. HMMs, 2-gram LMtrained on Bentham texts; Lexicon 9 341 tokens.Test : 33 pages; query set: 6 962 keywords
• PLANTAS (VOL-I): Single hand. Training: 224pages from Plantas, 77 char. HMMs, 2-gramLM trained with the training set + book glossarytranscripts. Lexicon 11 561 tokens.Test : 647 pages; query set: 9 945 keywords
• AUSTEN: Single hand. Training: 50 Austenpages, 81 char. HMMs, 2-gram LM trained onAusten texts; Lexicon 20K tokens.Test : 78 pages; query set: 2 281 keywords
• AUSTEN-B: Single hand. No training; usingBentham character HMMs, lexicon and LM.Test & query set: Same as for AUSTEN
E. Vidal PRHLT-UPV, September-2019 Page 53
Tex search and IR in untranscribed manuscripts ICDAR-2019
Laboratory Results on Difficult Medieval Collections (14th-16th c.)
– Recall-Precision curves– Average Precision (AP)
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
ap=0.88, Bentham
ap=0.86, Plantas
ap=0.81, Austen
ap=0.71, Austen-B
ap=0.71, Alcaraz
ap=0.77, WienStUlrich
ap=0.75, Chancery
Datasets training and test details
• ALCARAZ: Spanish, multi-hand. Training: 44pages, 70 char. HMMs + 2-gram LM trained ontraining transcripts; Lexicon 3 405 tokens.Test : Cross-val.; Query set : 3 400 keywords
• WIENSANKTULRICH: German/Latin, one hand.Training: 52 pages, 74 char. HMMs, 2-gram LMfrom training transcripts; Lexicon 2 303 tokens.Test : Cross-val.; Query set : 2 256 keywords
• CHANCERY: Medieval French/Latin, Heavilyabbreviated, many hands.Training: 341 Acts (∼ 100 pages), 105 charCRNN OMs + 5-gram char LM trained ontraining transcripts; Lexicon: ∼ 20 000 tokens.Test : 95 Acts; Query set : 6 506 keywords.
HIMANIS JPICH PROJECT: The full Chancerycollection (82 000) page images) was indexed.See: prhlt-kws.prhlt.upv.es/himanis
E. Vidal PRHLT-UPV, September-2019 Page 54
Tex search and IR in untranscribed manuscripts ICDAR-2019
Chancery Laboratory Results: Impact of LM and Abbreviations
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
5grm AP=0.75 mAP=0.68
3grm AP=0.69 mAP=0.61
0grm AP=0.62 mAP=0.52
HTR AP=0.58 mAP=0.440
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
5gram-Latin
AP=0.86 mAP=0.73
5gram-French
AP=0.80 mAP=0.74
5gram-All
AP=0.75 mAP=0.68
Left: Recall-Precision results for different character N -gram models (0grm, 3grm, 5grm). A singleR-P point (HTR) is also shown for the 1-best recognition hypotheses with character 5-grams.
Right: Recall-Precision results for (only) abbreviated keywords using character 5-gram models:Latin-only (5g-la), French-only (5g-fr) and both Latin and French (5gr). A single R-P point (HTR)is also shown for the 1-best recognition hypotheses with character 5-grams.
E. Vidal PRHLT-UPV, September-2019 Page 55
Tex search and IR in untranscribed manuscripts ICDAR-2019
Chancery: Examples of Abbreviated Word SpottingModernized (expanded) query keywords and corresponding spotting results
Keyword Guillaume chevalier livres quelconques
Full form
Abbreviated
False Positives
Avg. Precision (AP) 0.79 0.89 0.79 0.91
For each keyword: Selected examples ofcorrectly spotted images, both in full form andabbreviated, and one example of false positive.
The AP shown for each keyword is thetrue experimental value, computed taking intoaccount all the spotting results on the test set.
Latin and French abbreviated-only results arebetter than those including all the query words!
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Precision:π
Recall: ρ
5gram-Latin
AP=0.86 mAP=0.73
5gram-French
AP=0.80 mAP=0.74
5gram-All
AP=0.75 mAP=0.68
E. Vidal PRHLT-UPV, September-2019 Page 56
Tex search and IR in untranscribed manuscripts ICDAR-2019
Probabilistic Text Image Indexing and Search: System Diagram
Textimages
Ingestion
Database
Page-levelindices
indexing toolKWS &
Keyword search
• “KWS & indexing tool” : Off-line pre-computation of probabilistic indices
• “Ingestion” : Off-line creation of the actual database. Typically a simple andcomputationally cheap process
• “Keyword search” : On-line user query analisys, find the requestedinformation and present the retrived images. Short response times needed.
E. Vidal PRHLT-UPV, September-2019 Page 57
Tex search and IR in untranscribed manuscripts ICDAR-2019
Probabilistic Text Image Indexing: Index Building through KWS
Transcribed images
KWS & indexing tool
Text images
Optical + Language Models Training system
KWS +indexing
Contextual wordrecognizer (HTR)
Char / word Lattices
Page-levelprobabilisticindices
• Indexing is typically based on Key Word Spotting (KWS) technologies
• Most effective KWS methods use contextual word recognizers which requiremodels trained from transcribed images
• Both the contextual recognizer and the training system are often separatepiecees of software, not included in the indexing tool proper
• The contextual recognizer produces intermediate rich data structures, suchas character and/or word lattices, used by the KWS and indexing process
• In general, KWS and indexing can be computationally (very) demanding.
E. Vidal PRHLT-UPV, September-2019 Page 58
Tex search and IR in untranscribed manuscripts ICDAR-2019
Probabilistic Text Image Indexing: Index Ingestion
IngestionDatabase
Page-level
indicesChar folding, word grouping,index triming, data structuring
The set of individual image probabilistic indices are compiled into a datastructure, adequte for fast operation of the search engine.
The following processes are carried out here:
• Case & diacritics folding
• Word grouping – e.g., to index lemmas rather than regular words
• Organize the spots according to the chosen hierarchical structure
• Trim the index to the desired indexing density. Density be expressed asa relevance probability threshold, or as a number specifying how manyspots per page, per image region, or per running word should be indexed
E. Vidal PRHLT-UPV, September-2019 Page 59
Tex search and IR in untranscribed manuscripts ICDAR-2019
Probabilistic Indexing: Search Engine and User Interface
Keyword
Database
Search engine
Query analysis GUI retrived images
Display
Text images
search
• GUI: Graphical / textual specification of queries and desired precision-recall tradeoff
• Query analysis: trivial for single words, but significant for multi-word queries
• Search engine: Access the database. Specialized software typically needed forprobabilistically consitent support of multi-word queries and hierarchical search.
• Display retrieved images: Prepare the images to be presented to the users as aresult of their queries. The way they are presented is highly application dependent.
E. Vidal PRHLT-UPV, September-2019 Page 60
Tex search and IR in untranscribed manuscripts ICDAR-2019
Demonstrations
• CHANCERY collection (HIMANIS project).XIV-XV century “Tresor des Chartes registers”.82 000 images of densely handwritten text in Latin and French.
• TSO collection (Teatro del Siglo de Oro, READ project).XV-XVII century manuscripts of Spanish comedies,with more than 100 000 images, written by many hands.Work in progress: 150 manuscripts with 21 000 images so far.
• Many other demonstrators for smaller collections from variedhistorical periods in several languages.
E. Vidal PRHLT-UPV, September-2019 Page 61
Tex search and IR in untranscribed manuscripts ICDAR-2019
Chancery: Indexing & Search Demonstration
PRHLT HIMANIS Search Interfacehttp://himanis.huma-num.fr/himanis
A small sample of query examples:
liliorumpredicatorum
ElenaElena || Helena || Helene
IsabelYsabel || Isabelle || Elisabet || Helisabet || Elisa || Elisia
(guerre || paix) && Alemaigne
[ duc de borbon ]
E. Vidal PRHLT-UPV, September-2019 Page 62
Tex search and IR in untranscribed manuscripts ICDAR-2019
Teatro del Siglo de Oro Espanol: Indexing & Search Demonstration
PRHLT TSO Search Interfacehttp://prhlt-carabela.prhlt.upv.es/tso
A small sample of query examples:
marquesaAlmagro
teniente || alferez || sargentosol && espanol
Isabel (belleza || hermosura || nobleza)(valor || dolor) && (amor || honor)
[ Lope de Vega ][Calderon de la Barca]
E. Vidal PRHLT-UPV, September-2019 Page 63
Tex search and IR in untranscribed manuscripts ICDAR-2019
Indexing & Search Demonstration for Other Collections
Many other handwrriten (smaller) collections,
from varied historical periods, in several languages:
in the TRANSCRIPTORIUM web site.
E. Vidal PRHLT-UPV, September-2019 Page 64
Tex search and IR in untranscribed manuscripts ICDAR-2019
Passau Miscellanea: Indexing & Search Demonstration
PRHLT READ Search Interface for Passauhttp://transcriptorium.eu/demots/kws-Passau
A small sample of query possibilities:
Sabina
Margareta || Margareth || Margaretha || Margaretham || Margaritha
Adam && EvaPassau && Anna
(Johann || Anna) 1798
[ filia legitima ]
[matrica consignatio copulatorum]
E. Vidal PRHLT-UPV, September-2019 Page 65
Tex search and IR in untranscribed manuscripts ICDAR-2019
Mixture Models for Textual-Content Based Image Classification
Let b(w) = b1, . . . , bN the “bag of words” bit vector of the text w.The K−components Bernoulli Mixture likelihood of b(w) for class c is:
PB(b(w) | c) ≡ PB(w | c) =K∑
k=1
N∏
n=1
p bnckn(1− p(1−bn)ckn
)
The parameters of this model, pckn, 1 ≤ c ≤ C, 1 ≤ k ≤ K, 1 ≤ n ≤ N ,can be learned from class-labelled documents through EM estimation.
For a text image X, w is unknown and we have to marginalize overall possible words in X – the required “word content” probabilities areprovided by the image Probabilistic Index. After some developments andassumptions, the following approximation can be derived:
c ≈ argmaxc
P (c)PB(w | c)
where w is the set of N most relevant words according to P (R | X, v), v ∈w and N can be estimated as the expected value of n(w) (see: .27)
E. Vidal PRHLT-UPV, September-2019 Page 66