spot it! finding words and patterns in historical documents · 2020. 8. 1. · spot it! finding...

HAL Id: hal-00939950https://hal.archives-ouvertes.fr/hal-00939950

Submitted on 31 Jan 2014

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Spot it! Finding words and patterns in historicaldocuments

Vladislavs Dovgalecs, Alexandre Burnett, Pierrick Tranouez, StéphaneNicolas, Laurent Heutte

To cite this version:Vladislavs Dovgalecs, Alexandre Burnett, Pierrick Tranouez, Stéphane Nicolas, Laurent Heutte. Spotit! Finding words and patterns in historical documents. ICDAR, Aug 2013, United States. pp 1039-1043, �10.1109/ICDAR.2013.208�. �hal-00939950�

https://hal.archives-ouvertes.fr/hal-00939950

https://hal.archives-ouvertes.fr

Spot it!Finding words and patterns in historical documents

Vladislavs Dovgalecs⇤, Alexandre Burnett, Pierrick Tranouez, Stéphane Nicolas and Laurent HeutteUniversité de Rouen, LITIS EA 4108, BP 12

76801 SAINT-ÉTIENNE DU ROUVRAY, [email protected]⇤

Abstract—We propose a system designed to spot either wordsor patterns, based on a user made query. Employing a twostage approach, it takes advantage of the descriptive powerof the Bag of Visual Words (BOVW) representation and thediscriminative power of the proposed Longest Weighted Profile(LWP) algorithm. First, we try to identify the zones of images thatshare common characteristics with the query as summed up ina BOVW. Then, we filter these zones using the LWP introducingspatial constraints extracted from the query. We have validatedour system on the George Washington handwritten documentdatabase for word spotting, and medieval manuscripts from theDocExplore project for pattern spotting.

I. INTRODUCTION

In the last decades a considerable effort has been madeto digitize large collections of historical handwritten andprinted documents and manuscripts. While the problems ofpreservation and access to these documents seem to be at leastpartially solved, there is a need for efficient and seamlesssearch in such digitized documents. It is clear that manualsearch in more than a handful of scanned document pagesbecomes quickly unwieldy and automated search solutions arerequired.

Modern text-based search systems have reached some pointof maturity and are efficient at searching very large textcorpora. Unfortunately this requires manual transcription andannotation of scanned documents, which implies considerablehuman resources. Another option are modern optical characterrecognition systems (OCR) that work relatively reliably withscanned printed documents. These systems fail on handwrittenhistorical documents due to writing irregularities, touchinglines, free page layout, varying styles, degradations such asstains and tears, ancient fonts etc. In the last decades tech-niques called word spotting have emerged, which are attractiveby their properties to solve this kind of detection problems.

In this paper we address the problem of unsupervisedword spotting and graphical pattern spotting in historicalhandwritten documents, where the goal is to find regionscontaining the requested word or visually similar graphicalpattern. The presented method is designed to retrieve bothwords and graphical patterns from a single query.

Indeed, word and pattern spotting using a single query is avery challenging problem due to multiple reasons. Typicallythe digitized manuscript is the only source of informationand no annotation or transcription is available. This seriouslylimits state-of-the-art object detection methods [1] as their

Online phase

Offline phase

Dense SIFT extraction

Background feature removal

k-means clustering

Visual word generation

Sliding window BOVW

Stage II

User selected query Bounding box

generation

Sliding window with LWP

Evaluation of retrieval results

Stage I

Sliding window with BOVW

Top patch selection

NMS on top results

Query feature extraction

Figure 1. The offline and online processing phases of the proposed method.

discriminative model training would require at least severalpositive class samples. It also remains unclear how to selectnegative class instances when training such two class clas-sifiers. Furthermore, when resorting to matching techniquessuch as sliding windows, writing visual variability results in arelatively large number of false positives. These false positivesare often visually similar to the query but do not necessarilycontain the same word.

To answer these two challenges, we propose a segmentation-free method designed to spot both word and graphical patterns.The method consists in two phases: an offline phase featureextraction and an online query search phase. In the onlinephase, the algorithm proceeds with a two-stage query detec-tion. The first stage performs rapid elimination of large zoneswith visually dissimilar content, while the second stage con-centrates with more precise, feature order sensitive matchingand scoring.

This paper is organized as follows: in section II we reviewthe related works, in section III the proposed method ispresented, in section IV experimental results are discussedand finally in section V we draw conclusion and sketch futureresearch directions.

II. RELATED WORK

From literature one can point out three large groups ofworks aiming to solve the problem of word spotting, lookingfrom the point of view of required pre-processing and orderingfrom most to least intensive processing: word segmentation,line segmentation and no prior segmentation.

Historically the first works adopted word segmentation pre-processing and used simple pixel-wise comparison approaches

2

for spotting in printed and handwritten documents [2], [3]. Amore sophisticated method [4] uses cohesive elastic match-ing, which compares only informative parts of the templatekeyword for omnilingual word retrieval task. A probabilisticmethod modeling word shape and annotation was used forword recognition in historical documents [5] and implementedas a search engine in [6].

The second group of methods simplify the problem byrequiring only successful line segmentation which is a simplertask. Hidden Markov Models (HMM) became rapidly verypopular for handwritten word recognition by exploiting naturalsequence information encoded in segmented text lines. In [7] acomparison of HMM-based word spotting to DTW matchingusing the same SIFT-like features showed a large improvementin performance giving favor to more sophisticated methodsmodeling ordered sequences of features. Unfortunately theHMM training requires large amounts of training data andconsiderable training time. Addressing this issue, an attractivesub-word model, where only a limited set of individual lettersare needed for training, was proposed in [8].

Line segmentation in historical handwritten documents canbe still a difficult task due to touching text lines and uncon-strained layout. The third relatively recent group of methodsrequires no word or line segmentation and performs in asegmentation-free context. A typical approach consists insliding a window over a document image and performingpatch-by-patch comparisons. In [9] every patch is describedusing a Histogram of Oriented Gradients and color featureswhile the negative effect of many false positives is counteredusing a boosted two-class large margin classifier. Authors in[10] attempt to discover underlying semantic structure fromthe BOVW features using an unsupervised Latent SemanticAnalysis. A rapid spotting method exploiting adapted low-level features and efficient compact coding techniques wasproposed recently in [11]. The HOG features are computedfor each basic cell of an image allowing rapid descriptioncreation for an arbitrarily sized query. In order to increasethe classifier generalization capabilities to unseen patterns,the query is learned using the Exemplar SVM. Finally theProduct Quantization (PQ) method is used to further reduceHOG-PCA features down to a single real value enabling verycompact storage of descriptors in a computer memory. Ina similar spirit, a recent method [12] exploits an invertedfile of quantized SIFT descriptors for rapid retrieval of wordimages and proposes a scoring method taking into accountdiscriminant visual feature ordering information.

The main contribution of this paper is a segmentation-freeword spotting method that rapidly eliminates non-matchingregions in the first stage and concentrates on spatial orderingsensitive matching in those localized areas. Our method iscomposed of two stages: (1) initial candidate zone identifi-cation using orderless BOVW representations and (2) spatialordering constraints enforcement using the proposed LongestWeighted Profile (LWP) algorithm. The secondary contributionof this paper is the robust LWP matching algorithm withsensitivity to visual feature location information. In this paperwe focus on (1) robust matching techniques given a singlequery and (2) the system capable to work with both words

Figure 2. Illustration of some visual words belonging to two different clusters.Notice visually similar patches in each cluster with occasionally differentcontent.

and graphical patterns.Reliable word spotting with a single query is a challenging

task requiring careful feature selection and appropriate designof the decision stage. For instance, performance of a two classclassifier may be severely limited as the trained model willusually rely on a single positive example. We argue that acareful selection of visual features and focus on matchingtechniques robust to false positives can provide meaningfuldetections.

Handwritten historical documents usually carry more thansimple textual information. A manuscript can contain variousgraphical elements that could also be interesting to retrieve.To this end our method is capable of working not only withhandwritten words but also with graphical patterns. Despitethe lack of a classic training phase needed in the state-of-the-art object detection methods, our framework performsin a completely unsupervised setup and provides visuallyconsistent retrieval results also on graphical patterns.

III. GENERAL PRESENTATION OF THE METHOD

The proposed system consists of offline and online phasesas depicted in Fig. 1. The main purpose of the offline phase isto extract dense local features, encode them as visual wordsusing a codebook and produce BOVW representation features.In the online phase the BOVW representations and visual wordinformation are used in a two-stage query search process.The procedure starts with a candidate zone spotting usingBOVW representation information followed by another slidingwindow pass using the LWP algorithm for scoring.

A. Offline feature extractionThe offline processing phase begins with densely sampled

SIFT features on a regular grid with a step of 5 pixels. Ateach grid location fij = (xi, yi) three SIFT descriptors areextracted each covering a patch of size 10x10, 15x15 andfinally 20x20 pixels. To lower the computational burden andstorage requirements, we remove descriptors whose gradientnorm does not exceed a certain threshold of 0.01, whichremoves most of the background features while keeping zoneswith non-uniform visual information. It is important to notcompute rotation invariant SIFT descriptors since we want todiscriminate between horizontal and vertical writing strokesfor example. Our evaluation showed that the best performancewere obtained if the three SIFT descriptors for a fixed featurefij are concatenated in a 3 ⇥ 128 = 384 dimensional vectordij . Visualization in image space of such concatenated patch

3

wB

wD

wA

wPB B P A P B

D B B D D A

0.8 1.0 0.1 0.0 0.0 0.0

Q

YX= 1.9sQY =

Figure 3. Longest Weighted Profile illustration. Four visual word clustersrepresenting letters “B”, “P”, “D” and “A” are shown together with theirmutual similarities (left). Illustrative example of comparing query (Q) andtest (Y ) sequences: green - perfect match, orange and red - partial matchwith variable degrees of similarity and blue - mismatch. Best seen in color.

descriptors showed more coherent and visually similar patchesas in Fig. 2. Due to space limitations we do not present moreresults justifying this choice of features.

The processing proceeds with codebook creation using k-means algorithm for clustering. Empirical experimental re-sults and cluster compactness analysis revealed approximately1000-1500 natural clusters in the concatenated SIFT descriptorspace. In this work we fixed the number of clusters tok = 1500. Furthermore, every descriptor dij is compared tothe codebook of labeled cluster centers and is assigned therespective closest cluster label wl 2 {1, . . . , k} also calledvisual word.

For enhanced speed query comparison, we precompute fixedsize patch BOVW representations following the same setup asin [10]. The size of a patch is fixed to 300 ⇥ 75 pixels andis sampled every 25 pixels in each document image of thedatabase. Weak spatial information is encoded by additionallysplitting every sliding window into two even cells.

B. Online query search

The online phase starts as a user outlines a query regionQ depicting a word or a graphical pattern. The query searchproceeds with feature extraction in the selected region andproceeds in two stages: (1) candidate zone detection followedby (2) false positive filtering using spatial constraints. It isimportant to note that the query Q is represented using weaklyordered BOVW feature QBOVW and as a sequence Qseq ofvisual words. In this paper a sequence is constructed byprojecting window enclosed visual words on a horizontal axis.Each representation is used in its respective processing stage.

1) Candidate zone detection: In the first stage, candidatezone detection works by comparing query feature QBOVWwith all densely sampled sliding window features Y ij

BOVW, j =

1, . . . , ni in the ith document image. The input of the first stageis the feature QBOVW and the densely sampled sliding windowscovering each document image. The output of the first stageare zones of interest containing potential match described bybounding box windows. Every two pairs of BOVW featuresare compared using the �2 distance metric,

�2⇣QBOVW, Y ij

BOVW

⌘=

1

2

kX

l=1

⇣QBOVW (l)� Y ij

BOVW (l)⌘2

QBOVW (l) + Y ijBOVW (l)

(1)

which is adapted to histogram comparison. Comparing thequery to all sliding windows on a single page produces a vectorof distances, where we retain only the top 200 results.

Analysis of results, which we do not show due to limitedspace, showed that irrespective of the size of the query,the retained sliding windows tend to form compact clustersindicating zones of interest. For each such localized zoneof interest we build a bounding box. In order to reducethe number of degenerate cases of a large bounding boxcontaining two or more weakly overlapping zones of interest,the non-maximum suppression algorithm is applied. Such post-processing of the results increases the chance of a zone ofinterest to be compactly represented while also reducing thesearch space for the second stage.

2) Introducing spatial information: The second stage of themethod uses the detected zones of interest as input for furtherfiltering. The output of the second stage is a set of query sizewindows with their respective similarity score.

The essence of the second stage is to enforce spatial order-ing information characteristic of words and graphical patternsalike. As BOVW features used in the first stage encode onlyweak spatial information (windows are split vertically in twoeven size cells), experimental results showed that a largenumber of visually similar false positives are returned as well.We argue that enforcing a spatial arrangement of local featuresefficiently filters all zones with mismatching feature ordering.

Processing proceeds with a query size window, which weslide in each bounding box and compute the matching scoreusing the LWP algorithm. The LWP algorithm uses a secondrepresentation of the query Qseq and compares it to eachsequence Y ij

seq, j = 1, . . . , ni

sij= LWP

�Qseq, Y

ijseq, M

�(2)

where ni is the number of sliding windows in the ithbounding box and M is a matrix encoding intra-cluster visualsimilarities as discussed in subsection III-C. It is importantto note that applying the LWP algorithm directly withoutpreliminary first stage filtering would be prohibitively costly.

C. The Longest Weighted Profile (LWP) AlgorithmThe proposed LWP algorithm is used to match robustly two

selected regions represented as visual word sequences andreturn a similarity score based on the common informationfound in the two sequences. Ordering of visual words encodesfine to coarse spatial relationships of local features and isan important cue to distinguish between two similar but notidentical regions. This sort of false similarity is due to theorderless nature of BOVW representation. A remedy to thisproblem is to leverage spatial ordering of visual words in eachregion. It is clear that, due to a priori suboptimal featuresampling and writing variability, two sequences (generatedfrom two image words) cannot be compared directly elementto element and that a noise insensitive matching procedureshould be applied instead.

The proposed matching procedure (see Algorithm 1) fea-tures the required robustness properties. It takes as input twovisual word sequences Qseq and Yseq, an inter-cluster similarity

4

Algorithm 1 The Longest Weighted Profile (LWP) algorithm.INPUT:Q - query sequence composed of m visual wordsY - test sequence composed of n visual wordsM - k ⇥ k inter-cluster similarity matrixOUTPUT: sQY - similarity scorePROCEDURES := array (0 . . . m, 0 . . . n) 0

for i := 1 : mfor j := 1 : m

if Qi = Yj

Sij := Si�1,j�1 + 1

else⇤ij := Si�1,j�1 + MQi,Yj

Sij := max (Si,j�1, Si�1,j , ⇤ij)

endendreturn Smn

matrix M and computes a similarity score. The score is higherif common and visually similar (encoded by matrix M ) visualwords are in the same order. Furthermore, by tolerating thesmall random variations in the writing of a word that may pusha descriptor from one k-means cluster to another, it allows thealgorithm to increase the similarity threshold, thus eliminatingfalse positives without losing true ones. It is important tonote that the algorithm is insensitive to noisy, mismatchingfeature insertions that may occur at arbitrary locations in thesequences. The standard substring search algorithm [13] is notrobust in this context as it will return the longest contiguoussubstring which could be easily a subpart of a longer sequenceinterrupted by random insertions.

As depicted in Fig. 2, features carrying the same visual wordlabel do not necessarily carry the same visual information.Counting only strict match and mismatch cases would reflectthe perfect case of local feature clustering which is certainlynot the case with real-world data. It is therefore reasonable toaccount for these perhaps artificially split clusters and assignsome positive similarity score (see scoring illustration in Fig.3) when comparing two different visual words belonging totwo highly overlapping clusters. This prior information isencoded in the symmetric matrix M and we propose onepossible way to compute it

Mij = max

✓0,hµi, µjikµik kµik

◆⌧

, 8i, j 2 {1, . . . , k} (3)

where µi, i = 1, . . . , k are the concatenated SIFT featurespace cluster centers. The parameter ⌧ > 0 controls the trade-off between close and far cluster centers and is empirically setin all our experiments to ⌧ = 50. It is interesting to note that inthe limit ⌧ ! +1, the matrix M reduces to an identity matrixand the algorithm reduces to the classic Longest CommonSubsequence algorithm [13]. Therefore the running time ofthe proposed algorithm comparing two sequences of m and nelements using dynamic programming is O (mn).

In Fig. 4 we show an example comparing the LWP algo-

0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Recall

Prec

isio

n

Standard LCS algorithmLWP algorithm

Figure 4. Performance comparison of the standard LCS [13] and proposedLWP algorithms. Evaluation is done exhaustively on all the words from 2pages of the George Washington database.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 150

500

1000

1500

Word length

Wor

d co

unt

107

9881111

895

417366408249

10112833 41 11 1 4

Figure 5. Relative number of annotated words in the George Washingtondatabase grouped by the length.

rithm over the LCS algorithm in the word spotting context.

IV. EXPERIMENTAL RESULTS

In this paper the evaluation of word spotting performanceis done implementing the protocol used in [10] on the GeorgeWashington handwritten document dataset. For graphical pat-tern spotting, to our knowledge there is no publicly availablehistorical document collection with pattern annotation. To thisend we provide qualitative results of pattern spotting on ourin-house database of historical documents.

The evaluation protocol assumes that both databases are an-notated with a unique label assigned to each window carryingthe same information (word or pattern). System performanceis evaluated by querying each annotated element in turn andmeasuring Precision and Recall. We consider detection of aword or pattern to be successful if the ground truth windowis covered more than 50% of its surface by a result window.

The performance of the proposed spotting method usingall 4860 words as query on George Washington handwrittendocuments is shown in Fig. 6. The results are grouped by thelength of query in characters for convenience of comparison.The relative number of annotated words in this database isdepicted in Fig. 5. Globally the results reflect the difficulty toretrieve short words and less difficulties for longer words. Thisis an expected behavior since short words (e.g. 1-3 characters)are described with less visual features and the search results

5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Recall

Prec

isio

n

1−34−67−910−1213−15

Figure 6. Word spotting performance using all 4860 annotated words.

A

B C

Figure 7. Qualitative graphical pattern spotting performance. The querywindow is shown in red and the corresponding retrieved results in blue.

contain many false positives. From the use case perspective,these words are often stop words and their retrieval utilityin real-life scenario may be limited. The retrieval results aresignificantly better for queries composed of 7-9 and 10-12characters. On this particular database, up to 30%-40% recallcan be safely attained without sacrificing much in precision.The best results are obtained using the longest words (13-15 characters) and up to 50% of word can be retrieved withapproximately the same precision.

Finally, we evaluated the performance of the method in thecontext of graphical pattern spotting with some results shownin Fig. 7. Overall the results show consistency in structureand visual content of the query and the retrieved regions(examples A and B). Analysis of the results showed thatthe first stage results provide already good estimates of thevisually similar regions with little improvement from the LWPalgorithm. The typical false positives (example C) arose fromexamples carrying high level information that are not leveragedin this spotting method. Finally, we encountered difficultiesto select the best threshold score, which is characteristic ofsystems using no classifiers. Practical implementation in theDocexplore platform [14] allows this threshold to be controlledby the user and can be tuned manually as a sensitivityparameter.

V. CONCLUSIONS AND FUTURE WORK

The proposed segmentation-free spotting method has beenshown to work with both word and graphical patterns. The sys-tem works remarkably well if we consider the completely un-supervised learning approach and a single query. The proposedmethod demonstrates the power of discriminative BOVWfeatures for candidate zone detection and the LWP algorithmfor enforcing spatial information. In future work we intend tolearn discriminative visual features from the document imagesand provide deeper performance analysis of the system forgraphical pattern spotting. We expect that document-specificvisual features could be useful to further improve the retrievalresults for medium to long words. In order to reliably detectshort words, discriminative prior knowledge (e.g. use thewhitespace as a cue for word separation) could be integratedby extending the decision stages of the proposed method.

ACKNOWLEDGMENT

This work has been done in the context of the EU Inter-reg IVa Docexplore project (http://www.docexplore.eu). Theauthors would like to thank the Municipal Library of Rouen,France for providing digitized historical manuscripts and Cé-cile Capot for the great effort she put in graphical patternannotation.

REFERENCES

[1] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and Ramanan,“Object Detection with Discriminatively Trained Part-Based Models,”IEEE Transactions on PAMI, vol. 32, no. 9, 2010.

[2] S.-S. Kuo and O. Agazzi, “Keyword spotting in poorly printed docu-ments using pseudo 2-D hidden Markov models,” IEEE Transactions onPAMI, vol. 16, no. 8, pp. 842–848, 1994.

[3] R. Manmatha, C. Han, and E. M. Riseman, “Word Spotting: A NewApproach to Indexing Handwriting,” in Proc. of CVPR. IEEE ComputerSociety, Jun. 1996, pp. 631–637.

[4] Y. Leydier, A. Ouji, F. Lebourgeois, and H. Emptoz, “Towards anomnilingual word retrieval system for ancient manuscripts,” PatternRecognition, vol. 42, no. 9, pp. 2089–2105, 2009.

[5] V. Lavrenko, T. M. Rath, and R. Manmatha, “Holistic word recognitionfor handwritten historical documents,” in Proc. of DIAL. IEEEComputer Society, 2004, pp. 278–287.

[6] T. M. Rath, R. Manmatha, and V. Lavrenko, “A search engine forhistorical manuscript images,” in Proc. of Intl. ACM SIGIR Conf. onResearch and development in IR. New York, NY, USA: ACM, 2004,pp. 369–376.

[7] J. A. Rodriguez and F. Perronnin, “Local gradient histogram featuresfor word spotting in unconstrained handwritten documents,” In Proc. ofICFHR, pp. 7–12, August 2008.

[8] S. Thomas, C. Chatelain, L. Heutte, and T. Paquet, “An InformationExtraction model for unconstrained handwritten documents,” in ICPR,Istanbul, Turkey, 2010, pp. 3412–3415.

[9] P. Yarlagadda, A. Monroy, B. Carque, and B. Ommer, “Recognition andanalysis of objects in medieval images,” in Proc. of ICCV. Berlin,Heidelberg: Springer-Verlag, 2010, pp. 296–305.

[10] M. Rusiñol, D. Aldavert, R. Toledo, and J. L. o. s, “Browsing Hetero-geneous Document Collections by a Segmentation-Free Word SpottingMethod,” in In Proc. of ICDAR, 2011, pp. 63–67.

[11] J. Almazan, A. Gordo, A. Fornés, and E. Valveny, “Efficient ExemplarWord Spotting,” in British Machine Vision Conference, 2012.

[12] I. Z. Yalniz and R. Manmatha, “An Efficient Framework for SearchingText in Noisy Document Images,” 10th IAPR Intl. Workshop on DAS,pp. 48–52, 2012.

[13] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introductionto Algorithms, 2nd ed. McGraw-Hill Higher Education, 2001.

[14] P. Tranouez, S. Nicolas, V. Dovgalecs, A. Burnett, L. Heutte, Y. Liang,R. Guest, and M. Fairhurst, “DocExplore: overcoming cultural and phys-ical barriers to access ancient documents,” in Proc. of ACM symposiumon Document engineering. Paris, France: ACM, 2012, pp. 205–208.

spot it! finding words and patterns in historical documents · 2020. 8. 1. · spot it! finding...

Documents