segmentacion de imagenes en documentos historicos

IMAGE SEGMENTATION OF HISTORICAL DOCUMENTS

Carlos A.B. Mello and Rafael D. Lins Department of Electronics and Systems UFPE Brazil

{cabm, rdl}@cin.ufpe.br

ABSTRACT This paper presents a new entropy-based segmentation algorithm for images of documents. The algorithm is used to eliminate the noise inherent to the paper itself specially in documents written on both sides. It generates good quality monochromatic images increasing the hit rate of OCR commercial tools.

I. INTRODUCTION

We are interested in processing and automatic transcription of historical documents from the nineteenth century onwards. Image segmentation [2] of this kind of documents is more difficult than more recent documents because while the paper colour darkens with age, the printed part either handwritten or typed, tends to fade. These two factors acting simultaneously narrows the discriminations gap between the two predominant colour clusters of documents. If a document is typed or written on both sides and the opacity of the paper is such as to allow the back printing to be visualized on the front side, the degree of difficulty of good segmentation increases enormously. A new set of hues of paper and printing colours appears. Better filtering techniques are needed to filter out those pixels, reducing back-to-front noise.

The segmentation algorithm presented was applied to documents from Joaquim Nabucos1 file [5,12] held by Joaquim Nabuco Foundation (a research center in Recife-Brazil). The segmentation process is used to generate high quality greyscale or monochromatic images. Figure 1 shows the application of a nearest colour algorithm for decreasing the colours of a sample document from Nabucos bequest, using Adobe Photoshop [10]. The document is written on both sides the colour reduction process has not produced satisfactory results as the ink of one side of the paper interferes with the monochromatic image of the other side.

This paper introduces a new entropy-based segmentation algorithm and compares it with three of the most important entropy-based segmentation algorithms

1 Brazilian statesman, writer, and diplomat, one of the key figures in the campaign for freeing black slaves in Brazil, Brazilian ambassador to London (b.1861-d.1910).

described in the literature. Two different grands for comparison are presented: visual inspection of the filtered document and the response of Optical Character Recognition (OCR) tools.

II. ENTROPY-BASED SEGMENTATION The documents of Nabucos file are digitized with 200 dpi in true colour and then converted to 256-level greyscale format by using the equation:

C = 0.3*R + 0.59*G + 0.11*B where C is the new greyscale colour and R, G and B are, respectively, the Red, Green and Blue components of the palette of the original colour image.

Three segmentation algorithms based on the entropy function [1] are applied to greyscale images and are studied here: Pun [9], Kapur et al [3] and Johannsen [8].

A. Puns Algorithm

Puns algorithm analyses the entropy of black pixels, Hb, and the entropy of the white pixels, Hw, bounded by the threshold value t. The algorithm suggests that t is such that maximizes the function H = Hb + Hw, where Hb and Hw are defined by:

=

-=t

i

ipipHb0

])[log(][ (Eq. 1)

+=

-=255

1

])[log(][ti

ipipHw (Eq. 2)

where p[i] is the probability of pixel i with colour colour[i] is in the image. The logarithm function is taken in base 256. Figure 2 presents the application of Puns algorithm to the sample image shown in figure 1-left.

B. Kapur et als Algorithm

Reference [3] defines a probability distribution A for an object and a distribution B to the background of the document image, such that: A: p0/Pt, p1/Pt, ..., pt/Pt B: (pt+1)/(1 Pt), (pt + 2)/(1 - Pt),..., p255/(1 Pt)

The entropy values Hw and Hb are evaluated using equations (1) and (2) above, with p[i] following are applied to greyscale images the previous distributions. The maximization of the function Hw + Hb is analysed to define the threshold value t. The sample image of figure 1-left is segmented with this algorithm and the result is presented on figure 3.

C. Johannsens Algorithm

Another variation of an entropy-based algorithm is proposed by Johannsen trying to minimize the function Sb(t) + Sw(t), with:

)]()()[/1()log()(255

1

255

1

255

1

+=+=+=

++=ti

itti

iti

iw pEpEpptS

and

)]()()[/1()log()(1

000

-

===

++=t

iit

t

ii

t

iib pEpEpptS

where E(x)=-xlog(x) and t is the threshold value. Figure 4 presents the application of this algorithm to the image of the document under study.

Figure 1. (left) Original image in 256 greyscale levels and (right) its monochromatic version generated by

Photoshop

Figure 2.Puns algorithm applied to document

Figure 3. Kapur et als segmentation

Figure 4. Johannsens segmentation.

III. A NEW SEGMENTATION ALGORITHM The algorithm scans the image looking for the most frequent colour, which is likely to belong to the image background (the paper). This colour is used as inicial threshold value, t, to evaluate Hw and Hb as defined in equations (1) and (2).

The entropy H of the complete histogram of the image is also evaluated. It must be noticed that in this new algorithm the logarithmic function used to evaluate H, Hw and Hb is taken with a base equal to the product of the dimensions of the image. This means that, if the image has dimensions x versus y, the logarithmic base is x.y. As can be seen in [4], this does not change the concept of entropy.

Using the value of H two multiplicative factors, mw and mb, are defined folllowing the rules: If 0.25 < H < 0.30, then mw = 1 and mb = 2.6 If H 0.25, then mw = 2 and mb = 3 If 0.30 H < 0.305, then mw = 1 and mb = 2 If H 0.305, then mw = 0.8 and mb = 0.8

These values of mw and mb were found empirically after several experiments. By now, they can be applied to images of historical documents only. For any other kind of image, these values must be analysed again. We empathise that this new algorithm was developed to work with images with characteristcs of historical kinds.

The greyscale image is scanned again and each pixel i with colour[i] is turned white if:

(colour[i]/256) (mw*Hw + mb*Hb) Otherwise, its colour remains the same (to generate a new greyscale image) or it is turned to black (generating a monochromatic image).

This condition can be inverted generating a new image where the pixels with colour corresponding to the ink are eliminated, remaining only the pixels classified as paper. This new segmentation algorithm was used for two kind of applications: 1) to create high quality monochromatic images of the documents for minimum space storage and to efficient network transmission and 2) to look for better hit rates of OCR commercial tools. The application of the algorithm to the sample document of figure 1-left can be found in figure 5 next.

Figure 5. Application of new segmentation algorithm in

the document presented in figure 1-left. Comparing figures 2, 3, 4 and 5, one can observe

that the algorithm proposed in this paper yielded the best quality image, with most of the back-to-front interfernce removed. It is also important to notice that the new algorithm presented the lowest processing time amongst the algorithms analysed.

The entropy filtering presented here was used in a set of 40 images of documents and letters from Nabucos bequest. Only four times unsatisfactory images were produced which required the intervention of an operator.

Figure 6 zooms into one of these documents and the output obtained.

For typed documents (also from Nabucos file) the segmentation algorithm was applied in search for better responses from OCRs commercial tools. In previous tests [6], the OCR tool Omnipage [11] from Caere Corp. achieved the best hit rates1 amongst six commercial analysed softwares. These rates reached almost 99% in some cases. When applied to historical documents, however, this rate decreased to much lower values. The segmented images for a sample typed document can be seen in detail in figure 7.

Figure 6. (top left) Original image; (top right) Original image in black-and-white; (center left) Original image

segmented by Puns algorithm; (center right) Application of Kapur et als algorithm; (bottom left) Johannsens algorithm and (bottom right) our algorithm applied to

original image.

The table below presents the hit rate of Omnipage for four typed documents representative of Nabucos bequest after segmentation with the four entropy-based algorithms presented here. They are compared with the use of the original image with no pre-processing besides the one used by the own software (the column labeled as Omnipage).

It can be seen a little degradation in one of the cases in the hit rate of the software when compared with its use after the application of the new segmentation technique (in the D064 image). This degradation can be justified by a possible loss of part of some characters in

1 Number of characters correctly transcribed from image to text

the segmentation process producing errors in the character recognition process. Eventhough the segmentation algorithm proposed in this paper reached the best rates in average.

Image Omni

Page

Johannsen Pun Kapur

et al

New

Scheme

D023 80.3 78.3 43.3 91.7 91.4

D064 84.4 84.5 63.7 85.2 80.1

D077 80.1 80.1 71.8 77.3 92.4

D097 75.4 5.1 69.5 73.4 88.0

Table 1. Hit rate of Omnipage for images of typed historical documents in percentage

Figure 7-bottom shows another application of the algorithm, as explained before, where the frequencies classified as ink are eliminated remaining only the background of the image (the paper). This image is used in another part of the system in the generation of paper texture for historical documents [7].

Figure 7. (top left) Original image, (top right) segmented image (ink) and (bottom) negative segmentation (paper)

The algorithm was also tested with other

segmentation methods such as iteractive selection yielding better results in terms of OCR hit rates and visual inspection of monochromatic images quality.

IV. CONCLUSION

This paper introduces a new segmentation algorithm for historical documents, which is particularly suitable to reduce back-to-front noise of documents written on both sides. Applied to a set of 40 samples from Nabucos bequest it worked satisfactorily in 90% of them

producing, under visual inspection, better quality images than the best known algorithms described in literature. The automatic image-to-text transcription of those documents using Omnipage 9.0, a commercial OCR tool from Caere Corp. [11], improved after segmentation. The algorithm presented did not work well with very faded documents. We are currently working on re-tunning the algorithm for this class of documents.

V. REFERENCES

[1] N.Abramson. Information Theory and Coding. McGraw-Hill Book Company, 1963.

[2] R.Gonzalez and P.Wintz. Digital Image Processing. Addison Wesley, 1987.

[3] J.N.Kapur, P.K.Sahoo and A.K.C.Wong. A New Method for Gray-Level Picture Thresholding using the Entropy of the Histogram, Computer Vision, Graphics and Image Processing, 29(3), 1985.

[4] S.Kullback. Information Theory and Statistics. Dover Publications, Inc.1997.

[5] R.D.Lins et al. An Environment for Processing Images of Historical Documents. Microprocessing & Microprogramming, pp. 111-121, North-Holland, 1995.

[6] C.A.B.Mello & R.D.Lins. A Comparative Study on Commercial OCR Tools. Vision Interface99, pp. 224-323.Quebc, Canada, 1999.

[7] C.A.B.Mello & R.D.Lins. Generating Paper Texture Using Statistical Moments. IEEE International Conference on Acoustic, Speech and Signal Processing, Istanbul, Turkey, June 2000.

[8] J.R.Parker. Algorithms for Image Processing and Computer Vision. John Wiley and Sons, 1997.

[9] T.Pun, Entropic Thresholding, A New Approach, C. Graphics and Image Processing, 16(3), 1981.

[10] Adobe Systems Inc. http://www.adobe.com

[11] Caere Corporation. http://www.caere.com

[12] Nabuco Project. URL: http://www.di.ufpe.br/~nabuco.