[ieee 2006 ieee international conference on industrial technology - mumbai, india...

5
Novel Thresholding Method for Document Analysis Adnan Khashman, Senior Member, IEEE and Boran Sekeroglu, Member, IEEE Near East University, Lefkosa, Mersin 10, Turkey [email protected] and bsekeroglugneu.edu.tr Abstract-Thresholding is a simple and efficient method for relies on maximizing the total entropy of both the object and image enhancement and segmentation of grayscale documents, background regions to find the appropriate threshold [9]. where the relationship of pixel values in the documents can Kittler and Illingworth proposed a method which starts by provide an effective single point for the separation of the choosing an arbitrary initial threshold T. Then parameters background and foreground layers. Document analysis anda a p. effective separation of text may provide useful data for electronic Hi(T), vi(T) and a priori probability Pi(T), where i=1,2 are storage systems and libraries. This paper presents a novel method computed and then an updated threshold is calculated [10]. namely, Mass-Difference Thresholding (MDTh) for enhancement Parker proposed a method that first detects the edges, and then and text separation from documents. MDTh will be implemented the interior objects between edges are filled. For each pixel using 30 documents that have various levels of noise and color. A (x,y) in the input image Z, it calculates the negative of the comparison will be drawn between MDTh and five other well gradient in the direction of the brightest neighbor, then for each known and efficient thresholding methods. Experimental results suggest that the developed method performs well, thus providing regIon, sample mean and standard deviatons are calculated a fast and efficient method for text separation. [11]. Solihin and Leedham presented the Quadratic Integral Ratio (QIR) method which uses Quadratic Integral Estimators to separate image into three classes, foreground, background and I. INTRODUCTION fuzzy. Class A separates foreground and fuzzy classes and teffective class C separates fuzzy and the background classes. QIR deals Thepurptionandseparatof nofthe documentalstci s is th dero with quadratic approximation of the real intensity histogram of preparation and separation of the documents in order to provide efficient and clear data for recognition and analysis of image to separate classes [12]. tnis i Leedham et al. [13] showed that Otsu's algorithm mostly therdocuments.gThus, the first se te of dcent analysis overthresholded and it can not produce good results when the performing the efficient separation of the texts from the background intensity is high and there is low contrast in the background. It looks quite easy to separate texts from clean image. It was also declared that Kapur's entropy performed and noiseless background where simple and basic thresholding well if the images had good contrast while QIR produces better methods can solve this problem easily and effectively, results than other techniques. These methods, proved their However, some documents - especially historical documents, successes in simple text images or clean document images. have different conditions resulting from layers such as noises, But, under the extreme conditions, most of them are either colors, meaningless shapes etc. [1]. Therefore, analyzing these over-thresholded or under-thresholded. degraded documents generally requires more complex multi- This paper presents a new global single-stage thresholding algorithms or thresholding techniques to separate these layers. technique, namely Mass-Difference Thresholding (MDTh) Thresholding is one of the simplest methods that can be used which is based on the mass average and the global maxima of to separate foreground texts or objects from background. A image to separate foregrounds from backgrounds with fixed-points threshold may be successfully used in very high minimum loss of information and minimum deviation from the contrast images, however, in low contrast images; it causes .' . ' . . ' . ~~optimum threshold point. The MDTh method will be either loss of information or fuzzy information. Thresholding implemented using 30 new and historical documents methods, such as the background-symmetry algorithm [2] are . . based on brightness histograms of images, and thus are called contaiin23 words. ' ~~~~~A comparison will be drawn between MDTh and five known histogram-derived thresholds [3]. efficient methods for text separation, namely Otsu, Kapur, Recently, several algorithms had been proposed for Parker, Kittler-Illingworth and QIR Methods. The comparison separating foreground texts from backgrounds [4-7]. is based on the rate of recognized words in a document after Nonetheless, the most commonly used and efficient methods thresholding is applied using all six methods on 15 documents are considered to include Otsu's thresholding method [8], containing 1205 words. Kapur et al.'s Entropy technique [9], Kittler-Illingworth minimum error technique [I0], Parker Method [1] and Solihin The structure of the paper iS as follows: Section 2 describes et al. Quadratic Integral Ratio (QIR) method [12]. Otsu proposd an ef experimental results of applying MDTh and a comparison of based on image histogram. It uses discriminant analysis to the results when using other known thresholding methods. divide foreground and background by maximizing the Finally, section 4 concludes the work and provides suggestions discriminant measure [8]. Kapur et al. suggested a method that for further work. 1-4244-0726-5/06/$20.OO '2006 IEEE 616

Upload: boran

Post on 16-Mar-2017

214 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [IEEE 2006 IEEE International Conference on Industrial Technology - Mumbai, India (2006.12.15-2006.12.17)] 2006 IEEE International Conference on Industrial Technology - Novel Thresholding

Novel Thresholding Method for Document AnalysisAdnan Khashman, Senior Member, IEEE and Boran Sekeroglu, Member, IEEE

Near East University, Lefkosa, Mersin 10, [email protected] and bsekeroglugneu.edu.tr

Abstract-Thresholding is a simple and efficient method for relies on maximizing the total entropy of both the object andimage enhancement and segmentation of grayscale documents, background regions to find the appropriate threshold [9].where the relationship of pixel values in the documents can Kittler and Illingworth proposed a method which starts byprovide an effective single point for the separation of the choosing an arbitrary initial threshold T. Then parametersbackground and foreground layers. Document analysis anda a p.effective separation of text may provide useful data for electronic Hi(T), vi(T) and a priori probability Pi(T), where i=1,2 arestorage systems and libraries. This paper presents a novel method computed and then an updated threshold is calculated [10].namely, Mass-Difference Thresholding (MDTh) for enhancement Parker proposed a method that first detects the edges, and thenand text separation from documents. MDTh will be implemented the interior objects between edges are filled. For each pixelusing 30 documents that have various levels of noise and color. A (x,y) in the input image Z, it calculates the negative of thecomparison will be drawn between MDTh and five other well gradient in the direction of the brightest neighbor, then for eachknown and efficient thresholding methods. Experimental resultssuggest that the developed method performs well, thus providing regIon, sample mean and standard deviatons are calculateda fast and efficient method for text separation. [11]. Solihin and Leedham presented the Quadratic Integral Ratio

(QIR) method which uses Quadratic Integral Estimators toseparate image into three classes, foreground, background and

I. INTRODUCTION fuzzy. Class A separates foreground and fuzzy classes andteffective class C separates fuzzy and the background classes. QIR deals

Thepurptionandseparatof nofthedocumentalstci s is th dero with quadratic approximation of the real intensity histogram ofpreparation and separation of the documents in order toprovide efficient and clear data for recognition and analysis of image to separate classes [12].

tnisi Leedham et al. [13] showed that Otsu's algorithm mostlytherdocuments.gThus,the firstse te of dcent analysis overthresholded and it can not produce good results when theperforming the efficient separation of the texts from the background intensity is high and there is low contrast in thebackground. It looks quite easy to separate texts from clean image. It was also declared that Kapur's entropy performedand noiseless background where simple and basic thresholding well if the images had good contrast while QIR produces bettermethods can solve this problem easily and effectively, results than other techniques. These methods, proved theirHowever, some documents - especially historical documents, successes in simple text images or clean document images.have different conditions resulting from layers such as noises, But, under the extreme conditions, most of them are eithercolors, meaningless shapes etc. [1]. Therefore, analyzing these over-thresholded or under-thresholded.degraded documents generally requires more complex multi- This paper presents a new global single-stage thresholdingalgorithms or thresholding techniques to separate these layers. technique, namely Mass-Difference Thresholding (MDTh)

Thresholding is one of the simplest methods that can be used which is based on the mass average and the global maxima ofto separate foreground texts or objects from background. A image to separate foregrounds from backgrounds withfixed-points threshold may be successfully used in very high minimum loss of information and minimum deviation from thecontrast images, however, in low contrast images; it causes.' . ' . . ' . ~~optimum threshold point. The MDTh method will beeither loss of information or fuzzy information. Thresholding implemented using 30 new and historical documentsmethods, such as the background-symmetry algorithm [2] are . .based on brightness histograms of images, and thus are called contaiin23 words.' ~~~~~Acomparison will be drawn between MDTh and five knownhistogram-derived thresholds [3]. efficient methods for text separation, namely Otsu, Kapur,Recently, several algorithms had been proposed for Parker, Kittler-Illingworth and QIR Methods. The comparisonseparating foreground texts from backgrounds [4-7]. is based on the rate of recognized words in a document afterNonetheless, the most commonly used and efficient methods thresholding is applied using all six methods on 15 documentsare considered to include Otsu's thresholding method [8], containing 1205 words.Kapur et al.'s Entropy technique [9], Kittler-Illingworthminimum error technique [I0], Parker Method [1] and Solihin The structure of the paper iS as follows: Section 2 describeset al. Quadratic Integral Ratio (QIR) method [12].

Otsu proposd an ef experimental results of applying MDTh and a comparison ofbased on image histogram. It uses discriminant analysis to the results when using other known thresholding methods.divide foreground and background by maximizing the Finally, section 4 concludes the work and provides suggestionsdiscriminant measure [8]. Kapur et al. suggested a method that for further work.

1-4244-0726-5/06/$20.OO '2006 IEEE 616

Page 2: [IEEE 2006 IEEE International Conference on Industrial Technology - Mumbai, India (2006.12.15-2006.12.17)] 2006 IEEE International Conference on Industrial Technology - Novel Thresholding

II. MAsS-DIFFERENCE THRESHOLDING Figure 1 demonstrates the MDTh operation on the word

MDTh is a global single-stage thresholding technique that "sample" image histogram. Here the Mass of the image (meanfinds the optimum threshold value using the global maxima of intensities) M 226, the Global Maxima (highest pixel(highest pixel value) and the mass average (mean of the value) Gmax = 246 on the grey-level scale of (0 - 255) and theintensities) of an image. The relationship between pixel values optimum threshold point T 206.of grayscale images provides a threshold point for theforeground and the background of the image. The highest pixel III. EXPERIMENTAL RESULTS AND COMPARISONvalue represents the global maxima of the image. The averaged The proposed global single-stage thresholding methodpixel values of a whole image represent the mass average (MDTh) has been implemented using 30 images of various(mean of intensities) of the image. documents containing 2345 words in total. The documentsMDTh is different from the background-symmetry have a variety of noise and contrasts. A total deviation valuealgorithm [2] which assumes a distinct and dominant peak for

the ackrountht i symetrc abut ts aximm. his representing the selected optimum threshold point of an imagemhebaxim roumdpeak isfunbsearchgfbort ith maximum. vale was calculated for each of the 30 document images. Thismaximumnpthehistogram,ywhereasiin the maximum valu e yielded 30 optimum threshold points that are different andin the.histogram ,whereas in MDTh themaximumvalunique to their respective image. Figure 2 and Figure 3 showhighest pixel value within the image. examples of applying MDTh to separate text in documents.MDTh uses the deviations between mass average and global The efficiency of text separation methods can be determined

maxima. The Mass of an image can be defined as: by the recognition rate of words in the thresholded or

(d'im dim, segmented document images. Visual inspection of theM=l ' l[x y] l/(dimYxdim )() documents deterrnines the number of recognized and readable

i(, x words in a thresholded document. This was implemented by 15independents persons. The total number of the words in the

where M represents the mass of image, dimx and dimy denote original document and the number of recognized and readable'. ~~~~words after thresholding were used to determine thethe x and y dimensions of image respectively, and I represents . .

the original grayscale image. The Global Maxima or maximum recognition rate as in equation (7).brightness of an image is defined in equation (2) as amaximum function of original grayscale image.

Gmax = Fmax (2)

After the calculation of mass and global maxima, the LocalDeviation (D) of the Mass (M) from the Global Maxima (Gmax)is defined as:

D = Gmax -M (3)

The Total Deviation (T) which represents the Optimum 18W OThreshold Value is defined as the difference of local deviationand the mass of the image, which is defined in equation (4).Absolute value of the difference of local deviation and the 14Wmass of the image is considered to avoid the negative threshold 12Wvalues in the cases of the smaller mass of the images than thelocal deviation.

T = IM-DI (4) 8X t6W0

Equation (4) can also be written as: 4W2X T hI 1ll l || G,;~~~T

T IM (Gmax M) (5) .dII

An MDTh image (MI) is obtained using equation (6): 0 50 100 150 200 250

xo, if I[x,y].T Fig. 1. "Sample" Image and Histogram Operations: Mass of Image (M-MI ~ ~ {25 else].= (6) 226), Global Maxima (Gma 246), Optimum Threshold (T= 206).

617

Page 3: [IEEE 2006 IEEE International Conference on Industrial Technology - Mumbai, India (2006.12.15-2006.12.17)] 2006 IEEE International Conference on Industrial Technology - Novel Thresholding

TABLE ICOMPARISON OF RECOGNITION RATES

Method Clean Degraded Highly Total RecognitionDegraded Rate

Otsu 300/325 335/450 290/430 925/1205 77 %Kapur 289/325 325/450 175/430 789/1205 65 %Parker 252/325 250/450 235/430 737/1205 61 %

Kittler and Illingworth 199/325 60/450 35/430 294/1205 24 %QIR 215/325 90/450 80/430 385/1205 32 %MDTh 308/325 350/450 300/430 958/1205 800

A Message -ToThe US. Senate

A Mesg To As Messge T

Thefnuxt statsSnUS.teintsde'ate e The US. She nit enlw"213iSSmu s df Out 1thie ki xxrly

xtplm f6r A t'ltti of war, I)t pk ii* df Else Somae ElitiCWed MAO

fpiflfteit"i de 1ifuse Mdcxh oissidd 16i to the emisuA hi" m 6fl

Clstifikkfi thiew debooedi hItuilt clt> titor 4)6t simp fhktowul shoebriwttshet Sortatil- fh utge Else faiummun4w the ifro6mit.rtmndexcludes6i h tut IhiWltimail 3tl X WE" lgNiiEfii& wI 1 us die |illilluet ofAiicrlieAOwesUbe have ao1AWj ikefJlltalolfssoulcloow

fetiwb thinfntrn nkus w6li6ileiqeiik ilib tf theoinipaiion ofUliAif I

of stso shevnitmdratAlceln trti ov utws nne ii h

menral ctir demcra0cy

(a) (b)

A (raessage To A MessagedT ie .S-. renaa Us.aene.a

The Un6itedS8elhs d&dd lo e or of Lbe mug Imphrfl TiluSttesMiiate ti sJlt to ite eoic iisi il

Issues of our In in sxm E1s|X o (our tiii ll smaw

Excmp for a dechrLfioe ofwar, no p*mAeclng of the Senale h2eausshkistlc e;p I ix-3ig r rilWii 5 lw gietBsll lug

On icittldms a;ls ewt 1which 1iW lo die renvl fxrR oltke of an WitlpIkA4itoiis 1ts doue wiCtI -l em twidfeelu,ilicriAlrtYltr. -ilAli (oficet- of agi

Cnndufing tws &-u beid ckedor dimm"sSe wrem te Q kitictitixrilwse &wIxas bdiitidllckne tls defltnles peof)[oathruliomewrU the oppo rt to Iudg the lrs olfthe pUnmgs andeKlulles bmsfldittie op.onintiliv so judlgxe Ilse Isd.ruesas ofS diet pkmiithK«3|@ audfz exludlihfirand informnfion fnnbdqmak mabbk toThe ml16mDfMmern Wislii 0f!fXlWisjfigintisidMulrh5it.laikl us1 ilit' uooi5f1ion1 f AtisterircEuism who hns avital suke in ik Mal s ouio Yi3ter, wiliQ hae -Ait-SitAiskch it-i rletitic .iatlS 0 til£mm

We ure the Senmx io opn thie dXeba bildofe ii Eneea c4wUdiIst il' leSba k i tuxlsiIlttf lt ilsrisefWikaLrakeardthe ErstAmendnfftwbesofope,ngovenent tmam so St .aWtU ht In li:hwi i1-i.[e};wrfinvidtalahtmeasccnl- to our dl er. rerstrai to otnr de.s.1 ric

AL= .AU pmAALa

Page 4: [IEEE 2006 IEEE International Conference on Industrial Technology - Mumbai, India (2006.12.15-2006.12.17)] 2006 IEEE International Conference on Industrial Technology - Novel Thresholding

~~~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ aahn~5aa .ni5i.

MDdbPLtatD... ug.... 2 t MLtAA IN LIAS, INN.. AQ9Dt2 2.

arab drtnts.Franrtttnra Ana tsarutausat auTh Graganlz auto atDaf a astannaaoiso threa

... arb.sttt . .dnt ntnry.....

efficient SudayLFahot- wor BB nowatay tram-

ysr n n rprndthtltasshtoaaandnan t atp aptea aoun'La Elt booaatnad talfdtOn f2rrataabusty ananatas aauttss usa attoa 2a2andl trlanayo fsrom 4nt 1a2atachu aht in

o rstrrt 2. U. DIaLs aLtS bra tiny tayrstant a~~~~~~~~~~~~~~~R-Can staink F,ou.rha afliaar avrant coapaan

sttt hadaradalaparand anaduratunat aupldaatli Loacdaradi Lryainnd an raltartun5 nnatgbaSso.ta...rtarduatnt rnsytgnnnhtan nta arnaa nt.....a..d y ta atu b nL .tra ohra at t at- Oy-

tSHa irtn- Paa larttt ntt sa n,§ plapased atn ut tall an-fl. a

.t t.a...A y. ' sn Sn. in. . 5.A.a

Fig.m 3.~ MDTmpeettoExml,()OinaIag(b)pl MDhurh iage.-t grayScae...(7).V.CONLUSON

recognized words and WT is the total number of words in the novel MDTh method is based on the local and total deviation~~~ill b. pn and a

originaldocument. of the document pixels, where an optimum threshold value.....isTherecognition rate fter applying MDTh fo text determined..Theshold.values vary.frm.document.to.anotheseparationfrom 30 documents~~~~~~~~~~~~~was 85.11....00, where...1996..words. as they .repesntdt6~herolor andcntat aue f ocmn

outothe otal 345 wrds, ere cear ad reaable fter isua imag. Thediffeence n.aveaged.alues(MassDiffeenceinspection by the 15 independent persons. The average processing provides unique and optimum thresholding valueforeach~~~~~~~~~~~~~~~~~3 3d Y .hQ.lf thi t

timeofMDTh was 0.03 seconds for each image using a 2.4 document image, and thus, clearly, separates foregrounds.from...GHzPCwith 256 Mb of RAM,~~~~.. Windws X.OSand.orlad bckgrund wit-qa8tdt)minimum, loso ifrato nC~~compiler. minimum fuzziness of the thresholdedimage.~~~~~~~~~~~~.Acomparison between~~~~~~~~~~~~~~~....MDTh... an.rviul.evlpd.h.Dh.ehdfo.et.eaaio.sefiin.adfs

efficntmthodsfor ext eparaion as ben drwn uing 5 wit proessig tim of .03 econd perimag..Theeffiiencoutofthe 30 document~~~~~~~~~~~~. imgs Textsepaatio by.of our 1-novelmthflluodws emntadby pligthresholding was carried out using Otsu, Kapur, Parker, thresholding to 30 documents containing 2345 words.... ..AKittler-Illingworth, QIR and MDTh Methods. The 15 recognition rate of approximately 85 00 was achieved where...m.

methods. six methods. The results of this comparisonshowthatthe~~~~~~~~~~~~~~~~~HA.A LOKER

ratewerles tha FMDT. ThsshwthatIplmethetMDTExmpetho MDThigna byag varinM Thesmae-gmentsizles hnclultn h

isanefficient~WmehdfrtxIsprtonrm ouet,a mas-ifferpene. Eachledocumentg wouldthehavein aetuniquewhell assimpean fasgntiton implement. Figurhe 4n shows optimumDiffsegmenTrsizelthat provDes reductioentein Thetreshogndingwresults ofThesword "stamlenusing all sixdmethod processinghtme.hdi ae ntelca n oa eitoforicomparisomntpurpose.n iel,wer n piumtrshl alei

Th rcgntinrae feraplin DT frtet eeried Trshldvlus ay rm ocmn619aote

Page 5: [IEEE 2006 IEEE International Conference on Industrial Technology - Mumbai, India (2006.12.15-2006.12.17)] 2006 IEEE International Conference on Industrial Technology - Novel Thresholding

(a) (b)

(c) (d)

(e)

(g) (h)

Fig. 4. Sample Comparison Results (a) Original Image (b) Otsu Method (c) Kapur Method (d) Parker Method (e) Kittler and Illingworth Method (t) QIRMethod (g) MDTh Method - Removed Background and (h) MDTh Method Result.

REFERENCES

[1] Y Zhen, H.i an D. Derman, "MahinePrintd Tex and Springer-Verlag, Berlin Heidelberg, New York Vol. 3708. pp (581-688,[1] Y Zhen, H.i an D. Derman, "MahinePrintd Tex and 2005.Handwriting Identification in Noisy Document Images", IEEE Trans. 8N.Os,"TheolSlctnMtodfmGryLvlHtga"PAMI, vol. 26 (3) pp. 337-353, 2004. .IEEE Trans. on Systems, Man, and Cybernetics, vol.9, pp.62-66, 1979.

[2] R.C. Gonzalez and R.E. Woods, "Digital Image Processing", Reading, [9] J.N. Kapur, P.K. Sahoo and A.K.C. Wong, "A New Method for Gray-Massachusetts, Addison-Wesley, 2002. .Level Picture Thresholding Using the Entropy of the Histogram",

[3] I.T. Young, J.J. Gerbrands, and L.J. Van Vliet, "Image Processing Computer Vision, Graphics, and Image Processing, vol. 29, pp. 273-285,Fundamentals", available at: http://www.ph.tn.tudelft.nl/Courses/FIP/ 1985.noframes/fip.html. ., [10] J. Kittler, J. Illingworth, "Minimum Error Thresholding", Pattern

[4] L. O'Gorman, "The Document Spectrum for Page Layout Analysis" Recognition, vol. 19 (1), pp. 41-47, 1986.IEEE Trans. PAMI, vol. 15 (11) pp.1162-1173, 1993. [11] J.R. Parker, "Gray level Thresholding in Badly Illuminated Images",

[5] X.Ye, M. Cheriet and C.Y. Suen, "Stroke-Model-Based Character IEEE Trans. PAMI, vol. 13 (8), pp. 813-819, 1991.Extraction from Gray-Level Document Images", IEEE Trans. on Image [12] Y. Solihin and C.G. Leedham, "Integral Ratio: A New Class of GlobalProcessing, vol. 10 (8) pp.1152-1161, 2001. Thresholding Techniques for Handwriting Images", IEEE Trans. PAMI,

[6] A. Shahpour, A.S. Fard, H. Aghaeinia and K. Faez, "A Restoration and vol. 21 (8), pp.761-768, 1999.Segmentation Unit for the Historic Persian Documents" Lecture Notes- on [13] G.Ledhm S. Vama A. Patankar and V. GoInaau "Separatin

Copte cine SrngrVelg BrinHieleg e Yr,Vo. Tetad akgondi egaedDcmetImgs omaisno3708 p. 64-680 200. GobalThresoldig Tecniqus fo Muitae Trsodn"

[7]E. avlliraoundH. ntnopulu, Clanig nd nhncig rocedngsoftheEihthInerntinalWokshp n Fonier iHistorical~~~~~~~~~~~ ~ ~ ~ ~ ~ ~~DouetIae"etr oe nCmuerSine adrtn eonto, 02

620