image binarization and ocr toolkit for old degraded documents

5
International Journal of Emerging Technologies and Engineering (IJETE) Volume 2 Issue 1, January 2015, ISSN 2348 8050 20 www.ijete.org Image Binarization And OCR Toolkit For Old Degraded Documents Prof.D.J.Bonde Prathamesh Bhokare*, Ramdas Chavan**,Akash Dhawade***,Prashant Patule**** *Information Technology, Savitribai Phule Pune University, Pune **Information Technology, Savitribai Phule Pune University, Pune ***Information Technology, Savitribai Phule Pune University, Pune ****Information Technology, Savitribai Phule Pune University, Pune **** * Department of Information Technology, Savitribai Phule Pune University, Pune ABSTRACT Segmentation of text from badly degraded document images is a very challenging task. Because of little bit difference between background and foreground text of various document images. In this paper we propose image binarization technique which addresses the issue of adaptive image contrast which is combination of local image contrast and local image gradient . In this technique adaptive contrast map is first constructed as an input of degraded document image. The contrast map is then binarized and combined with Canny’s edge map to identify the text stroke edge pixels. Then further document is segmented by a local threshold that is estimated based on the intensities of detected text stroke edge pixels within a local window. The proposed method is simple, robust, and involves minimum parameter tuning. Keywords:- OCR (Optical Character Recognization) 1. INTRODUCTION Document Image Binarization is performed in the preprocessing stage for document analysis and it aims. To segment the foreground text from the document background. A fast and accurate document image binarization technique is becoming increasingly important as more number of text docment images are scanned in speedy and thruthfull manner.Image binarization has been studied for many years but there are still unsettled problems related to thresholding. The proposed method is simple robust and has minimum parameters.In Binarization a old documents image is converted into a Binarized image. which is an enhanced image of the input image. On this enhanced image OCR technique is applied ,for character recognization. 2. Related Work Many thresholding techniques have been reported for document image binarization. As many degraded documents do not have a clear bimodal pattern, global thresholding is usually not a suitable approach for the degraded document binarization. Adaptive thresholding, which estimates a local threshold for each document image pixel, is often a better approach to deal with different variations within degraded document images. For example, the early window-based adaptive thresholding techniques estimate the local threshold by using the mean and the standard variation of image pixels within a local neighborhood window. The main drawback of these window-based thresholding techniques is that the thresholding performance depends on the window size and hence the character stroke width. Other approaches have also been reported, including background subtraction texture analysis recursive method , decomposition method , contour completion , Markov Random Field , matched wavelet , cross section sequence graph analysis , self-learning , Laplacian energy user assistance and combination of binarization techniques . These methods combine different types of image information and domain knowledge and are often complex. The local image contrast and the local image gradient are very useful features for segmenting the text from the document background because the document text usually has certain image contrast to the neighboring document background. They are very effective and have been used in many document image binarization techniques. 3. PROPOSED WORK AND METHODS: This proposed system convert Grayscale document image into Binary document image.This section describes the proposed document image binarization techniques. Given a degraded document image, a grayscale image is first constructed then depending upon the threshold value a final black and white image is contructed. The text is then segmented and scaling in performed for template matching and character recognization . Some post-processing is further applied to improve the document binarization quality like the thinning process.

Upload: ijeteeditor

Post on 01-Feb-2016

6 views

Category:

Documents


0 download

DESCRIPTION

Segmentation of text from badly degraded document images is a very challenging task. Because of little bit difference between background and foreground text of various document images. In this paper we propose image binarization technique which addresses the issue of adaptive image contrast which is combination of local image contrast and local image gradient . In this technique adaptive contrast map is first constructed as an input of degraded document image. The contrast map is then binarized and combined with Canny’s edge map to identify the text stroke edge pixels. Then further document is segmented by a local threshold that is estimated based on the intensities of detected text stroke edge pixels within a local window. The proposed method is simple, robust, and involves minimum parameter tuning.

TRANSCRIPT

Page 1: Image Binarization And OCR Toolkit For Old Degraded Documents

International Journal of Emerging Technologies and Engineering (IJETE)

Volume 2 Issue 1, January 2015, ISSN 2348 – 8050

20 www.ijete.org

Image Binarization And OCR Toolkit For Old Degraded Documents

Prof.D.J.Bonde

Prathamesh Bhokare*, Ramdas Chavan**,Akash Dhawade***,Prashant Patule****

*Information Technology, Savitribai Phule Pune University, Pune

**Information Technology, Savitribai Phule Pune University, Pune

***Information Technology, Savitribai Phule Pune University, Pune

****Information Technology, Savitribai Phule Pune University, Pune

*****Department of Information Technology, Savitribai Phule Pune University, Pune

ABSTRACT

Segmentation of text from badly degraded document

images is a very challenging task. Because of little bit

difference between background and foreground text of

various document images. In this paper we propose

image binarization technique which addresses the issue

of adaptive image contrast which is combination of local

image contrast and local image gradient . In this

technique adaptive contrast map is first constructed as an

input of degraded document image. The contrast map is

then binarized and combined with Canny’s edge map to

identify the text stroke edge pixels. Then further

document is segmented by a local threshold that is

estimated based on the intensities of detected text stroke

edge pixels within a local window. The proposed

method is simple, robust, and involves minimum

parameter tuning.

Keywords:- OCR (Optical Character Recognization)

1. INTRODUCTION

Document Image Binarization is performed in the

preprocessing stage for document analysis and it aims.

To segment the foreground text from the document

background. A fast and accurate document image

binarization technique is becoming increasingly

important as more number of text docment images are

scanned in speedy and thruthfull manner.Image

binarization has been studied for many years but there

are still unsettled problems related to thresholding. The

proposed method is simple robust and has minimum

parameters.In Binarization a old documents image is

converted into a Binarized image. which is an enhanced

image of the input image. On this enhanced image OCR

technique is applied ,for character recognization.

2. Related Work Many thresholding techniques have been reported for

document image binarization. As many degraded

documents do not have a clear bimodal pattern, global

thresholding is usually not a suitable approach for the

degraded document binarization. Adaptive thresholding,

which estimates a local threshold for each document

image pixel, is often a better approach to deal with

different variations within degraded document images.

For example, the early window-based adaptive

thresholding techniques estimate the local threshold by

using the mean and the standard variation of image

pixels within a local neighborhood window. The main

drawback of these window-based thresholding

techniques is that the thresholding performance depends

on the window size and hence the character stroke width.

Other approaches have also been reported, including

background subtraction texture analysis recursive

method , decomposition method , contour completion ,

Markov Random Field , matched wavelet , cross section

sequence graph analysis , self-learning , Laplacian

energy user assistance and combination of binarization

techniques . These methods combine different types of

image information and domain knowledge and are often

complex. The local image contrast and the local image

gradient are very useful features for segmenting the text

from the document background because the document

text usually has certain image contrast to the neighboring

document background. They are very effective and have

been used in many document image binarization

techniques.

3. PROPOSED WORK AND METHODS:

This proposed system convert Grayscale document

image into Binary document image.This section

describes the proposed document image binarization

techniques. Given a degraded document image, a

grayscale image is first constructed then depending upon

the threshold value a final black and white image is

contructed. The text is then segmented and scaling in

performed for template matching and character

recognization . Some post-processing is further applied

to improve the document binarization quality like the

thinning process.

Page 2: Image Binarization And OCR Toolkit For Old Degraded Documents

International Journal of Emerging Technologies and Engineering (IJETE)

Volume 2 Issue 1, January 2015, ISSN 2348 – 8050

21 www.ijete.org

Aim to develop system that takes scanned image of

particular degraded document and produce binary image

and applying the various Binarization technique also

OCR is used for text extraction.

3.1. BLURRING:

In Blurring,we simple blur and image. In blurring

we simply reduce an noise from the degraded document

which are affected due to changes in temperature or

environmental conditions. An image looks more sharp or

more detailed if we are able to perceive all the objects

and their shapes correctly in it.For example an image

with a face,looks clear when we are able to identify

eyes,ears,nose,lips,forehead e.t.c very clear.This shape

of an object is due to its edges.So in blurring we simply

reduce the edge content and make the transition from

one color to the other very smooth.

Figure 1: Blurring of text

Figure 2:Blurring of image

Noise Reduction-Using Filtering

Median filtering is very widely used in digital image

processing because, under certain conditions, it

preserves edges while removing noise The median filter

is a nonlinear digital filtering technique, often used to

remove noise.

The idea of mean filtering is simply to replace each pixel

value in an image with the mean (`average') value of its

neighbors

3.2. GRAYSCALING:

In grayscale images, however, we do not differentiate

how much we emit of the different colors, we emit the

same amount in each channel.In gray scalling we apply

the scanline algorithm and seprateout the RGB colours.

What we can differentiate is the total amount of emitted

light for each pixel; little light gives dark pixels and

much light is perceived as bright pixels.When converting

an RGB image to grayscale, we have to take the RGB

values for each pixel and make as output a single value

reflecting the brightness of that pixel. One such

approach is to take the average of the contribution from

each channel: (R+B+C)/3. However, since the perceived

brightness is often dominated by the green component, a

different, more "human-oriented", method is to take a

weighted average, e.g.: 0.3R + 0.59G + 0.11B.

Figure 3:RGB to Gryascale

3.3 IMAGE SEGMENTATION:

Image segmentation is the process of dividing an

image into multiple parts. Segmentation is nothing but

small part of the memory ie,This is typically used to

identify objects or other relevant information in digital

images. There are many different ways to perform image

segmentation like thresholding.

The result of image segmentation is a set of segments

that collectively cover the entire image, or a set

of contours extracted from the image. Each of the pixels

in a region are similar with respect to some characteristic

or computed property,such as color, intensity, or texture.

Adjacent regions are significantly different with respect

to the same characteristics:

3.3.1. Scan Line:

The main advantage of this method is that

sorting vertices along the normal of the scanning plane

reduces the number of comparisons between edges.

Another advantage is that it is not necessary to translate

the coordinates of all vertices from the main memory

into the working memory—only vertices defining edges

Page 3: Image Binarization And OCR Toolkit For Old Degraded Documents

International Journal of Emerging Technologies and Engineering (IJETE)

Volume 2 Issue 1, January 2015, ISSN 2348 – 8050

22 www.ijete.org

that intersect the current scan line need to be in active

memory, and each vertex is read in only once.

Steps:

1.Locate the intersection point along the line with the

image.

2.Sort the intersection from left to right.

3.Set the corresponding buffer positions between each

intersection with a specified fill color.

Figure 4:Scanline

3.3.2. Thresholding:

Thresholding is a non-linear operation that

converts a gray-scale image into a binary image where

the two levels are assigned to pixels that are below or

above the specified threshold value.

For good quality of document image global

threshoulding is used to extract the document text. You

can apply a threshold to data directly from the command

line, e.g, myBinaryImage =

myGrayImage>thresholdValue ?255: 0

It is however far more efficient to use the

ImageThreshold operation which also provides several

methods for finding the "optimal" threshold value for a

given image. Thresholding is the simplest method of

image segmentation. From a grayscale image,

thresholding can be used to create binary images Image

Threshold provides the following methods for

determining the threshold value:

Thresholding Methods:

1. Automatically calculate a threshold value using

an iterative method.

2. Approximate the histogram of the image as a

bimodal distribution and choose a mid point

value as the threshold level

3. Adaptive thresholding. Evaluate the threshold

based on the last 8 pixels in each row, using

alternating rows.

Figure 5:Threshold Image

3.4 MEDIAN FILTERING:

The main idea of the median filter is to run

through the signal entry by entry, replacing each entry

with the median of neighboring entries. The pattern of

neighbors is called the "window", which slides, entry by

entry, over the entire signal.

3.4.1. Thinning:

Thinning is a morphological operation that is

used to remove selected foreground pixels from binary

images, somewhat like erosion or opening. It can be

used for several applications, but is particularly useful

for skeletonization. In this mode it is commonly used to

tidy up the output of edge detectors by reducing all lines

to single pixel thickness. Thinning is normally only

applied to binary images, and produces another binary

image as output.The thinning operation is related to

the hit-and-miss transform, and so it is helpful to have an

understanding of that operator before reading on.

How It Works:

Like other morphological operators, the behavior of the

thinning operation is determined by a structuring

element. The binary structuring elements used for

thinning are of the extended type described under the hit-

and-miss transform (i.e. they can contain both ones and

zeros).

The thinning operation is related to the hit-and-miss

transform and can be expressed quite simply in terms of

it. The thinning of an image I by a structuring

element J is:

Where the subtraction is a logical subtraction defined

by.

In everyday terms, the thinning operation is calculated

by translating the origin of the structuring element to

Page 4: Image Binarization And OCR Toolkit For Old Degraded Documents

International Journal of Emerging Technologies and Engineering (IJETE)

Volume 2 Issue 1, January 2015, ISSN 2348 – 8050

23 www.ijete.org

each possible pixel position in the image, and at each

such position comparing it with the underlying image

pixels. If the foreground and background pixels in the

structuring element exactly match foreground and

background pixels in the image, then the image pixel

underneath the origin of the structuring element is set to

background (zero). Otherwise it is left unchanged. Note

that the structuring element must always have a one or a

blank at its origin if it is to have any effect.

3.4.2. Template Matching:

Template matching is a technique in digital image

processing for finding small parts of an image which

match a template image.A basic method of template

matching uses a convolution mask (template), tailored to

a specific feature of the search image, which we want to

detect. This technique can be easily performed on grey

images or edge images.

Improving Template Matching Improvements can be made to the matching

method by using more than one template, these other

templates can have different scales and rotations. It is

also possible to improve the accuracy of the matching

method by hybridizing the feature-based and template-

based approaches. Naturally, this requires that the search

and template images have features that are apparent

enough to support feature matching.

Benefit of Binarization vs Template Matching

The convolution output will be highest at places where

the image structure matches the mask structure, where

large image values get multiplied by large mask values.

1]Fast Binarization

2]Quality of Binarization

Figure 6.Template Matching

3.5 Mathematical Equation

S={ I , IGs ,B, Ir Filter(), IO}

S= set of Systems to which we

provide an input image and the

system performs Processing on the

input image

I= input image (on which processing

is performed)

IG’s- Image gray scale

Igs= Gray Scale Image

( RGB images

converted into gray scale ,24 bit

image is converted to 16 bits)

IO=Image Output

I={I1,I2,……………In}

I=no of Images

B={B1,B2………Bn}

B=Set of blocks

IO={IOi,……………IOn}

IO=set of output images (that is

enhanced image)

4. FLOW OF PROPOSED SYSTEM:

Page 5: Image Binarization And OCR Toolkit For Old Degraded Documents

International Journal of Emerging Technologies and Engineering (IJETE)

Volume 2 Issue 1, January 2015, ISSN 2348 – 8050

24 www.ijete.org

We will call the search image S(x, y), where (x, y)

represent the coordinates of each pixel in the search

image. We will call the template T(x t, y t), where (xt, yt)

represent the coordinates of each pixel in the template.

We then simply move the center (or the origin) of the

template T(x t, y t) over each (x, y) point in the search

image and calculate the sum of products between the

coefficients in S(x, y) and T(xt, yt) over the whole area

spanned by the template. As all possible positions of the

template with respect to the search image are considered,

the position with the highest score is the best position.

This method is sometimes referred to as 'Linear Spatial

Filtering' and the template is called a filter mask.

4.1 OCR(Optical Character Recogonization)

OCR (optical character recognition) is the recognition of

printed or written text characters by a computer. In OCR

processing, the scanned-in image or bitmap is analyzed

for light and dark areas in order to identify each

alphabetic letter or numeric digit. When a character is

recognized, it is converted into an ASCII code. Special

circuit boards and computer chips designed expressly for

OCR are used to speed up the recognition process. OCR

is being used by libraries to digitize and preserve their

holdings. OCR is also used to process checks and credit

card slips and sort the mail.

Output of a given input image(i.e binarized image) is

taken as input in OCR.OCR extracts text from image.

5. ADAVANTAGES & DISADAVATAGES

5.1 ADAVANTAGES

More stable and easy to use for document

images with different kinds of degradation .

Superior performance .

Less human effort.

Easy and accurate process.

5.2 DISADAVANTAGES

Based on Scored performance

Not very accurate

6. CONCLUSION: This system proposed a system called Image

Binarization And OCR Toolkit For Old degreded

documents .Our systems provides the Binarized image of

old degreded documents, and recognized characters

using OCR toolkit,.

REFERENCES:

[1] S. Lu, B. Su, and C. L. Tan, ―Document image

binarization using background

estimation and stroke edges,‖ Int. J. Document Anal.

Recognit.,vol. 13, no. 4, pp. 303–314, Dec. 2010.

[2] B. Su, S. Lu, and C. L. Tan, ―Binarization of

historical handwritten document images using local

maximum and minimum filter,‖ in Proc.

Int. Workshop Document Anal. Syst., Jun. 2010, pp.

159–166

[3] M. Sezgin and B. Sankur, ―Survey over image

thresholding techniques and quantitative performance

evaluation,‖ J. Electron. Imag., vol. 13, no. 1, pp. 146–

165, Jan. 2004.

[4] J.-D. Yang, Y.-S. Chen, and W.-H. Hsu, ―Adaptive

thresholding algorithm and its hardware

implementation,‖ Pattern Recognit. Lett., vol. 15, no. 2,

pp. 141–150, 1994.

[5] M. Cheriet, J. N. Said, and C. Y. Suen, ―A recursive

thresholding technique for image segmentation,‖ in

Proc. IEEE Trans. Image Process., Jun. 1998, pp. 918–

921.

[6] C. T. Yuen, M. Rizon, W. S. San, and T. C. Seong.

―Facial Features for Template Matching Based Face

Recognition.‖ American J. of

Engineering and Applied Sciences 3 (1): 899-903, 2010.

[7] J. G. Kuk, N. I. Cho, and K. M. Lee, ―Map-MRF

approach for binarization of degraded document image,‖

in Proc. Int. Conf. Image

Process., 2008, pp. 2612–2615.

[8] S. Kumar, R. Gupta, N. Khanna, S. Chaudhury, and

S. D. Joshi, ―Text extraction and document image

segmentation using matched wavelets and MRF model,‖

IEEE Trans. Image Process., vol. 16, no. 8, pp. 2117–

2128, Aug. 2007.

[9] A. Dawoud, ―Iterative cross section sequence graph

for handwritten character segmentation,‖ IEEE Trans.

Image Process., vol. 16, no. 8,

pp. 2150–2154, Aug. 2007.

[10] B. Su, S. Lu, and C. L. Tan, ―A self-training

learning document binarization framework,‖ in Proc. Int.

Conf. Pattern Recognit., Aug. 2010,

pp. 3187–3190.