image binarization and ocr toolkit for old degraded documents
DESCRIPTION
Segmentation of text from badly degraded document images is a very challenging task. Because of little bit difference between background and foreground text of various document images. In this paper we propose image binarization technique which addresses the issue of adaptive image contrast which is combination of local image contrast and local image gradient . In this technique adaptive contrast map is first constructed as an input of degraded document image. The contrast map is then binarized and combined with Canny’s edge map to identify the text stroke edge pixels. Then further document is segmented by a local threshold that is estimated based on the intensities of detected text stroke edge pixels within a local window. The proposed method is simple, robust, and involves minimum parameter tuning.TRANSCRIPT
International Journal of Emerging Technologies and Engineering (IJETE)
Volume 2 Issue 1, January 2015, ISSN 2348 – 8050
20 www.ijete.org
Image Binarization And OCR Toolkit For Old Degraded Documents
Prof.D.J.Bonde
Prathamesh Bhokare*, Ramdas Chavan**,Akash Dhawade***,Prashant Patule****
*Information Technology, Savitribai Phule Pune University, Pune
**Information Technology, Savitribai Phule Pune University, Pune
***Information Technology, Savitribai Phule Pune University, Pune
****Information Technology, Savitribai Phule Pune University, Pune
*****Department of Information Technology, Savitribai Phule Pune University, Pune
ABSTRACT
Segmentation of text from badly degraded document
images is a very challenging task. Because of little bit
difference between background and foreground text of
various document images. In this paper we propose
image binarization technique which addresses the issue
of adaptive image contrast which is combination of local
image contrast and local image gradient . In this
technique adaptive contrast map is first constructed as an
input of degraded document image. The contrast map is
then binarized and combined with Canny’s edge map to
identify the text stroke edge pixels. Then further
document is segmented by a local threshold that is
estimated based on the intensities of detected text stroke
edge pixels within a local window. The proposed
method is simple, robust, and involves minimum
parameter tuning.
Keywords:- OCR (Optical Character Recognization)
1. INTRODUCTION
Document Image Binarization is performed in the
preprocessing stage for document analysis and it aims.
To segment the foreground text from the document
background. A fast and accurate document image
binarization technique is becoming increasingly
important as more number of text docment images are
scanned in speedy and thruthfull manner.Image
binarization has been studied for many years but there
are still unsettled problems related to thresholding. The
proposed method is simple robust and has minimum
parameters.In Binarization a old documents image is
converted into a Binarized image. which is an enhanced
image of the input image. On this enhanced image OCR
technique is applied ,for character recognization.
2. Related Work Many thresholding techniques have been reported for
document image binarization. As many degraded
documents do not have a clear bimodal pattern, global
thresholding is usually not a suitable approach for the
degraded document binarization. Adaptive thresholding,
which estimates a local threshold for each document
image pixel, is often a better approach to deal with
different variations within degraded document images.
For example, the early window-based adaptive
thresholding techniques estimate the local threshold by
using the mean and the standard variation of image
pixels within a local neighborhood window. The main
drawback of these window-based thresholding
techniques is that the thresholding performance depends
on the window size and hence the character stroke width.
Other approaches have also been reported, including
background subtraction texture analysis recursive
method , decomposition method , contour completion ,
Markov Random Field , matched wavelet , cross section
sequence graph analysis , self-learning , Laplacian
energy user assistance and combination of binarization
techniques . These methods combine different types of
image information and domain knowledge and are often
complex. The local image contrast and the local image
gradient are very useful features for segmenting the text
from the document background because the document
text usually has certain image contrast to the neighboring
document background. They are very effective and have
been used in many document image binarization
techniques.
3. PROPOSED WORK AND METHODS:
This proposed system convert Grayscale document
image into Binary document image.This section
describes the proposed document image binarization
techniques. Given a degraded document image, a
grayscale image is first constructed then depending upon
the threshold value a final black and white image is
contructed. The text is then segmented and scaling in
performed for template matching and character
recognization . Some post-processing is further applied
to improve the document binarization quality like the
thinning process.
International Journal of Emerging Technologies and Engineering (IJETE)
Volume 2 Issue 1, January 2015, ISSN 2348 – 8050
21 www.ijete.org
Aim to develop system that takes scanned image of
particular degraded document and produce binary image
and applying the various Binarization technique also
OCR is used for text extraction.
3.1. BLURRING:
In Blurring,we simple blur and image. In blurring
we simply reduce an noise from the degraded document
which are affected due to changes in temperature or
environmental conditions. An image looks more sharp or
more detailed if we are able to perceive all the objects
and their shapes correctly in it.For example an image
with a face,looks clear when we are able to identify
eyes,ears,nose,lips,forehead e.t.c very clear.This shape
of an object is due to its edges.So in blurring we simply
reduce the edge content and make the transition from
one color to the other very smooth.
Figure 1: Blurring of text
Figure 2:Blurring of image
Noise Reduction-Using Filtering
Median filtering is very widely used in digital image
processing because, under certain conditions, it
preserves edges while removing noise The median filter
is a nonlinear digital filtering technique, often used to
remove noise.
The idea of mean filtering is simply to replace each pixel
value in an image with the mean (`average') value of its
neighbors
3.2. GRAYSCALING:
In grayscale images, however, we do not differentiate
how much we emit of the different colors, we emit the
same amount in each channel.In gray scalling we apply
the scanline algorithm and seprateout the RGB colours.
What we can differentiate is the total amount of emitted
light for each pixel; little light gives dark pixels and
much light is perceived as bright pixels.When converting
an RGB image to grayscale, we have to take the RGB
values for each pixel and make as output a single value
reflecting the brightness of that pixel. One such
approach is to take the average of the contribution from
each channel: (R+B+C)/3. However, since the perceived
brightness is often dominated by the green component, a
different, more "human-oriented", method is to take a
weighted average, e.g.: 0.3R + 0.59G + 0.11B.
Figure 3:RGB to Gryascale
3.3 IMAGE SEGMENTATION:
Image segmentation is the process of dividing an
image into multiple parts. Segmentation is nothing but
small part of the memory ie,This is typically used to
identify objects or other relevant information in digital
images. There are many different ways to perform image
segmentation like thresholding.
The result of image segmentation is a set of segments
that collectively cover the entire image, or a set
of contours extracted from the image. Each of the pixels
in a region are similar with respect to some characteristic
or computed property,such as color, intensity, or texture.
Adjacent regions are significantly different with respect
to the same characteristics:
3.3.1. Scan Line:
The main advantage of this method is that
sorting vertices along the normal of the scanning plane
reduces the number of comparisons between edges.
Another advantage is that it is not necessary to translate
the coordinates of all vertices from the main memory
into the working memory—only vertices defining edges
International Journal of Emerging Technologies and Engineering (IJETE)
Volume 2 Issue 1, January 2015, ISSN 2348 – 8050
22 www.ijete.org
that intersect the current scan line need to be in active
memory, and each vertex is read in only once.
Steps:
1.Locate the intersection point along the line with the
image.
2.Sort the intersection from left to right.
3.Set the corresponding buffer positions between each
intersection with a specified fill color.
Figure 4:Scanline
3.3.2. Thresholding:
Thresholding is a non-linear operation that
converts a gray-scale image into a binary image where
the two levels are assigned to pixels that are below or
above the specified threshold value.
For good quality of document image global
threshoulding is used to extract the document text. You
can apply a threshold to data directly from the command
line, e.g, myBinaryImage =
myGrayImage>thresholdValue ?255: 0
It is however far more efficient to use the
ImageThreshold operation which also provides several
methods for finding the "optimal" threshold value for a
given image. Thresholding is the simplest method of
image segmentation. From a grayscale image,
thresholding can be used to create binary images Image
Threshold provides the following methods for
determining the threshold value:
Thresholding Methods:
1. Automatically calculate a threshold value using
an iterative method.
2. Approximate the histogram of the image as a
bimodal distribution and choose a mid point
value as the threshold level
3. Adaptive thresholding. Evaluate the threshold
based on the last 8 pixels in each row, using
alternating rows.
Figure 5:Threshold Image
3.4 MEDIAN FILTERING:
The main idea of the median filter is to run
through the signal entry by entry, replacing each entry
with the median of neighboring entries. The pattern of
neighbors is called the "window", which slides, entry by
entry, over the entire signal.
3.4.1. Thinning:
Thinning is a morphological operation that is
used to remove selected foreground pixels from binary
images, somewhat like erosion or opening. It can be
used for several applications, but is particularly useful
for skeletonization. In this mode it is commonly used to
tidy up the output of edge detectors by reducing all lines
to single pixel thickness. Thinning is normally only
applied to binary images, and produces another binary
image as output.The thinning operation is related to
the hit-and-miss transform, and so it is helpful to have an
understanding of that operator before reading on.
How It Works:
Like other morphological operators, the behavior of the
thinning operation is determined by a structuring
element. The binary structuring elements used for
thinning are of the extended type described under the hit-
and-miss transform (i.e. they can contain both ones and
zeros).
The thinning operation is related to the hit-and-miss
transform and can be expressed quite simply in terms of
it. The thinning of an image I by a structuring
element J is:
Where the subtraction is a logical subtraction defined
by.
In everyday terms, the thinning operation is calculated
by translating the origin of the structuring element to
International Journal of Emerging Technologies and Engineering (IJETE)
Volume 2 Issue 1, January 2015, ISSN 2348 – 8050
23 www.ijete.org
each possible pixel position in the image, and at each
such position comparing it with the underlying image
pixels. If the foreground and background pixels in the
structuring element exactly match foreground and
background pixels in the image, then the image pixel
underneath the origin of the structuring element is set to
background (zero). Otherwise it is left unchanged. Note
that the structuring element must always have a one or a
blank at its origin if it is to have any effect.
3.4.2. Template Matching:
Template matching is a technique in digital image
processing for finding small parts of an image which
match a template image.A basic method of template
matching uses a convolution mask (template), tailored to
a specific feature of the search image, which we want to
detect. This technique can be easily performed on grey
images or edge images.
Improving Template Matching Improvements can be made to the matching
method by using more than one template, these other
templates can have different scales and rotations. It is
also possible to improve the accuracy of the matching
method by hybridizing the feature-based and template-
based approaches. Naturally, this requires that the search
and template images have features that are apparent
enough to support feature matching.
Benefit of Binarization vs Template Matching
The convolution output will be highest at places where
the image structure matches the mask structure, where
large image values get multiplied by large mask values.
1]Fast Binarization
2]Quality of Binarization
Figure 6.Template Matching
3.5 Mathematical Equation
S={ I , IGs ,B, Ir Filter(), IO}
S= set of Systems to which we
provide an input image and the
system performs Processing on the
input image
I= input image (on which processing
is performed)
IG’s- Image gray scale
Igs= Gray Scale Image
( RGB images
converted into gray scale ,24 bit
image is converted to 16 bits)
IO=Image Output
I={I1,I2,……………In}
I=no of Images
B={B1,B2………Bn}
B=Set of blocks
IO={IOi,……………IOn}
IO=set of output images (that is
enhanced image)
4. FLOW OF PROPOSED SYSTEM:
International Journal of Emerging Technologies and Engineering (IJETE)
Volume 2 Issue 1, January 2015, ISSN 2348 – 8050
24 www.ijete.org
We will call the search image S(x, y), where (x, y)
represent the coordinates of each pixel in the search
image. We will call the template T(x t, y t), where (xt, yt)
represent the coordinates of each pixel in the template.
We then simply move the center (or the origin) of the
template T(x t, y t) over each (x, y) point in the search
image and calculate the sum of products between the
coefficients in S(x, y) and T(xt, yt) over the whole area
spanned by the template. As all possible positions of the
template with respect to the search image are considered,
the position with the highest score is the best position.
This method is sometimes referred to as 'Linear Spatial
Filtering' and the template is called a filter mask.
4.1 OCR(Optical Character Recogonization)
OCR (optical character recognition) is the recognition of
printed or written text characters by a computer. In OCR
processing, the scanned-in image or bitmap is analyzed
for light and dark areas in order to identify each
alphabetic letter or numeric digit. When a character is
recognized, it is converted into an ASCII code. Special
circuit boards and computer chips designed expressly for
OCR are used to speed up the recognition process. OCR
is being used by libraries to digitize and preserve their
holdings. OCR is also used to process checks and credit
card slips and sort the mail.
Output of a given input image(i.e binarized image) is
taken as input in OCR.OCR extracts text from image.
5. ADAVANTAGES & DISADAVATAGES
5.1 ADAVANTAGES
More stable and easy to use for document
images with different kinds of degradation .
Superior performance .
Less human effort.
Easy and accurate process.
5.2 DISADAVANTAGES
Based on Scored performance
Not very accurate
6. CONCLUSION: This system proposed a system called Image
Binarization And OCR Toolkit For Old degreded
documents .Our systems provides the Binarized image of
old degreded documents, and recognized characters
using OCR toolkit,.
REFERENCES:
[1] S. Lu, B. Su, and C. L. Tan, ―Document image
binarization using background
estimation and stroke edges,‖ Int. J. Document Anal.
Recognit.,vol. 13, no. 4, pp. 303–314, Dec. 2010.
[2] B. Su, S. Lu, and C. L. Tan, ―Binarization of
historical handwritten document images using local
maximum and minimum filter,‖ in Proc.
Int. Workshop Document Anal. Syst., Jun. 2010, pp.
159–166
[3] M. Sezgin and B. Sankur, ―Survey over image
thresholding techniques and quantitative performance
evaluation,‖ J. Electron. Imag., vol. 13, no. 1, pp. 146–
165, Jan. 2004.
[4] J.-D. Yang, Y.-S. Chen, and W.-H. Hsu, ―Adaptive
thresholding algorithm and its hardware
implementation,‖ Pattern Recognit. Lett., vol. 15, no. 2,
pp. 141–150, 1994.
[5] M. Cheriet, J. N. Said, and C. Y. Suen, ―A recursive
thresholding technique for image segmentation,‖ in
Proc. IEEE Trans. Image Process., Jun. 1998, pp. 918–
921.
[6] C. T. Yuen, M. Rizon, W. S. San, and T. C. Seong.
―Facial Features for Template Matching Based Face
Recognition.‖ American J. of
Engineering and Applied Sciences 3 (1): 899-903, 2010.
[7] J. G. Kuk, N. I. Cho, and K. M. Lee, ―Map-MRF
approach for binarization of degraded document image,‖
in Proc. Int. Conf. Image
Process., 2008, pp. 2612–2615.
[8] S. Kumar, R. Gupta, N. Khanna, S. Chaudhury, and
S. D. Joshi, ―Text extraction and document image
segmentation using matched wavelets and MRF model,‖
IEEE Trans. Image Process., vol. 16, no. 8, pp. 2117–
2128, Aug. 2007.
[9] A. Dawoud, ―Iterative cross section sequence graph
for handwritten character segmentation,‖ IEEE Trans.
Image Process., vol. 16, no. 8,
pp. 2150–2154, Aug. 2007.
[10] B. Su, S. Lu, and C. L. Tan, ―A self-training
learning document binarization framework,‖ in Proc. Int.
Conf. Pattern Recognit., Aug. 2010,
pp. 3187–3190.