challenging issues in devanagari script recognition

Challenging Issues in Devanagari Script Recognition

Rameshwar S. Mohite 1

, Balaji R. Bombade 2

1 M. Tech. Student, CSE Dept., SGGS IE&T Nanded-431606, India. 2 Assistant Professor, CSE Dept., SGGS IE&T Nanded-431606, India.

[email protected]

[email protected]

Abstract

Rigorous research has been done on optical character

recognition (OCR) and a large number of articles have

been published on this topic during the last few

decades. OCR plays a vital role in Digital Image

Processing and Pattern Recognition. Numerous work

has stated for Roman, Chinese, Japanese and Arabic

scripts. There is no convenient work done on Indian script recognition. In India, more than 300 million

people use Devanagari script for documentation.

Although different efficient methodologies of

Devanagari script recognition are proposed, but

recognition accuracy of Devanagari script is not yet

analogous to its overseas counterparts. This is

predominantly due to the large variety of

characters/symbols and their intimacy arrival in the

Devanagari script. In this paper, we discuss some

challenging issues which arise while recognition of

Devanagari script.

Keywords— OCR, image processing, peculiarities

of the Devanagari script, challenging issues in

Devanagari script.

1. Introduction Machine simulation of human activities has been the

most challenging research area since the evolution of

digital computers. The main reason for such an effort

was not only the challenges in simulating human

reading, but also the possibility of methodical

applications in which the information present on paper

documents has to be transferred into machine editable

form. OCR is a process of automatic computer

recognition of characters and symbols in optically

scanned and digitized pages of text [1]. Automatic

recognition of information present on documents like

cheques, envelopes, forms, and other manuscripts has a

numerous practical and commercial applications in banks, post offices, library, publication houses,

language processing, and forensic investigation.

Presently there are many OCR systems available for

handling printed English documents with reasonable

levels of accuracy. These systems are available for

many European languages as well as some of the Asian

languages such as Japanese, Chinese, etc. However,

there are not many efforts stated on developing OCR

systems for Indian languages. India is a multilingual,

multi-script country and there are twenty two

languages. Eleven scripts are used to write these

languages. Devanagari Script is an old one and evolved

from the Brahmi script. Devanagari is used to write

many languages such as Hindi, Konkani, Marathi,

Nepali, Sanskrit, Bodo, Dogri and Maithili. Hindi is

the national language of the India. Devanagari is the

second most popular language in the Indian

subcontinent and third most popular in the world [2].

300 million people use the Devanagari Script for

documentation in central and northern parts of India

[3]. It also serves as an auxiliary script for other

languages such as Punjabi, Sindhi and Kashmiri. The rest of this paper is organized as follows:

Section II illustrates the previous work. Section III

describes the peculiarities of Devanagari Script.

Challenging issues are discussed in section IV. The

conclusion is given in section V.

2. Previous Work OCR work on the printed Devanagari script started

in the early 1970s. The good survey about the work

done for offline recognition of Devanagari Script in [4].

The first complete OCR system development of printed

Devanagari is perhaps due to Palit and Chaudhuri [5] as

well as Pal and Chaudhuri [6]. The work on machine

printed Devanagari has been made by Bansal et al [7].

A syntactic pattern analysis system for Devanagari

script recognition is presented in Sinha’s Ph.D. Thesis

[8]. OCR is classified into two types, Offline

recognition and Online recognition. In offline

recognition the source is either an image or a scanned

form of the document whereas in online recognition the

successive points are represented as a function of time

and the order of hits are also accessible. The study

scrutinises the direction of the CR research, analysing

the limitations of methodologies for the systems, which

can be classified based upon two major criteria:

Rameshwar S Mohite et al, Int.J.Computer Technology & Applications,Vol 5 (3),947-952

IJCTA | May-June 2014 Available [email protected]

947

ISSN:2229-6093

The data acquisition process (on-line or off-line).

The text type (machine-printed or handwritten).

No matter in which type the problem belongs, in

general, these are the major phases in OCR problem as

follows:

Pre-processing.

Segmentation.

Feature extraction.

Classification and recognition.

2.1 Pre-processing

Pre-processing consists of a few types of sub

processes to clean the document image and make it

appropriate to carry the recognition process accurately.

The main sub processes of pre-processing are:

Binarization, Noise Reduction, Skew correction and

Thinning. Binarization process is transforming a grayscale image into a black and white image. Image

binarization is categorized into two main classes:

Global and Local. In a global approach, threshold

selection results in a single threshold value for the

entire image. The most commonly used method is an

Otsu’s method [9]. Using the local information that

guides the threshold value pixel wise in an adaptive

manner is well suited for degraded documents [10].

Histogram based thresholding approach can also be

used to convert a grayscale image into a two tone

image. Digital images are susceptible to several types

of noises. Noise in a document image is due to the

poorly optical scanning device. Salt and pepper noise

arises due to scanning process and quality of the paper

being scanned thereby corrupting the pixels. Median

filter is used for removal of salt and pepper noise [11],

[12]. The wiener Filtering method and morphological operations can be performed to remove noise [12].

When a document is scanned using an optical scanner,

a small degree of skew is unescapable. Skew angle is

the angle that the text lines in the digital image, make

with the horizontal direction. Skew estimation and

correction are important pre-processing steps of

document outline analysis. In [13], a rule based

approach is proposed. This method does remove

irrelevant data and fix skew from scanned textual

documents of Devanagari script. Thinning is a

technique which results in single pixel width image to

recognize the character easily. It is applied repeatedly

leaving only pixel wise linear representations of the

image characters. Thinning extracts the shape

information of the characters. The detailed information

about the thinning algorithm is available in [14].

2.2 Segmentation

Segmentation is a process which is used to split the

document images into lines, words and

characters/symbols. Segmentation is a vital phase in the OCR system because it affects the rate of recognition.

Segmentation can be external and internal. External

segmentation is the segregation of various text parts,

such as paragraphs, sentences or words. In internal

segmentation an image of a series of characters is

segregated into sub-images of individual character. The

segmentation process involves three steps, namely Line

segmentation, word segmentation and character

segmentation. Segmentation of lines and words are

done using the horizontal and vertical projection

profiles of the scanned document image [15]. Bansal

and Sinha [16] suggested segmentation of touching and

fused Devanagari characters for printed text. The

strategy recommended by them uses a two-pass

algorithm for the segmentation and separation of

Devanagari composite characters into their constituent

symbols. In the first phase, words are segmented into

smoothly detachable characters or composite characters. Statistical enlightenment about the height

and width of each autonomous box is used to

hypothesize whether a character box is composite. In

the second phase, the hypothesized composite

characters are again segmented. The proposed

algorithm extensively uses structural properties of the

script. Removal of shirorekha does the segmentation of

characters from each Devanagari word in [16]. In [18],

the touching characters are initially identified and then

segmented into basic ones by a new fuzzy decision-

making approach. This idea was motivated after

examining the complex ways by which characters touch

each other in the Devanagari and Bangla scripts.

2.3 Feature Extraction

Feature extraction is very problem dependent. Good

features are those whose values are similar to objects

belonging to the same category and distinct for objects

in different categories. The better approach for

recognition is to segment characters into basic symbol

and recognize each symbol subsequently. The system

described by Sinha and Mahabala [19] for printed

Devanagari characters stores structural descriptions for

each symbol of the script in terms of primitives and

their relationships. Bansal and Sinha [17] considered several statistical classifying features like horizontal

zero crossings, moments, vertex points, and pixel

density in different zones for Devanagari characters.

They also considered word envelop information

containing a number of character boxes, number of

vertical bars, number of upper modifier boxes, a

number of lower modifier boxes, vector giving the

position of vertical bars, vector giving type, and locus

of each character box. Jawahar et al. [20] used PCA for

feature extraction of printed characters. A word-level

identical system for searching in printed document

images is proposed by Meshesha and Jawahar [21]. The

feature-extraction system a takeaway local features by



948

ISSN:2229-6093

scanning vertical strips of the word image and integrates them automatically based on their

discriminatory potential. The features considered are

word profiles, moments, and transform-domain

representations. In [22], a technique to identify

Kannada, Hindi, and English text lines from a printed

document is presented. The system is based on the

upper and lower profiles of isolated text lines of the

input document image. The locations of the connected

components of the upper and lower profiles are

extracted and the coefficients of variation of the upper

and lower profiles are calculated latterly. In [23],

consider 5 different features i.e. the lower profile, the

upper profile, the ink-background transitions, the

number of black pixels, and the span of the foreground

pixels. The upper and lower profiles measure the

distance of the top and bottom foreground pixel from

the respective baselines. Ink-background transitions

measures the number of transitions from Ink to background and reverse. The number of black pixels

provides the information about the density of ink in the

vertical stripe.

2.4 Classification and Recognition

The extracted features are given as the input to the

decision making part of the recognition system. The

performance of a classifier relies on the quality of the features. Two main types of approaches have been

applied for character recognition [24].

The holistic approach

The analytical approach.

In holistic technique recognition is globally worked

on the whole image of words and there is no effort to

organize characters separately. The main advantage of

holistic method is that they avoid word

segmentation [25]. Their main drawback is that

vulnerable to recognition of long word and recognition

accuracy is reduced. Analytical methodology deals with

numerous levels of representation of the image that is

Sub-word or letter recognition. The leading advantage

of analytic method is that unlimited vocabulary and

recognition accuracy is high. Their main drawback is

that vulnerable to segmentation errors. Analytical

method requires external and internal segmentation.

There are some approaches that are used to classify the

characteristic features in the existing systems such as

neural network, support vector machine and

Combination Classifier and so on.

A neural network is an estimating structural design

that involves enormous parallel interconnection of

adaptive neural processors. The most popularly used

neural networks in the OCR systems are multi-layer

perceptrons (MLP). MLP is being used as classifiers

because of their universal approximation property and

better generalization ability [26], [30]. Back

propagation type NN classifier is proposed by K. Y.

Rajput et al. [27]. In [23], propose a Recurrent Neural

Network is known as Bidirectional Long-Short Term

Memory (BLSTM). Support Vector Machine is based

on statistical learning theory. A classification process

generally contains separating the data into two sets,

training and testing sets. Each instance in the training

set contains one target value and several Attributes.

Many researchers used SVM successfully viz. C. V.

Jawahar et al. [20], Sandhya Arora et al. [26],

Umapada Pal et al. [28]. Numerous classification

methods proposed for Devanagari script recognition

and each method has specific strengths and

weaknesses. Hence, many times combination classifiers

are used to resolve a specified classification problem.

In Indian scripts the combination of classifiers can be

used such as SVM and ANN [26], K-Means and SVM

[29], MLP and minimum edit [31].

3. Peculiarities of Devanagari Script Devanagari script has 34 consonants (Vyanjana),

and 13 vowels (Swara). Basic characters can be formed

by using vowels and consonants. Vowels can be an

independent letter or a variety of accent symbols which are written top, bottom, left or right the

consonant they belong to. When vowels are written in

this way they are known as modifiers and the characters

so modelled are called conjuncts and different

conjunction forms as shown in figure 1. Occasionally,

two or more consonants can merge and take new

shapes. This shape is known as composite character.

Devanagari is written from left to right. It has no upper

and lower case characters. Every character has a

horizontal line at the top called as shirorekha or header line. It connects with the header line of two or more

basic or composite characters to form a word.

Horizontal line at the top called as shirorekha or header

Figure 1. Different Conjunction form.

line. It connects with the header line of two or more

basic or composite characters to form a word.

Devanagari words can normally be divided into three

discrete zones: top zone, core zone, and bottom zone.



949

ISSN:2229-6093

The top zone and core zone are always separated by the header line, whereas there is no analogous feature to

distinct the bottom zone and core. The top zone

contains the top modifiers, and bottom zone contains

lower modifiers. The core zone that encompasses the

vowel, consonant, conjunct forms and composite

characters, respectively as shown in figure 2.

Figure 2. Three zones of a Devanagari script.

4. Challenging issues in Devanagari script Recognition of the printed Devanagari script is the

challenging problem since there is a difference in the

same character due to diverse font family, font size,

font orientation etc. Difference in font family and sizes

makes recognition task problematic, in such conditions

pre-processing, feature extraction and recognition are

not robust. Sometime same font and size may also have bold face character as well as normal ones. Thus, the

width of the stroke is also an issue that interrupts

recognition. For example, the four-character-word

image is shown in figure 3. Where first character font

size is larger than remaining characters. Therefore, all

characters within word don’t come under single

shirorekha or header line.

Figure 3. Font variation in Devanagari script.

Header line property is carrying out a vital role in

Devanagari script recognition. It is used for

identification of word limits and skew adjustment. If

the header line is absent from a word, it introduces the

problem of printed word recognition, i.e. skew

correction and character segmentation. The presence of

more than one header lines adds confusion of two text

lines.

Devanagari word contains Information about

Number of characters, Number of vertical bars,

Number of modifiers, and position of vertical bars.

Based on the above information we can recognize a significant word. Devanagari word contains a complex

mixture of few or all of the above elements, when we

recognize upper modifier, lower modifier and exact

place of that modifier and sometimes frail joining of

modifier creates a confusion. Also, when we miss the

position of modifier, word information is impossible to

understand. The gap between the character and the

modifier doesn't touch the core character at all, makes

the situation more tedious. The bottom modifier called ―nukta‖ (signified as a dot (.) at the bottom) usually

does not touch the core character. In figure 4. illustrates

one of the example.

Figure 4. The gap between character and modifier.

Image degradations can arise due to multiple sources like poor quality of ink, low spacing between

characters, document age etc. If the document is

heavily degraded then any meaningful extraction of

information is very tough. OCR for Devanagari script

becomes even more difficult when composite character

and modifiers are collective in 'noisy' state. Word

contains upper modifier like anuswar are small and

difficult to distinguish from noise. Figure 5a. illustrates

an example, the small dot present on top of the word is

actually a valid one, but due to small size, it can be

considered as a noise. There are several isolated marks,

which are vowel modifiers namely ―Anuswar‖,

―Visarga‖ and ―Chandra Bindu‖ which add up to the

misinterpretation. Possibly most errors in conjunct

recognition are due to misperceptions with vowels and

the virama symbol. When word contains a special

symbol like omkar or rupee sign it becomes very tedious situation in recognition. This is illustrated in

figure 5b.

Figure 5. (a) Confused with modifier and noise.

(b) Word with special symbol.

In Devanagari words consisting of modifiers, curvy

shapes, joined/ fused characters and composite characters leads to the usage of segmentation at

different levels. Due to these reasons recognitions of

Devanagari script is difficult job as compared to

English language. The line separation may be abstruse

due to overlapping of text lines. If two top modifier

touches each other, then they are segmented as one

component. For instance, consider ―Chandra‖ and

―Bindu‖ are two components which usually occurs

together in many words, but due to image degradation

they may overlap over each other and appear to not be

as separate one. The maximum problem of lower

modifier separation from consonants occurs in

character, due to the presence of lower modifier like

loop in the lower part of this character. Gap between

words is an important factor for word separation

because closely words may not get segmented into



950

ISSN:2229-6093

individual words. Devanagari word contains characters

like (a), (bha) etc. If we remove the header line

using vertical projection, which result in loss of shape

of such characters. So there may be miss interpretation of character recognition. Figure 6 illustrates this

consequence. The errors which arise in text line

segmentation also create a problem in word

segmentation and character segmentation.

Figure 6. (a) Word contains and .

(b) Loss of shape of and .

In classification phase, some character has similar

features. This observation creates confusion for

classification phase. It means certain character classes

have also been observed due to their similarity as

shown in figure. 7a. and figure 7b. ―na‖ (second

character in the image) can be confused as ―ta‖. This happened because the ―na‖ character in devanagari has

a hole in the beginning, which got filled up, in a

situation like this it can confused with ―ta‖ [23].

Figure 7. (a) Similar features character.

(b) Ambiguity with na and ta.

Where a conjunction with a ra ( ) and dash ( ) in

the Devanagari script has two different meanings

depending on its position in the word. As shown in the

figure 8a. the conjunction symbol is interpreted and

pronounced as a combination of 'ra' + 'ya', whereas in

figure 8b. the dash is often used in place of the word

"to" such. It is difficult for the classifier to interpret a

conjunction symbol and dash due to its two different

meanings in the Devanagari script.

Figure 8. Analogous feature symbol.

The quality of training data also affects the

performance of word recognizer. Poor quality of

hardware resources causes the improper generation of documents, thus removing very important and critical

sections of word in the initial phase itself. This

misleads the recognition of word particularly for

Devanagri scripts. Recognition of the Devanagri script

requires implementation of algorithms differentiation.

Better algorithms and techniques for correct and

efficient recognition is required because of the

existence of the problems discuss throughout this

paper.

5. Conclusion OCR techniques and algorithms vary as the script

changes. Different approaches of Devanagari script

recognition are proposed in various journal articles, but

recognition rate is not pre-eminent and improvements

are very marginal. Generally, recognition rates depend

upon the pre-processing segmentation and feature

extraction. An existing methodologies are not much

capable of segmenting and recognizing document in

complex cases. Devanagari script recognition is a tricky

job. In this paper, the problems are elaborated and

eager to solve in the future to make the OCR systems

more potent.

6. References [1] U. Pal and B. B. Chaudhuri, ―Indian script character

recognition: A survey‖, Pattern Recognit. , vol. 37, pp. 1887-1899, 2004.

[2] Raghuraj Singh, C.S.Yadav, Prabhat Verma, Vibhash

Yadav, ―Optical Character Recognition for Printed Devanagari Script Using Artificial Neural Network‖,

IJCSC, Vol.1, No.1, pp.91-95, Jan- June: 2010

[3] R. M. K. Sinha, ―A journey from Indian scripts

processing to Indian language processing‖, IEEE Ann. Hist. Comput., vol. 31, no. 1, pp. 8–31, Jan./Mar. 2009.

[4] R. Jayadevan, S. R. Kolhe, P.M. Patil and U. Pal,

"Offline Recognition of Devanagari Script: A Survey",

IEEE Transactions on Systems, Man and Cybernetics-Part C: Applications and Reviews, 2011.

[5] S. Palit, B.B. Chaudhuri, ―A feature-based scheme for

the machine recognition of printed Devanagari script‖,

Pattern Recognition, Image Processing and Computer Vision, India, pp. 163-168, 1995.

[6] U. Pal and B. B. Chaudhuri, ―Printed Devanagari script

OCR system‖, Vivek, vol. 10, pp. 12–24, 1997.

[7] V. Bansal, ―Integrating Knowledge Sources in Devanagari Text Recognition System‖, Ph. D

Thesis, 1996.

[8] R. M. K. Sinha, ―A Syntactic pattern analysis system

and its application to Devnagari script recognition‖, Ph.D. Thesis, Dept. Elect. Eng., Indian Institute of

Technology, Kanpur, India, 1973.

[9] N. Otsu, ―A threshold selection method from grey level

histogram,‖ IEEE Trans on SMC, Vol.9, pp.62-66, 1979.

[10] Y. Yang and H. Yan, ―An adaptive logical method for

binarization of degraded document images‖, Pattern

Recognition (33), pp. 787-807, 2000.



951

ISSN:2229-6093

[11] G. G. Rajput, Rajeswari Horakeri, ―Shape Descriptors based Handwritten Character Recognition Engine with

Application to Kannada Characters‖, International

Conference on Computer & Communication

Technology (ICCCT), pp 135-141, 2011. [12] P. Patidar, M. Gupta, S. Shrivastava and A. Nagawat,

―Image De-noising by Various Filter for Different

Noise,‖ vol. 9, no. 4, pp. 45-50, Nov. 2010.

[13] Pramod Kumar Sharma, Kapil Dev Dhingra, Sudip Sanyal, ―A Rule Based Approach for Skew Correction

and Removal of Insignificant Data from Scanned Text

Documents of Devanagari Script‖, SITIS, 899-903,

2007. [14] L. Lam, S. W. Lee, and C. Y. Suen, ―Thinning

Methodologies- A Comprehensive Survey,‖ IEEE

Trans. PAMI, vol. 14, pp. 869–885, Sept. 1992.

[15] B. B Chaudhuri and U. Pal, ―An OCR system to read two Indian language scripts: Bangla and Devanagari‖,

in Proc. 4th Conf. Document Anal. Recognit., pp. 1011–

1015b, 1997.

[16] Bansal, V., Sinha, R. ―Segmentation of touching and fused Devanagari characters‖, Pattern Recogn. 35 (4),

875–893, 2002.

[17] V. Bansal and R. M. K. Sinha, ―Integrating knowledge sources in Devanagari text recognition,‖ IEEE Trans.

Syst. Man Cybern. A: Syst. Hum. , vol. 30, no. 4, pp.

500–505, Jul. 2000.

[18] Garain, U., Chaudhuri, B., ―Segmentation of touching characters in printed Devanagari and Bangla scripts

using fuzzy multifactorial analysis‖, IEEE Trans. Syst.

Man Cybern. Part C 32 (4), 449–459, 2002.

[19] R. M. K. Sinha and H. Mahabala, ―Machine recognition of Devnagari script,‖ IEEE Trans. Syst. Man Cybern. ,

vol. 9, no. 8, pp. 435–441, Aug. 1979.

[20] C. V. Jawahar, P. Kumar, and S. S. R. Kiran, ―Bilingual

OCR for Hindi-Telugu documents and its applications‖, in Proc. 7th Conf. Document Anal. Recognit., pp. 1–5,

2003.

[21] M. Meshesha and C. V. Jawahar, ―Matching word

images for content-based retrieval from printed document images,‖ Int. J. Document Anal. Recognit. ,

vol. 11, pp. 29–38, 2008.

[22] P. A. Vijaya and M. C. Padma, ―Text line identification

from a multilingual document,‖ in Proc. Int. Conf. Digital Image Process., pp. 302–305.,2009.

[23] Naveen Sankaran and C.V Jawahar, ―Recognition of

Printed Devanagari Text Using BLSTM Neural

Network‖, ICPR, page 322-325. IEEE, 2012. [24] J.Hull,T.K.Ho,J.Favata,V.Govindaraju,S.Srihari,―Comb

ination of Segmentation-based and Wholistic

Handwritten Word Recognition Algorithms‖, Elsevier

Publ., pp. 261-272, 1992. [25] J. Rocha and T. Pavlidis. ―New method for word

recognition without segmentation.‖ In Proceedings of

SPIE, volume 1906, page 76, 1993.

[26] Sandhya Arora et al., ―Performance Comparison of SVM and ANN for Handwritten Devnagari Character

Recognition‖, IJCSI International Journal of Computer

Science Issues, Vol. 7, Issue 3, May 2010.

[27] K. Y. Rajput and Sangeeta Mishra, ―Recognition and Editing of Devnagari Handwriting Using Neural

Network‖, Proceedings of SPIT-IEEE Colloquium and

International Conference, Mumbai, India Vol. 1, 66.

[28] Umapada Pal, Sukalpa Chanda Tetsushi, Wakabayashi, Fumitaka Kimura, Accuracy Improvement of

Devnagari Character Recognition Combining SVM and

MQDF‖.

[29] Satish Kumar, ―Evaluation of Orthogonal Directional Gradients on Hand-Printed Datasets‖, Intl. Journal of

Information Technology and Knowledge Management ,

Volume 2, No. 1, pp. 203-207. Jan - Jun 2009.

[30] Anil K. Jain, Robert P.W. Duin, and Jianchang Mao, ―Statistical Pattern Recognition: A Review‖, IEEE

Transactions on Pattern Analysis and Machine

Intelligence, Vol. 22, No. 1, pp- 4-37, January 2000.

[31] Sandhya Arora, Debotosh Bhattacharjee, Mita Nasipuri, D. K. Basu, M. Kundu, ―Recognition of Non-

Compound Handwritten Devnagari Characters using a

Combination of MLP and Minimum Edit Distance‖,

International Journal of Computer Science and Security (IJCSS), Volume (4): Issue-1 pp 107-120.



952

ISSN:2229-6093