chapter 1 - shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/27115/4/04...script identification...

19
Chapter 1 Introduction --------------------------------------------------------------------------------------------------------------------------- Analysis of document images for information extraction has become very prominent in recent past. Wide variety of information, which has been conventionally stored on paper, is now being converted into electronic form for better storage and intelligent processing. This needs processing of documents using digital image processing methods. To develop a successful multi-lingual Optical Character Recognition (OCR) system, separation or identification of different scripts is an essential step. The recognized script document can then be submitted to respective OCR system for character/numeral recognition. In this chapter, a brief overview of document image analysis, OCR and script recognition is presented. Literature related to Indic scripts identification is reviewed. Further, properties of major Indic scripts are also described. --------------------------------------------------------------------------------------------------------------------------- One interesting and challenging field of research in pattern recognition is Optical Character Recognition (OCR). To develop a successful multi-lingual OCR system, separation or identification of different scripts is an essential step. In a multi-lingual country like India, designing script identification system facilitates OCR system. India has more than 22 official languages and 12 different scripts [8] are used for these languages. We can use systematic stage approach for script identification in documents and feed the recognized script document to OCR for character/numeral recognition.

Upload: others

Post on 06-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Chapter 1

Introduction

---------------------------------------------------------------------------------------------------------------------------

Analysis of document images for information extraction has become very prominent

in recent past. Wide variety of information, which has been conventionally stored on

paper, is now being converted into electronic form for better storage and intelligent

processing. This needs processing of documents using digital image processing

methods. To develop a successful multi-lingual Optical Character Recognition

(OCR) system, separation or identification of different scripts is an essential step. The

recognized script document can then be submitted to respective OCR system for

character/numeral recognition. In this chapter, a brief overview of document image

analysis, OCR and script recognition is presented. Literature related to Indic scripts

identification is reviewed. Further, properties of major Indic scripts are also

described.

---------------------------------------------------------------------------------------------------------------------------

One interesting and challenging field of research in pattern recognition is Optical

Character Recognition (OCR). To develop a successful multi-lingual OCR system,

separation or identification of different scripts is an essential step. In a multi-lingual

country like India, designing script identification system facilitates OCR system. India

has more than 22 official languages and 12 different scripts [8] are used for these

languages. We can use systematic stage approach for script identification in

documents and feed the recognized script document to OCR for character/numeral

recognition.

2 Introduction

1.1 Document Image Analysis

Document image analysis is the process that performs the overall

interpretation of document images. It refers to algorithms and techniques that are

applied to images of documents to obtain a computer-readable description from pixel

data. It recognizes the text & graphics components in image of documents and to

extract intended information from them. It also adds to OCR in systematizing the

document and applies outside knowledge in interpreting it. It is concerned with image

processing, document formatting, script identification, and character recognition

combined in order to deal with a particular application. Thus, document image

analysis deals with the global issues involved in recognition of written script in

images.

Two categories of document image analysis can be defined; text processing

and graphical processing.

Text processing deals with the textual components of a document image and its task

are;

- Determining the skew (any tilt at which the document may have been scanned)

- Finding columns, paragraphs, textual lines, words, recognizing the text by OCR.

Graphical processing deals with the non-textual elements pictures like tables, lines,

images, symbols, delimiters, company logo etc.

1.2 Optical Character Recognition

A well-known document image analysis product is Optical Character

Recognition (OCR) software that recognizes characters in a scanned document. It is a

field of research in pattern recognition, Artificial Intelligence and Computer Vision.

Powerful OCR software allows you to save a lot of time and effort when creating,

processing and repurposing various documents. This technology is used in a broad

range of applications. Emblematic applications are handwritten character recognition,

processing of textual web images, and information extraction from digital libraries.

Large digital archives are currently available; however their full fruition can be

achieved only by accessing the information that is embedded in the digital image.

The problem of character recognition can be divided into two major

categories: (i) Type written and handwritten and (ii) Offline and Online recognition.

3 Introduction

Typewritten OCR system recognizes scripts that have been previously typed

and scanned prior to recognition process.

The field of handwriting recognition is divided into the sub-fields of on-line

and off-line recognition. In on-line recognition special devices are used to track the

movement of the pen and record temporal information. As for Online Character

Recognition, the concentration is based on the interpretation of dynamic handwriting

motion. This technology is used mostly for handwriting analysis on Tablet, PC, PDA

units and mobile phones among others. In off-line recognition an image of the

handwritten text is scanned and recorded. In general off-line recognition is considered

the more difficult task, because of the lack of temporal information. It is possible to

construct the image of the handwriting using the information of the movement of the

pen, however it is not possible to reconstruct the information of the movement of the

pen using only the image.

Generally, handwritten character recognition refers to the process of

recognizing static handwriting, usually focusing on the shape of the character against

its background. This process is done from an offline state with the source being

constant. The system attempts to recognize a character that has been written by

human. This is usually more difficult task due to following reasons:

Complexity in pre-processing

Complexity in feature extraction and classification

Sensitivity of the scheme to the variation in handwritten text of a document.

Characters in the document have descenders and ascenders.

Variation in shapes of characters written by different writers.

Similarity between some symbols of different scripts

1.3 Script Identification

The goal of handwriting recognition system is to process handwritten data

electronically with the same or nearly the same accuracy as humans. By doing this

process with computers a large amount of data can be transcript at a high speed. An

integrated approach to the design of OCRs for all Indian scripts has great benefits. It

is necessary to identify different script forms before running an individual OCR

system. In a country like India, script identification is a must for multilingual OCR

system. It acts as a pre-processor to the OCR system identifying the script type of the

4 Introduction

document, so that specific OCR tool can be selected as illustrated in Fig. 1.1. In a

multi-script environment a bank of OCRs corresponding to all different scripts are

expected to be seen. The characters in an input document can then be recognized

reliably by selecting the appropriate OCR system from the OCR bank.

FIGURE 1.1 Stages of document processing in a multi-script environment

Many of the documents in Indian environment are multi-script in nature. A

document containing text information in more than one script is called a multi-script

document. Most of the people use more than one script for communication. Many

Indian documents contain two scripts, namely, the state’s official script (local script)

and English. In certain cases, a document may contain three scripts, for example, the

state’s official script (local), Devanagari (National) and Latin (English). An automatic

script identification technique is useful to identify the script type of a particular

word/line in a multi-script document, segment out characters and feed it to

appropriate script-specific OCR for recognition. Fig. 1.2 shows several examples of

multi-script documents.

FIGURE 1.2: Examples of multi-script document images: (a) Malayalam and English

(b) Kannada and English (c) Tamil and English (d) Oriya and English

5 Introduction

Recognition of scripts from document images is at the heart of any document

image understanding system. Typically in a multi-script document, different

paragraphs, text-blocks, text lines or words in a page are written in different scripts

(Figure 2). The structure of the script and a writing style pose challenges for script

type recognition. The script recognition system operates in following phases as shown

in Fig.1.3

1. Pre-processing (noise removal, enhancement, skew detection, segmentation)

2. Feature extraction

3. Script recognition (In Indian context, Kannada, English, Devanagari, Tamil,

Telugu, Gujarati, Punjabi, Oriya, Bengali, Malayalam, and Urdu).

FIGURE 1.3 Stages of script recognition

Documents written in Indian scripts present great challenges to an OCR

designer due to the large number of letters in the alphabet, the sophisticated ways in

which they combine, and the complicated graphemes they result in. The problem is

compounded by the unstructured manner in which popular fonts are designed.

Further, handwriting script recognition for Indic scripts is still in its infancy compared

to non Indic scripts like Latin and Chinese, Japanese, and Korean, and worthy of

serious investigation.

1.4 Script Recognition - Literature Review

Script is defined as the graphic form of the writing system used to write

statements expressible in language. A script class refers to a particular style of writing

and the set of characters used in it. Languages throughout the world are typeset in

many different scripts. A script may be used by only one language or may be shared

by many languages, sometimes with slight variations from one language to other. In

6 Introduction

India, there are many documents written in regional scripts. For example, due to the

policy of state governments in India, the official transactions are done in the regional

language apart from using English language for communication with other states.

Significant work related to script identification is carried out by various

researchers for identification of scripts from a multilingual document. Existing script

identification techniques mainly depend on various features extracted from document

images at block, line or word level. Block level script identification identifies the

script of the given document in a mixture of various script documents. In line based

script identification, a document image can contain more than one script but it

requires the same script on a single line. Word level script identification allows the

document to contain more than one script and the script of every word is identified.

The script recognition methods available in literature at block level, line level and

word level respectively is reviewed below.

Quite a few publications are found in the literature for differentiating the

Indian scripts at block level. Peake and Tan [1] have proposed a method for automatic

script and language identification from document images using multiple channel

(Gabor) filters and gray level co-occurrence matrices for seven scripts: Chinese,

English, Greek, Korean, Malayalam, Persian and Russian. Tan [2] has developed

rotation invariant texture feature extraction method for automatic script identification

for six scripts: Chinese, Greek, English, Russian, Persian and Malayalam. Judith [3]

has proposed method for Script and Language Identification of Arabic, Chinese,

Cyrillic, Devanagari, Japanese and Roman by connected compound features. To

discriminate between printed text lines in Arabic and English, three techniques are

presented in [4]. Firstly, an approach based on detecting the peaks in the horizontal

projection profile is considered. Secondly, another approach based on the moments of

the profiles using neural networks for classification is presented. Finally, approach

based on classifying run length histogram using neural networks is described.

Dhandra et. al. [5] have proposed script identification method at block level by

extracting the features in two stages. In the first stage, the morphological erosion and

opening by reconstruction is carried out on a document image in horizontal, vertical,

left and right diagonal directions. In the second stage, average pixel distribution is

found in these directions. The classification is done using nearest neighbor classifier.

The experiments are performed on Kannada, Urdu, English, and Devanagari scripts

7 Introduction

by considering the block size of 128 x 128 pixels. Multilingual document recognition

technology and its application in China which is useful for building multilingual

digital library are reported in [6]. The key technologies include statistical character

recognition, structural analysis for similar character discrimination, character

segmentation, script identification, post-processing. A hierarchical blind script

identifier for 11 different Indian scripts is reported in [7]. The various nodes of

hierarchical tree use different feature-classifier combinations such as Gabor and

Discrete Cosine Transform features and has been evaluated using nearest neighbor,

linear discriminant and support vector machine classifiers.

Significant methods are available in the literature for script recognition at line

level from printed documents compared to handwritten documents. Twelve Indian

scripts have been explored to develop an automatic script recognizer at text line level

in [8, 10]. Script recognizer has been designed to classify using the characteristics and

shape based features of the script. Devanagari was discriminated through the headline

feature and structural shapes were designed to discriminate English from the other

Indian script. Further, the work has been extended using Water Reservoirs to

accommodate more scripts rather than triplets. In [9], an automated technique for the

identification of printed Roman, Chinese, Arabic, Devanagari and Bangla text lines

from a single document is presented. An automatic scheme to identify text lines of

different Indian scripts from a printed document is attempted in [11]. Features based

on water reservoir principle, contour tracing, profile etc. are employed to identify the

scripts. In [12], a system is presented for Oriya and Roman scripts of printed line

documents. Classification is done through horizontal projection profiles for intensity

of pixels in different zone along with the line height and the number of characters

present in that line. In [13], texture is used as a tool for determining the script of

handwritten document image based on the observation that text has a distinct visual

texture to classify the scripts namely, English, Devanagari and Urdu. Handwritten

block and lines are used and 13 spatial spread features extracted using morphological

filters to attain the feature set. In [14], a model to identify the script type of a

trilingual document printed in Kannada, Hindi and English scripts is proposed. The

distinct characteristic features of these scripts are thoroughly studied from the nature

of the top and bottom profiles and the model is trained to learn thoroughly the distinct

features of each script.

8 Introduction

A brief review of work proposed in the literature at word level follows. Chain

code based representation and manipulation of hand written images is reported in

[15]. A survey of offline cursive script word recognition is presented in [16]. The

survey is classified into three categories: segmentation-free methods; segmentation-

based methods and the perception-oriented approach. Most of this survey focuses on

the algorithms that were proposed in order to realize the recognition phase. Two

different approaches have been proposed in [17] for script identification at the word

level, from a bilingual document containing Roman and Tamil scripts. In the first

approach, words are divided into three distinct spatial zones. The spatial spread of a

word in upper and lower zones, together with the character density, is used to identify

the script. The second approach analyses the directional energy distribution of a word

using Gabor filters with suitable frequencies and orientations. Text-Word level script

identification from a document containing English, Devanagari and Telugu text is

reported in [18]. In [19], a method for identification and separation of text words of

Kannada, Devanagari, and Roman scripts using discriminating features is presented.

In [20], using a piece-wise projection method, the destination address block (DAB) is

segmented into lines and then words are extracted. Using water reservoir the busy-

zone of the word is computed. Finally, using matra and water reservoir concept based

features word-wise Bangla/Devanagari and English scripts are identified. A system

for word-wise handwritten script identification for Indian postal automation is

reported in [21]. Knowledge based approach to determine postal code is proposed in

[22]. In [23], a method is proposed during morphological opening by reconstruction

of an image in different directions and regional descriptors for script identification at

word level. The method is based on the observation that every text has a distinct

visual appearance. In [24], a script identification algorithm which takes into account

the fact that the script changes at the word level in most Indian bilingual or

multilingual printed documents is analyzed. A Gabor function based multichannel

directional filtering approach for both text area separation and script identification at

the word level is reported in [25]. In [26], effectiveness of Gabor and discrete cosine

transform (DCT) features for word level multi-script identification has been

independently evaluated using nearest neighbor, linear discriminant and support

vector machine (SVM) classifiers. In [27], distinct features of each script are used to

identify Kannada, English and Devanagari using voting technique. The method

9 Introduction

proposed in [28] automatically separates the scripts of handwritten words from a

document, written in Bengali or Devanagari mixed with Roman scripts.

Some background information about the past researches on both global based

approach as well as local based approach for script identification in document images

is reported in [29]. Both the systems can perform script identification in document

images at document, line and word level. Gopal Datt Joshi et. al. [30] have proposed

hierarchical classification scheme which uses features consistent with human

perception for script identification from Indian document.

1.5 Introduction to Major Indian Scripts and Languages

India is multilingual country. It has 22 official languages which include

Assamese, Bengali, English, Gujarati, Hindi, Konkani, Kannada, Kashmiri,

Malayalam, Marathi, Nepali, and Oriya. Further, all the Indian languages do not have

the unique scripts. Some of them use the same or similar script. For example,

languages such as Hindi, Marathi, Rajasthani, Sanskrit and Nepali are written using

the Devanagari script; Assamese and Bengali languages are written using the Bengali

script; Urdu and Kashmiri are written using Urdu script and Telugu and Kannada use

the similar script. In all, twelve different Indic scripts are used to write these 22

languages. These scripts are named as Roman, Bengali, Devanagari, Gurumukhi,

Gujarati, Kashmiri, Malayalam, Oriya, Tamil, Kannada, Telugu and Urdu. With the

exception of the Urdu script which is of Perso-Arabic origin, they have evolved from

a single source, the phonographic Brahmi script, first documented extensively in the

edicts of Emperor Asoka of the third century BC. They are defined as “syllabic

alphabets” or abugidas in that the unit of encoding is a syllable of speech; however

the corresponding orthographic units show distinctive internal structure and a

constituent set of graphemes [32]. A word in these scripts is written as a sequence of

these orthographic syllabic units referred to as characters.

10 Introduction

Figure 1.4: Twelve Indian scripts: Roman, Devanagari, Bangla, Gujarati, Kannada, Kashmiri,

Malayalam, Oriya, Gurumukhi, Tamil, Telagu, and Urdu

Apart from numerals, vowels, and consonants, there are compound characters

in most of the Indian regional scripts. Combining two or more consonants forms the

compound characters and they remain complex in their shapes than basic consonants.

Further, a vowel following a consonant may take a modified shape and is placed on

the left, right, top, or bottom of the consonant depending on the vowel. Such

characters are called modified characters. A brief description of the languages using

scripts Latin, Devanagari, Gujarati, Gurumukhi, Telugu, Kannada, Tamil, Malayalam,

Bengali and Oriya, respectively, considered in our study is presented below. All these

scripts are written from left to right.

i) English: English is the most common auxiliary language widely used in almost all

the continents of the world. In the last couple of centuries it has virtual attained the

status of a universal language. In many Asian countries like India and Malaysia

English is accepted and used as a means of communications among themselves. In

multilingual country like India, where more than 22 official state languages and

hundreds of local dialects are in use English is playing a binding force among

countrymen. The Indian parliament has also recognized English as an official

11 Introduction

language in addition to Hindi, which is considered as the National language. The

modern English alphabet is a Latin-based alphabet consisting of 26 letters each of

upper and lower case characters. In addition, there are some special symbols and

numerals. English script is also termed as bicameral script (a script using two separate

cases). The letters A, E, I, O, U are considered vowel letters, the remaining letters are

considered consonant letters (Fig. 1.5). Capital letters are A, B, C, etc.; lower case

includes a, b, c, etc. The structure of the English alphabet contains more vertical and

slant strokes.

Vowels (upper case)

Consonants (upper case)

FIGURE 1.5 English Alphabets

ii) Hindi: An Indo-Aryan language of North India, having equal status with English

as an official language throughout India. It is one of several languages spoken in

different parts of the sub-continent with about 487 million speakers. Hindi is derived

from Devanagari script. The script is phonetic; so that Hindi, unlike English, is

pronounced as it is written. Devanagari alphabet descended from the Brahmi script

sometime around the 11th century AD. It was originally developed to write Sanskrit

but was later adapted to write many other languages. Type of writing system is alpha-

syllabary / abugida. The script has 12 vowels and 34 consonants (Fig. 1.6). Consonant

letters carry an inherent vowel which can be altered or muted by means of diacritics

or matra. Vowels can be written as independent letters, or by using a variety of

diacritical marks which are written above, below, before or after the consonant they

belong to. This feature is common to most of the alphabets of South and South East

Asia. When consonants occur together in clusters, special conjunct letters are used.

Devanagari script is used to write the languages Bhojpuri, Marathi, Mundari, Nepali,

Pali, Sanskrit, Sindhi and many more including Hindi. Devanagari is recognizable by

a distinctive horizontal line running along the tops of the letters that links them

together.

12 Introduction

Vowels and vowel diacritics

Consonants

FIGURE 1.6 Hindi Vowels and Consonants

iii) Gujarati: The Gujarati script is one of the modern scripts of India, and is derived

from the Devanagari script during the 16th century CE. The major difference between

Gujarati and Devanagari is the lack of the top horizontal bar in Gujarati. Otherwise

the two scripts are fairly similar. Gujarati is a syllabic alphabet in which all

consonants have an inherent vowel. Vowels can be written as independent letters, or

by using a variety of diacritical marks which are written above, below, before or after

the consonant they belong to. Gujarati character set provides 14 vowels and 34 (+2

compound -ksha, gna ) consonants as shown in Fig. 1.7.

Vowels and vowel diacritics

Consonants

FIGURE 1.7 Gujarati Vowels and Consonants

iv) Punjabi: Punjabi is an Indo-Aryan language spoken by about 105 million people

mainly in West Punjab in Pakistan and in East Punjab in India. Punjabi descended

from the Shauraseni language of medieval northern India and became a distinct

language during the 11th century. The Gurumukhi (Punjabi) alphabet was devised

during the 16th century and is modeled on the Landa alphabet. This is a syllabic

alphabet in which all consonants have an inherent vowel. Diacritics, which can appear

above, below, before or after the consonant they belong to, are used to change the

inherent vowel. Modern Gurumukhi has forty-one consonants, nine vowel symbols,

two symbols for nasal sounds, and one symbol which duplicate the sound of any

consonant. In addition, four conjuncts are used (Fig. 1.8).

13 Introduction

Vowels and vowel diacritics

Consonants

FIGURE 1.8 Punjabi Vowels and Consonants

v) Telugu: A Dravidian language spoken by about 75 million people mainly in the

southern Indian state of Andhra Pradesh, where it is the official language. It is also

spoken in such neighbouring states as Karnataka, Tamil Nadu, Orissa, Maharashtra

and Chhattisgarh. The origins of the Telugu alphabet can be traced to the Brahmi

alphabet of ancient India, which developed into an alphabet used for both Telugu and

Kannada, which in turn split into two separate alphabets between the 12th and 15th

centuries AD. The writing system is syllabic alphabet in which all consonants have an

inherent vowel. Diacritics, which can appear above, below, before or after the

consonant they belong to, are used to change the inherent vowel and consist of

sequences of simple and/or complex characters. The overall pattern consists of 60

symbols, of which 16 are vowels, 3 vowel modifiers, and 41 consonants as mentioned

in Fig. 1.9.

Vowels and vowel diacritics

Consonants

FIGURE 1.9 Telugu Vowels, Vowels diacritics and Consonants

14 Introduction

vi) Kannada: The official language of the southern Indian state of Karnataka.

Kannada is a Dravidian language spoken by about 44 million people in the Indian

states of Karnataka, Andhra Pradesh, Tamil Nadu and Maharashtra. The earliest

inscriptional records in Kannada are from the 6th century. Kannada script is closely

akin to Telugu script in origin. Under the influence of Christian missionary

organizations, Kannada and Telugu scripts were standardized at the beginning of the

19th century. Writing system is alpha syllabary in which all consonants have an

inherent vowel. Other vowels are indicated with diacritics, which can appear above,

below, before or after the consonants. Kannada has 16 vowels and 34 consonants.

There are about 250 basic, modified and compound character shapes in Kannada (Fig.

1.10).

Vowels

Consonants

FIGURE 1.10 Kannada Vowels and Consonants

vii) Tamil: A Dravidian language spoken by around 52 million people in India, Sri

Lanka, Malaysia, Vietnam, Singapore, Canada, the USA, UK and Australia. It is the

first language of the Indian state of Tamil Nadu, and is spoken by a significant

minority of people (2 million) in north-eastern Sri Lanka. The earlier Tamil

inscriptions were written in brahmi, grantha and vaTTezuttu scripts. The Tamil script

is partially “alphabetic” and partially syllable-based (Fig. 1.11). Writing system of

Tamil is syllabic alphabet. There are twelve vowels and eighteen consonants.

Consonants are made up of six surds and their corresponding six sonants and six

medials. Combinations of consonants with vowels give rise to new symbols or result

in modified symbols.

15 Introduction

Vowels and vowel diacritics

Consonants

FIGURE 1.11 Tamil Vowels and Consonants

viii) Malayalam: Malayalam belongs to the southern group of Dravidian languages

along with Tamil, Kota, Kodagu and Kannada. It has high affinity towards Tamil. In

the early thirteenth century the Malayalam script developed from a script known as

vattezhuthu (round writing), a descendant of the Brahmi script. This is a syllabic

alphabet in which all consonants have an inherent vowel. Diacritics, which can appear

above, below, before or after the consonant they belong to, are used to change the

inherent vowel. The modern Malayalam alphabet has 13 vowel letters, 36 consonant

letters, and a few other symbols as shown in Fig. 1.12.

Vowels

Consonants

FIGURE 1.12 Malayalam Vowels and Consonants

ix) Bengali: The Bengali (also called Bangla) script is used for writing the Bengali

language, spoken by people mostly in Bangladesh and India. The Bengali alphabet is

derived from the Brahmi alphabet. It is also closely related to the Devanagari

alphabet, from which it started to diverge in the 11th Century A.D. The Bengali script

has a total of 11 vowel graphemes. All of these are used in both Bengali and

Assamese, the two main languages using the script. It is also used for a number of

other Indian languages including Sylheti and, with one or two modifications,

Assamese. Bengali writing shares some similarities with the Dravidian-language

16 Introduction

scripts, particularly in the shapes of some vowel letters, but it is generally more

similar to the Aryan-language scripts, in particular Devanagari.

There are thirty-five consonant letters and eleven independent vowel letters

are used in this script (Fig. 1.13). Each vowel letter also has a diacritic form which

combines with a consonant to modify the inherent vowel.

Vowels and vowel diacritics

Consonants

FIGURE 1.13 Bengali Vowels and Consonants

x) Odiya (Oriya): The spoken languages Oriya, Bengali and Assamese have a

common mother language - Parkrit (or Pali), which diversified into three branches in

Eastern India - Magadhi, Maitheli and Sudrusa. Magadhi became the modern Oriya,

Maitheli the modern Bengali and Sudrusa the modern Assamese languages. The Oriya

script is derived from the ancient Brahmi script through various transformations. The

complex nature of Oriya alphabets consists of 268 symbols (13 vowels, 36

consonants, 10 digits and 210 conjuncts). Fig. 1.14 shows vowels, vowel diacritics

and consonants of Oriya.

Vowels and vowel diacritics

Consonants

FIGURE 1.14 Oriya vowels, vowel diacritics and Consonants

17 Introduction

xi) Urdu: The Urdu alphabet is the right-to-left alphabet used for the Urdu language.

It is a modification of the Persian alphabet, which is itself a derivative of the Arabic

alphabet. With 38 letters and no distinct letter cases, the Urdu alphabet is typically

written in the calligraphic Nasta'liq script.

FIGURE 1.15 The Urdu alphabet, with names in the Devanagari and Roman alphabets

1.6 Motivation and Problem Definition

Automatic script identification is crucial to meet the growing demand for electronic

processing of volumes of documents written in different scripts. Script identification

from handwritten documents is a challenging task due to large variation in

handwriting as compared to printed documents. Many of the documents in India,

handwritten or machine printed, contain two or more than two scripts. Further, the

frequency of occurrence of documents consisting of regional script and Latin script is

more compared to other combinations. From literature survey, it is evident that,

handwritten script recognition is as its early stages [3, 13, 16, 20, 21, 22, 23, 28]

compared to observation that most of the reported studies, accomplish script

recognition for printed documents [4, 5, 7, 8, 9, 10, 11, 12, 14, 17, 19, 22, 24, 25, 26,

27, 30, 31]. This motivated us to work in this area and design algorithms for script

recognition from handwritten documents. Based on the work carried out in this area,

it was proposed to design efficient algorithms to identify script type at level of

18 Introduction

block/line/word with the observation that in multi-script documents a specific script

may appear at level of block/line/word in the document. Ten Indian major scripts

including Roman(Latin) script are considered in the proposed work.

1.7 Organization of the Thesis

The thesis is organized into seven chapters.

In Chapter 1, a brief description about Document Image Analysis and OCR

system is presented. The importance of handwritten script identification is also

described. Methods and techniques available in the literature are presented. Different

type of scripts and languages in Indian context are discussed.

Chapter 2 presents details regarding collection of handwritten script

documents from various sources. As standard database for handwritten script

identification for Indian scripts is not available, we have created a large dataset for

carrying out experiments for the methods proposed in the thesis. A novel method for

skew correction of the scanned document images is presented. Denoising is performed

and binary images of the blocks of handwritten document, lines and words from the

document images are extracted to create the database. The proposed skew correction

technique is experimented on various printed and handwritten script documents.

In Chapter 3, methods used for extracting the features of various Indian scripts

are described. In many cases, the most distinguished information is hidden in the

frequency content of the signal rather than in the spatial domain. So for feature

extraction, Gabor, DCT and Wavelets are considered. Also, a brief description of the

classifiers used for recognizing the script of the block, line, and word is presented in

this chapter.

A novel method for recognizing the script at block level is presented in

Chapter 4. Block level script identification, identifies the script of the given document

in a mixture of various script documents. Blocks of size 512 x 512 pixels is input to

the proposed system for script recognition. Features based on Fourier, DCT and

Wavelet is extracted to maximize the distinction between English, Devanagari and

local official language scripts. The classification is done using k-nearest neighbour (k-

19 Introduction

NN) classifier. The results clearly shows that the combined features that constitute

DCT and wavelets yielded better results for recognition of the script.

In Chapter 5, handwritten script identification methods at line level and

portion of the line level are presented. A document image can contain more than one

script but it requires the same script on a single line. Gabor filter banks are used for

feature extraction of line and portion of the line. The portion of the line may contain

two or more words. The script classification task for portion of the line is simplified

and performed faster as compared to the analysis of the entire line extracted from the

handwritten document. Experiments are performed for identification of script type of

eight Indian scripts including English for bi-script documents. Gabor combined with

DCT and Gabor combined with wavelets are proposed for tri-scripts. The

classification is done using k-NN and SVM classifier. At line level, features are

extracted by using Gabor combined with DCT and Gabor combined with wavelets.

The results are promising when we applied DCT/Wavelet to the Gabor convolved

images as compared to the Gabor convolved images.

Script identification at word level is proposed in Chapter 6. The method

presented in Chapter 5 for script identification at line level of the document image is

used for Word level identification. To increase the accuracy of the script recognition,

neural network Classifier with ranking of features is considered. Experiments are

carried out on nine different scripts. It is observed that performance of Neural

Network classifier is better than k-NN and SVM. The proposed method is

experimented on text- word database consisting of more than two characters, two

characters, and one character, respectively. Encouraging results are obtained.

Conclusions and future work are presented in chapter 7. In conclusion, the

methods proposed for identification of scripts from handwritten documents is

summarized. Limitation of the proposed methods and future extension is also

discussed.