sindhi ocr using back propagation neural network

International Journal of Advanced Computer Science, Vol. 3, No. 3, Pp. 113-117, Mar., 2013.

Manuscript Received:

4, Oct., 2011

Revised:

25,Oct., 2011

Accepted:

14,Jan., 2013

Published:

15,Feb., 2013

Keywords

ANN,

artificial

intelligence,

Supervised

learning

Abstract This paper provide proposed OCR

solution of Sindhi Character Recognition using

Artificial Neural Networks (ANNs) and exposed

major alphabet differences between Sindhi and

Arabic languages with OCR perspective. Huge

literature is available in hard copy format and

needs to convert into soft copy format so that

everyone can access and perform searching to

achieve desired needs from Sindhi literature.

Sindhi language is very rich language contains fifty

two characters and it has ability to merge other

languages words in word list. With comparison of

other Unicode character languages, Sindhi

languages characters having differences in terms of

“Shape”, “cursive style”, and “position of character

“ . These behaviors guide to present linguistic

information and increase difficulties in writing and

printing also create more complexities for

document digitalization (OCR). Recognition system

take Sindhi characters as an input from drawing

control or used specific “MB Lateefi “for character

input. Given input then snip out and converted into

define size with horizontal and vertical attributes.

Proposed system’s training process used

unsupervised learning processes which randomly

change weight of matrix more nearer to the input.

Results are achieved when weight comes nearer

equal to the values given through input layer and

training process stops. Similarly at every input

character, one neuron will become as winner

neuron. 1

1. Introduction

Sindhi Language is one the major languages of Pakistan

and spoken by approximately 30-40 million peoples [8],

similarly after Urdu language Sindhi is second most highest

speaking langue of Pakistan. It’s an Indo-Aryan language

located in lower Indus River valley. Sindhi language is very

flexible language in terms of adopting new words in its

dictionary and its symbol of rich languages which can easily

adopts words of other languages. It has potential to grow

and complete the requirements of sounds by providing new

characters and words over period of time. This paper

focuses on comparison of Sindhi and Arabic language

alphabets with OCR perspective and identifies the major

alphabets differences. Second objective of this paper is to

have proposed solution for Sindhi OCR using Artificial

Neural network.

1, Ali Muhammad Nizamani is with SZABIST Karachi, Pakistan (Email:

[email protected]), 2, Prof. Naeem Ul Hassan janjua is with Bahria University Karachi,

Pakistan (Email: [email protected]).

A. Comparison of Sindhi & Arabic Language Alphabets

with OCR Perspective

Sindhi language used Arabic style scripting. Sindhi

Arabic Script style is merged from Persian writing style.

Arabic characters writing style starts from right to left. The

Sindhi script comprises of fifty two characters and seven

diacritic signs [9]. Sindhi language is consisting on fifty two

characters from which three characters are migrated from

Persian twenty nine characters are inherited from Arabic

language. As shown Sindhi alphabet structure in fig. 1.

Fig. 1. Combination of Sindhi Alphabets. [2]

There are twenty modified characters which make

difference between Sindhi and Arabic characters. These

characters help to complete the need of missing sounds of

Sindhi language. These modified characters are developed

by changing structure of some existing characters. For

example to pronounce “Ghay” , character “gaf” and “hay”

are combine to create new character. To creates new

characters “Nukta”/ Dots, several diacritics also used. For

example to pronounce sound “Cha”, “Bhay” four dots

characters are created which do not exist in Arabic language.

The combination of more than one characters, extending

number of dots, diacritics creates twenty more characters.

As shown in fig. 2.

ڃ ,ڱ ,ٻ ,ڄ , ڳ ,،ههگ ،ڌ ڀ ، ههج ,،ڙ ،ڻ ،ڏ ،ڍ ،ٽ ،ڊ ٺ ,ڦ ,ٿ ,ڇ ,ک

Fig. 2 Modified Characters for Sindhi Sounds. [2]

Fig. 3 Stand alone, Initial, Medial, Final Characters positions [2]

Sindhi language contains cursive style nature. In Cursive

style language’s characters are connected together to create

another character or word. There are four major character’s

Sindhi OCR using Back propagation Neural Networks Ali Muhammad Nizamani

1, , & Prof. Naeem Ul Hassan janjua 2,


International Journal Publishers Group (IJPG) ©

114

positions which bifurcate characters styles. These

character’s positions are show in fig. 3.

According to Figure 3, changing in the position of

character within a word creates different sound, diacritics

are very important in Sindhi language alphabets. They help

to create more characters to give verbal and written

presentation to sounds. For example four dots introduce in

Sindhi language to complete the need of missing sound.

Like character “Bhay” , “Thay” ,”Fay” used four dots .

Similarly character “Alf mudA, hamza” over any other

character create different character sound. [2].

B. Self Organizing Maps

Many of research findings appear by inspiring from

human and nature like genetic algorithms, decision support

systems, robots etc. To facilitate humanity researchers try to

simulate machines to work like a human. Aim of machine

learning is to create machines which are capable enough to

take the decision from their experiences. The level of

intelligence of machine depends on the historical data and

the model or algorithm used for a particular learning

process. Comparatively supervise learning, unsupervised

learning mechanism having some complexities and

accuracy problems especially in the cases of abnormally

written characters. The characters which are misguide the

system during process of recognition. That problem can be

resolve by developing the word / character dictionary to

search for the possible character composition, because the

presence of resemble knowledge will defiantly help to

remove the ambiguity in characters. In unsupervised

learning mechanism if the target result is not achieved

neural network will try to find optimal comparative results

[4]. The back propagation algorithm is used in layered

feed-forward ANNs. It has three layers including input,

hidden and output layer and sends their signals in forward

directions. Network topology receives input of neurons in

the input layer and the output of the network is given by the

neurons on an output layer. Process of back propagation

algorithm follows supervise learning method which mean

that inputs are training and supervised by network topology.

Training process compute, and then the error (difference

between actual and expected results) is calculated. Training

process of back propagation algorithm starts with random

weights. The aim of random weights picking is to adjust

them to achieve minimal error rate. The activation function

of the artificial neurons in ANNs implementing the back

propagation algorithm is a weighted sum. Back propagation

neural networks learn from the training pattern. Training

phase take input pattern and precede it to propagate forward.

Each input pattern is applied to in the input unit and then

propagated forward. The Activation rises at output layer and

compare with correct output pattern to calculate an error

rate. Back propagation network able to classify for the set of

inputs after training process .Trained inputs then further

tested by giving untrained pattern.

2. OCR Approaches

A. Pattern Matching approach for Urdu Characters

As Urdu writing contains cursive style it resemble with

Sindhi language writing style in which characters are

connected with each other to make a word or character.

Urdu alphabet consists of 40 characters. All of these

characters contain several behaviors like single loop, double

loop and incomplete loop. Similarly dots and diacritics are

also included in several character set. Pattern matching

approach used to train system which first by taking input

image. It creates chain code of every character and creates a

data dictionary in xml file. Training of system consists of

preprocessing, line and Character segmentation. Each

trained character in data dictionary /xml file hold chain code

information of every characters. Recognition process take

input as image and create its chain code. After creating

chain code it compares that chain code of data dictionary.

/xml file and mach pattern signature with already stored

patterns in xml file [6].

B. Statistical Analysis Approach for Arabic Characters

To recognize Arabic characters, an artificial neural

network map is developed and trained by using least mean

squares algorithm. This approached create a matrix of

binary number which are used as input to simple feature

extraction process of system. In least mean square algorithm

weights of input neurons iteratively adjusted and Optimum

weight is calculated for a single neuron. Network weights

are removed along the negative of gradient of performance

function. After several iteration weights are adjusted.

Proposed approach has a vector of 35 elements for

representing a single Arabic letter. As shown in fig. 4.

Fig. 4 Vector View of Character Noon [10]

By performing statistical analysis it was found that

Arabic letter has performed and present that the pixel value

of a letter s are highly not correlated. This fact become

causes the investigation of the use of standard deviation as

distinctive feature of the letter. According to the figure 3 ,

system input features take 28 typed Arabic alphabets letters.

Every character is represented by a matrix which contains 7

* 5 binary pixels [10]. As shown in figure 4.

C. Back propagation neural network for Arabic Characters

Back propagation is a neural network approach used by

hamza, Ali A for Arabic character recognition. This

approach classifies the design and trained inputs and

recognition any set of character combination, with size and

fonts attributes using Microsoft word. In this research

Nizamani et al.: Sindhi OCR using Back propagation Neural Networks.


115

study, 28 basic Arabic character plush 10 numerals set,

52 Latin characters and 10 numeral are used as a set of

inputs for the recognition process. This approach has

common ANN architecture of three layers which are input,

hidden and output layer as shown in figure 4. After the

training process back propagation algorithm take input for

the character recognition. It iterates Input up to the several

epochs and propagates to back layer till it reaches on

desired result level. Desired results will be achieved till the

algorithm reaches on define error state .input layer of

network topology is consist on 50 * 50 pixels, hidden layer

consist on 20 * 20 neuron whereas output layer consist as

many neurons which are required for the character to be

recognized. Hidden layer of network topology consist of

bias and connected with weight matrix of the pervious layer.

As shown in fig. 5.

Fig. 5 Back propagation neural network Structure [11]

Training of system will take an input with font size, font

type and color parameters. Trained set of every character

preserved in database file. Results of proposed study find

that degree of reliability in OCR system depends of degree

of noisy environment. It was found that time that needs to

train an Arabic input data directly depends on the size of

character set [11].

D. Back propagation algorithm for English OCR

Feed forward neural network with back propagation

learning approach is one the most famous and easy

approach for OCR problem. Training stage of feed forward

back propagation algorithm teaches neural network to give

anticipated output against specific input. This process is

containing two main working components. These working

components are input and the desired output. Once training

process is complete, the back propagation algorithm become

able to take input and can give proposed output. For

example for English alphabet OCR we have 26 capital

letters for the recognition. To perform activity of OCR

image of 5 x 6 pixels is provided as an input pattern. Input

image is converted into several input parts and create vector

of size 30 pixels. In which “1” is assigned white pixel and

“0” is assigned for every black pixel. Figure 6 demonstrate

the true false pixels of input character K.

Fig. 6 Matrix presentation of character k [17]

On every epoch all samples which are available in

training set calculated with squared error. When error

become less than the error value specified, network stopped

iteration and the sample become ready for the recognition.

For the process of recognition again input pattern is given to

the network. The vector which has maximum points in

output pattern is considered as recognized pattern [17].

3. Proposed Approach for Sindhi OCR

Solution of Sindhi OCR used Back propagation neural

networks approach for development process. A Typical

back propagation neural network approach is used which

contain input, hidden and output layer. Input layer take

normalize input and weights are adjusted in hidden layer.

The desired output will be a winning neuron which has

nearer weight to input. Final output neuron is reached at

output layer. Every node or neurons has weight vector of

same dimension as input data vector.

A. Sindhi OCR Workflow

According to figure 6, Sindhi OCR application will take

input first and perform normalization. Input normalization

features are extracted of every character and then all

characters are classified in to the group with their unique

identification. After once all characters recognize

successfully, system can perform recognition .The steps of

OCR are presented in fig. 7.

Fig. 7 [self]

a. Feature Extraction

Feature extraction process highlights the feature of

character which is written by user. The process skips empty

white spaces of the region and collection character template

in a feature vector. Each input vector of character features

are first resized with equal size of lattice which is of 20 * 20

map size. Optimum size of lattice or the number of neurons

of network topology helps to recognize the input characters.

Creating large size network for small size input pattern

consume more time to travels within training and

recognition iterations. Similarly creating smaller size

network teleology for a larger input patter will be the cause



116

of wrong outputs so proposed solution has 20 by 20 matrix.

As shown in fig. 8.

Fig. 8 Matrix presentation of Sindhi Character Bay [self]

b. Classification

Classification is a process that is closely related with

pattern recognition. Proposed solution for Sindhi OCR uses

this technique to classify all input character which are

trained. This process organizes inputs into several groups of

unique identification. The classification process received

input pattern which are presented to the input neurons.

These input neurons are further compiled which results as

identification of various patterns .After feature extraction,

OCR performs classification function. In this stage each

feature of character in vector is classified in to training set.

Figure 9 present classified group of data in terms of

characters in list box.

Fig. 9 Character Dictionary [self]

c. Recognition

Once normalize input is trained by system and update

characters dictionary. The input character vector maps its

weight values with the values given in lattice and find

matching pattern. As proposed solution followed SOM

Topology in neural network so network topology contains

two layers of nodes. These nodes are an input layer and a

mapping layer. Input layer act as distribution layer. The

number of nodes in input layer is equal to the number of

features or attributes associated with input. Therefore input

layer and each node of the mapping layer can be presented

as vector which contains mapping nodes. Mapping nodes

are initializing with some random numbers. The weight of

every input pattern or the actual pattern is further compared

with every note of the mapping layer gird. The node which

has smallest distance “Euclidean” between mapping node

vector and input vector. Once node is selected as winning

neuron, all of neighboring nodes of the winning node are

adjusted proportionally. In this way input nodes are mapped

with gird. As shown in fig. 10 recognized nearest winning

neuron is shown in editable format. Result of winner neuron

shown in fig. 10 which contains editable format output

character.

Fig. 10 Sindhi Recognized Character Editable Format [self]

B. Software Workflow, Implementation Aspect

Proposed solution of Sindhi OCR application contains

two methods which are briefly described below.

Training

Recognize

a. Training

When input is given to the solution of Sindhi OCR, the

initial weight matrix of input pattern is used to find current

error value. That error value is finding by organizing input

values in better way, means writing the character in smooth

way on drawing panel. The number of input neurons is

determined by training method from the down sampled

image. OCR lattice contain 20 by 20 pixels matrix so 400

total neuron will be calculated. Each epoch is give error

information and once training process reaches on its

defining acceptance level it will stop iteration. Input count

variable hold information of the each node of lattice. This

variable is actually containing total number of input neuron

which is entered or drawn by user. Letter count variable

actually count number of inputs which are already input in

to Sindhi OCR system for recognition. This variable will

only hold 52 Sindhi alphabet characters. This variable also

makes sure that no pattern is trained by more than two times.

Training set is dictionary which holds information of every

trained pattern which will further helps in recognition

process of OCR. Training set array has equal size which is

hold by letters list. The latter list is predefined with 52

Sindhi alphabets.

b. Recognize method

When recognition routine is called, a different code

routine executes. At very first stage input character is First

check by recognition method. This method identified that

either input that is given by user is trained and contains in

training set or not, if input character is not trained then

system will prompt to train input pattern first.

If system find filled network with concerned input

character, system will proceed to execute follow routine

which store total number of input neuron in sample size

variable and initialize a double size input array. Iteration

starts that compare down sampled values. If down sampled

values contain false value it assign negative value in input

array and in other condition will be to hold +0.5 values in

input array. Once input values are stored in input array,

another routine will execute which will send complete input

array to a method known as winner. Winner method return

the best out neuron which is hold in further in a variable

known as best. Most of the actual work performed by the

neural network is done in the winner method. The

Nizamani et al.: Sindhi OCR using Back propagation Neural Networks.


117

First thing that the winner method does is normalize the

inputs. Calculate the output values of each output neuron.

Input pattern iterated and propagate back until it reaches

on define error state. So finally the neuron which has largest

output value is considered as a winner and goes to output

stage.

4. Conclusion and Future Work

Sindhi language has huge literature available in hard

copy format and needs to be converted into soft copy format

so that every can utilize it and that information can be

available for everyone to explore worldwide. To achieve

this task, it’s necessary to have mature Sindhi OCR

applications which can convert literature in soft copy format.

OCR application helps to reduce data entry cost, reduce to

error level of data entry operation tasks. It helps to increase

strength and life of language and increase richness of

literature of Sindhi language. Technically for an OCR it

needs to be identified and investigate what optimum data is

required to train a particular network. For a good OCR

application especially the OCR application which takes

inputs by following a particular font consume large number

epocs to train a particular input. So it’s necessary to

decrease the epocs ratios especially in the large input data

training. In this paper a successful Sindhi OCR application

is developed which use self-organizing map approach to

recognize the characters. Sindhi OCR is compatible to train

fifty two Sindhi characters and enable to recognize it. It

takes “MB-Lateefi” font as a input.

As for as future work is concerned Sindhi OCR take

input which is followed by a particular font. In future work

the proposed solution can be extend with a system which is

font independent. Sindhi OCR can be extended further more

for hand writing recognition system.

REFERENCES

[1] Noor Ahmed Shaikh, Ghulam Ali Mallah, Zubair A.

Shaikh. “Character segmentation of Sindhi, an arabic

style script languge , usee hight propfile vector”.

Australian journal of basic and Applied Science. ISSN

1991-8187.

[2] S.A Husain, Asma Sajjad, fareeha Anwar. “online

Urdu Character Recognition System”MVA 2007

Confernece onMachine Vision Application “ Tokyo

Japan.

[3] Ali Dasdan , Kemal Oflzer “ Genetic Synthesis of

Unsupervised learning Algorithms”. Department of

Computer Engineering and infomration Science

Bilkent University , Tukey.

[4] Kuzman Ganchev, “Language segmentation for

Optical Character Recognition using Self Organizing

Mapes” , 2003.

[5] Simone Marinai “SOM clustering for text retrieval and

classification with examples on Indian Scripts “.

[6] Tabassam Nawaz, Syed ammar Hassan, Habib ur

Rehman, Anoshia Faiz “ Optical character Recognition

System for Urdu (Naskh Font) using

pattern Matching Technique”, International Journal of

Image Processsing Volume(3) .

[7] Yasmine Elglaly, Francis Quek, “Isolated Handwritten

Arabic Character Recognition using Multilayer

Perceptions and K Nearest Neighbor Classifiers.

[8] Sindhi Language Authority, Official Website.

http://www.Sindhila.org/Research%20Journal.htm(Ac

cessed 25th may 2011).

[9] Sindhi Language, Website

http://www.Sindhilanguage.com/script.html (Accessed

25th may 2011)

[10] Ahmed M.Sarhan , and Omar I.Al Helalat “Arabic

Character Recognition Using Artificial Neural network

and Statisitical Analaysis”. Europeon and

Mediterranean conference of Information System

2007.

[11] Hamza , Ali A. “Back Propogation Neural network

Arabic Characters Classification Module Utulizaing

Microsoft Word”. Journal of Computer Science 2008.

[12] Vikas J Dongre Vijay H Mankar A Review of

Research on Devnagari Character Recognition

Department of Electronics & Telecommunication,

Government Polytechnic, Nagpur, India International

Journal of Computer Applications (0975 – 8887)

Volume 12– No.2, November 2010.

[13] Handwriting Arabic Character Recognition LeNet

Using Neural Network Rashad Al-Jawfi Department of

Mathematics and Computer Science, Ibb University,

Yemen .The International Arab Journal of Information

Technology, Vol. 6, No. 3, July 2009

[14] Handwritten Character Recognition Using Multiscale

Neural Network Training Technique Velappa

Ganapathy, and Kok Leong Liew World Academy of

Science, Engineering and Technology 39 2008.

[15] Shambhavi.rajesh Int. J. of Advance d Networking and

Applications 188 Volume: 01 Issue: 03 Pages:

188-192 (2009) An Attempt to Recognize Handwritten

Tamil Character Using Kohonen SOM R.Indra Gandhi

Research Scholar, Department of Computer Science,

Mother Teresa Women’s University.

[16] Language segmentation for Optical Character

Recognition using Self organizing map Appeared in:

Proceedings of the Class of 2003 Senior Conference,

pages 109–115 Computer Science Department,

Swarthmore College Kuzman Ganche. [17] http://www.codeproject.com/kb/cs/neural_network_ocr.aspx,

Access on 25th may 2011.

[18] Ubeeka Jain, Dharamveer Sharma, Recognition of

Isolated handwritten character of Gurumukhi Script

neocogntion,internation journal of computer

assocation 0975-8887 , November 2010.

[19] R. Jagadeesh Kannan,R. Prabhakar,R. Prabhakar,A

Comparative Study of Optical Character Recognition

for Tamil Script,European European Journal of

Scientific Research-ISSN 1450-216X Vol.35 No.4

(2009).

sindhi ocr using back propagation neural network

Documents