sindhi ocr using back propagation neural network
DESCRIPTION
Ali Muhammad Nizamani, Prof. Naeem Ul Hassan JanjuaInternational Journal of Advanced Computer Science, Vol. 3, No. 3, Pp. 113-117, Mar., 2013.TRANSCRIPT
International Journal of Advanced Computer Science, Vol. 3, No. 3, Pp. 113-117, Mar., 2013.
Manuscript Received:
4, Oct., 2011
Revised:
25,Oct., 2011
Accepted:
14,Jan., 2013
Published:
15,Feb., 2013
Keywords
ANN,
artificial
intelligence,
Supervised
learning
Abstract This paper provide proposed OCR
solution of Sindhi Character Recognition using
Artificial Neural Networks (ANNs) and exposed
major alphabet differences between Sindhi and
Arabic languages with OCR perspective. Huge
literature is available in hard copy format and
needs to convert into soft copy format so that
everyone can access and perform searching to
achieve desired needs from Sindhi literature.
Sindhi language is very rich language contains fifty
two characters and it has ability to merge other
languages words in word list. With comparison of
other Unicode character languages, Sindhi
languages characters having differences in terms of
“Shape”, “cursive style”, and “position of character
“ . These behaviors guide to present linguistic
information and increase difficulties in writing and
printing also create more complexities for
document digitalization (OCR). Recognition system
take Sindhi characters as an input from drawing
control or used specific “MB Lateefi “for character
input. Given input then snip out and converted into
define size with horizontal and vertical attributes.
Proposed system’s training process used
unsupervised learning processes which randomly
change weight of matrix more nearer to the input.
Results are achieved when weight comes nearer
equal to the values given through input layer and
training process stops. Similarly at every input
character, one neuron will become as winner
neuron. 1
1. Introduction
Sindhi Language is one the major languages of Pakistan
and spoken by approximately 30-40 million peoples [8],
similarly after Urdu language Sindhi is second most highest
speaking langue of Pakistan. It’s an Indo-Aryan language
located in lower Indus River valley. Sindhi language is very
flexible language in terms of adopting new words in its
dictionary and its symbol of rich languages which can easily
adopts words of other languages. It has potential to grow
and complete the requirements of sounds by providing new
characters and words over period of time. This paper
focuses on comparison of Sindhi and Arabic language
alphabets with OCR perspective and identifies the major
alphabets differences. Second objective of this paper is to
have proposed solution for Sindhi OCR using Artificial
Neural network.
1, Ali Muhammad Nizamani is with SZABIST Karachi, Pakistan (Email:
[email protected]), 2, Prof. Naeem Ul Hassan janjua is with Bahria University Karachi,
Pakistan (Email: [email protected]).
A. Comparison of Sindhi & Arabic Language Alphabets
with OCR Perspective
Sindhi language used Arabic style scripting. Sindhi
Arabic Script style is merged from Persian writing style.
Arabic characters writing style starts from right to left. The
Sindhi script comprises of fifty two characters and seven
diacritic signs [9]. Sindhi language is consisting on fifty two
characters from which three characters are migrated from
Persian twenty nine characters are inherited from Arabic
language. As shown Sindhi alphabet structure in fig. 1.
Fig. 1. Combination of Sindhi Alphabets. [2]
There are twenty modified characters which make
difference between Sindhi and Arabic characters. These
characters help to complete the need of missing sounds of
Sindhi language. These modified characters are developed
by changing structure of some existing characters. For
example to pronounce “Ghay” , character “gaf” and “hay”
are combine to create new character. To creates new
characters “Nukta”/ Dots, several diacritics also used. For
example to pronounce sound “Cha”, “Bhay” four dots
characters are created which do not exist in Arabic language.
The combination of more than one characters, extending
number of dots, diacritics creates twenty more characters.
As shown in fig. 2.
ڃ ,ڱ ,ٻ ,ڄ , ڳ ,،ههگ ،ڌ ڀ ، ههج ,،ڙ ،ڻ ،ڏ ،ڍ ،ٽ ،ڊ ٺ ,ڦ ,ٿ ,ڇ ,ک
Fig. 2 Modified Characters for Sindhi Sounds. [2]
Fig. 3 Stand alone, Initial, Medial, Final Characters positions [2]
Sindhi language contains cursive style nature. In Cursive
style language’s characters are connected together to create
another character or word. There are four major character’s
Sindhi OCR using Back propagation Neural Networks Ali Muhammad Nizamani
1, , & Prof. Naeem Ul Hassan janjua 2,
International Journal of Advanced Computer Science, Vol. 3, No. 3, Pp. 113-117, Mar., 2013.
International Journal Publishers Group (IJPG) ©
114
positions which bifurcate characters styles. These
character’s positions are show in fig. 3.
According to Figure 3, changing in the position of
character within a word creates different sound, diacritics
are very important in Sindhi language alphabets. They help
to create more characters to give verbal and written
presentation to sounds. For example four dots introduce in
Sindhi language to complete the need of missing sound.
Like character “Bhay” , “Thay” ,”Fay” used four dots .
Similarly character “Alf mudA, hamza” over any other
character create different character sound. [2].
B. Self Organizing Maps
Many of research findings appear by inspiring from
human and nature like genetic algorithms, decision support
systems, robots etc. To facilitate humanity researchers try to
simulate machines to work like a human. Aim of machine
learning is to create machines which are capable enough to
take the decision from their experiences. The level of
intelligence of machine depends on the historical data and
the model or algorithm used for a particular learning
process. Comparatively supervise learning, unsupervised
learning mechanism having some complexities and
accuracy problems especially in the cases of abnormally
written characters. The characters which are misguide the
system during process of recognition. That problem can be
resolve by developing the word / character dictionary to
search for the possible character composition, because the
presence of resemble knowledge will defiantly help to
remove the ambiguity in characters. In unsupervised
learning mechanism if the target result is not achieved
neural network will try to find optimal comparative results
[4]. The back propagation algorithm is used in layered
feed-forward ANNs. It has three layers including input,
hidden and output layer and sends their signals in forward
directions. Network topology receives input of neurons in
the input layer and the output of the network is given by the
neurons on an output layer. Process of back propagation
algorithm follows supervise learning method which mean
that inputs are training and supervised by network topology.
Training process compute, and then the error (difference
between actual and expected results) is calculated. Training
process of back propagation algorithm starts with random
weights. The aim of random weights picking is to adjust
them to achieve minimal error rate. The activation function
of the artificial neurons in ANNs implementing the back
propagation algorithm is a weighted sum. Back propagation
neural networks learn from the training pattern. Training
phase take input pattern and precede it to propagate forward.
Each input pattern is applied to in the input unit and then
propagated forward. The Activation rises at output layer and
compare with correct output pattern to calculate an error
rate. Back propagation network able to classify for the set of
inputs after training process .Trained inputs then further
tested by giving untrained pattern.
2. OCR Approaches
A. Pattern Matching approach for Urdu Characters
As Urdu writing contains cursive style it resemble with
Sindhi language writing style in which characters are
connected with each other to make a word or character.
Urdu alphabet consists of 40 characters. All of these
characters contain several behaviors like single loop, double
loop and incomplete loop. Similarly dots and diacritics are
also included in several character set. Pattern matching
approach used to train system which first by taking input
image. It creates chain code of every character and creates a
data dictionary in xml file. Training of system consists of
preprocessing, line and Character segmentation. Each
trained character in data dictionary /xml file hold chain code
information of every characters. Recognition process take
input as image and create its chain code. After creating
chain code it compares that chain code of data dictionary.
/xml file and mach pattern signature with already stored
patterns in xml file [6].
B. Statistical Analysis Approach for Arabic Characters
To recognize Arabic characters, an artificial neural
network map is developed and trained by using least mean
squares algorithm. This approached create a matrix of
binary number which are used as input to simple feature
extraction process of system. In least mean square algorithm
weights of input neurons iteratively adjusted and Optimum
weight is calculated for a single neuron. Network weights
are removed along the negative of gradient of performance
function. After several iteration weights are adjusted.
Proposed approach has a vector of 35 elements for
representing a single Arabic letter. As shown in fig. 4.
Fig. 4 Vector View of Character Noon [10]
By performing statistical analysis it was found that
Arabic letter has performed and present that the pixel value
of a letter s are highly not correlated. This fact become
causes the investigation of the use of standard deviation as
distinctive feature of the letter. According to the figure 3 ,
system input features take 28 typed Arabic alphabets letters.
Every character is represented by a matrix which contains 7
* 5 binary pixels [10]. As shown in figure 4.
C. Back propagation neural network for Arabic Characters
Back propagation is a neural network approach used by
hamza, Ali A for Arabic character recognition. This
approach classifies the design and trained inputs and
recognition any set of character combination, with size and
fonts attributes using Microsoft word. In this research
Nizamani et al.: Sindhi OCR using Back propagation Neural Networks.
International Journal Publishers Group (IJPG) ©
115
study, 28 basic Arabic character plush 10 numerals set,
52 Latin characters and 10 numeral are used as a set of
inputs for the recognition process. This approach has
common ANN architecture of three layers which are input,
hidden and output layer as shown in figure 4. After the
training process back propagation algorithm take input for
the character recognition. It iterates Input up to the several
epochs and propagates to back layer till it reaches on
desired result level. Desired results will be achieved till the
algorithm reaches on define error state .input layer of
network topology is consist on 50 * 50 pixels, hidden layer
consist on 20 * 20 neuron whereas output layer consist as
many neurons which are required for the character to be
recognized. Hidden layer of network topology consist of
bias and connected with weight matrix of the pervious layer.
As shown in fig. 5.
Fig. 5 Back propagation neural network Structure [11]
Training of system will take an input with font size, font
type and color parameters. Trained set of every character
preserved in database file. Results of proposed study find
that degree of reliability in OCR system depends of degree
of noisy environment. It was found that time that needs to
train an Arabic input data directly depends on the size of
character set [11].
D. Back propagation algorithm for English OCR
Feed forward neural network with back propagation
learning approach is one the most famous and easy
approach for OCR problem. Training stage of feed forward
back propagation algorithm teaches neural network to give
anticipated output against specific input. This process is
containing two main working components. These working
components are input and the desired output. Once training
process is complete, the back propagation algorithm become
able to take input and can give proposed output. For
example for English alphabet OCR we have 26 capital
letters for the recognition. To perform activity of OCR
image of 5 x 6 pixels is provided as an input pattern. Input
image is converted into several input parts and create vector
of size 30 pixels. In which “1” is assigned white pixel and
“0” is assigned for every black pixel. Figure 6 demonstrate
the true false pixels of input character K.
Fig. 6 Matrix presentation of character k [17]
On every epoch all samples which are available in
training set calculated with squared error. When error
become less than the error value specified, network stopped
iteration and the sample become ready for the recognition.
For the process of recognition again input pattern is given to
the network. The vector which has maximum points in
output pattern is considered as recognized pattern [17].
3. Proposed Approach for Sindhi OCR
Solution of Sindhi OCR used Back propagation neural
networks approach for development process. A Typical
back propagation neural network approach is used which
contain input, hidden and output layer. Input layer take
normalize input and weights are adjusted in hidden layer.
The desired output will be a winning neuron which has
nearer weight to input. Final output neuron is reached at
output layer. Every node or neurons has weight vector of
same dimension as input data vector.
A. Sindhi OCR Workflow
According to figure 6, Sindhi OCR application will take
input first and perform normalization. Input normalization
features are extracted of every character and then all
characters are classified in to the group with their unique
identification. After once all characters recognize
successfully, system can perform recognition .The steps of
OCR are presented in fig. 7.
Fig. 7 [self]
a. Feature Extraction
Feature extraction process highlights the feature of
character which is written by user. The process skips empty
white spaces of the region and collection character template
in a feature vector. Each input vector of character features
are first resized with equal size of lattice which is of 20 * 20
map size. Optimum size of lattice or the number of neurons
of network topology helps to recognize the input characters.
Creating large size network for small size input pattern
consume more time to travels within training and
recognition iterations. Similarly creating smaller size
network teleology for a larger input patter will be the cause
International Journal of Advanced Computer Science, Vol. 3, No. 3, Pp. 113-117, Mar., 2013.
International Journal Publishers Group (IJPG) ©
116
of wrong outputs so proposed solution has 20 by 20 matrix.
As shown in fig. 8.
Fig. 8 Matrix presentation of Sindhi Character Bay [self]
b. Classification
Classification is a process that is closely related with
pattern recognition. Proposed solution for Sindhi OCR uses
this technique to classify all input character which are
trained. This process organizes inputs into several groups of
unique identification. The classification process received
input pattern which are presented to the input neurons.
These input neurons are further compiled which results as
identification of various patterns .After feature extraction,
OCR performs classification function. In this stage each
feature of character in vector is classified in to training set.
Figure 9 present classified group of data in terms of
characters in list box.
Fig. 9 Character Dictionary [self]
c. Recognition
Once normalize input is trained by system and update
characters dictionary. The input character vector maps its
weight values with the values given in lattice and find
matching pattern. As proposed solution followed SOM
Topology in neural network so network topology contains
two layers of nodes. These nodes are an input layer and a
mapping layer. Input layer act as distribution layer. The
number of nodes in input layer is equal to the number of
features or attributes associated with input. Therefore input
layer and each node of the mapping layer can be presented
as vector which contains mapping nodes. Mapping nodes
are initializing with some random numbers. The weight of
every input pattern or the actual pattern is further compared
with every note of the mapping layer gird. The node which
has smallest distance “Euclidean” between mapping node
vector and input vector. Once node is selected as winning
neuron, all of neighboring nodes of the winning node are
adjusted proportionally. In this way input nodes are mapped
with gird. As shown in fig. 10 recognized nearest winning
neuron is shown in editable format. Result of winner neuron
shown in fig. 10 which contains editable format output
character.
Fig. 10 Sindhi Recognized Character Editable Format [self]
B. Software Workflow, Implementation Aspect
Proposed solution of Sindhi OCR application contains
two methods which are briefly described below.
Training
Recognize
a. Training
When input is given to the solution of Sindhi OCR, the
initial weight matrix of input pattern is used to find current
error value. That error value is finding by organizing input
values in better way, means writing the character in smooth
way on drawing panel. The number of input neurons is
determined by training method from the down sampled
image. OCR lattice contain 20 by 20 pixels matrix so 400
total neuron will be calculated. Each epoch is give error
information and once training process reaches on its
defining acceptance level it will stop iteration. Input count
variable hold information of the each node of lattice. This
variable is actually containing total number of input neuron
which is entered or drawn by user. Letter count variable
actually count number of inputs which are already input in
to Sindhi OCR system for recognition. This variable will
only hold 52 Sindhi alphabet characters. This variable also
makes sure that no pattern is trained by more than two times.
Training set is dictionary which holds information of every
trained pattern which will further helps in recognition
process of OCR. Training set array has equal size which is
hold by letters list. The latter list is predefined with 52
Sindhi alphabets.
b. Recognize method
When recognition routine is called, a different code
routine executes. At very first stage input character is First
check by recognition method. This method identified that
either input that is given by user is trained and contains in
training set or not, if input character is not trained then
system will prompt to train input pattern first.
If system find filled network with concerned input
character, system will proceed to execute follow routine
which store total number of input neuron in sample size
variable and initialize a double size input array. Iteration
starts that compare down sampled values. If down sampled
values contain false value it assign negative value in input
array and in other condition will be to hold +0.5 values in
input array. Once input values are stored in input array,
another routine will execute which will send complete input
array to a method known as winner. Winner method return
the best out neuron which is hold in further in a variable
known as best. Most of the actual work performed by the
neural network is done in the winner method. The
Nizamani et al.: Sindhi OCR using Back propagation Neural Networks.
International Journal Publishers Group (IJPG) ©
117
First thing that the winner method does is normalize the
inputs. Calculate the output values of each output neuron.
Input pattern iterated and propagate back until it reaches
on define error state. So finally the neuron which has largest
output value is considered as a winner and goes to output
stage.
4. Conclusion and Future Work
Sindhi language has huge literature available in hard
copy format and needs to be converted into soft copy format
so that every can utilize it and that information can be
available for everyone to explore worldwide. To achieve
this task, it’s necessary to have mature Sindhi OCR
applications which can convert literature in soft copy format.
OCR application helps to reduce data entry cost, reduce to
error level of data entry operation tasks. It helps to increase
strength and life of language and increase richness of
literature of Sindhi language. Technically for an OCR it
needs to be identified and investigate what optimum data is
required to train a particular network. For a good OCR
application especially the OCR application which takes
inputs by following a particular font consume large number
epocs to train a particular input. So it’s necessary to
decrease the epocs ratios especially in the large input data
training. In this paper a successful Sindhi OCR application
is developed which use self-organizing map approach to
recognize the characters. Sindhi OCR is compatible to train
fifty two Sindhi characters and enable to recognize it. It
takes “MB-Lateefi” font as a input.
As for as future work is concerned Sindhi OCR take
input which is followed by a particular font. In future work
the proposed solution can be extend with a system which is
font independent. Sindhi OCR can be extended further more
for hand writing recognition system.
REFERENCES
[1] Noor Ahmed Shaikh, Ghulam Ali Mallah, Zubair A.
Shaikh. “Character segmentation of Sindhi, an arabic
style script languge , usee hight propfile vector”.
Australian journal of basic and Applied Science. ISSN
1991-8187.
[2] S.A Husain, Asma Sajjad, fareeha Anwar. “online
Urdu Character Recognition System”MVA 2007
Confernece onMachine Vision Application “ Tokyo
Japan.
[3] Ali Dasdan , Kemal Oflzer “ Genetic Synthesis of
Unsupervised learning Algorithms”. Department of
Computer Engineering and infomration Science
Bilkent University , Tukey.
[4] Kuzman Ganchev, “Language segmentation for
Optical Character Recognition using Self Organizing
Mapes” , 2003.
[5] Simone Marinai “SOM clustering for text retrieval and
classification with examples on Indian Scripts “.
[6] Tabassam Nawaz, Syed ammar Hassan, Habib ur
Rehman, Anoshia Faiz “ Optical character Recognition
System for Urdu (Naskh Font) using
pattern Matching Technique”, International Journal of
Image Processsing Volume(3) .
[7] Yasmine Elglaly, Francis Quek, “Isolated Handwritten
Arabic Character Recognition using Multilayer
Perceptions and K Nearest Neighbor Classifiers.
[8] Sindhi Language Authority, Official Website.
http://www.Sindhila.org/Research%20Journal.htm(Ac
cessed 25th may 2011).
[9] Sindhi Language, Website
http://www.Sindhilanguage.com/script.html (Accessed
25th may 2011)
[10] Ahmed M.Sarhan , and Omar I.Al Helalat “Arabic
Character Recognition Using Artificial Neural network
and Statisitical Analaysis”. Europeon and
Mediterranean conference of Information System
2007.
[11] Hamza , Ali A. “Back Propogation Neural network
Arabic Characters Classification Module Utulizaing
Microsoft Word”. Journal of Computer Science 2008.
[12] Vikas J Dongre Vijay H Mankar A Review of
Research on Devnagari Character Recognition
Department of Electronics & Telecommunication,
Government Polytechnic, Nagpur, India International
Journal of Computer Applications (0975 – 8887)
Volume 12– No.2, November 2010.
[13] Handwriting Arabic Character Recognition LeNet
Using Neural Network Rashad Al-Jawfi Department of
Mathematics and Computer Science, Ibb University,
Yemen .The International Arab Journal of Information
Technology, Vol. 6, No. 3, July 2009
[14] Handwritten Character Recognition Using Multiscale
Neural Network Training Technique Velappa
Ganapathy, and Kok Leong Liew World Academy of
Science, Engineering and Technology 39 2008.
[15] Shambhavi.rajesh Int. J. of Advance d Networking and
Applications 188 Volume: 01 Issue: 03 Pages:
188-192 (2009) An Attempt to Recognize Handwritten
Tamil Character Using Kohonen SOM R.Indra Gandhi
Research Scholar, Department of Computer Science,
Mother Teresa Women’s University.
[16] Language segmentation for Optical Character
Recognition using Self organizing map Appeared in:
Proceedings of the Class of 2003 Senior Conference,
pages 109–115 Computer Science Department,
Swarthmore College Kuzman Ganche. [17] http://www.codeproject.com/kb/cs/neural_network_ocr.aspx,
Access on 25th may 2011.
[18] Ubeeka Jain, Dharamveer Sharma, Recognition of
Isolated handwritten character of Gurumukhi Script
neocogntion,internation journal of computer
assocation 0975-8887 , November 2010.
[19] R. Jagadeesh Kannan,R. Prabhakar,R. Prabhakar,A
Comparative Study of Optical Character Recognition
for Tamil Script,European European Journal of
Scientific Research-ISSN 1450-216X Vol.35 No.4
(2009).