offline handwritten hindi character recognition using data mining152 (3)

A SEMINAR/ COLLOQUIUM REPORT ONOFFLINE HANDWRITTEN HINDI CHARACTER RECOGNITION USING DATA

MINING Submitted in Partial Fulfillment of the Requirements for the Degree of Master in Computer Applications SUBMITTED BYSHRIKRISHNA SHARMARoll No.: 1350204016

UNDER THE SUPERVISION OFMr. Ashok KumarAssociate Professor,Invertis University Bareilly INVERTIS INSTITUTE OF COMPUTER APPLICATIONSINVERTIS UNIVERSITY Invertis Village, Lucknow National Highway 24, Bareilly, Uttar Pradesh 243123 Batch: 2012 - 2015

CERTIFICATE This is to certify that Mr./Ms. SHRIKRISHNA SHARMA (Roll No. 1350204016), has carried out the Seminar/ Colloquium work presented in this report entitled OFFLINE HANDWRITTEN HINDI CHARACTER RECOGNITION USING DATA MINING for the award of Master of computer Applications from Invertis University, Bareilly under the supervision of undersigned. Ashok Kumar Ajay indianSeminar/ Colloquium Supervisor HOD (IICA) Associate Professor Associate Professor Invertis University Bareilly Invertis University Bareilly

ACKNOWLEDGEMENT The real spirit of achieving a goal is through the way of excellence and austerous discipline. I would have never succeeded in completing my task without the cooperation, encouragement and help provided to me by various personalities. First of all, I render my gratitude to the almighty who bestowed self-confidence, ability and strength in me to complete this work. Without his grace this would never come to be todays reality. With deep sense of gratitude I express my sincere thanks to my esteemed and worthy supervisor Mr. Ashok kumar in the Department of Master of computer application for his valuable guidance in carrying out this work under his effective supervision, encouragement, enlightenment and cooperation. Most of the novel ideas and solutions found in this thesis are the result of our numerous stimulating discussions. His feedback and editorial comments were also invaluable for writing of this thesis. I shall be failing in my duties if I do not express my deep sense of gratitude towards Mr. Ajay indian , Head of computer application Department who has been a constant source of inspiration for me throughout this work. I am also thankful to all the staff members of the Department for their full cooperation and help.

thank you shrikrishna sharma 1350204016

Abstract

Handwritten Numeral recognition plays a vital role in postal automation services especially in countries like India where multiple languages and scripts are used Discrete Hidden Markov Model (HMM) and hybrid of Neural Network (NN) and HMM are popular methods in handwritten word recognition system. The hybrid system gives better recognition result due to better discrimination capability of the NN.

A major problem in handwriting recognition is the huge variability and distortions of patterns. Elastic models based on local observations and dynamic programming such HMM are not efficient to absorb this variability. But their vision is local. But they cannot face to length variability and they are very sensitive to distortions. Then the SVM is used to estimate global correlations and classify the pattern. Support Vector Machine (SVM) is an alternative to NN. In Handwritten recognition, SVM gives a better recognition result.

The aim of this paper is to develop an approach which improve the efficiency of handwritten recognition nusing artificial neural network Keyword: Handwriting recognition, Support Vector Machine, Neural Network Advancement in Artificial Intelligence has lead to the developments of various smart devices. The biggest challenge in the field of image processing is to recognize documents both in printed and handwritten format. Character recognition is one of the most widely used biometric traits for authentication of person as well as document.

Optical Character Recognition (OCR) is a type of document image analysis where scanned digital image that contains either machine printed or handwritten script input into an OCR software engine and translating it into an editable machine readable digital text format. A Neural network is designed to model the way in which the brain performs a particular task or function of interest. Each image character is comprised of 3020 pixels. We have applied feature extraction technique for calculating the feature. Features extracted from characters are directions of pixels with respect to their neighboring pixels. These inputs are given to a back propagation neural network with hidden layer and output layer. We have used the Back propagation Neural Network for efficient recognition where the errors were corrected through back propagation and rectified neuron values were transmitted by feed-forward method in the neural network of multiple layers.

Handwriting recognition is the ability of a computer to receive and interpret intelligible handwritten input from sources such as photographs, touch - screens, paper documentsand other devices . Written text image may be sensed "off line" from a piece of paper by optical scanning (optical character recognition). Devnagari script has 14 vowels and 33 consonants. Vowels occur either in isolation or in combination with consonants. Apart from vowels and consonants characters called basic characters, compound characters are there in Devnagari script, which are formed by joiningtwo or more basic characters. Coupled to this in Devnagari script there is a practice of having twelve forms of modifiers with each for 33 consonants, giving rise to modified shapes which, depends on whether the modifier is placed to the left, right, top or bottom of the character. The net result is that there are several thousand different shapes or patterns, which makes Devnagari OCR more difficult to develop. Here focus is on the recognition of offline handwritten Hindi characters that can be used in common applications like commercial forms , bill processing systems , bank cheques, government records, Signature Verification , Postcode Recognition, , passport readers, offline document recognition generated by the expanding technological society . In this project , by the use of templet matching algorithm devnagari script characters are OCR from document images

Table of Contents1. Cover Page 2. Certificate 3. Abstract 4. Acknowledgements 5. Table of Contents 6. List of Tables 7. List of Figures 8. Introduction 9. Retail review of work and problem statement10. Proposed approach11. Solution approach

12. Implementation and Result

13.Future work and conclusion

14.Reffences

List of Tables

Recognition accuracy of handwritten hindi character 25 Detail recognition performance of svm and UCI datasets 45 Detail recognition performance of svm and HMM on UCI datasets 47 Recognition rate of each numeral in datasets 49

List of Figures

hindi language basic character set 16 character recognition of the document image 20 output is saved in form of the text format 25 genrated 8*8 input matrics 28 load some entries from the digits dataset into application using the default values in the application 35 we can see the analysis also plerforms rather well on completely new and priviously unseen data. 40

Introduction

Handwritten Recognition refers to the process of translating images of hand-written, typewritten, or printed digits into a format understood by user for the purpose of editing, indexing/searching, and a reduction in storage size. Handwritten recognition system is having its own importance and it is adoptable in various fields such as online handwriting recognition on computer tablets, recognize zip codes on mail for postal mail sorting, processing bank check amounts, numeric entries in forms filled up by hand and so on. There are two distinct handwriting recognition domains; online and offline, which are differentiated by the nature of their input signals.

In offline system, static representation of a digitized document is used in applications such as cheque , form, mail or document processing. On the other hand, online handwriting recognition (OHR) systems rely on information acquired during the production of the handwriting. They require specific equipment that allows the capture of the trajectory of the writing tool. Mobile communication systems such as Personal Digital Assistant (PDA), electronic pad and smart-phone have online handwriting recognition interface integrated in them. Therefore, it is important to further improve on the recognition performances for these applications while trying to constrain space for parameter storage and improving processing speed. Figure 1 shows an online handwritten Word recognition system. Many current systems use Discrete Hidden Markov Model based recognizer or a hybrid of Neural Network (NN) normalization, the writing is usually segmented into basic units (normally character or part of character) and each segment is classified and labeled. Using HMM search algorithm in the context of a language model, the most likely word path is then returned to the user as the intended string .

Segmentation process can be performed in various ways. However, observation probability for each segment is normally obtained by using a neural network (NN) and a Hidden Markov Model (HMM) estimates the probabilities of transitions within a resulting word path. This research aims to investigate the usage of support vector machines (SVM) in place of NN in a hybrid SVM/HMM recognition system. The main objective is to further improve the recognition rate[6,7] by using support vector machine (SVM) at the segment classification level. This is motivated by successful earlier work by Ganapathiraju in a hybrid SVM/HMM speech recognition (SR) system and the work by Bahlmann [8] in OHR. Ganapathiraju obtained better recognition rate Compared to hybrid NN/HMM SR system. In this work, SVM is first developed and used to trazin an OHR system using character databases. SVM with probabilistic output are then developed for use in the hybrid system. Eventually, the SVM will be integrated with the HMM module for word recognition.Preliminary results of using SVM for character recognition are given and compared with results using NN reported by Poisson. The following databases were used: IRONOFF, UNIPEN and the mixture IRONOFF-UNIPEN databases.

The biometrics is most commonly defined as measurable psychological or behavioral characteristic of the individual that can be used in personal identification and verification. Character recognition device is one of such smart devices that acquire partial human intelligence with the ability to capture and recognize various characters in different languages. Character recognition (in general, pattern recognition) addresses the problem of classifying input data, represented as vectors, into categories. Character Recognition is a part of Pattern Recognition [1].

It is impossible to achieve 100% accuracy. The most basic way to recognizing the patterns using probabilistic methods in which we use Bayesian Network classifiers for recognizing characters. The need for character recognition software has increased much since the outstanding growth of the Internet. Optical Character Recognition (OCR) is a very well-studied problem in the vast area of pattern recognition. Its origins can be found as early as 1870 when an image transmission system was invented which used an array of photocells to recognize patterns.

Until the middle of the 20th century OCR was primarily developed as an aid to the visually handicapped. With the advent of digital computers in the 1940s, OCR was realized as a data processing approach for the first time. The first commercial OCR systems began to appear in the early 1950s and soon they were being used by the US postal service to sort mail. The accurate recognition of Latin-script, typewritten text is now considered largely a solved problem on applications where clear imaging is available such as scanning of printed documents.

Typical accuracy rates on these exceed 99%. Total accuracy can only be achieved by human review. Optical Character Recognition (OCR) programs are capable of reading printed text. This could be text that was scanned in form a document, or hand written text that was drawn to a hand-held device, such as a Personal Digital Assistant (PDA). The character recognition software breaks the image into sub-images, each containing a single character.

The sub-images are then translated from an image format into a binary format, where each 0 and 1 represents an individual pixel of the sub image. The binary data is then fed into a neural network that has been trained to make the association between the character image data and a numeric value that corresponds to the character. The output from the neural network is then translated into ASCII text and saved as a file. Recognition of characters is a very complex problem. The characters could be written in different size, orientation, thickness, format and dimension. This will give infinite variations.

The capability of neural network to generalize and insensitive to the [6, 7] missing data would be very beneficial in recognizing characters. An Artificial Neural Network as the backend to solve the recognition problem. Neural Network used for training of neural network. Neural networks have been used in a variety of different areas to solve a wide range of problems. Unlike human brains that can identify and memorize the characters like letters or digits; computers treat them as binary graphics. The central objective of this paper is demonstrating the capabilities of Artificial Neural Network implementations in recognizing extended sets of image pixel data. In this paper offline recognition of character is done for this printed text document is used. It is a process by which we convert printed document or scanned page to ASCII character that a computer can recognize. A back propagation feed-forward neural network is used to recognize the characters.

After training the network with back-propagation learning algorithm, high recognition accuracy can be achieved. Recognition of printed characters is itself a challenging problem since there is a variation of the same character due to change of fonts or introduction of different types of noises. Difference in font and sizes makes recognition task difficult if pre-processing, feature extraction and recognition are not robust. This paper is organized as follows. Multilayer Perceptron Neural Network for Recognition is briefly described in Section 2. In section 3, Character recognition procedure is described. Section 4 training performance and accuracy of prediction is analyzed. Section 4 contains data description and result analysis.

Hindi handwritten character recognition is the one of the major problem in todays world. Typed Hindi characters are very difficultly recognized by computer machine therefore Hindi handwritten characters are not recognized efficiently and accurately by computer machine. Many researches have been done to recognize these characters and many algorithms have been proposed to recognize characters. Many types of software are in the market for optical Hindi character recognition. For recognizing characters, many processes have to be performed. No single process or single machine can perform that recognition. Artificial neural networks can be used for recognition of characters due to the simplicity of their design and their universality.

Hindi character recognition is becoming more and more important in the modern world. It helps human ease their jobs and solve more complex problems. The problem of recognition of hand-printed characters is still an active area of research. With the increasing necessity for office automation, it is imperative to provide practical and effective solutions. All sorts of structural, topological and statistical information has been observed about the characters does not lend a helping hand in the recognition process due to different writing styles and moods of persons at the time of writing. Limited variations in shapes of character are considered.

Literature Survey:-

Although first research report on handwritten Devnagari characters was published in 1977 [1] but not much research work is done after that. At present researchers have started to work onhandwritten Devnagari characters and few research reports are published recently. In this paper, implementation is doneon matlab which allowsmatrixmanipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages, including C, C++, Java, and Fortran.Hanmandlu and Murthy [2][3] proposed a Fuzzy model based recognition of handwritten Hindi numerals and characters and they obtained 92.67% accuracy for Handwritten Devnagari numerals and 90.65% accuracy for Handwritten Devnagari characters. Bajajet al [4] employed three different kinds of features namely, density features, moment features and descriptive component features for classification of Devnagari Numerals. They proposed multi-classifier connectionist architecture for increasing the recognition reliability and they obtained 89.6% accuracy for handwritten Devnagari numerals. Kumar and Singh [5] proposed a Zernike moment feature based approach for Devnagari handwritten character recognition. They used an artificialneural network for classification.

OCR is one of the oldest ideas in the history of pattern recognition using computers. In recent time, Punjabi character recognition becomes the field of practical usage. In character recognition, the process starts with reading of a scanned image of a series of characters, determines their meaning, and finally translates the image to a computer written text document. Mainly, this process is done commonly in the post-offices to mechanically read names and addresses on envelopes and by the banks to read amount and number on cheques. Also, companies and civilians can use this method to quickly translate paper documents to computer written documents. Many researches have been done on character recognition in last 56 years. Some books [6-8] and many surveys [4, 5] have been published on the character recognition. Most of the work on character recognition has been done on Japanese, Latin, Chinese characters in the middle of 1960s. The work by Impedovo et al. [9] focuses on commercial OCR systems. Jain et al. [10] summarized and compared some of the well-known methods used in various stages of a pattern recognition system. They have tried to identify research topics and applications, which are at the forefront in this field. Pal and Chaudhuri [8] in their report summarized different systems for Indian language scripts recognition. They have described some commercial systems like Bangla and Devnagiri OCRs. Manish [11] in his survey report summarized a system for the recognition of Punjabi characters. They reported the scope of future work to be extended in several directions such as OCR for poor quality documents, for multi font OCR and bi- script/multi-script OCR development etc. A bibliography of the fields of OCR and document analysis is given in [12]. Tappet et al. [13] and Wakahara et al. [14] worked on-line handwriting recognition and described a distortion-tolerant shape matching method. Noubound and Plamondon [15] and Suen et al. [16] proposed methods used for on-line recognition of hand-printed characters while Connell et al. [17, 18] described on-line character recognition for Devanagari characters and alphanumeric characters. Bortolozzi et al. [19] have published a very useful study on recent advances in handwriting recognition. Lee et al. [20] described off-line recognition of totally unconstrained handwritten numerals using multiplayer cluster neural network. The character regions are determined by using projection profiles and topographic features extracted from the gray-scale images. Then, a nonlinear character segmentation path in each character region is found by using multi-stage graph search algorithm. Khaly and Ahmed [21], Amin [22] and Lorigo & Govindraju [23] have produced a bibliography of research on the Arabic optical text recognition. Hildebrandt and Liu [24] have reported the advances in handwritten Chinese character recognition and Liu et al. [25] have discussed various techniques used for on-line Chinese character recognition. 2.1 Indian Script Recognition As compared to English and Chinese languages, the research on OCR of Indian language scripts has not achieved that perfection. Few attempts have been carried out on the recognition of Indian character sets on Devanagari, Bangla, Tamil, Telugu, Oriya, Gurmukhi, Gujarati and Kannada. These attempts are briefly described in the following sub-sections. 2.1.1 Recognition of Handwritten Devnagari Scripts Devnagari is the most popular script in India. Devnagari script is used to write many ndian languages such as Hindi, Marathi, Rajasthani, Sanskrit and Nepali. The characters of Hindi Language are shown in figure 9. The work on Handwritten Devnagari character recognition started early in 1977. Firstly in 1977, I. K. Sethi and B. Chatterjee [26] presented a system for handwritten Devnagari characters. In this system, sets of very simple primitives were used. Most of the decisions were taken on the basis of the presence/absence or positional relationship of these 18 primitives. A multistage process was used for taking these decisions. By completion of each stage, the options for making decision regarding the class membership of the input token decreases. In 1979, Sinha and Mahabala [27] presented a syntactic pattern analysis system with an embedded picture language for the recognition of handwritten and machine printed Devnagari characters. In this system, mainly feature extraction technique was used. Sethi and Chatterjee [28] also have done some studies on hand-printed Devnagari numerals which is based upon binary decision tree classifier and that binary decision tree was made on the basis of presence or absence of some basic primitives, namely, horizontal line segment, vertical line segment, left and right slant, D-curve, C- curve, etc. and their positions and interconnections. That decision process was also based on multistage process. Brijesh K. Verma [29] presented a system for HCR using Multi- Layer Perceptron (MLP) networks and the Radial Basis Function (RBF) networks in the task of handwritten Hindi Character Recognition (HCR). The error back propagation algorithm was used to train the MLP networks. 2.1.2. Recognition of Bangla Characters Among all the Indian scripts, the maximum work for recognition of handwritten characters has been done on Bangla characters. Handwritten Bangla characters are shown in figure 10. For offline handwritten Bangla numerals and characters recognition, some OCR systems are available in market. In 1982, S. K. Parui and B.B. Chaudhuri et al. [30] proposed a recognition scheme using a syntactic method for connected Bangla handwritten numerals. By using some automation some sub-patterns are made on the basis of these one-dimensional strings of eight direction codes. In 1998, A.F.R. Rahman and M. Kaykobad [31] proposed a complete Bangali OCR system in which they used hybrid approach for recognition of handwritten Bangla characters. Everybody have different writing style. For this purpose, Pal and Chaudhuri [32] proposed a robust scheme for the recognition of isolated Bangla off-line handwritten numeral. In this scheme, the direction of numeral, height and position of numeral with respect to the character bounding box, shape of the reservoir etc. are used for recognition. Dutta and Chaudhuri [34] reported a work on recognition of isolated Bangla alphanumeric handwritten characters using neural networks. In this method, the primitives are used for representing the characters and structural constraints between the primitives imposed by the junctions present in the characters. Neural network approach is also used by Bhattacharya et al. [33] for the recognition of Bangla handwritten numeral. In this, certain features like loops, junctions, etc. present in the graph are considered to classify a numeral into a smaller group. Sural and Das [35] defined fuzzy sets on Hough transform of character pattern pixels from which additional fuzzy sets are synthesized using t- norms. Garain et al. [36] proposed an online handwriting recognition system for Bangla. A low complexity classifier has been designed and the proposed similarity measure appears to be quite robust against wide variations in writing styles. Pal, Wakabayashi and F. Kimura [37] proposed a recognition system for handwritten offline compound Bangla characters using Modified Quadratic Discriminate Function (MQDF). The features used for recognition purpose are mainly based on directional information obtained from the arc tangent of the gradient. To get the feature, at first, a 2 X 2 mean filtering is applied 4 times on the gray level image and non-linear size normalization is done on the image. 2.1.3 Recognition of Tamil Characters The work on recognition of Tamil characters started in 1978 by Siromony et al. [38]. They described a method for recognition of machine-printed Tamil characters using an encoded character string dictionary. The scheme employs string features extracted by row- and column- wise scanning of character matrix. Features in each row (column) are encoded suitably depending upon the complexity of the script to be recognised. Chandrasekaran et al. [39] used similar approach for constrained hand-printed Tamil character recognition. Chinnuswamy and Krishnamoorthy [40] presented an approach for hand-printed Tamil character recognition employing labeled graphs to describe structural composition of characters in terms of line-like primitives. Recognition is carried out by correlation matching of the labeled graph of the unknown character with that of the prototypes. A piece of work on on-line Tamil character recognition is reported by Aparna et al. [41]. They used shape-based features including dot, line terminal, bumps and cusp. Comparing an unknown stroke with a database of strokes does stroke identification. Finite state automation has been used for character recognition with an accuracy of 71.32-91.5%. 2.1.4 Recognition of Telugu Characters A two-stage recognition system for printed Telugu alphabets has been described by Rajasekaran and Deekshatulu [42]. In the first stage a directed curve tracing method is employed to recognize primitives and to extract basic character from the actual character pattern. In the second stage, the basic character is coded, and on the basis of the knowledge of the primitives and the basic character present in the input pattern, the classification is achieved by means of a decision tree. Lakshmi and Patvardhan [43] presented a Telugu OCR system for printed text of multiple sizes and multiple fonts. After pre-processing, connected component approach is used for segmentation characters. Real valued direction features have been used for neural network based recognition system. The authors have claimed an accuracy of 98.6%. Negi et al. [2] presented a system for printed Telugu character recognition, using connected components and fringe distance based template matching for recognition. Fringe distances compare only the black pixels and their positions between the templates and the input images. 2.1.5 Recognition of Gurmukhi Characters Gurmukhi script is used primarily for writing Punjabi language. Punjabi Language is spoken by eighty four million native speakers and is the worlds 14th most widely spoken language. Lehal and Singh [30] developed a complete OCR system for printed Gurmukhi script where connected components are first segmented using thinning based approach. They started work with discussing useful pre-processing techniques. Lehal and Singh [30] have discussed in detail the segmentation problems for Gurmukhi script. They have observed that horizontal projection method, which is the most commonly used method employed to extract the lines from the document, fails in many cases when applied to Gurmukhi text and results in over segmentation or under segmentation. The text image is broken into horizontal text strips using horizontal projection in each row. The gaps on the horizontal projection profiles are taken as separators between the text strips. Each text strip could represent: a) Core zone of one text line consisting of upper, middle zone and optionally lower zone (core strip), b) upper zone of a text line (upper strip), c) lower zone of a text line (lower strip), d) core zone of more than one text line (multi strip). Then using estimated average height of the core strip and its percentage they identify the type of each strip. The classification process is carried out in three stages. In the first stage, the characters are grouped into three sets depending on their zonal position, i.e., upper zone, middle zone and lower zone. In the second stage, the characters in middle zone set are further distributed into smaller sub-sets by a binary decision tree using a set of robust and font independent features. In the third stage, the nearest neighbor classifier is used and the special features distinguishing the characters in each subset are used. This enhances the computational efficiency. The system has an accuracy of about 97.34%. An OCR postprocessor of Gurmukhi script is also developed. In last, Lehal and Singh and Lehal et al. Proposed a post-processor for Gurmukhi OCR where statistical information of Punjabi language syllable combinations and certain heuristics based on Punjabi grammar rules have been considered. There is also some literature dealing with segmentation of Gurmukhi Script. Lehal and Singh have performed segmentation of Gurmukhi script by connected component analysis of a word assuming the headline not being a part of the word. Goyal et al. have suggested a dissection based Gurmukhi character segmentation method, which segments the characters in the different zones of a word by examining the vertical white space. Manish [11] proposed an algorithm for recognizing Gurumukhi scripts. In his work he recognized Punjabi characters with the efficiency of 92.56 %. In Chinese, Latin the efficiency of recognition of words is over 99%.

HINDI LANGUAGE: A REVIEW :-

Hindi is an Indo-Aryan language and is one the official languages of India. It is the worlds third most commonly used language after Chinese and English and has approximately 500 million speakers all over the world. It is written in Devnagari script. It is written from left to right along a horizontal line. The basic character set has 13 SWARS (vowel) and 33 VYANJANS (consonants) shown in the figure.

Figure 1: Hindi language basic character set

DEVNAGARI SCRIPT :-

Hindi is worlds third most commonly used language after Chinese and English, and there are approximately 500 billion people all over the world who speak and write in Hindi. It is the basic script of many languages in India, such as Hindi and Sanskrit. It is indisputable that Devnagari has the most accurate scientific basis. For a long time, it has been script of Indian Aryan languages. It is even now used by Sanskrit, Hindi, Marathi and Nepali languages. Hindi is the worlds widely spoken language and since its script is Devnagari, so its the most popular script. As Hindi has been declared the national language by constitution of Indian, Devnagari has got the status of national dialect.

In the beginning, Hindi was declared as the state language and Devnagari as the state script of other major states such as Himachal, Haryana, Rajasthan, Madhya Pradesh, Bihar, Uttaranchal, etc. Presently, it is found that Devnagari script is the most scientific script. Since every script is developed from Brahmi script so, Devnagari has connection with almost every other script. In Devnagari, all letters are equal, i.e. there is no concept of capital or small letters. Devnagari is half syllabic in nature.

Optical Character Recognition (Ocr):-

OCR is the acronym for Optical Character Recognition. This technology allows a machine to automatically recognize characters through an optical mechanism. Many objects are recognized by human beings in this manner. Optical mechanism are the Eyes while the brain sees the input, according to many factors the ability to appreciatethese signals varies in each person. Reviewing the variables, the challenges faced by the technologist developing an OCR system can be understood easily. Documents are in the form of papers which the human can read and understand but it is not possible for the computer to understand these documents directly.

In order to convert these documents into computer process able form, OCR systems are developed. OCR is the process of converting scanned images of machine printed or handwritten text, numerals, letters and symbols into a computer process able format such as ASCII. OCR is an area of pattern recognition and processing of handwritten character is motivated largely by desire to improve man and machine communication.

Proposed Algorithm:-

The system performs character recognition by exploring the feature of templates matching for its ability to recognize handwritten Hindi characters.The following steps are followed: 1- A database of Hindi handwritten character is created in different handwritings from different peoples. 2- Preprocessing of training image.a) Binarization of image using function bw= im2bw(Ibw,level).b) Edge detection of image using function iedge=edge(uint8(BW2)).c) Dilation of image using function se = strel(square,2) ; iedge2=imdilate(iedge,se) .d) Region filling of image using function ifill = imfill(iedge2,holes).e) Character detection in image using [Ilabel num] = bwlabel(Ifill);

Iprops = regionprops(Ilabel);Ibox = [Iprops.BoundingBox];Ibox = reshape(Ibox,[4 num]);

3 - Extraction and Scaling the normalized characters to 50*50 scale using boundary value analysis img{cnt} = imcrop(Ibw,Ibox(:,cnt)); bw2 = imgcrop(img{cnt}); charvec = imresize(bw2,[50 50]);4 - Templates generation using image averaging and saving templates in templates.mat file which is used in matching phase.5- Binarizing test image and matching it with templates and generating a result.txt file containing recognized characters.The system performs character recognition by exploring the feature of templates matching for its ability to recognize handwritten Hindi characters. The scope of the proposed system is limited to the recognition of a single character.

Scanning :- Handwritten character data samples are acquired on paper from various people. These data samples are then scanned from paper through an optically digitizing device such as optical scanner or camera. A flat-bed scanner is used at 300dpi which converts the data on the paper being scanned into a bitmap image.

DETAIL REVIEW OF WORK AND PROBLEM STATEMENTDESIGNING OF MULTILAYER NEURAL NETWORK FOR RECOGNITIONThere are two basic methods used for OCR: Matrix matching and feature extraction. Of the two ways to recognize characters, matrix matching is the simpler and more common. But still we have used Feature Extraction to make the product more robust and accurate. Feature Extraction is much more versatile than matrix matching. Here we use Matrix matching for Recognition of character. The Process of Character Recognition of the document image mainly involves six phases:Acquisition of Grayscale ImageDigitization/BinarizationLine and Boundary DetectionFeature ExtractionFeed Forward Artificial Neural Network basedMatching.Recognition of Character based on matching score.

Fig. 2.

The scanned image must be [4, 5] a grayscale image or binary image, where binary image is a contrast stretched grayscale image. That grayscale image is then undergoes digitization. In digitization [12] a rectangular matrix of 0s and 1s are formed from the image. Where 0-black and 1-white and all RGB values are converted into 0s and 1s.The matrix of dots represents two dimensional arrays of bits.

Digitization is also called binarization as it converts grayscale image into binary image using adaptive threshold. Line and Boundary detection is the process of identifying points in a digital image at which the character top, bottom, left and right are calculated. Feed Forward Neural Network approach is used to combine all the unique features, which are taken as inputs, one hidden layer is used to integrate and collaborate[9] similar features and if required adjust the inputs by adding or subtracting weight values, finally one output layer is used to find the overall matching score of the III. CHARACTER RECOGNITION PROCEDURE

Pre-processing:- The pre-processing stage yields a clean document in the sense that maximal shape information with maximal compression and minimal noise on normalized image is obtained.

Segmentation: - Segmentation is an important stage because the extent one can reach in separation of words, lines or characters directly affects the recognition rate of the script.

Feature extraction:- After segmenting the character, extraction of feature like height, width, horizontal line, vertical line, and top and bottom detection is done.

Classification:- For classification or recognition back propagation algorithm is used. Output:-Output is saved in form of text format.

TRAINING ALGORITHM PERFORMANCE AND ACCURACY OF PREDICTION The back propagation algorithm requires a numerical representation for the characters. Learning is implemented using the back-propagation algorithm with learning rate. Gradient is calculated [10], after every iteration and compared with threshold gradient value. If gradient is greater than the threshold value then it performs next iteration. The batch steepest descent training function is trained. The weights and biases are updated in the direction of the negative gradient of the performance function. In order to determine quantitatively the model, two error measures is employed for evaluation and model comparison, being these: The model squared error (MSE) and the mean absolute error (MAE). If yi is the actual observation for a time period t and Ft is the forecast for the same period, then the error is defined as ( (1 = The standard statistical error measures can be defined as MSE= n i n i e 1 1 n 1 = = (2) And the mean absolute error as MSE= = n i i e n 1 1 where n is the number of periods of time .When the mean square error decreased gradually and became stable, and the training and testing error produced satisfactory results .the training performance curve of neural network. The accuracy of the trained network is tested against output data. The accuracy of the trained network is assessed in the following way: in first way, the predicted output value is compared with the measured values .The results are presented shows the relative accuracy of the predicted output. The overall percentage error obtained from the tested results is 4%. In the second way, the root mean square error and the mean absolute error are determined and compared. The performance index for training of ANN is given in terms of mean square error (MSE).The tolerance limit for the MSE is set to 0.001.The MSE of the training set become stable at 0.0070 when the number of iteration reaches 350. The closeness of the training and the testing errors validates the accuracy of the model. V. EXPERIMENTAL RESULTS We create interface for proposed system for character recognition by using Microsoft Visual C # 2008 Express Editions. The MLP network that is implemented is composed of three layers input layer, output layer and hidden layer. The input layer constitutes of 180 neurons which receive printed image data from a 30x20 symbol pixel matrix. The hidden layer constitutes of 256 neurons whose [12] number is decided on the basis of optimal results on a trial and error basis. The output layer is composed of 16 neurons. Number of characters=90, Learning rate=150, No of neurons in hidden layer=256 TABLE I: PERCENTAGE OF ERROR FOR DIFFERENT EPOCHS

2. Existing Techniques

2.1 Modified discrimination function (MQDF) Classifier G. S. Lehal and Nivedan Bhatt [10] designed a recognition system for handwritten Devangari Numeral using Modified discrimination function (MQDF) classifier. A recognition rate and a confusion rate were obtained as 89% and 4.5% respectively.

2.2 Neural Network on Devenagari NumeralsR. Bajaj, L. Dey, S. Chaudhari [11] used neural network based classification scheme. Numerals were represented by feature vectors of three types. Three different neural classifiers had been used for classification of these numerals. Finally, the outputs ofthe three classifiers were combined using a connectionist scheme. A 3-layer MLP was used for implementing the classifier for segment-based features. Their work produced recognition rate of 89.68%.

2.3 Gaussian Distribution FunctionR. J. Ramteke et.al applied classifiers on 2000 numerals images obtained from different individuals of different professions. The results of PCA, correlation coefficient and perturbed moments are an experimental success as compared to MIs. This research produced 92.28% recognition rate by considering 77 feature dimensions.

2.4 Fuzzy classifier on Hindi Numerals

M. Hanmandlu, A.V. Nath, A.C. Mishra and V.K. Madasu used fuzzy membership function for recognition of Handwritten Hindi Numerals and produce 96% recognition rate. To recognize the unknown numeral set, an exponential variant of fuzzy membership function was selected and it was constructed using the normalized vector distance.

2.5 Multilayer Perceptron

Ujjwal Bhattacharya, B. B. Chaudhuri [11] used a distinct MLP classifier. They worked onDevanagari, Bengali and English handwritten numerals. A back propagation (BP) algorithm was used for training the MLP classifiers. It provided 99.27% and 99.04% recognition accuracies on the original training and test sets of Devanagari numeral database, respectively.

2.6 Quadratic classifier for Devanagari Numerals

U. Pal, T. Wakabayashi, N. Sharma and F. Kimura [14] developed a modified quadratic classifier for recognition of offline handwritten numerals of six popular Indian scripts; viz. They had used 64 dimensional features for high-speed recognition. A five-fold cross validation technique has been used for result computation and obtained 99.56% accuracy from Devnagari scripts, respectively.

PROPOSED APPROACH

3.1 Support Vector Machine (SVM)

SVM in its basic form implement two class classifications. It has been used in recent years as an alternative to popular methods such as neural network. The advantage of SVM, is that it takes into account both experimental data and structural behavior for better generalization capability based on the principle of structural risk minimization (SRM). Its formulation approximates SRM principle by maximizing the margin of class separation, the reason for it to be known also as large margin classifier. The basic SVM formulation is for linearly separable datasets.

It can be used for nonlinear datasets by indirectly mapping the nonlinear inputs into to linear feature space where the maximum Margin decision function is approximated. The mapping is done by using a kernel function. Multi class classification can be performed by modifying the 2 class scheme. The objective of recognition is to interpret a sequence of numerals taken from the test set. The architecture of proposed system is given in fig. 3.The SVM (binary classifier) is applied to multi class numeral recognition problem by using one-versus-rest type method.The SVM is trained with the training samples using linear kernel.

Classifier performs its function in two phases; Training and Testing. [29] After preprocessing and Feature Extraction process, Training is performed by considering the feature vectors which are stored in the form of matrices. Result of training is used for testing the numerals. Algorithm for Training is given in algorithm.

3.2 Statistical Learning Theory

Support Vector Machines have been developed by Vapnik in the framework of Statistical Learning Theory [13]. In statistical learning theory (SLT), the problem of classification in supervised learning is formulated as follows: We are given a set of l training data and its class, {(x1,y1)...(xl,yl)} in Rn R sampled according to unknown joint probability distribution P(x,y) characterizing how the classes are spread in Rn R. To measure the performance of the classifier, a loss function L(y,f(x)) is defined as follows: L(y,f(x)) is zero if f classifies x correctly, one otherwise. On average, how f performs can be described by the Risk functional: ERM principle states that given the training set and a set of possible classifiers in the hypothesis space F, we Should choose f F that minimizes Remp(f). However, which generalizes well to unseen data due to over fitting phenomena. Remp(f) is a poor, over-optimistic approximation of R(f), the true risk. Neural network classifier relies on ERM principle.

The normal practice to get a more realistic estimate of generalization error, as in neural network is to divide the available data into training and test set. Training set is used to find a Classifier with minimal empirical error (optimize the weight of an MLP neural networks) while the test set is used to find the generalization error (error rate on the Test set). If we have different sets of classifier hypothesis space F1, F2 e.g. MLP neural networks with different topologies, we can select a classifier from each hypothesis space (each topology) with minimal Remp(f) and choose the final classifier with minimal generalization error. However, to do that requires designing and training potentially large number of individual classifiers. Using SLT, we do not need to do that. Generalization error can be directly minimized by minimizing an upper bound of the risk functional R(f).

The bound given below holds for any distribution P(x,y) with probability of at least 1- :where the parameter h denotes the so called VC (Vapnik-Chervonenkis) dimension. is theconfidence term defined by Vapnik [10] as : ERM is not sufficient to find good classifierbecause even with small Remp(f), when h is large compared to l, will be large, so R(f) will also be large, ie: not optimal. We actually need to minimize Remp(f)and at the same time, a process which is called structural risk Minimization (SRM). By SRM, we do not need test set for model selection anymore.

Taking different sets of classifiers F1, F2 with known h1, h2 we can select f from one of the set with minimal Remp(f), compute and choose a classifier with minimal R(f).No more evaluation on test set needed, at least in theory. However, we still have to train potentially very large number of individual classifiers. To avoid this, we want to make h tunable (ie: to cascade a potential classifier Fi with VC dimension = h and choose an optimal f from an optimal Fi in a single optimization step. This is done in large margin classification.

3.3 SVM formulations

SVM is realized from the above SLT framework. The simplest formulation of SVM is linear, where the decision hyper plane lies in the space of the input data x. In this case the hypothesis space is a subset of all hyper planes of the form: f(x) = wx +b. SVM finds an optimal hyper plane as the solution to the learning Problem which is geometrically the furthest from both classes since that will generalize best for future unseen data.

There are two ways of finding the optimal decision hyper plane. The first is by finding a plane that bisects the two closest points of the two convex hulls defined by the set of points of each class, as shown in figure 2. The second is by maximizing the margin between two supporting planes as shown in figure 3. Both methods will produce the same optimal decision plane and the same set of points that support the solution (the closest points on the two convex hulls in figure 2 or the points on the two parallel supporting planes in figure 3). These are called the support vectors.

4. Feature Extraction

4.1 Moment InvariantsThe moment invariants (MIs) [1] are used to evaluate seven distributed parameters of a numeral image. In any character Recognition system, the characters are processed to extract features that uniquely represent properties of the character. Based on normalized central moments, a set of seven moment invariants is derived. Further, the resultant image was thinned and seven moments were extracted. Thus we had 14 features (7 original and 7 thinned), which are applied as features for recognition using Gaussian Distribution Function. To increase the success rate, the new features need to be extracted by applying Affine Invariant Moment method.

4.2 Affine Moment Invariants

The Affine Moment Invariants were derived by means of the theory of algebraic invariants. Full derivation and comprehensive discussion on the properties of invariants can be found. Four features can be computed for character recognition. Thus overall 18 features have been used for Support Vector Machine.

5. Experiment

5.1 Data Set DescriptionIn this paper, the UCI Machine learning data set are used. The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. One of the available datasets is the Optical Recognition of the Handwritten Digits Data Set.

The dataset of handwritten assamese characters by collecting samples from 45 writers is created. Each writer contributed 52 basic characters, 10 numerals and 121 assamese conjunct consonants. The total number of entries corresponding to each writer is 183 (= 52 characters + 10 numerals + 121 conjunct consonants). The total number of samples in the dataset is 8235 ( = 45 183 ). The handwriting samples were collected on an iball 8060U external digitizing tablet connected to a laptop using its cordless digital stylus pen. The distribution of the dataset consists of 45 folders.

This file contains information about the character id (ID), character name (Label) and actual shape of the character (Char). In the raw Optdigits data, digits are represented as 32x32 matrices. They are also available in a preprocessed form in which digits have been divided into non-overlapping blocks of 4x4 and the number of on pixels have been counted in each block. This generated 8x8 input matrices where each element is an integer in the range 0.16.

5.2 Data Preprocessing

For the experiments using SVM, example isolated characters are preprocessed and 7 local features for each point of the spatially resample online signal were extracted. For each example character there are 350 feature values as input to the SVM. We use SVM with RBF kernel, since RBF kernel has been shown to generally give better recognition result. Grid search was done to find the best values for the C and gamma (representing in the original RBF kernel formulation) for the final SVM models and by that, C = 8 and gamma = were chosen.

Preprocessing :-

A series of operations are performed on scanned image during preprocessing (figure 4). Scanned image The operations that are performed during preprocessing are: (i) Applied median filtering to reduce noise from the introduced to the character image during scanning. It is usually taken from a template centered on the point of interest. To perform median filtering at a point values of the pixel and its neighbors are sorted into order based upon their gray levels and their median is determined [12]. (ii) Global thresholding is applied to convert image from gray scale to binary form. (iii) Image is normalized into 7X7. (iv) Thinning is performed by the method proposed in [10].

The back-propagation algorithm works as follows (i) All the weights are initialized to some small random values. (ii) Input vector and desired output vectors are presented to the net.(iii) Each input unit receives an input signal and transmits this signal to each hidden unit. (iv) Each hidden unit calculates the activation function and sends the signal to each output unit.(v) Actual output is calculated. Each output unit compares the actual output with the desired output to determine the associated error with that unit. (vi) Weights are adjusted to minimize the error.

In this paper, the proposed backpropagation neural net is designed two hidden layers as shown in figure 7. The input layer contained total 49 nodes as 49 features were extracted for each character. The output layer contained total 5 nodes (1 node for each class). Size of the output was 5X5, where each character is represented as a 5X1 output vector. Number of nodes in both hidden layers was set to 7.

PROPOSED WORK

Character recognition task has been attempted through many different approaches like template matching, statistical techniques like NN, HMM, Quadratic Discriminant function (QDF) etc. Template matching works effectively for recognition of standard fonts, but gives poor performance with handwritten characters and when the size of dataset grows. It is not an effective technique if there is font discrepancy . HMM models achieved great success in the field of speech recognition in past decades, however developing a 2-D HMM model for character recognition is found difficult and complex. NN is found very computationally expensive in recognition purpose . N. Araki et al. applied Bayesian filters based on Bayes Theorem for handwritten character recognition. Later, discriminative classifiers such as Artificial Neural Network (ANN) and Support Vector Machine (SVM) grabbed a lot of attention. In G. Vamvakas et al. compared the performance of three classifiers: Naive Bayes, K-NN and SVM and attained best performance with SVM. However SVM suffers from limitation of selection of kernel. ANNs can adapt to changes in the data and learn the characteristics of input signal .Also, ANNs consume less storage and computation than SVMs. Mostly used classifiers based on ANN are MLP and RBFN. B.K. Verma presented a system for HCR using MLP and RBFN networks in the task of handwritten Hindi character recognition. The error back propagation algorithm was used to train the MLP networks. J. Sutha et al. in showed the effectiveness of MLP for Tamil HCR using the Fourier descriptor features. R. Gheroie et al. in proposed handwritten Farsi character recognition using MLP trained with error back propagation algorithm. Computer Science & Information Technology (CS & IT) 27 similar shaped characters are difficult to differentiate because of very minor variations in their structures. In T. Wakabayashi et al. proposed an F-Ratio (Fisher Ratio) based feature extraction method to improve results of similar shaped characters. They considered pairs of similar shaped characters of different scripts like English, Arabic/Persian, Devnagri, etc. and used QDF for recognition purpose. QDF suffers from limitation of minimum required size of dataset. F. Yang et al. in [14] proposed a method that combines both structural and statistical features of characters for similar handwritten Chinese character recognition. As it can be seen that various feature extraction methods and classifiers have been used for character recognition by researchers that are suitable for their work, we propose a novel feature set that is expected to perform well for this application. In this paper, the features are extracted on the basis of character geometry, which are then fed to each of the selected ML algorithms for recognition of SSHMC. 3.Methodology for feature extraction A device is to be designed and trained to recognize the 26 letters of the alphabet. We assume that some imaging system digitizes each letter centered in the systems field of vision. The result is that each letter is represented as a 5 by 7 grid of real values. The following figure shows the perfect pictures of all 26 letters. Figure 1: The 26 letters of the alphabet with a resolution of 5 7. However, the imaging system is not perfect and the letters may suffer from noise: Figure 2: A perfect picture of the lettar A and 4 noisy versions (stabdard devistion of 0.2). Perfect classification of ideal input vectors is required, and more important reasonably accurate classification of noisy vectors. Before OCR can be used, the source material must be scanned using an optical scanner (and sometimes a specialized circuit board in the PC) to read in the page as a bitmap (a pattern of dots). Software to recognize the images is also required. The character recognition software then processes these scans to differentiate between images and text and determine what letters are represented in the light and dark areas. Older OCR systems match these images against stored bitmaps based on specific fonts. The hit-or-miss results of such pattern-recognition systems helped establish OCR's reputation for inaccuracy. Today's OCR engines add the multiple algorithms of neural.

SOLUTION APPROACH

On-line handwriting recognition involves the automatic conversion of text as it is written on a specialdigitizer orPDA where a sensor picks up the pen-tip movements as well as pen-up/pen-down switching. This kind of data is known as digital ink and can be regarded as a digital representation of handwriting. The obtained signal is converted into letter codes which are usable within computer and text-processing applications.The elements of an on-line handwriting recognition interface typically include: a pen or stylus for the user to write with. a touch sensitive surface, which may be integrated with, or adjacent to, an output display. a software application which interprets the movements of the stylus across the writing surface, translating the resulting strokes into digital text.General processThe process of online handwriting recognition can be broken down into a few general steps: preprocessing, feature extraction and classification.The purpose of preprocessing is to discard irrelevant information in the input data, that can negatively affect the recognitio.This concerns speed and accuracy. Preprocessing usually consists of binarization, normalization, sampling , smoothing and denoising. The second step is feature extraction. Out of the two- or more-dimensional vector field received from the preprocessing algorithms, higher-dimensional data is extracted. The purpose of this step is to highlight important information for the recognition model. This data may include information like pen pressure, velocity or the changes of writing direction. The last big step is classification. In this step various models are used to map the extracted features to different classes and thus identifying the characters or words the features represent.

HardwareCommercial products incorporating handwriting recognition as a replacement for keyboard input were introduced in the early 1980s. Examples include handwriting terminals such as the Pencept Penpadand the Inforite point-of-sale terminal.With the advent of the large consumer market for personal computers, several commercial products were introduced to replace the keyboard and mouse on a personal computer with a single pointing/handwriting system, such as those from PenCept, CICand others. The first commercially available tablet-type portable computer was the GRiDPad fromgrid system, released in September 1989. Its operating system was based onMS-DOS.In the early 1990s, hardware makers includingNCR,IBMandEOreleasedtablet computersrunning thePenPointoperating system developed byGO Corp. PenPoint used handwriting recognition and gestures throughout and provided the facilities to third-party software. IBM's tablet computer was the first to use theThinkPadname and used IBM's handwriting recognition. This recognition system was later ported to MicrosoftWindows for Pen Computing, andIBM's PenforOS/2. None of these were commercially successful.Advancements in electronics allowed the computing power necessary for handwriting recognition to fit into a smaller form factor than tablet computers, and handwriting recognition is often used as an input method for hand-heldPDAs. The first PDA to provide written input was theApple Newton, which exposed the public to the advantage of a streamlined user interface. However, the device was not a commercial success, owing to the unreliability of the software, which tried to learn a user's writing patterns. By the time of the release of theNewton OS2.0, wherein the handwriting recognition was greatly improved, including unique features still not found in current recognition systems such as modeless error correction, the largely negative first impression had been made. After discontinuation ofApple Newton, the feature has been ported to Mac OS X 10.2 or later in form ofInkwell (Macintosh) Palmlater launched a successful series ofPDAs based on theGraffitirecognition system. Graffiti improved usability by defining a set of "unistrokes", or one-stroke forms, for each character. This narrowed the possibility for erroneous input, although memorization of the stroke patterns did increase the learning curve for the user. The Graffiti handwriting recognition was found to infringe on a patent held by Xerox, and Palm replaced Graffiti with a licensed version of the CIC handwriting recognition which, while also supporting unistroke forms, pre-dated the Xerox patent. The court finding of infringement was reversed on appeal, and then reversed again on a later appeal. The parties involved subsequently negotiated a settlement concerning this and other patentsGraffiti (Palm OS).

ATablet PCis a special notebook computer that is outfitted with adigitizer tabletand a stylus, and allows a user to handwrite text on the unit's screen. The operating system recognizes the handwriting and converts it into typewritten text.Windows Vista andWindows 7include personalization features that learn a user's writing patterns or vocabulary for English, Japanese, Chinese Traditional, Chinese Simplified and Korean. The features include a "personalization wizard" that prompts for samples of a user's handwriting and uses them to retrain the system for higher accuracy recognition. This system is distinct from the less advanced handwriting recognition system employed in itsWindows Mobile OS for PDAs.Although handwriting recognition is an input form that the public has become accustomed to, it has not achieved widespread use in either desktop computers or laptops. It is still generally accepted thatkeyboard input is both faster and more reliable. As of 2006, many PDAs offer handwriting input, sometimes even accepting natural cursive handwriting, but accuracy is still a problem, and some people still find even a simpleon-screen keyboard more efficient.

SoftwareInitial software modules could understand print handwriting where the characters were separated. Author of the first applied pattern recognition program in 1962 wasShelia Guberman, then in Moscow.Commercial examples came from companies such as Communications Intelligence Corporation and IBM. In the early 90s, two companies, ParaGraph International, and Lexicus came up with systems that could understand cursive handwriting recognition. ParaGraph was based in Russia and founded by computer scientist Stepan Pachikov while Lexicus was founded by Ronjon Nag and Chris Kortge who were students at Stanford University. The ParaGraph CalliGrapher system was deployed in the Apple Newton systems, and Lexicus Longhand system was made available commercially for the PenPoint and Windows operating system. Lexicus was acquired by Motorola in 1993 and went on to develop Chinese handwriting recognition andpredictive textsystems for Motorola. ParaGraph was acquired in 1997 by SGI and its handwriting recognition team formed a P&I division, later acquired from SGI by Vadem. Microsoft has acquired CalliGrapher handwriting recognition and other digital ink technologies developed by P&I from Vadem in 1999. Wolfram Mathematica (8.0 or later) also provides a handwriting or text recognition function TextRecognize.Character recognition task has been attempted through many different approaches like template matching, statistical techniques like NN, HMM, Quadratic Discriminant function (QDF) etc. Template matching works effectively for recognition of standard fonts, but gives poor performance with handwritten characters and when the size of dataset grows. It is not an effective technique if there is font discrepancy .

HMM models achieved great success in the field of speech recognition in past decades, however developing a 2-D HMM model for character recognition is found difficult and complex . NN is found very computationally expensive in recognition purpose. N. Araki et al. applied Bayesian filters based on Bayes Theorm for handwritten character recognition. Later, discriminative classifiers such as Artificial Neural Network (ANN) and Support Vector Machine (SVM) grabbed a lot of attention.

In G. Vamvakas et al. compared the performance of three classifiers: Naive Bayes, K-NN and SVM and attained best performance with SVM. However SVM suffers from limitation of selection of kernel. ANNs can adapt to changes in the data and learn the characteristics of input signal. Also, ANNs consume less storage and computation than SVMs . Mostly used classifiers based on ANN are MLP and RBFN. B.K. Verma [10] presented a system for HCR using MLP and RBFN networks in the task of handwritten Hindi character recognition.

The error back propagation algorithm was used to train the MLP networks. J. Sutha et al. in showed the effectiveness of MLP for Tamil HCR using the Fourier descriptor features. R. Gheroie et al. in proposed handwritten Farsi character recognition using MLP trained with error back propagation algorithm. Computer Science & Information Technology (CS & IT) 27 Similar shaped characters are difficult to differentiate because of very minor variations in their structures. In T.

Wakabayashi et al. proposed an F-Ratio (Fisher Ratio) based feature extraction method to improve results of similar shaped characters. They considered pairs of similar shaped characters of different scripts like English, Arabic/Persian, Devnagri, etc. and used QDF for recognition purpose. QDF suffers from limitation of minimum required size of dataset.F. Yang et al. in proposed a method that combines both structural and statistical features of characters for similar handwritten Chinese character recognition.

As it can be seen that various feature extraction methods and classifiers have been used for character recognition by researchers that are suitable for their work, we propose a novel feature set that is expected to perform well for this application. In this paper, the features are extracted on the basis of character geometry, which are then fed to each of the selected ML algorithms for recognition of SSHHC.

3. MACHINE LEARNING CONCEPTS

Machine learning [15] is a scientific discipline that deals with the design and development ofalgorithms that allow computers to develop behaviours based on empirical data. ML algorithms, in this application, are used to map the instances of the handwritten character samples to predefined classes.

3.1. Machine Learning Algorithms

For this application of SSHHC recognition, we use the below mentioned ML algorithms that have been implemented in WEKA 3.7.0: WEKA (Waikato Environment for Knowledge Analysis) is JAVA based open source simulator. These algorithms have been found performing very well for most of the applications and have been widely used by researchers. Brief description of these algorithms is as follows:

3.1.1. Bayesian Network

A Bayesian Network [17] or a Belief Network is a probabilistic model in the form of directedacyclic graphs (DAG) that represents a set of random variables by its nodes and their correlations by its edges. Bayesian Networks has an advantage that they visually represent all the relationships between the variables in the system via connecting arcs. Also, they can handle situations where the data set is incomplete.

3.1.2. Radial Basis Function Network

An RBFN is an artificial neural network which uses radial basis functions as activationfunctions. Due to its non-linear approximation properties, RBF Networks are able to modelcomplex mapping. RBF Networks do not suffer from the issues of local minima because theparameters required to be adjusted are the linear mappings from hidden layer to output layer.

3.1.3. Multilayer Perceptron

An MLP is a feed forward artificial neural network that computes a single output frommultiple real-valued inputs by forming a linear combination according to its input weights and then putting the output through some nonlinear activation function (mainly Sigmoid). MLP is a universal function approximator and is highly efficient for solving complex problems due to the presence of more than one hidden layer.3.1.4. C4.5C4.5 is an extension of Ross Quinlan's earlier ID3 algorithm. It builds decision trees from aset of training data using the concept of information gain and entropy. C4.5 uses a white box28 Computer Science & Information Technology (CS & IT) model due to which the explanation of results is easy to understand. Also, it performs well with even large amount of data.

3.2. Feature Reduction

Feature extraction is the task to detect and isolate various desired attributes (features) of an object in an image, which maximizes the recognition rate with the least amount of elements. However, training the classifier with maximum number of features obtained is not always the best option, as the irrelevant or redundant features can cause negative impact on a classifiers performance . and at the same time, the build classifier can be computationally complex. Feature reduction or feature selection is the technique of selecting a subset of relevant features to improve the performance of learning models by speeding up the learning process and reducing computational complexities. Two feature reduction methods that have been chosen for this application are CFS [22] and CON [23] as these methods have been widely used by researchers for feature reduction.

4. EXPERIMENTAL METHODOLOGY

The following sections describe the data-set, pre-processing and feature extraction adopted in our proposed work of recognition of SSHHC.

4.1. Dataset Creation

Dataset is created by asking the candidates of different age groups to write the similar shapedcharacters (, ; , ; , ) several times in their handwriting on white plain sheets. These image samples are scanned using HP Scanjet G2410 with resolution of 1200 x 1200 dpi. Each character is cropped and stored in .jpg format using MS Paint. Thus this dataset consists ofisolated handwritten Hindi characters that are to be recognized using ML algorithms. Using these character samples, three datasets are created as described below:

Dataset 1 consists of only 100 samples of the target pair (, ).

Dataset 2 consists of increased samples of the same target pair, i.e. size of training dataset isincreased to 342 samples by adding more samples of the target class (from other persons) in the training dataset. More samples are added in order to analyze the impact of increase in number of samples on the relative performance of ML algorithms.

Dataset 3 consists of samples of both the target and non-target class, i.e. other similar shapedcharacter pairs (like , ; , ) are also added to the dataset (making 500 samples in total) withwhich the ML algorithms are trained. Non-target class characters are added to test the ability of ML classifiers for target characters among different characters. A few samples of the entiredataset are shown in Figure 1.

4.2. Performance MetricsPerformance of the classifiers is evaluated on the basis of the metrics described below:i Precision: Proportion of the examples which truly have class x among all those which wereclassified as class x. Figure. 1 Samples of Handwritten Hindi Characters Computer Science & Information Technology (CS & IT) 29ii Misclassification Rate: Number of instances that were classified incorrectly out of the totalinstances.iii Model Build Time: Time taken to train a classifier on a given data set.

4.3. Pre-processing

Following pre-processing steps are applied to the scanned character images:i First each RGB character image, after converting to gray scale, is binarized through thresholding.ii The image is inverted such that the background is black and foreground is white.iii Then shortest matrix that fits the entire character skeleton for each image is obtained andthis is termed as universe of discourse.iv Finally, the spurious pixels are removed from the image followed by skeletonization.

4.4. Feature Extraction

After pre-processing, features for each character image are extracted based on the charactergeometry using the technique described in [24]. The features are based on the basic line types that form the skeleton of the character. Each pixel in the image is traversed. Individual line segments, their directions and intersection points are identified from an isolated character image. For this, the image matrix is initially divided into nine zones and the number, length and type of lines and intersections present in each zone are determined. The line type can be: Horizontal, Vertical, Right Diagonal and Left Diagonal. For each zone following features are extracted. It results into a feature vector of length 9 for each zone:

i. Number of horizontal linesii. Number of vertical linesiii. Number of Right diagonal linesiv. Number of Left diagonal linesv. Normalized Length of all horizontal linesvi. Normalized Length of all vertical linesvii. Normalized Length of all right diagonal linesviii. Normalized Length of all left diagonal linesix. Number of intersection points.

A total of 81 (9x9) zonal features are obtained. After zonal feature extraction, four additionalfeatures are extracted for the entire image based on the regional properties namely:

i. Euler Number: It is defined as the difference of Number of Objects and Number of holesin the imageii. Eccentricity: It is defined as the ratio of the distance between the centre of the ellipse andits major axis lengthiii. Orientation: It is the angle between the x-axis and the major axis of the ellipse that hasthe same second-moments as the regioniv. Extent: It is defined as the ratio of pixels in the region to the pixels in the total bounding box

IMPLEMENTATION AND RESULT

5.4 Experimental Results5.4.1 Test application Analysis

The test application accompanying the source code can perform the recognition of handwritten digits. To do so, open the application (preferably outside Visual Studio, for better performance). Click on the menu File and select Open. This will load some entries from the Optdigits dataset into the application.

To perform the analysis, click the Run Analysis button. Please be aware that it may take sometime. After the analysis is complete, the other tabs in the sample application will be populated with theof each factor found during the discriminant analysis is plotted in a pie graph for easy visual inspection. Once the analysis is complete, we can test its classification ability in the testing data set. The green rows have been correctly identified by the discriminant space Euclidean distance classifier. We can see that it correctly identifies 98% of the testing data. The testing and training data set are disjoint and independent. analysis' information.

Fig.6: Using the default values in the application

Results

After the analysis has been completed and validated, we can use it to classify the new digitsdrawn directly in the application. The bars on the right show the relative response for each of the discriminant functions. Each class has a discriminant function that outputs a closeness measure for the input point. The classification is based on which function produces the maximum output.

Handwritten Devanagari Character sets are taken from test .bmp image. Steps are followed to obtain best accuracy of input handwritten Hindi character image given to the system. First of all, training of system is done by using different data set or sample. And then system is tested for few of the given sample, and accuracy is measured. For each character, feature were computed and stored in templates for training the system.

The sets of handwritten Gurumukhi characters are made. The data set was partitioned into two parts. The first one is used for training the system and the second one for testing. For each character, features were computed and stored for training the network. Three network layers, i.e. one input layer, one hidden layer and one output layer are taken. If number of neurons in the hidden layer is increased, then a problem of allocation of required memory is occurred. Also, if the value of error tolerance is high, say 0.1, desired results are not obtained, so changing the value of error tolerance i.e. say 0.01, high accuracy rate is obtained. Also the network takes more number of cycles to learn when the error tolerance value is less rather than in the case of high value of error tolerance in which network learns in less number of cycles and so the learning is not very fine. The unit disk is taken for each character by finding the maximum radius of the character (i.e. the maximum distance between the center of the character and the boundary of the character), so that the character could fit on the disk.Here are some tables displaying the results obtained from the program. Sign images of the same letter are grouped together on every table. The table gives us information about the pre-processing operations that took place (i.e. noise, edge detection, filling of gap) and also if the image belongs to the same database with the training images. The amount of each filter is also recorded so maximum values of noise can be estimated that the network can tolerate. This of course varies from character image to character image. The result also varies for every time the algorithm is executed. The variance is very small but it is there. Following are main results of Gurumukhi character recognition: -

Table 4: Recognition Accuracy of Handwritten Hindi Characters Character No. of Samples Train/Test % Accuracy

200 180/20 93%

196 176/20 87%

155 130/25 89%

184 169/15 71%

192 162/30 69%

160 140/20 81%

179 159/20 79%

168 148/20 84%

195 170/25 80%

177 152/25 90%

191 166/25 88%

180 165/15 86%

195 170/25 89%

187 167/20 96%

169 149/20 95%

199 174/25 92%

188 168/20 94%

166 146/20 82%

196 176/20 82%

189 164/25 88%

168 148/20 85%

178 158/20 84%

196 176/20 87%

171 151/20 81%

182 162/20 88%

184 164/20 80%

169 149/20 89%

180 155/25 76%

170 150/20 78%

193 173/20 71%

185 165/20 82%

176 146/30 70%

167 147/20 92%

157 132/25 85%

178 158/20 87%

183 153/30 69%

191 161/30 73%

185 155/30 70%

Fig 7: We can see the analysis also performs rather well on completely new and previously unseen data.

Experiments were performed on different samples having mixed scripting languages on numerals using single hidden layer.

Table 2: Detail Recognition performance of SVM on UCI datasets

Table 3: Detail Recognition performance of SVM and HMM on UCI datasets

Table 4: Recognition Rate of Each Numeral in DATASET

It is observed that recognition rate using SVM is higher than Hidden Markov Model. However, free parameter storage for SVM model is significantly higher. The memory space required for SVM will be the number of support vectors multiply by the number of feature values (in this case 350). This is significantly large compared to HMM which only need to store the weight. HMM needs less space due to the weight-sharing scheme.

However, in SVM, space saving can be achieved by storing only the original online signals and the penup/ pen-down status in a compact manner. During recognition, the model will be expanded dynamically as required. Table 3 shows the comparison of recognition rates between HMM and SVM using all three databases. SVM clearly outperforms in all three isolated character cases.

The result for the isolated character cases above indicates that the recognition rate for the hybrid word recognizer could be improved by using SVM instead of HMM. Thus, we are currently implementing word recognizer using both HMM and SVM and comparing their performance.

FUTURE WORK AND CONCLUSION CONCLUSION- The important feature of this ANN training is that the learning rates are dynamically computed each epoch by an interpolation map. The ANN error function is transformed into a lower dimensional error space and the reduced error function is employed to identify the variable learning rates. As the training progresses the geometry of the ANN error function constantly changes and therefore the interpolation map always identifies variable learning rates that gradually reduce to a lower magnitude. As a result the error function also reduces to a smaller terminal function value. The result of structure analysis shows that if the number of hidden nodes increases the number of epochs taken to recognize the handwritten character is also increases. A lot of efforts have been made to get higher accuracy but still there are tremendous scopes of improving recognition accuracy. REFERE6. Conclusion Handwriting recognition is a challenging field in the machine learning and this work identifies Support Vector Machines as a potential solution. The number of support vectors can be reduced by selecting better C and gamma parameter values through a finer grid search and by reduced set selection Work on integrating the SVM character recognition framework into the HMM based word recognition framework is on the way. In the hybrid system, word preprocessing and normalization needs to be done before SVM is then used for character hypothesis recognition and word likelihood computation using HMM. It is envisaged that, due to SVMs better discrimination capability, word recognition rate will be better than in a HMM hybrid system.The scope of the proposed system is limited to the recognition of a single character.Offline handwritten Hindi character recognition is a difficult problem, not only because of the great amount of variations in human handwriting, but also, because of the overlapped and joined characters. Recognition approaches heavily depend on the nature of the data to be recognized. Since handwritten Hindi characters could be of various shapes and size, the recognition process needs to be much efficient and accurate to recognize the characters written by different users.There are few reasons that create problem in Hindi handwritten character recognition. Some characters are similar in shape (for example and ). Sometimes characters are overlapped and joined. Large numbers of character and stroke classes are present there. Different, or even the same user can write differently at different times, depending on the pen or pencil, the width of the line, the slight rotation of the paper, the type of paper and the mood and stress level of the person.The character can be written at different location on paper or in window Characters can be written in different fonts.

Handwritten Gurumukhi character recognition using neural networks is discussed here. It has been found that recognition of handwritten Gurumukhi characters is a very difficult task. Following are main reasons for difficulty in recognition of Gurumukhi characters:- Some Gurumukhi characters are similar in shape (for example and ). Different, or even the same writer can write differently at different times, depending on the pen or pencil, the width of the line, the slight rotation of the paper, the type of paper and the mood and stress level of the person. The character can be written at different location on paper or in window Characters can be written in different fonts. These facts are justified by the work done here. A small set of all Gurumukhi characters using back propagation neural network is trained, then testing was performed on other character set. The accuracy of network was very low. Then, some other character images in the old character set are added and trained the network using new sets. Then again testing was performed on some new image sets written by different people, and it was found that accuracy of the network increases slightly in some cases. Again some new character images into old character set are added (on which network was trained) and trained the network using this new set. The network is presented new character images and it has been seen that recognition increases, although at a slow rate. The result of the last training by 50 character set and testing with the 10 character set are presented. It can be concluded that as the network is trained with more number of sets, the accuracy of recognition of characters will increase definitely.

Future scope Over the past three decades, many different methods have been explored by a large number of scientists to recognize characters. A variety of approaches have been proposed and tested by researchers in different parts of the world, including statistical methods, structural and syntactic methods and neural networks. No OCR in this world is 100% accurate till date. The recognition accuracy of the neural networks proposed here can be further improved. The number of character set used for training is reasonably low and the accuracy of the network can be increased by taking more training character sets. This approach of recognition is used for recognition of Gurumukhi characters only. In future work, this can be implemented for recognition of Gurumukhi words.

References

[1] R. Plamondon and S. N. Srihari, On-line and off-line handwritten recognition: a comprehensive survey, IEEE Transactions on PAMI, Vol. 22(1), pp. 6384, 2000. [2] Negi, C. Bhagvati and B. Krishna, An OCR system for Telugu, in the Proceedings of the Sixth International Conference on Document Processing, pp.1110-1114, 2001. [3] Hong, J.I. and Landay, J.A. SATIN: A Toolkit for Informal Ink-based Applications. CHI Letters: ACM Symposium on UIST, 2 (2), 63-72. [4] S. Mori, C. Y. Suen and K. Yamamoto, Historical review of OCR research and development, Proceedings of the IEEE, Vol. 80(7), pp. 1029-1058, 1992. [5] U. Pal and B. B. Chaudhuri, Indian script character recognition, Pattern Recognition, Vol. 37(9), pp. 1887-1899, 2004. [6] H. Bunke and P. S. P. Wang, Handbook of Character Recognition and Document Image Analysis, World Scientific Publishing Company, 1997. [7] Stephen V. Rice, George Nagy and Thomas A. Nartker, Optical Character Recognition: An Illustrated Guide to the Frontier, Kluwer Academic Pub

offline handwritten hindi character recognition using data mining152 (3)

Documents

handwritten recognition

better recognition

handwriting recognition

department of master

seminar colloquium work

ashok kumarassociate

deep sense of gratitude

data mining