multitier holistic approach for urdu nastaliq recognition

5
A Multi-tier Holistic approach for Urdu Nastaliq Recognition Syed. Afaq Husain* and Syed. Hassan Amin** Faculty of Computer Science and Engineering Ghulam Ishaq Khan (GIK) Institute of Engineering Sciences and Technology Topi, 23460, Dist. Swabi, NWFP, PAKISTAN Email:* [email protected] _ , **[email protected] Abstract Character recognition is an active area of research with numerous applications including web publishing, document analysis and text to speech conversion. In this paper, we present a new approach for the off-line recognition of cursive Urdu Text. This methodology has been developed for the Noori Nastaliq Script [Ahmed 1]. Word (Ligature) based identification has been adopted instead of character based identification. A multi-tier holistic approach has been utilized to recognize ligatures from a pre-defined ligature set. Initially, the special ligatures (Dots, Tay, Hamza & Mad) are identified from the base ligatures. These special ligatures are associated to the most probable neighboring base ligature in the second step. Finally, the above information along with some other RTS invariant features of base ligature is presented to the Feed Forward Back Propagation neural network to perform the final recognition task. Keywords: OCR, Urdu Character Recognition, Noori Nastaliq, Ligature based identification, Back-propagation Neural Network. 1. Objective Urdu is the national language of Pakistan. It is a language that is understood by over 300 million people belonging to Pakistan, India and Bangladesh. Due to its historical database of literature, there is a need to devise automatic systems for conversion of this literature into electronic form that may be accessible on the world-wide- web. The suggested Urdu Text recognition system endeavors to convert scanned Urdu documents automatically into computerized text files in UZT format. The Diacritics (Aerab) and punctuation have been ignored in the current version of the system, however may be classified as another category of symbols. Multi-Font and multi-lingual support has also been ignored for simplification. 2. Introduction Urdu character set is based on the Arabic character set. It is a cursive language even in its printed form. In the past, a lot of research has been done on automatic recognition of text written in languages based on Roman [Guyon],[Ha], Chinese text [Guo],[Ding], Arabic [Amin1] and Persian [Khorsheed3] but no serious research has ever been published on Urdu text recognition. Arabic and Persian, which are based on similar basic characters and writing styles as Urdu, have seen quite worthwhile research in the past decade. However, those solutions are not valid to Urdu due to a number of inherent differences in the script and styles of Urdu text. Nasakh and Nastaliq are the two most popular writing styles (scripts) in Urdu and both have their own unique features that make them different and more complicated than their close counterparts. The following chart (Table 1) represents a view of the comparative complexities of Urdu Script as compared to some other languages. Like Arabic, recognizing Urdu script presents challenges of cursive orthography and context sensitive letter shape [Khorsheed2]. However, in contrast to Arabic text, in which connected characters follows a base line, the joined characters in Nastaliq and Nasakh are positioned according to their preceding, pro-ceding as well as a vertical justification of the ligature. Table 1: Comparative features of some languages The word recognition strategies are generally classified into three categories, namely Holistic Approach, Analytic Approach and Feature Sequence Matching. [Shridher]. However, some researchers regard the Sequence matching techniques to be a form of Holistic approach. The analytic approach tries to segment the word into characters before the recognition task while the holistic approaches tries to recognize the word or its sub- part (ligature) as a whole. [Khorsheed1]. The first approach segment Urdu words into characters, and second approach segment words into symbols. These symbols may be character, ligature or possibly a fraction of character. In this paper, we present an approach to recognize commonly used ligatures from Noori Nastaliq Script developed by Ahmad Mirza Jamil [Ahmed1]. Nastaliq is one of the most beautiful and one of the most complex scripts. The script was originally created by the Characteristics Urdu Arabic Latin Hebrew Hindi H Justification RL RL LR RL LR V-Justification Centre Base No No Top Cursive Yes Yes No No Yes Diacritics Yes Yes No No Yes # Vowels 2 2 5 11 - # Letters 37 28 26 22 40 Letter Shapes 1-28 1-4 2 1 1 Complementary Characters 5 3- - - -

Upload: dr-syed-hassan-amin

Post on 13-Apr-2017

335 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: Multitier holistic Approach for urdu Nastaliq Recognition

A Multi-tier Holistic approach for Urdu Nastaliq Recognition

Syed. Afaq Husain* and Syed. Hassan Amin** Faculty of Computer Science and Engineering

Ghulam Ishaq Khan (GIK) Institute of Engineering Sciences and Technology Topi, 23460, Dist. Swabi, NWFP, PAKISTAN

Email:* [email protected]_ , **[email protected]

Abstract Character recognition is an active area of research

with numerous applications including web publishing, document analysis and text to speech conversion. In this paper, we present a new approach for the off-line recognition of cursive Urdu Text. This methodology has been developed for the Noori Nastaliq Script [Ahmed 1]. Word (Ligature) based identification has been adopted instead of character based identification. A multi-tier holistic approach has been utilized to recognize ligatures from a pre-defined ligature set. Initially, the special ligatures (Dots, Tay, Hamza & Mad) are identified from the base ligatures. These special ligatures are associated to the most probable neighboring base ligature in the second step. Finally, the above information along with some other RTS invariant features of base ligature is presented to the Feed Forward Back Propagation neural network to perform the final recognition task.

Keywords: OCR, Urdu Character Recognition, Noori Nastaliq, Ligature based identification, Back-propagation Neural Network.

1. Objective Urdu is the national language of Pakistan. It is a

language that is understood by over 300 million people belonging to Pakistan, India and Bangladesh. Due to its historical database of literature, there is a need to devise automatic systems for conversion of this literature into electronic form that may be accessible on the world-wide-web. The suggested Urdu Text recognition system endeavors to convert scanned Urdu documents automatically into computerized text files in UZT format.

The Diacritics (Aerab) and punctuation have been ignored in the current version of the system, however may be classified as another category of symbols. Multi-Font and multi-lingual support has also been ignored for simplification.

2. Introduction Urdu character set is based on the Arabic

character set. It is a cursive language even in its printed form. In the past, a lot of research has been done on automatic recognition of text written in languages based on Roman [Guyon],[Ha], Chinese text [Guo],[Ding], Arabic [Amin1] and Persian [Khorsheed3] but no serious research has ever been published on Urdu text recognition. Arabic and Persian, which are based on similar basic

characters and writing styles as Urdu, have seen quite worthwhile research in the past decade. However, those solutions are not valid to Urdu due to a number of inherent differences in the script and styles of Urdu text. Nasakh and Nastaliq are the two most popular writing styles (scripts) in Urdu and both have their own unique features that make them different and more complicated than their close counterparts. The following chart (Table 1) represents a view of the comparative complexities of Urdu Script as compared to some other languages.

Like Arabic, recognizing Urdu script presents

challenges of cursive orthography and context sensitive letter shape [Khorsheed2]. However, in contrast to Arabic text, in which connected characters follows a base line, the joined characters in Nastaliq and Nasakh are positioned according to their preceding, pro-ceding as well as a vertical justification of the ligature.

Table 1: Comparative features of some languages

The word recognition strategies are generally classified into three categories, namely Holistic Approach, Analytic Approach and Feature Sequence Matching. [Shridher]. However, some researchers regard the Sequence matching techniques to be a form of Holistic approach. The analytic approach tries to segment the word into characters before the recognition task while the holistic approaches tries to recognize the word or its sub-part (ligature) as a whole. [Khorsheed1]. The first approach segment Urdu words into characters, and second approach segment words into symbols. These symbols may be character, ligature or possibly a fraction of character.

In this paper, we present an approach to recognize commonly used ligatures from Noori Nastaliq Script developed by Ahmad Mirza Jamil [Ahmed1]. Nastaliq is one of the most beautiful and one of the most complex scripts. The script was originally created by the

Characteristics Urdu Arabic Latin Hebrew HindiH Justification R�L R�L L�R R�L L�RV-Justification Centre Base No No TopCursive Yes Yes No No Yes Diacritics Yes Yes No No Yes # Vowels 2 2 5 11 - # Letters 37 28 26 22 40 Letter Shapes 1-28 1-4 2 1 1 Complementary Characters

5 3- - - -

Page 2: Multitier holistic Approach for urdu Nastaliq Recognition

calligrapher Mir Ali Tabrezi. The attempts to mechanize Urdu script didn’t bear any success for a long time, and as a result a typewriter that could type in the Nastaliq style, is not available even today. There are two approaches to computerizing Nastaliq i.e. Ligature based approach (more glyphs) and character based approach (more rules). For example, the word has three ligatures or separate shapes , and . Noori Nastaliq describes about 20000 ligatures that are required to write almost all words contained in the Urdu dictionary. Since, the ligature based recognition is dependent on the ligatures used for training it has the context information due to which it has a higher performance. However, it has the disadvantage that adding new ligatures into the system would require re-training of the system. E.g. the. Urdu word Computer is one ligature that is not in the formal dictionary of ligatures though it is widely written in Urdu text.

3. Character Recognition Schemes The problem of Urdu text recognition is closely

related to Arabic text recognition. Arabic Text Recognition Systems generally have following stages: image acquisition, preprocessing, segmentation, feature extraction, classification and recognition [Khorsheed3].

The Arabic Text Recognition Systems are further divided into Segmentation based and Segmentation-free systems. Here we briefly describe approaches into Arabic Text Recognition, with the view that these give valuable insight into problem of Urdu Text Recognition [Bunke]. 3.1 Segmentation Free Systems

In these systems, the word is recognized as a whole without trying to segment and recognize characters or primitives [7]. One approach for such systems is to calculate a single feature vector for each word; this feature vector is then used to recognize the word. 3.2 Segmentation Based Systems

In Segmentation based systems, each word is further divided into a number of subparts. The segmentation-based systems are further subdivided into four categories: Isolated/Pre-segmented characters, segmenting a word into characters, segmenting a word into primitives, Integration of recognition and segmentation. These systems are either impractical because they try to recognize digits and isolated characters or they have low recognition rate because of segmentation errors [Khorsheed2].

4. Ligature Identification System In our proposed system, after preprocessing, the

text is segmented into a number of ligatures ordered from right to left and top to bottom. The ligatures at this stage are defined as every connected set of characters. These ligatures also contain the special symbols used in Urdu namely, (Tau, Mad, Dots, Hamza and Ha). A number of features are calculated and then fed into Feed Forward Back propagation neural net to recognize special ligatures from the base ligatures. These special ligatures are then associated with the base ligature, forming part of the feature vector used to recognize base ligature, thus aiding

in the recognition of the base ligature. This feature vector is then used to recognize ligatures using a Feed Forward Back Propagation neural net.

Figure 1: Stages of Urdu Character Recognition 4.1 Preprocessing

The preprocessing stage involves Smoothing, Skew detection and correction, Document decomposition, Slant normalization etc.

4.2 Segmentation In document image analysis, four commonly used

segmentation algorithms are connected component labeling, X-Y tree decomposition, run-length smearing, and Hough Transform.

We have applied Connected Component Labeling to the image of Urdu text. This technique assigns to each connected component of binary image a distinct label. The labels are usually natural numbers from 1 to the number of connected components in the input image. The algorithm scans the image from left-to-right and top-to-bottom. On the first line containing black pixels, a unique label is assigned to each contiguous run of black pixels. For each black pixel, the pixels in its eight neighborhood are examined, if any of these pixels has been labeled the same label is assigned to the current pixel, otherwise a new label is assigned to it. The procedure continues to the bottom of the image [Khorsheed3].

4.3 Feature Extraction I In this stage, we extract only those features that

will help us in the recognition of special ligatures, see figure. These features are Solidity, Number of Holes, Axis Ratio, Eccentricity, Moments, Normalized segment length, curvature, ratio of bounding box width and height.

Preprocessing

Segmentation

Feature Extraction I

Special Ligature Identification

Feature Extraction II

Ligature Identification

Page 3: Multitier holistic Approach for urdu Nastaliq Recognition

4.3.1 Solidity Solidity is a scalar quantity. It is defined as the

proportion of the pixels in the convex hull that are also in the region. It is computed as

Solidity = Ligature Area/ Convex Hull Area Where,

Ligature Area = ∑∑f (x, y) For all x, y in the binary image of the ligature

Convex Hull Area = ∑∑f(x,y) For all x, y in the convex hull of the ligature

4.3.2 Axes Ratio It is the ratio of the major axis to the minor axis

of the best-fit ellipse of the ligature. Axis Ratio = a/b

Where a and b are the lengths of semi-major axis and semi-minor axis of the best-fit ellipse.

4.3.3 Eccentricity It is the ratio of the distance between the foci of the

best-fit ellipse to its major axis. Eccentricity = distance btw foci / 2b

4.3.4 Moment based features These refer to certain functions of moments,

which are invariant to geometric transformations such as, translation, scaling, and rotation [6]. Such features are useful in identification of objects with unique shapes, regardless of their location, size and orientation

4.3.5 Normalized Length Feature First the normalized length of a segment i is

calculated relative to other segment lengths in the same word. Then normalized length of the ligature is calculated as

Normalized Length = ∑ L(i)

4.3.6 Curvature Feature: In a similar fashion, first the curvature of a segment is

measured by simply dividing the Euclidean distance between the two feature points of that segment by its actual length. This feature equals zero when the segment is a loop and 1 when the segment is a straight line. C(i) = (Euclidean distance between endpoints) / segment length Then curvature feature of the ligature is calculated as a sum of curvature features of all of its segments. Curvature Feature = ∑ C(i) 4.3.7 Number of Holes:

This feature gives total number of holes in a ligature. If feature points of ligature are considered as a set of vertices V, and segments as a set of edges E, of a graph G (V, E), then total number of holes in the ligature can be found using graph theory as following: Number of Holes = E - Est Here, E = Number of edges in G Est= Number of edges in the spanning tree of G.

A graph with N vertices has N-1 edges in its spanning tree.

4.4 Special Ligature Identification For identifying special ligatures, a Feed Forward

Back propagation neural network with 15 inputs, 25 hidden and 25 output neurons was used. The feature vectors obtained from Feature extraction 1 stage of the system are fed to this neural network. It then identifies the ligatures as either special ligatures or base ligatures.

Figure 2: Some special ligatures

4.5 Feature Extraction II In this stage, we associate special ligatures with

the base ligatures. We associate special ligature with the base ligature whose Centroid-to-Centroid distance is minimum. A number of lines are grown from the centre of each special ligature, when one of these lines touches a base ligature, then the special ligature is associated with that base ligature.

In this stage, due to association of special ligatures with the base ligatures twenty new features are added to the feature vector of the base ligature.

4.6 Ligature Identification In this stage, the final feature vector consisting of

34 features is fed into Feed Forward Back propagation neural network. The network architecture consists of 34 inputs, 65 hidden neurons and 45 output neurons.

5. Results The system was trained using a training set of

two hundred carefully selected ligatures. The testing was done on bitmap images containing Urdu written in Nastaliq font using a text editor.

This simplified the problem by neglecting the pre-processing stage required for noise removal during image acquisition. The training set contained the more simplified and commonly used ligatures.

The performance of the system on images containing the trained ligatures only was 100 %.. However incases, where it contained additional ligatures, they were classified to the closest match in the training set. No rejection class was utilized.

6. Conclusion In this paper, we have presented a method for

recognition of Cursive Urdu text written in Nastaliq Script. The system is currently trained for a small number of ligatures but has the potential to be expanded to be more practical use. Our approach minimizes the errors due to segmentation by using segmentation free approach. By using multiple classes of features , we have improved the number of ligatures that can be identified.

Page 4: Multitier holistic Approach for urdu Nastaliq Recognition

7. Future Directions A number of possible directions are under

consideration for enhancement of the system for practical use namely, 1. Enhancement of the number of ligatures used for

training. 2. Addition of Special characters, Numerals and Aerab

for recognition as special ligatures 3. Recognition of intonation marks in the document. 4. Addition of multi lingual support in the system.

References 1. [Ahmed] Ahmad Mirza Jamil, “Noori Nastaliq,

Computerized Urdu Calligraphy”, Elite Publishers, 1982.

2. [Amin] A.Amin and S.Al-Fedaghi, “Machine

recognition of printed Arabic text utilizing a natural language morphology”, Int. J. of Man-machine Studies 35,6 (1991), 768-788.

3. [Badr] Badr Al-Badr, Robert M. Haralick,

“Segmentation–Free word recognition with application to Arabic”, IJDAR1(3):147-166(1998)

4. [Bunke] H. Bunke, P. Wang, “Handbook of character

recognition and document image analysis”, World Scientific, 2000.

5. [Ding] X.Q.Ding, Y.S.Wu, Recognition of multi-font

printed chineses characters, CCIPP/CLCS, 1988, Toroto, Canada.

6. [Guo] H.Guo, X.Q.Ding, The development of high performance Chineses/English bi-lingual OCR system, proc. CMIN ’95, Beijing, China, March 95, 248-253.

7. [Guyon] I.Guyon, J.Bromley, N.Matic, etc, “A neural network system for recognizing on-line handwriting”, Models of Neural network, Springer Verlag, 1996.

8. [Ha] J.Y.Ha, S,C. Oh, J.H. Kim, and Y.B. Kwon,

“Unconstrained handwriiten word recognition with interconnected hidden Markov Models, 3rd Int. Workshop on Frontiers in Handwriting Recognition”, Buffalo, May 93, 455-460

9. [Khorsheed1] Mohammad S. Khorsheed, William F.

Clocksin, “Structural features of cursive Arabic script”, proc of 10th British Vision Conference, University of Nottingham, UK, September-1999.

10. [Khorsheed2] M S Khorsheed, ”Off-Line Arabic

Character Recognition A Review”. 11. [Khorsheed3] Mohammad S. Khorsheed, ”Automatic

recognition of words in Arabic manuscripts”, PhD Dissertation, Churchill College, University of Cambridge, June 2000

12. [Shridher] N.Shridher, F.Kimura, “Segmentation

based cursive handwriting recognition”, Handbook of character recognition and document image analysis, 126-127, World scientific, 1997.

13. [Trier] Ovinid Due Trier, Anil K. Jain, and Torfinn

Taxt, “Feature Extraction Methods for Character Recognition – A Survey”, Pattern Recognition, Vol. 29 , No. 4 , pp. 641-662 , 1996.

Page 5: Multitier holistic Approach for urdu Nastaliq Recognition