recognition of gestures in arabic sign language using ... · naturalness by using images of bare...

Artificial Intelligence 133 (2001) 117–138

Recognition of gestures in Arabic sign languageusing neuro-fuzzy systems

Omar Al-Jarrah∗, Alaa HalawaniJordan University of Science and Technology, Department of Computer and Internet Engineering,

P.O. Box 3030, Irbid 22110, Jordan

Received 20 April 2000; received in revised form 26 June 2001

Abstract

Hand gestures play an important role in communication between people during their daily lives.But the extensive use of hand gestures as a mean of communication can be found insign languages.Sign language is the basic communication method between deaf people. A translator is usuallyneeded when an ordinary person wants to communicate with a deaf one. The work presented in thispaper aims at developing a system for automatic translation of gestures of the manual alphabets in theArabic sign language. In doing so, we have designed a collection of ANFIS networks, each of whichis trained to recognize one gesture. Our system does not rely on using any gloves or visual markingsto accomplish the recognition job. Instead, it deals with images of bare hands, which allows the userto interact with the system in a natural way. An image of the hand gesture is processed and convertedinto a set of features that comprises of the lengths of some vectors which are selected to span thefingertips’ region. The extracted features are rotation, scale, and translation invariat, which makes thesystem more flexible. The subtractive clustering algorithm and the least-squares estimator are used toidentify the fuzzy inference system, and the training is achieved using the hybrid learning algorithm.Experiments revealed that our system was able to recognize the 30 Arabic manual alphabets with anaccuracy of 93.55%. 2001 Elsevier Science B.V. All rights reserved.

Keywords:Hand gestures; Sign language; Recognition; Neuro-fuzzy; Arabic sign language; Deaf people

1. Introduction

Human–Computer Interaction(HCI) is getting increasingly important as computer’sinfluence on our lives is becoming more and more significant. With the advancement in

* Corresponding author.E-mail addresses:[email protected] (O. Al-Jarrah), [email protected] (A. Halawani).

0004-3702/01/$ – see front matter 2001 Elsevier Science B.V. All rights reserved.PII: S0004-3702(01)00141-2

118 O. Al-Jarrah, A. Halawani / Artificial Intelligence 133 (2001) 117–138

the world of computers, the already-existing HCI devices (the mouse and the keyboardfor example) are not satisfying the increasing demands anymore. Designers are trying tomake HCI faster, easier, and more natural. To achieve this,Human-to-Human Interactiontechniques are being introduced into the field of Human-Computer Interaction. One of themost fertile Human-to-Human Interaction fields is the use of hand gestures. People usehand gestures mainly to communicate and to express ideas.

The importance of using hand gestures for communication becomes clearer whensignlanguageis considered. The sign language is the fundamental communication methodbetween people who suffer from hearing imperfections. Sign language is a collection ofgestures, movements, postures, and facial expressions corresponding to letters and wordsin natural languages. In order for an ordinary person to communicate with deaf people,an interpreter is usually needed to translate sign language into natural language and viceversa. In the recent years, the idea of designing a computerized translator has become anattractive research area.

In addition to sign language, researchers are trying to exploit hand gestures inapplications like computer games [7], interactive computer graphics [8], hand tracking[1], television control [6], and pointing devices as a substitute for the ordinary mouse [17].

In order to take advantage of gestures in HCI, a mean by which the computer canunderstand the gesture must be provided [16]. In this sense, two major classes have beenidentified. The first class relies on electromechanical devices that are used to measure thedifferent gesture parameters such as the hand’s position, angle, and the location of thefingertips. Systems that use such devices are usually calledglove-based systems(e.g., thesystem that is described in [15]). A major problem with such systems is that they force theuser to wear cumbersome and inconvenient devices. As a result, the mean by which theuser interacts with the system is complicated and less natural.

In order to get rid of the inconvenience and to increase the naturalness of HCI, the secondclass exploits machine vision and image processing techniques to createvisual-basedhandgesture recognition systems. Visual-based gesture recognition systems are further dividedinto two categories. The first one relies on using specially designed gloves with visualmarkers that help in determining hand postures [3,9]. However, using gloves and markersdoes not provide the naturalness required in such HCI systems. Besides, if colored glovesare used, the processing complexity is increased. As an alternative, the second kind ofvisual-based gesture recognition systems tries to achieve the ultimate convenience andnaturalness by using images of bare hands to recognize gestures.

Many researchers have been trying to introduce hand gestures to HCI field. In [9],Hussain developed a system for the recognition of Arabic sign language alphabets. Hismethod was based on detecting fingertip and wrist locations using a colored glove with6 different colors. Then, vectors between the wrist and the fingertips and between thefingertips themselves are computed and fed to a set of ANFIS models for the recognitionpurpose. The clear disadvantage of this system is its dependence on using the coloredglove, which places a restriction on the naturalness of interaction.

Chan-Su Lee et al., proposed a system for the recognition of the Korean signlanguage [15]. They used fuzzy logic for direction classification, and fuzzy min-max neuralnetworks for posture and orientation recognition [18]. The average recognition rate was80.1%. The system is classified under the glove-based hand gesture recognition systems

O. Al-Jarrah, A. Halawani / Artificial Intelligence 133 (2001) 117–138 119

since it uses theCyberGloveTM interface unit to measure the flexure of fingers, handposture, and orientation.

Freeman and Roth [5] exploited local orientation information embedded in the gestureimage to recognize hand gestures. They used orientation histograms as a feature vector forgesture recognition. As an application, orientation histograms were used in [6] to recognizegestures for the purpose of television control, and in [8] for interactive computer graphics.Details of the computation of local orientation can be found in [4].

Although they are less sensitive to lighting changes and exhibit translation invarianceproperty, orientation histograms are not rotation invariant; an image for a rotated versionof a given gesture will have an orientation histogram that is different from that for theimage of the unrotated gesture. Besides, orientation histograms may not be unique; twodifferent gestures may have very similar orientation histograms.

In [3], Davis and Shah presented a method for dynamic gesture recognition. Fingertipsare detected and tracked through a sequence of frames. Next, the gesture is modeled as a setof vectors representing the direction and displacement of the fingertips. The recognition isdone by matching this vector model with previously stored models. The cost of this systemis the requirement for the use of special visual markings on the hand to detect fingertips.

In [21] and [20], elastic graph matching was used for the recognition of face and handgestures. A gesture is represented as a labeled graph that consists of nodes that are basedon a Gabor wavelet transform. For each posture, a model graph is created from one of theposture’s images. To classify an image, the model graphs for all postures are sequentiallymatched to the image, and a similarity value is computed for each graph matching. Then,the posture of model graph with the highest similarity is chosen as the corresponding classof the image.

Our system recognizes the 30 Arabic sign language alphabets visually, using imagesof the bare hands. We use the Adaptive Neuro-Fuzzy Inference system (ANFIS) [12] toaccomplish the recognition job. The users are not required to wear any gloves or to useany devices to interact with the system. Our approach relies on presenting the gesture asa feature vector that is translation, scale, and rotation invariant. The 30 manual alphabetsdealt with in this paper are shown in Fig. 1.

The rest of this paper is organized as follows: The next section describes the architectureof the Adaptive Neuro-Fuzzy Inference System. The third section is devoted to describingthe subtractive clustering algorithm. In the fourth section, we introduce the system thatwe have built for the recognition purpose. Our results are summarized in the fifth section.Finally, concluding remarks are given in sixth section.

2. Adaptive Neuro-Fuzzy Inference System (ANFIS)

Adaptive Neuro-Fuzzy Inference System (ANFIS) is a class of adaptive networksthat are functionally equivalent to Fuzzy Inference Systems (FISs) [12]. Usually, thetransformation of human knowledge into a fuzzy system (in the form of rules andmembership functions) does not give exactly the desired response. So, there is a need totune the parameters of the FIS to enhance its performance. The main objective of ANFISis to optimize the parameters of a given fuzzy inference system by applying a learning


Fig. 1. Arabic sign language alphabets.

procedure using a set of input-output data pairs (called training data). The parameteroptimization is done in a way such that the error measure between the desired and theactual output is minimized.

The architecture of ANFIS is a feedforward network that consists of 5 layers [12]. Fig. 2shows the equivalent ANFIS architecture for a two-input Sugeno-type fuzzy inferencesystem. A rule in the first order Sugeno FIS has the form:

If x is Ai andy is Bi , thenfi = pix + qiy + ri .

The output of a node in the first layer specifies to which degree a given input,x, satisfies aquantifier,A, i.e., the function of the nodei in this layer is a membership function for thequantifier,Ai , of the form:

O1i = µAi (x). (1)


Fig. 2. ANFIS architecture for a two-input, two-rule Sugeno FIS.

Each membership function has a set of parameters that can be used to control thatmembership function. For example, a Gaussian membership function that has the form

µAi (x) = e−((x−ci)/σi)2

(2)

has two parameters,ci and σi . Tuning the values of these parameters will vary themembership function, which means a change in the behavior of the FIS. Parameters inthis layer are referred to aspremise parameters[12].

In the second layer, the output of a node represents a firing strength of a rule. The nodegenerates the output (firing strength) by multiplying the signals that come on its input, i.e.,

wi = µAi (x) × µBi (y). (3)

The function of a node in the third layer is to compute the ratio between theith rule’s firingstrength to the sum of all rules’ firing strengths:

wi = wi

w1 + w2(4)

wi is referred to as thenormalized firing strength[12].In the fourth layer, each node has a function of the form:

O4i = wifi = wi(pix + qiy + ri ), (5)

where {pi, qi, ri } is the parameter set. These parameters are referred to as theconsequentparameters[12].

The overall output is computed in the fifth layer by summing all the incoming signals,i.e.,

O51 = f =

∑i

wifi = w1f1 +w2f2

w1 +w2. (6)

During the learning process, the premise and consequent parameters are tuned until thedesired response of the FIS is achieved [12].


3. Subtractive clustering

Subtractive clustering [2], is an effective approach to estimate the number of fuzzy clus-ters and cluster centers. In this algorithm, each data point is considered as a potential clustercenter. Forn data points,{x1, . . . , xn}, a density measure is defined for each data point,xi ,as

Di =n∑

j=1

e−‖xi−xj‖2/(ra/2)2, (7)

wherera is a positive constant.It can be observed that the density measure for a data point is a function of its distance to

all other data points. Hence, a data point that has many neighboring points will have a highpotential of being a cluster center. The constant,ra , defines the radius of neighborhood(cluster radius); points outside this radius have little effect on the density measure. Thechoice ofra plays an important role in determining the number of clusters. Small values ofra means a large number of clusters will be identified, while larger values ofra mean lessnumber of clusters.

The first cluster center is chosen to be the data point that has the highest density measure.Then, the density measure for each data point,xi , is reduced according to the formula

Di = Di − Dc1e−‖xi−xc1‖2/(rb/2)2, (8)

wherexc1 is the point selected as the first cluster center,Dc1 is its density measure, andrbis a positive constant. Note that data points close to the cluster center will have significantlyreduced density measure so that they are not likely to be selected as the next cluster center.The constantrb defines the radius of neighborhood within which the reduction in densitywill be measurable. The constantrb is usually greater thanra to avoid having closelyspaced centers and is set to 1.5ra [2].

The next cluster center is chosen and the density measure is reduced again. This processis repeated until a stopping criterion is met. A good stopping criterion can be found in [2].

4. The recognition system

In this section, we discuss the elements of our system for Arabic sign languagerecognition. We describe the various phases that the image goes through until the gestureis recognized. We mainly concentrate on the feature extraction phase because we believe itis a very crucial phase in the recognition system.

4.1. Structure

The structure of the recognition system is shown in Fig. 3. First of all, an image forthe gesture is acquired using a camera that is connected to the computer. Then, the imageenters the image processing stage in which it is filtered and segmented to identify thegesture region. In addition, in this stage, we calculate the properties of gesture and identifyits border. Using the results of the image processing stage, the feature extraction stage


Fig. 3. Structure of the recognition system.

converts the image into a set of features that is required for the recognition process, whichis done in the last stage. These stages are discussed in the following subsections.

4.2. Image processing

The acquired image cannot be directly used for recognition purpose. It is necessary todo some preprocessing to prepare the image for feature extraction and classification. Theimage processing stage consists of the following steps:

• Image filtering: This step is necessary to reduce the noise gained in the acquisitionprocess and to enhance the image’s quality. A 3× 3 median filter is applied to theimage to reduce the noise.

• Image segmentation: The image is segmented into two regions; background regionand hand gesture region. This step is crucial because all the following steps rely on acorrect segmentation of the image. Pixels corresponding to the gesture are set to 1, andthose corresponding to the background are set to 0. To get good results, we apply anautomatic thresholding scheme, specifically, the iterative thresholding algorithm [11].The algorithm works well even when there is a change in the overall brightnessconditions. By segmenting the image, the system can use the region correspondingto the gesture to determine the gesture border and to compute properties necessary forthe extraction of the image’s features.

• Calculation of properties: In this step, we are interested in computing the hand’sdirection and center of area. Those parameters play an important role in the featureextraction phase. The center of area resembles the point from which vectors to a partof the gesture border will originate, and direction helps in determining the division ofthe border that holds the most important features of the gesture. For a binary image,B, with object’s pixels set to 1, the coordinates of the center of area of the object aregiven by [11]:


Fig. 4. Gesture properties: (a) Original image; (b) Center of area is represented by the hole in side gesture region,and orientation is represented by the drawn line.

xc =∑n

i=1∑m

j=1 jB[i, j ]∑ni=1

∑mj=1B[i, j ] , (9)

yc =∑n

i=1∑m

j=1 iB[i, j ]∑ni=1

∑mj=1B[i, j ] , (10)

and the direction of the object is given by [10]:

θ = 1

2tan−1

[2µ11

µ20 − µ02

], (11)

whereµ11, µ20, andµ02 are the second order central moments.Fig. 4 shows an image of a gesture with its corresponding center of area represented bya hole inside the white region in (b), and the corresponding orientation is representedby a line.

• Border following: After computing the direction and the center of area, we are notanymore interested in the whole region of the gesture. The border information isthe only information that is needed fo the feature extraction phase. Determining theborder of the gesture region is the last job in the image processing phase. Fig. 5(b)shows the results of applying a border tracing algorithm, found in [19], on the gesturein Fig. 5(a). Fig. 5(c) is a smoothed version of the border in Fig. 5(b). The bordersmoothing is done to enhance the quality of the border data and to eliminate and localbad spots. The smoothing was done using a Gaussian filter of the form

G = [0.0625,0.2500,0.3750,0.2500,0.0625]. (12)


Fig. 5. Border tracing: (a) Gesture image; (b) Corresponding border; (c) Border smoothed using the filter in 12.

This filter is applied to the border image along thex andy axes.

4.3. Feature extraction

Selecting good features is a crucial step in any object recognition system. One may askwhy not using the image itself as a feature? There are two reasons for not doing so. Firstof all, the high dimensionality of the image makes it unreasonable to use the image asa feature. Secondly, many of the information embedded in the image data are redundantand some may be not useful. Therefore, the objective of the feature extraction phase is torepresent the image by a set of numerical features that correspond to the useful information,remove the redundancy of image data, and reduce its dimensionality. For a set of featuresto be considered reasonable, they must satisfy the following [14]:

(1) Images for objects from the same class must have very similar features.(2) Features for objects from different classes must be noticeably different.(3) Features must be scale, translation, and rotation invariant, which means that they

must be able to recognize objects regardless of their size, location, and orientation.Our feature extraction scheme uses the border information, the center of area, and the

gesture direction to extract a feature vector for the gesture. The approach depends onoriginating vectors from the center of area to the portion of the border that bears the mostimportant information about the gesture (fingertips region). The lengths of the vectors aretaken as the corresponding features of the gesture. Ifcx andcy are the coordinates of thecenter of area, andbx andby are the coordinates of a point on the border, the length of thevector is given by:

lcb =√(bx − cx)2 + (by − cy)2 (13)

and the vector direction is given by:

θcb = tan−1(

by − cy

bx − cx

). (14)

Two problems were found here. First, how can we determine the useful part of the border,and second, what number of vectors should be used?


To determine the useful portion of the border, the direction of the gesture is used. Wefind that the information discriminating gestures in Arabic sign language lies in the regionbetween the angles:

(90− θg) � θ � (113+ θg), (15)

whereθg is the gesture direction. Consequently, only vectors with directions that satisfy theabove inequality are used. The region specified by this equation corresponds to the regionof fingertips and is derived based on the investigation of the hand gestures of several people.

The number of vectors was determined experimentally, and it was found that the bestresults are achieved by using 30 vectors. These vectors are equally spaced in the rangespecified by Eq. (15). The lengths of these vectors are taken as features for the gesture. Sothe constructed feature vector will have the form:

f = [l1, l2, . . . , l30] (16)

whereli , i = 1,2, . . . ,30, is the length of the vectorvi . Fig. 6 shows a sample gesturewith its corresponding features represented by the bar chart. To reduce the effect of scalingof the gesture, we normalize them into the range 0 to 100 by dividing the features by themaximum vector length and then multiplying by 100. If the image is scaled, all of thevectors will be scaled by a certain factor. Therefore, normalization will ensure the scale-invariant property.

The selected features are also translation and rotation invariant. Note that when theposition of the gesture in the image is changed, the coordinates of its center of area anddesignated vectors are changed accordingly. Therefore, the lengths of these vectors willnot be affected. Furthermore, if the gesture is rotated, its direction will change accordingly,causing the range of corresponding features to move such that the inequality in (15) issatisfied. Consequently, the relative orientation will not change.

Fig. 6. Features: (a) Vectors originating from the center of area to a portion of the gesture border; (b) Features(vector lengths) represented as a bar chart.


4.4. Gesture recognition

In this phase, the features extracted from the image data are recognized as a specificgesture. We use Adaptive Neuro-Fuzzy Inference Systems (ANFIS) as the underlyingarchitecture for the recognition process. The adopted ANFIS architecture is of the type thatis functionally equivalent to the first order Sugeno-type fuzzy inference system. For eachof the 30 gestures, an ANFIS model is built and trained to recognize the correspondinggesture. The resulting architecture is some times calledMANFIS (Many ANFIS) [13].Each ANFIS model is trained to produce a value of 1 as an output if the data presentedat its inputs corresponds to the gesture that the model is associated with, and a value of 0otherwise.

The recognition process is done by presenting the features of the image to be classified toeach of the 30 ANFIS models. This will result in 30 different responses. A voting schemeis applied to determine the class to which the image belongs. The class (gesture) that isassociated with the ANFIS model with the response closest to the value of 1 is chosen asthe class to which the image under investigation belongs.

An important issue that determines the effectiveness of this phase is the constructionof the ANFIS models. The process of constructing an ANFIS model for a specific classinvolves two steps:

• Identification of a fuzzy model (Fuzzy Inference System) for that class, and• Training the model using ANFIS.

These two steps are discussed in the following subsections.

4.4.1. Fuzzy model identificationAs the complexity of the system increases, the ability of humans to describe the system

by knowledge-based rules decreases. For example, when the dimensionality of the datais high, and the available training set is large, the possibilities of the values describing apattern are prohibitively increased.

With the data dimensionality of 31 (30 inputs and 1 output), it becomes very difficult todescribe the rules manually. Therefore, an automatic model identification method becomesa must. The process of fuzzy model identification includes the following steps:

(1) Determination of the number of fuzzy rules,(2) Determination of the number of membership functions,(3) Identification of the premise parameters, and(4) Identification of the consequent parameters.

The first three steps are done using subtractive clustering algorithm. When the subtractiveclustering is applied to a group of input–output data pairs, each identified cluster center isconsidered as a prototype that represents a set of characteristics of the system [2]. So, thecluster centers can be used as the centers of the fuzzy rules that describe the system. In thisway, the number of fuzzy rules is equal to the number of clusters.

The degree at which each rule is fulfilled is determined by anand combination of 30membership functions, each of which corresponds to an input dimension. So, if we haver

rules, the number of membership functions in each dimension is equal tor, and the total


number of membership functions is equal to 30×r. The form of theith rulei, i = 1, . . . , r,is given by:

if in1 is µi1 andin2 is µi

2 . . . andin30 is µi30 then out isfi

whereµij , j = 1, . . . ,30, is theith membership function of thej th input dimension, and

fi is theith linear membership function (MF with consequent parameters) of the output.Because we are using a first order Sugeno-type fuzzy model,fi has the form:

fi = pi1in1 + pi

2in2 + · · · + pi30in30 + qi, (17)

where{pi1, . . . , p

i30, qi} is the set of consequent parameters.

In our system, we have chosen the premise membership functions,µij , to be a Gaussian

membership function (see Eq. (2)). Because a cluster center,ci is used as a basis for rulei,rule i must be completely fulfilled by the elements ofci . Moreover, we want the data pointsthat are in the neighborhood ofci to highly fulfill rule i. To achieve this, the membershipfunctionµi

j must have the form:

µij (inj ) = e−((inj−cij )/(ra/2))2

, (18)

wherecij is thej th element of the cluster centerci , andra is the radius of neighborhooddefined previously in Section 3. Therefore, the parameters of the Gaussian membershipfunction (the premise parameters) are identified to becij andra/2.

The consequent parameters are identified using the least-squares estimate (LSE) method.Recall that the overall output of the Sugeno-type fuzzy model is calculated using theweighted average method. Forr rules the output is given by:

f =∑r

i=1wifi∑ri=1þwi

=r∑

i=1

wifi, (19)

wherewi and wi are the firing strength and the normalized firing strength of rulei,respectively, andfi is theith rule consequent. DefiningPi = [pi

1,pi2, . . . , p

i30] as the row

vector of the linear parameterspi1, . . . , p

i30, andI = [in1, in2, . . . , in30]T as the vector of

the input space, and substituting forfi , Eq. (19) becomes:

f =r∑

i=1

wi(PiI + qi) (20)

or alternatively:

f = [w1I

T,w1, . . . ,wrIT,wr

]

P T1

q1...

P Tr

qr

. (21)


If we have a set ofn training-data vectors, the resultant set of model output is given by:

f1...

fn

=

w1,1IT1 w1,1 . . . wr,1I

T1 wr,1

...

w1,nITn w1,n . . . wr,nI

Tn wr,n

P T1

q1...

P Tr

qr

, (22)

wherewi,j is wi evaluated atIj . Note that in Eq. (22), the first matrix at the right handside is constant (because the valuesIj , j = 1, . . . , n, are provided by the training data). Thesecond matrix at the right hand side contains all the consequent parameters to be identified.Replacing the vector at the left-hand side with the actual outputs of the training data, theproblem of identifying the parameters can be now solved using the LSE method. Eq. (22)can be rewritten using the standard notation of the LSE method as:

AX = B, (23)

whereB is the output vector,A is the constant matrix, andX is the matrix of parameters tobe identified. Our objective now is to minimize the squared error‖AX−B‖2. The solutionthat minimizes the squared error is given by [13]:

X = (ATA

)−1ATB. (24)

4.4.2. Training fuzzy modelsOnce a fuzzy model for a class is identified, an ANFIS network equivalent to the

model can be built as discussed in Section 2. The network is then trained using thehybrid learning algorithm such that the desired response is achieved. The hybrid learningalgorithm combines gradient descent method and least-squares estimate (LSE) to identifyparameters. The problem with using the gradient descent method alone is that it is generallyslow and likely to become trapped into local minima [12]. If an adaptive network is linearin some of the network’s parameters, LSE can be used to identify these parameters. ForANFIS, we can observe that when the values of the premise parameters are fixed, theoverall output can be expressed as a linear combination of the consequent parameters. Sothe LSE can be used to identify these parameters. To apply the hybrid learning algorithm,each epoch consists of two passes;forward passandbackward pass. In the forward pass,the premise parameters are fixed and the consequent parameters are identified by LSE. Inthe backward pass, the error signals are propagated backwards and the premise parametersare updated using gradient descent.

The consequent parameters identified this way are optimal (under the condition that thepremise parameters are fixed) [13]. This means that the hybrid learning algorithm willconverge faster since it reduces the search space of the pure backpropagation.

5. Experimental results

In this section, we evaluate the performance of our recognition system by testing itsability to classify gestures of both training and testing data. The effect of the number of


rules used to describe the system on its performance is studied. In addition, we discusssome problems in the performance of some gestures due to the similarities between them.We also compare our results with those obtained in [9].

5.1. Data set

The data set used for training and testing the recognition system consists of gray-scaleimages for all of the thirty gestures shown in Fig. 1. 60 samples for each gesture weretaken from 60 different volunteers. For each gesture, 40 out of the 60 samples were usedfor training purpose, while the remaining 20 samples were used for testing. Training wasperformed until either an error goal of 0.04 is achieved, or a maximum of 300 epochsis reached. The samples were taken from different distances from the camera, and withdifferent orientations. This way, we were able to obtain a data set with cases that havedifferent sizes and orientations, so that we can examine the capabilities of our featureextraction scheme. For example, some of the samples taken for the gesture “Tha” are shownin Fig. 7.

5.2. Recognition rate

We evaluate the performance of our system based on its ability to correctly classifysamples to their corresponding classes. The metric that we use to accomplish this job iscalled the recognition rate. The recognition rate is defined as the ratio of the number ofcorrectly classified samples to the total number of samples, i.e.,

Recognition rate= Number of correctly classified samples

Total number of samples× 100%. (25)

Firstly, we built a system with 19 rules (ra = 0.4) for each ANFIS model. The resultsfor both training and testing data are shown in Table 1.

We can see that the system has a very good performance regarding the training data.This is expected because the parameters of the system were tuned according to this data.On the other hand, the performance (recognition rate) for the testing data is relativelylow (73.16%). The results obtained for the testing data are not considered satisfactory,because we want the system to respond well to both training and testing data. After severalexperiments, we found that the obtained behavior is affected by the value chosen for thecluster radius,ra .

As was stated in Section 3, the value chosen for the cluster radius,ra , plays an importantrole in determining the number of clusters to be generated. The effect of the cluster radius

Table 1Results of the system with 19 rules (ra = 0.4) per ANFIS model

Data # of samples Recognized samples Recognition rate (%)

Training 1200 1176 98.00

Testing 600 439 73.16

Total 1800 1615 89.72


Fig. 7. Some of the samples taken for the gesture “Tha”.

on the complexity of the system, represented by the number of rules, is shown in Fig. 8.Small values ofra leads to a large number of clusters and, hence, results in a large numberof rules. On the other hand, large values ofra means less number of clusters which in turnmeans smaller number of generated rules. Therefore, changing the value ofra may affectthe performance of the system considerably.

We have run several experiments by varying the value ofra and observing the resultantbehavior. Fig. 9 shows the effect of changing the value of the cluster radius on therecognition rate of the training data. It can be seen that for small values ofra , therecognition rate for the training data is very high, approaching 100% forra = 0.35. Asthe value ofra is increased, the recognition rate starts to decrease accordingly, reaching avalue of 87.5% for ra = 0.9. This may suggest that setting the value ofra to be as low aspossible will enhance the performance of the system. In fact, this is not true. By lookingat Fig. 10, which shows the recognition rate of the testing data as a function ofra , wecan see that the behavior achieved for the testing data is completely different from thatobserved for the training data. The performance is very low when the value ofra is very


Fig. 8. Number of rules as a function ofra .

small (about 61% forra = 0.35), and it is enhanced asra is increased, reaching about 85%whenra = 0.8, then it is worsened again for larger values ofra .

The behavior observed for small values ofra is caused by a phenomenon calledoverfitting. Overfitting is the situation in which the fuzzy system is fitted to the trainingdata so well that its ability to fit to the testing data is no longer satisfactory. In our case,overfitting occurs when the number of rules describing the system is very large, whichresults in a very specific description of the training data. This causes the system to respondvery bad for any data that does not fit to that specific description, and therefore, it reducesthe system’s generalization capability.

When the value ofra becomes too large, the small number of generated rules will not besufficient to convey a good description about the system. So the behavior will be bad forboth training and testing data. This explains the situation shown in Fig. 10, in which therecognition rate drops when the value ofra is greater than 0.8.

Of course, we are not interested in a system with low generalization capabilities. Instead,we are looking for a system that is trained using training data and can perform well withtesting data. This is achieved by settingra to a suitable value. Fig. 11 shows the overallperformance (of both training data plus testing data) as a function ofra . The best result isachieved whenra = 0.8, which results in approximately 9 rules per ANFIS model. At thispoint, the system fits well for both training and testing data. For the rest of this section,the system withra = 0.8 is considered unless otherwise specified. The results obtained aresummarized in Table 2.


Fig. 9. The effect of changingra on the recognition rate of the training data.

Table 2Results of the system usingra = 0.8


Training 1200 1158 96.50

Testing 600 515 85.83

Total 1800 1673 92.94

Most of the misclassified samples correspond to the gesture that are similar to eachother. As an example, Fig. 12 shows the gestures “Ra” and “Za”. Because these gesturesare similar, their corresponding features are also similar as can be seen in Fig. 12(b) and(d). Therefore, it is probable for a sample of the gesture “Ra” to be classified as a “Za” orvice versa. The same is true for the gestures “Dal” and “Thal”, and the gestures “Tah” and“Thah”.

In addition, it was observed that the gestures “Sad” and “He” are sometimes misclas-sified with each other. Even though these two gestures do not seem to be similar, theircorresponding features have some sort of similarities. Fig. 13 shows that the border infor-mation of both gestures are similar. Since we only use the border information in our featureextraction scheme, the resultant features of those gestures will be similar to some extent.

In trying to enhance the performance of the problematic gestures, it was noticed thatsome of these gestures perform better using relatively higher number of rules. Specifically,the gestures “Sad” and “Thah” have higher recognition rates using 13 rules. Replacing


Fig. 10. The effect of changingra on the recognition rate of the testing data.

Table 3Enhancements achieved using 13 rules for “Sad” and “Thah”

Gesture Old recognition rate (%) Enhanced recognition rate (%)

Sad 81.66 90.00

Thah 81.66 86.66

He 93.33 95.00

Table 4Results after enhancements


Training 1200 1162 96.83

Testing 600 522 87.00

Total 1800 1684 93.55

the old models of these gestures by models with 13 rules, some enhancements in theirperformance were achieved as shown in Table 3. In addition, the performance of the gesture“He” was slightly enhanced due to the changes made to the gesture “Sad”. The effect ofthe enhancements on the overall performance was not significant as shown in Table 4; therecognition rate increases from 92.94% to 93.55%.


Fig. 11. The effect of changingra on the overall recognition rate.

Fig. 12. Similarities between the gestures “Ra” and “Za”.


Fig. 13. Similarity between features of “Sad” and “He”.

Table 5Comparison with results found in [9]

System Overall recognition rate (%)

System in [9] 95.57

Our system 93.55

Table 5 shows a comparison between performances achieved by our system and thesystem in [9].

It can be seen that our results are comparable to those of [9]. The great advantage of oursystem over that designed by Hussain is that it was able to get rid of the restriction of usingcolored gloves, without any considerable loss in the performance. In addition, the imagepreprocessing employed in our system is more efficient. For example, the determinationof the threshold for the six regions in [9] is done manually, which makes it inefficient forthe use in real-time systems. In addition, the segmentation of the image into six regionsincludes scanning the whole image and determining for each pixel its distance from thethreshold set for each region, and then assigning the pixel to the region with the minimumdistance. Compared to our segmentation procedure, which includes just one comparisonfor each pixel, the procedure used by Hussain in [9] is computationally expensive.


6. Conclusion

In this paper, we designed a system for the purpose of the recognition of the alphabets inthe Arabic sign language. The work was accomplished by training a set of ANFIS models,each of which is dedicated for the recognition of a given gesture. Without the need forany gloves, an image for the gesture is acquired using a camera connected to a computer.After preprocessing, features were extracted from the image. The feature extraction schemedepends on computing 30 vectors between the gesture’s center of area to the useful portionof the gesture border. These vectors are then fed to the ANFIS system to assign them to aspecific class (gesture).

The proposed system is robust against changes in gesture’s position, size, and/ordirection within the image. This is because the extracted features are believed to betranslation, scale, and rotation invariant.

Simulation results showed that our system, with approximately 9 rules per ANFISmodel, was able to reach a recognition rate of 93.55%.

References

[1] S. Ahmad, A usable real-time 3D hand tracker, in: Proc. 28th Asilomar Conference on Signals, Systems,and Computers, IEEE Computer Society Press, 1995.

[2] S.L. Chiu, Fuzzy model identification based on cluster estimation, J. Intelligent and Fuzzy Systems 2 (3)(1994) 267–278.

[3] J. Davis, M. Shah, Gesture recognition, Technical Report CS-TR-93-11, Department of Computer Science,University of Central Florida, Orlando, FL, 1993.

[4] W.T. Freeman, E.H. Adelson, The design and use of steerable filters, IEEE Transactions on Pattern Analysisand Machine Intelligence 13 (1991) 891–906.

[5] W.T. Freeman, M. Roth, Orientation histograms for hand gesture recognition, in: Proc. IEEE InternationalWorkshop on Automatic Face and Gesture Recognition, Zurich, Switzerland, 1995.

[6] W.T. Freeman, C.D. Weissman, Television control by hand gestures, in: Proc. IEEE International Workshopon Automatic Face and Gesture Recognition, Zurich, Switzerland, 1995.

[7] W.T. Freeman, K. Tanaka, K. Kyuma, Computer vision for computer games, in: Proc. IEEE 2nd InternationalConference on Automatic Face and Gesture Recognition, Killington, VT, 1996.

[8] W.T. Freeman, D. Anderson, P. Beardsley, C. Dodge, H. Kage, K. Kyuma, Y. Miyake, M. Roth, K. Tanaka,C. Weissman, W. Yerazunis, Computer vision for interactive computer graphics, IEEE Computer Graphicsand Applications 18 (3) (1998) 42–53.

[9] M.A. Hussain, Automatic recognition of sign language gestures, Master’s Thesis, Jordan University ofScience and Technology, Irbid, 1999.

[10] A.K. Jain, Fundamentals of Digital Image Processing, Englewood Cliffs, Prentice Hall, NJ, 1989.[11] R. Jain, R. Kasturi, B. Schunck, Machine Vision, McGraw-Hill, New York, 1995.[12] J.-S.R. Jang, ANFIS: Adaptive-Network-Based Fuzzy Inference System, IEEE Trans. Systems Man

Cybernet. 23 (1993) 665–685.[13] J.-S.R. Jang, C.-T. Sun, E. Mizutani, Neuro-Fuzzy and Soft Computing, Prentice Hall, Englewood Cliffs,

NJ, 1997.[14] A. Khotanzad, J.-H. Lu, Classification of invariant image representations using a neural network, IEEE

Transactions Acoustics, Speech, and Signal Processing 38 (1990) 1028–1038.[15] C.-S. Lee, G. tae Park, J.-S. Kim, Z. Bien, W. Jang, S.-K. Kim, Real-time recognition system of Korean sign

language based on elementary components, in: Proc. 6th IEEE International Conference on Fuzzy Systems,Barcelona, Spain, 1997, pp. 1463–1468.

[16] V.I. Pavlovic, R. Sharma, T.S. Huang, Visual interpretation of hand gestures for human–computerinteraction: A review, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 677–695.


[17] F.K.H. Quek, T. Mysliwiec, M. Zhao, FingerMouse: A freehand pointing interface, in: Proc. InternationalWorkshop on Automatic Face and Gesture Recognition, Zurich, Switzerland, 1995, pp. 372–377.

[18] P. Simpson, Fuzzy min-max neural networks—Part 1: Classification, IEEE Transactions on NeuralNetworks 3 (1992) 776–786.

[19] M. Sonka, V. Hlavac, R. Boyle, Image Processing, Analysis and Machine Vision, Chapman & HallComputing, 1995.

[20] J. Triesch, C. von der Malsburg, Robust classification of hand postures against complex backgrounds, in:Proc. IEEE 2nd International Conference on Automatic Face and Gesture Recognition, Killington, VT,1996.

[21] L. Wiskott, J.-M. Fellous, N. Kruger, C. von der Malsburg, Face recognition by elastic bunch graphmatching, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 775–779.

recognition of gestures in arabic sign language using ... · naturalness by using images of bare...

Documents