a soft computing model for mapping incomplete/approximate postal addresses to mail delivery points

A soft computing model for mapping incomplete/approximate postal addresses tomail delivery points

P. Nagabhushan a, S.A. Angadi b,*, B.S. Anami c

a Department of Studies in Computer Science, University of Mysore, Mysore, Indiab Department of Computer Science & Engineering, BEC, Bagalkot 587102, Indiac KLE Institute of Technology, Hubli, India

Applied Soft Computing 9 (2009) 806–816

A R T I C L E I N F O

Article history:

Received 20 March 2006

Received in revised form 31 January 2008

Accepted 12 June 2008

Available online 6 November 2008

Keywords:

Fuzzy symbolic expert system

Postal address component

Mail delivery point

Soft computing model

A B S T R A C T

Mapping of postal address to a mail delivery point is a very important task that affects the efficiency of

postal service. This task is very complex in the countries such as India, where postal addresses are not

structured. Further most of the times the destination addresses in such countries are incomplete,

approximate and erroneous which adds to the complexity of mapping postal address to delivery point.

Automation of this aspect of the postal service is a challenge. This paper presents a soft computing model

to map the postal address to mail delivery point. Firstly machine readable postal address is processed to

identify the address components using a novel fuzzy symbolic similarity analysis, and further these

labeled components are organized as a symbolic postal address object. This postal address object is

further processed using the newly devised fuzzy symbolic methodology for mapping the address to mail

delivery point. Symbolic knowledge bases for postal address component labeling and mail delivery point

mapping are devised. Fuzzy symbolic similarity measures are formulated once for address component

labeling and the second time for mapping the entire address to a mail delivery point. In sequel to

similarity computations, which are viewed as fuzzy membership values, an expert system comprising of

a-cut de-fuzzification is proposed to evaluate the confidence factors, while inferencing the validity of

address component labels and mail delivery points. The system is tested exhaustively and an efficiency of

94% is obtained in address component identification and about 86% in mail delivery point mapping, while

working on an Indian postal data base of about 500 addresses.

� 2008 Elsevier B.V. All rights reserved.

Contents lists available at ScienceDirect

Applied Soft Computing

journal homepage: www.elsev ier .com/ locate /asoc

1. Introduction

Automating the various sub-tasks of postal mail handling canincreases the efficiency of the postal mail service and will abet thepostal service in facing the competition from various electroniccommunication tools such as email, fax, Short Messaging Service(SMS), Multimedia Messaging Service (MMS) etc. The differentaspects of postal services that need to be automated are identifiedin [1,2]. The most important and the central task in postalautomation is interpretation of the destination address. Apart fromthe character recognition activities for reading the postal addressother sub-tasks required to make the postal service efficientinclude address component identification, address validation, maildelivery point mapping, route optimization for mail distributionon a daily basis etc. [2–5]. These tasks require application of tools/

* Corresponding author. Tel.: +91 8354233832; fax: +91 8354234204.

E-mail addresses: [email protected] (P. Nagabhushan),

[email protected] (S.A. Angadi), [email protected] (B.S. Anami).

1568-4946/$ – see front matter � 2008 Elsevier B.V. All rights reserved.

doi:10.1016/j.asoc.2008.06.005

techniques from different technological areas such as, patternrecognition, image processing, graph theory, optimization, softcomputing etc.

Most of the reported works in postal automation area, aretowards development of address interpretation systems withemphasis on extracting the destination address from the mail pieceand its conversion to machine readable form [6,7]. Some workstowards validation of machine readable postal addresses [5,8] andgeneration of destination postal code (also called PINCODE inIndia) from machine readable addresses [9] are reported. Truthvalue and testing issues in complex systems especially pertainingto postal addresses are discussed in [3]. Information theoreticanalysis of different postal address fields and its usage in addressinterpretation is presented in [4]. A fuzzy mapping technique forovercoming the errors in postal address is presented in [10].

The postal address interpretation works assume standardaddress formats [3,4,11], where the correctness and consistencyof the address are not generally in question. The same is not true incountries like India, where the destination addresses are writtenusing any known information about the geographical location of

mailto:[email protected]



http://www.sciencedirect.com/science/journal/15684946

http://dx.doi.org/10.1016/j.asoc.2008.06.005

P. Nagabhushan et al. / Applied Soft Computing 9 (2009) 806–816 807

the addressee/mail delivery point. The typical Indian postaladdresses are approximate/incomplete descriptions of the maildelivery point as brought out in Table 5 of Section 7.2.

To the best of our knowledge there has not been any reportedwork on identification of address components and mapping to themail delivery point especially with approximate/incomplete postaladdresses. This task of identifying/labeling the address compo-nents and mapping the address to mail delivery points requirescategorization and classification of text material. A multitude ofworks in general text/word categorization and classificationapplied to other domains are found in literature. Text categoriza-tion is the assignment of the text material to one or morepredefined categories based on their content [12]. The differentapproaches to text categorization such as supervised andunsupervised techniques including knowledge engineeringapproaches are discussed in [13]. A number of statistical andmachine learning techniques have been devised for general textcategorization applications and include regression models, nearestneighbor classifiers, decision trees, Bayesian classifiers, supportvector machines etc. [12].

Recent advances in computational strategies have introduced anew paradigm in computation namely, soft computing forprocessing of imprecise and inaccurate information. Soft comput-ing is tolerant to imprecision, uncertainty, partial truth, andapproximation [10,14]. The above-mentioned aspects of softcomputing are used to achieve tractability and robustness inprocessing, at low solution cost [15]. Soft computing approacheshave been adapted for categorizing text. A least square-supportvector machine approach based on radial basis functions, proposedin [16] for text classification has achieved an efficiency of 80%. Afuzzy inductive learning approach using concept dictionaries forclassifying the emails by the analyzing the subject and body of theemail is presented in [17]. Soft computing techniques have alsobeen employed to solve many real world problems. A fuzzy expertsystem using an uncertainty model with qualitative scales ofplausibility values and multi-set based fuzzy algebra for solvingthe lost circulation problem, a common problem encounteredwhile drilling oil wells is described in [18]. A framework formedical diagnosis that employs a tight coupling of case base, rulebase and neural networks is proposed in [19]. A fuzzy rule-basedsystem is proposed for an adaptive scheduling, which dynamicallyselects and applies the most suitable strategy according to thecurrent state of the scheduling environment [20].

Hence in pursuit of better computational strategies forprocessing machine readable destination address, which by itsvery nature is imprecise and ambiguous, we explore and employsoft computing strategies for postal address component labelingand mapping the address to mail delivery point.

The task of mapping the postal address to a mail delivery pointis a very important task in postal automation and the output can befurther used for generation of optimal mail distribution route on adaily basis. In this paper a novel soft computing methodology, formapping the incomplete/approximate postal address to maildelivery point is proposed. The proposed methodology takes thedestination postal address in text form as input and maps it to oneor more probable mail delivery points using soft computingtechnique as depicted in Fig. 1. The principal constituents of Soft

Fig. 1. Block schematic of the soft computing model.

Computing are Fuzzy Logic (FL), Neural Computing, EvolutionaryComputing/Genetic Algorithms, Machine Learning (ML) andProbabilistic Reasoning (PR) including belief networks, chaostheory and parts of learning theory. Another interesting approachto soft computing is Symbolic Data Analysis (SDA) [21].

The application of soft computing for mapping the postaladdress to a mail delivery point requires an appropriaterepresentation of postal address to facilitate its further processingfor identifying the mail delivery point. The characteristics of postaladdress like the impreciseness, the presence/absence of variousfields make symbolic data representation an appropriate choice forrepresenting postal address. In this paper a postal address isinterpreted as a symbolic object [5] as proposed in one of ourearlier work. Based on this advantageous representation scheme,symbolic similarity measures are devised for identifying theaddress component labels and mail delivery points. The similaritymeasures resemble fuzzy membership functions since thesimilarity values give approximate nearness to various possibilities(of address component labels/mail delivery points). This necessi-tates the disambiguation of the similarity formulation and iscarried out by an expert system using fuzzy alpha cut methodologyfor address component labeling and mail delivery point mapping.The alpha cut set generated is further employed in computing aconfidence value for the decision made.

The soft computing methodology is tested for addresscomponent labeling and mail delivery point mapping. Using aknowledge base of 500 postal addresses representing about 300distinct mail delivery points. An overall accurate addresscomponent labeling has been achieved on 94% of the addressestested whereas an average component labeling efficiency of 95.8%is observed. Further the proposed methodology could unambigu-ously decide on the mail delivery point for about 86% of addresses.

The paper is organized into eight sections. Section 2 presents adiscussion on incomplete/approximate nature of postal addressesand proposes a symbolic object representation. Section 3 presentsthe proposed expert system for labeling the address componentsand mapping the address to mail delivery point. Section 4 explainsthe symbolic knowledge bases employed in this work. Section 5describes the newly devised symbolic similarity measures. Section 6presents the fuzzy alpha cut based methodology for identifyingaddress components and mapping postal address to mail deliverypoints along with confidence value for the decision made. Section 7gives the experimental results and Section 8 presents the conclusion.

2. Representation of postal address as a symbolic object

The postal address is used to describe the addressee withemphasis on specifying the mail delivery point. Most of theadvanced countries have fixed formats for the destination postaladdresses [11], which is facilitated by the structured layout of thelocalities. Whereas in countries like India this is not the case andmost of the postal addresses give an approximate description ofthe addressee/mail delivery point. The postal addresses generallymake use of well-known land marks, houses of famous person-alities and popular names of roads for describing the mail deliverypoint. Further the roads and landmarks may have more than onedifferent popular names. All these give an unstructured nature tothe postal addresses. People also use synonyms like street/road forcross, avenue for road etc. when writing destination addresses andmay some times use wrongly spelled words. Hence, typically thepostal addresses are approximate/incomplete/imprecise descrip-tions of the mail delivery points. Such addresses are mapped toactual mail delivery points in the postal system by the experienceof the postman or postal mail sorter. Any computational model toautomatically map the postal address to a unique or even probable

Table 1A typical postal address object.

Postal address Symbolic representation

Shri M.M. Patil PostalAddressObject = {[Addressee = (Salutation = Shri),

(AddresseeName = MMPatil)],

[Location = (Area = LaxmiExtension)],

[Place = (place = Gokak), (PIN = 591307),

(District = Belgaum), (State = Karnataka)]

Lakshmi Extension

Gokak-591307

Belgaum

Karnataka

P. Nagabhushan et al. / Applied Soft Computing 9 (2009) 806–816808

mail delivery point/s has to deal with the approximate nature/incompleteness of the postal address.

Along with the approximate and imprecise nature of postaladdresses, the other associated aspects that make the task ofmapping the postal address to mail delivery point a complex oneinclude the various possibilities of addressee—mail delivery pointcombinations and the method of writing addresses. For example,persons with the same name may be present in the neighborhood.Persons with same initials but different names may be present inthe neighborhood. A person may have more than one address. Theaddressee information may be correct but the location informationmay be incorrect or vice versa and the like. To deal with thesepeculiar characteristics of postal addresses and process themautomatically, there is need for an appropriate data structure torepresent postal address and consequently the knowledge baserepresentation. This section gives a symbolic representation for thepostal address; the symbolic knowledge bases employed in thiswork are described in Section 4.

The postal addresses contain various components/fields, someof these fields are qualitative, such as addressee name, care ofname etc., other fields such as house number; road number etc. arenumeric, though their usage may be non-numeric in nature. Thevalues taken by most of the fields can be distinct or one among thegiven range or one amongst the enumerated list of values. A postaladdress may not contain all the possible fields. This description ofthe postal address makes it a suitable candidate for representationusing symbolic data approach [5].

Symbolic objects are extensions of classical data types. Asymbolic object and its extent can model a concept or an object ofthe real world. Symbolic objects can be of three different types,Assertion Object, Hoard Object and Synthetic Object. An assertionobject is a conjunction of events pertaining to a given object. Anevent is a pair which links feature variables and feature values. AHoard object is a collection of one or more assertion objects,whereas a synthetic object is a collection of one or more hoardobjects [21–23]. The postal address object is described as a hoardobject consisting of three assertion type objects [5] namelyAddressee, Location and Place as described in Eq. (1):

POSTAL ADDRESS OBJECT ¼ f½Addressee�; ½ Location�; ½Place�g(1)

Each of the assertion objects describes an important componentof the destination address. The Addressee specifies the name andother personal details of the mail recipient; the Location specifies thegeographical position of the mail delivery point and Place specifiesthe city/town or village where the mail recipient resides. Each ofthese assertion objects is defined by a collection of events describedby the feature variables. The feature variables or postal address fieldsof the different assertion objects are listed in Eqs. (2)–(4). Each of thefeature describes some aspect of the object and all the featurestogether completely specify the assertions objects. However, certainfeatures remain missing in a typical postal address because they arenot available and in some cases the written address may containmore than the required address components (typically more valuesfor one feature, viz two or more landmarks).

½½Addressee ¼ ðAddressee NameÞðCare of NameÞðQualificationÞðProfessionÞðSalutationÞðDesignationÞ�� (2)

½Location ¼ ðHouse NumberÞðHouse NameÞðRoadÞðAreaÞðLandMarkÞðPBNoÞðFirmÞ� (3)

½Place ¼ ðPostÞðTalukÞðDistrictÞðStateÞðPlaceÞðPINÞðViaÞ� (4)

A typical postal address and its representation as a symbolicobject is given in Table 1.

The symbolic postal address object is processed using an expertsystem designed around soft computing methodology, to map it toone or more than one probable mail delivery points.

3. Expert system for mapping incomplete/approximate postaladdresses

The mapping of postal addresses to mail delivery pointsrequires imitation of the operation of the skilled postal mail sorter/postman. This entails an expert system for mapping postal addressto mail delivery point. The expert system for this task shouldcomprise a knowledge base and an inference engine for simulatingthe human mail sorting expert. The task of mapping the postaladdress to the mail delivery point can be divided into two sub-tasks, the first task is; identification of address components alsocalled component labeling or component tagging. It involvesassigning a label to the address component. The second taskinvolves using the identified components for mapping the addressto a mail delivery point. An expert system is devised for thepurpose, with two components of the knowledge base and twocomponents for inference. The proposed expert system is depictedby the block schematic of Fig. 2.

The first component of the knowledge base and the softcomputing inference mechanism, together identify the probablecomponents and assign probable labels to them generating thepostal address object in the process. The soft computing inferenceis done at the assertion object level, which is then grouped into thepostal address object (hoard object). The inference mechanismuses symbolic analysis for labeling the address components usingthe similarity measure, defined in Section 5.1 as a fuzzymembership value and fuzzy alpha cut set for assigning confidencevalue for the decision. Once the postal address components aregenerated they are subjected to conflict resolution process if theconfidence in component identification is less than 50%. The expertsystem then maps the input postal hoard object to a mail deliverypoint. The similarity measure defined in Section 5.2 is used to findthe fuzzy distance of the input address to the different possiblemail delivery points. The fuzzy alpha cut set method is furtheremployed for de-fuzzification and evaluation of confidence valuein the mail delivery point identified.

4. Symbolic knowledge bases

The processing of postal address using symbolic approachrequires appropriately structured symbolic knowledge bases.Section 4.1 describes the symbolic knowledge base employedfor postal address component labeling and Section 4.2 describesthe knowledge base used for mail delivery point mapping.

4.1. Knowledge base for postal address component labeling

The symbolic knowledge base employed for postal addresscomponent labeling is devised based on the frame structured

Fig. 2. Expert system for mail delivery point generation.

Fig. 3. Structure of symbolic address component knowledge base.


knowledge base representation as described in [24] and study oflarge number of postal addresses. The symbolic knowledge baseused in this work enables a systematic approach for addresscomponent labeling. The symbolic knowledge base used foraddress component labeling, AD_COMP_KB is organized as asynthetic object of three hoard objects namely Addressee Knowl-edge base: Addresskb, Location Knowledge base: Locationkb andPlace Knowledge base: Placekb as given in (5).

AD COMP KB ¼ f½Addresseekb�; ½Locationkb�; ½Placekb�g (5)

The hoard objects are made of assertion objects as detailed inFig. 3. All the assertion objects of the symbolic knowledge basehave the events described in Fig. 4. Each assertion object of thesynthetic object AD_COMP_KB may or may not have values foreach of the events. The knowledge base is populated with thevalues extracted and summarized by observing large number ofpostal addresses.

Fig. 4. Events associated with the assertion object.

4.2. Knowledge base for mail delivery point identification

The postal address components in the form of postal addressobject are compared with knowledge base of a place (PL_MDP_KB)to identify the mail delivery point represented by the address. Theknowledge base has been designed by studying large number ofpostal addresses and the variants of the postal addresses usedwhen destination address is written. The symbolic knowledge baseis organized as a synthetic object: PL_MDP_KB, consisting of threehoard objects namely, mail delivery point addressee knowledgebase: mdp_addresseeKB, mail delivery point location knowledgebase: mdp_locationKB, and mail delivery point place knowledgebase: mdp_placeKB as given by (6):

PL MDP KB ¼ f½mdp addresseeKB�; ½mdp locationKB�;½mdp placeKB�g (6)

The assertion objects contained in each of the hoard objects isdepicted in Fig. 5. One or more values for each assertion object arestored in the Knowledge base. The first assertion object mdp_ad-dresseeKB can have same addressee name with different Id Nos. Thisis useful in the scenario where, the same person has differentaddresses or different persons have same addressee information.Similarly different addressee names with the same Id No may also bepresent to represent the scenario, where different persons stay at thesame delivery point. Hence there is many to many mapping between

Fig. 5. Structure of symbolic mail delivery point knowledge base.


Id No and other assertion objects of mdp_addresseeKB, givingmultiple entries for each ID No. The hoard object mdp_locationKBhas only one entry for each mail delivery point i.e. DEL-PT-ID, givingthe various descriptions of the mail delivery point. The other hoardobject mdp_placeKB gives the various descriptions of the placewhere the mail is to be distributed. This organization of the symbolicknowledge base helps in symbolic analysis of the input postaladdress components. The symbolic knowledge base is used to mapthe input postal address to a probable mail delivery point.

5. Symbolic similarity measures

The symbolic data analysis approach for address componentlabeling and mail delivery point mapping requires, distance/similarity measures to map the input to probable labels/maildelivery points. Widely used symbolic data measures for similaritygiving numeric values, quantifying the similarity are described in[21,25]. Similarity/distance measures for interval type data,absolute value/ratio type of data etc. are described. The similaritymeasure described in [25] is made up of three components, namelysimilarity due to position, similarity due to content and similaritydue to span of the two objects being compared. The positionsimilarity is defined only for interval type of data and describes thedistance of one object to the initial position of other object. Thespan similarity is defined for both interval and absolute type ofdata and describes the range/fraction of similarity between theobjects. The content similarity describes the similarity betweenthe contents of the two objects. The similarity measures defined in[25] have been used for clustering, classification etc. and have beentested on fat oil and iris data. Most of the symbolic objectsexperimented with consist of numeric values, whereas the postaladdress which is to be processed in the proposed work is mostlydescriptive. Hence the available similarity measures cannot bedirectly adapted to this work and need appropriate tailoring to usethe descriptive information and compare two objects for similarity.In this work we present two similarity measures, one for addresscomponent labeling and the other for mail delivery point mapping.

5.1. Symbolic similarity measure for address component labeling

The classification of postal address into labeled components foruse in identification of mail delivery points requires the applicationof text categorization techniques. There are many machinelearning approaches to text categorization. The problem of addresscomponent labeling is non-trivial and the component label shouldbe ascertained by the information present in the component only.The presence of some key words and their occurrence relative tothe other components helps in identifying the component labels. Inthis work a symbolic similarity measure to label the addresscomponent by finding the similarity between the input componentand probable component labels is defined.

The similarity measure gives the similarity of the inputcomponent with various component labels (assertion objects) ofthe symbolic synthetic object AD_COMP_KB. The similaritymeasure between ith input component (IPi) and jth componentlabel (ctj) of the knowledge base is found using (7):

SðIPi; ct jÞ ¼1

EV�XEV

k¼1

netsimk for 1 � i � n and 1 � j � m (7)

where n is the number of available components in input address, m

is the number of possible component labels or assertion objects inthe knowledge base, and EV takes a value of 7, representing theseven events of the assertion objects.

The values of netsimk are calculated for each event of assertionobject using the computations presented in (8) for the first five tocalculate content similarity and (9) for the last two to calculatespan and content similarity:

w f k�Interse

Sum IP KBfor 1 � k � 5 (8)

w f k�Interse

Sum IP KBþ Com p IP þ Com p KB

2�Sum IP KB

� �for 6 � k � 7 (9)

where Interse is the number of words/elements common to inputcomponent and component label under test, Comp_IP is thenumber of words/elements in the input component, Comp_KB isthe number of words/elements in the component label (knowledgebase) under test and

Sum IP KB ¼ Com p IP þ Com p KB� Interse

The weight factors wfk are predefined for every component labeland the values are assigned based on the relative importance of theevents in different labels (assertion objects).

The similarity measure is computed for each available inputcomponent with all the possible component labels. This similaritymeasure is the fuzzy membership function of the input componentin the component label class. The actual decision of the label classto which the input component belongs is made using the de-fuzzification technique described in Section 6.1.

5.2. Symbolic similarity measure for mapping postal address

to mail delivery point

The postal address components organized as a postal addressobject are matched against the symbolic knowledge base to find itssimilarity with the mail delivery point in the knowledge base of theplace. The nearness of the input address to a probable mail deliverypoint is quantified by the similarity measure as described in thissection. The similarity measure is treated as the fuzzy membershipvalue for that input address with respect to the mail delivery point.

Let the Input Symbolic Postal Address Object be represented byI. The symbolic object has three assertion object components


represented as IA, IL and IP as depicted in (10):

I ¼ fIA; IL; IPg (10)

The assertion object IP is used to verify whether the input mailaddress belongs to the place where mail sorting is done, whereasthe IA and IL is used to generate the similarity measure. Let eachrecord/information in the addressee knowledge base be repre-sented by Aki and information in the location knowledge base berepresented by Lkj, for all 1 � i �m and all 1 � j � n, where m is thenumber of entries in addressee knowledge base and n is thenumber of entries in the location knowledge base.

The similarity measure that gives the similarity of the inputaddress to a mail delivery point consists of two similaritycomponents, one that measures the similarity of the addresseeinformation in the input to that in the knowledge base and thesecond that measures the similarity of the location information inthe input to that in the knowledge base. The similarity componentthat measures the addressee similarity gives the nearness of theinput addressee object with various addressee information in thesymbolic hoard object mdp_addresseeKB. The similarity measurebetween input addressee information and addressee object of theknowledge base is found as given in (11):

SAIA ;Aki

¼ 1

EC

XEC

p¼1

netsim p (11)

where EC is the number of components of postal addressee objectexisting in the input.

The value of netsimp (net similarity measure) for everycomponent is calculated as in (12) with respect to addresseeinformation in mdp_addresseeKB:

netsim p ¼scanSim p þ comtentSim p

2:00(12)

The parameters, scanSimp and contentSimp represent the span andcontent similarity measures for the mail delivery point mapping.They are calculated as in (13) and (14):

scanSim ¼ Com p POþ Com p KB

2�Sum PO KB(13)

where Comp_PO is the number of words in the component,Comp_KB is the number of words in the knowledge base entry Aki,and Sum_PO_KB = Comp_PO + Comp_KB � Interse; where Interse = -COM * MF and COM = number of words common to IA and Aki.

MF ¼ 1

when all words of IA and Aki match,

MF ¼ k=n

when k words of the n possible words of IA match,

MF ¼ 0:25�k1þ k

n

when k1 first character matches are used and k word matches arepresent (the first character matches are assigned a weight of 0.25),

MF ¼ ðð2� flþ ocÞ�k1þ kÞn

where

fl ¼ 2

length o f in put word

and

oc ¼ 1� 2� fl

length o f in put word

k1 is the number of partial word matches and k is the number offull word matches, 2� f

n is the weight assigned to the first and the lastcharacter matches, oc

n is the weight assigned to the middlecharacter match

contentSim ¼ Interse

Sum PO KB(14)

Accordingly, the similarity measure between input locationinformation and location object of the knowledge base is found asgiven in (15):

SLIL ;LK j

¼ 1

EC

XEC

p¼1

netsim p (15)

where EC is the number of components of postal location objectexisting in the input. The netsimp values are calculated as for theaddressee object.

The similarity measure for a given mail delivery point, includingaddressee and location information is then calculated as given inEq. (16):

SI;KB ¼SA

IA ;Aki

þ SLIL ;Lk j

2:0where IDNOðAkiÞ ¼ IDNOðLK jÞ (16)

The similarity measure so obtained is the fuzzy membershipvalue for the input address with respect to the different probablemail delivery points. The value is de-fuzzified and a crisp output forthe probable mail delivery point with evaluated confidence in thedecision is obtained as described in Section 6.2.

6. Fuzzy decision and confidence value

The symbolic measures defined in Section 5 are used to resolvethe address component label and the probable mail delivery point/s. The similarity values obtained gives the quantitative measure ofthe similarity of the postal address components/postal addresswith those in the knowledge base. As the similarity values areavailable with more than one probable candidates, there is a needto decide on the component label/address. In this work thesymbolic similarity measures, computed as described in Section 5,are treated as fuzzy membership functions, which are de-fuzzifiedto remove the ambiguity in deciding upon the component labels/postal addresses. The de-fuzzification is carried out using alpha cutmethodology [26]. Section 6.1 describes the methodology adoptedfor de-fuzzification and identification of address component labelsand Section 6.2 presents the methodology employed for de-fuzzification and determination of the mail delivery pointcorresponding to the input postal address.

6.1. Address component labeling

The methodology for labeling the address components involvesseparating the probable address components (components inseparate lines or separated by commas) and extracting therequired features. These features are stored in a newly deviseddata structure called Postal Address Information Structure (PDIS).The structure of PDIS is given in Fig. 6.

The PDIS is then employed to find the similarity measure withall the component labels (assertion objects of AD_COMP_KB). Afterthe symbolic similarity measure is calculated for the variouscomponent labels for an input component using Eq. (7), thecomponent labels are arranged in the decreasing order ofsimilarity. This list gives the fuzzy membership of the inputcomponent in the possible component labels as depicted in thesimilarity array of Fig. 7. Now to make a decision as to whichcomponent label, the input component belongs, a de-fuzzification

Fig. 6. Postal address information structure.

Fig. 7. The similarity array sorted in descending order (S0 > S1 > S2 > � � �).


process is taken up. The de-fuzzification is done by defining thefuzzy alpha cut set. The a value is calculated using Eq. (17):

a ¼ S0 � DFC�S0 (17)

where S0 is the maximum similarity value obtained for the inputcomponent, and DFC is the de-fuzzification constant and is taken as0.1, based on the experimentation with postal address components.

The alpha cut set is obtained from the similarity array by takinginto the cut set all the members of the similarity array whose value isgreater than a. This is depicted pictorially in Fig. 8. The alpha cut set

is used to identify the component label with assigned confidencevalue for the decision. If the alpha cut set has only one member thenthe component label, ct0 (corresponding to I0 and S0) is assigned tothe input component with confidence measure of 100.

If the alpha cut set has more than one component label (asdepicted in Fig. 8) then the probable component labels are outputwith the decreasing order of confidence. The confidence of thesystem in a probable component label is evaluated using (18):

Ci; j ¼S jP p

k¼1 Sk

�100 for 1 � j � p and 1 � i � n (18)

where Ci,j is the confidence of assigning jth component label to ithinput component, n is the number of input components and p is thenumber of component labels in alpha cut set, Sj is the similarity ifith input component with jth component label in similarity array,Sk is the similarity if ith input component with kth component labelin similarity array

Fig. 8. De-fuzzification process and the alpha cut set.


6.2. Mail delivery point identification

The labeled postal address components are organized as apostal object, which is used for mapping the address to the maildelivery point. It is possible that one particular input address maymap to more than one mail delivery point and more than oneaddress may map to one delivery point as brought out in theexperimental analysis. In such a situation the system has toprovide the confidence of mapping to each mail delivery, which iscarried out by de-fuzzification using alpha cut set methodology.This task is not a trivial one as it requires disambiguating thesimilarity measure to map to one or more mail delivery points aftercomputing confidence in each of the delivery point identified.

After the symbolic similarity measure is computed for the inputpostal object with all the possible mail delivery points, the maildelivery points are again arranged in decreasing order of similarity.For identifying the mail delivery point de-fuzzification is carriedout using alpha cut method. The a value is again calculated asdescribed previously for address component labeling. The de-fuzzification and confidence value calculations are done in amanner similar to that explained in Section 6.1. If the number ofaddresses in the alpha cut set is more than 4, then historicalevidence of similar postal addresses are used to make the decisionas described in Algorithm 2. The complete process of mapping thepostal address (object) to mail delivery point is presented inAlgorithm 2.

7. Experimental results and discussion

The soft computing model for mapping postal addresses to maildelivery points is thoroughly tested using exhaustive knowledgebases. The experimental results for address component labelingare described in Section 7.1 and the results for mail delivery pointmapping are presented in Section 7.2

7.1. Address component labeling

The soft computing methodology for address componentlabeling is tested on various types of addresses and the resultsare encouraging. Table 2 summarizes the output of the system fortypical input addresses and lists the highest two similarity valuesgenerated with respect to input components and the correspond-ing identified labels. The overall results are given in Table 3 and aredepicted in Fig. 9. The total efficiency of the system is about 94%and can be increased by enhancing the symbolic knowledge base.The developed system is robust enough for use in practical

situations. The component-wise address identification efficiency islisted in Table 4, and an average component-wise identification/labeling efficiency of about 95.8% is obtained. The results show thatalmost all the address components are correctly identified. Thecomponents, which have less identification efficiency, are housename, house number and road which is because of the variousways in which this information can be provided.

Table 2Result of address component identification for a typical input.

Input address Output address components

Component Similarity

measure with label

Similarity

measure with label

Alpha cut set Assigned

label

Confidence

of decision

Mr. Bhosale Chandra,

Near Daddennavar Hospital,

Extension Area, Bagalkot, 587101

Mr 0.228, Salutation 0.1, Addressee {Salutation} Salutation 100

Bhosale Chandra 0.148, addressee 0.1, Care of Name {Addressee} Addressee 100

Near Daddennavar

Hospital

0.228, Landmark 0.1, Care of Name {Land Mark} Land Mark 100

Extension Area 0.278, Area 0.093, Landmark {Area} Area 100

Bagalkot 0.228, Place 0.114, PIN {Place} Place 100

587101 0.114, PIN 0.1, State {PIN} PIN 100

Shri, S K Deshpande,

‘‘Padmakunja’’, 15th Cross,

Moonlight Bar, Vidyagiri,

Bagalkot, 587102

Shri 0.228, Salutation 0.1, Addressee {Salutation} Salutation 100

S K Deshpande 0.123, Addressee 0.1, Care of Name {Addressee} Addressee 100

Padmakunja 0.186, House Name 0.1, Care of Name {House Name} House Name 100

15th Cross 0.119, Road Number 0.1 Care of Name {Road Number} Road Number 100

Moonlight Bar 0.93,Landmark 0.86, Postbox {Landmark,PostBox} Landmark 52

Vidyagiri 0.186, Areaname 0.1,State {Areaname} Areaname 100

Bagalkot 0.2, place 0.1,State {State} State 100

587102 0.126, Pincode 0.107,Post {Pincode} PIN 100

Table 3Overall results of address component identification.

Sl No. Particulars Confidence of component labeling Percentage of total addresses (=500)

All 100% >75% and <100% <75%

1 Correctly labeled addresses 399 71 00 94

2 Addresses with one incorrectly labeled component 18 02 03 4.6

3 Addresses with two or more incorrectly labeled components 05 01 01 1.4


7.2. Mail delivery point mapping

The soft computing methodology for Mail Delivery Point (MDP),identification taking the postal address object as input was testedusing variety of input addresses. The knowledge base is populatedwith addressee information of 500 persons and mail delivery point

information for 300 delivery points. The testing was carried outconsidering the various issues that may occur in practice. Theoutput of the delivery point mapping system for such issues

Table 4Component-wise address component identification.

Sl No. Address

component

Number of

correctly

recognized

components

Number of

incorrectly

recognized

components

Percentage

identification

efficiency

1 Salutation 300 0 100.00

2 Addressee 500 0 100.00

3 Care of Name 280 0 100.00

4 Qualification 56 4 93.33

5 Profession 84 6 93.33

6 House Number 71 9 88.75

7 House Name 54 6 90.00

8 Road 74 6 92.5

9 Area 430 25 94.51

10 Post 72 0 100.00

11 Taluk 75 5 93.75

12 District 71 7 91.03

13 State 93 6 93.94

14 Place 481 19 96.2

15 PIN 452 12 97.41

16 VIA 37 0 100

17 Landmark 415 18 95.84

18 Organization/Firm 110 5 95.65

19 Designation 75 0 100

20 Post Box Number 65 0 100

Average component-wise labeling/identification efficiency 95.812

imitates the human expert. A few typical cases are listed andelaborated in Table 5. The output indicated in the table does notshow the addressee name which is as given in the input.

The outputs of the system described in Table 5 bring out thepower of the soft computing model employed for mapping theincomplete/approximate postal addresses to the mail deliverypoints. The table indicates that even though the similarity to anaddress is 0.5 the confidence of the system in identification is 100%as in case 2. Also in case 6 though the similarity is 0.566 theconfidence is only 33.65%, both of which replicate the humanexpert behavior. For the ninth entry the similarity is only 0.343 butstill the confidence in the decision is 100%. So the soft computingmodel for the expert system has been successful in imitating thehuman behavior.

The overall results of the testing using addresses of differentnature are enlisted in Table 6. The system could unambiguouslydecide on the mail delivery point for about 86% of addresses(Corresponding to Sl Nos. 1, 2 & 3) and for remaining addresseseither manual resolution was necessary or the addresses were notdeliverable. The results are pictorially depicted in Fig. 10.

Fig. 9. Address component labeling using soft computing.

Table 5The results of the system dealing with various issues.

Sl No. Input address Output Similarity Measure

(SM)/Percentage

Confidence (PC)/MDP

Explanation

1 Mr Shanmukhappa A Angadi,

H No 8, 12th Main Road,

Near Basava vana,

Vidyagiri, Bagalkot

H No. 8, 12th Main Road, Near Basava Vana,

Vidyagiri, Bagalkot 587102, Karnataka State

SM = 0.825, PC = 100%,

MDP = 000003

The expected complete address

was given as input

2 Prof S A Angadi, H No 8,

12th Main Road,

Near Gangoor Hospital,

Bagalkot-587102



SM = 0.5, PC = 100%/,

MDP = 000003

The input address did not contain

the area name but the system could

still identify the mail delivery point

with 100% confidence

3 Kumar Manoj S Biradar,

c/o S A Angadi, H No 8,

12th Main Road,

Vidyagiri, Bagalkot



SM = 0.626, PC = 100%,

MDP = 000003

The addressee name is not present in

the knowledge base and PIN is not

present but c/o information is used here

4 Mr S A Angadi, c/o M A Angadi,

‘‘Gurukrupa’’, Near Old IB,

Extension, Bagalkot-587101

‘‘Gurukrupa’’, Near Old IB, Extension,

Bagalkot 587101, Karnataka State

SM = 0.687, PC = 100%,

MDP = 000029

The addressee has more than one

location information, still the correct

address was generated

5 Mr Suresh Basappa Angadi,

‘‘Shivakrupa’’, H No 10, 12th

Main Road, Vidyagiri,

Bagalkot-587102

H No 1, ‘‘Shiva Krupa’’, 12th Main Road,


SM = 0.7, PC = 100%,

MDP = 000028

The input address contained wrong

house number and was corrected

and the decision was correct

6 Mr Suresh Appasaheb Angadi,

‘‘Shivakrupa, H No 10,

12th Main Road, Vidyagiri,

Bagalkot-587102

This gave three probable outputs, (i) H No 1,

‘‘Shivakrupa’’, 12th Main Road, Vidyagiri,


i) SM = 0.566, PC = 33.65%,

MDP = 000028

(i) The addressee name matches

partially and house number for

that match is wrong

(ii) H No 8, 12th Main Road, Near Basava Vana,


(ii) SM = 0.566, PC = 33.65%,

MDP = 000003

(ii) The addressee information

matches partially and road number

and area information match in the

location part

(iii) H No 10, 12th Main Road, Vidyagiri,

Bagalkot-587102, Karnataka State

(iii) SM = 0.55/, PC = 32.6%,

MDP = 000006

(iii) There is a larger match in

location information though

there is very little addressee

match. This behavior of the

system generally replicates

the behavior of the postman

when such is the input information

7 Ms Chandrakala Manojarao Patil,

Sales Executive, Jorapur Chawl,

Hospeth, Bagalkot-587101

416/a,Darbar Chawl, Venkatpeth,


SM = 0.5595, PC = 100%,

MDP = 000037

The addressee information

matches but the location

information is completely

wrong and there is no other

person with same name,

hence the system has 100%

confidence in its decision

8 Manjula S Chinchi, C-36,

Colony, Bagalkot 587101

C-36, Housing Colony, Bagalkot 587101,

Karnataka State

SM = 0.594, PC = 100%,

MDP = 000035

The addressee name is wrongly

spelled and location information

is also not complete

9 Mr Goshal G Mujumdar,

H No 31, Bagalkot 587101

H No 31, 16th Main Road, Vidyagiri,


SM = 0.343, PC = 100%,

MDP = 000054

The address has only

house number and addressee

information and PIN is wrong,

still the system could identify

the mail delivery point correctly

Table 6Test results of Mail Delivery Point (MDP) mapping.

Sl No. Particulars Number of

addressesa

Percentage of

total addresses

1 Addresses mapping to single MDP with a confidence of 100% in decision 255 72.86

2 Addresses mapping to single MDP with a confidence greater than 75% in decision 33 9.43

3 Addresses mapping to single MDP with a confidence less than 75% in decision 13 3.71

4 Addresses mapping to multiple MDP’s; one MDP is mapped with a percentage confidence greater

than 50 and others less than 50 (Manual resolution is needed)

38 10.86

5 Addresses mapping to multiple MDP’s: all MDP’s are mapped with confidence less than 50

(Undeliverable addresses need verification)

11 3.14

a The total number of addresses used for testing is 350.


Fig. 10. Overall results of soft computing model for MDP mapping.


It can be shown that the efficiency of the system is dependanton the size and content of the symbolic knowledge base.

8. Conclusions

The soft computing model based expert system presented inthis paper is used to address one of the very important tasks ofintegrated postal automation, namely mapping an incomplete/approximate postal address to a mail delivery point. It employssymbolic similarity measures for address component labeling andmail delivery point mapping. The similarity measures are treatedas fuzzy membership functions and the fuzzy alpha cut metho-dology is employed for de-fuzzification and the decision on thelabel of components and the mail delivery point to which theaddress maps. The soft computing approach has efficientlymodeled the inaccuracies and impreciseness of postal addresseswith which humans work. The system assumes character/wordrecognition of the destination address and is designed to be veryrobust so as to overcome the discrepancies in Optical CharacterRecognition (OCR). The soft computing model can be integratedwith a knowledge based OCR system to develop a quick readingapproach to speed up the recognition of the address.

Comparison of the results with previous works was not possibleas there are no works of similar nature to the best of ourknowledge. The symbolic model for postal address and its furtherprocessing using soft computing approach has opened up newavenues in postal automation and presents an efficient tool forimproving the postal address interpretation systems. The softcomputing framework provided here can also be employed to builddecision support systems for varied applications in managementand other socially relevant problems.

Acknowledgements

The authors acknowledge the comments of the reviewers whichhas helped in improving the quality of the paper. This work issupported by grant received from AICTE, New Delhi, Govt of India,under the RPS scheme for the project titled ‘‘Development ofTechniques for Optimal Postal Mail Distribution in the IndianContext’’, vide Fno. 8022/RID/NPROJ/RPS-60 dated 22nd March2004.

References

[1] G. Garibotto, Computer Vision in Postal Automation, Elsag Bailey – TELEROBOT,2002.

[2] P. Nagabhushan, Towards automation in Indian Postal Services: a loud thinking,Technovision Special Volume (1998) 128–139.

[3] S. Setlur, A. Lawson, V. Govindaraju, S.N. Srihari, Truthing, testing and evaluationissues in complex systems, in: Sixth IAPR International Conference on DocumentAnalysis and Recognition, Seattle, WA, (2001), pp. 1205–1214.

[4] S.N. Srihari, W.-j. Yang, V. Govindaraju, Information theoretic analysis of postaladdress fields for automatic address interpretation, in: Proceedings of FifthInternational Conference on Document Analysis and Recognition (ICDAR-99),Bangalore, India, (1999), pp. 309–312.

[5] P. Nagabhushan, S.A. Angadi, B.S. Anami, A symbolic data structure for postaladdress representation and address validation through symbolic knowledge base,in: Proceedings of International Conference on Pattern Recognition and MachineIntelligence (Premi 05), 18–22 December 2005, LNCS, vol. 3776, Springer Verlag,2005, pp. 388–393.

[6] P. Nagabhushan, S.A. Angadi, B.S. Anami, Automatic recognition of PINCODEprinted in Kannada: a fuzzy statistical methodology using heuristics, in: Proceed-ings of International Conference on Human Machine Interfaces (ICHMI-2004), IIScCampus, Bangalore, 20–23 December, (2004), pp. 255–266.

[7] K. Roy, S. Vajda, U. Pal, B.B. Chaudhari, A system for Indian Postal autmation, in:Proceedings of the International Workshop on Frontiers in Handwriting Recogni-tion (IWFHR-9 2004), 2004.

[8] M.R. Premalatha, P. Nagabhushan, An algorithmic prototype for automatic ver-ification and validation of PIN code: a step towards Postal Automation, in:Proceedings of AICTE-ISTE National Conference on Document Analysis & Recogni-tion ‘‘NCDAR’’, 13–14 July, 2001, pp. 225–233.

[9] M.R. Nagamani, P. Nagabhushan, Knowledge based approach to determine thedestination postal code through address block extraction: a case study towardspostal automation, in: Proceedings of National Conference on Document Analysisand Recognition (NCDAR-2003) held at PESCE, Mandya, 11–12 July, 2003, pp.152–163.

[10] J.P. Buckley, B.P. Buckles, F.E. Petry, Processing noisy structured textual data usinga fuzzy matching approach: application to postal address errors, Journal of SoftComputing 4 (4) (2000) 195–205.

[11] Universal Postal Union Address Standard, FGDC Address Standard Version 2.[12] A. Kjersti, E. Line, Text categorization: a survey, Norwegian Computing Center, NR

941, 1999.[13] S. Fabrizio, Machine learning in automated text categorization, ACM Computing

Surveys 34 (1) (2002) 1–47.[14] A. Zadeh Lotfi, Some reflections on soft computing, granular computing and their

roles in the conception, design and utilization of information/intelligent systems,Soft Computing 2 (1) (1998) 23–25.

[15] http://www.soft-computing.de/def.html.[16] V. Mitra, C.-J. Wang, S. Banerjee, Text classification: a least square support vector

machine approach, Applied Soft Computing 7 (June (3)) (2007) 908–914.[17] S. Sakurai, A. Suyama, An e-mail analysis method based on text mining techni-

ques, Applied Soft Computing 6 (November (1)) (2005) 62–71.[18] L. Sheremetov, I. Batyrshin, D. Filatov, J. Martinez, H. Rodriguez, Fuzzy expert

system for solving lost circulation problem, Applied Soft Computing 8 (January(1)) (2008) 14–29.

[19] O.U. Obot, E. Uzoka Faith-Michael, A framework for application of neuro-case-rulebase hybridization in medical diagnosis, Applied Soft Computing, in press.Corrected Proof, Available online 20 April 2008.

[20] K. Lee Key, Fuzzy rule generation for adaptive scheduling in a dynamic manu-facturing environment, Applied Soft Computing 8 (September (4)) (2008) 1295–1304.

[21] H.-H. Bock, E. Diday, Analysis of Symbolic Data, Springer, Heidelberg, 2000.[22] Lecture notes of short term course on symbolic and fuzzy approaches to data

analysis, 21–26 April 1997.[23] E. Diday, Knowledge discovery from the symbolic data and the SODAS software,

in: PKDD 2000 Workshop on Symbolic Data Analysis, Lyon, 12th September 2000.[24] P. Nagabhushan, S.A. Angadi, B.S. Anami, A knowledge -base supported inferen-

cing of address components in postal mail, Presented in the National Conferenceon Vision Graphics and Image Processing (NVGIP 05), JNNCE, Shimoga, 2–3 March2005.

[25] K. Chidanada Gowda, Symbolic objects and symbolic classification, in: Proceed-ings of International Conference on Symbolic and Spatial Data Analysis: MiningComplex Data Structures Pisa, 20 September, 2004, pp. 1–18.

[26] J. Ross Timothy, Fuzzy Logic with Engineering Applications, McGraw-Hill Pub-lications, 1997.

http://www.soft-computing.de/def.html

a soft computing model for mapping incomplete/approximate postal addresses to mail delivery points

Documents