supervised word sense disambiguation using semantic diffusion kernel

Supervised word sense disambiguation using semantic diffusion kernel

Tinghua Wang a,b,n, Junyang Rao b, Qi Hu c

a School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, Chinab Institute of Computer Science and Technology, Peking University, Beijing 100871, Chinac School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China

a r t i c l e i n f o

Article history:Received 10 March 2013Received in revised form20 July 2013Accepted 19 August 2013Available online 17 September 2013

Keywords:Word sense disambiguation (WSD)Semantic diffusion kernelSupport vector machine (SVM)Kernel methodNatural language processing

a b s t r a c t

The success of machine learning approaches to word sense disambiguation (WSD) is largely dependenton the representation of the context in which an ambiguous word occurs. Typically, the contexts arerepresented as the vector space using “Bag of Words (BoW)” technique. Despite its ease of use, BoWrepresentation suffers from well-known limitations, mostly due to its inability to exploit semanticsimilarity between terms. In this paper, we apply the semantic diffusion kernel, which models semanticsimilarity by means of a diffusion process on a graph defined by lexicon and co-occurrence information,to smooth the BoW representation for WSD systems. Semantic diffusion kernel can be obtained througha matrix exponentiation transformation on the given kernel matrix, and virtually exploits higher orderco-occurrences to infer semantic similarity between terms. The superiority of the proposed method isdemonstrated experimentally with several SensEval disambiguation tasks.

& 2013 Elsevier Ltd. All rights reserved.

1. Introduction

Word sense disambiguation (WSD) refers to the task of identi-fying the correct sense of an ambiguous word in a given context(Navigli, 2009). The ambiguity results from homonymy, i.e., wordshaving the same spelling and pronunciation but different senses,and polysemy, i.e., words having multiple senses, usually withsubtle differences (Nguyen and Ock, 2011). Homonymy is relativelyeasy to disambiguate because the domains of different senses aredistinct, e.g., the noun bank could be defined as “sloping raisedland, especially along the sides of a river” or alternately as “anorganization where people and businesses can invest or borrowmoney, convert to foreign money, etc. or a building where theseservices are offered” (Cambridge Advanced Learner's Dictionary).Polysemy is far more difficult because of the subtle differences andthe common origin of the senses, e.g., the noun cold could refer to“a mild viral infection involving the nose and respiratory passages” or“the absence of heat, or the sensation produced by low temperatures”(WordNet 3.1). As a fundamental semantic understanding task atthe lexical level in natural language processing, WSD can benefitmany applications such as information retrieval (Stokoe et al.,2003; Zhong and Ng, 2012) and machine translation (Carpuat andWu, 2007; Chan et al., 2007). In actual applications, WSD is oftenfully integrated into the system and cannot be separated out

(for instance, in information retrieval, WSD is often not doneexplicitly but is just by-product of query to document matching).However, it has been very difficult to formalize the process ofdisambiguation, which humans can do so effortlessly.

There are two main kinds of methods to perform the task of WSD:knowledge-based approaches and corpus-based approaches.The former disambiguate words by comparing their context againstinformation from the predefined lexical resources such as WordNet,whereas the latter do not make use of any these resources fordisambiguation (Navigli, 2009). Most of the corpus-based approachesstem from the machine learning community, ranging from supervisedlearning in which a classifier is trained for each distinct word on acorpus of manually sense-annotated examples, to completely unsu-pervised methods that cluster occurrence of words, thereby inducingsenses. Among these, supervised learning approaches have been themost successful algorithms to date. Moreover, in recent years it seemsvery promising that applying kernel methods (Shawe-Taylor andCristianini, 2004; Simek et al., 2004) such as support vector machine(SVM) (Giuliano et al., 2009; Jin et al., 2008; Joshi et al., 2006; Leeet al., 2004; Pahikkala et al., 2009), kernel principal componentanalysis (KPCA) (Su et al., 2004; Wu et al., 2004) and regularizedleast-squares classifier (RLSC) (Popescu, 2004) to theWSD task. Kernelmethods in general and SVM in particular have delivered extremelyhigh performance in a wide variety of learning tasks. The advantage ofusing kernel methods for WSD is that they offer a flexible and efficientway of defining application-specific kernels for introducing back-ground knowledge and modeling explicitly linguistic insights.

For the machine learning-based WSD, one of the key steps isthe representation of the context in which an ambiguous word

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/engappai

Engineering Applications of Artificial Intelligence

0952-1976/$ - see front matter & 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.engappai.2013.08.007

n Corresponding author at: School of Mathematics and Computer Science,Gannan Normal University, Ganzhou 341000, China. Tel.: þ86 18810358076.

E-mail addresses: [email protected], [email protected] (T. Wang).

Engineering Applications of Artificial Intelligence 27 (2014) 167–174

www.sciencedirect.com/science/journal/09521976

www.elsevier.com/locate/engappai

http://dx.doi.org/10.1016/j.engappai.2013.08.007



http://crossmark.crossref.org/dialog/?doi=10.1016/j.engappai.2013.08.007&domain=pdf



mailto:[email protected]

mailto:[email protected]


occurs. In the commonly used “Bag of Words (BoW)” representa-tion (Shawe-Taylor and Cristianini, 2004), contexts are representedby vectors whose dimensions are indexed by different wordsor terms occurring in the contexts. Despite its ease of use,the BoW representation suffers from well-known limitations,mostly due to its inability to exploit semantic similarity betweenterms: contexts sharing terms that are different but semanticallyrelated will be considered as unrelated. The lack of semantics inthe BoW representation limits the effectiveness of automatic WSD.To alleviate this shortage, a number of attempts have been made toincorporate semantic knowledge into the BoW representation.For example, the external semantic knowledge provided by wordthesauri or ontology was embedded into a semantic kernel, whichis used to enrich the standard BoW paradigm (Jin et al., 2008; Joshiet al., 2006). In the absence of external semantic knowledge,corpus-based statistical methods, such as Latent Semantic Index-ing (LSI) (Deerwester et al., 1990) can be applied to capturesemantic relations between terms (Cristianini et al., 2002;Giuliano et al., 2009). However, such methods are also limited intheir flexibility and usually computationally expensive.

In this paper, we present a semantically enriched BoW repre-sentation for WSD systems. We adopt the semantic diffusionkernel (Kandola et al., 2003) to efficiently model semantic simi-larity by means of a diffusion process on a graph definedby lexicon and co-occurrence information. The idea behind thesemantic diffusion kernel is that higher order correlationsbetween terms can affect their semantic similarities. In order tofurther understand the geometric significance of this kernel, wedistinguish the notion of “semantic diffusion” from the “diffusionkernel”. That is, we consider a semantic matrix to capture thesemantic similarities between terms and perform diffusion processon the semantic matrix rather than directly on the kernel.Combined with the use of SVM, this kernel shows a significantdisambiguation improvement over the standard BoW kernel.Besides the more clear interpretation of the semantic diffusionkernel, to the best of our knowledge, our work is the first time toapply this kernel to the WSD application.

The rest of this paper is organized as follows. In Section 2,we briefly introduce SVM for binary and multiclass settings,respectively. In Section 3, we give the representation of thecontexts and BoW kernel for WSD. In Section 4, we describe indetail the concept of semantic diffusion kernel and its geometricsignificance, and then apply this kernel to the WSD system, andfinally discuss the related work. The proposed method is evaluatedwith several SensEval disambiguation tasks in Section 5, and theconclusion and future work are provided in Section 6.

2. Support vector machines (SVM)

The fundamental concepts of SVM were developed by Vapnik(1995). SVM was originally formulated for the solutions of classi-fication problems with two classes, named binary classificationproblems. For problems with more than two classes, known asmulticlass classification problems, a decomposition approach isoften employed, where the multiclass problem is divided intomultiple binary subproblems.

2.1. Binary classification

In a binary classification problem, we are given l pairs oftraining samples fðxi; yiÞgli ¼ 1, where xiAX (the input space) andyiAf71g. Each x is then mapped to a ϕðxÞ in the feature space,which is implicitly defined by the kernel function kðx; zÞ ¼ ϕðxÞTϕðzÞfor any x;zAX. The standard SVM tries to find a hyperplanewTϕðxÞþb¼ 0, which is determined by a weight vector w and a

bias b. This hyperplane can be obtained by solving the followingoptimization problem:

min 12jjwjj2þC ∑

l

i ¼ 1ξi

s:t: yiðwTϕðxiÞþbÞZ1�ξiξiZ0; i¼ 1;…; l ð1Þ

where ξ¼ ðξ1;…; ξlÞT is the vector of slack variables and C is theregularization parameter used to impose a trade-off between thetraining error and generalization. This problem can be solved usingthe Lagrange method. Suppose αi be the Lagrange multipliercorresponding to the ithinequality, then the dual of Problem(1) is shown to be

max ∑l

i ¼ 1αi�

12∑l

i ¼ 1∑l

j ¼ 1yiyjαiαjkðxi; xjÞ

s:t: ∑l

i ¼ 1αiyi ¼ 0;

0rαirC; i¼ 1;…; l ð2ÞThe weight vector w and bias b then can be determined basedon the solution αi. The samples xi with αi40 are called supportvectors. Finally, the resultant decision function can be written as

f ðxÞ ¼ sgn ∑l

i ¼ 1αiyikðxi; xÞþb

!ð3Þ

The kernel function k has to be positive definite, i.e., satisfyingMercer's condition. Mercer's condition tells us whether or nota kernel is actually an inner product in a given space and ensuresthat the solution of Problem (2) produces a global optimum.Popular choices for the kernel function include linear kernel,polynomial kernel and Gaussian kernel:

� Linear kernel: kðxi; xjÞ ¼ xi Uxj� Polynomial kernel: kðxi; xjÞ ¼ ðxi Uxjþ1Þd; d¼ 1;2;…� Gaussian kernel: kðxi; xjÞ ¼ expð�ðjjxi�xjjj2Þ=2s2Þ; s40

In real applications, the information available for SVM andother kernel methods is contained entirely in the Gram matrix orkernel matrix which is given by

K ¼ ðkðxi; xjÞÞli;j ¼ 1 ð4Þ

Gram matrix represents a sort of ‘bottleneck’ for the informationthat can be exploited: by operating on this matrix, one canimplicitly encode the data in a more suitable manner for learningand mining. Since the Gram matrix and the kernel function areessentially equivalent, we can refer to one or the other as the“kernel” without risk of confusion.

2.2. Multiclass classification

In kernel methods, multiclass SVM has recently attracted muchattention due to the demands for multiclass classification in manypractical applications and the success of SVM in binary classification.There are several approaches available to extend binary classificationto multiclass case (Hsu and Lin, 2002; Rifkin and Klautau, 2004). Theseapproaches roughly fall into two categories. The first approachdenoted as all-in-one or single machine is to directly consider all datain one optimization formulation. The second one involves consideringa decomposition of the multiclass problem into several binarysubproblems and then combining their solutions. There are twowidely used strategies to decompose the problem: one-versus-rest(1-v-r) and one-versus-one (1-v-1). Given a problem with m classes,the 1-v-r strategy constructsm binary SVMs, in which each of them istrained to separate one class from the other classes, while the 1-v-1

T. Wang et al. / Engineering Applications of Artificial Intelligence 27 (2014) 167–174168

strategy constructs mðm�1Þ=2 binary SVMs, in which each of them istrained to separate one class from another class. Whena test sample is provided, it is applied to all the binary SVMs andtheir outputs are combined based on some voting techniques, such asMaxWins voting scheme (Hsu and Lin, 2002), which counts how ofteneach class is output by the binary SVMs and the test sample isassigned to the most voted class. Although both approaches presentusually no significant difference in classification accuracy when theparameters of SVM are properly tuned, the decomposition one is oftenrecommended for practical use because of lower computationaloverhead and conceptual simplicity.

3. BoW representation of contexts

WSD can be viewed as a problem of classifying each word,according to its surrounding context, to one of its senses. Thereforeit is a classification problem with a few classes, i.e., multiclassclassification. A key step in applying SVM to WSD is to choose anappropriate kernel. Up to now, most methods tried to representthe contextual information in a vector, and then used some vector-based kernels. In natural language processing, it is widely agreedthat the linear kernel performs better than other kernels (Jin et al.,2008; Popescu, 2004) when the vectors are provided. In thissection, we will discuss such vector-based data representationand kernels used for WSD.

Let t denote a word to be disambiguated andd¼ ðt�r ;…; t�1; t1;…; tsÞ be the context of t, where t�r ;…; t�1

are the words in the order they appear preceding t, and corre-spondingly t1;…; ts are the words that follow t in the text. We alsodefine a context span parameter τ to control the length ofthe context. For a fixed τ, we take always the largest contextd¼ ðt�r ;…; t�1; t1;…; tsÞ so that rrτ and srτ. Note that if thereexist τ words preceding and following the word to be disambig-uated, then r¼ s¼ τ, otherwise roτ or soτ. In addition, let V andn, respectively, be a set of all distinct words of all the contexts inthe training corpus set and the size of V . The BoW model (alsocalled vector space model, or VSM) (Shawe-Taylor and Cristianini,2004) of a context d is defined as follows:

ϕ : d-ϕðdÞ ¼ ðtf ðt1;dÞ;…; tf ðtn;dÞÞT ARn ð5Þ

where tf ðti;dÞ, 1r irn, is the frequency of the occurrence of wordti in the context d. If we consider the feature space defined by theVSM, the BoW kernel is given by

kðdi;djÞ ¼ oϕðdiÞ;ϕðdjÞ4 ¼ ϕðdiÞTϕðdjÞ ð6Þ

Note that in this case the input space X is the set of contexts.It is obvious that the BoW kernel is essentially a linear kernel.

In natural language processing community, the number of featuresof a data set is usually large, even much larger than the number ofinstances. In this case, one may not need to map data to a higherdimensional space. In other words, the nonlinear mapping doesnot improve the performance and using the linear kernel is goodenough (Hsu et al., 2003).

4. Semantic diffusion kernel applied to WSD

As mentioned above, the key problem of a supervisedWSD systembased on SVM is how to choose the appropriate kernel.1 In thissection we will discuss a sophisticated method to solve this problem.

4.1. Semantic diffusion kernel

In the BoW representation, the feature vectors are typicallysparse with a small number of non-zero entries for those wordsoccurring in the contexts. Two contexts that use semanticallyrelated but distinct words will therefore show no similarity.Ideally, semantically similar contexts should be mapped to nearbypositions in the feature space. In order to address the semanticinformation of the words in BoW, a transformation of the featurevector of the type ϕðdÞ ¼ SϕðdÞ is required, where S is a semanticmatrix and indexed by pairs of terms with the entry Si;j ¼ Sj;i,1r i; jrn, indicating the strength of their semantic similarity.Using this transformation, the semantic kernels take the form

kðdi;djÞ ¼ ϕðdiÞTϕðdjÞ ¼ ϕðdiÞTSTSϕðdjÞ ð7ÞThe semantic kernels correspond to representing a context asa less sparse vector, SϕðdÞ, which has non-zero entries for all termsthat are semantically similar to those presented in context d.Different choices of the matrix S lead to different variants of thesemantic kernels.

In practice, the problem of how to infer semantic similaritiesbetween terms from a corpus remains an open issue. Kandola et al.(2003) presented a semantic diffusion kernel named exponentialkernel given by

KðλÞ ¼K0 expðλK0Þ ð8Þwhere K0 is the kernel matrix of the BoW kernel and λ is a decayfactor. Let D be feature example (term-by-context in the caseof WSD) matrix in the BoW kernel-induced feature space, thenDTD gives the kernel matrix K0. Let G¼DDT , it has been provedthat KðλÞ corresponds to a semantic matrix expðλG=2Þ (Kandolaet al., 2003; Shawe-Taylor and Cristianini, 2004), i.e.

S ¼ expλ

2G

� �¼ 12

2IþλGþλ2G2

2!þ U U UþλθGθ

θ!þ U U U

!ð9Þ

The matrix G is called the ‘generator’ which captures the initialsemantic similarities between two entities, and the semanticmatrix S is defined as the exponential of the generator.

In fact, noting that S is a symmetric positive semi-definitematrix since G is symmetric (Gärtber, 2003; Kondor and Lafferty,2002), we have

KðλÞ ¼DTSTSD¼DTS2D¼DT expðλGÞD

¼DT IþλGþλ2G2

2!þ U U UþλθGθ

θ!þ U U U

!D

¼DTDþλDTGDþλ2DTG2D2!

þ U U UþλθDTGθDθ!

þ U U U

¼K0þλK20þ

λ2K30

2!þ U U UþλθKθþ1

0

θ!þ U U U

¼K0 IþλK0þλ2K2

0

2!þ U U UþλθKθ

0

θ!þ U U U

!

¼K0 expðλK0Þ ð10Þwhere I denotes the identity matrix.

Geometrically, semantic diffusion kernel models semanticsimilarities by means of a diffusion process on a graph definedby lexicon and co-occurrence information. The semantic matrix Sessentially captures the higher order correlation between terms.Conceptually, if term t1 co-occurs with term t2 in some contexts,we say t1 and t2 share a first-order correlation between them. If t1co-occurs with t2 in some contexts, and t2 with t3 in some others,then t1 and t3 are said to share a second-order correlation throught2. Higher orders of correlation can be similarly defined. ConsiderFig. 1(a) depicting three contexts d1, d2 and d3, each containingtwo terms from t1, t2, t3 and t4. We can find that t1 and t4 share a

1 In kernel methods, the problem of how to choose the appropriate kernel isoften considered as a problem of model selection.

T. Wang et al. / Engineering Applications of Artificial Intelligence 27 (2014) 167–174 169

third-order correlation through t2 and t3. When modeled as agraph as shown in Fig. 1(b), each higher order correlation defines apath between the two vertices (terms). Considering the semanticmatrix S, the entries in the matrix Gθ are given by

Gθi;j ¼ ∑

m1;…;mθA 1;…;nf gm1 ¼ i; mθ ¼ j

∏θ�1

p ¼ 1Gmp ;mpþ 1 ð11Þ

that is the number of θth-order co-occurrence paths betweenterms i and j. Hence the semantic similarity between two terms ismeasured by the number of the co-occurrence paths betweenthem. Specifically, G indicates the first-order correlation betweenfeatures (terms) over the training corpus data, G2 indicates thesecond-order correlation between terms, and so forth, Gθ indicatesthe θth-order correlation between terms. In addition, it should benoted that the identity matrix I, which can be regarded as theindication of the zero-order correlation between terms, meansonly the similarity between a term and itself equals 1 and 0 forother cases. Intuition shows that the higher the co-occurrenceorder is, the less similar the semantics becomes. The semanticmatrix S combines all the order co-occurrence paths with expo-nentially decaying weights and the parameter λA ½0; þ1Þ is usedto control the decaying speed for increasing orders. Finally, it iseasy to find that the exponential kernel is reduced to the standardBoW kernel when λ¼ 0. In other words, the BoW kernel is just aspecial case of the exponential kernel.

4.2. The proposed WSD system

Fig. 2 shows the architecture of the proposed WSD system.This system consists of three independent modules: preprocessing,kernel selection and classification. Preprocessing is the step toconvert the input documents into formatted information. Thedetails of this step will be described in the Empirical evaluationsection. In kernel selection step, we use the semantic diffusionkernel rather than the standard BoW kernel. Finally, the SVMclassifier trains a model for each word type which has trainingdata during the training process. The trained classification modelis then applied to the test examples of the corresponding wordtype in the test process.

Obviously, the main feature of our system is that it takesadvantages of the semantic diffusion kernel to reveal the semanticsimilarities between terms based on the “diffusion” of co-occurrenceinformation. Although this kernel or its variations have been appliedto many domains, such as collective classification of link data(Tian et al., 2006) and relevant web pages findings (Wang et al.,2009), to the best of our knowledge, it is the first time to apply this

technique to WSD system. Besides, from the point of view offramework flexibility, one may design more appropriate kernels toreplace the semantic diffusion kernel or choose other classifiers toreplace SVM in this system.

4.3. Related work

Many current natural language processing systems rely on thesemantic knowledge extracted from the data via machine learningmethods. Statistical or alternative models are learned, and thenapplied to run the text. The main problem encountered by suchsystems is the sparse data problem, due to the small amount oflearning examples. Focusing on WSD, only a handful of occur-rences with sense tags are available per word (Agirre et al., 2005).One possible solution to this problem is to utilize higher order co-occurrences as a measure of semantic relation between terms. Theunderlying idea is that humans do not necessarily use the samevocabulary when writing about the same topic (Lemaire andDenhière, 2006).

Several approaches in the past few years have been proposed tomodel term similarity based on the concept of higher order co-occurrence (Chakraborti et al., 2007; Ganiz et al., 2011; Hussainand Bisson, 2010; Rohde et al., 2006). Perhaps the most sophisti-cated and widely used method is LSI, which was originallyproposed for information retrieval (Deerwester et al., 1990). Givena term-by-context matrix D, the general idea of LSI is to reduce Dusing singular value decomposition (SVD) technique which isthought to reduce the noise in the data and make relationshipsbetween underlying concepts more clear. Formally, SVD decom-poses D into the product of three matrices, i.e.,D¼UΣVT , where Σis a diagonal matrix of singular values, and UTU ¼ Ι and VTV ¼ Ι.If the matrices comprising the SVD are permuted such that thesingular values are in descending order, they can be truncated to amuch lower rank, n′ (n′{n). It can be shown that the product ofthese reduced matrices is the best rank n′ approximation, in termsof sum squared error, to the original matrix D. By selecting the topn′ singular values, we reduce the current space to n′ dimensions,and can thus project the instances (both training and test) to areduced space. Once we project all training and test instances, anymachine learning algorithm as usual can be applied to get thedecision rule or boundary.

In fact, LSI takes advantage of implicit higher order (or latent)structure in the association of terms and contexts. Kontostathisand Pottenger (2006) mathematically proved that LSI implicitlydepends on higher order co-occurrences. They also demonstratedempirically the higher order co-occurrences play a key role in theeffectiveness of systems used for information retrieval and textmining. In the context of kernel methods, LSI-based kernels, suchas latent semantic kernels (Cristianini et al., 2002) and domain

Fig. 1. Graphical representation of higher order co-occurrences.

Fig. 2. WSD system architecture.


kernels (Giuliano et al., 2009), have also been investigated.Conceptually, the domain kernel estimates the similarity betweenthe domains of two texts, so to capture domain aspects of sensedistinction. It is a variation of the latent semantic kernel.

The main problem of LSI is its high computation complexitysince computing the SVD itself is not trivial. For a n� n matrix,the SVD computation requires time proportional to n3. This isimpractical for matrices with more than a few thousand dimen-sions. In comparison with LSI, semantic diffusion kernel does notneed the heavy computational overhead of decomposition. Inaddition, semantic diffusion kernel has the advantage of explicitlycapturing the latent or higher order similarities, as opposed todoing that implicitly as LSI.

5. Empirical evaluation

This section provides an empirical study to show the benefitsof the proposed WSD system on several real corpora. Specifically,we compare the disambiguation performances for four methods:

� MFS: a common baseline model that selects the most frequentsense (MFS) in the training data as the answer. In other words,this baseline assigns the sense which is the most frequent tothe ambiguous word in a corpus, independently of the wordcontexts. This trivial technique can be very effective if onesense has a high probability, but if a word has equiprobablesenses, its performance will be poor.

� BoW: SVM classifier with the BoW kernel.� LSI: LSI-based SVM classifier. This method firstly performs

LSI (via SVD on the term-by-context matrix D) to find a reducedspace, an SVM is then applied in the reduced space.

� SDK: SVM classifier with the semantic diffusion kernel (SDK).

5.1. Data sets

We select the corpora for four words, namely interest, line, hardand serve, which have been used in numerous comparative studiesof WSD, from SensEval.2 The detailed descriptions of these datasets are shown as follows:

The interest data (Bruce and Wiebe, 1994) consist of 2368instances where the noun interest is used in one of the six sensestaken from the Longman Dictionary of Contemporary English(LDOCE). The instances are extracted from the part of speechtagged subset of the Penn Treebank Wall Street Journal Corpus(ACL/DCI version). Table 1 shows the description and distributionof different senses for the word interest.

The line data (Leacock et al., 1993) consist of 4147 instanceswhere the noun line is used in one of the six possible WordNetsenses. These data were extracted from the 1987 to 1989 WallStreet Journal (WSJ) corpus and the American Printing Housefor the Blind (APHB) corpus. The distribution of senses is some-what skewed with more than 50% of the instances used in theproduct sense while all the other instances more or less equallydistributed among the other five senses. Table 2 shows thedescription and distribution of different senses for the word line.

The hard data (Leacock et al., 1998) consist of 4333 instancestaken from the San Jose Mercury News Corpus (SJM) and areannotated with one of the three senses of the adjective hard, fromWordNet. The distribution of instances is skewed with almost 80%of the instances used in the not easy—difficult sense. Table 3 shows

the description and distribution of different senses for the wordhard.

The serve data (Leacock et al., 1998) consist of 4378 instanceswith the verb serve as the target word. They are annotated withone of the four senses from WordNet. Like line it was created fromthe WSJ and APHB corpora. Table 4 shows the description anddistribution of different senses for the word serve.

5.2. Experimental setup

For each data set, we partition it into a training set and a testset by stratified sampling: 70% of the data set serves as the trainingset and 30% as the test set. For the training set, we first remove thewords that are in a list of stop words (for example: “is”, “are”, “a”and “the”). Words that contain no alphabetic characters, suchas punctuation symbols and numbers, are also discarded. Wethen extract the surrounding words, which can be in the currentsentence or immediately adjacent sentences, in the 75-wordwindow (i.e., r¼ s¼ τ¼ 5) context of an ambiguous word.The extracted words are finally converted to their lemma formsin lower case. Each lemma is considered as one feature and whosevalue is set to be the “term frequency (TF)”. For the test set, thesimilar preprocessing is carried out but the features are the same

Table 4Description and distribution of senses of serve.

Sense Sense tag frequency Percentage

Supply with food 1814 41.43Hold an office 1272 29.05Function as something 853 19.48Provide a service 439 10.03

Table 2Description and distribution of senses of line.


Cord 373 8.99Division 374 9.02Formation 349 8.42Phone 429 10.34Product 2218 53.48Text 404 9.74

Table 3Description and distribution of senses of hard.


Not easy (difficult) 3455 79.74Not soft (metaphoric) 502 11.59Not soft (physical) 376 8.68

Table 1Description and distribution of senses of interest.


Readiness to give attention 361 15.24Quality of causing attention to be given to 11 0.46Activity, etc. that one gives attention to 66 2.79Advantage, advancement or favor 178 7.52A share in a company or business 500 21.11money paid for the use of money 1252 52.87

2 http://www.senseval.org/


http://www.senseval.org/

as those extracted from the training set (we directly eliminatethose lemmas found in the test set but not in the training set).

Once training and test instances are represented as featurevectors, it becomes possible to exploit any number of existingsupervised learning algorithms. Here the SVM classifier trains amodel for each word type with the training data and tests thedisambiguation performance of the learned model with the testdata. The parameters of the SVM are optimized by five-fold cross-validation on the training set. For the BoW kernel, there is onlyone parameter C that needs to be optimized. We performgrid-search in one dimension (i.e., a line-search) to choosethis parameter from the set f2�2;20;…;210g. For LSI, since itsperformance is critically dependent on the number of desirabledimensions, we try with 100, 200, 300, 500 and 1000 dimensions,and report the best performance. SVD is performed with thesoftware SVDPACK3 and SVM is then fed with the results fromSVDPACK. For the semantic diffusion kernel, there are twoparameters C and λ that need to be optimized. We performgrid-search over two dimensions, i.e., C ¼ f2�2;20;…;210g andλ¼ f20;2�1;…;2�10g. In addition, the implementation of the SVMclassifier is achieved by the software LIBLINEAR (Fan et al., 2008),which is a simple but efficient package for solving large-scalelinear classification and regression problems.

We measure the classification performance using the F1 score,which is a popular measure for comparing performances ofdifferent algorithms typically on the data with skewed classdistribution. We recall that, given a confusion matrix for a class csuch as the one in Table 5, the precision of a classifier, for the classc, is the ratio of the number of texts correctly assigned to c to thetotal number texts assigned to c. The recall is the ratio of thenumber of texts correctly assigned to c to the total number of textsbelonging to c. The F1 score is the harmonic mean of the precisionand the recall. Formally, these measures are shown as follows:

PrecisionðcÞ ¼ TPc

TPcþFPcð12Þ

RecallðcÞ ¼ TPc

TPcþFNcð13Þ

F1ðcÞ ¼2nprecisionðcÞnrecallðcÞprecisionðcÞþrecallðcÞ ð14Þ

For multiclass classification problems, the micro-average and themacro-average are used to evaluate the classifier. Table 6 gives theformulae for the precision and recall. The global F1 score is thengiven by

F1 ¼2nprecisionnrecallprecisionþrecall

ð15Þ

according to the kind of average used.

5.3. Experimental results

The average classification results in terms of the micro-F1and macro-F1 over 10 trials are summarized in Tables 7 and 8,respectively. The bold element indicates the best performance. Formore reliable results rather than those which would be expectedby chance, two-tailed t-test with the significant level 0.05is performed to determine whether there is a significant differencebetween SDK and other methods. A win-tie-loss (W-T-L) summar-ization based on t-test is also attached at the bottomsof Tables 7 and 8. A win or a loss means that SDK is better orworse than other method on a data set. A tie means that bothmethods have the same performance. From these tables, we can

see that the BoW, LSI and SDK methods all achieve significantlybetter performances than the MFS baseline. More importantly, theSDK method is significantly better than the BoW method on allfour data sets. For LSI and SDK, SDK is intuitively similar to LSIin that both try to capture the higher order similarities. However,instead of computing the latent values implicitly based on thelower rank text space, SDK uses an explicit approach to modelsemantic similarity by means of a diffusion process on a graphdefined by lexicon and co-occurrence information. This explicitprocedure enables us to find real higher order relations betweenpairs of terms. As we can see, the SDK method is significantlybetter than the LSI method in three out of the four data sets.For the macro-averaged F1 score, although LSI slightly outperformsSDK on the serve data set, there is no significant difference. Thisshows the benefits of the higher order co-occurrence informationfor WSD and the effectiveness of the SDK method.

In addition, we would like to investigate the influence of SVMparameters on the disambiguation performance. It is well knownthat the generalization performance of SVM is heavily dependenton the parameter selection (Keerthi and Lin, 2003). An example ofthe influence of these parameters on the classification perfor-mance is depicted in Figs. 3 and 4, where the interest data set is

Table 6Micro-average and macro-average for m classes.

Micro Macro

Precision ∑cTPc∑c ðTPc þFPc Þ

∑cprecisionðcÞm

Recall ∑cTPc∑c ðTPc þFNc Þ

∑crecallðcÞm

Table 7Classification results (micro-F1) of four methods.

Data set Micro-F1 (%)

MFS BoW LSI SDK

Interest 52.87 86.00 86.13 87.32Line 53.48 82.94 82.98 84.09Hard 79.74 83.70 83.62 84.84Serve 41.43 86.29 86.73 87.00

W-T-L 4-0-0 4-0-0 3-1-0 –

Table 5Confusion matrix for class c.

Positive Negative

Predicted positive TPc FPcPredicted negative FNc TNc

Table 8Classification results (macro-F1) of four methods.

Data set Macro-F1 (%)

MFS BoW LSI SDK

Interest 11.53 64.21 65.73 72.82Line 11.62 75.21 74.86 76.03Hard 14.87 33.13 33.28 34.42Serve 9.76 55.21 57.61 57.43

W-T-L 4-0-0 4-0-0 3-1-0 –

3 http://www.netlib.org/svdpack/


http://www.netlib.org/svdpack/

considered. For the semantic diffusion kernel, Fig. 3 presents theclassification results obtained using the different λ values withC ¼ 1 and Fig. 4 presents the classification results obtained usingthe different C values with λ¼ 2�8. From Fig. 3, we can see that theparameter λ needs to be tuned carefully to achieve the bestdisambiguation performance. Empirically, the semantic diffusionkernel appears to prefer lower values of λ on the interest corpus.One possible reason for this phenomenon could be an over-emphasized influence of the higher order co-occurrence informa-tion with a relatively larger value of λ. From Fig. 4, it is surprisingto find that the performance is not very sensitive to C, whichviolates the conventional observations. However, this could resultfrom the inherent insensitivity of the solvers in LIBLINEAR(Fan et al., 2008), in view of the fact that Keerthi and Lin (2003)have shown that once C is larger than certain value, the obtainedmodels have similar performances. Overall, the parameter selec-tion is a crucial issue that should be addressed accurately whenapplying SVM to WSD. This point is beyond the scope of this paperbut the most intuitive way of solving this problem is like ours:performing exhaustive search on the parameter space to optimizethe classification measure.

Finally, it is also interesting to note that the micro-F1 values areconsistently higher than the micro-F1 values for all methods on alldata sets. This is determined by the skewed class distribution ofthe data sets and the definition of the F1 measure. Actually, themicro-F1 gives an equal weight to all texts so it will not be affected

by the small classes. However, the macro-F1 is an average over theclasses thus small classes will drastically affect the score.

6. Conclusion and future work

Kernel methods such as SVM and KPCA are popularly applied tothe natural language processing community. In this paper, we haveexplored the use of semantic diffusion kernel for improving theperformance of SVM classifier in word sense disambiguation.Geometrically, semantic diffusion kernel models semantic simila-rities as a diffusion process in a graph whose nodes are thecontexts and edges incorporate first-order similarity. Diffusionefficiently takes all possible paths connecting two nodes intoaccount, and propagates the similarity between two remotecontexts. The superiority of the proposed method is demonstratedexperimentally with several SensEval disambiguation tasks.

The following two issues are worthy of investigating in thefuture work:

1. In a supervised WSD framework, we have class information oftraining data in addition to the co-occurrence information. Aninherent limitation of semantic diffusion kernel is that it fails toexploit the class information of training data. It will be aninteresting direction to present new approaches or applyexisting technologies such as “sprinkling” (Chakraborti et al.,2006, 2007) to refine the semantic diffusion kernel.

2. It has been shown that it is quite possible to substantiallyimprove the disambiguation performance using a combinationof more adequate kernels (each kernel in the combinationbeing adequate to the source of information represented bythe part of the features that the kernel uses) (Giuliano et al.,2009). Instead of using an equal weight for each kernel in thecombination, we can use weights that reflect the importance ofthe knowledge source that the kernel uses for disambiguationvia multiple kernel learning (MKL) (Gönen and Alpayın, 2011).

Acknowledgments

We would like to thank all the referees for their constructiveand insightful comments on this paper. This work is supported inpart by the National Natural Science Foundation of China(No. 61202265), the China Postdoctoral Science FoundationFunded Project (No. 2012M510275) and the Natural ScienceFoundation of Jiangxi Province of China (No. 20114BAB211021).

References

Agirre, E., Lacalle, O.L.D., Martínez, D., 2005. Exploring feature spaces with svd andunlabeled data for word sense disambiguation. In: Proceedings of the Interna-tional Conference on Recent Advances in Natural Language Processing, Bor-ovets, Bulgaria.

Bruce, R.F., Wiebe, J., 1994. Word-sense disambiguation using decomposablemodels. In: Proceedings of the 32nd Annual Meeting of the Association forComputational Linguistics. Las Cruces, USA, 139–146.

Carpuat, M., Wu, D., 2007. Improving statistical machine translation using wordsense disambiguation. In: Proceedings of the 2007 Joint Conference onEmpirical Methods in Natural Language Processing and Computational NaturalLanguage Learning, Prague, Czech Republic, pp. 61–72.

Chakraborti, S., Lothian, R., Wiratunga, N., Watt, S., 2006. Sprinkling: supervisedlatent semantic indexing. In: Proceedings of the 28th European Conference onIR Research, London, UK, pp. 510–514.

Chakraborti, S., Wiratunga, N., Lothian, R., Watt, S., 2007. Acquiring word simila-rities with higher order association mining. In: Proceedings of the SeventhInternational Conference on Case-Based Reasoning: Case-Based ReasoningResearch and Development, Belfast, UK, pp. 61–76.

Chan, Y.S., Ng, H.T., Chiang, D., 2007. Word sense disambiguation improves statisticalmachine translation. In: Proceedings of the 45th Annual Meeting of the Associa-tion for Computational Linguistics, Prague, Czech Republic, pp. 33–40.

-10 -9 -8 -7 -6 -5 -410

20

30

40

50

60

70

80

90

100

log2λ

F 1(

%)

interest

Micro-F1

Macro-F1

Fig. 3. Plot of the influence of λ in the semantic diffusion kernel on the classifica-tion performance (averaged over 10 trials) using the interest corpus. The parameterC was fixed to 1.

-2 0 2 4 6 8 1050

60

70

80

90

100

log2C

F 1(%)

interest

Micro-F1

Macro-F1

Fig. 4. Plot of the influence of C on the classification performance (averaged over10 trials) using the interest corpus. The parameter λ in the semantic diffusion kernelwas fixed to 2�8.


http://refhub.elsevier.com/S0952-1976(13)00164-4/othref0005





















Cristianini, N., Shawe-Taylor, J., Lodhi, H., 2002. Latent semantic kernels. Journal ofIntelligent Information Systems 18 (2 and 3), 127–152.

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R., 1990.Indexing by latent semantic analysis. Journal of the American Society forInformation Science 41 (6), 391–407.

Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J., 2008. LIBLINEAR: a library forlarge linear classification. Journal of Machine Learning Research 9, 1871–1874.(Software available at)⟨http://www.csie.ntu.edu.tw/�cjlin/liblinear⟩.

Ganiz, M.C., George, C., Pottenger, W.M., 2011. Higher order naïve Bayes: a novelnon-IID approach to text classification. IEEE Transactions on Knowledge andData Engineering 23 (7), 1022–1034.

Gärtber, T., 2003. A survey of kernels for structured data. SIGKDD Explorations 5(1), 49–58.

Giuliano, C., Gliozzo, A., Strapparava, C., 2009. Kernel methods for minimallysupervised WSD. Computational Linguistics 35 (4), 513–528.

Gönen, M., Alpayın, E., 2011. Multiple kernel learning algorithms. Journal ofMachine Learning Research 12, 2211–2268.

Hsu, C.W., Chang, C.C, Lin, C.J., 2003. A practical guide to support vector classifica-tion. Technical Report, Department of Computer Science, National TaiwanUniversity. ⟨http://www.csie.ntu.edu.tw/�cjlin/papers/guide/guide.pdf⟩.

Hsu, C.W., Lin, C.J., 2002. A comparison of methods for multiclass support vectormachines. IEEE Transactions on Neural Networks 13 (2), 415–425.

Hussain, S.F., Bisson,G., 2010. Text categorization using word similarities based onhigher order co-occurrences. In: Proceedings of the SIAM International Con-ference on Data Mining, Columbus, USA, pp. 1–12.

Jin, P., Li, F., Zhu, D., Wu, Y., Yu, S., 2008. Exploiting external knowledge sources toimprove kernel-based word sense disambiguation. In: Proceedings of IEEEInternational Conference on Natural Language Processing and KnowledgeEngineering, Beijing, China, pp. 1–8.

Joshi, M., Pedersen, T., Maclin, R., Pakhomov, S., 2006. Kernel methods for wordsense disambiguation and acronym expansion. In: Proceedings of the 21stNational Conference on Artificial Intelligence, Boston, USA.

Kandola, J., Shawe-Taylor, J., Cristianini, N., 2003. Learning semantic similarity.Advances in Neural Information Processing Systems 15, 657–664.

Keerthi, S.S., Lin, C.J., 2003. Asymptotic behaviors of support vector machines withGaussian kernel. Neural Computation 15 (7), 1667–1689.

Kondor, R.I., Lafferty, J., 2002. Diffusion kernels on graphs and other discretestructures. In: Proceedings of the 19th International Conference on MachineLearning, Sydney, Australia, pp. 315–322.

Kontostathis, A., Pottenger, W.M., 2006. A framework for understanding latentsemantic indexing (LSI) performance. Information Processing and Management42 (1), 56–73.

Leacock, C., Miller, G.A., Chodorow, M., 1998. Using corpus statistics and WordNetrelations for sense identification. Computational Linguistics 24 (1), 147–165.

Leacock, C., Towell, G., Voorhees, E., 1993. Corpus-based statistical sense resolution.In: Proceedings of the ARPA Workshop on Human Language Technology.Plainsboro, USA, pp. 260–265.

Lee, Y.K., Ng, H.T., Chia, T.K., 2004. Supervised word sense disambiguation withsupport vector machines and multiple knowledge sources. In: Proceedings of

Senseval-3: Third International Workshop on the Evaluation of Systems for theSemantic Analysis of Text, Barcelona, Spain, pp. 137–140.

Lemaire, B., Denhière, G., 2006. Effects of high-order co-occurrences on wordsemantic similarity. Current Psychology Letters 18, 1.

Navigli, R., 2009. Word sense disambiguation: a survey. ACM Computing Surveys 41(2), 1–69.

Nguyen, K.H., Ock, C.Y., 2011. Word sense disambiguation as a traveling salesmanproblem. Artificial Intelligence Review , http://dx.doi.org/10.1007/s10462-011-9288-9. (Online first).

Pahikkala, T., Pyysalo, S., Boberg, J., Järvinen, J., Salakoski, T., 2009. Matrixrepresentations, linear transformations, and kernels for disambiguation innatural language. Machine Learning 74 (2), 133–158.

Popescu, M., 2004. Regularized least-squares classification for word sense disam-biguation. In: Proceedings of Senseval-3: Third International Workshop on theEvaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain,pp. 209–212.

Rifkin, R., Klautau, A., 2004. In defense of one-vs-all classification. Journal ofMachine Learning Research 5, 101–141.

Rohde, D.L.T., Gonnerman, L.M., Plaut, D.C., 2006. An improved model of semanticsimilarity based on lexical co-occurrence. Communications of the ACM 8,627–633.

Shawe-Taylor, J., Cristianini, N., 2004. Kernel Methods for Pattern Analysis. Cam-bridge University Press, New York.

Simek, K., Fujarewicz, K., Świerniak, A., Kimmel, M., Jarzab, B., Wiench, M.,Rzeszowska, J., 2004. Using SVD and SVM methods for selection, classification,clustering and modeling of DNA microarray data. Engineering Applications ofArtificial Intelligence 17 (4), 417–427.

Stokoe, C., Oakes, M.P., Tait, J., 2003. Word sense disambiguation in informationretrieval revisited. In: Proceedings of the 26th Annual International ACM SIGIRConference on Research and Development in Information Retrieval, Toronto,Canada, pp. 159–166.

Su, W., Carpuat, M., Wu, D., 2004. Semi-supervised training of a kernel PCA-basedmodel for word sense disambiguation. In: Proceedings of the 20th InternationalConference on Computational Linguistics, Geneva, Switzerland.

Tian, Y., Huang, T., Gao, W., 2006. Latent linkage semantic kernels for collectiveclassification of link data. Journal of Intelligent Information Systems 26 (3),269–301.

Vapnik, V.N., 1995. The Nature of Statistical Learning Theory. Springer-Verlag, NewYork.

Wang, Y., Su, L., Tian, Y., 2009. KernelRank: exploiting semantic linkage kernels forrelevant pages finding. Chinese Journal of Electronics 18 (3), 405–410.

Wu, D., Su, W., Carpuat, M., 2004. A kernel PCA method for superior word sensedisambiguation. In: Proceedings of the 42nd Annual Meeting of the Associationfor Computational Linguistics, Barcelona, Spain, pp. 637–644.

Zhong, Z., Ng, H.T., 2012. Word sense disambiguation improves informationretrieval. In: Proceedings of the 50th Annual Meeting of the Association forComputational Linguistics, Jeju, Republic of Korea, pp. 273–282.


http://refhub.elsevier.com/S0952-1976(13)00164-4/sbref1





http://www.csie.ntu.edu.tw/~cjlin/liblinear

http://www.csie.ntu.edu.tw/~cjlin/liblinear












http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf




































http://dx.doi.org/10.1007/s10462-011-9288-9

http://dx.doi.org/10.1007/s10462-011-9288-9

http://dx.doi.org/10.1007/s10462-011-9288-9

http://dx.doi.org/10.1007/s10462-011-9288-9







































supervised word sense disambiguation using semantic diffusion kernel

Documents