prior disambiguation of word tensors for constructing sentence

12
Prior Disambiguation of Word Tensors for Constructing Sentence Vectors Dimitri Kartsaklis University of Oxford Department of Computer Science Wolfson Building, Parks Road Oxford, OX1 3QD, UK [email protected] Mehrnoosh Sadrzadeh Queen Mary University of London School of Electronic Engineering and Computer Science Mile End Road London, E1 4NS, UK [email protected] Abstract Recent work has shown that compositional- distributional models using element-wise op- erations on contextual word vectors benefit from the introduction of a prior disambigua- tion step. The purpose of this paper is to generalise these ideas to tensor-based models, where relational words such as verbs and ad- jectives are represented by linear maps (higher order tensors) acting on a number of argu- ments (vectors). We propose disambiguation algorithms for a number of tensor-based mod- els, which we then test on a variety of tasks. The results show that disambiguation can pro- vide better compositional representation even for the case of tensor-based models. Further- more, we confirm previous findings regarding the positive effect of disambiguation on vec- tor mixture models, and we compare the ef- fectiveness of the two approaches. 1 Introduction Distributional models of meaning have been proved extremely useful for a number of natural language processing tasks, ranging from thesaurus extraction (Curran, 2004) to topic modelling (Landauer and Dumais, 1997) and information retrieval (Manning et al., 2008), to name just a few. These models are based on the distributional hypothesis of Har- ris (1968), which states that the meaning of a word depends on its context. This idea allows the words to be represented by vectors of statistics collected from a sufficiently large corpus of text; each ele- ment of the vector reflects how many times a word co-occurs in the same context with another word of the vocabulary. However, due to the generative power of natural language, which is able to pro- duce infinite new structures from a finite set of re- sources (words), no text corpus, regardless of its size, can provide reliable distributional representa- tions for anything longer than single words or per- haps very short phrases consisting of two words; in other words, this technique cannot scale up to the phrase or sentence level. Much research activity has been recently dedi- cated to provide a solution to this problem: although the direct construction of a sentence vector is not possible, we might still be able to synthetically cre- ate such a vectorial representation by somehow com- posing the vectors of the words that comprise the sentence. Towards this goal, researchers have em- ployed a variety of approaches that roughly fall into two general categories. Following an influential work (Mitchell and Lapata, 2008), the models in the first category compute a sentence vector as a mix- ture of the original word vectors, using simple oper- ations such as element-wise multiplication and ad- dition; we refer to these models as vector mixtures. The main characteristic of these models is that they do not distinguish between the type-logical identi- ties of the different words: an intransitive verb, for example, is of the same order as its subject (a noun), and both will contribute equally to the composite sentence vector. However, this symmetric treatment of composi- tion seems unjustified from a formal semantics point of view. Words with special meanings, such as verbs and adjectives, are usually seen as functions acting on, hence modifying, a number of arguments rather than lexical units of the same order as them; an adjective, for example, is a function that returns a modified version of its input noun. Inspired from

Upload: others

Post on 11-Feb-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Prior Disambiguation of Word Tensorsfor Constructing Sentence Vectors

Dimitri KartsaklisUniversity of Oxford

Department ofComputer Science

Wolfson Building, Parks RoadOxford, OX1 3QD, UK

[email protected]

Mehrnoosh SadrzadehQueen Mary University of LondonSchool of Electronic Engineering

and Computer ScienceMile End Road

London, E1 4NS, [email protected]

Abstract

Recent work has shown that compositional-distributional models using element-wise op-erations on contextual word vectors benefitfrom the introduction of a prior disambigua-tion step. The purpose of this paper is togeneralise these ideas to tensor-based models,where relational words such as verbs and ad-jectives are represented by linear maps (higherorder tensors) acting on a number of argu-ments (vectors). We propose disambiguationalgorithms for a number of tensor-based mod-els, which we then test on a variety of tasks.The results show that disambiguation can pro-vide better compositional representation evenfor the case of tensor-based models. Further-more, we confirm previous findings regardingthe positive effect of disambiguation on vec-tor mixture models, and we compare the ef-fectiveness of the two approaches.

1 Introduction

Distributional models of meaning have been provedextremely useful for a number of natural languageprocessing tasks, ranging from thesaurus extraction(Curran, 2004) to topic modelling (Landauer andDumais, 1997) and information retrieval (Manninget al., 2008), to name just a few. These modelsare based on the distributional hypothesis of Har-ris (1968), which states that the meaning of a worddepends on its context. This idea allows the wordsto be represented by vectors of statistics collectedfrom a sufficiently large corpus of text; each ele-ment of the vector reflects how many times a wordco-occurs in the same context with another wordof the vocabulary. However, due to the generative

power of natural language, which is able to pro-duce infinite new structures from a finite set of re-sources (words), no text corpus, regardless of itssize, can provide reliable distributional representa-tions for anything longer than single words or per-haps very short phrases consisting of two words; inother words, this technique cannot scale up to thephrase or sentence level.

Much research activity has been recently dedi-cated to provide a solution to this problem: althoughthe direct construction of a sentence vector is notpossible, we might still be able to synthetically cre-ate such a vectorial representation by somehow com-posing the vectors of the words that comprise thesentence. Towards this goal, researchers have em-ployed a variety of approaches that roughly fall intotwo general categories. Following an influentialwork (Mitchell and Lapata, 2008), the models in thefirst category compute a sentence vector as a mix-ture of the original word vectors, using simple oper-ations such as element-wise multiplication and ad-dition; we refer to these models as vector mixtures.The main characteristic of these models is that theydo not distinguish between the type-logical identi-ties of the different words: an intransitive verb, forexample, is of the same order as its subject (a noun),and both will contribute equally to the compositesentence vector.

However, this symmetric treatment of composi-tion seems unjustified from a formal semantics pointof view. Words with special meanings, such as verbsand adjectives, are usually seen as functions actingon, hence modifying, a number of arguments ratherthan lexical units of the same order as them; anadjective, for example, is a function that returns amodified version of its input noun. Inspired from

this more-aligned-to-formal-semantics view, a sec-ond research direction aims to represent relationalwords as linear maps (tensors of various orders)that can be applied to one or more arguments (vec-tors). Baroni and Zamparelli (2010), for example,model adjectives as matrices which, when matrix-multiplied with a noun vector, will produce a vec-torial representation of the specific adjective-nouncompound. The notion of a framework where re-lational words are entities living in vector spaces ofhigher order than nouns, which are simple vectors,has been formalized by Coecke et al. (2010) in thecontext of the abstract mathematical framework ofcompact closed categories. We refer to this class ofmodels as tensor-based.

Regardless of the way they approach the repre-sentation of relational words and their compositionoperation, however, most current compositional-distributional models do share a common feature:they all rely on ambiguous vector representations,where all the senses of a polysemous word, such asthe verb ‘file’ (which can mean register or smooth),are merged into the same vector or tensor. At leastfor the vector mixture approach, this practice hasbeen proved suboptimal: Reddy et al. (2011) andKartsaklis et al. (2013) test a number of simple mul-tiplicative and additive models using disambiguatedvector representations on various tasks, showing thatthe introduction of a disambiguation step prior to ac-tual composition can indeed increase the quality ofthe composite vectors. However, the fact that disam-biguation can be beneficial for models based on vec-tor mixtures is not very surprising. Both additive andmultiplicative compositions are but a kind of averageof the vectors of the words in the sentence, hencecan directly benefit from the provision of more ac-curate starting points. Perhaps a more interestingquestion, and one that the current paper aims to ad-dress, is to what extent disambiguation can also pro-vide benefits for tensor-based approaches, which ingeneral constitute more powerful models for naturallanguage (see discussion in Section 2).

Specifically, this paper aims to: (a) propose dis-ambiguation algorithms for a number of tensor-based distributional models; (b) examine the effectof disambiguation on tensors for relational words;and (c) meaningfully compare the effectiveness oftensor-based against vector mixture models in anumber of tasks. Based on the generic procedure ofSchutze (1998), we propose algorithms for a num-

ber of tensor-based models, where the compositionis modelled as the application of linear maps (ten-sor contractions). Following Mitchell and Lapata(2008) and many others, we test our models ontwo disambiguation tasks similar to that of Kintsch(2001), and on the phrase similarity task introducedin (Mitchell and Lapata, 2010). In almost everycase, the results show that disambiguation can makea great difference in the case of tensor-based models;they also reconfirm previous findings regarding theeffectiveness of the method for simple vector mix-ture models.

2 Vectors vs tensors

The simple models of Mitchell and Lapata (2008)constitute the easiest and perhaps the most intuitiveway of composing two or more vectors: each ele-ment of the resulting vector is computed as the sumor the product of the corresponding elements in theinput vectors (left part in Figure 1). In the case ofaddition, the components of the output vector aresimply the cumulative scores of the correspondinginput components. So in a sense the output elementembraces both input elements, resembling a union ofthe input features. On the other hand, the element-wise multiplication of two vectors can be seen as theintersection of their features: a zero element in oneof the input vectors will eliminate the correspondingfeature in the output, no matter how high the otherinput component was. In addition to failing to iden-tify the special roles of words in a sentence, vectormixture models disregard grammar in another way:the commutativity of operators make them a bag-of-words approach, where the meaning of sentence‘dog bites man’ is equated to that of ‘man bites dog’.

On the contrary to the above element-wise treat-ment, a compositional approach based on linearmaps computes each element of the resulting vec-

= =

Vector mixture Tensor-based

Figure 1: Vector mixture and tensor-based models forcomposition. In the latter approach, the ith element ofthe output vector is the linear combination of the inputvector with the ith row of the matrix.

tor via a linear combination of all the elements ofthe input vector (right part of Figure 1); in otherwords, possible interdependencies between differ-ent features are also taken into account, offering (inprinciple) more power. Furthermore, by design, thebag-of-words problem is not present here. Over-all, tensor-based models offer a more complete andlinguistically motivated solution to the problem ofcomposition. For example, one can consider build-ing linear maps for prepositions and logical words,rather than treating them as noise and discard them,as commonly done in the vector mixture models.

3 Disambiguation in vector mixtures

For a compositional model based on vector mix-tures, polysemy of words can be a critical factor.Pulman (2013) and Kartsaklis et al. (2013) point outthat the element-wise combination of “ambiguous”vectors produces results that are hard to interpret;the composed vector is not a purely compositionalrepresentation but a product of two tasks that takeplace in parallel: composition and some amount ofdisambiguation that emerges as a side-effect of thecompositional process, leaving the resulting vectorin an intermediate state.

This effect is demonstrated in Figure 2, whichshows the composition of the ambiguous verb ‘run’(with meanings moving fast and dissolving) with thesubject ‘horse’. The first three components of ourtoy vector space are related to the dissolving mean-ing, while the last three of them to the moving fastmeaning. An ambiguous vector for ‘run’ will havenon-zero values for every component. On the otherhand, we would expect the vector for ‘horse’ to havehigh values for the ‘race’, ‘gallop’, and ‘move’ com-ponents, and very low values (but not necessarilyzero) for the dissolving-related ones—it is alwayspossible for the word ‘horse’ to appear in the same

colour

dissolve

painting

race

gallop

move

5

9

4

11

8

15

1

0

2

6

13

7

Ambiguous runhorse

=

5

0

8

66

104

105

0

0

0

11

8

15

1

0

2

6

13

7

Disambiguatedrunhorse

=

0

0

0

66

104

105

Figure 2: The effect of disambiguation on vector compo-sition. The numbers are (artificial) co-occurrence countsof each target word with the 6 basis words on the left.

context with the word ‘painting’, for example. Theleft part of Figure 2 shows what happens when theambiguous ‘run’ vector is used; the multiplicationwith the ‘horse’ vector will produce an impure re-sult, half affected by composition and half by disam-biguation. However, what we really want is a vec-tor where all the dissolving-related components willbe eliminated, since they are irrelevant to the waythe word ‘run’ is used in the sentence. In order toachieve this, we have to introduce a disambiguationstep prior to composition (right part of Figure 2).

These ideas are experimentally verified in theworks of Reddy et al. (2011) and Kartsaklis et al.(2013); Pulman (2013) also presents a comprehen-sive analysis of the problem. What remains to beseen is if disambiguation can also provide bene-fits for the linguistically motivated setting of tensor-based models, the principles of which are shortlydiscussed in the next section.

4 Tensors as multilinear maps

A tensor is a geometric object that can be seen as thegeneralization of the familiar notion of a vector inhigher dimensions. The order of a tensor is the num-ber of its dimensions; in other words, the number ofindices we need to fully describe a random elementof the tensor. Hence, a vector is a tensor of order1, a matrix is a tensor of order 2, and so on. Ten-sors and multilinear maps stand in one-to-one cor-respondence, as stated by the following well-known“map-state” isomorphism (Bourbaki, 1989):

f : V1 → . . .→ Vj → Vk ∼= Vk⊗Vj⊗. . .⊗V1 (1)

This offers an elegant way to adopt a formal se-mantics view of natural language in vector spaces.Let nouns live in a basic vector space N ∈ RD; re-turning to our previous example, an adjective thencan be seen as a map f : N → N which is isomor-phic to N ⊗N (that is, to a matrix). In general, theorder of the tensor is equal to the number of argu-ments plus one dimension that carries the result; soa unary function (e.g. adjectives, intransitive verbs)is represented by a tensor of order 2 (a matrix), a bi-nary function (e.g. a transitive verb) as an order 3tensor, and so on. Due to the above isomorphism,function application (and hence our compositionaloperation) becomes a generalisation of matrix mul-tiplication, formalised in terms of the inner product.In the case of a unary relational word, such as anadjective, this is nothing more than the usual notion

of matrix multiplication between a matrix and a vec-tor. The generalization of this process to tensors ofhigher order is known as tensor contraction. Giventwo tensors of orders n and m, the tensor contrac-tion operation will always produce a tensor of ordern+m− 2.

Let us see an example of how this works for asimple transitive sentence. Let V ∈ RI×J×K be thetensor of order 3 for the verb and S ∈ RI , O ∈ RK

the tensors of order 1 (vectors) for the subject andthe object of the verb, respectively. Then V ×O willreturn a new tensor living in RI×J (i.e. a matrix)1;a further interaction of this result with the subjectwill return a vector for the whole transitive sentenceliving in RJ . We should note that the order in whichthe verb is applied to its arguments is not important;so in general the meaning of a transitive sentence isgiven by:

(V ×O)T × S = (VT × S)×O (2)

where T denotes a transpose and makes indicesmatch, since subject precedes the verb.

5 Creating verb tensors

In this section we review a number of proposals re-garding concrete methods of constructing tensors forrelational words in the context of the frameworksof Coecke et al. (2010) and Baroni and Zamparelli(2010), which both comply to the setting of Section4.2

Relational Following ideas from the set-theoreticview of formal semantics, Grefenstette andSadrzadeh (2011a) suggest that the meaning of arelational word should be represented as the sumof its arguments. The meaning of adjective ‘red’,for example, becomes the sum of the vectors ofall the nouns that ‘red’ modifies in the corpus; so−→red =

∑i−−−→nouni, where i iterates through all the

occurrences of ‘red’. This can be generalised torelational words of any arity, by summing the tensorproduct of their arguments. So for a transitive verbwe have:

verb2=∑i

(−−→subji ⊗

−−→obji) (3)

1The symbol × denotes tensor contraction.2In what follows we use the case of a transitive verb as an

example; however the descriptions apply to any relational wordof any arity. A vector (an order-1 tensor) is denoted as −→x ; ten-sors of order n > 1 are shown as xn for clarity.

where i again iterates over all occurrences of the spe-cific verb in the corpus and the superscript denotesthe order of the tensor.

In order to achieve a more expressive represen-tation for the sentences, the authors used the con-vention that the arity of the head word in a sentencewill also determine the order of the sentence space;that is, the space of intransitive sentences will be oforder 1, of transitive ones will be of order 2, andso on. Recall from Section 4 that for the transitivecase this increases the order of the verb tensor to 4(2 dimensions for the arguments plus another 2 forthe result). In spite of this, however, note that themethod of Equation 3 produces a matrix. The othertwo dimensions of the tensor remain empty (filledwith zeros), a fact that simplifies the calculations butalso considerably weakens the expressive power ofthe model. This simplification transforms Equation2 to the following:

subj verb obj2= (−−→subj ⊗

−→obj)� verb2 (4)

where ⊗ denotes the tensor product and � element-wise multiplication.

Kronecker In a subsequent work (Grefenstetteand Sadrzadeh, 2011b), the same team proposes thecreation of a verb matrix as the Kronecker productof the verb’s contextual vector with itself:

verb2=−−→verb⊗

−−→verb (5)

Again in this model the sentence space is of order2, and the meaning of a transitive sentence is calcu-lated using Equation 4.

Frobenius The previous models bring the impor-tant limitation that only sentences of the same struc-ture can be meaningfully compared; it is not pos-sible, for example, to compare an intransitive sen-tence (e.g. ‘kids play’) with a transitive one (‘chil-dren play football’), since the former is a vector andthe latter a matrix. Using Frobenius algebras, Kart-saklis et al. (2012) provide a unified sentence spacefor every sentence regardless of its type. These mod-els turn the matrix of Equation 3 to a tensor of or-der 3 (as required by the type-logical identities) bycopying one of the existing dimensions. When thedimension of rows (corresponding to subjects) iscopied, the calculation of a vector for a transitivesentence becomes:

−−−−−−−−−→subj verb obj =

−−→subj � (verb

2 ×−→obj) (6)

Copying the column dimension (objects) gives:

−−−−−−−−−→subj verb obj =

−→obj �

((verb

2)T ×

−−→subj

)(7)

Linear regression None of the above models cre-ate tensors that are fully populated: one or moredimensions will always remain empty. Followingan idea first introduced by Baroni and Zamparelli(2010) for the creation of adjective matrices, Grefen-stette et al. (2013) use linear regression in orderto learn full tensors of order 3 for transitive verbs.Linear regression is a supervised method of learn-ing, so it needs a number of exemplar data points.In the case of the adjective ‘red’, for example, wewould need a set of the form 〈−→car,

−−−−→red car〉, 〈

−−→shirt,

−−−−−→red shirt〉, 〈

−−→shoe,

−−−−−→red shoe〉 and so on, where the sec-

ond vector in each pair is the contextual vector of thewhole phrase created exactly as if it were a singleword. The goal of the learning process is to find theparameters adj

2and−→b such that:

−−−−−−→adj noun ≈ adj2 ×−−−→noun+

−→b (8)

for all nouns modified by the specific adjective. Inpractice, the bias

−→b is embedded in adj

2, hence

the above procedure provides us with a matrix forthe adjective. One can generalize this procedure totensors of higher order by proceeding step-wise, asdone by Grefenstette et al. (2013). For the caseof a transitive verb, they first use exemplar pairsof the form 〈

−−→subj,

−−−−−−−−→subj verb obj〉 to learn a matrix

verb obj2

for the verb phrase; then, they performa new training session with exemplars of the form〈−→obj, verb obj

2〉, the result of which is an order 3tensor for the verb.

6 Generic context-based disambiguation

In all of the models of Section 5, the training of arelational word tensor is based on the set of contextswhere this word occurs. Hence, in these models theproblem of creating disambiguated versions of ten-sors can be recasted to that of further breaking theset of contexts in a way that each subset reflects adifferent sense of the word in the corpus. If, for ex-ample, S is the whole set of sentences for a wordw that occurs in the corpus under n different senses,then the goal is to create n subsets S1, . . . Sn suchthat S1 contains the sentences where w appears un-der the first sense, S2 the sentences where w occurs

under the second sense, and so on. Each one of thesesubsets can then be used to train a tensor for a spe-cific sense of the target relational word.

Towards this purpose we use a variation of the ef-fective procedure of Schutze (1998): first, each con-text for a target word wt is represented by a contextvector of the form 1

n(−→w1 + . . . + −→wn), where −→wi is

the lexical vector of some other word wi 6= wt in thesame context. Next, we apply a clustering methodon this set of vectors in order to discover the latentsenses of wt. The assumption is that the contextsof wt will vary according to the specific sense thisword is used: ‘bank’ as a financial institution shouldappear in quite different contexts than as land.

The above procedure will give us a number ofclusters, each consisting of context vectors; we usethe centroid of each cluster as a vectorial repre-sentation of the corresponding sense. So in ourmodel each wordw is initially represented by a tuple〈−→w ,S〉, where−→w is the lexical vector of the word ascreated by the usual distributional practice, and Sis a set of sense vectors (centroids of context vec-tor clusters) produced by the above procedure. Thedisambiguation of a new word w under a context Ccan now be accomplished as follows: we create acontext vector −→c for C as above, and we compare itwith every sense vector of w; the word is assigned tothe sense corresponding to the closest sense vector.Specifically, if Sw is the set of sense vectors for w,−→c the context vector for C, and d(−→v ,−→u ) our vectordistance measure, the preferred sense s is given by:

s = argmin−→vs∈Sw

d(−→vs ,−→c ) (9)

For the actual clustering step we follow the set-ting of Kartsaklis et al. (2013), which worked well intasks very similar to ours. Specifically, we performhierarchical agglomerative clustering (HAC) usingWard’s method as the inter-cluster distance, whilethe distance between vectors is measured with Pear-son’s correlation.3 In the above work, this configura-tion has been found to return the highest V-measure(Rosenberg and Hirschberg, 2007) on the noun setof SEMEVAL 2010 Word Sense Induction & Disam-biguation Task (Manandhar et al., 2010). As con-text for a word, we consider the sentence in whichthis word occurs. The output of HAC is a dendro-gram embedding all the possible partitionings of the

3Informal experimentation with more robust probabilistictechniques, such as Dirichlet process gaussian mixture models,revealed no significant benefits for our setting.

data. In order to select the optimal partitioning, werely on the Calinski/Harabasz index (Calinski andHarabasz, 1974), also known as variance ratio cri-terion (VRC). VRC is calculated as the ratio of thesum of the inter-cluster variances over the sum ofthe intra-cluster variances, bearing the intuition thatthe optimal partitioning should be the one that re-sults in the most compact and maximally separatedclusters. We compute the VRC for a range of differ-ent partitionings (from 2 to 10 clusters) and keep thepartitioning with the highest score.

7 Constructing unambiguous verb tensors

The procedure described in Section 6 provides uswith n clusters of context vectors for a target word.Since in our case each context vector correspondsto a distinct sentence, the output of the clusteringscheme can also be seen as n subsets of sentences,where the word appears under different senses. Itis now quite straightforward to use this partitioningof the training corpus in order to learn unambiguousversions of verb tensors, as detailed below.

Relational/Frobenius Both the Relational and theFrobenius models use the same way of creating aninitial verb matrix (Equation 3) which then they ex-pand to a higher order tensor. Let S1 . . . Sn be thesets of sentences returned by the clustering step fora verb. Then, the verb tensor for the ith sense is:

verb2i =

∑s∈Si

(−−−→subjs ⊗

−−→objs) (10)

where subjs and objs refer to the subject/object pairthat occurred with the verb in sentence s. This canbe generalized to any arity n as follows:

wordni =

∑s∈Si

n⊗k=1

−−−→argk,s (11)

where argk,s denotes the kth argument of the targetword in sentence s.

Kronecker For a given verb v in a context C, let−→vi be the sense vector of v given C corresponding tothe sense i returned by Equation 9. Then we have:

verb2i =−→vi ⊗−→vi (12)

The generalized version to arity n is given by:

wordni =

n⊗k=1

−→vi (13)

Linear regression Creating unambiguous full ten-sors using linear regression is also quite straightfor-ward. Let us assume again that the clustering stepfor a verb v returns n sets of sentences S1 . . . Sn,where each sentence set corresponds to a differentsense. Then, we have n different regression prob-lems, each one of which will be trained on exemplarpairs derived exclusively from the sentences of thecorresponding set. This will result in n verb tensors,which will correspond to the different senses of theverb. Generalization to higher arities is a straightfor-ward extension of the step-wise process in Section 5for transitive verbs.

8 Experiments

In this section we will test the effect of disambigua-tion on the models of Section 5 in a variety of tasks.Due to the significant methodological differencesof the linear regression model from the other ap-proaches and the variety of its set of parameters, wedecided that it would be better if this was left as thesubject of a distinct work.

Experimental setting We train our vectors usingukWaC (Ferraresi et al., 2008), a corpus of Englishtext with 2 billion words (100m sentences). We use2000 dimensions, with weights calculated as the ra-tio of the probability of the context word given thetarget word to the probability of the context wordoverall. The context here is a 5-word window onboth sides of the target word. The vectors are disam-biguated both syntactically and semantically: first,separate vectors have been created for different syn-tactic usages of the same word in the corpus; for ex-ample, the word ‘book’ has two vectors, one for itsnoun sense and one for its verb sense. Furthermore,each word is semantically disambiguated accordingto the method of Section 6.

Models We compare the tensor-based models ofSection 5 with the multiplicative and additive mod-els of Mitchell and Lapata (2008), reporting resultsfor both ambiguous and disambiguated versions.For all the disambiguated models, the best sense foreach word in the sentence or phrase is first selectedby applying the procedure of Section 6 and Equa-tion 9. If the model is based on a vector mixture, thesense vectors corresponding to these senses are mul-tiplied or added to form the composite representa-tion for the sentence or phrase. For the tensor-basedmodels, the composite meanings are calculated ac-

cording to the equations of Section 5, using verbtensors created by the procedures of Section 7. Thesemantic similarity of two phrases or sentences ismeasured as the cosine distance between their com-posite vectors. For models that return a matrix (e.g.Relational, Kronecker), the distance is based on theFrobenius inner product.

Implementation details Our code is mainly writ-ten in Python and C++, and for the actual cluster-ing step we use the Python interface of the efficientFASTCLUSTER library (Mullner, 2013). In a shared24-core Xeon machine with 72 GB of memory, andwith a fair amount of parallelism applied, the aver-age processing time per word was about 4 minutes;this is roughly translated to 12-13 hours of trainingon average per dataset.

8.1 Verb disambiguation task

Perhaps surprisingly, one of the most popular tasksfor testing compositionality in distributional modelsis based on disambiguation. This task, originally in-troduced by Kintsch (2001), has been adopted byMitchell and Lapata (2008) and others for evaluatingthe quality of composition in vector spaces. Givenan ambiguous verb such as ‘file’, the goal is to findout to what extent the presence of an appropriatecontext will disambiguate its intended meaning. Thecontext (e.g. a subject/object pair) is composed withtwo landmark verbs corresponding to the differentsenses (‘smooth’ and ‘register’) to create simple sen-tences. The assumption is that a good compositionalmodel should be able to reflect that ‘woman files ap-plication’ is closer to ‘woman registers application’than to ‘woman smooths application’.

In this paper we test our models on two differentdatasets of transitive sentences, that of Grefenstetteand Sadrzadeh (2011a) and Kartsaklis et al. (2013)4.Specific details about the creation of the datasets canbe found in the above papers; for the purposes ofthe current work it is sufficient to mention that theirmain difference is that in the former the verbs andtheir alternative meanings have been selected auto-matically using the JCN metric of semantic similar-ity (Jiang and Conrath, 1997), while in the latter theselection was based on human judgements from thework of Pickering and Frisson (2001). So, while

4This dataset has been created by Mehrnoosh Sadrzadeh incollaboration with Edward Grefenstette, but remained unpub-lished until (Kartsaklis et al., 2013).

in the first dataset many verbs cannot be consid-ered as genuinely ambiguous (e.g. ‘say’ with mean-ings state and allege or ‘write’ with meanings pub-lish and spell), the landmarks in the second datasetcorrespond to clearly separated senses (e.g. ‘file’with meanings register and smooth or ‘charge’ withmeanings accuse and bill). Furthermore, subjectsand objects of this latter case are modified by appro-priate adjectives, overall creating a richer and morelinguistically balanced dataset.

In both cases the evaluation methodology is thesame: each entry of the dataset has the form〈subject, verb, object, high-sim landmark, low-simlandmark〉. The context is combined with the verband the two landmarks, creating three simple tran-sitive sentences. The main-verb sentence is pairedwith both the landmark sentences, and these pairsare randomly presented to human evaluators, theduty of which is to evaluate the similarity of the sen-tences within a pair in a scale from 1 to 7. The scoresof the compositional models are the cosine distances(or the Frobenius inner products, in the case of ma-trices) between the composite representations of thesentences of each pair. As an overall score for eachmodel, we report its Spearman’s ρ correlation withthe human judgements. Both datasets consist of 200pairs of sentences (10 main verbs × 2 landmarks ×10 contexts).

Results The results for the G&S dataset are shownin Table 1.5 The verbs-only model (BL) refers to anon-compositional evaluation, where the similaritybetween two sentences is solely based on the dis-tance between the two verbs, without applying anycompositional step with subject and object.

The tensor-based models present much better per-formance than the vector mixture ones, with the dis-ambiguated version of the copy-object model sig-nificantly higher than the relational model. By de-sign, the copy-object model retains more informa-tion about the objects; so this result confirms pre-vious findings, that in this certain dataset the roleof objects is more important than that of subjects(Kartsaklis et al., 2012). In general, the disambigua-tion step improves the results of all the tensor-basedmodels except Kronecker; the effect is reversed forthe vector mixture models, where the disambiguatedversions present much worse performance (these

5For all tables in this section, � and� denote highly sta-tistically significant differences with p < 0.001.

Model Ambig. Disamb.BL Verbs only 0.198 � 0.132M1 Multiplicative 0.137 � 0.044M2 Additive 0.127 � 0.047T1 Relational 0.219 < 0.223T2 Kronecker 0.207 � 0.061T3 Copy-subject 0.070 � 0.122T4 Copy-object 0.241 � 0.262

Human agreement 0.599Difference between T4 and T1 is s.s. with p < 0.001

Table 1: Results for the G&S dataset.

Model Ambig. Disamb.BL Verbs only 0.151 � 0.217M1 Multiplicative 0.131 < 0.137M2 Additive 0.085 � 0.193T1 Relational 0.036 � 0.121T2 Kronecker 0.159 < 0.166T3 Copy-subject 0.035 � 0.117T4 Copy-object 0.033 � 0.095

Human agreement 0.383Difference between BL and M2 is s.s. with p < 0.001

Table 2: Results for the Kartsaklis et al. dataset.

findings are further discussed in Section 9).The result of disambiguation is clearer for the

dataset of Kartsaklis et al. (Table 2). The longercontext in combination with genuinely ambiguousverbs produces two effects: first, disambiguation isnow helpful for all models, either vector mixturesor tensor-based; second, the disambiguation of justthe verb (verbs-only model), without any interac-tion with the context, is sufficient to provide the bestscore (0.22) with a difference statistically significantfrom the second model (0.19 for disambiguated ad-ditive). In fact, further composition of the verb withthe context decreases performance, confirming theresults reported by Kartsaklis et al. (2013) for vec-tors trained using BNC. Given the nature of the spe-cific task, which is designed around the ambiguity ofthe verb, this result is not surprising: a direct disam-biguation of the verb based on the rest of the con-text should naturally constitute the best method toachieve top performance—no composition is neces-sary for this task to be successful.

However, when one does use a task like this inorder to evaluate compositional models (as we dohere and as is commonly the case), they implic-itly correlate the strength of the disambiguation ef-fect that takes place during the composition with thequality of composition, essentially assuming that thestronger the disambiguation, the better the composi-

tional model that produced this side-effect. Unfor-tunately, the extent to which this assumption is validor not is still not quite clear; the subject is addressedin more detail in (Kartsaklis et al., 2013). Keeping anote of this observation, we now proceed to examinethe performance of our models in a task that does notuse disambiguation as a criterion of composition.

8.2 Phrase/sentence similarity taskOur second set of experiments is based on the phrasesimilarity task of Mitchell and Lapata (2010). Onthe contrary with the task of Section 8.1, this onedoes not involve any assumptions about disambigua-tion, and thus it seems like a more genuine test ofmodels aiming to provide appropriate phrasal or sen-tential semantic representations; the only criterion isthe degree to which these models correctly evaluatethe similarity between pairs of sentences or phrases.We work on the verb-phrase part of the dataset, con-sisting of 72 short verb phrases (verb-object struc-tures). These 72 phrases have been paired in threedifferent ways to form groups exhibiting variousdegrees of similarity: the first group contains 36pairs of highly similar phrases (e.g. produce effect-achieve result), the pairs of the second group areof medium similarity (e.g. write book-hear word),while a last group contains low-similarity pairs (useknowledge-provide system). The task is again tocompare the similarity scores given by the variousmodels for each phrase pair with those of human an-notators. Additionally to the verb phrases task, wealso perform a richer version of the experiment us-ing transitive sentences.

Verb phrases It can be shown that for simple verbphrases the relational model reduces itself to thecopy-subject model; for both of these methods, themeaning of the verb phrase is calculated accordingto Equation 6. Furthermore, according to the copy-object model the meaning of a verb phrase computedby a verb matrix

∑ij vij(

−→ni⊗−→nj) and an object vec-tor∑

j oj−→nj becomes:

verb object2=∑ij

vijoj(−→ni ⊗−→nj) (14)

Finally, the Kronecker model has no meaning forverb phrases, since the vector of a verb phrasewill become (−→vs ⊗ −→vs) ×

−→obj, which is equal to

〈−→vs |−→obj〉−→vs , where 〈−→vs |

−→obj〉 denotes the inner prod-

uct between vectors of verb and object. Hence, the

Model Ambig. Disamb.BL Verbs only 0.310 � 0.420M1 Multiplicative 0.315 � 0.448M2 Additive 0.291 � 0.436T1 Rel./Copy-sbj 0.340 � 0.367T2 Copy-object 0.290 � 0.393

Human agreement 0.550Difference between M1 and M2 is not s.s.Difference between M1 and BL is s.s. with p < 0.001

Table 3: Results for the original M&L task.

meaning of a verb phrase becomes a scalar multipli-cation of the meaning of its verb. As a result, thecosine distance (used for measuring similarity) be-tween the meanings of two verb phrases is reducedto the distance between the vectors of their verbs,completely dismissing the role of their objects.

Hence our models are limited to those of Table3. The effects of disambiguation for this task arequite impressive: the differences between the scoresof all disambiguated models and those of the am-biguous versions are highly statistically significant(with p < 0.001), while 4 of the 5 models presentan improvement greater than 10 units of correla-tion. The models that benefit the most from disam-biguation are the vector mixtures; both of these ap-proaches perform significantly better than the besttensor-based model (copy-object). In fact, the scoreof M1 (0.45) is quite high, given that the inter-annotator agreement is 0.55 (best score reported byMitchell and Lapata was 0.41 for their LDA-dilationmodel).

Transitive sentences The second part of this ex-periment aims to examine the extent to which theabove picture can change for the case of text struc-tures longer than verb phrases. In order to achievethis, we extend each one of the 72 verb phrases toa full transitive sentence by adding an appropriatesubject such that the similarity relationships of theoriginal dataset are retained as much as possible,so the human judgements for the verb phrase pairscould as well be used for the transitive cases. Weworked pair-wise: for each pair of verb phrases, wefirst selected one of the 5 most frequent subjects forthe first phrase; then, the subject of the other phrasewas selected by a list of synonyms of the first sub-ject in a way that the new pair of transitive sen-tences constitutes the least more specific version ofthe given verb-phrase pair. So, for example, the pairproduce effect/achieve result became drug produce

effect/medication achieve result, while the pair poseproblem/address question became study pose prob-lem/paper address question.6

The restrictions of the verb-phrase version do nothold here, so we evaluate on the full set of models(Table 4). Once more disambiguation produces bet-ter results in all cases, with highly statistically sig-nificant differences for all but one model. Further-more, now the best score is delivered by one of thetensor-based models (Kronecker), with a differencenot statistically significant from disambiguated ad-ditive. In any case, the result suggests that as thelength of the text segments increases, the perfor-mance of vector mixtures and tensor-based modelsconverges. Indeed, note how the performance of thevector mixture models are significantly decreasedcompared to the verb phrase task.

9 Discussion

The purpose of this work was twofold: our main ob-jective was to investigate how disambiguation canaffect the compositional models which are based onhigher order vector spaces; a second, but not lessimportant goal, was to compare this more linguisti-cally motivated approach to the simpler vector mix-ture methods. Based on the experimental work pre-sented here, we can say with enough confidence thatdisambiguation as an additional step prior to com-position is indeed very beneficial for tensor-basedmodels. Furthermore, our experiments confirm andstrengthen previous work (Reddy et al., 2011; Kart-saklis et al., 2013) that showed better performance ofdisambiguated vector mixture models compared totheir ambiguous versions. The positive effect of dis-ambiguation is more evident for the vector mixturemodels (especially for the additive model) than for

6The dataset will be available at http://www.cs.ox.ac.uk/activities/compdistmeaning/.

Model Ambig. Disamb.BL Verbs only 0.310 � 0.341M1 Multiplicative 0.325 � 0.404M2 Additive 0.368 � 0.410T1 Relational 0.368 � 0.397T2 Kronecker 0.404 < 0.412T3 Copy-subject 0.310 � 0.337T4 Copy-object 0.321 � 0.368

Human agreement 0.550Difference between T2 and M2 is not s.s.

Table 4: Transitive version of M&L task.

the tensor-based ones. This is expected: compositerepresentations created by element-wise operationsare averages, and a prior step of disambiguation canmake a great difference.

From a task perspective, the effect of disambigua-tion was much more definite in the phrase/sentencesimilarity task. This observation is really interest-ing, since the words of that dataset were not se-lected in order to be ambiguous in any way. Thesuperior performance of the disambiguated models,therefore, implies that the proposed methodologycan improve tasks based on phrase or sentence sim-ilarity regardless of the level of ambiguity in thevocabulary. For these cases, the proposed disam-biguation algorithm acts as a fine-tuning process, theoutcome of which seems to be always positive; itcan only produce better composite representations,not worse. In general, the positive effect of dis-ambiguation in the phrase/sentence similarity task isquite encouraging, especially given the fact that thistask constitutes a more appropriate test for evaluat-ing compositional models, avoiding the pitfalls ofdisambiguation-based experiments (as shortly dis-cussed in Section 8.1).

For disambiguation-based tasks similar to thoseof Section 8.1, the form of dataset is very important;hence the inferior performance of disambiguatedmodels in the G&S dataset, compared to the datasetof Kartsaklis et al.. In fact, the G&S dataset wasthe only one where disambiguation was not helpfulfor some cases (specifically, for vector mixtures andthe Kronecker model). We believe the reason behindthis lies in the fact that the automatic selection oflandmark verbs using the JCN metric (as done withthe G&S dataset) was not very efficient for certaincases. Note, for example, that the bare baseline ofcomparing just ambiguous versions of verbs (with-out any composition) in that dataset already achievesa very high correlation of 0.198 with human judge-ments (Table 1).7. This number is only 0.15 for theKartsaklis et al. dataset, due to the more efficientverb selection procedure. In general, we considerthe results gained by this latter experiment more re-liable for the specific task, the successful evaluationof which requires genuinely ambiguous verbs.

The results are less conclusive for the secondquestion we posed in the beginning of this section,regarding the comparison of the two classes of mod-

7The reported number for this baseline by Grefenstette andSadrzadeh (2011a) was 0.16 using vectors trained from BNC.

els. Despite the obvious benefits of the tensor-basedapproaches, this work suggests for one more timethat vector mixture models might constitute a hard-to-beat baseline; similar observations have beenmade, for example, in the comparative study of Bla-coe and Lapata (2012). However, when trying to in-terpret the mixing results regarding the effectivenessof the tensor-based models compared to vector mix-tures, we need to take into account that the tensor-based models tested in this work were all “hybrid”,in the sense that they all involved some elementof point-wise operation; in other words, they con-stituted a trade-off between transformational powerand complexity.

Even with this compromise, though, the studypresented in Section 8.2 implies that the effective-ness of each method depends to some extent on thelength of the text segment: when more words areinvolved, vector mixture models tend to be less ef-fective; on the contrary, the performance of tensor-based models seems to be proportional to the lengthof the phrase or sentence—the more, the better.These observations comply with the nature of theapproaches: “averaging” larger numbers of pointsresults in more general (hence less accurate) repre-sentations; on the other hand, a larger number ofarguments makes a function (such as a verb) moreaccurate.

10 Conclusion and future work

In the present paper we showed how to improvea number of tensor-based compositional distribu-tional models of meaning by introducing a stepof disambiguation prior to composition. Our sim-ple algorithm (based on the procedure of Schutze(1998)) creates unambiguous versions of tensors be-fore these are composed with vectors of nouns inorder to construct vectors for sentences and phrases.This algorithm is quite generic, and can be appliedto any model that follows the tensor contraction pro-cess described in Section 4. As for future work, weaim to investigate the application of this procedureto the regression model of Grefenstette et al. (2013).

Acknowledgements

We would like to thank Edward Grefenstette for hiscomments on the first draft of this paper, as well asthe three anonymous reviewers for their fruitful sug-gestions. Support by EPSRC grant EP/F042728/1 isgratefully acknowledged by the authors.

References

Baroni, M. and Zamparelli, R. (2010). Nouns areVectors, Adjectives are Matrices. In Proceedingsof Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Blacoe, W. and Lapata, M. (2012). A comparison ofvector-based representations for semantic compo-sition. In Proceedings of the 2012 Joint Confer-ence on Empirical Methods in Natural LanguageProcessing and Computational Natural LanguageLearning, pages 546–556, Jeju Island, Korea. As-sociation for Computational Linguistics.

Bourbaki, N. (1989). Commutative Algebra: Chap-ters 1-7. Srpinger Verlag, Berlin/New York.

Calinski, T. and Harabasz, J. (1974). A DendriteMethod for Cluster Analysis. Communications inStatistics-Theory and Methods, 3(1):1–27.

Coecke, B., Sadrzadeh, M., and Clark, S.(2010). Mathematical Foundations for Dis-tributed Compositional Model of Meaning. Lam-bek Festschrift. Linguistic Analysis, 36:345–384.

Curran, J. (2004). From Distributional to Seman-tic Similarity. PhD thesis, School of Informatics,University of Edinburgh.

Ferraresi, A., Zanchetta, E., Baroni, M., and Bernar-dini, S. (2008). Introducing and evaluatingukWaC, a very large web-derived corpus of En-glish. In Proceedings of the 4th Web as CorpusWorkshop (WAC-4) Can we beat Google, pages47–54.

Grefenstette, E., Dinu, G., Zhang, Y.-Z., Sadrzadeh,M., and Baroni, M. (2013). Multi-step regres-sion learning for compositional distributional se-mantics. In Proceedings of the 10th InternationalConference on Computational Semantics (IWCS2013).

Grefenstette, E. and Sadrzadeh, M. (2011a). Exper-imental Support for a Categorical CompositionalDistributional Model of Meaning. In Proceedingsof Conference on Empirical Methods in NaturalLanguage Processing (EMNLP).

Grefenstette, E. and Sadrzadeh, M. (2011b). Exper-imenting with Transitive Verbs in a DisCoCat. InProceedings of Workshop on Geometrical Modelsof Natural Language Semantics (GEMS).

Harris, Z. (1968). Mathematical Structures of Lan-guage. Wiley.

Jiang, J. and Conrath, D. (1997). Semantic similar-ity based on corpus statistics and lexical taxon-omy. In Proceedings on International Conferenceon Research in Computational Linguistics, pages19–33, Taiwan.

Kartsaklis, D., Sadrzadeh, M., and Pulman, S.(2012). A unified sentence space for categori-cal distributional-compositional semantics: The-ory and experiments. In Proceedings of 24thInternational Conference on Computational Lin-guistics (COLING 2012): Posters, pages 549–558, Mumbai, India. The COLING 2012 Orga-nizing Committee.

Kartsaklis, D., Sadrzadeh, M., and Pulman, S.(2013). Separating Disambiguation from Com-position in Distributional Semantics. In Proceed-ings of 17th Conference on Computational Nat-ural Language Learning (CoNLL-2013), Sofia,Bulgaria.

Kintsch, W. (2001). Predication. Cognitive Science,25(2):173–202.

Landauer, T. and Dumais, S. (1997). A Solution toPlato’s Problem: The Latent Semantic AnalysisTheory of Acquision, Induction, and Representa-tion of Knowledge. Psychological Review.

Manandhar, S., Klapaftis, I., Dligach, D., and Prad-han, S. (2010). Semeval-2010 task 14: Wordsense induction & disambiguation. In Proceed-ings of the 5th International Workshop on Se-mantic Evaluation, pages 63–68. Association forComputational Linguistics.

Manning, C., Raghavan, P., and Schutze, H. (2008).Introduction to Information Retrieval. CambridgeUniversity Press.

Mitchell, J. and Lapata, M. (2008). Vector-basedModels of Semantic Composition. In Proceedingsof the 46th Annual Meeting of the Association forComputational Linguistics, pages 236–244.

Mitchell, J. and Lapata, M. (2010). Compositionin distributional models of semantics. CognitiveScience, 34(8):1388–1439.

Mullner, D. (2013). fastcluster: Fast HierarchicalClustering Routines for R and Python. Journal ofStatistical Software, 9(53):1–18.

Pickering, M. and Frisson, S. (2001). Processingambiguous verbs: Evidence from eye movements.Journal of Experimental Psychology: Learning,Memory, and Cognition, 27(2):556.

Pulman, S. (2013). Combining Compositional andDistributional Models of Semantics. In Heunen,C., Sadrzadeh, M., and Grefenstette, E., editors,Quantum Physics and Linguistics: A Composi-tional, Diagrammatic Discourse. Oxford Univer-sity Press.

Reddy, S., Klapaftis, I., McCarthy, D., and Man-andhar, S. (2011). Dynamic and static prototypevectors for semantic composition. In Proceedingsof 5th International Joint Conference on NaturalLanguage Processing, pages 705–713.

Rosenberg, A. and Hirschberg, J. (2007). V-measure: A conditional entropy-based externalcluster evaluation measure. In Proceedings of the2007 Joint Conference on Empirical Methods inNatural Language Processing and ComputationalNatural Language Learning, pages 410–420.

Schutze, H. (1998). Automatic Word Sense Dis-crimination. Computational Linguistics, 24:97–123.