calculating shades of meaning in semantic...

CALCULATING SHADES OF MEANING IN SEMANTICSPACES

by

David Petar Novakovic, BIT. (Bond, 2006)

http://dpn.name/[email protected]

Submitted in fulfilment of the requirementsfor the Degree of Master of Information Technology (Research)

Faculty of Information TechnologyQueensland University of Technology

July, 2008

ACKNOWLEDGEMENTS

Professor Peter Bruza, for spotting the interest at just the right time and helping me seepast the digits and into semantic spaces. Maths is now an unshakable, but much lovedcurse. Also for the continuing support and direction through this research. His infec-tious enthusiasm left me leaving every meeting feeling encouraged and revitalised.Stephen Kelly for thoroughly proof reading a paper on a topic foreign to him, serveraltimes as well. Jade for tolerating me living my work for the last two years and stillbeing able to encourage me.Mum and Dad for all their endless support and encouragement.Ben for doing research before me, so I knew what to expect.inQbator for giving me somewhere to work from, and being so supportive.

TABLE OF CONTENTS

TABLE OF CONTENTS i

LIST OF TABLES iii

LIST OF FIGURES iv

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 The problem: Conceptual Ambiguity . . . . . . . . . . . . . . . . . . . . . 3

2 Literature Review 10

2.1 Representation Models That Contain Ambiguity . . . . . . . . . . . . . . 11

2.1.1 Hyperspace to Analogue Language . . . . . . . . . . . . . . . . . 12

2.1.2 Latent Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . 16

2.2 Word Sense Disambiguation,Discrimination and Induction . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Techniques for Calculating SoM 25

3.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Concept Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Vector Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Non-negative Matrix Factorisation . . . . . . . . . . . . . . . . . . . . . . 36

4 Empirical Evaluation 39

i

TABLE OF CONTENTS ii

4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1.1 Word Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.1.2 Evaluation Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.1.3 Human Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Building the HAL Model . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Computing the Shades of Meaning . . . . . . . . . . . . . . . . . . 49

4.2.3 Representing the Traces . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.4 Comparing SoM to Traces . . . . . . . . . . . . . . . . . . . . . . . 51

4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3.1 Purity and Normalised Mutual Information . . . . . . . . . . . . 52

4.3.2 Confusion Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Results 59

5.1 Normalised Mutual Information . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

6 Conclusions and Future Work 69

6.1 SoM Accuracy Across Languages . . . . . . . . . . . . . . . . . . . . . . . 70

6.2 Decomposing Term-Trace Data . . . . . . . . . . . . . . . . . . . . . . . . 70

6.3 Euclidean and Generalised Kernel NMF . . . . . . . . . . . . . . . . . . . 71

6.4 Abductive Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.4.1 Quantum Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

A Confusion Matrices 75

B Early Pilots and Trace Browser 79

BIBLIOGRAPHY 82

LIST OF TABLES

2.1 A simple term× term matrix computed by HAL, before combining rowsand columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Trace table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 A reduced example of the “Reagan” confusion matrix at k = 10. . . . . . 54

4.3 “onion” at k = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.1 Average overall improvements in NMI. . . . . . . . . . . . . . . . . . . . 60

5.2 Reuters NMI results. IK is the ideal value for k. B is the baseline NMIvalue. The percentages detail improvements over the baseline. . . . . . 61

5.3 Average overall reductions in entropy. . . . . . . . . . . . . . . . . . . . . 62

5.4 “coffee” at k = 5. An example of SVD performing better than NMF in arun of the algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5 Reuters-21578 conditional entropy results . . . . . . . . . . . . . . . . . . 64

5.6 “float” at k = 5. Another example of SVD performing better than NMF. . 67

A.1 SENSEVAL NMI results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

A.2 SENSEVAL1 nouns conditional entropy results . . . . . . . . . . . . . . . 76

A.3 “GATT” at k = 30 showing a nearly perfect reduction in entropy score of95.28%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

A.4 “Reagan” at k = 10 for SVD and NMF, an example of NMF out perform-ing SVD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

iii

LIST OF FIGURES

1.1 An example of a semantic space locality for “Reagan.” . . . . . . . . . . . 4

1.2 Concepts surrounding “Reagan” feed into it from different angles. . . . 5

1.3 “Reagan” can be understood in terms of axes of meaning derived fromconcepts surrounding it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 An example of some traces about President Ronald Reagan. . . . . . . . 6

1.5 After traces have been processed, a word’s vector contains no originalorder or trace specific information. . . . . . . . . . . . . . . . . . . . . . . 6

1.6 Axes in a new dimensionally reduced subspace may be considered SoM. 8

2.1 Algorithm to compute the HAL TCOR matrix. The algorithm is run cu-mulatively for each document and the function word returns the lexiconindex of the i’th word of the document. . . . . . . . . . . . . . . . . . . . 15

B.1 A screenshot of the tracebrowser. . . . . . . . . . . . . . . . . . . . . . . . 81

iv

CHAPTER 1

Introduction

This thesis introduces the problem of conceptual ambiguity, or Shades of Meaning

(SoM) that can exist around a term or entity. As an example consider President Ronald

Reagan the ex-president of the USA, there are many aspects to him that are captured

in text; the Russian missile deal, the Iran-contra deal and others. Simply finding doc-

uments with the word “Reagan” in them is going to return results that cover many

different shades of meaning related to "Reagan". Instead it may be desirable to retrieve

results around a specific shade of meaning of "Reagan", e.g., all documents relating to

the Iran-contra scandal. This thesis investigates computational methods for identifying

shades of meaning around a word, or concept. This problem is related to word sense

ambiguity, but is more subtle and based less on the particular syntactic structures as-

sociated with or around an instance of the term and more with the semantic contexts

around it. A particularly noteworthy difference from typical word sense disambigua-

tion is that shades of a concept are not known in advance. It is up to the algorithm

itself to ascertain these subtleties. It is the key hypothesis of this thesis that reducing

the number of dimensions in the representation of concepts is a key part of reduc-

ing sparseness and thus also crucial in discovering their SoM within a given corpus.

1

1.1. BACKGROUND 2

What follows is discussion around the background theory supporting this research, the

methodology and the results showing SoM identified and empirically evaluated.

1.1 Background

Information in society, particularly online is an ever growing, almost infinite collection

of opinions, reports, essays, emails and other forms of communication. This massive

amount of information is something that no single person could ever conceive of com-

pletely comprehending. Additionally there are also many situations where information

is not so readily available, yet needs to be monitored or analysed to find out core ideas

and themes. Whether it be the sheer volume of information available, or simply that

people tend to focus on topics close to them, it is clear that people do not have the ca-

pacity to keep up with the amount of information available and knowledge within sep-

arate fields is becoming more and more isolated. In order to help quantify the largely

unmonitored relationships amongst the vast amounts of information on the Internet

scientists have found systems to model this data. In general this data is modelled using

high dimensional representations of text. Some of the relevant models are covered in

the next section.

In addition to the ever widening gap of between comprehension and the global scope of

information, words are becoming ambiguous between different informational contexts.

Words can have many meanings, this is known as polysemy. Sometimes the ambiguities

are as simple as the difference between a verb and a noun, like “fly” the flying insect,

and “fly” the action. Other times the ambiguities are more obvious, like the nouns,

“bank” the side of a river, and “bank” where you save money. Since words can mean

so many things, a way is needed to be able to find the different shades of a concept

1.2. THE PROBLEM: CONCEPTUAL AMBIGUITY 3

and utilise them. This thesis is about developing effective computational models for

inducing SoM around a concept. Further research will be done on using the shades to

find relevant information more effectively.

1.2 The problem: Conceptual Ambiguity

While lexical ambiguity is a concern in text processing, conceptual ambiguity is prob-

ably more prevalent in text corpora because of its more general nature. Conceptual

ambiguity or aspects of a concept are much more subtle. The idea of lexical ambiguity

or homographs 1 have often been described as senses [43]. This thesis will argue that in

some cases the senses of a word are different to the aspects of a concept. These aspects

are referred to as Shades of Meaning (SoM). Consider a concept that has several different

conceptual “axes” or aspects associated with it, each of these conceptual contexts is a

shade of meaning. This is something that is very common in general literature, for ex-

ample the term “Reagan” immediately gives the association to the ex-president of the

USA. The ambiguity here comes from the fact that Reagan was actually involved in and

known for much more than just the fact that he was president. Other concepts include

the Iran-contra scandal, the Russian missile deal, his wife, his acting career and others

[51].

If a text processing engine builds a conceptual semantic space rather than a lexical space

and represents it in a two-dimensional visualisation, it could look like Figure 1.1. A

series of concepts represented in a measurable space where some concepts are closer

together than others. These concepts represent the general concepts for each of these

terms, irrespective of occuring with “Reagan.” Since the model has been built out of

1Homographs since text is being dealt with or more generally, Homonyms.


reagan acid rain

tax

missile deal

iran-contra

Figure 1.1: An example of a semantic space locality for “Reagan.”

a series of documents about President Reagan it occurs in the center of all the other

concepts. The reason it can be found amongst them is because it is closely related to

them.

The truth is that the “reagan” concept is actually subtly made of up the concepts sur-

rounding it, so each of these concepts feeds into the Reagan concept. If this was a two

dimensional space someone could walk around, standing in the “acid rain” area and

looking toward “reagan” would look like an orange-green blur. This is because they are

seeing the Reagan concept, but from the point of view of acid rain. To see this illustrated

see Figure 1.2.

Finally it can be seen in Figure 1.3 that the “Reagan” concept is actually made up of

axes reflecting the other concepts around it. When the “Reagan” concept is referred to,

it could be anyone one of these SoM, but normally it is by all of them. By ambigously

referring to just “Reagan” it is the aim of this thesis to uncover the fundamental axes

surrounding the concept, these are the SoM.


reagan

missile deal

iran-contra

acid rain

tax

Figure 1.2: Concepts surrounding “Reagan” feed into it from different angles.

reagan

acid rain

tax

missile deal

iran-contra

Figure 1.3: “Reagan” can be understood in terms of axes of meaning derived fromconcepts surrounding it.


As an example consider a text processing engine indexing large amounts of data search-

ing for the term “Reagan,” it will find many instances of the word in many varying

contexts. Here are some example headlines it may find the word “Reagan” in.

1. REAGAN ADMITS IRAN ARMS OPERATION AMISTAKE2. REAGAN PLEDGES TO INCREASE SPENDING ON ACID RAIN3. IRAN INVESTIGATORS SEEK REAGAN TAPES4. SPEAKES SAYS HE, REAGANMISLED PUBLIC UNWITTINGLY5. CANADAWELCOMES LATEST REAGAN ACID RAIN PLEDGE6. REAGAN TO VETO 87.5 BILLION DLR HIGHWAY BILL

Figure 1.4: An example of some traces about President Ronald Reagan.

In the above headlines 1 and 3 are clearly about the Iran-contra deal, headline 2 and

5 are about the acid rain problem and headline 6 is about the Highway Bill that Rea-

gan vetoed. The unusual sentence is 4, it is not particularly associated to any topic, like

Iran-contra, this one is discussed later. So all of these sentences contain the word “REA-

GAN” but are about different aspects of Reagan. The text processing engine would

store all of these sentences under the word Reagan, leading to a single entry looking

similar to the text below. In practice the engine would keep some kind of value asso-

ciated with these terms, like their frequency within the document, this just an example

to describe the ambiguity.

REAGAN: [’BILLION’, ’SAYS’, ’TAPES’, ’UNWITTINGLY’, ’ADMITS’,’SPEAKES’, ’ACID’, ’SEEK’, ’INVESTIGATORS’, ’CANADA’, ’SPEND-ING’, ’ARMS’, ’INCREASE’, ’TO’, ’MISLED’, ’HE,’, ’OPERATION’, ’WEL-COMES’, ’LATEST’, ’A’, ’IRAN’, ’PLEDGE’, ’VETO’, ’REAGAN’, ’RAIN’,’DLR’, ’PLEDGES’, ’HIGHWAY’, ’MISTAKE’, ’ON’, ’87.5’, ’BILL’, ’PUBLIC’]

Figure 1.5: After traces have been processed, a word’s vector contains no original orderor trace specific information.

This set of words represents a model of “Reagan,” but in a single combined unit. With


appropriate background knowledge, a human can identify some terms dealing with

particular SoM, but this is much more difficult for computational model to recognize.

This model is scaled up to millions of sentences, it becomes unmanageable to store each

sentence individually and still be able to process them efficiently and effectively, so a

machine must be taught to find the SoM encoded in this single vector of words.

Another problem alluded to in the last paragraph is that headline 4 doesn’t seem to

relate to any topic in particular. In actual fact the headline is associated with an article

about the Iran-contra scandal as well. The fragment could have been made longer until

it included some explicit words that associated the headline with the actual SoM, but

that risks including spurious information that may also weaken its association to the

appropriate shade.

The problem of conceptual ambiguity is further exacerbated bymetaphorical andmetonymic

references to seemingly unrelated terms. In the case of Reagan, he got the nickname

“the Gripper” from his role in the movie Knute Rockne, All American. Most Ameri-

cans know who the nickname refers to, but this would confuse a machine. SoM could

be considered a very specific subset of polysemy. Conceptual ambiguity should also be

a consideration when processing large corpora of text because it is desirable to know

the SoM surrounding a word, this would help in many applications in both informa-

tion retrieval (IR) and computational linguistics (CL). It should be noted that in an IR

system, most ambiguity can be resolved to a reasonable level for a user by addition

of terms to the query. Query expansion in this manner has been likened to semantic

priming [30, 29, 21]. This simply resolves lexical ambiguity by creating more context

for the query. It is preferable to resolve ambiguity at a conceptual level and more ex-

haustively. The query “reagan iran-contra” would almost certainly give an IR system


enough context for a relevant enough list of documents to be returned. Many other

queries are often not satisfied simply because the most prevalent use of a term may not

be the particular meaning you are looking for or you may not know the word to qualify

the original query with in the first place. In Section 2.1.1 will look at an example of

building an ambiguous representation with a high dimensional representation model.

1

1

1

1

Figure 1.6: Axes in a new dimensionally reduced subspace may be considered SoM.

As stated earlier, it is the key hypothesis of this thesis that reducing the number of

dimensions in the representation of concepts is a key part of reducing sparseness and

thus also crucial in discovering SoM within a given corpus. The use of dimensional

reduction to improve the representation of terms is a common theme among many

different strands of research.

A key part of Latent Semantic Indexing (LSA) is dimensional reduction, which is gen-

erally a way to reduce the complexity of high dimensional matrix representations [25].

Karypis points out that it is possible to map a high dimensional space down onto a

lower dimensional subspace while still retaining important explicit information in the

data [22]. A side effect of this is that some latent connections are also brought out. He

says that by using dimensional reduction algorithms latent concepts can be found in a

document collection. This view is also supported by Landauer and this assertion forms


the basis of this thesis [25]. Cai and He further support this position in their work with

Orthogonal Locality Preserving Indexing [11]. In the context of SoM the bases in a new

dimensionally reduced subspace may be the SoM surrounding that particular context.

In Figure 1.6 an abstract view of axes oiginating at the center of “Reagan” illustrate the

view that new bases might be SoM. This raises the question whether the bases corre-

sponding to SoM should be orthgonal.

In summary, this thesis is to develop and empirically evaluate unsupervised com-

putational methods for inducing SoM around a concept relative to a corpus of text.

The models investigated will be matrix models of concepts and a key component

will be to investigate the effect of dimension reduction on the quality of SoM being

induced.

CHAPTER 2

Literature Review

To cover the ground work required to start working with aspects, one should first look

at techniques that have been used to help resolve ambiguity in the past. Sanderson

gives a very brief and clear introduction to information retrieval and word ambiguity

and methods for improving it [43]. In particular he discusses mechanisms for “fixing”

a document’s representation in the system to give better results with resolving ambi-

guity, it is these kinds of post-hoc manipulations on data that can drastically improve

the quality of operations on them. It is still important to choose the right representation

model though. Although most IR researchers discuss semantic ambiguity in the con-

text of documents, often the discussion and solutions are useful in the context of word

ambiguity. The study of meaning has been coined with the name “semiotic-cognitive in-

formation systems” [51]. Despite this paper being primarily interested inwordmeanings,

there is considerable overlap between the areas of document based IR and the semantic

analysis of text in the context of symbolic meaning. Some of the systems used that often

contain ambiguity will be discussed, then also heuristics to help resolve ambiguity via

preprocessing, indexing and post processing. Though the results that have been found

by using them in information retrieval and semiotic-cognitive systems are of interest,

10

2.1. REPRESENTATIONMODELS THAT CONTAIN AMBIGUITY 11

this thesis is concerned with the computation methods for SoM.

2.1 Representation Models That Contain Ambiguity

We are in agreement that high dimensional models are a promising way to represent

concepts [38]. In order to investigate the potential benefits of term level co-occurrence

information and document level co-occurrence high dimensional models which have

shown consistently to approximate human cognitive processes effectively while allow-

ing ambiguous concepts to be represented will be discussed [8, 13, 21, 18]. There are a

few ways of representing concepts in high dimensional models, one is at a term level,

others are at a higher syntactic level.1

One way of representing meaning at a term level is by representing terms in a model

known as Hyperspace Analogue to Language (HAL) which attempts to approximate

human cognitive processes, for example, human semantic word association norms [8,

9, 10]. This model is generated by passing a window across the text and recording the

relative positions of words to each other. The value given to a word in the window

relative to another word is the inverse of the distance between them. This process

creates a very large sparse matrix of n × n dimensions, where n is the number of terms

in the system.

Another way to model concepts is in a term × document model where each document

is represented by a vector of word frequencies, or related weights. So models can have

can have term × document matrices and HAL matrices which are term × term. Latent

Semantic Analysis (LSA) is a model that has been known to produce encouraging re-

sults in representing the semantics of terms in a cognitively validated way [13]. LSA

1Like sentences, paragraphs or documents.


adopts a firmly in the area of IR that focuses on “bag of words” style representation

of terms in a dimensionally reduced term × document matrix. A vector in LSA carries

information about a term and the contexts it has been found in.2 There is not a lot of

literature comparing these different ways of representing data, merely comparisons be-

tween different models using only term×term models, or more often, term×document

models. Studies have shown both models are relevant in different ways. The types of

information encoded in LSA and HAL vectors are different in nature yet in literature

about LSA, its vectors are known as context vectors, confusingly similar to HAL litera-

ture which refers to HAL vectors as “context” vectors. Jones et al somewhat help with

a disambiguation by calling relationships in LSA context and HAL’s order [21]. Lavelli

et al make the distinction clearer by calling LSA’s information document occurrence rep-

resentation (DOR) and HAL’s information term co-occurrence representation (TCOR) [26].

Other research projects make the same assessment of the different types of information

and how they relate to each other [21, 26].

2.1.1 Hyperspace to Analogue Language

The HAL model has shown promise as an effective way of using symbol co-occurrence

to help model the meaning of words [8, 9, 10].

HAL has been noted to compare favourably with human cognitive processing by repli-

cating human semantic word association norms [8, 9, 10]. HAL is a real valued matrix,

which allows terms and concepts to be represented measurably [33, 51].

Burgess et al [8] call HAL vectors “context” vectors, while it is is a suitable name, other

high dimensional semantic models also use the word context to mean something differ-

2In Landauer’s work, a context is typically a fragment of text the size of a paragraph.


ent. The context that the term context actually refers to can mean a sentence, a window,

a document. It will be seen later what other kinds of relationships there can be. HAL is

classified as a TCOR model, which implies a much smaller context window than DOR

models which use the whole document as a window.

A HAL model is referred to as context space by Burgess et al, but as mentioned earlier

is most clearly described as a TCOR model [8]. The reason for this is that the value

of a particular symbol in the HAL model is calculated as a sum of all the contexts the

particular term appeared in, where the context is normally much smaller than the size

of a document. This model is generated by passing a window of size l = 10 across the

text and recording the relative positions of words to each other, though other values of

l have been used as well. The value given to a word in the window relative to another

word is the inverse of the distance between them. This process typically creates a very

large sparse matrix of n × n dimensions, where n is the number of known terms in

the system. A row in this vector space is a vector for the values of words that appeared

before the chosen word in the window. Columns represent vectors of the words follow-

ing the word in the window. Burgess concatenated these two vectors together to give a

single vector representing the term in the HAL space. Bruza et al found that preserving

word order information via the two separate vectors did not help with their experi-

ments and instead summed the associated entries from the column and row vectors

[33, 51].

A term in the HAL semantic space is defined by the succinct mathematical notation in

Equation 2.1 [1].

HAL(t|t′) =K∑

k=1

w(k)n(t, k, t′) (2.1)

The HAL value for any term ti and another term tj can be shown and n(ti, k, tj) is the


number of times ti appears distance k from tj . He also explains that w(k) = K − k + 1

is the strength of the relationship between the two terms given k. The HAL semantic

space is denoted S where S[i, j] gives the strength of the co-occurrence relationship

between term ti and term tj . For example, consider the text “President Reagan ignorant

of the arms scandal”, with l = 5, the resulting HAL matrix would be as shown in Table

2.1. Each row i in the resulting matrix represents accumulated weighted associations of

word i with respect to other words which preceded i in a context window. Conversely,

column i represents accumulated weighted associations with words that appeared after

i in a window. Equation 2.1 also shows a pseudocode implementation of the HAL

algorithm.

As HAL produces vector representations of words, similarity metrics such as cosine

similarity [50, 6], Minkowski similarity [6, 51] and inferential metrics like information

flow [51, 50, 6, 49] can be calculated in the space. Since the information about every

context a symbol appears in is contained in a vector representation, all that is required

is to find the different aspects in the vector. The question is how this should be done.

The problem of ambiguity is barely touched upon bymuch of the preceding work using

HAL. In general ambiguity in semantic models, particularly HAL, is tackled with the

arms ignorant of president reagan scandal thearms 0 3 4 1 2 0 5

ignorant 0 0 0 4 5 0 0of 0 5 0 3 4 0 0

president 0 0 0 0 0 0 0reagan 0 0 0 5 0 0 0scandal 5 2 3 0 1 0 4the 0 4 5 2 3 0 0

Table 2.1: A simple term× term matrix computed by HAL, before combining rows andcolumns.


use of semantic priming [29]. Burgess illustrates ambiguity in HAL vectors by reducing

their representations to two dimensions [9]. In their graphic diagrams examples like the

word “fix” lie in between the two semantic groups representing the use of drugs and

the act of repairing something. Burgess proposes a method of detecting if a term may

be ambiguous. He says that through his research he has found that as terms appear in

more andmore contexts it becomesmore general. He suggests that finding the “Context

Attributes” of HAL vector helps find the dominant context contained in the vector. This

is checked by finding how many standard deviations all the non-zero dimensions of a

vector are from the mean. The terms with the highest standard deviation would then

be taken into consideration as defining the meaning of the vector the most. While this

is very helpful to find the aboutness of the vector, it is desirable to know if there are any

other aspects in the vector that may be getting overshadowed by the dominant aspect.

In addition to finding separate aspects, it is good to be able to tell if a particular term is

ambiguous amongst the contexts it is found in.

def calc_HAL(word: int→int, n: int, S: matrix)for i in 1..n dofor j in 1..x such that i-j>0 doS[word(i),word(i-j)] += (l+1)-j

endend

end

Figure 2.1: Algorithm to compute the HAL TCOR matrix. The algorithm is run cumu-latively for each document and the function word returns the lexicon index of the i’thword of the document.


2.1.2 Latent Semantic Analysis

The Latent Semantic Analysis (LSA) technique was first introduced in 1988 and has

been employed in a variety of fields such as IR, machine learning, computational lin-

guistics and cognitive science [16]. LSA starts with a term × document matrix S which

is factorized by a theorem from linear algebra known as singular value decomposition

[13, 25, 16, 15, 2, pg. 44]. SVD produces three matrices which can be multiplied together

to get the original matrix, S. While the matrix is decomposed into the three resultant

matrices of SVD, it is possible to perform operations that reduce the rank and smooth

the original matrix. If the original matrix is S, the three matrices produced by SV D(S)

are, U, D and V T . (See Equation 2.2.)

SV D(S) = UDV T (2.2)

U is a matrix of orthogonal eigenvectors representing the terms in the matrix, it has

dimensionality t × rank(S) where t is the number of terms in the vocabulary. D is a

diagonal matrix of singular values, it is of rank rank(S). V T is a matrix similar to U but

is a matrix of orthogonal eigenvectors of dimensionality n × rank(S), where n is the

number of documents in the system. Dimensions in V T represent documents from the

system. Deerwester et al provide a good visual description of the matrices during and

after the LSA operation [13].

LSA truncates the singular value matrix D by taking the highest k singular values and

zeroing out the rest of the singular values on the diagonal. Thus when the matrices are

multiplied together again, the rank of the matrix S′ is reduced to k.

This has two effects; firstly, noise is filtered out of the matrix, this is because the lower

singular values were discarded. Secondly, as a result of the noise filtration, latent con-

2.2. WORD SENSE DISAMBIGUATION,DISCRIMINATION AND INDUCTION 17

cepts in the matrix are exposed [13, 25, 16, 15]. What is found is that after dimension

reduction many weights in the matrix that were zero before are found to be non-zero.

These new non-zero weights represent the latent information in the matrix [33].

Both HAL and LSA have an established track record of building a model of a corpus as

a matrix. LSA employs SVD on the matrix and the resulting eigenvectors represent an

"axis of meaning" through the corpus. As HAL is a matrix it could similarly be given an

eigenvector decomposition via SVD. In the following experimentation, we will not be

building a matrix from a generic corpus of text as is often done with HAL and LSA, but

rather the corpus will be constructed based around a single word. This then raises the

question whether the eigenvectors of the corresponding dimensionally reduced matrix

corresponds to SoM.

2.2 Word Sense Disambiguation,

Discrimination and Induction

Word Sense Disambiguation (WSD) is the process of finding the sense of a particular

word in a particular context. Discrimination normally refers to disambiguating sense

of words without any kind of supervision. Induction extends discrimination further

by inducing the senses of a word from the corpus itself. This is in stark contrast to

most WSD systems that check their classifications against some golden standard. This

is similar to what this thesis is hoping to achieve, but not exactly the same. As noted in

Section 1, aspects or SoM are a much more subtle form of ambiguity. Some techniques

in WSD will be assessed, and some other previous work will also be discussed. It is

important to discuss some of the methodologies and problems with all WSD as a pre-

cursor to investigating new methods. There is substantial work being done in the field


of WSD and it is a hard topic to deal with. While part of speech taggers have upwards

of 95% success tagging corpora with parts of speech, word sense taggers still suffer

from poor results [36]. While some are defeatist about word sense disambiguation [52],

it should instead be viewed as another challenge, especially since some current systems

seem to perform well [42].

There are threemajor approaches toword sense disambiguation; knowledge based (dis-

ambiguation), supervised (disambiguation) and unsupervised (discrimination). Knowl-

edge based and supervised rely on using external resources as a reference to help dis-

ambiguate terms [35]. Supervisedmethods ofWSD use some crafted data in the form of

taxonomies, dictionaries or thesauri. The big problem with this method, if it is related

back to the “Reagan” example, is that the senses for the word “Reagan,” are basically

totalled at one, the ex-president of the USA.3 So using hand built thesauri or dictio-

naries are not acceptable for the problem stated for this thesis. Unsupervised WSD is

closer to what this achieved since it gives aspects the capacity to show themselves with-

out being forced to fit into categorisation system. As noted though, to find SoM, one

step further is required.

Some early WSD work by Wilks et al. to create a machine-tractable dictionary of words

showed enough promise with respect to building a statistical model of words that the

trend continued in most other following research.[54] The idea of building a machine-

tractable dictionary of words of different senses leads towards finding a dicitionary of

words divided by their SoM.

Sanderson has reviewed several different disambiguation techniques [44]. Most lean

heavily in the direction of supervised techniques and he is quite critical of the others,

3Justifiably, thesaurus.com does not even have a record of “reagan” and WordNet has one, the obviousone.


except for one. He makes note of work by Shütze and Pederson in the area of WSD

[45]. In all of the methods he discusses in his paper, the work of Shütze and Pederson

matches closely to what this thesis achieves. In that it cannot be assumed that any SoM

may exist around a particular topic, rather they need to be ascertained from the text

itself. This quote about other approaches to WSD appropriately justifies their position:

“All these approaches share the problem of coverage: specialized domains

tend to exhibit rare words and specialized meanings, which are not covered

by generic lexical resources. The cost of customizing the resources is often

prohibitively high.” Shütze and Pederson [45].

Shütze and Pederson create a novel algorithm for calculating the sense of a word based

on the “company it keeps.” They start by taking the context of every appearance of a

word and counting the occurrences of words in the same context as it (ie a window of

size k = 40). Note, this is a frequency count, not a weighted measure of distance as is

the case with HAL. By doing this recording, Shütze and Pederson build a term × term

TCOR (term co-occurrence representation) matrix, where the weights are the frequency

of terms co-occurring in the same window. To reduce computational complexity, two

rounds of clustering and the singular value decomposition algorithm are used to reduce

the size of the matrices involved in the calculation.[47] The first round of clustering is

done around the most occurring words in the corpus, then the second round of cluster-

ing is done based on words that appear near words from the first round of clustering.

This is done for for two reasons, to reduce the dimensionality of the data set for pro-

cessing, and more importantly, to reduce the dimensionality while keeping a record

of second order word occurrences. The third major operation is performing SVD on a

matrix built out of occurrence information from the first two cluster operations. This


method is very computationally intensive, but they achieve good results. To combine

the vectors to represent terms effectively in a query, Shütze and Pederson sum the vec-

tors for terms in the query. The sum isn’t just a simple sum though, the vectors are

scaled according the the tf.idf (term frequency, inverse document frequency) weighting

scheme.

Sanderson is quick to point out that Shütze and Pederson use long queries often over

100 words long which is unrealistic for a user feedback system. Interestingly, Burgess

et al make reference to Shütze’s work but do not go into too much detail about the dif-

ferences [10, 46]. There are two key differences between HAL and Shützes’s work. In

HAL, co-occurrence data is recorded as the inverse of the distance between two words,

this makes word meaning and relationships in HAL sensitive to the distance between

any two co-occurring terms. This makes HAL a logical extension of Shütze and Peder-

son’s work. Unfortunately most literature around HAL by Burgess et al only mentions

dimension reduction in passing. Shütze and Pederson on the other hand consider SVD

a key part of their work. The conclusion is that Shütze and Pederson’s work is elegant

but does not include distance information from the corpus and may be too computa-

tionally intensive.

Similar to the work of Shütze and Pederson is the work of Pantel and Lin, who un-

fortunately do not know of the work of the former but nevertheless make a valuable

contribution to the field [39]. This oversight is also noted by Rapp [41]. Pantel and

Lin use a clustering mechanism they call Clustering By Committee which clusters terms

based on point-wise mutual information (PMI) weightings. PMI is a probabilistic sys-

tem for finding the association of two random variables. The Minipar parser arguably

makes the workings of their system "semi-supervised" as it consults non-corpus based


resources in order to calculate quantify grammatical contexts. Because this method de-

viates from the purely “corpus-based” unsupervised principles it seems to give good

results, it may be worth investigating Minipar as an indexer in future research.

Rapp makes reference to Pantel and Lin and continues in a similar vein but with good

results and a lesser degree of supervision [41]. Here he only uses lemmatisation during

the text indexing stage, thus not desirable as it is preferable to evaluate purely unsu-

pervised systems. Rapp is actually using the process behind Latent Semantic Analysis

and acknowledges the fact in his evaluation. He outlines the key differences as being

his use of lemmatisation and a sliding window of size two instead of whole documents

as context. This means he claims that TCOR works better than DOR in his situation,

this is also supported implicitly by Pantel and Lin in using grammatical relationships

to build PMI.

Some of the most recent work that builds upon Rapp, Shütze and Pederson’s work is

that of Chatterjee et al. [12]. They propose using the Random Indexing (RI) method

for reducing a Word Space similar to HAL for the task of WSD. The method shows

promise and is very much aligned with the work in this thesis. The because of the

heavily random element of RI is it not used as a dimensional reduction strategy in this

thesis.

In [37] Neill proposes a system that is evaluated quantitatively by himself and his su-

pervisor. His system for finding word senses is slightly different to other WSD systems

here and is even closer to calculating SoM. He calls Sense Induction, where the senses

are determined from the corpus itself and evaluated qualitatively by humans. Though

his system is probabilistic, a lot of the principles translate well into purely geometrical

representations. Sense induction uses a similarity matrix to build a graph like structure


of the localities in the data set, this information is then used to seed a set of clusters,

each cluster representing a sense. Words are loosely grouped into their clusters, they

can be members of several sense definitions. The most innovative work in Neill’s paper

is his use of conditional entropy to calculate the overall performance of his proposed

system. This is discussed more in Section 4.3.3 as it addresses many of the evaluation

issues.

Rapp continues his research into sense induction by further analysing the work of Pan-

tel and Lin [39] and Neill [37] and extending their ideas further [42]. His work is of

particular interest because it actually steps away from using global scoped vectors and

instead clusters vectors representing each context. So each sentence becomes a vector

in the system, with Boolean weights for whether a word exists in that context or not.

He then does some filtering on some terms, and applies SVD to the matrix. This system

gives him good overall results, about 86% unsupervised accuracy, far higher than the

approximate 75% supervised ceiling mark achieved in the original SENSEVAL tasks.

The biggest problem associated with Rapp’s procedure is needing to store each sen-

tence as an individual entity, as mentioned in Section 1 this could be a problem when

needing to scale to very large numbers of sentences. Fortunately in this case which is

not specifically an IR system, the right number of sentences could be defined as “how-

ever many it takes to get all senses of a word.” Rapp’s system is good and fits well with

the hypothesis of this research, that dimensional reduction helps uncover SoM in text.

His ideas will be pursued further after this present line of research.(See Future Work,

Section 6 for more information.)

In the field of WSD4 researchers overwhelmingly use TCOR as the basis for collecting

4More generally, this is Computational Linguistics as opposed to Information Retrieval

2.3. SUMMARY 23

data about a particular corpus, additionally there have been good recent advances in

the field of unsupervised and semi-supervised methods for WSD. This young field of

research is faced with the problem of not having a standard system for quantifying re-

sults. The “Test of English as a Foreign Language” or TOEFL test is a popular method

of quantifying results amongst researchers in the field of WSD, but it does not suit the

test for aspects. Another way that the researchers here quantify the results of their sys-

tems are to compare them against establishedWordNet sense taxonomies. The problem

of quantifying evaluation will be discussed in more depth in Section 4.

2.3 Summary

Several methods of dimensional reduction or clustering that could be applied to the

problem of inducing SoM have been discussed. Nearly all of the methods are used by

their creators to help with information retrieval performance which is not the aim of

this thesis. Instead the aim is to disambiguate SoM in vectors that carry a lot of context

information. The models discussed in detail were Hyperspace Analogue to Language

(HAL) and Latent Semantic Analysis (LSA), whereby Singular Value Decomposition

(SVD)was discussed as themeans for dimensionally reducing themodel to improve the

quality of both term and document representations. While thesemodels have been used

in the form of term × document or term × term matrices, these matrices are typically

constructed from the whole corpus. There seems to be no literature whereby the matrix

model represents a single concept or word.

Based on this literature review, a system is proposed which computes a HAL TCOR

model derived from a set of sentences corresponding to a word or concept resulting in

a matrix model corresponding to that concept. This matrix model is input into various

2.3. SUMMARY 24

techniques for computing the SoM for the concept. These methods nearly all (except

Vector Negation) are used to process term × document DOR information. While some

of these systems have been used to process term co-occurrence information as well

as document co-occurrence information, there does not seem to be any literature in

analysis of SoM within a single concept. This leads into a discussion of how this can be

investigated, in the next section potential practical work that this research leads to will

be covered.

Based on this literature review, a system using a HAL TCOR space built over a set

of sentences containing a concept is proposed. Then the application of each of the

reviewed systems to this semantic space will potentialy yield SoM. To evaluate if this

modified space contains a good description of the shades of meaning the conditional

entropy strategy will be used.

CHAPTER 3

Techniques for Calculating SoM

3.1 Singular Value Decomposition

The SVD of a matrix of term×document information has been described as an ’intrinsic

semantic space’ which supports the hypothesis that eigenvectors could be very highly

related to the SoM around a concept [14]. The intuition behind the term “intrinsic”

refers to the eigenvectors of the decomposition of the matrix being an “axes of mean-

ing”. SVD has also been applied to term×termmatrices the eigenvector decomposition

of which also should produce axes of meaning.

Kontostathis and Pottenger provide analysis of eigenvectors produced by SVD during

the LSA process [24]. They use a term × term matrix T calculated by taking U and D′

and finding their product then is the symmetric matrix given by their multiplication

with the transposed version of itself. The equation is given below:

T = UD′(UD′)T (3.1)

In this term × term matrix, weights are still based on DOR, the weights are considered

the similarity between the terms in each dimension. This kind of approximation would25

3.1. SINGULAR VALUE DECOMPOSITION 26

also be possible with HAL/TCOR vectors, and is explored by Bruza andMcArthur [33].

Unfortunately the data used in their experiments is not available to continue tests on.

The HAL algorithm shown in Equation 2.1 produces a nxn matrix M where n is the

number of terms. A column i encodes accumulated co-occurence weights of terms

appearing after term i in a context window. Conversely, row i encodes accumulated

co-occurence weights of terms appearing preceding term i in a context window. In this

way, HAL captures order information. In our case, we will not take word order into

account, so matrix M is added to its transpose resulting in a symmetric nxn matrix S.

The SVD of S is cmputed as follows:

S = UDV T (3.2)

V T = U (3.3)

∴ S = UDU (3.4)

The SVD of a symmetric matrix is the eigenvalue decomposition of it [32]. Thus by

using HAL matrices as the input to SVD, it would be calculating the eigenvalue de-

composition. The question we will investigate is whether the eigenvectors correspond

to the SoM of the word represented by the matrix S. Another way of looking at it is

through the spectral theorem, which is applicable because the data is symmetric. The

spectral theorem shows that S can be rewritten as a sum of projectors [6]. (See Equation

3.5.) ui is the ith column vector of U , and σi is the singular value D[i, i]. The problem

of SoM can be described as a question of whether ui can accurately describe a SoM for

a particular word.

3.2. CONCEPT INDEXING 27

S =n∑

i=1

σiuiuTi (3.5)

By creating a rank k approximation of S where k < n the SoM space can be reduced to

varying sizes by changing the parameter k. (See Equation 3.6.)

S ≈k∑

i=1

σiuiuTi (3.6)

Because the singular vectors and values are ordered in the size of the singular values,

the eigenvectors in U are expected to always be in the same order. So for k = 10 the

first five vectors will be the same as the whole resulting matrix when k = 5. As k is

varied it will introduce new vectors that have lower associated singular values, but are

orthogonal and thus could be a new SoM.

The SVD algorithm used in this research is the implementation provided in the numpy

package1, which was robust enough to deal with the relatively small amount of data

required for this experiment. After removing the key term from the HAL matrix was

decomposed via SVD and the right singular vectors from V T were used as shades of

meaning, though the left singular vectors from U could have been used as well.

3.2 Concept Indexing

Concept indexing (CI) is a method proposed by Karypis and Han to project a term by

document matrix into a lower subspace [22]. It claims to be a lot faster and more effi-

cient than LSA. By looking at the algorithms behind CI there may be a way to adapt it to

work on HAL vectors. From a high level perspective CI fits right into the model of us-

1http://numpy.scipy.org/


ing high dimensional spaces to represent concepts. Like LSA it uses a term×document

matrix to represent a large collection of documents. CI then clusters the documents

to k clusters, with each cluster containing documents of a similar topic, that is, doc-

uments in the same cluster should be more similar to each other than to documents

in other clusters. Naturally the success of CI relies heavily on the effectiveness of the

clustering algorithm. The centroid vectors of the clusters then become the new axes or

basis vectors for the reduced dimension representation of the collection. Documents

can then be projected onto this k dimensional space. Document similarity can then also

be measured by calculating the cosine similarity.

The centroid vectors provide an effective way to represent all the documents contained

within the cluster. Because the documents are all of a similar nature the highlyweighted

terms in the centroid actually provide a good way to summarise the members of the

cluster. To project a document down from its original high dimensional form into the

new lower dimensional subspace, the document is compared via the cosine similarity

function to each centroid vector in the reduced subspace. So the similarity to the ith

centroid becomes the ith element in the new document vector. This is found by calcu-

lating the dot product of the matrix of basis vectors with the document vector.

As mentioned, the clustering algorithm plays a very important role in the reduction

of the document set to meaningful groups. Several clustering algorithms have been

benchmarked for clustering effectiveness and performance [4]. The clustering algo-

rithm used in CI is based on the partitional clustering system developed in the afore-

mentioned benchmarks and research. This clustering system is actually very similar

to k-means clustering. K-means clustering outperforms traditional forms of clustering

quite convincingly. This partitioning starts by randomly selecting k documents and


using them as the basis for selection of more documents to be added to that cluster.

Documents are then iteratively added to the clusters based on the centroid that they

are most similar to. So at each iteration the centroids are recalculated to represent the

new clusters. In CI, this method is slightly modified so that at each iteration only two

partitions are created, but the algorithm gets applied k − 1 times. It is claimed that this

keeps the sizes of the clusters more regular which improves dimensional reduction. At

each step of the algorithm it must choose which cluster to split into two. It chooses the

cluster with the biggest spread of concepts. This is done by calculating the square of the

length of the centroid vectors which then represents the average of the similarities of all

the documents in the cluster. By subtracting this number from one, the dissimilarity is

found. Thus the algorithm chooses the cluster with the highest dissimilarity to further

split.

It seems that perhaps not randomly choosing the seed vectors may help create better

clusters, perhaps this could be done by picking a vector, then picking a second only if

its cosine similarity is sufficiently low, otherwise randomly pick another and check its

similarity and so on. This problem is somewhat mitigated by the fact that in CI Karypis

actually performs the clustering five times, then the best of the five clustered sets is

chosen. Even so, using a basic sanity check would hopefully give better results with

minimal overhead. 2

Karypis discusses a system of calculating the effectiveness of the dimension reduction

from CI and LSA and calls it retrieval improvement. This system is implemented by

checking how tight a semantic neighbourhood may be before dimensional reduction,

and how much tighter it becomes after the reduction. He observes that CI performs

2Unless of course none of the documents are below the threshold, in which case it would be a wastedeffort.


comparably with LSA.

One of the principle claimed benefits of CI is that it performed five to eight times faster

than LSA for Karypis, he says that this is a benefit of using clustering instead of SVD.

This claim does not seem to be based on any firm mathematical or even empirical re-

sults. After conducting empirical tests it was found that the final matrix multiplication

step of LSA can be many times slower then the SVD calculation itself when calculated

in numpy for Python. The reason this could be so is that the final matrix produced by

LSA has many more non-zero elements than the original full rank matrix. Some empir-

ical results are as follows; SVD on a 3000 × 3000 matrix took 41 seconds, the matrix

multiplication of USV T without dimensional reduction took 384 seconds. In MATLAB

the results were reversed, which indicates SVD being the primary reason why LSA is

slower than CI could actually be an implementation detail. One thing that concept

indexing is almost certainly better at dealing with is extremely large data sets. This is

because most implementations of SVD are not optimised to run on sparse matrices, and

the ones that are still have ceilings on the amount of data they can handle at once.

To apply CI to HAL vectors the type of information that is being clustered is differ-

ent. The data is context information for specific, not the documents they were found

in (DOR). So this means the final result is a reduced dimensional space around terms

(TCOR). In IR clusters that are built based on term distances are known as Metric Clus-

ters [2, p126]. It is hoped that to find, given the right value for k, words would be

clustered into groups that represent aspects. What was tested in this regard is whether

actually using HAL vectors as opposed to document vectors give significant improve-

ments. The reason this is mentioned is because in CI as performed by Karypis, am-

biguous terms are kept within their document and potentially their aspect. So when


dimensional reduction is performed, one would find ambiguous terms showing their

full conceptual meaning for the particular aspect they are contributing to. On the other

hand, there are still two axes (term×term) whereby terms if treated as sub-parts of con-

cepts, could appear in multiple clusters. This is something that is tested, interestingly

another overarching problem with applying CI to this problem was discovered.

The biggest problem with CI is the fact that it randomly selects the seed vectors for the

clusters. This may perform well for information retrieval tasks but is not optimal with

respect to accuracy of SoM. In general information retrieval requires a relevant docu-

ment to be selected based on its general representation, not the exact centroid of the

cluster it is closest to. In concept indexing the final stage of the process is projection of

a document vector down on the new subspace formed by the dimensionally reduced

vector. This means that the essence of the document itself is preserved, it is not ap-

proximated by a nearest neighbour, it is a dimensionally reduced version of itself. For

shades of meaning the projection step is never processed, instead the centroids that best

describe the original space are required. For this particular task the ideal shades need

to be selected to best represent something there is only a nebulous idea about. It would

be preferable to make the initial seed selection process less random.

The concept indexing algorithm is somewhat troublesome and has not given the orig-

inally envisaged results. This is because of its random selection of the initial vectors

to seed centroids with. The intent was to build a new set of basis vectors based on the

centroids of the clusters built from the data. This process could well be very effective

for information retrieval tasks which CI was intended for, where documents are then

projected down into the new space defined by the cluster centroids. But when it comes

to reducing the dimensionality of of terms, the random selection method may not be

3.3. VECTOR NEGATION 32

representative enough. This is an important distinction as when it comes to IR. Gener-

ally when a system is queried, a combination of the query weightings and document

weightings are used to provide a ranked list of results. This is OK when precision is

measured by the overall result of the ranked list of resources. In the case of shades of

meaning, single concepts that best represent the original semantic space are required,

for this the seed selection process is critical. The method of selecting random vectors

could be offset by the implementation of the seed selection process outlined in Neill’s

Sense Induction [37] paper, but he himself also notes some lacklustre performances

due to the somewhat arbitrary selection process. This is still an open problem, which

warrants further research. Early pilot tests with concept indexing found it lacking for

this particular task, as such it was not empirically evaluated as a candidate method for

computing SoM.

3.3 Vector Negation

Vector Negation could refer to any kind of negation operation on vectors, in this case it

specifically refers to a technique investigated by Widdows in Geometry and Meaning

[53]. The not operation defined byWiddows is a subtraction which does not completely

subtract one vector from another, but merely orthogonalise them against each other.

Widdows’ WORDSPACE model is generally similar to LSA in that it is based around a

term × document DOR model but does not deal with dimensional reduction. It simply

represents documents as vectors of term frequencies, and terms as vectors of frequen-

cies across documents. He then creates term × term matrices, as an adjacency matrix

of pairwise term similarities of the normalised information from the documents.

The concept that is very interesting is the idea of being able to have a NOT operator


that works on vectors in order to make them irrelevant to each other (orthogonal). This

enables NOT operations to be defined for queries. This may seem irrelevant, but in

reality it can be generalised to help find SoM. Exploring what negation actually entails

will enable further analysis about how it is relevant. Widdows defines the concept of

vector negation as this:

“Two word vectors a and b in WORDSPACE are considered irrelevant to

one another if their vectors are orthogonal, ie a and b are irrelevant to one

another if a · b = 0.” Widdows [53].

The concept in the quote above is basically very simple, if a vector has a cosine similar-

ity of zero to another vector, then they have nothing to do with each other. He asserts

that simply doing a matrix subtraction a − b would not work well because it would be

a very brute force style removal of b from a. Instead, he suggests rescaling b such that

subtracting it from a leaves some elements in a that may not be there because of b in the

first place. So there is a vector a− λb. The question then, is how to scale b reasonably to

get the desired result. This is described as making the vector b irrelevant to a, in other

words, orthogonal. To operationalise this theory, he describes the method of orthogo-

nalising vectors. If there are two vectors, a and b, to orthogonalise them take the dot

product a · b and scale b by it then subtract the resulting vector from a.

a NOT b = a − λb (3.7)

(a NOT b) · b = 0 (3.8)

λ =a · b

‖b‖2(3.9)

‖b‖ = 1 (3.10)

a NOT b = a − (a · b)b (3.11)


In Equation 3.7 see that the a NOT b vector is the a vector scaled by some quantity. In

Equation 3.8 it is shown that a NOT b should have a cosine similarity of 0 to b. Equation

3.9 is given by Widdows as the solution to λ. Then it is possible to figure out that with

normalised vectors the length is 0 which allows reducing the equation down to the final

line which gives the operational formula to negate the vectors.

The reason this is interesting is that by taking a vector and NOT ing it with another vec-

tor it is effectively saying, given this vector, remove this concept from it. Now if there is

a vector for an ambiguous term, it could be presumed that taking the most similar term

to it, and performing the NOT operation, would remove an approximation of that con-

cept from the vector. Thus allowing concepts that may have been overshadowed before

to come through stronger. With regard to the “reagan” example one would hope to be

able to take the reagan vector and apply theNOT operator to it to get a vector weighted

more heavily in the direction of another shade of meaning. So if administration is the

dimensionwith highest value in the reagan vector, reagan NOT administrationwould

give the reagan vector orthgonalised to administration, in other words, a vector where

administration no longer represents a dominant shade of meaning. This would hope-

fully allow another shade of meaning to shine through, like something relating to the

Iran contra deal. Widdows shows this idea working on selected data in his book [53].

The next experimental step from this would be treating this as a way of decomposing a

vector into its SoM. So continue to apply the NOT operator k amount of times to take

our vector. This process would involve taking a vector, performing the NOT operation,

which would give two vectors, the original, and the one without the heavy weighing

associated with the original. If this step is repeated, the final results would be a series

of vectors each with a reduced amount of information. Hopefully each vector would be


representative of a particular shade of meaning. Since these vectors will be orthogonal,

they could be used as basis vectors defining a subspace corresponding to a concept,

whereby each basis vector corresponds to a SoM. Each of the concepts would be a di-

mension in this reduced space, which clustering could be performed on. This is highly

speculative, and was tested thoroughly. (See Section 4.)

VN was expected to perform quite well, but is actually fundamentally incompatible

with HAL vectors. This will be best explained by an example. Say the vector for “rea-

gan" is made up of many dimensions, the largest of which is the dimension for “presi-

dent". For this example, assume that the “reagan" vector only has three dimensions. It

could look something like: reagan : [president : 20, iran : 15, missile : 5]. This would

mean finding the “president" vector and perform the NOT operation with it and the

“reagan" vector. Now remember, HAL vectors in their very nature are built out of co-

occurrence information gathered from a corpus, so president and reagan will co-occur

often thus the “president" dimension is large in the “reagan" vector but the “reagan" di-

mension is not large in the “reagan" vector. This is also true for the “president” vector,

which means that when the “reagan" vector is negated against the “president" vector,

the “president" dimension will never be reduced in the “reagan" vector, because there is

no “president" dimension in the “president" vector. This is a critical step in the negation

process, because once the vectors are negated, the next largest dimension in the original

vector is the one operation is performed with the next time. In the example that would

be the “president" vector again, which would result in a useless loop. A workaround to

the above has been developed which keeps a record of which dimension’s vectors have

been used to negate against the original vector, this method seems to give results when

used with other data, but the algorithm does not work with HAL vectors.

3.4. NON-NEGATIVE MATRIX FACTORISATION 36

The implementation of the Vector Negation principle is an interpretation and adaption

of the original concept provided by Widdows. Unfortunately worthwhile results not

achieved during pilot studies and consequently was not empirically validated as a can-

didate method for computing SoM. This is something that intuitively seems like a good

way to find shades of meaning and is still an open research question.

3.4 Non-negative Matrix Factorisation

Non-negative Matrix Factorisation (NMF) refers to a series of mathematical algorithms

designed to factorise a matrix into two, or sometimes three sub-matrices.[3] The oper-

ation is broadly similar to SVD except that it enforces different constraints on the data.

The main constraint is that all of the values in the resultant sub-matrices must be non-

negative.

The NMF operation most frequently decomposes a large matrix S into two smaller

matrices W and H . W is considered a set of features of S and H a set of variables de-

scribing how the original data relates to the entries in W . For some specific implemen-

tations the W matrix can be considered cluster centroids while H contains information

on how close each of the rows or columns in the original data are related to the cen-

troids. The multiplication of WH gives an approximation of the original matrix, but

not the original matrix. As noted by Berry et al it is more appropriate to refer to NMF as

non-negative matrix approximation. The NMF problem has no unique solution, which

is noted in most research around the topic [3, 27, 55].

The actual factorisation of the matrix happens by minimizing a particular function over

a set of data S to obtain the basis vectors or underlying features as the matrix W . There

are many different factorisations possible for a single matrix which depend on the con-


straints and distance/divergence functions used to do the factorisation. A generalisa-

tion of the operation is given by Lee and Seung is shown in Equation 3.12 [27]. D is a

measure of distance or divergence which is sought to be minimised between S andWH

where V is the original data set and WH are the results of the factorisation. In short, it

seeks to produce a factorisation of S that is much smaller but as close as possible to S

itself.

JNMF = D(S, WH) (3.12)

Different methods to do the factorisation have been developed to target different com-

plexities of non-negative matrix factorisation. These issues can be broadly categorised

into several groups. These groups are, the convergence of the factorisation and the

speed of the factorisation to converge. Different algorithms have been developed to

tackle each of these different problems. One of the most troubling issues is the amount

time the algorithm usually takes to converge, NMF takes a long time to complete.

The NMF paper that stirred a lot of recent activity in NMF research uses Kullback-

Leibler (KL) divergence as its distance metric [27]. There are other methods for mea-

suring the distance for D including a Euclidean one using the Frobenius norm. For

the sake of this research into shades of meaning the KL method that is considered the

baseline for other implementations of NMF is used. The KL-Divergence function is

described in Equation 3.13 [27].

F =n∑

i=1

m∑

µ=1

[Xiµlog(WH)iµ − (WH)iµ] (3.13)

Because NMF produces matrices with no negative values in them they intuitively make


more sense as something to compare only positive vectors to. SVD on the other hand

creates eigenvectors that can contain negative values and in some cases only exist in

the negative space. What will be interesting is to see how NMF and SVD compare to

each other in the results. By all accounts the vectors in the W matrix fit the intuition

of what constitutes a shade of meaning and will be explored in the experimentation

section of this research. Extensive studies have shown that NMF can be described by

a more general model of PCA and in addition to that, when NMF is found using KL

divergence it is equivalent to probabilistic Latent Semantic Analysis [7, 20]. This further

supports the intuition that the column vectors of W may correspond to the SoM.

The NMF algorithm that was used in the experiments is publicly available [34].3 Be-

cause NMF is an iterative algorithm it can take a long time to converge . This is not a

problem in this strand of research, but for an algorithm that shows such promise is it

unfortunate that it can be prohibitively slow. The HAL matrix was stripped of the vec-

tors for the key term and processed with the NMF algorithm, of the resulting matrices

H and W . Each column vector of W was considered a shade of meaning.

3http://www.ma.utexas.edu/users/zmccoy/plsinmf.html

CHAPTER 4

Empirical Evaluation

The problem of evaluation is challenging for this area of research. With typical word

sense disambiguation tasks there is a human-tagged golden standard corpus against

which you can test the results of a system. Unfortunately with disambiguating SoM

there is no such golden standard. This issue is also noted by Neill [37], where he pro-

poses several key challenges standing in the way of a successful evaluation methodol-

ogy. These are, choosing the right words to disambiguate, problems with how to define

a sense and finally, how many senses are inherent to each word. Thus the challenges of

finding SoM are more like the problem of sense induction, more complex than a typical

set-and-run type algorithm for information retrieval or word sense disambiguation. An

appropriate waymust be found to approximate using the system to disambiguate SoM.

This includes choosing the words to find SoM for, defining how to find shades and the

number of them. Then performance of this system has to be evaluated and quantified.

4.1 Data

The data requirements for this thesis were to have a collection of traces, which is a

phrase or sentence containing a word for which we are interested to compute the SoM.39

4.1. DATA 40

(Examples of traces for “Reagan” are given in Figure 1.4) The intuition here is that a

semantic space can be built for a specific concept, not a generic one for a whole cor-

pus. This semantic space will hold many different aspects to a single concept encoded

in a single matrix representation. Initially traces were selected for a few handpicked

concepts from a chosen corpus, but this allowed for assessor bias as well, so a semiau-

tomatic system for selecting traces had to be adopted. To find potential solutions other

related fields of research were reviewed for inspiration. For word sense disambigua-

tion the established series of events and competitions are the SENSEVAL1 tests. The test

task data are sentences built around a particular word, much like the training data used

for the tests of SoM. Each sentence has a gold standard tagged sense for the instance

of that particular word in that particular sentences [23]. SENSEVAL systems are then

trained on the data and expected to be able to disambiguate the sense of an instance of

a word in a particular context. The data in the SENSVAL corpora has proven excellent

for testing early pilots of the various systems, and the algorithms mentioned here have

given promising results.

Despite SENSEVAL data being interesting to run the system on, the data is not really

representative of the problem statement of finding SoM. The SENSEVAL data has sub-

senses listed as well as regular senses and the sub-senses listed in the SENSEVAL gold

standard dictionary do not necessarily map to a shade of meaning, this is mostly be-

cause the sub-senses are often defined by their syntactic effect on words around them.

Furthermore, some of the senses are overarching general senses which is reality could

apply to many different SoM. As an example from the SENSEVAL dictionary the noun

sack is listed as having several main senses and various sub-senses within each of the

main ones. The two major relevant senses are the sense related to “getting the sack”

1http://www.senseval.org/

4.1. DATA 41

and a sack like a “sack of things.” When decomposing out data in the SoM with two

shades it can be seen that the two shades appear to represent each of these different

senses which seems reasonable. But, when decomposed into five different SoM instead

of seeing other senses appearing, similar senses are uncovered, but associated with dif-

ferent contexts. So for example more sack shades are created with the context of sacks

of grain in one shade and sacks for garbage in another. This is a good sign for finding

SoM, but not promising for evaluating against the SENSEVAL golden standard which

cares about senses not shades. This is discussed further in the results section. This still

leaves the problem of not having a good evaluation strategy to test the performance

of the proposed systems. In this case it would be suitable to both try existing evalua-

tion systems despite their shortcomings in conjunction with a new system based on the

intuitive results that need to be quantified.

4.1.1 Word Selection

Word selection is an important part of the problem, primarily because there needs to be

some words that will clearly have SoM, as well as words that do not. What is meant by

“clearly” will be explained in the following.

Several words have been selected from the SENSEVAL data, the words chosen were

all the nouns.[23] All the nouns were chosen so there could be no question about how

they were chosen. It is suspected that it would be possible to find shades of meaning

around verbs and adjectives too, but since this is such an early strand of research that

will be left for future work. The words chosen from the SENSEVAL data were all of the

nouns: [shirt, giant, rabbit, disability, behaviour, knee, excess, bet, sack, scrap, onion, promise,

steering, float, accident]

4.1. DATA 42

In addition to the SENSEVAL word several words have been chosen from the Reuters-

21578 [28] collection which a hand crafted golden standard has been created from. The

researcher and supervisor collaboratively selected words from the Reuters-21578 col-

lection. These words were chosen so that there would be a good selection of words that

should performwell and ones that more than likely won’t performwell. Thewords that

were chosen from the collection are: [reagan, GATT, president, oil, economy, coffee] These

words show a spectrum of specifity ranging from GATT (a very specific acronym) to a

general terms such as “oil” and “economy”.

If all occurrences of the words chosen from the Reuters-21578 collection were to be

used the resulting HAL space and vectors would have been extremely large and harder

to process, especially with the SVD algorithm implementation available. Additionally

manually tagging all words from the corpus would have been impossible. To avoid

this problem only the occurrences of the words that appeared in titles were used. Ti-

tles are normally used to summarise the content of the article, so it was expected that

they would give good conceptual information while not requiring so much data to be

processed. Armed with a selection of words it is now possible to look at the different

stages of evaluation in this research.

4.1.2 Evaluation Data Sets

Both data sets fit a certain format, this means they can both have similar tests run on

them. Both the SENSEVAL data and the crafted data from the Reuters collection are

comprised of words in context, generally a sentence or two. In the case of the SENSE-

VAL data the traces are a couple of sentences and longer than the simple headlines used

for the Reuters data. For each chosen word there are a particular number of traces and

4.1. DATA 43

the number of traces varies from term to term. Table 4.1 shows the various numbers of

traces for the data in the Reuters collections.

w Number

coffee 114economy 57oil 416GATT 35president 67Reagan 150

Table 4.1: Trace table

By reducing the data that is indexed to a subset of the original corpus, this is really

building a very specific subspace, a locality of that particular term or concept. This is

a powerful model because global model of the text is not as important as merely a key

concept and the different contexts surrounding it. A benefit of using this reduced data

set is that is it much smaller and easier to process. The key point here is the full text

of each corpus is not being used, this means that higher order relationships may not be

modeled by the system very well. This is not a key topic of this research though and

should not be an issue.

See Appendix A for some examples of text strings used in the experiments.

4.1.3 Human Tagging

The set of tagging that needed to be done manually, was for the traces out of the hand

crafted data from Reuters-21578.2 The data was tagged independently by both the re-

searcher and supervisor. They then agreed on a common set of tags which were applied

jointly to the data.

2The SENSEVAL traces are already tagged.

4.1. DATA 44

Some of the trials and tribulations associated with collecting human tagged data are

well documented by Kilgarriff and Rosenzweig [23]. Problems with human tagging

include issues like interassessor agreement, and replicability. The requirement for a

new set of golden standard data is definitely a weakness in the evaluation strategy that

needs to be further investigated. Initially the shadeswere going to be tagged in addition

to the traces and a simple comparison of tags would have been the resulting accuracy

measure. This was deemed to be insufficient for the purposes of objectively evaluating

the performance of the system and a different method for evaluating performance was

researched.

4.1.4 Preprocessing

There are several methods of preprocessing text that change the effectiveness of an

information modelling system. These include part of speech tagging and stemming.

These are discussed briefly, simply to justify decisions made in this thesis.

Part of Speech Tagging

Part of speech (POS) tagging refers to an automated algorithm for pre-processing text.

It works by tagging each word with a syntactic type [33]. It seems that adding syntactic

structure to indexed corpora add no benefit with respect to adding meaning or resolv-

ing ambiguity in an information retrieval system [43]. Though as noted by Sanderson,

the use of such syntactic information to derive semantic information may be useful.

Further investigations into use of POS tagging in semantic space models by Burgess

[8], Gärdenfors [19] suggests that POS tagging had no significant effect on their work.

Pinker [40, p82] whose definition of “mentalese” points out that a concept as repre-

sented in the conciousness of the human mind is void of conversation specific words,

4.1. DATA 45

constructions, information about pronouncing words and the order that words are in.

While it is questionable that this is completely true, it lends support to the use of seman-

tic space models without POS tagging. If POS tagging is related back to the “Reagan”

example it will soon become clear that the part of speech for Reagan will be the same

for nearly every context it is found in, it is a proper noun everywhere!3

Stop Words

Stop words are almost universally dropped from semantics models. Burgess and Gär-

denfors are primarily concerned with the cognition perspective of modelling language

while still keeping operational mechanisms in mind. Burgess asserts that humans do

not remember sentences because of their syntactic structure, they actually remember

the concept representing the meaning of the concept. Principles such as structural se-

mantics take syntax into consideration when analysing the meaning of a sentence or

body of text [8, 25], in fact “Chomskian” linguistics put a heavy emphasis on punctu-

ations and other syntactical markers [19]. Gärdenfors claims that syntax is only useful

“for the subtlest of aspects of communication”.

Stemming

Stemming refers to a process of reducing words to their “stems.” Like “banking” to

“bank”, “flooded”, “flooder” and “flooding” to “flood.” This basically reduces series

of terms to a single common concept [2, p168]. This reduces the complexity of the rep-

resentation, in terms of the number of terms, it also makes term representations heavy

with many different contexts, which is the exact problem this research seeks to resolve.

3Unless of course someone turns it into a verb for some reason. Let’s hope that never happens. :)

4.1. DATA 46

There are different levels of stemming available,4 different types of stemming, and often

stems of words are not lexically the same (“brought/bring”) [43]. Thus stemming can

be a very simple or complex option. It is generally accepted that heavy word stemming

can increase IR performance, but at the same time introduces ambiguity. Pinker also

places emphasis on the extra meaning that affixes give stems [40, p133]. Considering

that ambiguity is introduced with stemming rather than resolved, it would be wise not

to include stemming as a preprocessing task in semiotic-cognitive analysis of aspects.

The example given is “training” and “train,” the stems are identical here, but each word

has a very distinct meaning. It is noted that semantics researchers often avoid stem-

ming altogether [51, 15]. Baeza-Yates, Ribeiro-Neto and Ziviani mention that stemming

has been a controversial topic in the field of information retrieval, and that large scale

studies have often given inconclusive results [2, p168]. Pinker also places emphasis

on the extra meaning that affixes give stems [40, p133]. Considering that ambiguity is

introduced with stemming rather than resolved and both the linguistics and statistical

sides of the discussion agree its value is questionable, it would be wise not to include

stemming as a preprocessing task in semiotic-cognitive analysis of aspects. One should

not completely discount stemming, there is one situation where introducing ambiguity

via stemming is not a problem, this case is when it can effectively disambiguated later.

So a reduced complexity model exists, that makes processing faster without losing the

subtle nuances and disambiguation provided by term affixes. Upon completing a way

to disambiguate different aspects around a term, stemming can be used to allow for the

retrieval of more aspects that can then be effectively disambiguated into more potential

SoM while adding to the existing shades around a concept.

4http://www.comp.lancs.ac.uk/computing/research/stemming/Links/weight.htm

4.2. METHOD 47

4.2 Method

This section will discuss the different stages that were used for the creation and evalu-

ation of SoM. Additionally it will look at how to build semantic spaces via HAL, then

find shades of meaning in those spaces. Then finally look at how to create a vectorial

representation of the traces for comparison

4.2.1 Building the HAL Model

The semantic space, or HAL space for the experiments was built using the HALmethod

outlined in Section 2.1.1 and Equation 2.1, the parameter l was set to 10. Earlier pilots

were done with l = 5 and l = 15, on most data there was a noticeable decrease in

performance for l = 5 and a slight increase and decrease for l = 15 depending on the

data set. 10was chosen as an effective compromise between the overly large, heavy and

often redundant l = 15 and the under performing l = 5. During the HAL processing

stop words and punctuation were filtered, and HAL was run over the resulting traces.

Each trace was treated as a document, so HAL did not run across traces. A single HAL

space, being the aggregation of all the traces was built. Each word in the system has a

row and column in the symmetric HAL space. (See Section 2.1.1.) Each of these rows is

a sum of all the contexts that word has been found in, a weighted sum of all the words

that are found around it. It is these contexts that essentially would like to be uncovered

in the word’s vector. In future calculations the HAL semantic space for a word w’s data

set is denoted Sw.

The HAL space that is created from a set of traces looks very different to a HAL space

built over a regular document or set of documents. Because every trace contains the

word of interest, the HAL vector for the term in question is very large, in that it con-

4.2. METHOD 48

tains a dimension for nearly every other term in the system. As a way of example,

on the first 100 Reuters-21578 [28] traces the HAL vector for Reagan contains non-zero

dimensions for 274 other terms in the system, whereas the vector for Rome contains

only five non-zero dimensions. President, presumably a very closely associated concept

to Reagan has a mere 49 non-zero dimensions. If these HAL vectors were to be repre-

sented as beams of light in a 3 dimensional space, the Reagan beam would be so bright

and intrude so much on the space of the other beams, that all the other beams shining

in the space would be very hard to see. The whole space is nearly defined by this sin-

gle vector. This implies that these vectors might as well all be equivalent when faced

with the enormity of the vector looming over them. For this reason the vectors for the

term in question (like “reagan”) were removed before the matrix decomposition was

performed, to allow the other weaker vectors to shine through. This means removing

the column for “reagan” and the row for “reagan.”

Traces are the building blocks for this thesis and are used to build the semantic space

around a word from which its SoM are computed. In order to find out the effectiveness

of the SoM systems a two stage evaluation process is proposed. By using existing eval-

uation systems that may be lacking for this task, it may be possible to find out what

SoM are not. Then by using the new system of evaluating SoM it may be possible to

define what SoM are. The first stage is to run the systems on the tests from SENSEVAL

and first justify the results with an established method. The second stage is to create

a SENSEVAL-like test composed of tagged traces with various SoM. The systems will

then be run on this new data to give a structured, repeatable and intuitive strategy for

assessing the quality of each method. These stages will be explained in the next few

subsections, along with an overview of word selection discussions. Before looking at

4.2. METHOD 49

evaluation further there are some issues that need to be resolved around what words

to use for evaluation.

4.2.2 Computing the Shades of Meaning

Finding the shades of meaning is themost critical part of this research. There are several

different methods which have shown promise for finding shades of meaning, eachwere

run on the term × term matrix that had been generated in the previous section. The

required result was a set of shades of vectors of which each would represent the center

of a shade of meaning, thus the best representation of its meaning. These vectors are

what are used to represent shades of meaning in the rest of the experimentation.

The calculation for finding the shades of meaning is different for each algorithm, so to

show that a set computational shades CSw is found from a term × term matrix Sw the

generic equation is given. With a system SoM that reduces Sw into k shades Equation

4.1 is relevant. For example for the adaption of the NMF system the equation would

read CSw = NMF(Sw, k).

CSw = SoM(Sw, k) (4.1)

When using the trace browser to analyse the results coming out of the different shade

finding algorithms, it became clear that the “bright light” problem mentioned in the

previous section (4.2.1) was causing issues with the shades that were being uncovered.

As mentioned, to avoid this problem the vectors for the term in question was removed

from the term × term matrix, this left only the vectors of the words surrounding the

term itself. This method allowed the more subtle meanings of the contexts to be used

effectively to disambiguate traces. It is assumed that the term in question is highly

4.2. METHOD 50

related to the traces, since it appeared in all of them, and all of them were about that

concept. What is of more interest is disambiguating the subtle contexts around the

concept to better understand what its key components are.

4.2.3 Representing the Traces

To be able to evaluate the relationship between a SoM represented as a vector (s, s ∈

CSw) and a given trace ( t, t ∈ Tw) vectorial representation ~t is needed of the trace in

the same space as s which can then quantitatively be used to compare find the distance

between s and ~t. To find the general meaning of a trace, a centroid vector of of the terms

it is composed of is built. To do this each term’s vector is sourced from the term× term

matrix Sw and averaged using the centroid method. This gives the centroid meaning

of the trace as the system knows it. The HAL vectors for each term are given by the

Equation 2.1. Then it is possible to find the vector for a single trace in the data set using

Equation 4.2. A trace’s words are the list of words t with each individual word being

t[i] where i varies over the range of the length of t. Sw,t[i] in the following is the HAL

vector corresponding to the ith word in t.

~t =1

|t|

|t|∑

i=1

Sw,t[i] (4.2)

As mentioned in the previous section, there is a problem with keeping the main term’s

vector in the space when doing comparisons. This also affected the building of centroid

representations of the traces. If there is a dominant vector in the trace that is present in

all the other traces they all become a lot more similar when viewed from a distance. This

is because a single dominant vector really accounts for most of the cluster’s meaning

and becomes the real centroid of the vector. This would skew a vector pointing in one

4.2. METHOD 51

particular direction off to another different direction that could confound the meaning

that is intended to be captured in a SoM. To avoid this problem the focus term w was

removed from the trace when building the centroid for each trace.

4.2.4 Comparing SoM to Traces

After calculating centroid vectors for each trace there are a set of vectors for the traces

and vectors for the shades represented in the same vector space, enabling a comparison

between the two. Now it is possible to compare traces to shades and see which shade

most accurately describes a trace. The method used to find the similarity between two

vectors in a linear space is the cosine coefficient, or cosine similarity. By finding the

cosine similarity between two vectors a value between -1 and 1 is created, in reality

values below zero don’t occur very often so essentially there is a value between zero

and one where zero is “not at all alike” and one means “the same” [53]. The problem

this leaves is that there is only only have a value between zero and one to describe the

relationship between the two vectors. It would be ideal to be able to tag each shade

with what fits human intuition, and also tag each trace. When a trace’s most closely

related shade is found the tags are compared to see if the tag is the same for both of

them. If it is, then that is a right answer for the system. A reliable way to tag the traces

and shades is needed though, something that is not possible within the scope of this

thesis. Instead the relationship between a shade and its related traces is used to build a

model of the overall picture. (See Section 4.3.3.)

To find themost closely related shade for a trace, the shade that yields the highest cosine

similarity to the trace’s centroid is found. The shade sj ∈ CSw associated with a trace

centroid ~ti from ~Tw (the set of trace centroids) is given by Equation 4.3.

4.3. EVALUATION METRICS 52

simij = arg maxj

sj · ~ti

‖ sj ‖‖ ~ti ‖(4.3)

So the similarity for simij is the resultant cosine similarity for the shade that gives

the highest cosine similarity to the trace’s centroid in question. These similarities and

association with a particular trace allows finding the most highly related shade for a

particular trace.

4.3 Evaluation Metrics

The final data produced by the experiment was a set values indicating the strength of

the relationship between a trace and its most related shade. From this data two method

of evaluating the results are possible, these are discussed in the following sections.

4.3.1 Purity and Normalised Mutual Information

Purity and Normalised Mututal Information (NMI) treat traces grouped around a cer-

tain SoM as clusters, and the hand tagged traces as classes. In this situation the evalu-

ation task can be treated as a clustering problem [56, 31]. By using a Golden Standard

of traces assigned to classes it is possible to use traditional clustering evaluation tech-

niques to see how well the SoM represent the perceived classes in the data.

Computing the purity of a cluster entails assigning the class which occurs most often

in the cluster. Purity is then calculated by finding the number of traces which were

assigned to the right cluster and dividing by the total number of traces. In the following

equations Ω is the set of clusters of traces clustered around computational shades and

C is the set of classes of hand tagged data.


purity(Ω, C) =1

N

∑

k

maxj

|wk ∩ cj | (4.4)

The problem with Purity is that as the number of clusters approaches the number of

classes, the Purity score will continue to improve until it reaches a perfect score when

k = n. NMI is suggested as an improvment on Purity, by balancing the clustering

against the number of clusters [31, 56].

NMI(Ω, C) =I(Ω; C)

[H(Ω) + H(C)]/2(4.5)

The NMI is essentially the mutual information divided by the average entropy of the

clusters and the classes. The mutual information itself increases as the number of clus-

ters increses, this gives a similar situation as with Purity, where when k = n the max-

imum information sharing occurs. By dividing by the average entropy of the classes

and clusters the number of clusters does not bias the final score. Because the number

of clusters does not cause a bias of any kind, the NMI can be used to compare cluster-

ings of different sizes, something that addresses the monotonic increase in clustering

performance often seen with other evaluation strategies.

In the context of traces and SoM, wewill cluster the traces around the SoM and evaluate

the performance using NMI.

4.3.2 Confusion Matrices

To find out how well each system performed with classification of traces into shade

categories a system is required to first get an overall picture of the classifications, then

be able to quantify how closely the classifications match the golden standard. Neill uses


acid iran tax highway economy

0 1 41 12 23 2 18 54 1 15 1 46 12 278 59 1 13 1

Table 4.2: A reduced example of the “Reagan” confusion matrix at k = 10.

confusion matrices in combination with conditional entropy effectively to help create a

meaningful measure of how his word sense induction system performs.

In artificial intelligence and machine learning confusion matrices are used to determine

how well a system has performed at a classification task. These confusion matrices

normally have an axis for answer classes and an axis for test classes. In the case of

the shades of meaning system the answer classes are the golden standard tags used for

traces and the test classes are the shades of meaning the system has created to represent

the data set. A confusion matrix is denoted C, or Cwk since there is a confusion matrix

for every w and k combination.

The confusionmatrix in Table 4.2 is for a reduced version of the “Reagan” data set when

decomposed into k = 10 shades of meaning through the Non-negative Matrix Factor-

ization algorithm. Along the top are the various tags that were used to tag the traces

extracted from the Reuters-21578 corpus. Down the left hand side are the numbered

shades of meaning automatically generated by the algorithm under test, non-negative

matrix factorisation in this case. Note that k = |CSw| so the height of the table is the

number of computational shades found by the system. The numbers in the confusion


matrix represent the number of traces tagged with a particular tag that are most closely

related to a particular shade. For example, at rank ten shades, the system perfectly

groups all 13 traces tagged “highway.” The “highway” tag was used to tag all traces

related to the multibillion dollar highway bill vetoed by “Reagan”. These traces have a

clearly defined scope around the corpus occurrences of “Reagan”. A less well defined

scope is the “economy” tag which is spread over nearly all shades of meaning, but still

has the largest number of traces on a row where incidentally the majority of the tax

traces were categorised. From the understanding of tax and the economy these two

concepts are closely related, so this makes sense too.

The spread of values in the confusion matrix can be used to gain a general overview at

a glance of how well the matrix method has performed. Obviously how tightly traces

are clustered for a single tag is important, like “acid”, “tax” and “highway,” but also

clustering on rows is important too. More traces appearing in a single column for a

row is better then being spread over it. A confusion matrix where the tight groupings

occur both in rows and columns perform the best in the evaluation and fit the intuition

that SoM can accurately represent the different shades of meaning. For more discussion

about the results with more tables see the sections for each algorithm that was tested in

Section 5.

4.3.3 Conditional Entropy

Conditional entropy as used by Neill is a good measure of how effectively the system

performs when it takes the whole distribution of answer and test results together into

consideration. What this means is that the system must compare the distribution of

manually tagged traces to the classifications of traces to shades induced by the systems


under test. The confusion matrices discussed in the previous section can be used to

calculate the conditional entropy of an tag distribution compared to the induced test

distribution. To calculate the entropy of the tag classifications the formula given by

Neill was used. i varies over the tag classes, such that P (i) denotes the probability a

trace will be manually assigned to tag i:

H(i) = −∑

i

P (i)log2P (i) (4.6)

Ignoring test results for now, a distribution of traces that is pretty much all classified

with one tag has a low entropy, whereas a distribution of traces that is evenly mixed

between several trace tags has a high entropy. High entropy means heavy mixing be-

tween different trace tags. For example if there were 100 "reagan" traces, 50 of which

were classified as "iran-contra" and the other 50 "missile-deal" the entropy would be

high because there is a large amount of mixing or uncertainty about whether a trace

would be tagged with any one tag.

To find the performance of the shade test distributions percentage decrease in entropy

must be calculated. This means finding the conditional entropy of the induced test

distribution and subtracting it from the entropy of the tag distribution will give the

percentage reduction in entropy attained by the system under test. To find the con-

ditional entropy the formula given in Equation 4.7 was used. P (i, j) is the chance of

finding a trace in a particular tag and shade class (P (i, j) = Cwk[i,j]Tw

).

H(i|j) = −∑

i

∑

j

P (i, j)log2P (i|j) (4.7)

Conditional entropy looks at how mixed the test distributions are as compared to the

answer distributions. So if there are two trace tags, "missile-deal" and "iran-contra" as


mentioned earlier, and each of them has 50 associated traces, if the test system assigns

all of the 100 traces to a single shade, the the system has performed the worst possible

and the entropy is zero, nomixing has occurred. Whereas if the system assigns 50 traces

to one shade and 50 to another the entropy will be high. After subtracting the 100%

from the conditional entropy from the 100% of the answer’s entropy there is a 100%

reduction in entropy, the system has performed perfectly. When the entropies vary and

there is a remainder after the subtraction, the higher that value is, the better the system

has performed. To calculate the percentage decrease in entropy the following equation

is given by Neill in Equation 4.8.

H(i) − H(i|j)

H(i)(100%) (4.8)

As on of the more drastic examples see Table 4.3 (pg. 58). The percentage reduction

in entropy for the NMF result was 19.3, while the percentage reduction for the SVD

result was 2.07. At an initial glance, the distribution in the NMF table does not look

brilliant, but when compared to the SVD table it is clear that it has done a better job

distinguishing between the different SoM, there is less spreading of traces amongst

columns and rows. The fact that the “spring” sense of onion and “veg” sense of onion

appear on the same row for NMF seems to be offset by the fact that for each tag the

traces are grouped quite well. It can be concluded that the “spring” sense of the word

is closely related to the “veg” sense of the word, which is indeed the case. Whereas in

the SVD table, such conclusions cannot be drawn. More discussion of results like these

are covered in Section 5.


(a) NMF

basil veg plant spring

0 1 131 125 4 132 6 9 13 12 14 15 3 1

(b) SVD

basil veg plant spring

01 6 12 66 9 63 1 76 7 64 23 1 2

Table 4.3: “onion” at k = 5

CHAPTER 5

Results

Overall the results from this thesis are encouraging. In this section discussion revolves

around the results with respect to the methods outlined in Section 4.3; normalised mu-

tual information and conditional entropy. After that the conditional entropy results are

discussed in more depth, with confusion matrices displayed to help with the analysis.

To see the full set of confusion matrices please see Section A

It is clear from the results that SVD is an effective method for finding shades of meaning

in a concept. There are often clear groupings of traces assigned to a particular tag

and shade. Unfortunately, often these groupings sit on the same row out of all of the

shades. This fits the understanding that SVD’s eigenvectors and singular values seek

to account for the largest divergence in the original matrix in the lower singular values

and vectors. Non-negative matrix factorisation consistently gave the best results. As

can be seen from the examples of tables given here, NMF keeps clusters of data both in

single rows and columns a lot more effectively than SVD.

Of the evaluation methods NMI seemed to be the most indicative of the effectiveness

of the evaluation. The reasons for this will be discussed below.

59

5.1. NORMALISED MUTUAL INFORMATION 60

5.1 Normalised Mutual Information

The overall results from NMI are presented as a percentage improvement over a base-

line score. The most notable feature of the overall improvements listed in Table 5.1 is

the difference between the SENSEVAL data and Reuters data results. The most logical

conclusion that can be drawn from the different between the two sets of data is that

SENSEVAL traces are not classified based SoM as classified by this thesis, rather by the

sense of the word of interest. While there may be some overlap between senses and se-

mantic meaning, these results strongly suggest that SoM calculation is a different class

of problem than sense disambiguation.

Data + Method 5 10 20 30 Overall

Reut:SVD 8.34 13.16 18.60 19.98 15.02Reut:NMF 15.48 19.19 14.36 23.06 18.02

Reuters 12.54 14.41 14.49 19.81 15.62

SENS:SVD 5.35 3.93 6.37 4.51 5.04SENS:NMF 5.05 4.97 7.62 5.40 5.76

SENSEVAL 5.18 4.31 6.24 4.39 5.4

Table 5.1: Average overall improvements in NMI.

NMI results also show from an early stage that a higher value of k does not always

mean a better division of SoM. The SoM for the Reuters data set have been chosen

to map to a the human cognitive understanding of the different aspects of a concept.

Thus despite higher values of k giving a more fine grained perspective on the concept,

it may not always be congitively economical to apply such a fine grained conceptual

filter. Between the actual systems under test, NMF gives the best overall improvements

over the baseline.

The baseline for NMI is calculated by computing the NMI of clusters generated by sim-

ply iterating over the traces and placing them one at a time into each different cluster.

5.1. NORMALISED MUTUAL INFORMATION 61

term method IK, B, nmi (% incr) B, 5 B, 10 B, 20 B, 30

coffee nmf 9, 0.15, 0.31 (16.51%) 0.08, 0.17 (8.95%) 0.12, 0.37 (24.65%) 0.20, 0.39 (18.51%) 0.27, 0.40 (13.45%)coffee svd 9, 0.15, 0.28 (12.92%) 0.08, 0.29 (20.64%) 0.12, 0.29 (16.69%) 0.20, 0.33 (12.97%) 0.27, 0.38 (10.75%)

oil nmf 40, 0.30, 0.53 (23.02%) 0.07, 0.29 (22.05%) 0.13, 0.40 (27.24%) 0.20, 0.47 (26.45%) 0.26, 0.51 (24.72%)oil svd 40, 0.30, 0.47 (17.51%) 0.07, 0.24 (16.21%) 0.13, 0.34 (21.26%) 0.20, 0.44 (23.59%) 0.26, 0.47 (20.79%)

gatt nmf 7, 0.26, 0.44 (17.93%) 0.19, 0.37 (18.26%) 0.31, 0.57 (26.41%) 0.47, 0.62 (15.55%) 0.60, 0.68 (7.59%)gatt svd 7, 0.26, 0.50 (23.96%) 0.19, 0.45 (26.15%) 0.31, 0.55 (24.17%) 0.47, 0.56 (9.02%) 0.60, 0.56 (-4.62%)

reagan nmf 21, 0.31, 0.62 (30.58%) 0.13, 0.33 (20.79%) 0.22, 0.49 (27.27%) 0.33, 0.60 (26.99%) 0.40, 0.64 (24.67%)reagan svd 21, 0.31, 0.53 (21.56%) 0.13, 0.32 (19.51%) 0.22, 0.42 (20.28%) 0.33, 0.52 (18.80%) 0.40, 0.55 (15.87%)

president nmf 9, 0.27, 0.37 (10.27%) 0.19, 0.26 (6.56%) 0.24, 0.40 (16.61%) 0.31, 0.52 (20.30%) 0.38, 0.55 (17.71%)president svd 9, 0.27, 0.44 (17.62%) 0.19, 0.42 (23.00%) 0.24, 0.44 (20.24%) 0.31, 0.45 (13.68%) 0.38, 0.50 (12.12%)

economy nmf 25, 0.64, 0.73 (8.95%) 0.34, 0.44 (9.53%) 0.47, 0.63 (16.20%) 0.60, 0.68 (7.36%) 0.67, 0.72 (4.76%)economy svd 25, 0.64, 0.64 (0.12%) 0.34, 0.49 (14.39%) 0.47, 0.56 (8.97%) 0.60, 0.61 (0.91%) 0.67, 0.62 (-4.89%)

Table 5.2: Reuters NMI results. IK is the ideal value for k. B is the baseline NMI value.The percentages detail improvements over the baseline.

Essentially this is what could be achieved if we just split the traces amongst the clus-

ters and represents the capacity of a small child to simply assign items to classes. This

means that the baseline is different for every different value of k.

As can be seen in Table 5.2 there are are a wide variety of results depending on the

value of k. For “Reagan” at k = 21 with NMF we see the best result for that term.

This matches the intuition that the different classes of traces selected most accurately

match the SoM in the data. Notably the NMI value for k = 21 is higher than that of

k = 30 showing that a higher value for k does not neccesarily mean a better result for

representing SoM. In contrast to this for “economy” the best results are when k = 10

not k = 25 which was chosen as the ideal value of k. “Economy” is also an example of

getting a negative improvement over the baseline. For SVD and k = 30 the generated

shades actually perform worse than the baseline. Since “economy” has nearly 70 traces

and k = 30 it would appear that the baseline method of simply assigning traces to

clusters has chanced upon a surprisingly accurate clustering of the traces. To achieve a

higher NMI value there basically needs to be very little entropy in the guessed clusters

and answer classes, or a lot of shared information between them.

The “oil” data set was the largest out of the Reuters corpus examples. It also shows

5.2. CONDITIONAL ENTROPY 62

among the most consistent improvements in NMI. This is because with the larger data

set is is harder to fluke a good clustering of traces around SoM.

5.2 Conditional Entropy

The conditional entropy results are presented as an alternative perspective way on

quantifying the performance of different systems. As discussed, conditional entropy

provides a way to tell how well the test distribution of traces to shades matches the an-

swer distribution of traces to tags. The overall averages for each method on each data

set are shown in Table 5.3. Despite some anomalies it is safe to say that on the whole

NMF performed better than SVD at reducing the entropy in the data sets.

Data + Method 5 10 20 30 Overall

Reut:SVD 31.71 43.32 53.72 59.14 46.97Reut:NMF 27.52 49.42 64.59 74.35 53.97

Reuters 29.61 46.37 59.15 66.74 50.46

SENS:SVD 11.88 20.84 30.15 37.85 25.18SENS:NMF 14.53 26.37 36.23 45.57 30.67

SENSEVAL 13.20 23.60 33.19 41.71 27.78

Table 5.3: Average overall reductions in entropy.

When comparing NMF to SVD in the Reuters data set there are several instances where

SVD performs better than NMF, “coffee” at k = 5, “gatt” at k = 5 and “reagan“ at

k = 5. For the “coffee” case consider Table 5.4 the confusion matrix for a run of the

algorithm , there is a subtly better grouping of traces given the test examples for SVD

than NMF. This can be seen in the “growth” column, where SVD groups traces together

more. Additionally NMF has put more traces across more tags on shade 1, this will

significantly hinder the reduction in entropy.

The entropy decrease results for the SENSEVAL data is in Table A.2 on Page 76. In this


table it can be seen that SVD outperforms NMF on more occasions than in the Reuters

data, but not enough to steal the crown in the overall scores. An example of SVD

performing better is shown in table 5.6 on Page 67. Columns “sharesact” and “rod”

show better grouping of traces for SVD than NMF.

(a) NMF, the reduction in entropy is 17.22%.

decline marketing make quota rain growth export org import

0 9 1 2 6 6 1 11 6 1 6 6 4 4 12 3 16 4 23 7 1 6 64 7 2 5 1

(b) SVD, the reduction in entropy is 27.50%.

decline marketing make quota rain growth export org import

0 1 1 1 5 11 5 2 12 22 3 19 1 13 5 6 4 54 18 2 4 9 5 1 1

Table 5.4: “coffee” at k = 5. An example of SVD performing better than NMF in a runof the algorithm.

The first key point when looking at Table 5.5 (pg. 64) that is also raised by Neill in his

work is that the results get better and better as the number of shades increases. While

this happens for the values of k documented above, it does not happen for every test

case. For example the “accident” term from the SENSEVAL data processed with SVD

exhibits non-monotonic increases in conditional entropy as k increases. At k = 2 the

conditional entropy is 3.66, at k = 3 it is 10.27 and at k = 5 it goes back down to 7.52.

To explain these results wemust look at the conditional entropymetric inmore detail. If

entropy is a measure of the amount of mixing between classes and conditional entropy


term method B IK, ent (% decr) 5 10 20 30

coffee nmf 3.51 9: 1.58 (35.45%) 2.04 (16.93%) 1.39 (43.16%) 1.19 (51.28%) 1.03 (57.88%)coffee svd 3.51 9: 1.71 (30.40%) 1.78 (27.50%) 1.66 (32.30%) 1.43 (41.76%) 1.23 (49.69%)

oil nmf 0.96 40: 1.82 (56.43%) 3.24 (22.48%) 2.72 (34.84%) 2.26 (45.82%) 2.00 (52.26%)oil svd 0.96 40: 2.16 (48.37%) 3.47 (17.07%) 3.01 (28.04%) 2.43 (41.84%) 2.23 (46.66%)

gatt nmf 11.43 7: 1.31 (46.05%) 1.54 (36.41%) 0.81 (66.52%) 0.45 (81.52%) 0.06 (97.64%)gatt svd 11.43 7: 1.23 (49.20%) 1.43 (41.11%) 0.95 (60.62%) 0.80 (66.93%) 0.80 (66.93%)

reagan nmf 2.67 21: 1.35 (64.77%) 2.82 (26.49%) 2.12 (44.67%) 1.45 (62.35%) 1.13 (70.66%)reagan svd 2.67 21: 1.86 (51.44%) 2.90 (24.56%) 2.43 (36.66%) 1.92 (49.96%) 1.65 (57.04%)

president nmf 5.97 9: 1.06 (45.89%) 1.41 (27.54%) 0.98 (49.53%) 0.48 (75.26%) 0.26 (86.61%)president svd 5.97 9: 0.94 (51.80%) 1.13 (42.28%) 0.92 (53.00%) 0.76 (61.25%) 0.58 (70.49%)

economy nmf 7.02 25: 0.78 (79.40%) 2.46 (35.25%) 1.60 (57.79%) 1.09 (71.32%) 0.72 (81.07%)economy svd 7.02 25: 1.33 (65.00%) 2.37 (37.75%) 1.93 (49.31%) 1.50 (60.56%) 1.37 (64.03%)

Table 5.5: Reuters-21578 conditional entropy results

is the amount of mixing between answer and test classes the results observed fit the

description given by Neill.

For the “reagan” example, if we have 100 traces with 2 tags (“iran” has 50, “tax” has 50)

and k=3 (SOM1, SOM2, SOM3). If the algorithm splits the two sets of traces 50/50 over

SOM1 and SOM2 there is 100% decrease in entropy (which fits the intuition). Also, if

the algorithm splits the two sets of traces 50/25/25 over SOM1 SOM2 and SOM3 the

system still performs accurately because there has been no mixing of answer classes.

The intuition here is that the human taggers have missed a significant subset of mean-

ing in the data. Consider the situation of adding more SOM to the system by increasing

the value of k. Unless that causes the algorithm to start mixing traces of different classes

together the system will maintain a 100% reduction in entropy. To put it another way, if

we have a larger amount of mixing and addingmore SOM to the system allows it to bet-

ter ’unmix’ the mixed traces, the conditional entropy will decrease. That is, when there

are traces from two classes assigned to the same SOM and adding more SOM allows

them to be more effectively spit apart, the conditional entropy will decrease. Looking

through the confusion matrices from the experiments, we see that this is exactly what

is happening. So unlike NMI, conditional entropy doesn’t normalise the results when


there are larger numbers of k.

As for Neill’s results he is using lower values of k (2,3,5,10) and the seed selection

process is an extremely important part of his research. Especially considering the com-

plexity of his model. If we take k = 2 and k = 3 for example, it is possible that the

seed selection process could select much better seeds for k = 2 and then choose sub-

par seeds for k = 3. In this situation the low values of k means that if one bad seed

is chosen, one half or one third of the seeds could be very poorly classified by the sys-

tem. When the system does make a bad seed selection, the percentage of erroneously

classified shades is much higher than in systems where k is higher. Thus the entropy of

the answer distribution will be much higher and the reduction in conditional entropy

will be lower. Neill also notes that sometimes the performance does not increase as the

number of seeds increases. He himself attributes this to the randomness of the seed

selection process. In the pilot tests of this research the random seed problem was no-

ticed with varying clustering algorithms too, for this reason concept indexing was not

included in the final experimentation.

The entropy decrease results for the SENSEVAL data is in Table A.2 on Page 76. In

contrast, for two repeated iterations of NMF or SVD with the same implementations,

the results are the same, one would expect this means that randomness like what is

observed by Neill simply does not happen. Unfortunately this is not the case, it was

observed that the NMF does not give consistent results like the SVD algorithm. Most

of the time it performed better than SVD, but sometimes it performed worse, not by a

lot, but frequently enough to be noticeable. This makes sense since each time NMF is

run it is initialised with random data. It is said to have converged when for a given

iteration the expression KL > thres × oldKL is true, where KL is the current KL


divergence, oldKL is the divergence from the last iteration and thres = 0.9999. This

means the random data causes a different factorisation each time and as long as the

above expression is true at some point, the factorisation is considered complete. This

behaviour is noted by all NMF research reviewed [3, 27, 55].

The most extreme upper results in the data set are the k = 30 results for the “GATT”

data set out of the Reuters collection, where with the NMF system a 95.28% decrease in

entropy is recorded, which seems like a very unlikely number to achieve. Interestingly

this result actually makes a lot of sense when you look at the data set in Table A.3 on

Page 77. There are 35 traces in the data set, this is the smallest data set in the system

and by all accounts one would expect to almost see a single trace per shade. In fact this

is nearly the case for the NMF results, the SVD results which are substantially lower ac-

tually groups some traces together onto a single shade, which intuitively makes more

sense when one thinks about grouping traces tagged the same around a single shade.

The reason why SVD’s entropy is lower is because it mixes traces with different tag as-

signments to the same shade more often. This is because SVD can drive an eigenvector

between shades of meaning, rather than through it.

In contrast to the “GATT” data set, the “float” data set is larger, with 70 traces, yet a

large decrease in entropy occurs when it is decomposed into shades, especially so for

SVD. This means that the shades of meaning present in the “float” data match up the

what the different sense taggings mean more so than other data sets. (See Table 5.6 on

Page 67.) Also, one of the largest data sets, the “oil” data set, had over 400 traces, yet

its results also show significant reductions in entropy, meaning that the size of the data

set only impacts the results as the number of shades (k) gets much close to the total

number of traces.


(a) NMF, the reduction in entropy is 44.73%.

sharesact currencyact rod fiesta cash milk wobble device lorry

0 2 1 21 1 10 12 2 1 2 23 2 1 1 1 14 3 1 15 10 16 8 27 1 18 2 1 4 19 1 1 2 1

(b) SVD, the reduction in entropy is 68.25%.

sharesact currencyact rod fiesta cash milk wobble device lorry

0 32 1 1 112 2 13 34 1 65 11 1 1 16 17 1 1 18 1 29 1 1

Table 5.6: “float” at k = 5. Another example of SVD performing better than NMF.

In Table A.4 on Page 78 are the two confusion matrices created for the “reagan” data

set. It defines aspects of “reagan” that were easy to identify and tag in the data set, they

are easily found by both the decompositions. Examples of these clearly defined tags are

“tax”, “highway” and “acid.” What is also interesting is the difference between NMF

and SVD, specifically where different tags have been clustered onto the same shade. For

example on shade 8 NMF puts 8 tags for “iran” next to 6 tags for “staff,” arguably these

topics are all related once the corpus is seen, since after the Iran-contra controversy in-

vestigations were carried out on members of Reagan’s staff. Other weakly represented


overlaps can be identified in the tables too, like between “address” (meaning presiden-

tial speech) and “economy.”

CHAPTER 6

Conclusions and Future Work

This thesis investigates the possibility that concepts have subtle aspects or shades of

meaning (SoM). The Hyperspace Analogue to Language (HAL) semantic space model

was used to model concepts in a high dimensional semantic space, a term × term ma-

trix. The data is created by running the HAL algorithm over several sets of data, each

representing a term by a series of traces. These sets of data can be broken up into two

different groups; the nouns from the SENSEVAL1 data, and handpicked sets of traces

built out of all titles containing that word in the Reuters-21578 collection. Two methods

of dimensional reduction were used to induce these shades of meaning about the word

from its HAL semantic space. The methods employed were Singular Value Decompo-

sition (SVD) and Non-negative Matrix Factorisation (NMF). Pilot studies revealed that

other potential methods: vector negation and concept indexing did not reveal promis-

ing enough results to warrant further investigation.

An evaluation framework based on evaluating the SoM as clusters and conditional en-

tropy was used to evaluate the overall performance of each system for different values

of k ∈ 5, 10, 20, 30. Using this framework it is clear that using dimensional reduction

with a matrix representation of a concept can induce the shades of meaning surround-

69

6.1. SOM ACCURACY ACROSS LANGUAGES 70

ing a particular word with NMF looking to be effective than SVD. This is probably

due to the factorization resulting in vectors being driven through term clusters in the

semantic space corresponding to a SoM whereas the orthogonal eigenvector solution

imposed by SVD has more chance to drive eigenvectors between such term clusters,

hence degrading performance. This work has opened up a wealth of possibilities for

further research, several have been mentioned briefly through this document but some

warrant further discussion. As an outro to this line of work, reflectively consider the

possibilities for improving the results of SoM and for applying the knowledge gained

here to new applications.

6.1 SoM Accuracy Across Languages

An interesting take of empirically measuring the performance of SoM systems is to see

how well the shades translate from one language and culture to another. This would

be possible by building a HAL semantic space over a parallel corpora of data in two

languages. Widdows shows that semantic relationships between two different parallel

corpora are maintained through statistical model processes [53]. By taking the same

shade of meaning in each language and showing them to someone who speaks both

languages, it should be possible to get a good idea of whether the two shades of mean-

ing mean the same thing.

6.2 Decomposing Term-Trace Data

In this present line of research the data being operated on is term × term data built

out of running the HAL algorithm over the traces. Similar but fundamentally different

is the work of Rapp in his paper addressing Word Sense Induction [42]. He follows

6.3. EUCLIDEAN AND GENERALISED KERNEL NMF 71

on from the work of Neill, but proposes a new representation model [37]. Instead of

building local clusters of information out of a large global context, keep the traces in

their original context, and build a term × trace (or term × context as he puts it) matrix

of term frequency vectors. His work shows encouraging results and inspires a new

angle on computing SoM by using a hybrid of the HAL model presented in this thesis

and his methods.

The new proposed angle on SoM is to build a matrix of term×trace vectors, but instead

of weighting by term frequencyweight according to a single instance of aHALwindow.

So for example, the trace: “President Reagan ignorant of the arms scandal” with a HAL

window size of l = 5 would be represented in the matrix as a vector like [president: 5,

reagan: 0, ignorant: 5, of: 4, the: 3, arms: 2, scandal: 1]. As can be seen in the vector,

rather than term frequencies the HAL weighting window is applied once to the trace,

around the term in question.

By applying SVD to the data in this new style of matrix it would reduce the sparsity of

the matrix and allow a term × term matrix be built in a similar fashion as discussed in

Section 3.1 [24]. By the applying SVD or clustering to this term × term matrix of infor-

mation built from specific term context information it may be possible to induce SoM.

By using the evaluation strategy and data used in this research the the effectiveness of

this hybrid method could be evaulated.

6.3 Euclidean and Generalised Kernel NMF

There are other ways to solve the NMF problem that are not covered in this thesis.

These include using the Frobenius norm to find the Euclidean NMF as opposed to the

KL-divergence NMF. This concept can be taken further by generalising NMF to a kernel

6.4. ABDUCTIVE REASONING 72

based model where any function can be used to measure the distance1 between the ap-

proximation and the original data. Preliminary experiments with a kernel based NMF

and the Frobenius norm are promising with respect to finding shades of meaning. Fur-

thermore, point-wise mutual information andmaybe even information flow are further

candidates for investigation.

6.4 Abductive Reasoning

An area that shows a lot of potential for applying these works is that of abductive

reasoning [6, 17]. The discovery of connections between Raynaud’s disease and fish

oil typifies the kind of relationships associated with abductive reasoning. Bruza et al

explore the relationship between semantic space models and abductive reasoning and

find promising yet slightly underwhelming results. It should be possible to use shades

of meaning to allow for much more fine grained analysis, allowing relevant but weaker

relationships to be found more effectively.

The discovery of the use of fish oil as a treatment for Raynaud’s disease was a serendip-

itous discovery found because of the common properties between what fish oil could

treat and the effects of Raynaud’s disease. There was no material directly linking the

two, yet a connection could be made between them. This is somewhat formalised by

modelling an A-B-C system, where A represents a potential cure, C the problem for

which a cure is needed and B the relationship that joins the two. The example terms for

B given with regard to the Raynaud’s disease example are “platelet aggregation,” “vas-

cular reactivity,” and “blood viscosity.” By knowing C, and potential B’s the system

should be able to narrow the bounds for which there are possible C’s. So if there are

1In the loosest sense of the word.


vectors for A and C, the intuition is that more B terms will be common between them

when they are more closely related, this can be quantified by using the cosine similarity.

This could be done merely to test the validity of the A-B-C theory.

Bruza et al continue further by using SVD to reduce the dimensionality of the semantic

space that A and C are modelled in. As mentioned in Section 2.1.2 it would be expected

that some words that were zero before end up being to be non-zero after the SVD op-

eration is performed. In the experiments conducted, dimensional reduction moves the

word “fish” up to the top 1143 of 28,799 dimensions in the Raynaud vector.

Concept Combination (CC) and Information Flow (IF) are methods proposed by Bruza

and Song [50, 48] for inferring a relationship between concepts or groups of concepts.

Groups of concepts can be created through the CC heuristic, then relationships inferred

through IF. By using CC the SoM around a group of concepts, or a composite concept

can be induced.

Quantifying SoM in an operational abductive system could help identify A terms and

phrases with stronger relationships. This would be possible by using the SoM vectors

for C to find appropriate B and A terms. As the vectors represent a SoM they can be

used with CC and IF to find a much more specific concept.

6.4.1 Quantum Logic

An area of research closely linked with Abductive Reasoning is the field of Quantum

Informatics. The words can be considered superpositions of SoM, and when a word

is observed in context the superposition collapses into a particular SoM. By finding

the SoM around a word or concept the theory is moving towards a way of defining

eigenstates, or at least a new basis for words in a quantum-like model of meaning. This


opens up a new field of mathematics applicable to this problem [5].

APPENDIX A

Confusion Matrices

term method IK, B, nmi (% incr) B, 5 B, 10 B, 20 B, 30

shirt nmf 6, 0.06, 0.12 (5.94%) 0.05, 0.10 (5.01%) 0.06, 0.10 (4.01%) 0.09, 0.12 (2.95%) 0.14, 0.17 (2.58%)shirt svd 6, 0.06, 0.11 (5.53%) 0.05, 0.11 (6.27%) 0.06, 0.10 (4.30%) 0.09, 0.12 (2.94%) 0.14, 0.13 (-1.03%)

onion nmf 4, 0.01, 0.07 (6.07%) 0.01, 0.12 (10.52%) 0.03, 0.17 (13.56%) 0.05, 0.17 (11.33%) 0.08, 0.17 (9.13%)onion svd 4, 0.01, 0.02 (0.64%) 0.01, 0.01 (0.17%) 0.03, 0.05 (1.82%) 0.05, 0.09 (3.70%) 0.08, 0.09 (1.36%)

disability nmf 3, 0.01, 0.03 (1.86%) 0.06, 0.02 (-4.19%) 0.05, 0.08 (3.03%) 0.09, 0.11 (2.11%) 0.10, 0.14 (3.57%)disability svd 3, 0.01, 0.02 (0.93%) 0.06, 0.03 (-2.86%) 0.05, 0.07 (1.08%) 0.09, 0.08 (-1.02%) 0.10, 0.12 (2.01%)

scrap nmf 12, 0.17, 0.21 (4.26%) 0.09, 0.17 (7.99%) 0.17, 0.25 (8.82%) 0.24, 0.29 (4.73%) 0.32, 0.36 (3.79%)scrap svd 12, 0.17, 0.18 (1.34%) 0.09, 0.12 (3.16%) 0.17, 0.16 (-0.87%) 0.24, 0.25 (0.54%) 0.32, 0.31 (-1.31%)

float nmf 9, 0.20, 0.43 (23.51%) 0.13, 0.39 (25.29%) 0.19, 0.51 (31.88%) 0.33, 0.56 (23.71%) 0.38, 0.52 (14.31%)float svd 9, 0.20, 0.69 (49.09%) 0.13, 0.41 (27.36%) 0.19, 0.68 (48.71%) 0.33, 0.67 (34.86%) 0.38, 0.70 (32.76%)

giant nmf 5, 0.13, 0.20 (7.51%) 0.13, 0.16 (3.33%) 0.16, 0.19 (3.44%) 0.20, 0.18 (-1.83%) 0.24, 0.24 (-0.18%)giant svd 5, 0.13, 0.13 (0.01%) 0.13, 0.13 (0.01%) 0.16, 0.18 (2.12%) 0.20, 0.17 (-2.49%) 0.24, 0.23 (-1.80%)

sack nmf 6, 0.08, 0.12 (4.20%) 0.09, 0.17 (8.84%) 0.12, 0.15 (3.26%) 0.23, 0.22 (-0.95%) 0.23, 0.32 (9.33%)sack svd 6, 0.08, 0.21 (12.74%) 0.09, 0.18 (9.31%) 0.12, 0.18 (5.65%) 0.23, 0.24 (1.05%) 0.23, 0.25 (1.82%)

behaviour nmf 3, 0.02, 0.03 (1.02%) 0.02, 0.03 (1.75%) 0.03, 0.04 (0.57%) 0.03, 0.03 (-0.22%) 0.05, 0.05 (-0.22%)behaviour svd 3, 0.02, 0.00 (-1.54%) 0.02, 0.03 (1.56%) 0.03, 0.04 (0.46%) 0.03, 0.03 (-0.15%) 0.05, 0.04 (-0.77%)

knee nmf 13, 0.15, 0.21 (6.44%) 0.08, 0.18 (10.17%) 0.13, 0.23 (10.42%) 0.17, 0.22 (4.69%) 0.21, 0.24 (3.21%)knee svd 13, 0.15, 0.22 (7.23%) 0.08, 0.19 (11.28%) 0.13, 0.20 (7.69%) 0.17, 0.20 (2.70%) 0.21, 0.24 (3.39%)

accident nmf 6, 0.04, 0.10 (5.20%) 0.04, 0.05 (1.37%) 0.05, 0.10 (5.85%) 0.08, 0.15 (7.29%) 0.09, 0.14 (5.26%)accident svd 6, 0.04, 0.08 (3.45%) 0.04, 0.06 (2.35%) 0.05, 0.09 (4.45%) 0.08, 0.12 (4.78%) 0.09, 0.13 (4.41%)

promise nmf 7, 0.12, 0.11 (-1.09%) 0.11, 0.08 (-3.49%) 0.14, 0.16 (1.95%) 0.20, 0.22 (1.79%) 0.23, 0.23 (0.15%)promise svd 7, 0.12, 0.13 (1.08%) 0.11, 0.13 (2.09%) 0.14, 0.14 (-0.11%) 0.20, 0.19 (-0.79%) 0.23, 0.26 (3.08%)

rabbit nmf 8, 0.09, 0.10 (0.88%) 0.08, 0.06 (-2.58%) 0.10, 0.11 (0.45%) 0.12, 0.12 (-0.31%) 0.12, 0.13 (0.77%)rabbit svd 8, 0.09, 0.07 (-1.37%) 0.08, 0.05 (-3.19%) 0.10, 0.09 (-1.60%) 0.12, 0.11 (-0.72%) 0.12, 0.11 (-1.14%)

excess nmf 9, 0.09, 0.16 (6.90%) 0.06, 0.08 (2.62%) 0.09, 0.16 (6.32%) 0.16, 0.22 (5.98%) 0.21, 0.25 (4.20%)excess svd 9, 0.09, 0.08 (-0.85%) 0.06, 0.05 (-0.91%) 0.09, 0.09 (-0.03%) 0.16, 0.16 (-0.39%) 0.21, 0.20 (-0.54%)

steering nmf 5, 0.05, 0.08 (2.67%) 0.05, 0.11 (6.09%) 0.08, 0.24 (16.57%) 0.10, 0.22 (11.68%) 0.16, 0.25 (9.31%)steering svd 5, 0.05, 0.26 (21.11%) 0.05, 0.26 (21.11%) 0.08, 0.27 (19.08%) 0.10, 0.27 (17.14%) 0.16, 0.28 (11.56%)

bet nmf 15, 0.14, 0.23 (8.75%) 0.07, 0.10 (2.99%) 0.11, 0.15 (4.10%) 0.18, 0.26 (8.10%) 0.23, 0.32 (9.27%)bet svd 15, 0.14, 0.20 (5.77%) 0.07, 0.09 (2.48%) 0.11, 0.14 (2.81%) 0.18, 0.23 (5.48%) 0.23, 0.28 (5.09%)

Table A.1: SENSEVAL NMI results

75

APPENDIX A. CONFUSION MATRICES 76

term method B IK, ent (% decr) 5 10 20 30

shirt nmf 2.27 6: 1.43 (13.84%) 1.48 (11.07%) 1.43 (14.05%) 1.31 (21.39%) 1.13 (32.01%)shirt svd 2.27 6: 1.46 (12.41%) 1.49 (10.58%) 1.43 (14.01%) 1.33 (19.85%) 1.26 (24.22%)

onion nmf 1.96 4: 0.74 (10.60%) 0.67 (18.92%) 0.58 (30.03%) 0.49 (40.82%) 0.49 (41.17%)onion svd 1.96 4: 0.81 (2.03%) 0.81 (2.07%) 0.74 (10.74%) 0.65 (21.90%) 0.63 (23.47%)

disability nmf 2.76 3: 0.92 (4.03%) 0.93 (3.10%) 0.78 (18.81%) 0.67 (30.66%) 0.57 (40.44%)disability svd 2.76 3: 0.94 (2.29%) 0.92 (4.34%) 0.84 (12.58%) 0.77 (20.20%) 0.65 (32.17%)

scrap nmf 2.78 12: 2.09 (24.32%) 2.32 (15.70%) 2.00 (27.33%) 1.76 (35.98%) 1.43 (48.24%)scrap svd 2.78 12: 2.25 (18.22%) 2.52 (8.53%) 2.37 (14.08%) 1.95 (29.05%) 1.68 (39.13%)

float nmf 5.63 9: 1.11 (51.13%) 1.41 (37.99%) 0.91 (59.73%) 0.55 (75.81%) 0.52 (76.90%)float svd 5.63 9: 0.73 (67.79%) 1.59 (30.14%) 0.72 (68.25%) 0.51 (77.45%) 0.33 (85.42%)

giant nmf 5.41 5: 0.79 (29.70%) 0.86 (23.45%) 0.71 (37.01%) 0.66 (41.55%) 0.44 (60.83%)giant svd 5.41 5: 0.96 (14.69%) 0.96 (14.69%) 0.81 (27.65%) 0.71 (36.69%) 0.51 (54.28%)

sack nmf 4.88 6: 1.37 (15.71%) 1.28 (20.96%) 1.24 (23.26%) 0.97 (40.04%) 0.59 (63.58%)sack svd 4.88 6: 1.27 (21.88%) 1.33 (17.81%) 1.27 (21.64%) 1.02 (37.09%) 0.97 (39.92%)

behaviour nmf 1.61 3: 0.33 (7.10%) 0.31 (11.98%) 0.29 (18.92%) 0.28 (20.32%) 0.23 (34.59%)behaviour svd 1.61 3: 0.36 (0.14%) 0.32 (9.27%) 0.30 (14.84%) 0.29 (18.86%) 0.26 (27.65%)

knee nmf 1.79 13: 1.76 (26.06%) 1.97 (17.21%) 1.75 (26.70%) 1.66 (30.18%) 1.53 (35.76%)knee svd 1.79 13: 1.77 (25.52%) 1.97 (17.32%) 1.85 (22.39%) 1.76 (26.14%) 1.56 (34.36%)

accident nmf 1.59 6: 1.00 (14.94%) 1.09 (7.41%) 0.95 (19.42%) 0.77 (34.04%) 0.76 (35.44%)accident svd 1.59 6: 1.05 (10.94%) 1.09 (7.53%) 0.99 (15.98%) 0.87 (26.23%) 0.80 (31.77%)

promise nmf 4.00 7: 1.42 (14.70%) 1.52 (8.84%) 1.27 (23.52%) 1.02 (38.75%) 0.92 (44.65%)promise svd 4.00 7: 1.45 (13.10%) 1.48 (11.42%) 1.43 (14.12%) 1.27 (23.82%) 1.02 (38.93%)

rabbit nmf 2.23 8: 0.54 (24.29%) 0.62 (12.50%) 0.50 (29.82%) 0.43 (39.38%) 0.37 (47.52%)rabbit svd 2.23 8: 0.60 (15.24%) 0.65 (7.90%) 0.56 (20.86%) 0.47 (33.99%) 0.44 (38.05%)

excess nmf 2.31 9: 1.89 (17.95%) 2.12 (8.20%) 1.88 (18.53%) 1.60 (30.70%) 1.46 (36.94%)excess svd 2.31 9: 2.16 (6.69%) 2.23 (3.50%) 2.11 (8.59%) 1.89 (17.99%) 1.70 (26.22%)

steering nmf 2.33 5: 1.52 (8.99%) 1.46 (12.27%) 1.12 (32.98%) 1.10 (34.09%) 0.91 (45.33%)steering svd 2.33 5: 1.24 (25.45%) 1.24 (25.45%) 1.11 (33.52%) 1.04 (37.66%) 1.00 (40.11%)

bet nmf 1.66 15: 2.37 (24.88%) 2.90 (8.29%) 2.67 (15.37%) 2.22 (29.68%) 1.89 (40.07%)bet svd 1.66 15: 2.51 (20.57%) 2.92 (7.56%) 2.74 (13.35%) 2.36 (25.27%) 2.15 (32.04%)

Table A.2: SENSEVAL1 nouns conditional entropy results

APPENDIX A. CONFUSION MATRICES 77

airbus semi gatt subsidies agri herring dispute

01 12 1 13 24 256 17 28 29 110 111 112 213 214 115 11617 11819 120 121 122 123 124 1 125 126 127 128 329

Table A.3: “GATT” at k = 30 showing a nearly perfect reduction in entropy score of95.28%.

APPENDIX

A.CONFUSIO

NMATRIC

ES

78

(a) NMF, entropy decrease is 43.84%.

euratom acid iran congress gold wife tax foreign credit japan oil export highway address central gulf economy soviet trip pres staff

0 1 10 1 2 2 31 2 3 6 12 6 2 2 13 1 1 4 19 1 1 1 2 14 3 1 1 3 2 2 1 15 2 3 1 3 16 5 1 1 1 1 17 1 1 1 1 2 1 28 8 1 3 69 1 2 4 4 2 1 1 1

(b) SVD, entropy decrease is 37.07%.

euratom acid iran congress gold wife tax foreign credit japan oil export highway address central gulf economy soviet trip pres staff

0 1 3 3 11 8 1 2 3 1 6 2 9 1 1 72 7 3 1 12 3 2 1 13 1 1 11 14 35 1 1 2 16 17 5 6 1 18 1 1 1 3 2 1 29 2 1 3 5 2 1 5 2 3

Table A.4: “Reagan” at k = 10 for SVD and NMF, an example of NMF out performing SVD.

APPENDIX B

Early Pilots and Trace Browser

The biggest problem getting in intition about which algorithms were performing well

was the sheer amount of raw numeric data that had to bemanually scanned by a human

reader. This was also a problemwhen trying to develop an evaluation strategy that was

meaningful and not biased. This meant that implementations were working but no real

decisions could be made about each one. To help make the data more readable a “Trace

Browser” (TB) was written to display the massive amounts of data in a way that helps

get a good idea of the results without needing to read every piece of data in the system

or create an evaluation strategy just to understand them.

TB takes all the traces used to build the system and creates a vector to represent them.

This vector is the centroid of the addition of all the term vectors in a trace. (See Section

4.2.3.) The centroid of a cluster is accepted to be a good representative of the content of

the constituent vectors.[22] Once the TB has all the trace centroids and all of the SoM

built in the above algorithms, it compares them all (via cosine similarity) and creates a

ranked list of traces that best match a particular shade of meaning. These ranked lists

are then displayed along side the trace they are best matched against. Only the top ten

values from the shade of meaning are displayed to help with being able to read through

79

APPENDIX B. EARLY PILOTS AND TRACE BROWSER 80

the results quickly.

The TB gives an effective way to browse the results quickly and clearly, but it does not

provide the answer to the problem of evaluation that is clearly an issues with this line

of research. At least with the TB there were indicative results that help make some

decisions about evaluation. Figure B.1 shows a snapshot of the trace browser in action,

with two shades of meaning in green and a ranked list of the traces shown in purple,

with their similarities shown in blue.

APPENDIX B. EARLY PILOTS AND TRACE BROWSER 81

Figure B.1: A screenshot of the tracebrowser.

BIBLIOGRAPHY

[1] Leif Azzopardi, Mark Girolami, and Malcolm Crowe. Probabilistic hyperspace analogueto language. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 575–576, New York, NY, USA,2005. ACM Press.

[2] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval. ACMPress or Addison-Wesley, 1999.

[3] Michael W. Berry, Murray Browne, Amy N. Langville, Paul V. Pauca, and Robert J. Plem-mons. Algorithms and applications for approximate nonnegative matrix factorization.Computational Statistics & Data Analysis, 52(1):155–173, September 2007.

[4] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, andJ. Moore. Partitioning-based clustering for web document categorization. Decision SupportSystems, 27(3):329–341, 1999.

[5] P. Bruza and R. Cole. Quantum Logic of Semantic Space: An Exploratory Investigation ofContext Effects in Practical Reasoning . In S. Artemov, H. Barringer, A. S. d’Avila Garcez,L. C. Lamb, and J. Woods, editors, We Will Show Them: Essays in Honour of Dov Gabbay,volume 1, pages 339–361. London: College Publications, 2005.

[6] P.D. Bruza, D.W. Song, and R.M. McArthur. Abduction in semantic space: Towards a logicof discovery. Logic Journal of the IGPL, 12(2):97–110, 2004.

[7] Wray Buntine. Extensions to em and multinomial pca. In Proc. European Conference onMachine Learning (ECML-02), pages 23–34, 2002.

[8] C. Burgess. From simple associations to the building blocks of language: Modeling mean-ing in memory with the HAL model. Behaviour Research Methods, Instruments & Computers,30(2):188–198, 1998.

[9] C. Burgess. Representing and resolving semantic ambiguity: A contribution from high-dimensional memory modeling. In D. S. Gorfein, editor, On the Consequences of MeaningSelection: Perspectives on resolving lexical ambiguity., pages 233–261. American PsychologicalAssociation., 2001.

[10] C. Burgess, K. Livesay, and K. Lund. Explorations in context space: words, sentences,discourse. Discourse Processes, 25(2&3):211–257, 1998.

[11] Deng Cai and Xiaofei He. Orthogonal locality preserving indexing. In SIGIR ’05: Pro-ceedings of the 28th annual international ACM SIGIR conference on Research and development ininformation retrieval, pages 3–10, New York, NY, USA, 2005. ACM Press.

82

BIBLIOGRAPHY 83

[12] Niladri Chatterjee and Shiwali Mohan. Discovering word senses from text using randomindexing. In CICLing, pages 299–310, 2008.

[13] S. Deerwester, S. Dumais, T.K. Landauer, and G.W. Furnas. Indexing by Latent SemanticAnalysis. Journal of the American Society for Information Science and Technology, 41(6):391–407,1990.

[14] C. Ding. A Probabilistic Model for Latent Semantic Indexing. Journal of the American Societyfor Information Science and Technology, 56(6):597–608, 2005.

[15] Susan T. Dumais. LSI meets TREC: A status report. In Text REtrieval Conference, pages137–152, 1992.

[16] Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Scott Deerwester, and RichardHarshman. Using latent semantic analysis to improve access to textual information. InProceedings of the Conference on Human Factors in Computing Systems CHI’88, 1988.

[17] D. Gabbay and J. Woods. Abduction. lecture notes from the european summer school onlogic, language and information (ess-lli 2000). 2000.

[18] P. Gärdenfors. Conceptual Spaces: The Geometry of Thought. MIT Press, 2000.

[19] Peter Gärdenfors and Massimo Warglien. Cooperation, conceptual spaces and the evolu-tion of semantics. In E. Tuci P. Vogt, Y.Sugita and Berling Heidelberg C. Nehaniv, Springer,editors, Symbol Grounding and Beyond, pages 16–30. Springer, 2006.

[20] Eric Gaussier and Cyril Goutte. Relation between plsa and nmf and implications. In Proc.28th international ACM SIGIR conference on Research and development in information retrieval(SIGIR-05), pages 601–602, 2005.

[21] Michael N. Jones, Walter Kintsch, and Douglas J. Mewhort. High-dimensional semanticspace accounts of priming. Journal of Memory and Language, 55(4):534–552, November 2006.

[22] George Karypis and Eui-Hong Han. Concept indexing: A fast dimensionality reductionalgorithm with applications to document retrieval and categorization. Technical reporttr-00-0016, University of Minnesota, 2000.

[23] A. Kilgarriff and J. Rosenzweig. Framework and results for english senseval. In Computersand the Humanities, 34, pages 15–48, 2000.

[24] A. Kontostathis and W. Pottenger. Detecting patterns in the lsi term-term matrix. 2002.IEEE ICDM02 Workshop Proceedings, The Foundation of Data Mining and KnowledgeDiscovery (FDM02).

[25] T.K. Landauer. On the computational basis of learning and cognition: Arguments fromlsa. In B.H. Ross, editor, The Psychology of Learning and Motivation, volume 41, pages 43–84.Academic Press, 2002.

[26] Alberto Lavelli, Fabrizio Sebastiani, and Roberto Zanoli. Distributional term representa-tions: an experimental comparison. In CIKM ’04: Proceedings of the thirteenth ACM inter-national conference on Information and knowledge management, pages 615–624, New York, NY,USA, 2004. ACM Press.

[27] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factoriza-tion. Nature, 401(6755):788–791, October 1999.

[28] David D. Lewis. Reuters-21578.

BIBLIOGRAPHY 84

[29] K. Livesay and C. Burgess. Mediated priming in high-dimensional meaning space: Whatis "mediated" in mediated priming? In Cognitive Science Proceedings, LEA., pages 436–441,1997.

[30] W. Lowe and S. McDonald. The direct route: Mediated priming in semantic space. InM. A.Gernsbacher and S. D. Derry, editors, 22nd Annual Meeting of the Cognitive Science Society,pages 675–680. Lawrence Erlbaum Associates, 2000.

[31] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Infor-mation Retrieval. Cambridge University Press, 1 edition, July 2008.

[32] I. Matveeva, G. Levow, A. Farahat, , and C. Royer. Generalized latent semantic analysisfor term representation. In Proceedings of the International Conference on Recent Advances inNatural Language Processing (RANLP-05), pages 1–5, 2005.

[33] R. McArthur. and P.D. Bruza. Discovery of implicit and explicit connections betweenpeople using email utterance. In Proceedings of the 8th European Conference on Computer-supported Cooperative Work (ECSCW), pages 21–40. Kluwer Academic Publishers, 2003.

[34] Z. McCoy and M. Blom. Comparing plsi and non-negative matrix factorization.

[35] Rada Mihalcea and Ted Pedersen. Advances in word sense disambiguation.http://www.d.umn.edu/∼tpederse/WSDTutorial.html, July 2005. Slides from the AAAI2005 tutorial Advances in Word Sense Disambiguation.

[36] Roberto Navigli. Meaningful clustering of senses helps boost word sense disambiguationperformance. In ACL ’06: Proceedings of the 21st International Conference on ComputationalLinguistics and the 44th annual meeting of the ACL, Sydney, pages 105–112, Morristown, NJ,USA, 2006. Association for Computational Linguistics.

[37] D. B. Neill. Fully automatic word sense induction by semantic clustering. Master’s thesis,M.Phil. in Computer Speech, Text, and Internet Technology, Churchill College, 2002.

[38] Charles Osgood, George Suci, and Percy Tannenbaum. The measurement of meaning. Uni-versity of Illinois Press, 1957.

[39] Patrick Pantel and Dekang Lin. Discovering word senses from text. In Proceedings of theeighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages613–619, New York, NY, USA, 2002. ACM Special Interest Group on Knowledge Discoveryin Data, ACM Press, ISBN:1-58113-567-X.

[40] S. Pinker. The Language Instinct: How the Mind Creates Language. HarperCollins, New York,1994.

[41] R. Rapp. Word sense discovery based on sense descriptor dissimilarity. In Proceedings ofNinth Machine Translation Summit, pages 315–322, 2003.

[42] Reinhard Rapp. A practical solution to the problem of automatic word sense induction. InProceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 26, Morris-town, NJ, USA, 2004. Association for Computational Linguistics.

[43] Mark Sanderson. Word sense disambiguation and information retrieval. In Proceedingsof SIGIR-94, 17th ACM International Conference on Research and Development in InformationRetrieval, pages 49–57, Dublin, IE, 1994.

[44] Mark Sanderson. Retrieving with good sense. Information Retrieval, 2(1):49–69, 2000.

BIBLIOGRAPHY 85

[45] H. Schütze and J. Pedersen. Information retrieval based on word senses. In Symposium onDocument Analysis and Information Retrieval (SDAIR), Las Vegas, NV, pages 161–175, 1995.

[46] Hinrich Schütze. Dimensions of meaning. In Proceedings of Supercomputing ’92, Minneapo-lis., pages 787–796, 1992.

[47] Hinrich Schütze and Jan O. Pedersen. A cooccurrence-based thesaurus and two applica-tions to information retrieval. Inf. Process. Manage., 33(3):307–318, 1997.

[48] D. Song, P. D. Bruza, Z. Huang, and R. Lau. Classifying document titles based on in-formation inference. In Proceedings of the 14th International Symposium on Methodologies forIntelligent Systems (ISMIS 2003), pages 297–306. Springer, 2003.

[49] D. Song, P.D Bruza, R.M. McArthur, and T. Mansfield. Enabling Management Oversightin Corporate Blog Space. In Computational Approaches to Analyzing Weblogs, 2006. AAAISpring Symposium Series, Stanford University, March 27-29.

[50] D.W. Song and P.D. Bruza. Discovering Information Flow using a High Dimensional Con-ceptual Space. In Proceedings of the 24th Annual ACM Conference of Research and Developmentin Information Retrieval (SIGIR’2001), pages 327–333. ACM Press, 2001.

[51] D.W. Song and P.D. Bruza. Towards context sensitive information inference. Journal of theAmerican Society for Information Science and Technology, 54(3):321–334, 2003.

[52] Jean Veronis. Sense tagging: does it make sense?http://citeseer.ist.psu.edu/veronis01sense.html accessed july 2008.

[53] Dominic Widdows. Geometry and Meaning. Center for the Study of Language and Infor-mation/SRI, 2004.

[54] YorickWilks, Dan Fass, Cheng-ming Guo, James E.McDonald, Tony Plate, and Slator BrianM. Providing machine tractable dictionary tools. Machine Translation, 5(2):99–154, 1990.

[55] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrixfactorization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conferenceon Research and development in informaion retrieval, pages 267–273, NewYork, NY, USA, 2003.ACM Press.

[56] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrixfactorization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conferenceon Research and development in informaion retrieval, pages 267–273, NewYork, NY, USA, 2003.ACM Press.