calculating shades of meaning in semantic...
TRANSCRIPT
CALCULATING SHADES OF MEANING IN SEMANTICSPACES
by
David Petar Novakovic, BIT. (Bond, 2006)
http://dpn.name/[email protected]
Submitted in fulfilment of the requirementsfor the Degree of Master of Information Technology (Research)
Faculty of Information TechnologyQueensland University of Technology
July, 2008
ACKNOWLEDGEMENTS
Professor Peter Bruza, for spotting the interest at just the right time and helping me seepast the digits and into semantic spaces. Maths is now an unshakable, but much lovedcurse. Also for the continuing support and direction through this research. His infec-tious enthusiasm left me leaving every meeting feeling encouraged and revitalised.Stephen Kelly for thoroughly proof reading a paper on a topic foreign to him, serveraltimes as well. Jade for tolerating me living my work for the last two years and stillbeing able to encourage me.Mum and Dad for all their endless support and encouragement.Ben for doing research before me, so I knew what to expect.inQbator for giving me somewhere to work from, and being so supportive.
TABLE OF CONTENTS
TABLE OF CONTENTS i
LIST OF TABLES iii
LIST OF FIGURES iv
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The problem: Conceptual Ambiguity . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 10
2.1 Representation Models That Contain Ambiguity . . . . . . . . . . . . . . 11
2.1.1 Hyperspace to Analogue Language . . . . . . . . . . . . . . . . . 12
2.1.2 Latent Semantic Analysis . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Word Sense Disambiguation,Discrimination and Induction . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Techniques for Calculating SoM 25
3.1 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Concept Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Vector Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Non-negative Matrix Factorisation . . . . . . . . . . . . . . . . . . . . . . 36
4 Empirical Evaluation 39
i
TABLE OF CONTENTS ii
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Word Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Evaluation Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1.3 Human Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Building the HAL Model . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.2 Computing the Shades of Meaning . . . . . . . . . . . . . . . . . . 49
4.2.3 Representing the Traces . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 Comparing SoM to Traces . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Purity and Normalised Mutual Information . . . . . . . . . . . . 52
4.3.2 Confusion Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5 Results 59
5.1 Normalised Mutual Information . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6 Conclusions and Future Work 69
6.1 SoM Accuracy Across Languages . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 Decomposing Term-Trace Data . . . . . . . . . . . . . . . . . . . . . . . . 70
6.3 Euclidean and Generalised Kernel NMF . . . . . . . . . . . . . . . . . . . 71
6.4 Abductive Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.1 Quantum Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
A Confusion Matrices 75
B Early Pilots and Trace Browser 79
BIBLIOGRAPHY 82
LIST OF TABLES
2.1 A simple term× term matrix computed by HAL, before combining rowsand columns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1 Trace table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 A reduced example of the “Reagan” confusion matrix at k = 10. . . . . . 54
4.3 “onion” at k = 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.1 Average overall improvements in NMI. . . . . . . . . . . . . . . . . . . . 60
5.2 Reuters NMI results. IK is the ideal value for k. B is the baseline NMIvalue. The percentages detail improvements over the baseline. . . . . . 61
5.3 Average overall reductions in entropy. . . . . . . . . . . . . . . . . . . . . 62
5.4 “coffee” at k = 5. An example of SVD performing better than NMF in arun of the algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5 Reuters-21578 conditional entropy results . . . . . . . . . . . . . . . . . . 64
5.6 “float” at k = 5. Another example of SVD performing better than NMF. . 67
A.1 SENSEVAL NMI results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
A.2 SENSEVAL1 nouns conditional entropy results . . . . . . . . . . . . . . . 76
A.3 “GATT” at k = 30 showing a nearly perfect reduction in entropy score of95.28%. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A.4 “Reagan” at k = 10 for SVD and NMF, an example of NMF out perform-ing SVD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
iii
LIST OF FIGURES
1.1 An example of a semantic space locality for “Reagan.” . . . . . . . . . . . 4
1.2 Concepts surrounding “Reagan” feed into it from different angles. . . . 5
1.3 “Reagan” can be understood in terms of axes of meaning derived fromconcepts surrounding it. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 An example of some traces about President Ronald Reagan. . . . . . . . 6
1.5 After traces have been processed, a word’s vector contains no originalorder or trace specific information. . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Axes in a new dimensionally reduced subspace may be considered SoM. 8
2.1 Algorithm to compute the HAL TCOR matrix. The algorithm is run cu-mulatively for each document and the function word returns the lexiconindex of the i’th word of the document. . . . . . . . . . . . . . . . . . . . 15
B.1 A screenshot of the tracebrowser. . . . . . . . . . . . . . . . . . . . . . . . 81
iv
CHAPTER 1
Introduction
This thesis introduces the problem of conceptual ambiguity, or Shades of Meaning
(SoM) that can exist around a term or entity. As an example consider President Ronald
Reagan the ex-president of the USA, there are many aspects to him that are captured
in text; the Russian missile deal, the Iran-contra deal and others. Simply finding doc-
uments with the word “Reagan” in them is going to return results that cover many
different shades of meaning related to "Reagan". Instead it may be desirable to retrieve
results around a specific shade of meaning of "Reagan", e.g., all documents relating to
the Iran-contra scandal. This thesis investigates computational methods for identifying
shades of meaning around a word, or concept. This problem is related to word sense
ambiguity, but is more subtle and based less on the particular syntactic structures as-
sociated with or around an instance of the term and more with the semantic contexts
around it. A particularly noteworthy difference from typical word sense disambigua-
tion is that shades of a concept are not known in advance. It is up to the algorithm
itself to ascertain these subtleties. It is the key hypothesis of this thesis that reducing
the number of dimensions in the representation of concepts is a key part of reduc-
ing sparseness and thus also crucial in discovering their SoM within a given corpus.
1
1.1. BACKGROUND 2
What follows is discussion around the background theory supporting this research, the
methodology and the results showing SoM identified and empirically evaluated.
1.1 Background
Information in society, particularly online is an ever growing, almost infinite collection
of opinions, reports, essays, emails and other forms of communication. This massive
amount of information is something that no single person could ever conceive of com-
pletely comprehending. Additionally there are also many situations where information
is not so readily available, yet needs to be monitored or analysed to find out core ideas
and themes. Whether it be the sheer volume of information available, or simply that
people tend to focus on topics close to them, it is clear that people do not have the ca-
pacity to keep up with the amount of information available and knowledge within sep-
arate fields is becoming more and more isolated. In order to help quantify the largely
unmonitored relationships amongst the vast amounts of information on the Internet
scientists have found systems to model this data. In general this data is modelled using
high dimensional representations of text. Some of the relevant models are covered in
the next section.
In addition to the ever widening gap of between comprehension and the global scope of
information, words are becoming ambiguous between different informational contexts.
Words can have many meanings, this is known as polysemy. Sometimes the ambiguities
are as simple as the difference between a verb and a noun, like “fly” the flying insect,
and “fly” the action. Other times the ambiguities are more obvious, like the nouns,
“bank” the side of a river, and “bank” where you save money. Since words can mean
so many things, a way is needed to be able to find the different shades of a concept
1.2. THE PROBLEM: CONCEPTUAL AMBIGUITY 3
and utilise them. This thesis is about developing effective computational models for
inducing SoM around a concept. Further research will be done on using the shades to
find relevant information more effectively.
1.2 The problem: Conceptual Ambiguity
While lexical ambiguity is a concern in text processing, conceptual ambiguity is prob-
ably more prevalent in text corpora because of its more general nature. Conceptual
ambiguity or aspects of a concept are much more subtle. The idea of lexical ambiguity
or homographs 1 have often been described as senses [43]. This thesis will argue that in
some cases the senses of a word are different to the aspects of a concept. These aspects
are referred to as Shades of Meaning (SoM). Consider a concept that has several different
conceptual “axes” or aspects associated with it, each of these conceptual contexts is a
shade of meaning. This is something that is very common in general literature, for ex-
ample the term “Reagan” immediately gives the association to the ex-president of the
USA. The ambiguity here comes from the fact that Reagan was actually involved in and
known for much more than just the fact that he was president. Other concepts include
the Iran-contra scandal, the Russian missile deal, his wife, his acting career and others
[51].
If a text processing engine builds a conceptual semantic space rather than a lexical space
and represents it in a two-dimensional visualisation, it could look like Figure 1.1. A
series of concepts represented in a measurable space where some concepts are closer
together than others. These concepts represent the general concepts for each of these
terms, irrespective of occuring with “Reagan.” Since the model has been built out of
1Homographs since text is being dealt with or more generally, Homonyms.
1.2. THE PROBLEM: CONCEPTUAL AMBIGUITY 4
reagan acid rain
tax
missile deal
iran-contra
Figure 1.1: An example of a semantic space locality for “Reagan.”
a series of documents about President Reagan it occurs in the center of all the other
concepts. The reason it can be found amongst them is because it is closely related to
them.
The truth is that the “reagan” concept is actually subtly made of up the concepts sur-
rounding it, so each of these concepts feeds into the Reagan concept. If this was a two
dimensional space someone could walk around, standing in the “acid rain” area and
looking toward “reagan” would look like an orange-green blur. This is because they are
seeing the Reagan concept, but from the point of view of acid rain. To see this illustrated
see Figure 1.2.
Finally it can be seen in Figure 1.3 that the “Reagan” concept is actually made up of
axes reflecting the other concepts around it. When the “Reagan” concept is referred to,
it could be anyone one of these SoM, but normally it is by all of them. By ambigously
referring to just “Reagan” it is the aim of this thesis to uncover the fundamental axes
surrounding the concept, these are the SoM.
1.2. THE PROBLEM: CONCEPTUAL AMBIGUITY 5
reagan
missile deal
iran-contra
acid rain
tax
Figure 1.2: Concepts surrounding “Reagan” feed into it from different angles.
reagan
acid rain
tax
missile deal
iran-contra
Figure 1.3: “Reagan” can be understood in terms of axes of meaning derived fromconcepts surrounding it.
1.2. THE PROBLEM: CONCEPTUAL AMBIGUITY 6
As an example consider a text processing engine indexing large amounts of data search-
ing for the term “Reagan,” it will find many instances of the word in many varying
contexts. Here are some example headlines it may find the word “Reagan” in.
1. REAGAN ADMITS IRAN ARMS OPERATION AMISTAKE2. REAGAN PLEDGES TO INCREASE SPENDING ON ACID RAIN3. IRAN INVESTIGATORS SEEK REAGAN TAPES4. SPEAKES SAYS HE, REAGANMISLED PUBLIC UNWITTINGLY5. CANADAWELCOMES LATEST REAGAN ACID RAIN PLEDGE6. REAGAN TO VETO 87.5 BILLION DLR HIGHWAY BILL
Figure 1.4: An example of some traces about President Ronald Reagan.
In the above headlines 1 and 3 are clearly about the Iran-contra deal, headline 2 and
5 are about the acid rain problem and headline 6 is about the Highway Bill that Rea-
gan vetoed. The unusual sentence is 4, it is not particularly associated to any topic, like
Iran-contra, this one is discussed later. So all of these sentences contain the word “REA-
GAN” but are about different aspects of Reagan. The text processing engine would
store all of these sentences under the word Reagan, leading to a single entry looking
similar to the text below. In practice the engine would keep some kind of value asso-
ciated with these terms, like their frequency within the document, this just an example
to describe the ambiguity.
REAGAN: [’BILLION’, ’SAYS’, ’TAPES’, ’UNWITTINGLY’, ’ADMITS’,’SPEAKES’, ’ACID’, ’SEEK’, ’INVESTIGATORS’, ’CANADA’, ’SPEND-ING’, ’ARMS’, ’INCREASE’, ’TO’, ’MISLED’, ’HE,’, ’OPERATION’, ’WEL-COMES’, ’LATEST’, ’A’, ’IRAN’, ’PLEDGE’, ’VETO’, ’REAGAN’, ’RAIN’,’DLR’, ’PLEDGES’, ’HIGHWAY’, ’MISTAKE’, ’ON’, ’87.5’, ’BILL’, ’PUBLIC’]
Figure 1.5: After traces have been processed, a word’s vector contains no original orderor trace specific information.
This set of words represents a model of “Reagan,” but in a single combined unit. With
1.2. THE PROBLEM: CONCEPTUAL AMBIGUITY 7
appropriate background knowledge, a human can identify some terms dealing with
particular SoM, but this is much more difficult for computational model to recognize.
This model is scaled up to millions of sentences, it becomes unmanageable to store each
sentence individually and still be able to process them efficiently and effectively, so a
machine must be taught to find the SoM encoded in this single vector of words.
Another problem alluded to in the last paragraph is that headline 4 doesn’t seem to
relate to any topic in particular. In actual fact the headline is associated with an article
about the Iran-contra scandal as well. The fragment could have been made longer until
it included some explicit words that associated the headline with the actual SoM, but
that risks including spurious information that may also weaken its association to the
appropriate shade.
The problem of conceptual ambiguity is further exacerbated bymetaphorical andmetonymic
references to seemingly unrelated terms. In the case of Reagan, he got the nickname
“the Gripper” from his role in the movie Knute Rockne, All American. Most Ameri-
cans know who the nickname refers to, but this would confuse a machine. SoM could
be considered a very specific subset of polysemy. Conceptual ambiguity should also be
a consideration when processing large corpora of text because it is desirable to know
the SoM surrounding a word, this would help in many applications in both informa-
tion retrieval (IR) and computational linguistics (CL). It should be noted that in an IR
system, most ambiguity can be resolved to a reasonable level for a user by addition
of terms to the query. Query expansion in this manner has been likened to semantic
priming [30, 29, 21]. This simply resolves lexical ambiguity by creating more context
for the query. It is preferable to resolve ambiguity at a conceptual level and more ex-
haustively. The query “reagan iran-contra” would almost certainly give an IR system
1.2. THE PROBLEM: CONCEPTUAL AMBIGUITY 8
enough context for a relevant enough list of documents to be returned. Many other
queries are often not satisfied simply because the most prevalent use of a term may not
be the particular meaning you are looking for or you may not know the word to qualify
the original query with in the first place. In Section 2.1.1 will look at an example of
building an ambiguous representation with a high dimensional representation model.
1
1
1
1
Figure 1.6: Axes in a new dimensionally reduced subspace may be considered SoM.
As stated earlier, it is the key hypothesis of this thesis that reducing the number of
dimensions in the representation of concepts is a key part of reducing sparseness and
thus also crucial in discovering SoM within a given corpus. The use of dimensional
reduction to improve the representation of terms is a common theme among many
different strands of research.
A key part of Latent Semantic Indexing (LSA) is dimensional reduction, which is gen-
erally a way to reduce the complexity of high dimensional matrix representations [25].
Karypis points out that it is possible to map a high dimensional space down onto a
lower dimensional subspace while still retaining important explicit information in the
data [22]. A side effect of this is that some latent connections are also brought out. He
says that by using dimensional reduction algorithms latent concepts can be found in a
document collection. This view is also supported by Landauer and this assertion forms
1.2. THE PROBLEM: CONCEPTUAL AMBIGUITY 9
the basis of this thesis [25]. Cai and He further support this position in their work with
Orthogonal Locality Preserving Indexing [11]. In the context of SoM the bases in a new
dimensionally reduced subspace may be the SoM surrounding that particular context.
In Figure 1.6 an abstract view of axes oiginating at the center of “Reagan” illustrate the
view that new bases might be SoM. This raises the question whether the bases corre-
sponding to SoM should be orthgonal.
In summary, this thesis is to develop and empirically evaluate unsupervised com-
putational methods for inducing SoM around a concept relative to a corpus of text.
The models investigated will be matrix models of concepts and a key component
will be to investigate the effect of dimension reduction on the quality of SoM being
induced.
CHAPTER 2
Literature Review
To cover the ground work required to start working with aspects, one should first look
at techniques that have been used to help resolve ambiguity in the past. Sanderson
gives a very brief and clear introduction to information retrieval and word ambiguity
and methods for improving it [43]. In particular he discusses mechanisms for “fixing”
a document’s representation in the system to give better results with resolving ambi-
guity, it is these kinds of post-hoc manipulations on data that can drastically improve
the quality of operations on them. It is still important to choose the right representation
model though. Although most IR researchers discuss semantic ambiguity in the con-
text of documents, often the discussion and solutions are useful in the context of word
ambiguity. The study of meaning has been coined with the name “semiotic-cognitive in-
formation systems” [51]. Despite this paper being primarily interested inwordmeanings,
there is considerable overlap between the areas of document based IR and the semantic
analysis of text in the context of symbolic meaning. Some of the systems used that often
contain ambiguity will be discussed, then also heuristics to help resolve ambiguity via
preprocessing, indexing and post processing. Though the results that have been found
by using them in information retrieval and semiotic-cognitive systems are of interest,
10
2.1. REPRESENTATIONMODELS THAT CONTAIN AMBIGUITY 11
this thesis is concerned with the computation methods for SoM.
2.1 Representation Models That Contain Ambiguity
We are in agreement that high dimensional models are a promising way to represent
concepts [38]. In order to investigate the potential benefits of term level co-occurrence
information and document level co-occurrence high dimensional models which have
shown consistently to approximate human cognitive processes effectively while allow-
ing ambiguous concepts to be represented will be discussed [8, 13, 21, 18]. There are a
few ways of representing concepts in high dimensional models, one is at a term level,
others are at a higher syntactic level.1
One way of representing meaning at a term level is by representing terms in a model
known as Hyperspace Analogue to Language (HAL) which attempts to approximate
human cognitive processes, for example, human semantic word association norms [8,
9, 10]. This model is generated by passing a window across the text and recording the
relative positions of words to each other. The value given to a word in the window
relative to another word is the inverse of the distance between them. This process
creates a very large sparse matrix of n × n dimensions, where n is the number of terms
in the system.
Another way to model concepts is in a term × document model where each document
is represented by a vector of word frequencies, or related weights. So models can have
can have term × document matrices and HAL matrices which are term × term. Latent
Semantic Analysis (LSA) is a model that has been known to produce encouraging re-
sults in representing the semantics of terms in a cognitively validated way [13]. LSA
1Like sentences, paragraphs or documents.
2.1. REPRESENTATIONMODELS THAT CONTAIN AMBIGUITY 12
adopts a firmly in the area of IR that focuses on “bag of words” style representation
of terms in a dimensionally reduced term × document matrix. A vector in LSA carries
information about a term and the contexts it has been found in.2 There is not a lot of
literature comparing these different ways of representing data, merely comparisons be-
tween different models using only term×term models, or more often, term×document
models. Studies have shown both models are relevant in different ways. The types of
information encoded in LSA and HAL vectors are different in nature yet in literature
about LSA, its vectors are known as context vectors, confusingly similar to HAL litera-
ture which refers to HAL vectors as “context” vectors. Jones et al somewhat help with
a disambiguation by calling relationships in LSA context and HAL’s order [21]. Lavelli
et al make the distinction clearer by calling LSA’s information document occurrence rep-
resentation (DOR) and HAL’s information term co-occurrence representation (TCOR) [26].
Other research projects make the same assessment of the different types of information
and how they relate to each other [21, 26].
2.1.1 Hyperspace to Analogue Language
The HAL model has shown promise as an effective way of using symbol co-occurrence
to help model the meaning of words [8, 9, 10].
HAL has been noted to compare favourably with human cognitive processing by repli-
cating human semantic word association norms [8, 9, 10]. HAL is a real valued matrix,
which allows terms and concepts to be represented measurably [33, 51].
Burgess et al [8] call HAL vectors “context” vectors, while it is is a suitable name, other
high dimensional semantic models also use the word context to mean something differ-
2In Landauer’s work, a context is typically a fragment of text the size of a paragraph.
2.1. REPRESENTATIONMODELS THAT CONTAIN AMBIGUITY 13
ent. The context that the term context actually refers to can mean a sentence, a window,
a document. It will be seen later what other kinds of relationships there can be. HAL is
classified as a TCOR model, which implies a much smaller context window than DOR
models which use the whole document as a window.
A HAL model is referred to as context space by Burgess et al, but as mentioned earlier
is most clearly described as a TCOR model [8]. The reason for this is that the value
of a particular symbol in the HAL model is calculated as a sum of all the contexts the
particular term appeared in, where the context is normally much smaller than the size
of a document. This model is generated by passing a window of size l = 10 across the
text and recording the relative positions of words to each other, though other values of
l have been used as well. The value given to a word in the window relative to another
word is the inverse of the distance between them. This process typically creates a very
large sparse matrix of n × n dimensions, where n is the number of known terms in
the system. A row in this vector space is a vector for the values of words that appeared
before the chosen word in the window. Columns represent vectors of the words follow-
ing the word in the window. Burgess concatenated these two vectors together to give a
single vector representing the term in the HAL space. Bruza et al found that preserving
word order information via the two separate vectors did not help with their experi-
ments and instead summed the associated entries from the column and row vectors
[33, 51].
A term in the HAL semantic space is defined by the succinct mathematical notation in
Equation 2.1 [1].
HAL(t|t′) =K∑
k=1
w(k)n(t, k, t′) (2.1)
The HAL value for any term ti and another term tj can be shown and n(ti, k, tj) is the
2.1. REPRESENTATIONMODELS THAT CONTAIN AMBIGUITY 14
number of times ti appears distance k from tj . He also explains that w(k) = K − k + 1
is the strength of the relationship between the two terms given k. The HAL semantic
space is denoted S where S[i, j] gives the strength of the co-occurrence relationship
between term ti and term tj . For example, consider the text “President Reagan ignorant
of the arms scandal”, with l = 5, the resulting HAL matrix would be as shown in Table
2.1. Each row i in the resulting matrix represents accumulated weighted associations of
word i with respect to other words which preceded i in a context window. Conversely,
column i represents accumulated weighted associations with words that appeared after
i in a window. Equation 2.1 also shows a pseudocode implementation of the HAL
algorithm.
As HAL produces vector representations of words, similarity metrics such as cosine
similarity [50, 6], Minkowski similarity [6, 51] and inferential metrics like information
flow [51, 50, 6, 49] can be calculated in the space. Since the information about every
context a symbol appears in is contained in a vector representation, all that is required
is to find the different aspects in the vector. The question is how this should be done.
The problem of ambiguity is barely touched upon bymuch of the preceding work using
HAL. In general ambiguity in semantic models, particularly HAL, is tackled with the
arms ignorant of president reagan scandal thearms 0 3 4 1 2 0 5
ignorant 0 0 0 4 5 0 0of 0 5 0 3 4 0 0
president 0 0 0 0 0 0 0reagan 0 0 0 5 0 0 0scandal 5 2 3 0 1 0 4the 0 4 5 2 3 0 0
Table 2.1: A simple term× term matrix computed by HAL, before combining rows andcolumns.
2.1. REPRESENTATIONMODELS THAT CONTAIN AMBIGUITY 15
use of semantic priming [29]. Burgess illustrates ambiguity in HAL vectors by reducing
their representations to two dimensions [9]. In their graphic diagrams examples like the
word “fix” lie in between the two semantic groups representing the use of drugs and
the act of repairing something. Burgess proposes a method of detecting if a term may
be ambiguous. He says that through his research he has found that as terms appear in
more andmore contexts it becomesmore general. He suggests that finding the “Context
Attributes” of HAL vector helps find the dominant context contained in the vector. This
is checked by finding how many standard deviations all the non-zero dimensions of a
vector are from the mean. The terms with the highest standard deviation would then
be taken into consideration as defining the meaning of the vector the most. While this
is very helpful to find the aboutness of the vector, it is desirable to know if there are any
other aspects in the vector that may be getting overshadowed by the dominant aspect.
In addition to finding separate aspects, it is good to be able to tell if a particular term is
ambiguous amongst the contexts it is found in.
def calc_HAL(word: int→int, n: int, S: matrix)for i in 1..n dofor j in 1..x such that i-j>0 doS[word(i),word(i-j)] += (l+1)-j
endend
end
Figure 2.1: Algorithm to compute the HAL TCOR matrix. The algorithm is run cumu-latively for each document and the function word returns the lexicon index of the i’thword of the document.
2.1. REPRESENTATIONMODELS THAT CONTAIN AMBIGUITY 16
2.1.2 Latent Semantic Analysis
The Latent Semantic Analysis (LSA) technique was first introduced in 1988 and has
been employed in a variety of fields such as IR, machine learning, computational lin-
guistics and cognitive science [16]. LSA starts with a term × document matrix S which
is factorized by a theorem from linear algebra known as singular value decomposition
[13, 25, 16, 15, 2, pg. 44]. SVD produces three matrices which can be multiplied together
to get the original matrix, S. While the matrix is decomposed into the three resultant
matrices of SVD, it is possible to perform operations that reduce the rank and smooth
the original matrix. If the original matrix is S, the three matrices produced by SV D(S)
are, U, D and V T . (See Equation 2.2.)
SV D(S) = UDV T (2.2)
U is a matrix of orthogonal eigenvectors representing the terms in the matrix, it has
dimensionality t × rank(S) where t is the number of terms in the vocabulary. D is a
diagonal matrix of singular values, it is of rank rank(S). V T is a matrix similar to U but
is a matrix of orthogonal eigenvectors of dimensionality n × rank(S), where n is the
number of documents in the system. Dimensions in V T represent documents from the
system. Deerwester et al provide a good visual description of the matrices during and
after the LSA operation [13].
LSA truncates the singular value matrix D by taking the highest k singular values and
zeroing out the rest of the singular values on the diagonal. Thus when the matrices are
multiplied together again, the rank of the matrix S′ is reduced to k.
This has two effects; firstly, noise is filtered out of the matrix, this is because the lower
singular values were discarded. Secondly, as a result of the noise filtration, latent con-
2.2. WORD SENSE DISAMBIGUATION,DISCRIMINATION AND INDUCTION 17
cepts in the matrix are exposed [13, 25, 16, 15]. What is found is that after dimension
reduction many weights in the matrix that were zero before are found to be non-zero.
These new non-zero weights represent the latent information in the matrix [33].
Both HAL and LSA have an established track record of building a model of a corpus as
a matrix. LSA employs SVD on the matrix and the resulting eigenvectors represent an
"axis of meaning" through the corpus. As HAL is a matrix it could similarly be given an
eigenvector decomposition via SVD. In the following experimentation, we will not be
building a matrix from a generic corpus of text as is often done with HAL and LSA, but
rather the corpus will be constructed based around a single word. This then raises the
question whether the eigenvectors of the corresponding dimensionally reduced matrix
corresponds to SoM.
2.2 Word Sense Disambiguation,
Discrimination and Induction
Word Sense Disambiguation (WSD) is the process of finding the sense of a particular
word in a particular context. Discrimination normally refers to disambiguating sense
of words without any kind of supervision. Induction extends discrimination further
by inducing the senses of a word from the corpus itself. This is in stark contrast to
most WSD systems that check their classifications against some golden standard. This
is similar to what this thesis is hoping to achieve, but not exactly the same. As noted in
Section 1, aspects or SoM are a much more subtle form of ambiguity. Some techniques
in WSD will be assessed, and some other previous work will also be discussed. It is
important to discuss some of the methodologies and problems with all WSD as a pre-
cursor to investigating new methods. There is substantial work being done in the field
2.2. WORD SENSE DISAMBIGUATION,DISCRIMINATION AND INDUCTION 18
of WSD and it is a hard topic to deal with. While part of speech taggers have upwards
of 95% success tagging corpora with parts of speech, word sense taggers still suffer
from poor results [36]. While some are defeatist about word sense disambiguation [52],
it should instead be viewed as another challenge, especially since some current systems
seem to perform well [42].
There are threemajor approaches toword sense disambiguation; knowledge based (dis-
ambiguation), supervised (disambiguation) and unsupervised (discrimination). Knowl-
edge based and supervised rely on using external resources as a reference to help dis-
ambiguate terms [35]. Supervisedmethods ofWSD use some crafted data in the form of
taxonomies, dictionaries or thesauri. The big problem with this method, if it is related
back to the “Reagan” example, is that the senses for the word “Reagan,” are basically
totalled at one, the ex-president of the USA.3 So using hand built thesauri or dictio-
naries are not acceptable for the problem stated for this thesis. Unsupervised WSD is
closer to what this achieved since it gives aspects the capacity to show themselves with-
out being forced to fit into categorisation system. As noted though, to find SoM, one
step further is required.
Some early WSD work by Wilks et al. to create a machine-tractable dictionary of words
showed enough promise with respect to building a statistical model of words that the
trend continued in most other following research.[54] The idea of building a machine-
tractable dictionary of words of different senses leads towards finding a dicitionary of
words divided by their SoM.
Sanderson has reviewed several different disambiguation techniques [44]. Most lean
heavily in the direction of supervised techniques and he is quite critical of the others,
3Justifiably, thesaurus.com does not even have a record of “reagan” and WordNet has one, the obviousone.
2.2. WORD SENSE DISAMBIGUATION,DISCRIMINATION AND INDUCTION 19
except for one. He makes note of work by Shütze and Pederson in the area of WSD
[45]. In all of the methods he discusses in his paper, the work of Shütze and Pederson
matches closely to what this thesis achieves. In that it cannot be assumed that any SoM
may exist around a particular topic, rather they need to be ascertained from the text
itself. This quote about other approaches to WSD appropriately justifies their position:
“All these approaches share the problem of coverage: specialized domains
tend to exhibit rare words and specialized meanings, which are not covered
by generic lexical resources. The cost of customizing the resources is often
prohibitively high.” Shütze and Pederson [45].
Shütze and Pederson create a novel algorithm for calculating the sense of a word based
on the “company it keeps.” They start by taking the context of every appearance of a
word and counting the occurrences of words in the same context as it (ie a window of
size k = 40). Note, this is a frequency count, not a weighted measure of distance as is
the case with HAL. By doing this recording, Shütze and Pederson build a term × term
TCOR (term co-occurrence representation) matrix, where the weights are the frequency
of terms co-occurring in the same window. To reduce computational complexity, two
rounds of clustering and the singular value decomposition algorithm are used to reduce
the size of the matrices involved in the calculation.[47] The first round of clustering is
done around the most occurring words in the corpus, then the second round of cluster-
ing is done based on words that appear near words from the first round of clustering.
This is done for for two reasons, to reduce the dimensionality of the data set for pro-
cessing, and more importantly, to reduce the dimensionality while keeping a record
of second order word occurrences. The third major operation is performing SVD on a
matrix built out of occurrence information from the first two cluster operations. This
2.2. WORD SENSE DISAMBIGUATION,DISCRIMINATION AND INDUCTION 20
method is very computationally intensive, but they achieve good results. To combine
the vectors to represent terms effectively in a query, Shütze and Pederson sum the vec-
tors for terms in the query. The sum isn’t just a simple sum though, the vectors are
scaled according the the tf.idf (term frequency, inverse document frequency) weighting
scheme.
Sanderson is quick to point out that Shütze and Pederson use long queries often over
100 words long which is unrealistic for a user feedback system. Interestingly, Burgess
et al make reference to Shütze’s work but do not go into too much detail about the dif-
ferences [10, 46]. There are two key differences between HAL and Shützes’s work. In
HAL, co-occurrence data is recorded as the inverse of the distance between two words,
this makes word meaning and relationships in HAL sensitive to the distance between
any two co-occurring terms. This makes HAL a logical extension of Shütze and Peder-
son’s work. Unfortunately most literature around HAL by Burgess et al only mentions
dimension reduction in passing. Shütze and Pederson on the other hand consider SVD
a key part of their work. The conclusion is that Shütze and Pederson’s work is elegant
but does not include distance information from the corpus and may be too computa-
tionally intensive.
Similar to the work of Shütze and Pederson is the work of Pantel and Lin, who un-
fortunately do not know of the work of the former but nevertheless make a valuable
contribution to the field [39]. This oversight is also noted by Rapp [41]. Pantel and
Lin use a clustering mechanism they call Clustering By Committee which clusters terms
based on point-wise mutual information (PMI) weightings. PMI is a probabilistic sys-
tem for finding the association of two random variables. The Minipar parser arguably
makes the workings of their system "semi-supervised" as it consults non-corpus based
2.2. WORD SENSE DISAMBIGUATION,DISCRIMINATION AND INDUCTION 21
resources in order to calculate quantify grammatical contexts. Because this method de-
viates from the purely “corpus-based” unsupervised principles it seems to give good
results, it may be worth investigating Minipar as an indexer in future research.
Rapp makes reference to Pantel and Lin and continues in a similar vein but with good
results and a lesser degree of supervision [41]. Here he only uses lemmatisation during
the text indexing stage, thus not desirable as it is preferable to evaluate purely unsu-
pervised systems. Rapp is actually using the process behind Latent Semantic Analysis
and acknowledges the fact in his evaluation. He outlines the key differences as being
his use of lemmatisation and a sliding window of size two instead of whole documents
as context. This means he claims that TCOR works better than DOR in his situation,
this is also supported implicitly by Pantel and Lin in using grammatical relationships
to build PMI.
Some of the most recent work that builds upon Rapp, Shütze and Pederson’s work is
that of Chatterjee et al. [12]. They propose using the Random Indexing (RI) method
for reducing a Word Space similar to HAL for the task of WSD. The method shows
promise and is very much aligned with the work in this thesis. The because of the
heavily random element of RI is it not used as a dimensional reduction strategy in this
thesis.
In [37] Neill proposes a system that is evaluated quantitatively by himself and his su-
pervisor. His system for finding word senses is slightly different to other WSD systems
here and is even closer to calculating SoM. He calls Sense Induction, where the senses
are determined from the corpus itself and evaluated qualitatively by humans. Though
his system is probabilistic, a lot of the principles translate well into purely geometrical
representations. Sense induction uses a similarity matrix to build a graph like structure
2.2. WORD SENSE DISAMBIGUATION,DISCRIMINATION AND INDUCTION 22
of the localities in the data set, this information is then used to seed a set of clusters,
each cluster representing a sense. Words are loosely grouped into their clusters, they
can be members of several sense definitions. The most innovative work in Neill’s paper
is his use of conditional entropy to calculate the overall performance of his proposed
system. This is discussed more in Section 4.3.3 as it addresses many of the evaluation
issues.
Rapp continues his research into sense induction by further analysing the work of Pan-
tel and Lin [39] and Neill [37] and extending their ideas further [42]. His work is of
particular interest because it actually steps away from using global scoped vectors and
instead clusters vectors representing each context. So each sentence becomes a vector
in the system, with Boolean weights for whether a word exists in that context or not.
He then does some filtering on some terms, and applies SVD to the matrix. This system
gives him good overall results, about 86% unsupervised accuracy, far higher than the
approximate 75% supervised ceiling mark achieved in the original SENSEVAL tasks.
The biggest problem associated with Rapp’s procedure is needing to store each sen-
tence as an individual entity, as mentioned in Section 1 this could be a problem when
needing to scale to very large numbers of sentences. Fortunately in this case which is
not specifically an IR system, the right number of sentences could be defined as “how-
ever many it takes to get all senses of a word.” Rapp’s system is good and fits well with
the hypothesis of this research, that dimensional reduction helps uncover SoM in text.
His ideas will be pursued further after this present line of research.(See Future Work,
Section 6 for more information.)
In the field of WSD4 researchers overwhelmingly use TCOR as the basis for collecting
4More generally, this is Computational Linguistics as opposed to Information Retrieval
2.3. SUMMARY 23
data about a particular corpus, additionally there have been good recent advances in
the field of unsupervised and semi-supervised methods for WSD. This young field of
research is faced with the problem of not having a standard system for quantifying re-
sults. The “Test of English as a Foreign Language” or TOEFL test is a popular method
of quantifying results amongst researchers in the field of WSD, but it does not suit the
test for aspects. Another way that the researchers here quantify the results of their sys-
tems are to compare them against establishedWordNet sense taxonomies. The problem
of quantifying evaluation will be discussed in more depth in Section 4.
2.3 Summary
Several methods of dimensional reduction or clustering that could be applied to the
problem of inducing SoM have been discussed. Nearly all of the methods are used by
their creators to help with information retrieval performance which is not the aim of
this thesis. Instead the aim is to disambiguate SoM in vectors that carry a lot of context
information. The models discussed in detail were Hyperspace Analogue to Language
(HAL) and Latent Semantic Analysis (LSA), whereby Singular Value Decomposition
(SVD)was discussed as themeans for dimensionally reducing themodel to improve the
quality of both term and document representations. While thesemodels have been used
in the form of term × document or term × term matrices, these matrices are typically
constructed from the whole corpus. There seems to be no literature whereby the matrix
model represents a single concept or word.
Based on this literature review, a system is proposed which computes a HAL TCOR
model derived from a set of sentences corresponding to a word or concept resulting in
a matrix model corresponding to that concept. This matrix model is input into various
2.3. SUMMARY 24
techniques for computing the SoM for the concept. These methods nearly all (except
Vector Negation) are used to process term × document DOR information. While some
of these systems have been used to process term co-occurrence information as well
as document co-occurrence information, there does not seem to be any literature in
analysis of SoM within a single concept. This leads into a discussion of how this can be
investigated, in the next section potential practical work that this research leads to will
be covered.
Based on this literature review, a system using a HAL TCOR space built over a set
of sentences containing a concept is proposed. Then the application of each of the
reviewed systems to this semantic space will potentialy yield SoM. To evaluate if this
modified space contains a good description of the shades of meaning the conditional
entropy strategy will be used.
CHAPTER 3
Techniques for Calculating SoM
3.1 Singular Value Decomposition
The SVD of a matrix of term×document information has been described as an ’intrinsic
semantic space’ which supports the hypothesis that eigenvectors could be very highly
related to the SoM around a concept [14]. The intuition behind the term “intrinsic”
refers to the eigenvectors of the decomposition of the matrix being an “axes of mean-
ing”. SVD has also been applied to term×termmatrices the eigenvector decomposition
of which also should produce axes of meaning.
Kontostathis and Pottenger provide analysis of eigenvectors produced by SVD during
the LSA process [24]. They use a term × term matrix T calculated by taking U and D′
and finding their product then is the symmetric matrix given by their multiplication
with the transposed version of itself. The equation is given below:
T = UD′(UD′)T (3.1)
In this term × term matrix, weights are still based on DOR, the weights are considered
the similarity between the terms in each dimension. This kind of approximation would25
3.1. SINGULAR VALUE DECOMPOSITION 26
also be possible with HAL/TCOR vectors, and is explored by Bruza andMcArthur [33].
Unfortunately the data used in their experiments is not available to continue tests on.
The HAL algorithm shown in Equation 2.1 produces a nxn matrix M where n is the
number of terms. A column i encodes accumulated co-occurence weights of terms
appearing after term i in a context window. Conversely, row i encodes accumulated
co-occurence weights of terms appearing preceding term i in a context window. In this
way, HAL captures order information. In our case, we will not take word order into
account, so matrix M is added to its transpose resulting in a symmetric nxn matrix S.
The SVD of S is cmputed as follows:
S = UDV T (3.2)
V T = U (3.3)
∴ S = UDU (3.4)
The SVD of a symmetric matrix is the eigenvalue decomposition of it [32]. Thus by
using HAL matrices as the input to SVD, it would be calculating the eigenvalue de-
composition. The question we will investigate is whether the eigenvectors correspond
to the SoM of the word represented by the matrix S. Another way of looking at it is
through the spectral theorem, which is applicable because the data is symmetric. The
spectral theorem shows that S can be rewritten as a sum of projectors [6]. (See Equation
3.5.) ui is the ith column vector of U , and σi is the singular value D[i, i]. The problem
of SoM can be described as a question of whether ui can accurately describe a SoM for
a particular word.
3.2. CONCEPT INDEXING 27
S =n∑
i=1
σiuiuTi (3.5)
By creating a rank k approximation of S where k < n the SoM space can be reduced to
varying sizes by changing the parameter k. (See Equation 3.6.)
S ≈k∑
i=1
σiuiuTi (3.6)
Because the singular vectors and values are ordered in the size of the singular values,
the eigenvectors in U are expected to always be in the same order. So for k = 10 the
first five vectors will be the same as the whole resulting matrix when k = 5. As k is
varied it will introduce new vectors that have lower associated singular values, but are
orthogonal and thus could be a new SoM.
The SVD algorithm used in this research is the implementation provided in the numpy
package1, which was robust enough to deal with the relatively small amount of data
required for this experiment. After removing the key term from the HAL matrix was
decomposed via SVD and the right singular vectors from V T were used as shades of
meaning, though the left singular vectors from U could have been used as well.
3.2 Concept Indexing
Concept indexing (CI) is a method proposed by Karypis and Han to project a term by
document matrix into a lower subspace [22]. It claims to be a lot faster and more effi-
cient than LSA. By looking at the algorithms behind CI there may be a way to adapt it to
work on HAL vectors. From a high level perspective CI fits right into the model of us-
1http://numpy.scipy.org/
3.2. CONCEPT INDEXING 28
ing high dimensional spaces to represent concepts. Like LSA it uses a term×document
matrix to represent a large collection of documents. CI then clusters the documents
to k clusters, with each cluster containing documents of a similar topic, that is, doc-
uments in the same cluster should be more similar to each other than to documents
in other clusters. Naturally the success of CI relies heavily on the effectiveness of the
clustering algorithm. The centroid vectors of the clusters then become the new axes or
basis vectors for the reduced dimension representation of the collection. Documents
can then be projected onto this k dimensional space. Document similarity can then also
be measured by calculating the cosine similarity.
The centroid vectors provide an effective way to represent all the documents contained
within the cluster. Because the documents are all of a similar nature the highlyweighted
terms in the centroid actually provide a good way to summarise the members of the
cluster. To project a document down from its original high dimensional form into the
new lower dimensional subspace, the document is compared via the cosine similarity
function to each centroid vector in the reduced subspace. So the similarity to the ith
centroid becomes the ith element in the new document vector. This is found by calcu-
lating the dot product of the matrix of basis vectors with the document vector.
As mentioned, the clustering algorithm plays a very important role in the reduction
of the document set to meaningful groups. Several clustering algorithms have been
benchmarked for clustering effectiveness and performance [4]. The clustering algo-
rithm used in CI is based on the partitional clustering system developed in the afore-
mentioned benchmarks and research. This clustering system is actually very similar
to k-means clustering. K-means clustering outperforms traditional forms of clustering
quite convincingly. This partitioning starts by randomly selecting k documents and
3.2. CONCEPT INDEXING 29
using them as the basis for selection of more documents to be added to that cluster.
Documents are then iteratively added to the clusters based on the centroid that they
are most similar to. So at each iteration the centroids are recalculated to represent the
new clusters. In CI, this method is slightly modified so that at each iteration only two
partitions are created, but the algorithm gets applied k − 1 times. It is claimed that this
keeps the sizes of the clusters more regular which improves dimensional reduction. At
each step of the algorithm it must choose which cluster to split into two. It chooses the
cluster with the biggest spread of concepts. This is done by calculating the square of the
length of the centroid vectors which then represents the average of the similarities of all
the documents in the cluster. By subtracting this number from one, the dissimilarity is
found. Thus the algorithm chooses the cluster with the highest dissimilarity to further
split.
It seems that perhaps not randomly choosing the seed vectors may help create better
clusters, perhaps this could be done by picking a vector, then picking a second only if
its cosine similarity is sufficiently low, otherwise randomly pick another and check its
similarity and so on. This problem is somewhat mitigated by the fact that in CI Karypis
actually performs the clustering five times, then the best of the five clustered sets is
chosen. Even so, using a basic sanity check would hopefully give better results with
minimal overhead. 2
Karypis discusses a system of calculating the effectiveness of the dimension reduction
from CI and LSA and calls it retrieval improvement. This system is implemented by
checking how tight a semantic neighbourhood may be before dimensional reduction,
and how much tighter it becomes after the reduction. He observes that CI performs
2Unless of course none of the documents are below the threshold, in which case it would be a wastedeffort.
3.2. CONCEPT INDEXING 30
comparably with LSA.
One of the principle claimed benefits of CI is that it performed five to eight times faster
than LSA for Karypis, he says that this is a benefit of using clustering instead of SVD.
This claim does not seem to be based on any firm mathematical or even empirical re-
sults. After conducting empirical tests it was found that the final matrix multiplication
step of LSA can be many times slower then the SVD calculation itself when calculated
in numpy for Python. The reason this could be so is that the final matrix produced by
LSA has many more non-zero elements than the original full rank matrix. Some empir-
ical results are as follows; SVD on a 3000 × 3000 matrix took 41 seconds, the matrix
multiplication of USV T without dimensional reduction took 384 seconds. In MATLAB
the results were reversed, which indicates SVD being the primary reason why LSA is
slower than CI could actually be an implementation detail. One thing that concept
indexing is almost certainly better at dealing with is extremely large data sets. This is
because most implementations of SVD are not optimised to run on sparse matrices, and
the ones that are still have ceilings on the amount of data they can handle at once.
To apply CI to HAL vectors the type of information that is being clustered is differ-
ent. The data is context information for specific, not the documents they were found
in (DOR). So this means the final result is a reduced dimensional space around terms
(TCOR). In IR clusters that are built based on term distances are known as Metric Clus-
ters [2, p126]. It is hoped that to find, given the right value for k, words would be
clustered into groups that represent aspects. What was tested in this regard is whether
actually using HAL vectors as opposed to document vectors give significant improve-
ments. The reason this is mentioned is because in CI as performed by Karypis, am-
biguous terms are kept within their document and potentially their aspect. So when
3.2. CONCEPT INDEXING 31
dimensional reduction is performed, one would find ambiguous terms showing their
full conceptual meaning for the particular aspect they are contributing to. On the other
hand, there are still two axes (term×term) whereby terms if treated as sub-parts of con-
cepts, could appear in multiple clusters. This is something that is tested, interestingly
another overarching problem with applying CI to this problem was discovered.
The biggest problem with CI is the fact that it randomly selects the seed vectors for the
clusters. This may perform well for information retrieval tasks but is not optimal with
respect to accuracy of SoM. In general information retrieval requires a relevant docu-
ment to be selected based on its general representation, not the exact centroid of the
cluster it is closest to. In concept indexing the final stage of the process is projection of
a document vector down on the new subspace formed by the dimensionally reduced
vector. This means that the essence of the document itself is preserved, it is not ap-
proximated by a nearest neighbour, it is a dimensionally reduced version of itself. For
shades of meaning the projection step is never processed, instead the centroids that best
describe the original space are required. For this particular task the ideal shades need
to be selected to best represent something there is only a nebulous idea about. It would
be preferable to make the initial seed selection process less random.
The concept indexing algorithm is somewhat troublesome and has not given the orig-
inally envisaged results. This is because of its random selection of the initial vectors
to seed centroids with. The intent was to build a new set of basis vectors based on the
centroids of the clusters built from the data. This process could well be very effective
for information retrieval tasks which CI was intended for, where documents are then
projected down into the new space defined by the cluster centroids. But when it comes
to reducing the dimensionality of of terms, the random selection method may not be
3.3. VECTOR NEGATION 32
representative enough. This is an important distinction as when it comes to IR. Gener-
ally when a system is queried, a combination of the query weightings and document
weightings are used to provide a ranked list of results. This is OK when precision is
measured by the overall result of the ranked list of resources. In the case of shades of
meaning, single concepts that best represent the original semantic space are required,
for this the seed selection process is critical. The method of selecting random vectors
could be offset by the implementation of the seed selection process outlined in Neill’s
Sense Induction [37] paper, but he himself also notes some lacklustre performances
due to the somewhat arbitrary selection process. This is still an open problem, which
warrants further research. Early pilot tests with concept indexing found it lacking for
this particular task, as such it was not empirically evaluated as a candidate method for
computing SoM.
3.3 Vector Negation
Vector Negation could refer to any kind of negation operation on vectors, in this case it
specifically refers to a technique investigated by Widdows in Geometry and Meaning
[53]. The not operation defined byWiddows is a subtraction which does not completely
subtract one vector from another, but merely orthogonalise them against each other.
Widdows’ WORDSPACE model is generally similar to LSA in that it is based around a
term × document DOR model but does not deal with dimensional reduction. It simply
represents documents as vectors of term frequencies, and terms as vectors of frequen-
cies across documents. He then creates term × term matrices, as an adjacency matrix
of pairwise term similarities of the normalised information from the documents.
The concept that is very interesting is the idea of being able to have a NOT operator
3.3. VECTOR NEGATION 33
that works on vectors in order to make them irrelevant to each other (orthogonal). This
enables NOT operations to be defined for queries. This may seem irrelevant, but in
reality it can be generalised to help find SoM. Exploring what negation actually entails
will enable further analysis about how it is relevant. Widdows defines the concept of
vector negation as this:
“Two word vectors a and b in WORDSPACE are considered irrelevant to
one another if their vectors are orthogonal, ie a and b are irrelevant to one
another if a · b = 0.” Widdows [53].
The concept in the quote above is basically very simple, if a vector has a cosine similar-
ity of zero to another vector, then they have nothing to do with each other. He asserts
that simply doing a matrix subtraction a − b would not work well because it would be
a very brute force style removal of b from a. Instead, he suggests rescaling b such that
subtracting it from a leaves some elements in a that may not be there because of b in the
first place. So there is a vector a− λb. The question then, is how to scale b reasonably to
get the desired result. This is described as making the vector b irrelevant to a, in other
words, orthogonal. To operationalise this theory, he describes the method of orthogo-
nalising vectors. If there are two vectors, a and b, to orthogonalise them take the dot
product a · b and scale b by it then subtract the resulting vector from a.
a NOT b = a − λb (3.7)
(a NOT b) · b = 0 (3.8)
λ =a · b
‖b‖2(3.9)
‖b‖ = 1 (3.10)
a NOT b = a − (a · b)b (3.11)
3.3. VECTOR NEGATION 34
In Equation 3.7 see that the a NOT b vector is the a vector scaled by some quantity. In
Equation 3.8 it is shown that a NOT b should have a cosine similarity of 0 to b. Equation
3.9 is given by Widdows as the solution to λ. Then it is possible to figure out that with
normalised vectors the length is 0 which allows reducing the equation down to the final
line which gives the operational formula to negate the vectors.
The reason this is interesting is that by taking a vector and NOT ing it with another vec-
tor it is effectively saying, given this vector, remove this concept from it. Now if there is
a vector for an ambiguous term, it could be presumed that taking the most similar term
to it, and performing the NOT operation, would remove an approximation of that con-
cept from the vector. Thus allowing concepts that may have been overshadowed before
to come through stronger. With regard to the “reagan” example one would hope to be
able to take the reagan vector and apply theNOT operator to it to get a vector weighted
more heavily in the direction of another shade of meaning. So if administration is the
dimensionwith highest value in the reagan vector, reagan NOT administrationwould
give the reagan vector orthgonalised to administration, in other words, a vector where
administration no longer represents a dominant shade of meaning. This would hope-
fully allow another shade of meaning to shine through, like something relating to the
Iran contra deal. Widdows shows this idea working on selected data in his book [53].
The next experimental step from this would be treating this as a way of decomposing a
vector into its SoM. So continue to apply the NOT operator k amount of times to take
our vector. This process would involve taking a vector, performing the NOT operation,
which would give two vectors, the original, and the one without the heavy weighing
associated with the original. If this step is repeated, the final results would be a series
of vectors each with a reduced amount of information. Hopefully each vector would be
3.3. VECTOR NEGATION 35
representative of a particular shade of meaning. Since these vectors will be orthogonal,
they could be used as basis vectors defining a subspace corresponding to a concept,
whereby each basis vector corresponds to a SoM. Each of the concepts would be a di-
mension in this reduced space, which clustering could be performed on. This is highly
speculative, and was tested thoroughly. (See Section 4.)
VN was expected to perform quite well, but is actually fundamentally incompatible
with HAL vectors. This will be best explained by an example. Say the vector for “rea-
gan" is made up of many dimensions, the largest of which is the dimension for “presi-
dent". For this example, assume that the “reagan" vector only has three dimensions. It
could look something like: reagan : [president : 20, iran : 15, missile : 5]. This would
mean finding the “president" vector and perform the NOT operation with it and the
“reagan" vector. Now remember, HAL vectors in their very nature are built out of co-
occurrence information gathered from a corpus, so president and reagan will co-occur
often thus the “president" dimension is large in the “reagan" vector but the “reagan" di-
mension is not large in the “reagan" vector. This is also true for the “president” vector,
which means that when the “reagan" vector is negated against the “president" vector,
the “president" dimension will never be reduced in the “reagan" vector, because there is
no “president" dimension in the “president" vector. This is a critical step in the negation
process, because once the vectors are negated, the next largest dimension in the original
vector is the one operation is performed with the next time. In the example that would
be the “president" vector again, which would result in a useless loop. A workaround to
the above has been developed which keeps a record of which dimension’s vectors have
been used to negate against the original vector, this method seems to give results when
used with other data, but the algorithm does not work with HAL vectors.
3.4. NON-NEGATIVE MATRIX FACTORISATION 36
The implementation of the Vector Negation principle is an interpretation and adaption
of the original concept provided by Widdows. Unfortunately worthwhile results not
achieved during pilot studies and consequently was not empirically validated as a can-
didate method for computing SoM. This is something that intuitively seems like a good
way to find shades of meaning and is still an open research question.
3.4 Non-negative Matrix Factorisation
Non-negative Matrix Factorisation (NMF) refers to a series of mathematical algorithms
designed to factorise a matrix into two, or sometimes three sub-matrices.[3] The oper-
ation is broadly similar to SVD except that it enforces different constraints on the data.
The main constraint is that all of the values in the resultant sub-matrices must be non-
negative.
The NMF operation most frequently decomposes a large matrix S into two smaller
matrices W and H . W is considered a set of features of S and H a set of variables de-
scribing how the original data relates to the entries in W . For some specific implemen-
tations the W matrix can be considered cluster centroids while H contains information
on how close each of the rows or columns in the original data are related to the cen-
troids. The multiplication of WH gives an approximation of the original matrix, but
not the original matrix. As noted by Berry et al it is more appropriate to refer to NMF as
non-negative matrix approximation. The NMF problem has no unique solution, which
is noted in most research around the topic [3, 27, 55].
The actual factorisation of the matrix happens by minimizing a particular function over
a set of data S to obtain the basis vectors or underlying features as the matrix W . There
are many different factorisations possible for a single matrix which depend on the con-
3.4. NON-NEGATIVE MATRIX FACTORISATION 37
straints and distance/divergence functions used to do the factorisation. A generalisa-
tion of the operation is given by Lee and Seung is shown in Equation 3.12 [27]. D is a
measure of distance or divergence which is sought to be minimised between S andWH
where V is the original data set and WH are the results of the factorisation. In short, it
seeks to produce a factorisation of S that is much smaller but as close as possible to S
itself.
JNMF = D(S, WH) (3.12)
Different methods to do the factorisation have been developed to target different com-
plexities of non-negative matrix factorisation. These issues can be broadly categorised
into several groups. These groups are, the convergence of the factorisation and the
speed of the factorisation to converge. Different algorithms have been developed to
tackle each of these different problems. One of the most troubling issues is the amount
time the algorithm usually takes to converge, NMF takes a long time to complete.
The NMF paper that stirred a lot of recent activity in NMF research uses Kullback-
Leibler (KL) divergence as its distance metric [27]. There are other methods for mea-
suring the distance for D including a Euclidean one using the Frobenius norm. For
the sake of this research into shades of meaning the KL method that is considered the
baseline for other implementations of NMF is used. The KL-Divergence function is
described in Equation 3.13 [27].
F =n∑
i=1
m∑
µ=1
[Xiµlog(WH)iµ − (WH)iµ] (3.13)
Because NMF produces matrices with no negative values in them they intuitively make
3.4. NON-NEGATIVE MATRIX FACTORISATION 38
more sense as something to compare only positive vectors to. SVD on the other hand
creates eigenvectors that can contain negative values and in some cases only exist in
the negative space. What will be interesting is to see how NMF and SVD compare to
each other in the results. By all accounts the vectors in the W matrix fit the intuition
of what constitutes a shade of meaning and will be explored in the experimentation
section of this research. Extensive studies have shown that NMF can be described by
a more general model of PCA and in addition to that, when NMF is found using KL
divergence it is equivalent to probabilistic Latent Semantic Analysis [7, 20]. This further
supports the intuition that the column vectors of W may correspond to the SoM.
The NMF algorithm that was used in the experiments is publicly available [34].3 Be-
cause NMF is an iterative algorithm it can take a long time to converge . This is not a
problem in this strand of research, but for an algorithm that shows such promise is it
unfortunate that it can be prohibitively slow. The HAL matrix was stripped of the vec-
tors for the key term and processed with the NMF algorithm, of the resulting matrices
H and W . Each column vector of W was considered a shade of meaning.
3http://www.ma.utexas.edu/users/zmccoy/plsinmf.html
CHAPTER 4
Empirical Evaluation
The problem of evaluation is challenging for this area of research. With typical word
sense disambiguation tasks there is a human-tagged golden standard corpus against
which you can test the results of a system. Unfortunately with disambiguating SoM
there is no such golden standard. This issue is also noted by Neill [37], where he pro-
poses several key challenges standing in the way of a successful evaluation methodol-
ogy. These are, choosing the right words to disambiguate, problems with how to define
a sense and finally, how many senses are inherent to each word. Thus the challenges of
finding SoM are more like the problem of sense induction, more complex than a typical
set-and-run type algorithm for information retrieval or word sense disambiguation. An
appropriate waymust be found to approximate using the system to disambiguate SoM.
This includes choosing the words to find SoM for, defining how to find shades and the
number of them. Then performance of this system has to be evaluated and quantified.
4.1 Data
The data requirements for this thesis were to have a collection of traces, which is a
phrase or sentence containing a word for which we are interested to compute the SoM.39
4.1. DATA 40
(Examples of traces for “Reagan” are given in Figure 1.4) The intuition here is that a
semantic space can be built for a specific concept, not a generic one for a whole cor-
pus. This semantic space will hold many different aspects to a single concept encoded
in a single matrix representation. Initially traces were selected for a few handpicked
concepts from a chosen corpus, but this allowed for assessor bias as well, so a semiau-
tomatic system for selecting traces had to be adopted. To find potential solutions other
related fields of research were reviewed for inspiration. For word sense disambigua-
tion the established series of events and competitions are the SENSEVAL1 tests. The test
task data are sentences built around a particular word, much like the training data used
for the tests of SoM. Each sentence has a gold standard tagged sense for the instance
of that particular word in that particular sentences [23]. SENSEVAL systems are then
trained on the data and expected to be able to disambiguate the sense of an instance of
a word in a particular context. The data in the SENSVAL corpora has proven excellent
for testing early pilots of the various systems, and the algorithms mentioned here have
given promising results.
Despite SENSEVAL data being interesting to run the system on, the data is not really
representative of the problem statement of finding SoM. The SENSEVAL data has sub-
senses listed as well as regular senses and the sub-senses listed in the SENSEVAL gold
standard dictionary do not necessarily map to a shade of meaning, this is mostly be-
cause the sub-senses are often defined by their syntactic effect on words around them.
Furthermore, some of the senses are overarching general senses which is reality could
apply to many different SoM. As an example from the SENSEVAL dictionary the noun
sack is listed as having several main senses and various sub-senses within each of the
main ones. The two major relevant senses are the sense related to “getting the sack”
1http://www.senseval.org/
4.1. DATA 41
and a sack like a “sack of things.” When decomposing out data in the SoM with two
shades it can be seen that the two shades appear to represent each of these different
senses which seems reasonable. But, when decomposed into five different SoM instead
of seeing other senses appearing, similar senses are uncovered, but associated with dif-
ferent contexts. So for example more sack shades are created with the context of sacks
of grain in one shade and sacks for garbage in another. This is a good sign for finding
SoM, but not promising for evaluating against the SENSEVAL golden standard which
cares about senses not shades. This is discussed further in the results section. This still
leaves the problem of not having a good evaluation strategy to test the performance
of the proposed systems. In this case it would be suitable to both try existing evalua-
tion systems despite their shortcomings in conjunction with a new system based on the
intuitive results that need to be quantified.
4.1.1 Word Selection
Word selection is an important part of the problem, primarily because there needs to be
some words that will clearly have SoM, as well as words that do not. What is meant by
“clearly” will be explained in the following.
Several words have been selected from the SENSEVAL data, the words chosen were
all the nouns.[23] All the nouns were chosen so there could be no question about how
they were chosen. It is suspected that it would be possible to find shades of meaning
around verbs and adjectives too, but since this is such an early strand of research that
will be left for future work. The words chosen from the SENSEVAL data were all of the
nouns: [shirt, giant, rabbit, disability, behaviour, knee, excess, bet, sack, scrap, onion, promise,
steering, float, accident]
4.1. DATA 42
In addition to the SENSEVAL word several words have been chosen from the Reuters-
21578 [28] collection which a hand crafted golden standard has been created from. The
researcher and supervisor collaboratively selected words from the Reuters-21578 col-
lection. These words were chosen so that there would be a good selection of words that
should performwell and ones that more than likely won’t performwell. Thewords that
were chosen from the collection are: [reagan, GATT, president, oil, economy, coffee] These
words show a spectrum of specifity ranging from GATT (a very specific acronym) to a
general terms such as “oil” and “economy”.
If all occurrences of the words chosen from the Reuters-21578 collection were to be
used the resulting HAL space and vectors would have been extremely large and harder
to process, especially with the SVD algorithm implementation available. Additionally
manually tagging all words from the corpus would have been impossible. To avoid
this problem only the occurrences of the words that appeared in titles were used. Ti-
tles are normally used to summarise the content of the article, so it was expected that
they would give good conceptual information while not requiring so much data to be
processed. Armed with a selection of words it is now possible to look at the different
stages of evaluation in this research.
4.1.2 Evaluation Data Sets
Both data sets fit a certain format, this means they can both have similar tests run on
them. Both the SENSEVAL data and the crafted data from the Reuters collection are
comprised of words in context, generally a sentence or two. In the case of the SENSE-
VAL data the traces are a couple of sentences and longer than the simple headlines used
for the Reuters data. For each chosen word there are a particular number of traces and
4.1. DATA 43
the number of traces varies from term to term. Table 4.1 shows the various numbers of
traces for the data in the Reuters collections.
w Number
coffee 114economy 57oil 416GATT 35president 67Reagan 150
Table 4.1: Trace table
By reducing the data that is indexed to a subset of the original corpus, this is really
building a very specific subspace, a locality of that particular term or concept. This is
a powerful model because global model of the text is not as important as merely a key
concept and the different contexts surrounding it. A benefit of using this reduced data
set is that is it much smaller and easier to process. The key point here is the full text
of each corpus is not being used, this means that higher order relationships may not be
modeled by the system very well. This is not a key topic of this research though and
should not be an issue.
See Appendix A for some examples of text strings used in the experiments.
4.1.3 Human Tagging
The set of tagging that needed to be done manually, was for the traces out of the hand
crafted data from Reuters-21578.2 The data was tagged independently by both the re-
searcher and supervisor. They then agreed on a common set of tags which were applied
jointly to the data.
2The SENSEVAL traces are already tagged.
4.1. DATA 44
Some of the trials and tribulations associated with collecting human tagged data are
well documented by Kilgarriff and Rosenzweig [23]. Problems with human tagging
include issues like interassessor agreement, and replicability. The requirement for a
new set of golden standard data is definitely a weakness in the evaluation strategy that
needs to be further investigated. Initially the shadeswere going to be tagged in addition
to the traces and a simple comparison of tags would have been the resulting accuracy
measure. This was deemed to be insufficient for the purposes of objectively evaluating
the performance of the system and a different method for evaluating performance was
researched.
4.1.4 Preprocessing
There are several methods of preprocessing text that change the effectiveness of an
information modelling system. These include part of speech tagging and stemming.
These are discussed briefly, simply to justify decisions made in this thesis.
Part of Speech Tagging
Part of speech (POS) tagging refers to an automated algorithm for pre-processing text.
It works by tagging each word with a syntactic type [33]. It seems that adding syntactic
structure to indexed corpora add no benefit with respect to adding meaning or resolv-
ing ambiguity in an information retrieval system [43]. Though as noted by Sanderson,
the use of such syntactic information to derive semantic information may be useful.
Further investigations into use of POS tagging in semantic space models by Burgess
[8], Gärdenfors [19] suggests that POS tagging had no significant effect on their work.
Pinker [40, p82] whose definition of “mentalese” points out that a concept as repre-
sented in the conciousness of the human mind is void of conversation specific words,
4.1. DATA 45
constructions, information about pronouncing words and the order that words are in.
While it is questionable that this is completely true, it lends support to the use of seman-
tic space models without POS tagging. If POS tagging is related back to the “Reagan”
example it will soon become clear that the part of speech for Reagan will be the same
for nearly every context it is found in, it is a proper noun everywhere!3
Stop Words
Stop words are almost universally dropped from semantics models. Burgess and Gär-
denfors are primarily concerned with the cognition perspective of modelling language
while still keeping operational mechanisms in mind. Burgess asserts that humans do
not remember sentences because of their syntactic structure, they actually remember
the concept representing the meaning of the concept. Principles such as structural se-
mantics take syntax into consideration when analysing the meaning of a sentence or
body of text [8, 25], in fact “Chomskian” linguistics put a heavy emphasis on punctu-
ations and other syntactical markers [19]. Gärdenfors claims that syntax is only useful
“for the subtlest of aspects of communication”.
Stemming
Stemming refers to a process of reducing words to their “stems.” Like “banking” to
“bank”, “flooded”, “flooder” and “flooding” to “flood.” This basically reduces series
of terms to a single common concept [2, p168]. This reduces the complexity of the rep-
resentation, in terms of the number of terms, it also makes term representations heavy
with many different contexts, which is the exact problem this research seeks to resolve.
3Unless of course someone turns it into a verb for some reason. Let’s hope that never happens. :)
4.1. DATA 46
There are different levels of stemming available,4 different types of stemming, and often
stems of words are not lexically the same (“brought/bring”) [43]. Thus stemming can
be a very simple or complex option. It is generally accepted that heavy word stemming
can increase IR performance, but at the same time introduces ambiguity. Pinker also
places emphasis on the extra meaning that affixes give stems [40, p133]. Considering
that ambiguity is introduced with stemming rather than resolved, it would be wise not
to include stemming as a preprocessing task in semiotic-cognitive analysis of aspects.
The example given is “training” and “train,” the stems are identical here, but each word
has a very distinct meaning. It is noted that semantics researchers often avoid stem-
ming altogether [51, 15]. Baeza-Yates, Ribeiro-Neto and Ziviani mention that stemming
has been a controversial topic in the field of information retrieval, and that large scale
studies have often given inconclusive results [2, p168]. Pinker also places emphasis
on the extra meaning that affixes give stems [40, p133]. Considering that ambiguity is
introduced with stemming rather than resolved and both the linguistics and statistical
sides of the discussion agree its value is questionable, it would be wise not to include
stemming as a preprocessing task in semiotic-cognitive analysis of aspects. One should
not completely discount stemming, there is one situation where introducing ambiguity
via stemming is not a problem, this case is when it can effectively disambiguated later.
So a reduced complexity model exists, that makes processing faster without losing the
subtle nuances and disambiguation provided by term affixes. Upon completing a way
to disambiguate different aspects around a term, stemming can be used to allow for the
retrieval of more aspects that can then be effectively disambiguated into more potential
SoM while adding to the existing shades around a concept.
4http://www.comp.lancs.ac.uk/computing/research/stemming/Links/weight.htm
4.2. METHOD 47
4.2 Method
This section will discuss the different stages that were used for the creation and evalu-
ation of SoM. Additionally it will look at how to build semantic spaces via HAL, then
find shades of meaning in those spaces. Then finally look at how to create a vectorial
representation of the traces for comparison
4.2.1 Building the HAL Model
The semantic space, or HAL space for the experiments was built using the HALmethod
outlined in Section 2.1.1 and Equation 2.1, the parameter l was set to 10. Earlier pilots
were done with l = 5 and l = 15, on most data there was a noticeable decrease in
performance for l = 5 and a slight increase and decrease for l = 15 depending on the
data set. 10was chosen as an effective compromise between the overly large, heavy and
often redundant l = 15 and the under performing l = 5. During the HAL processing
stop words and punctuation were filtered, and HAL was run over the resulting traces.
Each trace was treated as a document, so HAL did not run across traces. A single HAL
space, being the aggregation of all the traces was built. Each word in the system has a
row and column in the symmetric HAL space. (See Section 2.1.1.) Each of these rows is
a sum of all the contexts that word has been found in, a weighted sum of all the words
that are found around it. It is these contexts that essentially would like to be uncovered
in the word’s vector. In future calculations the HAL semantic space for a word w’s data
set is denoted Sw.
The HAL space that is created from a set of traces looks very different to a HAL space
built over a regular document or set of documents. Because every trace contains the
word of interest, the HAL vector for the term in question is very large, in that it con-
4.2. METHOD 48
tains a dimension for nearly every other term in the system. As a way of example,
on the first 100 Reuters-21578 [28] traces the HAL vector for Reagan contains non-zero
dimensions for 274 other terms in the system, whereas the vector for Rome contains
only five non-zero dimensions. President, presumably a very closely associated concept
to Reagan has a mere 49 non-zero dimensions. If these HAL vectors were to be repre-
sented as beams of light in a 3 dimensional space, the Reagan beam would be so bright
and intrude so much on the space of the other beams, that all the other beams shining
in the space would be very hard to see. The whole space is nearly defined by this sin-
gle vector. This implies that these vectors might as well all be equivalent when faced
with the enormity of the vector looming over them. For this reason the vectors for the
term in question (like “reagan”) were removed before the matrix decomposition was
performed, to allow the other weaker vectors to shine through. This means removing
the column for “reagan” and the row for “reagan.”
Traces are the building blocks for this thesis and are used to build the semantic space
around a word from which its SoM are computed. In order to find out the effectiveness
of the SoM systems a two stage evaluation process is proposed. By using existing eval-
uation systems that may be lacking for this task, it may be possible to find out what
SoM are not. Then by using the new system of evaluating SoM it may be possible to
define what SoM are. The first stage is to run the systems on the tests from SENSEVAL
and first justify the results with an established method. The second stage is to create
a SENSEVAL-like test composed of tagged traces with various SoM. The systems will
then be run on this new data to give a structured, repeatable and intuitive strategy for
assessing the quality of each method. These stages will be explained in the next few
subsections, along with an overview of word selection discussions. Before looking at
4.2. METHOD 49
evaluation further there are some issues that need to be resolved around what words
to use for evaluation.
4.2.2 Computing the Shades of Meaning
Finding the shades of meaning is themost critical part of this research. There are several
different methods which have shown promise for finding shades of meaning, eachwere
run on the term × term matrix that had been generated in the previous section. The
required result was a set of shades of vectors of which each would represent the center
of a shade of meaning, thus the best representation of its meaning. These vectors are
what are used to represent shades of meaning in the rest of the experimentation.
The calculation for finding the shades of meaning is different for each algorithm, so to
show that a set computational shades CSw is found from a term × term matrix Sw the
generic equation is given. With a system SoM that reduces Sw into k shades Equation
4.1 is relevant. For example for the adaption of the NMF system the equation would
read CSw = NMF(Sw, k).
CSw = SoM(Sw, k) (4.1)
When using the trace browser to analyse the results coming out of the different shade
finding algorithms, it became clear that the “bright light” problem mentioned in the
previous section (4.2.1) was causing issues with the shades that were being uncovered.
As mentioned, to avoid this problem the vectors for the term in question was removed
from the term × term matrix, this left only the vectors of the words surrounding the
term itself. This method allowed the more subtle meanings of the contexts to be used
effectively to disambiguate traces. It is assumed that the term in question is highly
4.2. METHOD 50
related to the traces, since it appeared in all of them, and all of them were about that
concept. What is of more interest is disambiguating the subtle contexts around the
concept to better understand what its key components are.
4.2.3 Representing the Traces
To be able to evaluate the relationship between a SoM represented as a vector (s, s ∈
CSw) and a given trace ( t, t ∈ Tw) vectorial representation ~t is needed of the trace in
the same space as s which can then quantitatively be used to compare find the distance
between s and ~t. To find the general meaning of a trace, a centroid vector of of the terms
it is composed of is built. To do this each term’s vector is sourced from the term× term
matrix Sw and averaged using the centroid method. This gives the centroid meaning
of the trace as the system knows it. The HAL vectors for each term are given by the
Equation 2.1. Then it is possible to find the vector for a single trace in the data set using
Equation 4.2. A trace’s words are the list of words t with each individual word being
t[i] where i varies over the range of the length of t. Sw,t[i] in the following is the HAL
vector corresponding to the ith word in t.
~t =1
|t|
|t|∑
i=1
Sw,t[i] (4.2)
As mentioned in the previous section, there is a problem with keeping the main term’s
vector in the space when doing comparisons. This also affected the building of centroid
representations of the traces. If there is a dominant vector in the trace that is present in
all the other traces they all become a lot more similar when viewed from a distance. This
is because a single dominant vector really accounts for most of the cluster’s meaning
and becomes the real centroid of the vector. This would skew a vector pointing in one
4.2. METHOD 51
particular direction off to another different direction that could confound the meaning
that is intended to be captured in a SoM. To avoid this problem the focus term w was
removed from the trace when building the centroid for each trace.
4.2.4 Comparing SoM to Traces
After calculating centroid vectors for each trace there are a set of vectors for the traces
and vectors for the shades represented in the same vector space, enabling a comparison
between the two. Now it is possible to compare traces to shades and see which shade
most accurately describes a trace. The method used to find the similarity between two
vectors in a linear space is the cosine coefficient, or cosine similarity. By finding the
cosine similarity between two vectors a value between -1 and 1 is created, in reality
values below zero don’t occur very often so essentially there is a value between zero
and one where zero is “not at all alike” and one means “the same” [53]. The problem
this leaves is that there is only only have a value between zero and one to describe the
relationship between the two vectors. It would be ideal to be able to tag each shade
with what fits human intuition, and also tag each trace. When a trace’s most closely
related shade is found the tags are compared to see if the tag is the same for both of
them. If it is, then that is a right answer for the system. A reliable way to tag the traces
and shades is needed though, something that is not possible within the scope of this
thesis. Instead the relationship between a shade and its related traces is used to build a
model of the overall picture. (See Section 4.3.3.)
To find themost closely related shade for a trace, the shade that yields the highest cosine
similarity to the trace’s centroid is found. The shade sj ∈ CSw associated with a trace
centroid ~ti from ~Tw (the set of trace centroids) is given by Equation 4.3.
4.3. EVALUATION METRICS 52
simij = arg maxj
sj · ~ti
‖ sj ‖‖ ~ti ‖(4.3)
So the similarity for simij is the resultant cosine similarity for the shade that gives
the highest cosine similarity to the trace’s centroid in question. These similarities and
association with a particular trace allows finding the most highly related shade for a
particular trace.
4.3 Evaluation Metrics
The final data produced by the experiment was a set values indicating the strength of
the relationship between a trace and its most related shade. From this data two method
of evaluating the results are possible, these are discussed in the following sections.
4.3.1 Purity and Normalised Mutual Information
Purity and Normalised Mututal Information (NMI) treat traces grouped around a cer-
tain SoM as clusters, and the hand tagged traces as classes. In this situation the evalu-
ation task can be treated as a clustering problem [56, 31]. By using a Golden Standard
of traces assigned to classes it is possible to use traditional clustering evaluation tech-
niques to see how well the SoM represent the perceived classes in the data.
Computing the purity of a cluster entails assigning the class which occurs most often
in the cluster. Purity is then calculated by finding the number of traces which were
assigned to the right cluster and dividing by the total number of traces. In the following
equations Ω is the set of clusters of traces clustered around computational shades and
C is the set of classes of hand tagged data.
4.3. EVALUATION METRICS 53
purity(Ω, C) =1
N
∑
k
maxj
|wk ∩ cj | (4.4)
The problem with Purity is that as the number of clusters approaches the number of
classes, the Purity score will continue to improve until it reaches a perfect score when
k = n. NMI is suggested as an improvment on Purity, by balancing the clustering
against the number of clusters [31, 56].
NMI(Ω, C) =I(Ω; C)
[H(Ω) + H(C)]/2(4.5)
The NMI is essentially the mutual information divided by the average entropy of the
clusters and the classes. The mutual information itself increases as the number of clus-
ters increses, this gives a similar situation as with Purity, where when k = n the max-
imum information sharing occurs. By dividing by the average entropy of the classes
and clusters the number of clusters does not bias the final score. Because the number
of clusters does not cause a bias of any kind, the NMI can be used to compare cluster-
ings of different sizes, something that addresses the monotonic increase in clustering
performance often seen with other evaluation strategies.
In the context of traces and SoM, wewill cluster the traces around the SoM and evaluate
the performance using NMI.
4.3.2 Confusion Matrices
To find out how well each system performed with classification of traces into shade
categories a system is required to first get an overall picture of the classifications, then
be able to quantify how closely the classifications match the golden standard. Neill uses
4.3. EVALUATION METRICS 54
acid iran tax highway economy
0 1 41 12 23 2 18 54 1 15 1 46 12 278 59 1 13 1
Table 4.2: A reduced example of the “Reagan” confusion matrix at k = 10.
confusion matrices in combination with conditional entropy effectively to help create a
meaningful measure of how his word sense induction system performs.
In artificial intelligence and machine learning confusion matrices are used to determine
how well a system has performed at a classification task. These confusion matrices
normally have an axis for answer classes and an axis for test classes. In the case of
the shades of meaning system the answer classes are the golden standard tags used for
traces and the test classes are the shades of meaning the system has created to represent
the data set. A confusion matrix is denoted C, or Cwk since there is a confusion matrix
for every w and k combination.
The confusionmatrix in Table 4.2 is for a reduced version of the “Reagan” data set when
decomposed into k = 10 shades of meaning through the Non-negative Matrix Factor-
ization algorithm. Along the top are the various tags that were used to tag the traces
extracted from the Reuters-21578 corpus. Down the left hand side are the numbered
shades of meaning automatically generated by the algorithm under test, non-negative
matrix factorisation in this case. Note that k = |CSw| so the height of the table is the
number of computational shades found by the system. The numbers in the confusion
4.3. EVALUATION METRICS 55
matrix represent the number of traces tagged with a particular tag that are most closely
related to a particular shade. For example, at rank ten shades, the system perfectly
groups all 13 traces tagged “highway.” The “highway” tag was used to tag all traces
related to the multibillion dollar highway bill vetoed by “Reagan”. These traces have a
clearly defined scope around the corpus occurrences of “Reagan”. A less well defined
scope is the “economy” tag which is spread over nearly all shades of meaning, but still
has the largest number of traces on a row where incidentally the majority of the tax
traces were categorised. From the understanding of tax and the economy these two
concepts are closely related, so this makes sense too.
The spread of values in the confusion matrix can be used to gain a general overview at
a glance of how well the matrix method has performed. Obviously how tightly traces
are clustered for a single tag is important, like “acid”, “tax” and “highway,” but also
clustering on rows is important too. More traces appearing in a single column for a
row is better then being spread over it. A confusion matrix where the tight groupings
occur both in rows and columns perform the best in the evaluation and fit the intuition
that SoM can accurately represent the different shades of meaning. For more discussion
about the results with more tables see the sections for each algorithm that was tested in
Section 5.
4.3.3 Conditional Entropy
Conditional entropy as used by Neill is a good measure of how effectively the system
performs when it takes the whole distribution of answer and test results together into
consideration. What this means is that the system must compare the distribution of
manually tagged traces to the classifications of traces to shades induced by the systems
4.3. EVALUATION METRICS 56
under test. The confusion matrices discussed in the previous section can be used to
calculate the conditional entropy of an tag distribution compared to the induced test
distribution. To calculate the entropy of the tag classifications the formula given by
Neill was used. i varies over the tag classes, such that P (i) denotes the probability a
trace will be manually assigned to tag i:
H(i) = −∑
i
P (i)log2P (i) (4.6)
Ignoring test results for now, a distribution of traces that is pretty much all classified
with one tag has a low entropy, whereas a distribution of traces that is evenly mixed
between several trace tags has a high entropy. High entropy means heavy mixing be-
tween different trace tags. For example if there were 100 "reagan" traces, 50 of which
were classified as "iran-contra" and the other 50 "missile-deal" the entropy would be
high because there is a large amount of mixing or uncertainty about whether a trace
would be tagged with any one tag.
To find the performance of the shade test distributions percentage decrease in entropy
must be calculated. This means finding the conditional entropy of the induced test
distribution and subtracting it from the entropy of the tag distribution will give the
percentage reduction in entropy attained by the system under test. To find the con-
ditional entropy the formula given in Equation 4.7 was used. P (i, j) is the chance of
finding a trace in a particular tag and shade class (P (i, j) = Cwk[i,j]Tw
).
H(i|j) = −∑
i
∑
j
P (i, j)log2P (i|j) (4.7)
Conditional entropy looks at how mixed the test distributions are as compared to the
answer distributions. So if there are two trace tags, "missile-deal" and "iran-contra" as
4.3. EVALUATION METRICS 57
mentioned earlier, and each of them has 50 associated traces, if the test system assigns
all of the 100 traces to a single shade, the the system has performed the worst possible
and the entropy is zero, nomixing has occurred. Whereas if the system assigns 50 traces
to one shade and 50 to another the entropy will be high. After subtracting the 100%
from the conditional entropy from the 100% of the answer’s entropy there is a 100%
reduction in entropy, the system has performed perfectly. When the entropies vary and
there is a remainder after the subtraction, the higher that value is, the better the system
has performed. To calculate the percentage decrease in entropy the following equation
is given by Neill in Equation 4.8.
H(i) − H(i|j)
H(i)(100%) (4.8)
As on of the more drastic examples see Table 4.3 (pg. 58). The percentage reduction
in entropy for the NMF result was 19.3, while the percentage reduction for the SVD
result was 2.07. At an initial glance, the distribution in the NMF table does not look
brilliant, but when compared to the SVD table it is clear that it has done a better job
distinguishing between the different SoM, there is less spreading of traces amongst
columns and rows. The fact that the “spring” sense of onion and “veg” sense of onion
appear on the same row for NMF seems to be offset by the fact that for each tag the
traces are grouped quite well. It can be concluded that the “spring” sense of the word
is closely related to the “veg” sense of the word, which is indeed the case. Whereas in
the SVD table, such conclusions cannot be drawn. More discussion of results like these
are covered in Section 5.
4.3. EVALUATION METRICS 58
(a) NMF
basil veg plant spring
0 1 131 125 4 132 6 9 13 12 14 15 3 1
(b) SVD
basil veg plant spring
01 6 12 66 9 63 1 76 7 64 23 1 2
Table 4.3: “onion” at k = 5
CHAPTER 5
Results
Overall the results from this thesis are encouraging. In this section discussion revolves
around the results with respect to the methods outlined in Section 4.3; normalised mu-
tual information and conditional entropy. After that the conditional entropy results are
discussed in more depth, with confusion matrices displayed to help with the analysis.
To see the full set of confusion matrices please see Section A
It is clear from the results that SVD is an effective method for finding shades of meaning
in a concept. There are often clear groupings of traces assigned to a particular tag
and shade. Unfortunately, often these groupings sit on the same row out of all of the
shades. This fits the understanding that SVD’s eigenvectors and singular values seek
to account for the largest divergence in the original matrix in the lower singular values
and vectors. Non-negative matrix factorisation consistently gave the best results. As
can be seen from the examples of tables given here, NMF keeps clusters of data both in
single rows and columns a lot more effectively than SVD.
Of the evaluation methods NMI seemed to be the most indicative of the effectiveness
of the evaluation. The reasons for this will be discussed below.
59
5.1. NORMALISED MUTUAL INFORMATION 60
5.1 Normalised Mutual Information
The overall results from NMI are presented as a percentage improvement over a base-
line score. The most notable feature of the overall improvements listed in Table 5.1 is
the difference between the SENSEVAL data and Reuters data results. The most logical
conclusion that can be drawn from the different between the two sets of data is that
SENSEVAL traces are not classified based SoM as classified by this thesis, rather by the
sense of the word of interest. While there may be some overlap between senses and se-
mantic meaning, these results strongly suggest that SoM calculation is a different class
of problem than sense disambiguation.
Data + Method 5 10 20 30 Overall
Reut:SVD 8.34 13.16 18.60 19.98 15.02Reut:NMF 15.48 19.19 14.36 23.06 18.02
Reuters 12.54 14.41 14.49 19.81 15.62
SENS:SVD 5.35 3.93 6.37 4.51 5.04SENS:NMF 5.05 4.97 7.62 5.40 5.76
SENSEVAL 5.18 4.31 6.24 4.39 5.4
Table 5.1: Average overall improvements in NMI.
NMI results also show from an early stage that a higher value of k does not always
mean a better division of SoM. The SoM for the Reuters data set have been chosen
to map to a the human cognitive understanding of the different aspects of a concept.
Thus despite higher values of k giving a more fine grained perspective on the concept,
it may not always be congitively economical to apply such a fine grained conceptual
filter. Between the actual systems under test, NMF gives the best overall improvements
over the baseline.
The baseline for NMI is calculated by computing the NMI of clusters generated by sim-
ply iterating over the traces and placing them one at a time into each different cluster.
5.1. NORMALISED MUTUAL INFORMATION 61
term method IK, B, nmi (% incr) B, 5 B, 10 B, 20 B, 30
coffee nmf 9, 0.15, 0.31 (16.51%) 0.08, 0.17 (8.95%) 0.12, 0.37 (24.65%) 0.20, 0.39 (18.51%) 0.27, 0.40 (13.45%)coffee svd 9, 0.15, 0.28 (12.92%) 0.08, 0.29 (20.64%) 0.12, 0.29 (16.69%) 0.20, 0.33 (12.97%) 0.27, 0.38 (10.75%)
oil nmf 40, 0.30, 0.53 (23.02%) 0.07, 0.29 (22.05%) 0.13, 0.40 (27.24%) 0.20, 0.47 (26.45%) 0.26, 0.51 (24.72%)oil svd 40, 0.30, 0.47 (17.51%) 0.07, 0.24 (16.21%) 0.13, 0.34 (21.26%) 0.20, 0.44 (23.59%) 0.26, 0.47 (20.79%)
gatt nmf 7, 0.26, 0.44 (17.93%) 0.19, 0.37 (18.26%) 0.31, 0.57 (26.41%) 0.47, 0.62 (15.55%) 0.60, 0.68 (7.59%)gatt svd 7, 0.26, 0.50 (23.96%) 0.19, 0.45 (26.15%) 0.31, 0.55 (24.17%) 0.47, 0.56 (9.02%) 0.60, 0.56 (-4.62%)
reagan nmf 21, 0.31, 0.62 (30.58%) 0.13, 0.33 (20.79%) 0.22, 0.49 (27.27%) 0.33, 0.60 (26.99%) 0.40, 0.64 (24.67%)reagan svd 21, 0.31, 0.53 (21.56%) 0.13, 0.32 (19.51%) 0.22, 0.42 (20.28%) 0.33, 0.52 (18.80%) 0.40, 0.55 (15.87%)
president nmf 9, 0.27, 0.37 (10.27%) 0.19, 0.26 (6.56%) 0.24, 0.40 (16.61%) 0.31, 0.52 (20.30%) 0.38, 0.55 (17.71%)president svd 9, 0.27, 0.44 (17.62%) 0.19, 0.42 (23.00%) 0.24, 0.44 (20.24%) 0.31, 0.45 (13.68%) 0.38, 0.50 (12.12%)
economy nmf 25, 0.64, 0.73 (8.95%) 0.34, 0.44 (9.53%) 0.47, 0.63 (16.20%) 0.60, 0.68 (7.36%) 0.67, 0.72 (4.76%)economy svd 25, 0.64, 0.64 (0.12%) 0.34, 0.49 (14.39%) 0.47, 0.56 (8.97%) 0.60, 0.61 (0.91%) 0.67, 0.62 (-4.89%)
Table 5.2: Reuters NMI results. IK is the ideal value for k. B is the baseline NMI value.The percentages detail improvements over the baseline.
Essentially this is what could be achieved if we just split the traces amongst the clus-
ters and represents the capacity of a small child to simply assign items to classes. This
means that the baseline is different for every different value of k.
As can be seen in Table 5.2 there are are a wide variety of results depending on the
value of k. For “Reagan” at k = 21 with NMF we see the best result for that term.
This matches the intuition that the different classes of traces selected most accurately
match the SoM in the data. Notably the NMI value for k = 21 is higher than that of
k = 30 showing that a higher value for k does not neccesarily mean a better result for
representing SoM. In contrast to this for “economy” the best results are when k = 10
not k = 25 which was chosen as the ideal value of k. “Economy” is also an example of
getting a negative improvement over the baseline. For SVD and k = 30 the generated
shades actually perform worse than the baseline. Since “economy” has nearly 70 traces
and k = 30 it would appear that the baseline method of simply assigning traces to
clusters has chanced upon a surprisingly accurate clustering of the traces. To achieve a
higher NMI value there basically needs to be very little entropy in the guessed clusters
and answer classes, or a lot of shared information between them.
The “oil” data set was the largest out of the Reuters corpus examples. It also shows
5.2. CONDITIONAL ENTROPY 62
among the most consistent improvements in NMI. This is because with the larger data
set is is harder to fluke a good clustering of traces around SoM.
5.2 Conditional Entropy
The conditional entropy results are presented as an alternative perspective way on
quantifying the performance of different systems. As discussed, conditional entropy
provides a way to tell how well the test distribution of traces to shades matches the an-
swer distribution of traces to tags. The overall averages for each method on each data
set are shown in Table 5.3. Despite some anomalies it is safe to say that on the whole
NMF performed better than SVD at reducing the entropy in the data sets.
Data + Method 5 10 20 30 Overall
Reut:SVD 31.71 43.32 53.72 59.14 46.97Reut:NMF 27.52 49.42 64.59 74.35 53.97
Reuters 29.61 46.37 59.15 66.74 50.46
SENS:SVD 11.88 20.84 30.15 37.85 25.18SENS:NMF 14.53 26.37 36.23 45.57 30.67
SENSEVAL 13.20 23.60 33.19 41.71 27.78
Table 5.3: Average overall reductions in entropy.
When comparing NMF to SVD in the Reuters data set there are several instances where
SVD performs better than NMF, “coffee” at k = 5, “gatt” at k = 5 and “reagan“ at
k = 5. For the “coffee” case consider Table 5.4 the confusion matrix for a run of the
algorithm , there is a subtly better grouping of traces given the test examples for SVD
than NMF. This can be seen in the “growth” column, where SVD groups traces together
more. Additionally NMF has put more traces across more tags on shade 1, this will
significantly hinder the reduction in entropy.
The entropy decrease results for the SENSEVAL data is in Table A.2 on Page 76. In this
5.2. CONDITIONAL ENTROPY 63
table it can be seen that SVD outperforms NMF on more occasions than in the Reuters
data, but not enough to steal the crown in the overall scores. An example of SVD
performing better is shown in table 5.6 on Page 67. Columns “sharesact” and “rod”
show better grouping of traces for SVD than NMF.
(a) NMF, the reduction in entropy is 17.22%.
decline marketing make quota rain growth export org import
0 9 1 2 6 6 1 11 6 1 6 6 4 4 12 3 16 4 23 7 1 6 64 7 2 5 1
(b) SVD, the reduction in entropy is 27.50%.
decline marketing make quota rain growth export org import
0 1 1 1 5 11 5 2 12 22 3 19 1 13 5 6 4 54 18 2 4 9 5 1 1
Table 5.4: “coffee” at k = 5. An example of SVD performing better than NMF in a runof the algorithm.
The first key point when looking at Table 5.5 (pg. 64) that is also raised by Neill in his
work is that the results get better and better as the number of shades increases. While
this happens for the values of k documented above, it does not happen for every test
case. For example the “accident” term from the SENSEVAL data processed with SVD
exhibits non-monotonic increases in conditional entropy as k increases. At k = 2 the
conditional entropy is 3.66, at k = 3 it is 10.27 and at k = 5 it goes back down to 7.52.
To explain these results wemust look at the conditional entropymetric inmore detail. If
entropy is a measure of the amount of mixing between classes and conditional entropy
5.2. CONDITIONAL ENTROPY 64
term method B IK, ent (% decr) 5 10 20 30
coffee nmf 3.51 9: 1.58 (35.45%) 2.04 (16.93%) 1.39 (43.16%) 1.19 (51.28%) 1.03 (57.88%)coffee svd 3.51 9: 1.71 (30.40%) 1.78 (27.50%) 1.66 (32.30%) 1.43 (41.76%) 1.23 (49.69%)
oil nmf 0.96 40: 1.82 (56.43%) 3.24 (22.48%) 2.72 (34.84%) 2.26 (45.82%) 2.00 (52.26%)oil svd 0.96 40: 2.16 (48.37%) 3.47 (17.07%) 3.01 (28.04%) 2.43 (41.84%) 2.23 (46.66%)
gatt nmf 11.43 7: 1.31 (46.05%) 1.54 (36.41%) 0.81 (66.52%) 0.45 (81.52%) 0.06 (97.64%)gatt svd 11.43 7: 1.23 (49.20%) 1.43 (41.11%) 0.95 (60.62%) 0.80 (66.93%) 0.80 (66.93%)
reagan nmf 2.67 21: 1.35 (64.77%) 2.82 (26.49%) 2.12 (44.67%) 1.45 (62.35%) 1.13 (70.66%)reagan svd 2.67 21: 1.86 (51.44%) 2.90 (24.56%) 2.43 (36.66%) 1.92 (49.96%) 1.65 (57.04%)
president nmf 5.97 9: 1.06 (45.89%) 1.41 (27.54%) 0.98 (49.53%) 0.48 (75.26%) 0.26 (86.61%)president svd 5.97 9: 0.94 (51.80%) 1.13 (42.28%) 0.92 (53.00%) 0.76 (61.25%) 0.58 (70.49%)
economy nmf 7.02 25: 0.78 (79.40%) 2.46 (35.25%) 1.60 (57.79%) 1.09 (71.32%) 0.72 (81.07%)economy svd 7.02 25: 1.33 (65.00%) 2.37 (37.75%) 1.93 (49.31%) 1.50 (60.56%) 1.37 (64.03%)
Table 5.5: Reuters-21578 conditional entropy results
is the amount of mixing between answer and test classes the results observed fit the
description given by Neill.
For the “reagan” example, if we have 100 traces with 2 tags (“iran” has 50, “tax” has 50)
and k=3 (SOM1, SOM2, SOM3). If the algorithm splits the two sets of traces 50/50 over
SOM1 and SOM2 there is 100% decrease in entropy (which fits the intuition). Also, if
the algorithm splits the two sets of traces 50/25/25 over SOM1 SOM2 and SOM3 the
system still performs accurately because there has been no mixing of answer classes.
The intuition here is that the human taggers have missed a significant subset of mean-
ing in the data. Consider the situation of adding more SOM to the system by increasing
the value of k. Unless that causes the algorithm to start mixing traces of different classes
together the system will maintain a 100% reduction in entropy. To put it another way, if
we have a larger amount of mixing and addingmore SOM to the system allows it to bet-
ter ’unmix’ the mixed traces, the conditional entropy will decrease. That is, when there
are traces from two classes assigned to the same SOM and adding more SOM allows
them to be more effectively spit apart, the conditional entropy will decrease. Looking
through the confusion matrices from the experiments, we see that this is exactly what
is happening. So unlike NMI, conditional entropy doesn’t normalise the results when
5.2. CONDITIONAL ENTROPY 65
there are larger numbers of k.
As for Neill’s results he is using lower values of k (2,3,5,10) and the seed selection
process is an extremely important part of his research. Especially considering the com-
plexity of his model. If we take k = 2 and k = 3 for example, it is possible that the
seed selection process could select much better seeds for k = 2 and then choose sub-
par seeds for k = 3. In this situation the low values of k means that if one bad seed
is chosen, one half or one third of the seeds could be very poorly classified by the sys-
tem. When the system does make a bad seed selection, the percentage of erroneously
classified shades is much higher than in systems where k is higher. Thus the entropy of
the answer distribution will be much higher and the reduction in conditional entropy
will be lower. Neill also notes that sometimes the performance does not increase as the
number of seeds increases. He himself attributes this to the randomness of the seed
selection process. In the pilot tests of this research the random seed problem was no-
ticed with varying clustering algorithms too, for this reason concept indexing was not
included in the final experimentation.
The entropy decrease results for the SENSEVAL data is in Table A.2 on Page 76. In
contrast, for two repeated iterations of NMF or SVD with the same implementations,
the results are the same, one would expect this means that randomness like what is
observed by Neill simply does not happen. Unfortunately this is not the case, it was
observed that the NMF does not give consistent results like the SVD algorithm. Most
of the time it performed better than SVD, but sometimes it performed worse, not by a
lot, but frequently enough to be noticeable. This makes sense since each time NMF is
run it is initialised with random data. It is said to have converged when for a given
iteration the expression KL > thres × oldKL is true, where KL is the current KL
5.2. CONDITIONAL ENTROPY 66
divergence, oldKL is the divergence from the last iteration and thres = 0.9999. This
means the random data causes a different factorisation each time and as long as the
above expression is true at some point, the factorisation is considered complete. This
behaviour is noted by all NMF research reviewed [3, 27, 55].
The most extreme upper results in the data set are the k = 30 results for the “GATT”
data set out of the Reuters collection, where with the NMF system a 95.28% decrease in
entropy is recorded, which seems like a very unlikely number to achieve. Interestingly
this result actually makes a lot of sense when you look at the data set in Table A.3 on
Page 77. There are 35 traces in the data set, this is the smallest data set in the system
and by all accounts one would expect to almost see a single trace per shade. In fact this
is nearly the case for the NMF results, the SVD results which are substantially lower ac-
tually groups some traces together onto a single shade, which intuitively makes more
sense when one thinks about grouping traces tagged the same around a single shade.
The reason why SVD’s entropy is lower is because it mixes traces with different tag as-
signments to the same shade more often. This is because SVD can drive an eigenvector
between shades of meaning, rather than through it.
In contrast to the “GATT” data set, the “float” data set is larger, with 70 traces, yet a
large decrease in entropy occurs when it is decomposed into shades, especially so for
SVD. This means that the shades of meaning present in the “float” data match up the
what the different sense taggings mean more so than other data sets. (See Table 5.6 on
Page 67.) Also, one of the largest data sets, the “oil” data set, had over 400 traces, yet
its results also show significant reductions in entropy, meaning that the size of the data
set only impacts the results as the number of shades (k) gets much close to the total
number of traces.
5.2. CONDITIONAL ENTROPY 67
(a) NMF, the reduction in entropy is 44.73%.
sharesact currencyact rod fiesta cash milk wobble device lorry
0 2 1 21 1 10 12 2 1 2 23 2 1 1 1 14 3 1 15 10 16 8 27 1 18 2 1 4 19 1 1 2 1
(b) SVD, the reduction in entropy is 68.25%.
sharesact currencyact rod fiesta cash milk wobble device lorry
0 32 1 1 112 2 13 34 1 65 11 1 1 16 17 1 1 18 1 29 1 1
Table 5.6: “float” at k = 5. Another example of SVD performing better than NMF.
In Table A.4 on Page 78 are the two confusion matrices created for the “reagan” data
set. It defines aspects of “reagan” that were easy to identify and tag in the data set, they
are easily found by both the decompositions. Examples of these clearly defined tags are
“tax”, “highway” and “acid.” What is also interesting is the difference between NMF
and SVD, specifically where different tags have been clustered onto the same shade. For
example on shade 8 NMF puts 8 tags for “iran” next to 6 tags for “staff,” arguably these
topics are all related once the corpus is seen, since after the Iran-contra controversy in-
vestigations were carried out on members of Reagan’s staff. Other weakly represented
5.2. CONDITIONAL ENTROPY 68
overlaps can be identified in the tables too, like between “address” (meaning presiden-
tial speech) and “economy.”
CHAPTER 6
Conclusions and Future Work
This thesis investigates the possibility that concepts have subtle aspects or shades of
meaning (SoM). The Hyperspace Analogue to Language (HAL) semantic space model
was used to model concepts in a high dimensional semantic space, a term × term ma-
trix. The data is created by running the HAL algorithm over several sets of data, each
representing a term by a series of traces. These sets of data can be broken up into two
different groups; the nouns from the SENSEVAL1 data, and handpicked sets of traces
built out of all titles containing that word in the Reuters-21578 collection. Two methods
of dimensional reduction were used to induce these shades of meaning about the word
from its HAL semantic space. The methods employed were Singular Value Decompo-
sition (SVD) and Non-negative Matrix Factorisation (NMF). Pilot studies revealed that
other potential methods: vector negation and concept indexing did not reveal promis-
ing enough results to warrant further investigation.
An evaluation framework based on evaluating the SoM as clusters and conditional en-
tropy was used to evaluate the overall performance of each system for different values
of k ∈ 5, 10, 20, 30. Using this framework it is clear that using dimensional reduction
with a matrix representation of a concept can induce the shades of meaning surround-
69
6.1. SOM ACCURACY ACROSS LANGUAGES 70
ing a particular word with NMF looking to be effective than SVD. This is probably
due to the factorization resulting in vectors being driven through term clusters in the
semantic space corresponding to a SoM whereas the orthogonal eigenvector solution
imposed by SVD has more chance to drive eigenvectors between such term clusters,
hence degrading performance. This work has opened up a wealth of possibilities for
further research, several have been mentioned briefly through this document but some
warrant further discussion. As an outro to this line of work, reflectively consider the
possibilities for improving the results of SoM and for applying the knowledge gained
here to new applications.
6.1 SoM Accuracy Across Languages
An interesting take of empirically measuring the performance of SoM systems is to see
how well the shades translate from one language and culture to another. This would
be possible by building a HAL semantic space over a parallel corpora of data in two
languages. Widdows shows that semantic relationships between two different parallel
corpora are maintained through statistical model processes [53]. By taking the same
shade of meaning in each language and showing them to someone who speaks both
languages, it should be possible to get a good idea of whether the two shades of mean-
ing mean the same thing.
6.2 Decomposing Term-Trace Data
In this present line of research the data being operated on is term × term data built
out of running the HAL algorithm over the traces. Similar but fundamentally different
is the work of Rapp in his paper addressing Word Sense Induction [42]. He follows
6.3. EUCLIDEAN AND GENERALISED KERNEL NMF 71
on from the work of Neill, but proposes a new representation model [37]. Instead of
building local clusters of information out of a large global context, keep the traces in
their original context, and build a term × trace (or term × context as he puts it) matrix
of term frequency vectors. His work shows encouraging results and inspires a new
angle on computing SoM by using a hybrid of the HAL model presented in this thesis
and his methods.
The new proposed angle on SoM is to build a matrix of term×trace vectors, but instead
of weighting by term frequencyweight according to a single instance of aHALwindow.
So for example, the trace: “President Reagan ignorant of the arms scandal” with a HAL
window size of l = 5 would be represented in the matrix as a vector like [president: 5,
reagan: 0, ignorant: 5, of: 4, the: 3, arms: 2, scandal: 1]. As can be seen in the vector,
rather than term frequencies the HAL weighting window is applied once to the trace,
around the term in question.
By applying SVD to the data in this new style of matrix it would reduce the sparsity of
the matrix and allow a term × term matrix be built in a similar fashion as discussed in
Section 3.1 [24]. By the applying SVD or clustering to this term × term matrix of infor-
mation built from specific term context information it may be possible to induce SoM.
By using the evaluation strategy and data used in this research the the effectiveness of
this hybrid method could be evaulated.
6.3 Euclidean and Generalised Kernel NMF
There are other ways to solve the NMF problem that are not covered in this thesis.
These include using the Frobenius norm to find the Euclidean NMF as opposed to the
KL-divergence NMF. This concept can be taken further by generalising NMF to a kernel
6.4. ABDUCTIVE REASONING 72
based model where any function can be used to measure the distance1 between the ap-
proximation and the original data. Preliminary experiments with a kernel based NMF
and the Frobenius norm are promising with respect to finding shades of meaning. Fur-
thermore, point-wise mutual information andmaybe even information flow are further
candidates for investigation.
6.4 Abductive Reasoning
An area that shows a lot of potential for applying these works is that of abductive
reasoning [6, 17]. The discovery of connections between Raynaud’s disease and fish
oil typifies the kind of relationships associated with abductive reasoning. Bruza et al
explore the relationship between semantic space models and abductive reasoning and
find promising yet slightly underwhelming results. It should be possible to use shades
of meaning to allow for much more fine grained analysis, allowing relevant but weaker
relationships to be found more effectively.
The discovery of the use of fish oil as a treatment for Raynaud’s disease was a serendip-
itous discovery found because of the common properties between what fish oil could
treat and the effects of Raynaud’s disease. There was no material directly linking the
two, yet a connection could be made between them. This is somewhat formalised by
modelling an A-B-C system, where A represents a potential cure, C the problem for
which a cure is needed and B the relationship that joins the two. The example terms for
B given with regard to the Raynaud’s disease example are “platelet aggregation,” “vas-
cular reactivity,” and “blood viscosity.” By knowing C, and potential B’s the system
should be able to narrow the bounds for which there are possible C’s. So if there are
1In the loosest sense of the word.
6.4. ABDUCTIVE REASONING 73
vectors for A and C, the intuition is that more B terms will be common between them
when they are more closely related, this can be quantified by using the cosine similarity.
This could be done merely to test the validity of the A-B-C theory.
Bruza et al continue further by using SVD to reduce the dimensionality of the semantic
space that A and C are modelled in. As mentioned in Section 2.1.2 it would be expected
that some words that were zero before end up being to be non-zero after the SVD op-
eration is performed. In the experiments conducted, dimensional reduction moves the
word “fish” up to the top 1143 of 28,799 dimensions in the Raynaud vector.
Concept Combination (CC) and Information Flow (IF) are methods proposed by Bruza
and Song [50, 48] for inferring a relationship between concepts or groups of concepts.
Groups of concepts can be created through the CC heuristic, then relationships inferred
through IF. By using CC the SoM around a group of concepts, or a composite concept
can be induced.
Quantifying SoM in an operational abductive system could help identify A terms and
phrases with stronger relationships. This would be possible by using the SoM vectors
for C to find appropriate B and A terms. As the vectors represent a SoM they can be
used with CC and IF to find a much more specific concept.
6.4.1 Quantum Logic
An area of research closely linked with Abductive Reasoning is the field of Quantum
Informatics. The words can be considered superpositions of SoM, and when a word
is observed in context the superposition collapses into a particular SoM. By finding
the SoM around a word or concept the theory is moving towards a way of defining
eigenstates, or at least a new basis for words in a quantum-like model of meaning. This
6.4. ABDUCTIVE REASONING 74
opens up a new field of mathematics applicable to this problem [5].
APPENDIX A
Confusion Matrices
term method IK, B, nmi (% incr) B, 5 B, 10 B, 20 B, 30
shirt nmf 6, 0.06, 0.12 (5.94%) 0.05, 0.10 (5.01%) 0.06, 0.10 (4.01%) 0.09, 0.12 (2.95%) 0.14, 0.17 (2.58%)shirt svd 6, 0.06, 0.11 (5.53%) 0.05, 0.11 (6.27%) 0.06, 0.10 (4.30%) 0.09, 0.12 (2.94%) 0.14, 0.13 (-1.03%)
onion nmf 4, 0.01, 0.07 (6.07%) 0.01, 0.12 (10.52%) 0.03, 0.17 (13.56%) 0.05, 0.17 (11.33%) 0.08, 0.17 (9.13%)onion svd 4, 0.01, 0.02 (0.64%) 0.01, 0.01 (0.17%) 0.03, 0.05 (1.82%) 0.05, 0.09 (3.70%) 0.08, 0.09 (1.36%)
disability nmf 3, 0.01, 0.03 (1.86%) 0.06, 0.02 (-4.19%) 0.05, 0.08 (3.03%) 0.09, 0.11 (2.11%) 0.10, 0.14 (3.57%)disability svd 3, 0.01, 0.02 (0.93%) 0.06, 0.03 (-2.86%) 0.05, 0.07 (1.08%) 0.09, 0.08 (-1.02%) 0.10, 0.12 (2.01%)
scrap nmf 12, 0.17, 0.21 (4.26%) 0.09, 0.17 (7.99%) 0.17, 0.25 (8.82%) 0.24, 0.29 (4.73%) 0.32, 0.36 (3.79%)scrap svd 12, 0.17, 0.18 (1.34%) 0.09, 0.12 (3.16%) 0.17, 0.16 (-0.87%) 0.24, 0.25 (0.54%) 0.32, 0.31 (-1.31%)
float nmf 9, 0.20, 0.43 (23.51%) 0.13, 0.39 (25.29%) 0.19, 0.51 (31.88%) 0.33, 0.56 (23.71%) 0.38, 0.52 (14.31%)float svd 9, 0.20, 0.69 (49.09%) 0.13, 0.41 (27.36%) 0.19, 0.68 (48.71%) 0.33, 0.67 (34.86%) 0.38, 0.70 (32.76%)
giant nmf 5, 0.13, 0.20 (7.51%) 0.13, 0.16 (3.33%) 0.16, 0.19 (3.44%) 0.20, 0.18 (-1.83%) 0.24, 0.24 (-0.18%)giant svd 5, 0.13, 0.13 (0.01%) 0.13, 0.13 (0.01%) 0.16, 0.18 (2.12%) 0.20, 0.17 (-2.49%) 0.24, 0.23 (-1.80%)
sack nmf 6, 0.08, 0.12 (4.20%) 0.09, 0.17 (8.84%) 0.12, 0.15 (3.26%) 0.23, 0.22 (-0.95%) 0.23, 0.32 (9.33%)sack svd 6, 0.08, 0.21 (12.74%) 0.09, 0.18 (9.31%) 0.12, 0.18 (5.65%) 0.23, 0.24 (1.05%) 0.23, 0.25 (1.82%)
behaviour nmf 3, 0.02, 0.03 (1.02%) 0.02, 0.03 (1.75%) 0.03, 0.04 (0.57%) 0.03, 0.03 (-0.22%) 0.05, 0.05 (-0.22%)behaviour svd 3, 0.02, 0.00 (-1.54%) 0.02, 0.03 (1.56%) 0.03, 0.04 (0.46%) 0.03, 0.03 (-0.15%) 0.05, 0.04 (-0.77%)
knee nmf 13, 0.15, 0.21 (6.44%) 0.08, 0.18 (10.17%) 0.13, 0.23 (10.42%) 0.17, 0.22 (4.69%) 0.21, 0.24 (3.21%)knee svd 13, 0.15, 0.22 (7.23%) 0.08, 0.19 (11.28%) 0.13, 0.20 (7.69%) 0.17, 0.20 (2.70%) 0.21, 0.24 (3.39%)
accident nmf 6, 0.04, 0.10 (5.20%) 0.04, 0.05 (1.37%) 0.05, 0.10 (5.85%) 0.08, 0.15 (7.29%) 0.09, 0.14 (5.26%)accident svd 6, 0.04, 0.08 (3.45%) 0.04, 0.06 (2.35%) 0.05, 0.09 (4.45%) 0.08, 0.12 (4.78%) 0.09, 0.13 (4.41%)
promise nmf 7, 0.12, 0.11 (-1.09%) 0.11, 0.08 (-3.49%) 0.14, 0.16 (1.95%) 0.20, 0.22 (1.79%) 0.23, 0.23 (0.15%)promise svd 7, 0.12, 0.13 (1.08%) 0.11, 0.13 (2.09%) 0.14, 0.14 (-0.11%) 0.20, 0.19 (-0.79%) 0.23, 0.26 (3.08%)
rabbit nmf 8, 0.09, 0.10 (0.88%) 0.08, 0.06 (-2.58%) 0.10, 0.11 (0.45%) 0.12, 0.12 (-0.31%) 0.12, 0.13 (0.77%)rabbit svd 8, 0.09, 0.07 (-1.37%) 0.08, 0.05 (-3.19%) 0.10, 0.09 (-1.60%) 0.12, 0.11 (-0.72%) 0.12, 0.11 (-1.14%)
excess nmf 9, 0.09, 0.16 (6.90%) 0.06, 0.08 (2.62%) 0.09, 0.16 (6.32%) 0.16, 0.22 (5.98%) 0.21, 0.25 (4.20%)excess svd 9, 0.09, 0.08 (-0.85%) 0.06, 0.05 (-0.91%) 0.09, 0.09 (-0.03%) 0.16, 0.16 (-0.39%) 0.21, 0.20 (-0.54%)
steering nmf 5, 0.05, 0.08 (2.67%) 0.05, 0.11 (6.09%) 0.08, 0.24 (16.57%) 0.10, 0.22 (11.68%) 0.16, 0.25 (9.31%)steering svd 5, 0.05, 0.26 (21.11%) 0.05, 0.26 (21.11%) 0.08, 0.27 (19.08%) 0.10, 0.27 (17.14%) 0.16, 0.28 (11.56%)
bet nmf 15, 0.14, 0.23 (8.75%) 0.07, 0.10 (2.99%) 0.11, 0.15 (4.10%) 0.18, 0.26 (8.10%) 0.23, 0.32 (9.27%)bet svd 15, 0.14, 0.20 (5.77%) 0.07, 0.09 (2.48%) 0.11, 0.14 (2.81%) 0.18, 0.23 (5.48%) 0.23, 0.28 (5.09%)
Table A.1: SENSEVAL NMI results
75
APPENDIX A. CONFUSION MATRICES 76
term method B IK, ent (% decr) 5 10 20 30
shirt nmf 2.27 6: 1.43 (13.84%) 1.48 (11.07%) 1.43 (14.05%) 1.31 (21.39%) 1.13 (32.01%)shirt svd 2.27 6: 1.46 (12.41%) 1.49 (10.58%) 1.43 (14.01%) 1.33 (19.85%) 1.26 (24.22%)
onion nmf 1.96 4: 0.74 (10.60%) 0.67 (18.92%) 0.58 (30.03%) 0.49 (40.82%) 0.49 (41.17%)onion svd 1.96 4: 0.81 (2.03%) 0.81 (2.07%) 0.74 (10.74%) 0.65 (21.90%) 0.63 (23.47%)
disability nmf 2.76 3: 0.92 (4.03%) 0.93 (3.10%) 0.78 (18.81%) 0.67 (30.66%) 0.57 (40.44%)disability svd 2.76 3: 0.94 (2.29%) 0.92 (4.34%) 0.84 (12.58%) 0.77 (20.20%) 0.65 (32.17%)
scrap nmf 2.78 12: 2.09 (24.32%) 2.32 (15.70%) 2.00 (27.33%) 1.76 (35.98%) 1.43 (48.24%)scrap svd 2.78 12: 2.25 (18.22%) 2.52 (8.53%) 2.37 (14.08%) 1.95 (29.05%) 1.68 (39.13%)
float nmf 5.63 9: 1.11 (51.13%) 1.41 (37.99%) 0.91 (59.73%) 0.55 (75.81%) 0.52 (76.90%)float svd 5.63 9: 0.73 (67.79%) 1.59 (30.14%) 0.72 (68.25%) 0.51 (77.45%) 0.33 (85.42%)
giant nmf 5.41 5: 0.79 (29.70%) 0.86 (23.45%) 0.71 (37.01%) 0.66 (41.55%) 0.44 (60.83%)giant svd 5.41 5: 0.96 (14.69%) 0.96 (14.69%) 0.81 (27.65%) 0.71 (36.69%) 0.51 (54.28%)
sack nmf 4.88 6: 1.37 (15.71%) 1.28 (20.96%) 1.24 (23.26%) 0.97 (40.04%) 0.59 (63.58%)sack svd 4.88 6: 1.27 (21.88%) 1.33 (17.81%) 1.27 (21.64%) 1.02 (37.09%) 0.97 (39.92%)
behaviour nmf 1.61 3: 0.33 (7.10%) 0.31 (11.98%) 0.29 (18.92%) 0.28 (20.32%) 0.23 (34.59%)behaviour svd 1.61 3: 0.36 (0.14%) 0.32 (9.27%) 0.30 (14.84%) 0.29 (18.86%) 0.26 (27.65%)
knee nmf 1.79 13: 1.76 (26.06%) 1.97 (17.21%) 1.75 (26.70%) 1.66 (30.18%) 1.53 (35.76%)knee svd 1.79 13: 1.77 (25.52%) 1.97 (17.32%) 1.85 (22.39%) 1.76 (26.14%) 1.56 (34.36%)
accident nmf 1.59 6: 1.00 (14.94%) 1.09 (7.41%) 0.95 (19.42%) 0.77 (34.04%) 0.76 (35.44%)accident svd 1.59 6: 1.05 (10.94%) 1.09 (7.53%) 0.99 (15.98%) 0.87 (26.23%) 0.80 (31.77%)
promise nmf 4.00 7: 1.42 (14.70%) 1.52 (8.84%) 1.27 (23.52%) 1.02 (38.75%) 0.92 (44.65%)promise svd 4.00 7: 1.45 (13.10%) 1.48 (11.42%) 1.43 (14.12%) 1.27 (23.82%) 1.02 (38.93%)
rabbit nmf 2.23 8: 0.54 (24.29%) 0.62 (12.50%) 0.50 (29.82%) 0.43 (39.38%) 0.37 (47.52%)rabbit svd 2.23 8: 0.60 (15.24%) 0.65 (7.90%) 0.56 (20.86%) 0.47 (33.99%) 0.44 (38.05%)
excess nmf 2.31 9: 1.89 (17.95%) 2.12 (8.20%) 1.88 (18.53%) 1.60 (30.70%) 1.46 (36.94%)excess svd 2.31 9: 2.16 (6.69%) 2.23 (3.50%) 2.11 (8.59%) 1.89 (17.99%) 1.70 (26.22%)
steering nmf 2.33 5: 1.52 (8.99%) 1.46 (12.27%) 1.12 (32.98%) 1.10 (34.09%) 0.91 (45.33%)steering svd 2.33 5: 1.24 (25.45%) 1.24 (25.45%) 1.11 (33.52%) 1.04 (37.66%) 1.00 (40.11%)
bet nmf 1.66 15: 2.37 (24.88%) 2.90 (8.29%) 2.67 (15.37%) 2.22 (29.68%) 1.89 (40.07%)bet svd 1.66 15: 2.51 (20.57%) 2.92 (7.56%) 2.74 (13.35%) 2.36 (25.27%) 2.15 (32.04%)
Table A.2: SENSEVAL1 nouns conditional entropy results
APPENDIX A. CONFUSION MATRICES 77
airbus semi gatt subsidies agri herring dispute
01 12 1 13 24 256 17 28 29 110 111 112 213 214 115 11617 11819 120 121 122 123 124 1 125 126 127 128 329
Table A.3: “GATT” at k = 30 showing a nearly perfect reduction in entropy score of95.28%.
APPENDIX
A.CONFUSIO
NMATRIC
ES
78
(a) NMF, entropy decrease is 43.84%.
euratom acid iran congress gold wife tax foreign credit japan oil export highway address central gulf economy soviet trip pres staff
0 1 10 1 2 2 31 2 3 6 12 6 2 2 13 1 1 4 19 1 1 1 2 14 3 1 1 3 2 2 1 15 2 3 1 3 16 5 1 1 1 1 17 1 1 1 1 2 1 28 8 1 3 69 1 2 4 4 2 1 1 1
(b) SVD, entropy decrease is 37.07%.
euratom acid iran congress gold wife tax foreign credit japan oil export highway address central gulf economy soviet trip pres staff
0 1 3 3 11 8 1 2 3 1 6 2 9 1 1 72 7 3 1 12 3 2 1 13 1 1 11 14 35 1 1 2 16 17 5 6 1 18 1 1 1 3 2 1 29 2 1 3 5 2 1 5 2 3
Table A.4: “Reagan” at k = 10 for SVD and NMF, an example of NMF out performing SVD.
APPENDIX B
Early Pilots and Trace Browser
The biggest problem getting in intition about which algorithms were performing well
was the sheer amount of raw numeric data that had to bemanually scanned by a human
reader. This was also a problemwhen trying to develop an evaluation strategy that was
meaningful and not biased. This meant that implementations were working but no real
decisions could be made about each one. To help make the data more readable a “Trace
Browser” (TB) was written to display the massive amounts of data in a way that helps
get a good idea of the results without needing to read every piece of data in the system
or create an evaluation strategy just to understand them.
TB takes all the traces used to build the system and creates a vector to represent them.
This vector is the centroid of the addition of all the term vectors in a trace. (See Section
4.2.3.) The centroid of a cluster is accepted to be a good representative of the content of
the constituent vectors.[22] Once the TB has all the trace centroids and all of the SoM
built in the above algorithms, it compares them all (via cosine similarity) and creates a
ranked list of traces that best match a particular shade of meaning. These ranked lists
are then displayed along side the trace they are best matched against. Only the top ten
values from the shade of meaning are displayed to help with being able to read through
79
APPENDIX B. EARLY PILOTS AND TRACE BROWSER 80
the results quickly.
The TB gives an effective way to browse the results quickly and clearly, but it does not
provide the answer to the problem of evaluation that is clearly an issues with this line
of research. At least with the TB there were indicative results that help make some
decisions about evaluation. Figure B.1 shows a snapshot of the trace browser in action,
with two shades of meaning in green and a ranked list of the traces shown in purple,
with their similarities shown in blue.
APPENDIX B. EARLY PILOTS AND TRACE BROWSER 81
Figure B.1: A screenshot of the tracebrowser.
BIBLIOGRAPHY
[1] Leif Azzopardi, Mark Girolami, and Malcolm Crowe. Probabilistic hyperspace analogueto language. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 575–576, New York, NY, USA,2005. ACM Press.
[2] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval. ACMPress or Addison-Wesley, 1999.
[3] Michael W. Berry, Murray Browne, Amy N. Langville, Paul V. Pauca, and Robert J. Plem-mons. Algorithms and applications for approximate nonnegative matrix factorization.Computational Statistics & Data Analysis, 52(1):155–173, September 2007.
[4] D. Boley, M. Gini, R. Gross, S. Han, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, andJ. Moore. Partitioning-based clustering for web document categorization. Decision SupportSystems, 27(3):329–341, 1999.
[5] P. Bruza and R. Cole. Quantum Logic of Semantic Space: An Exploratory Investigation ofContext Effects in Practical Reasoning . In S. Artemov, H. Barringer, A. S. d’Avila Garcez,L. C. Lamb, and J. Woods, editors, We Will Show Them: Essays in Honour of Dov Gabbay,volume 1, pages 339–361. London: College Publications, 2005.
[6] P.D. Bruza, D.W. Song, and R.M. McArthur. Abduction in semantic space: Towards a logicof discovery. Logic Journal of the IGPL, 12(2):97–110, 2004.
[7] Wray Buntine. Extensions to em and multinomial pca. In Proc. European Conference onMachine Learning (ECML-02), pages 23–34, 2002.
[8] C. Burgess. From simple associations to the building blocks of language: Modeling mean-ing in memory with the HAL model. Behaviour Research Methods, Instruments & Computers,30(2):188–198, 1998.
[9] C. Burgess. Representing and resolving semantic ambiguity: A contribution from high-dimensional memory modeling. In D. S. Gorfein, editor, On the Consequences of MeaningSelection: Perspectives on resolving lexical ambiguity., pages 233–261. American PsychologicalAssociation., 2001.
[10] C. Burgess, K. Livesay, and K. Lund. Explorations in context space: words, sentences,discourse. Discourse Processes, 25(2&3):211–257, 1998.
[11] Deng Cai and Xiaofei He. Orthogonal locality preserving indexing. In SIGIR ’05: Pro-ceedings of the 28th annual international ACM SIGIR conference on Research and development ininformation retrieval, pages 3–10, New York, NY, USA, 2005. ACM Press.
82
BIBLIOGRAPHY 83
[12] Niladri Chatterjee and Shiwali Mohan. Discovering word senses from text using randomindexing. In CICLing, pages 299–310, 2008.
[13] S. Deerwester, S. Dumais, T.K. Landauer, and G.W. Furnas. Indexing by Latent SemanticAnalysis. Journal of the American Society for Information Science and Technology, 41(6):391–407,1990.
[14] C. Ding. A Probabilistic Model for Latent Semantic Indexing. Journal of the American Societyfor Information Science and Technology, 56(6):597–608, 2005.
[15] Susan T. Dumais. LSI meets TREC: A status report. In Text REtrieval Conference, pages137–152, 1992.
[16] Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Scott Deerwester, and RichardHarshman. Using latent semantic analysis to improve access to textual information. InProceedings of the Conference on Human Factors in Computing Systems CHI’88, 1988.
[17] D. Gabbay and J. Woods. Abduction. lecture notes from the european summer school onlogic, language and information (ess-lli 2000). 2000.
[18] P. Gärdenfors. Conceptual Spaces: The Geometry of Thought. MIT Press, 2000.
[19] Peter Gärdenfors and Massimo Warglien. Cooperation, conceptual spaces and the evolu-tion of semantics. In E. Tuci P. Vogt, Y.Sugita and Berling Heidelberg C. Nehaniv, Springer,editors, Symbol Grounding and Beyond, pages 16–30. Springer, 2006.
[20] Eric Gaussier and Cyril Goutte. Relation between plsa and nmf and implications. In Proc.28th international ACM SIGIR conference on Research and development in information retrieval(SIGIR-05), pages 601–602, 2005.
[21] Michael N. Jones, Walter Kintsch, and Douglas J. Mewhort. High-dimensional semanticspace accounts of priming. Journal of Memory and Language, 55(4):534–552, November 2006.
[22] George Karypis and Eui-Hong Han. Concept indexing: A fast dimensionality reductionalgorithm with applications to document retrieval and categorization. Technical reporttr-00-0016, University of Minnesota, 2000.
[23] A. Kilgarriff and J. Rosenzweig. Framework and results for english senseval. In Computersand the Humanities, 34, pages 15–48, 2000.
[24] A. Kontostathis and W. Pottenger. Detecting patterns in the lsi term-term matrix. 2002.IEEE ICDM02 Workshop Proceedings, The Foundation of Data Mining and KnowledgeDiscovery (FDM02).
[25] T.K. Landauer. On the computational basis of learning and cognition: Arguments fromlsa. In B.H. Ross, editor, The Psychology of Learning and Motivation, volume 41, pages 43–84.Academic Press, 2002.
[26] Alberto Lavelli, Fabrizio Sebastiani, and Roberto Zanoli. Distributional term representa-tions: an experimental comparison. In CIKM ’04: Proceedings of the thirteenth ACM inter-national conference on Information and knowledge management, pages 615–624, New York, NY,USA, 2004. ACM Press.
[27] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factoriza-tion. Nature, 401(6755):788–791, October 1999.
[28] David D. Lewis. Reuters-21578.
BIBLIOGRAPHY 84
[29] K. Livesay and C. Burgess. Mediated priming in high-dimensional meaning space: Whatis "mediated" in mediated priming? In Cognitive Science Proceedings, LEA., pages 436–441,1997.
[30] W. Lowe and S. McDonald. The direct route: Mediated priming in semantic space. InM. A.Gernsbacher and S. D. Derry, editors, 22nd Annual Meeting of the Cognitive Science Society,pages 675–680. Lawrence Erlbaum Associates, 2000.
[31] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Infor-mation Retrieval. Cambridge University Press, 1 edition, July 2008.
[32] I. Matveeva, G. Levow, A. Farahat, , and C. Royer. Generalized latent semantic analysisfor term representation. In Proceedings of the International Conference on Recent Advances inNatural Language Processing (RANLP-05), pages 1–5, 2005.
[33] R. McArthur. and P.D. Bruza. Discovery of implicit and explicit connections betweenpeople using email utterance. In Proceedings of the 8th European Conference on Computer-supported Cooperative Work (ECSCW), pages 21–40. Kluwer Academic Publishers, 2003.
[34] Z. McCoy and M. Blom. Comparing plsi and non-negative matrix factorization.
[35] Rada Mihalcea and Ted Pedersen. Advances in word sense disambiguation.http://www.d.umn.edu/∼tpederse/WSDTutorial.html, July 2005. Slides from the AAAI2005 tutorial Advances in Word Sense Disambiguation.
[36] Roberto Navigli. Meaningful clustering of senses helps boost word sense disambiguationperformance. In ACL ’06: Proceedings of the 21st International Conference on ComputationalLinguistics and the 44th annual meeting of the ACL, Sydney, pages 105–112, Morristown, NJ,USA, 2006. Association for Computational Linguistics.
[37] D. B. Neill. Fully automatic word sense induction by semantic clustering. Master’s thesis,M.Phil. in Computer Speech, Text, and Internet Technology, Churchill College, 2002.
[38] Charles Osgood, George Suci, and Percy Tannenbaum. The measurement of meaning. Uni-versity of Illinois Press, 1957.
[39] Patrick Pantel and Dekang Lin. Discovering word senses from text. In Proceedings of theeighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages613–619, New York, NY, USA, 2002. ACM Special Interest Group on Knowledge Discoveryin Data, ACM Press, ISBN:1-58113-567-X.
[40] S. Pinker. The Language Instinct: How the Mind Creates Language. HarperCollins, New York,1994.
[41] R. Rapp. Word sense discovery based on sense descriptor dissimilarity. In Proceedings ofNinth Machine Translation Summit, pages 315–322, 2003.
[42] Reinhard Rapp. A practical solution to the problem of automatic word sense induction. InProceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 26, Morris-town, NJ, USA, 2004. Association for Computational Linguistics.
[43] Mark Sanderson. Word sense disambiguation and information retrieval. In Proceedingsof SIGIR-94, 17th ACM International Conference on Research and Development in InformationRetrieval, pages 49–57, Dublin, IE, 1994.
[44] Mark Sanderson. Retrieving with good sense. Information Retrieval, 2(1):49–69, 2000.
BIBLIOGRAPHY 85
[45] H. Schütze and J. Pedersen. Information retrieval based on word senses. In Symposium onDocument Analysis and Information Retrieval (SDAIR), Las Vegas, NV, pages 161–175, 1995.
[46] Hinrich Schütze. Dimensions of meaning. In Proceedings of Supercomputing ’92, Minneapo-lis., pages 787–796, 1992.
[47] Hinrich Schütze and Jan O. Pedersen. A cooccurrence-based thesaurus and two applica-tions to information retrieval. Inf. Process. Manage., 33(3):307–318, 1997.
[48] D. Song, P. D. Bruza, Z. Huang, and R. Lau. Classifying document titles based on in-formation inference. In Proceedings of the 14th International Symposium on Methodologies forIntelligent Systems (ISMIS 2003), pages 297–306. Springer, 2003.
[49] D. Song, P.D Bruza, R.M. McArthur, and T. Mansfield. Enabling Management Oversightin Corporate Blog Space. In Computational Approaches to Analyzing Weblogs, 2006. AAAISpring Symposium Series, Stanford University, March 27-29.
[50] D.W. Song and P.D. Bruza. Discovering Information Flow using a High Dimensional Con-ceptual Space. In Proceedings of the 24th Annual ACM Conference of Research and Developmentin Information Retrieval (SIGIR’2001), pages 327–333. ACM Press, 2001.
[51] D.W. Song and P.D. Bruza. Towards context sensitive information inference. Journal of theAmerican Society for Information Science and Technology, 54(3):321–334, 2003.
[52] Jean Veronis. Sense tagging: does it make sense?http://citeseer.ist.psu.edu/veronis01sense.html accessed july 2008.
[53] Dominic Widdows. Geometry and Meaning. Center for the Study of Language and Infor-mation/SRI, 2004.
[54] YorickWilks, Dan Fass, Cheng-ming Guo, James E.McDonald, Tony Plate, and Slator BrianM. Providing machine tractable dictionary tools. Machine Translation, 5(2):99–154, 1990.
[55] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrixfactorization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conferenceon Research and development in informaion retrieval, pages 267–273, NewYork, NY, USA, 2003.ACM Press.
[56] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrixfactorization. In SIGIR ’03: Proceedings of the 26th annual international ACM SIGIR conferenceon Research and development in informaion retrieval, pages 267–273, NewYork, NY, USA, 2003.ACM Press.