2005 probing the properties of determinologization

Probing the Properties of Determinologization - the DiaSketch

Jakob Halskov Dept. of Computational Linguistics

Copenhagen Business School e-mail: [email protected]

Abstract Identifying recurrent usage patterns of terms in non-specialized contexts may act as a filtering device and thus help increase the precision of web-based term extraction algorithms. This article presents a corpus-driven approach to the study of determinologization and investigates the claim in Melby [16] that terms are always characterized by special reference irrespective of the context in which they are used. An implementation of a system called the DiaSketch (Diachronic wordSketch) is outlined. The DiaSketch can detect changing co-occurrence patterns of mother terms in diachronic corpora and thus indirectly assess their termhood. Analyzing the usage of a term from the domain of Information Technology (IT) in specialized and non-specialized contexts by means of DiaSketches, it is shown that termhood seems to be a gradable property.

1 Introduction

Assessing termhood is the key challenge to Automatic Terminology Recognition (ATR)

software, whether based on statistical or linguistic methods. When using the dynamic but

chaotic Internet as a basis for the extraction of new terminology, however, the properties of

termhood become even more complex than with small, static corpora of highly specialized

discourse. Strings of natural language containing elements, which would function as terms

behind the domain wall, may not do so in other contexts.

The goal of this article is to present the thoughts and theories behind the DiaSketch, a

system which can detect changing co-occurrence patterns of mother terms in diachronic

corpora and thereby assess the specificity of reference. The study was largely inspired by the

following quote from Rita Temmerman

An attempt at getting more insight into how the meaning of terms evolves [...] could be

a major research topic for Terminology [26]p. 15

39

Section 2 of the article discusses what determinologization is, why it is an interesting

phenomenon to study and how one might operationalize the definition of determinologization

in [18]. While section 3 then briefly summarizes the discussion in Cabré ([3], [4], [5]),

Kageura [13], Melby [16], Pearson [20] and Sager ([21], [22], [23]) about the distinction

between terms and words, section 4 relates the DiaSketch approach to four theoretical schools

within the science of terminology, namely the General Theory of Terminology,

Communicative Theory of Terminology ([4]), Socioterminology ([9], [10], [11]) and

Sociocognitive terminology ([26]). Drawing heavily on work by Evert ([7], [8]), Kilgarriff

([14], [15]) and Schulze [24] section 5 then proceeds to a description of various

implementation issues and section 6 finally gives an example of the output from a beta

version of the DiaSketch implementation.

2 Determinologization This section discusses what determinologization is, why it should be studied and how one

might proceed to study it with statistical methods from corpus linguistics.

2.1 What is determinologization?

It is not surprising that conceptual fuzziness tends to occur when non-specialists use

terminology in non-specialized communicative contexts. It seems intuitive that what is a term

(representing a clear-cut concept) to one person may be a (possibly unknown) word

representing a fuzzy category to another person who lacks the required specialist knowledge

to decode the term fully and correctly. It also seems probable that traces of this conceptual

fuzziness can be registered in linguistic usage. While one-off cases of creative or fuzzy usage

of terms pose no problem to web-based ATR software, it is a different story when large

numbers of non-specialists use terms from a domain forming strong collocations which,

formally speaking, may resemble terminological neologisms, while not functioning as such.

Although determinologization has received little attention and has yet to be studied in a

quantitative framework, it has been defined as "the ways in which terminological usage and

meaning can 'loosen' when a term captures the interest of the general public" [17]p.12 and

"det at ei eksisterende terminologisk ordform går over i allmennspråket"1 [19]p.112. The

1 That an existing terminological unit enters general language

40

definition in Meyer and Mackintosh is the more specific of the two and groups the

semantic/pragmatic changes caused by determinologization into two types:

1) Maintien des aspects fondamentaux du sens terminologique

2) Dilution du sens terminologique d'origine [18]pp.202, 205

To illustrate the difference between 1) preservation and 2) dilution of a terminological sense

we can consider the two phrases from The New York Times (1999) below:

• A large server

• The Internet business model needs a reboot

When the term server is modified by the conceptually fuzzy adjective large, it seems that the

reference of the combined phrase has somehow become less accurate. Are we talking about a

server, which is physically large, or about a server which is equipped with large amounts of

RAM? In spite of the superficial fuzziness, the reference to the domain specific concept2

seems to be largely intact. However, this is not the case with the figurative use of the term

reboot. When taking the context into consideration it becomes obvious that reboot no longer

refers to the original domain specific concept of shutting down and restarting an operating

system, but is being used in the more general sense of starting something afresh.

Clearly, there are a lot of intermediate stages in-between determinologization of type 1)

and 2), also known as sense modulation and sense selection in lexical semantics [6], and it is

not at all clear whether terms which are exposed to this phenomenon will come to represent

increasingly fuzzy categories or not. The rest of the article will outline an approach, which

may eventually answer this and other questions regarding determinologization.

2.2 Why should we study determinologization?

Having reviewed existing definitions of determinologization, we need to justify our interest in

this phenomenon. While determinologization ought to be an important field of research in

Socioterminology (see section 4.4), it also has important implications for the optimization of

term extraction algorithms. The extensive usage of terminology from a domain like IT by vast

2 a computer that controls or performs a particular job for all the computers in a network (definition from MacMillan English Dictionary for Advanced Learners, 2002)

41

numbers of non-experts in a variety of communicative settings complicates the automatic

extraction task. Although determinologized usage can be avoided by using corpora which

have been manually compiled and are known to represent specialized communication between

experts, such corpora are expensive to come by and age swiftly (especially a domain like IT).

Thus using the Internet as a dynamic and inexhaustible treasure trove of terms is becoming

increasingly appealing to computational terminologists [2], but this involves tackling a

number of problems caused by determinologization. While these problems cannot be fully

answered through statistical analysis of terminological usage in large general language

corpora, we can at least get some indication of the linguistic characteristics of this

phenomenon and thus a better understanding of factors, which are important to the notions of

termhood and domain.

Clearly, the concepts of certain domains are more exposed to determinologization than

others. The domain of IT has been chosen as the testing ground for the present study, but why

this particular domain?

[...] computerese has transcended its fundamental purpose: to describe and explain

computing. Although it still fulfills its original function, it frequently steps outside

these bounds to describe the human condition. Conversely, in the computer industry,

the human condition is frequently explained in terms of technological metaphors.

[1]p.xiiv

As a technical subject field, IT needs concepts, which require a high degree of determinacy,

but at the same time these specialized concepts are highly popularized making them

particularly exposed to determinologization. While this is also the case for domains like

medicine, appliances and technology in general, IT is special in that neology by

terminologization (metaphorical extension of general language lexical units) is much more

frequent in this domain.

Having established what determinologization is and where to look for it, the following

section will explicate how the phenomenon can be investigated within the framework of

corpus linguistics.

42

2.3 How can we study determinologization?

While the definition in section 2.1 provided a starting point for an empirical investigation of

the linguistic properties of determinologization, it will be necessary to explicate the theory of

meaning adopted in this study in order to arrive at an operational definition of conceptual

fuzziness and determinologization.

From a statistical NLP perspective

it is [...] natural to think of meaning as residing in the distribution of contexts over

which words and utterances are used [25]p.16

In a contextual theory of meaning syntax and semantics are intimately related and

interdependent. Every word has a certain semantic potential, but it is the context, ie.

neighbouring words (and sometimes extra-linguistic context), which activates a particular

sense in a particular case. This theory of meaning is highly pragmatic and descriptive and can

be summarized by the famous Wittgenstein aphorism: "The meaning of a word is its use in the

language" [27]. Based on a contextual theory of meaning, determinologization can be

described as the process by which the combinatory potential, ie. relational co-occurrence

patterns, of a term starts to resemble that of a comparable lexical unit from general language.

While conceptual fuzziness can be measured synchronically, we should not forget that

determinologization is controlled by an extra-linguistic factor, namely the degree of diffusion

of domain concepts into the consciousness of the general public. The spread of domain-

specific concepts into non-specialized discourse is a diachronic phenomenon, and a

description of the linguistic properties of determinologization can thus only be attempted by

comparing a series of synchronic assessments of conceptual fuzziness. Such assessments are

performed by computing lexical profiles of terms in diachronic corpora.

Lexical profiling has primarily been used within lexicography, and an example of an

implementation for English is described in [14] and [15]. The basic technique involves the

calculation of the strength of association of key relational collocates of a word in a part of

speech tagged corpus and the subsequent listing of these collocates ordered by relation and

statistical salience. Figure 1 shows such a lexical profile, also known as a "wordsketch", of

the term server in the British National Corpus (BNC)3.

3 http://www.natcorp.ox.ac.uk/ (March 4, 2005)

43

Figure 1 - Lexical profile of server as computed by SketchEngine4

So far wordsketching has only been applied to synchronic studies of lexical units, but the

technique seems very promising for diachronic studies of how terminological units behave in

general language corpora. Retrieving significant collocational changes from wordsketch to

wordsketch in successive time slices of a general language corpus will yield what we might

call a DiaSketch (Diachronic wordSketch). Classifying the speed and manner in which

changes might register in such a DiaSketch will then help us gain a greater knowledge of the

linguistic properties of determinologization and perhaps allow us to refine the definition

proposed in section 2.1. Before proceeding to an actual test run of the DiaSketch

implementation in section 6, sections 3 and 4 will summarize the theoretical debate on the

issue of termhood.

3 Terms vs. words

This section discusses the notion of termhood as a prelude to the summary of theoretical

approaches to terminology in section 4.

3.1 The ideal A term is typically defined as a lexical unit which represents a concept inside a domain or "a

verbal designation of a general concept in a specific subject field" (ISO 1087-1/ISO 12620).

Ideally, there is a one-to-one mapping between a domain specific concept and the term which

designates or labels it. Although there may be terminological variants (like hard disk and hard

drive in the domain of IT), these refer to the same clear-cut concept, and this synonymy, or

4 http://www.sketchengine.co.uk/ (March 4, 2005)

44

superficial ambiguity [16]p.55, does not impede efficient and unambiguous specialized

communication between domain experts.

Ideally, a distinctive feature of terms, as opposed to words, is that the 1:1

correspondence between a term and the concept it labels is impervious to linguistic context,

and the meaning of a term can thus be fully decoded irrespective of the context in which it

was used. Words, on the other hand, represent fuzzy categories, which partly overlap with

adjacent categories, and for their meaning to be fully decoded, one normally needs to consider

both the linguistic and communicative contexts. In fact, statistical analysis of large amounts

of actual usage is needed to cluster the meanings of a polysemous word into a list of

predominant word senses, and such a list will necessarily be infinite and highly dynamic due

to the fundamental ambiguity [16]p.55 of general language. When we move beyond neat,

synchronic samples of expert-expert communication within a highly specialized subdomain,

the clear-cut line between terms and words starts to blur, however.

3.2 Critical voices Cabré’s Theory of Doors (section 4.3) explains why we have yet to see a decisive theory of

the term which allows us to distinguish it from the word. The research object of terminology

is multidimensional, and the symbolic dimension (the level of the term) and the

representational dimension (the level of the concept) do not, in themselves, give us the full

picture. Since terms can rarely be distinguished from words by any formal means, as pointed

out in [13] and [23], we need to consider their communicative function as well. The

representational function of terms has been studied in detail within the conceptually-oriented

framework of the General Theory of Terminology (section 4.1), but Pearson argues that the

communicative function of terms in actual usage has largely been ignored:

it is futile to propose differences between words and terms without reference to the

circumstances in which they are used […we need to consider] what happens when

terms are actually used in text rather than simply as labels for concepts in knowledge

structures [20]pp.7-8

Kageura speculates along the same lines that termhood is perhaps more like an aspectual

category [13]p.26, and so does Cabré:

45

a lexical unit is by itself neither terminological nor general but [...] it is general by

default and acquires special or terminological meaning when this is activated by the

pragmatic characteristics of the discourse [my emphasis] [5]pp. 189-190

Sager has been arguing that the key to distinguishing terms from words is a “theory of

reference” [21] and a multidimensional model of knowledge space. Words, he argued, map

onto notions through general reference, while terms map onto concepts (more restricted

segments of knowledge space) through special reference. Twenty years later, in [23], he

reemphasizes that terms are basically the end products of a double evolutionary process of

abstraction and subsequent specification in natural language. The process starts with unique

reference (proper names), the abstraction of which is general reference (nouns), and ends with

special reference (terms).

To avoid misunderstanding and make it possible to enhance the knowledge of

mankind, Sager argues that notions (or general representations) must be refined into concepts

or bundles of judgment. From this perspective a term is then

the name given to a set of judgements considered pro tem as a unit representing a

scientifically defined concept [23]p.53

By pro tem Sager acknowledges the dynamic nature of natural language and the fact that

concepts may degenerate into notions through a usage-governed process of

determinologization. Although it seems convincing that termhood is a function of context

(rather than a property which is established a priori), the next section will introduce a counter

argument presented as an analogy.

3.3 Melby’s analogy In [16] Alan Melby introduces an interesting clay/stone analogy of the difference between

words (lexical units) and terms (terminological units):

A word is thus a chunk of pliable clay and a term is a hard stone. One can think of a

stone as a blob of clay that has become transformed into a solid object through some

chemical process, just as a terminological unit receives a fixed meaning within the

context of a certain domain of knowledge. [16]pp.52-53

46

In Melby's view termhood is not a gradable property:

the continuum between a very general text and a highly domain-specific text is not a

gray scale along which words gradually become terms and terms gradually become

words [...] The ratio of the mix may change gradually, but usually words are words

and terms are terms, and different processing applies to each. [16]p.53

The reason why domain-specific concepts are non-overlapping and can be ordered into

hierarchical conceptual structures (ontologies), is that they were defined using a metalanguage

(general language) and are protected from conceptual fuzziness by a wall of conventions

surrounding the domain:

When we create a narrow, well thought-out domain we build a wall around it so that

from inside the domain the universe appears to be orderly and computable [16]p.101

Maintaining this wall allows domain experts to optimize the balance between three principles

which are vital to achieving efficient specialist communication, namely precision, economy of

expression and appropriateness [22]. Extreme precision could be achieved by always citing

the complete definition of the given concept, but this is disallowed by the principle of

economy of expression, and thus a compromise is gradually reached on the basis of

appropriateness, which is essentially the norm established by domain experts over time.

Ideally, this balance results in a means of communication, which approximates an artificial

(or controlled) language.

However, when a domain attracts the sustained interest of the general public, and

terms from the domain are used by non-specialists outside the wall, we can no longer assume

that the linguistic and communicative contexts have no bearing on the meaning of these

terms. It is the hypothesis of this article that terms under these circumstances may come to

represent fuzzy categories with a prototypical core, rather than clear-cut, non-overlapping

concepts. In answer to the question posed at the beginning of this section: lexical units, which

function as terms in some contexts, do not necessarily fulfil the criteria for termhood (such as

special reference) in other contexts. The test run of the DiaSketch implementation in section 6

will provide more evidence against Melby’s claim that termhood is not a gradable property.

47

4 Theories of terminology

This section will further elaborate the

discussion of termhood by juxtaposing the

viewpoints of four theoretical schools of

terminology and finally position the

DiaSketch approach in this theoretical

framework.

4.1 General Theory of Terminology (GTT)

Central to the science of terminology has been

the apex of the semantic triangle, namely the

concept. Extra-linguistic reality (objects or

referents) is classified by identifying the distinctive properties of the corresponding mental

representations, storing these properties in attribute-value matrices and ordering the resulting

clear-cut concepts into conceptual hierarchies or ontologies. In classical terminology (as

advocated in the post-humously published works of Eugen Wüster [28]) concepts are thus

static, universal and non-overlapping, and the position of a particular concept in a given

hierarchy is precisely determined by its definition, which typically specifies a genus

proximum (nearest superordinate concept) and differentia specifica (specific differences).

This ob

Figure 2 – Melby’s clay/stone analogy

jectivist approach to terminology is computationally tractable and has proven

extrem

n independent

scientif

there is no substantial body of literature which could support the proclamation of

terminology as a separate discipline and there is not likely to be. Everything of

ely successful in fields like knowledge engineering, ontology-based Information

Retrieval and terminological standardization. The last few years, however, have seen a

vigorous theoretical debate in which the explanatory adequacy of GTT as an all-embracing

theory has been questioned. The attacks on GTT come from many branches of linguistics,

establishing new terminological schools such as Socioterminology ([9], [10], [11]), Socio-

Cognitive Terminology [26] and Communicative Theory of Terminology [4].

While some scholars contest the very status of terminology as a

ic discipline:

48

importance that can be said about terminology is more appropriately said in the

context of linguistics, or information science or computational linguistics [22]p.1

cholars criticize GTT for ignoring actual usage:

other s

ogy should be [my emphasis] in order

to ensure unambiguous plurilingual communication and not about what terminology

Harshe

ology confuses principles, i.e. objectives to be aimed at, with facts

which are the foundation of a science. By raising principles to the level of facts, it

As lon bjectives of GTT are knowledge structuring and

andardization, rather than a descriptive account of terminological usage in various contexts,

4.2 SocioCognitive Terminology

While the GTT approach must be counted among the positivist or objectivist theories of

advocated in Temmerman [26], is a hermeneutic or

Wüster developed a theory about what terminol

actually is in its great variety and plurality [5]p.167

r critics claim that:

Traditional Termin

converts wishes into reality [26]p.15

g as one recognizes that the primary o

st

I find this bias perfectly legitimate, however. The following sections will briefly review

alternative paradigms in terminology and finally explicate the theoretical foundations of the

present study on determinologization.

science, Sociocognitive terminology, as

experientialist theory. The premise in experientialism is that reality does not exist

independently of the perceiving subject. All knowledge comes from experience, and meaning

cannot be completely objectified because it always involves a subject and is perceived and

expressed through an inescapable filter (natural language). Inspired by recent findings in

Cognitive Science which suggest that there is no clear separation between general and

specialized knowledge, [26] thus claims that terms, more often than not, represent categories

(notions in Sager’s terminology) which are as fuzzy and dynamic as those represented by

words.

49

Temmerman argues that clear-cut concepts, which are not prototypical to some extent,

are extremely rare outside of exact sciences like Mathematics and Chemistry [26]p.223. The

analyti

ave prototype structure and are in

constan

uacy, and I believe it is correct that

terms,

4.3 Communicative Theory of Terminology

Cabré ([4],[5]) claims that the research object of terminology is not concepts, nor units of

U).

y we, therefore, find the terminological

unit seen as a polyhedron with three viewpoints: the cognitive (the concept), the

While ron, namely the

onceptual one, it fails to consider the other dimensions. This does not mean that GTT is

cal (intensional) definitions used in GTT are thus often inadequate because

prototypical categories with gradable membership cannot be understood in a logical or

ontological structure. The core of Temmerman’s criticism of GTT is that it rejects the unity of

the linguistic sign, by dissociating form (the term) from content (the concept) and thus

reducing terms to context-independent labels for things.

In her Sociocognitive terminology Temmerman speaks of Units of Understanding

(UU), rather than of concepts. These UUs typically h

t evolution. UUs can rarely be intensionally defined but should be interpreted by

means of “templates of understanding” which are composed of different modules of

information depending on the receiver and the context.

On the whole I agree with Temmerman that GTT needs to be extended in various

directions to achieve descriptive and explanatory adeq

in certain contexts, represent categories as fuzzy as those represented by words, but I

disagree with her that this is the general case. I think it requires a process of

determinologization, and this process is only initiated when the domain to which the term

belongs catches the interest of non-specialists. While the SocioCognitive approach to

terminology shakes the very foundations of classical terminology the Communicative Theory

of Terminology outlined in the next section is much more inclusive of GTT.

understanding, but rather Terminological Units (T

At the core of the knowledge field of terminolog

linguistic (the term) and the communicative (the situation) [5]p.187

GTT accounts for one dimension of the terminological polyhed

c

flawed, because TUs are such complex and multidimensional phenomena that they can hardly

be accessed on all fronts at once. It does mean, however, that GTT can only be an ancillary

50

component in a more comprehensive theory the outline of which has only recently manifested

itself. Cabré argues that:

it is impossible to account for the complexity of terminology within a single theory [...]

a number of integrated and complementary theories are required which deal with the

Although Sager already discussed the communicative dimension of terms in [22], Cabré

erates his arguments and calls for a Communicative Theory of Terminology (CTT) in which

ike Sociocognitive Terminology and CTT, Socioterminology, as outlined in Gambier ([9],

o argues that GTT needs to be extended. Socioterminology is

de la

désignation, rattachement à un réseau de notions…): il est aussi à voir dans son

Its fo Chomskian terms,

ther than terminological standardization or knowledge structuring as such:

sur les corpus

limités, ignorance de la dimension orale, une attitude plus linguistique – la linguistique

different facets of terms [3]pp.12-13

it

"each one of the three dimensions [the cognitive, linguistic and communicative], while being

inseparable in the terminological unit, permits a direct access to the object" [5]p.187

4.4 Socioterminology L

[10]) and Gaudin [11], als

basically a functionalist approach to terminology, which stipulates that we should include

contextual factors like language change and social practices in the study of terminology:

Un terme ne peut pas être vu seulement par rapport à un système (adéquation

fonctionemment, sur le terrain des contradictions socials. (Qui utilise quoi? Qui

innove? Comment et par qui les termes se diffusent-ils? Comment s’opèrent les

réajustements terminologiques, les reformulations? Etc.)5 [10]p.320

cal point is thus linguistic reality, or terminological “performance” in

ra

En rupture avec les usages traditionnels: consultation d’experts, travaux

5 A term cannot be viewed exclusively with respect to a system (the adequacy of the designation, inclusion in a network of concepts…): it should also be viewed wrt. its function in the field of social contradictions. (Who uses what? Who innovates? How and by whom are terms spread? How are terms subjected to language change? Etc.)

51

étant essentiellement une science descriptive – suppose que les termes soient étudiés

dans leur dimension interactive et discursive6. [11]p.295

This admittedly simplified survey of GTT and three newer theoretical schools shows how the

frameworks of sociolinguistics, cognitive science and communication theory are being

applied to terminology to increase the explanatory adequacy of classical terminological

theory. The following section will describe the theoretical foundations of the DiaSketch

approach to determinologization.

4.5 The DiaSketch approach

Having summed up the viewpoints of GTT, CTT, SocioCognitive Terminology and

Socioterminology, it is now time to position the DiaSketch approach in this theoretical

framework. Owing to its principle of synchrony and monosemy, the evolution of

terminological meaning and a phenomenon like determinologization cannot be studied in a

GTT framework. The framework of SocioCognitive Terminology as presented in [26] does

not seem attractive from a computational, corpus linguistic perspective. Temmerman's Units

of Understanding do not seem to offer a coherent alternative to the conceptual analysis of

classical terminology. As a corpus-driven approach, the theoretical foundations of DiaSketch

are best described as a mixture of CTT and SocioTerminology. CTT highlights the impact

that communicative context has on terminological meaning, and Socioterminology stresses

the functional aspects of terms. These aspects are reflected in the composition of the corpora

on which the DiaSketch analyses are based (see section 5.2).

5 Methodology and issues

A corpus-based description of how domain specific terms are used in general language is

faced with two obvious problems:

1. since terms are specific to a domain we must expect their frequency of occurrence to

be relatively low outside the domain in question.

6 Breaking with traditional approaches (consulting experts, working with limited corpora, ignoring the oral dimension), a more linguistic approach (linguistics being essentially a descriptive science) presupposes that the terms are studied in their interactive and discursive dimension.

52

2. when lexical units, which function as terms in specialized discourse (e.g. bus, server,

driver), occur in non-specialized contexts, the most frequent senses are likely to be the

non-specialized ones.

While the context of the domain allows us to presume monosemy, language outside Melby's

wall is rife with polysemy. Mother terms thus need to be disambiguated (sense tagged) before

any reliable DiaSketching can take place.

Table 1 lists twenty terms from the ANSDIT8 terminology compilation, which have

the highest average relative frequency in a 1.8M word fragment of a specialized corpus

(PcPlus) and a 64M word fragment of a newspaper corpus (New York Times). Not

surprisingly, the lemmas seem to

represent very superordinate

concepts which all - with the

possible exception of software, PC,

computer, Internet and CD - have

one or more general language

senses in addition to the domain-

specific sense. These general-

language senses may cause more or

less noise (the majority of the

windows in New York Times are

physical ones), but it is obvious

that candidate mother terms need

to be sense-tagged before they can

be subjected to diasketching.

5.1 The Yarowsky algorithm

Sense tagging can be

performed using supervised or

unsupervised methods. Since supervised methods presuppose a large corpus, which has been

manually disambiguated, they are labour-intensive and not appealing in the present case. In

Table 1 - Terms from ANSDIT present both in LSP and LGP corpora and ranked by average relativefrequency – 2001 (2000) [lemma=term

& pos="N.*"] NYT (abs)

PCP(abs)

NYT (rel)7

PCP (rel)

Averagerelative

1 window (2) 6013 3946 92.7 2113.1 1102.92 PC (1) 2409 3823 37.1 2047.3 1042.23 file (4) 5001 3481 77.1 1864.1 970.64 software (5) 5714 2385 88.1 1277.2 682.75 image (13) 7301 2303 112.6 1233.3 672.96 user (7) 5021 2250 77.4 1204.9 641.27 information (10) 20778 1373 320.4 735.3 527.88 Internet (6) 17633 1399 271.9 749.2 510.59 feature (9) 10149 1493 156.5 799.5 478

10 service (11) 24993 1044 385.4 559.1 472.211 code 4131 1538 63.7 823.6 443.712 computer (16) 14812 1228 228.4 657.6 44313 button (15) 1137 1345 17.5 720.3 368.914 screen (19) 4244 1207 65.4 646.4 355.915 object (12) 2073 1243 32 665.6 348.816 CD (17) 2872 1211 44.3 648.5 346.417 web (161) 1910 1212 29.5 649 339.218 server (14) 902 1221 13.9 653.9 333.919 memory (18) 4716 1066 72.7 570.9 321.820 package (26) 5174 1041 79.8 557.5 318.6

7 Relative frequencies are given in occurrences per million running words 8American National Standard Dictionary of Information Technology (app. 5,500 terms), www.incits.org

53

his landmark paper from 1995, Yarowsky [29] proposes and evaluates a semi-unsupervised

WSD algorithm, which achieves accuracy rates rivaling those of the best supervised methods.

It does so by making two very simple, but powerful assumptions, namely that polysemy is

restricted by the fact that polysemous lexical units typically have one sense per discourse and

one sense per collocation. Thanks to these two assumptions, the algorithm only needs a few

seed collocates (sense indicators), and can then use the assumptions as "bridges" to new

contexts from which more (or better) sense indicators can be retrieved in an iterative fashion.

The basic steps in the Yarowsky algorithm are as follows:

1. identify all occurrences of the polysemous word and store the contexts

2. for each possible sense identify a seed word (fx. through vs. pop-up for window)

3. using these sense indicators, extract a seed set of manually disambiguated

occurrences (table 2)

4. for each collocation type in the seed set compute log(P(sensea|collocationi)/

P(senseb|collocationi)) <=> log((f(sensea)+0.5)/(f(senseb)+0.5))9

5. order the collocations by numerical log-likelihood ratio to get a decision list (table 3)

6. Apply the decision list on all contexts from step 1 to classify more occurrences

7. Optionally apply one-sense-per-discourse assumption to classify more occurrences

8. Iterate steps 4 - 8

9. stop when the decision list is unchanged. New data can now be sense tagged.

Table 2 - seed set of manually disambiguated occurrences of window at 5 a.m. when the bullet crashed through the <window> , law enforcement officials said . The phys When your turn comes , bark your order through the <window> . If for any reason they ignore phys she was shooting skeet , staring from the castle <window> , looking through the gloaming at t phys little children . The sea is visible through the <windows> of the room where these gargantuan phys or zoom-in tools that left me stranded in a pop-up <window> . My computer never crashed . complure to get you to wade through a swamp of pop-up <windows> and banners hawking their pr comp, often used on Web sites to create pop-up <windows> and navigational aids , can be embedd comp.... ...

9 Additive smooting (+0.5) is used to avoid division by zero

54

This kind of (virtually) unsupervised approach will be used to sense tag all occurrences of

candidate mother terms in the final implementation of the DiaSketch, but even DiaSketches of

candidate mother terms which have not been semantically disambiguated may yield

interesting results as can be seen in the case study of section 6.

5.2 Corpus annotation and CQL

In a study of determinologization, which is essentially an aspect of language change, the

dimension of time is of course the main variable. To gain a better understanding of the

linguistic properties of conceptual fuzziness, however, the dimension of text type should also

be examined. The corpora listed in table 4 are included in the study and represent highly

specialized discourse (computer science papers from The Computer Journal 10 ), popular

science (the British computer magazine PcPlus11), technical, online discourse (newsgroup

postings) and general language (newspaper

corpora from the Gigaword corpus which

includes New York Times).

In order to identify relational co-

occurrences the corpora are PoS tagged (Penn

tagset) and lemmatized with the TreeTagger12

and subsequently phrase chunked with

Yamcha13. An example of an annotated corpus

fragment can be seen in table 5 (where B

indicates the beginning of a phrase and I

indicates a non-boundary). The corpora are

finally converted into the special format

required by Corpus WorkBench [24]. CWB

includes a Corpus Query Language (CQL),

which allows sophisticated queries using regular expressions over combinations of positional

and/or structural attributes. In the case of the DiaSketch implementation we use the four

positional attributes token, PoS, lemma and chunk.

Table 3 – initial decision list based on a seed set of 32 disambiguated occurrences of the polysemous word window

LogLphys.sense

comp. sense collocation pos sense

-1.46 0 14 pop-up any comp-1.40 0 12 pop-up -1 comp1.36 11 0 through -2 phys 1.28 9 0 the -1 phys 0.85 3 0 visible any phys 0.85 17 2 through any phys -0.70 0 2 create any comp-0.70 0 2 programs any comp-0.70 0 2 computer any comp0.70 2 0 stained-glass any phys ... .... ... ... ... ...

10 http://www3.oup.co.uk/computer_journal/ (March 7, 2005) 11 http://www.pcplus.co.uk/ (March 7, 2005) 12 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger (March 4, 2005) 13 http://chasen.org/~taku/software/yamcha (March 4, 2005)

55

5.3 Contingency tables and UCS

Stefan Evert [8] describes the benefits and pitfalls of a number of statistical association

measures implemented by Evert himself in the Utilities for Co-occurrence Statistics (UCS)14

toolkit. He favours relational co-occurrence over simple positional co-occurrence because the

former reduces noise from grammatically unrelated n-grams and leads to more meaningful

results [7]. The UCS tables generated by Evert's perl scripts are called frequency signatures

and contain the four values, joint frequency (O11), marginal frequencies (O12, O21) and sample

size (N). This corresponds to a classical four-celled contingency table like table 6, where O11

is the number of times partition and server co-occur in the sample (of size N) of all noun

bigrams in the corpus, O12 is the number of times

partition occurs with another noun in this sample

and O21 the number of times server occurs with

another noun.

Comparing the observed with the expected

frequencies (E11-E22) provides a measure of the

strength of association of the two lemmas. This

association measure can be based on a number of

statistical models, but in the case study in section 6

Evert's implementation of Fisher's exact test is used

because it provides p-values, which are not approximations and it "is now generally accepted

as the most appropriate test for independence in a 2-by-2 contingency table"15. With these p-

values it is straightforward to enforce a cut-off level of significance (for example p<10-6).

Table 5 - Annotated corpus slice Token PoS Lemma ChunkHe PP he B-NP reckons VVZ reckon B-VP the DT the B-NP current JJ current I-NP account NN account I-NP deficit NN deficit I-NP will MD will B-VP

Table 4 – Diachronic corpora Computer Journal PcPlus Newsgroups LDC Gigaword Tokens 4.5M 6M 100Bn16 1.7Bn Time frame 1997-2004 2000-2004 1981-present 1994-2002 Assumed degree of determinologization

none low low/moderate high

14 http://www.collocations.de (March 4, 2005) 15 Evert (2004) - http://www.collocations.de/AM/ (March 13, 2005) 16 The Google archives contain more than 1 billion postings, and an average length of 100 words seems reasonable.

56

5.4 Computing DiaSketches with

CQL and UCS

The original wordsketches as

implemented in the sketchengine [15]

plot statistically salient co-occurrence

pairs from a set of twenty odd

grammatical relations. The present implementation of the DiaSketch accepts only nouns as

input and identifies significant co-occurrences, which are noun or adjective modifiers of the

node or predicates which subcategorize for the node as subject or object. In the case of the

SUBJ_OF relation, a CQL query like

Table 6 - Contingency table v=server v≠server u=partition O11 O12 u≠partition O21 O22

u=partition E11 = (R1*C1)/N E12 = (R1*C2)/N u≠partition E21 = (R2*C1)/N E22 = (R2*C2)/N

[pos="NN.*"][pos="WDT"]?[chunk=".*-VP"]*[pos="VV[ZPDG]?"]17

Table 7 - SUBJ_OF examples from New York Times agency/NN/I-NP has/VHZ/B-VP been/VBN/I-VP recruited/VVN/I-VP to/TO/I-VP help/VV/I-VP

convince/VV/I-VP model/NN/I-NP that/WDT/B-NP made/VVD/B-VP

schools/NNS/I-NP serving/VVG/B-VP industry/NN/I-NP might/MD/B-VP not/RB/I-VP exist/VV/I-VP

will retrieve (virtually) all these relations in the given corpus slice.

By specifying that the PoS of the rightmost verb must be simple present (VVZ), simple past

(VVD), gerund, (VVG) or infinitive (VVP), we filter out passives (which have PoS=VVN)

from this set of SUBJ_OF relations. Moreover, by setting the matching strategy to "longest",

we make sure that we get the main verb of complex VPs like to help convince (cf. table 7). If

longer relative clauses intervene, however, the pattern will simply match the first main verb.

This can only be avoided by carrying out a computationally expensive full parsing.

The lemma pairs (ie. agency/convince, model/make etc.) are then extracted and piped

into UCS, yielding the N value mentioned in section 5.3. In case we want a sketch of the term

server, the general query is simply transformed to a specific query by substituting

[lemma="server"] for [pos="NN.*"]. Such a query will then provide all the O11, O12 and O21

values needed to complete the contingency table and compute the co-occurrence statistics. 17 A noun in plural or singular possibly followed by a relative pronoun, any number of VP chunk elements and finally a mandatory full verb.

57

6 Case study: "server" According to MacMillan English Dictionary for Advanced Learners (2002) the lexical unit

server has five senses:

1. a computer

2. player who starts to play

3. large spoon/fork/etc

4. sb who helps in church

5. sb who brings food

Judging by the collocations in the general language DiaSketch for server (figure 3), senses 2

through 5 seem to be virtually absent (except for the collocation altar server which is

indicative of sense 4). So in this case semantic tagging was not a critical issue. Defining what

counts as a significant collocation can be difficult, but in this case study we only include those

co-occurrences which defeat the Null Hypothesis at a significance level of p < 0.00000118 (or

–log(p) > 6) and where the number of co-occurrences (O11) exceeds 2% of the marginal

frequency of the term in question (O21). While the absolute number of occurrences of the

lemma server in this relation are approximately the same in the two corpora (some 800 per

time slice), the relative frequencies of course differ tremendously.

While figure 3 charts the most significant modifiers of the term server through time in

a 6M word slice of the British computer magazine PcPlus, figure 4 lists the most significant

modifiers of the same term in a 915M word slice of the New York Times (NYT) corpus.

Strength of association, as measured in negative logarithmic p-values, is indicated along the

y-axis and the collocation candidates are listed along the x-axis. A striking difference between

the two figures is that the three noun modifiers network, Internet and computer are prominent

collocations in the newspaper corpus (throughout the timeframe) but do not occur in the

DiaSketch for the corpus of computer magazines. While network server and Internet server

are infrequent19 variants of the term web server (which is highly salient in both corpora),

computer server is a case of determinologized usage. In this compound the noun modifier is

18 Mail correspondence with Stefan Evert has led me to believe that this is the standard significance level for collocation strength as computed by means of contingency tables. 19 http://scholar.google.com (May 4, 2005): computer server (680 hits), Internet server (3,320 hits), network server (3,800 hits), web server (62,800 hits)

58

used as a kind of domain label to distinguish the IT sense of the head noun from other senses

possible outside this domain (for example the altar server). The expression is no more fuzzy

than server on its own, but the modifier would be redundant in specialist communication and

might cause noise in an ATR system.

The only example of a collocation referring to a fuzzy category is powerful server

which climbs above the strict threshold values in two time slices of the NYT corpus (Fall of

1994 and 1999). While the adjectival modifier in PcPlus (virtual) combines with the mother

term to form a clear-cut subordinate concept in a generic relation to server, the collocation of

powerful with server modulates the special reference of server so that the combined phrase no

longer refers to a clear-cut domain specific concept. What exactly is denoted by powerful

server? Does powerful refer to the storage capacity of the server as measured in gigabytes or

to its clock rate as measured in MHz or rather to the data transmission rate of its network

interface as measured in gigabit per second?

Figure 3 – Modifiers of the term server in PcPlus (2000-2003)

modifiers of server in PcP (2000-2004)

020406080

100120

mail

appli

catio

npro

xy Xweb yo

urFT

PDNS

font

Nis

centr

al

virtua

l

collocates

asso

ciat

ion

stre

ngth

(-lo

gp)

2000200120022003

59

Figure 4 - Modifiers of the term server in New York Times (1994-2001)

60

7 Conclusion

Based on a lengthy discussion of the theoretical foundations of terminology (sections 3 and 4)

we arrived at 1) a functional definition of the term in which termhood is contingent upon

context and 2) an operational definition of determinologization as the process by which the

relational co-occurrence patterns of a term20 in non-specialized discourse come to resemble

that of comparable lexical units from the general vocabulary. This contextual theory of

terminological meaning was the basis for the implementation of the DiaSketch (section 5)

with which an example of conceptual fuzziness was identified by extracting salient relational

collocations of the term server in a specialized and a general language corpus, respectively

(section 6). Judging by the DiaSketches of the term server in a 9-year fragment of the New

York Times corpus, we must conclude that Melby’s stone/clay analogy as described in section

3.3 seems to be inaccurate and that we are dealing with a greyscale where lexical units, which

invoke clear-cut concepts in some contexts, come to represent increasingly prototypical

categories in other contexts.

7.1 Further work Although DiaSketches of a single term seems to indicate that termhood is gradable, usage

patterns of a wider range of (sense tagged) mother terms21 need to be analyzed to see if the

tendencies from the case of server can be generalized. While examining co-occurrence

patterns for relations like SUBJ_OF might provide a richer description of determinologized

usage, this may not be feasible due to data sparseness. However, it seems likely that even

identifying simple positional co-occurrence patterns will make it possible to improve the

precision of ATR software using the Internet as a corpus. A list of modifiers typically used

with mother terms in non-specialized discourse could for example be used as a document

classification device or a simple filtering device. The possible increase in precision brought

about by such a filter would then need to be evaluated by running the web-based ATR system

with and without the filter.

Finally, it would be interesting to see if certain terms are less context sensitive than

others. Whether terms, which have been terminologized by metaphor (e.g. bus, icon, mouse),

are more susceptible to subsequent determinologization than terms formed by formal neology

(e.g. byte) or by compounding (e.g. operating system)?

20 Strictly speaking, a lexical unit which functions as a term in specialized communicative contexts 21 the list in table 1 could be a starting point

61

References

[1] Barry, John (1993) A. Technobabble MIT Press

[2] Baroni, Marco (2004) "BootCat: BootStrapping Corpora and terms from the web" In:

Proceedings of LREC 2004

[3] Cabré Castellví, María Teresa (1999) "Do We Need an Autonomous Theory of

Terms?" In: Terminology 5:1, pp. 5-19, John Benjamins

[4] Cabré Castellví, María Teresa (2000) "Elements for a theory of terminology: Towards

an alternative paradigm" In: Terminology 6:1, pp. 35-57, John Benjamins

[5] Cabré Castellví, María Teresa (2003) "Theories of terminology - their description,

prescription and explanation" In: Terminology 9:2, pp. 163-199, John Benjamins

[6] Cruse, D.A. (1986) Lexical Semantics Cambridge University Press

[7] Evert, Stefan; Brigitte Krenn (2003) "Computational approaches to collocations"

Introductory course at the European Summer School on Logic, Language and Information

(ESSLLI)

[8] Evert, Stefan (2004) The Statistics of Word Co-occurrences: Word Pairs and

Collocations PhD thesis, University of Stuttgart

[9] Gambier, Yves (1991) “Travail et vocabulaire spécialisés: Prolégomènes à une socio-

terminologie” In: Meta, 36(1), pp. 8-15, Les Presses de l'Université de Montréal

[10] Gambier, Yves (1987) ”Problèmes terminologiques des pluies acides: Pour une socio-

terminologie” In: Meta, 32(3), pp. 314-320, Les Presses de l'Université de Montréal

[11] Gaudin, François (1993) “Socioterminologie: du signe au sens, construction d’un

champ”, In: Meta, 38(2), pp. 293-301, Les Presses de l'Université de Montréal

[12] Järvi, Outi (2001) "From Precise Terms to Fuzzy Words - from Bad to Worse in

Terminology Science? In: IITF Journal vol. 12, no. 1-2, pp. 85-88

[13] Kageura, Kyo (2002) The Dynamics of Terminology - a descriptive theory of term

formation and terminological growth John Benjamins

[14] Kilgarriff, Adam; David Tugwell (2001) "WORD SKETCH: Extraction and Display

of Significant Collocations for Lexicography" In: Proceedings of ACL 2001, Toulouse,

France, pp. 32-38

[15] Kilgarriff, Adam (2004) "The sketch engine" In: Proceedings of the 11th EuraLex

International Congress

[16] Melby, Alan K. (1995) The Possibility of Language John Benjamins

62

63

[17] Meyer, Ingrid; Kristen Mackintosh (2000) "When terms move into our everyday lives:

An overview of de-terminologization" In: Terminology vol. 6:1, pp. 111-138, John

Benjamins

[18] Meyer, Ingrid; Kristen Mackintosh (2000) "L'étirement du sens terminologique:

apercu du phénomène de la déterminologisation" In: Le Sens en Terminologie Ed. by Henri

Béjoint and Philippe Thoiron, Presses universitaires de Lyon, pp. 198-217

[19] Myking, Johan (2000) "Sosioterminologi - Ein modell for Norden?" In: I

Terminologins tjänst - Festskrift för Heribert Picht på 60-årsdagen, University of Vaasa

[20] Pearson, Jennifer (1998) Terms in Context John Benjamins

[21] Sager, Juan C.; David Dungworth (1980) English Special Languages Brandsetter

Verlag, Wiesbaden

[22] Sager, Juan C. (1990) A practical course in terminology processing John Benjamins

[23] Sager, Juan C. (1999) "In search of a foundation: Towards a theory of the term" In:

Terminology 5:1, pp. 41-57, John Benjamins

[24] Schulze, Bruno M. (1994) Entwurf und implementierung eines anfragesystems für

maschinelle textcorpora. Master's thesis, Institut für maschinelle Sprachverarbeitung

(IMS), Stuttgart University

[25] Schütze, Hinrich; Christopher D. Manning (1999) Foundations of Statistical Natural

Language Processing MIT Press

[26] Temmerman, Rita (2000) Towards New Ways of Terminology Description. The

sociocognitive approach John Benjamins

[27] Wittgenstein, Ludwig (1997) Philosophical Investigations translated by G. E. M.

Anscombe Basil Blackwell, orig. 1953

[28] Wüster, Eugen (1991) Einführung in die allgemeine Terminologielehre und

terminologische Lexikographie. Bonn: Romanistischer Verlag, 3. Auflage, orig. 1979

[29] Yarowsky, David (1995) "Unsupervised Word Sense Disambiguation Rivaling

Supervised Methods" In: Proceedings of ACL 33, pp. 189-196

2005 probing the properties of determinologization

Documents

study of determinologization

science of terminology

sociocognitive terminology

diasketch implementation

nonspecialized contexts

meaning of terms

diasketch approach

general theory of terminology