![Page 1: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/1.jpg)
Dictionaries
See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004.
![Page 2: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/2.jpg)
2/18
Dictionaries/Lexicons
• Lexicography and the computer
• Corpus-based lexicography
• MRDs
• Dictionaries for NLP
• Thesauri: structured lexicons
![Page 3: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/3.jpg)
3/18
Computational lexicography
• Restructuring and exploiting human dictionaries for use by computer programs
• Using computational techniques to compile (new) dictionaries
• Focus on English (and other well established languages)
• Significant different issues for other languages, especially– Alphabetization and arrangement– Compilation from scratch for previously unstudied
languages
![Page 4: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/4.jpg)
4/18
Human dictionaries• Traditional view of what a “dictionary” is
– List of words, arranged (usually) alphabetically– Inclusion in dictionary lends authority, even proscriptively– Entry typically gives
• spelling ... alternate spellings• POS, morphology (if irregular)• core definition (using defining vocab?)• pronunciation (using own transcription)• etymology• examples of usage
– as justification for inclusion– as illustration of use (esp. learner’s dictionaries)
– Entry typically doesn’t give • help with spelling• morphology (if regular), especially derivational• subcategorization information• contrastive examples of use• indications of possible metaphorical extensions to meaning
![Page 5: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/5.jpg)
5/18
Human dictionaries
• Historically– bilingual dictionaries for translators– monolingual dictionary as (pre/proscriptive) definition
of language, often polemical– OED (1884-1928) first dictionary on purely descriptive
principle, relying on citations
• Deficiencies and difficulties– What to include? (neologisms, slang)– Inclusion of names– Differentiating senses
![Page 6: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/6.jpg)
6/18
Differentiating word senses
• Dictionaries disagree widely• Probably no right answer• General principles (look for excuse to split vs
look for reason to lump)• Keep related words of different POS together?• Etymology can be misleading (eg crane, pupil)• Metaphorical extension of original meaning –
how far do you go? (eg rose, bar)• Purpose of dictionary may help decide, eg
translation
![Page 7: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/7.jpg)
7/18
Citations
• Senses and uses identified by collecting examples of use– Sent in on “slips” by informants– Lexicographer’s job is to collate these
• Criteria for a new word (or new meaning)– Number of citations– Source of citations– Veracity of use
![Page 8: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/8.jpg)
8/18
Corpus-based dictionaries
• A collection of texts, usually collected with a specific purpose in mind
• British National Corpus, attempt to capture a synchronic picture of BrE of the late 1980s (100m words)
• COBUILD “Bank of English” dynamic “monitor” corpus used to help lexicographers identify/define usage
![Page 9: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/9.jpg)
9/18
Machine-readable dictionaries
• “Machine” means “computer”
• Dictionary stored in a format which makes it manipulable on a computer
• Originally, derived from MR version of print dictionary (from type-setter’s tapes)
• Now the other way round: data stored as a database from which hard copy can be printed (inter alia)
![Page 10: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/10.jpg)
10/18
MRDs - advantages
• Flexibility of access and presentation– Not bound to alphabetical listing– Information presented can be filtered– Can be searched as a database– Different versions (for different users, serving different
purposes) can be produced
• Increased storage capacity– More information can be stored, especially
• Implicit information can be made explicit• More examples, including “negative data”
![Page 11: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/11.jpg)
11/18
Lexicons for NLP
• Have to state everything we need to know about the word– Phonology: stress pattern, possible weak forms– Orthography: spelling alternatives, hyphenation– Morphology: inflectional paradigms, even if regular– Information about derivations– Syntax: Explicit information about subcategorization and
• eg syntactic/semantic features of arguments• Any special interpretation of tenses
– Lexical combinatorics: compounds, idioms– Semantics: definition, semantic features, semantic relations– Pragmatics: register, collocation, connotation
![Page 12: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/12.jpg)
12/18
Lexicons for NLP - example
• Information about derivations• Agentive derivation (-er) is very productive
– Usually means the actor doing the action of a verb, e.g. swimmer, dancer, killer
– Not available for some verbs, e.g. *knower, *cycler, *sayer though cf soothsayer, *hoper
– May have a specialised meaning instead of or as well as the derived meaning, e.g. revolver, computer, washer, hitter
– In some cases can mean the object undergoing the action (via ergative use of verb), e.g. taster
![Page 13: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/13.jpg)
13/18
Subcategorization• Words are assigned to categories (ie parts
of speech, POS), eg noun, verb– on basis of form, meaning, use
• Syntactic behaviour is predictable from (or determined by) category
• Within a category there are subcategories with specific patterns of behaviour, both syntactic and semantic, e.g.– transitive/intransitive verb direct object?
passivize?
![Page 14: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/14.jpg)
14/18
Subcategorization
• Subcat frames indicate complement patterns and preferences, e.g.– subj, obj, double obj, prep-obj, infinitival complement,
that complement etc– semantic features of complements, eg obj of eat
normally edible
• Subcat information can help to disambiguate– cf He told the man where the body was buried .– He found the place where the body was buried .
• Much of this info can be captured in general rules
[ ][ ]
[ [ ]]
![Page 15: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/15.jpg)
15/18
• Have to state everything we need to know about the word, though not necessarily explicitly– There can be rules to capture inheritance of
properties, e.g.• accomplishment + prog tense implies incompletion
• cf She was baking a cake when she dropped dead no cake
• She was stroking the cat when she dropped dead
![Page 16: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/16.jpg)
16/18
Exploiting human dictionaries in NLP
• In all NLP applications, lexicon is major bottleneck• Availability of MRD versions of human dictionaries
provided possible solution– Obviously, MRD gives list of words, and some information– Extract further information about verb frames by analysing the
examples– Identify semantic features from definitions eg a plant which..., a person who...– Identify hidden arguments eg to lock = to close sthg using a key cf He locked the door. The key was heavy. He emptied his pockets. *The key was heavy.
![Page 17: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/17.jpg)
17/18
Exploiting human dictionaries in NLP
• Generic information about a word and its usage can be derived from definitions in which it occurs:
Wine: alcoholic drink made from fermented juices, especially of grapesVintage: a season’s yield of wine from a vineyardRed wine: wine having a red colour derived from the skins of the grapes used ...Vineyard: an orchard where grapes are grown for the purpose of wine makingPinot noir: a dry red Californian table wineSake: Japanese rice wineClaret: a dry red Bordeaux or Bordeaux-like wineSherry: a sweet white wine from the Jerez region of SpainRiesling: a dessert wine made from white grapes grown historically in Germany ...
![Page 18: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/18.jpg)
18/18
Corpus-based lexicography revisited
• Similarly, analysis of real examples can reveal patterns of usage– Identify primary meaning: not always what
you’d expect (example of reckon)– Identify possible complementation patterns,
and their relative frequency
![Page 19: Dictionaries See Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004](https://reader036.vdocuments.net/reader036/viewer/2022082709/56649cfa5503460f949cc03f/html5/thumbnails/19.jpg)
19/18
Structured dictionaries
• Special type of dictionary in which words are grouped together according to their meaning: thesaurus
• Classic example Roget’s Thesaurus (1852)
• Structured vocabulary much used in field of terminology
• Also now a valuable resource for NLP: Miller’s (Princeton) WordNet (1985)