Corpus annotation and retrieval: an introduction
Paul RaysonComputing Department, Lancaster University
Dawn ArcherSchool of Humanities, University of Central Lancashire
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Session outline
What is a corpus?
What is corpus linguistics?
Applying these techniques to historical data
What research questions can we answer with CL techniques
… in linguistics …?… in computing …?… in history …?
1. Background
Corpora, corpus linguistics, annotation, retrieval methods
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Underlying assumption
Intuition is not enough to study language …
Reaction to Noam Chomsky’s focus on introspection in 1950s/60s
Empirical observation of naturally occurring data versus theory of how
human language processing is actually undertaken
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
What is a corpus?
Old meaning = “body of text” (Latin)
Now = (any) “collection of texts or language examples” – usually in an electronic format
Demonstrates extent to which CL-revival led by advances in computing technology
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
A corpus tends to be “representative”
i.e. a balanced sample of a language or a particular variety of language --- c.f. national corpora (British, American, Czech, Polish …)
Reasoning? Helps to remove intuitive bias Helps us to find common/ rare phenomena
Exceptions …?
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
And large …
… because size helps us to:
Establish norms about the variety being studied
Reveal lots of cases of rare features of language
Zipf’s law
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Size matters!
Brown/LOB1960s1 million
BNC1990s100 million
WebPresent day? billion
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Birmingham corpus198010 million
Collins Bank of EnglishCambridge International CorpusOxford English Corpus2006600 million – 1 billion
WebFuture? billions
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
So what is corpus linguistics?
= the “study of language using corpora”
= empirical methodology
= a useful means of exploring: Synchronic and diachronic variation Syntax, semantics, pragmatics Lexicography Dialects, minority languages Not just English
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Corpus techniques we utilise
Retrieval
Frequency profiling Concordancing Collocations Key words Key domains
Annotation
POS tagging Semantic tagging
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Annotation Part of speech
tagging Semantic field
tagging
Retrieval Frequency
lists Concordances
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Key words
Text
Keywords
Text or reference corpus
What are “key words”?And why are they so useful?
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Key words
Word Clouds
If we compare text A
… with text B … we can discover the most significant items within text A
… and not only the frequent items
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Collocations
Collocation = a relationship between words that tend to occur together in texts
Words that tend to occur near word X are the collocates of word X (consider “fish and XXXXX”)
Based on frequency (how frequent separate vs. how frequent together)
The company a word keeps: implicit associations or assumptions
Bachelor: eligible, flat, life, days Spinster: elderly, widows, sisters, parish
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Corpus software
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Modern methods in an historical setting (focussing on EmodE period)
Tools/methods don’t take account of spelling variation Variant spelling detector (VARD)
The need to use historically valid taxonomies or thesauri, or revise our existing modern tagsets Historical Thesaurus of English Spevack (1993)
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Using automated systems of annotation on historical texts is problematic …
EModE texts pose the following “problems”: Archaic –eth and –(e)st verb suffixes, e.g. doth, hath,
hast, sayeth, etc., which persist in specialised contexts: religious and poetic usage
Fused forms, e.g. ’Tis (It is) Spellings that are variable even in modern-day usage,
e.g. center/centre, skilful/skillful/skilfull, the suffixes -or/-our, -ise/-ize
Archaic forms like howbeit, betwixt, for which no obvious modern equivalent exists
Compound words, e.g. it self, now adays, in stead Proper names of Latin origin that are sometimes
modernised, e.g. Galilaeo (Galileo) Due to different conventions and compositing practices
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Previous work in …
Fuzzy search engine Aimed at successful
retrieval for novice users without expertise in the text
Expand the search term using known letter replacements
Changing dictionary built in to corpus annotation software
Back-dating inbuilt dictionaries by adding historical variants
Information Retrieval
Corpus linguistics
Natural language processing
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Our scenario
SEM TAGGER
POS TAGGER
VARD: Detect variant spellings
and insert modern equivalents
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
An important point about the VARD
Although the VARD allows for the detection and “normalisation” of variants to their modern equivalents, it should be noted that ...
The original variants are retained in the text We’re not carrying out spell checking per se (no
“correct” spelling in EmodE period) ...
Our ultimate aim is to develop a system that automatically regularises variants within a text to their modernised forms so that historical corpora become more amenable to further annotation and analysis.
2. Historical data
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Existing corpora
What is already available: LOB-family, Brown family (20th Century)
15 genres: press, religion, skills & hobbies, biography, learned, fiction (detective, science, adventure), romance, humour
Lampeter (1640-1740) Religion, Politics, Economy, Science, Law and Misc.
Corpus of English Dialogues (1560-1760) Trial proceedings, depositions, drama, prose fiction
Helsinki (Old, Middle and Early Modern English) Archer (1650-1990, sampled at 50 year periods)
Journals, letters, fiction, news, medicine, science
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Book Search
Other historical texts – not complied for corpus linguistics
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Changing English Across the 20th Century: a corpus-based study ucrel.lancs.ac.uk/20thCenturyEnglish/ Leverhulme Trust (2005-7)
1901 1931 1961 1991/2
BrE Lanc-1901 B-LOB
(Lanc-1931)
LOB F-LOB
AmE ? Pre-Brown31
Brown Frown
Background:
Recent observations of significant shifts having occurred among expressions of obligation/necessity in the period 1961-1991 e.g.
• a decline of the central modals MUST and NEED • a spread of the semi-modals HAVE TO, NEED TO
Research questions
? Are these changes recent? How do these changes
compare to the development of the semantic field of OBLIGATION/ NECESSITY as a whole?
Project outputs
• Compile a new corpus of British English called Lancaster1901• Enhance the encoding and annotation of Lancaster1901 and the three existing corpora (Lancaster1931, LOB and FLOB)• 10 conference presentations• 1 book chapter• 1 book• 2 journal articles
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Application 2: Historical CL
In particular, courtroom research (1640+), from a linguistic perspective
Utilise a specially designed corpus – Sociopragmatic Corpus – which has been annotated for: age, gender, status and
role. speech acts such as
questions, requests and commands
<P 37> [$ (^Record.^) $] <u stfunc="fol-ini" force="q" q="qy" qtype="d" qform="dec" speaker="s" spid="s4tgiles001" spsex="m" sprole1="re" spstatus="1" spage="8" addressee="s" adid="s4tgiles027" adsex="f" adrole1="w" adstatus="5" adage="x">He did not go out of your Company at all? </u> [$ (^Ann.^) $] <u stfunc=“res" force=“h" a=“ca“ a2=“ela“ speaker="s" spid="s4tgiles027" spsex=“f“ sprole1=“w“ spstatus=“5" spage="8“ addressee="s“ adid="s4tgiles001“ adsex=“m" adrole1=“m" adstatus=“1“ adage="x">Yes about Ten a Clock.</u> [$ (^Record.^) $] <u stfunc="fol" force="h" speaker="s“spid="s4tgiles001" spsex="m" sprole1="re" spstatus="1" spage="8" addressee="s" adid="s4tgiles027" adsex="f" adrole1="w" adstatus="5" adage="x">Woman you must be mistaken, he came to Town at Twelve or One, and might be in thy company, but it is plain he went to a Brokers in (^Long-lane^) , and so to the (^Artillery-Ground^) at (^Cripple-Gate^) , for I guess it might be so: Then they went to (^Whetstones-Park^) , and spent Six-Pence, and after that they went into (^Drury-lane^).</u> [$ (^Giles,^) $] <u stfunc="rep" force="h" speaker="s" spid="s4tgiles005" spsex="m" sprole1="d" spstatus="1" spage="x" addressee="s" adid="s4tgiles001" adsex="m" adrole1="re" adstatus="1" adage="8">My Lord, she don't say she was with us all the while, but we came to an House where she was, and several other People our Neighbours. </u>
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Some important findings
Historical courtroom discourse is not just made up of questions and answers (even during examination sequences)
The frequency with which questions – and directives - were used, the function that they served, and their ability to achieve their social and/or interactional goal depended (in large part) on a number of socio-pragmatic factors:
type and date of trial position in discourse role of user & addressee ultimate aim of interaction
1640-1760 was a period of emerging and changing roles
Now beginning to explore the nineteenth century, i.e. period in which the courtroom adopted advocacy in its modern form (Cairns 1998) Utilising full trials: emerging need to consider opening/closing
statements
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
Linguistic theory
Natural language processing &
Computational linguistics
Corpus Empirical evidence to inform theoryStatistical and rule-
based language models
Corpus Linguistics
Historical theory
Historical theory
Historical text mining (HTM)
HTM
3. Over to you …
Text Mining for HistoriansJuly 17-18 2007 Glasgow University
What research questions would you like to answer, but can’t?
Search engines for new text collections and digital libraries
Named entity extraction for GISVariant spellingsHistorical text miningNew research methods in History