the geneva corpus of middle english poetry: its construction and possible applications...
TRANSCRIPT
IntroductionConstruction
DatabaseApplications
The Geneva Corpus of Middle English Poetry:its construction and possible applications
Richard Zimmermann
Universite de Geneve
November 16, 2013
IntroductionConstruction
DatabaseApplications
Outline
IntroductionWhat is the GeCMEP?Why is the GeCMEP useful?
ConstructionStep 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Database
ApplicationsExample 1: Verb - Object orderExample 2: Th and Wh elements
IntroductionConstruction
DatabaseApplications
What is the GeCMEP?Why is the GeCMEP useful?
GeCMEP - overview
I The Geneva Corpus of Middle English Poetry (GeCMEP) is afully annotated and syntactically parsed corpus.
I Time: 1150-1420 (Helsinki periods M1, M2, M3)
I Size: goal is 100,000 words before end of PhD, but inprinciple open-ended
I Parsed according to the rules of the Penn Parsed Corpus ofMiddle English (Kroch and Taylor 2000)
Go to the PPCM2 Manual
IntroductionConstruction
DatabaseApplications
What is the GeCMEP?Why is the GeCMEP useful?
GeCMEP - overview
I The Geneva Corpus of Middle English Poetry (GeCMEP) is afully annotated and syntactically parsed corpus.
I Time: 1150-1420 (Helsinki periods M1, M2, M3)
I Size: goal is 100,000 words before end of PhD, but inprinciple open-ended
I Parsed according to the rules of the Penn Parsed Corpus ofMiddle English (Kroch and Taylor 2000)
Go to the PPCM2 Manual
IntroductionConstruction
DatabaseApplications
What is the GeCMEP?Why is the GeCMEP useful?
GeCMEP - overview
I The Geneva Corpus of Middle English Poetry (GeCMEP) is afully annotated and syntactically parsed corpus.
I Time: 1150-1420 (Helsinki periods M1, M2, M3)
I Size: goal is 100,000 words before end of PhD, but inprinciple open-ended
I Parsed according to the rules of the Penn Parsed Corpus ofMiddle English (Kroch and Taylor 2000)
Go to the PPCM2 Manual
IntroductionConstruction
DatabaseApplications
What is the GeCMEP?Why is the GeCMEP useful?
GeCMEP - overview
I The Geneva Corpus of Middle English Poetry (GeCMEP) is afully annotated and syntactically parsed corpus.
I Time: 1150-1420 (Helsinki periods M1, M2, M3)
I Size: goal is 100,000 words before end of PhD, but inprinciple open-ended
I Parsed according to the rules of the Penn Parsed Corpus ofMiddle English (Kroch and Taylor 2000)
Go to the PPCM2 Manual
IntroductionConstruction
DatabaseApplications
What is the GeCMEP?Why is the GeCMEP useful?
Example parse
(1) hwawho
swaso
nenot
forZefeDforgives
heoretheir
hating,hating,
neno
godGod
nenot
forZeueDforgives
himthem
nano
þingthing
’Whoever doesn’t forgive their hate, God will not forgive them anything’(PatNost,111.67.220)
IntroductionConstruction
DatabaseApplications
What is the GeCMEP?Why is the GeCMEP useful?
Major Middle English Prose Texts 1100-1400
→ ME verse can help to close the prose gap c. 1250-1350
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Creation of a basic text file I
I Find an electronic version of thetext you want to parse
I Example: Morris (1972) An OldEnglish Miscellany Containing aBestiary
Go to Archive.org
I Copy and paste into a .txt filewith ANSI formatting so thatspecial characters like thorn,yogh etc. can be read
Open Bestiary.txt
Some online resources...
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Creation of a basic text file I
I Find an electronic version of thetext you want to parse
I Example: Morris (1972) An OldEnglish Miscellany Containing aBestiary
Go to Archive.org
I Copy and paste into a .txt filewith ANSI formatting so thatspecial characters like thorn,yogh etc. can be read
Open Bestiary.txt
Some online resources...
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Creation of a basic text file I
I Find an electronic version of thetext you want to parse
I Example: Morris (1972) An OldEnglish Miscellany Containing aBestiary
Go to Archive.org
I Copy and paste into a .txt filewith ANSI formatting so thatspecial characters like thorn,yogh etc. can be read
Open Bestiary.txt
Some online resources...
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Creation of a basic text file II
I Remove mark-ups, comments,footnotes, page & folionumbers, translations, criticalapparatus etc. (UTF-8)
I Replace special characters, e.g.þ → +t, Z → +g etc.
Open Bestiary2.txt
I Reformat the file such thatthere is only one word per line
Open Bestiary3.txt
Replacement of D inNotepad++
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Creation of a basic text file II
I Remove mark-ups, comments,footnotes, page & folionumbers, translations, criticalapparatus etc. (UTF-8)
I Replace special characters, e.g.þ → +t, Z → +g etc.
Open Bestiary2.txt
I Reformat the file such thatthere is only one word per line
Open Bestiary3.txt
Replacement of D inNotepad++
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Creation of a basic text file II
I Remove mark-ups, comments,footnotes, page & folionumbers, translations, criticalapparatus etc. (UTF-8)
I Replace special characters, e.g.þ → +t, Z → +g etc.
Open Bestiary2.txt
I Reformat the file such thatthere is only one word per line
Open Bestiary3.txt
Replacement of D inNotepad++
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Part-of-Speech Annotation
I POS-annotation with TreeTagger (Schmid 1995)
I takes a word and assigns to it aPOS-tag and a lemma
I GeCMEP does not includelemmas; lemmas not used
I supervised tagger; traininglexicon and training input takenfrom PPCME2 prose texts
I accuracy: c. 85% of all tags areassigned correctly
Example output ofTreeTagger
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Part-of-Speech Annotation
I POS-annotation with TreeTagger (Schmid 1995)
I takes a word and assigns to it aPOS-tag and a lemma
I GeCMEP does not includelemmas; lemmas not used
I supervised tagger; traininglexicon and training input takenfrom PPCME2 prose texts
I accuracy: c. 85% of all tags areassigned correctly
Example output ofTreeTagger
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Part-of-Speech Annotation
I POS-annotation with TreeTagger (Schmid 1995)
I takes a word and assigns to it aPOS-tag and a lemma
I GeCMEP does not includelemmas; lemmas not used
I supervised tagger; traininglexicon and training input takenfrom PPCME2 prose texts
I accuracy: c. 85% of all tags areassigned correctly
Example output ofTreeTagger
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Part-of-Speech Annotation
I POS-annotation with TreeTagger (Schmid 1995)
I takes a word and assigns to it aPOS-tag and a lemma
I GeCMEP does not includelemmas; lemmas not used
I supervised tagger; traininglexicon and training input takenfrom PPCME2 prose texts
I accuracy: c. 85% of all tags areassigned correctly
Example output ofTreeTagger
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Part-of-Speech Annotation
I POS-annotation with TreeTagger (Schmid 1995)
I takes a word and assigns to it aPOS-tag and a lemma
I GeCMEP does not includelemmas; lemmas not used
I supervised tagger; traininglexicon and training input takenfrom PPCME2 prose texts
I accuracy: c. 85% of all tags areassigned correctly
Example output ofTreeTagger
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Part-of-Speech Annotation
Demonstration
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Tokenization
I sentence boundaries and IDs inserted manually in aspreadsheet
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
Shallow parsing procedure
I shallow parsing builds simple syntactic structures with regularexpressions (Abney 1991)
I e.g. prepositional phrases can be build with an instruction like”whenever there is a P immediately before an NP, thenbracket them together into a PP”
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
CorpusSearch revision queries
I tokens are chunked with revision queries of CorpusSearch 2(Randall 2004)
Example revision querydefine: cs.defnode: IP*|CP*copy corpus: tquery: (IP*|CP* idoms {1}P)AND (IP*|CP* idoms {2}NP)AND (P iprecedes NP)
add internal node{1,2}: PP
I windows bat file runs very many of such revision queries;output of one query feeds into the next
I simple python script converts tokens into right input formatI then manual correction until all tokens correspond to
PPCME2 guidelines
IntroductionConstruction
DatabaseApplications
Step 1: PreprocessingStep 2: POS-taggingStep 3: Chunking
CorpusSearch revision queries
Demonstration
IntroductionConstruction
DatabaseApplications
Online Documentation for the GeCMEP
I Currently the database includes information on 15 parsed textfiles and a general bibliography
I Text information specifies factors that might determinesyntactic variation: date of composition, date of manuscript,dialect, versification, literary subjects
I In addition, cross-references to the three standard cataloguesof ME (verse) texts: IMEV, Wells and MEC.
Go to the GeCMEP Database
IntroductionConstruction
DatabaseApplications
Online Documentation for the GeCMEP
I Currently the database includes information on 15 parsed textfiles and a general bibliography
I Text information specifies factors that might determinesyntactic variation: date of composition, date of manuscript,dialect, versification, literary subjects
I In addition, cross-references to the three standard cataloguesof ME (verse) texts: IMEV, Wells and MEC.
Go to the GeCMEP Database
IntroductionConstruction
DatabaseApplications
Online Documentation for the GeCMEP
I Currently the database includes information on 15 parsed textfiles and a general bibliography
I Text information specifies factors that might determinesyntactic variation: date of composition, date of manuscript,dialect, versification, literary subjects
I In addition, cross-references to the three standard cataloguesof ME (verse) texts: IMEV, Wells and MEC.
Go to the GeCMEP Database
IntroductionConstruction
DatabaseApplications
Online Documentation for the GeCMEP
I Currently the database includes information on 15 parsed textfiles and a general bibliography
I Text information specifies factors that might determinesyntactic variation: date of composition, date of manuscript,dialect, versification, literary subjects
I In addition, cross-references to the three standard cataloguesof ME (verse) texts: IMEV, Wells and MEC.
Go to the GeCMEP Database
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Searching the corpus
I the corpus can be searched with query files of CorpusSearch 2(Randall 2004)
I all functional labels, POS-tags and specific spellings can besearched for
I complex relations between elements can be specified (relativeorder of elements, number of words, identical indices, ...)
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Searching the corpus
I the corpus can be searched with query files of CorpusSearch 2(Randall 2004)
I all functional labels, POS-tags and specific spellings can besearched for
I complex relations between elements can be specified (relativeorder of elements, number of words, identical indices, ...)
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Searching the corpus
I the corpus can be searched with query files of CorpusSearch 2(Randall 2004)
I all functional labels, POS-tags and specific spellings can besearched for
I complex relations between elements can be specified (relativeorder of elements, number of words, identical indices, ...)
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Realization of verb-pronoun structures
(2) a. þethe
þurstthirst
himhim
dededid
moremore
wowoe
þenthan
hevedehad
raþerearlier
hishis
houngerhunger
do.done
’The thirst caused him more misery than his houngerhad done before’ (FoxWolf,56.273.68)
b. þethe
woxfox
bichardedeceived
him,him,
midwith
iwisse,certainity,
Forfor
hehe
nenot
fondfound
nonesnone
kunneskind’s
blissebliss
’The fox had deceived him indeed, because he didn’tfind any kind of bliss’ (FoxWolf,224.278.295)
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Search queries for verb-pronoun structures
I simple CorpusSearch query files to find such constructions:
opro-V.q
define: cs.defnode: IP*query: (IP* idoms NP-OB*)AND (NP-OB* idomsonly PRO)AND (IP* idomsVBP|VBD|DOP|DOD|HVP|HVD)
AND (NP-OB* precedes
VBP|VBD|DOP|DOD|HVP|HVD)
V-opro.q
define: cs.defnode: IP*query: (IP* idoms NP-OB*)AND (NP-OB* idomsonly PRO)AND (IP* idomsVBP|VBD|DOP|DOD|HVP|HVD)
AND
(VBP|VBD|DOP|DOD|HVP|HVD
precedes NP-OB*)
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Search queries for verb-pronoun structures
Demonstration
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Development of verb-pronoun structures
→ Significant decline of object pronoun - verb orders (c. 80%to c. 40%) measurable in Middle English verse 1150-1400
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Benefits
I Collecting data from parsed corpora with automated searchqueries ...
I ... saves a lot of time. Going through tens of thousands ofwords manually takes weeks or even months - with parsedcorpora it’s a matter of seconds.
I ... causes fewer mistakes. Humans can easily overlookexamples or tally them up wrong - computers count correctly.
I ... assures replicability. Researchers can quickly double-checkquantitative claims in the literature.
I ... increases objectivity. Search criteria must be made explicitin the query files hence the ex-/inclusion of a particularsentence is independent of the individual researcher.
→ increased scientificity
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Benefits
I Collecting data from parsed corpora with automated searchqueries ...
I ... saves a lot of time. Going through tens of thousands ofwords manually takes weeks or even months - with parsedcorpora it’s a matter of seconds.
I ... causes fewer mistakes. Humans can easily overlookexamples or tally them up wrong - computers count correctly.
I ... assures replicability. Researchers can quickly double-checkquantitative claims in the literature.
I ... increases objectivity. Search criteria must be made explicitin the query files hence the ex-/inclusion of a particularsentence is independent of the individual researcher.
→ increased scientificity
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Benefits
I Collecting data from parsed corpora with automated searchqueries ...
I ... saves a lot of time. Going through tens of thousands ofwords manually takes weeks or even months - with parsedcorpora it’s a matter of seconds.
I ... causes fewer mistakes. Humans can easily overlookexamples or tally them up wrong - computers count correctly.
I ... assures replicability. Researchers can quickly double-checkquantitative claims in the literature.
I ... increases objectivity. Search criteria must be made explicitin the query files hence the ex-/inclusion of a particularsentence is independent of the individual researcher.
→ increased scientificity
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Benefits
I Collecting data from parsed corpora with automated searchqueries ...
I ... saves a lot of time. Going through tens of thousands ofwords manually takes weeks or even months - with parsedcorpora it’s a matter of seconds.
I ... causes fewer mistakes. Humans can easily overlookexamples or tally them up wrong - computers count correctly.
I ... assures replicability. Researchers can quickly double-checkquantitative claims in the literature.
I ... increases objectivity. Search criteria must be made explicitin the query files hence the ex-/inclusion of a particularsentence is independent of the individual researcher.
→ increased scientificity
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Benefits
I Collecting data from parsed corpora with automated searchqueries ...
I ... saves a lot of time. Going through tens of thousands ofwords manually takes weeks or even months - with parsedcorpora it’s a matter of seconds.
I ... causes fewer mistakes. Humans can easily overlookexamples or tally them up wrong - computers count correctly.
I ... assures replicability. Researchers can quickly double-checkquantitative claims in the literature.
I ... increases objectivity. Search criteria must be made explicitin the query files hence the ex-/inclusion of a particularsentence is independent of the individual researcher.
→ increased scientificity
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Locative relatives: there → where ...
(3) a. hishis
halieholy
nomename
wewe
nomentook
andand
beren.bore
Inin
þethe
fontfont
þerwhere
wewe
iclensedcleansed
weren.were
’We took and bore his name in the font where we were cleansed.’(PatNost,36.59.70)
b. Toto
OxenfordOxford
ishis
messagerismessengers
hehe
sende,sent,
thatthat
hithey
soghtesought
Thisthis
maidemaid
warewhere
heoshe
werewere
ifoundefound
andand
toto
himhim
broghte.brought
He sent his messengers to Oxford so that they might seek and findthis maiden, where she was, and bring her to him.’ (Fridesw,47.64)
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
and temporal subordinators: then → when
(4) a. Acbut
ureour
drihtenLord
eftagain
ofof
deaþedeath
hemthem
aræreþ,arises
Soas
hehe
alleall
menmen
deþ,does
þonnewhen
domes daiDoomsday
cumeþ.comes
’But our Lord will raise them up from death, as he willall men, when Doomsday comes.’(BodySoul,185.7.13.FragE)
b. andand
alall
bi-fuliþbefouls
hehe
hishis
frendfriend
hwenwhen
hehe
himhim
vnfoldiþ.embraces
’He wholly befouls his friend when he embraces him.’(ProvAlf,224.50.659.B32)
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Development of Th and Wh elements
→ Significant increase of Wh- relative and adverbial clauses(c. 10% to c. 70%) in Middle English verse 1150-1400
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
Comparison verse - prose ...
IntroductionConstruction
DatabaseApplications
Example 1: Verb - Object orderExample 2: Th and Wh elements
... shows that verse can close the gap in prose texts.
IntroductionConstruction
DatabaseApplications
Special thanks to Beatrice Santorini for running her queries on theGeCMEP files to find annotation mistakes. Thanks are also due to BenjaminBorschinger and Paola Merlo for useful advice on chunking.
IntroductionConstruction
DatabaseApplications
Abney, S. (1991), Parsing by chunks, in R. Berwick and C. Tenny,eds, ‘Principle-Based Parsing’, Kluwer Academic Publishers,Dordrecht, pp. 257–278.
Kroch, A. and Taylor, A. (2000), Penn-Helsinki Parsed Corpus ofMiddle English,http://www.ling.upenn.edu/hist-corpora/PPCME2-RELEASE-3(Accessed 10 April 2013), 2 edn, Department of Linguistics,University of Pennsylvania.
Randall, B. (2004), CorpusSearch 2,http://corpussearch.sourceforge.net/ (Accessed 7 November2013), Sourceforge.net.
Schmid, H. (1995), ‘Improvements in part-of-speech tagging withan application to german’, Proceedings of the ACLSIGDAT-Workshop. Dublin, Ireland pp. 47–50.