imact final conference - language parallel sessions - erjavec
TRANSCRIPT
Tomaž ErjavecDepartment of Knowledge Technologies
Jožef Stefan Institute
Ljubljana
Resources for historical Slovene
IMPACT Conference 2011
October 24-25, 2011, London
2
• Pre-story: AHLib (2004–08)(Deutsch-slowenische/kroatische Übersetzung 1848–1918)• Corpus / DL of ger→slv books• AAS: transcription correction and markup (TEI P4)• JSI: automatic annotation and editing environment
• Story: EU IP IMPACT (ext. 2010–2011)• Better OCR for historical texts• NUK: GTD transcriptions (PAGE/Aletheia)• JSI: (semi)manual lexicon construction
• Co-story: Google award (2011)• Developing language models for historical Slovene• ZRC SAZU: transcriptions of old texts (TEI P5)• JSI: annotating a corpus of old Slovene
Background
Tomaž Erjavec: Slovene language resources
3
Methodology• Develop 3 resources:
• transcribed texts• hand-annotated corpus• lexicon of historical words
• Develop annotation tool, ToTrTaLe• How to tag and lemmatise historical Slovene?
Little chance of developing training data comparable to that for contemporary Slovene
• Basic idea: • modernise words then use models for modern Slovene• transcription is via fixed lexicon + transcription patterns• patterns implemented via LMU Vaam• mostly OK for XIX and XVIII century language
Tomaž Erjavec: Slovene language resources
Corpus
Annotators
ToTrTaLe
HistoricallexiconTexts
Contemporarymodels
4
Issues• Tokenisation - words were split differently in historical
language :• žnjo → z njo• po noči → ponoči
• Variability:• archaic forms:
ljubezen ← lubesen, ljubesen, lubeſn, ljubezin, ljubesin
• inflection:ljubezen ← ljubezni, ljubeznijo
• both:ljubezen ← ljubezni, ljubesni, lubesen, ljubesen, lubesni, lubeſn, ljubeznijo, ljubezin, lubeſne, lubeſni, lubesne, ljubesnijo, ljubesin
• Extinct words:• zajhen / cajhen / znamenje
Tomaž Erjavec: Slovene language resources
5
Transcribed historical texts• AHLib corpus/DL:
90 books, 10,000 pages, 2M words (> 1850)• NUK GTD:
5,000 pages, 1M words • Google Books:
30 books, 10,000 pages, 2M words (in progress)• WikiSource (Lj Uni):
200 books, 5M words (in progress)
~ 10M words
• most texts have associated facsimiles• can be made freely available
Tomaž Erjavec: Slovene language resources
6
Initial Lexicon• Development of initial lexicon (2010), using the data and tools at
hand• AHLib collection (70 books > 1850)• Transcription rules + FidaPLUS lexicon of contemporary slv• LMU LeXtractor editing tool• produced 3,000 entries (word-forms)
Tomaž Erjavec: Slovene language resources
7
Reference corpusgoo300k• Page sampled• Each word annotated with:
• Contemporary equivalent• Modern lemma• Part-of-speech tag
• First with ToTrTaLe• Then manually correct
• INL Cobalt Lexicon Tool• A team of annotators• Also correcting errors in transcription• Manual, cookbook, FAQ, mailing list, meetings…
• TEI P5 – bibliography, links to facsimiles & DL
Tomaž Erjavec: Slovene language resources
Period Units Pages Tokens
1584 1 8 60001695 1 27 10000
1751-1800 8 155 27000
1801-1850 12 206 740001851-1875 36 380 1260001876-1900 23 224 51000
∑ 81 1000 296000
8
INL Cobalt lexicon building tool
Tomaž Erjavec: Slovene language resources
9
TEI corpusdump
Tomaž Erjavec: Slovene language resources
10
Final lexicon
Composition:• Initial LeXtractor lexicon (3k entries)• Lexicon dump from goo300k• Additional lexicon from full
text collection
Format:• TEI P5• lemma oriented• grammatical properties, glosses, historical spelling, (corpus)
examples
Tomaž Erjavec: Slovene language resources
goo300k All Historical
Lex. entries 56346 22849
Word-forms 53853 19627
Normalised 46996 15402
Modernised 37334 11396
Lemmas 19569 8605
11
Results• Language resources for historical Slovene:
• Text Collection hs5M: • facsimile + transcription, DL (+ automatic annotation)
• Annotated Corpus goo300k: • page-sampled , hand-annotated
• Structured Lexicon imp20k: • grammar + glosses + forms + attestations
• TEI P5, CC BY
• ToTrTaLe + resources for HS: • tokenisation & transcription patterns
• Services: CUWI, (moderniser+archaiser)• all still work in progress, available mid-2012
Tomaž Erjavec: Slovene language resources
12
Further work• Better IR for Digital Libraries: NUK• Dictionary of historical Slovene: ZRC• Beyond words: changes in syntax• MT paradigm• tweets & Croatian
Tomaž Erjavec: Slovene language resources