icelandic: modern tools and old textskristinb/clarin-2015-6okt.pdf• lemmald • a rule-based...

39
Kristín Bjarnadóttir, Jón Friðrik Daðason & Ludger Zeevaert The Árni Magnússon Institute for Icelandic Studies University of Iceland Icelandic: Modern Tools and Old Texts Nordic CLARIN Network Workshop on Historical Resources and Tools Gothenburg 6.10.2015

Upload: others

Post on 30-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Kristín Bjarnadóttir, Jón Friðrik Daðason & Ludger Zeevaert

    The Árni Magnússon Institute for Icelandic Studies University of Iceland

    Icelandic: Modern Tools and Old Texts

    Nordic CLARIN Network Workshop on Historical Resources and Tools

    Gothenburg 6.10.2015

  • Topic

    Ludger: The challenge: Njáls saga and NLP tools Jón Friðrik: Applying modern NLP tools to old texts Kristín: Why normalize to Modern Icelandic?

    Linguistic resources

  • Co-operation

    • Co-operation between the Manuscript Department and the Department for Lexicography

    • Aim: Adapt linguistic tools developed for Modern Icelandic to Old Icelandic corpora

  • The Gullskinna-Project

    • Funded by The Icelandic Research Fund (Rannís, styrknúmer 152342-051)

    • Start: August 2015 (3 years) • Main questions: (1)  a stemma of the Gullskinna-manuscripts (2)  an analysis of the treatment of a number of

    morphological and syntactic variables exhibited by different scribes

  • The Gullskinna-Project

    • Funded by The Icelandic Research Fund (Rannís, styrknúmer 152342-051)

    • Start: August 2015 (3 years) • Main questions: (1)  a stemma of the Gullskinna-manuscripts (2)  an analysis of the treatment of a number of

    morphological and syntactic variables exhibited by different scribes

  • Linguistic Variables with Differences Between Manuscripts of Njáls saga

    • Historical present tense • Order of noun and modifier • Pronominal reference (reflexive or not) • Agreement (past participle/supine) •  Indirect speech constructions

  • Linguistic Variables with Differences Between Manuscripts of Njáls saga

    • Historical present tense • Order of noun and modifier • Pronominal reference (reflexive or not) • Agreement (past participle/supine) •  Indirect speech constructions

  • One Example: Indirect Speech

    • Direct speech • Conjunctional clause • Accusative with infinitive (a.c.i.) • Mixed constructions

  • One Short Example

    Hrappur segist vera íslenskur Hrappur-SBJ say-3.SG.PRS.MID be-INF Icelandic ‘Hrappur says himself to be Icelandic’ Hrappur sagði að hann

    væri íslenskur Hrappur-SBJ said that he

    be-3.SG.PST.SBJV Icelandic ‘Hrappur said that he was Icelandic’

  • Our Approach

    • Tag a version of the text in Modern Icelandic spelling

  • Our Approach

    • Tag a version of the text in Modern Icelandic spelling

    • Transform the tagged text into Menota-XML (Medieval Nordic Text Archive, menota.org) + syntactic tagging (TEI)

  • Our Approach

    • Tag a version of the text in Modern Icelandic spelling

    • Transform the tagged text to Menota-XML (Medieval Nordic Text Archive, menota.org) + syntactic tagging (TEI)

    • Correct the tagging

  • Our Approach

    • Tag a version of the text in Modern Icelandic spelling

    • Transform the tagged text to Menota-XML (Medieval Nordic Text Archive, menota.org) + syntactic tagging (TEI)

    • Correct the tagging • Add the text from the manuscript

  • Our Approach • Tag a version of the text in Modern Icelandic

    spelling • Transform the tagged text to Menota-XML

    (Medieval Nordic Text Archive, menota.org) + syntactic tagging (TEI)

    • Correct the tagging • Add the text from the manuscript • Analyse and compare linguistic variables in

    different manuscripts

  • Conjunctional Clause/ A.c.i.

    Hrappur kvaðst vera íslenskur (AM 162 B theta fol. “Þetabrotið”, ca 1325)

    Hrappur sagði hann væri utan af Íslandi (AM 133 fol. “Kálfalækjarbók”, ca 1325)

    Hrappur segist vera íslenskur (AM 137 fol. “Vigfúsarbók”, ca 1650)

    Hrappur kvaðst vera íslenskur maður (AM 163 i fol. “Saurbæjarbók”, 1668)

    Hrappur sagði að hann væri íslenskur (AM 135 fol., ca 1690)

  • Chronological Development or not?

    Þeta Gullskinna

    Vigfúsarbók Saurbæjarbók

    Kálfalækjarbók

    AM 135 fol.

    *x4 *x1

  • Icelandic Taggers •  IceNLP

    • Offers a hybrid rule-based/HMM tagger •  Trained on the Icelandic Frequency Dictionary

    •  590,000 tokens •  Tagging accuracy of 92.7%

    •  IceStagger •  An adaption of Stagger, an averaged

    perceptron tagger •  Tagging accuracy of 93.8%

    •  An improvement over Stagger’s accuracy of 91.0%

  • Tagging Old Icelandic • Normalized Old Icelandic

    •  IceNLP: 86.6% accuracy (vs. 92.7%) • Stagger: 84.9% accuracy (vs. 91.0%)

    • Extending the training data • Add 95,000 tokens from the Saga-Gold

    corpus (13th-14th century texts) to the training set

    •  Improved tagging accuracy •  IceNLP: 90.6% accuracy • Stagger: 91.8% accuracy

  • Lemmatizing Old Icelandic •  Lemmald

    •  A rule-based lemmatizer included with IceNLP •  Trained on the IFD

    •  Approx. 74,000 distinct word form/tag/lemma combinations

    • Nefnir •  A new rule-based lemmatizer •  Trained on the Database of Modern Icelandic

    Inflection •  Approx. 6,000,000 distinct word form/tag/

    lemma combinations

  • A Sample Evaluation • Approx. 1,700 tokens from Njála

    • Without the extended training set

    Task IceNLP IceStagger

    Tagging 88.3% 83.5% Lemmatization (Lemmald) 92.3% 92.3% Lemmatization (Nefnir) 94.2% 93.2%

  • Automating the Process

  • Automating the Process

  • Goals • Make the tools as easy to use as possible

    •  Tools can be added, replaced and updated as necessary

    •  The less technical proficiency that is required of users, the better

    •  Introduce a simple and fast process • Offers a much better starting point than

    manually annotating documents from scratch •  Fully processed documents can be shared

    back to us in a common format • Documents shared back can be used to

    enlarge the training set

  • Normalization & Resources

    •  Existing NLP tools are made for Modern Icelandic

    •  For NLP use, all pre 20th C texts have to be normalized, as spelling is not standardized before then

    •  Linguistic resources (lexicons) are needed for each period

  • Layers of Text: Normalization

  • Layers of Text

    1.  Photograph 2.  Facsimile 3.  Diplomatic: Modern character set 4.  Normalized to standardized Old Norse 5.  Normalized to Modern Icelandic

    Our aim is to normalize a diplomatic version to Modern Icelandic in order to be able to use the PoS Taggers and other NLP tools. (And to make the texts readable for everyone.)

  • Continuity

    •  The cohesion of Icelandic word forms is extensive,

    apart from spelling variants •  Relatively slight changes in morphology

    •  The inflectional system is intact •  Minor changes in individual inflectional classes •  Drift of individual words between inflectional

    classes

    •  Word formation: Unchanged rules …

    Result: Very high rate of recognisable word forms between periods with no clear cut-off point in time.

  • Why Modern Icelandic?

    Linguistic resources for the modern language are available •  Database of Modern Icelandic Inflection (BÍN): 380,000

    paradigms, 5.8 million inflectional forms •  The Tagged Icelandic Corpus (MÍM): 25 million running words •  Íslenskur orðasjóður (Leipzig Wortschatz): 545 million running

    words from websites, 21st C. Producing new tools for each period of the language would be expensive and not always feasible because of data scarcity. Pre-standardized spelling is highly idiosyncratic, up to the 20th century. There is no fundamental difference in normalizing a very idiosyncratic 19th century text and an ‘easy’ 15th century one.

  • Historical Resources

    •  Written Language Archive (WLA, post 1540): over 700,000

    headwords (normalized to Modern Icelandic), 1.3 million citations in original spelling.

    •  Ordbog over de norrøne prosasprog (ONP): ca. 65,000 headwords (normalized to Old Norse)

    •  Individual texts and indices … such as Andrea De Leew Van Weenen: Lemmatized Index to

    the Icelandic Homily Book (2004) … and all available texts, such as Ludger’s Njála project, etc. …

    .

  • Skrambi, the Normalizer

    •  Skrambi is a spellchecker

    •  Uses a statistical model for character substitution •  Can adapt itself to the characteristics of individual

    documents •  Is lexicon-based

    •  Skrambi is used for •  Modern language spell-checking •  OCR correction •  Normalization of older texts

    •  Skrambi’s versatility depends on the lexicons used. By using a 19th C lexicon, 19th C OCR texts can be corrected and normalized, etc.

  • •  ONP (pre 1540): dryckiom, drvckio, drvkkio, dryckiar-, drykkior, dryckio, dryccior, dryckiu, dryckiona, drykkivr, drykkju, drykkiu, ðryckiu, dryckiv …

    •  WLA (post 1540): dryckia, dryckiu, dryckju, drykkja, drykkju …

    •  ONP: WLA: •  [ck|k|kk|cc] > kk ck > kk •  i > j , v > u i > j •  [u|o|v]$* > u •  v > y •  ^ð > d Cf. BÍN à

    Spelling in the ONP / WLA

  • The Database of Modern Icelandic Inflection

    Section: D

  • Conclusion The challenge is to reduce unknown words to improve

    the accuracy of the tools •  by linking and enlarging lexicons •  by using our compound analyser

    (It produces binary constituent trees) •  by developing a tool to normalize word boundaries

    The more resources (i.e., texts) we get, the better the results of the tools will be.

  • The Árni Magnússon Institute

    for Icelandic Studies

    Thank you for your attention

    Ludger Zeevaert, Jón Friðrik Daðason, Kristín Bjarnadóttir [email protected], jfd1 @hi.is, [email protected]

    Oct. 6th 2015 Gotenburg