rafał l. górski, the national corpus of polish benefits of synergy

Upload: mediae-et-infimae-latinitas-polonorum-lexicon-utiliaque

Post on 03-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    1/16

    Rafa L. Grski

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    2/16

    4 corpora of Polish1. Institute of Polish Language, PAS

    2. PWN Publishers3. PELCRA University of d4. Institute of Computer Science, PAS

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    3/16

    PWN Publishers large and balanced

    PELCRA spoken componentICS morphosyntactic annotation

    however none them met all the requirementsof a general-reference corpus

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    4/16

    Large +

    balanced +written and spoken texts +annotated

    = perfect corpus

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    5/16

    1,5 billion running words300 million balanced subcorpus

    1 million subcorpus manually annotatednew tools

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    6/16

    Ministry of Higher Education research grant2008-2011

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    7/16

    A 1 million corpus annotated manually at alllevels

    morphosyntax syntactic words syntactic groups named entity recognition word sense disambiguation

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    8/16

    50%: journalism 16%: fiction literature

    5.5%: non-fiction literature 5.5%: instructive writing and textbooks 2%: academic writing and textbooks 3%: miscellaneous written 7%: internet 10%: spoken and quasi-spoken

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    9/16

    Text typology is based on linguistic and

    functional features. The corpus is a representation of the

    perception of texts, i.e. the design of thecorpus reflects the structure of readership inPoland.

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    10/16

    anotatornia a tool for manual annotation two search engines

    morphosyntactic tagger tools for word sense disambiguation named entity recognition shallow syntactic parsing

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    11/16

    based on TEI P5 a ready solution picked up from the tresure

    trove of TEI well documented

    http://nlp.ipipan.waw.pl/TEI4NKJP/ used for annotation at all textual and

    linguistic levels

    http://nlp.ipipan.waw.pl/TEI4NKJP/http://nlp.ipipan.waw.pl/TEI4NKJP/
  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    12/16

    Annotation of all levels metadata

    text structure morphosyntax syntactic groups named entities word sense disambiuguation

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    13/16

    Poliqarp enables sophisticated queries at cost of complicated

    syntax

    own binary format

    PELCRA user friendly at cost of limited linguistic functionalities

    scalable

    good collocation component, diachronic profiles of wordsand phrases

    relative database

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    14/16

    empirical basis for two dictionaries teaching at university level

    corpus linguistics translation

    still growing number of applications (it seemshowever it co

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    15/16

    Polish Russian parallel corpus in progress diachronic corpus of modern Polish (1960-

    2000)

    in progress corpus of late 19.th century planned corpus of Polish comparable to BNC in

    progress

  • 7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy

    16/16

    a considerable number of texts and toolsavailable from the beginning

    4 demo versions enable testing experience of the team in both corpus

    compiling and corpus linguistics