rafał l. górski, the national corpus of polish benefits of synergy
TRANSCRIPT
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
1/16
Rafa L. Grski
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
2/16
4 corpora of Polish1. Institute of Polish Language, PAS
2. PWN Publishers3. PELCRA University of d4. Institute of Computer Science, PAS
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
3/16
PWN Publishers large and balanced
PELCRA spoken componentICS morphosyntactic annotation
however none them met all the requirementsof a general-reference corpus
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
4/16
Large +
balanced +written and spoken texts +annotated
= perfect corpus
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
5/16
1,5 billion running words300 million balanced subcorpus
1 million subcorpus manually annotatednew tools
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
6/16
Ministry of Higher Education research grant2008-2011
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
7/16
A 1 million corpus annotated manually at alllevels
morphosyntax syntactic words syntactic groups named entity recognition word sense disambiguation
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
8/16
50%: journalism 16%: fiction literature
5.5%: non-fiction literature 5.5%: instructive writing and textbooks 2%: academic writing and textbooks 3%: miscellaneous written 7%: internet 10%: spoken and quasi-spoken
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
9/16
Text typology is based on linguistic and
functional features. The corpus is a representation of the
perception of texts, i.e. the design of thecorpus reflects the structure of readership inPoland.
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
10/16
anotatornia a tool for manual annotation two search engines
morphosyntactic tagger tools for word sense disambiguation named entity recognition shallow syntactic parsing
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
11/16
based on TEI P5 a ready solution picked up from the tresure
trove of TEI well documented
http://nlp.ipipan.waw.pl/TEI4NKJP/ used for annotation at all textual and
linguistic levels
http://nlp.ipipan.waw.pl/TEI4NKJP/http://nlp.ipipan.waw.pl/TEI4NKJP/ -
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
12/16
Annotation of all levels metadata
text structure morphosyntax syntactic groups named entities word sense disambiuguation
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
13/16
Poliqarp enables sophisticated queries at cost of complicated
syntax
own binary format
PELCRA user friendly at cost of limited linguistic functionalities
scalable
good collocation component, diachronic profiles of wordsand phrases
relative database
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
14/16
empirical basis for two dictionaries teaching at university level
corpus linguistics translation
still growing number of applications (it seemshowever it co
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
15/16
Polish Russian parallel corpus in progress diachronic corpus of modern Polish (1960-
2000)
in progress corpus of late 19.th century planned corpus of Polish comparable to BNC in
progress
-
7/28/2019 Rafa L. Grski, The National Corpus of Polish benefits of synergy
16/16
a considerable number of texts and toolsavailable from the beginning
4 demo versions enable testing experience of the team in both corpus
compiling and corpus linguistics