experimen)ng+with+slovak+ wikipediaas+asource+for...
TRANSCRIPT
![Page 1: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/1.jpg)
Experimen)ng with Slovak Wikipedia as a source for Language
Technologies Michal Laclavík1, Štefan Dlugolinský1, Michal Blanárik2
1Ins)tute of Informa)cs, Slovak Academy of Sciences, Bra)slava
2Faculty of Informa)cs and Informa)on Technologies, Slovak University of Technology, Bra)slava
![Page 2: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/2.jpg)
Wikipedia • A well-‐known source of human knowledge maintained by crowd
• A lot of facts on various topics in a lot of ar>cles: – 4,405,584 in English – 999,839 in Polish – 275,982 in Czech – 249,421 in Hungarian – 187,182 in Slovak – 138,292 in Slovene
• hTp://en.wikipedia.org/wikistats/EN/Sitemap.htm
![Page 3: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/3.jpg)
Wikipedia as a text corpus • Addi)onal useful informa)on:
– ar#cles represent informa)on about en>>es – links represent rela>ons between en>>es – anchor texts are alterna>ve names, inflected forms, abbrevia>ons of en))es
or en>ty proper>es
• Useful for NLP tasks such as NER, QA, MT, WSD, etc.
• NLP Tools based on English Wikipedia: – WikipediaMiner
• detec)ng and disambigua)ng Wikipedia topics when they are men)oned in documents – Illinois Wikifier
• disambigua)on to Wikipedia with local and global algorithms – DBpedia Spotlight
• tool for annota)ng men)ons of DBpedia resources in text. (Dbpedia is structured informa)on in a from of RDF graphs)
![Page 4: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/4.jpg)
• Not explored so far as the English
• Good source of inflected forms and alterna)ve names of en))es not included on available dic)onaries like Persons, Organiza)ons, Loca)ons
• We made two simple experiments showing possible use of Slovak Wikipedia for NLP:
E1: Links and anchors extrac>on E2: Named En>ty Recogni>on
Slovak Wikipedia
Národné obrodenie
Uhorský snem
![Page 5: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/5.jpg)
E1: Link and anchor text extrac)on • The point
– Collect en))es (ar)cles), their alterna)ve names (anchors) and related en))es (via links) and explore search over )tles and anchors
• Parsed XML dump of Slovak Wikipedia – 737.3 MB uncompressed, 30th April 2013 – Size of uncompressed English Wikipedia dump is about 44 GB!!!
• We have used Map-‐Reduce paradigm for this task (Hadoop implementa)on)
• Parsed results were indexed in Solr
![Page 6: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/6.jpg)
E1: Wikipedia parsing using Map-‐Reduce
href2, anchor3
XML dump
Article1
Article2
ArticleN
href1, anchor1href1, anchor2href2, anchor3
href1, anchor4href3, anchor5
href4, anchor6href4, anchor6href5, anchor7href3, anchor8
href1, anchor1href1, anchor2href1, anchor4
href3, anchor5href3, anchor8
href5, anchor7
Article1anchor1anchor2anchor4
Article2anchor3
Article3anchor5anchor8
href4, anchor6href4, anchor6
Article4anchor6anchor6
Article5anchor7
Solr
INPUT SPLIT MAP SHUFFLE REDUCE OUTPUT
![Page 7: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/7.jpg)
E1 Results hTp://147.213.75.180:8080/stevo/skwikislovco/browse?q=hrad
![Page 8: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/8.jpg)
E1 Results • 310,571 ar)cles processed including redirects from dump XML
• 4,212,467 outlinks with anchor texts extracted • 3,977,843 inlinks
– only outlinks to encyclopedia ar)cles, lists, disambigua)on pages, and encyclopedia redirects were converted to inlinks
• 696,874 ar)cles indexed including non-‐exis)ng
• 5.71 inlinks per ar)cle in average • 13.60 inlinks per ar)cle (considering only ar)cles referred more than once)
• Wikipedia link structure together with anchor texts could be a resource for crea)ng training sets for NLP methods such as lemma)za)on, stemming or NER
![Page 9: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/9.jpg)
E2: Named En)ty Recogni)on
• The point – Annotate persons, loca)ons and their inflected forms in Wikipedia texts and train a NER model on these texts
• Our goal was to automa)cally train NER model on Wikipedia content and make it applicable on newswire texts for person and loca)on recogni)on
![Page 10: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/10.jpg)
E2: Person names extrac)on and annota)on
• There were 42,500 unique person names extracted and annotated in Slovak Wikipedia
• Approach: – Step 1 (22,511 unique lemmas of person names):
• 16,454 names in Wikimedia markup; e.g. [[Ľudovít Štúr]] (* [[1815]]) • 11,404 names in Infobox informa)on fields
– Step 2 (19,989 unique inflected person names): • Inflected forms discovered in anchor texts
![Page 11: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/11.jpg)
E2: Loca)on names extrac)on and annota)on
• Loca)on names were extracted and annotated similarly to person names – Step 1: 37,121 lemmas of loca)on names – Step 2: 482 inflected loca)on names
• Total 37,603 loca)on names
![Page 12: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/12.jpg)
E2: Model training
• Model training – Apache OpenNLP used with maximum entropy machine learning (no special tweaking made)
• Trained on annotated Wikipedia texts – Example of training data for person NEs:
<START:person> Newtonov <END> interpolačný polynóm alebo presnejšie interpolačný polynóm v <START:person> Newtonovom <END> tvare alebo skrátene len <START:person> Newtonov <END> polynóm je v numerickej matema)ke polynóm pomenovaný podľa <START:person> Isaaca Newtona <END> interpolujúci danú množinu bodov, ktorý má špecifický tvar, nazývaný '' <START:person> Newtonov <END> tvar.
![Page 13: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/13.jpg)
E2: Training data and evalua)on
Sentences Tagged person names
184,602 91,915
Sentences Tagged person names
40,579 38,538
Precision Recall F1
Person model 0,901 0,617 0,867
Loca)on model 0,998 0,995 0,997
Evalua>on on Training data
Precision Recall F1
Person model 0,891 0,372 0,517
Loca)on model 1,0 0,292 0,433
Evalua>on on Test data – manually annotated news ar>cles
Training data
![Page 14: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/14.jpg)
E2: En)ty merging
• Trained NER models were able to discover en))es in text, but not recognize that different forms represent the same en)ty; e.g.
[Michael Schumacher], [Michaela Schumachera], [Michaelom Schumacherom]
• Simple merging algorithm based on Levenstein distance and suffixes
![Page 15: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/15.jpg)
E2: Most frequent suffixes of person names found in Wikipedia
Suffix Frequency Suffix Frequency Suffix Frequency
a 5 543 ov 130 ea 55
om 4 406 ová 127 s 54
ovi 1 323 ova 103 e 53
m 779 ovho 88 ových 52
ho 589 mu 76 ovom 44
ou 541 ove 71 ému 41
ej 325 ovou 62 o 41
ovej 192 vi 59 ovo 41
ého 189 eho 56 i 40
us 143 ovu 56 OTHER 1 345
![Page 16: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/16.jpg)
Conclusion
• Experiments does not provide ready to use solu)ons for NLP tasks, but show the use paTern of growing Wikipedia resource.
• Our intent was to show that Slovak Wikipedia can serve as a decent source for Language Technology training and evalua)on
![Page 17: Experimen)ng+with+Slovak+ Wikipediaas+asource+for ...ikt.ui.sav.sk/publications/dlugolinsky_sk_wikipedia_slovko-2013.pdf · href5, anchor7 href3, anchor8 href1, anchor1 href1, anchor2](https://reader033.vdocuments.net/reader033/viewer/2022052009/601f51f5f1d622583a005ac5/html5/thumbnails/17.jpg)
Thank you!