resource-light bantu part-of-speech tagging
TRANSCRIPT
RESOURCE-LIGHT BANTU PART-OF-SPEECH TAGGING
Guy De Pauw (UA) Gilles-Maurice de Schryver (UGent) Janneke van de Loo (UA)
Motivation
There are many data-driven taggers available, but
they need extensive annotated corpora.
Unsupervised part-of-speech tagging techniques
for resource-scarce languages exhibit limited
results on Sub-Saharan languages
Becoming increasingly available: digitally
available dictionaries, lexicons, word lists, ...
Research questions
• What information can we use for part-of-
speech tagging?
• Can we use this information to bootstrap
accurate part-of-speech taggers for the
languages under investigation?
• How does this technique compare to the
state-of-the-art in data-driven part-of-
speech tagging?
Bag-of-SubstringsAdamPROPNAME alionekanaV chumbaniN kwakePRON hanaNEG fahamuN .FULL_STOP
Train maximum entropy classifier and compare it to memory-based
tagger
Experimental ResultsConclusion
In the absence of large, annotated corpora, the bag-
of-substrings approach established a low-resource,
high accuracy bootstrapping method for part-of-
speech tagging of conjunctively written Bantu
languages.
Demos