a toolkit for analysis of ainu language -...

1
Described our research on developing Ainu language analysis toolkit. Described recent improvements to the previous system as well as other enhancements made with an aim to help Ainu language researchers. In particular: enhanced POS tagger with analysis of morphological information, added translation support tool for Ainu language translators, created simple shallow parser (chunker). In the near future we plan to: - Compare different tokenization approaches (ex. Huang et al., 2007). - Add other dictionaries (Nakagawa, 1995; Tamura, 1998). - Perform a robust evaluation of the annotations with the help of several experts and Ainu native speakers. - Create Deep Syntactic Parser - Add Named Entity Recognition - Improve the system for even better performance. - Apply to machine translation. A Toolkit for Analysis of Ainu Language Michal Ptaszynski 1 Fumito Masui 1 and Yoshio Momouchi 2 1) Kitami Institute of Technology, Department of Computer Science 2) Professor Emeritus at Hokkai-Gakuen University Ainu language is a language of Ainu people living in northern Japan. It is critically endangered. However, it has a large heritage including myths, stories and poetry. In ongoing numerous linguistic research these resources have been analyzed, till now by hand. We present a toolkit for the most necessary linguistic analysis to help Ainu language researchers and translators. Abstract © Michał Ptaszyński 2013 Output Options Linguistic Studies: collections of Ainu epic stories and myths (Chiri, 1978; Kayano, 1998; Piłsudski and Majewicz,2004) dictionaries and lexicons (Batchelor, 1905; Hattori, 1964; Chiri, 1975-1976; Nakagawa, 1995; Kayano, 1996; Tamura, 1998; Kirikae, 2003) grammar descriptions (Chiri, 1974; Murasaki, 1979; Refsing, 1986; Kindaichi, 1993; Sato, 2008) NLP-related Studies: attempt to transform Ainu language dictionary into an online database (Bugaeva, 2010) automatically gather word translations from texts (Echizen-ya et al., 2004) analysis / retrieval of hierarchical Ainu-Japanese translations (Azumi and Momouchi, 2009a,b) annotating Ainu “yukar” stories for machine translation system (Momouchi et al. 2008) a system for translation of Ainu topological names (Momouchi and Kobayashi 2010) Our previous work: created POST-AL, a simple POS tagger for Ainu language (Ptaszynski and Momouchi, 2012) Previous Research on Ainu language [default] Vertical (typical for POS taggers) View Selection 1. Skye Hohmann. 2008. The Ainu’s modern struggle. In Worlds Watch, Vol 21., No. 6. 2. Christopher Moseley (ed.). 2010. Atlas of the World? Languages in Danger, 3rd ed. Paris, UNESCO Publishing. Online version:http://www.unesco.org/culture/languages-atlas/ 3. Yukie Chiri. 1978. Ainu shin-yoshu. Tokyo, Iwanami Shoten. 4. Shigeru Kayano. 1998. Kayano no ainu shinwa shuusei [A collection of Ainu myths by Kayano]. vol. 1-10, Tokyo, Heibonsha. 5. Bronisław Piłsudski (Author), Alfred F. Majewicz (Editor). 2004. The Collected Works of Bronislaw Pilsudski: Materials for the Study of the Ainu Language and Folklore, v.3, Pt. 2: Materials for the Study of the Ainu, (Trends in Linguistics: Documentation). Mouton de Gruyter (Oct 2004) 6. Shiroo Hattori (ed.). 1964. An Ainu dialect dictionary. Tokyo, Iwanami Shoten. 7. Mashiho Chiri. 1975-1976. Bunrui ainugo jiten [A classificational dictionary of the Ainu language], vol. 1-3, Tokyo, Heibonsha. Reprint of works from 1953, 1954 and 1962. 8. Hiroshi Nakagawa. 1995. Ainugo Chitose Hogen Jiten: The Ainu-Japanese Dictionary: Chitose Dialect [In Japanese]. Sofukan. 9. Shigeru Kayano. 1996. Kayano Shigeru no ainugo jiten [An Ainu dictionary by Kayano Shigeru]. Tokyo, Sanseido. 10. Suzuko Tamura. 1998. Ainugo Chitose Hogen Jiten: The Ainu-Japanese Dictionary: Saru Dialect [In Japanese]. Sofukan. 11. Hideo Kirikae. 2003. Ainu shin-yoshu jiten: tekisuto bumpo kaisetsu tsuki (Lexicon to Yukie Chiri’s Ainu Shin-yosyu (Ainu Songs of Gods) with Text and Grammatital Notes) [In Japanese]. Sapporo: Hokkaido Daigaku Bungakubu Gengogaku. 12. Mashiho Chiri. 1974. Ainu goho gaisetu (An outline of Ainu grammar). In Chiri Mashiho chosakushuu (Collection of works by Machiho Chiri) [In Japanese], vol. 4, 3-197. Tokyo, Heibonsha. Reprint from 1936. 13. Kyoko Murasaki. 1979. Karafuto ainugo. Bunpo-hen (Sakhalin Ainu. Grammar volume) [In Japanese]. Tokyo, Kokushokan-kokai. 14. Kirsten Refsing. 1986. The Ainu language. The morphology and syntax of the Shizunai dialect. Aarhus, Aarhus University Press. 15. Kyosuke Kindaichi. 1993. Ainu yukara goho tekiyo (An outline grammar of Ainu epic stories) [In Japanese]. In Ainugogaku kogi 2 (Lectures on Ainu studies 2). Kindaichi Kyosuke zenshu. Ainugo I, v. 5, 145-366. Tokyo, Sanseidoo. Reprint from Ainu jojishi yukara no kenkyu (Research on Ainu epic stories) 2, 1-233, Tokyo: Toyo Bunko, 1931. 16. Tomomi Sato. 2008. Ainugo bunpo no kiso (The basics of Ainu grammar) [In Japanese]. Tokyo, Daigakushorin. 17. Anna Bugaeva. 2010. Internet Applications for Endangered Languages: A Talking Dictionary of Ainu. Waseda Institute for Advanced Study Research Bulletin,No.3, pp. 73-81. 18. Hiroshi Echizen-ya, Kenji Araki, Yoshio Momouchi and Koji Tochinai. 2004. Acquisition of Word Translations Using Local Focus-Based Learning in Ainu-Japanese Parallel Corpora. Lecture Notes in Computer Science, Springer-Verlag, Vol. 2945, pp.300-304. 19. Michal Ptaszynski and Yoshio Momouchi, "Part-of-Speech Tagger for Ainu Language Based on Higher Order Hidden Markov Model", Expert Systems With Applications, Vol. 39, Issue 14 (2012), pp. 1157611582, Elsevier, 2012. 20. Yasunori Azumi and Yoshio Momouchi. 2009a. Development of Analysis Tool for Hierarchical Ainu-Japanese Translation Data [In Japanese]. Bulletin of the Faculty of Engineering at Hokkai-Gakuen University, No.36, pp.175-193. 21. Yasunori Azumi and Yoshio Momouchi. 2009b. Development of Tools for retrieving and analyzing Ainu-Japanese translation data and their applications to Ainu-Japanese machine translation system [In Japanese]. Engineering Research: The Bulletin of Graduate School of Engineering at Hokkai-Gakuen University, No.9, pp.37-58. 22. Yoshio Momouchi, Yasunori Azumi and Yukio Kadoya. 2008. Research Note: Construction and Utilization of Electronic Data for “Ainu Shin-yosyu” [In Japanese]. Bulletin of the Faculty of Engineering at Hokkai-Gakuen University, No. 35, pp. 159-171. 23. Yoshio Momouchi and Ryosuke Kobayashi. 2010. Dictionaries and Analysis Tools for the Componential Analysis of Ainu Place Names [In Japanese]. Engineering Research: The Bulletin of Graduate School of Engineering at Hokkai-Gakuen University, No.10, pp.39-49. 24. Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007). Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification, In Proceedings of the ACL 2007 Demo and Poster Sessions, pages 69-72, 2007 25. John Batchelor. 1905. An Ainu-English-Japanese dictionary (including a grammar of the Ainu language). Tokyo Methodist Pub.House. Ainu language is a language of Ainu people, mostly living in northern Japan. Population of Ainu = about 23 thousand people. Number of native speakers = less than hundred (Hohmann, 2008). Ainu language is critically endangered (Moseley, 2010). Purpose of this research: Create language analysis toolkit including POS tagger, translation support tool and shallow parser. Help in linguistic and language anthropology research and support translators of Ainu texts. Contribute to the process of reviving Ainu language. Introduction 1. POS standard selection -n Nakagawa (1995) / compact -t Tamura (1998) / sophisticated [default]Kirikae(2003) / balanced -s Simple (Japanese) / no POS description -en English / English POS tag names -e English (Abbreviated) / abbreviated English POS tag names (useful for linguists) 2. Morphological Analysis -m analysis of morphemes constituting each word Image source: http://en.wikipedia.org/wiki/Ainu_people -h Horizontal (useful for language anthropologists) Additional options:-ts Translation support *) ORIGINAL TRANSLATION (私はそれを見て、安心をし流れに沿うて帰つて来た.) Conclusions and Future Work TAMURA STANDARD WITH MORPHOLOGICAL ANALYSIS base dictionary for POST-AL: Ainu shin-yoshu jiten (Lexicon to Yukie Chiri’s Ainu Shin-yosyu (Ainu Songs of Gods)) by Kirikae (2003) transform dictionary information to XML database: 1. token (word, morpheme, etc.) 2. part of speech 3. meaning (in Japanese) 4. usage examples (not for all cases) 5. reference to the story it appears in (not for all cases) Dictionary Construction Image source: http://www.amazon.co.jp Tokenization DL-LSM: (Dictionary Lookup with Longest String Matching) based on Longest Match Principle DL-P-LSM: (Dictionary Lookup with Partial-LSM) based on LMP with caesurae POS Tagging S-POST: (Statistical Part of Speech Tagging) all words of the same form are treated as one list, choose POS with the highest occurrence CON-POST: (Contextual Part of Speech Tagging) based on higher order HMM trained on dictionary examples (* HMM = bigrams, higher order HMM = bigrams, trigrams and longer) Token Translation RAN-ToT: (Random Token Translation) translation selected randomly from the list of words of the same POS (S-POST extension) CON-ToT: (Contextual Token Translation) translation selected specifically for the word selected in CON-POST System Description input tokenization DL-LSM | DL-P-LSM POS tagging S-POST | CON-POST token translation RAN-ToT | CON-ToT output References 13 Ainu stories (yukar) from Ainu shin-yoshu (Ainu Songs of Gods) gathered by Chiri (1978). all stories are tokenized (by Kirikae, 2003) one yukar is annotated with POS and translations (by Momouchi et al., 2008) Yukar 10: Pon Okikirmuy yayeyukar “kutnisa kutunkutun” (The “Kutnisa kutunkutun” story told by Small Okikirmuy himself) Evaluation Dataset Description Image source: http://www.amazon.co.jp Score Calculation Calculate score as balanced F1 score for all parts of POST-AL Results Tokenization DL-LSM was slightly Better (98.46%). Token translation Contextual token translation was much better (98.36%) than statistical (90.11%). Evaluation POS tagging Contextual POS tagging was much better (96.96%) than statistical (90.11%). Additional options:-o Additional information

Upload: others

Post on 15-Feb-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • Described our research on developing Ainu language analysis toolkit.

    Described recent improvements to the previous system as well as other

    enhancements made with an aim to help Ainu language researchers.

    In particular:

    • enhanced POS tagger with analysis of morphological information,

    • added translation support tool for Ainu language translators,

    • created simple shallow parser (chunker).

    In the near future we plan to:

    - Compare different tokenization approaches (ex. Huang et al., 2007).

    - Add other dictionaries (Nakagawa, 1995; Tamura, 1998).

    - Perform a robust evaluation of the annotations with the help of several

    experts and Ainu native speakers.

    - Create Deep Syntactic Parser

    - Add Named Entity Recognition

    - Improve the system for even better performance.

    - Apply to machine translation.

    A Toolkit for Analysis of Ainu Language

    Michal Ptaszynski1 Fumito Masui1 and Yoshio Momouchi2

    1) Kitami Institute of Technology, Department of Computer Science 2) Professor Emeritus at Hokkai-Gakuen University

    Ainu language is a language of Ainu people living in northern Japan. It is critically endangered.

    However, it has a large heritage including myths, stories and poetry. In ongoing numerous

    linguistic research these resources have been analyzed, till now by hand. We present a toolkit

    for the most necessary linguistic analysis to help Ainu language researchers and translators.

    Abstract

    Tokenization DL-LSM: (Dictionary Lookup with Longest String Matching)

    based on Longest Match Principle

    POS Tagging CON-POST: (Contextual Part of Speech Tagging)

    based on higher order HMM trained on dictionary examples (* HMM = bigrams, higher order HMM = bigrams, trigrams and longer)

    Token Translation CON-ToT: (Contextual Token Translation)

    translation selected specifically for the word selected in CON-POST

    System Description

    © Michał Ptaszyński 2013

    Output Options

    Linguistic Studies: • collections of Ainu epic stories and myths (Chiri, 1978; Kayano, 1998; Piłsudski and

    Majewicz,2004)

    • dictionaries and lexicons (Batchelor, 1905; Hattori, 1964; Chiri, 1975-1976; Nakagawa, 1995;

    Kayano, 1996; Tamura, 1998; Kirikae, 2003)

    • grammar descriptions (Chiri, 1974; Murasaki, 1979; Refsing, 1986; Kindaichi, 1993; Sato, 2008)

    NLP-related Studies:

    • attempt to transform Ainu language dictionary into an online database (Bugaeva, 2010)

    • automatically gather word translations from texts (Echizen-ya et al., 2004)

    • analysis / retrieval of hierarchical Ainu-Japanese translations (Azumi and Momouchi, 2009a,b)

    • annotating Ainu “yukar” stories for machine translation system (Momouchi et al. 2008)

    • a system for translation of Ainu topological names (Momouchi and Kobayashi 2010)

    Our previous work:

    • created POST-AL, a simple POS tagger for Ainu language (Ptaszynski and Momouchi, 2012)

    Previous Research on Ainu language

    • [default] Vertical

    (typical for POS taggers)

    View Selection

    1. Skye Hohmann. 2008. The Ainu’s modern struggle. In Worlds Watch, Vol 21., No. 6. 2. Christopher Moseley (ed.). 2010. Atlas of the World? Languages in Danger, 3rd ed. Paris,

    UNESCO Publishing. Online version:http://www.unesco.org/culture/languages-atlas/ 3. Yukie Chiri. 1978. Ainu shin-yoshu. Tokyo, Iwanami Shoten. 4. Shigeru Kayano. 1998. Kayano no ainu shinwa shuusei [A collection of Ainu myths by

    Kayano]. vol. 1-10, Tokyo, Heibonsha. 5. Bronisław Piłsudski (Author), Alfred F. Majewicz (Editor). 2004. The Collected Works of

    Bronislaw Pilsudski: Materials for the Study of the Ainu Language and Folklore, v.3, Pt. 2: Materials for the Study of the Ainu, (Trends in Linguistics: Documentation). Mouton de Gruyter (Oct 2004)

    6. Shiroo Hattori (ed.). 1964. An Ainu dialect dictionary. Tokyo, Iwanami Shoten. 7. Mashiho Chiri. 1975-1976. Bunrui ainugo jiten [A classificational dictionary of the Ainu

    language], vol. 1-3, Tokyo, Heibonsha. Reprint of works from 1953, 1954 and 1962. 8. Hiroshi Nakagawa. 1995. Ainugo Chitose Hogen Jiten: The Ainu-Japanese Dictionary:

    Chitose Dialect [In Japanese]. Sofukan. 9. Shigeru Kayano. 1996. Kayano Shigeru no ainugo jiten [An Ainu dictionary by Kayano

    Shigeru]. Tokyo, Sanseido. 10. Suzuko Tamura. 1998. Ainugo Chitose Hogen Jiten: The Ainu-Japanese Dictionary: Saru

    Dialect [In Japanese]. Sofukan.

    11. Hideo Kirikae. 2003. Ainu shin-yoshu jiten: tekisuto bumpo kaisetsu tsuki (Lexicon to Yukie Chiri’s Ainu Shin-yosyu (Ainu Songs of Gods) with Text and Grammatital Notes) [In Japanese]. Sapporo: Hokkaido Daigaku Bungakubu Gengogaku.

    12. Mashiho Chiri. 1974. Ainu goho gaisetu (An outline of Ainu grammar). In Chiri Mashiho chosakushuu (Collection of works by Machiho Chiri) [In Japanese], vol. 4, 3-197. Tokyo, Heibonsha. Reprint from 1936.

    13. Kyoko Murasaki. 1979. Karafuto ainugo. Bunpo-hen (Sakhalin Ainu. Grammar volume) [In Japanese]. Tokyo, Kokushokan-kokai.

    14. Kirsten Refsing. 1986. The Ainu language. The morphology and syntax of the Shizunai dialect. Aarhus, Aarhus University Press.

    15. Kyosuke Kindaichi. 1993. Ainu yukara goho tekiyo (An outline grammar of Ainu epic stories) [In Japanese]. In Ainugogaku kogi 2 (Lectures on Ainu studies 2). Kindaichi Kyosuke zenshu. Ainugo I, v. 5, 145-366. Tokyo, Sanseidoo. Reprint from Ainu jojishi yukara no kenkyu (Research on Ainu epic stories) 2, 1-233, Tokyo: Toyo Bunko, 1931.

    16. Tomomi Sato. 2008. Ainugo bunpo no kiso (The basics of Ainu grammar) [In Japanese]. Tokyo, Daigakushorin.

    17. Anna Bugaeva. 2010. Internet Applications for Endangered Languages: A Talking Dictionary of Ainu. Waseda Institute for Advanced Study Research Bulletin,No.3, pp. 73-81.

    18. Hiroshi Echizen-ya, Kenji Araki, Yoshio Momouchi and Koji Tochinai. 2004. Acquisition of Word Translations Using Local Focus-Based Learning in Ainu-Japanese Parallel Corpora. Lecture Notes in Computer Science, Springer-Verlag, Vol. 2945, pp.300-304.

    19. Michal Ptaszynski and Yoshio Momouchi, "Part-of-Speech Tagger for Ainu Language Based on Higher Order Hidden Markov Model", Expert Systems With Applications, Vol. 39, Issue 14 (2012), pp. 11576–11582, Elsevier, 2012.

    20. Yasunori Azumi and Yoshio Momouchi. 2009a. Development of Analysis Tool for Hierarchical Ainu-Japanese Translation Data [In Japanese]. Bulletin of the Faculty of Engineering at Hokkai-Gakuen University, No.36, pp.175-193.

    21. Yasunori Azumi and Yoshio Momouchi. 2009b. Development of Tools for retrieving and analyzing Ainu-Japanese translation data and their applications to Ainu-Japanese machine translation system [In Japanese]. Engineering Research: The Bulletin of Graduate School of Engineering at Hokkai-Gakuen University, No.9, pp.37-58.

    22. Yoshio Momouchi, Yasunori Azumi and Yukio Kadoya. 2008. Research Note: Construction and Utilization of Electronic Data for “Ainu Shin-yosyu” [In Japanese]. Bulletin of the Faculty of Engineering at Hokkai-Gakuen University, No. 35, pp. 159-171.

    23. Yoshio Momouchi and Ryosuke Kobayashi. 2010. Dictionaries and Analysis Tools for the Componential Analysis of Ainu Place Names [In Japanese]. Engineering Research: The Bulletin of Graduate School of Engineering at Hokkai-Gakuen University, No.10, pp.39-49.

    24. Huang, C., Simon, P., Hsieh, S., & Prevot, L. (2007). Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Word break Identification, In Proceedings of the ACL 2007 Demo and Poster Sessions, pages 69-72, 2007

    25. John Batchelor. 1905. An Ainu-English-Japanese dictionary (including a grammar of the Ainu language). Tokyo Methodist Pub.House.

    Shallow Parser

    • Ainu language is a language of Ainu people, mostly living in northern Japan.

    • Population of Ainu = about 23 thousand people.

    • Number of native speakers = less than hundred (Hohmann, 2008).

    • Ainu language is critically endangered (Moseley, 2010).

    Purpose of this research:

    ↓ Create language analysis toolkit including POS tagger,

    translation support tool and shallow parser.

    ↓ Help in linguistic and language anthropology research

    and support translators of Ainu texts.

    Contribute to the process of reviving Ainu language.

    Introduction

    1. POS standard selection • -n Nakagawa (1995) / compact

    • -t Tamura (1998) / sophisticated

    • [default]Kirikae(2003) / balanced

    • -s Simple (Japanese) / no POS

    description • -en English / English POS tag names

    • -e English (Abbreviated) / abbreviated

    English POS tag names (useful for

    linguists)

    2. Morphological Analysis -m analysis of morphemes constituting

    each word

    Ima

    ge

    so

    urc

    e: h

    ttp

    ://e

    n.w

    ikip

    edia

    .org

    /wik

    i/A

    inu_pe

    ople

    Ima

    ge

    so

    urc

    e: h

    ttp://w

    ww

    .am

    azo

    n.c

    o.jp

    • -h Horizontal

    (useful for language anthropologists) Additional options:-ts Translation support

    *) ORIGINAL TRANSLATION (私はそれを見て、安心をし流れに沿うて帰つて来た.)

    Conclusions and Future Work

    Applied the POS information from

    POS tagger to create a simple

    shallow parser, or chunker, which:

    1. Divides the sentence into

    clauses containing the longest

    possible string of consecutive

    nouns and verbs and closely

    related morphemes (particles,

    prefixes, suffixes).

    2. Divides the clauses into noun

    phrases (NP) and verb phrases

    (VP).

    Difficult to create

    a fully functional

    deep parser.

    Ainu language grammar is sophisticated:

    • polysynthetic language (more morphemes than word stems)

    • words relate to each other and change meanings in context

    • incomplete model of grammar.

    However, the sentence structure of Ainu language

    is generally similar to Japanese (SOV).

    General rules for

    sentence

    chunking should

    work.

    TAMURA STANDARD WITH MORPHOLOGICAL ANALYSIS

    • base dictionary for POST-AL:

    Ainu shin-yoshu jiten (Lexicon to

    Yukie Chiri’s Ainu Shin-yosyu (Ainu

    Songs of Gods)) by Kirikae (2003)

    • transform dictionary information

    to XML database:

    1. token (word, morpheme, etc.)

    2. part of speech

    3. meaning (in Japanese)

    4. usage examples (not for all cases)

    5. reference to the story it appears in (not for all cases)

    Dictionary Construction Ima

    ge

    so

    urc

    e: h

    ttp://w

    ww

    .am

    azo

    n.c

    o.jp

    Tokenization

    DL-LSM: (Dictionary Lookup with Longest String Matching)

    based on Longest Match Principle

    DL-P-LSM: (Dictionary Lookup with Partial-LSM)

    based on LMP with caesurae

    POS Tagging

    S-POST: (Statistical Part of Speech Tagging)

    all words of the same form are treated as one list,

    choose POS with the highest occurrence

    CON-POST: (Contextual Part of Speech Tagging)

    based on higher order HMM trained on dictionary examples

    (* HMM = bigrams, higher order HMM = bigrams, trigrams and longer)

    Token Translation

    RAN-ToT: (Random Token Translation)

    translation selected randomly from the list of words

    of the same POS (S-POST extension)

    CON-ToT: (Contextual Token Translation)

    translation selected specifically for the word selected in CON-POST

    System Description input

    tokenization

    DL-LSM | DL-P-LSM

    POS tagging

    S-POST | CON-POST

    token translation

    RAN-ToT | CON-ToT

    output

    References

    • 13 Ainu stories (yukar) from Ainu shin-yoshu

    (Ainu Songs of Gods) gathered by Chiri (1978).

    • all stories are tokenized (by Kirikae, 2003)

    • one yukar is annotated with POS and

    translations (by Momouchi et al., 2008)

    Yukar 10: Pon Okikirmuy yayeyukar “kutnisa kutunkutun”

    (The “Kutnisa kutunkutun” story told by Small Okikirmuy himself)

    Evaluation Dataset Description

    Ima

    ge

    so

    urc

    e: h

    ttp://w

    ww

    .am

    azo

    n.c

    o.jp

    Score Calculation

    Calculate score as balanced

    F1 score for all parts of POST-AL

    Results

    Tokenization

    DL-LSM was slightly

    Better (98.46%).

    Token translation

    Contextual token translation

    was much better (98.36%)

    than statistical (90.11%).

    Evaluation

    POS tagging

    Contextual POS tagging

    was much better (96.96%)

    than statistical (90.11%).

    Additional options:-o Additional information