paper nlp

Upload: khuongnn

Post on 04-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Paper NLP

    1/7

    Vietnamese Text Accent Restoration With Statistical Machine Translation

    Luan-Nghia Pham

    Department of Information Technology

    Haiphong University

    Haiphong,Vietnam

    [email protected]

    Viet-Hong Tran

    University of Economic

    And Technical Industries

    Hanoi,Vietnam

    [email protected]

    Vinh-Van Nguyen

    University of Engineering and Technology

    Vietnam National University

    Hanoi,Vietnam

    [email protected]

    Abstract

    Vietnamese accentless texts exist on paral-lel with official vietnamese documents and

    play an important role in instant message,mobile SMS and online searching. Under-standing correctly these texts is not sim-ple because of the lexical ambiguity causedby the diversity in adding diacritics to agiven accentless sequence. There have beensome methods for solving the vietnameseaccentless texts problem known as accentprediction and they have obtained promis-ing results. Those methods are usually basedon distance matching, n-gram, dictionary ofwords and phrases and heuristic techniques.

    In this paper, we propose a new methodsolving the accent prediction. Our methodcombine the strength of previous methods(combining n-gram method and phrase dic-tionary in general). This method considersthe accent predicting as statistical machinetranslation (SMT) problem with source lan-guage as accentless texts and target languageas accent texts, respectively. We also im-prove quality of accent predicting by apply-ing some techniques such as adding dictio-nary, changing order of language model andtuning. The achieved result and the ability

    to enhance proposed system are obviouslypromising.

    1 Introduction

    Accent predicting problem refers to the situationwhere accents are removed (e.g. by some emailpreprocessing systems), cannot be entered (e.g. bystandard English keyboards), or not explicitly rep-resented in the text (e.g. in Arabic). We resolve

    the languages using Roman characters in writ-ing together with additional accent and diacriti-cal marks. These languages include European lan-

    guages such as Spanish and French and Asian lan-guages such as Chinese Pinyin and Vietnamese.

    Vietnamese accentless texts coexist with officialVietnamese texts and it is relatively common textson the internet. Official Vietnamese language isa complex language with many accent (includingacute, grave, hook, tilder, and dot-below) and Lat-inh alphabets. These are two inseparable compo-nents in Vietnamese. However, many Vietnamesechoose to use accentless Vietnamese because it iseasier and quickly to type. For example, a officialVietnamese sentence:chng ti s bay ti H Nivo ch nht(We will fly to Hanoi on Sunday)will be written as an Vietnamese accentless sen-tence aschung toi se bay toi Ha Noi vao chu nhat.Decoding such a sentence could be quite hard forboth human and machine because of lexical ambi-guity. For instance, the accentless term "toi"caneasily lead to confusion between the original Viet-namese "ti"(we) and the plausible alternative"ti" (to).

    Nowadays, the application of information tech-nology to exchanging information is more andmore popular. We daily receive many of emails,SMS but the majority of them are without ac-cents which may cause troubles for interpreting themeaning. Therefore, automatic accent restorationof accentless Vietnamese texts have many of ap-plications such as automatically inserting accentto emails, instant message, SMS are written with-out diacritics Vietnamese, or assistant for websiteadministration in which accent Vietnamese is re-quired. Therefore, it is essential to develop sup-porting tools which can automatically insert ac-cent to Vietnamese texts.

    Accent predicting problem is the particular

    problem of lexical disambiguation. The recent ap-proach to lexical disambiguation is corpus-basedsuch as n-gram, dictionary of phrases, ...

  • 8/13/2019 Paper NLP

    2/7

    In this paper, we propose the method for auto-matic accent restoration using Phrase-based SMT.Vietnamese accentless sentence and Vietnameseaccent sentence (office Vietnamese sentence) willbe source and target sentence in Phrase based

    SMT, respectively. We also improve quality of ac-cent predicting by applying some techniques suchas adding dictionary, changing size of n-gram oflanguage model. The experiment results with Viet-namese corpus showed that our approach achievespromising results.

    The rest of this paper is organised as follows.Related works are mentioned in Section 2. Themethods for accent restoration using SMT are pro-posed in Section 3. In Section 4, we describe theexperiments and results for evaluating the pro-

    posed methods. Finally, Section 5 concludes thepaper.

    2 Related works

    In the recent years, several different methods wereproposed to automatically restore accent for Viet-namese texts.

    The VietPad (Quan, 2002) used a dictionary fileand this one is stored all of words in Vietnamese.The idea of VietPad is based on the use of dictio-nary file and each non-diacritic word is mapped 1-1 into diacritic word. However, the dictionary filealso is stored many words which are rarely used sothese words is incorrectly restored accent. There-fore, VietPad can only restore Vietnamese accenttexts with accuracy about 60-85% and this accu-racy is depended on content of text.

    The AMPad (Tam, 2008) is an efficiency Viet-namese accent restoration tool and texts can berestored online. The idea of AMPad is based onthe statistical frequency of diacritics words which

    correspond with non-diacritics word. The programused selection algorithm and the most appropriateword is chosen. AMPad can restore Vietnameseaccent texts with accuracy about 80% or higher forpolitical commentary and popular science domain.However, It also restore Vietnamese accent withaccuracy about 50% for specialized documents orliterature and poetry documents which have com-plex sentence structure and multiple meaning.

    The VietEditor (Lan, 2005) is based on the ideaof VietPad but it is improved. This program used

    a dictionary file and this file is stored all of phrasewhich are often used in Vietnamese texts. Thisfile is called phrase dictionary. After each non-

    diacritic word is mapped 1-1 into diacritic word,the program check the phrase dictionary to find themost appropriate word. VietEditor restore Viet-namese accent texts more flexibly and accuratelythan VietPad.

    The viAccent (Truyen et al., 2008) allows youto restore Vietnamese accent texts online andwith several different speed. Generally, the slowerspeed is the better result is. The idea of programused n-gram language model and it is reported atthe conference PRICAI 2008 (The Pacific Rim In-ternational Conference on Artificial Intelligence).

    The VnMark (Toan, 2008) used n-gram lan-guage model to create a dictionary file. It is VN-MarkDic.txt file. This file shows occurrence prob-ability of phrase in Vietnamese texts and it will

    build the case restoration for word or phrase. Thiscombination will create sentences which are re-stored Vietnamese accents. When the weight ofeach sentence is identified, the best way will beselected accent restoration for Vietnamese text.However, Accuracy depends on the sentences ofthe dictionary file. Because the number of sen-tence is few so the result is still limited.

    3 Our method

    As mentioned in the section 2, the studies aboutaccent restoration for Vietnamese text are basedon experience. These studies used the dictionaryfile (such as VietPad) or n-gram language modelwith phrase (such as VnMark, AmPad, viAccent...) but They are not yet generalized because thiscombination has some limitation as following:

    The number of phrases are few (each phraseis only 2 words, 3 words) and there are nopriority when each phrase is chosen.

    The combination of language model andphrase dictionary is simple and It is mainlybased on experience.

    To overcome the above limitation, we present ageneral approach to restore accent for Vietnamesetext. This method is viewed as machine trans-lation from non-diacritical Vietnamese language(source language) into diacritical Vietnamese lan-guage (target language). This method has solvedtwo above limitation by the use of phrase-table

    with the priority levels (the length of the phrase isarbitrary and the corresponding translation prob-ability value) and It is combined language model

  • 8/13/2019 Paper NLP

    3/7

    with phrase-table by log-linear model (adding andturning of parameters for combination).

    3.1 Overview of Phrase-table Statistical

    Machine Translation

    In this section, we will present the basics ofPhrase-based SMT toolkit(Koehn et al., 2003).The goal of the model is translating a text fromsource language to target language. As describedby (1), we have sentences in source language (En-glish) eI

    1 = e1, . . . ej , . . . , eI which are trans-

    lated into target language (Vietnamese) vJ1 =

    v1, . . . , vj , . . . vJ. Each sentence can be found inthe target text then the we will choose sentence sothat:

    VJ1 = argmaxvI1

    Pr(vJ1 /eJ1 ) (1)

    The figure below illustrates the process of phrase-based translation:

    Figure 1: Phrase-based translation model

    In phrase translation model, sequence of con-secutive words (phrases) are translated into the tar-get language. The length of source phrase can bedifferent from target phrase. We divide source sen-tence into several phrases and each phrase is trans-lated into a target language, then it reorder thephrases so that the target sentence to satisfy theformula (1) and then they are grafted together. Fi-nally, we get a translation in the target language.

    Figure 2 shows the phrase-based statistical

    translation model. There are many translationknowledge which can be used as language mod-els, translation models, etc. The combination ofcomponent models (language model, translationmodel, word sense disambiguation, reorderingmodel...) is based on log-linear model (Koehn etal., 2003).

    3.2 Accent prediction based on Phrase

    Statistical Machine Translation.

    Vietnamese texts with accent are collected fromnewspapers, books, the internet, etc., and they arereprocessed to remove the excess characters. Then

    Figure 2: Diagram Phrase-based SMT translation basedon log-linear model

    vnTokenizer tool (Hong et al., 2008) is used to seg-ment words in Vietnamese language. After that,the text file with accent and a corresponding ac-centless text fileis are created. Two text files arethe same number of line and each line of accenttext file corresponds with an accentless line in an-other text file. Figure 3 shows some sentences incorpus.

    Figure 3: Some sentences in corpus

    The accents removal is processed by building map-ping table between the accents words and corre-sponding accentless Vietnamese words . For ex-ample:

    a = {a, , , , , , , , , , , , , }

    Mapping table of characters include uppercase andlowercase letters in Vietnamese. After preprocess-

    ing we have a corpus file and we processed train-ing data. Finish, we received phrase table and lan-guage model for machine learning. This phrase ta-ble is stored phrases with the different length. Lan-guage model with n-grams is covered nearly allVietnamese sentences and this training process isautomatically executed.

    The general architecture of our method is de-scribed on Figure 4:

    Phrase-based Statistical Translation model has

    three important components. They include trans-lation model, language model and decoder. Trans-lation results are calculated with parameters in the

  • 8/13/2019 Paper NLP

    4/7

    Figure 4: Accents restoration base on Phrase-basedSMT

    phrase table and language model. Example, ac-centless sentence restoration in Vietnamese:neil Young da bieu dien nhieu the loai nhac rockva blue.After the phrases is segmented. This sentence is:neil_Young da bieu_dien nhieu the loai nhac rockva blue.Results after the translation:neil_Young biu_din nhiu th loi nhc rockv blue .

    First, the source sentence will be divided intophrases neil_Young, neil_Young da, neil_Youngda bieu_dien, da bieu_dien, bieu_dien, bieu_diennhieu,....

    After that, the system find the probability ofeach phrase in the phrase table and languagemodel then the weight of sentence is computed .Example:We found weight of above phrases in the phrasetable and language model:da bieu_dien ||| biu_din ||| 1 1 1 0.857179

    2.718 ||| 1 1the loai ||| th loi ||| 1 1 1 0.0362146 2.718 ||| 2 2

    -3.436823 -0.3055001-2.309609 -0.5276677-4.109961 biu_din -0.2860174-4.168163 biu_din-2.227281 biu_din nhiu-1.628649 th loi

    Finally, the system is implemented by the de-coder process. For each translation choice will

    have a hypothesis. Suppose first selection wordis the neil_Young, this word is translated intoneil_Young (unchanged) because it do not find

    corresponding word. For simplicity, we translatefrom left to right of sentence. Next, da wordis translated, for example, it can be or .The probability of each hypothesis is calculatedand updated for the each new hypothesis. Next,

    bieu_dien is translated, this phrase is found in thephrase table, language model and a phrase is cho-sen that is biu_din phrase. Combining hypothe-ses can happen, da bieu_dien phrase is restored to biu_din. Continue until all the words of sen-tence are translated.

    4 Experiment Results

    4.1 Corpus Statistics and Experiments

    For evaluation, we used an accentless Vietnamese

    corpus with 206000 pairs, including 200000 pairsfor training, 1000 pairs for tunning and 5000 pairsfor development test set.

    The corpus for experiments was collected fromnewspapers, books,... on the internet with topicssuch as social, sports, science (Nguyen et al.,2008). The Table 1 shows the summary statisticalof our data sets. Several experiments are processedon the basis of Phrase-based Statistical MachineTranslation model with MOSES open-source

    decoder (Koehn et al., 2007). For training dataand turning parameters, we used standard settingsin the Moses toolkit (GIZA++ alignment, grow-diagfinal-and, lexical reordering models, MERTturning). To build the language model, we usedSRILM toolkit (Stolcke, 2002) with 3 and 4-gram.In this experiments, we evaluated the quality ofthe translation results by BLEU score (Papineni etal., 2002) and accuracy sentences.

    We performed experiments on MT_VR system

    and MT_VR+Dict system:

    MT_VR is a baseline Vietnamese restorationsystem. This system uses phrase-based statis-tical machine translation with standard set-tings in the Moses toolkit.

    MT_VR+Dict is a baseline Vietnameserestoration combine with dictionary informa-tion.

    4.2 Results

    We experimented on several different corpusand we evaluated translation quality.

  • 8/13/2019 Paper NLP

    5/7

    Corpus Statistical Vietnamese with accents Vietnamese without accentsTraining Sentences 200000

    Average Length 22.3 22.3Word 4474378 4474378

    Development Sentences 1000Average Length 24,3 24,3

    Word 24343 24343Test Sentences 5000

    Average Length 22,1 22,1Word 110729 110729

    Table 1: The Summary statistical of data sets

    Translate model with the corpus include50.000, 100.000, 150.000 and 200.000 sen-tence pairs. After successful training, wetested with 5.000 pairs of sentence.

    To improve the quality of the system we need tobuild a corpus with better quality as well as greatercoverage and we need to process accurately data.We have improved on some of the approach

    4.2.1 Improved models using dictionary

    The training from the raw corpus may havesome limitations due to the size of the corpus. Ifthe corpus is too small, the possibility of usefulphrases are not learned when building phrase ta-ble. However, if corpus is too larger could in ex-cess. In addition, we used the automatic segmentof phrase tool so that it can be some errors in theanalysis. We added Vietnamese dictionary of com-pound and syllable word into the phrase translationtable and we assigned weight 1 into the each word,we solved this problem. Results as following:

    Table 2: The accuracy of experiment systems

    Table 3: The accuracy of experiment systems

    4.2.2 Improved model using n-gram level

    changes

    Changing n-gram level, the increased levelof language model improved translation results.

    However, with level 4 or higher then the resultshas not been almost changed. Because in the Viet-namese language, the phrases including 3 or 4words are more related than each other. The re-sult with the weight assigned to 1 and level oflanguage model 4: BLEU score=0.9850; accuracywhen using MT_VR combine dictionary informa-tion:92%.

    Table 4: Results with different n-grams levels

    Figure 5: Compare BLUE score with experiment sys-tems

    Table 3 shows experimental results with differenttraining corpus. Experimental results show that us-ing MT_VR combine dictionary infomation with

    3-gram have the best result. Table 4 and Figure 5show that training with 200.000 pairs of sentencewill have the best accuracy accents prediction.

  • 8/13/2019 Paper NLP

    6/7

    Table 5: Accent prediction of some sentences

    4.2.3 Comparison with other methods

    We also compared our method with viAccentsystem(Truyen et al., 2008) because viAccent isthe newest and efficient method for Vietnameseaccent prediction. We conducted the experimentwith the same test corpus (5000 sentences) for vi-Accent. Bleu scores of both MT_VR+Dict and vi-Accent system were showed on Table 6.

    Table 6: Compared our method with viAccent system

    5 Conclusion

    The experimental results showed that our approachachieves significant improvements over viAccentsystem. Performance of accent prediction with ourmethod achieves better accuracy than that andsome examples in test corpus was showed on Table

    5. In this paper, we introduced the issues of ac-cents prediction for accentless Vietnamese texts

    and proposed a novel method to resolve this prob-lem. The our idea is based on Phrase-based Statis-tical Machine Translation to develop a Vietnamesetext accent restoration system.

    We combined the advantage of previous ap-proach such as n-gram languages model andphrase dictionary. In general, experimental resultsshowed that our approach achieves promised per-formance. The quality of accents prediction can beimproved if we have a better corpus or assignedappropriate weight to dictionary.

    6 Acknowledgments

    This work is funded by the Vietnam NationalFoundation for Science and Technology Develop-ment (NAFOSTED) under grant number 102.01-2011.08

    References

    Phuong-Le Hong, Huyen-Nguyen Thi Minh, AzimRoussanaly, and Vinh-Ho Tuong. 2008. A Hy-brid Approach to Word Segmentation of Vietnamese

    Texts. In Proceedings of the 2nd International Con-ference on Language and Automata Theory and Ap-plications, Springer, LNCS 5196.

  • 8/13/2019 Paper NLP

    7/7

    Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of HLT-NAACL 2003, pages 127133. Ed-monton, Canada.

    Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,

    Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-dra Constantin, and Evan Herbst. 2007. Moses:Open source toolkit for statistical machine transla-tion. Proceedings of ACL, Demonstration Session.

    Phan Quoc Lan. 2005. Approach to add accents forVietnamese text without accent. Informatics Bache-lors thesis, VietNam National University of Ho ChiMinh City.

    Thai Phuong Nguyen, Akira Shimazu, Tu Bao Ho,Minh Le Nguyen, and Vinh Van Nguyen. 2008. Atree-to-string phrase-based model for statistical ma-

    chine translation. In Proceedings of the TwelfthConference on Computational Natural LanguageLearning (CoNLL 2008), pages 143150, Manch-ester, England, August. Coling 2008 OrganizingCommittee.

    Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for AutomaticEvaluation of Machine Translation. ACL.

    Nguyen Quan. 2002. VietPad. http://vietpad .source-forge.net.

    A. Stolcke. 2002. Srilm - an extensible languagemodeling toolkit, in Proceedings of International

    Conference on Spoken Lan-guage Processing.Tran Triet Tam. 2008. AMPad. http://www.echip.com.vn/echiproot/weblh/qcbg/duynghi/automark.

    Nguyen Van Toan. 2008. VnMark.Tran The Truyen, Dinh Q. Phung, and Svetha

    Venkatesh. 2008. Constrained Sequence Classifi-cation for Lexical Disambiguation. In Proceedingsof PRICAI 2008, pages 430441.