ieee rivf’09, 16 july 2009 vietnamese language processing: issues and challenges ho tu bao japan...

IEEE RIVF’09, 16 July 2009

Vietnamese Language Processing: Issues and Challenges

Ho Tu Bao

Japan Advanced Institute of Science

and Technology

Vietnamese Academy of Science and Technology

(Keynote talk at international conference IEEE RIVF 2009)


Japan Advanced Institute of Science and Technology

Institute of Information TechnologyVietnamese Academy of Science & Technology


Outline

Problems and progress in natural language processing

Issues and challenges in Vietnamese language processing

Our VLSP project (Vietnamese Language and Speech Processing)


Natural language processing?

Psychological view: Understand human language processing Alan Turing: Propose

to consider the question: “Can machine think?”

Engineering view: Build systems to process language

Alan Turing (1912-1954)Alan Turing (1912-1954)

Người máy ASIMO đưa đồ uống cho khách theo yêu cầu(http://world.honda.com/ASIMO/)


More languages than you might have thought

We meet here today to talk about processing of Vietnamese language and speech.

Aujourd'hui nous nous réunissons ici pour discuter le traitement de langue et de parole vietnamienne.

Cегодня мы встрачаемся здесь, чтобы говорить о обработке вьетнамского языкa и речи.

今日我々はここに集まりベトナム語と発言処理について議論します。 오늘 우리는 여기에 모여서 베트남어와 발언처리에 대하여

의론하겠습니다 .

لغة و الفيتنامية اللغة عن لنتحدث اليوم هنا نجتمع أنناالخطاب

Hôm nay chúng ta gặp nhau ở đây để nói về xử lý ngôn ngữ và tiếng nói tiếng Việt.

6912 distinct languages (230 spoken in Europe, 2197 in Asia)


54 ethnic groups in Vietnam

Language groups

Mon-Khmer

Tay-Thai

Tibeto-Burman

Malayo-Polysian

Kadai

Mong-Dao

Han


English websites and Vietnamese?


Translation and machine translation

Translate the following sentence into English

“Ông già đi nhanh quá”?

Many possible translations

1. [Ông già] [đi] [nhanh quá] The old man walks too fast

My father walks too fast

2. [Ông già] [đi] [nhanh quá] The old man died too fast

My father died too fast

3. [Ông] [già đi] [nhanh quá] You get old too fast

Grandfather gets old too fast

Ambiguity of language


Two approaches to machine translationLinguistic rule-based machine translation words are translated by

using linguistic rules about the two languages, the

correspondence transfer

between them (morphology, syntax, etc)

Requires understanding natural language

Statistical machine translation

generate translations using statistical learning methods based on bilingual text corpora (statistically similar)

Requires large and qualified bilingual text corpora.DOMINATING!


Lexical / Morphological Analysis

Syntactic Analysis

Semantic Analysis

Discourse Analysis

Tagging

Chunking

Word Sense Disambiguation

Grammatical Relation Finding

Named Entity Recognition

Reference Resolution

Shallow parsing

The woman will give Mary a book

The/Det woman/NN will/MD give/VB Mary/NNP a/Det book/NN

POS tagging

[The/Det woman/NN]NP [will/MD give/VB]VP [Mary/NNP]NP [a/Det book/NN]NP

chunking

[The woman] [will give] [Mary] [a book]

relation findingsubject

i-object object

text

meaning

From text to the meaningNatural Language Processing (NLP)


1990s–2000s: Statistical learning algorithms, evaluation, corpora

1980s: Standard resources and tasks Penn Treebank, WordNet, MUC

1970s: Kernel (vector) spaces clustering, information retrieval (IR)

1960s: Representation Transformation Finite state machines (FSM) and

Augmented transition networks (ATNs)

1960s: Representation—beyond the word level

lexical features, tree structures, networks

Trainable FSMs

Trainable parsers

Archeology of natural language processing

(Hovy, COLING 2004)

Natural language

processing

Information retrieval and Information

extraction


some ML/Stat no ML/Stat

(Pages 11-12 from Marie Claire, ECML/PKDD 2005)

ML and statistical methods in NLP


Recent learning methods in NLP


Large investment from the government and industry National Institute of Standards and Technology (NIST), ATR, NICT USA, CHINA, Singapore, etc.

NLP & CL organizations ACL (Assoc. Comp. Linguistics) NACL( North Amer. Assoc. on CL) EACL (Euro Association on CL) PACLIC (Pacific Assoc. on CL) ICCL (Inter. committee CL)

Many NLP people

Rich resources and tools

NLP R&D in other countries

Linguistic Data Consortium


Vietnamese language was established a long time ago

Chinese characters was used for a long time

Unique writing system of Vietnam called Chu Nom (字喃 ) in the 10th century

Romanced script to represent the Quốc Ngữ since the beginning of the 20th century

Nam quốc sơn hà Nam đế cư

南国山河南帝居Over Mountains and Rivers of the South, Reigns the Emperor of the

South

Vietnamese language


Vietnamese language

Vietnamese is an analytic language (words are composed of a single morpheme).

ngôn ngữ (analytic), lang-gua-ge (synthetic), 言語 (synthetic)

Vietnamese does not use morphological marking of case, gender, number, and tense.

Trưa nay tôi ăn ba thằng tôm

Syntax conforms to Subject Verb Object word order Cái thằng chồng em nó chẳng ra gì.

FOCUS CLASSIFIER husband I he not turn.out what

“That husband of mine, he is good for nothing.”


Most work aims at machine translation or other tasks at top layers but very few basic work at lower layers

Work done in isolation, no inheritance people have to do their work from the scratch without sharing and collaborating no standards.

Almost no resources and tools for VLSP

Vietnamese Language and Speech Processing

Many tools such as ChaSen, Yamcha, …

このひとことで元気になった

No tool to do such a simple task


National project with eleven active research VLSP groups from Ho Chi Minh City to Hanoi, with two objectives:

Building VLSP infrastructure, especially indispensable resources and tools for the VLSPdevelopment.Building and developing several typical VLSP products for public end-users.

VLSP national project (KC01.01.05/06-10)5.2007-8.2009

Natural language processing methods

Pragmatics: Speech, text and Web data mining

Tools, corpora,

resources


SP7.3Vietnamese treebank

SP7.3Vietnamese treebank

SP7.4E-V corpora of aligned

sentences

SP7.4E-V corpora of aligned

sentences

SP3English-Vietnamesetranslation system

SP4IREST: Internet use

support system

SP5Vietnamese spelling

checker

SP5Vietnamese spelling

checker

SP8.2 Vietnamese word

Segmentation


Segmentation

SP8.3 Vietnamese POS tagger


SP8.4 Vietnamese chunker

SP8.4 Vietnamese chunker

SP8.5Vietnamese syntax

analyser


analyser

SP7.1English-Vietnamese

dictionary


dictionary

SP7.2Viet dictionary


SP1Apllicationoriented systems based on Vietnamese speech

recognition & synthesis

SP2Speech recognition

system with large vocabulary

SP8.1 Speech analysis tools

SP8.1 Speech analysis tools

SP6.1Corpora for

speech recognition

SP6.1Corpora for

speech recognition

SP6.2Corpora for

speech synthesis

SP6.2Corpora for

speech synthesis

SP6.3Corpora for

specific words

SP6.3Corpora for

specific words

Project target products

To be standard for

long term development


VLSP website: open soon to the public


SP7.2: Viet Machine Readable Dictionary

Study other MRDs EDR Electronic

Dictionary FrameNet (UC Berkeley) TCL's Computational

Lexicon

Build a model of VCL (Vietnamese Computational Lexicon) The macroscopic structure The microscopic structure The content and VCL

structure Tool and VCL construction

Institute of Electronic Dictionary, 1980s-1990s

Japanese EDR


SP7.2: Viet Machine Readable Dictionary

Microscopic structure Morphological information Syntactic information, e.g.,

two kinds of verb Semantic information: logic

and semantic constraints, definition, context

VCL content and structure Tool for the construction

35,000 common used words in modern Vietnamese

Develop a tool for building VCL with XML representation.

Sub-V-Obj Sub-V

Lợn ăn rauXe ăn xăng

Chim bay bé ngủChó chạy bé đang ngủ


Ông già

S

NP VP

P V

đi

NP

T

nhanh quá

SP7.3: Viet Treebank

A Treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. English: Penn Treebank (4.5M words) and

many others; Chinese: Penn Chinese Treebank (507K

words), Sinica Treebank (61,087 trees, 361K words);

Japanese: ATR Dependency corpus, Kyoto Text Corpus, Verbmobil treebanks;

Korean: Korean Treebank (5078 trees, 54K words)

Viet Treebank (7.2007-5.2009): 10,000 trees 1,000,000 morphemes

Viet machine translation, info extraction, etc.

Viet Treebank

Viet syntactic parser

Viet chunker

Viet POS tagger

Viet word segmenter


Study various existing treebanks, modern theories for syntax and Vietnamese language

Build guidelines for word segmentation, POS, and syntax “Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả”

(“the house is in jumble” and “at home the door is not closed”)

“Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”

(She keeps her beauty” and “this painting has better color”)

Build the tools Labeling Agreement between labelers (95%)

SP7.3: Viet Treebank


SP7.4: English-Vietnamese parallel corpus

Parallel Corpus (L1-L2)

SentencesL1

WordsL2

Words

German-English 1,313,09634,700,3

6236,663,08

3

Greek-English 662,09018,834,7

5818,827,24

1

Spanish-English 1,304,11637,870,7

5136,429,27

4

Finnish-English 1,257,72024,895,7

9034,802,61

7

French-English 1,334,08041,573,1

1737,436,22

2

Italian-English 1,251,31536,411,1

6636,510,03

3

Dutch-English 1,326,41236,784,1

6836,690,39

2

Portuguese-English

1,287,75737,342,4

2636,355,90

7

Swedish-English 1,164,53628,882,1

4232,053,62

8(http://www.euromatrix.net)

Pairs of corresponding sentences in English and Vietnamese (size & quality)

Easy for many languages (LDC: English-French corpus of 2.8M sentences, source from Canadian Parliament)

No publicly available parallel corpus for Vietnamese

Building corpora needs time, money and human resources (boring job)


SP7.4: English-Vietnamese parallel corpus

Our corpus in first phase: 100,000 sentence pairs

Manual and semi-automatic collection of parallel text

Automatic alignment

Automatic cross-site parallel text discovery

Many Vietnamese news are translated from English source in other web sites from the Internet


Setting up the “standards” for VLSP

Importance of “standards” in VLSP: choose an appropriate view from different schools on Vietnamese language

Guide for words recognition and description: morphological, syntactic, semantic criteria

Guide for constituent labeling: noun phrase, verb phrase, clause, etc.

Guide for sentence split

Others

Challenge: Standards for sustainable development


Example: Guideline for POS tagging

36 word labels in English, from Penn Treebank (1989)

30 word labels in Chinese, from Chinese TreeBank (1998)

47 word labels in Thai, from Orchid corpus (1997)

How many for Vietnamese?


VLSP tools for the public

All the tools are constructed based on the same view of words, label assignment, sentences, and resources.

Using statistical and machine learning methods to build the tools with the corpora.

Tools and resources are to be given to the public.

SP7.3Vietnamese

treebank

SP7.3Vietnamese

treebank

SP7.4E-V corpora of

aligned sentences

SP7.4E-V corpora of

aligned sentences


segmentation


segmentation



SP8.4 Vietnamese

chunker

SP8.4 Vietnamese

chunker


analyser


analyser


dictionary


dictionary




o1 o2 o3 oT

s1 s2 s3 sT...

...

Hidden Markov Models (HMMs) [Baum et al. 1970; Rabiner 1989]

- Generative- Need independence assumption - Local optimums - Local normalization

o1 o2 o3 oT

s1 s2 s3 sT...

...

Maximum Entropy Markov Models (MEMMs)

[McCallum et al. 2000]

- Discriminative and exponential- No independence assumption- Global optimum- Local normalization

More accurate than HMMs

o1 o2 o3 oT

s1 s2 s3 sT...

...

Conditional Random Fields (CRFs)[Lafferty et al. 2001]

- Discriminative and exponential- No independence assumption- Global optimum- Global normalization

More accurate than HMMs and MEMMs

MEMMs and CRFs are

emerging techniques in

NLP & machine learning

CRF dynamic programming- O(|L|2T): first-order- O(|L|3T): second-order(L is the set of labels)

Using machine learning in creating toolsFinite state machines


CRFsOnline Learning

Data

Chunking models

VietnameseSentences

Decoding output

Anh ấy đang ăn cơm

NP [anh ấy] VP [đang ăn cơm]

SP8.4: Statistical learning in chunking


SP8.5: HGSP grammar for syntax analysis

Syntax tree

Text

Word segmentation

module

Analysis module

Word feature

dictionary

Rules with

constraints

Attribute elimination

Improved earley

algorithm

HGSP: Head-Driven Phrase Structure Grammar

HGSP allows us to represent the relations between words and constraints between syntax and semantics


SP3: Machine translation and EVSMT1.0

Statistical Analysis

Vietnamese


Broken English

English

Ông già đi nhanh quá

Died the old man too fast

The old man too fast died

The old man died too fast

Old man died the too fast

The old man died too fast

(Slides 31-32 adapted from tutorial on SMT, K. Knight and P. Koehn)

Vietnamese-English

Bilingual Text

EnglishText

SMT:statistical machine translation


Translation Model

Language Model

Decoding AlgorithmArgmax P(v|e) x P(e)



Vietnamese


Broken English

English

Vietnamese-English

Bilingual Text

EnglishText

SMT:statistical machine translation



Issues in Vietnamese SMT Corpus building Language Modeling Translation Model Decoder Others

Decoder (search problem)MOSES

Translation Model(phrase-based)-GIZA++-MOSES-MERT

Language ModelSRILM

Englishsentence

Vietnamesesentence

SMT core

- Standardization- Word segmentation (VNsegmenter)- POS tagger (CRF Postagger, VnQtag)- Morphological analyser (morpha)

Pre-processing

Vietnamese-English

Parallel corpus

Pre-processing

Vietnamesecorpus

Pre-processing

SMT Resource processing- Pre-process (sentence splitter, tokenizer, etc.), Web crawler- Sentence alignment tools

Raw materials(documents, books, …)

Automatic extract parallel text from the Web

Corpus collecting and building


Google: English-Vietnamese translation 26.9.08 (translate.google.com, 35 languages)


Machine translation issues and challenges SMT major difficulties: word choice, word order,

tense and aspect, pronoun, idioms

Target: Improve phrase-based SMT in two aspects of word order and word choice

Combination of tree-to-string SMT and phrase-based SMT (N.P. Thai et al., Machine translation, Vol. 20. No. 3 (2006), IJCPOL, Vol. 20, No. 2 (2007)

Focus on translating long and complex sentences by introducing CRF-based clause splitting and chunking parsing (N.V. Vinh, IJCPOL, 2009)


Different types of query entries

List of Websites in English

Danh sách Websites tiếng Việt

Search on Internet for Webpages having information related to the query

Selected Website in English

Trang Web được dịch qua tiếng Việt

Check each Website

Extract news related to the query

Text related to the query

Summarize the text

Summarized text in English

Tin tóm tắt được dịch sang tiếng Việt

Extract information related to the query

Summarize the text for its gist

Translate the gist into Vietnamese.

Translate the list of retrieved Webpages into Vietnamese

Translate the selected Website into Vietnamese

1

2

3

4

IREST: Support for exploiting the Internet(Information Retrieval, Extraction, Summarization, Translation)


Vietnamese named-entities on the webHo Chi Minh University of Technology


Sentence reduction by SVM

Long sentence

parsing

transforming

generating

Large parsed tree

Small parsed tree

Short sentence

Corpus

Tree set

Parsing all sentences

Set of contexts and actions

Generating training data

SVM learning

Rules

A rule shows relation of a context and an action

G

H A

C B D

Q

Z

R

c

a

d

eb

K

b

H

e

F D

a

G

{a, b, c, d, e}

{b, e, a}

Input list, CSTACK, RSTACK Actions: SHIFT, REDUCE,

DROP, ASSIGN TYPE, RESTORE Transforming tree is a

sequence of actions

(Minh et al., COLING 2004, J. CPOL’05, ACM Trans. ALP’05, IEICE’06)


M = (D, E, T, TR, TI, TV, f, g)

ETD: Detecting topics that are growing in interest and utility overtime from a corpus

Topic verification How to define

interest and utility functions and evaluate their

increase overtime?

Topic representation

Which features are necessary to

characterize topics (interest and utility

overtime)?

Topic identification How to extract these features

from the corpus for each topic?

Emerging trend detection(Le Minh Hoang, KSS journal, 2006)


ETD: Topic representation


Topic representation

Which features are necessary to

characterize topics (interest and utility

overtime)?

neural network

Define 6 types of citation


ETD: Topic identification


Topic identification How to extract these features

from the corpus for each topic?

Build 6 models corresponding to 6 types of citation

Using HMM, MEMM, an CRF to extract features

| = max ,i is

P O P s o

i

jj

P O| =

P O|O iP

i

= - .logO i O i O iH P P


ETD: Topic verification


Topic verification How to define

interest and utility functions and evaluate their

increase overtime?

}6,5,4,2{}6,5,3,1{

),(4

1 ,),(

4

1)(

jii

jii

kkii

jtGrowth)g(tUtilityjtGrowthtfInterest

time axis along the(j)}s {ttime-seriegrowth of , j)Growth(t

0 1 2

'f

f k kx

2

2''

ff k k

x

the speed of growing at x = k

the acceleration of growing at x = k

Speed > 0Acceleration >

0


ETD: Evaluation



Conclusion

Complete the first phase in VLSP infrastructure.

Advanced technologies and experience from processing of other languages, especially statistical learning from large corpora.

Work in collaboration and sharing

Look for investment from the government and industry for the next phase, and for collaboration.


Acknowledgements

The national project KC01.01.05/06-10

Projects members: Luong Chi Mai, Ngo Cao Son, Ho Bao Quoc, Dinh Dien, Cao Hoang Tru, Nguyen Thi Minh Huyen, Vu Luong, Le Thanh Huong, Nguyen Phuong Thai, Nguyen Le Minh, Le Minh Hoang, Phan Xuan Hieu, Pham Ngoc Khanh, Ha Thanh Le, Nguyen Phuong Thao, Nguyen Viet Cuong, VLSP forum, among others.

VLSP meeting, 21-25 Nov. 2005, JAIST


Korean and Arbic

오늘 우리는 여기에 모여서 베트남어와 발언처리에 대하여 의론하겠습니다 .

Ônưr wulinân iôkiê môiôshô Vietnamơwa balântsơriê têhaiô ưi rônhagếtsưnnita

أننا نجتمع هنا اليوم لنتحدث عن اللغة الفيتنامية و لغة الخطابenna ngtma hena alyom lenthds an alloga alvitnamya wa logh alkhytab

Inana nagtama huna alyom linatahades an: alloga alvitnemâyơ ôe loga alkhytab

Kyou, wareware wa kokoni atsumari,Betonamu-go to speech shori ni tsuite Giron shimasu


Search for parallel document

Observation: Many Vietnamese news are translated from English source in other web sites from the Internet

Methodology Make use of search engine to find English

candidates Queries are created from

Posted date News source’s URL Translational independent data: text data

unchanged during translation process. E.g. term, NE, number


Framework

• Queries are generated and executed from high ranks to low ranks

• Filtering

•Length-based

•TID-based

ieee rivf’09, 16 july 2009 vietnamese language processing: issues and challenges ho tu bao japan...

Documents