ieee rivf’09, 16 july 2009 vietnamese language processing: issues and challenges ho tu bao japan...

50
IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese Academy of Science and Technology (Keynote talk at international conference IEEE RIVF 2009)

Upload: virgil-terry

Post on 11-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Vietnamese Language Processing: Issues and Challenges

Ho Tu Bao

Japan Advanced Institute of Science

and Technology

Vietnamese Academy of Science and Technology

(Keynote talk at international conference IEEE RIVF 2009)

Page 2: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Japan Advanced Institute of Science and Technology

Institute of Information TechnologyVietnamese Academy of Science & Technology

Page 3: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Outline

Problems and progress in natural language processing

Issues and challenges in Vietnamese language processing

Our VLSP project (Vietnamese Language and Speech Processing)

Page 4: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Natural language processing?

Psychological view: Understand human language processing Alan Turing: Propose

to consider the question: “Can machine think?”

Engineering view: Build systems to process language

Alan Turing (1912-1954)Alan Turing (1912-1954)

Người máy ASIMO đưa đồ uống cho khách theo yêu cầu(http://world.honda.com/ASIMO/)

Page 5: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

More languages than you might have thought

We meet here today to talk about processing of Vietnamese language and speech.

Aujourd'hui nous nous réunissons ici pour discuter le traitement de langue et de parole vietnamienne.

Cегодня мы встрачаемся здесь, чтобы говорить о обработке вьетнамского языкa и речи.

今日我々はここに集まりベトナム語と発言処理について議論します。 오늘 우리는 여기에 모여서 베트남어와 발언처리에 대하여

의론하겠습니다 .

لغة و الفيتنامية اللغة عن لنتحدث اليوم هنا نجتمع أنناالخطاب

Hôm nay chúng ta gặp nhau ở đây để nói về xử lý ngôn ngữ và tiếng nói tiếng Việt.

6912 distinct languages (230 spoken in Europe, 2197 in Asia)

Page 6: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

54 ethnic groups in Vietnam

Language groups

Mon-Khmer

Tay-Thai

Tibeto-Burman

Malayo-Polysian

Kadai

Mong-Dao

Han

Page 7: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

English websites and Vietnamese?

Page 8: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Translation and machine translation

Translate the following sentence into English

“Ông già đi nhanh quá”?

Many possible translations

1. [Ông già] [đi] [nhanh quá] The old man walks too fast

My father walks too fast

2. [Ông già] [đi] [nhanh quá] The old man died too fast

My father died too fast

3. [Ông] [già đi] [nhanh quá] You get old too fast

Grandfather gets old too fast

Ambiguity of language

Page 9: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Two approaches to machine translationLinguistic rule-based machine translation words are translated by

using linguistic rules about the two languages, the

correspondence transfer

between them (morphology, syntax, etc)

Requires understanding natural language

Statistical machine translation

generate translations using statistical learning methods based on bilingual text corpora (statistically similar)

Requires large and qualified bilingual text corpora.DOMINATING!

Page 10: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Lexical / Morphological Analysis

Syntactic Analysis

Semantic Analysis

Discourse Analysis

Tagging

Chunking

Word Sense Disambiguation

Grammatical Relation Finding

Named Entity Recognition

Reference Resolution

Shallow parsing

The woman will give Mary a book

The/Det woman/NN will/MD give/VB Mary/NNP a/Det book/NN

POS tagging

[The/Det woman/NN]NP [will/MD give/VB]VP [Mary/NNP]NP [a/Det book/NN]NP

chunking

[The woman] [will give] [Mary] [a book]

relation findingsubject

i-object object

text

meaning

From text to the meaningNatural Language Processing (NLP)

Page 11: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

1990s–2000s: Statistical learning algorithms, evaluation, corpora

1980s: Standard resources and tasks Penn Treebank, WordNet, MUC

1970s: Kernel (vector) spaces clustering, information retrieval (IR)

1960s: Representation Transformation Finite state machines (FSM) and

Augmented transition networks (ATNs)

1960s: Representation—beyond the word level

lexical features, tree structures, networks

Trainable FSMs

Trainable parsers

Archeology of natural language processing

(Hovy, COLING 2004)

Natural language

processing

Information retrieval and Information

extraction

Page 12: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

some ML/Stat no ML/Stat

(Pages 11-12 from Marie Claire, ECML/PKDD 2005)

ML and statistical methods in NLP

Page 13: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Recent learning methods in NLP

Page 14: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Large investment from the government and industry National Institute of Standards and Technology (NIST), ATR, NICT USA, CHINA, Singapore, etc.

NLP & CL organizations ACL (Assoc. Comp. Linguistics) NACL( North Amer. Assoc. on CL) EACL (Euro Association on CL) PACLIC (Pacific Assoc. on CL) ICCL (Inter. committee CL)

Many NLP people

Rich resources and tools

NLP R&D in other countries

Linguistic Data Consortium

Page 15: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Vietnamese language was established a long time ago

Chinese characters was used for a long time

Unique writing system of Vietnam called Chu Nom (字喃 ) in the 10th century

Romanced script to represent the Quốc Ngữ since the beginning of the 20th century

Nam quốc sơn hà Nam đế cư

南 国 山 河 南 帝 居Over Mountains and Rivers of the South, Reigns the Emperor of the

South

Vietnamese language

Page 16: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Vietnamese language

Vietnamese is an analytic language (words are composed of a single morpheme).

ngôn ngữ (analytic), lang-gua-ge (synthetic), 言語 (synthetic)

Vietnamese does not use morphological marking of case, gender, number, and tense.

Trưa nay tôi ăn ba thằng tôm

Syntax conforms to Subject Verb Object word order Cái thằng chồng em nó chẳng ra gì.

FOCUS CLASSIFIER husband I he not turn.out what

“That husband of mine, he is good for nothing.”

Page 17: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Most work aims at machine translation or other tasks at top layers but very few basic work at lower layers

Work done in isolation, no inheritance people have to do their work from the scratch without sharing and collaborating no standards.

Almost no resources and tools for VLSP

Vietnamese Language and Speech Processing

Many tools such as ChaSen, Yamcha, …

このひとことで元気になった

No tool to do such a simple task

Page 18: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

National project with eleven active research VLSP groups from Ho Chi Minh City to Hanoi, with two objectives:

Building VLSP infrastructure, especially indispensable resources and tools for the VLSPdevelopment.Building and developing several typical VLSP products for public end-users.

VLSP national project (KC01.01.05/06-10)5.2007-8.2009

Natural language processing methods

Pragmatics: Speech, text and Web data mining

Tools, corpora,

resources

Page 19: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

SP7.3Vietnamese treebank

SP7.3Vietnamese treebank

SP7.4E-V corpora of aligned

sentences

SP7.4E-V corpora of aligned

sentences

SP3English-Vietnamesetranslation system

SP4IREST: Internet use

support system

SP5Vietnamese spelling

checker

SP5Vietnamese spelling

checker

SP8.2 Vietnamese word

Segmentation

SP8.2 Vietnamese word

Segmentation

SP8.3 Vietnamese POS tagger

SP8.3 Vietnamese POS tagger

SP8.4 Vietnamese chunker

SP8.4 Vietnamese chunker

SP8.5Vietnamese syntax

analyser

SP8.5Vietnamese syntax

analyser

SP7.1English-Vietnamese

dictionary

SP7.1English-Vietnamese

dictionary

SP7.2Viet dictionary

SP7.2Viet dictionary

SP1Apllicationoriented systems based on Vietnamese speech

recognition & synthesis

SP2Speech recognition

system with large vocabulary

SP8.1 Speech analysis tools

SP8.1 Speech analysis tools

SP6.1Corpora for

speech recognition

SP6.1Corpora for

speech recognition

SP6.2Corpora for

speech synthesis

SP6.2Corpora for

speech synthesis

SP6.3Corpora for

specific words

SP6.3Corpora for

specific words

Project target products

To be standard for

long term development

Page 20: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

VLSP website: open soon to the public

Page 21: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

SP7.2: Viet Machine Readable Dictionary

Study other MRDs EDR Electronic

Dictionary FrameNet (UC Berkeley) TCL's Computational

Lexicon

Build a model of VCL (Vietnamese Computational Lexicon) The macroscopic structure The microscopic structure The content and VCL

structure Tool and VCL construction

Institute of Electronic Dictionary, 1980s-1990s

Japanese EDR

Page 22: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

SP7.2: Viet Machine Readable Dictionary

Microscopic structure Morphological information Syntactic information, e.g.,

two kinds of verb Semantic information: logic

and semantic constraints, definition, context

VCL content and structure Tool for the construction

35,000 common used words in modern Vietnamese

Develop a tool for building VCL with XML representation.

Sub-V-Obj Sub-V

Lợn ăn rauXe ăn xăng

Chim bay bé ngủChó chạy bé đang ngủ

Page 23: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Ông già

S

NP VP

P V

đi

NP

T

nhanh quá

SP7.3: Viet Treebank

A Treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. English: Penn Treebank (4.5M words) and

many others; Chinese: Penn Chinese Treebank (507K

words), Sinica Treebank (61,087 trees, 361K words);

Japanese: ATR Dependency corpus, Kyoto Text Corpus, Verbmobil treebanks;

Korean: Korean Treebank (5078 trees, 54K words)

Viet Treebank (7.2007-5.2009): 10,000 trees 1,000,000 morphemes

Viet machine translation, info extraction, etc.

Viet Treebank

Viet syntactic parser

Viet chunker

Viet POS tagger

Viet word segmenter

Page 24: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Study various existing treebanks, modern theories for syntax and Vietnamese language

Build guidelines for word segmentation, POS, and syntax “Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả”

(“the house is in jumble” and “at home the door is not closed”)

“Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”

(She keeps her beauty” and “this painting has better color”)

Build the tools Labeling Agreement between labelers (95%)

SP7.3: Viet Treebank

Page 25: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

SP7.4: English-Vietnamese parallel corpus

Parallel Corpus (L1-L2)

SentencesL1

WordsL2

Words

German-English 1,313,09634,700,3

6236,663,08

3

Greek-English 662,09018,834,7

5818,827,24

1

Spanish-English 1,304,11637,870,7

5136,429,27

4

Finnish-English 1,257,72024,895,7

9034,802,61

7

French-English 1,334,08041,573,1

1737,436,22

2

Italian-English 1,251,31536,411,1

6636,510,03

3

Dutch-English 1,326,41236,784,1

6836,690,39

2

Portuguese-English

1,287,75737,342,4

2636,355,90

7

Swedish-English 1,164,53628,882,1

4232,053,62

8(http://www.euromatrix.net)

Pairs of corresponding sentences in English and Vietnamese (size & quality)

Easy for many languages (LDC: English-French corpus of 2.8M sentences, source from Canadian Parliament)

No publicly available parallel corpus for Vietnamese

Building corpora needs time, money and human resources (boring job)

Page 26: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

SP7.4: English-Vietnamese parallel corpus

Our corpus in first phase: 100,000 sentence pairs

Manual and semi-automatic collection of parallel text

Automatic alignment

Automatic cross-site parallel text discovery

Many Vietnamese news are translated from English source in other web sites from the Internet

Page 27: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Setting up the “standards” for VLSP

Importance of “standards” in VLSP: choose an appropriate view from different schools on Vietnamese language

Guide for words recognition and description: morphological, syntactic, semantic criteria

Guide for constituent labeling: noun phrase, verb phrase, clause, etc.

Guide for sentence split

Others

Challenge: Standards for sustainable development

Page 28: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Example: Guideline for POS tagging

36 word labels in English, from Penn Treebank (1989)

30 word labels in Chinese, from Chinese TreeBank (1998)

47 word labels in Thai, from Orchid corpus (1997)

How many for Vietnamese?

Page 29: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

VLSP tools for the public

All the tools are constructed based on the same view of words, label assignment, sentences, and resources.

Using statistical and machine learning methods to build the tools with the corpora.

Tools and resources are to be given to the public.

SP7.3Vietnamese

treebank

SP7.3Vietnamese

treebank

SP7.4E-V corpora of

aligned sentences

SP7.4E-V corpora of

aligned sentences

SP8.2 Vietnamese word

segmentation

SP8.2 Vietnamese word

segmentation

SP8.3 Vietnamese POS tagger

SP8.3 Vietnamese POS tagger

SP8.4 Vietnamese

chunker

SP8.4 Vietnamese

chunker

SP8.5Vietnamese syntax

analyser

SP8.5Vietnamese syntax

analyser

SP7.1English-Vietnamese

dictionary

SP7.1English-Vietnamese

dictionary

SP7.2Viet dictionary

SP7.2Viet dictionary

Page 30: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

o1 o2 o3 oT

s1 s2 s3 sT...

...

Hidden Markov Models (HMMs) [Baum et al. 1970; Rabiner 1989]

- Generative- Need independence assumption - Local optimums - Local normalization

o1 o2 o3 oT

s1 s2 s3 sT...

...

Maximum Entropy Markov Models (MEMMs)

[McCallum et al. 2000]

- Discriminative and exponential- No independence assumption- Global optimum- Local normalization

More accurate than HMMs

o1 o2 o3 oT

s1 s2 s3 sT...

...

Conditional Random Fields (CRFs)[Lafferty et al. 2001]

- Discriminative and exponential- No independence assumption- Global optimum- Global normalization

More accurate than HMMs and MEMMs

MEMMs and CRFs are

emerging techniques in

NLP & machine learning

CRF dynamic programming- O(|L|2T): first-order- O(|L|3T): second-order(L is the set of labels)

Using machine learning in creating toolsFinite state machines

Page 31: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

CRFsOnline Learning

Data

Chunking models

VietnameseSentences

Decoding output

Anh ấy đang ăn cơm

NP [anh ấy] VP [đang ăn cơm]

SP8.4: Statistical learning in chunking

Page 32: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

SP8.5: HGSP grammar for syntax analysis

Syntax tree

Text

Word segmentation

module

Analysis module

Word feature

dictionary

Rules with

constraints

Attribute elimination

Improved earley

algorithm

HGSP: Head-Driven Phrase Structure Grammar

HGSP allows us to represent the relations between words and constraints between syntax and semantics

Page 33: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

SP3: Machine translation and EVSMT1.0

Statistical Analysis

Vietnamese

Statistical Analysis

Broken English

English

Ông già đi nhanh quá

Died the old man too fast

The old man too fast died

The old man died too fast

Old man died the too fast

The old man died too fast

(Slides 31-32 adapted from tutorial on SMT, K. Knight and P. Koehn)

Vietnamese-English

Bilingual Text

EnglishText

SMT:statistical machine translation

Page 34: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Translation Model

Language Model

Decoding AlgorithmArgmax P(v|e) x P(e)

SP3: Machine translation and EVSMT1.0

Statistical Analysis

Vietnamese

Statistical Analysis

Broken English

English

Vietnamese-English

Bilingual Text

EnglishText

SMT:statistical machine translation

Page 35: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

SP3: Machine translation and EVSMT1.0

Issues in Vietnamese SMT Corpus building Language Modeling Translation Model Decoder Others

Decoder (search problem)MOSES

Translation Model(phrase-based)-GIZA++-MOSES-MERT

Language ModelSRILM

Englishsentence

Vietnamesesentence

SMT core

- Standardization- Word segmentation (VNsegmenter)- POS tagger (CRF Postagger, VnQtag)- Morphological analyser (morpha)

Pre-processing

Vietnamese-English

Parallel corpus

Pre-processing

Vietnamesecorpus

Pre-processing

SMT Resource processing- Pre-process (sentence splitter, tokenizer, etc.), Web crawler- Sentence alignment tools

Raw materials(documents, books, …)

Automatic extract parallel text from the Web

Corpus collecting and building

Page 36: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Google: English-Vietnamese translation 26.9.08 (translate.google.com, 35 languages)

Page 37: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Machine translation issues and challenges SMT major difficulties: word choice, word order,

tense and aspect, pronoun, idioms

Target: Improve phrase-based SMT in two aspects of word order and word choice

Combination of tree-to-string SMT and phrase-based SMT (N.P. Thai et al., Machine translation, Vol. 20. No. 3 (2006), IJCPOL, Vol. 20, No. 2 (2007)

Focus on translating long and complex sentences by introducing CRF-based clause splitting and chunking parsing (N.V. Vinh, IJCPOL, 2009)

Page 38: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Different types of query entries

List of Websites in English

Danh sách Websites tiếng Việt

Search on Internet for Webpages having information related to the query

Selected Website in English

Trang Web được dịch qua tiếng Việt

Check each Website

Extract news related to the query

Text related to the query

Summarize the text

Summarized text in English

Tin tóm tắt được dịch sang tiếng Việt

Extract information related to the query

Summarize the text for its gist

Translate the gist into Vietnamese.

Translate the list of retrieved Webpages into Vietnamese

Translate the selected Website into Vietnamese

1

2

3

4

IREST: Support for exploiting the Internet(Information Retrieval, Extraction, Summarization, Translation)

Page 39: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Vietnamese named-entities on the webHo Chi Minh University of Technology

Page 40: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Sentence reduction by SVM

Long sentence

parsing

transforming

generating

Large parsed tree

Small parsed tree

Short sentence

Corpus

Tree set

Parsing all sentences

Set of contexts and actions

Generating training data

SVM learning

Rules

A rule shows relation of a context and an action

G

H A

C B D

Q

Z

R

c

a

d

eb

K

b

H

e

F D

a

G

{a, b, c, d, e}

{b, e, a}

Input list, CSTACK, RSTACK Actions: SHIFT, REDUCE,

DROP, ASSIGN TYPE, RESTORE Transforming tree is a

sequence of actions

(Minh et al., COLING 2004, J. CPOL’05, ACM Trans. ALP’05, IEICE’06)

Page 41: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

M = (D, E, T, TR, TI, TV, f, g)

ETD: Detecting topics that are growing in interest and utility overtime from a corpus

Topic verification How to define

interest and utility functions and evaluate their

increase overtime?

Topic representation

Which features are necessary to

characterize topics (interest and utility

overtime)?

Topic identification How to extract these features

from the corpus for each topic?

Emerging trend detection(Le Minh Hoang, KSS journal, 2006)

Page 42: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

ETD: Topic representation

ETD: Detecting topics that are growing in interest and utility overtime from a corpus

Topic representation

Which features are necessary to

characterize topics (interest and utility

overtime)?

neural network

Define 6 types of citation

Page 43: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

ETD: Topic identification

ETD: Detecting topics that are growing in interest and utility overtime from a corpus

Topic identification How to extract these features

from the corpus for each topic?

Build 6 models corresponding to 6 types of citation

Using HMM, MEMM, an CRF to extract features

| = max ,i is

P O P s o

i

jj

P O| =

P O|O iP

i

= - .logO i O i O iH P P

Page 44: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

ETD: Topic verification

ETD: Detecting topics that are growing in interest and utility overtime from a corpus

Topic verification How to define

interest and utility functions and evaluate their

increase overtime?

}6,5,4,2{}6,5,3,1{

),(4

1 ,),(

4

1)(

jii

jii

kkii

jtGrowth)g(tUtilityjtGrowthtfInterest

time axis along the(j)}s {ttime-seriegrowth of , j)Growth(t

0 1 2

'f

f k kx

2

2''

ff k k

x

the speed of growing at x = k

the acceleration of growing at x = k

Speed > 0Acceleration >

0

Page 45: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

ETD: Evaluation

ETD: Detecting topics that are growing in interest and utility overtime from a corpus

Page 46: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Conclusion

Complete the first phase in VLSP infrastructure.

Advanced technologies and experience from processing of other languages, especially statistical learning from large corpora.

Work in collaboration and sharing

Look for investment from the government and industry for the next phase, and for collaboration.

Page 47: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Acknowledgements

The national project KC01.01.05/06-10

Projects members: Luong Chi Mai, Ngo Cao Son, Ho Bao Quoc, Dinh Dien, Cao Hoang Tru, Nguyen Thi Minh Huyen, Vu Luong, Le Thanh Huong, Nguyen Phuong Thai, Nguyen Le Minh, Le Minh Hoang, Phan Xuan Hieu, Pham Ngoc Khanh, Ha Thanh Le, Nguyen Phuong Thao, Nguyen Viet Cuong, VLSP forum, among others.

VLSP meeting, 21-25 Nov. 2005, JAIST

Page 48: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Korean and Arbic

오늘 우리는 여기에 모여서 베트남어와 발언처리에 대하여 의론하겠습니다 .

Ônưr wulinân iôkiê môiôshô Vietnamơwa balântsơriê têhaiô ưi rônhagếtsưnnita

أننا نجتمع هنا اليوم لنتحدث عن اللغة الفيتنامية و لغة الخطابenna ngtma hena alyom lenthds an alloga alvitnamya wa logh alkhytab

Inana nagtama huna alyom linatahades an: alloga alvitnemâyơ ôe loga alkhytab

Kyou, wareware wa kokoni atsumari,Betonamu-go to speech shori ni tsuite Giron shimasu

Page 49: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Search for parallel document

Observation: Many Vietnamese news are translated from English source in other web sites from the Internet

Methodology Make use of search engine to find English

candidates Queries are created from

Posted date News source’s URL Translational independent data: text data

unchanged during translation process. E.g. term, NE, number

Page 50: IEEE RIVF’09, 16 July 2009 Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Japan Advanced Institute of Science and Technology Vietnamese

IEEE RIVF’09, 16 July 2009

Framework

• Queries are generated and executed from high ranks to low ranks

• Filtering

•Length-based

•TID-based