ieee rivf’09, 16 july 2009 vietnamese language processing: issues and challenges ho tu bao japan...
TRANSCRIPT
IEEE RIVF’09, 16 July 2009
Vietnamese Language Processing: Issues and Challenges
Ho Tu Bao
Japan Advanced Institute of Science
and Technology
Vietnamese Academy of Science and Technology
(Keynote talk at international conference IEEE RIVF 2009)
IEEE RIVF’09, 16 July 2009
Japan Advanced Institute of Science and Technology
Institute of Information TechnologyVietnamese Academy of Science & Technology
IEEE RIVF’09, 16 July 2009
Outline
Problems and progress in natural language processing
Issues and challenges in Vietnamese language processing
Our VLSP project (Vietnamese Language and Speech Processing)
IEEE RIVF’09, 16 July 2009
Natural language processing?
Psychological view: Understand human language processing Alan Turing: Propose
to consider the question: “Can machine think?”
Engineering view: Build systems to process language
Alan Turing (1912-1954)Alan Turing (1912-1954)
Người máy ASIMO đưa đồ uống cho khách theo yêu cầu(http://world.honda.com/ASIMO/)
IEEE RIVF’09, 16 July 2009
More languages than you might have thought
We meet here today to talk about processing of Vietnamese language and speech.
Aujourd'hui nous nous réunissons ici pour discuter le traitement de langue et de parole vietnamienne.
Cегодня мы встрачаемся здесь, чтобы говорить о обработке вьетнамского языкa и речи.
今日我々はここに集まりベトナム語と発言処理について議論します。 오늘 우리는 여기에 모여서 베트남어와 발언처리에 대하여
의론하겠습니다 .
لغة و الفيتنامية اللغة عن لنتحدث اليوم هنا نجتمع أنناالخطاب
Hôm nay chúng ta gặp nhau ở đây để nói về xử lý ngôn ngữ và tiếng nói tiếng Việt.
6912 distinct languages (230 spoken in Europe, 2197 in Asia)
IEEE RIVF’09, 16 July 2009
54 ethnic groups in Vietnam
Language groups
Mon-Khmer
Tay-Thai
Tibeto-Burman
Malayo-Polysian
Kadai
Mong-Dao
Han
IEEE RIVF’09, 16 July 2009
English websites and Vietnamese?
IEEE RIVF’09, 16 July 2009
Translation and machine translation
Translate the following sentence into English
“Ông già đi nhanh quá”?
Many possible translations
1. [Ông già] [đi] [nhanh quá] The old man walks too fast
My father walks too fast
2. [Ông già] [đi] [nhanh quá] The old man died too fast
My father died too fast
3. [Ông] [già đi] [nhanh quá] You get old too fast
Grandfather gets old too fast
Ambiguity of language
IEEE RIVF’09, 16 July 2009
Two approaches to machine translationLinguistic rule-based machine translation words are translated by
using linguistic rules about the two languages, the
correspondence transfer
between them (morphology, syntax, etc)
Requires understanding natural language
Statistical machine translation
generate translations using statistical learning methods based on bilingual text corpora (statistically similar)
Requires large and qualified bilingual text corpora.DOMINATING!
IEEE RIVF’09, 16 July 2009
Lexical / Morphological Analysis
Syntactic Analysis
Semantic Analysis
Discourse Analysis
Tagging
Chunking
Word Sense Disambiguation
Grammatical Relation Finding
Named Entity Recognition
Reference Resolution
Shallow parsing
The woman will give Mary a book
The/Det woman/NN will/MD give/VB Mary/NNP a/Det book/NN
POS tagging
[The/Det woman/NN]NP [will/MD give/VB]VP [Mary/NNP]NP [a/Det book/NN]NP
chunking
[The woman] [will give] [Mary] [a book]
relation findingsubject
i-object object
text
meaning
From text to the meaningNatural Language Processing (NLP)
IEEE RIVF’09, 16 July 2009
1990s–2000s: Statistical learning algorithms, evaluation, corpora
1980s: Standard resources and tasks Penn Treebank, WordNet, MUC
1970s: Kernel (vector) spaces clustering, information retrieval (IR)
1960s: Representation Transformation Finite state machines (FSM) and
Augmented transition networks (ATNs)
1960s: Representation—beyond the word level
lexical features, tree structures, networks
Trainable FSMs
Trainable parsers
Archeology of natural language processing
(Hovy, COLING 2004)
Natural language
processing
Information retrieval and Information
extraction
IEEE RIVF’09, 16 July 2009
some ML/Stat no ML/Stat
(Pages 11-12 from Marie Claire, ECML/PKDD 2005)
ML and statistical methods in NLP
IEEE RIVF’09, 16 July 2009
Recent learning methods in NLP
IEEE RIVF’09, 16 July 2009
Large investment from the government and industry National Institute of Standards and Technology (NIST), ATR, NICT USA, CHINA, Singapore, etc.
NLP & CL organizations ACL (Assoc. Comp. Linguistics) NACL( North Amer. Assoc. on CL) EACL (Euro Association on CL) PACLIC (Pacific Assoc. on CL) ICCL (Inter. committee CL)
Many NLP people
Rich resources and tools
NLP R&D in other countries
Linguistic Data Consortium
IEEE RIVF’09, 16 July 2009
Vietnamese language was established a long time ago
Chinese characters was used for a long time
Unique writing system of Vietnam called Chu Nom (字喃 ) in the 10th century
Romanced script to represent the Quốc Ngữ since the beginning of the 20th century
Nam quốc sơn hà Nam đế cư
南 国 山 河 南 帝 居Over Mountains and Rivers of the South, Reigns the Emperor of the
South
Vietnamese language
IEEE RIVF’09, 16 July 2009
Vietnamese language
Vietnamese is an analytic language (words are composed of a single morpheme).
ngôn ngữ (analytic), lang-gua-ge (synthetic), 言語 (synthetic)
Vietnamese does not use morphological marking of case, gender, number, and tense.
Trưa nay tôi ăn ba thằng tôm
Syntax conforms to Subject Verb Object word order Cái thằng chồng em nó chẳng ra gì.
FOCUS CLASSIFIER husband I he not turn.out what
“That husband of mine, he is good for nothing.”
IEEE RIVF’09, 16 July 2009
Most work aims at machine translation or other tasks at top layers but very few basic work at lower layers
Work done in isolation, no inheritance people have to do their work from the scratch without sharing and collaborating no standards.
Almost no resources and tools for VLSP
Vietnamese Language and Speech Processing
Many tools such as ChaSen, Yamcha, …
このひとことで元気になった
No tool to do such a simple task
IEEE RIVF’09, 16 July 2009
National project with eleven active research VLSP groups from Ho Chi Minh City to Hanoi, with two objectives:
Building VLSP infrastructure, especially indispensable resources and tools for the VLSPdevelopment.Building and developing several typical VLSP products for public end-users.
VLSP national project (KC01.01.05/06-10)5.2007-8.2009
Natural language processing methods
Pragmatics: Speech, text and Web data mining
Tools, corpora,
resources
IEEE RIVF’09, 16 July 2009
SP7.3Vietnamese treebank
SP7.3Vietnamese treebank
SP7.4E-V corpora of aligned
sentences
SP7.4E-V corpora of aligned
sentences
SP3English-Vietnamesetranslation system
SP4IREST: Internet use
support system
SP5Vietnamese spelling
checker
SP5Vietnamese spelling
checker
SP8.2 Vietnamese word
Segmentation
SP8.2 Vietnamese word
Segmentation
SP8.3 Vietnamese POS tagger
SP8.3 Vietnamese POS tagger
SP8.4 Vietnamese chunker
SP8.4 Vietnamese chunker
SP8.5Vietnamese syntax
analyser
SP8.5Vietnamese syntax
analyser
SP7.1English-Vietnamese
dictionary
SP7.1English-Vietnamese
dictionary
SP7.2Viet dictionary
SP7.2Viet dictionary
SP1Apllicationoriented systems based on Vietnamese speech
recognition & synthesis
SP2Speech recognition
system with large vocabulary
SP8.1 Speech analysis tools
SP8.1 Speech analysis tools
SP6.1Corpora for
speech recognition
SP6.1Corpora for
speech recognition
SP6.2Corpora for
speech synthesis
SP6.2Corpora for
speech synthesis
SP6.3Corpora for
specific words
SP6.3Corpora for
specific words
Project target products
To be standard for
long term development
IEEE RIVF’09, 16 July 2009
VLSP website: open soon to the public
IEEE RIVF’09, 16 July 2009
SP7.2: Viet Machine Readable Dictionary
Study other MRDs EDR Electronic
Dictionary FrameNet (UC Berkeley) TCL's Computational
Lexicon
Build a model of VCL (Vietnamese Computational Lexicon) The macroscopic structure The microscopic structure The content and VCL
structure Tool and VCL construction
Institute of Electronic Dictionary, 1980s-1990s
Japanese EDR
IEEE RIVF’09, 16 July 2009
SP7.2: Viet Machine Readable Dictionary
Microscopic structure Morphological information Syntactic information, e.g.,
two kinds of verb Semantic information: logic
and semantic constraints, definition, context
VCL content and structure Tool for the construction
35,000 common used words in modern Vietnamese
Develop a tool for building VCL with XML representation.
Sub-V-Obj Sub-V
Lợn ăn rauXe ăn xăng
Chim bay bé ngủChó chạy bé đang ngủ
IEEE RIVF’09, 16 July 2009
Ông già
S
NP VP
P V
đi
NP
T
nhanh quá
SP7.3: Viet Treebank
A Treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. English: Penn Treebank (4.5M words) and
many others; Chinese: Penn Chinese Treebank (507K
words), Sinica Treebank (61,087 trees, 361K words);
Japanese: ATR Dependency corpus, Kyoto Text Corpus, Verbmobil treebanks;
Korean: Korean Treebank (5078 trees, 54K words)
Viet Treebank (7.2007-5.2009): 10,000 trees 1,000,000 morphemes
Viet machine translation, info extraction, etc.
Viet Treebank
Viet syntactic parser
Viet chunker
Viet POS tagger
Viet word segmenter
IEEE RIVF’09, 16 July 2009
Study various existing treebanks, modern theories for syntax and Vietnamese language
Build guidelines for word segmentation, POS, and syntax “Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả”
(“the house is in jumble” and “at home the door is not closed”)
“Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn”
(She keeps her beauty” and “this painting has better color”)
Build the tools Labeling Agreement between labelers (95%)
SP7.3: Viet Treebank
IEEE RIVF’09, 16 July 2009
SP7.4: English-Vietnamese parallel corpus
Parallel Corpus (L1-L2)
SentencesL1
WordsL2
Words
German-English 1,313,09634,700,3
6236,663,08
3
Greek-English 662,09018,834,7
5818,827,24
1
Spanish-English 1,304,11637,870,7
5136,429,27
4
Finnish-English 1,257,72024,895,7
9034,802,61
7
French-English 1,334,08041,573,1
1737,436,22
2
Italian-English 1,251,31536,411,1
6636,510,03
3
Dutch-English 1,326,41236,784,1
6836,690,39
2
Portuguese-English
1,287,75737,342,4
2636,355,90
7
Swedish-English 1,164,53628,882,1
4232,053,62
8(http://www.euromatrix.net)
Pairs of corresponding sentences in English and Vietnamese (size & quality)
Easy for many languages (LDC: English-French corpus of 2.8M sentences, source from Canadian Parliament)
No publicly available parallel corpus for Vietnamese
Building corpora needs time, money and human resources (boring job)
IEEE RIVF’09, 16 July 2009
SP7.4: English-Vietnamese parallel corpus
Our corpus in first phase: 100,000 sentence pairs
Manual and semi-automatic collection of parallel text
Automatic alignment
Automatic cross-site parallel text discovery
Many Vietnamese news are translated from English source in other web sites from the Internet
IEEE RIVF’09, 16 July 2009
Setting up the “standards” for VLSP
Importance of “standards” in VLSP: choose an appropriate view from different schools on Vietnamese language
Guide for words recognition and description: morphological, syntactic, semantic criteria
Guide for constituent labeling: noun phrase, verb phrase, clause, etc.
Guide for sentence split
Others
Challenge: Standards for sustainable development
IEEE RIVF’09, 16 July 2009
Example: Guideline for POS tagging
36 word labels in English, from Penn Treebank (1989)
30 word labels in Chinese, from Chinese TreeBank (1998)
47 word labels in Thai, from Orchid corpus (1997)
How many for Vietnamese?
IEEE RIVF’09, 16 July 2009
VLSP tools for the public
All the tools are constructed based on the same view of words, label assignment, sentences, and resources.
Using statistical and machine learning methods to build the tools with the corpora.
Tools and resources are to be given to the public.
SP7.3Vietnamese
treebank
SP7.3Vietnamese
treebank
SP7.4E-V corpora of
aligned sentences
SP7.4E-V corpora of
aligned sentences
SP8.2 Vietnamese word
segmentation
SP8.2 Vietnamese word
segmentation
SP8.3 Vietnamese POS tagger
SP8.3 Vietnamese POS tagger
SP8.4 Vietnamese
chunker
SP8.4 Vietnamese
chunker
SP8.5Vietnamese syntax
analyser
SP8.5Vietnamese syntax
analyser
SP7.1English-Vietnamese
dictionary
SP7.1English-Vietnamese
dictionary
SP7.2Viet dictionary
SP7.2Viet dictionary
IEEE RIVF’09, 16 July 2009
o1 o2 o3 oT
s1 s2 s3 sT...
...
Hidden Markov Models (HMMs) [Baum et al. 1970; Rabiner 1989]
- Generative- Need independence assumption - Local optimums - Local normalization
o1 o2 o3 oT
s1 s2 s3 sT...
...
Maximum Entropy Markov Models (MEMMs)
[McCallum et al. 2000]
- Discriminative and exponential- No independence assumption- Global optimum- Local normalization
More accurate than HMMs
o1 o2 o3 oT
s1 s2 s3 sT...
...
Conditional Random Fields (CRFs)[Lafferty et al. 2001]
- Discriminative and exponential- No independence assumption- Global optimum- Global normalization
More accurate than HMMs and MEMMs
MEMMs and CRFs are
emerging techniques in
NLP & machine learning
CRF dynamic programming- O(|L|2T): first-order- O(|L|3T): second-order(L is the set of labels)
Using machine learning in creating toolsFinite state machines
IEEE RIVF’09, 16 July 2009
CRFsOnline Learning
Data
Chunking models
VietnameseSentences
Decoding output
Anh ấy đang ăn cơm
NP [anh ấy] VP [đang ăn cơm]
SP8.4: Statistical learning in chunking
IEEE RIVF’09, 16 July 2009
SP8.5: HGSP grammar for syntax analysis
Syntax tree
Text
Word segmentation
module
Analysis module
Word feature
dictionary
Rules with
constraints
Attribute elimination
Improved earley
algorithm
HGSP: Head-Driven Phrase Structure Grammar
HGSP allows us to represent the relations between words and constraints between syntax and semantics
IEEE RIVF’09, 16 July 2009
SP3: Machine translation and EVSMT1.0
Statistical Analysis
Vietnamese
Statistical Analysis
Broken English
English
Ông già đi nhanh quá
Died the old man too fast
The old man too fast died
The old man died too fast
Old man died the too fast
The old man died too fast
(Slides 31-32 adapted from tutorial on SMT, K. Knight and P. Koehn)
Vietnamese-English
Bilingual Text
EnglishText
SMT:statistical machine translation
IEEE RIVF’09, 16 July 2009
Translation Model
Language Model
Decoding AlgorithmArgmax P(v|e) x P(e)
SP3: Machine translation and EVSMT1.0
Statistical Analysis
Vietnamese
Statistical Analysis
Broken English
English
Vietnamese-English
Bilingual Text
EnglishText
SMT:statistical machine translation
IEEE RIVF’09, 16 July 2009
SP3: Machine translation and EVSMT1.0
Issues in Vietnamese SMT Corpus building Language Modeling Translation Model Decoder Others
Decoder (search problem)MOSES
Translation Model(phrase-based)-GIZA++-MOSES-MERT
Language ModelSRILM
Englishsentence
Vietnamesesentence
SMT core
- Standardization- Word segmentation (VNsegmenter)- POS tagger (CRF Postagger, VnQtag)- Morphological analyser (morpha)
Pre-processing
Vietnamese-English
Parallel corpus
Pre-processing
Vietnamesecorpus
Pre-processing
SMT Resource processing- Pre-process (sentence splitter, tokenizer, etc.), Web crawler- Sentence alignment tools
Raw materials(documents, books, …)
Automatic extract parallel text from the Web
Corpus collecting and building
IEEE RIVF’09, 16 July 2009
Google: English-Vietnamese translation 26.9.08 (translate.google.com, 35 languages)
IEEE RIVF’09, 16 July 2009
Machine translation issues and challenges SMT major difficulties: word choice, word order,
tense and aspect, pronoun, idioms
Target: Improve phrase-based SMT in two aspects of word order and word choice
Combination of tree-to-string SMT and phrase-based SMT (N.P. Thai et al., Machine translation, Vol. 20. No. 3 (2006), IJCPOL, Vol. 20, No. 2 (2007)
Focus on translating long and complex sentences by introducing CRF-based clause splitting and chunking parsing (N.V. Vinh, IJCPOL, 2009)
IEEE RIVF’09, 16 July 2009
Different types of query entries
List of Websites in English
Danh sách Websites tiếng Việt
Search on Internet for Webpages having information related to the query
Selected Website in English
Trang Web được dịch qua tiếng Việt
Check each Website
Extract news related to the query
Text related to the query
Summarize the text
Summarized text in English
Tin tóm tắt được dịch sang tiếng Việt
Extract information related to the query
Summarize the text for its gist
Translate the gist into Vietnamese.
Translate the list of retrieved Webpages into Vietnamese
Translate the selected Website into Vietnamese
1
2
3
4
IREST: Support for exploiting the Internet(Information Retrieval, Extraction, Summarization, Translation)
IEEE RIVF’09, 16 July 2009
Vietnamese named-entities on the webHo Chi Minh University of Technology
IEEE RIVF’09, 16 July 2009
Sentence reduction by SVM
Long sentence
parsing
transforming
generating
Large parsed tree
Small parsed tree
Short sentence
Corpus
Tree set
Parsing all sentences
Set of contexts and actions
Generating training data
SVM learning
Rules
A rule shows relation of a context and an action
G
H A
C B D
Q
Z
R
c
a
d
eb
K
b
H
e
F D
a
G
{a, b, c, d, e}
{b, e, a}
Input list, CSTACK, RSTACK Actions: SHIFT, REDUCE,
DROP, ASSIGN TYPE, RESTORE Transforming tree is a
sequence of actions
(Minh et al., COLING 2004, J. CPOL’05, ACM Trans. ALP’05, IEICE’06)
IEEE RIVF’09, 16 July 2009
M = (D, E, T, TR, TI, TV, f, g)
ETD: Detecting topics that are growing in interest and utility overtime from a corpus
Topic verification How to define
interest and utility functions and evaluate their
increase overtime?
Topic representation
Which features are necessary to
characterize topics (interest and utility
overtime)?
Topic identification How to extract these features
from the corpus for each topic?
Emerging trend detection(Le Minh Hoang, KSS journal, 2006)
IEEE RIVF’09, 16 July 2009
ETD: Topic representation
ETD: Detecting topics that are growing in interest and utility overtime from a corpus
Topic representation
Which features are necessary to
characterize topics (interest and utility
overtime)?
neural network
Define 6 types of citation
IEEE RIVF’09, 16 July 2009
ETD: Topic identification
ETD: Detecting topics that are growing in interest and utility overtime from a corpus
Topic identification How to extract these features
from the corpus for each topic?
Build 6 models corresponding to 6 types of citation
Using HMM, MEMM, an CRF to extract features
| = max ,i is
P O P s o
i
jj
P O| =
P O|O iP
i
= - .logO i O i O iH P P
IEEE RIVF’09, 16 July 2009
ETD: Topic verification
ETD: Detecting topics that are growing in interest and utility overtime from a corpus
Topic verification How to define
interest and utility functions and evaluate their
increase overtime?
}6,5,4,2{}6,5,3,1{
),(4
1 ,),(
4
1)(
jii
jii
kkii
jtGrowth)g(tUtilityjtGrowthtfInterest
time axis along the(j)}s {ttime-seriegrowth of , j)Growth(t
0 1 2
'f
f k kx
2
2''
ff k k
x
the speed of growing at x = k
the acceleration of growing at x = k
Speed > 0Acceleration >
0
IEEE RIVF’09, 16 July 2009
ETD: Evaluation
ETD: Detecting topics that are growing in interest and utility overtime from a corpus
IEEE RIVF’09, 16 July 2009
Conclusion
Complete the first phase in VLSP infrastructure.
Advanced technologies and experience from processing of other languages, especially statistical learning from large corpora.
Work in collaboration and sharing
Look for investment from the government and industry for the next phase, and for collaboration.
IEEE RIVF’09, 16 July 2009
Acknowledgements
The national project KC01.01.05/06-10
Projects members: Luong Chi Mai, Ngo Cao Son, Ho Bao Quoc, Dinh Dien, Cao Hoang Tru, Nguyen Thi Minh Huyen, Vu Luong, Le Thanh Huong, Nguyen Phuong Thai, Nguyen Le Minh, Le Minh Hoang, Phan Xuan Hieu, Pham Ngoc Khanh, Ha Thanh Le, Nguyen Phuong Thao, Nguyen Viet Cuong, VLSP forum, among others.
VLSP meeting, 21-25 Nov. 2005, JAIST
IEEE RIVF’09, 16 July 2009
Korean and Arbic
오늘 우리는 여기에 모여서 베트남어와 발언처리에 대하여 의론하겠습니다 .
Ônưr wulinân iôkiê môiôshô Vietnamơwa balântsơriê têhaiô ưi rônhagếtsưnnita
أننا نجتمع هنا اليوم لنتحدث عن اللغة الفيتنامية و لغة الخطابenna ngtma hena alyom lenthds an alloga alvitnamya wa logh alkhytab
Inana nagtama huna alyom linatahades an: alloga alvitnemâyơ ôe loga alkhytab
Kyou, wareware wa kokoni atsumari,Betonamu-go to speech shori ni tsuite Giron shimasu
IEEE RIVF’09, 16 July 2009
Search for parallel document
Observation: Many Vietnamese news are translated from English source in other web sites from the Internet
Methodology Make use of search engine to find English
candidates Queries are created from
Posted date News source’s URL Translational independent data: text data
unchanged during translation process. E.g. term, NE, number
IEEE RIVF’09, 16 July 2009
Framework
• Queries are generated and executed from high ranks to low ranks
• Filtering
•Length-based
•TID-based