language resource and language technology

30
PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR Language Resource and Language Technology Virach Sornlertlamvanich NECTEC, Thailand TCL, NICT ALRC, AFNLP 1

Upload: afi

Post on 21-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Language Resource and Language Technology. Virach Sornlertlamvanich NECTEC, Thailand TCL, NICT ALRC, AFNLP. ALRC, AFNLP ASIAN LANGUAGE RESOURCES COMMITTEE, ASIAN FEDERATION OF NATURAL LANGUAGE PROCESSING. AFNLP. Jun’ichi Tsujii President Key-Sun ChoiVice President - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Language Resource and Language Technology

Virach SornlertlamvanichNECTEC, Thailand

TCL, NICT

ALRC, AFNLP

1

Page 2: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

ALRC, AFNLPASIAN LANGUAGE RESOURCES COMMITTEE,ASIAN FEDERATION OF NATURAL LANGUAGE PROCESSING

2

Page 3: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

AFNLPJun’ichi Tsujii President

Key-Sun Choi Vice President

Keh-Yih Su Secretary General

Kam-Fai Wong Honorary Treasurer

Yuji Matsumoto Chair of CCC (Conference Coordinating Committee)

Haizhou Li Chair of CLC (Communications and Liaison Committee)

Virach Sornlertlamvanich Chair of ALRC (Asian Language Resources Committee)

Benjamin Tsou Chair of NCAC (Nominations and Constitutional Affairs Committee)

Mark Steedman ACL liaison member to AFNLP

Rajeev Sangal

Chengqing Zong

3

Page 4: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR4

Page 5: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Role of ALRC, AFNLP

1. ALR Workshop

Take initiative in setting up ALR Workshop in every other year. This is to consider as an attaching workshop to a major conference such as IJCNLP. It involves setting up the workshop and program chairs. The process should start at the latest as soon as the call for workshop proposal has been announced, so that the workshop and program chairs can be announced at the appropriate time. The Chair must interact with the workshop chair to ensure that the workshop preparations are proceeding smoothly.

2. LR catalogue

Throughout the year, monitor and maintain the LR catalogue up to the date.

5

Page 6: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

ALR Workshop in the Past

1. Tokyo, Japan, under the name of Symposium on Language Resources in Asia, 2001

2. Tokyo, Japan, in conjunction with the 6th Natural Language Processing Pacific Rim Symposium, National Center of Sciences, 2001

3. Taipei, Taiwan, in conjunction with Coling2002

4. Sanya City, Hainan Island, China, in conjunction with IJCNLP2004

5. Jeju Island, Korea, in conjunction with IJCNLP2005

6. Hyderabad, India, in conjunction with IJCNLP2008

6

Page 7: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

The 7th Workshop on Asian Language Resources

• Co-Chair:– Hammam Riza - IPTEKnet-BPPT, Indonesia– Virach Sornlertlamvanich - NECTEC, Thailand

• Venue: – Aug 7, 2009– ACL-IJCNLP 2009, Singapore, Aug 2-7, 2009– http://www.acl-ijcnlp-2009.org/main/workshops.html

• Important Date:– Paper submission due May 1, 2009– Demo session requests due May 8, 2009– Notification of acceptance July 1, 2009– Camera-ready papers due June 7, 2009

7

Page 8: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

LR Catalogue

• http://www.tcllab.org/add• http://www.shachi.org/

8

Page 9: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR9

Page 10: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR10

Page 11: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

ADDASIAN APPLIED NATURAL LANGUAGE PROCESSING FOR LINGUISTICS DIVERSITY AND LANGUAGE RESOURCE DEVELOPMENT

11

Page 12: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Asian Applied Natural Language Processing for Linguistics Diversity and Language

Resource Development (ADD)• Objective:-

– Build experts in NLP– Build a human network of NLP expert for sharing the

experience, expertise, and collaboration in studying and applying NLP

– Support the development of language resources for studying and evaluating the technology

– Support the development of standards for language resource development

– Support the research and development of NLP common utilities

– Support the implementation of the existing NLP utilities

12

Page 13: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Asian Applied Natural Language Processing for Linguistics Diversity and Language

Resource Development (ADD)• Organizer and Supporter:-

– NICT Asia Research Center– Asian Language Resources Network Project (ALRN)– National Electronics Computer and Technology

Center (NECTEC)– Sirindhorn International Institute of Technology (SIIT)– Asia-Pacific Association for Machine Translation

(AAMT)– Asian Federation of Natural Language Processing

(AFNLP)– PAN Localization Project, CRULP

13

Page 14: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

ADD School and Workshop

• ADD-1: Introduction to NLP– August 21–September 1, 2006

SIIT, Bangkok, Thailand

• ADD-2: Advanced NLP (Special Topic on Morpho-Syntactic Anaysis)– March 6-14, 2007

Thammasart University, Bangkok, Thailand

• ADD-3: Advanced NLP (Special Topic on Image and Speech processing)– February 25–March 1, 2008

SIIT, Bangkok, Thailand

14

Page 15: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

ADD-1• 27 from 34 applications of 12

countries– Bhutan 1– Cambodia 2– Indonesia 2– Lao 3– Mongolia 1– Myanmar 3– Nepal 3– Pakistan 3– Sri Lanka 1– Thailand open– US 1– Vietnam 7

ADD-2• 36 from 42 applications of 13

countries– Bangladesh 2– Bhutan 1– Cambodia 2– India 1– Indonesia 3– Lao 5 (7)– Mongolia 1– Myanmar 1 (3)– Nepal 3 (5)– Pakistan 1– Philippines 1– Thailand 4– Vietnam 11

* Figures inside the bracket () are the number of applications

ADD Applications (1)

15

Page 16: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

ADD-3• 37 from 39 applications of 12

countries– Bangladesh 3– Bhutan 3 (4)– Indonesia 7– Lao 3– Mongolia 2– Myanmar 4– Nepal 2 (3)– Pakistan 2– Philippines 1– Sri Lanka 2– Thailand 1 [+18]– Vietnam 7

* Figures inside the bracket () are the number of applications Figure inside the bracket [] is the number of sit-in participants

ADD Applications (2)

16

Page 17: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

CFP of ADD-4

• Theme:– Language Resource Technology

POS, tagging, word segmentation, terminology, Asian WordNet, tools for corpus development, tools for text mining, text summarization, categorization, approaches for morphological analysis

• Date:– Feb 23-27, 2009

• Venue:– NECTEC Academy, Bangkok

• Application:– www.tcllab.org/add 17

Page 18: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

http://www.tcllab.org/add

18

Page 19: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

ALR SUMMITMarch 2009, Phuket

19

Page 20: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

ALR Summit

• March 2009, Phuket• Discuss on Asian Language Resource in

terms of developing, sharing, licensing, etc.

• Corpus, Terminology, WordNet, Language tools, etc.

20

Page 21: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

POLICY CONSIDERATIONS FOR DEVELOPMENT AND DEPLOYMENT OF LOCAL LANGUAGE COMPUTING AND CONTENT

21

Page 22: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Asian WordNet

• Use English equivalents to link the existing dictionary to WordNet

• POS (n, v, adv, adj), English equivalent, and English equivalent of synonym of the target language are used to pinpoint the link

• Number of matched English equivalents in the Synset confirms the appropriate link

• Experiment on Thai-English, Indonesian-English and Mongolian-English dictionaries

• http://asianwordnet.org/

22

Page 23: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Asian WordNet Development

23

GWN

AWN

ApplicationsDictionaryOntologyCL-SearchMTSummarizationIE/IR….

KUI

Correction

Voting

Lookup

Translation

Discussion

Addition

WN merged-WN

X-English

X-English

X-English

Thai-English

X-English

X-English

X-EnglishIndonesian

-English

Page 24: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

English-English

24

Page 25: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Thai-English

25

Page 26: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Thai-Indonesian

26

Page 27: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Thai-Lao Phoneme-based MT

• Sharing of character set (similar but different encoding scheme)

• Sharing of phrase structure• Sharing of vocabulary• http://www.tcllab.org/th2lao

27

Phoneme mapping with a table of word exception

Page 28: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Phoneme Mapping

28

Thai input textเครื่��องรื่�อน

G2PThai phonetics

Khr-vv-ng -^2|r-@-n -^2|

Phonetic conversion rule

Lao phonetics

Kh-vv-ng -^2|l-@-n -^2|

Surface generationLao text

Phoneme mapping

Word mapping

ເຄື່��ອງລັ່��ອນ

khr -> khr -> l

Page 29: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Sample of Consonant Phoneme Mapping

29

Thaimid

Sym Laomid

กจด, ฎต , ฏบปอ

k

c

d

t

b

p

z

Thailow high

Sym Laolow high

ค ฆ ขช ฌ ฉซ ส ศ ษง หงญ ย หญ หยฑ ฒ ธ ฐ ถ ณ น หนพ ภ ผ

kh

ch

ch

ng

j

th

n

ph

ຄ ຂ

ຊ ສ ຊ ສ ງ ຫງ

ຍ ຫຍ

ທ ຖ

ນ ຫນ

ພ ຜ

Page 30: Language Resource and Language Technology

PAN Localization, Jan 12-16, 2009, Novotel, Vientiane, Lao PDR

Language Grid

• Lead by Prof Toru Ishida, Kyoto University and NICT

• Service of language resource and language computing

• Participation– Language resource provider– Computational resource provider– Language service user

• NECTEC as a node of Langrid Operation• http://www.langrid.org

30