8/13/2004nycnlp (coling 2004) cross-lingual information extraction system evaluation kiyoshi sudo...

25
8/13/2004 NYCNLP (COLING 2004) Cross-lingual Information Extraction System Evaluation Kiyoshi Sudo Satoshi Sekine Ralph Grishman New York University

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

8/13/2004 NYCNLP (COLING 2004)

Cross-lingual Information Extraction System Evaluation

Kiyoshi SudoSatoshi Sekine

Ralph Grishman

New York University

8/13/2004 NYCNLP (COLING 2004)

Outline

1. Introduction

2. Cross-lingual IE system

• Translation-based QDIE system

• Cross-lingual QDIE system

3. Experiment

4. Discussion

5. Conclusion

8/13/2004 NYCNLP (COLING 2004)

Information Extraction

• Identifying entities from source text and mapping from source text to pre-defined table.

“A smiling Palestinian suicide bomber triggered a massive explosion in the heavily policed heart of downtown Jerusalem today, …”

Date:Location:Perpetrator:

downtown Jerusalem

A … suicide bomber

today(Terrorism Activity)

8/13/2004 NYCNLP (COLING 2004)

Local Context

• Local contexts provides a useful information to identify entities.

Date:Location:Perpetrator:

downtown Jerusalem

A … suicide bomber

today

“A smiling Palestinian suicide bomber triggered a massive explosion in the heavily policed heart of downtown Jerusalem today, …”

8/13/2004 NYCNLP (COLING 2004)

Extraction Patterns

• Extraction patterns have been widely used as an effective means to extract entities.– Pre-defined template (Riloff 1993): (kidnapped in <x>)– Predicate-Argument (Yangarber et al. 2000): (<org>, appoint, <person>)– Dependency Tree (Sudo et al. 2003): (trigger (OBJ:

explosion)

(ADV: <date>)))

• Because of the cost in portability of IE system, automatic pattern discovery technique has become important.– application of bootstrapping method

(Riloff and Jones 1999, Yangarber et al. 2000)

8/13/2004 NYCNLP (COLING 2004)

Pattern Discovery

…..

QDIE = query-driven information extraction

query IR

(1) Get relevant documents

(2) Score pattern candidates based on TF/IDF

(3) Use pattern matching

Sourcedocument

(Sudo et al. 2003)

Preprocess source documents (NE-tagging, Dependency parsing)

keywordnarrative

Any subtree that containsat least one NE instance

8/13/2004 NYCNLP (COLING 2004)

Cross-lingual IE

• Assume we have– Machine Translation System

– Basic linguistic tools for source and target language• Morphological analyzer, parser, NE-tagger, IR system

query

Japanese

English

Sourcedocument

E-QDIE

J-QDIE

MT system

8/13/2004 NYCNLP (COLING 2004)

Outline

1. Introduction

2. Cross-lingual IE system

• Translation-based QDIE system

• Cross-lingual QDIE system

3. Experiment

4. Discussion

5. Conclusion

8/13/2004 NYCNLP (COLING 2004)

Translation-based QDIE system

query

Japanese

English

Sourcedocument

(1) Translate the source documents

…...

(2) Use English QDIE system

Sourcedocument

8/13/2004 NYCNLP (COLING 2004)

Cross-lingual QDIE system

query

Japanese

English

Sourcedocument

…...

query

(1) Translate the user’s query

(2) Use Japanese QDIE system

(3) Translatethe extracted table

8/13/2004 NYCNLP (COLING 2004)

Comparison of two systems

• Translation-based QDIE– No source-language-specific tools are necessary

except MT system.– Tools for E-QDIE system were customized into

English (not output of MT system)

• Cross-lingual QDIE– MT for short sentences or phrases (for query and

extracted entities)– Tools for J-QDIE system were customized into

Japanese.

8/13/2004 NYCNLP (COLING 2004)

Experiment• Management Succession Extraction Task

(simple version of MUC-6 task)

– Identify the entities involved in a succession event.• Person, Post, Organization

• Test document– 100 articles (61 relevant, 39 irrelevant)

accumulated from Yomiuri Newspaper 1999 (Japanese)– Person(173/651), Post(210/626), Organization(111/709)

• Source document and tools– 130,000 articles from Yomiuri Newspaper 1998 (Japanese)– MT system: “King of Translation” (IBM)– NE tagger: (Sekine and Nobata 2004).

• Extraction performance is measured by recall/precision of extracted entities.

8/13/2004 NYCNLP (COLING 2004)

Cross-lingual QDIE does better

• Maximum recall:• crosslingual system: 60%• translation-based system: 41%

0

20

40

60

80

0 20 40 60 80

Recall

Pre

cisi

on

CrosslingualTranslation

8/13/2004 NYCNLP (COLING 2004)

Translation QDIE suffers fromNE recognition errors

• NE tagger was customized for English (WSJ)– many of the Japanese NEs do not occur in WSJ.

• [ Kansai Economic Federation ] ORG

→ [ Kansai ] LOC [ Economic Federation ] ORG

– Translation errors

• result in fewer and noisier pattern candidates Translation / Cross-lingual– Person: 4543 / 12096– Post: 3924 / 14986– Organization: 4014 / 11812

• used Giza++ (Och et al. 2003) to make word alignments between original Japanese sentences and MT-ed English sentences.

• doubled the number of pattern candidates.

NE tagging by Cross-language Projection

順天堂 大 の 水野 美邦 教授

Professor Mizuno 美邦 of 順天堂 large

(= Yoshikuni Mizuno, professor at Juntendo Univ.)

大 = abbreviation of 大学 (=Univ.)Frequently mistranslated as “Large”

(inspired by Riloff et al. 2002)

Japanese:

MT output:

8/13/2004 NYCNLP (COLING 2004)

Still Cross-lingual QDIE does better

• Maximum recall:• crosslingual system: 60%• translation-based system with NE projection 52%• translation-based system: 41%

0

20

40

60

80

0 20 40 60 80

Recall

Pre

cisi

on

CrosslingualTranslationTranslation+NE

8/13/2004 NYCNLP (COLING 2004)

Problems in Translation

• Incorrect dependency structure caused by MT translation errors.

8/13/2004 NYCNLP (COLING 2004)

Correct Translation:

On the sixth, since the financial reports for the fiscal year that ended

in February, 1999 will end in a deficit, "Okajima" (Marunouchi, Kofu-

city), the leading department store in the prefecture, announced that

six of the thirteen full-time directors, including President Hiroyuki

Okajima (40), two executive directors and a managing director,

submitted the resignation letter and will formally resign at the general

meeting of shareholders of the company.

8/13/2004 NYCNLP (COLING 2004)

From Muika the term settlement of accounts ended February , 99

having become the prospect of the first deficit settlement of accounts

after the war etc. , six of President Hiroyuki Okajima ( 40 ) , two

managing directors , one managing directors , the full-time

directors that are 13 persons submitted the resignation report ,

“Okajima” of Marunouchi , Kofu-shi who is the major department

store within the prefecture announced that he resigns formally by the

fixed general meeting of shareholders of the company planned at the

end of this month .

MT Output:

8/13/2004 NYCNLP (COLING 2004)

Problems in Translation

• Structural difference– multiple translations of a single source language

expression make pattern discovery more difficult on MT output

<post> に就任する。

be appointed to <post>

assume <post>

be inaugurated as <post>

(translation error)

8/13/2004 NYCNLP (COLING 2004)

Related Work

• Riloff et al. 2002– showed how CLIE systems can be developed with IE learning

tools, bitext alignment and an MT system.

– conducted experiments on relatively close language pair: English and French

• “achieved roughly the same level of performance as the source-language IE system”

• We expect that the perforamnce gap between translation-based IE and Cross-lingual IE is more pronounced with a more divergent language pair like Japanese and English.

8/13/2004 NYCNLP (COLING 2004)

Conclusion

• We discussed the difficulty in cross-lingual information extraction caused by the translation of the source text.

• Cross-lingual QDIE performs better– Translation-based QDIE suffers from NE recognition

errors.

– Structural errors and incorrect dependency analysis in MT output caused fewer and noisier pattern candidates

8/13/2004 NYCNLP (COLING 2004)

Further Discussions

• Linguistic tools necessary for QDIE systems are available for major languages.

• Speculation from TIDES Surprise Language Exercise: development of tools in a new language– Machine Translation– Cross-lingual Information Retrieval– Named Entity tagger

– (dependency/shallow/full) parser needs more work

• Additional performance gain for Cross-lingual QDIE may be achieved by the techniques for query translation + query expansion.

8/13/2004 NYCNLP (COLING 2004)

8/13/2004 NYCNLP (COLING 2004)

NE tagging by Cross-language Projection

• used Giza++ (Och et al. 2003) to make word alignments between original Japanese sentences and MT-ed English sentences.

• doubled the number of pattern candidates.

President Akiyama is inaugurated as the following chairman of Kansai Economic Federation.

秋山社長が関西経済連合会の次期会長に就任する。秋山社長が関西経済連合会の次期会長に就任する。

(inspired by Riloff et al. 2002)