Ｔｅｒｍ inology e xtraction system based on vocabulary space

Ｔｅｒｍ inology 　 Extraction 　 System based on Vocabulary Space

Hiroshi Nakagawa

Information Technology Center,

The University of Tokyo

German-Japan NL WS in Sapporo2003/7/4

歩留まり : Bu-Domari:Success rate ??横持ち : Side take:Transportation between main transportation

method station (like airport, train station )and destination or starting point.

玉掛け : ball hingeTo operate a power shovelReally useful and interesting terminologies

• German • German-Japan• German-Japan natural• German-Japan natural language• German-Japan natural language processing• German-Japan natural language processing

workshop• German-Japan natural language processing

workshop program• German-Japan natural language processing

workshop program chair

Long Compound Nouns

German-Japan natural language processing workshop program chair and

German-Japan natural language processing workshop program chair and ACL

German-Japan natural language processing workshop program chair and ACL2003

German-Japan natural language processing workshop program chair and ACL2003 general

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratoryGerman-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory chief scientist Dr. Tsuruoka

Long compound noun (NP) is the source of information about terminology

Objective

Up-to-date domain terminology dictionary is the gateway to various technology and academic fields.

For this, first of all we need high quality terminologies of the target domain.

What corpus? Ordinary corpus or Web pages?

Concepts Methodological classification: Supervised Learning based extraction

I. finding heavily influenced features

II. surrounding patterns of target expression

III. technology developed by NE task

Statistics based extraction our targetI. document space based statistics

II. linguistic structure, such as syntactic, semantic structure based formalism

III. vocabulary space based statistics our target

Document space versus Vocabulary space

ab, xy

abc

abc, lmnxy

abc,abc,ablmn

xy,xy

Web

document space based statistics

Old fashionedWeight term candidates based on their occu

rrence on document space: corpus or Web, and rank them descending order.

term frequency or tf*idf for basic nounsTo extract compound nouns,contingency m

atrix and co-occurrence based decision with MI, χ2 ,Dice etc.

Linguistic Structure based method

Syntactic structurePOS pattern like {adj (noun)+}phrasal verbs, etc.

Semantic structure of compound nounsPredicate argument structure (i.e.Pustejovski)Case frame of predicate

Single and compound nouns are not treated equally.

Vocabulary space based method

Statistics of vocabulary space such asStatistics of embedded relation (C-value)How many compound nouns the target noun m

akes (LR = our proposal)Application of link structure analysis of Web p

ages: (PageRank, HITS)Single and compound nouns are treated equally

Our objective

Experimental analysis and evaluation of various term extraction methods withTest collection (TMREC) corpusWeb page corpusDomain dictionaries on Web or in CR-ROM as gold-standar

dTerm extraction system repository

Gensen Web ( 言選 Web) http://gensen.dl.itc.u-tokyo.ac.jp/gensenweb_eng.html

Finally Automatic builder for up-to-date domain terms dictionary

ATR byCompound noun statistics

言選　 Gensen Web

Automatic term extraction from WEB pages

Step1. Term candidate extraction

separating text by stop-words (or using

morphological analyzer ) to generate candidates

Step 2. Scoring candidates to rank them

our scoring mechanism is innovative and unique

Domain Specific Terms

expressing domain concepts

About ８５％　　　　　　about １５％compound 　　　　　　　simple nouns 　nouns

•Simple noun: no more divided into shorter nouns•Compound noun: uninterrupted sequence of simple nouns

Our Purpose is　Extracting domain specific terms including compound and

simple nouns from domain corpus automatically.

Li =freq. n N 　　　　　　 m Rj= freq.

　 3 　　 noun 　　　　　　　　　　　　 statistics 2

　 1 　 character 　　　 trigram 　　　　　　

1 　　 class 　　　　　　　　　　　　　 acquisition 　　 1

Scoring of Simple Nouns

LN(trigram)= ５ n=3 m=2 RN(trigram)=３

Principle:A simple noun which contributes to make a big number of compound nouns has a high score.

Scoring of compound nounsGM （Ｃ ompound Noun)

LL

iii NRNNLNCNGM

2

1

1

1)(1)()(

GM(CN) is a geometric mean which does not

depend on the length of CN.

　LNNNCN .......21

if CN occurs independently then where f(CN) means the number of ind

ependent occurrences of noun CN (= CN does not appear as a part of longer CN ) 　Ex. GM（ trigram ） =((5+1)x(3+1))1/2=4.9 if f(trigram)= 5 　　　 FGM(trigram)=24.5

)()()( CNGMCNfCNFGM

New scoring function: FGM(CN)

Modified Ｃ - ｖａｌｕｅModify Ｃ - ｖａｌｕｅ（Ｆｒａｎｔｚｉ＆ Ananiadou,1996) to be

able toscore a simple noun

　

ｌｅｎｇｔｈ（ａ）　： # of simple nouns consisting a　ｆｒｅｑ（ａ）：frequency of ａ

　ｔ（ａ）： frequency of candidate compound nouns including ａ　ｃ（ａ）： frequency of distinct candidate compound nouns including

ａ　　

)(

)()()()(

ac

atafreqalengthavalueMC

)(

)()(1)()(

ac

atafreqalengthavalueC

Experimental Evaluations

Manually POS tagged Japanese corpus and the gold-standard is a set of manually extracted terms developed by ＮＴＣＩＲ 1 TMREC task

　（ Artificial Intelligence field ： 1,870 paper abstracts ）

Gold-standard consists of manually extracted 8,843 domain specific terms 　　　　　　　　　

Data used in our experiment is developed by NII.

0

500

1000

1500

2000

2500

3000

3500

0 1000 2000 3000

Numbuer of atutomatically extractedterms

Num

ber

of e

xtra

cted

ter

ms

coin

cide

with

gold

-sta

ndar

d

完全一致部分一致

Complete and Partial match by GM: (base line)

Complete match

Partial match

(contained)

- 50

0

50

100

150

200

250

300

350

Number of automatically extracted terms

Difere

nce o

f ext

racte

d te

rms

coin

cid

e w

ith g

old

-st

anda

rdfr

om

bas

e lin

e

FGM- GMMC- GMbase

Number of complete matched terms by FGM,MC-value

MCval - GM

FGM-GM

- 450

- 400

- 350

- 300

- 250

- 200

- 150

- 100

- 50

00 400 800 1200 1600 2000 2400 2800


Diff

ern

ce

FGM- GMMC- GMbase

Number of partially matched terms byFGM,MC-value

FGM-GM

MCval-GM

00.5

11.5

22.5

33.5

44.5

5

0 500 1000 1500 2000 2500 3000


Aver

age

leng

th

GMFGMMC- value

Average length (every 100 terms)of extracted terms

L

MC-value

GM

FGM

candidate terms frequency1. 知識 (knowledge) 787 　 ○2. 学習知識 (learning knowledge) 1 ○3. 学習 (learning) 255 ○4. 言語的知識 (linguistic knowledge) 2 ○5. 知識システム (knowledge system) 14 ○6. 学習システム (learning system) 16 ○7. 問題知識 (problem knowledge) 3 ×8. 学習問題 (learning problem) 5 ○9. 言語的 (linguistic) 1 ○10.システム (system) 861 ○

Top scored 20 terms by GM

11. 問題 (problem) 561 ○12. 論理的知識 (logical knowledge) 1 ○13. 学習支援システム (learning assistance system) 3 ○14. 設計知識 (design knowledge) 29 ○15.学習問題解決システム (learning problem solver) 1 　○16. 学習支援 (learning assistance) 9 ○17. 言語的情報 (linguistic knowledge) 3 ○18. 知識モデル (knowledge model) 3 ○19. 設計システム (design system) 6 ○20. システム設計 (system design) 1 ○

Top scored 20 terms by GM(con’t)

candidate terms 　　 frequency1. 知識 (knowledge) 787 　 ○2. システム (system) 861 ○3. 問題 (problem) 561 ○4. 学習 (learning) 255 ○5. 学習者 (learner) 383 ○6. モデル (model) 356 ○7. 情報 (information) 382 ○8. 問題解決 (problem solving) 186 ○9. 設計 (design) 183 ○10.知識ベース (knowledge base) 149 ○

Top scored 20 terms by FGM

11. 推論 (inference) 162 ○12. 支援 (assistance) 87 ×13. 知識表現 (knowledge representation) 74 ○14. エージェント (agent) 256 ○15. 学習者モデル (learner’s model) 57 ○16. 機能 (function) 294 ×17. 設計者 (designer) 69 ○18. 対話 (dialogue) 205 ○19. 言語 (language) 75 ○20. 対象 (object) 293 ○

Top scored 20 terms by FGM(con’t)

candidate terms 　　　 frequency1. 学習者 (learner) 383 　 ○2. 問題解決 (problem solving) 186 ○3. システム (system) 861 ○4. 知識 (knowledge) 787 ○5. 研究 (research) 651 ×6. 本稿 (this paper) 594 ×7. 手法 (method) 562 ×8. 問題 (problem) 561 ○9. 知識ベース (knowledge base) 149 ○10.論文 (paper) 453 ×

Top scored 20 terms by MC-value

11. 方法 (method, way to do) 426 ×12. 支援システム (assistance system) 18 ×13. 計算機 (computer) 128 ○14. 情報 (information) 382 ○15. モデル (model) 356 ○16. 自然言語 (natural language) 63 ○17. 我々 (we) 332 ×18. 有効性 (effectiveness) 160 ×19. エキスパートシステム (expert system) 　 78 　

○20. ユーザ (user) 297 ○

Top scored 20 terms by MC-value (con’t)

Precision(complete matched) of each method

terms FGM MC-value N1 N2

1~

1000

.773 .754 .705 .744

1001~

2000

.635 .707 .607 .584

2001~

3000

.562 .640 .618 .518

N1,N2 ：　 top two systems of ＮＴＣＩＲ１　　

Precision(partially matched) of each method

　　

terms FGM MC-value

1~

1000

.773 　　　　　 .948

.754 　　　　　 .801

1001~

2000

.635 　　　　　 .951 　

.707 　　　　　 .810

2001~

3000

.562 　　　　　 .941

.640 　　　　　 .857

Precision of each method when large number of terms extracted

terms FGM MC-value N1 N2

1 ～ 3000

.657 .704 .643 .615

3001 ～ 6000

.495 .513 .499 .449

6001 ～ 9000

.470 .416 .254 .460

9001 ～ 12000

.408 .362 .284 .438

12001 ～ 15000

.330 .344 .337 .311

N1, N2 ：　 top two systems of ＮＴＣＩＲ１　

Conclusions －１

　　

New statistical methods for ATR, which are basically how many nouns adjoin the single-noun in question to form compound nouns.

FGM・ best in extracting small number( up to 1400) of

high quality domain specific terms

・ longer terms including correct terms are better extracted by FGM or GM

MC-valueStrong in extracting large number (up to 6000) of

domain specific terms

Conclusion ｓ－２

Web is perceived as a gigantic knowledge resource, but yet to be fully utilized.

Terminology in various domain is sure to be the gateway to the domain for novices even for experts.

More readily useful ATR is needed.

Ｔｅｒｍ inology e xtraction system based on vocabulary space

Documents