Term inology e xtraction system based on vocabulary space

36
Term inology Extraction System based on Vocabular y Space Hiroshi Nakagawa Information Technology Center, The University of Tokyo German-Japan NL WS in Sapporo2003/7/4

Upload: raja

Post on 25-Jan-2016

17 views

Category:

Documents


0 download

DESCRIPTION

German-Japan NL WS in Sapporo2003/7/4. Term inology E xtraction System based on Vocabulary Space. Hiroshi Nakagawa Information Technology Center, The University of Tokyo. 歩留まり : Bu-Domari: Success rate ?? 横持ち : Side take: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Term inology E xtraction System based on Vocabulary Space

Term inology   Extraction   System based on Vocabulary Space

Hiroshi Nakagawa

Information Technology Center,

The University of Tokyo

German-Japan NL WS in Sapporo2003/7/4

Page 2: Term inology E xtraction System based on Vocabulary Space

歩留まり : Bu-Domari:Success rate ??横持ち : Side take:Transportation between main transportation

method station (like airport, train station )and destination or starting point.

玉掛け : ball hingeTo operate a power shovelReally useful and interesting terminologies

Page 3: Term inology E xtraction System based on Vocabulary Space

• German • German-Japan• German-Japan natural• German-Japan natural language• German-Japan natural language processing• German-Japan natural language processing

workshop• German-Japan natural language processing

workshop program• German-Japan natural language processing

workshop program chair

Long Compound Nouns

Page 4: Term inology E xtraction System based on Vocabulary Space

German-Japan natural language processing workshop program chair and

German-Japan natural language processing workshop program chair and ACL

German-Japan natural language processing workshop program chair and ACL2003

German-Japan natural language processing workshop program chair and ACL2003 general

Page 5: Term inology E xtraction System based on Vocabulary Space

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory

Page 6: Term inology E xtraction System based on Vocabulary Space

German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratoryGerman-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory chief scientist Dr. Tsuruoka

Long compound noun (NP) is the source of information about terminology

Page 7: Term inology E xtraction System based on Vocabulary Space

Objective

Up-to-date domain terminology dictionary is the gateway to various technology and academic fields.

For this, first of all we need high quality terminologies of the target domain.

What corpus? Ordinary corpus or Web pages?

Page 8: Term inology E xtraction System based on Vocabulary Space

Concepts Methodological classification: Supervised Learning based extraction

I. finding heavily influenced features

II. surrounding patterns of target expression

III. technology developed by NE task

Statistics based extraction our targetI. document space based statistics

II. linguistic structure, such as syntactic, semantic structure based formalism

III. vocabulary space based statistics our target

Page 9: Term inology E xtraction System based on Vocabulary Space

Document space versus Vocabulary space

ab, xy

abc

abc, lmnxy

abc,abc,ablmn

xy,xy

Web

Page 10: Term inology E xtraction System based on Vocabulary Space

document space based statistics

Old fashionedWeight term candidates based on their occu

rrence on document space: corpus or Web, and rank them descending order.

term frequency or tf*idf for basic nounsTo extract compound nouns,contingency m

atrix and co-occurrence based decision with MI, χ2 ,Dice etc.

Page 11: Term inology E xtraction System based on Vocabulary Space

Linguistic Structure based method

Syntactic structurePOS pattern like {adj (noun)+}phrasal verbs, etc.

Semantic structure of compound nounsPredicate argument structure (i.e.Pustejovski)Case frame of predicate

Single and compound nouns are not treated equally.

Page 12: Term inology E xtraction System based on Vocabulary Space

Vocabulary space based method

Statistics of vocabulary space such asStatistics of embedded relation (C-value)How many compound nouns the target noun m

akes (LR = our proposal)Application of link structure analysis of Web p

ages: (PageRank, HITS)Single and compound nouns are treated equally

Page 13: Term inology E xtraction System based on Vocabulary Space

Our objective

Experimental analysis and evaluation of various term extraction methods withTest collection (TMREC) corpusWeb page corpusDomain dictionaries on Web or in CR-ROM as gold-standar

dTerm extraction system repository

Gensen Web ( 言選 Web) http://gensen.dl.itc.u-tokyo.ac.jp/gensenweb_eng.html

Finally Automatic builder for up-to-date domain terms dictionary

Page 14: Term inology E xtraction System based on Vocabulary Space

ATR byCompound noun statistics

Page 15: Term inology E xtraction System based on Vocabulary Space

言選  Gensen Web

Automatic term extraction from WEB pages

Step1. Term candidate extraction

separating text by stop-words (or using

morphological analyzer ) to generate candidates

Step 2. Scoring candidates to rank them

our scoring mechanism is innovative and unique

Page 16: Term inology E xtraction System based on Vocabulary Space

Domain Specific Terms

expressing domain concepts

About 85%       about 15%compound         simple nouns  nouns

•Simple noun: no more divided into shorter nouns•Compound noun: uninterrupted sequence of simple nouns

Our Purpose is Extracting domain specific terms including compound and

simple nouns from domain corpus automatically.

Page 17: Term inology E xtraction System based on Vocabulary Space

Li =freq. n N        m Rj= freq.

  3    noun              statistics 2

  1   character     trigram       

1    class               acquisition    1

Scoring of Simple Nouns

LN(trigram)= 5 n=3 m=2 RN(trigram)=3

Principle:A simple noun which contributes to make a big number of compound nouns has a high score.

Page 18: Term inology E xtraction System based on Vocabulary Space

Scoring of compound nounsGM (C ompound Noun)

LL

iii NRNNLNCNGM

2

1

1

1)(1)()(

GM(CN) is a geometric mean which does not

depend on the length of CN.

 LNNNCN .......21

Page 19: Term inology E xtraction System based on Vocabulary Space

if CN occurs independently then where f(CN) means the number of ind

ependent occurrences of noun CN (= CN does not appear as a part of longer CN )  Ex. GM( trigram ) =((5+1)x(3+1))1/2=4.9 if f(trigram)= 5     FGM(trigram)=24.5

)()()( CNGMCNfCNFGM

New scoring function: FGM(CN)

Page 20: Term inology E xtraction System based on Vocabulary Space

Modified C - valueModify C - value(Frantzi& Ananiadou,1996) to be

able toscore a simple noun

 

length(a) : # of simple nouns consisting a freq(a):frequency of a

 t(a): frequency of candidate compound nouns including a c(a): frequency of distinct candidate compound nouns including

a  

)(

)()()()(

ac

atafreqalengthavalueMC

)(

)()(1)()(

ac

atafreqalengthavalueC

Page 21: Term inology E xtraction System based on Vocabulary Space

Experimental Evaluations

Manually POS tagged Japanese corpus and the gold-standard is a set of manually extracted terms developed by NTCIR 1 TMREC task

 ( Artificial Intelligence field : 1,870 paper abstracts )

Gold-standard consists of manually extracted 8,843 domain specific terms          

Data used in our experiment is developed by NII.

Page 22: Term inology E xtraction System based on Vocabulary Space

0

500

1000

1500

2000

2500

3000

3500

0 1000 2000 3000

Numbuer of atutomatically extractedterms

Num

ber

of e

xtra

cted

ter

ms

coin

cide

with

gold

-sta

ndar

d

完全一致部分一致

Complete and Partial match by GM: (base line)

Complete match

Partial match

(contained)

Page 23: Term inology E xtraction System based on Vocabulary Space

- 50

0

50

100

150

200

250

300

350

Number of automatically extracted terms

Difere

nce o

f ext

racte

d te

rms

coin

cid

e w

ith g

old

-st

anda

rdfr

om

bas

e lin

e

FGM- GMMC- GMbase

Number of complete matched terms by FGM,MC-value

MCval - GM

FGM-GM

Page 24: Term inology E xtraction System based on Vocabulary Space

- 450

- 400

- 350

- 300

- 250

- 200

- 150

- 100

- 50

00 400 800 1200 1600 2000 2400 2800

Number of automatically extracted terms

Diff

ern

ce

FGM- GMMC- GMbase

Number of partially matched terms byFGM,MC-value

FGM-GM

MCval-GM

Page 25: Term inology E xtraction System based on Vocabulary Space

00.5

11.5

22.5

33.5

44.5

5

0 500 1000 1500 2000 2500 3000

Number of automatically extracted terms

Aver

age

leng

th

GMFGMMC- value

Average length (every 100 terms)of extracted terms

L

MC-value

GM

FGM

Page 26: Term inology E xtraction System based on Vocabulary Space

candidate terms frequency1. 知識 (knowledge) 787   ○2. 学習知識 (learning knowledge) 1 ○3. 学習 (learning) 255 ○4. 言語的知識 (linguistic knowledge) 2 ○5. 知識システム (knowledge system) 14 ○6. 学習システム (learning system) 16 ○7. 問題知識 (problem knowledge) 3 ×8. 学習問題 (learning problem) 5 ○9. 言語的 (linguistic) 1 ○10.システム (system) 861 ○

Top scored 20 terms by GM

Page 27: Term inology E xtraction System based on Vocabulary Space

11. 問題 (problem) 561 ○12. 論理的知識 (logical knowledge) 1 ○13. 学習支援システム (learning assistance system) 3 ○14. 設計知識 (design knowledge) 29 ○15.学習問題解決システム (learning problem solver) 1  ○16. 学習支援 (learning assistance) 9 ○17. 言語的情報 (linguistic knowledge) 3 ○18. 知識モデル (knowledge model) 3 ○19. 設計システム (design system) 6 ○20. システム設計 (system design) 1 ○

Top scored 20 terms by GM(con’t)

Page 28: Term inology E xtraction System based on Vocabulary Space

candidate terms    frequency1. 知識 (knowledge) 787   ○2. システム (system) 861 ○3. 問題 (problem) 561 ○4. 学習 (learning) 255 ○5. 学習者 (learner) 383 ○6. モデル (model) 356 ○7. 情報 (information) 382 ○8. 問題解決 (problem solving) 186 ○9. 設計 (design) 183 ○10.知識ベース (knowledge base) 149 ○

Top scored 20 terms by FGM

Page 29: Term inology E xtraction System based on Vocabulary Space

11. 推論 (inference) 162 ○12. 支援 (assistance) 87 ×13. 知識表現 (knowledge representation) 74 ○14. エージェント (agent) 256 ○15. 学習者モデル (learner’s model) 57 ○16. 機能 (function) 294 ×17. 設計者 (designer) 69 ○18. 対話 (dialogue) 205 ○19. 言語 (language) 75 ○20. 対象 (object) 293 ○

Top scored 20 terms by FGM(con’t)

Page 30: Term inology E xtraction System based on Vocabulary Space

candidate terms     frequency1. 学習者 (learner) 383   ○2. 問題解決 (problem solving) 186 ○3. システム (system) 861 ○4. 知識 (knowledge) 787 ○5. 研究 (research) 651 ×6. 本稿 (this paper) 594 ×7. 手法 (method) 562 ×8. 問題 (problem) 561 ○9. 知識ベース (knowledge base) 149 ○10.論文 (paper) 453 ×

Top scored 20 terms by MC-value

Page 31: Term inology E xtraction System based on Vocabulary Space

11. 方法 (method, way to do) 426 ×12. 支援システム (assistance system) 18 ×13. 計算機 (computer) 128 ○14. 情報 (information) 382 ○15. モデル (model) 356 ○16. 自然言語 (natural language) 63 ○17. 我々 (we) 332 ×18. 有効性 (effectiveness) 160 ×19. エキスパートシステム (expert system)   78  

○20. ユーザ (user) 297 ○

Top scored 20 terms by MC-value (con’t)

Page 32: Term inology E xtraction System based on Vocabulary Space

Precision(complete matched) of each method

terms FGM MC-value N1 N2

1~

1000

.773 .754 .705 .744

1001~

2000

.635 .707 .607 .584

2001~

3000

.562 .640 .618 .518

N1,N2 :  top two systems of NTCIR1  

Page 33: Term inology E xtraction System based on Vocabulary Space

Precision(partially matched) of each method

  

terms FGM MC-value

1~

1000

.773       .948

.754       .801

1001~

2000

.635       .951  

.707       .810

2001~

3000

.562       .941

.640       .857

Page 34: Term inology E xtraction System based on Vocabulary Space

Precision of each method when large number of terms extracted

terms FGM MC-value N1 N2

1 ~ 3000

.657 .704 .643 .615

3001 ~ 6000

.495 .513 .499 .449

6001 ~ 9000

.470 .416 .254 .460

9001 ~ 12000

.408 .362 .284 .438

12001 ~ 15000

.330 .344 .337 .311

N1, N2 :  top two systems of NTCIR1 

Page 35: Term inology E xtraction System based on Vocabulary Space

Conclusions -1

  

New statistical methods for ATR, which are basically how many nouns adjoin the single-noun in question to form compound nouns.

FGM・ best in extracting small number( up to 1400) of

high quality domain specific terms

・ longer terms including correct terms are better extracted by FGM or GM

MC-valueStrong in extracting large number (up to 6000) of

domain specific terms

Page 36: Term inology E xtraction System based on Vocabulary Space

Conclusion s-2

Web is perceived as a gigantic knowledge resource, but yet to be fully utilized.

Terminology in various domain is sure to be the gateway to the domain for novices even for experts.

More readily useful ATR is needed.