Term inology e xtraction system based on vocabulary space
DESCRIPTION
German-Japan NL WS in Sapporo2003/7/4. Term inology E xtraction System based on Vocabulary Space. Hiroshi Nakagawa Information Technology Center, The University of Tokyo. 歩留まり : Bu-Domari: Success rate ?? 横持ち : Side take: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/1.jpg)
Term inology Extraction System based on Vocabulary Space
Hiroshi Nakagawa
Information Technology Center,
The University of Tokyo
German-Japan NL WS in Sapporo2003/7/4
![Page 2: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/2.jpg)
歩留まり : Bu-Domari:Success rate ??横持ち : Side take:Transportation between main transportation
method station (like airport, train station )and destination or starting point.
玉掛け : ball hingeTo operate a power shovelReally useful and interesting terminologies
![Page 3: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/3.jpg)
• German • German-Japan• German-Japan natural• German-Japan natural language• German-Japan natural language processing• German-Japan natural language processing
workshop• German-Japan natural language processing
workshop program• German-Japan natural language processing
workshop program chair
Long Compound Nouns
![Page 4: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/4.jpg)
German-Japan natural language processing workshop program chair and
German-Japan natural language processing workshop program chair and ACL
German-Japan natural language processing workshop program chair and ACL2003
German-Japan natural language processing workshop program chair and ACL2003 general
![Page 5: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/5.jpg)
German-Japan natural language processing workshop program chair and ACL2003 general chair Professor
German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii
German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory
![Page 6: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/6.jpg)
German-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratoryGerman-Japan natural language processing workshop program chair and ACL2003 general chair Professor Tsujii’s laboratory chief scientist Dr. Tsuruoka
Long compound noun (NP) is the source of information about terminology
![Page 7: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/7.jpg)
Objective
Up-to-date domain terminology dictionary is the gateway to various technology and academic fields.
For this, first of all we need high quality terminologies of the target domain.
What corpus? Ordinary corpus or Web pages?
![Page 8: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/8.jpg)
Concepts Methodological classification: Supervised Learning based extraction
I. finding heavily influenced features
II. surrounding patterns of target expression
III. technology developed by NE task
Statistics based extraction our targetI. document space based statistics
II. linguistic structure, such as syntactic, semantic structure based formalism
III. vocabulary space based statistics our target
![Page 9: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/9.jpg)
Document space versus Vocabulary space
ab, xy
abc
abc, lmnxy
abc,abc,ablmn
xy,xy
Web
![Page 10: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/10.jpg)
document space based statistics
Old fashionedWeight term candidates based on their occu
rrence on document space: corpus or Web, and rank them descending order.
term frequency or tf*idf for basic nounsTo extract compound nouns,contingency m
atrix and co-occurrence based decision with MI, χ2 ,Dice etc.
![Page 11: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/11.jpg)
Linguistic Structure based method
Syntactic structurePOS pattern like {adj (noun)+}phrasal verbs, etc.
Semantic structure of compound nounsPredicate argument structure (i.e.Pustejovski)Case frame of predicate
Single and compound nouns are not treated equally.
![Page 12: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/12.jpg)
Vocabulary space based method
Statistics of vocabulary space such asStatistics of embedded relation (C-value)How many compound nouns the target noun m
akes (LR = our proposal)Application of link structure analysis of Web p
ages: (PageRank, HITS)Single and compound nouns are treated equally
![Page 13: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/13.jpg)
Our objective
Experimental analysis and evaluation of various term extraction methods withTest collection (TMREC) corpusWeb page corpusDomain dictionaries on Web or in CR-ROM as gold-standar
dTerm extraction system repository
Gensen Web ( 言選 Web) http://gensen.dl.itc.u-tokyo.ac.jp/gensenweb_eng.html
Finally Automatic builder for up-to-date domain terms dictionary
![Page 14: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/14.jpg)
ATR byCompound noun statistics
![Page 15: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/15.jpg)
言選 Gensen Web
Automatic term extraction from WEB pages
Step1. Term candidate extraction
separating text by stop-words (or using
morphological analyzer ) to generate candidates
Step 2. Scoring candidates to rank them
our scoring mechanism is innovative and unique
![Page 16: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/16.jpg)
Domain Specific Terms
expressing domain concepts
About 85% about 15%compound simple nouns nouns
•Simple noun: no more divided into shorter nouns•Compound noun: uninterrupted sequence of simple nouns
Our Purpose is Extracting domain specific terms including compound and
simple nouns from domain corpus automatically.
![Page 17: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/17.jpg)
Li =freq. n N m Rj= freq.
3 noun statistics 2
1 character trigram
1 class acquisition 1
Scoring of Simple Nouns
LN(trigram)= 5 n=3 m=2 RN(trigram)=3
Principle:A simple noun which contributes to make a big number of compound nouns has a high score.
![Page 18: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/18.jpg)
Scoring of compound nounsGM (C ompound Noun)
LL
iii NRNNLNCNGM
2
1
1
1)(1)()(
GM(CN) is a geometric mean which does not
depend on the length of CN.
LNNNCN .......21
![Page 19: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/19.jpg)
if CN occurs independently then where f(CN) means the number of ind
ependent occurrences of noun CN (= CN does not appear as a part of longer CN ) Ex. GM( trigram ) =((5+1)x(3+1))1/2=4.9 if f(trigram)= 5 FGM(trigram)=24.5
)()()( CNGMCNfCNFGM
New scoring function: FGM(CN)
![Page 20: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/20.jpg)
Modified C - valueModify C - value(Frantzi& Ananiadou,1996) to be
able toscore a simple noun
length(a) : # of simple nouns consisting a freq(a):frequency of a
t(a): frequency of candidate compound nouns including a c(a): frequency of distinct candidate compound nouns including
a
)(
)()()()(
ac
atafreqalengthavalueMC
)(
)()(1)()(
ac
atafreqalengthavalueC
![Page 21: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/21.jpg)
Experimental Evaluations
Manually POS tagged Japanese corpus and the gold-standard is a set of manually extracted terms developed by NTCIR 1 TMREC task
( Artificial Intelligence field : 1,870 paper abstracts )
Gold-standard consists of manually extracted 8,843 domain specific terms
Data used in our experiment is developed by NII.
![Page 22: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/22.jpg)
0
500
1000
1500
2000
2500
3000
3500
0 1000 2000 3000
Numbuer of atutomatically extractedterms
Num
ber
of e
xtra
cted
ter
ms
coin
cide
with
gold
-sta
ndar
d
完全一致部分一致
Complete and Partial match by GM: (base line)
Complete match
Partial match
(contained)
![Page 23: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/23.jpg)
- 50
0
50
100
150
200
250
300
350
Number of automatically extracted terms
Difere
nce o
f ext
racte
d te
rms
coin
cid
e w
ith g
old
-st
anda
rdfr
om
bas
e lin
e
FGM- GMMC- GMbase
Number of complete matched terms by FGM,MC-value
MCval - GM
FGM-GM
![Page 24: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/24.jpg)
- 450
- 400
- 350
- 300
- 250
- 200
- 150
- 100
- 50
00 400 800 1200 1600 2000 2400 2800
Number of automatically extracted terms
Diff
ern
ce
FGM- GMMC- GMbase
Number of partially matched terms byFGM,MC-value
FGM-GM
MCval-GM
![Page 25: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/25.jpg)
00.5
11.5
22.5
33.5
44.5
5
0 500 1000 1500 2000 2500 3000
Number of automatically extracted terms
Aver
age
leng
th
GMFGMMC- value
Average length (every 100 terms)of extracted terms
L
MC-value
GM
FGM
![Page 26: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/26.jpg)
candidate terms frequency1. 知識 (knowledge) 787 ○2. 学習知識 (learning knowledge) 1 ○3. 学習 (learning) 255 ○4. 言語的知識 (linguistic knowledge) 2 ○5. 知識システム (knowledge system) 14 ○6. 学習システム (learning system) 16 ○7. 問題知識 (problem knowledge) 3 ×8. 学習問題 (learning problem) 5 ○9. 言語的 (linguistic) 1 ○10.システム (system) 861 ○
Top scored 20 terms by GM
![Page 27: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/27.jpg)
11. 問題 (problem) 561 ○12. 論理的知識 (logical knowledge) 1 ○13. 学習支援システム (learning assistance system) 3 ○14. 設計知識 (design knowledge) 29 ○15.学習問題解決システム (learning problem solver) 1 ○16. 学習支援 (learning assistance) 9 ○17. 言語的情報 (linguistic knowledge) 3 ○18. 知識モデル (knowledge model) 3 ○19. 設計システム (design system) 6 ○20. システム設計 (system design) 1 ○
Top scored 20 terms by GM(con’t)
![Page 28: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/28.jpg)
candidate terms frequency1. 知識 (knowledge) 787 ○2. システム (system) 861 ○3. 問題 (problem) 561 ○4. 学習 (learning) 255 ○5. 学習者 (learner) 383 ○6. モデル (model) 356 ○7. 情報 (information) 382 ○8. 問題解決 (problem solving) 186 ○9. 設計 (design) 183 ○10.知識ベース (knowledge base) 149 ○
Top scored 20 terms by FGM
![Page 29: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/29.jpg)
11. 推論 (inference) 162 ○12. 支援 (assistance) 87 ×13. 知識表現 (knowledge representation) 74 ○14. エージェント (agent) 256 ○15. 学習者モデル (learner’s model) 57 ○16. 機能 (function) 294 ×17. 設計者 (designer) 69 ○18. 対話 (dialogue) 205 ○19. 言語 (language) 75 ○20. 対象 (object) 293 ○
Top scored 20 terms by FGM(con’t)
![Page 30: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/30.jpg)
candidate terms frequency1. 学習者 (learner) 383 ○2. 問題解決 (problem solving) 186 ○3. システム (system) 861 ○4. 知識 (knowledge) 787 ○5. 研究 (research) 651 ×6. 本稿 (this paper) 594 ×7. 手法 (method) 562 ×8. 問題 (problem) 561 ○9. 知識ベース (knowledge base) 149 ○10.論文 (paper) 453 ×
Top scored 20 terms by MC-value
![Page 31: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/31.jpg)
11. 方法 (method, way to do) 426 ×12. 支援システム (assistance system) 18 ×13. 計算機 (computer) 128 ○14. 情報 (information) 382 ○15. モデル (model) 356 ○16. 自然言語 (natural language) 63 ○17. 我々 (we) 332 ×18. 有効性 (effectiveness) 160 ×19. エキスパートシステム (expert system) 78
○20. ユーザ (user) 297 ○
Top scored 20 terms by MC-value (con’t)
![Page 32: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/32.jpg)
Precision(complete matched) of each method
terms FGM MC-value N1 N2
1~
1000
.773 .754 .705 .744
1001~
2000
.635 .707 .607 .584
2001~
3000
.562 .640 .618 .518
N1,N2 : top two systems of NTCIR1
![Page 33: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/33.jpg)
Precision(partially matched) of each method
terms FGM MC-value
1~
1000
.773 .948
.754 .801
1001~
2000
.635 .951
.707 .810
2001~
3000
.562 .941
.640 .857
![Page 34: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/34.jpg)
Precision of each method when large number of terms extracted
terms FGM MC-value N1 N2
1 ~ 3000
.657 .704 .643 .615
3001 ~ 6000
.495 .513 .499 .449
6001 ~ 9000
.470 .416 .254 .460
9001 ~ 12000
.408 .362 .284 .438
12001 ~ 15000
.330 .344 .337 .311
N1, N2 : top two systems of NTCIR1
![Page 35: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/35.jpg)
Conclusions -1
New statistical methods for ATR, which are basically how many nouns adjoin the single-noun in question to form compound nouns.
FGM・ best in extracting small number( up to 1400) of
high quality domain specific terms
・ longer terms including correct terms are better extracted by FGM or GM
MC-valueStrong in extracting large number (up to 6000) of
domain specific terms
![Page 36: Term inology E xtraction System based on Vocabulary Space](https://reader036.vdocuments.net/reader036/viewer/2022081520/568153ea550346895dc1e71c/html5/thumbnails/36.jpg)
Conclusion s-2
Web is perceived as a gigantic knowledge resource, but yet to be fully utilized.
Terminology in various domain is sure to be the gateway to the domain for novices even for experts.
More readily useful ATR is needed.