1 chinese term extraction based on delimiters yuhang yang, qin lu, tiejun zhao school of computer...

24
1 Chinese Term Extraction Based on Delimiters Yuhang Yang, Qin Lu, Tiejun Zhao School of Computer Science and Technology, Harbin Institute of Technology Department of Computing, The Hong Kong Polytechnic University May, 2008

Upload: aron-hood

Post on 30-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

1

Chinese Term Extraction Based on Delimiters

Yuhang Yang, Qin Lu, Tiejun Zhao

School of Computer Science and Technology, Harbin Institute of

Technology Department of Computing,

The Hong Kong Polytechnic University

May, 2008

2

Outline

Introduction

Related Works

Methodology Experiment and Discussion

Conclusion

3

Basic Concepts

Terms(terminology): lexical units of the

most fundamental knowledge of a domain

Term extraction Term candidate extraction

Unithood

Terminology verification Termhood

4

Major Problems

Term boundary identification based on term

features Fewer features are not enough More features lead to more conflicts

Limitation in scope low frequency terms long compound terms dependency on Chinese segmentation

5

Main Idea

Delimiter based Term candidates extraction: identifying the

relative stable and domain independent words immediate

before and after these terms

扫描隧道显微镜是一种基于量子隧道效应的高分辨率显微镜 Scan

tunneling microscope is a kind of quantum tunnelling effect-based

high angular resolution microscope 社会主义制度是中华人民共和国的根本制度

Socialist system is the basic system of the People's Republic of China

Potential Advantages of the proposed approach No strict limits on frequency or word length No need for full segmentation Relatively domain independent

6

Related works:Statistic-based Measures

Internal measure (Schone and Jurafsky, 2001)

Internal associative measures between constituents of the candidate characters, such as:

Frequency Mutual information

Contextual measure

Dependency of candidates on its context: The left/right entropy (Sornlertlamvanich et al., 2000) The left/right context dependency (Chien, 1999) Accessor variety criteria (Feng et al., 2004).

7

Hybrid Approaches

The UnitRate algorithm (Chen et al., 2006)

occurrence probability + marginal variety probability The TCE_SEF&CV algorithm (Ji et al, 2007)

significance estimation function + C-value measure

Limitations Data sparseness for low frequency terms and long terms Cascading errors by full segmentation

8

Observations

Sentences are constituted by substantives and functional words

Domain specific terms (terms for short) are more likely to be domain substantives

Predecessors and successors of terms are more likely to be functional words or general substantives connecting terms Predecessors and successors are markers of terms,

referred to as term delimiters (or simply delimiters)

9

Delimiter Based Term Extraction

Characteristics of delimiters Mainly functional words and general

substantives Relatively stable Domain independent Can be extracted more easily

Proposed model Identifying features of delimiters Identify terms by finding their predecessors and

successors as their boundary words

10

Algorithm design

TCE_DI (Term Candidate Extraction – Delimiter Identification)

Input: Corpusextract (domain corpus ), DListlist ) (1). Partition Corpusextract to char strings by

punctuations. (2). Partition char strings by delimiters to obtain

term candidates.

If there is no delimiter contained in a string, the whole string is regarded as a term candidate.

C1 ... Cib Ci1 ... Cil Cia ... Cjb Cj1 ... Cjm Cja ... Cn

TC1 TC2 TC3

D1 D2

11

Acquisition of DList

From a given stop word list Produced by experts or from a general corpus No training is needed

DList_Ext algorithm Given a training corpus CorpusD_training, and

A domain lexicon LexiconDomain

12

The DList_Ext algorithm

S1: For each term in LexiconDomain

mark Ti in CorpusD_training as a lexical unit S2: Segment the remaining text S3: Extracts predecessors and successors of

all

Ti as delimiter candidates S4: Remove all Ti from delimiter candidates S5: Rank delimiter candidates by frequency

Use of a simple threshold NDI

13

Experiments:Data Preparation

Delimiter ListDListIT Extracted by using CorpusIT_Small and LexiconIT

DListLegal Extracted by using CorpusLegal_Small and LexiconLegal DListSW494 general stop words

14

Performance Measurements

Evaluation: Precision(sampling) & Rate of NTE

Reference algorithms SEF&C-value (Ji et al, 2007) for term candidate

extraction TFIDF (Frank et al., 1999) for both term

candidate extraction and terminology verification LA_TV (Link Analysis based – Terminology

Verification) for fair comparison

TCList

NewNTE N

NR

TCList

NewLexiconTE N

NNprecision

15

Evaluation:DList_Ext algorithm: NDI

CorpusLegal_Large

(11,048 sentences)

CorpusIT_Large

(60,508 sentences)

DListIT (Top100) 77.6% 89.1%

DListIT (Top300) 84.6% 92.6%

DListIT (Top500) 90.3% 93.4%

DListIT (Top700) 92.7% 93.9%

DListlegal (Top100) 95.8% 92.6%

DListlegal (Top300) 97.8% 96.2%

DListlegal (Top500) 98.7% 96.8%

DListlegal (Top700) 99.1% 97.1%

DListSW 98.1% 98.1%Coverage of Delimiters on Different Corpora

16

Evaluation:DList_Ext algorithm: NDI

Frequency of Delimiters on Domain Corpora

17

Evaluation:DList_Ext algorithm: NDI

Performance of DListIT on CorpusIT_Large Performance of DListLegal on CorpusIT_Large

18

NDI = 500

Performance of DListIT on CorpusLegal_Large Performance of DListLegal on CorpusLegal_Large

19

Evaluation on Term Extraction

Performance of Different Algorithms on IT Domain and Legal Domain

20

Performance Analysis

Domain independent and stable delimiters Being extracted easily and useful

Larger granularity of domain specific terms Keeping many noisy strings out

Less frequency sensitivity Concentrating on delimiters without regards

to the frequencies of the candidates

21

Evaluation on New Term Extraction: RNTE

Performance of Different Algorithms for New Term Extraction

22

Error Analysis

Figure of Speech phrases “不难看出” (it is not difficult to see that….) “新方法中” (in the new methods)

General words “思维状态” (mental state) “建筑” (architecture)

Long strings which contain short terms “访问共享资源” (access shared resources), “再次遍历” (traverse again)

23

Conclusion A delimiter based approach for term candidate

extraction Advantages

Less sensitivity to term frequency Requiring little prior domain knowledge, relatively

less adaptation for new domains Quite significant improvements for term extraction Much better performance for new term extraction

Future works Improving overall term extraction algorithms Applying to related NLP tasks such as NER Applying to other languages

24

Thank You !

Q & A