outline

July 4, 2001 Chinese Information Extraction by Tianfang Yao

1

Chinese Information Extraction

Tianfang Yao

Department of Computer Science and Engineering

Shanghai Jiao Tong University1954 Hua Shan Road

Shanghai, 200030China


2

Outline

Introduction Word Segmentation Named Entity Extraction Entity Relation Extraction Conclusion


3

Introduction (1)

Chinese Language Difficulties in Chinese NLP State-of-the-Art for Chinese Information Extraction


4

Introduction (2)

Chinese Language

• Chinese is a different topological language from English or German.

• It has a big character set that involves about 44,908 characters.

• Although Chinese has a history of more than 6,000 years, up to now, Chinese grammar standard has not been built perfectly.


5

Introduction (3)Chinese Language

• The form of Chinese character is related to the meaning of character. It combines with the hieroglyph, e.g. 日 (sun) and 月(moon), the self-explanatory, e.g. 上 (above) and 下 (below), as well as the associative compounds, e.g. 信 (believe), a character made up of 人 (man) and 言 (word), means a message or something that can be believed or trusted.

• There are many homonyms in Chinese words, e.g. 锣 (gong), 螺(spiral shell), 骡 (mule), 箩 (bamboo basket) etc.• Chinese word can be disconnected or expanded. Its order can be changed. e.g. 吃饭 (take a meal) vs. 吃了一顿饭；理发 (haircut)了 vs. 发理了


6

Introduction (4)Difficulties in Chinese NLP

• Because there is no space between the characters in the Chinese sentence, we have to segment word before we analyze the sentence structure.

• Chinese characters have no flection, using semantic structures to understand Chinese sentences is more important than using syntactic structures to do that.

• The combination of Chinese words is flexible, changeable, succinct and implicit. Sometimes there are omitted constituents in the sentence.

• There exist continuous nouns or continuous verbs in a Chinese sentence at times.


7

Introduction (5)

State-of-the-Art for Chinese Information Extraction

• Knowledge engineering approaches

• Automatically trainable approaches

• Statistic approaches

• Hybrid approaches


8

Word Segmentation (1)

Research of Automatic Chinese Word Segmentation (Kaiying Liu. Computer Science Department, Shan Xi University, China)

1. DefinitionsDefinition 1: Ambiguous Phrase of Overlap Type

Assume that AJB is a character string and W is a word list. If AJ

W, and JB W, then AJB is called ambiguous phrase of overlap type.

e.g. In the string “ 当代表 (act as a delegate)” , both “ 当代 (of our time)” and “ 代表 (delegate)” are words. So this string is an ambiguous phrase of overlap type.


9

Word Segmentation (2)Definition 2: Chain Length

The number of ambiguous strings is called chain length.

e.g. There is one ambiguous string in the string “ 当代表” , so the chain length is 1.

Definition 3: Ambiguous Phrase of Combination Type

Assume that AB is a character string and W is a word list. If A W, B W and AB W, then AB is called ambiguous phrase of combination type.

e.g. In the string “ 个人 (individual)” , “ 个 (quantifier)”, “ 人 (man)” and “ 个人” are all words. So this string is an ambiguous phrase of combination type.


10


2. Build the ambiguous phrase libraries

• 78,000 phrases for overlap type

• More than 3,000 phrases for combination type

• Statistical results for overlap type:• Their chain lengths are mostly 1 or 2, about 95% of all.

• Among the ambiguous phrases like “ABCD” with a chain length of 2. 98% of them can be segmented into “AB|CD”.

• The segmentation of about 82% of the ambiguous phrases like “ABCDE” with a chain length of 3 depends on the leftmost three characters “ABC”.

• False ambiguous phrase: 94%

• Real ambiguous phrase: 6%


11


• False ambiguous phrase: It is with actually only one segmentation result in real texts. e.g. “ 挨 (be given)| 批评 (a criticism)”

• Real ambiguous phrase: It is more than two applicable segmentation results.

Case 1: with almost equal occurrence probabilities

e.g. “ 应用于 (apply to)” can be segmented into “ 应用 | 于 (apply…to…)” or “ 应 | 用于 (should be used in…)”

Case 2: mostly segmented into only one result in real texts.

e.g. “ 解除了 (have dismissed)” should be mostly segmented into “ 解除 | 了 (have dismissed)”


12

Word Segmentation (5)3. Approaches for segmenting ambiguous phrases with overlap type

• Statistics based approach

Built the wording capacity library: includes frequency information for ambiguous phrase “AJB” with chain length of 1, that is, different frequencies for constructing words: FreqLeft(AJ), FreqRight(B), FreqLeft(A) and FreqRight(JB)

Rule1: If FreqLeft(AJ) + FreqRight(B) > FreqLeft(A) + FreqRight(JB), “AJB” is segmented into “AJ|B”; otherwise “A|JB”


13

Word Segmentation (6)(Depending on the statistical results for ambiguous phrase library)

Rule 2: Ambiguous phrase with a chain length of 2, like “ABCD”, is segmented into “AB|CD”.

Rule 3: Ambiguous phrase with a chain length of 3, like “ABCDE”, first is segmented into “ABC|DE”; then the fore part “ABC” is segmented as an ambiguous phrase with a chain length of 1.

Rule 4: Ambiguous phrase with a chain length of 4, like “ABCDEF”, is segmented into “AB|CD|EF”


14

Word Segmentation (7)• Rules based approach

Rule 1: If there is an appulsive verb in an ambiguous phrase with its previous word as a verb, it is segmented solely. e.g. “ 真正体现出 (really embody)” should be segmented “ 真正 | 体现 | 出” , because “ 出 (come up)” is an appulsive verb, “ 体现” is a verb.

Rule 2: If the foremost character in an ambiguous phrase is a quantifier and the preceding word of the phrase is a numeral, the a quantifier is segmented solely. e.g. “65 层高楼 (a high building of 65 stories)” should be segmented into “65| 层 | 高楼” , because “ 层” is a quantifier and “65” is a numeral.


15

Word Segmentation (8)4. Approaches for segmenting ambiguous phrases with combination type

•Statistics based approachAmong all ambiguous phrases, 30% of them usually have only one segmentation result. Therefore, a library including 133 phrases is built. The structure of database is as follows:FIELD NAME TYPE LENGTH EXPLANATION word char 4 AB nh number 3 the times of seg. into AB nf number 3 the times of seg. into A|BAssume freq=nh/(nh+nf), thresholds are α1and α2, here α1> α2 . If freq>α1 , “AB” will be segmented into “AB”; if freq<α2 , it is segmented into “A|B”.


16

Word Segmentation (9)• POS rule based approach

The word to be segmented is related with the POS of its context words. If the previous word of “AB” is numeral, “AB” will be segmented into “A|B”;

otherwise segmented into “AB”. e.g. In the sentence “ 他一个人睡在屋里 (He sleeps in his room by himself)”, here AB= 个人 . Because “ 一” is a numeral, “ 个人” should be segmented into “ 个 | 人” . But in the phrase “ 农民个人利益 (The individual interests of the peasantry)”, “ 个人” should not be segmented.


17

Word Segmentation (10)5. System architecture

Ambiguous Segmentation

Text Pre-Processing

Word Matching

Basic LexiconSpecial LexiconUser’s Lexicon

Wording Capacity Lib.Rule Set of Ambiguity

of Overlap Type

Ambiguous Phrase Lib. of Combination Type

Rule Set of Ambiguity of Combination Type

Ambiguous Segmentation of Overlap Type

Ambiguous Segmentation of Combination Type


18


6. System test results

The system has been tested with the corpus randomly chosen from Beijing Youth, in which there are 607 ambiguous phrases of overlap type and 2292 ambiguous phrases of combination type. The precisions are 97% and 87% respectively.


19

Named Entity Extraction (1)Description of the NTU System used for MET2

(Hsin-His Chen et al. Natural Language Processing Lab., Department of Computer Science and Information Engineering, National Taiwan

University)

• Processing Steps of Named Entity Extraction(1) Transform Chinese texts in GB codes into texts in Big-5 codes

(2) Segment Chinese texts into a sequence of tokens

(3) Identify named people

(4) Identify named organizations

(5) Identify named locations

(6) Use n-gram model to identify named organizations/locations

(7) Identify the rest of named expressions

(8) Transform the results in Big-5 codes into the results in GB codes


20

Named Entity Extraction (2)(1) Transform Chinese texts in GB codes into texts in Big-5 codes The GB code is an internal code of the simplified Chinese character set, which is

used in the mainland of China. The Big-5, on the other hand, is an internal code of the traditional Chinese character set, which is used in Taiwan and Hong Kong.

e.g. simplified Chinese character vs. traditional Chinese character

人工智能 (Artificial Intelligence) 人工智慧软件 (Software) 軟體报道 (Report) 報導新西兰 (New Zealand) 紐西蘭 NTU System is designed for the traditional Chinese character text and the test texts

in MET2 are in GB code. So it must transform GB code of test texts into Big-5 code. But this mapping is not only one-to-one, sometimes it is one-to-many.


21

Named Entity Extraction (3)

(2) Segment Chinese texts into a sequence of tokens List all possible words by dictionary look-up, and then resolve ambiguities by

segmentation strategies. The dictionary is trained from CKIP corpus, of which articles are collected from Taiwan newspapers, magazines, etc.

(3) Identify named people• Chinese person names

Most Han Chinese surnames are single character, but some are two characters.

Most names are two characters, but some are single character.

Theoretically, every character can be used for a name. Thus the length of Chinese names ranges from 2 to 6 characters.

Three kinds of recognition strategies are adopted:

• Named-formulation rules

• Context clues, e.g., titles, positions, speech-act verbs, etc.

• Cache


22

Named Entity Extraction (4)• Named-formulation rules

They are trained from a person name corpus in Taiwan, which contains 1 million Chinese names. Each contains surname, name and sex.

Possible candidates:

Model 1. Single character for surname

P(C1)*P(C2)*P(C3) using male (female) training table > threshold1(3) and

P(C2)*P(C3) using male (female) training table > threshold2(4)

Model 2. Two characters for surname P(C2)*P(C3) using male (female) training table > threshold2(4)

Model 3. Two surnames together

P(C12)*P(C2)*P(C3) using female training table > threshold3

P(C2)*P(C3) using female training table > threshold4 and

P(C12)*P(C2)*P(C3) using female training table > P(C12)*P(C2)*P(C3) using male training table


23

Named Entity Extraction (5)• Context clues, e.g., titles, positions, speech-act verbs, etc.

Titles: 博士 (Dr.); 教授 (Prof.); 女士 (Mrs./Ms.); 小姐 (Miss); 先生(Mr.)

Positions: 总统 (President); 导演 (Director); 总经理 (General Manager)

Speech-act verbs: 发言 (speak) ；说 (say) ；提出 (bring up)

• Cache

The cache presents a global clue. Because a person name may appear more than once in a document. The cache is used to store the identified candidates. There are four cases shown below when cache is used:

(1) C1C2C3 and C1C2C4 are in the cache, and C1C2 is correct.

(2) C1C2C3 and C1C2C4 are in the cache, and both are correct.

(3) C1C2C3 and C1C2 are in the cache, and C1C2C3 is correct.

(4) C1C2C3 and C1C2 are in the cache, and C1C2 is correct.


24

Named Entity Extraction (6)• Transliterated person names

Transliterated person names denote foreigners. The length of transliterated person names is not restricted to 2 to 6 characters.

Main strategies:• Transliterated name set

The transliterated names trained from MET data are regarded as a built-in name set.

• Character condition

Two special character sets are retrieved from MET training data. The first character of names must belong to a 280-character set, and the remaining characters must appear in a 411-character set. The character condition is a loose restriction. It should be employed with other clues.

• Titles

They used in Chinese person names are also applicable to transliterated person names.

• Name introducers

Such as, 叫 (be called), 名叫 (Her/His name is …), 尊称 (respectfully call sb. …)

• Special verbs

e.g. 发表 (issue/express/deliver), 暗示 (hint/imply)


25

Named Entity Extraction (7)(4) Identify named organizationsThe structure of organization names is more complex than that of person names. Basically, a complete organization name can be divided into name and keyword.Such as, names: 联合国 (UN), 美国 (USA), 罗伯逊 (Robertson) keywords: 部队 (Army), 大使馆 (Embassy), 基金会 (Foundation)There are some rules to recognize organization names:OrganizationName -> OrganizationName + OrganizationNameKeywordOrganizationName -> CountryName + OrganizationNameKeywordOrganizationName -> PersonName + OrganizationNameKeywordOrganizationName -> CountryName + {D|DD} + OrganizationNameKeywordOrganizationName -> PersonName + {D|D} + OrganizationNameKeywordOrganizationName -> LocationName + {D|D} + OrganizationNameKeywordOrganizationName -> CountryName + OrganizationNameOrganizationName -> LocationName + OrganizationNameWhere D is a content word, such as, 国际 (International), 文教 (culture and education) etc.


26

Named Entity Extraction (8)(5) Identify named locationsThe structure of location names is similar to that of organization names. The rules are like:

LocationName -> PersonName + LocationNameKeyword

LocationName -> LocationName + LocationNameKeyword

The following are some examples of location keywords:

山 (maintain); 中心 (center); 公路 (highway); 以北 (the Northern of …); 市 (city)

Other strategies for recognizing location names without keywords:

• Locative verbs: 来自 (come from …); 前往 (go to …)

• Cache:

• N-gram model: employ multiple occurrences to find a pattern


27

Named Entity Extraction (9)(6) Use n-gram model to identify named organizations/locationsAlthough cache mechanism and n-gram use the same feature, i.e., multiple occurrences, their concepts are totally different. For organization names, it is not sure when a pattern should be put into cache because its left boundary is hard to be decided.

In the model, the patterns are selected to meet the following criteria:

• It must consist of a name and an organization name keyword

• Its length must be greater than two words

• It does not cross sentence boundary and any punctuation marks

• It must occur at lease twice


28

Named Entity Extraction (10)(7) Identify the rest of named expressionsThe rule based approach is used for the following named expressions:

• Date expressions

DATE->NUMBER+YEAR

DATE->NUMBER+MTHUNIT

• Time expressions

TIME->NUMBER+HUNIT

TIME->TIME+BSTATE

• Monetary expressions

DMONEY->MOUNIT+NUMBER+MOUNIT

DMONEY->NUMBER+MONUIT

• Percentage expressions

DPERCENT->PERCENT+NUMBER

DPERCENT->NUMBER+PERCENT


29

Named Entity Extraction (11)(8) Transform the results in Big-5 codes into the results in GB codes

MET2 Testing Results

Named Entity Recall(%) Precision(%)

Person Name 91 74

Organization Name 78 85

Location Name 78 69

Date 94 88

Time 98 70

Money 98 98

Percent 83 98

F-MEASURES: P&R 79.61% 2P&R 77.88% P&2R 81.42%


30

Entity Relation Extraction (1)A Trainable Method for Extracting Chinese Entity Names and Their Relations(Yimin Zhang et al. Intel China Research Center, Beijing, China)

The process can be divided into two stages. The first one is the learning process in which several classifiers are built from the training data. The second one is the extracting process in which Chinese entity names and their relations are extracted using the classifiers learned. The learning algorithm used in the learning process is memory-based learning (MBL) which is a classification based supervised learning approach.


31

Entity Relation Extraction (2)

EXAMPLES

CASESINPUT OUTPUT

Similarity-Based Reasoning

Performance

Learning

Storage Computation of Metrics

Memory-Based Learning Architecture


32

Entity Relation Extraction (3)The main steps for the learning process:

(1) Prepare training data in which all noun phrases, entity names and relations are manually annotated.

(2) Segmenting, tagging and partial parsing for the training data.

(3) Extract the training sets from the parsed training data. Four training sets are extracted for different tasks, related to Chinese person names, entity names, noun phrase, or relations between entity names in the training data respectively. The main feathers used in an example can be either local context feathers, e.g. dependency relation, or global context features, e.g. the feature of a word in the whole document, etc.

(4) Use MBL algorithm to obtain IG-Tree for four training sets. IG-Tree is a compressed representation of the training set that can be processed quickly in classification process.


33

Entity Relation Extraction (4)The main steps for the extracting process:

(1) Segmenting, tagging and partial parsing for the Chinese documents.

(2) Identify Chinese people names using PersonName-IG-Tree.

(3) Identify Chinese organization names using the same method of NTU System.

(4) Identify other entity names using the same method of NTU System.

(5) Identify Chinese noun phrases (NP chunking) using NP-IG-Tree.

(6) Use entity names and noun phrases extracted to perform partial parsing again to fix the parsing errors.

(7) Use EntityName-IG-Tree to classify the noun phrases extracted. This step will identify entity names that are missed in the previous steps.

(8) Use Relation-IG-Tree to identify relations between the extracted entity names.


34

Entity Relation Extraction (5)The entity relation extracted:

(1) Employee-of,

(2) Location-of,

(3) Product-of and

(4) No-relation

The feathers for this task:

(1) The features used in CRYSTAL System,

(2) Add some new feathers, such as the linear order of entity names, the word(s) between the entity names, the relative position of the entity names (in same sentence or in neighboring sentence) etc.


35

Entity Relation Extraction (6)Example:

Phrase “ 联想总裁 (Legend’s President)” (Note: Legend=Legend Holdings Limited or Legend Group which is a famous computer company in China) in the subject position includes the features:

SUBJ-Terms- 联想SUBJ-Terms- 总裁SUBJ-Mod-Terms- 联想SUBJ-Head-Terms- 总裁SUB-Classes-Employee

SUB-Mod-Classes-Organization

SUB-Head-Classes-Organization(should be Position)


36

Entity Relation Extraction (7)Learning and extracting processes:

For every two related entity names in the training data, a training example is identified and extracted. After all examples are extracted, they are fed to MBL Learner to build the Relation-IG-Tree.

The extracting process is the same as the learning process for extracting all pairs of entity names. Then the relation between every pair of entity names is derived by the Relation-IG-Tree.


37

Entity Relation Extraction (8)Example1:“浪潮集团作为国内著名的 IT 硬件设备制造商，…”As a famous manufacturer of IT hardware devices in China, the Lang Chao Group …

Company name: 浪潮集团 Product name: IT 硬件设备Training example: Company name ( 作为 / 是 ) … Product name 制造商Relation: product-of

Example2:

“吴士宏再度成为媒体关注的焦点。不过，这次她是以 TCL 集团副总裁兼信息产业公司总经理的身份来上海的。”Wu Shihong became the media focus once again, however, this time she came to Shanghai as the vice president of TCL group and its IT company’s general manager.

Person name: 吴士宏 Company name: TCL 集团Training example: If a person name and a company name appear in neighboring sentences, and no other person names and company names are found in between, they tend to have an employee-of relation.

Relation: employee-of


38

Entity Relation Extraction (9)System testing results:

To test this approach, a manually annotated corpus which comprises about 200 business news is used. All the entity names (about 500 person names and 300 organization names), noun phrases, and relations in the corpus were manually annotated. Ten pairs of training sets and tests were randomly selected from the corpus with each set size equivalent to half of the entire corpus. All data sets were tested, the result is as follows:

Recall(%) Precision(%)

Person Name 86.3 83.2

Organization Name 73.4 89.3

Employee-of 75.6 92.3

Product-of 56.2 87.1

Location-of 67.2 75.6


39

Conclusion• Chinese is a different topological language from English or German. There exist some special difficulties in Chinese NLP, such as word segmentation.• There are mainly two ambiguous phrases in Chinese word segmentation. One is overlap type, another is combination type. In overlay ambiguous phrases, the chain lengths are mostly 1 or 2 and take up 95%. In combination ambiguous phrases, 30% of them usually have only one possibility of segmentation. We can remove ambiguity depending on different ambiguous types.• Chinese named entities are major constituents in Chinese documents. We can adopt different methods to extract them together, such as character conditions, statistical information, titles, punctuation marks, organization and location keywords, speech-act and locative verbs, cache and n-gram model.• We can view the determination of Chinese entity relation as classification process. In the learning process, several classifiers are built from the training data. In the extracting process, the relations are extracted using the classifiers learned. Machine learning technique has been effectively used in Chinese entity relation extraction.

outline

Documents

chinese language chinese

chinese characters

chinese sentences

form of chinese character

combination of chinese

chinese language difficulties

chinese nlp stateoftheart

chinese grammar standard