a search-based chinese word segmentation method ——www 2007 xin-jing wang: ibm china wen liu:...

20
A search-based Chinese Word Segmentation Method —— WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China

Upload: bruce-williamson

Post on 31-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

A search-based Chinese Word Segmentation Method

——WWW 2007

Xin-Jing Wang: IBM China

Wen Liu: Huazhong Univ. China

Yong Qin: IBM China

Introduction

• Challenges in CWS• Ambiguous • Unknown word

• Web and search technology• Free from OOV problem• Adaptive to different segmentation standards• Entirely unsupervised

The proposed approach

• Segments Collecting• Query sentence => sub-sentence (by punctuation)• Submit sub-sentence to a search engine• Collect the highlights from returned snippets

Query :“我明天要去止锚湾玩”:

The proposed approach

• Segments Scoring Select a subset of segments as final segmentation

• Frequency-based: term frequency• Segment occurrences : total number of occurrences

• SVM-based• SVM classifier with RBF kernel and maps the

outputs into probabilities as the scores

Reconstruct the query using the segment way with

highest score

The proposed approach

• Segments Selecting• Valid subset: if its member segments can reconstruct exactly the query

• Score of valid subset:

the average score of its member segments.

• Greedy search to find valid subset

For efficiency consideration• Select the valid subset which has highest score as

final segmentation

Evaluations

• Experiment setting• SVM-based score

• Training set: 3000 randomly selected sentences

• Feature space ——Three dimensional : TF DF LEN

TF: term frequency

DF: number of documents indexed by a segment

Len: number of characters in a segment

• Frequency-based score• need no training set

Evaluations

• Comparison result• SIGHAN’05

Evaluations

• Worse than reported results• Why SVM is worse ? Feature space too simple • Advantage:

• only 3000 or non training set

• Avoids OOV problem

• Better performance can be achieved with more search results provided (Google+Yahoo!)

Evaluations

• Comparison to IBM full parser

Conclusion

• It is good at discovering new words (no OOV problem) and adapting to different segmentation standards

• Entirely unsupervised which saves labors to labeling training data.

• Finding more effective scoring methods• Combining current approach to other types of

segmentation methods to give a better performance

My work going on……Discriminative Reranking

——ACL 07 & 03

1 Michael Collins and Terry Koo

2 Zhongqiang Huang: Purdue Univ.

Background

• Have been applied to many NLP application• NER, Parsing, sentence boundary detection

• Haven’t try it on POS-tagging

• Motivation1 Rerank the output of an existing probabilistic tagger.

2 The base tagger produces a set of candidate tag sequence for each sentence.

3 A second model attempts to improve upon this initial ranking using

additional features

Collins’ Reranking Algorithm

• Training the reranker• n sentences

each with ni candidates

Along with log-probability produced by the HMM tagger

• “goodness” score• : measures the similarity between the

candidate and the gold reference.

{ : 1,..., }iS i n

,{ : 1,..., }i j iX j n

,( )i jL X

,( )i jScore X

Collins’ Reranking Algorithm

• Training data consists of a set of examples

each along with a “goodness” score

and a log-probability

,{ : 1,..., ; 1,..., }i j iX i n j n

,( )i jL X

,( )i jScore X

Collins’ Reranking Algorithm

• A set of indicator functions

:extract binary features

on each example .

• Each indicator function is associated with a weight parameter which is real valued.

• is associated with

kh

,{ ( ): 1,..., }k i jh X K m

{ : 1,..., }kh K m

,i jX

k

0,( )i jL X

Collins’ Reranking Algorithm

• The ranking function

• The objective of training• Set to minimize:

Where:

0 , ,1

( ) ( )m

i j k k i jk

L x h x

0 1{ , ,..., }m

Experiments

• Using HMM as the base model

• Data set• The most recently released Penn Chinese Tree

bank 5.2 (denoted CTB, released by LDC)

——33 POS tags

——500K words, 800K characters, 18K sentences

Experiments

• Divide into 20 chunks, with each chunk N-best tagged by the HMM model trained on the combination of

the other 19 chunks

Experiments

• Result of Reranking Models• N-gram features:

• N-gram + morphological features.

Conclusion

• Reranking method is efficient on POS task

• extract additional reranking features utilizing more explicitly the characteristics of Mandarin.

• explore semi-supervised training methods for reranking.