part-of-speech tagging and chunking with log-linear models

Part-of-speech tagging and chunking with log-linear models

University of ManchesterNational Centre for Text Mining (NaCTeM)

Yoshimasa Tsuruoka

Outline

• POS tagging and Chunking for English– Conditional Markov Models (CMMs)– Dependency Networks– Bidirectional CMMs

• Maximum entropy learning

• Conditional Random Fields (CRFs)

• Domain adaptation of a tagger

Part-of-speech tagging

• The tagger assigns a part-of-speech tag to each word in the sentence.

The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

Algorithms for part-of-speech tagging

• Tagging speed and accuracy on WSJ

Tagging Speed Accuracy

Dependency Net (2003) Slow 97.24

SVM (2004) Fast 97.16

Perceptron (2002) ? 97.11

Bidirectional CMM (2005) Fast 97.10

HMM (2000) Very fast 96.7*

CMM (1998) Fast 96.6*

* evaluated on different portion of WSJ

Chunking (shallow parsing)

• A chunker (shallow parser) segments a sentence into non-recursive phrases

He reckons the current account deficit will narrow toNP VP NP VP PPonly # 1.8 billion in September . NP PP NP

Chunking (shallow parsing)

• Chunking tasks can be converted into a standard tagging task

• Different approaches:– Sliding window

– Semi-Markov CRF

– …

He reckons the current account deficit will narrow toBNP BVP BNP INP INP INP BVP IVP BPP

only # 1.8 billion in September . BNP INPINP INP BPP BNP

Algorithms for chunking

• Chunking speed and accuracy on Penn Treebank

Tagging Speed Accuracy

SVM + voting (2001) Slow? 93.91

Perceptron (2003) ? 93.74

Bidirectional CMM (2005) Fast 93.70

SVM (2000) Fast 93.48

Conditional Markov Models (CMMs)

• Left to right decomposition (with the first-order Markov assumption)

n

iii

iin

ottP

otttPottP

11

111

|

...||...

t1 t2 t3

o

POS tagging with CMMs [Ratnaparkhi 1996; etc.]

• Left-to-right decomposition

– The local classifier uses the information on the preceding tag.

He runs fast

PRP VBZ RB? ? ?

ottPottPotPottP 2312131 ||||...

Examples of the features for local classification

Word unigram wi, wi-1, wi+1

Word bigram wi-1wi , wi wi+1

Previous tag ti-1

Tag/word ti-1wi

Prefix/suffix Up to length 10

Lexical features Hyphen, number, etc..

He runs fast

PRP ?

POS tagging with Dependency Network [Toutanova et al. 2003]

• Use the information on the following tag as well

),,|()|,...,( 111

1 otttPottScore ii

n

iin

This is no longer a probability You can use the followingtag as a feature in the localclassification model

t1 t2 t3

POS tagging with a Cyclic Dependency Network [Toutanova et al. 2003]

• Training cost is small – almost equal to CMMs.• Decoding can be performed with dynamic

programming, but it is still expensive.• Collusion – the model can lock onto

conditionally consistent but jointly unlikely sequences.

t1 t2 t3

Bidirectional CMMs [Tsuruoka and Tsujii, 2005]

• Possible decomposition structures

• Bidirectional CMMs– We can find the “best” structure and tag

sequences in polynomial time

t1 t2 t3(a) t1 t2 t3(b)

t1 t2 t3(c) t1 t2 t3(d)

Maximum entropy learning

• Log-linear modeling

iii yxf

xZxyp ,exp

1|

Feature functionFeature weight

y iii yxfxZ ,exp

Maximum entropy learning

• Maximum likelihood estimation– Find the parameters that maximize the (log-)

likelihood of the training data

• Smoothing– Gaussian prior [Berger et al, 1996]– Inequality constrains [Kazama and Tsujii, 2005]

yx

xypxpxypLL,

|~~|log)(

Parameter estimation• Algorithms for maximum entropy

– GIS [Darroch and Ratcliff, 1972], IIS [Della Pietra et al., 1997]

• General-purpose algorithms for numerical optimization– BFGS [Nocedal and Wright, 1999], LMVM [Benson and More, 2001]

• You need to provide the objective function and gradient:– Likelihood of training samples– Model expectation of each feature

][][)(

~ ipipi

fEfELL

yx

xypxpxypLL,

|~~|log)(

Computing likelihood and model expectation

• Example– Two possible tags: “Noun” and “Verb”

– Two types of features: “word” and “suffix”

Verb

He opened it

edsuffixverbtagopenedwordverbtagedsuffixnountagopenedwordnountag

edsuffixverbtagopenedwordverbtag

,,,,

,,

Noun Noun

tag = noun tag = verb

Conditional Random Fields (CRFs)

• A single log-linear model on the whole sentence

• One can use exactly the same techniques as maximum entropy learning to estimate the parameters.

• However, the number of classes is HUGE, and it is impossible in practice to do it in a naive way.

F

iiin xf

ZottP

11 exp

1)|...(

Conditional Random Fields (CRFs)

• Solution– Let’s restrict the types of features

– Then, you can use a dynamic programming algorithm that drastically reduces the amount of computation

• Features you can use (in first-order CRFs)– Features defined on the tag

– Features defined on the adjacent pair of tags

Features

• Feature weights are associated with states and edges

Noun

Verb

Noun

Verb

Noun

Verb

Noun

Verb

He has opened it

W0=He&

Tag = Noun

Tagleft = Noun&

Tagright = Noun

A naive way of calculating Z(x)

Noun Noun Noun Noun = 7.2

= 1.3

= 4.5

= 0.9

= 2.3

= 11.2

= 3.4

= 2.5

= 4.1

= 0.8

= 9.7

= 5.5

= 5.7

= 4.3

= 2.2

= 1.9

Sum = 67.5

Noun Noun Noun Verb

Noun Noun Verb Noun

Noun Noun Verb Verb

Noun Verb Noun Noun

Noun Verb Noun Verb

Noun Verb Verb Noun

Noun Verb Verb Verb

Verb Noun Noun Noun

Verb Noun Noun Verb

Verb Noun Verb Noun

Verb Noun Verb Verb

Verb Verb Noun Noun

Verb Verb Noun Verb

Verb Verb Verb Noun

Verb Verb Verb Verb

Dynamic programming

• Results of intermediate computation can be reused.

Noun

Verb

Noun

Verb

Noun

Verb

Noun

Verb

He has opened it

Maximum entropy learning and Conditional Random Fields

• Maximum entropy learning– Log-linear modeling + MLE

– Parameter estimation• Likelihood of each sample

• Model expectation of each feature

• Conditional Random Fields– Log-linear modeling on the whole sentence

– Features are defined on states and edges

– Dynamic programming

Named Entity Recognition

We have shown that interleukin-1 (IL-1) and IL-2 control protein protein proteinIL-2 receptor alpha (IL-2R alpha) gene transcription in DNACD4-CD8-murine T lymphocyte precursors. cell_line

Algorithms for Biomedical Named Entity Recognition

Recall Precision F-score

SVM+HMM (2004) 76.0 69.4 72.6

Semi-Markov CRF [Okanohara et al., 2006]

72.7 70.4 71.5

Sliding window 75.8 67.5 70.8

MEMM (2004) 71.6 68.6 70.1

CRF (2004) 70.3 69.3 69.8

• Shared task data for Coling 2004 BioNLP workshop

Domain adaptation

• Large training data has been available for general domains (e.g. Penn Treebank WSJ)

• NLP Tools trained with general domain data are less accurate on biomedical domains

• Development of domain-specific data requires considerable human efforts

Tagging errors made by a tagger trained on WSJ

• Accuracy of the tagger on the GENIA POS corpus: 84.4%

… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

Re-training of maximum entropy models

• Taggers trained as maximum entropy models

• Adapting Maximum entropy models to target domains by re-training with domain specific data

F

iii xf

Zxp

1

exp1

Feature function(given by the developer)

Model parameter

Methods for domain adaptation

• Combined training data: a model is trained from scratch with the original and domain-specific data

• Reference distribution: an original model is used as a reference probabilistic distribution of a domain-specific model

F

iiiorignew xfxp

Zxp

1

exp)(1

Adaptation of the part-of-speech tagger

• Relationships among training and test data are evaluated for the following corpora– WSJ: Penn Treebank WSJ– GENIA: GENIA POS corpus [Kim et al., 2003]

• 2,000 MEDLINE abstracts selected by MeSH terms, Human, Blood cells, and Transcription factors

– PennBioIE: Penn BioIE corpus [Kulick et al., 2004]• 1,100 MEDLINE abstracts about inhibition of the cytochrome

P450 family of enzymes• 1,157 MEDLINE abstracts about molecular genetics of

cancer– Fly: 200 MEDLINE abstracts on Drosophia

melanogaster

• Training sets

• Test sets

Training and test sets

# tokens # sentences

WSJ 912,344 38,219

GENIA 450,492 18,508

PennBioIE 641,838 29,422

Fly 1,024

# tokens # sentences

WSJ 129,654 5,462

GENIA 50,562 2,036

PennBioIE 70,713 3,270

Fly 7,615 326

Experimental results

AccuracyTraining

time(sec.)WSJ GENIA PennBioIE Fly

WSJ+GENIA+PennBioIE

96.68 98.10 97.65 96.35

Fly only 93.91

Combined 96.69 98.12 97.65 97.94 30,632

Ref. dist 95.38 98.17 96.93 98.08 21

Corpus size vs. accuracy(combined training data)

95.095.596.096.597.097.598.098.599.0

8 16 32 64 128 256 512 1024Number of sentences

Acc

urac

y(%)

Fly WSJ GENIA Penn

Corpus size vs. accuracy(reference distribution)

94.0

94.5

95.0

95.5

96.0

96.597.0

97.5

98.0

98.5

99.0

8 16 32 64 128 256 512 1024

Number of sentences

Acc

urac

y (%

)

Fly WSJ GENIA Penn

Summary

• POS tagging– MEMM-like approaches achieve good performance

with reasonable computational cost. CRFs seems to be too computationally expensive at present.

• Chunking– CRFs yield good performance for NP chunking. Semi-

Markov CRFs are promising, but we need to somehow reduce computational cost.

• Domain Adaptation– One can easily use the information about the original

domain as the reference distribution.

References• A. L. Berger, S. A. Della Pietra, and V. J. Della Pietra. (1996). A maximum entropy

approach to natural language processing. Computational Linguistics.• Adwait Ratnaparkhi. (1996). A Maximum Entropy Part-Of-Speech Tagger. Proceedings

of EMNLP.• Thorsten Brants. (2000). TnT A Statistical Part-Of-Speech Tagger. Proceedings of

ANLP. • Taku Kudo and Yuji Matsumoto. (2001). Chunking with Support Vector Machines,

Proceedings of NAACL.• John Lafferty, Andrew McCallum, and Fernando Pereira. (2001). Conditional Random

Fields,, Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of ICML.

• Michael Collins. (2002). Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. Proceedings of EMNLP.

• Fei Sha and Fernando Pereira. (2003). Shallow Parsing with Conditional Random Fields. Proceedings of HLT-NAACL.

• K. Toutanova, D. Klein, C. Manning, and Y. Singer. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. Proceedings of HLT-NAACL.

References

• Xavier Carreras and Lluis Marquez. (2003). Phrase recognition by filtering and ranking with perceptrons. Proceedings of RANLP.

• Jesús Giménez and Lluís Márquez. (2004). SVMTool: A general POS tagger generator based on Support Vector Machines. Proceedings of LREC.

• Sunita Sarawagi and William W. Cohen. (2004). Semimarkov conditional random fields for information extraction. Proceedings of NIPS 2004.

• Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2005). Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data. Proceedings of HLT/EMNLP.

• Yuka Tateisi,Yoshimasa Tsuruoka and Jun'ichi Tsujii. (2006). Subdomain adaptation of a POS tagger with a small corpus. In Proceedings of HLT-NAACL BioNLP Workshop.

• Daisuke Okanohara, Yusuke Miyao, Yoshimasa Tsuruoka, and Jun'ichi Tsujii. (2006). Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. Proceedings of COLING/ACL 2006.

part-of-speech tagging and chunking with log-linear models

Documents

monocytes nn nn cd nn

log likelihood

tag sequences

featurecomputing likelihood

preceding tag

following tag

speech taggingtagging

current account deficit