entitypro exploting svm for italian named entity recognition

EVALITA 2007EVALITA 2007

Frascati, September 10th 2007Frascati, September 10th 2007

ENTITYPROEXPLOTING SVM FOR ITALIANEXPLOTING SVM FOR ITALIANNAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION

Roberto Zanoli and Emanuele PiantaRoberto Zanoli and Emanuele Pianta

TextProTextPro

22

A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis TagPro: Part-of-Speech tagging LemmaPro: lemmatization EntityPro: Named Entity recognition ChunkPro: phrase chunking SentencePro: sentence splitting

Architecture designed to be efficient, scalable and robust. Cross-platform: Unix / Linux / Windows / MacOS X Multi-lingual models All modules integrated and accessible through unified command line interface

33

EntityPro’s architecture

We used YamCha, an SVM-based machine learning environment, to build EntityPro, a system exploiting a rich set of linguistic features, such as the orthographic features, prefixes and suffixes, and the occurrence in proper nouns gazetteers.

Feature Feature selectionselection

ControllerController

Feature extractionortho, prefix, suffix, dictionary,

collocation bigram

dictionary

Learning

models

ClassificationClassification

YamCha

Training

data

Test

dataFeature Feature selectionselection

EntityPro

Feature extractionortho, prefix, suffix, dictionary,

collocation bigram

TagPro

YamChaYamCha

44

• Created as generic, customizable, open source text chunker

• Can be adapted to a lot of other tag-oriented NLP tasks

• Uses state-of-the-art machine learning algorithm (SVM)

Can redefine Context (window-size) parsing-direction (forward/backward) algorithms for multi-class problem (pair wise/one vs rest)

Practical chunking time (1 or 2 sec./sentence.)

Available as C/C++ library

Support Vector MachinesSupport Vector Machines

55

Support vector machines are based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995) from computational learning theory.

Support vector machines map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.

YamCha: YamCha: Setting Window SizeSetting Window Size

66

Default setting is "F:-2..2:0.. T:-2..-1".

The window setting can be customized

Training and Tuning SetTraining and Tuning Set

77

Evalita Development set randomly split into two parts

training: 92.241 tokens

tuning : 40.348 tokens

FEATURES (1/3)FEATURES (1/3)

88

For each running word:

WORD: the word itself (both unchanged and lower-cased)e.g. Casa casa

POS: the part of speech of the word (as produced by TagPro)e.g. Oggi SS (singular noun)

AFFIX: prefixes/suffixes (1, 2, 3 or 4 chars. at the start/end of the word)

e.g. Oggi {o,og,ogg,oggi, – i,gi,ggi,oggi}

ORTHOgraphic information (e.g. capitalization, hyphenation)e.g. Oggi C (capitalized) oggi L (lowercased)

FEATURES (2/3)FEATURES (2/3)

99

COLLOCation bigrams (36.000, Italian newspapers ranked by MI values)

e.g. l’ OavvocatoOdi ORossi OCarlo B-COLTaormina I-COLha O…….

FEATURES (3/3): GAZETTeersFEATURES (3/3): GAZETTeers

1010

• TOWNS: World (main), Italian (comuni) and Trentino’s (frazioni) towns(12.000, from various internet sites)

• STOCK-MARKET: Italian and American stock market organizations (5.000, from stock market sites)

• WIKI-GEO: Wikipedia geographical locations(3.200,)

• PERSONS: Person proper names or titles(154.000, Italian phone-book, Wikipedia,)

difeso O O O Odall' O O O Oavvocato O O O TRIGMario O O O B-NAMDe O O O B-SURMurgo O O O I-SURdi O O O OVicenza GPE O O O……………..

An Example of An Example of Feature Extraction Feature Extraction

1111

difeso VSP Odall' ES Oavvocato SS OMario SPN B-PERDe E I-PERMurgo SPN I-PER, XPW O

difeso difeso d di dif dife o so eso feso L N O O O O O VSP Odall' dall' d da dal dall ' l' ll' all' L A O O O O O ES Oavvocato avvocato a av avv avvo o to ato cato L N O O O TRIG O SS OMario mario m ma mar mari o io rio ario C N O O O B-NAM O SPN B-PERDe de d e _nil_ _nil_ e de _nil_ _nil_ C N O O O B-SUR B-COL E I-PERMurgo murgo m mu mur murg o go rgo urgo C N O O O I-SUR I-COL SPN I-PER

Static vs Dynamic FeaturesStatic vs Dynamic Features

1212

STATIC FEATURES extracted for the current, previous and

following word WORD, POS, AFFIX, ORTHO,

COLLOC, GAZET

DYNAMIC FEATURES decided dynamically during tagging tag of the 3 tokens preceding the

current token.

Finding the best featuresFinding the best features

1313

Pr Re F1

baseline 75.28 68.74 71.86

+POS +1.31 +2.78 +2.11

+GAZET +6.09 +7.93 +7.09

+COLLOC +0.37 +0.54 +0.46

+CLUSTER_5-class -0.45 -0.04 -0.23

+POS+GAZET+COLLOC +6.56 +9.14 +7.95

Baseline: WORD (both unchanged and lower-cased) AFFIX

ORTHOgraphic window-size: STAT: +2,-2 DYNAMIC: -2

Finding the best window-sizeFinding the best window-size

1414

STAT DYN Pr Re F1

+2,-2 -2 81.84 77.88 79.81

+3,-3 -3 +1.03 -1,17 -0.14+6,-6 -6 +0.01 -3.14 -1.67

+1,-1 -1 +1.87 +2.46 +2.18+1,-1 -3 +2.21 +3.04 +2.64

+1,-1 -7.70 -0.72 -4.19

Given the best set of features (F1=79.81) we tried to improve F1 measure changing the window-size

Evaluating the best algorithmEvaluating the best algorithmPKI vs. PKEPKI vs. PKE

1515

Pr Re F1 tokens/sec

PKI 84.05 80.92 82.45 1400

PKE 83.22 80.16 81.66 4200

YamCha uses two implementations of SVMs: PKI and PKE.

•both are faster than the original SVMs

PKI produces the same accuracy as the original SVMs.

PKE approximates the orginal SVM, slightly less accurate but faster

Feature Contribution to the best Feature Contribution to the best configurationconfiguration

1616

Pr Re F1

Best Configuration 84.05 80.92 82.45

no POS +0.27 -0.71 -0.24 no GAZET -8.25 -8.40 -8.33 no COLLOC +0.01 -0.13 -0.06

no GAZET, no COLLOC(i.e. no external resources) -8.26 -8.49 -8.38 no ORTHO -0.96 -3.22 -2.14 no AFFIX -1.30 -2.51 -1.93

The learning curveThe learning curve

1717

Test ResultsTest Results

1818

Test-Set Pr Re F1

All 83.41 80.91 82.14

GPE 84.80 86.30 85.54

LOC 77.78 68.85 73.04

ORG 68.84 60.26 64.27

PER 91.62 92.63 92.12

Conclusion (1/2)Conclusion (1/2)

1919

A statistical approach to Named Entity Recognition for Italian based on YamCha/SVMs

Results confirm that SVMs can deal with a big number of features and that they perform at state of the art.

For the features, GAZETteers seem to be the most important feature31% error reduction

Large context (large values of window-size e.g. +6,-6) involves a significant decrease of the recall (data sparseness), 3 points.

Conclusion (2/2)Conclusion (2/2)

2020

F1 values for both PER (92.12) and GPE (85.54) appear rather good, comparing well with those obtain in CONLL2003 for English.

Recognition of LOCs (F1: 73.04) seems more problematic: we suspect that the number of LOCs in the training is insufficient for the learning algorithm.

ORGs appear to be highly ambiguous.

ExamplesExamples

2121

Token Gold Prediction

è O Ostato O Odenunciato O Odai O Ocarabinieri B-ORG Odi O OVigolo B-GPE B-GPEVattaro I-GPE I-GPE


è O Ostato O Ofermato O Odai O Ocarabinieri O Oed O Oin O Oseguito O Oad O Oun O Ocontrollo O O

Examples 2Examples 2

2222


Fontana B-PER B-PER( O OVillazzano B-ORG B-GPE) O O, O OCampo B-PER B-PER( O OBaone B-ORG B-GPE) O O, O ORao B-PER B-PER( O OAlta B-ORG B-ORGVallagarina I-ORG I-ORG) O O. O O


dovrà O Odare O Oa O Ovia B-ORG B-LOCSegantini I-ORG I-LOCun O Oruolo O Odiverso O O

Appendix AAppendix A

2323

Test-Set (without external resources)

Pr Re F1 All 75.79 72.43 74.07 GPE 78.56 76.51 77.53 LOC 81.08 49.18 61.22 ORG 57.09 52.28 54.58 PER 85.71 85.50 85.60

EntityProEntityPro

2424

EntityPro is a system for Named Entity Recognition (NER) based on YamCha in order to implement Support Vector Machines (SVMs).

YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo), is a generic, customizable, and open source text chunker.

EntityPro can exploit a rich set of linguistic features such as the Part of Speech, orthographic features and proper name gazetteers.

The system is part of TextPro, a suite of NLP tools developed at FBK-irst.

entitypro exploting svm for italian named entity recognition

Documents