entitypro exploting svm for italian named entity recognition

24
EVALITA 2007 EVALITA 2007 Frascati, September 10th 2007 Frascati, September 10th 2007 ENTITYPRO EXPLOTING SVM FOR ITALIAN EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION NAMED ENTITY RECOGNITION Roberto Zanoli and Emanuele Roberto Zanoli and Emanuele Pianta Pianta

Upload: erma

Post on 19-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION. EVALITA 2007 Frascati, September 10th 2007. Roberto Zanoli and Emanuele Pianta. TextPro. A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

EVALITA 2007EVALITA 2007

Frascati, September 10th 2007Frascati, September 10th 2007

ENTITYPROEXPLOTING SVM FOR ITALIANEXPLOTING SVM FOR ITALIANNAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION

Roberto Zanoli and Emanuele PiantaRoberto Zanoli and Emanuele Pianta

Page 2: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

TextProTextPro

22

A suite of modular NLP tools developed at FBK-irst TokenPro: tokenization MorphoPro: morphological analysis TagPro: Part-of-Speech tagging LemmaPro: lemmatization EntityPro: Named Entity recognition ChunkPro: phrase chunking SentencePro: sentence splitting

Architecture designed to be efficient, scalable and robust. Cross-platform: Unix / Linux / Windows / MacOS X Multi-lingual models All modules integrated and accessible through unified command line interface

Page 3: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

33

EntityPro’s architecture

We used YamCha, an SVM-based machine learning environment, to build EntityPro, a system exploiting a rich set of linguistic features, such as the orthographic features, prefixes and suffixes, and the occurrence in proper nouns gazetteers.

Feature Feature selectionselection

ControllerController

Feature extractionortho, prefix, suffix, dictionary,

collocation bigram

dictionary

Learning

models

ClassificationClassification

YamCha

Training

data

Test

dataFeature Feature selectionselection

EntityPro

Feature extractionortho, prefix, suffix, dictionary,

collocation bigram

TagPro

Page 4: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

YamChaYamCha

44

• Created as generic, customizable, open source text chunker

• Can be adapted to a lot of other tag-oriented NLP tasks

• Uses state-of-the-art machine learning algorithm (SVM)

Can redefine Context (window-size) parsing-direction (forward/backward) algorithms for multi-class problem (pair wise/one vs rest)

Practical chunking time (1 or 2 sec./sentence.)

Available as C/C++ library

Page 5: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Support Vector MachinesSupport Vector Machines

55

Support vector machines are based on the Structural Risk Minimization principle (Vladimir N. Vapnik, 1995) from computational learning theory.

Support vector machines map input vectors to a higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane that separates the data. The separating hyperplane is the hyperplane that maximizes the distance between the two parallel hyperplanes.

Page 6: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

YamCha: YamCha: Setting Window SizeSetting Window Size

66

Default setting is "F:-2..2:0.. T:-2..-1".

The window setting can be customized

Page 7: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Training and Tuning SetTraining and Tuning Set

77

Evalita Development set randomly split into two parts

training: 92.241 tokens

tuning : 40.348 tokens

Page 8: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

FEATURES (1/3)FEATURES (1/3)

88

For each running word:

WORD: the word itself (both unchanged and lower-cased)e.g. Casa casa

POS: the part of speech of the word (as produced by TagPro)e.g. Oggi SS (singular noun)

AFFIX: prefixes/suffixes (1, 2, 3 or 4 chars. at the start/end of the word)

e.g. Oggi {o,og,ogg,oggi, – i,gi,ggi,oggi}

ORTHOgraphic information (e.g. capitalization, hyphenation)e.g. Oggi C (capitalized) oggi L (lowercased)

Page 9: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

FEATURES (2/3)FEATURES (2/3)

99

COLLOCation bigrams (36.000, Italian newspapers ranked by MI values)

e.g. l’ OavvocatoOdi ORossi OCarlo B-COLTaormina I-COLha O…….

Page 10: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

FEATURES (3/3): GAZETTeersFEATURES (3/3): GAZETTeers

1010

• TOWNS: World (main), Italian (comuni) and Trentino’s (frazioni) towns(12.000, from various internet sites)

• STOCK-MARKET: Italian and American stock market organizations (5.000, from stock market sites)

• WIKI-GEO: Wikipedia geographical locations(3.200,)

• PERSONS: Person proper names or titles(154.000, Italian phone-book, Wikipedia,)

difeso O O O Odall' O O O Oavvocato O O O TRIGMario O O O B-NAMDe O O O B-SURMurgo O O O I-SURdi O O O OVicenza GPE O O O……………..

Page 11: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

An Example of An Example of Feature Extraction Feature Extraction

1111

difeso VSP Odall' ES Oavvocato SS OMario SPN B-PERDe E I-PERMurgo SPN I-PER, XPW O

difeso difeso d di dif dife o so eso feso L N O O O O O VSP Odall' dall' d da dal dall ' l' ll' all' L A O O O O O ES Oavvocato avvocato a av avv avvo o to ato cato L N O O O TRIG O SS OMario mario m ma mar mari o io rio ario C N O O O B-NAM O SPN B-PERDe de d e _nil_ _nil_ e de _nil_ _nil_ C N O O O B-SUR B-COL E I-PERMurgo murgo m mu mur murg o go rgo urgo C N O O O I-SUR I-COL SPN I-PER

Page 12: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Static vs Dynamic FeaturesStatic vs Dynamic Features

1212

STATIC FEATURES extracted for the current, previous and

following word WORD, POS, AFFIX, ORTHO,

COLLOC, GAZET

DYNAMIC FEATURES decided dynamically during tagging tag of the 3 tokens preceding the

current token.

Page 13: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Finding the best featuresFinding the best features

1313

Pr Re F1

baseline 75.28 68.74 71.86

+POS +1.31 +2.78 +2.11

+GAZET +6.09 +7.93 +7.09

+COLLOC +0.37 +0.54 +0.46

+CLUSTER_5-class -0.45 -0.04 -0.23

+POS+GAZET+COLLOC +6.56 +9.14 +7.95

Baseline: WORD (both unchanged and lower-cased) AFFIX

ORTHOgraphic window-size: STAT: +2,-2 DYNAMIC: -2

Page 14: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Finding the best window-sizeFinding the best window-size

1414

STAT DYN Pr Re F1

+2,-2 -2 81.84 77.88 79.81

+3,-3 -3 +1.03 -1,17 -0.14+6,-6 -6 +0.01 -3.14 -1.67

+1,-1 -1 +1.87 +2.46 +2.18+1,-1 -3 +2.21 +3.04 +2.64

+1,-1 -7.70 -0.72 -4.19

Given the best set of features (F1=79.81) we tried to improve F1 measure changing the window-size

Page 15: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Evaluating the best algorithmEvaluating the best algorithmPKI vs. PKEPKI vs. PKE

1515

Pr Re F1 tokens/sec

PKI 84.05 80.92 82.45 1400

PKE 83.22 80.16 81.66 4200

YamCha uses two implementations of SVMs: PKI and PKE.

•both are faster than the original SVMs

PKI produces the same accuracy as the original SVMs.

PKE approximates the orginal SVM, slightly less accurate but faster

Page 16: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Feature Contribution to the best Feature Contribution to the best configurationconfiguration

1616

Pr Re F1

Best Configuration 84.05 80.92 82.45

no POS +0.27 -0.71 -0.24 no GAZET -8.25 -8.40 -8.33 no COLLOC +0.01 -0.13 -0.06

no GAZET, no COLLOC(i.e. no external resources) -8.26 -8.49 -8.38 no ORTHO -0.96 -3.22 -2.14 no AFFIX -1.30 -2.51 -1.93

Page 17: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

The learning curveThe learning curve

1717

Page 18: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Test ResultsTest Results

1818

Test-Set Pr Re F1

All 83.41 80.91 82.14

GPE 84.80 86.30 85.54

LOC 77.78 68.85 73.04

ORG 68.84 60.26 64.27

PER 91.62 92.63 92.12

Page 19: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Conclusion (1/2)Conclusion (1/2)

1919

A statistical approach to Named Entity Recognition for Italian based on YamCha/SVMs

Results confirm that SVMs can deal with a big number of features and that they perform at state of the art.

For the features, GAZETteers seem to be the most important feature31% error reduction

Large context (large values of window-size e.g. +6,-6) involves a significant decrease of the recall (data sparseness), 3 points.

Page 20: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Conclusion (2/2)Conclusion (2/2)

2020

F1 values for both PER (92.12) and GPE (85.54) appear rather good, comparing well with those obtain in CONLL2003 for English.

Recognition of LOCs (F1: 73.04) seems more problematic: we suspect that the number of LOCs in the training is insufficient for the learning algorithm.

ORGs appear to be highly ambiguous.

Page 21: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

ExamplesExamples

2121

Token Gold Prediction

è O Ostato O Odenunciato O Odai O Ocarabinieri B-ORG Odi O OVigolo B-GPE B-GPEVattaro I-GPE I-GPE

Token Gold Prediction

è O Ostato O Ofermato O Odai O Ocarabinieri O Oed O Oin O Oseguito O Oad O Oun O Ocontrollo O O

Page 22: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Examples 2Examples 2

2222

Token Gold Prediction

Fontana B-PER B-PER( O OVillazzano B-ORG B-GPE) O O, O OCampo B-PER B-PER( O OBaone B-ORG B-GPE) O O, O ORao B-PER B-PER( O OAlta B-ORG B-ORGVallagarina I-ORG I-ORG) O O. O O

Token Gold Prediction

dovrà O Odare O Oa O Ovia B-ORG B-LOCSegantini I-ORG I-LOCun O Oruolo O Odiverso O O

Page 23: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

Appendix AAppendix A

2323

Test-Set (without external resources)

Pr Re F1 All 75.79 72.43 74.07 GPE 78.56 76.51 77.53 LOC 81.08 49.18 61.22 ORG 57.09 52.28 54.58 PER 85.71 85.50 85.60

Page 24: ENTITYPRO EXPLOTING SVM FOR ITALIAN NAMED ENTITY RECOGNITION

EntityProEntityPro

2424

EntityPro is a system for Named Entity Recognition (NER) based on YamCha in order to implement Support Vector Machines (SVMs).

YamCha (Yet Another Multipurpose Chunk Annotator, by Taku Kudo), is a generic, customizable, and open source text chunker.

EntityPro can exploit a rich set of linguistic features such as the Part of Speech, orthographic features and proper name gazetteers.

The system is part of TextPro, a suite of NLP tools developed at FBK-irst.