unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora

Post on 12-Jan-2016

30 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora. Alessandro Lenci (Università di Pisa, Italy) Barbara McGillivray ( ILC-CNR / Università di Pisa, Italy) Simonetta Montemagni ( ILC-CNR, Italy) Vito Pirrelli ( ILC-CNR, Italy). Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora

Alessandro Lenci (Università di Pisa, Italy)

Barbara McGillivray (ILC-CNR / Università di Pisa, Italy)

Simonetta Montemagni (ILC-CNR, Italy)

Vito Pirrelli (ILC-CNR, Italy)

Outline

1. Subcategorization acquisition

2. MDL verb clustering

1. Subcategorization acquisition: summary

• Previous work

• Our acquisition process

• Evaluation of results

Previous work (1)

• Brent, 1991; Ushioda et al., 1993; Briscoe & Carroll, 1997; Korhonen, 2002

• These approaches presuppose a battery of predefined frames

• there are languages for which no such SCF repertoires are already available

Previous work (2)

• alternative: acquisition process as a “SCF discovery” process in corpora

• Basili et al., 1997; Zeman & Sarkar, 2000; Alonso et al., 2007; Bourigault & Frérot, 2005

• we present a variation of this “discovery approach” to SC acquisition for Italian verbs

Our SC extraction method

• simply requires a “chunked” corpus and a limited number of search heuristics that do not rely on any previous knowledge about SCFs

– languages other than English

– a looser notion of SCF including typical verb modifiers and strongly selected arguments

The acquisition process

0. experimental setting– chunked PAROLE Corpus

• Italian general corpus

• 3 million word tokens

• chunked with CHUG-IT

– 47 communication verbs

The acquisition process (step 1)

1. extraction of verb local contexts (SLCs) from chunked texts

• Ex.:[N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico]

‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’

The acquisition process (step 2)

2. Context carving: linguistically-motivated criteria select only those chunks that are in the dependency scope of v noise information is minimized

• Ex.:

[N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico]

‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’

The acquisition process (step 3)

3. induction of potential subcategorization frames (PSF)

a. assumption: all contextual chunks occurring immediately after the verb are very likely governed by it potentially subcategorized slots (PSS)

b. Frequency filter on PSSs

c. a SLC is eligible as a PSF if its contextual chunks belong to the list of selected PSS

d. Frequency filter on PSFs

The acquisition process (step 3)

SLC PSF Rel.freq.

[ ] [ ] 0.33

[CHE_C] [CHE_C] 0.05

[I_C-di] [I_C-di] 0.13

[N_C] [N_C] 0.45

[N_C][ADJ_C] [N_C] 0.45

[N_C][ADJPART_C] [N_C] 0.45

[N_C][di_C] [N_C] 0.45

[N_C][NA_C] [N_C] 0.45

[N_C][P_C-a] [N_C] 0.45

[N_C][P_C-di] [N_C] 0.45

[N_C][P_C-di][ADJ_C] [N_C] 0.45

[N_C][P_C-di][ADJPART_C] [N_C] 0.45

Verb accettare ’accept’

PSS

[ ]

[CHE_C]

[I_C-di]

[N_C]

Evaluation of results - Italian

• Evaluation of our SCF induction method

– extracted carved contexts: baseline (step

2)

– induced subcat frames (step 4)

o type precision

o type recall

o F-measure

frames acquired all

frames acquiredcorrectly P

standard gold in the frames all

frames acquiredcorrectly R

RP

R*P*2

F

Evaluation - Italian (2)

• carried out against three gold standards

1. IGS1: a general purpose computational lexicon (SIMPLE-PAROLE-CLIPS lexicon)

2. IGS2: Italian dictionary (Sabatini-Coletti 2006)

3. IGS3: merging IGS1 and IGS2

4. Manual evaluation

Evaluation - Italian (3)

IGS1 IGS2 IGS3 4

SCFs P 42% 30% 52% 93%

R 8% 84% 78% NA

F 13% 44% 62% NA

baseline P 23% 13% 27% 40%

R 72% 68% 75% NA

F 35% 22% 38% NA

Evaluation - English

• four gold standards

1. EGS1: general purpose computational lexicon (Valex5 Lexicon)

2. EGS2: Longman Dictionary (2006);

3. EGS3: biomedical English lexicon (SPECIALIST Lexicon)

4. EGS4: merging EGS1, EGS2 and EGS3

Evaluation – English (2)

EGS1+ EGS2 EGS3 EGS4

SCFs P 69% 52% 83%

R 48% 54% 51%

F 57% 53% 63%

baseline P 28% 17% 33%

R 52% 49% 53%

F 36% 25% 41%

2. Verb clustering: summary

• The MDL Principle

• Verb clustering using MDL

Why verb clustering?

• syntax-semantics lexical interface

• starting from the SCFs extracted, we aim at inducing clusters of verbs that share similar semantic properties

• each verb is represented as a vector whose dimensions report its statistical distribution with the automatically extracted SCFs

• a clustering of verb vectors is performed using the

Minimum Description Length Principle (MDL)

The MDL Principle

• from information theory (Rissanen 1989)

• model description length: code length in bits for the encoding of the model itself complexity of the model

• data description length: code length in bits for the encoding of the given data observed through the model fit of the model to the data

• MDL: “any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally”

)(minarg )|( mDmm LLM

1) Baseline model: each verb belongs to one class

2) Compare with any model

3) Choose such that

4) Cluster together into the class

},{;,,,,1},{:),( 11 khrjj vvkhjrjvkhM

),( 11 mn

))],(()([),( 10

0)),(()(

111

),(10

maxarg khMLMLmnM

khkhMLML

0M

}{,},{: 110 rr vvM

Verb clustering using MDL

1r),(11 mn vv

PROMETTERERISPONDERE

PARLAREPROTESTARE

CHIEDEREDIRE

ASSERIREMINACCIARECOMANDARE

INSEGNAREAMMONIRE

DICHIARARECONFESSARE

CHIARIREPROIBIRE

SUGGERIRECOMUNICARE

ACCETTAREPROPORREMOSTRARE

COMMENTARECHIAMAREPREGARE

DISCUTERERIVELARE

RICHIAMARERIMPROVERARE

LEGGERESPIEGARE

REPLICAREDESCRIVERERICHIEDERE

DENUNCIAREOFFRIRE

RIMPIANGEREORDINARE

• 47 Italian communication verbs: 23 clustering steps

MDL -clustering

Conclusions

• a preliminary qualitative analysis of induced verb

clusters shows encouraging results

• we expect to evaluate the coherence of the

obtained lexico-semantic clusters and the coverage

of the subcategorization behaviour of clustered

verbs

•The verb classes are assigned a new

cluster-based frame distribution

[ ] [che] [I-di] [N] [P-a] [perché]

chiarire ‘clarify’ 0.34 0.10 0 0.40 0 0.009

comunicare ‘communicate’

0.24 0.15 0 0.31 0.08 0

proibire ‘forbid’ 0.21 0.03 0.03 0.51 0 0

suggerire ‘suggest’

0.24 0.10 0.009 0.42 0.02 0.02

verb class (cluster)

0.25 0.10 0.008 0.41 0.02 0.02

MDL -clustering

top related