unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora

Unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora

Alessandro Lenci (Università di Pisa, Italy)

Barbara McGillivray (ILC-CNR / Università di Pisa, Italy)

Simonetta Montemagni (ILC-CNR, Italy)

Vito Pirrelli (ILC-CNR, Italy)

Outline

1. Subcategorization acquisition

2. MDL verb clustering

1. Subcategorization acquisition: summary

• Previous work

• Our acquisition process

• Evaluation of results

Previous work (1)

• Brent, 1991; Ushioda et al., 1993; Briscoe & Carroll, 1997; Korhonen, 2002

• These approaches presuppose a battery of predefined frames

• there are languages for which no such SCF repertoires are already available

Previous work (2)

• alternative: acquisition process as a “SCF discovery” process in corpora

• Basili et al., 1997; Zeman & Sarkar, 2000; Alonso et al., 2007; Bourigault & Frérot, 2005

• we present a variation of this “discovery approach” to SC acquisition for Italian verbs

Our SC extraction method

• simply requires a “chunked” corpus and a limited number of search heuristics that do not rely on any previous knowledge about SCFs

– languages other than English

– a looser notion of SCF including typical verb modifiers and strongly selected arguments

The acquisition process

0. experimental setting– chunked PAROLE Corpus

• Italian general corpus

• 3 million word tokens

• chunked with CHUG-IT

– 47 communication verbs

The acquisition process (step 1)

1. extraction of verb local contexts (SLCs) from chunked texts

• Ex.:[N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico]

‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’

2. Context carving: linguistically-motivated criteria select only those chunks that are in the dependency scope of v noise information is minimized

• Ex.:

[N_C lo yen] [FV_C ha chiuso] [P_C a Tokio] [P_C a 120] [I_C dopo aver toccato] [P_C nel corso] [P_C della seduta] [N_C il massimo storico]

‘the yen closed down in Tokyo at 120 after reaching the maximum ever in the course of the session’

3. induction of potential subcategorization frames (PSF)

a. assumption: all contextual chunks occurring immediately after the verb are very likely governed by it potentially subcategorized slots (PSS)

b. Frequency filter on PSSs

c. a SLC is eligible as a PSF if its contextual chunks belong to the list of selected PSS

d. Frequency filter on PSFs

SLC PSF Rel.freq.

[ ] [ ] 0.33

[CHE_C] [CHE_C] 0.05

[I_C-di] [I_C-di] 0.13

[N_C] [N_C] 0.45

[N_C][ADJ_C] [N_C] 0.45

[N_C][ADJPART_C] [N_C] 0.45

[N_C][di_C] [N_C] 0.45

[N_C][NA_C] [N_C] 0.45

[N_C][P_C-a] [N_C] 0.45

[N_C][P_C-di] [N_C] 0.45

[N_C][P_C-di][ADJ_C] [N_C] 0.45

[N_C][P_C-di][ADJPART_C] [N_C] 0.45

Verb accettare ’accept’

[CHE_C]

[I_C-di]

Evaluation of results - Italian

• Evaluation of our SCF induction method

– extracted carved contexts: baseline (step

– induced subcat frames (step 4)

o type precision

o type recall

o F-measure

frames acquired all

frames acquiredcorrectly P

standard gold in the frames all

frames acquiredcorrectly R

Evaluation - Italian (2)

• carried out against three gold standards

1. IGS1: a general purpose computational lexicon (SIMPLE-PAROLE-CLIPS lexicon)

2. IGS2: Italian dictionary (Sabatini-Coletti 2006)

3. IGS3: merging IGS1 and IGS2

4. Manual evaluation

Evaluation - Italian (3)

IGS1 IGS2 IGS3 4

SCFs P 42% 30% 52% 93%

R 8% 84% 78% NA

F 13% 44% 62% NA

baseline P 23% 13% 27% 40%

R 72% 68% 75% NA

F 35% 22% 38% NA

Evaluation - English

• four gold standards

1. EGS1: general purpose computational lexicon (Valex5 Lexicon)

2. EGS2: Longman Dictionary (2006);

3. EGS3: biomedical English lexicon (SPECIALIST Lexicon)

4. EGS4: merging EGS1, EGS2 and EGS3

Evaluation – English (2)

EGS1+ EGS2 EGS3 EGS4

SCFs P 69% 52% 83%

R 48% 54% 51%

F 57% 53% 63%

baseline P 28% 17% 33%

R 52% 49% 53%

F 36% 25% 41%

2. Verb clustering: summary

• The MDL Principle

• Verb clustering using MDL

Why verb clustering?

• syntax-semantics lexical interface

• starting from the SCFs extracted, we aim at inducing clusters of verbs that share similar semantic properties

• each verb is represented as a vector whose dimensions report its statistical distribution with the automatically extracted SCFs

• a clustering of verb vectors is performed using the

Minimum Description Length Principle (MDL)

The MDL Principle

• from information theory (Rissanen 1989)

• model description length: code length in bits for the encoding of the model itself complexity of the model

• data description length: code length in bits for the encoding of the given data observed through the model fit of the model to the data

• MDL: “any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally”

)(minarg )|( mDmm LLM

1) Baseline model: each verb belongs to one class

2) Compare with any model

3) Choose such that

4) Cluster together into the class

},{;,,,,1},{:),( 11 khrjj vvkhjrjvkhM

),( 11 mn

))],(()([),( 10

0)),(()(

maxarg khMLMLmnM

khkhMLML

}{,},{: 110 rr vvM

Verb clustering using MDL

1r),(11 mn vv

PROMETTERERISPONDERE

PARLAREPROTESTARE

CHIEDEREDIRE

ASSERIREMINACCIARECOMANDARE

INSEGNAREAMMONIRE

DICHIARARECONFESSARE

CHIARIREPROIBIRE

SUGGERIRECOMUNICARE

ACCETTAREPROPORREMOSTRARE

COMMENTARECHIAMAREPREGARE

DISCUTERERIVELARE

RICHIAMARERIMPROVERARE

LEGGERESPIEGARE

REPLICAREDESCRIVERERICHIEDERE

DENUNCIAREOFFRIRE

RIMPIANGEREORDINARE

• 47 Italian communication verbs: 23 clustering steps

MDL -clustering

Conclusions

• a preliminary qualitative analysis of induced verb

clusters shows encouraging results

• we expect to evaluate the coherence of the

obtained lexico-semantic clusters and the coverage

of the subcategorization behaviour of clustered

•The verb classes are assigned a new

cluster-based frame distribution

[ ] [che] [I-di] [N] [P-a] [perché]

chiarire ‘clarify’ 0.34 0.10 0 0.40 0 0.009

comunicare ‘communicate’

0.24 0.15 0 0.31 0.08 0

proibire ‘forbid’ 0.21 0.03 0.03 0.51 0 0

suggerire ‘suggest’

0.24 0.10 0.009 0.42 0.02 0.02

verb class (cluster)

0.25 0.10 0.008 0.41 0.02 0.02

MDL -clustering

unsupervised acquisition of verb subcategorization frames from shallow-parsed corpora

c dopo

c nel corso p

c ha chiuso p

c della seduta n

sc acquisition

c il massimo storicothe

toccato p

tokio p

Documents

annotation of corpora

corpora tivo

the spdf electron orbital model parsed

corpora & corpus annotation

resource acquisition for syntax-based mt from parsed...

lexika corpora

open corpora

workshop programme multimodal corpora from multimodal ......

capturing patterns of linguistic interaction in a parsed...

corpora m

corpora e tradução

automatic acquisition of subcategorization frames for czech...

using the penn parsed corpora of historical english...

exploring karuk morphology in a parsed text corpus

aquisição de subcategorization frames para verbos da...

web corpora

corpora monolingües

corpora tiva

meta-dating the parsed corpus of tibetan (pactib)

corpora e tradução ana frankenberg-garcia. que tipo de...