using dedicom for completely unsupervised part-of-speech tagging

Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging

NAACL-HLT 2009Workshop on Unsupervised and Minimally Supervised

Learning of Lexical Semantics

June 5, 2009

Peter A. Chew, Brett W. BaderSandia National Laboratories

Alla RozovskayaUniversity of Illinois, Urbana-Champaign

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.

Outline

• Previous approaches to part-of-speech (POS) tagging

• The DEDICOM model

• Testing framework

• Preliminary results and discussion

Approaches to POS tagging (1)

• Supervised

– Rule-based (e.g. Harris 1962)• Dictionary + manually developed rules• Brittle – approach doesn’t port to new domains

– Stochastic (e.g. Stolz et al. 1965, Church 1988)• Examples: HMMs, CRFs• Relies on estimation of emission and transition

probabilities from a tagged training corpus• Again, difficulty in porting to new domains

Approaches to POS tagging (2)

• Unsupervised– All approaches exploit distributional patterns– Singular Value Decomposition (SVD) of term-

adjacency matrix (Schütze 1993, 1995)– Graph clustering (Biemann 2006)

– Our approach: DEDICOM of term-adjacency matrix– Most similar to Schütze (1993, 1995)

Advantages:• can be reconciled to stochastic approaches• like SVD and graph clustering, completely unsupervised• initial results (to be shown) appear promising

Introduction to DEDICOM

• DEcomposition into DIrectional COMponents

• Harshman (1978)

• A linear-algebraic decomposition method comparable to SVD

• First used for analysis of marketing data

DEDICOM – an example(domain = shampoo marketing!)

Stimulus phrase Bo

dy

Fu

lnes

s

Ho

lds

set

Bo

un

cy

No

t lim

p

Man

agea

ble

Zes

ty

Nat

ura

l

Body 44 5 23 1 19 1 3Fulness 22 5 3 1 9 1 2Holds set 17 21 5 0 17 0 5Bouncy 15 12 3 1 5 0 14Not limp 28 27 4 18 4 1 7Manageable 17 13 11 2 0 0 3Zesty 7 9 2 22 0 4 13Natural 4 9 1 2 0 7 1

Evoked phrase

Evoked/stimulus phrase 1 (Thickness) 2 (Vigor)Body 0.299 0.252Fulness 0.355 -0.158Holds set 0.041 0.213Bouncy 0.172 0.004Not limp -0.048 0.420Manageable 0.150 0.013Zesty -0.043 0.248Natural 0.074 0.010

Evoked "dimension"

Stimulus "dimension" 1 (Thickness) 2 (Vigor)1 (Thickness) 248 442 (Vigor) 216 24

Evoked "dimension"

• DEDICOM decomposes the 8 x 8 matrix into a simplified k x k “summary” (here k = 2), and a matrix showing the loadings for each phrase in each dimension

• A key assumption is that stimulus and evoked phrases are a “single set of objects”

Original data matrix Reduced data matrix

“Loadings” matrix

DEDICOM – algebraic details

• Let X be original data matrix• Let R be reduced matrix of directional relationships• Let A be “loadings” matrix

X ARAT

• Compare to SVD:

X USVT

– U, V and A are all dense– But R is dense while S is diagonal, and U V– In SVD, U and V differ; in DEDICOM, A is repeated as AT

DEDICOM – application to POS tagging

• The assumption that terms are a “single set of objects”, whether they precede or follow, sets DEDICOM apart from SVD and other unsupervised approaches

• This assumption models the fact that tokens play the same syntactic role whether we view them as the first or second element in a bigram

Term adjacency matrix ‘R’ matrix

‘A’ matrixPreceding term term

1

term

2

term

3

term

4

term

5

term

6

… term

n

term 1 44 5 23 1 19 1 3term 2 22 5 3 1 9 1 2term 3 17 21 5 0 17 0 5term 4 15 12 3 1 5 0 14term 5 28 27 4 18 4 1 7term 6 17 13 11 2 0 0 3… 7 9 2 22 0 4 13term n 4 9 1 2 0 7 1

Following termPreceding POS 1 (det) 2 (noun)1 (det) 2 3902 (noun) 38 40

Following POS

Evoked/stimulus phrase 1 (det) 2 (noun)term 1 0.299 0.252term 2 0.355 -0.158term 3 0.041 0.213term 4 0.172 0.004term 5 -0.048 0.420term 6 0.150 0.013… -0.043 0.248term n 0.074 0.010

Evoked "dimension"

Comparing DEDICOM output to HMM input

• The output of DEDICOM is essentially a transition and emission probability matrix

• DEDICOM offers the possibility of getting the familiar transition and emission probabilities without training data

‘R’ matrix

‘A’ matrix

Preceding POS 1 (det) 2 (noun)1 (det) 2 3902 (noun) 38 40

Following POS

Evoked/stimulus phrase 1 (det) 2 (noun)term 1 0.299 0.252term 2 0.355 -0.158term 3 0.041 0.213term 4 0.172 0.004term 5 -0.048 0.420term 6 0.150 0.013… -0.043 0.248term n 0.074 0.010

Evoked "dimension"

Output of DEDICOM

Transition prob. matrix

Emission prob. matrix

Input to HMM(after normalization of counts)

Validation: method 1 (theoretical)

• Hypothetical example - suppose tagged training corpus exists

The man walked the big dog

DT NN VBD DT JJ NN

Corpus:

X:

A*: term-tag counts R*: tag-adjacency counts

• By definition (subject to diff. of 1 for final token):– rowsums of X = colsums of X = rowsums of A*– colsums of A* = rowsums of R* = colsums of R*

DT NN VBD JJ ROWSUMthe 2 2man 1 1walked 1 1big 1 1dog 1 1COLSUM 2 2 1 1 6

sparse matrix of bigram counts

the man walked big dog ROWSUMthe 1 1 2man 1 1walked 1 1big 1 1dog 0COLSUM 1 1 1 1 1 5

DT NN VBD JJ ROWSUMDT 1 1 2NN 1 1VBD 1 1JJ 1 1COLSUM 1 2 1 1 5

Validation: method 1 (theoretical)

• To turn A* and R* into transition and emission probability matrices, we simply multiply each by a diagonal matrix D where the entries are the inverses of the rowsum vector

• But if the DEDICOM model is a good one, we should be able to multiply A*DR*D(A*)T to approximate the original matrix X

• In this case, A*DR*D(A*)T =the man walked big dog ROWSUM

the 0.5 1 0.5 2man 0.5 0.5walked 1 1big 0.5 0.5 1dog 0.5 0.5COLSUM 1 1 1 1 1 5

• This not only does approximate X, but it also captures some syntactic regularities which aren’t instantiated in the corpus (this is one reason HMM-based POS tagging is successful)

Validation: method 2 (empirical)

• Use a tagged corpus (CONLL 2000) as gold standard– CONLL 2000 has 19,440 distinct terms– There are 44 distinct tags in the tagset

• Tabulate X matrix (solely from bigram frequencies, blind to tags)

• Apply DEDICOM to ‘learn’ emission and transition probability matrices

• Use these as input to a HMM; tag each token with a numerical index (one of the DEDICOM ‘dimensions)

• Evaluate by looking at correlation of induced tags with gold standard tags in a confusion matrix

Validation: method 2 (empirical)

• Examples of DEDICOM dimensions or clusters:

Tag Top 10 types (by weight) with weightings million share said . year billion inc. corp. years quarter 1 0.0246 0.0146 0.0129 0.0098 0.0088 0.0069 0.0064 0.0061 0.0058 0.0054 company u.s. new first market share year stock . government 2 0.0264 0.0136 0.0113 0.0095 0.0086 0.0086 0.0079 0.0077 0.0065 0.006 the a new an other its any addition their 1988 3 0.2889 0.1194 0.0121 0.0094 0.0092 0.0085 0.0067 0.0062 0.0062 0.0057

… the its his about those their all u.s. . this 8 0.0935 0.0462 0.0208 0.0160 0.0096 0.0095 0.0088 0.0077 0.0074 0.0071

…

Validation: method 2 (empirical)• Confusion matrix: correlation with ‘ideal’ diagonal matrix = 0.494

ideally, the confusion matrix would have one DEDICOM class per ‘gold standard’ tag – either a diagonal matrix or some permutation thereof – although this assumes the gold standard is the optimal tagging scheme

• DEDICOM, like other completely unsupervised POS-tagging methods, is hard to evaluate empirically

• But we believe it holds promise because:– unlike other unsupervised approaches, it can be

reconciled to stochastic approaches (like HMMs) which have a successful track record

– unlike traditional stochastic approaches it is truly completely unsupervised

– initial objective and subjective results do appear promising

Conclusions

• We believe the key to evaluating DEDICOM, or other methods of POS tagging, is to do so within a larger system

• For example, use DEDICOM to disambiguate tokens which are ambiguous w.r.t. part of speech– e.g. ‘claims’ (NN) versus ‘claims’ (VBZ)

• Then use this, for example, within an information retrieval system to establish separate indices (rows in a term-by-document matrix) for disambiguated terms

• Evaluate based on standard metrices such as precision; see if DEDICOM-based disambiguation results in improved precision

Future work

QUESTIONS?

POINTS OF CONTACT:Brett W. Bader ([email protected])

Peter A. Chew ([email protected])

using dedicom for completely unsupervised part-of-speech tagging

Documents