using dedicom for completely unsupervised part-of-speech tagging
DESCRIPTION
Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging. NAACL-HLT 2009 Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics June 5, 2009 Peter A. Chew, Brett W. Bader Sandia National Laboratories Alla Rozovskaya University of Illinois, Urbana-Champaign. - PowerPoint PPT PresentationTRANSCRIPT
Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging
NAACL-HLT 2009Workshop on Unsupervised and Minimally Supervised
Learning of Lexical Semantics
June 5, 2009
Peter A. Chew, Brett W. BaderSandia National Laboratories
Alla RozovskayaUniversity of Illinois, Urbana-Champaign
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.
Outline
• Previous approaches to part-of-speech (POS) tagging
• The DEDICOM model
• Testing framework
• Preliminary results and discussion
Approaches to POS tagging (1)
• Supervised
– Rule-based (e.g. Harris 1962)• Dictionary + manually developed rules• Brittle – approach doesn’t port to new domains
– Stochastic (e.g. Stolz et al. 1965, Church 1988)• Examples: HMMs, CRFs• Relies on estimation of emission and transition
probabilities from a tagged training corpus• Again, difficulty in porting to new domains
Approaches to POS tagging (2)
• Unsupervised– All approaches exploit distributional patterns– Singular Value Decomposition (SVD) of term-
adjacency matrix (Schütze 1993, 1995)– Graph clustering (Biemann 2006)
– Our approach: DEDICOM of term-adjacency matrix– Most similar to Schütze (1993, 1995)
Advantages:• can be reconciled to stochastic approaches• like SVD and graph clustering, completely unsupervised• initial results (to be shown) appear promising
Introduction to DEDICOM
• DEcomposition into DIrectional COMponents
• Harshman (1978)
• A linear-algebraic decomposition method comparable to SVD
• First used for analysis of marketing data
DEDICOM – an example(domain = shampoo marketing!)
Stimulus phrase Bo
dy
Fu
lnes
s
Ho
lds
set
Bo
un
cy
No
t lim
p
Man
agea
ble
Zes
ty
Nat
ura
l
Body 44 5 23 1 19 1 3Fulness 22 5 3 1 9 1 2Holds set 17 21 5 0 17 0 5Bouncy 15 12 3 1 5 0 14Not limp 28 27 4 18 4 1 7Manageable 17 13 11 2 0 0 3Zesty 7 9 2 22 0 4 13Natural 4 9 1 2 0 7 1
Evoked phrase
Evoked/stimulus phrase 1 (Thickness) 2 (Vigor)Body 0.299 0.252Fulness 0.355 -0.158Holds set 0.041 0.213Bouncy 0.172 0.004Not limp -0.048 0.420Manageable 0.150 0.013Zesty -0.043 0.248Natural 0.074 0.010
Evoked "dimension"
Stimulus "dimension" 1 (Thickness) 2 (Vigor)1 (Thickness) 248 442 (Vigor) 216 24
Evoked "dimension"
• DEDICOM decomposes the 8 x 8 matrix into a simplified k x k “summary” (here k = 2), and a matrix showing the loadings for each phrase in each dimension
• A key assumption is that stimulus and evoked phrases are a “single set of objects”
Original data matrix Reduced data matrix
“Loadings” matrix
DEDICOM – algebraic details
• Let X be original data matrix• Let R be reduced matrix of directional relationships• Let A be “loadings” matrix
X ARAT
• Compare to SVD:
X USVT
– U, V and A are all dense– But R is dense while S is diagonal, and U V– In SVD, U and V differ; in DEDICOM, A is repeated as AT
DEDICOM – application to POS tagging
• The assumption that terms are a “single set of objects”, whether they precede or follow, sets DEDICOM apart from SVD and other unsupervised approaches
• This assumption models the fact that tokens play the same syntactic role whether we view them as the first or second element in a bigram
Term adjacency matrix ‘R’ matrix
‘A’ matrixPreceding term term
1
term
2
term
3
term
4
term
5
term
6
… term
n
term 1 44 5 23 1 19 1 3term 2 22 5 3 1 9 1 2term 3 17 21 5 0 17 0 5term 4 15 12 3 1 5 0 14term 5 28 27 4 18 4 1 7term 6 17 13 11 2 0 0 3… 7 9 2 22 0 4 13term n 4 9 1 2 0 7 1
Following termPreceding POS 1 (det) 2 (noun)1 (det) 2 3902 (noun) 38 40
Following POS
Evoked/stimulus phrase 1 (det) 2 (noun)term 1 0.299 0.252term 2 0.355 -0.158term 3 0.041 0.213term 4 0.172 0.004term 5 -0.048 0.420term 6 0.150 0.013… -0.043 0.248term n 0.074 0.010
Evoked "dimension"
Comparing DEDICOM output to HMM input
• The output of DEDICOM is essentially a transition and emission probability matrix
• DEDICOM offers the possibility of getting the familiar transition and emission probabilities without training data
‘R’ matrix
‘A’ matrix
Preceding POS 1 (det) 2 (noun)1 (det) 2 3902 (noun) 38 40
Following POS
Evoked/stimulus phrase 1 (det) 2 (noun)term 1 0.299 0.252term 2 0.355 -0.158term 3 0.041 0.213term 4 0.172 0.004term 5 -0.048 0.420term 6 0.150 0.013… -0.043 0.248term n 0.074 0.010
Evoked "dimension"
Output of DEDICOM
Transition prob. matrix
Emission prob. matrix
Input to HMM(after normalization of counts)
Validation: method 1 (theoretical)
• Hypothetical example - suppose tagged training corpus exists
The man walked the big dog
DT NN VBD DT JJ NN
Corpus:
X:
A*: term-tag counts R*: tag-adjacency counts
• By definition (subject to diff. of 1 for final token):– rowsums of X = colsums of X = rowsums of A*– colsums of A* = rowsums of R* = colsums of R*
DT NN VBD JJ ROWSUMthe 2 2man 1 1walked 1 1big 1 1dog 1 1COLSUM 2 2 1 1 6
sparse matrix of bigram counts
the man walked big dog ROWSUMthe 1 1 2man 1 1walked 1 1big 1 1dog 0COLSUM 1 1 1 1 1 5
DT NN VBD JJ ROWSUMDT 1 1 2NN 1 1VBD 1 1JJ 1 1COLSUM 1 2 1 1 5
Validation: method 1 (theoretical)
• To turn A* and R* into transition and emission probability matrices, we simply multiply each by a diagonal matrix D where the entries are the inverses of the rowsum vector
• But if the DEDICOM model is a good one, we should be able to multiply A*DR*D(A*)T to approximate the original matrix X
• In this case, A*DR*D(A*)T =the man walked big dog ROWSUM
the 0.5 1 0.5 2man 0.5 0.5walked 1 1big 0.5 0.5 1dog 0.5 0.5COLSUM 1 1 1 1 1 5
• This not only does approximate X, but it also captures some syntactic regularities which aren’t instantiated in the corpus (this is one reason HMM-based POS tagging is successful)
Validation: method 2 (empirical)
• Use a tagged corpus (CONLL 2000) as gold standard– CONLL 2000 has 19,440 distinct terms– There are 44 distinct tags in the tagset
• Tabulate X matrix (solely from bigram frequencies, blind to tags)
• Apply DEDICOM to ‘learn’ emission and transition probability matrices
• Use these as input to a HMM; tag each token with a numerical index (one of the DEDICOM ‘dimensions)
• Evaluate by looking at correlation of induced tags with gold standard tags in a confusion matrix
Validation: method 2 (empirical)
• Examples of DEDICOM dimensions or clusters:
Tag Top 10 types (by weight) with weightings million share said . year billion inc. corp. years quarter 1 0.0246 0.0146 0.0129 0.0098 0.0088 0.0069 0.0064 0.0061 0.0058 0.0054 company u.s. new first market share year stock . government 2 0.0264 0.0136 0.0113 0.0095 0.0086 0.0086 0.0079 0.0077 0.0065 0.006 the a new an other its any addition their 1988 3 0.2889 0.1194 0.0121 0.0094 0.0092 0.0085 0.0067 0.0062 0.0062 0.0057
… the its his about those their all u.s. . this 8 0.0935 0.0462 0.0208 0.0160 0.0096 0.0095 0.0088 0.0077 0.0074 0.0071
…
Validation: method 2 (empirical)• Confusion matrix: correlation with ‘ideal’ diagonal matrix = 0.494
ideally, the confusion matrix would have one DEDICOM class per ‘gold standard’ tag – either a diagonal matrix or some permutation thereof – although this assumes the gold standard is the optimal tagging scheme
• DEDICOM, like other completely unsupervised POS-tagging methods, is hard to evaluate empirically
• But we believe it holds promise because:– unlike other unsupervised approaches, it can be
reconciled to stochastic approaches (like HMMs) which have a successful track record
– unlike traditional stochastic approaches it is truly completely unsupervised
– initial objective and subjective results do appear promising
Conclusions
• We believe the key to evaluating DEDICOM, or other methods of POS tagging, is to do so within a larger system
• For example, use DEDICOM to disambiguate tokens which are ambiguous w.r.t. part of speech– e.g. ‘claims’ (NN) versus ‘claims’ (VBZ)
• Then use this, for example, within an information retrieval system to establish separate indices (rows in a term-by-document matrix) for disambiguated terms
• Evaluate based on standard metrices such as precision; see if DEDICOM-based disambiguation results in improved precision
Future work