spectral learning algorithms for natural language processing
TRANSCRIPT
![Page 1: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/1.jpg)
Spectral Learning Algorithms for NaturalLanguage Processing
Shay Cohen1, Michael Collins1, Dean Foster2, Karl Stratos1
and Lyle Ungar2
1Columbia University
2University of Pennsylvania
June 10, 2013
Spectral Learning for NLP 1
![Page 2: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/2.jpg)
Latent-variable Models
Latent-variable models are used in many areas of NLP, speech, etc.:
I Latent-variable PCFGs (Matsuzaki et al.; Petrov et al.)
I Hidden Markov Models
I Naive Bayes for clustering
I Lexical representations: Brown clustering, Saul and Pereira,etc.
I Alignments in statistical machine translation
I Topic modeling
I etc. etc.
The Expectation-maximization (EM) algorithm is generally usedfor estimation in these models (Depmster et al., 1977)
Other relevant algorithms: cotraining, clustering methods
Spectral Learning for NLP 2
![Page 3: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/3.jpg)
Example 1: Latent-Variable PCFGs (Matsuzaki et al., 2005; Petrov
et al., 2006)
S
NP
D
the
N
dog
VP
V
saw
P
him
=⇒
S1
NP3
D1
the
N2
dog
VP2
V4
saw
P1
him
Spectral Learning for NLP 3
![Page 4: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/4.jpg)
Example 2: Hidden Markov Models
S1 S2 S3 S4
the dog saw him
Parameterized by π(s), t(s|s′) and o(w|s)
EM is used for learning the parameters
Spectral Learning for NLP 4
![Page 5: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/5.jpg)
Example 3: Naıve Bayes
H
X Y
p(h, x, y) = p(h)× p(x|h)× p(y|h)
(the, dog)(I, saw)(ran, to)
(John, was)...
I EM can be used to estimate parameters
Spectral Learning for NLP 5
![Page 6: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/6.jpg)
Example 4: Brown Clustering and Related Models
w1
C(w1) C(w2)
w2
p(w2|w1) = p(C(w2)|C(w1))× p(w2|C(w2)) (Brown et al., 1992)
h
w1 w2
p(w2|w1) =∑
h p(h|w1)× p(w2|h) (Saul and Pereira, 1997)
Spectral Learning for NLP 6
![Page 7: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/7.jpg)
Example 5: IBM Translation Models
null Por favor , desearia reservar una habitacion .
Please , I would like to book a room .
Hidden variables are alignments
EM used to estimate parameters
Spectral Learning for NLP 7
![Page 8: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/8.jpg)
Example 6: HMMs for Speech
Phoneme boundaries are hidden variables
Spectral Learning for NLP 8
![Page 9: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/9.jpg)
Co-training (Blum and Mitchell, 1998)
Examples come in pairs
Each view is assumed to be sufficient for classification
E.g. Collins and Singer (1999):
. . . , says Mr. Cooper, a vice president of . . .
I View 1. Spelling features: “Mr.”, “Cooper”
I View 2. Contextual features: appositive=president
Spectral Learning for NLP 9
![Page 10: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/10.jpg)
Spectral Methods
Basic idea: replace EM (or co-training) with methods based onmatrix decompositions, in particular singular value decomposition(SVD)
SVD: given matrix A with m rows, n columns, approximate as
Ajk ≈d∑
h=1
σhUjhVjh
where σh are “singular values”
U and V are m× d and n× d matrices
Remarkably, can find the optimal rank-d approximation efficiently
Spectral Learning for NLP 10
![Page 11: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/11.jpg)
Similarity of SVD to Naıve Bayes
H
X Y
P (X = x, Y = y) =
d∑
h=1
p(h)p(x|h)p(y|h)
Ajk ≈d∑
h=1
σhUjhVjh
I SVD approximation minimizes squared loss, not log-lossI σh not interpretable as probabilitiesI Ujh, Vjh may be positive or negative, not probabilities
BUT we can still do a lot with SVD (and higher-order,tensor-based decompositions)
Spectral Learning for NLP 11
![Page 12: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/12.jpg)
CCA vs. Co-training
I Co-training assumption: 2 views, each sufficient forclassification
I Several heuristic algorithms developed for this setting
I Canonical correlation analysis:
I Take paired examples x(i),1, x(i),2
I Transform to z(i),1, z(i),2
I z’s are linear projections of the x’sI Projections are chosen to maximize correlation between z1 andz2
I Solvable using SVD!I Strong guarantees in several settings
Spectral Learning for NLP 12
![Page 13: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/13.jpg)
One Example of CCA: Lexical Representations
I x ∈ Rd is a word
dog = (0, 0, . . . , 0, 1, 0, . . . , 0, 0) ∈ R200,000
I y ∈ Rd′ is its context information
dog-context = (11, 0, . . . 0, 917, 3, 0, . . . 0) ∈ R400,000
I Use CCA on x and y to derive x ∈ Rk
dog = (0.03,−1.2, . . . 1.5) ∈ R100
Spectral Learning for NLP 13
![Page 14: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/14.jpg)
Spectral Learning of HMMs and L-PCFGs
Simple algorithms: require SVD, then method of moments inlow-dimensional space
Close connection to CCA
Guaranteed to learn (unlike EM) under assumptions on singularvalues in the SVD
Spectral Learning for NLP 14
![Page 15: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/15.jpg)
Spectral Methods in NLP
I Balle, Quattoni, Carreras, ECML 2011 (learning of finite-statetransducers)
I Luque, Quattoni, Balle, Carreras, EACL 2012 (dependencyparsing)
I Dhillon et al, 2012 (dependency parsing)
I Cohen et al 2012, 2013 (latent-variable PCFGs)
Spectral Learning for NLP 15
![Page 16: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/16.jpg)
Overview
Basic conceptsLinear Algebra RefresherSingular Value DecompositionCanonical Correlation Analysis: AlgorithmCanonical Correlation Analysis: Justification
Lexical representations
Hidden Markov models
Latent-variable PCFGs
Conclusion
Spectral Learning for NLP 16
![Page 17: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/17.jpg)
Matrices
A ∈ Rm×n
m
n
A =
[3 1 40 2 5
]
“matrix of dimensions m by n” A ∈ R2×3
Spectral Learning for NLP 17
![Page 18: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/18.jpg)
Vectors
u ∈ Rn
n
u =
021
“vector of dimension n” u ∈ R3
Spectral Learning for NLP 18
![Page 19: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/19.jpg)
Matrix Transpose
I A> ∈ Rn×m is the transpose of A ∈ Rm×n
A =
[3 1 40 2 5
]=⇒ A> =
3 01 24 5
Spectral Learning for NLP 19
![Page 20: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/20.jpg)
Matrix Multiplication
Matrices B ∈ Rm×d and C ∈ Rd×n
A︸︷︷︸m×n
= B︸︷︷︸m×d
C︸︷︷︸d×n
Spectral Learning for NLP 20
![Page 21: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/21.jpg)
Overview
Basic conceptsLinear Algebra RefresherSingular Value DecompositionCanonical Correlation Analysis: AlgorithmCanonical Correlation Analysis: Justification
Lexical representations
Hidden Markov models
Latent-variable PCFGs
Conclusion
Spectral Learning for NLP 21
![Page 22: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/22.jpg)
Singular Value Decomposition (SVD)
A︸︷︷︸m×n
SVD
=
d∑
i=1
σi︸︷︷︸scalar
ui︸︷︷︸m×1
(vi)>︸︷︷︸1×n︸ ︷︷ ︸
m×nI d = min(m,n)
I σ1 ≥ . . . ≥ σd ≥ 0
I u1 . . . ud ∈ Rm are orthonormal:∣∣∣∣ui∣∣∣∣2
= 1 ui · uj = 0 ∀i 6= j
I v1 . . . vd ∈ Rn are orthonormal:∣∣∣∣vi∣∣∣∣2
= 1 vi · vj = 0 ∀i 6= j
Spectral Learning for NLP 22
![Page 23: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/23.jpg)
Singular Value Decomposition (SVD)
A︸︷︷︸m×n
SVD
=
d∑
i=1
σi︸︷︷︸scalar
ui︸︷︷︸m×1
(vi)>︸︷︷︸1×n︸ ︷︷ ︸
m×nI d = min(m,n)
I σ1 ≥ . . . ≥ σd ≥ 0
I u1 . . . ud ∈ Rm are orthonormal:∣∣∣∣ui∣∣∣∣2
= 1 ui · uj = 0 ∀i 6= j
I v1 . . . vd ∈ Rn are orthonormal:∣∣∣∣vi∣∣∣∣2
= 1 vi · vj = 0 ∀i 6= j
Spectral Learning for NLP 22
![Page 24: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/24.jpg)
Singular Value Decomposition (SVD)
A︸︷︷︸m×n
SVD
=
d∑
i=1
σi︸︷︷︸scalar
ui︸︷︷︸m×1
(vi)>︸︷︷︸1×n︸ ︷︷ ︸
m×nI d = min(m,n)
I σ1 ≥ . . . ≥ σd ≥ 0
I u1 . . . ud ∈ Rm are orthonormal:∣∣∣∣ui∣∣∣∣2
= 1 ui · uj = 0 ∀i 6= j
I v1 . . . vd ∈ Rn are orthonormal:∣∣∣∣vi∣∣∣∣2
= 1 vi · vj = 0 ∀i 6= j
Spectral Learning for NLP 22
![Page 25: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/25.jpg)
Singular Value Decomposition (SVD)
A︸︷︷︸m×n
SVD
=
d∑
i=1
σi︸︷︷︸scalar
ui︸︷︷︸m×1
(vi)>︸︷︷︸1×n︸ ︷︷ ︸
m×nI d = min(m,n)
I σ1 ≥ . . . ≥ σd ≥ 0
I u1 . . . ud ∈ Rm are orthonormal:∣∣∣∣ui∣∣∣∣2
= 1 ui · uj = 0 ∀i 6= j
I v1 . . . vd ∈ Rn are orthonormal:∣∣∣∣vi∣∣∣∣2
= 1 vi · vj = 0 ∀i 6= jSpectral Learning for NLP 22
![Page 26: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/26.jpg)
SVD in Matrix Form
A︸︷︷︸m×n
SVD
= U︸︷︷︸m×d
Σ︸︷︷︸d×d
V >︸︷︷︸d×n
U =
| |u1 . . . ud
| |
∈ Rm×d Σ =
σ1 0
. . .
0 σd
∈ Rd×d
V =
| |v1 . . . vd
| |
∈ Rn×d
Spectral Learning for NLP 23
![Page 27: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/27.jpg)
Matrix Rank
A ∈ Rm×n
rank(A) ≤ min(m,n)
I rank(A) := number of linearly independent columns in A
1 1 21 2 21 1 2
1 1 21 2 21 1 3
rank 2 rank 3(full-rank)
Spectral Learning for NLP 24
![Page 28: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/28.jpg)
Matrix Rank: Alternative Definition
I rank(A) := number of positive singular values of A
1 1 21 2 21 1 2
1 1 21 2 21 1 3
Σ =
4.53 0 00 0.7 00 0 0
Σ =
5 0 00 0.98 00 0 0.2
rank 2 rank 3(full-rank)
Spectral Learning for NLP 25
![Page 29: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/29.jpg)
SVD and Low-Rank Matrix Approximation
I Suppose we want to find B∗ such that
B∗ = argminB: rank(B)=r
∑
jk
(Ajk −Bjk)2
I Solution:
B∗ =r∑
i=1
σiui(vi)>
Spectral Learning for NLP 26
![Page 30: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/30.jpg)
SVD in Practice
I Black box, e.g., in Matlab
I Input: matrix A, output: scalars σ1 . . . σd, vectors u1 . . . ud
and v1 . . . vd
I Efficient implementations
I Approximate, randomized approaches also available
I Can be used to solve a variety of optimization problems
I For instance, Canonical Correlation Analysis (CCA)
Spectral Learning for NLP 27
![Page 31: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/31.jpg)
Overview
Basic conceptsLinear Algebra RefresherSingular Value DecompositionCanonical Correlation Analysis: AlgorithmCanonical Correlation Analysis: Justification
Lexical representations
Hidden Markov models
Latent-variable PCFGs
Conclusion
Spectral Learning for NLP 28
![Page 32: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/32.jpg)
Canonical Correlation Analysis (CCA)
I Data consists of paired samples: (x(i), y(i)) for i = 1 . . . n
I As in co-training, x(i) ∈ Rd and y(i) ∈ Rd′ are two “views” ofa sample point
View 1 View 2
x(1) = (1, 0, 0, 0) y(1) = (1, 0, 0, 1, 0, 1, 0)
x(2) = (0, 0, 1, 0) y(2) = (0, 1, 0, 0, 0, 0, 1)
......
x(100000) = (0, 1, 0, 0) y(100000) = (0, 0, 1, 0, 1, 1, 1)
Spectral Learning for NLP 29
![Page 33: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/33.jpg)
Example of Paired Data: Webpage Classification (Blumand Mitchell, 98)
I Determine if a webpage is an course home page
course 1
↓
instructor’shomepage
−→course home page· · ·Announcements· · ·Lectures· · ·TAs· · · Information· · ·
←−TA’shomepage
↑course 2
I View 1. Words on the page: “Announcements”, “Lectures”I View 2. Identities of pages pointing to the page: instructror’s
home page, related course home pages
I Each view is sufficient for the classification!
Spectral Learning for NLP 30
![Page 34: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/34.jpg)
Example of Paired Data: Named Entity Recognition(Collins and Singer, 99)
I Identify an entity’s type as either Organization, Person, orLocation
. . . , says Mr. Cooper, a vice president of . . .
I View 1. Spelling features: “Mr.”, “Cooper”
I View 2. Contextual features: appositive=president
I Each view is sufficient to determine the entity’s type!
Spectral Learning for NLP 31
![Page 35: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/35.jpg)
Example of Paired Data: Bigram Model
H
X Y
p(h, x, y) = p(h)× p(x|h)× p(y|h)
(the, dog)(I, saw)(ran, to)
(John, was)...
I EM can be used to estimate the parameters of the model
I Alternatively, CCA can be used to derive vectors which can beused in a predictor
the =⇒
0.3...
1.1
dog =⇒
−1.5
...−0.4
Spectral Learning for NLP 32
![Page 36: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/36.jpg)
Projection Matrices
I Project samples to lower dimensional space
x ∈ Rd =⇒ x′ ∈ Rp
I If p is small, we can learn with far fewer samples!
I CCA finds projection matrices A ∈ Rd×p, B ∈ Rd′×p
I The new data points are a(i) ∈ Rp, b(i) ∈ Rp where
a(i)︸︷︷︸p×1
= A>︸︷︷︸p×d
x(i)︸︷︷︸d×1
b(i)︸︷︷︸p×1
= B>︸︷︷︸p×d′
y(i)︸︷︷︸d′×1
Spectral Learning for NLP 33
![Page 37: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/37.jpg)
Projection Matrices
I Project samples to lower dimensional space
x ∈ Rd =⇒ x′ ∈ Rp
I If p is small, we can learn with far fewer samples!
I CCA finds projection matrices A ∈ Rd×p, B ∈ Rd′×p
I The new data points are a(i) ∈ Rp, b(i) ∈ Rp where
a(i)︸︷︷︸p×1
= A>︸︷︷︸p×d
x(i)︸︷︷︸d×1
b(i)︸︷︷︸p×1
= B>︸︷︷︸p×d′
y(i)︸︷︷︸d′×1
Spectral Learning for NLP 33
![Page 38: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/38.jpg)
Mechanics of CCA: Step 1
I Compute CXY ∈ Rd×d′ , CXX ∈ Rd×d, and CY Y ∈ Rd′×d′
[CXY ]jk =1
n
n∑
i=1
(x(i)j − xj)(y
(i)k − yk)
[CXX ]jk=1
n
n∑
i=1
(x(i)j − xj)(x
(i)k − xk)
[CY Y ]jk=1
n
n∑
i=1
(y(i)j − yj)(y
(i)k − yk)
where x =∑
i x(i)/n and y =
∑i y
(i)/n
Spectral Learning for NLP 34
![Page 39: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/39.jpg)
Mechanics of CCA: Step 1
I Compute CXY ∈ Rd×d′ , CXX ∈ Rd×d, and CY Y ∈ Rd′×d′
[CXY ]jk =1
n
n∑
i=1
(x(i)j − xj)(y
(i)k − yk)
[CXX ]jk =1
n
n∑
i=1
(x(i)j − xj)(x
(i)k − xk)
[CY Y ]jk=1
n
n∑
i=1
(y(i)j − yj)(y
(i)k − yk)
where x =∑
i x(i)/n and y =
∑i y
(i)/n
Spectral Learning for NLP 35
![Page 40: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/40.jpg)
Mechanics of CCA: Step 1
I Compute CXY ∈ Rd×d′ , CXX ∈ Rd×d, and CY Y ∈ Rd′×d′
[CXY ]jk =1
n
n∑
i=1
(x(i)j − xj)(y
(i)k − yk)
[CXX ]jk =1
n
n∑
i=1
(x(i)j − xj)(x
(i)k − xk)
[CY Y ]jk =1
n
n∑
i=1
(y(i)j − yj)(y
(i)k − yk)
where x =∑
i x(i)/n and y =
∑i y
(i)/n
Spectral Learning for NLP 36
![Page 41: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/41.jpg)
Mechanics of CCA: Step 2
I Do SVD on C−1/2XX CXY C
−1/2Y Y ∈ Rd×d′
C−1/2XX CXY C
−1/2Y Y
SVD
= UΣV >
Let Up ∈ Rd×p be the top p left singular vectors. LetVp ∈ Rd′×p be the top p right singular vectors.
Spectral Learning for NLP 37
![Page 42: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/42.jpg)
Mechanics of CCA: Step 3
I Define projection matrices A ∈ Rd×p and B ∈ Rd′×p
A = C−1/2XX Up B = C
−1/2Y Y Vp
I Use A and B to project each (x(i), y(i)) for i = 1 . . . n:
x(i) ∈ Rd =⇒ A>x(i) ∈ Rp
y(i) ∈ Rd′
=⇒ B>y(i) ∈ Rp
Spectral Learning for NLP 38
![Page 43: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/43.jpg)
Input and Output of CCA
x(i) = (0, 0, 0, 1, 0, 0,0, 0, 0, . . . , 0) ∈ R50,000
↓a(i) = (−0.3 . . . 0.1) ∈ R100
y(i) = (497, 0, 1, 12, 0, 0, 0, 7,0, 0, 0, 0, . . . , 0, 58, 0) ∈ R120,000
↓b(i) = (−0.7 . . .− 0.2) ∈ R100
Spectral Learning for NLP 39
![Page 44: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/44.jpg)
Overview
Basic conceptsLinear Algebra RefresherSingular Value DecompositionCanonical Correlation Analysis: AlgorithmCanonical Correlation Analysis: Justification
Lexical representations
Hidden Markov models
Latent-variable PCFGs
Conclusion
Spectral Learning for NLP 40
![Page 45: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/45.jpg)
Justification of CCA: Correlation Coefficients
I Sample correlation coefficient for a1 . . . an ∈ R andb1 . . . bn ∈ R is
Corr(aini=1, bini=1) =
∑ni=1(ai − a)(bi − b)
√∑ni=1(ai − a)2
√∑ni=1(bi − b)2
where a =∑
i ai/n, b =∑
i bi/n
a
b
Correlation ≈ 1
Spectral Learning for NLP 41
![Page 46: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/46.jpg)
Simple Case: p = 1
I CCA projection matrices are vectors u1 ∈ Rd, v1 ∈ Rd′
I Project x(i) and y(i) to scalars u1 · x(i) and v1 · y(i)
I What vectors does CCA find? Answer:
u1, v1 = arg maxu,v
Corr(u · x(i)ni=1, v · y(i)ni=1
)
Spectral Learning for NLP 42
![Page 47: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/47.jpg)
Simple Case: p = 1
I CCA projection matrices are vectors u1 ∈ Rd, v1 ∈ Rd′
I Project x(i) and y(i) to scalars u1 · x(i) and v1 · y(i)
I What vectors does CCA find? Answer:
u1, v1 = arg maxu,v
Corr(u · x(i)ni=1, v · y(i)ni=1
)
Spectral Learning for NLP 42
![Page 48: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/48.jpg)
Finding the Next Projections
I After finding u1 and v1, what vectors u2 and v2 does CCA
find? Answer:
u2, v2 = arg maxu,v
Corr(u · x(i)ni=1, v · y(i)ni=1
)
subject to the constraints
Corr(u2 · x(i)ni=1, u1 · x(i)ni=1
)= 0
Corr(v2 · y(i)ni=1, v1 · y(i)ni=1
)= 0
Spectral Learning for NLP 43
![Page 49: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/49.jpg)
CCA as an Optimization Problem
I CCA finds for j = 1 . . . p (each column of A and B)
uj, vj = arg maxu,v
Corr(u · x(i)ni=1, v · y(i)ni=1
)
subject to the constraints
Corr(uj · x(i)ni=1, uk · x(i)ni=1
)= 0
Corr(vj · y(i)ni=1, vk · y(i)ni=1
)= 0
for k < j
Spectral Learning for NLP 44
![Page 50: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/50.jpg)
Guarantees for CCA
H
X Y
I Assume data is generated from a Naive Bayes model
I Latent-variable H is of dimension k, variables X and Y are ofdimension d and d′ (typically k d and k d′)
I Use CCA to project X and Y down to k dimensions (needs(x, y) pairs only!)
I Theorem: the projected samples are as good as the originalsamples for prediction of H(Foster, Johnson, Kakade, Zhang, 2009)
I Because k d and k d′ we can learn to predict H with farfewer labeled examples
Spectral Learning for NLP 45
![Page 51: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/51.jpg)
Guarantees for CCA (continued)
Kakade and Foster, 2007 - cotraining-style setting:
I Assume that we have a regression problem: predict somevalue z given two “views” x and y
I Assumption: either view x or y is sufficient for prediction
I Use CCA to project x and y down to a low-dimensional space
I Theorem: if correlation coefficients drop off to zero quickly,we will need far fewer samples to learn when using theprojected representation
I Very similar setting to cotraining, but:
I No assumption of independence between the two viewsI CCA is an exact algorithm - no need for heuristics
Spectral Learning for NLP 46
![Page 52: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/52.jpg)
Summary of the Section
I SVD is an efficient optimization techniqueI Low-rank matrix approximation
I CCA derives a new representation of paired data thatmaximizes correlation
I SVD as a subroutine
I Next: use of CCA in deriving vector representations of words(“eigenwords”)
Spectral Learning for NLP 47
![Page 53: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/53.jpg)
Overview
Basic concepts
Lexical representationsI Eigenwords found using the thin SVD between words and
context
capture distributional similaritycontain POS and semantic information about wordsare useful features for supervised learning
Hidden Markov Models
Latent-variable PCFGs
Conclusion
Spectral Learning for NLP 48
![Page 54: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/54.jpg)
Uses of Spectral Methods in NLP
I Word sequence labelingI Part of Speech tagging (POS)I Named Entity Recognition (NER)I Word Sense Disambiguation (WSD)I Chunking, prepositional phrase attachment, ...
I Language modelingI What is the most likely next word given a sequence of words
(or of sounds)?I What is the most likely parse given a sequence of words?
Spectral Learning for NLP 49
![Page 55: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/55.jpg)
Uses of Spectral Methods in NLP
I Word sequence labeling: semi-supervised learningI Use CCA to learn vector representation of words (eigenwords)
on a large unlabeled corpus.I Eigenwords map from words to vectors, which are used as
features for supervised learning.
I Language modeling: spectral estimation of probabilisticmodels
I Use eigenwords to reduce the dimensionality of generativemodels (HMMs,...)
I Use those models to compute the probability of an observedword sequence
Spectral Learning for NLP 50
![Page 56: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/56.jpg)
The Eigenword Matrix U
I U contains the singular vectors from the thin SVD of thebigram count matrix
ate cheese ham I You
ate 0 1 1 0 0cheese 0 0 0 0 0ham 0 0 0 0 0I 1 0 0 0 0You 2 0 0 0 0
I ate hamYou ate cheeseYou ate
Spectral Learning for NLP 51
![Page 57: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/57.jpg)
The Eigenword Matrix U
I U contains the singular vectors from the thin SVD of thebigram matrix (wt−1 ∗wt) analogous to LSA, but uses contextinstead of documents
I Context can be multiple neighboring words (we often use thewords before and after the target)
I Context can be neighbors in a parse treeI Eigenwords can also be computed using the CCA between
words and their contexts
I Words close in the transformed space are distributionally,semantically and syntactically similar
I We will later use U in HMMs and parse trees to project wordsto low dimensional vectors.
Spectral Learning for NLP 52
![Page 58: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/58.jpg)
Two Kinds of Spectral Models
I Context oblivious (eigenwords)I learn a vector representation of each word type based on its
average context
I Context sensitive (eigentokens or state)I estimate a vector representation of each word token based on
its particular context using an HMM or parse tree
Spectral Learning for NLP 53
![Page 59: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/59.jpg)
Eigenwords in Practice
I Work well with corpora of 100 million words
I We often use trigrams from the Google n-gram collection
I We generally use 30-50 dimensions
I Compute using fast randomized SVD methods
Spectral Learning for NLP 54
![Page 60: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/60.jpg)
How Big Should Eigenwords Be?
I A 40-D cube has 240 (about a trillion) vertices.
I More precisely, in a 40-D space about 1.540 ∼ 11 millionvectors can all be approximately orthogonal.
I So 40 dimensions gives plenty of space for a vocabulary of amillion words
Spectral Learning for NLP 55
![Page 61: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/61.jpg)
Fast SVD: Basic Method
problem Find a low rank approximation to a n×m matrix M .
solution Find an n× k matrix A such that M ≈ AA>M
Construction A is constructed by:
1. create a random m× k matrix Ω (iid normals)2. compute MΩ3. Compute thin SVD of result: UDV > = MΩ4. A = U
better: iterate a couple times
“Finding structure with randomness: Probabilisticalgorithms for constructing approximate matrixdecompositions” by N. Halko, P. G. Martinsson, and J.A. Tropp.
Spectral Learning for NLP 56
![Page 62: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/62.jpg)
Fast SVD: Basic Method
problem Find a low rank approximation to a n×m matrix M .
solution Find an n× k matrix A such that M ≈ AA>M
Construction A is constructed by:
1. create a random m× k matrix Ω (iid normals)2. compute MΩ3. Compute thin SVD of result: UDV > = MΩ4. A = U
better: iterate a couple times
“Finding structure with randomness: Probabilisticalgorithms for constructing approximate matrixdecompositions” by N. Halko, P. G. Martinsson, and J.A. Tropp.
Spectral Learning for NLP 56
![Page 63: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/63.jpg)
Eigenwords for ’Similar’ Words are Close
-0.2 -0.1 0.0 0.1 0.2 0.3 0.4
-0.3
-0.2
-0.1
0.0
0.1
0.2
PC 1
PC
2
man
miles
girl
woman
boy
sonmother
pressure
father
teacher
wife
guy
temperature
doctor
brother
bytes
lawyer
degrees
inchesdaughter
stress
sister
husband
density
pounds
citizen
bossacres
meters
tons
farmer
uncle
gravitytension
barrels
viscositypermeability
Spectral Learning for NLP 57
![Page 64: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/64.jpg)
Eigenwords Capture Part of Speech
-0.2 0.0 0.2 0.4
-0.2
-0.1
0.0
0.1
0.2
0.3
PC 1
PC
2
homecar
houseword
talk
river
dog
agree
catlisten
boatcarry
truck
sleepdrink
eatpush
disagree
Spectral Learning for NLP 58
![Page 65: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/65.jpg)
Eigenwords: Pronouns
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
-0.1
0.0
0.1
0.2
0.3
PC 1
PC 2
i youwe
us
our
theyhe
his
them
her
shehim
Spectral Learning for NLP 59
![Page 66: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/66.jpg)
Eigenwords: Numbers
-0.4 -0.2 0.0 0.2
-0.3
-0.2
-0.1
0.0
0.1
0.2
PC 1
PC
2
123
2005
4
one
5
106
2006
78
2004
9
two
20032002
2000
2001
three
1999
four
1998five
199719961995sixten
seveneightnine
200720082009
Spectral Learning for NLP 60
![Page 67: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/67.jpg)
Eigenwords: Names
-0.1 0.0 0.1 0.2
-0.10
-0.05
0.00
0.05
0.10
0.15
PC 1
PC 2
johndavidmichael
paul
robert
george
thomaswilliam
mary
richard
miketom
charles
bobjoe
joseph
daniel
dan
elizabeth
jennifer
barbara
susanchristopher
lisa
linda
mariadonald
nancy
karen
margaret
helen
patricia
bettyliz
dorothybetsy
tricia
Spectral Learning for NLP 61
![Page 68: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/68.jpg)
CCA has Nice Properties for Computing Eigenwords
I When computing the SVD of a word× context matrix (asabove) we need to decide how to scale the counts
I Using raw counts gives more emphasis to common wordsI Better: rescale
I Divide each row by the square root of the total count of theword in that row
I Rescale the columns to account for the redundancy
I CCA between words and their contexts does thisautomatically and optimally
I CCA ’whitens’ the word-context covariance matrix
Spectral Learning for NLP 62
![Page 69: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/69.jpg)
Semi-supervised Learning Problems
I Sequence labeleing (Named Entity Recognition, POS,WSD...)
I X = target wordI Z = context of the target wordI label = person / place / organization ...
I Topic identificationI X = words in titleI Z = words in abstractI label = topic category
I Speaker identification:I X = videoI Z = audioI label = which character is speaking
Spectral Learning for NLP 63
![Page 70: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/70.jpg)
Semi-supervised Learning using CCA
I Find CCA between X and ZI Recall: CCA finds projection matrices A and B such that
x︸︷︷︸k×1
= A>︸︷︷︸k×d
x︸︷︷︸d×1
z︸︷︷︸k×1
= B>︸︷︷︸k×d′
z︸︷︷︸d′×1
I Project X and Z to estimate hidden state: (x, z)I Note: if x is the word and z is its context, then A is the
matrix of eigenwords, x is the (context oblivious) eigenwordcorresponding to work x, and z gives a context-sensitive“eigentoken”
I Use supervised learning to predict label from hidden stateI and from hidden state of neighboring words
Spectral Learning for NLP 64
![Page 71: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/71.jpg)
Theory: CCA has Nice Properties
I If one uses CCA to map from target word and context (twoviews, X and Z) to reduced dimension hidden state and thenuses that hidden state as features in a linear regression topredict a y, then we have provably almost as good a fit in thereduced dimsion (e.g. 40) as in the original dimension (e.g.million word vocabulary).
I In contrast, Principal Components Regression (PCR:regression based on PCA, which does not “whiten” thecovariance matrix) can miss all the signal
[Foster and Kakade, ’06]
Spectral Learning for NLP 65
![Page 72: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/72.jpg)
Semi-supervised Results
I Find spectral features on unlabeled dataI RCV-1 corpus: NewswireI 63 million tokens in 3.3 million sentences.I Vocabulary size: 300kI Size of embeddings: k = 50
I Use in discriminative modelI CRF for NERI Averaged perceptron for chunking
I Compare against state-of-the-art embeddingsI C&W, HLBL, Brown, ASO and Semi-Sup CRFI Baseline features based on identity of word and its neighbors
I BenefitI Named Entity Recognition (NER): 8% error reductionI Chunking: 29% error reductionI Add spectral features to discriminative parser: 2.6% error
reduction
Spectral Learning for NLP 66
![Page 73: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/73.jpg)
Section Summary
I Eigenwords found using thin SVD between words and contextI capture distributional similarityI contain POS and semantic information about wordsI perform competitively to a wide range of other embeddingsI CCA version provides provable guarantees when used as
features in supervised learning
I Next: eigenwords form the basis for fast estimation of HMMsand parse trees
Spectral Learning for NLP 67
![Page 74: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/74.jpg)
A Spectral Learning Algorithm for HMMs
I Algorithm due to Hsu, Kakade and Zhang (COLT 2009; JCSS2012)
I Algorithm relies on singular value decomposition followed byvery simple matrix operations
I Close connections to CCA
I Under assumptions on singular values arising from the model,has PAC-learning style guarantees (contrast with EM, whichhas problems with local optima)
I It is a very different algorithm from EM
Spectral Learning for NLP 68
![Page 75: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/75.jpg)
Hidden Markov Models (HMMs)
H1 H2 H3 H4
the dog saw him
p(the dog saw him︸ ︷︷ ︸x1...x4
, 1 2 1 3︸ ︷︷ ︸h1...h4
)
= π(1)× t(2|1)× t(1|2)× t(3|1)
×o(the|1)× o(dog|2)× o(saw|1)× o(him|3)
I Initial parameters: π(h) for each latent state h
I Transition parameters: t(h′|h) for each pair of states h′, h
I Observation parameters: o(x|h) for each state h, obs. x
Spectral Learning for NLP 69
![Page 76: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/76.jpg)
Hidden Markov Models (HMMs)
H1 H2 H3 H4
the dog saw him
p(the dog saw him︸ ︷︷ ︸x1...x4
, 1 2 1 3︸ ︷︷ ︸h1...h4
)
= π(1)× t(2|1)× t(1|2)× t(3|1)
×o(the|1)× o(dog|2)× o(saw|1)× o(him|3)
I Initial parameters: π(h) for each latent state h
I Transition parameters: t(h′|h) for each pair of states h′, h
I Observation parameters: o(x|h) for each state h, obs. x
Spectral Learning for NLP 69
![Page 77: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/77.jpg)
Hidden Markov Models (HMMs)
H1 H2 H3 H4
the dog saw him
p(the dog saw him︸ ︷︷ ︸x1...x4
, 1 2 1 3︸ ︷︷ ︸h1...h4
)
= π(1)× t(2|1)× t(1|2)× t(3|1)
×o(the|1)× o(dog|2)× o(saw|1)× o(him|3)
I Initial parameters: π(h) for each latent state h
I Transition parameters: t(h′|h) for each pair of states h′, h
I Observation parameters: o(x|h) for each state h, obs. x
Spectral Learning for NLP 69
![Page 78: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/78.jpg)
Hidden Markov Models (HMMs)
H1 H2 H3 H4
the dog saw him
Throughout this section:
I We use m to refer to the number of hidden states
I We use n to refer to the number of possible words(observations)
I Typically, m n (e.g., m = 20, n = 50, 000)
Spectral Learning for NLP 70
![Page 79: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/79.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑
h′
t(h|h′)o(the|h′)f0h′
f2h =∑
h′
t(h|h′)o(dog|h′)f1h′ f3h =∑
h′
t(h|h′)o(saw|h′)f2h′
f4h =∑
h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑
h
f4h
Spectral Learning for NLP 71
![Page 80: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/80.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑
h′
t(h|h′)o(the|h′)f0h′
f2h =∑
h′
t(h|h′)o(dog|h′)f1h′ f3h =∑
h′
t(h|h′)o(saw|h′)f2h′
f4h =∑
h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑
h
f4h
Spectral Learning for NLP 71
![Page 81: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/81.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h)
f1h =∑
h′
t(h|h′)o(the|h′)f0h′
f2h =∑
h′
t(h|h′)o(dog|h′)f1h′ f3h =∑
h′
t(h|h′)o(saw|h′)f2h′
f4h =∑
h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑
h
f4h
Spectral Learning for NLP 71
![Page 82: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/82.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑
h′
t(h|h′)o(the|h′)f0h′
f2h =∑
h′
t(h|h′)o(dog|h′)f1h′ f3h =∑
h′
t(h|h′)o(saw|h′)f2h′
f4h =∑
h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑
h
f4h
Spectral Learning for NLP 71
![Page 83: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/83.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑
h′
t(h|h′)o(the|h′)f0h′
f2h =∑
h′
t(h|h′)o(dog|h′)f1h′
f3h =∑
h′
t(h|h′)o(saw|h′)f2h′
f4h =∑
h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑
h
f4h
Spectral Learning for NLP 71
![Page 84: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/84.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑
h′
t(h|h′)o(the|h′)f0h′
f2h =∑
h′
t(h|h′)o(dog|h′)f1h′ f3h =∑
h′
t(h|h′)o(saw|h′)f2h′
f4h =∑
h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑
h
f4h
Spectral Learning for NLP 71
![Page 85: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/85.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑
h′
t(h|h′)o(the|h′)f0h′
f2h =∑
h′
t(h|h′)o(dog|h′)f1h′ f3h =∑
h′
t(h|h′)o(saw|h′)f2h′
f4h =∑
h′
t(h|h′)o(him|h′)f3h′
p(. . .) =∑
h
f4h
Spectral Learning for NLP 71
![Page 86: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/86.jpg)
HMMs: the forward algorithm
H1 H2 H3 H4
the dog saw him
p(the dog saw him) =∑
h1,h2,h3,h4
p(the dog saw him, h1 h2 h3 h4)
The forward algorithm:
f0h = π(h) f1h =∑
h′
t(h|h′)o(the|h′)f0h′
f2h =∑
h′
t(h|h′)o(dog|h′)f1h′ f3h =∑
h′
t(h|h′)o(saw|h′)f2h′
f4h =∑
h′
t(h|h′)o(him|h′)f3h′ p(. . .) =∑
h
f4h
Spectral Learning for NLP 71
![Page 87: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/87.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 72
![Page 88: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/88.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 72
![Page 89: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/89.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h)
e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 72
![Page 90: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/90.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 72
![Page 91: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/91.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 72
![Page 92: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/92.jpg)
HMMs: the forward algorithm in matrix form
H1 H2 H3 H4
the dog saw him
I For each word x, define the matrix Ax ∈ Rm×m as
[Ax]h′,h = t(h′|h)o(x|h) e.g., [Athe]h′,h = t(h′|h)o(the|h)
I Define π as vector with elements πh, 1 as vector of all ones
I Then
p(the dog saw him) = 1>×Ahim×Asaw×Adog×Athe× π
Forward algorithm through matrix multiplication!
Spectral Learning for NLP 72
![Page 93: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/93.jpg)
The Spectral Algorithm: definitions
H1 H2 H3 H4
the dog saw him
Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
Easy to derive an estimate:
[P2,1]i,j =Count(X2 = i,X1 = j)
N
Spectral Learning for NLP 73
![Page 94: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/94.jpg)
The Spectral Algorithm: definitions
H1 H2 H3 H4
the dog saw him
For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
Easy to derive an estimate, e.g.,:
[P3,dog,1]i,j =Count(X3 = i,X2 = dog, X1 = j)
N
Spectral Learning for NLP 74
![Page 95: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/95.jpg)
Main Result Underlying the Spectral AlgorithmI Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
I For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
I SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
I Definition:Bx = U> × P3,x,1 × V︸ ︷︷ ︸
m×m
× Σ−1︸︷︷︸m×m
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
Spectral Learning for NLP 75
![Page 96: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/96.jpg)
Main Result Underlying the Spectral AlgorithmI Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
I For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
I SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
I Definition:Bx = U> × P3,x,1 × V︸ ︷︷ ︸
m×m
× Σ−1︸︷︷︸m×m
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
Spectral Learning for NLP 75
![Page 97: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/97.jpg)
Main Result Underlying the Spectral AlgorithmI Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
I For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
I SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
I Definition:Bx = U> × P3,x,1 × V︸ ︷︷ ︸
m×m
× Σ−1︸︷︷︸m×m
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
Spectral Learning for NLP 75
![Page 98: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/98.jpg)
Main Result Underlying the Spectral AlgorithmI Define the following matrix P2,1 ∈ Rn×n:
[P2,1]i,j = P(X2 = i,X1 = j)
I For each word x, define the following matrix P3,x,1 ∈ Rn×n:
[P3,x,1]i,j = P(X3 = i,X2 = x,X1 = j)
I SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
I Definition:Bx = U> × P3,x,1 × V︸ ︷︷ ︸
m×m
× Σ−1︸︷︷︸m×m
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
Spectral Learning for NLP 75
![Page 99: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/99.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 76
![Page 100: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/100.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 76
![Page 101: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/101.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 76
![Page 102: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/102.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 76
![Page 103: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/103.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 76
![Page 104: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/104.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × π
Spectral Learning for NLP 76
![Page 105: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/105.jpg)
Why does this matter?
I Theorem: if P2,1 is of rank m, then
Bx = GAxG−1
where G ∈ Rm×m is invertible
I Recall p(the dog saw him) = 1>AhimAsawAdogAtheπ.
Forward algorithm through matrix multiplication!
I Now note that
Bhim ×Bsaw ×Bdog ×Bthe
= GAhimG−1 ×GAsawG
−1 ×GAdogG−1 ×GAtheG
−1
= GAhim ×Asaw ×Adog ×AtheG−1
The G’s cancel!!
I Follows that if we have b∞ = 1>G−1 and b0 = Gπ then
b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
= 1> ×Ahim ×Asaw ×Adog ×Athe × πSpectral Learning for NLP 76
![Page 106: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/106.jpg)
The Spectral Learning Algorithm
1. Derive estimates
[P2,1]i,j =Count(X2 = i,X1 = j)
N
For all words x,
[P3,x,1]i,j =Count(X3 = i,X2 = x,X1 = j)
N
2. SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
3. For all words x, define Bx = U> × P3,x,1 × V︸ ︷︷ ︸m×m
× Σ−1︸︷︷︸m×m
.
(similar definitions for b0, b∞, details omitted)
4. For a new sentence x1 . . . xn, can calculate its probability, e.g.,
p(the dog saw him)
= b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
Spectral Learning for NLP 77
![Page 107: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/107.jpg)
The Spectral Learning Algorithm
1. Derive estimates
[P2,1]i,j =Count(X2 = i,X1 = j)
N
For all words x,
[P3,x,1]i,j =Count(X3 = i,X2 = x,X1 = j)
N
2. SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
3. For all words x, define Bx = U> × P3,x,1 × V︸ ︷︷ ︸m×m
× Σ−1︸︷︷︸m×m
.
(similar definitions for b0, b∞, details omitted)
4. For a new sentence x1 . . . xn, can calculate its probability, e.g.,
p(the dog saw him)
= b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
Spectral Learning for NLP 77
![Page 108: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/108.jpg)
The Spectral Learning Algorithm
1. Derive estimates
[P2,1]i,j =Count(X2 = i,X1 = j)
N
For all words x,
[P3,x,1]i,j =Count(X3 = i,X2 = x,X1 = j)
N
2. SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
3. For all words x, define Bx = U> × P3,x,1 × V︸ ︷︷ ︸m×m
× Σ−1︸︷︷︸m×m
.
(similar definitions for b0, b∞, details omitted)
4. For a new sentence x1 . . . xn, can calculate its probability, e.g.,
p(the dog saw him)
= b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
Spectral Learning for NLP 77
![Page 109: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/109.jpg)
The Spectral Learning Algorithm
1. Derive estimates
[P2,1]i,j =Count(X2 = i,X1 = j)
N
For all words x,
[P3,x,1]i,j =Count(X3 = i,X2 = x,X1 = j)
N
2. SVD(P2,1)⇒ U ∈ Rn×m,Σ ∈ Rm×m, V ∈ Rn×m
3. For all words x, define Bx = U> × P3,x,1 × V︸ ︷︷ ︸m×m
× Σ−1︸︷︷︸m×m
.
(similar definitions for b0, b∞, details omitted)
4. For a new sentence x1 . . . xn, can calculate its probability, e.g.,
p(the dog saw him)
= b∞ ×Bhim ×Bsaw ×Bdog ×Bthe × b0
Spectral Learning for NLP 77
![Page 110: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/110.jpg)
GuaranteesI Throughout the algorithm we’ve used estimates P2,1 andP3,x,1 in place of P2,1 and P3,x,1
I If P2,1 = P2,1 and P3,x,1 = P3,x,1 then the method is exact.But we will always have estimation errors
I A PAC-Style Theorem: Fix some length T . To have∑
x1...xT
|p(x1 . . . xT )− p(x1 . . . xT )|
︸ ︷︷ ︸L1 distance between p and p
≤ ε
with probability at least 1− δ, then number of samplesrequired is polynomial in
n,m, 1/ε, 1/δ, 1/σ, T
where σ is m’th largest singular value of P2,1
Spectral Learning for NLP 78
![Page 111: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/111.jpg)
Intuition behind the TheoremI Define
||A−A||2 =
√∑
j,k
(Aj,k −Aj,k)2
I With N samples, with probability at least 1− δ
||P2,1 − P2,1||2 ≤ ε
||P3,x,1 − P3,x,1||2 ≤ εwhere
ε =
√1
Nlog
1
δ+
√1
N
I Then need to carefully bound how the error ε propagatesthrough the SVD step, the various matrix multiplications, etcetc. The “rate” at which ε propagates depends on T , m, n,1/σ
Spectral Learning for NLP 79
![Page 112: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/112.jpg)
Summary
I The problem solved by EM: estimate HMM parameters π(h),t(h′|h), o(x|h) from observation sequences x1 . . . xn
I The spectral algorithm:
I Calculate estimates P2,1 (bigram counts) and P3,x,1 (trigramcounts)
I Run an SVD on P2,1
I Calculate parameter estimates using simple matrix operationsI Guarantee: we recover the parameters up to linear transforms
that cancel
Spectral Learning for NLP 80
![Page 113: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/113.jpg)
Overview
Basic concepts
Lexical representations
Hidden Markov models
Latent-variable PCFGsBackgroundSpectral algorithmJustification of the algorithmExperiments
Conclusion
Spectral Learning for NLP 81
![Page 114: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/114.jpg)
Probabilistic Context-free Grammars
I Used for natural language parsing and other structured models
I Induce probability distributions over phrase-structure trees
Spectral Learning for NLP 82
![Page 115: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/115.jpg)
The Probability of a Tree
S
NP
D
the
N
dog
VP
V
saw
P
him
p(tree)
= π(S)×t(S→ NP VP|S)×t(NP→ D N|NP)×t(VP→ V P|VP)×q(D→ the|D)×q(N→ dog|N)×q(V→ saw|V)×q(P→ him|P)
We assume PCFGs in Chomsky normal form
Spectral Learning for NLP 83
![Page 116: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/116.jpg)
PCFGs - Advantage
“Context-freeness” leads to generalization (“NP” - noun phrase):
Seen in data: Unseen in data (grammatical):S
NP
D
the
N
dog
VP
V
saw
NP
D
the
N
cat
S
NP
D
the
N
cat
VP
V
saw
NP
D
the
N
dog
An NP subtree can be combined anywhere an NP is expected
Spectral Learning for NLP 84
![Page 117: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/117.jpg)
PCFGs - Disadvantage
“Context-freeness” can lead to over-generalization:
Seen in data: Unseen in data (ungrammatical):S
NP
D
the
N
dog
VP
V
saw
NP
P
him
S
NP
N
him
VP
V
saw
NP
D
the
N
dog
Spectral Learning for NLP 85
![Page 118: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/118.jpg)
PCFGs - a Fix
Adding context to the nonterminals fixes that:
Seen in data: Low likelihood:S
NPsbj
D
the
N
dog
VP
V
saw
NPobj
P
him
S
NPobj
N
him
VP
V
saw
NPsbj
D
the
N
dog
Spectral Learning for NLP 86
![Page 119: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/119.jpg)
Idea: Latent-Variable PCFGs (Matsuzaki et al., 2005; Petrov et al.,
2006)
S
NP
D
the
N
dog
VP
V
saw
P
him
=⇒
S1
NP3
D1
the
N2
dog
VP2
V4
saw
P1
him
The latent states for each node are never observed
Spectral Learning for NLP 87
![Page 120: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/120.jpg)
The Probability of a Tree
S1
NP3
D1
the
N2
dog
VP2
V4
saw
P1
him
p(tree, 1 3 1 2 2 4 1)
= π(S1)×t(S1 → NP3 VP2|S1)×t(NP3 → D1 N2|NP3)×t(VP2 → V4 P1|VP2)×q(D1 → the|D1)×q(N2 → dog|N2)×q(V4 → saw|V4)×q(P1 → him|P1)
p(tree) =∑
h1...h7
p(tree, h1 h2 h3 h4 h5 h6 h7)
Spectral Learning for NLP 88
![Page 121: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/121.jpg)
Learning L-PCFGs
I Expectation-maximization (Matsuzaki et al., 2005)
I Split-merge techniques (Petrov et al., 2006)
Neither solves the issue of local maxima or statistical consistency
Spectral Learning for NLP 89
![Page 122: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/122.jpg)
Overview
Basic concepts
Lexical representations
Hidden Markov models
Latent-variable PCFGsBackgroundSpectral algorithmJustification of the algorithmExperiments
Conclusion
Spectral Learning for NLP 90
![Page 123: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/123.jpg)
Inside and Outside Trees
At node VP:
S
NP
D
the
N
dog
VP
V
saw
P
him
Outside tree o = S
NP
D
the
N
dog
VP
Inside tree t = VP
V
saw
P
him
Conditionally independent given the label and the hidden state
p(o, t|VP, h) = p(o|VP, h)× p(t|VP, h)
Spectral Learning for NLP 91
![Page 124: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/124.jpg)
Inside and Outside Trees
At node VP:
S
NP
D
the
N
dog
VP
V
saw
P
him
Outside tree o = S
NP
D
the
N
dog
VP
Inside tree t = VP
V
saw
P
him
Conditionally independent given the label and the hidden state
p(o, t|VP, h) = p(o|VP, h)× p(t|VP, h)
Spectral Learning for NLP 92
![Page 125: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/125.jpg)
Vector Representation of Inside and Outside TreesAssume functions Z and Y :
Z maps any outside tree to a vector of length m.
Y maps any inside tree to a vector of length m.
Convention: m is the number of hidden states under the L-PCFG.
S
NP
D
the
N
dog
VP
VP
V
saw
P
him
Outside tree o⇒ Inside tree t⇒Z(o) = [1, 0.4,−5.3, . . . , 72] ∈ Rm Y (t) = [−3, 17, 2, . . . , 3.5] ∈ Rm
Spectral Learning for NLP 93
![Page 126: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/126.jpg)
Parameter Estimation for Binary RulesTake M samples of nodes with rule VP→ V NP.
At sample i
I o(i) = outside tree at VP
I t(i)2 = inside tree at V
I t(i)3 = inside tree at NP
t(VPh1 → Vh2 NPh3 |VPh1)
=count(VP →V NP)
count(VP)× 1
M
M∑
i=1
(Zh1(o(i))× Yh2(t
(i)2 )× Yh3(t
(i)3 )
)
Spectral Learning for NLP 94
![Page 127: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/127.jpg)
Parameter Estimation for Unary Rules
Take M samples of nodes with rule N→ dog.
At sample i
I o(i) = outside tree at N
q(Nh → dog|Nh) =count(N →dog)
count(N)× 1
M
M∑
i=1
Zh(o(i))
Spectral Learning for NLP 95
![Page 128: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/128.jpg)
Parameter Estimation for the Root
Take M samples of the root S.
At sample i
I t(i) = inside tree at S
π(Sh) =count(root=S)
count(root)× 1
M
M∑
i=1
Yh(t(i))
Spectral Learning for NLP 96
![Page 129: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/129.jpg)
Deriving Z and Y
Design functions ψ and φ:
ψ maps any outside tree to a vector of length d′
φ maps any inside tree to a vector of length d
S
NP
D
the
N
dog
VP
VP
V
saw
P
him
Outside tree o⇒ Inside tree t⇒ψ(o) = [0, 1, 0, 0, . . . , 0, 1] ∈ Rd′ φ(t) = [1, 0, 0, 0, . . . , 1, 0] ∈ Rd
Z and Y will be reduced dimensional representations of ψ and φ.
Spectral Learning for NLP 97
![Page 130: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/130.jpg)
Reducing Dimensions via a Singular Value DecompositionHave M samples of a node with non-terminal a. At sample i, o(i)
is the outside tree rooted at a and t(i) is the inside tree rooted at a.
I Compute a matrix Ωa ∈ Rd×d′ with entries
[Ωa]j,k =1
M
M∑
i=1
φj(t(i))ψk(o
(i))
I An SVD:Ωa︸︷︷︸d×d′
≈ Ua︸︷︷︸d×m
Σa︸︷︷︸m×m
(V a)T︸ ︷︷ ︸m×d′
I Projection:
Y (t(i)) = (Ua)T︸ ︷︷ ︸m×d
φ(t(i))︸ ︷︷ ︸d×1
∈ Rm
Z(o(i)) = (Σa)−1︸ ︷︷ ︸m×m
(V a)T︸ ︷︷ ︸m×d′
ψ(o(i))︸ ︷︷ ︸d′×1
∈ Rm
Spectral Learning for NLP 98
![Page 131: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/131.jpg)
Reducing Dimensions via a Singular Value DecompositionHave M samples of a node with non-terminal a. At sample i, o(i)
is the outside tree rooted at a and t(i) is the inside tree rooted at a.
I Compute a matrix Ωa ∈ Rd×d′ with entries
[Ωa]j,k =1
M
M∑
i=1
φj(t(i))ψk(o
(i))
I An SVD:Ωa︸︷︷︸d×d′
≈ Ua︸︷︷︸d×m
Σa︸︷︷︸m×m
(V a)T︸ ︷︷ ︸m×d′
I Projection:
Y (t(i)) = (Ua)T︸ ︷︷ ︸m×d
φ(t(i))︸ ︷︷ ︸d×1
∈ Rm
Z(o(i)) = (Σa)−1︸ ︷︷ ︸m×m
(V a)T︸ ︷︷ ︸m×d′
ψ(o(i))︸ ︷︷ ︸d′×1
∈ Rm
Spectral Learning for NLP 98
![Page 132: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/132.jpg)
Reducing Dimensions via a Singular Value DecompositionHave M samples of a node with non-terminal a. At sample i, o(i)
is the outside tree rooted at a and t(i) is the inside tree rooted at a.
I Compute a matrix Ωa ∈ Rd×d′ with entries
[Ωa]j,k =1
M
M∑
i=1
φj(t(i))ψk(o
(i))
I An SVD:Ωa︸︷︷︸d×d′
≈ Ua︸︷︷︸d×m
Σa︸︷︷︸m×m
(V a)T︸ ︷︷ ︸m×d′
I Projection:
Y (t(i)) = (Ua)T︸ ︷︷ ︸m×d
φ(t(i))︸ ︷︷ ︸d×1
∈ Rm
Z(o(i)) = (Σa)−1︸ ︷︷ ︸m×m
(V a)T︸ ︷︷ ︸m×d′
ψ(o(i))︸ ︷︷ ︸d′×1
∈ Rm
Spectral Learning for NLP 98
![Page 133: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/133.jpg)
A Summary of the Algorithm
1. Design feature functions φ and ψ for inside and outside trees.
2. Use SVD to compute vectors
Y (t) ∈ Rm for inside treesZ(o) ∈ Rm for outside trees
3. Estimate the parameters t, q, and π from the training data.
Spectral Learning for NLP 99
![Page 134: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/134.jpg)
Overview
Basic concepts
Lexical representations
Hidden Markov models
Latent-variable PCFGsBackgroundSpectral algorithmJustification of the algorithmExperiments
Conclusion
Spectral Learning for NLP 100
![Page 135: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/135.jpg)
Justification of the Algorithm: Roadmap
How do we marginalize latent states? Dynamic programming
Succinct tensor form of representing the DP algorithm
Estimation guarantees explained through the tensor form
How do we parse? Dynamic programming again
Spectral Learning for NLP 101
![Page 136: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/136.jpg)
Calculating Tree Probability with Dynamic Programming:Revisited
S
NP
D
the
N
dog
VP
V
saw
P
him
b1h =∑
h2,h3
t(NPh → Dh2 Nh3 |NPh)× q(Dh2 → the|Dh2)× q(Nh3 → dog|Nh3)
b2h =∑
h2,h3
t(VPh → Vh2 Ph3 |VPh)× q(Vh2 → saw|Vh2)× q(Ph3 → him|Ph3)
b3h =∑
h2,h3
t(Sh → NPh2 VPh3 |Sh)× b1h2× b2h3
p(tree) =∑
h
π(Sh)× b3h
Spectral Learning for NLP 102
![Page 137: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/137.jpg)
Tensor Form of the Parameters
For each non-terminal a, define a vector πa ∈ Rm with entries
[πa]h = π(ah)
For each rule a→ x, define a vector qa→x ∈ Rm with entries
[qa→x]h = qa→x(ah → x|ah)
For each rule a→ b c, define a tensor T a→b c ∈ Rm×m×m with entries
[T a→b c]h1,h2,h3= t(ah1 → bh2 ch3 |ah1)
Spectral Learning for NLP 103
![Page 138: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/138.jpg)
Tensor Formulation of Dynamic Programming
I The dynamic programming algorithm can be represented muchmore compactly based on basic tensor-matrix-vector products
Sh
NPh2
D
the
N
dog
VPh3
V
saw
P
him
Regular form:
b3h =∑
h2,h3
t(Sh → NPh2 VPh3 |Sh)×b1h2×b2h3
Equivalent tensor form:
b3 = T S→NPVP(b1, b2)
where T S→NPVP ∈ Rm×m×m and
T S→NPVPh,h2,h3 = t(Sh → NPh2 VPh3 |Sh)
b1 b2
Spectral Learning for NLP 104
![Page 139: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/139.jpg)
Dynamic Programming in Tensor Form
S
NP
D
the
N
dog
VP
V
saw
P
him
T S→NPVP(TNP→DN(qD→the, qN→dog), TVP→VP(qV→saw, qP→him)) πS
|||
p(tree) =∑
h1...h7
p(tree, h1 h2 h3 h4 h5 h6 h7)
Spectral Learning for NLP 105
![Page 140: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/140.jpg)
Thought Experiment
I We want the parameters (in tensor form)
πa ∈ Rm
qa→x ∈ Rm
T a→b c(y2, y3) ∈ Rm
I What if we had an invertible matrix Ga ∈ Rm×m for everynon-terminal a?
I And what if we had instead
ca = Gaπa
ca→x = qa→x(Ga)−1
Ca→b c(y2, y3) = T a→b c(y2Gb, y3G
c)(Ga)−1
Spectral Learning for NLP 106
![Page 141: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/141.jpg)
Cancellation of the Linear Operators
S
NP
D
the
N
dog
VP
V
saw
P
him
CS→NPVP(CNP→DN(cD→the, cN→dog), CVP→VP(cV→saw, cP→him)) cS
|||
TS→NPVP(TNP→DN(qD→the(GD)−1GD, qN→dog(G
N)−1GN)(GNP)−1GNP,
TVP→VP(qV→saw(GV)−1GV, qP→him(GP)−1GP)(GVP)−1GVP)(GS)−1GSπS
|||
T S→NPVP(TNP→DN(qD→the, qN→dog), TVP→VP(qV→saw, qP→him)) πS
|||
p(tree) =∑
h1...h7
p(tree, h1 h2 h3 h4 h5 h6 h7)
Spectral Learning for NLP 107
![Page 142: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/142.jpg)
Estimation Guarantees
I Basic argument: If Ωa has rank m, parameters Ca→b c, ca→x,and ca converge to
Ca→b c(y2, y3) = T a→b c(y2Gb, y3G
c)(Ga)−1
ca→x = qa→x(Ga)−1
ca = Gaπa
for some Ga that is invertible.
I Ga are unknown, but they are there, canceling out perfectly
Spectral Learning for NLP 108
![Page 143: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/143.jpg)
Implications of Guarantees
I The dynamic programming algorithm calculates p(tree)
I As we have more data, p(tree) converges to p(tree)
But we are interested in parsing – trees are unobserved
Spectral Learning for NLP 109
![Page 144: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/144.jpg)
Cancellation of Linear Operators
Can compute any quantity that marginalizes out latent states
E.g.: the inside-outside algorithm can compute “marginals”
µ(a, i, j) : the probability that a spans words i through j
No latent states involved! They are marginalized out
They are used as auxiliary variables in the model
Spectral Learning for NLP 110
![Page 145: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/145.jpg)
Minimum Bayes Risk Decoding
Parsing algorithm:
I Find marginas µ(a, i, j) for each nonterminal a and span (i, j)in a sentence
I Compute using CKY the best tree t:
arg maxt
∑
(a,i,j)∈t
µ(a, i, j)
Minimum Bayes risk decoding (Goodman, 1996)
Spectral Learning for NLP 111
![Page 146: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/146.jpg)
Overview
Basic concepts
Lexical representations
Hidden Markov models
Latent-variable PCFGsBackgroundSpectral algorithmJustification of the algorithmExperiments
Conclusion
Spectral Learning for NLP 112
![Page 147: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/147.jpg)
Results with EM (section 22 of Penn treebank)
m = 8 86.87m = 16 88.32m = 24 88.35m = 32 88.56
Vanilla PCFG maximum likelihood estimation performance:68.62%
We focus on m = 32
Spectral Learning for NLP 113
![Page 148: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/148.jpg)
Key Ingredients for Accurate Spectral Learning
Feature functions
Handling negative marginals
Scaling of features
Smoothing
Spectral Learning for NLP 114
![Page 149: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/149.jpg)
Inside Features UsedConsider the VP node in the following tree:
S
NP
D
the
N
cat
VP
V
saw
NP
D
the
N
dogThe inside features consist of:
I The pairs (VP, V) and (VP, NP)
I The rule VP → V NP
I The tree fragment (VP (V saw) NP)
I The tree fragment (VP V (NP D N))
I The pair of head part-of-speech tag with VP: (VP, V)
I The width of the subtree spanned by VP: (VP, 2)
Spectral Learning for NLP 115
![Page 150: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/150.jpg)
Outside Features UsedConsider the D node inthe following tree:
S
NP
D
the
N
cat
VP
V
saw
NP
D
the
N
dogThe outside features consist of:
I The fragments NP
D∗ N
, VP
V NP
D∗ N
and S
NP VP
V NP
D∗ N
I The pair (D, NP) and triplet (D, NP, VP)I The pair of head part-of-speech tag with D: (D, N)I The widths of the spans left and right to D: (D, 3) and (D,
1)
Spectral Learning for NLP 116
![Page 151: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/151.jpg)
Accuracy (section 22 of the Penn treebank)
The accuracy out-of-the-box with these features is:
55.09%EM’s accuracy: 88.56%
Spectral Learning for NLP 117
![Page 152: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/152.jpg)
Negative Marginals
Sampling error can lead to negative marginals
Signs of marginals are flipped
On certain sentences, this gives the world’s worst parser:
t∗ = arg maxt−score(t) = arg min
tscore(t)
Taking the absolute value of the marginals fixes it
Likely to be caused by sampling error
Spectral Learning for NLP 118
![Page 153: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/153.jpg)
Accuracy (section 22 of the Penn treebank)
The accuracy with absolute-value marginals is:
80.23%
EM’s accuracy: 88.56%
Spectral Learning for NLP 119
![Page 154: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/154.jpg)
Scaling of Features by Inverse Variance
Features are mostly binary. Replace φi(t) by
φi(t)×
√1
count(i) + κ
where κ = 5
This is an approximation to replacing φ(t) by
(C)−1/2φ(t)
where C = E[φφ>]
Closely related to canonical correlation analysis
Spectral Learning for NLP 120
![Page 155: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/155.jpg)
Accuracy (section 22 of the Penn treebank)
The accuracy with scaling is:
86.47%EM’s accuracy: 88.56%
Spectral Learning for NLP 121
![Page 156: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/156.jpg)
SmoothingEstimates required:
E(VPh1 → Vh2 NPh3 |VPh1) =1
M
M∑
i=1
(Zh1(o(i))× Yh2(t
(i)2 )× Yh3(t
(i)3 )
)
Smooth using “backed-off” estimates, e.g.:
λE(VPh1 → Vh2 NPh3 |VPh1) + (1− λ)F ( VPh1 → Vh2 NPh3 |VPh1)
where
F (VPh1 → Vh2 NPh3 |VPh1)
=
(1
M
M∑
i=1
(Zh1(o(i))× Yh2(t
(i)2 )
))×
(1
M
M∑
i=1
Yh3(t(i)3 )
)
Spectral Learning for NLP 122
![Page 157: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/157.jpg)
Accuracy (section 22 of the Penn treebank)
The accuracy with smoothing is:
88.82%
EM’s accuracy: 88.56%
Spectral Learning for NLP 123
![Page 158: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/158.jpg)
Final Results
Final results on the Penn treebank
section 22 section 23EM spectral EM spectral
m = 8 86.87 85.60 — —m = 16 88.32 87.77 — —m = 24 88.35 88.53 — —m = 32 88.56 88.82 87.76 88.05
Spectral Learning for NLP 124
![Page 159: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/159.jpg)
Simple Feature Functions
Use rule above (for outside) and rule below (for inside)
Corresponds to parent annotation and sibling annotation
Accuracy:
88.07%Accuracy of parent and sibling annotation: 82.59%
The spectral algorithm distills latent states
Avoids overfitting caused by Markovization
Spectral Learning for NLP 125
![Page 160: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/160.jpg)
Running Time
EM and the spectral algorithm are cubic in the number of latentstates
But EM requires a few iterations
m single EM spectral algorithmEM iter. best model total SVD a→ b c a→ x
8 6m 3h 3h32m 36m 1h34m 10m16 52m 26h6m 5h19m 34m 3h13m 19m24 3h7m 93h36m 7h15m 36m 4h54m 28m32 9h21m 187h12m 9h52m 35m 7h16m 41m
SVD with sparse matrices is very efficient
Spectral Learning for NLP 126
![Page 161: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/161.jpg)
Related Work
Spectral algorithms have been used for parsing in other settings:
I Dependency parsing (Dhillon et al., 2012)
I Split head automaton grammars (Luque et al., 2012)
I Probabilistic grammars (Bailly et al., 2010)
Spectral Learning for NLP 127
![Page 162: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/162.jpg)
SummaryPresented spectral algorithms as a method for estimatinglatent-variable models
Formal guarantees:
I Statistical consistencyI No issue with local maxima
Complexity:
I Most time is spent on aggregating statisticsI Much faster than the alternative, expectation-maximizationI Singular value decomposition step is fast
Widely applicable for latent-variable models:
I Lexical representationsI HMMs, L-PCFGs (and R-HMMs)I Topic modeling
Spectral Learning for NLP 128
![Page 163: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/163.jpg)
Addendum: Spectral Learning forTopic Modeling
![Page 164: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/164.jpg)
Spectral Topic Modeling: Bag-of-Words
I Bag-of-words model with K topics and d words
I Model parameters: for i = 1 . . .K,
wi ∈ R :probability of topic i
µi ∈ Rd :word distribution of topic i
I Task: recover wi and µi for all topic i = 1 . . .K
Spectral Learning for NLP 130
![Page 165: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/165.jpg)
Spectral Topic Modeling: Bag-of-Words
I Estimate a matrix A ∈ Rd×d and a tensor T ∈ Rd×d×ddefined by
A = E[x1x
>2
](expectation over bigrams)
T = E[x1x
>2 x>3
](expectation over trigrams)
I Claim: these are symmetric tensors in wi and µi
A =
K∑
i=1
wiµiµ>i
T =
K∑
i=1
wiµiµ>i µ>i
I We can decompose T using A to recover wi and µi(Anandkumar et al. 2012)
Spectral Learning for NLP 131
![Page 166: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/166.jpg)
Spectral Topic Modeling: LDA
I Latent Dirichlet Allocation model with K topics and d wordsI Parameter vector α = (α1 . . . αK) ∈ RKI Define α0 =
∑i αi
I Dirichlet distribution over probability simplex h ∈ 4K−1
pα(h) =Γ(α0)∏i Γ(αi)
∏
i
hαi−1i
I A document can be a mixture of topics:
1. Draw topic distribution h = (h1 . . . hK) from Dir(α)2. Draw words x1 . . . xl from the word distribution
h1µ1 + · · ·+ hKµK ∈ Rd
I Task: assume α0 is known, recover αi and µi for all topici = 1 . . .K
Spectral Learning for NLP 132
![Page 167: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/167.jpg)
Spectral Topic Modeling: LDA
I Estimate a vector v ∈ Rd, a matrix A ∈ Rd×d and a tensorT ∈ Rd×d×d defined by
v = E[x1]
A = E[x1x
>2
]− α0
α0 + 1vv>
T = E[x1x
>2 x>3
]
− α0
α0 + 2
(E[x1x
>2 v>]
+ E[x1v>x>2
]+ E
[vx>1 x
>2
])
+2α2
0
(α0 + 2)(α0 + 1)(vv>v>)
Spectral Learning for NLP 133
![Page 168: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/168.jpg)
Spectral Topic Modeling: LDA
I Claim: these are symmetric tensors in αi and µi
A =
K∑
i=1
αi(α0 + 1)α0
µiµ>i
T =
K∑
i=1
2αi(α0 + 2)(α0 + 1)α0
µiµ>i µ>i
I We can decompose T using A to recover αi and µi(Anandkumar et al. 2012)
Spectral Learning for NLP 134
![Page 169: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/169.jpg)
References I
[1] A. Anandkumar, D. Foster, D. Hsu, S. M. Kakade, andY. Liu. A spectral algorithm for latent dirichlet allocation.arXiv:1204.6703, 2012.
[2] A. Anandkumar, R. Ge, D. Hsu, S. M. Kakade, andM. Telgarsky. Tensor decompositions for learninglatent-variable models. arXiv:1210.7559, 2012.
[3] R. Bailly, A. Habrar, and F. Denis. A spectral approach forprobabilistic grammatical inference on trees. In Proceedingsof ALT, 2010.
[4] B. Balle and M. Mohri. Spectral learning of general weightedautomata via constrained matrix completion. In P. Bartlett,F.C.N. Pereira, C.J.C. Burges, L. Bottou, and K.Q.Weinberger, editors, Advances in Neural InformationProcessing Systems 25, pages 2168–2176. 2012.
Spectral Learning for NLP 135
![Page 170: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/170.jpg)
References II
[5] B. Balle, A. Quattoni, and X. Carreras. A spectral learningalgorithm for finite state transducers. In Proceedings ofECML, 2011.
[6] A. Blum and T. Mitchell. Combining labeled and unlabeleddata with co-training. In Proceedings of COLT, 1998.
[7] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra,and J. C. Lai. Class-based n-gram models of naturallanguage. Computational Linguistics, 18:467–479, 1992.
[8] S. B. Cohen, K. Stratos, M. Collins, D. F. Foster, andL. Ungar. Spectral learning of latent-variable PCFGs. InProceedings of ACL, 2012.
[9] S. B. Cohen, K. Stratos, M. Collins, D. P. Foster, andL. Ungar. Experiments with spectral learning oflatent-variable PCFGs. In Proceedings of NAACL, 2013.
Spectral Learning for NLP 136
![Page 171: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/171.jpg)
References III[10] M. Collins and Y. Singer. Unsupervised models for named
entity classification. In In Proceedings of the Joint SIGDATConference on Empirical Methods in Natural LanguageProcessing and Very Large Corpora, pages 100–110, 1999.
[11] A. Dempster, N. Laird, and D. Rubin. Maximum likelihoodestimation from incomplete data via the EM algorithm.Journal of the Royal Statistical Society B, 39:1–38, 1977.
[12] P. Dhillon, J. Rodu, M. Collins, D. P. Foster, and L. H.Ungar. Spectral dependency parsing with latent variables. InProceedings of EMNLP, 2012.
[13] J. Goodman. Parsing algorithms and metrics. In Proceedingsof ACL, 1996.
[14] D. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonicalcorrelation analysis: An overview with application to learningmethods. Neural Computation, 16(12):2639–2664, 2004.
Spectral Learning for NLP 137
![Page 172: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/172.jpg)
References IV[15] H. Hotelling. Relations between two sets of variants.
Biometrika, 28:321–377, 1936.
[16] D. Hsu, S. M. Kakade, and T. Zhang. A spectral algorithmfor learning hidden Markov models. In Proceedings of COLT,2009.
[17] H. Jaeger. Observable operator models for discrete stochastictime series. Neural Computation, 12(6), 2000.
[18] T. K. Landauer, P. W. Foltz, and D. Laham. An introductionto latent semantic analysis. Discourse Processes,(25):259–284, 1998.
[19] F. M. Luque, A. Quattoni, B. Balle, and X. Carreras. Spectrallearning for non-deterministic dependency parsing. InProceedings of EACL, 2012.
[20] T. Matsuzaki, Y. Miyao, and J. Tsujii. Probabilistic CFG withlatent annotations. In Proceedings of ACL, 2005.
Spectral Learning for NLP 138
![Page 173: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/173.jpg)
References V[21] A. Parikh, L. Song, and E. P. Xing. A spectral algorithm for
latent tree graphical models. In Proceedings of The 28thInternational Conference on Machine Learning (ICML 2011),2011.
[22] S. Petrov, L. Barrett, R. Thibaux, and D. Klein. Learningaccurate, compact, and interpretable tree annotation. InProceedings of COLING-ACL, 2006.
[23] L. Saul, F. Pereira, and O. Pereira. Aggregate andmixed-order markov models for statistical language processing.In In Proceedings of the Second Conference on EmpiricalMethods in Natural Language Processing, pages 81–89, 1997.
[24] A. Tropp, N. Halko, and P. G. Martinsson. Finding structurewith randomness: Stochastic algorithms for constructingapproximate matrix decompositions. In Technical Report No.2009-05, 2009.
Spectral Learning for NLP 139
![Page 174: Spectral Learning Algorithms for Natural Language Processing](https://reader033.vdocuments.net/reader033/viewer/2022042922/626ab22a65c6f237a30f9c9d/html5/thumbnails/174.jpg)
References VI
[25] S. Vempala and G. Wang. A spectral algorithm for learningmixtures of distributions. Journal of Computer and SystemSciences, 68(4):841–860, 2004.
Spectral Learning for NLP 140