Transcript
Page 1: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Learning with Matrix Factorizations

Nati SrebroDepartment of Electrical Engineering and Computer Science

Massachusetts Institute of Technology

Adviser: Tommi Jaakkola (MIT)Committee: Alan Willsky (MIT), Tali Tishby (HUJI), Josh Tenenbaum (MIT)

Joint work with Shie Mannor (McGill), Noga Alon (TAU) and Jason Rennie (MIT)

Lots of help from John Barnett, Karen Livescu, Adi Shraibman, Michael Collins,Erik Demaine, Michel Goemans, David Karger and Dmitry Panchenko

Page 2: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Dimensionality Reduction:Low dimensional representation capturing important aspects of high dimensional data

observations small representation yy u

image varying aspects of scenegene expression levels description of biological process

a user’s preferences characterization of userdocument topics

• Compression (mostly to reduce processing time)• Reconstructing latent signal

– biological processes through gene expression• Capturing structure in a corpus

– documents, images, etc• Prediction: collaborative filtering

Page 3: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Linear Dimensionality Reduction× v1u1yy×+ v2u2

+ × v3u3

Page 4: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Linear Dimensionality Reductiontitles

yy u1 v1

titles× comic value

u2 v2×+u3

+1

+2

-1

dramatic value

preferences of a specific user(real-valued preference level

for each title)

+ violence× v3

characteristics of the user

Page 5: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Linear Dimensionality Reductiontitles

V’titles

×yy u

Page 6: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Linear Dimensionality Reductionyy u V’YY

datadata U

titles

user

s

titles

×

Page 7: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Matrix Factorization

V’YYdatadata U X

rank k

×=≈

• Non-Negativity [LeeSeung99]

• Stochasticity (convexity) [LeeSeung97][Barnett04]

• Sparsity– Clustering as an extreme (when rows of U sparse)

• Unconstrained: Low Rank Approximation

Page 8: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Matrix Factorization

V’YYdatadata U Z

i.i.d.Gaussian

×= +

• Additive Gaussian noise minimize |Y-UV’|Fro

( ) ∑ ∑ +−== −

ij ijijijijij constUVYUVYPYUVL 2

21 )'()'|(log;'log 2σ

Page 9: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Matrix Factorization

V’YYdatadata U Z

i.i.d.p(Zij)

×= +

• Additive Gaussian noise minimize |Y-UV’|Fro

• General additive noise minimize ∑ -log p(Yij-UV’ij)

Page 10: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Matrix Factorization

• Additive Gaussian noise minimize |Y-UV’|Fro

• General additive noise minimize ∑ -log p(Yij-UV’ij)• General conditional models minimize ∑ -log p(Yij|UV’ij)

– Multiplicative noise– Binary observations: Logistic LRA– Exponential PCA [Collins+01]

– Multinomial (pLSA [Hofman01])• Other loss functions [Gordon02] minimize ∑loss(UV’ij;Yij)

V’YYdatadata U

×

p(Yij|UV’ij)

Page 11: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

The Aspect Model (pLSA)

YYcoco--occurrenceoccurrence

countscounts

I

Jword

J

I

T

topic 1..k

p(j|t)

p(i,t)

[Hoffman+99]

Xp(i,j)rank k

×=

document

Y|X ∼ Multinomial(N,X)

Yij|Xij ∼ Binomial(N,Xij)

N=∑ Yij

Page 12: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Logistic Low Rank Approximation

[Schein+03]

Yij|Xij~Bin(N,g(Xij))Sufficient Dimensionality Reduction[Globerson+02]

Natural parameterizationunconstrained Xij

P(Yij=1) = XijYij|Xij~Bin(N,Xij)Aspect Model (pLSA) [Hoffman+99]

Mean parameterization0 · Xij · 1E[Yij|Xij]=Xij

Independent Bernoulli

Independent Binomials

Multinomial

≡ NMF if ∑Xij=1 ≈ NMF [Lee+01]

hingeloss

Low-Rank Models for Matrices of Counts, Occurrences or Frequencies

p(Yij|Xij) ∝ exp(YijXij+F(Yij))Exponential PCA: [Collins+01]

g(x)=1/(1+ex)

Page 13: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Outline• Finding Low Rank Approximations

– Weight Low Rank Approx: minimize ∑ijWij(Yij-Xij)2

– Use WLRA Basis for other losses / conditional models

• Consistency of Low Rank ApproximationWhen more data is available, do we converge to correct

solution? Not always…

• Matrix Factorization for Collaborative Filtering– Maximum Margin Matrix Factorization– Generalization Error Bounds (Low Rank & MMMF)

Page 14: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Finding Low Rank ApproximationFind rank-k X minimizing ∑loss(Xij;Yij) (=log p(Y|X))

• Non-convex:– “X is rank-k” is a non-convex constraint– ∑loss((UV)ij;Yij) not convex in U,V

• rank-k X minimizing ∑(Yij-Xij)2:– non-convex, but no (non-global) local minima– solution: leading components of SVD

• For other loss functions, or with missing data:cannot use SVD, local minima, difficult problem

• Weighted Low Rank Approximation:minimize ∑Wij(Yij-Xij)2

Arbitrarily specified weights(part of input)

Page 15: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

WLRA: Optimization∑ −=

ijijij UVYWUVJ 2)'()'(

For fixed V, find optimal UFor fixed U, find optimal V

)'(min)(* UVJVJU

=

))'*((*'2)'(* WYVUUVJV ⊗−=∂∂

Conjugate gradient descent on J*Optimize km parameters instead of k(n+m)

V’U

×k m

k

n

X←LowRankApprox(W⊗Y+(1-W)⊗X)

elementwiseproductEM approach:

Page 16: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Newton Optimization forNon-Quadratic Loss Functions

minimize ∑ijloss(Xij;Yij)loss(Xij;Yij) convex in Xij

• Iteratively optimize quadratic approximations of objective

• Each such quadratic optimization is a weighted low rank approximation

Page 17: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Maximum Likelihood Estimation with Gaussian Mixture Noise

YYobservedobserved

Xrank k

Znoise

C(hidden)

= +

Mixture of Gaussians:{pc,µc,σc}

E step: calculate posteriors of CM step: WLRA with

∑=

=c C

ijij

cCW 2

)Pr(σ ij

c C

Cijijij WcCYA ∑

=+= 2

)Pr(σ

µ

Page 18: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Outline• Finding Low Rank Approximations

– Weight Low Rank Approx: minimize ∑ijWij(Yij-Xij)2

– Use WLRA Basis for other losses / conditional models

• Consistency of Low Rank ApproximationWhen more data is available, do we converge to correct

solution? Not always…

• Matrix Factorization for Collaborative Filtering– Maximum Margin Matrix Factorization– Generalization Error Bounds (Low Rank & MMMF)

Page 19: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Asymptotic Behavior ofLow Rank Approximation

YY ZXparameters

rank k

random

independent= +

Single observation,Number of parameters is linear in number of observables

Can never approach correct estimation of parameters

What can be estimated is row-space of X

Page 20: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

YY Zrandom

independentU

V= +

Number of parameters is linear in number of observables

What can be estimated is row-space of X

Page 21: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

U

random

YY Zrandom

independent

V= +

What can be estimated is row-space of X

Page 22: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

What can be estimated is row-space of X

Multiple samples of random variable y

uyy z= +

Page 23: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Probabilistic PCA

V×uyy z

Gaussian

~= +Gaussian

~

σ2IΣXrank k

xGaussian

~estimated

parametersestimated

parameters

Maximum likelihood ≡ PCA[Tipping Bishop 97]

Page 24: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Probabilistic PCA

V×uyy z= +Gaussian(σ2I)

~

Gaussian

~

estimated parametersestimated

parameters

V×uyy ~ ~Multinomial( N, )

Latent Dirichlet Allocation[Blei Ng Jordan 03]

[Tipping Bishop 97]

Dirichlet(α)

estimated parametersestimated

parameters

Page 25: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Generative and Non-GenerativeLow Rank Models

Y=X+Gaussian “Probabilistic PCA”

pLSA,Y~Binom(N,X) Latent Dirichlet Allocatoin

Parametric generative models:Consistency of Maximum Likelihood

estimation guaranteedif model assumptions hold

(both on Y|X and on U)

Non-parametricgenerative models

Page 26: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

estimated parametersrow-space of V,not entries of V

V×uyy z

Gaussian

~= +

xPU

nonparametric,unconstrained

nuisance

Non-parametric model,estimation of a parametric part of the modelMaximum Likelihood ≡ “Single Observed Y”

Page 27: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistency of ML Estimationtrue V

estimated V

Page 28: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistency of ML Estimation

[ ]))((logmax);( uVzxpExV Zuz −+=Ψ

ML estimator is consistentfor any Pu

);( xVΨfor all x,

V maximizes iff V spans x

xexpected

contribution of x to likelihood of V

When Z~Gaussian, =Ψ );( xV E[L2 distance of x+z from V]

V

For iid Gaussian noise, ML estimation (PCA) of the low-rank sub-space is consistent

Page 29: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistency of ML Estimation

[ ]))((logmax);( uVzxpExV Zuz −+=Ψ

V

expected contribution of x to

likelihood of Vx=0

);( xVΨfor all x,

V maximizes iff V spans x

ML estimator is consistentfor any Pu

is constant for all V)0;(VΨML estimator is consistent

for any Pu

Page 30: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistency of ML Estimation

[ ]))((logmax);( uVzxpExV Zuz −+=Ψ

V

expected contribution of x to

likelihood of V

θx=0

|][|21])[( iz

Z eizp −=Laplace:

1

1.5

=Ψ− )0;(V

θ= π2π

vertical axis

horizontal axishorizontal axis

Page 31: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistency of ML Estimation

expected contribution of x to

likelihood of V

X=UV’General conditional model for Y|X

[ ]xuVYpExV XYuXY |)|(logmax);( ||=Ψ

);( xVΨfor all x,

V maximizes iff V spans x

ML estimator is consistentfor any Pu

is constant for all V)0;(VΨML estimator is consistent

for any Pu

Page 32: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistency of ML Estimation

expected contribution of x to

likelihood of V

V

x=0

][| 11])[|1][( ixXY e

ixiYp −−==

θ

Logistic:[ ]xuVYpExV XYuXY |)|(logmax);( ||=Ψ

=Ψ− )0;(V

θ= π2

π

Page 33: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistent Estimation• Additive i.i.d. noise Y=X+Z:

Maximum Likelihood generally not consistentPCA is consistent (for any noise distribution)

ΣX + σ2IΣYSpan of k leading eigenvectors of

rank k

Page 34: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistent Estimation• Additive i.i.d. noise Y=X+Z:

Maximum Likelihood generally not consistentPCA is consistent (for any noise distribution)

ΣY = ΣX + ΣZ = ΣX + σ2IΣY

s1, s2, …, sk, 0, 0, …, 0

s1+σ2, s2+σ 2, …, sk+σ2, σ2, σ2, …, σ2

Page 35: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistent Estimation• Additive i.i.d. noise Y=X+Z:

Maximum Likelihood generally not consistentPCA is consistent (for any noise distribution)

10 100 1000 100000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Sample size (number of observed rows)

MLPCA

dist

ance

to c

orre

ct s

ubsp

ace

Noise:0.99 N(0,1) + 0.01 N(0,100)

Page 36: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Consistent Estimation• Additive i.i.d. noise Y=X+Z:

Maximum Likelihood generally not consistentPCA is consistent (for any noise distribution)

• Unbiased noise E[Y|X]=X:Maximum Likelihood generally not consistentPCA not consistent

-- Can correct by ignoring diagonal of covariance

• Exponential PCA (X are natural parameters)Maximum Likelihood generally not consistentCovariance methods not consistent??????

Page 37: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Outline• Finding Low Rank Approximations

– Weight Low Rank Approx: minimize ∑ijWij(Yij-Xij)2

– Use WLRA Basis for other losses / conditional models

• Consistency of Low Rank ApproximationWhen more data is available, do we converge to correct

solution? Not always…

• Matrix Factorization for Collaborative Filtering– Maximum Margin Matrix Factorization– Generalization Error Bounds (Low Rank & MMMF)

Page 38: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Collaborative FilteringBased on preferences so far, and preferences of others:⇒ Predict further preferences

movies Implicit or explicit preferences?

Type of queries.

2 1 4 5

4 1 3 5

5 4 1 33 5 2

4 5 3

2 1 41 5 5 4

2 5 43 3 1 5 2 13 1 2 34 5 1 3

3 3 52 1 1

5 2 4 41 3 1 5 4 5

1 2 4 5

user

s

Page 39: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Matrix CompletionBased on partially observed matrix:

⇒ Predict unobserved entries “Will user i like movie j?”movies

2 1 4 5

4 1 3 5

5 4 1 33 5 2

4 5 3

2 1 41 5 5 4

2 5 43 3 1 5 2 13 1 2 34 5 1 3

3 3 52 1 1

5 2 4 41 3 1 5 4 5

1 2 4 5

?

? ?

?

? ?

??

?

?

user

s

Page 40: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Matrix Completion withMatrix Factorization

Fit factorizable (low-rank) matrix X=UV’to observed entries.

minimize Σloss(Xij;Yij)

Use matrix X to predict unobserved entries.

V’2 1 4 5

4 1 3 5

5 4 1 33 5 2

4 5 3

2 1 41 5 5 4

2 5 43 3 1 5 2 13 1 2 34 5 1 3

3 3 52 1 1

5 2 4 41 3 1 5 4 5

1 2 4 5

?

? ?

?

? ?

??

?

?

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

Uobservationprediction

[Sarwar+00] [Azar+01] [Hoffman04] [Marlin+04]

Page 41: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Matrix Completion withMatrix Factorization

When U is fixed,each row is a linear classification problem:•rows of U are feature vectors•columns of V are linear classifiers

Fitting U and V:Learning features that work well across all

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1v1 v2 v3 v4 v5 v6 v7 v8 v9 v10v11v12

U

-1.51.3 0.4

-4.88.3 2.5

3.40.7 -0.2

1.61.7 -5.2

0.9-3.7 2.1

2.74.3 -0.5

6.44.7 0.2

-5.86.0 0.3

-1.5

-4.8

3.4

1.6

0.9

2.7

6.4

-5.8

0.4

2.5

-0.2

-5.2

2.1

-0.5

0.2

0.3

1.3

8.3

0.7

1.7

-3.7

4.3

4.7

6.0

classification problems.

Page 42: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Max-Margin Matrix Factorization

Instead of bounding dimensionality of U,V, bound norms of U,V

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

low normlow norm

‹Ui,Vj›

For observed Yij ∈ ±1:Yij Xij ≥ Margin

Page 43: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

low normlow norm

‹Ui,Vj›

For observed Yij ∈ ±1:Yij Xij ≥ Margin

(maxi|Ui|2) (maxj|Vj|2) · 1

(∑i |Ui|2) (∑j |Vj|2) · 1

Max-Margin Matrix Factorization

U is fixed:each column of V is SVM

bound norms on average:

bound norms uniformly:

Page 44: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

low normlow norm

‹Ui,Vj›

For observed Yij ∈ ±1:Yij Xij ≥ Margin

(maxi|Ui|2) (maxj|Vj|2) · 1

(∑i |Ui|2) (∑j |Vj|2) · 1

Max-Margin Matrix Factorization

U is fixed:each column of V is SVM

bound norms on average:

bound norms uniformly:

Page 45: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

Geometric Interpretationcolumns of V

(items)rows of U

(users)

(maxi|Ui|2) (maxj|Vj|2) · 1

Page 46: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

V

U

Geometric Interpretationcolumns of V

(items)rows of U

(users)

(maxi|Ui|2) (maxj|Vj|2) · 1

Page 47: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Finding Max-Margin Matrix Factorizations

(maxi|Ui|2) (maxj|Vj|2) · 1(∑i |Ui|2) (∑j |Vj|2) · 1X=UVX=UV

maximize Mmaximize MYij Xij ≥ MYij Xij ≥ M

Unlike rank(X) · k, these are convex constraints!

Page 48: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Finding Max-Margin Matrix Factorizations

X-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

Y-1 -1 +1 +1

+1 -1-1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

QQmaximize MYij Xij ≥ M

X=UV(∑i |Ui|2) (∑j |Vj|2) · 1

|X|tr = ∑ (singular values of X)

Dual variable Qij for each observed (i,j)

maximize ∑ Qij

BX’XA p.s.d.

minimize tr(A)+tr(B)Yij Xij ≥ 1

+ c ∑ξij

- ξij 0 · Qij · c

||Q ⊗ Y||2 · 1

sparse elementwise product (zero for unobserved entries)

Page 49: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Finding Max-Margin Matrix Factorizations

• Semi-definite program with sparse dual:Limited by number of observations, not size

(for both average-norm and max-norm)

• Current implementation: use CSDP (off-the-shelf solver), up to 30k observations (e.g. 1000x1000, 3% observed)

• For large-scale problems: updates on dual alone ?Dual variable Qij for each observed (i,j)

maximize ∑ Qij

BX’XA p.s.d.

minimize tr(A)+tr(B)Yij Xij ≥ 1

+ c ∑ξij

- ξij 0 · Qij · c

||Q ⊗ Y||2 · 1

sparse elementwise product (zero for unobserved entries)

Page 50: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Loss Functions for Rankings

Xij

0Yij=+1

loss

(Xij;Y

ij)

Page 51: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Loss Functions for Rankings

Xij

θ1 θ2 θ6θ5θ4θ3

765421 3Yij =

loss

(Xij;Y

ij)

Page 52: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Loss Functions for Rankings

XijYij=3θ1 θ2 θ6θ5θ4θ3

Immediate-threshold loss

loss

(Xij;Y

ij)

[Shashua Levin 03]

all-threshold loss

• All-threshold loss is a bound on the absolute rank-difference• For both loss functions: learn per-user θ’s

Page 53: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Experimental Results on MovieLens Subset

all threshold

MMMF

immediate threshold

MMMF

K-medians K=2 Rank-1 Rank-2

Rank Difference 0.670 0.715 0.674 0.698 0.714

Zero One Error 0.553 0.542 0.558 0.559 0.553

100 users × 100 movies subset of MovieLens,3515 training ratings, 3515 test ratings

Page 54: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Outline• Finding Low Rank Approximations

– Weight Low Rank Approx: minimize ∑ijWij(Yij-Xij)2

– Use WLRA Basis for other losses / conditional models

• Consistency of Low Rank ApproximationWhen more data is available, do we converge to correct

solution? Not always…

• Matrix Factorization for Collaborative Filtering– Maximum Margin Matrix Factorization– Generalization Error Bounds (Low Rank & MMMF)

Page 55: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Generalization Error Bounds

D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error

Assuming a low-rank structure (eigengap):Asymptotic behavior [Azar+01]

Sample complexity, query strategy [Drineas+02]

X

-1 -1 +1 +1

+1 -1 -1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

Yunknown,

assumption-free

Page 56: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Generalization Error Bounds

D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error

DS(X;Y) = ∑ij∈S loss(Xij;Yij)/|S|empirical error

-1 -1 +1 +1

+1 -1 -1 +1

+1 +1 -1 -1-1 +1 +1

+1 +1 -1

-1 -1 +1-1 +1 +1 +1

-1 +1 +1+1 +1 -1 +1 -1 -1+1 -1 -1 +1+1 +1-1 +1

-1 -1 +1-1 -1 -1

+1 -1 +1 +1-1 -1 -1+1 +1 +1

-1 -1 +1 +1

X YSS

random

unknown, assumption-free

∀Y PrS ( ∀rank-k X D(X;Y)<DS(X;Y)+ε ) > 1-δ

training set

source distribution

hypothesis

Page 57: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Generalization Error Bounds0/1 loss: loss(Xij;Yij) = 1 when sign(Xij)≠Yij

D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error

DS(X,Y) = ∑ij∈S loss(Xij;Yij)/|S|empirical error

loss(Xij;Yij) ∼ Bernoulli(D(X;Y))For particular X,Y:

Pr( DS(X;Y) < D(X;Y)-ε ) < e-2|S|ε2randomrandom

randomUnion bound over all possible Xs:

∀Y PrS ( ∀X D(X,Y)<DS(X,Y)+ε ) > 1-δ

||2loglog)( 18

Smnk k

emδε

++=

log(# of possible Xs)

Page 58: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Number of Sign Configurations of Rank-k Matrices

-0.36 -0.00 -0.98 -1.36 -0.26-0.07 -0.24 -1.14 -0.82 -0.120.19 -0.44 -1.27 -0.34 0.01-0.81 0.84 1.17 -1.07 -0.35-0.48 0.21 -0.43 -1.28 -0.28-0.35 0.13 -0.43 -1.02 -0.22-0.54 0.68 1.26 -0.43 -0.20-0.22 0.40 0.98 0.10 -0.051.13 -1.06 -1.21 1.72 0.511.30 -0.73 0.55 3.12 0.720.53 -0.45 -0.40 0.90 0.25-0.04 -0.00 -0.14 -0.17 -0.03-0.80 0.70 0.67 -1.32 -0.37-1.81 1.28 0.28 -3.73 -0.930.68 -0.56 -0.43 1.21 0.33

-0.46 0.65-0.20 0.700.02 0.74-0.61 -0.59-0.50 0.33-0.38 0.31-0.35 -0.69-0.09 -0.560.90 0.571.27 -0.520.44 0.17-0.06 0.09-0.66 -0.29-1.65 0.090.58 0.16

1.11 -0.81 -0.27 2.24 0.570.22 -0.58 -1.70 -0.52 0.00

{ x | ∀i Pi(xWarren (1968): The number of connected components of ) ≠ 0 }

Xi,j = ∑r Ui,r Vr,j

nk variables mk variables

nm polynomials of degree 2

polynomials

X

V

U

4 e (degree) (#polys)

(# variables)

(# variables)is at most1

2 34

567

8

P1(x1,x2)=0

P2(x1,x2)=0

P3(x1,x2)=0Based on [Alon95]

Page 59: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Number of Sign Configurations of Rank-k Matrices

-0.36 -0.00 -0.98 -1.36 -0.26-0.07 -0.24 -1.14 -0.82 -0.120.19 -0.44 -1.27 -0.34 0.01-0.81 0.84 1.17 -1.07 -0.35-0.48 0.21 -0.43 -1.28 -0.28-0.35 0.13 -0.43 -1.02 -0.22-0.54 0.68 1.26 -0.43 -0.20-0.22 0.40 0.98 0.10 -0.051.13 -1.06 -1.21 1.72 0.511.30 -0.73 0.55 3.12 0.720.53 -0.45 -0.40 0.90 0.25-0.04 -0.00 -0.14 -0.17 -0.03-0.80 0.70 0.67 -1.32 -0.37-1.81 1.28 0.28 -3.73 -0.930.68 -0.56 -0.43 1.21 0.33

-0.46 0.65-0.20 0.700.02 0.74-0.61 -0.59-0.50 0.33-0.38 0.31-0.35 -0.69-0.09 -0.560.90 0.571.27 -0.520.44 0.17-0.06 0.09-0.66 -0.29-1.65 0.090.58 0.16

1.11 -0.81 -0.27 2.24 0.570.22 -0.58 -1.70 -0.52 0.00

{ x | ∀i Pi(xWarren (1968): The number of connected components of ) ≠ 0 }

Xi,j = ∑r Ui,r Vr,j

nk variables mk variables

nm polynomials of degree 2

polynomials

X

V

U

4 e (degree) (#polys)

(# variables)

(# variables)

12 3

456

7

8

P1(x1,x2)=0

P2(x1,x2)=0

P3(x1,x2)=0

2 nmk(n+m)

k(n+m)

is at most

Based on [Alon95]

Page 60: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Generalization Error Bounds:Low Rank Matrix Factorization

D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error

DS(X,Y) = ∑ij∈S loss(Xij;Yij)/|S|empirical error

∀Y PrS ( ∀X of rank-k D(X,Y)<DS(X,Y)+ε ) > 1-δ

||2loglog)( 18

Smnk k

emδε

++=0/1 loss: loss(Xij;Yij) = sign(XijYij)

||logloglog)(

61

)(8

Smnk mnk

Skem

δε++

= +loss(Xij;Yij) · 1:(by bounding the psudodimension)

Page 61: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Generalization Error Bounds:Large Margin Matrix Factorization

D(X;Y) = ∑ij loss(Xij;Yij)/nmgeneralization error

loss(Xij;Yij) = sign(XijYij) l

DS(X;Y) = ∑ij∈S loss1(Xij;Yij)/|S|empirical error

oss1(Xij;Yij) = sign(XijYij-1)

∀Y PrS ( ∀X D(X,Y)<DS(X,Y)+ε ) > 1-δ

||loglog)(ln

124

SnmnRmK δε ++

=

universal constant from [Seginer00]bound on spectral norm of random matrix

(∑ |Ui|2/n)(∑ |Vi|2/m) · R2:

||log)(12

12

SmnR δε ++

=(max |Ui|2)(max |Vj|2) · R2:

Page 62: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Maximum Margin Matrix Factorizationas a Convex Combination of Classifiers

{ UV | (∑ |Ui|2)(∑ |Vi|2) · 1 }= convex-hull( { uv’ | u ∈ Rn, v ∈ Rm |u|=|v|=1} )

conv( { uv’ | u ∈ ±1n, v ∈ ±1m} ) ⊂ { UV | (max |Ui|2)(max |Vj|2) · 1 }

⊂ 2 conv( { uv’ | u ∈ ±1n, v ∈ ±1m} )

Grothendiek's Inequality

Page 63: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Summary• Finding Low Rank Approximations

– Weighted Low Rank Approximations– Basis for other loss function: Newton, Gaussian mixtures

• Consistency of Low Rank Approximation– ML for popular low-rank models is not consistent!– PCA consistent for additive noise; diagonal ignoring for unbiased– Efficient estimators?– Consistent estimators for Exponential-PCA?

• Maximum Margin Matrix Factorization– Correspondence with large margin linear classification– Sparse SDPs for both avarage-norm and max-norm formulations– Direct optimization of dual would enable large-scale applications

• Generalization Error Bounds for Collaborative Prediction– First “assumption free” bounds for matrix completion– Both for Low-Rank and for Max-Margin– Observation process?

Page 64: Learning with Matrix Factorizations - pdfs.semanticscholar.org fileLearning with Matrix Factorizations Nati Srebro Department of Electrical Engineering and Computer Science Massachusetts

Average February Temperature(centigrade)

Tel-Aviv TorontoBostonHaifa(Technion)

????

-20

-10

0

10

20

30


Top Related