fast als-based matrix factorization for explicit and implicit feedback datasets

Fast ALS-based matrix factorization for explicit andimplicit feedback datasets

István Pilászy, Dávid Zibriczky, Domonkos Tikk

Gravity R&D Ltd.www.gravityrd.com

28 September 2010

http://www.gravityrd.com/

Collaborative filtering

Problem setting

5 4 3

44

2 41

• Ridge Regression

wx Tiiy

?????

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

w

wXy T

•Optimal solution:


3.03.001.04.00.0

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

•Computing the optimal solution:

•Matrix inversion is costly:

•Sum of squared errors of the optimal solution: 0.055


3.253.178.239.236.133.170.216.229.186.88.236.224.296.253.139.239.186.253.373.186.136.83.133.182.12

XXT

)()( 1 yXXXw TT

6.221.194.247.284.14

yXT

)( 3KO

• RR1: RR with coordinate descent

• Idea: optimize only one variable of at once

•Start with zero:

•Sum of squared errors: 24.6

w

0.00.00.00.00.0

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0



•Start with zero, then optimize w1


w

0.00.00.00.0

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

1.2



•Start with zero, then optimize w1 ,then optimize w2


w

0.00.00.0

2.1

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

0.2



•Start with zero, then optimize w1, then w2, then w3


w

0.00.0

2.02.1

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

0.1



•… w4


w

0.0

1.02.02.1

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

0.1



•… w5


w

0.11.01.02.02.1

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0



•… w1 again


w

1.01.01.02.0

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

0.8



•… w2 again


w

1.01.01.0

8.0

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

0.3



•… w3 again


w

1.01.0

3.08.0

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

0.2



•… after a while:


•No remarkable difference

•Cost: n examples, e epoch

w

0.30.30.010.40.0

7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1

9.19.12.11.23.26.19.18.0

)( enKO

• The rating matrix, R of (M x N) is approximated as the product of two lower ranked matrices,

• P: user feature matrix of (M x K) size

• Q: item (movie) feature matrix of (N x K) size

• K: number of features

• Matrix factorization

iTuuir qp

PT

RT

Q

TPQR iTuuir qp

Matrix Factorization for explicit feedb.

Q

P5

5

4

3

1

R3.3

1.3

1.3

1.4

1.3

1.9

1.7 0.7 1.0 1.3 0.8

0 0.7 0.4 1.7 0.3

2.12.2

6.7

1.6

1.4

TPQR

iTuuir qp

2

43.3

1.6 1.8

Finding P and Q

Q

PR

0.3 0.9 0.7 1.3 0.5

0.6 1.2 0.3 1.6 1.1

5

5

4

3

1

2

4 ??

• Init Q randomly

•Find p1

Finding p1 with RR

??

1.15.03.07.06.03.0

345

• Optimal solution:

3.22.3

1p

Finding p1 with RR

Q

PR

0.3 0.9 0.7 1.3 0.5

0.6 1.2 0.3 1.6 1.1

5

5

4

3

1

2

4 2.33.2

• Initialize Q randomly

•Repeat

• Recompute P

• Compute p1 with RR

• Compute p2 with RR

• … (for each user)

• Recompute Q

• Compute q1 with RR

• … (for each item)

• Alternating Least Squares (ALS)

• ALS relies on RR:

• recomputation of vectors with RR

•when recomputing p1, the previously computed value is ignored

• ALS1 relies on RR1:

• optimize the previously computed p1, one scalar at once

•the previously computed value is not lost

•run RR1 only for one epoch

• ALS is just an approximation method.

• Likewise ALS1.

• ALS1: ALS with RR1

1:),( eenKO

Implicit feedback

Q

P

1 0

R0.5

0.1

0.2

0.7

0.3

0.1

0.1 0.7 0.3 0 0.2

0 0.7 0.4 0.4 0.4

TPQR

iTuuir qp

10 0

0 011 0

0 101 1

•The matrix is fully specified: each user watched each item.

•Zeros are less important, but still important. Many 0-s, few 1-s.

•Recall, that

•Idea (Hu, Koren, Volinsky):

• consider a user, who watched nothing

• compute and for this user (the null-user)

• when recomputing p1, compare her to the null-user

• based on the cached and ,update them according to the differences

• In this way, only the number of 1-s affect performance, not the number of 0-s

•IALS: alternating least squares with this trick.

• Implicit feedback: IALS

)()( 1 yXXXw TT

XXT yXT

XXT yXT

•The RR1 trick cannot be applied here

• Implicit feedback: IALS1

•The RR1 trick cannot be applied here

•But, wait…!


•XTX is just a matrix.

•No matter how many items we have, its dimension is the same (KxK)

•If we are lucky, we can find K items which generate this matrix

•What, if we are unlucky? We can still create synthetic items.

•Assume that the null user did not watch these K items

•XTX and XTy are the same, if synthetic items were created appropriately


3.253.178.239.236.133.170.216.229.186.88.236.224.296.253.139.239.186.253.373.186.136.83.133.182.12

XXT

•Can we find a Z matrix such that

• Z is small, KxK and ?

•We can, by eigenvalue decomposition


3.253.178.239.236.133.170.216.229.186.88.236.224.296.253.139.239.186.253.373.186.136.83.133.182.12

XXT

ZZXX TT

TT SΛSXX TSΛZ :

65.496.315.559.594.2

26.094.142.117.24.1

79.100.104.003.153.0

62.006.052.049.008.1

16.070.078.015.044.0

Z

•If a user watched N items,we can run RR1 with N+K examples

•To recompute pu, we need steps (assume 1 epoch)

•Is it better in practice, than the of IALS ?


)( 2 KNKO )( 23 NKKO

• Evaluation of ALS vs. ALS1•Probe10 RMSE on Netflix Prize dataset, after 25 epochs

• Evaluation of ALS vs. ALS1•Time-accuracy tradeoff

• Evaluation of IALS vs. IALS1•Average Relative Position on the test subset of a proprietary implicit feedback dataset, after 20 epochs. Lower is better.

• Evaluation of IALS vs. IALS1•Time – accuracy tradeoff.

Conclusionsusers

item

s

•We learned two tricks:

• ALS1: RR1 can be used instead of RR in ALS

• IALS1: we can create few synthetic examples to replace the not-watching of many examples

•ALS and IALS are approximation algorithms, so why not change them to be even more approximative

•ALS1 and IALS1 offer better time-accuracy tradeoffs, esp. when K is large.

•They can be even 10x faster (or even 100x faster, for non-realistic K values)

TODO:

Precision, recall, other datasets.

Thank you for your attention

?

fast als-based matrix factorization for explicit and implicit feedback datasets

Technology

sum of squared errors

coordinate descent idea

rr q p r

variable ofat

rr computep

q q p r

rr optimal solution

run rr1