fast als-based matrix factorization for explicit and implicit feedback datasets
TRANSCRIPT
Fast ALS-based matrix factorization for explicit andimplicit feedback datasets
István Pilászy, Dávid Zibriczky, Domonkos Tikk
Gravity R&D Ltd.www.gravityrd.com
28 September 2010
Collaborative filtering
Problem setting
5 4 3
44
2 41
• Ridge Regression
wx Tiiy
?????
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
w
wXy T
•Optimal solution:
• Ridge Regression
3.03.001.04.00.0
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
•Computing the optimal solution:
•Matrix inversion is costly:
•Sum of squared errors of the optimal solution: 0.055
• Ridge Regression
3.253.178.239.236.133.170.216.229.186.88.236.224.296.253.139.239.186.253.373.186.136.83.133.182.12
XXT
)()( 1 yXXXw TT
6.221.194.247.284.14
yXT
)( 3KO
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•Start with zero:
•Sum of squared errors: 24.6
w
0.00.00.00.00.0
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•Start with zero, then optimize w1
•Sum of squared errors: 7.5
w
0.00.00.00.0
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
1.2
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•Start with zero, then optimize w1 ,then optimize w2
•Sum of squared errors: 6.2
w
0.00.00.0
2.1
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
0.2
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•Start with zero, then optimize w1, then w2, then w3
•Sum of squared errors: 5.7
w
0.00.0
2.02.1
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
0.1
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•… w4
•Sum of squared errors: 5.4
w
0.0
1.02.02.1
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
0.1
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•… w5
•Sum of squared errors: 5.0
w
0.11.01.02.02.1
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•… w1 again
•Sum of squared errors: 3.4
w
1.01.01.02.0
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
0.8
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•… w2 again
•Sum of squared errors: 2.9
w
1.01.01.0
8.0
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
0.3
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•… w3 again
•Sum of squared errors: 2.7
w
1.01.0
3.08.0
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
0.2
• RR1: RR with coordinate descent
• Idea: optimize only one variable of at once
•… after a while:
•Sum of squared errors: 0.055
•No remarkable difference
•Cost: n examples, e epoch
w
0.30.30.010.40.0
7.24.15.20.26.15.09.25.22.27.08.11.22.22.00.03.13.13.19.21.09.20.25.27.13.11.12.00.27.27.18.13.04.09.27.11.06.01.02.13.1
9.19.12.11.23.26.19.18.0
)( enKO
• The rating matrix, R of (M x N) is approximated as the product of two lower ranked matrices,
• P: user feature matrix of (M x K) size
• Q: item (movie) feature matrix of (N x K) size
• K: number of features
• Matrix factorization
iTuuir qp
PT
RT
Q
TPQR iTuuir qp
Matrix Factorization for explicit feedb.
Q
P5
5
4
3
1
R3.3
1.3
1.3
1.4
1.3
1.9
1.7 0.7 1.0 1.3 0.8
0 0.7 0.4 1.7 0.3
2.12.2
6.7
1.6
1.4
TPQR
iTuuir qp
2
43.3
1.6 1.8
Finding P and Q
Q
PR
0.3 0.9 0.7 1.3 0.5
0.6 1.2 0.3 1.6 1.1
5
5
4
3
1
2
4 ??
• Init Q randomly
•Find p1
Finding p1 with RR
??
1.15.03.07.06.03.0
345
• Optimal solution:
3.22.3
1p
Finding p1 with RR
Q
PR
0.3 0.9 0.7 1.3 0.5
0.6 1.2 0.3 1.6 1.1
5
5
4
3
1
2
4 2.33.2
• Initialize Q randomly
•Repeat
• Recompute P
• Compute p1 with RR
• Compute p2 with RR
• … (for each user)
• Recompute Q
• Compute q1 with RR
• … (for each item)
• Alternating Least Squares (ALS)
• ALS relies on RR:
• recomputation of vectors with RR
•when recomputing p1, the previously computed value is ignored
• ALS1 relies on RR1:
• optimize the previously computed p1, one scalar at once
•the previously computed value is not lost
•run RR1 only for one epoch
• ALS is just an approximation method.
• Likewise ALS1.
• ALS1: ALS with RR1
1:),( eenKO
Implicit feedback
Q
P
1 0
R0.5
0.1
0.2
0.7
0.3
0.1
0.1 0.7 0.3 0 0.2
0 0.7 0.4 0.4 0.4
TPQR
iTuuir qp
10 0
0 011 0
0 101 1
•The matrix is fully specified: each user watched each item.
•Zeros are less important, but still important. Many 0-s, few 1-s.
•Recall, that
•Idea (Hu, Koren, Volinsky):
• consider a user, who watched nothing
• compute and for this user (the null-user)
• when recomputing p1, compare her to the null-user
• based on the cached and ,update them according to the differences
• In this way, only the number of 1-s affect performance, not the number of 0-s
•IALS: alternating least squares with this trick.
• Implicit feedback: IALS
)()( 1 yXXXw TT
XXT yXT
XXT yXT
•The RR1 trick cannot be applied here
• Implicit feedback: IALS1
•The RR1 trick cannot be applied here
•But, wait…!
• Implicit feedback: IALS1
•XTX is just a matrix.
•No matter how many items we have, its dimension is the same (KxK)
•If we are lucky, we can find K items which generate this matrix
•What, if we are unlucky? We can still create synthetic items.
•Assume that the null user did not watch these K items
•XTX and XTy are the same, if synthetic items were created appropriately
• Implicit feedback: IALS1
3.253.178.239.236.133.170.216.229.186.88.236.224.296.253.139.239.186.253.373.186.136.83.133.182.12
XXT
•Can we find a Z matrix such that
• Z is small, KxK and ?
•We can, by eigenvalue decomposition
• Implicit feedback: IALS1
3.253.178.239.236.133.170.216.229.186.88.236.224.296.253.139.239.186.253.373.186.136.83.133.182.12
XXT
ZZXX TT
TT SΛSXX TSΛZ :
65.496.315.559.594.2
26.094.142.117.24.1
79.100.104.003.153.0
62.006.052.049.008.1
16.070.078.015.044.0
Z
•If a user watched N items,we can run RR1 with N+K examples
•To recompute pu, we need steps (assume 1 epoch)
•Is it better in practice, than the of IALS ?
• Implicit feedback: IALS1
)( 2 KNKO )( 23 NKKO
• Evaluation of ALS vs. ALS1•Probe10 RMSE on Netflix Prize dataset, after 25 epochs
• Evaluation of ALS vs. ALS1•Time-accuracy tradeoff
• Evaluation of IALS vs. IALS1•Average Relative Position on the test subset of a proprietary implicit feedback dataset, after 20 epochs. Lower is better.
• Evaluation of IALS vs. IALS1•Time – accuracy tradeoff.
Conclusionsusers
item
s
•We learned two tricks:
• ALS1: RR1 can be used instead of RR in ALS
• IALS1: we can create few synthetic examples to replace the not-watching of many examples
•ALS and IALS are approximation algorithms, so why not change them to be even more approximative
•ALS1 and IALS1 offer better time-accuracy tradeoffs, esp. when K is large.
•They can be even 10x faster (or even 100x faster, for non-realistic K values)
TODO:
Precision, recall, other datasets.
Thank you for your attention
?