skew-symmetric matrix completion for rank aggregation
DESCRIPTION
Slides from a talk at Purdue's Machine Learning Seminar on 2011-01-24.TRANSCRIPT
Skew-symmetric matrix completion for rank aggregation !and other matrix computations DAVID F. GLEICH PURDUE UNIVERSITY COMPUTER SCIENCE DEPARTMENT February 24th, 12pm
Purdue ML Seminar David Gleich, Purdue 1/40
Skew-symmetric matrix completion for rank aggregation !and other matrix computations DAVID F. GLEICH PURDUE UNIVERSITY COMPUTER SCIENCE DEPARTMENT February 24th, 12pm
Purdue ML Seminar David Gleich, Purdue 2/40
Skew-symmetric matrix completion for rank aggregation !and other matrix computations DAVID F. GLEICH PURDUE UNIVERSITY COMPUTER SCIENCE DEPARTMENT January 24th, 12pm
Purdue ML Seminar David Gleich, Purdue 3/40
Images copyright by their respective owners 4/
40
Matrix computations are the heart
(and not brains) of
many methods of computing.
Purdue ML Seminar David Gleich, Purdue 5/40
Matrix computations Physics
Statistics Engineering
Graphics Databases
… Machine learning
Purdue ML Seminar David Gleich, Purdue 6/40
Matrix computations
A =
2
66664
A1,1 A1,2 · · · A1,n
A2,1 A2,2 · · ·...
.... . .
. . . Am�1,nAm,1 · · · Am,n�1 Am,n
3
77775
Ax = b min kAx � bk Ax = �x
Linear systems Least squares Eigenvalues Purdue ML Seminar David Gleich, Purdue 7/
40
NETWORK and MATRIX COMPUTATIONS
Why looking at networks of data as a matrix is a powerful and successful paradigm.
RAPr on WikipediaE [x(A)]
United States
C:Living people
France
United Kingdom
Germany
England
Canada
Japan
Poland
Australia
Std [x(A)]
United States
C:Living people
C:Main topic classif.
C:Contents
C:Ctgs. by country
United Kingdom
France
C:Fundamental
England
C:Ctgs. by topic
Gleich (Stanford) Random sensitivity Ph.D. Defense 23 / 41
A new matrix-based sensitivity analysis of Google’s PageRank.
Presented at" WAW2007, WWW2010
Published in the
J. Internet Mathematics
Led to new results on uncertainty quantification in
physical simulations published in SIAM J. Matrix
Analysis and SIAM J. Scientific Computing.
Patent Pending
Improved web-spam detection!
Collaborators Paul Constantine, Gianluca Iaccarino (physical simulation)
PageRank (I � ↵P)x = (1 � ↵)vSimRank
BlockRank TrustRank
ObjectRank HostRank
Random walk with restart
GeneRank
DiffusionRank IsoRank
ItemRank ProteinRank
SocialPageRank FoodRank FutureRank TwitterRank
Network alignment
Matching and overlapSquares produce overlap ! bonus for some �� and �j !
P���j
�r
t
s
j
��t�
Square
A L B
Variables, Data�� = edge indicator�� = weight of edgesS�j squares in S
e� 2 Le� = (t,�)�� =�t�
Problem
m�ximize��
X
�:e�2L���� +X
�,j2S���j
subject to � is a matching$
m�ximizex
wTx+ 12x
TSxsubject to Ax e
�� 2 {0,1}
David F. Gleich (Stanford) Network alignment Southeast Ranking Workshop 11 / 29Purdue ML Seminar David Gleich, Purdue
Bayati, Gerritsen, Gleich, Saberi, and Wang, ICDM2009 Bayati, Gleich, Saberi and Wang, Submitted
40 60 80 100 120
40
60
80
mm
David F. Gleich (Purdue) Network alignment INFORMS Seminar 17 / 40
Network alignment
NETWORK ALIGNMENT
m�ximize �wTx+ �2x
TSxsubject to Ax e,�� 2 {0,1}
History
… QUADRATIC ASS IGNMENT… MAXIMUM COMMON SUBGRAPH… PATTERN RECOGNIT ION… ONTOLOGY MATCHING… B IOINFORMATICS
Sparse problemsSparse L often ignored (afew exceptions).Our paper tackles thatcase explicitly.We do large problems,too.
Conte el al. Thirty years of graph matching, 2004.; Melnik et al. Similarity flooding, 2004; Blondel et al. SIREV 2004;Singh et al. RECOMB 2007; Klau, BMC Bioinformatics 10:S59, 2009.
10/4
0
Purdue ML Seminar David Gleich, Purdue
Overlapping clusters!for distributed computation Andersen, Gleich, and Mirrokni, WSDM2012
1 1.1 1.2 1.3 1.4 1.5 1.6 1.70
0.5
1
1.5
2
Volume Ratio
Rel
ativ
e W
ork
Metis Partitioner
Swapping Probability (usroads)PageRank Communication (usroads)Swapping Probability (web−Google)PageRank Communication (web−Google)
How much more of the graph we need to store.
11/4
0
Tweet along @dgleich
MAIN RESULTS – SLIDE THREE
David F. Gleich (Sandia) ICME la/opt seminar 4 of 50
Purdue ML Seminar David Gleich, Purdue
TOP-K ALGORITHM FOR KATZ
Approximate
where is sparse Keep sparse too Ideally, don’t “touch” all of
David F. Gleich (Purdue) Univ. Chicago SSCS Seminar 34 of 47
Gleich et al. "J. Internet Mathematics, to appear.
Local methods for massive network analysis
Can solve these problems in milliseconds even with 100M edges!
12/4
0
Rank aggregation
DAVID F. GLEICH (PURDUE) & LEK-HENG LIM (UNIV. CHICAGO)
Purdue ML Seminar David Gleich, Purdue 13
Which is a better list of good DVDs? Lord of the Rings 3: The Return of … Lord of the Rings 3: The Return of … Lord of the Rings 1: The Fellowship Lord of the Rings 1: The Fellowship Lord of the Rings 2: The Two Towers Lord of the Rings 2: The Two Towers Lost: Season 1 Star Wars V: Empire Strikes Back Battlestar Galactica: Season 1 Raiders of the Lost Ark Fullmetal Alchemist Star Wars IV: A New Hope Trailer Park Boys: Season 4 Shawshank Redemption Trailer Park Boys: Season 3 Star Wars VI: Return of the Jedi Tenchi Muyo! Lord of the Rings 3: Bonus DVD Shawshank Redemption The Godfather
Nuclear Norm "based rank aggregation
(not matrix completion on the netflix rating matrix)
Standard "rank aggregation"(the mean rating)
Purdue ML Seminar David Gleich, Purdue 14/4
0
Rank Aggregation Given partial orders on subsets of items, rank aggregation is the problem of finding an overall ordering. Voting Find the winning candidate Program committees Find the best papers given reviews Dining Find the best restaurant in Chicago
Purdue ML Seminar David Gleich, Purdue 15/4
0
Ranking is really hard
All rank aggregations involve some measure of compromise
A good ranking is the “average” ranking under a permutation distance
Ken Arrow John Kemeny Dwork, Kumar, Naor, !
Sivikumar
NP hard to compute Kemeny’s ranking
Purdue ML Seminar David Gleich, Purdue 16/4
0
Given a hard problem, what do you do?!!Numerically relax!!!It’ll probably be easier.
Embody chair!John Cantrell (flickr)
Purdue ML Seminar David Gleich, Purdue 17/4
0
Suppose we had scores
Purdue ML Seminar David Gleich, Purdue
Suppose we had scoresLet be the score of the ith movie/song/paper/team to rank
Suppose we can compare the ith to jth:
Then is skew-symmetric, rank 2.
Also works for with an extra log.
Kemeny and Snell, Mathematical Models in Social Sciences (1978)
Numerical ranking is intimately intertwined with skew-symmetric matrices
David F. Gleich (Purdue) KDD 2011 6/20 18/4
0
Using ratings as comparisons
Arithmetic Mean
Log-odds
Purdue ML Seminar David Gleich, Purdue
Ratings induce various skew-symmetric matrices. From David 1988 – The Method of Paired Comparisons 19
/40
Extracting the scores
Purdue ML Seminar David Gleich, Purdue
Extracting the scores
Given with all entries, then is the Bordacount, the least-squares solution to
How many do we have? Most.
Do we trust all ? Not really. Netflix data 17k movies,
500k users, 100M ratings–99.17% filled
105
107
101
101
David F. Gleich (Purdue) KDD 2011
105
Number of Comparisons
Movie
Pair
s
8/20 20/4
0
Only partial info? COMPLETE IT!
Purdue ML Seminar David Gleich, Purdue
Only partial info? Complete it!Let be known for We trust these scores.
Goal Find the simplest skew-symmetric matrix that matches the data
David F. Gleich (Purdue) KDD 2011
noiseless
noisy
Both of these are NP-hard too.9/20
21/4
0
Solution GO NUCLEAR!
Purdue ML Seminar David Gleich, Purdue From a French nuclear test in 1970, image from http://picdit.wordpress.com/2008/07/21/8-insane-nuclear-explosions/
22/4
0
The nuclear norm!The analog of the 1-norm or ℓ𝓁1-norm for matrices
Purdue ML Seminar David Gleich, Purdue
The nuclear norm
For vectors
is NP-hard while
is convex and gives the same answer “under appropriate circumstances”
For matrices
Let be the SVD.
best convex under-estimator of rank on unit ball.
David F. Gleich (Purdue) KDD 2011
The analog of the 1-norm or for matrices.
11/20
23/4
0
Only partial info? COMPLETE IT!
Purdue ML Seminar David Gleich, Purdue
Only partial info? Complete it!Let be known for We trust these scores.
Goal Find the simplest skew-symmetric matrix that matches the data
NP hard
Convex
Heu
rist
ic
David F. Gleich (Purdue) KDD 2011 12/20
24/4
0
Solving the !nuclear norm problem
Purdue ML Seminar David Gleich, Purdue
Solving the nuclear norm problemUse a LASSO formulation
Jain et al. propose SVP for
this problem without
1. 2. REPEAT
3. = rank-k SVD of
4. 5. 6. UNTIL
Jain et al. NIPS 2010David F. Gleich (Purdue) KDD 2011 13/20
25/4
0
Skew-symmetric SVD
Purdue ML Seminar David Gleich, Purdue
Skew-symmetric SVDsLet be an skew-symmetric matrix with
eigenvalues , where and . Then the SVD of is given by
for and given in the proof.Proof Use the Murnaghan-Wintner form and the SVD of a
2x2 skew-symmetric block
This means that SVP will give us the skew-symmetric constraint “for free”
David F. Gleich (Purdue) KDD 2011 14/20
26/4
0
Matrix completion A fundamental question is matrix completion is when do these problems have the same solution?
Purdue ML Seminar David Gleich, Purdue
Only partial info? Complete it!Let be known for We trust these scores.
Goal Find the simplest skew-symmetric matrix that matches the data
NP hard
Convex
Heu
rist
ic
David F. Gleich (Purdue) KDD 2011 12/20
27/4
0
Exact recovery results
Purdue ML Seminar David Gleich, Purdue
Exact recovery resultsDavid Gross showed how to recover Hermitian matrices.
i.e. the conditions under which we get the exact
Note that is Hermitian. Thus our new result!
Gross arXiv 2010.David F. Gleich (Purdue) KDD 2011
15/20
indices. Instead we view the following theorem as providingintuition for the noisy problem.
Consider the operator basis for Hermitian matrices:
H = S [K [D where
S = {1/p2(eie
Tj + eje
Ti ) : 1 i < j n};
K = {ı/p2(eie
Tj � eje
Ti ) : 1 i < j n};
D = {eieTi : 1 i n}.
Theorem 5. Let s be centered, i.e., sT e = 0. Let Y =seT � esT where ✓ = maxi s
2
i /(sT s) and ⇢ = ((maxi si) �
(mini si))/ksk. Also, let ⌦ ⇢ H be a random set of elementswith size |⌦| � O(2n⌫(1 + �)(log n)2) where ⌫ = max((n✓ +1)/4, n⇢2). Then the solution of
minimize kXk⇤subject to trace(X⇤W i) = trace((ıY )⇤W i), W i 2 ⌦
is equal to ıY with probability at least 1� n��.
The proof of this theorem follows directly by Theorem 4 ifıY has coherence ⌫ with respect to the basis H. We nowshow this result.
Definition 6 (Coherence, Gross [2010]). Let A ben ⇥ n, rank-r, and Hermitian. Let UU⇤ be an orthogonalprojector onto range(A). Then A has coherence ⌫ with
respect to an operator basis {W i}n2
i=1
if both
maxi trace(W iUU⇤W i) 2⌫r/n, and
maxi trace(sign(A)W i)2 ⌫r/n2.
For A = ıY with sT e = 0:
UU⇤ =ssT
sT s� 1
neeT and sign(A) =
1
ksk pnA.
Let Sp 2 S, Kp 2 K, and Dp 2 D. Note that becausesign(A) is Hermitian with no real-valued entries, both quan-tities trace(sign(A)Di)
2 and trace(sign(A)Si)2 are 0. Also,
because UU⇤ is symmetric, trace(KiUU⇤Kp) = 0. Theremaining basis elements satisfy:
trace(SpUU⇤Sp) =1n+
s2i + s2j2sT s
(1/n) + ✓
trace(DpUU⇤Dp) =1n+
s2isT s
(1/n) + ✓
trace(sign(A)Kp)2 =
2(si � sj)2
nsT s (2/n)⇢2.
Thus, A has coherence ⌫ with ⌫ from Theorem 5 and withrespect to H. And we have our recovery result. Although,this theorem provides little practical benefit unless both ✓and ⇢ are O(1/n), which occurs when s is nearly uniform.
6. RESULTSWe implemented and tested this procedure in two synthetic
scenarios, along with Netflix, movielens, and Jester joke-setratings data. In the interest of space, we only present asubset of these results for Netflix.
102
103
104
0
0.2
0.4
0.6
0.8
1
Fra
ctio
n o
f tr
ials
reco
vere
d
Samples
5n
2n lo
g(n
)
6n lo
g(n
)
102
103
104
0
0.2
0.4
0.6
0.8
1
Fra
ctio
n o
f tr
ials
reco
vere
d
Samples200 1000 50000
0.01
0.02
0.03
0.04
0.05
No
ise
leve
l
Samples
5n
2n lo
g(n
)
6n lo
g(n
)
Figure 2: An experimental study of the recoverabil-ity of a ranking vector. These show that we needabout 6n log n entries of Y to get good recovery inboth the noiseless (left) and noisy (right) case. See§6.1 for more information.
6.1 RecoveryThe first experiment is an empirical study of the recover-
ability of the score vector in the noiseless and noisy case. Inthe noiseless case, Figure 2 (left), we generate a score vectorwith uniformly distributed random scores between 0 and 1.These are used to construct a pairwise comparison matrixY = seT � esT . We then sample elements of this matrixuniformly at random and compute the di↵erence betweenthe true score vector s and the output of steps 4 and 5 ofAlgorithm 2. If the relative 2-norm di↵erence between thesevectors is less than 10�3, we declare the trial recovered. Forn = 100, the figure shows that, once the number of samplesis about 6n log n, the correct s is recovered in nearly all the50 trials.Next, for the noisy case, we generate a uniformly spaced
score vector between 0 and 1. Then Y = seT � esT +"E, where E is a matrix of random normals. Again, wesample elements of this matrix randomly, and declare atrial successful if the order of the recovered score vector isidentical to the true order. In Figure 2 (right), we indicatethe fractional of successful trials as a gray value between black(all failure) and white (all successful). Again, the algorithmis successful for a moderate noise level, i.e., the value of ",when the number of samples is larger than 6n log n.
6.2 SyntheticInspired by Ho and Quinn [2008], we investigate recovering
item scores in an item-response scenario. Let ai be the centerof user i’s rating scale, and bi be the rating sensitivity of useri. Let ti be the intrinsic score of item j. Then we generateratings from users on items as:
Ri,j = L[ai + bitj + Ei,j ]
where L[↵] is the discrete levels function:
L[↵] = max(min(round(↵), 5), 1)
and Ei,j is a noise parameter. In our experiment, we drawai ⇠ N(3, 1), bi ⇠ N(0.5, 0.5), ti ⇠ N(0.1, 1), and Ei,j ⇠"N(0, 1). Here, N(µ,�) is a standard normal, and " is anoise parameter. As input to our algorithm, we sampleratings uniformly at random by specifying a desired numberof average ratings per user. We then look at the Kendall⌧ correlation coe�cient between the true scores ti and theoutput of our algorithm using the arithmetic mean pairwiseaggregation. A ⌧ value of 1 indicates a perfect orderingcorrelation between the two sets of scores.
Gross, arXiv, 2010
28/4
0
Purdue ML Seminar David Gleich, Purdue
Recovery Discussion and ExperimentsConfession If , then just look at differences from
a connected set. Constants? Not very good.
Intuition for the truth.
David F. Gleich (Purdue) KDD 2011 16/20
29/4
0
Recovery Experiments
Purdue ML Seminar David Gleich, Purdue
Recovery Discussion and ExperimentsConfession If , then just look at differences from
a connected set. Constants? Not very good.
Intuition for the truth.
David F. Gleich (Purdue) KDD 2011 16/20
30/4
0
The ranking algorithm
Purdue ML Seminar David Gleich, Purdue
The Ranking Algorithm0. INPUT (ratings data) and c
(for trust on comparisons)
1. Compute from
2. Discard entries with fewer than c comparisons
3. Set to be indices and values of what’s left
4. = SVP( )
5. OUTPUT
David F. Gleich (Purdue) KDD 2011 17/20
31/4
0
Synthetic evaluation
Purdue ML Seminar David Gleich, Purdue
Item Response ModelThe synthetic results came from a model inspired by Ho and
Quinn [2008].
- center rating for user $i$ - sensitivity of user $i$ - value of item $j$ - error level in ratings
Sample ratings uniformly at random such that there for expected ratings per user.
David F. Gleich (Purdue) KDD 2011 21/20
32/4
0
Evaluation
Purdue ML Seminar David Gleich, Purdue
0 0.2 0.4 0.6 0.8 10.5
0.6
0.7
0.8
0.9
1
Error
Media
n K
endall’s
Tau
20
10
5
2
1.5
0 0.2 0.4 0.6 0.8 10.5
0.6
0.7
0.8
0.9
1
Error
Media
n K
endall’s
Tau
0 0.2 0.4 0.6 0.8 10.5
0.6
0.7
0.8
0.9
1
Error
Media
n K
endall’s
Tau
0 0.2 0.4 0.6 0.8 10.5
0.6
0.7
0.8
0.9
1
Error
Media
n K
endall’s
Tau
Figure 3: The performance of our algorithm (left)and the mean rating (right) to recovery the order-ing given by item scores in an item-response theorymodel with 100 items and 1000 users. The variousthick lines correspond to average number of ratingseach user performed (see the in place legend). See§6.2 for more information
Figure 3 shows the results for 1000 users and 100 itemswith 1.1, 1.5, 2, 5, and 10 ratings per user on average. We alsovary the parameter " between 0 and 1. Each thick line withmarkers plots the median value of ⌧ in 50 trials. The thinadjacency lines show the 25th and 75th percentiles of the50 trials. At all error levels, our algorithm outperforms themean rating. Also, when there are few ratings per-user andmoderate noise, our approach is considerably more correlatedwith the true score. This evidence supports the anecdotalresults from Netflix in Table 2.
6.3 NetflixSee Table 2 for the top movies produced by our technique
in a few circumstances using all users. The arithmetic meanresults in that table use only elements of Y with at least 30pairwise comparisons (it is a am all 30 model in the codebelow). And see Figure 4 for an analysis of the residualsgenerated by the fit for di↵erent constructions of the matrixY . Each residual evaluation of Netflix is described by a code.For example, sb all 0 is a strict-binary pairwise matrix Yfrom all Netflix users and c = 0 in Algorithm 2 (i.e. acceptall pairwise comparisons). Alternatively, am 6 30 denotesan arithmetic-mean pairwise matrix Y from Netflix userswith at least 6 ratings, where each entry in Y had 30 userssupporting it. The other abbreviations are gm: geometricmean; bc: binary comparison; and lo: log-odds ratio.
These residuals show that we get better rating fits by onlyusing frequently compared movies, but that there are onlyminor changes in the fits when excluding users that ratefew movies. The di↵erence between the score-based residu-als
��⌦(seT � esT )� b�� (red points) and the svp residuals��⌦(USV T )� b
�� (blue points) show that excluding compar-isons leads to “overfitting” in the svp residual. This suggeststhat increasing the parameter c should be done with careand good checks on the residual norms.To check that a rank-2 approximation is reasonable, we
increased the target rank in the svp solver to 4 to investigate.For the arithmetic mean (6,30) model, the relative residualat rank-2 is 0.2838 and at rank-4 is 0.2514. Meanwhile, thenuclear norm increases from around 14000 to around 17000.These results show that the change in the fit is minimal andour rank-2 approximation and its scores should represent areasonable ranking.
0.2 0.3 0.4 0.5 0.6 0.7
am all 30am 6 30gm 6 30gm all 30am all 100am 6 100sb 6 30sb all 30gm all 100gm 6 100bc 6 30bc all 30
bc 6 100bc all 100lo all 30lo 6 30lo 6 100lo all 100
sb all 100sb 6 100
am 6 0am all 0
bc 6 0bc all 0lo 6 0lo all 0gm 6 0gm all 0
sb 6 0sb all 0
Relative Residual
Figure 4: The labels on each residual show how wegenerated the pairwise scores and truncated the Net-flix data. Red points are the residuals from thescores, and blue points are the final residuals fromthe SVP algorithm. Please see the discussion in §6.3.
7. CONCLUSIONExisting principled techniques such as computing a Ke-
meny optimal ranking or finding a minimize feedback arc setare NP-hard. These approaches are inappropriate in largescale rank aggregation settings. Our proposal is (i) measurepairwise scores Y and (ii) solve a matrix completion problemto determine the quality of items. This idea is both princi-pled and functional with significant missing data. The resultsof our rank aggregation on the Netflix problem (Table 2)reveal popular and high quality movies. These are interestingresults and could easily have a home on a “best movies inNetflix” web page. Such a page exists, but is regarded ashaving strange results. Computing a rank aggregation withthis technique is not NP-hard. It only requires solving aconvex optimization problem with a unique global minima.Although we did not record computation times, the mosttime consuming piece of work is computing the pairwise com-parison matrix Y . In a practical setting, this could easily bedone with a MapReduce computation.
To compute these solutions, we adapted the svp solver formatrix completion [Jain et al., 2010]. This process involved(i) studying the singular value decomposition of a skew-symmetric matrix (Lemmas 1 and 2) and (ii) showing thatthe svp solver preserves a skew-symmetric approximationthrough its computation (Theorem 3). Because the svp solvercomputes with an explicitly chosen rank, these techniqueswork well for large scale rank aggregation problems.
We believe the combination of pairwise aggregation andmatrix completion is a fruitful direction for future research.We plan to explore optimizing the svp algorithm to exploitthe skew-symmetric constraint, extending our recovery resultto the noisy case, and investigating additional data.
Acknowledgements. The authors would like to thank Amy Langville,
Carl Meyer, and Yuan Yao for helpful discussions.
Nuclear norm ranking Mean rating
33/4
0
Conclusions and Future Work
Our motto “aggregate, then complete”
Rank aggregation with "the nuclear norm is principled easy to compute The results are much better than simple approaches.
1. Additional comparison 2. Noisy recovery! More
realistic sampling. 3. Skew-symmetric Lanczos
based SVD?
Purdue ML Seminar David Gleich, Purdue 34/4
0
Current research
Purdue ML Seminar David Gleich, Purdue 35
Data driven surrogate functions Beyond spectral methods for UQ
Purdue ML Seminar David Gleich, Purdue 36/4
0
Graph spectra
Purdue ML Seminar David Gleich, Purdue
Graph spectra
37/4
0
Purdue ML Seminar David Gleich, Purdue
1.5, 0.5
1.33 (two!)
1.5
1.5 (two)
1.833 0.565741"1.767592
0.725708"1.607625
Spectral spikes
38/4
0
Google nuclear ranking gleich
Purdue ML Seminar David Gleich, Purdue 39/4
0