big data matrix factorizations and overlapping community detection in graphs

Big data matrix factorizations and

Overlapping community detection in graphs.

David F. Gleich!Purdue University!

Joint work with Paul Constantine, Austin Benson, Jason Lee, Jeremy Templeton, Yangyang Hou, C. Seshadhri Joyce Jiyoung Whang, and Inderjit S. Dhillon, supported by NSF CAREER 1149756-CCF, and DOE ASCR award

Code bit.ly/dgleich-codes!

2

A

From tinyimages"collection

Tall-and-Skinny matrices (m ≫ n) Many rows (like a billion) A few columns (under 10,000)

regression and!general linear models!with many samples!

block iterative methods panel factorizations

approximate kernel k-means

big-data SVD/PCA!

Used in

David Gleich · Purdue

A graphical view of the MapReduce programming model

David Gleich · Purdue 3

dataMap

dataMap

dataMap

dataMap

keyvalue

keyvalue

keyvalue

keyvalue

keyvalue

keyvalue

()

Shuffle

keyvaluevalue

dataReduce

keyvaluevaluevalue

dataReduce

keyvalue dataReduce

Map tasks read batches of data in parallel and do some initial filtering

Reduce is often where the computation happens

Shuffle is a global comm. like group-by or MPIAlltoall

PCA of 80,000,000"images

4/22

A

80,000,000 images

1000 pixels

First 16 columns of V as images

David Gleich · Purdue Constantine & Gleich, MapReduce 2010.

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Principal Components

Fra

ctio

n o

f va

riance

20 40 60 80 1000

0.2

0.4

0.6

0.8

1


Fra

ctio

n o

f va

riance

200 400 600 800 10000

0.2

0.4

0.6

0.8

1


Fra

ctio

n o

f va

riance

200 400 600 800 10000

0.2

0.4

0.6

0.8

1


Fra

ctio

n o

f va

riance

Figure 5: The 16 most important principal compo-nent basis functions (by rows) and the amount ofvariance explained by the top 100 (bottom left) andall principal components (bottom right).

4. CONCLUSIONIn this manuscript, we have illustrated the ability of Map-

Reduce architectures to solve massive least-squares prob-lems through a tall and skinny QR factorization. We chooseto implement these algorithms in a simple Hadoop stream-ing framework to provide prototype implementations so thatothers can easily adapt the algorithms to their particularproblem. These codes are all available online.1 We envi-sion that the TSQR paradigm will find a place in block-analogues of the various iterative methods in the Mahoutproject. These methods are based on block analogues of theLanczos process, which replace vector normalization stepswith QR factorizations. Because the TSQR routine solveslinear regression problems, it can also serve as the least-squares sub-routine for an iteratively reweighted least-squaresalgorithm for fitting general linear models.

A key motivation for our MapReduce TSQR implemen-tation comes from a residual minimizing model reductionmethod [5] for approximating the output of a parameterizeddi�erential equation model. Methods for constructing re-duced order models typically involve a collection of solutions1See http://www.github.com/dgleich/mrtsqr.

(dubbed snapshots [16]) – each computed at its respectiveinput parameters. Storing and managing the terascale datafrom these solutions is itself challenging, and the hard diskstorage of MapReduce is a natural fit.

5. REFERENCES[1] E. Agullo, C. Coti, J. Dongarra, T. Herault, and J. Langem.

QR factorization of tall and skinny matrices in a gridcomputing environment. In Parallel Distributed Processing(IPDPS), 2010 IEEE International Symposium on, pages 1–11, April 2010.

[2] Å. Björck. Numerical Methods for Least Squares Problems.SIAM, Philadelphia, Penn., 1996.

[3] K. Bosteels. Fuzzy techniques in the usage and construction ofcomparison measures for music objects, 2009.

[4] J. Choi, J. Demmel, I. S. Dhillon, J. Dongarra, S. Ostrouchov,A. Petitet, K. Stanley, D. W. Walker, and R. C. Whaley.ScaLAPACK: A portable linear algebra library for distributedmemory computers - design issues and performance. PARA,pages 95–106, 1995.

[5] P. G. Constantine and Q. Wang. Residual minimizing modelreduction for parameterized nonlinear dynamical systems,arxiv:1012.0351, 2010.

[6] B. Dagnon and B. Hindman. TSQR on EC2 using the Nexussubstrate. http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/writeup_dagnon.pdf, 2010. Class Project writeup forCS267 and University of California Berkeley.

[7] J. Dean and S. Ghemawat. MapReduce: Simplified dataprocessing on large clusters. In Proceedings of the 6thSymposium on Operating Systems Design andImplementation (OSDI2004), pages 137–150, 2004.

[8] J. Demmel, L. Grigori, M. Hoemmen, and J. Langou.Communication-avoiding parallel and sequential QRfactorizations. arXiv, 0806.2159, 2008.

[9] P. Drineas, M. W. Mahoney, S. Muthukrishnan, and T. Sarlós.Faster least squares approximation. Numerische Mathematik,117(2):219–249, 2011.

[10] J. G. F. Francis. The QR transformation a unitary analogue tothe LR transformation – part 1. The Computer Journal,4:265–271, 1961.

[11] P. E. Gill, W. Murray, and M. H. Wright. PracticalOptimization. Academic Press, 1981.

[12] G. H. Golub and C. F. van Loan. Matrix Computations. TheJohns Hopkins University Press, third edition, October 1996.

[13] D. Heller. A survey of parallel algorithms in numerical linearalgebra. SIAM Rev., 20:740–777, 1978.

[14] J. Langou. Computing the R of the QR factorization of talland skinny matrix using mpi_reduce. arXiv,math.NA:1002.4250, 2010.

[15] T. E. Oliphant. Guide to NumPy. Provo, UT, Mar. 2006.[16] L. Sirovich. Turbulence and the dynamics of coherent

structures. Part 1: Coherent structures. Quar, 45(3):561–571,1987.

[17] A. Stathopoulos and K. Wu. A block orthogonalizationprocedure with constant synchronization requirements. SIAMJ. Sci. Comput., 23:2165–2182, June 2001.

[18] A. Torralba, R. Fergus, and W. Freeman. 80 million tinyimages: A large data set for nonparametric object and scenerecognition. Pattern Analysis and Machine Intelligence,IEEE Transactions on, 30(11):1958 –1970, November 2008.

[19] L. N. Trefethen and D. I. Bau. Numerical Linear Algebra.SIAM, Philadelphia, 1997.

[20] Various. Hadoop version 0.21. http://hadoop.apache.org, 2010.[21] F. Wang. Implement linear regression.

https://issues.apache.org/jira/browse/MAHOUT-529.Mahout-529 JIRA, accessed on February 10, 2011.

[22] R. C. Whaley and J. Dongarra. Automatically tuned linearalgebra software. In SuperComputing 1998: HighPerformance Networking and Computing, 1998.

[23] B. White. hadoopy. http://bwhite.github.com/hadoopy.

Acknowledgments. We are exceedingly grateful to Mark Hoemmenfor many discussions about the TSQR factorization. We would alsolike to thank James Demmel for suggesting examining the referencestreaming time. Finally, we are happy to acknowledge the fellowMapReduce “computers” at Sandia for general Hadoop help: CraigUlmer, Todd Plantenga, Justin Basilico, Art Munson, and Tamara G.Kolda.

Regression with 80,000,000 images

The goal was to approx. how much red there was in a picture from the value of the grayscale pixels only. We get a measure of how much “redness” each pixel contributes to the whole.

Table 3: Results when varying block size. The bestperformance results are bolded. See §3.3 for details.

Iter. 1 Iter. 2Cols. Blks. Maps Secs. Secs.50 2 8000 424 21— 3 — 399 19— 5 — 408 19— 10 — 401 19— 20 — 396 20— 50 — 406 18— 100 — 380 19— 200 — 395 19100 2 7000 410 21— 3 — 384 21— 5 — 390 22— 10 — 372 22— 20 — 374 221000 2 6000 493 199— 3 — 432 169— 5 — 422 154— 10 — 430 202— 20 — 434 202

3.4 Split sizeThere are three factors that control the TSQR tree on

Hadoop: the number of mappers, the number of reducers,and the number of iterations. In this section, we investi-gate the trade-o� between decreasing the number of map-pers, which is done by increasing the minimum split size inHDFS, and using additional iterations. Using additional it-erations provides the opportunity to exploit parallelism viaa reduction tree. Table 4 show the total computation timefor our C++ code when used with various split sizes andone or two iterations. The block size used was the best per-forming case from the previous experiment. Each row statesthe number of columns, the number of iterations used (fortwo iterations, we used 250 reducers in the first iteration),the split size, and the total computation time. With a splitsize of 512 MB, each mapper consumes an entire input fileof the matrix (recall that the matrices are constructed by1000 reducers, and hence 1000 files). The two iteration testused 250 reducers in the first iteration and 1 reducer in thesecond iteration. The one iteration test used 1 reducer inthe first iteration, which is required to get the correct finalanswer. (In the 1000 column test, using a smaller split sizeof 64 or 256 MB generated too much data from the mappersfor a single reducer to handle e�ciently.)

The results are di�erent between 50 columns and 1000columns. With 50 columns, a one iteration approach isfaster, and increasing the split size dramatically reduces thecomputation time. This results from two intertwined behav-iors: first, using a larger split size sends less data to the finalreducer, making it run faster; and second, using a larger splitsize reduces the overhead with Hadoop launching additionalmap tasks. With 1000 columns, the two iteration approachis faster. This happens because each R matrix output by themappers is 400 times larger than with the 50 column experi-ment. Consequently, the single reducer takes much longer inthe one iteration case. Using an additional iteration allowsus to handle this reduction with more parallelism.

Table 4: Results when varying split size. See §3.4.Cols. Iters. Split

(MB)Maps Secs.

50 1 64 8000 388— — 256 2000 184— — 512 1000 149— 2 64 8000 425— — 256 2000 220— — 512 1000 1911000 1 512 1000 666— 2 64 6000 590— — 256 2000 432— — 512 1000 337

3.5 Tinyimages: regression and PCAOur final experiment shows this algorithm applied to a

real world dataset. The tinyimages collection is a set of al-most 80,000,000 images. Each image is 32-by-32 pixels. Theimage collection is stored in a single file, where each 3072byte segment consists of the red, green, and blue values foreach of the 1024 pixels in the image. We wrote a customHadoop InputFormat to read this file directly and trans-mit the data to our Hadoop streaming programs as a set ofbytes. We used the Dumbo python framework for these ex-periments. In the following two experiments, we translatedall the color pixels into shades of gray. Consequently, thisdataset represents an 79,302,017-by-1024 matrix.

We first solved a regression problem by trying to predictthe sum of red-pixel values in each image as a linear combi-nation of the gray values in each image. Formally, if ri is thesum of the red components in all pixels of image i, and Gi,j

is the gray value of the jth pixel in image i, then we wantedto find min

qi(ri ≠

qj

Gi,jsj)2. There is no particular im-portance to this regression problem, we use it merely as ademonstration.The coe�cients sj are dis-played as an image at the right.They reveal regions of the im-age that are not as importantin determining the overall redcomponent of an image. Thecolor scale varies from light-blue (strongly negative) to blue(0) and red (strongly positive).The computation took 30 min-utes using the Dumbo frame-work and a two-iteration job with 250 intermediate reducers.

We also solved a principal component problem to find aprincipal component basis for each image. Let G be matrixof Gi,j ’s from the regression and let ui be the mean of the ithrow in G. The principal components of the images are givenby the right singular vectors of the matrix G≠ueT where uare all of the mean values as a vector and e is the 1024-by-1vector of ones. That is, let G ≠ ueT = U�V T be the SVD,then the principal components are the columns of V . Wecompute V by first doing a TSQR of G ≠ ueT , and thencomputing an SVD of the final R, which is a small 1024-by-1024 matrix. The principal components are plotted asimages in Figure 5. These images show a reasonable basisfor images and are reminiscent of the basis in a discretecosine transform.

A

80,000,000 images

1000 pixels


Models and algorithms for high performance !matrix and network computations


18 P. G. CONSTANTINE, D. F. GLEICH, Y. HOU, AND J. TEMPLETON

1

error

0

2

(a) Error, s = 0.39 cm

1

std

0

2

(b) Std, s = 0.39 cm

10

error

0

20

(c) Error, s = 1.95 cm

10

std

0

20

(d) Std, s = 1.95 cm

Fig. 4.5: Error in the reduce order model compared to the prediction standard de-viation for one realization of the bubble locations at the final time for two values ofthe bubble radius, s = 0.39 and s = 1.95 cm. (Colors are visible in the electronicversion.)

the varying conductivity fields took approximately twenty minutes to construct usingCubit after substantial optimizations.

Working with the simulation data involved a few pre- and post-processing steps:interpret 4TB of Exodus II files from Aria, globally transpose the data, compute theTSSVD, and compute predictions and errors. The preprocessing steps took approx-imately 8-15 hours. We collected precise timing information, but we do not reportit as these times are from a multi-tenant, unoptimized Hadoop cluster where otherjobs with sizes ranging between 100GB and 2TB of data sometimes ran concurrently.Also, during our computations, we observed failures in hard disk drives and issuescausing entire nodes to fail. Given that the cluster has 40 cores, there was at most2400 cpu-hours consumed via these calculations—compared to the 131,072 hours ittook to compute 4096 heat transfer simulations on Red Sky. Thus, evaluating theROM was about 50-times faster than computing a full simulation.

We used 20,000 reducers to convert the Exodus II simulation data. This choicedetermined how many map tasks each subsequent step utilized—around 33,000. Wealso found it advantageous to store matrices in blocks of about 16MB per record. Thereduction in the data enabled us to use a laptop to compute the coe�cients of theROM and apply to the far face for the UQ study in Section 4.4.

Here are a few pertinent challenges we encountered while performing this study.Generating 8192 meshes with di↵erent material properties and running independent

Tensor eigenvalues"and a power method

28

Tensor methods for network alignment

Network alignment is the problem of computing an approximate isomorphism between two net-works. In collaboration with Mohsen Bayati, Amin Saberi, Ying Wang, and Margot Gerritsen,the PI has developed a state of the art belief propagation method (Bayati et al., 2009).

FIGURE 6 – Previous workfrom the PI tackled net-work alignment with ma-trix methods for edgeoverlap:

i

j

j

0i

0

OverlapOverlap

A L B

This proposal is for match-ing triangles using tensormethods:

j

i

k

j

0

i

0

k

0

TriangleTriangle

A L B

If xi, xj , and xk areindicators associated withthe edges (i, i0), (j, j0), and(k, k0), then we want toinclude the product xixjxk

in the objective, yielding atensor problem.

We propose to study tensor methods to perform network alignmentwith triangle and other higher-order graph moment matching. Similarideas were proposed by Svab (2007); Chertok and Keller (2010) alsoproposed using triangles to aid in network alignment problems.In Bayati et al. (2011), we found that triangles were a key missingcomponent in a network alignment problem with a known solution.Given that preserving a triangle requires three edges between twographs, this yields a tensor problem:

maximizeX

i2L

wixi +X

i2L

X

j2L

xixjSi,j +X

i2L

X

j2L

X

k2L

xixjxkTi,j,k

| {z }triangle overlap term

subject to x is a matching.

Here, Ti,j,k = 1 when the edges corresponding to i, j, and k inL results in a triangle in the induced matching. Maximizing thisobjective is an intractable problem. We plan to investigate a heuris-tic based on a rank-1 approximation of the tensor T and usinga maximum-weight matching based rounding. Similar heuristicshave been useful in other matrix-based network alignment algo-rithms (Singh et al., 2007; Bayati et al., 2009). The work involvesenhancing the Symmetric-Shifted-Higher-Order Power Method due toKolda and Mayo (2011) to incredibly large and sparse tensors . On thisaspect, we plan to collaborate with Tamara G. Kolda. In an initialevaluation of this triangle matching on synthetic problems, using thetensor rank-1 approximation alone produced results that identifiedthe correct solution whereas all matrix approaches could not.

vision for the future

All of these projects fit into the PI’s vision for modernizing the matrix-computation paradigmto match the rapidly evolving space of network computations. This vision extends beyondthe scope of the current proposal. For example, the web is a huge network with over onetrillion unique URLs (Alpert and Hajaj, 2008), and search engines have indexed over 180billion of them (Cuil, 2009). Yet, why do we need to compute with the entire network?By way of analogy, note that we do not often solve partial di↵erential equations or modelmacro-scale physics by explicitly simulating the motion or interaction of elementary particles.We need something equivalent for the web and other large networks. Such investigations maytake many forms: network models, network geometry, or network model reduction. It is thevision of the PI that the language, algebra, and methodology of matrix computations will

11

maximize

Pijk

T

ijk

x

i

x

j

x

k

subject to kxk2

= 1

Human protein interaction networks 48,228 triangles Yeast protein interaction networks 257,978 triangles The tensor T has ~100,000,000,000 nonzeros

We work with it implicitly

where ! ensures the 2-norm

[x

(next)

]

i

= ⇢ · (

X

jk

T

ijk

x

j

x

k

+ �x

i

)

SSHOPM method due to "Kolda and Mayo

Big data methods SIMAX ‘09, SISC ‘11,MapReduce ‘11, ICASSP ’12

Network alignment ICDM ‘09, SC ‘11, TKDE ‘13

Fast & Scalable"Network centrality SC ‘05, WAW ‘07, SISC ‘10, WWW ’10, …

Data clustering WSDM ‘12, KDD ‘12, CIKM ’13 …

Ax = b

min kAx � bkAx = �x

Massive matrix "computations

on multi-threaded and distributed architectures

PCA of 80,000,000"images

7/22

A

80,000,000 images

1000 pixels

X

MapReduce Post Processing

Zero"mean"rows

TSQ

R

R SVD

V

First 16 columns

of V as images

Top 100 singular values

(principal components)

David Gleich · Purdue Constantine & Gleich, MapReduce 2010.

Input 500,000,000-by-100 matrix Each record 1-by-100 row HDFS Size 423.3 GB Time to compute colsum( A ) 161 sec. Time to compute R in qr( A ) 387 sec.


How to store tall-and-skinny matrices in Hadoop


A1

A4

A2

A3

A4

A : m x n, m ≫ n Key is an arbitrary row-id Value is the 1 x n array "for a row (or b x n block) Each submatrix Ai is an "the input to a map task.

100 105 1010 1015 102010−15

10−10

10−5

100

105

Numerical stability was a problem for prior approaches

10

Condition number

norm

( Q

T Q –

I )

AR-1

AR-1 + "

iterative refinement 4. Direct TSQR Benson, Gleich, "Demmel, BigData’13

Prior work

1. Constantine & Gleich, MapReduce 2011

2. Benson, Gleich, Demmel, BigData’13

Previous methods couldn’t ensure that the matrix Q was orthogonal


3. Benson, Gleich, Demmel, BigData’13

A1

A2

A3

A1

A2qr

Q2 R2

A3qr

Q3 R3

A4qr Q4A4

R4

emit

A5

A6

A7

A5

A6qr

Q6 R6

A7qr

Q7 R7

A8qr Q8A8

R8

emit

Mapper 1Serial TSQR

R4

R8

Mapper 2Serial TSQR

R4

R8

qr Q emitRReducer 1Serial TSQR

AlgorithmData Rows of a matrix

Map QR factorization of rowsReduce QR factorization of rows

Communication avoiding QR (Demmel et al. 2008) "on MapReduce (Constantine and Gleich, 2011)

11


More about how to "compute a regression

A

min kAx � bk2

= minX

i

(X

j

A

ij

x

j

� b

i

)2

b

A1

A2

A3

A1

A2 qr Q2

R2

A3 qr

A4

Mapper 1 Serial TSQR

b2 = Q2T b1

b1


Too many maps cause too much data to one reducer!

Each image is 5k. Each HDFS block has "12,800 images. 6,250 total blocks. Each map outputs "1000-by-1000 matrix One reducer gets a 6.25M-by-1000 matrix (50GB)


Too many maps cause too much data to one reducer!

S(1)

A

A1

A2

A3

A3

R1 map

Mapper 1-1 Serial TSQR

A2

emit R2 map


A3

emit R3 map


A4

emit R4 map


shuffle

S1

A2

reduce

Reducer 1-1 Serial TSQR

S2 R2,2

reduce


R2,1 emit

emit

emit

shuffle

A2 S3 R2,3

reduce


emit

Iteration 1 Iteration 2

identity map

A2 S(2) R reduce


emit


The rest of the talk"Full TSQR code in hadoopy

15


import random, numpy, hadoopy class SerialTSQR: def __init__(self,blocksize,isreducer): self.bsize=blocksize self.data = [] if isreducer: self.__call__ = self.reducer else: self.__call__ = self.mapper def compress(self): R = numpy.linalg.qr( numpy.array(self.data),'r') # reset data and re-initialize to R self.data = [] for row in R: self.data.append([float(v) for v in row]) def collect(self,key,value): self.data.append(value) if len(self.data)>self.bsize*len(self.data[0]): self.compress()

def close(self): self.compress() for row in self.data: key = random.randint(0,2000000000) yield key, row def mapper(self,key,value): self.collect(key,value) def reducer(self,key,values): for value in values: self.mapper(key,value) if __name__=='__main__': mapper = SerialTSQR(blocksize=3,isreducer=False) reducer = SerialTSQR(blocksize=3,isreducer=True) hadoopy.run(mapper, reducer)

Non-negative matrix factorization


(a) PCA (b) NMF (c) Manifold Learning

x

y

xy

z

xy

Projection on 1st PC

2nd

PC

Projection on 1st NNF

2nd

NN

F

First manifold parameter

Seco

nd

FIGURE 3 – Three standard examples of dimensionality reduction. The top row shows three rawdatasets and the bottom row shows the output of the technique. Each datapoint is uniquely col-ored to show its identity before and after the transformation. In PCA, the direction of maximumvariance is the dominant line, which is “unrotated” by projecting onto the dominant principalcomponent. The dataset for NMF is a mixture of three “samples” represented by the vectors.Projecting into the non-negative factors (NNFs) recovers the underlying mixture. The manifoldlearning example successfully finds a linear structure underlying the curved S.

PCA. Principal components identify the directions of maximum variance in the data. If thedata lie on a k-dimensional linear subspace, then the first k principal components will span thissubspace. These are computable at scale (Constantine and Gleich, 2011; Halko et al., 2011a).

NMF. When the original data are non-negative linear mixture of a components – which iscommon in hyperspectral imaging where each pixel’s spectra is a non-negative mixture of elementalspectra – then the non-negative matrix factorization will recover both the underlying componentsand the mixture. This unmixing, or decomposition into parts, is the key di↵erence from PCA (Leeand Seung, 1999). The underlying NMF computation is NP-complete (Vavasis, 2009), but recentwork has identified a special case of NMF (Donoho and Stodden, 2004) when the problem has apolynomial time solution (Arora et al., 2012). Many new NMF algorithms and analyses exploit thiscase (Esser et al., 2012; Bittorf et al., 2012; Gillis, 2013).

Manifold learning. Manifold learning is a recent approach for dimensionality reduction, whenthe data sets lie on a nonlinear manifold rather than in a linear subspace. As our prior work shows(§4.5), many high dimensional climate data sets lie on a low-dimensional manifold embedded in ahigh-dimensional feature space. A number of algorithms have been proposed for this problem, amongwhich are Isomap (Tenenbaum et al., 2000); Locally Linear Embedding (Roweis and Saul, 2000);Laplacian Eigenmaps (Belkin and Niyogi, 2003); and Hessian Locally Linear Embedding (Donohoand Grimes, 2003). Laplacian Eigenmaps has been applied to climate data by Giannakis, Majda,and Tung, which they call Nonlinear Laplacian Spectral Analysis (NLSA) (Giannakis and Majda,2012a; Giannakis et al., 2012).

The various approaches proposed to solve the manifold learning problem di↵er in computationalcomplexity, asymptotic optimality (if any!), whether they solve a local embedding problem or a

5

Find W, H � 0where A ⇡ WH

NMF !

Separable NMF!Find H � 0, A(:,K)where A ⇡ A(:,K)H

There are good algorithms for separable NMF that avoid alternating between W, H.


Find W, H � 0where A ⇡ WH

NMF ! Separable NMF!Find H � 0, A(:,K)where A ⇡ A(:,K)H

Separable NMF algorithms

1.  Find the columns of A. 2.  Find the values of W.



x

y

xy

z

x

y


2nd

PC


2nd

NN

F


Seco

nd






5


Separable NMF algorithms are really geometry

1.  Find the columns of A. "Equiv. to “Find the extreme points of a convex set.”

2.  These are preserved under linear transformations



x

y

xy

z

x

y


2nd

PC


2nd

NN

F


Seco

nd






5


We use our tall-and-skinny QR to get a orthogonal transformation to make the problem easily solvable.



A U

S VT

SVD

NMF

AK

H 1. Compute QR using TSQR method

2. Run a separable NMF method on SVT

3. Find H by solving a small non-negative least-squares problem in each column. These are tiny.

All of the hard analysis is on the small dimension of the matrix, which makes this very useful in practice.


Our methods vs. the competition


Figure 1: Relative error in the separable factoriza-tion as a function of nonnegative rank (r) for thethree algorithms. The matrix was synthetically gen-erated to be separable. SPA and GP capture all ofthe true extreme columns when r = 20 (where theresidual is zero). Since we are using the greedy vari-ant of XRAY, it takes r = 21 to capture all of theextreme columns.

3.5 Communication costs for NMF on MapRe-duce

There are two communication costs that we analyze forMapReduce. The first is the time to read the input data.In Hadoop, data is stored on disk and loading the datais frequently the dominant cost in numerical algorithms.The second is the time spent shu✏ing data. This can beroughly measured by the number and size of the key-valuepairs sorted in the shu✏e step. Current implementations ofTSQR and TSSVD in MapReduce can compute R or ⌃V T

in a a single MapReduce iteration [5]. For the dimensionreduction, the data matrix only needs to be read once. Al-though algorithms such as Hott Topixx, SPA, and Gaussianprojection require normalized columns, we showed that thecolumn norms can be computed at the same time as TSQR(see Section 3.3. For Gaussian projection, we cannot com-pute the factor H in the same projected space. To remedythis, we combine TSQR with the Gaussian projection in asingle pass over the data. Following this initial set, the Hmatrix is computed as in Section 2.4.

The map processes in the MapReduce implementations forTSQR, TSSVD, and Algorithm 1 emit O(n ·#(map tasks))keys to the shu✏e stage (one for each row of the reducedmatrix). The key-value pairs are O(n) in length – each pairrepresents a partial row sum of the resultant n⇥ n matrix.For tall-and-skinny matrices, n may as well be considered aconstant as it is often incredibly small. Thus, our commu-nication is optimal.

4. TESTING ON SYNTHETIC MATRICESIn this section, we test our dimension reduction techniques

on tall-and-skinny matrices that are synthetically generated

Figure 2: First 20 extreme columns selected by SPA,XRAY, and GP along with the true columns usedin the synthetic matrix generation. A marker ispresent for a given column index if and only if thatcolumn is a selected extreme column. SPA and GPcapture all of the true extreme columns. Since weare using the greedy variant of XRAY, it does se-lect all of the true extreme columns (the columnsmarked Generation).

to be separable or near-separable. All experiments were con-ducted on a 10-node, 40-core MapReduce cluster at Stan-ford’s Institute for Computational and Mathematical En-gineering (ICME). Each node has 6 2-TB disks, 24 GB ofRAM, and a single Intel Core i7-960 3.2 GHz processor.They are connected via Gigabit ethernet. We test the fol-lowing three algorithms:

1. Dimension reduction with the SVD followed by SPA.As a pre-processing step for SPA, the columns are nor-malized.

2. Dimension reduction with the SVD followed by thegreedy variant of the XRAY algorithm. The greedymethod is not exact in the separable case but workswell in practice [25]. The algorithm does not normalizecolumns and thus requires only one pass over the data.

3. Gaussian projection (GP) as described in Section 2.3.Columns are normalized in a pre-processing step.

Using our dimension reduction technique, all three algo-rithms require only one pass over the data. The algorithmswere selected to be a representative set of the approachesin the literature, and we will refer to the three algorithmsas SPA, XRAY, and GP. As discussed in Section 2.2, thechoice of QR or SVD does not matter for these algorithms(although it may matter for other NMF algorithms). Thus,we only consider the SVD transformation in the subsequentnumerical experiments.We generate a separable matrix X with m = 200 million

rows and n = 200 columns. The nonnegative rank (r in

Figure 1: Relative error in the separable factoriza-tion as a function of nonnegative rank (r) for thethree algorithms. The matrix was synthetically gen-erated to be separable. SPA and GP capture all ofthe true extreme columns when r = 20 (where theresidual is zero). Since we are using the greedy vari-ant of XRAY, it takes r = 21 to capture all of theextreme columns.

3.5 Communication costs for NMF on MapRe-duce

There are two communication costs that we analyze forMapReduce. The first is the time to read the input data.In Hadoop, data is stored on disk and loading the datais frequently the dominant cost in numerical algorithms.The second is the time spent shu✏ing data. This can beroughly measured by the number and size of the key-valuepairs sorted in the shu✏e step. Current implementations ofTSQR and TSSVD in MapReduce can compute R or ⌃V T

in a a single MapReduce iteration [5]. For the dimensionreduction, the data matrix only needs to be read once. Al-though algorithms such as Hott Topixx, SPA, and Gaussianprojection require normalized columns, we showed that thecolumn norms can be computed at the same time as TSQR(see Section 3.3. For Gaussian projection, we cannot com-pute the factor H in the same projected space. To remedythis, we combine TSQR with the Gaussian projection in asingle pass over the data. Following this initial set, the Hmatrix is computed as in Section 2.4.

The map processes in the MapReduce implementations forTSQR, TSSVD, and Algorithm 1 emit O(n ·#(map tasks))keys to the shu✏e stage (one for each row of the reducedmatrix). The key-value pairs are O(n) in length – each pairrepresents a partial row sum of the resultant n⇥ n matrix.For tall-and-skinny matrices, n may as well be considered aconstant as it is often incredibly small. Thus, our commu-nication is optimal.

4. TESTING ON SYNTHETIC MATRICESIn this section, we test our dimension reduction techniques

on tall-and-skinny matrices that are synthetically generated

Figure 2: First 20 extreme columns selected by SPA,XRAY, and GP along with the true columns usedin the synthetic matrix generation. A marker ispresent for a given column index if and only if thatcolumn is a selected extreme column. SPA and GPcapture all of the true extreme columns. Since weare using the greedy variant of XRAY, it does se-lect all of the true extreme columns (the columnsmarked Generation).

to be separable or near-separable. All experiments were con-ducted on a 10-node, 40-core MapReduce cluster at Stan-ford’s Institute for Computational and Mathematical En-gineering (ICME). Each node has 6 2-TB disks, 24 GB ofRAM, and a single Intel Core i7-960 3.2 GHz processor.They are connected via Gigabit ethernet. We test the fol-lowing three algorithms:

1. Dimension reduction with the SVD followed by SPA.As a pre-processing step for SPA, the columns are nor-malized.

2. Dimension reduction with the SVD followed by thegreedy variant of the XRAY algorithm. The greedymethod is not exact in the separable case but workswell in practice [25]. The algorithm does not normalizecolumns and thus requires only one pass over the data.

3. Gaussian projection (GP) as described in Section 2.3.Columns are normalized in a pre-processing step.

Using our dimension reduction technique, all three algo-rithms require only one pass over the data. The algorithmswere selected to be a representative set of the approachesin the literature, and we will refer to the three algorithmsas SPA, XRAY, and GP. As discussed in Section 2.2, thechoice of QR or SVD does not matter for these algorithms(although it may matter for other NMF algorithms). Thus,we only consider the SVD transformation in the subsequentnumerical experiments.We generate a separable matrix X with m = 200 million

rows and n = 200 columns. The nonnegative rank (r in

200 million rows, 200 columns, separation rank 20.


Nonlinear heat transfer model in random media Each run takes 5 hours on 8 processors, outputs 4M (node) by 9 (time-step) simulation

We did 8192 runs (128 samples of bubble locations, 64 bubble radii) 4.5 TB of data in Exodus II (NetCDF)

Apply heat

Look

at t

empe

ratu

re

https://www.opensciencedatacloud.org/publicdata/heat-transfer/

0 10 20 30 40 50 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bubble radius

Pro

port

ion o

f te

mp. >

475 K

15 20 25

0

0.5

1

TrueROMRS


0 10 20 30 40 50 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Bubble radius

Pro

port

ion o

f te

mp.

> 4

75 K

Insulator regime

Non-insulator regime


A

Each simulation is a column 5B-by-64 matrix

2.2TB

U

S VT

SVD

NMF

AK

H

Run a “standard” NMF "algorithm on SVT


Figure 9: Coe�cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. Inall cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each columnin H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of thetwo extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe�cients.

Figure 8: First 10 extreme columns selected by SPA,XRAY, and GP for the heat transfer simulationdata. The separation rank r = 10 was chosen basedon the residual curves in Figure 7. For the heattransfer simulation data, the columns with largerindices are more extreme. However, the algorithmsstill select di↵erent extreme columns.

cific targets on the surface of blood cells. The phenotypeand function of individual cells can be identified by decod-ing these label combinations. The analyzed data set containsmeasurements of 40,000 single cells. The measurement fluo-rescence intensity conveying the abundance information wascollected at five di↵erent bands corresponding to the FITC,PE, ECD, PC5, and PC7 fluorescent labels tagging antibod-ies against CD4, CD8, CD19, CD45, and CD3 epitopes.

The results are represented as the data matrix A of size40, 000 ⇥ 5. Our interest in the presented analysis was tostudy pairwise interactions in the data (cell vs. cell, andmarker vs. marker). Thus, we are interested in the matrixX = A⌦A, the Kronecker product of A with itself. Each row

Figure 10: Value of H matrix for columns 1 through34 for the SPA algorithm on the heat transfer sim-ulation data matrix with separation rank r = 10.Columns 1 and 34 were selected as extreme columnsby the algorithm, while columns 2 through 33 werenot. The two curves show the value of the matrixH in rows 1 and 34 for many columns. For thesecolumns of H, the value is negligible for other rows.

of X corresponds to a pair of cells and each column to a pairof marker abundance values. X has dimension 40, 0002 ⇥ 52

and occupies 345 GB on HDFS.Figure 11 shows the residuals for the three algorithms

applied to the FC data for varying values of the separa-tion rank. In contrast to the heat transfer simulation data,the relative errors are quite large for small r. In fact, SPAhas large relative error until nearly all columns are selected(r = 22). Figure 12 shows the columns selected when r = 16.XRAY and GP only disagree on one column. SPA choosesdi↵erent columns, which is not surprising given the relativeresidual error. Interestingly, the columns involving the sec-

Figure 9: Coe�cient matrix H for SPA, XRAY, and GP for the heat transfer simulation data when r = 10. Inall cases, the non-extreme columns are conic combinations of two of the selected columns, i.e., each columnin H has at most two non-zero values. Specifically, the non-extreme columns are conic combinations of thetwo extreme columns that “sandwich” them in the matrix. See Figure 10 for a closer look at the coe�cients.

Figure 8: First 10 extreme columns selected by SPA,XRAY, and GP for the heat transfer simulationdata. The separation rank r = 10 was chosen basedon the residual curves in Figure 7. For the heattransfer simulation data, the columns with largerindices are more extreme. However, the algorithmsstill select di↵erent extreme columns.

cific targets on the surface of blood cells. The phenotypeand function of individual cells can be identified by decod-ing these label combinations. The analyzed data set containsmeasurements of 40,000 single cells. The measurement fluo-rescence intensity conveying the abundance information wascollected at five di↵erent bands corresponding to the FITC,PE, ECD, PC5, and PC7 fluorescent labels tagging antibod-ies against CD4, CD8, CD19, CD45, and CD3 epitopes.

The results are represented as the data matrix A of size40, 000 ⇥ 5. Our interest in the presented analysis was tostudy pairwise interactions in the data (cell vs. cell, andmarker vs. marker). Thus, we are interested in the matrixX = A⌦A, the Kronecker product of A with itself. Each row

Figure 10: Value of H matrix for columns 1 through34 for the SPA algorithm on the heat transfer sim-ulation data matrix with separation rank r = 10.Columns 1 and 34 were selected as extreme columnsby the algorithm, while columns 2 through 33 werenot. The two curves show the value of the matrixH in rows 1 and 34 for many columns. For thesecolumns of H, the value is negligible for other rows.

of X corresponds to a pair of cells and each column to a pairof marker abundance values. X has dimension 40, 0002 ⇥ 52

and occupies 345 GB on HDFS.Figure 11 shows the residuals for the three algorithms

applied to the FC data for varying values of the separa-tion rank. In contrast to the heat transfer simulation data,the relative errors are quite large for small r. In fact, SPAhas large relative error until nearly all columns are selected(r = 22). Figure 12 shows the columns selected when r = 16.XRAY and GP only disagree on one column. SPA choosesdi↵erent columns, which is not surprising given the relativeresidual error. Interestingly, the columns involving the sec-

A bunch of papers

Constantine & Gleich, MapReduce 2011 Benson, Gleich & Demmel, BigData 2013 Benson, Gleich, Rawja & Lee, arXiv 2014 Constantine, Gleich, Hou, Templeton, SISC In-press Code online: github.com/arbenson


Next talk

1.  Personalized PageRank"based community detection

2.  The best community detection algorithm?


A community is a set of vertices that is denser inside than out.


250 node GEOP network in 2 dimensions 31

250 node GEOP network in 2 dimensions 32

We can find communities using Personalized PageRank (PPR) [Andersen et al. 2006] PPR is a Markov chain on nodes 1.  with probability 𝛼, ", "

follow a random edge 2.  with probability 1-𝛼, ", "

restart at a seed

aka random surfer aka random walk with restart unique stationary distribution


Personalized PageRank community detection 1.  Given a seed, approximate the

stationary distribution. 2.  Extract the community.

Both are local operations.


Conductance communities Conductance is one of the most important community scores [Schaeffer07] The conductance of a set of vertices is the ratio of edges leaving to total edges: Equivalently, it’s the probability that a random edge leaves the set. Small conductance ó Good community

�(S) =

cut(S)

min

�vol(S), vol(

¯S)

�(edges leaving the set)

(total edges in the set)


cut(S) = 7

vol(S) = 33

vol(

¯S) = 11

�(S) = 7/11

35

Andersen-Chung-Lang"personalized PageRank community theorem"[Andersen et al. 2006]!

Informally Suppose the seeds are in a set of good conductance, then the personalized PageRank method will find a set with conductance that’s nearly as good. … also, it’s really fast.


# G is graph as dictionary-of-sets !alpha=0.99 !tol=1e-4 !!x = {} # Store x, r as dictionaries !r = {} # initialize residual !Q = collections.deque() # initialize queue !for s in seed: ! r(s) = 1/len(seed) ! Q.append(s) !while len(Q) > 0: ! v = Q.popleft() # v has r[v] > tol*deg(v) ! if v not in x: x[v] = 0. ! x[v] += (1-alpha)*r[v] ! mass = alpha*r[v]/(2*len(G[v])) ! for u in G[v]: # for neighbors of u ! if u not in r: r[u] = 0. ! if r[u] < len(G[u])*tol and \ ! r[u] + mass >= len(G[u])*tol: ! Q.append(u) # add u to queue if large ! r[u] = r[u] + mass ! r[v] = mass*len(G[v]) !


Problem 1, which seeds?


Whang-Gleich-Dhillon, CIKM2013 [upcoming…]

1.  Extract part of the graph that might have overlapping communities.

2.  Compute a partitioning of the network into many pieces (think sqrt(n)) using Graclus.

3.  Find the center of these partitions. 4.  Use PPR to grow egonets of these centers.


Table 4: Returned number of clusters and graph coverage of each algorithm

Graph random egonet graclus ctr. spread hubs demon bigclam

HepPh coverage (%) 97.1 72.1 100 100 88.8 62.1no. of clusters 97 241 109 100 5,138 100

AstroPh coverage (%) 97.6 71.1 100 100 94.2 62.3no. of clusters 192 282 256 212 8,282 200

CondMat coverage (%) 92.4 99.5 100 100 91.2 79.5no. of clusters 199 687 257 202 10,547 200

DBLP coverage (%) 99.9 86.3 100 100 84.9 94.6no. of clusters 21,272 8,643 18,477 26,503 174,627 25000

Amazon coverage (%) 99.9 100 100 100 79.2 99.2no. of clusters 21,553 14,919 20,036 27,763 105,828 25,000

Flickr coverage (%) 76.0 54.0 100 93.6 - 52.1no. of clusters 14,638 24,150 16,347 15,349 - 15,000

LiveJournal coverage (%) 88.9 66.7 99.8 99.8 - 43.9no. of clusters 14,850 34,389 16,271 15,058 - 15,000

Myspace coverage (%) 91.4 69.1 100 99.9 - -no. of clusters 14,909 67,126 16,366 15,324 - -

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Coverage (percentage)

Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandomdemonbigclam

Student Version of MATLAB

(a) AstroPh

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Max

imum

Con

duct

ance



(b) HepPh

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Max

imum

Con

duct

ance



(c) CondMat

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandombigclam


(d) Flickr

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandombigclam


(e) LiveJournal

0 10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Max

imum

Con

duct

ance

egonetgraclus centersspread hubsrandom


(f) Myspace

Figure 2: Conductance vs. graph coverage – lower curve indicates better communities. Overall, “gracluscenters” outperforms other seeding strategies, including the state-of-the-art methods Demon and Bigclam.

Flickr social network

2M vertices"22M edges

We can cover

95% of network with communities

of cond. ~0.15.


A good partitioning helps"

40

flickr sample - 2M verts, 22M edges

F1 F20.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24DBLP

demonbigclamgraclus centersspread hubsrandomegonet


F1 F2

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

Amazon



Figure 3: F1 and F2 measures comparing our algorithmic communities to ground truth – a higher barindicates better communities.

Amazon DBLP0

1

2

3

4

5

6

7

8Run time

Run

tim

e (h

ours

)



Figure 4: Runtime on Amazon and DBLP – Theseed set expansion algorithm is faster than Demonand Bigclam.

5.4 Comparison of running timesFinally, we compare the algorithms by runtime. Figure 4

and Table 5 show the runtime of each algorithm. We runthe single thread version of Bigclam for HepPh, AstroPh,CondMat, DBLP, and Amazon networks, and use the multi-threaded version with 20 threads for Flickr, Myspace, andLiveJournal networks.

As can be seen in Figure 4, the seed set expansion meth-ods are much faster than Demon and Bigclam on DBLP andAmazon networks. On small networks (HepPh, AstroPh,CondMat), our algorithm with “spread hubs” is faster thanDemon and Bigclam. On large networks (Flickr, LiveJour-nal, Myspace), our seed set expansion methods are muchfaster than Bigclam even though we compare a single-threadedimplementation of our method with 20 threads for Bigclam.

6. DISCUSSION AND CONCLUSIONWe now discuss the results from our experimental investi-

gations. First, we note that our seed set expansion methodwas the only method that worked on all of the problems.Also, our method is faster than both Bigclam and Demon.

Our seed set expansion algorithm is also easy to parallelizebecause each seed can be expanded independently. Thisproperty indicates that the runtime of the seed set expan-sion method could be further reduced in a multi-threadedversion. Also, we can use any other high quality partition-ing scheme instead of Graclus including those with paralleland distributed implementations [25]. Perhaps surprisingly,the major di↵erence in cost between using Graclus centersfor the seeds and the other seed choices does not result fromthe expense of running Graclus. Rather, it arises becausethe personalized PageRank expansion technique takes longerfor the seeds chosen by Graclus and spread hubs. When thePageRank expansion method has a larger input set, it tendsto take longer, and the input sets we provide for the spreadhubs and Graclus seeding strategies are the neighborhoodsets of high degree vertices.Another finding that emerges from our results is that us-

ing random seeds outperforms both Bigclam and Demon.We believe there are two reasons for this finding. First, ran-dom seeds are likely to be in some set of reasonable conduc-tance as also discussed by Andersen and Lang [5]. Second,and importantly, a recent study by Abrahao [2] showed thatpersonalized PageRank clusters are topologically similar toreal-world clusters [2]. Any method that uses this techniquewill find clusters that look real.Finally, we wish to address the relationship between our

results and some prior observations on overlapping commu-nities. The authors of Bigclam found that the dense regionsof a graph reflect areas of overlap between overlapping com-munities. By using a conductance measure, we ought tofind only these dense regions – however, our method pro-duces much larger communities that cover the entire graph.The reason for this di↵erence is that we use the entire ver-tex neighborhood as the restart for the personalized PageR-ank expansion routine. We avoid seeding exclusively insidea dense region by using an entire vertex neighborhood as aseed, which grows the set beyond the dense region. Thus, thecommunities we find likely capture a combination of commu-nities given by the egonet of the original seed node. To ex-pand on this point, in experiments we omit due to space, wefound that seeding solely on the node itself – rather than us-

Using datasets from "Yang and Leskovec (WDSM 2013) with known overlapping

community structure

Our method outperform current state of the art

overlapping community detection methods. "

Even randomly seeded!


And helps to find real-world overlapping communities too.

41

Proposed Algorithm

Seed Set ExpansionCarefully select seedsGreedily expand communities around the seed sets

The algorithmFiltering PhaseSeeding PhaseSeed Set Expansion PhasePropagation Phase

Joyce Jiyoung Whang, The University of Texas at Austin Conference on Information and Knowledge Management (8/44)



Filtering Phase


Filtering Phase



Seeding Phase


Seed Set Expansion Phase


Run clustering, and choose centers or pick an independent set of high degree nodes

Run personalized PageRank


Propagation Phase


Propagation Phase


We can prove that this only improves the objective

Conclusion & Discussion &

PPR community detection is fast "[Andersen et al. FOCS06]

PPR communities look real "[Abrahao et al. KDD2012; Zhu et al. ICML2013]

Partitioning for seeding yields "high coverage & real communities. “Caveman” communities?!!!!


Gleich & Seshadhri KDD2012 Whang, Gleich & Dhillon CIKM2013 PPR Sample !bit.ly/18khzO5 !!Egonet seeding bit.ly/dgleich-code !

References

Best conductance cut at intersection of communities?

big data matrix factorizations and overlapping community detection in graphs

Technology

manifold learning

synthetic

coe cient

true extreme

david gleich

selected extreme

coe cients

personalized