tall-and-skinny qr factorizations in mapreduce architectures

Tall-and-Skinny QR Factorizations in MapReduce DAVID F. GLEICH !

PURDUE UNIVERSITY COMPUTER SCIENCE !

DEPARTMENT

PAUL G. CONSTANTINE !STANFORD UNIVERSITY

AUSTIN BENSON AND !JAMES DEMMEL !UC BERKELEY

1 ICME Seminar David Gleich Purdue

2

Tall-and-Skinny matrices (m ≫ n) arise in

regression with many samples block iterative methods panel factorizations model reduction problems general linear models "with many samples tall-and-skinny SVD/PCA

All of these applications !need a QR factorization of !a tall-and-skinny matrix.! some only need R !

A

From tinyimages"collection

ICME Seminar David Gleich Purdue

MapReduce is great for TSQR!!You don’t need ATA Data A tall and skinny (TS) matrix by rows Map QR factorization of local rows Reduce QR factorization of local rows

Input 500,000,000-by-100 matrix"Each record 1-by-100 row"HDFS Size 423.3 GB Time to compute Ae (column sums) 161 sec.!Time to compute R in qr(A) 387 sec. !Time to compute Q in qr(A) stably … stay tuned …

3/22


Quick review of QR QR Factorization

David Gleich (Sandia)

Using QR for regression

  is given by the solution of  

QR is block normalization“normalize” a vector usually generalizes to computing   in the QR

A Q

Let   , real

 

  is   orthogonal (   )

  is   upper triangular.

0

R

=

4/22MapReduce 2011 4 ICME Seminar David Gleich Purdue

Quiz for ICME qual-takers! Is R unique?

5 ICME Seminar David Gleich Purdue

PCA of 80,000,000!images

6/22

A

80,000,000 images

1000 pixels

X

MapReduce Post Processing

Zero"mean"rows

TSQ

R

R SVD

  V

First 16 columns

of V as images

Top 100 singular values

(principal �components)


The principal components look like a 2d cosine transform

ICME Seminar David Gleich Purdue 7

Thanks Wikipedia!

Input "Parameters

Time history"of simulation

s f

The Database

s1 -> f1 s2 -> f2

sk -> fk

f(s) =

2

66666666666664

q(x1, t1, s)...

q(xn

, t1, s)q(x1, t2, s)

...q(x

n

, t2, s)...

q(xn

, t

k

, s)

3

77777777777775

A single simulation at one time step

X =⇥f(s1) f(s2) ... f(sp)

⇤

The database as a very"tall-and-skinny matrix Th

e sim

ulatio

n as

a v

ecto

r

CS&E Seminar David Gleich · Purdue 8

MapReduce


Intro to MapReduce Originated at Google for indexing web pages and computing PageRank.

The idea Bring the computations to the data.

Express algorithms in "data-local operations. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.

MM R

RMM

Input stored in triplicate

Map output"persisted to disk"before shuffle

Reduce input/"output on disk

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M

Data scalable

Fault-tolerance by design

10


Mesh point variance in MapReduce Run 1 Run 2 Run 3

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

11


Mesh point variance in MapReduce

M M M

R R

1. Each mapper out-puts the mesh points with the same key.

2. Shuffle moves all values from the same mesh point to the same reducer.

Run 1 Run 2 Run 3

3. Reducers just compute a numerical variance.

Bring the computations to the data!

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

12


MapReduce vs. Hadoop.

MapReduce!A computation model with:"Map a local data transform"Shuffle a grouping function "Reduce an aggregation

Hadoop!An implementation of MapReduce using the HDFS parallel file-system. Others !

Pheonix++, Twisted, Google MapReduce, spark …


That is, we store the runs Supercomputer Data computing cluster Engineer

Each multi-day HPC simulation generates gigabytes of data.

A data cluster can hold hundreds or thousands of old simulations …

… enabling engineers to query and analyze months of simulation data for statistical studies and uncertainty quantification.

and build the interpolant from the pre-computed data.

CS&E Seminar David Gleich · Purdue 14

Tall-and-Skinny QR


Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR

Demmel et al. 2008. Communicating avoiding parallel and sequential QR.

First, do QR factorizationsof each local matrix  

Second, compute a QR factorization of the new “R”

David Gleich (Sandia) 6/22MapReduce 2011

16


Serial QR factorizations!(Demmel et al. 2008) Fully serial TSQR


Compute QR of   , read   , update QR, …


17


Tall-and-skinny matrix storage in MapReduce MapReduce matrix storage

 

Key is an arbitrary row-idValue is the   array for

a row.

Each submatrix   is an input split.

A1

A2

A3

A4


18


Ask Austin about block row storage!

A1

A2

A3

A1

A2qr

Q2 R2

A3qr

Q3 R3

A4qr Q4A4

R4

emit

A5

A6

A7

A5

A6qr

Q6 R6

A7qr

Q7 R7

A8qr Q8A8

R8

emit

Mapper 1Serial TSQR

R4

R8

Mapper 2Serial TSQR

R4

R8

qr Q emitRReducer 1Serial TSQR

AlgorithmData Rows of a matrix

Map QR factorization of rowsReduce QR factorization of rows

19


Key Limitations

Computes only R and not Q

Can get Q via Q = AR+ with another MR iteration. (We currently use this for computing the SVD.)

Dubious numerical stability; iterative refinement helps.


Why MapReduce?


Full code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper

def compress(self):R = numpy.linalg.qr(

numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])

def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()

def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row

def mapper(self,key,value):self.collect(key,value)

def reducer(self,key,values):for value in values: self.mapper(key,value)

if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)

David Gleich (Sandia) 13/22MapReduce 2011 22


3HUIRUPDQFH�UHVXOWV��VLPXODWHG�IDXOWV�

:H�FDQ�VWLOO�UXQ�ZLWK�3�IDXOW�� ZLWK�RQO\�a��SHUIRUPDQFH�SHQDOW\��+RZHYHU��ZLWK�3�IDXOW��VPDOO��ZH�VWLOO�VHH�D�SHUIRUPDQFH�KLW�

Fault injection


10 100 1000

1/Prob(failure) – mean number of success per failure

Tim

e to

com

plet

ion

(sec)

200

100

No faults (200M by 200)

Faults (800M by 10)

Faults (200M by 200)

No faults "(800M by 10)

With 1/5 tasks failing, the job only takes twice as long.

How to get Q?


Idea 1 (unstable)


A1

A4

Q1 R+

Mapper 1

A2 Q2

A3 Q3

A4 Q4

R TSQR

Distribute R

R+

R+

R+

Idea 2 (better)


A1

A4

Q1 R+

Mapper 1

A2 Q2

A3 Q3

A4 Q4

R TSQR

Distribute R

R+

R+

R+

There’s a famous quote that “two iterations of iterative refinement are enough” attributed to Parlett

TSQR

A1

A4

Q1 S+

Mapper 2

A2 Q2

A3 Q3

A4 Q4

T

Distribute "

S = RT

S+

S+

S+

Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR


First, do QR factorizationsof each local matrix  

Second, compute a QR factorization of the new “R”


27


Idea 3 (best!)


A1

A4

Q1 R1

Mapper 1

A2 Q2 R2

A3 Q3 R3

A4 Q4

Q1

Q2

Q3

Q4

R1

R2

R3

R4

R4 Q o

utpu

t

R ou

tput

Q11

Q21

Q31

Q41

R Task 2

Q11

Q21

Q31

Q41

Q1

Q2

Q3

Q4

Mapper 3

1. Output local Q and R in separate files

2. Collect R on one node, compute Qs for each piece

3. Distribute the pieces of Q*1 and form the true Q

Achieving numerical stability Stability

ICME Seminar David Gleich Purdue 29 Condition number

108 102

norm

( Q

T Q –

I ) AR+

AR+ + iterative refinement

Full TSQR

The price is right!


seco

nds

2500

500

Full TSQR is faster than refinement for few columns

… and not any slower for many columns.

Improving performance


Lots of data? Add an iteration.

S(1)

A

A1

A2

A3

A3

R1map

Mapper 1-1Serial TSQR

A2

emitR2map


A3

emitR3map


A4

emitR4map


shuffle

S1

A2

reduce

Reducer 1-1Serial TSQR

S2

R2,2reduce


R2,1emit

emit

emit

shuffle

A2S3

R2,3reduce


emit

Iteration 1 Iteration 2

identity map

A2S(2)

Rreduce


emit

Too many maps? Add an iteration!

David Gleich (Sandia) 14/22MapReduce 2011 32


Summary of parameters mrtsqr – summary of parametersBlocksize How many rows to

read before computing a QR factorization, expressed as a multiple of the number of columns (See paper)

Splitsize The size of each local matrix

Reduction treeThe number of reducers and iterations to use


A1

A2

A1

A2qr

Q2

A1

R1map


emit

S(2)S(1)

A

shuffle

Iteration 1

(Red)

(Red)

S(2)

(Red)

Iter 2 Iter 315/22MapReduce 2011 33


Varying splitsize and the tree Varying splitsize Synthetic DataCols. Iters. Split

(MB)Maps Secs.

50 1 64 8000 388

– – 256 2000 184

– – 512 1000 149

– 2 64 8000 425

– – 256 2000 220

– – 512 1000 191

1000 1 512 1000 666

– 2 64 6000 590

– – 256 2000 432

– – 512 1000 337

Increasing split size improves performance(accounts for Hadoopdata movement)

Increasing iterations helps for problems with many columns.

(1000 columns with 64-MB split size overloaded the single reducer.)


34


Currently investigating how to do multiple iterations with Full TSQR.


MapReduce TSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rows

Map QR factorization of local rowsReduce QR factorization of local rows

Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute   (the norm of each column) 161 sec.Time to compute   in qr(   ) 387 sec.

On a 64-node Hadoop cluster with 4x2TB, one Core i7-920, 12GB RAM/node

Demmel et al. showed that this construction works to compute a QR factorization with minimal communication


36


Do I have to !write in Java?


Hadoop streaming performance Hadoop streaming frameworks

Iter 1QR (secs.)

Iter 1Total (secs.)

Iter 2Total (secs.)

OverallTotal (secs.)

Dumbo 67725 960 217 1177

Hadoopy 70909 612 118 730

C++ 15809 350 37 387

Java 436 66 502

Synthetic data test 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+AtlasCustom C++ TypedBytes reader/writer with AtlasNew non-streaming Java implementation too


All timing results from the Hadoop job tracker

C++ in streaming beats a native Java implementation.

16/22MapReduce 2011

38


What’s next? Investigate randomized algorithms for computing SVDs for fatter matrices.


RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION 9

Algorithm: Randomized PCA

Given an m ! n matrix A, the number k of principal components, and anexponent q, this procedure computes an approximate rank-2k factorizationU!V !. The columns of U estimate the first 2k principal components of A.

Stage A:1 Generate an n ! 2k Gaussian test matrix ".2 Form Y = (AA!)qA" by multiplying alternately with A and A!

3 Construct a matrix Q whose columns form an orthonormal basis for therange of Y .

Stage B:1 Form B = Q!A.2 Compute an SVD of the small matrix: B = !U!V !.3 Set U = Q !U .

singular spectrum of the data matrix often decays quite slowly. To address this di!-culty, we incorporate q steps of a power iteration where q = 1, 2 is usually su!cientin practice. The complete scheme appears below as the Randomized PCA algorithm.For refinements, see the discussion in §§4–5.

This procedure requires only 2(q +1) passes over the matrix, so it is e!cient evenfor matrices stored out-of-core. The flop count is

TrandPCA " qk Tmult + k2(m + n),

where Tmult is the cost of a matrix–vector multiply with A or A!.We have the following theorem on the performance of this method, which is a

consequence of Corollary 10.10.

Theorem 1.2. Suppose that A is a real m!n matrix. Select an exponent q anda target number k of principal components, where 2 # k # 0.5 min{m, n}. Executethe randomized PCA algorithm to obtain a rank-2k factorization U!V !. Then

E $A % U!V !$ #

"

1 + 4

#2 min{m, n}

k % 1

$1/(2q+1)

!k+1, (1.8)

where E denotes expectation with respect to the random test matrix and !k+1 is the(k + 1)th singular value of A.

This result is new. Observe that the bracket in (1.8) is essentially the same as thebracket in the basic error bound (1.6). We find that the power iteration drives theleading constant to one exponentially fast as the power q increases. Since the rank-kapproximation of A can never achieve an error smaller than !k+1, the randomizedprocedure estimates 2k principal components that carry essentially as much varianceas the first k actual principal components. Truncating the approximation to the firstk terms typically produces very accurate results.

1.7. Outline of paper. The paper is organized into three parts: an introduction(§§1–3), a description of the algorithms (§§4–7), and a theoretical performance analysis(§§8–11). Each part commences with a short internal outline, and the three segments

Halko, Martinsson, Tropp. SIREV 2011


Questions?

Most recent code at http://github.com/arbenson/mrtsqr

Varying blocksize Varying Blocksize

Using the C++ framework3/7/2011 19/22

41


tall-and-skinny qr factorizations in mapreduce architectures

Education

david gleich purduecse

gleich purdue icme seminar422

update qr

demmel tsqr compute

local data

serial qr factorizations

icme qualtakers

data cluster