tall-and-skinny qr factorizations in mapreduce architectures

41
Tall-and-Skinny QR Factorizations in MapReduce DAVID F. GLEICH PURDUE UNIVERSITY COMPUTER SCIENCE DEPARTMENT PAUL G. CONSTANTINE STANFORD UNIVERSITY AUSTIN BENSON AND JAMES DEMMEL UC BERKELEY 1 ICME Seminar David Gleich Purdue

Upload: david-gleich

Post on 15-Jan-2015

2.655 views

Category:

Education


2 download

DESCRIPTION

An update on how to compute tall-and-skinny QR factorizations in MapReduce

TRANSCRIPT

Page 1: Tall-and-skinny QR factorizations in MapReduce architectures

Tall-and-Skinny QR Factorizations in MapReduce DAVID F. GLEICH !

PURDUE UNIVERSITY COMPUTER SCIENCE !

DEPARTMENT

PAUL G. CONSTANTINE !STANFORD UNIVERSITY

AUSTIN BENSON AND !JAMES DEMMEL !UC BERKELEY

1 ICME Seminar David Gleich Purdue

Page 2: Tall-and-skinny QR factorizations in MapReduce architectures

2

Tall-and-Skinny matrices (m ≫ n) arise in

regression with many samples block iterative methods panel factorizations model reduction problems general linear models "with many samples tall-and-skinny SVD/PCA

All of these applications !need a QR factorization of !a tall-and-skinny matrix.! some only need  R !

A

From tinyimages"collection

ICME Seminar David Gleich Purdue

Page 3: Tall-and-skinny QR factorizations in MapReduce architectures

MapReduce is great for TSQR!!You don’t need  ATA Data A tall and skinny (TS) matrix by rows Map QR factorization of local rows Reduce QR factorization of local rows

Input 500,000,000-by-100 matrix"Each record 1-by-100 row"HDFS Size 423.3 GB Time to compute Ae (column sums) 161 sec.!Time to compute  R in qr(A) 387 sec. !Time to compute Q in qr(A) stably … stay tuned …

3/22

ICME Seminar David Gleich Purdue

Page 4: Tall-and-skinny QR factorizations in MapReduce architectures

Quick review of QR QR Factorization

David Gleich (Sandia)

Using QR for regression

   is given by the solution of   

QR is block normalization“normalize” a vector usually generalizes to computing    in the QR

A Q

Let    , real

  

   is    orthogonal (   )

   is    upper triangular.

0

R

=

4/22MapReduce 2011 4 ICME Seminar David Gleich Purdue

Page 5: Tall-and-skinny QR factorizations in MapReduce architectures

Quiz for ICME qual-takers! Is R unique?

5 ICME Seminar David Gleich Purdue

Page 6: Tall-and-skinny QR factorizations in MapReduce architectures

PCA of 80,000,000!images

6/22

A

80,000,000 images

1000 pixels

X

MapReduce Post Processing

Zero"mean"rows

TSQ

R

R SVD

   V

First 16 columns

of V as images

Top 100 singular values

(principal �components)

ICME Seminar David Gleich Purdue

Page 7: Tall-and-skinny QR factorizations in MapReduce architectures

The principal components look like a 2d cosine transform

ICME Seminar David Gleich Purdue 7

Thanks Wikipedia!

Page 8: Tall-and-skinny QR factorizations in MapReduce architectures

Input "Parameters

Time history"of simulation

s f

The Database

s1 -> f1 s2 -> f2

sk -> fk

f(s) =

2

66666666666664

q(x1, t1, s)...

q(xn

, t1, s)q(x1, t2, s)

...q(x

n

, t2, s)...

q(xn

, t

k

, s)

3

77777777777775

A single simulation at one time step

X =⇥f(s1) f(s2) ... f(sp)

The database as a very"tall-and-skinny matrix Th

e sim

ulatio

n as

a v

ecto

r

CS&E Seminar David Gleich · Purdue 8

Page 9: Tall-and-skinny QR factorizations in MapReduce architectures

MapReduce

ICME Seminar David Gleich Purdue 9

Page 10: Tall-and-skinny QR factorizations in MapReduce architectures

Intro to MapReduce Originated at Google for indexing web pages and computing PageRank.

The idea Bring the computations to the data.

Express algorithms in "data-local operations. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.

MM R

RMM

Input stored in triplicate

Map output"persisted to disk"before shuffle

Reduce input/"output on disk

1 MM R

RMMM

Maps Reduce

Shuffle

2

3

4

5

1 2 M M

3 4 M M

5 M

Data scalable

Fault-tolerance by design

10

ICME Seminar David Gleich Purdue

Page 11: Tall-and-skinny QR factorizations in MapReduce architectures

Mesh point variance in MapReduce Run 1 Run 2 Run 3

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

11

ICME Seminar David Gleich Purdue

Page 12: Tall-and-skinny QR factorizations in MapReduce architectures

Mesh point variance in MapReduce

M M M

R R

1. Each mapper out-puts the mesh points with the same key.

2. Shuffle moves all values from the same mesh point to the same reducer.

Run 1 Run 2 Run 3

3. Reducers just compute a numerical variance.

Bring the computations to the data!

T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3

12

ICME Seminar David Gleich Purdue

Page 13: Tall-and-skinny QR factorizations in MapReduce architectures

MapReduce vs. Hadoop.

MapReduce!A computation model with:"Map a local data transform"Shuffle a grouping function "Reduce an aggregation

Hadoop!An implementation of MapReduce using the HDFS parallel file-system. Others !

Pheonix++, Twisted, Google MapReduce, spark …

ICME Seminar David Gleich Purdue 13

Page 14: Tall-and-skinny QR factorizations in MapReduce architectures

That is, we store the runs Supercomputer Data computing cluster Engineer

Each multi-day HPC simulation generates gigabytes of data.

A data cluster can hold hundreds or thousands of old simulations …

… enabling engineers to query and analyze months of simulation data for statistical studies and uncertainty quantification.

and build the interpolant from the pre-computed data.

CS&E Seminar David Gleich · Purdue 14

Page 15: Tall-and-skinny QR factorizations in MapReduce architectures

Tall-and-Skinny QR

ICME Seminar David Gleich Purdue 15

Page 16: Tall-and-skinny QR factorizations in MapReduce architectures

Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR

Demmel et al. 2008. Communicating avoiding parallel and sequential QR.

First, do QR factorizationsof each local matrix   

Second, compute a QR factorization of the new “R”

David Gleich (Sandia) 6/22MapReduce 2011

16

ICME Seminar David Gleich Purdue

Page 17: Tall-and-skinny QR factorizations in MapReduce architectures

Serial QR factorizations!(Demmel et al. 2008) Fully serial TSQR

Demmel et al. 2008. Communicating avoiding parallel and sequential QR.

Compute QR of    , read    , update QR, …

David Gleich (Sandia) 8/22MapReduce 2011

17

ICME Seminar David Gleich Purdue

Page 18: Tall-and-skinny QR factorizations in MapReduce architectures

Tall-and-skinny matrix storage in MapReduce MapReduce matrix storage

  

Key is an arbitrary row-idValue is the    array for

a row.

Each submatrix    is an input split.

A1

A2

A3

A4

David Gleich (Sandia) 10/22MapReduce 2011

18

ICME Seminar David Gleich Purdue

Ask Austin about block row storage!

Page 19: Tall-and-skinny QR factorizations in MapReduce architectures

A1

A2

A3

A1

A2qr

Q2 R2

A3qr

Q3 R3

A4qr Q4A4

R4

emit

A5

A6

A7

A5

A6qr

Q6 R6

A7qr

Q7 R7

A8qr Q8A8

R8

emit

Mapper 1Serial TSQR

R4

R8

Mapper 2Serial TSQR

R4

R8

qr Q emitRReducer 1Serial TSQR

AlgorithmData Rows of a matrix

Map QR factorization of rowsReduce QR factorization of rows

19

ICME Seminar David Gleich Purdue

Page 20: Tall-and-skinny QR factorizations in MapReduce architectures

Key Limitations

Computes only R and not Q

Can get Q via Q = AR+ with another MR iteration. (We currently use this for computing the SVD.)

Dubious numerical stability; iterative refinement helps.

ICME Seminar David Gleich Purdue 20

Page 21: Tall-and-skinny QR factorizations in MapReduce architectures

Why MapReduce?

ICME Seminar David Gleich Purdue 21

Page 22: Tall-and-skinny QR factorizations in MapReduce architectures

Full code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper

def compress(self):R = numpy.linalg.qr(

numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])

def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()

def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row

def mapper(self,key,value):self.collect(key,value)

def reducer(self,key,values):for value in values: self.mapper(key,value)

if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)

David Gleich (Sandia) 13/22MapReduce 2011 22

ICME Seminar David Gleich Purdue

Page 23: Tall-and-skinny QR factorizations in MapReduce architectures

3HUIRUPDQFH�UHVXOWV��VLPXODWHG�IDXOWV�

:H�FDQ�VWLOO�UXQ�ZLWK�3�IDXOW�� �����ZLWK�RQO\�a����SHUIRUPDQFH�SHQDOW\���+RZHYHU��ZLWK�3�IDXOW��VPDOO��ZH�VWLOO�VHH�D�SHUIRUPDQFH�KLW�

Fault injection

ICME Seminar David Gleich Purdue 23

10 100 1000

1/Prob(failure) – mean number of success per failure

Tim

e to

com

plet

ion

(sec)

200

100

No faults (200M by 200)

Faults (800M by 10)

Faults (200M by 200)

No faults "(800M by 10)

With 1/5 tasks failing, the job only takes twice as long.

Page 24: Tall-and-skinny QR factorizations in MapReduce architectures

How to get Q?

ICME Seminar David Gleich Purdue 24

Page 25: Tall-and-skinny QR factorizations in MapReduce architectures

Idea 1 (unstable)

ICME Seminar David Gleich Purdue 25

A1

A4

Q1 R+

Mapper 1

A2 Q2

A3 Q3

A4 Q4

R TSQR

Distribute R

R+

R+

R+

Page 26: Tall-and-skinny QR factorizations in MapReduce architectures

Idea 2 (better)

ICME Seminar David Gleich Purdue 26

A1

A4

Q1 R+

Mapper 1

A2 Q2

A3 Q3

A4 Q4

R TSQR

Distribute R

R+

R+

R+

There’s a famous quote that “two iterations of iterative refinement are enough” attributed to Parlett

TSQR

A1

A4

Q1 S+

Mapper 2

A2 Q2

A3 Q3

A4 Q4

T

Distribute "

S = RT

S+

S+

S+

Page 27: Tall-and-skinny QR factorizations in MapReduce architectures

Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR

Demmel et al. 2008. Communicating avoiding parallel and sequential QR.

First, do QR factorizationsof each local matrix   

Second, compute a QR factorization of the new “R”

David Gleich (Sandia) 6/22MapReduce 2011

27

ICME Seminar David Gleich Purdue

Page 28: Tall-and-skinny QR factorizations in MapReduce architectures

Idea 3 (best!)

ICME Seminar David Gleich Purdue 28

A1

A4

Q1 R1

Mapper 1

A2 Q2 R2

A3 Q3 R3

A4 Q4

Q1

Q2

Q3

Q4

R1

R2

R3

R4

R4 Q o

utpu

t

R ou

tput

Q11

Q21

Q31

Q41

R Task 2

Q11

Q21

Q31

Q41

Q1

Q2

Q3

Q4

Mapper 3

1. Output local Q and R in separate files

2. Collect R on one node, compute Qs for each piece

3. Distribute the pieces of Q*1 and form the true Q

Page 29: Tall-and-skinny QR factorizations in MapReduce architectures

Achieving numerical stability Stability

ICME Seminar David Gleich Purdue 29 Condition number

108 102

norm

( Q

T Q –

I ) AR+

AR+ + iterative refinement

Full TSQR

Page 30: Tall-and-skinny QR factorizations in MapReduce architectures

The price is right!

ICME Seminar David Gleich Purdue 30

seco

nds

2500

500

Full TSQR is faster than refinement for few columns

… and not any slower for many columns.

Page 31: Tall-and-skinny QR factorizations in MapReduce architectures

Improving performance

ICME Seminar David Gleich Purdue 31

Page 32: Tall-and-skinny QR factorizations in MapReduce architectures

Lots of data? Add an iteration.

S(1)

A

A1

A2

A3

A3

R1map

Mapper 1-1Serial TSQR

A2

emitR2map

Mapper 1-2Serial TSQR

A3

emitR3map

Mapper 1-3Serial TSQR

A4

emitR4map

Mapper 1-4Serial TSQR

shuffle

S1

A2

reduce

Reducer 1-1Serial TSQR

S2

R2,2reduce

Reducer 1-2Serial TSQR

R2,1emit

emit

emit

shuffle

A2S3

R2,3reduce

Reducer 1-3Serial TSQR

emit

Iteration 1 Iteration 2

identity map

A2S(2)

Rreduce

Reducer 2-1Serial TSQR

emit

Too many maps? Add an iteration!

David Gleich (Sandia) 14/22MapReduce 2011 32

ICME Seminar David Gleich Purdue

Page 33: Tall-and-skinny QR factorizations in MapReduce architectures

Summary of parameters mrtsqr – summary of parametersBlocksize How many rows to

read before computing a QR factorization, expressed as a multiple of the number of columns (See paper)

Splitsize The size of each local matrix

Reduction treeThe number of reducers and iterations to use

David Gleich (Sandia)

A1

A2

A1

A2qr

Q2

A1

R1map

Mapper 1-1Serial TSQR

emit

S(2)S(1)

A

shuffle

Iteration 1

(Red)

(Red)

S(2)

(Red)

Iter 2 Iter 315/22MapReduce 2011 33

ICME Seminar David Gleich Purdue

Page 34: Tall-and-skinny QR factorizations in MapReduce architectures

Varying splitsize and the tree Varying splitsize Synthetic DataCols. Iters. Split

(MB)Maps Secs.

50 1 64 8000 388

– – 256 2000 184

– – 512 1000 149

– 2 64 8000 425

– – 256 2000 220

– – 512 1000 191

1000 1 512 1000 666

– 2 64 6000 590

– – 256 2000 432

– – 512 1000 337

Increasing split size improves performance(accounts for Hadoopdata movement)

Increasing iterations helps for problems with many columns.

(1000 columns with 64-MB split size overloaded the single reducer.)

David Gleich (Sandia) 17/22MapReduce 2011

34

ICME Seminar David Gleich Purdue

Page 35: Tall-and-skinny QR factorizations in MapReduce architectures

Currently investigating how to do multiple iterations with Full TSQR.

ICME Seminar David Gleich Purdue 35

Page 36: Tall-and-skinny QR factorizations in MapReduce architectures

MapReduce TSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rows

Map QR factorization of local rowsReduce QR factorization of local rows

Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute    (the norm of each column) 161 sec.Time to compute    in qr(   ) 387 sec.

On a 64-node Hadoop cluster with 4x2TB, one Core i7-920, 12GB RAM/node

Demmel et al. showed that this construction works to compute a QR factorization with minimal communication

David Gleich (Sandia) 2/22MapReduce 2011

36

ICME Seminar David Gleich Purdue

Page 37: Tall-and-skinny QR factorizations in MapReduce architectures

Do I have to !write in Java?

ICME Seminar David Gleich Purdue 37

Page 38: Tall-and-skinny QR factorizations in MapReduce architectures

Hadoop streaming performance Hadoop streaming frameworks

Iter 1QR (secs.)

Iter 1Total (secs.)

Iter 2Total (secs.)

OverallTotal (secs.)

Dumbo 67725 960 217 1177

Hadoopy 70909 612 118 730

C++ 15809 350 37 387

Java 436 66 502

Synthetic data test 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+AtlasCustom C++ TypedBytes reader/writer with AtlasNew non-streaming Java implementation too

David Gleich (Sandia)

All timing results from the Hadoop job tracker

C++ in streaming beats a native Java implementation.

16/22MapReduce 2011

38

ICME Seminar David Gleich Purdue

Page 39: Tall-and-skinny QR factorizations in MapReduce architectures

What’s next? Investigate randomized algorithms for computing SVDs for fatter matrices.

ICME Seminar David Gleich Purdue 39

RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION 9

Algorithm: Randomized PCA

Given an m ! n matrix A, the number k of principal components, and anexponent q, this procedure computes an approximate rank-2k factorizationU!V !. The columns of U estimate the first 2k principal components of A.

Stage A:1 Generate an n ! 2k Gaussian test matrix ".2 Form Y = (AA!)qA" by multiplying alternately with A and A!

3 Construct a matrix Q whose columns form an orthonormal basis for therange of Y .

Stage B:1 Form B = Q!A.2 Compute an SVD of the small matrix: B = !U!V !.3 Set U = Q !U .

singular spectrum of the data matrix often decays quite slowly. To address this di!-culty, we incorporate q steps of a power iteration where q = 1, 2 is usually su!cientin practice. The complete scheme appears below as the Randomized PCA algorithm.For refinements, see the discussion in §§4–5.

This procedure requires only 2(q +1) passes over the matrix, so it is e!cient evenfor matrices stored out-of-core. The flop count is

TrandPCA " qk Tmult + k2(m + n),

where Tmult is the cost of a matrix–vector multiply with A or A!.We have the following theorem on the performance of this method, which is a

consequence of Corollary 10.10.

Theorem 1.2. Suppose that A is a real m!n matrix. Select an exponent q anda target number k of principal components, where 2 # k # 0.5 min{m, n}. Executethe randomized PCA algorithm to obtain a rank-2k factorization U!V !. Then

E $A % U!V !$ #

"

1 + 4

#2 min{m, n}

k % 1

$1/(2q+1)

!k+1, (1.8)

where E denotes expectation with respect to the random test matrix and !k+1 is the(k + 1)th singular value of A.

This result is new. Observe that the bracket in (1.8) is essentially the same as thebracket in the basic error bound (1.6). We find that the power iteration drives theleading constant to one exponentially fast as the power q increases. Since the rank-kapproximation of A can never achieve an error smaller than !k+1, the randomizedprocedure estimates 2k principal components that carry essentially as much varianceas the first k actual principal components. Truncating the approximation to the firstk terms typically produces very accurate results.

1.7. Outline of paper. The paper is organized into three parts: an introduction(§§1–3), a description of the algorithms (§§4–7), and a theoretical performance analysis(§§8–11). Each part commences with a short internal outline, and the three segments

Halko, Martinsson, Tropp. SIREV 2011

Page 40: Tall-and-skinny QR factorizations in MapReduce architectures

ICME Seminar David Gleich Purdue 40

Questions?

Most recent code at http://github.com/arbenson/mrtsqr

Page 41: Tall-and-skinny QR factorizations in MapReduce architectures

Varying blocksize Varying Blocksize

Using the C++ framework3/7/2011 19/22

41

ICME Seminar David Gleich Purdue