tall-and-skinny qr factorizations in mapreduce architectures
DESCRIPTION
An update on how to compute tall-and-skinny QR factorizations in MapReduceTRANSCRIPT
Tall-and-Skinny QR Factorizations in MapReduce DAVID F. GLEICH !
PURDUE UNIVERSITY COMPUTER SCIENCE !
DEPARTMENT
PAUL G. CONSTANTINE !STANFORD UNIVERSITY
AUSTIN BENSON AND !JAMES DEMMEL !UC BERKELEY
1 ICME Seminar David Gleich Purdue
2
Tall-and-Skinny matrices (m ≫ n) arise in
regression with many samples block iterative methods panel factorizations model reduction problems general linear models "with many samples tall-and-skinny SVD/PCA
All of these applications !need a QR factorization of !a tall-and-skinny matrix.! some only need R !
A
From tinyimages"collection
ICME Seminar David Gleich Purdue
MapReduce is great for TSQR!!You don’t need ATA Data A tall and skinny (TS) matrix by rows Map QR factorization of local rows Reduce QR factorization of local rows
Input 500,000,000-by-100 matrix"Each record 1-by-100 row"HDFS Size 423.3 GB Time to compute Ae (column sums) 161 sec.!Time to compute R in qr(A) 387 sec. !Time to compute Q in qr(A) stably … stay tuned …
3/22
ICME Seminar David Gleich Purdue
Quick review of QR QR Factorization
David Gleich (Sandia)
Using QR for regression
is given by the solution of
QR is block normalization“normalize” a vector usually generalizes to computing in the QR
A Q
Let , real
is orthogonal ( )
is upper triangular.
0
R
=
4/22MapReduce 2011 4 ICME Seminar David Gleich Purdue
Quiz for ICME qual-takers! Is R unique?
5 ICME Seminar David Gleich Purdue
PCA of 80,000,000!images
6/22
A
80,000,000 images
1000 pixels
X
MapReduce Post Processing
Zero"mean"rows
TSQ
R
R SVD
V
First 16 columns
of V as images
Top 100 singular values
(principal �components)
ICME Seminar David Gleich Purdue
The principal components look like a 2d cosine transform
ICME Seminar David Gleich Purdue 7
Thanks Wikipedia!
Input "Parameters
Time history"of simulation
s f
The Database
s1 -> f1 s2 -> f2
sk -> fk
f(s) =
2
66666666666664
q(x1, t1, s)...
q(xn
, t1, s)q(x1, t2, s)
...q(x
n
, t2, s)...
q(xn
, t
k
, s)
3
77777777777775
A single simulation at one time step
X =⇥f(s1) f(s2) ... f(sp)
⇤
The database as a very"tall-and-skinny matrix Th
e sim
ulatio
n as
a v
ecto
r
CS&E Seminar David Gleich · Purdue 8
MapReduce
ICME Seminar David Gleich Purdue 9
Intro to MapReduce Originated at Google for indexing web pages and computing PageRank.
The idea Bring the computations to the data.
Express algorithms in "data-local operations. Implement one type of communication: shuffle. Shuffle moves all data with the same key to the same reducer.
MM R
RMM
Input stored in triplicate
Map output"persisted to disk"before shuffle
Reduce input/"output on disk
1 MM R
RMMM
Maps Reduce
Shuffle
2
3
4
5
1 2 M M
3 4 M M
5 M
Data scalable
Fault-tolerance by design
10
ICME Seminar David Gleich Purdue
Mesh point variance in MapReduce Run 1 Run 2 Run 3
T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3
11
ICME Seminar David Gleich Purdue
Mesh point variance in MapReduce
M M M
R R
1. Each mapper out-puts the mesh points with the same key.
2. Shuffle moves all values from the same mesh point to the same reducer.
Run 1 Run 2 Run 3
3. Reducers just compute a numerical variance.
Bring the computations to the data!
T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3
12
ICME Seminar David Gleich Purdue
MapReduce vs. Hadoop.
MapReduce!A computation model with:"Map a local data transform"Shuffle a grouping function "Reduce an aggregation
Hadoop!An implementation of MapReduce using the HDFS parallel file-system. Others !
Pheonix++, Twisted, Google MapReduce, spark …
ICME Seminar David Gleich Purdue 13
That is, we store the runs Supercomputer Data computing cluster Engineer
Each multi-day HPC simulation generates gigabytes of data.
A data cluster can hold hundreds or thousands of old simulations …
… enabling engineers to query and analyze months of simulation data for statistical studies and uncertainty quantification.
and build the interpolant from the pre-computed data.
CS&E Seminar David Gleich · Purdue 14
Tall-and-Skinny QR
ICME Seminar David Gleich Purdue 15
Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
First, do QR factorizationsof each local matrix
Second, compute a QR factorization of the new “R”
David Gleich (Sandia) 6/22MapReduce 2011
16
ICME Seminar David Gleich Purdue
Serial QR factorizations!(Demmel et al. 2008) Fully serial TSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
Compute QR of , read , update QR, …
David Gleich (Sandia) 8/22MapReduce 2011
17
ICME Seminar David Gleich Purdue
Tall-and-skinny matrix storage in MapReduce MapReduce matrix storage
Key is an arbitrary row-idValue is the array for
a row.
Each submatrix is an input split.
A1
A2
A3
A4
David Gleich (Sandia) 10/22MapReduce 2011
18
ICME Seminar David Gleich Purdue
Ask Austin about block row storage!
A1
A2
A3
A1
A2qr
Q2 R2
A3qr
Q3 R3
A4qr Q4A4
R4
emit
A5
A6
A7
A5
A6qr
Q6 R6
A7qr
Q7 R7
A8qr Q8A8
R8
emit
Mapper 1Serial TSQR
R4
R8
Mapper 2Serial TSQR
R4
R8
qr Q emitRReducer 1Serial TSQR
AlgorithmData Rows of a matrix
Map QR factorization of rowsReduce QR factorization of rows
19
ICME Seminar David Gleich Purdue
Key Limitations
Computes only R and not Q
Can get Q via Q = AR+ with another MR iteration. (We currently use this for computing the SVD.)
Dubious numerical stability; iterative refinement helps.
ICME Seminar David Gleich Purdue 20
Why MapReduce?
ICME Seminar David Gleich Purdue 21
Full code in hadoopy In hadoopyimport random, numpy, hadoopyclass SerialTSQR:def __init__(self,blocksize,isreducer):self.bsize=blocksizeself.data = []if isreducer: self.__call__ = self.reducerelse: self.__call__ = self.mapper
def compress(self):R = numpy.linalg.qr(
numpy.array(self.data),'r')# reset data and re-initialize to Rself.data = []for row in R:self.data.append([float(v) for v in row])
def collect(self,key,value):self.data.append(value)if len(self.data)>self.bsize*len(self.data[0]):self.compress()
def close(self):self.compress()for row in self.data:key = random.randint(0,2000000000)yield key, row
def mapper(self,key,value):self.collect(key,value)
def reducer(self,key,values):for value in values: self.mapper(key,value)
if __name__=='__main__':mapper = SerialTSQR(blocksize=3,isreducer=False)reducer = SerialTSQR(blocksize=3,isreducer=True)hadoopy.run(mapper, reducer)
David Gleich (Sandia) 13/22MapReduce 2011 22
ICME Seminar David Gleich Purdue
3HUIRUPDQFH�UHVXOWV��VLPXODWHG�IDXOWV�
:H�FDQ�VWLOO�UXQ�ZLWK�3�IDXOW�� �����ZLWK�RQO\�a����SHUIRUPDQFH�SHQDOW\���+RZHYHU��ZLWK�3�IDXOW��VPDOO��ZH�VWLOO�VHH�D�SHUIRUPDQFH�KLW�
Fault injection
ICME Seminar David Gleich Purdue 23
10 100 1000
1/Prob(failure) – mean number of success per failure
Tim
e to
com
plet
ion
(sec)
200
100
No faults (200M by 200)
Faults (800M by 10)
Faults (200M by 200)
No faults "(800M by 10)
With 1/5 tasks failing, the job only takes twice as long.
How to get Q?
ICME Seminar David Gleich Purdue 24
Idea 1 (unstable)
ICME Seminar David Gleich Purdue 25
A1
A4
Q1 R+
Mapper 1
A2 Q2
A3 Q3
A4 Q4
R TSQR
Distribute R
R+
R+
R+
Idea 2 (better)
ICME Seminar David Gleich Purdue 26
A1
A4
Q1 R+
Mapper 1
A2 Q2
A3 Q3
A4 Q4
R TSQR
Distribute R
R+
R+
R+
There’s a famous quote that “two iterations of iterative refinement are enough” attributed to Parlett
TSQR
A1
A4
Q1 S+
Mapper 2
A2 Q2
A3 Q3
A4 Q4
T
Distribute "
S = RT
S+
S+
S+
Communication avoiding QR (Demmel et al. 2008) Communication avoiding TSQR
Demmel et al. 2008. Communicating avoiding parallel and sequential QR.
First, do QR factorizationsof each local matrix
Second, compute a QR factorization of the new “R”
David Gleich (Sandia) 6/22MapReduce 2011
27
ICME Seminar David Gleich Purdue
Idea 3 (best!)
ICME Seminar David Gleich Purdue 28
A1
A4
Q1 R1
Mapper 1
A2 Q2 R2
A3 Q3 R3
A4 Q4
Q1
Q2
Q3
Q4
R1
R2
R3
R4
R4 Q o
utpu
t
R ou
tput
Q11
Q21
Q31
Q41
R Task 2
Q11
Q21
Q31
Q41
Q1
Q2
Q3
Q4
Mapper 3
1. Output local Q and R in separate files
2. Collect R on one node, compute Qs for each piece
3. Distribute the pieces of Q*1 and form the true Q
Achieving numerical stability Stability
ICME Seminar David Gleich Purdue 29 Condition number
108 102
norm
( Q
T Q –
I ) AR+
AR+ + iterative refinement
Full TSQR
The price is right!
ICME Seminar David Gleich Purdue 30
seco
nds
2500
500
Full TSQR is faster than refinement for few columns
… and not any slower for many columns.
Improving performance
ICME Seminar David Gleich Purdue 31
Lots of data? Add an iteration.
S(1)
A
A1
A2
A3
A3
R1map
Mapper 1-1Serial TSQR
A2
emitR2map
Mapper 1-2Serial TSQR
A3
emitR3map
Mapper 1-3Serial TSQR
A4
emitR4map
Mapper 1-4Serial TSQR
shuffle
S1
A2
reduce
Reducer 1-1Serial TSQR
S2
R2,2reduce
Reducer 1-2Serial TSQR
R2,1emit
emit
emit
shuffle
A2S3
R2,3reduce
Reducer 1-3Serial TSQR
emit
Iteration 1 Iteration 2
identity map
A2S(2)
Rreduce
Reducer 2-1Serial TSQR
emit
Too many maps? Add an iteration!
David Gleich (Sandia) 14/22MapReduce 2011 32
ICME Seminar David Gleich Purdue
Summary of parameters mrtsqr – summary of parametersBlocksize How many rows to
read before computing a QR factorization, expressed as a multiple of the number of columns (See paper)
Splitsize The size of each local matrix
Reduction treeThe number of reducers and iterations to use
David Gleich (Sandia)
A1
A2
A1
A2qr
Q2
A1
R1map
Mapper 1-1Serial TSQR
emit
S(2)S(1)
A
shuffle
Iteration 1
(Red)
(Red)
S(2)
(Red)
Iter 2 Iter 315/22MapReduce 2011 33
ICME Seminar David Gleich Purdue
Varying splitsize and the tree Varying splitsize Synthetic DataCols. Iters. Split
(MB)Maps Secs.
50 1 64 8000 388
– – 256 2000 184
– – 512 1000 149
– 2 64 8000 425
– – 256 2000 220
– – 512 1000 191
1000 1 512 1000 666
– 2 64 6000 590
– – 256 2000 432
– – 512 1000 337
Increasing split size improves performance(accounts for Hadoopdata movement)
Increasing iterations helps for problems with many columns.
(1000 columns with 64-MB split size overloaded the single reducer.)
David Gleich (Sandia) 17/22MapReduce 2011
34
ICME Seminar David Gleich Purdue
Currently investigating how to do multiple iterations with Full TSQR.
ICME Seminar David Gleich Purdue 35
MapReduce TSQR summary MapReduce is great for TSQR!Data A tall and skinny (TS) matrix by rows
Map QR factorization of local rowsReduce QR factorization of local rows
Input 500,000,000-by-100 matrixEach record 1-by-100 rowHDFS Size 423.3 GBTime to compute (the norm of each column) 161 sec.Time to compute in qr( ) 387 sec.
On a 64-node Hadoop cluster with 4x2TB, one Core i7-920, 12GB RAM/node
Demmel et al. showed that this construction works to compute a QR factorization with minimal communication
David Gleich (Sandia) 2/22MapReduce 2011
36
ICME Seminar David Gleich Purdue
Do I have to !write in Java?
ICME Seminar David Gleich Purdue 37
Hadoop streaming performance Hadoop streaming frameworks
Iter 1QR (secs.)
Iter 1Total (secs.)
Iter 2Total (secs.)
OverallTotal (secs.)
Dumbo 67725 960 217 1177
Hadoopy 70909 612 118 730
C++ 15809 350 37 387
Java 436 66 502
Synthetic data test 100,000,000-by-500 matrix (~500GB)Codes implemented in MapReduce streamingMatrix stored as TypedBytes lists of doublesPython frameworks use Numpy+AtlasCustom C++ TypedBytes reader/writer with AtlasNew non-streaming Java implementation too
David Gleich (Sandia)
All timing results from the Hadoop job tracker
C++ in streaming beats a native Java implementation.
16/22MapReduce 2011
38
ICME Seminar David Gleich Purdue
What’s next? Investigate randomized algorithms for computing SVDs for fatter matrices.
ICME Seminar David Gleich Purdue 39
RANDOMIZED ALGORITHMS FOR MATRIX APPROXIMATION 9
Algorithm: Randomized PCA
Given an m ! n matrix A, the number k of principal components, and anexponent q, this procedure computes an approximate rank-2k factorizationU!V !. The columns of U estimate the first 2k principal components of A.
Stage A:1 Generate an n ! 2k Gaussian test matrix ".2 Form Y = (AA!)qA" by multiplying alternately with A and A!
3 Construct a matrix Q whose columns form an orthonormal basis for therange of Y .
Stage B:1 Form B = Q!A.2 Compute an SVD of the small matrix: B = !U!V !.3 Set U = Q !U .
singular spectrum of the data matrix often decays quite slowly. To address this di!-culty, we incorporate q steps of a power iteration where q = 1, 2 is usually su!cientin practice. The complete scheme appears below as the Randomized PCA algorithm.For refinements, see the discussion in §§4–5.
This procedure requires only 2(q +1) passes over the matrix, so it is e!cient evenfor matrices stored out-of-core. The flop count is
TrandPCA " qk Tmult + k2(m + n),
where Tmult is the cost of a matrix–vector multiply with A or A!.We have the following theorem on the performance of this method, which is a
consequence of Corollary 10.10.
Theorem 1.2. Suppose that A is a real m!n matrix. Select an exponent q anda target number k of principal components, where 2 # k # 0.5 min{m, n}. Executethe randomized PCA algorithm to obtain a rank-2k factorization U!V !. Then
E $A % U!V !$ #
"
1 + 4
#2 min{m, n}
k % 1
$1/(2q+1)
!k+1, (1.8)
where E denotes expectation with respect to the random test matrix and !k+1 is the(k + 1)th singular value of A.
This result is new. Observe that the bracket in (1.8) is essentially the same as thebracket in the basic error bound (1.6). We find that the power iteration drives theleading constant to one exponentially fast as the power q increases. Since the rank-kapproximation of A can never achieve an error smaller than !k+1, the randomizedprocedure estimates 2k principal components that carry essentially as much varianceas the first k actual principal components. Truncating the approximation to the firstk terms typically produces very accurate results.
1.7. Outline of paper. The paper is organized into three parts: an introduction(§§1–3), a description of the algorithms (§§4–7), and a theoretical performance analysis(§§8–11). Each part commences with a short internal outline, and the three segments
Halko, Martinsson, Tropp. SIREV 2011
ICME Seminar David Gleich Purdue 40
Questions?
Most recent code at http://github.com/arbenson/mrtsqr
Varying blocksize Varying Blocksize
Using the C++ framework3/7/2011 19/22
41
ICME Seminar David Gleich Purdue