recursion unrolling for divide and conquer programs
DESCRIPTION
Recursion Unrolling for Divide and Conquer Programs. Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology. What This Talk Is About. Automatic generation of efficient large base cases for divide and conquer programs. Outline. - PowerPoint PPT PresentationTRANSCRIPT
Recursion Unrolling for Divide and Conquer Programs
Radu Rugina and Martin RinardLaboratory for Computer Science
Massachusetts Institute of Technology
What This Talk Is About
•Automatic generation of efficient large base cases for divide and conquer programs
Outline1. Motivating Example2. Computation Structure3. Transformations4. Related Work5. Conclusion
1. Motivating Example
Divide and Conquer Matrix Multiply
• Divide matrices into sub-matrices: A0 , A1, A2 etc
• Use blocked matrix multiply equations
A0 A1
A2 A3
B0 B1
B2 B3
A0B0+A1
B2
A0B1+A1
B3
A2B0+A3
B2
A2B1+A3
B3
=
A B = R
Divide and Conquer Matrix Multiply
• Recursively multiply sub-matrices
A0 A1
A2 A3
B0 B1
B2 B3
A0B0+A1
B2
A0B1+A1
B3
A2B0+A3
B2
A2B1+A3
B3
=
A B = R
Divide and Conquer Matrix Multiply
• Terminate recursion with a simple base case
=
A B = R
a0 b0 a0 b0
Divide and Conquer Matrix Multiply
void matmul(int *A, int *B, int *R, int n) { if (n == 1) {
(*R) += (*A) * (*B); } else {
matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);
}
Implements R += A B
Divide and Conquer Matrix Multiply
Divide matrices in sub-matrices andrecursively multiplysub-matrices
void matmul(int *A, int *B, int *R, int n) { if (n == 1) {
(*R) += (*A) * (*B); } else {
matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);
}
Divide and Conquer Matrix Multiply
Identify sub-matrices with pointers
void matmul(int *A, int *B, int *R, int n) { if (n == 1) {
(*R) += (*A) * (*B); } else {
matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);
}
Divide and Conquer Matrix Multiply
Use a simple algorithm for the base case
void matmul(int *A, int *B, int *R, int n) { if (n == 1) {
(*R) += (*A) * (*B); } else {
matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);
}
Divide and Conquer Matrix Multiply
• Advantage of small base case: simplicity
• Code is easy to:• Write• Maintain • Debug• Understand
void matmul(int *A, int *B, int *R, int n) { if (n == 1) {
(*R) += (*A) * (*B); } else {
matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);
}
Divide and Conquer Matrix Multiply
• Disadvantage: inefficiency
• Large control flow overhead:
• Most of the time is spent in dividing the matrix in sub-matrices
void matmul(int *A, int *B, int *R, int n) { if (n == 1) {
(*R) += (*A) * (*B); } else {
matmul(A, B, R, n/4);matmul(A, B+(n/4), R+(n/4), n/4);matmul(A+2*(n/4), B, R+2*(n/4), n/4);matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4);matmul(A+(n/4), B+2*(n/4), R, n/4);matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4);matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4);matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4);
}
Hand Coded Implementationvoid serialmul(block *As, block *Bs, block *Rs){ int i, j; DOUBLE *A = (DOUBLE *) As; DOUBLE *B = (DOUBLE *) Bs; DOUBLE *R = (DOUBLE *) Rs; for (j = 0; j < 16; j += 2) { DOUBLE *bp = &B[j]; for (i = 0; i < 16; i += 2) { DOUBLE *ap = &A[i * 16]; DOUBLE *rp = &R[j + i * 16]; register DOUBLE s0_0 = rp[0], s0_1 = rp[1]; register DOUBLE s1_0 = rp[16], s1_1 = rp[17]; s0_0 += ap[0] * bp[0]; s0_1 += ap[0] * bp[1]; s1_0 += ap[16] * bp[0]; s1_1 += ap[16] * bp[1]; s0_0 += ap[1] * bp[16]; s0_1 += ap[1] * bp[17]; s1_0 += ap[17] * bp[16]; s1_1 += ap[17] * bp[17]; s0_0 += ap[2] * bp[32]; s0_1 += ap[2] * bp[33]; s1_0 += ap[18] * bp[32]; s1_1 += ap[18] * bp[33]; s0_0 += ap[3] * bp[48]; s0_1 += ap[3] * bp[49]; s1_0 += ap[19] * bp[48]; s1_1 += ap[19] * bp[49]; s0_0 += ap[4] * bp[64]; s0_1 += ap[4] * bp[65]; s1_0 += ap[20] * bp[64]; s1_1 += ap[20] * bp[65];
s0_0 += ap[5] * bp[80]; s0_1 += ap[5] * bp[81]; s1_0 += ap[21] * bp[80]; s1_1 += ap[21] * bp[81]; s0_0 += ap[6] * bp[96]; s0_1 += ap[6] * bp[97]; s1_0 += ap[22] * bp[96]; s1_1 += ap[22] * bp[97]; s0_0 += ap[7] * bp[112]; s0_1 += ap[7] * bp[113]; s1_0 += ap[23] * bp[112]; s1_1 += ap[23] * bp[113]; s0_0 += ap[8] * bp[128]; s0_1 += ap[8] * bp[129]; s1_0 += ap[24] * bp[128]; s1_1 += ap[24] * bp[129]; s0_0 += ap[9] * bp[144]; s0_1 += ap[9] * bp[145]; s1_0 += ap[25] * bp[144]; s1_1 += ap[25] * bp[145]; s0_0 += ap[10] * bp[160]; s0_1 += ap[10] * bp[161]; s1_0 += ap[26] * bp[160]; s1_1 += ap[26] * bp[161]; s0_0 += ap[11] * bp[176]; s0_1 += ap[11] * bp[177]; s1_0 += ap[27] * bp[176]; s1_1 += ap[27] * bp[177]; s0_0 += ap[12] * bp[192]; s0_1 += ap[12] * bp[193]; s1_0 += ap[28] * bp[192]; s1_1 += ap[28] * bp[193]; s0_0 += ap[13] * bp[208]; s0_1 += ap[13] * bp[209]; s1_0 += ap[29] * bp[208];
s1_1 += ap[29] * bp[209]; s0_0 += ap[14] * bp[224]; s0_1 += ap[14] * bp[225]; s1_0 += ap[30] * bp[224]; s1_1 += ap[30] * bp[225]; s0_0 += ap[15] * bp[240]; s0_1 += ap[15] * bp[241]; s1_0 += ap[31] * bp[240]; s1_1 += ap[31] * bp[241]; rp[0] = s0_0; rp[1] = s0_1; rp[16] = s1_0; rp[17] = s1_1; } }}
cilk void matrixmul(long nb, block *A, block *B, block *R){ if (nb == 1) { flops = serialmul(A, B, R); } else if (nb >= 4) {
spawn matrixmul(nb/4, A, B, R); spawn matrixmul(nb/4, A, B+(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B+(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B, R+3*(nb/4)); sync; spawn matrixmul(nb/4, A+(nb/4), B+2*(nb/4), R); spawn matrixmul(nb/4, A+(nb/4), B+3*(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+3*(nb/4)); sync;
}}
Goal
• The programmer writes simple code with small base cases
• The compiler automatically generates efficient code with large base cases
2. Computation Structure
Running Example – Array Increment
void f(char *p, int n) if (n == 1) {
/* base case: increment one element */(*p) += 1;
} else {f(p, n/2); /* increment first half */f(p+n/2, n/2); /* increment second
half */}
}
Dynamic Call Tree for n=4Execution of f(p,4)
Dynamic Call Tree for n=4
Test n=1Call f Call
f
Execution of f(p,4)
Dynamic Call Tree for n=4
Test n=1Call f Call
f
Execution of f(p,4)Activation Frame
on the Stack
Dynamic Call Tree for n=4
Test n=1Call f Call
f
Execution of f(p,4)Executed
Instructions
Dynamic Call Tree for n=4
Test n=1Call f Call
f
Execution of f(p,4)
Dynamic Call Tree for n=4
Test n=1Call f Call
f
Test n=1Call f Call
f
Test n=1Call f Call
f
n=4
n=2
Execution of f(p,4)
Dynamic Call Tree for n=4
Test n=1Call f Call
f
Test n=1Call f Call
f
Test n=1
Inc *p
Test n=1
Inc *p
Test n=1Call f Call
f
Test n=1
Inc *p
Test n=1
Inc *p
n=4
n=2
n=1
Execution of f(p,4)
Control Flow Overhead
Test n=1Call f Call
f
Test n=1Call f Call
f
Test n=1
Inc *p
Test n=1
Inc *p
Test n=1Call f Call
f
Test n=1
Inc *p
Test n=1
Inc *p
n=4
n=2
n=1
Execution of f(p,4) Call
overhead
Control Flow Overhead
Test n=1Call f Call
f
Test n=1Call f Call
f
Test n=1
Inc *p
Test n=1
Inc *p
Test n=1Call f Call
f
Test n=1
Inc *p
Test n=1
Inc *p
n=4
n=2
n=1
Execution of f(p,4) Call overhead
+ Test overhead
Computation
Test n=1Call f Call
f
Test n=1Call f Call
f
Test n=1
Inc *p
Test n=1
Inc *p
Test n=1Call f Call
f
Test n=1
Inc *p
Test n=1
Inc *p
n=4
n=2
n=1
Execution of f(p,4) Call overhead
+ Test overhead
Computation
Large Base Cases = Reduced Overhead
Test n=2Call f Call
fn=4
n=2
Execution of f(p,4)
Test n=2Inc *p
Inc *(p+1)
Test n=2Inc *p
Inc *(p+1)
3. Transformations
Transformation 1: Recursion Inlining
void f (char *p, int n) if (n == 1) {
(*p) += 1; } else {
f(p, n/2);
f(p+n/2, n/2); }
Start with the original recursive procedure
Transformation 1: Recursion Inlining
void f1(char *p, int n) if (n == 1) {
(*p) += 1; } else {
f1(p, n/2);
f1(p+n/2, n/2); }
void f2(char *p, int n) if (n == 1) {
(*p) += 1; } else {
f2(p, n/2);
f2(p+n/2, n/2); }
Make two copies of the original procedure
Transformation 1: Recursion Inlining
void f1(char *p, int n) if (n == 1) {
(*p) += 1; } else {
f2(p, n/2);
f2(p+n/2, n/2); }
void f2(char *p, int n) if (n == 1) {
(*p) += 1; } else {
f1(p, n/2);
f1(p+n/2, n/2); }
Transform direct recursion to mutual recursion
Transformation 1: Recursion Inlining
void f1(char *p, int n) if (n == 1) {
(*p) += 1; } else {
f2(p, n/2);
f2(p+n/2, n/2); }
void f2(char *p, int n) if (n == 1) {
(*p) += 1; } else {
f1(p, n/2);
f1(p+n/2, n/2); }
Inline procedure f2 at call sites in f1
Transformation 1: Recursion Inlining
void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }
Transformation 1: Recursion Inlining
void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }
• Reduced procedure call overhead
• More code exposed at the intra-procedural level
• Opportunities to simplify control flow in the inlined code
Transformation 1: Recursion Inlining
void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } }
• Reduced procedure call overhead
• More code exposed at the intra-procedural level
• Opportunities to simplify control flow in the inlined code:
• identical condition expressions
Transformation 2: Conditional Fusion
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }
Merge if statements with identical conditions
Transformation 2: Conditional Fusion
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }
Merge if statements with identical conditions
• Reduced branching overhead and bigger basic blocks
• Larger base case for n/2 = 1
Unrolling Iterations
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }
Repeatedly apply inlining and conditional fusion
Second Unrolling Iteration
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }
void f2(char *p, int n) if (n == 1) { *p += 1; } else { f2(p, n/2); f2(p+n/2, n/2); }
Second Unrolling Iteration
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f2(p, n/2/2); f2(p+n/2/2, n/2/2); f2(p+n/2, n/2/2); f2(p+n/2+n/4, n/2/2); }
void f2(char *p, int n) if (n == 1) { *p += 1; } else { f1(p, n/2); f1(p+n/2, n/2); }
Result of Second Unrolling Iteration
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }
else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,
n/2/2/2);}
Unrolling Iterations• The unrolling process stops when the number
of iterations reaches the desired unrolling factor
• The unrolled recursive procedure:• Has base cases for larger problem sizes• Divides the given problem into more sub-
problems of smaller sizes
• In our example:• Base cases for n=1, n=2, and n=4• Problems are divided into 8 problems of 1/8
size
Speedup for Matrix MultiplyMatrix of 512 x 512 elements
0
2
4
6
8
10
1 2unrolling factor
spee
dup
inline inline+fusion
Speedup for Matrix MultiplyMatrix of 512 x 512 elements
0
2
4
6
8
10
1 2unrolling factor
spee
dup
inline inline+fusion
Speedup for Matrix MultiplyMatrix of 1024 x 1024 elements
0
2
4
6
8
10
1 2unrolling factor
spee
dup
inline inline+fusion
Efficiency of Unrolled Recursive Part
• Because the recursive part is also unrolled, recursion may not exercise the large base
cases
• Which base case is executed depends on the size of the input problem
• In our example:• For a problem of size n=8, the base case for n=1 is
executed• For a problem of size n=16, the base case for n=2 is
executed• The efficient base case for n=4 is not executed in
these cases
Solution: Recursion Re-Rolling
• Roll back the recursive part of the unrolled procedure after the large base cases are generated
• Re-Rolling ensures that larger base cases are always executed, independent of the input problem size
• The compiler unrolls the recursive part only temporarily, to generate the base cases
Transformation 3: Recursion Re-Rolling
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }
else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,
n/2/2/2);}
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }
Identify the recursive part
else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2,
n/2/2/2);}
Transformation 3: Recursion Re-Rolling
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }
Replace with the recursive part of the original procedure
else { f1(p, n/2); f1(p+n/2, n/2);}
Transformation 3: Recursion Re-Rolling
Final Result
void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; }
else { f1(p, n/2); f1(p+n/2, n/2);}
Speedup for Matrix MultiplyMatrix of 512 x 512 elements
0
2
4
6
8
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for Matrix MultiplyMatrix of 1024 x 1024 elements
0
2
4
6
8
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Other Optimizations
• Inlining moves code from the inter-procedural level to the intra-procedural level
• Conditional fusion brings code from the inter-basic-block level to the intra-basic-block level
• Together, inlining and conditional fusion give subsequent compiler passes the opportunity to perform more aggressive optimizations
Comparison to Hand Coded Programs
• Two applications: Matrix multiply, LU decomposition
• Three machines: Pentium III, Origin 2000, PowerPC
• Two different problem sizes
• Compare automatically unrolled programs to optimized, hand coded versions from the Cilk benchmarks
• Best automatically unrolled version performs:• Between 2.2 and 2.9 times worse for matrix
multiply• As good as hand coded version for LU
• Procedure Inlining:• Scheifler (1977)• Richardson, Ghanapathi (1989)• Chambers, Ungar (1989)• Cooper, Hall, Torczon (1991)• Appel (1992)• Chang, Mahlke, Chen, Hwu (1992)
Related Work
Conclusion• Recursion Unrolling
• analogous to the loop unrolling transformation
• Divide and Conquer Programs• The programmer writes simple base cases• The compiler automatically generates large base
cases
• Key Techniques• Inlining: conceptually inline recursive calls• Conditional Fusion: simplify intra-procedural
control flow• Re-Rolling: ensure that large base cases are
executed
Comparison to Hand Coded Programs
• Matrix multiply 512 x 512 elements:• Best automatically unrolled program: 2.55
sec.• Hand coded with three nested loops: 3.46
sec.• Hand coded Cilk program: 1.16
sec.
• Matrix multiply for 1024 x 1024 elements:• Best automatically unrolled program:
20.47 sec.• Hand coded with three nested loops:
27.40 sec.• Hand coded Cilk program: 9.19
sec.
Correctness
• Recursion unrolling preserves the semantics of the program:
• The unrolled program terminates if and only if the original recursive program terminates
• When both the original and the unrolled program terminate, the yield the same result
Speedup for Matrix MultiplyPentium III, Matrix of 512 x 512
elements
02
468
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for Matrix MultiplyPentium III, Matrix of 1024 x 1024
elements
02
468
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for Matrix MultiplyPower PC, Matrix of 512 x 512
elements
02
468
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for Matrix MultiplyPower PC, Matrix of 1024 x 1024
elements
02
468
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for Matrix MultiplyOrigin 2000, Matrix of 512 x 512
elements
0
2
4
6
8
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for Matrix MultiplyOrigin 2000, Matrix of 1024 x 1024
elements
0
2
4
6
8
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for LUPentium III, Matrix of 512 x 512
elements
02
468
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for LUPentium III, Matrix of 1024 x 1024
elements
02
468
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for LUPower PC, Matrix of 512 x 512
elements
02
468
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for LUPower PC, Matrix of 1024 x 1024
elements
02
468
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for LUOrigin 2000, Matrix of 1024 x 1024
elements
02
468
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll
Speedup for LUOrigin 2000, Matrix of 512 x 512
elements
0
2
4
6
8
10
1 2 3unrolling factor
spee
dup
inline inline+fusion inline+fusion+reroll