o ptimizing lu f actorization in c ilk ++ nathan beckmann silas boyd-wickizer
TRANSCRIPT
![Page 1: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/1.jpg)
OPTIMIZING LU FACTORIZATION IN CILK++Nathan Beckmann
Silas Boyd-Wickizer
![Page 2: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/2.jpg)
THE PROBLEM
LU is a common matrix operation with a broad range of applications Writes matrix as a product of L and U
Example:PA= LU
a11 a12 a13
a21 a22 a23
a31 a32 a33
0 1 0
1 0 0
0 0 1
l11 0 0
l21 l22 0
l31 l32 l33
u1
1
u1
2
u1
3
0 u2
2
u2
3
0 0 u3
3
![Page 3: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/3.jpg)
THE PROBLEM
![Page 4: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/4.jpg)
THE PROBLEM
![Page 5: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/5.jpg)
THE PROBLEM
![Page 6: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/6.jpg)
THE PROBLEM
Small parallelism
Small parallelism
Big parallelism
![Page 7: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/7.jpg)
OUTLINE
Overview
Results
Conclusion
![Page 8: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/8.jpg)
OVERVIEW
Four implementations of LU PLASMA (highly optimized third party library) Sivan Toledo’s algorithm in Cilk++ (courtesy of
Bradley) Parallel standard “right-looking” in Cilk++ Right-looking in pthreads
All implementations use same base case GotoBLAS2 matrix routines
Analyze performance Machine architecture Cache behavior
![Page 9: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/9.jpg)
OUTLINE
Overview
Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size
Conclusion
![Page 10: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/10.jpg)
METHODOLOGY
Machine configurations: AMD16: Quad-quad AMD Opteron 8350 @ 2.0
GHz Intel16: Quad-quad Intel Xeon E7340 @ 2.0 GHz Intel8: Dual-quad Intel Xeon E5530 @ 2.4 GHz
Xen indicates running a Xen-enabled kernel All tests still ran in dom0 (outside virtual
machine)
![Page 11: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/11.jpg)
PERFORMANCE SUMMARY
Quite significant performance heterogeneity by machine architecture
Large impact from caches
LU performace (gflops on 4k x 4k, 8 cores)
AMD16 Intel16 Intel16Xen Intel8Xen
PLASMA 28.7 21.5 20.6 31.1
Toledo 17.2 19.6 17.4 32.5
Right 7.72 8.53 7.38 23.2
Pthread 12.5 11.2 10.8 22.1
![Page 12: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/12.jpg)
LU SCALING
![Page 13: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/13.jpg)
OUTLINE
Overview
Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size
Conclusion
![Page 14: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/14.jpg)
ARCHITECTURAL VARIATION (BY ARCH.)
AMD16 Intel16
Intel8Xen
![Page 15: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/15.jpg)
ARCHITECTURAL VARIATION (BY ALG’THM)
![Page 16: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/16.jpg)
XEN INTERFERENCE
Strange behavior with increasing core count on Intel16
Intel16Xen
Intel16
![Page 17: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/17.jpg)
OUTLINE
Overview
Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size
Conclusion
![Page 18: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/18.jpg)
CACHE INTERFERENCE
Noticed scaling problem with Toledo algorithm
Tested with matrices of size 2n
Caused conflict misses in processor cache
![Page 19: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/19.jpg)
CACHE INTERFERENCE: EXAMPLE
AMD Opteron has 64 byte cache lines and a 64 Kbyte 2-way set associative cache:
512 sets, 2 cache lines each Every 32Kbyte (or 4096 doubles) map to the
same set
offsetsettag
056141563
![Page 20: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/20.jpg)
CACHE INTERFERENCE: EXAMPLE
4096 elements
![Page 21: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/21.jpg)
CACHE INTERFERENCE: EXAMPLE
4096 elements
![Page 22: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/22.jpg)
CACHE INTERFERENCE: EXAMPLE
4096 elements
![Page 23: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/23.jpg)
CACHE INTERFERENCE: EXAMPLE
4096 elements
![Page 24: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/24.jpg)
CACHE INTERFERENCE: EXAMPLE
4096 elements
![Page 25: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/25.jpg)
CACHE INTERFERENCE: EXAMPLE
4096 elements
![Page 26: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/26.jpg)
CACHE INTERFERENCE: EXAMPLE
4096 elements
![Page 27: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/27.jpg)
SOLUTION: PAD MATRIX ROWS
4096 elements
8 element pad
![Page 28: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/28.jpg)
SOLUTION: PAD MATRIX ROWS
4096 elements
8 element pad
![Page 29: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/29.jpg)
SOLUTION: PAD MATRIX ROWS
4096 elements
8 element pad
![Page 30: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/30.jpg)
SOLUTION: PAD MATRIX ROWS
4096 elements
8 element pad
![Page 31: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/31.jpg)
CACHE INTERFERENCE (GRAPHS)
Before:
After:
![Page 32: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/32.jpg)
OUTLINE
Overview
Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size
Conclusion
![Page 33: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/33.jpg)
PARALLELISM
Toledo shows higher parallelism, particularly in burdened parallelism and large matrices
Still doesn’t explain poor scaling of right at low numbers of cores
Matrix Size Toledo Right-looking
Parallelism Burdened Parallelism
Parallelism Burdened Parallelism
2048x2048 15.8 15.5 16.0 12.2
4096x4096 38.1 37.4 34.6 26.0
8192x8192 92.6 91.1 72.8 57.3
![Page 34: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/34.jpg)
SYSTEM FACTORS (LOAD LATENCY)
Performance of Right relative to Toledo
![Page 35: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/35.jpg)
SYSTEM FACTORS (LOAD LATENCY)
Performance of Tile relative to Toledo
![Page 36: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/36.jpg)
OUTLINE
Overview
Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size
Conclusion
![Page 37: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/37.jpg)
SCHEDULING
Cilk++ provides dynamic scheduler
PLASMA, pthread use static schedule
Compare performance under multiprogrammed workload
![Page 38: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/38.jpg)
SCHEDULING GRAPH
Cilk++ implementations degrade more gracefully PLASMA does OK; pthread right (“tile”) doesn’t
![Page 39: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/39.jpg)
OUTLINE
Overview
Results Summary Architectural heterogeneity Cache effects Parallelism Scheduling Code size
Conclusion
![Page 40: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/40.jpg)
CODE STYLE
* Includes base case wrappers
Comparing different languages
Expected large difference, but they are similar Complexity is in base case Base cases are shared
Lines of Code
Toledo Right-looking
PLASMA Pthread Right
Just LU 111 121 143 134
Everything 238 257 269 934*
![Page 41: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/41.jpg)
CONCLUSION
Cilk++ can perform competitively with optimized math libraries
Cache behavior is most important factor
Cilk++ shows better performance degradation with other things running Especially compared to hand-coded pthread
versions
Code size not a major factor
![Page 42: O PTIMIZING LU F ACTORIZATION IN C ILK ++ Nathan Beckmann Silas Boyd-Wickizer](https://reader035.vdocuments.net/reader035/viewer/2022062714/56649d1b5503460f949f1217/html5/thumbnails/42.jpg)