saving energy in sparse and dense linear algebra computations · 2012. 3. 13. · saving energy in...
TRANSCRIPT
Saving Energy inSparse and Dense Linear Algebra Computations
P. Alonso∗, M. F. Dolz†, F. Igual‡, R. Mayo†,E. S. Quintana-Ortı†, V. Roca†
∗Univ. Politecnica †Univ. Jaume I ‡The Univ. of Texasde Valencia, Spain de Castellon, Spain at Austin, TX
Saving Energy in Sparse and Dense Linear Algebra Computations 1 SIAM PPSC–Savannah (GA), Feb. 2012
Introduction and motivation
Motivation
Reduce energy consumption!
Costs over lifetime of an HPC facility often exceed acquisition costsCarbon dioxide is a risk for health and environmentHeat reduces hardware reliability
Personal view
Hardware features mechanisms and modes to save energy
Scientific apps. are in general energy-oblivious
Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC–Savannah (GA), Feb. 2012
Introduction and motivation
Motivation
Reduce energy consumption!
Costs over lifetime of an HPC facility often exceed acquisition costsCarbon dioxide is a risk for health and environmentHeat reduces hardware reliability
Personal view
Hardware features mechanisms and modes to save energy
Scientific apps. are in general energy-oblivious
Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC–Savannah (GA), Feb. 2012
Introduction and motivation
Motivation (Cont’d)
Scheduling of task parallel linear algebra algorithmsExamples: Cholesky, QR and LU factorizations, etc.
ILUPACK
Energy saving tools available for multi-core processorsExample: Dynamic Voltage and Frequency Scaling (DVFS)
Scheduling tasks + DVFS
⇓Energy-aware execution on multi-core processors
Different “scheduling” strategies:SRA: Reduce the frequency of cores that will execute non-critical tasks to decreaseidle times without sacrificing total performance of the algorithm
RIA: Execute all tasks at highest frequency to “enjoy” longer inactive periods
Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC–Savannah (GA), Feb. 2012
Introduction and motivation
Motivation (Cont’d)
Scheduling of task parallel linear algebra algorithmsExamples: Cholesky, QR and LU factorizations, etc.
ILUPACK
Energy saving tools available for multi-core processorsExample: Dynamic Voltage and Frequency Scaling (DVFS)
Scheduling tasks + DVFS
⇓Energy-aware execution on multi-core processors
Different “scheduling” strategies:SRA: Reduce the frequency of cores that will execute non-critical tasks to decreaseidle times without sacrificing total performance of the algorithm
RIA: Execute all tasks at highest frequency to “enjoy” longer inactive periods
Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC–Savannah (GA), Feb. 2012
Introduction and motivation
Outline
1 LU factorization with partial pivoting1 Slack Reduction Algorithm2 Race-to-Idle Algorithm3 Simulation4 Experimental Results
2 ILUPACK1 Race-to-Idle Algorithm2 Experimental Results
3 Conclusions
Saving Energy in Sparse and Dense Linear Algebra Computations 4 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP
1. LU Factorization with Partial Pivoting (LUPP)
LU factorization
FactorA = LU,
with L/U ∈ Rn×n unit lower/upper triangular matrices
For numerical stability, permutations are introduced to preventoperation with small pivot elements
PA = LU,
with P ∈ Rn×n a permutation matrix that interchanges rows of A
Saving Energy in Sparse and Dense Linear Algebra Computations 5 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP
Blocked algorithm for LUPP (no PP, for simplicity)
for k = 1 : s doAk:s,k = Lk:s,k · Ukk LU factorization
for j = k + 1 : s do
Akj ← L−1kk· Akj Triangular solve
Ak+1:s,j ← Ak+1:s,j − Ak+1:s,k · Akj Matrix-matrix product
end forend for
DAG with a matrix consisting of 5 × 5 blocks (s=5)
T 43
G 22T 21
M 54
M 51
M 53
M 43
T 52
T 42
M 31
M 32T 32
T 51
T 54 G 55
M 41
M 21T 53
M 42
T 41
T 31
M 52
G 11 G 33
G 44
Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP
Blocked algorithm for LUPP (no PP, for simplicity)
for k = 1 : s doAk:s,k = Lk:s,k · Ukk LU factorization
for j = k + 1 : s do
Akj ← L−1kk· Akj Triangular solve
Ak+1:s,j ← Ak+1:s,j − Ak+1:s,k · Akj Matrix-matrix product
end forend for
DAG with a matrix consisting of 5 × 5 blocks (s=5)
T 43
G 22T 21
M 54
M 51
M 53
M 43
T 52
T 42
M 31
M 32T 32
T 51
T 54 G 55
M 41
M 21T 53
M 42
T 41
T 31
M 52
G 11 G 33
G 44
Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Slack Reduction Algorithm
1.1 Slack Reduction Algorithm
Strategy
Search for “slacks” (idle periods) in the DAG associated with thealgorithm, and try to minimize them applying e.g. DVFS
1 Search slacks via the Critical Path Method (CPM):DAG of dependencies
Nodes ⇒ TasksEdges ⇒ Dependencies
ESi/LFi : Early and latest times task Ti , with cost Ci , can start/finalize withoutincreasing the total execution time of the algorithm
Si : Slack (time) task Ti can be delayed without increasing the total execution time
Critical path: Collection of tasks directly connected, from initial to final node of thegraph, with total slack = 0
2 Minimize the slack of tasks with Si > 0, reducing their executionfrequency via SRA
Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Slack Reduction Algorithm
1.1 Slack Reduction Algorithm
Strategy
Search for “slacks” (idle periods) in the DAG associated with thealgorithm, and try to minimize them applying e.g. DVFS
1 Search slacks via the Critical Path Method (CPM):DAG of dependencies
Nodes ⇒ TasksEdges ⇒ Dependencies
ESi/LFi : Early and latest times task Ti , with cost Ci , can start/finalize withoutincreasing the total execution time of the algorithm
Si : Slack (time) task Ti can be delayed without increasing the total execution time
Critical path: Collection of tasks directly connected, from initial to final node of thegraph, with total slack = 0
2 Minimize the slack of tasks with Si > 0, reducing their executionfrequency via SRA
Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Slack Reduction Algorithm
Application of CPM to the DAG of the LUPP of a 5 × 5 blocked matrix:
T 43
G 22T 21
M 54
M 51
M 53
M 43
T 52
T 42
M 31
M 32T 32
T 51
T 54 G 55
M 41
M 21T 53
M 42
T 41
T 31
M 52
G 11 G 33
G 44
Task C ES LF S
G 11 109.011 0.000 109.011 0T 21 30.419 109.011 139.430 0T 41 30.419 109.011 287.374 147.944T 51 30.419 109.011 326.306 186.877T 31 30.419 109.011 225.082 85.652
.
.
....
.
.
....
.
.
.
Duration of tasks (cost, C ) of type G and M depends on the iteration!Evaluate the time of 1 flop for each type of task and, from its theoreticalcost, approximate the execution time
Saving Energy in Sparse and Dense Linear Algebra Computations 8 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Slack Reduction Algorithm
Slack Reduction Algorithm
1 Frequency assignment: Set initial frequencies
2 Critical subpath extraction
3 Slack reduction
1 Frequency assignment
T 43
2.00
G 22
2.00
T 21
2.00 M 54
2.00
M 51
2.00
M 53
2.00
M 43
2.00
T 52
2.00
T 42
2.00
M 31
2.00
M 32
2.00
T 32
2.00
T 51
2.00
T 54
2.00
G 55
2.00
M 41
2.00
M 21
2.00
T 53
2.00
M 42
2.00T 41
2.00
T 31
2.00
M 52
2.00
G 11
2.00G 33
2.00 G 44
2.00
Discrete range of frequencies: {2.00, 1.50, 1.20, 1.00, 0.80} GHzDuration of tasks of type G and M depends on the iteration.Initially, all tasks to run at the highest frequency: 2.00 GHz
Saving Energy in Sparse and Dense Linear Algebra Computations 9 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Slack Reduction Algorithm
Slack Reduction Algorithm
1 Frequency assignment
2 Critical subpath extraction: Identify critical subpaths
3 Slack reduction
2 Critical subpath extraction
T 43
2.00
G 22
2.00
T 21
2.00 M 54
2.00
M 51
2.00
M 53
2.00
M 43
2.00
T 52
2.00
T 42
2.00
M 31
2.00
M 32
2.00
T 32
2.00
T 51
2.00
T 54
2.00
G 55
2.00
M 41
2.00
M 21
2.00
T 53
2.00
M 42
2.00T 41
2.00
T 31
2.00
M 52
2.00
G 11
2.00G 33
2.00 G 44
2.00
Critical subpath extraction:
1 Identify and extract the critical (sub)path(s)2 Eliminate the graph nodes/edges that belong to it3 Repeat process until the graph is empty
Results:
Critical path: CP0
3 critical subpaths (from largest to shortest): CP1 > CP2 > CP3
Saving Energy in Sparse and Dense Linear Algebra Computations 10 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Slack Reduction Algorithm
Slack Reduction Algorithm
1 Frequency assignment
2 Critical subpath extraction
3 Slack reduction: Determine execution frequency for each task
3 Slack reduction
T 43
2.00
G 22
2.00
T 21
2.00 M 54
2.00
M 51
1.50
M 53
1.50
M 43
2.00
T 52
1.50
T 42
1.50
M 31
1.50
M 32
2.00
T 32
2.00
T 51
1.50
T 54
2.00
G 55
2.00
M 41
1.50
M 21
2.00
T 53
1.50
M 42
1.50T 41
1.50
T 31
1.50
M 52
1.50
G 11
2.00G 33
2.00 G 44
2.00
Slack reduction algorithm:
1 Reduce frequency of tasks of the largest unprocessed subpath2 Check the lowest frequency reduction ratio for each task of that subpath3 Repeat process until all subpaths are processed
Results:
Tasks of CSP1, CSP2 and CSP3 are assigned to run at 1.5 GHz!
Saving Energy in Sparse and Dense Linear Algebra Computations 11 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Race-to-Idle Algorithm
1.2 Race-to-Idle Algorithm
Race-to-Idle ⇒ complete execution as soon as possible by executingtasks of the algorithm at the highest frequency to “enjoy” longer inactiveperiods
Tasks are executed at highest frequency
During idle periods CPU frequency is reduced to lowest possible
Why?
Current processors are quite efficient at saving power when idle
Power of an idle core is much lower than power during workingperiods
DAG requires no processing, unlike SRA
Saving Energy in Sparse and Dense Linear Algebra Computations 12 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Simulation
1.3 Simulation
Use of a simulator to evaluate the performance of the two strategies
Input parameters:DAG capturing tasks and dependencies of a blocked algorithm and frequenciesrecommended by the Slack Reduction Algorithm and Race-to-Idle Algorithm
A simple description of the target architecture:
Number of sockets (physical processors)Number of cores per socketDiscrete range of frequencies and associated voltagesFrequency changes at core/socket levelCollection of real power for each combination of frequency idle/busy state per coreCost (overhead) required to perform frequency changes
Static priority list scheduler:Duration of tasks at each available frequency is known in advance
Tasks that lie on critical path are prioritized
Saving Energy in Sparse and Dense Linear Algebra Computations 13 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Simulation
AMD Opteron 6128 (8 cores):Evaluate the time of 1 flop for each type of task and, from its theoretical cost, approximatethe execution time
Frequency changes at core level, with f ∈ {2.00, 1.50, 1.20, 1.00, 0.80} GHz
Blocked algorithm for LUPP:Simulation independent of actual implementation (LAPACK, libflame, etc.)
Matrix size n from 768 to 5,632; block size: b = 256
Metrics:Compare simulated time and energy consumption of original algorithm with those ofmodified SRA/RIA algorithms
Simulated time
TSRA/RIA
Toriginal
Impact of SRA/RIA on time
IT =TSRA/RIA
Toriginal· 100
Simulated energy
ESRA/RIA =Pn
i=1 Wn · Tn
Eoriginal = v2 T(fmax )
Impact of SRA/RIA on energy
IE =ESRA/RIA
Eoriginal· 100
Saving Energy in Sparse and Dense Linear Algebra Computations 14 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Simulation
Impact of SRA/RIA on simulated time/energy for LUPP:only power/energy due to workload (application)!
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
1024
1280
1536
1792
2048
2304
2560
2816
3072
3328
3584
3840
4096
4352
4608
4864
5120
5376
5632
%Im
pact
on
tim
e
Matrix size (n)
Time
RIASRA
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
768
1024
1280
1536
1792
2048
2304
2560
2816
3072
3328
3584
3840
4096
4352
4608
4864
5120
5376
5632
%Im
pact
on
ener
gy
Matrix size (n)
Energy
RIASRA
SRA: Time is compromised, increasing the consumption for largest problem sizes
Increase in execution time due to SRA being oblivious to the actual resources
RIA: Time is not compromised and consumption is reduced for large problem sizes
Saving Energy in Sparse and Dense Linear Algebra Computations 15 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Simulation
Impact of SRA/RIA on simulated time/energy for LUPP:only power/energy due to workload (application)!
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
1024
1280
1536
1792
2048
2304
2560
2816
3072
3328
3584
3840
4096
4352
4608
4864
5120
5376
5632
%Im
pact
on
tim
e
Matrix size (n)
Time
RIASRA
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
768
1024
1280
1536
1792
2048
2304
2560
2816
3072
3328
3584
3840
4096
4352
4608
4864
5120
5376
5632
%Im
pact
on
applica
tion
ener
gy
Matrix size (n)
Application energy
RIASRA
SRA: Time is compromised, increasing the consumption for largest problem sizes
Increase in execution time due to the SRA being oblivious to the actual resources
RIA: Time is not compromised and consumption is reduced for large problem sizes
Saving Energy in Sparse and Dense Linear Algebra Computations 16 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Experimental Results
1.4 Experimental Results
Integrate RIA into a runtime for task-parallel execution of dense linearalgebra algorithms: libflame+SuperMatrix
Queue of readytasks (no dependencies)
Queue of pendingtasks + dependencies(DAG)
......
Algorithm
SymbolicAnalysis
Dispatch
Worker Th. 1
Worker Th. 2
Worker Th. p
Core 1
Core 2
Core p
Two energy-aware techniques:RIA1: Reduce operation frequency when there are no ready tasks (Linux governor)
RIA2: Remove polling when there are no ready tasks (while ensuring a quick recovery)
Applicable to any runtime: SuperMatrix, SMPSs, Quark, etc!
Saving Energy in Sparse and Dense Linear Algebra Computations 17 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Experimental Results
“Doing nothing well” – David E. Culler
AMD Opteron 6128, 1 core:
40
50
60
70
80
90
100
110
120
0 5 10 15 20 25 30
Power
(Watt
s)
Time (s)
Power for different thread activities
MKL dgemm at 2.00 GHzBlocking at 800 MHz
Polling at 2.00 GHzPolling at 800 MHz
Saving Energy in Sparse and Dense Linear Algebra Computations 18 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Experimental Results
AMD Opteron 6128 (8 cores):Frequency changes at core level, with fmax = 2.00 GHz and fmin = 800 MHz
Blocked algorithm for LUPP:Routine FLA LU piv from libflame v5.0
Matrix size n from 2,048 to 12,288; block size: b = 256
Metrics:Compare actual time and energy consumption of original runtime with those of modifiedRIA1/RIA2 runtime
Actual time
TRIA
Toriginal
Impact of RIA on time
IT =TRIA
Toriginal· 100
Actual energy
ERIA and Eoriginal measured usinginternal powermeter withsampling f =25 Hz
Impact of RIA on energy
IE =ERIA
Eoriginal· 100
Saving Energy in Sparse and Dense Linear Algebra Computations 19 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Experimental Results
Impact of RIA1/RIA2 on actual time/energy for LUPP:only power/energy due to workload (application)!
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
%Im
pact
on
tim
e
Matrix size (n)
Time
RIA1RIA2
RIA1+RIA2
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
%Im
pact
on
ener
gy
Matrix size (n)
Energy
RIA1RIA2
RIA1+RIA2
Small impact on execution time
Consistent savings around 5 % for RIA1+RIA2:
Not much hope for larger savings because there is no opportunity for that: the cores arebusy most of the time
Saving Energy in Sparse and Dense Linear Algebra Computations 20 SIAM PPSC–Savannah (GA), Feb. 2012
LUPP Experimental Results
Impact of RIA1/RIA2 on actual time/energy for LUPP:only power/energy due to workload (application)!
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
%Im
pact
on
tim
e
Matrix size (n)
Time
RIA1RIA2
RIA1+RIA2
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
2048
3072
4096
5120
6144
7168
8192
9216
10240
11264
12288
%Im
pact
on
applica
tion
ener
gy
Matrix size (n)
Application energy
RIA1RIA2
RIA1+RIA2
Small impact on execution time
Consistent savings around 7–8% for RIA1+RIA2:
Not much hope for larger savings because there is no opportunity for that: the cores arebusy most of the time
Saving Energy in Sparse and Dense Linear Algebra Computations 21 SIAM PPSC–Savannah (GA), Feb. 2012
ILUPACK
2. ILUPACK
Sequential version – M. Bollhofer
Iterative solver for large-scale sparse linear systems
Multilevel ILU preconditioners for general and complexsymmetric/Hermitian positive definite systems
Based on inverse-based ILUs: incomplete LU decompositions thatcontrol the growth of the inverse triangular factors
Multi-threaded version – J.I. Aliaga, M. Bollhofer, A. F. Martın,E. S. Quintana-Ortı
Real symmetric positive definite systems
Construction of preconditioner and PCG solver
Algebraic parallelization based on a task tree
Leverage task-parallelism in the tree
Dynamic scheduling via tailored run-time (OpenMP)
Saving Energy in Sparse and Dense Linear Algebra Computations 22 SIAM PPSC–Savannah (GA), Feb. 2012
ILUPACK
������������
(1,1)
(1,2)
(1,3)
(1,4)
(2,2)
(3,1)
(3,1)
(2,1)
GAA
TT
PA
(3,1)
(2,1)
(1,1) (1,2) (1,3)
(2,2)
(1,4)
Saving Energy in Sparse and Dense Linear Algebra Computations 23 SIAM PPSC–Savannah (GA), Feb. 2012
ILUPACK
Integrate RIA into multi-threaded version of ILUPACK
Core 1
Core p
...
Core 2...
...
Worker Th. 1
Partitioning:METIS/SCOTCH Queue of pending
tasks (binary tree)Worker Th. p
Worker Th. 2
Queues of readytasks (no dependencies)
Problem data: A
Same energy-aware techniques:RIA1+RIA2: Reduce operation frequency when there are no ready tasks (Linux governor)and remove polling when there are no ready tasks (while ensuring a quick recovery)
Saving Energy in Sparse and Dense Linear Algebra Computations 24 SIAM PPSC–Savannah (GA), Feb. 2012
ILUPACK
AMD Opteron 6128 (8 cores)
Linear system associated with Laplacian equation (n ≈16M)
Impact of RIA1+RIA2 on actual power/application power for ILUPACK (onlypreconditioner):
0.4
0.6
0.8
1
1.2
1.4
1.6
0 200 400 600 800 1000 1200 1400 1600 1800
%Im
pact
on
power
Sample
Power
RIA1+RIA2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 200 400 600 800 1000 1200 1400 1600 1800%Im
pact
on
applica
tion
power
Sample
Application power
RIA1+RIA2
Saving Energy in Sparse and Dense Linear Algebra Computations 25 SIAM PPSC–Savannah (GA), Feb. 2012
Conclusions
3. Conclusions
Goal: Combine scheduling+DVFS to save energy during the execution oflinear algebra algorithms on multi-threaded architectures
For (dense) LUPP:
Slack Reduction Algorithm
DAG requires a processing
Currently does not take into accountnumber of resources
Increases execution time as matrix sizegrows
Increases also energy consumption
Race-to-Idle Algorithm
Algorithm is applied on-the-fly, nopre-processing needed
Maintains in all of cases execution time
Reduce application energy consumption(around 7–8%)
As the ratio between the problem size and the number of resourcesgrows, the opportunities to save energy decrease!
Saving Energy in Sparse and Dense Linear Algebra Computations 26 SIAM PPSC–Savannah (GA), Feb. 2012
Conclusions
For (sparse) ILUPACK:
Race-to-Idle Algorithm
DAG requires no processing
Algorithm is applied on-the-fly
Maintains in all of cases execution time
Reduce power dissipated by applicationup to 40%
Significant opportunities to save energy!
General:We need architectures that know how to do nothing better!
Saving Energy in Sparse and Dense Linear Algebra Computations 27 SIAM PPSC–Savannah (GA), Feb. 2012
More information
“Improving power-efficiency of dense linear algebra algorithms on multi-core processors viaslack control”P. Alonso, M. F. Dolz, R. Mayo, E. S. Quintana-OrtıWorkshop on Optimization Issues in Energy Efficient Distributed Systems – OPTIM 2011,Istanbul (Turkey), July 2011
“DVFS-control techniques for dense linear algebra operations on multi-core processors”P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortı2nd Int. Conf. on Energy-Aware High Performance Computing – EnaHPC 2011, Hamburg(Germany), Sept. 2011
“Saving energy in the LU factorization with partial pivoting on multi-core processors”P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortı20th Euromicro Conf. on Parallel, Distributed and Network based Processing – PDP 2012,Garching (Germany). Feb. 2012
Saving Energy in Sparse and Dense Linear Algebra Computations 28 SIAM PPSC–Savannah (GA), Feb. 2012
Saving Energy inSparse and Dense Linear Algebra Computations
P. Alonso∗, M. F. Dolz†, F. Igual‡, R. Mayo†,E. S. Quintana-Ortı†, V. Roca†
∗Univ. Politecnica †Univ. Jaume I ‡The Univ. of Texasde Valencia, Spain de Castellon, Spain at Austin, TX
Saving Energy in Sparse and Dense Linear Algebra Computations 29 SIAM PPSC–Savannah (GA), Feb. 2012