saving energy in sparse and dense linear algebra computations · 2012. 3. 13. · saving energy in...

33
Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso * , M. F. Dolz , F. Igual , R. Mayo , E. S. Quintana-Ort´ ı , V. Roca * Univ. Polit´ ecnica Univ. Jaume I The Univ. of Texas de Valencia, Spain de Castell´ on, Spain at Austin, TX [email protected] Saving Energy in Sparse and Dense Linear Algebra Computations 1 SIAM PPSC–Savannah (GA), Feb. 2012

Upload: others

Post on 09-Mar-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

Saving Energy inSparse and Dense Linear Algebra Computations

P. Alonso∗, M. F. Dolz†, F. Igual‡, R. Mayo†,E. S. Quintana-Ortı†, V. Roca†

∗Univ. Politecnica †Univ. Jaume I ‡The Univ. of Texasde Valencia, Spain de Castellon, Spain at Austin, TX

[email protected]

Saving Energy in Sparse and Dense Linear Algebra Computations 1 SIAM PPSC–Savannah (GA), Feb. 2012

Page 2: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

Introduction and motivation

Motivation

Reduce energy consumption!

Costs over lifetime of an HPC facility often exceed acquisition costsCarbon dioxide is a risk for health and environmentHeat reduces hardware reliability

Personal view

Hardware features mechanisms and modes to save energy

Scientific apps. are in general energy-oblivious

Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC–Savannah (GA), Feb. 2012

Page 3: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

Introduction and motivation

Motivation

Reduce energy consumption!

Costs over lifetime of an HPC facility often exceed acquisition costsCarbon dioxide is a risk for health and environmentHeat reduces hardware reliability

Personal view

Hardware features mechanisms and modes to save energy

Scientific apps. are in general energy-oblivious

Saving Energy in Sparse and Dense Linear Algebra Computations 2 SIAM PPSC–Savannah (GA), Feb. 2012

Page 4: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

Introduction and motivation

Motivation (Cont’d)

Scheduling of task parallel linear algebra algorithmsExamples: Cholesky, QR and LU factorizations, etc.

ILUPACK

Energy saving tools available for multi-core processorsExample: Dynamic Voltage and Frequency Scaling (DVFS)

Scheduling tasks + DVFS

⇓Energy-aware execution on multi-core processors

Different “scheduling” strategies:SRA: Reduce the frequency of cores that will execute non-critical tasks to decreaseidle times without sacrificing total performance of the algorithm

RIA: Execute all tasks at highest frequency to “enjoy” longer inactive periods

Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC–Savannah (GA), Feb. 2012

Page 5: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

Introduction and motivation

Motivation (Cont’d)

Scheduling of task parallel linear algebra algorithmsExamples: Cholesky, QR and LU factorizations, etc.

ILUPACK

Energy saving tools available for multi-core processorsExample: Dynamic Voltage and Frequency Scaling (DVFS)

Scheduling tasks + DVFS

⇓Energy-aware execution on multi-core processors

Different “scheduling” strategies:SRA: Reduce the frequency of cores that will execute non-critical tasks to decreaseidle times without sacrificing total performance of the algorithm

RIA: Execute all tasks at highest frequency to “enjoy” longer inactive periods

Saving Energy in Sparse and Dense Linear Algebra Computations 3 SIAM PPSC–Savannah (GA), Feb. 2012

Page 6: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

Introduction and motivation

Outline

1 LU factorization with partial pivoting1 Slack Reduction Algorithm2 Race-to-Idle Algorithm3 Simulation4 Experimental Results

2 ILUPACK1 Race-to-Idle Algorithm2 Experimental Results

3 Conclusions

Saving Energy in Sparse and Dense Linear Algebra Computations 4 SIAM PPSC–Savannah (GA), Feb. 2012

Page 7: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP

1. LU Factorization with Partial Pivoting (LUPP)

LU factorization

FactorA = LU,

with L/U ∈ Rn×n unit lower/upper triangular matrices

For numerical stability, permutations are introduced to preventoperation with small pivot elements

PA = LU,

with P ∈ Rn×n a permutation matrix that interchanges rows of A

Saving Energy in Sparse and Dense Linear Algebra Computations 5 SIAM PPSC–Savannah (GA), Feb. 2012

Page 8: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP

Blocked algorithm for LUPP (no PP, for simplicity)

for k = 1 : s doAk:s,k = Lk:s,k · Ukk LU factorization

for j = k + 1 : s do

Akj ← L−1kk· Akj Triangular solve

Ak+1:s,j ← Ak+1:s,j − Ak+1:s,k · Akj Matrix-matrix product

end forend for

DAG with a matrix consisting of 5 × 5 blocks (s=5)

T 43

G 22T 21

M 54

M 51

M 53

M 43

T 52

T 42

M 31

M 32T 32

T 51

T 54 G 55

M 41

M 21T 53

M 42

T 41

T 31

M 52

G 11 G 33

G 44

Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC–Savannah (GA), Feb. 2012

Page 9: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP

Blocked algorithm for LUPP (no PP, for simplicity)

for k = 1 : s doAk:s,k = Lk:s,k · Ukk LU factorization

for j = k + 1 : s do

Akj ← L−1kk· Akj Triangular solve

Ak+1:s,j ← Ak+1:s,j − Ak+1:s,k · Akj Matrix-matrix product

end forend for

DAG with a matrix consisting of 5 × 5 blocks (s=5)

T 43

G 22T 21

M 54

M 51

M 53

M 43

T 52

T 42

M 31

M 32T 32

T 51

T 54 G 55

M 41

M 21T 53

M 42

T 41

T 31

M 52

G 11 G 33

G 44

Saving Energy in Sparse and Dense Linear Algebra Computations 6 SIAM PPSC–Savannah (GA), Feb. 2012

Page 10: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Slack Reduction Algorithm

1.1 Slack Reduction Algorithm

Strategy

Search for “slacks” (idle periods) in the DAG associated with thealgorithm, and try to minimize them applying e.g. DVFS

1 Search slacks via the Critical Path Method (CPM):DAG of dependencies

Nodes ⇒ TasksEdges ⇒ Dependencies

ESi/LFi : Early and latest times task Ti , with cost Ci , can start/finalize withoutincreasing the total execution time of the algorithm

Si : Slack (time) task Ti can be delayed without increasing the total execution time

Critical path: Collection of tasks directly connected, from initial to final node of thegraph, with total slack = 0

2 Minimize the slack of tasks with Si > 0, reducing their executionfrequency via SRA

Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC–Savannah (GA), Feb. 2012

Page 11: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Slack Reduction Algorithm

1.1 Slack Reduction Algorithm

Strategy

Search for “slacks” (idle periods) in the DAG associated with thealgorithm, and try to minimize them applying e.g. DVFS

1 Search slacks via the Critical Path Method (CPM):DAG of dependencies

Nodes ⇒ TasksEdges ⇒ Dependencies

ESi/LFi : Early and latest times task Ti , with cost Ci , can start/finalize withoutincreasing the total execution time of the algorithm

Si : Slack (time) task Ti can be delayed without increasing the total execution time

Critical path: Collection of tasks directly connected, from initial to final node of thegraph, with total slack = 0

2 Minimize the slack of tasks with Si > 0, reducing their executionfrequency via SRA

Saving Energy in Sparse and Dense Linear Algebra Computations 7 SIAM PPSC–Savannah (GA), Feb. 2012

Page 12: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Slack Reduction Algorithm

Application of CPM to the DAG of the LUPP of a 5 × 5 blocked matrix:

T 43

G 22T 21

M 54

M 51

M 53

M 43

T 52

T 42

M 31

M 32T 32

T 51

T 54 G 55

M 41

M 21T 53

M 42

T 41

T 31

M 52

G 11 G 33

G 44

Task C ES LF S

G 11 109.011 0.000 109.011 0T 21 30.419 109.011 139.430 0T 41 30.419 109.011 287.374 147.944T 51 30.419 109.011 326.306 186.877T 31 30.419 109.011 225.082 85.652

.

.

....

.

.

....

.

.

.

Duration of tasks (cost, C ) of type G and M depends on the iteration!Evaluate the time of 1 flop for each type of task and, from its theoreticalcost, approximate the execution time

Saving Energy in Sparse and Dense Linear Algebra Computations 8 SIAM PPSC–Savannah (GA), Feb. 2012

Page 13: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Slack Reduction Algorithm

Slack Reduction Algorithm

1 Frequency assignment: Set initial frequencies

2 Critical subpath extraction

3 Slack reduction

1 Frequency assignment

T 43

2.00

G 22

2.00

T 21

2.00 M 54

2.00

M 51

2.00

M 53

2.00

M 43

2.00

T 52

2.00

T 42

2.00

M 31

2.00

M 32

2.00

T 32

2.00

T 51

2.00

T 54

2.00

G 55

2.00

M 41

2.00

M 21

2.00

T 53

2.00

M 42

2.00T 41

2.00

T 31

2.00

M 52

2.00

G 11

2.00G 33

2.00 G 44

2.00

Discrete range of frequencies: {2.00, 1.50, 1.20, 1.00, 0.80} GHzDuration of tasks of type G and M depends on the iteration.Initially, all tasks to run at the highest frequency: 2.00 GHz

Saving Energy in Sparse and Dense Linear Algebra Computations 9 SIAM PPSC–Savannah (GA), Feb. 2012

Page 14: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Slack Reduction Algorithm

Slack Reduction Algorithm

1 Frequency assignment

2 Critical subpath extraction: Identify critical subpaths

3 Slack reduction

2 Critical subpath extraction

T 43

2.00

G 22

2.00

T 21

2.00 M 54

2.00

M 51

2.00

M 53

2.00

M 43

2.00

T 52

2.00

T 42

2.00

M 31

2.00

M 32

2.00

T 32

2.00

T 51

2.00

T 54

2.00

G 55

2.00

M 41

2.00

M 21

2.00

T 53

2.00

M 42

2.00T 41

2.00

T 31

2.00

M 52

2.00

G 11

2.00G 33

2.00 G 44

2.00

Critical subpath extraction:

1 Identify and extract the critical (sub)path(s)2 Eliminate the graph nodes/edges that belong to it3 Repeat process until the graph is empty

Results:

Critical path: CP0

3 critical subpaths (from largest to shortest): CP1 > CP2 > CP3

Saving Energy in Sparse and Dense Linear Algebra Computations 10 SIAM PPSC–Savannah (GA), Feb. 2012

Page 15: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Slack Reduction Algorithm

Slack Reduction Algorithm

1 Frequency assignment

2 Critical subpath extraction

3 Slack reduction: Determine execution frequency for each task

3 Slack reduction

T 43

2.00

G 22

2.00

T 21

2.00 M 54

2.00

M 51

1.50

M 53

1.50

M 43

2.00

T 52

1.50

T 42

1.50

M 31

1.50

M 32

2.00

T 32

2.00

T 51

1.50

T 54

2.00

G 55

2.00

M 41

1.50

M 21

2.00

T 53

1.50

M 42

1.50T 41

1.50

T 31

1.50

M 52

1.50

G 11

2.00G 33

2.00 G 44

2.00

Slack reduction algorithm:

1 Reduce frequency of tasks of the largest unprocessed subpath2 Check the lowest frequency reduction ratio for each task of that subpath3 Repeat process until all subpaths are processed

Results:

Tasks of CSP1, CSP2 and CSP3 are assigned to run at 1.5 GHz!

Saving Energy in Sparse and Dense Linear Algebra Computations 11 SIAM PPSC–Savannah (GA), Feb. 2012

Page 16: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Race-to-Idle Algorithm

1.2 Race-to-Idle Algorithm

Race-to-Idle ⇒ complete execution as soon as possible by executingtasks of the algorithm at the highest frequency to “enjoy” longer inactiveperiods

Tasks are executed at highest frequency

During idle periods CPU frequency is reduced to lowest possible

Why?

Current processors are quite efficient at saving power when idle

Power of an idle core is much lower than power during workingperiods

DAG requires no processing, unlike SRA

Saving Energy in Sparse and Dense Linear Algebra Computations 12 SIAM PPSC–Savannah (GA), Feb. 2012

Page 17: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Simulation

1.3 Simulation

Use of a simulator to evaluate the performance of the two strategies

Input parameters:DAG capturing tasks and dependencies of a blocked algorithm and frequenciesrecommended by the Slack Reduction Algorithm and Race-to-Idle Algorithm

A simple description of the target architecture:

Number of sockets (physical processors)Number of cores per socketDiscrete range of frequencies and associated voltagesFrequency changes at core/socket levelCollection of real power for each combination of frequency idle/busy state per coreCost (overhead) required to perform frequency changes

Static priority list scheduler:Duration of tasks at each available frequency is known in advance

Tasks that lie on critical path are prioritized

Saving Energy in Sparse and Dense Linear Algebra Computations 13 SIAM PPSC–Savannah (GA), Feb. 2012

Page 18: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Simulation

AMD Opteron 6128 (8 cores):Evaluate the time of 1 flop for each type of task and, from its theoretical cost, approximatethe execution time

Frequency changes at core level, with f ∈ {2.00, 1.50, 1.20, 1.00, 0.80} GHz

Blocked algorithm for LUPP:Simulation independent of actual implementation (LAPACK, libflame, etc.)

Matrix size n from 768 to 5,632; block size: b = 256

Metrics:Compare simulated time and energy consumption of original algorithm with those ofmodified SRA/RIA algorithms

Simulated time

TSRA/RIA

Toriginal

Impact of SRA/RIA on time

IT =TSRA/RIA

Toriginal· 100

Simulated energy

ESRA/RIA =Pn

i=1 Wn · Tn

Eoriginal = v2 T(fmax )

Impact of SRA/RIA on energy

IE =ESRA/RIA

Eoriginal· 100

Saving Energy in Sparse and Dense Linear Algebra Computations 14 SIAM PPSC–Savannah (GA), Feb. 2012

Page 19: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Simulation

Impact of SRA/RIA on simulated time/energy for LUPP:only power/energy due to workload (application)!

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

1024

1280

1536

1792

2048

2304

2560

2816

3072

3328

3584

3840

4096

4352

4608

4864

5120

5376

5632

%Im

pact

on

tim

e

Matrix size (n)

Time

RIASRA

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

768

1024

1280

1536

1792

2048

2304

2560

2816

3072

3328

3584

3840

4096

4352

4608

4864

5120

5376

5632

%Im

pact

on

ener

gy

Matrix size (n)

Energy

RIASRA

SRA: Time is compromised, increasing the consumption for largest problem sizes

Increase in execution time due to SRA being oblivious to the actual resources

RIA: Time is not compromised and consumption is reduced for large problem sizes

Saving Energy in Sparse and Dense Linear Algebra Computations 15 SIAM PPSC–Savannah (GA), Feb. 2012

Page 20: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Simulation

Impact of SRA/RIA on simulated time/energy for LUPP:only power/energy due to workload (application)!

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

1024

1280

1536

1792

2048

2304

2560

2816

3072

3328

3584

3840

4096

4352

4608

4864

5120

5376

5632

%Im

pact

on

tim

e

Matrix size (n)

Time

RIASRA

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

768

1024

1280

1536

1792

2048

2304

2560

2816

3072

3328

3584

3840

4096

4352

4608

4864

5120

5376

5632

%Im

pact

on

applica

tion

ener

gy

Matrix size (n)

Application energy

RIASRA

SRA: Time is compromised, increasing the consumption for largest problem sizes

Increase in execution time due to the SRA being oblivious to the actual resources

RIA: Time is not compromised and consumption is reduced for large problem sizes

Saving Energy in Sparse and Dense Linear Algebra Computations 16 SIAM PPSC–Savannah (GA), Feb. 2012

Page 21: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Experimental Results

1.4 Experimental Results

Integrate RIA into a runtime for task-parallel execution of dense linearalgebra algorithms: libflame+SuperMatrix

Queue of readytasks (no dependencies)

Queue of pendingtasks + dependencies(DAG)

......

Algorithm

SymbolicAnalysis

Dispatch

Worker Th. 1

Worker Th. 2

Worker Th. p

Core 1

Core 2

Core p

Two energy-aware techniques:RIA1: Reduce operation frequency when there are no ready tasks (Linux governor)

RIA2: Remove polling when there are no ready tasks (while ensuring a quick recovery)

Applicable to any runtime: SuperMatrix, SMPSs, Quark, etc!

Saving Energy in Sparse and Dense Linear Algebra Computations 17 SIAM PPSC–Savannah (GA), Feb. 2012

Page 22: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Experimental Results

“Doing nothing well” – David E. Culler

AMD Opteron 6128, 1 core:

40

50

60

70

80

90

100

110

120

0 5 10 15 20 25 30

Power

(Watt

s)

Time (s)

Power for different thread activities

MKL dgemm at 2.00 GHzBlocking at 800 MHz

Polling at 2.00 GHzPolling at 800 MHz

Saving Energy in Sparse and Dense Linear Algebra Computations 18 SIAM PPSC–Savannah (GA), Feb. 2012

Page 23: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Experimental Results

AMD Opteron 6128 (8 cores):Frequency changes at core level, with fmax = 2.00 GHz and fmin = 800 MHz

Blocked algorithm for LUPP:Routine FLA LU piv from libflame v5.0

Matrix size n from 2,048 to 12,288; block size: b = 256

Metrics:Compare actual time and energy consumption of original runtime with those of modifiedRIA1/RIA2 runtime

Actual time

TRIA

Toriginal

Impact of RIA on time

IT =TRIA

Toriginal· 100

Actual energy

ERIA and Eoriginal measured usinginternal powermeter withsampling f =25 Hz

Impact of RIA on energy

IE =ERIA

Eoriginal· 100

Saving Energy in Sparse and Dense Linear Algebra Computations 19 SIAM PPSC–Savannah (GA), Feb. 2012

Page 24: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Experimental Results

Impact of RIA1/RIA2 on actual time/energy for LUPP:only power/energy due to workload (application)!

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

%Im

pact

on

tim

e

Matrix size (n)

Time

RIA1RIA2

RIA1+RIA2

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

%Im

pact

on

ener

gy

Matrix size (n)

Energy

RIA1RIA2

RIA1+RIA2

Small impact on execution time

Consistent savings around 5 % for RIA1+RIA2:

Not much hope for larger savings because there is no opportunity for that: the cores arebusy most of the time

Saving Energy in Sparse and Dense Linear Algebra Computations 20 SIAM PPSC–Savannah (GA), Feb. 2012

Page 25: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

LUPP Experimental Results

Impact of RIA1/RIA2 on actual time/energy for LUPP:only power/energy due to workload (application)!

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

%Im

pact

on

tim

e

Matrix size (n)

Time

RIA1RIA2

RIA1+RIA2

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

2048

3072

4096

5120

6144

7168

8192

9216

10240

11264

12288

%Im

pact

on

applica

tion

ener

gy

Matrix size (n)

Application energy

RIA1RIA2

RIA1+RIA2

Small impact on execution time

Consistent savings around 7–8% for RIA1+RIA2:

Not much hope for larger savings because there is no opportunity for that: the cores arebusy most of the time

Saving Energy in Sparse and Dense Linear Algebra Computations 21 SIAM PPSC–Savannah (GA), Feb. 2012

Page 26: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

ILUPACK

2. ILUPACK

Sequential version – M. Bollhofer

Iterative solver for large-scale sparse linear systems

Multilevel ILU preconditioners for general and complexsymmetric/Hermitian positive definite systems

Based on inverse-based ILUs: incomplete LU decompositions thatcontrol the growth of the inverse triangular factors

Multi-threaded version – J.I. Aliaga, M. Bollhofer, A. F. Martın,E. S. Quintana-Ortı

Real symmetric positive definite systems

Construction of preconditioner and PCG solver

Algebraic parallelization based on a task tree

Leverage task-parallelism in the tree

Dynamic scheduling via tailored run-time (OpenMP)

Saving Energy in Sparse and Dense Linear Algebra Computations 22 SIAM PPSC–Savannah (GA), Feb. 2012

Page 27: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

ILUPACK

������������

(1,1)

(1,2)

(1,3)

(1,4)

(2,2)

(3,1)

(3,1)

(2,1)

GAA

TT

PA

(3,1)

(2,1)

(1,1) (1,2) (1,3)

(2,2)

(1,4)

Saving Energy in Sparse and Dense Linear Algebra Computations 23 SIAM PPSC–Savannah (GA), Feb. 2012

Page 28: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

ILUPACK

Integrate RIA into multi-threaded version of ILUPACK

Core 1

Core p

...

Core 2...

...

Worker Th. 1

Partitioning:METIS/SCOTCH Queue of pending

tasks (binary tree)Worker Th. p

Worker Th. 2

Queues of readytasks (no dependencies)

Problem data: A

Same energy-aware techniques:RIA1+RIA2: Reduce operation frequency when there are no ready tasks (Linux governor)and remove polling when there are no ready tasks (while ensuring a quick recovery)

Saving Energy in Sparse and Dense Linear Algebra Computations 24 SIAM PPSC–Savannah (GA), Feb. 2012

Page 29: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

ILUPACK

AMD Opteron 6128 (8 cores)

Linear system associated with Laplacian equation (n ≈16M)

Impact of RIA1+RIA2 on actual power/application power for ILUPACK (onlypreconditioner):

0.4

0.6

0.8

1

1.2

1.4

1.6

0 200 400 600 800 1000 1200 1400 1600 1800

%Im

pact

on

power

Sample

Power

RIA1+RIA2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 200 400 600 800 1000 1200 1400 1600 1800%Im

pact

on

applica

tion

power

Sample

Application power

RIA1+RIA2

Saving Energy in Sparse and Dense Linear Algebra Computations 25 SIAM PPSC–Savannah (GA), Feb. 2012

Page 30: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

Conclusions

3. Conclusions

Goal: Combine scheduling+DVFS to save energy during the execution oflinear algebra algorithms on multi-threaded architectures

For (dense) LUPP:

Slack Reduction Algorithm

DAG requires a processing

Currently does not take into accountnumber of resources

Increases execution time as matrix sizegrows

Increases also energy consumption

Race-to-Idle Algorithm

Algorithm is applied on-the-fly, nopre-processing needed

Maintains in all of cases execution time

Reduce application energy consumption(around 7–8%)

As the ratio between the problem size and the number of resourcesgrows, the opportunities to save energy decrease!

Saving Energy in Sparse and Dense Linear Algebra Computations 26 SIAM PPSC–Savannah (GA), Feb. 2012

Page 31: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

Conclusions

For (sparse) ILUPACK:

Race-to-Idle Algorithm

DAG requires no processing

Algorithm is applied on-the-fly

Maintains in all of cases execution time

Reduce power dissipated by applicationup to 40%

Significant opportunities to save energy!

General:We need architectures that know how to do nothing better!

Saving Energy in Sparse and Dense Linear Algebra Computations 27 SIAM PPSC–Savannah (GA), Feb. 2012

Page 32: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

More information

“Improving power-efficiency of dense linear algebra algorithms on multi-core processors viaslack control”P. Alonso, M. F. Dolz, R. Mayo, E. S. Quintana-OrtıWorkshop on Optimization Issues in Energy Efficient Distributed Systems – OPTIM 2011,Istanbul (Turkey), July 2011

“DVFS-control techniques for dense linear algebra operations on multi-core processors”P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortı2nd Int. Conf. on Energy-Aware High Performance Computing – EnaHPC 2011, Hamburg(Germany), Sept. 2011

“Saving energy in the LU factorization with partial pivoting on multi-core processors”P. Alonso, M. F. Dolz, F. Igual, R. Mayo, E. S. Quintana-Ortı20th Euromicro Conf. on Parallel, Distributed and Network based Processing – PDP 2012,Garching (Germany). Feb. 2012

Saving Energy in Sparse and Dense Linear Algebra Computations 28 SIAM PPSC–Savannah (GA), Feb. 2012

Page 33: Saving Energy in Sparse and Dense Linear Algebra Computations · 2012. 3. 13. · Saving Energy in Sparse and Dense Linear Algebra Computations P. Alonso∗, M. F. Dolz †, F. Igual‡,

Saving Energy inSparse and Dense Linear Algebra Computations

P. Alonso∗, M. F. Dolz†, F. Igual‡, R. Mayo†,E. S. Quintana-Ortı†, V. Roca†

∗Univ. Politecnica †Univ. Jaume I ‡The Univ. of Texasde Valencia, Spain de Castellon, Spain at Austin, TX

[email protected]

Saving Energy in Sparse and Dense Linear Algebra Computations 29 SIAM PPSC–Savannah (GA), Feb. 2012