power profiling of cholesky and qr factorizations on ... · power profiling of cholesky and qr...

26
Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems George Bosilca 1 Hatem Ltaief 2 Jack Dongarra 134 1 Innovative Computing Laboratory University of Tennessee Knoxville 2 KAUST Supercomputing Laboratory Thuwal, Saudi Arabia 3 Oak Ridge National Lab 4 University of Manchester International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 1 / 26

Upload: vannhi

Post on 28-Aug-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

Power Profiling of Cholesky and QRFactorizations on Distributed Memory Systems

George Bosilca1 Hatem Ltaief2 Jack Dongarra1 3 4

1Innovative Computing LaboratoryUniversity of Tennessee Knoxville

2KAUST Supercomputing LaboratoryThuwal, Saudi Arabia

3Oak Ridge National Lab

4University of Manchester

International Conference on Energy-Aware High PerformanceComputing

Hamburg, GermanySept 12, 2012Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 1 / 26

Outline

1 Motivations

2 From LAPACK to PLASMA

3 From PLASMA to DPLASMA

4 Power Measurements Technique

5 Power Measurements Results

6 Summary and Future Work

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 2 / 26

Motivations

The Top500 List

Rank Name Vendor Cores Rmax (Tflop/s) Power (KW)1 Sequoia IBM BG/Q 1572864 16324.75 (81%) 7890.02 K Computer Fujitsu SPARC64 705024 10510.00 (93%) 12659.93 Mira IBM BG/Q 786432 8162.38 (81%) 3945.04 SuperMUC Intel Xeon E5 147456 2897.00 (91%) 3422.75 Tianhe-1A Intel Xeon X5670 + M2050 186368 2566.00 (55%) 4040.06 Jaguar Cray XK6 Opteron + M2090 298592 1941.00 (74%) 5142.07 Fermi IBM BG/Q 163840 1725.49 (82%) 821.98 JuQUEEN IBM BG/Q 131072 1380.39 (82%) 657.59 Curie thin nodes Intel Xeon E5 77184 1359.00 (81%) 2251.010 Nebulae Intel Xeon X5670 + M2050 120640 1271.00 (42%) 2580.0

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 3 / 26

Motivations

The Top500 List

Rank Name Vendor Cores Rmax (Tflop/s) Power (KW)1 Sequoia IBM BG/Q 1572864 16324.75 (81%) 7890.02 K Computer Fujitsu SPARC64 705024 10510.00 (93%) 12659.93 Mira IBM BG/Q 786432 8162.38 (81%) 3945.04 SuperMUC Intel Xeon E5 147456 2897.00 (91%) 3422.75 Tianhe-1A Intel Xeon X5670 + M2050 186368 2566.00 (55%) 4040.06 Jaguar Cray XK6 Opteron + M2090 298592 1941.00 (74%) 5142.07 Fermi IBM BG/Q 163840 1725.49 (82%) 821.98 JuQUEEN IBM BG/Q 131072 1380.39 (82%) 657.59 Curie thin nodes Intel Xeon E5 77184 1359.00 (81%) 2251.010 Nebulae Intel Xeon X5670 + M2050 120640 1271.00 (42%) 2580.0

Human brain: 20 PetaFLOPS! (cf Kurzweil)

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 4 / 26

Motivations

Today’s Special Meal

8 MW needed to feed the babyExascale roadmap says up to 20 MW Power EnvelopeHuge challenge: achieving 2 orders of magnitude inperformance by only doubling the power rateHigh level of concurrencyIngredients: Fine-grain parallelism, Dynamic runtime systems,Power EfficiencyFlops are cheap, Data movement is expensiveCo-designed Hardware and Software solutions

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 5 / 26

From LAPACK to PLASMA

Block Algorithms

Panel-Update SequenceTransformations are blocked/accumulated within the Panel(Level 2 BLAS)Transformations applied at once on the trailing submatrix(Level 3 BLAS)Parallelism hidden inside the BLASFork-join Model

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 6 / 26

From LAPACK to PLASMA

One-Sided Block Algorithms: LU

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 7 / 26

From LAPACK to PLASMA

Block Algorithms: Fork-Join Paradigm

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 8 / 26

From LAPACK to PLASMA

Tile Data Layout Format

LAPACK: column-major format PLASMA: tile format

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 9 / 26

From LAPACK to PLASMA

PLASMA: Tile Algorithms

PLASMA: Parallel Linear Algebra for Scalable Multi-coreArchitectures =⇒ http://icl.cs.utk.edu/plasma/

Parallelism is brought to the foreMay require the redesign of linear algebra algorithmsTile data layout translationRemove unnecessary synchronization points betweenPanel-Update sequencesDAG execution where nodes represent tasks and edgesdefine dependencies between themDynamic runtime system environment QUARK

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 10 / 26

From LAPACK to PLASMA

Dynamic Runtime System QUARK

Basic Ideas:

Conceptually similar to out-of-order processor schedulingDynamic runtime DAG schedulerOut-of-order execution flow of fine-grained tasksTask scheduling as soon as dependencies are satisfiedProducer-Consumer

Similar projects: SuperMatrix, OMPSs, StarPU

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 11 / 26

From LAPACK to PLASMA

DataFlow Programming Model

Five decades OLD conceptProgramming paradigm that models a program as adirected graph of the data flowing between operations (cf.Wikipedia)Think "how things connect" rather than "how things happen"Assembly lineInherently parallel

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 12 / 26

From PLASMA to DPLASMA

2D Block Cyclic Distribution

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

0

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3 2

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

(a) Column-major data layout for-mat within a block.

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

0

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3 2

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

3

0

2

1

(b) Tile data layout format within ablock.

Figure: Two-Dimensional Block Cyclic Data Distribution.

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 13 / 26

From PLASMA to DPLASMA

DAGuE Dynamic Runtime Scheduler

Bosilca et. al, UTKBosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 14 / 26

From PLASMA to DPLASMA

DAGuE Dynamic Runtime Scheduler

Bosilca et. al, UTKBosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 15 / 26

Power Measurements Technique

The PowerPack Framework

Dual-socket quad-core Intel Xeon system from Virginia Tech,clocked at 2.8GHz with 8GB of memoryMeasurements from power meters attached to thehardware of the systemFine-grain measurement (100ms) allows power consumptionto be measured on a per-component basisCPU, memory, hard disk, motherboard and System (as awhole)N = 40000 for all experiments

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 16 / 26

Power Measurements Technique

The PowerPack Framework

K. Cameron et. al, Virginia Tech

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 17 / 26

Power Measurements Results

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) block size = 32.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) block size = 128.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(c) block size = 512.

Figure: Impact of the block size onthe power profiles (Watts) of theScaLAPACK Cholesky.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) tile size = 48.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) tile size = 192.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35P

ow

er

(Wa

tts)

Time (seconds)

System

CPU

MemoryNetwork

(c) tile size = 768.

Figure: Impact of the tile size onthe power profiles (Watts) of theDPLASMA Cholesky.

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 18 / 26

Power Measurements Results

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) block size = 32.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) block size = 128.

5000

10000

15000

20000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(c) block size = 512.

Figure: Impact of the block size onthe power profiles (Watts) of theScaLAPACK QR Factorization.

0

5000

10000

15000

20000

25000

0 50 100 150 200 250 300 350 400 450

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) tile size = 48.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) tile size = 192.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100P

ow

er

(Wa

tts)

Time (seconds)

System

CPU

Memory

Network

(c) tile size = 768.

Figure: Impact of the tile size onthe power profiles (Watts) of theDPLASMA QR Factorization.

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 19 / 26

Power Measurements Results

0

1000

2000

3000

4000

5000

6000

7000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) Number of cores = 128.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) Number of cores = 512.

Figure: Impact of the number ofcores on the power profiles (Watts)of the ScaLAPACK CholeskyFactorization.

0

1000

2000

3000

4000

5000

6000

7000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) Number of cores = 128.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) Number of cores = 512.

Figure: Impact of the number ofcores on the power profiles (Watts)of the DPLASMA CholeskyFactorization.

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 20 / 26

Power Measurements Results

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) Number of cores = 128.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) Number of cores = 512.

Figure: Impact of the number ofcores on the power profiles (Watts)of the ScaLAPACK QR.

0

1000

2000

3000

4000

5000

6000

7000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) Number of cores = 128.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) Number of cores = 512.

Figure: Impact of the number ofcores on the power profiles (Watts)of the DPLASMA QR.

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 21 / 26

Power Measurements Results

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 5 10 15 20 25 30 35

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure: Power Profiles of the Cholesky Factorization.

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 22 / 26

Power Measurements Results

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(a) ScaLAPACK.

0

5000

10000

15000

20000

25000

0 20 40 60 80 100

Po

we

r (W

att

s)

Time (seconds)

System

CPU

Memory

Network

(b) DPLASMA.

Figure: Power Profiles of the QR Factorization.

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 23 / 26

Power Measurements Results

# Cores Library Cholesky QR

128ScaLAPACK 192000 672000DPLASMA 128000 540000

256ScaLAPACK 240000 816000DPLASMA 96000 540000

512ScaLAPACK 325000 1000000DPLASMA 125000 576000

Figure: Total amount of energy (joule) used for each test based on thenumber of cores

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 24 / 26

Summary and Future Work

Conclusion

Stressing the system’s componentsDPLASMA Cholesky algorithms decrease the energyconsumption up to 62% compared to ScaLAPACK CholeskyDPLASMA QR algorithms decrease the energy consumptionup to 40% compared to ScaLAPACK QRAsynchronous execution runtime and adapted algorithmscan lead to significantly improved efficiencies and powersaving

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 25 / 26

Summary and Future Work

What’s next?

Power analysis of advanced numerical algorithms ondistributed systems (two-sided transformations, treereduction, mixed precisions)Comparisons with other DLA libraries: Elemental, Eigen-K,ELPADistributed heterogeneous architecturesScheduler interaction through DVFS/Intel RAPL technologyRunning on IBM BG P/Q and exploit embedded powercollection hardware tools.

Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Profiling of DLA Algorithms ENAHPC’12 26 / 26