analysis of applications on a high performance-low energy ... › ~weidendo › uchpc14 › slides...

Analysis of Applications on a High Performance–Low Energy Computer

Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing

Florina M. Ciorba, Thomas Ilsche, Elke Franz, Stefan Pfennig, Christian Scheunert,

Ulf Markwardt, Joseph Schuchart, Daniel Hackenberg, Robert Schöne,

Andreas Knüpfer, Wolfgang E. Nagel, Eduard A. Jorswieck, and Matthias S. Müller

7th Workshop on UnConventional High Performance Computing August 26 2014, Porto


Talk Outline

¨  Motivation ¨  Modeling Applications ¨  Modeling a High Performance–Low Energy Computer ¨  Mapping Application to Systems ¨  Modeling Communication ¨  Simulation Results ¨  Summary and Future Work

2


The Challenge

Given a parallel application and a high performance-low energy computer,

how can the computer execute the application as fast as possible while consuming the

least amount of energy?

3


Our Approach

¨  Simulation and analysis workflow

recorded app. trace (existing system)

mapping and trace visualization and

analysis

simulation

architecture abstraction models (topology, performance/energy of computation and communication)

software abstraction models (mapping, runtime

environment, energy-aware software)

parallel application (source code)

simulated app. trace

and mapping

(HAEC Box)

anal

ysis

and

ev

alua

tion

of

inpu

t

tracing granularity, performance counters, etc.

simulation output

analysis and evaluation

of simulation

haec_sim

desired tracing features

desired energy measurements

instrumented execution

(test systems, production systems)

accuracy, sampling rate, measurement scope, etc.

energy/utility function

simul

atio

n in

put

simulation input

application configuration process models display trace influences visualization feedback Legend:

HAEC Box parameters (latency, bandwidth, errors)

desired simulation goals

4


Our Simulation Framework & State of the Art

State of the art Our framework ¨  Trace-driven simulation (TDS)

¨  No execution-based simulation (replay) ¤  Offers increased accuracy ¤  Increases modeling complexity for the hybrid

interconnection networks

¨  Parallel TDS

¨  Hybrid (& dynamic) communication network

¨  Trace format contains energy measurements (performance metrics)

¨  Application AND system performance AND energy consumption modeling

¨  TDS or use traces in some fashion

¨  TDS+Execution-Based Simulation (EBS, replay) (xSim, BigSim, MPI-NetSim, OMNEST, PSINS, SILAS, MPI-SIM) ¤  Offer scalability ¤  Avoid the need to model complex interconnection networks

¨  Sequential TDS (DIMEMAS, HeSSE, LogGOPSim, TaskSim, Tsim)

¨  Parallel TDS (xSim, BigSim, OMNEST, PSINS, SILAS, SIMCaN)

¨  Non-hybrid communication network (xSim, BigSim, DIMEMAS, LogGOPSim, SILAS, TaskSim, Tsim)

¨  Hybrid communication network (HeSSE, OMNEST)

¨  No focus on energy measurements

¨  Focus on I/O architectures: SIMCaN ¨  Application OR system performance modeling

OR network modeling

5


¨  Performance, scalability, and energy

¨  NPB lu.C.81 on 6 Taurus nodes and node level energy counters (1 Sa/s)

Modeling Applications

~ 30 s

6


Modeling Applications

¨  lu.C.81 on Taurus ¤  Accumulated exclusive time: 69.9% communication, 30.1% computation ¤  Very high number of point-to-point (unicast) messages (11,639,408)

Communication

matrix

Process graph

7


Modeling a High Performance – Low Energy Computer

Circles – compute nodes Blue lines – optical links Green lines – wireless links

HAEC Box

8

Wireless Interconnections •  On-chip/on-package antenna fields

•  8x8 or 16x16 Butler matrices •  Analog/digital beam steering and interference

suppression •  200GHz channel / bandwidth / operating range •  100 Gbit/s @ 200GHz / Z direction •  10 us latency •  1D mesh topology (at the moment)

Optical Interconnections •  Adaptive analog/digital circuits for E/O transceiver •  Embedded polymer waveguides •  Packaging technologies (e.g., 3D stacking of Si/III-V hybrids) •  Optical switch (MOEMS) for reconfigurable networks •  250 Gbit/s via 10 optical channels /XY direction •  1 us latency •  2D mesh topology


Table 1: Comparison of three process-to-node mappings for lu.C.81

Mapping IePLC IaNLC IeNLC AVG IeNLC MIN IeNLC MAX IeNLC

xyz

11,639,4080 11,639,408 228,223 161,658 242,490

block xyz 4,364,778 7,274,630 173,205 80,829 242,488random 646,633 10,992,775 99,934 80,829 242,488

Table 2: Comparison between process-to-node mappings for lu.C.81

Mapping IePLC IePPC IaNPC IeNPC AVG IeNPC MIN IeNPC MAX IeNPC

xyz

11,639,40816,004,186 0 16,004,186 333,420 121,242 484,976

block xyz 14,549,260 4,364,778 10,184,482 212,176 80,829 484,976random 31,280,908 646,633 30,364,275 567,301 161,657 1,050,780

the highest number of point-to-point (unicast) communications and a very smallnumber of collective (multicast) communications. We use class C of this problemand execute it with 81 MPI processes (denoted lu.C.81 ) on 6 compute nodes ofour current HPC production system1 in which five compute nodes had 14 MPIprocesses while the sixth compute node had 11 MPI processes.

In preparation for the simulation experiments described in §4.2, we used theabove strategies to map the lu.C.81 benchmark to the 27 nodes of the HAECplatform. The total number of inter-process logical unicast communications(IePLC) of the benchmark is 11,639,408. These communications are illustratedin Fig. 2b where lu.C.81 is mapped to the HAEC platform using block xyz.As comparison metrics (cf. Table 1), we use the number of intra-node logicalcommunications (IaNLC), number of inter-node logical communications (IeNLC),and the average, minimum, and maximum number of IeNLC between any nodepair. The block xyz strategy yields the smallest IeNLC value, which results inthe largest IaNLC value. Thus, it is expected that block xyz results in the bestoverall simulated performance.

In reality, a single MPI process of lu represents more than single ‘thin’ coreparallelism (e.g., as it is the case in the multi-zone version of this benchmark).In our approach, we abstract this parallelism and consider that a single MPIprocess partially or entirely exploits the available intra-node parallelism. Whenmultiple MPI processes are mapped to the same compute node, we assume thatthey equally share the ‘thin’ cores of the node. In this work we concentrate on theinter-node communication requirements of applications mapped to the HAECplatform, and model them explicitly.

4.2 Communication model and its impact on application

performance

It is possible that the HAEC platform topology dynamically changes at runtimegiven the presence of wireless links. To accurately model the communicationbehavior of applications running on the HAEC platform, the communicationmodels must account for the shape and characteristics of the interconnection

1 https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/SystemTaurus

Mapping Applications onto HAEC Box

xyz block xyz random

IePLC – inter-process logical communication IaNLC – intra-node logical (local) communication IeNLC – inter-node logical communication

Static mapping of lu.C.81 onto the 3×3×3 HAEC Box

9


Modeling Communication for Parallel Applications running on the HAEC Box

¤  Message passing n  Point-to-point

¤  Links n  homogeneous

¤  Topology n  3D mesh

¤  Path selection n  Single path n  XYZ

¤  Routing n  Dimension order routing

¤  Network coding n  Practical network coding

¤  Assumptions n  Error-free transmission n  With acknowledgements

blocking communica-on

non-‐blocking communica-on

applica-on communica-on model (e.g., MPI)

point-‐to-‐point collec-ve remote memory access

HAEC communica-on model

links topology path selec-on

op-cal communica-on

performance energy

network coding

10


Multicast: Routing vs Network Coding

Routing (RT): two timeslots for transmitting

m1 and m2 over C-D to both E and F

Multicast: S wants to transmit both messages m1 and m2 to E and F Topology: butterfly

S

B A

C

D

F E

m1 m2

m1

m1 m2

m2

m1

m1

S

B A

C

D

F E

m1 m2

m1

m1 m2

m2

m2

m2

S

B A

C

D

F E

m1 m2

m1

m1 m2

m2

m12

m12 m12

Network coding (NC): one timeslot for transmitting m1 and m2 over C-D to E and F à Reduces delay and energy costs, increases throughput

11


Unicast: Routing vs Network Coding

Routing (RT): data packet lost over A-B has to be resent

S B A C

Unicast: S wants to transmit a message (as data packets) to C Topology: linear array Unreliable links: failures or attacks

Network coding (NC): further linear independent combinations are sufficient

S B A C

p1

p2

p3

. . .

p1

p3

. . .

p2 p2

p1+p2

2 p2+ 3 p2

p1 + 4 p2

. . .

. . .

p3 + 2 p4

p1+p2

p1 + 4 p2

p3 + 2 p4

12


Modeling Communication Delays

node j

application

network

processor memory

channel

encoding transmission decoding

node j +++ 111

network

application

memory processor

dmpi

ds

|di

|da

dout

l + s

p

b

din

dr

|di

|da

dmpi

dh,p

din

l + s

a

b

dout

dh,a

ds process a data packet of size sp by the sender

dr process a data packet of size sp by the receiver

di process a data packet by an intermediate node

da process an acknowledgment of size sa

dh,p send a data packet over one hop

dh,a send an acknowledgment over one hop

dout write out to channel din read in from channel

dmpi write out to/read in from network buffer l latency for channel coding

13


Modeling Transfer Times

¨  Transfer time tt(x) for sending x > 0 packets over h ∈ [0,6] hops without errors or acknowledgments

Assumption: dh,p ≥ ds and dh,p ≥ dr tt(x) = 2 ·∙ dmpi + ds + (h + x - 1) ·∙ dh,p + (h - 1) ·∙ di + dr ∀ h > 0 (1)

tt(x) = 2 ·∙ dmpi if h = 0 (intra-node communication)

¨  Complete transfer time T(np) for sending np packets over

h ∈ [0,6] hops without errors, with acknowledgments (only the final ACK/generation needs to be considered)

T(np) = tt(sw) ·∙ nw + tt(nr) + h ·∙ (nw + ⌈nr/sw⌉) ·∙ (dh,a + da) (2)

14


Table 1: Comparison of three process-to-node mappings for lu.C.81

Mapping IePLC IaNLC IeNLC AVG IeNLC MIN IeNLC MAX IeNLC

xyz

11,639,4080 11,639,408 228,223 161,658 242,490

block xyz 4,364,778 7,274,630 173,205 80,829 242,488random 646,633 10,992,775 99,934 80,829 242,488

Table 2: Comparison between process-to-node mappings for lu.C.81

Mapping IePLC IePPC IaNPC IeNPC AVG IeNPC MIN IeNPC MAX IeNPC

xyz

11,639,40816,004,186 0 16,004,186 333,420 121,242 484,976

block xyz 14,549,260 4,364,778 10,184,482 212,176 80,829 484,976random 31,280,908 646,633 30,364,275 567,301 161,657 1,050,780

the highest number of point-to-point (unicast) communications and a very smallnumber of collective (multicast) communications. We use class C of this problemand execute it with 81 MPI processes (denoted lu.C.81 ) on 6 compute nodes ofour current HPC production system1 in which five compute nodes had 14 MPIprocesses while the sixth compute node had 11 MPI processes.

In preparation for the simulation experiments described in §4.2, we used theabove strategies to map the lu.C.81 benchmark to the 27 nodes of the HAECplatform. The total number of inter-process logical unicast communications(IePLC) of the benchmark is 11,639,408. These communications are illustratedin Fig. 2b where lu.C.81 is mapped to the HAEC platform using block xyz.As comparison metrics (cf. Table 1), we use the number of intra-node logicalcommunications (IaNLC), number of inter-node logical communications (IeNLC),and the average, minimum, and maximum number of IeNLC between any nodepair. The block xyz strategy yields the smallest IeNLC value, which results inthe largest IaNLC value. Thus, it is expected that block xyz results in the bestoverall simulated performance.

In reality, a single MPI process of lu represents more than single ‘thin’ coreparallelism (e.g., as it is the case in the multi-zone version of this benchmark).In our approach, we abstract this parallelism and consider that a single MPIprocess partially or entirely exploits the available intra-node parallelism. Whenmultiple MPI processes are mapped to the same compute node, we assume thatthey equally share the ‘thin’ cores of the node. In this work we concentrate on theinter-node communication requirements of applications mapped to the HAECplatform, and model them explicitly.

4.2 Communication model and its impact on application

performance

It is possible that the HAEC platform topology dynamically changes at runtimegiven the presence of wireless links. To accurately model the communicationbehavior of applications running on the HAEC platform, the communicationmodels must account for the shape and characteristics of the interconnection

1 https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/SystemTaurus

xyz mapping block xyz mapping random mapping

XYZ path selection for lu.C.81 communication over the physical links of the 3×3×3 HAEC Box

IePLC – inter-process logical communication IePPC – inter-process physical communication

IaNPC – intra-node physical (local) communication IeNPC – inter-node physical communication

Modeling Communication for Parallel Applications running on the HAEC Box

15


¤  latency 1μs ¤  bandwidth 250 Gbit/s ¤  packet size 288 bytes

¤  delay per packet per hop 1,209.216 ns ¤  delay per ACK per hop 1,200.192 ns ¤  sender delay 200 ns or 203.125 ns ¤  receiver delay 200 ns or 215.625 ns

dimension order routing practical network coding

faster

slower

faster

slower

lu.C.81 on Taurus

69.9% 30.1%

Modeling the Performance of Communication in Parallel Applications on the HAEC Box

41.793 s 23.7-24.1 s

16

lu.C.81on HAEC Box (xyz mapping)

¨  Simulation parameters (haec_sim)




slower

slower

lu.C.81 on Taurus

69.9% 30.1%

lu.C.81on HAEC Box (random mapping)

41.793 s 23.7-24.1 s

17







faster

slower

slower

lu.C.81 on Taurus

69.9% 30.1%

lu.C.81on HAEC Box (block xyz mapping)

faster

<xyz, random <xyz, random >xyz, <random >xyz,<random <xyz, random <xyz, random 41.793 s

23.7-24.1 s

18





Summary

¨  HAEC Box: unconventional architecture sharing important concerns with the HPC systems ¤  Performance and energy (computation + communication)

¨  Two communication models ¤ Dimension order routing ¤  Practical network coding (novel for HPC applications)

¨  Simulation-based performance analysis using a trace-driven simulator (haec_sim)

19


Future Work

¨  Model more applications (HPC and not only) ¤  Point-to-point communication ¤  Collective communication ¤  Combinations thereof

¨  Develop energy consumption models ¤  Computation and communication operations

¨  Develop optimal mapping strategies ¤  Communication- and topology- aware

¨  Extend the communication models ¤  Point-to-point: with errors/attacks ¤  Collective: without and with errors/attacks ¤  Heterogeneous links (dynamic latency, bandwidth, path selection, topology)

¨  Simulation ¤  Implement local resource managers (nodes, links): enable contention modeling ¤  Implement runtime process migration (after optimal initial mapping)

20


Thank you

HAEC website: http://tu-dresden.de/sfb912

analysis of applications on a high performance-low energy ... › ~weidendo › uchpc14 › slides...

Documents