analysis of applications on a high performance-low energy ... › ~weidendo › uchpc14 › slides...
TRANSCRIPT
Analysis of Applications on a High Performance–Low Energy Computer
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Florina M. Ciorba, Thomas Ilsche, Elke Franz, Stefan Pfennig, Christian Scheunert,
Ulf Markwardt, Joseph Schuchart, Daniel Hackenberg, Robert Schöne,
Andreas Knüpfer, Wolfgang E. Nagel, Eduard A. Jorswieck, and Matthias S. Müller
7th Workshop on UnConventional High Performance Computing August 26 2014, Porto
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Talk Outline
¨ Motivation ¨ Modeling Applications ¨ Modeling a High Performance–Low Energy Computer ¨ Mapping Application to Systems ¨ Modeling Communication ¨ Simulation Results ¨ Summary and Future Work
2
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
The Challenge
Given a parallel application and a high performance-low energy computer,
how can the computer execute the application as fast as possible while consuming the
least amount of energy?
3
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Our Approach
¨ Simulation and analysis workflow
recorded app. trace (existing system)
mapping and trace visualization and
analysis
simulation
architecture abstraction models (topology, performance/energy of computation and communication)
software abstraction models (mapping, runtime
environment, energy-aware software)
parallel application (source code)
simulated app. trace
and mapping
(HAEC Box)
anal
ysis
and
ev
alua
tion
of
inpu
t
tracing granularity, performance counters, etc.
simulation output
analysis and evaluation
of simulation
haec_sim
desired tracing features
desired energy measurements
instrumented execution
(test systems, production systems)
accuracy, sampling rate, measurement scope, etc.
energy/utility function
simul
atio
n in
put
simulation input
application configuration process models display trace influences visualization feedback Legend:
HAEC Box parameters (latency, bandwidth, errors)
desired simulation goals
4
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Our Simulation Framework & State of the Art
State of the art Our framework ¨ Trace-driven simulation (TDS)
¨ No execution-based simulation (replay) ¤ Offers increased accuracy ¤ Increases modeling complexity for the hybrid
interconnection networks
¨ Parallel TDS
¨ Hybrid (& dynamic) communication network
¨ Trace format contains energy measurements (performance metrics)
¨ Application AND system performance AND energy consumption modeling
¨ TDS or use traces in some fashion
¨ TDS+Execution-Based Simulation (EBS, replay) (xSim, BigSim, MPI-NetSim, OMNEST, PSINS, SILAS, MPI-SIM) ¤ Offer scalability ¤ Avoid the need to model complex interconnection networks
¨ Sequential TDS (DIMEMAS, HeSSE, LogGOPSim, TaskSim, Tsim)
¨ Parallel TDS (xSim, BigSim, OMNEST, PSINS, SILAS, SIMCaN)
¨ Non-hybrid communication network (xSim, BigSim, DIMEMAS, LogGOPSim, SILAS, TaskSim, Tsim)
¨ Hybrid communication network (HeSSE, OMNEST)
¨ No focus on energy measurements
¨ Focus on I/O architectures: SIMCaN ¨ Application OR system performance modeling
OR network modeling
5
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
¨ Performance, scalability, and energy
¨ NPB lu.C.81 on 6 Taurus nodes and node level energy counters (1 Sa/s)
Modeling Applications
~ 30 s
6
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Modeling Applications
¨ lu.C.81 on Taurus ¤ Accumulated exclusive time: 69.9% communication, 30.1% computation ¤ Very high number of point-to-point (unicast) messages (11,639,408)
Communication
matrix
Process graph
7
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Modeling a High Performance – Low Energy Computer
Circles – compute nodes Blue lines – optical links Green lines – wireless links
HAEC Box
8
Wireless Interconnections • On-chip/on-package antenna fields
• 8x8 or 16x16 Butler matrices • Analog/digital beam steering and interference
suppression • 200GHz channel / bandwidth / operating range • 100 Gbit/s @ 200GHz / Z direction • 10 us latency • 1D mesh topology (at the moment)
Optical Interconnections • Adaptive analog/digital circuits for E/O transceiver • Embedded polymer waveguides • Packaging technologies (e.g., 3D stacking of Si/III-V hybrids) • Optical switch (MOEMS) for reconfigurable networks • 250 Gbit/s via 10 optical channels /XY direction • 1 us latency • 2D mesh topology
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Table 1: Comparison of three process-to-node mappings for lu.C.81
Mapping IePLC IaNLC IeNLC AVG IeNLC MIN IeNLC MAX IeNLC
xyz
11,639,4080 11,639,408 228,223 161,658 242,490
block xyz 4,364,778 7,274,630 173,205 80,829 242,488random 646,633 10,992,775 99,934 80,829 242,488
Table 2: Comparison between process-to-node mappings for lu.C.81
Mapping IePLC IePPC IaNPC IeNPC AVG IeNPC MIN IeNPC MAX IeNPC
xyz
11,639,40816,004,186 0 16,004,186 333,420 121,242 484,976
block xyz 14,549,260 4,364,778 10,184,482 212,176 80,829 484,976random 31,280,908 646,633 30,364,275 567,301 161,657 1,050,780
the highest number of point-to-point (unicast) communications and a very smallnumber of collective (multicast) communications. We use class C of this problemand execute it with 81 MPI processes (denoted lu.C.81 ) on 6 compute nodes ofour current HPC production system1 in which five compute nodes had 14 MPIprocesses while the sixth compute node had 11 MPI processes.
In preparation for the simulation experiments described in §4.2, we used theabove strategies to map the lu.C.81 benchmark to the 27 nodes of the HAECplatform. The total number of inter-process logical unicast communications(IePLC) of the benchmark is 11,639,408. These communications are illustratedin Fig. 2b where lu.C.81 is mapped to the HAEC platform using block xyz.As comparison metrics (cf. Table 1), we use the number of intra-node logicalcommunications (IaNLC), number of inter-node logical communications (IeNLC),and the average, minimum, and maximum number of IeNLC between any nodepair. The block xyz strategy yields the smallest IeNLC value, which results inthe largest IaNLC value. Thus, it is expected that block xyz results in the bestoverall simulated performance.
In reality, a single MPI process of lu represents more than single ‘thin’ coreparallelism (e.g., as it is the case in the multi-zone version of this benchmark).In our approach, we abstract this parallelism and consider that a single MPIprocess partially or entirely exploits the available intra-node parallelism. Whenmultiple MPI processes are mapped to the same compute node, we assume thatthey equally share the ‘thin’ cores of the node. In this work we concentrate on theinter-node communication requirements of applications mapped to the HAECplatform, and model them explicitly.
4.2 Communication model and its impact on application
performance
It is possible that the HAEC platform topology dynamically changes at runtimegiven the presence of wireless links. To accurately model the communicationbehavior of applications running on the HAEC platform, the communicationmodels must account for the shape and characteristics of the interconnection
1 https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/SystemTaurus
Mapping Applications onto HAEC Box
xyz block xyz random
IePLC – inter-process logical communication IaNLC – intra-node logical (local) communication IeNLC – inter-node logical communication
Static mapping of lu.C.81 onto the 3×3×3 HAEC Box
9
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Modeling Communication for Parallel Applications running on the HAEC Box
¤ Message passing n Point-to-point
¤ Links n homogeneous
¤ Topology n 3D mesh
¤ Path selection n Single path n XYZ
¤ Routing n Dimension order routing
¤ Network coding n Practical network coding
¤ Assumptions n Error-free transmission n With acknowledgements
blocking communica-on
non-‐blocking communica-on
applica-on communica-on model (e.g., MPI)
point-‐to-‐point collec-ve remote memory access
HAEC communica-on model
links topology path selec-on
op-cal communica-on
performance energy
network coding
10
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Multicast: Routing vs Network Coding
Routing (RT): two timeslots for transmitting
m1 and m2 over C-D to both E and F
Multicast: S wants to transmit both messages m1 and m2 to E and F Topology: butterfly
S
B A
C
D
F E
m1 m2
m1
m1 m2
m2
m1
m1
S
B A
C
D
F E
m1 m2
m1
m1 m2
m2
m2
m2
S
B A
C
D
F E
m1 m2
m1
m1 m2
m2
m12
m12 m12
Network coding (NC): one timeslot for transmitting m1 and m2 over C-D to E and F à Reduces delay and energy costs, increases throughput
11
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Unicast: Routing vs Network Coding
Routing (RT): data packet lost over A-B has to be resent
S B A C
Unicast: S wants to transmit a message (as data packets) to C Topology: linear array Unreliable links: failures or attacks
Network coding (NC): further linear independent combinations are sufficient
S B A C
p1
p2
p3
. . .
p1
p3
. . .
p2 p2
p1+p2
2 p2+ 3 p2
p1 + 4 p2
. . .
. . .
p3 + 2 p4
p1+p2
p1 + 4 p2
p3 + 2 p4
12
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Modeling Communication Delays
node j
application
network
processor memory
channel
encoding transmission decoding
node j +++ 111
network
application
memory processor
dmpi
ds
|di
|da
dout
l + s
p
b
din
dr
|di
|da
dmpi
dh,p
din
l + s
a
b
dout
dh,a
ds process a data packet of size sp by the sender
dr process a data packet of size sp by the receiver
di process a data packet by an intermediate node
da process an acknowledgment of size sa
dh,p send a data packet over one hop
dh,a send an acknowledgment over one hop
dout write out to channel din read in from channel
dmpi write out to/read in from network buffer l latency for channel coding
13
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Modeling Transfer Times
¨ Transfer time tt(x) for sending x > 0 packets over h ∈ [0,6] hops without errors or acknowledgments
Assumption: dh,p ≥ ds and dh,p ≥ dr tt(x) = 2 ·∙ dmpi + ds + (h + x - 1) ·∙ dh,p + (h - 1) ·∙ di + dr ∀ h > 0 (1)
tt(x) = 2 ·∙ dmpi if h = 0 (intra-node communication)
¨ Complete transfer time T(np) for sending np packets over
h ∈ [0,6] hops without errors, with acknowledgments (only the final ACK/generation needs to be considered)
T(np) = tt(sw) ·∙ nw + tt(nr) + h ·∙ (nw + ⌈nr/sw⌉) ·∙ (dh,a + da) (2)
14
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Table 1: Comparison of three process-to-node mappings for lu.C.81
Mapping IePLC IaNLC IeNLC AVG IeNLC MIN IeNLC MAX IeNLC
xyz
11,639,4080 11,639,408 228,223 161,658 242,490
block xyz 4,364,778 7,274,630 173,205 80,829 242,488random 646,633 10,992,775 99,934 80,829 242,488
Table 2: Comparison between process-to-node mappings for lu.C.81
Mapping IePLC IePPC IaNPC IeNPC AVG IeNPC MIN IeNPC MAX IeNPC
xyz
11,639,40816,004,186 0 16,004,186 333,420 121,242 484,976
block xyz 14,549,260 4,364,778 10,184,482 212,176 80,829 484,976random 31,280,908 646,633 30,364,275 567,301 161,657 1,050,780
the highest number of point-to-point (unicast) communications and a very smallnumber of collective (multicast) communications. We use class C of this problemand execute it with 81 MPI processes (denoted lu.C.81 ) on 6 compute nodes ofour current HPC production system1 in which five compute nodes had 14 MPIprocesses while the sixth compute node had 11 MPI processes.
In preparation for the simulation experiments described in §4.2, we used theabove strategies to map the lu.C.81 benchmark to the 27 nodes of the HAECplatform. The total number of inter-process logical unicast communications(IePLC) of the benchmark is 11,639,408. These communications are illustratedin Fig. 2b where lu.C.81 is mapped to the HAEC platform using block xyz.As comparison metrics (cf. Table 1), we use the number of intra-node logicalcommunications (IaNLC), number of inter-node logical communications (IeNLC),and the average, minimum, and maximum number of IeNLC between any nodepair. The block xyz strategy yields the smallest IeNLC value, which results inthe largest IaNLC value. Thus, it is expected that block xyz results in the bestoverall simulated performance.
In reality, a single MPI process of lu represents more than single ‘thin’ coreparallelism (e.g., as it is the case in the multi-zone version of this benchmark).In our approach, we abstract this parallelism and consider that a single MPIprocess partially or entirely exploits the available intra-node parallelism. Whenmultiple MPI processes are mapped to the same compute node, we assume thatthey equally share the ‘thin’ cores of the node. In this work we concentrate on theinter-node communication requirements of applications mapped to the HAECplatform, and model them explicitly.
4.2 Communication model and its impact on application
performance
It is possible that the HAEC platform topology dynamically changes at runtimegiven the presence of wireless links. To accurately model the communicationbehavior of applications running on the HAEC platform, the communicationmodels must account for the shape and characteristics of the interconnection
1 https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/SystemTaurus
xyz mapping block xyz mapping random mapping
XYZ path selection for lu.C.81 communication over the physical links of the 3×3×3 HAEC Box
IePLC – inter-process logical communication IePPC – inter-process physical communication
IaNPC – intra-node physical (local) communication IeNPC – inter-node physical communication
Modeling Communication for Parallel Applications running on the HAEC Box
15
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
¤ latency 1μs ¤ bandwidth 250 Gbit/s ¤ packet size 288 bytes
¤ delay per packet per hop 1,209.216 ns ¤ delay per ACK per hop 1,200.192 ns ¤ sender delay 200 ns or 203.125 ns ¤ receiver delay 200 ns or 215.625 ns
dimension order routing practical network coding
faster
slower
faster
slower
lu.C.81 on Taurus
69.9% 30.1%
Modeling the Performance of Communication in Parallel Applications on the HAEC Box
41.793 s 23.7-24.1 s
16
lu.C.81on HAEC Box (xyz mapping)
¨ Simulation parameters (haec_sim)
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Modeling the Performance of Communication in Parallel Applications on the HAEC Box
dimension order routing practical network coding
slower
slower
lu.C.81 on Taurus
69.9% 30.1%
lu.C.81on HAEC Box (random mapping)
41.793 s 23.7-24.1 s
17
¤ latency 1μs ¤ bandwidth 250 Gbit/s ¤ packet size 288 bytes
¤ delay per packet per hop 1,209.216 ns ¤ delay per ACK per hop 1,200.192 ns ¤ sender delay 200 ns or 203.125 ns ¤ receiver delay 200 ns or 215.625 ns
¨ Simulation parameters (haec_sim)
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Modeling the Performance of Communication in Parallel Applications on the HAEC Box
dimension order routing practical network coding
faster
slower
slower
lu.C.81 on Taurus
69.9% 30.1%
lu.C.81on HAEC Box (block xyz mapping)
faster
<xyz, random <xyz, random >xyz, <random >xyz,<random <xyz, random <xyz, random 41.793 s
23.7-24.1 s
18
¤ latency 1μs ¤ bandwidth 250 Gbit/s ¤ packet size 288 bytes
¤ delay per packet per hop 1,209.216 ns ¤ delay per ACK per hop 1,200.192 ns ¤ sender delay 200 ns or 203.125 ns ¤ receiver delay 200 ns or 215.625 ns
¨ Simulation parameters (haec_sim)
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Summary
¨ HAEC Box: unconventional architecture sharing important concerns with the HPC systems ¤ Performance and energy (computation + communication)
¨ Two communication models ¤ Dimension order routing ¤ Practical network coding (novel for HPC applications)
¨ Simulation-based performance analysis using a trace-driven simulator (haec_sim)
19
Collaborative Research Center 912: HAEC − Highly Adaptive Energy-Efficient Computing
Future Work
¨ Model more applications (HPC and not only) ¤ Point-to-point communication ¤ Collective communication ¤ Combinations thereof
¨ Develop energy consumption models ¤ Computation and communication operations
¨ Develop optimal mapping strategies ¤ Communication- and topology- aware
¨ Extend the communication models ¤ Point-to-point: with errors/attacks ¤ Collective: without and with errors/attacks ¤ Heterogeneous links (dynamic latency, bandwidth, path selection, topology)
¨ Simulation ¤ Implement local resource managers (nodes, links): enable contention modeling ¤ Implement runtime process migration (after optimal initial mapping)
20