accelerators in technical computing: is it worth the pain?€¦ · 16 cores, 2 ghz openmp intel...
TRANSCRIPT
Rechen- und Kommunikationszentrum (RZ)
Accelerators in Technical
Computing:
Is it Worth the Pain? A TCO Perspective
Sandra Wienke, Dieter an Mey, Matthias S. Müller
Center for Computing and Communication
JARA – High-Performance Computing
RWTH Aachen University
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 2
Agenda
Introduction
Modeling
Total Cost of Ownership (TCO)
Comparison Metrics
Case Study on Accelerators
Programming Models & System Types
TCO Components @ RWTH
Real-World Application
Results
Conclusion & Outlook
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 3
Today: Varity of HPC clusters
Usage of accelerators (NVIDIA GPU, Intel Xeon Phi) motivated by
promising performance per watt ratio
System comparison by performance or performance per watt not
sufficient for purchase decision
Total costs of ownership (TCO)
Acquisition costs, housing, operation costs,..
Inclusion of manpower costs (administration & programming)
Comparison of costs per program run (application-dependent)
Investigation of a real-world software package
OpenMP on Intel Sandy Bridge
OpenMP + LEO on Intel Xeon Phi
OpenCL, OpenACC on NVIDA Fermi GPU
Introduction
Impact of manpower effort/
programming model?
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 4
Modeling – Total Cost of Ownership (TCO)
Basis: single compute node extrapolate to cluster amount
𝐈𝐧𝐯𝐞𝐬𝐭𝐦𝐞𝐧𝐭 𝑰 = 𝐓𝐂𝐎 𝒏, 𝝉 = 𝑪𝒐𝒕(𝒏) + 𝑪𝒑𝒂(𝒏) ∙ 𝝉
One-time costs Cot
Per node: HW acquisition, building/infrastructure, OS/ env. installation
Per node type: OS/ env. installation, programming effort
Annual costs Cpa
Per node: HW maintenance, building/infrastructure, OS/ env.
maintenance, power consumption
Per node type: OS/ env. maintenance, compiler/software, application
maintenance
TCO depends on architecture & application
𝑛: number of nodes 𝜏: system lifetime
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 5
Modeling – Comparison Metrics
Costs per program run Cppr
Includes investment/ TCO & application performance
𝐶𝑝𝑝𝑟 𝑛, 𝜏 =TCO(𝑛, 𝜏)
𝑛𝑒𝑥(𝜏) ∙ 𝑛 with 𝑛𝑒𝑥 𝜏 =
𝑘 ∙ 𝜏
𝑡𝑝𝑎𝑟
Used baseline for system X: Intel Sandy Bridge (SNB) + OpenMP
𝐶𝑝𝑝𝑟,𝑋 𝑛𝑋, 𝜏 − 𝐶𝑝𝑝𝑟,𝑂𝑀𝑃 𝑛𝑂𝑀𝑃, 𝜏
𝐶𝑝𝑝𝑟,𝑂𝑀𝑃 𝑛𝑂𝑀𝑃 , 𝜏
< 0≥ 0
𝑖𝑓 𝑋
𝑂𝑀𝑃 beneficial
Break-even investments
Min. budget needed so that system X beneficial over OpenMP on SNB
Solve for 𝐼 with given fixed lifetime 𝜏:
𝐶𝑝𝑝𝑟,𝑋 𝑛𝑋, 𝜏 − 𝐶𝑝𝑝𝑟,𝑂𝑀𝑃 𝑛𝑂𝑀𝑃, 𝜏 = 0 with TCO 𝑛, 𝜏 = 𝐼
𝑛 ∶ number of nodes 𝜏 ∶ system lifetime 𝑛𝑒𝑥 ∶ #app. executions 𝑘 ∶ system usage rate 𝑡𝑝𝑎𝑟: parallel runtime
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 6
Case Study on Accelerators – Programming Models & System Types
Programming Model Accelerator Host Compiler
Serial 2x Intel Sandy Bridge,
16 cores, 2 GHz Intel 13.0.1 OpenMP
(simple, vectorized)
LEO + OpenMP Intel Xeon Phi
5110P, 60 cores 1x Intel Westmere,
4 cores, 2.4 GHz
Intel 13.0.1
OpenACC NVIDIA Tesla
C2050 (Fermi),
ECC on
PGI 12.9
OpenCL Intel 13.0.1
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 7
Case Study on Accelerators – TCO Components @ RWTH
One-time costs
HW purchase: list prices from Bull
Building/infrastructure: as annual costs since it is amortized over 25 years
OS/env. installation: -
Programming effort: Full-time employee costs 285.71€ a day
Annual costs
HW maintenance: 5% of HW purchase costs
Building/infrastructure: 200,000€ per year; costs per node: division by 1.6MW;
multiplication by max. power consumption of each node
OS/env. maintenance: 4 admins, 75% maintenance cluster (~2300 nodes):
180,000€ / 2300 = 78€ per node and year
Software/compiler: -
Power: PUE 1.5, regional electricity costs 0.15 €/kWh
Application maintenance: - (small kernels)
Given lifetime of 4 years & investment Cppr
#nodes, #executions (usage rate 80%)
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 8
Basis
Serial version
Small kernel
Assumption: homogeneous app. landscape
KegelSpan2
3D simulation of bevel gear cutting process
Kernel artificially increased from 25% to 90%
Case Study on Accelerators – Real-World Application
Sourc
e: B
MW
, Z
F, K
lingeln
berg
2 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume 2108.2 of VDI-Berichte, pp.1381–1384, Düsseldorf, VDI Verlag, 2010.
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 9
Case Study on Accelerators – TCO Components of Application
5.0
1.5
4.5 3.5
0.5
0
2
4
6
eff
ort
[d
ays]
119 140
158
0
50
100
150
200
250
0
20
40
60
80
100
120
140
160
180
p
ow
er
co
nsu
mp
tio
n [
W]
run
tim
e [
s]
OpenCL (GPU)
OpenACC (GPU)
OpenMP+LEO (Phi)
OpenMP-vec (SNB)
OpenMP-simp (SNB)
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 10
Case Study on Accelerators – Results
7,787
1,809
7,231
0 €
5,000 €
10,000 €
bre
ak
-even
in
vestm
en
t
-20%
-10%
0%
10%
20%
0€ 100K€ 200K€
co
sts
per
pro
gra
m r
un
(r
ela
tive t
o O
MP
-sim
p)
Investment
-17.15%
3.62%
-12.09% -16.82%
OpenCL (GPU)
OpenACC (GPU)
OpenMP+LEO (Phi)
OpenMP-vec (SNB)
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 11
Conclusion
Are accelerators beneficial? “It depends”
TCO spreadsheet1 for own computations available
Our results (w/ 90% kernel portion) show
GPU Fermi beneficial over 2-socket Intel SNB server
Intel Xeon Phi results disappointing for now
Mainly due to high acquisition costs
NVIDIA Kepler probably similar
Programming effort impacts break-even investment (see OpenACCOpenCL)
Bigger codes: increase of kernel size ~ increase of break-even invest.
Projections possible (e.g. hybrid codes)
1 Wienke, S., an Mey, D., Müller, M.S.: Accelerators for Technical Computing: Is it Worth the Pain? TCO Spreadsheet. https://sharepoint. campus.rwth-aachen.de/units/rz/HPC/public/Shared%20Documents/ WienkeEtAl_Accelerators-TCO-Perspective.xlsx, 2013
SNB-OMP (4 years, 250 K€)
-17% Cppr
+ 4% Cppr
TCO of Accelerators
Sandra Wienke | Center for Computing and Communication 12
Outlook
Hybrid code implementation (cmp to projections)
Model extensions
New programming models & architectures (OpenMP 4.0, NVIDIA Kepler)
Network communication (MPI)
Mixed job execution (heterogeneous application landscape)
Assessment of decrease in runtime/ gaining more results
Comprehensive TCO calculation with predictive powers
Performance, power consumption, manpower
Towards exascale computing, architectures might get more complex
More difficult to manage & program
Impact of manpower effort might get stronger
Thank you for
your attention!