accelerators in technical computing: is it worth the pain?€¦ · 16 cores, 2 ghz openmp intel...

Rechen- und Kommunikationszentrum (RZ)

Accelerators in Technical

Computing:

Is it Worth the Pain? A TCO Perspective

Sandra Wienke, Dieter an Mey, Matthias S. Müller

Center for Computing and Communication

JARA – High-Performance Computing

RWTH Aachen University

TCO of Accelerators

Sandra Wienke | Center for Computing and Communication 2

Agenda

Introduction

Modeling

Total Cost of Ownership (TCO)

Comparison Metrics

Case Study on Accelerators

Programming Models & System Types

TCO Components @ RWTH

Real-World Application

Results

Conclusion & Outlook

TCO of Accelerators


Today: Varity of HPC clusters

Usage of accelerators (NVIDIA GPU, Intel Xeon Phi) motivated by

promising performance per watt ratio

System comparison by performance or performance per watt not

sufficient for purchase decision

Total costs of ownership (TCO)

Acquisition costs, housing, operation costs,..

Inclusion of manpower costs (administration & programming)

Comparison of costs per program run (application-dependent)

Investigation of a real-world software package

OpenMP on Intel Sandy Bridge

OpenMP + LEO on Intel Xeon Phi

OpenCL, OpenACC on NVIDA Fermi GPU

Introduction

Impact of manpower effort/

programming model?

TCO of Accelerators


Modeling – Total Cost of Ownership (TCO)

Basis: single compute node extrapolate to cluster amount

𝐈𝐧𝐯𝐞𝐬𝐭𝐦𝐞𝐧𝐭 𝑰 = 𝐓𝐂𝐎 𝒏, 𝝉 = 𝑪𝒐𝒕(𝒏) + 𝑪𝒑𝒂(𝒏) ∙ 𝝉

One-time costs Cot

Per node: HW acquisition, building/infrastructure, OS/ env. installation

Per node type: OS/ env. installation, programming effort

Annual costs Cpa

Per node: HW maintenance, building/infrastructure, OS/ env.

maintenance, power consumption

Per node type: OS/ env. maintenance, compiler/software, application

maintenance

TCO depends on architecture & application

𝑛: number of nodes 𝜏: system lifetime

TCO of Accelerators


Modeling – Comparison Metrics

Costs per program run Cppr

Includes investment/ TCO & application performance

𝐶𝑝𝑝𝑟 𝑛, 𝜏 =TCO(𝑛, 𝜏)

𝑛𝑒𝑥(𝜏) ∙ 𝑛 with 𝑛𝑒𝑥 𝜏 =

𝑘 ∙ 𝜏

𝑡𝑝𝑎𝑟

Used baseline for system X: Intel Sandy Bridge (SNB) + OpenMP

𝐶𝑝𝑝𝑟,𝑋 𝑛𝑋, 𝜏 − 𝐶𝑝𝑝𝑟,𝑂𝑀𝑃 𝑛𝑂𝑀𝑃, 𝜏

𝐶𝑝𝑝𝑟,𝑂𝑀𝑃 𝑛𝑂𝑀𝑃 , 𝜏

< 0≥ 0

𝑖𝑓 𝑋

𝑂𝑀𝑃 beneficial

Break-even investments

Min. budget needed so that system X beneficial over OpenMP on SNB

Solve for 𝐼 with given fixed lifetime 𝜏:

𝐶𝑝𝑝𝑟,𝑋 𝑛𝑋, 𝜏 − 𝐶𝑝𝑝𝑟,𝑂𝑀𝑃 𝑛𝑂𝑀𝑃, 𝜏 = 0 with TCO 𝑛, 𝜏 = 𝐼

𝑛 ∶ number of nodes 𝜏 ∶ system lifetime 𝑛𝑒𝑥 ∶ #app. executions 𝑘 ∶ system usage rate 𝑡𝑝𝑎𝑟: parallel runtime

TCO of Accelerators


Case Study on Accelerators – Programming Models & System Types

Programming Model Accelerator Host Compiler

Serial 2x Intel Sandy Bridge,

16 cores, 2 GHz Intel 13.0.1 OpenMP

(simple, vectorized)

LEO + OpenMP Intel Xeon Phi

5110P, 60 cores 1x Intel Westmere,

4 cores, 2.4 GHz

Intel 13.0.1

OpenACC NVIDIA Tesla

C2050 (Fermi),

ECC on

PGI 12.9

OpenCL Intel 13.0.1

TCO of Accelerators


Case Study on Accelerators – TCO Components @ RWTH

One-time costs

HW purchase: list prices from Bull

Building/infrastructure: as annual costs since it is amortized over 25 years

OS/env. installation: -

Programming effort: Full-time employee costs 285.71€ a day

Annual costs

HW maintenance: 5% of HW purchase costs

Building/infrastructure: 200,000€ per year; costs per node: division by 1.6MW;

multiplication by max. power consumption of each node

OS/env. maintenance: 4 admins, 75% maintenance cluster (~2300 nodes):

180,000€ / 2300 = 78€ per node and year

Software/compiler: -

Power: PUE 1.5, regional electricity costs 0.15 €/kWh

Application maintenance: - (small kernels)

Given lifetime of 4 years & investment Cppr

#nodes, #executions (usage rate 80%)

TCO of Accelerators


Basis

Serial version

Small kernel

Assumption: homogeneous app. landscape

KegelSpan2

3D simulation of bevel gear cutting process

Kernel artificially increased from 25% to 90%

Case Study on Accelerators – Real-World Application

Sourc

e: B

MW

, Z

F, K

lingeln

berg

2 C. Brecher, C. Gorgels, and A. Hardjosuwito. Simulation based Tool Wear Analysis in Bevel Gear Cutting. In International Conference on Gears, volume 2108.2 of VDI-Berichte, pp.1381–1384, Düsseldorf, VDI Verlag, 2010.

TCO of Accelerators


Case Study on Accelerators – TCO Components of Application

5.0

1.5

4.5 3.5

0.5

0

2

4

6

eff

ort

[d

ays]

119 140

158

0

50

100

150

200

250

0

20

40

60

80

100

120

140

160

180

p

ow

er

co

nsu

mp

tio

n [

W]

run

tim

e [

s]

OpenCL (GPU)

OpenACC (GPU)

OpenMP+LEO (Phi)

OpenMP-vec (SNB)

OpenMP-simp (SNB)

TCO of Accelerators


Case Study on Accelerators – Results

7,787

1,809

7,231

0 €

5,000 €

10,000 €

bre

ak

-even

in

vestm

en

t

-20%

-10%

0%

10%

20%

0€ 100K€ 200K€

co

sts

per

pro

gra

m r

un

(r

ela

tive t

o O

MP

-sim

p)

Investment

-17.15%

3.62%

-12.09% -16.82%

OpenCL (GPU)

OpenACC (GPU)

OpenMP+LEO (Phi)

OpenMP-vec (SNB)

TCO of Accelerators


Conclusion

Are accelerators beneficial? “It depends”

TCO spreadsheet1 for own computations available

Our results (w/ 90% kernel portion) show

GPU Fermi beneficial over 2-socket Intel SNB server

Intel Xeon Phi results disappointing for now

Mainly due to high acquisition costs

NVIDIA Kepler probably similar

Programming effort impacts break-even investment (see OpenACCOpenCL)

Bigger codes: increase of kernel size ~ increase of break-even invest.

Projections possible (e.g. hybrid codes)

1 Wienke, S., an Mey, D., Müller, M.S.: Accelerators for Technical Computing: Is it Worth the Pain? TCO Spreadsheet. https://sharepoint. campus.rwth-aachen.de/units/rz/HPC/public/Shared%20Documents/ WienkeEtAl_Accelerators-TCO-Perspective.xlsx, 2013

SNB-OMP (4 years, 250 K€)

-17% Cppr

+ 4% Cppr

TCO of Accelerators


Outlook

Hybrid code implementation (cmp to projections)

Model extensions

New programming models & architectures (OpenMP 4.0, NVIDIA Kepler)

Network communication (MPI)

Mixed job execution (heterogeneous application landscape)

Assessment of decrease in runtime/ gaining more results

Comprehensive TCO calculation with predictive powers

Performance, power consumption, manpower

Towards exascale computing, architectures might get more complex

More difficult to manage & program

Impact of manpower effort might get stronger

Thank you for

your attention!