investigations of future computing platforms for high energy ......• power9 (based on powerisa...

Investigations ofFuture Computing Platforms for High Energy PhysicsDAVID AbDURACHMANOV (UNL), Felice Pantaleo (CERN), Omar Awile (CERN), Marco Guerri (CERN), Aritz Brosa Iartza (Universidad de Oviedo)PRESENTED AT CHEP 2016

Gartner reported in 2010 that energy-related costs are the fastest-rising cost in the data center

2

MOTIVATION

United States Data CenterEnergy Usage Report (LBNL-1005775), June 2016

Annual growth for server shipments have fallen significantly, same applies for power consumption, yet we continue to have “a drastic increase in demand for data center services”

New report:

The on-chip power density limitations are driving the computing market towards a greater variety of solutions• ARM announced ARMv8.{0,1,2} which covers server market

• Designs with up to 64 cores announced & to be used in supercomputers• ARM recently announced Scalable Vector Extension (SVE) for HPC

• Up to 2K vector sizes & binary compatibility• IBM announced OpenPOWER Foundation which allows other

companies to build PowerPC CPUs• POWER9 (based on PowerISA 3.0) is the first one after OpenPOWER

Foundation was launched• IBM recently announced POWER8+ with NVLink 1.0 support• Intel recently announced KNL (many-core and vector CPU) with

legacy support for x86_64

3

THE NEW ERA: THE CHANGE

Administrators and users shouldn’t notice that they are running on different architectures – it’s yet another PC• All provide LP64 data model and little-endian support• All can run RHEL/CentOS/Fedora distributions• Minimal porting efforts (i.e. enablement)• All have announced NVIDIA GPGPU (CUDA) support

But each vendor needs to differentiate• POWER8+ has NVLink with up to 160GB/s bi-directional traffic• KNL contains many-cores with 512-bit vectors• ARMv8 SVE support from 128-bit to 2K-bit vectors in binary

compatible way• Bake custom silicon with general availability ISA

Geopolitics might play a role4

THE NEW ERA: BEING BORING and UNIQUE

SELECTED PROCESSORSCaviumThunderX

AMDA1100

APMX-Gene1

Intel XeonD-1540

Intel XeonE5-2699v3

Intel Xeon E5-2699v4

Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T

Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)

Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB

SMT 1 1 1 2 2 2

Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

5

Intel XeonE5-2698v3

Intel XeonE5-2683v4

Intel XeonE5-2680v4

Intel XeonD-1581

IBMPOWER8 IBMPOWER8

Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T

Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491


SMT 2 2 2 2 8 8

Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13

NOTE: All systems are production grade


AMDA1100

APMX-Gene1

Intel XeonD-1540

Intel XeonE5-2699v3



Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)


SMT 1 1 1 2 2 2

Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

6

Intel XeonE5-2698v3

Intel XeonE5-2683v4

Intel XeonE5-2680v4

Intel XeonD-1581

IBMPOWER8 IBMPOWER8


Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491


SMT 2 2 2 2 8 8

Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13


AMDA1100

APMX-Gene1

Intel XeonD-1540

Intel XeonE5-2699v3



Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)


SMT 1 1 1 2 2 2

Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

7

Intel XeonE5-2698v3

Intel XeonE5-2683v4

Intel XeonE5-2680v4

Intel XeonD-1581

IBMPOWER8 IBMPOWER8


Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491


SMT 2 2 2 2 8 8

Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13


AMDA1100

APMX-Gene1

Intel XeonD-1540

Intel XeonE5-2699v3



Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)


SMT 1 1 1 2 2 2

Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

8

Intel XeonE5-2698v3

Intel XeonE5-2683v4

Intel XeonE5-2680v4

Intel XeonD-1581

IBMPOWER8 IBMPOWER8


Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491


SMT 2 2 2 2 8 8

Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13


AMDA1100

APMX-Gene1

Intel XeonD-1540

Intel XeonE5-2699v3



Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)


SMT 1 1 1 2 2 2

Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

9

Intel XeonE5-2698v3

Intel XeonE5-2683v4

Intel XeonE5-2680v4

Intel XeonD-1581

IBMPOWER8 IBMPOWER8


Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491


SMT 2 2 2 2 8 8

Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13

CMSSW RECO: RAW PERFORMANCE – SINGLE CORE

10

NOTE:Allnumbersnormalized toasinglecoreThunderXperformance

CaviumThunderX AMDA1110 APMX-Gene1 XeonD-1540

XeonE5-2699v3

XeonE5-2699v3NO_TB

XeonE5-2699v4NO_TB

XeonE5-2698v3

XeonE5-2683v4

XeonE5-2680v4 POWER8-8 POWER8-10

events/s 1.00 1.94 1.59 4.66 6.44 4.24 3.74 5.37 5.35 5.87 3.40 3.08

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00EventThrou

ghpu

t(events/s)[no

rmalizedto

Thu

nderX1core]

CMSSW RECO: RAW PERFORMANCE – SINGLE SOCKET

11

NOTE: All numbers normalized to a single core Thunder X performance.Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.


XeonE5-2699v3

XeonE5-2699v3NO_TB

XeonE5-2699v4NO_TB

XeonE5-2698v3

XeonE5-2683v4


events/s 42.89 14.24 10.89 39.30 88.92 87.39 92.91 78.35 87.35 86.83 56.88 63.93

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

100.00EventThrou

ghpu

t(events/s)[no

rmalizedto

Thu

nderX1core]

CMSSW RECO: RAW PERFORAMNCE – SCALABILITY

12

NOTE: Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.

0.0000

0.5000

1.0000

1.5000

2.0000

2.5000

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

Even

ts/s

ThreadsPerCore

CaviumThunderX AMDA110 APMX-Gene1 XeonD-1540 XeonE5-2699v3 XeonE5-2699v3NO_TB

XeonE5-2699v4NO_TB XeonE5-2698v3 XeonE5-2683v4 XeonE5-2680v4 POWER8-8 POWER8-10

CMSSW RECO: RAW PERFORAMNCE – HYPERTHREADING

13


0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

%gaine

dbyHT

ThreadsPerCore

XeonD-1540 XeonE5-2699v3 XeonE5-2699v3NO_TB XeonE5-2699v4NO_TB XeonE5-2698v3

XeonE5-2683v4 XeonE5-2680v4 POWER8-8 POWER8-10

CMSSW RECO: RAW PERFORAMNCE – TURBO BOOST

14


0.0000

0.0200

0.0400

0.0600

0.0800

0.1000

0.1200

0.1400

0.1600

0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

Even

ts/s/thread

ThreadsPerCore

CaviumThunderX AMDA110 APMX-Gene1 XeonD-1540 XeonE5-2699v3 XeonE5-2699v3NO_TB

XeonE5-2699v4NO_TB XeonE5-2698v3 XeonE5-2683v4 XeonE5-2680v4 POWER8-8 POWER8-10

Haswell TB vs NO_TB

ParFullCMS: SCALABILITY – XEON v3 vs v4, FREQ LOCKED

15

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

1 2 4 8 16 24 32 48

Even

ts/s

Cores#

[email protected]

[email protected]

ParFullCMS: SCALABILITY – XEON V3 vs v4, TURBOboost

16

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

1 2 4 8 16 24 32 48

Even

ts/s

Cores#

[email protected]

[email protected]

[email protected]

[email protected]

ParFullCMS: SCALABILITY – ARM64/aarch64

17

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

1 2 4 8 16 24 32 48

Even

ts/s

Cores#

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

ParFullCMS: SCALABILITY – XEON-D (BROADWELL)

18

0.00

2.00

4.00

6.00

8.00

10.00

12.00

14.00

16.00

18.00

20.00

1 2 4 8 16 24 32 48

Even

ts/s

Cores#

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

[email protected]

FKDtree: SCALABILITY - SPEEDUP

19

0

20

40

60

80

100

1201 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101

105

109

113

117

121

125

129

133

137

141

145

149

153

157

161

165

169

173

177

181

185

189

193

197

201

205

209

213

217

221

225

229

233

237

241

245

249

253

IBMPOWER8+ IntelKNL(7210)

NOTE: FKDtree is a heapified KDTree with parallel search implemented in OpenMP, TBB, OpenCL, and CUDA. It is used for nearest neighbor finding in tracking and clustering.This is box to box comparison, POWER8+ is 2S Minsky system.

POWER8+x17.85

KNLx62.94

FKDtree: THROUGHPUT - searches/mS

20

NOTE: P100-SXM2 isconnectedwithNVLink 1.0,butweonlytransferdataonce.Implementation is using SP.

0

2000

4000

6000

8000

10000

120001 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101

105

109

113

117

121

125

129

133

137

141

145

149

153

157

161

165

169

173

177

181

185

189

193

197

201

205

209

213

217

221

225

229

233

237

241

245

249

253

IBMPOWER8+ IntelKNL(7210) NvidiaK40 NvidiaPascalP100

x2.25

FKDtree: POWER EFFICIENCY - SEARCHES/J

21

NOTE: PowermeasurementsareforCPUs/GPGPUs,notforthefullbox.

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

IBMPOWER8+ IntelKNL(7210) NvidiaK40 NvidiaPascalP100 IntelHDGraphics515(SkylakeCoreM3)

112W11W

380W 147W

131W

• Intel continues to improve their process node (industry leader) and update microarchitecture at steady pace• Intel continues to push boundaries with many-core & vector

CPUs (KNL)• 2017-2018 (next two years) are important for ARMv8 64-bit

and PowerPC (POWER9)• Nvidia P100 provides a major improvement from K40 in

terms of performance and power efficiency

22

RACE IS HEATING UP

For providing hardware and / or helping setting it up we would like to thank:• Intel• CERN Techlab and Openlab• Supermicro• IBM• Rackspace

23

THANK you!

24

BACKUP

CMSSW RECO: RAW PERFORMANCE – SINGLE JOB

25

NOTE: All numbers normalized to a single core Thunder X performance.The job is configured with 4 threads to represent 2016 executions.


XeonE5-2699v3

XeonE5-2699v3NO_TB

XeonE5-2699v4NO_TB

XeonE5-2698v3

XeonE5-2683v4


events/s 3.79 7.30 6.09 17.39 22.51 16.41 13.92 18.97 18.10 20.09 12.26 10.36

0.00

5.00

10.00

15.00

20.00

25.00EventThrou

ghpu

t(events/s)[no

rmalizedto

Thu

nderX1core]

MAIN MEMORY BANDWIDTH (Intel RDT) – 1S, E5-2683 v4

26

ParFullCMS HEP SPEC 2006

MB/s MB/s

STREAM Triad best rate: 40329.8 MB/s (75%) or 53773 MB/s (100%) [Similar number reported by RDT]All are running with 32 threads (16C/32T, full 1S), 40MB LLC

investigations of future computing platforms for high energy ......• power9 (based on powerisa...

Documents