investigations of future computing platforms for high energy ......• power9 (based on powerisa...
TRANSCRIPT
-
Investigations ofFuture Computing Platforms for High Energy PhysicsDAVID AbDURACHMANOV (UNL), Felice Pantaleo (CERN), Omar Awile (CERN), Marco Guerri (CERN), Aritz Brosa Iartza (Universidad de Oviedo)PRESENTED AT CHEP 2016
-
Gartner reported in 2010 that energy-related costs are the fastest-rising cost in the data center
2
MOTIVATION
United States Data CenterEnergy Usage Report (LBNL-1005775), June 2016
Annual growth for server shipments have fallen significantly, same applies for power consumption, yet we continue to have “a drastic increase in demand for data center services”
New report:
-
The on-chip power density limitations are driving the computing market towards a greater variety of solutions• ARM announced ARMv8.{0,1,2} which covers server market
• Designs with up to 64 cores announced & to be used in supercomputers• ARM recently announced Scalable Vector Extension (SVE) for HPC
• Up to 2K vector sizes & binary compatibility• IBM announced OpenPOWER Foundation which allows other
companies to build PowerPC CPUs• POWER9 (based on PowerISA 3.0) is the first one after OpenPOWER
Foundation was launched• IBM recently announced POWER8+ with NVLink 1.0 support• Intel recently announced KNL (many-core and vector CPU) with
legacy support for x86_64
3
THE NEW ERA: THE CHANGE
-
Administrators and users shouldn’t notice that they are running on different architectures – it’s yet another PC• All provide LP64 data model and little-endian support• All can run RHEL/CentOS/Fedora distributions• Minimal porting efforts (i.e. enablement)• All have announced NVIDIA GPGPU (CUDA) support
But each vendor needs to differentiate• POWER8+ has NVLink with up to 160GB/s bi-directional traffic• KNL contains many-cores with 512-bit vectors• ARMv8 SVE support from 128-bit to 2K-bit vectors in binary
compatible way• Bake custom silicon with general availability ISA
Geopolitics might play a role4
THE NEW ERA: BEING BORING and UNIQUE
-
SELECTED PROCESSORSCaviumThunderX
AMDA1100
APMX-Gene1
Intel XeonD-1540
Intel XeonE5-2699v3
Intel Xeon E5-2699v4
Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T
Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)
Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB
SMT 1 1 1 2 2 2
Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16
5
Intel XeonE5-2698v3
Intel XeonE5-2683v4
Intel XeonE5-2680v4
Intel XeonD-1581
IBMPOWER8 IBMPOWER8
Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T
Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491
Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB
SMT 2 2 2 2 8 8
Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13
NOTE: All systems are production grade
-
SELECTED PROCESSORSCaviumThunderX
AMDA1100
APMX-Gene1
Intel XeonD-1540
Intel XeonE5-2699v3
Intel Xeon E5-2699v4
Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T
Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)
Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB
SMT 1 1 1 2 2 2
Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16
6
Intel XeonE5-2698v3
Intel XeonE5-2683v4
Intel XeonE5-2680v4
Intel XeonD-1581
IBMPOWER8 IBMPOWER8
Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T
Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491
Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB
SMT 2 2 2 2 8 8
Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13
-
SELECTED PROCESSORSCaviumThunderX
AMDA1100
APMX-Gene1
Intel XeonD-1540
Intel XeonE5-2699v3
Intel Xeon E5-2699v4
Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T
Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)
Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB
SMT 1 1 1 2 2 2
Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16
7
Intel XeonE5-2698v3
Intel XeonE5-2683v4
Intel XeonE5-2680v4
Intel XeonD-1581
IBMPOWER8 IBMPOWER8
Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T
Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491
Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB
SMT 2 2 2 2 8 8
Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13
-
SELECTED PROCESSORSCaviumThunderX
AMDA1100
APMX-Gene1
Intel XeonD-1540
Intel XeonE5-2699v3
Intel Xeon E5-2699v4
Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T
Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)
Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB
SMT 1 1 1 2 2 2
Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16
8
Intel XeonE5-2698v3
Intel XeonE5-2683v4
Intel XeonE5-2680v4
Intel XeonD-1581
IBMPOWER8 IBMPOWER8
Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T
Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491
Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB
SMT 2 2 2 2 8 8
Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13
-
SELECTED PROCESSORSCaviumThunderX
AMDA1100
APMX-Gene1
Intel XeonD-1540
Intel XeonE5-2699v3
Intel Xeon E5-2699v4
Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T
Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)
Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB
SMT 1 1 1 2 2 2
Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16
9
Intel XeonE5-2698v3
Intel XeonE5-2683v4
Intel XeonE5-2680v4
Intel XeonD-1581
IBMPOWER8 IBMPOWER8
Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T
Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491
Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB
SMT 2 2 2 2 8 8
Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13
-
CMSSW RECO: RAW PERFORMANCE – SINGLE CORE
10
NOTE:Allnumbersnormalized toasinglecoreThunderXperformance
CaviumThunderX AMDA1110 APMX-Gene1 XeonD-1540
XeonE5-2699v3
XeonE5-2699v3NO_TB
XeonE5-2699v4NO_TB
XeonE5-2698v3
XeonE5-2683v4
XeonE5-2680v4 POWER8-8 POWER8-10
events/s 1.00 1.94 1.59 4.66 6.44 4.24 3.74 5.37 5.35 5.87 3.40 3.08
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00EventThrou
ghpu
t(events/s)[no
rmalizedto
Thu
nderX1core]
-
CMSSW RECO: RAW PERFORMANCE – SINGLE SOCKET
11
NOTE: All numbers normalized to a single core Thunder X performance.Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.
CaviumThunderX AMDA1110 APMX-Gene1 XeonD-1540
XeonE5-2699v3
XeonE5-2699v3NO_TB
XeonE5-2699v4NO_TB
XeonE5-2698v3
XeonE5-2683v4
XeonE5-2680v4 POWER8-8 POWER8-10
events/s 42.89 14.24 10.89 39.30 88.92 87.39 92.91 78.35 87.35 86.83 56.88 63.93
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00EventThrou
ghpu
t(events/s)[no
rmalizedto
Thu
nderX1core]
-
CMSSW RECO: RAW PERFORAMNCE – SCALABILITY
12
NOTE: Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.
0.0000
0.5000
1.0000
1.5000
2.0000
2.5000
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00
Even
ts/s
ThreadsPerCore
CaviumThunderX AMDA110 APMX-Gene1 XeonD-1540 XeonE5-2699v3 XeonE5-2699v3NO_TB
XeonE5-2699v4NO_TB XeonE5-2698v3 XeonE5-2683v4 XeonE5-2680v4 POWER8-8 POWER8-10
-
CMSSW RECO: RAW PERFORAMNCE – HYPERTHREADING
13
NOTE: Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.
0.00
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00
%gaine
dbyHT
ThreadsPerCore
XeonD-1540 XeonE5-2699v3 XeonE5-2699v3NO_TB XeonE5-2699v4NO_TB XeonE5-2698v3
XeonE5-2683v4 XeonE5-2680v4 POWER8-8 POWER8-10
-
CMSSW RECO: RAW PERFORAMNCE – TURBO BOOST
14
NOTE: Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.
0.0000
0.0200
0.0400
0.0600
0.0800
0.1000
0.1200
0.1400
0.1600
0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00
Even
ts/s/thread
ThreadsPerCore
CaviumThunderX AMDA110 APMX-Gene1 XeonD-1540 XeonE5-2699v3 XeonE5-2699v3NO_TB
XeonE5-2699v4NO_TB XeonE5-2698v3 XeonE5-2683v4 XeonE5-2680v4 POWER8-8 POWER8-10
Haswell TB vs NO_TB
-
ParFullCMS: SCALABILITY – XEON v3 vs v4, FREQ LOCKED
15
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
1 2 4 8 16 24 32 48
Even
ts/s
Cores#
-
ParFullCMS: SCALABILITY – XEON V3 vs v4, TURBOboost
16
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
1 2 4 8 16 24 32 48
Even
ts/s
Cores#
-
ParFullCMS: SCALABILITY – ARM64/aarch64
17
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
1 2 4 8 16 24 32 48
Even
ts/s
Cores#
-
ParFullCMS: SCALABILITY – XEON-D (BROADWELL)
18
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
1 2 4 8 16 24 32 48
Even
ts/s
Cores#
-
FKDtree: SCALABILITY - SPEEDUP
19
0
20
40
60
80
100
1201 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101
105
109
113
117
121
125
129
133
137
141
145
149
153
157
161
165
169
173
177
181
185
189
193
197
201
205
209
213
217
221
225
229
233
237
241
245
249
253
IBMPOWER8+ IntelKNL(7210)
NOTE: FKDtree is a heapified KDTree with parallel search implemented in OpenMP, TBB, OpenCL, and CUDA. It is used for nearest neighbor finding in tracking and clustering.This is box to box comparison, POWER8+ is 2S Minsky system.
POWER8+x17.85
KNLx62.94
-
FKDtree: THROUGHPUT - searches/mS
20
NOTE: P100-SXM2 isconnectedwithNVLink 1.0,butweonlytransferdataonce.Implementation is using SP.
0
2000
4000
6000
8000
10000
120001 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101
105
109
113
117
121
125
129
133
137
141
145
149
153
157
161
165
169
173
177
181
185
189
193
197
201
205
209
213
217
221
225
229
233
237
241
245
249
253
IBMPOWER8+ IntelKNL(7210) NvidiaK40 NvidiaPascalP100
x2.25
-
FKDtree: POWER EFFICIENCY - SEARCHES/J
21
NOTE: PowermeasurementsareforCPUs/GPGPUs,notforthefullbox.
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
IBMPOWER8+ IntelKNL(7210) NvidiaK40 NvidiaPascalP100 IntelHDGraphics515(SkylakeCoreM3)
112W11W
380W 147W
131W
-
• Intel continues to improve their process node (industry leader) and update microarchitecture at steady pace• Intel continues to push boundaries with many-core & vector
CPUs (KNL)• 2017-2018 (next two years) are important for ARMv8 64-bit
and PowerPC (POWER9)• Nvidia P100 provides a major improvement from K40 in
terms of performance and power efficiency
22
RACE IS HEATING UP
-
For providing hardware and / or helping setting it up we would like to thank:• Intel• CERN Techlab and Openlab• Supermicro• IBM• Rackspace
23
THANK you!
-
24
BACKUP
-
CMSSW RECO: RAW PERFORMANCE – SINGLE JOB
25
NOTE: All numbers normalized to a single core Thunder X performance.The job is configured with 4 threads to represent 2016 executions.
CaviumThunderX AMDA1110 APMX-Gene1 XeonD-1540
XeonE5-2699v3
XeonE5-2699v3NO_TB
XeonE5-2699v4NO_TB
XeonE5-2698v3
XeonE5-2683v4
XeonE5-2680v4 POWER8-8 POWER8-10
events/s 3.79 7.30 6.09 17.39 22.51 16.41 13.92 18.97 18.10 20.09 12.26 10.36
0.00
5.00
10.00
15.00
20.00
25.00EventThrou
ghpu
t(events/s)[no
rmalizedto
Thu
nderX1core]
-
MAIN MEMORY BANDWIDTH (Intel RDT) – 1S, E5-2683 v4
26
ParFullCMS HEP SPEC 2006
MB/s MB/s
STREAM Triad best rate: 40329.8 MB/s (75%) or 53773 MB/s (100%) [Similar number reported by RDT]All are running with 32 threads (16C/32T, full 1S), 40MB LLC