investigations of future computing platforms for high energy ......• power9 (based on powerisa...

26
Investigations of Future Computing Platforms for High Energy Physics DAVID AbDURACHMANOV (UNL), Felice Pantaleo (CERN), Omar Awile (CERN), Marco Guerri (CERN), Aritz Brosa Iartza (Universidad de Oviedo) PRESENTED AT CHEP 2016

Upload: others

Post on 26-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Investigations ofFuture Computing Platforms for High Energy PhysicsDAVID AbDURACHMANOV (UNL), Felice Pantaleo (CERN), Omar Awile (CERN), Marco Guerri (CERN), Aritz Brosa Iartza (Universidad de Oviedo)PRESENTED AT CHEP 2016

  • Gartner reported in 2010 that energy-related costs are the fastest-rising cost in the data center

    2

    MOTIVATION

    United States Data CenterEnergy Usage Report (LBNL-1005775), June 2016

    Annual growth for server shipments have fallen significantly, same applies for power consumption, yet we continue to have “a drastic increase in demand for data center services”

    New report:

  • The on-chip power density limitations are driving the computing market towards a greater variety of solutions• ARM announced ARMv8.{0,1,2} which covers server market

    • Designs with up to 64 cores announced & to be used in supercomputers• ARM recently announced Scalable Vector Extension (SVE) for HPC

    • Up to 2K vector sizes & binary compatibility• IBM announced OpenPOWER Foundation which allows other

    companies to build PowerPC CPUs• POWER9 (based on PowerISA 3.0) is the first one after OpenPOWER

    Foundation was launched• IBM recently announced POWER8+ with NVLink 1.0 support• Intel recently announced KNL (many-core and vector CPU) with

    legacy support for x86_64

    3

    THE NEW ERA: THE CHANGE

  • Administrators and users shouldn’t notice that they are running on different architectures – it’s yet another PC• All provide LP64 data model and little-endian support• All can run RHEL/CentOS/Fedora distributions• Minimal porting efforts (i.e. enablement)• All have announced NVIDIA GPGPU (CUDA) support

    But each vendor needs to differentiate• POWER8+ has NVLink with up to 160GB/s bi-directional traffic• KNL contains many-cores with 512-bit vectors• ARMv8 SVE support from 128-bit to 2K-bit vectors in binary

    compatible way• Bake custom silicon with general availability ISA

    Geopolitics might play a role4

    THE NEW ERA: BEING BORING and UNIQUE

  • SELECTED PROCESSORSCaviumThunderX

    AMDA1100

    APMX-Gene1

    Intel XeonD-1540

    Intel XeonE5-2699v3

    Intel Xeon E5-2699v4

    Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T

    Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)

    Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB

    SMT 1 1 1 2 2 2

    Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

    5

    Intel XeonE5-2698v3

    Intel XeonE5-2683v4

    Intel XeonE5-2680v4

    Intel XeonD-1581

    IBMPOWER8 IBMPOWER8

    Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T

    Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491

    Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB

    SMT 2 2 2 2 8 8

    Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13

    NOTE: All systems are production grade

  • SELECTED PROCESSORSCaviumThunderX

    AMDA1100

    APMX-Gene1

    Intel XeonD-1540

    Intel XeonE5-2699v3

    Intel Xeon E5-2699v4

    Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T

    Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)

    Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB

    SMT 1 1 1 2 2 2

    Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

    6

    Intel XeonE5-2698v3

    Intel XeonE5-2683v4

    Intel XeonE5-2680v4

    Intel XeonD-1581

    IBMPOWER8 IBMPOWER8

    Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T

    Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491

    Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB

    SMT 2 2 2 2 8 8

    Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13

  • SELECTED PROCESSORSCaviumThunderX

    AMDA1100

    APMX-Gene1

    Intel XeonD-1540

    Intel XeonE5-2699v3

    Intel Xeon E5-2699v4

    Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T

    Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)

    Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB

    SMT 1 1 1 2 2 2

    Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

    7

    Intel XeonE5-2698v3

    Intel XeonE5-2683v4

    Intel XeonE5-2680v4

    Intel XeonD-1581

    IBMPOWER8 IBMPOWER8

    Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T

    Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491

    Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB

    SMT 2 2 2 2 8 8

    Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13

  • SELECTED PROCESSORSCaviumThunderX

    AMDA1100

    APMX-Gene1

    Intel XeonD-1540

    Intel XeonE5-2699v3

    Intel Xeon E5-2699v4

    Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T

    Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)

    Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB

    SMT 1 1 1 2 2 2

    Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

    8

    Intel XeonE5-2698v3

    Intel XeonE5-2683v4

    Intel XeonE5-2680v4

    Intel XeonD-1581

    IBMPOWER8 IBMPOWER8

    Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T

    Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491

    Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB

    SMT 2 2 2 2 8 8

    Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13

  • SELECTED PROCESSORSCaviumThunderX

    AMDA1100

    APMX-Gene1

    Intel XeonD-1540

    Intel XeonE5-2699v3

    Intel Xeon E5-2699v4

    Cores/Threads 48C/48T 8C/8T 8C/8T 8C/16C 18C/36T 22C/44T

    Speed(GHz) 2.5 2.0 2.4 2.0 (2.6) 2.3(3.6) 2.2(3.6)

    Cache(L3) 16MB 8MB 8MB 12MB 45MB 55MB

    SMT 1 1 1 2 2 2

    Year Q4/15 Q1/2014 Q3/13 Q1/15 Q3/14 Q1/16

    9

    Intel XeonE5-2698v3

    Intel XeonE5-2683v4

    Intel XeonE5-2680v4

    Intel XeonD-1581

    IBMPOWER8 IBMPOWER8

    Cores/Threads 16C/32T 16C/32T 14C/28T 16C/32T 8C/64T 10C/80T

    Speed 2.3(3.6) 2.1(3.0) 2.4(3.3) 1.8(2.4) 3.857 3.491

    Cache(L3) 40MB 40MB 35MB 24MB 64MB 80MB

    SMT 2 2 2 2 8 8

    Year Q3/14 Q1/16 Q1/16 Q1/16 Q3/13 Q3/13

  • CMSSW RECO: RAW PERFORMANCE – SINGLE CORE

    10

    NOTE:Allnumbersnormalized toasinglecoreThunderXperformance

    CaviumThunderX AMDA1110 APMX-Gene1 XeonD-1540

    XeonE5-2699v3

    XeonE5-2699v3NO_TB

    XeonE5-2699v4NO_TB

    XeonE5-2698v3

    XeonE5-2683v4

    XeonE5-2680v4 POWER8-8 POWER8-10

    events/s 1.00 1.94 1.59 4.66 6.44 4.24 3.74 5.37 5.35 5.87 3.40 3.08

    0.00

    1.00

    2.00

    3.00

    4.00

    5.00

    6.00

    7.00EventThrou

    ghpu

    t(events/s)[no

    rmalizedto

    Thu

    nderX1core]

  • CMSSW RECO: RAW PERFORMANCE – SINGLE SOCKET

    11

    NOTE: All numbers normalized to a single core Thunder X performance.Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.

    CaviumThunderX AMDA1110 APMX-Gene1 XeonD-1540

    XeonE5-2699v3

    XeonE5-2699v3NO_TB

    XeonE5-2699v4NO_TB

    XeonE5-2698v3

    XeonE5-2683v4

    XeonE5-2680v4 POWER8-8 POWER8-10

    events/s 42.89 14.24 10.89 39.30 88.92 87.39 92.91 78.35 87.35 86.83 56.88 63.93

    0.00

    10.00

    20.00

    30.00

    40.00

    50.00

    60.00

    70.00

    80.00

    90.00

    100.00EventThrou

    ghpu

    t(events/s)[no

    rmalizedto

    Thu

    nderX1core]

  • CMSSW RECO: RAW PERFORAMNCE – SCALABILITY

    12

    NOTE: Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.

    0.0000

    0.5000

    1.0000

    1.5000

    2.0000

    2.5000

    0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

    Even

    ts/s

    ThreadsPerCore

    CaviumThunderX AMDA110 APMX-Gene1 XeonD-1540 XeonE5-2699v3 XeonE5-2699v3NO_TB

    XeonE5-2699v4NO_TB XeonE5-2698v3 XeonE5-2683v4 XeonE5-2680v4 POWER8-8 POWER8-10

  • CMSSW RECO: RAW PERFORAMNCE – HYPERTHREADING

    13

    NOTE: Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.

    0.00

    20.00

    40.00

    60.00

    80.00

    100.00

    120.00

    140.00

    160.00

    180.00

    0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

    %gaine

    dbyHT

    ThreadsPerCore

    XeonD-1540 XeonE5-2699v3 XeonE5-2699v3NO_TB XeonE5-2699v4NO_TB XeonE5-2698v3

    XeonE5-2683v4 XeonE5-2680v4 POWER8-8 POWER8-10

  • CMSSW RECO: RAW PERFORAMNCE – TURBO BOOST

    14

    NOTE: Multi-threaded (4 threads) jobs were used to fill full socket due to memory constraints.

    0.0000

    0.0200

    0.0400

    0.0600

    0.0800

    0.1000

    0.1200

    0.1400

    0.1600

    0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00

    Even

    ts/s/thread

    ThreadsPerCore

    CaviumThunderX AMDA110 APMX-Gene1 XeonD-1540 XeonE5-2699v3 XeonE5-2699v3NO_TB

    XeonE5-2699v4NO_TB XeonE5-2698v3 XeonE5-2683v4 XeonE5-2680v4 POWER8-8 POWER8-10

    Haswell TB vs NO_TB

  • ParFullCMS: SCALABILITY – XEON v3 vs v4, FREQ LOCKED

    15

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    14.00

    16.00

    18.00

    20.00

    1 2 4 8 16 24 32 48

    Even

    ts/s

    Cores#

    [email protected]

    [email protected]

  • ParFullCMS: SCALABILITY – XEON V3 vs v4, TURBOboost

    16

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    14.00

    16.00

    18.00

    20.00

    1 2 4 8 16 24 32 48

    Even

    ts/s

    Cores#

    [email protected]

    [email protected]

    [email protected]

    [email protected]

  • ParFullCMS: SCALABILITY – ARM64/aarch64

    17

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    14.00

    16.00

    18.00

    20.00

    1 2 4 8 16 24 32 48

    Even

    ts/s

    Cores#

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

  • ParFullCMS: SCALABILITY – XEON-D (BROADWELL)

    18

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    14.00

    16.00

    18.00

    20.00

    1 2 4 8 16 24 32 48

    Even

    ts/s

    Cores#

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

    [email protected]

  • FKDtree: SCALABILITY - SPEEDUP

    19

    0

    20

    40

    60

    80

    100

    1201 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101

    105

    109

    113

    117

    121

    125

    129

    133

    137

    141

    145

    149

    153

    157

    161

    165

    169

    173

    177

    181

    185

    189

    193

    197

    201

    205

    209

    213

    217

    221

    225

    229

    233

    237

    241

    245

    249

    253

    IBMPOWER8+ IntelKNL(7210)

    NOTE: FKDtree is a heapified KDTree with parallel search implemented in OpenMP, TBB, OpenCL, and CUDA. It is used for nearest neighbor finding in tracking and clustering.This is box to box comparison, POWER8+ is 2S Minsky system.

    POWER8+x17.85

    KNLx62.94

  • FKDtree: THROUGHPUT - searches/mS

    20

    NOTE: P100-SXM2 isconnectedwithNVLink 1.0,butweonlytransferdataonce.Implementation is using SP.

    0

    2000

    4000

    6000

    8000

    10000

    120001 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101

    105

    109

    113

    117

    121

    125

    129

    133

    137

    141

    145

    149

    153

    157

    161

    165

    169

    173

    177

    181

    185

    189

    193

    197

    201

    205

    209

    213

    217

    221

    225

    229

    233

    237

    241

    245

    249

    253

    IBMPOWER8+ IntelKNL(7210) NvidiaK40 NvidiaPascalP100

    x2.25

  • FKDtree: POWER EFFICIENCY - SEARCHES/J

    21

    NOTE: PowermeasurementsareforCPUs/GPGPUs,notforthefullbox.

    0

    10000

    20000

    30000

    40000

    50000

    60000

    70000

    80000

    90000

    100000

    IBMPOWER8+ IntelKNL(7210) NvidiaK40 NvidiaPascalP100 IntelHDGraphics515(SkylakeCoreM3)

    112W11W

    380W 147W

    131W

  • • Intel continues to improve their process node (industry leader) and update microarchitecture at steady pace• Intel continues to push boundaries with many-core & vector

    CPUs (KNL)• 2017-2018 (next two years) are important for ARMv8 64-bit

    and PowerPC (POWER9)• Nvidia P100 provides a major improvement from K40 in

    terms of performance and power efficiency

    22

    RACE IS HEATING UP

  • For providing hardware and / or helping setting it up we would like to thank:• Intel• CERN Techlab and Openlab• Supermicro• IBM• Rackspace

    23

    THANK you!

  • 24

    BACKUP

  • CMSSW RECO: RAW PERFORMANCE – SINGLE JOB

    25

    NOTE: All numbers normalized to a single core Thunder X performance.The job is configured with 4 threads to represent 2016 executions.

    CaviumThunderX AMDA1110 APMX-Gene1 XeonD-1540

    XeonE5-2699v3

    XeonE5-2699v3NO_TB

    XeonE5-2699v4NO_TB

    XeonE5-2698v3

    XeonE5-2683v4

    XeonE5-2680v4 POWER8-8 POWER8-10

    events/s 3.79 7.30 6.09 17.39 22.51 16.41 13.92 18.97 18.10 20.09 12.26 10.36

    0.00

    5.00

    10.00

    15.00

    20.00

    25.00EventThrou

    ghpu

    t(events/s)[no

    rmalizedto

    Thu

    nderX1core]

  • MAIN MEMORY BANDWIDTH (Intel RDT) – 1S, E5-2683 v4

    26

    ParFullCMS HEP SPEC 2006

    MB/s MB/s

    STREAM Triad best rate: 40329.8 MB/s (75%) or 53773 MB/s (100%) [Similar number reported by RDT]All are running with 32 threads (16C/32T, full 1S), 40MB LLC