internal gpgpus on dedicated x16 slots are they needed for...

62
Internal GPGPUs on Dedicated x16 Slots – Are They Needed for HPC? Elliott Berger HPC Solution Architect Dell Research Computing [email protected]

Upload: others

Post on 26-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Internal GPGPUs on Dedicated x16

Slots – Are They Needed for HPC?

Elliott Berger

HPC Solution Architect

Dell Research Computing

[email protected]

Page 2: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

• “Gold” Standards in GPGPU Computing

- Tesla S-series (S1060, S1070, S2050, S2070)

- Generic 1U server with 2 GPUs

• C410x Design Goals and Description - Production-Ready

• C410x Benchmark Results - Internal vs. External Performance

- GPU:Host Scaling

• Model for Enabling GPU Computing

Agenda

2 Dell Research Computing

Page 3: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPGPU

“Gold”

Standards

7 Dell Research Computing

Page 4: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

NVIDIA Tesla S-Series • 1U chassis (external) • Up to 4 GPUs • Connects to a “host” via a PCI-e HIC Card

• Host is typically a 1U node/server • HIC = Host Interface Card • Two (2) PCI-e inputs per chassis

• Each one addresses up to 2 GPUs

• 4 GPUs / 1 RackU (Tesla alone) • 4 GPUs / 2 RackU (with one 1U host) • 4 GPUs / 3 RackU (with two 1U hosts) • GPUs are external to the host • 2 GPUs / RackU or ~1.3 GPUs / RackU • 2 GPUs sharing a x16 • Single power supply Dell Research Computing 8

Page 5: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

SuperMicro SuperServer family

• 1U chassis (internal) • Includes the host and GPUs • Up to 2 GPUs • Each GPU on a dedicated x16

• 2 GPUs / RackU • GPUs are internal to the host • 2 GPUs / RackU • 2 GPUs on separate x16’s • DDR performance for interconnect (x4 slot) • Single power supply Dell Research Computing 9

Page 6: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell PE C410x:

Design

Goals

Dell Research Computing 10

Page 7: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell PowerEdge C410x Design Goals

• Increase density (more GPUs per RackU) • Introduce “flexibility”

- GPU/Host ratio = 1:1, 2:1, 3:1, 4:1, …, 8:1

• “Commercial Grade” – or – production-ready “Today, most GPGPU buyers plan to use these processors for exploration and applications development rather than for production work. Assuming this experimentation period goes well, the transition to production computing will follow.”

- IDC, February 19, 2010

Dell Research Computing 11

Page 8: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell PE C410x:

Description

Dell Research Computing 12

Page 9: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell PowerEdge C410x:

• 3U chassis (external) -“Room-and-Board” for PCIe Gen-2 x16 devices - Up to 8 hosts

• Sixteen (16) x16 Gen-2 Devices

- Initial Target = GPGPUs - Support for any FH/HL or HH/HL device - Each slot Double-Wide - Individually Serviceable

• N+1 Power (3+1) - Gold (90%)

• N+1 Cooling (7+1)

Dell Research Computing 13

Page 10: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell PowerEdge C410x:

• Sixteen (16) x16 Gen-2 Modules - PCIe Gen-2 x16 compliant - Independently serviceable

Board-to-board connector for X16 Gen 2 PCIe signals and power

Power connector for GPGPU card

GPU card

LED and On/Off

Dell Research Computing 14

Page 11: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell PowerEdge C410x:

• Common Carrier supports low-profile, half-length, single-width, PCI Express cards with standard full-height bracket • Allows external cabling (network fabric, etc.)

Dell Research Computing 15

Page 12: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

C410x Flexibility: One Host with a Single x16 1:1, 2:1, 3:1, 4:1, 8:1

Host

C410x

PCI Switch GPU

x16 x16

1 GPU / x16

iPASS cable

HIC

Dell Research Computing 16

Page 13: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

C410x Flexibility: One Host with a Single x16 1:1, 2:1, 3:1, 4:1, 8:1

Host

C410x

PCI Switch GPU

x16 x16

1 GPU / x16

Host

C410x

HIC PCI

Switch x16 x16

2 GPUs / x16

GPU

GPU x16

iPASS cable iPASS cable

HIC

Dell Research Computing 17

Page 14: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

C410x Flexibility: One Host with a Single x16 1:1, 2:1, 3:1, 4:1, 8:1

Host

C410x

PCI Switch GPU

x16 x16

1 GPU / x16

Host

C410x

HIC PCI

Switch x16 x16

2 GPUs / x16

GPU

GPU x16

iPASS cable iPASS cable

HIC

Host

C410x

HIC PCI

Switch x16 x16

3 GPUs / x16

GPU

GPU x16

iPASS cable GPU

x16

Dell Research Computing 18

Page 15: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

C410x Flexibility: One Host with a Single x16 1:1, 2:1, 3:1, 4:1, 8:1

Host

C410x

PCI Switch GPU

x16 x16

1 GPU / x16

Host

C410x

HIC PCI

Switch x16 x16

2 GPUs / x16

GPU

GPU x16

iPASS cable iPASS cable

Host

C410x

HIC PCI

Switch x16 x16

4 GPUs / x16

GPU

GPU x16

iPASS cable GPU

GPU

HIC

x16

x16

Host

C410x

HIC PCI

Switch x16 x16

3 GPUs / x16

GPU

GPU x16

iPASS cable GPU

x16

Dell Research Computing 19

Page 16: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

C410x Flexibility: One Host with Dual x16 1:1, 2:1, 3:1, 4:1, 8:1

Host

C410x

PCI Switch x16

x16

2 GPUs: 1 GPU / x16

GPU

iPASS cable

GPU x16 PCI

Switch HIC HIC

x16

Dell Research Computing 20

Page 17: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

C410x Flexibility: One Host with Dual x16 1:1, 2:1, 3:1, 4:1, 8:1

Host

C410x

PCI Switch x16

x16

4 GPUs: 2 GPUs / x16

GPU

GPU x16

iPASS cable

GPU

GPU

x16

x16 PCI Switch

HIC HIC

x16

Host

C410x

PCI Switch x16

x16

2 GPUs: 1 GPU / x16

GPU

iPASS cable

GPU x16 PCI

Switch HIC HIC

x16

Dell Research Computing 21

Page 18: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

C410x Flexibility: One Host with Dual x16 1:1, 2:1, 3:1, 4:1, 8:1

Host

C410x

PCI Switch x16

x16

6 GPUs: 3 GPU / x16

GPU

GPU x16

iPASS cable GPU

x16

GPU x16

GPU

GPU

x16

x16 PCI Switch

HIC HIC

x16

Dell Research Computing 22

Page 19: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

C410x Flexibility: One Host with Dual x16 1:1, 2:1, 3:1, 4:1, 8:1

Host

C410x

PCI Switch x16

x16

8 GPUs: 4 GPU / x16

GPU

GPU x16

iPASS cable GPU

GPU

x16

x16

x16 GPU

GPU x16

GPU

GPU

x16

x16 PCI Switch

HIC HIC

x16

Host

C410x

PCI Switch x16

x16

6 GPUs: 3 GPU / x16

GPU

GPU x16

iPASS cable GPU

x16

GPU x16

GPU

GPU

x16

x16 PCI Switch

HIC HIC

x16

Dell Research Computing 23

Page 20: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Recommended Host: Dell PowerEdge C6100

Dell Research Computing 24

Four 2-Socket Nodes in 2U Intel Westmere-EP

Each Node: 12 DIMMs each

2 GigE (Intel)

1 Daughter Card (PCIe x8)

10GigE

QDR IB

One PCIe x16 (half-length, half-height)

Optional SAS controller (in-place of IB)

Chassis Design:

Hot Plug, Individually Serviceable System Boards / Nodes

Up to 12 x 3.5” drives (3 per node)

Up to 24 x 2.5” drives (6 per node)

N+1 Power supplies (1100W or 1400W)

NVIDIA HIC certified

DDR and QDR IB PCIe card certified

Page 21: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

C410x “Sandwich”

Dell Research Computing 25

C410x

C6100

C6100 }

7U

8-Card C410x Sandwich 16-Card C410x Sandwich

2 x C6100 2 x C6100

8 GPUs 16 GPUs

1 – QDR IB daughtercard 1 – QDR IB daughtercard

7U total 7U total

8 GPUs total 16 GPUs total

8 nodes total 8 nodes total

8/7 nodes / U 8/7 node / U

8/7 GPUs per U 16/7 GPUs per U

1 GPU per PCIe x16 2 GPUs per PCIe x16

Page 22: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell PowerEdge C410x:

• Increased density (more GPUs per RackU) • Introduced “flexibility”

- GPU/Host ratio = 1:1, 2:1, 3:1, 4:1, …, 8:1

• Purposely separate the Host from the GPUs • Purpose-built to power, cool and manage PCI-e devices

- (N+1) Power (3+1 “Gold” power supplies) - (N+1) Cooling (7+1 fans) - Onboard BMC Web interface to monitor, manage & configure - Each PCI-e Module is individually serviceable

-- no un-cabling -- no un-racking - - no opening of compute nodes - - no bumped DIMMs - - no disturbed dust - - vertical insertion

Dell Research Computing 26

Page 23: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Performance

Testing

Dell Research Computing 27

Page 24: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Performance Testing

Dell Research Computing 28

• Internal vs. External Performance - Direct comparison between internal and external GPUs

- Identical configurations

- Identical applications

• GPU:Host Scaling - Examine GPU performance improvements (vs. CPUs)

- Examine performance scaling with GPU count

Page 25: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Application

Workloads

Dell Research Computing 29

Page 26: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Workloads

Dell Research Computing 30

• NAMD v2.7b2 (STMV benchmark) • a parallel, object-oriented molecular dynamics code designed for high-

performance simulation of large biomolecular systems.

• LAMMPS (15 Jan 2010 Version) (LJ-Cut & Gay-Berne)

• a classical molecular dynamics package written for parallel machines.

• CUDASW++ v2.0 • Bio-informatics software for Smith-Waterman protein database searches

• GPU-HMMER - v0.92 • Bioinformatics software that does protein sequence alignment using

profile HMMs

• Running a search of uniprot_sprot file for different profile HMM lengths

• HOOMD-blue – v0.8.2 Python-2.4 • General purpose particle dynamics package

• Running included LJ and Polymer benchmarks

Page 27: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

System Configurations

Dell Research Computing 31

• SuperMicro (Internal 2-x16)

• CPU:

• 2 x Intel Xeon

• E5520 2.26 GHz

• Memory:

• 6 x 4GB 1333MHz DIMMs

• GPU:

• 2 x NVIDIA Tesla M1060

• GPU Driver:

• 3.0 linux-64 195.36.15

• OS:

• RHEL 5.3

• CUDA

• Toolkit: 3.0 linux-64 RHEL 5.3

• SDK: GPU Computing 3.0 linux

• C410x / C6100 (External 1-x16)

• CPU:

• 2 x Intel Xeon

• E5520 2.26 GHz

• Memory:

• 6 x 4GB 1333MHz DIMMs

• GPU:

• 2 x NVIDIA Tesla M1060

• GPU Driver:

• 3.0 linux-64 195.36.15

• OS:

• RHEL 5.5

• CUDA

• Toolkit: 3.0 linux-64 RHEL 5.3

• SDK: GPU Computing 3.0 linux

Page 28: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Internal vs.

External

x16 Slots

Dell Research Computing 32

Page 29: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Internal vs. External: NAMD

Dell Research Computing 33

0.95

0.82

0

0.2

0.4

0.6

0.8

1

1.2

STMV

Ste

ps/

Se

co

nd

NAMD – STMV Benchmark

Internal 2-x16 (2)

C410x / C6100 (2)

13.8% performance difference.

Page 30: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Internal vs. External: LAMMPS LJ-Cut

Dell Research Computing 34

292

562

1,072

310

572

1,111

0

200

400

600

800

1000

1200

256000 500000 1000188

Wa

ll C

loc

k (

s)

Number of Particles

LAMMPS LJ-Cut

Internal 2-x16 (2)

C410x / C6100 (2)

1.9 to 6.1% performance difference.

Page 31: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Internal vs. External: LAMMPS Gay-Berne

Dell Research Computing 35

148

283

522

157

300

554

0

100

200

300

400

500

600

15625 32768 64000

Wa

ll C

loc

k (

s)

Number of Particles

LAMMPS Gay-Berne

Internal 2-x16 (2)

C410x / C6100 (2)

5.7 to 6.2% performance difference.

Page 32: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Internal vs. External: CUDASW++

Dell Research Computing 36

0

5

10

15

20

25

30

GF

LO

PS

Query Length

CUDASW++

C410x / C6100 (2)

Internal 2-x16 (2)

0.03 to 2.9% performance difference.

Page 33: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Internal vs. External: GPU-HMMER

Dell Research Computing 37

0

500

1000

1500

2000

2500

3000

3500

415 983 1419 2293

Wa

ll c

loc

k (

s)

HMM Length

GPU-HMMER

Internal 2-x16 (2)

C410x / C6100 (2)

0.04 to 0.23% performance difference.

Page 34: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Internal vs. External: HOOMD

Dell Research Computing 38

379 389 381 363

0

50

100

150

200

250

300

350

400

450

64000 Particle LJ-Liquid

64000 ParticlePolymer

Tim

e S

tep

s/S

ec

on

d

HOOMD Benchmarks

Internal 2-x16 (2)

C410x / C6100 (2)

0.5% to 6.6% performance difference.

Page 35: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell Research Computing 39

Results Summary

Application Negatively Affected by External

GPU’s and/or Shared x16’s

Scales Beyond

2 GPU’s (4:2)

NAMD Y (13.8%)

LAMMPS LJ-CUT Y (6.1%)

LAMMPS GB Y (6.2%)

CUDASW++ N (0.03 to 2.9%)

GPU-HAMMER N (0.04 to 0.23%)

HOOMD Y (6.6%)

Page 36: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host

Ratio

Performance

Dell Research Computing 40

Page 37: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling: NAMD

Dell Research Computing 41

0.10

0.47

0.82

1.52

0.95

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

STMV

Ste

ps/

Se

co

nd

NAMD

CPU

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Page 38: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling: NAMD

Dell Research Computing 42

0.10

0.47

0.82

1.52

0.95

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

STMV

Ste

ps/

Se

co

nd

NAMD

CPU

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Speedup

4.7X

8.2X

15.2X

9.5X

Page 39: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : LAMMPS LJ-Cut

Dell Research Computing 43

3,844.94

7,547.89

15,094.30

529.19 903.99 1,784.48

0

2000

4000

6000

8000

10000

12000

14000

16000

256000 500000 1000188

Wa

ll C

loc

k (

s)

Number of Particles

LAMMPS LJ CPU vs. GPU

CPU

C410x / C6100 (1)

Page 40: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : LAMMPS LJ-Cut

Dell Research Computing 44

0

2000

4000

6000

8000

10000

12000

14000

16000

256000 500000 1000188

Wa

ll C

loc

k (

s)

Number of Particles

LAMMPS LJ CPU vs. GPU

CPU

C410x / C6100 (1)

7.2X

8.3X

8.5X

Page 41: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : LAMMPS LJ-Cut

Dell Research Computing 45

0

200

400

600

800

1000

1200

1400

1600

1800

2000

256000 500000 1000188

Wa

ll c

loc

k (

s)

Number of Particles

LAMMPS LJ GPU Scaling

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Page 42: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : LAMMPS LJ-Cut

Dell Research Computing 46

0

200

400

600

800

1000

1200

1400

1600

1800

2000

256000 500000 1000188

Wa

ll c

loc

k (

s)

Number of Particles

LAMMPS LJ GPU Scaling

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Speedup

8.5X

13.5X

14.4X

14.0X

Page 43: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : LAMMPS Gay-Berne

Dell Research Computing 47

13,513.9 25,743.8

45,466.4

289.0 575.5

1,092.8

1

10

100

1000

10000

100000

15625 32768 64000

Wa

ll C

loc

k(s

)

Number of Particles

LAMMPS GB CPU vs. GPU

CPU

C410x / C6100 (1)

Page 44: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : LAMMPS Gay-Berne

Dell Research Computing 48

1

10

100

1000

10000

100000

15625 32768 64000

Wa

ll C

loc

k(s

)

Number of Particles

LAMMPS GB CPU vs. GPU

CPU

C410x / C6100 (1)

46X 44X

41X

Page 45: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : LAMMPS Gay-Berne

Dell Research Computing 49

0

200

400

600

800

1000

1200

15625 32768 64000

Wa

ll C

loc

k(s

)

Number of Particles

LAMMPS GB GPU Scaling

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Page 46: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : LAMMPS Gay-Berne

Dell Research Computing 50

0

200

400

600

800

1000

1200

15625 32768 64000

Wa

ll C

loc

k(s

)

Number of Particles

LAMMPS GB GPU Scaling

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Speedup

41X

82X

142X

87X

Page 47: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : CUDASW++

Dell Research Computing 51

0

5

10

15

20

25

30

35

40

GF

LO

PS

Query Length

CUDASW++

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Page 48: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : CUDASW++

Dell Research Computing 52

0

5

10

15

20

25

30

35

40

GF

LO

PS

Query Length

CUDASW++

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Speedup

1.8X

2.4X

1.8X

Page 49: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : GPU-HMMER

Dell Research Computing 53

2,042

4,635

6,598

10,691

697 1,654

2,386

5,805

0

2000

4000

6000

8000

10000

12000

415 983 1419 2293

Wa

ll C

loc

k (

s)

Length of HMM

GPU-HMMER CPU vs. GPU

CPU

C410x / C6100 (1)

Page 50: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : GPU-HMMER

Dell Research Computing 54

0

2000

4000

6000

8000

10000

12000

415 983 1419 2293

Wa

ll C

loc

k (

s)

Length of HMM

GPU-HMMER CPU vs. GPU

CPU

C410x / C6100 (1)

2.9X

2.8X

2.7X

1.8X

Page 51: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : GPU-HMMER

Dell Research Computing 55

0

1000

2000

3000

4000

5000

6000

7000

415 983 1419 2293

Wa

ll c

loc

k (

s)

Length of HMM

GPU-HMMER: GPU Scaling

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Page 52: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : GPU-HMMER

Dell Research Computing 56

0

1000

2000

3000

4000

5000

6000

7000

415 983 1419 2293

Wa

ll c

loc

k (

s)

Length of HMM

GPU-HMMER: GPU Scaling

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Speedup

1.8X

3.6X

7.2X

3.6X

Page 53: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : HOOMD

Dell Research Computing 57

7.94 7.88 0

50

100

150

200

250

300

350

400

450

64000 Particle LJ-Liquid

64000 ParticlePolymer

Ste

ps/

S

HOOMD

CPU

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Page 54: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

GPU:Host Scaling : HOOMD

Dell Research Computing 58

7.94 7.88 0

50

100

150

200

250

300

350

400

450

64000 Particle LJ-Liquid

64000 ParticlePolymer

Ste

ps/

S

HOOMD

CPU

C410x / C6100 (1)

C410x / C6100 (2)

C410x / C6100 (4)

Internal 2-x16 (2)

Speedup

1.0X

36X

46X

41X

49X

Page 55: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Summary

Dell Research Computing 59

Application Negatively Affected by External

GPU’s and/or Shared x16’s

Scales Beyond

2 GPU’s (4:2)

NAMD Y (13.8%) Y (1.8X)

LAMMPS LJ-CUT Y (6.1%) N

LAMMPS GB Y (6.2%) Y (1.6X)

CUDASW++ N (0.03 to 2.9%) Y (1.3X)

GPU-HAMMER N (0.04 to 0.23%) Y (1.9X)

HOOMD Y (6.6%) N

Page 56: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Conclusions

• GPU:Host Ratio is critically important for performance - application dependent - applications exist which scale beyond 1 or 2 GPUs - flexibility is important in unknown situations

• Dedicated x16’s may not be important for performance • External Connections minimally affect performance • C410x

• Introduces flexibility while increasing density • Purpose-built to power, cool & manage PCI-e devices

• Separates the Host/GPU ; addresses technology curves

Dell Research Computing 60

Page 57: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Model for

Enabling GPU

Computing

Dell Research Computing 61

Page 58: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell Research Computing

Model for Enabling GPU Computing

• Develop anywhere

• Test in the office

• Deploy in the data center

62

Page 59: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell Research Computing

Dell Precision M4600 GPU development anywhere Mobile 15” workstation starting at 6.15lbs

63

Maximized Performance

• Latest technology

• Up to Intel Core i7 Extreme Edition

• DDR3 1333MHz or 1600MHz

• Workstation graphics

• NVIDIA Quadro 1000M

• NVIDIA Quadro 2000M

• AMD FirePro M5950

• UltraSharp Display

Application Empowering

• ISV certified

• Avid/DigiDesign Tier 1 Certification

• 95 applications from 35 key industry vendors

• GPU support

• Quadro 4000M = 192 CUDA cores with 2GB

• OS support

• Red Hat Linux WS 6

• Windows 7

Mobile & Scalable

• Expandable

• Up to 32GB RAM at 1333MHz or 16GB RAM at 1600MHz

• 2 Storage drives

• Flexible

• e-SATA interface for additional storage

• SSD Mini card option

• Dual integrated high quality speakers and dual integrated noise cancelling digital array microphones

Total Control of Ownership

• Manageability

• Transition periods

• Intel vPro support

• Remote systems management

• Security

• Dell ControlVault

• FIPS FP reader

• Contactless SC reader

• FIPS Encrypted drives

• Dell Services

• Lifecycle Services

• ImageDirect

Page 60: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell Research Computing

Dell Precision M6600 Portable high-performance GPU development Mobile 17” workstation starting at 8.1lbs

64

Unparalleled Performance

• Latest technology

• Up to Intel Core i7 Extreme Edition

• DDR3 1600MHz

• RAID 0, 1, or RAID 5 SSD based storage options

• Workstation graphics

• NVIDIA Quadro 3000M

• NVIDIA Quadro 4000M

• AMD ATI FirePro M9800

• Dell UltraSharp Display

Application Empowering

• ISV certified

• Avid/DigiDesign Tier 1 Certification

• 95 applications from 35 key industry vendors

• GPU support

• 4000M = 336 CUDA cores with 2GB

• COMING SOON: Quadro 5010M = 384 CUDA cores with 4GB

• OS support

• Red Hat Linux WS 6

• Windows 7

Massive Scalability

• Expandable

• Up to 32GB RAM

• 4 DIMM slots

• 2 Storage drives

• Up to over 1.5TB storage capacity

• Flexible

• e-SATA interface for additional storage

• USB 3.0 option

• SSD Mini card option

• Dual integrated high quality speakers and dual integrated noise cancelling digital array microphones

Total Control of Ownership

• Manageability

• Transition periods

• Global Standard Platforms

• Remote systems management

• Security

• Dell ControlVault

• FIPS FP reader

• Contactless SC reader

• FIPS Encrypted drives

• Dell Services

• Lifecycle Services

• ImageDirect

Page 61: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

Dell Research Computing

Delivering a personal supercomputer workstation Desktop GPU Computing enables rapid development and at-speed testing in the office

65

Dell Precision

T5500 or

T7500 workstation

NVIDIA® Tesla™

C1060

with 240 cores

Dell Precision

T7500 workstation

NVIDIA® Tesla™

C2050

with 448 cores

Page 62: Internal GPGPUs on Dedicated x16 Slots Are They Needed for ...saahpc.ncsa.illinois.edu/11/presentations/berger.pdf · • DDR performance for interconnect ... iPASS cable iPASS cable

More information?

http://www.dellhpcsolutions.com

Thank You!

Elliott Berger

HPC Solution Architect

Dell Research Computing

[email protected]

Dell Research Computing 66