5rdg 7r 7dvnlqj - developer.download.nvidia.com · 7kh 5rfn\ 5rdg 7r 7dvnlqj 0dufk ,yr .dedgvkrz...

47
The Rocky Road To Tasking March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre Member of the Helmholtz Association

Upload: others

Post on 10-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

The Rocky Road To Tasking

March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre

Member of the Helmholtz Association

Page 2: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

HPC ≠ HPC

ns 𝜇s ms s min h

CPU Cycle Network Latency

High Frequency Trading

MD

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

Page 3: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

HPC ≠ HPC

ns 𝜇s ms s min h

CPU Cycle Network Latency

High Frequency Trading

MD

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

Page 4: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

HPC ≠ HPC

ns 𝜇s ms s min h

CPU Cycle Network Latency

High Frequency Trading

MD

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

Page 5: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

HPC ≠ HPC

ns 𝜇s ms s min h

CPU Cycle Network Latency

High Frequency Trading

MD

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

Page 6: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

HPC ≠ HPC

ns 𝜇s ms s min h

CPU Cycle Network Latency

High Frequency Trading

MD

Game Dev

Deep Learning

Astrophysics

Critical walltime

Requirements for MD

Strong scalability

Performance portability

Member of the Helmholtz Association March 21, 2019 Slide 1

Page 7: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Our MotivationSolving Coulomb problem for Molecular Dynamics

Task: Compute all pairwise interactions of N particles

N-body problem: O(N2) → O(N) with FMM

Why is that an issue?

MD targets < 1ms runtime per time step

MD runs millions or billions of time steps

not compute-bound, but synchronization bound

no libraries (like BLAS) to do the heavy lifting

We might have to look under the hood ... and get our hands dirty.

Member of the Helmholtz Association March 21, 2019 Slide 2

Page 8: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Parallelization Potential

Classical

O(N2)

high low

easy

hard

Algorithmic Complexity

Parallelization

Classical Approach

Lots of independent parallelism

Member of the Helmholtz Association March 21, 2019 Slide 3

Page 9: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Parallelization Potential

FMM

O(N)

Classical

O(N2)

high low

easy

hard

Algorithmic Complexity

Parallelization

Fast Multipole Method (FMM)

Many dependent phases

Varying amount of parallelism

Member of the Helmholtz Association March 21, 2019 Slide 4

Page 10: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Coarse-Grained Parallelization

Input P2M M2M M2L L2L L2P P2P Output

synchronization

points

Different amount of available loop-level parallelism within each phase

Some phases contain sub-dependencies

Synchronizations might be problematic

Member of the Helmholtz Association March 21, 2019 Slide 5

Page 11: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Coarse-Grained Parallelization

Input P2M M2M M2L L2L L2P P2P Output

synchronization

points

Different amount of available loop-level parallelism within each phase

Some phases contain sub-dependencies

Synchronizations might be problematic

Member of the Helmholtz Association March 21, 2019 Slide 5

Page 12: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

FMM Algorithmic FlowMultipole to multipole (M2M), shifting multipoles upwards

𝜔

0

1

2

3

4

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies

Member of the Helmholtz Association March 21, 2019 Slide 6

Page 13: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

FMM Algorithmic FlowMultipole to multipole (M2M), shifting multipoles upwards

𝜔

0

1

2

3

4

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies

p2m m2m m2l l2l l2p𝜔 𝜔

𝜔

Member of the Helmholtz Association March 21, 2019 Slide 7

Page 14: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

FMM Algorithmic FlowMultipole to local (M2L), translate remote multipoles into local taylor moments

𝜇

0

1

2

3

4

d =+

++

+

+ ++

Dataflow – Fine-grained Dependencies

Member of the Helmholtz Association March 21, 2019 Slide 8

Page 15: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

FMM Algorithmic FlowMultipole to local (M2L), translate remote multipoles into local taylor moments

𝜇

0

1

2

3

4

d =+

++

+

+ ++

Dataflow – Fine-grained Dependencies

p2m m2m m2l l2l l2p

𝜔

𝜔 𝜇

𝜇

Member of the Helmholtz Association March 21, 2019 Slide 9

Page 16: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

FMM Algorithmic FlowLocal to local (L2L), shifting Taylor moments downwards

𝜇

0

1

2

3

4

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies

Member of the Helmholtz Association March 21, 2019 Slide 10

Page 17: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

FMM Algorithmic FlowLocal to local (L2L), shifting Taylor moments downwards

𝜇

0

1

2

3

4

d =+ +

+ + + +

+ + + + + + + +

+ + + + + + + + + + + + + + + +

Dataflow – Fine-grained Dependencies

p2m m2m m2l l2l l2p𝜇 𝜇

𝜇

Member of the Helmholtz Association March 21, 2019 Slide 11

Page 18: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

CPU Tasking Framework

Core

ThreadingWrapperThread

Scheduler

Queue

Dispatcher

TaskFactory

LoadBalancer

Member of the Helmholtz Association March 21, 2019 Slide 12

Page 19: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

CPU Tasking Framework

Core

ThreadingWrapperThread

Scheduler

Queue

Dispatcher

TaskFactory

LoadBalancer

Member of the Helmholtz Association March 21, 2019 Slide 12

Page 20: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

CPU Tasking Framework

Core

ThreadingWrapperThread

Scheduler

Queue

Dispatcher

TaskFactory

LoadBalancer

Member of the Helmholtz Association March 21, 2019 Slide 12

Page 21: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

CPU Tasking Framework

Core

ThreadingWrapperThread

Scheduler

Queue

Dispatcher

TaskFactory

LoadBalancer

Member of the Helmholtz Association March 21, 2019 Slide 12

Page 22: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

CPU Tasking FrameworkTask life-cycle per thread

Dispatcher

TaskFactory LoadBalancer

Queues

� Task execution

� new task

Tasks can be prioritized by task type

Only ready-to-execute tasks are stored in queue

Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

Page 23: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

CPU Tasking FrameworkTask life-cycle per thread

Dispatcher

TaskFactory LoadBalancer

Queues

� Task execution

� new task � task

Tasks can be prioritized by task type

Only ready-to-execute tasks are stored in queue

Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

Page 24: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

CPU Tasking FrameworkTask life-cycle per thread

Dispatcher

TaskFactory LoadBalancer

Queues

� Task execution

� new task � task

Tasks can be prioritized by task type

Only ready-to-execute tasks are stored in queue

Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

Page 25: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

CPU Tasking FrameworkTask life-cycle per thread

Dispatcher

TaskFactory LoadBalancer

Queues

� Task execution

� new task � task

Tasks can be prioritized by task type

Only ready-to-execute tasks are stored in queue

Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

Page 26: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

CPU Tasking FrameworkTask life-cycle per thread

Dispatcher

TaskFactory LoadBalancer

Queues

� Task execution

� new task� new task � task

Tasks can be prioritized by task type

Only ready-to-execute tasks are stored in queue

Workstealing from other threads is possible

Member of the Helmholtz Association March 21, 2019 Slide 13

Page 27: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Tasking Without Workstealing103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

4

8

12

16

20

24

P2P

P2M

M2M

M2L

L2L

L2P

Runtime [s]

#ActiveThreads

Member of the Helmholtz Association March 21, 2019 Slide 14

Page 28: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Tasking With Workstealing103 680 Particles on 2×Intel Xeon E5-2680 v3 (2×12 cores)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

4

8

12

16

20

24

P2P

P2M

M2M

M2L

L2L L2P

Runtime [s]

#ActiveThreads

Member of the Helmholtz Association March 21, 2019 Slide 15

Page 29: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

The Rocky Road To Tasking

March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre

Member of the Helmholtz Association

Page 30: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

GPU TaskingGoal

Provide same features as CPU tasking:

Static and dynamic load balancing

Priority queues

Ready-to-execute tasks

Member of the Helmholtz Association March 21, 2019 Slide 16

Page 31: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

GPU TaskingUniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

Page 32: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

GPU TaskingUniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

Page 33: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

GPU TaskingUniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

Page 34: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

GPU TaskingUniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

Page 35: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

GPU TaskingUniform Programming Model for CPUs and GPUs

Member of the Helmholtz Association March 21, 2019 Slide 17

Page 36: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

PitfallsPerformance Portability

Diverse GPU programming approaches:

OpenCL

CUDA

SYCL

Our requirements:

Strong subset of C++11

Portability between GPU vendors

Tasking features

Maturity

(Intermediate) Solution

Use CUDA for reasons of performance, specific tasking features and maturity. Take the loss of not being

portable out of the box.

Member of the Helmholtz Association March 21, 2019 Slide 18

Page 37: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

PitfallsPerformance Portability

For performance portability we consider diverse GPU programming approaches:

OpenCL

CUDA

SYCL

Unsatisfying (Intermediate) Solution

Use CUDA for reasons of performance and specific features. Take the loss of not being portable out of the

box.

Member of the Helmholtz Association March 21, 2019 Slide 19

Page 38: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

PitfallsArchitectural Differences

Pitfalls for Load Balancing

No thread pinning

No cache coherency

Pitfalls for Mutual Exclusion

Weak memory consistency

Missing forward progress guarantees

Member of the Helmholtz Association March 21, 2019 Slide 20

Page 39: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

PitfallsLoad Balancing

No possibility to pin threads to streaming multiprocessors

No direct access to shared memory of other streaming multiprocessors

Work stealing requires multi-producer multi-consumer queues → Mechanism for mutual exclusion?

Member of the Helmholtz Association March 21, 2019 Slide 21

Page 40: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

PitfallsMutual Exclusion

Weak memory consistency

Warp-synchronous deadlocks due to lock step

How to prove thread safety?

Member of the Helmholtz Association March 21, 2019 Slide 22

Page 41: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

PitfallsMutex Implementation

class Mutex

{

__inline__ __device__ void lock()

{

while (atomicCAS(&mutex, 0, 1) != 0)

__threadfence();

};

__inline__ __device__ void unlock()

{

__threadfence();

atomicExch(&mutex, 0);

};

int mutex = 0;

};

Member of the Helmholtz Association March 21, 2019 Slide 23

Page 42: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Very First EvaluationConditions

Tasking with global queue only

Measurements without work load to determine enqueue and dequeue overhead

Measurements on P100 with 56 thread blocks with 1024 threads each

Measurements on V100 with 80 thread blocks with 1024 threads each

Member of the Helmholtz Association March 21, 2019 Slide 24

Page 43: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

First EvaluationTasking Overhead on P100 and V100

100 101 102 103 104 105 106

10−1

101

103

105

#Tasks

Runtimeinms

P100

V100

Member of the Helmholtz Association March 21, 2019 Slide 25

Page 44: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

GPU TaskingConclusion

Fine-grained task parallelism pays off on CPUs

Developed mapping between CPU and GPU concepts

(Partly) overcome pitfalls:

Lock-based mutual exclusion

Reusability of CPU tasking code

Architectural differences between CPU and GPU

Successfully transferred parts of CPU tasking to GPUs

Member of the Helmholtz Association March 21, 2019 Slide 26

Page 45: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Next StepsAnalyze and solve performance issues in dependency resolution

Use memory pool for dynamic allocations

Implement hierarchical queues

Transfer priority queue to GPU

Exploit data-parallelism through warps

Consider the use of lock-free data structures

Implement FMM based on GPU tasking

Member of the Helmholtz Association March 21, 2019 Slide 27

Page 46: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

Thank You to Our Sponsor!

NVIDIA Tesla V100 and NVIDIA Tesla P100 where provided by

Member of the Helmholtz Association March 21, 2019 Slide 28

Page 47: 5RDG 7R 7DVNLQJ - developer.download.nvidia.com · 7KH 5RFN\ 5RDG 7R 7DVNLQJ 0DUFK ,YR .DEDGVKRZ /DXUD 0RUJHQVWHUQ -¼OLFK 6XSHUFRPSXWLQJ &HQWUH MemberoftheHelmholtzAssociation

The Rocky Road To Tasking

March 21, 2019 Ivo Kabadshow, Laura Morgenstern Jülich Supercomputing Centre

Member of the Helmholtz Association