tyler sorensen adviser: jade alglave university college london wpli 2015 april 12, 2105 1 gpu...

1

Tyler SorensenAdviser: Jade Alglave

University College London

WPLI 2015 April 12, 2105

GPU Concurrency: Weak Behaviours and Programming Assumptions

2

Based on our ASPLOS ‘15 paper:

Jade Alglave1,2, Mark Batty3, Alastair F. Donaldson4, Ganesh Gopalakrishnan5, Jeroen Ketema4, Daniel Poetzl6, Tyler Sorensen1,5, John Wickerson4

1 University College London, 2 Microsoft Research, 3 University of Cambridge, 4 Imperial College London, 5 University of Utah, 6 University of Oxford

GPU Concurrency: Weak Behaviours and Programming Assumptions

3

Intel Core i7 4500 CPU

5

Nvidia Tesla C2075 GPU

6

Roadmap

• what happened to the pony • how we found the bug • how we are able to fix the pony

(background)(methodology)(contribution)

7

What happened to the pony?

• the visualization bugs are due to weak memory behaviours on GPUs

8

Weak memory models

• consider the test known as message passing (mp)• an instance of this test appears in the pony code

9

Weak memory models

• consider the test known as message passing (mp)• initial state: x and y are memory locations

10

Weak memory models

• consider the test known as message passing (mp)• thread ids

11

Weak memory models

• consider the test known as message passing (mp)• program: for each thread id

12

Weak memory models

• consider the test known as message passing (mp)• assertion: question about the final state of registers

13

Message passing (mp) test

• Tests how to implement a handshake idiom

Data

Data

14



Flag

Flag

15



Stale Data

21

assertion cannotbe satisfied by interleavings

this is knownas Lamport’s sequentialconsistency (or SC)

22

Weak memory models

• can we assume assertion will never pass?

23

Weak memory models

• can we assume assertion will never pass? No!

24

Weak memory models

• Alglave and Maranget report this assertion appears 41 million times out of 5 billion test runs on Tegra2 ARM processor1

1http://diy.inria.fr/cats/tables.html

25

Weak memory models

• what happened?

• architectures implement weak memory models where the hardware is allowed to re-order certain memory instructions.

• weak memory models can allow weak behaviors (executions that do not correspond to an interleaving)

26

GPU memory models

• what type of memory model do current GPUs implement?

• documentation is sparse

• CUDA has 1 page + 1 example • PTX has 1 page + 0 examples

• given in English prose

• we need to know this if we are to write correct GPU programs!

27

CTA 0 CTA 1 CTA n

Threads

GPU programming

Global Memory

Shared Memory For CTA 0


Shared Memory For CTA n

Within CTAs, threadsare grouped into warps(32 threads per warp in Nvidia GPUs)

28

Threads

GPU programming

Global Memory

29

CTA 0 CTA 1 CTA n

Threads

GPU programming

Global Memory

30

CTA 0 CTA 1 CTA n

Threads

GPU programming

Global Memory




31

CTA 0 CTA 1 CTA n

Threads

GPU programming

Global Memory




Within CTAs, threadsare grouped into warps(32 threads per warp in Nvidia GPUs)

32


Roadmap

• what happened to the pony • how we found the bug • how we are able to fix the pony

33

Methodology

GPU litmus tests

GPU hardware

formal model

compare results

34

GPU tests

• GPU litmus test considerations

Scope Tree (device (cta T0) (cta T1) )x: global, y: global

35

GPU tests

• GPU litmus test considerations• PTX instructions


36

GPU tests

• GPU litmus test considerations• what memory region (shared or global) are x and y in?


37

GPU tests

• GPU litmus test considerations• what memory region (shared or global) are x and y in?

38

GPU tests

• GPU litmus test considerations• are T0 and T1 in the same CTA or different CTAs?


39

GPU tests

• GPU litmus test considerations• are T0 and T1 in the same CTA or different CTAs?

40

Running tests

• we extend the litmus CPU testing tool of Alglave and Maranget to run GPU tests

• given a GPU litmus test, generates an executable CUDA or OpenCL code for the test

41

Heuristics

• memory stress: extra threads read and write to scratch memory

T0 T1 extra thread 1 extra thread n . . . . .

run T0 test program

run T1 test program

loop:read or write to scratchpad

loop:read or write to scratchpad

42

Heuristics

• random threads: randomize the location of threads

T0

T1

43

Heuristics


44

Heuristics


45

Heuristics


46

Heuristics

test none random threads memory stress

memory stress +

random threads

gpu-mp 0

# of weak behaviours in 100,000 runs for different heuristics on a Nvidia Tesla C2075

47

Heuristics


memory stress +

random threads

gpu-mp 0 0


48

Heuristics


memory stress +

random threads

gpu-mp 0 0 139


49

Heuristics


memory stress +

random threads

gpu-mp 0 0 139 522


50

How we found the pony bug


memory stress +

random threads

gpu-mp 0 0 139 522

This is the idiom and heuristics that caused bug!

51


Roadmap

• what happened to the pony• how we found the bug • how we are able to fix the pony

52

GPU fences

• PTX gives 2 fences to disallow reading stale data

• membar.cta – gives ordering intra-CTA

• membar.gl – gives ordering over device

53

GPU fences

• Test amended with a parameterizable fence


54

GPU fences

test none membar.cta membar.gl

gpu-mp 3380

# of weak behaviours in 100,000 runs for different fences on a Nvidia Tesla C2075

55

GPU fences


gpu-mp 3380 2


56

GPU fences


gpu-mp 3380 2 0


57

How do we fix the pony

Tesla C2075 Nvidia GPU

58

How do we fix the pony

• adding fences to the code

Tesla C2075 Nvidia GPU(with fences)

59

GPU testing campaign

• we extend the diy CPU litmus test generation tool of Alglave and Maranget to generate GPU tests

• generates litmus tests based on cycles

• enumerates the tests over the GPU thread and memory hierarchy

60


• Using our tools, we generated and ran 10930 tests over 5 Nvidia chips:

chip year architecture

GTX 750 ti 2014 Maxwell

GTX Titan 2013 Kepler

GTX 660 2012 Kepler

GTX 540m 2011 Fermi

Tesla C2075 2011 Fermi

61


• Results are hosted at:http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/flat.html

http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/flat.html

http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/flat.html

62

Modeling

• we extended the CPU axiomaitic memory modeling toolherd of Alglave and Maranget, for GPUs

• we developed an axiomatic memory model for PTX which is able to simulate all of our tests

• our model is sound with respect to all of our hardware observations

63

Modeling

• Demo of web interface

64

More results

• surprising and buggy behaviours observed:

• GPU mutex implementations allow stale data to be read(found in CUDA by Example book and other academic papers1,2)

led to an erratum issued by Nvidia

• Hardware re-orders loads from the same address in Nvidia Fermi and Kepler

• Some testing on AMD GPUs

1J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" CoRR, 2011, http://arxiv.org/pdf/1110.4623.pdf.2B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB 2011.

65

Related work (CPU memory models)• Alglave et. al. have done extensive work on testing and modeling

CPUs (notably IBM Power and ARM) and create the tools diy, litmus, and herd which we extended for this work

• Collier tested CPU memory models using the ARCHTEST tool

66

Related work (GPU memory models)• Hower et. al. have proposed several SC for race-free language level

memory models for GPUs

Questions?

Nvidia Tesla C2075 GPU(with fences)

Nvidia Tesla C2075 GPUIntel Core i7 4500 CPU

project page: http://virginia.cs.ucl.ac.uk/sunflowers/asplos15/

68

CUDA by Example

Intel Core i7 4500 CPU

69

CUDA by Example

Nvidia Tesla C2075 GPU

70

CUDA by Example

Nvidia Tesla C2075 GPU(with fences)

71

Read-after-Read Hazard

72

Ignore after this

73

Results

• Surprising and buggy behaviours observed:

• SC-per-location violations on NVIDIA Fermi and Kepler architecture:

todo: add CORR test

74

Limitations

• warps: we do not test intra-warp behaviours as the lock step behaviour of warps is not compatible with some of our heuristics

• grids: we do not test inter-grid behaviours as we did not find any examples in the literature

75

GPU programming

• GPUs are SIMT (Single Instruction, Multiple Thread)

• Nvidia GPUs may be programmed using CUDA or OpenCL

76

Roadmap

• background and motivation• approach• GPU tests• running tests• modeling

77

Heuristics

• two additional heuristics:

• synchronization: testing threads synchronize immediately before running the test program

• general bank conflicts: generate memory access that conflict with the accesses in the memory stress heuristic

78

Challenges

• PTX optimizing assembler may reorder or remove instructions

• We developed a tool optcheck which compares the litmus test with the binary and checks for optimizations

79

Roadmap

• background and motivation• approach• GPU tests• running tests• modeling

80

GPU tests

• concrete GPU test

T0 | T1 ;

st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ;

st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ;

ScopeTree

(grid(cta(warp T0) (warp T1)))

x: shared, y: global

exists (1:r1=1 /\ 1:r2=0)

81

GPU tests


T0 | T1 ;st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ;st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ;

ScopeTree(grid(cta(warp T0) (warp T1)))


exists (1:r1=1 /\ 1:r2=0)

82

GPU tests


T0 | T1 ;st.cg.s32 [x], 1 | ld.cg.s32 r1,[y] ;st.cg.s32 [y], 1 | ld.cg.s32 r2,[x] ;

ScopeTree(grid(cta(warp T0) (warp T1)))


exists (1:r1=1 /\ 1:r2=0)

83

GPU programming

explicit hierarchical concurrency model

• thread hierarchy:• thread

• warp

• CTA (Cooperative Thread Array)

• grid

• memory hierarchy:• shared memory

• global memory

84

GPU background

Images from Wikipedia [15,16,17]

• GPU is a highly parallel co-processor

• currently found in devicesfrom tablets to top supercomputers

• not just used for visualization anymore!

85

References

[1] L. Lamport, "How to make a multiprocessor computer that correctly executes multi-process programs" Trans. Comput. 1979.

[2] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, "Litmus: Running tests against hardware" TACAS 2011.

[3] J. Alglave, L. Maranget, and M. Tautschnig, "Herding cats: modelling, simulation, testing, and data-mining for weak memory" TOPLAS 2014.

[4] NVIDIA, "CUDA C programming guide, version 6 (July 2014)" http://docs.nvidia.com/cuda/pdf/CUDA C Programming Guide.pdf

[5] NVIDIA, "Parallel Thread Execution ISA: Version 4.0 (Feb. 2014)," http://docs.nvidia.com/cuda/parallel-thread-execution

[6] J. Alglave, L. Maranget, S. Sarkar, and P. Sewell, “Fences in weak memory models (extended version)” FMSD 2012

[7] J. Sanders and E. Kandrot, “CUDA by Example: An Introduction to General-Purpose GPU Programming” Addison-Wesley Professional, 2010.

http://docs.nvidia.com/cuda/pdf/CUDA%20C%20Programming%20Guide.pdf

http://docs.nvidia.com/cuda/pdf/CUDA%20C%20Programming%20Guide.pdf

http://docs.nvidia.com/cuda/parallel-thread-execution

http://docs.nvidia.com/cuda/parallel-thread-execution

86

References

[8] J. A. Stuart and J. D. Owens, "Efficient synchronization primitives for GPUs" CoRR, 2011, http://arxiv.org/pdf/1110.4623.pdf.

[9] B. He and J. X. Yu, “High-throughput transaction executions on graphics processors” PVLDB 2011.

[10] W. W. Collier, Reasoning About Parallel Architectures. Prentice-Hall, Inc., 1992.

[11] D. R. Hower, B. M. Beckmann, B. R. Gaster, B. A. Hechtman, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Sequential consistency for heterogeneous-race-free" MSPC 2013.

[12] D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free memory models," ASPLOS 2014

[13] T. Sorensen, G. Gopalakrishnan, and V. Grover, "Towards shared memory consistency models for GPUs" ICS 2013

[14] W.-m. W. Hwu, “GPU Computing Gems Jade Edition” Morgan Kaufmann Publishers Inc., 2011.

87

References

[15] http://en.wikipedia.org/wiki/Samsung_Galaxy_S5

[16] http://en.wikipedia.org/wiki/Titan_(supercomputer)

[17] http://en.wikipedia.org/wiki/Barnes_Hut_simulation

88

Roadmap

• what happened to the pony (background)• how we found the bug (methodology)• how we are able to fix the pony (contribution)

89


• Tests how to implement a handshake idiom• Found in Octree code for the pony visualization

90



Data

Data

91



Flag

Flag

92

Methodology

• empirically explore the hardware memory model implemented on deployed NVIDIA and AMD GPUs

• develop hardware memory model testing tools for GPUs

• analyze classic (i.e. CPU) memory model properties and communication idioms in CUDA applications

• run large families of tests on GPUs as a basis for modeling and bug hunting

93



Stale Data

94

Running tests

• however, unlike CPUs, simply running the tests did not yield any weak memory behaviours for Nvidia chips!

• we developed heuristics to run tests under a variety of stress to expose weak behaviours

tyler sorensen adviser: jade alglave university college london wpli 2015 april 12, 2105 1 gpu...

Documents

weak memory behaviours

weak memory modelswhat

weak memory models8consider

weak memory modelsalglave

sc weak memory modelscan

cta n30cta

weak behaviours

cta nwithin ctas