sequential consistency for heterogeneous-race-free derek r. hower, bradford m. beckmann, benedict r....

29
Sequential Consistency for Heterogeneous-Race- Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD JUNE 12, 2013

Upload: douglas-russell

Post on 16-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

Sequential Consistency for Heterogeneous-Race-Free

DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K. REINHARDT, DAVID A. WOOD

JUNE 12, 2013

Page 2: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

2MSPC 2013 | JUNE 21, 2013

EXECUTIVE SUMMARY

Existing GPU memory models ambiguous, dense for programmers

CPUs: Sequential Consistency for Data-Race-Free (SC for DRF)– Relaxed HW, precise semantics, programmer-friendly– Problem: GPUs use scoped synchronization

Sequential Consistency for Heterogeneous-Race-Free (SC for HRF)– SC for DRF + Scopes

HRF0: Basic scope synchronization– Two threads communicate use identical scope synchronization– Works well with existing, regular GPU codes

Beyond HRF0?– There are limits

Page 3: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

3MSPC 2013 | JUNE 21, 2013

OUTLINE

Background and SetupHRF0: Basic scope synchronizationFuture directions

Page 4: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

4MSPC 2013 | JUNE 21, 2013

BACKGROUND AND SETUP

CPU Programmers use memory model to understand memory behavior – Sequential Consistency (SC) [1979]: threads interleave like multitasking

uniprocessor– HW/Compiler actually implements TSO [1991] or more relaxed model– JavaTM [2005] and C++ [2008] insure SC for data-race-free (DRF) programs

Programmers need a GPU memory model for abstraction and portability– Currently GPU models expose ad hoc HW mechanisms– SC for DRF is a start, BUT…

OpenCLTM Execution Model

Page 5: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

5MSPC 2013 | JUNE 21, 2013

SEQUENTIAL CONSISTENCY FOR DATA-RACE-FREE

Two memory accesses participate in a data race if they– access the same location– at least one access is a store– can occur simultaneously

• i.e. appear as adjacent operations in interleaving.

A program is data-race-free if no possible execution results in a data race.

Sequential consistency for data-race-free programs– Avoid everything else

GPUs: Not good enough!

Page 6: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

6MSPC 2013 | JUNE 21, 2013

DATA-RACE-FREE IS NOT ENOUGH

t1 t2 t3 t4ST X = 1GroupLock.Release() GroupLock.Acquire() ... GlobalLock.Release() GlobalLock.Acquire() LD X (??)

Workgroup #1-2 Workgroup #3-4

Two ordinary memory accesses participate in a data race if theyAccess same locationAt least one is a storeCan occur simultaneously

Not a data race… Is it SC?

t4t3t1 t2

SGlobal

S12 S34

Page 7: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

7MSPC 2013 | JUNE 21, 2013

WHAT WILL HAPPEN?

“waits until all…memory accesses made by the calling thread prior to [{Group, Global}Lock.Release()] are visible to…

all threads in the {group, device}”

t1 t2 t3 t4ST X = 1GroupLock.Release() GroupLock.Acquire() ... GlobalLock.Release() GlobalLock.Acquire() LD X (??)

Workgroup #1-2 Workgroup #3-4

CUDATM __threadfence{_block} definition:

Page 8: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

8MSPC 2013 | JUNE 21, 2013

WHAT WILL HAPPEN?

t1 t2 t3 t4ST X = 1GroupLock.Release() GroupLock.Acquire() ... GlobalLock.Release() GlobalLock.Acquire() LD X (??)

Workgroup #1-2 Workgroup #3-4

Heterogeneous Race!

Makes X =1 visible to t2

Makes … visible to t3

Page 9: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

9MSPC 2013 | JUNE 21, 2013

SEQUENTIAL CONSISTENCY FOR DATA-RACE-FREE

Two memory accesses participate in a data race if they– access the same location– at least one access is a store– can occur simultaneously

• i.e. appear as adjacent operations in interleaving.

A program is data-race-free if no possible execution results in a data race.

Sequential consistency for data-race-free programs– Avoid everything else

Page 10: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

10MSPC 2013 | JUNE 21, 2013

SEQUENTIAL CONSISTENCY FOR HETEROGNEOUS-RACE-FREE

Two memory accesses participate in a heterogeneous race if– access the same location– at least one access is a store– can occur simultaneously

• i.e. appear as adjacent operations in interleaving.

– Are not synchronized with “enough” scope

A program is heterogeneous-race-free if no possible execution results in a heterogeneous race.

Sequential consistency for heterogeneous-race-free programs

– Avoid everything else

Page 11: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

11MSPC 2013 | JUNE 21, 2013

A FIRST CUT

HRF0: Basic Scope Synchronization– “enough” = both threads synchronize using identical scope

Recall example:– Contains a heterogeneous race in HRF0

t1 t2 t3 t4ST X = 1GroupLock.Release() GroupLock.Acquire() ... GlobalLock.Release() GlobalLock.Acquire() LD X (??)

Workgroup #1-2 Workgroup #3-4

Different Scope

HRF0 Conclusion: This is bad. Don’t do it.

Page 12: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

12MSPC 2013 | JUNE 21, 2013

IS HRF0 USEFUL?

Use smallest scope that includes all producers/consumers of shared data

HRF0 Scope Selection Guideline

Implication: Producers/consumers must be known at synchronization time

Want: For performance, use smallest scope possibleWhat is safe in HRF0?

Is this a valid assumption?

Page 13: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

13MSPC 2013 | JUNE 21, 2013

REGULAR GPGPU WORKLOADS

N

M

DefineProblem Space

Partition Hierarchically

CommunicateLocally

N times

CommunicateGlobally

M times

Well defined (regular) data partitioning +

Well defined (regular) synchronization pattern =

Producer/consumers are always known

Generally: HRF0 works well with existing regular GPGPU workloads

Page 14: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

14MSPC 2013 | JUNE 21, 2013

IRREGULAR GPGPU WORKLOADS

HRF0: example is race– Must upgrade Group -> Global

Current hardware: – LD X will see value (1)!

t1 t2 t3 t4ST X = 1GroupLock.Release() GroupLock.Acquire() ... GlobalLock.Release() GlobalLock.Acquire() LD X (??)

Workgroup #1-2 Workgroup #3-4

Page 15: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

15MSPC 2013 | JUNE 21, 2013

HRFX: THE NEXT ITERATION

HRF0 is overly conservative on existing HW– Makes fast irregular parallelism hard

Other HRF definitions are possible– e.g., define behavior when different scopes interact

What are the gotchas? (there will be many…)

The answer: ???

Page 16: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

16MSPC 2013 | JUNE 21, 2013

CONCLUSIONS & FUTURE DIRECTIONS

GPUs Currently expose ad-hoc scoped synchronization

From CPU world: mask low-level details with SC for DRF– GPU scope synchronization incompatible w/ SC for DRF

GPUs: SC for HRF– HRF0: Basic scope synchronization

+ Easy-ish to define/understand+ Safe interpretation of existing models+ Permits most HW optimizations − Prohibits some SW opts in current hardware

– HRFx: models to exploit hierarchy• What happens when different scopes interact?

Let’s not wait 30 years this time

Page 17: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

THANKS! QUESTIONS?

Page 18: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

BACKUP

Page 19: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

19MSPC 2013 | JUNE 21, 2013

BASIC APPLICATION EXAMPLE – SEQUENCE ALIGNMENT

Well defined (regular) synchronization pattern +

Well defined (regular) data partitioning =

Producer/consumers are always known

Smallest box: WavefrontIntermediate box: WorkgroupFull box: NDRange

NW N

W

Generally: HRF0 works well with existing regular GPGPU workloads

Page 20: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

21MSPC 2013 | JUNE 21, 2013

SCOPED SYNCHRONIZATION

Some operations are not global, have visibility limited to workgroup or device– HSAIL: st_{rel, part_rel}, ld_{acq, part_acq}, etc.– CUDATM: threadfence_{block, system}, __syncthreads, etc.– PTX: membar.{cta, gl, sys}, bar, etc.

t4t3t1 t2

SGlobal

S12 S34

Page 21: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

22MSPC 2013 | JUNE 21, 2013

EXECUTIVE SUMMARY

GPUs: streaming memory system, hierarchical programming model– Use partial (scoped) synchronization– Existing GPU memory models ambiguous, dense for programmers

From the CPU world: SC for DRF– Relaxed HW, precise semantics, programmer-friendly– Let’s not wait 30 years this time

This work: apply same principle to GPU world– Sequential Consistency for Heterogeneous-Race-Free (SC for HRF)

HRF0: Basic scope synchronization– Two threads communicate use identical scope synchronization– Works well with existing GPU codes

Others possible. HRF0 is:– Difficult to use efficiently w/ irregular synchronization– Overly conservative for current implementations

Page 22: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

23MSPC 2013 | JUNE 21, 2013

INSERTING SCOPED SYNCHRONIZATION

Global

Barrie

r

Global

Barrie

r

Global

Barrie

r

Global

Barrie

r

Global

Barrie

r

Global

Barrie

r

Time

Synchronization Scope:WF on N, W edge of WG use global acquireWF on S, E edge of WG use global release

All other sync is local

Acquire_S?

cell[id] = fn(cell[N], cell[NW], cell[W])Release_S?

Page 23: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

24MSPC 2013 | JUNE 21, 2013

EXAMPLE AMBIGUITY

What value of X does t3 see?

t1 t2 t3 t4ST X = 1Release_S12

Acquire_S12

LD X (1) Release_SGlobal

Acquire_SGlobal

LD X (??)

t4t3t1 t2

SGlobal

S12 S34

Answer not obvious -- depends on system

Page 24: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

25MSPC 2013 | JUNE 21, 2013

EXAMPLE: CONVENTIONAL BASE SYSTEM

Conventional write-through/combining cache hierarchy:– Local release flush stores from queue– Local acquire stall until queue is empty

– Global acquire Invalidate all valid locations in L1 cache– Global release Flush all dirty locations in L1 cache

L1 L1

L2

t1 t2 t3 t4

wf1 wf2 wf3 wf4ST X = 1Release_S12

Acquire_S12

LD X (1) Release_SGlobal

Acquire_SGlobal

LD X (??) WF3 ALWAYS sees (1)

X = 1

ST X = 1Release_S12

Acquire_S12

LD X (1)

X = 1

Release_SGlobal

Acquire_SGlobal

LD X (??)

X = 1

X = 1

Page 25: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

26MSPC 2013 | JUNE 21, 2013

EXAMPLE: OPTIMIZED BASE SYSTEM

Optimized write-combining cache hierarchy– Local release flush stores from queue– Local acquire stall until queue is empty

– Global acquire Inv. locations read by acquiring WF in L1– Global release Flush locations written by releasing WF in L1

L1 L1

L2

t1 t2 t3 t4

t1 t2 t3 wf4ST X = 1Release_S12

Acquire_S12

LD X (1) Release_SGlobal

Acquire_SGlobal

LD X (??) WF3 sees (0) or (1)

X = 1

ST X = 1Release_S12

Acquire_S12

LD X (1)

X = 1

Release_SGlobal

Acquire_SGlobal

LD X (??)

X = 0

X = 0

Page 26: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

27MSPC 2013 | JUNE 21, 2013

ASSUMPTIONS/SIMPLIFICATIONS

1. Ignore scratchpad (local/group/shared) memory• i.e., all memory in single, global shared address space

2. Use scoped acquire/release synchronization• Generalizes to other forms of synchronization

3. Examples: two-level scope hierarchy with simple threads

t4t3t1 t2

SGlobal

S12 S34

Page 27: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

28MSPC 2013 | JUNE 21, 2013

APPLICATION EXAMPLE – TASK PARALLEL RUNTIME

Global Queue

Local Queue

WF1 WF2

Local Queue

WF3 WF4

Hierarchical task queue: - WF produce/consume tasks independently - Use local queue until: - Local queue empty pull from global - Local queue full push to global

Observation: - Push/pull performed on behalf of workgroup

HRF0: - Either all sync has to be global or WFs coordinate to push/pullHRF1: - Single WF can push/pull independently

Beware: HRF1 consumes significant brain power

Page 28: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

29MSPC 2013 | JUNE 21, 2013

SCOPES

Scope: A subset of threads

Scoped synchronization: synchronization w.r.t. a scope– OpenCLTM: mem_fence– HSAIL: st_{rel, part_rel}, ld_{acq, part_acq}, etc.– CUDATM: threadfence_{block, system}, __syncthreads, etc.– PTX: membar.{cta, gl, sys}, bar, etc.

Scopes introduce new class of races:– What happens when threads (wavefronts/warps) use different scopes?

WF4WF3WF1 WF2

SGlobal

S12 S34L1 L1

L2

WF1 WF2 WF3 WF4

Page 29: Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K

30MSPC 2013 | JUNE 21, 2013

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.