cache coherence in scalable machines (iv). 99-6-72 dealing with correctness issues serialization of...

Cache Coherence in Scalable Machines (IV)

99-6-7 2

Dealing with Correctness Issues

Serialization of operations Deadlock Livelock Starvation

99-6-7 3

Serialization of Operations

Need a serializing agent home memory is a good candidate, since all misses go there first

Possible Mechanism: FIFO buffering requests at the home until previous requests forwarded from home have returned replies

to it but input buffer problem becomes acute at the home

Possible Solutions: let input buffer overflow into main memory (MIT Alewife)

99-6-7 4

Serialization of Operations

don’t buffer at home, but forward to the owner node when directory is in a busy state (Stanford DASH)

serialization determined by home when clean, by owner when exclusive

if cannot be satisfied at “owner”, e.g. written back or ownership given up, NACKed back to requestor without being serialized

serialized when retried don’t buffer at home, use busy state to NACK (Origin)

serialization order is that in which requests are accepted (not NACKed)

maintain the FIFO buffer in a distributed way (SCI)

99-6-7 5

Serialization to a Location (cont’d)

Having single entity determine order is not enough Example

P1 P2

rd A (i) wr A

barrier barrier

rd A (ii)

Second read of A should return the value written by P2 Rd A (i) -> wr A -> rd A(ii) Wr A -> rd A(i) -> rd A(ii)

99-6-7 6

Serialization to a Location (cont’d)

Having single entity determine order is not enough it may not know when all xactions for that operation are done everywhere

2

4

5

Home

P1 P2

1. P1 issues read request to home node for A

2. P2 issues read-exclusive request to home corresponding to write of A. But won’t process it until it is done with read

3. Home receives 1, and in response sends reply to P1 (and sets directory presence bit). Home now thinks read is complete. Unfortunately, the reply does not get to P1 right away .

4. In response to 2, home sends invalidate to P1; it reaches P1 before transaction 3 (no point-to-point order among requests and replies).

5. P1 receives and applies invalidate, sends ack to home.

6. Home sends data reply to P2 corresponding to request 2.

Finally, transaction 3 (read reply) reaches P1.

3

61

99-6-7 7

Possible solutions

Solution 1 To have read replies themselves be acknowledged Let the home go on to process the next request only after it receives this

ack. Solution 2

The requestor does not allow access by another request, such as invalidation, to be applied to that block until its outstanding request completes.

SGI Orgin Solution 3

Apply invalidation even before the read reply is received and consider the reply invalid and retry the read

DASH The order may be different between two solutions

99-6-7 8

Deadlock

Two networks not enough when protocol not request-reply Additional networks expensive and underutilized

Use two, but detect potential deadlock and circumvent e.g. when input request and output request buffers fill more than a

threshold, and request at head of input queue is one that generates more requests

or when output request buffer is full and has had no relief for T cycles Two major techniques:

take requests out of queue and NACK them, until the one at head will not generate further requests or ouput request queue has eased up (DASH)

fall back to strict request-reply (Origin) instead of NACK, send a reply saying to request directly from owner better because NACKs can lead to many retries, and even livelock

99-6-7 9

Livelock

Classical problem of two processors trying to write a block Origin solves with busy states and NACKs

first to get there makes progress, others are NACKed

Problem with NACKs useful for resolving race conditions (as above) Not so good when used to ease contention in deadlock-prone

situations can cause livelock e.g. DASH NACKs may cause all requests to be retried immediately,

regenerating problem continually DASH implementation avoids by using a large enough

input buffer No livelock when backing off to strict request-reply

99-6-7 10

Starvation

Not a problem with FIFO buffering but has earlier problems

NACKs can cause starvation Possible solutions:

do nothing; starvation shouldn’t happen often (DASH) random delay between request retries priorities (Origin)

99-6-7 11

Synchronization

R10000 load-locked / store conditional Hub provides uncached fetch&op

99-6-7 12

Back-to-back Latencies (unowned)

measured by pointer chasing since R10000 does not stall on a read miss

Satisfied in back-to-back latency (ns) hops

L1 cache 5.5 0

L2 cache 56.9 0

local mem 472 0

4P mem 690 1

8P mem 890 2

16P mem 990 3

99-6-7 13

Protocol latencies

Home Owner Unowned Clean-Exclusive Modified

Local Local 472 707 1,036

Remote Local 704 930 1,272

Local Remote 472* 930 1,159

Remote Remote 704* 917 1,097

99-6-7 14

Application Speedups

0

5

10

15

20

25

30

1 3 5 7 9

11 13 15 17 19 21 23 25 27 29 31

Processors

Speedup

Barnes-Hut: 512KbodiesBarnes-Hut: 16K bodies

Ocean: n=1026

Ocean: n=514

Radix: 4M keys

Radix: 1M keys

0

5

10

15

20

25

30

1 3 5 7 9

11 13 15 17 19 21 23 25 27 29 31

ProcessorsSpeedup

LU: n=4096

LU: n=2048

Raytrace: ball4 (128)

Raytrace: car (1024)

Radiosity: room

Radiosity: largeroom

99-6-7 15

Summary

In directory protocol there is substantial implementation complexity below the logical state diagram

directory vs cache states transient states race conditions conditional actions speculation

Real systems reflect interplay of design issues at several levels

Origin philosophy: memory-less: node reacts to incoming events using only local state an operation does not hold shared resources while requesting others

Hardware/Software Trade-Offs

99-6-7 17

HW/SW Trade-offs

Potential limitations of the directory-based, CC systems High waiting time at memory operations

SC and HW optimization - SGI Origin2000 Relaxing consistency model

Limited capacity for replication caching data in main memory and keeping this data coherent - COMA

High design and implementation cost HW solutions - separate CA SW solutions

Memory Consistency Models

99-6-7 CS258 S99 19

Memory Consistency Model

A formal specification of memory semantics How the memory system will appear to the programmer

Sequential Consistency greatly restricts the use of many performance optimizations

commonly used by uniprocessor hardware and compiler designers Relaxed Consistency Models

to alleviate the above problem

99-6-7 CS258 S99 20

Memory Consistency Models - Who Should Care?

The model affects programmability The model affects the performance of the system The model affects portability, due to a lack of consensus

on a single model The memory model influences the writing of parallel

programs from the programmer’s perspective, and virtually all aspects of designing a parallel system(the processor, memory, interconnect network, compiler, and programming language) from a system designer’s perspective.

99-6-7 CS258 S99 21

Memory Semantics in Uniprocessor Systems

Most HL uniprocessor language - simple sequential semantics for memory operations

all memory operations will occur one at a time in the sequential order specified by the program

but allow a wide range of efficient system designs register allocation, code motion, loop transformation pipelining, multiple issue, write buffer bypassing and forwarding,

lockup-free cache

99-6-7 CS258 S99 22

Sequential Consistency

Sequential consistency(Lamport) The result of any execution is the same as if the operations of all

the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program

simple and intuitive programming model disallows many hardware and compiler optimizations that are

possible in uniprocessors Condition: SC ?

99-6-7 23

Reasoning with Sequential Consistency

program order: (a) (b) and (c) (d) claim: (x,y) == (1,0) cannot occur

x == 1 => (b) (c) y == 0 => (d) (a) thus, (a) (b) (c) (d) (a) so (a) (a)

initial: A, flag, x, y == 0

p1 p2

(a) A := 1; (c) x := flag;

(b) flag := 1; (d) y := A

99-6-7 24

Then again, . . .

Many variables are not used to effect the flow of control, but only to shared data

synchronizing variables non-synchronizing variables

initial: A, flag, x, y == 0

p1 p2

(a) A := 1; (c) x := flag;

B := 3.1415

C := 2.78

(b) flag := 1; (d) y := A+B+C

99-6-7 CS258 S99 25

Sequential Consistency

Condition: SC ? maintaining program order among operations from

individual processors maintaining a single sequential order among operations

from all processors - atomic memory operation

P1 P2 P3 .... P4

Memory

99-6-7 26

Lamport’s Requirement for SC

Each processor issues memory requests in the order specified by its program.

Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of entering the request on this queue.

Assumes stores execute atomically newly written value becomes visible to all processors at the same

time inserted into FIFO queue

not so with caches and general interconnect

99-6-7 CS258 S99 27

Memory Operations

Issuing : Request has left the processor environment Performing:

A LOAD is considered performed at a point in time when the issuing of a STORE to the same address cannot affect the value returned by the LOAD.

A STORE on X by processor i is considered performed at a point in time when a subsequently issued LOAD to the same address returns the value defined by a STORE in the sequence {Si(X)}+

Atomic Memory accesses are atomic in a system if the value stored by a

WRITE op becomes readable at the same time for all processors

99-6-7 CS258 S99 28

Memory Operations

Performing wrt a processor A STORE by processor i is considered performed wrt

processor k at a point in time when a subsequently issued LOAD to the same address by processor k returns the value defined by a STORE in the sequence {Si(X)/k}+

A LOAD by processor I..... Performing an access globally

A STORE is globally performed when it is performed wrt all processors

A LOAD is globally performed if it is performed wrt all processors and if the STORE which is the source of the returned value has been globally performed

99-6-7 29

Requirements for SC (Dubois & Scheurich)

Each processor issues memory requests in the order specified by the program.

After a store operation is issued, the issuing processor should wait for the store to complete before issuing its next operation. (A STORE globally performed)

After a load operation is issued, the issuing processor should wait for the load to complete, and for the store whose value is being returned by the load to complete, before issuing its next operation. (A LOAD is globally performed)

the last point ensures that stores appear atomic to loads note, in an invalidation-based protocol, if a processor has a copy of

a block in the dirty state, then a store to the block can complete immediately, since no other processor could access an older value

99-6-7 30

Architecture Implications

need write completion for atomicity and access ordering w/o caches, ack writes w/ caches, ack all invalidates

atomicity delay access to new value till all inv. are acked

access ordering delay each access till previous completes

PM

CA

PM

CA° ° °

99-6-7 CS258 S99 31

Implementing Sequential Consistency

Architectures with Caches three additional issues

multiple copies --> cache coherence protocol detecting when a write is complete --> more transactions propagating changes --> non-atomic operation

99-6-7 CS258 S99 32

Architectures with Caches:Cache Coherence and SC

Several definitions for Cache Coherence a synonym for SC

A Set of Conditions commonly associated with a Cache Coherence Protocol

1) a write is eventually made visible to all processors

2) writes to the same location appear to be seen in the same order by all processors (also referred to a serialization of writes to the same location)

Are the above conditions sufficient for satisfying SC?

99-6-7 CS258 S99 33

Architectures with Caches: Cache Coherence and SC

SC requires1) writes to all locations to be seen in the same order by all

processors

2) operations of a single processor appear to execute in the program order

With this view, CC Protocol can be simply defined as the mechanism that propagates a newly written value

invalidating and updating

cache coherence in scalable machines (iv). 99-6-72 dealing with correctness issues serialization of...

Documents

home p1 p2

home corresponding

read reply

input request

home possible solutions

home sends data reply

exclusive request

home node fora