cache coherence in scalable machines (iv). 99-6-72 dealing with correctness issues serialization of...
TRANSCRIPT
Cache Coherence in Scalable Machines (IV)
99-6-7 2
Dealing with Correctness Issues
Serialization of operations Deadlock Livelock Starvation
99-6-7 3
Serialization of Operations
Need a serializing agent home memory is a good candidate, since all misses go there first
Possible Mechanism: FIFO buffering requests at the home until previous requests forwarded from home have returned replies
to it but input buffer problem becomes acute at the home
Possible Solutions: let input buffer overflow into main memory (MIT Alewife)
99-6-7 4
Serialization of Operations
don’t buffer at home, but forward to the owner node when directory is in a busy state (Stanford DASH)
serialization determined by home when clean, by owner when exclusive
if cannot be satisfied at “owner”, e.g. written back or ownership given up, NACKed back to requestor without being serialized
serialized when retried don’t buffer at home, use busy state to NACK (Origin)
serialization order is that in which requests are accepted (not NACKed)
maintain the FIFO buffer in a distributed way (SCI)
99-6-7 5
Serialization to a Location (cont’d)
Having single entity determine order is not enough Example
P1 P2
rd A (i) wr A
barrier barrier
rd A (ii)
Second read of A should return the value written by P2 Rd A (i) -> wr A -> rd A(ii) Wr A -> rd A(i) -> rd A(ii)
99-6-7 6
Serialization to a Location (cont’d)
Having single entity determine order is not enough it may not know when all xactions for that operation are done everywhere
2
4
5
Home
P1 P2
1. P1 issues read request to home node for A
2. P2 issues read-exclusive request to home corresponding to write of A. But won’t process it until it is done with read
3. Home receives 1, and in response sends reply to P1 (and sets directory presence bit). Home now thinks read is complete. Unfortunately, the reply does not get to P1 right away .
4. In response to 2, home sends invalidate to P1; it reaches P1 before transaction 3 (no point-to-point order among requests and replies).
5. P1 receives and applies invalidate, sends ack to home.
6. Home sends data reply to P2 corresponding to request 2.
Finally, transaction 3 (read reply) reaches P1.
3
61
99-6-7 7
Possible solutions
Solution 1 To have read replies themselves be acknowledged Let the home go on to process the next request only after it receives this
ack. Solution 2
The requestor does not allow access by another request, such as invalidation, to be applied to that block until its outstanding request completes.
SGI Orgin Solution 3
Apply invalidation even before the read reply is received and consider the reply invalid and retry the read
DASH The order may be different between two solutions
99-6-7 8
Deadlock
Two networks not enough when protocol not request-reply Additional networks expensive and underutilized
Use two, but detect potential deadlock and circumvent e.g. when input request and output request buffers fill more than a
threshold, and request at head of input queue is one that generates more requests
or when output request buffer is full and has had no relief for T cycles Two major techniques:
take requests out of queue and NACK them, until the one at head will not generate further requests or ouput request queue has eased up (DASH)
fall back to strict request-reply (Origin) instead of NACK, send a reply saying to request directly from owner better because NACKs can lead to many retries, and even livelock
99-6-7 9
Livelock
Classical problem of two processors trying to write a block Origin solves with busy states and NACKs
first to get there makes progress, others are NACKed
Problem with NACKs useful for resolving race conditions (as above) Not so good when used to ease contention in deadlock-prone
situations can cause livelock e.g. DASH NACKs may cause all requests to be retried immediately,
regenerating problem continually DASH implementation avoids by using a large enough
input buffer No livelock when backing off to strict request-reply
99-6-7 10
Starvation
Not a problem with FIFO buffering but has earlier problems
NACKs can cause starvation Possible solutions:
do nothing; starvation shouldn’t happen often (DASH) random delay between request retries priorities (Origin)
99-6-7 11
Synchronization
R10000 load-locked / store conditional Hub provides uncached fetch&op
99-6-7 12
Back-to-back Latencies (unowned)
measured by pointer chasing since R10000 does not stall on a read miss
Satisfied in back-to-back latency (ns) hops
L1 cache 5.5 0
L2 cache 56.9 0
local mem 472 0
4P mem 690 1
8P mem 890 2
16P mem 990 3
99-6-7 13
Protocol latencies
Home Owner Unowned Clean-Exclusive Modified
Local Local 472 707 1,036
Remote Local 704 930 1,272
Local Remote 472* 930 1,159
Remote Remote 704* 917 1,097
99-6-7 14
Application Speedups
0
5
10
15
20
25
30
1 3 5 7 9
11 13 15 17 19 21 23 25 27 29 31
Processors
Speedup
Barnes-Hut: 512KbodiesBarnes-Hut: 16K bodies
Ocean: n=1026
Ocean: n=514
Radix: 4M keys
Radix: 1M keys
0
5
10
15
20
25
30
1 3 5 7 9
11 13 15 17 19 21 23 25 27 29 31
ProcessorsSpeedup
LU: n=4096
LU: n=2048
Raytrace: ball4 (128)
Raytrace: car (1024)
Radiosity: room
Radiosity: largeroom
99-6-7 15
Summary
In directory protocol there is substantial implementation complexity below the logical state diagram
directory vs cache states transient states race conditions conditional actions speculation
Real systems reflect interplay of design issues at several levels
Origin philosophy: memory-less: node reacts to incoming events using only local state an operation does not hold shared resources while requesting others
Hardware/Software Trade-Offs
99-6-7 17
HW/SW Trade-offs
Potential limitations of the directory-based, CC systems High waiting time at memory operations
SC and HW optimization - SGI Origin2000 Relaxing consistency model
Limited capacity for replication caching data in main memory and keeping this data coherent - COMA
High design and implementation cost HW solutions - separate CA SW solutions
Memory Consistency Models
99-6-7 CS258 S99 19
Memory Consistency Model
A formal specification of memory semantics How the memory system will appear to the programmer
Sequential Consistency greatly restricts the use of many performance optimizations
commonly used by uniprocessor hardware and compiler designers Relaxed Consistency Models
to alleviate the above problem
99-6-7 CS258 S99 20
Memory Consistency Models - Who Should Care?
The model affects programmability The model affects the performance of the system The model affects portability, due to a lack of consensus
on a single model The memory model influences the writing of parallel
programs from the programmer’s perspective, and virtually all aspects of designing a parallel system(the processor, memory, interconnect network, compiler, and programming language) from a system designer’s perspective.
99-6-7 CS258 S99 21
Memory Semantics in Uniprocessor Systems
Most HL uniprocessor language - simple sequential semantics for memory operations
all memory operations will occur one at a time in the sequential order specified by the program
but allow a wide range of efficient system designs register allocation, code motion, loop transformation pipelining, multiple issue, write buffer bypassing and forwarding,
lockup-free cache
99-6-7 CS258 S99 22
Sequential Consistency
Sequential consistency(Lamport) The result of any execution is the same as if the operations of all
the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program
simple and intuitive programming model disallows many hardware and compiler optimizations that are
possible in uniprocessors Condition: SC ?
99-6-7 23
Reasoning with Sequential Consistency
program order: (a) (b) and (c) (d) claim: (x,y) == (1,0) cannot occur
x == 1 => (b) (c) y == 0 => (d) (a) thus, (a) (b) (c) (d) (a) so (a) (a)
initial: A, flag, x, y == 0
p1 p2
(a) A := 1; (c) x := flag;
(b) flag := 1; (d) y := A
99-6-7 24
Then again, . . .
Many variables are not used to effect the flow of control, but only to shared data
synchronizing variables non-synchronizing variables
initial: A, flag, x, y == 0
p1 p2
(a) A := 1; (c) x := flag;
B := 3.1415
C := 2.78
(b) flag := 1; (d) y := A+B+C
99-6-7 CS258 S99 25
Sequential Consistency
Condition: SC ? maintaining program order among operations from
individual processors maintaining a single sequential order among operations
from all processors - atomic memory operation
P1 P2 P3 .... P4
Memory
99-6-7 26
Lamport’s Requirement for SC
Each processor issues memory requests in the order specified by its program.
Memory requests from all processors issued to an individual memory module are serviced from a single FIFO queue. Issuing a memory request consists of entering the request on this queue.
Assumes stores execute atomically newly written value becomes visible to all processors at the same
time inserted into FIFO queue
not so with caches and general interconnect
99-6-7 CS258 S99 27
Memory Operations
Issuing : Request has left the processor environment Performing:
A LOAD is considered performed at a point in time when the issuing of a STORE to the same address cannot affect the value returned by the LOAD.
A STORE on X by processor i is considered performed at a point in time when a subsequently issued LOAD to the same address returns the value defined by a STORE in the sequence {Si(X)}+
Atomic Memory accesses are atomic in a system if the value stored by a
WRITE op becomes readable at the same time for all processors
99-6-7 CS258 S99 28
Memory Operations
Performing wrt a processor A STORE by processor i is considered performed wrt
processor k at a point in time when a subsequently issued LOAD to the same address by processor k returns the value defined by a STORE in the sequence {Si(X)/k}+
A LOAD by processor I..... Performing an access globally
A STORE is globally performed when it is performed wrt all processors
A LOAD is globally performed if it is performed wrt all processors and if the STORE which is the source of the returned value has been globally performed
99-6-7 29
Requirements for SC (Dubois & Scheurich)
Each processor issues memory requests in the order specified by the program.
After a store operation is issued, the issuing processor should wait for the store to complete before issuing its next operation. (A STORE globally performed)
After a load operation is issued, the issuing processor should wait for the load to complete, and for the store whose value is being returned by the load to complete, before issuing its next operation. (A LOAD is globally performed)
the last point ensures that stores appear atomic to loads note, in an invalidation-based protocol, if a processor has a copy of
a block in the dirty state, then a store to the block can complete immediately, since no other processor could access an older value
99-6-7 30
Architecture Implications
need write completion for atomicity and access ordering w/o caches, ack writes w/ caches, ack all invalidates
atomicity delay access to new value till all inv. are acked
access ordering delay each access till previous completes
PM
CA
PM
CA° ° °
99-6-7 CS258 S99 31
Implementing Sequential Consistency
Architectures with Caches three additional issues
multiple copies --> cache coherence protocol detecting when a write is complete --> more transactions propagating changes --> non-atomic operation
99-6-7 CS258 S99 32
Architectures with Caches:Cache Coherence and SC
Several definitions for Cache Coherence a synonym for SC
A Set of Conditions commonly associated with a Cache Coherence Protocol
1) a write is eventually made visible to all processors
2) writes to the same location appear to be seen in the same order by all processors (also referred to a serialization of writes to the same location)
Are the above conditions sufficient for satisfying SC?
99-6-7 CS258 S99 33
Architectures with Caches: Cache Coherence and SC
SC requires1) writes to all locations to be seen in the same order by all
processors
2) operations of a single processor appear to execute in the program order
With this view, CC Protocol can be simply defined as the mechanism that propagates a newly written value
invalidating and updating