cs 213 lecture 11: multiprocessor 3: directory organization

CS252/PattersonLec 12.1

2/28/01

CS 213

Lecture 11: Multiprocessor 3: Directory

Organization


2/28/01

Example Directory Protocol Contd.

– Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.

Cache to Cache Transfer: Can occur with a remote read or write miss. Idea: Transfer block directly from the cache with exclusive copy to the requesting cache. Why go through directory? Rather inform directory after the block is transferred => 3 transfers over the IN instead of 4.


2/28/01

Basic Directory Transactions

P

A M/D

C

P

A M/D

C

P

A M/D

C

Read requestto directory

Reply withowner identity

Read req.to owner

DataReply

Revision messageto directory

1.

2.

3.

4a.

4b.

P

A M/D

CP

A M/D

C

P

A M/D

C

RdEx requestto directory

Reply withsharers identity

Inval. req.to sharer

1.

2.

P

A M/D

C

Inval. req.to sharer

Inval. ack

Inval. ack

3a. 3b.

4a. 4b.

Requestor

Node withdirty copy

Directory nodefor block

Requestor

Directory node

Sharer Sharer

(a) Read miss to a block in dirty state (b) Write miss to a block with two sharers


2/28/01

Protocol Enhancements for Latency

• Forwarding messages: memory-based protocols

L H R

1: req

2:reply

3:intervention

4a:revise

4b:response

L H R

1: req 2:intervention

3:response4:reply

L H R

1: req 2:intervention

3b:response

3a:revise

(a) Strict request-reply (a) Intervention forwarding

(a) Reply forwarding

Intervention is like a req,but issued in reaction to req. and sent to cache, rather than memory.


2/28/01

Example

P1 P2 Bus Directory Memorystep StateAddr ValueStateAddrValueActionProc. Addr Value Addr State{Procs}Value

P1: Write 10 to A1

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1

A1 and A2 map to the same cache block

Processor 1 Processor 2 Interconnect MemoryDirectory


2/28/01

Example


P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1}Excl. A1 10 DaRp P1 A1 0

P1: Read A1P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1




2/28/01

Example



P1: Read A1 Excl. A1 10P2: Read A1

P2: Write 40 to A2

P2: Write 20 to A1




2/28/01

Example

P2: Write 20 to A1




P1: Read A1 Excl. A1 10P2: Read A1 Shar. A1 RdMs P2 A1

Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10

1010

P2: Write 40 to A2 10


A1

Write BackWrite Back

A1


2/28/01

Example

P2: Write 20 to A1





Shar. A1 10 Ftch P1 A1 10 10Shar. A1 10 DaRp P2 A1 10 A1 Shar.{P1,P2} 10Excl. A1 20 WrMs P2 A1 10

Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0

WrBk P2 A1 20 A1 Unca. {} 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0


A1A1


2/28/01

Example

P2: Write 20 to A1






Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 10


A1A1


2/28/01

Example

P2: Write 20 to A1






Inv. Inval. P1 A1 A1 Excl. {P2} 10P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0

WrBk P2 A1 20 A1 Unca. {} 20Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0


A1A1


2/28/01

Assume Network latency = 25 cycles


2/28/01

P

M

Reducing Storage Overhead

• Optimizations for full bit vector schemes– increase cache block size (reduces storage

overhead proportionally)– use multiprocessor nodes (bit per mp node, not

per processor)– still scales as P*M, but reasonable for all but very

large machines» 256-procs, 4 per cluster, 128B line: 6.25%

ovhd.

• Reducing “width”– addressing the P term?

• Reducing “height”– addressing the M term?


2/28/01

Storage Reductions

• Width observation: – most blocks cached by only few nodes– don’t have a bit per node, but entry contains a few

pointers to sharing nodes– P=1024 => 10 bit ptrs, can use 100 pointers and

still save space– sharing patterns indicate a few pointers should

suffice (five or so)– need an overflow strategy when there are more

sharers

• Height observation: – number of memory blocks >> number of cache

blocks– most directory entries are useless at any given time– organize directory as a cache, rather than having

one entry per memory block


2/28/01

Insight into Directory Requirements

• If most misses involve O(P) transactions, might as well broadcast!

=> Study Inherent program characteristics:

– frequency of write misses?– how many sharers on a write miss– how these scale

• Also provides insight into how to organize and store directory information


2/28/01

Cache Invalidation PatternsLU Invalidation Patterns

8.75

91.22

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.220

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6 7

8 to

11

12 t

o 15

16 t

o 19

20 t

o 23

24 t

o 27

28 t

o 31

32 t

o 35

36 t

o 39

40 t

o 43

44 t

o 47

48 t

o 51

52 t

o 55

56 t

o 59

60 t

o 63

# of invalidations

% o

f sh

are

d w

rite

s

Ocean Invalidation Patterns

0

80.98

15.06

3.04 0.49 0.34 0.03 0 0.03 0 0 0 0 0 0 0 0 0 0 0 0 0.020

10

20

30

40

50

60

70

80

90

0 1 2 3 4 5 6 7

8 to

11

12 t

o 15

16 t

o 19

20 t

o 23

24 t

o 27

28 t

o 31

32 t

o 35

36 t

o 39

40 t

o 43

44 t

o 47

48 t

o 51

52 t

o 55

56 t

o 59

60 t

o 63

# of invalidations

% o

f sh

are

d w

rite

s


2/28/01

Cache Invalidation PatternsBarnes-Hut Invalidation Patterns

1.27

48.35

22.87

10.56

5.332.87 1.88 1.4 2.5 1.06 0.61 0.24 0.28 0.2 0.06 0.1 0.07 0 0 0 0 0.33

0

5

10

15

20

25

30

35

40

45

50

0 1 2 3 4 5 6 7

8 to

11

12 t

o 15

16 t

o 19

20 t

o 23

24 t

o 27

28 t

o 31

32 t

o 35

36 t

o 39

40 t

o 43

44 t

o 47

48 t

o 51

52 t

o 55

56 t

o 59

60 t

o 63

# of invalidations

% o

f sh

are

d w

rite

s

Radiosity Invalidation Patterns

6.68

58.35

12.04

4.162.24 1.59 1.16 0.97

3.28 2.2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.91

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7

8 to

11

12 t

o 15

16 t

o 19

20 t

o 23

24 t

o 27

28 t

o 31

32 t

o 35

36 t

o 39

40 t

o 43

44 t

o 47

48 t

o 51

52 t

o 55

56 t

o 59

60 t

o 63

# of invalidations

% o

f sh

are

d w

rite

s


2/28/01

Sharing Patterns Summary• Generally, few sharers at a write, scales slowly

with P– Code and read-only objects (e.g, scene data in Raytrace)

» no problems as rarely written– Migratory objects (e.g., cost array cells in LocusRoute)

» even as # of PEs scale, only 1-2 invalidations– Mostly-read objects (e.g., root of tree in Barnes)

» invalidations are large but infrequent, so little impact on performance

– Frequently read/written objects (e.g., task queues)» invalidations usually remain small, though frequent

– Synchronization objects» low-contention locks result in small invalidations» high-contention locks need special support (SW trees,

queueing locks)

• Implies directories very useful in containing traffic

– if organized properly, traffic and latency shouldn’t scale too badly

• Suggests techniques to reduce storage overhead


2/28/01

Overflow Schemes for Limited Pointers• Broadcast (DiriB): Directory size is I. If more

copies are needed (Overflow), enable broadcast bit so that invalidation signal will be broadcast to all processors in case of a write

– bad for widely-shared frequently read data

• No-broadcast (DiriNB): Don’t allow more than I copies to be present at any time. If a new request arrives, invalidate one of the existing copies - on overflow, new sharer replaces one of the old ones

– bad for widely read data

• Coarse vector (DiriCV)– change representation to a coarse vector, 1

bit per k nodes– on a write, invalidate all nodes that a bit

corresponds to

Ref: Chaiken, et al “Directory-Based Cache Coherence in Large-Scale Multiprocessors”, IEEE Computer, June 1990.

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

1

Over½ow bit 8-bit coarse vector

(a) Over½ow

P0 P1 P2 P3

P4 P5 P6 P7

P8 P9 P10 P11

P12 P13 P14 P15

0

Over½ow bit 2 Pointers

(a) No over½ow


2/28/01

Overflow Schemes (contd.)

• Software (DiriSW)– trap to software, use any number of pointers

(no precision loss)» MIT Alewife: 5 ptrs, plus one bit for local

node– but extra cost of interrupt processing on

software» processor overhead and occupancy» latency

• 40 to 425 cycles for remote read in Alewife• 84 cycles for 5 inval, 707 for 6.

• Dynamic pointers (DiriDP)– use pointers from a hardware free list in

portion of memory– manipulation done by hw assist, not sw– e.g. Stanford FLASH


2/28/01

Some Data

– 64 procs, 4 pointers, normalized to full-bit-vector– Coarse vector quite robust

• General conclusions:– full bit vector simple and good for moderate-scale– several schemes should be fine for large-scale

0

100

200

300

400

500

600

700

800

LocusRoute Cholesky Barnes-Hut

Norm

ali

zed

In

vali

dati

on

s

BNBCV


2/28/01

Summary of Directory Organizations

• Flat Schemes:• Issue (a): finding source of directory data

– go to home, based on address

• Issue (b): finding out where the copies are– memory-based: all info is in directory at home– cache-based: home has pointer to first element of distributed linked

list

• Issue (c): communicating with those copies– memory-based: point-to-point messages (perhaps coarser on

overflow)» can be multicast or overlapped

– cache-based: part of point-to-point linked list traversal to find them» serialized

• Hierarchical Schemes:– all three issues through sending messages up and down tree– no single explict list of sharers– only direct communication is between parents and children


2/28/01

Summary of Directory Approaches

• Directories offer scalable coherence on general networks

– no need for broadcast media

• Many possibilities for organizing directory and managing protocols

• Hierarchical directories not used much– high latency, many network transactions, and bandwidth

bottleneck at root

• Both memory-based and cache-based flat schemes are alive

– for memory-based, full bit vector suffices for moderate scale

» measured in nodes visible to directory protocol, not processors

– will examine case studies of each


2/28/01

Summary

• Caches contain all information on state of cached memory blocks

• Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access)

• Directory has extra data structure to keep track of state of all cache blocks

• Distributing directory => scalable shared address multiprocessor => Cache coherent, Non uniform memory access

cs 213 lecture 11: multiprocessor 3: directory organization

Documents