cs252/patterson lec 12.1 2/28/01 cs162 computer architecture lecture 16: multiprocessor 2: directory...
Post on 22-Dec-2015
215 views
TRANSCRIPT
CS252/PattersonLec 12.1
2/28/01
CS162 Computer Architecture
Lecture 16: Multiprocessor 2: Directory Protocol,
Interconnection Networks
CS252/PattersonLec 12.2
2/28/01
Larger MPs• Separate Memory per Processor• Local or Remote access via memory controller• 1 Cache Coherency solution: non-cached pages • Alternative: directory per cache that tracks state of
every block in every cache– Which caches have a copies of block, dirty vs. clean, ...
• Info per memory block vs. per cache block?– PLUS: In memory => simpler protocol (centralized/one location)– MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache size)
• Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which Procs have copies of their blocks
CS252/PattersonLec 12.4
2/28/01
Network Examples
• Bi-directional Ring – EX: HP V Class • 2-D Mesh and Hypercube – SGI Origin
and Cray T3E• Crossbar and Omega Network – SMPs,
IBM SP3, and IP Routers• Clusters using ethernet, Gigabit
ethernet, Myrinet, etc.
Properties of various networks will be discussed later
CS252/PattersonLec 12.5
2/28/01
CC-NUMA Multiprocessor: Directory Protocol
• What is Cache Coherent Non-Uniform Memory Access (CC-NUMA)?
• Similar to Snoopy Protocol: Three states– Shared: ≥ 1 processors have data, memory up-to-date– Uncached (no processor hasit; not valid in any cache)– Exclusive: 1 processor (owner) has data;
memory out-of-date
• In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy)
• Directory Size: Big => Limited Directory Schemes (Not to be discussed)
CS252/PattersonLec 12.6
2/28/01
Directory Protocol• No bus and don’t want to broadcast:
– interconnect no longer single arbitration point– all messages have explicit responses
• Terms: typically 3 processors involved– Local node where a request originates– Home node where the memory location
of an address resides– Remote node has a copy of a cache
block, whether exclusive or shared
• Example messages on next slide: P = processor number, A = address
CS252/PattersonLec 12.7
2/28/01
Example Directory Protocol• Message sent to directory causes two actions:
– Update the directory– More messages to satisfy request
• Block is in Uncached state: the copy in memory is the current value; only possible requests for that block are:
– Read miss: requesting processor sent data from memory &requestor made only sharing node; state of block made Shared.
– Write miss: requesting processor is sent the value & becomes the Sharing node. The block is made Exclusive to indicate that the only valid copy is cached. Sharers indicates the identity of the owner.
• Block is Shared => the memory value is up-to-date:– Read miss: requesting processor is sent back the data from
memory & requesting processor is added to the sharing set.– Write miss: requesting processor is sent the value. All processors
in the set Sharers are sent invalidate messages, & Sharers is set to identity of requesting processor. The state of the block is made Exclusive.
CS252/PattersonLec 12.8
2/28/01
Example Directory Protocol• Block is Exclusive: current value of the block is
held in the cache of the processor identified by the set Sharers (the owner) => three possible directory requests:
– Read miss: owner processor sent data fetch message, causing state of block in owner’s cache to transition to Shared and causes owner to send data to directory, where it is written to memory & sent back to requesting processor. Identity of requesting processor is added to set Sharers, which still contains the identity of the processor that was the owner (since it still has a readable copy). State is shared.
– Data write-back: owner processor is replacing the block and hence must write it back, making memory copy up-to-date (the home directory essentially becomes the owner), the block is now Uncached, and the Sharer set is empty.
– Write miss: block has a new owner. A message is sent to old owner causing the cache to send the value of the block to the directory from which it is sent to the requesting processor, which becomes the new owner. Sharers is set to identity of new owner, and state of block is made Exclusive.
CS252/PattersonLec 12.11
2/28/01
Interconnection Topologies
• Class networks scaling with N• Logical Properties:
– distance, degree
• Physical properties– length, width
• Static vs. Dynamic Networks• Fully connected network
– diameter = 1– degree = N– cost?
» bus => O(N), but BW is O(1) - actually worse
» crossbar => O(N2) for BW O(N)
• VLSI technology determines switch degree
CS252/PattersonLec 12.12
2/28/01
What characterizes a network?
• Topology (what)– physical interconnection structure of the network
graph– direct: node connected to every switch– indirect: nodes connected to specific subset of
switches
• Routing Algorithm (which)– restricts the set of paths that msgs may follow– many algorithms with different properties
» gridlock avoidance?
• Switching Strategy (how)– how data in a msg traverses a route– circuit switching vs. packet switching
• Flow Control Mechanism (when)– when a msg or portions of it traverse a route– what happens when traffic is encountered?
CS252/PattersonLec 12.13
2/28/01
Flow Control
• What do you do when push comes to shove?
– Ethernet: collision detection and retry after delay– FDDI, token ring: arbitration token– TCP/WAN: buffer, drop, adjust rate– any solution must adjust to output rate
• Link-level flow control
Data
Ready
CS252/PattersonLec 12.14
2/28/01
Topological Properties• Routing Distance - number of links on
route• Diameter - maximum routing distance
between any two nodes in the network• Average Distance – Sum of distances
between nodes/number of nodes• Degree of a Node – Number of links
connected to a node => Cost high if degree is high
• A network is partitioned by a set of links if their removal disconnects the graph
• Fault-tolerance – Number of alternate paths between two nodes in a network
CS252/PattersonLec 12.15
2/28/01
Review: Performance Metrics
Sender
Receiver
SenderOverhead
Transmission time(size ÷ bandwidth)
Transmission time(size ÷ bandwidth)
Time ofFlight
ReceiverOverhead
Transport Latency
Total Latency = Sender Overhead + Time of Flight + Message Size ÷ BW + Receiver Overhead
Total Latency
(processorbusy)
(processorbusy)
Includes header/trailer in BW calculation?
CS252/PattersonLec 12.16
2/28/01
Example Static Network: 2-D Mesh Architecture
Node 0
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
Node 8
Node 9
Node 10
Node 11
Node 12
Node 13
Node 14
Node 15
(a) a 16-node mesh structure
CS252/PattersonLec 12.17
2/28/01
More Static Networks: Linear Arrays and Rings
• Linear Array– Diameter?– Average Distance?– Bisection bandwidth?– Route A -> B given by relative address R = B-A
• Torus?• Examples: FDDI, SCI, FiberChannel Arbitrated
Loop, KSR1
Linear Array
Torus
Torus arranged to use short wires
CS252/PattersonLec 12.18
2/28/01
Multidimensional Meshes and Tori
• d-dimensional array– n = kd-1 X ...X kO nodes
– described by d-vector of coordinates (id-1, ..., iO)
• d-dimensional k-ary mesh: N = kd
– k = dN– described by d-vector of radix k coordinate
• d-dimensional k-ary torus (or k-ary d-cube)?
Ex: Intel Paragon (2D), SGI Origin (Hypercube), Cray T3E (3DMesh)
2D Grid 3D Cube
CS252/PattersonLec 12.19
2/28/01
Hypercubes• Also called binary n-cubes. # of
nodes = N = 2n.• O(logN) Hops• Good bisection BW• Complexity
– Out degree is n = logN
correct dimensions in order– with random comm. 2 ports per processor
0-D 1-D 2-D 3-D 4-D 5-D !
CS252/PattersonLec 12.20
2/28/01
Origin Network
• Each router has six pairs of 1.56MB/s unidirectional links
– Two to nodes, four to other routers
– latency: 41ns pin to pin across a router
• Flexible cables up to 3 ft long
• Four “virtual channels”: request, reply, other two for priority or I/O
N
N
N
N
N
N
N
N
N
N
N
N
(b) 4-node (c) 8-node (d) 16-node
(e) 64-node
(d) 32-node
meta-router
CS252/PattersonLec 12.21
2/28/01
Case Study: Cray T3D
• Build up info in ‘shell’• Remote memory operations encoded in address
DRAM
Reqout
P$
MMU
150-MHz DEC Alpha (64 bit)
8-KB instruction + 8-KB data
43-bit virtual address
Prefetch
Load-lock, store-conditional
32-bit
DTB
Prefetch queue· 16 64
Message queue
· 4,080 4 64
Special registers
· swaperand · fetch&add · barrier
PE# + FC
DMA
Resp
in 3D torus of pairs of PEs· share net and BLT
· up to 2,048
· 64 MB each
Req
in
Respout
Block transfer
32- and 64-bit memory and byte operations
Nonblocking stores and memory barrier
engine
physical address
CS252/PattersonLec 12.22
2/28/01
Trees
• Diameter and ave distance logarithmic– k-ary tree, height d = logk N– address specified d-vector of radix k coordinates describing path down
from root
• Fixed degree• Route up to common ancestor and down
– R = B xor A– let i be position of most significant 1 in R, route up i+1 levels– down in direction given by low i+1 bits of B
• H-tree space is O(N) with O(N) long wires• Bisection BW?
CS252/PattersonLec 12.23
2/28/01
Real Machines
• Wide links, smaller routing delay• Tremendous variation
Machine Topology Cycle Time
(ns)
Channel
Width
(bits)
Routing
Delay
(cycles)
Flit
(data bits)
nCUBE/2 Hypercube 25 1 40 32
TMC CM-5 Fat-Tree 25 4 10 4
IBM SP-2 Banyan 25 8 5 16
Intel Paragon 2D Mesh 11.5 16 2 16
Meiko CS-2 Fat-Tree 20 8 7 8
CRAY T3D 3D Torus 6.67 16 2 16
DASH Torus 30 16 2 16
J-Machine 3D Mesh 31 8 2 8
Monsoon Butterfly 20 16 2 16
SGI Origin Hypercube 2.5 20 16 160
Myricom Arbitrary 6.25 16 50 16