graduate computer architecture i lecture 11: distribute memory multiprocessors
Post on 14-Jan-2016
214 Views
Preview:
TRANSCRIPT
Graduate Computer Architecture I
Lecture 11: Distribute Memory Multiprocessors
2 - CSE/ESE 560M – Graduate Computer Architecture I
Natural Extensions of Memory System
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem MemShared Cache
Centralized MemoryDance Hall, UMA
Distributed Memory (NUMA)
Scale
3 - CSE/ESE 560M – Graduate Computer Architecture I
Fundamental Issues
1. Naming2. Synchronization3. Performance: Latency and Bandwidth
4 - CSE/ESE 560M – Graduate Computer Architecture I
Fundamental Issue #1: Naming
• Naming– what data is shared– how it is addressed– what operations can access data– how processes refer to each other
• Choice of naming affects– code produced by a compiler
• via load where just remember address or keep track of processor number and local virtual address for msg. passing
– replication of data• via load in cache memory hierarchy or via SW replication and
consistency
5 - CSE/ESE 560M – Graduate Computer Architecture I
Fundamental Issue #1: Naming
• Global physical address space– any processor can generate, address, and access it in
a single operation– memory can be anywhere: virtual addr. translation
handles it• Global virtual address space
– if the address space of each process can be configured to contain all shared data of the parallel program
• Segmented shared address space– locations are named <process number, address>
uniformly for all processes of the parallel program
6 - CSE/ESE 560M – Graduate Computer Architecture I
Fundamental Issue #2: Synchronization
• Message passing– implicit coordination– transmission of data– arrival of data
• Shared address– explicitly coordinate– write a flag– awaken a thread– interrupt a processor
7 - CSE/ESE 560M – Graduate Computer Architecture I
Parallel Architecture Framework
• Programming Model– Multiprogramming
• lots of independent jobs• no communication
– Shared address space• communicate via memory
– Message passing• send and receive messages
• Communication Abstraction– Shared address space
• load, store, atomic swap– Message passing
• send, recieve library calls– Debate over this topic
• ease of programming vs. scalability
8 - CSE/ESE 560M – Graduate Computer Architecture I
Scalable Machines
• Design trade-offs for the machines– specialize vs commodity nodes– capability of node-to-network interface– supporting programming models
• Scalability– avoids inherent design limits on resources– bandwidth increases with increase in resource– latency does not increase– cost increases slowly with increase in resource
9 - CSE/ESE 560M – Graduate Computer Architecture I
Bandwidth Scalability
• Fundamentally limits bandwidth– Amount of wires– Bus vs. Network Switch
P M M P M M P M M P M M
S S S S
Typical switches
Bus
Multiplexers
Crossbar
10 - CSE/ESE 560M – Graduate Computer Architecture I
Dancehall Multiprocessor Organization
Scalable network
P
$
Switch
M
P
$
P
$
P
$
M M
Switch Switch
11 - CSE/ESE 560M – Graduate Computer Architecture I
Generic Distributed System Organization
Scalable network
CommAssist
P
$
Switch
M
Switch Switch
12 - CSE/ESE 560M – Graduate Computer Architecture I
Key Property of Distributed System
• Large number of independent communication paths between nodes– allow a large number of concurrent transactions
using different wires• Independent Initialization• No global arbitration• Effect of a transaction only visible to the
nodes involved– effects propagated through additional
transactions
13 - CSE/ESE 560M – Graduate Computer Architecture I
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/Software boundary
Network Transactions
Programming Models Realized by Protocols
14 - CSE/ESE 560M – Graduate Computer Architecture I
Network Transaction
• Interpretation of the message– Complexity of the message
• Processing in the Comm. Assist– Processing power
PM
CA
PM
CA° ° °
Scalable Network
Node Architecture
Communication Assist
Message
Output Processing – checks – translation – formatting – scheduling
Input Processing – checks – translation – buffering – action
15 - CSE/ESE 560M – Graduate Computer Architecture I
Shared Address Space Abstraction
• Fundamentally a two-way request/response protocol– writes have an acknowledgement
• Issues– fixed or variable length (bulk) transfers– remote virtual or physical address– deadlock avoidance and input buffer full– Memory coherency and consistency
Source Destination
Time
Load [Global address]
Read request
Read request
Memory access
Read response
(1) Initiate memory access(2) Address translation(3) Local /remote check
(4) Request transaction
(5) Remote memory access
(6) Reply transaction
(7) Complete memory access
Wait
Read response
16 - CSE/ESE 560M – Graduate Computer Architecture I
Shared Physical Address Space
Scalable network
P$
Memory management unit
Data
Ld R Addr
Pseudomemory
Pseudo-processor
DestReadAddrSrcTag
DataTagRrspSrc
Output processing· Mem access· Response
CommmunicationInput processing
· Parse· Complete read
P$
MMU
Mem
Pseudo-memory
Pseudo-processor
Mem
assist
17 - CSE/ESE 560M – Graduate Computer Architecture I
Shared Address Abstraction
• Source and destination data addresses are specified by the source of the request– a degree of logical coupling and trust
• No storage logically “outside the address space”– may employ temporary buffers for transport
• Operations are fundamentally request response• Remote operation can be performed on remote
memory– logically does not require intervention of the remote
processor
18 - CSE/ESE 560M – Graduate Computer Architecture I
Message passing
• Bulk transfers• Synchronous
– Send completes after matching recv and source data sent
– Receive completes after data transfer complete from matching send
• Asynchronous– Send completes after send buffer may be
reused
19 - CSE/ESE 560M – Graduate Computer Architecture I
Synchronous Message Passing
• Constrained programming model• Destination contention very limited• User/System boundary
Source Destination
Time
Send Pdest, local VA, len
Send-rdy req
Tag check
(1) Initiate send
(2) Address translation on Psrc
(4) Send-ready request
(6) Reply transaction
Wait
Recv Psrc
, local VA, len
Recv- rdy reply
Data- xfer req
(5) Remote check for posted receive (assume success)
(7) Bulk data transferSource VA Dest VA or ID
(3) Local /remote check
20 - CSE/ESE 560M – Graduate Computer Architecture I
Asynch Message Passing: Optimistic
• More powerful programming model• Wildcard receive non-deterministic• Storage required within msg layer?
Source Destination
Time
Send (Pdest, local VA, len)
(1) Initiate send
(2) Address translation
(4) Send data
Recv Psrc, local VA, len
Data-xfer req
Tag match
Allocate buffer
(3) Local /remote check
(5) Remote check for posted receive; on fail, allocate data buffer
21 - CSE/ESE 560M – Graduate Computer Architecture I
Active Messages
• User-level analog of network transaction– transfer data packet and invoke handler to extract it from the
network and integrate with on-going computation• Request/Reply• Event notification: interrupts, polling, events?• May also perform memory-to-memory transfer
Request
handler
handler
Reply
22 - CSE/ESE 560M – Graduate Computer Architecture I
Message Passing Abstraction
• Source knows send data address, dest. knows receive data address– after handshake they both know both
• Arbitrary storage “outside the local address spaces”– may post many sends before any receives– non-blocking asynchronous sends reduces the
requirement to an arbitrary number of descriptors• Fundamentally a 3-phase transaction
– includes a request / response– can use optimistic 1-phase in limited “Safe” cases
23 - CSE/ESE 560M – Graduate Computer Architecture I
Data Parallel
• Operations can be performed in parallel– each element of a large regular data structure, such as
an array– Data parallel programming languages lay out data to
processor• Processing Element
– 1 Control Processor broadcast to many PEs – When computers were large, could amortize the control
portion of many replicated PEs– Condition flag per PE so that can skip– Data distributed in each memory
• Early 1980s VLSI SIMD rebirth– 32 1-bit PEs + memory on a chip was the PE
24 - CSE/ESE 560M – Graduate Computer Architecture I
Data Parallel
• Architecture Development– Vector processors have similar ISAs,
but no data placement restriction– SIMD led to Data Parallel Programming
languages– Single Program Multiple Data (SPMD) model– All processors execute identical program
• Advanced VLSI Technology– Single chip FPUs– Fast µProcs (SIMD less attractive)
25 - CSE/ESE 560M – Graduate Computer Architecture I
Cache Coherent System
• Invoking coherence protocol– state of the line is maintained in the cache– protocol is invoked if an “access fault” occurs
on the line• Actions to Maintain Coherence
– Look at states of block in other caches– Locate the other copies– Communicate with those copies
26 - CSE/ESE 560M – Graduate Computer Architecture I
Scalable Cache Coherence
Scalable network
CA
P
$
Switch
M
Switch Switch
Realizing Program Models through net transaction protocols - efficient node-to-net interface - interprets transactions
Caches naturally replicate data - coherence through bus snooping protocols - consistency
Scalable Networks - many simultaneous transactions
Scalabledistributedmemory
Need cache coherence protocols that scale! - no broadcast or single point of order
27 - CSE/ESE 560M – Graduate Computer Architecture I
Bus-based Coherence
• All actions done as broadcast on bus– faulting processor sends out a “search” – others respond to the search probe and take necessary
action• Could do it in scalable network too
– broadcast to all processors, and let them respond• Conceptually simple, but doesn’t scale with p
– on bus, bus bandwidth doesn’t scale– on scalable network, every fault leads to at least p
network transactions
28 - CSE/ESE 560M – Graduate Computer Architecture I
One Approach: Hierarchical Snooping
• Extend snooping approach– hierarchy of broadcast media– processors are in the bus- or ring-based multiprocessors at the
leaves– parents and children connected by two-way snoopy interfaces– main memory may be centralized at root or distributed among
leaves• Actions handled similarly to bus, but not full broadcast
– faulting processor sends out “search” bus transaction on its bus– propagates up and down hierarchy based on snoop results
• Problems– high latency: multiple levels, and snoop/lookup at every level– bandwidth bottleneck at root
29 - CSE/ESE 560M – Graduate Computer Architecture I
Scalable Approach: Directories
• Directory– Maintain cached block copies– Maintain memory block states– On a miss in own memory
• Look up directory entry• Communicate only with the nodes with copies
– Scalable networks• Communication through network transactions
• Different ways to organize directory
30 - CSE/ESE 560M – Graduate Computer Architecture I
Basic Directory Transactions
P
A M/D
C
P
A M/D
C
P
A M/D
C
Read requestto directory
Reply withowner identity
Read req.to owner
DataReply
Revision messageto directory
1.
2.
3.
4a.
4b.
P
A M/D
CP
A M/D
C
P
A M/D
C
RdEx requestto directory
Reply withsharers identity
Inval. req.to sharer
1.
2.
P
A M/D
C
Inval. req.to sharer
Inval. ack
Inval. ack
3a. 3b.
4a. 4b.
Requestor
Node withdirty copy
Directory nodefor block
Requestor
Directory node
Sharer Sharer
(a) Read miss to a block in dirty state (b) Write miss to a block with two sharers
31 - CSE/ESE 560M – Graduate Computer Architecture I
E
S
I
P1$
E
S
I
P2$
D
S
U
MDirctrl
ld vA -> rd pA
Read pA
R/reply
R/req
P1: pA
S
S
Example Directory Protocol (1st Read)
32 - CSE/ESE 560M – Graduate Computer Architecture I
E
S
I
P1$
E
S
I
P2$
D
S
U
MDirctrlR/reply
R/req
P1: pA
ld vA -> rd pA
P2: pA
R/req
R/_
R/_
R/_S
S
S
Example Directory Protocol (Read Share)
33 - CSE/ESE 560M – Graduate Computer Architecture I
E
S
I
P1$
E
S
I
P2$
D
S
U
MDirctrl
st vA -> wr pA
R/reply
R/req
P1: pA
P2: pA
R/req
W/req E
R/_
R/_
R/_
Invalidate pARead_to_update pA
Inv ACK
RX/invalidate&reply
S
S
S
D
E
reply xD(pA)
Inv/_
Excl
Example Directory Protocol (Wr to shared)
I
34 - CSE/ESE 560M – Graduate Computer Architecture I
A Popular Middle Ground
• Two-level “hierarchy”– Coherence across nodes is directory-based
• directory keeps track of nodes, not individual processors
– Coherence within nodes is snooping or directory
• orthogonal, but needs a good interface of functionality
• Examples– Convex Exemplar: directory-directory– Sequent, Data General, HAL: directory-snoopy
35 - CSE/ESE 560M – Graduate Computer Architecture I
P
C
Snooping
B1
B2
P
C
P
CB1
P
C
MainMem
MainMem
AdapterSnoopingAdapter
P
CB1
Bus (or Ring)
P
C
P
CB1
P
C
MainMem
MainMem
Network
Assist Assist
Network2
P
C
AM/D
Network1
P
C
AM/D
Directory adapter
P
C
AM/D
Network1
P
C
AM/D
Directory adapter
P
C
AM/D
Network1
P
C
AM/D
Dir/Snoopy adapter
P
C
AM/D
Network1
P
C
AM/D
Dir/Snoopy adapter
(a) Snooping-snooping (b) Snooping-directory
Dir. Dir.
(c) Directory-directory (d) Directory-snooping
Two-level Hierarchies
36 - CSE/ESE 560M – Graduate Computer Architecture I
Memory Consistency
• Memory Coherence– Consistent view of
the memory– Not ensure how
consistent– In what order of
execution
Memory
P1 P2 P3
Memory Memory
A=1;flag=1;
while (flag==0);print A;
A:0 flag:0->1
Interconnection network
1: A=1
2: flag=1
3: load ADelay
P1
P3P2
(b)
(a)
Congested path
37 - CSE/ESE 560M – Graduate Computer Architecture I
Memory Consistency
• Relaxed Consistency– Allows Out-of-order Completion– Different Read and Write ordering models– Increase in Performance but possible errors
• Current Systems– Relaxed Models– Expectation for synchronous programs– Use of standard synchronization libraries
top related