1 multi threaded architectures sima, fountain and kacsuk chapter 16 cse462
Post on 20-Dec-2015
220 views
TRANSCRIPT
2
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Memory and Synchronization Latency Scalability of system is limited by ability to
handle memory latency & algorithmic sychronization delays
Overall solution is well known– Do something else whilst waiting
Remote memory accesses – Much slower than local
– Varying delay depending on• Network traffic
• Memory traffic
3
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Processor Utilization
Utilization – P/T
• P time spent processing
• T total time
– P/(P + I + S)• I time spent waiting on other tasks
• S time spent switching tasks
4
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Basic ideas - Multithreading
Fine Grain – task switch every cycle
Coarse Grain – Task swith every n cycles
Blocked
Blocked
Blocked
Task Switch Overhead Task Switch Overhead
5
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Design Space
Multi threaded architectures
ComputationalModel
GranularityMemory
OrganizationNumber of threads
per processor
Von Neumann(Sequential Control
Flow)
Hybrid Von Neumann/Dataflow
Fine Grain
Coarse Grain
Physical SharedMemory
Distributed Shared
Memory
Cache-coherentDistributed shared
Memory
Small (4 – 10)
Middle(10 – 100)
Large (over 100)
Parallel Control flowBased on parallelControl operators
Parallel control flowBased on control tokens
6
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Classification of multi-threaded architectures
Multi-threadedarchitectures
Von Neumann basedarchitectures
Hybrid von Neumann/Dataflow architectures
HEP
Tera
MIT Alewife & Sparcle
RISC Like DecoupledMacro dataflow
architectures
P-RISC
*T
USC
McGill MGDA & SAM
MIT HybridMachine
EM-4
8
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sequential control flow (von Neumann)
Flow of control and data separated Executed sequentially (or at least sequential
semantics – see chapter 7) Control flow changed with
JUMP/GOTO/CALL instructions Data stored in rewritable memory
– Flow of data does not affect execution order
9
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sequential Control Flow Model
-AB
m1
+B1
m2
*m1m2R
ControlFlow
L1:
L2:
L3:
R = (A - B) * (B + 1)
10
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow Control tied to data Instruction “fires” when data is available
– Otherwise it is suspended Order of instructions in program has no effect on execution order
– Cf Von Neumann No shared rewritable memory
– Write once semantics Code is stored as a dataflow graph Data transported as tokens Parallelism occurs if multiple instructions can fire at same time
– Needs a parallel processor Nodes are self scheduling
11
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – arbitrary execution order
- +
*
A B
1
R = (A - B) * (B + 1)
R
12
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – arbitrary execution order
- +
*
A B
1
R = (A - B) * (B + 1)
R
13
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow – Parallel Execution
- +
*
A B
1
R = (A - B) * (B + 1)
R
14
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Implementation
Dataflow model required very different execution engine
Data must be stored in special matching store Instructions must be triggered when both operands
are available Parallel operations must be scheduled to
processors dynamically– Don’t know apriori when they are available.
Instruction operands are pointers– To instruction– Operand number
15
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Dataflow model of execution
-
L4/1
+
1L4/2
*
L6/1
L2: L3:
L4:
A
Compte B
L2/2L3/1
L1:
BB
16
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Parallel Control flow
Sometimes called macro dataflow– Data flows between blocks of sequential code– Has advantaged of dataflow & Von Neumann
• Context switch overhead reduced
• Compiler can schedule instructions statically
• Don’t need fast matching store
Requires additional control instructions– Fork/Join
17
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Macro Dataflow (Hybrid Control/Dataflow)
-AB
m1
L2:
R = (A - B) * (B + 1)
ControlFlow
+B1
m2
L4:FORKL4
L1:
ControlFlow
GOTOL5
L3:
JOIN2
L5:
*m1m2R
L6:
18
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Issues for Hybrid dataflow
Blocks of sequential instructions need to be large enough to absorb overheads of context switching
Data memory same as MIMD– Can be partitioned or shared
– Synchronization instructions required• Semaphores, test-and-set
Control tokens required to synchronize threads.
20
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Denelcor HEP
Designed to tolerate latency in memory Fine grain interleaving of threads Processor pipeline contains 8 stages Each time step a new thread enters the pipeline Threads are taken from the Process Status Word (PSW) After thread taken from the PSW queue, instruction and
operands are fetched When an instruction is executed, another one is placed on
the PSW queue Threads are interleaved at the instruction level.
21
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Denelcor HEP
Memory latency toleration solved with Scheduler Function Unit (SFU)
Memory words are tagged as full or empty Attempting to read an empty suspends the
current thread– Then current PSW entry is moved to the SFU
When data is written, taken from the SFU and placed back on the PSW queue.
22
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Synchronization on the HEP
All registers have Full/Empty/Reserved bit Reading an empty register causes thread toe
be placed back on the PSW queue without updating its program counter
Thread synchronization is busy-wait– But other threads can run
23
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
HEP Architecture
Matching Unit
Operandfetch
Functunit 1
Functunit 2
Functunit N
PSWqueue
Incrementcontrol
Programmemory
Registers
Operand hand 1
Operand hand 2
SFU
To/fromData memory
24
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
HEP configuration
Up to 16 processors Up to 128 data memories Connected by high speed switch Limitations
– Threads can have only 1 outstanding memory request
– Thread synchronization puts bubbles in the pipeline
– Maximum of 64 threads causing problems for software• Need to throttle loops
– If parallelism is lower than 8 full utilisation not possible.
25
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
MIT Alewife Processor
512 Processors in 2-dim mesh Sparcle Processor Physcially distributed memory Logical shared memory Hardware supported cache coherence Hardware supported user level message passing Multi-threading
26
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Threading in Alewife
Coarse-grained multithreading Pipeline works on single thread as long as
remote memory access or synchronization not required
Can exploit register optimization in the pipeline
Integration of multi-threading with hardware supported cache coherence
27
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
The Sparcle Processor
Extension of SUN Sparc architecture Tolerant of memory latency Fine grained synchronisation Efficient user level message passing
28
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Fast context switching In Sparc 8 overlapping register windows Used in Sparcle in paris to represent 4 independent, non-overlapping contexts
– Three for user threads– 1 for traps and message handlers
Each context contains 32 general purpose registers and– PSR (Processor State Register)– PC (Program Counter)– nPC (next Program Counter)
Thread states– Active– Loaded
• State stored in registers – can become active– Ready
• Not suspended and not loaded– Suspended
Thread switching – In fast if one is active and the other is loaded– Need to flush the pipeline (cf HEP)
29
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Sparcle Architecture
PSRPC
nPCPSRPC
nPCPSRPC
nPCPSRPC
nPC
3:R0
3:R31
2:R0
2:R31
Activethread
1:R0
1:R31
0:R0
0:R31
CP
30
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
MIT Alewife and Sparcle
Cache64
kbytes
MainMemory4 BytesFPU
Sparcle
CMMU
NR
NR = Network routerCMMU = Communication & memory management unitFPU = Floating point unit
32
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.10 Thread states in Sparcle
PSR
PC
nPC
PSR
PC
nPC
PSR
PC
nPC
PSR
PC
nPC
CPactivethread
PC and PSR frames
Global register frames
Process state
G0
G70:R0
0:R31
1:R0
1:R31
2:R0
2:R313:R0
3:R31
. . . . . .
Memory
Ready queue Suspended queue
Loaded thread
Unloaded thread
33
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.11 structure of a typical static dataflow PE
Instruction queue
Func. Unit 1
Func. Unit 2
Func. Unit N
Activitystore
Fetch unit
Update unit
To/From other (PEs)
34
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.12 structure of a typical tagged-token dataflow PE
Token queueFunc. Unit 1
Func. Unit 2
Func. Unit N
Instruction/data memory
Fetch unit
Update unit
To other (PEs)
Matching unit Matching store
35
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.13 organization of the I-structure storage
WAAPW
datum
tag Xtag Ztag Y
nilnil
Data storage Data storage
k:
k+1:
k+2:
k+3:
k+4:
Presence bits (A=Absent, P=Present, W=Waiting
36
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.14 coding in explicit token-store architectures (a) and (b)
* +
-
<12, <FP, IP>><35, <FP, IP>>
* +
-<23, <FP, IP+1>><23, <FP, IP+2>>
fire
37
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.14 coding in explicit token-store architectures (c)
SUB
ADD
MUL
2
3
4
+1, +2
+2
+7 1
0
0
35 0
1
1
23
23
fire
FP
FP+2
Instruction memory Frame memory Frame memory
IP FP
FP+3
FP+4
Presence bit
38
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.15 structure of a typical explicit token-store dataflow PE
Fetch unit
Func. Unit 1
Func. Unit 2
Func. Unit N
Effective address
Presence bits
Frame store operation
Form tokenunit
Fetch unit
Framememory
Form tokenunit
To/from other PEs
From other PEs
39
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.16 scale of von Neumann/dataflow architectures
Dataflow
Macro dataflow
Decoupled hybrid dataflow
RISC-like hybrid
von Neumann
40
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.17 structure of a typical macro dataflow PE
Token queue Func. Unit
Instruction Frame
memory
Fetch unit
Form tokenunit
To/from other (PEs)
Matching unit
Internal control pipeline (program counter-based
sequential execution)
41
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.18 organization of a PE in the MIT hybrid Machine
Framememory
Instruction fetch
PC FBR
Decode unit
Operand fetch
Execution unit
Instruction memory
Registers
Enabled continuation
queue
(Token queue)
+1
To/from global memory
42
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.19 comparison of (a) SQ and (b) SCB macro nodes
l1
l2
a b
l4
l5
c
l3 l6
SQ1 SQ2
1 2
l1
l2
a b
l4
l5
c
l3 l6
SCB1 SCB2
1 2
3
43
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.20 structure of the USC Decoupled Architecture
Cluster graph memory
GC
DFGE
CE
CC
AQRQ
GC
DFGE
CE
CC
AQRQ
Cluster 0
To/from network (Graph virtual space)
Cluster graph memory
To/from network (Computation virtual space)
44
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.21 structure of a node in the SAM
Mainmemory
APU
LEU
ASUSEU
fire
done
To/from network
45
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.22 structure of the P-RISC processing element
Token queueFrame
memory
Instruction Instruction fetch
Operand fetch
Func. unit
Operand store Start
Local memory Internal control pipeline (conventional RISC-
processor)
Messages to/from other PE’s memory
Load/Store
46
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.23 transformation of dataflow graphs into control flow graphs (a) dataflow graph (b) control flow graph
* -
+
join
+
join
*
join
-
fork L1
L1:
47
David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997
Figures 16.24 structure of *T node
Remote memory request
coprocessor
Synchronization coprocessor
sIP
sFPsV1sV2
Data processor
dIP
dFPdV1dV2
Continuationqueue
<IP,FP>
Local memory
Network interface Message
formatterFrom
network To network
Message queues