1 multi threaded architectures sima, fountain and kacsuk chapter 16 cse462

1

Multi Threaded Architectures

Sima, Fountain and KacsukChapter 16

CSE462

2

David Abramson, 2004 Material from Sima, Fountain and Kacsuk, Addison Wesley 1997

Memory and Synchronization Latency Scalability of system is limited by ability to

handle memory latency & algorithmic sychronization delays

Overall solution is well known– Do something else whilst waiting

Remote memory accesses – Much slower than local

– Varying delay depending on• Network traffic

• Memory traffic

3


Processor Utilization

Utilization – P/T

• P time spent processing

• T total time

– P/(P + I + S)• I time spent waiting on other tasks

• S time spent switching tasks

4


Basic ideas - Multithreading

Fine Grain – task switch every cycle

Coarse Grain – Task swith every n cycles

Blocked

Blocked

Blocked

Task Switch Overhead Task Switch Overhead

5


Design Space

Multi threaded architectures

ComputationalModel

GranularityMemory

OrganizationNumber of threads

per processor

Von Neumann(Sequential Control

Flow)

Hybrid Von Neumann/Dataflow

Fine Grain

Coarse Grain

Physical SharedMemory

Distributed Shared

Memory

Cache-coherentDistributed shared

Memory

Small (4 – 10)

Middle(10 – 100)

Large (over 100)

Parallel Control flowBased on parallelControl operators

Parallel control flowBased on control tokens

6


Classification of multi-threaded architectures

Multi-threadedarchitectures

Von Neumann basedarchitectures

Hybrid von Neumann/Dataflow architectures

HEP

Tera

MIT Alewife & Sparcle

RISC Like DecoupledMacro dataflow

architectures

P-RISC

*T

USC

McGill MGDA & SAM

MIT HybridMachine

EM-4

7

Computational Models

8


Sequential control flow (von Neumann)

Flow of control and data separated Executed sequentially (or at least sequential

semantics – see chapter 7) Control flow changed with

JUMP/GOTO/CALL instructions Data stored in rewritable memory

– Flow of data does not affect execution order

9


Sequential Control Flow Model

-AB

m1

+B1

m2

*m1m2R

ControlFlow

L1:

L2:

L3:

R = (A - B) * (B + 1)

10


Dataflow Control tied to data Instruction “fires” when data is available

– Otherwise it is suspended Order of instructions in program has no effect on execution order

– Cf Von Neumann No shared rewritable memory

– Write once semantics Code is stored as a dataflow graph Data transported as tokens Parallelism occurs if multiple instructions can fire at same time

– Needs a parallel processor Nodes are self scheduling

11


Dataflow – arbitrary execution order

- +

*

A B

1

R = (A - B) * (B + 1)

R

12


Dataflow – arbitrary execution order

- +

*

A B

1

R = (A - B) * (B + 1)

R

13


Dataflow – Parallel Execution

- +

*

A B

1

R = (A - B) * (B + 1)

R

14


Implementation

Dataflow model required very different execution engine

Data must be stored in special matching store Instructions must be triggered when both operands

are available Parallel operations must be scheduled to

processors dynamically– Don’t know apriori when they are available.

Instruction operands are pointers– To instruction– Operand number

15


Dataflow model of execution

-

L4/1

+

1L4/2

*

L6/1

L2: L3:

L4:

A

Compte B

L2/2L3/1

L1:

BB

16


Parallel Control flow

Sometimes called macro dataflow– Data flows between blocks of sequential code– Has advantaged of dataflow & Von Neumann

• Context switch overhead reduced

• Compiler can schedule instructions statically

• Don’t need fast matching store

Requires additional control instructions– Fork/Join

17


Macro Dataflow (Hybrid Control/Dataflow)

-AB

m1

L2:

R = (A - B) * (B + 1)

ControlFlow

+B1

m2

L4:FORKL4

L1:

ControlFlow

GOTOL5

L3:

JOIN2

L5:

*m1m2R

L6:

18


Issues for Hybrid dataflow

Blocks of sequential instructions need to be large enough to absorb overheads of context switching

Data memory same as MIMD– Can be partitioned or shared

– Synchronization instructions required• Semaphores, test-and-set

Control tokens required to synchronize threads.

19

Some examples

20


Denelcor HEP

Designed to tolerate latency in memory Fine grain interleaving of threads Processor pipeline contains 8 stages Each time step a new thread enters the pipeline Threads are taken from the Process Status Word (PSW) After thread taken from the PSW queue, instruction and

operands are fetched When an instruction is executed, another one is placed on

the PSW queue Threads are interleaved at the instruction level.

21


Denelcor HEP

Memory latency toleration solved with Scheduler Function Unit (SFU)

Memory words are tagged as full or empty Attempting to read an empty suspends the

current thread– Then current PSW entry is moved to the SFU

When data is written, taken from the SFU and placed back on the PSW queue.

22


Synchronization on the HEP

All registers have Full/Empty/Reserved bit Reading an empty register causes thread toe

be placed back on the PSW queue without updating its program counter

Thread synchronization is busy-wait– But other threads can run

23


HEP Architecture

Matching Unit

Operandfetch

Functunit 1

Functunit 2

Functunit N

PSWqueue

Incrementcontrol

Programmemory

Registers

Operand hand 1

Operand hand 2

SFU

To/fromData memory

24


HEP configuration

Up to 16 processors Up to 128 data memories Connected by high speed switch Limitations

– Threads can have only 1 outstanding memory request

– Thread synchronization puts bubbles in the pipeline

– Maximum of 64 threads causing problems for software• Need to throttle loops

– If parallelism is lower than 8 full utilisation not possible.

25


MIT Alewife Processor

512 Processors in 2-dim mesh Sparcle Processor Physcially distributed memory Logical shared memory Hardware supported cache coherence Hardware supported user level message passing Multi-threading

26


Threading in Alewife

Coarse-grained multithreading Pipeline works on single thread as long as

remote memory access or synchronization not required

Can exploit register optimization in the pipeline

Integration of multi-threading with hardware supported cache coherence

27


The Sparcle Processor

Extension of SUN Sparc architecture Tolerant of memory latency Fine grained synchronisation Efficient user level message passing

28


Fast context switching In Sparc 8 overlapping register windows Used in Sparcle in paris to represent 4 independent, non-overlapping contexts

– Three for user threads– 1 for traps and message handlers

Each context contains 32 general purpose registers and– PSR (Processor State Register)– PC (Program Counter)– nPC (next Program Counter)

Thread states– Active– Loaded

• State stored in registers – can become active– Ready

• Not suspended and not loaded– Suspended

Thread switching – In fast if one is active and the other is loaded– Need to flush the pipeline (cf HEP)

29


Sparcle Architecture

PSRPC

nPCPSRPC

nPCPSRPC

nPCPSRPC

nPC

3:R0

3:R31

2:R0

2:R31

Activethread

1:R0

1:R31

0:R0

0:R31

CP

30


MIT Alewife and Sparcle

Cache64

kbytes

MainMemory4 BytesFPU

Sparcle

CMMU

NR

NR = Network routerCMMU = Communication & memory management unitFPU = Floating point unit

31

From here figures are drawn by Tim

32


Figures 16.10 Thread states in Sparcle

PSR

PC

nPC

PSR

PC

nPC

PSR

PC

nPC

PSR

PC

nPC

CPactivethread

PC and PSR frames

Global register frames

Process state

G0

G70:R0

0:R31

1:R0

1:R31

2:R0

2:R313:R0

3:R31

. . . . . .

Memory

Ready queue Suspended queue

Loaded thread

Unloaded thread

33


Figures 16.11 structure of a typical static dataflow PE

Instruction queue

Func. Unit 1

Func. Unit 2

Func. Unit N

Activitystore

Fetch unit

Update unit

To/From other (PEs)

34


Figures 16.12 structure of a typical tagged-token dataflow PE

Token queueFunc. Unit 1

Func. Unit 2

Func. Unit N

Instruction/data memory

Fetch unit

Update unit

To other (PEs)

Matching unit Matching store

35


Figures 16.13 organization of the I-structure storage

WAAPW

datum

tag Xtag Ztag Y

nilnil

Data storage Data storage

k:

k+1:

k+2:

k+3:

k+4:

Presence bits (A=Absent, P=Present, W=Waiting

36


Figures 16.14 coding in explicit token-store architectures (a) and (b)

* +

-

<12, <FP, IP>><35, <FP, IP>>

* +

-<23, <FP, IP+1>><23, <FP, IP+2>>

fire

37


Figures 16.14 coding in explicit token-store architectures (c)

SUB

ADD

MUL

2

3

4

+1, +2

+2

+7 1

0

0

35 0

1

1

23

23

fire

FP

FP+2

Instruction memory Frame memory Frame memory

IP FP

FP+3

FP+4

Presence bit

38


Figures 16.15 structure of a typical explicit token-store dataflow PE

Fetch unit

Func. Unit 1

Func. Unit 2

Func. Unit N

Effective address

Presence bits

Frame store operation

Form tokenunit

Fetch unit

Framememory

Form tokenunit

To/from other PEs

From other PEs

39


Figures 16.16 scale of von Neumann/dataflow architectures

Dataflow

Macro dataflow

Decoupled hybrid dataflow

RISC-like hybrid

von Neumann

40


Figures 16.17 structure of a typical macro dataflow PE

Token queue Func. Unit

Instruction Frame

memory

Fetch unit

Form tokenunit

To/from other (PEs)

Matching unit

Internal control pipeline (program counter-based

sequential execution)

41


Figures 16.18 organization of a PE in the MIT hybrid Machine

Framememory

Instruction fetch

PC FBR

Decode unit

Operand fetch

Execution unit

Instruction memory

Registers

Enabled continuation

queue

(Token queue)

+1

To/from global memory

42


Figures 16.19 comparison of (a) SQ and (b) SCB macro nodes

l1

l2

a b

l4

l5

c

l3 l6

SQ1 SQ2

1 2

l1

l2

a b

l4

l5

c

l3 l6

SCB1 SCB2

1 2

3

43


Figures 16.20 structure of the USC Decoupled Architecture

Cluster graph memory

GC

DFGE

CE

CC

AQRQ

GC

DFGE

CE

CC

AQRQ

Cluster 0

To/from network (Graph virtual space)

Cluster graph memory

To/from network (Computation virtual space)

44


Figures 16.21 structure of a node in the SAM

Mainmemory

APU

LEU

ASUSEU

fire

done

To/from network

45


Figures 16.22 structure of the P-RISC processing element

Token queueFrame

memory

Instruction Instruction fetch

Operand fetch

Func. unit

Operand store Start

Local memory Internal control pipeline (conventional RISC-

processor)

Messages to/from other PE’s memory

Load/Store

46


Figures 16.23 transformation of dataflow graphs into control flow graphs (a) dataflow graph (b) control flow graph

* -

+

join

+

join

*

join

-

fork L1

L1:

47


Figures 16.24 structure of *T node

Remote memory request

coprocessor

Synchronization coprocessor

sIP

sFPsV1sV2

Data processor

dIP

dFPdV1dV2

Continuationqueue

<IP,FP>

Local memory

Network interface Message

formatterFrom

network To network

Message queues

1 multi threaded architectures sima, fountain and kacsuk chapter 16 cse462

Documents

flow of control

david abramson

addison wesley

rewritable memory flow

sequential control flow

cse462 slide

execution order slide

kacsuk chapter