advanced computer architecture 5md00 / 5z033 multi-processing henk corporaal heco/courses/aca...

55
Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal www.ics.ele.tue.nl/ ~heco/courses/aca [email protected] TUEindhoven 2012

Upload: victor-dawson

Post on 18-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

Advanced Computer Architecture5MD00 / 5Z033

Multi-Processing

Henk Corporaalwww.ics.ele.tue.nl/~heco/courses/aca

[email protected]

2012

Page 2: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 2

Trends:• #transistors follows Moore• but not freq. and performance/core

Core i7

3GHz

100W

5

Crisis?

Page 3: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 3

Topics• Flynn's taxonomy• Why Parallel Processors• Communication models• Challenge of parallel processing• Coherence problem• Consistency problem• Synchronization

• Book: Chapter 4, appendix E, H

Page 4: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 4

Flynn's Taxomony• SISD (Single Instruction, Single Data)

– Uniprocessors• SIMD (Single Instruction, Multiple Data)

– Vector architectures also belong to this class• Multimedia extensions (MMX, SSE, VIS, AltiVec, …)

– Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine, GPUs, ……

• MISD (Multiple Instruction, Single Data)– Systolic arrays / stream based processing

• MIMD (Multiple Instruction, Multiple Data)– Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin

• Flexible– Most widely used

Compare the earlier presented classification!!

Page 5: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 5

Flynn's taxonomyInstruction memory

PE

Dat

a m

emor

y

Instruction memory

PE

Dat

a m

emor

y

Instruction memory

PE

Dat

a m

emor

y

Instruction memory

PE

Dat

a m

emor

y

PE

SISD MISD

SIMD MIMD

PE

PE

PE

PE

(processing element)

Page 6: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 6

Why parallel processing

• Performance drive• Diminishing returns for exploiting ILP and OLP• Multiple processors fit easily on a chip• Cost effective (just connect existing processors or

processor cores)• Low power: parallelism may allow lowering Vdd

However:• Parallel programming is hard

Page 7: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 7

Low power through parallelism

• Sequential Processor– Switching capacitance C

– Frequency f

– Voltage V

– P1 = fCV2

• Parallel Processor (two times the number of units)– Switching capacitance 2C

– Frequency f/2

– Voltage V’ < V

– P2 = f/2 2C V’2 = fCV’2 < P1

CPU

CPU1 CPU2

Page 8: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 8

Parallel Architecture

• Parallel Architecture extends traditional computer architecture with a communication network– abstractions (HW/SW interface)

– organizational structure to realize abstraction efficiently

Communication Network

Processingnode

Processingnode

Processingnode

Processingnode

Processingnode

Page 9: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 9

Communication models: Shared Memory

Process P1 Process P2

SharedMemory

• Coherence problem• Memory consistency issue• Synchronization problem

(read, write)(read, write)

Page 10: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 10

Communication models: Shared memory

• Shared address space

• Communication primitives:– load, store, atomic swap

Two varieties:

• Physically shared => Symmetric Multi-Processors (SMP)– usually combined with local caching

• Physically distributed => Distributed Shared Memory (DSM)

Page 11: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 11

SMP: Symmetric Multi-Processor• Memory: centralized with uniform access time (UMA)

and bus interconnect, I/O• Examples: Sun Enterprise 6000, SGI Challenge, Intel

Main memory I/O System

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processor

One ormore cache

levels

Processorcan be 1 bus, N busses, or any network

Page 12: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 12

DSM: Distributed Shared Memory

• Nonuniform access time (NUMA) and scalable interconnect (distributed memory)

Interconnection NetworkInterconnection Network

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Cache

Processor

Memory

Main memory I/O System

Page 13: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 13

Shared Address Model Summary

• Each processor can name every physical location in the machine

• Each process can name all data it shares with other processes

• Data transfer via load and store

• Data size: byte, word, ... or cache blocks

• Memory hierarchy model applies:– communication moves data to local proc. cache

Page 14: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 14

Communication models: Message Passing

• Communication primitives– e.g., send, receive library calls– standard MPI: Message Passing Interface

• www.mpi-forum.org

• Note that MP can be build on top of SM and vice versa !

Process P1 Process P2

receive

receive send

sendFiFO

Page 15: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 15

Message Passing Model

• Explicit message send and receive operations

• Send specifies local buffer + receiving process on remote computer

• Receive specifies sending process on remote computer + local buffer to place data

• Typically blocking communication, but may use DMA

Header Data Trailer

Message structure

Page 16: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 16

Message passing communication

Interconnection NetworkInterconnection Network

Networkinterface

Networkinterface

Networkinterface

Networkinterface

Cache

Processor

Memory

DMA

Cache

Processor

Memory

DMA

Cache

Processor

Memory

DMA

Cache

Processor

Memory

DMA

Page 17: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 17

Communication Models: Comparison• Shared-Memory

– Compatibility with well-understood (language) mechanisms

– Ease of programming for complex or dynamic communications patterns

– Shared-memory applications; sharing of large data structures

– Efficient for small items

– Supports hardware caching

• Messaging Passing– Simpler hardware

– Explicit communication

– Implicit synchronization (with any communication)

Page 18: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 18

Network: Performance metrics

• Network Bandwidth– Need high bandwidth in communication– How does it scale with number of nodes?

• Communication Latency– Affects performance, since processor may have to wait– Affects ease of programming, since it requires more thought

to overlap communication and computation

How can a mechanism help hide latency?– overlap message send with computation, – prefetch data, – switch to other task or thread– pipelining tasks, or pipelining iterations

Page 19: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 19

Challenges of parallel processingcheck yourself: Amdahl’s and Gustafson’s laws

Q1: can we get linear speedupSuppose we want speedup 80 with 100 processors.

What fraction of the original computation can be sequential (i.e. non-parallel)?

Answer: fseq = 0.25%

Q2: how important is communication latencySuppose 0.2 % of all accesses are remote, and require

100 cycles on a processor with base CPI = 0.5

What’s the communication impact?

Page 20: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 20

Three fundamental issues for shared memory multiprocessors

• Coherence, about: Do I see the most recent data?

• Consistency, about: When do I see a written value?– e.g. do different processors see writes at the same time

(w.r.t. other memory accesses)?

• SynchronizationHow to synchronize processes?– how to protect access to shared data?

Page 21: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 21

Coherence problem, in single CPU system

CPU

I/O

a'

b'

b

a

cache

memory

100

100

200

200

CPU

I/O

a'

b'

b

a

cache

memory

550

100

200

200

CPU

I/O

a'

b'

b

a

cache

memory

100

100

440

200

1) CPU writes to a 2) IO writes b

not coherent

not coherent

Page 22: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 22

Coherence problem, in Multi-Proc system

CPU-1

a'

b'

b

a

cache

memory

550

100

200

200

CPU-2

a''

b''

cache

100

200

Page 23: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 23

What Does Coherency Mean?• Informally:

– “Any read must return the most recent write (to the same address)”

– Too strict and too difficult to implement

• Better:– A write followed by a read by the same processor P with no

writes in between returns the value written– “Any write must eventually be seen by a read”

• If P writes to X and P' reads X then P' will see the value written by P if the read and write are sufficiently separated in time

– Writes to the same location by different processors are seen in the same order by all processors ("serialization")

• Suppose P1 writes location X, followed by P2 writing also to X. If no serialization, then some processors would read value of P1 and others of P2. Serialization guarantees that all processors read the same sequence.

Page 24: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 24

Two rules to ensure coherency• “If P1 writes x and P2 reads it, P1’s write will be seen

by P2 if the read and write are sufficiently far apart”

• Writes to a single location are serialized: seen in one order– Latest write will be seen

– Otherwise could see writes in illogical order (could see older value after a newer value)

Page 25: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 25

Potential HW Coherency Solutions• Snooping Solution (Snoopy Bus):

– Send all requests for data to all processors (or local caches)– Processors snoop to see if they have a copy and respond

accordingly – Requires broadcast, since caching information is at

processors• Works well with bus (natural broadcast medium)

– Dominates for small scale machines (most of the market)

• Directory-Based Schemes– Keep track of what is being shared in one centralized place– Distributed memory => distributed directory for scalability

(avoids bottlenecks, hot spots)– Scales better than Snooping– Actually existed BEFORE Snooping-based schemes

Page 26: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 26

Example Snooping protocol• 3 states for each cache line:

– invalid, shared (read only), modified (also called exclusive, you may write it)

• FSM per cache, gets requests from processor and bus

Main memory I/O System

Cache

Processor

Cache

Processor

Cache

Processor

Cache

Processor

Page 27: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 27

Snooping Protocol 1: Write Invalidate• Get exclusive access to a cache block (invalidate all

other copies) before writing it• When processor reads an invalid cache block it is

forced to fetch a new copy• If two processors attempt to write simultaneously, one

of them is first (having a bus helps). The other one must obtain a new copy, thereby enforcing serialization

Processor activity Bus activity Cache CPU A Cache CPU B Memory addr. X

0

CPU A reads X Cache miss for X 0 0

CPU B reads X Cache miss for X 0 0 0

CPU A writes 1 to X Invalidation for X 1 invalidated 0

CPU B reads X Cache miss for X 1 1 1

Example: address X in memory initially contains value '0'

Page 28: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 28

Basics of Write Invalidate• Use the bus to perform invalidates

– To perform an invalidate, acquire bus access and broadcast the address to be invalidated

• all processors snoop the bus, listening to addresses• if the address is in my cache, invalidate my copy

• Bus access enforces write serialization• Where is the most recent value?

– Easy for write-through caches: in the memory– For write-back caches, again use snooping

• Can use cache tags to implement snooping– Might interfere with cache accesses coming from

CPU– Duplicate tags, or employ multilevel cache with

inclusionMemoryMemory

Cache

Processor

Bus

corei

Page 29: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 29

Snoopy-Cache State Machine-I • State machine

for CPU requestsfor each cache block

InvalidShared

(read-only)

Exclusive(read/write)

CPU Read

CPU Write

CPU Read hit

Place read misson bus

Place Write Miss on bus

CPU read missWrite back block

CPU WritePlace Write Miss on Bus

CPU Read missPlace read miss on bus

CPU Write MissWrite back cache blockPlace write miss on bus

CPU read hitCPU write hit

Cache BlockState

Page 30: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 30

Snoopy-Cache State Machine-II

• State machinefor bus requests for each cache block

InvalidShared

(read/only)

Exclusive(read/write)

Write BackBlock; (abortmemory access)

Write miss for this block

Read miss for this block

Write miss for this block

Write BackBlock; (abortmemory access)

Page 31: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 31

Responds to Events Caused by ProcessorEvent State of block in cache Action

Read hit shared or exclusive read data from cache

Read miss invalid place read miss on bus

Read miss shared wrong block (conflict miss): place read miss on bus

Read miss exclusive conflict miss: write back block then place read miss on bus

Write hit exclusive write data in cache

Write hit shared place write miss on bus (invalidates all other copies)

Write miss invalid place write miss on bus

Write miss shared conflict miss: place write miss on bus

Write miss exclusive conflict miss: write back block, then place write miss on bus

Page 32: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 32

Responds to Events on Bus

Event State of addressed cache block

Action

Read miss shared No action: memory services read miss

Read miss exclusive Attempt to share data: place block on bus and change state to shared

Write miss shared Attempt to write: invalidate block

Write miss exclusive Another processor attempts to write "my" block: write back the block and invalidate it

Page 33: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 33

Snooping protocol 2: Write update/broadcast

• Update all cached copies. To keep bandwidth requirements under control, need to track whether words are shared or private

Processor activity Bus activity Cache CPU A

Cache CPU B

Memory addr X

0

CPU A reads X Cache miss for X 0 0

CPU B reads X Cache miss for X 0 0 0

CPU A writes 1 to X Write broadcast for X 1 1 1

CPU B reads X 1 1 1

Example: address X in memory initially contains value '0'

Page 34: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 34

Qualitative Performance Differences

• Compare: write invalidate versus write update:– Multiple writes to same word require:

• multiple write broadcasts in write update protocol• one invalidation in write invalidate

– When cache block contains multiple words, each word written to a cache block requires

• multiple write broadcasts in write update protocol• one invalidation in write invalidate• write invalidate works on cache blocks, write update on words/bytes

– Delay between writing a word in one processor and reading the new value in another is less in write update

• And the winner is?– write invalidate because bus bandwidth is most precious

Page 35: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 35

True Vs. False Sharing• True sharing: the word(s) being read is (are) the same as

the word(s) being written• False sharing: the word being read is different from the

word being written, but they are in same cache block

X1 X2

Time P1 P2 Comment

1 Write X1 True sharing miss; invalidation required in P2

2 Read X2 False sharing miss, since X2 is invalidated by the write of X1 by P1; now cache block in shared state

3 Write X1 False sharing miss, since X2 is shared again after P2 read it

4 Write X2 False sharing miss since writing to X2 while invalid for the X1 write

5 Read X2 True sharing miss since it involves a read of X2 which was invalidated

P1$: X1 X2P2$:

Page 36: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 36

Why This Protocol Works• Every valid cache block is either

– in shared state in multiple caches

– in exclusive state in one cache

• Transition to exclusive requires write miss on the bus– this invalidates all other copies

– if another cache has the block exclusively, that cache generates a write back !!

• If a read miss occurs on the bus to an exclusive block, the owning cache makes its state shared– if corresponding processor again wants to write, it needs to

re-gain exclusive access

Page 37: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 37

Complication: Write Races• Cannot update cache until bus is obtained

– Otherwise, another processor may get bus first, and then write the same cache block!

• Two step process:– Arbitrate for bus – Place miss on bus and complete operation

• If miss occurs to block while waiting for bus, handle miss (invalidate may be needed) and then restart

• Split transaction bus:– Bus transaction is not atomic;

can have multiple outstanding transactions for a block– Multiple misses can interleave;

allowing two caches to grab block in the Exclusive state– Must track and prevent multiple misses for one block

Page 38: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 38

Implementation Simplifications

• We place a write miss on the bus even if we have a write hit to a shared block– ownership or upgrade misses– real protocols support invalidate operations

• Real protocols really distinguish between shared and private data– don't need to get exclusive access to private data

Page 39: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 39

Three fundamental issues for shared memory multiprocessors

• Coherence, about: Do I see the most recent data?

• SynchronizationHow to synchronize processes?– how to protect access to shared data?

• Consistency, about: When do I see a written value?– e.g. do different processors see writes at the same time

(w.r.t. other memory accesses)?

Page 40: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 40

What's the Synchronization problem?

• Assume: Computer system of bank has credit process (P_c) and debit process (P_d)

/* Process P_c */ /* Process P_d */shared int balance shared int balanceprivate int amount private int amount

balance += amount balance -= amount

lw $t0,balance lw $t2,balancelw $t1,amount lw $t3,amountadd $t0,$t0,t1 sub $t2,$t2,$t3sw $t0,balance sw $t2,balance

Page 41: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 41

Critical Section Problem• n processes all competing to use some shared data• Each process has code segment, called critical

section, in which shared data is accessed.• Problem – ensure that when one process is executing

in its critical section, no other process is allowed to execute in its critical section

• Structure of process

while (TRUE){

entry_section ();

critical_section ();

exit_section ();

remainder_section ();

}

while (TRUE){

entry_section ();

critical_section ();

exit_section ();

remainder_section ();

}

Page 42: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 42

Attempt 1 – Strict Alternation

Two problems: Satisfies mutual exclusion, but not progress

(works only when both processes strictly alternate) Busy waiting

Process P0 Process P1

shared int turn;

while (TRUE) { while (turn!=0); critical_section(); turn = 1; remainder_section();}

shared int turn;

while (TRUE) { while (turn!=0); critical_section(); turn = 1; remainder_section();}

shared int turn;

while (TRUE) { while (turn!=1); critical_section(); turn = 0; remainder_section();}

shared int turn;

while (TRUE) { while (turn!=1); critical_section(); turn = 0; remainder_section();}

Page 43: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 43

Attempt 2 – Warning Flags

Satisfies mutual exclusion P0 in critical section: flag[0]!flag[1] P1 in critical section: !flag[0]flag[1]

However, contains a deadlock (both flags may be set to TRUE !!)

Process P0 Process P1

shared int flag[2];

while (TRUE) { flag[0] = TRUE; while (flag[1]); critical_section(); flag[0] = FALSE; remainder_section();}

shared int flag[2];

while (TRUE) { flag[0] = TRUE; while (flag[1]); critical_section(); flag[0] = FALSE; remainder_section();}

shared int flag[2];

while (TRUE) { flag[1] = TRUE; while (flag[0]); critical_section(); flag[1] = FALSE; remainder_section();}

shared int flag[2];

while (TRUE) { flag[1] = TRUE; while (flag[0]); critical_section(); flag[1] = FALSE; remainder_section();}

Page 44: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 44

Software solution: Peterson’s Algorithm

Process P0 Process P1

shared int flag[2];shared int turn;

while (TRUE) { flag[0] = TRUE; turn = 0; while (turn==0&&flag[1]); critical_section(); flag[0] = FALSE; remainder_section();}

shared int flag[2];shared int turn;

while (TRUE) { flag[0] = TRUE; turn = 0; while (turn==0&&flag[1]); critical_section(); flag[0] = FALSE; remainder_section();}

shared int flag[2];shared int turn;

while (TRUE) { flag[1] = TRUE; turn = 1; while (turn==1&&flag[0]); critical_section(); flag[1] = FALSE; remainder_section();}

shared int flag[2];shared int turn;

while (TRUE) { flag[1] = TRUE; turn = 1; while (turn==1&&flag[0]); critical_section(); flag[1] = FALSE; remainder_section();}

(combining warning flags and alternation)

• Software solution is slow !• Difficult to extend to more than 2 processes

Page 45: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 45

Hardware solution for Synchronization

• For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

• Hardware primitives needed– all solutions based on "atomically inspect and

update a memory location"

• Higher level synchronization solutions can be build on top

Page 46: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 46

Uninterruptable Instructions to Fetch and Update Memory

• Atomic exchange: interchange a value in a register for a value in memory– 0 => synchronization variable is free – 1 => synchronization variable is locked and unavailable

• Test-and-set: tests a value and sets it if the value passes the test – similar: Compare-and-swap

• Fetch-and-increment: it returns the value of a memory location and atomically increments it– 0 => synchronization variable is free

Page 47: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 47

Build a 'spin-lock' using exchange primitive• Spin locks: processor continuously tries to acquire, spinning

around a loop trying to get the lock:

LI R2,#1 ;load immediatelockit: EXCH R2,0(R1) ;atomic exchange

BNEZ R2,lockit ;already locked?

• What about MP with cache coherency?– Want to spin on cache copy to avoid full memory latency– Likely to get cache hits for such variables

– Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic

– Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”):

try: LI R2,#1 ;load immediate lockit: LW R3,0(R1) ;load var

BNEZ R3,lockit ;not free=>spinEXCH R2,0(R1) ;atomic exchangeBNEZ R2,try ;already locked?

Page 48: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 48

Alternative to Fetch and Update• Hard to have read & write in 1 instruction: use 2 instead• Load Linked (or load locked) + Store Conditional

– Load linked returns the initial value– Store conditional returns 1 if it succeeds (no other store to same

memory location since preceding load) and 0 otherwise

• Example doing atomic swap with LL & SC:try: OR R3,R4,R0 ; R4=R3

LL R2,0(R1) ; load linked SC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R3=0)

• Example doing fetch & increment with LL & SC:try: LL R2,0(R1) ; load linked

ADDI R3,R2,#1 ; increment SC R3,0(R1) ; store conditional BEQZ R3,try ; branch store fails (R2=0)

Page 49: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 49

Three fundamental issues for shared memory multiprocessors

• Coherence, about: Do I see the most recent data?

• SynchronizationHow to synchronize processes?– how to protect access to shared data?

• Consistency, about: When do I see a written value?– e.g. do different processors see writes at the same time

(w.r.t. other memory accesses)?

Page 50: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 50

Memory Consistency

Memory consistency: When and in which order are writes observed

OpenMP semantics:

Page 51: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 51

Memory Consistency: The Problem

• Observation: If writes take effect immediately (are immediately seen by all processors), it is impossible that both if-statements evaluate to true

• But what if write invalidate is delayed ……….– Should this be allowed, and if so, under what conditions?

Process P1 Process P2

A = 0;

...

A = 1;

L1: if (B==0) ...

B = 0;

...

B = 1;

L2: if (A==0) ...

Page 52: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 52

Sequential Consistency

Lamport (1979): A multiprocessor is sequentially consistent if the result of any execution is the same as if the (memory) operations of all processors were executed in some sequential order, and the operations of each individual processor occur in this sequence in the order specified by its program P

1P

2

Memory

P3

P4

This means that all processors 'see' all loads and stores happening in the same order !!

Page 53: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 53

How to implement Sequential Consistency

• Delay the completion of any memory access until all invalidations caused by that access are completed

• Delay next memory access until previous one is completed– delay the read of A and B (A==0 or B==0 in the example)

until the write has finished (A=1 or B=1)

• Note: Under sequential consistency, we cannot place the write in a write buffer and continue

Page 54: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 54

Sequential consistency overkill?• Schemes for faster execution then sequential

consistency• Observation: Most programs are synchronized

– A program is synchronized if all accesses to shared data are ordered by synchronization operations

• Example:

P2acquire (s) {lock}

...read(x)

P1write (x)...release (s) {unlock}...

ordered

Page 55: Advanced Computer Architecture 5MD00 / 5Z033 Multi-Processing Henk Corporaal heco/courses/aca h.corporaal@tue.nl TUEindhoven 2012

04/21/23 ACA H.Corporaal 55

Relaxed Memory Consistency Models• Key: (partially) allow reads and writes to complete

out-of-order• Orderings that can be relaxed:

– relax W R ordering• allows reads to bypass earlier writes (to different memory locations)• called processor consistency or total store ordering

– relax W W• allow writes to bypass earlier writes• called partial store order

– relax R W and R R• weak ordering, release consistency, Alpha, PowerPC

• Note, seq. consistency means: – W R, W W, R W and R R