Transcript
Page 1: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

1

Sept-13, 2016 1

CSCE 6610:Advanced Computer ArchitectureReview

New Amdahl’s lawA possible idea for a term project

Explore my idea about changing frequency based onserial fraction to maintain fixed energyor keep same execution time

Memory consistency – memory ordering, atomicityWeak orderingUse of locks or synchronization variable

memory fences/barriers

Store atomicity

Next Reading assignments (2)Transactional memory coherence and consistencyCore Fusion

Sept-13, 2016 2

CSCE 6610:Advanced Computer Architecture

Dataflow to Synthesis, May 18, 2007 6

Thread 1 Thread 2S x,1 S y,3Fence FenceS y,2 S x,4L y L x

Potential violations of Serializability: Example 1

= 3 = 1?

S x 1

L y

S y 2 S x 4

S y 3

L x

Predecessor Stores of a Load are ordered before its source

Rule 1

Page 2: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

2

Sept-13, 2016 3

CSCE 6610:Advanced Computer Architecture

What if L y sees S y 2 in Thread 1?

Thread 1 Thread 2S x,1 S y, 3Fence FenceS y, 2 S x, 4L y = 2 L x =1?

Now L x= 1 in Thread 2 is possible.

We can have the following order (other orders are also possible)Th 2: S y, 3

FenceS x, 4

Th 1: S x, 1Fence

Th 2 : L x =1Th 1: S y 2

L y = 2

In other words, we need to know what is possible under different orders of stores

Sept-13, 2016 4

CSCE 6610:Advanced Computer Architecture

Since L, y returned y = 3, it means L,y in Th 1 happened after S y, 3 in Th 2

This automatically restricts the possible interleaving of instructions to the following:

It is not possible for L x in Th 2 to return a 1 (set by S x, 1 in Th 1)

Interleaving1 Interleaving2 Interleaving3Sx,1 Th1 Sx,1 Th1 Sx,1 Th1Sy,2 Th1 Sy,2 Th1 Sy,2 Th1Sy,3 Th2 Sy,3 Th2 Sy,3 Th2Ly,3 Th1 Sx,4 Th 2 Sx,4 Th2Sx,4 Th2 Lx Th2 Ly,3 Th1Lx Th2 Ly,3 Th1 Lx Th 2

On the other hand what if L y in Th 1 sees S y 2 in Thread 1?

Page 3: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

3

Sept-13, 2016 5

CSCE 6610:Advanced Computer Architecture

Dataflow to Synthesis, May 18, 2007 7

Potential violations of Serializability: Example 2

Thread 1 Thread 2S x,1 S y,3S x,2 S y,5Fence FenceL y L x= 3 = 1?

S x 1

L y

S x 2 S y 5

S y 3

L x

Successor Stores of a Store are ordered after its observer

Rule 2

Sept-13, 2016 6

CSCE 6610:Advanced Computer Architecture

Consider the example for the second ruleRule 2: Successor Stores of a Store are ordered after its observer

Thread 1 Thread 2S x,1 S y, 3S x, 2 S y, 5Fence FenceL y = 3 L x =1?

This is not possible: S y, 5 (in Thread 2) happens after L y =3 in Thread 1

This means S x, 2 happened before L y =3, and before S y, 5,Fence and L x in Thread 2

Thread 2 can only get x=2 when it does L x

Page 4: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

4

Sept-13, 2016 7

CSCE 6610:Advanced Computer Architecture

Now L x =1 is possible.

Th 2: S y, 3S y, 5Fence

Th 1: S x, 1Th 2: L x =1Th 1: S x, 2

FenceL y = 5

However what if L y = 5 in Thread 1?

Thread 1 Thread 2S x,1 S y, 3S x, 2 S y, 5Fence FenceL y = 5 L x =1?

In other words, L y =5 happened afterS y, 5

Sept-13, 2016 8

CSCE 6610:Advanced Computer Architecture

Dataflow to Synthesis, May 18, 2007 10

Potential violations of Serializability: Example 3

Thread 1 Thread 2 Thread 3S x,1 S y,2 S y,4Fence Fence FenceL y S z,6 L zL y Fence

S x,8L x

= 2

= 1?

= 4= 6

S z 6

S y 2 S y 4

L z

S x 8

L x

S x 1

L y

L y

The rule here is that “mutual ancestors” of unordered Loads are ordered before “mutual successors” of the Stores they observe

Rule 3

Page 5: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

5

Sept-13, 2016 9

CSCE 6610:Advanced Computer ArchitectureRule 3: “mutual ancestors” of unordered Loads are ordered before “mutual successors” of the Stores they observe

Thread 1 Thread 2 Thread 3S x,1 S y, 2 S y, 4Fence Fence FenceL y = 2 S z, 6 L z = 6L y = 4 Fence

S x, 8L x = 1?

This not possible. Note that if you do not have a fence, instructions can be REORDEREDThus L y=2 and L y =4 in Thread 1 can be reordered.But S x, 1 in Thread 1 must happen before these loads since there is a fence after S x, 1

– the common ancestor of the loads.

Likewise S x, 8 is the common successor of S y, 2 (in Thread 2) and S y, 4 (Thread 3)S z, 6 in Thread 2 happens before L z =6 in Thread 3L x = 1 happens after S x, 8

Lx =1 is possible under other orders – such as no common ancestors

Sept-13, 2016 10

CSCE 6610:Advanced Computer ArchitectureSuch rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they do not violate the “serializability” rules.

Note that we cannot violate RAW dependencies. That is if we have a read of variable X, and a write of X, we need to make sure the order of the read and write is the same as that in the program, or thread

For example in the previous example Thread 3 S y, 4FenceL z = 6FenceS x, 8L x = 1?

We cannot reorder the last two statements.It is possible some other thread may have a store to x after S x, 8 such as S x,1 in thread 1

So, when we talk about memory orders we are talking aboutnon dependent loads and storesdependent loads and stores in different threads/programs

Page 6: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

6

Sept-13, 2016 11

Transactional Memories

Consider the following examples

a). for (j=0; j<n; j++)hist[j]++;

b). for (j=0; j<n; j++)if ( local_min(j) < global_min )

global_min = local_min(j);

Can we consider concurrent implementations using(i) Multithreading (shared memory)(ii) MPI (distributed memory)

Sept-13, 2016 12

Shared Memory Model

void compute_min (j){…local_min = min();lock(lock_variable);

if ( local_min < global_min )global_min = local_min;

unlock(lock_variable);exit();}

b). for (j=0; j<n; j++)thread_id[j] = spawn_thread(compute_min, j); for (j=0; j<n; j++)

join (thread_id[j]);

Page 7: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

7

Sept-13, 2016 13

Optimistic/Speculative Model

void compute_min (j){…local_min = min();

if ( local_min < global_min )atomic global_min = local_min;

exit();}

Consider a system that does not use locks to acquire shared data

But, all updates to shared data are still atomic

Is there a problem with this?

Sept-13, 2016 14

Optimistic/Speculative Model

Consider a different application that uses Speculative executionwhile (continue_cond) { …..

x= hash[index1];hash[index2] = y;……}

Can we execute the loop in parallel?

Is there a dependency between index1 and index2?And can we determine this dependency at compile time?

Page 8: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

8

Sept-13, 2016 15

Optimistic/Speculative Model

Speculative execution

Sept-13, 2016 16

Parallelism and issues

In most parallel programming languages we usemutual exclusionmonitorsor message communication

To communicate and synchronize among computations

The key is: shared data cannot be accessed outsidecritical sections

And: must use the same order on locks to avoid deadlocks

Page 9: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

9

Sept-13, 2016 17

Atomic Transactions

Atomicity of actionsWhen two concurrent computations interactwe need to assure atomicity of the interactionsNote that we are not talking about data directly but implicitly

Concurrency in database is based on atomic transactions

Consider booking airline ticketsEach “agent” works on a copy of dataand either commits or retries the results

Sept-13, 2016 18

Atomic Transactions

Atomic Transactions can cover bothmutual exclusion speculative execution

However, still need programming discipline

Thread 1atomic{

while (!flagA);flagB = true:}

Thread 2atomic{

flagA = true;while(!flagB);}

Page 10: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

10

Sept-13, 2016 19

Atomic Transactions

The code is not correct since the loops do not terminate unless the respective flags are true before starting the blocks

So, one cannot assume that you can mechanically replace mutual exclusion with atomic transactions

Still need to make sure that shared object are accessed only by atomic transactions

What we need is, the ability to notice changes to your variables and undo your computations

Sept-13, 2016 20

Implementing Atomicity and Speculation

1. Need to be able to rollback -- changes must be buffered2. Need to recognize failures

Maintain read and write sets with each transactionCache coherency extensionsVersion numbers

3. Need to differentiate between values that are speculated and regular valuesadditional instructions such as speculative-load/store

start transaction, end transaction4. Determine the order of committing/write-back

program orderone thread at a time (no specified order)

Page 11: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

11

Sept-13, 2016 21

Some Architectural Background

How is atomicity achieved in current systems?

Implementation of locks in hardwareTest&Set instructions (and Fetch&Add type instructions)Load Linked and Store Conditional InstructionsMemory Barriers (or fences)

Performance IssuesWhere to locate locks (local cache?)Tree and queue locksShadow locks

Sept-13, 2016 22

Load Linked and Store Conditional

LL remembers the memory address of the load.If some other access (either on the same processor or another processor) to the same address is made, the remembered address is lostOn a SC, if the remembered address is the same as that of SC, store will be successful

Let us take a simple example of implementing the test and set using these instructions.

Try: Move R3, R4 ; Move value to be exchangedLL R2, 0(R1) ; load linked to memorySC R3, 0(R1) ; Store conditional to the same memory

locationBEQZ R3, Try ; if unsuccessful, try againBNEQZ R2, Try ; if a nonzero value was read on LL try again

; lock is held by someone else

Page 12: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

12

Sept-13, 2016 23

Cache Coherency

Why do caches cause a problem in multicore systems?

Snooping Protocols -- MESI (and MOESI)

Each cache line will be associated with one of 4 states.Invalid (I); Modified (M); Exclusive (E) or Shared (S)

Before modifying data, if it is shared, need to either send an Invalidation message, or send updates to other caches

Sept-13, 2016 24

Transactional Memory Semantics

Operational SemanticsTransactions should behave as if the execution uses a single lock

Note that if every computation is a transaction, this semanticsis sufficient

But, if program include both (atomic) transactional and non-transactional computations, then we can still have race conditions

Page 13: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

13

Sept-13, 2016 25

Transactional Memory Semantics

Weak and Strong atomicity (isolation) properties

Transactions on objects are “linearized”

Strong semantics requires (or automatically convert) all operations to behave as atomic transactions

Weak semantics only applies to atomic transaction (and race conditions may exist with non-transactional operations)

Sept-13, 2016 26

Nested TransactionsWhat happens if an inner transaction aborts?

Flattened transactions. int x =1;atomic {

x = 2;atomic flatten {

x=3;abort }

}

Transactional Memory Semantics

x=1 always Aborting any nested transactioncauses abortion of outer transactions

Page 14: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

14

Sept-13, 2016 27

Closed transactions. int x =1;atomic {

x = 2;atomic closed {

x=3;abort }

}

Transactional Memory Semantics

x=2 if outer transaction commitsAborting inner transactions will not cause outer transactions to abortAnd inner transactions can only Commit if outer transaction commits.

Open transactions. int x =1;atomic {

x = 2;atomic open {

x=3; }abort

}

x=3 even if outer transaction abortsInner transactions can commit Even if outer transactions fail.

Sept-13, 2016 28

o Flattened transactions are easy to implement (but not good since aborted library transactions will abort your programs)

o Closed transactions lead to higher retries and overheads

o Open transactions do not guarantee Sequential consistency

Transactional Memory Semantics

Page 15: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

15

Sept-13, 2016 29

GranularityObject granularity -- typically in Software TM

Word or cache block granularity -- typically in hardware TMs

Performance impactsneed to maintain meta data with each data itemto track all transactions that access them

read sets and write sets

Transactional Memory Design Issues

Sept-13, 2016 30

Direct or Indirect updatesDirect -- modify an object directly.

must be able undo modificationsDeferred -- modify a copy

must ensure local reads see the updatesdiscard local copies if aborted

If conflicts are rare direct (or in-place) updates are efficientWhat if private copies cause buffer overflows

or need to write caches copies

Transactional Memory Design Issues

Page 16: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

16

Sept-13, 2016 31

Conflict DetectionEarly detection -- detection when opened

assuming instructions exist to open/closeOpenForReading (..)OpenForWriting

Late detection -- detect on commit

Performance issuesWasted computationmeta data needed to track read sets and write sets

Transactional Memory Design Issues

Sept-13, 2016 32

Transactional Memory Design Issues

In this example, in part (a) we will detect the conflict using either eager or lazy detection

In part (b), only eager detection detects the conflicts. Lazy detection allows both transactions commit (that is the ReadX of T1 happened before the WriteX of T2)

Page 17: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

17

Sept-13, 2016 33

Transactional Memory Design Issues

In early or eager conflict detectionT1 may be aborted twice; ReadX may conflict with WriteX of T2 (if T2 commits before T1) And ReadY may conflict with WriteY of T3 (if T3 commits first).

In lazy or delayed conflict detection,T1 will only be aborted once (when it tries to commit at the end).

T3StartT

ReadX

ReadY

WriteY

WriteX

EndT(Commit)

EndT(Commit)

EndT(Commit)

T1StartT

T2StartT

Sept-13, 2016 34

Transactional Memory Design Issues

Another issue is which transaction to abort when a conflict is detected.

In this example, when the conflict between T1 and T2 is detected, we can either abort T1 or T2. If T2 is aborted, then we need to again decide between T1 and T3 when the second conflict between T1 and T3 is detected (2 transactions are aborted).

If we abort T1 instead of T2, then both T2 and T3 can commit (only one transaction aborted).

Page 18: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

18

Sept-13, 2016 35

Read and Write setsEach Transaction maintains its own read sets

or makes the read set public

Maintain read/write sets with objectsnotify readers of conflictsread sets may be very large

Transactional Memory Design Issues

Sept-13, 2016 36

General OutlineDepending on the granularity need to maintain

indirection to permit atomicity of commitneed to maintain buffers for updatesneed to track read and write lists for validation

Software TM Implementations

Page 19: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

19

Sept-13, 2016 37

Object TM Implementations

Sept-13, 2016 38

Object TM Implementations

Page 20: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

20

Sept-13, 2016 39

Hardware TM Implementations

TCC (Stanford)

Sept-13, 2016 40

LogTM

Page 21: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they

9/13/16

21

Sept-13, 2016 41

LogTM

Sept-13, 2016 42

Implicit Transactions in Kilo

Views computations as regions.

Check-pointing and rollback mechanisms are used to either commit the computations of a region, or redo.


Top Related