![Page 1: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/1.jpg)
9/13/16
1
Sept-13, 2016 1
CSCE 6610:Advanced Computer ArchitectureReview
New Amdahl’s lawA possible idea for a term project
Explore my idea about changing frequency based onserial fraction to maintain fixed energyor keep same execution time
Memory consistency – memory ordering, atomicityWeak orderingUse of locks or synchronization variable
memory fences/barriers
Store atomicity
Next Reading assignments (2)Transactional memory coherence and consistencyCore Fusion
Sept-13, 2016 2
CSCE 6610:Advanced Computer Architecture
Dataflow to Synthesis, May 18, 2007 6
Thread 1 Thread 2S x,1 S y,3Fence FenceS y,2 S x,4L y L x
Potential violations of Serializability: Example 1
= 3 = 1?
S x 1
L y
S y 2 S x 4
S y 3
L x
Predecessor Stores of a Load are ordered before its source
Rule 1
![Page 2: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/2.jpg)
9/13/16
2
Sept-13, 2016 3
CSCE 6610:Advanced Computer Architecture
What if L y sees S y 2 in Thread 1?
Thread 1 Thread 2S x,1 S y, 3Fence FenceS y, 2 S x, 4L y = 2 L x =1?
Now L x= 1 in Thread 2 is possible.
We can have the following order (other orders are also possible)Th 2: S y, 3
FenceS x, 4
Th 1: S x, 1Fence
Th 2 : L x =1Th 1: S y 2
L y = 2
In other words, we need to know what is possible under different orders of stores
Sept-13, 2016 4
CSCE 6610:Advanced Computer Architecture
Since L, y returned y = 3, it means L,y in Th 1 happened after S y, 3 in Th 2
This automatically restricts the possible interleaving of instructions to the following:
It is not possible for L x in Th 2 to return a 1 (set by S x, 1 in Th 1)
Interleaving1 Interleaving2 Interleaving3Sx,1 Th1 Sx,1 Th1 Sx,1 Th1Sy,2 Th1 Sy,2 Th1 Sy,2 Th1Sy,3 Th2 Sy,3 Th2 Sy,3 Th2Ly,3 Th1 Sx,4 Th 2 Sx,4 Th2Sx,4 Th2 Lx Th2 Ly,3 Th1Lx Th2 Ly,3 Th1 Lx Th 2
On the other hand what if L y in Th 1 sees S y 2 in Thread 1?
![Page 3: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/3.jpg)
9/13/16
3
Sept-13, 2016 5
CSCE 6610:Advanced Computer Architecture
Dataflow to Synthesis, May 18, 2007 7
Potential violations of Serializability: Example 2
Thread 1 Thread 2S x,1 S y,3S x,2 S y,5Fence FenceL y L x= 3 = 1?
S x 1
L y
S x 2 S y 5
S y 3
L x
Successor Stores of a Store are ordered after its observer
Rule 2
Sept-13, 2016 6
CSCE 6610:Advanced Computer Architecture
Consider the example for the second ruleRule 2: Successor Stores of a Store are ordered after its observer
Thread 1 Thread 2S x,1 S y, 3S x, 2 S y, 5Fence FenceL y = 3 L x =1?
This is not possible: S y, 5 (in Thread 2) happens after L y =3 in Thread 1
This means S x, 2 happened before L y =3, and before S y, 5,Fence and L x in Thread 2
Thread 2 can only get x=2 when it does L x
![Page 4: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/4.jpg)
9/13/16
4
Sept-13, 2016 7
CSCE 6610:Advanced Computer Architecture
Now L x =1 is possible.
Th 2: S y, 3S y, 5Fence
Th 1: S x, 1Th 2: L x =1Th 1: S x, 2
FenceL y = 5
However what if L y = 5 in Thread 1?
Thread 1 Thread 2S x,1 S y, 3S x, 2 S y, 5Fence FenceL y = 5 L x =1?
In other words, L y =5 happened afterS y, 5
Sept-13, 2016 8
CSCE 6610:Advanced Computer Architecture
Dataflow to Synthesis, May 18, 2007 10
Potential violations of Serializability: Example 3
Thread 1 Thread 2 Thread 3S x,1 S y,2 S y,4Fence Fence FenceL y S z,6 L zL y Fence
S x,8L x
= 2
= 1?
= 4= 6
S z 6
S y 2 S y 4
L z
S x 8
L x
S x 1
L y
L y
The rule here is that “mutual ancestors” of unordered Loads are ordered before “mutual successors” of the Stores they observe
Rule 3
![Page 5: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/5.jpg)
9/13/16
5
Sept-13, 2016 9
CSCE 6610:Advanced Computer ArchitectureRule 3: “mutual ancestors” of unordered Loads are ordered before “mutual successors” of the Stores they observe
Thread 1 Thread 2 Thread 3S x,1 S y, 2 S y, 4Fence Fence FenceL y = 2 S z, 6 L z = 6L y = 4 Fence
S x, 8L x = 1?
This not possible. Note that if you do not have a fence, instructions can be REORDEREDThus L y=2 and L y =4 in Thread 1 can be reordered.But S x, 1 in Thread 1 must happen before these loads since there is a fence after S x, 1
– the common ancestor of the loads.
Likewise S x, 8 is the common successor of S y, 2 (in Thread 2) and S y, 4 (Thread 3)S z, 6 in Thread 2 happens before L z =6 in Thread 3L x = 1 happens after S x, 8
Lx =1 is possible under other orders – such as no common ancestors
Sept-13, 2016 10
CSCE 6610:Advanced Computer ArchitectureSuch rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they do not violate the “serializability” rules.
Note that we cannot violate RAW dependencies. That is if we have a read of variable X, and a write of X, we need to make sure the order of the read and write is the same as that in the program, or thread
For example in the previous example Thread 3 S y, 4FenceL z = 6FenceS x, 8L x = 1?
We cannot reorder the last two statements.It is possible some other thread may have a store to x after S x, 8 such as S x,1 in thread 1
So, when we talk about memory orders we are talking aboutnon dependent loads and storesdependent loads and stores in different threads/programs
![Page 6: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/6.jpg)
9/13/16
6
Sept-13, 2016 11
Transactional Memories
Consider the following examples
a). for (j=0; j<n; j++)hist[j]++;
b). for (j=0; j<n; j++)if ( local_min(j) < global_min )
global_min = local_min(j);
Can we consider concurrent implementations using(i) Multithreading (shared memory)(ii) MPI (distributed memory)
Sept-13, 2016 12
Shared Memory Model
void compute_min (j){…local_min = min();lock(lock_variable);
if ( local_min < global_min )global_min = local_min;
unlock(lock_variable);exit();}
b). for (j=0; j<n; j++)thread_id[j] = spawn_thread(compute_min, j); for (j=0; j<n; j++)
join (thread_id[j]);
![Page 7: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/7.jpg)
9/13/16
7
Sept-13, 2016 13
Optimistic/Speculative Model
void compute_min (j){…local_min = min();
if ( local_min < global_min )atomic global_min = local_min;
exit();}
Consider a system that does not use locks to acquire shared data
But, all updates to shared data are still atomic
Is there a problem with this?
Sept-13, 2016 14
Optimistic/Speculative Model
Consider a different application that uses Speculative executionwhile (continue_cond) { …..
x= hash[index1];hash[index2] = y;……}
Can we execute the loop in parallel?
Is there a dependency between index1 and index2?And can we determine this dependency at compile time?
![Page 8: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/8.jpg)
9/13/16
8
Sept-13, 2016 15
Optimistic/Speculative Model
Speculative execution
Sept-13, 2016 16
Parallelism and issues
In most parallel programming languages we usemutual exclusionmonitorsor message communication
To communicate and synchronize among computations
The key is: shared data cannot be accessed outsidecritical sections
And: must use the same order on locks to avoid deadlocks
![Page 9: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/9.jpg)
9/13/16
9
Sept-13, 2016 17
Atomic Transactions
Atomicity of actionsWhen two concurrent computations interactwe need to assure atomicity of the interactionsNote that we are not talking about data directly but implicitly
Concurrency in database is based on atomic transactions
Consider booking airline ticketsEach “agent” works on a copy of dataand either commits or retries the results
Sept-13, 2016 18
Atomic Transactions
Atomic Transactions can cover bothmutual exclusion speculative execution
However, still need programming discipline
Thread 1atomic{
while (!flagA);flagB = true:}
Thread 2atomic{
flagA = true;while(!flagB);}
![Page 10: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/10.jpg)
9/13/16
10
Sept-13, 2016 19
Atomic Transactions
The code is not correct since the loops do not terminate unless the respective flags are true before starting the blocks
So, one cannot assume that you can mechanically replace mutual exclusion with atomic transactions
Still need to make sure that shared object are accessed only by atomic transactions
What we need is, the ability to notice changes to your variables and undo your computations
Sept-13, 2016 20
Implementing Atomicity and Speculation
1. Need to be able to rollback -- changes must be buffered2. Need to recognize failures
Maintain read and write sets with each transactionCache coherency extensionsVersion numbers
3. Need to differentiate between values that are speculated and regular valuesadditional instructions such as speculative-load/store
start transaction, end transaction4. Determine the order of committing/write-back
program orderone thread at a time (no specified order)
![Page 11: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/11.jpg)
9/13/16
11
Sept-13, 2016 21
Some Architectural Background
How is atomicity achieved in current systems?
Implementation of locks in hardwareTest&Set instructions (and Fetch&Add type instructions)Load Linked and Store Conditional InstructionsMemory Barriers (or fences)
Performance IssuesWhere to locate locks (local cache?)Tree and queue locksShadow locks
Sept-13, 2016 22
Load Linked and Store Conditional
LL remembers the memory address of the load.If some other access (either on the same processor or another processor) to the same address is made, the remembered address is lostOn a SC, if the remembered address is the same as that of SC, store will be successful
Let us take a simple example of implementing the test and set using these instructions.
Try: Move R3, R4 ; Move value to be exchangedLL R2, 0(R1) ; load linked to memorySC R3, 0(R1) ; Store conditional to the same memory
locationBEQZ R3, Try ; if unsuccessful, try againBNEQZ R2, Try ; if a nonzero value was read on LL try again
; lock is held by someone else
![Page 12: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/12.jpg)
9/13/16
12
Sept-13, 2016 23
Cache Coherency
Why do caches cause a problem in multicore systems?
Snooping Protocols -- MESI (and MOESI)
Each cache line will be associated with one of 4 states.Invalid (I); Modified (M); Exclusive (E) or Shared (S)
Before modifying data, if it is shared, need to either send an Invalidation message, or send updates to other caches
Sept-13, 2016 24
Transactional Memory Semantics
Operational SemanticsTransactions should behave as if the execution uses a single lock
Note that if every computation is a transaction, this semanticsis sufficient
But, if program include both (atomic) transactional and non-transactional computations, then we can still have race conditions
![Page 13: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/13.jpg)
9/13/16
13
Sept-13, 2016 25
Transactional Memory Semantics
Weak and Strong atomicity (isolation) properties
Transactions on objects are “linearized”
Strong semantics requires (or automatically convert) all operations to behave as atomic transactions
Weak semantics only applies to atomic transaction (and race conditions may exist with non-transactional operations)
Sept-13, 2016 26
Nested TransactionsWhat happens if an inner transaction aborts?
Flattened transactions. int x =1;atomic {
x = 2;atomic flatten {
x=3;abort }
}
Transactional Memory Semantics
x=1 always Aborting any nested transactioncauses abortion of outer transactions
![Page 14: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/14.jpg)
9/13/16
14
Sept-13, 2016 27
Closed transactions. int x =1;atomic {
x = 2;atomic closed {
x=3;abort }
}
Transactional Memory Semantics
x=2 if outer transaction commitsAborting inner transactions will not cause outer transactions to abortAnd inner transactions can only Commit if outer transaction commits.
Open transactions. int x =1;atomic {
x = 2;atomic open {
x=3; }abort
}
x=3 even if outer transaction abortsInner transactions can commit Even if outer transactions fail.
Sept-13, 2016 28
o Flattened transactions are easy to implement (but not good since aborted library transactions will abort your programs)
o Closed transactions lead to higher retries and overheads
o Open transactions do not guarantee Sequential consistency
Transactional Memory Semantics
![Page 15: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/15.jpg)
9/13/16
15
Sept-13, 2016 29
GranularityObject granularity -- typically in Software TM
Word or cache block granularity -- typically in hardware TMs
Performance impactsneed to maintain meta data with each data itemto track all transactions that access them
read sets and write sets
Transactional Memory Design Issues
Sept-13, 2016 30
Direct or Indirect updatesDirect -- modify an object directly.
must be able undo modificationsDeferred -- modify a copy
must ensure local reads see the updatesdiscard local copies if aborted
If conflicts are rare direct (or in-place) updates are efficientWhat if private copies cause buffer overflows
or need to write caches copies
Transactional Memory Design Issues
![Page 16: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/16.jpg)
9/13/16
16
Sept-13, 2016 31
Conflict DetectionEarly detection -- detection when opened
assuming instructions exist to open/closeOpenForReading (..)OpenForWriting
Late detection -- detect on commit
Performance issuesWasted computationmeta data needed to track read sets and write sets
Transactional Memory Design Issues
Sept-13, 2016 32
Transactional Memory Design Issues
In this example, in part (a) we will detect the conflict using either eager or lazy detection
In part (b), only eager detection detects the conflicts. Lazy detection allows both transactions commit (that is the ReadX of T1 happened before the WriteX of T2)
![Page 17: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/17.jpg)
9/13/16
17
Sept-13, 2016 33
Transactional Memory Design Issues
In early or eager conflict detectionT1 may be aborted twice; ReadX may conflict with WriteX of T2 (if T2 commits before T1) And ReadY may conflict with WriteY of T3 (if T3 commits first).
In lazy or delayed conflict detection,T1 will only be aborted once (when it tries to commit at the end).
T3StartT
ReadX
ReadY
WriteY
WriteX
EndT(Commit)
EndT(Commit)
EndT(Commit)
T1StartT
T2StartT
Sept-13, 2016 34
Transactional Memory Design Issues
Another issue is which transaction to abort when a conflict is detected.
In this example, when the conflict between T1 and T2 is detected, we can either abort T1 or T2. If T2 is aborted, then we need to again decide between T1 and T3 when the second conflict between T1 and T3 is detected (2 transactions are aborted).
If we abort T1 instead of T2, then both T2 and T3 can commit (only one transaction aborted).
![Page 18: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/18.jpg)
9/13/16
18
Sept-13, 2016 35
Read and Write setsEach Transaction maintains its own read sets
or makes the read set public
Maintain read/write sets with objectsnotify readers of conflictsread sets may be very large
Transactional Memory Design Issues
Sept-13, 2016 36
General OutlineDepending on the granularity need to maintain
indirection to permit atomicity of commitneed to maintain buffers for updatesneed to track read and write lists for validation
Software TM Implementations
![Page 19: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/19.jpg)
9/13/16
19
Sept-13, 2016 37
Object TM Implementations
Sept-13, 2016 38
Object TM Implementations
![Page 20: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/20.jpg)
9/13/16
20
Sept-13, 2016 39
Hardware TM Implementations
TCC (Stanford)
Sept-13, 2016 40
LogTM
![Page 21: Sept-13 · Sept-13, 2016 10 CSCE 6610:Advanced Computer Architecture Such rules allow us (or compilers and dynamic instruction scheduling) to reorder instructions as long as they](https://reader033.vdocuments.net/reader033/viewer/2022050605/5fac494024d0ca2cb166f673/html5/thumbnails/21.jpg)
9/13/16
21
Sept-13, 2016 41
LogTM
Sept-13, 2016 42
Implicit Transactions in Kilo
Views computations as regions.
Check-pointing and rollback mechanisms are used to either commit the computations of a region, or redo.