ece5917 soc architecture: mp...
TRANSCRIPT
ECE5917SoC Architecture: MP SoC
Tae Hee Han: [email protected]
Semiconductor Systems Engineering
Sungkyunkwan University
Outline
n Overview
n Parallelismn Data-Level Parallelismn Instruction-Level Parallelismn Thread-Level Parallelismn Processor-Level Parallelism
n Multi-core
2
Parallelism- Thread Level Parallelism
3
Superscalar (In)Efficiency
4
Issue width
Time
Completely idle cycle (vertical waste)• Introduced when the processor issues no instructions in a cycle
Instruction issue
Partially filled cycle, i.e., IPC < 4(horizontal waste)•Occurs when not all issue slots can be filled in a cycle
Thread
n Definitionn A discrete sequence of related instructionsn Executed independently of other such sequences
n Every program has at least one threadn Initializesn Executes instructionsn May create other threads
n Each thread maintains its current state
n OS maps a thread to hardware resources
5
Multithreading
n On a single processor, multithreading generally occurs by time-division multiplexing (as in multitasking) – context switching
n On a multiprocessor or multi-core systems, threads can be truly concurrent, with every processor or core executing a separate thread simultaneously
n Many modern OS directly support both time-sliced and multiprocessor threading with a process scheduler
n The kernel of an OS allows programmer to manipulate threads via the system call interface
6
Thread Level Parallelism (TLP)
n Interaction with OSn OS perceives each core as a separate processorn OS scheduler maps threads/processes to
different logical (or virtual) coresn Most major OS support multithreading today
n TLP explicitly represented by the use of multiple threads of execution that are inherently parallel
n Goal: Use multiple instruction streams to improve
n Throughput of computers that run many programs n Execution time of multi-threaded programs
n TLP could be more cost-effective than ILP
7
Virtual Memory (ASID 1)
Virtual Memory (ASID 2)
Physical Memory
Process 1
Thread 1
• Stack• Register• PC
Thread 2
• Stack• Register• PC
Process 2
Thread 1
• Stack• Register• PC
Thread 2
• Stack• Register• PC
Thread 3
• Stack• Register• PC
Thread Scheduler (OS)
Processor Core 1
(e.g 2-way SMT)
Processor Core 2
(e.g 2-way SMT)
Multithreaded Execution
n Multithreading: multiple threads share functional units of 1 processor via overlapping
n Processor must duplicate independent state of each threadn Separate copy of register file, PCn Separate page table if different process
n Memory sharing via virtual memory mechanismsn Already supports multiple processes
n HW for fast thread switchn Must be much faster than full process switch (which is 100s to 1000s of clocks)
n When to switch?n Alternate instruction per thread (fine grain)—round robinn When thread is stalled (coarse grain)
n e.g., cache miss
8
Hardware Streams
Concurrent threads of Computation
Conceptual Illustration of Multithreaded Architecture
9
1 2 3 4
i = n
i = 3i = 2i = 1
Sub-problemA
j = m
j = 2j = 1
Serial Code
Sub-problemB
Sub-problemC
Program running in parallel
Unused Streams
Instruction Ready Pool
Pipeline of executing Instructions
Sources of Wasted Issue SlotsSource Possible latency-hiding or latency-reducing technique
TLB miss Increase TLB sizes, HW instruction prefetching, HW or SW data prefetching, faster servicing of TLB misses
I-cache miss Increase cache size, more associativity, HW instruction prefetching
D-cache miss Increase cache size, more associativity, HW or SW data prefetching, improved instruction scheduling, more sophisticated dynamic execution
Branch misprediction Improved branch prediction scheme, lower branch misprediction penalty
Control hazard Speculative execution, more aggressive if-conversion
Load delays (L1 cache hits) Shorter load latency, improved instruction scheduling, dynamic scheduling
Short integer delay Improved instruction scheduling
Long integer, short FP, long FP delays Shorter latencies, improved instruction scheduling
Memory conflict Improved instruction scheduling
10
Fine-Grained Multithreading
n Switches between threads on each instruction, interleaving execution of multiple threads
n Usually done round-robin, skipping stalled threads
n CPU must be able to switch threads every clock
n Advantage: can hide both short and long stallsn Instructions from other threads always available to executen Easy to insert on short stalls
n Disadvantage: slows individual threadsn Thread ready to execute without stalls will be delayed by instructions from
other threads
n Used on Sun (Now Oracle) Niagara (UltraSPARC T1) – Nov. 2005
11
Coarse-Grained Multithreading
n Switches threads only on costly stalls: e.g., L2 cache misses
n Advantages n Relieves need to have very fast thread switchingn Doesn’t slow thread
n Other threads only issue instructions when main one would stall (for long time) anyway
n Disadvantage: pipeline startup costs make it hard to hide throughput losses from shorter stalls
n Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread
n New thread must fill pipe before instructions can complete n Thus, better for reducing penalty of high-cost stalls where pipeline refill <<
stall time
n Used in IBM AS/400
12
Simple Multithreaded Pipeline
n Additional state: One copy of architected state per thread (e.g., PC, GPR)
n Thread select: Round-robin logic; Propagate Thread-ID down pipeline to access correct state (e.g., GPR1 versus GPR2)
n OS perceives multiple logical CPUs
13
2 Thread select
PC1PC
1PC1PC
1
I$ IRGPR1GPR1GPR1GPR1
X
Y
2
D$
+1
Cycle Interleaved MT (Fine-Grain MT)
14
Second thread interleaved cycle-by-cycle
Instruction issue
Partially filled cycle, i.e., IPC < 4(horizontal waste)
Issue width
Time
Cycle interleaved multithreading reduces vertical waste with cycle-by-cycle interleaving. However, horizontal waste remains.
Chip Multiprocessing (CMP)
15
Second thread interleaved cycle-by-cycle
Partially filled cycle, i.e., IPC < 4(horizontal waste)
Issue width
Time
Chip multiprocessing reduces horizontal waste with simple (narrower) cores.However, (1) vertical waste remains and (2) ILP is bounded.
Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]
n Interleave multiple threads to multiple issue slots with no restrictions
16
Issue width
Time
Simultaneous Multithreading (SMT) Motivation
n Fine-grain Multithreadingn HEP, Tera, MASA, MIT Alewifen Fast context switching among multiple independent threads
n Switch threads on cache miss stalls – Alewifen Switch threads on every cycle – Tera, HEP
n Target vertical wastes onlyn At any cycle, issue instructions from only a single thread
n Single-chip MPn Coarse-grain parallelism among independent threads in a
different processorn Also exhibit both vertical and horizontal wastes in each individual
processor pipeline
17
Simultaneous Multithreading (SMT)
n An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors (superscalars)
n SMT has the potential of greatly enhancing superscalar processor computational capabilities by
n Exploiting thread-level parallelism in a single processor core, simultaneously issuing, executing and retiring instructions from different threads during the same cycle
n A single physical SMT processor core acts as a number of logical processors each executing a single thread
n Providing multiple hardware contexts, hardware thread scheduling and context switching capability
n Providing effective long latency hidingn e.g.) FP, branch misprediction, memory latency
18
Simultaneous Multithreading (SMT)
n Intel’s HyperThreading (2-way SMT)
n IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores - each 2-way SMT, 4 chips per package): Power5 has OoO cores, Power6 In-order cores;
n Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources
19
Register RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister Renamer
RegFile
Fmult(4 cycles)
Fadd(2 cyc)
ALU 1
ALU 2
Load/Store(variable)
Fdiv, unpipe (16 cycles)
RS & ROB+
PhysicalRegister
File
RS & ROB+
PhysicalRegister
File
FetchUnitFetchUnit
PCPCPCPCPCPCPCPC
I-CACHE I-CACHE
DecodeDecode
Register RenamerRegister Renamer
RegFile
RegFile
RegFile
RegFile
RegFile
RegFile
RegFile
D-CACHE D-CACHE
Register RenamerRegister RenamerRegister RenamerRegister Renamer
Overview of SMT Hardware Changes
n For an N-way (N threads) SMT, we need:n Ability to fetch from N threadsn N sets of registers (including PCs)n N rename tables (RATs)n N virtual memory spaces
n But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)
20
Multithreading: Classification
21
Thread 1Unused
Exec
utio
n Ti
me
FU1 FU2 FU3 FU4
ConventionalSuperscalar
SingleThreaded
SimultaneousMultithreading
(SMT)
Fine-grainedMultithreading(cycle-by-cycleInterleaving)
Thread 2Thread 3Thread 4Thread 5
Coarse-grainedMultithreading
(Block Interleaving)
Chip Multiprocessor
(CMP orMulti-Core)
SMT Performance
n When it works, it fills idle “issue slots” with work from other threads; throughput improves
n But sometimes it can cause performance degradation!
22
Time( ) < Time( )Finish one task,
then do the otherDo both at sametime using SMT
How?
n Cache thrashing
23
I$ D$
Thread0 just fits inthe Level-1 Caches
Executesreasonablyquickly due
to high cachehit rates
Context switch to Thread1
I$ D$
Thread1 also fitsnicely in the caches
I$ D$
Caches were just big enoughto hold one thread’s data, but
not two thread’s worth
L2
Now both threads havesignificantly higher cache
miss rates
à Intel Smart Cache!
Multithreading: How Many Threads?
n With more HW threads:n Larger/multiple register filesn Replicated & partitioned resources à Lower
utilization, lower single-thread performancen Shared resources à Utilization vs. interference
and thrashing
n Impact of MT/MC on memory hierarchy?
24
Source: Guz et al. "Many-Core vs. Many-Thread Machines: Stay Away From the Valley," IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, 2009
SMT: Intel vs. ARM
n In 2010, ARM said it might include SMT in its chips in the future; however this was rejected for their 2012 64-bit design
n Intel conceded SMT will not be supported on its Silvermont processor cores in order to save power
25
Noel Hurley (VP of marketing and strategy in ARM’s processor division)said ARM rejected SMT as an option. Although it can be used to hidethe latency of memory accesses in parallel applications – a techniqueused heavily in GPUs – multithreading complicates the design of thepipeline itself. The tradeoff did not make sense for the engineeringteam, he said.( http://www.techdesignforums.com/blog/2012/10/30/arm-64bit-cortex-a53-a57-launch/ )
MT Analysis by ARM for Mobile Applications
n Evaluation tests by ARM have shown that MT is not efficient for mobile devices
n Increasing performance by 50% will cost more than a 50% increase in power
n MT is much less predictable than those of multi-core solutions
n In MT, the implementation cost of ‘sleep mode’ becomes higher due to more sharing of resources between multiple threads
n In high-end mobile apps which require superscalar and OoO based multi-core, then single-threaded multi-core implementations such as big.LITTLE are the most efficient solution
26
0
0.5
1
1.5
2
2.5
3
3.5
4
Cortex-A7 Dual Cortex-A7 Cortex-A12 Est. of Cortex-A12 with MT
Relative mW
RelativeDMIPS
Source: http://www.arm.com/files/pdf/Multi-threading_Technology.pdf
Summary
n Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options
n Data level parallelism and/or Thread level parallelism is exploited to better performance
n Coarse grain vs. Fine grain multithreadingn Only on big stall vs. every clock cycle
n Simultaneous Multithreading if fine grained multithreading based on OOO superscalar micro-architecture
n Instead of replicating registers, reuse rename registers
27
Parallelism- Processor Level Parallelism
28
Beyond ILP (Instruction Level Parallelism)
n Performance is limited by the serial fractionn Coarse grain parallelism in the post ILP era
n Thread, process and data parallelismn Learn from the lessons of the parallel processing community
n Revisit the classifications and architectural techniques
29
parallelizable
1CPU 2CPUs 3CPUs 4CPUs
“Automatic” Parallelism in Modern Machines
n Bit level parallelismn Within floating point operations, etc.
n Instruction level parallelism (ILP)n Multiple instructions execute per clock cycle
n Memory system parallelismn Overlap of memory operations with computation
n OS parallelismn Multiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks
30
Principles of Parallel Computing
n Finding enough parallelism (Amdahl’s Law)
n Granularity
n Locality
n Load balance
n Coordination and synchronization
n Performance modeling
31
All of these things makes parallel programming even harder than sequential programming
Finding Enough Parallelism
n Suppose only part of an application seems parallel
n Amdahl’s lawn Let s be the fraction of work done sequentially, so (1-s) is fraction
parallelizablen P = number of processors
Speedup(P) = Time(1)/Time(P)
= 1/(s + (1-s)/P)
@ 1/s if s approaches 1
n Even if the parallel part speeds up perfectly performance is limited by the sequential part
32
Overhead of Parallelism
n Given enough parallel work, this is the biggest barrier to getting desired speedup
n Parallelism overheads include:n cost of starting a thread or processn cost of communicating shared datan cost of synchronizingn extra (redundant) computation
n Each of these can be in the range of milliseconds (=millions of flops) on some systems
n Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work
33
Locality and Parallelism
n Large memories are slow, fast memories are small
n Storage hierarchies are large and fast on average
n Parallel processors, collectively, have large, fast cachen the slow accesses to “remote” data we call “communication”
n Algorithm should do most work on local data
34
ProcL1 Cache
L2 Cache
L3 Cache
Memory
Conventional Storage
Hierarchy
potentialinterconnects
L3 Cache L3 Cache
ProcL1 Cache
L2 Cache
ProcL1 Cache
L2 Cache
Memory Memory
Load Imbalance
n Load imbalance is the time that some processors in the system are idle due to
n Insufficient parallelism (during that phase)n Unequal size tasks
n Examples of the lattern Adapting to “interesting parts of a domain”n Tree-structured computations n Fundamentally unstructured problems
n Algorithm needs to balance load
35
Computer Architecture Classifications
Processor Organizations
Single Instruction Single Data Stream
(SISD)
Single Instruction Multiple Data
Stream(SIMD)
Multiple Instruction Single Data Stream
(MISD)
Multiple Instruction Multiple Data
Stream(MIMD)
Uniprocessor Vector processor
Array processor
Centralized Shared Memory Architecture
(UMA)Distributed Memory
Architecture
Distributed Shared memory
(NUMA)
36
Message Passing
Multiprocessors
n Why do we need multiprocessors?n Uniprocessor speed keeps improvingn But there are things that need even more speed
n Wait for a few years for Moore’s law to catch up?n Or use multiple processors and do it now?
n Multiprocessor software problemn Most code is sequential (for uniprocessors)
n MUCH easier to write and debugn Correct parallel code very, very difficult to write
n Efficient and correct is even hardern Debugging even more difficult (Heisenbugs)
37
ø Heisenbug is a computer programming jargon term for a software bug that seems to disappear or alter its behavior when one attempts to study it. The term is a pun on the name of Werner Heisenberg, the physicist who first asserted the observer effect of quantum mechanics, which states that the act of observing a system inevitably alters its state.
MIMD Multiprocessors
38
Centralized Shared Memory Distributed Memory
Centralized-Memory Machines
n Also “Symmetric Multiprocessors” (SMP)
n “Uniform Memory Access” (UMA)n All memory locations have similar latenciesn Data sharing through memory reads/writesn P1 can write data to a physical address A,
P2 can then read physical address A to get that data
n Problem: Memory Contentionn All processor share the one memoryn Memory bandwidth becomes bottleneckn Used only for smaller machines
n Most often 2,4, or 8 processors
39
Distributed-Memory Machines
n Two kindsn Distributed Shared-Memory (DSM)
n All processors can address all memory locationsn Data sharing like in SMPn Also called NUMA (non-uniform memory access)n Latencies of different memory locations can differ (local access faster
than remote access)n Message-Passing
n A processor can directly address only local memoryn To communicate with other processors, must explicitly send/receive
messagesn Also called multicomputers or clusters
n Most accesses local, so less memory contention (can scale to well over 1,000 processors)
40
Another Classification
n Two Models for Communication and Memory Architecture1. Communication occurs by explicitly passing messages among
the processors: Message-passing multiprocessors2. Communication occurs through shared address space (via
loads and stores): Shared memory multiprocessors eithern UMA (Uniform Memory Access time) for shared address, centralized
memory MPn NUMA (Non Uniform Memory Access time multiprocessor) for shared
address, distributed memory MP
n In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space
41
Process Coordination: Shared Memory vs. Message Passing
n Shared memoryn Efficient, familiarn Not always availablen Potentially insecure
n Message passingn Extensible to communication in distributed systems
42
send (process : process_id, message : string)
receive (process : process_id, var message : string)
send (process : process_id, message : string)
receive (process : process_id, var message : string)
Canonical syntax:
process foobegin
:x := ...:
end foo
process barbegin
:y := x:
end bar
global int x
Message Passing Protocols
n Explicitly send data from one thread to anothern Need to track ID’s of other CPUsn Broadcast may need multiple send’sn Each CPU has own memory space
n Hardware: send/recv queues between CPUs
n Program components can be run on the same or different systems, so can use 1,000s of processors.
n “Standard” libraries exist to encapsulate messages:n Parasoft's Express (commercial)n PVM (standing for Parallel Virtual Machine, non-commercial) n MPI (Message Passing Interface, also non-commercial).
43
RecvSend
CPU0 CPU1
Message Passing Machines
n A cluster of computersn Each with its own processor and memoryn An interconnect to pass messages between themn Producer-Consumer Scenario:
n P1 produces data D, uses a SEND to send it to P2n The network routes the message to P2n P2 then calls a RECEIVE to get the message
n Two types of send primitivesn Synchronous: P1 stops until P2 confirms receipt of messagen Asynchronous: P1 sends its message and continues
n Standard libraries for message passing:Most common is MPI – Message Passing Interface
44
Communication Performance
n Metrics for Communication Performancen Communication Bandwidthn Communication Latency
n Sender overhead + transfer time + receiver overheadn Communication latency hiding
n Characterizing Applicationsn Communication to Computation Ratio
n Work done vs. bytes sent over networkn Example: 146 bytes per 1000 instructions
45
Parallel Performance
n Serial sectionsn Very difficult to parallelize the entire applicationn Amdahl’s law
n Large remote access latency (100s of ns)n Overall IPC goes down
Parallel
ParallelParallel
Overall
SpeedupFF-(1
1 Speedup+
=)
1024 SpeedupParallel =
0.5 FParallel =
1.998 SpeedupOverall =
1024 SpeedupParallel =
0.99 FParallel =
91.2 SpeedupOverall =
8.2= CPI
estCostRemoteRequ estRateRemoteRequCPI CPI Base ´+=
4.0=BaseCPI 0.002estRateRemoteRequ =Cycles ns/Cycle
400nsestCostRemoteRequ 120033.0
==
We need at least 7 processors just to break even!
This cost reduced with CMP/multi-core
46
Message Passing Pros and Cons
n Prosn Simpler and cheaper hardwaren Explicit communication makes programmers aware of costly
(communication) operations
n Consn Explicit communication is painful to programn Requires manual optimization
n If you want a variable to be local and accessible via LD/ST, you must declare it as such
n If other processes need to read or write this variable, you must explicitly code the needed sends and receives to do this
47
Message Passing: A Program
n Calculating the sum of array elements#define ASIZE 1024#define NUMPROC 4double myArray[ASIZE/NUMPROC];double mySum=0;for(int i=0;I < ASIZE/NUMPROC;i++)mySum+=myArray[i];
if(myPID=0){for(int p=1;p < NUMPROC;p++){int pSum;recv(p,pSum);mySum+=pSum;
}printf(“Sum: %lf\n”,mySum);
}elsesend(0,mySum);
Must manually split the array
“Master” processor adds uppartial sums and prints the result
“Slave” processors send theirpartial results to master
48
Shared Memory Model
n The processors are all connected to a "globally available" memory, via either a SW or HW means
n The operating system usually maintains its memory coherence
n That’s basically it…n Need to fork/join threads, synchronize (typically locks)
49
Main Memory
Write X Read X
CPU0 CPU1
Shared Memory Multiprocessors: Roughly Two Styles
n UMA (Uniform Memory Access)n The time to access main memory is the same for all processors since they are equally
close to all memory locationsn Machines that use UMA are called Symmetric Multiprocessors (SMPs)n In a typical SMP architecture, all memory accesses are posted to the same shared
memory busn Contention - as more CPUs are added, competition for access to the bus leads to a
decline in performancen Thus, scalability is limited to about 32 processors
50
Processor
Memory
Interconnection network
Conceptual Model
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Shared Memory Multiprocessors: Roughly Two Styles
n NUMA (Non-Uniform Memory Access)n Since memory is physically distributed, it is faster for a processor to access its
own local memory than non-local memory (memory local to another processor or shared between processors)
n Unlike SMPs, all processors are not equally close to all memory locationsn A processor’s own internal computations can be done in its local memory
leading to reduced memory contentionn Designed to surpass the scalability limits of SMPs
51
P1
$
Memory
Pn
$ The “Interconnect” usually includes § a cache-directory to reduce snoop traffic§ Remote Cache to reduce access latency (think of it
as an L3)Cache-Coherent NUMA Systems (CC-NUMA):Non Cache-Coherent NUMA (NCC-NUMA)
Interconnect
Directory
P2
$
Memory
Directory
Memory
Directory
Modern Multiprocessor System: Mixed NUMA & UMA
n In this complex hierarchical scheme, processors are grouped by their physical location on one or the other multi-core CPU package or “node”
n Processors within a node share access to memory modules as per the UMA shared memory architecture
n At the same time, they may also access memory from the remote node using a shared interconnect, but with slower performance as per the NUMA shared memory architecture
52
Source: intel http://software.intel.com/en-us/articles/optimizing-applications-for-numa
Processor
Memory
Interconnection network
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Processor
Memory
Interconnection network
Cache
Processor
Cache
Processor
Cache
Processor
Cache
Interconnection network
Shared Memory: A Program
n Calculating the sum of array elements#define ASIZE 1024#define NUMPROC 4shared double array[ASIZE];shared double allSum=0;shared mutex sumLock;double mySum=0;for(int i=myPID*ASIZE/NUMPROC;i<(myPID+1)*ASIZE/NUMPROC;i++)mySum+=array[i];
lock(sumLock);allSum+=mySum;unlock(sumLock);if(myPID=0)printf(“Sum: %lf\n”,allSum);
Array is shared
Each processor adds its partial sums to the final result
“Master” processor prints the result
Each processor sums up“its” part of the array
53
Shared Memory Pros and Cons
n Prosn Communication happens automaticallyn More natural way of programming
n Easier to write correct programs and gradually optimize themn No need to manually distribute data
(but can help if you do)
n Consn Needs more hardware supportn Easy to write correct, but inefficient programs
(remote accesses look the same as local ones)
54
Communication/Connection Options for MPs
n Multiprocessors come in two main configurations:n a single bus connection, and a network connection
n The choice of the communication model and the physical connection depends largely on the number of processors in the organization
n Notice that the scalability of NUMA makes it ideal for a network configurationn UMA, however, is best suited to a bus connection
55
Category Choice Typical Number of Processors
Communication model
Message passing 8 ~ thousands
Shared address
UMA 2~64
NUMA 8~256
Physical ConnectionNetwork 8 ~ thousands
Bus 2~32
Focus on Shared Memory Multiprocessors…
n We are more interested in single chip multi-core processor architecture rather than MPP systems in Data Centers
n It implements a memory system with a single global physical address space (usually)
n Goal 1: Minimize memory latencyn Use co-location & caches
n Goal 2: Maximize memory bandwidthn Use parallelism & caches
56
Focus on Shared Memory Multiprocessors: Let’s See
57
DSPDSP
CoreLink CCN-504 Cache Coherent Network
Quad Cortex-
A57
L2 Cache
Quad Cortex-
A57
L2 Cache
Quad Cortex-
A57
L2 Cache
Quad Cortex-
A57
L2 Cache
NIC-400
GIC-500
Snoop Filter8-16MB L3 Cache
CoreLinkDMC-520
NIC-400 Network Interconnect
X72DDR4-3200
Flash GPIO
SATAUSBCryptoDPI
10-40 GbE PCIe DSP
CoreLinkDMC-520
X72DDR4-3200
MMU-500
Shared L2 Cache
between 4 Cores
Shared L3 Cache
between 4 Clusters
(each cluster has
4 cores)
Source: ARM (2013)
Cache Coherence Problem
n Cache coherent processorsn Reading processor must get the most
current valuen Most current value is the last write
n Cache coherency problemn updates from 1 processor not known to
others
n Mechanisms for maintaining cache coherency
n Coherency state associated with a block of data
n Bus/Interconnect operations on shared data change the state
n For the processor that initiates an operation
n For other processors that have the data of the operation resident in their caches
58
P0 P1
Load AStore A, 1
A 0
Load ALoad A
1
Memory
A 0
Possible Causes of Incoherence
n Sharing of writeable datan Cause most commonly considered
n Process migrationn Can occur even if independent jobs are executing
n I/On Often fixed via OS cache flushes
59
Defining Coherent Memory System
n A memory system is coherent if1. A read R from address X on processor P1 returns the value
written by the most recent write W to X on P1 if no other processor has written to X between W and R.
2. If P1 writes to X and P2 reads X after a sufficient time, and there are no other writes to X in between, P2’s read returns the value written by P1’s write.
3. Writes to the same location are serialized: two writes to location X are seen in the same order by all processors.
60
Cache Coherence Definition
n Property 1. preserves program ordern It says that in the absence of sharing, each processor behaves as a
uniprocessor would
n Property 2. says that any write to an address must eventually be seen by all processors
n If P1 writes to X and P2 keeps reading X, P2 must eventually see the new value
n Property 3. preserves causalityn Suppose X starts at 0. Processor P1 increments X and processor P2 waits
until X is 1 and then increments it to 2. Processor P3 must eventually see that X becomes 2.
n If different processors could see writes in different order, P2 can see P1’s write and do its own write, while P3 first sees the write by P2 and then the write by P1. Now we have two processors that will forever disagree about the value of A.
61
Maintaining Cache Coherence
n Snooping Solution (Snoopy Bus):n Send all requests for data to all processorsn Processors snoop to see if they have a copy and respond accordinglyn Requires broadcast, since caching information is at processorsn Works well with bus (natural broadcast medium)n Dominates for small scale machines (most of the market)
n Directory-Based Schemesn Keep track of what is being shared in one centralized placen Distributed memory è distributed directory (avoids bottlenecks)n Send point-to-point requests to processorsn Scales better than Snoopn Actually existed BEFORE Snoop-based schemes
62
Snooping vs. Directory-based (1/3)
n Snooping protocols tend to be faster, if enough bandwidth is available, since all transactions are a request/response seen by all processors
n The drawback is that snooping is not scalablen Every request must be broadcast to all nodes in a system, meaning that as
the system gets larger, the size of the (logical or physical) bus and the bandwidth it provides must grow
n In broadcast snoop systems the coherency traffic is proportional to: N×(N-1)where N is the number of coherent masters
n For each master the broadcast goes to all other masters except itself,n So coherency traffic for 1 master is proportional to N-1
63
Snooping vs. Directory-based (2/3)
n Directories, on the other hand, tend to have longer latencies (with a 3 hop request/forward/respond) but use much less bandwidth since messages are point to point and not broadcast
n In the best case if all shared data is shared only by two masters and we count the directory lookup and the snoop as separate transactions then traffic scales at order 2N
n In the worst case where all traffic is shared by all masters, then a directory doesn’t help and the traffic scales at order: N((N-1)+1) = N2, where the ‘+1’is the directory lookup
n In reality, data is probably rarely shared amongst more than 2 masters except in certain special-case scenario
n For this reason, many of the larger systems (>64 processors) use this type of cache coherence
64
Snooping vs. Directory-based (3/3)
n Actually, these two schemes are really two ends of a continuum of approaches
n A snoop based system can be enhanced with snoop filters that can filter out unnecessary broadcast snoops by using partial directories
n Thus snoop filters enable larger scaling of snoop-based systems
n A directory-based system is akin to a snoop-based system with perfect, fully populated snoop filters
65
Snooping
n Typically used for bus-based (SMP) multiprocessorsn Serialization on the bus used to maintain coherence property 3
n Two flavorsn Write-update (write broadcast)
n A write to shared data is broadcast to update all copiesn All subsequent reads will return the new written value (property 2)n All see the writes in the order of broadcasts
One bus == one order seen by all (property 3)n Write-invalidate
n Write to shared data forces invalidation of all other cached copiesn Subsequent reads miss and fetch new value (property 2)n Writes ordered by invalidations on the bus (property 3)
66
Update vs. Invalidate
n A burst of writes by a processor to one addressn Update: each sends an updaten Invalidate: possibly only the first invalidation is sent
n Writes to different words of a blockn Update: update sent for each wordn Invalidate: possibly only the first invalidation is sent
n Producer-consumer communication latencyn Update: producer sends an update,
consumer reads new value from its cachen Invalidate: producer invalidates consumer’s copy,
consumer’s read misses and has to request the block
n Which is better depends on applicationn But write-invalidate is simpler and implemented in most MP-capable
processors today
67
Cache Coherency Protocols
n Invalidation based protocoln Simple 2-state write-through invalidate protocoln 3-state (MSI) write-back invalidate protocoln 4-state MESI write-back invalidate protocoln 5-state MOESI write-back invalidate protocoln And many variants
n Update based protocoln Dragonn …
68
2-State Invalidate Protocols
n Write-through caches
n invalidation-based protocolsn The snooping cache monitors
the bus for writesn If it detects that another
processor has written to a block it is caching, it invalidates its copy
n This requires each cache controller to perform a tag match operation
n Cache tags can be made dual-ported
69
Valid
Invalid
Load / --
Load / OwnGETSStore / OwnGETX
OtherGETS / --OtherGETX / --
OtherGETX / --
Store / OwnGETXOtherGETS / --
3-State Write-Back Invalidate Protocol
n 2-State Protocoln + Simple hardware and protocoln – Uses lots of bandwidth (every write goes on bus!)
n 3-State Protocol (MSI)n Modified
n One cache exclusively has valid (modified) copy Ownern Memory is stale
n Sharedn >= 1 cache and memory have valid copy (memory = owner)
n Invalid (only memory has valid copy and memory is owner)
n Must invalidate all other copies before entering modified state
n Requires bus transaction (order and invalidate)
70
MSI Processor and Bus Actions
n Processor Actions:n Load: load data in the cache linen Store: store data into the cache linen Eviction: processor wants to replace cache block
n Bus Actions:n GETS: request to get data in shared staten GETX: request for data in modified state (i.e., eXclusive access)n UPGRADE: request for exclusive access to data owned in shared state
n Cache Controller Actions:n Source : this cache provides the data to the requesting cache (your copy is
more recent than the copy in memory)n Writeback: this cache updates the block in memory
71
MSI Snoopy Protocol
72
All edges are labeled with the activity that causes the transition; any value after the / represents an action place on the bus. All edges not shown are self edges that perform no actions (or are actions that are not possible)
Invalid
Modified
Shared
Load / GETS
Eviction
GETX, UPGRADE
Store / UPGRADE
GETS/ SOURCE, WRITEBACK
GETS / SOURCE
Store/ GETX
Eviction/ WRITEBACK
4-State MESI Invalidation Protocol
n MSI + New state: “Exclusive”n data that is clean and unique copy (matches memory)
n Benefit: bandwidth reduction
73
Cache Line State
Valid (I)nvalid
Clean Dirty
(S)hared (E)xclusive (M)odified
Cache Line State
Valid (I)nvalid
Clean Dirty
(S)hared (M)odified
MSI MESI
MESI Protocol
n Let's consider what happens if we read a block and then subsequently wish to modify it
n This will require two bus transactions using the 3-state MSI protocol
n But if we know that we have the only copy of the block, the transaction required to transition from state S to M is really unnecessary
n We could safely, and silently, transition from S to Mn E ® M transition doesn’t require bus transaction
n Improvement over MSI depends on the number of E®M transitions
74
MOESI Protocol
n MESI + New state: “Owned”n data that is both modified and shared
n Benefit: bandwidth reduction
75
Cache Line State
Valid (I)nvalid
Clean Dirty
(S)hared (E)xclusive (M)odified
MESI Cache Line State
Valid (I)nvalid
Clean Dirty
(S)hared (E)xclusive (O)wned
MOESI
(M)odified
MOESI Protocol
n An important assumption:n Cache-to-cache transfer is possible, so a cache with the data in
the modified state can supply that data to another reader without transferring it to memory
n O(wned) staten Other shared copies of this block exist, but memory is stalen This cache (the owned) is responsible for supplying the data
when it observes the relevant bus transactionn This avoids the need to write modified data back to memory
when another processor wants to read itn Look at the M to S transition in the MSI protocol
76
Cache-to-Cache Transfers
n Problemn P1 has block B in M staten P2 wants to read B, puts a RdReq on busn If P1 does nothing, memory will supply the data to P2n What does P1 do?
n Solution 1: abort/retryn P1 cancels P2’s request, issues a write backn P2 later retries RdReq and gets data from memoryn Too slow (two memory latencies to move data from P1 to P2)
n Solution 2: interventionn P1 indicates it will supply the data (“intervention” bus signal)n Memory sees that, does not supply the data, and waits for P1’s datan P1 starts sending the data on the bus, memory is updatedn P2 snoops the transfer during the write-back and gets the block
77
Cache-to-Cache Transfers
n Intervention works if some cache has data in M staten Nobody else has the correct data, clear who supplies the data
n What if a cache has requested data in S staten There might be others who have it, who should supply the data?n Solution 1: let memory supply the datan Solution 2: whoever wins arbitration supplies the datan Solution 3: A separate state similar to S that indicates there are
maybe others who have the block in S state, but if anybody asks for the data we should supply it
78
Coherence in Distributed Memory Multiprocessors
n Distributed memory systems are typically larger à bus-based snooping may not work well
n Option 1: software-based mechanisms – message-passing systems or software-controlled cache coherence
n Option 2: hardware-based mechanisms – directory-based cache coherence
79
Directory-Based Cache Coherence
n Typically in distributed shared memory
n For every local memory block, local directory has an entry
n Directory entry indicatesn Who has cached copies of the blockn In what state do they have the block
80
Processor& Caches
Memory I/O
Processor& Caches
Memory I/O
Processor& Caches
Memory I/O
Processor& Caches
Memory I/O
Interconnection network
Directory Directory Directory Directory
Basic Directory Scheme
n Read from main memory by processor i:n If dirty-bit OFF then { read from main memory; turn p[i] ON; }n If dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory;
turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}
n • Write to main memory by processor i:n If dirty-bit OFF then { supply data to i; send invalidations to all caches that have the
block; turn dirty-bit ON; turn p[i] ON; ... }
n K processors
n With each cache-block in memory: K presence bits, 1 dirty-bit
n With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit
81
P1 P2
$ $
Pn
$
Pn-1
$
Interconnect
Memory
Directorydirty bitpresence bit
Directory Protocol
n Similar to Snoopy Protocol: Three statesn Shared: ≥ 1 processors have data, memory up-to-daten Uncached (no processor has it; not valid in any cache)n Exclusive: 1 processor (owner) has data; memory out-of-date
n Terms: typically 3 processors involvedn Local node where a request originatesn Home node where the memory location of an address residesn Remote node has a copy of a cache block, whether exclusive or
shared
82
(Execution) Latency vs. Bandwidth
n Desktop processingn Typically want an application to execute as quickly as possible
(minimize latency)
n Server/Enterprise processingn Often throughput oriented (maximize bandwidth)n Latency of individual task less important
n ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time
83
Implementing MP Machines
n One approach: add sockets to your MOBOn minimal changes to existing CPUsn power delivery, heat removal and I/O not too bad since each
chip has own set of pins and cooling
84
CPU0
CPU1
CPU2
CPU3
Chip-Multiprocessing
n Simple SMP on the same chip
85
Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX
Shared Caches
n Resources can be shared between CPUs
n ex. IBM Power 5CPU0 CPU1
L2 cache shared betweenboth CPUs (no need tokeep two copies coherent)
L3 cache is also shared (only tagsare on-chip; data are off-chip)
86
Benefits?
n Cheaper than mobo-based SMPn all/most interface logic integrated on to main chip (fewer total
chips, single CPU socket, single interface to main memory)n less power than mobo-based SMP as well (communication on-
die is more power-efficient than chip-to-chip communication)
n Performancen on-chip communication is faster
n Efficiencyn potentially better use of hardware resources than trying to make
wider/more OOO single-threaded CPU
87
Performance vs. Power
n 2x CPUs not necessarily equal to 2x performance
n 2x CPUs à ½ power for eachn maybe a little better than ½ if resources can be shared
n Back-of-the-Envelope calculation:n 3.8 GHz CPU at 100Wn Dual-core: 50W per CPUn P µ V3: Vorig
3/VCMP3 = 100W/50W à VCMP = 0.8 Vorig
n f µ V: fCMP = 3.0GHz
88
Benefit of SMP: Full power budget per socket!
Summary
n Cache Coherencen Coordinate accesses to shared, writeable datan Coherence protocol defines cache line states, state transitions, actionsn Snooping and Directory Protocols similar; bus makes snooping easier
because of broadcast (snooping Þ uniform memory access)n Directory has extra data structure to keep track of state of all cache blocks
n Synchronizationn Locks and ISA support for atomicity
n Memory Consistencyn Defines programmers’ expected view of memoryn Sequential consistency imposes ordering on loads/stores
89
Multi-core
90
Multi-core Architectures
n SMPs on a single chip
n Chip Multi-Processors (CMP)
n Prosn Efficient exploitation of available transistor budgetn Improves throughput and speed of parallelized applicationsn Allows tight coupling of cores
n better communication between cores than in SMPn shared caches
n Low power consumptionn low clock ratesn idle cores can be suspended
n Consn Only improves speed of parallelized applicationsn Increased gap to memory speed
91
Multi-core Architectures
n Design decisionsn Homogeneous vs. Heterogeneous
n Specialized accelerator coresn SIMDn GPU operationsn cryptographyn DSP functions (e.g. FFT)n FPGA (programmable circuits)
n Access to memoryn own memory area (distributed memory)n via cache hierarchy (shared memory)
n Connection of coresn internal bus / cross bar connectionn Cache architecture
92
Multi-core Architectures: Examples
Core Core Core Core
L1 L1 L1 L1
L2 L2
L3L3
MemoryModule 1
MemoryModule 2
I/O
Homogeneous withshared caches and cross bar
Core (2x SMT)
CoreL1
L2
Core
LocalStore
LocalStore
Core Core
LocalStore
LocalStore
I/OMemoryModule
Heterogeneous withcaches, local store and ring bus
93
Shared Cache Design
94
Main Memory
Core Core
L1 L1
L2
Switch
Main Memory
Core Core
L1 L1
L2
Switch
Traditional design
Multiple single-coreswith shared cache off-chip
Core Core
L1 L1
L2
Switch
Multi-core Architecture
Shared Caches on-chip
Is a Multi-core really better off?
95
DEEP BLUE
480 chess chipsCan evaluate 200,000,000 moves per second!!
IBM Watson Jeopardy! Competition (2011.2)
n POWER7 chips (2,880 cores) + 16TB memory
n Massively parallel processing
n Combine: Processing power, Natural language processing, AI, Search, Knowledge extraction
96
Major Challenges for Multi-core Designs
n Communicationn Memory hierarchyn Data allocation (you have a large shared L2/L3 now)n Interconnection network
n AMD HyperTransportn Intel QPI
n Scalabilityn Bus Bandwidth, how to get there?
n Power-Performance — Win or lose?n Borkar’s multi-core arguments
n 15% per core performance drop à 50% power savingn Giant, single core wastes power when task is small
n How about leakage?
n Process variation and yield
n Programming Model
97
Intel Core 2 Duo
n Homogeneous cores
n Bus based on chip interconnect
n Shared on-die Cache Memory
n Traditional I/O
98
Classic OOO: Reservation Stations, Issue ports, Schedulers…etc
Large, shared set associative, prefetch, etc.
Source: Intel Corp.
Core 2 Duo Microarchitecture
99
Why Sharing on-die L2?
n What happens when L2 is too large?
100
101
CoreTM μArch — Wide Dynamic Execution
102
CoreTM μArch — Wide Dynamic Execution
CoreTM μArch — MACRO Fusion
n Common “Intel 32” instruction pairs are combined
n 4-1-1-1 decoder that sustains 7 μop’s per cycle
n 4+1 = 5 “Intel 32” instructions per cycle
103
Micro(-ops) Fusion (from Pentium M)
n A misnomer..
n Instead of breaking up an Intel32 instruction into μop, they decide not to break it up…
n A better naming scheme would call the previous techniques — “IA32 fission”
n To fusen Store address and store data μopsn Load-and-op μops (e.g. ADD (%esp), %eax)
n Extend each RS entry to take 3 operands
n To reducen micro-ops (10% reduction in the OOO logic)n Decoder bandwidth (simple decoder can decode fusion type instruction)n Energy consumption
n Performance improved by 5% for INT and 9% for FP (Pentium M data)
104
105
Smart Memory Access
AMD Quad-Core Processor (Barcelona)
n True 128-bit SSE (as opposed 64 in prior Opteron)
n Sideband Stack optimizern Parallelize many POPes and PUSHes (which were dependent on each other)
n Convert them into pure loads/store instructionsn No uops in FUs for stack pointer adjustment
106
On different power plane
from the cores
ø Source: AMD
Barcelona’s Cache Architecture
107
ø Source: AMD
108
Intel Penryn Dual-Core (First 45nm mprocessor)
• High K dielectric metal gate• 47 new SSE4 ISA
• Up to 12MB L2• > 3GHz
ø Source: Intel
109
Intel Arrandale Processor
• 32nm• Unified 3MB L3• Power sharing (Turbo Boost)
between cores and GFX via DFS
110
AMD 12-Core “Magny-Cours” Opteron
n 45nm
n 4 memory channels
111
IBM Power8
112
Cores§ 12 cores (SMT8)§ 8 dispatch, 10 issue, 16
exec pipe§ 2X internal data
flows/queues§ Enhanced prefetching§ 64K data cache, 32K
instruction cache
Accelerators§ Crypto & memory
expansion§ Transactional Memory§ VMM assist§ Data Move / VM Mobility
Caches§ 512KB SRAM L2 / core§ 96MB eDRAM shared L3§ Up to 128MB eDRAM L4
(off-chip)
Memory§ Up to 230 GB/s
sustained bandwidth
Bus Interfaces§ Durable open memory
attach interface§ Integrated PCIe Gen3§ SMP Interconnect§ CAPI (Coherent
Accelerator Processor Interface)
Energy Management§ On-chip Power Management Micro-controller§ Integrated Per-core VRM§ Critical Path Monitors
Technology§ 22nm SOI, eDRAM, 15ML 650mm2
IBM Power8 Core
113
Execution Improvement vs. POWER7§ SMT4 à SMT8§ 8 dispatch§ 10 issue§ 16 execution pipes:
• 2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR
§ Larger Issue queues (4x16 –entry)
§ Larger global completion, Load/Store reorder
§ Improved branch prediction§ Improved unaligned storage
access
Larger Caching Structures vs. POWER7§ 2x L1 data cache (64 KB)§ 2x outstanding data cache
misses§ 4x translation Cache
Wider Load/Store§ 32B à 64B L2 to L1 data bus§ 2x data cache to execution
dataflow
Bus Enhanced Prefetch§ Instruction speculation awareness§ Data prefetch depth awareness§ Adaptive bandwidth awareness§ Topology awareness
Core Performance vs. POWER7~1.6x Single Thread
~2x Max SMT
POWER8 On Chip Caches
n L2: 512 KB 8 way per core
n L3: 96 MB (12 x 8 MB 8 way Bank)
n “NUCA” Cache policy (Non-Uniform Cache Architecture)n Scalable bandwidth and latencyn Migrate “hot” lines to local L2, then local L3 (replicate L2 contained footprint)
n Chip interconnect: 150 GB/sec per direction per segment
114
Cache Bandwidths
n GB/sec shown assuming 4 GHzn Product frequency will vary based on
model type
n Across 12 core chipn 4 TB/sec L2 BWn 3 TB/sec L3 BW
115
POWER8 Memory Organization
n Up to 8 high speed channels, each running up to 9.6 Bg/s for up to 230 GB/s sustained
n Up to 32 total DDR ports yielding 410 GB/s peak at the DRAM
n Up to 1 TB memory capacity per fully configured processor socket (at initial launch)
116
POWER8 Memory Buffer Chip
n Intelligence Moved into Memoryn Scheduling logic, caching structuresn Energy Mgmt., RAS decision point
n Formerly on processorn Moved to Memory Buffer
n Processor Interfacen 9.6 GB/s high speed interfacen More robust RASn “On-the-fly” lane isolation/repairn Extensible for innovation build-out
n Performance Valuen End-to-end fastpath and data retry (latency)n Cache à latency/bandwidth, partial updatesn Cache à write scheduling, prefetch, energyn 22nm SOI for optimal performance/energyn 15 metal levels (latency, bandwidth)
117
POWER8 Integrated PCIe Gen 3
n Native PCI Gen 3 Supportn Direct processor integrationn Replaces proprietary
GX/Bridgen Low latencyn High Gen 3 bandwidth (8 Gb/s) à High utilization realizable
n Transport Layer for CAPI Protocol
n Coherently Attach Devices connected via PCI
n Protocol encapsulated in PCI
118
POWER8 CAPI (Coherence Attach Processor Interface)
n Virtual Addressingn Accelerator can work with same memory addresses that the
processor usen Pointers de-referenced same as the host applicationn Removes OS & device driver overhead
n Hardware Managed Cache Coherencen Enables the accelerator to participate in “Locks” as a normal
thread Lowers Latency over IO communication model
119
n Customizable Hardware Application Accelerator
n Specific system SW, middleware, or user application
n Written to durable interface provided by PSL
n Processor Service Layer (PSL)n Present robust, durable interfaces to
applicationsn Offload complexity / content from CAPP
PCI Gen 3Transport for encapsulated messages
POWER8
n Significant performance at thread, core and system
n Optimization for VM density & efficiency
n Strong enablement of autonomic system optimization
n Excellent Big Data analytics capability
120
Summary
n High frequency -> high power consumption
n Trend towards multiple cores on chip
n Broad spectrum of designs: homogeneous, heterogeneous, specialized, general purpose, number of cores, cache architectures, local memories, simultaneous multithreading, …
n Problem: memory latency and bandwidth
121
ARM MPCore Intra-Cluster Coherency Technology & ACE
122
ARM MPCore Intra-Cluster Coherency Technology
n ARM introduced MPCoreTM multi-core coherency technology in the ARM11 MPCore and subsequently in the Cortex-A5 and Cortex-A9 MPCore, which enables cache-coherency within a cluster of 2 to 4 processors
123
CPU/VFPCPU/VFP
L1 MemoryL1 Memory
CPU/VFPCPU/VFP
L1 MemoryL1 Memory
CPU/VFPCPU/VFP
L1 MemoryL1 Memory
CPU/VFPCPU/VFP
L1 MemoryL1 Memory
Snoop Control Unit (SCU)I&D
64-bit bus
CoherencyControl bus
Timer
WdogCPU
Interface
IRQ
Timer
WdogCPU
Interface
IRQ
Timer
WdogCPU
Interface
IRQ
Timer
WdogCPU
Interface
IRQ
Interrupt Distributor
Private FIQ lines
Configurable HW interrupt lines
ConfigurableBetween1 and 4
Symmetric CPU
ARM MPCore Intra-Cluster Coherency Technology
n In Cortex-A15 MPCore, it is extended with AMBA4 ACE coherency capability thus supports
n Multiple CPU clusters enabling systems containing more than 4 coresn Heterogeneous systems consisting of multiple CPUs and cached accelerators
n An improved MESI protocol:n Enables direct cache-to-cache copy of clean data and direct cache-to-cache
move of dirty data within the cluster, without write back to memory as in normal MESI-based processor
n Further enhanced by the ‘Snoop Control Unit’ (SCU), which maintains a copy of all L1 data cache tag RAMs acting as a local, low-latency directory, enabling it to direct transfers only to the L1 caches as needed
n This increases performance as unnecessary snoop traffic to L1 caches would increase effective processor L1 access latency by reducing processor access to the L1 cache
124
ARM MPCore Intra-Cluster Coherency Technology
n Also supports an optional Accelerator Coherency Port (ACP), which enables un-cached accelerators access to the processor cache hierarchy, enabling ‘one-way’ coherency where the accelerator can read and write data within the CPU caches without a write-back to RAM
n But ACP cannot support cached accelerators since the CPU has no way to snoop accelerator caches, and the accelerator caches may contain stale data if the CPU writes accelerator-cached data
n Effectively the ACP acts like an additional master port into the SCU and the ACP interface consists of a regular AXI3 slave interface
125
Different meanings of “protocol”
n Cache coherent protocolsn System communication policies
n ACE protocoln Interface communication protocoln Interconnect responsibilities
n ACE protocol does not guarantee coherency => ACE is a support for coherency
126
Different Kinds of Components
n Interconnect: called CCI (Cache Coherent Interconnect)
n ACE Masters: masters with caches
n ACE-Lite Masters: components without caches snooping other caches
n ACE-Lite/AXI Slaves: components not initiating snoop transactions
127
DSPDSP
CoreLink CCN-504 Cache Coherent Network
Quad Cortex-
A57
L2 Cache
Quad Cortex-
A57
L2 Cache
Quad Cortex-
A57
L2 Cache
Quad Cortex-
A57
L2 Cache
NIC-400
GIC-500
Snoop Filter8-16MB L3 Cache
CoreLinkDMC-520
NIC-400 Network Interconnect
X72DDR4-3200
Flash GPIO
SATAUSBCryptoDPI
10-40 GbE PCIe DSP
CoreLinkDMC-520
X72DDR4-3200
MMU-500
ACE Cache Coherency States
n ACE states of a cache line: 5-state cache modeln Each cache line is either Valid or Invalidn The ACE states can be mapped directly onto the MOESI cache coherency
model states, however ACE is designed to support components that use a variety of internal cache state models, including MESI, MOESI, MEI and others
128
InvalidValid
Dirty
Clean
Unique Shared
Unique Dirty (UD)
Shared Dirty
(SD)Invalid
Unique Clean
(UC)
Shared Clean
(SC)
ARM ACE MOESI ACE Meaning
UniqueDirty M Modified Not shared, dirty, must be written back
SharedDirty O Owned Shared, dirty, must be written back to memory
UniqueClean E Exclusive Not shared, clean
SharedClean S Shared Shared, no need to write back, may be clean or dirty
Invalid I Invalid Invalid
ACE Cache Coherency States
n ACE does not prescribe the cache states a component can use àSome components may not support all ACE transactions
n ARM Cortex--A15 MPCore internally uses MESI states for the L1 data cache meaning the cache cannot be in the SharedDirty (Owned) state
n To emphasize that ACE is not restricted to the MOESI cache state model, ACE does not use the familiar MOESI terminology
129
InvalidValid
Dirty
Clean
Unique Shared
Unique Dirty
Shared Dirty
InvalidUnique Clean
Shared Clean
ARM ACE MOESI ACE Meaning
UniqueDirty M Modified Not shared, dirty, must be written back
SharedDirty O Owned Shared, dirty, must be written back to memory
UniqueClean E Exclusive Not shared, clean
SharedClean S Shared Shared, no need to write back, may be clean or dirty
Invalid I Invalid Invalid
ACE Design Principle
n Lines held in more than one cache must be held in the Shared state
n Only one copy can be in the SharedDirty state, and that is the one that is responsible for updating memory
n Devices are not required to support all 5 states in the protocol internally à Flexible
n System interconnect is responsible for coordinating the progress of all shared (coherent) transactions and can handle these in various manners, e.g.
n The interconnect may present snoop addresses to all masters in parallel simultaneously, or it may present snoop addresses one at a time serially
130
ACE Design Principle
n System interconnect may choose eithern to perform speculative reads to lower latency,n or to wait until snoop responses have been received to reduce system power
consumption by minimizing external memory reads
n The interconnect may include a directory or snoop filter, or it may broadcast snoops to all masters
n ACE has been designed to enable performance and power optimizations by avoiding wherever possible unnecessary external memory accesses
n ACE facilitates direct master-to-master data transfer wherever possible
131
ACE Additional Signals and Channels
n AMBA 4 ACE is backwards-compatible with AMBA 4 AXI adding additional signals and channels to the AMBA 4 AXI interface
n The AXI interface consists of 5 channels
n In AXI, the read and write channels each have their own dedicated address and control channel
n The BRESP channel is used to indicate the completion of write transactions
132
ARADDR
RDATA
AWADDR
WDATA
BRESP
ACE Additional Signals and Channels
133
AXI4 Channel Signal Source Description
Read address
ARDOMAIN[1:0] Master Indicates the shareability domain of a read transaction
ARSNOOP[3:0] Master Indicates the transaction type for Shareable read transactions
ARBAR[1:0] Master Indicates a read barrier transaction
Write address
AWDOMAIN[1:0] Master Indicates the shareability domain of a write transaction
AWSNOOP[2:0] Master Indicates the transaction type for Shareable write transactions
AWBAR[1:0] Master Indicates a write barrier transaction
AWUNIQUE Master Indicates that a line is permitted to be held in a Unique state
Read data RRESP[3:2] SlaveRead response. The additional read response bits provide the information required to complete a Shareable read transaction.
ACE Additional Signals and Channels
n 3 new channels are supported, these are the snoop address channel, the snoop data channel, and the snoop response channel
n The snoop address (AC) channel is an input to a cached master that provides the address and associated control information for snoop transactions
n The snoop response (CR) channel is an output channel from a cached master that provides a response to a snoop transaction
n Every snoop transaction has a single response associated with itn The snoop response indicates if an associated data transfer on the CD channel is
expectedn The snoop data (CD) channel is an optional output channel that passes snoop
data out from a mastern Typically, this occurs for a read or clean snoop transaction when the master being
snooped has a copy of the data available to return
134
ACE Additional Signals and Channels
135
ACE-specific Channel Signal Source Description
Snoop address
ACVALID Slave Snoop address and control information is valid
ACREADY Master Snoop address ready
ACADDR[ac-1:0] Slave Snoop address
ACSNOOP[3:0] Slave Snoop transaction type
ACPROT[2:0] Slave Snoop protection type
Snoop response
CRVALID Master Snoop response valid
CRREADY Slave Snoop response ready
CRRESP[4:0] Master Snoop response
Snoop data
CDVALID Master Snoop data valid
CDREADY Slave Snoop data ready
CDDATA[cd-1:0] Master Snoop data
CDLAST Master Indicates the last data transfer of a snoop transaction
RACK Master Read acknowledge
WACK Master Write acknowledge
ACE Transactions
136
Read
Write
Non-shared
ReadClean
ReadNotSharedDirty
Read Shareable
ReadShared
CleanShared
CleanInvalid
Cache Maintenance
MakeInvalid
ReadOnce
WriteUnique
Non-cached
WriteLineUnique
ReadUnique
CleanUnique
Write Shareable
MakeUnique
WriteBack
WriteClean
Update memory
Evict
Snoop Filter
ACE-Lite Transaction subset
Summary: ACE
n ACE states of a cache line: 5-state cache model
n ACE channelsn Read Channels ( AR, R )n Write Channels ( AW, W, B )n Snoop Channels ( AC, CR, CD )
n ACE supported policiesn 100% snoopn Directory basedn Anything between (snoop filter)
137
ACE_IACE_UD ACE_UC
ACE_SD ACE_SC
Invalid Valid
Dirty Clean
Unique
Shared
Appendix
138
Non-Uniform Cache Architecture
n ASPLOS 2002 proposed by UT-Austin
n Factsn Large shared on-die L2n Wire-delay dominating on-die cache
139
3 cycles1MB
180nm, 1999
11 cycles4MB
90nm, 2004
24 cycles16MB
50nm, 2010
140
Multi-banked L2 cache
Bank=128KB11 cycles
2MB @ 130nm
Bank Access time = 3 cyclesInterconnect delay = 8 cycles
141
Multi-banked L2 cache
Bank=64KB47 cycles
16MB @ 50nm
Bank Access time = 3 cyclesInterconnect delay = 44 cycles
142
Static NUCA-1
n Use private per-bank channeln Each bank has its distinct access latencyn Statically decide data location for its given address n Average access latency =34.2 cyclesn Wire overhead = 20.9% à an issue
Tag Array
Data Bus
Address Bus
Bank
Sub-bank
Predecoder
Senseamplifier
Wordline driverand decoder
Static NUCA-2
n Use a 2D switched network to alleviate wire area overhead
n Average access latency =24.2 cycles
n Wire overhead = 5.9%
143
Bank
Data bus
SwitchTag Array
Wordline driverand decoder
Predecoder
Multiprocessors
n Shared-memory Multiprocessorsn Provide a shared-memory abstractionn Enables familiar and efficient programmer interface
144
Interconnection Network
P1
Interface
Cache M1
P2 P3 P4
Interface
Cache M1
Interface
Cache M1
Interface
Cache M1
Processors and Memory – UMA
n Uniform Memory Access (UMA)n Access all memory locations with same latencyn Pros: Simplifies software. Data placement does not mattern Cons: Lowers peak performance. Latency defined by worst casen Implementation: Bus-based UMA for symmetric multiprocessor (SMP)
145
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
Processors and Memory – NUMA
n Non-Uniform Memory Access (NUMA)n Access local memory locations fastern Pros: Increases peak performance.n Cons: Increases software complexity, data placement.n Implementation: Network-based NUMA with various network topologies, which
require routers (R).
146
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
MemR RRR
Networks and Topologies
n Shared Networksn Every CPU can communicate with
every other CPU via bus or crossbar
n Pros: lower latencyn Cons: lower bandwidth and more
difficult to scale with processor count (e.g., 16)
n Point-to-Point Networksn Every CPU can talk to specific
neighbors (depending on topology).
n Pros: higher bandwidth and easier to scale with processor count (e.g., 100s)
n Cons: higher multi-hop latencies
147
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
MemR RRR
CPU($)
Mem R
CPU($)
Mem R
CPU($)
MemR
CPU($)
MemR
Topology 1 – Bus
n Network Topologyn Defines organization of network nodesn Topologies differ in connectivity, latency, bandwidth, and cost. n Notation: f(1) denotes constant independent of p, f(p) denotes linearly
increasing cost with p, etc…
n Busn Direct interconnect stylen Latency: f(1) wire delayn Bandwidth: f(1/p) and not scalable (p<=4)n Cost: f(1) wire costn Supports ordered broadcast only
148
Topology 2 – Crossbar Switch
n Network Topologyn Defines organization of network nodesn Topologies differ in connectivity, latency, bandwidth, and cost. n Notation: f(1) denotes constant independent of p, f(p) denotes linearly
increasing cost with p, etc…
n Crossbar Switchn Indirect interconnect.n Switches implemented as big multiplexorsn Latency: f(1) constant latencyn Bandwidth: f(1)n Cost: f(2P) wires, f(P2) switches
149
Topology 3 – Multistage Network
n Network Topologyn Defines organization of network nodesn Topologies differ in connectivity, latency, bandwidth, and cost. n Notation: f(1) denotes constant independent of p, f(p) denotes linearly increasing cost
with p, etc…
n Crossbar Switchn Indirect interconnect.n Routing done by address decodingn k: switch arity (#inputs or #outputs)n d: number of network stages = logkPn Latency: f(d)n Bandwidth: f(1)n Cost: f(d´P/k) switches, f(P´d) wiresn Commonly used in large UMA systems
150
Topology 4 – 2D Torus
n Network Topologyn Defines organization of network nodesn Topologies differ in connectivity, latency, bandwidth, and cost. n Notation: f(1) denotes constant independent of p, f(p) denotes linearly
increasing cost with p, etc…
n 2D Torusn Direct interconnectn Latency: f(P1/2)n Bandwidth: f(1)n Cost: f(2P) wiresn Scalable and widely used.n Variants: 1D torus, 2D mesh, 3D torus
151
Challenges in Shared Memory
n Cache Coherencen “Common Sense”
n Synchronizationn Atomic read/write operations
n Memory Consistencyn What behavior should programmers expect from shared memory?n Provide a formal definition of memory behavior to programmern Example: When will a written value be seen?n Example: P1-Write[X] <<10ps>> P2-Read[X]. What happens?
152
P1-Read[X] ® P1-Write[X] ® P1-Read[X] Read returns XP1-Write[X] ® P2-Read[X] Read returns value written by P1P1-Write[X] ® P2-Write[X] Writes serialized
All P’s see writes in same order
Example Execution
153
Processor 0 Processor 10: addi r1, accts, r3 1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0 (r3)5: call give-cash 0: addi r1, accts, r3 # get addr for account
1: ld 0(r3), r4 # load balance into r42: blt r4, r2, 6 # check for sufficient funds3: sub r4, r2, r4 # withdraw4: st r4, 0(r3) #store new balance5: call give-cash
n Two withdrawals from one account. Two ATMsn Withdraw value: r2 (e.g., $100)n Account memory address: accts+r1n Account balance: r4
CPU0 MemCPU1
Scenario 1 – No Caches
154
Processor 0 Processor 10: addi r1, accts, r3
1: ld 0(r3), r4
2: blt r4, r2, 6
3: sub r4, r2, r4
4: st r4, 0 (r3)
5: call give-cash0: addi r1, accts, r3
1: ld 0(r3), r4
2: blt r4, r2, 6
3: sub r4, r2, r4
4: st r4, 0(r3)
5: call give-cash
n Processors have no cachesn Withdrawals update balance without a problem
500500
400
400
300
P0 P1 Mem
Scenario 2a – Cache Incoherence
155
Processor 0 Processor 10: addi r1, accts, r3 1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0 (r3)5: call give-cash 0: addi r1, accts, r3
1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0(r3)5: call give-cash
n Processors have write-back cachesn Processor 0 updates balance in cache, but does not write-back to memoryn Multiple copies of memory location [accts+r1]n Copies may get inconsistent
500V: 500 500
D: 400 500
D: 400 500V: 500
D: 400 500D: 400
P0 P1 Mem
Scenario 2b – Cache Incoherence
156
Processor 0 Processor 10: addi r1, accts, r3 1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0 (r3)5: call give-cash 0: addi r1, accts, r3
1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0(r3)5: call give-cash
n Processors have write-through cachesn What happens if processor 0 performs another withdrawal?
500V: 500 500
D: 400 400
D: 400 400V: 400
V: 400 300D: 400
P0 P1 Mem
Hardware Coherence Protocols
n Absolute Coherencen All cached copies have same data at same time.
Slow and hard to implement
n Relative Coherencen Temporary incoherence is ok (e.g., write-back
caches) as long as no load reads incoherent data.
n Coherence Protocoln Finite state machine that runs for every cache linen Define states per cache linen Define state transitions based on bus activityn Requires coherence controller to examine bus
traffic (address, data)n (4) Invalidates, updates cache lines
157
CPU
D$ d
ata
D$ ta
gs
CC
bus
Protocol 1 – Write Invalidate
n Mechanicsn Process P performs write, broadcasts address on busn !P snoop the bus. If address is locally cached, !P invalidates local copy
n Process P performs read, broadcasts address on busn !P snoop the bus. If address is locally cached, !P writes back local copy
n Example
158
Processor-Activity Bus-Activity Data in Cache-A Data in Cache-B Data in Mem[X]
0CPU-A reads X Cache miss for X 0 0CPU-B reads X Cache miss for X 0 0 0
CPU-A writes 1 to X Invalidation for X 1 0CPU-B reads X Cache miss for X 1 1 1
Cache Coherent Systems
n Provide Coherence Protocoln Statesn State transition diagramn Actions
n Implement Coherence Protocoln (0) Determine when to invoke coherence protocoln (1) Find state of cache line to determine actionn (2) Locate other cached copiesn (3) Communicate with other cached copies (invalidate, update)
n Implementation Variantsn (0) is done in the same way for all systems. Maintain additional state per
cache line. Invoke protocol based on staten (1-3) have different approaches
159
Implementation 1 – Snooping
n Bus-based Snoopingn All cache/coherence controllers observe/react to all bus events.n Protocol relies on globally visible events
n i.e., all processors see all eventsn Protocol relies on globally ordered events
n i.e., all processors see all events in same sequence
n Bus Eventsn Processor (events initiated by own processor P)
n read (R), write (W), write-back (WB)n Bus (events initiated by other processors !P)
n bus read (BR), bus write (BW)
160
Three-State Invalidate Protocol
n Implement protocol for every cache line.
n Add state bits to every cache to indicate (1) invalid, (2) shared, (3) exclusive
161
Example
162
P1 read (A)
P2 read (A1)
P1 write (B)
P2 read (C)
P1 write (D)
P2 write (E)
P2 write (F-Z)
Implementation 2 – Directory
n Bus-based Snooping – Limitations n Snooping scalability is limitedn Bus has insufficient data bandwidth for coherence trafficn Processor has insufficient snooping bandwidth for coherence
traffic
n Directory-based Coherence – Scalable Alternativen Directory contains state for every cache linen Directory identifies processors with cached copies and their
statesn In contrast to snoopy protocols, processors observe/act only on
relevant memory events. Directory determines whether a processor is involved
163
Directory Communication
n Processor sends coherence events to directory1. Find directory entry2. Identify processors with copies3. Communicate with processors, if necessary
164
Challenges in Shared Memory
n Cache Coherencen “Common Sense”
n Synchronizationn Atomic read/write operations
n Memory Consistencyn What behavior should programmers expect from shared memory?n Provide a formal definition of memory behavior to programmern Example: When will a written value be seen?n Example: P1-Write[X] <<10ps>> P2-Read[X]. What happens?
165
P1-Read[X] ® P1-Write[X] ® P1-Read[X] Read returns XP1-Write[X] ® P2-Read[X] Read returns value written by P1P1-Write[X] ® P2-Write[X] Writes serialized
All P’s see writes in same order
Synchronization
n Regulate access to data shared by processorsn Synchronization primitive is a lockn Critical section is a code segment that accesses shared datan Processor must acquire lock before entering critical section.n Processor should release lock when exiting critical section
n Spin Locks – Broken Implementationacquire (lock) # if lock=0, then set lock = 1, else spincritical section release (lock) # lock = 0
Inst-0: ldw R1, lock # load lock into R1Inst-1: bnez R1, Inst-0 # check lock, if lock!=0, go back to Inst-0Inst-2: stw 1, lock # acquire lock, set to 1<< critical section>>> # access shared dataInst-n: stw 0, lock # release lock, set to 0
166
Implementing Spin Locks
n Problem: Lock acquire not atomicn A set of atomic operations either all complete or all fail. During a set of
atomic operations, no other processor can interject. n Spin lock requires atomic load-test-store sequence
167
Processor 0 Processor 1Inst-0: ldw R1, lockInst-1: bnez R1,Inst-0 # P0 sees lock is free
Inst-0: ldw R1, lockInst-1: bnez R1, Inst-0 # P1 sees lock is free
Inst-2: stw 1, lock # P0 acquires lockInst-2: stw 1, lock # P1 acquires lock
…..…. # P0/P1 in critical section
…. # at the same timeInst-n: stw 0, lock
Implementing Spin Locks
n Solution: Test-and-set instructionn Add single instruction for load-test-store (t&s R1, lock)n Test-and-set atomically executes
ld R1, lock; # load previous lock valuest 1, lock;# store 1 to set/acquire
n If lock initially free (0), t&s acquires lock (sets to 1)n If lock initially busy (1), t&s does not change it n Instruction is un-interruptible/atomic by definition
168
Inst-0 t&s R1, lock # atomically load, check, and set lock=1Inst-1 bnez R1 # if previous value of R1 not 0, …. acquire unsuccessfulInst-n stw R1, 0 # atomically release lock
Test-and-Set Inefficiency
n Test-and-set works…
n …but performs poorlyn Suppose Processor 2 (not shown) has the lockn Processors 0/1 must…
n Execute a loop of t&s instructionsn Issue multiple store instructionsn Generate useless interconnection traffic
169
Processor 0 Processor 1Inst-0: t&s R1, lockInst-1: bnez R1,Inst-0 Inst-0: t&s R1, lock # P0 sees lock is free
Inst-1: bnez R1, Inst-0 # P1 does not acquire
Test-and-Test-and-Set Locks
n Solution: Test-and-test-and-set
n Advantagesn Spins locally without storesn Reduces interconnect trafficn Not a new instruction, simply new software (lock implementation)
170
Inst-0 ld R1, lock # test with a load, see if lock changedInst-1 bnez R1, Inst-0 # if lock=1, spinInst-2 t&s R1, lock # if lock=1, test-and-setInst-4 bnez R1, Inst-0 # if can not acquire, spin
Semaphores
n Semaphore (semaphore S, integer N)n Allows N parallel threads to access shared variablen If N = 1, equivalent to lockn Requires atomic fetch-and-add
171
Function Init (semaphore S, integer N) { s = N;
}
Function P (semaphore S) { # “Proberen” to testwhile (S == 0) { }; s = s -1 ;
}
Function V (semaphore S) { # “Verhogen” to increments = s + 1;
}
Challenges in Shared Memory
n Cache Coherencen “Common Sense”
n Synchronizationn Atomic read/write operations
n Memory Consistencyn What behavior should programmers expect from shared memory?n Provide a formal definition of memory behavior to programmern Example: When will a written value be seen?n Example: P1-Write[X] <<10ps>> P2-Read[X]. What happens?
172
P1-Read[X] ® P1-Write[X] ® P1-Read[X] Read returns XP1-Write[X] ® P2-Read[X] Read returns value written by P1P1-Write[X] ® P2-Write[X] Writes serialized
All P’s see writes in same order
Memory Consistency
n Execution Example
n Intuition – P1 should print A=1
n Coherence – Makes no guarantees!
173
A = Flag = 0Processor 0 Processor 1A = 1 while (!Flag)Flag = 1 print A
Consistency and Caches
n Execution Example
n Caching Scenario1. P0 writes A=1. Misses in cache. Puts write into a store buffer.2. P0 continues execution.3. P0 writes Flag=1. Hits in cache. Completes write (with coherence)4. P1 reads Flag=1.5. P1 exits spin loop.6. P1 prints A=0
n Caches, buffering, and other performance mechanisms can cause strange behavior
174
A = Flag = 0Processor 0 Processor 1A = 1 while (!Flag)Flag = 1 print A
Sequential Consistency (SC)
n Definition of Sequential Consistency n Formal definition of programmers’ expected view of memory(1) Each processor P sees its own loads/stores in program order(2) Each processor P sees !P loads/stores in program order(3) All processors see same global load/store ordering
P and !P loads/stores may be interleaved into some order.But all processors see the same interleaving/ordering.
n Definition of Multiprocessor Ordering [Lamport]n Multi-processor ordering corresponds to some sequential
interleaving of uni-processor orderings. Multiprocessor ordering should be indistinguishable from multi-programmed uni-processor
175
Enforcing SC
n Consistency and Coherencen SC Definition: loads/stores globally orderedn SC Implications: coherence events of all load/stores globally ordered
n Implementing Sequential Consistencyn All loads/stores commit in-ordern Delay completion of memory access until all invalidations that are caused by
access are completen Delay a memory access until previous memory access is completen Delay memory read until previous write completes. Cannot place writes in a
buffer and continue with reads.n Simple for programmer but constraints HW/SW performance optimizations
176
Weaker Consistency Models
n Assume programs are synchronized n SC required only for lock variablesn Other variables are either (1) in critical section and cannot be accessed in
parallel or (2) not shared
n Use fences to restrict re-orderingn Increases opportunity for HW optimization but increases programmer effortn Memory fences stall execution until write buffers emptyn Allows load/store reordering in critical section.n Slows lock acquire, release
177
acquirememory fencecritical sectionmemory fence # ensures all writes from critical sectionrelease # are cleared from buffer