ece5917 soc architecture: mp...

ECE5917SoC Architecture: MP SoC

Tae Hee Han: [email protected]

Semiconductor Systems Engineering

Sungkyunkwan University

Outline

n Overview

n Parallelismn Data-Level Parallelismn Instruction-Level Parallelismn Thread-Level Parallelismn Processor-Level Parallelism

n Multi-core

2

Parallelism- Thread Level Parallelism

3

Superscalar (In)Efficiency

4

Issue width

Time

Completely idle cycle (vertical waste)• Introduced when the processor issues no instructions in a cycle

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)•Occurs when not all issue slots can be filled in a cycle

Thread

n Definitionn A discrete sequence of related instructionsn Executed independently of other such sequences

n Every program has at least one threadn Initializesn Executes instructionsn May create other threads

n Each thread maintains its current state

n OS maps a thread to hardware resources

5

Multithreading

n On a single processor, multithreading generally occurs by time-division multiplexing (as in multitasking) – context switching

n On a multiprocessor or multi-core systems, threads can be truly concurrent, with every processor or core executing a separate thread simultaneously

n Many modern OS directly support both time-sliced and multiprocessor threading with a process scheduler

n The kernel of an OS allows programmer to manipulate threads via the system call interface

6

Thread Level Parallelism (TLP)

n Interaction with OSn OS perceives each core as a separate processorn OS scheduler maps threads/processes to

different logical (or virtual) coresn Most major OS support multithreading today

n TLP explicitly represented by the use of multiple threads of execution that are inherently parallel

n Goal: Use multiple instruction streams to improve

n Throughput of computers that run many programs n Execution time of multi-threaded programs

n TLP could be more cost-effective than ILP

7

Virtual Memory (ASID 1)

Virtual Memory (ASID 2)

Physical Memory

Process 1

Thread 1

• Stack• Register• PC

Thread 2


Process 2

Thread 1


Thread 2


Thread 3


Thread Scheduler (OS)

Processor Core 1

(e.g 2-way SMT)

Processor Core 2

(e.g 2-way SMT)

Multithreaded Execution

n Multithreading: multiple threads share functional units of 1 processor via overlapping

n Processor must duplicate independent state of each threadn Separate copy of register file, PCn Separate page table if different process

n Memory sharing via virtual memory mechanismsn Already supports multiple processes

n HW for fast thread switchn Must be much faster than full process switch (which is 100s to 1000s of clocks)

n When to switch?n Alternate instruction per thread (fine grain)—round robinn When thread is stalled (coarse grain)

n e.g., cache miss

8

Hardware Streams

Concurrent threads of Computation

Conceptual Illustration of Multithreaded Architecture

9

1 2 3 4

i = n

i = 3i = 2i = 1

Sub-problemA

j = m

j = 2j = 1

Serial Code

Sub-problemB

Sub-problemC

Program running in parallel

Unused Streams

Instruction Ready Pool

Pipeline of executing Instructions

Sources of Wasted Issue SlotsSource Possible latency-hiding or latency-reducing technique

TLB miss Increase TLB sizes, HW instruction prefetching, HW or SW data prefetching, faster servicing of TLB misses

I-cache miss Increase cache size, more associativity, HW instruction prefetching

D-cache miss Increase cache size, more associativity, HW or SW data prefetching, improved instruction scheduling, more sophisticated dynamic execution

Branch misprediction Improved branch prediction scheme, lower branch misprediction penalty

Control hazard Speculative execution, more aggressive if-conversion

Load delays (L1 cache hits) Shorter load latency, improved instruction scheduling, dynamic scheduling

Short integer delay Improved instruction scheduling

Long integer, short FP, long FP delays Shorter latencies, improved instruction scheduling

Memory conflict Improved instruction scheduling

10

Fine-Grained Multithreading

n Switches between threads on each instruction, interleaving execution of multiple threads

n Usually done round-robin, skipping stalled threads

n CPU must be able to switch threads every clock

n Advantage: can hide both short and long stallsn Instructions from other threads always available to executen Easy to insert on short stalls

n Disadvantage: slows individual threadsn Thread ready to execute without stalls will be delayed by instructions from

other threads

n Used on Sun (Now Oracle) Niagara (UltraSPARC T1) – Nov. 2005

11

Coarse-Grained Multithreading

n Switches threads only on costly stalls: e.g., L2 cache misses

n Advantages n Relieves need to have very fast thread switchingn Doesn’t slow thread

n Other threads only issue instructions when main one would stall (for long time) anyway

n Disadvantage: pipeline startup costs make it hard to hide throughput losses from shorter stalls

n Pipeline must be emptied or frozen on stall, since CPU issues instructions from only one thread

n New thread must fill pipe before instructions can complete n Thus, better for reducing penalty of high-cost stalls where pipeline refill <<

stall time

n Used in IBM AS/400

12

Simple Multithreaded Pipeline

n Additional state: One copy of architected state per thread (e.g., PC, GPR)

n Thread select: Round-robin logic; Propagate Thread-ID down pipeline to access correct state (e.g., GPR1 versus GPR2)

n OS perceives multiple logical CPUs

13

2 Thread select

PC1PC

1PC1PC

1

I$ IRGPR1GPR1GPR1GPR1

X

Y

2

D$

+1

Cycle Interleaved MT (Fine-Grain MT)

14

Second thread interleaved cycle-by-cycle

Instruction issue

Partially filled cycle, i.e., IPC < 4(horizontal waste)

Issue width

Time

Cycle interleaved multithreading reduces vertical waste with cycle-by-cycle interleaving. However, horizontal waste remains.

Chip Multiprocessing (CMP)

15

Second thread interleaved cycle-by-cycle

Partially filled cycle, i.e., IPC < 4(horizontal waste)

Issue width

Time

Chip multiprocessing reduces horizontal waste with simple (narrower) cores.However, (1) vertical waste remains and (2) ILP is bounded.

Ideal Superscalar Multithreading [Tullsen, Eggers, Levy, UW, 1995]

n Interleave multiple threads to multiple issue slots with no restrictions

16

Issue width

Time

Simultaneous Multithreading (SMT) Motivation

n Fine-grain Multithreadingn HEP, Tera, MASA, MIT Alewifen Fast context switching among multiple independent threads

n Switch threads on cache miss stalls – Alewifen Switch threads on every cycle – Tera, HEP

n Target vertical wastes onlyn At any cycle, issue instructions from only a single thread

n Single-chip MPn Coarse-grain parallelism among independent threads in a

different processorn Also exhibit both vertical and horizontal wastes in each individual

processor pipeline

17

Simultaneous Multithreading (SMT)

n An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue processors (superscalars)

n SMT has the potential of greatly enhancing superscalar processor computational capabilities by

n Exploiting thread-level parallelism in a single processor core, simultaneously issuing, executing and retiring instructions from different threads during the same cycle

n A single physical SMT processor core acts as a number of logical processors each executing a single thread

n Providing multiple hardware contexts, hardware thread scheduling and context switching capability

n Providing effective long latency hidingn e.g.) FP, branch misprediction, memory latency

18

Simultaneous Multithreading (SMT)

n Intel’s HyperThreading (2-way SMT)

n IBM Power7 (4/6/8 cores, 4-way SMT); IBM Power5/6 (2 cores - each 2-way SMT, 4 chips per package): Power5 has OoO cores, Power6 In-order cores;

n Basic ideas: Conventional MT + Simultaneous issue + Sharing common resources

19

Register RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister RenamerRegister Renamer

RegFile

Fmult(4 cycles)

Fadd(2 cyc)

ALU 1

ALU 2

Load/Store(variable)

Fdiv, unpipe (16 cycles)

RS & ROB+

PhysicalRegister

File

RS & ROB+

PhysicalRegister

File

FetchUnitFetchUnit

PCPCPCPCPCPCPCPC

I-CACHE I-CACHE

DecodeDecode

Register RenamerRegister Renamer

RegFile

RegFile

RegFile

RegFile

RegFile

RegFile

RegFile

D-CACHE D-CACHE

Register RenamerRegister RenamerRegister RenamerRegister Renamer

Overview of SMT Hardware Changes

n For an N-way (N threads) SMT, we need:n Ability to fetch from N threadsn N sets of registers (including PCs)n N rename tables (RATs)n N virtual memory spaces

n But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)

20

Multithreading: Classification

21

Thread 1Unused

Exec

utio

n Ti

me

FU1 FU2 FU3 FU4

ConventionalSuperscalar

SingleThreaded

SimultaneousMultithreading

(SMT)

Fine-grainedMultithreading(cycle-by-cycleInterleaving)

Thread 2Thread 3Thread 4Thread 5

Coarse-grainedMultithreading

(Block Interleaving)

Chip Multiprocessor

(CMP orMulti-Core)

SMT Performance

n When it works, it fills idle “issue slots” with work from other threads; throughput improves

n But sometimes it can cause performance degradation!

22

Time( ) < Time( )Finish one task,

then do the otherDo both at sametime using SMT

How?

n Cache thrashing

23

I$ D$

Thread0 just fits inthe Level-1 Caches

Executesreasonablyquickly due

to high cachehit rates

Context switch to Thread1

I$ D$

Thread1 also fitsnicely in the caches

I$ D$

Caches were just big enoughto hold one thread’s data, but

not two thread’s worth

L2

Now both threads havesignificantly higher cache

miss rates

à Intel Smart Cache!

Multithreading: How Many Threads?

n With more HW threads:n Larger/multiple register filesn Replicated & partitioned resources à Lower

utilization, lower single-thread performancen Shared resources à Utilization vs. interference

and thrashing

n Impact of MT/MC on memory hierarchy?

24

Source: Guz et al. "Many-Core vs. Many-Thread Machines: Stay Away From the Valley," IEEE COMPUTER ARCHITECTURE LETTERS, VOL. 8, NO. 1, 2009

SMT: Intel vs. ARM

n In 2010, ARM said it might include SMT in its chips in the future; however this was rejected for their 2012 64-bit design

n Intel conceded SMT will not be supported on its Silvermont processor cores in order to save power

25

Noel Hurley (VP of marketing and strategy in ARM’s processor division)said ARM rejected SMT as an option. Although it can be used to hidethe latency of memory accesses in parallel applications – a techniqueused heavily in GPUs – multithreading complicates the design of thepipeline itself. The tradeoff did not make sense for the engineeringteam, he said.( http://www.techdesignforums.com/blog/2012/10/30/arm-64bit-cortex-a53-a57-launch/ )

MT Analysis by ARM for Mobile Applications

n Evaluation tests by ARM have shown that MT is not efficient for mobile devices

n Increasing performance by 50% will cost more than a 50% increase in power

n MT is much less predictable than those of multi-core solutions

n In MT, the implementation cost of ‘sleep mode’ becomes higher due to more sharing of resources between multiple threads

n In high-end mobile apps which require superscalar and OoO based multi-core, then single-threaded multi-core implementations such as big.LITTLE are the most efficient solution

26

0

0.5

1

1.5

2

2.5

3

3.5

4

Cortex-A7 Dual Cortex-A7 Cortex-A12 Est. of Cortex-A12 with MT

Relative mW

RelativeDMIPS

Source: http://www.arm.com/files/pdf/Multi-threading_Technology.pdf

Summary

n Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options

n Data level parallelism and/or Thread level parallelism is exploited to better performance

n Coarse grain vs. Fine grain multithreadingn Only on big stall vs. every clock cycle

n Simultaneous Multithreading if fine grained multithreading based on OOO superscalar micro-architecture

n Instead of replicating registers, reuse rename registers

27

Parallelism- Processor Level Parallelism

28

Beyond ILP (Instruction Level Parallelism)

n Performance is limited by the serial fractionn Coarse grain parallelism in the post ILP era

n Thread, process and data parallelismn Learn from the lessons of the parallel processing community

n Revisit the classifications and architectural techniques

29

parallelizable

1CPU 2CPUs 3CPUs 4CPUs

“Automatic” Parallelism in Modern Machines

n Bit level parallelismn Within floating point operations, etc.

n Instruction level parallelism (ILP)n Multiple instructions execute per clock cycle

n Memory system parallelismn Overlap of memory operations with computation

n OS parallelismn Multiple jobs run in parallel on commodity SMPs

Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks

30

Principles of Parallel Computing

n Finding enough parallelism (Amdahl’s Law)

n Granularity

n Locality

n Load balance

n Coordination and synchronization

n Performance modeling

31

All of these things makes parallel programming even harder than sequential programming

Finding Enough Parallelism

n Suppose only part of an application seems parallel

n Amdahl’s lawn Let s be the fraction of work done sequentially, so (1-s) is fraction

parallelizablen P = number of processors

Speedup(P) = Time(1)/Time(P)

= 1/(s + (1-s)/P)

@ 1/s if s approaches 1

n Even if the parallel part speeds up perfectly performance is limited by the sequential part

32

Overhead of Parallelism

n Given enough parallel work, this is the biggest barrier to getting desired speedup

n Parallelism overheads include:n cost of starting a thread or processn cost of communicating shared datan cost of synchronizingn extra (redundant) computation

n Each of these can be in the range of milliseconds (=millions of flops) on some systems

n Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work

33

Locality and Parallelism

n Large memories are slow, fast memories are small

n Storage hierarchies are large and fast on average

n Parallel processors, collectively, have large, fast cachen the slow accesses to “remote” data we call “communication”

n Algorithm should do most work on local data

34

ProcL1 Cache

L2 Cache

L3 Cache

Memory

Conventional Storage

Hierarchy

potentialinterconnects

L3 Cache L3 Cache

ProcL1 Cache

L2 Cache

ProcL1 Cache

L2 Cache

Memory Memory

Load Imbalance

n Load imbalance is the time that some processors in the system are idle due to

n Insufficient parallelism (during that phase)n Unequal size tasks

n Examples of the lattern Adapting to “interesting parts of a domain”n Tree-structured computations n Fundamentally unstructured problems

n Algorithm needs to balance load

35

Computer Architecture Classifications

Processor Organizations

Single Instruction Single Data Stream

(SISD)

Single Instruction Multiple Data

Stream(SIMD)

Multiple Instruction Single Data Stream

(MISD)

Multiple Instruction Multiple Data

Stream(MIMD)

Uniprocessor Vector processor

Array processor

Centralized Shared Memory Architecture

(UMA)Distributed Memory

Architecture

Distributed Shared memory

(NUMA)

36

Message Passing

Multiprocessors

n Why do we need multiprocessors?n Uniprocessor speed keeps improvingn But there are things that need even more speed

n Wait for a few years for Moore’s law to catch up?n Or use multiple processors and do it now?

n Multiprocessor software problemn Most code is sequential (for uniprocessors)

n MUCH easier to write and debugn Correct parallel code very, very difficult to write

n Efficient and correct is even hardern Debugging even more difficult (Heisenbugs)

37

ø Heisenbug is a computer programming jargon term for a software bug that seems to disappear or alter its behavior when one attempts to study it. The term is a pun on the name of Werner Heisenberg, the physicist who first asserted the observer effect of quantum mechanics, which states that the act of observing a system inevitably alters its state.

MIMD Multiprocessors

38

Centralized Shared Memory Distributed Memory

Centralized-Memory Machines

n Also “Symmetric Multiprocessors” (SMP)

n “Uniform Memory Access” (UMA)n All memory locations have similar latenciesn Data sharing through memory reads/writesn P1 can write data to a physical address A,

P2 can then read physical address A to get that data

n Problem: Memory Contentionn All processor share the one memoryn Memory bandwidth becomes bottleneckn Used only for smaller machines

n Most often 2,4, or 8 processors

39

Distributed-Memory Machines

n Two kindsn Distributed Shared-Memory (DSM)

n All processors can address all memory locationsn Data sharing like in SMPn Also called NUMA (non-uniform memory access)n Latencies of different memory locations can differ (local access faster

than remote access)n Message-Passing

n A processor can directly address only local memoryn To communicate with other processors, must explicitly send/receive

messagesn Also called multicomputers or clusters

n Most accesses local, so less memory contention (can scale to well over 1,000 processors)

40

Another Classification

n Two Models for Communication and Memory Architecture1. Communication occurs by explicitly passing messages among

the processors: Message-passing multiprocessors2. Communication occurs through shared address space (via

loads and stores): Shared memory multiprocessors eithern UMA (Uniform Memory Access time) for shared address, centralized

memory MPn NUMA (Non Uniform Memory Access time multiprocessor) for shared

address, distributed memory MP

n In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space

41

Process Coordination: Shared Memory vs. Message Passing

n Shared memoryn Efficient, familiarn Not always availablen Potentially insecure

n Message passingn Extensible to communication in distributed systems

42

send (process : process_id, message : string)

receive (process : process_id, var message : string)

send (process : process_id, message : string)

receive (process : process_id, var message : string)

Canonical syntax:

process foobegin

:x := ...:

end foo

process barbegin

:y := x:

end bar

global int x

Message Passing Protocols

n Explicitly send data from one thread to anothern Need to track ID’s of other CPUsn Broadcast may need multiple send’sn Each CPU has own memory space

n Hardware: send/recv queues between CPUs

n Program components can be run on the same or different systems, so can use 1,000s of processors.

n “Standard” libraries exist to encapsulate messages:n Parasoft's Express (commercial)n PVM (standing for Parallel Virtual Machine, non-commercial) n MPI (Message Passing Interface, also non-commercial).

43

RecvSend

CPU0 CPU1

Message Passing Machines

n A cluster of computersn Each with its own processor and memoryn An interconnect to pass messages between themn Producer-Consumer Scenario:

n P1 produces data D, uses a SEND to send it to P2n The network routes the message to P2n P2 then calls a RECEIVE to get the message

n Two types of send primitivesn Synchronous: P1 stops until P2 confirms receipt of messagen Asynchronous: P1 sends its message and continues

n Standard libraries for message passing:Most common is MPI – Message Passing Interface

44

Communication Performance

n Metrics for Communication Performancen Communication Bandwidthn Communication Latency

n Sender overhead + transfer time + receiver overheadn Communication latency hiding

n Characterizing Applicationsn Communication to Computation Ratio

n Work done vs. bytes sent over networkn Example: 146 bytes per 1000 instructions

45

Parallel Performance

n Serial sectionsn Very difficult to parallelize the entire applicationn Amdahl’s law

n Large remote access latency (100s of ns)n Overall IPC goes down

Parallel

ParallelParallel

Overall

SpeedupFF-(1

1 Speedup+

=)

1024 SpeedupParallel =

0.5 FParallel =

1.998 SpeedupOverall =

1024 SpeedupParallel =

0.99 FParallel =

91.2 SpeedupOverall =

8.2= CPI

estCostRemoteRequ estRateRemoteRequCPI CPI Base ´+=

4.0=BaseCPI 0.002estRateRemoteRequ =Cycles ns/Cycle

400nsestCostRemoteRequ 120033.0

==

We need at least 7 processors just to break even!

This cost reduced with CMP/multi-core

46

Message Passing Pros and Cons

n Prosn Simpler and cheaper hardwaren Explicit communication makes programmers aware of costly

(communication) operations

n Consn Explicit communication is painful to programn Requires manual optimization

n If you want a variable to be local and accessible via LD/ST, you must declare it as such

n If other processes need to read or write this variable, you must explicitly code the needed sends and receives to do this

47

Message Passing: A Program

n Calculating the sum of array elements#define ASIZE 1024#define NUMPROC 4double myArray[ASIZE/NUMPROC];double mySum=0;for(int i=0;I < ASIZE/NUMPROC;i++)mySum+=myArray[i];

if(myPID=0){for(int p=1;p < NUMPROC;p++){int pSum;recv(p,pSum);mySum+=pSum;

}printf(“Sum: %lf\n”,mySum);

}elsesend(0,mySum);

Must manually split the array

“Master” processor adds uppartial sums and prints the result

“Slave” processors send theirpartial results to master

48

Shared Memory Model

n The processors are all connected to a "globally available" memory, via either a SW or HW means

n The operating system usually maintains its memory coherence

n That’s basically it…n Need to fork/join threads, synchronize (typically locks)

49

Main Memory

Write X Read X

CPU0 CPU1

Shared Memory Multiprocessors: Roughly Two Styles

n UMA (Uniform Memory Access)n The time to access main memory is the same for all processors since they are equally

close to all memory locationsn Machines that use UMA are called Symmetric Multiprocessors (SMPs)n In a typical SMP architecture, all memory accesses are posted to the same shared

memory busn Contention - as more CPUs are added, competition for access to the bus leads to a

decline in performancen Thus, scalability is limited to about 32 processors

50

Processor

Memory

Interconnection network

Conceptual Model

Cache

Processor

Cache

Processor

Cache

Processor

Cache

Shared Memory Multiprocessors: Roughly Two Styles

n NUMA (Non-Uniform Memory Access)n Since memory is physically distributed, it is faster for a processor to access its

own local memory than non-local memory (memory local to another processor or shared between processors)

n Unlike SMPs, all processors are not equally close to all memory locationsn A processor’s own internal computations can be done in its local memory

leading to reduced memory contentionn Designed to surpass the scalability limits of SMPs

51

P1

$

Memory

Pn

$ The “Interconnect” usually includes § a cache-directory to reduce snoop traffic§ Remote Cache to reduce access latency (think of it

as an L3)Cache-Coherent NUMA Systems (CC-NUMA):Non Cache-Coherent NUMA (NCC-NUMA)

Interconnect

Directory

P2

$

Memory

Directory

Memory

Directory

Modern Multiprocessor System: Mixed NUMA & UMA

n In this complex hierarchical scheme, processors are grouped by their physical location on one or the other multi-core CPU package or “node”

n Processors within a node share access to memory modules as per the UMA shared memory architecture

n At the same time, they may also access memory from the remote node using a shared interconnect, but with slower performance as per the NUMA shared memory architecture

52

Source: intel http://software.intel.com/en-us/articles/optimizing-applications-for-numa

Processor

Memory


Cache

Processor

Cache

Processor

Cache

Processor

Cache

Processor

Memory


Cache

Processor

Cache

Processor

Cache

Processor

Cache


Shared Memory: A Program

n Calculating the sum of array elements#define ASIZE 1024#define NUMPROC 4shared double array[ASIZE];shared double allSum=0;shared mutex sumLock;double mySum=0;for(int i=myPID*ASIZE/NUMPROC;i<(myPID+1)*ASIZE/NUMPROC;i++)mySum+=array[i];

lock(sumLock);allSum+=mySum;unlock(sumLock);if(myPID=0)printf(“Sum: %lf\n”,allSum);

Array is shared

Each processor adds its partial sums to the final result

“Master” processor prints the result

Each processor sums up“its” part of the array

53

Shared Memory Pros and Cons

n Prosn Communication happens automaticallyn More natural way of programming

n Easier to write correct programs and gradually optimize themn No need to manually distribute data

(but can help if you do)

n Consn Needs more hardware supportn Easy to write correct, but inefficient programs

(remote accesses look the same as local ones)

54

Communication/Connection Options for MPs

n Multiprocessors come in two main configurations:n a single bus connection, and a network connection

n The choice of the communication model and the physical connection depends largely on the number of processors in the organization

n Notice that the scalability of NUMA makes it ideal for a network configurationn UMA, however, is best suited to a bus connection

55

Category Choice Typical Number of Processors

Communication model

Message passing 8 ~ thousands

Shared address

UMA 2~64

NUMA 8~256

Physical ConnectionNetwork 8 ~ thousands

Bus 2~32

Focus on Shared Memory Multiprocessors…

n We are more interested in single chip multi-core processor architecture rather than MPP systems in Data Centers

n It implements a memory system with a single global physical address space (usually)

n Goal 1: Minimize memory latencyn Use co-location & caches

n Goal 2: Maximize memory bandwidthn Use parallelism & caches

56

Focus on Shared Memory Multiprocessors: Let’s See

57

DSPDSP

CoreLink CCN-504 Cache Coherent Network

Quad Cortex-

A57

L2 Cache

Quad Cortex-

A57

L2 Cache

Quad Cortex-

A57

L2 Cache

Quad Cortex-

A57

L2 Cache

NIC-400

GIC-500

Snoop Filter8-16MB L3 Cache

CoreLinkDMC-520

NIC-400 Network Interconnect

X72DDR4-3200

Flash GPIO

SATAUSBCryptoDPI

10-40 GbE PCIe DSP

CoreLinkDMC-520

X72DDR4-3200

MMU-500

Shared L2 Cache

between 4 Cores

Shared L3 Cache

between 4 Clusters

(each cluster has

4 cores)

Source: ARM (2013)

Cache Coherence Problem

n Cache coherent processorsn Reading processor must get the most

current valuen Most current value is the last write

n Cache coherency problemn updates from 1 processor not known to

others

n Mechanisms for maintaining cache coherency

n Coherency state associated with a block of data

n Bus/Interconnect operations on shared data change the state

n For the processor that initiates an operation

n For other processors that have the data of the operation resident in their caches

58

P0 P1

Load AStore A, 1

A 0

Load ALoad A

1

Memory

A 0

Possible Causes of Incoherence

n Sharing of writeable datan Cause most commonly considered

n Process migrationn Can occur even if independent jobs are executing

n I/On Often fixed via OS cache flushes

59

Defining Coherent Memory System

n A memory system is coherent if1. A read R from address X on processor P1 returns the value

written by the most recent write W to X on P1 if no other processor has written to X between W and R.

2. If P1 writes to X and P2 reads X after a sufficient time, and there are no other writes to X in between, P2’s read returns the value written by P1’s write.

3. Writes to the same location are serialized: two writes to location X are seen in the same order by all processors.

60

Cache Coherence Definition

n Property 1. preserves program ordern It says that in the absence of sharing, each processor behaves as a

uniprocessor would

n Property 2. says that any write to an address must eventually be seen by all processors

n If P1 writes to X and P2 keeps reading X, P2 must eventually see the new value

n Property 3. preserves causalityn Suppose X starts at 0. Processor P1 increments X and processor P2 waits

until X is 1 and then increments it to 2. Processor P3 must eventually see that X becomes 2.

n If different processors could see writes in different order, P2 can see P1’s write and do its own write, while P3 first sees the write by P2 and then the write by P1. Now we have two processors that will forever disagree about the value of A.

61

Maintaining Cache Coherence

n Snooping Solution (Snoopy Bus):n Send all requests for data to all processorsn Processors snoop to see if they have a copy and respond accordinglyn Requires broadcast, since caching information is at processorsn Works well with bus (natural broadcast medium)n Dominates for small scale machines (most of the market)

n Directory-Based Schemesn Keep track of what is being shared in one centralized placen Distributed memory è distributed directory (avoids bottlenecks)n Send point-to-point requests to processorsn Scales better than Snoopn Actually existed BEFORE Snoop-based schemes

62

Snooping vs. Directory-based (1/3)

n Snooping protocols tend to be faster, if enough bandwidth is available, since all transactions are a request/response seen by all processors

n The drawback is that snooping is not scalablen Every request must be broadcast to all nodes in a system, meaning that as

the system gets larger, the size of the (logical or physical) bus and the bandwidth it provides must grow

n In broadcast snoop systems the coherency traffic is proportional to: N×(N-1)where N is the number of coherent masters

n For each master the broadcast goes to all other masters except itself,n So coherency traffic for 1 master is proportional to N-1

63


n Directories, on the other hand, tend to have longer latencies (with a 3 hop request/forward/respond) but use much less bandwidth since messages are point to point and not broadcast

n In the best case if all shared data is shared only by two masters and we count the directory lookup and the snoop as separate transactions then traffic scales at order 2N

n In the worst case where all traffic is shared by all masters, then a directory doesn’t help and the traffic scales at order: N((N-1)+1) = N2, where the ‘+1’is the directory lookup

n In reality, data is probably rarely shared amongst more than 2 masters except in certain special-case scenario

n For this reason, many of the larger systems (>64 processors) use this type of cache coherence

64


n Actually, these two schemes are really two ends of a continuum of approaches

n A snoop based system can be enhanced with snoop filters that can filter out unnecessary broadcast snoops by using partial directories

n Thus snoop filters enable larger scaling of snoop-based systems

n A directory-based system is akin to a snoop-based system with perfect, fully populated snoop filters

65

Snooping

n Typically used for bus-based (SMP) multiprocessorsn Serialization on the bus used to maintain coherence property 3

n Two flavorsn Write-update (write broadcast)

n A write to shared data is broadcast to update all copiesn All subsequent reads will return the new written value (property 2)n All see the writes in the order of broadcasts

One bus == one order seen by all (property 3)n Write-invalidate

n Write to shared data forces invalidation of all other cached copiesn Subsequent reads miss and fetch new value (property 2)n Writes ordered by invalidations on the bus (property 3)

66

Update vs. Invalidate

n A burst of writes by a processor to one addressn Update: each sends an updaten Invalidate: possibly only the first invalidation is sent

n Writes to different words of a blockn Update: update sent for each wordn Invalidate: possibly only the first invalidation is sent

n Producer-consumer communication latencyn Update: producer sends an update,

consumer reads new value from its cachen Invalidate: producer invalidates consumer’s copy,

consumer’s read misses and has to request the block

n Which is better depends on applicationn But write-invalidate is simpler and implemented in most MP-capable

processors today

67

Cache Coherency Protocols

n Invalidation based protocoln Simple 2-state write-through invalidate protocoln 3-state (MSI) write-back invalidate protocoln 4-state MESI write-back invalidate protocoln 5-state MOESI write-back invalidate protocoln And many variants

n Update based protocoln Dragonn …

68

2-State Invalidate Protocols

n Write-through caches

n invalidation-based protocolsn The snooping cache monitors

the bus for writesn If it detects that another

processor has written to a block it is caching, it invalidates its copy

n This requires each cache controller to perform a tag match operation

n Cache tags can be made dual-ported

69

Valid

Invalid

Load / --

Load / OwnGETSStore / OwnGETX

OtherGETS / --OtherGETX / --

OtherGETX / --

Store / OwnGETXOtherGETS / --

3-State Write-Back Invalidate Protocol

n 2-State Protocoln + Simple hardware and protocoln – Uses lots of bandwidth (every write goes on bus!)

n 3-State Protocol (MSI)n Modified

n One cache exclusively has valid (modified) copy Ownern Memory is stale

n Sharedn >= 1 cache and memory have valid copy (memory = owner)

n Invalid (only memory has valid copy and memory is owner)

n Must invalidate all other copies before entering modified state

n Requires bus transaction (order and invalidate)

70

MSI Processor and Bus Actions

n Processor Actions:n Load: load data in the cache linen Store: store data into the cache linen Eviction: processor wants to replace cache block

n Bus Actions:n GETS: request to get data in shared staten GETX: request for data in modified state (i.e., eXclusive access)n UPGRADE: request for exclusive access to data owned in shared state

n Cache Controller Actions:n Source : this cache provides the data to the requesting cache (your copy is

more recent than the copy in memory)n Writeback: this cache updates the block in memory

71

MSI Snoopy Protocol

72

All edges are labeled with the activity that causes the transition; any value after the / represents an action place on the bus. All edges not shown are self edges that perform no actions (or are actions that are not possible)

Invalid

Modified

Shared

Load / GETS

Eviction

GETX, UPGRADE

Store / UPGRADE

GETS/ SOURCE, WRITEBACK

GETS / SOURCE

Store/ GETX

Eviction/ WRITEBACK

4-State MESI Invalidation Protocol

n MSI + New state: “Exclusive”n data that is clean and unique copy (matches memory)

n Benefit: bandwidth reduction

73

Cache Line State

Valid (I)nvalid

Clean Dirty

(S)hared (E)xclusive (M)odified

Cache Line State

Valid (I)nvalid

Clean Dirty

(S)hared (M)odified

MSI MESI

MESI Protocol

n Let's consider what happens if we read a block and then subsequently wish to modify it

n This will require two bus transactions using the 3-state MSI protocol

n But if we know that we have the only copy of the block, the transaction required to transition from state S to M is really unnecessary

n We could safely, and silently, transition from S to Mn E ® M transition doesn’t require bus transaction

n Improvement over MSI depends on the number of E®M transitions

74

MOESI Protocol

n MESI + New state: “Owned”n data that is both modified and shared

n Benefit: bandwidth reduction

75

Cache Line State

Valid (I)nvalid

Clean Dirty

(S)hared (E)xclusive (M)odified

MESI Cache Line State

Valid (I)nvalid

Clean Dirty

(S)hared (E)xclusive (O)wned

MOESI

(M)odified

MOESI Protocol

n An important assumption:n Cache-to-cache transfer is possible, so a cache with the data in

the modified state can supply that data to another reader without transferring it to memory

n O(wned) staten Other shared copies of this block exist, but memory is stalen This cache (the owned) is responsible for supplying the data

when it observes the relevant bus transactionn This avoids the need to write modified data back to memory

when another processor wants to read itn Look at the M to S transition in the MSI protocol

76

Cache-to-Cache Transfers

n Problemn P1 has block B in M staten P2 wants to read B, puts a RdReq on busn If P1 does nothing, memory will supply the data to P2n What does P1 do?

n Solution 1: abort/retryn P1 cancels P2’s request, issues a write backn P2 later retries RdReq and gets data from memoryn Too slow (two memory latencies to move data from P1 to P2)

n Solution 2: interventionn P1 indicates it will supply the data (“intervention” bus signal)n Memory sees that, does not supply the data, and waits for P1’s datan P1 starts sending the data on the bus, memory is updatedn P2 snoops the transfer during the write-back and gets the block

77

Cache-to-Cache Transfers

n Intervention works if some cache has data in M staten Nobody else has the correct data, clear who supplies the data

n What if a cache has requested data in S staten There might be others who have it, who should supply the data?n Solution 1: let memory supply the datan Solution 2: whoever wins arbitration supplies the datan Solution 3: A separate state similar to S that indicates there are

maybe others who have the block in S state, but if anybody asks for the data we should supply it

78

Coherence in Distributed Memory Multiprocessors

n Distributed memory systems are typically larger à bus-based snooping may not work well

n Option 1: software-based mechanisms – message-passing systems or software-controlled cache coherence

n Option 2: hardware-based mechanisms – directory-based cache coherence

79

Directory-Based Cache Coherence

n Typically in distributed shared memory

n For every local memory block, local directory has an entry

n Directory entry indicatesn Who has cached copies of the blockn In what state do they have the block

80

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O


Directory Directory Directory Directory

Basic Directory Scheme

n Read from main memory by processor i:n If dirty-bit OFF then { read from main memory; turn p[i] ON; }n If dirty-bit ON then { recall line from dirty proc (cache state to shared); update memory;

turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;}

n • Write to main memory by processor i:n If dirty-bit OFF then { supply data to i; send invalidations to all caches that have the

block; turn dirty-bit ON; turn p[i] ON; ... }

n K processors

n With each cache-block in memory: K presence bits, 1 dirty-bit

n With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit

81

P1 P2

$ $

Pn

$

Pn-1

$

Interconnect

Memory

Directorydirty bitpresence bit

Directory Protocol

n Similar to Snoopy Protocol: Three statesn Shared: ≥ 1 processors have data, memory up-to-daten Uncached (no processor has it; not valid in any cache)n Exclusive: 1 processor (owner) has data; memory out-of-date

n Terms: typically 3 processors involvedn Local node where a request originatesn Home node where the memory location of an address residesn Remote node has a copy of a cache block, whether exclusive or

shared

82

(Execution) Latency vs. Bandwidth

n Desktop processingn Typically want an application to execute as quickly as possible

(minimize latency)

n Server/Enterprise processingn Often throughput oriented (maximize bandwidth)n Latency of individual task less important

n ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time

83

Implementing MP Machines

n One approach: add sockets to your MOBOn minimal changes to existing CPUsn power delivery, heat removal and I/O not too bad since each

chip has own set of pins and cooling

84

CPU0

CPU1

CPU2

CPU3

Chip-Multiprocessing

n Simple SMP on the same chip

85

Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX

Shared Caches

n Resources can be shared between CPUs

n ex. IBM Power 5CPU0 CPU1

L2 cache shared betweenboth CPUs (no need tokeep two copies coherent)

L3 cache is also shared (only tagsare on-chip; data are off-chip)

86

Benefits?

n Cheaper than mobo-based SMPn all/most interface logic integrated on to main chip (fewer total

chips, single CPU socket, single interface to main memory)n less power than mobo-based SMP as well (communication on-

die is more power-efficient than chip-to-chip communication)

n Performancen on-chip communication is faster

n Efficiencyn potentially better use of hardware resources than trying to make

wider/more OOO single-threaded CPU

87

Performance vs. Power

n 2x CPUs not necessarily equal to 2x performance

n 2x CPUs à ½ power for eachn maybe a little better than ½ if resources can be shared

n Back-of-the-Envelope calculation:n 3.8 GHz CPU at 100Wn Dual-core: 50W per CPUn P µ V3: Vorig

3/VCMP3 = 100W/50W à VCMP = 0.8 Vorig

n f µ V: fCMP = 3.0GHz

88

Benefit of SMP: Full power budget per socket!

Summary

n Cache Coherencen Coordinate accesses to shared, writeable datan Coherence protocol defines cache line states, state transitions, actionsn Snooping and Directory Protocols similar; bus makes snooping easier

because of broadcast (snooping Þ uniform memory access)n Directory has extra data structure to keep track of state of all cache blocks

n Synchronizationn Locks and ISA support for atomicity

n Memory Consistencyn Defines programmers’ expected view of memoryn Sequential consistency imposes ordering on loads/stores

89

Multi-core

90

Multi-core Architectures

n SMPs on a single chip

n Chip Multi-Processors (CMP)

n Prosn Efficient exploitation of available transistor budgetn Improves throughput and speed of parallelized applicationsn Allows tight coupling of cores

n better communication between cores than in SMPn shared caches

n Low power consumptionn low clock ratesn idle cores can be suspended

n Consn Only improves speed of parallelized applicationsn Increased gap to memory speed

91

Multi-core Architectures

n Design decisionsn Homogeneous vs. Heterogeneous

n Specialized accelerator coresn SIMDn GPU operationsn cryptographyn DSP functions (e.g. FFT)n FPGA (programmable circuits)

n Access to memoryn own memory area (distributed memory)n via cache hierarchy (shared memory)

n Connection of coresn internal bus / cross bar connectionn Cache architecture

92

Multi-core Architectures: Examples

Core Core Core Core

L1 L1 L1 L1

L2 L2

L3L3

MemoryModule 1

MemoryModule 2

I/O

Homogeneous withshared caches and cross bar

Core (2x SMT)

CoreL1

L2

Core

LocalStore

LocalStore

Core Core

LocalStore

LocalStore

I/OMemoryModule

Heterogeneous withcaches, local store and ring bus

93

Shared Cache Design

94

Main Memory

Core Core

L1 L1

L2

Switch

Main Memory

Core Core

L1 L1

L2

Switch

Traditional design

Multiple single-coreswith shared cache off-chip

Core Core

L1 L1

L2

Switch

Multi-core Architecture

Shared Caches on-chip

Is a Multi-core really better off?

95

DEEP BLUE

480 chess chipsCan evaluate 200,000,000 moves per second!!

IBM Watson Jeopardy! Competition (2011.2)

n POWER7 chips (2,880 cores) + 16TB memory

n Massively parallel processing

n Combine: Processing power, Natural language processing, AI, Search, Knowledge extraction

96

Major Challenges for Multi-core Designs

n Communicationn Memory hierarchyn Data allocation (you have a large shared L2/L3 now)n Interconnection network

n AMD HyperTransportn Intel QPI

n Scalabilityn Bus Bandwidth, how to get there?

n Power-Performance — Win or lose?n Borkar’s multi-core arguments

n 15% per core performance drop à 50% power savingn Giant, single core wastes power when task is small

n How about leakage?

n Process variation and yield

n Programming Model

97

Intel Core 2 Duo

n Homogeneous cores

n Bus based on chip interconnect

n Shared on-die Cache Memory

n Traditional I/O

98

Classic OOO: Reservation Stations, Issue ports, Schedulers…etc

Large, shared set associative, prefetch, etc.

Source: Intel Corp.

Core 2 Duo Microarchitecture

99

Why Sharing on-die L2?

n What happens when L2 is too large?

100

101

CoreTM μArch — Wide Dynamic Execution

102

CoreTM μArch — Wide Dynamic Execution

CoreTM μArch — MACRO Fusion

n Common “Intel 32” instruction pairs are combined

n 4-1-1-1 decoder that sustains 7 μop’s per cycle

n 4+1 = 5 “Intel 32” instructions per cycle

103

Micro(-ops) Fusion (from Pentium M)

n A misnomer..

n Instead of breaking up an Intel32 instruction into μop, they decide not to break it up…

n A better naming scheme would call the previous techniques — “IA32 fission”

n To fusen Store address and store data μopsn Load-and-op μops (e.g. ADD (%esp), %eax)

n Extend each RS entry to take 3 operands

n To reducen micro-ops (10% reduction in the OOO logic)n Decoder bandwidth (simple decoder can decode fusion type instruction)n Energy consumption

n Performance improved by 5% for INT and 9% for FP (Pentium M data)

104

105

Smart Memory Access

AMD Quad-Core Processor (Barcelona)

n True 128-bit SSE (as opposed 64 in prior Opteron)

n Sideband Stack optimizern Parallelize many POPes and PUSHes (which were dependent on each other)

n Convert them into pure loads/store instructionsn No uops in FUs for stack pointer adjustment

106

On different power plane

from the cores

ø Source: AMD

Barcelona’s Cache Architecture

107

ø Source: AMD

108

Intel Penryn Dual-Core (First 45nm mprocessor)

• High K dielectric metal gate• 47 new SSE4 ISA

• Up to 12MB L2• > 3GHz

ø Source: Intel

109

Intel Arrandale Processor

• 32nm• Unified 3MB L3• Power sharing (Turbo Boost)

between cores and GFX via DFS

110

AMD 12-Core “Magny-Cours” Opteron

n 45nm

n 4 memory channels

IBM Power8

112

Cores§ 12 cores (SMT8)§ 8 dispatch, 10 issue, 16

exec pipe§ 2X internal data

flows/queues§ Enhanced prefetching§ 64K data cache, 32K

instruction cache

Accelerators§ Crypto & memory

expansion§ Transactional Memory§ VMM assist§ Data Move / VM Mobility

Caches§ 512KB SRAM L2 / core§ 96MB eDRAM shared L3§ Up to 128MB eDRAM L4

(off-chip)

Memory§ Up to 230 GB/s

sustained bandwidth

Bus Interfaces§ Durable open memory

attach interface§ Integrated PCIe Gen3§ SMP Interconnect§ CAPI (Coherent

Accelerator Processor Interface)

Energy Management§ On-chip Power Management Micro-controller§ Integrated Per-core VRM§ Critical Path Monitors

Technology§ 22nm SOI, eDRAM, 15ML 650mm2

IBM Power8 Core

113

Execution Improvement vs. POWER7§ SMT4 à SMT8§ 8 dispatch§ 10 issue§ 16 execution pipes:

• 2 FXU, 2 LSU, 2 LU, 4 FPU, 2 VMX, 1 Crypto, 1 DFU, 1 CR, 1 BR

§ Larger Issue queues (4x16 –entry)

§ Larger global completion, Load/Store reorder

§ Improved branch prediction§ Improved unaligned storage

access

Larger Caching Structures vs. POWER7§ 2x L1 data cache (64 KB)§ 2x outstanding data cache

misses§ 4x translation Cache

Wider Load/Store§ 32B à 64B L2 to L1 data bus§ 2x data cache to execution

dataflow

Bus Enhanced Prefetch§ Instruction speculation awareness§ Data prefetch depth awareness§ Adaptive bandwidth awareness§ Topology awareness

Core Performance vs. POWER7~1.6x Single Thread

~2x Max SMT

POWER8 On Chip Caches

n L2: 512 KB 8 way per core

n L3: 96 MB (12 x 8 MB 8 way Bank)

n “NUCA” Cache policy (Non-Uniform Cache Architecture)n Scalable bandwidth and latencyn Migrate “hot” lines to local L2, then local L3 (replicate L2 contained footprint)

n Chip interconnect: 150 GB/sec per direction per segment

114

Cache Bandwidths

n GB/sec shown assuming 4 GHzn Product frequency will vary based on

model type

n Across 12 core chipn 4 TB/sec L2 BWn 3 TB/sec L3 BW

115

POWER8 Memory Organization

n Up to 8 high speed channels, each running up to 9.6 Bg/s for up to 230 GB/s sustained

n Up to 32 total DDR ports yielding 410 GB/s peak at the DRAM

n Up to 1 TB memory capacity per fully configured processor socket (at initial launch)

116

POWER8 Memory Buffer Chip

n Intelligence Moved into Memoryn Scheduling logic, caching structuresn Energy Mgmt., RAS decision point

n Formerly on processorn Moved to Memory Buffer

n Processor Interfacen 9.6 GB/s high speed interfacen More robust RASn “On-the-fly” lane isolation/repairn Extensible for innovation build-out

n Performance Valuen End-to-end fastpath and data retry (latency)n Cache à latency/bandwidth, partial updatesn Cache à write scheduling, prefetch, energyn 22nm SOI for optimal performance/energyn 15 metal levels (latency, bandwidth)

117

POWER8 Integrated PCIe Gen 3

n Native PCI Gen 3 Supportn Direct processor integrationn Replaces proprietary

GX/Bridgen Low latencyn High Gen 3 bandwidth (8 Gb/s) à High utilization realizable

n Transport Layer for CAPI Protocol

n Coherently Attach Devices connected via PCI

n Protocol encapsulated in PCI

118

POWER8 CAPI (Coherence Attach Processor Interface)

n Virtual Addressingn Accelerator can work with same memory addresses that the

processor usen Pointers de-referenced same as the host applicationn Removes OS & device driver overhead

n Hardware Managed Cache Coherencen Enables the accelerator to participate in “Locks” as a normal

thread Lowers Latency over IO communication model

119

n Customizable Hardware Application Accelerator

n Specific system SW, middleware, or user application

n Written to durable interface provided by PSL

n Processor Service Layer (PSL)n Present robust, durable interfaces to

applicationsn Offload complexity / content from CAPP

PCI Gen 3Transport for encapsulated messages

POWER8

n Significant performance at thread, core and system

n Optimization for VM density & efficiency

n Strong enablement of autonomic system optimization

n Excellent Big Data analytics capability

120

Summary

n High frequency -> high power consumption

n Trend towards multiple cores on chip

n Broad spectrum of designs: homogeneous, heterogeneous, specialized, general purpose, number of cores, cache architectures, local memories, simultaneous multithreading, …

n Problem: memory latency and bandwidth

121

ARM MPCore Intra-Cluster Coherency Technology & ACE

122

ARM MPCore Intra-Cluster Coherency Technology

n ARM introduced MPCoreTM multi-core coherency technology in the ARM11 MPCore and subsequently in the Cortex-A5 and Cortex-A9 MPCore, which enables cache-coherency within a cluster of 2 to 4 processors

123

CPU/VFPCPU/VFP

L1 MemoryL1 Memory

CPU/VFPCPU/VFP

L1 MemoryL1 Memory

CPU/VFPCPU/VFP

L1 MemoryL1 Memory

CPU/VFPCPU/VFP

L1 MemoryL1 Memory

Snoop Control Unit (SCU)I&D

64-bit bus

CoherencyControl bus

Timer

WdogCPU

Interface

IRQ

Timer

WdogCPU

Interface

IRQ

Timer

WdogCPU

Interface

IRQ

Timer

WdogCPU

Interface

IRQ

Interrupt Distributor

Private FIQ lines

Configurable HW interrupt lines

ConfigurableBetween1 and 4

Symmetric CPU


n In Cortex-A15 MPCore, it is extended with AMBA4 ACE coherency capability thus supports

n Multiple CPU clusters enabling systems containing more than 4 coresn Heterogeneous systems consisting of multiple CPUs and cached accelerators

n An improved MESI protocol:n Enables direct cache-to-cache copy of clean data and direct cache-to-cache

move of dirty data within the cluster, without write back to memory as in normal MESI-based processor

n Further enhanced by the ‘Snoop Control Unit’ (SCU), which maintains a copy of all L1 data cache tag RAMs acting as a local, low-latency directory, enabling it to direct transfers only to the L1 caches as needed

n This increases performance as unnecessary snoop traffic to L1 caches would increase effective processor L1 access latency by reducing processor access to the L1 cache

124


n Also supports an optional Accelerator Coherency Port (ACP), which enables un-cached accelerators access to the processor cache hierarchy, enabling ‘one-way’ coherency where the accelerator can read and write data within the CPU caches without a write-back to RAM

n But ACP cannot support cached accelerators since the CPU has no way to snoop accelerator caches, and the accelerator caches may contain stale data if the CPU writes accelerator-cached data

n Effectively the ACP acts like an additional master port into the SCU and the ACP interface consists of a regular AXI3 slave interface

125

Different meanings of “protocol”

n Cache coherent protocolsn System communication policies

n ACE protocoln Interface communication protocoln Interconnect responsibilities

n ACE protocol does not guarantee coherency => ACE is a support for coherency

126

Different Kinds of Components

n Interconnect: called CCI (Cache Coherent Interconnect)

n ACE Masters: masters with caches

n ACE-Lite Masters: components without caches snooping other caches

n ACE-Lite/AXI Slaves: components not initiating snoop transactions

127

DSPDSP

CoreLink CCN-504 Cache Coherent Network

Quad Cortex-

A57

L2 Cache

Quad Cortex-

A57

L2 Cache

Quad Cortex-

A57

L2 Cache

Quad Cortex-

A57

L2 Cache

NIC-400

GIC-500

Snoop Filter8-16MB L3 Cache

CoreLinkDMC-520

NIC-400 Network Interconnect

X72DDR4-3200

Flash GPIO

SATAUSBCryptoDPI

10-40 GbE PCIe DSP

CoreLinkDMC-520

X72DDR4-3200

MMU-500

ACE Cache Coherency States

n ACE states of a cache line: 5-state cache modeln Each cache line is either Valid or Invalidn The ACE states can be mapped directly onto the MOESI cache coherency

model states, however ACE is designed to support components that use a variety of internal cache state models, including MESI, MOESI, MEI and others

128

InvalidValid

Dirty

Clean

Unique Shared

Unique Dirty (UD)

Shared Dirty

(SD)Invalid

Unique Clean

(UC)

Shared Clean

(SC)

ARM ACE MOESI ACE Meaning

UniqueDirty M Modified Not shared, dirty, must be written back

SharedDirty O Owned Shared, dirty, must be written back to memory

UniqueClean E Exclusive Not shared, clean

SharedClean S Shared Shared, no need to write back, may be clean or dirty

Invalid I Invalid Invalid

ACE Cache Coherency States

n ACE does not prescribe the cache states a component can use àSome components may not support all ACE transactions

n ARM Cortex--A15 MPCore internally uses MESI states for the L1 data cache meaning the cache cannot be in the SharedDirty (Owned) state

n To emphasize that ACE is not restricted to the MOESI cache state model, ACE does not use the familiar MOESI terminology

129

InvalidValid

Dirty

Clean

Unique Shared

Unique Dirty

Shared Dirty

InvalidUnique Clean

Shared Clean

ARM ACE MOESI ACE Meaning

UniqueDirty M Modified Not shared, dirty, must be written back

SharedDirty O Owned Shared, dirty, must be written back to memory

UniqueClean E Exclusive Not shared, clean

SharedClean S Shared Shared, no need to write back, may be clean or dirty

Invalid I Invalid Invalid

ACE Design Principle

n Lines held in more than one cache must be held in the Shared state

n Only one copy can be in the SharedDirty state, and that is the one that is responsible for updating memory

n Devices are not required to support all 5 states in the protocol internally à Flexible

n System interconnect is responsible for coordinating the progress of all shared (coherent) transactions and can handle these in various manners, e.g.

n The interconnect may present snoop addresses to all masters in parallel simultaneously, or it may present snoop addresses one at a time serially

130

ACE Design Principle

n System interconnect may choose eithern to perform speculative reads to lower latency,n or to wait until snoop responses have been received to reduce system power

consumption by minimizing external memory reads

n The interconnect may include a directory or snoop filter, or it may broadcast snoops to all masters

n ACE has been designed to enable performance and power optimizations by avoiding wherever possible unnecessary external memory accesses

n ACE facilitates direct master-to-master data transfer wherever possible

131

ACE Additional Signals and Channels

n AMBA 4 ACE is backwards-compatible with AMBA 4 AXI adding additional signals and channels to the AMBA 4 AXI interface

n The AXI interface consists of 5 channels

n In AXI, the read and write channels each have their own dedicated address and control channel

n The BRESP channel is used to indicate the completion of write transactions

132

ARADDR

RDATA

AWADDR

WDATA

BRESP


133

AXI4 Channel Signal Source Description

Read address

ARDOMAIN[1:0] Master Indicates the shareability domain of a read transaction

ARSNOOP[3:0] Master Indicates the transaction type for Shareable read transactions

ARBAR[1:0] Master Indicates a read barrier transaction

Write address

AWDOMAIN[1:0] Master Indicates the shareability domain of a write transaction

AWSNOOP[2:0] Master Indicates the transaction type for Shareable write transactions

AWBAR[1:0] Master Indicates a write barrier transaction

AWUNIQUE Master Indicates that a line is permitted to be held in a Unique state

Read data RRESP[3:2] SlaveRead response. The additional read response bits provide the information required to complete a Shareable read transaction.


n 3 new channels are supported, these are the snoop address channel, the snoop data channel, and the snoop response channel

n The snoop address (AC) channel is an input to a cached master that provides the address and associated control information for snoop transactions

n The snoop response (CR) channel is an output channel from a cached master that provides a response to a snoop transaction

n Every snoop transaction has a single response associated with itn The snoop response indicates if an associated data transfer on the CD channel is

expectedn The snoop data (CD) channel is an optional output channel that passes snoop

data out from a mastern Typically, this occurs for a read or clean snoop transaction when the master being

snooped has a copy of the data available to return

134


135

ACE-specific Channel Signal Source Description

Snoop address

ACVALID Slave Snoop address and control information is valid

ACREADY Master Snoop address ready

ACADDR[ac-1:0] Slave Snoop address

ACSNOOP[3:0] Slave Snoop transaction type

ACPROT[2:0] Slave Snoop protection type

Snoop response

CRVALID Master Snoop response valid

CRREADY Slave Snoop response ready

CRRESP[4:0] Master Snoop response

Snoop data

CDVALID Master Snoop data valid

CDREADY Slave Snoop data ready

CDDATA[cd-1:0] Master Snoop data

CDLAST Master Indicates the last data transfer of a snoop transaction

RACK Master Read acknowledge

WACK Master Write acknowledge

ACE Transactions

136

Read

Write

Non-shared

ReadClean

ReadNotSharedDirty

Read Shareable

ReadShared

CleanShared

CleanInvalid

Cache Maintenance

MakeInvalid

ReadOnce

WriteUnique

Non-cached

WriteLineUnique

ReadUnique

CleanUnique

Write Shareable

MakeUnique

WriteBack

WriteClean

Update memory

Evict

Snoop Filter

ACE-Lite Transaction subset

Summary: ACE

n ACE states of a cache line: 5-state cache model

n ACE channelsn Read Channels ( AR, R )n Write Channels ( AW, W, B )n Snoop Channels ( AC, CR, CD )

n ACE supported policiesn 100% snoopn Directory basedn Anything between (snoop filter)

137

ACE_IACE_UD ACE_UC

ACE_SD ACE_SC

Invalid Valid

Dirty Clean

Unique

Shared

Appendix

138

Non-Uniform Cache Architecture

n ASPLOS 2002 proposed by UT-Austin

n Factsn Large shared on-die L2n Wire-delay dominating on-die cache

139

3 cycles1MB

180nm, 1999

11 cycles4MB

90nm, 2004

24 cycles16MB

50nm, 2010

140

Multi-banked L2 cache

Bank=128KB11 cycles

2MB @ 130nm

Bank Access time = 3 cyclesInterconnect delay = 8 cycles

141

Multi-banked L2 cache

Bank=64KB47 cycles

16MB @ 50nm

Bank Access time = 3 cyclesInterconnect delay = 44 cycles

142

Static NUCA-1

n Use private per-bank channeln Each bank has its distinct access latencyn Statically decide data location for its given address n Average access latency =34.2 cyclesn Wire overhead = 20.9% à an issue

Tag Array

Data Bus

Address Bus

Bank

Sub-bank

Predecoder

Senseamplifier

Wordline driverand decoder

Static NUCA-2

n Use a 2D switched network to alleviate wire area overhead

n Average access latency =24.2 cycles

n Wire overhead = 5.9%

143

Bank

Data bus

SwitchTag Array

Wordline driverand decoder

Predecoder

Multiprocessors

n Shared-memory Multiprocessorsn Provide a shared-memory abstractionn Enables familiar and efficient programmer interface

144

Interconnection Network

P1

Interface

Cache M1

P2 P3 P4

Interface

Cache M1

Interface

Cache M1

Interface

Cache M1

Processors and Memory – UMA

n Uniform Memory Access (UMA)n Access all memory locations with same latencyn Pros: Simplifies software. Data placement does not mattern Cons: Lowers peak performance. Latency defined by worst casen Implementation: Bus-based UMA for symmetric multiprocessor (SMP)

145

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

Processors and Memory – NUMA

n Non-Uniform Memory Access (NUMA)n Access local memory locations fastern Pros: Increases peak performance.n Cons: Increases software complexity, data placement.n Implementation: Network-based NUMA with various network topologies, which

require routers (R).

146

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

MemR RRR

Networks and Topologies

n Shared Networksn Every CPU can communicate with

every other CPU via bus or crossbar

n Pros: lower latencyn Cons: lower bandwidth and more

difficult to scale with processor count (e.g., 16)

n Point-to-Point Networksn Every CPU can talk to specific

neighbors (depending on topology).

n Pros: higher bandwidth and easier to scale with processor count (e.g., 100s)

n Cons: higher multi-hop latencies

147

CPU($)

Mem

CPU($)

Mem

CPU($)

Mem

CPU($)

MemR RRR

CPU($)

Mem R

CPU($)

Mem R

CPU($)

MemR

CPU($)

MemR

Topology 1 – Bus

n Network Topologyn Defines organization of network nodesn Topologies differ in connectivity, latency, bandwidth, and cost. n Notation: f(1) denotes constant independent of p, f(p) denotes linearly

increasing cost with p, etc…

n Busn Direct interconnect stylen Latency: f(1) wire delayn Bandwidth: f(1/p) and not scalable (p<=4)n Cost: f(1) wire costn Supports ordered broadcast only

148

Topology 2 – Crossbar Switch



n Crossbar Switchn Indirect interconnect.n Switches implemented as big multiplexorsn Latency: f(1) constant latencyn Bandwidth: f(1)n Cost: f(2P) wires, f(P2) switches

149

Topology 3 – Multistage Network

n Network Topologyn Defines organization of network nodesn Topologies differ in connectivity, latency, bandwidth, and cost. n Notation: f(1) denotes constant independent of p, f(p) denotes linearly increasing cost

with p, etc…

n Crossbar Switchn Indirect interconnect.n Routing done by address decodingn k: switch arity (#inputs or #outputs)n d: number of network stages = logkPn Latency: f(d)n Bandwidth: f(1)n Cost: f(d´P/k) switches, f(P´d) wiresn Commonly used in large UMA systems

150

Topology 4 – 2D Torus



n 2D Torusn Direct interconnectn Latency: f(P1/2)n Bandwidth: f(1)n Cost: f(2P) wiresn Scalable and widely used.n Variants: 1D torus, 2D mesh, 3D torus

151

Challenges in Shared Memory

n Cache Coherencen “Common Sense”

n Synchronizationn Atomic read/write operations

n Memory Consistencyn What behavior should programmers expect from shared memory?n Provide a formal definition of memory behavior to programmern Example: When will a written value be seen?n Example: P1-Write[X] <<10ps>> P2-Read[X]. What happens?

152

P1-Read[X] ® P1-Write[X] ® P1-Read[X] Read returns XP1-Write[X] ® P2-Read[X] Read returns value written by P1P1-Write[X] ® P2-Write[X] Writes serialized

All P’s see writes in same order

Example Execution

153

Processor 0 Processor 10: addi r1, accts, r3 1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0 (r3)5: call give-cash 0: addi r1, accts, r3 # get addr for account

1: ld 0(r3), r4 # load balance into r42: blt r4, r2, 6 # check for sufficient funds3: sub r4, r2, r4 # withdraw4: st r4, 0(r3) #store new balance5: call give-cash

n Two withdrawals from one account. Two ATMsn Withdraw value: r2 (e.g., $100)n Account memory address: accts+r1n Account balance: r4

CPU0 MemCPU1

Scenario 1 – No Caches

154

Processor 0 Processor 10: addi r1, accts, r3

1: ld 0(r3), r4

2: blt r4, r2, 6

3: sub r4, r2, r4

4: st r4, 0 (r3)

5: call give-cash0: addi r1, accts, r3

1: ld 0(r3), r4

2: blt r4, r2, 6

3: sub r4, r2, r4

4: st r4, 0(r3)

5: call give-cash

n Processors have no cachesn Withdrawals update balance without a problem

500500

400

400

300

P0 P1 Mem

Scenario 2a – Cache Incoherence

155

Processor 0 Processor 10: addi r1, accts, r3 1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0 (r3)5: call give-cash 0: addi r1, accts, r3

1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0(r3)5: call give-cash

n Processors have write-back cachesn Processor 0 updates balance in cache, but does not write-back to memoryn Multiple copies of memory location [accts+r1]n Copies may get inconsistent

500V: 500 500

D: 400 500

D: 400 500V: 500

D: 400 500D: 400

P0 P1 Mem

Scenario 2b – Cache Incoherence

156

Processor 0 Processor 10: addi r1, accts, r3 1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0 (r3)5: call give-cash 0: addi r1, accts, r3

1: ld 0(r3), r42: blt r4, r2, 63: sub r4, r2, r44: st r4, 0(r3)5: call give-cash

n Processors have write-through cachesn What happens if processor 0 performs another withdrawal?

500V: 500 500

D: 400 400

D: 400 400V: 400

V: 400 300D: 400

P0 P1 Mem

Hardware Coherence Protocols

n Absolute Coherencen All cached copies have same data at same time.

Slow and hard to implement

n Relative Coherencen Temporary incoherence is ok (e.g., write-back

caches) as long as no load reads incoherent data.

n Coherence Protocoln Finite state machine that runs for every cache linen Define states per cache linen Define state transitions based on bus activityn Requires coherence controller to examine bus

traffic (address, data)n (4) Invalidates, updates cache lines

157

CPU

D$ d

ata

D$ ta

gs

CC

bus

Protocol 1 – Write Invalidate

n Mechanicsn Process P performs write, broadcasts address on busn !P snoop the bus. If address is locally cached, !P invalidates local copy

n Process P performs read, broadcasts address on busn !P snoop the bus. If address is locally cached, !P writes back local copy

n Example

158

Processor-Activity Bus-Activity Data in Cache-A Data in Cache-B Data in Mem[X]

0CPU-A reads X Cache miss for X 0 0CPU-B reads X Cache miss for X 0 0 0

CPU-A writes 1 to X Invalidation for X 1 0CPU-B reads X Cache miss for X 1 1 1

Cache Coherent Systems

n Provide Coherence Protocoln Statesn State transition diagramn Actions

n Implement Coherence Protocoln (0) Determine when to invoke coherence protocoln (1) Find state of cache line to determine actionn (2) Locate other cached copiesn (3) Communicate with other cached copies (invalidate, update)

n Implementation Variantsn (0) is done in the same way for all systems. Maintain additional state per

cache line. Invoke protocol based on staten (1-3) have different approaches

159

Implementation 1 – Snooping

n Bus-based Snoopingn All cache/coherence controllers observe/react to all bus events.n Protocol relies on globally visible events

n i.e., all processors see all eventsn Protocol relies on globally ordered events

n i.e., all processors see all events in same sequence

n Bus Eventsn Processor (events initiated by own processor P)

n read (R), write (W), write-back (WB)n Bus (events initiated by other processors !P)

n bus read (BR), bus write (BW)

160

Three-State Invalidate Protocol

n Implement protocol for every cache line.

n Add state bits to every cache to indicate (1) invalid, (2) shared, (3) exclusive

161

Example

162

P1 read (A)

P2 read (A1)

P1 write (B)

P2 read (C)

P1 write (D)

P2 write (E)

P2 write (F-Z)

Implementation 2 – Directory

n Bus-based Snooping – Limitations n Snooping scalability is limitedn Bus has insufficient data bandwidth for coherence trafficn Processor has insufficient snooping bandwidth for coherence

traffic

n Directory-based Coherence – Scalable Alternativen Directory contains state for every cache linen Directory identifies processors with cached copies and their

statesn In contrast to snoopy protocols, processors observe/act only on

relevant memory events. Directory determines whether a processor is involved

163

Directory Communication

n Processor sends coherence events to directory1. Find directory entry2. Identify processors with copies3. Communicate with processors, if necessary

164





165



Synchronization

n Regulate access to data shared by processorsn Synchronization primitive is a lockn Critical section is a code segment that accesses shared datan Processor must acquire lock before entering critical section.n Processor should release lock when exiting critical section

n Spin Locks – Broken Implementationacquire (lock) # if lock=0, then set lock = 1, else spincritical section release (lock) # lock = 0

Inst-0: ldw R1, lock # load lock into R1Inst-1: bnez R1, Inst-0 # check lock, if lock!=0, go back to Inst-0Inst-2: stw 1, lock # acquire lock, set to 1<< critical section>>> # access shared dataInst-n: stw 0, lock # release lock, set to 0

166

Implementing Spin Locks

n Problem: Lock acquire not atomicn A set of atomic operations either all complete or all fail. During a set of

atomic operations, no other processor can interject. n Spin lock requires atomic load-test-store sequence

167

Processor 0 Processor 1Inst-0: ldw R1, lockInst-1: bnez R1,Inst-0 # P0 sees lock is free

Inst-0: ldw R1, lockInst-1: bnez R1, Inst-0 # P1 sees lock is free

Inst-2: stw 1, lock # P0 acquires lockInst-2: stw 1, lock # P1 acquires lock

…..…. # P0/P1 in critical section

…. # at the same timeInst-n: stw 0, lock

Implementing Spin Locks

n Solution: Test-and-set instructionn Add single instruction for load-test-store (t&s R1, lock)n Test-and-set atomically executes

ld R1, lock; # load previous lock valuest 1, lock;# store 1 to set/acquire

n If lock initially free (0), t&s acquires lock (sets to 1)n If lock initially busy (1), t&s does not change it n Instruction is un-interruptible/atomic by definition

168

Inst-0 t&s R1, lock # atomically load, check, and set lock=1Inst-1 bnez R1 # if previous value of R1 not 0, …. acquire unsuccessfulInst-n stw R1, 0 # atomically release lock

Test-and-Set Inefficiency

n Test-and-set works…

n …but performs poorlyn Suppose Processor 2 (not shown) has the lockn Processors 0/1 must…

n Execute a loop of t&s instructionsn Issue multiple store instructionsn Generate useless interconnection traffic

169

Processor 0 Processor 1Inst-0: t&s R1, lockInst-1: bnez R1,Inst-0 Inst-0: t&s R1, lock # P0 sees lock is free

Inst-1: bnez R1, Inst-0 # P1 does not acquire

Test-and-Test-and-Set Locks

n Solution: Test-and-test-and-set

n Advantagesn Spins locally without storesn Reduces interconnect trafficn Not a new instruction, simply new software (lock implementation)

170

Inst-0 ld R1, lock # test with a load, see if lock changedInst-1 bnez R1, Inst-0 # if lock=1, spinInst-2 t&s R1, lock # if lock=1, test-and-setInst-4 bnez R1, Inst-0 # if can not acquire, spin

Semaphores

n Semaphore (semaphore S, integer N)n Allows N parallel threads to access shared variablen If N = 1, equivalent to lockn Requires atomic fetch-and-add

171

Function Init (semaphore S, integer N) { s = N;

}

Function P (semaphore S) { # “Proberen” to testwhile (S == 0) { }; s = s -1 ;

}

Function V (semaphore S) { # “Verhogen” to increments = s + 1;

}





172



Memory Consistency

n Execution Example

n Intuition – P1 should print A=1

n Coherence – Makes no guarantees!

173

A = Flag = 0Processor 0 Processor 1A = 1 while (!Flag)Flag = 1 print A

Consistency and Caches

n Execution Example

n Caching Scenario1. P0 writes A=1. Misses in cache. Puts write into a store buffer.2. P0 continues execution.3. P0 writes Flag=1. Hits in cache. Completes write (with coherence)4. P1 reads Flag=1.5. P1 exits spin loop.6. P1 prints A=0

n Caches, buffering, and other performance mechanisms can cause strange behavior

174

A = Flag = 0Processor 0 Processor 1A = 1 while (!Flag)Flag = 1 print A

Sequential Consistency (SC)

n Definition of Sequential Consistency n Formal definition of programmers’ expected view of memory(1) Each processor P sees its own loads/stores in program order(2) Each processor P sees !P loads/stores in program order(3) All processors see same global load/store ordering

P and !P loads/stores may be interleaved into some order.But all processors see the same interleaving/ordering.

n Definition of Multiprocessor Ordering [Lamport]n Multi-processor ordering corresponds to some sequential

interleaving of uni-processor orderings. Multiprocessor ordering should be indistinguishable from multi-programmed uni-processor

175

Enforcing SC

n Consistency and Coherencen SC Definition: loads/stores globally orderedn SC Implications: coherence events of all load/stores globally ordered

n Implementing Sequential Consistencyn All loads/stores commit in-ordern Delay completion of memory access until all invalidations that are caused by

access are completen Delay a memory access until previous memory access is completen Delay memory read until previous write completes. Cannot place writes in a

buffer and continue with reads.n Simple for programmer but constraints HW/SW performance optimizations

176

Weaker Consistency Models

n Assume programs are synchronized n SC required only for lock variablesn Other variables are either (1) in critical section and cannot be accessed in

parallel or (2) not shared

n Use fences to restrict re-orderingn Increases opportunity for HW optimization but increases programmer effortn Memory fences stall execution until write buffers emptyn Allows load/store reordering in critical section.n Slows lock acquire, release

177

acquirememory fencecritical sectionmemory fence # ensures all writes from critical sectionrelease # are cleared from buffer

ece5917 soc architecture: mp...

Documents