Multiprocessors—Multiprocessors—Large vs. Small ScaleLarge vs. Small Scale
Small-Scale MIMD Small-Scale MIMD DesignsDesigns
Memory: centralized with uniform Memory: centralized with uniform memory access time (UMA) and memory access time (UMA) and bus interconnect bus interconnect
Examples: SPARCcenter Examples: SPARCcenter
Large-Scale MIMD Large-Scale MIMD DesignsDesigns
Memory: distributed with non-uniform Memory: distributed with non-uniform memory access time (NUMA) and memory access time (NUMA) and scalable interconnectscalable interconnect
Examples: Cray T3D, Intel Paragon, CM-Examples: Cray T3D, Intel Paragon, CM-5 5
Communication Communication ModelsModels Shared MemoryShared Memory
– Communication via shared address spaceCommunication via shared address space– Advantages:Advantages:
Ease of programmingEase of programming Lower latencyLower latency Easier to use hardware controlled cachingEasier to use hardware controlled caching
Message passingMessage passing– Processors have private memories, Processors have private memories,
communicate via messagescommunicate via messages– Advantages:Advantages:
Less hardware, easier to designLess hardware, easier to design Focuses attention on costly Focuses attention on costly non-localnon-local operations operations
Communication Communication PropertiesProperties
BandwidthBandwidth– Need high bandwidth in communicationNeed high bandwidth in communication– Limits in network, memory, and processorLimits in network, memory, and processor
LatencyLatency– Affects performance, since processor waitAffects performance, since processor wait– Affects ease of programming - How to Affects ease of programming - How to
overlap communication and computation.overlap communication and computation. Latency HidingLatency Hiding
– How can a mechanism help hide latency?How can a mechanism help hide latency?– Examples: overlap message send with Examples: overlap message send with
computation, prefetchcomputation, prefetch
Small-Scale—Shared Small-Scale—Shared MemoryMemory
Caches serve to:Caches serve to:– Increase bandwidth Increase bandwidth
versus bus/memoryversus bus/memory– Reduce latency of accessReduce latency of access– Valuable for both private Valuable for both private
data and shared datadata and shared data What about cache What about cache
consistency?consistency?
The Problem of Cache The Problem of Cache CoherencyCoherency
Value of X in memory is 1Value of X in memory is 1 CPU A reads X – its cache now CPU A reads X – its cache now
contains 1contains 1 CPU B reads X – its cache now CPU B reads X – its cache now
contains 1contains 1 CPU A stores 0 into X CPU A stores 0 into X
– CPU ACPU A’’s cache contains a 0s cache contains a 0– CPU BCPU B’’s cache contains a 1s cache contains a 1
Multicore SystemsMulticore Systems
Multicore Computers (chip multiprocessors)
Combine two or more processors (cores) on a single piece of silicon
Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches
Multithreading is used
Pollack’s Rule
Performance increase is roughly proportional to the square root of the increase in complexity
performance √complexity
Power consumption increase is roughly linearly proportional to the increase in complexity
power consumption complexity
Pollack’s Rule
complexity power performance
1 1 1
4 4 2
25 25 5
100s of low complexity cores, each operating at very low power
Ex: Four small cores
complexity power performance
4x1 4x1 4
Increasing CPU Performance
Manycore Chip Composed of hybrid cores
• Some general purpose
• Some graphics
• Some floating point
Exascale Systems
Board composed of multiple manycore chips sharing memory
A room full of these racks
Millions of coresExascale systems (1018 Flop/s)
Rack composed of multiple boards
Moore’s Law Reinterpreted
Number of cores per chip doubles every 2 years
Number of threads of execution doubles every 2 years
Shared Memory MIMD
Shared memory
• Single address space
• All processes have access to the pool of shared memory
Memory
Bus
P P P P
Shared Memory MIMD
Each processor executes different instructions asynchronously, using different dataM
emor
y
PE
PE
PE
PE
data
data
data
data
instruction
CU
CU
CU
CU
Symmetric Multiprocessors (SMP)
MIMD Shared memory UMA
Proc
L1
L2
Main Memory I/O
I/O
I/O
Proc
L1
L2
…
System bus
Symmetric Multiprocessors (SMP)Characteristics:
Two or more similar processors
Processors share the same memory and I/O facilities
Processors are connected by bus or other internal connection scheme, such that memory access time is the same for each processor
All processors share access to I/O devices
All processors can perform the same functions
The system is controlled by the operating system
Symmetric Multiprocessors (SMP)
Operating system:
Provides tools and functions to exploit the parallelism
Schedules processes or threads across all of the processors
Takes care of
• scheduling of threads and processes on processors
• synchronization among processors
Multicore Computers
Dedicated L1 Cache
(ARM11 MPCore)
CPUcore 1
L1-I
L2
Main Memory
I/O
I/O
I/O
…L1-D
CPUcore n
L1-I L1-D
Multicore Computers
Dedicated L2 Cache
(AMD Opteron)
CPUcore 1
L1-I
L2
Main Memory
I/O
I/O
I/O
…L1-D
CPUcore n
L1-I L1-D
L2
Multicore Computers
Shared L2 Cache
(Intel Core Duo)
CPUcore 1
L1-I
L2
Main Memory
I/O
I/O
I/O
…L1-D
CPUcore n
L1-I L1-D
Multicore Computers
Shared L3 Cache
(Intel Core i7)
CPUcore 1
L1-I
L2
Main Memory
I/O
I/O
I/O
…L1-D
CPUcore n
L1-I L1-D
L2
L3
Multicore Computers
Advantages of Shared L2 cache Reduced overall miss rate
• Thread on one core may cause a frame to be brought into the cache, thread on another core may access the same location that has already been brought into the cache
Data shared by multiple cores is not replicated The amount of shared cache allocated to each core may be dynamic Interprocessor communication is easy to implement
Advantages of Dedicated L2 cache Each core can access its private cache more rapidly
L3 cache When the amount of memory and number of cores grow, L3 cache provides
better performance
Multicore Computers
On-chip interconnects Bus Crossbar
Off-chip communication (CPU-to-CPU or I/O): Bus-based
Multicore Computers (chip multiprocessors)
Combine two or more processors (cores) on a single piece of silicon
Each core consists of ALU, registers, pipeline hardware, L1 instruction and data caches
Multithreading is used
Multicore Computers
Multithreading
A multithreaded processor provides a separate PC for each thread (hardware multithreading)
Implicit multithreading• Concurrent execution of multiple threads extracted from a single sequential
program
Explicit multithreading• Execute instructions from different explicit threads by interleaving
instructions from different threads on shared or parallel pipelines
Multicore Computers Explicit Multithreading
Fine-grained multithreading (Interleaved multithreading)• Processor deals with two or more thread contexts at a time• Switching from one thread to another at each clock cycle
Coarse-grained multithreading (Blocked multithreading)• Instructions of a thread are executed sequentially until an event that causes a delay
(eg. cache miss) occurs• This event causes a switch to another thread
Simultaneous multithreading (SMT)• Instructions are simultaneously issued from multiple threads to the execution units
of a superscalar processor• Thread-level parallelism is combined with instruction-level parallelism (ILP)
Chip multiprocessing (CMP)• Each processor of a multicore system handles separate threads
Coarse-grained, Fine-grained, Symmetric Multithreading, CMP
GPUs (Graphics Processing Units)
Characteristics of GPUs
GPUs are accelerators for CPUs
SIMD
GPUs have many parallel processors and many concurrent threads (i.e. 10 or more cores; 100s or 1000s of threads per core)
CPU-GPU combination is an example for heterogeneous computing
GPGPU (general purpose GPU): using a GPU to perform applications traditionally handled by the CPU
GPUs