lecture 37: chapter 7: multiprocessors today’s topic –introduction to multiprocessors...
TRANSCRIPT
Lecture 37: Chapter 7: Multiprocessors
• Today’s topic– Introduction to multiprocessors– Parallelism in software– Memory organization– Cache coherence
1
Introduction
• Goal: connecting multiple computersto get higher performance– Multiprocessors– Scalability, availability, power efficiency
• Job-level (process-level) parallelism– High throughput for independent jobs
• Parallel processing program– Single program run on multiple processors
• Multicore microprocessors– Chips with multiple processors (cores)
2
Hardware and Software
• Hardware– Serial: e.g., Pentium 4– Parallel: e.g., quad-core Xeon e5345
• Software– Sequential: e.g., matrix multiplication– Concurrent: e.g., operating system
• Sequential/concurrent software can run on serial/parallel hardware– Challenge: making effective use of parallel
hardware3
Parallel Programming
• Parallel software is the problem• Need to get significant performance
improvement– Otherwise, just use a faster uniprocessor,
since it’s easier!• Difficulties
– Partitioning– Coordination– Communications overhead
4
Amdahl’s Law
• Sequential part can limit speedup• Example: 100 processors, 90× speedup?
• Need sequential part to be 0.1% of original time
5
Scaling Example
• Workload: sum of 10 scalars, and 10 × 10 matrix sum– Speed up from 10 to 100 processors
• Single processor: Time = (10 + 100) × tadd• 10 processors
– Time = 10 × tadd + 100/10 × tadd = 20 × tadd– Speedup = 110/20 = 5.5 (55% of potential)
• 100 processors– Time = 10 × tadd + 100/100 × tadd = 11 × tadd– Speedup = 110/11 = 10 (10% of potential)
6
Scaling Example (cont)
• What if matrix size is 100 × 100?• Single processor: Time = (10 + 10000) × tadd
• 10 processors– Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd
– Speedup = 10010/1010 = 9.9 (99% of potential)
• 100 processors– Time = 10 × tadd + 10000/100 × tadd = 110 × tadd
– Speedup = 10010/110 = 91 (91% of potential)
7
Strong vs Weak Scaling
• Strong scaling: problem size fixed– As in example
• Weak scaling: problem size proportional to number of processors– 10 processors, 10 × 10 matrix
• Time = 20 × tadd
– 100 processors, 32 × 32 matrix• Time = 10 × tadd + 1000/100 × tadd = 20 × tadd
– Constant performance in this example
8
9
Memory Organization - I
• Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP)
• Multiple processors connected to a single centralized memory – since all processors see the same memory organization uniform memory access (UMA)
• Shared-memory because all processors can access the entire memory address space
• Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors
10
SMPs or Centralized Shared-Memory
Processor
Caches
Processor
Caches
Processor
Caches
Processor
Caches
Main Memory I/O System
11
Memory Organization - II
• For higher scalability, memory is distributed among processors distributed memory multiprocessors
• If one processor can directly address the memory local to another processor, the address space is shared distributed shared-memory (DSM) multiprocessor
• If memories are strictly local, we need messages to communicate data cluster of computers or multicomputers
• Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory
12
Distributed Memory Multiprocessors
Processor& Caches
Memory I/O
Processor& Caches
Memory I/O
Processor& Caches
Memory I/O
Processor& Caches
Memory I/O
Interconnection network
13
SMPs
• Centralized main memory and many caches many copies of the same data
• A system is cache coherent if a read returns the most recently written value for that word
Time Event Value of X in Cache-A Cache-B Memory 0 - - 1 1 CPU-A reads X 1 - 1 2 CPU-B reads X 1 1 1 3 CPU-A stores 0 in X 0 1 0
14
Cache Coherence
A memory system is coherent if:• P writes to X; no other processor writes to X; P reads X and receives the value previously written by P
• P1 writes to X; no other processor writes to X; sufficient time elapses; P2 reads X and receives value written by P1
• Two writes to the same location by two processors are seen in the same order by all processors – write serialization
• The memory consistency model defines “time elapsed” before the effect of a processor is seen by others
15
Cache Coherence Protocols
• Directory-based: A single location (directory) keeps track of the sharing status of a block of memory
• Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary
Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block