lecture 37: chapter 7: multiprocessors today’s topic –introduction to multiprocessors...

Lecture 37: Chapter 7: Multiprocessors

• Today’s topic– Introduction to multiprocessors– Parallelism in software– Memory organization– Cache coherence

1

Introduction

• Goal: connecting multiple computersto get higher performance– Multiprocessors– Scalability, availability, power efficiency

• Job-level (process-level) parallelism– High throughput for independent jobs

• Parallel processing program– Single program run on multiple processors

• Multicore microprocessors– Chips with multiple processors (cores)

2

Hardware and Software

• Hardware– Serial: e.g., Pentium 4– Parallel: e.g., quad-core Xeon e5345

• Software– Sequential: e.g., matrix multiplication– Concurrent: e.g., operating system

• Sequential/concurrent software can run on serial/parallel hardware– Challenge: making effective use of parallel

hardware3

Parallel Programming

• Parallel software is the problem• Need to get significant performance

improvement– Otherwise, just use a faster uniprocessor,

since it’s easier!• Difficulties

– Partitioning– Coordination– Communications overhead

4

Amdahl’s Law

• Sequential part can limit speedup• Example: 100 processors, 90× speedup?

• Need sequential part to be 0.1% of original time

5

Scaling Example

• Workload: sum of 10 scalars, and 10 × 10 matrix sum– Speed up from 10 to 100 processors

• Single processor: Time = (10 + 100) × tadd• 10 processors

– Time = 10 × tadd + 100/10 × tadd = 20 × tadd– Speedup = 110/20 = 5.5 (55% of potential)

• 100 processors– Time = 10 × tadd + 100/100 × tadd = 11 × tadd– Speedup = 110/11 = 10 (10% of potential)

6

Scaling Example (cont)

• What if matrix size is 100 × 100?• Single processor: Time = (10 + 10000) × tadd

• 10 processors– Time = 10 × tadd + 10000/10 × tadd = 1010 × tadd

– Speedup = 10010/1010 = 9.9 (99% of potential)

• 100 processors– Time = 10 × tadd + 10000/100 × tadd = 110 × tadd

– Speedup = 10010/110 = 91 (91% of potential)

7

Strong vs Weak Scaling

• Strong scaling: problem size fixed– As in example

• Weak scaling: problem size proportional to number of processors– 10 processors, 10 × 10 matrix

• Time = 20 × tadd

– 100 processors, 32 × 32 matrix• Time = 10 × tadd + 1000/100 × tadd = 20 × tadd

– Constant performance in this example

8

9

Memory Organization - I

• Centralized shared-memory multiprocessor or Symmetric shared-memory multiprocessor (SMP)

• Multiple processors connected to a single centralized memory – since all processors see the same memory organization uniform memory access (UMA)

• Shared-memory because all processors can access the entire memory address space

• Can centralized memory emerge as a bandwidth bottleneck? – not if you have large caches and employ fewer than a dozen processors

10

SMPs or Centralized Shared-Memory

Processor

Caches

Processor

Caches

Processor

Caches

Processor

Caches

Main Memory I/O System

11

Memory Organization - II

• For higher scalability, memory is distributed among processors distributed memory multiprocessors

• If one processor can directly address the memory local to another processor, the address space is shared distributed shared-memory (DSM) multiprocessor

• If memories are strictly local, we need messages to communicate data cluster of computers or multicomputers

• Non-uniform memory architecture (NUMA) since local memory has lower latency than remote memory

12

Distributed Memory Multiprocessors

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Processor& Caches

Memory I/O

Interconnection network

13

SMPs

• Centralized main memory and many caches many copies of the same data

• A system is cache coherent if a read returns the most recently written value for that word

Time Event Value of X in Cache-A Cache-B Memory 0 - - 1 1 CPU-A reads X 1 - 1 2 CPU-B reads X 1 1 1 3 CPU-A stores 0 in X 0 1 0

14

Cache Coherence

A memory system is coherent if:• P writes to X; no other processor writes to X; P reads X and receives the value previously written by P

• P1 writes to X; no other processor writes to X; sufficient time elapses; P2 reads X and receives value written by P1

• Two writes to the same location by two processors are seen in the same order by all processors – write serialization

• The memory consistency model defines “time elapsed” before the effect of a processor is seen by others

15

Cache Coherence Protocols

• Directory-based: A single location (directory) keeps track of the sharing status of a block of memory

• Snooping: Every cache block is accompanied by the sharing status of that block – all cache controllers monitor the shared bus so they can update the sharing status of the block, if necessary

Write-invalidate: a processor gains exclusive access of a block before writing by invalidating all other copies Write-update: when a processor writes, it updates other shared copies of that block

lecture 37: chapter 7: multiprocessors today’s topic –introduction to multiprocessors...

Documents