lecture 19 beyond low-level parallelism. 2 © wen-mei hwu and s. j. patel, 2002 ece 412, university...

28
Lecture 19 Beyond Low-level Parallelism

Upload: rosemary-lucas

Post on 06-Jan-2018

215 views

Category:

Documents


1 download

DESCRIPTION

3 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Intra-Instruction (pipelining) Parallel Instructions (wide issue) Loop-level Parallelism (blocks, loops) Algorithmic Thread Parallelism (loops, plus more) Concurrent programs (multiple programs) Dimension 1: Granularity of Concurrency

TRANSCRIPT

Page 1: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

Lecture 19Beyond Low-level Parallelism

Page 2: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

2 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Outline

• Models for exploiting large grained parallelism

• Issues in parallelization: communication, synchronization, Amdahl’s Law

• Basic Single-Bus Shared Memory Machines

Page 3: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

3 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

• Intra-Instruction (pipelining)

• Parallel Instructions (wide issue)

• Loop-level Parallelism (blocks, loops)

• Algorithmic Thread Parallelism (loops, plus more)

• Concurrent programs (multiple programs)

Dimension 1: Granularity of Concurrency

Page 4: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

4 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Dimension 2: Flynn’s Classification

• Single Instruction (stream), Single Data (stream) (SISD)– simple CPU (1-wide pipeline)

• Single Instruction, Multiple Data (SIMD)– vector computer, multimedia extensions

• Multiple Instruction, Single Data (MISD)– ???

• Multiple Instruction, Multiple Data (MIMD)– multiprocessor (now also, multithreaded

CPU)

Page 5: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

5 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Dimension 3: Parallel Programming Models

• Data Parallel Model:– The same computation is applied to a large data

set. All such applications can be done in parallel.– E.g. Matrix multiplication: all elements of the

product array can be generated in parallel.• Function Parallel Model:

– A program is partitioned into parallel modules.– E.g.“ Compiler: phases of a compiler can be made

into parallel modules: parser, semantic analyzer, optimizer 1, optimizer 2, and code generator. They can be organized into a high-level pipeline.

Page 6: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

6 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Dimension 4: Memory Models

CPU CPU

Mem

CPU

Mem

CPU

Mem

Shared Memory Model Message Passing Model

Interconnect

(amenable to single address space) (amenable to multiple address spaces)

•How processors view each other’s memory space

Page 7: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

7 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Dimension 4: Memory Models• Shared memory

– All processors have common access to a shared memory space. When a program accesses data, it just give the memory system an address for the data.

– DO 10 I = 1, NDO 20 J = 1, N

A[I,J] = 0DO 30 K = 1, N A[I,J] = A[I,J] + B[I,K]*C[K,J]END DO

END DO END DO

Page 8: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

8 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Memory Architecture (Cont.)• Message passing (or Distributed Memory)

– Each processor has a local memory. For a processor to access a piece of data not present in its own local memory, the processor has to exchange message with the processor whose local memory contains the data.

• For each data access, the programmer must determine if a thread is executed by the processor whose local memory contains the data.– In general, the programmer has to partition the data and

assign them explicitly to the processor local memories.– The programmer then parallelizes the program and assign

the threads to processors. Using the assignment decisions, each access is then determined to be either a local memory access or a message exchange.

Page 9: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

9 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

A Simple Problem

min = D[0];for (i = 0; i < n; i++) { if (D[i] < min) min = D[i];}cout << min;

Program to find the minimum of a large set of input data.

Page 10: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

10 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Simple Problem, Data Parallelizedpstart = p# * 100000;min[p#] = D[0];for (i = pstart; i < pstart+100000; i++) { if (D[i] < min[p#]) min[p#] = D[i];}

barrier();

if (p# == 0) { real_min = min[0]; for (i = 1; i < pMax; i++) { if (min[i] < real_min) real_min = min[i]; } cout << real_min;}

Page 11: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

11 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Loop Level Parallelization

• DOALL: There is no dependence between loop iterations. All can be executed in parallel.DOALL 30 J=1,J1

• X(II1+J) = X(II1+J) * SC1• Y(II1+J) = Y(II1+J) * SC1• Z(II1+J) = Z(II1+J) * SC1

END DO

Page 12: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

12 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Loop Level Parallelization• DOACROSS: There are dependences

between loop iterations. The dependences are enforced by two synchronization constructs:– advance(synch_pt): signals that the current

iteration has passed the synchronization point identified by synch_pt.

– await(synch_pt, depend_distance): forces the execution of the current iteration to wait for a previous iteration to pass the synchronization point identified by synch_pt. The iteration is the current iteration number minus depend_distance.

Page 13: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

13 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

DOACROSS Example

• DOACROSS 40 I=4,ILAWAIT(1, 3)X(I) = Y(I) + X(I-3)ADVANCE (1)

END DO• The current iteration depends on

synch point 1 of three iterations ago.

Page 14: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

14 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

• Objective is to write parallel programs that can use as use many processors as are available.

• Information needs to be exchanged between processors– The min[] needs to be communicated to processor 0

in example.• Parallelization is limited by communication.

– In terms of latency – In terms of amount of communication– At some point, the communication overhead will

out-weight the benefit of parallelism

Issue 1: Communication

Page 15: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

15 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Issue 2: Synchronization

• Often, parallel tasks need to coordinate and wait for each other.

• Example: barrier in parallel min algorithm

• The art of parallel programming involves minimizing synchronization

Page 16: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

16 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Mutual Exclusion• Atomic update to data structures so that the

actions from different processors are serialized. – This is especially important for data structures

that need to be updated in multiple steps. Each processor should be able to lock out other processors, update the data structure, leave the data structure in a consistent state, and unlock.

– This is usually realized by using atomic Read Modify Write instructions.

• Parking space example, file editing example

Page 17: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

17 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Issue 3 : Serial Component• Also, no matter how much you parallelize, the

performance is ultimately limited by the serial component.

• Better known as Amdahl’s Law.

• Jet Service to Chicago– Jet and propellers have very different flying time

but similar taxi time – taxi time is the serial section here.

Page 18: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

18 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Example Speedup Graph

Speedup over a single processor

Num of Processors

Ideal Speedup

Observed Speedup

Page 19: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

19 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Message Passing Machines

• Network characteristic are of primary importance

• Latency, bisection bandwidths, node bandwidth, occupancy of communication

Latency = sender overhead + transport latency + receiver overhead

Page 20: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

20 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Synchronization support for shared memory machines

• Example: two processors trying to increment a shared variable

• Requires an atomic memory update instr– example: test and set, or compare and exchange

• Using cmpxchg, implement barrier();

• Scalability issues with traditional synchronization

Page 21: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

21 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Coherence

CPU CPU

Mem

CPU CPU

MemNo Caches:

easy hardware, bad performance

Cache Cache

Better performance, but…what about keeping data valid

cachecache

Page 22: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

22 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Memory Coherence, continued

CPU 0 CPU 1

Cache CacheLoc X: val A Loc X: val A

A MemoryX

Essential Question:What happens whenCPU 0 writes value B tomemory location X?

Page 23: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

23 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

One solution : snooping caches

CPU CPU

Mem

Cache Cache

cachecache

snooptags

snooptags

• Snoop logic watches bus transactions and either invalidates or updates matching cache lines.

• Requires an extra set of cache tags.

• Generally, write-back caches are used. Why?

• What about the L1 cache?

Page 24: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

24 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Cache Coherence Protocol• Basic premise : writes must update

other copies or invalidate other copies.

• How do we make this work while not choking the bus with transactions OR flooding memory OR bringing each CPU to a crawl?

• One solution: The Illinois Protocol

Page 25: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

25 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

The Illinois Protocol

TagM E S I

The tag for each cache block nowcontains 2 bits that specify whetherthe block is modified, exclusively owned,shared, or invalid.

Page 26: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

26 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Illinois Protocolfrom the CPU’s perspective

Invalidor no tag match

Modified Shared

Read, supplied by memory

Write,but invalidate others

Read, supplied by another CPU

Write, but invalidate others

Write

Exclusive

Read

Read

Read or Write

Page 27: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

27 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Illinois Protocolfrom the bus’s perspective

Invalidor no tag match

Modified Shared

Exclusive

ReadBus Read, andsupply data

Bus Write (miss)

Bus ReadBus Write,but firstsupply data

Bus Write or Invalidation

Bus

Page 28: Lecture 19 Beyond Low-level Parallelism. 2 © Wen-mei Hwu and S. J. Patel, 2002 ECE 412, University of Illinois Outline Models for exploiting large grained

28 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Issues with a single bus

• Keep adding CPUs and eventually the bus saturates.

• Coherence protocol adds extra traffic

• Buses are slower than point-to-point interconnects