lecture 19 beyond low-level parallelism. 2 © wen-mei hwu and s. j. patel, 2002 ece 412, university...

Lecture 19Beyond Low-level Parallelism

2 © Wen-mei Hwu and S. J. Patel, 2002ECE 412, University of Illinois

Outline

• Models for exploiting large grained parallelism

• Issues in parallelization: communication, synchronization, Amdahl’s Law

• Basic Single-Bus Shared Memory Machines


• Intra-Instruction (pipelining)

• Parallel Instructions (wide issue)

• Loop-level Parallelism (blocks, loops)

• Algorithmic Thread Parallelism (loops, plus more)

• Concurrent programs (multiple programs)

Dimension 1: Granularity of Concurrency


Dimension 2: Flynn’s Classification

• Single Instruction (stream), Single Data (stream) (SISD)– simple CPU (1-wide pipeline)

• Single Instruction, Multiple Data (SIMD)– vector computer, multimedia extensions

• Multiple Instruction, Single Data (MISD)– ???

• Multiple Instruction, Multiple Data (MIMD)– multiprocessor (now also, multithreaded

CPU)


Dimension 3: Parallel Programming Models

• Data Parallel Model:– The same computation is applied to a large data

set. All such applications can be done in parallel.– E.g. Matrix multiplication: all elements of the

product array can be generated in parallel.• Function Parallel Model:

– A program is partitioned into parallel modules.– E.g.“ Compiler: phases of a compiler can be made

into parallel modules: parser, semantic analyzer, optimizer 1, optimizer 2, and code generator. They can be organized into a high-level pipeline.


Dimension 4: Memory Models

CPU CPU

Mem

CPU

Mem

CPU

Mem

Shared Memory Model Message Passing Model

Interconnect

(amenable to single address space) (amenable to multiple address spaces)

•How processors view each other’s memory space


Dimension 4: Memory Models• Shared memory

– All processors have common access to a shared memory space. When a program accesses data, it just give the memory system an address for the data.

– DO 10 I = 1, NDO 20 J = 1, N

A[I,J] = 0DO 30 K = 1, N A[I,J] = A[I,J] + B[I,K]*C[K,J]END DO

END DO END DO


Memory Architecture (Cont.)• Message passing (or Distributed Memory)

– Each processor has a local memory. For a processor to access a piece of data not present in its own local memory, the processor has to exchange message with the processor whose local memory contains the data.

• For each data access, the programmer must determine if a thread is executed by the processor whose local memory contains the data.– In general, the programmer has to partition the data and

assign them explicitly to the processor local memories.– The programmer then parallelizes the program and assign

the threads to processors. Using the assignment decisions, each access is then determined to be either a local memory access or a message exchange.


A Simple Problem

min = D[0];for (i = 0; i < n; i++) { if (D[i] < min) min = D[i];}cout << min;

Program to find the minimum of a large set of input data.


Simple Problem, Data Parallelizedpstart = p# * 100000;min[p#] = D[0];for (i = pstart; i < pstart+100000; i++) { if (D[i] < min[p#]) min[p#] = D[i];}

barrier();

if (p# == 0) { real_min = min[0]; for (i = 1; i < pMax; i++) { if (min[i] < real_min) real_min = min[i]; } cout << real_min;}


Loop Level Parallelization

• DOALL: There is no dependence between loop iterations. All can be executed in parallel.DOALL 30 J=1,J1

• X(II1+J) = X(II1+J) * SC1• Y(II1+J) = Y(II1+J) * SC1• Z(II1+J) = Z(II1+J) * SC1

END DO


Loop Level Parallelization• DOACROSS: There are dependences

between loop iterations. The dependences are enforced by two synchronization constructs:– advance(synch_pt): signals that the current

iteration has passed the synchronization point identified by synch_pt.

– await(synch_pt, depend_distance): forces the execution of the current iteration to wait for a previous iteration to pass the synchronization point identified by synch_pt. The iteration is the current iteration number minus depend_distance.


DOACROSS Example

• DOACROSS 40 I=4,ILAWAIT(1, 3)X(I) = Y(I) + X(I-3)ADVANCE (1)

END DO• The current iteration depends on

synch point 1 of three iterations ago.


• Objective is to write parallel programs that can use as use many processors as are available.

• Information needs to be exchanged between processors– The min[] needs to be communicated to processor 0

in example.• Parallelization is limited by communication.

– In terms of latency – In terms of amount of communication– At some point, the communication overhead will

out-weight the benefit of parallelism

Issue 1: Communication


Issue 2: Synchronization

• Often, parallel tasks need to coordinate and wait for each other.

• Example: barrier in parallel min algorithm

• The art of parallel programming involves minimizing synchronization


Mutual Exclusion• Atomic update to data structures so that the

actions from different processors are serialized. – This is especially important for data structures

that need to be updated in multiple steps. Each processor should be able to lock out other processors, update the data structure, leave the data structure in a consistent state, and unlock.

– This is usually realized by using atomic Read Modify Write instructions.

• Parking space example, file editing example


Issue 3 : Serial Component• Also, no matter how much you parallelize, the

performance is ultimately limited by the serial component.

• Better known as Amdahl’s Law.

• Jet Service to Chicago– Jet and propellers have very different flying time

but similar taxi time – taxi time is the serial section here.


Example Speedup Graph

Speedup over a single processor

Num of Processors

Ideal Speedup

Observed Speedup


Message Passing Machines

• Network characteristic are of primary importance

• Latency, bisection bandwidths, node bandwidth, occupancy of communication

Latency = sender overhead + transport latency + receiver overhead


Synchronization support for shared memory machines

• Example: two processors trying to increment a shared variable

• Requires an atomic memory update instr– example: test and set, or compare and exchange

• Using cmpxchg, implement barrier();

• Scalability issues with traditional synchronization


Coherence

CPU CPU

Mem

CPU CPU

MemNo Caches:

easy hardware, bad performance

Cache Cache

Better performance, but…what about keeping data valid

cachecache


Memory Coherence, continued

CPU 0 CPU 1

Cache CacheLoc X: val A Loc X: val A

A MemoryX

Essential Question:What happens whenCPU 0 writes value B tomemory location X?


One solution : snooping caches

CPU CPU

Mem

Cache Cache

cachecache

snooptags

snooptags

• Snoop logic watches bus transactions and either invalidates or updates matching cache lines.

• Requires an extra set of cache tags.

• Generally, write-back caches are used. Why?

• What about the L1 cache?


Cache Coherence Protocol• Basic premise : writes must update

other copies or invalidate other copies.

• How do we make this work while not choking the bus with transactions OR flooding memory OR bringing each CPU to a crawl?

• One solution: The Illinois Protocol


The Illinois Protocol

TagM E S I

The tag for each cache block nowcontains 2 bits that specify whetherthe block is modified, exclusively owned,shared, or invalid.


Illinois Protocolfrom the CPU’s perspective

Invalidor no tag match

Modified Shared

Read, supplied by memory

Write,but invalidate others

Read, supplied by another CPU

Write, but invalidate others

Write

Exclusive

Read

Read

Read or Write


Illinois Protocolfrom the bus’s perspective

Invalidor no tag match

Modified Shared

Exclusive

ReadBus Read, andsupply data

Bus Write (miss)

Bus ReadBus Write,but firstsupply data

Bus Write or Invalidation

Bus


Issues with a single bus

• Keep adding CPUs and eventually the bus saturates.

• Coherence protocol adds extra traffic

• Buses are slower than point-to-point interconnects

lecture 19 beyond low-level parallelism. 2 © wen-mei hwu and s. j. patel, 2002 ece 412, university...

Documents