lec 9 – parallelism fileerno salminen - oct. 2008 tkt-2431 soc design lec 9 – parallelism erno...

Erno Salminen - Oct. 2008

TKTTKT--2431 Soc 2431 Soc DesignDesign

Lec 9 Lec 9 –– ParallelismParallelism

Erno SalminenErno Salminen

Department ofDepartment of Computer SystemsComputer SystemsTampere University of TechnologyTampere University of Technology

Fall 2008Fall 2008

Erno Salminen - Oct. 2008#2/47

ContentsContents

IntroductionAmdahl’s law once again

Computer classificatiobn by FlynnParallelization methods

Data-parallelFunction-parallel

Communication scheme and memoriesCase study: Data-parallel video encoder


Copyright noticeCopyright notice

Part of the slidesadapted from slide set by Alberto Sangiovanni-Vincentelli

course EE249 at University of California, Berkeleyhttp://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml

Part of figures from:Ireneusz Karkowski and Henk Corporaal, Exploiting Fine- and Coarse-grain Parallelism in Embedded Programs, International Conference on Parallel Architectures and Compilation Techniques (PACT'98), Paris, October 1998, pp 60-67


At firstAt first

Make sure that simple things work before even trying more complex ones


Von Neumann (vN) Architecture Is Von Neumann (vN) Architecture Is Reaching Its Limits ...Reaching Its Limits ...

stolen from Bob Colwell

vNvN: : unbalancedunbalanced

vN Bottleneck

Cache

Memory, ...

CPU(Pro

cessor

Core)

Source: R. Hartenstein, Univ. Kaiserslautern

E. Maehle, E-Seminar IFIP Working Group 10.3 (Concurrent Systems) June 7, 2005


Benefits of parallel executionBenefits of parallel execution

Increased performanceor similar performance with lower cost (lower freq and lower Vdd!)

Partial shutdownAllows more aggressive power reduction

Fault-toleranceOvercomes single-point of failure

Encapsulation, plug-and-play design processParts are sepcialized to certain tasks

HUOM! OBS!

Muy importante!


MultiMulti--level parallelismlevel parallelism

Parallelism appears in many levels

1. Inside processors2. Between parallel

processing elementsUse combination for maximum impact

Fig. from [O. Lehtoranta, PhD Thesis, TUT 2006]

HUOM! OBS!

Muy importante!


Parallelization methodsParallelization methods

[Kulmala, JES, 2006]

Best methods depends-not surprisingly – on the applicationAlso on non-functional requirementsE.g. temporal parallel [Nang] has higher latency than data parallel

Not suited for real-time encodingGood for off-line


ScenariosScenariosEither 1 comp or 1 communication

e.g. CPU controls all transfers, it is either processing or copying data1 comp + 1 comm in parallel

Double buffering:Produces writes data to different buf (blue) than waht is currently processed (red)Buffers are switched for next data

n comp + 1 commMany CPUs but single bus

n comp + n commMany CPUs, memories, and accelerators. Multiple parallel links links in net

(1comp + n comm seems paradoxal)consumer

(CPU)

consumer

(CPU)producer

(camera)

producer

(camera) 1

0

1

0

’1’’0’

memory


AmdahlAmdahl’’s s Law (again!)Law (again!)

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

exc.

tim

e

[H. Corporaal, course material Adv. Computer architectures, Univ. Delft, 2001]


AmdahlAmdahl’’s Law Examples Law Example

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Max. speedupoverall = 1 / (1- fractionenhanced)

Speedupoverall = 10.95

= 1.053

ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold


Parallelism and AmdahlParallelism and Amdahl’’s laws law

Speedup can be achieved via parallel computation

(1- fractionenhanced)= fractionsequential

Computation cannot cannot be distributed in arbitrary sized blocks

use two processing elements for computationtwo tasks: 0.55*orig and 0.45*origspeedup =1/0.55= 1.81 instead of 2.0

Seq. part may increase as parallelism increases

More communication, ”master” has harder time to distribute the load


Amdahl (4)Amdahl (4)Speedup, seq_part=1-10%, min task size=0%, communication overhead 0%

seq_part=10%

seq_part=5%

seq_part=2%

seq_part=1%

Ideal

0

10

20

30

40

50

0 10 20 30 40 50 60

n_cpu

Spe

edup

Speedup, seq_part=2-10%, min task size=1-2 %, communication overhead 0%

Ideal

seq_part=2%, task=1%

seq_part=2%. task=2%



0

5

10

15

20

25

30

0 10 20 30 40 50 60

n_cpu

Spe

edup

orig Amdahl

Quantized load = Multiples of 1/100 or 1/50 of orig

Improving steps when N_cpu is integer multiple of task_size


Amdahl (5)Amdahl (5)

Seq. part increases 0/1/5% per added CPUCombine seq.part increase and quantized load

Speedup, seq_part=2%, min task size=0 %, communication overhead (0-5 % per n_cpu) due to communication

idealoverhead=0%

overhead=1%

overhead=5%

0

5

10

15

20

25

30

0 10 20 30 40 50 60

n_cpu

Spee

dup

Speedup, seq_part=2% , min task size=2 %, communication overhead (0-5 % per n_cpu) due to communication

ideal

overhead=0%

overhead=1%

overhead=5%

0

5

10

15

20

25

30

0 10 20 30 40 50 60

n_cpu

Spe

edup

Speed decreases


Computer classification Computer classification


Computer classificationComputer classification: Flynn Categories : Flynn Categories

1. SISD (Single Instruction Single Data)Traditional uniprocessor

2. MISD (Multiple Instruction Single Data)Systolic arrays / stream based processingRare

I, # simultaneaously executed instrcutions

D, # simultaneously processed data

1. SISD

3. SIMD 4. MIMD

2. MISD

Single Multiple

Single

Multiple


FlynnFlynn Categories (2)Categories (2)

3. SIMD (Single Instruction Multiple Data)Simple programming model (e.g. Intel MMX)Low overheadNow applied as sub-word parallelism

Count four 8-bit ADD operations with one 32b ALUOnly one program counter (PC)

4. MIMD (Multiple Instruction Multiple Data)Multiple program counters, “real multiprocessor”Flexible, may use off-the-shelf microsSpecial case: Single program, Multiple data (SPMD)

All CPUs use the same program code, saves memory!

++


Multiprocessor SoC (MPMultiprocessor SoC (MP--SoC)SoC)

Tremendous growth of integration capabilities allow building chip multiprocessors (CMP)

Aka. multiprocessor SoC (MP-SoC)Even on single FPGA!

HeterogeneousIBM Cell broadband engine (PowerPC+8 PPE)TI OMAP (ARM+TI DSP)

HomogeneousIntel/AMD Dual/Quad CoreDACI MP-SoC on FPGAIntel TeraFLOPS IBM Cell [Wikipedia]


Reminder: chip multiprocessorsReminder: chip multiprocessors

Past: 1 CPU, accelerator(s), memories

Today/future: multiple(tens of) processors, accelerator(s), memories

Ramchan Woo, Tampere Soc, 2004.


DataData--parallel and parallel and functionfunction--parallelparallel


Parallelization methods (1)Parallelization methods (1)

Example codefor i=1:9 loop

A

B

C

end loop

A1 refers to first iteration of operation AOperation A can be single instruction or a function

Two methods:1. Data-parallel (or index

set partitioning)

2. Operation-parallel (or functional pipelining)

HUOM! OBS!

Muy importante!


CPU 1

DataData--parallelparallelAll CPUs perform the same operations but on different data setsUsually CPUs must know their own IDThe example below assumes that all iterations of A can be parallelized

If Ai must be executed before Ai+1 and B after A, CPU2 and must wait until A3 in ready

A1 A2 A3 B1 B2 B3 C1 C2 C3

CPU 2 A4 A5 A6 B4 B5 B6 C4 C5 C6

CPU 3 A7 A8 A9 B7 B8 B9 C7 C8 C9

time

Data-parallel method


OperationOperation--parallelparallel

Operation-parallelEach CPU perfoms different operationsFigure assumes that Ai must be executed before Bi (which is executed before Ci)All operations should have roughly equal

CPU 1 A1 A2 A3 A4 A5 A6 A7 A8 A9

CPU 2 B1 B2 B3 B4 B5 B6 B7 B8 B9

CPU 3 C1 C2 C3 C4 C5 C6 C7 C8 C9

time

Operation-parallel methodexecution time to have balanced pipelineBalancing may prove difficult


InstructionInstruction--level parallelism (ILP)level parallelism (ILP)

CPU has many functional unitse.g. ALU, Multiplier, Shifter, and Load/store,

Instructions have dependencies:Cannot be executed in parallel

ld r3, [r2]

add r3,5

ILP in kernels is often restricted to <5That is the maximum speedup for single-threaded programs through ILP


Fine grain + coarse grain parallelismFine grain + coarse grain parallelismILP is too fine grain to used aloneCombine

fine grain: ILP, SIMDcoarse grain : multiple CPUs


Multiprocessor speedupMultiprocessor speedup

<nag> One should NOT use background color for graphs/figures </nag>

Speedup is never equal to number of CPUs

Task cannot be split into euqally sized sub-tasksThere is always some part executed sequentially

More CPUs, the more they have to communicate with each other


DataData--parallel vs operation parallelparallel vs operation parallel(Large) data set is easier to split into equally sized chunks than operations in task-parallel schemeData-parallel (DP) often method better than operation-parallel (OP)

Note few execptions in Fig 15 such as mulaw

They should be combined for best performance


Hierarchical parallelizationHierarchical parallelizationEach CPU optimized for certain functionsUtilize

ILP and SIMD within CPUdata-parallel and operation-parallel approaches for multi-CPU system minimal inter-CPU communication


Hierarchical parallelization (2)Hierarchical parallelization (2)

Combining all aforementioned approaches results in biggest speedupBest combination of approaches depends on applicationTrad. CPU has performance=1


Automatic parallelizationAutomatic parallelization

Research topic for several yearsOnly small scale success so far

Tim Sweeney:http://chipsandbs.blogspot.com/2006/03/auto-parallelization-of-c-code-is-not.htmlActually from http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2377&p=3

Auto-parallelization of C++ code is not a serious notionThese techniques applied to C/C++ programs are completely infeasible on the scale of real applications.Writing multithreaded software is very hardIt's about as unnatural to support multithreading in C++ as it was to write object-oriented software in assembly language. The whole industry is starting to do it now, but it's pretty clear that a new programming model is needed if we're going to scale to ever more parallel architectures.


Communication schemeCommunication scheme


Communication modelsCommunication models

1. Shared Memory (SM); shared addr spacee.g., load, store, atomic swapSimpler programming if cache coherency implementedCPUs process the data inside shared memoryShared data protected with mutex

2. Message Passing (MP)e.g., send, receive library callsExplicit communication, bit harder to programCPU process the in their local, private mem

Note that MP can be build on top of SM and vice versa


Communication Communication models (2)models (2)

1. Shared Memory (SM)Does not scale as well as MP to large systemsSensitive to memory latency- CPU stalled after miss

2. Message Passing (MP)Less sensitive to latencyPipelining of computation and communication

CPUCPUmemmem

NINI

CPUCPUmemmem

NINI

networknetwork

...CPUCPUcachecache

NINI

CPUCPUcachecache

NINI

networknetwork

...

main

memory

main

memory

”Typical” shared-memory computer, e.g. Dual Core

”Typical” message-passing computer, e.g. TeraFLOPS

arr [ ]x


Centralized vs. distributed memory Centralized vs. distributed memory

Physically shared - Symmetric Multi Processors (SMP)

Centralized memory (memories)Equal access time for all CPUs

Physically distributed - Distributed Shared Memory (DSM)

Each memory adjacent to one CPUClosest CPU has fast access, others have slowerUnequal access times to different parts of the memory spaceMore common than SMP in large systems

Practically all msg-passing systems use distr.mem.Explicitly managed local memories also calles ”scatch-pad memories”


MemoryMemory

Memory is no. 1 candidate for the system bottleneckguilty until proven innocent

Memories need big area (causes leakage also)Amount is limited especially inside FPGA

Number of ports affects memory area, power, and delay dramatically

More than two ports seldom usedUse many parallel memory banks

How to locate data efficiently?

Off-chip memory is impractical for same reasonsMake clear disctinction between bits (b) and bytes (B) when presenting memory sizesDistinguish kilos (k, 1000) and kibis (Ki, 1024)


Memory (2)Memory (2)

Build a hierarchy of memoriesCache vs. explicitly managed transfersCombination of fast, small memories and large, slow memories

Cache coherency is difficult in parallel systems

i.e. all CPUs must have consitent view of all dataWhat if data inside a cache of one CPU is modified?

Dynamic memory allocation causes unpredictable execution times

What happens if program runs out of memory?


Memory hierarchyMemory hierarchy

[M. Erez, Stream Architectures –Programmability and Efficiency, Tampere SoC, Nov. 2004]

Capacity

Aggregate

Bandwidth


SynchronizationSynchronization

Tasks runinng in parallel must be synchronized every now and then

1. Process (task) synchronization Multiple processes are to join up or handshake at a certain point

2. Data synchronization Keeps multiple copies of a dataset in coherence with one another to maintains data integrity

Process synchronization primitives are commonly used to implement data synchronization


Synchronization (2)Synchronization (2)

1. Semaphore (lock)Variable for communicating status between parallel tasksEnsures that only one or limited number of tasks updates shared data structureRequire atomic test-set functionality Mutual exclusion (mutex) is binary semaphore

2. BarrierEnsures that all tasks have reached specific point in programTasks read barrier semaphore, stop, and keep waiting until the last task ”arrives”

Also other types, such as thread join, non-blocking synchronization, synchronous communication


HIBI Multiprocessor HIBI Multiprocessor ExampleExampleScaleableScaleable Video Video EncoderEncoder


H.263 Video H.263 Video EncoderEncoder

Objective: Show how easily HIBI scalesProcessor independent C-source codeMaster + scaleable number of processorsgenerated automatically

I/OI/O

DMADMA

DPRAMDPRAM

ROMROM

ARM7ARM7

HIBI WrapperHIBI Wrapper

H.263

YUV

ARM7 + DMAARM7 + DMA






……

MASTERMASTER

SLAVESSLAVES


Data Data ParallelParallel MappingMapping(3 (3 slavesslaves exampleexample))

Top slice

Middle slice

Bottom slice

Slave 1

Slave 2

Slave 3

ARM7

ARM7TDMI

Master

Master's DPSRAM

foreman.cif (352x288 , 18 MB rows)

DMA data transfer

ARM7

ARM7TDMI

ARM7

ARM7TDMI

ARM7

ARM7TDMI


Load BalancingLoad Balancing

1 slave 2 slaves 3 slaves 4 slaves 5 slaves

Well balanced Poorly balanced

Increasing inter-communication

1 MB row

HUOM! OBS!

Muy importante!

Balance load (=computation) so thatAll processing element (PE) have egual loadCommunication between PEs is minimized

Poor balance decreases performance


Load balance exampleLoad balance example

Frame rate (qcif) @ 100MHz

0102030405060

1 2 3 4 5 6 7 8 9 10

Number of slaves

fps INTRA

INTER

Frame is divided into 9 rowsSlave with biggest number of slaves defines the performance

Speedup not linear

Balance:1: 9 rows2: 4+5

3: 3+3+34: 2+2+2+3

5: 1+2+2+2+26: 1+1+1+2+2+27: 1+1+1+1+1+2+28: 1+1+1+1+1+1+1+1+2

9: 1+1+1+1+1+1+1+1+1


Load balance example (2)Load balance example (2)Larger picture size, has better balanceDMA allows simultaneous computation and communication


Load balance example (3)Load balance example (3)Simulation of WLAN terminal on multiple PCs

See also integration example from Lec 6Manual load balancing

Minimal communication between PCs

Good scalabilityApplicable also on multiprocessor serverSimulation task split into several small processesSpread sim.tasks to multiple CPUs instead of oneJ. Riihimäki et al., Practical Distributed Simulation of a Network of Wireless Terminals, Tampere Soc,November 2004


ConclusionConclusion

Amdahl’s law offers idealistic but fundamental limitCombine fine and coarse grain parallelismTwo basic methods: function parallel and data-parallelTwo basic communication schemes: shared memory and message-passingLoad balancing is crucial in parallel systems

Larger data set simplifies balancing in Data-Parallel scheme

Communication between components has great impact on performance

lec 9 – parallelism fileerno salminen - oct. 2008 tkt-2431 soc design lec 9 – parallelism erno...

Documents