lec 9 – parallelism fileerno salminen - oct. 2008 tkt-2431 soc design lec 9 – parallelism erno...

47
Erno Salminen - Oct. 2008 TKT TKT - - 2431 Soc 2431 Soc Design Design Lec 9 Lec 9 Parallelism Parallelism Erno Salminen Erno Salminen Department of Department of Computer Systems Computer Systems Tampere University of Technology Tampere University of Technology Fall 2008 Fall 2008

Upload: nguyenquynh

Post on 22-Aug-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008

TKTTKT--2431 Soc 2431 Soc DesignDesign

Lec 9 Lec 9 –– ParallelismParallelism

Erno SalminenErno Salminen

Department ofDepartment of Computer SystemsComputer SystemsTampere University of TechnologyTampere University of Technology

Fall 2008Fall 2008

Page 2: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#2/47

ContentsContents

IntroductionAmdahl’s law once again

Computer classificatiobn by FlynnParallelization methods

Data-parallelFunction-parallel

Communication scheme and memoriesCase study: Data-parallel video encoder

Page 3: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#3/47

Copyright noticeCopyright notice

Part of the slidesadapted from slide set by Alberto Sangiovanni-Vincentelli

course EE249 at University of California, Berkeleyhttp://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml

Part of figures from:Ireneusz Karkowski and Henk Corporaal, Exploiting Fine- and Coarse-grain Parallelism in Embedded Programs, International Conference on Parallel Architectures and Compilation Techniques (PACT'98), Paris, October 1998, pp 60-67

Page 4: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#4/47

At firstAt first

Make sure that simple things work before even trying more complex ones

Page 5: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#5/47

Von Neumann (vN) Architecture Is Von Neumann (vN) Architecture Is Reaching Its Limits ...Reaching Its Limits ...

stolen from Bob Colwell

vNvN: : unbalancedunbalanced

vN Bottleneck

Cache

Memory, ...

CPU(Pro

cessor

Core)

Source: R. Hartenstein, Univ. Kaiserslautern

E. Maehle, E-Seminar IFIP Working Group 10.3 (Concurrent Systems) June 7, 2005

Page 6: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#6/47

Benefits of parallel executionBenefits of parallel execution

Increased performanceor similar performance with lower cost (lower freq and lower Vdd!)

Partial shutdownAllows more aggressive power reduction

Fault-toleranceOvercomes single-point of failure

Encapsulation, plug-and-play design processParts are sepcialized to certain tasks

HUOM! OBS!

Muy importante!

Page 7: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#7/47

MultiMulti--level parallelismlevel parallelism

Parallelism appears in many levels

1. Inside processors2. Between parallel

processing elementsUse combination for maximum impact

Fig. from [O. Lehtoranta, PhD Thesis, TUT 2006]

HUOM! OBS!

Muy importante!

Page 8: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#8/47

Parallelization methodsParallelization methods

[Kulmala, JES, 2006]

Best methods depends-not surprisingly – on the applicationAlso on non-functional requirementsE.g. temporal parallel [Nang] has higher latency than data parallel

Not suited for real-time encodingGood for off-line

Page 9: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#9/47

ScenariosScenariosEither 1 comp or 1 communication

e.g. CPU controls all transfers, it is either processing or copying data1 comp + 1 comm in parallel

Double buffering:Produces writes data to different buf (blue) than waht is currently processed (red)Buffers are switched for next data

n comp + 1 commMany CPUs but single bus

n comp + n commMany CPUs, memories, and accelerators. Multiple parallel links links in net

(1comp + n comm seems paradoxal)consumer

(CPU)

consumer

(CPU)producer

(camera)

producer

(camera) 1

0

1

0

’1’’0’

memory

Page 10: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#10/47

AmdahlAmdahl’’s s Law (again!)Law (again!)

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

exc.

tim

e

[H. Corporaal, course material Adv. Computer architectures, Univ. Delft, 2001]

Page 11: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#11/47

AmdahlAmdahl’’s Law Examples Law Example

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Max. speedupoverall = 1 / (1- fractionenhanced)

Speedupoverall = 10.95

= 1.053

ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold

Page 12: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#12/47

Parallelism and AmdahlParallelism and Amdahl’’s laws law

Speedup can be achieved via parallel computation

(1- fractionenhanced)= fractionsequential

Computation cannot cannot be distributed in arbitrary sized blocks

use two processing elements for computationtwo tasks: 0.55*orig and 0.45*origspeedup =1/0.55= 1.81 instead of 2.0

Seq. part may increase as parallelism increases

More communication, ”master” has harder time to distribute the load

Page 13: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#13/47

Amdahl (4)Amdahl (4)Speedup, seq_part=1-10%, min task size=0%, communication overhead 0%

seq_part=10%

seq_part=5%

seq_part=2%

seq_part=1%

Ideal

0

10

20

30

40

50

0 10 20 30 40 50 60

n_cpu

Spe

edup

Speedup, seq_part=2-10%, min task size=1-2 %, communication overhead 0%

Ideal

seq_part=2%, task=1%

seq_part=2%. task=2%

seq_part=10%, task=1%

seq_part=10%, task=2%

0

5

10

15

20

25

30

0 10 20 30 40 50 60

n_cpu

Spe

edup

orig Amdahl

Quantized load = Multiples of 1/100 or 1/50 of orig

Improving steps when N_cpu is integer multiple of task_size

Page 14: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#14/47

Amdahl (5)Amdahl (5)

Seq. part increases 0/1/5% per added CPUCombine seq.part increase and quantized load

Speedup, seq_part=2%, min task size=0 %, communication overhead (0-5 % per n_cpu) due to communication

idealoverhead=0%

overhead=1%

overhead=5%

0

5

10

15

20

25

30

0 10 20 30 40 50 60

n_cpu

Spee

dup

Speedup, seq_part=2% , min task size=2 %, communication overhead (0-5 % per n_cpu) due to communication

ideal

overhead=0%

overhead=1%

overhead=5%

0

5

10

15

20

25

30

0 10 20 30 40 50 60

n_cpu

Spe

edup

Speed decreases

Page 15: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008

Computer classification Computer classification

Page 16: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#16/47

Computer classificationComputer classification: Flynn Categories : Flynn Categories

1. SISD (Single Instruction Single Data)Traditional uniprocessor

2. MISD (Multiple Instruction Single Data)Systolic arrays / stream based processingRare

I, # simultaneaously executed instrcutions

D, # simultaneously processed data

1. SISD

3. SIMD 4. MIMD

2. MISD

Single Multiple

Single

Multiple

Page 17: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#17/47

FlynnFlynn Categories (2)Categories (2)

3. SIMD (Single Instruction Multiple Data)Simple programming model (e.g. Intel MMX)Low overheadNow applied as sub-word parallelism

Count four 8-bit ADD operations with one 32b ALUOnly one program counter (PC)

4. MIMD (Multiple Instruction Multiple Data)Multiple program counters, “real multiprocessor”Flexible, may use off-the-shelf microsSpecial case: Single program, Multiple data (SPMD)

All CPUs use the same program code, saves memory!

++

Page 18: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#18/47

Multiprocessor SoC (MPMultiprocessor SoC (MP--SoC)SoC)

Tremendous growth of integration capabilities allow building chip multiprocessors (CMP)

Aka. multiprocessor SoC (MP-SoC)Even on single FPGA!

HeterogeneousIBM Cell broadband engine (PowerPC+8 PPE)TI OMAP (ARM+TI DSP)

HomogeneousIntel/AMD Dual/Quad CoreDACI MP-SoC on FPGAIntel TeraFLOPS IBM Cell [Wikipedia]

Page 19: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#19/47

Reminder: chip multiprocessorsReminder: chip multiprocessors

Past: 1 CPU, accelerator(s), memories

Today/future: multiple(tens of) processors, accelerator(s), memories

Ramchan Woo, Tampere Soc, 2004.

Page 20: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008

DataData--parallel and parallel and functionfunction--parallelparallel

Page 21: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#21/47

Parallelization methods (1)Parallelization methods (1)

Example codefor i=1:9 loop

A

B

C

end loop

A1 refers to first iteration of operation AOperation A can be single instruction or a function

Two methods:1. Data-parallel (or index

set partitioning)

2. Operation-parallel (or functional pipelining)

HUOM! OBS!

Muy importante!

Page 22: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#22/47

CPU 1

DataData--parallelparallelAll CPUs perform the same operations but on different data setsUsually CPUs must know their own IDThe example below assumes that all iterations of A can be parallelized

If Ai must be executed before Ai+1 and B after A, CPU2 and must wait until A3 in ready

A1 A2 A3 B1 B2 B3 C1 C2 C3

CPU 2 A4 A5 A6 B4 B5 B6 C4 C5 C6

CPU 3 A7 A8 A9 B7 B8 B9 C7 C8 C9

time

Data-parallel method

Page 23: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#23/47

OperationOperation--parallelparallel

Operation-parallelEach CPU perfoms different operationsFigure assumes that Ai must be executed before Bi (which is executed before Ci)All operations should have roughly equal

CPU 1 A1 A2 A3 A4 A5 A6 A7 A8 A9

CPU 2 B1 B2 B3 B4 B5 B6 B7 B8 B9

CPU 3 C1 C2 C3 C4 C5 C6 C7 C8 C9

time

Operation-parallel methodexecution time to have balanced pipelineBalancing may prove difficult

Page 24: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#24/47

InstructionInstruction--level parallelism (ILP)level parallelism (ILP)

CPU has many functional unitse.g. ALU, Multiplier, Shifter, and Load/store,

Instructions have dependencies:Cannot be executed in parallel

ld r3, [r2]

add r3,5

ILP in kernels is often restricted to <5That is the maximum speedup for single-threaded programs through ILP

Page 25: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#25/47

Fine grain + coarse grain parallelismFine grain + coarse grain parallelismILP is too fine grain to used aloneCombine

fine grain: ILP, SIMDcoarse grain : multiple CPUs

Page 26: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#26/47

Multiprocessor speedupMultiprocessor speedup

<nag> One should NOT use background color for graphs/figures </nag>

Speedup is never equal to number of CPUs

Task cannot be split into euqally sized sub-tasksThere is always some part executed sequentially

More CPUs, the more they have to communicate with each other

Page 27: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#27/47

DataData--parallel vs operation parallelparallel vs operation parallel(Large) data set is easier to split into equally sized chunks than operations in task-parallel schemeData-parallel (DP) often method better than operation-parallel (OP)

Note few execptions in Fig 15 such as mulaw

They should be combined for best performance

Page 28: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#28/47

Hierarchical parallelizationHierarchical parallelizationEach CPU optimized for certain functionsUtilize

ILP and SIMD within CPUdata-parallel and operation-parallel approaches for multi-CPU system minimal inter-CPU communication

Page 29: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#29/47

Hierarchical parallelization (2)Hierarchical parallelization (2)

Combining all aforementioned approaches results in biggest speedupBest combination of approaches depends on applicationTrad. CPU has performance=1

Page 30: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#30/47

Automatic parallelizationAutomatic parallelization

Research topic for several yearsOnly small scale success so far

Tim Sweeney:http://chipsandbs.blogspot.com/2006/03/auto-parallelization-of-c-code-is-not.htmlActually from http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2377&p=3

Auto-parallelization of C++ code is not a serious notionThese techniques applied to C/C++ programs are completely infeasible on the scale of real applications.Writing multithreaded software is very hardIt's about as unnatural to support multithreading in C++ as it was to write object-oriented software in assembly language. The whole industry is starting to do it now, but it's pretty clear that a new programming model is needed if we're going to scale to ever more parallel architectures.

Page 31: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008

Communication schemeCommunication scheme

Page 32: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#32/47

Communication modelsCommunication models

1. Shared Memory (SM); shared addr spacee.g., load, store, atomic swapSimpler programming if cache coherency implementedCPUs process the data inside shared memoryShared data protected with mutex

2. Message Passing (MP)e.g., send, receive library callsExplicit communication, bit harder to programCPU process the in their local, private mem

Note that MP can be build on top of SM and vice versa

Page 33: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#33/47

Communication Communication models (2)models (2)

1. Shared Memory (SM)Does not scale as well as MP to large systemsSensitive to memory latency- CPU stalled after miss

2. Message Passing (MP)Less sensitive to latencyPipelining of computation and communication

CPUCPUmemmem

NINI

CPUCPUmemmem

NINI

networknetwork

...CPUCPUcachecache

NINI

CPUCPUcachecache

NINI

networknetwork

...

main

memory

main

memory

”Typical” shared-memory computer, e.g. Dual Core

”Typical” message-passing computer, e.g. TeraFLOPS

arr [ ]x

Page 34: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#34/47

Centralized vs. distributed memory Centralized vs. distributed memory

Physically shared - Symmetric Multi Processors (SMP)

Centralized memory (memories)Equal access time for all CPUs

Physically distributed - Distributed Shared Memory (DSM)

Each memory adjacent to one CPUClosest CPU has fast access, others have slowerUnequal access times to different parts of the memory spaceMore common than SMP in large systems

Practically all msg-passing systems use distr.mem.Explicitly managed local memories also calles ”scatch-pad memories”

Page 35: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#35/47

MemoryMemory

Memory is no. 1 candidate for the system bottleneckguilty until proven innocent

Memories need big area (causes leakage also)Amount is limited especially inside FPGA

Number of ports affects memory area, power, and delay dramatically

More than two ports seldom usedUse many parallel memory banks

How to locate data efficiently?

Off-chip memory is impractical for same reasonsMake clear disctinction between bits (b) and bytes (B) when presenting memory sizesDistinguish kilos (k, 1000) and kibis (Ki, 1024)

Page 36: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#36/47

Memory (2)Memory (2)

Build a hierarchy of memoriesCache vs. explicitly managed transfersCombination of fast, small memories and large, slow memories

Cache coherency is difficult in parallel systems

i.e. all CPUs must have consitent view of all dataWhat if data inside a cache of one CPU is modified?

Dynamic memory allocation causes unpredictable execution times

What happens if program runs out of memory?

Page 37: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#37/47

Memory hierarchyMemory hierarchy

[M. Erez, Stream Architectures –Programmability and Efficiency, Tampere SoC, Nov. 2004]

Capacity

Aggregate

Bandwidth

Page 38: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#38/47

SynchronizationSynchronization

Tasks runinng in parallel must be synchronized every now and then

1. Process (task) synchronization Multiple processes are to join up or handshake at a certain point

2. Data synchronization Keeps multiple copies of a dataset in coherence with one another to maintains data integrity

Process synchronization primitives are commonly used to implement data synchronization

Page 39: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#39/47

Synchronization (2)Synchronization (2)

1. Semaphore (lock)Variable for communicating status between parallel tasksEnsures that only one or limited number of tasks updates shared data structureRequire atomic test-set functionality Mutual exclusion (mutex) is binary semaphore

2. BarrierEnsures that all tasks have reached specific point in programTasks read barrier semaphore, stop, and keep waiting until the last task ”arrives”

Also other types, such as thread join, non-blocking synchronization, synchronous communication

Page 40: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008

HIBI Multiprocessor HIBI Multiprocessor ExampleExampleScaleableScaleable Video Video EncoderEncoder

Page 41: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#41/47

H.263 Video H.263 Video EncoderEncoder

Objective: Show how easily HIBI scalesProcessor independent C-source codeMaster + scaleable number of processorsgenerated automatically

I/OI/O

DMADMA

DPRAMDPRAM

ROMROM

ARM7ARM7

HIBI WrapperHIBI Wrapper

H.263

YUV

ARM7 + DMAARM7 + DMA

HIBI WrapperHIBI Wrapper

ARM7 + DMAARM7 + DMA

HIBI WrapperHIBI Wrapper

HIBI WrapperHIBI Wrapper

ARM7 + DMAARM7 + DMA

……

MASTERMASTER

SLAVESSLAVES

Page 42: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#42/47

Data Data ParallelParallel MappingMapping(3 (3 slavesslaves exampleexample))

Top slice

Middle slice

Bottom slice

Slave 1

Slave 2

Slave 3

ARM7

ARM7TDMI

Master

Master's DPSRAM

foreman.cif (352x288 , 18 MB rows)

DMA data transfer

ARM7

ARM7TDMI

ARM7

ARM7TDMI

ARM7

ARM7TDMI

Page 43: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#43/47

Load BalancingLoad Balancing

1 slave 2 slaves 3 slaves 4 slaves 5 slaves

Well balanced Poorly balanced

Increasing inter-communication

1 MB row

HUOM! OBS!

Muy importante!

Balance load (=computation) so thatAll processing element (PE) have egual loadCommunication between PEs is minimized

Poor balance decreases performance

Page 44: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#44/47

Load balance exampleLoad balance example

Frame rate (qcif) @ 100MHz

0102030405060

1 2 3 4 5 6 7 8 9 10

Number of slaves

fps INTRA

INTER

Frame is divided into 9 rowsSlave with biggest number of slaves defines the performance

Speedup not linear

Balance:1: 9 rows2: 4+5

3: 3+3+34: 2+2+2+3

5: 1+2+2+2+26: 1+1+1+2+2+27: 1+1+1+1+1+2+28: 1+1+1+1+1+1+1+1+2

9: 1+1+1+1+1+1+1+1+1

Page 45: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#45/47

Load balance example (2)Load balance example (2)Larger picture size, has better balanceDMA allows simultaneous computation and communication

Page 46: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#46/47

Load balance example (3)Load balance example (3)Simulation of WLAN terminal on multiple PCs

See also integration example from Lec 6Manual load balancing

Minimal communication between PCs

Good scalabilityApplicable also on multiprocessor serverSimulation task split into several small processesSpread sim.tasks to multiple CPUs instead of oneJ. Riihimäki et al., Practical Distributed Simulation of a Network of Wireless Terminals, Tampere Soc,November 2004

Page 47: Lec 9 – Parallelism fileErno Salminen - Oct. 2008 TKT-2431 Soc Design Lec 9 – Parallelism Erno Salminen Department of Computer Systems Tampere University of Technology Fall 2008

Erno Salminen - Oct. 2008#47/47

ConclusionConclusion

Amdahl’s law offers idealistic but fundamental limitCombine fine and coarse grain parallelismTwo basic methods: function parallel and data-parallelTwo basic communication schemes: shared memory and message-passingLoad balancing is crucial in parallel systems

Larger data set simplifies balancing in Data-Parallel scheme

Communication between components has great impact on performance