lec 9 – parallelism fileerno salminen - oct. 2008 tkt-2431 soc design lec 9 – parallelism erno...
TRANSCRIPT
Erno Salminen - Oct. 2008
TKTTKT--2431 Soc 2431 Soc DesignDesign
Lec 9 Lec 9 –– ParallelismParallelism
Erno SalminenErno Salminen
Department ofDepartment of Computer SystemsComputer SystemsTampere University of TechnologyTampere University of Technology
Fall 2008Fall 2008
Erno Salminen - Oct. 2008#2/47
ContentsContents
IntroductionAmdahl’s law once again
Computer classificatiobn by FlynnParallelization methods
Data-parallelFunction-parallel
Communication scheme and memoriesCase study: Data-parallel video encoder
Erno Salminen - Oct. 2008#3/47
Copyright noticeCopyright notice
Part of the slidesadapted from slide set by Alberto Sangiovanni-Vincentelli
course EE249 at University of California, Berkeleyhttp://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml
Part of figures from:Ireneusz Karkowski and Henk Corporaal, Exploiting Fine- and Coarse-grain Parallelism in Embedded Programs, International Conference on Parallel Architectures and Compilation Techniques (PACT'98), Paris, October 1998, pp 60-67
Erno Salminen - Oct. 2008#4/47
At firstAt first
Make sure that simple things work before even trying more complex ones
Erno Salminen - Oct. 2008#5/47
Von Neumann (vN) Architecture Is Von Neumann (vN) Architecture Is Reaching Its Limits ...Reaching Its Limits ...
stolen from Bob Colwell
vNvN: : unbalancedunbalanced
vN Bottleneck
Cache
Memory, ...
CPU(Pro
cessor
Core)
Source: R. Hartenstein, Univ. Kaiserslautern
E. Maehle, E-Seminar IFIP Working Group 10.3 (Concurrent Systems) June 7, 2005
Erno Salminen - Oct. 2008#6/47
Benefits of parallel executionBenefits of parallel execution
Increased performanceor similar performance with lower cost (lower freq and lower Vdd!)
Partial shutdownAllows more aggressive power reduction
Fault-toleranceOvercomes single-point of failure
Encapsulation, plug-and-play design processParts are sepcialized to certain tasks
HUOM! OBS!
Muy importante!
Erno Salminen - Oct. 2008#7/47
MultiMulti--level parallelismlevel parallelism
Parallelism appears in many levels
1. Inside processors2. Between parallel
processing elementsUse combination for maximum impact
Fig. from [O. Lehtoranta, PhD Thesis, TUT 2006]
HUOM! OBS!
Muy importante!
Erno Salminen - Oct. 2008#8/47
Parallelization methodsParallelization methods
[Kulmala, JES, 2006]
Best methods depends-not surprisingly – on the applicationAlso on non-functional requirementsE.g. temporal parallel [Nang] has higher latency than data parallel
Not suited for real-time encodingGood for off-line
Erno Salminen - Oct. 2008#9/47
ScenariosScenariosEither 1 comp or 1 communication
e.g. CPU controls all transfers, it is either processing or copying data1 comp + 1 comm in parallel
Double buffering:Produces writes data to different buf (blue) than waht is currently processed (red)Buffers are switched for next data
n comp + 1 commMany CPUs but single bus
n comp + n commMany CPUs, memories, and accelerators. Multiple parallel links links in net
(1comp + n comm seems paradoxal)consumer
(CPU)
consumer
(CPU)producer
(camera)
producer
(camera) 1
0
1
0
’1’’0’
memory
Erno Salminen - Oct. 2008#10/47
AmdahlAmdahl’’s s Law (again!)Law (again!)
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
exc.
tim
e
[H. Corporaal, course material Adv. Computer architectures, Univ. Delft, 2001]
Erno Salminen - Oct. 2008#11/47
AmdahlAmdahl’’s Law Examples Law Example
Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Max. speedupoverall = 1 / (1- fractionenhanced)
Speedupoverall = 10.95
= 1.053
ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold
Erno Salminen - Oct. 2008#12/47
Parallelism and AmdahlParallelism and Amdahl’’s laws law
Speedup can be achieved via parallel computation
(1- fractionenhanced)= fractionsequential
Computation cannot cannot be distributed in arbitrary sized blocks
use two processing elements for computationtwo tasks: 0.55*orig and 0.45*origspeedup =1/0.55= 1.81 instead of 2.0
Seq. part may increase as parallelism increases
More communication, ”master” has harder time to distribute the load
Erno Salminen - Oct. 2008#13/47
Amdahl (4)Amdahl (4)Speedup, seq_part=1-10%, min task size=0%, communication overhead 0%
seq_part=10%
seq_part=5%
seq_part=2%
seq_part=1%
Ideal
0
10
20
30
40
50
0 10 20 30 40 50 60
n_cpu
Spe
edup
Speedup, seq_part=2-10%, min task size=1-2 %, communication overhead 0%
Ideal
seq_part=2%, task=1%
seq_part=2%. task=2%
seq_part=10%, task=1%
seq_part=10%, task=2%
0
5
10
15
20
25
30
0 10 20 30 40 50 60
n_cpu
Spe
edup
orig Amdahl
Quantized load = Multiples of 1/100 or 1/50 of orig
Improving steps when N_cpu is integer multiple of task_size
Erno Salminen - Oct. 2008#14/47
Amdahl (5)Amdahl (5)
Seq. part increases 0/1/5% per added CPUCombine seq.part increase and quantized load
Speedup, seq_part=2%, min task size=0 %, communication overhead (0-5 % per n_cpu) due to communication
idealoverhead=0%
overhead=1%
overhead=5%
0
5
10
15
20
25
30
0 10 20 30 40 50 60
n_cpu
Spee
dup
Speedup, seq_part=2% , min task size=2 %, communication overhead (0-5 % per n_cpu) due to communication
ideal
overhead=0%
overhead=1%
overhead=5%
0
5
10
15
20
25
30
0 10 20 30 40 50 60
n_cpu
Spe
edup
Speed decreases
Erno Salminen - Oct. 2008
Computer classification Computer classification
Erno Salminen - Oct. 2008#16/47
Computer classificationComputer classification: Flynn Categories : Flynn Categories
1. SISD (Single Instruction Single Data)Traditional uniprocessor
2. MISD (Multiple Instruction Single Data)Systolic arrays / stream based processingRare
I, # simultaneaously executed instrcutions
D, # simultaneously processed data
1. SISD
3. SIMD 4. MIMD
2. MISD
Single Multiple
Single
Multiple
Erno Salminen - Oct. 2008#17/47
FlynnFlynn Categories (2)Categories (2)
3. SIMD (Single Instruction Multiple Data)Simple programming model (e.g. Intel MMX)Low overheadNow applied as sub-word parallelism
Count four 8-bit ADD operations with one 32b ALUOnly one program counter (PC)
4. MIMD (Multiple Instruction Multiple Data)Multiple program counters, “real multiprocessor”Flexible, may use off-the-shelf microsSpecial case: Single program, Multiple data (SPMD)
All CPUs use the same program code, saves memory!
++
Erno Salminen - Oct. 2008#18/47
Multiprocessor SoC (MPMultiprocessor SoC (MP--SoC)SoC)
Tremendous growth of integration capabilities allow building chip multiprocessors (CMP)
Aka. multiprocessor SoC (MP-SoC)Even on single FPGA!
HeterogeneousIBM Cell broadband engine (PowerPC+8 PPE)TI OMAP (ARM+TI DSP)
HomogeneousIntel/AMD Dual/Quad CoreDACI MP-SoC on FPGAIntel TeraFLOPS IBM Cell [Wikipedia]
Erno Salminen - Oct. 2008#19/47
Reminder: chip multiprocessorsReminder: chip multiprocessors
Past: 1 CPU, accelerator(s), memories
Today/future: multiple(tens of) processors, accelerator(s), memories
Ramchan Woo, Tampere Soc, 2004.
Erno Salminen - Oct. 2008
DataData--parallel and parallel and functionfunction--parallelparallel
Erno Salminen - Oct. 2008#21/47
Parallelization methods (1)Parallelization methods (1)
Example codefor i=1:9 loop
A
B
C
end loop
A1 refers to first iteration of operation AOperation A can be single instruction or a function
Two methods:1. Data-parallel (or index
set partitioning)
2. Operation-parallel (or functional pipelining)
HUOM! OBS!
Muy importante!
Erno Salminen - Oct. 2008#22/47
CPU 1
DataData--parallelparallelAll CPUs perform the same operations but on different data setsUsually CPUs must know their own IDThe example below assumes that all iterations of A can be parallelized
If Ai must be executed before Ai+1 and B after A, CPU2 and must wait until A3 in ready
A1 A2 A3 B1 B2 B3 C1 C2 C3
CPU 2 A4 A5 A6 B4 B5 B6 C4 C5 C6
CPU 3 A7 A8 A9 B7 B8 B9 C7 C8 C9
time
Data-parallel method
Erno Salminen - Oct. 2008#23/47
OperationOperation--parallelparallel
Operation-parallelEach CPU perfoms different operationsFigure assumes that Ai must be executed before Bi (which is executed before Ci)All operations should have roughly equal
CPU 1 A1 A2 A3 A4 A5 A6 A7 A8 A9
CPU 2 B1 B2 B3 B4 B5 B6 B7 B8 B9
CPU 3 C1 C2 C3 C4 C5 C6 C7 C8 C9
time
Operation-parallel methodexecution time to have balanced pipelineBalancing may prove difficult
Erno Salminen - Oct. 2008#24/47
InstructionInstruction--level parallelism (ILP)level parallelism (ILP)
CPU has many functional unitse.g. ALU, Multiplier, Shifter, and Load/store,
Instructions have dependencies:Cannot be executed in parallel
ld r3, [r2]
add r3,5
ILP in kernels is often restricted to <5That is the maximum speedup for single-threaded programs through ILP
Erno Salminen - Oct. 2008#25/47
Fine grain + coarse grain parallelismFine grain + coarse grain parallelismILP is too fine grain to used aloneCombine
fine grain: ILP, SIMDcoarse grain : multiple CPUs
Erno Salminen - Oct. 2008#26/47
Multiprocessor speedupMultiprocessor speedup
<nag> One should NOT use background color for graphs/figures </nag>
Speedup is never equal to number of CPUs
Task cannot be split into euqally sized sub-tasksThere is always some part executed sequentially
More CPUs, the more they have to communicate with each other
Erno Salminen - Oct. 2008#27/47
DataData--parallel vs operation parallelparallel vs operation parallel(Large) data set is easier to split into equally sized chunks than operations in task-parallel schemeData-parallel (DP) often method better than operation-parallel (OP)
Note few execptions in Fig 15 such as mulaw
They should be combined for best performance
Erno Salminen - Oct. 2008#28/47
Hierarchical parallelizationHierarchical parallelizationEach CPU optimized for certain functionsUtilize
ILP and SIMD within CPUdata-parallel and operation-parallel approaches for multi-CPU system minimal inter-CPU communication
Erno Salminen - Oct. 2008#29/47
Hierarchical parallelization (2)Hierarchical parallelization (2)
Combining all aforementioned approaches results in biggest speedupBest combination of approaches depends on applicationTrad. CPU has performance=1
Erno Salminen - Oct. 2008#30/47
Automatic parallelizationAutomatic parallelization
Research topic for several yearsOnly small scale success so far
Tim Sweeney:http://chipsandbs.blogspot.com/2006/03/auto-parallelization-of-c-code-is-not.htmlActually from http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2377&p=3
Auto-parallelization of C++ code is not a serious notionThese techniques applied to C/C++ programs are completely infeasible on the scale of real applications.Writing multithreaded software is very hardIt's about as unnatural to support multithreading in C++ as it was to write object-oriented software in assembly language. The whole industry is starting to do it now, but it's pretty clear that a new programming model is needed if we're going to scale to ever more parallel architectures.
Erno Salminen - Oct. 2008
Communication schemeCommunication scheme
Erno Salminen - Oct. 2008#32/47
Communication modelsCommunication models
1. Shared Memory (SM); shared addr spacee.g., load, store, atomic swapSimpler programming if cache coherency implementedCPUs process the data inside shared memoryShared data protected with mutex
2. Message Passing (MP)e.g., send, receive library callsExplicit communication, bit harder to programCPU process the in their local, private mem
Note that MP can be build on top of SM and vice versa
Erno Salminen - Oct. 2008#33/47
Communication Communication models (2)models (2)
1. Shared Memory (SM)Does not scale as well as MP to large systemsSensitive to memory latency- CPU stalled after miss
2. Message Passing (MP)Less sensitive to latencyPipelining of computation and communication
CPUCPUmemmem
NINI
CPUCPUmemmem
NINI
networknetwork
...CPUCPUcachecache
NINI
CPUCPUcachecache
NINI
networknetwork
...
main
memory
main
memory
”Typical” shared-memory computer, e.g. Dual Core
”Typical” message-passing computer, e.g. TeraFLOPS
arr [ ]x
Erno Salminen - Oct. 2008#34/47
Centralized vs. distributed memory Centralized vs. distributed memory
Physically shared - Symmetric Multi Processors (SMP)
Centralized memory (memories)Equal access time for all CPUs
Physically distributed - Distributed Shared Memory (DSM)
Each memory adjacent to one CPUClosest CPU has fast access, others have slowerUnequal access times to different parts of the memory spaceMore common than SMP in large systems
Practically all msg-passing systems use distr.mem.Explicitly managed local memories also calles ”scatch-pad memories”
Erno Salminen - Oct. 2008#35/47
MemoryMemory
Memory is no. 1 candidate for the system bottleneckguilty until proven innocent
Memories need big area (causes leakage also)Amount is limited especially inside FPGA
Number of ports affects memory area, power, and delay dramatically
More than two ports seldom usedUse many parallel memory banks
How to locate data efficiently?
Off-chip memory is impractical for same reasonsMake clear disctinction between bits (b) and bytes (B) when presenting memory sizesDistinguish kilos (k, 1000) and kibis (Ki, 1024)
Erno Salminen - Oct. 2008#36/47
Memory (2)Memory (2)
Build a hierarchy of memoriesCache vs. explicitly managed transfersCombination of fast, small memories and large, slow memories
Cache coherency is difficult in parallel systems
i.e. all CPUs must have consitent view of all dataWhat if data inside a cache of one CPU is modified?
Dynamic memory allocation causes unpredictable execution times
What happens if program runs out of memory?
Erno Salminen - Oct. 2008#37/47
Memory hierarchyMemory hierarchy
[M. Erez, Stream Architectures –Programmability and Efficiency, Tampere SoC, Nov. 2004]
Capacity
Aggregate
Bandwidth
Erno Salminen - Oct. 2008#38/47
SynchronizationSynchronization
Tasks runinng in parallel must be synchronized every now and then
1. Process (task) synchronization Multiple processes are to join up or handshake at a certain point
2. Data synchronization Keeps multiple copies of a dataset in coherence with one another to maintains data integrity
Process synchronization primitives are commonly used to implement data synchronization
Erno Salminen - Oct. 2008#39/47
Synchronization (2)Synchronization (2)
1. Semaphore (lock)Variable for communicating status between parallel tasksEnsures that only one or limited number of tasks updates shared data structureRequire atomic test-set functionality Mutual exclusion (mutex) is binary semaphore
2. BarrierEnsures that all tasks have reached specific point in programTasks read barrier semaphore, stop, and keep waiting until the last task ”arrives”
Also other types, such as thread join, non-blocking synchronization, synchronous communication
Erno Salminen - Oct. 2008
HIBI Multiprocessor HIBI Multiprocessor ExampleExampleScaleableScaleable Video Video EncoderEncoder
Erno Salminen - Oct. 2008#41/47
H.263 Video H.263 Video EncoderEncoder
Objective: Show how easily HIBI scalesProcessor independent C-source codeMaster + scaleable number of processorsgenerated automatically
I/OI/O
DMADMA
DPRAMDPRAM
ROMROM
ARM7ARM7
HIBI WrapperHIBI Wrapper
H.263
YUV
ARM7 + DMAARM7 + DMA
HIBI WrapperHIBI Wrapper
ARM7 + DMAARM7 + DMA
HIBI WrapperHIBI Wrapper
HIBI WrapperHIBI Wrapper
ARM7 + DMAARM7 + DMA
……
MASTERMASTER
SLAVESSLAVES
Erno Salminen - Oct. 2008#42/47
Data Data ParallelParallel MappingMapping(3 (3 slavesslaves exampleexample))
Top slice
Middle slice
Bottom slice
Slave 1
Slave 2
Slave 3
ARM7
ARM7TDMI
Master
Master's DPSRAM
foreman.cif (352x288 , 18 MB rows)
DMA data transfer
ARM7
ARM7TDMI
ARM7
ARM7TDMI
ARM7
ARM7TDMI
Erno Salminen - Oct. 2008#43/47
Load BalancingLoad Balancing
1 slave 2 slaves 3 slaves 4 slaves 5 slaves
Well balanced Poorly balanced
Increasing inter-communication
1 MB row
HUOM! OBS!
Muy importante!
Balance load (=computation) so thatAll processing element (PE) have egual loadCommunication between PEs is minimized
Poor balance decreases performance
Erno Salminen - Oct. 2008#44/47
Load balance exampleLoad balance example
Frame rate (qcif) @ 100MHz
0102030405060
1 2 3 4 5 6 7 8 9 10
Number of slaves
fps INTRA
INTER
Frame is divided into 9 rowsSlave with biggest number of slaves defines the performance
Speedup not linear
Balance:1: 9 rows2: 4+5
3: 3+3+34: 2+2+2+3
5: 1+2+2+2+26: 1+1+1+2+2+27: 1+1+1+1+1+2+28: 1+1+1+1+1+1+1+1+2
9: 1+1+1+1+1+1+1+1+1
Erno Salminen - Oct. 2008#45/47
Load balance example (2)Load balance example (2)Larger picture size, has better balanceDMA allows simultaneous computation and communication
Erno Salminen - Oct. 2008#46/47
Load balance example (3)Load balance example (3)Simulation of WLAN terminal on multiple PCs
See also integration example from Lec 6Manual load balancing
Minimal communication between PCs
Good scalabilityApplicable also on multiprocessor serverSimulation task split into several small processesSpread sim.tasks to multiple CPUs instead of oneJ. Riihimäki et al., Practical Distributed Simulation of a Network of Wireless Terminals, Tampere Soc,November 2004
Erno Salminen - Oct. 2008#47/47
ConclusionConclusion
Amdahl’s law offers idealistic but fundamental limitCombine fine and coarse grain parallelismTwo basic methods: function parallel and data-parallelTwo basic communication schemes: shared memory and message-passingLoad balancing is crucial in parallel systems
Larger data set simplifies balancing in Data-Parallel scheme
Communication between components has great impact on performance