algorithm mapping on massively parallel and reconfigurable
TRANSCRIPT
Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Markus Rullmann, Rainer Schaffer, Renate Merker
Department of Electrical Engineering and Information Technology, Institute of Circuits and Systems
Karlsruhe, April 27, 2009
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Content
1. Introduction
2. Target Architecture Massively Parallel Architecture→
3. Algorithm Class
4. Multi-Level Copartitioning
5. Algorithm Mapping with Multi-Level Copartitioning
1. Level: Subword Parallelism
2. Level: Processor Array
3. Level: Adaption to the Cache sizes
6. Test Results
7. Conclusions
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Introduction
Demand: • Large computational resources• Low power consumption
Trend: • Increasing density on a chip
Solution: • Parallel Processing Massively Parallel Architectures→
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Massively Parallel Architecture (MPA)
• Processor array with processor elements (PEs)• Processor elements with several functional units (FUs)• Functional units permit subword parallel processing
Main Memory
Cache L2
L1
> >
>>
FU 1FU 2
FLW
sub-words
regi
ster
s
Massively Parallel Architecture (MPA) Processor Element (PE)
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Massively Parallel Architecture (MPA)
Parallel processing of 2, 4, or 8 subwords by splitting the full length word (FLW) of a functional unit
• Advantages:
– Higher data throughput
• Disadvantages:
– Data have to be packed in the FLWs for parallel processing.
Additional packing instructions→
Functional unit with subword parallelism Linear processor array→
full length words (FLW)with 2 subwords
functional unit
Functional unit with subword parallelism
Subword Parallelism (SWP)
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Algorithm Class
i1
i 2
Iteration space
data dependency d s' sk
s' s
s''
Reduced dependence graph
System of Uniform Recurrence Equations
s : y s [ i ]=F s , y s ' [ i−d s ' sk
] , , i∈I s
s ' : y s ' [ i ]=F s ' , i∈I s '
⋮
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Algorithm Class
A for-loop program can be transformed in a system of uniform recurrence equations.
FIR-Filter:
i∈Iy : y [i ]=y [i−d y ]a [ i ]⋅x [ i ] ,
x : x [ i ]=x [ i−d x ] ,
a : a [i ]=a[ i−d a ] ,
i∈I
i∈I
System of Uniform Recurrence Equations
for( k = 0; k < K; k++ ){ y[i] = 0; for( j = 0; j < J; j++ ){ y[k] += a[j]*x[k-j]; }}
For-Loop Program System of uniform recurrence equations
I={i=kj ,0≤kK ,0≤ jJ }
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Multi-Level Copartitioning
Subdivision of the iteration space in partitions
Partitioning
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Multi-Level Copartitioning
Local parallel, global sequentially partitioning
• Each node of a partition represent a PE, which are running in parallel.
• Partitions are processed sequentially.
Processor array
LPGS Partitioning
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Multi-Level Copartitioning
Local sequentially, global parallel partitioning
• Each partition represent a PE, which are running in parallel.
• The elements in the partitions are processed sequentially.
Processor array
LSGP Partitioning
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Multi-Level Copartitioning
Combination of
– LSGP partitioning and
– LPGS partitioning
Definition of a succession for the sequential parts
LSGP partitionLPGS partition
Copartitioning
Processor array
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Multi-Level Copartitioning
Copartitioning of a copartitioned iterations space several times.
Processor array
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Multi-Level Copartitioning
Algorithm Mapping
Algorithm Class(e.g. loop program)
MPA
Subword Parallelism
Allocation to the PEs
Allocation to the FUs
Number of subwords
Size of processor arrayNumber of registers
Size of CachesUtilization of Caches
1. Level
2. Level
3...n. Level
Schedule
FUs
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Multi-Level Copartitioning
Algorithm Mapping
Algorithm Class(e.g. loop program)
MPA
Subword Parallelism
Allocation to the PEs
Allocation to the FUs
Number of subwords
Size of processor arrayNumber of registers
Size of CachesUtilization of Caches
1. Level
2. Level
3...n. Level
Schedule
FUs
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Subword Parallelism
Mapping the algorithm on functional unit with subword parallelism (SWP)
• FU with SWP linear processor array→
• Number of subwords Number of PEs →
Aims of the Mapping
• Utilization of the subword parallelism
• Less packing instructions
First Level Copartitioning
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Multi-Level Copartitioning
Algorithm Mapping
Algorithm(e.g. loop-program)
MPA
Subword Parallelism
Allocation to the PEs
Allocation to the FUs
Number of subwords
Size of processor arrayNumber of registers
Size of CachesUtilization of Caches
1. Level
2. Level
3...n. Level
Schedule
FUs
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Allocation to the Processor Elements
Mapping on the processor elements of the processor array
Aims of the Mapping
• Utilization of the processor array parallelism
• Exploitation of theregisters in the PEs
Second Level Copartitioning
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Multi-Level Copartitioning
Algorithm Mapping
Algorithm Class(e.g. loop-program)
MPA
Subword Parallelism
Allocation to the PEs
Allocation to the FUs
Number of subwords
Size of processor arrayNumber of registers
Size of CachesUtilization of Caches
1. Level
2. Level
3...n. Level
Schedule
FUs
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Utilization of Caches
Change of the sequential execution order of the copartitions with additional copartitionings
Aim of the Mapping• Increase the local data re-use
Main Memory
Third Level CopartitioningThird Level Copartitioning
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Change of the sequential execution order of the copartitions with additional copartitionings
Aim of the Mapping• Increase the local data re-use
Main Memory
L1
Third Level Copartitioning
Utilization of Caches
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Test Results for Edge Detection Algorithm
Massively Parallel Architecture
• 2x4 processor array (PA)
• Subword parallelism (SWP) with 4 subwords in a FLW
Speed-up
• SWP: 3.00 or 3.99 (theor. 4)
• PA: 7.93 (theor. 8)
• SWP & PA: 23.74 or 31.65 (theor. 32)
R3 R4 R90
5
10
15
20
25
30
35
1x1 w. SWP vs. 1x1 w/o SWP4x2 w.SWP vs. 1x1 w. SWP4x2 w.SWP vs. 1x1 w/o SWP
Gen. ALUR3 4R4 1 1R9 2 3
PE Resources Pack.
April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures
Conclusions
Systematic Mapping of an algorithm to a massively parallel architecture with different levels of parallelism and by utilization of local registers and caches
Mapping methodology
→ Multi-Level Copartitioning
Goals of the Algorithm Mapping
• Exploitation of the parallelism of all levels
→ Minimal execution time
• Increasing the data re-use in the processor array and in the caches
→ Minimal power consumption
Edge Detection Algorithm
Speedup Factor:
up to 31.65 (theor. 32)