algorithm mapping on massively parallel and reconfigurable

Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Markus Rullmann, Rainer Schaffer, Renate Merker

Department of Electrical Engineering and Information Technology, Institute of Circuits and Systems

Karlsruhe, April 27, 2009

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Content

1. Introduction

2. Target Architecture Massively Parallel Architecture→

3. Algorithm Class

4. Multi-Level Copartitioning

5. Algorithm Mapping with Multi-Level Copartitioning

1. Level: Subword Parallelism

2. Level: Processor Array

3. Level: Adaption to the Cache sizes

6. Test Results

7. Conclusions


Introduction

Demand: • Large computational resources• Low power consumption

Trend: • Increasing density on a chip

Solution: • Parallel Processing Massively Parallel Architectures→


Massively Parallel Architecture (MPA)

• Processor array with processor elements (PEs)• Processor elements with several functional units (FUs)• Functional units permit subword parallel processing

Main Memory

Cache L2

L1

> >

>>

FU 1FU 2

FLW

sub-words

regi

ster

s

Massively Parallel Architecture (MPA) Processor Element (PE)


Massively Parallel Architecture (MPA)

Parallel processing of 2, 4, or 8 subwords by splitting the full length word (FLW) of a functional unit

• Advantages:

– Higher data throughput

• Disadvantages:

– Data have to be packed in the FLWs for parallel processing.

Additional packing instructions→

Functional unit with subword parallelism Linear processor array→

full length words (FLW)with 2 subwords

functional unit

Functional unit with subword parallelism

Subword Parallelism (SWP)


Algorithm Class

i1

i 2

Iteration space

data dependency d s' sk

s' s

s''

Reduced dependence graph

System of Uniform Recurrence Equations

s : y s [ i ]=F s , y s ' [ i−d s ' sk

] , , i∈I s

s ' : y s ' [ i ]=F s ' , i∈I s '

⋮


Algorithm Class

A for-loop program can be transformed in a system of uniform recurrence equations.

FIR-Filter:

i∈Iy : y [i ]=y [i−d y ]a [ i ]⋅x [ i ] ,

x : x [ i ]=x [ i−d x ] ,

a : a [i ]=a[ i−d a ] ,

i∈I

i∈I

System of Uniform Recurrence Equations

for( k = 0; k < K; k++ ){ y[i] = 0; for( j = 0; j < J; j++ ){ y[k] += a[j]*x[k-j]; }}

For-Loop Program System of uniform recurrence equations

I={i=kj ,0≤kK ,0≤ jJ }


Multi-Level Copartitioning

Subdivision of the iteration space in partitions

Partitioning



Local parallel, global sequentially partitioning

• Each node of a partition represent a PE, which are running in parallel.

• Partitions are processed sequentially.

Processor array

LPGS Partitioning



Local sequentially, global parallel partitioning

• Each partition represent a PE, which are running in parallel.

• The elements in the partitions are processed sequentially.

Processor array

LSGP Partitioning



Combination of

– LSGP partitioning and

– LPGS partitioning

Definition of a succession for the sequential parts

LSGP partitionLPGS partition

Copartitioning

Processor array



Copartitioning of a copartitioned iterations space several times.

Processor array



Algorithm Mapping

Algorithm Class(e.g. loop program)

MPA

Subword Parallelism

Allocation to the PEs

Allocation to the FUs

Number of subwords

Size of processor arrayNumber of registers

Size of CachesUtilization of Caches

1. Level

2. Level

3...n. Level

Schedule

FUs


Subword Parallelism

Mapping the algorithm on functional unit with subword parallelism (SWP)

• FU with SWP linear processor array→

• Number of subwords Number of PEs →

Aims of the Mapping

• Utilization of the subword parallelism

• Less packing instructions

First Level Copartitioning



Algorithm Mapping

Algorithm(e.g. loop-program)

MPA

Subword Parallelism



Number of subwords



1. Level

2. Level

3...n. Level

Schedule

FUs


Allocation to the Processor Elements

Mapping on the processor elements of the processor array

Aims of the Mapping

• Utilization of the processor array parallelism

• Exploitation of theregisters in the PEs

Second Level Copartitioning



Algorithm Mapping

Algorithm Class(e.g. loop-program)

MPA

Subword Parallelism



Number of subwords



1. Level

2. Level

3...n. Level

Schedule

FUs


Utilization of Caches

Change of the sequential execution order of the copartitions with additional copartitionings

Aim of the Mapping• Increase the local data re-use

Main Memory

Third Level CopartitioningThird Level Copartitioning


Change of the sequential execution order of the copartitions with additional copartitionings

Aim of the Mapping• Increase the local data re-use

Main Memory

L1

Third Level Copartitioning

Utilization of Caches


Test Results for Edge Detection Algorithm

Massively Parallel Architecture

• 2x4 processor array (PA)

• Subword parallelism (SWP) with 4 subwords in a FLW

Speed-up

• SWP: 3.00 or 3.99 (theor. 4)

• PA: 7.93 (theor. 8)

• SWP & PA: 23.74 or 31.65 (theor. 32)

R3 R4 R90

5

10

15

20

25

30

35

1x1 w. SWP vs. 1x1 w/o SWP4x2 w.SWP vs. 1x1 w. SWP4x2 w.SWP vs. 1x1 w/o SWP

Gen. ALUR3 4R4 1 1R9 2 3

PE Resources Pack.


Conclusions

Systematic Mapping of an algorithm to a massively parallel architecture with different levels of parallelism and by utilization of local registers and caches

Mapping methodology

→ Multi-Level Copartitioning

Goals of the Algorithm Mapping

• Exploitation of the parallelism of all levels

→ Minimal execution time

• Increasing the data re-use in the processor array and in the caches

→ Minimal power consumption

Edge Detection Algorithm

Speedup Factor:

up to 31.65 (theor. 32)

algorithm mapping on massively parallel and reconfigurable

Documents