algorithm mapping on massively parallel and reconfigurable

22
Algorithm Mapping on Massively Parallel and Reconfigurable Architectures Markus Rullmann, Rainer Schaffer, Renate Merker Department of Electrical Engineering and Information Technology, Institute of Circuits and Systems Karlsruhe, April 27, 2009

Upload: others

Post on 04-Feb-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Markus Rullmann, Rainer Schaffer, Renate Merker

Department of Electrical Engineering and Information Technology, Institute of Circuits and Systems

Karlsruhe, April 27, 2009

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Content

1. Introduction

2. Target Architecture Massively Parallel Architecture→

3. Algorithm Class

4. Multi-Level Copartitioning

5. Algorithm Mapping with Multi-Level Copartitioning

1. Level: Subword Parallelism

2. Level: Processor Array

3. Level: Adaption to the Cache sizes

6. Test Results

7. Conclusions

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Introduction

Demand: • Large computational resources• Low power consumption

Trend: • Increasing density on a chip

Solution: • Parallel Processing Massively Parallel Architectures→

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Massively Parallel Architecture (MPA)

• Processor array with processor elements (PEs)• Processor elements with several functional units (FUs)• Functional units permit subword parallel processing

Main Memory

Cache L2

L1

> >

>>

FU 1FU 2

FLW

sub-words

regi

ster

s

Massively Parallel Architecture (MPA) Processor Element (PE)

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Massively Parallel Architecture (MPA)

Parallel processing of 2, 4, or 8 subwords by splitting the full length word (FLW) of a functional unit

• Advantages:

– Higher data throughput

• Disadvantages:

– Data have to be packed in the FLWs for parallel processing.

Additional packing instructions→

Functional unit with subword parallelism Linear processor array→

full length words (FLW)with 2 subwords

functional unit

Functional unit with subword parallelism

Subword Parallelism (SWP)

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Algorithm Class

i1

i 2

Iteration space

data dependency d s' sk

s' s

s''

Reduced dependence graph

System of Uniform Recurrence Equations

s : y s [ i ]=F s , y s ' [ i−d s ' sk

] , , i∈I s

s ' : y s ' [ i ]=F s ' , i∈I s '

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Algorithm Class

A for-loop program can be transformed in a system of uniform recurrence equations.

FIR-Filter:

i∈Iy : y [i ]=y [i−d y ]a [ i ]⋅x [ i ] ,

x : x [ i ]=x [ i−d x ] ,

a : a [i ]=a[ i−d a ] ,

i∈I

i∈I

System of Uniform Recurrence Equations

for( k = 0; k < K; k++ ){ y[i] = 0; for( j = 0; j < J; j++ ){ y[k] += a[j]*x[k-j]; }}

For-Loop Program System of uniform recurrence equations

I={i=kj ,0≤kK ,0≤ jJ }

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Multi-Level Copartitioning

Subdivision of the iteration space in partitions

Partitioning

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Multi-Level Copartitioning

Local parallel, global sequentially partitioning

• Each node of a partition represent a PE, which are running in parallel.

• Partitions are processed sequentially.

Processor array

LPGS Partitioning

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Multi-Level Copartitioning

Local sequentially, global parallel partitioning

• Each partition represent a PE, which are running in parallel.

• The elements in the partitions are processed sequentially.

Processor array

LSGP Partitioning

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Multi-Level Copartitioning

Combination of

– LSGP partitioning and

– LPGS partitioning

Definition of a succession for the sequential parts

LSGP partitionLPGS partition

Copartitioning

Processor array

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Multi-Level Copartitioning

Copartitioning of a copartitioned iterations space several times.

Processor array

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Multi-Level Copartitioning

Algorithm Mapping

Algorithm Class(e.g. loop program)

MPA

Subword Parallelism

Allocation to the PEs

Allocation to the FUs

Number of subwords

Size of processor arrayNumber of registers

Size of CachesUtilization of Caches

1. Level

2. Level

3...n. Level

Schedule

FUs

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Multi-Level Copartitioning

Algorithm Mapping

Algorithm Class(e.g. loop program)

MPA

Subword Parallelism

Allocation to the PEs

Allocation to the FUs

Number of subwords

Size of processor arrayNumber of registers

Size of CachesUtilization of Caches

1. Level

2. Level

3...n. Level

Schedule

FUs

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Subword Parallelism

Mapping the algorithm on functional unit with subword parallelism (SWP)

• FU with SWP linear processor array→

• Number of subwords Number of PEs →

Aims of the Mapping

• Utilization of the subword parallelism

• Less packing instructions

First Level Copartitioning

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Multi-Level Copartitioning

Algorithm Mapping

Algorithm(e.g. loop-program)

MPA

Subword Parallelism

Allocation to the PEs

Allocation to the FUs

Number of subwords

Size of processor arrayNumber of registers

Size of CachesUtilization of Caches

1. Level

2. Level

3...n. Level

Schedule

FUs

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Allocation to the Processor Elements

Mapping on the processor elements of the processor array

Aims of the Mapping

• Utilization of the processor array parallelism

• Exploitation of theregisters in the PEs

Second Level Copartitioning

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Multi-Level Copartitioning

Algorithm Mapping

Algorithm Class(e.g. loop-program)

MPA

Subword Parallelism

Allocation to the PEs

Allocation to the FUs

Number of subwords

Size of processor arrayNumber of registers

Size of CachesUtilization of Caches

1. Level

2. Level

3...n. Level

Schedule

FUs

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Utilization of Caches

Change of the sequential execution order of the copartitions with additional copartitionings

Aim of the Mapping• Increase the local data re-use

Main Memory

Third Level CopartitioningThird Level Copartitioning

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Change of the sequential execution order of the copartitions with additional copartitionings

Aim of the Mapping• Increase the local data re-use

Main Memory

L1

Third Level Copartitioning

Utilization of Caches

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Test Results for Edge Detection Algorithm

Massively Parallel Architecture

• 2x4 processor array (PA)

• Subword parallelism (SWP) with 4 subwords in a FLW

Speed-up

• SWP: 3.00 or 3.99 (theor. 4)

• PA: 7.93 (theor. 8)

• SWP & PA: 23.74 or 31.65 (theor. 32)

R3 R4 R90

5

10

15

20

25

30

35

1x1 w. SWP vs. 1x1 w/o SWP4x2 w.SWP vs. 1x1 w. SWP4x2 w.SWP vs. 1x1 w/o SWP

Gen. ALUR3 4R4 1 1R9 2 3

PE Resources Pack.

April 27, 2009 Algorithm Mapping on Massively Parallel and Reconfigurable Architectures

Conclusions

Systematic Mapping of an algorithm to a massively parallel architecture with different levels of parallelism and by utilization of local registers and caches

Mapping methodology

→ Multi-Level Copartitioning

Goals of the Algorithm Mapping

• Exploitation of the parallelism of all levels

→ Minimal execution time

• Increasing the data re-use in the processor array and in the caches

→ Minimal power consumption

Edge Detection Algorithm

Speedup Factor:

up to 31.65 (theor. 32)