10th reconfigurable architectures workshop (raw 2003), nice, france, april 22, 2003

27
Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms 10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003 17th Annual Int’l Parallel & Distributed Processing Symposium (IPDPS 2003) João M. P. Cardoso University of Algarve, Faro, INESC-ID, Lisboa Portugal

Upload: lacey-joyner

Post on 30-Dec-2015

38 views

Category:

Documents


0 download

DESCRIPTION

Loop Dissevering: A Technique for Temporally Partitioning Loops in Dynamically Reconfigurable Computing Platforms. João M. P. Cardoso University of Algarve, Faro, INESC-ID, Lisboa Portugal. 10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003 - PowerPoint PPT Presentation

TRANSCRIPT

Loop Dissevering: A Technique for Temporally Partitioning Loops in

Dynamically Reconfigurable Computing Platforms

Loop Dissevering: A Technique for Temporally Partitioning Loops in

Dynamically Reconfigurable Computing Platforms

10th Reconfigurable Architectures Workshop (RAW 2003), Nice, France, April 22, 2003

17th Annual Int’l Parallel & Distributed Processing Symposium (IPDPS 2003)

João M. P. CardosoUniversity of Algarve, Faro, INESC-ID, LisboaPortugal

João M. P. CardosoRAW 2003

MotivationMotivation

for(int i=0; i<8;i++) for(int j=0;j<8;j++) CosTrans[j+8*i] = CosBlock[i+8*j];

for(int i=0; i<8;i++) for(int j=0;j<8;j++) { TempBlock[i+j*8] = 0; for(int k=0;k<8;k++) TempBlock[i+j*8] += InIm[i+k*8] * CosTrans[k+j*8]; }

How to map sets of computational structures requiring more resources than available?

João M. P. CardosoRAW 2003

MotivationMotivation

How to map sets of computational structures requiring more resources than available? Temporal Partitioning

for(int i=0; i<8;i++) for(int j=0;j<8;j++) CosTrans[j+8*i] = CosBlock[i+8*j];

for(int i=0; i<8;i++) for(int j=0;j<8;j++) { TempBlock[i+j*8] = 0; for(int k=0;k<8;k++) TempBlock[i+j*8] += InIm[i+k*8] * CosTrans[k+j*8]; }

João M. P. CardosoRAW 2003

MotivationMotivation

How to map sets of computational structures requiring more resources than available? Temporal Partitioning

Other motivations for Partitioning Computations in Time each design is simpler

may lead to better performance! amortize some configuration time

by overlapping execution stages use of smaller reconfigurable arrays to implement

complex applications

For more info: see Cardoso and Weinhardt, DATE 2003

João M. P. CardosoRAW 2003

MotivationMotivation

How to map sets of computational structures requiring more resources than available? Temporal Partitioning

Computational structures for each loop or set of nested loops implemented in a single partition

But, what to do with a Loop requiring more resources than available?

João M. P. CardosoRAW 2003

OutlineOutline

Motivation

Configure-Execute Paradigm (execution stages)

Target Architecture

PACT XPP Architecture

XPP Configuration Flow

XPP-VC Compilation Flow

Temporal Partitioning of Loops

Experimental Results

Conclusions & Future Work

João M. P. CardosoRAW 2003

f2 c2

f2 c2

Configure-Execute Paradigm (Execution Stages)Configure-Execute Paradigm (Execution Stages)

the program in a single configuration

two configurations without on-chip

context planes and without partial reconfiguration

with partial reconfiguration

with on-chip context planes

Fetch (f) Configure (c) Compute (comp)

f1 c1 comp1 comp2

f1 c1 comp1

f2 c2_2

comp2

f1 c1 comp1 comp2

c2_1

time

João M. P. CardosoRAW 2003

PE

X × Y Coarse-grained array: Processing elements (PEs): compute typical ALU operations Two columns of SRAMs (Ms) I/O ports for data streaming

PEPE

PE

M

M

PACT XPP Architecture (briefly)PACT XPP Architecture (briefly)

João M. P. CardosoRAW 2003

Ready/ack. protocol for each programmable interconnection Flow of data (pre-foundry parameterized bit-widths) Flow of events (1-bit lines)

PE

PEPE

PE

M

M

PACT XPP Architecture (briefly)PACT XPP Architecture (briefly)

João M. P. CardosoRAW 2003

PE

PACT XPP Architecture (briefly)PACT XPP Architecture (briefly)

Dynamically reconfigurable: On-chip configuration cache and configuration manager Partial reconfiguration (only those used resources are

configured)

PEPE

PE

Configuration Manager

(CM)

Configuration Cache(CC)

fetch

configure

CMPort0CMPort1

M

M

João M. P. CardosoRAW 2003

XPP Configuration FlowXPP Configuration Flow

Uses 3 stages to execute each configuration:

Array may request the next configuration Configuration manager

accepts requests and proceeds without intervention from external host

c0;If(CMPort0) then c1;If(CMPort1) then c2;

c1

fetch configure

<N

CMport0CMport1

c2

c0Configuration

Cache(CC)

Configuration Manager

(CM)

c0

Fetch (f) Configure (c) Compute (comp)

João M. P. CardosoRAW 2003

XPP-VC Compilation FlowXPP-VC Compilation Flow

TempPart: partitions and generates reconfiguration statements which are executed by Configuration Manager

MODGen: maps C subset to NML (PACT proprietary structural language with reconfiguration primitives)

C programPreprocessing + Dependence

Analysis

TempPartTemporal

Partitioning

MODGen“Module Generation”

(with pipelining)

NML filexmapXPP

Binary CodeNML

Control Code Generation(Reconfiguration)

For more info: see Cardoso and Weinhardt, FPL 2002

João M. P. CardosoRAW 2003

Temporal PartitioningTemporal Partitioning

One partition for each node in the Hierarchical Task Graph (HTG) TOP level

Merge adjacent nodes if combination of both can be mapped to XPP device and if merge does not degrade overall performance

If HTG node too large, create separate partition for each node of the inner-HTG and call algorithm recursively

start

end

Loop 1

x

coef

Loop 2

Loop 3

Loop 4

tmp

y

João M. P. CardosoRAW 2003

Temporal Partitioning of LoopsTemporal Partitioning of Loops

What to do when loops in the program cannot be mapped due to the lack of enough resources? Software/reconfigware approach

control of the loop in software, migrates to reconfigware inner-code sections, each one

mapped to a single configuration Loop Distribution

transforms a loop into two or more loops each loop with the same iteration-space traversal of the

original loop inner statements of the original loop are split among the

loops Loop Dissevering

transforms a loop in a set of configurations cyclic behavior implemented by the configuration flow

João M. P. CardosoRAW 2003

Temporal Partitioning of LoopsTemporal Partitioning of Loops

Loop Distribution

Loop Dissevering

…for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) { for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0; Inner Loop 1 for(k=0;k<N;k++) tmp += X[i+ny*N][k+nx*N]* CosBlock[j][k]; TempBlock[i][j] = tmp; } // to be partitioned here for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0;Inner Loop 2 for(k=0;k<N;k++) tmp += TempBlock[k][j]* CosBlock[i][k]; Y[i+ny*N][j+nx*N] = tmp; } }…

João M. P. CardosoRAW 2003

Loop DistributionLoop Distribution …

for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) for(i=0;i<N;i++) for(j=0;j<N;j++) {

Inner Loop 1 TempBlock[i+ny*N][j+nx*N] = tmp; }

for(nx=0;nx<X_DIM_BLK; nx++) for(ny=0;ny<Y_DIM_BLK; ny++) for(i=0;i<N;i++) for(j=0;j<N;j++) { tmp = 0;

for(k=0;k<N;k++) tmp += TempBlock[k+ny*N][j+nx*N]*

CosBlock[i][k];Y[i+ny*N][j+nx*N] = tmp;

}…

begin

end

Conf. 1

Conf. 2

Conf. 1

Conf. 2

tmp += TempBlock[k][j]* CosBlock[i][k];

João M. P. CardosoRAW 2003

Loop DistributionLoop Distribution

Cannot be applied to all loops no break of cycles in the dependence graph of the

original loop

Use of auxiliary array variables for each loop-independent flow dependence of a scalar

variable (known as scalar expansion) and for each control dependence in the place where we

want to partition the loop

Expansion of some arrays

But, it preserves the software pipelining potential, and may improve parallelization, cache hit/miss ratio, etc.

João M. P. CardosoRAW 2003

Loop DisseveringLoop Dissevering

L1:

L3:

L4:

Finish:

…nx=0; write nx;read nx;If(nx>=X_DIM_BLK) goto Finish;ny=0; write ny;read ny; read nx;If(ny>=Y_DIM_BLK) goto L4;for(i=0;i<N;i++) for(j=0;j<N;j++) { Inner Loop 1 TempBlock[i][j] = tmp; }read ny; read nx;for(i=0;i<N;i++) for(j=0;j<N;j++) { Inner Loop 2 Y[i+ny*N][j+nx*N] = tmp; }ny++; write ny;goto L3;nx++; write nx;goto L1…

begin

Conf. 1

Conf. 2

end

Conf. 3

Conf. 4

Conf. 5

Conf. 1

Conf. 2

Conf. 3

Conf. 4

Conf. 5

João M. P. CardosoRAW 2003

Loop DisseveringLoop Dissevering

Applicable to every loop

Only relies on a configuration manager to execute complex loops

May relieve the host microprocessor to execute other tasks

No array or scalar expansion (only scalar communication)

But, Besides its usage to furnish feasible mappings, is it

worth to be applied? Does it lead to efficient solutions (in terms of performance)?

What are the improvements if the architecture can switch between configurations in few clock cycles?

João M. P. CardosoRAW 2003

Experimental ResultsExperimental Results

Compared Architectures Both with runtime support to partial reconfiguration

ARCH-A word-grained partial reconfiguration

ARCH-B context-planes with switching between contexts in few clock cycles

f2 c2

f1 c1 comp1

f2 c2_2

comp2

f1 c1 comp1 comp2

c2_1

João M. P. CardosoRAW 2003

Experimental ResultsExperimental Results

Benchmark Description #LoC #loops #loops after loop dist.

DCT88 Discrete Cosine

Transform on an image

80 8 10

BPICBinary pattern image

coding151 8 10

LifeConway’s game of life

algorithm118 10 -

Benchmarks

João M. P. CardosoRAW 2003

Experimental Results (resource savings)Experimental Results (resource savings)

Using loop dissevering When compared to implementations without loop

dissevering only 44% (DCT), 66% (BPIC), and 85% (Life) of resources are used

Benchmark

w/o loop dissevering w/ loop dissevering Ratio(#PEs)

#configs #PEs #configs #PEs

DCT 1 123 5 54/132 0.44

BPIC 1 148 5 97/189 0.66

Life 4 144/304 6 123/416 0.85

João M. P. CardosoRAW 2003

Experimental Results (speedups)Experimental Results (speedups)

Architecture A (ARCH-A) Word-grained partial reconfiguration

Architecture B (ARCH-B) Context-planes

DCT

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

6x6 w / ldis 6x6 w / ldis+unr 6x6 w / ld 6x6 w / ld+unr

João M. P. CardosoRAW 2003

Experimental Results (speedups)Experimental Results (speedups)

Life Applying Loop Dissevering Benefits of ARCH-B are neglected when partitions “in

the loop” compute for long times

1.21.41.61.82.02.22.4

8x8_2 8x8_4 8x8_8 8x8_16 16x16_2 16x16_4 16x16_8 16x16_16 32x32_2 32x32_4 32x32_8

ARCH. A ARCH. B

João M. P. CardosoRAW 2003

ConclusionsConclusions

Temporal Partitioning + Loop Dissevering guarantees the mapping of theoretically unlimited

computational structures

Loop Dissevering and Loop Distribution may lead to performance enhancements saving of resources

Loop Dissevering applicable to every loop performance efficient implementations may require fast

reconfiguration the resultant performance may decrease

when innermost loops are partitioned (no more potential for loop pipelining)

when each active partition computes for short times (does not amortize the reconfiguration time)

João M. P. CardosoRAW 2003

Future WorkFuture Work

More study on the impact of Loop Dissevering and Loop Distribution To understand the impact of the number of context-

planes, configuration cache size, etc. To evaluate loop partitioning when mapping to FPGAs

Automatic implementation of Loop Distribution

Methods to decide between Loop Dissevering and Loop Distribution

João M. P. CardosoRAW 2003

Acknowledgments (in the paper)Acknowledgments (in the paper)

Part of this work has been done when the author was with PACT XPP Technologies, Inc, Munich, Germany.

We gratefully acknowledge the support of all the members of PACT XPP Technologies, Inc., especially the help of Daniel Bretz, Armin Strobl, and Frank May, regarding the XDS tools. A special thanks to Markus Weinhardt regarding the fruitful discussions about loop dissevering and the XPP-VC compiler.