exploiting application data-parallelism on dynamically

16
IEEE Proof IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 Exploiting Application Data-Parallelism on Dynamically Reconfigurable Architectures: Placement and Architectural Considerations Sudarshan Banerjee, Member, IEEE, Elaheh Bozorgzadeh, Member, IEEE, and Nikil Dutt, Fellow, IEEE Abstract—<PAPER MUST BE NO LONGER THEN 14 PAGES. PLEASE CUT DOWN.> Partial dynamic reconfiguration, often called run-time reconfiguration (RTR), is a key feature in modern reconfigurable platforms. In this paper, we present parallelism granularity selection (PARLGRAN), an application mapping approach that maximizes performance of application task chains on architectures with such capability. PARLGRAN essentially selects a suitable granularity of data-parallelism for individual data parallel tasks while considering key issues such as significant reconfiguration overhead and placement constraints. It integrates granularity selection very effectively in a joint scheduling and placement formulation, necessary due to constraints imposed by partial RTR. As a key step to validating PARLGRAN, we additionally present an exact strategy (integer linear program- ming formulation). We demonstrate that PARLGRAN generates high-quality schedules with: 1) a set of small test cases where we compare our results with the exact strategy; 2) a very large set of synthetic experiments with over a thousand data-points where we compare it with a simpler strategy that tries to statically maximize data-parallelism, i.e., only considers resource availability; and 3) a detailed application case study of JPEG encoding. The application case-study confirms that blindly maximizing data-parallelism can result in schedules even worse than that generated by a simple (but RTR-aware) approach oblivious to data-parallelism. Last, but very important, we demonstrate that our approach is well-suited for true on-demand computing with detailed execution time estimates on a typical embedded processor. Heuristic execution time is comparable to task execution time, i.e., it is feasible to integrate PARLGRAN in a run-time scheduler for dynamically reconfigurable architectures. Index Terms—Author, please supply your own keywords or send a blank e-mail to [email protected] to receive a list of suggested keywords. I. INTRODUCTION R ECONFIGURABLE architectures are popular for ap- plications with intensive computation such as image processing, cryptography, etc., since a limited amount of logic can be customized to set up deep pipelines, and/or exploit more coarse-grain parallelism. Partial dynamic reconfiguration, or Manuscript received June 21, 2007; revised November 30, 2007. S. Banerjee is currently with Liga Systems, Sunnyvale, CA <PLEASE PRO- VIDE POSTAL CODE.> USA (e-mail: [email protected]). E. Bozorgzadeh and N. Dutt are with the Center for Embedded Computer Systems, University of California, Irvine, CA 92697 USA. Digital Object Identifier 10.1109/TVLSI.2008.2003490 run-time reconfiguration (RTR), allows additional customiza- tion during application execution enabling true on-demand computing. A device with partial RTR capability is shared by multiple dynamically invoked applications. Each dynami- cally invoked application is assigned a set of logic resources depending upon system capacity and resource requirements of other applications concurrently active on the same device, and partial RTR makes it feasible for the application to obtain higher performance from a limited set of resources [1]. In this context, our overall goal is to maximize performance of indi- vidual applications represented as precedence-constrained task directed acyclic graphs (DAGs) on single-context architectures with partial RTR (Xilinx Virtex-II is a commercial instance of such architectures). Some key issues in mapping applications onto such devices are the significant reconfiguration delay overhead, physical (placement) constraints, etc. In this paper, we focus on precedence-constrained task chains, common in image-processing applications such as JPEG encoding, Sobel filters, Laplace filters, etc., [2], [3]. In such applications, area-execution time characteristics of key tasks such as IDCT, Quantize, etc., are predictable because of complete pipelining. Additionally, many computation-intensive tasks such as DCT are completely data-parallel, i.e., results of task execution on a block of data are invariant even if the task processed some other disjoint block of data before the current data block. 1 On an architecture with partial RTR, it is possible to improve application execution time by dynamically adjusting the parallelism granularity of such data-parallel tasks, i.e., reconfiguring the architecture to instantiate multiple copies of such tasks during application execution—each copy (instance) uses an identical amount of hardware logic resources, but processes only part of the data. Due to complete pipelining, execution time of such tasks is directly proportional to the volume of data processed, and thus, reducing the data volume proportionately improves (reduces) the application execution time. Note that on architectures with no partial RTR, the scope of exploiting such data-parallelism is much more limited—par- tial RTR enables resource reuse, significantly expanding the potential of exploiting data-parallelism. As an example, we consider a simple chain with two tasks, as shown in Fig. 1. Assuming that there are enough resources to simultaneously execute three copies of task or two copies of task , Fig. 1(b) and (c) show some possible task graph configurations after such a transformation. However, such 1 Huffman encoding is a well-known example of a task that does not have data-parallelism property. 1063-8210/$25.00 © 2008 IEEE

Upload: others

Post on 03-Feb-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Exploiting Application Data-Parallelism onDynamically Reconfigurable Architectures:Placement and Architectural Considerations

Sudarshan Banerjee, Member, IEEE, Elaheh Bozorgzadeh, Member, IEEE, and Nikil Dutt, Fellow, IEEE

Abstract—<PAPER MUST BE NO LONGER THEN 14 PAGES.PLEASE CUT DOWN.> Partial dynamic reconfiguration, oftencalled run-time reconfiguration (RTR), is a key feature in modernreconfigurable platforms. In this paper, we present parallelismgranularity selection (PARLGRAN), an application mappingapproach that maximizes performance of application task chainson architectures with such capability. PARLGRAN essentiallyselects a suitable granularity of data-parallelism for individualdata parallel tasks while considering key issues such as significantreconfiguration overhead and placement constraints. It integratesgranularity selection very effectively in a joint scheduling andplacement formulation, necessary due to constraints imposedby partial RTR. As a key step to validating PARLGRAN, weadditionally present an exact strategy (integer linear program-ming formulation). We demonstrate that PARLGRAN generateshigh-quality schedules with: 1) a set of small test cases where wecompare our results with the exact strategy; 2) a very large set ofsynthetic experiments with over a thousand data-points where wecompare it with a simpler strategy that tries to statically maximizedata-parallelism, i.e., only considers resource availability; and 3) adetailed application case study of JPEG encoding. The applicationcase-study confirms that blindly maximizing data-parallelism canresult in schedules even worse than that generated by a simple (butRTR-aware) approach oblivious to data-parallelism. Last, butvery important, we demonstrate that our approach is well-suitedfor true on-demand computing with detailed execution timeestimates on a typical embedded processor. Heuristic executiontime is comparable to task execution time, i.e., it is feasible tointegrate PARLGRAN in a run-time scheduler for dynamicallyreconfigurable architectures.

Index Terms—Author, please supply your own keywords or senda blank e-mail to [email protected] to receive a list of suggestedkeywords.

I. INTRODUCTION

R ECONFIGURABLE architectures are popular for ap-plications with intensive computation such as image

processing, cryptography, etc., since a limited amount of logiccan be customized to set up deep pipelines, and/or exploit morecoarse-grain parallelism. Partial dynamic reconfiguration, or

Manuscript received June 21, 2007; revised November 30, 2007.S. Banerjee is currently with Liga Systems, Sunnyvale, CA <PLEASE PRO-

VIDE POSTAL CODE.> USA (e-mail: [email protected]).E. Bozorgzadeh and N. Dutt are with the Center for Embedded Computer

Systems, University of California, Irvine, CA 92697 USA.Digital Object Identifier 10.1109/TVLSI.2008.2003490

run-time reconfiguration (RTR), allows additional customiza-tion during application execution enabling true on-demandcomputing. A device with partial RTR capability is sharedby multiple dynamically invoked applications. Each dynami-cally invoked application is assigned a set of logic resourcesdepending upon system capacity and resource requirementsof other applications concurrently active on the same device,and partial RTR makes it feasible for the application to obtainhigher performance from a limited set of resources [1]. In thiscontext, our overall goal is to maximize performance of indi-vidual applications represented as precedence-constrained taskdirected acyclic graphs (DAGs) on single-context architectureswith partial RTR (Xilinx Virtex-II is a commercial instance ofsuch architectures). Some key issues in mapping applicationsonto such devices are the significant reconfiguration delayoverhead, physical (placement) constraints, etc.

In this paper, we focus on precedence-constrained taskchains, common in image-processing applications such asJPEG encoding, Sobel filters, Laplace filters, etc., [2], [3]. Insuch applications, area-execution time characteristics of keytasks such as IDCT, Quantize, etc., are predictable because ofcomplete pipelining. Additionally, many computation-intensivetasks such as DCT are completely data-parallel, i.e., results oftask execution on a block of data are invariant even if the taskprocessed some other disjoint block of data before the currentdata block.1 On an architecture with partial RTR, it is possibleto improve application execution time by dynamically adjustingthe parallelism granularity of such data-parallel tasks, i.e.,reconfiguring the architecture to instantiate multiple copies ofsuch tasks during application execution—each copy (instance)uses an identical amount of hardware logic resources, butprocesses only part of the data. Due to complete pipelining,execution time of such tasks is directly proportional to thevolume of data processed, and thus, reducing the data volumeproportionately improves (reduces) the application executiontime. Note that on architectures with no partial RTR, the scopeof exploiting such data-parallelism is much more limited—par-tial RTR enables resource reuse, significantly expanding thepotential of exploiting data-parallelism.

As an example, we consider a simple chain with two tasks,as shown in Fig. 1. Assuming that there are enough resourcesto simultaneously execute three copies of task or two copiesof task , Fig. 1(b) and (c) show some possible task graphconfigurations after such a transformation. However, such

1Huffman encoding is a well-known example of a task that does not havedata-parallelism property.

1063-8210/$25.00 © 2008 IEEE

Page 2: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 1. Granularity of individual data-parallel tasks.

Fig. 2. On-demand computing environment.

a transformation can be quite costly on architectures withpartial RTR—each new task instance (copy) adds a signifi-cant reconfiguration overhead. Therefore, the transformationsneed to be guided by selecting the right granularity of paral-lelism that masks the reconfiguration overhead and maximizesperformance. One important issue is that because of the recon-figuration overhead, multiple instances of a task are typicallyunable to start execution at the same time—thus, individualexecution time (workload) of the multiple instances may vary.

As the capacity of modern reconfigurable architectures con-tinues to increase rapidly, suitably sized devices can accomodatemany of our target applications without needing partial RTR. Inthis context, it is important to clarify that our work is motivatedby the following applications:

• maximizing performance from available smaller (andcheaper) devices—this aspect includes huge multimediaapplications such as the MPEG-4;

• more importantly, maximizing performance in a trueon-demand computing environment such as that shown inFig. 2. An embedded system includes a dynamically re-configurable component shared by multiple on-demandapplications. When an on-demand application is ready toexecute, it is granted resources (logic area, etc.) subject toresource usage of other currently active applications, andpartial RTR enables us to maximally exploit the availableresources. Note that such an embedded system may beimplemented completely on a large modern device such asthe Xilinx Virtex-5 XC5VLX330. Part of the logic in sucha device is statically configured for invariant functionality(such as timers, memory controllers, etc.) and the otherpart is dedicated to accelerating applications on demandusing partial RTR.

In such an on-demand computing environment, we addition-ally require a semi-online application mapping (scheduling) ap-proach to maximize application performance with partial RTR.We define a semi-online application execution scenario as fol-lows.

1) The application structure is known statically, i.e., pre-decessor and successor of each individual task does notchange during application execution. This property issatisfied by typical image-processing applications suchas Sobel filtering [4], JPEG decoding, etc. Additionally,key scheduling parameters such as logic resource re-quirement of each individual task are also available stat-ically.

2) When the application is invoked dynamically, it is allo-cated a set of logic resources depending upon its run-time environment.

3) A semi-online scheduling approach generates an appli-cation execution schedule based on the static schedulingparameters and two run-time parameters: (a) allocatedlogic resources and (b) input image size.

4) The scheduled application starts execution.That is, a semi-online scheduling approach allows an appli-

cation to adapt to key changes in its runtime environment—asexamples, change in available logic resources and change ininput image size directly affect the potential for performance im-provement. This necessitates that the execution time of a semi-online approach is low enough to be included in an operatingsystem for dynamically reconfigurable architectures [5], [6].2

Thus, measure of viability for a semi-online approach is cumula-tive execution time defined as sum of schedule length generatedby approach and execution time of approach on an embeddedprocessor.

Paper Contributions:• Granularity selection, scheduling, placement: In this

paper, we propose such a semi-online scheduling ap-proach, parallelism granularity selection (PARLGRAN),that maximizes application performance on architectureswith partial RTR by choosing the right parallelism gran-ularity for each individual data-parallel task. We definegranularity as both the number of instances (copies) ofthat task, and, the workload (execution time) of eachinstance. Our approach considers physical (placement)constraints, and utilizes configuration prefetch [7] to re-duce the latency. The key constraints of such architecturesnecessitate joint scheduling and placement [8], [9]. Ourapproach therefore, incorporates granularity selection asan integral part of simultaneous scheduling and place-ment. To the best of our knowledge, ours is the first efforttowards this difficult problem of semi-online applicationrestructuring.

• ILP, Detailed experiments, Case Study: As a key stepin understanding the problem and to validate quality ofschedules generated by our approach, we additionallypresent an exact strategy (ILP). We present extensiveexperimental evidence to validate our proposed approach.First, we demonstrate that PARLGRAN generates resultsclose to that of the exact (ILP) formulation with a setof small test cases. Since the exact strategy is very timeconsuming, we next present results on a very large set ofover a thousand synthetic experiments where we compare

2In scenarios where the image size and allocated resources are invariant atrun-time, we would of course precompute the schedule statically with muchless consideration for the one-time computation overhead.

Page 3: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

BANERJEE et al.: EXPLOITING APPLICATION DATA-PARALLELISM ON DYNAMICALLY RECONFIGURABLE ARCHITECTURES 3

against a simpler heuristic that tries to statically maximizeperformance gain from data parallelism based on resourceavailability only—average improvement in schedulelength is over 20%. We follow-up our experiments onsynthetic test cases with a detailed case study of JPEGencoding—the experiments demonstrate that a simple(RTR-unaware) static parallelization approach can endup generating schedules much worse than a RTR-awareapproach completely oblivious to data-parallelism.

• Semi-Online Capability: Finally, we have obtained detailedexecution time estimates of our approach on a typical em-bedded processor, the PPC405 processor operating at aclock frequency of 400 MHz. The data indicates that ex-ecution time of our approach is comparable to that of taskexecution time. Equally importantly, for our applicationcase study, cumulative execution time monotonically im-proves. For a given image size, as available area increases,(execution time of heuristic + schedule length generatedby heuristic) monotonically decreases. Thus, PARLGRANis well-qualified for semi-online scheduling, i.e., for inclu-sion in an operating system for dynamically reconfigurabledevices.

II. RELATED WORK

While there exists a large body of work in mapping taskchains typical in image processing to reconfigurable architec-tures, a significant amount of work such as [2] does not considerdynamic reconfiguration. More recently, there has been a spurtin work focussed on exploiting the powerful capabilities of par-tial dynamic reconfiguration for image-processing/multimediaapplications [4], [10]–[12], etc. Our work is closely related towork such as [4], [10], etc., that focus on task graph schedulingwith RTR-related constraints.

Recent work on scheduling application task graphs withRTR-related constraints [4], [10], often do not focus on thecritical role played by placement on such architectures. Ourwork focusses on joint scheduling and placement required onarchitectures with partial RTR, similar to [8] and [13]. However,prior work in joint scheduling and placement typically ignorekey architectural constraints such as the resource contentiondue to a single reconfiguration controller, configuration prefetchto reduce the reconfiguration latency, etc. Ignoring these keyissues makes the problem closer to the rectangle packingproblem [14] and does not realistically exploit RTR. Otherrecent work such as [15] focus on the problem of configurationreuse as an alternative strategy to reduce the reconfigurationoverhead, an aspect we do not address in this work.

Additionally, work on task-graph scheduling for such archi-tectures [9], [10] typically does not include application restruc-turing considerations. While [4] presents some application re-structuring considerations, their work is completely oblivious toplacement concerns. Also, their target device is a multicontextarchitecture with multiple concurrently active reconfigurationprocesses. Commercially available devices with partial RTR aresingle context architectures where only a single reconfigurationprocess is active at any instant. (True multicontext architecturessuch as Morphosys [16] incur a significant area overhead.) One

Fig. 3. Target dynamic architecture.

work that exploits data-parallelism in a single-context architec-ture [17] simply considers the problem of maximizing perfor-mance for a single task, without considering dependencies be-tween tasks in a task graph—it also does not include detailedplacement considerations. To the best of our knowledge, thiswork is the first effort that focuses specifically on techniques fortransforming applications on single-context architectures andincludes very detailed consideration of all partial RTR relatedconstraints such as placement, resource contention due to thesequential reconfiguration mechanism, etc.

There is of course a vast body of knowledge in the compilerdomain on extracting parallelism from programs at differentlevels of granularity [18]. Such compile-time techniques [19] aretypically unaware of partial RTR constraints (such as placement,reconfiguration overhead)—equally importantly, compile-timetechniques also incur a high execution overhead, since they arenot intended for execution in an embedded environment. An-other related work [10] has proposed a strategy based on Paretopoints to precompute schedules when there are a (known) lim-ited number of parameter variations at run-time. Along withbeing placement-unaware, this strategy is not fully capable ofexploiting a true on-demand computing environment. In ourpaper, we explicitly focus on a low execution-complexity ap-proach capable of maximally exploiting variability in task exe-cution parameters (data size) and resource allocation in a run-time scheduling environment.

III. PROBLEM OVERVIEW

A. Target Architecture

Our target dynamically reconfigurable device as shown inFig. 3 consists of a set of configurable logic blocks (CLB) ar-ranged in a 2-D matrix. The basic unit of configuration for sucha device is a frame spanning the height of the device. A columnof resources consists of multiple frames. A task occupies a con-tiguous set of columns. The reconfiguration time of a task isdirectly proportional to the number of columns (frames) oc-cupied by the task implementation. One key constraint is thatonly one task reconfiguration can be active at any time instant.An example of our target device is the Xilinx Virtex-II archi-tecture where constraints such as dynamic tasks occupying acontiguous set of columns are critical for realization of partialrun-time reconfiguration.

Page 4: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Note that our target device is capable of supporting multipleconcurrently executing applications, as shown in Fig. 3. Whenan application is invoked, it is allocated a set of resources (CLBcolumns) dependent on the current system state—partial RTRenables the application to maximize performance from the lim-ited set of allocated logic resources. Also, while we do not ex-plicitly show it in Fig. 3, part of the target device is used forstatic system functionality such as UARTs, operating systemfunctionality (such as a scheduler), etc.

B. Application Specification

A task executing on such a system can be representedas a three-tuple ( , , ) where is the number of resourcecolumns occupied by the task, and are the execution timeand reconfiguration overhead, respectively. Each task needs tobe reconfigured before its execution is scheduled. The physicalconstraints on such a device necessitates joint scheduling andplacement [8], [9].

In image processing applications, we often find chains (linearsequences) of such tasks. For a chain of tasks, ,each task in the chain has exactly one predecessor and one suc-cessor. Of course, the first task, , has no predecessor, and thelast task, , has no successor. A predecessor task utilizes ashared memory mechanism to communicate necessary data toits successor—this shared memory can be physically mapped tolocal on-chip memory and/or off-chip memory depending uponmemory requirements of the application. Detailed discussion onthe specifics of memory organization, including strategies foron-chip versus off-chip data mapping are beyond the scope ofthis paper—we simply assume that the logical shared memoryprovides sufficient bandwidth for our target applications. Oneimportant aspect of our shared memory abstraction is that com-munication time between two tasks is independent of their phys-ical placement. (An example of a system architecture capable ofproviding such an abstraction is available in [20].)

Such a chain of tasks can be executed in a pipelined fashionwhen sufficient resources are available to instantiate all tasks inthe chain. In this work, we specifically focus on scenarios withlimited logic when all tasks cannot be simultaneously placedin the available logic. Detailed considerations of inter-taskpipelining versus data-parallelism (when insufficient resourcesare available to place all tasks in a pipeline) is out of scope ofthis work, and will be considered in future work.

C. Problem Objective

Our overall goal is to maximize performance (minimizeschedule length) under physical and architectural constraints,given a resource constraint of columns available for theapplication, where is less than that required to map the

Fig. 4. Effect of significant reconfiguration overhead. (a) Sequential. (b) Idealparallel. (c) Actual parallel.

entire application,3 i.e., . An additional keygoal is that our approach should have a low computationaloverhead suitable for implementation on a typical embeddedprocessor.

IV. DETAILED PROBLEM SPECIFICATION AND EXACT

MATHEMATICAL FORMULATION (ILP)

In this section, we first motivate our problem and follow-upwith a detailed problem specification. Next, we provide an exactmathematical formulation (ILP).

A. Motivation and Detailed Problem Specification

Ideally, the degree of parallelism for a data-parallel task islimited only by the availability of HW resources. Let us con-sider a chain with only a single data-parallel task that exe-cutes in time using columns, as shown in Fig. 4(a). Givena resource constraint of columns, we expect performanceto be maximized (schedule length minimized) when this task isinstantiated times, as in Fig. 4(b). In these figures,the -axis represents the columnar area constraint andthe -axis represents the schedule length. For sequential tasks(0 degree of data-parallelism), the execution of task is rep-resented as as in Fig. 4(a). Each individual task requiresreconfiguration before execution—however, for ease of presen-tation, we show all our schedules (and corresponding equations)as starting from execution of the first task in the chain. Fordata-parallel tasks, we additionally denote execution of th in-stance (copy) of task as , as in Fig. 4(b).

Unfortunately, as we discuss next, the performance gainin Fig. 4(b) is typically not achievable while considering real-istic issues on such architectures.

1) Significant Reconfiguration Overhead: For modernsingle-context architectures that support partial RTR, the largereconfiguration delay is a key bottleneck in achievingparallelism. To illustrate this, we consider Fig. 4(c). In thisfigure, reconfiguration for th instance (copy) of is denotedas . Similar to our convention of not explicitly showingreconfiguration for task in Fig. 4(a), we do not explicitlyshow reconfiguration for the first data-parallel instance

in Fig. 4(c). We simply assume that the reconfigurationcontroller is available at the beginning of the execution of thefirst instance . Next, we attempt to maximize performanceby instantiating an additional copy and distributing the

3We assume of course that the largest task fits into the available area, i.e.,� � ����� �.

Page 5: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

BANERJEE et al.: EXPLOITING APPLICATION DATA-PARALLELISM ON DYNAMICALLY RECONFIGURABLE ARCHITECTURES 5

Fig. 5. Parallelism degree determined by reconfiguration overhead.

workload (execution time) equally between the two instancesand . However, execution of the second instance can

start only after the reconfiguration overhead, . Thus, insteadof the ideal workload of , the workload of the secondtask instance is only leading to less performanceimprovement than expected. The actual schedule length is

instead of the ideal schedule length of .For a single task, a simple equation suffices to compute the

optimal workload leading to maximum performance improve-ment, as shown in the following lemma.

Lemma 1: For parallelizing a task into instances, andgiven that the reconfiguration controller is available at the be-ginning of execution of the first instance, the best performance(least execution time) is obtained when the workload (executiontime) of the th instance is: .

Proof: The proof follows directly from the simple exampleof parallelizing task into two task instances .

When the th task instance is ready for execution (recon-figuration for is complete), workload completed by is

, workload completed by is ,workload completed by is . The aggregate workloadcompleted before starts is

To maximize performance (minimize schedule length), theremaining workload is distributed equally between all taskinstances, i.e., workload assigned to instance is

Lemma 1 clearly demonstrates that maximizing perfor-mance involves unequal workload (execution time) distributionbetween multiple copies of a task to compensate for the signif-icant reconfiguration delay and the sequential reconfigurationmechanism.

Along with reducing expected performance, the large recon-figuration delay also potentially prevents performance improve-ment if more than a few copies of a task are instantiated, asshown in Fig. 5. Even though enough resources are available toinstantiate four copies of task , instantiating the fourth copydoes not improve the schedule length any further. In fact, as-signing any nonzero workload to the fourth instance leads to alonger schedule than a schedule with only three instances. Sim-ilar to the previous lemma for computing optimal workload,

a simple equation suffices to compute the optimal number ofinstances leading to maximum performance improvement, asshown in the following.

Lemma 2: For parallelizing a task and given that the recon-figuration controller is available at the beginning of execution ofthe first instance, the best performance (least execution time) isobtained when there are exactly instances

Proof: The first term states that one trivialbound on the number of instances is simply the maximumnumber of task copies that fit in the available area. The secondterm follows from Lemma 1 as shown in the following.

If the th instance does not improve performance, the aggre-gate workload completed before starts execution is greaterthan the task workload, i.e.,

or

Solving the previous quadratic equation, performance doesnot improve if

Thus, the maximum number of instances such that per-formance definitely improves is given by , leadingto the second term in the lemma.

Thus, the granularity of data-parallelism, that includes se-lecting number of instances, is determined by two factors: alongwith the very obvious factor of the number of instances that fitin the given space, the other key factor is the ratio of task execu-tion time to task reconfiguration time. In the experimental sec-tion, we have conducted experiments on images with differentsizes—for such experiments, the reconfiguration time for indi-vidual tasks is invariant, while the task execution time is propor-tional to the image size. The experimental results validate thislemma, i.e., there is more performance improvement with in-creasing image size (resulting from potentially more instancesfor each data-parallel task).

The simple equations in Lemmas 1 and 2 provide the un-derlying principles for our proposed approach. However, theyare not sufficient to compute the schedule length for a prece-dence-constrained task chain. We next consider the additionalcomplications introduced in our problem due to precedence con-straints.

2) Precedence Constraints: For precedence-constrained ap-plication task chains, there is interaction of the resource de-mands of parent and child tasks, as shown in Fig. 6 for a simplechain with two tasks and . The HW resource constraintallows three copies (instances) of , or, two copies of tobe executing simultaneously. One possible approach is to ex-actly follow Lemmas 1 and 2, i.e., instantiate all three copies

Page 6: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 6. Problem space explosion with precedence constraints.

of to maximize performance of , and then instantiate twocopies of , as in Fig. 6(c). However, it is potentially possibleto improve the schedule length further by instantiating only twocopies of and using the remaining space to reconfigure (in-stantiate) one copy of —once the two copies of end, thefirst instance of is able to start execution immediately, asin Fig. 6(b). Note that in our execution model, all instances of aparent task must finish execution before any instance of a childtask starts execution.

Obviously, the problem space explodes with the introductionof precedence constraints. Effectively for a chain with tasks,we want to determine the best possible performance from

candidate transformed task graphs.To better understand the problem difficulty, note that sched-

uling a simple task chain under partial RTR constraints isNP-complete even without any data-parallelism considerations.The detailed proof based on reduction from set-partitioning isout of scope for this paper—the interested reader should contactthe author [21] for further details. Also, configuration prefetch[7] plays a critical role—in Fig. 6(b), the gap introduced be-tween completion of reconfiguration and start of execution

for task instance is crucial to improving latency in thepresence of significant reconfiguration delay. Thus, our detailedproblem specification is shown in the following.

Problem Inputs:• precedence-constrained application task chain

where some tasks have data-parallelism property;

• strict bound on the number of contiguous columnsavailable for mapping the task chain.

Problem Output:• number of copies for each data-parallel task;• workload (execution time) of each ( th) copy of a

data-parallel task ;• placed task schedule where every task (instance) is as-

signed an execution start time, and an execution startcolumn.

Problem Objective:• minimize schedule length (maximize performance).

As mentioned previously, an additional objective is that anysolution should have low execution complexity suitable for im-plementation on typical embedded processors. However, for thesake of completeness (and as a key tool to evaluate the qualityof our proposed heuristics), we next present an exact mathemat-ical formulation (ILP) to the previous problem.

B. Mathematical Formulation (ILP) of Problem

In this subsection, we present an integer linear program (ILP)that provides an exact solution to our problem. Our underlyingmodel is a 2-D grid where task placement is modelled alongone axis while time is represented on the other axis. Previouswork [9] has addressed the problem of exact scheduling (andplacement) for a task graph with partial RTR related constraints(including configuration prefetch) based on such a grid repre-sentation. Unlike [9], our objective is to determine the structureof a task graph that maximizes performance—attempting to de-termine the number of task instances and execution time of aninstance while satisfying all constraints related to columnar par-tial RTR makes the ILP formulation additionally challenging.

1) Core Principles: To formulate such an ILP, we essentiallystart with an expanded series-parallel graph. For each data-par-allel task , we implicitly instantiate as many task copies aspossible subject to the resource constraint . For each suchtask instance we add precedence edges to the child task of

(or, to every instance of task , if is data-parallel).Next, we introduce a Boolean (0-1) variable (invalid) for

every task instance in the expanded graph—a nonzero value ofthis variable denotes that the corresponding task instance is notrequired for maximizing performance. To determine task in-stance execution time along with task instance start time, weintroduce two sets of variables: (start execution) variable fora task instance denotes the execution start time of the task in-stance, while (is executing) variable denotes that a task in-stance is processing data in a given time-step.

The following indices are key to properly specifying the ILPvariables:

number of tasks in the chain

number of task instances in the expanded graph

number of instances of task

upper bound on schedule length

2) ILP Variables: The complete set of 0-1 (decision) vari-ables is

if task instance

time-step and is leftmost

column occupied by

otherwise.

Page 7: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

BANERJEE et al.: EXPLOITING APPLICATION DATA-PARALLELISM ON DYNAMICALLY RECONFIGURABLE ARCHITECTURES 7

if instance

time-step and is leftmost

column occupied by

otherwise.

if instance finishes execution in

time-step

otherwise.

if is in an optimal solution

otherwise.

if reconfiguration for starts at

time-step and is leftmost

column occupied by

otherwise.

Some of the constraints necessitate introduction of additionalbinary variables to represent logical conditions. All such vari-ables are represented as .

3) Constraints:1) Simple task execution constraints

a) Each valid task instance is executed exactly once

(1)b) Task instance execution-time is non-negative, i.e., ex-

ecution finish time for a task instance is greater thanor equal to execution start time

(2)

Note that this is true for all task instances. If a taskinstance is not required, its corresponding (startexecution) and (finish execution) variables are allassigned a value of zero.

2) Core data-parallelism constraints:a) Execution time for a task equals the aggregate execu-

tion time of all instantiated copies

(3)

This equation holds trivially for all non-data-paralleltasks that have a single instance each.

b) Precedence constraints between task instances: Eachvalid instance of task starts execution afterany instance of finishes execution

(4)

We can rewrite the equation in the following form:

This enables us to apply the if-then transformation asin [22].4

3) Core column-based partial RTR constraints:a) Each valid task instance needs to be reconfig-

ured—also, the start column for reconfiguration issame as start column for execution

(5)

b) Each valid task instance can start processing dataonly after the task is reconfigured, i.e., reconfigura-tion delay time-steps after start of reconfiguration

(6)where denotes the reconfiguration time for task

. We can apply the if-then transform similar to thatfor (4).

c) Resource constraints on field-programmable gatearray (FPGA): total number of columns being usedfor task instance executions and number of columnsbeing reconfigured is limited by the total number ofFPGA columns

(7)d) At every time-step , at most single task instance is

being reconfigured

(8)

4if-then transform for the constraint

if ����� � �� then ���� � �

� ���� ���

���� ����� ��

� � ��� ��

where � is a large number such that ���� � � , ����� � � for � satis-fying all other constraints.

Page 8: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

e) At every time-step , mutual exclusion of executionand reconfiguration for every column

(9)f) For every column, at every time-step, total number of

reconfigurations is at most 1 less than the number ofexecutions started using that column

(10)

g) Simple placement constraint: a task can start execu-tion only if there are sufficient available columns tothe right

(11)4) Equations relating task execution variables:

a) For each task instance, if it starts execution in time-step ( variable is “1”), variables denoting taskis executing are zero in prior time-steps and “1” intime-step

(12)

(13)

b) For each task instance, if it is executing in column ,the corresponding starts execution variable is true forthis column

(14)

c) For each valid task instance, the task instance execu-tion time equals the number of nonzero is executingvariables

(15)

All the previous equations can be simplified using theif-then transform described earlier.

5) Objective function to minimize schedule length:This is equivalent to minimizing the end time for any in-stance of the last task in the chain . By introducing a newsink task and precedence edges from all instances ofthe last task in the chain , the objective function is simply

the execution start time for this new task . If we addi-tionally assign a width of columns to this new task,the objective function is further simplified to

minimize

6) Additional constraints:Along with the necessary constraints, we also introduce ad-ditional constraints such as simple timing ASAP/ALAPconstraints to help reduce the search space (and corre-spondingly reduce the time required by the ILP solver tofind a solution).

V. HEURISTIC APPROACHES

In this section, we first present MFF, a heuristic for schedulingsimple task chains. While MFF is oblivious to data-parallelism,it provides the core concepts underlying PARLGRAN, our pro-posed approach for chains with some data-parallel tasks.

A. Modified First Fit (MFF)

For architectures with partial RTR, the physical (placement)constraints and, the architectural constraint of the single re-configuration mechanism, make it difficult to achieve the idealschedule length . In fact, this simple problemof minimizing schedule length for a chain, under constraintsrelated to partial RTR, is actually NP-complete, as mentionedin Section IV (the detailed proof by reduction from set-par-titioning is out of scope for this paper). MFF, our proposedheuristic to solve this problem, essentially tries to satisfy taskresource constraints, and, attempts simple local optimizationsto reduce fragmentation, and, hence, the schedule length.

Approach: MFF (Modified First Fit)Place task starting from leftmost columnfor each task

if ( aligned with rightmost column)local optimization: Adjust immediate ancestor placement(and start time) if possible to improve start time of

endfor

MFF is based on a first-fit approach. To get intuition behindwhy a first-fit approach works well in practical scenarios, wetake a look at Fig. 7(b). The tasks are essentially laid out inthe form of diagonals running from the top-left of the placedschedule towards the bottom-right (the diagonal in the figure re-sults from each task in the chain being placed to the right of itspredecessor and can start only after its predecessor completes).As long as a task does not “fall off” the diagonal, it is possibleto overlap at least part of the reconfiguration overhead with theexecution of its immediate ancestor. Once a task “falls off” the

Page 9: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

BANERJEE et al.: EXPLOITING APPLICATION DATA-PARALLELISM ON DYNAMICALLY RECONFIGURABLE ARCHITECTURES 9

Fig. 7. Simple chain-right placement of task 2. (a) Less fragmentation. (b)More fragmentation.

Fig. 8. Exploiting slack in reconfiguration controller. (a) More fragmentation.(b) Less fragmentation.

diagonal and is placed at the left-most column , it is essen-tially trying to reuse the area of ancestor tasks higher up in thechain. Given that for tasks in a chain the execution componentshave to be in sequence, a more distant ancestor is guaranteedto finish earlier than a closer ancestor. This increases signifi-cantly the possibility of being able to overlap reconfigurationof this task with the execution of ancestors that are closer to itin the chain. Effectively the chain property causes a “window”of tasks: tasks within a window affect each other much morestrongly than tasks outside the window.

1) Simple Fragmentation Reduction: One simple modifica-tion for reducing fragmention in MFF compared to pure first-fitis shown in Fig. 7. Our observations indicate that in tightly-con-strained scenarios (few columns available for task mapping),placing the second task adjacent to task , as in Fig. 7(b),often leads to immediate fragmentation—though enough areais available to reconfigure task in parallel with execution oftask , this area is not contiguous, and thus task gets de-layed. MFF takes care of this by placing at the right-handcorner. Of course, this simple technique does not improve frag-mentation in all possible scenarios.

2) Local Optimization—Exploiting Slack in ReconfigurationController: A more interesting local optimization to reducefragmentation is shown in Fig. 8(a). While scheduling task ,we notice that it is possible to exploit slack in the reconfigu-ration mechanism to postpone the reconfiguration of task

without delaying the actual execution of task . Wecan thus make better use of the available area (HW resources)to reschedule (and change placement of) task —as a result,reconfiguration of task can now execute in parallel with

Fig. 9. Static pruning based on timing.

, leading to a reduction in schedule length, as shown inFig. 8(b).

Before proceeding to PARLGRAN, it is important to under-stand that the fragmentation problems we try to address in MFF(and PARLGRAN) are because we are trying to jointly scheduleand place while satisfying a host of other constraints—thus,other free space coalescing techniques for partially reconfig-urable architectures, such as [23], are not directly applicable.

B. PARLGRAN

We use the insights obtained from the chain-schedulingproblem as the basis for our granularity selection approach.Detailed analysis of chain-scheduling shows that applying localoptimizations can improve the performance. We additionallywant to design an approach such that the algorithm executiontime is comparable to the execution time of the tasks. Thus, ourproposed granularity selection approach is simple and greedy,but, uses specific problem properties to try and improve thesolution quality.

Our approach consists of the following two steps:• static pruning;• dynamic granularity selection.1) Static Pruning: First, we utilize some simple facts to

statically prune regions of the search space. As an exampleof pruning, consider Fig. 9. If we schedule exactly one copyeach for tasks and , then task can start as soon asends, i.e., at , as in Fig. 9(a). If we schedule another copy oftask , the execution time of improves. However, now thereconfiguration controller becomes the bottleneck, as shownin Fig. 9(b). Now, task can start only at , whichis greater than . Effectively, the number of copies of a taskis limited by the impact of its reconfiguration overhead onits successors. Note that such simple static pruning is basedprimarily on timing considerations.

2) Dynamic Granularity Selection: We next consider workdistribution (load balancing) issues for the multiple task copies.

a) Uneven Finish Times: From our initial discussion ondata-parallelism (as shown earlier in Fig. 5), it seems that it isa good idea to always generate as many copies as possible sub-ject to performance improvement and get them to finish at thesame time instant. However, with the introduction of task depen-dencies, it is necessary to modify this strategy in certain casesto improve performance, as shown in Fig. 10. In Fig. 10(a), let

denote the time instant the earlier copy of task , that isends. Task can start at: . However,

Page 10: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

10 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 10. Uneven finish times.

Fig. 11. Left placement for instances of first task. (a) More fragmentation. (b)Less fragmentation.

if both copies of end at the same time instant as shown inFig. 10(b), this time-instant is given by

As a result, reconfiguration for task gets delayed andexecution for task can only start at

.Of course, if the area of task is greater than the area of task, letting both copies of end at the same time instant would

lead to a shorter schedule.b) Adjacent Task Instances: Another simple observation to

improve MFF specifically for parallelism granularity selectionis shown in Fig. 11. Placing multiple copies of a task adjacentto each other intuitively helps reduce fragmentation.

PARLGRAN is an adaptation of MFF that essentially triesto greedily add multiple copies of data parallel tasks as longas it estimates that adding a new copy is beneficial for per-formance (shorter schedule length). The concepts of dynami-cally adjusting the workload combined with local optimizationsmakes it effective. We summarize our PARLGRAN approachshown as follows.

PARLGRAN (Parallelism Granularity Selection)

Place first copy of task starting from left-most column foreach task

Compute earliest execution start of task (space searchby last-fit)

if (parent task is data-parallel)

while (no degradation in start time of )

add new copy of parent (assign start time,physical location)adjust workload of existing scheduled copiesof parent

Schedule (and place)apply local optimizations from MFF for improvingschedule

endfor

In the previous code segment, one task is considered in eachiteration of the outer loop. Using a last-fit-based placementstrategy, we obtain the earliest time-instant that this task canstart execution. The inner loop selects the granularity of itsparent task, i.e., selects the number of instances and workloadof the parent task such that the current task can start executionearlier (of course this is applicable only to parent tasks that aredata-parallel). As each new instance of a parent task is added,the algorithm uses the fragmentation reduction strategies de-scribed earlier to try and reduce the schedule length. Whenit is not possible to improve the start time of the current taskany further, the inner loop terminates and the next task in thechain is selected. Note that granularity selection is very tightlyintegrated with placement—at each step, the schedule length iscomputed based on the physical location of tasks on the device.

While this approach appears to be simplistic, experimental re-sults in the following section show it typically does better thanstatically deciding to parallelize each task to its maximum de-gree. For real image-processing applications such as JPEG en-coding, blind parallelization can lead to significantly inferior re-sults, even worse than RTR-aware first-fit, because of the recon-figuration overhead and the physical (placement) constraints.

VI. EXPERIMENTS

We conducted a wide variety of experiments to validate ourproposed approach. We demonstrate the quality of schedulesgenerated by our heuristics with a very large set of synthetic ex-periments (consisting of over a thousand data-points) along witha detailed application case study. Additionally, we demonstratethe semi-online applicability of PARLGRAN with detailed anal-ysis of estimated execution time on a typical embedded pro-cessor, the PPC405 (PowerPC) processor with an operating fre-quency of 400 MHz.

It is important to remember that our goal is to maximize per-formance (minimize schedule length) for an application taskchain in an on-demand computing scenario where a dynami-cally invoked application is assigned logic resources (area formapping application task graph) depending on the number andresource requirement of other applications simultaneously exe-cuting on the reconfigurable device. Thus, while it is possibleto fit our applications onto suitably sized target devices, weassume for experimental purposes that the hard resource con-straint is less than the aggregate size of all tasks.

Page 11: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

BANERJEE et al.: EXPLOITING APPLICATION DATA-PARALLELISM ON DYNAMICALLY RECONFIGURABLE ARCHITECTURES 11

A. Experimental Setup

We assumed a target device with 24 columns, similar toXilinx XC2V2000.5 From the XC2V2000 data sheet, we es-timate that the reconfiguration overhead for the smallest taskoccupying one column on our architecture is 0.38 ms at themaximum suggested reconfiguration frequency of 66 MHz.We obtained area and timing data for well-known tasks suchas Huffman encoding, DCT, etc., by synthesizing them withcolumnar placement and routing constraints on the XC2V2000,similar to the Xilinx methodology suggested for reconfigurablemodules.6

We explored a large set of scenarios with the followingstrategy:

1) We generated task chains of varying chain length, rangingfrom 4 to 15 tasks in the chain.

2) For a task chain of given chain length, each individual taskwas assigned a set of parameters (execution time, reconfig-uration delay, number of columns) randomly selected fromour database of synthesized tasks. Thus, we generated mul-tiple task chains for a given chain length.

3) Finally, for each individual task chain, we conducted mul-tiple experiments by varying the area constraint across awide range, to represent situations with less area, as wellas situations with more area available for mapping the ap-plication.

Our overall strategy resulted in a set of over a thousand in-dividual experiments. Note that the database of task parame-ters included information corresponding to images of varioussizes—since each individual task is completely pipelined, thereconfiguration delay and number of columns occupied by thetask is independent of the image size, but the execution time isdirectly proportional to the image size.

In subsequent discussions, the following notation denotesschedule length generated by various approaches, including ourproposed approach, the exact formulation, and other heuristicsimplemented to evaluate the quality of schedules generates byour approach.

• : corresponds to our exact (ILP) formulation.• : corresponds to our MFF approach for scheduling

chains with no data-parallelism considerations.• : corresponds to our proposed PARLGRAN ap-

proach.• : corresponds to a simple first-fit (FF) approach [9] for

scheduling chains with no data-parallelism considerations.• : corresponds to maximum parallelization (MAX-

PARL) approach.Along with our implementation of MFF, PARLGRAN, and

the ILP, we additionally implemented MAXPARL to evaluatethe quality of our schedules. MAXPARL attempts to maximizeparallelization by statically selecting the maximum number ofcopies possible for each task subject to resource constraintsonly, and assigning equal workload to each such task instance.Note that MAXPARL includes detailed configuration prefetch

5While the XC2V2000 datasheet specifies a 56 � 48 matrix of logic blocks,architectural constraints for partial RTR necessitate that one dynamically recon-figurable column equals two columns of logic on the physical device as shownonline http://www.xilinx.com/bvdocs/appnotes/xapp290.pdf.

6[Online]. Available: http://www.xilinx.com/bvdocs/appnotes/xapp290.pdf

TABLE IPARLGRAN VERSUS ILP FOR SMALL TESTS

considerations. Because of equal workload distribution, mul-tiple instances of a task finish at different time-instants, unlikeLemma 1—however, the freed-up area is utilized to instantiatemultiple copies of any data-parallel child task. Thus, the sched-ules generated by MAXPARL are of reasonable quality and sig-nificantly better than an approach with no configuration prefetchconsiderations.

B. Schedule Quality on Synthetic Experiments

1) Schedule Quality of MFF (Compared to FF): Our first setof experiments consisted of comparing schedule lengths gener-ated by MFF with that of first-fit, on the set of experiments asdescribed above.

The experimental data confirmed that schedules generatedby MFF were almost always equal to or better than FF. Theschedule lengths generated by MFF were better in 207 out of1096 tests, i.e., approximately 19% of the tests, worse in 6 outof 1096 tests. In 114 tests, around 10% of the total, MFF wasbetter by at least 3%. In the worst experiment for MFF, first-fitgenerated a schedule longer by 0.44%. Overall, on longer chains(more tasks) and looser constraints (more columns), both al-gorithms were almost equally able to hide the reconfigurationoverhead. However, on more constrained problems with shorterchains and tighter area constraints, MFF tends to generate betterschedules.

2) Comparing PARLGRAN Schedule Length With ILP forSmall Tests: Our next set of experiments consisted of comparingthe schedule length generated by PARLGRAN with that gener-ated by the exact formulation. The implementation of the ILPusing the commercial solver CPLEX7 requires hours for evenvery small testcases on our implementation platform (SunOS5.9 with a 502 MHz Sparcv9 processor). Thus, for experimentsinvolving the exact formulation, we report results on a small setof synthetic experiments with short chains where chain lengthvaries between 3 to 5 tasks (the experiments also include onetestcase (jpg3) from the detailed case study presented in the nextsubsection).

In Table I, the second column represents schedule lengthsgenerated by the ILP, while the third column represents schedulelengths generated by PARLGRAN. For this set of experiments,the schedule length is reported in time-steps where one time-step corresponds to the reconfiguration delay for a single CLBcolumn.

As the table shows, the schedules generated by PARLGRANfor small experiments (short chains) are reasonably close to thatof the exact approach.

7[Online]. Available: http://www.ilog.com/products/cplex

Page 12: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

12 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE IIREDUCTION IN SCHEDULE LENGTH FOR COMPLETELY DATA PARALLEL CHAINS

WITH PARLGRAN

3) Overall Schedule Quality of PARLGRAN: Next, inTable II, we present a summary of results covering the entireset of synthetic experiments. The data in each row of thetable corresponds to experiments on chains of correspondinglength—as an example, data in the second row (chain length7–9) was obtained from experiments on chains with at least7 tasks and at most 9 tasks. Note that this set of experimentsis identical to that we used to validate MFF—we additionallyassume that each task in the chain is completely data-parallel.For comparison with MAXPARL and FF, our quality measure issimply the percentage increase in schedule length generated bythe other approach compared to PARLGRAN. As an example,for comparison with MAXPARL, the quality measure is simply

The second column in Table II represents the Average per-centage improvement of PARLGRAN as compared to FF. Eachentry in the second column is an average of a large numberof experiments conducted on chains of corresponding length.The third, fourth and fifth columns, respectively, represent theAverage, the Best, and the Worst performance of our approachcompared to MAXPARL. As an example, the data in the secondrow, fourth column, states that on a large number of experi-ments with chain length between 7 and 9 tasks, the best resultgenerated by our approach corresponds to an experiment whereMAXPARL generated a schedule 139% longer.

Expectedly, there is significant improvement in schedulelength with PARLGRAN compared to the sequential (FF)approach, as shown in the second column of the table. Moreimportantly, the data in the third column clearly shows that ourproposed granularity selection heuristic, PARLGRAN, gener-ates increasingly better results compared to MAXPARL whenmore space is available. Intuitively, with more available area,it is possible to make more instances of the data-parallel tasks.However, with each additional instance, the workload (execu-tion time) decreases per instance, resulting in execution timecomparable to the reconfiguration overhead—PARLGRAN isbetter capable of deciding when to stop instantiating multiplecopies, as opposed to MAXPARL. The local optimizations inPARLGRAN play an active role in such circumstances to helpimprove the schedule length.

One key aspect of the data in Table II is that for smaller chains,our presented results cover a very large range of varying areaconstraints—for longer chains, the presented results cover thescenarios where the available HW area is at most 40%–45%of the aggregate HW area of the tasks. For chains with morethan 9–10 tasks, a loose area constraint results in even more

Fig. 12. JPEG encoder task graph.

TABLE IIICASE STUDY OF JPEG ENCODING: SCHEDULE LENGTH WITH DIFFERENT

IMAGE SIZE AND AREA CONSTRAINTS

significant improvement with PARLGRAN compared to otherapproaches.

C. Detailed Application Case Study: JPEG Encoding

After conducting a wide range of experiments on syntheticgraphs, we conducted a detailed application case study onthe JPEG encoding algorithm, represented as a chain of fourkey tasks

, shown in Fig. 12. Note that Huffman is a sequentialtask (no data-parallelism) while the remaining 3 tasks aredata-parallel. Table III presents some results from our casestudy. Entries in the first column, Case, denote the imagesize—256 256 denotes experiments on a 256 256 colorimage. For each case, we varied the number of columns andobserved the resulting schedule lengths (the aggregate arearequirement of all tasks in the chain is 11 columns). The secondcolumn represents the area constraint in columns. Thethird, fourth and fifth columns correspond to schedule lengths(in milliseconds) generated by MFF, MAXPARL, and PARL-GRAN, respectively.

The data in Table III demonstrates that as available area in-creases, our proposed approach PARLGRAN consistently gen-erates shorter schedules. As an example, for the 256 256image, we consider the data corresponding to , andthe data corresponding to . The corresponding trans-formed task graphs are shown in Figs. 13 and 14, respectively.The DCT task is the most computation-intensive task in thechain (maximum execution time). However, a tighter area con-straint does not allow multiple instances of theDCT task. Thus, PARLGRAN improves performance by addingone instance of the RGB2YCRCB task, as shown in Fig. 13.

Page 13: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

BANERJEE et al.: EXPLOITING APPLICATION DATA-PARALLELISM ON DYNAMICALLY RECONFIGURABLE ARCHITECTURES 13

Fig. 13. Transformed JPEG task graph: Image size: 256 � 256, � � �.

Fig. 14. Transformed JPEG task graph: Image size: 256 � 256, � � �.

Fig. 15. JPEG task graph with maximum parallelization: � � �.

However, with more area , PARLGRAN is capableof deciding that it is more beneficial to instantiate two copies ofthe DCT and only have a single instance of the RGB2YCRCBtask. For comparison, we note that an approach oblivious topartial RTR constraints would generate four instances of theRGB2YCRCB task with , as shown in Fig. 15.

Next, we observe how our approach adapts to varying datasize with Figs. 14 and 16. For the same area constraint

Fig. 16. Transformed JPEG task graph: Image size: 512� 512, � � �.

, the transformed task graph for the 256 256 image has sixtasks while that for the 512 512 image has seven tasks. Forthe larger image, the task execution time is significantly higherthan the task reconfiguration time, resulting in more scope forexploiting data-parallelism, as in Lemma 2.

Next we focus on experimental results for the 256 256image. For this set of experiments, where the reconfigurationoverheads are comparable to the task execution times, our ap-proach frequently does much better than statically parallelizingeverything (MAXPARL). Additionally, the data demonstratesthat such blind parallelization can lead to results worse thana simple sequential scheduling approach. For an area con-straint of eight columns, schedule length of FF is longer thanPARLGRAN by . Blind (static)parallelization leads to significantly worse schedule longer by

. This is in spite of the fact thatthe effective transformed graph from MAXPARL (see Fig. 15)consists of nine tasks with apparently more parallelism, whilethe transformed graph from PARLGRAN (Fig. 14) consists ofsix tasks only.

For the 512 512 image, each task execution time is signifi-cantly greater than the reconfiguration overhead. In such a sce-nario, where, additionally, the chain length is short, MAXPARLgenerates good results—of course, PARLGRAN typically doessomewhat better. But, both parallelizing approaches result insignificant speedups.

D. Applicability in Semi-Online Scenario

The experimental data clearly demonstrates that PARL-GRAN generates high-quality schedules. However, our ob-jective is for PARLGRAN to be applicable in a semi-onlinescenario where the task precedence relations, and the taskarea-timing characteristics are available at compile-time, whilethe available HW area for mapping the application is knownonly at run-time. Task management under such dynamic re-source availability is a key issue in modern operating systemsfor reconfigurable architectures [5]. So, we next obtaineddetailed execution time estimates for MFF, PARLGRAN, andMAXPARL on the PPC405 operating at 400 MHz—such aprocessor is available in the Xilinx Virtex-II Pro platform.

We obtained heuristic execution time estimates for the JPEGencoding application with three different image sizes: 256

Page 14: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

14 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 17. Schedule length + heuristic execution time: JPEG encoding 256 �256.

Fig. 18. Schedule length + heuristic execution time: JPEG encoding 384 �384.

256, 384 384, 512 512. For each image size, we variedthe area constraint and obtained cumulative execution time asshown in Figs. 17, 18, and 19. In each of these figures, the

-axis represents the area constraint as a percentage of the ag-gregate area required by all tasks. The -axis represents the cu-mulative execution time (schedule length computed by heuristic+ execution time of heuristic).

MFF of course has the least execution time overhead. Thus,for short chains with a very tight area constraint, cumulativeexecution time with MFF is comparable to other heuristics, asin Fig. 17. However, as available area increases or, image sizeincreases, scope for exploiting data-parallelism increases. Insuch scenarios, PARLGRAN and MAXPARL generate shorterschedules that more than compensate for the increased heuristicexecution time.

This is explicitly demonstrated in Fig. 20 where we presentdata for three different image sizes scheduled with the same

Fig. 19. Schedule length + heuristic execution time: JPEG encoding 512 �512.

Fig. 20. Schedule length + heuristic execution time: JPEG encoding, loose areaconstraint.

(relaxed) area constraint. Note that increase in image size re-sults in increased ratio of task execution time to reconfigurationoverhead, making more data-parallel instances feasible (as inLemma 2). As image size increases, cumulative execution timewith MFF increases almost linearly, i.e., cumulative executiontime for a 512 512 image is almost 4 times that of the 256

256 image. However, with approaches that attempt to exploitdata-parallelism, the cumulative execution time increases at aslower rate—for PARLGRAN, the cumulative execution timefor the 512 512 image is only around 2.5 times that for the256 256 image.

Heuristic execution time for all approaches increase as morearea is available for mapping the application. However, MAX-PARL is significantly more sensitive, as shown in Figs. 17 and18. This is because MAXPARL attempts to maximize paral-lelism by scheduling a graph with the maximum number of taskspossible in the given area.

Comparing the data in Table III with that in Fig. 17 shows thatPARLGRAN execution time overhead is approximately

0.77 ms for the 256 256 image with . This

Page 15: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

BANERJEE et al.: EXPLOITING APPLICATION DATA-PARALLELISM ON DYNAMICALLY RECONFIGURABLE ARCHITECTURES 15

is quite low compared to the schedule length 9.08 ms.Much more importantly, for all experiments on the JPEG appli-cation, cumulative execution time for PARLGRAN monotoni-cally decreases confirming its viability in a semi-online envi-ronment.

Our wide range of experiments and case studies confirmthat PARLGRAN generates high-quality schedules in all situa-tions—tightly constrained problems with shorter chains, fewercolumns, as well as problems with more degrees of freedom,i.e., longer chains, more available columns. Additionally, theestimated run-time of our approach on a typical embeddedprocessor is comparable to the HW task execution times.

VII. CONCLUSION

In this paper, we proposed PARLGRAN, a semi-online sched-uling approach that selects granularity of data-parallelism tomaximize performance of application task chains executing onarchitectures with partial RTR capability. Our approach selectsboth the number of instances of a data-parallel task, and, the ex-ecution time (workload) of each such instance—it is integratedin a joint scheduling and placement formulation, necessitated bythe underlying physical and architectural constraints imposedby partial RTR.

To evaluate our proposed heuristic, we have formulated andimplemented an exact (ILP) approach, and a simpler strategythat attempts to statically maximize data-parallelism. Results onsmaller experiments show that PARLGRAN generates sched-ules reasonably close in quality to that of the exact approach.Experimental results on a very large space with over a thou-sand synthetic experiments confirm that PARLGRAN generatesschedules that are on an average better by 20% compared to thesimpler strategy that attempts to statically maximize data-paral-lelism based on logic availability only.

A detailed case study on JPEG encoding confirms that in real-istic scenarios, the simpler approach that tries to maximize dataparallelism without accounting for the underlying RTR-relatedconstraints can end up generating schedules much worse thaneven a data-parallelism-oblivious (but RTR-aware) approach.Finally, detailed execution-time estimates indicate that our ap-proach is suitable for integration in a semi-online schedulingmethodology where the goal is to maximize performance of anapplication given an area constraint and input characteristics(image size) available only at run-time.

While our approach demonstrates the potential for significantperformance improvement, there are some key aspects that wewant to address in our future work. Most importantly, we haveassumed in this work that we are not constrained by memory/communication bandwidth. Our initial estimates indicate thateven with increased parallelism, the additional bandwidth re-quirement for realistic applications (including the JPEG appli-cation) can be satisfied by a typical memory-bus configurationsuch as a PC3200 DDR memory integrated with a suitable bus.However, we agree that with increased task granularity (moreinstances) and ever-increasing device sizes (enabling more ap-plications to execute concurrently), the data transfer to and frommemory, both on-chip and off-chip, has the potential to becomea bottleneck, and will be considered in future work. Last, but notthe least, we have focussed on exploiting data-parallelism with

partial RTR. An operating system [5] that handles resource man-agement in a true on-demand computing environment requires atoolbox that can suitably mix and match a variety of techniquesincluding but not limited to inter-task pipelining, clustering, etc.

REFERENCES

[1] M. J. Wirthlin, “Improving functional density through run-time cir-cuit reconfiguration,” Ph.D. dissertation, Elect. Comput. Eng. Dept.,Brigham Young Univ., Provo, UT, 1997.

[2] H. Quinn, L. A. S. King, M. Leeser, and W. Meleis, “Runtime assign-ment of reconfigurable hardware components for image processingpipelines,” in Proc. IEEE Symp. Field Program. Custom Comput.Mach. (FCCM), 2003, pp. 173–182.

[3] J. Noguera and R. M. Badia, “Power-performance trade-offs for re-configurable computing,” in Proc. IEEE/ACM/IFIP Int. Conf. Hardw.-Softw. Codesign Syst. Synth. (CODES+ISSS), 2004, pp. 116–121.

[4] J. Noguera and R. M. Badia, “Performance and energy analysis of task-level graph transformation techniques on dynamically reconfigurablearchitectures,” in Proc. Int. Conf. Field Program. Logic Appl. (FPL),2005, pp. 563–567.

[5] C. Steiger, H. Walder, and M. Platzner, “Operating systems for recon-figurable embedded platforms: Online scheduling of real-time tasks,”IEEE Trans. Computers, vol. 53, no. 11, pp. 1393–1407, Nov. 2004.

[6] G. Brebner, “A virtual hardware operating system for the XilinxXC6200,” in Proc. Int. Workshop Field-Program. Logic, 1996, pp.327–336.

[7] S. Hauck, “Configuration pre-fetch for single context reconfigurableprocessors,” in Proc. ACM/SIGDA Int. Symp. Field Program. Gate Ar-rays (FPGA), 1998, pp. 65–74.

[8] S. P. Fekete, E. Kohler, and J. Teich, “Optimal FPGA module place-ment with temporal precedence constraints,” in Proc. Des. Automat.Test Europe (DATE), 2001, pp. 658–667.

[9] S. Banerjee, E. Bozorgzadeh, and N. Dutt, “Integrating physical con-straints in HW-SW partitioning for architectures with partial dynamicreconfiguration,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,vol. 14, no. 11, pp. 1189–1202, Nov. 2006.

[10] J. Resano, D. Mozos, and F. Catthoor, “A hybrid prefetch schedulingheuristic to minimize at run-time the reconfiguration overhead of dy-namically reconfigurable architectures,” in Proc. Des. Automat. TestEurope (DATE), 2005, pp. 106–111.

[11] C. Bobda, M. Majer, A. Ahmadiniya, T. Haller, A. Linarth, andJ. Teich, “The Erlangen slot machine: Increasing flexibility inFPGA-based reconfigurable platforms,” in Proc. Field-Program.Technol. (FPT), 2005, pp. 37–42.

[12] N. Sedcole, P. Y. K. Cheung, G. A. Constantinides, and W. Luk,“A reconfigurable platform for real-time embedded video imageprocessing),” in Proc. Field Program. Logic Appl. (FPL), 2003, pp.606–615.

[13] P.-H. Yuh, C.-L. Yang, Y.-W. Chang, and H.-L. Chen, “Temporal floor-planning using the T-tree formulation,” in Proc. Int. Conf. Comput.-Aided Des. (ICCAD), 2004, pp. 300–305.

[14] H. Murata, K. Fujiyoshi, S. Nakatake, and Y. Kajitani, “Rectangle-packing based module placement,” in Proc. Int. Conf. Comput.-AidedDes. (ICCAD), 1995, pp. 472–479.

[15] J. Harkin, T. M. Mcginnity, and L. P. Maguire, “Modeling andoptimizing run-time reconfiguration using evolutionary computation,”ACM Trans. Embedded Comput. Syst., vol. 3, no. 4, pp. 661–685,2004.

[16] H. Singh, G. Lu, E. M. C. Filho, R. Maestre, M.-H. Lee, F. J. Kurdah,and N. Bagherzadeh, “MorphoSys: Case study of a reconfigurable com-puting system targeting multimedia applications,” in Proc. Des. Au-tomat. Conf. (DAC), 2000, pp. 573–578.

[17] K. N. Vikram and V. Vasudevan, “Mapping data-parallel tasks ontopartially reconfigurable hybrid processor architectures,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 9, pp. 1010–1023,Sep. 2006.

[18] S. Muchnick, Advanced Compiler Design and Implementation. SanFrancisco, CA: Morgan Kaufmann, 1997.

[19] T. Stefanov, B. Kienhuis, and E. Deprettere, “Algorithmic transforma-tion techniques for efficient exploration of alternative application in-stances,” in Proc. Int. Symp. Hardw./Softw. Codesign (CODES), 2002,pp. 7–12.

[20] J. Noguera, “Energy-efficient hardware/software co-design for dynam-ically reconfigurable architectures,” Ph.D. dissertation, Dept. Comput.Arch., Techn. Univ. Catalonia, Barcelona, Spain, 2005.

Page 16: Exploiting Application Data-Parallelism on Dynamically

IEEE

Proo

f

16 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

[21] J. Augustine, Personal Communication. 2005. [Online]. Available:john.augustine22tcs.com

[22] W. L. Winston and M. Venkataraman, Introduction to MathematicalProgramming, 4th ed. Boston, MA: Thomson Brooks Cole, 2003.

[23] M. Handa and R. Vemuri, “An efficient algorithm for finding emptyspace for online FPGA placement,” in Proc. Des. Automat. Conf.(DAC), 2004, pp. 960–965.

Sudarshan Banerjee (M’99) received the B.Tech.and M.Tech.degrees in computer science and en-gineering from the Indian Institute of Technology,Kharagpur, India, and the Ph.D. degree in informa-tion and computer science from the University ofCalifornia, Irvine.

He is currently working on hardware-assistedsimulation with Liga Systems, Sunnyvale, CA.He has extensive experience in development ofindustry-leading logic verification tools as anemployee of Synopsys and Cadence. His current

research interests include partitioning and scheduling for HW-SW codesign,dynamically reconfigurable FPGAs.

Elaheh Bozorgzadeh (S’00–M’03) received the B.S.degree in electrical engineering from Sharif Univer-sity of Technology, Iran, in 1998, the M.S. degree incomputer engineering from Northwestern University,Evanston, IL, in 2000, and the Ph.D. degree in com-puter science from the University of California, LosAngeles, in 2003.

She is currently as Assistant Professor with theDepartment of Computer Science, University of Cal-ifornia, Irvine. Her research interests include designautomation for embedded systems, reconfigurable

computing, and VLSI/FPGA CAD. She has coauthored over 45 conference andjournal papers.

Prof. Bozorgzadeh was a recipient of a Best Paper Award from the IEEE FPL2006. She is a member of ACM.

Nikil Dutt (F’<PLEASE PROVIDE YEAR.>) re-ceived the Ph.D. degree in computer science fromthe University of Illinois at Urbana-Champaign, Ur-bana-Champaign, in 1989.

He is currently a Chancellor’s Professor withthe University of California, Irvine, with academicappointments in the Computer Science and ElectricalEngineering and Computer Science Departments.His research interests include embedded systems,electronic design automation, computer architecture,optimizing compilers, system specification tech-

niques, and distributed embedded systems.Prof. Dutt was a recipient of Best Paper Awards from CHDL89, CHDL91,

VLSIDesign2003, CODES+ISSS 2003, CNCC 2006, and ASPDAC-2006.He currently serves as Editor-in-Chief of the ACM Transactions on DesignAutomation of Electronic Systems (TODAES) and as an Associate Editor of theACM Transactions on Embedded Computer Systems (TECS) and of the IEEETRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS (IEEET-VLSI). He was an ACM SIGDA Distinguished Lecturer during 2001–2002,and an IEEE Computer Society Distinguished Visitor for 2003–2005. He hasserved on the steering, organizing, and program committees of several premierCAD and Embedded System Design conferences and workshops, includingASPDAC, CASES, CODES+ISSS, DATE, ICCAD, ISLPED, and LCTES.He serves or has served on the advisory boards of ACM SIGBED and ACMSIGDA, and previously served as Vice-Chair of ACM SIGDA and of IFIP WG10.5. He is an ACM Distinguished Scientist and an IFIP Silver Core Awardee.