816 ieee transactions on very large scale …sridevan/index_files/01664903.pdfexploiting statistical...

14
816 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2006 Exploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System Andhi Janapsatya, Member, IEEE, Aleksandar Ignjatovic, and Sri Parameswaran, Member, IEEE Abstract—A method to both reduce energy and improve perfor- mance in a processor-based embedded system is described in this paper. Comprising of a scratchpad memory instead of an instruc- tion cache, the target system dynamically (at runtime) copies into the scratchpad code segments that are determined to be beneficial (in terms of energy efficiency and/or speed) to execute from the scratchpad. We develop a heuristic algorithm to select such code segments based on a metric, called concomitance. Concomitance is derived from the temporal relationships of instructions. A hardware controller is designed and implemented for managing the scratchpad memory. Strategically placed custom instructions in the program inform the hardware controller when to copy instructions from the main memory to the scratchpad. A novel heuristic algorithm is implemented for determining locations within the program where to insert these custom instructions. For a set of realistic benchmarks, experimental results indicate the method uses 41.9% lower energy (on average) and improves per- formance by 40.0% (on average) when compared to a traditional cache system which is identical in size. Index Terms—Embedded system, scratchpad memory. I. INTRODUCTION P ROCESSORS are increasingly replacing gates as the basic design block in digital circuits. This rise is neither extraor- dinary nor rapid. Just as transistors slowly gave way to gates and gates to more complex circuits, processors are progressively be- coming the predominant component in embedded systems. As microprocessors become cheap and the time to market critical, it is a natural progression for designers to abandon gates in favor of processors as the main building components. The utilization of processors in embedded systems gives rise to a plethora of opportunities to optimize designs, which are neither needed nor available to the designer of an application-specific integrated circuit (ASIC). One critical area of optimization is the reduction of power in systems while increasing or at least maintaining the perfor- mance. The criticality stems from usage of embedded systems in battery-powered devices as well as the reduced reliability of systems which operate while emanating excessive amounts of Manuscript received June 30, 2005; revised January 6, 2006. A. Janapsatya is with the School of Computer Science and Engineering, Uni- versity of New South Wales, Sydney, NSW 2052, Australia (e-mail: andhij@cse. unsw.edu.au). A. Ignjatovic and S. Parameswaran are with the School of Computer Sci- ence and Engineering, University of New South Wales, Sydney, NSW 2052, Australia, and also with NICTA, The University of New South Wales, Kens- ington, NSW 2052, Australia (e-mail: [email protected]; sridevan@cse. unsw.edu.au). Digital Object Identifier 10.1109/TVLSI.2006.878470 Fig. 1. Instruction memory hierarchy and address space partitioning. (a) Cache configuration. (b) SPM configuration. heat. Examples of existing techniques for achieving reduced en- ergy consumption in embedded systems are: shutting down parts of the processor [1], voltage scaling [2], addition of application specific instructions [3], [4], feature size reduction [5], and ad- ditional cache levels [6]. Cache memory is one of the highest energy consuming com- ponents of a modern processor. The instruction cache memory alone is reported to consume up to 27% of the processor energy [7]. We present a method for reducing instruction memory energy consumed in an embedded processor by replacing the existing instruction cache [see Fig. 1(a)] with instruction scratchpad memory (SPM); see Fig. 1(b). SPM consumes less energy per access compared to cache memory because SPM dispenses with the tag-checking that is necessary in the cache architecture. In embedded systems design, the processor architecture and the application that is to be executed are both known a priori. It is therefore possible to extensively profile the application and find code segments which are executed most frequently. This information can be analyzed and decisions made about which code segments should be executed from the SPM. Various schemes for managing SPM have been introduced in the literature and can be broadly divided into two types: static and dynamic. Both of these schemes are usually applied to the data section and/or the code section of the program. Static man- agement refers to the partitioning of data or code prior to ex- ecution and storing the appropriate partitions in the SPM with no transfers occurring between these memories during the ex- ecution phase (occasionally, the SPM is filled from the main memory at start up). Memory words are fetched to the processor 1063-8210/$20.00 © 2006 IEEE

Upload: others

Post on 15-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

816 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2006

Exploiting Statistical Information for Implementationof Instruction Scratchpad Memory

in Embedded SystemAndhi Janapsatya, Member, IEEE, Aleksandar Ignjatovic, and Sri Parameswaran, Member, IEEE

Abstract—A method to both reduce energy and improve perfor-mance in a processor-based embedded system is described in thispaper. Comprising of a scratchpad memory instead of an instruc-tion cache, the target system dynamically (at runtime) copies intothe scratchpad code segments that are determined to be beneficial(in terms of energy efficiency and/or speed) to execute from thescratchpad. We develop a heuristic algorithm to select such codesegments based on a metric, called concomitance. Concomitanceis derived from the temporal relationships of instructions. Ahardware controller is designed and implemented for managingthe scratchpad memory. Strategically placed custom instructionsin the program inform the hardware controller when to copyinstructions from the main memory to the scratchpad. A novelheuristic algorithm is implemented for determining locationswithin the program where to insert these custom instructions. Fora set of realistic benchmarks, experimental results indicate themethod uses 41.9% lower energy (on average) and improves per-formance by 40.0% (on average) when compared to a traditionalcache system which is identical in size.

Index Terms—Embedded system, scratchpad memory.

I. INTRODUCTION

PROCESSORS are increasingly replacing gates as the basicdesign block in digital circuits. This rise is neither extraor-

dinary nor rapid. Just as transistors slowly gave way to gates andgates to more complex circuits, processors are progressively be-coming the predominant component in embedded systems. Asmicroprocessors become cheap and the time to market critical,it is a natural progression for designers to abandon gates in favorof processors as the main building components. The utilizationof processors in embedded systems gives rise to a plethora ofopportunities to optimize designs, which are neither needed noravailable to the designer of an application-specific integratedcircuit (ASIC).

One critical area of optimization is the reduction of powerin systems while increasing or at least maintaining the perfor-mance. The criticality stems from usage of embedded systemsin battery-powered devices as well as the reduced reliability ofsystems which operate while emanating excessive amounts of

Manuscript received June 30, 2005; revised January 6, 2006.A. Janapsatya is with the School of Computer Science and Engineering, Uni-

versity of New South Wales, Sydney, NSW 2052, Australia (e-mail: [email protected]).

A. Ignjatovic and S. Parameswaran are with the School of Computer Sci-ence and Engineering, University of New South Wales, Sydney, NSW 2052,Australia, and also with NICTA, The University of New South Wales, Kens-ington, NSW 2052, Australia (e-mail: [email protected]; [email protected]).

Digital Object Identifier 10.1109/TVLSI.2006.878470

Fig. 1. Instruction memory hierarchy and address space partitioning. (a) Cacheconfiguration. (b) SPM configuration.

heat. Examples of existing techniques for achieving reduced en-ergy consumption in embedded systems are: shutting down partsof the processor [1], voltage scaling [2], addition of applicationspecific instructions [3], [4], feature size reduction [5], and ad-ditional cache levels [6].

Cache memory is one of the highest energy consuming com-ponents of a modern processor. The instruction cache memoryalone is reported to consume up to 27% of the processor energy[7]. We present a method for reducing instruction memoryenergy consumed in an embedded processor by replacingthe existing instruction cache [see Fig. 1(a)] with instructionscratchpad memory (SPM); see Fig. 1(b). SPM consumesless energy per access compared to cache memory becauseSPM dispenses with the tag-checking that is necessary in thecache architecture. In embedded systems design, the processorarchitecture and the application that is to be executed are bothknown a priori. It is therefore possible to extensively profile theapplication and find code segments which are executed mostfrequently. This information can be analyzed and decisionsmade about which code segments should be executed from theSPM.

Various schemes for managing SPM have been introduced inthe literature and can be broadly divided into two types: staticand dynamic. Both of these schemes are usually applied to thedata section and/or the code section of the program. Static man-agement refers to the partitioning of data or code prior to ex-ecution and storing the appropriate partitions in the SPM withno transfers occurring between these memories during the ex-ecution phase (occasionally, the SPM is filled from the mainmemory at start up). Memory words are fetched to the processor

1063-8210/$20.00 © 2006 IEEE

Page 2: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

JANAPSATYA et al.: EXPLOITING STATISTICAL INFORMATION FOR IMPLEMENTATION OF INSTRUCTION SPM IN EMBEDDED SYSTEM 817

Fig. 2. Architecture of cache in comparison to SPM. (a) Cache architecture.(b) SPM architecture.

directly from either of the memory partitions. Dynamic manage-ment, on the other hand, moves highly utilized memory wordsfrom the main memory to the SPM, before transferring to theprocessor, thereby allowing the code or data executed from theSPM to have a larger total memory footprint than the SPM.

In this paper, we present a novel architecture, containing aspecial hardware unit to manage dynamic copying of code seg-ments from the main memory to the SPM. We describe the useof a specially created instruction which triggers copying fromthe main memory to the SPM. We further set forth heuristic al-gorithms which rapidly search for the sets of code segments tobe copied into the SPM and where to place the specially createdinstructions to maximize benefit. The whole system was evalu-ated using benchmarks from Mediabench [37] and UTDSP [38].

The remainder of this paper is organized as follows.Section II provides the motivation and the assumptions madefor this study. Section III describes previous works on SPM andpresents the contributions of our work. Section IV introducesour hardware strategy for using the SPM. Section V definesthe concomitance information, Section VI formally describesthe problem. Section VII presents the heuristic algorithm forpartitioning the application code. Section VIII provides the ex-perimental setup and results, and, finally, Section IX concludesthis paper.

II. MOTIVATION AND ASSUMPTIONS

A. Motivation

An SPM is a memory comprised of only the data array(SRAM cells) and the decoding logic. This is illustrated inFig. 2(b). A cache shown in Fig. 2(a), in comparison, has dataarray (SRAM cells), decoding logic, tag array (SRAM cells),and comparator logic. Comparator logic checks if the existingcontent in cache matches the current CPU request and thus de-termines whether a cache hit or a cache miss has occurred. Theaccess time and the access energy of individual componentswithin the cache have been estimated using the tool CACTI[35]. CACTI results show that, for a cache system consistingof 8 KB with associativities ranging from 1 to 32, the accesstime of the cache tag-RAM array is 38.9% longer on averagecompared with the access time of the cache data-RAM array,

and the access energy of the data-RAM is, on average, 72.5%of the total cache access energy.

Each piece of data is placed into the cache according toits main memory location, identified by the tag. Thus, cachememory is transparent to the CPU. The SPM is used as a re-placement of (level 1) instruction cache. The absence of the tagarray within the SPM means that the CPU has to know wherea particular instruction resides within the SPM. To allow theCPU to locate data, SPM and the DRAM span a single memoryaddress space partitioned between them.

Selecting the correct code segments for placement onto theSPM requires a careful analysis of the way such code segmentsare executed. All prior work in this area has focused upon loopanalysis of the trace of a program as the method for finding theappropriate code segments to be placed in the SPM. Loop anal-ysis has several drawbacks: 1) the structure and relationship ofloops can be very complex and thus difficult to analyse; 2) thisstructure can significantly vary for different inputs; and 3) theprecise structure of loops is irrelevant for the placement of in-structions in the SPM because only relative (temporal) prox-imity of executions of blocks of code matters, rather than theprecise order of these instructions as provided by the loop anal-ysis; see [9]–[12]. Thus, in this paper, instead of looking at theloop structure, we analyze temporal correlation of executions ofblocks of code using statistical methods with a signal processingflavor. The trace is seen as a “signal” on which we perform a sta-tistical rather than a structural analysis.

We call the measure we introduce for temporal proximity ofconsecutive executions of the same block of the code the self-concomitance of the block, and we use it to decide if the blockshould be executed from the SPM or from the main memory.Similarly, the measure for temporal proximity of interleavedexecutions of two different blocks of the code we call the con-comitance, and we use it to decide whether two blocks should beplaced in SPM simultaneously or if they can overlap in the SPM.

Such analysis of the trace has proven to yield an algorithmfor SPM placement with performance results that are superiorto the loop analysis method, and the algorithm is also muchsimpler, more efficient, and adaptive to different types of appli-cations by changing only the distance function used to define theconcomitance and self-concomitance. Recently, we found outthat a related but cruder and less general idea has been used forcache management in [13]. It uses a simple counting functionrather than a decreasing function of distance, has no notion cor-responding to our self-concomitance concept, and operates on acoarser level of granularity (procedures rather than basic blocks).

B. Motivational Example for Concomitance

The benefits of using temporal proximity information isillustrated in the following example. Consider an “if-clause”shown in Fig. 3(a), with the control flow graph (CFG) shownin Fig. 3(b). For and , the profiling woulddetermine that the first 50 executions of the “if-clause” willyield 50 executions of the block corresponding to ,while the next 50 executions of the “if-clause” will yield 50executions of the block for . Thus, executions of theblock corresponding to will have a large temporaldistance to the executions of the block for , and we

Page 3: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

818 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2006

Fig. 3. Example code. (a) If-clause code segment. (b) If-clause CFG.

will demonstrate that, in such a situation, the concomitance ofthese two blocks will be small. Thus, blocks for and

need not be placed into the SPM simultaneously,but can be placed to overlap with one another in the SPM, yetwithout degrading the performance because their executiondoes not overlap in time. Note that, if only the frequency ofexecution was used, the CFG would only inform the algorithmthat both blocks were executed 50 times, without providingcrucial information that the pattern of execution is such thatthey can overlap on the scratchpad without any penalty. Notethat, should the “if-clause” be of the form if , thenexecutions of and would alternate.Under such circumstances, executions of the blocks corre-sponding to these two functions would be interleaved withshort temporal distances. We will see that this makes theirconcomitance large, and the SPM placement algorithm wouldbe informed that these two blocks must be placed into the SPMsimultaneously and thus without overlap.

C. Assumptions

During the design and implementation of our methodology,the following assumptions are made.

• The size of the SPM or cache memory is only available inpowers of two, and the size is passed to the algorithm. Thisis a reasonable assumption since the program is to be exe-cuted in an embedded system where the underlying hard-ware is known in advance. This assumption allows betteroptimization. If a number of processors with differing sizesof SPM have to be serviced, the SPM size can be made anadjustable parameter for the SPM management algorithm.

• The program size is larger than the SPM size. This is avalid assumption, because energy consumptions and costconstraints dictate architectures with relatively small sizeSPMs, compared with the size of a typical application.

• The size of the largest basic block is less than or equal tothe SPM size. This assumption is quite valid in embeddedsystems where basic blocks are usually rather small. How-ever, if the basic block is too large for the SPM, it can besplit into a smaller part that fits into the SPM and the re-mainder that is executed from the main memory.

• Each instruction is exclusively executed from either theSPM or the main memory. This ensures that it is nevernecessary to have duplicates of parts of the code or to alterbranch destinations during the execution of the program.

• Program traces used for profiling provide an accurate de-piction of the program behavior during its execution. This

assumption is reasonable when a sufficiently large inputspace has been applied. The amount of profiling needed toobtain a particular confidence interval is given in [32].

• We do not consider higher level caches. In an embeddedsystem, where frequently there is no cache at all, it is un-likely that more than a single level of cache is available.Moreover, having higher level caches would not reduce theeffectiveness of the approach.

III. RELATED WORK

In the past, use of the SPM replacing cache memory hasbeen shown to improve the performance and reduce energy con-sumption [8]. Existing works on cache optimization techniquesrely on careful placement of instructions and/or data within thememory to ensure low cache miss rates. Cache optimizationmethods generally increase the program memory size [14]–[21].

In 1997, Panda et al. [22] presented a scheme to staticallymanage an SPM for use as data memory. They describe a parti-tioning problem for deciding whether data variables should belocated in the SPM or DRAM (accessed via the data cache).Their algorithm maximizes cache hit rates by allowing data vari-ables to be executed from the SPM. Avissar et al. [23] presenteda different static scheme for utilizing the SPM as data memory.They presented an algorithm that can partition the data variablesas well as the data stack among different memory units, suchas SPM , DRAM, and ROM to maximize performance. Exper-imental results shows that Avissar et al. were able to achieve30% performance improvement over traditional direct-mappeddata caches. By allowing part of the data stack to operate fromthe SPM, a performance improvement of 44.2% was obtainedcompared with a system without SPM.

In 2001, Kandemir et al. introduced dynamic managementof the SPM for data memory [24]. Their memory architectureconsisted of an SPM and a DRAM which was accessed throughcache. To transfer data from the DRAM into the SPM, they cre-ated two new software data transfer calls; one to read instruc-tions from DRAM and another to write instructions into theSPM. Multiple optimization strategies were presented to par-tition data variables between the SPM and DRAM. The bestresults were obtained from the hand-optimized version. Theirresults shows a 29.4% improvement in performance comparedwith a traditional cache memory system. In addition, they alsoshow that it is possible to obtain up to 10.2% performance im-provement when using a dynamic management scheme for SPMcompared with the static version. The work presented in [24]applied only to well-structured scientific and multimedia appli-cations, with each loop nest needing independent analysis.

In 2003, Udayakumaran and Barua [25] improved upon [24]and presented a more general scheme for dynamic managementof the SPM as data memory. The method from [25] analyzes thewhole program at once and aims to exploit locality throughoutthe whole program, instead of individual loop nesting. The re-ported reduction in execution time is by an average of 31.2%over the optimal static allocation scheme.

In 2002, Steinke [26] presented a statically managed SPM forboth instruction and data. Their results show, on average, a 23%reduction in energy consumption over a cache solution.

Page 4: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

JANAPSATYA et al.: EXPLOITING STATISTICAL INFORMATION FOR IMPLEMENTATION OF INSTRUCTION SPM IN EMBEDDED SYSTEM 819

In 2003, Angiolini et al. presented another SPM static man-agement scheme for data memory and instruction memory thatemploys a polynomial time algorithm for partitioning data andinstructions into DRAM and SPM [27]. Their results show en-ergy improvement ranging from 39% to 84% over an unparti-tioned SPM. In 2004, Angiolini et al. presented a different ap-proach to static usage scheme for the SPM [28] by mapping ap-plications to existing hardware.

In [29], Steinke et al. presented a dynamic managementscheme for an instruction SPM. For each instruction to becopied into the SPM, they insert a load and store instruction.The load instruction brings the instruction to be copied from themain memory to the processor, and a store instruction stores itback into the SPM. Their algorithm for determining locations inthe program for inserting load and store instructions is based ona solution of a integer linear programming (ILP) problem that isobtained by using an ILP solver. They conducted experimentswith small applications (e.g., bubble sort, heap sort, and “bi-quad” from DSP stone benchmark suit), and their results showan average 29.9% energy reduction and a 25.2% performanceimprovement compared with a cache solution.

In 2004, Janapsatya et al. presented a hardware/software ap-proach for a dynamic management scheme of an SPM [30].Their approach introduces a specially designed hardware com-ponent to manage the copying of instructions from the mainmemory into the SPM. They utilize instructions execution fre-quency information to model applications as graph and performgraph partitioning to determine locations within the program forinitiating the copying process.

A. Contributions

Our work improves upon the state of the art in the followingways.

• We design a novel architecture to perform dynamic man-agement of instruction code between SPM and mainmemory.

• We replace a difficult structural loop analysis of the pro-gram by an essentially statistical method for automated de-cision making regarding which basic blocks should be ex-ecuted from the SPM and, among them, which groups ofbasic blocks should be placed in the SPM simultaneouslywithout an overlap.

• We develop a novel graph partitioning algorithm for split-ting the instruction code into segments that will eitherbe copied into the SPM or executed directly from themain memory in a way that reduces the overall energydissipation.

We evaluated the system by using realistic embedded applica-tions and show performance and energy comparisons with pro-cessors of various cache sizes and differing associativities.

IV. SYSTEM ARCHITECTURE

A. Hardware Implementation

The proposed methodology modifies the processor by addingan SPM controller, which is responsible for copying instructionsfrom DRAM to SPM and stalling the CPU whenever copying is

Fig. 4. SPM system architecture.

Fig. 5. Architecture of the SPM controller.

in progress. Fig. 4 shows the block diagram of the SPM con-troller, SPM, DRAM, and the CPU. A new instruction calledScratchpad Managing Instruction (SMI) is implemented. TheSMI is used to activate the SPM controller; SMIs are insertedwithin the program and are executed whenever the CPU encoun-ters them. The cost of the CPU stalling while the SPM is beingcopied is identical to servicing a cache miss, and the number oftimes the SMI was executed provides useful information for de-termining the application execution time. Further performanceimprovement is possible if copying can be performed a few cy-cles before the instructions are to be executed, assuming thatthere is no conflict in the SPM location between the instructionsto be executed and the instructions to be copied.

The micro-architecture of the SPM controller is shown inFig. 5 and includes the basic block table (BBT). The BBT storesthe following static information: start addresses of the basicblocks to be read from within the DRAM; how many instruc-tions are to be copied; and the start addresses for storing thebasic blocks into the SPM. Content of the BBT is filled by thealgorithm shown in Fig. 6.

With the addition of SPM, an example of program executionis as follows: at the start of a program, the first instruction isalways fetched from DRAM. When an SMI is executed, theCPU will activate the SPM controller. The SPM controller willthen stall the CPU, read information from the BBT, and provide

Page 5: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

820 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2006

Fig. 6. Methodology for implementing the SPM.

addresses to DRAM and SPM for copying instructions fromDRAM into SPM. After copying is completed, the SPM con-troller will release the CPU, which will then continue to fetchand execute instructions from DRAM. Whenever a branchinginstruction leads program execution to the SPM, the CPU con-tinues to fetch instructions from the SPM.

Fig. 6 shows the steps for implementing the proposedmethodology. The methodology operates on the assembly/machine code level. Profiling and tracing (using modifiedsimplescalar/Pisa 3.0d [31] to output instruction trace into agzip file) is performed on the assembly/machine code to obtainan accurate representation of the program behavior. Programtrace was generated using SimpleScalar 3.0d out-of-orderexecution. To ensure proper order of execution of SMI in anout-of-order machine, the instructions to be copied should bemade dependent on the SMI. This is similar to a load operationfollowed by register operation upon the loaded data. Programbehavior is captured by extracting the concomitance infor-mation as explained below. The SPM placement algorithmuses the concomitance information to determine appropriatelocations for inserting SMI within the program and to constructa table (BBT) that specifies the positions where the blocks willbe copied.

B. Software Modification

The new instruction, SMI, is an instruction with a singleoperand. The operand of the SMI represents the address ofa BBT entry. For a 32-b instruction with a 16-b operand, itis possible to have up to 65 536 entries in the BBT, allowing65 536 SMIs to be inserted into a program.

The code can be modified by insertion of an SMI in the fol-lowing three cases.

1) An SMI can be inserted just before an unconditionalbranch [see Fig. 7(a)]; in this case, the destination ad-dress of the unconditional branch is altered to point tothe correct address within the SPM.

2) An SMI can be inserted to be the new destination of aconditional branch [see Fig. 7(b)]; if the branch testspositive, such an SMI will copy the basic block to be exe-cuted from the appropriate location in the SPM; an extra

Fig. 7. Condition for insertion of SMI into programs.

jump instruction is added immediately after the SMI, totransfer the program execution to that SPM location.

3) An SMI instruction can be inserted just after a condi-tional branch instruction [see Fig. 7(c)]. In this case,if the conditional branch tests negative, the SMI willcopy the basic block that is to be executed following thebranch instruction. Again, an extra jump instruction isadded immediately after the SMI to transfer program ex-ecution to the SPM.

An extra branch instruction may also be needed at the end ofthe basic block if execution flow is required to jump to anotherlocation within memory. If there is an unconditional branch atthe end of the basic block, then the branch destination can besimply altered, or modified as in 1). If there are conditionalbranches, then we may have to modify as per 2) or 3) dependingupon whether they are to be executed from DRAM or SPM.

The algorithm shown in Fig. 6 outputs the modified software,including the addition of any extra branching instructions.

V. CONCOMITANCE

A basic block, by definition, is the largest chain of consecu-tive instructions that has the following properties: 1) if the firstinstruction of the block is executed, then all instructions in thebasic block will also be executed consecutively and 2) any in-struction of the basic block is executed only as a part of the con-secutive execution of the whole block.

The distance between two consecutive executions andof a basic block in the trace of a run of a program

is defined as follows. If, between the executions andof , there are no other occurrences of in , we count thenumber of distinct instruction steps executed between and

, including . We call this value the distance betweenand and denote it by . For example, assumethat “ ” is a sequence of consecutive executionsin a trace and that each of the basic blocks , , , andcontains ten distinct instructions; then the distance betweenand is 40, because only , , and appear between the twoexecutions of the basic block (and we include itself in thecount).

The weight function is used to give a decreasing significanceto the two consecutive executions of the same block that arefurther apart in the sense of the above notion of distance. Thus, itis a nonnegative real function that is decreasing, i.e., suchthat, if , are real numbers and , then .

The trace concomitance gives information abouthow tightly interleaved the executions of two distinct basicblocks and in the trace are. Thus, for a basic block ,

Page 6: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

JANAPSATYA et al.: EXPLOITING STATISTICAL INFORMATION FOR IMPLEMENTATION OF INSTRUCTION SPM IN EMBEDDED SYSTEM 821

we consider all of its consecutive executions and inthe trace , for which there exists at least one execution of theblock between the executions and ; we denote sucha fact by . We now also reverse the roles ofand , and define by

Here, in the sum means that ranges over all ex-ecutions of the basic block that appear in the trace . Notethat, for two distinct basic blocks and , the concomitance ofthese two blocks will be large just in case is often executed be-tween two consecutive executions of that are a short distanceapart, and/or if is often executed between two consecutive ex-ecutions of that are also a short distance apart, in the sense ofdistance defined above.

The trace self-concomitance of a basic block is ameasure of how clustered consecutive executions of the blockare and is defined as

Thus, trace self-concomitance has a large value for thosebasic blocks whose executions appear in clusters, with allsuccessive pairs of executions within each cluster separated byshort distances. Note that, even if is executed relatively fre-quently, but such executions of are dispersed in the tracerather than clustered, then self-concomitance will stillbe low. On the other hand, if, for a certain input, a particular loopis frequently executed, then the trace self-concomitanceof each basic block from this loop will be large. Thus, theloop structure of a program is reflected in the statistics of theconcomitance values if such statistics are taken over executionswith a sufficient number of inputs reasonably representing whatis expected in practice. This is the motivation for the followingdefinitions.

The concomitance of a pair of basic blocks andfor given probability distribution of inputs is the correspondingexpected value of trace concomitance .

The self-concomitance of a basic block for a given aprobability distribution of inputs is the corresponding expectedvalue of trace self-concomitance .

To conveniently use the concomitance and self-concomitancein our scratchpad placement algorithms, we construct the con-comitance table by the following profiling procedure.

• Chose a suitable weight function . In our experi-ments so far, we have studied two types of weight func-tions: and , where is aconstant depending on the size of the scratchpad.

• Run the program with inputs that reasonably represent theprobability distribution of inputs expected in practice.

• Calculate the average value of the trace self-concomitanceobtained from such runs, thus obtaining the self-concomi-tance value .

• Set a threshold of significance for the value of self-con-comitance of basic blocks. The set of all blocks with sig-nificant self-concomitance (i.e., larger than the threshold)is formed.

• Calculate the concomitance for all pairs of basic blocksfrom such set , by finding the average of all trace con-comitances obtained from the runs of the program and thenform the corresponding table. Since the concomitance iscommutative, such a table is symmetric; thus, we recordonly its lower left triangle. The self-concomitance isconveniently placed on the diagonal of the table.

The size of the table is given by

Size

where is the total number of basic blocks in the set .Time complexity of the construction of the concomitancetable is bounded by , where is the size of the trace ininstructions.

VI. PROBLEM DESCRIPTION

To copy the important code segments into SPM, SMIs areinserted into strategic locations within the program. To deter-mine strategic locations for insertion of SMIs, we transformedthe problem into a graph-partitioning problem.

Given any program, it can be represented as a graph as fol-lows. The vertices represent basic blocks belonging to the set

of basic blocks having large self-concomitance. The weightof each vertex represents the number of instructions within thebasic block (size of the basic block), and the weight of each edgerepresents the concomitance value between the blocks joined bythe edge. An illustration of the basic block is given in Fig. 8(a),and its corresponding concomitance graph is shown in Fig. 8(b).

The problem is to find vertices within the graph (i.e., the basicblocks of instructions) where one should insert the custom in-struction, SMI.

VII. CODE-PARTITIONING ALGORITHM

A. Graph-Partitioning Problem

Consider the following graph-partitioning problem.Assume that we are given:

1) a graph with a set of vertices and a set of edges ;2) for each vertex , we are given a vertex weight

and for each edge an edge weight (theconcomitance value);

3) a constant (size of SPM).Find a partition of the graph, such that each subgraph has atotal vertex weight less than or equal to , and the sum ofedge weights of all of the edges connecting these subgraphs isminimal.

It is easy to see that the “Minimum Cut into Bounded SetsProblem” [33] is P-time Turing reducible to our graph-parti-tioning problem; thus, we cannot expect to find a polynomial

Page 7: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

822 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2006

Fig. 8. Example of the concomitance graph. (a) Basic block. (b) Corresponding concomitance graph.

Fig. 9. Graph-partitioning procedure.

time algorithm, and we must use a heuristic approximation al-gorithm to work on real applications. We implemented an al-gorithm consisting of two procedures: graph-partitioning pro-cedure and removal of insignificant subgraphs procedure, asshown in Fig. 9.

We start with the set of basic blocks whose self-concomi-tance is larger than the chosen threshold and consider a graphwhose vertices are all elements of . All vertices of this graphare initially pairwise connected, thus forming a complete graph(a clique). The edges have weights that are equal to the concomi-tance between their two vertices. We now order the set of edges

according to their weights in an ascending way; thus, the edgeat the top of the list is one of the edges with the lowest concomi-tance value. Notice that, since we start with a complete graph,

. We eliminate all of the edges of thegraph which have such a minimal weight value. This usually in-duces a partition of the graph into several connected subgraphs;the total weight of the vertices for each of these connected sub-graphs is calculated and compared with . We then apply the

same procedure to all connected subgraphs whose total vertexweight is larger than , and the procedure stops when all of theconnected subgraphs have a total vertex weight smaller or equalto . It is easy to see that the complexity of the graph-parti-tioning procedure is .

After we have produced the subgraphs using the concomi-tance, we now consider the CFG graph with edges havingweights equal to the total number of traversals of the edgeduring execution.

First of all, there can be subgraphs that are not worthwhileto copy and execute from SPM. For example, the graph-parti-tioning procedure can produce results such as those shown inFig. 11. The dotted lines indicate the partitions of the graph.It can be seen that partition 2 resulted in basic block B as asubgraph and it will only ever be executed once. Thus, it is notcost-effective to first read it from DRAM and copy it to SPM,since one can just read it and execute it directly from DRAM.To search and remove these type of subgraphs, another heuristicprocedure is implemented.

B. Removal of Insignificant Subgraphs Procedure

Such a heuristic procedure is shown in Fig. 10. It starts bycalculating the energy cost of executing each subgraph fromDRAM and the energy cost of executing such subgraph fromSPM. Energy cost of executing from DRAM iscalculated using

DRAM (1)

where DRAM is the energy cost of a single DRAM accessper instruction, is the number of instructions in the basicblock corresponding to the vertex , and is the total numberof executions of the basic block corresponding to the vertex .

Page 8: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

JANAPSATYA et al.: EXPLOITING STATISTICAL INFORMATION FOR IMPLEMENTATION OF INSTRUCTION SPM IN EMBEDDED SYSTEM 823

Fig. 10. Removal of insignificant subgraph procedure.

Equation (2) shows the energy cost of executing in-structions within the subgraph from SPM, including the costof accessing the DRAM once and copying to SPM

SPM (2)

where SPM is the energy cost of a single SPM access per in-struction. is defined as (3), shown at the bottom ofthe page, where is the set of all of the edges that are in-coming to the subgraph from all other subgraphs,is the energy cost of executing the special instruction includingreading from DRAM and copying to SPM, and is thecost of adding a branch instruction if necessary. This happens ifthe proper execution of the branching instruction should changethe flow from DRAM to SPM or vice versa. If such an extrabranching instruction is necessary for an edge , constantis equal to 1, else is 0.

The energy calculation is used to classify which subgraphsare to be executed from SPM and which subgraphs are to beexecuted from DRAM; only subgraphs with

will be executed from SPM. We call such a sub-graph an SPM-subgraph .

The remainder of the procedure inserts the SMI as follows.For every SPM-subgraph, it will examine all incoming edges tothis subgraph . We determine if such an edge is includedin a path either from another SPM-subgraph or from the startof the program, and only in such cases an SMI is inserted. Thismeans that, if all paths including this edge emanate from thesame SPM-subgraph , then an SMI need not be inserted. Thecomplexity of this procedure is , where is the numberof vertices in the graph.

Fig. 11. Subgraph representation.

TABLE ISIMPLESCALAR CONFIGURATION

For example, in the CFG shown in Fig. 11, there are three sub-graphs created by the graph partitioning procedure. Subgraph 1consists of the basic block B only. Assume that the energy costestimation using (1) and (2) classified basic block B to be exe-cuted from DRAM. Consider the edge from B to D. All pathscontaining the edge from B to D also contain the edge from Ato B. Since A belongs to the same SPM subgraph as D, by ourprocedure no SMI will be inserted in the edge from B to D. Inthis way, we avoid unnecessary replacement of instructions thatare already in the SPM.

The algorithm described in this section assumed the use of theconcomitance value for the partitioning of the graph. Similaralgorithm can be used with frequency (as in [30]) instead ofconcomitance, and the results are compared in Section VIII.

VIII. EXPERIMENTAL RESULTS

A. Experimental Setup

We simulated a number of benchmarks using simplescalar/PISA 3.0d [31] to obtain memory access statistics. Power figuresfor the CPU were calculated using Wattch [34] (for a 0.18- mprocess). CACTI 3.2 [35] was used as the energy model for thecache memory. The energy model for the SPM was extractedfrom CACTI as in [8]. The DRAM power figures were takenfrom IBM embedded DRAM SA-27E [36]. The configurationfor the simulated CPU is shown in Table I.

DRAM SPM (3)

Page 9: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

824 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2006

TABLE IICOST OF ADDING AND EXECUTING COPY INSTRUCTIONS

Fig. 12. Cache tag-RAM size compared with the BBT size in bits.

All benchmarks were obtained from the Mediabench suite[37] except for the benchmark histogram which were from theUTDSP test suite [38]. The total number of instructions exe-cuted in each benchmark is tabulated in Table II.

B. Area Cost for Inclusion of SMI Hardware Controller

In cache, the tag array memory cells keep record of the entrieswithin the cache data array while, for the SPM system proposedhere, the BBT keeps a record of where each set of code seg-ment(s) is to be placed within the SPM. Each BBT entry needsto store one DRAM address, one SPM address, and the numberof instructions to be copied. Since the most significant bits ofthe SPM address is known from the memory map [as shown inFig. 2(b)], only the least significant bits of the SPM address needto be stored within the BBT.

For example, given DRAM size of 2 MB, SPM size of 1 KB,and instruction size of 8 B, there exists enough space to storeup to 256 K instructions within the DRAM and 128 instructionswithin the SPM. Thus, the number of instructions that need tobe copied ranges from 1 to 128 requiring 7 b per entry. EachDRAM address requires 18 b per entry (to manage 256 K in-structions), and each SPM address requires 7 b per entry. Intotal, each BBT entry requires enough space to store 32 b.

Fig. 12 shows a comparison between the number of bits ina cache tag RAM cells and the average size of the BBT fordifferent size cache/SPM with a main memory size of 2 MB. Theaverage size of the BBT is calculated from all of the benchmarksshown in Table II. From Fig. 12, it can be seen that cache tagRAMcells grow linearly as the cache size grows exponentially,but the BBT size decreases as the SPM grows exponentially.

BBT size decreases as SPM size increases because, with a largerSPM, fewer SMIs are needed.

Table III shows a comparison of the total number of memoryaccesses. Performance and energy results measured from the ex-periments are shown in Tables IV and V, and the cost of addingcopy locations within the program is shown in Table II.

In Table II, column 1 shows the application name, column 2gives the size of the program, column 3 the numberof instructionsexecuted, column 4 gives the average number of copy locationsto be inserted (average is taken from varying SPM sizes rangingfrom 1 KB to 16 KB), column 5 shows the average numberof instructions that need to be copied into the SPM using theconcomitance method, and columns 6 and 7 present the resultsfrom using frequency as the edge value [30]. Comparing figuresin Table II, columns 5 and 7 show that the number of instructionsto be copied into the SPM is significantly reduced using theconcomitance method compared to the frequency method.

An SPM controller with BBT size of 128 entries( b) was implemented in Verilog. TheSPM controller is a finite state machine that accesses theBBT and forwards the output of the BBT as DRAM and SPMmemory addresses. Power estimation of the SPM controllerwas done using a commercial power estimation tool with thefollowing settings: Clock Frequency MHz; Voltage1.8 V; using a 0.18- m technology. Power estimation showedthat it consumed 2.94 mW of power. This is in comparisonwith cache tag power of 159 mW, obtained from CACTI [35],for a 0.18- m 64 bytes of direct map cache. The cache powerwould progressively increase for greater ways of associativity.In addition, we also synthesized the SPM controller usingSynopsys tools [39], for a 0.18- m process. The result showsthat the SPM controller is approximately 1108 gates in size.Comparing the size of the SPM controller with a synthesizedversion of the simplescalar CPU (obtained from [40], thesize of the synthesized CPU without the cache memories isapproximately 82 200 gates.) shows that the SPM controller isapproximately 1.35% of the total CPU size.

C. Performance Cost to Accommodate the Execution of SMI

Table III compares the total number of memory accesses of acache system, the system with scratchpad using the frequencymethod for capturing the program behavior [30], and the systemwith scratchpad using the concomitance metric as a method forcapturing the program behavior. Column 1 in Table III gives

Page 10: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

JANAPSATYA et al.: EXPLOITING STATISTICAL INFORMATION FOR IMPLEMENTATION OF INSTRUCTION SPM IN EMBEDDED SYSTEM 825

TABLE IIITOTAL MEMORY ACCESS COMPARISON. (CACHE RESULT SHOWS THE TOTAL

CACHE MISSES. DRAM ACCESS IS THE TOTAL NUMBER OF INSTRUCTIONS

EXECUTED FROM DRAM PLUS THE NUMBER OF INSTRUCTIONS COPIED

FROM DRAM TO THE SPM)

the application name; column 2 shows the cache or SPM size;columns 3–7 show the total number of cache misses for differentcache associativities; column 8 gives the total DRAM accessesusing the frequency optimization method; column 9 showsthe total DRAM accesses obtained from using the concomi-tance optimization method; column 10 shows the percentage

TABLE IVSYSTEM’S ENERGY COMPARISON

improvement of the concomitance method over the averagevalues of the results of the cache system (in columns 3–7); andcolumn 11 compares the percentage improvement of the con-comitance method over the frequency method. In column 11,it is shown that the concomitance method reduces the totalnumber of DRAM accesses by an average of 52.94% comparedwith the frequency method. When comparison is made with a

Page 11: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

826 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2006

TABLE VSYSTEM’S PERFORMANCE COMPARISON

cache based system (columns 3–7), it is shown that the totalnumber of DRAM accesses for the SPM system can be largerthan the total number of cache misses. Despite the highertotal DRAM accesses in a SPM system compared to the cachesystem, it does not translate to worse energy consumption andworse performance compared to a cache system (such casesare highlighted in bold in Tables III–V.) This is because theenergy and time cost per SPM access is far less compared with

the energy and time cost per cache access (especially whencompared with the energy and time cost of accessing a 16-wayset associative cache.). It can also be noted that total numberof DRAM accesses for an SPM system is comprised of boththe number of instructions to be executed from DRAM andthe total number of instructions copied from DRAM to SPM.Copying instructions from DRAM to SPM causes a sequentialDRAM access which consumes less power and time comparedwith a random DRAM access that happens on each cache miss.

D. Analysis of Results

The experimental setup for measuring the performanceand energy consumption is shown in Fig. 13. Performance ofthe memory architecture is evaluated by calculating the totalmemory accesses for a complete program execution. Estimationof memory access time is possible due to known SPM accesstime, cache access time, DRAM access time, hit rates of cache,and number of times the SPM contents are changed.

Total access time of the cache architecture system is calcu-lated using

cache (ns) cache cache cache (ns)

cache DRAM (ns) (4)

where cache is the total number of cache hits, cache is thenumber of cache misses, cache is the access time to fetchone instruction from the cache, and DRAM is the amountof time spent to access one DRAM instruction.

We estimate the total memory access time for the SPM archi-tecture using

SPM (ns) SPM SPM (ns)

SMI copy DRAM (ns)

DRAM DRAM (ns) (5)

where SPM is the number of instructions executed fromSPM, SPM is the amount of time needed to fetch one in-struction from the SPM, SMI is the total number of times allSMIs are executed, copy is the total number of instructionscopied from DRAM to SPM during program runtime, andDRAM is the number of instruction executed from DRAM.

SPM memory access time is compared with cache memoryaccess time using the following equation:

SPM (ns)cache (ns)

(6)

Energy consumption comparison is shown in Table IV. Thecache energy consumption is calculated using

cache cache cache

cache cache

(7)

where is the energy cost per DRAM access andis the energy consumed by the CPU.

Page 12: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

JANAPSATYA et al.: EXPLOITING STATISTICAL INFORMATION FOR IMPLEMENTATION OF INSTRUCTION SPM IN EMBEDDED SYSTEM 827

Fig. 13. Experimental setup.

Fig. 14. Energy Comparison for 8 KB SRAM. (y-axes are energy in mJ).

The SPM energy is calculated using

SPM SPM DRAM

SPM (8)

where is the energy cost per SPM access, is the en-ergy cost per execution of the special instruction, is the totalnumber of times SMIs are executed, is the energy costfor any additional branch instructions, and is the total numberof times any additional branch instructions are executed.

The percentage difference between the SPM energy con-sumption and the cache energy consumption is calculated usingthe following equation:

SPM

(9)

Results of the percentage energy improvement are shown inTable IV.

Table IV shows energy comparison of: 1) a cache system; 2)a scratchpad system using the frequency method [30]; and 3)

a scratchpad system using the concomitance method. The tablestructure is identical to Table III except that the comparison isnow for energy. Fig. 14 shows the energy improvement compar-ison between the cache system, the frequency SPM allocationprocedure [30], and the concomitance SPM allocation methodfor all the benchmarks. The energy comparison shows that theconcomitance method almost always performs better comparedwith the cache system, and superior results are seen when com-pared with results of the frequency method. On average, theenergy consumption by utilizing the concomitance method is41.9% better than cache system energy and 27.1% better thanthe frequency method. In particular, the concomitance methodis superior in cases where negative improvements over cachewere shown using the frequency method.

Performance results are shown in Table V. The structure ofthe table is identical to that of Table IV; columns 3–7 showthe execution time of a cache based system; column 8 showsthe performance results of the frequency method; column 9shows the performance measurement results of the concomi-tance method; column 10 shows the average performanceimprovement of the concomitance method over cache system;and column 11 shows performance improvement over thefrequency method. The concomitance method improves the

Page 13: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

828 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 8, AUGUST 2006

execution time by 40.0% compared with cache and 23.6%compared with the frequency method.

Shorter instruction memory access time does not alwaysimply a shorter execution time, due to data memory accesstime and the time required by the CPU to execute multicycleinstructions. For the purposes of evaluation, we minimized theeffect of data memory access on the execution time by settinga large data cache so that data cache miss rates are less than0.01%. Thus, the data cache has a very small to negligible missrate and, hence, minimal effect on the program execution time.

For multicycle instructions, it is not possible to accuratelyestimate the execution time without knowing which multicycleinstructions were executed. We perform comparisons betweencache execution times obtained by Simplescalar simulation, tocache memory access times obtained from (4). We found that,on average, a 9.5% error is seen between the two values with amaximum error of 17%.

Thus, it is clear from the results that the method is feasible forutilizing a dynamic SPM allocation scheme. We show that thenumber of copy instructions inserted are far fewer than for theexisting methods described in [29] and [30]. In addition, for theapplications shown here, we show performance improvementand energy savings over conventional cache-based systems.

IX. CONCLUSION

We have presented a method to lower energy consumptionand improve the performance of embedded systems. We pre-sented a methodology for selecting appropriate code segmentsto be executed from the SPM. The methodology determined thecode segments that are to be stored within the SPM by using theconcomitance information. By using a custom hardware SPMcontroller to dynamically manage the SPM, we have success-fully avoided the need to insert many instructions into a programfor managing the content of the SPM. Instead, we implementedheuristic algorithms to strategically insert custom instructions,SMI, to activate the hardware SPM controller. Experimental re-sults show that our SPM scheme can lower energy consumptionby an average of 41.9% compared with traditional cache archi-tecture, and performance is improved by an average of 40.0%.

REFERENCES

[1] K. Lahiri, S. Dey, and A. Raghunathan, “Communication architecturebased power management for battery efficient system design,” in Proc.39th Annu. Design Autom. Conf. , Jun. 10–14, 2002, pp. 691–696.

[2] J. Luo and N. K. Jha, “Power-profile driven variable voltage scaling forheterogeneous distributed real-time embedded systems,” in Proc. VLSIDesign, Jan. 4–8, 2003, pp. 369–375.

[3] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha, “Custom-instructionsynthesis for extensible-processor platforms,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 23, no. 2, pp. 216–228, Feb.2004.

[4] N. Cheung, S. Parameswaran, and J. Henkel, “A quantitative study andestimation models for extensible instructions in embedded processors,”in Proc. Int. Conf. Comput.-Aided Design, 2004, pp. 183–189.

[5] J. M. Rabaey, Digital Integrated Circuits: A Design Perspective. En-glewood Cliffs, NJ: Prentice-Hall, 1996.

[6] J. Kin, G. Munish, and W. H. Mangione-Smith, “The filter cache: Anenergy efficient memory structure,” in Proc. 30th Annu. IEEE Conf.Microarchitecture, 1997, pp. 184–193.

[7] J. Montanaro, “A 160 MHz, 32 b, 0.5 W CMOS RISC microprocessor,”IEEE J. Solid-State Circuits, vol. 31, no. 11, pp. 1703–1714, Nov. 1996.

[8] R. Banakar, “Scratchpad memory: A design alternative for cacheon-chip memory in embedded systems,” in Proc. Conf. Hard-ware/Software Codesign, May 6–8, 2002, pp. 73–78.

[9] G. Ramalingam, “On loops, dominators, and dominance frontier,” inProc. Conf. Programming Language Design Implementation, 2000, pp.233–241.

[10] P. Havlak, “Nesting of reducible and irreducible loops,” ACM Trans.Programming Languages Syst., vol. 19, no. 4, pp. 557–567, Jul. 1997.

[11] V. C. Sreedhar, G. R. Gao, and Y. F. Lee, “Identifying loops using DJgraphs,” ACM Trans. Programming Languages Syst., vol. 18, no. 6, pp.649–658, Nov. 1996.

[12] B. Steensgaard, “Sequentializing program dependence graphs for ir-reducible programs,” Microsoft Research, Redmond, WA, Tech. Rep.MSR_TR-93-14, 1993.

[13] N. Gloy and M. D. Smith, “Procedure placement using temporal-or-dering information,” ACM Trans. Programming Languages Syst., vol.32, no. 5, pp. 977–1027, Sep. 1999.

[14] S. Parameswaran and J. Henkel, “I-CoPES: Fast instruction code place-ment for embedded systems to improve performance and energy effi-ciency,” in Proc. Int. Conf. Comput.-Aided Design, 2001, pp. 635–641.

[15] P. P. Chang et al., “IMPACT: An architectural framework for multiple-instruction-issue processors,” Comput. Architecture News, vol. 19, no.3, pp. 266–275, 1991.

[16] S. McFarling, “Program optimization for instruction caches,” in Proc.3rd Conf. Architectural Support for Programming Language OperatingSyst., 1989, pp. 183–191, .

[17] ——, “Procedure merging with instruction caches,” SIGPLAN Notices,vol. 26, no. 6, pp. 71–79, 1991.

[18] P. R. Panda, N. D. Dutt, and A. Nicolau, “Memory organization forimproved data cache performance in embedded processors,” in Proc.9th Int. Symp. Syst. Synthesis, 1996, pp. 90–95.

[19] P. R. Panda, H. Nakamura, N. D. Dutt, and A. Nicolau, “A data align-ment technique for improving cache performance,” in Proc. Int. Conf.Comput. Design, 1997, pp. 587–592.

[20] H. Tomiyama and H. Yasuura, “Optimal code placement of embeddedsoftware for instruction cache,” in Proc. Eur. Conf. Exhib. DesignAutom. Test, 1996, pp. 96–101.

[21] S. Bartolini and C. A. Prete, “A cache-aware program transformationtechnique suitable for embedded systems,” Inf. Software Technol., vol.44, no. 13, pp. 783–795, 2002.

[22] P. R. Panda, N. D. Dutt, and A. Nicolau, “Efficient utilization ofscratch-pad memory in embedded processor applications,” in Proc.Eur. Design Test Conf., 1997, pp. 7–11.

[23] O. Avissar and R. Barua, “An optimal memory allocation scheme forscratch-pad-based embedded systems,” ACM Trans. Embedded Com-puting Syst., vol. 1, pp. 6–26, 2002.

[24] M. Kandemir and A. Choudhary, “Compiler-directed scratch padmemory hierarchy design and management,” in Proc. 39th DesignAutom. Conf., 2002, pp. 628–633.

[25] S. Udayakumaran and R. Barua, “Compiler-decided dynamic memoryallocation for scratch-pac based embedded systems,” in Proc. Int. Conf.Compilers, Architecture, and Synthesis for Embedded Syst., 2003, pp.276–286.

[26] S. Steinke, L. Wehmeyer, B. Lee, and P. Marwedel, “Assigning pro-gram and data objects to scratchpad for energy reduction,” in Proc.Conf. Design, Autom. Test Europe, 2002, pp. 409–415.

[27] F. Angiolini, L. Benini, and A. Caprara, “Polynomial-time algorithmfor on-chip scratchpad memory partitioning,” in Proc. Int. Conf. Com-pilers, Architecture, Synthesis Embedded Syst., 2003, pp. 318–326.

[28] F. Angiolini, “A post-compiler approach to scratchpad mappingof code,” in Proc. Int. Conf. Compilers, Architecture, Synthesis forEmbedded Syst., 2004, pp. 259–267.

[29] S. Steinke, “Reducing energy consumption by dynamic copying of in-structions onto onchip memory,” in Proc. 15th Int. Symp. Syst. Syn-thesis, 2002, pp. 213–218.

[30] A. Janapsatya, A. Ignjatovic, and S. Parameswaran, “Hardware/soft-ware managed scratchpad memory for embedded system,” in Proc. Int.Conf. Comput.-Aided Design, 2004, pp. 370–377.

[31] D. Burger and T. M. Austin, The SimpleScalar Tool Set, Version 2.0Univ. of Wisconsin-Madison, 1997, TR-CS-1342.

[32] P. B. Bartlett, “Profiling in the ASP codesign environment,” J. Syst.Architecture, vol. 46, no. 14, pp. 1263–1274, Dec. 2000.

[33] M. R. Garey and D. S. Johnson, Computers and Intractability. NewYork: W. H. Freeman, 2000.

[34] D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A framework forarchitectural-level power analysis and optimizations,” in Proc. 27th Int.Symp. Comput. Architecture, 2000, pp. 83–94.

Page 14: 816 IEEE TRANSACTIONS ON VERY LARGE SCALE …sridevan/index_files/01664903.pdfExploiting Statistical Information for Implementation of Instruction Scratchpad Memory in Embedded System

JANAPSATYA et al.: EXPLOITING STATISTICAL INFORMATION FOR IMPLEMENTATION OF INSTRUCTION SPM IN EMBEDDED SYSTEM 829

[35] P. Shivakumar and N. P. Jouppi, Cacti 3.0: An integrated cache timing,power, and area model Compaq Computer Corporation, Tech. Rep.2001/2, 2001.

[36] Embedded DRAM SA-27E IBM Microelectronics Division, 2002 [On-line]. Available: http://ibm.com/chips

[37] C. Lee, “Mediabench: A tool for evaluating multimedia and communi-cations systems,” in Proc. 30th Annu. IEEE Conf. Microarchitecture,1997, pp. 330–335.

[38] C. Lee and M. Stoodley, UTDSP Benchmark Suite. (1992) [Online].Available: http://www.eecg. toronto.edu/ corinna/DSP/infrastructure/UTDSP.html

[39] Synopsys – Electronic Design Automation Solutions and Services Syn-opsys Inc., 2005 [Online]. Available: http://www.synopsys.com

[40] J. Peddersen, S. L. Shee, A. Janapsatya, and S. Parameswaran, “Rapidembedded hardware/software system generation,” in Proc. 18th Int.Conf. VLSI Design, 2005, pp. 111–116.

Andhi Janapsatya (S’00–M’06) received the B.E.(hon.) degree in electrical engineering from TheUniversity of Queensland, Brisbane, Australia, in2001. He is currently working toward the Ph.D.degree at The University of New South Wales,Sydney, Australia.

His research interests include low-power design,application-specific hardware design, and energy andperformance optimization of application-specificsystems.

Aleksandar Ignjatovic received the B.S. and M.S.degrees in mathematics from the University of Bel-grade, Belgrade, in 1981 and 1984, respectively, andthe Ph.D. degree in mathematical logic from the Uni-versity of California, Berkeley, in 1990.

He was an Assistant Professor of mathemat-ical logic and philosophy with Carnegie MellonUniversity, Pittsburgh, PA, before joining KromosTechnology Inc. Presently, he is a Senior Lecturerwith the School of Computer Science and Engi-neering, University of New South Wales, Sydney,

Australia. His interests include applications of mathematical logic to complexitytheory, approximation theory, algorithms, and embedded systems design.

Sri Parameswaran (M’95) received the B.Eng.degree from Monash University in 1986 and the Ph.D.degree from the The University of Queensland in1991.

He is currently an Associate Professor with theSchool of Computer Science and Engineering, Uni-versity of New South Wales, Sydney, Australia. Healso serves as the Program Director for ComputerEngineering with the University of New South Wales.His research interests are in system-level synthesis,low-power systems, high-level systems, and network

on chips.Prof. Parameswaran has served on the Program Committees of numerous in-

ternational conferences, such as Design and Test in Europe (DATE), the Interna-tional Conference on Computer Aided Design (ICCAD), the International Con-ference on Hardware/Software Codesign and System Synthesis (CODES-ISSS),Asia South Pacific Design Automation Conference (ASP-DAC), and the Inter-national Conference on Compilers, Architectures and Synthesis for EmbeddedSystems (CASES). He was the recipient of the Faculty of Engineering TeachingStaff Excellence Award in 2005 and Best Paper nominations from the Interna-tional Conference on VLSI Design and the Asia South Pacific Design Automa-tion Conference.