applying pay-burst-only-once principle for periodic power management...

26

Applying Pay-Burst-Only-Once Principle for Periodic PowerManagement in Hard Real-Time Pipelined Multiprocessor Systems

GANG CHEN, Technische Universitat MuenchenKAI HUANG, Technische Universitat Muenchen and Sun Yat-sen UniversityCHRISTIAN BUCKL, Fortiss GmbHALOIS KNOLL, Technische Universitat Muenchen

Pipelined computing is a promising paradigm for embedded system design. Designing a power managementpolicy to reduce the power consumption of a pipelined system with nondeterministic workload is, however,nontrivial. In this article, we study the problem of energy minimization for coarse-grained pipelined systemsunder hard real-time constraints and propose new approaches based on an inverse use of the pay-burst-only-once principle. We formulate the problem by means of the resource demands of individual pipeline stagesand propose two new approaches, a quadratic programming-based approach and fast heuristic, to solve theproblem. In the quadratic programming approach, the problem is transformed into a standard quadraticprogramming with box constraint and then solved by a standard quadratic programming solver. Observingthe problem is NP-hard, the fast heuristic is designed to solve the problem more efficiently. Our approach isscalable with respect to the numbers of pipeline stages. Simulation results using real-life applications arepresented to demonstrate the effectiveness of our methods.

Categories and Subject Descriptors: C.3 [Special-Purpose and Application-based System]—Real-Timeand Embedded Systems

General Terms: Algorithms

Additional Key Words and Phrases: Scheduling, energy, pay-burst-only-once, periodic power management,real-time system

ACM Reference Format:Gang Chen, Kai Huang, Christian Buckl, and Alois Knoll. 2015. Applying pay-burst-only-once principle forperiodic power management in hard real-time pipelined multiprocessor systems. ACM Trans. Des. Autom.Electron. Syst. 20, 2, Article 26 (February 2015), 27 pages.DOI: http://dx.doi.org/10.1145/2699865

1. INTRODUCTION

With increasing requirements for high performance, multicore architectures are be-lieved to be the major solution for future embedded systems. Many real-time appli-cations, especially streaming applications, can be executed on multiple processors si-multaneously to achieve parallel processing. When real-time applications are executed

A preliminary version of a portion of this article appears in Proceedings of the Conference on Design Automa-tion and Test in Europe 2013.This work has been partly supported by China Scholarship Council, German BMBF project ECU (grant no.13N11936) and Car2X (grant no. 13N11933).The authors would like to thank their sponsors for their support.Authors’ addresses: G. Chen, Department of Informatics, Technische Universitat Muenchen (TUM),Boltzmannstraße 3, 85748, Garching bei Munchen, Germany; K. Huang (corresponding author), School of Mo-bile Information Engineering, Sun Yat-sen University, Zhu Hai, China; email: [email protected];C. Buckl, Fortiss GmbH, Germany; A. Knoll, Department of Informatics, Technische Universitat Muenchen(TUM), Boltzmannstraße 3, 85748, Garching bei Munchen, Germany.Permission to make digital or hard copies of all or part of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. Copyrights for components of this work owned byothers than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, topost on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissionsfrom [email protected]© 2015 ACM 1084-4309/2015/02-ART26 $15.00

DOI: http://dx.doi.org/10.1145/2699865

ACM Transactions on Design Automation of Electronic Systems, Vol. 20, No. 2, Article 26, Pub. date: February 2015.

26:2 G. Chen et al.

on multicore architectures powered by batteries, minimizing the energy consumptionis one of the major design goals, because an energy-efficient design will increase thelifetime, increase the reliability, and decrease the heat dissipation of the system.

Pipelined computing is a promising paradigm for embedded system design, whichcan, in principle, provide high throughput and low energy consumption [Carta et al.2007]. For instance, a streaming application can be split into a sequence of functionalblocks that are computed by a pipeline of processors where power-gating techniquescan be applied to achieve energy efficiency.

Performance constraints of a streaming application are usually imposed on two prin-ciple metrics, that is, throughput and latency. The latency is the main concern forapplications such as video/telephone conferencing and automatic pattern recognitionapplications, where the latency beyond a certain boundary is not tolerated. In the caseof pipelined real-time systems, the latency of a streaming application can be expressedas the end-to-end deadline requirement that the application is processed through thepipeline.

Designing the scheduling policy for the pipeline stages under the requirements ofboth energy efficiency and timing guarantee is, however, nontrivial. In general, energyefficiency and timing guarantee are conflict objectives, that is, techniques that reducethe energy consumption of the system will usually pay the price of longer executiontime, and vice versa. Previous work on this topic either requires precise timing informa-tion of the system [Yu and Prasanna 2002; Xu et al. 2007] or tackles only soft real-timerequirements [Javaid et al. 2011b; Carta et al. 2007]. However, this precise timingof task arrivals might not be guaranteed in practice. Thus, the previous approachescannot guarantee the worst-case deadline and cannot be applied to those embedded sys-tems where violating deadlines could be disastrous. Compared to the preceding work,our work tackles a pipelined event stream with nondeterministic workloads in hardreal-time systems by an inverted use of the pay-burst-only-once principle for energyefficiency.

This article studies the energy minimization problem of coarse-grained pipelinedsystems under hard real-time requirements. We consider a streaming application thatis split into a sequence of coarse-grained functional blocks which are mapped to apipeline architecture for processing. The workload of the streaming application is ab-stracted as an event stream and the event arrivals of the stream are modeled as thearrival curves in the interval domain [Le Boudec and Thiran 2001]. The event streamhas an end-to-end deadline requirement, that is, the time by which any event in thestream travels through the pipeline should be no longer than this required deadline.The objective is thereby to find those optimal scheduling policies for individual stagesof the pipeline with minimal energy consumption while the deadline requirement ofthe event stream is guaranteed.

Intuitively, the problem can be solved by partitioning the end-to-end deadline intosub-deadlines for individual pipeline stages and optimizing the energy consumptionbased on the partitioned sub-deadlines. However, any partition strategy based on theend-to-end deadline and the follow-up optimization method will suffer counting multi-ple times the burst of the event stream, which will inevitably overestimate the neededresource for every pipeline stage and lead to poor energy saving. A motivation examplein Section 4 will demonstrate this drawback in detail. Therefore, a more sophisticatedmethod is needed to tackle this problem.

In this article, we develop a new approach to solve the energy minimization prob-lem for pipelined multiprocessor embedded systems while guaranteeing the worst-caseend-to-end delay. This article summarizes and extends the results built in Chen et al.[2013]. Our idea to solve this problem lies in an inverse use of the well-known pay-burst-only-once principle [Le Boudec and Thiran 2001]. Rather than directly partitioning the


Applying Pay-Burst-Only-Once Principle for Periodic Power Management 26:3

end-to-end deadline, we compute for the entire pipeline one service curve that servesas a constraint for the minimal resource demand. The energy minimization problemis then formulated with respect to the individual resource demands of pipeline stages.To solve this problem, we propose two heuristics, that is, a quadratic programmingheuristic and a fast heuristic. In the quadratic programming heuristic, the minimiza-tion problem is transformed to a standard quadratic programming problem with boxconstraint and then solved by a standard solver. Observing that the formulated prob-lem is NP-hard, we present a fast heuristic to find a suboptimal solution by analyzingthe properties of the optimal solution, running with the complexity O(mn) (where mand n are the stage number and sample step number, respectively). For simplicity, weconsider power-gating energy minimization and use periodic dynamic power manage-ment in Huang et al. [2009b, 2011a] to reduce the leakage power, that is, to periodicallyturn on and off the processors of the pipeline. In this work we compute period powermanagement schemes offline and the fixed Ton/Toff for processors of every pipeline stageare applied during runtime. With this approach, we can not only guarantee the overallend-to-end deadline requirement but also retrieve the pay-burst-only-once phenomena,achieving a significant reduction of the energy consumption. In addition, our methodsare scalable with respect to the number of pipeline stages. The contributions of thisarticle are summarized as follows.

—A new method is developed to solve the energy minimization problem for pipelinedmultiprocessor embedded systems by inversely using the pay-burst-only-onceprinciple.

—A minimization problem is formulated based on the needed resource of individualstages of the pipeline architecture and a transformation of the formulation to a stan-dard quadratic programming problem with box constraints. The formulated problemis proved NP-hard.

—A quadratic programming heuristic is developed to solve the formulated problem anda formal proof is provided to show the correctness of our approach, that is, guaranteeon the end-to-end deadline requirement.

—A fast heuristic is developed to solve the formulated problem, running with thecomplexity O(mn).

The rest of the article is organized as follows. Section 2 reviews related work in theliterature. Section 3 presents basic models and the definition of the studied problem.Section 4 presents the motivation example and Section 5 describes the proposed ap-proach. Experimental evaluation is presented in Section 6, and Section 7 concludes.

2. RELATED WORK

Pipelined computing is a promising paradigm for embedded system design, which canin principle provide high performance and low energy consumption. Pipelined multi-processor systems are widely applied as a viable platform for high performance im-plementation of multimedia applications [Shee and Parameswaran 2007; Javaid andParameswaran 2009; Shee et al. 2006; Karkowski and Corporaal 1997]. Energy opti-mization for pipelined multiprocessor systems is an interesting topic where a numberof techniques have been proposed in the literature. Carta et al. [2007] and Alimondaet al. [2009] proposed a feedback control technique for dynamic voltage/frequency scal-ing (DVFS) in a pipelined MPSoC architecture with soft real-time constraints, aimedat minimizing energy consumption with throughput guarantees. Each pipelined pro-cessor is associated with a dedicated controller that monitors the occupancy level of thequeues to determine when to increase or decrease the voltage-frequency levels of theprocessor. Javaid et al. [2011b] proposed an adaptive pipelined MPSoC architectureand a runtime balancing approach based on workload prediction to achieve energy


26:4 G. Chen et al.

efficiency. The authors in Javaid et al. [2011a] proposed a dynamic power managementscheme for adaptive pipelined MPSoCs. In this work, the duration of idle periods is de-termined based on future workload prediction and used to select an appropriate powerstate for the idle processor. However, the prior approaches are under soft real-timeconstraints. Regarding hard real-time systems, these approaches cannot be applied.

There are also methods [Davare et al. 2007; Hong et al. 2011; de Langen andJuurlink 2006, 2009; Liu et al. 2014; Yu and Prasanna 2002] for hard real-time systems.To guarantee the end-to-end delay, the anthers in Liu et al. [2014] studied the problemof minimizing the number of processors required for scheduling end-to-end deadline-constrained streaming applications modeled as CSDF graphs, where the actors of aCSDF are executed as strictly periodic tasks. In Davare et al. [2007], the authors op-timized periods for dependent tasks on hard real-time distributed automotive systemsin order to meet the end-to-end constraints. In Hong et al. [2011], the authors proposeda distributed approach to assign local deadlines for periodical tasks on distributed sys-tems to meet the end-to-end deadline constraints. To reduce the energy consumption,Yu and Prasanna [2002] presented an integer linear programming (ILP) formulationfor the problem of frequency assignment of a set of periodic independent tasks on aheterogeneous multiprocessor system. The authors in de Langen and Juurlink [2006,2009] proposed leakage-aware scheduling heuristics to reduce the energy consumptionby translating real-time applications with periodic tasks to DAGs using the frame-based scheduling paradigm and considering the trade-offs among DVFS, DPM, andthe number of the processors. But these methods require precise timing informationsuch as periodical real-time events. However, in practice, this precise timing informa-tion of task arrivals might not be determined in advance. The nondeterminism in thetiming of event arrivals results from two main causes: (a) An event may be triggeredby the physical environment, which, in general, is not able to be accurately predicted.(b) When a distributed system is considered, an event might be triggered by otherevents on different processing components in which variable execution workloads wouldmake the prediction of precise information on event arrivals extremely complicated.In the aforesaid research, there is no guarantee that an event will arrive in time.Therefore, these approaches cannot be applied to guarantee the worst-case deadlinein embedded systems where violating deadlines could be disastrous. Unlike previouswork, we focus on improving energy efficiency in hard real-time embedded systemswhile guaranteeing the system satisfies the worst-case deadline constraint.

To model irregular event arrivals, Real-Time Caculus (RTC) [Thiele et al. 2000],which is based on network calculus [Le Boudec and Thiran 2001], can be applied.Specifically, the arrival curve in the RTC models an upper bound and a lower boundof the number of event arrivals or the demand of computation under a specified timeinterval domain. Considering the DVFS system, Maxiaguine et al. [2005] computed asafe frequency at periodic intervals to prevent buffer overflow of a system. By adopt-ing RTC models, Chen et al. [2009] explored the schedulability for the online DVFSscheduling algorithms proposed in Yao et al. [1995]. Combining optimistic and pes-simistic DVFS scheduling, Perathoner et al. [2010] presented an adaptive scheme forthe scheduling of arbitrary event streams. When only considering dynamic power man-agement (DPM), Huang et al. [2009b, 2011a] presented an algorithm to find periodictime-driven patterns to turn on/off the processor for energy saving. Online algorithmsare proposed in Huang et al. [2009a, 2011b] and Lampka et al. [2011] to adaptivelycontrol the power mode of a system, procrastinating the processing of arrived events aslate as possible. In one algorithm in Huang et al. [2009a, 2011b], a tight bound of eventarrivals is computed based on historical information of event arrivals in the recentpast. Instead of using historical information, a dynamic counter technique [Lampkaet al. 2011] is used to predict the future workload. Compared to preceding work, the



Fig. 1. System model.

distinct difference of ours is that we can tackle the correlation of a pipelined eventstream by an inverted use of the pay-burst-only-once principle. With this new method,retrieving this correlation of the same event stream between different pipeline stages,we can compute longer deadlines for each pipeline stage and reduce the overall powerconsumption of the system.

3. MODELS AND PROBLEM DEFINITION

3.1. Hardware Model

The hardware architecture we have chosen is a simplified one with no shared cacheand shared bus among different processing cores. The processing cores are connectedin a pipelined fashion via dedicated FIFOs. We consider the system with pipeline ar-chitecture shown in Figure 1(a). Subtasks of a partitioned application are mappedand executed in different processors. The processors communicate data only throughdistributed memory units. Each memory unit can be organized as one or several FI-FOs. The data communication and synchronization among processors are realized byblocking read and write SW primitives. This kind of hardware architecture has beenrealized in Nikolov et al. [2008]. As the service curve of each stage can be computed forenergy efficiency by our proposed approaches offline, the worst-case FIFO size of eachstage can be determined by applying the analysis approach in Wandeler et al. [2006].

Each processor in the pipelined system has three power consumption modes, namelyactive, standby, and sleep modes, as shown in Figure 1(b). To serve events, the processormust be in the active mode with power consumption Pa. When there is no event toprocess, the processor can switch to sleep mode with lower power consumption Pσ .However, mode switching from sleep mode to active mode will cause additional energyand latency penalty, respectively denoted as Esw,on and tsw,on. To prevent the processorfrom frequent mode switches, the processor can stay at standby mode with powerconsumption Ps, which is less than Pa but more than Pσ , that is, Pa > Ps > Pσ .Moreover, the mode switch from active (standby) mode to sleep mode will cause energyand time overhead, respectively denoted by Esw,sleep and tsw,sleep.

Consider the overhead of switching the system from active mode to sleep mode,the system break-even time TBET denotes the minimum time length that the systemstays at sleep mode. If the interval that the system can stay at sleep mode is smallerthan TBET , the mode-switch mode overheads are larger than the energy saving, there-fore switching mode is not worthwhile. And break-even time TBET can be defined asfollows:

TBET = max(

tsw,Esw

Ps − Pσ

), (1)

where tsw = tsw,on + tsw,sleep and Esw = Esw,on + Esw,sleep.


26:6 G. Chen et al.

3.2. Energy Model

The analytical processor energy model in Martin et al. [2002], Wang and Mishra [2010],Jejurikar et al. [2004], and de Langen and Juurlink [2009] is adopted in this article,whose accuracy has been verified with SPICE simulation [Martin et al. 2002; Wangand Mishra 2010; de Langen and Juurlink 2009]. The dynamic power consumption ofthe core on one voltage/frequency level (Vdd, f ) can be given by

Pdyn = Ceff · V 2dd · f, (2)

where Vdd is the supply voltage, f the operating frequency, and Ceff the effective switch-ing capacitance. The cycle length tcycle is given by a modified alpha power model

tcycle = Ld · K6

(Vdd − Vth)α, (3)

where K6 is technology constant and Ld is estimated by the average logic depth of allinstructions’ critical path in the processor. The threshold voltage Vth is given as

Vth = Vth1 − K1 · Vdd − K2 · Vbs, (4)

where Vth1, K1, K2 are technology constants and Vbs is the body bias voltage.The static power is mainly contributed by the subthreshold leakage current Isubn, the

reverse bias junction current Ij , and the number of devices in the circuit Lg. It can bepresented by

Psta = Lg · (Vdd · Isubn + |Vbs| · Ij), (5)

where the reverse bias junction current Ij is approximated as a constant and thesubthreshold leakage current Isubn can be determined as

Isubn = K3 · eK4Vdd · eK5Vbs , (6)

where K3, K4, and K5 are technology constants. To avoid junction leakage power over-riding the gain in lowering Isubn, Vbs should be constrained between 0 and −1V. Thus,the power consumption at active mode and at standby mode, that is, Pa and Ps, underone voltage/frequency (Vdd, f ) can be respectively computed as

Pa = Pdyn + Psta + Pon, (7)

Ps = Psta + Pon, (8)

where Pon is an inherent power needed for keeping the processor on.

3.3. Task Model

This article considers streaming applications that can be split into a sequence of tasks.As shown in Figure 1(a), an H.263 decoder is represented as four tasks (i.e., PD1,deQ, IDCT, MC) implemented in a pipelined fashion [Oh and Ha 2002]. To modelthe workload of the application, the concept of arrival curve α(�) = [αu(�), αl(�)],originating from network calculus [Le Boudec and Thiran 2001], is adopted. αu(�)and αl(�) provide the upper and lower bounds on the number of arrival events for thestream S in any time interval �. Many other traditional timing models of event streamscan be unified in the concept of arrival curves. For example, a periodic event streamcan be modeled by a set of step functions, where αu(�) = ��

p �+1 and αl(�) = ��p �. For a

sporadic event stream with minimal interarrival distance p and maximal interarrivaldistance p′, the upper and lower arrival curves are αu(�) = ��

p � + 1, αl(�) = � �p′ �,

respectively. Moreover, a widely used model to specify an arrival curve is the PJDmodel, where the arrival curve is characterized by period p, jitter j, and minimal



Fig. 2. Examples for arrival curves: (a) periodic events with period p; (b) events with minimal interarrivaldistance p and maximal interarrival distance p′ = 1.3p; (c) events with period p, jitter j = p, and minimalinterarrival distance d = 0.75p.

interarrival distance d. In the PJD model, the upper arrival curve can be determinedas αu(�) = min{��+ j

p �, ��d �}. Figure 2 depicts arrival curves for the previous cases.

Analogous to arrival curves that provide an abstract event stream model, a tupleβ(�) = [βu(�), βl(�)] defines an abstract resource model which provides upper andlower bounds on the available resources in any time interval �. Further details arereferred to Thiele et al. [2000]. Note that arrival curves are event based, meaning theyspecify the number of events of the steam in one interval of time, while service curvesare based on the amount of computation time. Therefore, service curve β has to betransformed to β to indicate the number of events of the stream that the processorcan process in a specified interval time. Suppose that the execution time of an eventis c, the transformation of the service curves can be done by βl = � βl

c � and βu = � βu

c �.With these definitions, a processor with lower service curve βGl(�) is said to satisfy thedeadline D for the event stream specified by αu(�) if the following condition holds.

βGl(�) ≥ αu(� − D), ∀� ≥ 0 (9)

Note that we adopt the same assumption as Maxiaguine et al. [2005], Huang et al.[2009a, 2009b], Lampka et al. [2011], and Chen et al. [2009] and assume the worst-caseexecution time (WCET) of each task can be predefined and considered as system inputin the article. As mentioned in the previous section, the hardware architecture that wehave chosen is a simplified one with no shared cache and shared bus among differentprocessing cores. In this sense, we can safely assume the WCET of the running tasksas system inputs.

3.4. Problem Statement

This article considers periodic power management [Huang et al. 2009b] that periodi-cally turns on and off a processor. In each period T = Ton+Toff, it switches the processorto active (standby) mode for Ton time units, followed by Toff time units in sleep mode,as shown in Figure 1(b). Given a time interval L, where L T and L

T is an integer,suppose that γ (L) is the number of events of event stream S served in L. If all theserved events finish within L, the energy consumption E(L, Ton, Toff) by applying thisperiodic scheme is

E(L, Ton, Toff) = LTon + Toff

(Esw,on + Esw,sleep)

+ L · Ton

Ton + ToffPs + L · Toff

Ton + ToffPσ

+ c · γ (L)(Pa − Ps)


26:8 G. Chen et al.

= L · Esw

Ton + Toff+ L · Ton(Ps − Pσ )

Ton + Toff

+ L · Pσ + c · γ (L)(Pa − Ps),

where Esw is Esw,on + Esw,sleep for brevity. Given a sufficiently large L, without chang-ing the scheduling policy, the minimization of energy consumption E(L, Ton, Toff) of asingle processor is to find Toff and Ton such that the average idle power consumptionP(Ton, Toff) is minimized.

P(Ton, Toff)def=

L·EswTon+Toff

+ L·Ton·(Ps−Pσ )Ton+Toff

L

= Esw + Ton · (Ps − Pσ )Ton + Toff

.

(10)

By defining K = TonTon+Toff

, the average idle power consumption P in (10) can be definedby Toff and K(0 ≤ K ≤ 1) as follows:

P(K, Toff)def= Esw

Toff+(

(Ps − Pσ ) − Esw

Toff

)· K. (11)

By analyzing (11), it is obvious that the following properties hold.

Property 1. ∀Toff, Toff ≥ EswPs−Pσ

, P(K, Toff) gets its minimum when K gets itsminimum.

Property 2. ∀Toff, Toff < EswPs−Pσ

, P(K, Toff) gets its minimum as Ps − Pσ when K = 1.

According to Properties 1 and 2, when Toff > EswPs−Pσ

holds, the processing unit shouldturn on as briefly as possible in one period. When Toff ≤ Esw

Ps−Pσholds, the processing

unit should turn on all the time with Toff = 0. In this context, EswPs−Pσ

can be seen as thebreak-even time of the processing unit.

Based on (10), the energy minimization problem of an m-stage pipeline can be for-mulated as minimizing the function

P( �Ton, �Toff) =m∑i

Eisw + T i

on · (Pis − Pi

σ

)T i

on + T ioff

, (12)

where �Ton = [T 1on T 2

on . . . T mon] and �Toff = [T 1

off T 2off . . . T m

off]. Now we can define theproblem that we studied as follows.

Given pipelined platform with mstages, an event stream S processed by thispipeline, and an end-to-end deadline requirement D, we are to find a set ofperiodic power managements characterized by �Ton and �Toff that minimizethe average idle power consumption P defined in (12) while guaranteeingthat the worst-case end-to-end delay does not exceed D.

4. MOTIVATION EXAMPLE

A phenomenon called pay-burst-only-once is well known and can give a closer upperestimate on the delay when an end-to-end service curve is derived prior to delay com-putations [Fidler 2003]. When a workload flow with a burst traverses a number ofstages in sequence, the effect of the burst of the flow on the end-to-end delay bound is



Fig. 3. Motivation example.

the same as if the flow traversed only one node. The end-to-end delay bound computedwith this property can be tighter than the sum of delay bounds of each node.

This section presents a motivation example where an event stream passes througha two-stage pipeline with a deadline requirement D. For simplicity, arrival curves inthe leaky-bucket form and service curves in rate-latency form [Le Boudec and Thiran2001] are used. In this representation, an arrival curve is modeled as α(�) = b + r · �,where b is the burst and r the leaky rate. Correspondingly, a service curve is modeledas β(�) = R · (� − T ), where R is service rate and T the delay. A graphical illustrationof the example is shown in Figure 3, where D = 20, b = 5, r = 0.5, and R1 = R2 = 1.

We first inspect the strategy of partitioning the end-to-end deadline and using thepartitioned sub-deadlines for the two pipeline stages. For simplicity, we split the Dequally, that is, D/2 for each stage. As shown in Figure 3(a), given D/2 deadlinerequirement for the first pipeline stage, we obtain the maximal T1 = D

2 − bR1

= 5,corresponding to the minimal service demand β1 = � − 5. To derive the minimal β2 forthe second stage of the pipeline is more involved. We need the output arrival curve α′from the first stage. According to Le Boudec and Thiran [2001], α′(�) = b+r · T1 +r ·�.Now again, with a deadline requirement D/2 for α′, we have T2 = D

2 − b+r·T1R1

= 2.5.Let’s take a close look at this solution. According to the concatenation theorem

βR1,T1 ⊗βR2,T2 = βmin(R1,R2),T1+T2 , we get a concatenated service curve β = �−(T1 +T2) =�−7.5. With this concatenated service curve, the maximal overall end-to-end deadlinefor β1 and β2 is 12.5, which is far more strict than D. This example indicates that theobtained β1 and β2 based on partitioning the end-to-end deadline are too pessimistic.

The reason for the pessimism comes from paying the burst b/R1 for the second stageof the pipeline as well as the additional delay r·T1

R2from the first stage, as the pay-burst-

only-once principle points out. These effects will be accumulated for every stage ofthe pipeline, leading to even more pessimistic results as the number of pipeline stagesincreases. In addition, computing the resource demand of each stage requires the lowerbound of the output arrival curve from the previous stage. Computing this output curverequires numerical min-plus convolution that will incur considerable computationaland memory overheads. In conclusion, the strategy based on partitioning the end-to-end deadline is not a viable approach, in particular for those cases of pipelined systemswith many stages.

On the other hand, one can first derive the total concatenated server demand βT l,in this case T = 15 as shown in Figure 3(b). Any partition based on this T will resultin smaller but valid service curves for each pipeline stage, as we can always retrievethe original end-to-end deadline by means of the pay-burst-only-once principle. For


26:10 G. Chen et al.

example, by an equal partition of T , both T1 and T2 are 7.5 and D is still preserved.This brings the basic idea of our approach that will be presented in the next section.

5. PROPOSED APPROACH

Our approach lies in an inverse use of the pay-burst-only-once principle, as mentionedin the previous section. Rather than directly partitioning the end-to-end deadline, wecompute one service curve for the entire pipeline, which serves as a constraint forthe minimal resource demand. The energy minimization problem is then formulatedwith respect to the resource demands for individual pipeline stages. To solve thisminimization problem, the formulation is transformed into a quadratic programmingform and solved by a 2-phase heuristic.

Without loss of generality, a pipelined system with mheterogeneous stages (m ≥ 2) isconsidered. The processor of the i stage can provide minimal βGl

i service. Since periodicpower management is considered, the minimal service βGl

i can be modeled as a T ion and

T ioff pair:

βGli (�) =

(T i

on

⌈ � − T ioff

T ion + T i

off

⌉)⊗ �. (13)

The derivation of Eq. (13) is presented in Lemma A.1 in the appendix section. Inaddition, to obtain a tight lower bound of the service curve of the entire pipeline, werestrict T i

on as a multiple of the worst-case execution time ci, that is, T ion = ni ci, ni ∈ N+.

5.1. Problem Formulation

Regarding the problem formulation, we first present an approximation approach (seeLemma 5.1) to derive a lower bound of the PPM service curve. By using this approxi-mated curve, we derive the concatenated service curve directly (see Lemma 5.2), whichcan be used to guarantee the real-time properties (see Theorem 5.3). Then, the energyminimization problem is formulated with respect to the resource demands for individ-ual pipeline stages. Before presenting the formulation, we first state a few basics. Bydefining Ki = T i

onT i

on+T ioff

, we have the following two lemmas.

LEMMA 5.1. βGli (�) ≥ Ki

ci(� − T i

off − ci).

PROOF. According to the definition of the min-plus convolution operation⊗

, theinequality �a + b� ≥ �a� + �b�, and the inequality Eq. (13), we have

βGli (�) ≥

⎢⎢⎢⎢⎣T ion

⌈ �−T ioff

T ion+T i

off

⌉ci

⎥⎥⎥⎥⎦ ⊗⌊

�

ci

⌋.

With the restriction T ion = ni ci, ni ∈ N+ and �a� ≥ a, we have⎢⎢⎢⎢⎣T i

on

⌈ �−T ioff

T ion+T i

off

⌉ci

⎥⎥⎥⎥⎦ = ni ·⌈

� − T ioff

T ion + T i

off

⌉

≥ Ki

ci

(� − T i

off

).

According to �a� ≥ a − 1, we have ��ci

� ≥ 1ci

(� − ci).



According to the rule of min-plus convolution of rate-latency service curve βR1,T1 ⊗βR2,T2 = βmin(R1,R2),T1+T2 in Le Boudec and Thiran [2001] and Ki ≤ 1, we have

Ki

ci

(� − T i

off

) ⊗ 1ci

(� − ci) = min(

Ki

ci,

1ci

) (� − T i

off − ci) = Ki

ci

(� − T i

off − ci).

Then, we get the right side of the inequality.

LEMMA 5.2.⊗m

i=1 βiGl ≥ minm

i=1( Kici

)(� − ∑mi=1(T i

off + ci)).

PROOF. According to the rule of min-plus convolution of rate-latency service curveβR1,T1 ⊗ βR2,T2 = βmin(R1,R2),T1+T2 in Le Boudec and Thiran [2001] and Lemma 5.1, wehave

m⊗i=1

βiGl ≥

m⊗i=1

Ki

ci

(� − T i

off − ci) = m

mini=1

(Ki

ci

)(� −

m∑i=1

(T i

off + ci))

.

With Lemma 5.2, we state the next theorem.

THEOREM 5.3. Assuming an event stream modeled with arrival curve α is processedby an m-stage pipeline and the lower service curve of each pipeline stage is defined by aT i

on and T ioff pair, the pipelined system satisfies an end-to-end deadline D if the following

condition holds.

mmini=1

(Ki

ci

)(� −

m∑i=1

(T i

off + ci)) ≥ αu(� − D) (14)

PROOF. In Lemma 5.2, the right-hand side of the inequality is a lower bound of⊗mi=1 βi

Gl that is the concatenated service curve of the pipeline. With⊗m

i=1 βiGl ≥

αu(� − D), the end-to-end delay of the pipeline is no more than D according to thepay-burst-only-once principle. Therefore, the theorem holds.

The left-hand side of inequality Eq. (14) can be considered as a bounded delay func-tion bdf (�,ρ0, b0) = max(0, ρ0(� − b0)) with slope ρ0 = minm

i=1( Kici

) and bounded delayb0 = ∑m

i=1(T ioff + ci). For the stream S with deadline D, a set of minimum bounded

delay functions bdfmin(�,ρ, b) can be derived by varying b (see Section 5.2). Therefore,we should find a solution of [ �K, �Toff] such that the resulting bounded delay functionbdf (�,ρ0, b0) is no less than the minimum bounded delay functions bdfmin(�,ρ, b).Therefore we can formulate our optimization problem as following:

minimize�K, �Toff

P( �K, �Toff)

subject tom

mini=1

(Ki

ci

)≥ ρ

m∑i=1

(T i

off + ci) ≤ b

0 ≤ Ki ≤ 1, i = 1, . . . , m

T ioff ≥ 0, i = 1, . . . , m,

(15)



where �K = [K1, . . . , Kn]. P( �K, �Toff) is obtained as follows by conducting a transforma-tion Ki = T i

onT i

on+T ioff

to the average power consumption (10) of each stage.

P( �K, �Toff) =m∑i

(Ei

sw (1 − Ki)T i

off

+ (Pi

s − Piσ

)Ki

).

The advantage of formulation (15) is twofold. First of all, the service curves of individualpipeline stages are the variables of the optimization problem, which, on the one hand,overcomes the problem of paying the burst multiple times while, on the other hand,avoiding the costly

⊗computation during the optimization. Second, this formulation

allows us to use a more efficient method to analyze the problem as presented in thefollowing sections.

5.2. Quadratic Programming Transformation

How to solve the minimization problem (15) is not obvious. The constraints b and ρ,indeed, are not fixed values and in addition these two constraints are correlated. Fora fixed b, the minimum bounded delay function bdfmin(�,ρ, b) can be determined bycomputing ρ.

ρ = inf {ρ : bdf (�,ρ, b) ≥ αu(� − D),∀� ≥ 0}. (16)

In this article, we conduct the optimization by varying b and computing ρ for everypossible b. For a fixed b, we can transform (15) into a quadratic programming problemwith box constraints (QPB), as stated in the following lemma.

LEMMA 5.4. The minimization problem in (15) can be transformed as the followingquadratic programming problem with box constraints:

minimize�x=[x1 ... xm]

�xT Q�x

subject to 0 ≤ xi ≤√

Eisw(1 − ρ ci), i = 1, . . . , m,

(17)

where Q = A− B, A is an m× m matrix of ones and B is an m× m diagonal matrix with

ith diagonal element(b−∑m

j=1 c j )(Pis −Pi

σ )Ei

sw.

Denote �x∗ as the optimal solution for the QPB problem in (17), then the optimization

solution for (15) can be obtained with Ki = 1 − (x∗i )2

Eisw

and T ioff = x∗

i∑mj=1 x∗

j(b − ∑m

j=1 c j).

PROOF. With Cauchy-Buniakowski-Schwartz’s inequality, we can get that

m∑i=1

T ioff ·

m∑i=1

Eisw(1 − Ki)

T ioff

≥(

m∑i=1

√Ei

sw(1 − ki)

)2

.

The minimum value of∑m

i=1Ei

sw(1−Ki )T i

offcan be obtained at (

∑mi=1

√Ei

sw(1−ki ))2

b−∑mj=1 c j

when the fol-

lowing equation holds.

T ioff =

√Ei

sw(1 − Ki)∑mj=1

√Ej

sw(1 − Kj)

⎛⎝b −

m∑j=1

c j

⎞⎠ .



Then optimization formulation in (15) can be formulated as

minimizeK1,K2,...,Km

(∑m

i=1

√Ei

sw(1 − Ki))2

b − ∑mj=1 c j

+m∑

i=1

(Pi

s − Piσ

)Ki

subject to ρ ci ≤ Ki ≤ 1, i = 1, . . . , m.

By defining xi = √Ei

sw(1 − Ki), formulation (15) can be transformed as the QPB

problem in (17).

Note that there is a feasible region for b. To guarantee all the resulting T ioff ≥ 0, the

bounded delay b should not be less than∑m

i=1 ci. According to (14), the maximum slopeρ of the bounded delay function will not exceed 1

maxmi=1 ci

. Correspondingly, we derive

the minimum bounded delay function bdfmin(�, 1maxm

i=1 ci, b). By inverting (16), we can

derive the maximum delay bu by (18), which can guarantee that all the resulting Ki willnot exceed 1. In summary, the feasible region of b ∈ [bl, bu] can be bounded as follows:

bu = sup{

d : bdf(

�,1

maxmi=1 ci

, d)

≥ αu(� − D),∀� ≥ 0}

bl =m∑

i=1

ci. (18)

5.3. Quadratic Programming Heuristic

With the preceding information, we can now present the overall algorithm to the energyminimization problem defined in Section 3.4. Basically, bounded delay b is scanned bystep ε within the range [bl, bh]. For each b, we first solve the subproblem (17) with aQPB solver, and then the obtained solution is repaired to fulfill further constraints (thiswill be explained later on). The pseudocode of the algorithm is depicted in Algorithm 1.

ALGORITHM 1: Quadratic Programming HeuristicInput: αu, bl, bh, ε, and Pmin = ∞Output: �Kopt, �Tof f, opt

1: for b = bl to bh with step ε do2: compute ρ by Eqn. 16;3: obtain �K and �Toff by solving (17);4: repair �K and �Toff;5: if P( �K, �Toff) < Pmin then6: �Kopt ← �K ; �Tof f, opt ← �Toff;7: Pmin ← P( �Kopt, �Tof f, opt);8: end if9: end for

THEOREM 5.5. ∃i ∈ {1, 2, . . . , m} and Eisw

Pis −Pi

σ< b − ∑m

j=1 c j , then the problem is NP-hard.

PROOF. If there exists a stage pi where the condition Eisw

Pis −Pi

σ< b − ∑m

j=1 c j holds,the matrix Q in Lemma 5.4 is not positive semi-definite. Thus, QPB is the nonconvexquadratic programming problem which is NP-hard [Jeyakumar et al. 2006].



To solve the subproblem (line 3 in Algorithm 1), we apply the existing QPB solver.According to Theorem 5.5, QPB is NP-hard when the scanned bounded delay b is bigenough (i.e., Ei

swPi

s −Piσ

< b − ∑mj=1 c j). It is in general difficult to solve the problem opti-

mally. Nevertheless, there are approximation schemes [Fu et al. 1998] that can effi-ciently solve the nonconvex QPB and there are many excellent off-the-shelf softwarepackages [Chen and Burer 2012] available. In this article, the state-of-the-art finiteB&B algorithm [Chen and Burer 2012] is applied to solve our QPB problem.

After obtaining a pair of �K and �Toff, the repair phase (line 4 in Algorithm 1) is con-ducted to fulfill further constraints. This repair scheme is represented in Algorithm 2.First of all, the resulting T i

off of pipeline stage i may be smaller than tisw. In the case

where T ioff < ti

sw, turning off the processor of stage i is not possible, therefore the so-lution for stage i is repaired by [K′

i, T i′off] = [1, 0], stage i is on all the time (line 2

in Algorithm 2). However, this repair step will lead to the loss of sleep time Qi for eachstage (line 21 in Algorithm 2). We record this loss and try to reassign the loss to eachstage at the end of the algorithm (lines 21–32 in Algorithm 2) to minimize the powerconsumption further. Second, the resulting T i

on may not be a multiple of ci, which isone of our basic requirements. The repair steps are conducted to make T i

on a multipleof ci (lines 6–20 in Algorithm 2). To guarantee the resulting K′

i is constant with respect

to Ki, T i′off should be adjusted to T i′

onKi

− T i′on, and �Ti indicates how much sleep time of

the stage i should be adjusted comparing to the original T ioff (line 14 in Algorithm 2).

If �Ti > 0 holds, it means that T i′on decreases and the stage i should decrease sleep

time T ioff to make K′

i constant (Line 16 in Algorithm 2), which will result in the lossQi and this part can be reassigned to prolong the sleep time of other stages. �Ti ≤ 0indicates that T i′

on increases and the stage should increase sleep time T ioff to make K′

i

constant. For this case, we make T i′off constant with respect to T i

off, which results in a K′i

increase and power consumption increase �Ei (line 18 in Algorithm 2). In the end, thetotal loss Q should be reassigned to the stage with �Ti < 0 to reduce the power con-sumption further (lines 21–32 in Algorithm 2). The reassignment heuristic uses powerincrease �Ei as a metric to decide which stage should be assigned first. Specifically, theheuristic iterates through all stages that need to compensate and, in each iteration,picks the stage with maximum power increase �Ei and increase T i

off without causingK′

i < Ki. The reassignment heuristic terminates when there is no loss to reassign orno stage needs to compensate. It is worth noting that the repair phase we conduct canstill guarantee the repaired solution to satisfy the constraints, as stated in Lemma 5.6.

LEMMA 5.6. The solution repaired by Algorithm 2 satisfies the constrains in (15).

PROOF. The operation in lines 2–20 will not increase the term∑m

i=1 T ioff without

causing K′i < Ki, which satisfies the constrains in (15). The reassignment heuristic

in (lines 21–32) reassigns the total loss Q to each stage that needs to compensate andincrease its sleep time T i

off without increasing the total sleep time∑m

i=1 T ioff and causing

K′i < Ki. Thus, the solution repaired by Algorithm 2 satisfies the constrains in (15).

5.4. Fast Heuristic

In Section 5.3, we presented a quadratic programming heuristic with QPB transforma-tion. According to Theorem 5.5, QPB is NP-hard when the scanned bounded delay b isbig enough. Assume that bounded delay b is scanned by n steps, then the heuristic inSection 5.3 needs to solve this NP-hard problem several times, which is time consuming.



ALGORITHM 2: Repair Scheme

Input: solution of QPB problem:[ �K, �Toff]Output: [ �K′, �T ′

off]1: compute the stage set: S1 = {pi|T i

off < tisw};

2: repair [K′, T ′off] of the stage p ∈ S1 as [1, 0];

3: update budget �Ti ← T ioff and power increase �Ei ← 0 for stages p ∈ S1;

4: compute Ton and the stage set: S2 = {pi|T ioff ≥ ti

sw};5: for each stage pi ∈ S2 do6: if T i

on < ci then7: T i′

on ← ci ;8: else9: T i′

on ← � T ion

ci� ci ;

10: if T i′on

Ki− T i′

on < tisw then

11: T i′on ← � T i

onci

� ci ;12: end if13: end if14: compute budget �Ti = T i

off − ( T i′on

Ki− T i′

on) ;15: if �Ti >= 0 then16: T i′

off ← T i′on

Ki− T i′

on ; �Ei ← 0 ;17: else18: T i′

off ← T ioff ; �Ei ← P(T i

on, T ioff) − P(T i′

on, T i′off) ;

19: end if20: end for21: compute total budget Q = ∑

�Ti>0 �Ti ;22: while Q > 0 do23: find stage pi with maximum power increase �Ei ;24: if �Ti < 0 then25: compute available allocation allo = min(Q, |�Ti|);26: T i′

off ← T i′off + allo ; �Ti ← �Ti + allo;

27: �Ei ← P(T i′on, T i′

off) − P(T ion, T i

off) ;28: Q ← Q− allo ;29: else30: break ;31: end if32: end while33: update [ �K′, �T ′

off];

Besides, in the first optimization step, the quadratic programming heuristic does notconsider the break-even time constraint (i.e., T i

off of pipeline stage i is not smaller thanT i

BET ), which could also make the result pessimistic. To overcome these drawbacks, wepresent a fast heuristic to find a suboptimal solution, running with O(mn) time com-plexity (m is the stage number). Different from the heuristic in Section 5.3, we considerthe break-even time constraint in the optimization phase and partition stage set Pas two stage sets according to this constraint, rather than decoupling the break-eventime constraint and optimization. Based on this stage-set partition, we can derive asuboptimal solution as stated in Lemma 5.7.

LEMMA 5.7. Give a fixed bounded delay b and denote [ �K, �Toff] as the optimal solutionfor the problem. Partition the stage set P into two subsets S1 and S2, where S1 =



{pi|T ioff < T i

BET } and S2 = {pi|T ioff ≥ T i

BET }. Then, the optimal solution [ �K, �Toff] can bedetermined as follows

(1) For the stage pi ∈ S1, [Ki, T ioff] = [1, 0].

(2) For the stage pi ∈ S2, [Ki, T ioff] = [ρci, xi], where xi = wi∑

pi∈S2 wi(b − ∑m

i=1 ci) and

wi = √Ei

sw(1 − ρ · ci).

PROOF. For the stage subset S2, T ioff ≥ T i

BET ≥ Eisw

Pis −Pi

σholds. The average power

consumption P(Ki, T ioff) gets its minimum at Ki = ρci according to Property 1. Thus, the

average power consumption of the stage subset S2 can be transformed as∑

pi∈S2w2

iT i

off+∑

pi∈S2 ρci(Pis − Pi

σ ) with constraint∑

pi∈S2 T ioff ≤ b − ∑m

i=1 ci − ∑pi∈S1 T i

off. According toCauchy-Buniakowski-Schwartz’s inequality, the optimal average power consumptionof the stage subset S2 can be determined as (19) when [Ki, T i

off] = [ρci,wi∑

pi∈S2 wi(b −∑m

i=1 ci − ∑pi∈S1 T i

off)] holds.

∑pi∈S2

P(Ki, T i

off

) =(∑

pi∈S2 wi

)2

b − ∑mi=1 ci − ∑

pi∈S1 T ioff

+∑

pi∈S2

ρci(Pis − Pi

σ ). (19)

According to (19), the average power consumption of the stage subset S2 gets theminimum when

∑pi∈S1 T i

off gets the minimum.For the stage set S1, there are two cases: (a) T i

BET = tisw, where for this case of

T ioff < ti

sw, turning off the processor of stage i is not possible, as we stated in repairscheme, due to the hardware requirement that the sleep time T i

off of the processorshould not be smaller than the overhead ti

sw. Thus, the solution for stage i is forced as[Ki, T i

off]pi∈S1 = [1, 0]; (b) T iBET = Ei

swPi

s −Piσ

where, with Property 2, the optimal averagepower consumption of the stage subset S1 gets its minimum at [Ki, T i

off]pi∈S1 = [1, 0].At this point,

∑pi∈S1 T i

off gets the minimum as 0, thus the average power consumptionof the stage subset S2 gets its minimum.

According to Lemma 5.7, the optimal solution can be derived directly if the stagepartition P = {S1, S2} is determined. Thus, optimal solution can be derived by exhaus-tively exploring all possible stage partitions with the complexity Ø(2n). When the stagenumber increases, the complexity will increase exponentially. To reduce its complex-ity, a fast stage partition scheme is proposed in this article. In this scheme, we firstgreedily put all stages into the stage set S2 = {pi|T i

off ≥ T iBET } (i.e., we assume all the

stages can enter sleep mode). Under this greedy partitioning, we compute the optimalToff according to Lemma 5.7 as described in lines 1 and 2 in Algorithm 3. Then, wecan assign the stages by checking whether the resulting optimal Toff under the greedypartition is greater than T i

BET (see lines 3–9 in Algorithm 3). The feasibility of thispartition scheme can be guaranteed by Lemma 5.8.

LEMMA 5.8. Stage partition P = {S1, S2} generated by Algorithm 3 is feasible.

PROOF. In Algorithm 3, wi∑pi∈P wi

(b − ∑mi=1 ci) ≥ T i

BET holds for S2. According

to Lemma 5.7, T ioff in S2 can be determined as wi∑

pi∈S2 wi(b − ∑m

i=1 ci). As S2 ⊆ P, we



ALGORITHM 3: Greedy Partition SchemeInput: ρ, b, POutput: S1,S2

1: Compute wi = √Ei

sw(1 − ρ · ci) for each stage pi;2: Compute xi = wi∑

pi∈P wi(b − ∑m

i=1 ci) for each stage pi ;

3: for pi ∈ P do4: if xi < T i

BET then5: Insert stage pi into set S1;6: else7: Insert stage pi into set S2;8: end if9: end for

can get T ioff ≥ wi∑

pi∈P wi(b − ∑m

i=1 ci) ≥ T iBET holds for S2, thus the stage partition gener-

ated by Algorithm 3 is feasible.

For each b, we can first obtain a suboptimal partition by the greedy partition schemedepicted in Algorithm 3, and then the optimal solution under the obtained partitioncan be determined. The pseudocode of the algorithm is depicted in Algorithm 4.

ALGORITHM 4: Fast HeuristicInput: αu, bl, bh ε, Pmin = ∞Output: [Ki, T i

off]mi=1

1: for b = bl to bh with step ε do2: compute ρ by Eqn. (16);3: generate the feasible partition S1 and S2 by Algorithm 3;4: obtain �K and �Toff according to Lemma 5.7;5: repair �K and �Toff by Algorithm 2;6: if P( �K, �Toff) < Pmin then7: �Kopt ← �K ; �Tof f, opt ← �Toff;8: Pmin ← P( �Kopt, �Tof f, opt);9: end if

10: end for

6. PERFORMANCE EVALUATIONS

In this section, we demonstrate the effectiveness of our approach. We compare threeapproaches in this section: (1) the pay-burst-only-once algorithm based on quadraticprogramming (PBOOA-QP) presented in Section 5.3; (2) pay-burst-only-once algorithmbased on the fast heuristic (PBOOA-FH) presented in Section 5.4; and (3) the deadlinepartition algorithm (DPA). DPA partitions the end-to-end deadline into sub-deadlinesfor individual pipeline stages and explores all the possible deadline partition combi-nations to find that deadline partition with the minimum energy consumption. Foreach deadline partition combination, DPA uses the scheme in Huang et al. [2009b] tominimize the energy consumption of individual pipeline stages to optimize the overallenergy consumption. To show the effects of our scheme, we report the average idlepower computed as Eq. (10) as well as the computation time of all the schemes. Thesimulation is implemented in Matlab using the RTC toolbox [Wandeler and Thiele2006] and the finite B&B algorithm [Chen and Burer 2012] is used to solve QPB. Allresults are obtained from a 2.83 GHz processor with 4GB memory.



Table I. Constants for 70nm Technology [Martin et al. 2002; Wang andMishra 2010]

Const Value Const Value Const Value

K1 0.063 K6 5.26 × 10−12 Vth1 0.244K2 0.153 K7 −0.144 Ij 4.8 × 10−10

K3 5.38 × 10−7 Vdd [0.5,1] Ceff 0.43 × 10−9

K4 1.83 Vbs [−1,0] Ld 37K5 4.19 α 1.5 Lg 4 × 106

Table II. Power Parameters

Vdd Pa Ps Pσ Esw tsw0.7V 656mW 390mW 50μW 483μJ 10ms

6.1. Simulation Setup

The experiments are conducted based on the classical energy model of a 70nm technol-ogy processor in Martin et al. [2002], Wang and Mishra [2010], Jejurikar et al. [2004],and Chen et al. [2014], whose accuracy has been verified with SPICE simulation.Table I lists the energy parameter under 70nm technology [Martin et al. 2002; Wangand Mishra 2010; Jejurikar et al. 2004; Chen et al. 2014]. According to Jejurikar et al.[2004], executing at Vdd = 0.7V is more energy efficient than executing at lower voltagelevels. To achieve the minimization of the overall energy consumption of the system,we assume that the processor runs at this critical frequency level when the processoris in the active state. From Wang and Mishra [2010] and Jejurikar et al. [2004], bodybias voltage Vbs is obtained as −0.7V . From Jejurikar et al. [2004], Pon related to idlepower can be obtained as 100mW and the power consumption in sleep mode Pσ is setas 50μW . In Jejurikar et al. [2004], we can obtain energy overhead Esw of the statetransition as 483μJ. We set time overhead tsw of the state transition as 10ms. Accord-ing to the energy parameter in Table I and the energy model in Section 3.2, we cancalculate the corresponding active power Pa and standby power Ps under voltage levelVdd = 0.7V . Table II lists all the power parameters used in the experiment.

An event stream is specified by the PJD model with period p, jitter j, and minimalinterarrival distance d. It is worth noting that a worst-case execution time c is associ-ated with the service curve of different stages, as stated in Section 3.3. The jitter j andthe relative deadline D of the stream are respectively defined as j = ϕ · p and D = γ · pand vary according to the corresponding factors.

To evaluate the effectiveness of our approach, we conduct the experiments withthree applications. We collected results for these three applications with deadline andjitter varying with the corresponding factors γ and ϕ. In the following, we give a briefoverview of the three applications. The H.263 decoder application [Oh and Ha 2002]was modeled by four tasks consisting of packet decoding (PD1), an inverse quantizationoperation (deQ), an inverse DCT operation (IDCT), and motion compensation (MC). Theexecution time of each subtask in the H.263 decoder application can be found in Oh andHa [2002]. The activation period of the H.263 decoder application is 100ms with varyingthe jitter and the end-to-end deadline. The MP3 decoder application is implementedin a pipelined fashion [Oh and Ha 2002] that can be split into five tasks, includingpacket decoding (PD2), Huffman decoding (HD), the inverse quantization operation(deQ), an inverse DCT operation (IDCT), and antialiasing (FB). The execution time ofeach subtask in the H.263 decoder application can be found in Oh and Ha [2002]. Theactivation period of the MP3 decoder application is 100ms with varying the jitter andthe end-to-end deadline. Time Delay Equalization (TDE) comes from the GMTI (GroundMoving Target Indiciator) application obtained from StreamIt benchmarks [Thies and



Table III. Average Power Savings with Respect to DPA

H.263 MP3 TDE H.263 MP3 TDE2-stages 2-stages 2-stages 3-stages 3-stages 3-stages

PBOOA-QP 10.46% 11.57% 39.62% 23.59% 26.37% 30.60%PBOOA-FH 10.46% 11.57% 39.65% 23.31% 25.69% 34.09%

Amarasinghe 2010]. The Time Delay Equalization (TDE) application contains 4 tasks,including tasks like FFT reorder, combined DFT, FFT reorder, and combined IDFT. Weset the activation period of the consumer application as 30ms.

6.2. Simulation Result

We first evaluate how the power consumptions of the compared approaches change asthe jitter and deadline vary. Cases of 2-stage and 3-stage pipeline architectures withhomogeneous 70nm processors are evaluated. We vary the jitter factor ϕ from 0–3 withstep 0.5 and the deadline factor γ from 1.5–2 with step 0.5. The simulation results of thethree approaches are shown in Figure 4. In the figure, each line represents the averageenergy consumption under the varied jitter factor settings with the fixed deadline factorand task mapping. From these, we can make the following observations: (1) pay-bust-only-once-based approaches always outperform the deadline partition approach for allsettings on both pipeline architectures. We list average normailized power savings ofPBOOA-QP and PBOOA-FH with respect to DPA in Table III; (2) the average idle powerconsumptions of the three approaches increase as jitter increases, since bigger jitterrequires longer Ton to gurantee the worst-case end-to-end deadline; (3) the average idlepower consumptions of the three approaches decrease as end-to-end deadline increases.This is expected becasue the loose end-to-end deadline requirement could result insmaller execution time Ton and longer sleep time Toff; (4) one interesting obervation isthat pay-bust-only-once-based approaches can achieve more power savings on 3-stagepipeline rather than 2-stage pipeline for different jitter and deadline settings. This isbecause DPA on 3-stage pipeline pay burst more times than 2-stage pipeline, whichleads PBOOA-QP and PBOOA-FH to achieve more power savings on 3-satge pipeline.

Next, we conduct the experiment to show the impact of time overhead of state tran-sition tsw on the effectiveness of our approaches. An H.263 application with jitter factorϕ = 0.5 and deadline factor γ = 1 runs in 3-stage pipeline architectures with homoge-neous 70nm processors. We vary the time overhead of state transition tsw from 5ms–15ms with fixed step size 1ms. Figure 5 illustrates the average power consumptionsfor the three compared approaches. In this figure, we can observe that our approachescan find efficient solutions and outperform DPA in all of tsw settings. Besides, when tswincreases, the average power consumptions of DPA increase faster compared to pay-burst-only-once-based approaches. This is because DPA generates the less idle time dueto suffering from a paying burst many times compared to pay-burst-only-once-basedapproaches, as we show in Section 4. The increase of tsw will reduce the opportunitiesfor turnning off the processor, which means that entering sleep mode should be moredifficult for DPA.

Then, we discuss the impact of the period setting on the effectiveness of the ap-proaches. The MP3 application with jitter factor ϕ = 1 and deadline factor γ = 1.5runs in two-stage pipeline architectures with homogeneous 70nm processors, wherewe vary period settings from 70–130ms with fixed step size 10ms. Figure 6 illustratesthe average power consumptions for the three compared approaches under different pe-riod settings. From Figure 6, we can see that the pay-burst-only-once-based approachesoutperform DPA at all period settings. Furthermore, the average power consumption ofall approaches decreases when the period increases. This is expected because a biggerperiod of the application can prolong the idle intervals.



Fig. 4. Average idle power consumption for three applications on 2-stage and 3-stage pipeline architectures.

In the end, we demonstrate the scalability of our approaches. We test them byup to a 20-stage heterogeneous pipeline. The execution time of subtasks mapped oneach stage are randomly generated between 5–15ms. According to the power modelpresented in Section 3.2, the power profile of each stage can be generated by ran-domly selecting voltage Vdd between 0.5V –0.8V . The activation period of the eventstream is 40ms with jitter factor ϕ = 1. The end-to-end deadline for the test case



Fig. 5. Average power consumption with varying tsw .

Fig. 6. Average power consumption with varying period.

with different stage number is determined by n · 20, where n is the stage number.The overhead values of state transition tsw and Esw of different stages are randomlyselected in [1ms, 5ms] and [400uJ, 800uJ], respectively. Based on the observation thatthe deadline partition algorithm may suffer deadline combination explosion and thecostly

⊗computation, we set the search step as 5 for the three compared approaches.

Figure 7 shows the power consumption and computation overhead on different pipelinearchitectures. From this figure, we have the following observations: (1) as shown inFigure 7(a), the computation overhead of the deadline partition algorithm increasesexponentially. When the stage number exceeds 10, the Deadline Partition Algo-rithm (DPA) fails to generate the results due to the expiration of time budget of 8 hours.For the case of 9-stage pipeline, DPA takes almost 420 minutes, which is 9182× longerthan the 3-stage pipeline case. This is expected because the deadline combinations willincrease exponentially as the stage number increases. In addition, as the stage numberincreases, the time for computing the resource demand of each following stage, whichrequires the lower bound of the output arrival curve from the previous stage, increases.



Fig. 7. Computation time and power computation for heterogeneous pipelined system.

Computing this output curve requires numerical min-plus convolution that will incurconsiderable computational and memory overheads; (2) compared to the deadline par-tition algorithm, pay-burst-only-once-based approaches are fast and the computationtime increases slowly with respect to the stage number, especially for PBOOA-FH.With the case of 20-stage pipeline, the PBOOA-QP approach takes 3.7 minutes, 124×more computing time than the 3-stage case. PBOOA-FH takes only 0.08 minutes togenerate the result, only 7.5× than the 3-stage case; (3) in the context of average idlepower consumption, pay-burst-only-once-based approaches are more energy efficientthan the deadline partition algorithm. In Figure 7(b), we can see PBOOA-QP andPBOOA-FH approaches always outperform DPA for all pipeline architectures, indicat-ing that our approaches are not only faster but also more energy efficient than the DPAapproach. Besides, as observed in prior experiments, the gap in power consumptionbetween the deadline partition algorithm and pay-burst-only-once-based algorithm in-crease as the stage number increases. This is expected because, as the stage numberincreases, the times when DPA should pay burst also increase. In contrast, the pro-posed approaches only need to pay burst only once, which leads the tighter end-to-enddelay bound and prolongs the idle intervals of the stages for energy efficiency; (4) thePBOOA-FH approach can achieve almost identical average idle power consumptionwith respect to the PBOOA-QP approach with almost 10× speedup. In some cases,PBOOA-FH can even achieve more energy savings than PBOOA-QP. This is becausethat, in contrast to the PBOOA-QP approach, PBOOA-FH integrates break-even timeconstraints into the optimization phase, which leads the PBOOA-FH approach to findbetter solutions than the PBOOA-QP approach.

7. CONCLUSION

This article presents new approaches to minimize energy consumption for pipelinedsystems. Targeting the streaming application with nondeterministic workload arrivalsunder hard real-time constraints, our approaches can not only guarantee the originalend-to-end deadline requirement but also retrieve the pay-burst-only-once phenom-ena, resulting in a significant reduction in both the energy consumption and comput-ing overhead. Moreover, our approaches are scalable with respect to the number ofpipelined stages. Simulation results demonstrate the effectiveness of our approaches.In the future, we intend to extend our approaches to dynamic voltage-frequency scal-ing (DVFS) to reduce dynamic power for pipelined systems. Another interesting fu-ture work would be to target multidimensional issues such as energy and thermal



constraints simultaneously. In addition, how to combine our approaches with consid-eration of the mapping of the application is also deemed worthy for our future work.

APPENDIX

LEMMA A.1. The service curve of period power management specified by Ton and Toffcan be represented as follows.

βGli (�) =

(T i

on

⌈� − T i

off

T ion + T i

off

⌉)⊗ �. (20)

PROOF. According to Huang et al. [2009b], the service curve of period power manage-ment specified by Ton and Toff can be represented as Eq. (21).

βGl(�) = max(⌊ �

Ton + Toff

⌋· Ton,� −

⌈ �

Ton + Toff

⌉· Toff

). (21)

This proof presents the derivation of Eq. (20), which is used to represent the servicecurve of period power management, to indicate that Eqs. (21) and (20) are equivalent.

According to the definition of the min-plus convolution,

βGl(�) =(

Ton

⌈� − Toff

Ton + Toff

⌉)⊗ �

= inf0≤s≤�

(� − s + Ton ·

⌈s − Toff

Ton + Toff

⌉).

We make some transformations as follows.

T = Ton + Toff

� = k� · T + r�, k� ∈ N+, 0 ≤ r� < T ,

s = ks · T + rs, ks ∈ N+, 0 ≤ rs < T .

Then, we have

βGl(�) = inf0≤s≤�

((k� − ks) · T + (r� − rs) + Ton · ks + Ton ·

⌈rs − Toff

T

⌉). (22)

As s ≤ �, there are two possibilities between the parameters r� and rs: (1) whenks = k�, rs ≤ r� should be held for s ≤ �; (2) when ks ≤ k� − 1, there is no constraintbetween r� and rs because ks ≤ k� − 1 is sufficient to guarantee s ≤ �.

Case 1: ks ≤ k� − 1. For this case, there are no constraints between r� and rs, thus wecan have Eq. (23) by calling Eq. (22).


((k� − ks) · T + (r� − rs) + ks · Ton + Ton ·

⌈rs − Toff

T

⌉)

= inf0≤s≤�

(k� · T + r� − ks · (T − Ton) − rs + Ton ·

⌈rs − Toff

T

⌉)

= inf0≤s≤�

(k� · T + r� − ks · Toff − rs + Ton ·

⌈rs − Toff

T

⌉). (23)



—When Toff < rs < T holds, we have Eq. (24) by calling Eq. (23).


(k� · T + r� − ks · Toff − rs + Ton)

> (k� · T + r� − ks · Toff − T + Ton)= (k� · T + r� − ks · Toff − Toff)≥ (k� · T + r� − k� · Toff). (24)

—When 0 ≤ rs ≤ Toff holds, we have Eq. (25) by calling Eq. (23).


(k� · T + r� − ks · Toff − rs)inf=========⇒

rs=Toff,ks=k�−1k� · T + r� − k� · Toff. (25)

For the preceding two cases, the infimum of βGlks≤k�−1(�) for the case ks ≤ k� − 1 can be

obtained as Eq. (26) by calling Eqs. (24) and (25).

βGlks≤k�−1(�) = k� · T + r� − k� · Toff = � − k� · Toff. (26)

Case 2: ks = k�. For this case, rs ≤ r� should be held for s ≤ �, thus we have Eq. (27)by calling Eq. (22).


((r� − rs) + k� · Ton + Ton ·

⌈rs − Toff

T

⌉). (27)

As rs should be constrained by r�, there are two cases for r�.—r� ≤ Toff. For this case, we have 0 ≤ rs ≤ r� ≤ Toff, thus we have Eq. (28) by calling

Eq. (27).

βGlks=k�,r�≤Toff

(�) = inf0≤s≤�

((r� − rs) + k� · Ton)inf===⇒

rs=r�

k� · Ton. (28)

By integrating the cases of ks = k� and ks ≤ k� − 1, we have Eq. (29) by callingEqs. (26) and (28).

βGlr�≤Toff

(�) = min (� − k� · Toff, k� · Ton) = k� · Ton. (29)

—r� > Toff. For this case, there are two subcases for rs, that is, 0 ≤ rs ≤ Toff < r� < Tand Toff < rs ≤ r� < T .—0 ≤ rs ≤ Toff < r� < T . For this case, we have Eq. (30) by calling Eq. (27).


((r� − rs) + k� · Ton)inf===⇒

rs=Toff

r� − Toff + k� · Ton < Ton + k� · Ton.

(30)

—Toff < rs ≤ r� < T . For this case, we have Eq. (31) by calling Eq. (27).


((r� − rs) + k� · Ton + Ton)inf===⇒

rs=r�

Ton + k� · Ton. (31)

For the prior two cases, the infimum of βGlks=k�,r�>Toff

(�) can be obtained as Eq. (32)by calling Eqs. (30) and (31).

βGlks=k�,r�>Toff

(�) = r� − Toff + k� · Ton = � − (k� + 1) · Toff. (32)



By integrating the cases of ks = k� and ks ≤ k� − 1, we have Eq. (33) by callingEqs. (32) and (26).

βGlr�>Toff

(�) = min (� − k� · Toff,� − (k� + 1) · Toff) = � − (k� + 1) · Toff. (33)

With Eqs. (29) and (33), we can obtain the service curve as Eq. (34).

βGl ={

k� · Ton r� ≤ Toff

� − (k� + 1) · Toff r� > Toff. (34)

When 0 ≤ r� ≤ Toff holds, we have k� ·Ton ≥ � − (k� + 1) · Toff and k� = � �Ton+Toff

�. When

r� > Toff holds, we have k� · Ton < � − (k� + 1) · Toff and k� + 1 = � �Ton+Toff

�.Then, we can obtain the service curve as Eq. (35).

βGl = max (k� · Ton,� − (k� + 1) · Toff). (35)

By transforming Eq. (35), we can obtain Eq. (21), thus the service curve of periodpower management can be represented as Eq. (20).

REFERENCES

A. Alimonda, S. Carta, A. Acquaviva, A. Pisano, and L. Benini. 2009. A feedback-based approach to DVFS indata-flow applications. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 28, 11, 1691–1704.

S. Carta, A. Alimonda, A. Pisano, A. Acquaviva, and L. Benini. 2007. A control theoretic approach to energy-efficient pipelined computation in mpsocs. ACM Trans. Embedd. Comput. Syst. 6, 4.

G. Chen, K. Huang, C. Buckl, and A. Knoll. 2013. Energy optimization with worst-case deadline guaranteefor pipelined multiprocessor systems. In Proceedings of the Design, Automation and Test in EuropeConference (DATE’13).

G. Chen, K. Huang, and A. Knoll. 2014. Energy optimization for real-time multiprocessor system-on-chipwith optimal DVFS and DPM combination. ACM Trans. Embedd. Comput. Syst. 13, 3.

J. Chen and S. Burer. 2012. Globally solving nonconvex quadratic programming problems via completelypositive programming. Math. Program. Comput. 4, 1, 33–52.

J. J. Chen, N. Stoimenov, and L. Thiele. 2009. Feasibility analysis of on-line DVS algorithms for schedulingarbitrary event streams. In Proceedings of the 30th IEEE Real-Time Systems Symposium (RTSS’09).

A. Davare, Q. Zhu, M. Di Natale, C. Pinello, S. Kanajan, and A. Sangiovanni-Vincentelli. 2007. Periodoptimization for hard real-time distributed automotive systems. In Proceedings of 44th ACM/IEEEDesign Automation Conference (DAC’07).

P. de Langen and B. Juurlink. 2006. Leakage-aware multiprocessor scheduling for low power. In Proceedingsof the 20th International Parallel and Distributed Processing Symposium (IPDPS’06).

P. de Langen and B. Juurlink. 2009. Leakage-aware multiprocessor scheduling. J. Signal Process. Syst. 57,1, 73–88.

M. Fidler. 2003. Extending the network calculus pay bursts only once principle to aggregate scheduling.In Proceedings of the 2nd International Workshop on Quality of Service in Multiservice IP Networks(QoS-IP’03). 19–34.

M. Fu, Z. Luo, and Y. Ye. 1998. Approximation algorithms for quadratic programming. J. Combinat. Optim.2, 1, 29–50.

S. Y. Hong, T. Chantem, and X. S. Hu. 2011. Meeting end-to-end deadlines through distributed local deadlineassignments. In Proceedings of the 32nd IEEE Real-Time Systems Symposium (RTSS’11).

K. Huang, J. J. Chen, and L. Thiele. 2011a. Energy-efficient scheduling algorithms for periodic power man-agement for real-time event streams. In Proceedings of the 17th IEEE International Conference onEmbedded and Real-Time Computing Systems and Applications (RTCSA’11).

K. Huang, L. Santinelli, J. J. Chen, L. Thiele, and G. C. Buttazzo. 2009a. Adaptive dynamic power man-agement for hard real-time systems. In Proceedings of the 30th IEEE Real-Time Systems Symposium(RTSS’09).

K. Huang, L. Santinelli, J. J. Chen, L. Thiele, and G. C. Buttazzo. 2009b. Periodic power management schemesfor real-time event streams. In Proceedings of the 48th IEEE International Conference on Decision andControl (CDC’09).



K. Huang, L. Santinelli, J. J. Chen, L. Thiele, and G. C. Buttazzo. 2011b. Applying real-time interface andcalculus for dynamic power management in hard real-time systems. Real-Time Syst. 47, 2, 163–193.

H. Javaid and S. Parameswaran. 2009. A design flow for application specific heterogeneous pipelined mul-tiprocessor systems. In Proceedings of the 46th ACM/IEEE Annual Design Automation Conference(DAC’09).

H. Javaid, M. Shafique, J. Henkel, and S. Parameswaran. 2011a. System-level application-aware dynamicpower management in adaptive pipelined MPSoCs for multimedia. In Proceedings of the IEEE/ACMInternational Conference on Computer-Aided Design (ICCAD’11).

H. Javaid, M. Shafique, S. Parameswaran, and J. Henkel. 2011b. Low-power adaptive pipelined MPSoCs formultimedia: An h.264 video encoder case study. In Proceedings of the 48th ACM/EDAC/IEEE DesignAutomation Conference (DAC’11).

R. Jejurikar, C. Pereira, and R. Gupta. 2004. Leakage aware dynamic voltage scaling for real-time embeddedsystems. In Proceedings of the 41st ACM/IEEE Design Automation Conference (DAC’04).

V. Jeyakumar, A. M. Rubinov, and Z. Y. Wu. 2006. Sufficient global optimality conditions for non-convexquadratic minimization problems with box constraints. J. Global Optim. 36, 3, 471–481.

I. Karkowski and H. Corporaal. 1997. Design of heterogeneous multi-processor embedded systems: Apply-ing functional pipelining. In Proceedings of the International Conference on Parallel Architectures andCompilation Techniques (PACT’97). 156.

K. Lampka, K. Huang, and J. J. Chen. 2011. Dynamic counters and the efficient and effective online powermanagement of embedded real-time systems. In Proceedings of the 7th IEEE/ACM/IFIP InternationalConference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’11).

J. Y. Le Boudec and P. Thiran. 2001. Network Calculus: A Theory of Deterministic Queuing Systems for theInternet. Springer.

D. Liu, J. Spasic, J. T. Zhai, T. Stefanov, and G. Chen. 2014. Resource optimization of CSDF-modeled streamingapplications with latency constraints. In Proceedings of the Design, Automation and Test in EuropeConference (DATE’14).

S. M. Martin, K. Flautner, T. Mudge, and D. Blaauw. 2002. Combined dynamic voltage scaling and adaptivebody biasing for lower power microprocessors under dynamic workloads. In Proceedings of IEEE/ACMInternational Conference on Computer-Aided Design (ICCAD’02).

S. Maxiaguine, A. Chakraborty, and L. Thiele. 2005. DVS for buffer-constrained architectures with pre-dictable QoS-energy tradeoffs. In Proceedings of the IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS’05).

H. Nikolov, T. Stefanov, and E. Deprettere. 2008. Systematic and automated multiprocessor system design,programming, and implementation. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 27, 3, 542–555.

H. Oh and S. Ha. 2002. Hardware-software cosynthesis of multi-mode multi-task embedded systems withreal-time constraints. In Proceedings of the 10th International Symposium on Hardware/Software Code-sign (CODES+ISSS’02).

S. Perathoner, K. Lampka, N. Stoimenov, L. Thiele, and J. J. Chen. 2010. Combining optimistic and pessimisticDVS scheduling: An adaptive scheme and analysis. In Proceedings of the IEEE/ACM InternationalConference on Computer-Aided Design (ICCAD’10).

S. L. Shee, A. Erdos, and S. Parameswaran. 2006. Heterogeneous multiprocessor implementations for jpeg:A case study. In Proceedings of the 4th International Conference on Hardware/Software Codesign andSystem Synthesis (CODES+ISSS’06).

S. L. Shee and S. Parameswaran. 2007. Design methodology for pipelined heterogeneous multiprocessorsystem. In Proceedings of the 44th Annual Design Automation Conference (DAC’07).

L. Thiele, S. Chakraborty, and M. Naedele. 2000. Real-time calculus for scheduling hard real-time systems.In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’00).

W. Thies and S. Amarasinghe. 2010. An empirical characterization of stream programs and its implica-tions for language and compiler design. In Proceedings of the 19th International Conference on ParallelArchitectures and Compilation Techniques (PACT’10).

E. Wandeler and L. Thiele. 2006. Real-time calculus (rtc) toolbox. http://www.mpa.ethz.ch/Rtctoolbox.E. Wandeler, L. Thiele, M. Verhoef, and P. Lieverse. 2006. System architecture evaluation using modular

performance analysis - A case study. Int. J. Softw. Tools Technol. Transfer 8, 6, 649–667.W. X. Wang and P. Mishra. 2010. Leakage-aware energy minimization using dynamic voltage scaling and

cache reconfiguration in real-time systems. In Proceedings of the 23rd International Conference on VLSIDesign (VLSID’10).

R. B. Xu, R. Melhem, and D. Mosse. 2007. Energy-aware scheduling for streaming applications on chip mul-tiprocessors. In Proceedings of the 28th IEEE International Real-Time Systems Symposium (RTSS’07).



F. Yao, A. Demers, and S. Shenker. 1995. A scheduling model for reduced CPU energy. In Proceedings of the36th Annual Symposium on Foundations of Computer Science (FOCS’95).

Y. Yu and V. K. Prasanna. 2002. Power-aware resource allocation for independent tasks in heterogeneousreal-time systems. In Proceedings of the 9th IEEE International Conference on Parallel and DistributedSystems (ICPADS’02).

Received November 2013; revised July 2014; accepted November 2014


applying pay-burst-only-once principle for periodic power management...

Documents