enabling dynamic voltage and frequency scaling in multicore architectures€¦ · ·...
Embed Size (px)
TRANSCRIPT

Enabling Dynamic Voltage and Frequency Scaling in
Multicore Architectures
by
Amithash Prasad
B.S., Visveswaraya Technical University, 2005
A thesis submitted to the
Faculty of the Graduate School of the
University of Colorado in partial fulfillment
of the requirements for the degree of
Master of Science
Department of Electrical and Computer Engineering
2009

This thesis entitled:Enabling Dynamic Voltage and Frequency Scaling in Multicore Architectures
written by Amithash Prasadhas been approved for the Department of Electrical and Computer Engineering
Prof. Dan Connors (chair)
Prof. Manish Vachharajani
Date
The final copy of this thesis has been examined by the signatories, and we find that both thecontent and the form meet acceptable presentation standards of scholarly work in the above
mentioned discipline.

iii
Prasad, Amithash (M.S., Electrical Engineering)
Enabling Dynamic Voltage and Frequency Scaling in Multicore Architectures
Thesis directed by Prof. Dan Connors (chair)
Traditional operating system methodologies in controlling the voltage and frequency
configuration of the machine are mostly based on ad-hoc means, thermal emergencies or con-
straining the power consumption of the system. Issues such as low memory latency, cache shar-
ing, depleting memory bandwidth have for long hindered clock speed scaling. Higher clock
speed no longer indicate better performance.
This thesis presents a scheduling methodology for multicore processors which maps run-
ning applications to cores executing at varied clock speeds based on their runtime performance
characteristic. Two schemes of mapping tasks to cores were designed. An asynchronously run
power optimizer was devised to adapt to the needs of the current workload in terms of core
clock speed on a multicore system accomplished by utilizing the information provided by the
scheduler about the needs of the current workload. The entire system was implemented as a
module for the Linux kernel.
In addition to these contributions, this thesis performs an extensive analysis on the system
for six selected workloads to analyze the effects on performance, power and energy efficiency.
Additionally this thesis presents future possible applications of the developed infrastructure
to aid compiler and user-space runtime designers to utilize the framework and provide useful
information about task characteristic and phase behavior to the operating system scheduler for
better task to clock speed assignments.

Dedication
This thesis is dedicated to my fiance Sushma for listening to me whine during the initial part of
this thesis when whatever I touched seemed to crash; For listening to me gloat in pride when
things worked for a change, and finally for proof reading my thesis when I realized that I have
forgotten the entire English language.

v
Acknowledgements
There is a saying in Sanskrit guru devo bavaha roughly translating to A professor is
equivalent to god. Thus I would like to begin my acknowledgement by thanking those who
have imparted me the most important wealth: knowledge.
First, I would like to thank my advisers Dan Connors and Manish Vachharajani for their
advice, tutoring, and incredible patience. I am grateful for the courses they taught which got
me interested in computer engineering and lured me away from control systems which was my
initial idea of a master’s degree.
Second, I’d like to thank Professor Dirk Grunwald for his motivation and suggestions
every time I stumbled into his office without an appointment which shows the level of his
patience and the willingness to help.
Third, I’d like to thank Tipp Moseley a fellow graduate student for helping me out during
my initial introduction into kernel hacking, and answering my questions whenever I felt that I
was somehow within the grasp of a black hole. This does not exclude the very helpful and
friendly Linux kernel community who helped a novice transition into a real kernel hacker. This
would not end without thanking Linus Torvalds for creating the Linux kernel in the first place.
Finally, I’d like to thank my parents for motivating me all along, understanding when I
barely call them and to have faith in me and all my endeavors (no matter how foolish it might
sound).

vi
Contents
Chapter
1 Introduction 1
2 Impact of clock speed scaling on application performance 4
2.1 Performance behavior of SPEC workloads . . . . . . . . . . . . . . . . . . . . 4
2.2 Quantifying performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Performance directed scheduling on multicore processors 10
3.1 Hierarchical processor organization . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Hardware performance counters . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Scheduling methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Performance state estimation with the Ladder approach (LEA) . . . . . 14
3.3.2 Performance state estimation with the select approach (SEA) . . . . . . 15
3.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Static/Fixed layout of processor cores . . . . . . . . . . . . . . . . . . . . . . 17
4 Power-aware throughput management on multicore systems 19
4.1 Common power management systems . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Overview of the Linux cpufreq architecture . . . . . . . . . . . . . . . . . . . 21
4.3 Proposed system overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

vii
4.5 The delta constrained mutation algorithm . . . . . . . . . . . . . . . . . . . . 27
4.5.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5.2 The Manhattan matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.5.3 The Manhattan weight vector . . . . . . . . . . . . . . . . . . . . . . 34
4.5.4 Cooperative demand distribution: The demand field . . . . . . . . . . . 34
4.5.5 Greedy performance state selection . . . . . . . . . . . . . . . . . . . 35
4.5.6 Greedy processor selection and transition . . . . . . . . . . . . . . . . 36
4.5.7 Parameter change and termination conditions . . . . . . . . . . . . . . 36
5 Experimental setup and results 38
5.1 Trends along delta and interval . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Trends along maximum allowed mutation rate . . . . . . . . . . . . . . . . . . 43
5.3 Comparing various methodologies and varying workloads . . . . . . . . . . . . 45
6 Future work 50
7 Summary and Conclusion 51
Bibliography 53
Appendix
A AMD Opteron capabilities 56
B Mutation time-line per workload 57
C Source code 61

viii
Tables
Table
2.1 Classification of the SPEC 2006 benchmarks . . . . . . . . . . . . . . . . . . 6
2.2 Workload members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 Threshold values for the select evaluation . . . . . . . . . . . . . . . . . . . . 16
3.2 Experimental layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A.1 Correlation between performance states with clock speed and power consumption 56

ix
Figures
Figure
2.1 IPC classification and speedup trends for the SPEC 2006 benchmarks . . . . . 5
2.2 Speedup achieved by characterized workloads . . . . . . . . . . . . . . . . . . 7
2.3 Speedup dependence with IPC . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Energy consumption per instruction variance with IPC . . . . . . . . . . . . . 9
3.1 Multi-processor organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Hardware performance monitoring counters . . . . . . . . . . . . . . . . . . . 12
3.3 Scheduling state diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Performance directed migration . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.5 Performance state estimation with the Ladder approach (LEA) . . . . . . . . . 15
3.6 Performance state estimation with the select approach (SEA) . . . . . . . . . . 16
3.7 Slowdown vs power consumption for fixed layouts . . . . . . . . . . . . . . . 18
4.1 The Seeker governor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Hypothetical time line displaying the invocation frequencies of the scheduler
and the mutator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Example mutation with the Manhattan distance between L (layout before mu-
tation) and L′ (layout after mutation) equal to 3 . . . . . . . . . . . . . . . . . 25
4.4 Mutation algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.5 State transition diagram for processor i currently in performance state Li . . . . 31

x
5.1 The Seeker infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Variation of average slowdown with delta and interval for the ladder (a) and the
select (b) estimation approach . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Variation of average power savings with delta and interval for the ladder (a) and
the select (b) estimation approach . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Variation of average energy efficiency (EPI) improvement with delta and inter-
val for the ladder (a) and the select (b) estimation approach . . . . . . . . . . . 44
5.5 Variation of average power savings, slowdown and energy efficiency improve-
ment with maximum mutation rate for the LEA (a) and the SEA (b) scheduling
systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.6 Variation of power savings, energy efficiency improvement and slowdown with
each workload for the ladder estimation approach (LEA) with the delta mutation
engine at ∆ = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.7 Variation of power savings, energy efficiency improvement and slowdown with
each workload for the select estimation approach (SEA) with the delta mutation
engine at ∆ = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.8 Variation of power savings, energy efficiency improvement and slowdown with
each workload for the Ondemand governor . . . . . . . . . . . . . . . . . . . 49
B.1 Adaptation time-line for the High workload with the ladder (a) and select (a)
scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
B.2 Adaptation time-line for the Low workload with the ladder (a) and select (a)
scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
B.3 Adaptation time-line for the Low-High workload with the ladder (a) and select
(a) scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
B.4 Adaptation time-line for the PHigh-PLow workload with the ladder (a) and se-
lect (a) scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

xi
B.5 Adaptation time-line for the PLow-Low workload with the ladder (a) and select
(a) scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B.6 Adaptation time-line for the PLow-High workload with the ladder (a) and select
(a) scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Chapter 1
Introduction
Emerging technology constraints are slowing the rate of performance growth in com-
puter systems. Specifically, designers are finding difficulties addressing strict processor power
consumption and cooling constraints. Design modifications to address power consumption gen-
erally limit processor performance and reduce peak operating frequency, thus changing the
trend of providing increased system performance every new processor generation. As such,
modern architectures have diverged from the clock speed race into the multicore era with mul-
tiple processing cores on a single chip. While the multicore design strategy improves chip
power density, there remains a significant potential in improving run-time power utilization by
dynamically changing per-core clock frequency and voltage.
Power consumption is directly proportional to the clock speed (frequency) and the square
of the operating voltage of the processor (Power ∝ frequency × V oltage2). In order to con-
serve energy, designers have explored methods to manipulate these parameters using circuit
design techniques and runtime software driven techniques. Implementations of DVFS (Dy-
namic voltage and frequency scaling) schemes divides the processor operation to performance
states or P-states with fixed operating voltage and clock speeds. Architectures supporting mul-
tiple such P-states require active runtime support to fully explore the performance and power
saving potential of the DVFS approach.
The schemes for manipulating DVFS configurations of a processor can be subdivided

2
into three broad categories: Hardware approach [1], [2] and [3] traditionally propose monitor-
ing hardware to predict execution patterns and manipulate the system’s DVFS configuration.
Hardware methods are immutable implementations in silicon which cannot be changed based
on design and policy variations and hence avoided by many chip manufacturers. The second
category is the management of power, controlled by compiler and user space runtime systems
which expect such capabilities to be exported by the operating system as discussed in [4], [5], [6]
and trace driven methods for power management as discussed in [7] and [8]. Operating system
techniques [9], [10], [11] and [12] generally use some of runtime performance monitoring ca-
pabilities exported from the hardware in order to assist in detecting runtime phases, and include
software policies to ultimately make decision for power management.
Majority of DVFS implementations at the operating system level are load based where
the system’s DVFS configuration is varied based upon extremely coarse-grain levels of proces-
sor activity. Even though effective in conserving power such approaches are largely unaware of
the workload currently executing on the system. Load based decisions are best left to technolo-
gies which decide whether the processing element is active or asleep (state at which the power
consumption is minimal) and is studied in [11] which predicts usage patterns of various devices
on the system.
This work proposes techniques which tie the scheduler and the system power optimizer
to constrain the number of DVFS level changes both over time and intensity with the goal of
energy savings without harming individual performance and throughput. Limiting DVFS level
transitions is important to limit instabilities endured by the system due to rapid DVFS transitions
as explored by [13].
This thesis makes the following contributions to this area of study:
(1) Method of scheduling to adjust DVFS with strict constraints on limiting the amount of
performance lost compared to full frequency mode. The system provides significant
power savings of up to 40% with a performance loss of up to 20% with a median

3
performance loss at around 10-15%.
(2) Method to schedule tasks based on their performance requirement on a multi processor
environment with varying per-core clock speeds in order to maximize power savings
while minimizing performance lost.
(3) Investigation of the impact of DVFS level changes per second on the ability to improve
power savings or performance impact.
(4) Design of a multi-task scheduling model that accounts for joule-per-instruction work
metric.
This thesis is organized as follows: Chapter 2 discusses the performance variation of real
world applications with increases in clock speed. Chapter 3 introduces the performance directed
scheduler. Chapter 4 proposes a methodology to mutate voltage and frequency configuration of
a system based on the workload demands. Chapter 5 describes the experimental procedure and
results obtained. Chapter 6 comments on the future work in this area and finally chapter 7
concludes this thesis.

Chapter 2
Impact of clock speed scaling on application performance
Throughput based computing has long been solely dependent on clock speed advantages
for faster application execution. With the onset of multicore processors, a large portion of
research has been devoted to parallelization of sequential applications. The performance of most
real world applications are either dependent on memory speeds or I/O (Input/Output) latency.
Parallel executing threads on a multicore system eventually saturate memory or I/O devices and
thus obtain no advantage of clock speed scaling. This chapter explores the impact of scaling
clock speeds on application performance and energy efficiency in multicore processors.
2.1 Performance behavior of SPEC workloads
In order to study the behavior of real applications on different voltage and frequency
settings, fourteen of the SPEC2006 [14] benchmark suite were run with varied clock speed from
1100 MHz to 2200 MHz on an AMD Opteron quad-core processor and were plotted showing
the percentage duration spent at an IPC (Instruction per Cycle) higher than 1.0 and the speed-up
in terms of execution time achieved from that when run with the lowest clock speed (1100 MHz)
shown in Figure 2.1. The X-axis of this graph is the percentage of time the benchmark spent
running at an IPC level greater than 1.0, the Y-axis is the clock speed at which the benchmark
was run at, and finally the Z-axis is the Speed-up achieved. It can be observed that the return

5
on investment (Higher clock speed) is low for tasks with a low percentage of runtime at an IPC
greater than 1.0.
Figure 2.1: IPC classification and speedup trends for the SPEC 2006 benchmarks
The SPEC2006 [14] benchmark suite is a set of real-world applications designed to run
with minimal I/O and hence best used to characterize and study the CPU (central processing
unit) and memory subsystem in modern processors. The applications are typical workloads
used in the field of throughput based computation and is widely used test or design processor
systems. Based on this observation the benchmarks were categorized into four categories: Low,
High, Phased Low and Phased High and shown in Table 2.1.
Table 2.2 shows the breakdown of six workloads (Low, High, Low-High PHigh-PLow,
PLow-High and PLow-Low) comprising of 4 benchmarks each were created (shown in Ta-
ble 2.2) based on their IPC classification in Table 2.1. It must be noted that even though these
workloads were grouped based on their IPC characteristics, it is not uncommon to have work-
loads in the real world comprising of subsets of each benchmarks. The six set of workloads

6Category BenchmarksLow (90%+executes belowand equal toIPC=1.0)
mcf milc libquantum lbm omnetpp
High (90%+executes aboveIPC=1.0)
bzip2 povray hmmer sjeng h264ref
Phased Low(50%+ executesabove IPC=1.0)
gobmk sphinx3
Phased High(50%+ executesbelow and equalto IPC=1.0)
astar xalancbmk
Table 2.1: Classification of the SPEC 2006 benchmarks
provide a fair but experimentally feasible set of applications to estimate the impact of architec-
ture or operating system design.
Workload BenchmarksLow milc libquantum lbm omnetppHigh bzip2 povray hmmer sjengLow-High bzip2 povray milc omnetppPHigh-PLow gobmk sphinx3 astar xalancbmkPLow-High astar xalancbmk povray hmmerPLow-Low astar xalancbmk milc omnetpp
Table 2.2: Workload members
Figure 2.2 shows that the workload High category shows the greatest improvement with
frequency, achieving a nearly 2x improvement using the highest frequency for each core in the
quad-core system. Otherwise, the other categories of workloads appear to benefit from higher
frequency, but are bounded to 1.5x of the baseline execution time. The Low workload is ob-
served to have a very high variance in terms of speedup and shows that clock speeds above 1700
MHz has no effect on the speedup achieved. This motivates the construction of a performance
directed scheduler aware of the IPC level at which tasks are executing and described in further

7
sections.
Figure 2.2: Speedup achieved by characterized workloads
2.2 Quantifying performance
An adaptive system needs a quantitative measure of performance. Section 2.1 indicated
that IPC can be used as such a measure. In order to study the throughput and energy consump-
tion behavior of workloads, a hardware monitoring-based framework was constructed to col-
lect and correlate per-application IPC data. Figure 2.3 was constructed displaying the speedup
which was achieved in terms of throughput when compared to the same sample executed at the
base clock speed of 1100 MHz. The analysis provides an aid in observing the exact IPC values
at which a higher clock speed is no better in increasing the throughput. Figure 2.3 shows that be-

8
yond IPC = 1.0 (box High IPC), a higher clock speed will always provide better throughput to
the system as demonstrated by the separation between curves corresponding to each frequency.
Samples with IPC in between 0.5 and 1.0 (box Moderate IPC) shows that clock speeds greater
than 2000MHz have no significant effect in improving the throughput demonstrated by an in-
significant separation between the curves corresponding to 2000MHz and 22000MHz. Finally
IPC lower than 0.5 (box Low IPC), indicate that clock speed increases not not dramatically
improve the throughput of the system (There are not clearly defined separation between the
curves). This motivates a system to manage DVFS configuration by detecting the IPC phase
of applications to conserve power by scaling the clock speed down during phases equivalent to
Box Low IPC.
Figure 2.3: Speedup dependence with IPC
Energy consumption and hence monetary cost of maintaining a server system during
application execution (Including cooling costs) is directly proportional to the product of the

9
operational power consumption and the execution time measured in Joules. Energy utilization
efficiency of a processor can be measured as the energy consumed per executed instruction or
EPI (Joules / Instruction). The data obtained by the above experiment aided in computing the
energy consumption of each sample by multiplying the power (Obtained by correlation provided
in Appendix A) with the execution time of each sample. Finally the EPI value of the sample was
derived by dividing the energy consumption by the total number of instructions in each sample.
Figure 2.4 demonstrates that executing applications with low IPC at higher clock speeds results
in poor energy utilization. EPI exponentially falls at higher values of IPC and making the
energy utilization consistent across all clock speeds. This should further motivate the use of
DVFS techniques to improve energy utilization efficiency for throughput based systems.
Figure 2.4: Energy consumption per instruction variance with IPC

Chapter 3
Performance directed scheduling on multicore processors
Current implementations of schedulers in modern operating systems recognize the en-
vironment as homogeneous in nature while technologies such as DVFS introduce an inherent
asymmetry in a multicore system. Considerable exploration in terms of asymmetric scheduling
support has been investigated in [15] and [16]. However, the work ignores the fact that certain
workloads do not gain sufficient advantages from increased clock speeds discovered in Chap-
ter 2 which could assist in conserving power by scaling the clock speed down during phases of
low IPC.
In a multicore environment, there are many facets for performance loss with the most
common reasons being high IO and memory latencies which is further aggravated by cache
sharing. A novel statistical learning method to predict voltage and frequency requirements by
tasks was proposed in [5], but the method required rigorous statistical learning making it time
consuming and impossible to implement at the context of a scheduler. For example even with a
quad core system and a scheduling frequency of 1000Hz, there is a possibility of 1000 × 4 =
4000 performance state transitions every second which can cause instabilities as shown in [13].
A simpler method was proposed in [10] and [2], but comes at the cost of possible rapid voltage
and frequency transitions which may lead to instabilities in emergencies

11
3.1 Hierarchical processor organization
Chapter 2 provide a strong motivation to scale voltage and frequency of processors based
on the performance of the workload. While most of the research pertaining to workload based
DVFS direct their attention of changing the voltage and frequency levels based on the task’s
demand, none consider the possibility that such a performance state might already be available
on another core which can be remedied by a simple migration without the causality of rapid
performance state transitions. The motivation to substitute migration for frequency transition
is further enhanced with the current multicore race where the probability of such a situation
improves proportionally.
In order to enable scheduling on asymmetric (cores executing with varied clock speeds)
multicore processors, the first implementation step was to organize the processor set as groups
with equal performance states. The term performance state or P-State will be used to differ-
entiate varied voltage and frequency configurations that a processor is allowed to have, and is
indicated by P0 through Pm−1 where, P0 is the configuration with the lowest clock speed and
voltage (lower power consumption state). Each set maintains the usage in terms of actively
running tasks in the particular group to enable load balancing between groups of varied perfor-
mance states. This view of the processor set is shown in Figure 3.1 and mathematically viewed
as a vector of sets shown in Equation (3.1) where, CPi is the set of processors which are at a
performance state of Pi.
Ls =[CP0 CP1 ... CPm−1
](3.1)
3.2 Hardware performance counters
In order to quantitatively measure the performance of each task, a driver was developed

12
Figure 3.1: Multi-processor organization
to read and configure the hardware performance counters present in the AMD and Intel Archi-
tectures. Two counters were configured to measure the sample IPC of an executing workload as
shown in Figure 3.2. The first counter tracked the total number of executed instructions while
the second tracked the total number of real clocks. It must be noted that the real clock rate (real
clocks per second) is the clock speed of the current DVFS configuration.
Every request for IPC made to the performance monitoring subsystem, causes both these
performance counters to be read based on which the IPC (instructions per clock) is computed as
a fixed point value of precision 0.125. After each successful transaction, the retired instruction
and real clock counters are cleared and continue incrementing through the next execution sam-
ple. The performance monitoring subsystem is unaware of the current executing task and it is
the scheduler’s responsibility to read the IPC level before switching out tasks from the running
state.
Figure 3.2: Hardware performance monitoring counters

13
3.3 Scheduling methodology
Most time sharing scheduling systems regularly switch out a running task to another in
order to provide fair equal execution times for all tasks. This execution slice is commonly re-
ferred to as the scheduling quanta. The Linux scheduling system maintains separate run-queues
for each processing element done in order to reduce the scheduling algorithmic complexity
to O(1). Migration, a procedure of moving a runnable task from one processor to another is
reduced to moving the task from one run-queue to another.
Figure 3.3 shows the state diagram of the performance directed scheduler which is a mod-
ification of the default Linux scheduler with an additional pseudo state (Performance directed
Migration) introduced at the transition away from the running state At this state, as shown in
Figure 3.4, the hardware performance counters are queried for the value of the current IPC (In-
structions per cycle). Based on this measure of the task’s performance, a mapping from IPC to
a required performance state (P ′) is made. This mapping can potentially be done is a number of
ways of which two, Ladder where the clock speed is determined based whether the IPC is lower
or higher to preset thresholds and Select where the clock speed is determined by comparing the
IPC to a table of IPC ranges each mapping to a specific clock speed; were implemented and
evaluated. These methods are further described in the following two subsections.
Once P ′ is estimated, the layout Ls is is queried for the total number of processors in the
setCP ′ and the total number of tasks in the ready or running state for processor inCP ′ . If the set
CP ′ is populated and the total number of tasks executing on the processors in set CP ′ is lesser
than or equal to the number of processors in CP ′ , then P ′′ = P ′. But if such is not the case,
then the layout Ls is searched for a set CP ′′ such that P ′′ is closest to P ′, and CP ′′ is populated
and the total number of tasks executing in the set CP ′′ is lesser than the number of processors
in CP ′′ . If all fails, then the layout is searched for a state P ′′ such that CP ′′ is populated and the
load (Computed as the total number of tasks divided by the number of processors) is minimum.

14
Finally the task is migrated to the set of processors in CP ′′ . Decisions on the exact processor a
task has to execute on is made by the underlying native scheduler of the operating system and
the PDS is completely oblivious to any further detail. As an interface to the mutator which will
be explained in Chapter 4, PDS increments the cell corresponding to P ′ in the demand vector
D (DP ′ = DP ′ + 1).
Figure 3.3: Scheduling state diagram
Figure 3.4: Performance directed migration
3.3.1 Performance state estimation with the Ladder approach (LEA)

15
The Ladder Estimation Approach sports a simple decision procedure. If IPC > H , then
P ′ is chosen such that is is higher than the state at which the task is currently executing with
(P ′ = P + 1). If IPC < L, then P ′ is chosen to be one less than the state at which the task
is executing with (P ′ = P − 1). Otherwise P ′ is chosen to equal P . The threshold values of
H and L were chosen to equal 0.875 and 0.5 respectively by observing Figure 2.3 (Chapter 2)
to be the edges of High IPC and Low IPC ranges. The advantage of the procedure is the lower
choice (two: H and L) available to the system administrator. It is clear that the wrong choice of
these thresholds can cause undesirable power and performance behavior of the system.
Figure 3.5: Performance state estimation with the Ladder approach (LEA)
3.3.2 Performance state estimation with the select approach (SEA)
A logical extension of the LEA system, is the Select Estimation Approach that maps
clock speeds to specific ranges of IPC as shown in Figure 3.6. Such a system was implemented
with threshold values shown in Table 3.1. These values are selected by carefully observing
Figure 2.3 (Chapter 2) and approximating the values of IPC for which each clock speed saturates
(Increasing clock speed does not gain significant improvement in speedup). The advantage of
this method is in the independence from the current performance state but comes at the cost of
complexity requiring rigorous experimental evaluation for each architecture.

16
Figure 3.6: Performance state estimation with the select approach (SEA)
Threshold ValueT0 0.25T1 0.50T2 0.75T3 1.25
Table 3.1: Threshold values for the select evaluation
A final note of the difference between the LEA and SEA systems: The LEA system
chooses the next higher or lower performance state based on the threshold, while the SEA
system chooses a specific performance state. The SEA and LEA systems were evaluated with
the delta based mutation engine described in Chapter 4.
3.4 Experimental setup
The experiments were conducted on a AMD quad-core Barcelona which allows indi-
vidual processor cores to be set to different clock speeds. The set of workloads described in
Table 2.2, Chapter 2 (groups of SPEC2006 benchmarks) were started on the layouts provided

17
in Table 3.2 (increasing from a to n each possessing faster cores and in effect higher power
consumption), and the execution time of each member benchmark was measured and summed
to arrive at a single execution time per workload. The experiment was repeated three times and
the mean of the execution times were compared with that of the [4, 4, 4, 4] layout to get the
percentage slow down.
Layout Name PC0 PC1 PC2 PC3
a 0 0 0 0b 0 0 1 1c 0 0 2 2d 0 0 0 4e 1 1 2 2f 0 0 3 3g 0 0 4 4h 1 1 3 3i 1 1 4 4j 2 2 3 3k 2 2 4 4l 0 4 4 4m 3 3 4 4n 4 4 4 4
Table 3.2: Experimental layouts
3.5 Static/Fixed layout of processor cores
The results of the experiment described above are plotted with the X-axis being the aver-
age power consumption of the processor layout and the Y-axis being the percentage slowdown
observed. One obvious trend is decrease in the percentage slowdown with the increase of power
provided per core (higher performance states available) increases. The problem with fixed static
layouts are obvious from the peaks observed whenever there is an asymmetry in the layout. An
obvious example being layout l ([0, 4, 4, 4]) where processor C0 is in state P0 while the re-
maining processors C1,C2 and C3 are in P4. Going from layout k to l which are a couple of

18
watts apart in terms of power consumption, but layout l displays 40% higher slowdown than k
[2, 2, 4, 4]. This behavior has been observed to be common when there are more tasks battling
over a lower number of processors where the time spent in migrations become noticeable and
add to the slow down of the task. The Low and High workloads are the least affected by varying
layouts which demonstrates that static layouts are best when all workloads exhibit equal and
non-varying phase behavior.
Figure 3.7: Slowdown vs power consumption for fixed layouts

Chapter 4
Power-aware throughput management on multicore systems
Power and energy management are critical to high-performance and high-throughput
environments. Servers are usually limited to the amount of peak power and energy consumed
in order to reduce maintenance cost. As an example, power and energy considerations govern
server space expansion in industries. To combat these constraints, power management systems
are becoming increasingly common in the server space. Two common methodologies exist to
reduce the power consumption of a compute element. The first among which is to turn off
processors during idle time. The second is to keep the system active but lower the operating
frequency and voltage of the processor to utilize proportionally lower power.
In order to fully utilize server nodes and minimize idle time, the current trend among in-
dustries is to run multiple virtual machines on a single server rack thus making varied workloads
common place in an otherwise single purpose server usage. This element further enhances the
requirement of dynamic non-trace driven power management systems. [17] discusses a power
optimizer specialized for a virtual machine farm, but fail to recognize the workload character-
istics of these systems.
Lowering frequency has a pleasant benefit of reducing the power consumption and hence
the energy and cooling costs. But as shown in Chapter 3, Figures 2.3 and 2.4, when the average
IPC of an application is high, reduction in frequency only causes the application to execute for
a proportionally longer duration and hence having no energy benefit (Figure 2.4). The only

20
advantage of reducing the frequency is the reduction of energy supply into the system per unit
time and hence possibly lower heat dissipation.
[5], [10] and [2] propose adapting the clock speed of the processing element based on the
current application’s demand. These methodologies assume that DVFS transitions are local to
that processor core. Some multicore processors have dependency between processor cores (tran-
sitioning one might potentially transition the other) or systems with symmetric multi-threaded
features where a single processor core is visible to the operating system as multiple virtual
processors. Thus, applications executing on such processing elements are tied to each other
and any DVFS transition based on one application might potentially affect another negatively.
Chapter 3 showed that such transitions could be remedied with a simple processor migration
requiring power optimizers to react to the needs of a scheduler.
4.1 Common power management systems
The most popular among power management techniques are load directed systems which
transition a processor to higher or lower performance states based on the current load of the
system. Two of these techniques: ondemand [18] and conservative are implemented within the
thesis infrastructure. The ondemand system raises the performance state to the highest possible
level at high loads, and reduces the performance state gradually to the lowest state during lower
load thus P ′i ,
P ′i =
Pmax : Loadi ≥ 0.8
Pi − 1 : Loadi < 0.8(4.1)
the performance state of processor i after the transition based on the current performance state
Pi and decided based upon load of processor i, loadi. The characteristics of this system is
to rapidly respond to load increase while conservatively lowering the performance state at the
end of the active state. The conservative system, on the other hand, gradually changes the

21
performance state in either direction, where P ′i ,
P ′i =
Pi + 1 : Loadi ≥ 0.8
Pi − 1 : Loadi < 0.8(4.2)
the performance state of processor i, after a transition is decided based on the load of each
processor Loadi with a typical characteristic to gradually meet the system needs and preferred
for servers with typically short running workloads.
4.2 Overview of the Linux cpufreq architecture
In order to maintain architecture independence, separation of methodology and policy,
the subsystem in the Linux kernel responsible for managing the voltage and frequency con-
figuration separates the procedure into two: cpufreq-drivers (The region enclosed by Drivers:
Figure 4.1) are responsible for the actual P-State transition and register with the intermediate
cpufreq layer. cpufreq-governors are responsible for policy and registers with the cpufreq layer.
Policy attributes are:
• How often changes to DVFS configurations are made.
• The magnitude of changes in configuration settings.
Once a working configuration is initialized, the cpufreq-governor instructs the cpufreq-
driver in an indirect fashion on the required transition. As most of the experimentation men-
tioned in this text relates to the AMD Barcelona, the powernow-k8 cpufreq-driver was used to
initiate the transition.
The procedure of changing the DVFS configuration of a multicore processor system is
termed, for the remainder of this thesis, as mutation. The entity involved in making policy
decisions about the nature of the mutation (what processor transitions to what performance
state) is referred to as the mutator. The interval at which mutation decision is performed is
termed the mutation interval.

22
A cpufreq governor seeker was developed having kernel exported interfaces for the mu-
tator to request transitions. A simple registration method was introduced to allow the mutator
to be informed every time a transition is complete as shown in Figure 4.1. The asynchronous
call back mechanism allows the mutator to update its data structures even when an entity other
than itself ( For example, through the sysfs interface) requests for a performance state change.
Figure 4.1: The Seeker governor
4.3 Proposed system overview
In contrast to the load based power management system described in Section 4.1, a
performance directed power management system was implemented. [5], [10] and [2] describe
methodologies where DVFS transitions are invoked based on the performance requirement of
the executing workload of the system. Chapter 3 demonstrated a novel method of substituting
task migrations for DVFS transitions in a multi-processor system where each processor could

23
possible be at varied performance states. With a processor layout Ls being a vector of sets with
a length of m, which allowed the performance directed scheduler to schedule tasks on a set with
an algorithmic complexity of O(m).
Another possible view of the processor layout
Lm = [PC0PC1 ...PCn−1 ] (4.3)
where Lm is a vector of integers of length n, where each element Lmi is the performance state
of processor i. It is clear that both Ls and Lm are different views of the same information
and the proposed mutation scheme must take the responsibility to keep these two views of the
processor layout consistent. As Ls is optimized for the scheduler it will not be further discussed
in this chapter and any reference to processor layout or L will be with respect to the mutation
engine’s view of the processor layout: Lm. Thus the processor layout provides n options for
each workload in terms of performance state.
Chapter 3 demonstrated the short comings of a static processor in lieu of varying work-
load in terms of characteristics and number. In order to alleviate the negative aspects of a static
layout, a mutation scheme was developed allowing the processor layout to adapt on a fixed in-
terval, the mutation interval, to the demands of the workload (mutation interval >> scheduling
quanta). Figure 4.2 shows the time line interaction of the performance directed scheduler and
the mutation engine.
Every scheduling quanta the performance directed scheduler is invoked which measures
the performance (IPC) of the task and based on which the preferred performance state P ′, is
chosen which varies depending on the performance state evaluation method (ladder or select).
Then the corresponding element of the demand vector D is incremented (DP ′ = DP ′ + 1).
At the expiration of every mutation interval possibly incorporating numerous such evaluations
(mutation interval >> scheduling quanta), this demand vector D provides the mutation engine
with an understanding of the current performance requirements of the processor layout L. At the
end of the mutation algorithm, the demand vector D is cleared (∀i ∈ N, i < m : Di = 0). Thus

24
Figure 4.2: Hypothetical time line displaying the invocation frequencies of the scheduler andthe mutator
creating an illusion of a dynamic processor layout which adapts to the needs of the machine’s
workload.
Since rapid and drastic DVFS transitions can affect the reliability of the system [13],
operating system management policies are critical. The delta constrained mutation scheme was
developed providing the ability to control the rate of transitions by limiting DVFS transitions
to be performed globally and at fixed interval lengths termed the mutation interval. Secondly,
the delta mutation policy can limit the magnitude of mutation thus eliminating rapid transitions.
The PDS as described in Chapter 3 is performed at the resolution of a scheduler quanta, while
the DVFS transitions are performed at a resolution of the mutation interval and is a multiple
of the scheduling quanta. In order to have an absolute upper bound on the DVFS transitions,
a parameter called the delta constraint (∆) limits the maximum number of mutations that can
be performed at any particular instant (hence limiting the maximum mutations per second to
∆Mutation Interval ).
Delta (∆) for a system with n processors where each processor i is transitioned from
performance state Pi to P ′i can be defined as
∆ ≥n−1∑i=0
|Pi − P ′i | (4.4)

25
and can also be defined as the maximum Manhattan distance allowed between the layout vector
L before the mutation and the layout vector L′ after the mutation. Thus the example mutation
as shown in Figure 4.3 is allowed only for a system with a delta constraint ∆ ≥ 3.
Figure 4.3: Example mutation with the Manhattan distance between L (layout before mutation)and L′ (layout after mutation) equal to 3
4.4 Problem definition
An overview of the system required is as follows:
(1) The scheduler estimates the required performance state for each task and maintains the
demand, a monotonically increasing count, representing the number of times each state
was requested.
(2) Each processor can take performance states from 0 to m− 1.
(3) The total movements: the Manhattan distance from the current layout to the next layout
should always be less than or equal to the delta constraint (∆).
(4) The system should be partial to maintaining a core’s current performance state if such
a performance state is requested.

26
The problem can be expressed as a multiple choice knapsack problem with each variable
xi (corresponding to each processor) capable of m options thus xij
xij ∈ {0, 1}, i = 0, 1, 2, ..., n− 1; j = 0, 1, 2, ...,m− 1 (4.5)
is a 0-1 choice of selecting performance state j for processor i. As we can at most select one
performance state to a processor, the problem is restricted by
m−1∑j=0
xij = 1,∀i = 0, 1, ..., n− 1 (4.6)
which allows only one of the choices for all performance states j. The knapsack capacity is the
delta constraint (∆) thus bringing the constraint of the optimization problem to be
Subject to :n−1∑i=0
m−1∑j=0
|Li − j|xij ≤ ∆ (4.7)
where xij is the selection of the transition of processor i from performance state Li to j. Finally
the problem is to optimize the transitions in such a way as to optimally satisfy the demand
vector D by
maxn−1∑i=0
m−1∑j=0
Djxij (4.8)
Defining the problem this way, the memorization based dynamic programming method
described in [19] was implemented. Early experimental results showed that the algorithm was
inefficient for having no concept of transition direction, that is, the algorithm attempts to fill
the knapsack (performing enough transitions within the constraint of ∆) without distinguishing
between transitions which increase performance from transitions which decrease performance.
Another effect which is not accounted by the dynamic programming approach is to ig-
nore the fact that the algorithm will be repeated periodically and the current results will be
utilized as the problem during the next mutation interval. Thus the dynamic programming ap-
proach tries to perfect the selection even though the layout adaptation can always be perfected
in future mutation intervals, Implying that the required nature of adaptation is to at least gravi-
tate towards the optimal layout. This motivated the development of an iterative direction based
greedy algorithm described below.

27
4.5 The delta constrained mutation algorithm
Figure 4.4 describes the steps involved in the proposed iterative greedy algorithm. Each
sub step is described in further sections. The initialization procedure is described in Sec-
tion 4.5.1. Based on the iteration delta constraint δ, a matrix, Manhattan matrix (S) is con-
structed and described in Section 4.5.2. The Manhattan weight vector W is described in Sec-
tion 4.5.3. The cooperative demand transformation is described in 4.5.4. The greedy winning
state w selection is described in Section 4.5.5. The selection of the processor c to transition
to the winning state w is described in Section 4.5.6. Finally, the parameter adjustments and
iteration termination conditions are described in Section 4.5.7.
Figure 4.4: Mutation algorithm

28
4.5.1 Initialization
In order to effectively provide only the needed number of processors, the mutator queries
the operating system for the total number of tasks in the ready state T . The Load of the system
Load =
T : T ≤ n
n : T > n
(4.9)
in terms of number of processors required is computed where T is the total number of tasks
in the ready state in a multicore system with n on-line cores. Using an estimated load based
on idle times of processors was found to be unfruitful as it can be non-representative of the
actual computing capacity demanded by the number of active tasks. This can be aggravated in
situations where multiple tasks could be queued on a single processor.
4.5.1.1 Processor demand
The Performance Directed Scheduler as described in Chapter 3 maintains the demand
for each state in the demand vector D. The contents of this vector cannot be directly consumed
as their values pose no direct description of the demand of individual performance states. As a
direct consequence, vector demand is computed based on the projected load of the system and
the demand vector D,
demandi =Di × Load
m−1∑j=0
Dj
(4.10)
where Di is the total number of times performance state Pi was requested by the performance
directed scheduler in the previous mutation interval while the total load of the system was
estimated to be load. The vector demand is also referred to as the core demand as each element
demandi is the total number of cores demanded by the performance directed scheduler at state
Pi in order to optimally schedule the current workload.

29
4.5.1.2 Present and required power characteristic of the system
powerP the present or current power number of the multicore system,
powerP =n−1∑i=0
Li (4.11)
where n is the total number of on-line cores and L is the current layout of the system. Thus
powerP is an integer value proportional to the total power drawn by the multicore system.
powerR, the required power number
powerR =m−1∑i=0
i× demandi (4.12)
is computed based on the processor demand demand as shown below in Equation (4.12) where,
each core is capable of m performance states and demandi being the total number of cores
demanded by performance state Pi. Thus powerR is proportional to the power drawn by the
multicore system when the cores are transitioned in such a way as to satisfy the performance
directed scheduler to optimally schedule it’s current workload.
4.5.1.3 Transition direction
Based on the parameters powerP and powerR, the transition direction, trans, a tri-state
value trans ∈ {−1, 0,+1} describing the nature of mutation required
trans =
1 : powerR > powerP
0 : powerR = powerP
−1 : powerR < powerP
(4.13)
is computed. This is done by the mutation engine to evaluate the required direction in which
DVFS configuration must be made. A transition direction of trans = +1 implies that the
scheduler requires higher performance from the multicore system, while a transition direction
of trans = −1 implies the exact opposite; an indication to conserve power by reducing the

30
performance states of the multicore processor. A transition direction equal to zero implies
stability and mutations must be avoided if possible.
4.5.1.4 Poison vector and iteration delta constraint initialization
Before starting the iterations, the poison vector X of length n
∀i ∈ N, i < n;Xi = 1 (4.14)
is initialized to all ones, where each element is a binary representation (Xi ∈ {0, 1}) of a
unique processor’s current assignment states. A value Xi = 0 indicates that processor i has
been allocated a performance state and must not be re-transitioned. Finally, the iteration delta
constraint δ is initialized to ∆ (δ = ∆).
4.5.2 The Manhattan matrix
The Manhattan matrix S with n rows (one for each core) and m columns (one for each
performance state), can be viewed as a probability matrix where each element Sij indicates the
probability that processor i can be transitioned to performance state j for a system with n cores
each capable of m performance states and depicted in Figure 4.5. Thus S23 describes the weight
of processor C2 transitioning to the performance state P3. The conception of this matrix was
achieved by viewing a probability density function for each processor and was soon modified
due to the lack of floating point operations in kernel space (Even though possible, the kernel
address space does not save the floating point context and makes it’s use dangerous as the kernel
is fully preempt-able).

31
Figure 4.5: State transition diagram for processor i currently in performance state Li
The Manhattan matrix (n×m) is computed as
Sij =
0 : j < (Li − δ)
m2 × (trans2 − trans + 2) + j − Li : (Li − δ) ≤ j < Li
2m : j = Li
m2 × (trans2 + trans + 2)− j + Li : Li < j ≤ (Li + δ)
0 : j > (Li + δ)
(4.15)
where, Li is the current performance state of processor i in a multicore environment with a total
of n cores each capable of m performance states and the transition direction trans estimated to
be either −1, 0 or +1.
In order to help understanding of the consequence of Equation (4.15), an example is
considered describing the matrix for one of the three possible values of transition direction
trans = +1, 0,−1. A hypothetical multicore with a total of 4 cores (n = 4) each capable
of 5 performance states (m = 5) is considered. The current layout L is assumed to be L =
[0, 1, 2, 3], implying that processors C0, C1, C2 and C3 are in performance states P0, P1, P2
and P3 respectively. For all the three variations, the iteration delta constraint is assumed to be
δ = 2.
Evaluating the Manhattan matrix given in Equation (4.15) for δ = 2, n = 4, m = 5 and

32
L = [0, 1, 2, 3] and all values of trans,
Strans=+1 =
10 9 8 0 0
4 10 9 8 0
3 4 10 9 8
0 3 4 10 9
(4.16)
Strans=0 =
10 4 3 0 0
4 10 4 3 0
3 4 10 4 3
0 3 4 10 4
(4.17)
Strans=−1 =
10 4 3 0 0
9 10 4 3 0
8 9 10 4 3
0 8 9 10 4
(4.18)
is obtained. The following observations and reasoning can be made about these variations of
the Manhattan matrix:
• For each row, i, the highest weight of 2m (10), is assigned to the element Sij when
j = Li.
◦ As L0 = 0, the element S00 = 10.
◦ The highest weight is given to a transition favoring a processor to maintain its
current performance state if required.
• For each row, i, elements Sij are assigned a value of 0, when they are more that δ
elements away (|j − Li| > δ) from the current performance state.
◦ Elements S03 and S04 are assigned a value of 0.
◦ This honors the iteration delta constraint δ by assigning a weight (probability)
equal to zero to such transitions.

33
• For each row, i, and column Li < j ≤ (Li + δ), integer weights are assigned to Sij
such that they are in decreasing order starting from 2m − 1 (9) for trans = +1 or
starting from m− 1 (4) for trans = 0 and trans = −1.
◦ Weights for transitions linearly reduce as they are further from the current perfor-
mance state.
◦ Strans=+1: Elements S01, S02 are given values 9 and 8 respectively and weights
linearly reduce from 2m− 1 (9) as transition direction trans = +1 indicates that
mutations (DVFS reconfigurations) must favor transitions increasing the perfor-
mance of the multicore.
◦ Strans=0,−1: Elements S01, S02 are given values 4 and 3 respectively and reduce
from m − 1 for trans = −1, 0 as mutations must not favor transitions increas-
ing the power consumption of the multicore system for trans = −1 or avoid
transitions for trans = 0.
• For each row, i, and column (Li− δ) ≤ j < Li, integer values are assigned to Sij such
that they are in decreasing order starting from 2m − 1 (9) for trans = −1 or starting
from m− 1 (4) for trans = 0 and trans = +1.
◦ Weights for transitions linearly reduce as they are further from the current perfor-
mance state.
◦ Strans=−1: Elements S21, S20 are given values 9 and 8 respectively and weights
linearly reduce from 2m − 1 (9) as transition direction trans = −1 indicates
that mutations (DVFS reconfigurations) must favor transitions reducing the power
consumption of the multicore.
◦ Strans=0,+1: Elements S21, S20 are given values 4 and 3 respectively and reduce
from m − 1 for trans = +1, 0 as mutations must not favor transitions reducing
the performance of the multicore system for trans = +1 or avoid transitions for

34
trans = 0.
4.5.3 The Manhattan weight vector
The Manhattan weight vector W
∀j ∈ N : j < m;Wj =n−1∑i=0
Sij (4.19)
is constructed where n is the total number of cores and m is the total number of performance
states avaliable in each core. The Manhattan weight vector is achieved by summing along each
column of the Manhattan matrix and hence providing an insight into the locality of each state. A
higher value of Wi indicates a higher probability that there exists active processors in or around
performance state i, while a null value, Wi = 0 indicates that the performance state i can never
be achieved under the current delta constraint.
4.5.4 Cooperative demand distribution: The demand field
With lower values of ∆ (or with respect to the current iteration, δ), there is a possibility
that a state i, could be requested with a high core demand (demandi) but due to the current
layout and delta constraint, have a Manhattan weight Wi = 0, implying that a core in perfor-
mance state i can never be provided. Thus a method was developed in transforming the vector
demand in such a way that such null performance states cooperatively give up their demand to
the performance state closest to them which has a possibility of being selected. This procedure
is described in Algorithm 1.
The demand field F is essentially a replicate of demand, and differs under circumstances
when, for a particular performance state i, Wi = 0 and demandi > 0, then its demand is dis-
tributed to a friend f which varies based on the transition direction trans. If transition direction
trans = 1, then the demand is given to the state f closest and lesser than i with Wf > 0. If the

35Input: Demand vector demand, Manhattan weight vector WOutput: Demand field F, Proxy vector Proxyforeach Performance state i do
Proxyi = i ;Fi = demandi ;
endforeach Performance state i do
if Wi = 0 and demandi > 0 thenif trans = 1 then
f = max{x : x ∈ N and x < i and Wx > 0} ;endif t = −1 then
f = min{x : x ∈ N and x > i and Wx > 0} ;endif t = 0 then
f = x : x ∈ N and x ≈ i and Wx > 0 ;f is closest to i such that weight Wf > 0.
endFf = Ff + Fi ;Fi = 0 ;Proxyf = i ;
endend
Algorithm 1: Demand field computation
transition direction trans = −1 then a friend is searched in the opposite direction withWf > 0.
Finally when trans = 0, the closest state f to i with Wf > 0 is chosen as the friend. Once the
friend state f is computed, demandi is transferred to f. In order to track contributions from other
performance states, a vector Proxy is maintained indicating the source of the demand. Under
normal circumstances, Proxyi = i.
It must be noted that if the LEA system (Chapter 3) is used, this algorithm will cause
F = demand as the demand will always be within range of 1 from the current performance
state of each processor. The algorithm comes into play with the SEA system.
4.5.5 Greedy performance state selection
Once W and F are computed, performance state w is selected as the winning state when
the value of the product Ww × Fw is maximum and hence reacting to demand and maintaining

36
locality. the winning state w
w = arg maxx
Wx × Fx (4.20)
4.5.6 Greedy processor selection and transition
With the winning performance state selected, the processor to be transitioned is deter-
mined by the row c
c = arg maxx
Sxw ×Xx (4.21)
in the Manhattan matrix whose value Scw × Xc is maximum. Once the selection is made,
processor c is transitioned to performance state w by invoking the seeker cpufreq governor
described in Section 4.2. An addition, not mentioned in Figure 4.4 in order to maintain clarity,
is to abort the iteration algorithm if the value Scw = 0. This condition implies that there are no
more available processors which are capable of transitions due to the iteration delta constraint.
4.5.7 Parameter change and termination conditions
At the end of every iteration, the parameters δ, Lc and demandw are updated to represent
the transition. First as a transition was previously performed, the iteration delta constraint must
reduce
δ = δ − |Lc − w| (4.22)
by an equal order as a part of the iteration delta constraint δ is consumed by the current transi-
tion. Second, as processor c is no longer in performance state Lc, it is updated to it’s new value
w
Lc = w (4.23)
thus future iterations are aware of the current locality. Third, as a processor is assigned to

37
performance state w, the corresponding demand
demandProxyw = demandProxyw − 1 (4.24)
can be decremented. Note that, due to the distortion introduced by the cooperative demand dis-
tribution, demandProxyw is updated instead of demandw. Lastly, the poison vector’s position
for processor c
Xc = 0 (4.25)
is set to zero and hence disabling processor c in participating in a future transitions. It can be
observed from Equation (4.21) that a value of 0 for Xi will ensure that i will never be used as
the resulting value of Xi × Siw will be zero and will never be selected, achieving the desired
behavior.
This concludes a single iteration of the greedy direction based algorithm and is repeated
for Load iterations as long as the iteration delta constraint δ > 0. The upper bound on the
number of iterations is Load (Which can be at most n) and ensures early termination of the
iterative algorithm when unnecessary.
The algorithmic complexity of the entire procedure assuming n >> m, can be evaluated
to be O(n2), as the number of iterations can be at most n, and each iteration involves the eval-
uation of the Manhattan matrix which can have up to n rows. Note that the complexity of the
described procedure is the same as that of the dynamic programming approach (with memo-
rization) but is more efficient due to the consideration of transition direction. The complexity is
affordable as this is executed at a much coarser interval (mutation interval), if not, the mutation
interval can be increased to reduce the effects of the evaluation time.

Chapter 5
Experimental setup and results
A Linux kernel module as shown in Figure 5.1 was developed incorporating the ideas
discussed in chapters 3 and 4. The scheduling routines in the Linux kernel (To be specific,
schedule and switch to) were extended using kprobes [20] to add the performance directed
scheduling features. The choice to use kprobes [20] a feature essentially used for debugging
and instrumentation from the context of a kernel module pose two possible problems. One,
kprobes by themselves introduce unnecessary code paths and interrupts causing some if not
significant overhead (Less than 3%). Secondly, certain functions relating to migration are not
exported to kernel modules and hence indirect and suboptimal procedures were chosen to get the
necessary functionality. In spite of these disadvantages, having the project as a kernel module
accelerated the development process.
For a system with m total performance states and n total processors, ∆ can at most take
a value of n× (m− 1). Experiments were carried on a quad-core AMD Opteron (Barcelona),
and a patched version of the Linux 2.6.28 kernel. The logging interface as shown in Figure 5.1,
collects task specific and system wide statistics which enable in studying the behavior of both
the performance directed scheduler and the delta constrained mutator. The statistics collected by
the logging system aided in the computation of total energy consumption of each processor and
workload. In order to measure percentage slowdown, the six workloads mentioned in Chapter 2,
were run with full clock speed, recording their cumulative execution time. The power values

39
mentioned in [21] (Provided in Appendix A for convenience) were utilized to compute the
average power and energy consumption. Appendix B displays the adaptation time line graph
showcasing each of the two schedulers along with the delta constrained mutator.
The system call interface allows an application runtime (or an application) to provide
hints to the scheduler on the performance state demanded and allows compiler and runtime
power optimizing frameworks to interact with the underlying kernel based power optimizer to
create a more robust system. The system call interface is beyond the scope of this thesis and will
not be further discussed, but shown only as an example of possible extensions to the discussed
power management system.
Figure 5.1: The Seeker infrastructure

40
5.1 Trends along delta and interval
Two parameters namely the delta constraint (∆) and the mutation interval were intro-
duced in Chapter 4. In order to further study the system it is important to define the variation of
three important effects, namely, slowdown and power savings and energy efficiency improve-
ment as a function of these parameters. A fully factorial experiment was conducted varying
delta from 1 through 16 and the interval from 125ms through 1000ms. All the six workloads
in Table 2.2 (Chapter 2) were run for each experiment. Slowdown for each workload was com-
puted by comparing it’s execution time to that when run on a multicore system with all cores
executing with the maximum frequency. Power savings was computed by comparing the aver-
age power consumption to the highest being 115.0W per core. And finally Energy efficiency
improvement was computed by comparing the EPI of each experiment to that when all cores
are executing with the maximum possible clock speed.
Figure 5.2 shows the trends observed with respect to slowdown for the LEA and SEA
scheduling systems. An unnaturally high slowdown can be observed for small values of delta
(∆) which can be attributed to the plasticity in adaptation. The interval of mutation has little
significance towards slowdown for delta values higher than 3.The mutation interval begins to
play a vital role in differentiating the experiments based on slowdown once the delta constraint
is lowered beyond a value of 3. It can be observed that at lower values of delta, reducing
the mutation interval has an effect of decreasing the slowdown and effectively reducing the
performance lost. Comparing Figure 5.2(a) with 5.2(b) shows that lower values of delta have a
more profound negative effect on the SEA system than the LEA system. At ∆ = 1, the SEA
system exhibits at least 5% higher slowdown but both rapidly falls to the plateau close to 10%
beyond a delta constraint of 4.
The logging interface was configured to sample the executing workload’s characteris-
tic and namely the execution time and current performance state (which was used to arrive

41
(a) LEA
(b) SEA
Figure 5.2: Variation of average slowdown with delta and interval for the ladder (a) and theselect (b) estimation approach
at the power consumption by utilizing Table A.1) further utilized to compute the total en-
ergy consumption and finally the average power consumption (Average power consumption =
Total energy consumptiontotal execution time ). Percentage power savings was estimated by comparing this value with the
maximum possible power consumption of 115.0W (At the maximum possible clock speed of
2200MHz taken from Table A.1).
Figure 5.3 shows the variation of percentage power savings with varied values of delta
(∆) and mutation interval. A trend similar to percentage slowdown is observed for power sav-

42
ings: A high value of power savings for lower values of delta which plateaus at the value of
∆ = 3. Even though high power savings are desirable, the effective slowdown accompanying
lower delta constraints can be disadvantageous. A trend unique to power savings is the con-
tinued effect of mutation interval, whose decrease deteriorate power savings marginally for all
values of delta. During the experimentation, mutation intervals lower than 100ms was observed
to cause system instabilities and frequent crashes and hence the study was performed only for
values starting from 125ms. Even though this phenomenon can be attributed to the implemen-
tation, [13] warns system designers to avoid rapid DVFS transitions and should not be ignored.
The SEA system can be observed to be largely affected by delta. A sharp decline in power
savings can be observed for the SEA system which also provides lower power savings.
Figure 5.4 shows the variation of percentage improvement in energy efficiency (improve-
ment in observed EPI) from the same experiment with all processors executing with the maxi-
mum clock speed. Figures 5.2 and 5.3 indicate the weakness of the SEA system in comparison
to the LEA system. Going by the maximum values of energy efficiency improvement, it can be
noted that the SEA system does better by a margin of 5%. The plot for the SEA system sports an
even plain indicating the independence to the parameters: delta and interval. The LEA system
shows a maximum at the point corresponding to a delta value of 8 and an interval of 500ms but
the SEA system does better with lower values of interval and has an minimum line at an interval
value of 250ms. The LEA system deteriorates at a delta value of 16 for lower interval values
(probably relating to unnecessary elasticity exhibited by the system).
Based on observations made by the trends of power savings, slowdown and energy effi-
ciency improvement, it can be concluded that a value of delta ∆ = 4 provides the best among all
the compromises and the interval can be changed based on the arrival rate and average execution
times of typical jobs on the server or compute node. Having a very low mutation interval can
hamper potential power savings for systems with a high mean time between execution work-
loads hence advising system designers to consider all parameters before deploying the delta
constrained mutator. Average job duration can also affect possible deployments with respect

43
(a) LEA
(b) SEA
Figure 5.3: Variation of average power savings with delta and interval for the ladder (a) and theselect (b) estimation approach
to mutation interval. Care must be taken to choose the mutation interval as a fraction of the
average execution times of typical jobs running on the system.
5.2 Trends along maximum allowed mutation rate
Figures 5.2, 5.3 and 5.4 shows the variation of slowdown, average power savings and
energy efficiency improvement respectively as a function of mutation interval and the delta

44
(a) LEA
(b) SEA
Figure 5.4: Variation of average energy efficiency (EPI) improvement with delta and intervalfor the ladder (a) and the select (b) estimation approach
constraint. A parameter maximum allowed mutation rate is the amalgamation of these two
parameters describing the rate at which mutations are allowed in the system expressed as muta-
tions per second.
Maximum allowed mutation rate =∆
Mutation Interval(5.1)

45
Even though less descriptive for a system administrator in choosing the parameters for deploy-
ment, this view of the results presented in Figures 5.2, 5.3 and 5.4 can be expressed as a function
of the maximum mutation rate and shown in Figure 5.5 and making the trends easier to observe.
Both the LEA and SEA systems show that increasing the maximum allowed mutation rate above
20 causes the slowdown to become less prominent; a direct effect of more elasticity of the adap-
tation system. It can also be observed that further increase in the maximum allowed mutation
rate has no significant effect of reducing the slowdown or improving the power savings. The
energy efficiency sports a bell curve for the LEA system reaching a maximum of approximately
2.5% at 40. As observed with Figure 5.4, a flat energy efficiency improvement variation is ob-
served for the SEA system consistent at 2.5% and slightly dipping around maximum allowed
mutation rate equal to around 45.
(a) LEA (b) SEA
Figure 5.5: Variation of average power savings, slowdown and energy efficiency improvementwith maximum mutation rate for the LEA (a) and the SEA (b) scheduling systems
5.3 Comparing various methodologies and varying workloads
Chapter 3 described two methods of evaluating the performance state and hence pro-
viding two distinct methods of scheduling which affects the adaptation of the delta system.

46
Chapter 4 introduced a common power management system, the ondemand power optimizer.
Experiments were conducted varying the mutation interval from 125ms to 1000ms for all three
systems. Section 5.1 recommends the minimum value of delta ∆ = 4 in order to minimize
the effects of slowdown due to adaptation plasticity and hence the delta mutator with a value of
∆ = 4 was utilized for the experiments proposed in this section. Utilizing the logging interface,
the execution sample performance state (and in turn the power consumption), execution time
and retired instructions were recorded to later compute the average power consumption for each
workload, the energy efficiency, measured as energy consumption per executed instruction and
slowdown. These experiments were repeated for each of the power management systems:
• Ondemand governor, mutation interval: 125ms, 250ms, 500ms, 1000ms
• The ladder performance directed scheduling system with the delta constrained mutator
with a delta value, ∆ = 4 and mutation interval: 125ms, 250ms, 500ms, 1000ms
• The select performance directed scheduling system with the delta constrained mutator
with a delta value, ∆ = 4 and mutation interval: 125ms, 250ms, 500ms, 1000ms
For each experiment, the percentage power savings, percentage slowdown and EPI (energy per
instruction) improvement was computed and tabulated. These experiments were clustered based
on workload in order to further categorize the results.
Due to the adaptive nature of the power management mechanism and the high depen-
dence on the IPC characteristic of workloads, it can be hypothesized that workloads with lower
IPC (The Low workload) should expect the maximum power savings while, workloads with
large IPC (The High workload) should expect the minimum slowdown and power savings.
Figure 5.6 shows the average values of the parameter space for each workload under
varying mutation intervals for the LEA system and the delta mutation engine. The High work-
load suffers from an insignificant value of slowdown (around 1%) but also gains no power
savings. There is a small deterioration of energy efficiency which can be perceived to be in-
significant. Meanwhile, the Low workload achieves an excess of 40% power savings with a

47
meager 20% of performance loss and a considerable gain in energy efficiency of 25%. These
two workloads show an agreement with the hypothesis made above. The PLow-Low shows a
drastic amount of slowdown (30%) and the most affected in the group. The median slowdown
for the LEA system can be estimated to be around the 16% margin with a median power sav-
ings of 12%. Three out of 6 (Low, Low-High and PLow-Low) show improvement in energy
efficiency.
Figure 5.6: Variation of power savings, energy efficiency improvement and slowdown with eachworkload for the ladder estimation approach (LEA) with the delta mutation engine at ∆ = 4
Figure 5.7 shows the average values of the parameter space for each workload under
varying mutation intervals for the SEA system and the delta mutation engine. Similar observa-
tions can be made to the SEA system as LEA, except that the SEA system does worse in terms
of slowdown ranging from a minimum of 4% (High workload) to a very high 35% (PLow-Low
workload). The power savings achieved by the SEA system is considerably lower than the LEA
system ranging from a 1% to 28%. Even though worse in terms of slowdown and power sav-
ings, the SEA system does better in terms of energy efficiency improvement with 4 out of 6

48
(Low, Low-High, PHigh-PLow and PLow-Low) workloads achieving a positive value and the 2
other workloads (High and PLow-High) show meager deterioration. The shocking increase in
slowdown can possibly be attributed to a wrong choice of thresholds mentioned in Chapter 3,
further investigations must be made with varying threshold values.
Figure 5.7: Variation of power savings, energy efficiency improvement and slowdown with eachworkload for the select estimation approach (SEA) with the delta mutation engine at ∆ = 4
Finally, Figure 5.8 shows the average values of the parameter space for each workload
under varying mutation intervals for the Ondemand system. There is an insignificant (less than
0.5%) amount of achievable power savings while still susceptible to varied amounts of slow-
down ranging from nothing to 5%. Even though by theory the ondemand system should never
experience slowdown, the observed values can best be attributed to the overhead suffered from
the particular implementation of the scheduling system explained in the introductory paragraph
implying that a production implementation of the scheduling and delta mutation system can re-
duce the suffered slowdowns described above by a margin of 5%. Insignificant improvements in
energy efficiency is observed. The PLow-Low workload suffers the minimum amount of slow-

49
down indicating characteristics dissimilar to that of the other workloads, this is also observed
from the graphs for the LEA and SEA systems.
Figure 5.8: Variation of power savings, energy efficiency improvement and slowdown with eachworkload for the Ondemand governor
Any power management system will impose varied amounts of slowdown on the exe-
cuting workloads based on the design. The experiments showcase the return on the slowdown
achieved by the proposed system with significant power savings ranging from 1% to 40%. But
the observed slowdown indicate room for improvement in the scheduling and mutation systems.
Figures 5.6, 5.7 and 5.8 show that any amount of power savings are also accompanied by var-
ious degrees of slowdown. The goal is to maximize power savings while minimizing suffered
slowdown. Appendix B shows the mutation time-line of all the workloads with both LEA and
SEA systems which should provide insight into the time-line mutation process.

Chapter 6
Future work
The scheduling methodology can integrate the work presented in [22] for higher accu-
racy in predicting the performance state required by a particular task. Further analysis on the
relationship of clock speed invariance of workloads may prove valuable in improving the select
scheduling system and deriving a more optimal methodology of selecting its threshold values.
The fact that there exists a correlation between IPC and speedup achievable with increased
clock speed, can be further utilized to produce a more robust system. The first steps would be
to provide strong support within the Linux kernel for asymmetric multicore processors which
will enable further study into the aspect of performance directed scheduling.
A system call interface was integrated into the existing system which can be utilized
to develop compiler or runtime frameworks, which would potentially have more information
about program characteristics in developing a hybrid system to provide higher power savings
with lower performance loss.

Chapter 7
Summary and Conclusion
This thesis presented an approach of scheduling tasks on multicore processors with re-
spect to performance requirements and performing a constrained system wide dynamic voltage
and frequency transitions in reaction to the demands of the performance characteristic of the
executing workloads. A novel methodology was presented in controlling the magnitude of
these transitions to improve stability and longevity of the multicore system. The methodology
was shown to possess characteristics to minimize slowdown of workloads while improving the
power savings achieved.
Workloads were shown to possess characteristics to possibly make it immune to clock
speed improvements thus enhancing the motivation of a performance determined power man-
agement system to conserve power during system active state. Experiments with various static
layouts were conducted to characterize and showcase the performance directed scheduler. It
was demonstrated that a statically assigned performance state layout of a system can prove
to drastically deteriorate performance of the system in an environment with non-deterministic
workload sets. Two scheduling systems, the ladder and select, were developed to schedule tasks
based on their performance characteristic.
A system was developed to allow a performance directed scheduler to direct an asyn-
chronously run power optimizer to manage the performance states of each core in an otherwise
homogeneous multicore environment. The problem was shown to be NP-Hard and an efficient

52
algorithm was developed in solving the Multiple-choice knapsack problem as a direction based
iterative greedy algorithm.
Experiments were conducted to measure slowdown and power savings along the sys-
tem’s entire parameter space to determine the most favorable constraint on the magnitude of the
allowed DVFS transitions. This was utilized to compare the power management system with a
popular load based power optimizer, the ondemand governor. The combination of the perfor-
mance directed scheduler and the delta constrained mutator was shown to achieve higher power
efficiency and better power savings with marginal slowdown.

Bibliography
[1] W. Kim, M. S. Gupta, G.-Y. Wei, and D. Brooks, “System level analysis of fast, per-coredvfs using on-chip switching regulators,” in 2008 IEEE 14th International Symposiumon High Performance Computer Architecture. Piscaatway, NJ, USA: IEEE, 2008, pp.123–134.
[2] K. M. P. R. M. K. M. J. Irwin, “Phase-aware adaptive hardware selection for power-efficient scientific computations,” in International Symposium on Computer Architecture/Proceedings of the 2007 international symposium on Low power electronics and design.Portland, OR, USA: ACM, 2007, pp. 403–406.
[3] M. Kondo, H. Sasaki, and H. Nakamura, “Improving fairness, throughput and energy-efficiency on a chip multiprocessor through dvfs,” vol. 35, no. 1. New York, NY, USA:ACM, 2007, pp. 31–38.
[4] Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, Y. Wu, J. Lee, andD. Brooks, “A dynamic compilation framework for controlling microprocessor energy andperformance,” in MICRO 38: Proceedings of the 38th annual IEEE/ACM InternationalSymposium on Microarchitecture. Washington, DC, USA: IEEE Computer Society,2005, pp. 271–282.
[5] H. Sasaki, Y. Ikeda, M. Kondo, and H. Nakamura, “An intra-task dvfs technique basedon statistical analysis of hardware events,” in Conference On Computing Frontiers/Proceedings of the 4th international conference on Computing frontiers. Ischia, Italy:ACM, 2007, pp. 123–130.
[6] A. Rangasamy, R. Nagpal, and Y. Srikant, “Compiler-directed frequency and voltage scal-ing for a multiple clock domain microarchitecture,” in CF ’08: Proceedings of the 5thconference on Computing frontiers. New York, NY, USA: ACM, 2008, pp. 209–218.
[7] B. Rountree, D. K. Lownenthal, B. R. de Supinski, M. Schulz, V. W. Freeh, and T. Bletsch,“Adagio: making dvs practical for complex hpc applications,” in ICS ’09: Proceedings ofthe 23rd international conference on Supercomputing. New York, NY, USA: ACM, 2009,pp. 460–469.
[8] B. Rountree, D. K. Lowenthal, S. Funk, V. W. Freeh, B. R. de Supinski, and M. Schulz,“Bounding energy consumption in large-scale mpi programs,” in SC ’07: Proceedings ofthe 2007 ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2007,pp. 1–9.

54
[9] K. Meng, R. Joseph, R. P. Dick, and L. Shang, “Multi-optimization power managementfor chip multiprocessors,” in Proceedings of the 17th international conference on Parallelarchitectures and compilation techniques. Toronto, Ontario, Canada: ACM, 2008, pp.177–186.
[10] C. Isci, G. Contreras, and M. Martonosi, “Live, runtime phase monitoring and predictionon real systems with application to dynamic power management,” in In: Proceedings ofthe 39th International Symposium on Microarchitecture (MICRO-39). Los Alamitos,CA, USA: IEEE Computer Soc, 2006, p. 12pp.
[11] Y.-H. Lu, L. Benini, and G. D. Micheli, “Operating-system directed power reduction,” inConference On Computing Frontiers/ Proceedings of the 4th international conference onComputing frontiers. Rapallo, Italy: ACM, 2000, pp. 37–42.
[12] C. Boneti, R. Gioiosa, F. J. Cazorla, and M. Valero, “A dynamic scheduler for balanc-ing hpc applications,” in SC ’08: Proceedings of the 2008 ACM/IEEE conference onSupercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 1–12.
[13] N. Soundararajan, N. Vijaykrishnan, and A. Sivasubramaniam, “Impact of dynamicvoltage and frequency scaling on the architectural vulnerability of gals architectures,”in ISLPED ’08: Proceeding of the thirteenth international symposium on Low powerelectronics and design. New York, NY, USA: ACM, 2008, pp. 351–356.
[14] Standard Performance Evaluation Corporation, “The SPEC CPU 2006 benchmark suite,”2006. [Online]. Available: http://www.spec.org
[15] T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn, “Efficient operating system schedulingfor performance-asymmetric multi-core architectures,” in SC ’07: Proceedings of the 2007ACM/IEEE conference on Supercomputing. New York, NY, USA: ACM, 2007, pp. 1–11.
[16] D. P. Gulati, C. Kim, S. Sethumadhavan, S. W. Keckler, and D. Burger, “Multitaskingworkload scheduling on flexible-core chip multiprocessors,” in Proceedings of the 17thinternational conference on Parallel architectures and compilation techniques. Toronto,Ontario, Canada: ACM, 2008, pp. 187–196.
[17] R. Nathuji and K. Schwan, “Virtualpower: coordinated power management in virtualizedenterprise systems,” in SOSP ’07: Proceedings of twenty-first ACM SIGOPS symposiumon Operating systems principles. New York, NY, USA: ACM, 2007, pp. 265–278.
[18] V. Pallipadi and A. Starikovskiy, “The on-demand governor,” in 2006 LinuxSymposium, 2006. [Online]. Available: http://ftp.funet.fi/pub/Linux/kernel/people/lenb/acpi/doc/OLS2006-ondemand-paper.pdf
[19] J. C. Bean, “Multiple choice knapsack functions,” in Proceedings of the 2007 InternationalSymposium on Low-Power Electronics and Design (ISLPED’07), Department of Indus-trial operations and Engineering, University of michigan, Ann Arbor, Michigan 48109,1987.
[20] R. Krishnakumar, “Kernel korner: kprobes-a kernel debugger,” Linux J., vol. 2005, no.133, p. 11, 2005.

55
[21] AMD, AMD Family 10h Server and Workstation Processor Power and Thermal DataSheet, June 2009. [Online]. Available: http://support.amd.com/us/Processor TechDocs/43374.pdf
[22] M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. Nikolopoulos, B. R. de Supinski, andM. Schulz, “Prediction models for multi-dimensional power-performance optimization onmany cores,” in Proceedings of the 17th international conference on Parallel architecturesand compilation techniques. Toronto, Ontario, Canada: ACM, 2008, pp. 250–259.

Appendix A
AMD Opteron capabilities
Performance State Clock Speed (MHz) Power Con-sumption perProcessor Core(Watts)
0 1100 64.61 1400 75.62 1700 88.83 2000 108.24 2200 115.0
Table A.1: Correlation between performance states with clock speed and power consumption

Appendix B
Mutation time-line per workload
For each figure, the X-axis is time with a granularity of a mutation interval. The Y-axis is
the stacked processor count for each performance state. Thus for each vertical slice, the width of
each color, corresponds to the total number of cores executing with that particular performance
state.
(a) Ladder (b) Select
Figure B.1: Adaptation time-line for the High workload with the ladder (a) and select (a)scheduling

58
(a) Ladder (b) Select
Figure B.2: Adaptation time-line for the Low workload with the ladder (a) and select (a)scheduling
(a) Ladder (b) Select
Figure B.3: Adaptation time-line for the Low-High workload with the ladder (a) and select (a)scheduling

59
(a) Ladder (b) Select
Figure B.4: Adaptation time-line for the PHigh-PLow workload with the ladder (a) and select(a) scheduling
(a) Ladder (b) Select
Figure B.5: Adaptation time-line for the PLow-Low workload with the ladder (a) and select (a)scheduling

60
(a) Ladder (b) Select
Figure B.6: Adaptation time-line for the PLow-High workload with the ladder (a) and select (a)scheduling

Appendix C
Source code
All the source code has been uploaded in a public git repository and available at:
git://github.com/amithash/seeker-scheduler.git.
Note that in order to checkout the source code, you will need to have the git source
control management utilities installed.