university of calgary experimental evaluation of...

UNIVERSITY OF CALGARY

Experimental Evaluation of Speed Scaling Systems

by

Arsham Bryan Skrenes

A THESIS

SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE

DEGREE OF MASTER OF SCIENCE

GRADUATE PROGRAM IN COMPUTER SCIENCE

CALGARY, ALBERTA

SEPTEMBER, 2016

© Arsham Bryan Skrenes 2016

ii

Abstract

Speed scaling policies are a critical component in modern operating systems, impacting both

energy efficiency and performance. Energy efficiency is important from a sustainability

standpoint, especially since datacenters account for roughly 2% of the global energy

consumption, growing by 6% per year.

Understanding the features of modern processors facilitates the development of more effective

policies. As a first contribution, this thesis provides such information, along with the details

necessary to properly interpret experimental measurement results. The second contribution is a

profiler that makes it easy to perform controlled workloads made up of precise units of work at

defined speeds, and produces high-resolution timing and energy measurement data broken down

by process and workload.

The profiler is used to collect empirical data about several theoretical speed scaling policies

using a modern processor, with detailed analysis and comparisons to the most common policy on

contemporary operating systems.

iii

Acknowledgements

First and foremost, I would like to thank Carey Williamson. From the very first email to arrange

our initial meeting, and throughout my entire program, he has continually been kind and

encouraging, exemplifying outstanding leadership and invoking inspiration, not only in myself,

but also in those who have had the privilege of knowing him. I cannot imagine a better

supervisor nor can words adequately express my gratefulness.

I am also immensely thankful to my wife, Allison Skrenes, who was never doubtful and always

encouraging, even when I felt demotivated. Her advice and strategies, forged from her own

academic successes, were instrumental. Our children, Tyrian and Nyala, have had to endure this

long haul, but when I had to “work on the thesis”, they had the most amazing mother at their

side.

I must also thank my parents and in-laws for their encouragement. My mother was the person

who initially inspired me to pursue graduate studies. She also gave me valuable advice that

helped me finish the thesis.

Finally, I am grateful to my defense committee members, Dr. Mea Wang, Dr. Diwaker

Krishnamurthy, and Dr. Carey Williamson, who took the time to read my thesis and provide their

feedback.

iv

Table of Contents

Abstract ............................................................................................................................... ii Acknowledgements ............................................................................................................ iii Table of Contents ............................................................................................................... iv List of Tables ..................................................................................................................... vi List of Figures and Illustrations ........................................................................................ vii List of Abbreviations ....................................................................................................... viii

CHAPTER 1: INTRODUCTION ........................................................................................1 1.1 Objectives ..................................................................................................................3 1.2 Thesis Roadmap .........................................................................................................3

CHAPTER 2: BACKGROUND AND RELATED WORK ................................................5 2.1 The Importance of Energy Efficiency .......................................................................5 2.2 Theory and Practice ...................................................................................................6 2.3 Speed Scaling Process Schedulers .............................................................................8 2.4 Summary ..................................................................................................................10

CHAPTER 3: PROBLEM FORMULATION ...................................................................12 3.1 Profiling Schedulers on Real Systems .....................................................................12 3.2 High-Resolution Hardware Timers ..........................................................................14 3.3 Running Average Power Limit (RAPL) ..................................................................15

3.3.1 Using the RAPL MSRs within Profilo ............................................................20 3.4 Summary ..................................................................................................................22

CHAPTER 4: DESIGN AND IMPLEMENTATION .......................................................23 4.1 Design Choices ........................................................................................................23

4.1.1 User Mode .......................................................................................................23 4.1.2 Kernel Mode ....................................................................................................24 4.1.3 Kernel/User Mode Hybrid ...............................................................................25 4.1.4 Workload .........................................................................................................25 4.1.5 Units of Work ..................................................................................................29

4.2 Implementation ........................................................................................................30 4.2.1 Kernel Module .................................................................................................31 4.2.2 Lockup Detectors .............................................................................................35 4.2.3 Clocks ..............................................................................................................36 4.2.4 User Mode Application ...................................................................................38

4.2.4.1 Processing the Arguments .....................................................................38 4.2.4.2 CPU Compatibility ................................................................................43 4.2.4.3 Environment Setup ................................................................................45 4.2.4.4 Preparations for Profiling .......................................................................50 4.2.4.5 Profiling .................................................................................................52 4.2.4.6 Concluding Profiling ..............................................................................55

4.2.5 Idler Utility ......................................................................................................56 4.3 Summary ..................................................................................................................57

v

CHAPTER 5: MICRO-BENCHMARKING .....................................................................59 5.1 Platform ...................................................................................................................59 5.2 ACPI Specification ..................................................................................................61 5.3 Total System Power Consumption ..........................................................................63

5.3.1 Global Sleep States ..........................................................................................63 5.3.2 Normalizing Measurements .............................................................................66 5.3.3 Idle Power Consumption .................................................................................69

5.3.3.1 Linux cpuidle .........................................................................................70 5.3.3.2 CPU Sleep States ...................................................................................73

5.3.4 Active Power Consumption .............................................................................82 5.4 Profilo Workload Benchmarking .............................................................................88 5.5 Mode and Context Switches ....................................................................................92 5.6 Switch Test ..............................................................................................................96 5.7 Summary ................................................................................................................106

CHAPTER 6: PROFILING COUPLED AND DECOUPLED SPEED SCALERS ........109 6.1 Workloads ..............................................................................................................109 6.2 Building the Trace Files .........................................................................................110

6.2.1 The PS Generator ..........................................................................................111 6.2.2 The FSP-PS Generator ..................................................................................112 6.2.3 The YDS Generator .......................................................................................113

6.3 Profiling Results ....................................................................................................114 6.4 Summary ................................................................................................................120

CHAPTER 7: CONCLUSIONS AND FUTURE WORK ...............................................122 7.1 Thesis Summary ....................................................................................................122

7.1.1 The Importance of Speed Scaling Scheduling Policies .................................122 7.1.2 Building a Profiler .........................................................................................123 7.1.3 Examining Modern Architectures .................................................................123 7.1.4 Experimental Evaluation of Speed Scaling Schedulers .................................123

7.2 Conclusions ............................................................................................................124 7.3 Relevance and Contributions .................................................................................125 7.4 Future Work ...........................................................................................................125

APPENDIX A ..................................................................................................................128

APPENDIX B ..................................................................................................................129

APPENDIX C ..................................................................................................................131

REFERENCES ................................................................................................................134

vi

List of Tables

Table 5.1 Power Consumption of Components ............................................................................ 69

Table 5.2 Ivy Bridge C1 Power Savings ....................................................................................... 74

Table 5.3 Ivy Bridge C-State Power Measurements ..................................................................... 80

Table 5.4 Ivy Bridge and Broadwell C-State Durations ............................................................... 81

Table 5.5 Single-Core Busy Wait ................................................................................................. 84

Table 5.6 Single-Core Trial Division Primality Test .................................................................... 86

Table 5.7 Benchmark Results (150 Primes) ................................................................................. 92

Table 5.8 Mode and Context Switch Test ..................................................................................... 95

Table 5.9 Switch Test Results ..................................................................................................... 100

Table 5.10 Mode Switch Durations and Energy Estimates ........................................................ 102

Table 6.1 Profiling Results ......................................................................................................... 115

vii

List of Figures and Illustrations

Figure 3.1 Structural Illustration of RAPL MSRs ........................................................................ 18

Figure 4.1 Trial Division Primality Test Workload ...................................................................... 28

Figure 4.2 Conceptual Overview of Profilo .................................................................................. 31

Figure 6.1 Profiling Graph (Workloads 1-5) .............................................................................. 119

Figure 6.2 Profiling Graph (Workloads 4-5) .............................................................................. 120

Figure A.1 Inline Assembly to Read MSRs ................................................................................ 128

Figure B.1 Workload C Programming Language Code .............................................................. 129

Figure C.1 PS Trace File ............................................................................................................. 131

Figure C.2 Verbose Profilo Summary for PS Example .............................................................. 132

Figure C.3 FSP-PS Trace File ..................................................................................................... 132

Figure C.4 Profilo Summary for FSP-PS Example .................................................................... 133

viii

List of Abbreviations

AC Alternating Current

ACPI Advanced Configuration and Power Interface

AES Advanced Encryption Standard

AES-NI Advanced Encryption Standard New Instructions

AKS Agrawal–Kayal–Saxena Primality Test

ALU Arithmetic Logic Unit

API Application Programming Interface

AVX Advanced Vector Extensions

BLE Bluetooth Low Energy

CPU Central Processing Unit

CSV Comma-Separated Values

DC Direct Current

DDR Double Data Rate

DMA Direct Memory Access

DMI Direct Memory Interface

DRAM Dynamic Random-Access Memory

DVFS Dynamic Voltage and Frequency Scaling

EDF Earliest Deadline First

EPA Environmental Protection Agency

FIVR Fully-Integrated Voltage Regulator

FPU Floating-Point Unit

FSP Fair Sojourn Protocol

ix

GB Gigabyte

GDDR Graphics Double Data Rate Synchronous Dynamic Random-Access

Memory

GHz Gigahertz

GIMPS Great Internet Mersenne Prime Search

GPR General Purpose Register

GPU Graphics Processing Unit

HLT Halt (x86 Instruction)

HPET High Precision Event Timer

ICT Information and Communications Technology

IO Input/Output

IRQ Interrupt Request

ISA Instruction Set Architecture

KB Kilobyte

LAN Local Area Network

LCD Liquid-Crystal Display

LTS Long Term Support

MB Megabyte

MHz Megahertz

MMX Multimedia Extension

MRT Mean Response Time

MSR Machine Specific Register

NEMA National Electrical Manufacturers Association

x

NMI Non-Maskable Interrupt

OEM Original Equipment Manufacturer

OpenGL Open Graphics Library

OS Operating System

OSX Mac Operating System 10

PCH Platform Controller Hub

PCI Peripheral Component Interconnect

PCU Power Control Unit

PIT Programmable Interval Timer

PKG Package Power Plane (Entire CPU)

PLL Phase-Locked Loop

PM Power Management

POS Power On Suspend

PP0 Power Plane 0 (Cores)

PP1 Power Plane 1 (Uncore Devices / GPU)

PS Processor Sharing

RAM Random Access Memory

RAPL Running Average Power Limit

RMS Root Mean Square

RPM Revolutions Per Minute

RTC Real-Time Clock

SI International System of Units (Système International d'unités)

SIMD Single Instruction, Multiple Data

xi

SMI System Management Interrupt

SMM System Management Mode

SMP Symmetric Multiprocessing

SRAM Static Random-Access Memory

SRPT Shortest Remaining Processing Time

SSE Streaming SIMD Extensions

TDP Thermal Design Power

TSC Time Stamp Counter

TSS Task State Segment

VCC IC Power-Supply Pin (Positive)

WLAN Wireless Local Area Network

YDS Yao-Demers-Shenker Speed Scaling Scheduling Algorithm

1

Chapter 1: Introduction

Energy efficiency has always been a consideration in processor design, but in recent years it has

topped the list of priorities. When Intel abandoned their Netburst architecture in 2004, forfeiting

a couple years of costly research and development, in favour of a considerably different design

that promised greater energy efficiency, they made their priorities very clear. Since then, they

have reduced their thermal design power (TDP), increased the number of idle and sleep states,

and added features that manage the behaviour of the processor to maximize energy efficiency

and minimize cooling requirements.

This change in priorities has also been reflected in the design of process schedulers – the

software component that manages how jobs are executed on the processor. The concept of

changing the service rates at which jobs are executed is called speed scaling. Incorporating speed

scaling into scheduling policies has led to additional complexity that has attracted analysis in the

literature. However, for the sake of analysis and comparison, assumptions are made and

characteristics of hardware are not modeled, which separates theory from practice.

Connecting theory and practice offers the possibility of improvements to the systems

community, while understanding the features of modern processors can give rise to improved

algorithms that are first described and analysed in theory. Bridging this gap can serve as a

catalyst for progress. Unfortunately, modern processor features tend to only be described in

datasheets, Web site reviews, and blogs, which often lack a quantitative assessment. On the

theory side, algorithms can make simplifying assumptions and omit important behaviours that

make experimental evaluation challenging.

2

This thesis tries to address these concerns by bridging the two perspectives. To accomplish this,

the features and characteristics of Intel’s newest microarchitecture need to be described and

quantified through careful micro-benchmarks. This includes an examination of all sleep, idle,

and active states, describing and quantifying well-known properties like the power rating of the

maximum discrete speed, as well as properties like the exit latency of the deepest idle state. This

information helps answer what processor states and features are available to improve

performance and energy efficiency. It also addresses what behaviours are necessary for a modern

processor to accomplish work while saving power.

The second thing needed to address the gap between theory and systems is a profiling tool that

allows different schedulers and speed scalers to be evaluated on real hardware. While there may

have been attempts at doing this in the past, to the best of my knowledge, there has never been a

generalized profiler that produces high-resolution timing and multi-domain energy consumption

data. Furthermore, the profiler described in this thesis can provide this information without

ancillary (and costly) equipment, requiring only an Intel processor from 2011 or later.

Such a profiler can help answer these two important and pragmatic questions:

1. How different are the speed scaling scheduling policies that make different trade-offs to

attain certain optimizations?

2. How do the best of these policies compare to the implementations found on modern

operating system kernels?

3

1.1 Objectives

The objectives of this research are as follows:

• Determine the performance and energy saving features of modern processors, as well as

their analytical properties, and then quantify this information through micro-benchmarks.

• Develop a generalized profiler that makes it easy to execute controlled workloads made

up of precise units of work at defined speeds, and receive high-resolution timing and

energy measurements, broken down by process and workload.

• Profile decoupled speed scaling and other interesting speed scaling and scheduling

policies on real hardware to see what gains can be made compared to the policies found

in modern operating systems.

1.2 Thesis Roadmap

This thesis is organized as follows. Chapter 2 provides further background and motivation for

this research, and describes the prior research work from both the theoretical and systems

communities. Chapter 3 delves into the problem formulation and explains, in detail, the technical

resources that are used to address these challenges. Chapter 4 presents the details of the design

and implementation of Profilo, the profiler that fulfills one of the thesis objectives. Chapter 5

presents the architectural features and micro-benchmarking results of the test platform. It also

sets the context for the subsequent chapter. Chapter 6 examines the profiling results of the speed

scaling scheduling policies described in Chapter 2. It compares two theoretical policies to the

4

most common implementation in modern operating systems, breaking the results down through

several metrics, thereby satisfying the final thesis objective.

5

Chapter 2: Background and Related Work

The goal of this chapter is to provide some of the background information necessary to

understand the motivation for Profilo. It illustrates the importance of process schedulers and

speed scaling policies from both a theoretical and systems perspective.

2.1 The Importance of Energy Efficiency

In recent years, the power efficiency of computer processors has become increasingly important.

In a 2007 report, it was estimated that the total footprint for the ICT sector accounted for 2% of

the world’s carbon emissions from human activity [59]. This same report estimated a growth rate

of 6% per year, at least until 2020. A more recent 2013 report suggests that the entire ICT

ecosystem may be responsible for as much as 10% of the world electricity consumption [40]. In

response to a 2007 Congressional request, the United States Environmental Protection Agency

(EPA) estimated that the energy consumption for operating data centers located in the United

States consumed 61 billion kilowatt-hours or roughly 1.5% of the nation’s electricity use [74].

On the other end of the spectrum, mobile devices also benefit greatly when improvements to

energy efficiency can be made, due to their limited battery capacity.

The most energy-intensive component of almost any computer is its processor(s). While some

datacenters may use graphic processing units (GPU’s) as compute resources, all datacenters

utilize central processing units (CPU’s). This is therefore the focus for Profilo and this thesis.

6

There have been many ideas to improve processor power efficiency. One solution has been

Dynamic Voltage and Frequency Scaling (DVFS); in modern processors this is more commonly

referred to as CPU speed scaling. There are typically several discrete voltage/frequency states

that can be used. Because power consumption grows in proportion to the frequency multiplied by

the square of the voltage (which must be increased to maintain stability at higher frequencies), it

is substantially more energy efficient to run a processor at half the frequency for a given job,

taking twice as long, than it is to run the same processor at its full frequency. To properly

accomplish DVFS, a software component is required. More explicitly, the kernel scheduler in

conjunction with its speed scaling policy controls the processor frequency (and voltage).

2.2 Theory and Practice

In dynamic CPU speed scaling systems, the speed at which the processor executes is adjusted

over time based on the ambient workload demand. If no jobs are present, the processor can enter

a rest state (e.g., “sleep”, “idle”, or “gated off”) to drastically reduce power consumption. In the

presence of one job, the processor can run at a modest baseline speed. If the number of active

jobs increases, the processor speed can be increased, perhaps to some maximum rate, to dissipate

the backlog quickly. Speed scaling strategies produce interesting trade-offs between response

time, fairness, and energy consumption. These issues have fostered research efforts in both the

systems community and the theory community.

In the published literature, there is a dichotomy between the speed scaling results for the systems

and theory research communities. The theoretical work tends to provide elegant results on the

7

optimality and efficiency of speed scaling algorithms [3], [5], [9], [11], [12], [62], albeit under

many assumptions (e.g., weighted cost functions for delay and energy consumption; known job

sizes; energy-proportional operation; job-count-based speed scaling; continuous and unbounded

speeds; zero cost for context switches, speed changes, or return from sleep states). Simulation is

sometimes used to augment the evaluation of speed scaling systems, but the simulators often

have similar assumptions as the analytical work.

In the systems community, research tends to focus on DVFS. In this context, practical issues

such as processor utilization, heat dissipation, and job size variability are primary considerations

[39], [47], [48], [56], [61], while optimality is not. Other concerns arise regarding granularity of

control, the set of discrete speeds available, non-linear power consumption effects, and unknown

job characteristics [54]–[56], [60], [61]. Practical energy saving techniques include threshold-

based control (Section 4.2.4.3), race-to-idle [4], and power-gating [23], [31].

In the theory community, speed scaling typically assumes a continuous and unbounded range of

available speeds, with the choice of speed determined either by job deadlines [65] or system

occupancy [6], [9]. Albers et al. have done extensive work on energy-efficient algorithms [3],

[5]. Some of this work optimizes the trade-off between energy consumption and mean response

time. Several studies on this metric suggest that energy-proportional speed scaling is near

optimal [6], [9]. An alternative approach has focused on minimizing the response time in

systems, given a fixed energy budget [11], [12].

8

Andrew, Lin, and Wierman formally consider the trade-offs between response time, fairness, and

energy consumption [6]. Their paper identifies algorithms that can optimize up to two of these

metrics, but not all three. For example, SRPT (Shortest Remaining Processing Time) is optimal

for response time [50], but can be unfair, while PS (Processor Sharing) is always fair, but

suboptimal for both response time and energy [6], [21]. Decoupled speed scaling divorces speed

selection from system occupancy [21]. This violates the definition of “natural” speed scaling in

[6], since it can require speed changes at arbitrary points in job execution, even if occupancy

remains the same. While decoupled speed scaling provides an elegant theoretical model, it has

not been implemented nor evaluated in a practical system [53]. The primary motivation for

creating Profilo was to provide this empirical data on a real system, the results of which are

explored in Chapter 6.

2.3 Speed Scaling Process Schedulers

Most operating system kernels today, including the Linux kernel, employ variations on the

round-robin scheduling algorithm [14]. This scheduler has its origins well established in the early

days of operating systems as a vast improvement to non-preemptive multiprogramming. In the

theoretical community, this algorithm is idealized as processor sharing (PS) [34]. It is often cited

in the literature because it sets the benchmark for fairness [8], [28] and is free of problems like

process starvation. Even in speed scaling systems, as long as the chosen speed is a function of

the number of jobs in the run queue, PS remains the criterion for fairness [6]. That is the reason

this early algorithm persists in both the theoretical and systems community.

9

One of the design goals with speed scaling process schedulers is to run the system just fast

enough to complete all of the work in a timely fashion, but no faster. By doing so, the energy

consumption for completing the workload is as low as possible. One of the earliest systems

papers on speed scaling was by Weiser et al. [61]. In their work, they considered a diverse mix of

processes in a Unix system, and attempted to determine the energy savings if the jobs were

executed using different system speeds. A subset of the same authors later contributed to one of

the seminal theoretical papers on speed scaling [65]. Their paper proposed an optimal offline

algorithm for speed scaling (now known as YDS, based on the names of the authors), with the

objective of minimizing power consumption.

The YDS algorithm works by calculating the minimum speed required to finish each job by a

certain deadline (the job departure times under PS can be used), based on each job getting

uninterrupted execution to completion in earliest deadline first (EDF) order. Execution speeds

are calculated by considering the set of jobs with the most stringent deadlines, which defines the

work intensity. By remaining at such a speed, all jobs will finish at or before the deadline. The

algorithm then recursively calculates the minimum speed required for the remaining jobs. Yao et

al. also proposed a heuristic online algorithm for the same problem. Both algorithms are

deadline-based, and require knowledge of job sizes and deadlines.

SRPT also requires knowledge of job size, but uses it to optimize mean response time. Under

certain job size distributions, SRPT can improve every job over PS [10], while consuming less

energy [21]. However, SRPT can be unfair to the large jobs [63]. There are several scheduling

policies that try to maintain this response time advantage while considering fairness. The Fair

10

Sojourn Protocol (FSP) [25] scheduling policy incorporates qualities of PS while honouring parts

of SRPT. It conceptually runs a “virtual PS” queue to determine the order that jobs would

complete under PS scheduling, and then devotes full processing resources to each of the jobs in

that order. FSP dominates PS in the sense that no job finishes later with FSP than in PS, while

mean response time is improved. As a result, the property of fairness is maintained.

The trouble with FSP is that it is ill-defined for job-count-based speed scaling, unless the speed

scaling policy is decoupled, as described by Elahi et al [21]. This concept of decoupled speed

scaling involves separating the scheduling policy that determines what the processor works on

from the speed scaling policy that determines the speed at which the processor does work.

Therefore, a scheduling policy that optimizes mean response time can be used in conjunction

with a speed scaling policy that optimizes power efficiency. Furthermore, under this decoupled

regime, a policy like FSP is well defined, and retains its property of dominating PS. This mixing

and matching of policies allows for numerous interesting combinations and may indeed make it

possible to simultaneously attain fairness, robustness, and near optimality [21].

2.4 Summary

This chapter began with the motivation and importance of energy efficient practices, particularly

given the global impact of the ICT sector. This is one of several metrics used to evaluate speed

scaling scheduling policies. A number of policies were then described. A few of notable interest

are: the YDS policy, which minimizes power consumption; the PS policy, which epitomises

fairness; and the FSP-PS decoupled system (i.e., FSP as scheduler and virtual speed changes

11

under PS used for speed scaling policy), which is provably efficient and has simulation results

that suggest a noteworthy performance advantage over PS [21]. These three policies are

evaluated on a real system (described in Section 5.1) with results in Chapter 6.

12

Chapter 3: Problem Formulation

From the review of background research in the previous chapter, it is clear that the speed scaling

models used by the theoretical community are oversimplified, with several limiting assumptions.

The best way to determine the effectiveness of scheduling and/or speed scaling algorithms is to

test them on a real system. Building a custom kernel is not practical every time a new algorithm

is to be evaluated. Furthermore, some algorithms require known job sizes or other information

that is not possible on real systems. However, knowing how an algorithm performs on a real

system is helpful for determining potential savings, and perhaps could lead to a practical

implementation on a real operating system.

To bridge the gap between theory and practice, a profiling tool named Profilo was created that

allows different scheduling and speed scaling algorithms to be run on real systems. This chapter

describes the technical requirements for such a profiler, and the features of modern processors

that facilitate this endeavour.

3.1 Profiling Schedulers on Real Systems

As outlined in Section 2.1, the energy impact of the ICT ecosystem is significant and global. It is

therefore fitting for energy consumption/efficiency to be an important metric in the evaluation of

speed scaling scheduling policies. Traditionally, this would have been difficult to accomplish

because it would require an expensive, complicated, pre-calibrated power meter. Such a meter

would need to export data, which in turn need processing and analysis to isolate the scheduling

policy activity from the rest of the system’s energy consumption. However, most CPU’s and

13

GPU’s manufactured in the last few years support accurate energy readings that isolate their

specific domain. Of particular interest in this thesis is the Intel Running Average Power Limit

(RAPL) Model Specific Registers (MSRs) supported on Sandy Bridge and newer Intel x86

microarchitectures [69, Vol. 3B Ch. 14.9.1]. This is currently the only energy interface supported

by Profilo, although there is no technical reason preventing other vendors from being supported.

All consumer and server grade operating systems have standard timing libraries with a resolution

that is similar to a time quantum (the maximum amount of time a process can occupy the

processor without preemption). It is therefore important to have access to a much higher

resolution timing system to properly evaluate speed scaling schedulers. Fortunately, most

modern processors contain high-resolution hardware timers.

On the Linux operating system, high-resolution timers are supported through a secondary timing

subsystem that is described in Section 3.2. With regards to the RAPL interface (described in

Section 3.3), Linux does not directly support any type of energy profiling hardware; however,

most distributions ship with an MSR kernel module that gives any process with root privileges a

relatively straightforward mechanism to access any MSR. It is for this reason, as well as those

discussed in Section 4.1.2, that Linux is the chosen operating system for building Profilo.

Incidentally, despite the flexibility, ingenuity, and carefully designed features such as file-based

user/kernel mode boundaries, the Linux kernel implements a variation of PS as its scheduler and

a relatively simple threshold-based speed scaler (described in Section 4.2.4.3). This is not unique

to Linux; most contemporary operating system kernels are similarly implemented. Therefore, a

14

final criterion for a good profiler is its ability to control the CPU directly, by circumventing its

hosting kernel. This design is described in detail throughout Chapter 4.

3.2 High-Resolution Hardware Timers

All operating systems require kernel timers for preemption, drivers, timeouts, and user mode

services. The Linux kernel has two separate timer subsystems. The standard timer framework is a

low resolution solution based on units of jiffies, which on most platforms is 10 milliseconds [68].

Its primary structure is a timer wheel, which is made up of 5 buckets that represent

logarithmically sized blocks of time in the future. Timers are moved into smaller blocks until

they are expired. Each bucket contains the timers in a linked list. Therefore, timer insertions are

O(1) complexity, but they cascade to expiration in O(n) time, where n is the number of timers

[32]. Since the initial and predominant usage of this framework is driver time-outs that then

generate errors, the cascades are often cut short because the timers are removed before they

expire.

The second timer framework is the high-resolution timers subsystem, called hrtimers. It is better

suited to precise measurement than countdown timers, though the latter is supported through a

similar API to the standard timer framework or through external kernel functions (e.g.,

usleep_range); however, it does not make use of a timer wheel, but rather a time-ordered red-

black tree [13], [27] (a type of self-balancing binary search tree), which happens to be a data

structure implemented as a kernel library. The timers are ordered at insertion to minimize

15

processing at activation, since these timers tend to not be cancelled, unlike the general case for

the standard timers.

As the name suggests, hrtimers is based on a much higher resolution unit of one nanosecond,

with a typically high level of accuracy and no aliasing since the common hardware clock source

on modern systems often have resolutions 3 or 4 times finer than a nanosecond (i.e., in the low

hundreds of picoseconds).

The hrtimers subsystem is used by several kernel features, such as the lockup detectors, which

make use of the hrtimers interrupt and callback functions (discussed in Section 4.2.2). The

subsystem is also exposed to user mode applications through the clock_gettime function that

Profilo uses to time processes as accurately as possible (see Section 4.2.4.4).

Modern systems have dozens of clocks and even several high-resolution clocks with various

levels of precision and reading costs. The differences span more than an order of magnitude.

This is why this subsystem has a clock manager, which selects the most suitable clock, and

supports user intervention to force a clock source. A thorough discussion of system clocks is

featured in Section 4.2.3.

3.3 Running Average Power Limit (RAPL)

When Intel introduced the 80386 processor in 1986, it included two test registers that were not

guaranteed to be included in future versions of the x86 instruction set architecture (ISA). These

16

registers were called model-specific registers (MSRs) or machine-specific registers. The

subsequent 80486 processor, introduced in 1989, included a generalized rdmsr instruction to read

an MSR and a wrmsr instruction to write to an MSR. It also included a CPUID instruction

(discussed in Section 4.2.4.2) to help determine the features present on a particular processor.

These instructions have persisted to present day microarchitectures. Some of the MSRs have

been inherited from previous generations without modification, and have become architectural

MSRs. Modern processors have hundreds of MSRs with an entire chapter of the Intel 64 and IA-

32 Architectures Software Developer’s Manual [69, Vol. 3C, Ch. 35] dedicated to their

addressing, scope, and brief description.

The Intel Running Average Power Limit (RAPL) interface is available on Sandy Bridge (2011)

and newer Intel microarchitectures. It is comprised of several non-architectural MSRs that allow

policies to be set for several domains that in turn manage the thermal output and energy

consumption of the processor and memory [20], via throttling, to keep it within defined

envelope(s). For this approach to work, accurate power meters are required. This is done through

a digital event-based pre-calibrated power model that uses performance counters and I/O to

increment energy counters [69, Sec. 14.9.1]. Fortunately, Intel exposes the raw values of the

power meters along with conversion values to transform the counters to SI based units. The

values produced by the power meter have independently been found to match the actual power

measurements [29], [49].

There are four domains in the RAPL interface; however only three of these domains are

available on any given CPU [69, Vol. 3B Ch. 14.9.3]. The CPUs targeting the server market have

17

Package (PKG), Power Plane 0 (PP0), and DRAM domains. The CPUs targeting the

client/consumer market have PKG, PP0, and PP1 domains. For compatibility with both product

categories, Profilo only makes use of PP0 (the processor cores) and the PKG (the entire CPU)

domains.

The PP0 domain energy counter is available through an MSR called

MSR_PP0_ENERGY_STATUS, while the PKG domain energy counter is available through an

MSR called MSR_PKG_ENERGY_STATUS. The values to convert the RAPL counters to SI

based units are contained in the MSR called MSR_RAPL_POWER_UNIT. Figure 3.1 shows the

structure of these MSRs with MSR_PP0_ENERGY_STATUS at the top (sourced from [69, Vols.

3B, Fig. 14–34]) and MSR_RAPL_POWER_UNIT at the bottom (sourced from [69, Vols. 3B,

Fig. 14–32]). The format of the MSR_PKG_ENERGY_STATUS register is the same as

MSR_PP0_ENERGY_STATUS.

18

Figure 3.1 Structural Illustration of RAPL MSRs

Profilo only makes use of the “energy status units” portion of MSR_RAPL_POWER_UNIT,

which is an unsigned integer using bits 8 to 12. This value is used with the formula

!!!"#$%& !"#"$% !"#$%

to create a unit in joules. The current default in binary is 10000, which means

the energy status unit is in !!!"

= ~15.3 microjoule increments. It is important to note the reserved

bits in these MSRs. In particular, half the MSR_PP0_ENERGY_STATUS and

MSR_PKG_ENERGY_STATUS MSRs are reserved for future use, leaving only 32 bits for the

PP0/PKG energy counters. Taking into account the value of the current energy status unit, this

means that the energy counters wrap around every 2!" − 1 !!!"

≅ 65536 joules. On the

testing platform, with a measured peak power consumption of 72 watts for the package (see

19

Section 5.3.4), the energy counter would wrap approximately every 15 minutes. Intel

recommends tallying the results, taking into account integer overflow, at least every 60 seconds

[69, Vol. 3B Ch. 14.9.3].

Another important detail regarding the energy counters is their update frequency. Updates occur

approximately every millisecond, most of the time. The jitter is fairly high at 20 milliseconds

[29], which is why Profilo runs should be sufficiently long. There is also a system management

interrupt (SMI) that puts these processors into System Management Mode (SMM). This mode is

used to handle power management functions, system hardware control, and run OEM-designed

system firmware. As a result, SMM is outside of the control of general-purpose systems

software, including operating system kernels, even when interrupts are disabled. According to

analysis performed by [29] on their Sandy Bridge based machine, the switch to SMM is periodic

and occurs every 16 milliseconds. This can cause update delays to the RAPL counters in excess

of 120 milliseconds. In their work, they mitigate these issues by delaying their benchmark until

the first RAPL counter updates immediately after the processor exits SMM. When the

benchmark is complete, they note the time, and then measure the delay until the next RAPL

counter update, making estimates to subtract the energy associated with this delay.

In the interest of simplifying the implementation, Profilo does not take into consideration the

SMM. If profile runs are much longer than 16 milliseconds, this should not impose too much

inaccuracy. The Profilo kernel module exposes two sysfs files, called sleep_busy and sleep_deep

(see Section 4.2.1), which wait for both PKG and PP0 energy counters to update before storing

the energy and high-resolution timer values. Before terminating, they once again wait for both of

20

the energy counters to change before calculating the final duration and energy consumption. This

is why both sysfs files have a minimum duration of roughly 1 millisecond. The user mode

portion of Profilo is even simpler and only waits for the energy counters to change before

profiling. To avoid introducing more variance to the duration and overestimating the energy

usage, when profiling is complete, it calculates the energy consumption from the last reading of

the MSRs, without waiting for them to change.

It is important to emphasize that when waiting for the energy counters to change, both of them

need to be checked. The reason for this is that they update asynchronously (roughly 0.4

microseconds apart on the test platform described in Section 5.1), with no guarantees on which

counter updates first.

3.3.1 Using the RAPL MSRs within Profilo

The RAPL implementation details are handled in user mode by a function call to

startRAPLmetric. This function has one argument: a pointer to the custom cpuData structure

described in Section 4.2.4.5, which essentially stores the user mode application’s global context

for the energy counters, and an open file descriptor to the MSR kernel module sysfs file (see

Section 4.2.4.2). The function initializes two temporary integers that are equal to the bottom 32-

bits (using a mask implemented with the bitwise and operator) of the

MSR_PP0_ENERGY_STATUS and MSR_PKG_ENERGY_STATUS MSRs. It then runs a loop

function that continuously reinitializes the cpuData copies of the MSRs until they both differ

21

from the temporary variables. Each loop iteration takes roughly 10 nanoseconds and when both

RAPL counters have changed, the function returns.

The cpuData structure also contains the “energy status units” portion of the power unit MSR

(MSR_RAPL_POWER_UNIT), from which a calculation is used to convert the energy counters

to joules. It also has two unsigned 64-bit integers called pp0LastRead and pkgLastRead that are

copies of the MSR_PP0_ENERGY_STATUS and MSR_PKG_ENERGY_STATUS MSRs,

respectively. Finally, it contains two double precision floating point variables called

pp0Consumed and pkgConsumed.

To guard against integer overflow, there is a function called tallyRAPLmetric that takes a pointer

to cpuData. It first reads from the core and package MSRs, using the open file descriptor to the

MSR module. It then calculates the difference (taking into account integer overflow) since the

last read of the MSRs, using 64-bit integers. Finally, it multiplies this difference by the pre-

calculated energy unit and adds it to the double-precision floating point variables, which contain

the number of joules consumed since the startRAPLmetric function returned. It concludes by

updating pp0LastRead and pkgLastRead. The implementation of Profilo follows Intel’s

conservative RAPL overflow advice, and tries to run this tally function every minute through a

mechanism described in Section 4.2.4.5. The function is always run at the end of profiling to

calculate the total number of joules consumed.

The startRAPLmetric and tallyRAPLmetric functions are part of the user mode application and

rely on the MSR kernel module. As mentioned in the previous section, the Profilo kernel module

22

exposes two sysfs files, called sleep_busy and sleep_deep (details in Section 4.2.1), which both

rely on hrtimers and the RAPL MSRs, just like the user mode application. The hrtimers

subsystem is easily accessible through variables like ktime_t and functions like usleep_range,

both of which are used by the Profilo kernel module. However, there is no formal way of

accessing MSRs, much less the RAPL MSRs. Therefore, to accomplish this in kernel mode,

inline assembly language code had to be written (see Appendix A).

The idea behind the assembly code is fairly straightforward. First, the address of the desired

MSR is loaded into the ECX general purpose register (GPR). Second, the rdmsr instruction is

executed, which dumps the lower 32-bits of the MSR into the general purpose EAX register, and

the upper 32-bits of the MSR into the general purpose EDX register. The upper 64-bits of the

GPRs (called RAX/R0 and RDX/R2) are cleared by the rdmsr instruction. This may seem like an

odd approach, since the GPRs are 64-bit as well, but this ensures that rdmsr remains compatible

with older 32-bit processors with 32-bit GPRs and some 64-bit MSRs. The code and its line-by-

line explanation are available in Appendix A.

3.4 Summary

The objective of this chapter was to explain the motivation for profiling speed scaling schedulers

on real systems, along with describing the technical requirements for achieving this goal. The

high-resolution timing subsystem and Intel RAPL interface were explained conceptually and

with some implementation points. The following chapter presents the design and implementation

details that were required to create Profilo.

23

Chapter 4: Design and Implementation

This chapter describes the design and implementation for Profilo, which is the application that is

used to control the micro-benchmarking experiments and collect energy profiles. This chapter

begins with the design choices to accurately measure timing and power consumption. Finally, it

describes the implementation details of Profilo and its accompanying utilities.

4.1 Design Choices

Because Profilo was designed to model certain datacenter workloads, it could not simply be a

single process application executing synthetic workloads, even with multiple threads. Hardware

context switches are a necessary component of the measurement, both for execution time and for

energy consumption. Therefore Profilo had to create a multi-process environment.

4.1.1 User Mode

The initial version of Profilo was written as a multi-process user mode application to measure

hardware context switches. The relative simplicity of the initial version allowed the application

to be created quickly, and the different speed scaling algorithms and processor schedulers to be

tested to see if there were repeatable and statistically significant differences in timing and energy

consumption.

One of the challenges with a fully user mode application, however, is that the kernel can pre-

empt processes at any time due to higher priority processes and/or interrupts. Furthermore,

24

regular processes are confined to occupying the CPU for at most a maximum duration,

commonly called a time slice, quantum, or jiffy, which on most platforms is 10 milliseconds

[68]. This is problematic since the Fair Sojourn Protocol (FSP) scheduling policy requires

uninterrupted execution for the entire lifetime of the process [25].

4.1.2 Kernel Mode

To address the aforementioned issues, kernel mode code is necessary. An open source operating

system is therefore the best starting point. Naturally, Linux (Ubuntu 14.04 LTS) was chosen

because it has the largest community and hardware support, including an x86 Model Specific

Register (MSR) module described in section 4.2.4.2.

Kernel mode in Linux can be accessed in two ways:

1. Modifying the kernel: This approach requires hi-jacking the kernel to perform the

profiling operations on synthetic workloads. The modified kernel would still need to be

able to interact with the user to take instructions and follow the desired

scheduler/workload as well as to log the timing and energy consumption information.

This would be complex and have the added disadvantage of requiring a rebuild and

system reboot every time the kernel is modified. Portability would also become a serious

hurdle and community patches would no longer be compatible.

2. Kernel module: In most cases, this is the preferred approach since it is portable across

different distributions and kernel versions. Modules are also loadable and unloadable,

without the need to recompile the entire kernel or reboot the system, which makes

25

development significantly faster and safer. Finally, kernel mode code can be exposed via

an API to user mode by using the sysfs virtual file system. However, contemporary Linux

kernels provide multicore, pre-emptive scheduling, so “interrupt disable” code is required

to effectively hi-jack the CPU and avoid unwanted context switches.

4.1.3 Kernel/User Mode Hybrid

The final version of Profilo was implemented as a hybrid with a kernel module that takes care of

performing uninterrupted work, busy waiting, and sleeping, while performing high-resolution

timing, and energy profiling. It exposes these features through sysfs files, which a user mode

application interacts with to perform all the other features such as reading the tasks, setting CPU

affinity, collecting data, processing, and output. Retaining as much as possible in user mode

makes programming and debugging substantially easier, due in part to the myriad of user mode

libraries and virtualization of resources.

4.1.4 Workload

The power profile of a sleeping, busy-waiting, and working processor are quantifiably different,

which is shown in Chapter 5. Therefore, to model compute-intensive datacenter jobs, a simple,

controllable workload has to be associated with each process. Profilo uses the trial division

primality test algorithm for this purpose. The rationale for this choice is as follows:

26

1. It is CPU-bound and fully contained within the processor package (e.g., core, cache, etc.).

This feature means that the Running Average Power Limit (RAPL) counters can be used

to profile the energy consumption [69, Vol. 3B Ch. 14.9.2].

2. It is easily implemented in kernel mode without the need for complicated mathematical

operations and/or floating point units.

3. It utilizes superscalar and pipelined integer architecture, while reasonably disrupting

branch prediction [58].

4. It is easily parameterized to generate jobs that range from microseconds to hours.

The trial division primality test is the simplest of primality algorithms, with an exponential

running time, when measured in terms of the size of the input in bits. A future improvement to

Profilo is to use an optimized trial division algorithm, which constrains the divisor (in the

modulo operation) from exceeding the square root of the candidate prime number, similar to the

sieve of Eratosthenes, bearing in mind that the latter still has a better theoretical complexity [41].

In addition to improving the runtime (which is not really a goal), the optimized trial division

algorithm would better utilize the arithmetic logic unit (ALU). The code and its explanation are

available in Appendix B.

Prime number sieves, such as the ancient sieve of Eratosthenes, the more recent sieve of

Sundaram [2], the faster sieve of Atkin [7], and the wheel sieves [44], [45], do not make good

benchmark algorithms because they tend to be memory-bound. Memory-bound processes tend to

under-utilize the CPU, and are not easily measured with the RAPL energy counters.

27

There are faster, general, and deterministic primality test algorithms, such as the AKS algorithm

[1], which was the first to have a provable (unconditional) polynomial runtime. A further

improvement is primality testing with Gaussian periods [35]; however, these implementations

are much more complex, especially in kernel mode.

Unfortunately, this trade-off for simplicity means that many components of these new

microarchitectures are never used, such as the floating point unit (FPU), the single instruction,

multiple data (SIMD) [24] extensions, Advanced Encryption Standard New Instructions (AES-

NI), and virtualization extensions. Nevertheless, this is arguably satisfactory to model simple

compute-intensive workloads, such as unencrypted Web server based datacenters.

Having said that, the workload code can be replaced with another algorithm without too much

effort. The trace file that Profilo takes as input (details in section 4.2.4.1) has a column that

specifies the amount of work the process needs to do. That work is expressed as an integral

multiple of a basic “work unit”. Having 2 or more units of work simply means reiterating the

workload associated with a “work unit” that many times. This linearizes work even if the

workload algorithm itself is non-linear, such as the class of primality test algorithms. In other

words, a process that has 50 units of work (fifty iterations of the workload associated with the

“work unit”) has exactly twice the work of a process with 25 units of work, which itself has five

times the work of a process with 5 units of work, and so on.

With the trial division workload, the “work unit” is the number of consecutive primes to find,

starting from two. The “work unit” integer is another argument to Profilo. Figure 4.1 shows the

28

execution time for various definitions of “work unit”. Different traces should only be compared

if they share the same workload and “work unit”.

Figure 4.1 Trial Division Primality Test Workload

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Dura%o

n (s)

Prime Calcula%ons ("work unit")

29

4.1.5 Units of Work

A subtle but significant detail about Profilo is that processes are modeled on units of work, not

units of time. For example, a line in the trace file may state that process “Small” instructs the

CPU to find the first 200 primes (a “work unit” for this run of Profilo), 80 times, at speed 1

(which may translate to 1.2 GHz). This is in contrast to telling the CPU to calculate primes for 70

milliseconds at speed 1. Units of work deliver more consistent, repeatable results because the

kernel module can keep control of the processor, performing the work required without invoking

context switches to set, check, and be interrupted by a timer or checking a clock and calculating

durations. To a lesser but measurable degree, clocks, timers, and counters also experience

inconsistencies based on their state (i.e., between ticks) and the state of the processor and its

management cycles [15].

Profilo requires a “work unit” argument to define what a single work unit is in the trace. By

definition, a “work unit” is the minimum unit of work, which roughly corresponds to a unit of

time at a given frequency. Naturally, with a fixed “work unit” value, the unit of time for each

discrete speed on a given processor will be inversely proportional to the frequency. Therefore if

one wishes to approximate a time slice for each discrete speed, for instance for a PS trace, a

certain number of loops, unique to each speed and proportional to the frequency, needs to be

determined. Section 5.4 explores the empirical values for loop count at each discrete speed with

relation to various values of “work unit”.

Another argument to Profilo related to work is “primes per second”. Since the trial division

primality algorithm has an exponential running time, this argument is based upon the “work

30

unit”. More precisely, “primes per second” is the reciprocal of the duration (in seconds) to

perform a “work unit” at the slowest discrete speed, multiplied by “work unit”. Internally, this

value is used to estimate one minute of work at the lowest frequency to conservatively tally the

RAPL MSRs and prevent data loss due to integer overflow discussed in Chapter 3. The reason

this is provided as an argument, and not calculated at runtime, is to maintain consistency across

multiple runs of Profilo and allow different schedulers to be compared.

Amongst the alternative runtimes for Profilo is a benchmark mode that allows one to determine

appropriate “work unit” and “primes per second” arguments, along with the respective loop

values for each discrete speed at whatever time slice, for the host processor.

4.2 Implementation

Profilo is composed of two parts: the kernel module, and the user mode application, which itself

has a few runtime modes. There is also an accompanying idler utility that makes use of

components in the Profilo kernel module that the user mode application does not use. Future

work for Profilo could include integrating the idler utility into the user mode Profilo application.

Figure 4.2 provides a conceptual overview of Profilo and its components.

31

Figure 4.2 Conceptual Overview of Profilo

4.2.1 Kernel Module

The kernel module makes use of the sysfs virtual file system provided by the Linux kernel [14, p.

527]. This virtual file system is intended to expose kernel subsystems, device drivers, and two-

way communication between kernel functionality and user mode. Although introduced to the

Linux kernel over a decade ago, there remain kernel modules that incorrectly use procfs, which

is now intended only for providing information and configuration for processes and related

system information [38, p. 360]. This dichotomy is exacerbated by the lack of documentation on

sysfs, often requiring a developer to peruse kernel source code.

Nonetheless, there are a number of clever macros and well-designed functions that inspire

elegant sysfs code. It is common and sometimes required to use a number of macros to properly

build a kernel module. A few examples are the license, author, description, and supported

devices macros that describe a module when probed by a user using modinfo. There is also an

init and exit macro used to tell the kernel which function to call when loading and unloading the

module, respectively.

Trace File 1. Process Arguments 2. Setup Environment 3. Profiling 4. Summarizing

• work_unit • do_work • sleep_busy • sleep_deep Idler

Application (User Mode)

sysfs

sysfs

Kernel Module

32

In addition to those functions, every sysfs virtual file can have a show and/or store function. A

show function is invoked whenever a virtual file is read; a store function is invoked whenever a

virtual file is written. The Profilo kernel module has four sysfs files:

• work_unit: as the name suggests, this file defines what a unit of work is -- specifically the

number of consecutive primes to find, using the trial division primality algorithm

described in Section 4.1.4. Writing an integer value to this file sets the number of primes,

while reading it returns its current value.

• do_work: this file runs the actual workload. To use it, simply write the number of

units/loops of work_unit to the file. Reading from the file only displays the kernel module

name and version.

• sleep_busy: writing an integer value to this file busy-loops the processor without context-

switches until that duration, in microseconds, has elapsed. Subsequently reading the file

reveals the formerly written value as well as the actual amount of time (in microseconds)

that the processor busy-waited (including the time it took for the RAPL MSRs to

change), and the difference in the PP0/PKG RAPL MSRs.

• sleep_deep: writing an integer value, interpreted in microseconds, to this file sleeps the

processor for that duration. Similar to sleep_busy, subsequently reading the file reveals

the formerly written value as well as the actual amount of time (in microseconds) that the

processor slept, plus any busy-wait period waiting for the RAPL counters to increment.

Concatenated to this output is also the difference in the PP0/PKG RAPL MSRs.

The change in the PP0/PKG RAPL MSR values coming from the Profilo kernel module must be

multiplied by the processor-defined energy unit, which is calculated from the

33

MSR_RAPL_POWER_UNIT MSR (done in user mode where floating point operations are easy),

to establish the energy consumed by those specific domains. Section 3.3 elaborates on the RAPL

MSRs, with some kernel code details in Section 3.3.1.

To simplify the code for handling the high-resolution timers, the Profilo kernel module enforces

a directive making it only compatible with the 64-bit Linux kernel. Both sleep_busy and

sleep_deep make use of high-resolution ktime_t, which is a 64-bit integer where each unit is a

nanosecond. In addition, sleep_deep makes use of the kernel function usleep_range, which is

also built on top of hrtimers, the subsystem for high-resolution kernel timers (see Section 3.2).

The accompanying show functions for those sysfs files output their respective time values (in

microseconds) with three decimal places to retain the nanosecond precision.

The state of the kernel module can be stored in either a kernel object structure or in static global

variables. For simple kernel modules like this one, static global variables make the code base

smaller and easier to read/understand. However, larger modules, or modules with independent

sysfs files, would benefit from creating custom structures for each sysfs file with their respective

variables using kernel macros (e.g., container_of) for ad hoc polymorphism.

When the kernel module is loaded, the init function allocates, initializes, and adds a kernel object

structure, which contains an array of attribute structures and a sysfs operation structure. Each

attribute structure contains the name of the file and its permissions, which are all set to 0666,

corresponding to the Unix permission for read/write access for user/group/other. The sysfs

operation structure allows the definition of a single show function and a single store function.

34

At first thought, one may consider this problematic, since each file’s read and/or write behaviour

has to be different. To resolve this issue, the signature for a show/store function has several

arguments: a pointer to a kernel object structure, a pointer to an attribute structure, and a

character pointer (string). Intuitively, the store function signature differs only in that the

character pointer is a constant and there is an additional string length argument.

By using the pointer to the attribute structure, the virtual file that is being read from or written to

can be determined. A clever container_of kernel macro takes a pointer, type, and member as

argument, and subtracts the memory offset between the type and its member from the pointer,

returning a pointer to the parent structure. By encapsulating every attribute in a parent structure

that includes a function pointer to a custom show/store function for that particular virtual file,

one can use the container_of macro to get the parent structure and then call the respective

show/store function. This resembles the Delegation Design Pattern [37] and reduces the default

show/store function to an elegant two lines of code without branches.

The last thing that the init function does is initialize a global spinlock. Since the Linux kernel is

pre-emptive, even single core systems behave like a symmetric multiprocessing (SMP) system

with regards to concurrency [38, p. 201]. The spinlock is used to enter a critical section where

the interrupt request (IRQ) table is saved and then disabled [38, p. 130]. The process work is

done without interruption on that core, then the IRQ table is restored and the spinlock is

unlocked.

35

4.2.2 Lockup Detectors

There is one more precondition for uninterrupted process execution: the hard lockup detector,

which is implemented with a non-maskable interrupt (NMI), needs to be disabled. The Linux

kernel has both soft and hard lockup detection. A soft lockup is a situation where the scheduler is

unable to give processes a chance to run within a certain period of time. A hard lockup is similar

with the addition that interrupts are not able to run. Both base their timeouts on a sysfs

configurable value /proc/sys/kernel/watchdog_thresh, which is typically 10 (seconds).

The soft lockup detector is a watchdog task implemented as a high priority kernel thread that

updates a timestamp every time it runs and an hrtimers interrupt with a period of

2*watchdog_thresh/5. If that timestamp is older than 2*watchdog_thresh seconds, the softlockup

detector inside the hrtimers callback function dumps debug information (registers, call trace,

etc.) to the system log as a kernel warning. On many Linux distributions, the console log level

(exposed at /proc/sys/kernel/printk) is set to print kernel warnings and higher priority messages.

This results in I/O time that adds 10-15 milliseconds.

The hard lockup detector is implemented inside the callback function for an NMI perf event that

has a period of watchdog_thresh. If the soft lockup detector’s hrtimers interrupt callback

function has not been run since the last NMI perf event, bearing in mind that it should have run 2

or 3 times in this duration, it dumps debug information (registers, call trace, etc.) to the system

log as a kernel notice. The kernel documentation [75] states that this should be a kernel warning,

but on Ubuntu 14.04 it is a kernel notice, which has a lower priority than a kernel warning and

36

therefore is not output to the console, causing no I/O. As a result, dumping to the kernel message

buffer only takes about 0.2 milliseconds.

The watchdog_thresh, soft lockup detector timeout value (rcu_cpu_stall_timeout), hrtimer, and

NMI perf event periods are all configurable. There are also configuration settings to make the

kernel panic when there is a soft and/or hard lockup detected. Finally, the soft lockup detection

and hard lockup detection can be respectively suppressed and disabled, which is done by the user

mode implementation of Profilo, described in section 4.2.4.3.

4.2.3 Clocks

The Linux kernel has a simple 64-bit counter called jiffies that is incremented based on the

inverse of the compile-time value of HZ, which is 100 on the x86 instruction set architecture

(ISA) [68]. It is implemented across all architectures using interrupts and is used in various

kernel functions and kernel modules [19, Ch. 7.1], especially legacy code.

Contemporary x86 systems, such as Intel's Sandy Bridge and later microarchitectures, have

many clocks, including, but not limited to, the High Precision Event Timer (HPET), the

Advanced Configuration and Power Interface Power Management Timer (ACPI PM), the

Programmable Interval Timer (PIT), and the Real-Time Clock (RTC). A comprehensive list of

clock sources can be found in Chapter 6 (Timing Measurements) of Understanding the Linux

Kernel [14].

37

Each processor core also has a 64-bit register called the Time Stamp Counter (TSC), which was

first introduced in the Pentium processor [69, Vol. 3B Ch. 17.13]. The TSC counts the number of

clock cycles, starting at zero when the core goes online. With SpeedStep, Intel’s implementation

of dynamic voltage and frequency scaling (DVFS), the TSC could no longer be used to

accurately time events until recent microarchitectures, when Intel changed the counter to an

invariant version that ticks at the processor’s highest rate, regardless of ACPI P/C/T state [67].

This once again makes the TSC a good clock source, not only because of its high precision, but

also because of the very low overhead (i.e., not having to access main memory or a platform

resource).

The Linux kernel evaluates most system clock sources based on their overhead, precision, and

accuracy. This sophisticated clock source manager goes through detection, calibration, and

verification phases, at boot time and when restoring to an ACPI G0/S0 state (i.e., waking up

from sleep, hibernation, etc.). This culminates in a prioritized list of each clock source that can

be seen in /sys/devices/system/clocksource/clocksource0/available_clocksource. To satisfy the

hrtimers subsystem’s need for high-resolution timers, this list is usually limited to the TSC,

HPET, and ACPI PM, in that order, on Intel Sandy Bridge and newer x86 microarchitectures.

The differences between even these clocks can be very large. For instance, in the Realtime

Referencing Guide from Red Hat, a benchmark evaluating the cost of 10 million reads from the

respective clock sources resulted in the TSC taking 0.6 seconds, the HPET taking 12.3 seconds,

and the ACPI PM taking 24.5 seconds [16].

38

Unfortunately, when interrupts are disabled long enough to cause a large drift between the jiffies

counter (which would no longer be advancing) and the TSC, the kernel wrongly assumes the

TSC is unstable and switches the clock source to the less precise and heavier HPET. It therefore

becomes important for the Profilo user mode application to check and retain the current clock

source before running, and then reinstating that clock source, if necessary, before exiting, as

described in section 4.2.4.3.

4.2.4 User Mode Application

The Profilo user mode application is a command line tool that begins by reading, translating, and

sanity-checking the shell arguments and trace file. It then modifies the operating system

environment and profiles the CPU using its accompanying kernel module and an efficient

representation of the trace file (see Appendix C for an example). At the end, it returns the

modified operating system parameters back to their former settings and then tallies and

summarizes the results. This section provides the details of how this is done.

4.2.4.1 Processing the Arguments

The application supports both short option (single dash followed by a letter), and long option

(double-dash followed by a word) argument formats. In addition to the self-explanatory

“version” and “help” arguments, there is a “processor” argument that displays information about

the processor, including if it is compatible with Profilo, and if so, the resolution and units of the

RAPL counters.

39

If an input (trace) filename of jobs to execute is given, then the user is expected to either add a

“check” argument or provide the “work unit” and “primes per second” arguments discussed in

section 4.1.5. As the name implies, the “check” argument instructs Profilo to verify that the trace

file is suitably formatted and semantically correct, without performing any additional work. The

trace file is a comma-separated-values (CSV) file with a header row and three columns:

1. Process Name: This is a case-sensitive and whitespace-sensitive name that becomes the

unique identifier for a process.

2. Work: This is a positive integer with the amount of contiguous “work units” that the

process is to perform at this instant in the trace.

3. Speed: This is a positive integer that is mapped to a processor frequency to perform the

work at this instant in the trace.

Even when the “check” argument is not present, Profilo silently performs the checks. If there are

problems, it outputs the error(s) to the console and terminates. If there are no issues and the

“check” argument is present, a summary is displayed with the number of unique processes, total

units of work, the number of speeds in the trace (including the minimum and maximum value),

the number of discrete frequencies mapped from the speeds (including the minimum and

maximum value) and the number of rows (sometimes called tasks), which is in accord with the

number of preemptions.

If the “check” argument is not present, Profilo creates a single instance of a custom

kernelVariables object (C struct), which contains a number of dynamically allocated arrays,

40

structures, and variables, and thus has an associated destructor function. Functions pass this

object around, which circumvents the problems with global variables [64], and allows the user

mode application to be composed of reusable components for the alternative runtimes (i.e.,

benchmark, processor information, etc.) and the utilities.

When Profilo is building the kernelVariables object, it starts by building an array called

schedule, made up of scheduleTask structures containing three unsigned integers: procID, work,

and speed. The schedule array is an optimized copy of the trace file and therefore has the same

number of elements as the trace file has lines, less the header. The procID is a monotonically

increasing unique integer identifier, starting at zero, for each process in the trace file. The work

variable is the same value in the trace file. Finally, the speed variable is the index value to the

speedLookup array, elaborated on below.

Similar to speed, the procID also functions as the index to an array called processes, which is

made up of process structures. These process structures have the name corresponding to the

name of the process in the trace file, the workLeft (to determine when the process leaves the

system), and two struct timespec variables that store the startTime and endTime for when the

process first gets CPU time, and when it finished all of its work.

The speeds from the trace file are essentially mapped to a frequency by the following formula:

𝑚𝑖𝑛𝐹𝑟𝑒𝑞 + 𝑚𝑎𝑥𝐹𝑟𝑒𝑞 −𝑚𝑖𝑛𝐹𝑟𝑒𝑞 !"##$!!"#$%&&'!"#$%&&'!!"#$%&&'

, where minFreq is the processor’s

slowest frequency, maxFreq is the fastest frequency, minSpeed is the smallest speed value in the

41

entire trace file, and maxSpeed is the largest speed value in the entire trace file. This value may

not be an integer, so it is stored as a floating point number that is then rounded to the closest

available discrete frequency for the host processor. If the calculated frequency is actually

between two available discrete frequencies, it is rounded up. The list of available discrete

frequencies for the first logical processor (cpu0) is available in the following sysfs file:

/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies.

Setting the frequency of the first logical processor is also done through a sysfs file

(/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed), which means it must be printed as a

string to be handled by its store function. Converting an integer to a string on every context

switch is an unnecessary overhead and storing a string value for each process creates a very large

memory footprint that is made worse by cache misses. Therefore, to prevent both of these

situations, a temporary linked list is used.

The temporary linked list is made up of node structures called speedStruct. In this structure are

two unsigned integers called uiSpeed and uiMapped. The uiSpeed variable is the same value

from the trace file and uiMapped is initially assigned to zero. The speedStruct is treated as a

mathematical set, so before any insertions occur, the list must be checked to ensure that uiSpeed

is unique. The list is insertion sorted. When the trace file has been completely parsed, the next

step is the map from uiSpeed to uiMapped using the mapping formula given previously. This

mapping function is many-to-one; so when it is complete, the number of unique uiMapped

values should be known, and be less than or equal to the number of discrete frequencies on the

processor.

42

If the “check” argument is present, the number of unique speeds (the size of the speedStruct

linked list), along with the minimum and maximum values are printed in the summary. The

summary section about the number of discrete frequencies mapped from the speeds, with the

minimum and maximum values, comes from the unique uiMapped values.

If the “check” argument is not present, the number of unique uiMapped values is used to create

the speedLookup array of strings, which are the string versions of all the uiMapped values. The

uiMapped values are then changed to be the index of speedLookup, where the string versions of

the discrete processor frequencies are found. Finally, the speed variable in each of the

scheduleTask structures is changed from the uiSpeed value to the associated uiMapped value.

The result is that during the profiling, if the new speed is different from the previous speed, then

with a line of code, writing the speed element of the cached speedLookup array to the

scaling_setspeed file changes the processor frequency.

The processes array is similarly created, while the trace file is being read, by first using a

temporary linked list made up of node structures called processList. These nodes contain the

same procID, name, and workLeft variables in the process structures (the elements of the

processes array). As the scheduleTask structures of the schedule array are being built up, the

process names that are being read from the trace file are crosschecked against the processList

linked list to determine the value of procID. If a process name does not exist in the linked list, it

is added to the head of the list, for maximum insertion efficiency (constant time). Typically, this

also makes searching for the process quicker, since the most recently added processes are at the

43

start of the list. When the trace file is complete, the processes array is allocated based on the

number of elements in the linked list. The process elements are then initialized using the

variables in the linked list, from the end of the array to the start, so that the procID values in the

linked list correspond to the index of the process elements in the processes array.

In summary, the speedStruct and processList linked lists are temporarily created, while reading

the trace file, to either generate the printed summary (when the “check” argument is present) or

to create the respective speedLookup and processes arrays. The linked lists are then properly

deallocated. With the arguments processed and the contents of the trace file loaded and ready for

profiling, the next step is to check the CPU for compatibility.

4.2.4.2 CPU Compatibility

As discussed in Section 3.3, Profilo uses some of the Intel RAPL MSRs to capture the energy

profiles. These are present in the Intel Sandy Bridge and newer microarchitectures. An easy and

universally compatible way of detecting the CPU vendor and microarchitecture on Linux is to

read /proc/cpuinfo. This procfs file uses the CPUID instruction on contemporary x86

microarchitectures to get its identity and features [69, Vol. 1 Ch. 17.1]. Unfortunately, there is no

RAPL feature flag on Sandy Bridge or any of the subsequent microarchitectures (Ivy Bridge,

Haswell, and Broadwell); however, Advanced Vector Extensions (AVX) that was introduced

with Sandy Bridge has a CPUID feature flag [69, Vol. 1 Ch. 5.13].

44

Consequently, Profilo’s CPU compatibility function starts by first ensuring that cpuinfo has a

vendor_id of GenuineIntel, and then looking for the presence of avx in the flags section. If

Profilo is run with the “processor” argument, this function also prints out the CPU model and

frequency range on the console. In addition, it uses the low-level open function to create a direct,

unbuffered I/O file descriptor to the MSR file, supplied by the MSR module.

The MSR module is a kernel module that provides an interface to x86 processors through the

virtual file /dev/cpu/0/msr. The file offset corresponds to an MSR address. Reading/writing to the

file is interpreted in 8-byte (64-bit) chunks. Larger chunks correspond to multiple read/writes to

the same MSR, unless, of course, one opens the MSR file with a file abstraction layer that

automatically advances the file offset. Reading/writing to the file typically requires elevated

(root) privileges. When Profilo opens the file, it is implicitly checking for root privileges, which

is also required for setting some of the environment variables in section 4.2.4.3. Profilo uses the

low-level pread function with the MSR file because, in addition to working with the low-level

open function’s file descriptor, it takes the file offset and the number of bytes to read as

arguments.

At this point, if the CPU compatibility function has been instructed to report (when Profilo has

been run with the “processor” argument), the MSR containing the RAPL energy, power, and

time units will be read, interpreted, and then printed in a summary about the CPU, which

includes the model and frequency range of the processor. If the CPU compatibility function has

not been instructed to report, the file descriptor associated with the MSR module is closed and

the function ends, indicating that the processor is compatible. Just like with argument checking,

45

every step is checked for errors, and if present, prints helpful messages, indicating where and

why a test did not pass, followed by Profilo gracefully exiting.

4.2.4.3 Environment Setup

With the RAPL energy unit read, interpreted, and stored within the kernelVariables object (see

section 4.2.4.1), Profilo moves on to setting up the environment for profiling. Ultimately, it is the

user’s responsibility to set up a testing environment with as few kernel modules/drivers, services,

and other interfering tasks as possible. However, some measures can be taken by the application

to minimize interference.

Linux, like most operating systems, supports process priorities. These priorities are numbers,

ranging from -20 to +19, called “nice levels”, “nice values”, or “niceness”. This nomenclature is

inherited from Unix. Its etymology comes from the idea that a process with a higher value of

“niceness” is nicer to other processes. Therefore, from a scheduling perspective, a nice value

of -20 is the highest priority, while +19 is the lowest priority. The default priority for processes is

a nice value of zero. On most distributions, setting a nice value of less than zero requires

elevated privileges. Profilo programmatically invokes the setpriority system call to set its nice

value to -20. Recall that at this point in its execution, Profilo will have already implicitly verified

that it has elevated privileges, by collecting the energy unit for the RAPL MSRs.

In addition to managing priorities, the kernel also needs to continuously load balance processes

in SMP environments. Each logical processor has its own scheduler, complete with its own

46

runqueue. The load balancer ensures that no runqueue has more than 25% the number of

processes than any other runqueue. When this occurs, the load balancer tries to move non-

running, least cache-hot, processes using spin locks and special migration threads. While this

balances computing resources and maximizes parallelism, it can also result in longer context

switches and cache-misses.

The load balancer tries to impose some process affinity, based on the topology of the logical

cores and their association to one another. For instance, Intel Hyper-Threading Technology (their

proprietary implementation of simultaneous multi-threading) is treated as two highly coupled

logical processors, while cores on the same package have a greater affinity to one another than

cores from different packages/processors. Sometimes it is advantageous to lock processes to

particular logical processors for performance reasons, which can sometimes be quite substantial

[57], or due to licensing restrictions. Linux supports this with its cpus_allowed bit mask in each

processes’ task_struct structure, which by default is all ones (can run on all available logical

processors). Both the load balancer and scheduler honour this bit mask.

The initial version of Profilo made use of processor affinities, setting the bit mask for all its

processes to run on the first logical core. This changed in the final version of Profilo. As

discussed in section 4.2.1, the Profilo kernel module disables interrupts on the logical processor

it is running on, before performing its workload. With the contemporary SMP Linux kernel,

these system calls no longer apply to any of the other logical processors. This is particularly

problematic since even the finest granularity for the RAPL counters is the domain of all of the

47

cores (PP0). So to capture accurate power profiles, Profilo needs to be the only active task on all

of the cores.

To address these issues, Profilo instead disables all but the first logical processor. The kernel

takes care of all the migrations (processes, interrupts, etc.) and then puts the deactivated cores

into a C6/C7 state (see section 5.3.3.2) by executing the MWAIT(C7) instruction; this causes the

cores to save their state, flush their level 1 and 2 (L1/L2) cache into level 3 (L3) cache, which is

shared amongst all the cores, and then reduce the voltage of the core to zero volts. On the only

active core, the load balancer is implicitly disabled, since it is no longer necessary when the

system is relegated to being a (logically) uniprocessor system. Disabling the logical processors is

easily performed by writing zero to the sysfs file /sys/devices/system/cpu/cpun/online where n is

greater than zero. Before this is done, however, Profilo first reads and retains the online status of

each processor, so that it can restore the environment back to its previous state. In the

architectures compatible with Profilo, there will typically be 8 logical processors, due to Hyper-

Threading on a quad-core processor. However, this number can range from 4 up to 288 logical

processors, on an 8-socket motherboard, each featuring an 18-core Haswell-EX processor.

Next, Profilo changes the current governor to assert control of the processor’s frequency. The

Linux kernel implements processor frequency scaling through infrastructure called cpufreq. This

infrastructure contains a generic governor interface that allows software-defined speed scaling

policies to be implemented when the processor is busy. The Linux kernel uses a separate cpuidle

subsystem when there is no work left for the processor (see section 5.3.3). Most x86 based Linux

48

distributions have five governors that rely on sysfs files located in

/sys/devices/system/cpu/cpun/cpufreq/ (where n is an integer associated with a logical processor):

• performance: This governor sets the processor to the frequency defined by the sysfs file

scaling_max_freq, which is usually the highest frequency, unless a user with elevated

privileges changes the value. This can be beneficial if the agenda is a dichotomy of

prolonged CPU-bound work, followed by similarly protracted idle durations, keeping the

processor either fully utilized or allowing it to remain in its halted state (ACPI C1 or

greater). Another usage for this governor is if the processor is being performance

benchmarked.

• powersave: This governor sets the processor to the frequency defined by the sysfs file

scaling_min_freq, which is usually the lowest frequency, unless a user with elevated

privileges changes the value. This can be helpful if there is a power or heat limitation on

the system infrastructure. However, if the processor is tasked with a heavy workload, this

governor may consume more energy than even the performance governor because the

latter can finish faster and then go into an energy saving ACPI C1 or greater state. This

can further result in cascading power savings across other components, thereby reducing

the idle waste power. Section 5.3.3 elaborates on this perhaps less obvious theme.

• ondemand: Unlike the previous two static governors, this governor dynamically changes

the processor’s frequency based on system load. When the system load is above the

up_threshold (typically 80 percent), it sets the processor frequency to scaling_max_freq.

When the system load drops below down_threshold (typically 20 percent), it decreases

the processor frequency to the next highest frequency in scaling_available_frequencies.

Because there is a cost to evaluating and switching the frequency, there is a sysfs

49

configurable sample rate (sampling_rate in microseconds) that is loosely based on the

CPU’s transition latency (cpuinfo_transition_latency in nanoseconds). This sampling rate

for determining system load and adjusting the frequency is typically 10ms on

contemporary x86 ISA’s.

• conservative: this governor is very similar to the ondemand governor, but instead of

jumping to scaling_max_freq when up_threshold is exceeded, it moves up the discrete

frequencies in scaling_available_frequencies at each sampling_rate. So while ondemand

is biased towards high performance, conservative treats frequency increases the same as

frequency decreases.

• userspace: this governor changes the frequency of the processor based on input from

scaling_setspeed. This gives full control to user mode processes/users (with elevated

privileges) to set the frequency.

The list of governors in a particular system is available in scaling_available_governors. The

default governor is ondemand. To change the governor, the name of a valid governor must be

written to scaling_governor. Profilo retains the name of the current governor in

scaling_governor, and then writes userspace to the sysfs file. At this point, the kernel stops all

frequency scaling, giving Profilo full control to set the frequency, which it does during profiling,

when it writes the speed element of the cached speedLookup array to the scaling_setspeed file, as

described in section 4.2.4.1.

Similarly, the status of the lockup detectors (section 4.2.2) and selected clock source (section

4.2.3) need to be retained. Specifically, the value of the sysfs file located at

50

/sys/module/rcupdate/parameters/rcu_cpu_stall_suppress is stored in a variable. Afterwards,

writing zero to the file, if that is not already its value, suppresses the soft lockup detection. The

exact same thing is done for retaining and disabling the hard lockup detection using the file

/proc/sys/kernel/nmi_watchdog. Finally, the current clock source is retained by reading the

contents of the sysfs file /sys/devices/system/clocksource/clocksource0/current_clocksource into

a string buffer.

4.2.4.4 Preparations for Profiling

With the operating system configured for Profilo, there are only a few things left to do before

profiling takes place. First, the Profilo kernel module sysfs file, time_unit, described in section

4.2.1, needs to be set by writing the respective value that was provided as an argument to Profilo.

This is actually handled by the function in section 4.2.4.1 that processes the arguments. Once this

is done, the Profilo kernel module sysfs file, do_work, can be used. Just like with the MSR

module, a write-enabled low-level open system call is invoked, which creates a direct,

unbuffered I/O file descriptor to the virtual file. In the case of the MSR module, the file

descriptor is stored in a variable called MSRfd in the cpuData structure (described in section

4.2.4.5), which is inside the custom kernelVariables object. The file descriptor for do_work is

simply a local variable.

With the governor set to userspace (section 4.2.4.3), the same open system call is used to create

a file descriptor to the sysfs file /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed, which is

51

used to set the first logical processor’s frequency. This interface is used in conjunction with the

speedLookup array from section 4.2.4.1.

Because profiling can sometimes take a long time, there needs to be a way for the user to

prematurely abort the process. A common and efficient way of accomplishing this goal is

through signals, another communications channel between kernel mode and user mode. A user

can interrupt a process with an interrupt character (usually Ctrl+C), which the kernel sends to

listening processes in the form of a SIGINT signal. A process can listen for a signal by giving the

kernel a handler function, using the signal system call. Most kernels, by default, kill processes

that do not register the SIGINT signal. This is particularly undesirable in Profilo’s case, since it

shuts down all but the first logical processor and then takes control of that processor’s frequency

scaling by changing the CPU governor, as well as changing the functionality of the lockup

detectors, and possibly the clock source. So to provide some summary information on what has

been profiled up to the interruption, and to return the operating system back to its former state,

Profilo delegates a handler for the SIGINT signal. This handler simply sets the ops variable

within the custom kernelVariables object to zero, signalling a section within the RAPL overflow

prevention code to terminate Profilo gracefully. The details are in the next section (4.2.4.5).

Since the Profilo kernel module, do_work, disables interrupts, if an interrupt character is

generated during this time, the signal handler will not be invoked until the sysfs file has

completed its work.

Next, Profilo creates an array called isRunning that contains the same number of elements as the

processes array, which is the number of unique processes from the trace file. Recall, from section

52

4.2.4.1, that each unique process has its own process ID, which corresponds to the index of the

processes array. Therefore, the index locations in isRunning can be treated the same way. During

allocation, each element in isRunning is initialized to zero (using the calloc function) to indicate

that no processes are running. A value of one indicates that the process is running.

At this juncture, profiling can begin, so the last thing to do is to read the energy and timing

counters. The startRAPLmetric function, from Section 3.3.1, is invoked. Recall that this reads the

core (MSR_PP0_ENERGY_STATUS) and package (MSR_PKG_ENERGY_STATUS) energy

MSRs into a temporary variable, and then continuously rereads the MSRs into a custom cpuData

structure (supplied as an argument to the function), until both values have changed. This is an

unpredictable amount of time that typically takes up to a millisecond, although it can sometimes

take much longer, as discussed in that section. This is why the energy MSR is read before the

clock. When it returns, the clock_gettime function is invoked, using the raw high-resolution

clock (CLOCK_MONOTONIC_RAW), which on architectures compatible with Profilo is the

TSC clocksource from section 4.2.3, with nanosecond resolution. The result is stored in the first

of two locally scoped struct timespec variables. The second variable stores the end time for the

profiling.

4.2.4.5 Profiling

The entire profiling section of the application is contained within a for-loop that systematically

traverses the scheduleTask array, described in section 4.2.4.1. Therefore, the number of loop

iterations is equal to the number of lines in the trace file (minus the header). Each run begins by

53

checking the isRunning array to see if the current process is already running. If the process is not

running (equal to zero), the startTime structure for the current process is initialized with the same

clock_gettime function from section 4.2.4.4, and then the isRunning element is set to one.

Next, the value of the speed element for the current scheduleTask is used as an index value to the

speedLookup array, which contains the string value for the desired processor frequency. This

string is written to the already open file descriptor for the scaling_setspeed file. With the

processor at the intended frequency, the work value (also from scheduleTask) is written to the

also previously opened file descriptor for the do_work file. At this point, a context switch occurs

and the work is done in kernel mode without interruptions. When the write operation returns, the

work has been done. The workLeft value, located at the current process’ index value in the

processes array, is decremented by the work value. If workLeft is zero, the endTime structure for

the current process is initialized, exactly the same way it was for startTime.

Before the loop draws to a conclusion, a swift assessment is made to guard against RAPL

overflow (see Section 3.3.1). As discussed in Section 4.1.5, one of the arguments to Profilo is

“primes per second”. This argument is multiplied by 60 (seconds) and then stored within the

custom kernelVariables object as a variable called ops. To guard against RAPL overflow, ops is

compared to a variable called calculations, which stores the sum of work (from scheduleTask)

sent to do_work. If calculations is larger than ops, then the tallyRAPLmetric function is invoked,

and calculations is reset to zero.

54

The tallyRAPLmetric function takes the cpuData structure as the argument. As described in

section 4.2.4.4, the cpuData structure contains an open file descriptor to the MSR module, called

MSRfd. In addition, it contains the interpreted energy units (in joules) for the processor

(described in Section 3.3). It also has two unsigned 64-bit integers called pp0LastRead and

pkgLastRead that are copies (with the reserved portions masked to zero) of the

MSR_PP0_ENERGY_STATUS and MSR_PKG_ENERGY_STATUS MSRs, respectively. Finally,

it contains two double precision floating point variables called pp0Consumed and pkgConsumed.

The job of tallyRAPLmetric is to first read from the core and package MSRs, using the open file

descriptor to the MSR module. It then calculates the difference (taking into account integer

overflow) since the last read of the MSRs, using 64-bit integers. Finally, it multiplies this

difference by the interpreted energy unit and adds it to the double-precision floating point

variables, which contain the number of joules consumed since the startRAPLmetric function

returned. It concludes by updating pp0LastRead and pkgLastRead.

If Profilo is given properly matched “work unit” and “primes per second” arguments for the

processor at its slowest speed, the tallyRAPLmetric function should be invoked at most every 60

seconds. However, recall from section 4.2.4.4 that the SIGINT signal handler sets the ops

variable to zero. This is just a trivial optimization to reduce overhead by preventing an extra if-

statement comparison on every profiling loop. Within the RAPL overflow prevention code, it

checks if ops is zero (ordinarily an invalid value) and only calls tallyRAPLmetric if this is not the

case. If ops is zero, Profilo terminates gracefully, which includes reversing all of the potential

environment changes from section 4.2.4.3.

55

4.2.4.6 Concluding Profiling

As soon as the profiling for-loop is concluded, the clock_gettime function is once again called

using the raw high-resolution clock (CLOCK_MONOTONIC_RAW) and the second locally

scoped struct timespec variable. Immediately afterwards, tallyRAPLmetric is invoked. Recall that

the RAPL MSRs could be power measurements that are as old as a millisecond. Unlike

startRAPLmetric, the tallyRAPLmetric function does not wait for the RAPL MSRs to change, as

this would over-estimate the energy usage. A future improvement could be to wait for the MSRs

to change and then subtract the estimated power usage for the portion of the wait period;

however, this task is not straightforward (see Section 7.4) and only marginally improves

accuracy. It is because of the reduced accuracy and precision of the RAPL energy counters (in

comparison to the high-resolution clock), and because of the additional calculations in

tallyRAPLmetric, that the clock_gettime function is called first.

Profilo then undoes the possible changes to the environment, described in section 4.2.4.3. It

restores the settings for the soft and hard lockup detection. If the kernel changed the clocksource,

it is restored. The online status setting for the logical processors are returned to their prior states.

Lastly, the former cpufreq governor is reinstated.

Next, the data from profiling is summarized, which includes calculating and displaying the total

duration for the profiling. The processes array is traversed so that the durations can be further

analyzed for each process, including their relative start and end times. The summary also

includes the overall energy and power consumption for both the cores and the package (see

56

Appendix C for an example). Finally, all file descriptors are closed and dynamically allocated

structures are freed. Profilo then exits.

4.2.5 Idler Utility

While the Profilo application takes processor schedules in the form of a trace file and then

measures the execution time and energy profile, the Idler utility performs an energy benchmark

on the processor. More specifically, Idler measures the power profile of the processor in its

halted states and in a busy loop, at all of the CPU’s frequencies. To accomplish this, it uses the

sleep_deep and sleep_busy features of the Profilo kernel module, described in section 4.2.1.

Naturally, the implementation of Idler shares many similarities with Profilo. Idler only takes one

mandatory argument and one optional argument, so argument processing is far more

straightforward. As a result, there are no complex data structures or optimizations required

either. The mandatory argument is the number of microseconds to pass to sleep_deep and

sleep_busy. The optional argument is a verbosity flag that, when present, prints human-friendly

messages to the console. When the flag is absent, the output is comma-separated-values (CSV)

format, which can easily be imported into a spreadsheet application.

Idler makes the same environment changes that Profilo makes in section 4.2.4.3, likewise

retaining the values so that the changes can be reversed. Low-level open system calls are made to

create direct, unbuffered I/O file descriptors to the kernel module sleep files. Since both

sleep_busy and sleep_deep provide the high-resolution breakdown of execution time and the

57

change in the energy readings, a file descriptor to the MSR file is not needed. The MSR module

is only briefly used to gather the RAPL energy unit.

Idler then goes into its primary for-loop, which has as many iterations as there are frequencies

for the first logical processor. This number comes from reading scaling_available_frequencies,

first described in section 4.2.4.1. Within each loop, the frequency is set and then sleep_busy is

written with the duration argument to Idler. The measured duration (using the high-resolution

timer) and the energy consumed are read back from sleep_busy, and then the power is calculated

(energy in joules divided by duration in seconds). The duration, energy, and power are stored in

double-precision floating point variables that are subsequently divided by a thousand and then

printed to the console in respective units of milliseconds, millijoules, and milliwatts, to be

consistent with the units of the duration argument. The same thing is done with sleep_deep,

except with the addition that reading from sleep_deep breaks down the measured duration into

the sleep and busy component. The busy component is time spent waiting for the RAPL counter

to change. This duration breakdown is conveyed to the console. At this stage, Idler returns the

environment back to the former values and then terminates.

4.3 Summary

This chapter began by presenting an overview of the design choices made to create Profilo. The

implementation details for both the kernel module and user mode components were explained.

At times this necessitated details in processor architecture as well as kernel function and

implementation. This was transitioned into the implementation details to the separate user mode

58

application, Idler. The next chapter delves into the experimental evaluation of Idler and Profilo,

with the subsequent chapter using Profilo, exclusively, to assess and compare established and

theoretical schedulers and speed scaling algorithms.

59

Chapter 5: Micro-Benchmarking

This chapter uses the tools described in Chapter 4 to analyze the performance and energy

characteristics of a quad-core, simultaneous multithreading (Intel Hyper-Threading Technology)

Ivy Bridge CPU under several conditions and across numerous tests. Through this analysis, the

strengths and weaknesses of both the platform and tools are explored. This chapter therefore

provides the background and context by which to investigate the schedulers in Chapter 6.

5.1 Platform

Before describing the testing methodology and results, it is important to contextualize the

platform in terms of the system specifications and its total system power consumption. All of the

results from Chapter 5 and 6 were gathered from a mid 2012 Apple MacBook Pro Retina

equipped with a 2.3 GHz (base) quad-core Intel Core i7-3615QM Ivy Bridge processor with 32

KB of level-1 (L1) instruction cache per core, 32 KB of level-1 (L1) data cache per core, 256 KB

of level-2 (L2) cache per core, 6 MB of on-chip level-3 (L3) shared cache, and 8 GB of 1600

MHz DDR3 random access memory (RAM).

This processor has 12 controllable discrete frequencies that range from 1200 MHz to 2300 MHz

in 100 MHz increments. There is also one additional discrete frequency with a misleading value

of 2301 MHz in scaling_available_frequencies, the sysfs file first described in Section 4.2.4.1.

This discrete speed is a special value used to enable Intel’s Turbo Boost Technology. This

feature allows the processor to clock one or more of its cores above its top-rated frequency, if

there is some thermal design power (TDP) headroom and the power, current, and thermal limit

60

are within a specific range. In the case of this processor, the maximum turbo frequency is 3.3

GHz for a single active core, 3.2 GHz for two simultaneously active cores, and 3.1 GHz for three

or four simultaneously active cores. While all of the micro-benchmarking results in this chapter

include the “2301 MHz” turbo mode, that speed setting is avoided in Chapter 6 because it is not

directly controllable, even in Ring 0 of the hierarchical protection domains (protection rings).

As described in Section 3.3, there are a total of four Running Average Power Limit (RAPL)

domains; however, there are only three of these domains available on any given CPU [69, Vol.

3B Ch. 14.9.3]. The CPUs targeting the server market have Package (PKG), Power Plane 0

(PP0), and DRAM domains. The CPUs targeting the client/consumer market have PKG, PP0,

and PP1 domains. The Intel Core i7-3615QM CPU falls under the latter category. For

compatibility with both product categories, Profilo only makes use of PP0 (the processor cores)

and the PKG (the entire CPU) domains. While this captures most of the dynamics in CPU-bound

activities, the platform is composed of a lot more than just the CPU. Therefore, to understand the

CPU in the context of the entire system, a crude but effective consumer-grade power monitor

was used between the wall receptacle and the switched power supply for the entire system.

The specific power monitor used was the Kill A Watt EZ (Model 4460.01) by P3 International

[73]. It has a NEMA 5-15 plug and receptacle that accepts line-in AC voltages of 85-125 volts

RMS at up to 15 amps RMS. It is then capable of reading the outgoing voltage and current draw,

and therefore the active power in watts. In the range of 90-125 volts and 0.2-15 amps, the typical

active power accuracy is within 0.5%, with a 2% maximum when outside that range [71].

61

One of the limitations of the Kill A Watt is its slow (1 Hertz) refresh rate, and lack of automatic

data logging [33]. Some have taken to modifying the power monitor and have tapped into the

two sensor outputs of the quad op-amp LM2902N chip to add automatic data logging [22][43].

Although this is reasonably straightforward to do, it would have necessitated substantial

additional software development. Instead, measurement was kept considerably simpler: three

types of tests were devised to keep the CPU at a consistent performance level for several

seconds. The tests were scripted, and while they were running, both the laptop display and the

power monitor were filmed. The video was then played back, so that each of the power monitor’s

updates could be added to a row in a spreadsheet. The median values for each configuration in

every type of test were retained. These values had a high correlation to the RAPL PKG readings,

suggesting that the power meter was sufficiently accurate under these conditions.

5.2 ACPI Specification

The first of the three types of tests with the power monitor involved measuring the total system

power while the processor was in an idle state. On a modern Advanced Configuration and Power

Interface (ACPI) system, “idle” is actually a multitude of states. The specification defines four

global (G0 to G3) and six sleep states (S0 to S5) [67]. Larger integer values correspond to greater

power savings, usually at the cost of wake-up latency. The nil global state (G0) corresponds to

the nil sleep state (S0), which represents normal running operation of the system. The first power

saving global state (G1) encapsulates the first few power-saving sleep states (S1-S4). The next

global state (G2) corresponds to the last sleep state (S5), and is essentially a “soft off”, whereby

the system can be brought back to the G0/S0 state with a button, keyboard, mouse, etc. The final

62

global state (G3) state is simply mechanical off, such as a system that is unplugged and without a

battery.

The specification defines numerous other state types, including device states (D0-D3), CPU

states (C0-C3), and performance states (P0-P16). The zero integer values imply full performance

while larger integer values provide better power savings, generally at the cost of performance or

wakeup latency. Unfortunately, beyond the global states, the specification is poorly written and

lacks consistency [18]. The intention is likely to leave the details to hardware vendors; however,

the specification actually contradicts itself, resulting in unavoidable misinterpretations. For

instance, in revision 5.1 of the ACPI specification, the S1 state is described as a low

wake-latency sleeping state where “no system context is lost (CPU or chip set) and hardware

maintains all system context” [67, Sec. 2.4]. It goes on to read that the S2 state is “similar to the

S1 sleeping state except that the CPU and system cache context is lost”. However, in section

16.1.1, the S1 state is defined as being a state where “all system context is preserved with the

exception of CPU caches”. The example in the subsequent section of the specification then

describes asserting a standby signal to the phase-locked loop (PLL) to stop all system clocks,

except for the real-time clock (RTC), essentially disabling the entire CPU. Hardware

manufacturers implement this example by power-gating the processor [31]. This is a benign

example of the inconsistency in the specification. It is unsurprising that Linus Torvalds, the

initial creator of the Linux kernel, describes ACPI as “a complete design disaster in every way”

[51]. Perhaps this is the source of unreliable sleep implementations that plagued computers 15

years ago and continue to be a problem today, albeit to a lesser degree [30]. Therefore, my

discussion with respect to the ACPI standard is restricted to the testing platform in its typical

63

configuration, with supplementary information on the latest Intel microarchitecture (Broadwell),

when applicable.

5.3 Total System Power Consumption

This section quantifies the supported ACPI states and extensions from the perspective of the total

system power consumption, using the power monitor and RAPL MSRs. It begins with

measurements of the global sleep states. Those tests are used to normalize the measurements in

the subsequent subsections, which include an examination of the global working state with the

CPU in its various C-states, in a busy wait, and under load.

5.3.1 Global Sleep States

The Profilo-compatible platforms do not implement all of the ACPI states. This is a result of the

ambiguous specifications and/or overlap with other state types. However, these gaps are

inconsequential, since there are numerous performance and power saving states. The idle test

type is comprised of four different power meter measurement tests. Each measures a different

level of idle: one in G2/S5, one in G1/S3, one in G1/S1, and one in G0/S0.

Recall that the G3 state has the most power savings. By definition, it has a power consumption of

zero, because it has no power source. The next state is G2/S5. This is a “soft off” state

consuming 0.3 watts to power circuitry that allows the system to be turned on with a button or

RTC alarm. Restoring from this power state requires cold booting the operating system, which

64

can take seconds to minutes, depending a lot on the performance of the non-volatile storage. If

Wi-Fi is enabled, power consumption doubles (0.6 watts), but the system can be turned on

through Wake on Wireless LAN (WoWLAN).

Incidentally, Wi-Fi (Broadcom BCM4331KML1G 3x3 802.11 a/b/g/n) consumes 0.3 watts,

while idle, in all of the power states. The Bluetooth 4.0 device (Broadcom BCM20702), which is

Bluetooth low energy (BLE) compliant, consumes an imperceptible amount of power while idle.

Finally, the keyboard backlight can consume up to 0.5 watts at full brightness. To reiterate, all of

these features are disabled in the profiling Linux environment.

The next power state is G1/S3 or more commonly, “Suspend to RAM”. In this state, the

processor caches are flushed to RAM and then the CPU package is powered off, along with

almost every ancillary component, except for the DRAM modules. The DRAM modules perform

a minimal self-refresh to maintain state and minimize power consumption. This is called S3-

Cold. Intel processors do not support S3-Hot [72, Sec. 4.1.1], likely because the wake-up latency

would be no different, but power consumption would be higher. The result is the system

consumes only 0.7 watts but can return to the previously functional state within a few seconds,

without rebooting the operating system.

There is actually a G1/S4 state, commonly called “Hibernation” or “Suspend to Disk”, which

flushes the RAM to non-volatile storage and then goes into a power configuration identical to

G2/S5 (0.3 watts). The only difference is that when the system comes out of “soft off”, it

accesses a ring 0 defined hibernation file location and reads its contents back into RAM prior to

65

continuing the same bootstrap as G1/S3. Restoring from the G1/S4 state can take seconds to

minutes depending on the amount of system RAM and the sequential transfer rate of the non-

volatile storage. As a side note, when Mac OS is told to “Suspend” or the lid is closed, it creates

the hibernation file, and then goes into the G1/S3 state. If the remaining battery capacity

becomes low, an interrupt is invoked (similar to how the RTC alarm works), which wakes the

processor long enough to put it into a G1/S4 state (this happens very quickly since the

hibernation file already exists).

The least-effective power-saving global state is G1/S1, often called Power on Suspend (POS) or

just “suspend”. As already mentioned, this state lacks a good definition and may soon be

deprecated, like the G1/S2 state, since power-gating makes implementing these states even more

ambiguous. In fact, Intel makes no reference to this state anywhere in the processor datasheet

[66], [72]. The datasheet explicitly lists support for only the following system states: G0/S0,

G1/S3-Cold (“S3-Hot is not supported by the processor”), G1/S4, G2/S5, and G3. Nevertheless,

the system is able to enter the G1/S2 state under Linux, although the graphics context becomes

corrupt after resuming. In this state, the system consumes 0.8 watts. One can only speculate

where the extra 0.1 watt is consumed: it could be the CPU package, or the Nvidia GeForce GT

650M graphics processing units (GPU) in a “D3 Hot” state (described in the next subsection)

with insufficient power to the GDDR5 memory, resulting in the corruption, or the Platform

Controller Hub (PCH) in an unsupported state.

66

5.3.2 Normalizing Measurements

One of the features of this MacBook Pro Retina is that it has two GPUs: the capable Intel HD

Graphics 4000 on the CPU package (that uses system DDR3 RAM) and the higher performance

Nvidia GeForce GT 650M with dedicated 1 GB GDDR5 memory. Under most circumstances,

when high-performance GPU-acceleration is not required, the significantly more energy efficient

HD4000 GPU is used. When an application requests high-performance GPU-acceleration, an

OpenGL context is created, or a user connects a secondary display, Mac OS seamlessly switches

to the 650M. Once the 650M is no longer required, and the HD4000 is adequate, Mac OS

seamlessly switches back to the HD4000. When the display is turned off (login screen timeout,

screensaver, or user invoked), the active GPU (ACPI D0 state) is put into a low power state

(ACPI D3). For the HD4000, this state is the “D3 Cold” state, which cuts power to the power

plane associated with the GPU (PP1). For the 650M, this state is the “D3 Hot” state, which

provides auxiliary power to keep the GDDR5 memory powered. Therefore, even in the D3 state,

the HD4000 is more energy efficient.

Unfortunately, this level of power management sophistication is unavailable in Linux.

Furthermore, at this time the only (reliably) available GPU in Linux is the 650M. This means

that the RAPL PP1 domain can be ignored since it stays zero. As a result, the PKG domain does

not need to be reduced by the value in the PP1 domain to isolate CPU activity. This simplifies

things, especially considering that had the HD4000 been active with this version of Linux, the

energy consumption reported by the PKG domain may have been offset by more than just the

power to the PP1 domain. For instance, the power control unit (PCU), which is accounted for in

the PKG domain, but not in either of the PP0 or PP1 domains, may consume more energy if it

67

manages the dynamic frequency adjustments of the HD4000. Most properly configured headless

servers would have their GPU disabled to save power. The exception would be if the GPU were

an active compute resource.

To better understand the impact of using the 650M under Linux, the power monitor was used to

measure the system in the G0/S0/C7 state using the HD4000 and then the 650M in Mac OS. All

tests were performed with Wi-Fi, Bluetooth, and the keyboard backlight disabled. When the

HD4000 is in the D3 (cold) state, the power monitor reads 5.7 watts. The 650M in its D3 (hot)

state consumes an additional 2.1 watts. When the HD4000 is put into the D0 state (with the

display turned off), the power consumption rises to 8.1 watts, suggesting that the HD4000 GPU

consumes 2.4 watts at idle. With the 650M in the same situation, the total power consumption is

10.5 watts, which is 2.4 watts more than the HD4000 in the D0 state. This suggests that the

650M (and its GDDR5 memory) consumes 4.8 watts at idle – double the HD4000. With the

display turned on and at full brightness, the power consumption rises by 8.4 watts. One would

expect to see the same 18.9 watt power consumption from Linux with the processor, 650M, and

display in the exact same power state; however, the power consumption is 25.4 watts. The extra

6.5 watts could possibly be a result of the PCI Express links and/or Direct Media Interface

(DMI) being in the active transfer state (L0) instead of a lower power state (i.e., L1-L3), and/or

devices (Wi-Fi, Bluetooth, etc.) being in higher power states (i.e., D0-D2) rather than being off

(D3), despite most device drivers being disabled. The extra power consumption also raises the

temperature of the processor, necessitating active thermal dissipation from the two DC brushless

fans that can consume up to 2W (rated) each.

68

Rather than attempting to correct the relatively poor power optimization of the system under

Linux, which could possibly require driver fixes, the results gathered from Profilo were

normalized against the Mac OS optimized system configuration in the G0/S0/C7/D3 state from

the first measurement (5.7 watts). Therefore, to capture the total system power consumption,

including the switched power supply, the median values for each scenario in every type of test,

except for the idle (S1 or deeper) measurement types, was captured and then reduced by 19.7

watts. This power consumption is more in line with properly setup, power efficient headless

servers found in well-managed datacenters.

Table 5.1 summarizes the power consumption of these isolated components. An important

consideration to this normalized approach is that the RAPL PKG values do not include energy

associated with using the HD4000 (if there is any in the D3 state). Energy associated with the

GPU (if any) is included in the “Other” values, which is the power consumption outside of the

PKG domain.

69

Table 5.1 Power Consumption of Components

Component Power Consumption

Wi-Fi & Bluetooth 0.3 W

Keyboard Backlight 0.5 W

650M (D3 Hot) 2.1 W

HD4000 (D0) 2.4 W

650M (D0) 4.8 W

LCD Display 8.4 W

DC Brushless Fans 0 W to 4.3 W

Linux Idle (C7) 25.4 W

OSX Idle (C7; Headless) 5.7 W

5.3.3 Idle Power Consumption

The global working state (G0/S0) is where all of the CPU’s power saving C-states function. The

characteristics and trade-offs of the C-states are arguably as important as the performance and

energy characteristics of the P-states, which are the processor’s dynamic voltage and frequency

scaling (DVFS) states. Until recently, discussion around processor idle states has generally been

restricted to “race-to-idle”, which is running the processor at its maximum speed and then

putting it into an idle state [4]. However, this scheme has been shown to be suboptimal in a

variety of situations [26]. The topic has become further complicated by the fact that there are

now many idle states that make compromises between power savings and exit latency. Multiple

70

cores also allow the physical processor to be in a myriad of hybrid states. From the perspective

of energy efficiency, choosing how a processor rests is as important as choosing how it works.

The ACPI specification loosely defines C0 through C3; however, Intel extends the number of

states to further reduce power consumption for situations where the wake-up latency and cache

warm-up time is acceptable. Over time, some power saving techniques have been deprecated

because of newer methods that increase savings with equal or improved wake-up latencies. As a

result, contemporary processors have gaps in the numbering of the C-states. On the Ivy Bridge

microarchitecture, there are six C-states: C0, C1, C1E, C3, C6, and C7. The Broadwell

microarchitecture adds to this with three more C-states: C8, C9, and C10.

Just like the other nil states, the C0 state is the active state of the CPU. The power consumption

in this state depends on the voltage, frequency, and activity on the CPU, since inactive execution

units are power-gated [31]. Nevertheless, even nominal power consumption on the Ivy Bridge

processor in the C0 state (at the lowest frequency) is 59% higher than the first power saving state

(C1) and 4.1 times higher than the last power saving state (C7). The rest of Section 5.3.3

discusses the details and power measurements of the remaining C-states. Section 5.3.4 focuses

almost exclusively on power consumption in the C0 state.

5.3.3.1 Linux cpuidle

Just as the Linux kernel has the cpufreq infrastructure for managing the processor frequency (P-

states), there is a subsystem for managing the CPU’s idle states, called cpuidle. Since every

71

processor has different idle characteristics, prerequisites, side effects, states, and actions to

enter/leave those states, this complexity is abstracted to a driver layer. A driver defines a

cpuidle_state structure for every processor state. In the case of the x86 instruction set

architecture (ISA), these are the C-states. The driver then defines a cpuidle_device structure for

each logical processor, which is usually the same for every logical processor on the same CPU,

although technically, they could be different. The driver then registers the device(s) with cpuidle.

Similar to cpufreq, the cpuidle subsystem has a governor interface that supports the

implementation of different idle policies. A governor makes decisions based on information it

gets from the device. Each of the states defined in a device has a number of predefined

parameters, such as its name, desc (description), flags, exit_latency, power_usage, and

target_residency.

The name and desc are for users since the governor and driver/devices are made available

through sysfs. The flags indicate predefined characteristics that are important for the governor to

make decisions. For instance, if a state has the CPUIDLE_FLAG_CHECK_BM flag, this

indicates that the state is not compatible with bus-mastering direct memory access (DMA)

activity. Entering such a state (i.e., deep sleep that is unresponsive to snooping but keeps the last

level cache active) while bus-mastering DMA activity is occurring, could result in the processor

caches failing to update in response to DMA, leading to data corruption.

The exit_latency is the number of microseconds it takes for the processor to return to the

operating state. If the state turns off caches, there could be additional delays due to cache misses.

The power_usage is the number of milliwatts consumed by the processor in this power state. It is

72

important to remember that transitioning to an energy saving state and then back again costs both

time and energy. Therefore, the target_residency is the minimum number of microseconds that

the processor should stay in the state to save any energy and make the transition worthwhile.

When a logical processor has nothing left to do, the select function to its cpuidle governor is

invoked. This is where the governor applies its heuristics and returns the integer index to a target

state in the states array defined by the device for the logical processor. Selecting an appropriate

sleep state involves several critical steps, such as looking at upcoming timing events,

determining if there is DMA activity, and power management quality of service latency

requirements, which is supported by the pm_qos subsystem that provides a kernel (e.g., drivers)

and user mode interface for registering performance expectations. Currently, there is only one

governor implemented, called menu, which takes into account all of these considerations and

then immediately picks the deepest possible idle state. An Intel conference paper suggests the

possibility of another governor called ladder, which takes a progressively deeper step-wise

approach to selecting an idle state [42], similar to cpufreq governors, in conjunction with a tick-

based kernel.

Intel actively develops and supports a cpuidle driver for Sandy Bridge and newer

microarchitectures [17]. From the source code, it appears that power_usage is an optional

parameter that is not used by the governor interface to make decisions. It is not defined

throughout the Intel driver. The driver first detects the microarchitecture of the processor by

invoking the CPUID instruction and then gathering the vendor, family, and model, just like

73

cpuinfo in section 4.2.4.2. Based on this information, the states and device(s) are statically

created and registered.

5.3.3.2 CPU Sleep States

The first C-state on these microarchitectures is C1. It can be entered with the HLT (halt)

instruction or the MWAIT(C1) instruction. This state offers power savings by stopping the

instruction clock, but allows the core to return to C0 almost instantaneously. The

bus/interconnect frequency is left unchanged, so unlike the deeper C-states, the C1 state power

savings is dependent on the clock frequency. The C-state of the entire CPU package is governed

by the core with the lowest C-state value (least power savings). For example, if one core goes

into the C1 state and the remaining cores are in C7 states, the package transitions into the C1

state. The package C1 state is purely a semantic state that is no different than C0. On Ivy Bridge

(the testing platform), Intel defines the target residency as 1 microsecond and the exit latency as

1 microsecond. On Broadwell, these values are both 2 microseconds. When the cpuidle governor

decides the system needs to idle for less than the C1 target residency or have a shorter response

time than the exit latency, it invokes a busy wait instead. This means that in the 1 to 2

microsecond range, the older Ivy Bridge microarchitecture is actually more energy efficient.

Table 5.2 shows that transitioning all cores from C0 to C1 results in the processor dropping

power consumption by 37% to 52%, depending on if the cores are clocked at the lowest or

highest frequency. This table includes the power consumption breakdown of the PP0 domain

(cores), the entire package (processor), the estimated fan power consumption, and components

outside of the processor package (other), as well as the normalized grand total power

74

consumption from the wall, as discussed in Section 5.3.2. It is important to emphasize that PP0 is

a subset of PKG and that both are a subset of total (the last column).

Table 5.2 Ivy Bridge C1 Power Savings

Core 0 Core 1-7 Frequency (MHz) PP0 (W) PKG (W) Fan (W) Other (W) Total (W)

C0 C0 1200 8.8 12.7 1.5 2.9 17.1

C0 C7 1200 3.0 6.8 1.5 2.9 11.2

C1 C1 1200 4.2 8.0 1.5 2.9 12.4

C0 C0 1300 9.3 13.3 1.6 2.9 17.8

C0 C7 1300 3.2 7.0 1.5 2.9 11.4

C1 C1 1300 4.4 8.2 1.5 2.9 12.6

C0 C0 1400 9.9 13.8 1.6 2.9 18.3

C0 C7 1400 3.4 7.2 1.5 2.9 11.6

C1 C1 1400 4.6 8.4 1.5 2.9 12.8

C0 C0 1500 10.5 14.4 1.7 3.0 19.1

C0 C7 1500 3.6 7.4 1.5 3.0 11.9

C1 C1 1500 4.8 8.6 1.5 3.0 13.1

C0 C0 1600 11.0 15.0 1.7 3.0 19.7

C0 C7 1600 3.7 7.5 1.5 3.0 12.0

C1 C1 1600 5.0 8.8 1.5 3.0 13.3

C0 C0 1700 11.6 15.5 1.8 3.0 20.3

C0 C7 1700 3.9 7.7 1.5 3.0 12.2

75


C1 C1 1700 5.2 9.0 1.5 3.0 13.5

C0 C0 1800 12.0 16.0 1.8 3.0 20.8

C0 C7 1800 4.1 7.9 1.5 3.0 12.4

C1 C1 1800 5.4 9.2 1.5 3.0 13.7

C0 C0 1900 12.6 16.6 1.9 3.0 21.5

C0 C7 1900 4.3 8.1 1.5 3.0 12.6

C1 C1 1900 5.6 9.4 1.5 3.0 13.9

C0 C0 2000 13.2 17.2 1.9 3.0 22.1

C0 C7 2000 4.5 8.3 1.5 3.0 12.8

C1 C1 2000 5.8 9.6 1.5 3.0 14.1

C0 C0 2100 13.7 17.7 2.0 3.1 22.8

C0 C7 2100 4.7 8.5 1.5 3.1 13.1

C1 C1 2100 6.0 9.8 1.5 3.1 14.4

C0 C0 2200 14.2 18.2 2.0 3.1 23.3

C0 C7 2200 4.8 8.7 1.5 3.1 13.3

C1 C1 2200 6.2 10.1 1.5 3.1 14.7

C0 C0 2300 15.1 19.1 2.1 3.1 24.3

C0 C7 2300 5.3 9.1 1.5 3.1 13.7

C1 C1 2300 6.7 10.5 1.5 3.1 15.1

C0 C0 2301 27.5 31.4 4.3 7.3 43.0

C0 C7 2301 11.3 15.3 1.5 4.0 20.8

76


C1 C1 2301 11.3 15.0 1.5 4.0 20.5

The table also includes a scenario where the first logical processor is left active (C0) and the rest

are put into the deepest sleep (C7), resulting in an additional 13% to 15% power savings over all

cores being in the C1 state, with the exception of the turbo frequency (described in Section 5.1).

This presents an interesting class of speed scaling algorithms, where most cores can be put into

states with high wake-up latencies (i.e., C7), while one or more cores are left in states with low

or no wake-up latencies (i.e., C0).

The Intel driver disables a feature called C1E (state) auto-promotion, which can be turned on and

off through the IA32_MISC_ENABLE model specific register (MSR). When enabled, this causes

the core to automatically transition from the C1 state into the C1E (C1-Enhanced or enhanced

halt) state, which further reduces power consumption by transitioning to the lowest supported

clock frequency and reducing voltage to the core. The state can alternatively be entered by

executing the MWAIT(C1E) instruction. If the other cores are in a C1E or deeper state, the

package also enters the C1E state, which reduces bus/interconnect frequencies and voltages.

Naturally, the target residency increases to 20 microseconds, while the exit latency climbs to 10

microseconds, on both Ivy Bridge and Broadwell. The power consumption for the Ivy Bridge

processor drops to 7.6 watts, which is 5% to 49% (depending on frequency) less than when the

processor is in the C1 state.

77

The next consecutive state is C3, which saves power by flushing the core’s L1 caches (32 KB

instruction and 32 KB data) and L2 cache (256 KB) to the shared L3 cache (6 MB on testing

platform) and then stopping all core clocks. In addition to power-gating [23], the package does

not need to wake the core when a snoop is detected or an active core accesses cacheable

memory. A core enters this state with the MWAIT(C3) instruction. The target residency for this

state jumps to 156 microseconds on Ivy Bridge and a more efficient 100 microseconds on

Broadwell. The exit latency on Ivy Bridge is defined as 59 microseconds, while Broadwell is

faster at 40 microseconds. Since the caches are cold, there are additional penalties for misses;

however, some or all of these hits could land on the L3 cache, which is roughly an order of

magnitude faster than RAM. The Ivy Bridge processor in the C3 state consumes 4.7 watts,

resulting in a 38% power savings over C1E.

With the exception of the package C1 and C7 states, the Intel datasheets are strikingly vague on

the package power savings [66, Sec. 4.2.5], [72, Sec. 4.2.5.1]. On Broadwell, there is an

additional package C2 state that is internal and cannot be explicitly requested by software. The

package can fall into this state when cores and graphics are in C3 or greater states and there is a

“memory access request received” (perhaps a delayed response from a microcontroller or

compute module, such as Intel’s Many Integrated Core Architecture). One possible reason for

the elusive details is that they are irrelevant to firmware and kernel developers and therefore kept

as a trade secret.

The deepest power saving state with respect to the cores is C6. When a core is instructed to enter

C6, through the MWAIT(C6) instruction, it flushes the L1/L2 caches and saves its architectural

78

state to dedicated SRAM. The entire core and its phase-locked loop (PLL) are then powered off.

The target residency jumps to 300 microseconds on Ivy Bridge and a less efficient 400

microseconds on Broadwell. Likewise, the exit latency climbs to 80 microseconds on Ivy Bridge

and a slower 133 microseconds on Broadwell. Just like the C3 state, the exit latency does not

include the extra delays associated with L1/L2 cache misses. The power consumption on the Ivy

Bridge processor in the C6 state is 3.3 watts, which is a 30% drop from C3.

If all of the cores execute the MWAIT(C7) instruction, they each do the exact same routines as in

C6, with the exception that the last transitioning core is responsible for flushing the contents of

the L3 cache before powering down. With all of the cores in the C7 state, the processor can

transition into the package C7 state. This involves disabling the L3 cache and notifying the

platform of the state transition so that it knows that the CPU possesses no snoopable information,

thus not waking the CPU unnecessarily. The package also powers down other components in the

uncore (components in close proximity to the core and essential for core performance) and then

enters a low power state. On Broadwell, the package may enter an even lower voltage state

called Package C7 Plus. These savings come with the target residency cost of 300 microseconds

for Ivy Bridge and a much longer 500 microseconds for Broadwell. The exit latency is 87

microseconds on Ivy Bridge and almost twice that for Broadwell at 166 microseconds.

Furthermore, the exit does not immediately re-enable the L3 cache until the processor has stayed

out of C6 or deeper for a preset amount of time – and even then, there are undisclosed internal

heuristics that govern how the L3 cache is gradually expanded. This saves energy by power-

gating the L3 cache and preventing unnecessary and expensive repopulation/flushing cycles. The

Ivy Bridge processor in the C7 state, which is its deepest C-state, measures 3.1 watts, which is

79

only a 6% power savings over the C6 measurement; however, the power consumption for the C6

and greater states will measure higher if the L3 cache is saturated, resulting in greater relative

power savings in the C7 state.

A breakdown of the power consumption in the various C-states of the Ivy Bridge testing

platform is contained in Table 5.3. This table has the same format as Table 5.2 and summarizes

its results by including the lowest, highest, and turbo frequency for the C0 and C1 states, as well

as the hybrid C0/C7 state, touched on at the beginning of this section. The corresponding

durations for the C-states can be found in Table 5.4.

80

Table 5.3 Ivy Bridge C-State Power Measurements


C7 C7 0.3 3.1 0.7 2.6 6.4

C6 C6 0.5 3.3 0.7 2.7 6.7

C3 C3 1.8 4.7 0.7 2.8 8.2

C1E C1E 3.9 7.6 1.2 2.9 11.7

C1 C1 1200 4.2 8.0 1.5 2.9 12.4

C1 C1 2300 6.7 10.5 1.5 3.1 15.1

C1 C1 2301 11.3 15.0 1.5 4.0 20.5

C0 C0 1200 8.8 12.7 1.5 2.9 17.1

C0 C0 2300 15.1 19.1 2.1 3.1 24.3

C0 C0 2301 27.5 31.4 4.3 7.3 43.0

C0 C7 1200 3.0 6.8 1.5 2.9 11.2

C0 C7 2300 5.3 9.1 1.5 3.1 13.7

C0 C7 2301 11.3 15.3 1.5 4.0 20.8

As a result of having a fully-integrated voltage regulator (FIVR), first introduced in Haswell, the

Broadwell microarchitecture also has package C8, C9, and C10 states, which can be entered by

executing MWAIT(C8), MWAIT(C9), and MWAIT(C10), respectively. The cores initiate the

transition exactly the same as for C7. The package C8 state builds on the package C7 state by

turning off all internally generated voltage rails and reducing the input VCC from 1.3 volts to 1.15

volts. This pushes the target residency to 900 microseconds and the exit latency to 300

81

microseconds. The package C9 state reduces the input VCC to zero volts but at a cost of doubling

both the target residency (1800 microseconds) and exit latency (600 microseconds). Finally, the

C10 package state puts the single-phase core controller (VR12.6) into a quiescent mode (PS4)

that consumes only 0.5 milliwatts. Despite the datasheet stating that the exit latency for the

controller is 48 microseconds [70], the target residency jumps to a considerably larger 7.7

milliseconds with an exit latency of 2.6 milliseconds.

Table 5.4 Ivy Bridge and Broadwell C-State Durations

Target Residency (µs) Exit Latency (µs)

State Ivy Bridge Broadwell Ivy Bridge Broadwell

C1 1 1 2 2

C1E 20 20 10 10

C3 156 100 59 40

C6 300 400 80 133

C7 300 500 87 166

C8 900 300

C9 1800 600

C10 7700 2600

Core and package states can be statically restricted at the hardware level (through configuration

MSRs) or dynamically regulated through a feature called auto-demotion. In essence, this feature

allows the processor to use the per-core immediate residency history and interrupt rate to demote

82

C6 and deeper requests to the C3 state, or C3 and deeper states to C1, based on the configuration

options. None of these features are enabled on the testing platform.

This concludes all of the idle states from the system perspective as well as from the CPU, on

both the Ivy Bridge testing platform and the latest Intel platform at the time of writing

(Broadwell). While Table 5.3 showcases the power consumption of each of the Ivy Bridge C-

states, when all of the logical processors are in the same state, this table could be significantly

expanded for all of the permutations of C-states that each of the cores can be in. As shown with

the C0/C7 hybrid state, there exist interesting cases were the CPU can remain active and

consume less power than when all cores are in a non-functional intermediate sleep state. Other

non-synchronous C-state options could be exercised to increase power savings, while retaining

better exit latencies. Further discussion on this topic is reserved for future work.

5.3.4 Active Power Consumption

As detailed in section 5.1, the testing platform has 12 controllable discrete frequencies and an

Intel Turbo Boost Technology state, yielding 13 ACPI performance states or P-states. The ACPI

specification currently has a defined limit of up to 16 P-states [67]. While the relative power

savings are large between the C-states, the absolute power consumption, in watts, is most

significant between the P-states. This is because the power consumption of processors is

proportional to the square of the voltage, multiplied by the frequency [46]. Since the voltage

must be increased with the frequency, the power requirements of processors grow in accordance

to this polynomial function. The valid range of the voltage identification value for the Intel Core

83

i7-3615QM Ivy Bridge processor is 0.65 to 1.35 volts [72, Sec. 7.9.1]. Using the Idler utility,

described in Section 4.2.5, each of the discrete frequencies was measured from the perspective of

the core power plane (PP0), processor package (PKG), and total system power from the wall

with the first core busy-waiting, while the remaining cores were “offline” (C7 state). The results

are summarized in Table 5.5. Once again, the “Total” column, as with the previous tables, is the

normalized (described in Section 5.3.2) total power consumption from the wall, with PKG being

a subset of this total, and PP0 being a subset of PKG. The last two columns are the relative

change from the prior frequency.

84

Table 5.5 Single-Core Busy Wait

Frequency (MHz) PP0 (W) PKG (W) Total (W) ΔFrequency ΔPP0

1200 2.8 6.5 11.8

1300 3.0 6.7 11.9 8.3% 0.15 (+5.5%)

1400 3.1 6.9 12.2 7.7% 0.16 (+5.4%)

1500 3.3 7.0 12.3 7.1% 0.15 (+4.9%)

1600 3.4 7.2 12.6 6.7% 0.17 (+5.1%)

1700 3.6 7.3 12.7 6.3% 0.16 (+4.6%)

1800 3.8 7.5 12.9 5.9% 0.17 (+4.6%)

1900 3.9 7.7 13.1 5.6% 0.19 (+5.0%)

2000 4.1 7.8 13.1 5.3% 0.15 (+3.8%)

2100 4.3 8.0 13.3 5.0% 0.17 (+4.1%)

2200 4.4 8.2 13.5 4.8% 0.15 (+3.5%)

2300 4.7 8.5 13.8 4.5% 0.32 (+7.1%)

2301 (3300) 9.9 13.7 20.2 43.5% 5.17 (+109.4%)

Curiously, the PP0 power consumption increases linearly and with an almost perfect correlation

to the frequency, up until 2200 MHz. This means that Intel maintains the same voltage for all

user-selectable frequencies, except the last (2300 MHz). This is supported by further

observations from Table 5.6. This voltage schema is obviously suboptimal, particularly for the

lowest frequencies. The last user-selectable frequency (2300 MHz) seems to experience a

85

voltage increase. With only one core active, the maximum turbo frequency is 3.3 GHz, including

a dramatic spike in power consumption, signifying that the voltage is further increased.

Despite this substantially larger power consumption, power-gating is still impactful at saving

energy. Power-gating is a technique to save power on inactive execution units [31]. The trial

division primality test algorithm is a heavy, integer-based workload, for which the power

consumption increases over a simple busy-wait loop, as seen in Table 5.6. The “Total” column is

the same as the previous tables. The last column in this table is the relative increase in the PP0

power consumption over the same frequency with a busy-wait loop on the first core (remaining

cores in C7, just like in Table 5.5). This is the power consumption of an active virtual process

(calculating the first 200 prime numbers, 10000 times) in Profilo, repeated at all of the discrete

speeds available on the Ivy Bridge testing platform.

86

Table 5.6 Single-Core Trial Division Primality Test

Frequency (MHz) Duration (s) PP0 (W) PKG (W) Total (W) ΔPP0

1200 8.68 3.1 6.9 12.0 12.3%

1300 8.02 3.3 7.1 12.4 12.6%

1400 7.44 3.5 7.3 12.6 12.1%

1500 6.95 3.7 7.5 12.7 12.5%

1600 6.51 3.9 7.6 12.9 12.4%

1700 6.13 4.1 7.9 13.1 13.3%

1800 5.79 4.3 8.0 13.4 13.2%

1900 5.49 4.5 8.3 13.5 12.9%

2000 5.21 4.6 8.4 13.7 12.9%

2100 4.96 4.8 8.6 13.9 13.3%

2200 4.74 5.0 8.9 14.0 14.2%

2300 4.53 5.4 9.2 14.4 14.3%

2301 (3300) 3.16 11.5 15.3 21.8 15.9%

At every single speed, the duration to complete the work is nearly exactly proportional to the

frequency increase, including the turbo mode, which is running at precisely the full 3300 MHz.

This means that the performance scales perfectly with frequency and that none of the frequencies

suffer from timing woes. Furthermore, this supports the PP0 observations of Table 5.5 that the

voltage is constant from 1.2 GHz to 2.2 GHz, and that it increases at 2.3 GHz, and then again for

the turbo speed (3.3 GHz). This claim is supported because the only way more energy is being

87

consumed is either by a voltage increase or wasteful idle cycles (due to timing issues), and the

duration measurements eliminate the latter supposition.

With regards to power consumption, the PP0 values show that even a single active core with this

workload, on an optimally configured system (see Section 5.3.2), can consume over half of the

total system power consumption from the wall. On the other hand, the package (PKG) power

consumption outside of PP0, only consumes a narrow 3.7 to 3.9 watts (C0 state), depending on

frequency and workload.

An interesting observation is that under this workload, the total system power consumption,

outside of the package, is fairly consistent at 5.3 watts, except for when the core is at its turbo

speed, where it jumps up to 6.5 watts. The components outside of the package, such as the

Platform Controller Hub (PCH), and its various I/O, seem to be independent of the processor’s

load and clock frequency until the turbo mode is engaged, where the frequency and/or voltage of

one or more external components seems to increase. A final thought to consider is how well the

consumer-grade power meter (Kill-A-Watt) correlates to the PP0/PKG values (0.999/0.998) for

all user controllable frequencies. This is reassuring that the power meter offers precise (+/- 0.1

watt) and likely accurate readings.

As mentioned in Section 4.1.4, the trial division primality test algorithm is an integer-based

workload that was chosen for numerous reasons, including its simplicity to code in kernel mode.

As a result, many execution units, such as the floating point unit (FPU), the single instruction,

multiple data (SIMD) [24] extensions, Advanced Encryption Standard New Instructions (AES-

88

NI), and virtualization extensions, are never used. The Great Internet Mersenne Prime Search

(GIMPS) foundation has a tool called Prime95, which is heavily optimized to take advantage of

some of these extra components to accelerate its search for Mersenne prime numbers. When

executing on all of the logical processors, the power consumption at the wall is a substantial 72

watts, with each core consuming roughly 15.4 watts. These additional execution units increase

the power consumption by 30% to 40% over the trial division primality test algorithm,

approaching the peak power consumption of the processor. Under this workload, the processor

consumes 80% (when hot) to 90% (when cool) of the total power consumption from the wall,

depending on thermal envelope, which causes throttling in the turbo mode, and active heat

dissipation (i.e., fans), which spins both 2W fans up to 6000 revolutions per minute (RPM) to

cool the processor.

5.4 Profilo Workload Benchmarking

Recall from Section 4.1.4 that the trial division primality test algorithm has an exponential

running time, when measured in terms of the size of the input in bits. This is visualized in Figure

4.1. Section 4.1.5 covers the implications of this algorithm and how Profilo uses a “work unit”

argument to essentially linearize any workload, including the trial division primality test

algorithm. The caveat is that with a fixed “work unit” value, the duration to complete a “work

unit” will be architecture dependent and inversely proportional to the frequency. Therefore if one

wishes to approximate a time slice for each discrete speed, for instance for a PS trace, a “work

unit” along with the number of loops, unique to each speed, needs to be determined.

89

Included as an alternative runtime for Profilo is a benchmark mode that allows one to determine

appropriate “work unit” and “primes per second” arguments for the host processor, along with

the respective loop values for each discrete speed to achieve a particular time slice. For Chapter

6, the desired time slice is 10 ms, since this is the maximum duration that regular processes are

allowed to occupy the CPU [68].

There are many different combinations for the “work unit” and loop values for the discrete

speeds. The smallest “work unit” is 1, but then all of the benefits of the workload from Section

4.1.4, such as branch prediction disruption, would be bypassed since the processor would not be

doing very much work, besides decrementing a counter each time it returns a constant.

Technically, the largest “work unit” is determined by the slowest discrete speed that is able to

calculate the defined number of prime numbers within the desired period (i.e., 10ms). However,

the loop value for the next discrete speed would either have to be the same, which is incorrect, or

incremented by one, doubling the number of calculations and therefore grossly missing the target

duration.

For example, on the Ivy Bridge platform at its lowest speed (1200 MHz), the processor can

calculate the first 850 prime numbers in 9.8 milliseconds. At the next discrete speed (1300

MHz), it performs the exact same work in 8.3 milliseconds. For the lowest speed, the loop value

would be 1, underestimating the target time slice by 0.2 milliseconds. With the second speed, a

loop value of 1 would be logically incorrect, because the same amount of work would be

completed in a time slice as the lowest speed, despite it calculating significantly faster.

Furthermore, the time slice would be underestimated by 1.7 milliseconds. Unfortunately, with a

90

loop value of two, the work would double, taking 16.6 milliseconds, overestimating the time

slice. The overestimation, with unique loop values, would get considerably worse at higher

speeds (e.g., at 1600 MHz, the fifth discrete speed, the time slice would be 24.19 milliseconds).

Therefore, the set of satisfactory combinations has to be one where the time slice divided by the

duration to perform the work unit results in a unique loop integer for each discrete speed.

Subsequently choosing a combination from that set is based upon the dichotomy of having the

largest “work unit” (because of the benefits of the workload) and better approximating the time

slice.

In the interest of time, the benchmark calculates “work unit” values 1 through 10, then every 10

to 100, then every 50 to 1000, and finally every 250 to 10000. Each discrete speed is timed with

a high-resolution timer (the same as described at the end of Section 4.2.4.4) and then the results

are displayed, rounded to the nearest microsecond. A future improvement to Profilo is to enter

the desired time-slice (in milliseconds) and maximum error (in percent), and have it output the

set of satisfactory combinations, pruned of combinations that are fully dominated by another set

(i.e., equal or lower absolute over/underestimation for each discrete speed, with a higher “work

unit” value).

It is important to reiterate from Section 5.1 that while the results include every available speed,

the turbo mode is avoided in Chapter 6 because it is not directly controllable, even in Ring 0 of

the hierarchical protection domains (protection rings). Therefore, the following results do not

include the turbo mode.

91

With the aforementioned discrete values for “work unit”, the test platform has a range of choices

from 50 to 200. A “work unit” of 200 results in a maximum underestimation of 397

microseconds (4.0%), a maximum overestimation of 10 microseconds (0.1%), and an average

case of underestimating the 10-millisecond time slice by 152 microseconds (1.5%). The next

tested “work unit” is 150, which results in a maximum underestimation of 172 microseconds

(1.7%), a maximum overestimation of 116 microseconds (1.2%), and an average case of

underestimating the 10-millisecond time slice by 36 microseconds (0.4%). Because of the

somewhat arbitrary desire to have an average case error less than 1% (i.e., ±100 microseconds

for a 10-millisecond time slice), the “work unit” of 150 was chosen for use throughout Chapter 6.

A summary of the benchmark results for this “work unit” is featured in Table 5.7. The duration

column is the execution time to find 150 prime numbers at the corresponding discrete speed. The

loop value is the number of rounds for a 10-millisecond time slice. The last column is the

deviation from 10 milliseconds.

92

Table 5.7 Benchmark Results (150 Primes)

Frequency Duration Loop Value Deviation

1200 MHz 468 µs 21 -172 µs

1300 MHz 432 µs 23 -64 µs

1400 MHz 401 µs 25 +25 µs

1500 MHz 374 µs 27 +98 µs

1600 MHz 351 µs 28 -172 µs

1700 MHz 330 µs 30 -100 µs

1800 MHz 312 µs 32 -16 µs

1900 MHz 295 µs 34 +30 µs

2000 MHz 281 µs 36 +116 µs

2100 MHz 267 µs 37 -121 µs

2200 MHz 255 µs 39 -55 µs

2300 MHz 244 µs 41 +4 µs

2301 MHz (3200 MHz) 170 µs 59 +30 µs

5.5 Mode and Context Switches

Since Profilo is a hybrid application, implemented in both kernel mode code and user mode

code, it is important to describe and quantify this domain crossing as well as contrast it to

process switching. Crossing between kernel mode and user mode requires a mode switch, which

involves changing to/from a supervisor/privileged mode, supported by hardware as hierarchical

93

protection domains, commonly refereed to as protection rings. The number of modifications to

registers and segments is minimal. A context switch, which occurs in kernel mode, requires that

all general non-floating-point registers be saved/swapped (although optimizations can be made

with software-implemented context switches), in addition to appropriately updating control

registers.

Linux does not use hardware context switches because full software context switches are about

the same speed; however, the latter allows the kernel to optimize and speed up the switches,

maintain counters and metrics, and verify registers for security. Nevertheless, the kernel is forced

to set up each online logical processor’s Task State Segment (TSS) to store hardware contexts.

There are two main reasons for this:

1. When x86 processors switch from user mode to kernel mode, they fetch the address of

the kernel mode stack from the TSS [14, Ch. 4, 10].

2. When a user mode process attempts to access I/O, the processor may verify its port

privileges by referencing the I/O Permission Bitmap stored in the TSS.

To optimize context switches, floating-point registers such as those used by the Floating Point

Unit (FPU) and single instruction, multiple data (SIMD) instruction sets (MMX/SSE/AVX) are

only changed if a process tries to use them. This is supported in hardware with the Task-

Switching (TS) flag, which is set in the CR0 control register every context switch. If the process

executes the ESCAPE instruction or a MMX/SIMD/AVX instruction, the processor raises a

"Device not available" exception, allowing a handler to load floating-point registers with values

saved in the TSS for that process.

94

In the case of Profilo, floating-point instructions are only used in the user mode application to

tally and summarize the metrics at the end of profiling. The workload also makes use of very

little memory, so cache misses are not a consideration either. Therefore a synthetic benchmark to

quantify the duration of a context switch is a good model. This is achieved with the futex (fast

userspace mutex) system call, which is used between two processes that take turns contending

for the mutex. Quantifying a mode switch is accomplished by using the gettid (get thread ID)

system call, which may be the cheapest Linux system call.

The mode switch and context switch test is further broken down with a script that disables all but

the first logical processor, and then sets the governor to userspace, similar to Section 4.2.4.3.

The frequency is then set to the base frequency, and then a mode switch test is performed,

followed by a context switch test. The C code for the tests was sourced from [52]. This is

repeated for every frequency including the turbo frequency (described in Section 5.1).

Afterwards, all of the logical processors are re-enabled, their governors set to userspace,

followed by testing again at each discrete frequency, including the turbo frequency. The system

is then returned to its prior state.

The results for the testing platform (described in Section 5.1) show that mode switches

consistently take 45 ns to 123 ns, depending on the frequency of the processor. The context

switches take a less consistent 1.1 µs to 3.2 µs, depending on the frequency. As already

discussed, this is an optimized software-based context switch with no floating-point instructions

within the processes, nor cache misses. When cache misses occur, context switches take two

95

orders of magnitude longer [36]. This means that generally context switches can take between 1

and 4 orders of magnitude longer than mode switches depending on the processor frequency,

instruction set usage, and cache pollution. The results are summarized in Table 5.8. The last two

columns of Table 5.8 are the mean of the samples and the standard deviation in the following

format: 𝑋 ± 𝛿

Table 5.8 Mode and Context Switch Test

Processors Frequency (MHz) Mode Switch (ns) Context Switch (ns)

One 1200 123.1 ± 0.2 3166.5 ± 8.9

One 1300 113.4 ± 0.0 2885.9 ± 24.3

One 1400 105.3 ± 0.0 2708.8 ± 13.0

One 1500 98.6 ± 0.5 2526.3 ± 9.4

One 1600 92.1 ± 0.0 2369.4 ± 23.8

One 1700 86.7 ± 0.0 2212.5 ± 7.9

One 1800 81.9 ± 0.0 2118.1 ± 25.2

One 1900 78.3 ± 1.2 1998.5 ± 25.5

One 2000 73.7 ± 0.0 1897.5 ± 20.8

One 2100 70.2 ± 0.0 1807.9 ± 31.5

One 2200 67.0 ± 0.0 1708.3 ± 2.8

One 2300 64.2 ± 0.2 1634.1 ± 8.9

One 2301 (3300) 44.8 ± 0.2 1139.5 ± 11.6

All 1200 122.9 ± 0.1 3098.8 ± 64.6

96

Processors Frequency (MHz) Mode Switch (ns) Context Switch (ns)

All 1300 113.4 ± 0.0 2861.7 ± 19.2

All 1400 107.0 ± 2.9 2672.4 ± 66.8

All 1500 99.9 ± 2.7 2503.9 ± 46.6

All 1600 92.1 ± 0.0 2316.0 ± 28.7

All 1700 86.7 ± 0.0 2192.1 ± 18.2

All 1800 81.9 ± 0.0 2033.4 ± 24.6

All 1900 77.8 ± 0.3 1966.4 ± 53.1

All 2000 73.7 ± 0.0 1809.2 ± 15.0

All 2100 70.7 ± 0.5 1771.1 ± 20.7

All 2200 67.3 ± 0.5 1703.1 ± 19.8

All 2300 64.1 ± 0.0 1629.4 ± 18.7

All 2301 (3300) 45.8 ± 1.2 1198.9 ± 45.9

5.6 Switch Test

One of the interesting benefits of Profilo is that in addition to being able to compare different

speed scaling algorithms and processor schedulers, a few interesting micro-benchmarks can be

performed to analyze the characteristics of the processor(s). The switch test is a set of traces that

micro-benchmark the testing platform and roughly measures the cost of frequency changes in

terms of duration and power consumption. It also supplements the ground-truth timing data

presented in Section 5.5 by cross-calibrating the data presented in this section to both estimate

97

Profilo’s overhead and estimate the energy costs of mode and context switches. The switch test is

composed of four subtests:

1. Locking the processor at a discrete frequency and comparing one continuous task (i.e., no

context switches) to one that is broken up into 100,000 discrete tasks. This is done for

every single discrete frequency and sets a baseline for Profilo process switches only,

including a power consumption metric, which is not included in Section 5.5.

2. Locking the processor at its slowest speed for half the task and then switching to the

second lowest speed (e.g., 1300 MHz on the testing platform) and finishing the task

compared to the same workload broken up into 100,000 tasks that alternate between the

lowest and the second lowest speed for each task. This aims to measure the additional

cost of frequency switches, in terms of time and energy, over the first test.

3. The same as test 2, but with the second speed being the highest non-turbo frequency (e.g.,

2300 MHz on the testing platform). This aims to determine if there is a difference

between the “closest” speed and the “furthest” speed.

4. The same as test 3, but with the second speed being the turbo frequency (e.g., labeled

“2301 MHz”, which turns out to be 3300 MHz). This aims to determine if there is a

difference between the slowest speed and the highest possible speed for the processor, in

comparison with other frequency changes.

Recall from Section 4.2.1 that the Profilo kernel module provides a sysfs file called do_work,

which ensures that a task can be performed without context switches. However, when the work is

complete, the kernel module releases control back to the operating system kernel, which

consequently allows the user mode component of Profilo (Section 4.2.4) to run. Because of this

98

lack of control in user mode, it is possible for a competing user mode process to run, despite the

user mode component of Profilo having the highest priority (Section 4.2.4.3). Therefore, each

task in Profilo involves at least one mode switch and the possibility of context switches.

Furthermore, when a task requires a frequency change, this is done through an invocation to

scaling_setspeed (see Section 4.2.4.1) from user mode, which requires a switch to kernel mode

(it is a sysfs file) before returning back to user mode. A future improvement to Profilo is to

change the frequency from within the kernel module (in the do_work store function) and avoid

these extra mode switches and potential context switches.

In summary, the continuous run of test 1 makes no switches; the switching run of test 1 requires

a mode switch for every task and possibly context switches; and each task in tests 2 through 4

require at least 2 mode switches (and a frequency switch) with the possibility of context

switches. With a properly configured testing environment, free of unnecessary modules, drivers,

services, and processes, the switches should be few in number (i.e., next to no context switches).

The goals of these tests are to measure the overhead of Profilo relative to Section 5.5, and then

use that data to micro-benchmark the testing platform and estimate the cost of process and

frequency switches. For these tests, a “work unit” (see Section 4.1.5) of 1 (i.e., no work) is

appropriate. The tests were run with an ops value (see Section 4.2.4.5) of 71,555,000. The large

ops value is to minimize the interference from the RAPL overflow prevention code. Bearing in

mind that it is interpreted as the approximate number of loops of the “work unit” that the

processor can perform in a second, it is multiplied by 60 and then cast to a 32-bit integer. To

prevent integer overflow of the 32-bit variable, the largest ops value is 71,582,788. The

99

71,555,000 value is close to this upper bound and easy to read in the script. Considering that

even in the switching version of each test, and at the lowest discrete speed, the testing platform

can perform nearly half a billion process switches, this ops value still leads to RAPL overflow

prevention code running 6 to 19 times per second, instead of once per minute. A future

improvement to Profilo is therefore to store the ops value and calculations variable as 64-bit

integers.

Because the first trace of every test continuously runs the respective processes without a context

switch, the workload is essentially reduced to incrementing a loop counter 100,000 times. This

occurs very quickly, without much overhead from Profilo, including RAPL overflow prevention

(because the calculations value is checked against the ops value between processes). Tests 2

through 4 split this into two processes with a speed switch in-between. However, this occurs

much faster than the finest resolution of 1 millisecond in the summary report. The energy

consumption of these tests is also below the 1 millijoule resolution of the summary report. As a

consequence, the data in Table 5.9 contains only the results of the second (switching) part of

each test. Furthermore, the results in the table represent the per-process cost. This value is

calculated by dividing the Profilo summary results by 100,002 (the first two tasks scale the

minimum and maximum speed according to the formula in Section 4.2.4.1) and then multiplying

the values by one million to yield duration in microseconds and energy consumption in

microjoules. Every test was run 10 times with the last three columns being comprised of the

mean of the samples and the standard deviation in the following format: 𝑋 ± 𝛿

100

Table 5.9 Switch Test Results

Test F1 (MHz) F2 (MHz) Duration (µs) PP0 (µJ) PKG (µJ)

1 (Switching) 1200 2.09 ± 0.03 7.8 ± 0.3 16.1 ± 0.5

1 (Switching) 1300 1.93 ± 0.02 7.6 ± 0.2 15.2 ± 0.3

1 (Switching) 1400 1.81 ± 0.03 7.5 ± 0.2 14.6 ± 0.4

1 (Switching) 1500 1.67 ± 0.03 7.2 ± 0.1 13.8 ± 0.2

1 (Switching) 1600 1.56 ± 0.02 7.0 ± 0.1 13.2 ± 0.1

1 (Switching) 1700 1.47 ± 0.02 6.9 ± 0.1 12.7 ± 0.2

1 (Switching) 1800 1.38 ± 0.02 6.8 ± 0.1 12.2 ± 0.2

1 (Switching) 1900 1.32 ± 0.03 6.7 ± 0.2 12.0 ± 0.3

1 (Switching) 2000 1.26 ± 0.02 6.6 ± 0.1 11.6 ± 0.2

1 (Switching) 2100 1.20 ± 0.02 6.5 ± 0.1 11.3 ± 0.2

1 (Switching) 2200 1.14 ± 0.02 6.4 ± 0.1 11.0 ± 0.2

1 (Switching) 2300 1.09 ± 0.02 6.6 ± 0.1 10.9 ± 0.2

1 (Switching) 2301 (3300) 0.76 ± 0.01 9.6 ± 0.2 12.6 ± 0.2

2 (Switching) 1200 1300 4.60 ± 0.02 28.9 ± 9.6 47.2 ± 9.5

3 (Switching) 1200 2300 6.44 ± 0.03 27.2 ± 0.2 52.8 ± 0.4

4 (Switching) 1200 2301 (3300) 4.62 ± 0.03 28.7 ± 9.7 47.1 ± 9.8

The standard deviation for the duration across all tests is below 2%. The standard deviation for

the energy consumption across the PP0/PKG domains in the first test is below 4%, with most

results being below 2%. The energy consumption for test 2 and test 4 varies by a substantial

101

amount, despite the fairly consistent duration. Without further tests and isolating the frequency

switches from the kernel function calls, it is unclear as to why this occurs. It is particularly

puzzling since the third switch test does not have a large standard deviation. However, taking

into account the means and standard deviations, the values are statistically comparable. As

discovered in Section 5.3.4, test 2 requires no voltage change, while tests 3 and 4 both require

voltage changes.

Since the first switch test takes 0.76 to 2.09 µs per process, depending on the frequency, it is

clear that this is done without a context switch, but rather a mode switch. In fact, when compared

to Table 5.8 at the specific frequencies, the mode switch represents a very consistent 5.9% of the

overhead. Put another way, Profilo’s overhead (without the mode switch) at each discrete speed

is equivalent to 16 times the duration of a mode switch, which seems somewhat high. This may

have to do with the high number of RAPL overflow prevention function calls generating

overhead that otherwise would not be present when Profilo has a workload. With a “work unit”

of 150, used in Chapter 6, based on the discussion from Section 5.4, the relative overhead would

likely be at least an order of magnitude smaller.

Because of the consistent temporal representation of overhead, Table 5.10 estimates the energy

required to perform a mode switch, at each discrete frequency. Without actually performing such

a micro-benchmark directly, there are no guarantees on accuracy or precision. The last three

columns of Table 5.9 and Table 5.10 use the same format (𝑋 ± 𝛿).

102

Table 5.10 Mode Switch Durations and Energy Estimates

Frequency (MHz) Mode Switch (ns) PP0 (nJ) PKG (nJ)

1200 123.1 ± 0.2 462 ± 18 948 ± 28

1300 113.4 ± 0.0 446 ± 12 893 ± 18

1400 105.3 ± 0.0 436 ± 14 852 ± 21

1500 98.6 ± 0.5 425 ± 8 814 ± 13

1600 92.1 ± 0.0 417 ± 5 781 ± 8

1700 86.7 ± 0.0 409 ± 6 753 ± 11

1800 81.9 ± 0.0 401 ± 6 725 ± 10

1900 78.3 ± 1.2 398 ± 9 708 ± 16

2000 73.7 ± 0.0 389 ± 8 680 ± 13

2100 70.2 ± 0.0 382 ± 5 660 ± 10

2200 67.0 ± 0.0 377 ± 6 644 ± 10

2300 64.2 ± 0.2 386 ± 6 640 ± 10

2301 (3300) 44.8 ± 0.2 566 ± 11 744 ± 14

Despite Profilo’s large overhead in the switch test, it represents only 62.5% of a context switch.

So Profilo is still actually over-optimized compared to context switches between real processes

managed by a contemporary Linux kernel, especially when considering that there are additional

delays created by kernel management code. However, if a Linux environment is not contending

between different processes, a process may get a consecutive time slice, experiencing only a

mode switch in between the time slices. Therefore, with traces that result in a pseudo runqueue

103

(see Section 4.2.4.3) that have a consistent depth greater than 1, the high number of context

switches would result in Profilo underestimating the real-world duration/energy. Conversely, in

situations where there is only one process in the runqueue, Profilo would over-estimate the real-

world time/energy. In summary though, it is reasonable to suggest that Profilo’s overhead can be

ignored with mixed workloads.

A future improvement to Profilo is to have a run time option that generates a mode/context

switch calibration file for each frequency that can subsequently be taken as input where Profilo

adds and subtracts overhead duration/energy, depending on the situation, yielding a sanitized and

extremely accurate summary that is also precisely repeatable on subsequent runs.

It can be observed from the first test in Table 5.9 that as the frequency increases, the per-process

duration for switching goes down. Conversely, the power goes up, but the energy used goes

down due to less idle-overhead, which is as expected and explained in detail in Section 5.3.

When frequency switching is added (test 2), the duration increases by a factor of 2.2, despite

running 8.3% faster for half the time. The results are even more interesting with test 3, where the

duration is 40% longer than test 2, despite the upper frequency running 77% faster! It would

seem that it takes longer to go to frequencies that are further away – so much so that the speed

boost from even the highest non-turbo frequencies cannot compensate for switch time. Even test

4 has a similar duration to the test 2, despite its second (turbo) frequency being clocked at over

2.5 times faster.

104

The situation gets worse in terms of energy consumption. For the second test, consumption

increases 3.7 fold and 2.9 fold, on average, across the respective PP0 and PKG domains.

Strangely, the energy consumption varies wildly for the second and fourth test but not for the

third test. The reason is not apparent, but taking into account the standard deviation, the values

are all comparable.

It would seem that changing the frequency in test two through four requires a context switch;

however, this may not necessarily be true. If we take the results for the first test at the lowest

speed (1200 MHz) and second lowest speed (1300 MHz), and average the results, then subtract

the average mode switch time for both frequencies, we get an average overhead for both

frequencies of 1.89 µs per process. This would leave only 2.71 µs per process for mode/context

and frequency switching. This is notably less than the combined average ground truth context

switch time for both frequencies of 3.03 µs ± 0.15 µs. It therefore seems more plausible that

frequency changes are performed without a context switch. To be certain, investigation of the

kernel code used by the sysfs file (/sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed) would

be necessary.

Carrying this analysis further, with this conjecture, we calculate the average mode switch time

for both frequencies, which is 0.12 µs. There are two of them: one to/from the Profilo kernel

module and user mode application, and then another to/from the sysfs file referenced above that

enables user mode applications to change the frequency. The second test takes 4.60 µs per

process, so taking into consideration the average overhead (1.89 µs) and combined mode

switches (0.24 µs), this means it takes 2.47 µs just to switch from the lowest to the second lowest

105

speed (1200 MHz to 1300 MHz). Similar calculations for the third test yield a substantially

longer 4.76 µs to switch from the lowest speed to the highest non-turbo frequency (1200 MHz to

2300 MHz). Oddly though, in the fourth test, where the second speed is the turbo frequency

(3300 MHz), the calculated switch duration is 3.11 µs, which is still longer than the second test,

but less than the third. The reduced time when compared to the highest non-turbo frequency may

have to do with it running 43% faster. The voltage change from 2200 MHz to 2300 MHz

(observed in Section 5.3.4) could be responsible for this increased duration. Expanding the test to

analyze each discrete frequency would be a relatively straightforward task with Profilo, but is

unfortunately out of scope for this subsection.

If the isolated frequency switch time over the total duration of each test is multiplied by the

respective power consumption, the estimated power consumption results for the second test is

15.5 ± 5.1 µJ per switch for PP0 and 25.4 ± 5.1 µJ per switch for PKG. For the third test, the

estimated results are 20.1 ± 0.1 µJ per switch for PP0 and 39.0 ± 0.3 µJ per switch for PKG. For

the fourth test, the estimated results are 19.3 ± 6.6 µJ per switch for PP0 and 31.7 ± 6.6 µJ per

switch for PKG.

When the energy values are instead calculated by taking the total energy value subtracted by the

average energy for the discrete speeds in the first test, as well as the estimations on energy

consumption of a mode switch for the discrete speeds, the values are in close agreement. The

PP0/PKG values for the first, second, and third tests are 20.7/30.6, 19.6/38.5, and 19.5/31.9

microjoules per switch, respectively.

106

Since these calculations are within the general vicinity of one another, it is safe to say that their

estimates are not far from the actual measurement values. To properly measure and assess the

isolated frequency switching costs, in terms of duration and energy, the Profilo kernel module

would need to be extended to add frequency switching directly (via assembly, similar to how the

RAPL MSRs are read). This would not only give more precise control to Profilo during regular

run time, it would support a frequency switching benchmark feature that generates a matrix of

duration and energy requirements from and to all of the frequencies available on the processor(s)

– an idea introduced, in concept, at the beginning of this section.

In summary, it can be observed that switching the discrete frequency at which a processor does

work is dependent on the destination frequency and possibly the source frequency. Furthermore,

the amount of time to perform the switch is non-trivial at around one or two times that of a

context switch. The amount of energy required to make a switch appears to be even more

significant, costing roughly two or three times more energy than a context switch.

5.7 Summary

Before this chapter could delve into micro-benchmarking, the testing platform was described in

detail, including its hardware specifications and limitations. The platform was also often

compared to Intel’s latest microarchitecture to get a sense of the evolution for their

implementation of the x86 instruction set architecture (ISA). The first subsection was dedicated

to the testing platform description and the tools and methodology for capturing the total system

power consumption, outside of the scope of the RAPL MSRs. The ACPI specification was also

107

introduced with its goals, shortcomings, nomenclature, and its relationship with modern

platforms.

Following this background information, the subsequent sections focused on micro-benchmarks,

beginning with the total system power consumption. The energy values for most of the devices in

their numerous states were evaluated. The most energy efficient configuration was used as a

yardstick to normalize the testing operating system.

The measurement values for the system in its various global states (G-states) and system states

(S-states) were studied. This led into the last, but important, evaluation of power saving idle

states: the processor idle states (C-states). The characteristics of these states were explored and

measured, along with how Linux makes use of them. In examining these states, the justification

for why idle states are as important for energy efficiency as the active performance (P-states)

states was made. As an example, a hybrid state, available only to multi-core/processor platforms

where work can be conducted on one or more cores, while the remaining cores stay in their

deepest sleep state (with a long wake-up latency), was found to save a substantial amount of

power in comparison to when all cores are in an idle state with a shorter wake-up latency to meet

response-time requirements, but where no useful work can be conducted by any of the cores.

Finally, the active DVFS states (P-states) were investigated and quantified. Voltage switches

were discernible from the power characteristics, giving rise to the observation that Intel has

suboptimal voltage mapping, particularly for the lowest frequencies (at least on the Ivy Bridge

108

microarchitectures). It was also observed through Profilo that power-gating can offer sizable

power savings.

To complete the analysis of power states, the workload was examined and the rationale for

choosing the particular “work unit” in the next chapter was explained with observed benchmark

data. The “work unit” value of 150, combined with unique loop values at each discrete speed,

was chosen for the testing platform to create 10 millisecond time slices. The focus then moved to

mode switches and context switches, which occur between time slices. The time and energy costs

of mode, context, and frequency switches were quantified. Improvements to Profilo were noted

for future work to make it an even more accurate model of real world systems.

In conclusion, this chapter provides the background information and up-to-date analysis of

Intel’s recent instruction set architecture. It also qualifies Profilo, describing its usefulness and

limitations. Armed with this knowledge, almost any kind of scheduler and speed scaler can be

synthesized, tested, and analyzed. Some interesting examples are explored in the next chapter.

109

Chapter 6: Profiling Coupled and Decoupled Speed Scalers

Having described the implementation details of Profilo in Chapter 4 and the hardware

characteristics of the testing platform in Chapter 5, this chapter builds on that foundation by

profiling both established and theoretical speed scaling schedulers from Chapter 2 using five

different job batches. Throughout this chapter, a “work unit” of 150 is implied, based on the

discussion from Section 5.4. It must also be stressed that only the non-turbo discrete speeds of

the test platform (described in Section 5.1) are used.

6.1 Workloads

Before any scheduler or speed scaling policy can be evaluated, several workloads need to be

defined that ideally represent the gamut of valid workloads that a real system may encounter. To

utilize every available (non-turbo) discrete speed, using job-count based speed scaling, on the

test platform (described in Section 5.1), each workload needs to contain at least 12 jobs. For

simplicity, none of the workloads contain more than 12 jobs.

Workload 1 is a batch of homogenous jobs that each take about 1-2 seconds to complete on the

test platform, depending on the speed of execution. This workload is somewhat contrived but

provides a simple baseline for comparison with the other heterogeneous workloads. Workload 2

is a batch of jobs whose sizes differ additively in a simple arithmetic progression. The smallest

job takes less than a second while the largest job takes 5-12 seconds, depending on the speed of

execution. Workload 3 is a batch of jobs whose sizes differ by successive factors of 2. The

110

smallest job takes tens of milliseconds while the largest job takes 50-100 seconds to complete,

depending on the execution speed, which alone accounts for half of the total system work.

The final two workloads are composed of heterogeneous jobs with less variability than the

second and third workloads. Workload 4 is calculated by the formula 𝑏! + 1399, where b is a

base value that starts from 1 and arithmetically increases by 1.5, with the last job having a base

value of 17.5. The smallest job takes roughly half a second, while the largest job takes 2-3

seconds depending on the execution speed. Workload 5 is calculated by the formula 𝑏! + 4999,

where b is a base value that starts from 1 and arithmetically increases by 1, with the last job

having a base value of 12. Depending on the execution speed, the smallest job takes 1-2 seconds,

while the largest job takes 6-12 seconds.

6.2 Building the Trace Files

As described in Section 2.3, there are three speed scaling schedulers that are of interest to

evaluate with Profilo: the YDS policy, which minimizes power consumption; the PS policy,

which epitomises fairness; and the FSP-PS decoupled system, which is provably efficient and

has simulation results that suggest a noteworthy performance advantage over PS [21]. These

policies are fundamentally different and illustrate the generality and flexibility of Profilo.

The first step to profiling these schedulers is to create the trace file using the workloads

described in the previous section. The workloads vary from 35,214 to 409,500 loops of the

“work unit” (150). PS time slices can execute 21 to 41 loops depending on the discrete speed

111

(see Table 5.7), which means that the PS trace files will have 1087 to 16,839 lines. Since the

other policies also require knowledge of PS execution, manually generating the trace files would

be tedious and error-prone (see Appendix C for a simple example). Instead, small utility

applications were written for each of the speed scaling schedulers, taking into account the PS

time slice loop values for each discrete speed, that accept job number and their respective sizes

as input, and generate the comma-separated-values (CSV) trace file as output.

6.2.1 The PS Generator

The PS generator contains a speed array equal in size to the number of (non-turbo) discrete

speeds, where each element is the loop value corresponding to a PS time slice in Table 5.7. It

also has a tasks array that holds each of the jobs and their respective size (loop values) in

increasing order to simplify the code in the loop block below. After generating the trace header,

the trace file is scaled with a single-unit task at speed 10 and another at speed 121. Then each job

can run at a speed equal to ten times the number of jobs left (e.g., the number of non-zero

elements in the tasks array) in the system. This results in the correct job-based discrete speed

scaling policy according to the formula in Section 4.2.4.1, while preventing any task from

running at speed 121, which is mapped to the turbo frequency (described with its caveats in

Section 5.1).

Next the utility enters a loop where each iteration writes a line to the trace file, going through the

elements of the tasks array. The amount of work performed for each task is found in the speed

array, where the index is the number of jobs left, unless the value of this element is larger than

112

the amount of work left, in which case this latter value is used. When there are no more jobs left

in the system, the utility terminates, and the trace file is complete.

6.2.2 The FSP-PS Generator

The FSP-PS generator has mostly the same code prior to the loop as the PS generator. The major

exception is a variable called fspWork and an fspTasks array that is an exact copy of the tasks

array. The loop runs through a “virtual PS” routine (described in Section 2.3), which is similar to

the PS loop above, except instead of writing the values, it adds the work to be done to the

fspWork variable.

When a virtual process is complete, a speed change is imminent, so an inner loop begins. The

inner loop goes through the fspTasks array, writing a line to the trace file with the amount of

work the smaller of either the amount of work a process has left or the fspWork variable,

decrementing both variables by the value used. If the fspWork variable is larger, the inner loop

reiterates with the next process in fspTasks. When fspWork is zero, the inner loop terminates and

the outer loop moves onto a new iteration.

When the number of processes is exhausted, all of the elements in both the tasks and fspTasks

arrays will be zero. The trace file will then be complete, having devoted continuous execution to

each process (in SRPT order, since this is a batch workload) breaking up the tasks across

multiple lines where a PS speed change would have occurred.

113

6.2.3 The YDS Generator

Recall from Section 2.3 that the offline YDS algorithm is recursive, requiring knowledge of the

highest intensity jobs as it successively steps down the speed (if any steps are necessary).

However, the EDF job execution order and the rounding up of the chosen execution speed (to

meet the deadlines) allow the job size variability to be enormous before the highest intensity job

becomes anything other than the last job of a single batch workload (i.e., where the jobs in a

system all have the same arrival time). Despite the large variability of the five chosen workloads,

none of them have sufficiently large variability to overcome this characteristic of batch

workloads. In other words, for these five workloads, the YDS algorithm arrives at a speed based

on the deadline of the largest job, which is the same as for the entire workload.

As a result, for these specific workloads, a simplified, non-recursive implementation suffices to

produce the correct YDS traces. The implementation begins the same as the FSP generator,

swapping the fspWork variable for an ops and switches variable. A similar loop to the “virtual

PS” routine runs through the round-robin PS algorithm, accumulating the number of loop

operations and context/mode switches performed. When there are no more virtual processes left

(i.e., all of the elements of the tasks array are empty), the speed array is traversed, looking for the

first element that is greater than or equal to the ceiling of ops/switches. The index value of this

element (plus one) corresponds to the execution speed for all of the jobs in that specific

workload. The trace file is then built giving each process full execution to completion in EDF

order, at the calculated minimum speed.

114

6.3 Profiling Results

Profilo was run with each of the traces using a “work unit” of 150 (see Section 5.1) and a

“primes per second” argument (see Section 4.1.5) of 320,000. The latter argument is based on

the slowest discrete speed performing a single run of the “work unit” in 468 µs (see Table 5.7),

which means that in a minute it would perform !!"#×!"!!

×150 ≅ 320,513 primes per second.

The slower speed is used for the calculation to establish a more conservative “primes per

second” value that guards against RAPL overflow (see Section 3.3).

The results of all the traces are shown in Table 6.1. The columns list the workload batch, the

evaluated policy, the total elapsed wall-clock time to complete the workload under the given

policy, the mean response time (𝐸 𝑇 ) for the 12 jobs, and the amount of energy (in joules)

consumed in the PP0 (cores) and PKG (entire CPU package) domains. To clarify, the mean

response time is the average duration, from arrival to departure, of each job in the respective

workload. A graphical summary of the results is in Figure 6.1.

115

Table 6.1 Profiling Results

Workload Scheduler Duration (s) MRT (s) PP0 (J) PKG (J)

1 PS 14.57 14.55 76.8 131.5

1 FSP-PS 14.57 7.89 76.8 131.6

1 YDS 14.55 7.88 76.5 130.9

2 PS 46.23 30.16 200.0 373.0

2 FSP-PS 46.21 16.33 199.4 372.4

2 YDS 45.80 17.81 198.8 369.9

3 PS 166.15 38.10 562.5 1184.4

3 FSP-PS 166.08 25.43 560.3 1180.8

3 YDS 163.12 27.15 560.9 1170.0

4 PS 10.71 6.57 46.5 86.6

4 FSP-PS 10.70 3.61 46.5 86.7

4 YDS 10.34 4.01 45.3 84.1

5 PS 37.29 21.95 160.4 300.6

5 FSP-PS 37.27 12.12 160.7 300.9

5 YDS 35.43 13.42 155.8 289.3

As observed, the total duration of each workload is similar across all of the schedulers. This

makes sense since the job deadlines are all set by the PS policy, with job departures that are in

the same order as under SRPT with these simple (single batch) workloads [25]. Theoretically, the

overall duration of these algorithms should be identical; however, the YDS traces for workload 4

116

and 5 are 3.5% and 5% shorter because the execution speed is rounded up to the discrete

frequency necessary to meet the job deadlines, which favours these particular workloads a little

more than the others. The total execution time of a decoupled speed scaling system, in theory,

ought to be set by the speed scaler, regardless of the decisions the scheduler makes; however, on

real systems, there are costs for mode/context switches (quantified in Sections 5.5 and 5.6),

which is the domain of the scheduler. The benefits of having fewer process switches with the

FSP-PS scheduler can be observed by the fact that when a difference in duration is observed, it

tends to favour the FSP-PS scheduler. A similar trend exists with the energy consumption,

although it is not as obvious due to the lower resolution and higher jitter of the RAPL counters,

the cause for which is explained in Section 3.3.

With respect to the individual jobs, the mean response time improvements of FSP-PS and YDS

over PS are dramatic. In Workload 1, where the jobs are homogenous, the mean response time of

PS is nearly twice as long. That is because PS keeps all the jobs in the system until the end, when

the jobs start leaving the system one at a time. As the jobs depart, the execution speed decreases

and the amount of work per time slice is reduced. Since the difference between the slowest and

fastest (non-turbo) discrete speeds on the test platform is less than a factor of two, it is possible

that the last jobs to leave the system require an extra time slice, on top of waiting for the jobs

before them, and having their service rate reduced. If the workload is sufficiently large though,

this exit behaviour will have a relatively small impact. In the case of Workload 1, this impact is

less than 1% of the total duration of the workload since each job exits without requiring another

time slice.

117

The improvements to mean response time for the heterogeneous workloads under FSP-PS and

YDS relative to PS still exhibit a 29% to 46% reduction. While the FSP-PS and YDS strategy of

exclusive service to one job at a time results in more timely departures, the advantage becomes

less pronounced when the final job accounts for a greater portion of the total execution time.

With Workload 1, the mean response time of the YDS policy is almost the same as FSP-PS. For

the remaining workloads, FSP-PS reduces the mean response time by 6% to 10% in comparison

to YDS. This improvement is due to the higher service rate that FSP-PS invokes early in the

workload compared to the “blended average” service rate of YDS. In principle, YDS optimizes

for energy efficiency, losing on the mean response time compared to FSP-PS.

The energy optimization of YDS is barely discernible with a homogenous workload, as is the

case for Workload 1. This should come as no surprise since the service rates of these policies are

almost the same in this case, except for at the very end, when the jobs under PS (and “virtual PS”

for FSP-PS) sequentially terminate. This exit behaviour, as described above as contributing to

less than 1% of the total workload runtime, affords YDS slight energy consumption reduction.

With the other (heterogeneous) workloads, the energy consumption is reduced by 1% to 4%,

which is notable considering it does this while completing the entire workloads in 1% to 5% less

time.

Taking into consideration all of these metrics, it can be observed that FSP-PS dominates PS in

the sense that it performs at least as well or better than PS. Its greatest strength is its dramatic

118

improvement in mean response time, reducing it by 33% to 46% on these workloads, with no

disadvantages compared to PS.

When comparing FSP-PS with YDS, it can be observed that Workload 1, with no job variability

(i.e., homogenous), results in comparable metrics due to the mostly similar execution speed and

power profiles. When the workload is heterogeneous, FSP-PS always has a distinct mean

response time advantage over YDS of 6% to 10%. To a lesser degree, YDS instead improves the

overall workload execution time and power efficiency. The results suggest that YDS only

appreciably improves on power efficiency with workloads that have medium job size variability.

Workload 2, with low job size variability, results in simple, relatively high service rates for all of

the policies, similar to the homogenous case. Workload 3, with its high job size variability,

creates a situation with the last job consuming a substantial amount of the overall execution time,

resulting in a relatively low service rate for all of the policies.

119

Figure 6.1 Profiling Graph (Workloads 1-5)

However, Workloads 4 and 5, with their intermediate job size variability, result in the greatest

differences between FSP-PS and YDS. This can be visually observed in Figure 6.2. In relation to

one another, FSP-PS reduces the mean response time by 10%, and YDS reduces the overall

execution time by 5%, while also reducing the overall energy consumption by 3-4%. The fusion

of high and low service rates (with a PS speed scaling policy) drives the differences between

these two policies. While FSP-PS reduces the response time of the smallest jobs by running at

the higher frequencies, YDS chooses a single blended service rate that optimizes for energy

efficiency, making no further frequency switches, and results in the longer jobs having a shorter

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

MRT Dura5on Energy

PS

FSP-‐PS

YDS

120

response time than with FSP-PS. This combined with fewer context switches (equal to the

number of processes) results in the entire workload completing sooner.

Figure 6.2 Profiling Graph (Workloads 4-5)

6.4 Summary

This chapter showcased Profilo using five batch workloads under the PS, decoupled FSP-PS, and

YDS speed scaling scheduling policies described in Chapter 2. The implementation details of the

trace generators were described along with the arguments used with Profilo. The results were

broken down in terms of overall duration, mean response time of jobs, and energy consumption

across two power domains. FSP-PS was found to match or improve upon PS across every

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

MRT Dura5on Energy

PS

FSP-‐PS

YDS

121

workload and metric. FSP-PS was also shown to have a clear mean response time advantage over

YDS in all heterogeneous workloads. The strengths of YDS were shown to be somewhat minor

compared to FSP-PS, with the differences most evident on the workloads with medium job size

variability. Under those workloads, YDS finished the overall batch sooner and with less energy

than FSP-PS – and with a dramatic improvement in mean response time relative to PS.

122

Chapter 7: Conclusions and Future Work

This chapter summarizes the information presented in this thesis, the conclusions that can be

made from its work, and the contributions it adds to the field. Future research and improvements

are suggested before the chapter draws to a close.

7.1 Thesis Summary

This thesis can be broken down into four major parts:

1. The motivation for effective speed scaling policies, and the difficulties of bridging

between theoretical work and systems work.

2. Designing and implementing a profiler using the features of modern processors to gather

accurate, high-resolution, measurement data.

3. Describing and quantifying the energy usage and overheads of modern micro-

architectures.

4. The experimental evaluation of theoretical and practical speed scaling schedulers.

7.1.1 The Importance of Speed Scaling Scheduling Policies

The thesis begins by describing the need for energy efficient computing from the perspectives of

both theoretical and systems communities. The gaps that exist between these communities and

the challenges of bridging them are also described in general and specific terms. Fortunately,

modern processors now have features, aimed at governing energy consumption, that integrate

123

efficient and precise energy meters. These features are described with the explicit intent of using

them to test theoretical speed scaling solutions in a simple but useful way.

7.1.2 Building a Profiler

Chapter 4 of this thesis presents the design and implementation of Profilo. The design decisions

that led to the current implementation are justified. The implementation is also described in

detail. Finally, the strengths and weaknesses of Profilo are discussed through qualitative and

quantitative evaluations.

7.1.3 Examining Modern Architectures

As explained in the early chapters, the need for examining modern architectural features and then

subsequently quantifying these features is not only important for the design of speed scaling

schedulers, it is a prerequisite for investigating profiling results. Chapter 5 of this thesis is

devoted to this micro-benchmarking objective.

7.1.4 Experimental Evaluation of Speed Scaling Schedulers

The final objective of the thesis is to determine the impact of speed scaling scheduling policies

from a performance and energy perspective. This part of the thesis makes use of Profilo to

examine a few policies and make comparisons with the most commonly implemented policy on

current operating systems.

124

7.2 Conclusions

The experimental systems work in this thesis leads to the following conclusions:

• Processor idle states can offer greater relative power savings than DVFS, while DVFS

can offer greater absolute power savings. Both are equally significant, and therefore

important to incorporate into speed scaling schedulers.

• On multicore systems, idling cores in higher sleep states (e.g., to save power before

another batch of jobs or to meet a latency service level) can consume more power than

keeping one core active (and able to do work) and the rest in their deepest idle state.

Other win-win hybrid states exist.

• The delay and energy consumption of mode, context, and frequency switches are not

insignificant; their costs are evident even in small workloads.

• The experimental evaluation of PS shows that decoupled speed scaling (specifically FSP-

PS) drastically improves the mean response time of jobs with a small but measurable

improvement on power savings and batch execution time.

• The experimental evaluation of YDS shows that FSP-PS has a better mean response time,

but under certain workloads, YDS is better for energy efficiency and batch execution

time.

125

7.3 Relevance and Contributions

Speed scaling scheduling policies are numerous and impactful in terms of performance and

energy efficiency. Understanding the characteristics and features of modern processors fosters

the development of more effective policies. Being able to test even purely theoretical policies on

real hardware with performance and energy metrics is something that, to my knowledge, was not

possible until the implementation of the profiler described by this thesis.

The following contributions are made by this thesis:

• Micro-benchmarks that describe and quantify important behaviours and states of modern

processors leading to improved and better suited scheduling and speed scaling policies.

This information also provides the knowledge necessary to properly analyze the results

that are generated from the profiler.

• A profiler that makes it easy to perform controlled workloads made up of precise units of

work at defined speeds, and produces high-resolution timing and energy data broken

down by process and workload.

• Empirical data from theoretical speed scaling scheduling policies using a modern

processor with detailed analysis that makes direct comparisons to the most common

policy on contemporary operating systems.

7.4 Future Work

There are several possible future directions to build upon the work in this thesis:

• Improve Profilo:

126

o Increase the accuracy and reduce the jitter of the RAPL energy counters by

applying all of the techniques described in Section 3.3.

o Implement support for a profiling argument that points to a calibration file

containing mode/context switch duration/energy costs for every discrete speed.

The summary results at the end of profiling can then be corrected to represent a

true production operating system result that neither has profiling overhead nor

over-optimizations.

o Expand the benchmark runtime mode:

§ Accept an optional argument indicating the consecutive prime numbers to

find, and then output the amount of time and energy it took to do so at

each discrete speed.

§ Accept optional “time slice” and “loop tolerance” arguments that can be

used to determine the ideal “work unit” for a desired “time slice” given the

“loop tolerance”.

§ Mode and Context switch benchmark that calculates the duration and

energy cost per switch.

§ Frequency switch benchmark, with a source and target frequency, which

calculates duration and energy cost per switch, performed directly without

a mode switch.

§ Idle benchmark that calculates the energy consumption, exit latency, and

target residency of every available C-state as well as the busy-wait power

rating for every available speed.

o Assimilate the idler utility into Profilo as an additional runtime mode.

127

• Support multiple logical processors:

o This would allow the hybrid C-states from Section 5.3.3.2 to be studied.

o This would enable multiprocessor schedulers to be profiled. Modern operating

systems continuously load balance logical processors, where a scheduler is

assigned to each logical processor, which makes decisions without any knowledge

of the other cores. This is made worse by the load balancer when it moves

processes between logical processors with no regard to topology. So it may move

a process to an adjacent virtual core (e.g., simultaneous multithreading) that share

L1/L2, an adjacent physical core that only shares L3, or a different package

(multiple CPU’s) that only share RAM. This is suboptimal since it causes

unnecessary cross talk, and code and data cache misses that hurt performance and

energy efficiency.

• Support idle states and arrival times in trace file.

o This would allow full versatility of workloads, allowing for dynamic/online

scheduling policies to be profiled.

• Profile and analyze more speed scaling policies.

128

Appendix A

Since there is no formal way of accessing MSRs in kernel mode, retrieving the RAPL energy

counters requires the inline assembly language written in Figure A.1.

Figure A.1 Inline Assembly to Read MSRs

The inline assembly in Figure A.1 begins by using the keyword volatile, which tells the compiler

that there are side effects and to not optimize the assembly. The second line is the custom

assembly code, which only runs the rdmsr instruction. This does not occur before the input

variable (ecx) in line 4 is loaded into the ECX GPR, which the compiler writes. Line 3 instructs

the compiler to write assembly, after the custom assembly code, that moves the contents of the

EAX register into the output variable (eax). The final line tells the compiler what register(s) it

cannot use because it/they get(s) clobbered by the custom assembly. In this case, it is the EDX

GPR, which gets assigned the reserved portion of the RAPL MSRs for the PP0/PKG energy

counters.

asm volatile( "rdmsr;" // assembly code :"=a" (*eax) // output: read from eax :"c" (ecx) // input: load into ecx :"%edx" // clobbered register );

129

Appendix B

Profilo uses the trial division primality test algorithm for its workload. This is the simplest of

primality algorithms, with an exponential running time, when measured in terms of the size of

the input in bits. The (C programming language) code of the workload is written in Figure B.1.

Figure B.1 Workload C Programming Language Code

The code is composed of three progressively nested for-loops. The outermost for-loop

implements the loop value described in Section 4.2.1 and analyzed in Section 5.4. Starting from

2, the consecutive prime number defined by work_unit1 is found a total of loop_value times.

The trial division primality algorithm begins with the middle for-loop. This for-loop increments

the candidate prime number (defined as i) until the consecutive number of prime numbers is

1 The relationship between the “work unit” and loop value is explained in Section 4.1.5

for (j = 0; j < loop_value; j++) { count = 2; for (i = 3; count <= work_unit; i++) { for (check = 2; check <= i - 1; check++) { if (i%check == 0) break; } if (check == i) count++; } }

130

found, which occurs when count exceeds work_unit. The reason the less-than-equal operator is

used is to ensure that a work_unit of 1 is not the same as a work_unit of 2. The nested code of

this middle for-loop contains the innermost for-loop that increments a check variable, starting at

2 and terminating when it is equal to the candidate prime number. Each iteration performs a

modulo operation to see if check is a factor of the candidate prime number. If the remainder is

zero, the loop is prematurely broken. The subsequent if-statement determines if the number is

prime by checking if the previous loop was prematurely broken. If it was, the number is not

prime and another iteration of the middle for-loop begins. If the check value is equal to the

candidate prime number, then these numbers are indeed prime, so count is incremented.

A future improvement to this workload is to use an optimized trial division algorithm, which

constrains the divisor (in the modulo operation) from exceeding the square root of the candidate

prime number, similar to the sieve of Eratosthenes, bearing in mind that the latter still has a

better theoretical complexity [41]. In addition to improving the runtime (which is not really a

goal), the optimized trial division algorithm would better utilize the arithmetic logic unit (ALU).

131

Appendix C

As a simple example, imagine a workload with three jobs named P1, P2, and P3 that perform

1000, 2000, and 3000 units of work, respectively, at the lowest discrete speed, with a scheduler

that executes 1000 units of work before preempting. The PS trace file may look like Figure C.1.

Figure C.1 PS Trace File

The first line is the header, which is ignored by Profilo. The second line, with job P0, scales the

speed using the formula in Section 4.2.4.1 so that speed 2 is mapped to the highest available

frequency, and speed 1 is mapped to the lowest available frequency. Running this trace through a

debug/verbose version of Profilo would produce an output like in Figure C.2.

Process,Work,Speed P0,1,2 P1,1000,1 P2,1000,1 P3,1000,1 P2,1000,1 P3,1000,1 P3,1000,1

132

Figure C.2 Verbose Profilo Summary for PS Example

If we consider the trace file as a batch workload that arrives at the same time, we would calculate

the mean response time using only the time the jobs ended. Because we do not count job P0, the

value would be !!!!!!

≅ 3.7 seconds. Using the same example, an FSP-PS trace file may look

like Figure C.3.

Figure C.3 FSP-PS Trace File

Starting Profilo virtual kernel. P0: Started... P0: Doing 1 unit(s) of work at 2301 MHz. P0: Terminating... P1: Started... P1: Doing 1000 unit(s) of work at 1200 MHz. P1: Terminating... P2: Started... P2: Doing 1000 unit(s) of work at 1200 MHz. P3: Started... P3: Doing 1000 unit(s) of work at 1200 MHz. P2: Doing 1000 unit(s) of work at 1200 MHz. P2: Terminating... P3: Doing 1000 unit(s) of work at 1200 MHz. P3: Doing 1000 unit(s) of work at 1200 MHz. P3: Terminating... The profiling took 6.00 seconds. Process P0 took 0.000 seconds. It started at 0.000s and ended at 0.000s. Process P1 took 1.000 seconds. It started at 0.000s and ended at 1.000s. Process P2 took 3.000 seconds. It started at 1.000s and ended at 4.000s. Process P3 took 4.000 seconds. It started at 2.000s and ended at 6.000s. The processor cores (PP0) consumed 18.000 joules (3.000 watts). The processor package (PKG) consumed 39.000 joules (6.500 watts).

Process,Work,Speed P0,1,2 P1,1000,1 P2,2000,1 P3,3000,1

133

The jobs leave the system in the same order as PS, but instead get continuous execution until

complete, without preemption. The corresponding release/non-verbose version of Profilo would

produce an output similar to Figure C.4.

Figure C.4 Profilo Summary for FSP-PS Example

Calculating the mean response time in the same way as for PS would result in !!!!!!

= 3.3

seconds. This is a 9% reduction in the mean response time. For the sake of simplicity, the speed

scaling policy in this example is static, and PS finishes the workload in only 6 time slices. As a

result, most of the popular schedulers would have the same overall profiling time and energy

consumption. However, in job-count based speed scaling workloads, such as the suite of

workloads described in Section 6.1, there would certainly be differences in mean response time,

overall profiling time, and energy consumption. This would be driven by the chosen execution

frequency(s) and number of mode/context switches.

Starting Profilo virtual kernel. The profiling took 6.00 seconds. Process P0 took 0.000 seconds. It started at 0.000s and ended at 0.000s. Process P1 took 1.000 seconds. It started at 0.000s and ended at 1.000s. Process P2 took 2.000 seconds. It started at 1.000s and ended at 3.000s. Process P3 took 3.000 seconds. It started at 3.000s and ended at 6.000s. The processor cores (PP0) consumed 18.000 joules (3.000 watts). The processor package (PKG) consumed 39.000 joules (6.500 watts).

134

References

[1] M. Agrawal, N. Kayal, and N. Saxena, “PRIMES is in P,” Ann. Math., pp. 781–793,

2004.

[2] V. Aiyar, “Sundaram’s Sieve for Prime Numbers,” Math. Stud., vol. 2, no. 2, p. 73, 1934.

[3] S. Albers, “Energy-Efficient Algorithms,” Commun. ACM, vol. 53, no. 5, pp. 86–96,

2010.

[4] S. Albers and A. Antoniadis, “Race to Idle: New Algorithms for Speed Scaling with a

Sleep State,” ACM Trans. Algorithms, vol. 10, no. 2, p. 9, 2014.

[5] S. Albers, F. Müller, and S. Schmelzer, “Speed Scaling on Parallel Processors,”

Algorithmica, vol. 68, no. 2, pp. 404–425, 2014.

[6] L. Andrew, M. Lin, and A. Wierman, “Optimality, Fairness, and Robustness in Speed

Scaling Designs,” in Proceedings of ACM SIGMETRICS, 2010, pp. 37–48.

[7] A. Atkin and D. Bernstein, “Prime Sieves using Binary Quadratic Forms,” Math.

Comput., vol. 73, no. 246, pp. 1023–1030, 2004.

[8] B. Avi-Itzhak and H. Levy, “On Measuring Fairness in Queues,” Adv. Appl. Probab., pp.

919–936, 2004.

[9] N. Bansal, H.-L. Chan, and K. Pruhs, “Speed Scaling with an Arbitrary Power Function,”

in Proceedings of ACM-SIAM, 2009, pp. 693–701.

[10] N. Bansal and M. Harchol-Balter, “Analysis of SRPT Scheduling: Investigating

Unfairness,” in Proceedings of ACM SIGMETRICS, 2001, vol. 29.

135

[11] N. Bansal, T. Kimbrel, and K. Pruhs, “Speed Scaling to Manage Energy and

Temperature,” J. ACM, vol. 54, no. 1, p. 3, 2007.

[12] N. Bansal, K. Pruhs, and C. Stein, “Speed Scaling for Weighted Flow Time,” SIAM J.

Comput., vol. 39, no. 4, pp. 1294–1308, 2009.

[13] R. Bayer, “Symmetric Binary B-trees: Data structure and Maintenance Algorithms,” Acta

Inform., vol. 1, no. 4, pp. 290–306, 1972.

[14] D. Bovet and M. Cesati, Understanding the Linux Kernel. O’Reilly Media, Inc., 2005.

[15] B. Brandenburg, H. Leontyev, and J. Anderson, “An Overview of Interrupt Accounting

Techniques for Multiprocessor Real-Time Systems,” J. Syst. Archit., vol. 57, no. 6, pp. 638–654,

2011.

[16] L. Brindley, A. Young, and C. Tan, “Example 15.1. Comparing the Cost of Reading

Hardware Clock Sources,” Red Hat Enterprise MRG 2 Realtime Reference Guide. [Online].

Available: https://access.redhat.com/documentation/en-

US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-

Realtime_Reference_Guide-Timestamping.html#example-Hardware_Clock_Cost_Comparison.

[Accessed: 17-Feb-2015].

[17] A. Brown, “Linux/drivers/idle/intel_idle.c,” Linux Cross Reference - Free Electrons.

[Online]. Available: http://lxr.free-electrons.com/source/drivers/idle/intel_idle.c. [Accessed: 11-

Apr-2015].

[18] A. Brown, “The State of ACPI in the Linux Kernel,” in Linux Symposium, 2004, p. 128.

136

[19] J. Corbet, A. Rubini, and G. Kroah-Hartman, Linux Device Drivers. O’Reilly Media, Inc.,

2005.

[20] H. David, E. Gorbatov, U. Hanebutte, R. Khanna, and C. Le, “RAPL: Memory Power

Estimation and Capping,” in 2010 ACM/IEEE International Symposium on Low-Power

Electronics and Design (ISLPED), 2010, pp. 189–194.

[21] M. Elahi, C. Williamson, and P. Woelfel, “Decoupled Speed Scaling: Analysis and

Evaluation,” Perform. Eval., vol. 73, pp. 3–17, Mar. 2014.

[22] J. Enos, C. Steffen, J. Fullop, M. Showerman, G. Shi, K. Esler, V. Kindratenko, J. Stone,

and J. Phillips, “Quantifying the Impact of GPUs on Performance and Energy Efficiency in HPC

Clusters,” in Green Computing Conference, 2010 International, 2010, pp. 317–324.

[23] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy Caches: Simple

Techniques for Reducing Leakage Power,” in Proceedings of ISCA, 2002, pp. 148–157.

[24] M. Flynn, “Some Computer Organizations and their Effectiveness,” IEEE Trans.

Comput., vol. 100, no. 9, pp. 948–960, 1972.

[25] E. Friedman and S. Henderson, “Fairness and Efficiency in Web Server Protocols,” in

Proceedings of ACM SIGMETRICS, 2003, pp. 229–237.

[26] A. Gandhi, M. Harchol-Balter, R. Das, and C. Lefurgy, “Optimal Power Allocation in

Server Farms,” in Proceedings of ACM SIGMETRICS, 2009, vol. 37, pp. 157–168.

[27] L. Guibas and R. Sedgewick, “Dichromatic Framework for Balanced Trees,” 1978.

[28] E. Hahne, “Round-Robin Scheduling for Max-Min Fairness in Data Networks,” IEEE J.

Sel. Areas Commun., vol. 9, no. 7, pp. 1024–1039, 1991.

137

[29] M. Hähnel, B. Döbel, M. Völp, and H. Härtig, “Measuring Energy Consumption for

Short Code Paths Using RAPL,” ACM SIGMETRICS Perform. Eval. Rev., vol. 40, no. 3, pp. 13–

17, Jan. 2012.

[30] F. Hu and J. Evans, “Linux Kernel Improvement: Toward Dynamic Power Management

of Beowulf Clusters,” in Proceedings of the 8th LCI International Conference on High-

Performance Clustered Computing (CDROM), 2007.

[31] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose,

“Microarchitectural Techniques for Power Gating of Execution Units,” in Proceedings of the

2004 International Symposium on Low Power Electronics and Design, 2004, pp. 32–37.

[32] T. Jones, “Kernel API’s, Part 3: Timers and Lists in the 2.6 Kernel,” 30-Mar-2010.

[Online]. Available: https://www.ibm.com/developerworks/library/l-timers-list/. [Accessed: 23-

Jun-2016].

[33] K. Kasichayanula, D. Terpstra, P. Luszczek, S. Tomov, S. Moore, and G. Peterson,

“Power Aware Computing on GPUs,” IEEE, 2012.

[34] L. Kleinrock, “Time-Shared Systems: A Theoretical Treatment,” J. ACM, vol. 14, no. 2,

pp. 242–261, 1967.

[35] H. Lenstra Jr and C. Pomerance, “Primality Testing with Gaussian Periods,” Lect. Notes

Comput. Sci., 2011.

[36] C. Li, C. Ding, and K. Shen, “Quantifying the Cost of Context Switch,” in Proceedings of

the 2007 Workshop on Experimental Computer Science, 2007.

138

[37] H. Lieberman, “Using Prototypical Objects to Implement Shared Behavior in Object-

Oriented Systems,” ACM Sigplan Not., vol. 21, no. 11, pp. 214–223, 1986.

[38] R. Love, Linux Kernel Development. Pearson Education, 2010.

[39] D. Lu, H. Sheng, and P. Dinda, “Size-Based Scheduling Policies with Inaccurate

Scheduling Information,” in Proceedings of IEEE MASCOTS, 2004, pp. 31–38.

[40] M. Mills, “The Cloud Begins with Coal,” Digital Power Group, Aug. 2013.

[41] M. O’Neill, “The Genuine Sieve of Eratosthenes,” J. Funct. Program., vol. 19, no. 01,

pp. 95–106, Jan. 2009.

[42] V. Pallipadi, S. Li, and A. Belay, “cpuidle: Do Nothing, Efficiently,” in Proceedings of

the Linux Symposium, 2007, vol. 2, pp. 119–126.

[43] D. Petersen, J. Steele, and J. Wilkerson, “WattBot: A Residential Electricity Monitoring

and Feedback System,” in CHI’09 Extended Abstracts on Human Factors in Computing Systems,

2009, pp. 2847–2852.

[44] P. Pritchard, “Explaining the Wheel Sieve,” Acta Inform., vol. 17, no. 4, pp. 477–485,

1982.

[45] P. Pritchard, “Improved Incremental Prime Number Sieves,” in Cornell University, 1994,

pp. 280–288.

[46] J. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits, vol. 2. Prentice

Hall Englewood Cliffs, 2002.

139

[47] I. Rai, G. Urvoy-Keller, and E. Biersack, “Analysis of LAS Scheduling for Job Size

Distributions with High Variance,” in Proceedings of ACM SIGMETRICS, 2003, vol. 31, pp.

218–228.

[48] T. Rauber and G. Rünger, “Energy-Aware Execution of Fork-Join-Based Task

Parallelism,” in Proceedings of IEEE MASCOTS, 2012, pp. 231–240.

[49] E. Rotem, A. Naveh, A. Ananthakrishnan, E. Weissmann, and D. Rajwan, “Power-

Management Architecture of the Intel Microarchitecture Code-Named Sandy Bridge,” IEEE

Micro, vol. 32, no. 2, pp. 20–27, Mar. 2012.

[50] L. Schrage, “A Proof of the Optimality of the Shortest Remaining Processing Time

Discipline,” Oper. Res., vol. 16, no. 3, pp. 687–690, 1968.

[51] D. Searls, “Linus & the Lunatics, Part II,” Linux Journal, 23-Nov-2003. [Online].

Available: http://www.linuxjournal.com/article/7279. [Accessed: 31-Mar-2015].

[52] B. Sigoure, “Tsuna’s Blog: How Long Does It Take To Make A Context Switch?,” 14-

Nov-2010. [Online]. Available: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-

context.html. [Accessed: 15-Mar-2016].

[53] A. Skrenes and C. Williamson, “Experimental Calibration and Validation of a Speed

Scaling Simulator,” in Proceedings of IEEE MASCOTS, 2016.

[54] D. Snowdon, E. Le Sueur, S. Petters, and G. Heiser, “Koala: A Platform for OS-Level

Power Management,” in Proceedings of the 4th ACM European Conference on Computer

Systems, 2009, pp. 289–302.

140

[55] D. Snowdon, S. Petters, and G. Heiser, “Accurate On-line Prediction of Processor and

Memory Energy Usage Under Voltage Scaling,” in Proceedings of the 7th ACM & IEEE

International Conference on Embedded Software, 2007, pp. 84–93.

[56] V. Spiliopoulos, A. Sembrant, and S. Kaxiras, “Power-Sleuth: A Tool for Investigating

Your Program’s Power Behavior,” in Proceedings of IEEE MASCOTS, 2012, pp. 241–250.

[57] M. Squillante and E. Lazowska, “Using Processor-Cache Affinity Information in Shared-

Memory Multiprocessor Scheduling,” IEEE Trans. Parallel Distrib. Syst., vol. 4, no. 2, pp. 131–

143, Feb. 1993.

[58] C. Stolte, R. Bosche, P. Hanrahan, and M. Rosenblum, “Visualizing Application

Behavior on Superscalar Processors,” in 1999 IEEE Symposium on Information Visualization,

1999. (Info Vis ’99) Proceedings, 1999, pp. 10–17, 141.

[59] The Climate Group on behalf of the Global eSustainability Initiative (GeSI), “Smart 2020

Report: Global ICT Solution Case Studies,” 2008. [Online]. Available:

http://www.smart2020.org/publications/. [Accessed: 10-Jun-2016].

[60] V. Weaver, M. Johnson, K. Kasichayanula, J. Ralph, P. Luszczek, D. Terpstra, and S.

Moore, “Measuring Energy and Power with PAPI,” in 2012 41st International Conference on

Parallel Processing Workshops (ICPPW), 2012, pp. 262–268.

[61] M. Weiser, B. Welch, A. Demers, and S. Shenker, “Scheduling for Reduced CPU

Energy,” in Mobile Computing, Springer, 1994, pp. 449–471.

[62] A. Wierman, L. Andrew, and A. Tang, “Power-Aware Speed Scaling in Processor

Sharing Systems,” in Proceedings of IEEE INFOCOM, 2009, pp. 2007–2015.

141

[63] A. Wierman and M. Harchol-Balter, “Classifying Scheduling Policies with Respect to

Unfairness in an M/GI/1,” in Proceedings of ACM SIGMETRICS, 2003, vol. 31, pp. 238–249.

[64] W. Wulf and M. Shaw, “Global Variable Considered Harmful,” ACM Sigplan Not., vol.

8, no. 2, pp. 28–34, 1973.

[65] F. Yao, A. Demers, and S. Shenker, “A Scheduling Model for Reduced CPU Energy,” in

Proceedings of Annual Symposium on Foundations of Computer Science, 1995, pp. 374–382.

[66] “5th Generation Intel® CoreTM Processor Family Datasheet Vol. 1.” .

[67] “ACPI Specification Version 5.1,” Unified Extensible Firmware Interface Forum.

[Online]. Available: http://www.uefi.org/sites/default/files/resources/ACPI_5_1release.pdf.


[68] “include/uapi/asm-generic/param.h,” Linux Cross Reference - Free Electrons. [Online].

Available: http://lxr.free-electrons.com/source/include/uapi/asm-generic/param.h#L5. [Accessed:

17-Feb-2015].

[69] “Intel® 64 and IA-32 Architectures Software Developer Manuals,” Intel. [Online].

Available: http://www.intel.com/content/www/us/en/processors/architectures-software-

developer-manuals.html. [Accessed: 14-Feb-2015].

[70] “Intersil ISL95813 Single Phase Core Controller for VR12.6 Datasheet.” Intersil, 15-

May-2013.

[71] “Kill A Watt EZ (P4460) Manual.” P3 International Corporation.

[72] “Mobile 3rd Generation Intel® CoreTM Processor Family: Datasheet, Vol. 1.” .

142

[73] “P3 Kill A Watt EZ Power Monitor,” P3 International Corporation. [Online]. Available:

http://www.p3international.com/products/p4460.html. [Accessed: 29-Mar-2015].

[74] “Report to Congress on Server and Data Center Energy Efficiency,” Aug-2007. [Online].

Available: https://www.energystar.gov/buildings/tools-and-resources/report-congress-server-

and-data-center-energy-efficiency-opportunities. [Accessed: 10-Jun-2016].

[75] “Softlockup Detector and Hardlockup Detector (aka nmi_watchdog),” The Linux Kernel

Archives. [Online]. Available: https://www.kernel.org/doc/Documentation/lockup-watchdogs.txt.


university of calgary experimental evaluation of...

Documents