compiler support for long-life, low-overhead intermittent

Compiler Support for Long-life, Low-overhead IntermittentComputation on Energy Harvesting Flash-based Devices

Saim Ahmad

Thesis submitted to the Faculty of the

Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Master of Science

in

Computer Science and Application

Matthew Hicks, Chair

Changwoo Min

Ali Butt

May 7, 2021

Blacksburg, Virginia

Keywords: Compiler Optimizations, Intermittent Computation, Energy Harvesting, Flash

Storage

Copyright 2021, Saim Ahmad

Compiler Support for Long-life, Low-overhead Intermittent Compu-tation on Energy Harvesting Flash-based Devices

Saim Ahmad

(ABSTRACT)

With the advent of energy harvesters, supporting fast and efficient computation on energy

harvesting devices has become a key challenge in the field of energy harvesting on ubiquitous

devices. Computation on energy harvesting devices is equivalent to spreading the execution

time of a lasting application over short, frequent cycles of power. However, we must ensure

that results obtained from intermittently executing an application do produce results that

are congruent to those produced by executing the application on a device with a continuous

source of power. The current state-of-the-art systems that enable intermittent computation

on energy harvesters make use of novel compiler analysis techniques as well as on-board

hardware on devices to measure the energy remaining for useful computation. However,

currently available programming models, which mostly target devices with FRAM as the

NVM, would cause failure on devices that employ the Flash as primary NVM, thereby

resulting in a non-universal solution that is restricted by the choice of NVM. This is primarily

the result of the Flash’s limited read/write endurance.

This research aims to contribute to the world of energy harvesting devices by providing

solutions that would enable intermittent computation regardless of the choice of NVM on

a device by utilizing only the SRAM to save state and perform computation. Utilizing the

SRAM further reduces run-time overhead as SRAM reads/writes are less costlier than NVM

reads/writes. Our proposed solutions rely on programmer-guidance and compiler analysis

to correct and efficient intermittent computation. We then extend our system to provide a

complete compiler-based solution without programmer intervention. Our system is able to

run applications that would otherwise render any device with Flash as NVM useless in a

matter of hours.

Compiler Support for Long-life, Low-overhead Intermittent Compu-tation on Energy Harvesting Flash-based Devices

Saim Ahmad

(GENERAL AUDIENCE ABSTRACT)

As batteries continue to take up space and make small-scale sensors hefty, battery-less devices

have grown increasingly popular for non-resource intensive computations. From tracking air

pressure in vehicle tires to monitoring room temperature, battery-less devices have countless

applications in various walks of life. These devices function by periodically harvesting energy

from the environment and its surroundings to power short bursts of computation. When

device energy levels reach a lower-bound threshold these devices must power off to scavenge

useful energy from the environment to further perform short bursts of computation. Usually,

energy harvesting devices draw power from solar, thermal or RF energy. This vastly depends

on the build of the device, also known as a microprocessor (a processing unit built to perform

small-scale computations). Due to these devices constantly powering on and off, performing

continuous computation on such devices is rather more difficult when compared to systems

with a continuous source of power.

Since applications can require more time to complete than one power cycle of such devices,

by default, applications running on these devices will restart execution from the beginning at

the start of every power cycle. Therefore, it is necessary for such devices to have mechanisms

to remember where the were before the device lost power. The past decade has seen many

solutions proposed to aid an application in restarting execution rather than recomputing

everything from the beginning. Solutions utilize different categories of devices with different

storage technologies as well different software and hardware utilities available to programmers

in this domain. In this research, we propose two different low-overhead, long-life computation

models to support intermittent computation on a subset of energy harvesting devices which

use Flash-based memory to store persistent data. Our approaches are heavily dependent

on programmer guidance and different program analysis techniques to sustain computation

across power cycles.

Acknowledgments

A special thanks to my research advisor, Matthew Hicks, and my research associate, Harrison

Williams, in helping me prepare my first research publication which forms a part of this

dissertation.

vi

Contents

List of Figures x

List of Tables xii

1 Introduction 1

2 A Difference World: High-performance, NVM-invariant, Software-only

Intermittent Computation 4

2.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Why Intermittent Computation on Flash Devices? . . . . . . . . . . 10

2.3.2 Existing Programmer-guided Systems Kill Flash . . . . . . . . . . . . 12

2.3.3 SRAM’s Time-dependent Non-volatility . . . . . . . . . . . . . . . . 13

2.3.4 Intermittent Off Times are Short . . . . . . . . . . . . . . . . . . . . 15

2.4 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.2 Detecting Unexpectedly Long Off Times . . . . . . . . . . . . . . . . 18

2.4.3 Bimodal Recovery Routine . . . . . . . . . . . . . . . . . . . . . . . . 20

vii

2.4.4 CAMEL Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.5 CAMEL Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.1 Compiler Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5.2 Compiler Modifications . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5.3 Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5.4 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6.2 Time to death—Flash Failure . . . . . . . . . . . . . . . . . . . . . . 32

2.6.3 Run-time Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.6.4 Binary size overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.6.5 Automatic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

2.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 SABLE: A Compiler-only Alternative to CAMEL 42

3.1 Programmer-invention vs Compiler-analysis — tradeoff . . . . . . . . . . . . 43

3.2 SABLE Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.1 Naive Canary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.2 Idempotent Canary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2.3 Naive CRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.4 Batch CRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3 SABLE Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 SABLE Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4 Conclusion 52

Bibliography 53

List of Figures

2.1 Flash/device lifetime for existing programmer-guided intermittent computa-

tion approaches. Incessant checkpointing to Flash quickly renders the device

unusable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 The maximum time SRAM before a single bit fails in SRAM for across tem-

perature changes for three capacitor sizes. The horizontal bars represent the

maximum off-times from our meta-analysis of off-times reported in the energy

harvesting literature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Interaction amongst components within CAMEL. . . . . . . . . . . . . . . . . 16

2.4 (a) Shows unmodified source code (b) shows the task divided code according

to the conventions described §2.4.4 (c) shows the execution of the code in (b)

after it has been instrumented by the compiler. . . . . . . . . . . . . . . . . 21

2.5 (1) Shows the start of a task (2) Shows a power failure midway execution of

a task (3) shows undo-logging before any tasks begins execution. The state

of the non-volatile and volatile buffers is shown after each of the three steps. 23

2.6 Shows how tasks use data in the differential buffers. The only non-idempotent

variable is result since it undergoes write-after-read. It is first read in line 11

and also written to. This sequence of instructions in assembly would result

in a write-after-read violation. . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7 Shows the pipeline for the generation of a CAMEL-certified executable. . . . . 26

x

2.8 CAMEL run-time overhead for Flash-based devices The global buffer size for

each benchmark is stated in parentheses on the x-axis. . . . . . . . . . . . . 34

2.9 CAMEL run-time overhead for FRAM-based devices. The global buffer size

for each benchmark is stated in parentheses on the x-axis. . . . . . . . . . . 34

2.10 CAMEL binary size increase compared to current state-of-the-art. . . . . . . . 38

3.1 Cumulative Density Curve for SABLE batch CRC which helps in determining

the ideal number of stores to batch . . . . . . . . . . . . . . . . . . . . . . . 48

List of Tables

2.1 Deployment lifetime (in hours) for existing programmer-guided systems on

Flash-based devices with expected time until a silent data corruption for

several CAMEL configurations. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2 Checkpoints each system makes per benchmark. We can see that each system

makes a comparable number of benchmarks hence the difference in run-time

and binary-size overheads cannot be a result of different number of checkpoints. 37

3.1 Summary of checkpoints placed in Naive and Idempotent variants of the ca-

nary version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Summary of checkpoints placed in Naive and Batch variants of the CRC version. 49

3.3 Relative run-time for different SABLE implementations . . . . . . . . . . . . . . . . . 51

3.4 Relative binary size for different SABLE implementations . . . . . . . . . . . . . . . . 51

xii

List of Abbreviations

AR: Activity Recognition

BC: Bit Count

CEM: Cold-Chain Equipment Monitoring

CF: Cuckoo Filter

CRC: Cyclic Redundancy Check

FRAM: Ferroelectric Random Access Memory

IDEM: Idempotent

IR: Intermediate Representation

LLVM: Low-level Virtual Machine

MCU: Micro-controller Unit

NVM: Non-volatile memory

SRAM: Static Random Access Memory

TDNV: Time-dependent non-volatility

xiii

Chapter 1

Introduction

No matter what electronic device we use it is always accompanied by a battery to power

it, taking at least half the amount of the total space on the device if not more. Energy

harvesting [33, 39, 48] is an essential first step for us to dump batteries permanently, thereby

utilizing the remaining space for more essential components. To attain complete ubiquity

[42], we require that future devices take advantage of the capabilities of energy harvesters

which harvest energy from the surroundings in the form of solar, thermal and RF energy.

However, dumping batteries and moving to a system with scavenges energy from the envi-

ronment proves to be a challenge of its own. Energy harvesters require frequent power offs

to salvage energy from the environment to enable them to power on again. This means that

applications executing on such a device may fail to execute completely given that the time

required for them to finish execution is greater than the average power on time of an energy

harvesting device. Such challenges have caused the advent of intermittent computation, a

research area which works to ensure software can be intermittently executed in a correct and

efficient way, always resulting in completion of the running application.

The last decade saw an influx in programming models for intermittent computation on

energy harvesting devices. Probably the first system to revolutionize the field of energy har-

vesting was Mementos [37]. Mementos [37] utilized compile-time analysis techniques along

with hardware to measure available voltage at run-time to successfully execute programs

intermittently. This opened the gateway for multiple ideas that could bed utilized to build

1

intermittent computation models. Since the dawn of Mementos [37], we have seen different

classes of models arise. Some techniques utilize software-only and compiler-based techniques

to enable applications to successfully execute on intermittent power [46]. Recent systems

have seen the incorporation of programmer-guidance along with compiler analysis to make

software techniques easy to implement [4, 27, 30]. Programmer-guidance involves the pro-

grammer rewriting source code, usually by dividing it into a set of functions that comply with

the primitives defined by the underlying system the application is being ported to. These

schemes are termed continuous checkpointing models coupled with programmer-guidance.

Another class of systems only use hardware to measure the amount of energy available and

run-time and checkpoint accordingly before the device powers off [2]. These techniques are

just-in-time approaches, as they checkpoint just before losing power. All of the aforemen-

tioned systems utilized either the Flash or FRAM as non-volatile memory to checkpoint

and store state. However, a new class of the state-of-the-art has emerged that is NVM

invariant [44]. TotalRecall is a just-in-time programming model that uses the SRAM to

checkpoint data thus reducing the checkpointing overhead to a fraction of what other sys-

tems could achieve. This is made possible by the observation that the SRAM can retain

data for short periods of time after a device loses power. This discovery was made by [44]

and is elaborated further in §2.3.3.

This research will introduce programming models that are built using the observation that

the SRAM can retain state successfully for minutes after an energy harvesting device powers

off. These programming models are continuous-checkpointing, compiler-based, software-only

and NVM invariant. They can work on either the Flash or FRAM based systems without

requiring any modifications to the application source or programming model. Furthermore,

they do not require on-board hardware to continuously monitor available energy, thus making

these systems more adaptable to devices that do not provide energy measuring hardware.

2

§2 presents a manuscript of ours that builds a system by coupling continuous checkpointing

with programmer-direction.§3 Introduces an alternative to the system in §2 that requires no

programmer-direction and is a compiler-only based approach. However, this system is still

in it’s development and testing phase.

3

Chapter 2

A Difference World:

High-performance, NVM-invariant,

Software-only Intermittent

Computation

In this chapter we present our research paper, edited and modified to meet the requirements

of this dissertation. This paper was submitted to SOSP’21 on May 6th, 2021.

2.1 Abstract

Supporting long life, high performance, intermittent computation is an essential challenge

in allowing energy harvesting devices to fulfill the vision of smart dust. Intermittent compu-

tation is the extension of long-running computation across the frequent, unexpected, power

cycles that result from replacing batteries with harvested energy. The most promising ap-

proaches combine programmer direction and compiler analysis to minimize run-time over-

head and provide programmer control—without specialized hardware support. While such

strategies succeed in reducing the size of non-volatile memory writes due to checkpoint-

ing, they must checkpoint continuously. Unfortunately, for Flash-based devices, writing

4

checkpoints is slow and gradually kills the device. Without intervention, Flash devices

and software-only intermittent computation are fundamentally incompatible. To enable

programmer-guided intermittent computation on Flash devices, we design and implement

CAMEL. The key idea behind CAMEL is the systematic bifurcation of program state into two

“worlds” of differing volatility. Programmers compose intermittent programs by stitching

together atomic units of computation called tasks. The CAMEL compiler ensures that all

within-task data is placed in the volatile world. CAMEL places all data that is communi-

cated between tasks in the non-volatile world. Between tasks, CAMEL swaps the worlds,

atomically locking-in the forward progress of the preceding task. In preparation for the next

task, CAMEL resolves differences in world view by copying only differences due to the pre-

ceding task’s updates. This systematic decomposition into a mixed-volatility memory allows

programmer-guided intermittent computation on Flash devices while improving performance

for all NVM types. CAMEL extends correct operation from minutes to 1000’s of years, in-

creases performance up to 42x on Flash devices, and improves performance on FRAM devices

by 50%.

2.2 Introduction

Energy harvesting [33, 39, 48] is the key to realizing the vision of ubiquitous computing [23,

42]. Aggressive transistor scaling brings us to an inflection point: computing devices are

smaller than a grain of rice and operate on nano-watts of power [47], but the batteries

required to power them remain largely unchanged, leaving them large, heavy, expensive, and

sometimes flammable [3]. This asymmetric scaling means attaining ubiquity demands that

current and future devices shed batteries in favor of harvested energy.

The transition to harvested energy brings a new challenge: how can we support long-running

5

programs in the face of the frequent, unpredictable, power cycles brought on by the rela-

tive trickle of energy supplied by energy harvesting? Existing programs and programmers

alike assume a continuous supply of energy, while energy harvesters provide only enough

energy for short bursts of computation. Attempting to execute unmodified programs on

such short bursts of power dooms long-running programs to a never-ending series of restarts.

A naive application of existing checkpointing schemes [37] is inadequate as previous work

shows that, without careful attention to the memory durability ramifications of power cycles,

semantically incorrect executions occur [36]. Thus intermittent computation is born.

Intermittent computation approaches fit into one of two high-level classes:

• Just-in-time checkpointing: special-purpose hardware monitors available power,

committing a checkpoint and ceasing computation when power dips below a pre-defined

voltage threshold [1, 2, 22, 37, 44].

• Continuous checkpointing: a program is decomposed (by a programmer [4, 27, 30],

compiler [31, 46], or hardware [10, 28, 29, 41]) into a series of inherently-restartable sub-

computations, glued together by checkpoints. This results in power-failure-agnostic

program execution.

Programmer-guided continuous checkpointing intermittent computation systems are favored

when programmers require guarantees about execution [4, 27, 30]. In many real-world de-

ployments, some operations cannot be interrupted and then resumed arbitrarily following a

power cycle. Consider interacting with an external radio or a sensor; these devices cannot

handle power loss in the mid-transaction [27]. This forces the programmer to anticipate

the effects of a power loss at every point in the code and write routines to mitigate the

consequences. This is unscalable and error-prone.

Programmer-guided approaches relieve programmers of this burden through a combination

6

of compiler analysis and a C programming interface that exposes forward progress atomicity

as a first-class programming abstraction. Compiler analysis is comprehensive and error-free,

while the programming interface allows programmers to reason about the system-level effects

of computation—at a granularity that they are comfortable with or that mirrors the device’s

interface/protocol.

Programmer-guided approaches divide programs into a series of checkpoint-connected tasks.

Tasks represent atomic, restartable, units of computation, i.e., they either complete entirely

or not at all. Regardless of any power cycles, the result of task execution is semantically

consistent with the code. The fundamental principle is that tasks keep their changes private

until they complete. Early work conservatively versions all within-task data [27], while

follow-on work introduces novel classes of cross-task data-communication channels [4]. The

most recent approach further reduces overhead by using idempotence analysis to minimize

data copying [30].

This paper addresses two limitations in state-of-the-art programmer-guided intermittent

computation approaches:

• Flash device lifetime: the frequent non-volatile memory writes of continuous check-

pointing strategies—no matter the size—exhaust Flash’s limited write endurance [17].

• Performance: current approaches copy redundant data to within-task buffers due to

not reusing existing data.

Fixing the first flaw makes programmer-guided intermittent computation possible

and performant on Flash-based devces. Fixing the second flaw increases perfor-

mance across both FRAM and Flash-based devices.

We design and implement CAMEL, an extension to C and compiler support that enables

7

long-life, low-overhead intermittent computation on Flash-based systems—without hardware

support—as well as increasing performance on both FRAM and Flash devices. Our solu-

tion leverages the idea of two worlds from ARM TrustZone [34], but replaces security with

data non-volatility. Two worlds exist within a CAMEL-instrumented program: a non-volatile

world across tasks (and for recovery) and a volatile world within a task. To implement this

selective mixed-volatility world abstraction on top of SRAM, we leverage recent work on

time-dependent non-volatility [44]. Instead of creating a wholly-non-volatile SRAM, which

§2.6 shows is not a viable solution, CAMEL reserves non-volatility for the non-volatile world

alone. Treating within-task data as volatile makes CAMEL performant on Flash-based de-

vices. Fine-grain idempotence analysis coupled with differential state analysis allows CAMEL

to efficiently update and transition between worlds. We validate CAMEL’s ability to ex-

tend long-running computation across frequent power cycles using Flash- and FRAM-based

MSP430 microcontrollers and a superset of benchmarks from previous work. Experiments

show that CAMEL provides practically unbounded deployment lifetime, while reducing av-

erage run-time overhead by 7x–42x over previous systems running on Flash-based devices.

Compared to a naïve software-only variant of a recent Flash-based intermittent computation

system, CAMEL improves performance by up to 455%. Even on FRAM devices, CAMEL’s

advanced compiler analyses cut run-time overhead in half compared to the state-of-the-art.

This paper makes the following contributions:

1. We expose that existing continuous checkpointing approaches have poor performance

and eventually kill Flash-based systems due to checkpoint-induced Flash memory

writes/erases (§2.3.1).

2. We propose the notion of the controlled-volatility worlds to enable high-performance

programmer-guided intermittent computation on Flash devices (§2.4.4).

8

3. We present a NVM-invariant performance improvement: reusable differential buffers

(§2.4.5).

4. We expose to the designer and quantify the trade space between pre-deployment effort

and run-time overhead (i.e., canary vs. CRC) (§2.4.2).

5. We evaluate CAMEL against state-of-the-art programmer-guided intermittent compu-

tation systems; results show that CAMEL outperforms previous approaches in both

lifetime and performance, regardless of non-volatile memory type (§2.6).

2.3 Motivation

There exists a succession of programmer-guided intermittent computation systems, each

refining the interface exposed to programmers and reducing run-time overhead. Why is

another approach needed? This section answers this question through analysis that shows

that by ignoring Flash-based energy-harvesting platforms, we exclude the most

ubiquitous, most available, lowest cost, and highest performance systems from

the benefits of intermittent computation. Experiments with existing approaches show

that due to the performance and lifetime consequences of Flash writes/erases and the high-

frequency checkpoints endemic to continuous checkpointing, a new approach is required.

Lastly, we show that achieving suitable performance is more challenging than a direct

extension of previous work targeting Flash devices. This analysis motivates a new,

in-place checkpointing, approach to programmer-guided intermittent computation: CAMEL.

9

Figure 2.1: Flash/device lifetime for existing programmer-guided intermittent computationapproaches. Incessant checkpointing to Flash quickly renders the device unusable.

2.3.1 Why Intermittent Computation on Flash Devices?

Frequent checkpointing allows programmer-guided approaches to remove the requirement

of special-purpose voltage monitoring hardware. Frequent checkpointing also means that

the performance of existing programmer-guided systems depends on the performance of

the non-volatile memory (NVM) that it commits checkpoints to. For several decades, the

only mass-market option for NVM in energy-harvesting-class devices was Flash memory. In

the last five years, a new NVM emerged: Ferroelectric Random-Access Memory (FRAM).

Following this trend, early energy-harvesting platforms used Flash-based microcontrollers

(e.g., WISP 4 [39] and Moo [48]), while the more recent energy-harvesting platforms use the

more esoteric FRAM-based microcontrollers (e.g., WISP 5 [33]). According to the WISP 5

developers, the impetus for the transition to FRAM-based devices is the lower cost of writes

compared to Flash.

10

While write latency is one metric to compare NVM technologies, other metrics become

important in a world where NVM writes/erases are no longer the limiting factor. Flash-

based devices provide several advantages over similar FRAM-based devices: flash devices

are more available and have a larger pool of developers and suppliers, since they have been

around for decades, compared to less than a decade for FRAM devices. This trend is

unlikely to change soon as Flash-based devices are more available today. Flash also provides

a performance advantage, as shown by comparing Drhystone results [44]. Even with the

same processor core, operating at the same clock frequency, FRAM requires memory access

wait-states when the clock surpasses 8 MHz, while Flash operates wait-state-free up to 25

MHz [15, 16]. Sub-linear power-frequency scaling in low-power microcontrollers enables more

energy-efficient operation at high clock speeds: for example, the MSP430F5529 consumes

360 µA/MHz at 1 MHz and 333 µA/MHz at 12 MHz [19]. Recent work highlights other

advantages of switching to a high-energy, high-efficiency operating point [5]. FRAM devices

require additional memory-access wait states as clock speeds increase, eliminating much of

the advantage of faster operation. Finally, Flash devices tend to contain more SRAM at

equivalent NVM sizes [18, 20].

Given the advantages of Flash, why do the most recent energy harvesting platform [33] and

programmer-guided intermittent computation systems [4, 27, 30] target FRAM? Despite the

availability and performance advantages of Flash, it is slow, high-energy writes/erases are

antithetical to the high-frequency checkpoints of continuous checkpointing systems. Pro-

gramming Flash (i.e., writing) is energy and time-intense as it requires collecting enough

charge to raise the voltage of a Flash cell high enough to force charge to flow across the

cell’s dielectric (e.g., from 2.2V up to 12V). Worse, this process is uni-directional. Thus,

to change any single bit of Flash requires copying a segment (512 B) to SRAM, erasing

the entire segment, updating the desired bits in SRAM, and writing the updated segment

11

back to Flash. The common-case nature of checkpointing in programmer-guided systems,

the cost of writing/erasing Flash memory, and Flash’s untimely failure eclipse any bene-

fit Flash offers. Without an alternative to checkpointing to Flash, the vast majority and

most performant microcontrollers will not support software-only intermittent computation.

The goal of this paper is to provide the most performant programmer-guided

checkpointing approach that works across popular NVM technologies.

2.3.2 Existing Programmer-guided Systems Kill Flash

Flash cells can endure only a limited number of write/erase cycles before they fail [17].

To understand how this impacts existing programmer-guided intermittent computation ap-

proaches, we evaluate how long each system takes to render the Flash—and therefore the

system—unusable. We use the benchmark set from §2.6, which is a superset of bench-

marks from previous work. We start by adapting each system for the limitations of Flash’s

write/erase granularity. As a consequence, for systems like Alpaca that use idempotence to

reduce NVM writes, the entire buffer must be updated in Flash environments, regardless.

We apply two types of wear leveling to maximize lifetime: (1) we pack as many whole-

checkpoints as will fit in a Flash segment. For example, the AR buffer is 164 bytes in DINO;

this allows for three buffers per Flash segment. (2) we employ optimistic wear leveling.

Figure 2.1 shows the time taken to exhaust the average-case Flash endurance if each ap-

plication runs continuously on the MSP430G2553 microcontroller. We determine Flash’s

lifetime by calculating how long it takes for Flash to reach its maximum write endurance

(100,000 for the MSP430G2553 [17]). For this we use each benchmark’s checkpoint size and

rate, Flash segment size, and the number of free Flash segments for each benchmark. We as-

sume constant operation and perfect wear-leveling—with no added cost or complexity. Most

12

configurations last for just hours, with the best-case lifetime is less than 40 hours. Thus,

existing approaches kill Flash devices quickly.

2.3.3 SRAM’s Time-dependent Non-volatility

Many existing intermittent computation works discuss SRAM as if it loses its state com-

pletely as soon as the microcontroller stops computing. We observe that, due to capacitance

in the system, the voltage of a system’s power rail gradually reduces from the microcon-

troller’s brown-out voltage to 0V. Due to the difference in the microcontroller’s brown-out

voltage (e.g., 1.6V) and SRAM’s data retention voltage (�0.4 V [13, 35]), SRAM scavenges

the otherwise wasted charge to retain state. We refer to this as SRAM’s time-dependent

non-volatility: for a period after computation ceases, SRAM acts as a non-volatile memory.

This presents an opportunity to leverage SRAM’s time-dependent non-volatility to serve as a

non-volatile checkpoint storage location—as long as SRAM retains data perfectly for longer

than the off time.

To verify this opportunity, we first quantify how long SRAM provides perfect data retention

for. For this experiment, we use a Flash-based MSP430 development board [21] that is

representative of energy harvesting-class devices. The literature indicates that two factors

dominate the discharge time of a capacitor: capacitor size and temperature. To explore

the impact of these variables on SRAM’s data retention time, we modify the development

board, replacing its 10µF decoupling capacitor with 47µF, 100µF, and 330µF versions. We

select the 47µF capacitor as it represents what the most popular energy harvesting devices

use [33, 39]. We use larger capacitor sizes to show how system designers can tune the

retention time through the capacitor. To control temperature, we perform the experiments

in a Test Equity 123H thermal chamber, varying temperature between 20℃ and 50℃.

13

20Office

25 30 35 40 45 50 55Death Valley

Temperature ( C)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

100%

SR

AM d

ata

rete

ntio

n tim

e (s

econ

ds)

Thermal: 14s RF: 10s Piezoelectric: 2sSolar: 300s

47 F100 F330 F

Figure 2.2: The maximum time SRAM before a single bit fails in SRAM for across tempera-ture changes for three capacitor sizes. The horizontal bars represent the maximum off-timesfrom our meta-analysis of off-times reported in the energy harvesting literature.

Because SRAM fails bi-directionally, in a board- and noise-dependent pattern [11], for each

temperature/capacitor combination, we perform five trials where we write all 1’s and five

where we write all 0’s, checking for data loss at each trial. Figure 2.2 shows the retention time

of our MSP430 microcontroller across a range of temperatures and the three energy storage

capacitor sizes. These results show that—even without system designer awareness of SRAM’s

time-dependent non-volatility—current energy harvesting platforms provide relatively long

data retention times.

14

2.3.4 Intermittent Off Times are Short

Given that SRAM provides perfect data retention for between 50 seconds and almost 4 hours,

the next question to answer is how long unexpected off-times1 are for the most common

energy sources. To answer this question, we perform a meta-analysis of the energy harvesting

literature. The goal is to identify common energy sources and, for each source, a realistic

upper bound for off times. This task is complicated by the fact that previous work focuses

on on-times due to its reliance on the long-term data retention guarantees of non-volatile

memories. Fortunately, by looking at on-times and the frequency of power-on events, we

are able to deduce approximate off times. We add the off-times for four energy sources

as horizontal lines in Figure 2.2: RF [8, 28, 37], Thermal [28], Piezoelectric [28, 40], and

Solar [8, 28]. When a given capacitor’s line is above the horizontal line, the capacitor

provides enough perfect data retention time to support operation at the temperature and

below. To summarize the results of our meta-analysis: off-times for most sources are

much shorter than the data retention time provided by existing energy harvesting

platforms. In this paper, we design and implement a system that reliably uses SRAM

as a low overhead, long lifetime, non-volatile memory for the short off times

common to intermittent computing, falling back to existing checkpointing to support

longer and expected power-off events.

15

ProgrammerDecomposition

CompilerAnalysis

Program Execution Recovery

InstrumentedExecutable

Uninstrumentedcode

Failure

CRC/Canary Checkpoint

CRC/CanaryVerificationSRAMFLASH

Commit

Task dividedcode

Write

Read

Restore/Restart

Completion

Figure 2.3: Interaction amongst components within CAMEL.

2.4 Design

We develop CAMEL, a programmer-guided, continuous checkpointing system with the goal

of enabling long-life, high-performance intermittent computation on Flash-based devices.

CAMEL avoids continuously writing program state to non-volatile memory by preserving

in-place SRAM data using differential reusable buffers. Our differential buffer model al-

lows checkpointing just-enough data required to restart program state as opposed the entire

SRAM as implemented previously [44]. We maintain semantically correct execution by ensur-

ing the in-place data remains consistent across power cycles and that tasks always re-execute

with known-good data. CAMEL is an amalgamation of three components, working together

to guarantee the correct execution of a program on harvested energy. These components

are: (1) CAMEL Recovery Routine; (2) CAMEL Tasks; and (3) CAMEL Compiler.

1We differentiate between expected and unexpected off times. The challenge for intermittent computationis dealing with unexpected power-cycles and their off times; thus that is our focus. In contrast, solar-poweredsystems experience long off times at night, but this is (predictable) power loss akin to turning-off yourcomputer—not intermittent computation.

16

2.4.1 System Overview

Figure 2.3 gives a high level overview of how different components interact to make CAMEL

function. The CAMEL programming model allows the programmer to ensure forward progress

of applications on any energy harvesting platform by decomposing source code into a set of

individually re-executable tasks. Tasks manipulate data in the differential buffer to perform

useful computation. The CAMEL compiler analyzes how tasks interact with shared data

in the differential buffers to ensure in-place SRAM data is consistent at run time, despite

re-executions after power failures. CAMEL performs idempotence analysis and produces a

ready-to-run executable that can be flashed to a board of the programmer’s choosing. CAMEL

implements the volatile and non-volatile world concept using two differential, swappable

buffers—at any given point in execution, a volatile world and a non-volatile world exist.

Tasks interact with data exclusively in the volatile world, whereas recovery pulls

data exclusively from the non-volatile world. Between tasks, CAMEL atomically swaps

which buffer represents each world—locking-in forward progress by rendering the updated

buffer effectively non-volatile (§2.4.2). After the swap and before the next task, CAMEL

resolves the differences between the up-to-date newly-non-volatile world and the outdated

newly-volatile world by copying only the data that the preceding task modified. Following a

power failure, CAMEL invokes the recovery routine which continues execution either (1) from

the most recent checkpoint in the common case when SRAM retains its data, or (2) from the

beginning of the program, when an uncommonly long power failure causes SRAM to lose its

data.

17

2.4.2 Detecting Unexpectedly Long Off Times

SRAM transitions from non-volatility to semi-volatility, gradually approaching full-volatility

as supply voltage falls. For any stage beyond non-volatility, the SRAM cells begin losing

state, jeopardizing recovery. We employ two methods to detect unexpectedly long off times.

Canary Values: During a power failure, the SRAM cells that fail first and the direction

of failure are decided by manufacturing-time process variation; hence each device exhibits

unique yet temporally consistent failure patterns [11, 12]. We leverage this predictable failure

pattern for a low-overhead check of SRAM data retention by writing a pre-determined value

to the canary memory and checking for it after a power failure. If the first-to-fail SRAM

cells retain their canary values, then we know all of the SRAM data is intact and can

restart from a checkpoint; otherwise, data may be corrupted and we must restart execution

from the beginning. SRAM canary values require chip characterization for two purposes:

1) identifying the cells that fail first and 2) identifying the value those cells fail to, which

prevents silent failures from the cell failing into the chosen canary value.

Pre-deployment characterization works for three reasons: (1) given a device, SRAM cells fail

to retain state in a mostly total ordered fashion—especially the tail cells [12]; (2) SRAM cell

failure ordering is preserved across temperature and voltage fluctuations [9]; and (3) The

weakest cells have a reliable power-on state [45]. Thus, to determine, for a given device, the

location and values of the canary cells, the user performs a binary search of off-times, looking

for the first cells to fail at the shortest off-time. The cells are set to all 1’s, then the power is

disconnected, then reconnected after the desired off time, then the SRAM state is read-back,

looking for failed cells. The user does the same for all 0’s. If a cell fails for either case, then

it is marked as a failure for that time. For our experiments, we perform 3 such (Bernoulli)

trials at each time step to eliminate noise. We then store the address(es) of the first-to-fail

18

SRAM cell(s) in the non-volatile memory and the value that exposes their failure. The

checkpoint routine writes the value to the address(es), while the recovery routine validates it

before resuming execution. Programmers can always use multiple canary cells for increased

resilience. Canary values guarantee data integrity for the cost of a few memory comparisons,

but pre-characterization of each chip may be impractical for some deployments.

Algorithm 1 Software CRC routine.1: MASK ← 0xFF002: CRC ← 0x03: P_TABLE ← &CRC_TABLE4: P_BUF ← SAFE5: while P_BUF ̸= SAFE_END do6: INDEX ← CRC[7:0]7: INDEX ← *P_BUF xor INDEX8: P_BUF ← P_BUF + 19: INDEX ← INDEX + INDEX

10: INDEX ← INDEX + P_TABLE11: CRC ← CRC and MASK12: CRC ← *INDEX xor CRC13: end while14: return CRC

Cyclic Redundancy Checks (CRC): To verify data integrity without chip characteri-

zation, we provide a second implementation based on Cyclic Redundancy Checks. The basic

algorithm is shown in Algorithm 1. CRCs are common mechanisms for communication sys-

tems and applications that need to verify the integrity of received data with high confidence.

Hardware support for CRCs is also common to low-powered micro-controllers that send and

receive data; both of our evaluation devices include hardware CRC engines; software CRC

implementations also exist.

The CRC algorithm divides the data by a predetermined generator polynomial using repeated

shifts and XOR operations. The output of the CRC algorithm is the remainder of this

division, which is stored alongside the data in volatile memory. The state of the application

data in the SRAM changes after every task, hence we recompute the CRC between tasks.

19

To verify the integrity of the data after recovery, we recalculate the CRC over the trusted

in-place application data and compare the resulting remainder to the one previously stored

in memory. The two remainders must match to conclude that data remains integral.

CRCs guarantee up to G bits of error detection, where G is determined by the variant of

CRC used; a 16-bit CRC detects up to 3 flipped bits whereas a 32-bit CRC detects up to

5 flipped bits. Both CRC variants additionally detect all odd-bit errors. For other errors,

CRCs provide probabilistic error detection with a chance of missing an error of 1/2m, where

m is the width of the CRC [24]. For a multi-bit error to go undetected, the checksum of

the corrupted and un-corrupted must be the same. The probability of undetected data

corruption is further reduced because there is a 50% chance that a failing cell will fail into

the value it is already holding; thus, CRCs provide a high-confidence general solution to

verify SRAM’s data integrity.

2.4.3 Bimodal Recovery Routine

As shown in Figure 2.3, following a power failure during program execution, the recovery

routine passes control to the program from either the start of either the main function or

the last executed task. We arrive at this decision by re-computing the CRC or checking the

canary value, depending on the variant of CAMEL deployed.

To resume execution from the last in-place checkpoint (1) the recomputed CRC should

match the one stored in the SRAM or (2) the canary value must be correct, indicating that

the data in the non-volatile world was preserved over the power cycle. After passing the

integrity check, CAMEL copies the data in the non-volatile world over the volatile world,

ensuring that the first task begins with correct data. CAMEL restores control to the program

at the beginning of the last partially-completed task by copying the saved register values back

20

Instrumented Code

struct { int x; int y; int z;}global;

task_sample() { int i = getReading(); GV(x) = i;}

task_transform() { int offset = GV(y) + GV(z); GV(x) = GV(x) + offset;}

main() { while (1) { task_sample(); task_transform(); }}

main() task_sample() int i = getReading() GV(x) = i task_transform() int offset = GV(y) + GV(z) FAILURE

Execution

RECOVERY

task_transform() int offset = GV(y) + GV(z) GV(x) = GV(x) + offset

main() { int x = 0; int y = 2; int z = 5; while (1) { x = getReading(); int offset = y + z; x = x + offset; }}

Source Code

(a) (b) (c)

Figure 2.4: (a) Shows unmodified source code (b) shows the task divided code according tothe conventions described §2.4.4 (c) shows the execution of the code in (b) after it has beeninstrumented by the compiler.

to the register file, restoring the saved program counter value at the last checkpoint at the

end. In the uncommon case—when SRAM fails to retain data due to an unexpectedly long

power failure—CAMEL passes control to the beginning of the program and restarts execution.

2.4.4 CAMEL Tasks

The CAMEL programming model is task-based, providing the programmer with an interface to

divide source code into small, reusable and atomic tasks. This facilitates the implementation

and management of our volatile and non-volatile worlds. This division enables tracking intra-

task idempotence [46] by the compiler. We label a section of code as idempotent if no variable

undergoes a Write-after-Read dependency [25] within that section. A variable is subject to

the Write-after-Read dependency [25] when it is first read and then later written in the

21

same task. The CAMEL compiler identifies this sequence by a load followed by a store to the

same memory location. Variables within a task are tracked to ensure the consistency of the

differential buffers upon task entry or re-entry after a power failure.

Programmers define functions serving as volatile tasks using the task_ keyword to mark

them for tracking by the compiler. In accordance to our task-based model, CAMEL expects

all variables that are to be used by multiple tasks or reused by multiple executions of the

same task to be declared as task-shared variables. All task-shared variables are declared as

part of a global structure which serves as the buffer for the volatile and non-volatile words.

Task-shared variables need not be passed to every task individually; instead, they are directly

accessible by the use of the GV() keyword.

After dividing the application into different worlds using tasks, the flow of the program must

be described in main—each task must be called in an order that would result in an identical

execution to that of the unmodified version of the program. We limit the use of main to

calling tasks and read-only conditionals that determine the next task, as CAMEL does not

track idempotency outside of tasks.

2.4.5 CAMEL Compiler

Our static analysis ensures (1) idempotency of tasks across power cycles and (2) consistency

of the differential shared buffers across tasks. Any writes made to the buffer by a task will

persist across power failures, causing the re-execution of the task after recovery to yield

different results than expected of it. The compiler statically tracks data in the volatile world

and inserts code between tasks to ensure data idempotence upon entry or re-entry in a task

after a power failure. This ensures the system-level atomicity of tasks—the results of a

task are never committed to the non-volatile world until the task is complete. To achieve

22

GV(curr) = GV(prev) + 4; . . .GV(prev) = GV(curr); 16 16

16 20

(1) (2)

curr

Unsafe

16

16

safe

16

16

prev

Restart (2)

(3)

Start (1)

undo-logging (3)

(1)(2)(3)

prev

curr

Volatile

Non-Volatile

Figure 2.5: (1) Shows the start of a task (2) Shows a power failure midway execution of a task(3) shows undo-logging before any tasks begins execution. The state of the non-volatileand volatile buffers is shown after each of the three steps.

idempotency and atomicity of a task, we implement a differential double buffer solution,

using the difference between the two buffers to ensure forward progress as well as re-execute

a task in case of a power failure. At any given point in the program, we have two live copies

of the buffer termed volatile and non-volatile. Tasks work on global variables in the volatile

buffer and do not interact directly with the non-volatile buffer. The non-volatile buffer serves

as the fail-safe against inconsistencies in memory due to power failures. CAMEL calculates

the CRC of the checkpoint registers and the non-volatile buffer, which is copied over the

volatile working buffer in the case of a power cycle to prevent memory inconsistencies. We

refer to this process as undo-logging, whereby we undo all non-idempotent variables changed

by the task in the volatile buffer. We illustrate this process in Figure 2.5. Undo-logging

takes effect before entry into a task, undoing all changes made in the volatile buffer by a

partially-executed task interrupted by a power failure.

The successful execution of a task is marked by a commit, which involves swapping the

volatile and non-volatile buffers and re-calculating the CRC on the updated non-volatile

buffer. Swapping the two buffers is essential because the volatile buffer contains updated

23

1. struct {2. int x;3. int y;4. int z;5. int temp;6. int result;7. };8.9. void task_compute() {10. GV(temp) = GV(x) + GV(y);11. GV(result) = GV(result) + GV(temp);12.}

Figure 2.6: Shows how tasks use data in the differential buffers. The only non-idempotentvariable is result since it undergoes write-after-read. It is first read in line 11 and also writtento. This sequence of instructions in assembly would result in a write-after-read violation.

state after the successful execution of a task. Crucially, the buffer swap can be implemented

as a pointer re-assignment, which reduces redundant data movement: instead of copying

data from a dedicated non-volatile buffer, modifying it in a private volatile buffer, and re-

committing it to the non-volatile store, tasks work directly on the data in the volatile world,

which is later rendered non-volatile by CAMEL as part of the commit process. We discuss

enforcing the atomicity of the commit in §2.5.3 to prevent incorrect execution stemming from

an interrupted commit. In order to ensure correct forward progress, the CAMEL compiler

resolves differences between the two worlds to keep program state consistent between tasks.

The compiler establishes idempotence of tasks by only undo-logging variables that undergo

the write-after-read dependency in a task. These variables cause memory inconsistencies and

incorrect execution if a power failure and re-execution occur after the write and before the

commit, as the preceding read will read a different value from the last execution. We refer

to these variables as non-idempotent; non-idempotent variables must be undo-logged before

a task to ensure the task’s idempotence.

Maintaining buffer consistency requires identifying the difference between the volatile and

24

non-volatile buffers after the commit following every task. Each changed variable must be

updated in the volatile buffer to ensure the two buffers remain consistent between tasks. We

resolve the volatile/non-volatile difference using the functions defined below.

Data Types: The compiler is packaged with the functionality to copy different types of

variables from the non-volatile buffer to the volatile buffer. We cover the three types of

variables supported by C: scalars, compounds, and unions.

The copy_scalar method copies different types of scalars from the non-volatile buffer to the

volatile buffer. For contiguous variables, we implement several different methods to efficiently

handle the required logging. The compiler can choose from three different mechanisms

while logging arrays. The most basic mechanism is logging the entire array, referred to as

copy_array, which is essentially a memcpy over the entire array. copy_array_scalar provides

the functionality to log only one index of an array given that the index is stored in a scalar

which is part of the differential buffers. Finally, we implement the copy_array_scalar_local

mechanism for when tasks use local variables to index into and modify global shared arrays.

This method saves the value of the local variable when it is used to index into the shared

array and uses it to perform only the required copies during logging. As structures and

unions are also contiguous variables, much like arrays, we reuse the mechanisms developed

for arrays for both of these types.

2.5 Implementation

We implement the compiler portion of CAMEL as a LLVM pass [26]. LLVM’s ability to

generate a detailed intermediate representation of code written in C using the Clang frontend

proves beneficial for us in functionally verifying CAMEL. We use LLVM version 10.0.0 to

25

Source Code

LLVM IR Instrumented IR

Executable

ClangStatic

AnalysisCompiler

Modification

msp430-gcc

Set Formation

read first idem

writefirst

reads WritesGlobal Buffer

Figure 2.7: Shows the pipeline for the generation of a CAMEL-certified executable.

develop our compiler pass. The IR is then compiled to msp430 native assembly by the

LLVM compiler using the target=msp430 tag. The final step uses msp430-gcc to generate

an executable, ready to be flashed on a board.

2.5.1 Compiler Analysis

The compiler’s aim is to populate sets of read and written variables to find non-idempotent

memory accesses. Figure 2.7 illustrates the pipeline of our analysis from source code to an

executable. Our pass statically analyzes the structured, architecture independent LLVM IR

generated using the task-divided code, written by the programmer using the conventions

highlighted in §2.4.4. LLVM provides interfaces to traverse, interact and change the IR.

Our pass analyzes every function declared with the prefix task_. We focus our analysis on

instructions in IR that are directly involved in interacting with memory locations, namely

load, store and memcpy. Furthermore, we are only interested in said instructions if their

operands are a part of the volatile world global buffer as only that buffer is impacted by

task execution.

Our pass begins by performing intra-module static analysis on the LLVM IR, examining all

function declarations to determine whether they are tasks. Once a task is identified, our pass

26

traverses the control-flow graph of the function, searching for loads, stores and memcpys.

After identifying the instructions of interest in the control-flow graph, we backtrack from the

operands of the instructions to their first use in a task. At this stage, we only add variables

backed by the global world buffer to their respective read/write set. In addition to a set of

read and written variables, we maintain a set of read-first variables—variables that are read

before they are written in a task. Once all sets are populated, we take the intersection of the

read-first and write sets. This produces a set with the variables subject to a write-after-read

dependency, which Figure 2.6 demonstrates. Note that our analysis is context-insensitive:

when the compiler cannot predict which branch of a conditional statement will execute and

one of the path would mark the variable as read-first, the compiler considers the variable

read-first regardless of the execution path. This static analysis guarantees the detection of

all idempotent violations within a task by pessimistically analyzing all execution paths.

2.5.2 Compiler Modifications

The compiler inspects main() to locate task call sites. It proceeds to insert 1) code to undo-

log data before a task call site and 2) code to copy data between the volatile and non-volatile

world to ensure buffer consistency after successfully executing a task. We copy variables in

the write-after-read-vulnerable set from non-volatile to volatile using the functions imple-

mented in §2.4.5.

For arrays, the user may choose to update a specific index of the array using a variable

defined in the volatile world buffer or a local, task-defined variable. The compiler can

insert logging code for any of these two variants. If the index is part of the volatile world

buffer, the compiler loads the value of the variable from the buffer, uses the built-in LLVM

getelementptr instruction to get the array from the buffer and logs the variable. If however,

27

the index is a local variable (i.e., not part of the buffer), we insert code in tasks to store the

value of the index in the volatile world buffer at the time of change. We then use this global

variable to log the specific index of the array. For structures, we choose to log the entire

structure rather than specific elements using memcpy() since our benchmarks use structures

which consist of mostly two or three scalars. Hence, logging the entire structure is not

significantly costlier than logging a scalar within the structure.

2.5.3 Recovery

The recovery component has two major elements: the commit and the recovery functions.

Algorithm 2 describes our commit procedure. Volatile and non-volatile worlds are imple-

mented as pointers to global buffers, which allows us to swap the values of each pointer to

swap world views. To ensure the atomicity of commits, the decision of which pointer points

to which buffer is determined by a flag value that is inverted at the end of each commit.

The commit procedure saves all registers to a protected region in the non-volatile buffer

then updates the canary value or re-calculates the CRC, enabling the recovery routine to

correctly verify SRAM’s data. We implement the function to save registers and calculate

the CRC in native MSP430 assembly; the argument to the SAVE_REGISTERS function is the

memory location to place the registers in. When the CRC is used, it is calculated over the

saved register file and non-volatile world buffer, excluding the CRC result itself.

Algorithm 2 Task Commit Routine.1: NON_VOLATILE ← FLAG ? &BUF1 : &BUF22: VOLATILE ← FLAG ? &BUF2 : &BUF13: SAVE_REGISTERS(NON_VOLATILE->reg_file)4: Guard_BUFFER_REGS_integrity(CRC_MODE)5: FLAG ← not FLAG

We modify the MSP430 reset vector to point to the recovery function when the device

regains power. The recovery function passes control to the program by either restarting

28

from the beginning of the program or from the last in-place checkpoint in the SRAM, based

on whether the SRAM integrity check passes. The recovery routine first reads the flag

value, which determines which global buffer represents which world. Then, depending on

the integrity check strategy, the routine either recomputes the CRC over the non-volatile

world buffer or validates the canary value’s integrity. If the non-volatile world’s contents

are integral, recovery commences: 1) the non-volatile world is copied to the volatile world;

2) the platform is re-initialized; 3) register values are restored from the non-volatile world

buffer; and 4) the program counter is restored, resuming execution from the last in-place

checkpoint.

2.5.4 Correctness

Correctness was a first-class part of our design and implementation processes. Our correct-

ness measure validate that our CAMEL implementation: 1) generates instrumented programs

capable of running on harvested energy; 2) generates programs that result in equivalent fi-

nal states as the uninstrumented programs on continuous power—regardless of power cycle

frequency; and 3) the CRC and Canary strategies both detect unexpectedly long off times

that corrupt SRAM. To obtain a golden reference of what to expect from the compiler, we

manually instrument all benchmarks and manually compare the generated assembly against

the CAMEL-instrumented assembly. Our comparison shows that there are no differences be-

tween the two, meaning CAMEL is capable of inserting fault-free code to log data used across

different benchmarks. Our second line of defense is a set of regression tests that capture

corner cases, the data types available in C, and bugs in earlier versions of CAMEL. Third,

we conduct 10 trials of execution for every benchmark to ensure correctness. In each trial,

we execute the uninstrumented and instrumented benchmarks, semantically comparing the

final state of the system after completion. While executing the instrumented programs, we

29

introduce approximately 20 random on- and off-times, reflective of real-wold energy harvest-

ing traces [8]. Finally, we simulate the uncommon case of extended off times to validate the

effectiveness of the CRC and the canary.

2.6 Evaluation

We evaluate CAMEL against the only existing in-place SRAM-based system [44] and other

programmer-guided, task-based systems [4, 27, 30]. For the competing systems, we use their

publicly-available implementations, without modification. However, for [44], we adapt it

to a continous checkpointing version of itself to fairly evaluate against CAMEL. To compare

performance across these systems, we evaluate each on benchmarks that are used to evaluate

previous programmer-guided systems. Our evaluation demonstrates that CAMEL:

• enables long-life, hardware-support-free intermittent computation on Flash-based de-

vices

• outperforms existing systems on both Flash and FRAM devices

We implement CAMEL on the MSP430 platform using both Flash and FRAM-based devices.

We evaluate CAMEL on the MSP430F5529, a Flash-based devices, and show that CAMEL

enables long-life, high-performance computation using continuous in-place checkpointing on

a wider range of devices than past work by eschewing common-case NVM writes. Next,

we evaluate CAMEL on the MSP430FR6989, an ultra-low-power MCU containing FRAM

as its NVM. Implementing CAMEL on the MSP430FR6989 allows direct comparison between

CAMEL prior work. We choose these devices because they are representative of the capabilities

and limitations of microcontrollers found in deployed energy harvesting systems [33, 39, 48].

30

Additionally, to facilitate reproducibility, they are available from Texas Instruments as part

of development boards.

We test CAMEL on five benchmarks developed in past work [4, 27, 30] to represent the types

of applications found in energy harvesting systems:

• Activity Recognition (AR): AR uses simulated samples from a three-axis ac-

celerometer to train a nearest neighbor classifier to determine whether a device is

stationary or moving.

• Bit Count (BC): BC uses seven different algorithms to count the set bits in a given

sequence.

• Cold-Chain Equipment Monitoring (CEM): CEM simulates input data from a

temperature sensor and later compresses the data using LZW compression [43].

• Cuckoo Filter (CF): A Cuckoo filter is a data structure used to efficiently test for

set membership [6]. This benchmark stores random data in a Cuckoo filter and later

queries the filter to recover the data.

• Data Encryption (RSA): RSA is a widely used public-key cryptosystem [38]. It

encrypts a given string using a user-defined encryption key that is stored in the memory.

2.6.1 Experimental Setup

Our experimental setup draws motivation from previous work [8] on emulating environmental

conditions for real-world energy harvesting use cases in experimental and in-lab setups. We

run our benchmarks on actual hardware, MSP430F5529 and MSP430FR6989, connected to

a variable voltage supply to emulate intermittent computation on harvested energy. For

31

AR BC CEM CF RSA avg.DINO[27] 32.02 39.48 11.53 11.26 16.86 22.23Chain[4] 28.98 78.75 11.90 11.28 23.63 30.91Alpaca[30] 22.29 40.15 10.79 10.56 13.05 19.37TOTALRECALL[44] - - - - - ∞CAMELCRC 20◦C - - - - - 1126k yrsCAMELCRC 55◦C - - - - - 256k yrsCAMELcanary - - - - - ∞

Table 2.1: Deployment lifetime (in hours) for existing programmer-guided systems on Flash-based devices with expected time until a silent data corruption for several CAMEL configu-rations.

voltage control, we use an in-house power control platform that is capable of delivering

arbitrary voltage that can mimic arbitrary energy traces; this allows for both randomization

and replay of energy availability scenarios. We flash each benchmark discussed previously

on the desired device and connect the device to our power controller. We instrument the

benchmarks to drive the GPIO to high on successful completion of the benchmark and

connect it to a low-powered LED to signify the successful completion and hence, correct

execution under harvested energy. Furthermore, we conduct all experiments at 20◦C to

reduce the impact of temperature on SRAM’s data retention time. As in §2.5.4, we perform

10 trials of each benchmark to filter randomness.

2.6.2 Time to death—Flash Failure

We extend the experiments in §2.3.2 to include CAMEL. We consider the comparable time

to system failure with CAMEL to be the first Silent Data Corruption (SDC) stemming from

the CRC failing to detect an error. The time until SDC varies with power-cycle frequency

and length; the worst-case for CAMEL is repeated power-cycles just long enough to produce

corruption that might evade the CRC check. We compare the system lifetimes in Table 2.1

32

and observe that CAMEL extends platform lifetime to well beyond typical deployment times.2

All implementations of CAMEL avoid premature failure by avoiding Flash checkpoints. In the

Canary implementation of CAMEL, platform lifetime is unbounded—carefully-chosen canary

locations prevent all SDCs stemming from SRAM data volatility.

We determine Flash’s lifetime on the state-of-the-art by calculating how long it would take

for Flash to reach its maximum write/erase endurance (100,000 write/erase cycles [17] for

MSP430).3 Like we did in §2.3.2, Table 2.1 values are calculated using optimistic conditions

(e.g., perferct wear-leveling for free) given each benchmark’s checkpoint size, checkpoint

rate, Flash segment size, and the number of free Flash segments for each benchmark. The

checkpoint size is simply the size of data that is written to the NVM at the time of taking

the checkpoint. The checkpoint rate is obtained by running the unmodified C application

and calculating the CPU cycles to completion on the FRAM-based MSP430FR6989 board.

We then divide the CPU cycles with the number of checkpoints each system makes for each

benchmark which is listed in Table 2.2. The Flash segment size comes from the MSP430

documentation [17] and is important, because only entire segments of the Flash can be

erased and written. Note that TOTALRECALL [44] has an unbounded Flash lifetime since it

only utilizes the SRAM for checkpoints.

2.6.3 Run-time Overhead

To evaluate CAMEL against its predecessors in terms of run-time, we measure the CPU cycles

to completion for each benchmark using the MSP430FR6989 and model the overhead each

system would incur on a Flash-based device. We evaluate all systems and benchmarks on

continuous power rather than harvested energy as we aim to categorize the overhead of the2CAMEL lifetime does not vary with program behavior; therefore, we only model CAMEL related values

once.3TI’s Flash endurance estimates match what we see experimentally.

33

Figure 2.8: CAMEL run-time overhead for Flash-based devices The global buffer size for eachbenchmark is stated in parentheses on the x-axis.

Figure 2.9: CAMEL run-time overhead for FRAM-based devices. The global buffer size foreach benchmark is stated in parentheses on the x-axis.

34

in-place checkpoint as it is the most expensive and recurring operation in contrast to the

recovery routine. Figures 2.8 and 2.9 present the overhead incurred by CAMEL along with

the overhead for systems we evaluate against. Our numbers are similar to what one would

see when executing on harvested energy.

CAMEL vs TOTALRECALL: We compare CAMEL to the only SRAM-based system developed

to date, TOTALRECALL [44]. We adapt TOTALRECALL from a just-in-time to a continuous

checkpointing system for a fair comparison with CAMEL and other continuous checkpointing,

task-based approaches we evaluate against. While the canary is easily adaptable, the CRC

proves to be much more of a challenge since every potential write to the memory violates

the existing CRC. A correct solution provides the abstraction of concurrent, atomic memory

writes and corresponding CRC update. To fulfill this abstraction, we record both the cur-

rent and future value of a soon-to-be-overwritten memory location as part of the checkpoint.

Upon recovery, the system uses both values and the CRC over memory to construct two

potential CRCs. No matter where the power cycle occurs, one of the CRCs will match if

memory contents are integral. This naive approach results in a large number of checkpoints

being taken by the continuous TOTALRECALL extension, as can be seen in Table 2.2. We im-

plement both of these systems and evaluate their run-time overhead using our benchmark set.

These run-time overheads are over an order-of-magnitude worse than existing approaches.

Observe that CAMEL outperforms TOTALRECALL by 3x and 4x for the canary and CRC vari-

ants of the systems respectively. We identify that the cause of the poor performance

is a surplus of non-volatility. CAMEL introduces the notion of selective non-volatility

through the introduction of the volatile and non-volatile worlds (tasks) to ensure that not

every part of the SRAM is treated as non-volatile between power-offs, hence, resulting in a

significant decrease in checkpointing overhead.

35

CAMEL vs task-based state-of-the-art: We compare CAMEL against the state of the art

on both platforms. For Flash-based systems (Figure 2.8), CAMEL performs significantly bet-

ter than each of the systems we evaluate against. CAMELcanary performs 50x better while

CAMELcrc performs 7x better than previous systems on average, highlighting the run-time

advantages of avoiding NVM writes on Flash platforms. We isolate the effects of CAMEL’s

differential buffer design on execution time by running it on the FRAM platform, where

CAMEL does not have to calculate a CRC to ensure non-volatile shared data integrity. The

results in Figure 2.9 indicate that CAMEL’s buffer design to reduce data movement overhead

yields a significant improvement over the state of the art, finishing each benchmark on av-

erage twice as fast as the next best system (Alpaca).

State-of-the-art—FRAM vs Flash: The state-of-the-art exhibit different run-time over-

heads when evaluated on boards with different choices of persistent memory (apart from

TOTALRECALL [44]). Systems showcase opposite trends for some benchmarks on Flash-based

devices when compared to FRAM-based devices. This is exhibited in the numbers for CF in

Figure 2.8 where DINO [27] performs better than it’s successor, Alpaca [30]. However, we

observe a different trend in Figure 2.9. This change is due to the different number of NVM

checkpoints made by the two competing systems as can be seen in Table 2.2. FRAM writes

are significantly less costly than Flash writes, hence a surplus of checkpoints on FRAM-based

boards will not affect the run-time overhead by a large percentage. However, on Flash-based

boards, a larger number of checkpoints result in a more significant difference in run-time

performance due to the time cost of Flash writes.

Trade-off—CAMELcanary vs CAMELcrc: Due to a less costly data corruption detection mech-

anism, CAMELcanary performs better than CAMELcrc for every benchmark. Computing the

CRC over the non-volatile differential world (buffer) after every task makes the run-time

36

AR BC CEM CF RSA avg.DINO [27] 1136 717 259 324 1830 788Chain [4] 2008 717 231 452 315 744Alpaca [30] 2008 717 225 452 315 743TOTALRECALL [44] 124k 18k 2272 6720 27k 36kCAMEL 1999 709 114 385 254 692

Table 2.2: Checkpoints each system makes per benchmark. We can see that each systemmakes a comparable number of benchmarks hence the difference in run-time and binary-sizeoverheads cannot be a result of different number of checkpoints.

impact of each checkpoint heavily dependent on the size of the buffer. Canary values trade

off run-time overhead for pre-deployment effort: by characterizing each device and deter-

mining which SRAM cells fail first, users can reduce the CAMEL integrity check from a

CRC calculation to a handful of canary data comparisons. Our evaluation indicates that

CAMELcanary finishes benchmarks approximately 5 times faster than CAMELcrc.

Commit routine: The commit routine which follows every task is a main source of run-time

overhead. The commit can either deploy with the canary or the CRC as the sentinel value,

depending on the variant of CAMEL in use. The commit coupled with canary values results

in a constant run-time overhead across all benchmarks (∼20 CPU cycles) since CAMELcanary

only stores and compares against a pre-determined value on run-time. However, since the

CRC guards the differential buffers by computing a value over the data in them, the run-time

overhead for CAMELcrc is a function of the size of the buffers. The average run-time overhead

incurred by the commit across all benchmarks 2007 CPU cycles.

2.6.4 Binary size overhead

We compare the binary size overhead between CAMEL and state-of-the-art in Figure 2.10.

CAMELcanary produces smaller binaries when compared to CAMELcrc, because the canary re-

37

Figure 2.10: CAMEL binary size increase compared to current state-of-the-art.

covery routine takes up less lines of code than the CRC.

Our evaluation shows that CAMELcanary reduces binary size when compared to past work,

while CAMELcrc’s binary overhead impact is comparable (approximately 1% larger than Al-

paca [30]).

The commit routine which accompanies every CAMEL task can be implemented as an inline

or a naked function. Figure 2.10 shows the overhead incurred by both of these versions. The

trade-off for both approaches is run-time for binary size. Both approaches produce binaries

that scale linearly with the number of tasks in a benchmark; the inline version produces

larger but faster executables, while the naked function approach produces smaller but slower

executables. However, it is noteworthy that a naked function call incurs a fraction of the run-

time overhead of a regular function call. It increases the run-time overhead of the commit

routine by a constant number of CPU cycles (∼5), resulting in a negligible change in the

overall run-time performance.

38

2.6.5 Automatic Systems

Automatic continuous checkpointing intermittent computation systems [31, 46] are an alter-

native to programmer guided systems that remove all burden from the programmer, but also

remove all power from them as it is impossible to control forward-progress-level-atomicity.

In addition to removing all programmer control, fully-automatic approaches exacerbate the

problems that CAMEL addresses: they have a higher rate of checkpoints that kill Flash even

faster than programmer-guided approaches. Therefore, the best-case Flash performance for

these systems is much worse than the worst-case listed in Table 2.1.

Because it is not the focus of the paper, we exclude extensive overhead results for automatic

systems, but our experiments show that CAMEL is approximately 4x better than Ratchet [46]

on FRAM-based boards. Extending this to Flash-based boards results in even higher over-

head for Ratchet [46], because of time cost of Flash writes/erases. As for Chinchilla [31] (an

extension of Ratchet that aims to dynamically elide checkpoints), given that its run-time

overhead is comparable to Alpaca [31] and CAMEL performs 2x better than Alpaca [30] on

FRAM-based devices (Figure 2.9), we expect a similar margin for Chinchilla.

2.7 Related Work

While CAMEL shows the possibilities of combining time-dependent non-volatility [44] with

programmer-guided continuous checkpointing, other checkpointing approaches exist.

One-time checkpointing approaches backup volatile state to non-volatile memory in a just-in-

time manner, i.e., with just enough energy remaining to write the checkpoint [1, 2, 22, 37].

This requires the ability to measure the amount of energy stored in the system’s energy

storage capacitor. This increases system complexity and cost, as well as increasing its overall

39

energy usage—limiting the energy available for useful computation [46]. While one-time

checkpointing approaches minimize run-time overhead, the need for hardware support limits

its deployment.

Continuous checkpointing approaches eschew a single, large, just-in-time, checkpoint of the

entire volatile program state for many small checkpoints of only the essential volatile pro-

gram state to resume execution. Continuous checkpointing enables intermittent computation

without special hardware support at the cost of decreased performance. No matter if archi-

tecture, programmer, or compiler driven, all existing continuous checkpointing approaches

are incompatible with Flash devices due Flash’s endurance, performance, and power limita-

tions [17].

Continuous checkpointing fits naturally with sequential hardware design if you consider the

flip-flops non-volatile state and the combinational logic between them volatile state. Ide-

tic [32] employs this model to support intermittent computation at the circuit level using

existing high-level synthesis tools. While this works for simple applications that readily

map to hardware circuits, it is not generalizable. Conventional processor pipelines are also

compatible with continuous checkpointing if you consider pipeline registers as non-volatile

state and the operations between stages as volatile state. Non-volatile processors [28, 29]

leverage this observation by implementing pipeline registers with non-volatile flip-flops (e.g.,

FRAM). Recent work improves performance [7] by allowing for approximate results. An

alternative to non-volatile processors is Clank [10], which enforces dynamic memory idem-

potency. Architecture-driven continuous checkpointing approaches tend to be at an extreme,

with a large number of small checkpoints.

CAMEL builds on existing programmer-guided intermittent computation systems: combin-

ing Chain’s [4] expressive programming interface with idempotence analysis as used by Al-

paca [30]. From this, CAMEL introduces the idea of of swappable mixed-volatility worlds

40

backed by a differential analysis that allows data reuse across tasks. This improves perfor-

mance regardless of NVM type by reducing redundant data copying. To enable programmer-

guided approaches on Flash devices, CAMEL bifurcates program data into a non-volatile world

between tasks and a volatile world within a task.

Ratchet [46] and Chinchilla [31] replace programmer reasoning with compiler analysis to

produce an automatic approach to supporting intermittent computation—at the cost of

removing the abstraction of forward progress atomicity from the programmer. Ratchet de-

composes programs into restartable units using idempotence analysis while Chinchilla [31]

builds on Ratchet with a smart timer and basic-block-level energy estimation to elide check-

points at run time. While Chinchilla eliminates up to 99% of Ratchet’s checkpoints, it too

quickly kills Flash devices.

2.8 Conclusion

This paper exposes and addresses the unexpected lifetime and performance limitations of cur-

rent programmer-guided approaches to intermittent computation on both Flash and FRAM

devices. The improvements center on the abstraction of two worlds that co-exist during

program execution: a non-volatile world that contains the data that tasks use to communi-

cate between each other and that is used for post-power-cycle recovery and a volatile world

that contains data used by a task. The non-volatile world’s state is protected from corrup-

tion on Flash-based systems by either a CRC or a canary location. The proposed approach

also advances performance on FRAM-based platforms by minimizing the data copied when

transitioning between worlds, when going between tasks. The result is the first programmer-

guided intermittent computation system that runs on both Flash and FRAM devices while

providing the highest performance on both.

41

Chapter 3

SABLE: A Compiler-only Alternative to

CAMEL

In this chapter we present an alternative to CAMEL’s §2 programmer-guided, continuous

checkpointing model, SABLE. Unlike CAMEL §2, SABLE does not require any intervention

from the programmer and is entirely dependent on compiler analysis and modification to

store and restore program state before and after a power failure. Like CAMEL §2, SABLE

relies on continuous run-time checkpoints to take frequent snapshots of the program state

that can later be used to continue execution. SABLE exploits the SRAM’s time-dependent

non-volatility, an idea that CAMEL §2 borrows from [44]. However, extending the SRAM’s

time dependent non-volalility to a continuous, task-based approach required the employment

of significant novel techniques as discussed in §2.

SABLE draws motivation from Ratchet [46] which is the only wholly automatic, compiler-

based intermittent computation model. Furthermore, extending the SRAM’s time-dependent

non-volatility [44] to a continuous checkpointing system without programmer reasoning re-

quires refining and developing new checkpoint and recovery routines for SABLE.

The following will be explored in the coming sections

1. the trade-space between programmer-guided and compiler-based approaches to iden-

tify how one approach compares to the other in terms of design, implementation,

42

complexity and run-time overhead §3.1.

2. We then describe SABLE design and implementation techniques employed to bring

SABLE to life §3.2 §3.3.

3. Lastly, we showcase some preliminary numbers that exhibit SABLE run-time and binary

size overhead §3.4.

3.1 Programmer-invention vs Compiler-analysis — trade-

off

In this section, we explore the trade space between programmer-invention and compiler-

analysis in terms of intermittent computation on energy harvesters. Programmer-intervention

requires a programmers to decompose the structure of an application using a set of pre-

defined rules, similar to those defined in §2.4.4. Such decomposition aims to lower the

complexity of the analysis that needs to performed to enable continuous checkpointing and

decrease the run-time checkpointing overhead. However, the degree to which the overhead

is lowered depends on the nature of the application and how well a programmer can rea-

son about its decomposition. Being too optimistic about decomposing an application can

be a hindrance to forward progress. If a task requires more energy that can possibly be

available in a single power cycle, the application will fail to complete. However, being too

pessimistic about application decomposition can lead to a higher degree of checkpointing

overhead. Hence, decomposing an application in to its most efficient version can prove to be

quite a challenge for programmers.

Using a wholly compiler-based approach can help solve the aforementioned challenges. Compiler-

analysis is absolute and requires no intervention from the programmer, making such a system

43

easy to deploy without having to modify any read-world applications. However, performing

analysis on raw source code can prove to be more difficult since programming languages

consist of a large number of constructs and it would be difficult and time-consuming to

handle all edge cases which are otherwise removed by application decomposition. Further-

more, compiler-based approaches tend to have a higher overhead than programmer-guided

approaches. In programmer-guided approaches, programmers can reason about whether a

checkpoint is needed or not but a compiler will insert checkpoints according to a pre-defined

set of rules.

3.2 SABLE Design

We design SABLE as a series of small systems, each system building upon the previous one to

arrive at the most efficient solution. All of our systems either make use of the canary values

or the CRC to validate data retention in the SRAM upon recovering from a power failure. In

this section, we will introduce CAMEL four independent modules that can be used to support

intermittent computation and the challenges we faced while coming up with their design.

3.2.1 Naive Canary

This is the first iteration of our attempt at using the canary values with a compiler-based,

continuous checkpointing system without programmer-invervention. SABLE’s naive canary

variant places a checkpoint before every memory altering instruction, which we identify to be

only the store and memcpy instructions in the LLVM IR. Like in CAMEL§2, the checkpoint

stores the state of the register file in the SRAM. These values can then be used to recover

the state of the register file upon regaining power. Since a checkpoint is placed before every

44

memory altering instruction, this ensures that the state of the SRAM remains consistent

across power cycles, ensuring correct execution of long-running applications.

SABLE detects unexpectedly long off-times by ensuring the canary value remains in place 2.4.2

when the device reboots. This, coupled with hardware enrollment and pre-characterization

§2.4.2, provides a high degree of corruption detection with low checkpointing and recovery

run-time overhead.

3.2.2 Idempotent Canary

Idempotent canary is the final and most efficient variant of SABLE to make use to the canary

values. We reuse CAMEL’s idempotent memory analysis for this variant.

As an extension of our naive implementation of the canary version, we optimize and extend

our solution to improve performance. Using the idea of idempotence [25] of an instruction

we reduce the total number of checkpoints that need to be inserted in our module, thereby

reducing the run-time checkpointing and recovery overhead.

Idempotent canary only places a checkpoint before a memory altering instruction to a vari-

able that is subject to the write-after-read dependency§2.4.4. Variables that are only written

in the scope of a function cannot cause a program to showcase inconsistent behaviour. A

variable can only cause inconsistency if it is read first before being written[25]. This is similar

to what we described in §2.4.4. Hence, we can reduce the total number of checkpoints by

only considering stores to variables that are undergoing the write-after-read dependency.

To compare Naive and Idempotent canary, we summarize the number of checkpoints placed

by each approach in Table 3.1. We can see at least a reduction of at least 33% in the number

of checkpoints placed in our benchmarks. This shows the degree of efficiency going from

Naive to Idempotent canary.

45

AR BC CEM CF RSANaive 92 75 64 64 158Idem 34 50 14 13 65

Table 3.1: Summary of checkpoints placed in Naive and Idempotent variants of the canaryversion.

3.2.3 Naive CRC

The naive CRC implementation is the first variant of SABLE to make use of the CRC [24]

to detect unexpectedly long off-times. One might think adapting the CRC calculation to an

application that is continuously taking checkpoints is a trivial task, but that is not the case.

Much like CAMEL §2, we rethink our approach of how to calculate a recovery value which

is not rendered useless whilst the program is executing. CAMEL employs a double buffer

routine, ensuring that one of the buffers remain in a state which corresponds to the CRC

value stored in memory. However, doing that without programmer intervention can prove

to be tricky.

Naive CRC follows the naive canary implementation by placing checkpoints before every

memory altering instruction to store a snapshot of the program which can be used to resume

execution. As described in §2.4.2, the CRC calculates a recovery value over the program

stack and the entire SRAM. Upon regaining power, the recovery routine recalculates this

value and asserts against our pre-calculated value. If the assertion is successful, we restore

program state. Otherwise, we restart the execution of the program.

The caveat here is that each memory altering instruction renders the CRC useless since the

it changes the program stack. We employ a simple fix to adapt our CRC to ensure it stays

alive before the next checkpoint is taken. At every checkpoint, SABLE calculates the CRC

without taking into account the memory locations that are written by the memory altering

instruction after the checkpoint. This ensures if power is lost after a checkpoint and its

46

succeeding store are executed, our stored CRC value remains alive until the SRAM begins

losing state [44]. One may argue that skipping the memory location that is written in the

CRC calculation would mean that SABLE would not detect the corruption of that specific

memory location. However, we do not care about that specific memory location since it

is being written. Consider the case where the specific memory location corrupted before

regaining power. On power on, the same memory location is going to have a value written

back to it, hence rewriting the corrupted data with new data. This is possible because

MSP430 is a register to memory architecture [14], which means that a memory location that

is being written to another memory location must have its contents stored in a register before

they can directly be written to the destination memory location.

3.2.4 Batch CRC

We extend naive CRC to reduce the number of checkpoints placed by SABLE and thereby

reduce the run-time checkpointing overhead. The general idea this version of SABLE remains

constant — placing checkpoints before stores. However, instead of placing a checkpoint

before every memory altering instruction, we batch these instructions and skip the memory

locations written by all of these instructions. This allows us to merge multiple naive CRC

calculations into one, optimal CRC calculation. However, a number of challenges needed to

be addressed before we could develop Batch CRC.

The first problem to be solved before implementing batched CRC is to figure out an opti-

mum number of stores to batch together into one CRC calculation. To do this, we analyze

the number of stores per idempotent section and count the occurrence of each number of

stores in the program.An idempotent section is the code between two non-idempotent stores.

We summarize our results in the cumulative probability distribution curves above for our

47

0 2 4 6 8 10 12Number of stores per Idempotent region

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Prob

abilit

y

ARBCCEMCFRSA

Figure 3.1: Cumulative Density Curve for SABLE batch CRC which helps in determining theideal number of stores to batch

benchmarks in Figure 3.1. This helped us decide an optimum number to choose for our

batching (4).

We isolate all functions that are part of the program source by placing checkpoints at the

start of the function and before every return instruction. This ensures that the number of

batched stores is always 0 when we enter or exit a function that has been declared and defined

as a part of our module. Furthermore, this ensures that our analysis is intra-procedural only.

We perform an intra-procedural control-flow analysis to batch stores pessimistically in all the

possible execution paths taken by a specific procedure. Pessimistic analysis enables taking

48

AR BC CEM CF RSANaive 92 75 64 64 158Batch 70 77 30 32 108

Table 3.2: Summary of checkpoints placed in Naive and Batch variants of the CRC version.

into account dynamic execution information, ensuring we never exceed the total number

of stores to be batched. We traverse the control-flow graph from the function exit block

backwards to batch stores.

In dealing with loops (for and while), we isolate the loop to ensure that there are 0 numbers

of stores batched at the start of the loop. We then treat the loop as a continuous block of

instructions. This ensures that the maximum number of batched stores in a loop will not

exceed our maximum batch number.

We only batch stores between non-idempotent stores, till a maximum of 4 stores are batched

or we encounter a call to a procedure that is defined inside our module.

The result in checkpoint reduction when we go from Naive CRC to Batch CRC is summa-

rizied in Table 3.2. This shows that even though batch CRC is not as efficient as reducing

checkpoints as idempotence memory analysis, it still manages to reduce the number of check-

points by 20% at minimum. However, we can also observe that there was no reduction in

the BC benchmark. This is largely due to the nature of BC application. The BC applica-

tion is filled with idempotent violations. According to the definition of batch CRC, we must

checkpoint before every idempotent violation and only batch between idempotent violations.

3.3 SABLE Implementation

We implement SABLE as a series of independent LLVM passes that can be incorporated in

the compilation pipeline of any C application which targets the MSP430 architecture. The

49

LLVM passes traverse the IR of the module under inspection function by function. We locate

stores in each function as they are the insertion points of our checkpoints in all variants. The

LLVM infrastructure provides API to check the type of every instruction, which we utilize

to locate all memory altering instructions. For non-naive versions of SABLE, we also make

use of the memory idempotence analysis developed as part of CAMEL. This analysis lists all

possible memory altering instructions that can potentially cause an idempotent violation.

The details of this analysis can be found in §2.4.5. Lastly, for Batch CRC, we traverse the

LLVM IR, as described in §3.2.4, we must figure out the maximum number of stores at the

entry point of every basic block in the control-flow graph of a function. This is done by

traversing the control-flow graph multiple times using a breadth-first search traversal and

calculating the maximum number of stores at the entry of every basic block. This is repeated

until the point of convergence — the maximum value at the start of every basic block does

not change for any basic block. Our analysis draws inspiration from the data-flow analysis

framework for computing data-flow problems. We use the data-flow analysis framework to

do a static run-time control-flow analysis that makes sure applications execute faultlessly at

run-time.

3.4 SABLE Evaluation

Since SABLE is currently under-development, we only evaluate SABLE across its different

variants of itself to show the effects of our compiler optimization techniques. We evaluate

the four different systems that we built using two metrics: run-time overhead and binary

size overhead.

Table 3.3 shows how the run-time overhead differs for every SABLE variant. These numbers

are attained by running the unmodified benchmark to completion and measuring the CPU

50

Naivecanary Naivecrc Idemcanary Batchedcrc

AR 1.30 6.59 1.15 4.77BC 5.11 33.1 4.33 36.5CEM 3.14 80.6 1.53 7.95CF 4.61 50.5 2.55 29.0RSA 4.39 80.0 3.59 37.22avg. 3.71 50.18 2.63 23.1

Table 3.3: Relative run-time for different SABLE implementations

Naivecanary Naivecrc Idemcanary Batchedcrc

AR 1.11 1.23 1.07 1.09BC 1.45 1.89 1.38 1.47CEM 1.35 1.75 1.26 1.31CF 1.20 1.43 1.13 1.17RSA 1.48 2.10 1.29 1.42avg. 1.32 1.68 1.21 1.29

Table 3.4: Relative binary size for different SABLE implementations

cycles taken. We then attain the CPU cycles each benchmark takes on every SABLE variant

and divide these by the CPU cycles of the unmodified benchmark. The overhead decreases

as a function of the number of statically placed checkpoints in every benchmark. We can

also observe that our final solutions (Batch CRC, Idempotent Canary) perform significantly

better than their naive predecessors.

Similarly, for binary size, we can see a downward trend from the naive implementations

of SABLE to their final, efficient solutions. This aligns with the fact that the number of

checkpoints decrease as we go from our naive to final solutions for both, the CRC and the

canary versions.

51

Chapter 4

Conclusion

In this research we present two different continuous-checkpoint based intermittent computa-

tion models. We build our systems to utilize the time-dependent non-volatility of the SRAM

§2.3.3 which is an idea first explored by TotaRecall [44]. Our systems employ novel compiler

analysis techniques to arrive at the most optimum solution. We then showcase in §2.6.2

why our systems are a necessity. Our systems provide long lifetimes for Flash-based devices,

which the current state-of-the-art fail to do.

The current state-of-the-art continuous checkpointing systems cannot be adapted to support

the Flash as they will end up rendering it useless within a matter of hours. This is due to the

high volume of NVM checkpoints taken by these systems to store snapshots of a program.

Our systems are the first continuous checkpointing models that ensure low-overhead and

long-life intermittent computation on devices regardless of the choice of NVM. Our evaluation

shows that CAMEL does inadvertently extend the lifetime of Flash-based devices and does so

efficiently, minimizing all sources of overhead. Furthermore, SABLE, follows in the footsteps

of CAMEL. Though it is currently just a prototype, SABLE further aims to make energy

harvesting devices more accessible to programmers by ridding the programmer of the burden

to make altercations to the application source code. Both systems presented in this paper

have trade-offs, they both can be used for high-performance, NVM-invariant, software-only

intermittent computation without rendering a Flash-based device usless.

52

Bibliography

[1] D. Balsamo, A. S. Weddell, A. Das, A. R. Arreola, D. Brunelli, B. M. Al-Hashimi,

G. V. Merrett, and L. Benini. Hibernus++: A self-calibrating and adaptive system for

transiently-powered embedded devices. IEEE Transactions on Computer-Aided Design

of Integrated Circuits and Systems, 35(12):1968–1980, March 2016.

[2] Domenico Balsamo, Alex Weddell, Geoff Merrett, Bashir Al-Hashimi, Davide Brunelli,

and Luca Benini. Hibernus: Sustaining Computation during Intermittent Supply for

Energy-Harvesting Systems. In IEEE Embedded Systems Letters, 2014.

[3] BBC News. Samsung confirms battery faults as cause of Note 7 fires, January 2017.

https://www.bbc.com/news/business-38714461.

[4] Alexei Colin and Brandon Lucia. Chain: Tasks and channels for reliable intermit-

tent programs. In International Conference on Object-Oriented Programming, Systems,

Languages, and Applications, OOPSLA, pages 514–530, October 2016.

[5] H. Desai and B. Lucia. A power-aware heterogeneous architecture scaling model for

energy-harvesting computers. IEEE Computer Architecture Letters, 19(1):68–71, 2020.

[6] Bin Fan, Dave G. Andersen, Michael Kaminsky, and Michael D. Mitzenmacher. Cuckoo

filter: Practically better than bloom. In Proceedings of the 10th ACM International on

Conference on Emerging Networking Experiments and Technologies, CoNEXT ’14, page

75–88, New York, NY, USA, 2014. Association for Computing Machinery.

[7] K. Ganesan, J. San Miguel, and N. Enright Jerger. The what’s next intermittent com-

53

https://www.bbc.com/news/business-38714461

puting architecture. In IEEE International Symposium on High Performance Computer

Architecture, HPCA, pages 211–223, Feb 2019.

[8] Josiah Hester, Timothy Scott, and Jacob Sorber. Ekho: Realistic and repeatable ex-

perimentation for tiny energy-harvesting sensors. In Proceedings of the 12th ACM

Conference on Embedded Network Sensor Systems, 2014.

[9] Josiah Hester, Nicole Tobias, Amir Rahmati, Lanny Sitanayah, Daniel Holcomb, Kevin

Fu, Wayne P. Burleson, and Jacob Sorber. Persistent clocks for batteryless sensing

devices. ACM Transactions on Embedded Computer Systems, 15(4):77:1–77:28, August

2016.

[10] Matthew Hicks. Clank: Architectural support for intermittent computation. In Inter-

national Symposium on Computer Architecture, ISCA, pages 228–240, 2017.

[11] D. E. Holcomb, W. P. Burleson, and K. Fu. Power-Up SRAM State as an Identifying

Fingerprint and Source of True Random Numbers. IEEE Transactions on Computers,

58(9):1198–1210, September 2009.

[12] Daniel E. Holcomb, Amir Rahmati, Mastooreh Salajegheh, Wayne P. Burleson, and

Kevin Fu. Drv-fingerprinting: Using data retention voltage of sram cells for chip identi-

fication. In Proceedings of the 8th International Conference on Radio Frequency Identi-

fication: Security and Privacy Issues, RFIDSec’12, pages 165–179, Berlin, Heidelberg,

2013. Springer-Verlag.

[13] G. Huang, L. Qian, S. Saibua, D. Zhou, and X. Zeng. An efficient optimization based

method to evaluate the drv of sram cells. IEEE Transactions on Circuits and Systems

I: Regular Papers, 60(6):1511–1520, June 2013.

54

[14] Texas Instruments. MSP430x2xx Family User’s Guide (Rev. J), 2013. http://www.ti.

com/lit/ug/slau144j/slau144j.pdf.

[15] Texas Instruments. Maximizing Write Speed on the MSP430 FRAM, 2015. https:

//www.ti.com/lit/an/slaa498b/slaa498b.pdf.

[16] Texas Instruments. MSP432P4xx SimpleLink microcontrollers technical reference man-

ual, March 2015. http://www.ti.com/lit/ug/slau356i/slau356i.pdf.

[17] Texas Instruments. MSP430 Flash Memory Characteristics (Rev. B), 2018. http:

//www.ti.com/lit/an/slaa334b/slaa334b.pdf.

[18] Texas Instruments. MSP430F5438A—MSP430F543xA, MSP430F541xA Mixed-

Signal Microcontrollers, September 2018. http://www.ti.com/lit/ds/symlink/

msp430f5438a.pdf.

[19] Texas Instruments. MSP430F552x, MSP430F551x Mixed-Signal Microcontrollers,

September 2018. https://www.ti.com/lit/ds/symlink/msp430f5529.pdf.

[20] Texas Instruments. MSP430FR5964—MSP430FR599x, MSP430FR596x Mixed-Signal

Microcontrollers, August 2018. http://www.ti.com/lit/ds/symlink/msp430fr5964.

pdf.

[21] Texas Instruments. MSP430G2553 LaunchPad Development Kit (MSP‑EXP430G2ET),

2018. http://www.ti.com/lit/ug/slau772/slau772.pdf.

[22] Hrishikesh Jayakumar, Arnab Raha, and Vijay Raghunathan. QUICKRECALL: A

Low Overhead HW/SW Approach for Enabling Computations across Power Cycles in

Transiently Powered Computers. In International Conference on VLSI Design and

International Conference on Embedded Systems, 2014.

55

http://www.ti.com/lit/ug/slau144j/slau144j.pdf

http://www.ti.com/lit/ug/slau144j/slau144j.pdf

https://www.ti.com/lit/an/slaa498b/slaa498b.pdf

https://www.ti.com/lit/an/slaa498b/slaa498b.pdf

http://www.ti.com/lit/ug/slau356i/slau356i.pdf

http://www.ti.com/lit/an/slaa334b/slaa334b.pdf

http://www.ti.com/lit/an/slaa334b/slaa334b.pdf

http://www.ti.com/lit/ds/symlink/msp430f5438a.pdf

http://www.ti.com/lit/ds/symlink/msp430f5438a.pdf

https://www.ti.com/lit/ds/symlink/msp430f5529.pdf

http://www.ti.com/lit/ds/symlink/msp430fr5964.pdf

http://www.ti.com/lit/ds/symlink/msp430fr5964.pdf

http://www.ti.com/lit/ug/slau772/slau772.pdf

[23] Joseph Kahn, Randy Katz, and Kristofer Pister. Next Century Challenges: Mobile

Networking for ”Smart Dust”. In Conference on Mobile Computing and Networking

(MobiCom), 1999.

[24] P. Koopman and T. Chakravarty. Cyclic redundancy code (crc) polynomial selection for

embedded networks. In International Conference on Dependable Systems and Networks,

2004, pages 145–154, June 2004.

[25] Marc de Kruijf, Karthikeyan Sankaralingam, and Somesh Jha. Static analysis and

compiler design for idempotent processing. In Conference on Programming Language

Design and Implementation, PLDI, pages 475–486, 2012.

[26] Chris Lattner and Vikram Adve. Llvm: A compilation framework for lifelong program

analysis & transformation. In In International Symposium on Code Generation and

Optimization, CGO, pages 75–86, 2004.

[27] Brandon Lucia and Benjamin Ransford. A simpler, safer programming and execution

model for intermittent systems. In Conference on Programming Language Design and

Implementation, PLDI, pages 575–585, 2015.

[28] K. Ma, Y. Zheng, S. Li, K. Swaminathan, X. Li, Y. Liu, J. Sampson, Y. Xie, and

V. Narayanan. Architecture exploration for ambient energy harvesting nonvolatile pro-

cessors. In IEEE International Symposium on High Performance Computer Architecture,

HPCA, pages 526–537, Feb 2015.

[29] Kaisheng Ma, Xueqing Li, Karthik Swaminathan, Yang Zheng, Shuangchen Li, Yongpan

Liu, Yuan Xie, John Sampson, and Vijaykrishnan Narayanan. Nonvolatile Processor

Architectures: Efficient, Reliable Progress with Unstable Power. In IEE Micro Volume

36, Issue 3, 2016.

56

[30] Kiwan Maeng, Alexei Colin, and Brandon Lucia. Alpaca: Intermittent execution with-

out checkpoints. In International Conference on Object-Oriented Programming, Sys-

tems, Languages, and Applications, OOPSLA, pages 96:1–96:30, October 2017.

[31] Kiwan Maeng and Brandon Lucia. Adaptive dynamic checkpointing for safe efficient

intermittent computing. In USENIX Conference on Operating Systems Design and

Implementation, OSDI, pages 129–144, November 2018.

[32] A. Mirhoseini, E. M. Songhori, and F. Koushanfar. Idetic: A high-level synthesis ap-

proach for enabling long computations on transiently-powered ASICs. In International

Conference on Pervasive Computing and Communications, PerCom, pages 216–224,

March 2013.

[33] University of Washington. WISP 5 GitHub, April 2014. http://www.github.com/

wisp/wisp5.

[34] Sandro Pinto and Nuno Santos. Demystifying ARM TrustZone: A comprehensive sur-

vey. ACM Computing Surveys, 51(6), January 2019.

[35] Huifang Qin, Yu Cao, Dejan Markovic, Andrei Vladimirescu, and Jan Rabaey. Sram

leakage suppression by minimizing standby supply voltage. In Proceedings of the 5th

International Symposium on Quality Electronic Design, ISQED ’04, pages 55–60, Wash-

ington, DC, USA, 2004. IEEE Computer Society.

[36] Benjamin Ransford and Brandon Lucia. Nonvolatile Memory is a Broken Time Machine.

In Workshop on Memory Systems Performance and Correctness, 2014.

[37] Benjamin Ransford, Jacob Sorber, and Kevin Fu. Mementos: System Support for

Long-Running Computation on RFID-Scale Devices. In Architectural Support for Pro-

gramming Languages and Operating Systems (ASPLOS), 2011.

57

http://www.github.com/wisp/wisp5

http://www.github.com/wisp/wisp5

[38] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures

and public-key cryptosystems. Commun. ACM, 21(2):120–126, February 1978.

[39] A. P. Sample, D. J. Yeager, P. S. Powledge, A. V. Mamishev, and J. R. Smith. Design

of an rfid-based battery-free programmable sensing platform. IEEE Transactions on

Instrumentation and Measurement, 57(11):2608–2615, Nov 2008.

[40] Henry Sodano, Gyuhae Park, and Daniel Inman. Estimation of Electric Charge Output

for Piezoelectric Energy Harvesting. In Strain, Volume 40, 2004.

[41] Fang Su, Kaisheng Ma, Xueqing Li, Tongda Wu, Yongpan Liu, and Vijaykrishnan

Narayanan. Nonvolatile processors: Why is it trending? In Proceedings of the Confer-

ence on Design, Automation & Test in Europe, DATE ’17, pages 966–971, 3001 Leuven,

Belgium, Belgium, 2017. European Design and Automation Association.

[42] Mark Weiser. Ubiquitous computing. Computer, 10:71–72, 1993.

[43] Welch. A technique for high-performance data compression. Computer, 17(6):8–19,

1984.

[44] Harrison Williams, Xun Jian, and Matthew Hicks. Forget failure: Exploiting SRAM

data remanence for low-overhead intermittent computation. In International Conference

on Architectural Support for Programming Languages and Operating Systems, ASPLOS,

pages 69–84, March 2020.

[45] Harrison Williams, Alexander Lind, Kishankumar Parikh, and Matthew Hicks. Silicon

Dating. arXiv, abs/2009.04002, 2020. _eprint: 2009.04002.

[46] Joel Van Der Woude and Matthew Hicks. Intermittent computation without hardware

support or programmer intervention. In USENIX Symposium on Operating Systems

Design and Implementation, OSDI, pages 17–32, November 2016.

58

[47] X. Wu, I. Lee, Q. Dong, K. Yang, D. Kim, J. Wang, Y. Peng, Y. Zhang, M. Saliganc,

M. Yasuda, K. Kumeno, F. Ohno, S. Miyoshi, M. Kawaminami, D. Sylvester, and

D. Blaauw. A 0.04MM3 16NW wireless and batteryless sensor system with integrated

cortex-m0+ processor and optical communication for cellular temperature measurement.

In IEEE Symposium on VLSI Circuits, pages 191–192, June 2018.

[48] Hong Zhang, Jeremy Gummeson, Benjamin Ransford, and Kevin Fu. Moo: A Battery-

less Computational RFID and Sensing Platform. In Technical Report UMCS-2011-020,

2011.

59

compiler support for long-life, low-overhead intermittent

Documents