autonomic failure identification and diagnosis for building …/67531/metadc499993/... · guan,...
TRANSCRIPT
APPROVED: Song Fu, Major Professor Yan Huang, Committee Member Krishna Kavi, Committee Member Xiaohui Yuan, Committee Member Barrett Bryant, Chair of the Department of
Computer Science and Engineering Coastas Tsatsoulis, Dean of the College of
Engineering Mark Wardell, Dean of the Toulouse Graduate
School
AUTONOMIC FAILURE IDENTIFICATION AND DIAGNOSIS FOR BUILDING
DEPENDABLE CLOUD COMPUTING SYSTEMS
Qiang Guan
Doctor Prepared for the Degree of
DOCTOR OF PHILOSOPHY
UNIVERSITY OF NORTH TEXAS
May 2014
Guan, Qiang. Autonomic Failure Identification and Diagnosis for Building Dependable
Cloud Computing Systems. Doctor of Philosophy (Computer Science), May 2014, 121 pp., 9
tables, 53 figures, bibliography, 112 titles.
The increasingly popular cloud-computing paradigm provides on-demand access to
computing and storage with the appearance of unlimited resources. Users are given access to a
variety of data and software utilities to manage their work. Users rent virtual resources and pay
for only what they use. In spite of the many benefits that cloud computing promises, the lack of
dependability in shared virtualized infrastructures is a major obstacle for its wider adoption,
especially for mission-critical applications.
Virtualization and multi-tenancy increase system complexity and dynamicity. They
introduce new sources of failure degrading the dependability of cloud computing systems. To
assure cloud dependability, in my dissertation research, I develop autonomic failure
identification and diagnosis techniques that are crucial for understanding emergent, cloud-wide
phenomena and self-managing resource burdens for cloud availability and productivity
enhancement. We study the runtime cloud performance data collected from a cloud test-bed and
by using traces from production cloud systems. We define cloud signatures including those
metrics that are most relevant to failure instances.
We exploit profiled cloud performance data in both time and frequency domain to
identify anomalous cloud behaviors and leverage cloud metric subspace analysis to automate the
diagnosis of observed failures. We implement a prototype of the anomaly identification system
and conduct the experiments in an on-campus cloud computing test-bed and by using the Google
datacenter traces. Our experimental results show that our proposed anomaly detection
mechanism can achieve 93% detection sensitivity while keeping the false positive rate as low as
6.1% and outperform other tested anomaly detection schemes. In addition, the anomaly detector
adapts itself by recursively learning from these newly verified detection results to refine future
detection.
ii
Copyright 2014
by
Qiang Guan
ACKNOWLEDGMENTS
This dissertation would be impossible without the continuous support and supervision of
many people. I would like to thank them at this opportunity. I first would like to appreciate my
advisor, Dr. Song Fu, for his guidance, support and supervision for the past four years. I am so
proud that I will be his first Ph.D. graduate. That is my great honor. I also want to thank Dr. Yan
Huang, Dr. Krishna Kavi and Dr. Xiaohui Yuan for their comments and suggestions on this work.
I would like to thank Dr. Nathan Debardeleben, Dr. Mike Lang and Mr Sean Blanchard from
Ultrascale System Research Center, New Mexico Consortium, Los Alamos National Laboratory
for their mentoring and advising. I would also appreciate the Chairman, Dr. Barrett Bryant, grad-
uate advisor, Dr. Bill Buckles, Dr. Armin R. Mikler for the guidance and generous help for my
academic career. I am thankful to my friends, Dongyu Ang, K.J. Buckles, Guangchun Cheng, Chi-
Chen Qiu, Song Huang, Tommy Janjusic, Zhi Liu, Husanbir Pannu, Devender Singh, Yanan Tao,
Dr. Shijun Tang, Yiwen Wan, Ziming Zhang, Chengyang Zhang, Shunli Zhao, all the team-mates
of Highland Guerilla and friends in Highland Baptist Church for their friendship and support.
I would like to thank my parents for their support during the whole journey. I want to give
special thanks to my wife, Dr. Xiaoyi Fang for love, patience, understanding and support for these
days and nights.
iii
TABLE OF CONTENTS
Page
ACKNOWLEDGMENTS iii
LIST OF TABLES viii
LIST OF FIGURES ix
CHAPTER 1 INTRODUCTION AND MOTIVATION 1
1.1. Introduction 1
1.2. Terms and Definitions 2
1.3. Motivation and Research Tasks 2
1.3.1. Charactering System Dependability in Cloud Computing Infrastructures 2
1.3.2. Metric Dimensionality Reduction for Cloud Anomaly Identification 3
1.3.3. Soft Errors (SE) and Silent Data Corruption (SEC) 5
1.4. Contributions 6
1.4.1. Cloud Dependability Characterization and Analysis 6
1.4.2. Metric Selection and Extraction for Charactering Cloud Health 8
1.4.3. Exploring Time and Frequency Domains of Cloud Performance Data
for Accurate Anomaly Detection 9
1.4.4. Most Relevant Principal Components based Anomaly Identification
and Diagnosis 9
1.4.5. SEFI : A Soft Error Fault Injection Tool for Profiling the Application
Vulnerability 10
1.5. Dissertation Organization 11
CHAPTER 2 BACKGROUND AND RELATED WORK 13
2.1. Metrics Selection and Extraction 13
iv
2.2. Anomaly Detection and Failure Management 14
2.3. State of the Art of Fault Injection 16
2.3.1. Dynamic Binary Instrumentation-based Fault Injection 16
2.3.2. Virtualization-based Fault Injection 17
CHAPTER 3 A CLOUD DEPENDABILITY ANALYSIS FRAMEWORK FOR
CHARACTERISING THE SYSTEM DEPENDABILITY IN CLOUD
COMPUTING INFRASTRUCTURES 19
3.1. Introduction 19
3.2. Overview of the Cloud Dependability Analysis Framework 20
3.3. Cloud Dependability Analysis Methodologies 21
3.4. Cloud Computing Testbed and Performance Profiling 23
3.5. Impact of Virtualization on Cloud Dependability 25
3.5.1. Analysis of CPU-Related Failures 26
3.5.2. Analysis of Memory-Related Failures 27
3.5.3. Analysis of Disk-Related Failures 28
3.5.4. Analysis of Network-Related Failures 30
3.5.5. Analysis of All Types of Failures 30
3.6. Summary 32
CHAPTER 4 A METRIC SELECTION AND EXTRACTION FRAMEWORK FOR
DESCRIBING CLOUD PERFORMANCE ANOMALIES 33
4.1. Introduction 33
4.2. Cloud Metric Space Reduction Algorithms 34
4.2.1. Metric Selection 34
4.2.2. Metric Space Combination and Separation 36
4.3. Performance Evaluation 39
4.3.1. Experimental Results of Metric Selection and Extraction 40
4.4. Summary 42
v
CHAPTER 5 EFFICIENT AND ACCURATE CLOUD ANOMALY DETECTION 45
5.1. Introduction 45
5.2. Cloud Anomaly Detection Mechanisms 45
5.2.1. Wavelet-Based Multi-Scale Anomaly Detection Mechanism 46
5.2.2. Sliding-Window Cloud Anomaly Detection 48
5.2.3. Mother Wavelet Selection and Adaptation 49
5.3. Performance Evaluation 51
5.3.1. Cloud Testbed and Performance Metrics 51
5.3.2. Mother Wavelets 51
5.3.3. Performance of Anomaly Identification 53
5.4. Summary 56
CHAPTER 6 EXPLORING METRIC SUBSPACE ANALYSIS FOR ANOMALY
IDENTIFICATION AND DIAGNOSIS 58
6.1. Introduction 58
6.2. A Motivating Example 59
6.3. MRPC-Based Adaptive Cloud Anomaly Identification 61
6.3.1. Dynamic Normalization 62
6.3.2. MRPC Selection 63
6.3.3. Adaptive Cloud Anomaly Identification 65
6.4. Analysis of Cloud Anomalies 66
6.4.1. Anomaly Detection and Diagnosis Results 66
6.4.2. MRPCs and Diagnosis of Memory Related Failures 66
6.4.3. MRPCs and Diagnosis of Disk Related Failures 68
6.4.4. MRPCs and Diagnosis of CPU Related Failures 70
6.4.5. MRPCs and Diagnosis of Network Related Failures 72
6.4.6. The Accuracy of Anomaly Identification 72
6.4.7. Experimental Results using Google Datacenter Traces 73
6.5. Summary 74
vi
CHAPTER 7 F-SEFI: A FINE-GRAINED SOFT ERROR FAULT INJECTION
FRAMEWORK 78
7.1. Introduction 78
7.2. A Coarse-Grained Soft Error Fault Injection (C-SEFI) Mechanism 79
7.2.1. C-SEFI Startup 79
7.2.2. C-SEFI Probe 79
7.2.3. C-SEFI Fault Injection 81
7.2.4. Performance Evaluation of C-SEFI 82
7.3. A Fine-Grained Soft Error Fault Injection (F-SEFI) Framework 83
7.3.1. F-SEFI Design Objectives 85
7.3.2. F-SEFI Fault Model 86
7.3.3. F-SEFI Fault Injection Mechanisms 87
7.3.4. Case Studies 91
7.4. Discussions 100
7.5. Summary 102
CHAPTER 8 CONCLUSION AND FUTURE WORK 104
8.1. Conclusion 104
8.1.1. Characterizing Cloud Dependability 104
8.1.2. Detecting and Diagnozing Cloud Anomalies 104
8.1.3. Soft Error Fault Injection 105
8.1.4. List of Publications in My PhD Study 105
8.2. Future Work 107
8.2.1. Self-Adaptive Failure-Aware Resource Management in the Cloud 108
8.2.2. Tolerating Silent Data Corruptions in Large Scale Computing Systems 108
BIBLIOGRAPHY 110
vii
LIST OF TABLES
Page
Table 2.1. Existing fault injection technologies. 17
Table 3.1. Description of the injected faults. 23
Table 3.2. The metrics that are highly correlated with failure occurrences in the cloud
testbed using four-level failure-metric DAGs. 31
Table 4.1. Normalized mutual information values for 12 metrics of CPU and memory
related statistics. 40
Table 6.1. MRPCs are ranked by correlation with faults.(For each major type, 25 faults
are injected into testbed) 68
Table 6.2. Performance metrics in the Google datacener traces 74
Table 7.1. Fault types for injection 87
Table 7.2. Benchmarks and target functions for fine-grained fault injection 91
Table 7.3. K-Means clustering centroids with and without fault injection showing the
impact of corrupted data in the centroid calculations and clustering calculations
for individual particles. 98
viii
LIST OF FIGURES
Page
Figure 1.1. A dependable cloud computing infrastructure. 7
Figure 3.1. Architecture of the cloud dependability analysis (CDA) framework. 20
Figure 3.2. A sampling of cloud performance metrics that are often correlated with failure
occurrences in our experiments. In total, 518 performance metrics are profiled
with 182 metrics for the hypervisor, 182 metrics for virtual machines, and 154
metrics for hardware performance counters (four cores on most of the cloud
servers). 24
Figure 3.3. Failure-metric DAG for CPU-related failures in the cloud testbed. 26
Figure 3.4. Failure-metric DAG for CPU-related failures in the non-virtualized system. 26
Figure 3.5. Failure-metric DAG for memory-related failures in the cloud testbed. 27
Figure 3.6. Failure-metric DAG for memory-related failures in the non-virtualized system. 27
Figure 3.7. Failure-Metric DAG for disk-related failures in the cloud testbed. 28
Figure 3.8. Failure-metric DAG for disk-related failures in the non-virtualized system. 28
Figure 3.9. Failure-metric DAG for network-related failures in the cloud testbed. 29
Figure 3.10. Failure-metric DAG for network-related failures in the non-virtualized system. 29
Figure 3.11. Failure-metric DAG for all types of failures in the cloud testbed. 31
Figure 3.12. Failure-metric DAG for all types of failures in the non-virtualized system. 32
Figure 4.1. Quantified redundancy and relevance among metrics based on their mutual
information values. 41
Figure 4.2. Results from metric extraction (Algorithm 2) and metric selection
(Algorithm 1). 41
Figure 4.3. Results from metric extraction (Algorithm 2) only. 41
Figure 4.4. Distribution of normal (blue marker) and abnormal (red marker) cloud
system states represented by the metrics that are selected and extracted by
ix
algorithm 1, 2 and 3. 43
Figure 5.1. Three-level details “Di” and approximations “Ai” of performance metric
%memused profiled on our cloud testbed. Performance metric %memused
is divided into high frequency components (details) and low frequency
components (approximations). The approximation is further decomposed into
new details and approximations at each level. 47
Figure 5.2. A sliding detection window (NsWin = 80 measurements of a cloud
performance metric) for a mother wavelet with Nmother = 16 measurements
and the scale coefficient s = 5. A failure is illustrated with the spike. 49
Figure 5.3. An example of Haar mother wavelet. 50
Figure 5.4. Mother wavelet derived by employing a measurement window of different
sizes. As the window size increases, the peak at the tail is sharpened while other
peaks are smoothed. From the perspective of frequency, more failure-related
signals in both the low-frequency and higher-frequency bands are included for
large measurement windows. 52
Figure 5.5. The numbers of truly identified anomalies vs. the numbers of validated false
positives with mother wavelets of different sizes. Small windows result in low
detection accuracy, while big windows brings in more false positives 54
Figure 5.6. Wavelets coefficients for mother wavelets with different Nmother (1 ≤ s ≤
16, 0 ≤ τ ≤ 200). A memory-related fault is injected at the 100th time
point. The states of system are learned from the wavelet coefficients based on
the anomaly mother wavelet with different scale. A smaller mother wavelet
(i.e., 8 measurements or 12 measurements) brings more noise to the wavelet
coefficients, while a bigger mother wavelet (i.e., 24 measurements) requires a
larger scale to achieve a high detection accuracy. 55
Figure 5.7. Performance comparison of our wavelet-based cloud anomaly detection
mechanism with other four detection algorithms. Our approach achieves the
best TPR with the least FPR. It can identify anomalies more accurately than
x
other methods. 56
Figure 6.1. Examples of memory related faults injected to a cloud testbed. The memory
utilization and CPU utilization time serials are plotted. 59
Figure 6.2. Distribution of data variance retained by the first 50 principal components. 60
Figure 6.3. Time series of principal components and their correlation with the memory
related faults. 61
Figure 6.4. MRPCs of memory-related failures. (a) plots the time series of 3rd principal
component. (b) shows the performance metric avgrq-sz displays the highest
contribution to the MRPC. 67
Figure 6.5. MRPCs of disk-related failures.(a) plots the time series of the 5th principal
component. (b) shows the performance metric rd-sec/s dev-253 displays the
highest contribution to the MRPC 69
Figure 6.6. MRPCs of CPU-related failures.(a) plots the time series of the 35th principal
component. (b) shows the performance metric ldavg displays the highest
contribution to the MRPC 70
Figure 6.7. MRPCs of network-related failures.(a) plots the time series of the 29th
principal component. (b) shows the performance metric %user 4 displays
highest contribution to the MRPC 71
Figure 6.8. Correlation between the principal components and different types of failures 76
Figure 6.9. Correlation between principal components and failure events using the Google
datacenter trace. 77
Figure 6.10. Performance of the proposed MRPC-based anomaly detector compared with
four other detection algorithms on the Google datacenter trace. 77
Figure 7.1. Overview of C-SEFI 79
Figure 7.2. SEFI’s startup phase 80
Figure 7.3. C-SEFI’s probe phase 80
Figure 7.4. C-SEFI’s fault injection phase 81
Figure 7.5. The multiplication experiment uses the floating point multiply instruction
xi
where a variable initially is set to 1.0 and is repeatedly multiplied by 0.9. For
five different experiments a random bit was flipped in the output of the multiply
at iteration 10, simulating a soft error in the logic unit or output register 84
Figure 7.6. Experiments with the focus on the injection point. it can be seen that each
of the five separately injected faults all cause the value of y to change - once
radically, the other times slightly. 84
Figure 7.7. Experiments with the focus on the effects on the final solution. it can be seen
that the final output of the algorithm differs due to these injected faults. 85
Figure 7.8. The overall system infrastructure of F-SEFI 88
Figure 7.9. The components of F-SEFI Broker 88
Figure 7.10. A subset of the function symbol table (FST) for the K-means clustering
algorithm studied in section 7.3.4. This is extracted during the profiling
stage and used to trace where the application is at runtime for targeted fault
injections. 89
Figure 7.11. Instruction profiles for the benchmarks studied. Each benchmark is reported
as a whole application (coarse-grained) and one or two functions that were
targeted (fine-grained). While both FFT and K-Means have a large number of
FADD and FMUL instructions, the BMM benchmark is almost entirely XOR. 92
Figure 7.12. 1-D FFT algorithm with soft errors injected by F-SEFI 93
Figure 7.13. Comparative outputs with four different types of fault injections into the
extended split radix 1-D FFT algorithm. The output is represented in
magnitude and phase. The single FADD fault shown causes significant SDC in
both magnitude and phase. 94
Figure 7.14. The relative mean square root (RMS) of 1-D FFT outputs with different
problem sizes showing that for the faults I injected into FMUL instructions the
output varied only slightly. 95
Figure 7.15. 2-D FFT algorithm with soft errors injected by F-SEFI 95
Figure 7.16. 8x8 spiral images with FADD and FMUL fault injections. 96
xii
Figure 7.17. The Bit Matrix Multiply algorithm compresses the 64-bits of output to a 9-bit
signature code used to checksum the result. 97
Figure 7.18. Faults injected into two different functions of the K-Means Cluster algorithm
cause different effects. Clusters are colored by cluster number and the centroids
are marked by triangles. 99
Figure 7.19. The number of mislabeled particles in the K-Means Clustering Algorithm
under fault injection as a function of the total number of particles. An FADD
fault injected into kmeans clustering causes about 28% of the particles
to be mislabeled. 100
xiii
CHAPTER 1
INTRODUCTION AND MOTIVATION
1.1. Introduction
The increasingly popular cloud computing paradigm provides on-demand access to com-
puting and storage with the appearance of unlimited resources [6]. Users are given access to a
variety of data and software utilities to manage their work. Users rent virtual resources and pay
for only what they use. Underlying these services are data centers that provide virtual machines
(VMs) [90]. Virtual machines make it easy to host computation and applications for large numbers
of distributed users by giving each the illusion of a dedicated computer system. It is anticipated
that cloud platforms and services will increasingly play a critical role in academic, government
and industry sectors, and will have widespread societal impact.
Production cloud computing systems continue to grow in their scale and complexity. Mod-
ern cloud computing systems contain hundreds to thousands of computing and storage servers.
Such a scale, combined with ever-growing system complexity, is introducing a key challenge to
failure and resource management for dependable cloud computing [6]. Despite great efforts on
the design of ultra-reliable components [9], the increase of cloud scale and complexity has out-
paced the improvement in component reliability. On the other hand, the states of cloud systems are
changing dynamically as well due to the addition and removal of system components, changing
execution environments and workloads, frequent updates and upgrades, online repairs and more.
In such large-scale complex and dynamic systems, failures are common [104, 45]. Results from
recent studies [64] show that the reliability of existing data centers and cloud computing systems is
constrained by a system mean time between failure (MTBF) on the order of 10-100 hours. Failure
occurrence as well as its impact on cloud performance and operating costs is becoming an increas-
ingly important concern to cloud designers and operators [6, 106]. The success of cloud computing
will depend on its ability to provide dependability at scale.
In this dissertation, I aim to design, implement, and evaluate a framework that can facilitate
the development of dependable cloud computing systems. I provided the definitions that are used
1
in this dissertation. Then, I elaborated the research tasks.
1.2. Terms and Definitions
The definitions of the following terms used in this dissertation are adopted from [91, 98,
73, 88].
Fault : the cause of an error (e.g., stuck bit, alpha particle and temperature).
Error : the part of states that may lead to a failure.
Failure : a transition to incorrect or unavailable services (e.g., web service disruption).
Performance anomaly: a performance anomaly arises when the system performance be-
havior deviates from the expectation. Usually the expectation threshold is defined in the Service
Level Agreement (SLA).
Soft error: a type of transient fault occur due to random events.
Dependability : the ability to avoid service failures that are more frequent and more severe
than is acceptable.
Resilience : the collection of techniques for keeping applications running to a correct solu-
tion in a timely and efficient manner despite underlying system faults.
1.3. Motivation and Research Tasks
The occurrences of failures may cause affected tasks to fail or abort and furthermore, force
the system rolling back to the nearest checkpoint to restart the relevant jobs and tasks. Dependable
cloud system should be able to proactively tackle the cloud anomaly before crashing and trigger-
ing the checkpoint restart, since reactive approaches are consuming extra computation cycles and
power budgets.
1.3.1. Charactering System Dependability in Cloud Computing Infrastructures
Dependability assurance is crucial for building reliable cloud services. Current solutions
to enhancing cloud dependability include VM replication [22] and checkpointing [72]. Proactive
approaches, such as failure prediction [96, 32, 28, 29] and VM migration [4, 39], have also been
explored. However, a fundamental question, i.e., ”What is the uniqueness of cloud computing
systems in terms of their dependability?” or ”What impact does virtualization have on the cloud
2
dependability?”, is never answered. There also exists research on characterizing cloud hardware re-
liability [104], modeling cloud service availability [23], and injecting faults to cloud software [46].
Still, none of them evaluate the influence of virtualization on the system dependability in cloud
computing infrastructures.
As virtualization has become the de facto enabling technology for cloud computing [6],
dependability evaluation of the cloud is no longer confined to the hardware, operating system, and
application layers. A new virtualized environment, which consists of virtual machines (VMs) with
virtualized hardware and hypervisors, should be analyzed to characterize the cloud dependabil-
ity. VM-related operations, such as VM creation, cloning, migration, and accesses to physical
resources via virtualized devices, cause more points of failure. They also make failure detec-
tion/prediction and diagnosis more complex. Moreover, virtualization introduces richer perfor-
mance metrics to evaluate the cloud dependability. Traditional approaches [68, 30] that ignore
those cloud-oriented metrics may not model cloud dependability accurately or effectively.
I need to design an analytical framework to evaluate the system dependability both virtual-
ized and non-virtualized environments , in order to characterize the impact of virtualization on the
cloud dependability.
1.3.2. Metric Dimensionality Reduction for Cloud Anomaly Identification
To characterize cloud behavior, identify anomalous states, and pinpoint the causes of fail-
ures, I need the runtime performance data collected from utility clouds. However, continuous
monitoring and large system scale lead to the overwhelming volume of data collected by health
monitoring tools. The size of system logs from large-scale production systems can easily reach
hundreds and even thousands of tera-bytes [70, 87]. In addition to the data size, the large number
of metrics that are measured make the data model extremely complex. Moreover, the existence
of interacting metrics and external environmental factors introduce measurement noises in the col-
lected data. For the collected health-related data, there might be a maximum number of metrics
above which the performance of anomaly detection will degrade rather than improve. High metric
dimensionality will cause low detection accuracy and high computational complexity. Another
challenge of failure identification from measurement data originates from the dynamics of cloud
3
computing systems. It is common in those systems that user behaviors and servers loads are always
changing. The cloud hardware and software components are also frequently replaced or updated.
This requires the failure detection mechanisms distinguish the normal cloud variation and real
failures.
Anomaly detection and failure management based on analysis of system logs is an active re-
search topic. Anomaly detection techniques developed in machine learning and statistical domains
are surveyed in [42]. Structured and broad overviews of recent research on anomaly detection and
proactive failure management techniques are presented in [15, 84]. Most of the existing anomaly
detection work focuses on the detection techniques, while putting little emphasis on the metric
selection. There is a lack of systematic approaches to effectively identifying and selecting princi-
ple metrics for anomaly detection. On the other hand, metric selection is vital. Its performance
directly affects the efficiency and accuracy of anomaly detection.
The conventional methods of failure detection rely on statistical learning models to ap-
proximate the dependency of failure occurrences on various performance attributes; see [15] for a
comprehensive review and [59, 13] for examples. The underlying assumption of these methods is
that the training dataset is labeled, i.e. for each measurement used to train a failure detector, the
designer knows if it corresponds to a normal execution state or a failure. However, the labeled data
are not always available in realworld cloud computing environments, especially for newly managed
or deployed systems. Moreover, these methods do not exploit the detected failures to improve the
accuracy of future detections. In these methods, the undetected failures are also never considered
by the detectors to identify new types of failures in the future. How to accurately and adaptively
detect and forecast failure occurrences in such complex and dynamic environments without prior
failure histories is challenging.
Cloud environments need to have visibility not only into the cloud performance , but also
into the different computing resources and architectures where these applications reside. Typically
it is easy to identify the performance anomalies in a single server as opposed to study the perfor-
mance of applications that are pulling computing resources from different resources. This issue is
more complex if the system dynamicity is taken into consideration. Current utility clouds are un-
4
able to validate the performance of a heterogeneous set of application in the cloud. No technology
are provided to in place trust the end-users about the honesty of the cloud service provides. There-
fore, the current cloud vendors need to have independent performance anomaly detection tools
in place to inform the quality of the services they are providing, but also enable the root-cause
analysis of problems as they occur in the backend.
I need a metric selection mechanism for proactive online anomaly detection and root-cause
analysis. This mechanism should be efficient, accurate and adaptive to the dynamicity of system.
1.3.3. Soft Errors (SE) and Silent Data Corruption (SEC)
Exascale supercomputers are likely to encounter failures at higher rates than current high
performance computing systems. Next generation machines are expected to consist of a much
larger component count than current petascale machines. In addition to the increase in components,
it is expected that each individual component will be built on smaller feature sizes which may prove
to be more vulnerable than current parts. This vulnerability may be aggravated by near-threshold
voltage designs meant to dramatically decrease power consumption in the data center [48].
Due to high error rates it is estimated that exascale systems may waste as much as 60%
[25] of their computation cycles and power due to the overhead of reliability assurance. These
high error rates pose a serious threat to the prospect of exascale systems.
Soft errors fall into three categories [91]: detected and corrected (DCE), detected but un-
correctable (DUE) and SE (silent). Most DRAM in supercomputers is protected by Chipkill which
makes DUE events rare and SE events even more rare. SRAM in cache layers, however, is gen-
erally protected by SECDED or parity and is therefore more vulnerable to SE events. In addition,
logic circuits have varying levels of internal protection and I expect these error rates to be on the
rise as well.
Silent errors pose a serious issue when they lead to silent data corruption (SDC) in user
applications. If undetected by the application, a single SDC can corrupt data causing applications
to output incorrect results, malfunction or hang.
Unfortunately, detecting and correcting SDC events at the application layer is challenging.
It requires expert knowledge of the algorithm involved to determine where an application might
5
be most vulnerable and how it will behave if an SDC should occur. Even with such knowledge it
is difficult to test any mitigation techniques an application author might attempt since SDC events
occur rarely and in most cases randomly.
Currently, many cloud vendors provide computing services (e.g., Google Compute Engine
(GCE) and Amazon Elastic Compute Cloud (EC2)) to satisfy the requirements of scientific com-
puting in research institutes and universities. However, cloud system is composed of less reliable
hardware components comparing to the High Performance Computing (HPC) systems. Compo-
nents in the cloud are more susceptible to in-field hardware bit upsets (soft errors) due to radiation
from energetic particle strikes. Soft errors that are undetectable will further corrupt the data in
the computation or memory cells. The computation results are delivered to end-users incorrectly.
In order to guarantee the service dependability, Cloud service providers have to address the high
system soft error rate and tackle the silent data corruption due to soft errors.
In this dissertation, I need to investigate the impact of soft errors to the correctness of
computation results and inject soft errors to analyze the resilience of applications to soft errors.
1.4. Contributions
In this dissertation, I consider the influence of the cloud infrastructure on system depend-
ability and develop new techniques in a systematic way. The proposed autonomic failure identifi-
cation and diagnosis framework with mechanisms will enable cloud computing systems to contin-
uously monitor their health, accurately and efficiently detect anomalous behaviors, diagnoze the
causes and inject soft faults to evaluate and deliver dependable cloud services.
1.4.1. Cloud Dependability Characterization and Analysis
The goal is to evaluate cloud dependability with the virtualized environments and compare
it with traditional, non-virtualized systems. To the best of our knowledge, this is the first work to
analyze the impact of virtualization on the cloud dependability. I propose a cloud dependability
analysis (CDA) framework with mechanisms to characterize failure behavior in cloud computing
infrastructures. I design failure-metric DAGs to model and quantify the correlation of various
performance metrics with failure events in virtualized and non-virtualized systems. I study multiple
6
VMM VMM
Cloud Resource
Manager
Cloud
Coordinator
Application execution
requests
RDVM (A) RDVM (A’)
Cloud Application A
Cloud Server 1 Cloud Server 2 Cloud Server N
Performance
data, Detected
failures
Workload
distribution
Cloud Anomaly
Detector
Performance
data
Observed
failures
Anomaly
identification
Guest
OS
Task
(A)
Guest
OS
Guest
OS
Task
(A')Daemon
Dom0 DomU DomU
Guest OS
Task (A')Daemon
DomU
Virtual
Machine
Cloud Application A’
Guest
OS
Daemon
Dom0
RDVM RDVM
VMM
Guest
OS
Task
(A)
Guest
OS
Guest
OS
Task
(A')Daemon
Dom0 DomU DomU
RDVM
Resource
allocation
FIGURE 1.1. A dependable cloud computing infrastructure.
types of failures, including CPU-, memory-, disk-, and network-related failures. By comparing the
generated DAGs in the two environments, I gain insights into the effects of virtualization on the
cloud dependability.
To build dependable cloud computing systems, I propose a reconfigurable distributed vir-
tual machine (RDVM) infrastructure, which leverages the virtualization technologies to facilitate
failure-aware cloud resource management. A RDVM, as illustrated in Figure 1.1, consists of a set
of virtual machines running on top of physical servers in a cloud. Each VM encapsulates execution
states of cloud services and running client applications, serving as the basic unit of management
for RDVM construction and reconfiguration. Each cloud server hosts multiple virtual machines.
They multiplex resources of the underlying physical server. The virtual machine monitor (VMM,
also called a hypervisor) is a thin layer that manages hardware resources and exports a uniform
interface to the upper guest OSs.
When a client application is submitted with its computation and storage requirement to the
cloud, the cloud coordinator (described in Section 3.2) evaluates the qualifications of available
cloud servers. It selects one or a set of them for the application, initiates the creation of VMs
7
on them, and then dispatches the application instances for execution. Virtual machines on a cloud
server are managed locally by a RDVM daemon, which is also responsible for communication with
the cloud resource manager, cloud anomaly detector and cloud coordinator. The RDVM daemon
monitors the health status of the corresponding cloud server, collects runtime performance data of
local VMs and sends them to the Cloud Anomaly Detector which characterizes cloud behaviors,
identifies anomalous states, and reports the identified anomalies to cloud operators. Based on the
performance data and anomaly reports, the cloud resource manager analyzes the workload distri-
bution, online availability and available cloud resources, and then makes RDVM reconfiguration
decisions. The Anomaly Detector and Resource Manager form a closed feedback control loop to
deal with dynamics and uncertainty of the cloud computing environment.
To identify anomalous behaviors, the Anomaly Detector needs the runtime cloud perfor-
mance data. The performance data collected periodically by the RDVM daemons include the ap-
plication execution status and the runtime utilization information of various virtualized resources
on virtual machines. RDVM daemons also work with hypervisors to record the performance of
hypervisors and monitor the utilization of underlying hardware resources/devices. These data and
information from multiple system levels (i.e., hardware, hypervisor, virtual machine, RDVM, and
the cloud) are valuable for accurate assessment of the cloud health and for identifying anomalies
and pinpointing failures. They constitute the health-related cloud performance dataset, which is
explored for autonomic anomaly identification.
1.4.2. Metric Selection and Extraction for Charactering Cloud Health
I propose a metric selection framework for efficient health characterization in the cloud.
Among the large number of metrics profiled, I select the most essential ones by applying met-
ric selection and extraction methods. Mutual information is exploited to quantify the relevance
and redundancy among metrics. An incremental search algorithm is designed to select metrics
by enforcing maximal relevance and minimal redundancy. We apply metric space combination
and separation to extract essential metrics and further reduce the metric dimension. I implement
a prototype of the proposed metric selection framework and evaluate its performance on a cloud
computing testbed. Experimental results show that the proposed approaches can significantly re-
8
duce the metric dimension by finding the most essential metrics.
1.4.3. Exploring Time and Frequency Domains of Cloud Performance Data for Accurate Anomaly
Detection
I propose a wavelet-based multi-scale anomaly identification mechanism to detect anoma-
lous cloud behaviors. It analyzes the profiled cloud performance metrics in both time and frequency
domains and identifies anomalous behaviors by checking both domains for cloud anomaly detec-
tion. I leverage learning technologies to construct and adapt mother wavelets which capture the
characteristic properties of failure events occurred in the cloud. To tackle cloud dynamicity and
improve detection accuracy, I devise a sliding-window approach to identify anomalies by using
the updated mother wavelet in a recent detection period. We develop a prototype of the proposed
cloud anomaly identification mechanism and evaluate its performance. Experimental results show
that the wavelet-based anomaly detector can identify cloud failures accurately. It achieves 93.3%
detection sensitivity and 6.1% false positive rate which makes it suitable for building highly de-
pendable cloud systems. To the best of our knowledge, this is the first work that considers both the
time and frequency domains to identify anomalies in cloud computing systems.
1.4.4. Most Relevant Principal Components based Anomaly Identification and Diagnosis
I propose an adaptive mechanism that leverages PCA to identify and diagnose cloud per-
formance anomalies. Different from existing PCA-based approaches, the proposed mechanism
characterizes cloud health dynamics and finds the most relevant principal components (MRPCs)
for each type of possible failures. The selection of MRPCs is motivated by the observation in
experiments that higher order principal components possess strong correlation with failure occur-
rences, even though they maintain less variance of the cloud performance data. By exploiting
MRPCs and learning techniques, I design an adaptive anomaly detection mechanism in the cloud.
It adapts its anomaly detector by learning from the newly verified detection results to refine future
detections. I compare the anomaly identification accuracy of several algorithms using the receiver
operating characteristic (ROC) curves. The experimental results show that the proposed mecha-
nism can achieve 91.4% true positive rate while keeping the false positive rate as low as 3.7%. I
9
also conduct experiments by using traces from a Google data center. Our MRPC-based anomaly
identification mechanism performs well on the traces from the production system with the true
positive rate reaching 81.5%. The mechanism is lightweight as it takes only several seconds to
initialize the detector and a couple of seconds for adaptation and anomaly identification. Thus, it
is well suitable for building dependable cloud computing systems.
These two anomaly identification approaches have different goal orientations and pros and
cons in anomaly identification.
The wavelet-based approach only monitors a few samples to identify the occurrences of
cloud anomalies and to update the anomaly mother wavelet when new anomalies are verified. The
MRPCs approach requires more samples (a bigger sliding window size) to build the most relevant
principal component space that correlates to the specific faults.
The selected MRPCs facilitate the creation of an anomaly inference code book to assist
the anomaly diagnoser to rapidly determine the failure type timely. Moreover, the MRPCs-based
approach is capable of finding the root causes of faults by learning the contributions of metrics
in the MRPCs. This analysis information can aid the diagnoser to track a failure instance to the
faulty hardware and the faulty VM with anomalous behaviors and guide the resource allocation
and reconfiguration.
Both approaches are able to adapt to the new types of failures. For the wavelet-based
approach, mother wavelets needs to be updated to include the features of new failures. For the
MRPCs-based approach, MRPCs are required to be re-selected when failure instances of a new
type occur.
1.4.5. SEFI : A Soft Error Fault Injection Tool for Profiling the Application Vulnerability
I develop F-SEFI, a Fine-grained Soft Error Fault Injector, for profiling software robustness
against soft errors. I rely on logic soft error injection to mimic the impact of soft errors to logic
circuits. Leveraging the open source virtual machine hypervisor QEMU. F-SEFI enables user to
modify the emulated machine instructions to introduce soft errors, by amending the process of
Tiny Code Generation (TCG) in the QEMU hypervisor. Neither intimate knowledge of applica-
tions (e.g., source code) nor dynamic instructions analysis is required. F-SEFI can semantically
10
control what application, which sub-function, when and how to inject the soft errors with different
granularities, without any interferences to other applications that share the same environment and
any revisions to the application source codes, compilers and operating systems. F-SEFI shows the
soft error injection capability on a selected set of applications for campaign studies on vulnerability
of applications while exposed to soft errors.
1.5. Dissertation Organization
The remaining of this dissertation is organized as follows.
Chapter 2 reviews the related work on failure detection and diagnosis in distributed sys-
tems, especially in clusters, grids and clouds. I first discuss the existing studies on characterising
failure behaviors. Then, I present the approaches to failure identification. I also describe existing
proactive and reactive failure management mechanisms. In the end, I survey the soft error injection
techniques.
Chapter 3 presents the cloud dependability analysis framework and the key findings on the
properties of the cloud dependability that guide the development of dependable cloud systems.
Chapter 4 presents the proposed performance metric selection and combination approaches
to reduce the dimensionality of the collected runtime cloud performance data. The reduced dataset
will be used in developing efficient and accurate anomaly identification mechanisms.
Chapter 5 describes the design, implementation, and evaluation of the proposed proactive
cloud anomaly detection framework. I first presented the wavelet-based multi-scale anomaly de-
tection approach. Then I applied the proposed approach to the collected cloud performance data
using a sliding window mechanism and show the experimental results.
In Chapter 6, I described a motivating example to demonstrate the limitations of the tra-
ditional Principal Component Analysis (PCA) for anomaly identification and diagnosis. Then, I
described the proposed anomaly identification approach based on metric subspace analysis. Ex-
perimental results from our cloud testbed and on the Google datacenter traces will be presented to
show the root-cause analysis on different failures using MRPMs.
Chapter 7 presents the soft error fault injection framework for profiling application vulner-
ability. First, the design of a coarse-grained injection platform is described based on an attached
11
GDB debugger. I then improve it to achieve fine-grained soft error injections facility. In case
studies, I show the effects of SEFI on multiple applications.
Chapter 8 concludes the dissertation with a summary of this work and remarks on the
directions of future research.
12
CHAPTER 2
BACKGROUND AND RELATED WORK
Production cloud computing systems continue to grow in their scale and complexity. They
are changing dynamically as well due to the addition and removal of system components, changing
execution environments and workloads, frequent updates and upgrades, online repairs and more.
In such large-scale complex and dynamic systems, failures are common [45].In addition, failure
management requires significantly higher level of automation. Examples include anomaly detec-
tion and failure diagnosis based on realtime streams of system events, and performing continuous
monitoring of cloud servers and services. The core of autonomic computing [51] is the ability to
analyze data in realtime and to identify anomalies accurately and efficiently. The goal is to avoid
catastrophic failures through prompt execution of remedial actions.
In this chapter, I survey the existing work on metric selection and extraction, anomaly de-
tection and failure diagnosis and fault injection, which provides the background for this dissertation
research.
2.1. Metrics Selection and Extraction
Anomaly detection is an important topic and attracts considerable attention [97]. Typical
methods treat anomalies as deviation of the system performance [62], [27]. Examples include diag-
nosis and prediction based on probabilistic or analytical model [36], real-time streams of computer
events[53], [107] and continuous monitoring over the runtime services [112], [50]. These model-
based system diagnosing methods analyze the system by deriving a probabilistic or analytical
model. Models are trained with the help of prior knowledges. Examples include native Bayesian-
based model for hardware disk failure [36], EM-algorithm model for the highest-likelihood mix-
ture [69] and Hidden Markov model for online failure detection [85]. These approaches are all
based on the study of the huge amount of the health-related data while being trained, which spe-
cially addresses the complication in mining daily log in the magnitude of GBs. Furthermore, the
insufficient failure data makes it difficult to accurately diagnose the root causes.
13
In order to overcome these drawbacks, other researchers seek to involve the performance
metrics selection and metrics extraction [67], [77] as a pre-process to maintain the variance of
dataset as much as possible while shrinking the size to improve the speed for analysis. Metrics
selection and extraction expect to remove the redundancy in the dataset. But still they are sharing
the weakness originating from the characteristic of a model-based manners while maintaining the
accuracy of the model is difficult for a complex system, especially while system is moved to cloud
environment and featured with virtualization. Meanwhile, any maintenance, update and upgrades
on the cloud system post the requirement of rebuilding the model. Moreover, any changes to
virtual architecture(i.e., virtual machine migration, initialization and shutdown) make the model
ineffective and inadequate. In addition, for many cloud infrastructures, the behavior of workload is
never taken into account, though it may be a dominant factor to affect the results of metric selection
and extraction. Besides, this kind of methods also suffer from the overfits, which are brought by
specific training data sets.
2.2. Anomaly Detection and Failure Management
Anomaly detection based on analysis of performance logs has been the topic of many stud-
ies. Hodge and Austin [42] provide an extensive survey of failure/anomaly detection techniques.
A structured and broad overview of extensive research on failure/anomaly detection techniques has
been presented in [15]. There exist many methods for failure detection, typically based on statis-
tical techniques. Specifically, Cohen et al. [19] developed an approach in the SLIC project that
statistically clusters metrics with respect to SLOs to create system signatures. Chen et al. [17] pro-
posed Pinpoint using clustering/correlation analysis for problem determination. Concerning data
center management, Agarwala et al. [1, 2] proposed profiling tools, E2EProf and SysProf, that can
capture monitoring information at different levels of granularity. They however address different
sets of VM behaviors, focusing on relationships among VMs rather than anomalies.
Broadly speaking, existing approaches can be classified into two categories: model-based
and data-driven. A model-based approach derives a probabilistic or analytical model of a sys-
tem. A warning is triggered when a deviation from the model is detected [41]. Examples include
an adaptive statistical data fitting method called MSET presented in [100], naive Bayesian based
14
models for disk failure prediction [37], and Semi-Markov reward models described in [31]. In
large-scale systems, errors may propagate from one component to others, thereby making it dif-
ficult to identify the root causes of failures. A common solution is to develop fault propagation
models, such as causality graphs or dependency graphs [93]. Generating dependency graphs, how-
ever, requires a priori knowledge of the system structure and the dependencies among different
components which is hard to obtain in large-scale systems. The major limitation of model-based
methods is their difficulty of generating and maintaining an accurate model, especially given the
unprecedented size and complexity of production cloud computing systems.
Data mining and statistical learning theories have received growing attention for anomaly
detection and failure management. These methods extract failure patterns from systems’ normal
behaviors, and detect abnormal observations based on the learned knowledge [74]. For example,
the RAD laboratory at UC-Berkeley applied statistical learning techniques for failure diagnosis in
Internet services [17, 110]. The SLIC (Statistical Learning, Inference and Control) project at HP
explored similar techniques for automating failure management of IT systems [19]. In [82, 103],
the authors presented several methods to forecast failure events in IBM clusters. In [63], Liang et al.
examined several statistical methods for failure prediction in IBM Blue Gene/L systems. In [54],
Lan et al. investigated meta-learning based method for improving failure prediction. These results
provide great help to the checkpointing and restoring mechanism for the proactive failure manage-
ment. By analyzing the knowledge of failure distribution, however, the failure information is rare
and needs the consistence of the system setting. But usually it is hard to be guaranteed due to the
possible system update and reconfiguration. Moreover, without timely failure validation, this kind
of unsupervised model bears a very high false alarm rate.
Research in [33, 87, 63, 83] characterized failure behaviors in production cloud and net-
worked computer systems. They found that failures are common in large-scale systems and their
occurrences are quite dynamic, displaying uneven distributions in both time and space domains.
There exist the time-of-day and day-of-week patterns in long time spans [87, 83]. Weibull dis-
tributions were used to model time-between-failure in [40]. Failure events, depending on their
15
types, display strong spatial correlations: a small fraction of computers may experience most of
the failures in a coalition system [83] and multiple computers may fail almost simultaneously [63].
Failure diagnosis techniques localize the root causes of a failure to a group of system per-
formance metrics to notify the system administrator to validate the inferred alarms by analyzing the
hardware failures, operating system faults and user-level applications. Currently most widely used
commercial tool for failure diagnosis are rule-based diagnosis approaches. IBM Tivoli Enterprise
Console provide a platform for users to define and develop new rules to trace back to root of failure
causes. Chopstix [10] collects system profiled runtime status and builds the inference rules offline.
These rules are used for mapping the system behaviors with known diagnosis rules. X-ray [7] and
Crosscut [18] troubleshoot the performance anomalies with dynamic instruments techniques to re-
play the process within runtime to gain the knowledge of root cause. Besides, machine learning
techniques are also used for building and training the failure diagnostic models. Draco [49] ad-
dresses the chronic problems that exist in distributed system by using a scalable Bayesian learner.
Decision trees [110, 35] and clustering techniques [71, 34] serve for the failure diagnosis by as-
signing the resource usage metrics to the correlated anomalies. Rule-based diagnosis techniques
are unable to the dynamics of cloud environments and numbers of unforeseen failures propagate
from other nodes and VM instances. On the other hand, the performance of machine learning based
failure diagnosis techniques is limit to overhead caused by the data volume and dimensionality.
2.3. State of the Art of Fault Injection
Studying the behavior of applications in the face of soft errors has been growing in popu-
larity in recent years. I identified two main techniques used in existing research to inject faults into
running applications: dynamic binary instrumentation and virtual machine based injection.
2.3.1. Dynamic Binary Instrumentation-based Fault Injection
One category of recent research focuses on injecting soft errors into an application binary
dynamically. Thomas et al. [99] propose LLFI, a programming level fault injection tool using
the LLVM [57] Just-In-Time Compiler (JITC) to allow injections to intermediate code based on
data categories (pointer data and control data). Performance of LLVM is restricted by the compiler
16
TABLE 2.1. Existing fault injection technologies.
Methodologies Heuristic /Application Knowledge Semantic Intrusion Granularity Compiler Dependency Output Fault Control
LLFI [99] Yes Yes Yes Fine and Coarse Yes Yes
BIFIT [61] Yes Yes Yes Fine and Coarse Yes No
Virtual Hardware FI [60] No No No Coarse No No (Crash Only)
CriticalFault [105], Relyzer [86] Yes No No Coarse Yes No
VarEMU [65] Yes Yes Yes Fine and Coarse No No
Xen SWIFI [58] Yes No Yes Coarse No No
specification (only GCC is supported), the heuristic study of the application source code, and by
the instrumented Intermediate Representation (IR) of the machine code.
Li et al. [61] designed BIFIT to investigate an application’s vulnerability to soft errors by
injecting faults at arbitrarily chosen execution points and data structures. BIFIT is closely inte-
grated with PIN [66], a binary instrumentation tool. BIFIT relies on application knowledge by
profiling the application to generate a memory reference table of all data objects. This approach
becomes less practical when the application uses a random seed to dynamically initialize the in-
put data set. Moreover, due to the limitations of PIN, BIFIT is constrained to specific (although
popular) hardware architectures.
2.3.2. Virtualization-based Fault Injection
Virtualization provides an infrastructure-independent environment for injecting hardware
errors with minimal modification to the system or application. In addition, virtualization can be
used to evaluate a variety of hardware and explore new architectures.
Levy et al. [60] propose a virtualization-based framework that injects errors into a guest’s
physical address and evaluates fault tolerance technologies for HPC systems (e.g. Palacios [56]).
Their Virtual Hardware Fault Injector is only able to inject crash failures and IDE disk failures.
CriticalFault [105] and Relyzer [86] are based on Simics, a commercial simulator. With
static pruning and dynamic profiling of the application, instructions are categorized into several
classes, i.e., control, data store and address. This process reduces the number of potential fault in-
jection sites by pruning the injection space. Depending on the test scenario, soft errors are injected
17
into different categories, which produce different faulty outputs, e.g., crash or SDC. However,
CriticalFault and Relyzer can’t always establish the correlations between the instruction level fault
injection and the faulty behaviors, because tracing back from instructions to high-level languages
is difficult. Therefore, only coarse-gained injection is available.
Wanner et al. [65] designed VarEMU, an emulation testbed built on top of QEMU. Injection
is controlled by a guest OS system call. In order to inject faults, users have to import a library to
interface with system calls and insert fault injection codes into the applications that run in the guest
OS. The user space controls cannot guarantee that injections are applied to the specific user space
application. If another user space application is running the same type of instructions and sharing
the CPU with the injector targeted application, the consequence of fault injection is unpredictable
and uncontrollable.
Winter et al. [58] designed a software-implemented fault injector (SWIFI) based on the
Xen Virtual Machine Monitor. Using Xen Hypercalls, Xen SWIFI can inject faults into the code,
memory and registers of Para-Virtualization (PV) and Fully-Virtualization (FV) virtual machines.
PV intrudes into the original system via modifications to kernel device drivers. Because the injec-
tion targets registers (i.e., EIP, ESP, EFL and EAX) that Xen SWIFI does not directly control, it is
not possible to be certain that an injection affects the application of interest.
18
CHAPTER 3
A CLOUD DEPENDABILITY ANALYSIS FRAMEWORK FOR CHARACTERISING THE
SYSTEM DEPENDABILITY IN CLOUD COMPUTING INFRASTRUCTURES
3.1. Introduction
Due to the inherent complexity and large scale, production cloud computing systems are
prone to various runtime problems caused by hardware and software failures. Dependability as-
surance is crucial for building sustainable cloud computing services. Although many techniques
have been proposed to analyze and enhance reliability of distributed systems, there is little work
on understanding the dependability of cloud computing environments.
As virtualization has become the de facto enabling technology for cloud computing [7],
dependability evaluation of the cloud is no longer confined to the hardware, operating system, and
application layers. A new virtualized environment, which consists of virtual machines (VMs) with
virtualized hardware and hypervisors, should be analyzed to characterize the cloud dependabil-
ity. VM-related operations, such as VM creation, cloning, migration, and accesses to physical
resources via virtualized devices, cause more points of failure. They also make failure detec-
tion/prediction and diagnosis more complex. Moreover, virtualization introduces richer perfor-
mance metrics to evaluate the cloud dependability. Traditional approaches [68, 30] that ignore
those cloud-oriented metrics may not model cloud dependability accurately or effectively.
In this dissertation task, I aim to evaluate cloud dependability with the virtualized environ-
ments and compare it with traditional, non-virtualized systems. To achieve this goal, I propose
a cloud dependability analysis (CDA) framework with mechanisms to characterize failure behav-
ior in cloud computing infrastructures. I design failure-metric DAGs to model and quantify the
correlation of various performance metrics with failure events in virtualized and non-virtualized
systems. I study multiple types of failures, including CPU-, memory-, disk- , and network-related
failures. By comparing the generated DAGs in the two environments, I gain insight into the effects
of virtualization on the cloud dependability.
In this chapter, I present an overview of the cloud dependability analysis framework in
19
FIGURE 3.1. Architecture of the cloud dependability analysis (CDA) framework.
Section 3.2. Details of the failure-metric DAG based analysis method are described in Section 3.3.
We present the cloud testbed and the runtime cloud performance profiling system in Section 3.4.
Analytical results for different types of failures are shown and discussed in Section 3.5. Section 3.6
summaries this chapter.
3.2. Overview of the Cloud Dependability Analysis Framework
Figure 3.1 depicts the architecture of our cloud dependability analysis (CDA) framework.
The cloud computing environment consists of a large number of cloud servers, each of which can
accommodate a set of virtual machines (VMs). A VM encapsulates the execution states of cloud
services and runs a client application. These VMs multiplex resources of the underlying physical
servers. The virtual machine monitor (VMM, also called hypervisor) is a thin layer that manages
hardware resources and exports a uniform interface to the upper VMs [81].
Our CDA system is distributed in nature. I leverage the fault injection techniques [43, 76]
to evaluate the cloud dependability. Each cloud server has a fault injection agent, which injects
random or specific faults to the host system. Faults can be injected to multiple layers of a system,
including the hypervisor, VMs, and possibly client applications. There is also an health monitoring
sensor residing on each cloud server. It periodically records the values of a list of health-related
performance metrics from the hardware, hypervisor, and VMs. Both the fault inject agent and the
health monitoring sensor run in a privileged domain, such as Dom0 in a Xen-based virtualization
environment, in order to access privileged server resources.
20
A Coordinator controls the fault injection operations. It determines the time and location
of an injection and the type of fault to be injected or lets the fault injection agents inject faults
randomly. It also schedules client applications to run on the cloud servers. By communicating with
the health monitoring sensors, the Coordinator collects the raw or aggregated cloud performance
data for cloud dependability analysis. In a small-scale cloud computing testbed, one Coordinator
can perform these tasks. However, the Coordinator will become a performance bottleneck in large-
scale cloud computing environments. To tackle this problem, I propose to employ a hierarchy
of Coordinators. A lower-level Coordinator manages multiple or all cloud servers in a rack. It
receives fault injection requests from and sends aggregated performance data and dependability
analysis results to the upper-level Coordinators.
3.3. Cloud Dependability Analysis Methodologies
To analyze the cloud dependability, I need a representation that captures those aspects of
cloud state that serve as a fingerprint of a particular cloud condition. I aim at capturing the essential
cloud state that contributes to cloud component failures, and to do so using a representation that
provides information useful in the diagnosis of the state.
CDA continuously measures a collection of metrics that characterize the low-level opera-
tions of a cloud computing system, from hardware devices, hypervisors, to virtual machines. This
information can come from system facilities, commercial operation monitoring tools, server logs,
etc.
As a starting point, I can simply use the raw values of the metrics as the cloud fingerprint. I
then automatically build models that identify the set of metrics that correlate with failure instances
as a concise representation of the cloud fingerprint. I compare the fingerprint from virtualized
computing environments with those from traditional, non-virtualized environments, and investigate
the influence of virtualization on cloud dependability analysis.
The metric attribution problem is a pattern classification problem in supervised learning.
Let ft denote the state of the failure occurrence at time t. In this case, f can take one of two states
from the set {normal, failure} for binary classification. Let {0, 1} denote the two states. When
we distinguish different types of failures, e.g. hardware (CPU, memory, disk, node controller,
21
rack, network, etc.) failures and software (hypervisor, VM, application, scheduler, etc.) failures,
a multi-class classification model is employed with each failure type being assigned a class value.
Let xt denote a record of values for the n collected performance metrics m1,m2, . . . ,mn at time t.
The pattern classification problem is to learn a classifier function C that has C(xt) = ft.
I use a directed acyclic graph (DAG) to represent the classification function C. Each node
in the DAG is for a cloud performance metric. An arc between two nodes represents a probability
correlation. Let x = (x1, x2, . . . , xn) be a cloud performance data point described by the n perfor-
mance metrics m1,m2, . . . ,mn, respectively. Using the DAG, we can compute the probability of
data point x by
(1) P (x) =n∏i=1
P (xi|mj),
where metric mj is the immediate predecessor of metric mi in the DAG. To find the essential
metrics that can characterize the correlation between cloud performance and failure events, we
compute the conditional probability of every metric on failure occurrences, i.e., P (mk|failure),
and select those metrics whose conditional probabilities are greater than a threshold τ . The selected
metrics constitute the cloud fingerprint.
A DAG is automatically built from a set of cloud performance data records,R = {x1, x2, . . . , xl}.
For a cloud performance metric mi, let metric mp denote a parent of mi. The probability P (mi =
mij|mp = mpk) is computed and denoted by wijpk. The DAG building mechanism searches for
the wijpk values that best model the cloud performance data. In essence, it tries to maximize the
probability
(2) Pw(R) =l∏
r=1
Px(xr).
This is done by an iterative process. wijpk is initialized to random probability values for any i,
j, p, and k. In each iteration, for each cloud performance data record, xr, in R, our mechanism
computes
(3)∂lnPw(R)
∂wijpk=
l∑r=1
P (mi = mij,mp = mpk|xr)wjipk
.
22
TABLE 3.1. Description of the injected faults.
Type of Injected Faults Symptom
CPU Fault Infinite loop
Memory Fault Keep allocating the memory space
I/O Fault Keep copying files to the disk
Network Fault Keep sending and receiving packets
Then, the values of wijpk are updated by
(4) wijpk = wijpk + α∂lnPw(R)
∂wijpk,
where α is a learning rate and ∂lnPw(R)/∂wijpk is computed from Equation (3). The value of α is
set to a small constant for quick convergence. Before the next iteration starts, the values of wijpk
are normalized to be between 0 and 1.
3.4. Cloud Computing Testbed and Performance Profiling
The cloud computing system under test consists of 16 servers. The cloud servers are
equipped with 4 to 8 Intel Xeon or AMD Opteron cores and 2.5 to 16 GB of RAM. I have in-
stalled Xen 3.1.2 hypervisors on the cloud servers. The operating system on a virtual machine
is Linux 2.6.18 as distributed with Xen 3.1.2. Each cloud server hosts up to ten VMs. A VM is
assigned up to two VCPUs, among with the number of active ones depends on applications. The
amount of memory allocated to a VM is set to 512 MB. I run the RUBiS [14] distributed online
service benchmark and MapReduce [24] jobs as cloud applications on VMs. The applications are
submitted to the cloud testbed through a web based interface. I have developed a fault injection
tool, which is able to inject four major types and 12 sub-types of faults to cloud servers by adjust-
ing the levels of intensity. They mimic the faults of CPU, memory, disk, and network. All four
major types of failures injected are implemented in Table 3.1
I exploit third-party monitoring tools, sysstat [94] to collect runtime performance data in
the hypervisor and virtual machines, and a modified perf [75] to obtain the values of performance
counters from the Xen hypervisor on each server in the cloud testbed. In total, 518 metrics are
23
FIGURE 3.2. A sampling of cloud performance metrics that are often correlated
with failure occurrences in our experiments. In total, 518 performance metrics are
profiled with 182 metrics for the hypervisor, 182 metrics for virtual machines, and
154 metrics for hardware performance counters (four cores on most of the cloud
servers).
profiled, i.e., 182 for the hypervisor and 182 for virtual machines by sysstat and 154 for perfor-
mance counters by perf, every minute. They cover the statistics of every component of cloud
servers, including the CPU usage, process creation, task switching activity, memory and swap
24
space utilization, paging and page faults, interrupts, network activity, I/O and data transfer, power
management, and more. Table 3.2 lists and describes a sampling of the performance metrics that
are often correlated with failure occurrences in our experiments. I tested the system from May 22,
2011 to February 18, 2012. In total, about 813.6 GB performance data were collected and recorded
from the cloud computing testbed in that period of time.
To tackle the big data problem and analyze the cloud dependability efficiently, our cloud
dependability analysis (CDA) system removes those performance metrics that are least relevant to
failure occurrences. First, CDA searches for the metrics that display zero variance. Among all
of the 518 metrics, 112 of them have constant values, which provides no contribution to cloud
dependability analysis. After removing them, 406 non-constant metrics are kept. Then, CDA cal-
culates the correlation between the remaining metrics and the “failure” label (0/1 for normal/failure
classification and multi-classes for different types of failures). CDA removes those metrics whose
correlations with failure occurrences are less than a threshold τcorr.
3.5. Impact of Virtualization on Cloud Dependability
This work aims to find out and model the impact of virtualization on system dependability
in cloud computing infrastructures. To this end, our cloud dependability analysis (CDA) system
compares the correlation of various performance metrics with failure occurrences in virtualization
and traditional non-virtualization environments. CDA exploits the DAGs described in Section 3.3
for the analysis and comparison.
To build a failure-metric DAG using a training set from the collected cloud performance
data, CDA sets the root node as “failure” for all types of failure events or a specific type of failures
for finer-grain analysis. Each node, except for the root node, is allowed to have multiple parents.
The maximal number of parents can be configured. For example, in our experiments it is set to
two, which means each metric node in the DAG can have only one more parent in addition to the
root node. Moreover, a continuous metric is discretized to a certain number of bins based on the
nature of the metric.
In this section, I focus on the failures caused by CPU (Section 3.5.1), memory(Section 3.5.2),
disk(Section 3.5.3), network(Section 3.5.4), and all (Section 3.5.5) faults, and model the impact
25
FIGURE 3.3. Failure-metric DAG for CPU-related failures in the cloud testbed.
FIGURE 3.4. Failure-metric DAG for CPU-related failures in the non-virtualized system.of virtualization on cloud dependability. I present the DAGs for virtualized and non-virtualized
systems and compare the results. Due to the space limitation, only the top three levels of each
DAG are plotted.
3.5.1. Analysis of CPU-Related Failures
To characterize the cloud dependability under CPU failures, the Coordinators in the CDA
system control the fault injection agents to inject CPU related faults, including randomly changing
one or multiple bits of the outputs of arithmetic or logic operations, continuously using up all CPU
cycles, and more. These faults are injected to one, some, or all of the processor core(s) on a cloud
server. The health monitoring sensors collect the runtime performance data on each cloud server,
pre-process the data, and report them to the Coordinators, which build the failure-metric DAGs
and analyze the system health status of a management domain or the entire cloud.
Figure 3.3 depicts the DAG for CPU related failures in the cloud computing testbed with
virtualization support. For comparison, I also conduct experiments on a traditional distributed
systems without virtualization. Figure 3.4 presents the corresponding DAG.
From Figure 3.4, I can see that 13 metrics display strong correlation with the occurrences
26
FIGURE 3.5. Failure-metric
DAG for memory-related
failures in the cloud testbed.
FIGURE 3.6. Failure-metric
DAG for memory-related
failures in the non-virtualized
system.
of CPU related failures in the non-virtualized system. Among them, four (i.e., %usr all, %nice all,
%sys all, and %iowait all) are metrics for all processor cores, while the others (i.e., %usr n,
%nice n, %sys n, %iowait n, and %soft n) are for individual cores.
In the cloud computing environment (Figure 3.3), 12 metrics are highly correlated with
the failures. Metric %usr all from the privileged domain, Dom0, is the direct child of the root
node, showing the highest correlation. Among the 12 metrics, 11 are metrics collected from Dom0
(Metrics from user virtual machines, DomU, locate at lower levels of the DAG). They are %usr,
%sys, and %iowait of all or individual processor cores. %steal all is a new metric that is cor-
related with failure occurrences, compared with Figure 3.4. In addition, a performance counter
metric, DTBL-load-miss, also has a strong dependency with CPU related failures, while other per-
formance counters have higher correlation with performance metrics of either the hypervisor or
virtual machines.
3.5.2. Analysis of Memory-Related Failures
To characterize the cloud dependability under memory related failures, memory faults are
injected by the fault injection agents to cloud servers. This type of faults includes flipping one or
multiple bits of memory to the opposite state, using up all available memory space, and more. The
Coordinators collect the runtime performance data from the health monitoring sensors on cloud
servers, and generate failure-metric DAGs for cloud dependability analysis.
27
FIGURE 3.7. Failure-Metric DAG for disk-related failures in the cloud testbed.
FIGURE 3.8. Failure-metric DAG for disk-related failures in the non-virtualized system.
Figure 3.5 shows the DAG for memory related failures in the cloud computing testbed. The
result from the non-virtualized system is presented in Figure 3.6.
In Figure 3.6, six metrics display strong correlation with the occurrences of memory related
failures in the non-virtualized system. They are %usr all and %sys all for all processor cores,
%usr n and %iowait n of some individual cores, and %memused, which indicates the memory
utilization.
In the cloud computing environment (Figure 3.5), seven metrics are highly correlated with
the failures. All of the seven metrics come from Dom0. Compared with Figure 3.6, the metric
%usr all is the direct child of the root node in both cases. However, in the cloud computing envi-
ronment, the metric %memused is not a significant identifier of memory related failures. Instead,
%soft n becomes more closely correlated with the occurrences of memory failures.
3.5.3. Analysis of Disk-Related Failures
Disks are also prone to fault [87, 104]. In our experiments, the fault injection agents in-
ject disk faults by blocking certain disk I/O operations or running background micro-benchmark
programs that continuously copying large files to disks to saturate disk I/O bandwidth. Again, the
28
FIGURE 3.9. Failure-metric DAG for network-related failures in the cloud testbed.
FIGURE 3.10. Failure-metric DAG for network-related failures in the non-
virtualized system.
Coordinators collect the cloud-wide performance data and analyze the cloud dependability.
Figure 3.7 presents the DAG for failures caused by disk faults in the cloud computing
testbed. Figure 3.8 shows the result from the non-virtualized system. From the two figures, I
observe that more metrics are correlated with the failure occurrences.
In the non-virtualized system (Figure 3.8), 15 metrics highly correlate with the occurrences
of disk related failures. In addition to other CPU metrics, %iowait n and %nice n are directly
affected by disk I/O operations. It is interesting to notice that metrics such as rd sec/s and wr sec/s
are not included in the top correlated metrics. This is because these metrics have a more direct
influence on the values of processor related metrics.
In the cloud computing environment (Figure 3.7), 12 metrics are the top ones that are
correlated with the failures. Among them, the metric %sys all from Dom0 is the direct child
of the root node, which is different from the non-virtualized case. Compared with Figure 3.8,
virtualization has more significant impact on the metrics including %steal n and pgpgout/s for
disk related failures.
29
3.5.4. Analysis of Network-Related Failures
Networking hardware/software in cloud servers and switches and routers in the core net-
work may fail at runtime [111]. To generate network related failures, the fault injection agents
inject network faults by dropping certain incoming/outgoing network packages, flipping one or
multiple bits of packages to the opposite state, or attempting to use up the network bandwidth by
continuously transferring large files through the network. After the performance data are collected
from cloud servers, failure-metric DAGs are generated to analyze the cloud dependability under
network related failures. Figures 3.9 and 3.10 show the DAGs for the cloud computing testbed and
the non-virtualized system, respectively.
From Figure 3.10, I observe the occurrences of network failures are strongly correlated
with 12 metrics in the non-virtualized environment. Two metrics, %iowait n and %usr all, are the
direct children of the root node. In contract, 16 metrics are included within the top three levels of
the DAG in Figure 3.9. For the cloud computing testbed, one metric, %usr all, is the direct child of
the root node. Three new metrics profiled from Dom0, fault/s, tcp-tw, and await dev8, are highly
correlated the occurrences of network failures. They are closely related to the networking opera-
tions, including the number of packages, the number of TCP sockets, and the average processing
time by networking devices. Moreover, two metrics from user virtual machines, DomU, are among
the most significant ones. They are U %usr all and U %steal n, accounting for the time to process
a large number of network packages and to switch between virtual processors.
3.5.5. Analysis of All Types of Failures
In addition to studying individual types of failures, I analyze the cloud dependability under
any type of failures. The goal is to identify a set of metrics that can characterize all types of failures
and to understand the impact of virtualization on the metric selection.
To generate the failure-metric DAGs for this purpose, the Coordinators mix the cloud per-
formance data records together. The label of each record takes one of the two values: 0 or 1
denoting a “normal” or “failure” state. Figures 3.11 and 3.12 depict the DAGs for the cloud com-
puting testbed and the non-virtualized system, respectively. The root nodes represent the generic
failures.
30
TABLE 3.2. The metrics that are highly correlated with failure occurrences in the
cloud testbed using four-level failure-metric DAGs.
Failure type No. of correlated metrics No. of metrics from Dom0 No. of metrics from DomU
CPU-related failures 45 44 1
Memory-related failures 29 26 2
Disk-related failures 34 25 9
Network-related failure 32 31 1
All failures 25 24 1
FIGURE 3.11. Failure-metric DAG for all types of failures in the cloud testbed.
By comparing these two figures, I can find out the influence of virtualization on the system
dependability. In both cases, processor related metrics are the dominant ones 1. Certain metrics in
these two DAGs also appear in the DAGs for individual types of failures. For the non-virtualized
case (Figure 3.12), a metric related with memory and disk operations, %vmeff, has a strong depen-
dency with the generic failures. In contrast, a hardware performance counter metric, DTLB-stores,
is highly correlated with failure occurrences in the cloud computing environment as shown in
Figure 3.11. Moreover, in Figure 3.11 and also preceding DAGs for the cloud computing environ-
ment, most of the correlated metrics are associated with Dom0. If more levels of the DAGs are
considered, more metrics from user virtual machines, DomU, correlate with failure occurrences.
However, there is little work on understanding the dependability of cloud computing envi-
ronments. As virtualization has been an enabling technology for cloud computing, it is imperative
to investigate the impact of virtualization on the cloud dependability, which is the focus of this
work.1Only the first three levels of the DAGs are depicted due to the limited space. When more levels are considered,metrics for other system components are incorporated for dependability analysis.
31
FIGURE 3.12. Failure-metric DAG for all types of failures in the non-virtualized system.3.6. Summary
Large-scale and complex cloud computing systems are susceptible to software and hard-
ware failures, which significantly affect the cloud performance and management. It is imperative
to understand the failure behavior in cloud computing infrastructures. In this work, I study the
impact of virtualization, which has become an enabling technology for cloud computing, on the
cloud dependability. I present a cloud dependability analysis (CDA) framework with mechanisms
to characterize failure behavior in virtualized environments. I exploit failure-metric DAGs to an-
alyze the correlation of various cloud performance metrics with failure events in virtualized and
non-virtualized systems. We study multiple types of failures, including CPU-, memory-, disk-,
and network-related failures. By comparing the generated DAGs in the two environments, I gain
insight into the effects of virtualization on the cloud dependability.
32
CHAPTER 4
A METRIC SELECTION AND EXTRACTION FRAMEWORK FOR DESCRIBING CLOUD
PERFORMANCE ANOMALIES
4.1. Introduction
To characterize cloud behavior, identify anomalous states, and pinpoint the causes of fail-
ures, I need the runtime performance data collected from utility clouds. However, continuous
monitoring and large system scale lead to the overwhelming volume of data collected by health
monitoring tools. The size of system logs from large-scale production systems can easily reach
hundreds and even thousands of tera-bytes [70, 87]. In addition to the data size, the large number
of metrics that are measured make the data model extremely complex. Moreover, the existence
of interacting metrics and external environmental factors introduce measurement noises in the col-
lected data. For the collected health-related data, there might be a maximum number of metrics
above which the performance of anomaly detection will degrade rather than improve. High metric
dimensionality will cause low detection accuracy and high computational complexity. However,
there is a lack of systematic approaches to effectively identifying and selecting principle metrics
for anomaly detection.
In this chapter, I present a metric selection framework for online anomaly detection in the
cloud. Among the large number of metrics profiled, I aim at selecting the most essential ones
by applying metric selection and extraction methods. Mutual information is exploited to quantify
the relevance and redundancy among metrics. An incremental search algorithm is proposed to
select metrics by enforcing maximal relevance and minimal redundancy. We apply metric space
combination and separation to extract essential metrics and further reduce the metric dimension.
The remainder of this chapter is organized as follows. Section 4.2 presents the proposed
metric selection framework with three mechanisms. Experimental evaluation and discussion are
described in Section 4.3. Section 4.4 presents the summary.
33
4.2. Cloud Metric Space Reduction Algorithms
To make anomaly detection tractable and yield high accuracy, we apply dimensionality re-
duction which transforms the collected health-related performance data to a new metric space with
only the most important metrics preserved [38]. I propose two approaches to reducing dimension-
ality: metric selection using mutual information and metric extraction by metric space combination
and separation. Metric selection are methods that select the best subset of the original metric set.
The term metric extraction refers to methods that create new metrics based on transformations or
combinations of the original metric set. The data presented in a low-dimensional subspace are
easier to be classified into distinct groups, which facilitates anomaly detection.
4.2.1. Metric Selection
The metric selection process can be formalized as follows. Given the input health-related
performance data D including L records of N metrics M = {mi, i = 1, . . . , N}, and the classi-
fication variable c, it is to find from the N -dimensional measurement space, RN , a subspace of n
metrics, Rn, that optimally characterizes c.
In this section, I present the metric selection algorithm based on mutual information (MI) [21]
as a measure of relevance and redundancy among metrics to select a desirable subset. MI has two
main properties that distinguish it from other selection methods. First, MI has the capability of
measuring any type of relationship between variables, because it does not rely on statistics of any
grade or order. The second property is MI’s invariance under space transformation.
The mutual information of two random variables quantifies the mutual dependence of
them. Let mi and mj be two metrics in M . Their mutual information is defined as I(mi;mj) =
H(mi) + H(mj) − H(mimj), where H(·) refers to the Shannon entropy [21]. Metrics in the
health-related performance data collected periodically from a cloud computing system usually
take discrete values. The marginal probability p(mi) of metric mi and the probability mass func-
tion p(mi,mj) of two metrics mi and mj can be calculated using the collected data. Then, the MI
of mi and mj is computed as
(5) I(mi;mj) =∑mi∈M
∑mj∈M
p(mi,mj) logp(mi,mj)
p(mi)p(mj).
34
Intuitively, the MI between two metrics, I(mi;mj), measures the amount of information shared
between mi and mj . Metrics with high co-relevance have high MI. As special cases, I(mi;mi) =
1, while I(mi;mj) = 0 if mi and mj are independent.
The goal of metric selection is to find from the originalN metrics a subset S with nmetrics
{mi, i = 1, . . . , n}, which jointly have the largest dependency on the class c. This can be accom-
plished by using two criteria in metric selection: maximal relevance and minimal redundancy. I use
the mean value of all MI values between individual metric mi and class c to define the relevance.
The maximal relevance criterion is specified as,
(6) max relevance(S), relevance =1
|S|∑mi∈S
I(mi; c),
where |S| is the cardinality of S. By applying Equation (6), irrelevant metrics can be removed.
However, the remaining metrics may have rich redundancy. As a result, the dependency
among these metrics may still be high. When two metrics highly depend on each other, their
class-discriminative capabilities do not change much, if one of them is removed. Therefore, I
additionally apply a minimal redundancy criterion to select independent metrics.
(7) min redundancy(S), redundancy =1
|S|2∑
mi,mj∈S
I(mi;mj).
I combine the two criteria (6 and 7) together to define the dependency of the selected metrics on
the class, dependency(S). To optimize relevance and redundancy simultaneously, I can use the
following equation.
(8)max dependency(S),
dependency = relevance(S)− redundancy(S).
The N metrics in the original metric set M defines a 2N search space. Finding the optimal
metric subset is NP-hard [5]. To find the near-optimal metrics satisfying the criterion (8), I apply an
incremental search method. Given Sk−1, a metric subset with (k−1) metrics, I try to select the kth
metric that maximizes dependency(·) from the remaining metrics in (M − Sk−1). By including
Equations (6) and (7), the metric search algorithm looks for the kth metric that optimizes the
35
following condition.
(9) maxmi∈M−Sk−1
{I(mi; c)−
1
k − 1
∑mj∈Sk−1
I(mi;mj)
}.
The metric selection algorithm works as follows.
ALGORITHM 1. Metric selection algorithm
MetricSelection() {
1: Apply the incremental search following Equation
(9) to select n metrics sequentially from the
original metric set M . The value of n can be preset
with a large number. The search process produces n
metric set, S1 ⊂ S2 ⊂ . . . ⊂ Sn.
2: Check these metric sets S1, . . . , Si, . . . , Sn to find
the range of i, where the cross-validation error erri
has small mean and small variance.
3: Within the range, look for the smallest error err∗.
The optimal size of the metric subset n∗ equals to
the smallest i, for which Si has error err∗. The
corresponding Sn∗ is the selected metric subset.
4: }
The computational complexity of the incremental search method is O(|S| ·N).
4.2.2. Metric Space Combination and Separation
The metric extraction process creates new metrics by transformation or combination of the
original metrics. It applies a mapping x′ = g(x) : Rn → Rn′ to transform a measurement x in
a n-dimensional space to a point x′ in a n′-dimensional space with n′ < n. It creates a subset
of new metrics by transformation of the original ones. The information most relevant to anomaly
36
detection in Rn is preserved. The goal is to reconstruct the health-related cloud performance dataset
to a space of fewer dimensions for more efficient and accurate anomaly identification. I explore
both metric space combination and metric space separation to find the most useful metrics and
reduce the dimension of metric space.
After the metric selection process (Section 4.2.1) is completed, the health-related cloud
performance dataset D contains L records (x1, x2, . . . , xL) with n metrics. Metric space combi-
nation transforms the L records from n-dimensional space to L records (x′1, x′2, . . . , x
′L) in a new
n′-dimensional space.
Let m1,m2, . . . ,mn denote the n performance metrics. A measurement xi in D can be
represented with {xj,i}, the value of the jth metric mj of xi. That is xi = [x1,i, x2,i, . . . , xn,i]T .
Then, the cloud performance dataset D is represented by a matrix D = [x1, x2, . . . , xL]. To find
the optimal combination of the metric space, I calculate the covariance matrix of D as V = DDT .
According to [26], in order to minimize the mean-squared error of representing the dataset by
n′ orthonormal metrics, the eigenvalues of the covariance matrix V are used. We calculate the
eigenvalues {λi} of V and sort them in a descending order as λ1 > λ2 > . . . > λn.
The metrics with the largest variance caused by a changing faulty condition are identified by
checking their directions. I utilize this property to combine metrics for efficient anomaly detection.
An iterative algorithm is employed to search for the new combined metrics. The first n′ eigenvalues
that satisfy the following requirement are chosen.
(10)∑n′
i=1 λi∑ni=1 λi
≥ τ,
where τ is a threshold and τ ∈ (0, 1). The corresponding n′ eigenvectors are the new metrics,
denoted by S ′ = {m′i, i = 1, . . . , n′}. The eigenvectors for metric space transformation are used
to select the most sensitive and relevant metrics. An iterative algorithm is employed to search for
{ej}:
ALGORITHM 2. Metric space combination based metric extraction
MetricExtraction1() {
37
1: n′ = the number of essential axes or eigenvectors
required to estimate;
2: Compute the covariance matrix S;
3: for j = 1 upto n′ do
4: Initialize randomly eigenvector ej of size n× 1;
5: while (1− |eTj ej|) > ε do
6: ej = Sej;
7: ej = ej −∑j−1
k=1(eTj ek)ek;
8: ej = ej/ ‖ ej ‖;
9: end while
10: end for
11: return e;
12:}
In Algorithm 2, Steps 7 and 8 apply the Gram-Schmidt orthogonalization process [44]
and then normalization. ε is a small constant, which is used to test the convergence of ej , i.e. if
(1− |eTj ej|) < ε, then ej converges, otherwise ej is updated iteratively.
Algorithm 2 converges quickly. It usually takes only two to five iterations to find an eigen-
vector. The computational complexity of the algorithm is O(n2n′ + n2L), where n is the number
of metrics after metric selection (Section 4.2.1). To determine the value of n′, i.e. the number of
essential metrics, a common practice is first setting a threshold for the percentage of total vari-
ance that is expected to preserve; then n′ being the smallest number of essential metrics that can
contribute to achieving such a threshold.
In addition to the metric space combination, I also apply metric extraction approaches
based on metric space separation. They separate desired data from mixed data. They define a set
of new basis vectors for metric space separation. Let us use A to denote the matrix with elements
xj,i, x = [x1, x2, . . . , xL]T and e = [e1, e2, . . . , e′n]T with base vectors. Then, x = Ae. After
estimating the matrix A from x, I calculate its inverse, denoted by W . Hence, the base vectors can
38
be computed by e = Wx.
Before applying the metric extraction algorithm, the anomaly detector performs some pre-
processing on the cloud performance dataset. Data records are subtracted by the mean of the data
record vector so that they have zero-mean. A linear transformation is also applied to the dataset,
which makes its components uncorrelated and having unit variance. The goal of metric space sep-
aration is to find an optimal transformation matrix W so that {ej} are maximally independent. An
iterative algorithm is employed to search for W and hence the new separated metrics.
The metric extraction algorithm that computes the matrix W works as follows.
ALGORITHM 3. Metric space separation based metric extraction
MetricExtraction2() {
1: Initialize randomly the matrix W = [w1, w2, . . . , w′n]T ;
2: while (1− |wTj wj|) > ε for any j = 1 . . . n do
3: for p = 1 upto n′ do
4: wp+1 = wp+1 −∑p
j=1wTp+1wjwj;
5: wp+1 = wp+1/(wTp+1wp+1)
1/2;
6: end for
7: end while
8: return W ;
9: }
In Algorithm 3, ε is a small constant, which is used to test the convergence of W , i.e. if
(1 − |wTj wj|) < ε for all j = 1 . . . n′, then W converges, otherwise W is updated iteratively. The
algorithm converges fast.
4.3. Performance Evaluation
As a proof of concept, I implement a prototype of the proposed metric selection framework
and evaluate its performance based on the collected performance metrics data from our cloud
39
TABLE 4.1. Normalized mutual information values for 12 metrics of CPU and
memory related statistics.
Metrics proc/s cswch/s intr/s pswpin/s pswpout/s pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s %vmeff
proc/s 1 0.173432 0.146242 0.026211 0.010373 0.173137 0.166139 0.774422 0.113095 0.59518 0.006642 0.010004
cswch/s 0.173432 1 1.19829 0.027594 0.037947 0.077614 0.22167 0.240895 0.069647 0.267043 0.024364 0.066146
intr/s 0.146242 1.19829 1 0.106931 0.202099 0.197885 0.432027 0.210116 0.203809 0.259266 0.222368 0.321142
pswpin/s 0.026211 0.027594 0.106931 1 0.191225 0.646559 0.104967 0.033024 0.907643 0.064447 0.123519 0.158348
pswpout/s 0.010373 0.037947 0.202099 0.191225 1 0.155811 0.179029 0.030781 0.208582 0.105926 0.22015 0.314369
pgpgin/s 0.173137 0.077614 0.197885 0.646559 0.155811 1 0.208409 0.25285 1.161042 0.210407 0.210457 0.226447
pgpgout/s 0.166139 0.22167 0.432027 0.104967 0.179029 0.208409 1 0.298546 0.196957 0.720304 0.217444 0.296459
fault/s 0.774422 0.240895 0.210116 0.033024 0.030781 0.25285 0.298546 1 0.151257 0.794756 0.049625 0.060143
majflt/s 0.113095 0.069647 0.203809 0.907643 0.208582 1.161042 0.196957 0.151257 1 0.196055 0.243348 0.295593
pgfree/s 0.59518 0.267043 0.259266 0.064447 0.105926 0.210407 0.720304 0.794756 0.196055 1 0.116328 0.181414
pgscank/s 0.006642 0.024364 0.222368 0.123519 0.22015 0.210457 0.217444 0.049625 0.243348 0.116328 1 0.351326
%vmeff 0.010004 0.066146 0.321142 0.158348 0.314369 0.226447 0.296459 0.060143 0.295593 0.181414 0.351326 1
testbed. The experiment settings are discussed in Section 3.4 I present the experimental result in
this section.
4.3.1. Experimental Results of Metric Selection and Extraction
I explore mutual information (MI) to quantify the relevance and redundancy of pair-wise
metrics. For N metrics, we only need to compute(N2
)mutual information values. After removing
the zero-variance metrics, I have N = 406. In total, I compute(4062
)= 82,215 MI value. To present
the results in a limited space, I show a portion of the entire 406×406 MI matrix in Table 4.1.
This matrix includes the pair-wise MI values of 12 metrics, which are related to the CPU and
memory usage statistics. They are the CPU related metrics: proc/s, cswch/s and intr/s, and
the memory related metrics: pswpin/s, pswpout/s, pgpgin/s, pgpgout/s, fault/s, majflt/s,
pgfree/s, pgscank/s and %vmeff . From the table, I can see the matrix is symmetric with all
diagonal elements being zero. A smaller MI values infers the corresponding pair of metrics share
less information.
Then I compute the redundancy and relevance among the metrics, and thus their depen-
dency according to Equation 9. For ease of comparison and visualization, I calculate the inverse
40
0 50 100 150 200 250 300 350 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
Feature index
No
rma
lize
d (
red
un
da
ncy-r
ele
va
nce
) va
lue
s
Metric index
FIGURE 4.1. Quantified redundancy and relevance among metrics based on their
mutual information values.
1 2 3 4 5 6 7 8 9 10 110
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Principal component index
Perc
enta
ge
Metric index
FIGURE 4.2. Results from met-
ric extraction (Algorithm 2) and
metric selection (Algorithm 1).
0 5 10 15 20 25 30 35 400
0.05
0.1
0.15
0.2
0.25
0.3
Principal component index
Pe
rce
nta
ge
of
va
ria
nce
ca
ptu
red
Metric index
FIGURE 4.3. Results from met-
ric extraction (Algorithm 2)
only.
of the dependency, i.e. (redundancy - relevance), and search for the metrics with minimal values.
Figure 4.1 shows the normalized (redundancy - relevance) values of the 406 non-constant metrics.
I set the threshold λd = 0.15 (under 95% confidence level for sensitivity) and select metrics whose
41
normalized (redundancy - relevance) values are no larger than λd. In total, 14 metrics satisfy this
condition. They are cswch/s, pswpout/s, pgscank/s, %vmeff , %system, intr/s, %iowait,
ITLB − loads, kbmemused, kbbuffers, wr sec/s, dev253 − 1, rxpck/s, and totsck. The di-
mension of the metric space is reduced by 96.6%. The efficiency of processing the metric space is
improved correspondingly.
I then apply the metric extraction algorithms to reduce the metric dimension further. Al-
gorithm 2 in Section 4.2.2 transforms the metric space by combining metrics to find the essential
ones. The new metric space can present the original dataset in a more concise way. Figure 4.2
shows the results after performing the metric extraction on the 14 metrics selected in the preceding
step. From the figure, I can see that the first three essential metrics can capture most (i.e. 81.3%) of
the variance from the original dataset. Therefore, the dimension of the metric space is further re-
duced to three. In addition, I evaluate the performance of metric extraction without applying metric
selection first. Figure 4.3 depicts the results. From the figure, it is clear to observe that the essential
metrics capture less variance than those in Figure 4.2 and only 50.1% of the variance is captured
by the first three metrics. I also test the metric extraction algorithm (Algorithm 3), which exploits
metric separation. Figure 4.4 shows the distribution of the samples of the essential components that
are selected and extracted by algorithm 1, 2 and 3. The normal states are marked with blue and
anomalous states are identified with red. As shown in Figure 4(a) and Figure 4(c), if algorithm 1
is applied, most of the normal samples are aggregated together and the samples that are discrete
and far from the central of the majority which represents the normal states, are most anomalous
states. With metric selection and metric extraction algorithms, both unsupervised clustering tech-
niques and supervised classification algorithms can be used for further anomaly identification and
diagnosis.
4.4. Summary
Large-scale and complex utility cloud computing systems are susceptible to software and
hardware failures and administrators’ mistakes, which significantly affect the cloud performance
and management. Anomaly detection and proactive failure management provides a vehicle for
self-managing cloud resources and enhancing system dependability. In this work, I focus on the
42
-20
24
68
10 -20
24
68-6
-4
-2
0
2
4
6
2nd Essential Component1st Essential Component
3rd
Ess
entia
l Com
pone
nt
(a) Distribution of normal and abnormal cloud sys-
tem states represented by top three essential com-
ponents with metric selection (Algorithm 1) and ex-
traction (Algorithm 2).
-2-1
01
23
4 -4
-2
0
2
4-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2nd Essential Component
1st Essential Component
3rd
Ess
entia
l Com
pone
nt
(b) Distribution of normal and abnormal cloud sys-
tem states represented by top three essential compo-
nents with metric extraction (Algorithm 2).
-8-6
-4-2
02
02
46
810
-6
-4
-2
0
2
4
6
8
10
1st Essential Component
2nd Essential Component
3th
Ess
entia
l Com
pone
nt
(c) Distribution of normal and abnormal cloud sys-
tem states represented by top three essential com-
ponents with metric selection (Algorithm 1) and ex-
traction (Algorithm 3).
-2
0
2
4
-1-0.5
00.5
11.5
22.5
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
1st Essential Component2nd Essential Component
3th
Ess
entia
l Com
pone
nt
(d) Distribution of normal and abnormal cloud sys-
tem states represented by top three essential compo-
nents with metric extraction (Algorithm 3)
FIGURE 4.4. Distribution of normal (blue marker) and abnormal (red
marker) cloud system states represented by the metrics that are selected and ex-
tracted by algorithm 1, 2 and 3.
43
metric selection issue for efficient and accurate anomaly detection in utility clouds. We present
a metric selection framework with metric selection and extraction mechanisms. The mutual in-
formation based approach selects metrics that maximize the mutual relevance and minimize their
redundancy. Then the essential metrics are further extracted by means of combining or separating
the selected metric space. The reduced dimensionality of metric space significantly improves the
computational efficiency of anomaly detection. We evaluate the performance of the metric selec-
tion framework and two illustrating anomaly detection approaches. The selected and extracted
metric set contributes to highly efficient and accurate anomaly detection.
The proposed metric selection framework is an open framework. Many metric selection
and extraction algorithms can be explored to implement the framework. I study the mutual infor-
mation based metric selection approach, because of its competence in capturing the relevance and
redundancy among metrics.
44
CHAPTER 5
EFFICIENT AND ACCURATE CLOUD ANOMALY DETECTION
5.1. Introduction
With the ever-growing complexity and dynamicity of cloud systems, autonomic failure
management is an effective approach to enhance cloud dependability [6]. Anomaly detection is a
key technique [15]. It identifies activities that do not conform to expected, normal cloud behavior.
The importance of anomaly detection is due to the fact that anomalies in the cloud performance
data may translate to significant and critical component or system failures which disrupt cloud ser-
vices and waste useful computation. Anomaly detectors provide valuable information for resource
allocation, virtual machine reconfiguration and cloud maintenance [84].
Efficient and accurate anomaly detection in cloud systems is challenging due to the dy-
namics of runtime cloud states, heterogeneity of configuration, nonlinearity of failure occurrences,
and overwhelming volume of performance data in production environments. Recent work has de-
veloped various technologies to tackle the problems of anomaly detection. They, however, detect
anomalies either in the time domain [74, 30] or in the frequency domain [8]. Few work considers
both domains to identify system anomalies.
In this chapter, we present a wavelet-based anomaly detection approach based on the anal-
ysis on the profiled cloud runtime performance metrics in time and frequency domains. Section 5.2
presents the key components of the proposed wavelet-based multi-scale cloud anomaly detection
mechanism that explores both the time and frequency domains. Experiment results are shown and
discussed in Section 5.3. Section 5.4 summarizes this chapter.
5.2. Cloud Anomaly Detection Mechanisms
Production clouds usually consist of large numbers of commodity computers which expe-
rience frequent software updates, hardware upgrades, system maintenance, repairs and reboots.
As a result, component failures are norms rather than exceptions [33]. Moreover, the mean time
between failures (MTBF) of a cloud system changes as the system evolves. Considering the high
dynamicity and changing failure rates, we propose a wavelet-based multi-scale model to charac-
45
terize the failure dynamics in both the time and frequency domains and to identify anomalous
behaviors in cloud computing systems. We also exploit multi-layer neural networks to make the
anomaly detectors adaptive to cloud dynamics and derive mother wavelets.
Wavelet analysis [11] is suitable for failure characterization and anomaly detection in cloud
systems because it can transform failure event signals from the time domain to the time-frequency
domain and is capable of analyzing failure dynamics with regard to both time and frequency prop-
erties. Wavelet transform provides a satisfactory time resolution of critical events at high frequen-
cies and a satisfactory frequency resolution at low frequencies [20]. It can highlight subtle changes
in the morphology of cloud event signals, which enables wavelet transform to detect transient,
aperiodic, and other non-stationary features in cloud performance and health data.
5.2.1. Wavelet-Based Multi-Scale Anomaly Detection Mechanism
Let m be the number of performance metrics monitored and profiled from cloud servers.
We construct a cloud health related matrix H with the profiled performance metrics, which is rep-
resented by H(t) = [h1(t), h2(t), ..., hm(t)], where hi(t) is a vector of values of the ith metric
profiled on a cloud server at time t. Our wavelet-based multi-scale anomaly detection mechanism
decomposes the matrix H into a hierarchical structure of details (D), which characterize anoma-
lous behaviors within a given period, and approximations (A), which include the normal workloads
running in the cloud. At each level, the approximation is decomposed as follows.
(11) Ai(t) = caAi+1(t) + cdDi+1(t),
where ca and cd are the decomposition coefficients. Then, the cloud performance matrix H is
rewritten as
(12) H(t) =k∑i=1
cdDi(t) + caAk(t),
where k specifies the number of levels by which the decomposition will be performed. As an
illustration of the preceding process, Figure 5.1 shows a three-level hierarchical structure of a
performance metric, %memused, profiled on our cloud testbed. The details at different levels
provide more information from different aspects of the dynamics of the cloud performance metric,
46
%memused
A1 D1
A DA2 D2
A3 D3
FIGURE 5.1. Three-level details “Di” and approximations “Ai” of performance
metric %memused profiled on our cloud testbed. Performance metric %memused
is divided into high frequency components (details) and low frequency components
(approximations). The approximation is further decomposed into new details and
approximations at each level.
which helps us identify anomalous behaviors. To determine the details (Di) and approximations
(Ai) of the profiled cloud performance metrics at level i, we employ the following wavelet function
(13) ψis,τ (t) = 2−s/2ψs(2−st− τ),
where s and τ are two parameters representing the scale and translation, respectively. We
convert the s scale to a characteristic frequency of the wavelet, such as the spectral peak frequency
that is associated with anomalous states, or the passband centre that describes patterns in workload
fluctuation. The spectral frequency is inversely proportional to the scale coefficient as f = fc/s,
47
where fc is the characteristic frequency of the mother wavelet, i.e., the archetype wavelet at scale
s = 1 and translation τ = 0.
Then, the ith-level details and approximations of the cloud performance metrics are com-
puted by
(14)Di(t) = ψ2s(t) =
+∞∑τ=0
g(t)ψi(2t− τ),
Ai(t) = ψ2s+1(t) =+∞∑τ=0
l(t)ψi(2t− τ),
where g(t) and l(t) denote the coefficients of low-pass and high-pass filters, respectively.
Thus, each profiled cloud performance metric hi is decomposed by the following.
(15)
hi(t) =2s∑j=1
hjs,i(t),
hjs,i(t) =+∞∑τ=0
cjs,τ,iψjs,τ,i(t),
where the wavelet function in Equation (13) is applied to the cloud performance metric hi and cjs,τ,i
is the wavelet coefficient, which is computed by
(16) cjs,τ,i =
∫ +∞
0
hi(t)ψjs,τ,i(t)dt.
The wavelet coefficient cjs,τ,i represents the similarity between the profiled cloud performance met-
ric and the characteristic mother wavelet in both time and frequency domains. We explore it for
cloud anomaly detection.
5.2.2. Sliding-Window Cloud Anomaly Detection
To achieve online anomaly identification in the cloud, we design a sliding-window ap-
proach that addresses the cloud dynamicity and improves detection accuracy. In this work, a slid-
ing detection window of size NsWin and the mother wavelet with window size Nmother are applied
to compute the wavelet coefficient cjs,τ at multiple scale levels. Figure 3 presents an illustration of
sliding windows with a 16-measurement mother wavelet.
48
0 50 100 150 200 250
NsWin
Nmother
FIGURE 5.2. A sliding detection window (NsWin = 80 measurements of a cloud
performance metric) for a mother wavelet with Nmother = 16 measurements and
the scale coefficient s = 5. A failure is illustrated with the spike.
The size of the sliding detection window,NsWin, is determined by the product of the mother
wavelet’s window size, Nmother, and the value of the scale parameter, s, as shown below.
(17) NsWin = s ·Nmother.
To identify anomalies based on the wavelet coefficient cjs,τ , we employ the threshold,
λ = k1µnormal + k2σ, which is a weighted sum of the normal states in the mother wavelet and
the standard deviation of the measurements in the mother wavelet. We define coefficient α as pro-
portion of anomalous states in the mother wavelet. The mean of normal states and the standard
deviation of anomalous states are important properties of the mother wavelet. The values of co-
efficient α and weights k1 and k2 depend on failure types and the length of mother wavelet. We
discuss the selection of their values in Section 5.3.
5.2.3. Mother Wavelet Selection and Adaptation
We propose a learning-based approach to select and adapt mother wavelets to characterize
the major properties of failures in the cloud. It is composed of the following two phases.
Constructing the Initial Mother Wavelet.
There are many choices to construct the initial mother wavelet. We adopt the Haar wavelet [102]
as it is widely used and it is simple and effective to represent the two (normal and anomalous) states.
49
By applying the Haar wavelet, the function of a mother wavelet can be represented as follows.
(18) ψ(k) =
1 0 ≤ k ≤ Nmother
a
−1 Nmother
a≤ k ≤ Nmother,
where a is a parameter that is used to separate the two states. Its value can be set according
to the measured failure rate. An example of Haar mother wavelet is shown in Figure 5.3
FIGURE 5.3. An example of Haar mother wavelet.
To adapt the mother wavelets at runtime, we exploit statistical learning technologies to up-
date mother wavelets based on observed failure events in a sliding detection window. In this work,
we use neural networks [109] because of its capability of combining the existing mother wavelets
with new measurements of cloud performance metrics in deriving the new mother wavelets. A
neural network consists of multiple levels of processing elements, called neurons, each of which
performs the following transfer function.
(19) yi(k) = fi(n∑j=1
ωijxj(k)− θi),
where xj is the jth input to the neuron, n is the size of a mother wavelet, ωij is the connection
weight coefficient between i and j, θi is the bias of the neuron. As for the function fi, we employ
the nonlinear Gaussian function.
Adaptation of Mother Wavelets.
Our cloud anomaly identification mechanism is adaptive to the dynamicity of cloud systems
by updating the neural network with verified anomalies and undetected but reported failures to
derive new mother wavelets.
50
The neural network adapter has n, where n = Nmother, input neurons to receive the mea-
surements of ψ(k)(k = 1, 2, ..., n), three hidden layers and n output neurons ψ′(k)(k = 1, 2, ..., n).
The Gaussian transfer function is applied to each layer with the Resilient Back Propagation Train-
ing algorithm [79], which adds the last n measurements prior to the occurrence of failures into the
training set to update the neural network. The learning process exploits a gradient descent method
in order to minimize the mean square values of errors between the output and input. The error is
defined as
(20) E =1
2
n∑k=1
(ψD(k)− ψ′(k))2,
(21) ∆wij(t) = −ξ ∂E∂wij
,
where ψD(k) is the desired output vector of wavelet coefficients, ξ is the learning factor and wij
defines the weight from neurons i to j. The error function repeatedly updates its weight matrix
until it converges.
5.3. Performance Evaluation
5.3.1. Cloud Testbed and Performance Metrics
The experiment settings are discussed in Section 3.4. We apply the metric space combination-
based metric extraction algorithm proposed in Chapter 4 to pre-process the performance data col-
lected from our cloud testbed. By maintaining the most variance of the performance data from
the original metric space, cloud performance metrics are extracted and then explored by wavelet
analysis for cloud anomaly identification. We present the experimental result in this section.
5.3.2. Mother Wavelets
The mother wavelet maintains the essential information of abnormal behaviors of cloud
servers. Our mechanism keeps updating the mother wavelet when anomalies are verified by cloud
operators or new undetected failures are reported to the anomaly detector. To adapt the mother
wavelet, measurements of cloud performance metrics prior to the time of a failure are collected
51
0 2 4 6 8-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Samples
Mea
sure
men
t
MeasurementAdapted Mother Wavelet
(a) Mother Wavelet using a window of 8 mea-
surements
0 5 10 15-3
-2
-1
0
1
2
3
Samples
Mea
sure
men
t
MeasurementAdapted Mother Wavelet
(b) Mother Wavelet using a window of 16
measurements
0 5 10 15 20-2
-1
0
1
2
3
4
Samples
Mea
sure
men
t
MeasurementAdapted Mother Wavelet
(c) Mother Wavelet using a window with 24
measurements
0 5 10 15 20 25 30-2
-1
0
1
2
3
4
5
Samples
Mea
sure
men
t
MeasurementAdapted Mother Wavelet
(d) Mother Wavelet using a window with 32
measurements
FIGURE 5.4. Mother wavelet derived by employing a measurement window of dif-
ferent sizes. As the window size increases, the peak at the tail is sharpened while
other peaks are smoothed. From the perspective of frequency, more failure-related
signals in both the low-frequency and higher-frequency bands are included for large
measurement windows.
and the neural network derives a new mother wavelet based on the existing one and metric mea-
surements for the recent failure event.
We use the cloud performance data of the first month to train the neural network and derive
the mother wavelet. The derived mother wavelets are shown in Figure 5.4 by using four different
measurement window sizes. As the window size increases, the peak at the tail is sharpened while
52
other peaks are smoothed. From the perspective of frequency, more failure-related signals in both
the low-frequency and higher-frequency bands are included for large measurement windows.
The coefficient α is defined as the proportion of anomalous states in the mother wavelet.
In Figure 5.4, for the mother wavelet using eight measurements, the last two samples are validated
as failure states and thus α = 0.25. If the mother wavelet is derived from more measurements, the
value of α gets smaller. In the experiments, we also observe that for most of the software-related
failure, α is larger than those for hardware-related failures, because software-related failures are
more likely to display deteriorating trends before the failures while hardware faults usually cause
abrupt changes.
The selection of the size of the mother wavelet is system dependent. We need to consider
the trade-off between the computation overhead and the anomaly detection accuracy. If the size
is too small, anomalous states cannot be captured and represented sufficiently, which causes more
false positives. On the other hand, a large mother wavelet may involve multiple anomalous states
of different types and more noise is included as well, which affect the accuracy of the anomaly
detection. Our cloud anomaly detector can try different sizes of measurements when generating
the mother wavelet and learn from the performance of anomaly detection to adaptively choose
the best size. In our experiments, the optimal size of the mother wavelet found by our anomaly
detector is 16 measurements, as shown in Figure 5.5. This size yields a good detection precision
and a small number of false alarms.
5.3.3. Performance of Anomaly Identification
Figure 5.6 shows the contours of wavelet coefficients cs,τ for the first 200 time bins with
the scale parameter s changes from 1 to 16. The sizes of mother wavelets are set to 8, 12, 16, and
24 cloud performance measurements. Memory-related faults are injected. We use a color bar to
denote the normalized value in each figure. A small mother wavelet (i.e., 8 measurements and 12
measurements in Figures 6(a) and 6(b)) brings noise to the wavelet coefficients. In Figures 6(c)
and 6(d), if we set threshold λ = 0.8, one anomaly is detected, i.e., at timebin = 100. If we use
this threshold in the cases of Figures 4(a) and 4(b), more false positive anomalies are identified.
In Figures 6(c) and 6(d), the mother wavelet derived from 16 measurements is more effective
53
4 8 12 16 20 24 2805
10
20
30
40
50
60
70
80Detected anomalies by different window size
Window Size
Num
ber o
f det
ecte
d an
omlie
s
True anomaliesFalse positives
FIGURE 5.5. The numbers of truly identified anomalies vs. the numbers of vali-
dated false positives with mother wavelets of different sizes. Small windows result
in low detection accuracy, while big windows brings in more false positives
than the one using 24 measurements. This is because in Figure 6(d), as the scale parameter s
increases from 1 to 16, the wavelet coefficients increase steadily, which indicates a higher scale
for higher detection accuracy. By choosing a larger scale, we also need a larger sliding detection
window, as shown in Equation (17), which causes higher computational complexity, while the
wavelet coefficients are similar. As a result, we choose the value of the scale parameter s = 10.
We employ ROC curves to present and compare the performance of anomaly detection by
several algorithms. We compute the true positive rate (TPR) and the false positive rate (FPR) of the
detection results by those algorithms. A larger area under the ROC curve implies higher detection
sensitivity and specificity.
We compare the performance of our wavelet-based multi-scale anomaly identification mech-
anism with four widely used anomaly detection approaches, which use the decision tree, radial
basis function (RBF) network, Bayesian network, and support vector machine (SVM). We train
the models using the same training set and compare their detection results. The ROC curves are
shown in Figure 5.7. In the test dataset, there are 18 failure records in 2217 measurements. Our
54
(a) Wavelet coefficients with 8 measurements. (b) Wavelet coefficients with 12 measurements.
(c) Wavelet coefficients with 16 measurements. (d) Wavelet coefficients with 24 measurements.
FIGURE 5.6. Wavelets coefficients for mother wavelets with differentNmother (1 ≤
s ≤ 16, 0 ≤ τ ≤ 200). A memory-related fault is injected at the 100th time point.
The states of system are learned from the wavelet coefficients based on the anomaly
mother wavelet with different scale. A smaller mother wavelet (i.e., 8 measurements
or 12 measurements) brings more noise to the wavelet coefficients, while a bigger
mother wavelet (i.e., 24 measurements) requires a larger scale to achieve a high
detection accuracy.
wavelet-based approach reaches 93.3% TPR with 6.1% FPR. Among other detection algorithms,
Bayesian network can achieve 81.5% TPR and about 10% FPR. Performance of decision tree,
RBF network and SVM are worse with less than 75% TPRs. In our current cloud performance
and failure datasets, a small number of faults are injected. Our wavelet-based anomaly detection
mechanism can detect almost all of them. We plan to find datasets from production cloud systems
55
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FPR
TPR
ROC Curve
Our ApproachDecision TreeRBF NetworkBayesianNetSupport Vector
FIGURE 5.7. Performance comparison of our wavelet-based cloud anomaly detec-
tion mechanism with other four detection algorithms. Our approach achieves the
best TPR with the least FPR. It can identify anomalies more accurately than other
methods.
and test our mechanism on them.
5.4. Summary
Large-scale and complex cloud computing systems are susceptible to software and hard-
ware failures and administrator mistakes, which significantly affect the cloud performance and
management. Anomaly identification and proactive failure management provides a vehicle for
self-managing cloud resources and enhancing system dependability.
In this paper, we employ a wavelet-based multi-scale cloud anomaly identification mecha-
nism with learning-aid mother wavelet selection and sliding detection windows for adaptive fail-
ure detection. Different from other anomaly identification approaches, it does not require a prior
knowledge of failure distributions, it can self-adapt by learning from observed failures at runtime,
and it analyzes both the time and frequency domains to identify anomalous cloud behaviors. We
56
test a prototype implementation of our cloud anomaly identification mechanism on a cloud com-
puting system. Experimental results show our approach can identify cloud failures with the highest
accuracy among several widely used approaches.
The proposed anomaly identification mechanism in this research can also aid failure predic-
tion. Complementing existing failure prediction methods, results from this research can be utilized
to determine the potential localization of failures by analyzing the runtime cloud performance data.
We also note that even with the most advanced learning algorithms the prediction accuracy
could not reach 100%. As a remedy, reactive failure management techniques, such as checkpoint-
ing and redundant execution, can be exploited to deal with mis-predictions. We will integrate these
two failure management approaches and enhance the cloud dependability further.
57
CHAPTER 6
EXPLORING METRIC SUBSPACE ANALYSIS FOR ANOMALY IDENTIFICATION AND
DIAGNOSIS
6.1. Introduction
Effective anomaly identification and diagnosis in cloud systems is challenging due to the
dynamics of runtime cloud states, heterogeneity of configuration, nonlinearity of failure occur-
rences, and overwhelming volume of performance data in production environments. Principal
component analysis (PCA) is a well-know dimensionality reduction technique [26]. It performs
linear transformation to map a set of data points to new axes (i.e., principal components). Each
principal component has the property that it points in the direction of maximum variance remain-
ing in the data. It has been widely in feature extraction and network anomaly detection [53], [12],
[55], [89]. However, indiscriminately choosing the first several principal components does not al-
ways yield desirable accuracy of anomaly diagnosis, as different types of failures display different
correlations with cloud performance metrics (An illustrating example is presented in Section 6.2.).
A more subtle analysis of the relation between principal components and failure types is necessary
in order to develop effective anomaly diagnosis mechanisms in cloud systems.
In this chapter, I propose an adaptive mechanism that explores PCA to identify and di-
agnose anomalies in the cloud. Different from existing PCA-based diagnosis approaches, our
proposed mechanism characterizes cloud health dynamics and finds the most relevant principal
components (MRPCs) for each type of possible failures. The selection of MRPCs is motivated
by our observation in experiments that higher order principal components possess strong correla-
tion with failure occurrences, even though they maintain less variance of the cloud performance
data. By exploiting MRPCs and learning techniques, I propose an effective anomaly diagnosis
mechanism in the cloud.
The reminder of this chapter is organized as follows. In Section 6.2, I present an example to
illustrate limitation of the traditional principal components based anomaly detection and diagnosis
approaches. In Section 6.3, I describe the proposed most relevant principal component approach
58
0 500 1000 1500 20000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time:min
%m
em u
tiliz
atio
n
%memFaults
0 500 1000 1500 20000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Time:min
%us
er a
ll
%userFaults
FIGURE 6.1. Examples of memory related faults injected to a cloud testbed. The
memory utilization and CPU utilization time serials are plotted.
for anomaly diagnosis. Experimental results are presented in Sections 6.4. Finally, I summarize
this chapter in Section 6.5.
6.2. A Motivating Example
Virtualization plays a key role in cloud computing infrastructures, because it makes possi-
ble to significantly reduce the number of servers in data centers by having each server host multiple
independent virtual machines (VMs) which are managed by a virtual machine monitor (VMM) of-
ten referred to as a hypervisor [90]. By enabling the consolidation of multiple applications on a
small number of physical servers, virtualization promises significant cost savings resulting from
higher resource utilization and lower system management costs.
However, virtualization also complicates the interactions among applications sharing the
physical cloud infrastructure, which is referred to as multi-tenancy. The inability to predict such
interactions and adapt the system accordingly makes it difficult to provide dependability assurance
in terms of availability and responsiveness to failures. Virtualization and multi-tenancy introduce
new sources of failure degrading the dependability of cloud computing systems and making anom-
aly identification more challenging.
As an example, Figure 6.1 depicts several memory-related failures, which are denoted by
red circles, observed on a cloud testbed. The corresponding CPU utilization and memory utiliza-
tion profiled from the testbed are also shown. From the figure, I can see these failures are difficult
to identify if single performance metric is explored by approaches such as [107]. Moreover, rule-
based failure diagnosis techniques are not applicable. High CPU utilization and low memory
59
0 5 10 15 20 25 30 35 40 45 500
0.05
0.1
0.15
0.2
0.25
Index of Principal Components
Per
cent
age
of V
aria
nce
FIGURE 6.2. Distribution of data variance retained by the first 50 principal components.
utilization can not determine the conclusion of a memory fault, especially when the symptom of
anomalous behavior is not significantly deviated from other normal states. To improve anomaly de-
tection and diagnosis accuracy, more metrics should be considered. Cloud computing systems are
large-scale and complex. Hundreds and even thousands of performance metrics are profiled peri-
odically [6]. The large metric dimension and the overwhelming volume of cloud performance data
dramatically increase the processing overhead and reduce the anomaly detection accuracy. Feature
extraction techniques have been widely applied to reduce the metric dimensionality. Among these
techniques, principal component analysis (PCA) is a well-known one, which extracts a smaller set
of metrics while retaining as much data variance as possible [26].
Figure 6.2 shows the distribution of variance retained by the first 50 principal components
(PCs) for the memory related failures. In total, 651 performance metrics are profiled on the cloud
testbed. The 1st and 2nd PCs keep 27% and 8% of data variance, respectively, while the 3rd and
5th PCs retain only less than 5% of data variance each. Their time series are plotted in Figure 6.3.
Figures 3(a) and 3(b) show that the low order PCs are less significant for identifying failures even
though they retain a large portion of the data variance. Some failure events are overwhelmed by
normal data records with high resource utilization and some are indistinguishable from noises.
In contrast, the higher order PCs, such as the 3rd and 5th PCs in Figures 3(c) and 3(d), manifest
stronger correlation with the failure events. I call these PCs the most relevant principal components
(MRPCs) for a failure type. They are usually higher order PCs, which are not explored by existing
anomaly detection techniques, while they provide important insights of failure occurrences. In the
60
0 500 1000 1500 2000
-4
-3
-2
-1
0
1
2
Time : min
The
1st P
rinci
pal C
ompo
nent
1st PCMemory faults
(a) The time series of the1st principal component
0 500 1000 1500 2000
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Time : min
The
2nd
Prin
cipa
l Com
pone
nt
2nd PCMemory faults
(b) The time series of the 2nd principal component
0 500 1000 1500 2000-4
-3
-2
-1
0
1
2
Time : min
The
3rd
Prin
cipa
l Com
pone
nt
3rd PCMemory faults
(c) The time series of the 3rd principal component
0 500 1000 1500 2000-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Time : min
Th
e 5t
h P
rinci
pal C
ompo
nent
5th PCMemory faults
(d) The time series of the 5th principal component
FIGURE 6.3. Time series of principal components and their correlation with the
memory related faults.
following sections, we discuss how to find the MRPCs and exploit MRPCs to identify anomalies
in cloud computing systems.
6.3. MRPC-Based Adaptive Cloud Anomaly Identification
To find the most relevant principal components (MRPCs) and identify cloud anomalies,
I explore the runtime performance data profiled from multiple layers of cloud servers. I install
profiling instruments in hypervisors and operating systems in virtual machines to measure perfor-
mance metrics from the hardware performance counters, resource utilization and critical events
from the hypervisors and virtual machines (VMs), while user applications are running in VMs. To
capture the dynamics of cloud systems, I employ a rolling window to analyze the profiled cloud
performance data and dynamic normalization to pre-process the data. Then, MRPCs will be se-
lected from cloud performance principal components by examining their correlation with failure
events. In this section, I present the three key parts of our proposed MRPC-based adaptive cloud
anomaly identification mechanism, i.e., dynamic normalization, MRPC selection, and adaptive
61
anomaly identification.
6.3.1. Dynamic Normalization
The cloud performance metrics are profiled from various components at runtime. The
collected data include performance and runtime states of hardware devices, hypervisors, virtual
machines and applications. The unit and scale vary among these metrics. To explore them in cloud
anomaly detection, normalization is necessary.
In our work, I employ a dynamic normalization approach with a rolling window to pre-
process the cloud performance data and scale the metrics to [0, 1]. Let X(t) represent a vector of
M cloud performance metrics metrics (m1(t),m2(t), ...,mM(t)). For X(t) at time t in a rolling
window of size N, the dynamic normalization process is described as follows.
In order to overcome the obstacle of the diversity in unit and scale of different types of
performance metrics, normalization is necessary. In our work, I employ a dynamic normalization
process where sampled cloud metrics are scaled down into For any cloud metric vector at t within
the rolling window size ofN , X(t) represents a vector ofM , the normalization function is defined
as follows:
(22) XN (t) =
X(t)−minN (t)
maxN (t)−minN (t)maxN (t) 6= minN (t)
0 maxN (t) = minN (t)
where maxN(t) and minN(t) denote the synthetic maximum and minimum values of a cloud
perofmance metric in a time window, which is ended at time t. They are updated iteratively based
on maxN(n− 1), minN(t− 1) and the new measurement of the metric by following the equations
below.
(23) maxN (t+ 1) = maxN (t)λeX(t+1)−maxN (t)
(24) minN (t+ 1) = minN (t)λeX(t+1)−minN (t)
where the coefficient λ represents the adjustment ratio. It controls the ascending or descending
rate. For a new measurement of the metric as the time window moves forward, if the measurement
62
is smaller than current maximum, the maximum value decreases. If the measurement is larger than
current minimum, the minimum value increases. In such a way, the measurements close to the end
of the rolling window have less impact to the normalized value than the measurements at the head
of time window.
6.3.2. MRPC Selection
The most relevant principal components (MRPCs) are the set of PCs which have strong
correlation with occurrences of failures. For each failure type, I select a corresponding set of
MRPCs. As described in Section III-A, I employ the dynamic normalization approach to re-scale
performance metrics in a rolling time window. To adapt to the dynamics in a time window and at
the same time consider the values of cloud performance metrics in previous windows, I propose an
adaptive learning approach that exploits neural networks to compute principal components (PCs)
from normalized values of cloud performance metrics in time windows. The approach proceeds as
follows.
1: Initialize a neural network Rm → Rl with small
random synaptic weights wji, j and i are the index
of input neural and output neural at time t = 1.
Assign a small learning rate ε.
2: For t = 1j = 1, 2, ....l and i = 1, 2, ...,m calculate
yj(t) =∑m
i=1wji(t)xi(t) and
4wji(t) = ε[yj(t)xi(t)− yj(t)∑j
k=1wki(t)yk(t)]
where xi(t) is the ith component of the m-by-1
input vector x(t) and l is the dimension of principal
components space
3: Increment t by 1, repeat step 2 until the synaptic
weight wji converges to the ith component of the
eigenvector associated with the jth eigenvalue of the
correlation matrix of the input vector x(n).
63
Then, the MRPC selection process decomposes the PCs in set S into k subset Ssub(fi)(one
for each failure type) and a noise subset Snoise.The decomposition can be expressed as:
(25) S = Ssub(f1) ∩ Ssub(f2) ∩ ... ∩ Ssub(fk) ∩ Snoise
Each failure specific subset characterizes a failure type fi. There may be intersections between
two subsets. As some PCs are correlated with multiple failure types. The noise subset consists
of PCs that are not correlated with any failure type. They can be represented by Gaussian noise.
Algorithm 4 shows the pseudocode of the MRPC selection process.
ALGORITHM 4. MRPC Selection Algorithm
MRPCSelect() {
1: l : dimension of principal component space
2: k: number of failure types
3: fi: failure type i
4: N : window size
5: coeffji :correlation coefficient of Yj to fi
6: Ssub(fi) : fi specific subspace
7: Yj(t, N) = (yi(t−N), yi(t−N − 1), ..., yi(t))
8: Lablei(t, N) = (Li(t−N), Li(t−N − 1)..., Li(t))
9: Ssub(fi) = ∅
10: for j = 1, 2, ..., l do
11: for i = 1, 2, ..., k do
12: coeffji = correlation(Yj(t, N), Lablei(t, N))
13: if coeffji < σ do
14: Ssub(fi) = Ssub(fi) ∩ j
15: end if
16: end for
17: end for
64
18: }
6.3.3. Adaptive Cloud Anomaly Identification
To identify anomalies using MRPCs, I leverage adaptive Kalman filters [47], because they
are dynamic and do not require any prior failure history. A Kalman filter is optimized to achieve
the best estimation of the next state. As the rolling window moves forward, the Kalman filter
detects anomalies based on the corresponding cloud performance data record and the prior states by
following Equation (4). Then it uses the measurement and the estimation to update the uncertainty
for next time window by employing Equation (5) adaptively. Each iteration consists of anomaly
detection and correction.
The anomaly detector identifies anomalous cloud behaviors by applying the following
model.
(26)
X−t = ΦXt−1
P−t = ΦPt−1 +Q
The correction phase can be presented as follows.
(27)
Kt =
P−tP−t +R
Xt = X−t +Kt(Mt −ΨX−t )
Pt = (1−KtΨ)P−t
where X−t and Xt represent the prior state and posteriori state at time t and P−t and Pt denote the
prior and posterior error covariance. Parameter Φ and Ψ are detector variables. Kt is the ratio,
which is to control the difference between prior state and posteriori state. Mt is the measurement
of time t. Q and R are the variances of estimation noise and measurement noise respectively.
If the difference |Xt−Mt| between measurement and estimation on MRPCs at a time point
is greater then a threshold, an anomaly is detected and it is reported to cloud operators. The cloud
operators check the identified anomalies to verity them as either true failures or false alarms. Those
records of true failures are used to update the corresponding anomaly detector model following
Equation 27. The cloud operators also input those observed but undetected failure records to the
65
anomaly detector to generate a new MRPC subset. Algorithm 5 presents the adaptive anomaly
detection process.
ALGORITHM 5. Adaptive Anomaly Identification Algorithm
AnomalyDetector() {
1: While(TRUE) do
2: On receiving a cloud performance record data xt
3: if xt −Mt < τ then
4: Report the anomaly state
5: end if
6: On receiving a verified failure or an observed
but undetected failure record
7: MRPCSelect()
8: end while
9: }
6.4. Analysis of Cloud Anomalies
6.4.1. Anomaly Detection and Diagnosis Results
In this section, I study the four types of failures caused by CPU-related faults, memory-
related faults, disk-related faults, and network-related faults. For each failure type, I present the
experimental results on MRPC selection and discuss the root cause analysis on each MRPC.
6.4.2. MRPCs and Diagnosis of Memory Related Failures
Figure 8(a) shows the correlation between PCs and the memory related faults. As I have
discussed, the 1st and 2nd principal components do not possess high causal correlation with the
occurrences of failures (only 0.16 and 0.04). This indicates the memory related failures have little
dependency upon them. On the contrary, the 3rd, 5th, 8th and 31th PCs display high correlation
with failure records (greater than 0.2) as listed in Table 6.1 . Figure 4(a) shows that the 3rd PC
displays a significantly identifiable performance to distinguish the failure states from normal states.
66
0 500 1000 1500 2000-4
-3
-2
-1
0
1
2
Time : min
The
3rd
Prin
cipa
l Com
pone
nt
3rd PCMemory faults
(a) The time series of 3rd principal component
0 50 100 150 200 250 300 350 400 450-0.7
-0.6
-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
Performance Metric
Coe
ffici
ents
to th
e 3n
d P
rinci
pal C
ompo
nent
(b) The coefficients of performance metric to the 3rd principal component
FIGURE 6.4. MRPCs of memory-related failures. (a) plots the time series of 3rd
principal component. (b) shows the performance metric avgrq-sz displays the high-
est contribution to the MRPC.
Based on the procedure described in Section III-B, the synaptic weight wjis represent the
quantified impact from the original space to the anomaly specific subsets. Considering these synap-
tic weight wjis could be either positive or negative, I exploit |wji| to identify the effect of each
performance metrics contributing to the anomaly specific subsets. These weights computed are
shown in Figure 4(b) for the 3rd principal component, which is selected as the top ranked MRPC
with regard to memory related failures. In addition, one performance metric has a dominant con-
67
TABLE 6.1. MRPCs are ranked by correlation with faults.(For each major type, 25
faults are injected into testbed)
Fault TypeMost Relevant Principal Components
Rank Order of PC Correlation Coefficient to Fault
Memory Fault
1 3 0.3898
2 5 0.2840
3 8 0.2522
4 31 0.2043
I/O Fault1 5 0.4283
2 7 0.2961
3 3 0.2402
CPU Fault1 35 0.3738
2 40 0.3424
3 103 0.2559
Network Fault1 29 0.3532
2 27 0.2733
3 23 0.2715
tribution to this MRPC, as weight 0.65. Within it, multiple performance metrics have weights
around 0.1-0.2. By checking the performance metrics list, we find that the highly weighted metric
is ”avgrq-sz dev253-1”, which is ”The average size (in sectors) of the requests that were issued to
the hard drive device 253-1 [2]”. Given that the memory related failures are injected by keeping
allocating memory in a short period, after the physical memory is used up, the swap memory is put
into use. As a result, this process will cause more requests issued to the hard disk.
6.4.3. MRPCs and Diagnosis of Disk Related Failures
Disk-related faults are injected by continuously issuing a big volume of disk requests to
saturate the I/O bandwidth. The causal correlation with the disk related failures is computed for
68
0 500 1000 1500 2000-1
0
1
2
3
4
5
6
Time : min
The
5th
Prin
cipa
l Com
pone
nt
5th PCI/O faults
(a) The time series of 5th principal component
0 50 100 150 200 250 300 350 400 450-0.5
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Performance Metric
Coe
ffici
ent t
o th
e 5t
h P
rinci
pal C
ompo
nene
t
(b) The coefficients of performance metric to the 5th principal component
FIGURE 6.5. MRPCs of disk-related failures.(a) plots the time series of the 5th
principal component. (b) shows the performance metric rd-sec/s dev-253 displays
the highest contribution to the MRPC
each principal component, which is shown in Figure 8(b). The top ranked MRPCs are listed in
Table 6.1 . The 5th PC possesses the highest correlation with the disk-related failures as its casual
correlation is more than 0.42. Analysis of the time series of the 5th principal component, plotted
in Figure 5(a), shows that most of the anomalies could be identified by setting a proper thresh-
old. From Figure 5(b), the performance metric named ”rd-sec/s dev-253”, with the coefficient of
0.4423, contributes to the 5th principal component more than other performance metrics. It refers
69
0 500 1000 1500 2000-1.5
-1
-0.5
0
0.5
1
Time : min
The
35th
Prin
cipa
l Com
pone
nt
35th PCCPU faults
(a) The time series of 35th principal component
0 50 100 150 200 250 300 350 400 450-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Performance Metric
Coe
ffici
ent t
o th
e 35
th P
rinci
pal C
ompo
nent
(b) The coefficients of performance metric to the 35th principal component
FIGURE 6.6. MRPCs of CPU-related failures.(a) plots the time series of the 35th
principal component. (b) shows the performance metric ldavg displays the highest
contribution to the MRPC
to The number of sectors read from the device, which is an indicator to characterize the symptom
of I/O related failures.
6.4.4. MRPCs and Diagnosis of CPU Related Failures
CPU-related faults are injected by employing infinite loops that use up all CPU cycles.
Table 6.1 lists MRPCs with the highest correlation with the CPU-related failures. Figure 6(a)
70
0 500 1000 1500 2000-1.5
-1
-0.5
0
0.5
1
1.5
Time : min
The
29th
Prin
cipa
l Com
pone
nt
29th PCNetwork faults
(a) Time series of 29th principal component
0 50 100 150 200 250 300 350 400 450-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Performance Metric
Coe
ffici
ent t
o th
e 29
th P
rinci
pal C
ompo
nent
(b) Coefficients of performance metric to the 29th principal component
FIGURE 6.7. MRPCs of network-related failures.(a) plots the time series of the
29th principal component. (b) shows the performance metric %user 4 displays
highest contribution to the MRPC
presents the time series of the 35th principal component. From the figure, I can see some CPU-
related failures are not easily identifiable, e.g., the failures occurred around 1250th minute and
1500th minute. Figure 6(b) plots weights of the cloud performance metrics for the 35th principal
component. With largest weight of 0.3874, ”ldavg-15” refers to ”The load average calculated as the
average number of runnable or running tasks (R state), and the number of tasks in uninterruptible
sleep (D state) over past 15 minutes”. The second and third largest weights correspond to the
71
performance metrics ”ldavg-5” and ”%sys all” respectively. ”%sys all” refers to the average CPU
utilization for system over all processors. All of the three performance metrics characterize the
process behavior under failures.
6.4.5. MRPCs and Diagnosis of Network Related Failures
Network-related faults are injected by saturating the network bandwidth by continuously
transferring large files between servers. In cloud computing systems, denial-of-service attacks,
virus infections, and failure of switches and routers may cause this type of anomalies. The MR-
PCs are listed in Table 6.1 . The 29th principal component is highly correlated with the network
related failures. The 27th and 23rd principal components are ranked as the second and third as
shown in Figure 8(d). In Figure 7(a), I can see the 29th principal component is sensitive to the
occurrences of network-related failures and the failure is distinguishable from normal states. Fig-
ure 7(b) shows that ”%usr 4” possesses the highest weight of 0.292. The second and third highest
weights are associated with performance metrics ”%idle 4” and ”svctm dev8-0” (i.e., ”The average
service time (in milliseconds) for I/O requests that were issued to the device.”). Both %usr 4 and
%idle 4 represent the states of processor core 4, which is assigned to the virtual machine where
the network-related faults are injected. Therefore, MRPCs can assist cloud operator not only to
identify anomalies, but also to localize faults even in virtual machines.
6.4.6. The Accuracy of Anomaly Identification
I study the performance of several anomaly detection techniques including our proposed
MRPC-based detection approach. I use the receiver operating characteristic (ROC) curves to
present the experimental results. An ROC curve displays the true positive rate (TPR) and the
false positive rate (FPR) of the anomaly detection results. The area under the curve is used to
evaluate the detection performance. A larger area infers higher sensitivity and specificity.
I compare the performance of the proposed MRPC-based anomaly detection approach
with four widely used detection algorithms: decision tree, Bayesian network, support vector ma-
chine(SVM), and 1st principal component (using Kalman filter to detect the anomaly). Our MRPC-
based anomaly detector achieves the best performance, with the true positive rate reaching 91.4%
72
while keeping the false positive rate as low as 3.7%. By applying only the first principal compo-
nent, the false positive rate is higher than 40% in order to achieve a 90% true positive rate. Among
other detection algorithms, the Bayesian network is relatively better, reaching 74.1% of TPR with
a low FPR. Experimental results prove that PCA has the worst performance in identifying the
performance anomalies.
To make an anomaly detection, it takes 6.81 seconds on average for a control node in the
cloud to process cloud performance data, select MRPCs, and make anomaly detections.
6.4.7. Experimental Results using Google Datacenter Traces
In addition to the experiments on our cloud computing testbed, I evaluate the performance
of the proposed MRPC-based anomaly detection mechanism by using the performance and events
traces collected from a Google datacenter [80]. The Google datacenter trace is the first publicly
available dataset collected from a large number of (about 13000) multi-purpose servers over 29
days. In the dataset, multiple task related events are recorded. Among them, I focus on the failure
events. In total, there are 13 resource usage metrics profiled periodically, which are listed in Table
6.2. the measurement period is typically 5 minutes (300s). Within each measurement period,
measurements are usually taken at 1 second intervals. By applying the MRPC selection algorithm
presented in Section III-B, we obtain the casual correlation between principal components and
failure events, which is plotted in Figure 6.9. The 13th principal component retains the highest
correlation (i.e., 0.18) with the failures.
The ROC curves in Figure 6.10 show the performance of the proposed anomaly detection
approach and other four detection algorithms. By exploiting MRPC, I can achieve 81.5% of TPR
and 27% of FPR. The results outperform all other detection methods by 22.9% - 68.7% of TPR
with the same value of FPR. The performance of the proposed anomaly identification mechanism
is a little worse on the Google traces. This is caused by the higher dynamicity and variety of
workloads,more complex interactions among system components, less number of performance
metrics and incomplete information of failure types. Our anomaly detector still provides valuable
information of failure dynamics, which facilitates the system operators to proactively reconfigure
resources and schedule workloads.
73
TABLE 6.2. Performance metrics in the Google datacener traces
Index Performance Metrics
1 Number of running tasks
2 CPU rate
3 Canonical memory usage
4 Assigned memory usage
5 Unmapped page cache
6 Total page cache
7 Maximum memory usage
8 Disk I/O time
9 Local disk space usage
10 Maximum CPU rate
11 Maximum disk I/O time
12 Cycles per instruction
13 memory accesses per instruction
The main contribution of this work includes: to our best knowledge, it is the first time to use
subsets of principal components as the most relevant metrics for different types of failures. With
the analysis of each failure type, I show that anomalies are highly correlated with specific principal
component subsets. Moreover, MRPCs can be applied for digging the root causes of failures and
guiding a timely maintenance.
6.5. Summary
Modern large-scale and complex cloud computing systems are susceptible to software and
hardware failures, which significantly affect the cloud dependability and performance. In this
paper, I present an adaptive anomaly identification mechanism in cloud computing systems. I
start by analyzing the correlation between principal components with failure occurrences, where I
find the PCs retaining the highest variance cannot effectively characterize the failure events, while
74
lower order PCs displaying high correlation with occurrences of failures. I then propose to exploit
the most relevant principal components (MRPCs) to describe failure events and devise a learning
based approach to identify and diagnose cloud anomalies by leveraging MRPCs. The anomaly
detector adapts itself by recursively learning from these newly verified detection results to refine
future detections. Meanwhile, it exploits the observed but undetected failure records reported by
the cloud operators to identify new types of failures. Experimental results from an on-campus
cloud computing testbed show that the proposed MRPC-based anomaly identification mechanism
can accurately detect failures while achieving a low overhead. Learning form the MRPC subspaces
that relate to each type of failure, I gain the knowledge of the root causes of failures.
75
0 50 100 150 200 250 300 350 400 4500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
(a) Correlation with memory-related failures
0 50 100 150 200 250 300 350 400 4500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
(b) Correlation with disk-related failures
50 100 150 200 250 300 350 400 4500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
(c) Correlation with CPU-related failures
0 50 100 150 200 250 300 350 400 4500
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
(d) Correlation with Network-related failures
FIGURE 6.8. Correlation between the principal components and different types of
failures
76
1 2 3 4 5 6 7 8 9 10 11 12 130
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Index of Pricipal Components
Cor
rela
tion
to th
e Fa
ults
FIGURE 6.9. Correlation between principal components and failure events using
the Google datacenter trace.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FPR
TPR
ROC Curve
Decision TreeBayesianNetSupport VectorMRPCPC1
FIGURE 6.10. Performance of the proposed MRPC-based anomaly detector com-
pared with four other detection algorithms on the Google datacenter trace.
77
CHAPTER 7
F-SEFI: A FINE-GRAINED SOFT ERROR FAULT INJECTION FRAMEWORK
7.1. Introduction
In order to facilitate the testing of application resilience methods, I present a fine-grained
soft error fault injector named F-SEFI. F-SEFI allows for the targeted injection of soft errors into
instructions belonging to applications of interest and that applications individual subroutines. F-
SEFI leverages the QEMU [92] virtual machine (VM) and its hypervisor. QEMU uses Tiny Code
Generation (TCG) to reference and translate instruction sets between the guest and host architec-
tures before the instructions are delivered to the host system for execution. F-SEFI provides the
ability to emulate soft errors and corrupt data at runtime by intercepting instructions and replacing
them with contaminated versions during the TCG translation. With the addition of a binary symbol
table, F-SEFI supports a tunable fine-grained injection strategy where soft errors can be injected
into chosen instructions in specified functions of an application. In addition, F-SEFI allows multi-
ple fault models to mimic the upsets in hardware (e.g., probabilistic model, single bit fault model
and multiple-bits fault model). Overall, F-SEFI manages the fault injections and the user decides
where, when, and how to inject faults.
I implemented a prototype F-SEFI system and conducted the fault injection campaign on
multiple HPC applications. The experimental results show that the effect of the injected faults is
amplified when the fault propagates to other software components, resulting in a number of silent
data corruptions on multiple sites. F-SEFI provides sufficient instruction level soft error samples
of different fault models which assists programmers to understand the vulnerabilities of underlying
HPC applications, further helps designing resilient strategies to mitigate the impacts of SDCs.
The rest of this chapter is organized as follows. Section 7.2 presents the coarse-grained soft
error fault injection (C-SEFI) platform which requires a gdb to manually snoop and inject the soft
errors to specific applications. Section 7.3 describes the design goal of a fine-grained soft error
fault injector (F-SEFI) and the competences of F-SEFI. Fault definitions and models supported
in F-SEFI are presented in Section 7.3.2. Section 7.3.3 depicts fault injection mechanism and
78
FIGURE 7.1. Overview of C-SEFI
the implementation of components of F-SEFI. Cases studies on three widely used benchmarks
are demonstrated in Section 7.3.4. Discussion and conclusion are presented in Section 7.4 and
Section 7.5.
7.2. A Coarse-Grained Soft Error Fault Injection (C-SEFI) Mechanism
C-SEFI’s logic soft error injection operational flow is roughly depicted in Figure 7.1. First,
the guest environment is booted and the application to inject faults into is started. Next, I probe
the guest operating system for information related to the code region of the target application and
notify the VM which code regions to watch. Then the application is released, allowing it to run.
The VM observes the instructions occurring on the machine and augments ones of interest. A more
detailed explaination of these techniques follows.
7.2.1. C-SEFI Startup
Initial startup of SEFI begins by simply booting a debug enabled Linux kernel within a
standard QEMU virtual machine. QEMU allows us to start a gdbserver within the QEMU monitor
such that I can attach to the running Linux kernel with an external gdb instance. This allows us
to set breakpoints and extract kernel data structures from outside the guest operating system as
well as from outside QEMU itself. This is a fairly standard technique used by many Linux kernel
developers. Figure 7.2 depicts the startup phase.
7.2.2. C-SEFI Probe
Once the guest Linux operating system is fully booted and sitting idle I use the attached
external gdb to set a breakpoint at the end of the sys exec call tree but before an application is
sent to a cpu to be executed. I are currently focused on only ELF binaries and have therefore
set our breakpoint at the end of the load elf binary routine. This is trivial to generalize to other
79
FIGURE 7.2. SEFI’s startup phase
FIGURE 7.3. C-SEFI’s probe phase
binary formats in future work. With the breakpoint is set I are free to issue a continue via gdb to
allow the Linux kernel to operate. The application of interest can now be started and will almost
immediately hit our set breakpoint and bring the kernel back to a stopped state. By this point in the
exec procedure the kernel has already loaded an application’s text section into physical memory in
a memory region denoted by the start code and end code elements of the task’s mm struct memory
structure. I can now extract the location in memory assigned to our application by the kernel by
walking the task list in the kernel. Starting with the symbol init task, I can find the application of
interest either by comparing a binary name to the task struct’s comm field or by searching for a
known pid which is also contained in the task struct. The physical addresses within the VM of the
application’s text region can now be fed into our fault injection code in the modified QEMU virtual
machine. Currently this is done by hand but I have plans to automate this discovery and transfer
using scripts and hypervisor calls.
Figure 7.3 depicts the probe phase of C-SEFI.
80
7.2.3. C-SEFI Fault Injection
Once QEMU has the code segment range of the target application, the application is re-
sumed. Next, when any opcode is called in the guest hardware that I are interested in injecting
faults into, QEMU checks the current instruction pointer register (EIP). If that instruction pointer
address is within the range of the target application (obtained during the probe phase), QEMU now
is aware that the application I are targeting is running this particular instruction. At this point I
are able to inject any number of faults and have confidence that I are affecting only the desired
application.
This approach is novel for several reasons. Causing opcodes in an emulated machine hard-
ware to produce wrong results is not particularly novel or complex. What is complex is doing it
only in applications of interest and not every time that instruction is called. For instance, causing
every add operation to be faulty on the machine would be neither interesting nor allow the kernel to
boot. Our technique of pinpointing which instructions are being executed by an application affords
us this capability.
FIGURE 7.4. C-SEFI’s fault injection phase
Figure 7.4 depicts this fault injection phase of the C-SEFI logic plug-in. In the first step
of this phase, QEMU brings in the code segment range obtained in the previous, probing, phase.
This range is passed into QEMU by a new hypervisor call that I added to QEMU. Next, the gdb
breakpoint is removed. The application is then resumed and continues operation as normal. Once
the application makes calls to opcodes that I are monitoring, the fault injection code inside of
QEMU can determine if, and how, to insert a simulated soft error in that opcode. Finally, the
81
application continues to run in this state and I observe and analyze how the injected fault is handled
in the application.
The opcode fault injection code has several capabilities. Firstly, it can simply flip a bit in
the inputs or outputs of the operation. Flipping a bit in the input simulates a soft error in the input
registers used for this operation. Secondly, it can flip a bit in the output of the operation. This
simulates either a soft error in the actual operation of the logic unit (such as a faulty multiplier)
or soft error in the register after the data value is stored. Currently the bit flipping is random
but can be seeded to produce errors in a specified bit-range. Thirdly, opcode fault injection can
perform complicated changes to the output of operations by flipping multiple bits in a pattern
consistent with an error in part but not all of an opcodes physical circuitry. For example, consider
the difference in the output of adding two floating point numbers of differing exponent if the a
transient error occurs for one of the numbers while setting up the significant digits so that they can
be added. By carefully considering the elements of such an operation I can alter the output of such
an operation to reflect all the different possible incorrect outputs that might occur.
The fault injector also has the ability to let some calls to the opcode go unmodified. It is
possible to cause the faults to occur after a certain number of calls or with some probability. In this
way the fault can occur every time which closely emulates permanently damaged hardware or can
be used to emulate transient soft errors by causing a single call to be faulty.
Most importantly, whenever I cause a fault to occur I know precisely what the instruction
pointer was at that time. Using this information I should be able to reference back to the original
source code. One of the obvious complications of this is that there does not exist a readily available
one-to-one mapping between high level language source code and the machine code generated by
the compiler and assembler. However, if the target application is compiled with debug symbols, I
can recognize at the very least what function the application was in when I injected the fault. This,
coupled with careful code organization, should make it more feasible to make this mapping.
7.2.4. Performance Evaluation of C-SEFI
To demonstrate C-SEFI’s capability to inject errors in specific instructions I provide two
simple experiments. For each experiment I modified the translation instructions inside of Qemu for
82
each instruction of interest. Once the instruction was called, the modified Qemu would check the
current instruction pointer (EIP) to see if the address was within the range of the target application.
If so, then a fault could be injected. I performed two experiments in this way, injecting faults into
the floating point multiply and floating point add operations.
For this experiment I instrumented the floating point multiply operation, fmul, in Qemu.
I created a toy application which iteratively performs Equation 28 40 times. The variable, y, is
initialized to 1.0.
(28) y = y ∗ 0.9
Then, at iteration 10 I injected a single fault into the multiplication operation by flipping a random
bit in the output. Figure 7.5 plots the results of this experiment. The large, solid line, represents
the output as it is without any faults. The other five lines represent separate executions of the
application with different random faults injected. Each fault introduces a numerical error in the
results which continues through the lifetime of the program.
I focus on two areas of interest from the plot in Figure 7.6 and 7.6 . In Figure 7.6 the plot
is zoomed in to focus on the point where the five faults are injected so as to make it easier to see.
Figure 7.7 is focused on the final results of the application. In this figure it becomes clear that each
fault caused an error to manifest in the application through to the final results.
7.3. A Fine-Grained Soft Error Fault Injection (F-SEFI) Framework
In the previous section I discussed the Coarse-grained SEFI. I validate the idea of designing
a soft error fault injector with minimal modifications to the environments and source codes. But
C-SEFI is impractical because it requires the user to pause and extract the application knowledge
in order to inject faults to specific application. Moreover, C-SEFI can only implement a course-
grained injection granularity. It is difficult to inject the faults to specific sub-routines, which limits
its capability to profile the vulnerability of application against soft errors. Based on the study of
C-SEFI, I propose a fine-grained soft error fault injection platform that not only inherits the key
features of C-SEFI, but also combined all of the features I desired in a tool meant to study the
behavior of application in the presences. The key features of F-SEFI are summarized as follows:
83
FIGURE 7.5. The multiplication experiment uses the floating point multiply in-
struction where a variable initially is set to 1.0 and is repeatedly multiplied by 0.9.
For five different experiments a random bit was flipped in the output of the multiply
at iteration 10, simulating a soft error in the logic unit or output register
.
FIGURE 7.6. Experiments with the focus on the injection point. it can be seen that
each of the five separately injected faults all cause the value of y to change - once
radically, the other times slightly.
84
FIGURE 7.7. Experiments with the focus on the effects on the final solution. it can
be seen that the final output of the algorithm differs due to these injected faults.
7.3.1. F-SEFI Design Objectives
Non-intrusion: in designing F-SEFI I were keenly focused on providing fault injection
with as little impact on the operating environment as possible. Our approach is non-intrusive in
that it requires no modifications to the application source code, compilers, third party software and
operating systems. It does not require custom hardware and runs entirely in user space so can be run
on production supercomputers alongside scientific applications. These constraints are pragmatic at
a production DOE facility and also exclude any possibility of side effects due to intrusive changes.
Additionally, our approach allows other applications to run alongside the application under fault
injection. In particular, this facilitates studies in resilient libraries and helper applications.
Infrastructure Independency: F-SEFI is designed as a module of the QEMU hypervisor
and, therefore, benefits from virtualization. Since the hypervisor supports a wide range of plat-
forms so does our fault injection capability. This enables us to explore hardware that I might not
physically have as well as explore new hardware approaches. For instance, I can implement triple-
modular redundancy (TMR) in certain instructions and generate errors probabilistically to evaluate
classes of applications that might be resilient on such hardware. In addition, since all guest OSs
are isolated, multiple target guest OSs from different architectures can work at the same time with-
out any interference. Faults can then be contained and I can run multiple applications in different
guest OSs and inject faults into them concurrently. Similarly, since FSEFI can target a specific
85
application, I can inject into multiple applications running within the same guest OS. This can help
reduce the effects of the virtualization overhead by studying multiple applications (or input sets)
concurrently.
Application Knowledge: F-SEFI performs binary injection dynamically without augmen-
tation to source code. Moreover, it adapts to the dynamicity of data objects, covering all static and
dynamic data. This is especially useful for applications that operate on random data or whose fault
characteristics vary when given different input datasets. F-SEFI does not require the memory ac-
cess information of the data objects at runtime. All the injections target the instructions, covering
the opcodes, addresses, and data in registers copied from memory.
Tunable Injection Granularity: F-SEFI supports a tunable injection granularity, allowing it
to inject faults semantically. Faults can target the entire application or focus in on specific func-
tions. Furthermore, the faults can be configured to infect specific operands and specific bit ranges.
Particularly with function-level injection, F-SEFI can provide a gprof-like [6] vulnerability profile
which is useful to programmers analyzing coverage vulnerability. While fine-grained tunability op-
erates on the symbol table extracted from an unstripped binary, F-SEFI can still do fault injections
into random locations in the application if the symbol table is not available.
Injection Efficiency: F-SEFI can be configured to inject faults only in specific micro-
operations and get out of the way for others. As such, it can be configured to only cause SDCs by
flipping bits in mathematical operations. Or, it can be used to explore control corruptions (such as
looping and jumps) or crashes (accessing protected memory, etc). This generality allows a user of
F-SEFI to focus their attention on studying the effects of specific SDC scenarios.
7.3.2. F-SEFI Fault Model
In this work I consider soft errors that occur in the function units (e.g., ALU and FPU)
of the processor. In order to produce SDCs, I corrupt the micro-operations executed in the ALU
(e.g., XOR) and FPU (e.g., FADD and FMUL) unit(s) by tainting values in the registers during
instruction execution. Fault characteristics can be configured in several ways to comprehensively
study how an application responds.
Faulty Instruction(s): soft errors can be injected into any machine instruction. In this
86
TABLE 7.1. Fault types for injection
Fault Type Description
FADD Bit-flip in floating point addition micro-operation.
FMUL Bit-flip in floating point multiplication micro-operation.
XOR Bit-flip in xor micro-operation.
paper I study corrupted FADD, FMUL, and XOR instructions. Since QEMU does guest-to-host
instruction translation, I merely modify this process to perform the type of corruption I want to
study.
Random and Targeted: F-SEFI offers both random (for coarse-grained) and targeted (for
fine-grained) fault injections. Initial development of the tool demonstrated the coarse-grained in-
jection by randomly choosing instructions to corrupt in an application[4]. This technique provides
limited resilience evaluation at the application level. F-SEFI now also has the ability to do targeted
fault injection into specific instructions and functions of an application. This allows a finer-grained
study of the vulnerabilities of an application.
Single and Multiple-Bit Corruption: any number of bits can be corrupted in an instruction
using F-SEFI. This allows for the study of how applications would behave without different forms
of error protection as well as faults that cause silent data corruption.
Deterministic and Probabilistic: while injecting faults to instructions, F-SEFI can deter-
ministically flip any bit of the input or output register(s). It can also be configured to apply a
probability function to determine which bits are more vulnerable than others. For example, one
can target the exponent, mantissa, or sign bit(s) of a floating point value.
7.3.3. F-SEFI Fault Injection Mechanisms
F-SEFI leverages extensive open source work on the QEMU processor emulator and virtu-
alizer by interfacing with the hypervisor as a plug-in module. After the QEMU hypervisor starts
a virtual machine image, the F-SEFI broker is loaded dynamically. As instructions are issued by
applications running within a guest OS, F-SEFI intercepts these and potentially corrupts them be-
87
Overview of SEFIOverview of SEFI
Guest OS……Guest OSGuest OS
VM Hypervisor (QEMU) F-SEFI BrokerLog
Host Kernel
Hardware
FIGURE 7.8. The overall system infrastructure of F-SEFI
Overview of SEFIOverview of SEFI
Collect the information of target
instructions
ProfilerConfigure Probe and
Injector with multiple features
ConfiguratorSnoop the EIP before
the execution of each guest code
block
ProbeUse a Bit-flipper to
contaminate the target
application/function
Injector
Log all the injection events from SEFI
Tracker
pp /
FIGURE 7.9. The components of F-SEFI Broker
fore sending them on to the host kernel. This interaction is depicted in Figure 7.8. F-SEFI runs
entirely in user space and can be run as a ”black box” on the command-line. This launches the tool,
performs the fault injections, tracks how the application responds, and logs all the results back in
the host file system. This is particularly useful for batch mode analysis to do campaign studies of
the vulnerability of applications. F-SEFI consists of five major components: profiler, configurator,
probe, injector, and tracker, which are shown in Figure 7.9. These are explained in more detail in
the next few sections.
Profiler: as with most dynamic fault injectors, F-SEFI profiles the application to gather in-
formation about it before injecting faults. As described in Section 7.3.2, F-SEFI can target specific
instructions for corruption. It is in this profiling stage that F-SEFI gathers information about how
many occurrences of each instruction there are as well as their relative location within the binary. It
is also in this profiling stage that the function symbol table (FST) is extracted from the unstripped
binary. This allows F-SEFI to understand where the applications functions start/end. Then the
88
Num: Value Size Type Bind vis Ndx Name65: 08048130 136 FUNC GLOBAL DEFAULT 13 find_nearest_point86: 080489a0 143 FUNC GLOBAL DEFAULT 13 clusters101: 080491f0 661 FUNC GLOBAL DEFAULT 13 kmeans clustering101: 080491f0 661 FUNC GLOBAL DEFAULT 13 kmeans_clustering105 08048a70 1713 FUNC GLOBAL DEFAULT 13 main
FIGURE 7.10. A subset of the function symbol table (FST) for the K-means clus-
tering algorithm studied in section 7.3.4. This is extracted during the profiling stage
and used to trace where the application is at runtime for targeted fault injections.
Execution Instruction Pointer (EIP) is observed through QEMU to trace where the application is at
runtime. Figure 7.10 shows the relevant information for a sample symbol table used in a later case
study.
Configurator: the configuration contains all the specifics related to the faults that will be
injected. This includes what application is to be studied, functions to target, the injection granular-
ity, and the instructions to alter. Additionally, probabilities of alteration can be assigned to specific
bit regions where injections are desired.
As an example, in [101], the authors present an application that is highly resilient except to
data corruption in high order bits. This configuration stage makes it possible to target injections,
for instance, at only the most significant bits in a 64-bit register from the 52nd bit to the 63rd
bit. This is precisely the kind of study that is enabled by F-SEFI. Another example use would
be choosing a probability of corruption related to the neutron flux where the application will run
(sea level, high altitude terrestrial, aerial, satellite, etc.). The configurator allows a great deal of
flexibility in the way instructions can be targeted. For example, one can skip over N instances of a
target instruction and only then begin injecting faults.
Probe: once profiled and configured, the application under analysis is run within the guest
OS. The F-SEFI probe component then dynamically observes the guest instruction stream before
it is sent to the host hardware for execution. This instruction stream is snooped at a block-level
where blocks of instructions in QEMU are organized to reduce overhead. The probe monitors
the Execution Instruction Pointer (EIP) and if it enters the memory region belonging to the target
application the probe then switches to instruction-level monitoring. At this more fine-grained level
89
the probe begins checking micro-operations of each instruction that passes to the host. If the
underlying instruction satisfies the conditions defined in the configuration phase, then the probe
phase activates the injector. The algorithm for the probing process is shown as follows.
ALGORITHM 6. Probing Algorithm
PROBE() {
1: Load Probe configuration
2: Load Injector configuration
3: FOR each TB intercepted by FSEFI DO
4: Extract PS name of current TB
5: IF PS name == target application THEN
6: Start FSEFI Tracker
7: Extract critical memory region
8: FOR each instruction to execute DO
9: IF instruction address resides within cirtical memory region THEN
10: Start Injector
11: END IF
12: END FOR
13: END IF
14: END FOR
15: }
Injector: QEMU has translation functions for how to translate each instruction on a guest
architecture into a instruction (or series of instructions) on a host architecture. The injector phase
of F-SEFI substitutes the original helper function with a modified one. The new corrupted ver-
sion is controlled by the configuration to conditionally flip bits in the registers used during the
calculation. This translation is entirely transparent to the QEMU hypervisor and allows F-SEFI to
closely emulate faulty hardware without the associated overheads and limitations of hardware fault
injection.
Tracker: F-SEFI maintains very detailed logs of what happens during the monitoring of an
90
TABLE 7.2. Benchmarks and target functions for fine-grained fault injection
Benchmarks Target Functions for Injection
FFT [95]: Fast Fourier Transform using Radix-2, 3, 4, 5 and 8 FFT routineFFT4b : Redix 4 routine
FFT8b : Redix 8 routine
BMM [78]: Bit Matrix Multiply Algorithm from CPU Suite Benchmarkmaketable : construct the lookup tables
bmm update : apply 64bits matrix multiply based on lookup tables
Kmeans [16] : Kmeans Clustering Algorithm from Rodinia Suite Benchmarkkmeans clustering : update the cluster center coordinates
find nearest point : update membership
application as well as carefully tracking fault injections. When it decides to inject a fault it reports
information about what instruction was being executed and the state of the registers before and
after injection. In this way it is possible to analyze post-mortem the way in which the application
behaved when faults occurred.
7.3.4. Case Studies
In this section I demonstrate F-SEFI injecting faults into three benchmark applications: fast
fourier transform (FFT), bit matrix multiplication, and K-means clustering. These experiments
were conducted using the QEMU virtual machine hypervisor. The guest kernel used was Linux
version 2.6.0 running on Ubuntu 9.04. The host specifications are unimportant as all that is required
is that QEMU can run on it in user space.
Table 7.2 gives specifics about the benchmarks I studied including the functions targeted for
fault injection. Each benchmark was profiled to determine the number of floating point addition
(FADD), floating point multiplication (FMUL), and exclusive-or (XOR) operations. These results
are shown in Figure 7.11 and are the basis for the instructions that are targeted in the following
experiments.
1-D Fast Fourier Transform (FFT). After profiling the benchmark I chose to target the
fft4b function. This function comprises a large percentage of the FADD and FMUL instructions.
F-SEFI was configured to inject one and two single-bit errors into the benchmark into randomly
selected FADD and FMUL instructions in the fft4b routine. The injection procedure is shown
in Figure 7.12. Selected results from these four experiments are shown in Figure 7.13 and are
91
FADD FMUL XOR100
101
102
103
104
105
106
107
108
Num
ber o
f Ope
rand
s (lo
g)
Coarse-grained 1-D FFTFine-grained 1-D FFT (fft4b)Coarse-grained BMMFine-grained BMM (bmm_update)Fine-grained BMM (maketable)Coarse-grained K-MeansFine-grained K-Means (find_nearest_point)Fine-grained K-Means (kmeans_clustering)
FIGURE 7.11. Instruction profiles for the benchmarks studied. Each benchmark
is reported as a whole application (coarse-grained) and one or two functions that
were targeted (fine-grained). While both FFT and K-Means have a large number of
FADD and FMUL instructions, the BMM benchmark is almost entirely XOR.
presented in magnitude and phase. In each of the figures the thick blue line represents the correct
output, without faults injected. The thinner red line shows what happens when F-SEFI injects
faults into the specified regions.
In Figure 7.13 I can see the differences in the high frequency area. This can be explained
by signal processing theory because a transient and sharp peak or trough occurring at a certain
time will strengthen the power of the high frequency components. Therefore, the variation in FFT
outputs implies a significant spike/trough in the input time series due to data corruption by F-
SEFI. It is interesting to see that in double faults injected in both FADD (Figure 13(b)) and FMUL
(Figure 13(d)) experiments I can see two overlapping waves.
For the FMUL faults shown in Figure 13(c) and 13(d) and the double FADD faults, shown
in Figure13(b), the difference of the magnitude output is less significant compared with that of the
single FADD faults. However, for the phase output, faults are distinguishable. In order to quantify
the output difference introduced by F-SEFI, I use the forward error to compare the faulty 1-D FFT
outputs with the original 1-D FFT outputs. The forward error is calculated as
92
1-D FFT (Fast Fourier Transform) h kBenchmark
• Problem size : 1024
FFTTime Serials Frequency
Problem size : 1024
FFTTime Serials
Soft Error
Frequency
Soft Error
FIGURE 7.12. 1-D FFT algorithm with soft errors injected by F-SEFI
(29) forward error =‖F −O‖n‖O‖n
,
where F and O are the faulty and original output vectors, respectively, and ‖F − O‖n is the Ln
norm of the difference between the two output vectors and ‖O‖n is the Ln norm of the output
vector of the original FFT. They are further calculated as:
(30)
‖F −O‖n = (∑n
i=1 |Fi −Oi|n)1/n,
‖O‖n = (∑n
i=1 |Oi|n)1/n,
L2 error is also called the relative root mean square (RMS) and, therefore, I use the relative
RMS to quantify the impact of the injected soft errors on the application execution. The FFT
problem size was varied and the above four fault injection experiments were performed. The
results are shown in Figure 7.14. While it is visually obvious from Figure 7.13 that all of the faults
I injected cause output differences, from the RMS calculations I can see that the FADD faults caused
the most noticeable output variations. Furthermore, I can see that as the problem size increases,
the FADD faults caused more significant difference.
2-D FFT. I also tested a 2-D FFT for fault injection using F-SEFI shown in Figure 7.15. In
the 2-D FFT an image is transformed by FFT into the frequency domain. Then, an Inverse FFT
(IFFT) algorithm is performed which converts it back to the original image. For easy visualization
93
100 101 102 103100
105
1010
1015
FFT Output
Mag
itude
Bode Plot: Magnitude
Without FaultWith Fault
100 101 102 103-4
-2
0
2
4
FFT Output
Pha
se
Bode Plot: Phase
Without FaultWith Fault
(a) Comparison between 1-D FFT output without fault
and output with Single ”FADD” fault.
100 101 102 103102
104
106
FFT Output
Mag
itude
Bode Plot: Magnitude
Without FaultWith Fault
100 101 102 103-4
-2
0
2
4
FFT Output
Pha
se
Bode Plot: Phase
Without FaultWith Fault
(b) Comparison between 1-D FFT output without fault
and output with Single ”FMUL” fault.
100 101 102 103100
102
104
106
FFT Output
Mag
itude
Bode Plot: Magnitude
Without FaultWith Fault
100 101 102 103-4
-2
0
2
4
FFT Output
Pha
se
Bode Plot: Phase
Without FaultWith Fault
(c) Comparison between 1-D FFT output without fault
and output with Double ”FADD” fault.
100 101 102 103102
104
106
FFT Output
Mag
itude
Bode Plot: Magnitude
Without FaultWith Fault
100 101 102 103-4
-2
0
2
4
FFT Output
Pha
se
Bode Plot: Phase
Without FaultWith Fault
(d) Comparison between 1-D FFT output without fault
and output with Double ”FMUL” fault.
FIGURE 7.13. Comparative outputs with four different types of fault injections into
the extended split radix 1-D FFT algorithm. The output is represented in magnitude
and phase. The single FADD fault shown causes significant SDC in both magnitude
and phase.
of the data corruption output I chose to use an 8x8 spiral gray image. This original image is shown
in the left-most picture in Figure 7.16.
I chose to inject into the fft4b function during the FFT portion of the algorithm. This
corrupts the original image in its conversion to the frequency domain. Then, the inverse FFT cor-
rectly converts the corrupted image back and I are able to visualize the output. For this experiment,
single FADD and FMUL faults were injected, separately, to see how the image would be affected.
94
1024 2048 4096 8192 16384 3276810-5
100
105
1010
1015
1020
1025
1030
1035
1040
FFT Problem Size
Rel
ativ
e R
MS
One FADD Fault
Two FADD Faults
One FMUL Fault
Two FMUL Faults
FIGURE 7.14. The relative mean square root (RMS) of 1-D FFT outputs with dif-
ferent problem sizes showing that for the faults I injected into FMUL instructions
the output varied only slightly.
2-D FFT (Fast Fourier Transform) h kBenchmark
FFT IFFTOriginal Image Faulty Image
Soft Error
FIGURE 7.15. 2-D FFT algorithm with soft errors injected by F-SEFI
The FADD fault appears in the center picture in Figure 7.16. I can see that the (x, y)
locations (6, 4) and (6, 8) were affected by a single fault injection and exhibit data corruption. On
the contrary, the FMUL fault shown to the right in Figure 7.16 affects the image to a lesser extent.
The difference of the image cannot be distinguished visually but careful analysis shows each pixel
is on average 3% different from the original image. This difference may not be visually apparent
but for an application using this result it could be catastrophic.
Bit-Matrix Multiply (BMM): Bit-matrix multiplication is similar to a numerical matrix mul-
tiplication, where, numerical multiplications are replaced by AND bit-wise operations and numer-
ical additions are replaced by XOR bit-wise operations. This algorithm is used in various fields
including symmetric-key cryptography [108] and bioinformatics [3] [52]. BMM can be generally
95
Original1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
Single FADD1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
Single FMUL1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
Clear SDCs
FIGURE 7.16. 8x8 spiral images with FADD and FMUL fault injections.
defined as:
(31) Y (m) =N−1∑n=0
B(m,n)A(n)
where A(n)(n = 0, 1, ..., N − 1) is an N−dimensional input vector, Y (m)(m = 0, 1, ...,M − 1) is
the M−dimensional output vector, and B(m,n) is an N ×M matrix.
Here I present fault injection results from the BMM benchmark in the VersaBench bench-
mark suite [78]. This benchmark uses a randomly generated input matrix of 217 64-bit elements.
Each multiplication result is compressed to a 9-bit signature code as shown in Figure 7.17 and a
512-entry vector is used to statistically accumulate the frequency of code occurrences. This vector
is used as a checksum for validation.
The BMM loop is repeated for 8 iterations and performs a total of 217 × 8 = 1, 048, 575
64-bit “multiply” operations (implemented as XOR as explained above). As shown in the profiling
results in Figure 7.11, there are only XOR instructions in the BMM algorithm. Two core functions,
bmm update and maketable contain 93% and 0.24% of the XOR operations, respectively. I
inject one and two single-bit faults into the XOR instruction used in the functions bmm update
and maketable and compare the output checksum vector with that from a fault free BMM run.
For our fault injections the total number of corrupted outputs was very small (between 200
and 300 corrupted outputs out of approximately one million). I found that for the bmm update
96
3263 0
ANDHigh 32Bits Low 32Bits
031 16
OR
High 16Bits Low 16Bits
15 0
OR
8
ADD
High 8Bits Low 8Bits
ADD08
9 Bits
FIGURE 7.17. The Bit Matrix Multiply algorithm compresses the 64-bits of output
to a 9-bit signature code used to checksum the result.
function, when I injected two single-bit errors the number of corrupted output increased on average
by 26%. For the maketable function I saw that two single-bit errors increased the number of
corrupted outputs by an average of 47%. Whether this is significant enough to matter for this
algorithm is difficult to tell without examining the results in the context of the parent application
that uses this algorithm.
Kmeans Clustering: the K-Means Clustering algorithm (and variations of it) is used in a
wide range of scientific and engineering fields including computer vision, astronomy, and biology
when dealing with large data sets. K-Means inputs include N k−dimensional particles to be
clustered into n clusters. The clusters’ centers and cluster membership (which cluster each particle
belongs to) vector is produced as output.
There are two key functions in the K-Means algorithm: kmeans clustering and find nearest point.
At the beginning of each iteration, the kmeans clustering function updates the center of
each cluster by calculating the centroid based on the distances between particles. For the newly
97
generated cluster centers, the find nearest point function updates the membership of each
particle by searching for the nearest clustering center.
As shown in Figure 7.11, the find nearest point function contains 82.13% of the
FADD micro-operations and 100% of the FMUL micro-operations. The kmeans clustering
function has no FMUL micro-operations and has the remaining 17.87% FADD ones. I chose to
inject faults into the FADD instruction in these two functions.
Results from these experiments appear in Table 7.3 and are shown visually in Figure 7.18.
This experiment uses a small data set so that the corruptions are easy visible. The dataset was
randomly generated uniformly distributed 2-D data points for clustering and I chose to find five
clusters.
TABLE 7.3. K-Means clustering centroids with and without fault injection showing
the impact of corrupted data in the centroid calculations and clustering calculations
for individual particles.
# of Particles2-D Coordinates of Cluster Centers
Cluster Index Center w/o Faults Center w/ Faults (kmeans clustering) Center w/ Faults (find nearest point)
300
1 (194.92, 66.32) (194.04, 67.07) (195.88, 66.29)
2 (126.39, 183.90) (68.26, 202.14) (126.39, 183.90)
3 (44.53, 200.69) (52.38, 4.2E+19) (44.53, 200.69)
4 (66.45, 68.25) (66.35, 76.43) (67.34, 68.25)
5 (210.85, 181.85) (195.63, 185.41) (210.85, 181.85)
By injecting a fault into the kmeans clustering function the centroid for cluster 3 was
sent far from the other data points. This causes the data points to effectively cluster into only four
groups as shown in Figure 18(a) (cluster 3 centroid is not shown for clarity).
When a fault was injected into the find nearest point function a single particle be-
comes mislabeled. This is shown in Figure 18(b) and this causes the particle to be grouped with
cluster 4 instead of cluster 1. This also augments the centroids for those clusters as can be seen
in Table 7.3. This makes sense as the find nearest point function compares distances be-
tween a particle and the cluster centroids. The particle is assigned to the cluster that is closest
and corrupting this calculation makes it incorrectly assign a single particle to the wrong cluster.
98
0 50 100 150 200 2500
50
100
150
200
250
X Coordinate
Y C
oord
inat
e
K-Means Clustering Algorithm Without Faults
0 50 100 150 200 2500
50
100
150
200
250
X Coordinate
Y C
oord
inat
e
K-Means Clustering Algorithm with FADD Fault
(a) K-Means Clustering without (left) and with (right) one single-bit FADD fault injected into the
kmeans clustering function. The uniformly distributed data clusters into only four clusters
when one centroid is relocated far away.
0 50 100 150 200 2500
50
100
150
200
250
X Coordinate
Y C
oord
inat
e
K-Means Clustering Algorithm without Faults
0 50 100 150 200 2500
50
100
150
200
250
X Coordinate
Y C
oord
inat
e
K-Means Clustering Algorithm with FADD Fault
Mislabeled Particle
(b) K-Means Clustering without (left) and with (right) one single-bit FADD fault injected into the
find nearest point function. A single particle is affected and gets mislabeled. This also
causes the centroids for these clusters to change slightly.
FIGURE 7.18. Faults injected into two different functions of the K-Means Cluster
algorithm cause different effects. Clusters are colored by cluster number and the
centroids are marked by triangles.
99
Therefore, faults contained to this function will only affect a single particle while faults that affect
the kmeans clustering function cause much larger differences.
To demonstrate this I performed FADD injections into the find nearest point func-
tion while scaling up the number of particles. These results are shown in Figure 7.19. It can be seen
that for our uniformly distributed dataset causing one of the cluster centroids to be removed has an
effect on approximately 28% of the data points being clustered incorrectly. Also the centroids for
each cluster then are wrong.
300 3000 300000
1000
2000
3000
4000
5000
6000
7000
8000
9000
Total Number of Particles
Num
ber o
f Mis
labe
led
Par
ticle
s
FADD fault in kmeans_clustering()
82
854
8376
FIGURE 7.19. The number of mislabeled particles in the K-Means Clustering Al-
gorithm under fault injection as a function of the total number of particles. An FADD
fault injected into kmeans clustering causes about 28% of the particles to be
mislabeled.
7.4. Discussions
Overhead: fault injection techniques are notoriously slow. Vendors use a combination of
Register-Transfer Level (RTL) simulations and hardware fault injection to study chip behavior.
Both of these approaches are obviously limited. RTL simulations require proprietary information
about hardware designs and hardware approaches require specialized hardware. By contrast, our
approach trades this for flexibility, working inside of a freely available virtual machine and running
in user space.
100
F-SEFI adds an overhead of about 30% on top of the QEMU virtualization overhead. In
contrast to other virtualization fault injection techniques that add about 200% overhead, our ap-
proach seems favorable. The QEMU overhead can be substantial due to the processor emulation
and in our experiences might add as much as a 200x slowdown over running an application na-
tively. This performance impact is still generally faster than RTL-based approaches.
However, it is important to realize that studying applications for their vulnerabilities is
an offline exercise. Furthermore, processor emulation provides us a lot of capabilities to study
interesting new hardware approaches and how they affect reliability. Finally, if applications have a
small enough memory footprint they can be run concurrently inside the same guest OS and F-SEFI
can perform fault injection experiments in parallel and without interaction between the processes.
Characterizing Fault Propagation: it is often difficult for a software designer to understand
how an error might propagate through their application and what it might affect. F-SEFI provides
a means for studying that propagation which I demonstrated that with the K-Means Clustering al-
gorithm. For that algorithm, I saw that the kmeans clustering function is particularly vital
to protect as it has an impact on all of the centroids and about 28% of the particles were misla-
beled when a fault occurred. In contrast, the find nearest point function only affects the
classification of a single particle and some of the related centroids.
Based on analysis like this, the functions of an application can be divided into several
vulnerability zones that characterize the impact of soft errors. These vulnerability zones can then
be translated into a vulnerability map that quantifies with ranked scores how data corruption affects
individual portions of an application. This is valuable information for a programmer to decide
where to focus their attention to provide resilience techniques.
Effective SDC Outcomes: faults can cause application crashes, data corruption, hangs, and
also have no impact (are benign). Many applications can tolerate crashes much more than data
corruption where getting the wrong answer can have drastic impacts on what actions are taken
within a code or the scientific integrity of a result. As such, in our work I have particularly focused
on the ”getting the right answer” portion of the problem. While it is possible to study crashes using
F-SEFI through corrupting moves, jumps, and conditional instructions I have not focused on that
101
at this time.
One of the benefits of our approach is that since F-SEFI can target faults with extreme pre-
cision, I can study just how data corruption at specific points of an application affects the results.
Also, because of our focus on causing data corruption rather than crashes it makes large scale
studies more practical. Many similar approaches require an extremely large number of fault injec-
tions to cause data corruption. While these approaches are valuable for studying data corruption
probabilities in a code, they make it difficult to study the questions I are focused on with F-SEFI –
how do corruptions at specific locations in an application cause it to behave? Our approach makes
studying this question more effective and practical.
7.5. Summary
For a number of reasons there is call to be concerned about rates of silent data corruption in
applications on next generation machines. Many of the applications that are run on leadership-class
supercomputers are intolerant of data corruption and these applications are often used for science
where knowing you have the right answer is extremely important. As the HPC community goes
through this transition period towards exascale there is an opportunity to redesign applications so
that they are made more resilient to these types of errors. However, knowing where to focus the
attention has historically been difficult. Studies on the presence and impact of silent data corruption
has been lacking due to the silent nature of the errors. They are difficult to reproduce and identify
and are largely environmentally dependent. I have presented F-SEFI which makes that effort easier
by allowing an application programmer the ability to survey where an application is vulnerable and
where it is more resilient.
F-SEFI leverages a robust and actively developed open source processor emulator, QEMU,
to emulate faults as close to hardware as possible to do in software. By intercepting instructions
in the translation of architectural instruction sets, F-SEFI is capable of injecting soft errors that
cause incorrect execution results. F-SEFI can control when and how to inject the errors into which
function in what application with different granularities. I demonstrated the use of F-SEFI on a
variety of benchmark applications and have shown how data corruption can alter results. The tool
is capable of doing large, campaign studies of fault injections into applications and provides access
102
to a rich set of fault models. The tool can be useful in studying new approaches to resilience both
at the hardware and software level and actually quantifying the benefits of them.
103
CHAPTER 8
CONCLUSION AND FUTURE WORK
8.1. Conclusion
This dissertation research aims to characterize and enhance cloud dependability and system
resilience. I summarize my work as follows.
8.1.1. Characterizing Cloud Dependability
As virtualization is an enabling technology for cloud computing, its impact on dependabil-
ity must be well understood. The goal of this work is to assess cloud dependability in virtualized
environments and compare it with that of traditional, non-virtualized systems. I propose a cloud
dependability analysis (CDA) framework with mechanisms to characterize failure dynamics in
cloud computing infrastructures. I have proposed failure-metric DAGs (Directed Acyclic Graph)
to model and quantify the correlation of various performance metrics with failure events in virtu-
alized and non-virtualized systems. I have investigated multiple types of failures, including CPU-,
memory-, disk-and network-related failures. By comparing the generated DAGs in the two envi-
ronments, I gain insight into the effects of virtualization on the cloud dependability by comparing
different members in the failure-related system performance metric sets selected for virtualized
and non-virtualized environments.
8.1.2. Detecting and Diagnozing Cloud Anomalies
Given the ever-increasing cloud sizes coupled with the complexity of system components,
continuous monitoring leads to the overwhelming volume of data collected by health monitoring
tools. I address the metric selection problem for efficient and accurate anomaly detection in the
cloud. I present a metric selection framework with metric selection and extraction mechanisms.
The mutual information based approach selects metrics that maximize the mutual relevance and
minimize their redundancy. Then the essential metrics are further extracted by means of combining
or separating the selected metric space. The reduced dimensionality of metric space significantly
improves the computational efficiency of anomaly detection.
104
To detect performance anomalies, I propose a wavelet-based multi-scale cloud anomaly
identification mechanism with learning-aid mother wavelet selection and sliding detection win-
dows. Different from other anomaly identification approaches, it does not require a prior knowl-
edge of failure distributions, it can self-adapt by learning from observed failures at runtime, and it
analyzes both the time and frequency domains to identify anomalous cloud behaviors.
To diagnose the identified anomalies, I start by analyzing the correlation between principal
components with failure occurrences, where I find the PCs retaining the highest variance cannot
effectively characterize the failure events, while lower order PCs displaying high correlation with
occurrences of failures. I propose to exploit the most relevant principal components (MRPCs)
to describe failure events and devise a learning based approach to diagnose cloud anomalies by
leveraging MRPCs.
8.1.3. Soft Error Fault Injection
To assess application vulnerability, I develop a fine-grained soft error fault injector, F-
SEFI. F-SEFI leverages a robust and actively developed open source processor emulator, QEMU,
to emulate faults as close to hardware as possible to do in software. By intercepting instructions
in the translation of architectural instruction sets, F-SEFI is capable of injecting soft errors that
cause incorrect execution results. F-SEFI can control when and how to inject the errors into which
function in what application with different granularities. I demonstrated the use of F-SEFI on a
variety of benchmark applications and have shown how data corruption can alter results. The tool
is capable of doing large, campaign studies of fault injections into applications and provides access
to a rich set of fault models. The tool can be useful in studying new approaches to resilience both
at the hardware and software level and actually quantifying the benefits of them.
8.1.4. List of Publications in My PhD Study
1. Qiang Guan and Song Fu, Exploring Time and Frequency Domains for Accurate
and Automated Anomaly Detection in Cloud Computing Systems”, in proceeding of the 19th
IEEE/IFIP International Symposium on Dependable Computing (PRDC), 10 pages, 2013.
2. Qiang Guan and Song Fu, ”Wavelet-Based Multi-scale Anomaly Identification in Cloud
105
Computing Systems”, accepted by IEEE Global Communications Conference (GlobeCom), 6
pages, 2013.
3. Qiang Guan and Song Fu, ”Adaptive Anomaly Identification by Exploring Metric
Subspace in Cloud Computing Infrastructures”, in proceeding of the 32nd IEEE International
Symposium on Reliable Distributed Systems (SRDS), 10 pages, 2013.
4. Husanbir S Pannu, Jianguo Liu, Qiang Guan and Song Fu, ”AFD: Adaptive Failure
Detection System for Cloud Computing Infrastructures,” in proceeding of the 31st IEEE Interna-
tional Performance Computing and Communications Conference (IPCCC), 10 pages, 2012.
5. Ziming Zhang, Qiang Guan and Song Fu, ”An Adaptive Power Management Frame-
work for Autonomic Resource Configuration in Cloud Computing Infrastructures,” in proceed-
ing of the 31st IEEE International Performance Computing and Communications Conference
(IPCCC), 10 pages, 2012.
6. Qiang Guan, Chi-Chen Chiu and Song Fu, ”A Cloud Dependability Analysis Frame-
work for Characterizing System Dependability in Cloud Computing Infrastructures,” in proceed-
ing of the 18th IEEE/IFIP International Symposium on Dependable Computing (PRDC), 10 pages,
2012.
7. Qiang Guan, Ziming Zhang and Song Fu, ”A Failure Detection and Prediction Mecha-
nism for Enhancing Dependability of Data Centers”, in International Journal of Computer Theory
and Engineering, 4 (5), 726-730 , 2012.
8. Qiang Guan, Ziming Zhang and Song Fu, ”Ensemble of Bayesian Predictors and Deci-
sion Trees for Proactive Failure Management in Cloud Computing Systems”, in Journal of Com-
munications, 7 (1), 52-61, 2012.
9. Qiang Guan, Ziming Zhang and Song Fu, ”Efficient and Accurate Anomaly Iden-
tification Using Reduced Metric Space in Utility Clouds,” in proceeding of IEEE International
Conference on Networking, Architecture, and Storage (NAS), 10 pages, 2012.
10. Qiang Guan, Ziming Zhang and Song Fu, ”Proactive Failure Management by Inte-
grated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems”, in Proceed-
ing of IEEE International Conference on Availability, Reliability and Security Conference (ARES),
106
2011.
11. Nathan DeBardeleben, Sean Blanchard, Qiang Guan, Ziming. Zhang, and Song. Fu,
”Experimental Framework for Injecting Logic Errors in a Virtual Machine to Profile Applications
for Soft Error Resilience”, in Proceeding of Resilience, International European Conference on
Parallel and Distributed Computing (Euro-Par), 2011.
12. Qiang Guan, Ziming Zhang and Song Fu, ”Ensemble of Bayesian Predictors for Au-
tonomic Failure Management in Cloud Computing”, in Proceeding of the 20th IEEE International
Conference on Computer Communications and Networks (ICCCN), 2011.
13. Qiang Guan and Song Fu, ”auto-AID: A Data Mining Framework for Autonomic
Anomaly Identification in Networked Computer Systems”, in Proceeding of the 29th IEEE Inter-
national Performance Computing and Communications Conference (IPCCC), 2010.
14. Qiang Guan, Derek Smith and Song. Fu, ”Anomaly Detection in Large-Scale Coali-
tion Clusters for Dependability Assurance”, in Proceeding of the 17th IEEE International Confer-
ence on High Performance Computing (HiPC), 2010.
15. Derek Smith, Qiang Guan and Song Fu, ”An Anomaly Detection Framework for
Autonomic Management of Compute Cloud Systems”, in Proceedings of CloudApp, the 34th IEEE
International Conference on Computer Software and Applications (COMPSAC), 2010.
8.2. Future Work
As the rapid development and ever-increasing demands of cloud computing, it becomes
more and more important to build dependable cloud computing infrastructures to guarantee the
trust between users and cloud service providers. Dependability-as-a-Service (DaaS) would be a
solution for the users that require high dependability cloud services. To achieve this goal, two
areas should be considered. I need to develop failure-aware resource management mechanisms to
proactively avoid assigning applications to or migrating VMs from the cloud servers that are to
fail and meanwhile satisfy the service level agreements. Moreover, efforts should be made towards
designing resilient programming models and libraries to ensure the correctness of cloud services
and applications.
107
8.2.1. Self-Adaptive Failure-Aware Resource Management in the Cloud
Based on the understanding of cloud dependability, resilience against service overload,
hardware failures, software bugs as well as operator errors should be studied and addressed by an-
swering the following questions: How much resources should be allocated adaptively to an appli-
cation that will run on multiple VMs across cloud servers and how should applications be migrated
and resources be reconfigured when failures occur in order to satisfy the requirements of cloud
dependability and performance with low overheads? In order to address the above questions, a
self-adaptive failure-aware resource management System needs to possess a number of properties:
1) adaptation capability for cloud dynamics in virtualized environments; 2) self-learning ability
over various applications and cloud infrastructures; 3) low overhead assuming system performance
in failure management and resource configuration operations.
Current resource management techniques manage cloud resources based on the perfor-
mance of systems under different intensity of workloads and applications. I can optimize the cloud
system by virtue of allocating the cloud applications to VMs or cloud nodes with the guarantee
of dependability to proactively avoid Service Level Agreement (SLA) violation or inefficient re-
sources utilization due to varying hardware/software health conditions and workload fluctuations.
Furthermore, these new resource management strategies could also benefit energy efficiency of
large scale cloud systems.
8.2.2. Tolerating Silent Data Corruptions in Large Scale Computing Systems
As programming languages and models have been used to reduce the complexity of par-
allel programming, I plan to leverage a new programming model that guarantees the correctness
of applications running on unreliable hardware to enhance applicationlevel resilience. This will
facilitate applications to get rid of the threats from the silent data corruptions (SDCs). Based on
our developed F-SEFI soft error fault injector, I plan to achieve the application-level SDC fault
tolerance by preventing and correcting SDCs.
Soft Error SDC Prevention: I propose to design a programming model for developing re-
silient applications. For example, a standard library can be extended to a ”resilient” library by
defining resilient patterns and abstracts that are robust to SDCs. This can be achieved by com-
108
prehensively studying the implementation of these functions and determining the resilient features
from those functions. These features will then be quantified and categorized as guidelines for de-
signing a resilient programming model. This idea can be further explored at the algorithm level.
F-SEFI will help to profile the different implementations of an algorithm and find the most resilient
implementation.
Soft Error SDC Correction: I also plan to develop an ”on-site” correction mechanism that
uses redundant resources to execute error-prone instructions. These vulnerable portions (either
subroutines or floating point operands) can be identified by vulnerability profiling. The proposed
SDC fault tolerance mechanisms will facilitate the design of the next generation exascale comput-
ing systems by enhancing the resilience capability especially when the service/system providers
cannot keep a low SDC rate with an acceptable performance.
109
BIBLIOGRAPHY
[1] S. Agarwala, F. Alegre, K. Schwan, and J. Mehalingham, E2EProf: Automated end-to-
end performance management for enterprise systems, Proc. of IEEE/IFIP Intl. Conf. on
Dependable Systems and Networks (DSN), 2007.
[2] Sandip Agarwala and Karsten Schwan, Sysprof: Online distributed behavior diagnosis
through fine-grain system monitoring, Proc. of IEEE Intl. Conf. on Distributed Computing
Systems (ICDCS), 2006.
[3] Tatsuya Akutsu, Satoru Miyano, and Satoru Kuhara, Algorithms for identifying boolean net-
works and related biological networks based on matrix multiplication and fingerprint func-
tion, Proceedings of the fourth annual international conference on Computational molecular
biology (New York, NY, USA), RECOMB ’00, ACM, 2000, pp. 8–14.
[4] Samer Al-Kiswany, Dinesh Subhraveti, Prasenjit Sarkar, and Matei Ripeanu, Vmflock: vir-
tual machine co-migration for the cloud, Proc. of ACM Intl. Symp. on High Performance
Distributed Computing (HPDC), 2011.
[5] Edoardo Amaldi and Viggo Kann, On the approximability of minimizing nonzero variables
or unsatisfied relations in linear systems, Theoretical Computer Science 209 (1998), no. 1-2,
237–260.
[6] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy
Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia, A
view of cloud computing, Communications of the ACM 53 (2010), no. 4, 50–58.
[7] Mona Attariyan, Michael Chow, and Jason Flinn, X-ray: automating root-cause diagnosis
of performance anomalies in production software, Proceedings of the 10th USENIX confer-
ence on Operating Systems Design and Implementation, OSDI’12, 2012.
[8] Paul Barford, Jeffery Kline, David Plonka, and Amos Ron, A signal analysis of network
traffic anomalies, Proc. of ACM SIGCOMM Workshop on Internet measurment, 2002.
[9] David Bernick, Bill Bruckert, Paul Del Vigna, David Garcia, Robert Jardine, Jim Klecka,
110
and Jim Smullen, Nonstop advanced architecture, Proc. of IEEE Conf. on Dependable Sys-
tems and Networks (DSN), 2005.
[10] Sapan Bhatia, Abhishek Kumar, Marc E. Fiuczynski, and Larry Peterson, Lightweight,
high-resolution monitoring for troubleshooting production systems, Proceedings of the 8th
USENIX conference on Operating systems design and implementation, OSDI’08, 2008.
[11] Boualem Boashash, Time frequency signal analysis and processing : a comprehensive ref-
erence, Elsevier, 2003.
[12] D. Brauckhoff, K. Salamatian, and M. May, Applying pca for traffic anomaly detection:
Problems and solutions, INFOCOM 2009, IEEE, 2009, pp. 2866–2870.
[13] Daniela Brauckhoff, Kave Salamatian, and Martin May, A signal processing view on packet
sampling and anomaly detection, Proc. of IEEE Intl. Conf. on Information Communications
(INFOCOM), 2010.
[14] Emmanuel Cecchet, Julie Marguerite, and Willy Zwaenepoel, Performance and scalabil-
ity of EJB applications, Proc. of ACM Conf. on Object-Oriented Programming, Systems,
Languages, and Applications (OOPSLA), 2002.
[15] Varun Chandola, Arindam Banerjee, and Vipin Kumar, Anomaly detection: A survey, ACM
Computing Surveys 41 (2009), no. 3, 15:1–15:58.
[16] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee,
and Kevin Skadron, Rodinia: A benchmark suite for heterogeneous computing, Proceed-
ings of the 2009 IEEE International Symposium on Workload Characterization (IISWC)
(Washington, DC, USA), IISWC ’09, IEEE Computer Society, 2009, pp. 44–54.
[17] Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer, Pinpoint:
Problem determination in large, dynamic internet services, Proc. of IEEE/IFIP Intl. Conf.
on Dependable Systems and Networks (DSN), 2002.
[18] Jim Chow, Dominic Lucchetti, Tal Garfinkel, Geoffrey Lefebvre, Ryan Gardner, Joshua
Mason, Sam Small, and Peter M. Chen, Multi-stage replay with crosscut, Proceedings of the
6th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments,
VEE ’10, 2010.
111
[19] Ira Cohen, Moises Goldszmidt, Terence Kelly, Julie Symons, and Jeffrey S. Chase, Cor-
relating instrumentation data to system states: a building block for automated diagnosis
and control, Proc. of USENIX Symp. on Opearting Systems Design and Implementation
(OSDI), 2004.
[20] J.M. Combes, A. Grossmann, and P. Tchamitchian, Wavelets: time-frequency methods and
phase space, Springer-Verlag, 1990.
[21] Thomas M. Cover and Joy A. Thomas, Elements of information theory, Wiley, New York,
1991.
[22] Brendan Cully, Geoffrey Lefebvre, Dutch Meyer, Mike Feeley, Norm Hutchinson, and An-
drew Warfield, Remus: high availability via asynchronous virtual machine replication, Proc.
of USENIX Symp. on Networked Systems Design and Implementation (NSDI), 2008.
[23] Yuan-Shun Dai, Bo Yang, Jack Dongarra, and Gewei Zhang, Cloud service reliability:
Modeling and analysis, Proc. of IEEE Pacific Rim Intl. Symp. on Dependable Computing
(PRDC), 2009.
[24] Jeffrey Dean and Sanjay Ghemawat, Mapreduce: simplified data processing on large clus-
ters, Communications of the ACM 51 (2008), no. 1, 107–113.
[25] Nathan DeBardeleben, James Laros, John T. Daly, Stephen L. Scott, Christian Engelmann,
and Bill Harrod, High-end computing resilience: Analysis of issues facing the HEC commu-
nity and path-forward for research and development, Whitepaper, December 2009.
[26] Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern classification, Wiley-
Interscience, 2000.
[27] Daniel Ford, Francois Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong,
Luiz Barroso, Carrie Grimes, and Sean Quinlan, Availability in globally distributed stor-
age systems, Proceedings of the 9th USENIX conference on Operating systems design and
implementation, OSDI’10, 2010, pp. 1–7.
[28] S. Fu, Dependability enhancement for coalition clusters with autonomic failure manage-
ment, Proc. of IEEE Intl. Symp. on Computers and Communications (ISCC), 2010.
[29] Song Fu, Failure-aware construction and reconfiguration of distributed virtual machines
112
for high availability computing, Proc. of IEEE/ACM Intl. Symp. on Cluster Computing and
the Grid (CCGrid), 2009.
[30] Song Fu and Chengzhong Xu, Exploring event correlation for failure prediction in coali-
tions of clusters, Proc. of ACM/IEEE Supercomputing Conf. (SC), 2007.
[31] Sachin Garg, Antonio Puliafito, and Kishor S. Trivedi, Analysis of software rejuvenation
using markov regenerative stochastic petri net, Proc. of IEEE Intl. Symp. on Software Reli-
ability Engineering (ISSRE), 1995.
[32] Q. Guan, Z. Zhang, and S. Fu, Proactive failure management by integrated unsupervised
and semi-supervised learning for dependable cloud systems, Proc. of IEEE Intl. Conf. on
Availability, Reliability and Security (ARES), 2011.
[33] Qiang Guan, Chi-Chen Chiu, and Song Fu, A cloud dependability analysis framework for
assessing system dependability in cloud computing infrastructures, Proc. of IEEE/IFIP Pa-
cific Rim Intl. Symp. on Dependable Computing (PRDC), 2012.
[34] Qiang Guan, Chi-Chen Chiu, Ziming Zhang, and Song Fu, Efficient and accurate anomaly
identification using reduced metric space in utility clouds, Proc. of IEEE Intl. Conf. on
Networking, Architecture and Storage (NAS), 2012.
[35] Qiang Guan, Ziming Zhang, and Song Fu, Ensemble of bayesian predictors and decision
trees for proactive failure management in cloud computing systems, Journal of Communi-
cations 7 (2012), no. 1, 52–61.
[36] G Hamerly and C Elkan, Bayesian approaches to failure prediction for disk drives, Proceed-
ings of the Eighteenth International Conference on Machine Learning, ICML ’01, 2001.
[37] Greg Hamerly and Charles Elkan, Bayesian approaches to failure prediction for disk drives,
Proc. of Conf. on Machine Learning (ICML), 2001.
[38] J. Han and M. Kamber, Data mining: Concepts and techniques, Morgan Kaufmann Pub-
lishers Inc., 2005.
[39] Jacob Gorm Hansen and Eric Jul, Lithium: virtual machine storage for the cloud, Proc. of
ACM Symp. on Cloud Computing (SOCC), 2010.
[40] Taliver Heath, Richard P. Martin, and Thu D. Nguyen, Improving cluster availability using
113
workstation validation, Proc. of ACM Intl. Conf. on Measurement and modeling of com-
puter systems (SIGMETRICS), 2002.
[41] Joseph L. Hellerstein, Fan Zhang, and Perwez Shahabuddin, A statistical approach to pre-
dictive detection, Computer Networks: The Intl. Journal of Computer and Telecommunica-
tions Networking 35 (2001), no. 1, 77–95.
[42] Victoria Hodge and Jim Austin, A survey of outlier detection methodologies, Artificial In-
telligence Review 22 (2004), 85–126.
[43] M. Hsueh, T.K. Tsai, and R.K. Iyer, Fault injection techniques and tools, IEEE Computer
30 (1997), 75–82.
[44] Alan Jeffrey, Advanced engineering mathematics, Academic Press, 2001.
[45] Kaustabh Joshi, Guy Bunker, Farnham Jahanian, Aard van Moorsel, and Joe Weinman,
Dependability in the cloud: Challenges and opportunities, Proc. of IEEE/IFIP Intl. Conf.
on Dependable Systems and Networks (DSN), 2009.
[46] Pallavi Joshi, Haryadi S. Gunawi, and Koushik Sen, PREFAIL: a programmable tool for
multiple-failure injection, Proc. of ACM Intl. Conf. on Object Oriented Programming Sys-
tems Languages and Applications (OOPSLA), 2011.
[47] R. E. Kalman, Transactions of the ASME Journal of Basic Engineering, no. 82 (Series D),
35–45.
[48] Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and
Shekhar Borkar, Near-threshold voltage (ntv) design: opportunities and challenges, Pro-
ceedings of the 49th Annual Design Automation Conference (New York, NY, USA), DAC
’12, ACM, 2012, pp. 1153–1158.
[49] S.P. Kavulya, S. Daniels, K. Joshi, M. Hiltunen, R. Gandhi, and P. Narasimhan, Draco:
Statistical diagnosis of chronic problems in large distributed systems, 2012 42nd Annual
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2012.
[50] Kamal Kc and Xiaohui Gu, Elt: Efficient log-based troubleshooting system for cloud com-
puting infrastructures, Proceedings of the 2011 IEEE 30th International Symposium on
Reliable Distributed Systems, SRDS ’11, 2011, pp. 11–20.
114
[51] Jeffrey O. Kephart and David M. Chess, The vision of autonomic computing, IEEE Com-
puter 36 (2003), no. 1, 41–50.
[52] Mehmet Koyuturk, Wojciech Szpankowski, and Ananth Grama, Biclustering gene-feature
matrices for statistically significant dense patterns, Proceedings of the 2004 IEEE Com-
putational Systems Bioinformatics Conference (Washington, DC, USA), CSB ’04, IEEE
Computer Society, 2004, pp. 480–484.
[53] Anukool Lakhina, Mark Crovella, and Christophe Diot, Diagnosing network-wide traffic
anomalies, Proceedings of the 2004 conference on Applications, technologies, architectures,
and protocols for computer communications, SIGCOMM ’04, 2004, pp. 219–230.
[54] Zhiling Lan, Jiexing Gu, Ziming Zheng, Rajeev Thakur, and Susan Coghlan, A study of
dynamic meta-learning for failure prediction in large-scale systems, Journal of Parallel and
Distributed Computing 70 (2010), no. 6, 630–643.
[55] Zhiling Lan, Ziming Zheng, and Yawei Li, Toward automated anomaly identification in
large-scale systems, Parallel and Distributed Systems, IEEE Transactions on 21 (2010),
no. 2, 174–187.
[56] J. Lange, K. Pedretti, T. Hudson, P. Dinda, Zheng Cui, Lei Xia, P. Bridges, A. Gocke,
S. Jaconette, M. Levenhagen, and R. Brightwell, Palacios and kitten: New high performance
operating systems for scalable virtualized and native supercomputing, Parallel Distributed
Processing (IPDPS), 2010 IEEE International Symposium on, 2010, pp. 1–12.
[57] C. Lattner and V. Adve, Llvm: a compilation framework for lifelong program analysis trans-
formation, Code Generation and Optimization, 2004. CGO 2004. International Symposium
on, 2004, pp. 75–86.
[58] Michael Le and Yuval Tamir, Fault injection in virtualized systems - challenges and appli-
cations, IEEE Transactions on Dependable and Secure Computing 99 (2014), no. PrePrints,
1.
[59] Matthew Leeke, Saima Arif, Arshad Jhumka, and Sarabjot Singh Anand, A methodology for
the generation of efficient error detection mechanisms, Proc. of IEEE/IFIP SIntl. Conf. on
Dependable Systems and Networks(DSN), 2011.
115
[60] Scott Levy, Matthew G. F. Dosanjh, Patrick G. Bridges, and Kurt B. Ferreira, Using unreli-
able virtual hardware to inject errors in extreme-scale systems, FTXS’13, 2013, pp. 21–26.
[61] Dong Li, J.S. Vetter, and Weikuan Yu, Classifying soft error vulnerabilities in extreme-scale
scientific applications using a binary instrumentation tool, High Performance Computing,
Networking, Storage and Analysis (SC), 2012 International Conference for, 2012, pp. 1–11.
[62] Zhichun Li, Ming Zhang, Zhaosheng Zhu, Yan Chen, Albert Greenberg, and Yi-Min Wang,
Webprophet: automating performance prediction for web services, Proceedings of the 7th
USENIX conference on Networked systems design and implementation, NSDI’10, 2010.
[63] Y. Liang, Y. Zhang, A. Sivasubramaniam, R. Sahoo, J. Moreira, and M. Gupta., Filtering
failure logs for a BlueGene/L prototype, Proc. of IEEE Conf. on Dependable Systems and
Networks (DSN), 2005.
[64] F. Longo, R. Ghosh, V.K. Naik, and K.S. Trivedi, A scalable availability model for
infrastructure-as-a-service cloud, Proc. of IEEE Conf. on Dependable Systems and Net-
works (DSN), 2011.
[65] Liangzhen Lai Puneet Gupta Lucas Wanner, Salma ELmalaki and Mani Srivastava, Pro-
ceedings of the 11th international conference on hardware/software codesign and system
synthesis, codes+isss 2013, 2013, VarEMU: An Emulation Testbed for Variability-Aware
Software, 2013.
[66] Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney,
Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood, Pin: building customized pro-
gram analysis tools with dynamic instrumentation, Proceedings of the 2005 ACM SIG-
PLAN conference on Programming language design and implementation (New York, NY,
USA), PLDI ’05, ACM, 2005, pp. 190–200.
[67] Jordan McBain and Markus Timusk, Feature extraction for novelty detection as applied to
fault detection in machinery, Pattern Recogn. Lett. 32 (2011), no. 7, 1054–1061.
[68] James W. Mickens and Brian D. Noble, Exploiting availability prediction in distributed sys-
tems, Proc. of USENIX Symp. on Networked Systems Design and Implementation (NSDI),
2006.
116
[69] Joseph F. Murray, Gordon F. Hughes, and Dale Schuurmans, Machine learning methods
for predicting failures in hard drives: A multiple-instance application, Journal of Machine
Learning research 6 (2005), 816.
[70] Adam Oliner and Jon Stearley, What supercomputers say: A study of five system logs, Proc.
of IEEE Conf. on Dependable Systems and Networks (DSN), 2007.
[71] Husanbir Pannu, Jianguo Liu, and Song Fu, AAD: Adaptive anomaly detection system for
cloud computing infrastructures, Proc. of IEEE Symp. on Reliable Distributed Systems
(SRDS), 2012.
[72] Eunbyung Park, Bernhard Egger, and Jaejin Lee, Fast and space-efficient virtual machine
checkpointing, Proc. of ACM Intl. Conf. on Virtual Execution Environments (VEE), 2011.
[73] K. Patel, S. Parameswaran, and R.G. Ragel, Architectural frameworks for security and reli-
ability of mpsocs, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 19
(2011), no. 9, 1641–1654.
[74] Wei Peng and Tao Li, Mining logs files for computing system management, Proc. of IEEE
Intl. Conf. on Autonomic Computing (ICAC), 2005.
[75] perf Subsystem, perf: Linux profiling with performance counters, Available at:
http://perf.wiki.kernel.org/.
[76] Cuong Pham, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar K. Iyer, Cloudval: A
framework for validation of virtualization environment in cloud infrastructure, Proc. of
IEEE/IFIP Intl. Conf. on Dependable Systems and Networks (DSN), 2011.
[77] Guangzhi Qu, S. Hariri, and M. Yousif, A new dependency and correlation analysis for
features, Knowledge and Data Engineering, IEEE Transactions on 17 (2005), no. 9, 1199–
1207.
[78] Rodric Rabbah and Anant Agarwal, Versatility and versabench: A new metric and a bench-
mark suite for flexible architectures, 2004.
[79] Russell D. Reed and Robert J. Marks, Neural smithing: Supervised learning in feedforward
artificial neural networks, MIT Press, 1998.
117
[80] Charles Reiss, John Wilkes, and Joseph L. Hellerstein, Google cluster-usage traces: format
+ schema, Tech. report, Google Inc., November 2011.
[81] M. Rosenblum and T. Garfinkel, Virtual machine monitors: current technology and future
trends, IEEE Computer 38 (2005), no. 5, 39–47.
[82] R. K. Sahoo, A. J. Oliner, I. Rish, M. Gupta, J. E. Moreira, S. Ma, R. Vilalta, and A. Siva-
subramaniam, Critical event prediction for proactive management in large-scale computer
clusters, Proc. of ACM Intl. Conf. on Knowledge Discovery and Data Dining (KDD), 2003.
[83] Ramendra K. Sahoo, Anand Sivasubramaniam, Mark S. Squillante, and Yanyong Zhang,
Failure data analysis of a large-scale heterogeneous server environment, Proc. of IEEE
Conf. on Dependable Systems and Networks (DSN), 2004.
[84] Felix Salfner, Maren Lenk, and Miroslaw Malek, A survey of online failure prediction meth-
ods, ACM Computing Surveys 42 (2010), no. 3, 10:1–10:42.
[85] Felix Salfner and Miroslaw Malek, Using hidden semi-markov models for effective online
failure prediction, Proceedings of the 26th IEEE International Symposium on Reliable Dis-
tributed Systems, SRDS ’07, 2007, pp. 161–174.
[86] S.K. Sastry Hari, S.V. Adve, H. Naeimi, and P. Ramachandran, Relyzer: Application re-
siliency analyzer for transient faults, Micro, IEEE 33 (2013), no. 3, 58–66.
[87] Bianca Schroeder and Garth A. Gibson, A large-scale study of failures in high-performance
computing systems, IEEE Transactions on Dependable and Secure Computing 7 (2010),
337–351.
[88] Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li, Reference-driven performance
anomaly identification, SIGMETRICS Perform. Eval. Rev. 37 (2009), no. 1.
[89] Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang, Principal
component-based anomaly detection scheme, Foundations and Novel Approaches in Data
Mining (Tsau Young Lin, Setsuo Ohsuga, Churn-Jung Liau, and Xiaohua Hu, eds.), Studies
in Computational Intelligence, vol. 9, Springer Berlin Heidelberg, 2006, pp. 311–329.
[90] James E. Smith and Ravi Nair, The architecture of virtual machines, IEEE Computer 38
(2005), no. 5, 32–38.
118
[91] Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi,
Pavan Balaji, Bill Carlson, Andrew A. Chien, Pedro Diniz, Christian Engelmann, Rinku
Gupta, Fred Johnson, Jim Belak, Pradip Bose, Franck Cappello, Paul Coteus, Nathan A. De-
bardeleben, Mattan Erez, Saverio Fazzari, Al Geist, Sriram Krishnamoorthy, Sven Leyffer,
Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van
Hensbergen, Addressing failures in exascale computing, Workshop report, August 4-11,
2013.
[92] Dawn Song, David Brumley, Heng Yin, Juan Caballero, Ivan Jager, Min Gyung Kang,
Zhenkai Liang, James Newsome, Pongsin Poosankam, and Prateek Saxena, BitBlaze: A
new approach to computer security via binary analysis, Proceedings of the 4th International
Conference on Information Systems Security. Keynote invited paper. (Hyderabad, India),
December 2008.
[93] M. Steinder and A.S. Sethi, A survey of fault localization techniques in computer networks,
Science of Computer Programming 53 (2004), no. 2, 165–194.
[94] SYSSTAT Utilities, SYSSTAT: Performance monitoring tools, Available at:
http://sebastien.godard.pagesperso-orange.fr/.
[95] D. Takahashi, An extended split-radix fft algorithm, Signal Processing Letters, IEEE 8
(2001), no. 5, 145–147.
[96] Yongmin Tan, Xiaohui Gu, and Haixun Wang, Adaptive system anomaly prediction for
large-scale hosting infrastructures, Proc. of ACM Symp. on Principles of Distributed Com-
puting (PODC), 2010.
[97] Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, C. Venkatramani, and D. Rajan,
Prepare: Predictive performance anomaly prevention for virtualized cloud systems, Dis-
tributed Computing Systems (ICDCS), 2012 IEEE 32nd International Conference on, 2012,
pp. 285–294.
[98] Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak
Rajan, Prepare: Predictive performance anomaly prevention for virtualized cloud systems.,
ICDCS, 2012.
119
[99] Anna Thomas and Karthik Pattabiraman, Error detector placement for soft computation,
2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Net-
works (DSN) 0 (2013), 1–12.
[100] Kalyan Vaidyanathan and Kenny Gross, Mset performance optimization for detection of
software aging, Proc. of IEEE Intl. Symp. on Software Reliability Engineering (ISSRE),
2003.
[101] Hubertus J. J. van Dam, Abhinav Vishnu, and Wibe A. de Jong, A case for soft error detec-
tion and correction in computational chemistry, Journal of Chemical Theory and Computa-
tion 9 (2013), no. 9, 3995–4005.
[102] Brani Vidakovic, Statistical modeling by wavelets, 2010.
[103] R. Vilalta and S. Ma, Predicting rare events in temporal domains, Proc. of IEEE Intl. Conf.
on Data Mining (ICDM), 2002.
[104] Kashi Venkatesh Vishwanath and Nachiappan Nagappan, Characterizing cloud computing
hardware reliability, Proceedings of ACM Symposium on Cloud computing (SOCC), 2010.
[105] Xin Xu and Man-Lap Li, Understanding soft error propagation using efficient vulnerability-
driven fault injection, Dependable Systems and Networks (DSN), 2012 42nd Annual
IEEE/IFIP International Conference on, 2012, pp. 1–12.
[106] Bo Yang, Feng Tan, Yuan-Shun Dai, and Suchang Guo, Performance evaluation of cloud
service considering fault recovery, Proc. of IEEE Intl. Conf. on Cloud Computing (Cloud-
Com), 2009.
[107] Lingyun Yang, Chuang Liu, Jennifer M. Schopf, and Ian Foster, Anomaly detection and
diagnosis in grid environments, Proceedings of the 2007 ACM/IEEE conference on Super-
computing, SC ’07, 2007, pp. 33:1–33:9.
[108] Ying Yang, Chang-Tsun Li, Xingming Sun, and Hengfu Yang, Removable visible image wa-
termarking algorithm in the discrete cosine transform domain, Journal of Electronic Imag-
ing 17 (2008), no. 3, 033008–033008–11.
[109] Xin Yao, Evolving artificial neural networks, 1999.
120
[110] Alice X. Zheng, Jim Lloyd, and Eric Brewer, Failure diagnosis using decision trees, Proc.
of IEEE Conf. on Autonomic Computing (ICAC), 2004.
[111] Qiang Zheng, Guohong Cao, Tom La Porta, and Ananthram Swami, Optimal recovery from
large-scale failures in ip networks, Proc. of IEEE ICDCS, 2012.
[112] Ziming Zheng, Li Yu, Wei Tang, Zhiling Lan, Rinku Gupta, Narayan Desai, Susan Coghlan,
and Daniel Buettner, Co-analysis of ras log and job log on blue gene/p, Proceedings of the
2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS ’11, 2011,
pp. 840–851.
121