tesi di laurea stochastic analyses for internet …...stochastic analyses for internet data centers...
TRANSCRIPT
Universita degli Studidi Modena e Reggio Emilia
Facolta di Scienze Matematiche, Fisiche e Naturali
Corso di Laurea Magistrale in Informatica
TESI DI LAUREA
Stochastic Analyses for InternetData Centers Management
RelatoreProf. Michele Colajanni
CorrelatoreIng. Sara Casolari
CandidataDott.ssa Stefania Tosi
Anno Accademico 2009/2010
“La science avec des faits comme une maison avec despierres; mais une accumulation de faits n’est pas plusune science qu’un tas de pierres n’est une maison.”
“La scienza e fatta di dati come una casa di pietre.Ma un ammasso di dati none scienza piu di quanto unmucchio di pietre sia una casa.”
“Science is made of data as a house is made of stones.But a mass of data is not science more than a pile ofstones is a house.”
Jules-Henri Poincare
Contents
1 Introduction 17
2 Whole system analysis 212.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Multi-phase methodology . . . . . . . . . . . . . . . . . . . . . . 24
3 Stochastical analyses of system resource measures 313.1 Deterministic component . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Correlogram . . . . . . . . . . . . . . . . . . . . . . . . 353.1.2 Periodogram . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Spike component . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.1 Sigma threshold test . . . . . . . . . . . . . . . . . . . . 40
3.3 Noise component . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.1 Color of noise . . . . . . . . . . . . . . . . . . . . . . . . 433.3.2 Distribution of noise . . . . . . . . . . . . . . . . . . . . 45
4 PCA-based technique on collected data 494.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.1.1 Heterogeneous resources of a single server . . . . . . . . 514.1.2 Homogeneous resources of different servers . . . . . . . .55
4.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . 564.2.1 PCA on heterogeneous resources of a single server . . . . 614.2.2 PCA on homogeneous resources of different servers . . . 64
4.3 Analyzing eigenresources . . . . . . . . . . . . . . . . . . . . . . 674.3.1 A taxonomy of eigenresources . . . . . . . . . . . . . . . 674.3.2 Understanding eigenresources . . . . . . . . . . . . . . . 72
4.4 Extraction of representative eigenresources . . . . . . . .. . . . 77
5 Tracking models 835.1 Trend extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . 85
6 CONTENTS
5.2.1 Interpolation techniques . . . . . . . . . . . . . . . . . . 855.2.2 Smoothing techniques . . . . . . . . . . . . . . . . . . . 90
5.3 Interpolation estimators . . . . . . . . . . . . . . . . . . . . . . . 935.3.1 Simple Regression (SR) . . . . . . . . . . . . . . . . . . 935.3.2 Cubic Spline (CS) . . . . . . . . . . . . . . . . . . . . . 94
5.4 Smoothing estimators . . . . . . . . . . . . . . . . . . . . . . . . 955.4.1 Simple Moving Average (SMA) . . . . . . . . . . . . . . 965.4.2 Exponential Weighted Moving Average (EWMA) . . . . . 965.4.3 Auto-Regressive (AR) . . . . . . . . . . . . . . . . . . . 975.4.4 Auto-Regressive Integrated Moving Average (ARIMA) . . 97
5.5 Quantitative performance analysis . . . . . . . . . . . . . . . . .985.5.1 Computational cost . . . . . . . . . . . . . . . . . . . . . 985.5.2 Estimation quality . . . . . . . . . . . . . . . . . . . . . 99
6 Forecasting models 1056.1 Time series prediction . . . . . . . . . . . . . . . . . . . . . . . . 1056.2 Prediction models . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.2.1 Simple Regression (SR) . . . . . . . . . . . . . . . . . . 1086.2.2 Cubic Spline (CS) . . . . . . . . . . . . . . . . . . . . . 1106.2.3 Exponential Weighted Moving Average (EWMA) . . . . . 1116.2.4 Holt’s Model (Holt’s) . . . . . . . . . . . . . . . . . . . . 1126.2.5 Auto-Regressive (AR) . . . . . . . . . . . . . . . . . . . 1126.2.6 Auto-Regressive Integrated Moving Average (ARIMA) . . 113
6.3 Quantitative analysis . . . . . . . . . . . . . . . . . . . . . . . . 1146.3.1 Computational cost . . . . . . . . . . . . . . . . . . . . . 1156.3.2 Prediction quality . . . . . . . . . . . . . . . . . . . . . . 117
6.4 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . 120
7 Runtime models 1257.1 State change detection . . . . . . . . . . . . . . . . . . . . . . . 125
7.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . 1277.1.2 Wavelet Cusum state change detection model . . . . . . . 1307.1.3 Other state change detection models . . . . . . . . . . . . 1347.1.4 Quantitative analysis . . . . . . . . . . . . . . . . . . . . 1357.1.5 Performance analysis . . . . . . . . . . . . . . . . . . . . 1417.1.6 On-line state change detection for IDC management . . .144
7.2 Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . 1477.2.1 Problem definition . . . . . . . . . . . . . . . . . . . . . 1487.2.2 Point anomaly detection . . . . . . . . . . . . . . . . . . 1497.2.3 Point anomaly detection techniques . . . . . . . . . . . . 1507.2.4 On-line anomaly detection for IDC management . . . . . 153
CONTENTS 7
7.2.5 Collective anomaly detection . . . . . . . . . . . . . . . . 1577.2.6 Collective anomaly detection techniques . . . . . . . . . . 1587.2.7 On-line collective anomaly detection for IDC management 1637.2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 164
8 Related work 167
9 Conclusions 173
List of Figures
2.1 The proposed multi-phase framework for whole system analysis. . 26
3.1 Examples of monitored time series. . . . . . . . . . . . . . . . . . 323.2 Example of a trend time series. . . . . . . . . . . . . . . . . . . . 333.3 Example of a seasonal time series. . . . . . . . . . . . . . . . . . 343.4 Example of a trend and seasonal time series. . . . . . . . . . . .. 353.5 Example of a time series with a ”hidden” seasonality and the cor-
responding correlogram. . . . . . . . . . . . . . . . . . . . . . . 373.6 Example of periodogram. . . . . . . . . . . . . . . . . . . . . . . 383.7 Example of a spike component. . . . . . . . . . . . . . . . . . . . 393.8 Example of3σ threshold test for spike component. . . . . . . . . 403.9 Example of1σ and5σ threshold tests for spike component. . . . . 413.10 Example of a noise component. . . . . . . . . . . . . . . . . . . . 433.11 Example of autocovariance function of a white noise time series. . 443.12 Example of autocovariance function of a colored noise time series. 453.13 Examples of Q-Q plots. . . . . . . . . . . . . . . . . . . . . . . . 47
4.1 First phase of the multi-phase framework. . . . . . . . . . . . .. 514.2 Examples of heterogeneous resource measures of a database server. 544.3 Examples of homogeneous resource measures - CPU utilization. . 554.4 Second phase of the multi-phase framework. . . . . . . . . . . .. 574.5 Example of 1D projection of 2D points in the original space. . . . 584.6 Example of eigenresource and corresponding principal components. 604.7 Example of a scree plot. . . . . . . . . . . . . . . . . . . . . . . . 604.8 Scree plots for the resource time series of a database server. . . . . 624.9 PCA resulting principal eigenresources on heterogeneous resources
of a database server. . . . . . . . . . . . . . . . . . . . . . . . . . 644.10 Scree plots for CPU utilization time series. . . . . . . . . . .. . . 654.11 PCA resulting principal eigenresources on homogeneousresources
(CPU utilizations) of a database server. . . . . . . . . . . . . . . . 664.12 Examples of the three types of eigenresources. . . . . . . .. . . . 68
10 LIST OF FIGURES
4.13 Classifying eigenresources by using three statisticaltests. . . . . . 704.14 Deterministic eigenresources and corresponding correlograms. . . 734.15 Example of correlogram showing a multi-seasonal behavior in a
two weeks resource sampling. . . . . . . . . . . . . . . . . . . . 734.16 Spike eigenresources and corresponding sigma threshold tests. . . 754.17 Noise eigenresources and corresponding autocovariance functions. 764.18 Representative eigenresources. . . . . . . . . . . . . . . . . . . .794.19 Representative eigenresources with spike and noise subclasses. . . 81
5.1 Third phase: modeling the system behavior in the past. . .. . . . 845.2 Trend estimation techniques classification. . . . . . . . . .. . . . 865.3 Graphical example of linear interpolation techniques.. . . . . . . 875.4 Graphical example of non-linear interpolation techniques. . . . . . 895.5 Graphical examples of smoothing techniques. . . . . . . . . .. . 925.6 Example of time series and approximate confidence interval. . . . 1005.7 Trend curves with respect to the approximate confidence interval. . 104
6.1 Third phase: forecasting the system behavior in the future. . . . . 1066.2 Example of time series prediction and corresponding prediction
interval (c = 0.95). . . . . . . . . . . . . . . . . . . . . . . . . . 1186.3 Raw and treated CPU utilization time series. . . . . . . . . . . . .1206.4 Holt’s prediction model,k = 10, on raw data set. . . . . . . . . . 1216.5 Holt’s prediction model,k = 10, on trend estimation. . . . . . . . 122
7.1 Third phase: analyzing the system behavior in the present. . . . . 1267.2 The problem of detecting relevant state changes. . . . . . .. . . . 1297.3 F-measures -ρe = 0. . . . . . . . . . . . . . . . . . . . . . . . . 1407.4 F-measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.5 Qualitative evaluation - Step profile -σe = 0.6 andρe = 0.3. . . . 1427.6 Qualitative evaluation - Multi-step profile -σe = 0.9 andρe = 0.3. 1437.7 Relevant state changes on the representative deterministic eigen-
resource. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.8 Performance evaluation of state change detection models on rep-
resentative deterministic eigenresource. . . . . . . . . . . . . .. 1477.9 Example of point anomalies. . . . . . . . . . . . . . . . . . . . . 1507.10 A box plot for a univariate data set. . . . . . . . . . . . . . . . . .1527.11 Performance evaluation of5σ test for point anomaly detection on
the representative spike eigenresource. . . . . . . . . . . . . . . .1547.12 Box plot of representative spike eigenresource. . . . . . .. . . . 1557.13 Performance evaluation of box plot rule for point anomaly detec-
tion on the representative spike eigenresource. . . . . . . . . .. . 156
LIST OF FIGURES 11
7.14 Example of collective anomaly. . . . . . . . . . . . . . . . . . . . 1577.15 Prediction results on the representative deterministic eigenresource. 1607.16 Performance evaluation of collective anomaly detection model on
the representative deterministc eigenresource. . . . . . . . .. . . 164
List of Tables
4.1 System monitor’s syntax and corresponding resource measure. . . 534.2 Occurrence of eigenresource types in order of importance. . . . . 714.3 Contributions of eigenresource types. . . . . . . . . . . . . . . .72
5.1 CPU time (msec) for the computation of a trend value. . . . . . . 98
6.1 CPU time (msec) of prediction models and policies. . . . . . . . . 1166.2 PE of the prediction policies,k = 10. . . . . . . . . . . . . . . . 1196.3 PI of the prediction policies,k = 10. . . . . . . . . . . . . . . . . 119
7.1 Recall -ρe = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387.2 Precision -ρe = 0. . . . . . . . . . . . . . . . . . . . . . . . . . 138
Acknowledgements
Il mio primo ringraziamento non puo che andare al Professor Colajanni, per avermi
dato la possibilita di realizzare questo sogno. Gli sono grata per le enormi oppor-
tunita che mi ha concesso in questi due anni, per i consigli sincerie per la sua
spiazzante capacita di riuscire a trasmettermi un insegnamento in ogni sua frase.
Lo ringrazio per avermi dato fiducia, coraggio e per avermi trasmesso un briciolo
della sua infinita passione per la ricerca. Grazie.
Ringrazio Sara, per avermi presa per mano e accompagnata passo dopo passo,
per i suoi consigli e i suoi insegnamenti preziosi. Ho imparato di piu stando al suo
fianco e ammirando il suo lavoro che in tutti gli anni passati sui libri. Ho scop-
erto in lei un’insegnante paziente, una collega divertentema soprattutto un’amica
speciale. Grazie.
Con particolare affetto dico grazie a Mirco, Michele, Claudia, Riccardo, Mauro
e Luca, che hanno dato un tocco di allegria e spensieratezza alle tante giornate in
ufficio. Li ringrazio per avermi accolta come una di loro nel gruppo di ricerca e
per avermi fatto scoprire che, in fondo, gli ingegneri non sono poi cosı male =).
Grazie.
Con altrettanto affetto ringrazio tutti i miei compagni di universita, quelli
persi, quelli trovati e quelli che ho sempre avuto al mio fianco in questo cam-
mino di studi. Se scavando nella memoria riaffiorano soltanto bellissimi ricordi,e
tutto merito loro. Grazie.
A tutti i miei amici che, vicini o lontani, hanno saputo starmi accanto in questi
anni. A chi mi ha ascoltata, a chi mi ha sopportata e a chiunquemi abbia rallegrato
la giornata con un gesto o un sorriso. A chi ha dimostrato di essere orgoglioso di
me, a chi mi ha dato consigli sinceri e a chiunque mi abbia trasmesso il coraggio
di credere in me stessa e nei miei sogni. Grazie.
16 LIST OF TABLES
Il grazie piu sentito, pero, lo riservo per tre persone straordinarie. Innanzitutto,
la mia mamma e il mio papa. Per esserci sempre stati, giorno dopo giorno, con
le loro attenzioni, le loro domande, i loro consigli. Per aver sostenuto le mie
scelte. Per aver condiviso e creduto nei miei sogni. Per non aver mai smesso di
dimostrare entusiasmo ad ogni mio traguardo. Vedere la gioia nei loro occhie la
soddisfazione piu grande di ogni mio successo. Grazie.
Il grazie piu prezioso in assoluto, pero, voglio rivolgerlo ad Andrea, perche e
la persona che ammiro di piu al mondo e non c’e felicita piu grande che renderlo
fiero di sua sorella. Grazie.
Chapter 1
Introduction
The advent of large Internet Data Centers providing any kind of service through
Web related technologies has changed the traditional processing paradigm. These
modern infrastructures must accommodate varying demands for different types
of processing within certain time constraints. Overall performance analysis and
runtime management in these contexts is becoming extremelycomplex, because
they are a function not only of individual applications, butalso of their interactions
as they contend for processing and I/O resources, both internal and external.
The majority of critical Internet-based services run on shared application in-
frastructures that have to satisfy scalability, adaptability and availability require-
ments, and have to avoid performance degradation and systemoverload. Man-
aging these systems requires a large set of runtime decisionalgorithms that are
oriented to load balancing and load sharing [9, 24, 99], overload and admission
control [33,34,52,91,93], job dispatching and redirection [25]. The recently wide-
spread paradigm of Utility Computing further increases the necessity for runtime
management algorithms that take important actions on the basis of present and
future load conditions of the system resources.
Existing models, methodologies and frameworks commonly applied to other
contexts are often inadequate to efficiently support the runtime management of
the present and future Internet-based systems because of two main problems.
• The large majority of the literature related to Internet-based systems man-
agement proposes decision systems relying on the modeling of servers re-
source usages. Even so, the stochasticity and randomness ofthese processes
18 Introduction
makes it hard to model their behavior and prevents the use of parametric
techniques to this scope. Unlike existing models and schemes, that are
oriented to evaluate system performance extracting the needed parameters
from the usage traces, in this work we rely on stochastic analyses making
no modeling assumptions.
• Most available algorithms and mechanisms for runtime decisions evaluate
the performance of information infrastructures or systemsthrough the peri-
odic sampling ofresource measuresobtained from the monitoring of each
server in isolation, and use these values (or simple combinations of them) as
a basis for determining the present and the future system condition. How-
ever, a wide range of important problems faced by computer researchers
today (including computer engineering, system design, anomaly detection,
change detection and capacity planning) require modeling and analysis of
system behaviors considering all resource measures of all servers simulta-
neously.
In general,whole resources analysis- that is, modeling the metrics of all the
resources of a node simultaneously - andwhole servers analysis- that is, mod-
eling the behaviors of all the servers of a system simultaneously - are difficult
objectives, amplified by the fact that modeling time series behavior on a single
resource of a single server is itself a complex task.Whole system analysis- aris-
ing from the synthesis of whole resources and whole servers analyses - therefore,
remains an important and unmet challenge.
One way to address the problem of whole resources analysis isto recognize
that the behavior observed for different resources of a single server is not inde-
pendent, but is in fact determined by a common external workload and typical
resource features. The superimposition of different resources, as determined by
their features, gives rise to the overall behavior of a server.
At a higher level of interest, if we consider an infrastructure composed by
several servers, its behavior can be evaluated as the join ofthe single servers
behaviors, overlapped on the basis of the routing matrix andinternal policies of
the infrastructure. Thus, instead studying servers in their single resources and
systems in their single servers, a more direct and fundamental focus for whole
19
system study is the analysis of theresources setbehavior of theservers setof an
infrastructure.
However, analyzing the whole resources behavior and its flowinside even a
simple infrastructure suffers from several difficulties. The first challenge is that
there are several resources that can be monitored on a singleserver, each one with
its typical features and its own impact on the overall behavior of the server. We
imagine that some resources would influence more the final performance of the
server and that others could otherwise be ignored for the analysis of the whole
system capability. For this reason, a first imposing obstacle is the ranking of
system resource measures in order of importance and the determination of their
weights on the basis of their contribution to the overall behavior of a server node.
Even if this problem represents itself a huge challenge, linking together dif-
ferent servers into an even simple infrastructure increases exponentially the com-
plexity of solving the whole system problem. Once ranked theresource measures
collected on the different servers, next challenge is the analysis of server nodes
interactions inside the whole infrastructure. Thus, everysystem analysis requires
the knowledge of the routing policies of the system and the main characteristics
of its constituent servers.
Another central problem one confronts in facing whole system analysis is the
so called”curse of dimensionality”[46], coming from the fact that servers in an
infrastructure can be in the order of hundreds. This means that resource interaction
inside the system form a high dimensional multivariate structure. For example,
even a moderate-size infrastructure may be composed of several dozens of servers,
each one characterized by tens of resource metrics; the resulting set of time series
has thousands of dimensions. The high dimensionality of system resources matrix
is in fact another main source of difficulty in addressing thewhole system analysis
and it tends to become even more and more prominent with the spreading of large
size Internet Data Centers.
The interest of this work is facing the whole system analysisin the context of
modern Internet Data Centers. We propose a multi-phase methodology aiming to
solve all the previous mentioned problems. It analyzes the system resource mea-
sures of the Internet Data Center servers and collects their main features into a
representative vision of the whole system state. This vision provides an exhaus-
20 Introduction
tive representation of the Internet Data Center performancethat consents whole
system analysis and helps management decisions.
The rest of the work is organized as follows. Chapter 2 discusses the motiva-
tions of this research and the high dimensionality and stochasticity of resource
measures in a system, providing the necessary foundations of the multi-phase
methodology. We give a brief description of the four phases of the proposed
approach. In Chapter 3 we carry out a first analysis of the typical behavior of re-
source measures monitored on Internet-based system’s servers and the problems
of management due to their highly variable and stochastic behavior. In Chapter 4
we present the steps taken to construct time series from the monitoring of the
system resources of an Internet Data Center used as testbed. We then apply the
proposed technique on the collected data in order to evaluate its ability in solv-
ing whole system analysis problem. We elaborate on the notion of representative
eigenresourcesand show how they can be interpreted, understood and used for
supporting some fundamental tasks characterizing runtimemanagement decisions
in Internet Data Centers. Chapters from 5 to 7 explain some classic problems
characterizing Internet Data Centers, concerning theirpast, presentandfuturebe-
haviors. In this work, we consider trend extraction, state change detection, point
and collective anomaly detection and time series forecasting as examples of sys-
tem state evaluations, and demonstrate how these applications can take advantage
of the eigenresource representations for whole system analysis and management
at runtime. Chapter 8 compares our contribution with state ofthe art. Concluding
remarks and our ongoing work are presented in Chapter 9.
Chapter 2
Whole system analysis
This chapter discusses the high complexity of whole system analysis and proposes
a novel methodology to face it.
2.1 Problem definition
Many application contexts, ranging from financial to socialapplications, or from
hydrological phenomena to information and computer systems, base their man-
agement decisions on the information coming from the monitoring of fundamen-
tal processes, such as price behavior, population growth, temperature fluctuations
or system performance/utilization. Efficient process management requires suit-
able algorithms able to make runtime decisions on the basis of actual and past
behaviors of the time series coming from monitoring systems.
Over the last twenty years, there has been a significant increase in the number
of real problems concerned with questions such as fault detection and diagno-
sis, state change and anomaly detection, safety and qualitycontrol, prediction
of future events, estimation of expected capacity needs. These problems result
from the increasing complexity of most novel processes, theavailability of so-
phisticated sensors in both technological and natural worlds, and the existence of
sophisticated information processing systems.
Central to all above issues is to understand the characteristics of the process,
to evaluate its behavior and to generate a reliable representation of it, in order to
define the past, present and actual state of the process. The key difficulty is to
22 Whole system analysis
generate a representative view of process behavior that collects and assembles all
(and only) the relevant characteristics of the process.
Let us give a biological example, considering the health of apatient as our
context of interest. We can monitor several biological processes of a human being,
such as respiration, digestion, response to stimulus, interaction between organs
and so on, and compute a health index for each of them. Each index gives an
idea of the good or bad conditions of the person in face of a particular process.
For example, a low index for the respiration process means that the person has
problems in breathing. However, looking at this index on itsown does not give
an overall idea of the health of the person: a high respiration index does not mean
that he feels good! In the same way, a low respiration index cannot give an idea
of the effective illness of the patient if it is not associated to some information
about hearth power, intercostal muscles, mouth and nose pathways, etc. Also
information about not strictly related processes, such as blood pressure, diabetes
or asthma conditions, are useful to understand the breathing conditions of the
person. Only through the aggregation of all available information in a suitable
way we can get a clear idea about the health of the patient.
Similarly, an information system can be seen as a very simplehuman being,
with its several organs - that are, the server nodes of the system -, its various pro-
cesses - that are, resource usages -, and its overall health -typically expressed in
terms of system performance. The system performance arisesfrom the contribu-
tions of different resource usages on different servers, aswell as the health of a
person is given by all the biological processes of all the organs of the body. Thus,
information system performance arises from the superimposition of different re-
sources behaviors on different servers.
For this reason, a thorough understanding of components features is essential
for modeling system work, and for addressing a wide variety of problems includ-
ing computer engineering, system design, capacity planning [92], forecasting and
anomaly detection [66]. All these problems in computer and information systems
require a reliable and exhaustive representation of the state of their resources. For
this reason, the most importantKey Performance Indicators(KPIs) are continu-
ously monitored and data are passed to some statistical models that decide whether
new management decisions have to be taken or not.
2.1 Problem definition 23
All methods and software in commerce for measuring information systems
or infrastructures performance consider different KPIs and use them to estimate
partial visions of the principal components of the entire system/infrastructure. An-
alyzing every KPI by itself allows us to have an idea of the capability of the com-
ponent the KPI refers to and gives the possibility to understand how it performs.
This would help system administrators to better understandthe behavior of that
server and better plan for its usage. In order to get a global vision about the prin-
cipal system components and to decide whether to allocate ornot resources, we
should apply a similar method to each server of the infrastructure.
This approach suffers of many drawbacks.
• In modern Internet Data Centers there may be hundreds of servers whose
resources are monitored simultaneously. This means, the modeling of thou-
sands KPIs, making the whole system analysis a very expensive task.
• Since they refer to highly variable and stochastic processes, system resource
metrics are hard to model through the parametric techniquestypically used
by state of the art decision mechanisms.
• All the KPIs refer only to a single resource usage of one server, giving a
measure of the relative performance of a single hardware/software compo-
nent of the entire system.
Analyzing every KPI on its own, besides being extremely timeconsuming,
gives a reductive and incomplete vision of the whole system,since it does not
take into account the interactions of components under a software load. The rea-
sons are obvious: the whole system’s performance is affected not only by the
behavior of each single resource, but also by the resulting interaction of resources
on several different servers combined together. This meansthat the system overall
performance is given by the superimposition of several contributions that, in some
instance, are not independent.
As a consequence, all present solutions for measuring system performances,
characterize system behavior and detecting state changes and anomalous events
lack in efficiency and risk to bring to erroneous management decisions since they
rest upon partial and deficient representations of the entire system state.
24 Whole system analysis
2.2 Multi-phase methodology
To address most of the raised issues, we propose a novelmulti-phase methodology
that gives a representative vision of the whole system behavior able to characterize
the majority of resource variability of all the servers of the Internet Data Center.
The performance of an Internet-based system depends heavily on the char-
acteristics of its load and on the impact it has on the different resources of the
constituent servers. Starting from a test phase, it is possible to rely on some data
samples (time series), collected from the server’s log filesand referring to all the
monitored resources. These data can be used to create a modelthat describes,
or approximates, the actual resources behavior of the server. From this model,
one can predict the impact of a load change in its resource usages. Much of the
work done in this direction focus on single resource behavior characterization,
which considers each server at a time, creating a model from observation (e.g.,
a stochastic model) and extracting the needed parameters from the resource time
series.
In this work, we propose an alternative approach for whole system analysis.
Since a single resource time series analysis is itself a complex task, modeling
the whole system’s behavior is even more difficult. The reason is that all the
information coming from the monitoring of the resource measures of the servers
of an Internet-based system are stochastic and form together a high dimensional
structure.
However, one can suppose that some system resources share common behav-
iors as a function of time. For example, several resources could share the same
periodic behavior due to steady peaks of utilization duringthe business hours and
low utilizations during the lunch hour and other non-business hours in the evening
and on weekends. On the other hand, some resources could present simultane-
ous short bursts (or spikes) of high demand, which are calledflash crowds, often
triggered by a special and usually unexpected event.
These observations lead us to believe that the high dimensional structure of re-
sources time series, that appears to be complex, could be governed by a small set
of features (e.g., correlated periodicity, simultaneous demand spikes, common un-
derlying noise) and, therefore, be presented approximately by a lower-dimension
2.2 Multi-phase methodology 25
representation.
When presented with the need to analyze a high dimensional structure, a com-
mon and powerful approach is to seek an alternate lower-dimensional approxima-
tion of the structure that preserves its important properties. It can often be the
case that a structure appearing complex because of its high dimension may be
governed by a small set of independent variables and so can bewell approximated
by a lower dimensional representation. Dimension analysisand dimension reduc-
tion techniques attempt to find these simple variables and can therefore be a useful
tool to understand the original structure.
The most popular technique to analyze high dimensional structures is thePrin-
cipal Component Analysis(PCA) [67]. Given a high dimensional object and its
associated coordinate space, PCA finds a new coordinate spacewhich is the best
one to use for dimension reduction of the given object. Once the object is placed
into this new coordinate space, projecting the object onto asubset of axes can
be done in a way that minimizes the error. When a high dimensional object can
be well approximated in a smaller number of dimensions, we refer to the smaller
number of dimensions as the object’sintrinsic dimensionality.
To find the whole system analysis intrinsic dimensionality is the initial pur-
pose of the approach presented in this work. In this study, weuse PCA and apply
some analysis steps to an explicative data set extracted from Internet Data Center
servers. By determining whether the whole system has low intrinsic dimensional-
ity, it is possible to create a workload model that is described only by a small set of
features, such as deterministic, noisy and spiky. We then use the extracted features
for several purposes: detecting relevant state changes, signaling punctual anoma-
lies and identifying uncommon patterns in the expected servers behavior. Several
analysis other than those presented in this work can benefit from this study. They
can inform some decision systems direct to the runtime management of Internet
Data Centers.
We now give a brief overview of the multi-phase methodology proposed in this
study. It aims to separate the complex whole system analysisand management in
four main phases, as described in Figure 2.1.
Let us outline the different phases that will be described inthe following chap-
ters:
26 Whole system analysis
Figure 2.1: The proposed multi-phase framework for whole system analysis.
1. Data collection
In this work, we analyze stochastic time series coming from system moni-
tors related to the most relevant internal resource measures of the servers of
an Internet Data Center. They form a high dimensional data setof stochas-
tic measurements, that is comprehensive butnot representativeof the whole
system under exam.
2. PCA-based technique
This phase analyzes the collected time series in order to extract arepresen-
tativevision of the monitored system. We apply the Principal Component
Analysis to the collected time series, with the aim to obtainan exhaustive
2.2 Multi-phase methodology 27
representation of them through a smaller number of relevantinformation,
that we calleigenresources. The decomposition is followed by an analysis
of eigenresources characteristics and an investigation oftheir typical behav-
ior. We see that, in the context of Internet Data Center resource measures,
eigenresource components are easy to map into three behavioral classes:de-
terministic, noiseandspike. In light of what discovered in our application
context, the PCA-based technique ends with the assembling ofthreerepre-
sentative eigenresources, one for each class. These representations collects
the contributions of all the server nodes under examinationand therefore
compose a representative vision of the entire system. This vision collects
the entire intrinsic dimensionality of whole system analysis problem, by
reducing its complexity to only three dimensions.
3. System state evaluation
This phase proposes some mechanisms that use the previous representa-
tions as a basis for evaluating important information aboutthe past (e.g,
trend extraction), the present (e.g., state change detection, point anomaly
and collective anomaly detection) and the future (e.g., time series predic-
tion) behavior of the Internet Data Center state. These evaluations refer to
representative sights of the system, by collecting the contributions of all the
resource measures of all the servers of the architecture, and thus carrying
reliable information about the previous, current and expected behaviors of
the Internet Data Center.
4. System management
We take advantage of the PCA-results and the applications outcomes to
manage the system through some runtime decisions typical ofInternet Data
Centers. To apply management algorithms to representative eigenresources
allows system administrators to make appropriate and reliable operating de-
cisions. In this way, the complex problem of the whole systemanalysis and
management of the considered Internet Data Center is reducedto the inves-
tigation of a small set of dimensions, still carrying the main characteristics
28 Whole system analysis
of the entire Data Center.
In this work, we show the results of applying the multi-phasemethodology
to a typical Internet Data Center. Our purpose is to transformthe huge collected
data set of measurements related to its servers into a small set of representative
information, that constitutes the basis for runtime decisions/actions.
In the examined context, the multi-phase methodology produces interesting
benefits in understanding Internet Data Center characteristics, evaluating its past,
present and future performance, and making runtime decisions for the whole Data
Center administration, bringing several benefits:
• collects into three representative visions the relevant information about the
behavior of the Internet Data Center, giving an exhaustive characterization
that includes all the contributions of the system’s server nodes;
• reduces to only three the number of time series to analyze inorder to manage
the overall behavior of the Internet Data Center, in spite of its complexity,
its purposes and the number of servers;
• reveals behavioral trends in servers activities, isolating their deterministic
components from deviations due to noise or spikes;
• helps anomaly detection, since statistical tests for the outliers identification
can be applied to the spiky features;
• makes it possible to choose suited detection models, on thebasis of the
different typologies of change that should occur in the state of the system
and the peculiar characteristics of the representative visions coming from
the PCA-based technique;
• allows the forecasting of the behavior of the system, usingits deterministic
components to predict the future activity of the entire set of servers;
• enables to detect unexpected events in system activity andthe signaling of
something wrong in typical servers behavior.
2.2 Multi-phase methodology 29
A detailed described application of the multi-phase methodology for Inter-
net Data Center analysis and management is addressed in the following chapters.
After an initial overview on stochastic resource measures in next chapter, we de-
scribe the first two phases of the methodology in Chapter 4, andthe third phase
in Chapters 5, 6 and 7. They analyze the past, future and present behavior of the
Internet Data Center, respectively. All the analyses are preparatory for a system
management phase, that is only hinted at in this work but is our intent to deep
investigate in future works.
Chapter 3
Stochastical analyses of systemresource measures
Before applying the proposed multi-phase methodology, let us introduce some
stochastical analyses that turn to be useful in our ongoing work. In this chapter,
we propose a detailed analysis of the statistical behavior of the most important
system resource measures of an Internet Data Center and we review stochastic
methods that are useful for their analysis.
We consider these measures (or samples) asstochastic data setsthat are con-
tinuously provided by system monitors. The termstochasticis due to the non-
deterministic behavior of the monitored resources and the fact that Internet Data
Center’s state is determined both by the resource’s predictable actions and by a
random element. The termdata setrefers to an ordered collection ofi samples,
starting at timet1 and covering events up to the current timeti. We denote the data
set byXi = [x1, x2,. . ., xi−1, xi], where thej − th elementxj, with 1 ≤ j ≤ i,
denotes the value of the resource measure of interest at timetj.
Since the data set’s samples are measured at successive times spaced at uni-
form time intervals, we refer to the data setXi as atime series. Figure 3.1 reports
the typical behavior of two time series obtained from the monitoring of two inter-
nal resources (CPU and Memory utilization) of an Internet-based server. System
monitors capture resource measures every 5 minutes during an observation inter-
val lasting a week. Seven days monitoring builds up a time series of i = 2016
values.
32 Stochastical analyses of system resource measures
0
20
40
60
80
100
0 500 1000 1500 2000
CP
U u
tiliz
atio
n
(a) CPU utilization
87
87.5
88
88.5
0 500 1000 1500 2000
Mem
ory
utili
zatio
n
(b) Memory usage
Figure 3.1: Examples of monitored time series.
The properties and the characteristics of the time series related to the system
resources of an Internet Data Center require in-depth investigations to achieve
an useful interpretation and an adequate positioning of theresource states with re-
spect to the capacity of the system. Due to the stochastic behavior of the monitored
time series, we not deal with only one possible reality of howthe data set might
evolve under time, but in a stochastic time series there is some indeterminacy in
its future evolution described by probability distributions. This means that, even
if the initial condition (or starting point) is known, thereare many possibilities the
resource measures might go to, but some paths may be more probable and others
less.
One way to simplify the stochastic time series analysis and the understanding
of their evolution is to separate time series in their constituent components. To an-
alyze and study one feature at time helps in reducing the complexity of managing
random time series and to apply suitable models to the components characteristics.
As in most time series analysis, the system resource measures consist of:
• adeterministic component- usually a set of systematic trend and periodic
patterns;
• a spike component- usually caused by isolated and occasional bursts and
dips;
• anoise component- usually making the pattern difficult to identify.
3.1 Deterministic component 33
In the following sections, we analyze the three main components and present
several statistical analyses useful to their investigation.
3.1 Deterministic component
The deterministic component represents data set patterns appearing to be rela-
tively predictable. Most time series patterns can be described in terms of two
basic classes of components:trendandseasonality.
Trend pattern
Thetrend is one of the dominant feature of many time series. It represents a
general systematic linear or (most often) non-linear component that changes
over time and does not repeat or at least does not repeat within the time
range captured by the data. Such a trend can be upward or downward, it can
be step or not, and it can be exponential or approximately linear.
In Figure 3.2 we give an example of an upward linear trend timeseries.
We can observe a manifest increase of sample values in time and random
disturbances that do not affect the prevalent linear growthof the time series.
0
2
4
6
8
10
12
14
0 500 1000 1500 2000
Sam
ple
Val
ues
Time
Figure 3.2: Example of a trend time series.
34 Stochastical analyses of system resource measures
It is worth to say that not all the time series show a trend component.
Besides, as long as the trend component in the time series is present and
monotonous (consistently increasing or decreasing), the trend estimation is
a useful method to interpret the data set because it would complement the
seasonally statistics to fully understand the deterministic component char-
acterizing the stochastic data set.
Seasonal pattern
When time series are monitored for a sufficiently long time period (e.g., of
weeks, months or years), it is often the case that such a series displaysea-
sonalpatterns. A seasonal pattern has similar nature as trend component,
but it repeats itself in systematic intervals over time. This is typically the
case of Internet-based services, where system measures increase during di-
urnal activities and decrease during the night or weekend. Figure 3.3 shows
a typical example of a time series displaying seasonality.
0
20
40
60
80
100
Mon Tue Wed Thu Fri Sat Sun
Sam
ple
Val
ues
Figure 3.3: Example of a seasonal time series.
Seasonality may also be reflected in the variance of a time series. For ex-
ample, the time series variability may seem highest on specific days of the
week (e.g., Thursday and Friday, rather than Monday to Wednesday), be-
cause of the specific characteristics of Internet Data Centerservices.
Correlogram 35
Trend and seasonal patterns usually coexist in real-life data sets and the ampli-
tude of the seasonal changes increases with the overall trend. They form together
what we call thedeterministic componentof a time series. An example of the two
contributions together in the same data set is given in Figure 3.4. This component
indicates that the relative amplitude of seasonal changes is constant over time,
thus it is related to the trend.
0
20
40
60
80
100
Mon Tue Wed Thu Fri Sat Sun
Sam
ple
Val
ues
Figure 3.4: Example of a trend and seasonal time series.
The correlogramand theperiodogramare among the most useful tools for
determining the deterministic component of a data set.
3.1.1 Correlogram
The correlogram(or autocorrelogram) displays graphically and numerically the
autocorrelation function(ACF) of a dataset and is useful to evidence trend and
seasonal patterns [32].
Given a stochastic data set, the ACF describes the correlation between the data
set values at different instants. The presence of autocorrelation between different
values of a time series means that a temporal dependence between its samples ex-
ists and that a strong deterministic component influences the time series behavior.
Let Xi be the data set collected at timeti. If Xi has a mean valueµ and
36 Stochastical analyses of system resource measures
varianceσ2, then the autocorrelation function is defined:
ACF (Xi, Xi+k) =E[(Xi − µ)(Xi+k − µ)]
σ2(3.1)
whereE is the expected value operator andk stands for the lag of time. The
autocorrelation at lagk is defined as the correlation between samples separated
by k time periods.
This expression is not well defined for all data set values, since the varianceσ2
may be zero (for a constant data set) or infinite for some heavy-tailed distributions.
If the function ACF is well defined, its value must lie in the range [−1, 1], with 1
indicating perfect correlation and−1 indicating perfect anticorrelation.
The correlogram displays serial autocorrelation coefficients (and their stan-
dard errors) for consecutive lagsk in a specified range of lags (e.g.,k ∈ {1, . . . , 100}).
Correlograms are useful techniques for trend and seasonal pattern identification
because if the time series contains a seasonal fluctuation, then the correlogram
exhibits an oscillation at the same frequency.
Figure 3.5 gives an example of application, showing a time series and the
corresponding correlogram. The autocorrelation functionof the data set in Fig-
ure 3.5(a) is computed at different lagsk, ranging in the interval[1, 100] and the
result is represented by the heights of the columns of Figure3.5(b). At a super-
ficial analysis, the first figure displays a stationary time series, just independent
and identically distributed. The information deriving from the computing of the
autocorrelation function, instead, disapprove this belief. The correlogram corre-
sponding to the time series reveals a strong seasonal behavior that was ”hidden”
in the data set. The correlogram provides relevant extra information about the
deterministic component that was not clearly present in thetime plot of the data.
Searching for the peaks in the autocorrelation function it is possible to discover
the time lag at which the time series repeats periodically itself.
Correlogram carries also useful information for time seriesprediction. A rapid
decrease of the ACF curve means that the observed data set values exhibit low (or
null) autocorrelation. Predictions on non-correlated data tend to produce unreli-
able future estimations. On the other hand, a slow decay of ACFin the correlo-
gram indicates that the time series show a dependency among its values, and thus
is rationally predictable.
Periodogram 37
-4
-3
-2
-1
0
1
2
3
4
10 20 30 40 50 60 70 80 90 100
Sam
ple
Val
ues
Samples
(a) Plot of the time series
-0.2
0
0.2
0.4
0.6
0.8
1
10 20 30 40 50 60 70 80 90 100
AC
F
Lag
(b) Correlogram
Figure 3.5: Example of a time series with a ”hidden” seasonality and the corre-sponding correlogram.
3.1.2 Periodogram
A basic idea in mathematics and statistics is to take a complicated object, such as
a stochastic time series, and break it up into the sum of simple objects that can
be studied separately, see which ones can be thrown away as being unimportant,
and then adding what is left back together again to obtain an approximation of
the original object. The periodogram [112] of a time series is the result of such a
procedure.
38 Stochastical analyses of system resource measures
Periodogramis a very useful tool for describing a time series and identifing
trend and seasonal patterns at unknown periods. It is much more useful than the
correlogram but it does require some training to interpret properly.
The basic idea is that time series of long period are smooth inappearance,
whereas those of short period are very wiggly. Thus if a time series appears to be
very smooth (wiggly), then the values of the periodogram forlong (short) periods
will be large relatively to its other values. In this case, wesay that the data set has
an excess of long (short) periods. For a purely stochastic series, all of the sinu-
soids should be of equal importance and thus the periodogramwill vary randomly
around a constant. Instead, if a time series has a strong sinusoidal signal for some
period, then there will be a peak in the periodogram at that period. If a large peak
is observed, it may well provide a clue to some important source of seasonality in
the data. A strong seasonal component at a frequency such as 1/10 of the sampling
period will result in a large spike as shown in Figure 3.6.
-35
-30
-25
-20
-15
-10
-5
0
0 5 10 15 20
Pow
er
Period
Figure 3.6: Example of periodogram.
Periodograms can also show small peaks at multiples of the fundamental pe-
riod, which reflect the fact that the seasonal oscillation isnot very sinusoidal. The
contributions at very long periods come from the overall trend in the series.
3.2 Spike component 39
3.2 Spike component
All non-deterministic components of time series are considered stochastic errors,
that are deviations of the time series from the expected systematic pattern. Ran-
dom errors of data sets coming from process monitoring typically include a note-
worthy spike component. It collects short-lived bursts departing from time series
mean in correspondence of unexpected and uncommon events inthe sampled re-
source measure.
Figure 3.7 shows the typical behavior of spike components. The most of data
samples oscillate enclosing to the time series mean (equal to zero, in this case), as-
sembling what we can consider the under-of-control state ofthe resource. Sample
values strongly and abruptly diverging from the resource state are named spikes
and correspond to out-of-control data samples.
-40
-30
-20
-10
0
10
20
0 500 1000 1500 2000
Sam
ple
Val
ues
Samples
Figure 3.7: Example of a spike component.
Spike samples are due to accidental and unexpected events insystem activity
that should drive to urgent inspections in their causes. Thus, spike component
plays a relevant role in stochastic time series analysis, since it assembles random
errors that cannot be ignored because irrelevant (such as, noise component), but
need further investigations for an appropriate resource control. For this reason,
the spike identification is critical for Internet Data Centers management.
40 Stochastical analyses of system resource measures
Since a spike signaling should cause activities of examination in the moni-
tored resource, it is extremely important that the technique for spike detection is
precise, accurate, and instantaneous. We explore a popularspike detection tech-
nique:sigma threshold test.
3.2.1 Sigma threshold test
A simple spike analysis technique, often used in process quality control domain [114],
is to declare as a spike all data instances that are more thansσ distance away from
the time series meanµ, whereσ is the standard deviation for the time series. The
s parameter of thesigma threshold testis a positive integer that should be set in
function of the definition of spike in the different application contexts, the char-
acteristics of the time series, and the performance wanted to achieve by the spike
detection technique.
Figure 3.8 shows the meaning of applying a sigma threshold test with s = 3.
The black dotted lines up and down the figure are plotted in correspondence of
three times the standard deviation of the gray lined time series. Mathematically,
theµ ± 3σ region contains 99.7% of the data instances.
-40
-30
-20
-10
0
10
20
0 500 1000 1500 2000
Sam
ple
Val
ues
Samples
Time Series 3-sigma threshold
Figure 3.8: Example of3σ threshold test for spike component.
According to the control chart theory [114], the two dotted lines indicate the
threshold at which the process output is considered statistically ”unlikely”. On the
Sigma threshold test 41
other hand, the points falling inside the central band indicates that the monitored
process is currently under control. Here, the time series isstable, with variation
only coming from sources common to the process.
Any observations outside the limits suggest the introduction of a new and
likely unanticipated source of variation, causing a spike in the time series. Since
time series variation means something unusual and unexpected, a sigma threshold
test ”signaling” the presence of a spike requires immediateinvestigation.
Control chart theory setss = 3, since the choice of three sigma control limits is
supported by statistical theorems [125] and by empirical investigations of sundry
probability distributions, revealing that at least 99% of observations occur within
three standard deviations of the mean. Settings < 3 brings to the signaling
of a higher number of out-of-control values in the time series. It is needed just
in case of very critical characteristics or monitored processes that do not cause
a cascade of time-consuming investigations for every detected spike. s values
higher than three detect stronger shifts in the data set’s mean. They are used
for time series that monitor less critical characteristicsor characteristics causing
inspection procedures with a high impact every time samplesare out-of-control.
Figure 3.9 gives a graphical example. A technique withs = 1 brings to the
identification of a huge number of spikes, as evinced by the many samples with
values outside the small central band in Figure 3.9(a). Every time the time series
overcomes (in positive and in negative direction) one time the standard deviation
of its values, a spike is signaled and an investigation procedure is activated.
-40
-30
-20
-10
0
10
20
0 500 1000 1500 2000
Sam
ple
Val
ues
Samples
Time Series 1-sigma threshold
(a) 1σ threshold test
-40
-30
-20
-10
0
10
20
30
0 500 1000 1500 2000
Sam
ple
Val
ues
Samples
Time Series 5-sigma threshold
(b) 5σ threshold test
Figure 3.9: Example of1σ and5σ threshold tests for spike component.
42 Stochastical analyses of system resource measures
Opposite results are obtained fors = 5. In Figure 3.9(b), only few values
strongly departing from time series mean correspond to spikes. This reduces the
number of procedures activated to face unexpected behaviors of the monitored
resource.
There is no optimal value for thes parameter. It must be tuned considering
the monitored resource and the time series characteristics, as well as the context
requirements and the impact of investigation activities. Beside that, in Internet
Data Centers management there are always changes in the monitored resources,
which result in the fixed control limits becoming invalid. Ifthe process improves
and the monitored resource becomes more stable, fixed limitsremain too wide
andsσ limits will not properly register signals for out of control. This translates
into lost spike detections that are not accepted although true. On the other hand, if
the process becomes worse and the investigated resource increases its variability,
the control limits are too narrow. This results in the spike technique reporting out
of control samples that should not be, if thesσ limits were naturally calculated.
This situation causes the signaling of spikes that, instead, should be rejected. A
runtime tuning of thes parameter is essential to avoid poor performance of sigma
threshold test caused by stochastic process variations.
3.3 Noise component
Next to spikes, also stochasticnoisesare deviations of the time series from the
expected deterministic pattern.
In computing and information contexts, noise is typically considered unwanted
data without meaning, that is, data that is not used to transmit a signal, but it is
simply produced as an unwanted by-product of other activities. Even unwanted,
the noise component carries important informations that can be extensively used
for a better understanding of the noise itself or of the stochastic time series behav-
ior. Noise can be the key to analyse phenomena that are difficult to explain in a
fully deterministic regime, such as the Internet Data Centeractivity.
Figure 3.10 displays a finite length, discrete time monitoring of a noise com-
ponent generated from a typical resource usage of an Internet-based server.
Color of noise 43
-1
-0.5
0
0.5
1
0 50 100 150 200
Sam
ple
Val
ues
Samples
Figure 3.10: Example of a noise component.
Many characteristics of the noise component deserve deep investigation, rang-
ing from the more salient ones (such as, variance and standard deviation) to the
more hidden ones. Among them, we investigate thecolor of noiseand thedistri-
bution of noise, that will be useful for future investigations and classifications of
noise components in this work.
3.3.1 Color of noise
While noise is by definition derived from a stochastic signal,it can have differ-
ent characteristic statistical properties correspondingto different mappings from
a source of randomness to a real noise. Spectral density, that is, the power distri-
bution in the frequency spectrum [6], is a property that can be used to distinguish
different types of noise. This classification carried out through spectral density
gives the so-calledcolor terminologythat assigns different name of colors to the
different types of noise.
An important type of stochastic signals are the so-calledwhite noisetime se-
ries. “White” is because in some sense white noise contains equally much of
all frequency components, analogously to white light whichcontains all colors.
44 Stochastical analyses of system resource measures
White noise has zero mean value:
µwhite = 0 (3.2)
There is no co-variance or relation between sample values atdifferent time
indexes, and hence theautocovarianceis zero for all lagsk except fork = 0. This
results in a random scattering component difficult to treat.The absence of relation
among samples makes it impossible to model white noise.
Given a stochastic time seriesXi, the autocovariance is a measure of how
much the time series changes together to a time-shifted version of itself. Naming
E[Xi] = µi the mean of each state of the time series and beingXj the shifted data
set, then the autocovariance is given by:
COVXX(i, j) = E[(Xi − µi)(Xj − µj)] = E[Xi ·Xj] − µi · µj (3.3)
whereE is the expectation operator.
Thus, for a white signal, the autocovariance function follows the characteristic
behavior of the function shown in Figure 3.11. We can evince the main character-
istics of a relatively large value at lagk = 0.
-30
-20
-10
0
10
20
30
0 50 100 150 200 250 300 350 400
Sam
ple
Val
ues
Samples
(a) White noise time series
0
0.2
0.4
0.6
0.8
1
-200 -150 -100 -50 0 50 100 150 200
CO
Vxx
Lag
(b) Autocovariance function
Figure 3.11: Example of autocovariance function of a white noise time series.
White noise is an important signal in estimation theory because the purely
random noise, which is always present in stochastic measurements, can be repre-
sented by white noise.
As opposite to white noise,colored noisedoes not vary completely randomly.
µcoloured 6= 0 (3.4)
Distribution of noise 45
In other words, there is a co-variance between the sample values at different
time indexes. As a consequence, the autocovarianceCOVXX is non-zero for lags
k = 0. COVXX will have a maximum value atk = 0, and will decrease for
increasingk. The autocovariance function of a colored noise time serieshas a
behavior similar to that shown in Figure 3.12.
-30
-20
-10
0
10
20
30
40
0 50 100 150 200 250 300 350 400
Sam
ple
Val
ues
Samples
(a) Coloured noise time series
0
0.2
0.4
0.6
0.8
1
-200 -150 -100 -50 0 50 100 150 200
CO
Vxx
Lag
(b) Autocovariance function
Figure 3.12: Example of autocovariance function of a colored noise time series.
Most contexts in practice are nonlinear time varying stochastic systems and
owing to the effect of feedback control, colored noise may arise [129]. Both white
and colored noise may affect the analysis of the observed data coming from a
stochastic system. To distinguish between them is a relevant step in time series
analysis. While the relation among colored noise samples canbe somehow mod-
eled for management purposes, white noise is hard to treat. It represents the un-
wanted error affecting time series that does not carry any useful information and
can be therefor discarded. For these reasons, a deepen analysis of noise character-
istics and their effect in perturbing resource measures is fundamental to eliminate
contaminated samples and retain only the main information useful for an efficient
time series management.
3.3.2 Distribution of noise
Most methods for time series analysis assumes that the data set is corrupted by
Gaussian noise. This hypothesis is not always true, and needs statistical analyses
to be confirmed. The Shapiro-Wilk normality test [113] is themost common
technique to see whether thenoise distributionis normal.
46 Stochastical analyses of system resource measures
The test can be done through aQ-Q plot [60], that is, a plot of the quantiles of
two distributions against each other, or a plot based on estimates of the quantiles.
Quantiles are points taken at regular intervals from the cumulative distribution
function (CDF) of a stochastic variable, that describes its probability distribution.
The pattern of points in the Q-Q plot is used to compare the twodistributions.
Thus, a Q-Q plot is a probability plot, which is a graphical method for com-
paring two probability distributions by plotting their quantiles against each other.
If the two distributions are similar, the points in the Q-Q plot will approximately
lie on the liney = x. If the distributions are linearly related, the points in the Q-Q
plot approximately lie on a line, but not necessarily on the line y = x. Q-Q plots
can also be used as a graphical means of estimating parameters in a location-scale
family of distributions, such as Gaussian distribution.
We use Q-Q plots to compare the data set noise component to thestandard
normal distributionN(0, 1). This can provide a graphical assessment of ”good-
ness of fit”. Since Q-Q plots compare distributions, there isno need for the values
to be observed as pairs, as in a scatterplot, or even for the numbers of values in
the two groups being compared to be equal.
Figure 3.13 shows two examples of Q-Q plot for normality test. The points
plotted in a Q-Q plot are always non-decreasing when viewed from left to right. If
the noise distributions and the Gaussian one are identical,the Q-Q plot follows the
45° liney = x, as in Figure 3.13(a). If the two distributions agree after linearly
transforming the values in one of the distributions, then the Q-Q plot follows
some line, but not necessarily the liney = x, as in Figure 3.13(b). If the general
trend of the Q-Q plot is flatter than the liney = x, the distribution plotted on the
horizontal axis is more dispersed than the distribution plotted on the vertical axis.
Conversely, if the general trend of the Q-Q plot is steeper than the liney = x,
the distribution plotted on the vertical axis is more dispersed than the distribution
plotted on the horizontal axis. Q-Q plots are often arced, or”S” shaped, as in
Figure 3.13(c), indicating that one of the distributions ismore skewed than the
other, or that one of the distributions has heavier tails than the other.
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Noi
se q
uant
iles
Gaussian quantiles
(a) Q-Q plot of identical distributions
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Noi
se q
uant
iles
Gaussian quantiles
(b) Q-Q plot of linearly relateddistributions
-3
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3
Noi
se q
uant
iles
Gaussian quantiles
(c) ”S” shaped Q-Q plot
Figure 3.13: Examples of Q-Q plots.
Chapter 4
PCA-based technique on collecteddata
This chapter details the first two phases of the multi-phase methodology proposed
in Chapter 2.
We use Principal Component Analysis to explore the intrinsicdimensionality
and structure of system resources behaviors, using data collected from a typical
Internet Data Center as described in Section 4.1. They come from the monitoring
of both heterogeneous and homogeneous resource measures ona single server
and on the different servers of the system, as reported in Sections 4.1.1 and 4.1.2,
respectively.
Even though typical Data Centers have thousands of servers, we show in Sec-
tion 4.2 that, on long time scales (days to weeks), the structure of heterogeneous
and homogeneous resource measures can be well captured through remarkably
few dimensions. We find that using less than 15 dimensions, one can accurately
approximate the behavior of a thousand monitored resource measures in the sys-
tem. In order to explore the nature of this low dimensionality, we introduce the
notion of eigenresources. An eigenresource, derived from a PCA of heteroge-
neous (see Section 4.2.1) or homogeneous (see Section 4.2.2) resource measures,
is a time series that captures a particular source of temporal variability (a ”fea-
ture”) in the resource behavior. Each resource time series can be expressed as
a weighted sum of eigenresources; the weights capture the extent to which each
feature is present in the given resource time series.
50 PCA-based technique on collected data
The proposed PCA-based technique uses some analysis steps, detailed in Sec-
tion 4.3, to classify eigenresources into few main classes on the basis of their
main statistical properties. These properties are easy to map over those ones of
the stochastic time series and investigated in the previouschapter. The analyses
presented in Chapter 3 make it possible to understand eigenresource characteris-
tics and peculiar behaviors in Section 4.3.2. In Section 4.4, the eigenresources of
the same class add up to the creation of what we callrepresentative eigenresource,
which contains all the contributions of the servers resources of the system to that
behavioral class. PCA-results, in terms of this few set of principal eigenresources,
give an exhaustive and complete representation of the behavioral components of
the entire Internet Data Center, since they collect the contributions of the moni-
tored resource measures of all the servers of the system.
4.1 Data collection
Lets start from the first phase of our methodology, as highlighted in Figure 4.1.
The data considered in this work are collected from 50 servers of an Inter-
net Data Center of thousands servers supporting several critical services. The
50 servers monitored in this study include Web servers, application servers and
database servers, following the typical multi-tier infrastructure for supporting Web-
based services. The servers run MWA (MeasureWare Agent) [37] to collect per-
formance data.
The data considered in this work are collected during a one-week period from
03/08/2010 to 03/14/2010. The decision of considering onlyone week worth of
data is supported by several empirical tests: applying the later described PCA-
based technique to a two or more weeks sampling gives relatively the same results
as the ones obtained on seven-days time series. For simplicity, in this work we
decide to report the outcomes of applying the multi-phase methodology to time
series collecting measures for a week.
System monitors aggregate (average) the resource measuresof the 50 servers
at every five minutes. These measures refer to the most important resources of
a server and compute the most interesting performance metrics for a complete
characterization of the server.
Heterogeneous resources of a single server 51
Figure 4.1: First phase of the multi-phase framework.
In Section 4.1.1, we describe the heterogeneous resource measures monitored
on each server node, while a general comparison of the behaviors of homogeneous
resource measures monitored on different servers of the Internet Data Center is
given in Section 4.1.2.
4.1.1 Heterogeneous resources of a single server
MWA system monitors collect several performance metrics ofservers, such as:
• CPU utilization, queue lengths, and related processor metrics;
• Memory usage, caching, and other memory-related metrics;
52 PCA-based technique on collected data
• Network utilization, errors, and other metrics pertaining to network activity;
• CPU, memory and network usage broken down by specific applications;
• End-to-end transaction times.
System monitors average the performance measures collected in five minutes
and write to a log file which is subsequently extracted and stored in a central
repository. Among monitored resource performance,21 have been selected as
interesting for server state characterization. For each ofthe21 metrics, Table 4.1
reports the syntax used by system monitors and the metric it corresponds to.
From the data collected, we generate a heterogeneous resource measurements
matrixχhetero for each server of the system. It is at x p matrix, where the number
of rows t is the number of time intervals (t = 2016 five-minutes intervals within
the one-week period) and the number of columnsp is the number of considered
resource measures on the server (p = 21).
Every column of the matrix is a resource time series: in Figure 4.2 we report
some examples of time series monitored on a database server of the Internet Data
Center in exam for a week. All these figures share the common trait that the
performance metrics coming from system monitors are extremely variable and
stochastic. Despite that, different resources and measures show different sample
values domains, different characteristics behaviors, anddifferent weights of the
constituent deterministic and error components.
CPU utilization in Figure 4.2(a) is a bound resource measure strongly driven
by the seasonal component. This is evident in the periodic behavior of the time
series, where increases during diurnal activity are followed by decreases during
the night. The same seasonal pattern guides the network packet rate measure in
Figure 4.2(c), even thought time series values reach different orders of magnitude.
Memory usage measures in Figure 4.2(b) seem primarily driven by an increasing
non-linear trend component, while the time series of systemcall rates shows peri-
odic spikes in Figure 4.2(d).
Each time series gives a representation of the behavior of a single monitored
resource in respect to the precise measure it collects. Studying each time series
alone can give interesting information for the management of one resource of a
Heterogeneous resources of a single server 53
Syntax Description
Syscall Rate The average rate of system calls made during the intervalUptime Hours The system up-time of the monitored systemActive CPUs The number of CPUs on-line on the systemCPU% The percentage of time the CPU was not idle during the intervalCPU Time The total time, in seconds, that the CPU was not idle in the
intervalIdle CPU% The percentage of time the CPU is not processing instructionsIdle Time The time, in seconds, that the CPU was idle during the interval
This is the total idle time, including waiting for I/OPk FS Sp% The percentage of occupied disk space to total disk space
for the fullest file system found during the intervalCache Rd Rt The amount of physical memory (in MBs unless otherwise
specified) used by the buffer cache during the intervalMemory% The percentage of physical memory in use during the interval
This includes system memory, buffer cache and user memoryUser Mem% The percentage of physical memory allocated
to user code and data at the end of the intervalPg Out Rate The number of KBs per second of pages paged-out from
system memory to disk during the monitoring intervalPage Out The total number of pages paged-out from system memory
to disk per second during the monitoring intervalSys+Cache% The percentage of physical memory used by the system
during the interval, including buffer cacheSysMem% The percentage of physical memory used by the system (kernel)
during the intervalIn Pkt Rate The number of successful packets received through all
network interfaces during the intervalOut Pkt Rate The number of successful packets sent through all network
interfaces during the intervalNetwork Pkt Rt The number of successful packets per second (both inbound
and outbound) for all network interfaces during the intervalAlive Proc The sum of the alive-process-time/interval time ratios
for every processActive Proc The sum of the alive-process-time/interval time ratios of every
process that is active (uses any CPU time) during an intervalRun Time The elapsed time since a process started, in seconds
Table 4.1: System monitor’s syntax and corresponding resource measure.
54 PCA-based technique on collected data
0
20
40
60
80
100
0 500 1000 1500 2000
CP
U U
tiliz
atio
n
Samples
(a) CPU utilization
87
87.5
88
88.5
0 500 1000 1500 2000
Mem
ory
usag
e
Samples
(b) Memory usage
0
5000
10000
15000
20000
0 500 1000 1500 2000
Net
wor
k P
acke
t Rat
e
Samples
(c) Network packet rate
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 500 1000 1500 2000
Sys
tem
Cal
l Rat
e
Samples
(d) System call rate
Figure 4.2: Examples of heterogeneous resource measures ofa database server.
single server, but it useless in a context of Internet Data Center management.
As we consider50 servers, we collect fiftyχhetero matrices, each one com-
posed by2016 x 21 measurements. Each matrix is itself a high dimensional struc-
ture residing in a high dimensional space. Different servers measurements within
the Internet Data Center become a high dimensional multivariate time series.
Since heterogeneous resource measures of a single server node are the result
of the same external workload, we suppose that resource timeseries share com-
mon characteristics. We should expect the columns ofχhetero to be related, so
that the intrinsic dimensionality ofχhetero is less thanp. Principal Component
Analysis, described in detail in Section 4.2, is a powerful approach to verify this
presumption quantitatively. Via PCA, in Section 4.2.1 we extrapolate the most
relevant features of theχhetero matrix for each one of the50 monitored servers.
Homogeneous resources of different servers 55
4.1.2 Homogeneous resources of different servers
Once obtained a representative vision for each server, the interest goes to the entire
set of servers and their interaction.
Homogeneous resources are monitored on the50 servers of the Internet Data
Center and system monitors entries are collected in a homogeneous resource mea-
surements matrixχhomo for each resource measure of the system. It is at x p
matrix, wheret is the number of time intervals (as before,t = 2016) and the num-
ber of columnsp is the number of monitored servers in the Internet Data Center
(p = 50).
Since the number of monitored resource measures is21, we obtain twenty-
oneχhomo matrices of2016 x 50 measurements. Each column of a matrix is a
time series representing the same resource measure collected on different servers.
View of the same resource measure behavior on the different servers of a system
is given in Figure 4.3. The plotted resource measure is CPU utilization.
0
20
40
60
80
100
0 500 1000 1500 2000
CP
U U
tiliz
atio
n
Samples
(a) Web server
0
20
40
60
80
100
0 500 1000 1500 2000
CP
U U
tiliz
atio
n
Samples
(b) Application server
0
20
40
60
80
100
0 500 1000 1500 2000
CP
U U
tiliz
atio
n
Samples
(c) Database server
Figure 4.3: Examples of homogeneous resource measures - CPU utilization.
56 PCA-based technique on collected data
Figures 4.3(a), (b) and (c) report, respectively, the CPU utilization monitored
contemporaneously on a Web server, an application server and a database server
of the Internet Data Center. What is evident is the different behavior of the same
metrics on different servers of the infrastructure. This isa clear evidence that
decisions made on the basis of the information given by a single server can not be
extended for the management of the other servers of the Internet Data Center.
Because a resource measure on somehow related servers is the result of a
common users activity, we should expect the columns ofχhomo to be related. Also
in this case, a very useful method to verify this assumption in a quantitative way
is the dimension analysis via PCA.
4.2 Principal Component Analysis
We now consider the second phase of the multi-phase methodology (see Fig-
ure 4.4) and give a complete explanation of PCA-based technique.
We apply PCA (Principal Component Analysis) technique to characterize each
resource behavior. We show that the monitored resource measures of the servers
of the Internet Data Center can be characterized by just few features that are suffi-
cient to describe the whole system behavior. These featureschange from server to
server, and from resource to resource: we apply PCA on heterogeneous resources
monitored on a single node server in Section 4.2.1, and on homogeneous resources
monitored on the different nodes of the Data Center in Section4.2.2.
In order to facilitate the subsequent discussions, we recover some relevant no-
tations hinted in the previous section. For each resource, let p denote the number
of time series referring to the resource, andt denote the number of successive time
interval of interest. In this work, we study a system having on the order of half
a hundred servers, tens of monitored performance metrics, over long time scales
(days to weeks) and over intervals of5 minutes, so thatt ≫ p. Let χhetero and
χhomo be thet x p measurement matrix, which denote, respectively, the heteroge-
neous resource time series of a single server node and the homogeneous resource
time series of all the nodes of the system.
4.2 Principal Component Analysis 57
Figure 4.4: Second phase of the multi-phase framework.
If consideringχhetero, each columni denotes the time series of thei-th moni-
tored resource measure and each rowj represents an instance of all the monitored
resource measures on that server at timej. In the case of homogeneous resources
coming from different servers, each columni of χhomo denotes the resource time
series of thei-th server and each rowj represents an instance of the monitored
resource measure on all the servers at timej.
To facilitate the representation, we use the common nameχ when referring
invariably to both measurements matrices. We refer to individual columns of a
matrix using a single subscript, so the measurement of column i is denoted byχi.
Note thatχ-matrices thus defined have rank at mostp. Finally, all vectors in this
work are column vectors, unless otherwise noted.
58 PCA-based technique on collected data
PCA is a coordinate transformation method that maps the measured data onto
a new set of axes. These axes are called theprincipal axesor components. Each
principal component has the property that it points in the direction of maximum
variation or energy(with respect to the Euclidean norm) remaining in the data,
given the energy already accounted for in the preceding components. As such,
the first principal component captures the total energy of the original data on the
maximal degree possible on a single axis. The next principalcomponents then
capture the maximum residual energy among the remaining orthogonal directions.
In this sense, the principal axes are ordered by the amount ofenergy in the data
they capture.
The method of PCA can be motivated by a geometric illustration. An applica-
tion of PCA on a two dimensional dataset is shown in Figure 4.2.
Figure 4.5: Example of 1D projection of 2D points in the original space.
The first principal axis points in the direction of maximum energy in the data.
Generalization to higher dimensions, as in the case ofχ, take the rows ofχ as
points in Euclidean space, so that we have a dataset oft points inRp. Mapping
the data onto the firstr principal axes places the data into anr-dimensional hy-
perplane.
Shifting from the geometric interpretation to a linear algebraic formulation,
calculating the principal components is equivalent to solving the symmetric eigen-
value problem for the matrixχTχ. The matrixχTχ is a measure of the covariance
4.2 Principal Component Analysis 59
between servers resources time series. Each principal componentvi is the i-th
eigenvector computed from the spectral decomposition ofχTχ:
χTχvi = λivi i = 1, . . . , p (4.1)
whereλi is the eigenvalue corresponding tovi. Furthermore, becauseχTχ is
symmetric positive definite, its eigenvectors are orthogonal and the corresponding
eigenvalues are non-negative real. By convention, the eigenvectors are unit norm
and the eigenvalues are arranged from large to small, so thatλ1 ≥ λ2 ≥ . . . ≥ λp.
Once the data have been mapped into principal component space, it can be
useful to examine the transformed data one dimension at a time. Considering
the data mapped onto the principal components, we see that the contribution of
principal axisi as a function of time is given byχvi. This vector can be normalized
to unit length by dividing byσi =√λi. Thus, we have for each principal axisi:
ui =χvi
σi
i = 1, . . . , p (4.2)
Theui are vectors of sizet and orthogonal by construction. The above equa-
tion shows that all the servers resource behaviors, when weighted byvi, produce
one dimension of the transformed data. Thus vectorui captures the temporal vari-
ation common to all flows along principal axisi. Since the principal axis are
in order of contribution to the overall energy,u1 captures the strongest temporal
trend common to all servers resource measures,u2 captures the next strongest,
and so on. Because the set of{ui}, i = 1, . . . , p captures the time-varying trends
common to the resource behaviors, we refer to them as theeigenresourceof χ.
In Figure 4.6 we show a typical example of an eigenresourceui and its cor-
responding principal axisvi. The eigenresource captures a pattern of temporal
variation common to the set of time series referring to CPU utilizations of dif-
ferent servers, and the extent to which this particular temporal pattern is present
in each CPU utilization of the monitored servers is given by the entries ofvi. In
this case, we can see that this eigenresource feature is moststrongly present in the
server 44 (the strongest peak invi).
The elements of{σi}, i = 1, . . . , p are calledsingular values. Note that each
singular value is the square root of the corresponding eigenvalue, which in turn is
60 PCA-based technique on collected data
-40
-20
0
20
40
Mon Tue Wed Thu Fri Sat Sun
Eig
enre
sour
ce 2
Time
(a) Eigenresource
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
5 10 15 20 25 30 35 40 45 50
Prin
cipa
l Com
pone
nt 2
Server
(b) Principal component
Figure 4.6: Example of eigenresource and corresponding principal components.
the energy attributable to the respective principal component. Thus, the singular
values are useful for gauging the potential for reduced dimensionality in the data,
often simply through their visual examination in ascree plot.
A scree plot shows, in descending order of magnitude, the singular values
of χ. Such a plot when read left-to-right across the abscissa canoften show a
clear ”elbow” that separates the ”most important” components from the ”least
important” ones. That sharp drop in the plot signals that subsequent components
are ignorable. You can see an example of scree plot in Figure 4.7: it shows a ”big
gap” between the ninth and the tenth singular values, so the first nine principal
components are retained and the rest, are discarded.
0
5
10
15
20
25
30
5 10 15 20 25 30 35 40 45 50
Mag
nitu
de
Singular Values
Figure 4.7: Example of a scree plot.
PCA on heterogeneous resources of a single server 61
A quantitative common used guideline to choose how many dimensions retain
is theKaiser criterion [70], stating that we can retain only factors with singular
values greater than 1. In essence this is like saying that, unless a singular value
extracts at least as much as the equivalent of one original time series, we drop it.
Finding that onlyr singular values are non-negligible implies thatχ resides on
anr-dimensional subspace ofRp. In that case, we can approximate the originalχ
as:
χ′ ≈r∑
i=1
σiuivTi (4.3)
wherer < p is the effective intrinsic dimension ofχ.
In the next sections we apply the PCA technique on heterogeneous resources
of a single server node and on homogeneous resources of different servers, with
the aim of extracting the intrinsic dimension of resource data set and then reducing
the complexity of the whole system problem.
4.2.1 PCA on heterogeneous resources of a single server
In this section we apply PCA technique to heterogeneous resources monitored
on a single server of the information infrastructure. This allows us to extract the
relevant resource information of a server and then reduce the number of resource
time series to be analyzed for its characterization. As there are21 monitored
resources on each server, with5 minutes sampling during a one-week period,
χhetero is the2016 x 21 measurement matrix, whose columns denotes the time
series of each monitored resource of a server.
A first important result in applying PCA on the21 resource time series of the
servers of the infrastructure, is that only a small set of eigenresources is necessary
for reasonably accurate construction of system servers behavior. This means that
resource measures of a server form a multivariate time series of low effective
dimensions.
The energy contributed by each eigenresource to aggregate resource measures
is summarized in the scree plot of Figure 4.8(a). It shows thescree plot obtained
by applying PCA to time series of the resources set of a database server, chosen as
representative. The unexpected result is that the vast majority of resource measure
62 PCA-based technique on collected data
variability is contributed by the first few eigenresources.The curve shows a very
sharp knee, revealing that only four eigenresources contribute to most of server
variability. In different terms, this result denotes that resource measures together
form a structure with effective4 dimensions - much lower than the number of time
series monitored on a server (21 in this case).
0
2e+07
4e+07
6e+07
8e+07
1e+08
1.2e+08
1.4e+08
1.6e+08
1.8e+08
5 10 15 20
Mag
nitu
de
Singular Values
(a) Resources time series
0
5e+06
1e+07
1.5e+07
2e+07
5 10 15 20
Mag
nitu
de
Singular Values
(b) Normalized time series
Figure 4.8: Scree plots for the resource time series of a database server.
We now question what is the reason for this low dimensionality in resources
set data. There are at least two ways in which this low dimensionality may arise.
First, if the magnitude of variation among dimensions in theoriginal time series
differs greatly, then the data may have low effective dimension for that reason.
This occurs when the variation along a small set of dimensions in the original data
is dominant. Second, a multivariate time series may exhibitlow dimensionality
if there are common underlying patterns or trends across dimensions - in other
words, if dimensions show non-negligible correlation.
We can distinguish these cases in resource analysis by normalizing the re-
source time series before performing PCA. The standard approach is to normalize
each resource measure to zero mean and unit variance. Since normalization is
applied to bothχhetero andχhomo matrices, we have:
χ′i =
χi − µi
σi
i = 1, . . . , p (4.4)
whereµi ≡ µ(χi) is the sample mean ofχi.
If we find that the CPU utilization time series still exhibits low dimensional-
ity after normalization, we can infer that the remaining effect is due to common
PCA on heterogeneous resources of a single server 63
temporal patterns among time series.
The results of applying PCA to normalized versions of all dataset is shown
in Figure 4.8(b). The most streaking feature of this figure isthat the sharp knee
from Figure 4.8(a) remains, but in correspondence of the second eigenresource.
That means that the first eigenresource collects most of the energy of the resource
measures of the server.
If we apply Keiser Criterion to decide how many dimensions to retrieve, the
number of eigenresources with singular values higher than 1is four, thus confirm-
ing the result previously obtained on non-normalized data.Furthermore, the last
results on normalized resources ensure that the cause of lowdimensionality stands
all in correlations among resource time series and in commonbehavioral patterns.
They give the additional information that the principal pattern of the resources is
carried out by the first eigenresource, which contributes for almost all the energy
of the data set.
A plot of the first eigenresource is reported in Figure 4.9(a): it shows an evi-
dent periodic behavior, following the diurnal activity of the database server pro-
cesses.
The other three principal eigenresources are reported in the lasting figures:
Figure 4.9(b) shows the manifest spike behavior of the second eigenesource: it
collects occasional bursts in the server activity. A decreasing trend is incontestably
manifested by the third eigenresource in Figure 4.9(c), while the fourth eigenre-
source in Figure 4.9(d) retains all random deviations of resource time series from
the previous visions.
This results allow us to consider only four time series to represent the overall
behavior of the database serves, thus reducing the complexity of whole resources
analysis. Now the problem can be solved by examining only thefirst 4 eigenre-
sources resulting from PCA and applying to them (or to a propercombination of
them) all statistical models needed for server management.
Equivalent results are obtained on the resources sets of allthe other servers
of the considered system. This reinforces the thesis that the complex behavioral
structure of a server can be reduced to a very small number of time series.
64 PCA-based technique on collected data
-15000
-10000
-5000
0
5000
10000
Mon Tue Wed Thu Fri Sat Sun
1st e
igen
reso
urce
(a) First
-2500
-2000
-1500
-1000
-500
0
500
1000
1500
Mon Tue Wed Thu Fri Sat Sun
2nd
eige
nres
ourc
e
(b) Second
-100
-50
0
50
100
Mon Tue Wed Thu Fri Sat Sun
3rd
eige
nres
ourc
e
(c) Third
-200
-150
-100
-50
0
50
100
150
200
Mon Tue Wed Thu Fri Sat Sun
4th
eige
nres
ourc
e
(d) Fourth
Figure 4.9: PCA resulting principal eigenresources on heterogeneous resources ofa database server.
4.2.2 PCA on homogeneous resources of different servers
We now focus on homogeneous resource measures coming from the different
servers of the system. Through PCA, the ensemble of resourcesis decomposed
into its constituent set of eigenresources. In this case, the numberp of time series
is equal to50, as much as the servers of the considered infrastructure. Asin the
case of heterogeneous resources,t is equal2016, since we consider time intervals
of 5 minutes over a timescale of a week.
PCA results concerning eigenresources differ from resourceto resource: con-
sidering CPU utilization time series of different servers ormemory occupancy
gives different outcomes. We focus on the results of PCA-based technique applied
to CPU utilization, that is the most representative resourcefor the application of
the proposed methodology on homogeneous resources of different servers. Need-
PCA on homogeneous resources of different servers 65
less to say, this technique can be applied to time series referring to any monitored
resource.
As obtained for heterogeneous resources, applying PCA on the50 CPU uti-
lization time series of the servers returns a small set of relevant eigenresources
needed for accurate construction of the data set. This meansthat CPU measures
form a multivariate time series of low effective dimensions. Looking at the re-
sulting scree plot in Figure 4.10(a), we obtain the same effect discovered for het-
erogeneous resource measures: the first few eigenresourcescontribute for the vast
majority of data set variability. The knee of the curve showsthat a handful of
eigenresources, from4 to 9, contribute to most of CPU utilization variability. In
different terms, this result reveals that CPU utilization measures together form a
structure with effective dimensions between4 and9 - much lower than the number
of time series and servers (50 in this case).
As we are interested in underlying patterns or common trend across dimen-
sions, we show the results of applying PCA to normalized versions of all dataset
in Figure 4.10(b). Even in this case, we can see that the knee in Figure 4.10(a)
remains, even if less sharp, in nearly the same location. It is also clear that the
relative significance of the first few eigenresources has diminished somewhat.
0
100
200
300
400
500
600
5 10 15 20 25 30 35 40 45 50
Mag
nitu
de
Singular Values
CPU utilization time series
(a) CPU utilization time series
0
2
4
6
8
10
12
14
5 10 15 20 25 30 35 40 45 50
Mag
nitu
de
Singular Values
Normalized CPU utilization time series
(b) Normalized time series
Figure 4.10: Scree plots for CPU utilization time series.
Taken together, these observations suggest that while differences in time series
size contribute to low dimensionality of CPU utilization measures, that correla-
tions among time series (common underlying resource patterns) play a significant
role. As the previous discussion points out, these common underlying resource
66 PCA-based technique on collected data
patterns are in fact the eigenresources.
According to Kaiser Criterion, we evaluate the first12 dimensions shown in
Figures 4.11 as the most representatives.
-40
-20
0
20
40
60
80
Mon Tue Wed Thu Fri Sat Sun
1st e
igen
reso
urce
(a) First
-40
-20
0
20
40
Mon Tue Wed Thu Fri Sat Sun
2nd
eige
nres
ourc
e
(b) Second
-40
-30
-20
-10
0
10
20
30
40
50
Mon Tue Wed Thu Fri Sat Sun
3rd
eige
nres
ourc
e
(c) Third
-40
-30
-20
-10
0
10
20
30
40
50
Mon Tue Wed Thu Fri Sat Sun
4th
eige
nres
ourc
e
(d) Fourth
-40
-30
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
5th
eige
nres
ourc
e
(e) Fifth
-30
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
6th
eige
nres
ourc
e
(f) Sixth
-30
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
7th
eige
nres
ourc
e
(g) Seventh
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
8th
eige
nres
ourc
e
(h) Eigth
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
9th
eige
nres
ourc
e
(i) Ninth
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
10th
eig
enre
sour
ce
(l) Tenth
-50
-40
-30
-20
-10
0
10
20
Mon Tue Wed Thu Fri Sat Sun
11th
eig
enre
sour
ce
(m) Eleventh
-20
-10
0
10
20
30
40
50
Mon Tue Wed Thu Fri Sat Sun
12th
eig
enre
sour
ce
(n) Twelfth
Figure 4.11: PCA resulting principal eigenresources on homogeneous resources(CPU utilizations) of a database server.
Focusing on them and rejecting other irrelevant information allows us to sim-
plify the problem of whole servers analysis. Indeed, this problem can now be
solved on the basis of the information carried on a small number (12, in this case)
4.3 Analyzing eigenresources 67
of time series, with the certainty to retain all relevant information of system be-
havior.
In the rest of the work, we will focus on CPU utilization as the most represen-
tative resources. Since we are primarily interested in common temporal patterns,
we focus the analysis on the normalized resources. In fact, normalization ensures
that the common patterns captured by the eigenresources arenot skewed due to
differences in mean CPU utilization rates.
4.3 Analyzing eigenresources
To understand the information carried out by eigenresources, we inspect their
properties in Section 4.3.1, describe the three most commontypes and deepen
their insight in Section 4.3.2.
4.3.1 A taxonomy of eigenresources
We analyze the complete set of resource measures of the servers set of our infras-
tructure, by following the technique proposed for network flows in [10]. Although
we focus on CPU utilization, we find similar results to those obtained for network
traffic flows analysis. Across all of the eigenresources, there appear to be only
three distinctly different types. Representative examplesof each eigenresource
type from servers CPU utilization are shown in Figure 4.12.
Figure 4.12(a) shows an example of eigenresource that exhibits strong period-
icities. The periodicities clearly reflect diurnal activity, as well as the difference
between weekday and weekend activity. Because this eigenresource appear to be
relatively predictable and shows a strong trend and seasonal component, we refer
to it asdeterministic eigenresource.
Figure 4.12(b) shows an example of eigenresource that exhibits strong, short-
lived spikes. Thisspike eigenresourceshows isolated values that can be many
standard deviations (e.g.,4 or 5 standard deviations) from the eigenresource mean.
They capture the occasional CPU utilization bursts and dips that are common
features of Web-based system behavior. The majority of eigenresources in the
considered dataset appear to be of this type.
68 PCA-based technique on collected data
-40
-20
0
20
40
60
80
Mon Tue Wed Thu Fri Sat Sun
Eig
enre
sour
ce 1
(a) Deterministic eigenresource
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
Eig
enre
sour
ce 1
0
(b) Spike eigenresource
-40
-30
-20
-10
0
10
20
30
40
50
Mon Tue Wed Thu Fri Sat Sun
Eig
enre
sour
ce 4
(c) Noise eigenresource
Figure 4.12: Examples of the three types of eigenresources.
Figure 4.12(c) shows an example of eigenresource that appears roughly sta-
tionary and Gaussian. Thisnoise eigenresourcecaptures the remaining random
variation that arises as the result of multiplexing many individual servers sources.
These three categories of eigenresources are only heuristically distinguished.
It is not our intent to suggest that any eigenresource can be unambiguously cat-
egorized in this way. Nonetheless, we observe that these categories are distinct,
and that almost all eigenresources of our data set can be easily placed into one of
these categories.
To demonstrate that, we evaluate each eigenresource according to the follow-
ing criteria:
1. Does the eigenresource have a strong periodicity of its autocorrelation func-
tion lasting 12 or 24 hours?
2. Does the eigenresource contain at least one outlier that exceeds 5 standard
A taxonomy of eigenresources 69
deviations from its mean?
3. Does the eigenresource have a marginal distribution thatappears to be nearly
Gaussian?
We judge whether each eigenresource meets one of these creteria by applying
some analyses described in Chapter 3.
The first criteria is evaluated through the correlogram (seeSection 3.1) of the
eigenresource. An evident periodicity in autocorrelationfunction behavior is a
proof of the intrinsic periodicity of the corresponding eigenresource. This analy-
sis allows us also to evaluate the lag of samples after which the time series shows
a temporal repetition. An example of applying this criteriato an eigenresource
of the considered data set is shown in Figure 4.13(a). It is evident that the eigen-
resource that we visually identified as deterministic eigenresource has a distinct
periodic behavior of its autocorrelation function, with lags of 288 samples (cor-
responding to one day interval). The maximum peak found at sample 288 corre-
sponds to the fundamental frequency of the time series. We can see that the ACF
is repeated with this fundamental period (peaks are found atsamples 576, 864,
1152 and 1440).
The second criterion is assessed through the5σ threshold test (see Section 3.2),
evaluating if some data samples of the eigenresource time series exceed five times
its standard deviation from the mean value. The choice of ans test parameter
equal to5 confirms the criterion presented in [10]. Several tests withdifferentsσ
thresholds have been exercised on the eigenresources coming from PCA. In the
application context of Internet Data Centers, the setting ofs = 5 demonstrated
to achieve the best performance in terms of detecting all andonly effective spike
components in the eigenresources set. In Figures 4.13(b.1)and (b.2) we show
two examples of visually identified spike eigenresources, that actually have5σ
excursions from the mean.
The last criterion is evaluated through the Q-Q plot test (see Section 3.3). In
Figure 4.13(c) we show the eigenresource that is visually categorized as a noise
eigenresource manifesting a marginal distribution that isnearly Gaussian. The al-
most straight crossed line indicates a close fit of the eigenresource to the standard
normal distributionN(0, 1).
70 PCA-based technique on collected data
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Mon Tue Wed Thu Fri Sat Sun
AC
F
(a) Deterministic eigenresource(correlogram)
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
Eig
enre
sour
ce 1
0
Spike eigenresource 5sigma threshold
(b.1) Spike eigenresource(5σ threshold test)
-50
-40
-30
-20
-10
0
10
20
30
Mon Tue Wed Thu Fri Sat Sun
Eig
enre
sour
ce 1
1
Spike eigenresource 5sigma threshold
(b.2) Spike eigenresource(5σ threshold test)
−4 −3 −2 −1 0 1 2 3 4−50
−40
−30
−20
−10
0
10
20
30
40
50
Standard Normal Quantiles
Qua
ntile
s of
Noi
se E
igen
reso
urce
(c) Noise eigenresource(Q-Q plot test)
Figure 4.13: Classifying eigenresources by using three statistical tests.
A taxonomy of eigenresources 71
We use these tools to classify all the eigenresources of the considered data set.
Eigenresources for which none of the criterion above held true are categorized as
”indeterminate”. On CPU utilization time series, only 1 is indeterminate (con-
tributing 4.77% to overall energy). For all of the remaining eigenresources, one
and only one criterion above held true. Since in Section 4.2.2 we have demon-
strated that only the first12 eigenresources retains most of the variability of CPU
utilization time series, we focus on this smaller set of dimensions.
By using the criteria above, we see that3 of those12 principal eigenresources
show a deterministic behavior and3 a noise behavior. The remaining eigenre-
sources have all short-lived spikes. Despite that, we underline that the different
eigenresource types appear in different regions when the eigenresources are or-
dered by overall importance (e.g., by singular value). As presented in Table 4.2,
deterministic and noise eigenresources are together the first seven eigenresources.
The next five ones in order are all classified as spike eigenresources.
Order Eigenresource Order Eigenresource Order EigenresourceType Type Type
1 Deterministic 5 Indeterminate 9 Spike2 Noise 6 Noise 10 Spike3 Noise 7 Deterministic 11 Spike4 Deterministic 8 Spike 12 Spike
Table 4.2: Occurrence of eigenresource types in order of importance.
This result reveals that the most important source of variation in CPU uti-
lization measures are the systematic changes due to periodic trends. After these
periodic trends, noise dispersions are next in importance.The least significant
contribution to CPU utilization variability comes from bursts or spikes.
These conclusions are confirmed in a more quantitative way bythe data in
Table 4.3, which shows the fraction of total energy that can be assigned to each
of the three eigenresource types. Deterministic eigenresources provide more than
two times the contribution of the noise class, and almost sixtimes the contribution
of the spike class.
This is the case of the PCA-based technique applied on a resource measure
strongly dependent on periodic activities as the CPU utilization. Different re-
72 PCA-based technique on collected data
Deterministic Spike Noise IndeterminateEigenresources Eigenresources Eigenresources Eigenresource
Contribution 58.49% 25.90% 10.84% 4.77%
Table 4.3: Contributions of eigenresource types.
source measures may return different behavioral classes. What we can say is that,
on every one of the21 resource measures tested, the PCA technique extrapolates
always three types of classes, and never more.
To better understand the three behavioral classes, in next section we investigate
the characteristics of deterministic, spike and noise eigenresources, in order to
improve the proposed methodology for the whole system analysis.
4.3.2 Understanding eigenresources
The analysis of eigeresources has emphasized the central role of three behavioral
classes in which all eigenresouces can be placed. Despite that, eigenresources
belonging to the same class can carry out different information. We now extend
the basic analysis proposed in [10] in order to better understand the statistical
characteristics of eigenresources and apply these resultsto improve Internet Data
Center management.
Let us start from several examples. In Figure 4.14 we report the eigenresources
classified as deterministic in Section 4.3.1 with the corresponding correlograms.
The first and the seventh eigenresources in Figure 4.14(a.1)and (c.1) have auto-
correlation functions with periodic peaks repeating every288 samples. Thus, we
can infer that the two deterministic eigenresources have a seasonal behavior with
a temporal lag of one day. A periodic repetition is shown alsoby the ACF of the
fourth eigenresource, as can be evinced by the correlogram of Figure 4.14(b.2).
However, there is a different time window in which the function repeats itself.
Its values iterate with a lag of 576 samples, that correspondto two days resource
measures. Thus, the seventh eigenresource collects the seasonal system behavior
that repeats every 48 hours of activity. This information isuseful in the choice of
which eigenreources is worth to be investigated to make management decisions.
Understanding eigenresources 73
-40
-20
0
20
40
60
80
Mon Tue Wed Thu Fri Sat Sun
1st e
igen
reso
urce
(a.1) First
-40
-30
-20
-10
0
10
20
30
40
50
Mon Tue Wed Thu Fri Sat Sun
4th
eige
nres
ourc
e
(b.1) Fourth
-30
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
7th
eige
nres
ourc
e
(c.1) Seventh
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Mon Tue Wed Thu Fri Sat Sun
AC
F
ACF
(a.2) First test
-0.2
-0.1
0
0.1
0.2
0.3
Mon Tue Wed Thu Fri Sat Sun
AC
F
ACF
(b.2) Fourth test
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Mon Tue Wed Thu Fri Sat Sun
AC
F
ACF
(c.2) Seventh test
Figure 4.14: Deterministic eigenresources and corresponding correlograms.
We collected PCA results also on time series referring to longer time scales
(e.g., two weeks). In these contexts we can find dimensions with autocorrelation
functions showing different lags and multiple periodic behaviors. An example of
this multi-seasonal behavior is reported in Figure 4.15.
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
AC
F
Figure 4.15: Example of correlogram showing a multi-seasonal behavior in a twoweeks resource sampling.
74 PCA-based technique on collected data
In all these cases, the discovered lags of periodicity of thedeterministic eigen-
resources are one multiple of the others. Considering them asequivalent into an
unique vision may not involve a remarkable impact on the accuracy of whole sys-
tem analysis, but this may not always be the case. If the considered data set is
influenced by different non-multiplicative periodicitiesconcurring together to In-
ternet Data Center seasonal activity, to collect them into one representative vision
may give meaningless results.
We asses that an accurate analysis of deterministic eigenresources is a crucial
step in the proposed multi-phase methodology, since it allows us to collect infor-
mation in a suitable way, gives meaningful information on different periodic lags
that may influence the activity of the studied Web-based infrastructure and guides
the setting of right parameters values of those management algorithms that take
into account seasonal properties.
Careful investigation should join also with the comprehension of the other two
classes of eigenresources. Focusing on dimensions with overall spike behaviors,
they may also differ one from the other in a meaningful way. Let us start from an
illustrative example.
Figure 4.16 shows the PCA resulting eigenresources satisfying the 5σ thresh-
old test. There is a clear difference between Figures 4.16(a)-(b) and Figures 4.16(c)-
(d)-(e). Eighth and ninth eigenresources show several consecutive instantaneous
spikes taking place during the entire monitored week. They preserve the charac-
teristic of immediacy, but loose their singularity and sparseness. Tenth to twelfth
eigenresources, otherwise, exhibit isolated, sporadic and uncommon bursts, that
manifestly depart from the mean beahavior of the time series.
In our context, spike category can be split in two subclasses: the one ofre-
current spike eigenresources, and the one ofsporadic spike eigenresources. The
former subclass includes all those spike eigenresources exhibiting frequent short-
lived spikes, repeating in a non predictable way but consistently present during
the entire period of sampling. The latter subclass comprises spike eigenresources
with strong occasional values departing many standard deviations from the eigen-
resource mean. Examples in Figures 4.16(a)-(b) are assigned to the recurrent spike
eigenresources subclass. Examples in Figures 4.16(c)-(d)-(e) to the sporadic sub-
class.
Understanding eigenresources 75
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
8th
eige
nres
ourc
e
(a) Eighth
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
9th
eige
nres
ourc
e
(b) Ninth
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
10th
eig
enre
sour
ce
(c) Tenth
-50
-40
-30
-20
-10
0
10
20
Mon Tue Wed Thu Fri Sat Sun
11th
eig
enre
sour
ce
(d) Eleventh
-20
-10
0
10
20
30
40
50
Mon Tue Wed Thu Fri Sat Sun
12th
eig
enre
sour
ce
(e) Twelfth
Figure 4.16: Spike eigenresources and corresponding sigmathreshold tests.
Also the noise class needs a deep investigation. As discussed in Section 3.3,
noise signals can be classified as white or colored. White noise presents no co-
variance or relation between time series values at different time samples, and
hence the autocovariance function is zero for all lagsk except fork = 0. Col-
ored noise does not vary completely randomly and its autocovariance is non-zero
for lagsk 6= 0. It is often incorrectly assumed that Gaussian noise is necessar-
ily white noise, yet neither property implies the other. Gaussianity refers to the
probability distribution with respect to the value, that is, the probability that the
signal has a certain given value, while the term ”white” refers to the way the sig-
nal power is distributed over time or among frequencies. We can therefore find
Gaussian white noise, but also Poisson, Cauchy, etc. white noises. As well as
colored Gaussian noise. Thus, once proved that the noise is normally distributed,
further investigations about the color of noise are needed,in order to obtain an
exhaustive knowledge of the noise component.
Figure 4.17 shows the three PCA resulting time series classified as noise eigen-
resources, with the corresponding autocovariance test forthe determination of the
noise color. As can be evinced by Figure 4.17(b.2) and Figure4.17(c.2), the third
and sixth eigenresources show the typical behavior of whitenoise, with sudden
fast degradation of autocovariance function from peak at lag 0. We can say that
76 PCA-based technique on collected data
the 3rd and 6th dimensions vary completely randomly as a function of time. A
different behavior is manifested by the second eigenresource: the autocorrelation
function displayed in Figure 4.17(a.2) is typical of colored noise, with positive
values for a wide window of lags centred in sample 0.
-40
-20
0
20
40
Mon Tue Wed Thu Fri Sat Sun
2nd
eige
nres
ourc
e
(a.1) Second
-40
-30
-20
-10
0
10
20
30
40
50
Mon Tue Wed Thu Fri Sat Sun
3rd
eige
nres
ourc
e
(b.1) Third
-30
-20
-10
0
10
20
30
40
Mon Tue Wed Thu Fri Sat Sun
6th
eige
nres
ourc
e
(c.1) Sixth
0
0.2
0.4
0.6
0.8
1
-200 -100 0 100 200
2nd
eige
nres
ourc
e
Lag (Samples)
(a.2) Second test
-0.2
0
0.2
0.4
0.6
0.8
1
-200 -100 0 100 200
3rd
eige
nres
ourc
e
Lag (Samples)
(b.2) Third test
-0.2
0
0.2
0.4
0.6
0.8
1
-200 -100 0 100 200
6th
eige
nres
ourc
e
Lag (Samples)
(c.2) Sixth test
Figure 4.17: Noise eigenresources and corresponding autocovariance functions.
Thanks to these results, we can divide the noise eigenresource class into two
subclasses on the basis of the type of noise characterizing the time series:white
noise eigenresourcessubclass andcolored noise eigenresourcessubclass. All the
noise eigenresources showing an ACF abruptly decreasing as soon as it departs
from k = 0 belong to the previous subclass. In the considered context,the third
and sixth eigenresources are designated to this subclass. The colored noise sub-
class comprises, instead, noise eigenresources varying completely randomly in
time, as the 2nd eigenresource does.
This classification is of great interest since it allows us toseparate noise con-
tributions that are somehow time related and predictable from those ones that are
completely random and hard to model.
An accurate examination of the characteristics of the threebehavioral classes
helps to better understand system characteristics, appropriate dimensions contri-
4.4 Extraction of representative eigenresources 77
butions and to choose adapt methods and algorithms for system management. In
the next section, we show how the three main classes of eigenresources (determin-
istic, spike and noise) contribute to the generation of three representative visions
of the entire system behavior, that can be used as a reliable starting point to solve
the whole system analysis problem.
4.4 Extraction of representative eigenresources
In this section, we show how the understanding of the three classes of eigenre-
sources in light of the previous results can yield to the generation of threerepre-
sentative eigenresourcesto solve the whole system analysis.
This is the main innovation of the proposed PCA-technique. Wecollect all
the contributions of the resource measures monitored in a complex Internet Data
Center into an extremely simplified representation that is able, in its simplicity,
to carry all the relevant information of the system. This representation comprises
only three time series, whose investigation replaces the complex and time con-
suming analysis of thousands and thousands of resource timeseries that, on their
own, do not give any reliable information of the whole systemstate.
To evaluate the relative impact of the three classes on the overall behavior of
the system, we collect all the contribution of the eigenresources belonging to the
same class into an aggregate vision. For every monitored resource measure, we
create threerepresentative eigenresources, one for each typology of behavior. In
particular, for the CPU utilization time series we compute:
1. arepresentative deterministic eigenresource, Rdeterministic, including all
common trend and seasonal components of the CPU utilization time series;
2. arepresentative spike eigenresource, Rspike, collecting short-lived spikes
and all the contributions due to occasional bursts and dips in system CPU
utilization;
3. arepresentative noise eigenresource, Rnoise, capturing random variations
of CPU utilization of the Internet Data Center servers.
78 PCA-based technique on collected data
The threerepresentative eigenresourcescome from the weighted sum of all
the eigenresources in the setui, i = 1, . . . , 12 showing that type of behavior. Each
eigenresource contribution is weighted on the basis of the corresponding singular
valueσi, as follows:
Rdeterministic =∑
i |ui ∈ deterministic class
uiσi (4.5)
Rspike =∑
i |ui ∈ spike class
uiσi (4.6)
Rnoise =∑
i |ui ∈noise class
uiσi (4.7)
wherei ∈ [1, 12].
Through singular values weights, the three representativeeigenresources give
more importance to those dimensions contributing with moreenergy to the over-
all resource measure of the system. These representations integrate all relevant
dimensions in the total energy of the Internet Data Center (12 in this case). As-
sembling the three system visions is a simple procedure thatbrings to an important
result that is quite original in Internet-based contexts.
Representative time series still preserve the characteristics of their constitutive
dimensions. Figure 4.18 shows the three representative eigenresources resulting
from the application of the PCA-based technique on CPU utilization time series.
The representative deterministic eigenresource in Figure4.18(a) is a compre-
hensive representation of the systematic component of the system, since it col-
lects all relevant information and does not contain the contribution due to eigen-
resources belonging to the other behavioral classes. It reveals a strong seasonal
component, coming from the periodic contributions of all the deterministic eigen-
resources of the system. It consists of all the linear or nonlinear elements that
change over time and repeat within the time range captured bythe data. Since
pattern and seasonal contributions add up in a weighted way,the representative
deterministic vision maintains higher magnitudes than thespike and noise ones,
even if spike dimensions are higher than the other two.
4.4 Extraction of representative eigenresources 79
-20
0
20
40
60
Mon Tue Wed Thu Fri Sat Sun
Rep
rese
ntat
ive
eige
nres
ourc
e
(a) Representative deterministic
-8
-6
-4
-2
0
2
4
6
8
10
Mon Tue Wed Thu Fri Sat Sun
Rep
rese
ntat
ive
eige
nres
ourc
e
(b) Representative spike
-20
-15
-10
-5
0
5
10
15
20
25
Mon Tue Wed Thu Fri Sat Sun
Rep
rese
ntat
ive
eige
nres
ourc
e
(c) Representative noise
Figure 4.18: Representative eigenresources.
80 PCA-based technique on collected data
The system error components contributions are carried out by representative
eigenresources of noise and spike classes. The representative spike eigenresource
shows several isolated spikes departing from the time series mean value, collect-
ing all the occasional bursts detected by the spike eigenresources. As shown in
Figure 4.18(b), adding up all the contributions of spike eigenresources spoils the
characteristics of occasional bursts and dips. Spikes in the representative vision
become more frequent, even if still unrelated.
A different behavior can be appreciated in Figure 4.18(c): the representative
noise eigenresource maintains the roughly stationary behavior of its constituents
noise eigenresources and collects into one exhaustive timeseries all the random
variations in Internet Data Center CPU utilization.
Some application contexts, such as anomaly detection and time series fore-
casting, may need more precise system representations taking in consideration the
behavioral subclasses discovered is Section 4.3.2. Figure4.19 adds the contribu-
tions given by the split of spike and noise classes into theirconstitutive subclasses.
It shows therepresentative sporadic spike eigenresourcein Figure 4.19(b.1) and
the representative recurrent spike eigenresourcein Figure 4.19(b.2), as well as
therepresentative white noise eigenresourceand therepresentative colored noise
eigenresourcein Figure 4.19(c.1) and Figure 4.19(c.2), respectively.
Beside this shrewdness, the main outcome of PCA-based technique stands in
isolating into only few representative time series all the deterministic patterns and
all the error components of the resource measures of the servers. In next chap-
ters we demonstrate that the happening of something strangeor unexpected in
Internet Data Center activity properly reflects in one or moreof the representa-
tions. This is an important outcome that improves the importance of the proposed
PCA-based technique for the system management. Accidental events reflect in
system representations only in the case of relevant incidents impacting the whole
system state and functioning. Episodes influencing the state of only one or few
servers but having no important consequence in Internet Data Center activity are
not manifested by the representative eigenresources. Thisis a demonstration that
the representative time series are more effective for wholesystem management
than the single monitored resource time series coming from the single servers,
that carry also information that may be not useful to understand the state of the
4.4 Extraction of representative eigenresources 81
-20
0
20
40
60
Mon Tue Wed Thu Fri Sat Sun
Rep
rese
ntat
ive
eige
nres
ourc
e
(a) Representative deterministic
-10
-5
0
5
10
Mon Tue Wed Thu Fri Sat Sun
Rep
rese
ntat
ive
eige
nres
ourc
e
(b.1) Representative sporadic spike
-10
-5
0
5
10
Mon Tue Wed Thu Fri Sat Sun
Rep
rese
ntat
ive
eige
nres
ourc
e
(b.2) Representative recurrent spike
-15
-10
-5
0
5
10
15
Mon Tue Wed Thu Fri Sat Sun
Rep
rese
ntat
ive
eige
nres
ourc
e
(c.1) Representative white noise
-15
-10
-5
0
5
10
15
Mon Tue Wed Thu Fri Sat Sun
Rep
rese
ntat
ive
eige
nres
ourc
e
(c.2) Representative colored noise
Figure 4.19: Representative eigenresources with spike and noise subclasses.
entire Internet-based system.
Several kinds of applications could benefit from it. For instance, all sort of
algorithms for system management that base their decisionson the evaluation
of whole system state. Thanks to the proposed PCA-based technique, decisions
could be made working on a small input, that is, the three representative eigenre-
82 PCA-based technique on collected data
sources, instead of the high number of system resource time series. Among these
applications, in this work we consider some mechanisms extrapolating interesting
information about thepast, the presentand thefuture behavior of an Internet-
based system. In particular, we model the past system behavior and forecast its
future performance with the goal of an efficient runtime management of the Inter-
net Data Center in the present. To this purpose, Chapter 5 and Chapter 6 report
an overview of the modeling and forecasting problems, respectively. We intro-
duce the common used parametric models to address these problems and evaluate
their performance in stochastic contexts, such as the one ofInternet Data Centers.
Modeling and forecasting applications are then addressed to the on-line analysis
of present system performance in Chapter 7, in order to make runtime decisions
for the whole system management.
Chapter 5
Tracking models
In this chapter, we consider application contexts where it is important to model
the past behavior of the Internet Data Center for an efficient management, either
off-line or on-line. We considertrend extractionfor the modeling of the whole
system state in a relevant past.
5.1 Trend extraction
Trend extractionis a useful method to characterize time series behavior in a signif-
icant past. It clarifies increasing and decreasing tendencies, seasonal periodicities,
cyclical patterns and other deterministic components in the time series. Trend es-
timation provides meaningful information that can be used to understand time
series behavioral patterns, or as an input for further management purposes. In this
work, the trend esteems are used as an input data for time series forecasting in
Chapter 6 and as a state representation for state change detection in Chapter 7.
Most runtime techniques for Internet-based systems management rely on mod-
els built on predictable trends and periodicities, which are in their turn isolated
from noise and spike influences. For these models, one of the main difficulties
is to isolate underlying meaningful time series patterns from trivial error compo-
nents.
Another problem of most existing decision algorithms is that they work sep-
arately on time series coming from the monitoring of one resource of a server,
and take decisions on the basis of single representations. This approach requires
84 Tracking models
Figure 5.1: Third phase: modeling the system behavior in thepast.
the investigation of many time series as the number of resource measures on each
server. Hence, the trend algorithm must be applied on thousands and thousands
time series, each one with its behavior and its parameters toset.
The proposed multi-phase methodology would solve most of these problems
and reduce the complexity of whole system management at runtime.
First of all, the multi-phase methodology isolates the deterministic compo-
nents of the Internet Data Center servers from random errors.This is done without
any assumption about system characteristics or any previous off-line study on the
choice of best models parameters.
Second, it works on only oneRdeterministic time series that assembles all rele-
vant deterministic information of system resources and diminishes the time spent
5.2 Problem definition 85
in analyzing input time series and in applying decision models to each one of
them.
Moreover, the statistical characteristics of the representative deterministic eigen-
resource guides the choice of suitable management algorithms and the correct set-
ting of their parameters. This makes it possible to adapt runtime decision models
to the specific context of the information system under investigation.
Finally, decisions made on the basis of the representative vision are conform
to the state of the whole system, and not to the behavior of a specific server or a
peculiar resource measure of it.
5.2 Problem definition
There are many techniques for tracking the trend of a time series. The trend
represents a general systematic linear or (most often) non-linear component that
changes over time and does not repeat or at least does not repeat within the time
range captured by the data. Relatively simple techniques, such as simple means or
medians, can provide acceptable results in some contexts. When data are stochas-
tic, volatile, or when the early identification of turning points is critical, it is nec-
essary to use more sophisticated mathematical models that fall into two main cat-
egories:interpolationandsmoothingtechniques, as shown in Figure 5.2.
5.2.1 Interpolation techniques
An interpolation function reveals the trendT passing through a certain numberp
of selected points{x1, . . . , xp} belonging to the observed data set. It is a specific
case of curve fitting, in which a functionf must go exactly through thep data
points:
T = f(xj) j = 1, . . . , p (5.1)
with xj ∈ Xi. On the basis of the main characteristics of thef function, the
interpolation methods can be classified in two main classes:linear interpolation
andnon-linear interpolationmodels.
Linear interpolation
86 Tracking models
Figure 5.2: Trend estimation techniques classification.
Linear interpolationis a method of curve fitting through anf function com-
puting linear polynomials. Typical examples of linear interpolations are the
piecewise constantinterpolation and thesimple regression.
Consider the example in Figure 5.3. Figure 5.3(a) displays the selected
pointsxj belonging to the data set. In this simple example,p = 7. The
trend estimation faces the problem of approximating the value for a non-
given pointxk in some space,xk /∈ {x1, . . . , xp}, when given some values
of points around that point.
The simplest interpolation method is to locate the data valuexk ∈ {x1, . . . , xp}nearest toxk, and to assign to it the same value,f(xk) = f(xk), as shown
in Figure 5.3(b). The horizontal black lines passing through the data points
compose the estimation of the trendT resulting from the application of
piecewise constant interpolation technique. In one dimension, there are sel-
dom good reasons to choose this simple method over regression, which is
almost as cheap. However, in higher dimensional multivariate interpolation
this can be a favorable choice for its speed and simplicity.
An example of simple regression interpolation is given in Figure 5.3(c).
Suppose we want to determinef(2.5). Since2.5 stands midway between
Interpolation techniques 87
2 and 3, it is reasonable to takef(2.5) midway betweenf(2) and f(3).
The black line of Figure 5.3(c) estimates the trendT as straight continuous
lines linking all the data set pointsxj. Linear interpolation is quick and
easy, but it is not precise. Another disadvantage is that thetrendT is not
-1
0
1
1 2 3 4 5 6
(a) Plot of the data points xj
-1
0
1
1 2 3 4 5 6
(b) Piecewise constant interpolation
-1
0
1
1 2 3 4 5 6
(c) Simple regression interpolation
Figure 5.3: Graphical example of linear interpolation techniques.
88 Tracking models
differentiable at the pointxj.
All linear interpolation techniques have low computational costs and can
provide acceptable results when the data set is subject to linear trends [62].
On the other hand, when the data set is characterized by non stationary and
high variable behavior, the linear interpolation is not a reliable technique
for trend identification. In this context, non-linear interpolation gives better
results.
Non-linear interpolation
Non-linear interpolationis a trend estimation technique able to model highly
curved time series through non-linear polynomials. The linear models fit a
straight line or a flat plane to the data samples. Usually, thetrue relation-
ship that we want to model is curved, rather than flat. To fit it,we need
non-linear models, such aspolynomialandsplineinterpolations.
Given some points of the data set, polynomial interpolationtechniques es-
timate the trendT through polynomials of degree higher than1 passing
through the pointsxj. Referring to the previous example, the sixth degree
polynomial in Figure 5.4(a) goes through all the seven points xj. Gener-
ally, if we havep data points, there is exactly one polynomial of degree at
mostp − 1 passing through all the data points. The interpolation error is
proportional to the distance between the data points to the power p [17].
Furthermore, the interpolant is a polynomial and thus infinitely differen-
tiable. So, we see that polynomial interpolation solves allthe problems of
simple regression. However, polynomial interpolation also has some dis-
advantages. Calculating the interpolating polynomial is computationally
expensive compared to simple regression. Furthermore, polynomial inter-
polation may exhibit oscillatory artifacts, especially atthe end points.
These disadvantages can be avoided by through thespline interpolation
model [104,127], that uses low-degree polynomials in each of the intervals
[xj, xj+1], and chooses the polynomial pieces such that they fit smoothly
together. The resulting function is called a spline. Figure5.4(b) shows the
Interpolation techniques 89
trendT estimated by a cubic spline, where the polynomial pieces areof
degree3. For instance, the cubic spline is peicewise cubic and twicecon-
tinuously differentiable.
-1
0
1
1 2 3 4 5 6
(a) Polynomial interpolation
-1
0
1
1 2 3 4 5 6
(b) Spline interpolation
Figure 5.4: Graphical example of non-linear interpolationtechniques.
Like polynomial interpolation, spline interpolation incurs a smaller error
than that of linear interpolation and the interpolant is smoother. Moreover,
the spline interpolant is easier to evaluate than the high-degree polynomials
used in polynomial interpolation. It also does not suffer from Runge’s phe-
nomenon [109]. Despite that, both non-linear techniques have high compu-
tational costs and are often inadequate to work in contexts with short-term
real-time requirements.
90 Tracking models
5.2.2 Smoothing techniques
A smoothing technique is a function that aims to capture important patterns in the
data set, while leaving out noise. Some common smoothing algorithms are the
moving average, theautoregressivemodels and thefiltering theory.
Moving average
Moving averagetechniques smooth out the observed data set and reduce the
effect of out-of-scale values. They are fairly easy to compute at runtime and
are commonly used as trend indicators [81].
The most used moving average techniques are theSimple Moving Average
(SMA) and theExponential Weighted Moving Average(EWMA), that com-
pute a uniform and a non-uniform weighted mean of the past measures,
respectively. These techniques tend to introduce an excessive delay in trend
representation when the number of past measures is large, while they do
not eliminate all noises when working on a small set of past samples. The
problem of choosing the best past data set size can be addressed when the
time series are stable.
Autoregressive
Autoregressivemodels comprise a group of linear smoothing formulas that
attempt to filter a time series on the basis of the previous rawand filtered
samples. A model that depends only on the previous filtered samples is
called anAuto-Regressive(AR) model, while a model depending only on the
raw data samples is called aMoving Average(MA) model. A model based
on both raw and filtered samples is anAuto-Regressive Moving Average
(ARMA) model. These models are adequate for stationary time series.
When the data set shows evidence of non-stationarity, it is preferable to use
theAuto-Regressive Integrated Moving Average(ARIMA) model, that is a
generalization of the ARMA model. It provides an initial differencing step
corresponding to the ”integrated” part of the model, applied to remove the
non-stationarity of the time series. The ARIMA model has the advantage
Smoothing techniques 91
that few terms are needed to describe a wide variety of time series processes,
less than AR and MA models [120].
ARFIMA and ARCH [50,82] are further accurate autoregressive techniques
useful in modeling time series with long memory or exhibiting time-varying
volatility clustering, that is, periods of swings followedby periods of rela-
tive calm.
Filtering theory
Filtering theory is useful to reveal trend in time series. Its purpose is to
remove from a signal some unwanted component or feature.
Recursive filtersre-use one or more of their outputs as an input. If both the
time series and the unwanted component error are Gaussian and uncorre-
lated, there is an optimal recursive filter, namely theKalman Filter. It is a
set of mathematical equations that provides an efficient computational (re-
cursive) means to estimate the state of a process, in a way that minimizes
the mean of the squared error. This filter is very powerful in several aspects:
it supports estimations of past, present, and even future states, even when
the nature of the modeled time series is unknown [18].
Discrete Wavelet Transforms(DWT) andDiscrete Fourier Transforms(DFT)
are more representative techniques based on the filtering theory. These tech-
niques belong to a popular and computationally efficient family of multi-
scale basis functions for the decomposition of a signal intolevels or scales
and for the extraction of a denoised data set representation[102]. In the
DWT, the data set is passed through filters with different cut-off frequen-
cies at different levels, while the DFT decomposes the time series into the
sum of periodic harmonics. The main difference is that wavelets are local-
ized in both time and frequency, whereas the standard Fourier transform is
only localized in frequency. Wavelets often give a better data set’s trend rep-
resentation and are computationally more efficient than theDiscrete Fourier
Transform. A DFT of lengthp takes on the order ofplog2p operations, as
compared to the approximatelyp operations required by a DWT [61].
92 Tracking models
Figure 5.5 shows the results of applying smoothing techniques to a stochas-
tic and highly variable time series. An example of trend estimation is given
for each category of smoothing techniques. The black line inFigure 5.5(b)
represents the trend resulting from the Exponential Weighted Moving Av-
erage computed on the gray time series of Figure 5.5(a). EWMA model,
in this example, works considering 10 past measures. It produces a spiky
and reactive representation of the data set, following all the variabilities of
the time series, even the smallest ones. Similar results areachieved by the
ARIMA(1,1,1) model in Figure 5.5(c). This autoregressive technique tracks
the data set and smooths out only the major fringes of variability, thus result-
ing in a fluctuating representation strongly dependent on the data samples
values. On the other hand, the DWT technique in Figure 5.5(d) cuts out
almost all time series variability, thus resulting in the smoothest represen-
-2
-1
0
1
2
0 1 2 3 4 5 6
(a) Plot of the time series
-2
-1
0
1
2
0 1 2 3 4 5 6
(b) EWMA smoothing
-2
-1
0
1
2
0 1 2 3 4 5 6
(c) ARIMA smoothing
-2
-1
0
1
2
0 1 2 3 4 5 6
(d) DWT smoothing
Figure 5.5: Graphical examples of smoothing techniques.
5.3 Interpolation estimators 93
tation. This filtering technique represents well the overall trend of the data
set and removes almost all the variability of the time series.
It is unreasonable to define which smoothing technique better estimates a
trend because the performance of each model must be related to the appli-
cation context and the time series characteristics. These requirements guide
the choice of the technique for trend estimation that best fits our interests,
and, equally important, the choice of the parameter values suitable to our
purposes. A practiced setting of the number and value of model parameters
is fundamental to time series trend estimation.
In next sections, we detail some interpolation and smoothing techniques, with
a particular emphasis on their implementation and the parameters they depend
on. These techniques are used in the next applications of this study as time series
representations for state change detection and for time series forecasting.
5.3 Interpolation estimators
We discuss thesimple regression- chosen among linear interpolation -, andcubic
spline- as non-linear interpolation technique.
5.3.1 Simple Regression (SR)
Simple regressionfits a straight line through the set of then past monitored sam-
ples values of the data setXi, that are,Xi,n = [xi−(n−1), . . . , xi−1, xi]. Thus, the
simple regression trend estimationSR(Xi,n) is computed as follows:
SR(Xi,n) = αixi + βi (5.2)
where the coefficientsαi is equal to the degree of variation between the first and
the last sample of the data setXi,n, that is:
αi =xi − xi−(n−1)
n(5.3)
while βi is set as:
94 Tracking models
βi = xi−(n−1) − αin (5.4)
as suggested in [8].
An SR-based trend estimator evaluates a newSR(Xi,n) value for each mea-
surexi collected during the observation period. The number of considered past
samplesn is a parameter of the interpolation model, hence hereafter we use the
notation SRn to indicate an simple regression tracker based onn past measures.
Since the simple regression models linear trends, it risks to be inefficient when
the data set is characterized by a non stationary and high variable behavior. Cubic
splines are typically used to overcome this limit.
5.3.2 Cubic Spline (CS)
An empirical analysis induces us to consider thecubic splinefunction [104], in
the version proposed by Forsytheet al. [56]. This choice is motivated by the
observation that lower order spline curves (that is, with a degree less than 3) do
not react quickly enough to time series changes, while spline curves with a de-
gree higher than 3 are unnecessarily complex, introduce undesired ripples and are
computationally too expensive to be applied in runtime contexts.
To define cubic spline functions, let us choose somecontrol points(tj, xj) in
the set of measured data values, wheretj is the measurement time of the sample
xj. A cubic spline functionCSJ(t), based onJ control points, is a set ofJ − 1
piecewise third-order polynomialspj(t), wherej ∈ [1, J − 1], that satisfies the
following properties.
Property 1.The control points are connected through third-order polynomials:{CSJ(tj) = xj j = 1, . . . , J
CSJ(t) = pj(t) tj < t < tj+1, j = 1, . . . , J − 1(5.5)
Property 2.To guarantee aC2 behavior at each control point the first and second
order derivatives ofpj(t) andpj+1(t) are set equal at timetj, ∀j ∈ {1, . . . , J−2}:
dpj(tj+1)
dt=
dpj+1(tj+1)
dt
d2pj(tj+1)
dt2=
d2pj+1(tj+1)
dt2
(5.6)
5.4 Smoothing estimators 95
If we combine Properties 1 and 2, we obtain the following definition for
CSJ(t):
CSJ(t) =zj+1(t− tj)
3 + zj(tj+1 − t)3
6hi
+
+ (xj+1
hi
− hj
6zj+1)(t− tj) + (
xj
hj
− hj
6zj)(tj+1 − t)
(5.7)
∀j ∈ {1, . . . , J − 1}, wherehi = ti+1 − ti, andxj are the measured values. The
zj coefficients are solved by the following system of equations:
z0 = 0
hj−1zj−1 + 2(hj−1 + hj)zj + hjzj+1 = 6(xj+1−xj
hj− xj−xj−1
hj−1)
zn = 0
(5.8)
The spline-based trend estimation modelCS(Xi,n), at timeti, is defined as
the cubic spline functionCSJ(ti), that is obtained through a subset ofJ control
points belonging to the vectorXi,n of n past samples measures. We denote it as
CSn.
Although the cubic spline load tracker has two parameters and is computa-
tionally more expensive than linear interpolation techniques, it is commonly used
in approximation and trend extraction contexts [51, 104, 127]. The cubic spline
has the advantage of being reactive to load changes and it is independent of time
series characteristics. Its computational complexity is compatible to runtime de-
cision systems, especially if we choose a small number of control pointsJ .
5.4 Smoothing estimators
Among smoothing techniques for time series denoising, we first consider the class
of moving averagemodels. Moving averages are commonly used as trend indi-
cators [44, 81, 122], since they smooth out observed data, reduce the effect of
out-of-scale values and are fairly easy to compute at runtime. We consider two
classes of moving average algorithms (Simple Moving Average(SMA) andEx-
ponential Weighted Moving Average(EWMA) ) and some popular linear autore-
gressive models (Auto-Regressive(AR) andAuto-Regressive Integrated Moving
Average(ARIMA) ).
96 Tracking models
5.4.1 Simple Moving Average (SMA)
Simple Moving Averageis the unweighted mean of then past monitored samples
values of the data setXi, that are,Xi,n = [xi−(n−1), . . . , xi−1, xi]:
SMA(Xi,n) =
∑i−(n−1)≤j≤i
xj
n(5.9)
An SMA-based trend estimator evaluates a newSMA(Xi,n) value for each
measurexi collected during the observation period. The number of considered
past samplesn is a parameter of the smoothing model, hence hereafter we usethe
notation SMAn to indicate an Simple Moving Average tracker based onn past
measures. Since the Simple Moving Average assigns an equal weight to each of
the past considered data values, this model tends to introduce a significant delay
in time series representation, especially when the size of the subsetXi,n increases.
Exponential Moving Average models are usually applied withthe purpose of lim-
iting this delay effect.
5.4.2 Exponential Weighted Moving Average (EWMA)
Exponential Weighted Moving Averageis the weighted mean of then past mon-
itored samples values,Xi,n, where the weights assigned to the samples decrease
exponentially. An EWMA-based load trackerEWMA(Xi,n), at timeti, is equal
to:
EWMA(Xi,n) = αxi + (1 − α)EWMA(Xi−1,n) (5.10)
where the parameterα = 2/(n+ 1) is thesmoothing factor.
The initial EWMA(Xn,n) value is initialized to the arithmetic mean of the
first n measures:
EWMA(Xn,n) =
∑1≤j≤n
xj
n(5.11)
Similarly to the SMA model, the numbern of past considered data values is a
parameter of the EWMA model, hence with EWMAn we denote an Exponential
Weighted Moving Average based onn past measures.
Auto-Regressive (AR) 97
5.4.3 Auto-Regressive (AR)
Auto-Regressivemodel is a weighted linear combination of the pastp observed
data values of the vectorXi, that are,Xi,p = [xi−(p−1), . . . , xi−1, xi]. An AR-
based trend estimation model, at timeti, can be written as:
AR(Xi,p) = φ1xi + . . . + φpxi−(p−1) + ei (5.12)
whereei ∼ WN(0, σ2) is an independent and identically distributed sequence
(calledresiduals sequence). xi, . . . , xi−(p−1) are the data samples weighted byp
linear coefficients,φ1, . . . , φp, that are the firstp values of the auto-correlation
function computed on theXi vector. Thep order of the AR process is deter-
mined by the lag at which the partial autocorrelation function becomes negligi-
ble [21, 75]. It is a parameter of the AR model, hence with AR(p)we denote an
autoregressive tracker based onp values.
5.4.4 Auto-Regressive Integrated Moving Average (ARIMA)
Auto-Regressive Integrated Moving Averagemodel is obtained by differentiatingd
times a non stationary sequence and by fitting an ARMA model that is composed
by the auto-regressive model (AR(p)) and the moving average model (MA(q)).
The moving average part is a linear combination of the pastq residual terms,
ei, . . . , ei−q [21,75]. An ARIMA model can be written as:
ARIMA(Xi,p,d,q) = φ1xi + . . . + φp+dxi−(p+d+1) +
+ θ0ei + . . . + θqei−q
(5.13)
whereθ0, . . . , θq arelinear coefficients.
An ARIMA model is guided by three parameters. Thus, we use the notation
ARIMA(p,d,q), wherep is the number of the considered past values in the data
set,q is the number of the residuals terms, andd is the number of differentiating
times. An ARIMA model requires frequent updates of its parameters when the
characteristics of the data set changes.
98 Tracking models
5.5 Quantitative performance analysis
Trend estimation models should be evaluated in terms ofcomputational costand
estimation quality. For on-line decision contexts, we consider acceptable only
trend estimators having a computational complexity compatible with runtime re-
quirements. In this section, we compare the computational cost of the described
trend estimation models, in order to evaluate their efficacyfor on-line manage-
ment of Internet Data Centers. We also report some interesting results about the
quality of trend estimators applied in several experiments.
5.5.1 Computational cost
We evaluate the CPU time required by each described model to compute a new
value of the trend representation, in order to evaluate the possibility of applying it
to a runtime environment. Collected times do not include the system and commu-
nication times that are necessary to fill the observed data set.
The results evaluated on an average PC machine and reported in Table 5.1 refer
to a realistic system subject to heavy service demand, but they can be considered
representative of any workload. Computational costs are estimated for different
numbers of past samples (n) considered by the models. Behind the choice of
the parameters of the AR and ARIMA models there is an evaluation of the auto-
correlation and partial auto-correlation functions as in [21, 75]. For this analysis,
we choose the AR(32) and ARIMA(1,0,1) models as the best parameters settings
for the considered workload. The table demonstrates that the computational cost
of all the considered trend estimation models is compatiblewith runtime con-
straints, because all the models have a CPU time below 10 msec.
n = 30 n = 60 n = 90 n = 120 n = 240SR 0.462 0.448 0.456 0.461 0.494CS 2.100 3.426 4.242 6.231 12.215
SMA 0.560 1.039 1.461 1.990 3.785EWMA 0.059 0.059 0.059 0.059 0.059
AR 5.752 5.978 5.998 6.070 6.417ARIMA 7.233 7.536 7.765 7.228 8.141
Table 5.1: CPU time (msec) for the computation of a trend value.
Estimation quality 99
This results lead us to consider the previous described models adequate to
support runtime decision systems in stochastic and highly variable workload sce-
narios.
5.5.2 Estimation quality
The trend estimation quality needs, for its evaluation, a representation of the ef-
fective time series trend, to which compare the one estimated by the model. Due
to the stochasticity of the time series, the simple mean is not a good indicator of
the central tendency of the data set [81], hence we prefer to evaluate the effective
time series trend as theapproximate confidence intervalCI = [TU , TL] [19]. It
is an indicator of the approximative central tendency of thetime series in specific
periods of rather stability of the observed data set.TU andTL represent the upper
bound and lower bound, respectively, of this central tendency, and thus limit the
region inside which the trend esteems should fall.
Since in our experiments we control the load generators, it is possible to com-
pute off-line the periods of rather stability, considered as the time intervals during
which we generate the same number of user requests, that is, we have the same
number of active emulated browsers. For estimation qualityevaluation, we con-
sider the data set shown in Figure 5.6, where the horizontal lines represent the
upperTU and lowerTL bounds of the approximate confidence interval.
Thanks to this definition, the estimation quality of the models can be computed
in terms ofaccuracyandresponsiveness.
Accuracy
Accuracyevaluates to the capacity of having small oscillations around the
approximate confidence interval. Higher the accuracy, the best the model
tracks the trend of the time series.
Theaccuracy errorof a trend estimation model is the sum of the distances
between each estimated valueli computed at timeti, i = 1, . . . , n, and the
corresponding upperTUi or lowerTL
i bounds of the approximate confidence
100 Tracking models
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900
Sam
ple
valu
es
Samples
Time seriesApproximate confidence interval
Figure 5.6: Example of time series and approximate confidence interval.
interval at timeti. It is computed as:
n∑
i=1
di (5.14)
, where
di =
li − TUi , if li > TU
i
TLi − li, if li < TL
i
0, otherwise.
(5.15)
The accuracy error corresponds to the sum of the vertical distances between
each estimated value outside the approximate confidence interval and the
approximate confidence interval bounds.
For the sake of comparison of different trend estimation models, we prefer
to use a normalized value, such as therelative accuracy error. As a nor-
malization factor, we consider the accuracy error of the observed data set.
The relative accuracy error for any acceptable trend estimation model lies
between 0 and 1, otherwise a trend model would be considered completely
inaccurate and discarded.
Estimation quality 101
Responsiveness
Responsivenessevaluates the capacity of reaching as soon as possible the
representative load interval. It is a temporal requirementthat aims to repre-
sent the ability of a trend estimation model to quickly adaptitself to signif-
icant load variations.
Let tk, 1 ≤ k ≤ n, denote the time at which the representative trend exhibits
a new stable load condition that is associated to a significant change in the
number of users. (For example, in the data set shown in Figure5.6, we have
five changes andk ∈ C = {200, 340, 500, 700, 820}.) A model is more
responsive when its curve reaches the new approximate confidence interval
as soon as possible. LettK denote the instant in which the estimated trend
reaches for the first time one of the borders of the approximative confidence
interval associated to a new load condition.
The responsiveness errorof a trend estimation model is measured as the
sum of the horizontal differences between the initial instant tk of the new
load condition and the corresponding timetK at which the estimated trend
reaches the new interval. That means:
∑k∈C
|tk − tK | (5.16)
For reasons of comparison, we normalize the sum of the time delays by the
total number of changes, thus obtaining arelative responsiveness error.
We run several experiments to compare trend estimation models accuracy and
evaluate their responsiveness for different settings of model parameters. We car-
ried out a large set of experiments and in Figure 5.7 we graphically report a subset
of their results, aware that the main conclusions of the experiments are represen-
tative of the typical behavior of the trend estimation models.
The SR, SMA and EWMA trend estimators are characterized by an interesting
trade-off between accuracy and responsiveness. Working ona small (n ≤ 30) and
large (n ≥ 200) amount of past samples causes lower estimation quality than the
one achieved by intermediate size vectors (30 ≤ n ≤ 200). The motivations are
different: for small values ofn, the poor quality is caused by a high accuracy error,
102 Tracking models
due to excessive oscillations of the estimated trend. For large values ofn, instead,
the low quality is the effect of a high responsiveness error,due to excessive delays
of the esteems in reaching the approximate confidence interval.
For example, the SMA30 curve in Figure 5.7(c.1) touches soon the represen-
tative load interval, but its accuracy is low because of manyoscillations. On the
other hand, the SMA240 curve in Figure 5.7(c.2) is highly smoothed, but it follows
the real load with too much delay, causing in this case a poor responsiveness. Sim-
ilar results are achieved by the SR and EWMA models withn = 30 andn = 240.
Better results are achieved working on an intermediate number of past samples.
The best quality is reached by the SR90 and EWMA90 curves in Figure 5.7(a.1) and
(d.1), that follow more regularly the approximate confidence interval guarantee-
ing the best trade-off between accuracy and responsiveness. The AR and ARIMA
models show low accuracy due to their jittery nature, as shown in Figures 5.7(e)
and (f). The cubic spline model has a quite interesting behavior because working
on larger sets ofn past samples lead to a monotonic improvement of the CS accu-
racy. Comparing Figures 5.7(b.1) and (b.2) we can appreciatehow, forn = 240,
the curve follows the approximate confidence interval much better than the cubic
spline forn = 30, that scatters much more.
A comparison of all the results collected and not reported here for a mat-
ter of space shows that AR and ARIMA models have the lowest accuracy. The
best results of SR, EWMA and SMA models are comparable and are all achieved
working on a set ofn = 90 observed past values. Their accuracy is even better
than the one of the best cubic spline model, that is CS240.
The large set of experiments carried on the models for the evaluation of their
trend estimation quality brings to several conclusions that would be interesting for
Internet Data Center management.
First, there exists a clear relationship between the dispersion (that is, standard
deviation) of the observed data set and the choice of the bestmodel parameters. A
high dispersion of the observed data set, such as that of the heavy service demand,
requires trend estimators working on a higher numbern of observed past samples.
On the other hand, the amount of past samples needed to obtaina precise trend
esteem decreases when the workload causes a lower dispersion of the observed
data set. The proposal of a theoretic methodology to find the “best” parameter for
Estimation quality 103
any trend estimation model, any workload and any application is out of the scope
of this thesis. However, a large set of experimental resultspoints out the existence
of a set of feasible parameter values that guarantee an acceptable performance of
the trend extraction models. This range of feasible values depends on the standard
deviation of the observed data set.
Second, all the considered models are affected by a trade-off between the ca-
pacity of reaching as soon as possible the real behavioral trend of a time series,
and of having small oscillations around it. The two properties of quality are in
conflict, hence the perfect trend model with optimal accuracy and responsiveness
does not exist. This trade-off can be solved only by considering the goals of the
trend models applications. A runtime decision system that must take immediate
actions may prefer a highly reactive trend estimators at theprice of some inaccu-
racy. This is the case of trend esteems used as state representation for state change
detection (see Section 7.1). On the other hand, when an action has to be carefully
evaluated, a decision system prefers an accurate trend model even if less reactive.
This is the choice in case of the detection of collective anomalies (see Section??).
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalSR (n=90)
(a.1) SR90
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalSR (n=240)
(a.2) SR240
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalCS (n=30)
(b.1) CS30
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalCS (n=240)
(b.2) CS240
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalSMA (n=30)
(c.1) SMA30
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalSMA (n=240)
(c.2) SMA240
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalEWMA (n=90)
(d.1) EWMA90
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalEWMA (n=240)
(d.2) EWMA240
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalAR (32)
(e) AR
0
0.2
0.4
0.6
0.8
1
200 300 400 500 600 700 800 900 1000
Sam
ple
valu
es
Representative load intervalARIMA (1,0,1)
(f) ARIMA
Figure 5.7: Trend curves with respect to the approximate confidence interval.
Chapter 6
Forecasting models
We now considertime series predictionmodels, that are oriented to forecast the
expected performance of an Internet Data Center. This chapter formalizes the
problem and gives an overview of state of the art models suitable to on-line fore-
casting in Internet-based contexts.
6.1 Time series prediction
On-line time series predictionis a classic problem for the estimation of the fu-
ture load behavior and to guide management decisions in complex Internet-based
infrastructures.
Prediction models work on an ordered set of historical information. We define
the historical information at samplei as an ordered collection ofr data,S[r]i, that
starts at timeti−(r−1), converging measures up to a final timeti, that is:
S[r]i = {sj}, i− (r − 1) ≤ j ≤ i (6.1)
where thei-th element is a pairsi = (fi, ti). The first element of the pairfi
denotes the time series information, that can correspond tothe monitored raw data
or to a filtered representation of it. The second element of the pairti indicates its
occurrence time.
A predicted value at stepi is the output of a function conditioned onS[r]j:
fi+k = g(S[r]i) + ǫi (6.2)
106 Forecasting models
Figure 6.1: Third phase: forecasting the system behavior inthe future.
in which g() is the function capturing the predictable component of the data set,
ǫi models the possible noise, andk denotes the number of future steps to predict,
that is, the so calledprediction window.
Different approaches have been proposed to perform time series forecasting
in computer environments, ranging from simple heuristics to sophisticated mod-
elling frameworks. We cite just the most important classes:linear time series [44],
neural network [119], wavelet analysis [102], Support Vector Machines [38],
Fuzzy systems [119]. The choice of the most appropriate prediction model de-
pends on the nature of the time series, on the amount of available a-priori knowl-
edge, on the required forecasting accuracy, as well as on therequirements of the
application context.
6.1 Time series prediction 107
Most prediction models are designed for off-line applications. This is the case
of the genetic algorithms, neural networks, SVM, Fuzzy systems that may achieve
a valid prediction quality after long execution times. Morecomplex prediction
techniques, such as Kalman filtering [71], rely on parameters whose identification
proves to be difficult in practical settings, particularly when no a-priori knowledge
on the time series is available. Hence it is difficult or impossible to use them in a
dynamic and runtime environment such as the Internet-basedsystem.
The literature on time series prediction proposes many prediction models to
support on-line prediction. Each model has been developed to work better in a
specific application context on the basis of the statisticalproperties of the time se-
ries in exam, such as variability, correlation, non-stationary or non-deterministic
behavior. On-line prediction models can use different statistical methodologies to
estimate their parameters. Choosing an adequate methodology for the parameter
estimation is crucial for prediction models performance, since this choice impacts
not only on the prediction quality, but also on the computational cost of the predic-
tion models. Consequently, the methodology used to estimatemodel parameters
can limit the applicability of a prediction model on the different application con-
texts.
On the basis of the parameter estimation, we can distinguishstaticandadap-
tiveprediction models.
Static models
A static prediction is characterized by a static choice of the model param-
eters. This means that the selection of the number of parameters is not
optimized for every time series. Static solutions have a lowimpact on the
computational cost of the prediction models and, for this reason, they are
typically used in application contexts having short-term time requirements.
Adaptive models
An adaptive prediction model computes dynamically the number and the
value of its parameters, in order to optimize its performance. It is especially
useful in non stationary and variable application contexts, where the predic-
tion model needs to dynamically modify its parameters at every change in
108 Forecasting models
time series behavior. Choosing the best parameters improvesthe prediction
quality at the price of a higher computational cost.
The possibility to apply a predictor in an on-line way decreases with the flex-
ibility of the model, or, equivalently, with the number of necessary parameters.
Therefore, the more flexible the model, the less usable in practice. Beside that,
as we consider time series evolving in time, all the model parameters must be on-
line updated during the time series evolution. There existsa trade-off between the
ability of a model to properly fit the signal and the number of parameters needed
to compute at each update. Efficient trade-off is achieved bya wide range of
on-line prediction models developed to forecast the behavior of internal resource
measures of Internet-based applications.
We describe some forecasting models by distinguishing static and adaptive
estimation of their parameters.
6.2 Prediction models
We consider six main classes of time series forecasting models based on historical
information that can be adapted to runtime contexts [20]:Simple Regression(SR)
andCubic Spline(CS) are based on interpolation trend estimation;Exponential
Weighted Moving Average(EWMA), Holt’s model(Holt’s), Auto-Regressive(AR)
andAuto-Regressive Integrated Moving Average(ARIMA) are based on smooth-
ing trend estimators.
The models considered in this work lack the learning capabilities of other more
complex prediction algorithms, but in a runtime decision context it is mandatory
to achieve good (not necessarily optimal) predictions quickly, rather than looking
for the optimal decision in an unpredictable amount of time.
6.2.1 Simple Regression (SR)
A simple regression predictionk steps ahead at timeti is equal to:
fi+k = αik + βi (6.3)
Simple Regression (SR) 109
where the coefficientsα andβ of the equation are differently chosen in the case
of static or dynamic implementation of the model.
In thestatic-SR, the coefficientsαi is equal to the degree of variation between
the first and the last sample of the data setS[r]i, that is:
αi =fi − fi−(r−1)
r(6.4)
while βi is set as:
βi = fi−(r−1) − αir (6.5)
as suggested in [8].
This prediction method intercepts two points(fi, ti) and (fi−(r−1), ti−(r−1)),
that are statically chosen in the data setS[r]i. The simplicity of the model guaran-
tees a very low prediction cost.Static-SR prediction quality is good when the data
set is stable or is subject to long term variations. On the other hand, when the data
set is characterized by short term variations, the SR model tends to overestimate
the changes of the data set values with a consequent low prediction quality.
Among the severaladaptive-SR models proposed in literature, we consider
the Baryshnikov et al. model [14]. In this implementation, the coefficientsαi
andβi are dynamically chosen in order to minimize the mean quadratic deviation∑i
j=i−(r−1)[fj − fj]2 among the data setS[r]i and the predicted data setSi. That
means:
αi =
∑i
j=i−(r−1)(fj − E[S[r]i])(fj − E[Si])∑i
j=i−(r−1)(fj − E[S[r]i])2(6.6)
βi = E[Si] − αiE[S[r]i] (6.7)
whereE[S[r]i] andE[Si] are the mean of the time series values and of the pre-
dicted time series values, respectively.
The parameters optimization allows to overcome the limits of the static-SR
model, providing reliable predictions also when the time series changes frequently
in its behavior, but at the price of a higher complexity leading to an increase of
model computation cost.
110 Forecasting models
6.2.2 Cubic Spline (CS)
For the definition of the cubic spline model, let us chooseP control pointsfp,
wherep ∈ [1, P − 1], that are equally spaced samples of the data setS[r]i. A CS
model based onP control points is a set ofP − 1 piecewise third-order polyno-
mials that models the data set for ak-step ahead prediction as follows:
fi+k =zp+1(i+ k − p)3 + zp(p+ 1 − i− k)3
6hp
+
+ (fp+1
hp
− hp
6zp+1)(i+ k − p)+
+ (fp
hp
− hp
6zp)(p+ 1 − i− k)
(6.8)
∀p ∈ {1, . . . , P − 1}, wherehp is the number of samples in the data setS[r]i
comprised between the control pointsfp andfp+1.
Thezp coefficients are solved by the following system of equations:
z0 = 0
hp−1zp−1 + 2(hp−1 + hp)zp + hpzp+1 = 6(fp+1−fp
hp− fp−fp−1
hp−1)
zn = 0
(6.9)
The CS prediction model is obtained through a subset ofP control points from a
data setS[r]i of lengthr and has the advantage of being reactive to changes in the
data set behavior. Its computational complexity is compatible to on-line decision
systems, especially if we choose a small number of control pointsP .
Thestatic-CS is based on a constant number of control points that, in order to
guarantee a low computational cost, must be low as suggestedin [8]. Since the
quantity of information used by the CS predictor depends on the number of control
points and their position in the data set, thestatic-CS risks to be low reliable
especially in non stationary application contexts.
Theadaptive-CS estimates dynamically the optimal number of control points
and their position in the data set. This solution is based on the methodology
presented in [59] that provides a control points sequence able to create the best
interpolation of the data setS[r]i. Theadaptive-CS is particularly useful in non
stationary contexts characterized by a non-linear trend ofthe time seires behavior.
Exponential Weighted Moving Average (EWMA) 111
Spline interpolation is the most suited method in highly variable (both in mean
and variance) contexts. However, the computation cost of the adaptive-CS (that
increases with the number of control points) risks to limit its applicability in those
contexts having short-term time requirements.
6.2.3 Exponential Weighted Moving Average (EWMA)
EWMA models predict the future valuek steps ahead as a weighted average of
the last sample availablefi and the previously predicted valuefi:
fi+k = γfi + (1 − γ)fi (6.10)
whereγ is calledsmoothing factor.
Thestatic-EWMA setsγ as follows:
γ =2
r + 1(6.11)
wherer is the size of the data setS[r]i [94].
This choice leads to a simple linear algorithm characterized by a very low
prediction cost. Its accuracy depends on the data set characteristics: in stable
conditions, it exhibits a good prediction quality. When the data set is unstable,
otherwise, the prediction quality decreases as well. Besides the accuracy problem,
another main issue is that thestatic-EWMA model generates a future value after a
delay that is proportional to the size of the considered datasetS[r]i. This problem
may prevent a valid EWMA application to runtime contexts thatrequire reactive
predictions.
In theadaptive-EWMA, the dynamical estimation at samplei of the parameter
γi is:
γi =2Φ
σ2fi
+ Φ(6.12)
whereΦ is the accepted noise component (estimated in terms of the variance of the
modeled data set) andσ2fi
is the on-line estimation of the process variance [101].
This dynamical choice ofγi improves the model accuracy and allows to limit
the delay problem affecting thestatic-EWMA. Theadaptive-EWMA is very use-
ful in those application contexts with a variable noise component of the time series
112 Forecasting models
and that have to guarantee for every noise condition the desired prediction perfor-
mance.
6.2.4 Holt’s Model (Holt’s)
Holt’s model is an extension of the EWMA one and is often used when the time
series exhibits a linear trend. At stepi, a prediction of a valuek steps ahead is
computed as:
fi+k = li + bik (6.13)
whereli andbi are recursively computed as follows:
li = νfi + (1 − ν)(li−1 + bi−1) bi = η(li − li−1) + (1 − η)bi−1 (6.14)
In thestatic-Holt’s, starting values for this recursions are often set to: In the
static-Holt’s, starting values for this recursions are often set to:
li−(r−1) = fi−(r−1) bi−(r−1) = fi−(r−2) − fi−(r−1) (6.15)
The parametersν andη are constants that are statically chosen in the range
0 ≤ ν ≤ 1 and 0 ≤ η ≤ 1, [68]. The prediction quality and the limits of
thestatic-Hold’s are quite similar to thestatic-EWMA. The static-Holt’s suffers
to high delays for increasing values of the parameterν. Moreover, the Holt’s
prediction quality is particularly conditioned by the noise component of the data
set.
The adaptive-Holt’s provides a versatile solution; it estimates the model pa-
rameters minimizing the conditional likelihood (see [68] for additional details);
this adaptive methodology guarantees a dynamical support to forecast time series
that exhibit non linear behaviors and that are subject to a variable noise compo-
nent.
6.2.5 Auto-Regressive (AR)
A k step ahead prediction through an AR model is a weighted linear combi-
nation of p values. Thesep values are constituted byk − 1 predicted values
fi+k−1, . . . , fi+1 coming from the previousk − 1 steps, andp − k values of the
Auto-Regressive Integrated Moving Average (ARIMA) 113
data set(fi, . . . , fi−(p−k)). These values are weighted byp linear coefficients
ϕ1, . . . , ϕp, that are the firstp values of the auto-correlation function evaluated
on S[r]i. Thep order of the AR process is defined by a statistical test based on
the partial auto-correlation function that is described in[22,75]. The last element
of the AR model is the componentǫi that is obtained as a function of the residual
sequence (see [22] for additional details).
Hence, an AR-based predictor at stepi can be written as:
fi+k = ϕ1fi+(k−1) + · · · + ϕpfi−(p−k) + ǫi (6.16)
When the data set is stable, the AR model represents an appreciable solution
to the trade-off between prediction cost and prediction quality [44].
Thestatic-AR computes the model parameters,ϕ1, . . . , ϕp, through an initial
training on a subset of the entire experiment data. Thestatic-AR quality risks to
be low in highly variable scenarios [22] where the characteristics of the time series
change in time.
Theadaptive-AR, instead, uses a continuous update of AR parameters at ev-
ery prediction step. The update ofϕ for every new valuefi+k allows to capture
and model the non stationary and highly variable behavior that characterize many
system resources [28].
6.2.6 Auto-Regressive Integrated Moving Average (ARIMA)
A k step ahead prediction through an ARIMA model is obtained by differentiating
a d times the non stationary sequence of the filtered values inS[r]i and by fitting
an Auto Regressive Moving Average (ARMA) model that is composed by the
auto-regressive model (AR) described in Equation 6.16 with amoving average
model (MA).
The moving average part is a linear combination of the past(q−k) noise terms,
ei, . . . , ei−q−k, weighted by the linear coefficientϑ1, . . . , ϑq−k [22,75]. Hence, an
ARIMA model is usually denoted as ARIMA(p,d,q), wherep is the number of the
considered time series values,q− k is the number of the residuals values, andd is
the number of the differentiating values.
114 Forecasting models
An ARIMA model for ak-step ahead prediction can be written as:
fi+k = ϕ0 + ϕ1fi+(k−1) + · · · + ϕp+dfi−p−d+k+
+ϑ1ei + · · · + ϑq−kei−(q−k)
(6.17)
The ARIMA prediction model requires a careful choice of the model parameters,
that is typically based on the evaluation of the auto-correlation and partial auto-
correlation functions of the time series [22].
The static-ARIMA evaluates the parameters considering an initial subset of
the experiment data and on this subset computes the statistical analysis of the
auto-correlation and partial auto-correlation functions. This static solution has
some difficulties to predict accurate values when the data set is extremely variable.
The adaptive-ARIMA improves thestatic-ARIMA performance with a con-
tinuous re-evaluation of its parameters for every prediction of a new valuefi+k.
This adaptive solution brings to a better prediction quality but at a the price of
a high computational cost that risks to limit the applicability of the adaptive-
ARIMA to many application contexts, such as the short-term ones, that have time
requirements in the order of seconds.
6.3 Quantitative analysis
We evaluate the performance of the two classes of predictionmodels in the case
of static or dynamic parameters estimation. The combination of models and pa-
rameters choice originates the following four classes of approaches.
Static prediction on Data set: SD-policy
The raw data observed at different time scales do not receiveany data treat-
ment and are directly modeled by a predictor based on a staticconfiguration
of its parameters. This policy is often used in application contexts with data
sets showing a stable behavior with a low noise component.
Static prediction on Trend estimation: ST-policy
The sampled data undergoes firstly a trend extraction treatment based on the
on-line Discrete Wavelet Transform (DWT), described in [95], since it is
Computational cost 115
considered the optimal solution to remove the noise to the signal in stochas-
tic contexts, such as the Internet-based applications. Successively, the new
data representation is predicted using models based on a static parameters
estimation. The ST-policy is useful for application contexts showing unde-
sirable effects of the noise component and that do not changetheir statistical
properties in time.
Adaptive prediction on Data set: AD-policy
No data treatment is applied on the observed data set that is directly pre-
dicted by adaptive models that update dynamically their parameters on the
basis of the statistical properties of the time series. Thissolution can be
adopted in application contexts subjected to a low noise component and
non stationary behavior.
Adaptive prediction on Trend estimation: AT-policy
The data treatment based on the on-line DWT is applied to the observed
data set to minimize its noise component and to eliminate thepresence of
outliers. The new data representation is predicted using adaptive prediction
models. The AT-policy is particularly indicated for the on-line prediction of
non stationary and stochastic measurements.
The performance evaluation of the prediction models under the different poli-
cies enable us to select those models that are effectively applicable to the consid-
ered context, characterized by non deterministic and non stationary time series,
stringent time constraints and short or medium-term predictions. The performance
are evaluated in terms ofcomputational costandprediction qualityof the model.
6.3.1 Computational cost
Computational costevaluates the CPU time required by each prediction model to
estimate a new value for one system resource. In the case of STand AT-policies,
the computational cost includes also the time spent for trend extraction.
The lines of Table 6.1 show the CPU time spent by the SR, CS, EWMA, Holt’s,
AR and ARIMA models to predict one value of one system resourcemeasure
using the prediction policy indicated in the first column.
116 Forecasting models
The first line of Table 6.1 reports the computational costs ofthe differentstatic
models applied to monitored time series. Thanks to the distinctive low compu-
tational cost typical of the SD-policy, it is particularly adequate in application
contexts with short-term time requirements.
The data treatment phase required by the ST-policy brings toa significant in-
crease of the computational costs of all the predictors. Using the on-line DWT,
the cost of prediction increases of 23.1 msec as reported in the second line of
Table 6.1.
The application ofadaptivemodels increases the computational complexity
of all predictions, as evidenced by the computational costsin the third and fourth
lines. This augment depends on the statistical criteria used by each model to
dynamically estimate its parameters. SR, Holt’s, AR and ARIMAmodels are the
most affected by the cost of their parameters updates.
Policy SR CS EWMA Holt’s AR ARIMASD 1.78 2.10 0.05 1.44 5.97 7.53ST 24.88 25.20 23.15 24.54 29.07 30.63AD 22.42 2.91 3.48 21.11 45.33 114.56AT 45.52 26.01 26.58 44.21 68.43 137.66
Table 6.1: CPU time (msec) of prediction models and policies.
If the requirement is to predict just few values, the computational cost of any
considered model is compatible to time constraints of short-term and medium-
term application contexts. However, we must point out that these models are
typically applied to complex clusters consisting of hundreds nodes, where the
state of each node may require the observations of several internal resources. In
these contexts, the application of the AT-policy for short-term prediction is critical
or impossible.
Since a trade-off between computational cost and prediction quality exists, we
expect that the elevate CPU time spent by dynamic policies or complex models
will be covered by a higher quality of the prediction.
Prediction quality 117
6.3.2 Prediction quality
Prediction qualityestimates the ability of the model in time series forecasting
by considering two important features: the precision in modeling the future data
samples in hypothetical stable conditions, and the adaptability in following time
series variations when conditions are unstable.
We measure the following metrics to evaluate the predictionquality:
Prediction Error (PE)
The PE computes the mean distance between an ideal representation of the
data set and the predicted values:
PE =
∑N
i=1(fi − f ∗i )2
N(6.18)
whereN is the total number of predictions,fi is the predicted value and
f ∗i is the off-line filtered data set standing for the reference representation
generated from the observed raw data. As suggested in [95], we consider
the DWT-based filter for the off-line representation. PE takes into account
the precision in forecasting the future sample values and the ability to follow
the possible variable conditions of the data set.
Prediction Interval (PI)
The PI, widely described in [19], estimates the interval in which future ob-
servations will fall with a certain probabilityc. Through this metrics, a
predicted valuefi+k at samplei + k is associated to a prediction interval
[li+k, ui+k] by the following equation:
c = Pr(li+k ≤ fi+k ≤ ui+k) (6.19)
whereli+k is the lower limit andui+k is the upper limit of the prediction
interval.
Considering a probabilityc = 0.95, the lower and the upper limits at sample
i+ k are defined as follows:
li+k = fi+k − 1.96σi√r
ui+k = fi+k + 1.96σi√r
(6.20)
118 Forecasting models
whereσi is the standard deviation at samplei of the predicted data set,
F [r]i = (fi−(r−1), . . . , fi), andr is the size of the data set.
Comparing prediction intervals obtained under the same probability valuec,
we can argue that higher prediction intervals are associated to less reliable
prediction models.
In Figure 6.2 we give an example of time series prediction andthe correspond-
ing prediction interval. The continuous gray line represents the data samples,
while the black ones are the predicted values. The black dot lines delimits the
prediction interval.
0
10
20
30
40
50
60
70
80
100 200 300 400 500 600 700 800
Sam
ple
Val
ues
Samples
Time seriesPredicted value
Prediction interval
Figure 6.2: Example of time series prediction and corresponding prediction inter-val (c = 0.95).
We evaluate the PE of the prediction models for every considered policy, by
setting a prediction windowk = 10. Table 6.2 reports the prediction errors com-
puted on the time series coming for the monitoring of the CPU utilization of a
database server of the Internet Data Center considered as testbed in this work.
The results confirm that the AT-policy has the lowest prediction error for any pre-
diction model. This result is the consequence of the combined usage of the data
Prediction quality 119
treatment and the adaptive prediction, that allows to reduce the noise component
of the signal and to adapt the model parameters to the data setbehavior. The other
policies have higher prediction errors because of a static choice of the model pa-
rameters (e.g., SD-policy and ST-policy) or because they consider also perturbing
information due to outliers and noise for their prediction (e.g., SD-policy and AD-
policy).
Policy SR CS EWMA Holt’s AR ARIMASD 0.135 0.127 0.102 0.124 0.130 0.128ST 0.033 0.037 0.031 0.035 0.035 0.032AD 0.054 0.047 0.027 0.046 0.510 0.040AT 0.028 0.032 0.021 0.027 0.320 0.029
Table 6.2: PE of the prediction policies,k = 10.
To better understand the prediction quality of the model, wejoin the previous
results to the one in Table 6.3, by reporting the prediction intervals of the models.
The data trend extraction phase brings to lower prediction intervals, as evinced by
PI performance of the ST-policy and AT-policy. Policies that envisage a filtering
step before the forecasting phase generate more faithful predictions. On the other
hand, noise and outliers deteriorate the prediction reliability of all the considered
models under SD-policy and AD-policy, that work on raw data sets and show PI
values around20%.
Policy SR CS EWMA Holt’s AR ARIMASD 19.54% 20.20% 22.49% 21.44% 22.32% 23.54%ST 7.02% 5.35% 7.12% 7.98% 7.52% 7.41%AD 15.24% 16.32% 18.99% 16.86% 19.29% 17.91%AT 5.90% 3.04% 5.25% 6.01% 6.34% 6.16%
Table 6.3: PI of the prediction policies,k = 10.
An evident trade-off between prediction quality and computational cost exists.
Even though ST-policy and AT-policy always guarantee better prediction quality
(both in terms of PE and of PI) than those without data treatment, they have a
higher computational complexity that can limit their usagein application contexts
with strong time constraints.
120 Forecasting models
6.4 Performance analysis
We present the behavior of the different prediction policies in a realistic scenario.
We test the considered models on the prediction of the time series coming from
the monitoring of three-days CPU utilization of a database server of the Internet
Data Center under examination.
The time series is displayed in Figure 6.3 (gray line), in which can be appre-
ciated also the result of on-line data treatment and the ideal representation of the
time series. The continuous black line is the filtered data representation coming
from a trend extraction treatment based on the on-line DWT. Itrepresents the re-
sult of the pre-filtering step of ST-policy and AT-policy. Otherwise, SD-policy
and AD-policy work directly on the gray un-treated line. Allpolicies prediction
results are compared to the black dotted line, that is, the ideal time series represen-
tation, for the computation of the prediction error. The more the predicted values
will be close to that dotted line, the more the prediction error will be small and the
predictor reliable.
0
10
20
30
40
50
60
70
80
Mon Tue Wed
CP
U u
tiliz
atio
n
Time seriesTreated time series
Ideal representation
Figure 6.3: Raw and treated CPU utilization time series.
6.4 Performance analysis 121
The difference between the four approaches can be graphically evinced in Fig-
ure 6.4 and in Figure 6.5. They report the off-line filtered curve based on the DWT
filter (gray line), the predicted curve (black continuous line) and the prediction in-
terval (black dotted lines) for the four prediction policies: SD-policy, ST-policy,
AD-policy and AT-policy, that use the Holt’s prediction model.
0
10
20
30
40
50
60
70
80
Mon Tue Wed
CP
U u
tiliz
atio
n
Treated time seriesPredicted valuePrediction interval
(a) SD-policy
0
10
20
30
40
50
60
70
80
Mon Tue Wed
CP
U u
tiliz
atio
n
Treated time seriesPredicted valuePrediction interval
(b) AD-policy
Figure 6.4: Holt’s prediction model,k = 10, on raw data set.
122 Forecasting models
0
10
20
30
40
50
60
70
80
Mon Tue Wed
CP
U u
tiliz
atio
n
Treated time seriesPredicted valuePrediction interval
(a) ST-policy
0
10
20
30
40
50
60
70
80
Mon Tue Wed
CP
U u
tiliz
atio
n
Treated time seriesPredicted valuePrediction interval
(b) AT-policy
Figure 6.5: Holt’s prediction model,k = 10, on trend estimation.
The differences among treated and un-treated policies is evident: the ST-policy
and the AT-policy in Figure 6.5 (a) and (b) reach more accurate predictions than
those in Figure 6.4, thanks to a very small prediction interval. Much higher PIs are
assessed by those policies that do not extract the data trendbefore the prediction
step. Since they base the prediction on a stochastic data setaffected by great
6.4 Performance analysis 123
oscillations, the SD-policy and AD-policy in Figures 6.4 (a) and (b) have low
prediction quality due to larger prediction intervals.
Despite that, it is of crucial importance to observe the manifest delay of pre-
diction introduced by the on-line treatment in Figures 6.5 (a) and (b). ST- policy
and AT-policy guarantee excellent PIs at the cost of relevant prediction errors due
to a visible shift ahead of the predicted values. The prediction follows very well
the on-line filtered line, but it refers to a retardant representation that do not cor-
respond to the actual ideal behavior. This causes a shift between predicted and
ideal values that augments the prediction error and thus affects the performance
of ST-policy and AT-policy. On the other hand, bypassing thedata treatment phase
avoids to introduce undesired delays in the prediction. SD-policy and AD-policy
in Figures 6.4 (a) and (b) timely predict future behavior, guaranteeing that the
future time series values will fall into the prediction interval.
Chapter 7
Runtime models
This chapter deals with models for runtime management of Internet Data Centers.
We show how the three representative eigenresources comingfrom the proposed
PCA-based technique can be used as an input for some on-line decision algo-
rithms that would take advantage in working on a small set of data and on reliable
representations of the whole system state.
In this chapter we consider two important on-line problems:state change de-
tectionandanomaly detectionproblems.
7.1 State change detection
One of the key goals for supporting adaptive and self-adaptive mechanisms is the
ability of detecting one or several changes in some characteristic properties of the
system. Many runtime management decisions related to Internet-based services
are activated after a notification that a significant load variation has occurred in
some system resource(s). Request redirection, process migration, access control
and limitation are some examples of processes that are activated after the detec-
tion of a significant and non-transient system state change.We callstate change
any modification in the data sets of the system that occurs either instantaneously
or rapidly with respect to the period of sampling and that lasts for a significant
number of consecutive measurements.
State change detectorsworking at runtime have a unquestioned central role in
system management because the immediate detection of relevant changes in the
126 Runtime models
Figure 7.1: Third phase: analyzing the system behavior in the present.
state of monitored processes is crucial for efficacious decisions. Structural defects,
damage detection and novelty detection techniques detect structural anomalies
in the system,e.g., defects due to unforeseen circumstances, unexpected users
attacks, and sudden deviations (increases or decreases) ofsystem activity. State
change detection techniques try to detect changes in the data collected from a
structure. Typical measures and their representations tend to be stable over time
and might have spatial correlations. When the data suddenly move from their
stable characteristics or loose their typical correlation, we assume that something
anomalous occurred in the system. We want to detect only relevant state changes
and not minor instabilities.
Problem definition 127
7.1.1 Problem definition
The on-line detection of a relevant state change refers to techniques and method-
ologies that are able to decide whether such a change occurs.This is usually done
by evaluating the statistical characteristics (e.g., meanand variance) of samples
flowing from monitors or passing also through on-line filtersextracting the trend
of the monitored samples.
On-line state change detection in highly variable contextsis addressed through
two steps:state representationanddetection rule.
State representation
The huge number of existing state representation models canbe classified
on the basis of how they treat the time series generated by themonitored
process. We distinguish representation techniques in two main classes:with
process modelsor without process models.
The former techniques require a preliminary knowledge of time series sta-
tistical properties (e.g., Kalman Filter [63], the Sequential Monte Carlo
Method [77] and Bayesian Bootstrap Filtering [7]) or an empirical evalua-
tion of them (e.g., Particle Filtering [41,80]). In this work, we cannot adopt
representations based on process models because the non deterministic and
stochastic behavior of the considered time series preventsthe possibility of
knowing or evaluating on-line its statistical characteristics.
The second class of representation algorithms are based on interpolation
and smoothing filtering methods, as those presented in Chapter 5. The most
common solutions use linear filtering, such as mean filteringand exponen-
tial smoothing [94]. However, we will see that these models are not effective
for state change detection in the considered stochastic andhighly variable
contexts. Non-linear filtering techniques should be more suitable to these
contexts because they are able to reduce time series noise and to guarantee
a reliable detection quality.
128 Runtime models
Detection rule
In literature there are detection rules strictly related toa prior knowledge
about all states that characterize the monitored process, and models that use
a detection rule quite independent of this information. We prefer to follow
this latter approach, and choose the Cumulative Sum (Cusum) statistical
model as our fundamental detection rule [15,94].
Other on-line detection rules are based on machine learningalgorithms that
have to get some initial knowledge about the time series, especially about
the probability of state changes and their distributions (e.g., Principal Com-
ponent Analysis, neural networks [84]). Other widely adopted methods use
one or more thresholds for detecting relevant state changes[107]. Both
these alternative classes of detection rules seem unsuitable to the highly
variable context of Internet Data Center resources. Nevertheless, we con-
sider a threshold-based model for comparison purposes.
Let Xi be the time series of monitored samples. Thestateof a time series is
a reference representation of the monitored process, that is dynamically updated
on the basis of the stability of the statistical characteristics of the time seriesXi.
A relevant state changecorresponds to a significant variation of the statistical
characteristics of the time seriesXi [15].
The problem we consider in this application is to detect at runtime the occur-
rence of relevant state changes, with a minimum or null rate of false detections in
non-stationary and highly variable contexts. This is done on the basis of a statisti-
cal change detection rulethat is typically obtained by comparing, at each sample
i, thetest statisticsgi with a characteristic statistical thresholdH.
The state change may occur in an increasing or decreasing direction on the
basis of the following equations:
detection rule=
increasing change, if gi ≥ H
decreasing change, if gi ≤ −Hno change, otherwise.
(7.1)
The test statisticsgi is an indicator of the detection model: during a stable
state it should be close to zero and departs from zero when a change in the time
Problem definition 129
series occurs. The choice of the characteristic thresholdH has been extensively
addressed in the quality control literature [94] and depends on the test statisticsgi
and on the required performance [126].
In Figure 7.2, we give an example of the issues that may affecta state change
detection model. The time series is composed byN = 600 samples, shown by the
spiked curve. The data profile, denoted by the continuous line, shows8 relevant
state changes at sample 50, 150, 200, 250, 315, 440, 475, 540.The vertical line
with a circle at the bottom denotes a false detection, that is, a signaled change that
does not correspond to a real state change. The vertical linewith a cross at the
bottom denotes a right detection.
-3
-2
-1
0
1
2
3
0 100 200 300 400 500 600
Sam
ple
valu
es
Samples
Time seriesRepresentative state
True detectionFalse detection
Figure 7.2: The problem of detecting relevant state changes.
This figure shows the two main problems when the time series ishighly vari-
able:
• several false detections, as evidenced by 10 lines with a circle at the bottom;
• the absence of detections of relevant state changes, as at sample 315.
130 Runtime models
This preliminary analysis confirms that a highly variable behavior of time series
limits the detection quality of existing models, that do notachieve good perfor-
mance especially when applied to time series characterizedby the stochastic and
highly variable behavior typical of Internet Data Center management. We propose
a new on-line model for state change detection that uses an on-line wavelet-based
filtering for state representation and an adaptive implementation of the Cusum as
its detection rule.
7.1.2 Wavelet Cusum state change detection model
The performance of change detection algorithms is highly dependent on the statis-
tical structure of the measured time series [94]. Because of the inherent variability
and non-stationary behavior of many monitored processes, such as computer sys-
tem resources usage, we have found that the direct use of standard detection tech-
niques,e.g., Cusum algorithm, on the original time series, as reported bysystem
monitors, results extremely poor due to the high variance ofthe time series.
For change detection, we find convenient to consider not the original time
seriesXi, but a so calledrectified time seriesYi = [y1, . . . , yi] [95]. Yi retains
the significant features ofX but removes (most of) the variability which can be
ascribed to noise and resources usage short term oscillations. We regardYi as
the state representation of the monitored process and we apply change detection
algorithms to this time seriesYi.
Wavelet-based representation
Linear low pass filtering techniques, such as mean filtering and exponential
smoothing, are the most commonly used methods to remove a noisy compo-
nent from a time series, for their simplicity and the fact that they can be used
on-line. In the case of exponential smoothing, also known asexponential
weighted moving average (EWMA), we have at samplei:
yi = αxi + (1 − α)yi−1 (7.2)
whereα, 0 ≤ α ≤ 1, is a weighting factor (which is related to the cutoff
frequency of the low pass filter).
Wavelet Cusum state change detection model 131
Simplicity comes at a significant cost: since linear filtering methods basi-
cally remove all frequencies above a cutoff value, in the resulting smoothed
representationYi we remove not only noise, but also some significant fea-
tures of the time series,e.g., abrupt changes. For change detection purposes
this translates into false detections or significant detection delays.
Following the approach presented in [95], we propose the useof wavelet
based filtering/rectification. The wavelet transform [89] has emerged as a
powerful tool for statistical time series analysis. The wavelet transform
represents a time seriesx as the sum of a shifted and scaled version of a
base wavelet functionψ and a shifted version of a low pass scale functionφ.
With proper choice of the wavelet and scale functions, the resulting families
of functions are:
ψmk(n) =√
2−mψ(2−mn− k) (7.3)
φmk(n) =√
2−mφ(2−mn− k) (7.4)
wherem and k are the dilation and translation parameters, respectively,
form an orthonormal basis. A time seriesXi can be conveniently rewritten
as follows:
xi =n2−L∑
k=1
aLkφ(i) +L∑
m=1
n2−m∑
k=1
dmkψmk(i) (7.5)
whereaLk is thek-th scaling function coefficient at the coarsest scaleL,
dmk is thek-th wavelet coefficient at scalek, andn the time series length.
The coefficientsm andk are computed by inner product ofx with the base
functions. Computation of the transform and its inverse can be done in
O(n). As indicated in [95], in our implementation we set the coarsest scale
L equal to 5 if the time series is perturbed by white noise,L = 4 otherwise.
A key feature of this representation is that the wavelet decomposition cap-
tures significant signal features in a few relatively large coefficients, while
noise results decorrelated. As a result, noise - and noise only - can be effec-
tively removed by setting equal to zero wavelet coefficientssmaller than a
threshold.
132 Runtime models
We consider a state representation of the time seriesXi as the time seriesYi
obtained as following:
1. Compute the wavelet transform of the original time seriesXi. We use
the standard Haar function [36] as a base wavelet, which consists of a
simple rectangular impulse function;
2. Set to zero the wavelet coefficients which are lower than a suitable
thresholdtm (wherem is the dilation parameter). As indicated in [95],
we set the thresholdtm = σm
√2 log nwhereσm = 1
0.6745median{|dmk|};
3. Compute the inverse wavelet transform to obtainYi.
This rectification technique has been proved to be superior to many other
approaches [47], but it is restricted to off-line operations. We adopt the on-
line version proposed in [95] which considers a moving window of dyadic
length to computeYi.
Cusum-based detection rules
The proposed model integrates the wavelet-based representation with a novel
load change detection mechanism that is based on the Cusum change detec-
tion rule. The Cusum detection rule is proposed by [98] and used in different
contexts [15, 27]. It is considered the best choice for the statistical quality
control of many processes [94].
Given the time series of the state representation,Yi, the one-sided Cusum
for detecting an increase in the mean computes the followingtest statistics:
g+0 = 0 (7.6)
g+i = max{0, g+
i−1 + yi − (µ0 +K+)} (7.7)
which measures positive deviations from a reference valueµ0.
The test statisticsg+i accumulates deviations ofyi from µ0 that are greater
than a pre-defined constantK+, and resets to 0 on becoming negative. The
termK+, which is known as the allowance orslack value, determines the
Wavelet Cusum state change detection model 133
minimum deviation that the statisticsg+i accounts for. A positive change is
signaled wheng+i exceeds a design chosen thresholdH+.
The one-sided Cusum test for detecting negative deviations is defined simi-
larly as:
g−0 = 0 (7.8)
g−i = max{0, g−i−1 + (µ0 −K−) − yi} (7.9)
A negative change is signaled wheng−i exceeds a design thresholdH−.
A two-sided test to detect both increases and decreases is obtained by apply-
ing the two tests simultaneously. For the sake of simplicitywe will consider
the symmetric case wherebyK+ = K− = K andH+ = H− = H.
When a shift is detected, the Cusum test also provides an estimate of the
reference valueµ1 as follows:
µ1 =
µ0 +K +g+
i
N+ if g+i > H
µ0 −K − g−i
N−if g−i > H
(7.10)
whereN+ (N−) denotes the number of steps elapsed since the last timeg+i
(g−i ) was set to zero, that isN+ = i − inf{j | g+j = 0} and similarly for
N−.
The performance of the Cusum test is expressed in terms of the so calledAv-
erage Run Lengths(ARL): ARL0 denotes the average number of samples
between false alarms when no change has occurred;ARL1 denotes the av-
erage number of samples to detect a change when it does occur.Both ARL
measures are affected by the design parametersH andK. To achieve a good
detection quality, the suggested value forK is ∆2
, where∆ is the minimum
shift to be detected. In the considered environment, that ischaracterized by
variable variance of the process, it is important to providea dynamical esti-
mation of theH parameter. We propose to dynamically adjust the threshold
H in order to provide a targetARL0 performance and limit the number of
wrong detections, as presented by the ARL0 Cusum in [27].
134 Runtime models
These settings guarantee good prediction quality in terms of both recall and
precision (see Section 7.1.4) since we are able to dynamically adjust the
value of the thresholdH to reflect the variation of the time series behavior.
7.1.3 Other state change detection models
Most existing change detection techniques base on the trendextracted by the mod-
els presented in Chapter 5, used as state representations. Inthis section, we outline
the most popular on-line state change detection algorithmsthat are presented just
for performance comparison. For each algorithm, we consider the parameter val-
ues that guarantee the best evaluation metrics (see Section7.1.4) for Internet Data
Center management.
Threshold-based detector
This model uses a state representation based on the filtered time series with
an exponential moving average equal ton = 5. The detection rule is based
on the double threshold model described in [87]. The high threshold is equal
to thH = ∆, where∆ is the smallest shift to detect, and the low threshold
is thL = p∆. We choose thep coefficient on the basis of the traditional
method described in [87], that is based on the ROC curve in order to adapt
thL to the statistical characteristics of time series. We denote this detection
model as Th-EMA5.
EWMA-based detectors
The EWMA model is applied in several online contexts (for example in in-
formation and computer systems, and financial and social applications) [94].
In this paper, it is used for both state representation [75] and detection
rule [94]. The performance of the EWMA detectors depends on the choice
of past valuesn used for state representation and on the length of the con-
trol limit M used by the detection rule. In our analysis, we considerM = 3
and two valuesn = 5 andn = 10; all these values are suggested by Mont-
gomery et al. [94] as the most popular choices able to providea good and
Quantitative analysis 135
reliable detection quality. On the basis of then value, we denote the respec-
tive detectors as EWMA5 and EWMA10. EWMA5 should maximize the
number of correct detections at the cost of some false ones. The EWMA10
model should minimize the number of false detections, likely at the cost of
a lack of some detections.
Baseline Cusum
It uses an exponential moving average withn = 5 as its state representation,
and the Cusum detection rule [94, 98]. Its detection quality is conditioned
by the choice of the statistical thresholdH and by the slack valueK that
represents the minimum deviation that the detection rule ofthe baseline
Cusum accounts for. The recommended values of the model parameters are
H = 5σx andK = ∆2
.
ARL 0 Cusum
The online state representation is an exponential moving average withn =
5. The ARL0 Cusum uses a Cusum detection rule that is proposed by some
of the authors [27]. Its performance depends on the design parametersH
andK. It uses an adaptive estimation of the parameterH that is dynamically
evaluated on the basis of the ARL0 value, chosen in order to ensure a very
small false detection rate. ARL0 Cusum usesK = ∆2
.
7.1.4 Quantitative analysis
As we are interested to on-line detection in non-stationaryand highly variable time
series coming from non deterministic application contexts, in the following anal-
ysis we evaluate the detection quality of the proposed Wavelet Cusum model on
a wide range of time series based on two typicaltime series profiles. The perfor-
mance of the considered detection techniques are evaluatedin terms of common
evaluation metrics, by considering several intensities of thenoise componentsof
the time series.
136 Runtime models
Time series profiles
The evaluation of the proposed model is carried out on a wide range of time
series based on two state profiles:
• Step profile describes a sudden increment of the time series values
from a relatively low to a higher state [111]. The lower statekeeps
constant for 200 samples, then it is suddenly increased for 200 sam-
ples. The increase is followed by a similar decrease.
• Multi-step profile describes an alternating increase and decrease of
the time series state characterized by different lengths. The time series
is subject to8 state changes at sample 50, 125, 200, 275, 350, 400,
475, 550.
To facilitate the presentation, the two profiles are normalized and both have
unit increases and decreases.
Noise components
Since the proposed solution aims to improve existing detection models, we
apply detection models on time series characterized by different levels of
noise components. The noise dispersion,σe, and the correlation index,ρe,
as described in [21, 45], are the most important statisticalproperties that
characterize the noise error and considered in our evaluation.
Evaluation metrics
The detection quality of the considered models are evaluated in terms of
recall andprecision[96].
To formalize these metrics, let us consider a time series subject to state
changes. All detected samples that are signaled correctly by the detection
model are calledtrue positives(TP ). If the model does not detect one
or more changes, the related samples are classified asfalse negative(FN )
detections. When the time series is in a stable state, the detection of a change
is classified asfalse positive(FP ); otherwise, a non detection in a stable
state is atrue negative(TN ). The number of true positives, false negatives,
Quantitative analysis 137
true negatives, and false positives add up to100% of the time series.
Recall is the fraction of detections that are relevant to the time series and
that are successfully retrieved:
recall =TP
TP + FN(7.11)
It can be looked at as the probability that a relevant state change is detected
by the model. To achieve a recall value equal to1, the detection model must
signal all relevant changes.
The value of the recall alone is not enough but it must be supported by
some information related to the number of non-relevant detections, such as
precision, that is the fraction of the relevant detections:
precision=TP
TP + FP(7.12)
whereTP is the number of true positive detections and(TP + FP ) is the
total number of detections. The precision gives information on the ability of
the detection model to limit unnecessary detections of a state change. A pre-
cision equal to 1 means that the model detects only relevant state changes,
while low precision values are caused by a detection model that signals
many not relevant changes.
A trade-off between recall and precision values exists, hence these two met-
rics are usually combined into a single measure, namely theF-measure,
that gives a global estimation of the detection quality. TheF-measure is the
weighted harmonic mean of precision and recall, that is:
F-measure= 2precision ∗ recallprecision+ recall
(7.13)
When the detection quality is good, the F-measure value is close to 1, while
it is low for unreliable detection models characterized by false positive and
false negative detections.
In this section, we evaluate how the performance metrics of the detection mod-
els are affected by the noise dispersion on the time series. In the first set of ex-
periments, we test allσe ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}, setρe to 0,
and consider the step profile time series.
138 Runtime models
The recall and precision results for all considered detection models and for
severalσe values are reported in Table 7.1 and Table 7.2. Table 7.1 shows that
when the noise dispersion is low (σe ≤ 0.5 ) all the considered models achieve
recall values close to 1. This means that all methods recognize correctly all rele-
vant state changes, with the exception of the threshold-based model that does not
signal some detections. When the noise dispersion increases(σe > 0.5), the mod-
els using EWMA as detection rule, EWMA5 and EWMA10, worsen significantly
their recall values: they risk to be completely unreliable in highly variable con-
texts since they do not signal many relevant state changes. Other models, such as
threshold and Cusum-based, are able to maintain good recall values (> 0.9) also
in time series with an intense noise dispersion.
σe Th-EMA 5 EWMA 5 EWMA 10 Baseline ARL0 WaveletCusum Cusum Cusum
0.1 1 1 1 1 1 10.2 0.97 1 1 1 1 10.3 0.96 1 1 1 1 10.4 0.92 1 1 1 1 10.5 0.89 1 0.98 0.99 1 10.6 0.83 0.99 0.87 1 1 10.7 0.85 0.95 0.31 1 1 10.8 0.92 0.88 0.02 1 0.99 10.9 0.94 0.71 0.01 0.99 0.99 11 1 0.64 0.01 1 1 1
Table 7.1: Recall -ρe = 0.
σe Th-EMA 5 EWMA 5 EWMA 10 Baseline ARL0 WaveletCusum Cusum Cusum
0.1 1 0.23 0.31 1 1 10.2 0.33 0.43 0.50 0.99 1 10.3 0.13 0.50 1 0.95 1 10.4 0.08 0.97 1 0.89 1 10.5 0.06 0.98 1 0.77 0.99 0.990.6 0.05 0.81 1 0.65 0.92 0.960.7 0.05 0.7 1 0.50 0.87 0.960.8 0.05 0.61 1 0.37 0.73 0.930.9 0.04 0.53 1 0.29 0.64 0.871 0.04 0.52 1 0.25 0.55 0.84
Table 7.2: Precision -ρe = 0.
Quantitative analysis 139
However, the recall metric is not sufficient to offer a complete understanding
of the model quality. If we consider the precision results inTable 7.2, we notice
that the threshold-based method is too much sensitive to thenoise variations of the
time series and even if it is able to detect all relevant statechanges, this is done
at the price of a high number of false positive detections, asconfirmed by low
precision values of the Th-EMA5. Cusum-based methods are all able to achieve
good precision in time series characterized by a noise dispersion σe ≤ 0.5. In
more variable and noisy contexts,σe > 0.5, only the proposed Wavelet Cusum
model preserves high detection quality, since it maintainsa precision close to0.85
also in highly variable time series with large noise dispersion.
The combined effect of recall and precision can be appreciated in Figure 7.3,
that shows the behavior of F-measure as a function of the standard deviationσe
for all the considered detection models. With the exceptionof models based on
the EWMA detection rule, all algorithms worsen for increasing values ofσe. Nev-
ertheless, the Wavelet Cusum achieves the best F-measure values for everyσe. It
is worth to observe that it is the only model guaranteeing reliable results even for
very high noise dispersions.
For example, atσe = 1, its F-measure remains consistently higher than 0.9,
in spite of a F-measure around 0.7 of the ARL0 Cusum, that is the best exist-
ing on-line detection model [27]. The threshold-based method is characterized
by an exponential decay of detection quality for increasingvalues ofσe. This
behavior limits the model applicability especially in non-stationary and non deter-
ministic contexts. The EWMA-based solutions have the peculiarity of improving
F-measure that then decreases forσe > 0.5. This is due to the so calledinertia
limit [94], that is, the inability to react quickly to state changes when the size of
the smallest shift to detect (∆) is significantly higher thanσe. This limit and the
F-measure degradation for highσe values evidence that the performance of the
EWMA-based models are unacceptable because too much sensitive to the statis-
tical characteristics of the time series and to the choice ofthe model parameters.
Existing Cusum-based methods are characterized by a small decay of F-measure
for low values ofσe. On the other hand, the F-measure decreases slowly when
σe > 0.5. These results evidence that even existing Cusum-based models do not
work well when a time series is highly variable.
140 Runtime models
0
0.2
0.4
0.6
0.8
1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
F-m
easu
re
Standard Deviation
Th-EMA5EWMA5EWMA10
Baseline CusumARL0 CusumWavelet Cusum
Figure 7.3: F-measures -ρe = 0.
We pass now to examine the effects of the correlation of the noise com-
ponentρe on the detection quality by considering different correlation indexes,
ρe = {0, 0.1, 0.2, 0.3}, and setting the noise dispersion valueσe.
In Figure 7.4 we report the F-measure values of all considered models in two
cases of high noise dispersion:σe = 0.6 andσe = 0.9. The results confirm that
the Wavelet Cusum model improves the performance of all traditional detectors
for every correlation value and anyσe. Its performances remains acceptable also
in the most chaotic context of intense variance and strong correlation of the noise
component (σe = 0.9, ρe = 0.3), by improving of more than50% the perfor-
mance of the best existing model ARL0 Cusum. A second important result is that
Wavelet Cusum is less sensitive to statistical characteristics of the time series with
respect to any other detection model. This robustness is useful in all real contexts
characterized by variable, non-stationary and non deterministic behaviors.
Performance analysis 141
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3
F-m
easu
re
Noise correlation index
Th-EMA5EWMA5EWMA10
Baseline CusumARL0 CusumWavelet Cusum
(a) σe = 0.6
0
0.2
0.4
0.6
0.8
1
0 0.1 0.2 0.3
F-m
easu
re
Noise correlation index
Th-EMA5EWMA5EWMA10
Baseline CusumARL0 CusumWavelet Cusum
(b) σe = 0.9
Figure 7.4: F-measures.
7.1.5 Performance analysis
In this section we evaluate the detection quality of the considered models on time
series characterized by different numbers of state changesand different lengths of
the periods of stability.
Figure 7.5 and Figure 7.6 represent the behavior of a selected subset of the
detection models (EWMA5, Baseline Cusum, ARL0 Cusum and Wavelet Cusum)
on step profile and multi-step profile, respectively. Both figures display the curve
of the online state representation (continuous gray line),the curve of the time
142 Runtime models
series profile (continuous black line) and the vertical lines with a circle at the
bottom for false detections and a cross for true detections.
The number of false detections is an important measure of theprecision of a
model. On the other hand, the absence of detection in occurrence of a relevant
state change affects the detection recall. On step profile, the Wavelet Cusum in
Figure 7.5(a) is the only model that detects timely and correctly all relevant state
changes, without false detections. The improvement of the proposed model with
respect to the Baseline Cusum is impressive. The performance of the Baseline
Cusum in chaotic contexts is very low, as a consequence of the large number of
detections (in Figure 7.5(b)). The Cusum ARL0 and the EWMA5 models (Fig-
ure 7.5(c) and Figure 7.5(d), respectively) show the same detection quality char-
acterized by two wrong detections during a stable state.
For increasing complexity of the time series profile, all existing models re-
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 100 200 300 400 500 600
Dat
a va
lue
Sample
On-line state representationStep profile
True Positive detectionFalse Positive detection
(a) Wavelet Cusum
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 100 200 300 400 500 600
Dat
a va
lue
Sample
On-line state representationStep profile
True Positive detectionFalse Positive detection
(b) Baseline Cusum
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 100 200 300 400 500 600
Dat
a va
lue
Sample
On-line state representationStep profile
True Positive detectionFalse Positive detection
(c) ARL 0 Cusum
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 100 200 300 400 500 600
Dat
a va
lue
Sample
On-line state representationStep profile
True Positive detectionFalse Positive detection
(d) EWMA5
Figure 7.5: Qualitative evaluation - Step profile -σe = 0.6 andρe = 0.3.
Performance analysis 143
duce significantly their quality, while the Wavelet Cusum in Figure 7.6(a) shows
a high quality also in time series characterized by multiplestate changes. In these
contexts, the Cusum ARL0 and the EWMA5 models diversify their behaviors: the
former model, in Figure 7.6(c), keeps on detecting all relevant state changes, but
its precision is decreased due to a higher number of false detections; the second
model, in Figure 7.6(d), misses many relevant state changes, thus strongly affect-
ing its recall quality.
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 100 200 300 400 500 600
Dat
a va
lue
Sample
On-line state representationMulti-step profile
True Positive detectionFalse Positive detection
(a) Wavelet Cusum
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 100 200 300 400 500 600
Dat
a va
lue
Sample
On-line state representationMulti-step profile
True Positive detectionFalse Positive detection
(b) Baseline Cusum
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 100 200 300 400 500 600
Dat
a va
lue
Sample
On-line state representationMulti-step profile
True Positive detectionFalse Positive detection
(c) ARL 0 Cusum
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 100 200 300 400 500 600
Dat
a va
lue
Sample
On-line state representationMulti-step profile
True Positive detectionFalse Positive detection
(d) EWMA5
Figure 7.6: Qualitative evaluation - Multi-step profile -σe = 0.9 andρe = 0.3.
We can conclude that the Wavelet Cusum detector provides the best results,
both in term of recall and precision, despite of the complexity of the time series
profile, the number and length of the stable states and the noise components char-
acteristics.
144 Runtime models
7.1.6 On-line state change detection for IDC management
The previous results on on-line detection of relevant statechanges are now applied
to a real Internet Data Center context. The difficulty here is to detect intrinsic
changes that are not directly observed and that are measuredtogether with other
types of perturbations. We expect that working on a system representation col-
lecting only the deterministic patterns of the Data Center servers would solve the
problems of existing algorithms. The representative deterministic eigenresource
can be used as state representations for proper detection rules in order to signal
relevant state changes in the system.
In this section, we report the results of several experiments run on the 50
servers of the Internet Data Center, in order to understand the impact of changing
workloads in the PCA resulting representations. We exercisethe Data Center
servers through some representative synthetic user scenarios referring to several
workload models having different impacts on system resources. Every experiment
generates multiple data sets referring to the system resource measures of the 50
servers. We use these data sets to evaluate the impact of non-transient changes in
the state of the server resources, and as input resource measures for the proposed
multi-phase methodology. This allows us to evaluate how andwhether workload
changes have an impact on representative eigenresources resulting from the PCA-
based technique.
By applying the proposed methodology to the considered Internet Data Center
we can observe that:
• sudden changes in the number of emulated browsers have influence only on
deterministic eigenresources;
• sudden changes in the impact that the simulated workload models have on
system resources influence only noise eigenresources (thisresult is not in-
vestigated in this thesis but left for future works);
• sudden changes in the number of emulated browsers under lowworkload
scenarios influence the resources of the servers, but are notregistered by
the representative deterministic eigenresource. They impact deterministic
dimensions having a low contribution to the overall energy of the system
On-line state change detection for IDC management 145
(that means, low singular value). This reflects in a small addition to the
Rdeterministic vision and has a not relevant effect in the Internet Data Center
representative time series;
• sudden changes in the number of emulated browsers under heavy workload
scenarios have influence on the resources of the servers, andare registered
by the representative deterministic eigenresource. They impact on deter-
ministic dimensions having a high contribution to the overall energy of the
system, and cause a relevant state change in theRdeterministic time series.
These results confirms the scientific value of the proposed methodology: rep-
resentative eigenresources register relevant state changes only in the case of highly
impacting events in service demand. Applying state change detection models to
the representative deterministic representation guarantees that only state changes
having strong repercussions in Internet Data Center activity are signaled. This ap-
proach prevents the activation of system procedures when there are state changes
that affect the state of few servers but do not have an impact in the whole system
work.
In the case of relevant non-transient changes in Internet Data Center activ-
ity, all the experiments demonstrate that the representative deterministic eigenre-
source reflects the sudden increasing or decreasing of load conditions.
Figure 7.7 shows the representative deterministic eigenresource time series
resulting from the PCA-based technique applied to the measurements of one ex-
periment. We choose astep scenariodescribing a sudden load increment from a
relatively unloaded to a more loaded system, followed by a subsequent decreas-
ing [111]. The population is kept at 120 emulated browsers for the first 66 hours of
the week, then it is suddenly increased to 200 emulated browsers for an equivalent
period of time. Then, a sudden decrease re-establishes the initial conditions.
All experiments verify that the increase and decrease in thenumber of requests
is registered by the representative deterministic reigenresource.
Figure 7.8 represent the behavior of EWMA5, Baseline Cusum, ARL0 Cusum
and Wavelet Cusum detection models on the representative deterministic eigenre-
source. The figures display the representative deterministic time series (continu-
146 Runtime models
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Mon Tue Wed Thu Fri Sat Sun
Rde
term
inis
tic
Samples
Figure 7.7: Relevant state changes on the representative deterministic eigenre-source.
ous gray line) and the vertical lines with a circle at the bottom for false detections
and a cross for true detections.
In this experiment, the Wavelet Cusum in Figure 7.8(a) is the only model that
detects timely and correctly the two relevant state changeswithout false detec-
tions. The Baseline Cusum and the Cusum ARL0 models (Figure 7.8(b) and Fig-
ure 7.8(c), respectively) are affected by two wrong detections. Baseline Cusum
detects several consecutive signals in correspondence of every load change, thus
demonstrating its scarce performance during a transition state. Cusum ARL0 is
affected by false detections when the time series is characterized by stationary
conditions. This reveals its inability to maintain a stablestate. In the presented
stochastic contexts, EWMA5 model misses the two relevant state changes and this
result affects its recall quality.
The Wavelet Cusum detector applied to the representative deterministic eigen-
resource is a reliable, both in term of recall and precision,novel solution for the
management of the considered Internet Data Center. Relevant state changes in the
state of the whole system are timely detected and no false alarms activate useless
management decisions.
7.2 Anomaly detection 147
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Mon Tue Wed Thu Fri Sat Sun
Rde
term
inis
tic
On-line state representationZero mean
True Positive detectionFalse Positive detection
(a) Wavelet Cusum
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Mon Tue Wed Thu Fri Sat Sun
Rde
term
inis
tic
On-line state representationZero mean
True Positive detectionFalse Positive detection
(b) Baseline Cusum
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Mon Tue Wed Thu Fri Sat Sun
Rde
term
inis
tic
On-line state representationZero mean
True Positive detectionFalse Positive detection
(c) ARL 0 Cusum
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Mon Tue Wed Thu Fri Sat Sun
Rde
term
inis
tic
On-line state representationZero mean
True Positive detectionFalse Positive detection
(d) EWMA5
Figure 7.8: Performance evaluation of state change detection models on represen-tative deterministic eigenresource.
7.2 Anomaly detection
Anomaly detectionrefers to the problem of finding patterns in the time series
that do not conform to the expected behavior. These non-conforming patterns
are often referred to as anomalies, outliers, discordant observations, exceptions,
aberrations, surprises, peculiarities or contaminants indifferent application do-
mains [31]. Among these definitions, anomalies and outliersare two terms com-
monly used in the context of anomaly detection.
Anomaly detection algorithms are applied to a large varietyof applications,
such as fraud detection for credit cards, insurance or health care, intrusion de-
tection for cyber-security, military surveillance for enemy activities, and, as our
interest, fault detection in critical information systems.
The importance of an anomaly detection is due to the fact thatanomalies in
148 Runtime models
data translate to significant (and often critical) actionable information in a wide
variety of application domains. For example, an anomalous traffic pattern in a
computer network could mean that a hacked computer is sending out sensitive
data to an unauthorized destination [79]. An anomalous MRI image may indicate
presence of malignant tumors [118]. Anomalies in credit card transaction data
could indicate credit card or identity theft [5] or anomalous readings from a space
craft sensor could signify a fault in some component of the space craft [58].
In this work, we are interested to anomaly detection in time series coming
from the monitoring of Internet-based system resources. InInternet-based con-
texts, anomaly detection is usually direct tointrusion detection. It refers to the
detection of malicious activity (break-ins, penetrations, and other forms of com-
puter abuse) in a computer related system [103]. These malicious activities or
intrusions are interesting from a computer security perspective. An intrusion is
different from the normal behavior of the system, and hence anomaly detection
techniques are applicable in intrusion detection domain. To identify anomalies
rapidly and accurately is crucial to the efficient operationof large computer sys-
tems. Identifying, diagnosing and treating anomalies in a timely fashion is a fun-
damental part of day to day computer operations. Without this kind of capability,
systems are not able to operate efficiently or reliably.
7.2.1 Problem definition
Accurate identification and diagnosis of anomalies first depends on robust and
timely data, and second on established methods for isolating anomalous signals
within data. The key challenge for anomaly detection in Internet-based systems
domain is the huge volume of data. The anomaly detection techniques need to
be computationally efficient to handle these large sized inputs. Moreover the data
typically comes in a streaming fashion, thereby requiring on-line analysis. An-
other issue which arises because of the large sized input is the false alarm rate.
Since the data amounts to millions of data objects, a few percent of false alarms
can make analysis overwhelming for an analyst. Finally, system operators usually
use data set collected from the servers resources of the system and analyze them
separately to identify resource anomalies. These discordant observations refer to
Point anomaly detection 149
an isolated view of the system, concerning only one resourceand not taking into
account all the interaction among servers.
Applying anomaly detection techniques to only one resourceat time causes
two drawbacks:
• reductive view, since it does not give a reliable detection of non-conforming
events in the entire system behavior, but it can only reveal that something
discordant has happened on the behavior of a resource measure;
• high dimensionality, since it implies the investigation and the application
of an anomaly detection model on each resource time series ofeach server
of the system. These time series are numerous, many samples long, and
make the system anomaly detection at runtime an inaccessible problem.
Applying anomaly detection techniques to the representative eigenresources
coming from the PCA-based technique proposed in Chapter 4 allows us to solve
the two mentioned problems. First of all, representative eigenresources collect the
relevant information of the state of the system, so embodying an exhaustive view
of its behavior. Second, anomaly detection models must be applied to only three
time series, thus reducing the dimensionality of whole system anomaly detection
problem.
We focus on on-line anomaly detection, with particular interest to two of the
many accepted meanings of the word ”anomaly”. We considerpoint anomaly
detection in Section 7.2.2 andcollective anomalydetection in Section 7.2.5. Both
applications aim at signaling unexpected events in Data Center activity. They help
Data Center operators to adapt the system to changing environments, to identify
malicious activities and to activate management procedures to avoid undesired
consequences.
7.2.2 Point anomaly detection
If an individual data instance can be considered anomalous with respect to the rest
of data, then the instance is termed as apoint anomaly. This is the simplest type
of anomaly and is the focus of majority of research on anomalydetection.
150 Runtime models
For example, in Figure 7.9 the data samples marked by a cross depart from
the behavior of the normal region, and hence are point anomalies since they are
different from normal data points. In time series domain, a data sample for which
the value is very high or very low compared to the normal rangeof variability for
that time series will be a point anomaly.
-15
-10
-5
0
5
10
15
0 500 1000 1500 2000
Sam
ple
Val
ues
Samples
Figure 7.9: Example of point anomalies.
As a real life example, consider credit card fraud detection. Let the data set
correspond to an individual’s credit card transactions. For the sake of simplicity,
let us assume that the data is defined using only one feature:amount spent. A
transaction for which the amount spent is very high comparedto the normal range
of expenditure for that person will be a point anomaly.
7.2.3 Point anomaly detection techniques
Appropriate techniques for point anomaly detection in timeseries analysis are the
statistical ones.Statistical techniquesfor point anomaly detection are based on
the following key assumption:
Point anomaly detection techniques 151
Assumption. Normal data instances occur in high probability regions of a
stochastic model, while anomalies occur in the low probability regions of the
stochastic model.
Statistical techniques fit a statistical model (usually fornormal behavior) to
the given data and then apply a statistical inference test todetermine if an unseen
instance belongs to this model or not. Instances that have a low probability to be
generated from the learned model, based on the applied test statistic, are declared
as anomalies.
The test used for the identification of spike eigenresources(see Section 3.2) is
a simple outlier detection model among statistical techniques. Thesσ threshold
test[114] declares anomalous all time series instances that aremore thans stan-
dard deviations away from the time series mean. Namingµ the time series mean,
σ its standard deviation, and settings = 3, theµ ± 3σ region contains 99.7% of
the data instances.
More sophisticated statistical tests include thebox plot rule, often used to
detect point anomalies. A box plot depicts the data using summary attributes such
as smallest non-anomaly observation (min), lower quartile (Q1), median, upper
quartile (Q3), and largest non-anomaly observation (max). The quantityQ3− Q1
is called theInter Quartile Range (IQR). The box plots also indicates the limits
beyond which any observation will be treated as an anomaly. Adata instance that
lies more than1.5IQRlower thanQ1 or 1.5IQRhigher thanQ3 is declared as an
anomaly. The region betweenQ1− 1.5IQRandQ3 + 1.5IQRcontains 99.3% of
observations.
A box plot example is given in Figure 7.10: the bottom and top of the box
are always the25-thand75-thpercentile (the lower and upper quartiles, respec-
tively), and the band near the middle of the box is always the50-thpercentile (the
median).
The ends of the whiskers can represent several possible alternative values [90],
among them:
• the minimum and maximum of all the data;
• the lowest sample still within1.5 IQRof the lower quartile, and the highest
datum still within1.5 IQRof the upper quartile;
152 Runtime models
Figure 7.10: A box plot for a univariate data set.
• one standard deviation above and below the mean of the data;
• the9-thpercentile and the91-stpercentile;
• the2-ndpercentile and the98-thpercentile.
Any data not included between the whiskers is plotted as an outlier with a cir-
cle and corresponds to an unexpected value, that is, an anomaly. Thus, the choice
of the ends of the whiskers plays the same relevant role as thesetting of the sigma
threshold tests parameter. To establish whisker limits, context requirements and
resource characteristics must be taken into account, in order to obtain the best box
plot rule performance.
Box plots have some strengths with respect to the sigma threshold test. Be-
sides graphically displaying a data set location and spreadat a glance, it provides
some indication of the data symmetry and skewness. Moreover, by using a box
plot for many time series side-by-side on the same graph, onecan easily compare
data sets. This is not easy to obtain through a sigma threshold test that, on the
other hand, has the advantage to show the data samples valuesand occurrence in
time of the anomalous samples.
On-line anomaly detection for IDC management 153
7.2.4 On-line anomaly detection for IDC management
We now show an application of the two anomaly detection techniques presented
in the previous section. They are applied to the representative spike eigenresource
with the aim to detect point anomalies in the behavior of the Internet Data Center.
Applying the anomaly detection techniques to theRspike time series ensures that
each signal corresponds to an effective unexpected event inthe life of the system.
Each detection needs an accurate investigation of its causes and the activation of
management procedures to face uncommon behaviors.
Starting from thesσ threshold test, we test the performance of the anomaly
detector on theRspike time series for different values of thes parameter. Several
tests lead us to sets = 5 as the best choice in our application context, guarantee-
ing an acceptable trade-off between the number of detected true and false point
anomalies.
Figure 7.11 shows the results of5σ test for point anomaly detection on the
representative spike eigenresource. The horizontal dotted lines are plotted at5σ
distance from the zero mean of the time series, in positive and in negative di-
rection. The upper threshold line intersects some eigenresource samples that are
therefore detected as anomalous. In correspondence of timeseries instances with
an upper cross, something peculiar has happened causing a relevant instantaneous
change in the normal behavior of the Internet Data Center.
Once learned the5σ threshold values on a significant set of past samples, their
values can be fixed for on-line anomaly detection. Given the spike eigenresource
samplexi at timeti, it is compared to the upper and lower thresholds. Ifxi exceeds
(in positive or in negative direction) one of the two thresholds, a point anomaly
is detected at timeti. This helps in the immediate investigation of the causes
of the uncommon event and the instantaneous activation of suitable management
procedures.
The threshold values can be carried on constant since a relevant state change
is detected. Relating the outcomes of state change detectionmodels described
in Section 7.1.6 to the setting of anomaly detection technique parameters allows
to adapt them to the evolution of the notion of normal behavior and to make the
parameter values representative also under new state conditions.
154 Runtime models
-10
-5
0
5
10
Mon Tue Wed Thu Fri Sat Sun
Rsp
ike
Spike eigenresourceUpper threshold
Lower thresholdAnomaly
Figure 7.11: Performance evaluation of5σ test for point anomaly detection on therepresentative spike eigenresource.
The other point anomaly detector we test is the box plot rule.Also for this
technique, we have examined different settings of its parameters and collected
the relative performance. Fixing the ends of the box plot whiskers equal to (Q1
− 1.5IQR) and (Q3 + 1.5IQR) implies that the box contains 99.3% of data set
observations.
Even thought considering as anomalous the 0.7% observations departing from
Gaussian distribution should seem to be a strictly rule, it must be taken into ac-
count also the application context of the box plot rule. In the context of Internet
Data Center management, every detection of anomaly causes several operations,
generally expensive and time consuming. Reducing the numberof detections to
only those ones really relevant for the sake of the system permits to limit the acti-
vation of control procedures and to avoid false alarms in thesystem. In our exper-
iments, we see that a1.5 weight for the Inter Quartile Range causes an excessive
number of point anomaly detections, most of them irrelevantand not linked to a
real problem in the system.
On-line anomaly detection for IDC management 155
For this reason, the proposed application of box plot rule tothe representative
spike eigenresource sets a higher weight forIQR, thus giving a technique less
restrictive than5σ test but signaling unexpected events that can be brought back
to something strange happening in the Internet Data Center.
Figure 7.12: Box plot of representative spike eigenresource.
Figure 7.12 shows the box plot of the representative spike eigenresource time
series. In this case, theIQR weight is set equal to5. We can appreciate the Gaus-
sian distribution with zero mean and very low standard deviation of a small set of
samples, as evinced by the short rectangle including the Inter Quartile Range. The
choice of a5 IQR weight motivates the long whiskers, beyond which samples are
declared as anomalous. Those samples are all but one collected in the positive tail
of the Gaussian distribution, thus suggesting point anomalies due to unexpected
bursts in the system activity, caused by a sudden increase ofCPU utilization.
Searching for those entries in the representative spike time series enables to
identify anomalous samples and their occurrence in the period of monitoring. Fig-
ure 7.13 graphically reports the results of box plot rule on the representative spike
eigenresource. Each cross at the top or at the bottom of the figure denotes an
anomaly found in the upper or lower portion of the box plot in Figure 7.12. Com-
paring these results to the ones of 5σ test in Figure 7.11, we can appreciate an
evident increase of point anomalies detected. The higher number of crosses at the
156 Runtime models
top and at the bottom of the last figure evidences how the choice of the weight
parameter gives rise to an anomaly detection technique lessstringent than the pre-
vious one. Despite that, all detected point anomalies can berelated to strange
events in system behavior, even thought not so critical to cause a fault in the sys-
tem or overload.
-5
0
5
10
Mon Tue Wed Thu Fri Sat Sun
Rsp
ike
Spike eigenresource Anomaly
Figure 7.13: Performance evaluation of box plot rule for point anomaly detectionon the representative spike eigenresource.
This second technique for anomaly detection can be used for the removal of
meaningless samples from deterministic and noise representative eigenresources.
An anomalous events in the system behavior detected at timeti by the spike eigen-
resource may affect also the corresponding samplesxi captured by the other two
representations, making them not reliable. An accurate investigation of all the
three representative eigenresources in correspondence ofa box plot test signal is
crucial for the correct interpretation of the point anomalyand for the cleaning of
deterministic and noise time series from spurious samples.
Collective anomaly detection 157
Beside that, also representative deterministic eigenresource carries on itself
crucial informations for the detection of anomalies in the system. The information
carried on the deterministic class of eigenresources are ofinterest to the identifi-
cation of collective anomalies in the system activity.
7.2.5 Collective anomaly detection
A collective anomalyoccurs when a collection of related samples is anomalous
with respect to the entire time series. The individual data instances in a collective
anomaly may not be anomalies by themselves, but their occurrence together as a
collection is anomalous [31]. Let us give a graphical example.
Figure 7.14 shows an example of deterministic eigenresource time series. The
encircled region denotes a collective anomaly because the time series values with
low variability persist for an abnormally long time (corresponding to a day of
normal expected activity of the Data Center). Note that thosevalues by themselves
are not anomalous.
-40
-20
0
20
40
60
0 500 1000 1500 2000
Sam
ple
Val
ues
Samples
Figure 7.14: Example of collective anomaly.
158 Runtime models
While point anomalies can occur in any time series, collective anomalies can
occur only in time series in which samples are related. Therefore, representative
spike and noise eigenresources are worthless for searchingfor uncommon patterns
in Internet Data Center behavior. On the other hand, we can apply collective
anomaly detection techniques to the representative deterministic eigenresource
resulting from our multi-phase methodology.
The anomaly detection technique used for the research of collective anoma-
lies bases on time series prediction, addressed in Chapter 6.Once predicted the
future expected values of theRdeterministic time series, a relevant gap between the
prediction and the real eigenresource sample will be symptom of an unexpected
pattern in the trend and seasonal components of the system. This approach makes
it easy the investigation of collective anomalies in the behavior of the Internet
Data Center.
7.2.6 Collective anomaly detection techniques
A collective anomaly occurs only in time series in which samples have some kind
of deterministic relation, since this type of anomaly is composed by a collection
of related samples anomalous with respect to the entire timeseries. Thus, rela-
tionships among samples is fundamental for the happening and the identification
of collective anomalies.
For this reason, the preliminary analysis of the techniquesfor collective anomaly
detection is direct to the verification of correlations among time series values.
This property can be verified through correlograms and periodograms presented
in Chapter 3. If these analyses demonstrate the existence of adeterministic com-
ponent underlying time series behavior, then collective anomalies may occur and
their research has worth.
In comparison to the rich literature on point anomaly detection techniques, the
research on collective anomaly detection has been limited.Broadly, such tech-
niques can be classified in two categories. The first categoryof techniques re-
duces a collective anomaly detection problem to a point anomaly detection prob-
lem; while the second category of techniques models the structure in the data and
uses the model to detect anomalies. We adopt this last solution.
Collective anomaly detection techniques 159
A generic technique in this category can be described as follows. A model
is learned from the data set, through which predict the expected behavior with
respect to a given context. If the expected behavior is significantly different from
the observed behavior, an anomaly is declared. A simple example of this generic
technique is regression, in which the contextual attributes can be used to predict
the behavioral attribute by fitting a regression line on the data.
Like state change detection, on-line collective anomaly detection is typically
addressed through two steps:state representationanddetection rule.
State representation
For time series context, several regression-based techniques for data set
state representation such as robust regression [108], autoregressive mod-
els [57], ARMA models [2, 3], and ARIMA models [16], have been devel-
oped for collective anomaly detection. Thus, among the forecasting models
described in Section 6.2, AR and ARIMA models are the most usedfor
collective anomaly detection.
The reason of this is graphically simple to evince from the following com-
parison among the prediction models. Let us forecast the representative
deterministic eigenresource for 2 days in the future. In Figure 7.15 we re-
port the results of the different models in predicting theRdeterministic time
series coming from the application of the PCA-based technique. All pre-
dictors are tested under a SD-policy and in an off-line fashion. We set the
values of models parameters on the basis of the static implementations of
prediction models presented in Section 6.2.
All figures display the representative deterministic eigenresource for the last
three days of the week and the predicted values for the next two days. The
former is represented by the gray line, while the latter are shown by the
subsequent black line. The dotted lines delimits the prediction interval
computed forc = 0.80 andc = 0.95. In Figure 7.15 we can appreciate
two main behavioral classes: the class of predictors that reproduce in the
future only the common trend of the time series, and the one that recreates
also its seasonal behavior. SR, CS, EWMA, and Holt’s techniquesbelong
160 Runtime models
to the previous class. Figures 7.15(a), (b), (c) and (d) clearly show predic-
tion results that do not take into account the periodic component of the time
series. Predicted values only follow the main tendency of the time series,
reproducing it for the next two days in the future. Autoregressive models in
Figures 7.15(e) and (f), otherwise, learn the systematic component from the
-200
-150
-100
-50
0
50
100
150
200
Fri Sat Sun Mon Tue
Rde
term
inis
tic
Time seriesPredicted value
CI 80%CI 95%
(a) SR
-40
-30
-20
-10
0
10
20
30
40
Fri Sat Sun Mon Tue
Rde
term
inis
tic
Time seriesPredicted value
CI 80%CI 95%
(b) CS
-200
-150
-100
-50
0
50
100
150
Fri Sat Sun Mon Tue
Rde
term
inis
tic
Time seriesPredicted value
CI 80%CI 95%
(c) EWMA
-100
-50
0
50
100
Fri Sat Sun Mon Tue
Rde
term
inis
tic
Time seriesPredicted value
CI 80%CI 95%
(d) Holt’s
-40
-30
-20
-10
0
10
20
30
40
Fri Sat Sun Mon Tue
Rde
term
inis
tic
Time seriesPredicted value
CI 80%CI 95%
(e) AR
-40
-30
-20
-10
0
10
20
30
40
Fri Sat Sun Mon Tue
Rde
term
inis
tic
Time seriesPredicted value
CI 80%CI 95%
(f) ARIMA
Figure 7.15: Prediction results on the representative deterministic eigenresource.
Collective anomaly detection techniques 161
past samples, and recreate it in their future predictions.
Since we are interested in time series forecasting direct tothe identifica-
tion of collective anomalies, we focus on AR and ARIMA predictors and
their reproduction of the seasonal behavior of the time series. The reason of
choosing autoregressive models is that the key factor in differentiating be-
tween normal and anomalous behavior is the co-occurrence ofevents. Thus,
time series predictors that loose the periodic correlationamong samples are
useless for collective anomaly detection. Since we search unexpected be-
haviors departing from the common seasonal activity of the system, we need
a prediction technique able to model this seasonal component.
Let us guide the results obtained in Section 6.3 in this direction. To anomaly
detection purposes, stringent time constraints are not required, since this
type of anomalies can be discovered in periodic behaviors repeating ev-
ery 24 hours. For our scope, all prediction policies are applicable in term
of computational cost, also the pre-filtering ones using complex prediction
models, such asadaptive-ARIMA under a AT-policy.
What is crucial in collective anomaly detection is prediction quality. We
guess to low prediction errors and small prediction intervals, since our in-
tent is to detect all and only real collective anomalies. This is possible
only through reliable time series predictions. The most reliable prediction
is guaranteed by the AT-policy, both in terms of prediction error and predic-
tion interval. For this reason, in this section we consider prediction methods
providing a trend representation of the time series on whichapply a pre-
diction model that dynamically adapts the number of parameters and their
values to the inconstant characteristics of the data set. This choice is par-
ticularly adapt to highly variable and stochastic contexts, such as the one of
Internet-based systems.
Once chosen the AT-policy for our purpose, we focus on autoregressive
models. As discovered in Section 6.3, AR and ARIMA models havesimilar
PI performances. Their behavior in terms of prediction error is otherwise
very different. We discover a big gap between AR and ARIMA performance
in terms of prediction errors under the AT-policy, with aPE = 0.32 for
162 Runtime models
the AR model and aPE = 0.029 for the ARIMA one. Since an order
of magnitude is present between the prediction quality metrics of the two
autoregressive models, we choose to adopt the ARIMA prediction technique
to detect collective anomalies in the considered Internet Data Center.
Detection rule
The collective anomaly detection rule adopted in this work follows the ap-
proach of the state change detection rule described for state change detec-
tion (see Section 7.1.2). The problem we consider now is to detect at run-
time the occurrence of a collective anomaly in the state of the system, with
a minimum or null rate of false detections. As for state change detection,
this is done on the basis of a statisticalchange detection ruleobtained by
comparing, at each samplei, the test statisticsgi with a characteristic sta-
tistical thresholdH. A collective anomaly is detected on the basis of the
following equation:
detection rule=
{collective anomaly, if gi ≥ H
no anomaly, otherwise.(7.14)
The test statisticsgi is an indicator of the detection model: during a normal
behavior it should be close to zero and departs from zero whenan anoma-
lous behavior occurs.
At time ti, we consider the time seriesS[r]i and the predicted values com-
puted from1 to k-step ahead,{fi+1, . . . , fi+k}. Everyfi+j, with 1 ≤ j ≤ k,
goes with the correspondent prediction intervalPIj = [li+j, ui+j]. The con-
sidered rule for detecting a collective anomaly computes the following test
statistics:
g0 = 0 (7.15)
gi = max
|fi+j − li+j|, if fi+j < li+j
|fi+j − ui+j|, if fi+j > ui+j
0, otherwise
(7.16)
which measure positive deviations of the monitored valuefi+j from the
prediction interval.
On-line collective anomaly detection for IDC management 163
The test statisticsgi accumulates deviations offi+j from li+j andui+j, in
the case ofc = 0.95. A collective anomaly is detected whengi exceeds a
statistic thresholdH. Through an empirical evaluation, we dynamically set
H = 0.05σi, whereσi computes the standard deviation of the time series
Xi at timeti. This setting ofH parameter value is suited to the character-
istics of the representative deterministic eigenresource, that is cleaned by
all perturbations due to noise and spikes. All the information carried by
Rdeterministic time series are meaningful for the periodic behavior identifi-
cation, thus even a smallH value obtains good detection qualities.
7.2.7 On-line collective anomaly detection for IDC manage-ment
Figure 7.16 shows the results of the collective anomaly detection technique ap-
plied to the representative deterministic eigenresource coming from PCA. Pre-
dicted values are computed for three days in the future through an ARIMA fore-
casting model that adapts its parameters at each predictionstep.
The gray line is the representative deterministic eigenresource. The continu-
ous black line represents the predicted values settingk = 864, enclosed by the
prediction interval computed for values ofc equal to0.80 and 0.95. The lim-
its of the prediction intervalui+j and li+j are displayed through the black dot-
ted lines up and down the prediction. A time series sample that goes out from
PIj = [li+j, ui+j] with c = 0.95 contributes positively to test statisticsgi and thus
to a detection of a collective anomaly. In correspondence ofthe cross displayed at
the bottom of the figure, the test statisticsgi overcomes the statistic thresholdH
and thus an anomaly is detected.
Investigating on the time series in correspondence of the detection, we can
assess that it is a correct signal. In correspondence of the detection, the pattern
eigenresource looses its characteristic periodic behavior, with a long lasting period
of stable values, that is anomalous in respect to the seasonal past behavior of the
system activity.
This is probably symptom of something strange happening in the Internet Data
Center and thus deserve in-depth investigations. In Internet-based system, anoma-
164 Runtime models
-40
-20
0
20
40
Fri Sat Sun Mon Tue Wed
Rde
term
inis
tic
Time seriesPredicted valueCollective anomaly
PI, c = 80%PI, c = 95%
Figure 7.16: Performance evaluation of collective anomalydetection model onthe representative deterministc eigenresource.
lous subsequences may translate to malicious programs, unauthorized behaviors
and policy violations. The detection of each one of the unexpected events ad-
dressed in this chapter guides investigation procedures for a better Internet Data
Center management.
7.2.8 Conclusions
The proposed multi-phase methodology provides some advantages to the runtime
models for anomaly detection considered in this work.
• Helps in the choice of suitable anomaly detection techniques
Over time, a variety of anomaly detection techniques have been devel-
oped in several research communities. Many of these techniques have been
specifically developed for certain application domains, while others are more
generic, but all need a contextual definition of the normal domain behavior.
Defining a normal region which encompasses every possible normal behav-
ior is very difficult. In addition, the boundary between normal and anoma-
Conclusions 165
lous behavior is often not precise. Thus, an anomalous observation which
lies close to the boundary can actually be normal, and vice-versa. The stud-
ies carried in Chapter 4 on the three representative eigenresources coming
from the application of PCA-based method let the definition ofnormal and
anomalous behaviors suitable for the eigenresource characteristics. This is
a crucial starting point for the choice of anomaly detectiontechniques ap-
propriate for whole system analysis and management.
• Permits a specific definition of anomaly
The exact notion of anomaly is different for different application domains.
For example, in CPU utilization a small deviation from normal(e.g., an
increase of 20%) might be an anomaly, while a similar deviation in the
network packet rate might be considered as normal. Thus, applying a tech-
nique developed in one domain to another is not straightforward. Working
on known time series, with studied characteristics and features, lets the def-
inition of anomaly suitable for the domain in exam. All results obtained in
Chapter 4 enlighten the main characteristics of the eigenresources, and thus
enable a specific definition of anomaly for each behavioral class.
• Extrapolates pattern contributions
Setting apart seasonal and trend components allows the identification of
surprising actions departing from the periodic behavior. This topic is related
to the detection of the so-calledcollective anomalies. They comprise all
related time series instances that are anomalous with respect to the entire
time series. The individual data instances in a collective anomaly may not
be anomalies by themselves, but their occurrence together as a collection is
anomalous. By investigating the representative deterministic eigenresource
coming from the application of the PCA-based technique it is possible to
identify collective anomalies in the whole system behavior, otherwise hard
to find.
• Isolates noise components
To distinguish the noise often present in time series permits to eliminate
those components that tends to be similar to the actual anomalies and hence
166 Runtime models
can affect their detection. Noise can be defined as a phenomenon in the
time series which is not of interest to the analyst, but acts as a hindrance
to time series analysis. Noise removal is a fundamental stepof anomaly
detection techniques, since it consents to depart the unwanted samples that
may be mistaken as anomalies. Isolating the representativenoise eigenre-
source enables to remove the unwanted objects before anomaly detection is
performed on the system representations.
Chapter 8
Related work
Most existing runtime techniques for Internet-based systems management rely on
models built on predictable trends and periodicities, thatare in their turn isolated
from noise and spike influences. One of their main obstacles stands in isolating
underlying meaningful patterns from trivial error components. Another problem is
that they take decisions on the basis of a representation of single system resources.
The proposed multi-phase methodology would overcome theseproblems by
reducing the computational complexity of runtime system management. It helps
a system administrator to answer to a variety of questions, such as: (1) how to
size the IT infrastructure; (2) which servers are most used and need to be better
investigated; (3) how much and when to add (or remove) physical hardware when
computing demand increases or changes; (4) which could be the best plan for
capacity usage of the entire Data Center, in order to satisfy pre-specified service
level objectives (SLOs).
The four phases of the proposed multi-phase methodology apply at runtime
known models and algorithms, but the innovation of this workresides in their
application to a new representation of the whole system state. We evidence and
discuss the following main contributions:
• Whole system view vs. Single component view;
• Principal Component Analysis vs. Parametric techniques;
• Realistic Internet-based system vs. Simulated systems;
168 Related work
• Runtime decision algorithms vs. Off-line algorithms.
Whole system view vs. Single component view
Modeling resource time series in a single server node has attracted consid-
erable researches. Various metrics are collected, analyzed and visualized
for various purposes, such as traffic modeling, capacity planning and re-
source management. On the basis of this information, several researchers
have characterized the Web workload by fitting distributions to data (e.g.,
heavy-tailed distributions [11,13,30], burst arrivals [69] and hot spots [14])
and by proposing performance models driven by such distributions [48].
All the analyses confirm that the external traffic reaching anInternet-based
system shows some periodic behavior [14] that facilitates its interpretation
and management. Hence, existing results are useful for capacity planning
and system dimensioning goals on servers, but they are useless to estimate
at runtime the state of an Internet Data Center and to guide runtime man-
agement methods.
Principal Component Analysis vs. Parametric techniques
Common methods to represent the resource state are based on the periodic
collection of samples through server monitors and on the direct use of these
values. Some low-pass filtering of network throughput samples have been
proposed in [110], but the majority of resource state interpretation algo-
rithms for the runtime management of Internet-based systems are based on
functions that work directly on resource measures [9, 12, 29, 33, 52, 73, 93,
99, 106, 116, 128]. Other studies based on a control theoretical approach to
prevent overload or to provide guaranteed levels of performance in Web sys-
tems [1,72] refer to direct resource measures (e.g., CPU utilization, average
Web object response time) as feedback signals.
Other works [34,44] have proposed parametric models based on moving av-
erage and on linear regression. Their problem is that modernInternet-based
systems are characterized by complex hardware/software architectures and
169
by stochastic and highly variable workloads that cause instability of system
resource measures. The observed measures of the internal resources are
characterized by noises, heteroscedasticity and short time dependency, that
prevent an initial optimal setting of the parameters and require a continuous
update of them.
The context of Internet-based systems is subject to typically stochastic loads,
heavy-tailed distributions [13] and flash crowds [69], extreme variability
and tendency to become obsolete rather quickly [39]. Hence,parametric
models based on a static setting of their parameters values are unable to fol-
low the continuous changing of monitored resource measures. On the other
hand, techniques providing a dynamic estimation of their parameters are of
little help in face of stochastic and highly variable time series, affected by
random errors and strong perturbations. In these contexts,the tuning of their
parameters is impossible or extremely time consuming and risks to suggest
completely wrong actions.
Principal Component Analysis is a non parametric technique that helps to
distinguish overload conditions from transient peaks, to understand load
trends and seasonality, and to isolate undesired noise components. PCA
does not make any assumption about resource measures statistical char-
acteristics and do not need any parameter setting. It works on the set of
monitored samples and extracts its intrinsic dimensionality through the in-
formation contained in the time series.
Although, to the best of our knowledge, the PCA dimensional analysis has
not been applied for the whole system analysis and for the investigation of
the entire set of measurements of an Internet Data Center, it has been em-
ployed in other contexts, such as face recognition [76], brain imaging [123],
meteorology [105] and fluid dynamics [115].
The PCA-based technique proposed in this work have been validated in
other contexts, such as network traffic flows [10] and application workload
characterization for utility computing [4]. These studiesare limited to the
characterization and analysis of input time series behavior and not oriented
to collect representative visions of the system they deal with.
170 Related work
Realistic Internet-based system vs. Simulated systems
Our work focuses on real Internet-based systems integratedwith load mon-
itoring strategies and management tasks, and characterized by heavy-tailed
workloads that are too complex for an analytical representation [53, 86].
Related studies were oriented to simulation models of simplified architec-
tures [1,26,34,99,121], that represent an interesting research objective [54]
but that cannot take into account interesting and complex issues of real sys-
tems.
There are many studies on the characterization of resource loads, albeit re-
lated to systems subject to quite different workloads from those ones con-
sidered in this study. For example, the authors in [93] evaluate the effects
of different load representations on job load balancing through a simulation
model that assumes a Poisson job inter-arrival process. A similar analysis
concerning Unix systems is carried out in [52]. Dinda et al. [44] investi-
gate the predictability of the CPU load average in a Unix machine subject
to CPU bound jobs. The adaptive disk I/O prefetcher proposed in [122] is
validated through realistic disk I/O inter-arrival patterns referring to scien-
tific applications. The workload features considered in allthese pioneer pa-
pers differ substantially from the load models characterizing Internet-based
servers, showing stochasticity, bursting patterns and heavy-tails workloads
even at different time scales.
Other papers make strong assumptions on the nature of the workload, that
simplify many state representation problems. For example,the authors
in [85] present a mechanism that works well with mildly oscillating or sta-
tionary workloads; in [124] stochastic models for the FTP transfer times
are presented; the host CPU load average is studied in [44]; some models
on network traffic, assumed as a Gaussian process, are analyzed in [110].
These assumptions are too restrictive for workloads characterizing modern
Internet-based systems.
Runtime decision algorithms vs. Off-line algorithms
Runtime load state interpretation of the internal resourceshas not received
171
much attention yet, especially if we refer to Internet-based systems. For ex-
ample, Pacifici et al. [97] propose a model for estimating at runtime the CPU
demand of Web applications, but not for positioning the resource state with
respect to the system resource capacities. Other studies that are oriented
to server management do not consider runtime constraints. Some exam-
ples include load balancer policies [9, 12, 29, 52, 99], overload and admis-
sion controller schemes [99, 100], request routing mechanisms and replica
placement algorithms [73,116], distributed resource monitors [106].
Even the most common methods for load representations oriented to run-
time management tasks work off-line [14,35,44,74,83,110].
Hence, adequate models for supporting runtime management decisions in
highly variable systems represent an open issue. In this work, we address it
in the context of Internet Data Centers. The management decision mecha-
nisms exercised in the last phases of the proposed methodology are based on
widely known and used algorithms for time series analysis, such as smooth-
ing and interpolation algorithms [104,127], forecasting models [88], detec-
tion rules [15], etc. In this work, we implement them in an on-line way
and test their performance for runtime Internet Data Center management.
The constraints due to on-line decisions lead us to considermodels that are
characterized by low computational complexity.
Considering state change detection, many stochastic modelsare oriented
to off-line schemes. The historical reference of all these studies is [98].
Subsequent investigations are in [64, 65, 78]. Other theoretical optimal re-
sults about the likelihood approach to load change detection are proposed
in [42]. It is impossible to propose a simple application of these schemes to
a runtime environment, although we could extend some previous theoretical
results, such as the Cusum algorithms, to the on-line state change detection
problem.
Detecting outliers or anomalies in data has been studied in the statistics
community as early as the 19th century [49]. Over time, a variety of anomaly
detection techniques have been developed in several research communities.
Many of these techniques have been specifically developed for certain appli-
172 Related work
cation domains, while others are more generic. Bronstein et al. [23] propose
a variant of the Bayesian networks technique for network intrusion detec-
tion. For collective anomaly detection, the techniques have to either model
the sequence data or compute similarity between sequences.A survey of
different techniques used for this problem is presented by Snyder [117]. A
comparative evaluation of anomaly detection for host basedintrusion de-
tection is presented in Forrest et al. [55] and Dasgupta and Nino [40]. In
this work, we implement anomaly detection techniques at runtime, that is
an interesting research field recently opened.
There is a huge amount of prediction models that are orientedto off-line
forecasting. We can cite support vector machines [38], machine learning
techniques, and Fuzzy systems [119]. None of them can be applied or
adapted to support runtime predictions in a variable environment such as
a typical Internet-based system [43].
Moreover, in this thesis, we propose and analyze runtime prediction mod-
els that do not make any assumption (e.g., linearity, stability) on the dis-
tribution of the data set, as required by other works on runtime short-term
predictions [44,85,110,124].
Chapter 9
Conclusions
In this thesis, we consider the problem of on-line and off-line management of
large Internet Data Centers. To this purpose, we propose a whole system analy-
sis that starts from the acquisition of resource information generated from system
monitors. We address several issues, but the most original proposal is related to
Principal Component Analysis, that allows us to represent thousands time series
collected from the Internet Data Center using less than 15 independent dimen-
sions.
This surprising low dimensionality motivated us to understand the behavior
of the Internet Data Center on the basis of these few dimensions. By examining
eigenresources, which are the common patterns of variationunderlying resource
measures, we could develop considerable understanding of the structure of In-
ternet Data Center resources. The set of eigenresources shows three features:
deterministic trends, spikes and noise. Furthermore, we discover more restrictive
behavioral subclasses, that help us to eigenresource characterization. From the as-
sembling of the contributions of the dimensions belonging to the same behavioral
class, we extracted three representative time series collecting the main features of
the Internet Data Center.
Our last objective was to examine the extent to which the three representative
visions can help Internet Data Center management. We consider five application
contexts: trend extraction, time series forecasting, state change, point anomaly
and collective anomaly detections. The results of the PCA-based technique sim-
plify whole system analysis and support runtime Internet Data Center manage-
174 Conclusions
ment.
The whole system analysis proposed in this work can be enriched by other
models and decision support systems, and can be applied to different contexts.
In particular, we are studying how to automate the process togenerate periodical
reports to the IT manager in a way that it would be possible to analyze system be-
haviors during several periods of time. These reports may begenerated at different
levels, such as operational level, tactical level, strategic level, and more.
The whole system analysis can be a guideline for the investigation of different
application contexts, such as the ones of network traffic flows, application work-
loads, virtualized environments, but also for non-technological fields.
Bibliography
[1] T. Abdelzaher, K. G. Shin, and N. Bhatti. Performance guaranteesfor Webserver end-systems: A control-theoretical approach.IEEE Trans. Parallel andDistributed Systems, 13(1):80–96, Jan. 2002.
[2] B. Abraham and G. E. P. Box. Bayesian analysis of some outlier problems in timeseries.Biometrika, 66(2):229–236, Aug. 1979.
[3] B. Abraham and A. Chuang. Outlier detection and time series modeling.Techno-metrics, 31(2):241–248, May 1989.
[4] B. Abrahao and A. Zhang. Characterizing application workloads oncpu utilizationin utility computing. Technical Report HPL-2004-157, Hewllet-Packard Labs,2004.
[5] E. Aleskerov, B. Freisleben, and B. Rao. Cardwatch: A neural network baseddatabase mining system for credit card fraud detection. InIEEE ComputationalIntelligence for Financial Engineering, pages 220–226, 1997.
[6] C. Alexander and M. Sadiku.The frequency spectrum of a signal consists of theplots of the amplitudes and phases of the harmonics versus frequency. Fundamen-tals of Electric Circuits, McGraw-Hill, 2004.
[7] D. L. Alspach and H. W. Sorenson. Nonlinear bayesian estimation using gaussiansum approximation.IEEE Trans. Automat. Contr., 20, 1972.
[8] M. Andreolini, S. Casolari, and M. Colajanni. Models and framework for sup-porting run-time decisions in web-based systems.ACM Tran. on the Web, 2(3),2008.
[9] M. Andreolini, M. Colajanni, and M. Nuccio. Scalability of content-aware serverswitches for cluster-based Web information systems. InProc. of WWW, Budapest,HU, May 2003.
[10] L. Anukool, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk, and N. Taft.Structural analysis of network traffic flows.Joint International Conference onMeasurement and Modeling of Computer Systems, pages 61–72, 2004.
[11] M. Arlitt, D. Krishnamurthy, and J. Rolia. Characterizing the scalability of a largeweb-based shopping system.IEEE Trans. Internet Technology, 1(1):44–69, Aug.2001.
[12] J. Bahi, S. Contassot-Vivier, and R. Couturier. Dynamic load balancing and effi-cient load estimators for asynchronous iterative algorithms.IEEE Trans. Paralleland Distributed Systems, 16(4):289–299, Apr. 2006.
176 BIBLIOGRAPHY
[13] P. Barford and M. E. Crovella. Generating representative Web workloads for net-work and server performance evaluation. InProc. of the Joint International Con-ference on Measurement and Modeling of Computer Systems (SIGMETRICS1998),Madison, WI, June 1998.
[14] Y. Baryshnikov, E. Coffman, G. Pierre, D. Rubenstein, M. Squillante, andT. Yimwadsana. Predictability of Web server traffic congestion. InProc.of the 10th International Workshop on Web Content Caching and Distribution(WCW2005), Sophia Antipolis, FR, Sept. 2005.
[15] M. Basseville and I. Nikiforov.Detection of Abrupt Changes:Theory and Appli-cation. Prentice-Hall, 1993.
[16] A. M. Bianco, E. J. Garcıa B. M. andMartınez, and V. J. Yohai. Outlier detectionin regression models with arima errors using robust estimates.Journal of Fore-casting, 20(8):565–579, Dec. 2001.
[17] G. Birkhoff and C. R. de Boor. Piecewise polynomial interpolation and approx-imation. InProc. General Motors Symposium of 1964, Elsevier, New York andAmsterdam, 1965. H. L. Garabedian.
[18] G. Bishop and G. Welch. An introduction to the kalman filter.SIGGRAPH, Course8, 2001.
[19] D. Bonett. Approximate confidence interval for standard deviation of nonnormaldistributions. Computational Statistics and Data Analysis, 50(3):775–882, Feb.2006.
[20] G. Box, G. Jenkins, and G. Reinsel.Time Series Analysis Forecasting and Control.Prentice Hall, 1994.
[21] B. L. Brockwell and R. A. Davis.Time Series: Theory and Methods. Springer-Verlag, 1987.
[22] P. J. Brockwell and R. A. Davis.Introduction to Time Series and Forecasting.Springer, 2001.
[23] A. Bronstein, J. Das, M. Duro, R. Friedrich, G. Kleyner, M. Mueller, S. Sing-hal, and I. Cohen. Bayesian networks for detecting anomalies in internet-basedservices. InInternational Symposium on Integrated Network Management, 2001.
[24] V. Cardellini, E. Casalicchio, M. Colajanni, and P. Yu. The state of theart inlocally distributed Web-server system.ACM Computing Surveys, pages 263–311,2002.
[25] V. Cardellini, M. Colajanni, and P. Yu. Request redirection algorithmsfor dis-tributed Web systems.IEEE Trans. Parallel and Distributed Systems, 14(5):355–368, May 2003.
[26] V. Cardellini, M. Colajanni, and P. S. Yu. Geographic load balancingfor scal-able distributed Web systems. InProc. of 8th International Workshop on Model-ing, Analysis, and Simulation of Computer and Telecommunication Systems (MAS-COTS 2000), San Francisco, CA, USA, aug/sep 2000.
[27] S. Casolari, , M. Colajanni, and F. Lo Presti. Runtime state change detector ofcomputer system resources under non stationary conditions. InProc. of 17th Int.Workshop on Modeling, Analysis, and Simulation of Computer and Telecommuni-cation Systems (MASCOTS 2009), Sept. 2009.
BIBLIOGRAPHY 177
[28] S. Casolari and M. Colajanni. Short-term prediction models for server manage-ment in internet-based contexts.Elsevier - Decision Support Systems, 48, 2009.
[29] M. Castro, M. Dwyer, and M. Rumsewicz. Load balancing and control for dis-tributed World Wide Web servers. InProceedings of the Intl. Conference on Con-trol Applications (CCA 1999), Kohala Coast, HI, Aug. 1999.
[30] J. Challenger, P. Dantzig, A. Iyengar, M. Squillante, and L. Zhang. Efficientlyserving dynamic data at highly accessed Web sites.IEEE/ACM Trans. on Net-working, 12(2):233–246, Apr. 2004.
[31] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACMComput. Surv., 41(3):1–58, 2009.
[32] C. Chatfield. The Analysis of Time Series: An Introduction. Chapman and Hall,1989.
[33] H. Chen and P. Mohapatra. Overload control in qos-aware web servers.ComputerNetworks, 42(1):119–133, May 2003.
[34] L. Cherkasova and P. Phaal. Session-based admission control: amechanismfor peak load management of commercial web sites.IEEE Trans. Computers,51(6):669–685, June 2002.
[35] B. Choi, J. Park, and Z. Zhang. Adaptive random sampling for load change de-tection. InProc. of the 16th IEEE International Conference on Communications(ICC2003), Anchorage, AL, USA, May 2003.
[36] C. K. Chui. An Introduction to Wavelets. Academic Press, 1992.[37] H.-P. Company.HP Open-View MeasureWare Agent for Windows NT: User’s Man-
ual. HP, 1999.[38] C. Cortes and V. Vapnik. Support-vector networks.Machine Learning, 20(3),
1995.[39] M. Dahlin. Interpreting state load information.IEEE Trans. Parallel and Dis-
tributed Systems, 11(10):1033–1047, Oct. 2000.[40] D. Dasgupta and F. Nino. A comparison of negative and positive selection algo-
rithms in novel pattern detection. InIEEE International Conference on Systems,Man, and Cybernetics, volume 1, pages 125–130, Nashville, TN, 2000.
[41] P. Del Moral. Non linear filtering: Interacting particle solution.Markov Processesand Related Fields, 2(4), 1996.
[42] J. Deshayes and P. D. Off-line statistical analysis of change-point models usingnon parametric and likelihood methods.In Detection of Abrupt Changes in Signalsand Dynamical Systems, pages 103–168, 1986.
[43] L. Devroye, L. Gyorfi, and G. Lugosi.A Probabilistic Theory of Pattern Recogni-tion. Springer-Verlag, 1996.
[44] P. Dinda and D. O’Hallaron. Host load prediction using linear models.ClusterComputing, 3(4):265–280, Dec. 2000.
[45] M. Dobber, R. Van det Mei, and G. Koole. A prediction method for jobruntimesin shared processors: Survey, statistical analysis and new avenues.PerformanceEvaluation, 2007.
[46] D. L. Donoho. High-dimensional data analysis: the curses and blessings of di-mensionality.American Mathematical Society Conf. Math Challenges of the 21stCentury, 2000.
178 BIBLIOGRAPHY
[47] D. L. Donovo, I. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet shrinkage:Asymptotia?Journal of the Royal Statistical Society B, 57(2), 1995.
[48] A. B. Downey and D. G. Feitelson. The elusive goal of workload characterization.Performance Evaluation, 26(4):14–29, 1999.
[49] F. Y. Edgeworth. On discordant observations.Philosophical Magazine,23(5):364–375, 1887.
[50] R. F. Engle and K. F. Kronera. Multivariate simultaneous generalized arch.Econo-metric Theory, 11:122–150, 1995.
[51] R. L. Eubank and E. Eubank.Non parametric regression and spline smoothing.Marcel Dekker, 1999.
[52] D. Ferrari and S. Zhou. An empirical investigation of load indices for load balanc-ing applications. InProc. of the 12th IFIP International Symposium on ComputerPerformance, Modeling, Measurements and Evaluation (PERFORMANCE1987),Brussels, BE, Dec. 1987.
[53] G. Fishman and I. Adan. How heavy-tailed distributions affect simulation-generated time averages.ACM Trans. on Modeling and Computer Simulation,16(2):152–173, Apr. 2006.
[54] S. Floyd and V. Paxson. Difficulties in simulating the internet.IEEE/ACM Trans.Networking, 9(3):392–403, Aug. 2001.
[55] S. Forrest, P. D’haeseleer, and P. Helman. An immunological approach to changedetection: Algorithms, analysis and implications. InProc. of the 1996 IEEE Sym-posium on Security and Privacy. IEEE Computer Society, 1996.
[56] G. E. Forsythe, M. A. Malcolm, and C. B. Moler.Computer Methods for Mathe-matical Computations. Prentice-Hall, 1977.
[57] A. J. Fox. Outliers in time series.Journal of the Royal Statistical Society,34(3):350–363, 1972.
[58] R. Fujimaki, T. Yairi, and K. Machida. An approach to spacecraft anomaly detec-tion problem using kernel feature space. Ineleventh ACM SIGKDD internationalconference on Knowledge discovery in data mining. ACM Press, pages 401–410,New York, NY, USA, 2005.
[59] P. Gaffney and M. Powell. Optimal interpolation.Numerical Analysis, 506, 1976.[60] R. Gnanadesikan and M. B. Wilk. Probability plotting methods for the analysis of
data.Biometrika, 55:1–17, 1968.[61] A. Graps. An introduction to wavelets.IEEE, 1995.[62] F. E. Harrell.Regression modeling strategies: with applications to linear models,
logistic regression, and survival analysis.Springer, 2001.[63] E. Hartikainen and S. Ekelin. Enhanced network-state estimation usingchange
detection. InProc. of the 31st IEEE Conf. on Local Computer Networks, Nov.2006.
[64] D. V. Hinkley. Inference about the change point in a sequence of random variables.Biometrika, 57:1–17, 1970.
[65] D. V. Hinkley. Inference about the change point from cumulativesum-tests.Biometrika, 58:509–523, 1971.
BIBLIOGRAPHY 179
[66] P. Hoogenboom and J. Lepreau. Computer system performance problem detec-tion using time series models. InUsenix-stc’93: Proceedings of the USENIXSummer 1993 Technical Conference on Summer technical conference, pages 1–21. USENIX Association, 1993.
[67] H. Hotelling. Analysis of a complex of statistica variables into principal compo-nents.J. Educ. Psy., pages 417–441, 1933.
[68] R. Hyndman, A. Koehler, R. Snyder, and S. Grose. A state spaceframework forautomaic forecasting using exponential smoothing methods.International Journalof Forecasting, 18(3), 2002.
[69] J. Jung, B. Krishnamurthy, and M. Rabinovich. Flash crowds anddenial of serviceattacks: characterization and implications for CDNs and Web sites. InProc. ofthe 11th International World Wide Web Conference (WWW2002), Honolulu, HW,May 2002.
[70] H. F. Kaiser. An index of factorial simplicity.Psychometrica, 39:31–36, 1974.[71] R. Kalman. A new approach to linear filtering and prediction problems.Journal
of Basic Engineering, 82(1), 1960.[72] A. Kamra, V. Misra, and E. M. Nahum. Yaksha: a self-tuning controller for man-
aging the performance of 3-tiered sites. InProc. of Twelfth International Workshopon Quality of Service (IWQOS2004), Montreal, CA, June 2004.
[73] P. Karbhari, M. Rabinovich, Z. Xiao, and F. Douglis. ACDN: a content deliv-ery network for applications. InProc. of 21st Int’l ACM SIGMOD Conference,Madison, WI, USA, 2002.
[74] T. Kelly. Detecting performance anomalies in global applications. InProc. of the2nd USENIX Workshop on Real, Large Distributed Systems (WORLDS2005), SanFrancisco, CA, USA, 2005.
[75] M. Kendall and J. Ord.Time Series. Oxford University Press, 1990.[76] M. Kirby and L. Sirovich. Application of the karhunen-loeve procedure for the
characterization of human faces.IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 12(1):103–108, Jan. 1990.
[77] G. Kitigawa. Monte carlo filter and smoother for non-gaussian nonlinear statemodels.Journal of Computational and Graphical Statistics, 5, 1996.
[78] N. Kligiene and T. L. Methods of detecting instants of change of random processproperties.Automation and Remote Control, 44:1241–1283, 1983.
[79] V. Kumar. Parallel and distributed computing for cybersecurity.Distributed Sys-tems Online, IEEE, 6(10), 2005.
[80] F. LeGland, C. Musso, and N. Oudjane. An analysis of regularized interactingparticle methods in nonlinear filtering. InProc. of the IEEE European Workshopon Computer-Intensive Methods in Control and Signal Processing, 1998.
[81] D. J. Lilja. Measuring computer performance. A practitioner’s guide. CambridgeUniversity Press, 2000.
[82] S. Ling and W. K. Li. On fractionally integrated autoregressive moving-averagetime series models with conditional heteroskedasticity.Journal of the AmericanStatistical Association, 92:1184–1194, 1997.
180 BIBLIOGRAPHY
[83] Y. Lingyun, I. Foster, and J. M. Schopf. Homeostatic and tendency-based CPUload predictions. InProc. of the 17th Parallel and distributed processing Sympo-sium (IPDPS2003), Nice, FR, 2003.
[84] D. Lu, P. Mausel, E. Brondizio, and E. Moran. Change detection techniques.Int.Journal of Remote Sensing, 2004.
[85] Y. Lu, T. Abdelzaher, L. Chenyang, S. Lui, and L. Xue. Feedback control withqueueing-theoretic prediction for relative delay guarantees in Web servers. InProc.of the 9th IEEE real-time and embedded technology and Applications Symposium(RTAS2003), Charlottesville, VA, May 2003.
[86] S. Luo and G. Marin. Realistic internet traffic simulation through mixture mod-eling and a case study. InProc. of the IEEE Winter Simulation Conference(WSC2005), Orlando, FL, USA, 2005.
[87] N. A. Macmillan and C. D. Creelman.Detection Theory: a User’s Guide.Lawrence Erlbaum Associates, 2005.
[88] S. G. Makridakis, S. C. Wheelwright, and R. J. Hyndman.Forecasting: Methodsand Applications. 3rd ed. John Wiley & Sons, 1998.
[89] S. G. Mallat. A theory of multiresolution signal decomposition: The wavelet de-composition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 11(7),1989.
[90] R. McGill, J. W. Tukey, and W. A. Larsen. Variations of box plots.The AmericanStatistician, 32:12–16, 1978.
[91] D. Menasce and J. Kephart. Autonomic computing.IEEE Internet Computing,11(1):18–21, Jan. 2007.
[92] D. A. Menasce, V. A. F. Almeida, and L. W. Dowdy.Capacity planning andperformance modeling: from mainframes to client-server systems. Prentice-Hall,Inc., 1994.
[93] M. Mitzenmacher. How useful is old information.IEEE Trans. Parallel and Dis-tributed Systems, 11(1):6–20, Jan. 2000.
[94] D. C. Montgomery. Introduction to Statistical Quality Control. John Wiley andSons, 2008.
[95] M. N. Nounou and B. Bakshi. On-line multiscale filtering of random andgrosserrors without process models.American Institute of Chemical Engineers Journal,45(5), May 1999.
[96] D. L. Olson and D. Delen.Advanced Data Mining Techniques. Springer, 2008.[97] G. Pacifici, W. Segmuller, M. Spreitzer, and A. Tantawi. Dynamic estimation
of CPU demand of Web traffic. InProc. of the 1st International Conference onPerformance Evaluation Methodologies and Tools (VALUETOOLS2006), Pisa, IT,Oct. 2006.
[98] E. S. Page. Estimating the point of change in a continuous process.Biometrika,44:248–252, 1957.
[99] V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel,and E. M. Nahum. Locality-aware request distribution in cluster-based networkservers. InProc. of the 8th International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS1998), San Jose, CA,Oct. 1998.
BIBLIOGRAPHY 181
[100] R. Pandey and R. Barnes, J. F. Olsson. Supporting quality of service in HTTPservers. InProc. of the ACM Symposium on Principles of Distributed Computing,Puerto Vallarta, MX, June 1998.
[101] A. Papoulis.Probability, Random Variables, and Stochastic Processes. Mc-GrawHill, 1984.
[102] D. B. Percival and A. T. Walden.Wavelet methods for time series analysis.Cam-bridge University Press, 2000.
[103] V. V. Phoha.The Springer Internet Security Dictionary. Springer-Verlag, 2002.[104] D. J. Poirier. Piecewise regression using cubic spline.Journal of the American
Statistical Association, 68(343):515–524, 1973.[105] R. W. Preisendorfer.Principal component analysis in meteorology and oceanog-
raphy. Elsevier, 1988.[106] M. Rabinovich, S. Triukose, Z. Wen, and L. Wang. DipZoom: the internet
measurement marketplace. InProc. of 9th IEEE Global Internet Symposium,Barcelona, ES, 2006.
[107] P. Ramanathan. Overload management in real-time control applicationsusing(m,k)-firm guarantee.Performance Evaluation Review, 10(6), Jun. 1999.
[108] P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection.JohnWiley and Sons, 1987.
[109] C. Runge. Uber empirische Funktionen und die Interpolation zwischenaquidistanten Ordinaten.Zeitschrift fur Mathematik und Physik, 1901.
[110] A. Sang and S. Li. A predictability analysis of network traffic. InProc. of the 19thAnnual Joint Conference of the IEEE Computer and Communications Societies(INFOCOM2000), Tel Aviv, ISR, Mar. 2000.
[111] M. Satyanarayanan, D. Narayanan, J. Tilton, J. Flinn, and K. Walker. Agileapplication-aware adaptation for mobility. InProceedings of the 16th ACM Intl.Symposium on Operating Systems Principles (SOSP 1997), Saint-Malo, France,Oct. 1997.
[112] A. Schuster. On the investigation of hidden periodicities with applicationto asupposed 26 day period of meteorological phenomena.Terrestrial Magnetism andAtmospheric Electricity, 3:13–41, 1898.
[113] S. S. Shapiro and M. B. Wilk. An analysis of variance test for normality (completesamples).Biometrika, 52:591–611, 1965.
[114] W. A. Shewhart.Economic control of quality of manufactured product. AmericanSociety for Quality, 1980.
[115] L. Sirovich, K. S. Ball, and L. R. Keefe. Plane waves and structures in turbulentchannel flow.Phys. Fluids, pages 2217–2226, 1990.
[116] S. Sivasubramanian, G. Pierre, and M. Van Steen. Replication for web hostingsystems.ACM Computing surveys, 36(3):291–334, Aug. 2004.
[117] D. Snyder.Online intrusion detection using sequences of system calls. M.S. thesis,De- partment of Computer Science, Florida State University, 2001.
[118] C. Spence, L. Parra, and P. Sajda. Detection, synthesis and compression in mam-mographic image analysis with a hierarchical image probability model. InIEEEWorkshop on Mathematical Methods in Biomedical Image Analysis. IEEE Com-puter Society, volume 3, Washington, DC, USA, 2001.
182 BIBLIOGRAPHY
[119] J. T. Spooner, M. Maggiore, R. Ordonez, and K. M. Passino.Stable AdaptiveControl and Estimation for Nonlinear Systems: Neural and Fuzzy ApproximatorTechniques. John Wiley and Sons, 2002.
[120] T. H. Spreen, R. E. Mayer, J. R. Simpson, and J. T. McClave. Forecasting monthlyslaughter cow prices with a subset autoregressive model.Southern Journal ofAgricultural Economics, Vol. 11, No. 1, 1979.
[121] J. A. Stankovic. Simulations of three adaptive, decentralized controlled, jobscheduling algorithms.Computer Networks, 8(3):199–217, June 1984.
[122] N. Tran and D. A. Reed. Automatic ARIMA time series modeling for adaptive I/Oprefetching.IEEE Trans. Parallel and Distributed Systems, 15(4):362–377, 2004.
[123] D. Ts’o, R. D. Frostig, E. E. Lieke, and G. A. Functional organization of primatevisual cortex revealed by high resolution optical imaging.Science, pages 417–420,1990.
[124] S. Vazhkudai and J. Schopf. Predict sporadic grid data transfers. In Proc.of the 11th IEEE Symposium on High Performance Distributed Computing(HPDC2002), Edinburgh, GBR, jul 2002.
[125] D. F. Vysochanskij and Y. I. Petunin. Justification of the 3sigma rule for unimodaldistributions.Theory of Probability and Mathematical Statistics, 21:25–36, 1980.
[126] A. S. Willsky and H. L. Jones. A generalized likelihood ratio approach to thedetection and the estimation of jumps in linear systems.IEEE Trans. on Data andKnowledge Engineering, 14(2), 2002.
[127] G. Wolberg and I. Alfy. Monotonic cubic spline interpolation. InCGI ’99: Pro-ceedings of the International Conference on Computer Graphics, Washington, DC,USA, 1999. IEEE Computer Society.
[128] R. Wolski, N. T. Spring, and J. Hayes. The network weather service: a distributedresource performance forecasting service for metacomputing.Future GenerationComputer Systems, 15(5–6):757–768, 1999.
[129] D. H. Zhou and P. M. Frank. Strong tracking filtering of nonlineartime-varyingstochastic systems with coloured noise: application to parameter estimation andempirical robustness analysis.International Journal of Control, 65:295–307,1996.