tesi di laurea stochastic analyses for internet …...stochastic analyses for internet data centers...

Universita degli Studidi Modena e Reggio Emilia

Facolta di Scienze Matematiche, Fisiche e Naturali

Corso di Laurea Magistrale in Informatica

TESI DI LAUREA

Stochastic Analyses for InternetData Centers Management

RelatoreProf. Michele Colajanni

CorrelatoreIng. Sara Casolari

CandidataDott.ssa Stefania Tosi

Anno Accademico 2009/2010

“La science avec des faits comme une maison avec despierres; mais une accumulation de faits n’est pas plusune science qu’un tas de pierres n’est une maison.”

“La scienza e fatta di dati come una casa di pietre.Ma un ammasso di dati none scienza piu di quanto unmucchio di pietre sia una casa.”

“Science is made of data as a house is made of stones.But a mass of data is not science more than a pile ofstones is a house.”

Jules-Henri Poincare

Contents

1 Introduction 17

2 Whole system analysis 212.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . 212.2 Multi-phase methodology . . . . . . . . . . . . . . . . . . . . . . 24

3 Stochastical analyses of system resource measures 313.1 Deterministic component . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Correlogram . . . . . . . . . . . . . . . . . . . . . . . . 353.1.2 Periodogram . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Spike component . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.1 Sigma threshold test . . . . . . . . . . . . . . . . . . . . 40

3.3 Noise component . . . . . . . . . . . . . . . . . . . . . . . . . . 423.3.1 Color of noise . . . . . . . . . . . . . . . . . . . . . . . . 433.3.2 Distribution of noise . . . . . . . . . . . . . . . . . . . . 45

4 PCA-based technique on collected data 494.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.1 Heterogeneous resources of a single server . . . . . . . . 514.1.2 Homogeneous resources of different servers . . . . . . . .55

4.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . 564.2.1 PCA on heterogeneous resources of a single server . . . . 614.2.2 PCA on homogeneous resources of different servers . . . 64

4.3 Analyzing eigenresources . . . . . . . . . . . . . . . . . . . . . . 674.3.1 A taxonomy of eigenresources . . . . . . . . . . . . . . . 674.3.2 Understanding eigenresources . . . . . . . . . . . . . . . 72

4.4 Extraction of representative eigenresources . . . . . . . .. . . . 77

5 Tracking models 835.1 Trend extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . 85

6 CONTENTS

5.2.1 Interpolation techniques . . . . . . . . . . . . . . . . . . 855.2.2 Smoothing techniques . . . . . . . . . . . . . . . . . . . 90

5.3 Interpolation estimators . . . . . . . . . . . . . . . . . . . . . . . 935.3.1 Simple Regression (SR) . . . . . . . . . . . . . . . . . . 935.3.2 Cubic Spline (CS) . . . . . . . . . . . . . . . . . . . . . 94

5.4 Smoothing estimators . . . . . . . . . . . . . . . . . . . . . . . . 955.4.1 Simple Moving Average (SMA) . . . . . . . . . . . . . . 965.4.2 Exponential Weighted Moving Average (EWMA) . . . . . 965.4.3 Auto-Regressive (AR) . . . . . . . . . . . . . . . . . . . 975.4.4 Auto-Regressive Integrated Moving Average (ARIMA) . . 97

5.5 Quantitative performance analysis . . . . . . . . . . . . . . . . .985.5.1 Computational cost . . . . . . . . . . . . . . . . . . . . . 985.5.2 Estimation quality . . . . . . . . . . . . . . . . . . . . . 99

6 Forecasting models 1056.1 Time series prediction . . . . . . . . . . . . . . . . . . . . . . . . 1056.2 Prediction models . . . . . . . . . . . . . . . . . . . . . . . . . . 108

6.2.1 Simple Regression (SR) . . . . . . . . . . . . . . . . . . 1086.2.2 Cubic Spline (CS) . . . . . . . . . . . . . . . . . . . . . 1106.2.3 Exponential Weighted Moving Average (EWMA) . . . . . 1116.2.4 Holt’s Model (Holt’s) . . . . . . . . . . . . . . . . . . . . 1126.2.5 Auto-Regressive (AR) . . . . . . . . . . . . . . . . . . . 1126.2.6 Auto-Regressive Integrated Moving Average (ARIMA) . . 113

6.3 Quantitative analysis . . . . . . . . . . . . . . . . . . . . . . . . 1146.3.1 Computational cost . . . . . . . . . . . . . . . . . . . . . 1156.3.2 Prediction quality . . . . . . . . . . . . . . . . . . . . . . 117

6.4 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . 120

7 Runtime models 1257.1 State change detection . . . . . . . . . . . . . . . . . . . . . . . 125

7.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . 1277.1.2 Wavelet Cusum state change detection model . . . . . . . 1307.1.3 Other state change detection models . . . . . . . . . . . . 1347.1.4 Quantitative analysis . . . . . . . . . . . . . . . . . . . . 1357.1.5 Performance analysis . . . . . . . . . . . . . . . . . . . . 1417.1.6 On-line state change detection for IDC management . . .144

7.2 Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . 1477.2.1 Problem definition . . . . . . . . . . . . . . . . . . . . . 1487.2.2 Point anomaly detection . . . . . . . . . . . . . . . . . . 1497.2.3 Point anomaly detection techniques . . . . . . . . . . . . 1507.2.4 On-line anomaly detection for IDC management . . . . . 153

CONTENTS 7

7.2.5 Collective anomaly detection . . . . . . . . . . . . . . . . 1577.2.6 Collective anomaly detection techniques . . . . . . . . . . 1587.2.7 On-line collective anomaly detection for IDC management 1637.2.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 164

8 Related work 167

9 Conclusions 173

List of Figures

2.1 The proposed multi-phase framework for whole system analysis. . 26

3.1 Examples of monitored time series. . . . . . . . . . . . . . . . . . 323.2 Example of a trend time series. . . . . . . . . . . . . . . . . . . . 333.3 Example of a seasonal time series. . . . . . . . . . . . . . . . . . 343.4 Example of a trend and seasonal time series. . . . . . . . . . . .. 353.5 Example of a time series with a ”hidden” seasonality and the cor-

responding correlogram. . . . . . . . . . . . . . . . . . . . . . . 373.6 Example of periodogram. . . . . . . . . . . . . . . . . . . . . . . 383.7 Example of a spike component. . . . . . . . . . . . . . . . . . . . 393.8 Example of3σ threshold test for spike component. . . . . . . . . 403.9 Example of1σ and5σ threshold tests for spike component. . . . . 413.10 Example of a noise component. . . . . . . . . . . . . . . . . . . . 433.11 Example of autocovariance function of a white noise time series. . 443.12 Example of autocovariance function of a colored noise time series. 453.13 Examples of Q-Q plots. . . . . . . . . . . . . . . . . . . . . . . . 47

4.1 First phase of the multi-phase framework. . . . . . . . . . . . .. 514.2 Examples of heterogeneous resource measures of a database server. 544.3 Examples of homogeneous resource measures - CPU utilization. . 554.4 Second phase of the multi-phase framework. . . . . . . . . . . .. 574.5 Example of 1D projection of 2D points in the original space. . . . 584.6 Example of eigenresource and corresponding principal components. 604.7 Example of a scree plot. . . . . . . . . . . . . . . . . . . . . . . . 604.8 Scree plots for the resource time series of a database server. . . . . 624.9 PCA resulting principal eigenresources on heterogeneous resources

of a database server. . . . . . . . . . . . . . . . . . . . . . . . . . 644.10 Scree plots for CPU utilization time series. . . . . . . . . . .. . . 654.11 PCA resulting principal eigenresources on homogeneousresources

(CPU utilizations) of a database server. . . . . . . . . . . . . . . . 664.12 Examples of the three types of eigenresources. . . . . . . .. . . . 68

10 LIST OF FIGURES

4.13 Classifying eigenresources by using three statisticaltests. . . . . . 704.14 Deterministic eigenresources and corresponding correlograms. . . 734.15 Example of correlogram showing a multi-seasonal behavior in a

two weeks resource sampling. . . . . . . . . . . . . . . . . . . . 734.16 Spike eigenresources and corresponding sigma threshold tests. . . 754.17 Noise eigenresources and corresponding autocovariance functions. 764.18 Representative eigenresources. . . . . . . . . . . . . . . . . . . .794.19 Representative eigenresources with spike and noise subclasses. . . 81

5.1 Third phase: modeling the system behavior in the past. . .. . . . 845.2 Trend estimation techniques classification. . . . . . . . . .. . . . 865.3 Graphical example of linear interpolation techniques.. . . . . . . 875.4 Graphical example of non-linear interpolation techniques. . . . . . 895.5 Graphical examples of smoothing techniques. . . . . . . . . .. . 925.6 Example of time series and approximate confidence interval. . . . 1005.7 Trend curves with respect to the approximate confidence interval. . 104

6.1 Third phase: forecasting the system behavior in the future. . . . . 1066.2 Example of time series prediction and corresponding prediction

interval (c = 0.95). . . . . . . . . . . . . . . . . . . . . . . . . . 1186.3 Raw and treated CPU utilization time series. . . . . . . . . . . . .1206.4 Holt’s prediction model,k = 10, on raw data set. . . . . . . . . . 1216.5 Holt’s prediction model,k = 10, on trend estimation. . . . . . . . 122

7.1 Third phase: analyzing the system behavior in the present. . . . . 1267.2 The problem of detecting relevant state changes. . . . . . .. . . . 1297.3 F-measures -ρe = 0. . . . . . . . . . . . . . . . . . . . . . . . . 1407.4 F-measures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1417.5 Qualitative evaluation - Step profile -σe = 0.6 andρe = 0.3. . . . 1427.6 Qualitative evaluation - Multi-step profile -σe = 0.9 andρe = 0.3. 1437.7 Relevant state changes on the representative deterministic eigen-

resource. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1467.8 Performance evaluation of state change detection models on rep-

resentative deterministic eigenresource. . . . . . . . . . . . . .. 1477.9 Example of point anomalies. . . . . . . . . . . . . . . . . . . . . 1507.10 A box plot for a univariate data set. . . . . . . . . . . . . . . . . .1527.11 Performance evaluation of5σ test for point anomaly detection on

the representative spike eigenresource. . . . . . . . . . . . . . . .1547.12 Box plot of representative spike eigenresource. . . . . . .. . . . 1557.13 Performance evaluation of box plot rule for point anomaly detec-

tion on the representative spike eigenresource. . . . . . . . . .. . 156

LIST OF FIGURES 11

7.14 Example of collective anomaly. . . . . . . . . . . . . . . . . . . . 1577.15 Prediction results on the representative deterministic eigenresource. 1607.16 Performance evaluation of collective anomaly detection model on

the representative deterministc eigenresource. . . . . . . . .. . . 164

List of Tables

4.1 System monitor’s syntax and corresponding resource measure. . . 534.2 Occurrence of eigenresource types in order of importance. . . . . 714.3 Contributions of eigenresource types. . . . . . . . . . . . . . . .72

5.1 CPU time (msec) for the computation of a trend value. . . . . . . 98

6.1 CPU time (msec) of prediction models and policies. . . . . . . . . 1166.2 PE of the prediction policies,k = 10. . . . . . . . . . . . . . . . 1196.3 PI of the prediction policies,k = 10. . . . . . . . . . . . . . . . . 119

7.1 Recall -ρe = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1387.2 Precision -ρe = 0. . . . . . . . . . . . . . . . . . . . . . . . . . 138

Acknowledgements

Il mio primo ringraziamento non puo che andare al Professor Colajanni, per avermi

dato la possibilita di realizzare questo sogno. Gli sono grata per le enormi oppor-

tunita che mi ha concesso in questi due anni, per i consigli sincerie per la sua

spiazzante capacita di riuscire a trasmettermi un insegnamento in ogni sua frase.

Lo ringrazio per avermi dato fiducia, coraggio e per avermi trasmesso un briciolo

della sua infinita passione per la ricerca. Grazie.

Ringrazio Sara, per avermi presa per mano e accompagnata passo dopo passo,

per i suoi consigli e i suoi insegnamenti preziosi. Ho imparato di piu stando al suo

fianco e ammirando il suo lavoro che in tutti gli anni passati sui libri. Ho scop-

erto in lei un’insegnante paziente, una collega divertentema soprattutto un’amica

speciale. Grazie.

Con particolare affetto dico grazie a Mirco, Michele, Claudia, Riccardo, Mauro

e Luca, che hanno dato un tocco di allegria e spensieratezza alle tante giornate in

ufficio. Li ringrazio per avermi accolta come una di loro nel gruppo di ricerca e

per avermi fatto scoprire che, in fondo, gli ingegneri non sono poi cosı male =).

Grazie.

Con altrettanto affetto ringrazio tutti i miei compagni di universita, quelli

persi, quelli trovati e quelli che ho sempre avuto al mio fianco in questo cam-

mino di studi. Se scavando nella memoria riaffiorano soltanto bellissimi ricordi,e

tutto merito loro. Grazie.

A tutti i miei amici che, vicini o lontani, hanno saputo starmi accanto in questi

anni. A chi mi ha ascoltata, a chi mi ha sopportata e a chiunquemi abbia rallegrato

la giornata con un gesto o un sorriso. A chi ha dimostrato di essere orgoglioso di

me, a chi mi ha dato consigli sinceri e a chiunque mi abbia trasmesso il coraggio

di credere in me stessa e nei miei sogni. Grazie.

16 LIST OF TABLES

Il grazie piu sentito, pero, lo riservo per tre persone straordinarie. Innanzitutto,

la mia mamma e il mio papa. Per esserci sempre stati, giorno dopo giorno, con

le loro attenzioni, le loro domande, i loro consigli. Per aver sostenuto le mie

scelte. Per aver condiviso e creduto nei miei sogni. Per non aver mai smesso di

dimostrare entusiasmo ad ogni mio traguardo. Vedere la gioia nei loro occhie la

soddisfazione piu grande di ogni mio successo. Grazie.

Il grazie piu prezioso in assoluto, pero, voglio rivolgerlo ad Andrea, perche e

la persona che ammiro di piu al mondo e non c’e felicita piu grande che renderlo

fiero di sua sorella. Grazie.

Chapter 1

Introduction

The advent of large Internet Data Centers providing any kind of service through

Web related technologies has changed the traditional processing paradigm. These

modern infrastructures must accommodate varying demands for different types

of processing within certain time constraints. Overall performance analysis and

runtime management in these contexts is becoming extremelycomplex, because

they are a function not only of individual applications, butalso of their interactions

as they contend for processing and I/O resources, both internal and external.

The majority of critical Internet-based services run on shared application in-

frastructures that have to satisfy scalability, adaptability and availability require-

ments, and have to avoid performance degradation and systemoverload. Man-

aging these systems requires a large set of runtime decisionalgorithms that are

oriented to load balancing and load sharing [9, 24, 99], overload and admission

control [33,34,52,91,93], job dispatching and redirection [25]. The recently wide-

spread paradigm of Utility Computing further increases the necessity for runtime

management algorithms that take important actions on the basis of present and

future load conditions of the system resources.

Existing models, methodologies and frameworks commonly applied to other

contexts are often inadequate to efficiently support the runtime management of

the present and future Internet-based systems because of two main problems.

• The large majority of the literature related to Internet-based systems man-

agement proposes decision systems relying on the modeling of servers re-

source usages. Even so, the stochasticity and randomness ofthese processes

18 Introduction

makes it hard to model their behavior and prevents the use of parametric

techniques to this scope. Unlike existing models and schemes, that are

oriented to evaluate system performance extracting the needed parameters

from the usage traces, in this work we rely on stochastic analyses making

no modeling assumptions.

• Most available algorithms and mechanisms for runtime decisions evaluate

the performance of information infrastructures or systemsthrough the peri-

odic sampling ofresource measuresobtained from the monitoring of each

server in isolation, and use these values (or simple combinations of them) as

a basis for determining the present and the future system condition. How-

ever, a wide range of important problems faced by computer researchers

today (including computer engineering, system design, anomaly detection,

change detection and capacity planning) require modeling and analysis of

system behaviors considering all resource measures of all servers simulta-

neously.

In general,whole resources analysis- that is, modeling the metrics of all the

resources of a node simultaneously - andwhole servers analysis- that is, mod-

eling the behaviors of all the servers of a system simultaneously - are difficult

objectives, amplified by the fact that modeling time series behavior on a single

resource of a single server is itself a complex task.Whole system analysis- aris-

ing from the synthesis of whole resources and whole servers analyses - therefore,

remains an important and unmet challenge.

One way to address the problem of whole resources analysis isto recognize

that the behavior observed for different resources of a single server is not inde-

pendent, but is in fact determined by a common external workload and typical

resource features. The superimposition of different resources, as determined by

their features, gives rise to the overall behavior of a server.

At a higher level of interest, if we consider an infrastructure composed by

several servers, its behavior can be evaluated as the join ofthe single servers

behaviors, overlapped on the basis of the routing matrix andinternal policies of

the infrastructure. Thus, instead studying servers in their single resources and

systems in their single servers, a more direct and fundamental focus for whole

19

system study is the analysis of theresources setbehavior of theservers setof an

infrastructure.

However, analyzing the whole resources behavior and its flowinside even a

simple infrastructure suffers from several difficulties. The first challenge is that

there are several resources that can be monitored on a singleserver, each one with

its typical features and its own impact on the overall behavior of the server. We

imagine that some resources would influence more the final performance of the

server and that others could otherwise be ignored for the analysis of the whole

system capability. For this reason, a first imposing obstacle is the ranking of

system resource measures in order of importance and the determination of their

weights on the basis of their contribution to the overall behavior of a server node.

Even if this problem represents itself a huge challenge, linking together dif-

ferent servers into an even simple infrastructure increases exponentially the com-

plexity of solving the whole system problem. Once ranked theresource measures

collected on the different servers, next challenge is the analysis of server nodes

interactions inside the whole infrastructure. Thus, everysystem analysis requires

the knowledge of the routing policies of the system and the main characteristics

of its constituent servers.

Another central problem one confronts in facing whole system analysis is the

so called”curse of dimensionality”[46], coming from the fact that servers in an

infrastructure can be in the order of hundreds. This means that resource interaction

inside the system form a high dimensional multivariate structure. For example,

even a moderate-size infrastructure may be composed of several dozens of servers,

each one characterized by tens of resource metrics; the resulting set of time series

has thousands of dimensions. The high dimensionality of system resources matrix

is in fact another main source of difficulty in addressing thewhole system analysis

and it tends to become even more and more prominent with the spreading of large

size Internet Data Centers.

The interest of this work is facing the whole system analysisin the context of

modern Internet Data Centers. We propose a multi-phase methodology aiming to

solve all the previous mentioned problems. It analyzes the system resource mea-

sures of the Internet Data Center servers and collects their main features into a

representative vision of the whole system state. This vision provides an exhaus-

20 Introduction

tive representation of the Internet Data Center performancethat consents whole

system analysis and helps management decisions.

The rest of the work is organized as follows. Chapter 2 discusses the motiva-

tions of this research and the high dimensionality and stochasticity of resource

measures in a system, providing the necessary foundations of the multi-phase

methodology. We give a brief description of the four phases of the proposed

approach. In Chapter 3 we carry out a first analysis of the typical behavior of re-

source measures monitored on Internet-based system’s servers and the problems

of management due to their highly variable and stochastic behavior. In Chapter 4

we present the steps taken to construct time series from the monitoring of the

system resources of an Internet Data Center used as testbed. We then apply the

proposed technique on the collected data in order to evaluate its ability in solv-

ing whole system analysis problem. We elaborate on the notion of representative

eigenresourcesand show how they can be interpreted, understood and used for

supporting some fundamental tasks characterizing runtimemanagement decisions

in Internet Data Centers. Chapters from 5 to 7 explain some classic problems

characterizing Internet Data Centers, concerning theirpast, presentandfuturebe-

haviors. In this work, we consider trend extraction, state change detection, point

and collective anomaly detection and time series forecasting as examples of sys-

tem state evaluations, and demonstrate how these applications can take advantage

of the eigenresource representations for whole system analysis and management

at runtime. Chapter 8 compares our contribution with state ofthe art. Concluding

remarks and our ongoing work are presented in Chapter 9.

Chapter 2

Whole system analysis

This chapter discusses the high complexity of whole system analysis and proposes

a novel methodology to face it.

2.1 Problem definition

Many application contexts, ranging from financial to socialapplications, or from

hydrological phenomena to information and computer systems, base their man-

agement decisions on the information coming from the monitoring of fundamen-

tal processes, such as price behavior, population growth, temperature fluctuations

or system performance/utilization. Efficient process management requires suit-

able algorithms able to make runtime decisions on the basis of actual and past

behaviors of the time series coming from monitoring systems.

Over the last twenty years, there has been a significant increase in the number

of real problems concerned with questions such as fault detection and diagno-

sis, state change and anomaly detection, safety and qualitycontrol, prediction

of future events, estimation of expected capacity needs. These problems result

from the increasing complexity of most novel processes, theavailability of so-

phisticated sensors in both technological and natural worlds, and the existence of

sophisticated information processing systems.

Central to all above issues is to understand the characteristics of the process,

to evaluate its behavior and to generate a reliable representation of it, in order to

define the past, present and actual state of the process. The key difficulty is to

22 Whole system analysis

generate a representative view of process behavior that collects and assembles all

(and only) the relevant characteristics of the process.

Let us give a biological example, considering the health of apatient as our

context of interest. We can monitor several biological processes of a human being,

such as respiration, digestion, response to stimulus, interaction between organs

and so on, and compute a health index for each of them. Each index gives an

idea of the good or bad conditions of the person in face of a particular process.

For example, a low index for the respiration process means that the person has

problems in breathing. However, looking at this index on itsown does not give

an overall idea of the health of the person: a high respiration index does not mean

that he feels good! In the same way, a low respiration index cannot give an idea

of the effective illness of the patient if it is not associated to some information

about hearth power, intercostal muscles, mouth and nose pathways, etc. Also

information about not strictly related processes, such as blood pressure, diabetes

or asthma conditions, are useful to understand the breathing conditions of the

person. Only through the aggregation of all available information in a suitable

way we can get a clear idea about the health of the patient.

Similarly, an information system can be seen as a very simplehuman being,

with its several organs - that are, the server nodes of the system -, its various pro-

cesses - that are, resource usages -, and its overall health -typically expressed in

terms of system performance. The system performance arisesfrom the contribu-

tions of different resource usages on different servers, aswell as the health of a

person is given by all the biological processes of all the organs of the body. Thus,

information system performance arises from the superimposition of different re-

sources behaviors on different servers.

For this reason, a thorough understanding of components features is essential

for modeling system work, and for addressing a wide variety of problems includ-

ing computer engineering, system design, capacity planning [92], forecasting and

anomaly detection [66]. All these problems in computer and information systems

require a reliable and exhaustive representation of the state of their resources. For

this reason, the most importantKey Performance Indicators(KPIs) are continu-

ously monitored and data are passed to some statistical models that decide whether

new management decisions have to be taken or not.

2.1 Problem definition 23

All methods and software in commerce for measuring information systems

or infrastructures performance consider different KPIs and use them to estimate

partial visions of the principal components of the entire system/infrastructure. An-

alyzing every KPI by itself allows us to have an idea of the capability of the com-

ponent the KPI refers to and gives the possibility to understand how it performs.

This would help system administrators to better understandthe behavior of that

server and better plan for its usage. In order to get a global vision about the prin-

cipal system components and to decide whether to allocate ornot resources, we

should apply a similar method to each server of the infrastructure.

This approach suffers of many drawbacks.

• In modern Internet Data Centers there may be hundreds of servers whose

resources are monitored simultaneously. This means, the modeling of thou-

sands KPIs, making the whole system analysis a very expensive task.

• Since they refer to highly variable and stochastic processes, system resource

metrics are hard to model through the parametric techniquestypically used

by state of the art decision mechanisms.

• All the KPIs refer only to a single resource usage of one server, giving a

measure of the relative performance of a single hardware/software compo-

nent of the entire system.

Analyzing every KPI on its own, besides being extremely timeconsuming,

gives a reductive and incomplete vision of the whole system,since it does not

take into account the interactions of components under a software load. The rea-

sons are obvious: the whole system’s performance is affected not only by the

behavior of each single resource, but also by the resulting interaction of resources

on several different servers combined together. This meansthat the system overall

performance is given by the superimposition of several contributions that, in some

instance, are not independent.

As a consequence, all present solutions for measuring system performances,

characterize system behavior and detecting state changes and anomalous events

lack in efficiency and risk to bring to erroneous management decisions since they

rest upon partial and deficient representations of the entire system state.


2.2 Multi-phase methodology

To address most of the raised issues, we propose a novelmulti-phase methodology

that gives a representative vision of the whole system behavior able to characterize

the majority of resource variability of all the servers of the Internet Data Center.

The performance of an Internet-based system depends heavily on the char-

acteristics of its load and on the impact it has on the different resources of the

constituent servers. Starting from a test phase, it is possible to rely on some data

samples (time series), collected from the server’s log filesand referring to all the

monitored resources. These data can be used to create a modelthat describes,

or approximates, the actual resources behavior of the server. From this model,

one can predict the impact of a load change in its resource usages. Much of the

work done in this direction focus on single resource behavior characterization,

which considers each server at a time, creating a model from observation (e.g.,

a stochastic model) and extracting the needed parameters from the resource time

series.

In this work, we propose an alternative approach for whole system analysis.

Since a single resource time series analysis is itself a complex task, modeling

the whole system’s behavior is even more difficult. The reason is that all the

information coming from the monitoring of the resource measures of the servers

of an Internet-based system are stochastic and form together a high dimensional

structure.

However, one can suppose that some system resources share common behav-

iors as a function of time. For example, several resources could share the same

periodic behavior due to steady peaks of utilization duringthe business hours and

low utilizations during the lunch hour and other non-business hours in the evening

and on weekends. On the other hand, some resources could present simultane-

ous short bursts (or spikes) of high demand, which are calledflash crowds, often

triggered by a special and usually unexpected event.

These observations lead us to believe that the high dimensional structure of re-

sources time series, that appears to be complex, could be governed by a small set

of features (e.g., correlated periodicity, simultaneous demand spikes, common un-

derlying noise) and, therefore, be presented approximately by a lower-dimension

2.2 Multi-phase methodology 25

representation.

When presented with the need to analyze a high dimensional structure, a com-

mon and powerful approach is to seek an alternate lower-dimensional approxima-

tion of the structure that preserves its important properties. It can often be the

case that a structure appearing complex because of its high dimension may be

governed by a small set of independent variables and so can bewell approximated

by a lower dimensional representation. Dimension analysisand dimension reduc-

tion techniques attempt to find these simple variables and can therefore be a useful

tool to understand the original structure.

The most popular technique to analyze high dimensional structures is thePrin-

cipal Component Analysis(PCA) [67]. Given a high dimensional object and its

associated coordinate space, PCA finds a new coordinate spacewhich is the best

one to use for dimension reduction of the given object. Once the object is placed

into this new coordinate space, projecting the object onto asubset of axes can

be done in a way that minimizes the error. When a high dimensional object can

be well approximated in a smaller number of dimensions, we refer to the smaller

number of dimensions as the object’sintrinsic dimensionality.

To find the whole system analysis intrinsic dimensionality is the initial pur-

pose of the approach presented in this work. In this study, weuse PCA and apply

some analysis steps to an explicative data set extracted from Internet Data Center

servers. By determining whether the whole system has low intrinsic dimensional-

ity, it is possible to create a workload model that is described only by a small set of

features, such as deterministic, noisy and spiky. We then use the extracted features

for several purposes: detecting relevant state changes, signaling punctual anoma-

lies and identifying uncommon patterns in the expected servers behavior. Several

analysis other than those presented in this work can benefit from this study. They

can inform some decision systems direct to the runtime management of Internet

Data Centers.

We now give a brief overview of the multi-phase methodology proposed in this

study. It aims to separate the complex whole system analysisand management in

four main phases, as described in Figure 2.1.

Let us outline the different phases that will be described inthe following chap-

ters:


Figure 2.1: The proposed multi-phase framework for whole system analysis.

1. Data collection

In this work, we analyze stochastic time series coming from system moni-

tors related to the most relevant internal resource measures of the servers of

an Internet Data Center. They form a high dimensional data setof stochas-

tic measurements, that is comprehensive butnot representativeof the whole

system under exam.

2. PCA-based technique

This phase analyzes the collected time series in order to extract arepresen-

tativevision of the monitored system. We apply the Principal Component

Analysis to the collected time series, with the aim to obtainan exhaustive


representation of them through a smaller number of relevantinformation,

that we calleigenresources. The decomposition is followed by an analysis

of eigenresources characteristics and an investigation oftheir typical behav-

ior. We see that, in the context of Internet Data Center resource measures,

eigenresource components are easy to map into three behavioral classes:de-

terministic, noiseandspike. In light of what discovered in our application

context, the PCA-based technique ends with the assembling ofthreerepre-

sentative eigenresources, one for each class. These representations collects

the contributions of all the server nodes under examinationand therefore

compose a representative vision of the entire system. This vision collects

the entire intrinsic dimensionality of whole system analysis problem, by

reducing its complexity to only three dimensions.

3. System state evaluation

This phase proposes some mechanisms that use the previous representa-

tions as a basis for evaluating important information aboutthe past (e.g,

trend extraction), the present (e.g., state change detection, point anomaly

and collective anomaly detection) and the future (e.g., time series predic-

tion) behavior of the Internet Data Center state. These evaluations refer to

representative sights of the system, by collecting the contributions of all the

resource measures of all the servers of the architecture, and thus carrying

reliable information about the previous, current and expected behaviors of

the Internet Data Center.

4. System management

We take advantage of the PCA-results and the applications outcomes to

manage the system through some runtime decisions typical ofInternet Data

Centers. To apply management algorithms to representative eigenresources

allows system administrators to make appropriate and reliable operating de-

cisions. In this way, the complex problem of the whole systemanalysis and

management of the considered Internet Data Center is reducedto the inves-

tigation of a small set of dimensions, still carrying the main characteristics


of the entire Data Center.

In this work, we show the results of applying the multi-phasemethodology

to a typical Internet Data Center. Our purpose is to transformthe huge collected

data set of measurements related to its servers into a small set of representative

information, that constitutes the basis for runtime decisions/actions.

In the examined context, the multi-phase methodology produces interesting

benefits in understanding Internet Data Center characteristics, evaluating its past,

present and future performance, and making runtime decisions for the whole Data

Center administration, bringing several benefits:

• collects into three representative visions the relevant information about the

behavior of the Internet Data Center, giving an exhaustive characterization

that includes all the contributions of the system’s server nodes;

• reduces to only three the number of time series to analyze inorder to manage

the overall behavior of the Internet Data Center, in spite of its complexity,

its purposes and the number of servers;

• reveals behavioral trends in servers activities, isolating their deterministic

components from deviations due to noise or spikes;

• helps anomaly detection, since statistical tests for the outliers identification

can be applied to the spiky features;

• makes it possible to choose suited detection models, on thebasis of the

different typologies of change that should occur in the state of the system

and the peculiar characteristics of the representative visions coming from

the PCA-based technique;

• allows the forecasting of the behavior of the system, usingits deterministic

components to predict the future activity of the entire set of servers;

• enables to detect unexpected events in system activity andthe signaling of

something wrong in typical servers behavior.


A detailed described application of the multi-phase methodology for Inter-

net Data Center analysis and management is addressed in the following chapters.

After an initial overview on stochastic resource measures in next chapter, we de-

scribe the first two phases of the methodology in Chapter 4, andthe third phase

in Chapters 5, 6 and 7. They analyze the past, future and present behavior of the

Internet Data Center, respectively. All the analyses are preparatory for a system

management phase, that is only hinted at in this work but is our intent to deep

investigate in future works.

Chapter 3

Stochastical analyses of systemresource measures

Before applying the proposed multi-phase methodology, let us introduce some

stochastical analyses that turn to be useful in our ongoing work. In this chapter,

we propose a detailed analysis of the statistical behavior of the most important

system resource measures of an Internet Data Center and we review stochastic

methods that are useful for their analysis.

We consider these measures (or samples) asstochastic data setsthat are con-

tinuously provided by system monitors. The termstochasticis due to the non-

deterministic behavior of the monitored resources and the fact that Internet Data

Center’s state is determined both by the resource’s predictable actions and by a

random element. The termdata setrefers to an ordered collection ofi samples,

starting at timet1 and covering events up to the current timeti. We denote the data

set byXi = [x1, x2,. . ., xi−1, xi], where thej − th elementxj, with 1 ≤ j ≤ i,

denotes the value of the resource measure of interest at timetj.

Since the data set’s samples are measured at successive times spaced at uni-

form time intervals, we refer to the data setXi as atime series. Figure 3.1 reports

the typical behavior of two time series obtained from the monitoring of two inter-

nal resources (CPU and Memory utilization) of an Internet-based server. System

monitors capture resource measures every 5 minutes during an observation inter-

val lasting a week. Seven days monitoring builds up a time series of i = 2016

values.

32 Stochastical analyses of system resource measures

0

20

40

60

80

100

0 500 1000 1500 2000

CP

U u

tiliz

atio

n

(a) CPU utilization

87

87.5

88

88.5

0 500 1000 1500 2000

Mem

ory

utili

zatio

n

(b) Memory usage

Figure 3.1: Examples of monitored time series.

The properties and the characteristics of the time series related to the system

resources of an Internet Data Center require in-depth investigations to achieve

an useful interpretation and an adequate positioning of theresource states with re-

spect to the capacity of the system. Due to the stochastic behavior of the monitored

time series, we not deal with only one possible reality of howthe data set might

evolve under time, but in a stochastic time series there is some indeterminacy in

its future evolution described by probability distributions. This means that, even

if the initial condition (or starting point) is known, thereare many possibilities the

resource measures might go to, but some paths may be more probable and others

less.

One way to simplify the stochastic time series analysis and the understanding

of their evolution is to separate time series in their constituent components. To an-

alyze and study one feature at time helps in reducing the complexity of managing

random time series and to apply suitable models to the components characteristics.

As in most time series analysis, the system resource measures consist of:

• adeterministic component- usually a set of systematic trend and periodic

patterns;

• a spike component- usually caused by isolated and occasional bursts and

dips;

• anoise component- usually making the pattern difficult to identify.

3.1 Deterministic component 33

In the following sections, we analyze the three main components and present

several statistical analyses useful to their investigation.

3.1 Deterministic component

The deterministic component represents data set patterns appearing to be rela-

tively predictable. Most time series patterns can be described in terms of two

basic classes of components:trendandseasonality.

Trend pattern

Thetrend is one of the dominant feature of many time series. It represents a

general systematic linear or (most often) non-linear component that changes

over time and does not repeat or at least does not repeat within the time

range captured by the data. Such a trend can be upward or downward, it can

be step or not, and it can be exponential or approximately linear.

In Figure 3.2 we give an example of an upward linear trend timeseries.

We can observe a manifest increase of sample values in time and random

disturbances that do not affect the prevalent linear growthof the time series.

0

2

4

6

8

10

12

14

0 500 1000 1500 2000

Sam

ple

Val

ues

Time

Figure 3.2: Example of a trend time series.


It is worth to say that not all the time series show a trend component.

Besides, as long as the trend component in the time series is present and

monotonous (consistently increasing or decreasing), the trend estimation is

a useful method to interpret the data set because it would complement the

seasonally statistics to fully understand the deterministic component char-

acterizing the stochastic data set.

Seasonal pattern

When time series are monitored for a sufficiently long time period (e.g., of

weeks, months or years), it is often the case that such a series displaysea-

sonalpatterns. A seasonal pattern has similar nature as trend component,

but it repeats itself in systematic intervals over time. This is typically the

case of Internet-based services, where system measures increase during di-

urnal activities and decrease during the night or weekend. Figure 3.3 shows

a typical example of a time series displaying seasonality.

0

20

40

60

80

100

Mon Tue Wed Thu Fri Sat Sun

Sam

ple

Val

ues

Figure 3.3: Example of a seasonal time series.

Seasonality may also be reflected in the variance of a time series. For ex-

ample, the time series variability may seem highest on specific days of the

week (e.g., Thursday and Friday, rather than Monday to Wednesday), be-

cause of the specific characteristics of Internet Data Centerservices.

Correlogram 35

Trend and seasonal patterns usually coexist in real-life data sets and the ampli-

tude of the seasonal changes increases with the overall trend. They form together

what we call thedeterministic componentof a time series. An example of the two

contributions together in the same data set is given in Figure 3.4. This component

indicates that the relative amplitude of seasonal changes is constant over time,

thus it is related to the trend.

0

20

40

60

80

100


Sam

ple

Val

ues

Figure 3.4: Example of a trend and seasonal time series.

The correlogramand theperiodogramare among the most useful tools for

determining the deterministic component of a data set.

3.1.1 Correlogram

The correlogram(or autocorrelogram) displays graphically and numerically the

autocorrelation function(ACF) of a dataset and is useful to evidence trend and

seasonal patterns [32].

Given a stochastic data set, the ACF describes the correlation between the data

set values at different instants. The presence of autocorrelation between different

values of a time series means that a temporal dependence between its samples ex-

ists and that a strong deterministic component influences the time series behavior.

Let Xi be the data set collected at timeti. If Xi has a mean valueµ and


varianceσ2, then the autocorrelation function is defined:

ACF (Xi, Xi+k) =E[(Xi − µ)(Xi+k − µ)]

σ2(3.1)

whereE is the expected value operator andk stands for the lag of time. The

autocorrelation at lagk is defined as the correlation between samples separated

by k time periods.

This expression is not well defined for all data set values, since the varianceσ2

may be zero (for a constant data set) or infinite for some heavy-tailed distributions.

If the function ACF is well defined, its value must lie in the range [−1, 1], with 1

indicating perfect correlation and−1 indicating perfect anticorrelation.

The correlogram displays serial autocorrelation coefficients (and their stan-

dard errors) for consecutive lagsk in a specified range of lags (e.g.,k ∈ {1, . . . , 100}).

Correlograms are useful techniques for trend and seasonal pattern identification

because if the time series contains a seasonal fluctuation, then the correlogram

exhibits an oscillation at the same frequency.

Figure 3.5 gives an example of application, showing a time series and the

corresponding correlogram. The autocorrelation functionof the data set in Fig-

ure 3.5(a) is computed at different lagsk, ranging in the interval[1, 100] and the

result is represented by the heights of the columns of Figure3.5(b). At a super-

ficial analysis, the first figure displays a stationary time series, just independent

and identically distributed. The information deriving from the computing of the

autocorrelation function, instead, disapprove this belief. The correlogram corre-

sponding to the time series reveals a strong seasonal behavior that was ”hidden”

in the data set. The correlogram provides relevant extra information about the

deterministic component that was not clearly present in thetime plot of the data.

Searching for the peaks in the autocorrelation function it is possible to discover

the time lag at which the time series repeats periodically itself.

Correlogram carries also useful information for time seriesprediction. A rapid

decrease of the ACF curve means that the observed data set values exhibit low (or

null) autocorrelation. Predictions on non-correlated data tend to produce unreli-

able future estimations. On the other hand, a slow decay of ACFin the correlo-

gram indicates that the time series show a dependency among its values, and thus

is rationally predictable.

Periodogram 37

-4

-3

-2

-1

0

1

2

3

4

10 20 30 40 50 60 70 80 90 100

Sam

ple

Val

ues

Samples

(a) Plot of the time series

-0.2

0

0.2

0.4

0.6

0.8

1

10 20 30 40 50 60 70 80 90 100

AC

F

Lag

(b) Correlogram

Figure 3.5: Example of a time series with a ”hidden” seasonality and the corre-sponding correlogram.

3.1.2 Periodogram

A basic idea in mathematics and statistics is to take a complicated object, such as

a stochastic time series, and break it up into the sum of simple objects that can

be studied separately, see which ones can be thrown away as being unimportant,

and then adding what is left back together again to obtain an approximation of

the original object. The periodogram [112] of a time series is the result of such a

procedure.


Periodogramis a very useful tool for describing a time series and identifing

trend and seasonal patterns at unknown periods. It is much more useful than the

correlogram but it does require some training to interpret properly.

The basic idea is that time series of long period are smooth inappearance,

whereas those of short period are very wiggly. Thus if a time series appears to be

very smooth (wiggly), then the values of the periodogram forlong (short) periods

will be large relatively to its other values. In this case, wesay that the data set has

an excess of long (short) periods. For a purely stochastic series, all of the sinu-

soids should be of equal importance and thus the periodogramwill vary randomly

around a constant. Instead, if a time series has a strong sinusoidal signal for some

period, then there will be a peak in the periodogram at that period. If a large peak

is observed, it may well provide a clue to some important source of seasonality in

the data. A strong seasonal component at a frequency such as 1/10 of the sampling

period will result in a large spike as shown in Figure 3.6.

-35

-30

-25

-20

-15

-10

-5

0

0 5 10 15 20

Pow

er

Period

Figure 3.6: Example of periodogram.

Periodograms can also show small peaks at multiples of the fundamental pe-

riod, which reflect the fact that the seasonal oscillation isnot very sinusoidal. The

contributions at very long periods come from the overall trend in the series.

3.2 Spike component 39

3.2 Spike component

All non-deterministic components of time series are considered stochastic errors,

that are deviations of the time series from the expected systematic pattern. Ran-

dom errors of data sets coming from process monitoring typically include a note-

worthy spike component. It collects short-lived bursts departing from time series

mean in correspondence of unexpected and uncommon events inthe sampled re-

source measure.

Figure 3.7 shows the typical behavior of spike components. The most of data

samples oscillate enclosing to the time series mean (equal to zero, in this case), as-

sembling what we can consider the under-of-control state ofthe resource. Sample

values strongly and abruptly diverging from the resource state are named spikes

and correspond to out-of-control data samples.

-40

-30

-20

-10

0

10

20

0 500 1000 1500 2000

Sam

ple

Val

ues

Samples

Figure 3.7: Example of a spike component.

Spike samples are due to accidental and unexpected events insystem activity

that should drive to urgent inspections in their causes. Thus, spike component

plays a relevant role in stochastic time series analysis, since it assembles random

errors that cannot be ignored because irrelevant (such as, noise component), but

need further investigations for an appropriate resource control. For this reason,

the spike identification is critical for Internet Data Centers management.


Since a spike signaling should cause activities of examination in the moni-

tored resource, it is extremely important that the technique for spike detection is

precise, accurate, and instantaneous. We explore a popularspike detection tech-

nique:sigma threshold test.

3.2.1 Sigma threshold test

A simple spike analysis technique, often used in process quality control domain [114],

is to declare as a spike all data instances that are more thansσ distance away from

the time series meanµ, whereσ is the standard deviation for the time series. The

s parameter of thesigma threshold testis a positive integer that should be set in

function of the definition of spike in the different application contexts, the char-

acteristics of the time series, and the performance wanted to achieve by the spike

detection technique.

Figure 3.8 shows the meaning of applying a sigma threshold test with s = 3.

The black dotted lines up and down the figure are plotted in correspondence of

three times the standard deviation of the gray lined time series. Mathematically,

theµ ± 3σ region contains 99.7% of the data instances.

-40

-30

-20

-10

0

10

20

0 500 1000 1500 2000

Sam

ple

Val

ues

Samples

Time Series 3-sigma threshold

Figure 3.8: Example of3σ threshold test for spike component.

According to the control chart theory [114], the two dotted lines indicate the

threshold at which the process output is considered statistically ”unlikely”. On the

Sigma threshold test 41

other hand, the points falling inside the central band indicates that the monitored

process is currently under control. Here, the time series isstable, with variation

only coming from sources common to the process.

Any observations outside the limits suggest the introduction of a new and

likely unanticipated source of variation, causing a spike in the time series. Since

time series variation means something unusual and unexpected, a sigma threshold

test ”signaling” the presence of a spike requires immediateinvestigation.

Control chart theory setss = 3, since the choice of three sigma control limits is

supported by statistical theorems [125] and by empirical investigations of sundry

probability distributions, revealing that at least 99% of observations occur within

three standard deviations of the mean. Settings < 3 brings to the signaling

of a higher number of out-of-control values in the time series. It is needed just

in case of very critical characteristics or monitored processes that do not cause

a cascade of time-consuming investigations for every detected spike. s values

higher than three detect stronger shifts in the data set’s mean. They are used

for time series that monitor less critical characteristicsor characteristics causing

inspection procedures with a high impact every time samplesare out-of-control.

Figure 3.9 gives a graphical example. A technique withs = 1 brings to the

identification of a huge number of spikes, as evinced by the many samples with

values outside the small central band in Figure 3.9(a). Every time the time series

overcomes (in positive and in negative direction) one time the standard deviation

of its values, a spike is signaled and an investigation procedure is activated.

-40

-30

-20

-10

0

10

20

0 500 1000 1500 2000

Sam

ple

Val

ues

Samples


(a) 1σ threshold test

-40

-30

-20

-10

0

10

20

30

0 500 1000 1500 2000

Sam

ple

Val

ues

Samples


(b) 5σ threshold test

Figure 3.9: Example of1σ and5σ threshold tests for spike component.


Opposite results are obtained fors = 5. In Figure 3.9(b), only few values

strongly departing from time series mean correspond to spikes. This reduces the

number of procedures activated to face unexpected behaviors of the monitored

resource.

There is no optimal value for thes parameter. It must be tuned considering

the monitored resource and the time series characteristics, as well as the context

requirements and the impact of investigation activities. Beside that, in Internet

Data Centers management there are always changes in the monitored resources,

which result in the fixed control limits becoming invalid. Ifthe process improves

and the monitored resource becomes more stable, fixed limitsremain too wide

andsσ limits will not properly register signals for out of control. This translates

into lost spike detections that are not accepted although true. On the other hand, if

the process becomes worse and the investigated resource increases its variability,

the control limits are too narrow. This results in the spike technique reporting out

of control samples that should not be, if thesσ limits were naturally calculated.

This situation causes the signaling of spikes that, instead, should be rejected. A

runtime tuning of thes parameter is essential to avoid poor performance of sigma

threshold test caused by stochastic process variations.

3.3 Noise component

Next to spikes, also stochasticnoisesare deviations of the time series from the

expected deterministic pattern.

In computing and information contexts, noise is typically considered unwanted

data without meaning, that is, data that is not used to transmit a signal, but it is

simply produced as an unwanted by-product of other activities. Even unwanted,

the noise component carries important informations that can be extensively used

for a better understanding of the noise itself or of the stochastic time series behav-

ior. Noise can be the key to analyse phenomena that are difficult to explain in a

fully deterministic regime, such as the Internet Data Centeractivity.

Figure 3.10 displays a finite length, discrete time monitoring of a noise com-

ponent generated from a typical resource usage of an Internet-based server.

Color of noise 43

-1

-0.5

0

0.5

1

0 50 100 150 200

Sam

ple

Val

ues

Samples

Figure 3.10: Example of a noise component.

Many characteristics of the noise component deserve deep investigation, rang-

ing from the more salient ones (such as, variance and standard deviation) to the

more hidden ones. Among them, we investigate thecolor of noiseand thedistri-

bution of noise, that will be useful for future investigations and classifications of

noise components in this work.

3.3.1 Color of noise

While noise is by definition derived from a stochastic signal,it can have differ-

ent characteristic statistical properties correspondingto different mappings from

a source of randomness to a real noise. Spectral density, that is, the power distri-

bution in the frequency spectrum [6], is a property that can be used to distinguish

different types of noise. This classification carried out through spectral density

gives the so-calledcolor terminologythat assigns different name of colors to the

different types of noise.

An important type of stochastic signals are the so-calledwhite noisetime se-

ries. “White” is because in some sense white noise contains equally much of

all frequency components, analogously to white light whichcontains all colors.


White noise has zero mean value:

µwhite = 0 (3.2)

There is no co-variance or relation between sample values atdifferent time

indexes, and hence theautocovarianceis zero for all lagsk except fork = 0. This

results in a random scattering component difficult to treat.The absence of relation

among samples makes it impossible to model white noise.

Given a stochastic time seriesXi, the autocovariance is a measure of how

much the time series changes together to a time-shifted version of itself. Naming

E[Xi] = µi the mean of each state of the time series and beingXj the shifted data

set, then the autocovariance is given by:

COVXX(i, j) = E[(Xi − µi)(Xj − µj)] = E[Xi ·Xj] − µi · µj (3.3)

whereE is the expectation operator.

Thus, for a white signal, the autocovariance function follows the characteristic

behavior of the function shown in Figure 3.11. We can evince the main character-

istics of a relatively large value at lagk = 0.

-30

-20

-10

0

10

20

30

0 50 100 150 200 250 300 350 400

Sam

ple

Val

ues

Samples

(a) White noise time series

0

0.2

0.4

0.6

0.8

1

-200 -150 -100 -50 0 50 100 150 200

CO

Vxx

Lag

(b) Autocovariance function

Figure 3.11: Example of autocovariance function of a white noise time series.

White noise is an important signal in estimation theory because the purely

random noise, which is always present in stochastic measurements, can be repre-

sented by white noise.

As opposite to white noise,colored noisedoes not vary completely randomly.

µcoloured 6= 0 (3.4)

Distribution of noise 45

In other words, there is a co-variance between the sample values at different

time indexes. As a consequence, the autocovarianceCOVXX is non-zero for lags

k = 0. COVXX will have a maximum value atk = 0, and will decrease for

increasingk. The autocovariance function of a colored noise time serieshas a

behavior similar to that shown in Figure 3.12.

-30

-20

-10

0

10

20

30

40

0 50 100 150 200 250 300 350 400

Sam

ple

Val

ues

Samples

(a) Coloured noise time series

0

0.2

0.4

0.6

0.8

1

-200 -150 -100 -50 0 50 100 150 200

CO

Vxx

Lag

(b) Autocovariance function

Figure 3.12: Example of autocovariance function of a colored noise time series.

Most contexts in practice are nonlinear time varying stochastic systems and

owing to the effect of feedback control, colored noise may arise [129]. Both white

and colored noise may affect the analysis of the observed data coming from a

stochastic system. To distinguish between them is a relevant step in time series

analysis. While the relation among colored noise samples canbe somehow mod-

eled for management purposes, white noise is hard to treat. It represents the un-

wanted error affecting time series that does not carry any useful information and

can be therefor discarded. For these reasons, a deepen analysis of noise character-

istics and their effect in perturbing resource measures is fundamental to eliminate

contaminated samples and retain only the main information useful for an efficient

time series management.

3.3.2 Distribution of noise

Most methods for time series analysis assumes that the data set is corrupted by

Gaussian noise. This hypothesis is not always true, and needs statistical analyses

to be confirmed. The Shapiro-Wilk normality test [113] is themost common

technique to see whether thenoise distributionis normal.


The test can be done through aQ-Q plot [60], that is, a plot of the quantiles of

two distributions against each other, or a plot based on estimates of the quantiles.

Quantiles are points taken at regular intervals from the cumulative distribution

function (CDF) of a stochastic variable, that describes its probability distribution.

The pattern of points in the Q-Q plot is used to compare the twodistributions.

Thus, a Q-Q plot is a probability plot, which is a graphical method for com-

paring two probability distributions by plotting their quantiles against each other.

If the two distributions are similar, the points in the Q-Q plot will approximately

lie on the liney = x. If the distributions are linearly related, the points in the Q-Q

plot approximately lie on a line, but not necessarily on the line y = x. Q-Q plots

can also be used as a graphical means of estimating parameters in a location-scale

family of distributions, such as Gaussian distribution.

We use Q-Q plots to compare the data set noise component to thestandard

normal distributionN(0, 1). This can provide a graphical assessment of ”good-

ness of fit”. Since Q-Q plots compare distributions, there isno need for the values

to be observed as pairs, as in a scatterplot, or even for the numbers of values in

the two groups being compared to be equal.

Figure 3.13 shows two examples of Q-Q plot for normality test. The points

plotted in a Q-Q plot are always non-decreasing when viewed from left to right. If

the noise distributions and the Gaussian one are identical,the Q-Q plot follows the

45° liney = x, as in Figure 3.13(a). If the two distributions agree after linearly

transforming the values in one of the distributions, then the Q-Q plot follows

some line, but not necessarily the liney = x, as in Figure 3.13(b). If the general

trend of the Q-Q plot is flatter than the liney = x, the distribution plotted on the

horizontal axis is more dispersed than the distribution plotted on the vertical axis.

Conversely, if the general trend of the Q-Q plot is steeper than the liney = x,

the distribution plotted on the vertical axis is more dispersed than the distribution

plotted on the horizontal axis. Q-Q plots are often arced, or”S” shaped, as in

Figure 3.13(c), indicating that one of the distributions ismore skewed than the

other, or that one of the distributions has heavier tails than the other.

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

Noi

se q

uant

iles

Gaussian quantiles

(a) Q-Q plot of identical distributions

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

Noi

se q

uant

iles

Gaussian quantiles

(b) Q-Q plot of linearly relateddistributions

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

Noi

se q

uant

iles

Gaussian quantiles

(c) ”S” shaped Q-Q plot

Figure 3.13: Examples of Q-Q plots.

Chapter 4

PCA-based technique on collecteddata

This chapter details the first two phases of the multi-phase methodology proposed

in Chapter 2.

We use Principal Component Analysis to explore the intrinsicdimensionality

and structure of system resources behaviors, using data collected from a typical

Internet Data Center as described in Section 4.1. They come from the monitoring

of both heterogeneous and homogeneous resource measures ona single server

and on the different servers of the system, as reported in Sections 4.1.1 and 4.1.2,

respectively.

Even though typical Data Centers have thousands of servers, we show in Sec-

tion 4.2 that, on long time scales (days to weeks), the structure of heterogeneous

and homogeneous resource measures can be well captured through remarkably

few dimensions. We find that using less than 15 dimensions, one can accurately

approximate the behavior of a thousand monitored resource measures in the sys-

tem. In order to explore the nature of this low dimensionality, we introduce the

notion of eigenresources. An eigenresource, derived from a PCA of heteroge-

neous (see Section 4.2.1) or homogeneous (see Section 4.2.2) resource measures,

is a time series that captures a particular source of temporal variability (a ”fea-

ture”) in the resource behavior. Each resource time series can be expressed as

a weighted sum of eigenresources; the weights capture the extent to which each

feature is present in the given resource time series.

50 PCA-based technique on collected data

The proposed PCA-based technique uses some analysis steps, detailed in Sec-

tion 4.3, to classify eigenresources into few main classes on the basis of their

main statistical properties. These properties are easy to map over those ones of

the stochastic time series and investigated in the previouschapter. The analyses

presented in Chapter 3 make it possible to understand eigenresource characteris-

tics and peculiar behaviors in Section 4.3.2. In Section 4.4, the eigenresources of

the same class add up to the creation of what we callrepresentative eigenresource,

which contains all the contributions of the servers resources of the system to that

behavioral class. PCA-results, in terms of this few set of principal eigenresources,

give an exhaustive and complete representation of the behavioral components of

the entire Internet Data Center, since they collect the contributions of the moni-

tored resource measures of all the servers of the system.

4.1 Data collection

Lets start from the first phase of our methodology, as highlighted in Figure 4.1.

The data considered in this work are collected from 50 servers of an Inter-

net Data Center of thousands servers supporting several critical services. The

50 servers monitored in this study include Web servers, application servers and

database servers, following the typical multi-tier infrastructure for supporting Web-

based services. The servers run MWA (MeasureWare Agent) [37] to collect per-

formance data.

The data considered in this work are collected during a one-week period from

03/08/2010 to 03/14/2010. The decision of considering onlyone week worth of

data is supported by several empirical tests: applying the later described PCA-

based technique to a two or more weeks sampling gives relatively the same results

as the ones obtained on seven-days time series. For simplicity, in this work we

decide to report the outcomes of applying the multi-phase methodology to time

series collecting measures for a week.

System monitors aggregate (average) the resource measuresof the 50 servers

at every five minutes. These measures refer to the most important resources of

a server and compute the most interesting performance metrics for a complete

characterization of the server.

Heterogeneous resources of a single server 51

Figure 4.1: First phase of the multi-phase framework.

In Section 4.1.1, we describe the heterogeneous resource measures monitored

on each server node, while a general comparison of the behaviors of homogeneous

resource measures monitored on different servers of the Internet Data Center is

given in Section 4.1.2.

4.1.1 Heterogeneous resources of a single server

MWA system monitors collect several performance metrics ofservers, such as:

• CPU utilization, queue lengths, and related processor metrics;

• Memory usage, caching, and other memory-related metrics;


• Network utilization, errors, and other metrics pertaining to network activity;

• CPU, memory and network usage broken down by specific applications;

• End-to-end transaction times.

System monitors average the performance measures collected in five minutes

and write to a log file which is subsequently extracted and stored in a central

repository. Among monitored resource performance,21 have been selected as

interesting for server state characterization. For each ofthe21 metrics, Table 4.1

reports the syntax used by system monitors and the metric it corresponds to.

From the data collected, we generate a heterogeneous resource measurements

matrixχhetero for each server of the system. It is at x p matrix, where the number

of rows t is the number of time intervals (t = 2016 five-minutes intervals within

the one-week period) and the number of columnsp is the number of considered

resource measures on the server (p = 21).

Every column of the matrix is a resource time series: in Figure 4.2 we report

some examples of time series monitored on a database server of the Internet Data

Center in exam for a week. All these figures share the common trait that the

performance metrics coming from system monitors are extremely variable and

stochastic. Despite that, different resources and measures show different sample

values domains, different characteristics behaviors, anddifferent weights of the

constituent deterministic and error components.

CPU utilization in Figure 4.2(a) is a bound resource measure strongly driven

by the seasonal component. This is evident in the periodic behavior of the time

series, where increases during diurnal activity are followed by decreases during

the night. The same seasonal pattern guides the network packet rate measure in

Figure 4.2(c), even thought time series values reach different orders of magnitude.

Memory usage measures in Figure 4.2(b) seem primarily driven by an increasing

non-linear trend component, while the time series of systemcall rates shows peri-

odic spikes in Figure 4.2(d).

Each time series gives a representation of the behavior of a single monitored

resource in respect to the precise measure it collects. Studying each time series

alone can give interesting information for the management of one resource of a

Heterogeneous resources of a single server 53

Syntax Description

Syscall Rate The average rate of system calls made during the intervalUptime Hours The system up-time of the monitored systemActive CPUs The number of CPUs on-line on the systemCPU% The percentage of time the CPU was not idle during the intervalCPU Time The total time, in seconds, that the CPU was not idle in the

intervalIdle CPU% The percentage of time the CPU is not processing instructionsIdle Time The time, in seconds, that the CPU was idle during the interval

This is the total idle time, including waiting for I/OPk FS Sp% The percentage of occupied disk space to total disk space

for the fullest file system found during the intervalCache Rd Rt The amount of physical memory (in MBs unless otherwise

specified) used by the buffer cache during the intervalMemory% The percentage of physical memory in use during the interval

This includes system memory, buffer cache and user memoryUser Mem% The percentage of physical memory allocated

to user code and data at the end of the intervalPg Out Rate The number of KBs per second of pages paged-out from

system memory to disk during the monitoring intervalPage Out The total number of pages paged-out from system memory

to disk per second during the monitoring intervalSys+Cache% The percentage of physical memory used by the system

during the interval, including buffer cacheSysMem% The percentage of physical memory used by the system (kernel)

during the intervalIn Pkt Rate The number of successful packets received through all

network interfaces during the intervalOut Pkt Rate The number of successful packets sent through all network

interfaces during the intervalNetwork Pkt Rt The number of successful packets per second (both inbound

and outbound) for all network interfaces during the intervalAlive Proc The sum of the alive-process-time/interval time ratios

for every processActive Proc The sum of the alive-process-time/interval time ratios of every

process that is active (uses any CPU time) during an intervalRun Time The elapsed time since a process started, in seconds

Table 4.1: System monitor’s syntax and corresponding resource measure.


0

20

40

60

80

100

0 500 1000 1500 2000

CP

U U

tiliz

atio

n

Samples

(a) CPU utilization

87

87.5

88

88.5

0 500 1000 1500 2000

Mem

ory

usag

e

Samples

(b) Memory usage

0

5000

10000

15000

20000

0 500 1000 1500 2000

Net

wor

k P

acke

t Rat

e

Samples

(c) Network packet rate

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 500 1000 1500 2000

Sys

tem

Cal

l Rat

e

Samples

(d) System call rate

Figure 4.2: Examples of heterogeneous resource measures ofa database server.

single server, but it useless in a context of Internet Data Center management.

As we consider50 servers, we collect fiftyχhetero matrices, each one com-

posed by2016 x 21 measurements. Each matrix is itself a high dimensional struc-

ture residing in a high dimensional space. Different servers measurements within

the Internet Data Center become a high dimensional multivariate time series.

Since heterogeneous resource measures of a single server node are the result

of the same external workload, we suppose that resource timeseries share com-

mon characteristics. We should expect the columns ofχhetero to be related, so

that the intrinsic dimensionality ofχhetero is less thanp. Principal Component

Analysis, described in detail in Section 4.2, is a powerful approach to verify this

presumption quantitatively. Via PCA, in Section 4.2.1 we extrapolate the most

relevant features of theχhetero matrix for each one of the50 monitored servers.

Homogeneous resources of different servers 55

4.1.2 Homogeneous resources of different servers

Once obtained a representative vision for each server, the interest goes to the entire

set of servers and their interaction.

Homogeneous resources are monitored on the50 servers of the Internet Data

Center and system monitors entries are collected in a homogeneous resource mea-

surements matrixχhomo for each resource measure of the system. It is at x p

matrix, wheret is the number of time intervals (as before,t = 2016) and the num-

ber of columnsp is the number of monitored servers in the Internet Data Center

(p = 50).

Since the number of monitored resource measures is21, we obtain twenty-

oneχhomo matrices of2016 x 50 measurements. Each column of a matrix is a

time series representing the same resource measure collected on different servers.

View of the same resource measure behavior on the different servers of a system

is given in Figure 4.3. The plotted resource measure is CPU utilization.

0

20

40

60

80

100

0 500 1000 1500 2000

CP

U U

tiliz

atio

n

Samples

(a) Web server

0

20

40

60

80

100

0 500 1000 1500 2000

CP

U U

tiliz

atio

n

Samples

(b) Application server

0

20

40

60

80

100

0 500 1000 1500 2000

CP

U U

tiliz

atio

n

Samples

(c) Database server

Figure 4.3: Examples of homogeneous resource measures - CPU utilization.


Figures 4.3(a), (b) and (c) report, respectively, the CPU utilization monitored

contemporaneously on a Web server, an application server and a database server

of the Internet Data Center. What is evident is the different behavior of the same

metrics on different servers of the infrastructure. This isa clear evidence that

decisions made on the basis of the information given by a single server can not be

extended for the management of the other servers of the Internet Data Center.

Because a resource measure on somehow related servers is the result of a

common users activity, we should expect the columns ofχhomo to be related. Also

in this case, a very useful method to verify this assumption in a quantitative way

is the dimension analysis via PCA.

4.2 Principal Component Analysis

We now consider the second phase of the multi-phase methodology (see Fig-

ure 4.4) and give a complete explanation of PCA-based technique.

We apply PCA (Principal Component Analysis) technique to characterize each

resource behavior. We show that the monitored resource measures of the servers

of the Internet Data Center can be characterized by just few features that are suffi-

cient to describe the whole system behavior. These featureschange from server to

server, and from resource to resource: we apply PCA on heterogeneous resources

monitored on a single node server in Section 4.2.1, and on homogeneous resources

monitored on the different nodes of the Data Center in Section4.2.2.

In order to facilitate the subsequent discussions, we recover some relevant no-

tations hinted in the previous section. For each resource, let p denote the number

of time series referring to the resource, andt denote the number of successive time

interval of interest. In this work, we study a system having on the order of half

a hundred servers, tens of monitored performance metrics, over long time scales

(days to weeks) and over intervals of5 minutes, so thatt ≫ p. Let χhetero and

χhomo be thet x p measurement matrix, which denote, respectively, the heteroge-

neous resource time series of a single server node and the homogeneous resource

time series of all the nodes of the system.

4.2 Principal Component Analysis 57

Figure 4.4: Second phase of the multi-phase framework.

If consideringχhetero, each columni denotes the time series of thei-th moni-

tored resource measure and each rowj represents an instance of all the monitored

resource measures on that server at timej. In the case of homogeneous resources

coming from different servers, each columni of χhomo denotes the resource time

series of thei-th server and each rowj represents an instance of the monitored

resource measure on all the servers at timej.

To facilitate the representation, we use the common nameχ when referring

invariably to both measurements matrices. We refer to individual columns of a

matrix using a single subscript, so the measurement of column i is denoted byχi.

Note thatχ-matrices thus defined have rank at mostp. Finally, all vectors in this

work are column vectors, unless otherwise noted.


PCA is a coordinate transformation method that maps the measured data onto

a new set of axes. These axes are called theprincipal axesor components. Each

principal component has the property that it points in the direction of maximum

variation or energy(with respect to the Euclidean norm) remaining in the data,

given the energy already accounted for in the preceding components. As such,

the first principal component captures the total energy of the original data on the

maximal degree possible on a single axis. The next principalcomponents then

capture the maximum residual energy among the remaining orthogonal directions.

In this sense, the principal axes are ordered by the amount ofenergy in the data

they capture.

The method of PCA can be motivated by a geometric illustration. An applica-

tion of PCA on a two dimensional dataset is shown in Figure 4.2.

Figure 4.5: Example of 1D projection of 2D points in the original space.

The first principal axis points in the direction of maximum energy in the data.

Generalization to higher dimensions, as in the case ofχ, take the rows ofχ as

points in Euclidean space, so that we have a dataset oft points inRp. Mapping

the data onto the firstr principal axes places the data into anr-dimensional hy-

perplane.

Shifting from the geometric interpretation to a linear algebraic formulation,

calculating the principal components is equivalent to solving the symmetric eigen-

value problem for the matrixχTχ. The matrixχTχ is a measure of the covariance

4.2 Principal Component Analysis 59

between servers resources time series. Each principal componentvi is the i-th

eigenvector computed from the spectral decomposition ofχTχ:

χTχvi = λivi i = 1, . . . , p (4.1)

whereλi is the eigenvalue corresponding tovi. Furthermore, becauseχTχ is

symmetric positive definite, its eigenvectors are orthogonal and the corresponding

eigenvalues are non-negative real. By convention, the eigenvectors are unit norm

and the eigenvalues are arranged from large to small, so thatλ1 ≥ λ2 ≥ . . . ≥ λp.

Once the data have been mapped into principal component space, it can be

useful to examine the transformed data one dimension at a time. Considering

the data mapped onto the principal components, we see that the contribution of

principal axisi as a function of time is given byχvi. This vector can be normalized

to unit length by dividing byσi =√λi. Thus, we have for each principal axisi:

ui =χvi

σi

i = 1, . . . , p (4.2)

Theui are vectors of sizet and orthogonal by construction. The above equa-

tion shows that all the servers resource behaviors, when weighted byvi, produce

one dimension of the transformed data. Thus vectorui captures the temporal vari-

ation common to all flows along principal axisi. Since the principal axis are

in order of contribution to the overall energy,u1 captures the strongest temporal

trend common to all servers resource measures,u2 captures the next strongest,

and so on. Because the set of{ui}, i = 1, . . . , p captures the time-varying trends

common to the resource behaviors, we refer to them as theeigenresourceof χ.

In Figure 4.6 we show a typical example of an eigenresourceui and its cor-

responding principal axisvi. The eigenresource captures a pattern of temporal

variation common to the set of time series referring to CPU utilizations of dif-

ferent servers, and the extent to which this particular temporal pattern is present

in each CPU utilization of the monitored servers is given by the entries ofvi. In

this case, we can see that this eigenresource feature is moststrongly present in the

server 44 (the strongest peak invi).

The elements of{σi}, i = 1, . . . , p are calledsingular values. Note that each

singular value is the square root of the corresponding eigenvalue, which in turn is


-40

-20

0

20

40


Eig

enre

sour

ce 2

Time

(a) Eigenresource

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

5 10 15 20 25 30 35 40 45 50

Prin

cipa

l Com

pone

nt 2

Server

(b) Principal component

Figure 4.6: Example of eigenresource and corresponding principal components.

the energy attributable to the respective principal component. Thus, the singular

values are useful for gauging the potential for reduced dimensionality in the data,

often simply through their visual examination in ascree plot.

A scree plot shows, in descending order of magnitude, the singular values

of χ. Such a plot when read left-to-right across the abscissa canoften show a

clear ”elbow” that separates the ”most important” components from the ”least

important” ones. That sharp drop in the plot signals that subsequent components

are ignorable. You can see an example of scree plot in Figure 4.7: it shows a ”big

gap” between the ninth and the tenth singular values, so the first nine principal

components are retained and the rest, are discarded.

0

5

10

15

20

25

30

5 10 15 20 25 30 35 40 45 50

Mag

nitu

de

Singular Values

Figure 4.7: Example of a scree plot.

PCA on heterogeneous resources of a single server 61

A quantitative common used guideline to choose how many dimensions retain

is theKaiser criterion [70], stating that we can retain only factors with singular

values greater than 1. In essence this is like saying that, unless a singular value

extracts at least as much as the equivalent of one original time series, we drop it.

Finding that onlyr singular values are non-negligible implies thatχ resides on

anr-dimensional subspace ofRp. In that case, we can approximate the originalχ

as:

χ′ ≈r∑

i=1

σiuivTi (4.3)

wherer < p is the effective intrinsic dimension ofχ.

In the next sections we apply the PCA technique on heterogeneous resources

of a single server node and on homogeneous resources of different servers, with

the aim of extracting the intrinsic dimension of resource data set and then reducing

the complexity of the whole system problem.

4.2.1 PCA on heterogeneous resources of a single server

In this section we apply PCA technique to heterogeneous resources monitored

on a single server of the information infrastructure. This allows us to extract the

relevant resource information of a server and then reduce the number of resource

time series to be analyzed for its characterization. As there are21 monitored

resources on each server, with5 minutes sampling during a one-week period,

χhetero is the2016 x 21 measurement matrix, whose columns denotes the time

series of each monitored resource of a server.

A first important result in applying PCA on the21 resource time series of the

servers of the infrastructure, is that only a small set of eigenresources is necessary

for reasonably accurate construction of system servers behavior. This means that

resource measures of a server form a multivariate time series of low effective

dimensions.

The energy contributed by each eigenresource to aggregate resource measures

is summarized in the scree plot of Figure 4.8(a). It shows thescree plot obtained

by applying PCA to time series of the resources set of a database server, chosen as

representative. The unexpected result is that the vast majority of resource measure


variability is contributed by the first few eigenresources.The curve shows a very

sharp knee, revealing that only four eigenresources contribute to most of server

variability. In different terms, this result denotes that resource measures together

form a structure with effective4 dimensions - much lower than the number of time

series monitored on a server (21 in this case).

0

2e+07

4e+07

6e+07

8e+07

1e+08

1.2e+08

1.4e+08

1.6e+08

1.8e+08

5 10 15 20

Mag

nitu

de

Singular Values

(a) Resources time series

0

5e+06

1e+07

1.5e+07

2e+07

5 10 15 20

Mag

nitu

de

Singular Values

(b) Normalized time series

Figure 4.8: Scree plots for the resource time series of a database server.

We now question what is the reason for this low dimensionality in resources

set data. There are at least two ways in which this low dimensionality may arise.

First, if the magnitude of variation among dimensions in theoriginal time series

differs greatly, then the data may have low effective dimension for that reason.

This occurs when the variation along a small set of dimensions in the original data

is dominant. Second, a multivariate time series may exhibitlow dimensionality

if there are common underlying patterns or trends across dimensions - in other

words, if dimensions show non-negligible correlation.

We can distinguish these cases in resource analysis by normalizing the re-

source time series before performing PCA. The standard approach is to normalize

each resource measure to zero mean and unit variance. Since normalization is

applied to bothχhetero andχhomo matrices, we have:

χ′i =

χi − µi

σi

i = 1, . . . , p (4.4)

whereµi ≡ µ(χi) is the sample mean ofχi.

If we find that the CPU utilization time series still exhibits low dimensional-

ity after normalization, we can infer that the remaining effect is due to common

PCA on heterogeneous resources of a single server 63

temporal patterns among time series.

The results of applying PCA to normalized versions of all dataset is shown

in Figure 4.8(b). The most streaking feature of this figure isthat the sharp knee

from Figure 4.8(a) remains, but in correspondence of the second eigenresource.

That means that the first eigenresource collects most of the energy of the resource

measures of the server.

If we apply Keiser Criterion to decide how many dimensions to retrieve, the

number of eigenresources with singular values higher than 1is four, thus confirm-

ing the result previously obtained on non-normalized data.Furthermore, the last

results on normalized resources ensure that the cause of lowdimensionality stands

all in correlations among resource time series and in commonbehavioral patterns.

They give the additional information that the principal pattern of the resources is

carried out by the first eigenresource, which contributes for almost all the energy

of the data set.

A plot of the first eigenresource is reported in Figure 4.9(a): it shows an evi-

dent periodic behavior, following the diurnal activity of the database server pro-

cesses.

The other three principal eigenresources are reported in the lasting figures:

Figure 4.9(b) shows the manifest spike behavior of the second eigenesource: it

collects occasional bursts in the server activity. A decreasing trend is incontestably

manifested by the third eigenresource in Figure 4.9(c), while the fourth eigenre-

source in Figure 4.9(d) retains all random deviations of resource time series from

the previous visions.

This results allow us to consider only four time series to represent the overall

behavior of the database serves, thus reducing the complexity of whole resources

analysis. Now the problem can be solved by examining only thefirst 4 eigenre-

sources resulting from PCA and applying to them (or to a propercombination of

them) all statistical models needed for server management.

Equivalent results are obtained on the resources sets of allthe other servers

of the considered system. This reinforces the thesis that the complex behavioral

structure of a server can be reduced to a very small number of time series.


-15000

-10000

-5000

0

5000

10000


1st e

igen

reso

urce

(a) First

-2500

-2000

-1500

-1000

-500

0

500

1000

1500


2nd

eige

nres

ourc

e

(b) Second

-100

-50

0

50

100


3rd

eige

nres

ourc

e

(c) Third

-200

-150

-100

-50

0

50

100

150

200


4th

eige

nres

ourc

e

(d) Fourth

Figure 4.9: PCA resulting principal eigenresources on heterogeneous resources ofa database server.

4.2.2 PCA on homogeneous resources of different servers

We now focus on homogeneous resource measures coming from the different

servers of the system. Through PCA, the ensemble of resourcesis decomposed

into its constituent set of eigenresources. In this case, the numberp of time series

is equal to50, as much as the servers of the considered infrastructure. Asin the

case of heterogeneous resources,t is equal2016, since we consider time intervals

of 5 minutes over a timescale of a week.

PCA results concerning eigenresources differ from resourceto resource: con-

sidering CPU utilization time series of different servers ormemory occupancy

gives different outcomes. We focus on the results of PCA-based technique applied

to CPU utilization, that is the most representative resourcefor the application of

the proposed methodology on homogeneous resources of different servers. Need-

PCA on homogeneous resources of different servers 65

less to say, this technique can be applied to time series referring to any monitored

resource.

As obtained for heterogeneous resources, applying PCA on the50 CPU uti-

lization time series of the servers returns a small set of relevant eigenresources

needed for accurate construction of the data set. This meansthat CPU measures

form a multivariate time series of low effective dimensions. Looking at the re-

sulting scree plot in Figure 4.10(a), we obtain the same effect discovered for het-

erogeneous resource measures: the first few eigenresourcescontribute for the vast

majority of data set variability. The knee of the curve showsthat a handful of

eigenresources, from4 to 9, contribute to most of CPU utilization variability. In

different terms, this result reveals that CPU utilization measures together form a

structure with effective dimensions between4 and9 - much lower than the number

of time series and servers (50 in this case).

As we are interested in underlying patterns or common trend across dimen-

sions, we show the results of applying PCA to normalized versions of all dataset

in Figure 4.10(b). Even in this case, we can see that the knee in Figure 4.10(a)

remains, even if less sharp, in nearly the same location. It is also clear that the

relative significance of the first few eigenresources has diminished somewhat.

0

100

200

300

400

500

600

5 10 15 20 25 30 35 40 45 50

Mag

nitu

de

Singular Values

CPU utilization time series

(a) CPU utilization time series

0

2

4

6

8

10

12

14

5 10 15 20 25 30 35 40 45 50

Mag

nitu

de

Singular Values

Normalized CPU utilization time series

(b) Normalized time series

Figure 4.10: Scree plots for CPU utilization time series.

Taken together, these observations suggest that while differences in time series

size contribute to low dimensionality of CPU utilization measures, that correla-

tions among time series (common underlying resource patterns) play a significant

role. As the previous discussion points out, these common underlying resource


patterns are in fact the eigenresources.

According to Kaiser Criterion, we evaluate the first12 dimensions shown in

Figures 4.11 as the most representatives.

-40

-20

0

20

40

60

80


1st e

igen

reso

urce

(a) First

-40

-20

0

20

40


2nd

eige

nres

ourc

e

(b) Second

-40

-30

-20

-10

0

10

20

30

40

50


3rd

eige

nres

ourc

e

(c) Third

-40

-30

-20

-10

0

10

20

30

40

50


4th

eige

nres

ourc

e

(d) Fourth

-40

-30

-20

-10

0

10

20

30

40


5th

eige

nres

ourc

e

(e) Fifth

-30

-20

-10

0

10

20

30

40


6th

eige

nres

ourc

e

(f) Sixth

-30

-20

-10

0

10

20

30

40


7th

eige

nres

ourc

e

(g) Seventh

-20

-10

0

10

20

30

40


8th

eige

nres

ourc

e

(h) Eigth

-10

0

10

20

30

40


9th

eige

nres

ourc

e

(i) Ninth

-20

-10

0

10

20

30

40


10th

eig

enre

sour

ce

(l) Tenth

-50

-40

-30

-20

-10

0

10

20


11th

eig

enre

sour

ce

(m) Eleventh

-20

-10

0

10

20

30

40

50


12th

eig

enre

sour

ce

(n) Twelfth

Figure 4.11: PCA resulting principal eigenresources on homogeneous resources(CPU utilizations) of a database server.

Focusing on them and rejecting other irrelevant information allows us to sim-

plify the problem of whole servers analysis. Indeed, this problem can now be

solved on the basis of the information carried on a small number (12, in this case)

4.3 Analyzing eigenresources 67

of time series, with the certainty to retain all relevant information of system be-

havior.

In the rest of the work, we will focus on CPU utilization as the most represen-

tative resources. Since we are primarily interested in common temporal patterns,

we focus the analysis on the normalized resources. In fact, normalization ensures

that the common patterns captured by the eigenresources arenot skewed due to

differences in mean CPU utilization rates.

4.3 Analyzing eigenresources

To understand the information carried out by eigenresources, we inspect their

properties in Section 4.3.1, describe the three most commontypes and deepen

their insight in Section 4.3.2.

4.3.1 A taxonomy of eigenresources

We analyze the complete set of resource measures of the servers set of our infras-

tructure, by following the technique proposed for network flows in [10]. Although

we focus on CPU utilization, we find similar results to those obtained for network

traffic flows analysis. Across all of the eigenresources, there appear to be only

three distinctly different types. Representative examplesof each eigenresource

type from servers CPU utilization are shown in Figure 4.12.

Figure 4.12(a) shows an example of eigenresource that exhibits strong period-

icities. The periodicities clearly reflect diurnal activity, as well as the difference

between weekday and weekend activity. Because this eigenresource appear to be

relatively predictable and shows a strong trend and seasonal component, we refer

to it asdeterministic eigenresource.

Figure 4.12(b) shows an example of eigenresource that exhibits strong, short-

lived spikes. Thisspike eigenresourceshows isolated values that can be many

standard deviations (e.g.,4 or 5 standard deviations) from the eigenresource mean.

They capture the occasional CPU utilization bursts and dips that are common

features of Web-based system behavior. The majority of eigenresources in the

considered dataset appear to be of this type.


-40

-20

0

20

40

60

80


Eig

enre

sour

ce 1

(a) Deterministic eigenresource

-20

-10

0

10

20

30

40


Eig

enre

sour

ce 1

0

(b) Spike eigenresource

-40

-30

-20

-10

0

10

20

30

40

50


Eig

enre

sour

ce 4

(c) Noise eigenresource

Figure 4.12: Examples of the three types of eigenresources.

Figure 4.12(c) shows an example of eigenresource that appears roughly sta-

tionary and Gaussian. Thisnoise eigenresourcecaptures the remaining random

variation that arises as the result of multiplexing many individual servers sources.

These three categories of eigenresources are only heuristically distinguished.

It is not our intent to suggest that any eigenresource can be unambiguously cat-

egorized in this way. Nonetheless, we observe that these categories are distinct,

and that almost all eigenresources of our data set can be easily placed into one of

these categories.

To demonstrate that, we evaluate each eigenresource according to the follow-

ing criteria:

1. Does the eigenresource have a strong periodicity of its autocorrelation func-

tion lasting 12 or 24 hours?

2. Does the eigenresource contain at least one outlier that exceeds 5 standard

A taxonomy of eigenresources 69

deviations from its mean?

3. Does the eigenresource have a marginal distribution thatappears to be nearly

Gaussian?

We judge whether each eigenresource meets one of these creteria by applying

some analyses described in Chapter 3.

The first criteria is evaluated through the correlogram (seeSection 3.1) of the

eigenresource. An evident periodicity in autocorrelationfunction behavior is a

proof of the intrinsic periodicity of the corresponding eigenresource. This analy-

sis allows us also to evaluate the lag of samples after which the time series shows

a temporal repetition. An example of applying this criteriato an eigenresource

of the considered data set is shown in Figure 4.13(a). It is evident that the eigen-

resource that we visually identified as deterministic eigenresource has a distinct

periodic behavior of its autocorrelation function, with lags of 288 samples (cor-

responding to one day interval). The maximum peak found at sample 288 corre-

sponds to the fundamental frequency of the time series. We can see that the ACF

is repeated with this fundamental period (peaks are found atsamples 576, 864,

1152 and 1440).

The second criterion is assessed through the5σ threshold test (see Section 3.2),

evaluating if some data samples of the eigenresource time series exceed five times

its standard deviation from the mean value. The choice of ans test parameter

equal to5 confirms the criterion presented in [10]. Several tests withdifferentsσ

thresholds have been exercised on the eigenresources coming from PCA. In the

application context of Internet Data Centers, the setting ofs = 5 demonstrated

to achieve the best performance in terms of detecting all andonly effective spike

components in the eigenresources set. In Figures 4.13(b.1)and (b.2) we show

two examples of visually identified spike eigenresources, that actually have5σ

excursions from the mean.

The last criterion is evaluated through the Q-Q plot test (see Section 3.3). In

Figure 4.13(c) we show the eigenresource that is visually categorized as a noise

eigenresource manifesting a marginal distribution that isnearly Gaussian. The al-

most straight crossed line indicates a close fit of the eigenresource to the standard

normal distributionN(0, 1).


-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1


AC

F

(a) Deterministic eigenresource(correlogram)

-20

-10

0

10

20

30

40


Eig

enre

sour

ce 1

0

Spike eigenresource 5sigma threshold

(b.1) Spike eigenresource(5σ threshold test)

-50

-40

-30

-20

-10

0

10

20

30


Eig

enre

sour

ce 1

1

Spike eigenresource 5sigma threshold

(b.2) Spike eigenresource(5σ threshold test)

−4 −3 −2 −1 0 1 2 3 4−50

−40

−30

−20

−10

0

10

20

30

40

50

Standard Normal Quantiles

Qua

ntile

s of

Noi

se E

igen

reso

urce

(c) Noise eigenresource(Q-Q plot test)

Figure 4.13: Classifying eigenresources by using three statistical tests.

A taxonomy of eigenresources 71

We use these tools to classify all the eigenresources of the considered data set.

Eigenresources for which none of the criterion above held true are categorized as

”indeterminate”. On CPU utilization time series, only 1 is indeterminate (con-

tributing 4.77% to overall energy). For all of the remaining eigenresources, one

and only one criterion above held true. Since in Section 4.2.2 we have demon-

strated that only the first12 eigenresources retains most of the variability of CPU

utilization time series, we focus on this smaller set of dimensions.

By using the criteria above, we see that3 of those12 principal eigenresources

show a deterministic behavior and3 a noise behavior. The remaining eigenre-

sources have all short-lived spikes. Despite that, we underline that the different

eigenresource types appear in different regions when the eigenresources are or-

dered by overall importance (e.g., by singular value). As presented in Table 4.2,

deterministic and noise eigenresources are together the first seven eigenresources.

The next five ones in order are all classified as spike eigenresources.

Order Eigenresource Order Eigenresource Order EigenresourceType Type Type

1 Deterministic 5 Indeterminate 9 Spike2 Noise 6 Noise 10 Spike3 Noise 7 Deterministic 11 Spike4 Deterministic 8 Spike 12 Spike

Table 4.2: Occurrence of eigenresource types in order of importance.

This result reveals that the most important source of variation in CPU uti-

lization measures are the systematic changes due to periodic trends. After these

periodic trends, noise dispersions are next in importance.The least significant

contribution to CPU utilization variability comes from bursts or spikes.

These conclusions are confirmed in a more quantitative way bythe data in

Table 4.3, which shows the fraction of total energy that can be assigned to each

of the three eigenresource types. Deterministic eigenresources provide more than

two times the contribution of the noise class, and almost sixtimes the contribution

of the spike class.

This is the case of the PCA-based technique applied on a resource measure

strongly dependent on periodic activities as the CPU utilization. Different re-


Deterministic Spike Noise IndeterminateEigenresources Eigenresources Eigenresources Eigenresource

Contribution 58.49% 25.90% 10.84% 4.77%

Table 4.3: Contributions of eigenresource types.

source measures may return different behavioral classes. What we can say is that,

on every one of the21 resource measures tested, the PCA technique extrapolates

always three types of classes, and never more.

To better understand the three behavioral classes, in next section we investigate

the characteristics of deterministic, spike and noise eigenresources, in order to

improve the proposed methodology for the whole system analysis.

4.3.2 Understanding eigenresources

The analysis of eigeresources has emphasized the central role of three behavioral

classes in which all eigenresouces can be placed. Despite that, eigenresources

belonging to the same class can carry out different information. We now extend

the basic analysis proposed in [10] in order to better understand the statistical

characteristics of eigenresources and apply these resultsto improve Internet Data

Center management.

Let us start from several examples. In Figure 4.14 we report the eigenresources

classified as deterministic in Section 4.3.1 with the corresponding correlograms.

The first and the seventh eigenresources in Figure 4.14(a.1)and (c.1) have auto-

correlation functions with periodic peaks repeating every288 samples. Thus, we

can infer that the two deterministic eigenresources have a seasonal behavior with

a temporal lag of one day. A periodic repetition is shown alsoby the ACF of the

fourth eigenresource, as can be evinced by the correlogram of Figure 4.14(b.2).

However, there is a different time window in which the function repeats itself.

Its values iterate with a lag of 576 samples, that correspondto two days resource

measures. Thus, the seventh eigenresource collects the seasonal system behavior

that repeats every 48 hours of activity. This information isuseful in the choice of

which eigenreources is worth to be investigated to make management decisions.

Understanding eigenresources 73

-40

-20

0

20

40

60

80


1st e

igen

reso

urce

(a.1) First

-40

-30

-20

-10

0

10

20

30

40

50


4th

eige

nres

ourc

e

(b.1) Fourth

-30

-20

-10

0

10

20

30

40


7th

eige

nres

ourc

e

(c.1) Seventh

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1


AC

F

ACF

(a.2) First test

-0.2

-0.1

0

0.1

0.2

0.3


AC

F

ACF

(b.2) Fourth test

-0.1

-0.05

0

0.05

0.1

0.15

0.2


AC

F

ACF

(c.2) Seventh test

Figure 4.14: Deterministic eigenresources and corresponding correlograms.

We collected PCA results also on time series referring to longer time scales

(e.g., two weeks). In these contexts we can find dimensions with autocorrelation

functions showing different lags and multiple periodic behaviors. An example of

this multi-seasonal behavior is reported in Figure 4.15.

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun

AC

F

Figure 4.15: Example of correlogram showing a multi-seasonal behavior in a twoweeks resource sampling.


In all these cases, the discovered lags of periodicity of thedeterministic eigen-

resources are one multiple of the others. Considering them asequivalent into an

unique vision may not involve a remarkable impact on the accuracy of whole sys-

tem analysis, but this may not always be the case. If the considered data set is

influenced by different non-multiplicative periodicitiesconcurring together to In-

ternet Data Center seasonal activity, to collect them into one representative vision

may give meaningless results.

We asses that an accurate analysis of deterministic eigenresources is a crucial

step in the proposed multi-phase methodology, since it allows us to collect infor-

mation in a suitable way, gives meaningful information on different periodic lags

that may influence the activity of the studied Web-based infrastructure and guides

the setting of right parameters values of those management algorithms that take

into account seasonal properties.

Careful investigation should join also with the comprehension of the other two

classes of eigenresources. Focusing on dimensions with overall spike behaviors,

they may also differ one from the other in a meaningful way. Let us start from an

illustrative example.

Figure 4.16 shows the PCA resulting eigenresources satisfying the 5σ thresh-

old test. There is a clear difference between Figures 4.16(a)-(b) and Figures 4.16(c)-

(d)-(e). Eighth and ninth eigenresources show several consecutive instantaneous

spikes taking place during the entire monitored week. They preserve the charac-

teristic of immediacy, but loose their singularity and sparseness. Tenth to twelfth

eigenresources, otherwise, exhibit isolated, sporadic and uncommon bursts, that

manifestly depart from the mean beahavior of the time series.

In our context, spike category can be split in two subclasses: the one ofre-

current spike eigenresources, and the one ofsporadic spike eigenresources. The

former subclass includes all those spike eigenresources exhibiting frequent short-

lived spikes, repeating in a non predictable way but consistently present during

the entire period of sampling. The latter subclass comprises spike eigenresources

with strong occasional values departing many standard deviations from the eigen-

resource mean. Examples in Figures 4.16(a)-(b) are assigned to the recurrent spike

eigenresources subclass. Examples in Figures 4.16(c)-(d)-(e) to the sporadic sub-

class.

Understanding eigenresources 75

-20

-10

0

10

20

30

40


8th

eige

nres

ourc

e

(a) Eighth

-10

0

10

20

30

40


9th

eige

nres

ourc

e

(b) Ninth

-20

-10

0

10

20

30

40


10th

eig

enre

sour

ce

(c) Tenth

-50

-40

-30

-20

-10

0

10

20


11th

eig

enre

sour

ce

(d) Eleventh

-20

-10

0

10

20

30

40

50


12th

eig

enre

sour

ce

(e) Twelfth

Figure 4.16: Spike eigenresources and corresponding sigmathreshold tests.

Also the noise class needs a deep investigation. As discussed in Section 3.3,

noise signals can be classified as white or colored. White noise presents no co-

variance or relation between time series values at different time samples, and

hence the autocovariance function is zero for all lagsk except fork = 0. Col-

ored noise does not vary completely randomly and its autocovariance is non-zero

for lagsk 6= 0. It is often incorrectly assumed that Gaussian noise is necessar-

ily white noise, yet neither property implies the other. Gaussianity refers to the

probability distribution with respect to the value, that is, the probability that the

signal has a certain given value, while the term ”white” refers to the way the sig-

nal power is distributed over time or among frequencies. We can therefore find

Gaussian white noise, but also Poisson, Cauchy, etc. white noises. As well as

colored Gaussian noise. Thus, once proved that the noise is normally distributed,

further investigations about the color of noise are needed,in order to obtain an

exhaustive knowledge of the noise component.

Figure 4.17 shows the three PCA resulting time series classified as noise eigen-

resources, with the corresponding autocovariance test forthe determination of the

noise color. As can be evinced by Figure 4.17(b.2) and Figure4.17(c.2), the third

and sixth eigenresources show the typical behavior of whitenoise, with sudden

fast degradation of autocovariance function from peak at lag 0. We can say that


the 3rd and 6th dimensions vary completely randomly as a function of time. A

different behavior is manifested by the second eigenresource: the autocorrelation

function displayed in Figure 4.17(a.2) is typical of colored noise, with positive

values for a wide window of lags centred in sample 0.

-40

-20

0

20

40


2nd

eige

nres

ourc

e

(a.1) Second

-40

-30

-20

-10

0

10

20

30

40

50


3rd

eige

nres

ourc

e

(b.1) Third

-30

-20

-10

0

10

20

30

40


6th

eige

nres

ourc

e

(c.1) Sixth

0

0.2

0.4

0.6

0.8

1

-200 -100 0 100 200

2nd

eige

nres

ourc

e

Lag (Samples)

(a.2) Second test

-0.2

0

0.2

0.4

0.6

0.8

1

-200 -100 0 100 200

3rd

eige

nres

ourc

e

Lag (Samples)

(b.2) Third test

-0.2

0

0.2

0.4

0.6

0.8

1

-200 -100 0 100 200

6th

eige

nres

ourc

e

Lag (Samples)

(c.2) Sixth test

Figure 4.17: Noise eigenresources and corresponding autocovariance functions.

Thanks to these results, we can divide the noise eigenresource class into two

subclasses on the basis of the type of noise characterizing the time series:white

noise eigenresourcessubclass andcolored noise eigenresourcessubclass. All the

noise eigenresources showing an ACF abruptly decreasing as soon as it departs

from k = 0 belong to the previous subclass. In the considered context,the third

and sixth eigenresources are designated to this subclass. The colored noise sub-

class comprises, instead, noise eigenresources varying completely randomly in

time, as the 2nd eigenresource does.

This classification is of great interest since it allows us toseparate noise con-

tributions that are somehow time related and predictable from those ones that are

completely random and hard to model.

An accurate examination of the characteristics of the threebehavioral classes

helps to better understand system characteristics, appropriate dimensions contri-

4.4 Extraction of representative eigenresources 77

butions and to choose adapt methods and algorithms for system management. In

the next section, we show how the three main classes of eigenresources (determin-

istic, spike and noise) contribute to the generation of three representative visions

of the entire system behavior, that can be used as a reliable starting point to solve

the whole system analysis problem.

4.4 Extraction of representative eigenresources

In this section, we show how the understanding of the three classes of eigenre-

sources in light of the previous results can yield to the generation of threerepre-

sentative eigenresourcesto solve the whole system analysis.

This is the main innovation of the proposed PCA-technique. Wecollect all

the contributions of the resource measures monitored in a complex Internet Data

Center into an extremely simplified representation that is able, in its simplicity,

to carry all the relevant information of the system. This representation comprises

only three time series, whose investigation replaces the complex and time con-

suming analysis of thousands and thousands of resource timeseries that, on their

own, do not give any reliable information of the whole systemstate.

To evaluate the relative impact of the three classes on the overall behavior of

the system, we collect all the contribution of the eigenresources belonging to the

same class into an aggregate vision. For every monitored resource measure, we

create threerepresentative eigenresources, one for each typology of behavior. In

particular, for the CPU utilization time series we compute:

1. arepresentative deterministic eigenresource, Rdeterministic, including all

common trend and seasonal components of the CPU utilization time series;

2. arepresentative spike eigenresource, Rspike, collecting short-lived spikes

and all the contributions due to occasional bursts and dips in system CPU

utilization;

3. arepresentative noise eigenresource, Rnoise, capturing random variations

of CPU utilization of the Internet Data Center servers.


The threerepresentative eigenresourcescome from the weighted sum of all

the eigenresources in the setui, i = 1, . . . , 12 showing that type of behavior. Each

eigenresource contribution is weighted on the basis of the corresponding singular

valueσi, as follows:

Rdeterministic =∑

i |ui ∈ deterministic class

uiσi (4.5)

Rspike =∑

i |ui ∈ spike class

uiσi (4.6)

Rnoise =∑

i |ui ∈noise class

uiσi (4.7)

wherei ∈ [1, 12].

Through singular values weights, the three representativeeigenresources give

more importance to those dimensions contributing with moreenergy to the over-

all resource measure of the system. These representations integrate all relevant

dimensions in the total energy of the Internet Data Center (12 in this case). As-

sembling the three system visions is a simple procedure thatbrings to an important

result that is quite original in Internet-based contexts.

Representative time series still preserve the characteristics of their constitutive

dimensions. Figure 4.18 shows the three representative eigenresources resulting

from the application of the PCA-based technique on CPU utilization time series.

The representative deterministic eigenresource in Figure4.18(a) is a compre-

hensive representation of the systematic component of the system, since it col-

lects all relevant information and does not contain the contribution due to eigen-

resources belonging to the other behavioral classes. It reveals a strong seasonal

component, coming from the periodic contributions of all the deterministic eigen-

resources of the system. It consists of all the linear or nonlinear elements that

change over time and repeat within the time range captured bythe data. Since

pattern and seasonal contributions add up in a weighted way,the representative

deterministic vision maintains higher magnitudes than thespike and noise ones,

even if spike dimensions are higher than the other two.


-20

0

20

40

60


Rep

rese

ntat

ive

eige

nres

ourc

e

(a) Representative deterministic

-8

-6

-4

-2

0

2

4

6

8

10


Rep

rese

ntat

ive

eige

nres

ourc

e

(b) Representative spike

-20

-15

-10

-5

0

5

10

15

20

25


Rep

rese

ntat

ive

eige

nres

ourc

e

(c) Representative noise

Figure 4.18: Representative eigenresources.


The system error components contributions are carried out by representative

eigenresources of noise and spike classes. The representative spike eigenresource

shows several isolated spikes departing from the time series mean value, collect-

ing all the occasional bursts detected by the spike eigenresources. As shown in

Figure 4.18(b), adding up all the contributions of spike eigenresources spoils the

characteristics of occasional bursts and dips. Spikes in the representative vision

become more frequent, even if still unrelated.

A different behavior can be appreciated in Figure 4.18(c): the representative

noise eigenresource maintains the roughly stationary behavior of its constituents

noise eigenresources and collects into one exhaustive timeseries all the random

variations in Internet Data Center CPU utilization.

Some application contexts, such as anomaly detection and time series fore-

casting, may need more precise system representations taking in consideration the

behavioral subclasses discovered is Section 4.3.2. Figure4.19 adds the contribu-

tions given by the split of spike and noise classes into theirconstitutive subclasses.

It shows therepresentative sporadic spike eigenresourcein Figure 4.19(b.1) and

the representative recurrent spike eigenresourcein Figure 4.19(b.2), as well as

therepresentative white noise eigenresourceand therepresentative colored noise

eigenresourcein Figure 4.19(c.1) and Figure 4.19(c.2), respectively.

Beside this shrewdness, the main outcome of PCA-based technique stands in

isolating into only few representative time series all the deterministic patterns and

all the error components of the resource measures of the servers. In next chap-

ters we demonstrate that the happening of something strangeor unexpected in

Internet Data Center activity properly reflects in one or moreof the representa-

tions. This is an important outcome that improves the importance of the proposed

PCA-based technique for the system management. Accidental events reflect in

system representations only in the case of relevant incidents impacting the whole

system state and functioning. Episodes influencing the state of only one or few

servers but having no important consequence in Internet Data Center activity are

not manifested by the representative eigenresources. Thisis a demonstration that

the representative time series are more effective for wholesystem management

than the single monitored resource time series coming from the single servers,

that carry also information that may be not useful to understand the state of the


-20

0

20

40

60


Rep

rese

ntat

ive

eige

nres

ourc

e

(a) Representative deterministic

-10

-5

0

5

10


Rep

rese

ntat

ive

eige

nres

ourc

e

(b.1) Representative sporadic spike

-10

-5

0

5

10


Rep

rese

ntat

ive

eige

nres

ourc

e

(b.2) Representative recurrent spike

-15

-10

-5

0

5

10

15


Rep

rese

ntat

ive

eige

nres

ourc

e

(c.1) Representative white noise

-15

-10

-5

0

5

10

15


Rep

rese

ntat

ive

eige

nres

ourc

e

(c.2) Representative colored noise

Figure 4.19: Representative eigenresources with spike and noise subclasses.

entire Internet-based system.

Several kinds of applications could benefit from it. For instance, all sort of

algorithms for system management that base their decisionson the evaluation

of whole system state. Thanks to the proposed PCA-based technique, decisions

could be made working on a small input, that is, the three representative eigenre-


sources, instead of the high number of system resource time series. Among these

applications, in this work we consider some mechanisms extrapolating interesting

information about thepast, the presentand thefuture behavior of an Internet-

based system. In particular, we model the past system behavior and forecast its

future performance with the goal of an efficient runtime management of the Inter-

net Data Center in the present. To this purpose, Chapter 5 and Chapter 6 report

an overview of the modeling and forecasting problems, respectively. We intro-

duce the common used parametric models to address these problems and evaluate

their performance in stochastic contexts, such as the one ofInternet Data Centers.

Modeling and forecasting applications are then addressed to the on-line analysis

of present system performance in Chapter 7, in order to make runtime decisions

for the whole system management.

Chapter 5

Tracking models

In this chapter, we consider application contexts where it is important to model

the past behavior of the Internet Data Center for an efficient management, either

off-line or on-line. We considertrend extractionfor the modeling of the whole

system state in a relevant past.

5.1 Trend extraction

Trend extractionis a useful method to characterize time series behavior in a signif-

icant past. It clarifies increasing and decreasing tendencies, seasonal periodicities,

cyclical patterns and other deterministic components in the time series. Trend es-

timation provides meaningful information that can be used to understand time

series behavioral patterns, or as an input for further management purposes. In this

work, the trend esteems are used as an input data for time series forecasting in

Chapter 6 and as a state representation for state change detection in Chapter 7.

Most runtime techniques for Internet-based systems management rely on mod-

els built on predictable trends and periodicities, which are in their turn isolated

from noise and spike influences. For these models, one of the main difficulties

is to isolate underlying meaningful time series patterns from trivial error compo-

nents.

Another problem of most existing decision algorithms is that they work sep-

arately on time series coming from the monitoring of one resource of a server,

and take decisions on the basis of single representations. This approach requires

84 Tracking models

Figure 5.1: Third phase: modeling the system behavior in thepast.

the investigation of many time series as the number of resource measures on each

server. Hence, the trend algorithm must be applied on thousands and thousands

time series, each one with its behavior and its parameters toset.

The proposed multi-phase methodology would solve most of these problems

and reduce the complexity of whole system management at runtime.

First of all, the multi-phase methodology isolates the deterministic compo-

nents of the Internet Data Center servers from random errors.This is done without

any assumption about system characteristics or any previous off-line study on the

choice of best models parameters.

Second, it works on only oneRdeterministic time series that assembles all rele-

vant deterministic information of system resources and diminishes the time spent

5.2 Problem definition 85

in analyzing input time series and in applying decision models to each one of

them.

Moreover, the statistical characteristics of the representative deterministic eigen-

resource guides the choice of suitable management algorithms and the correct set-

ting of their parameters. This makes it possible to adapt runtime decision models

to the specific context of the information system under investigation.

Finally, decisions made on the basis of the representative vision are conform

to the state of the whole system, and not to the behavior of a specific server or a

peculiar resource measure of it.

5.2 Problem definition

There are many techniques for tracking the trend of a time series. The trend

represents a general systematic linear or (most often) non-linear component that

changes over time and does not repeat or at least does not repeat within the time

range captured by the data. Relatively simple techniques, such as simple means or

medians, can provide acceptable results in some contexts. When data are stochas-

tic, volatile, or when the early identification of turning points is critical, it is nec-

essary to use more sophisticated mathematical models that fall into two main cat-

egories:interpolationandsmoothingtechniques, as shown in Figure 5.2.

5.2.1 Interpolation techniques

An interpolation function reveals the trendT passing through a certain numberp

of selected points{x1, . . . , xp} belonging to the observed data set. It is a specific

case of curve fitting, in which a functionf must go exactly through thep data

points:

T = f(xj) j = 1, . . . , p (5.1)

with xj ∈ Xi. On the basis of the main characteristics of thef function, the

interpolation methods can be classified in two main classes:linear interpolation

andnon-linear interpolationmodels.

Linear interpolation

86 Tracking models

Figure 5.2: Trend estimation techniques classification.

Linear interpolationis a method of curve fitting through anf function com-

puting linear polynomials. Typical examples of linear interpolations are the

piecewise constantinterpolation and thesimple regression.

Consider the example in Figure 5.3. Figure 5.3(a) displays the selected

pointsxj belonging to the data set. In this simple example,p = 7. The

trend estimation faces the problem of approximating the value for a non-

given pointxk in some space,xk /∈ {x1, . . . , xp}, when given some values

of points around that point.

The simplest interpolation method is to locate the data valuexk ∈ {x1, . . . , xp}nearest toxk, and to assign to it the same value,f(xk) = f(xk), as shown

in Figure 5.3(b). The horizontal black lines passing through the data points

compose the estimation of the trendT resulting from the application of

piecewise constant interpolation technique. In one dimension, there are sel-

dom good reasons to choose this simple method over regression, which is

almost as cheap. However, in higher dimensional multivariate interpolation

this can be a favorable choice for its speed and simplicity.

An example of simple regression interpolation is given in Figure 5.3(c).

Suppose we want to determinef(2.5). Since2.5 stands midway between

Interpolation techniques 87

2 and 3, it is reasonable to takef(2.5) midway betweenf(2) and f(3).

The black line of Figure 5.3(c) estimates the trendT as straight continuous

lines linking all the data set pointsxj. Linear interpolation is quick and

easy, but it is not precise. Another disadvantage is that thetrendT is not

-1

0

1

1 2 3 4 5 6

(a) Plot of the data points xj

-1

0

1

1 2 3 4 5 6

(b) Piecewise constant interpolation

-1

0

1

1 2 3 4 5 6

(c) Simple regression interpolation

Figure 5.3: Graphical example of linear interpolation techniques.

88 Tracking models

differentiable at the pointxj.

All linear interpolation techniques have low computational costs and can

provide acceptable results when the data set is subject to linear trends [62].

On the other hand, when the data set is characterized by non stationary and

high variable behavior, the linear interpolation is not a reliable technique

for trend identification. In this context, non-linear interpolation gives better

results.

Non-linear interpolation

Non-linear interpolationis a trend estimation technique able to model highly

curved time series through non-linear polynomials. The linear models fit a

straight line or a flat plane to the data samples. Usually, thetrue relation-

ship that we want to model is curved, rather than flat. To fit it,we need

non-linear models, such aspolynomialandsplineinterpolations.

Given some points of the data set, polynomial interpolationtechniques es-

timate the trendT through polynomials of degree higher than1 passing

through the pointsxj. Referring to the previous example, the sixth degree

polynomial in Figure 5.4(a) goes through all the seven points xj. Gener-

ally, if we havep data points, there is exactly one polynomial of degree at

mostp − 1 passing through all the data points. The interpolation error is

proportional to the distance between the data points to the power p [17].

Furthermore, the interpolant is a polynomial and thus infinitely differen-

tiable. So, we see that polynomial interpolation solves allthe problems of

simple regression. However, polynomial interpolation also has some dis-

advantages. Calculating the interpolating polynomial is computationally

expensive compared to simple regression. Furthermore, polynomial inter-

polation may exhibit oscillatory artifacts, especially atthe end points.

These disadvantages can be avoided by through thespline interpolation

model [104,127], that uses low-degree polynomials in each of the intervals

[xj, xj+1], and chooses the polynomial pieces such that they fit smoothly

together. The resulting function is called a spline. Figure5.4(b) shows the

Interpolation techniques 89

trendT estimated by a cubic spline, where the polynomial pieces areof

degree3. For instance, the cubic spline is peicewise cubic and twicecon-

tinuously differentiable.

-1

0

1

1 2 3 4 5 6

(a) Polynomial interpolation

-1

0

1

1 2 3 4 5 6

(b) Spline interpolation

Figure 5.4: Graphical example of non-linear interpolationtechniques.

Like polynomial interpolation, spline interpolation incurs a smaller error

than that of linear interpolation and the interpolant is smoother. Moreover,

the spline interpolant is easier to evaluate than the high-degree polynomials

used in polynomial interpolation. It also does not suffer from Runge’s phe-

nomenon [109]. Despite that, both non-linear techniques have high compu-

tational costs and are often inadequate to work in contexts with short-term

real-time requirements.

90 Tracking models

5.2.2 Smoothing techniques

A smoothing technique is a function that aims to capture important patterns in the

data set, while leaving out noise. Some common smoothing algorithms are the

moving average, theautoregressivemodels and thefiltering theory.

Moving average

Moving averagetechniques smooth out the observed data set and reduce the

effect of out-of-scale values. They are fairly easy to compute at runtime and

are commonly used as trend indicators [81].

The most used moving average techniques are theSimple Moving Average

(SMA) and theExponential Weighted Moving Average(EWMA), that com-

pute a uniform and a non-uniform weighted mean of the past measures,

respectively. These techniques tend to introduce an excessive delay in trend

representation when the number of past measures is large, while they do

not eliminate all noises when working on a small set of past samples. The

problem of choosing the best past data set size can be addressed when the

time series are stable.

Autoregressive

Autoregressivemodels comprise a group of linear smoothing formulas that

attempt to filter a time series on the basis of the previous rawand filtered

samples. A model that depends only on the previous filtered samples is

called anAuto-Regressive(AR) model, while a model depending only on the

raw data samples is called aMoving Average(MA) model. A model based

on both raw and filtered samples is anAuto-Regressive Moving Average

(ARMA) model. These models are adequate for stationary time series.

When the data set shows evidence of non-stationarity, it is preferable to use

theAuto-Regressive Integrated Moving Average(ARIMA) model, that is a

generalization of the ARMA model. It provides an initial differencing step

corresponding to the ”integrated” part of the model, applied to remove the

non-stationarity of the time series. The ARIMA model has the advantage

Smoothing techniques 91

that few terms are needed to describe a wide variety of time series processes,

less than AR and MA models [120].

ARFIMA and ARCH [50,82] are further accurate autoregressive techniques

useful in modeling time series with long memory or exhibiting time-varying

volatility clustering, that is, periods of swings followedby periods of rela-

tive calm.

Filtering theory

Filtering theory is useful to reveal trend in time series. Its purpose is to

remove from a signal some unwanted component or feature.

Recursive filtersre-use one or more of their outputs as an input. If both the

time series and the unwanted component error are Gaussian and uncorre-

lated, there is an optimal recursive filter, namely theKalman Filter. It is a

set of mathematical equations that provides an efficient computational (re-

cursive) means to estimate the state of a process, in a way that minimizes

the mean of the squared error. This filter is very powerful in several aspects:

it supports estimations of past, present, and even future states, even when

the nature of the modeled time series is unknown [18].

Discrete Wavelet Transforms(DWT) andDiscrete Fourier Transforms(DFT)

are more representative techniques based on the filtering theory. These tech-

niques belong to a popular and computationally efficient family of multi-

scale basis functions for the decomposition of a signal intolevels or scales

and for the extraction of a denoised data set representation[102]. In the

DWT, the data set is passed through filters with different cut-off frequen-

cies at different levels, while the DFT decomposes the time series into the

sum of periodic harmonics. The main difference is that wavelets are local-

ized in both time and frequency, whereas the standard Fourier transform is

only localized in frequency. Wavelets often give a better data set’s trend rep-

resentation and are computationally more efficient than theDiscrete Fourier

Transform. A DFT of lengthp takes on the order ofplog2p operations, as

compared to the approximatelyp operations required by a DWT [61].

92 Tracking models

Figure 5.5 shows the results of applying smoothing techniques to a stochas-

tic and highly variable time series. An example of trend estimation is given

for each category of smoothing techniques. The black line inFigure 5.5(b)

represents the trend resulting from the Exponential Weighted Moving Av-

erage computed on the gray time series of Figure 5.5(a). EWMA model,

in this example, works considering 10 past measures. It produces a spiky

and reactive representation of the data set, following all the variabilities of

the time series, even the smallest ones. Similar results areachieved by the

ARIMA(1,1,1) model in Figure 5.5(c). This autoregressive technique tracks

the data set and smooths out only the major fringes of variability, thus result-

ing in a fluctuating representation strongly dependent on the data samples

values. On the other hand, the DWT technique in Figure 5.5(d) cuts out

almost all time series variability, thus resulting in the smoothest represen-

-2

-1

0

1

2

0 1 2 3 4 5 6

(a) Plot of the time series

-2

-1

0

1

2

0 1 2 3 4 5 6

(b) EWMA smoothing

-2

-1

0

1

2

0 1 2 3 4 5 6

(c) ARIMA smoothing

-2

-1

0

1

2

0 1 2 3 4 5 6

(d) DWT smoothing

Figure 5.5: Graphical examples of smoothing techniques.

5.3 Interpolation estimators 93

tation. This filtering technique represents well the overall trend of the data

set and removes almost all the variability of the time series.

It is unreasonable to define which smoothing technique better estimates a

trend because the performance of each model must be related to the appli-

cation context and the time series characteristics. These requirements guide

the choice of the technique for trend estimation that best fits our interests,

and, equally important, the choice of the parameter values suitable to our

purposes. A practiced setting of the number and value of model parameters

is fundamental to time series trend estimation.

In next sections, we detail some interpolation and smoothing techniques, with

a particular emphasis on their implementation and the parameters they depend

on. These techniques are used in the next applications of this study as time series

representations for state change detection and for time series forecasting.

5.3 Interpolation estimators

We discuss thesimple regression- chosen among linear interpolation -, andcubic

spline- as non-linear interpolation technique.

5.3.1 Simple Regression (SR)

Simple regressionfits a straight line through the set of then past monitored sam-

ples values of the data setXi, that are,Xi,n = [xi−(n−1), . . . , xi−1, xi]. Thus, the

simple regression trend estimationSR(Xi,n) is computed as follows:

SR(Xi,n) = αixi + βi (5.2)

where the coefficientsαi is equal to the degree of variation between the first and

the last sample of the data setXi,n, that is:

αi =xi − xi−(n−1)

n(5.3)

while βi is set as:

94 Tracking models

βi = xi−(n−1) − αin (5.4)

as suggested in [8].

An SR-based trend estimator evaluates a newSR(Xi,n) value for each mea-

surexi collected during the observation period. The number of considered past

samplesn is a parameter of the interpolation model, hence hereafter we use the

notation SRn to indicate an simple regression tracker based onn past measures.

Since the simple regression models linear trends, it risks to be inefficient when

the data set is characterized by a non stationary and high variable behavior. Cubic

splines are typically used to overcome this limit.

5.3.2 Cubic Spline (CS)

An empirical analysis induces us to consider thecubic splinefunction [104], in

the version proposed by Forsytheet al. [56]. This choice is motivated by the

observation that lower order spline curves (that is, with a degree less than 3) do

not react quickly enough to time series changes, while spline curves with a de-

gree higher than 3 are unnecessarily complex, introduce undesired ripples and are

computationally too expensive to be applied in runtime contexts.

To define cubic spline functions, let us choose somecontrol points(tj, xj) in

the set of measured data values, wheretj is the measurement time of the sample

xj. A cubic spline functionCSJ(t), based onJ control points, is a set ofJ − 1

piecewise third-order polynomialspj(t), wherej ∈ [1, J − 1], that satisfies the

following properties.

Property 1.The control points are connected through third-order polynomials:{CSJ(tj) = xj j = 1, . . . , J

CSJ(t) = pj(t) tj < t < tj+1, j = 1, . . . , J − 1(5.5)

Property 2.To guarantee aC2 behavior at each control point the first and second

order derivatives ofpj(t) andpj+1(t) are set equal at timetj, ∀j ∈ {1, . . . , J−2}:

dpj(tj+1)

dt=

dpj+1(tj+1)

dt

d2pj(tj+1)

dt2=

d2pj+1(tj+1)

dt2

(5.6)

5.4 Smoothing estimators 95

If we combine Properties 1 and 2, we obtain the following definition for

CSJ(t):

CSJ(t) =zj+1(t− tj)

3 + zj(tj+1 − t)3

6hi

+

+ (xj+1

hi

− hj

6zj+1)(t− tj) + (

xj

hj

− hj

6zj)(tj+1 − t)

(5.7)

∀j ∈ {1, . . . , J − 1}, wherehi = ti+1 − ti, andxj are the measured values. The

zj coefficients are solved by the following system of equations:

z0 = 0

hj−1zj−1 + 2(hj−1 + hj)zj + hjzj+1 = 6(xj+1−xj

hj− xj−xj−1

hj−1)

zn = 0

(5.8)

The spline-based trend estimation modelCS(Xi,n), at timeti, is defined as

the cubic spline functionCSJ(ti), that is obtained through a subset ofJ control

points belonging to the vectorXi,n of n past samples measures. We denote it as

CSn.

Although the cubic spline load tracker has two parameters and is computa-

tionally more expensive than linear interpolation techniques, it is commonly used

in approximation and trend extraction contexts [51, 104, 127]. The cubic spline

has the advantage of being reactive to load changes and it is independent of time

series characteristics. Its computational complexity is compatible to runtime de-

cision systems, especially if we choose a small number of control pointsJ .

5.4 Smoothing estimators

Among smoothing techniques for time series denoising, we first consider the class

of moving averagemodels. Moving averages are commonly used as trend indi-

cators [44, 81, 122], since they smooth out observed data, reduce the effect of

out-of-scale values and are fairly easy to compute at runtime. We consider two

classes of moving average algorithms (Simple Moving Average(SMA) andEx-

ponential Weighted Moving Average(EWMA) ) and some popular linear autore-

gressive models (Auto-Regressive(AR) andAuto-Regressive Integrated Moving

Average(ARIMA) ).

96 Tracking models

5.4.1 Simple Moving Average (SMA)

Simple Moving Averageis the unweighted mean of then past monitored samples

values of the data setXi, that are,Xi,n = [xi−(n−1), . . . , xi−1, xi]:

SMA(Xi,n) =

∑i−(n−1)≤j≤i

xj

n(5.9)

An SMA-based trend estimator evaluates a newSMA(Xi,n) value for each

measurexi collected during the observation period. The number of considered

past samplesn is a parameter of the smoothing model, hence hereafter we usethe

notation SMAn to indicate an Simple Moving Average tracker based onn past

measures. Since the Simple Moving Average assigns an equal weight to each of

the past considered data values, this model tends to introduce a significant delay

in time series representation, especially when the size of the subsetXi,n increases.

Exponential Moving Average models are usually applied withthe purpose of lim-

iting this delay effect.

5.4.2 Exponential Weighted Moving Average (EWMA)

Exponential Weighted Moving Averageis the weighted mean of then past mon-

itored samples values,Xi,n, where the weights assigned to the samples decrease

exponentially. An EWMA-based load trackerEWMA(Xi,n), at timeti, is equal

to:

EWMA(Xi,n) = αxi + (1 − α)EWMA(Xi−1,n) (5.10)

where the parameterα = 2/(n+ 1) is thesmoothing factor.

The initial EWMA(Xn,n) value is initialized to the arithmetic mean of the

first n measures:

EWMA(Xn,n) =

∑1≤j≤n

xj

n(5.11)

Similarly to the SMA model, the numbern of past considered data values is a

parameter of the EWMA model, hence with EWMAn we denote an Exponential

Weighted Moving Average based onn past measures.

Auto-Regressive (AR) 97

5.4.3 Auto-Regressive (AR)

Auto-Regressivemodel is a weighted linear combination of the pastp observed

data values of the vectorXi, that are,Xi,p = [xi−(p−1), . . . , xi−1, xi]. An AR-

based trend estimation model, at timeti, can be written as:

AR(Xi,p) = φ1xi + . . . + φpxi−(p−1) + ei (5.12)

whereei ∼ WN(0, σ2) is an independent and identically distributed sequence

(calledresiduals sequence). xi, . . . , xi−(p−1) are the data samples weighted byp

linear coefficients,φ1, . . . , φp, that are the firstp values of the auto-correlation

function computed on theXi vector. Thep order of the AR process is deter-

mined by the lag at which the partial autocorrelation function becomes negligi-

ble [21, 75]. It is a parameter of the AR model, hence with AR(p)we denote an

autoregressive tracker based onp values.

5.4.4 Auto-Regressive Integrated Moving Average (ARIMA)

Auto-Regressive Integrated Moving Averagemodel is obtained by differentiatingd

times a non stationary sequence and by fitting an ARMA model that is composed

by the auto-regressive model (AR(p)) and the moving average model (MA(q)).

The moving average part is a linear combination of the pastq residual terms,

ei, . . . , ei−q [21,75]. An ARIMA model can be written as:

ARIMA(Xi,p,d,q) = φ1xi + . . . + φp+dxi−(p+d+1) +

+ θ0ei + . . . + θqei−q

(5.13)

whereθ0, . . . , θq arelinear coefficients.

An ARIMA model is guided by three parameters. Thus, we use the notation

ARIMA(p,d,q), wherep is the number of the considered past values in the data

set,q is the number of the residuals terms, andd is the number of differentiating

times. An ARIMA model requires frequent updates of its parameters when the

characteristics of the data set changes.

98 Tracking models

5.5 Quantitative performance analysis

Trend estimation models should be evaluated in terms ofcomputational costand

estimation quality. For on-line decision contexts, we consider acceptable only

trend estimators having a computational complexity compatible with runtime re-

quirements. In this section, we compare the computational cost of the described

trend estimation models, in order to evaluate their efficacyfor on-line manage-

ment of Internet Data Centers. We also report some interesting results about the

quality of trend estimators applied in several experiments.

5.5.1 Computational cost

We evaluate the CPU time required by each described model to compute a new

value of the trend representation, in order to evaluate the possibility of applying it

to a runtime environment. Collected times do not include the system and commu-

nication times that are necessary to fill the observed data set.

The results evaluated on an average PC machine and reported in Table 5.1 refer

to a realistic system subject to heavy service demand, but they can be considered

representative of any workload. Computational costs are estimated for different

numbers of past samples (n) considered by the models. Behind the choice of

the parameters of the AR and ARIMA models there is an evaluation of the auto-

correlation and partial auto-correlation functions as in [21, 75]. For this analysis,

we choose the AR(32) and ARIMA(1,0,1) models as the best parameters settings

for the considered workload. The table demonstrates that the computational cost

of all the considered trend estimation models is compatiblewith runtime con-

straints, because all the models have a CPU time below 10 msec.

n = 30 n = 60 n = 90 n = 120 n = 240SR 0.462 0.448 0.456 0.461 0.494CS 2.100 3.426 4.242 6.231 12.215

SMA 0.560 1.039 1.461 1.990 3.785EWMA 0.059 0.059 0.059 0.059 0.059

AR 5.752 5.978 5.998 6.070 6.417ARIMA 7.233 7.536 7.765 7.228 8.141

Table 5.1: CPU time (msec) for the computation of a trend value.

Estimation quality 99

This results lead us to consider the previous described models adequate to

support runtime decision systems in stochastic and highly variable workload sce-

narios.

5.5.2 Estimation quality

The trend estimation quality needs, for its evaluation, a representation of the ef-

fective time series trend, to which compare the one estimated by the model. Due

to the stochasticity of the time series, the simple mean is not a good indicator of

the central tendency of the data set [81], hence we prefer to evaluate the effective

time series trend as theapproximate confidence intervalCI = [TU , TL] [19]. It

is an indicator of the approximative central tendency of thetime series in specific

periods of rather stability of the observed data set.TU andTL represent the upper

bound and lower bound, respectively, of this central tendency, and thus limit the

region inside which the trend esteems should fall.

Since in our experiments we control the load generators, it is possible to com-

pute off-line the periods of rather stability, considered as the time intervals during

which we generate the same number of user requests, that is, we have the same

number of active emulated browsers. For estimation qualityevaluation, we con-

sider the data set shown in Figure 5.6, where the horizontal lines represent the

upperTU and lowerTL bounds of the approximate confidence interval.

Thanks to this definition, the estimation quality of the models can be computed

in terms ofaccuracyandresponsiveness.

Accuracy

Accuracyevaluates to the capacity of having small oscillations around the

approximate confidence interval. Higher the accuracy, the best the model

tracks the trend of the time series.

Theaccuracy errorof a trend estimation model is the sum of the distances

between each estimated valueli computed at timeti, i = 1, . . . , n, and the

corresponding upperTUi or lowerTL

i bounds of the approximate confidence

100 Tracking models

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900

Sam

ple

valu

es

Samples

Time seriesApproximate confidence interval

Figure 5.6: Example of time series and approximate confidence interval.

interval at timeti. It is computed as:

n∑

i=1

di (5.14)

, where

di =

li − TUi , if li > TU

i

TLi − li, if li < TL

i

0, otherwise.

(5.15)

The accuracy error corresponds to the sum of the vertical distances between

each estimated value outside the approximate confidence interval and the

approximate confidence interval bounds.

For the sake of comparison of different trend estimation models, we prefer

to use a normalized value, such as therelative accuracy error. As a nor-

malization factor, we consider the accuracy error of the observed data set.

The relative accuracy error for any acceptable trend estimation model lies

between 0 and 1, otherwise a trend model would be considered completely

inaccurate and discarded.


Responsiveness

Responsivenessevaluates the capacity of reaching as soon as possible the

representative load interval. It is a temporal requirementthat aims to repre-

sent the ability of a trend estimation model to quickly adaptitself to signif-

icant load variations.

Let tk, 1 ≤ k ≤ n, denote the time at which the representative trend exhibits

a new stable load condition that is associated to a significant change in the

number of users. (For example, in the data set shown in Figure5.6, we have

five changes andk ∈ C = {200, 340, 500, 700, 820}.) A model is more

responsive when its curve reaches the new approximate confidence interval

as soon as possible. LettK denote the instant in which the estimated trend

reaches for the first time one of the borders of the approximative confidence

interval associated to a new load condition.

The responsiveness errorof a trend estimation model is measured as the

sum of the horizontal differences between the initial instant tk of the new

load condition and the corresponding timetK at which the estimated trend

reaches the new interval. That means:

∑k∈C

|tk − tK | (5.16)

For reasons of comparison, we normalize the sum of the time delays by the

total number of changes, thus obtaining arelative responsiveness error.

We run several experiments to compare trend estimation models accuracy and

evaluate their responsiveness for different settings of model parameters. We car-

ried out a large set of experiments and in Figure 5.7 we graphically report a subset

of their results, aware that the main conclusions of the experiments are represen-

tative of the typical behavior of the trend estimation models.

The SR, SMA and EWMA trend estimators are characterized by an interesting

trade-off between accuracy and responsiveness. Working ona small (n ≤ 30) and

large (n ≥ 200) amount of past samples causes lower estimation quality than the

one achieved by intermediate size vectors (30 ≤ n ≤ 200). The motivations are

different: for small values ofn, the poor quality is caused by a high accuracy error,

102 Tracking models

due to excessive oscillations of the estimated trend. For large values ofn, instead,

the low quality is the effect of a high responsiveness error,due to excessive delays

of the esteems in reaching the approximate confidence interval.

For example, the SMA30 curve in Figure 5.7(c.1) touches soon the represen-

tative load interval, but its accuracy is low because of manyoscillations. On the

other hand, the SMA240 curve in Figure 5.7(c.2) is highly smoothed, but it follows

the real load with too much delay, causing in this case a poor responsiveness. Sim-

ilar results are achieved by the SR and EWMA models withn = 30 andn = 240.

Better results are achieved working on an intermediate number of past samples.

The best quality is reached by the SR90 and EWMA90 curves in Figure 5.7(a.1) and

(d.1), that follow more regularly the approximate confidence interval guarantee-

ing the best trade-off between accuracy and responsiveness. The AR and ARIMA

models show low accuracy due to their jittery nature, as shown in Figures 5.7(e)

and (f). The cubic spline model has a quite interesting behavior because working

on larger sets ofn past samples lead to a monotonic improvement of the CS accu-

racy. Comparing Figures 5.7(b.1) and (b.2) we can appreciatehow, forn = 240,

the curve follows the approximate confidence interval much better than the cubic

spline forn = 30, that scatters much more.

A comparison of all the results collected and not reported here for a mat-

ter of space shows that AR and ARIMA models have the lowest accuracy. The

best results of SR, EWMA and SMA models are comparable and are all achieved

working on a set ofn = 90 observed past values. Their accuracy is even better

than the one of the best cubic spline model, that is CS240.

The large set of experiments carried on the models for the evaluation of their

trend estimation quality brings to several conclusions that would be interesting for

Internet Data Center management.

First, there exists a clear relationship between the dispersion (that is, standard

deviation) of the observed data set and the choice of the bestmodel parameters. A

high dispersion of the observed data set, such as that of the heavy service demand,

requires trend estimators working on a higher numbern of observed past samples.

On the other hand, the amount of past samples needed to obtaina precise trend

esteem decreases when the workload causes a lower dispersion of the observed

data set. The proposal of a theoretic methodology to find the “best” parameter for


any trend estimation model, any workload and any application is out of the scope

of this thesis. However, a large set of experimental resultspoints out the existence

of a set of feasible parameter values that guarantee an acceptable performance of

the trend extraction models. This range of feasible values depends on the standard

deviation of the observed data set.

Second, all the considered models are affected by a trade-off between the ca-

pacity of reaching as soon as possible the real behavioral trend of a time series,

and of having small oscillations around it. The two properties of quality are in

conflict, hence the perfect trend model with optimal accuracy and responsiveness

does not exist. This trade-off can be solved only by considering the goals of the

trend models applications. A runtime decision system that must take immediate

actions may prefer a highly reactive trend estimators at theprice of some inaccu-

racy. This is the case of trend esteems used as state representation for state change

detection (see Section 7.1). On the other hand, when an action has to be carefully

evaluated, a decision system prefers an accurate trend model even if less reactive.

This is the choice in case of the detection of collective anomalies (see Section??).

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalSR (n=90)

(a.1) SR90

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalSR (n=240)

(a.2) SR240

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalCS (n=30)

(b.1) CS30

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalCS (n=240)

(b.2) CS240

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalSMA (n=30)

(c.1) SMA30

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalSMA (n=240)

(c.2) SMA240

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalEWMA (n=90)

(d.1) EWMA90

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalEWMA (n=240)

(d.2) EWMA240

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalAR (32)

(e) AR

0

0.2

0.4

0.6

0.8

1

200 300 400 500 600 700 800 900 1000

Sam

ple

valu

es

Representative load intervalARIMA (1,0,1)

(f) ARIMA

Figure 5.7: Trend curves with respect to the approximate confidence interval.

Chapter 6

Forecasting models

We now considertime series predictionmodels, that are oriented to forecast the

expected performance of an Internet Data Center. This chapter formalizes the

problem and gives an overview of state of the art models suitable to on-line fore-

casting in Internet-based contexts.

6.1 Time series prediction

On-line time series predictionis a classic problem for the estimation of the fu-

ture load behavior and to guide management decisions in complex Internet-based

infrastructures.

Prediction models work on an ordered set of historical information. We define

the historical information at samplei as an ordered collection ofr data,S[r]i, that

starts at timeti−(r−1), converging measures up to a final timeti, that is:

S[r]i = {sj}, i− (r − 1) ≤ j ≤ i (6.1)

where thei-th element is a pairsi = (fi, ti). The first element of the pairfi

denotes the time series information, that can correspond tothe monitored raw data

or to a filtered representation of it. The second element of the pairti indicates its

occurrence time.

A predicted value at stepi is the output of a function conditioned onS[r]j:

fi+k = g(S[r]i) + ǫi (6.2)

106 Forecasting models

Figure 6.1: Third phase: forecasting the system behavior inthe future.

in which g() is the function capturing the predictable component of the data set,

ǫi models the possible noise, andk denotes the number of future steps to predict,

that is, the so calledprediction window.

Different approaches have been proposed to perform time series forecasting

in computer environments, ranging from simple heuristics to sophisticated mod-

elling frameworks. We cite just the most important classes:linear time series [44],

neural network [119], wavelet analysis [102], Support Vector Machines [38],

Fuzzy systems [119]. The choice of the most appropriate prediction model de-

pends on the nature of the time series, on the amount of available a-priori knowl-

edge, on the required forecasting accuracy, as well as on therequirements of the

application context.

6.1 Time series prediction 107

Most prediction models are designed for off-line applications. This is the case

of the genetic algorithms, neural networks, SVM, Fuzzy systems that may achieve

a valid prediction quality after long execution times. Morecomplex prediction

techniques, such as Kalman filtering [71], rely on parameters whose identification

proves to be difficult in practical settings, particularly when no a-priori knowledge

on the time series is available. Hence it is difficult or impossible to use them in a

dynamic and runtime environment such as the Internet-basedsystem.

The literature on time series prediction proposes many prediction models to

support on-line prediction. Each model has been developed to work better in a

specific application context on the basis of the statisticalproperties of the time se-

ries in exam, such as variability, correlation, non-stationary or non-deterministic

behavior. On-line prediction models can use different statistical methodologies to

estimate their parameters. Choosing an adequate methodology for the parameter

estimation is crucial for prediction models performance, since this choice impacts

not only on the prediction quality, but also on the computational cost of the predic-

tion models. Consequently, the methodology used to estimatemodel parameters

can limit the applicability of a prediction model on the different application con-

texts.

On the basis of the parameter estimation, we can distinguishstaticandadap-

tiveprediction models.

Static models

A static prediction is characterized by a static choice of the model param-

eters. This means that the selection of the number of parameters is not

optimized for every time series. Static solutions have a lowimpact on the

computational cost of the prediction models and, for this reason, they are

typically used in application contexts having short-term time requirements.

Adaptive models

An adaptive prediction model computes dynamically the number and the

value of its parameters, in order to optimize its performance. It is especially

useful in non stationary and variable application contexts, where the predic-

tion model needs to dynamically modify its parameters at every change in


time series behavior. Choosing the best parameters improvesthe prediction

quality at the price of a higher computational cost.

The possibility to apply a predictor in an on-line way decreases with the flex-

ibility of the model, or, equivalently, with the number of necessary parameters.

Therefore, the more flexible the model, the less usable in practice. Beside that,

as we consider time series evolving in time, all the model parameters must be on-

line updated during the time series evolution. There existsa trade-off between the

ability of a model to properly fit the signal and the number of parameters needed

to compute at each update. Efficient trade-off is achieved bya wide range of

on-line prediction models developed to forecast the behavior of internal resource

measures of Internet-based applications.

We describe some forecasting models by distinguishing static and adaptive

estimation of their parameters.

6.2 Prediction models

We consider six main classes of time series forecasting models based on historical

information that can be adapted to runtime contexts [20]:Simple Regression(SR)

andCubic Spline(CS) are based on interpolation trend estimation;Exponential

Weighted Moving Average(EWMA), Holt’s model(Holt’s), Auto-Regressive(AR)

andAuto-Regressive Integrated Moving Average(ARIMA) are based on smooth-

ing trend estimators.

The models considered in this work lack the learning capabilities of other more

complex prediction algorithms, but in a runtime decision context it is mandatory

to achieve good (not necessarily optimal) predictions quickly, rather than looking

for the optimal decision in an unpredictable amount of time.

6.2.1 Simple Regression (SR)

A simple regression predictionk steps ahead at timeti is equal to:

fi+k = αik + βi (6.3)

Simple Regression (SR) 109

where the coefficientsα andβ of the equation are differently chosen in the case

of static or dynamic implementation of the model.

In thestatic-SR, the coefficientsαi is equal to the degree of variation between

the first and the last sample of the data setS[r]i, that is:

αi =fi − fi−(r−1)

r(6.4)

while βi is set as:

βi = fi−(r−1) − αir (6.5)

as suggested in [8].

This prediction method intercepts two points(fi, ti) and (fi−(r−1), ti−(r−1)),

that are statically chosen in the data setS[r]i. The simplicity of the model guaran-

tees a very low prediction cost.Static-SR prediction quality is good when the data

set is stable or is subject to long term variations. On the other hand, when the data

set is characterized by short term variations, the SR model tends to overestimate

the changes of the data set values with a consequent low prediction quality.

Among the severaladaptive-SR models proposed in literature, we consider

the Baryshnikov et al. model [14]. In this implementation, the coefficientsαi

andβi are dynamically chosen in order to minimize the mean quadratic deviation∑i

j=i−(r−1)[fj − fj]2 among the data setS[r]i and the predicted data setSi. That

means:

αi =

∑i

j=i−(r−1)(fj − E[S[r]i])(fj − E[Si])∑i

j=i−(r−1)(fj − E[S[r]i])2(6.6)

βi = E[Si] − αiE[S[r]i] (6.7)

whereE[S[r]i] andE[Si] are the mean of the time series values and of the pre-

dicted time series values, respectively.

The parameters optimization allows to overcome the limits of the static-SR

model, providing reliable predictions also when the time series changes frequently

in its behavior, but at the price of a higher complexity leading to an increase of

model computation cost.


6.2.2 Cubic Spline (CS)

For the definition of the cubic spline model, let us chooseP control pointsfp,

wherep ∈ [1, P − 1], that are equally spaced samples of the data setS[r]i. A CS

model based onP control points is a set ofP − 1 piecewise third-order polyno-

mials that models the data set for ak-step ahead prediction as follows:

fi+k =zp+1(i+ k − p)3 + zp(p+ 1 − i− k)3

6hp

+

+ (fp+1

hp

− hp

6zp+1)(i+ k − p)+

+ (fp

hp

− hp

6zp)(p+ 1 − i− k)

(6.8)

∀p ∈ {1, . . . , P − 1}, wherehp is the number of samples in the data setS[r]i

comprised between the control pointsfp andfp+1.

Thezp coefficients are solved by the following system of equations:

z0 = 0

hp−1zp−1 + 2(hp−1 + hp)zp + hpzp+1 = 6(fp+1−fp

hp− fp−fp−1

hp−1)

zn = 0

(6.9)

The CS prediction model is obtained through a subset ofP control points from a

data setS[r]i of lengthr and has the advantage of being reactive to changes in the

data set behavior. Its computational complexity is compatible to on-line decision

systems, especially if we choose a small number of control pointsP .

Thestatic-CS is based on a constant number of control points that, in order to

guarantee a low computational cost, must be low as suggestedin [8]. Since the

quantity of information used by the CS predictor depends on the number of control

points and their position in the data set, thestatic-CS risks to be low reliable

especially in non stationary application contexts.

Theadaptive-CS estimates dynamically the optimal number of control points

and their position in the data set. This solution is based on the methodology

presented in [59] that provides a control points sequence able to create the best

interpolation of the data setS[r]i. Theadaptive-CS is particularly useful in non

stationary contexts characterized by a non-linear trend ofthe time seires behavior.

Exponential Weighted Moving Average (EWMA) 111

Spline interpolation is the most suited method in highly variable (both in mean

and variance) contexts. However, the computation cost of the adaptive-CS (that

increases with the number of control points) risks to limit its applicability in those

contexts having short-term time requirements.

6.2.3 Exponential Weighted Moving Average (EWMA)

EWMA models predict the future valuek steps ahead as a weighted average of

the last sample availablefi and the previously predicted valuefi:

fi+k = γfi + (1 − γ)fi (6.10)

whereγ is calledsmoothing factor.

Thestatic-EWMA setsγ as follows:

γ =2

r + 1(6.11)

wherer is the size of the data setS[r]i [94].

This choice leads to a simple linear algorithm characterized by a very low

prediction cost. Its accuracy depends on the data set characteristics: in stable

conditions, it exhibits a good prediction quality. When the data set is unstable,

otherwise, the prediction quality decreases as well. Besides the accuracy problem,

another main issue is that thestatic-EWMA model generates a future value after a

delay that is proportional to the size of the considered datasetS[r]i. This problem

may prevent a valid EWMA application to runtime contexts thatrequire reactive

predictions.

In theadaptive-EWMA, the dynamical estimation at samplei of the parameter

γi is:

γi =2Φ

σ2fi

+ Φ(6.12)

whereΦ is the accepted noise component (estimated in terms of the variance of the

modeled data set) andσ2fi

is the on-line estimation of the process variance [101].

This dynamical choice ofγi improves the model accuracy and allows to limit

the delay problem affecting thestatic-EWMA. Theadaptive-EWMA is very use-

ful in those application contexts with a variable noise component of the time series


and that have to guarantee for every noise condition the desired prediction perfor-

mance.

6.2.4 Holt’s Model (Holt’s)

Holt’s model is an extension of the EWMA one and is often used when the time

series exhibits a linear trend. At stepi, a prediction of a valuek steps ahead is

computed as:

fi+k = li + bik (6.13)

whereli andbi are recursively computed as follows:

li = νfi + (1 − ν)(li−1 + bi−1) bi = η(li − li−1) + (1 − η)bi−1 (6.14)

In thestatic-Holt’s, starting values for this recursions are often set to: In the

static-Holt’s, starting values for this recursions are often set to:

li−(r−1) = fi−(r−1) bi−(r−1) = fi−(r−2) − fi−(r−1) (6.15)

The parametersν andη are constants that are statically chosen in the range

0 ≤ ν ≤ 1 and 0 ≤ η ≤ 1, [68]. The prediction quality and the limits of

thestatic-Hold’s are quite similar to thestatic-EWMA. The static-Holt’s suffers

to high delays for increasing values of the parameterν. Moreover, the Holt’s

prediction quality is particularly conditioned by the noise component of the data

set.

The adaptive-Holt’s provides a versatile solution; it estimates the model pa-

rameters minimizing the conditional likelihood (see [68] for additional details);

this adaptive methodology guarantees a dynamical support to forecast time series

that exhibit non linear behaviors and that are subject to a variable noise compo-

nent.

6.2.5 Auto-Regressive (AR)

A k step ahead prediction through an AR model is a weighted linear combi-

nation of p values. Thesep values are constituted byk − 1 predicted values

fi+k−1, . . . , fi+1 coming from the previousk − 1 steps, andp − k values of the

Auto-Regressive Integrated Moving Average (ARIMA) 113

data set(fi, . . . , fi−(p−k)). These values are weighted byp linear coefficients

ϕ1, . . . , ϕp, that are the firstp values of the auto-correlation function evaluated

on S[r]i. Thep order of the AR process is defined by a statistical test based on

the partial auto-correlation function that is described in[22,75]. The last element

of the AR model is the componentǫi that is obtained as a function of the residual

sequence (see [22] for additional details).

Hence, an AR-based predictor at stepi can be written as:

fi+k = ϕ1fi+(k−1) + · · · + ϕpfi−(p−k) + ǫi (6.16)

When the data set is stable, the AR model represents an appreciable solution

to the trade-off between prediction cost and prediction quality [44].

Thestatic-AR computes the model parameters,ϕ1, . . . , ϕp, through an initial

training on a subset of the entire experiment data. Thestatic-AR quality risks to

be low in highly variable scenarios [22] where the characteristics of the time series

change in time.

Theadaptive-AR, instead, uses a continuous update of AR parameters at ev-

ery prediction step. The update ofϕ for every new valuefi+k allows to capture

and model the non stationary and highly variable behavior that characterize many

system resources [28].

6.2.6 Auto-Regressive Integrated Moving Average (ARIMA)

A k step ahead prediction through an ARIMA model is obtained by differentiating

a d times the non stationary sequence of the filtered values inS[r]i and by fitting

an Auto Regressive Moving Average (ARMA) model that is composed by the

auto-regressive model (AR) described in Equation 6.16 with amoving average

model (MA).

The moving average part is a linear combination of the past(q−k) noise terms,

ei, . . . , ei−q−k, weighted by the linear coefficientϑ1, . . . , ϑq−k [22,75]. Hence, an

ARIMA model is usually denoted as ARIMA(p,d,q), wherep is the number of the

considered time series values,q− k is the number of the residuals values, andd is

the number of the differentiating values.


An ARIMA model for ak-step ahead prediction can be written as:

fi+k = ϕ0 + ϕ1fi+(k−1) + · · · + ϕp+dfi−p−d+k+

+ϑ1ei + · · · + ϑq−kei−(q−k)

(6.17)

The ARIMA prediction model requires a careful choice of the model parameters,

that is typically based on the evaluation of the auto-correlation and partial auto-

correlation functions of the time series [22].

The static-ARIMA evaluates the parameters considering an initial subset of

the experiment data and on this subset computes the statistical analysis of the

auto-correlation and partial auto-correlation functions. This static solution has

some difficulties to predict accurate values when the data set is extremely variable.

The adaptive-ARIMA improves thestatic-ARIMA performance with a con-

tinuous re-evaluation of its parameters for every prediction of a new valuefi+k.

This adaptive solution brings to a better prediction quality but at a the price of

a high computational cost that risks to limit the applicability of the adaptive-

ARIMA to many application contexts, such as the short-term ones, that have time

requirements in the order of seconds.

6.3 Quantitative analysis

We evaluate the performance of the two classes of predictionmodels in the case

of static or dynamic parameters estimation. The combination of models and pa-

rameters choice originates the following four classes of approaches.

Static prediction on Data set: SD-policy

The raw data observed at different time scales do not receiveany data treat-

ment and are directly modeled by a predictor based on a staticconfiguration

of its parameters. This policy is often used in application contexts with data

sets showing a stable behavior with a low noise component.

Static prediction on Trend estimation: ST-policy

The sampled data undergoes firstly a trend extraction treatment based on the

on-line Discrete Wavelet Transform (DWT), described in [95], since it is

Computational cost 115

considered the optimal solution to remove the noise to the signal in stochas-

tic contexts, such as the Internet-based applications. Successively, the new

data representation is predicted using models based on a static parameters

estimation. The ST-policy is useful for application contexts showing unde-

sirable effects of the noise component and that do not changetheir statistical

properties in time.

Adaptive prediction on Data set: AD-policy

No data treatment is applied on the observed data set that is directly pre-

dicted by adaptive models that update dynamically their parameters on the

basis of the statistical properties of the time series. Thissolution can be

adopted in application contexts subjected to a low noise component and

non stationary behavior.

Adaptive prediction on Trend estimation: AT-policy

The data treatment based on the on-line DWT is applied to the observed

data set to minimize its noise component and to eliminate thepresence of

outliers. The new data representation is predicted using adaptive prediction

models. The AT-policy is particularly indicated for the on-line prediction of

non stationary and stochastic measurements.

The performance evaluation of the prediction models under the different poli-

cies enable us to select those models that are effectively applicable to the consid-

ered context, characterized by non deterministic and non stationary time series,

stringent time constraints and short or medium-term predictions. The performance

are evaluated in terms ofcomputational costandprediction qualityof the model.

6.3.1 Computational cost

Computational costevaluates the CPU time required by each prediction model to

estimate a new value for one system resource. In the case of STand AT-policies,

the computational cost includes also the time spent for trend extraction.

The lines of Table 6.1 show the CPU time spent by the SR, CS, EWMA, Holt’s,

AR and ARIMA models to predict one value of one system resourcemeasure

using the prediction policy indicated in the first column.


The first line of Table 6.1 reports the computational costs ofthe differentstatic

models applied to monitored time series. Thanks to the distinctive low compu-

tational cost typical of the SD-policy, it is particularly adequate in application

contexts with short-term time requirements.

The data treatment phase required by the ST-policy brings toa significant in-

crease of the computational costs of all the predictors. Using the on-line DWT,

the cost of prediction increases of 23.1 msec as reported in the second line of

Table 6.1.

The application ofadaptivemodels increases the computational complexity

of all predictions, as evidenced by the computational costsin the third and fourth

lines. This augment depends on the statistical criteria used by each model to

dynamically estimate its parameters. SR, Holt’s, AR and ARIMAmodels are the

most affected by the cost of their parameters updates.

Policy SR CS EWMA Holt’s AR ARIMASD 1.78 2.10 0.05 1.44 5.97 7.53ST 24.88 25.20 23.15 24.54 29.07 30.63AD 22.42 2.91 3.48 21.11 45.33 114.56AT 45.52 26.01 26.58 44.21 68.43 137.66

Table 6.1: CPU time (msec) of prediction models and policies.

If the requirement is to predict just few values, the computational cost of any

considered model is compatible to time constraints of short-term and medium-

term application contexts. However, we must point out that these models are

typically applied to complex clusters consisting of hundreds nodes, where the

state of each node may require the observations of several internal resources. In

these contexts, the application of the AT-policy for short-term prediction is critical

or impossible.

Since a trade-off between computational cost and prediction quality exists, we

expect that the elevate CPU time spent by dynamic policies or complex models

will be covered by a higher quality of the prediction.

Prediction quality 117

6.3.2 Prediction quality

Prediction qualityestimates the ability of the model in time series forecasting

by considering two important features: the precision in modeling the future data

samples in hypothetical stable conditions, and the adaptability in following time

series variations when conditions are unstable.

We measure the following metrics to evaluate the predictionquality:

Prediction Error (PE)

The PE computes the mean distance between an ideal representation of the

data set and the predicted values:

PE =

∑N

i=1(fi − f ∗i )2

N(6.18)

whereN is the total number of predictions,fi is the predicted value and

f ∗i is the off-line filtered data set standing for the reference representation

generated from the observed raw data. As suggested in [95], we consider

the DWT-based filter for the off-line representation. PE takes into account

the precision in forecasting the future sample values and the ability to follow

the possible variable conditions of the data set.

Prediction Interval (PI)

The PI, widely described in [19], estimates the interval in which future ob-

servations will fall with a certain probabilityc. Through this metrics, a

predicted valuefi+k at samplei + k is associated to a prediction interval

[li+k, ui+k] by the following equation:

c = Pr(li+k ≤ fi+k ≤ ui+k) (6.19)

whereli+k is the lower limit andui+k is the upper limit of the prediction

interval.

Considering a probabilityc = 0.95, the lower and the upper limits at sample

i+ k are defined as follows:

li+k = fi+k − 1.96σi√r

ui+k = fi+k + 1.96σi√r

(6.20)


whereσi is the standard deviation at samplei of the predicted data set,

F [r]i = (fi−(r−1), . . . , fi), andr is the size of the data set.

Comparing prediction intervals obtained under the same probability valuec,

we can argue that higher prediction intervals are associated to less reliable

prediction models.

In Figure 6.2 we give an example of time series prediction andthe correspond-

ing prediction interval. The continuous gray line represents the data samples,

while the black ones are the predicted values. The black dot lines delimits the

prediction interval.

0

10

20

30

40

50

60

70

80

100 200 300 400 500 600 700 800

Sam

ple

Val

ues

Samples

Time seriesPredicted value

Prediction interval

Figure 6.2: Example of time series prediction and corresponding prediction inter-val (c = 0.95).

We evaluate the PE of the prediction models for every considered policy, by

setting a prediction windowk = 10. Table 6.2 reports the prediction errors com-

puted on the time series coming for the monitoring of the CPU utilization of a

database server of the Internet Data Center considered as testbed in this work.

The results confirm that the AT-policy has the lowest prediction error for any pre-

diction model. This result is the consequence of the combined usage of the data

Prediction quality 119

treatment and the adaptive prediction, that allows to reduce the noise component

of the signal and to adapt the model parameters to the data setbehavior. The other

policies have higher prediction errors because of a static choice of the model pa-

rameters (e.g., SD-policy and ST-policy) or because they consider also perturbing

information due to outliers and noise for their prediction (e.g., SD-policy and AD-

policy).

Policy SR CS EWMA Holt’s AR ARIMASD 0.135 0.127 0.102 0.124 0.130 0.128ST 0.033 0.037 0.031 0.035 0.035 0.032AD 0.054 0.047 0.027 0.046 0.510 0.040AT 0.028 0.032 0.021 0.027 0.320 0.029

Table 6.2: PE of the prediction policies,k = 10.

To better understand the prediction quality of the model, wejoin the previous

results to the one in Table 6.3, by reporting the prediction intervals of the models.

The data trend extraction phase brings to lower prediction intervals, as evinced by

PI performance of the ST-policy and AT-policy. Policies that envisage a filtering

step before the forecasting phase generate more faithful predictions. On the other

hand, noise and outliers deteriorate the prediction reliability of all the considered

models under SD-policy and AD-policy, that work on raw data sets and show PI

values around20%.

Policy SR CS EWMA Holt’s AR ARIMASD 19.54% 20.20% 22.49% 21.44% 22.32% 23.54%ST 7.02% 5.35% 7.12% 7.98% 7.52% 7.41%AD 15.24% 16.32% 18.99% 16.86% 19.29% 17.91%AT 5.90% 3.04% 5.25% 6.01% 6.34% 6.16%

Table 6.3: PI of the prediction policies,k = 10.

An evident trade-off between prediction quality and computational cost exists.

Even though ST-policy and AT-policy always guarantee better prediction quality

(both in terms of PE and of PI) than those without data treatment, they have a

higher computational complexity that can limit their usagein application contexts

with strong time constraints.


6.4 Performance analysis

We present the behavior of the different prediction policies in a realistic scenario.

We test the considered models on the prediction of the time series coming from

the monitoring of three-days CPU utilization of a database server of the Internet

Data Center under examination.

The time series is displayed in Figure 6.3 (gray line), in which can be appre-

ciated also the result of on-line data treatment and the ideal representation of the

time series. The continuous black line is the filtered data representation coming

from a trend extraction treatment based on the on-line DWT. Itrepresents the re-

sult of the pre-filtering step of ST-policy and AT-policy. Otherwise, SD-policy

and AD-policy work directly on the gray un-treated line. Allpolicies prediction

results are compared to the black dotted line, that is, the ideal time series represen-

tation, for the computation of the prediction error. The more the predicted values

will be close to that dotted line, the more the prediction error will be small and the

predictor reliable.

0

10

20

30

40

50

60

70

80

Mon Tue Wed

CP

U u

tiliz

atio

n

Time seriesTreated time series

Ideal representation

Figure 6.3: Raw and treated CPU utilization time series.

6.4 Performance analysis 121

The difference between the four approaches can be graphically evinced in Fig-

ure 6.4 and in Figure 6.5. They report the off-line filtered curve based on the DWT

filter (gray line), the predicted curve (black continuous line) and the prediction in-

terval (black dotted lines) for the four prediction policies: SD-policy, ST-policy,

AD-policy and AT-policy, that use the Holt’s prediction model.

0

10

20

30

40

50

60

70

80

Mon Tue Wed

CP

U u

tiliz

atio

n

Treated time seriesPredicted valuePrediction interval

(a) SD-policy

0

10

20

30

40

50

60

70

80

Mon Tue Wed

CP

U u

tiliz

atio

n


(b) AD-policy

Figure 6.4: Holt’s prediction model,k = 10, on raw data set.


0

10

20

30

40

50

60

70

80

Mon Tue Wed

CP

U u

tiliz

atio

n


(a) ST-policy

0

10

20

30

40

50

60

70

80

Mon Tue Wed

CP

U u

tiliz

atio

n


(b) AT-policy

Figure 6.5: Holt’s prediction model,k = 10, on trend estimation.

The differences among treated and un-treated policies is evident: the ST-policy

and the AT-policy in Figure 6.5 (a) and (b) reach more accurate predictions than

those in Figure 6.4, thanks to a very small prediction interval. Much higher PIs are

assessed by those policies that do not extract the data trendbefore the prediction

step. Since they base the prediction on a stochastic data setaffected by great

6.4 Performance analysis 123

oscillations, the SD-policy and AD-policy in Figures 6.4 (a) and (b) have low

prediction quality due to larger prediction intervals.

Despite that, it is of crucial importance to observe the manifest delay of pre-

diction introduced by the on-line treatment in Figures 6.5 (a) and (b). ST- policy

and AT-policy guarantee excellent PIs at the cost of relevant prediction errors due

to a visible shift ahead of the predicted values. The prediction follows very well

the on-line filtered line, but it refers to a retardant representation that do not cor-

respond to the actual ideal behavior. This causes a shift between predicted and

ideal values that augments the prediction error and thus affects the performance

of ST-policy and AT-policy. On the other hand, bypassing thedata treatment phase

avoids to introduce undesired delays in the prediction. SD-policy and AD-policy

in Figures 6.4 (a) and (b) timely predict future behavior, guaranteeing that the

future time series values will fall into the prediction interval.

Chapter 7

Runtime models

This chapter deals with models for runtime management of Internet Data Centers.

We show how the three representative eigenresources comingfrom the proposed

PCA-based technique can be used as an input for some on-line decision algo-

rithms that would take advantage in working on a small set of data and on reliable

representations of the whole system state.

In this chapter we consider two important on-line problems:state change de-

tectionandanomaly detectionproblems.

7.1 State change detection

One of the key goals for supporting adaptive and self-adaptive mechanisms is the

ability of detecting one or several changes in some characteristic properties of the

system. Many runtime management decisions related to Internet-based services

are activated after a notification that a significant load variation has occurred in

some system resource(s). Request redirection, process migration, access control

and limitation are some examples of processes that are activated after the detec-

tion of a significant and non-transient system state change.We callstate change

any modification in the data sets of the system that occurs either instantaneously

or rapidly with respect to the period of sampling and that lasts for a significant

number of consecutive measurements.

State change detectorsworking at runtime have a unquestioned central role in

system management because the immediate detection of relevant changes in the

126 Runtime models

Figure 7.1: Third phase: analyzing the system behavior in the present.

state of monitored processes is crucial for efficacious decisions. Structural defects,

damage detection and novelty detection techniques detect structural anomalies

in the system,e.g., defects due to unforeseen circumstances, unexpected users

attacks, and sudden deviations (increases or decreases) ofsystem activity. State

change detection techniques try to detect changes in the data collected from a

structure. Typical measures and their representations tend to be stable over time

and might have spatial correlations. When the data suddenly move from their

stable characteristics or loose their typical correlation, we assume that something

anomalous occurred in the system. We want to detect only relevant state changes

and not minor instabilities.

Problem definition 127

7.1.1 Problem definition

The on-line detection of a relevant state change refers to techniques and method-

ologies that are able to decide whether such a change occurs.This is usually done

by evaluating the statistical characteristics (e.g., meanand variance) of samples

flowing from monitors or passing also through on-line filtersextracting the trend

of the monitored samples.

On-line state change detection in highly variable contextsis addressed through

two steps:state representationanddetection rule.

State representation

The huge number of existing state representation models canbe classified

on the basis of how they treat the time series generated by themonitored

process. We distinguish representation techniques in two main classes:with

process modelsor without process models.

The former techniques require a preliminary knowledge of time series sta-

tistical properties (e.g., Kalman Filter [63], the Sequential Monte Carlo

Method [77] and Bayesian Bootstrap Filtering [7]) or an empirical evalua-

tion of them (e.g., Particle Filtering [41,80]). In this work, we cannot adopt

representations based on process models because the non deterministic and

stochastic behavior of the considered time series preventsthe possibility of

knowing or evaluating on-line its statistical characteristics.

The second class of representation algorithms are based on interpolation

and smoothing filtering methods, as those presented in Chapter 5. The most

common solutions use linear filtering, such as mean filteringand exponen-

tial smoothing [94]. However, we will see that these models are not effective

for state change detection in the considered stochastic andhighly variable

contexts. Non-linear filtering techniques should be more suitable to these

contexts because they are able to reduce time series noise and to guarantee

a reliable detection quality.

128 Runtime models

Detection rule

In literature there are detection rules strictly related toa prior knowledge

about all states that characterize the monitored process, and models that use

a detection rule quite independent of this information. We prefer to follow

this latter approach, and choose the Cumulative Sum (Cusum) statistical

model as our fundamental detection rule [15,94].

Other on-line detection rules are based on machine learningalgorithms that

have to get some initial knowledge about the time series, especially about

the probability of state changes and their distributions (e.g., Principal Com-

ponent Analysis, neural networks [84]). Other widely adopted methods use

one or more thresholds for detecting relevant state changes[107]. Both

these alternative classes of detection rules seem unsuitable to the highly

variable context of Internet Data Center resources. Nevertheless, we con-

sider a threshold-based model for comparison purposes.

Let Xi be the time series of monitored samples. Thestateof a time series is

a reference representation of the monitored process, that is dynamically updated

on the basis of the stability of the statistical characteristics of the time seriesXi.

A relevant state changecorresponds to a significant variation of the statistical

characteristics of the time seriesXi [15].

The problem we consider in this application is to detect at runtime the occur-

rence of relevant state changes, with a minimum or null rate of false detections in

non-stationary and highly variable contexts. This is done on the basis of a statisti-

cal change detection rulethat is typically obtained by comparing, at each sample

i, thetest statisticsgi with a characteristic statistical thresholdH.

The state change may occur in an increasing or decreasing direction on the

basis of the following equations:

detection rule=

increasing change, if gi ≥ H

decreasing change, if gi ≤ −Hno change, otherwise.

(7.1)

The test statisticsgi is an indicator of the detection model: during a stable

state it should be close to zero and departs from zero when a change in the time

Problem definition 129

series occurs. The choice of the characteristic thresholdH has been extensively

addressed in the quality control literature [94] and depends on the test statisticsgi

and on the required performance [126].

In Figure 7.2, we give an example of the issues that may affecta state change

detection model. The time series is composed byN = 600 samples, shown by the

spiked curve. The data profile, denoted by the continuous line, shows8 relevant

state changes at sample 50, 150, 200, 250, 315, 440, 475, 540.The vertical line

with a circle at the bottom denotes a false detection, that is, a signaled change that

does not correspond to a real state change. The vertical linewith a cross at the

bottom denotes a right detection.

-3

-2

-1

0

1

2

3

0 100 200 300 400 500 600

Sam

ple

valu

es

Samples

Time seriesRepresentative state

True detectionFalse detection

Figure 7.2: The problem of detecting relevant state changes.

This figure shows the two main problems when the time series ishighly vari-

able:

• several false detections, as evidenced by 10 lines with a circle at the bottom;

• the absence of detections of relevant state changes, as at sample 315.

130 Runtime models

This preliminary analysis confirms that a highly variable behavior of time series

limits the detection quality of existing models, that do notachieve good perfor-

mance especially when applied to time series characterizedby the stochastic and

highly variable behavior typical of Internet Data Center management. We propose

a new on-line model for state change detection that uses an on-line wavelet-based

filtering for state representation and an adaptive implementation of the Cusum as

its detection rule.

7.1.2 Wavelet Cusum state change detection model

The performance of change detection algorithms is highly dependent on the statis-

tical structure of the measured time series [94]. Because of the inherent variability

and non-stationary behavior of many monitored processes, such as computer sys-

tem resources usage, we have found that the direct use of standard detection tech-

niques,e.g., Cusum algorithm, on the original time series, as reported bysystem

monitors, results extremely poor due to the high variance ofthe time series.

For change detection, we find convenient to consider not the original time

seriesXi, but a so calledrectified time seriesYi = [y1, . . . , yi] [95]. Yi retains

the significant features ofX but removes (most of) the variability which can be

ascribed to noise and resources usage short term oscillations. We regardYi as

the state representation of the monitored process and we apply change detection

algorithms to this time seriesYi.

Wavelet-based representation

Linear low pass filtering techniques, such as mean filtering and exponential

smoothing, are the most commonly used methods to remove a noisy compo-

nent from a time series, for their simplicity and the fact that they can be used

on-line. In the case of exponential smoothing, also known asexponential

weighted moving average (EWMA), we have at samplei:

yi = αxi + (1 − α)yi−1 (7.2)

whereα, 0 ≤ α ≤ 1, is a weighting factor (which is related to the cutoff

frequency of the low pass filter).

Wavelet Cusum state change detection model 131

Simplicity comes at a significant cost: since linear filtering methods basi-

cally remove all frequencies above a cutoff value, in the resulting smoothed

representationYi we remove not only noise, but also some significant fea-

tures of the time series,e.g., abrupt changes. For change detection purposes

this translates into false detections or significant detection delays.

Following the approach presented in [95], we propose the useof wavelet

based filtering/rectification. The wavelet transform [89] has emerged as a

powerful tool for statistical time series analysis. The wavelet transform

represents a time seriesx as the sum of a shifted and scaled version of a

base wavelet functionψ and a shifted version of a low pass scale functionφ.

With proper choice of the wavelet and scale functions, the resulting families

of functions are:

ψmk(n) =√

2−mψ(2−mn− k) (7.3)

φmk(n) =√

2−mφ(2−mn− k) (7.4)

wherem and k are the dilation and translation parameters, respectively,

form an orthonormal basis. A time seriesXi can be conveniently rewritten

as follows:

xi =n2−L∑

k=1

aLkφ(i) +L∑

m=1

n2−m∑

k=1

dmkψmk(i) (7.5)

whereaLk is thek-th scaling function coefficient at the coarsest scaleL,

dmk is thek-th wavelet coefficient at scalek, andn the time series length.

The coefficientsm andk are computed by inner product ofx with the base

functions. Computation of the transform and its inverse can be done in

O(n). As indicated in [95], in our implementation we set the coarsest scale

L equal to 5 if the time series is perturbed by white noise,L = 4 otherwise.

A key feature of this representation is that the wavelet decomposition cap-

tures significant signal features in a few relatively large coefficients, while

noise results decorrelated. As a result, noise - and noise only - can be effec-

tively removed by setting equal to zero wavelet coefficientssmaller than a

threshold.

132 Runtime models

We consider a state representation of the time seriesXi as the time seriesYi

obtained as following:

1. Compute the wavelet transform of the original time seriesXi. We use

the standard Haar function [36] as a base wavelet, which consists of a

simple rectangular impulse function;

2. Set to zero the wavelet coefficients which are lower than a suitable

thresholdtm (wherem is the dilation parameter). As indicated in [95],

we set the thresholdtm = σm

√2 log nwhereσm = 1

0.6745median{|dmk|};

3. Compute the inverse wavelet transform to obtainYi.

This rectification technique has been proved to be superior to many other

approaches [47], but it is restricted to off-line operations. We adopt the on-

line version proposed in [95] which considers a moving window of dyadic

length to computeYi.

Cusum-based detection rules

The proposed model integrates the wavelet-based representation with a novel

load change detection mechanism that is based on the Cusum change detec-

tion rule. The Cusum detection rule is proposed by [98] and used in different

contexts [15, 27]. It is considered the best choice for the statistical quality

control of many processes [94].

Given the time series of the state representation,Yi, the one-sided Cusum

for detecting an increase in the mean computes the followingtest statistics:

g+0 = 0 (7.6)

g+i = max{0, g+

i−1 + yi − (µ0 +K+)} (7.7)

which measures positive deviations from a reference valueµ0.

The test statisticsg+i accumulates deviations ofyi from µ0 that are greater

than a pre-defined constantK+, and resets to 0 on becoming negative. The

termK+, which is known as the allowance orslack value, determines the

Wavelet Cusum state change detection model 133

minimum deviation that the statisticsg+i accounts for. A positive change is

signaled wheng+i exceeds a design chosen thresholdH+.

The one-sided Cusum test for detecting negative deviations is defined simi-

larly as:

g−0 = 0 (7.8)

g−i = max{0, g−i−1 + (µ0 −K−) − yi} (7.9)

A negative change is signaled wheng−i exceeds a design thresholdH−.

A two-sided test to detect both increases and decreases is obtained by apply-

ing the two tests simultaneously. For the sake of simplicitywe will consider

the symmetric case wherebyK+ = K− = K andH+ = H− = H.

When a shift is detected, the Cusum test also provides an estimate of the

reference valueµ1 as follows:

µ1 =

µ0 +K +g+

i

N+ if g+i > H

µ0 −K − g−i

N−if g−i > H

(7.10)

whereN+ (N−) denotes the number of steps elapsed since the last timeg+i

(g−i ) was set to zero, that isN+ = i − inf{j | g+j = 0} and similarly for

N−.

The performance of the Cusum test is expressed in terms of the so calledAv-

erage Run Lengths(ARL): ARL0 denotes the average number of samples

between false alarms when no change has occurred;ARL1 denotes the av-

erage number of samples to detect a change when it does occur.Both ARL

measures are affected by the design parametersH andK. To achieve a good

detection quality, the suggested value forK is ∆2

, where∆ is the minimum

shift to be detected. In the considered environment, that ischaracterized by

variable variance of the process, it is important to providea dynamical esti-

mation of theH parameter. We propose to dynamically adjust the threshold

H in order to provide a targetARL0 performance and limit the number of

wrong detections, as presented by the ARL0 Cusum in [27].

134 Runtime models

These settings guarantee good prediction quality in terms of both recall and

precision (see Section 7.1.4) since we are able to dynamically adjust the

value of the thresholdH to reflect the variation of the time series behavior.

7.1.3 Other state change detection models

Most existing change detection techniques base on the trendextracted by the mod-

els presented in Chapter 5, used as state representations. Inthis section, we outline

the most popular on-line state change detection algorithmsthat are presented just

for performance comparison. For each algorithm, we consider the parameter val-

ues that guarantee the best evaluation metrics (see Section7.1.4) for Internet Data

Center management.

Threshold-based detector

This model uses a state representation based on the filtered time series with

an exponential moving average equal ton = 5. The detection rule is based

on the double threshold model described in [87]. The high threshold is equal

to thH = ∆, where∆ is the smallest shift to detect, and the low threshold

is thL = p∆. We choose thep coefficient on the basis of the traditional

method described in [87], that is based on the ROC curve in order to adapt

thL to the statistical characteristics of time series. We denote this detection

model as Th-EMA5.

EWMA-based detectors

The EWMA model is applied in several online contexts (for example in in-

formation and computer systems, and financial and social applications) [94].

In this paper, it is used for both state representation [75] and detection

rule [94]. The performance of the EWMA detectors depends on the choice

of past valuesn used for state representation and on the length of the con-

trol limit M used by the detection rule. In our analysis, we considerM = 3

and two valuesn = 5 andn = 10; all these values are suggested by Mont-

gomery et al. [94] as the most popular choices able to providea good and

Quantitative analysis 135

reliable detection quality. On the basis of then value, we denote the respec-

tive detectors as EWMA5 and EWMA10. EWMA5 should maximize the

number of correct detections at the cost of some false ones. The EWMA10

model should minimize the number of false detections, likely at the cost of

a lack of some detections.

Baseline Cusum

It uses an exponential moving average withn = 5 as its state representation,

and the Cusum detection rule [94, 98]. Its detection quality is conditioned

by the choice of the statistical thresholdH and by the slack valueK that

represents the minimum deviation that the detection rule ofthe baseline

Cusum accounts for. The recommended values of the model parameters are

H = 5σx andK = ∆2

.

ARL 0 Cusum

The online state representation is an exponential moving average withn =

5. The ARL0 Cusum uses a Cusum detection rule that is proposed by some

of the authors [27]. Its performance depends on the design parametersH

andK. It uses an adaptive estimation of the parameterH that is dynamically

evaluated on the basis of the ARL0 value, chosen in order to ensure a very

small false detection rate. ARL0 Cusum usesK = ∆2

.

7.1.4 Quantitative analysis

As we are interested to on-line detection in non-stationaryand highly variable time

series coming from non deterministic application contexts, in the following anal-

ysis we evaluate the detection quality of the proposed Wavelet Cusum model on

a wide range of time series based on two typicaltime series profiles. The perfor-

mance of the considered detection techniques are evaluatedin terms of common

evaluation metrics, by considering several intensities of thenoise componentsof

the time series.

136 Runtime models

Time series profiles

The evaluation of the proposed model is carried out on a wide range of time

series based on two state profiles:

• Step profile describes a sudden increment of the time series values

from a relatively low to a higher state [111]. The lower statekeeps

constant for 200 samples, then it is suddenly increased for 200 sam-

ples. The increase is followed by a similar decrease.

• Multi-step profile describes an alternating increase and decrease of

the time series state characterized by different lengths. The time series

is subject to8 state changes at sample 50, 125, 200, 275, 350, 400,

475, 550.

To facilitate the presentation, the two profiles are normalized and both have

unit increases and decreases.

Noise components

Since the proposed solution aims to improve existing detection models, we

apply detection models on time series characterized by different levels of

noise components. The noise dispersion,σe, and the correlation index,ρe,

as described in [21, 45], are the most important statisticalproperties that

characterize the noise error and considered in our evaluation.

Evaluation metrics

The detection quality of the considered models are evaluated in terms of

recall andprecision[96].

To formalize these metrics, let us consider a time series subject to state

changes. All detected samples that are signaled correctly by the detection

model are calledtrue positives(TP ). If the model does not detect one

or more changes, the related samples are classified asfalse negative(FN )

detections. When the time series is in a stable state, the detection of a change

is classified asfalse positive(FP ); otherwise, a non detection in a stable

state is atrue negative(TN ). The number of true positives, false negatives,


true negatives, and false positives add up to100% of the time series.

Recall is the fraction of detections that are relevant to the time series and

that are successfully retrieved:

recall =TP

TP + FN(7.11)

It can be looked at as the probability that a relevant state change is detected

by the model. To achieve a recall value equal to1, the detection model must

signal all relevant changes.

The value of the recall alone is not enough but it must be supported by

some information related to the number of non-relevant detections, such as

precision, that is the fraction of the relevant detections:

precision=TP

TP + FP(7.12)

whereTP is the number of true positive detections and(TP + FP ) is the

total number of detections. The precision gives information on the ability of

the detection model to limit unnecessary detections of a state change. A pre-

cision equal to 1 means that the model detects only relevant state changes,

while low precision values are caused by a detection model that signals

many not relevant changes.

A trade-off between recall and precision values exists, hence these two met-

rics are usually combined into a single measure, namely theF-measure,

that gives a global estimation of the detection quality. TheF-measure is the

weighted harmonic mean of precision and recall, that is:

F-measure= 2precision ∗ recallprecision+ recall

(7.13)

When the detection quality is good, the F-measure value is close to 1, while

it is low for unreliable detection models characterized by false positive and

false negative detections.

In this section, we evaluate how the performance metrics of the detection mod-

els are affected by the noise dispersion on the time series. In the first set of ex-

periments, we test allσe ∈ {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}, setρe to 0,

and consider the step profile time series.

138 Runtime models

The recall and precision results for all considered detection models and for

severalσe values are reported in Table 7.1 and Table 7.2. Table 7.1 shows that

when the noise dispersion is low (σe ≤ 0.5 ) all the considered models achieve

recall values close to 1. This means that all methods recognize correctly all rele-

vant state changes, with the exception of the threshold-based model that does not

signal some detections. When the noise dispersion increases(σe > 0.5), the mod-

els using EWMA as detection rule, EWMA5 and EWMA10, worsen significantly

their recall values: they risk to be completely unreliable in highly variable con-

texts since they do not signal many relevant state changes. Other models, such as

threshold and Cusum-based, are able to maintain good recall values (> 0.9) also

in time series with an intense noise dispersion.

σe Th-EMA 5 EWMA 5 EWMA 10 Baseline ARL0 WaveletCusum Cusum Cusum

0.1 1 1 1 1 1 10.2 0.97 1 1 1 1 10.3 0.96 1 1 1 1 10.4 0.92 1 1 1 1 10.5 0.89 1 0.98 0.99 1 10.6 0.83 0.99 0.87 1 1 10.7 0.85 0.95 0.31 1 1 10.8 0.92 0.88 0.02 1 0.99 10.9 0.94 0.71 0.01 0.99 0.99 11 1 0.64 0.01 1 1 1

Table 7.1: Recall -ρe = 0.

σe Th-EMA 5 EWMA 5 EWMA 10 Baseline ARL0 WaveletCusum Cusum Cusum

0.1 1 0.23 0.31 1 1 10.2 0.33 0.43 0.50 0.99 1 10.3 0.13 0.50 1 0.95 1 10.4 0.08 0.97 1 0.89 1 10.5 0.06 0.98 1 0.77 0.99 0.990.6 0.05 0.81 1 0.65 0.92 0.960.7 0.05 0.7 1 0.50 0.87 0.960.8 0.05 0.61 1 0.37 0.73 0.930.9 0.04 0.53 1 0.29 0.64 0.871 0.04 0.52 1 0.25 0.55 0.84

Table 7.2: Precision -ρe = 0.


However, the recall metric is not sufficient to offer a complete understanding

of the model quality. If we consider the precision results inTable 7.2, we notice

that the threshold-based method is too much sensitive to thenoise variations of the

time series and even if it is able to detect all relevant statechanges, this is done

at the price of a high number of false positive detections, asconfirmed by low

precision values of the Th-EMA5. Cusum-based methods are all able to achieve

good precision in time series characterized by a noise dispersion σe ≤ 0.5. In

more variable and noisy contexts,σe > 0.5, only the proposed Wavelet Cusum

model preserves high detection quality, since it maintainsa precision close to0.85

also in highly variable time series with large noise dispersion.

The combined effect of recall and precision can be appreciated in Figure 7.3,

that shows the behavior of F-measure as a function of the standard deviationσe

for all the considered detection models. With the exceptionof models based on

the EWMA detection rule, all algorithms worsen for increasing values ofσe. Nev-

ertheless, the Wavelet Cusum achieves the best F-measure values for everyσe. It

is worth to observe that it is the only model guaranteeing reliable results even for

very high noise dispersions.

For example, atσe = 1, its F-measure remains consistently higher than 0.9,

in spite of a F-measure around 0.7 of the ARL0 Cusum, that is the best exist-

ing on-line detection model [27]. The threshold-based method is characterized

by an exponential decay of detection quality for increasingvalues ofσe. This

behavior limits the model applicability especially in non-stationary and non deter-

ministic contexts. The EWMA-based solutions have the peculiarity of improving

F-measure that then decreases forσe > 0.5. This is due to the so calledinertia

limit [94], that is, the inability to react quickly to state changes when the size of

the smallest shift to detect (∆) is significantly higher thanσe. This limit and the

F-measure degradation for highσe values evidence that the performance of the

EWMA-based models are unacceptable because too much sensitive to the statis-

tical characteristics of the time series and to the choice ofthe model parameters.

Existing Cusum-based methods are characterized by a small decay of F-measure

for low values ofσe. On the other hand, the F-measure decreases slowly when

σe > 0.5. These results evidence that even existing Cusum-based models do not

work well when a time series is highly variable.

140 Runtime models

0

0.2

0.4

0.6

0.8

1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

F-m

easu

re

Standard Deviation

Th-EMA5EWMA5EWMA10

Baseline CusumARL0 CusumWavelet Cusum

Figure 7.3: F-measures -ρe = 0.

We pass now to examine the effects of the correlation of the noise com-

ponentρe on the detection quality by considering different correlation indexes,

ρe = {0, 0.1, 0.2, 0.3}, and setting the noise dispersion valueσe.

In Figure 7.4 we report the F-measure values of all considered models in two

cases of high noise dispersion:σe = 0.6 andσe = 0.9. The results confirm that

the Wavelet Cusum model improves the performance of all traditional detectors

for every correlation value and anyσe. Its performances remains acceptable also

in the most chaotic context of intense variance and strong correlation of the noise

component (σe = 0.9, ρe = 0.3), by improving of more than50% the perfor-

mance of the best existing model ARL0 Cusum. A second important result is that

Wavelet Cusum is less sensitive to statistical characteristics of the time series with

respect to any other detection model. This robustness is useful in all real contexts

characterized by variable, non-stationary and non deterministic behaviors.

Performance analysis 141

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3

F-m

easu

re

Noise correlation index

Th-EMA5EWMA5EWMA10


(a) σe = 0.6

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3

F-m

easu

re

Noise correlation index

Th-EMA5EWMA5EWMA10


(b) σe = 0.9

Figure 7.4: F-measures.

7.1.5 Performance analysis

In this section we evaluate the detection quality of the considered models on time

series characterized by different numbers of state changesand different lengths of

the periods of stability.

Figure 7.5 and Figure 7.6 represent the behavior of a selected subset of the

detection models (EWMA5, Baseline Cusum, ARL0 Cusum and Wavelet Cusum)

on step profile and multi-step profile, respectively. Both figures display the curve

of the online state representation (continuous gray line),the curve of the time

142 Runtime models

series profile (continuous black line) and the vertical lines with a circle at the

bottom for false detections and a cross for true detections.

The number of false detections is an important measure of theprecision of a

model. On the other hand, the absence of detection in occurrence of a relevant

state change affects the detection recall. On step profile, the Wavelet Cusum in

Figure 7.5(a) is the only model that detects timely and correctly all relevant state

changes, without false detections. The improvement of the proposed model with

respect to the Baseline Cusum is impressive. The performance of the Baseline

Cusum in chaotic contexts is very low, as a consequence of the large number of

detections (in Figure 7.5(b)). The Cusum ARL0 and the EWMA5 models (Fig-

ure 7.5(c) and Figure 7.5(d), respectively) show the same detection quality char-

acterized by two wrong detections during a stable state.

For increasing complexity of the time series profile, all existing models re-

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600

Dat

a va

lue

Sample

On-line state representationStep profile

True Positive detectionFalse Positive detection

(a) Wavelet Cusum

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600

Dat

a va

lue

Sample



(b) Baseline Cusum

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600

Dat

a va

lue

Sample



(c) ARL 0 Cusum

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600

Dat

a va

lue

Sample



(d) EWMA5

Figure 7.5: Qualitative evaluation - Step profile -σe = 0.6 andρe = 0.3.

Performance analysis 143

duce significantly their quality, while the Wavelet Cusum in Figure 7.6(a) shows

a high quality also in time series characterized by multiplestate changes. In these

contexts, the Cusum ARL0 and the EWMA5 models diversify their behaviors: the

former model, in Figure 7.6(c), keeps on detecting all relevant state changes, but

its precision is decreased due to a higher number of false detections; the second

model, in Figure 7.6(d), misses many relevant state changes, thus strongly affect-

ing its recall quality.

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600

Dat

a va

lue

Sample

On-line state representationMulti-step profile


(a) Wavelet Cusum

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600

Dat

a va

lue

Sample



(b) Baseline Cusum

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600

Dat

a va

lue

Sample



(c) ARL 0 Cusum

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 100 200 300 400 500 600

Dat

a va

lue

Sample



(d) EWMA5

Figure 7.6: Qualitative evaluation - Multi-step profile -σe = 0.9 andρe = 0.3.

We can conclude that the Wavelet Cusum detector provides the best results,

both in term of recall and precision, despite of the complexity of the time series

profile, the number and length of the stable states and the noise components char-

acteristics.

144 Runtime models

7.1.6 On-line state change detection for IDC management

The previous results on on-line detection of relevant statechanges are now applied

to a real Internet Data Center context. The difficulty here is to detect intrinsic

changes that are not directly observed and that are measuredtogether with other

types of perturbations. We expect that working on a system representation col-

lecting only the deterministic patterns of the Data Center servers would solve the

problems of existing algorithms. The representative deterministic eigenresource

can be used as state representations for proper detection rules in order to signal

relevant state changes in the system.

In this section, we report the results of several experiments run on the 50

servers of the Internet Data Center, in order to understand the impact of changing

workloads in the PCA resulting representations. We exercisethe Data Center

servers through some representative synthetic user scenarios referring to several

workload models having different impacts on system resources. Every experiment

generates multiple data sets referring to the system resource measures of the 50

servers. We use these data sets to evaluate the impact of non-transient changes in

the state of the server resources, and as input resource measures for the proposed

multi-phase methodology. This allows us to evaluate how andwhether workload

changes have an impact on representative eigenresources resulting from the PCA-

based technique.

By applying the proposed methodology to the considered Internet Data Center

we can observe that:

• sudden changes in the number of emulated browsers have influence only on

deterministic eigenresources;

• sudden changes in the impact that the simulated workload models have on

system resources influence only noise eigenresources (thisresult is not in-

vestigated in this thesis but left for future works);

• sudden changes in the number of emulated browsers under lowworkload

scenarios influence the resources of the servers, but are notregistered by

the representative deterministic eigenresource. They impact deterministic

dimensions having a low contribution to the overall energy of the system

On-line state change detection for IDC management 145

(that means, low singular value). This reflects in a small addition to the

Rdeterministic vision and has a not relevant effect in the Internet Data Center

representative time series;

• sudden changes in the number of emulated browsers under heavy workload

scenarios have influence on the resources of the servers, andare registered

by the representative deterministic eigenresource. They impact on deter-

ministic dimensions having a high contribution to the overall energy of the

system, and cause a relevant state change in theRdeterministic time series.

These results confirms the scientific value of the proposed methodology: rep-

resentative eigenresources register relevant state changes only in the case of highly

impacting events in service demand. Applying state change detection models to

the representative deterministic representation guarantees that only state changes

having strong repercussions in Internet Data Center activity are signaled. This ap-

proach prevents the activation of system procedures when there are state changes

that affect the state of few servers but do not have an impact in the whole system

work.

In the case of relevant non-transient changes in Internet Data Center activ-

ity, all the experiments demonstrate that the representative deterministic eigenre-

source reflects the sudden increasing or decreasing of load conditions.

Figure 7.7 shows the representative deterministic eigenresource time series

resulting from the PCA-based technique applied to the measurements of one ex-

periment. We choose astep scenariodescribing a sudden load increment from a

relatively unloaded to a more loaded system, followed by a subsequent decreas-

ing [111]. The population is kept at 120 emulated browsers for the first 66 hours of

the week, then it is suddenly increased to 200 emulated browsers for an equivalent

period of time. Then, a sudden decrease re-establishes the initial conditions.

All experiments verify that the increase and decrease in thenumber of requests

is registered by the representative deterministic reigenresource.

Figure 7.8 represent the behavior of EWMA5, Baseline Cusum, ARL0 Cusum

and Wavelet Cusum detection models on the representative deterministic eigenre-

source. The figures display the representative deterministic time series (continu-

146 Runtime models

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


Rde

term

inis

tic

Samples

Figure 7.7: Relevant state changes on the representative deterministic eigenre-source.

ous gray line) and the vertical lines with a circle at the bottom for false detections

and a cross for true detections.

In this experiment, the Wavelet Cusum in Figure 7.8(a) is the only model that

detects timely and correctly the two relevant state changeswithout false detec-

tions. The Baseline Cusum and the Cusum ARL0 models (Figure 7.8(b) and Fig-

ure 7.8(c), respectively) are affected by two wrong detections. Baseline Cusum

detects several consecutive signals in correspondence of every load change, thus

demonstrating its scarce performance during a transition state. Cusum ARL0 is

affected by false detections when the time series is characterized by stationary

conditions. This reveals its inability to maintain a stablestate. In the presented

stochastic contexts, EWMA5 model misses the two relevant state changes and this

result affects its recall quality.

The Wavelet Cusum detector applied to the representative deterministic eigen-

resource is a reliable, both in term of recall and precision,novel solution for the

management of the considered Internet Data Center. Relevant state changes in the

state of the whole system are timely detected and no false alarms activate useless

management decisions.

7.2 Anomaly detection 147

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


Rde

term

inis

tic

On-line state representationZero mean


(a) Wavelet Cusum

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


Rde

term

inis

tic



(b) Baseline Cusum

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


Rde

term

inis

tic



(c) ARL 0 Cusum

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2


Rde

term

inis

tic



(d) EWMA5

Figure 7.8: Performance evaluation of state change detection models on represen-tative deterministic eigenresource.

7.2 Anomaly detection

Anomaly detectionrefers to the problem of finding patterns in the time series

that do not conform to the expected behavior. These non-conforming patterns

are often referred to as anomalies, outliers, discordant observations, exceptions,

aberrations, surprises, peculiarities or contaminants indifferent application do-

mains [31]. Among these definitions, anomalies and outliersare two terms com-

monly used in the context of anomaly detection.

Anomaly detection algorithms are applied to a large varietyof applications,

such as fraud detection for credit cards, insurance or health care, intrusion de-

tection for cyber-security, military surveillance for enemy activities, and, as our

interest, fault detection in critical information systems.

The importance of an anomaly detection is due to the fact thatanomalies in

148 Runtime models

data translate to significant (and often critical) actionable information in a wide

variety of application domains. For example, an anomalous traffic pattern in a

computer network could mean that a hacked computer is sending out sensitive

data to an unauthorized destination [79]. An anomalous MRI image may indicate

presence of malignant tumors [118]. Anomalies in credit card transaction data

could indicate credit card or identity theft [5] or anomalous readings from a space

craft sensor could signify a fault in some component of the space craft [58].

In this work, we are interested to anomaly detection in time series coming

from the monitoring of Internet-based system resources. InInternet-based con-

texts, anomaly detection is usually direct tointrusion detection. It refers to the

detection of malicious activity (break-ins, penetrations, and other forms of com-

puter abuse) in a computer related system [103]. These malicious activities or

intrusions are interesting from a computer security perspective. An intrusion is

different from the normal behavior of the system, and hence anomaly detection

techniques are applicable in intrusion detection domain. To identify anomalies

rapidly and accurately is crucial to the efficient operationof large computer sys-

tems. Identifying, diagnosing and treating anomalies in a timely fashion is a fun-

damental part of day to day computer operations. Without this kind of capability,

systems are not able to operate efficiently or reliably.

7.2.1 Problem definition

Accurate identification and diagnosis of anomalies first depends on robust and

timely data, and second on established methods for isolating anomalous signals

within data. The key challenge for anomaly detection in Internet-based systems

domain is the huge volume of data. The anomaly detection techniques need to

be computationally efficient to handle these large sized inputs. Moreover the data

typically comes in a streaming fashion, thereby requiring on-line analysis. An-

other issue which arises because of the large sized input is the false alarm rate.

Since the data amounts to millions of data objects, a few percent of false alarms

can make analysis overwhelming for an analyst. Finally, system operators usually

use data set collected from the servers resources of the system and analyze them

separately to identify resource anomalies. These discordant observations refer to

Point anomaly detection 149

an isolated view of the system, concerning only one resourceand not taking into

account all the interaction among servers.

Applying anomaly detection techniques to only one resourceat time causes

two drawbacks:

• reductive view, since it does not give a reliable detection of non-conforming

events in the entire system behavior, but it can only reveal that something

discordant has happened on the behavior of a resource measure;

• high dimensionality, since it implies the investigation and the application

of an anomaly detection model on each resource time series ofeach server

of the system. These time series are numerous, many samples long, and

make the system anomaly detection at runtime an inaccessible problem.

Applying anomaly detection techniques to the representative eigenresources

coming from the PCA-based technique proposed in Chapter 4 allows us to solve

the two mentioned problems. First of all, representative eigenresources collect the

relevant information of the state of the system, so embodying an exhaustive view

of its behavior. Second, anomaly detection models must be applied to only three

time series, thus reducing the dimensionality of whole system anomaly detection

problem.

We focus on on-line anomaly detection, with particular interest to two of the

many accepted meanings of the word ”anomaly”. We considerpoint anomaly

detection in Section 7.2.2 andcollective anomalydetection in Section 7.2.5. Both

applications aim at signaling unexpected events in Data Center activity. They help

Data Center operators to adapt the system to changing environments, to identify

malicious activities and to activate management procedures to avoid undesired

consequences.

7.2.2 Point anomaly detection

If an individual data instance can be considered anomalous with respect to the rest

of data, then the instance is termed as apoint anomaly. This is the simplest type

of anomaly and is the focus of majority of research on anomalydetection.

150 Runtime models

For example, in Figure 7.9 the data samples marked by a cross depart from

the behavior of the normal region, and hence are point anomalies since they are

different from normal data points. In time series domain, a data sample for which

the value is very high or very low compared to the normal rangeof variability for

that time series will be a point anomaly.

-15

-10

-5

0

5

10

15

0 500 1000 1500 2000

Sam

ple

Val

ues

Samples

Figure 7.9: Example of point anomalies.

As a real life example, consider credit card fraud detection. Let the data set

correspond to an individual’s credit card transactions. For the sake of simplicity,

let us assume that the data is defined using only one feature:amount spent. A

transaction for which the amount spent is very high comparedto the normal range

of expenditure for that person will be a point anomaly.

7.2.3 Point anomaly detection techniques

Appropriate techniques for point anomaly detection in timeseries analysis are the

statistical ones.Statistical techniquesfor point anomaly detection are based on

the following key assumption:

Point anomaly detection techniques 151

Assumption. Normal data instances occur in high probability regions of a

stochastic model, while anomalies occur in the low probability regions of the

stochastic model.

Statistical techniques fit a statistical model (usually fornormal behavior) to

the given data and then apply a statistical inference test todetermine if an unseen

instance belongs to this model or not. Instances that have a low probability to be

generated from the learned model, based on the applied test statistic, are declared

as anomalies.

The test used for the identification of spike eigenresources(see Section 3.2) is

a simple outlier detection model among statistical techniques. Thesσ threshold

test[114] declares anomalous all time series instances that aremore thans stan-

dard deviations away from the time series mean. Namingµ the time series mean,

σ its standard deviation, and settings = 3, theµ ± 3σ region contains 99.7% of

the data instances.

More sophisticated statistical tests include thebox plot rule, often used to

detect point anomalies. A box plot depicts the data using summary attributes such

as smallest non-anomaly observation (min), lower quartile (Q1), median, upper

quartile (Q3), and largest non-anomaly observation (max). The quantityQ3− Q1

is called theInter Quartile Range (IQR). The box plots also indicates the limits

beyond which any observation will be treated as an anomaly. Adata instance that

lies more than1.5IQRlower thanQ1 or 1.5IQRhigher thanQ3 is declared as an

anomaly. The region betweenQ1− 1.5IQRandQ3 + 1.5IQRcontains 99.3% of

observations.

A box plot example is given in Figure 7.10: the bottom and top of the box

are always the25-thand75-thpercentile (the lower and upper quartiles, respec-

tively), and the band near the middle of the box is always the50-thpercentile (the

median).

The ends of the whiskers can represent several possible alternative values [90],

among them:

• the minimum and maximum of all the data;

• the lowest sample still within1.5 IQRof the lower quartile, and the highest

datum still within1.5 IQRof the upper quartile;

152 Runtime models

Figure 7.10: A box plot for a univariate data set.

• one standard deviation above and below the mean of the data;

• the9-thpercentile and the91-stpercentile;

• the2-ndpercentile and the98-thpercentile.

Any data not included between the whiskers is plotted as an outlier with a cir-

cle and corresponds to an unexpected value, that is, an anomaly. Thus, the choice

of the ends of the whiskers plays the same relevant role as thesetting of the sigma

threshold tests parameter. To establish whisker limits, context requirements and

resource characteristics must be taken into account, in order to obtain the best box

plot rule performance.

Box plots have some strengths with respect to the sigma threshold test. Be-

sides graphically displaying a data set location and spreadat a glance, it provides

some indication of the data symmetry and skewness. Moreover, by using a box

plot for many time series side-by-side on the same graph, onecan easily compare

data sets. This is not easy to obtain through a sigma threshold test that, on the

other hand, has the advantage to show the data samples valuesand occurrence in

time of the anomalous samples.

On-line anomaly detection for IDC management 153

7.2.4 On-line anomaly detection for IDC management

We now show an application of the two anomaly detection techniques presented

in the previous section. They are applied to the representative spike eigenresource

with the aim to detect point anomalies in the behavior of the Internet Data Center.

Applying the anomaly detection techniques to theRspike time series ensures that

each signal corresponds to an effective unexpected event inthe life of the system.

Each detection needs an accurate investigation of its causes and the activation of

management procedures to face uncommon behaviors.

Starting from thesσ threshold test, we test the performance of the anomaly

detector on theRspike time series for different values of thes parameter. Several

tests lead us to sets = 5 as the best choice in our application context, guarantee-

ing an acceptable trade-off between the number of detected true and false point

anomalies.

Figure 7.11 shows the results of5σ test for point anomaly detection on the

representative spike eigenresource. The horizontal dotted lines are plotted at5σ

distance from the zero mean of the time series, in positive and in negative di-

rection. The upper threshold line intersects some eigenresource samples that are

therefore detected as anomalous. In correspondence of timeseries instances with

an upper cross, something peculiar has happened causing a relevant instantaneous

change in the normal behavior of the Internet Data Center.

Once learned the5σ threshold values on a significant set of past samples, their

values can be fixed for on-line anomaly detection. Given the spike eigenresource

samplexi at timeti, it is compared to the upper and lower thresholds. Ifxi exceeds

(in positive or in negative direction) one of the two thresholds, a point anomaly

is detected at timeti. This helps in the immediate investigation of the causes

of the uncommon event and the instantaneous activation of suitable management

procedures.

The threshold values can be carried on constant since a relevant state change

is detected. Relating the outcomes of state change detectionmodels described

in Section 7.1.6 to the setting of anomaly detection technique parameters allows

to adapt them to the evolution of the notion of normal behavior and to make the

parameter values representative also under new state conditions.

154 Runtime models

-10

-5

0

5

10


Rsp

ike

Spike eigenresourceUpper threshold

Lower thresholdAnomaly

Figure 7.11: Performance evaluation of5σ test for point anomaly detection on therepresentative spike eigenresource.

The other point anomaly detector we test is the box plot rule.Also for this

technique, we have examined different settings of its parameters and collected

the relative performance. Fixing the ends of the box plot whiskers equal to (Q1

− 1.5IQR) and (Q3 + 1.5IQR) implies that the box contains 99.3% of data set

observations.

Even thought considering as anomalous the 0.7% observations departing from

Gaussian distribution should seem to be a strictly rule, it must be taken into ac-

count also the application context of the box plot rule. In the context of Internet

Data Center management, every detection of anomaly causes several operations,

generally expensive and time consuming. Reducing the numberof detections to

only those ones really relevant for the sake of the system permits to limit the acti-

vation of control procedures and to avoid false alarms in thesystem. In our exper-

iments, we see that a1.5 weight for the Inter Quartile Range causes an excessive

number of point anomaly detections, most of them irrelevantand not linked to a

real problem in the system.

On-line anomaly detection for IDC management 155

For this reason, the proposed application of box plot rule tothe representative

spike eigenresource sets a higher weight forIQR, thus giving a technique less

restrictive than5σ test but signaling unexpected events that can be brought back

to something strange happening in the Internet Data Center.

Figure 7.12: Box plot of representative spike eigenresource.

Figure 7.12 shows the box plot of the representative spike eigenresource time

series. In this case, theIQR weight is set equal to5. We can appreciate the Gaus-

sian distribution with zero mean and very low standard deviation of a small set of

samples, as evinced by the short rectangle including the Inter Quartile Range. The

choice of a5 IQR weight motivates the long whiskers, beyond which samples are

declared as anomalous. Those samples are all but one collected in the positive tail

of the Gaussian distribution, thus suggesting point anomalies due to unexpected

bursts in the system activity, caused by a sudden increase ofCPU utilization.

Searching for those entries in the representative spike time series enables to

identify anomalous samples and their occurrence in the period of monitoring. Fig-

ure 7.13 graphically reports the results of box plot rule on the representative spike

eigenresource. Each cross at the top or at the bottom of the figure denotes an

anomaly found in the upper or lower portion of the box plot in Figure 7.12. Com-

paring these results to the ones of 5σ test in Figure 7.11, we can appreciate an

evident increase of point anomalies detected. The higher number of crosses at the

156 Runtime models

top and at the bottom of the last figure evidences how the choice of the weight

parameter gives rise to an anomaly detection technique lessstringent than the pre-

vious one. Despite that, all detected point anomalies can berelated to strange

events in system behavior, even thought not so critical to cause a fault in the sys-

tem or overload.

-5

0

5

10


Rsp

ike

Spike eigenresource Anomaly

Figure 7.13: Performance evaluation of box plot rule for point anomaly detectionon the representative spike eigenresource.

This second technique for anomaly detection can be used for the removal of

meaningless samples from deterministic and noise representative eigenresources.

An anomalous events in the system behavior detected at timeti by the spike eigen-

resource may affect also the corresponding samplesxi captured by the other two

representations, making them not reliable. An accurate investigation of all the

three representative eigenresources in correspondence ofa box plot test signal is

crucial for the correct interpretation of the point anomalyand for the cleaning of

deterministic and noise time series from spurious samples.

Collective anomaly detection 157

Beside that, also representative deterministic eigenresource carries on itself

crucial informations for the detection of anomalies in the system. The information

carried on the deterministic class of eigenresources are ofinterest to the identifi-

cation of collective anomalies in the system activity.

7.2.5 Collective anomaly detection

A collective anomalyoccurs when a collection of related samples is anomalous

with respect to the entire time series. The individual data instances in a collective

anomaly may not be anomalies by themselves, but their occurrence together as a

collection is anomalous [31]. Let us give a graphical example.

Figure 7.14 shows an example of deterministic eigenresource time series. The

encircled region denotes a collective anomaly because the time series values with

low variability persist for an abnormally long time (corresponding to a day of

normal expected activity of the Data Center). Note that thosevalues by themselves

are not anomalous.

-40

-20

0

20

40

60

0 500 1000 1500 2000

Sam

ple

Val

ues

Samples

Figure 7.14: Example of collective anomaly.

158 Runtime models

While point anomalies can occur in any time series, collective anomalies can

occur only in time series in which samples are related. Therefore, representative

spike and noise eigenresources are worthless for searchingfor uncommon patterns

in Internet Data Center behavior. On the other hand, we can apply collective

anomaly detection techniques to the representative deterministic eigenresource

resulting from our multi-phase methodology.

The anomaly detection technique used for the research of collective anoma-

lies bases on time series prediction, addressed in Chapter 6.Once predicted the

future expected values of theRdeterministic time series, a relevant gap between the

prediction and the real eigenresource sample will be symptom of an unexpected

pattern in the trend and seasonal components of the system. This approach makes

it easy the investigation of collective anomalies in the behavior of the Internet

Data Center.

7.2.6 Collective anomaly detection techniques

A collective anomaly occurs only in time series in which samples have some kind

of deterministic relation, since this type of anomaly is composed by a collection

of related samples anomalous with respect to the entire timeseries. Thus, rela-

tionships among samples is fundamental for the happening and the identification

of collective anomalies.

For this reason, the preliminary analysis of the techniquesfor collective anomaly

detection is direct to the verification of correlations among time series values.

This property can be verified through correlograms and periodograms presented

in Chapter 3. If these analyses demonstrate the existence of adeterministic com-

ponent underlying time series behavior, then collective anomalies may occur and

their research has worth.

In comparison to the rich literature on point anomaly detection techniques, the

research on collective anomaly detection has been limited.Broadly, such tech-

niques can be classified in two categories. The first categoryof techniques re-

duces a collective anomaly detection problem to a point anomaly detection prob-

lem; while the second category of techniques models the structure in the data and

uses the model to detect anomalies. We adopt this last solution.

Collective anomaly detection techniques 159

A generic technique in this category can be described as follows. A model

is learned from the data set, through which predict the expected behavior with

respect to a given context. If the expected behavior is significantly different from

the observed behavior, an anomaly is declared. A simple example of this generic

technique is regression, in which the contextual attributes can be used to predict

the behavioral attribute by fitting a regression line on the data.

Like state change detection, on-line collective anomaly detection is typically

addressed through two steps:state representationanddetection rule.

State representation

For time series context, several regression-based techniques for data set

state representation such as robust regression [108], autoregressive mod-

els [57], ARMA models [2, 3], and ARIMA models [16], have been devel-

oped for collective anomaly detection. Thus, among the forecasting models

described in Section 6.2, AR and ARIMA models are the most usedfor

collective anomaly detection.

The reason of this is graphically simple to evince from the following com-

parison among the prediction models. Let us forecast the representative

deterministic eigenresource for 2 days in the future. In Figure 7.15 we re-

port the results of the different models in predicting theRdeterministic time

series coming from the application of the PCA-based technique. All pre-

dictors are tested under a SD-policy and in an off-line fashion. We set the

values of models parameters on the basis of the static implementations of

prediction models presented in Section 6.2.

All figures display the representative deterministic eigenresource for the last

three days of the week and the predicted values for the next two days. The

former is represented by the gray line, while the latter are shown by the

subsequent black line. The dotted lines delimits the prediction interval

computed forc = 0.80 andc = 0.95. In Figure 7.15 we can appreciate

two main behavioral classes: the class of predictors that reproduce in the

future only the common trend of the time series, and the one that recreates

also its seasonal behavior. SR, CS, EWMA, and Holt’s techniquesbelong

160 Runtime models

to the previous class. Figures 7.15(a), (b), (c) and (d) clearly show predic-

tion results that do not take into account the periodic component of the time

series. Predicted values only follow the main tendency of the time series,

reproducing it for the next two days in the future. Autoregressive models in

Figures 7.15(e) and (f), otherwise, learn the systematic component from the

-200

-150

-100

-50

0

50

100

150

200

Fri Sat Sun Mon Tue

Rde

term

inis

tic


CI 80%CI 95%

(a) SR

-40

-30

-20

-10

0

10

20

30

40

Fri Sat Sun Mon Tue

Rde

term

inis

tic


CI 80%CI 95%

(b) CS

-200

-150

-100

-50

0

50

100

150

Fri Sat Sun Mon Tue

Rde

term

inis

tic


CI 80%CI 95%

(c) EWMA

-100

-50

0

50

100

Fri Sat Sun Mon Tue

Rde

term

inis

tic


CI 80%CI 95%

(d) Holt’s

-40

-30

-20

-10

0

10

20

30

40

Fri Sat Sun Mon Tue

Rde

term

inis

tic


CI 80%CI 95%

(e) AR

-40

-30

-20

-10

0

10

20

30

40

Fri Sat Sun Mon Tue

Rde

term

inis

tic


CI 80%CI 95%

(f) ARIMA

Figure 7.15: Prediction results on the representative deterministic eigenresource.

Collective anomaly detection techniques 161

past samples, and recreate it in their future predictions.

Since we are interested in time series forecasting direct tothe identifica-

tion of collective anomalies, we focus on AR and ARIMA predictors and

their reproduction of the seasonal behavior of the time series. The reason of

choosing autoregressive models is that the key factor in differentiating be-

tween normal and anomalous behavior is the co-occurrence ofevents. Thus,

time series predictors that loose the periodic correlationamong samples are

useless for collective anomaly detection. Since we search unexpected be-

haviors departing from the common seasonal activity of the system, we need

a prediction technique able to model this seasonal component.

Let us guide the results obtained in Section 6.3 in this direction. To anomaly

detection purposes, stringent time constraints are not required, since this

type of anomalies can be discovered in periodic behaviors repeating ev-

ery 24 hours. For our scope, all prediction policies are applicable in term

of computational cost, also the pre-filtering ones using complex prediction

models, such asadaptive-ARIMA under a AT-policy.

What is crucial in collective anomaly detection is prediction quality. We

guess to low prediction errors and small prediction intervals, since our in-

tent is to detect all and only real collective anomalies. This is possible

only through reliable time series predictions. The most reliable prediction

is guaranteed by the AT-policy, both in terms of prediction error and predic-

tion interval. For this reason, in this section we consider prediction methods

providing a trend representation of the time series on whichapply a pre-

diction model that dynamically adapts the number of parameters and their

values to the inconstant characteristics of the data set. This choice is par-

ticularly adapt to highly variable and stochastic contexts, such as the one of

Internet-based systems.

Once chosen the AT-policy for our purpose, we focus on autoregressive

models. As discovered in Section 6.3, AR and ARIMA models havesimilar

PI performances. Their behavior in terms of prediction error is otherwise

very different. We discover a big gap between AR and ARIMA performance

in terms of prediction errors under the AT-policy, with aPE = 0.32 for

162 Runtime models

the AR model and aPE = 0.029 for the ARIMA one. Since an order

of magnitude is present between the prediction quality metrics of the two

autoregressive models, we choose to adopt the ARIMA prediction technique

to detect collective anomalies in the considered Internet Data Center.

Detection rule

The collective anomaly detection rule adopted in this work follows the ap-

proach of the state change detection rule described for state change detec-

tion (see Section 7.1.2). The problem we consider now is to detect at run-

time the occurrence of a collective anomaly in the state of the system, with

a minimum or null rate of false detections. As for state change detection,

this is done on the basis of a statisticalchange detection ruleobtained by

comparing, at each samplei, the test statisticsgi with a characteristic sta-

tistical thresholdH. A collective anomaly is detected on the basis of the

following equation:

detection rule=

{collective anomaly, if gi ≥ H

no anomaly, otherwise.(7.14)

The test statisticsgi is an indicator of the detection model: during a normal

behavior it should be close to zero and departs from zero whenan anoma-

lous behavior occurs.

At time ti, we consider the time seriesS[r]i and the predicted values com-

puted from1 to k-step ahead,{fi+1, . . . , fi+k}. Everyfi+j, with 1 ≤ j ≤ k,

goes with the correspondent prediction intervalPIj = [li+j, ui+j]. The con-

sidered rule for detecting a collective anomaly computes the following test

statistics:

g0 = 0 (7.15)

gi = max

|fi+j − li+j|, if fi+j < li+j

|fi+j − ui+j|, if fi+j > ui+j

0, otherwise

(7.16)

which measure positive deviations of the monitored valuefi+j from the

prediction interval.

On-line collective anomaly detection for IDC management 163

The test statisticsgi accumulates deviations offi+j from li+j andui+j, in

the case ofc = 0.95. A collective anomaly is detected whengi exceeds a

statistic thresholdH. Through an empirical evaluation, we dynamically set

H = 0.05σi, whereσi computes the standard deviation of the time series

Xi at timeti. This setting ofH parameter value is suited to the character-

istics of the representative deterministic eigenresource, that is cleaned by

all perturbations due to noise and spikes. All the information carried by

Rdeterministic time series are meaningful for the periodic behavior identifi-

cation, thus even a smallH value obtains good detection qualities.

7.2.7 On-line collective anomaly detection for IDC manage-ment

Figure 7.16 shows the results of the collective anomaly detection technique ap-

plied to the representative deterministic eigenresource coming from PCA. Pre-

dicted values are computed for three days in the future through an ARIMA fore-

casting model that adapts its parameters at each predictionstep.

The gray line is the representative deterministic eigenresource. The continu-

ous black line represents the predicted values settingk = 864, enclosed by the

prediction interval computed for values ofc equal to0.80 and 0.95. The lim-

its of the prediction intervalui+j and li+j are displayed through the black dot-

ted lines up and down the prediction. A time series sample that goes out from

PIj = [li+j, ui+j] with c = 0.95 contributes positively to test statisticsgi and thus

to a detection of a collective anomaly. In correspondence ofthe cross displayed at

the bottom of the figure, the test statisticsgi overcomes the statistic thresholdH

and thus an anomaly is detected.

Investigating on the time series in correspondence of the detection, we can

assess that it is a correct signal. In correspondence of the detection, the pattern

eigenresource looses its characteristic periodic behavior, with a long lasting period

of stable values, that is anomalous in respect to the seasonal past behavior of the

system activity.

This is probably symptom of something strange happening in the Internet Data

Center and thus deserve in-depth investigations. In Internet-based system, anoma-

164 Runtime models

-40

-20

0

20

40

Fri Sat Sun Mon Tue Wed

Rde

term

inis

tic

Time seriesPredicted valueCollective anomaly

PI, c = 80%PI, c = 95%

Figure 7.16: Performance evaluation of collective anomalydetection model onthe representative deterministc eigenresource.

lous subsequences may translate to malicious programs, unauthorized behaviors

and policy violations. The detection of each one of the unexpected events ad-

dressed in this chapter guides investigation procedures for a better Internet Data

Center management.

7.2.8 Conclusions

The proposed multi-phase methodology provides some advantages to the runtime

models for anomaly detection considered in this work.

• Helps in the choice of suitable anomaly detection techniques

Over time, a variety of anomaly detection techniques have been devel-

oped in several research communities. Many of these techniques have been

specifically developed for certain application domains, while others are more

generic, but all need a contextual definition of the normal domain behavior.

Defining a normal region which encompasses every possible normal behav-

ior is very difficult. In addition, the boundary between normal and anoma-

Conclusions 165

lous behavior is often not precise. Thus, an anomalous observation which

lies close to the boundary can actually be normal, and vice-versa. The stud-

ies carried in Chapter 4 on the three representative eigenresources coming

from the application of PCA-based method let the definition ofnormal and

anomalous behaviors suitable for the eigenresource characteristics. This is

a crucial starting point for the choice of anomaly detectiontechniques ap-

propriate for whole system analysis and management.

• Permits a specific definition of anomaly

The exact notion of anomaly is different for different application domains.

For example, in CPU utilization a small deviation from normal(e.g., an

increase of 20%) might be an anomaly, while a similar deviation in the

network packet rate might be considered as normal. Thus, applying a tech-

nique developed in one domain to another is not straightforward. Working

on known time series, with studied characteristics and features, lets the def-

inition of anomaly suitable for the domain in exam. All results obtained in

Chapter 4 enlighten the main characteristics of the eigenresources, and thus

enable a specific definition of anomaly for each behavioral class.

• Extrapolates pattern contributions

Setting apart seasonal and trend components allows the identification of

surprising actions departing from the periodic behavior. This topic is related

to the detection of the so-calledcollective anomalies. They comprise all

related time series instances that are anomalous with respect to the entire

time series. The individual data instances in a collective anomaly may not

be anomalies by themselves, but their occurrence together as a collection is

anomalous. By investigating the representative deterministic eigenresource

coming from the application of the PCA-based technique it is possible to

identify collective anomalies in the whole system behavior, otherwise hard

to find.

• Isolates noise components

To distinguish the noise often present in time series permits to eliminate

those components that tends to be similar to the actual anomalies and hence

166 Runtime models

can affect their detection. Noise can be defined as a phenomenon in the

time series which is not of interest to the analyst, but acts as a hindrance

to time series analysis. Noise removal is a fundamental stepof anomaly

detection techniques, since it consents to depart the unwanted samples that

may be mistaken as anomalies. Isolating the representativenoise eigenre-

source enables to remove the unwanted objects before anomaly detection is

performed on the system representations.

Chapter 8

Related work

Most existing runtime techniques for Internet-based systems management rely on

models built on predictable trends and periodicities, thatare in their turn isolated

from noise and spike influences. One of their main obstacles stands in isolating

underlying meaningful patterns from trivial error components. Another problem is

that they take decisions on the basis of a representation of single system resources.

The proposed multi-phase methodology would overcome theseproblems by

reducing the computational complexity of runtime system management. It helps

a system administrator to answer to a variety of questions, such as: (1) how to

size the IT infrastructure; (2) which servers are most used and need to be better

investigated; (3) how much and when to add (or remove) physical hardware when

computing demand increases or changes; (4) which could be the best plan for

capacity usage of the entire Data Center, in order to satisfy pre-specified service

level objectives (SLOs).

The four phases of the proposed multi-phase methodology apply at runtime

known models and algorithms, but the innovation of this workresides in their

application to a new representation of the whole system state. We evidence and

discuss the following main contributions:

• Whole system view vs. Single component view;

• Principal Component Analysis vs. Parametric techniques;

• Realistic Internet-based system vs. Simulated systems;

168 Related work

• Runtime decision algorithms vs. Off-line algorithms.

Whole system view vs. Single component view

Modeling resource time series in a single server node has attracted consid-

erable researches. Various metrics are collected, analyzed and visualized

for various purposes, such as traffic modeling, capacity planning and re-

source management. On the basis of this information, several researchers

have characterized the Web workload by fitting distributions to data (e.g.,

heavy-tailed distributions [11,13,30], burst arrivals [69] and hot spots [14])

and by proposing performance models driven by such distributions [48].

All the analyses confirm that the external traffic reaching anInternet-based

system shows some periodic behavior [14] that facilitates its interpretation

and management. Hence, existing results are useful for capacity planning

and system dimensioning goals on servers, but they are useless to estimate

at runtime the state of an Internet Data Center and to guide runtime man-

agement methods.

Principal Component Analysis vs. Parametric techniques

Common methods to represent the resource state are based on the periodic

collection of samples through server monitors and on the direct use of these

values. Some low-pass filtering of network throughput samples have been

proposed in [110], but the majority of resource state interpretation algo-

rithms for the runtime management of Internet-based systems are based on

functions that work directly on resource measures [9, 12, 29, 33, 52, 73, 93,

99, 106, 116, 128]. Other studies based on a control theoretical approach to

prevent overload or to provide guaranteed levels of performance in Web sys-

tems [1,72] refer to direct resource measures (e.g., CPU utilization, average

Web object response time) as feedback signals.

Other works [34,44] have proposed parametric models based on moving av-

erage and on linear regression. Their problem is that modernInternet-based

systems are characterized by complex hardware/software architectures and

169

by stochastic and highly variable workloads that cause instability of system

resource measures. The observed measures of the internal resources are

characterized by noises, heteroscedasticity and short time dependency, that

prevent an initial optimal setting of the parameters and require a continuous

update of them.

The context of Internet-based systems is subject to typically stochastic loads,

heavy-tailed distributions [13] and flash crowds [69], extreme variability

and tendency to become obsolete rather quickly [39]. Hence,parametric

models based on a static setting of their parameters values are unable to fol-

low the continuous changing of monitored resource measures. On the other

hand, techniques providing a dynamic estimation of their parameters are of

little help in face of stochastic and highly variable time series, affected by

random errors and strong perturbations. In these contexts,the tuning of their

parameters is impossible or extremely time consuming and risks to suggest

completely wrong actions.

Principal Component Analysis is a non parametric technique that helps to

distinguish overload conditions from transient peaks, to understand load

trends and seasonality, and to isolate undesired noise components. PCA

does not make any assumption about resource measures statistical char-

acteristics and do not need any parameter setting. It works on the set of

monitored samples and extracts its intrinsic dimensionality through the in-

formation contained in the time series.

Although, to the best of our knowledge, the PCA dimensional analysis has

not been applied for the whole system analysis and for the investigation of

the entire set of measurements of an Internet Data Center, it has been em-

ployed in other contexts, such as face recognition [76], brain imaging [123],

meteorology [105] and fluid dynamics [115].

The PCA-based technique proposed in this work have been validated in

other contexts, such as network traffic flows [10] and application workload

characterization for utility computing [4]. These studiesare limited to the

characterization and analysis of input time series behavior and not oriented

to collect representative visions of the system they deal with.

170 Related work

Realistic Internet-based system vs. Simulated systems

Our work focuses on real Internet-based systems integratedwith load mon-

itoring strategies and management tasks, and characterized by heavy-tailed

workloads that are too complex for an analytical representation [53, 86].

Related studies were oriented to simulation models of simplified architec-

tures [1,26,34,99,121], that represent an interesting research objective [54]

but that cannot take into account interesting and complex issues of real sys-

tems.

There are many studies on the characterization of resource loads, albeit re-

lated to systems subject to quite different workloads from those ones con-

sidered in this study. For example, the authors in [93] evaluate the effects

of different load representations on job load balancing through a simulation

model that assumes a Poisson job inter-arrival process. A similar analysis

concerning Unix systems is carried out in [52]. Dinda et al. [44] investi-

gate the predictability of the CPU load average in a Unix machine subject

to CPU bound jobs. The adaptive disk I/O prefetcher proposed in [122] is

validated through realistic disk I/O inter-arrival patterns referring to scien-

tific applications. The workload features considered in allthese pioneer pa-

pers differ substantially from the load models characterizing Internet-based

servers, showing stochasticity, bursting patterns and heavy-tails workloads

even at different time scales.

Other papers make strong assumptions on the nature of the workload, that

simplify many state representation problems. For example,the authors

in [85] present a mechanism that works well with mildly oscillating or sta-

tionary workloads; in [124] stochastic models for the FTP transfer times

are presented; the host CPU load average is studied in [44]; some models

on network traffic, assumed as a Gaussian process, are analyzed in [110].

These assumptions are too restrictive for workloads characterizing modern

Internet-based systems.

Runtime decision algorithms vs. Off-line algorithms

Runtime load state interpretation of the internal resourceshas not received

171

much attention yet, especially if we refer to Internet-based systems. For ex-

ample, Pacifici et al. [97] propose a model for estimating at runtime the CPU

demand of Web applications, but not for positioning the resource state with

respect to the system resource capacities. Other studies that are oriented

to server management do not consider runtime constraints. Some exam-

ples include load balancer policies [9, 12, 29, 52, 99], overload and admis-

sion controller schemes [99, 100], request routing mechanisms and replica

placement algorithms [73,116], distributed resource monitors [106].

Even the most common methods for load representations oriented to run-

time management tasks work off-line [14,35,44,74,83,110].

Hence, adequate models for supporting runtime management decisions in

highly variable systems represent an open issue. In this work, we address it

in the context of Internet Data Centers. The management decision mecha-

nisms exercised in the last phases of the proposed methodology are based on

widely known and used algorithms for time series analysis, such as smooth-

ing and interpolation algorithms [104,127], forecasting models [88], detec-

tion rules [15], etc. In this work, we implement them in an on-line way

and test their performance for runtime Internet Data Center management.

The constraints due to on-line decisions lead us to considermodels that are

characterized by low computational complexity.

Considering state change detection, many stochastic modelsare oriented

to off-line schemes. The historical reference of all these studies is [98].

Subsequent investigations are in [64, 65, 78]. Other theoretical optimal re-

sults about the likelihood approach to load change detection are proposed

in [42]. It is impossible to propose a simple application of these schemes to

a runtime environment, although we could extend some previous theoretical

results, such as the Cusum algorithms, to the on-line state change detection

problem.

Detecting outliers or anomalies in data has been studied in the statistics

community as early as the 19th century [49]. Over time, a variety of anomaly

detection techniques have been developed in several research communities.

Many of these techniques have been specifically developed for certain appli-

172 Related work

cation domains, while others are more generic. Bronstein et al. [23] propose

a variant of the Bayesian networks technique for network intrusion detec-

tion. For collective anomaly detection, the techniques have to either model

the sequence data or compute similarity between sequences.A survey of

different techniques used for this problem is presented by Snyder [117]. A

comparative evaluation of anomaly detection for host basedintrusion de-

tection is presented in Forrest et al. [55] and Dasgupta and Nino [40]. In

this work, we implement anomaly detection techniques at runtime, that is

an interesting research field recently opened.

There is a huge amount of prediction models that are orientedto off-line

forecasting. We can cite support vector machines [38], machine learning

techniques, and Fuzzy systems [119]. None of them can be applied or

adapted to support runtime predictions in a variable environment such as

a typical Internet-based system [43].

Moreover, in this thesis, we propose and analyze runtime prediction mod-

els that do not make any assumption (e.g., linearity, stability) on the dis-

tribution of the data set, as required by other works on runtime short-term

predictions [44,85,110,124].

Chapter 9

Conclusions

In this thesis, we consider the problem of on-line and off-line management of

large Internet Data Centers. To this purpose, we propose a whole system analy-

sis that starts from the acquisition of resource information generated from system

monitors. We address several issues, but the most original proposal is related to

Principal Component Analysis, that allows us to represent thousands time series

collected from the Internet Data Center using less than 15 independent dimen-

sions.

This surprising low dimensionality motivated us to understand the behavior

of the Internet Data Center on the basis of these few dimensions. By examining

eigenresources, which are the common patterns of variationunderlying resource

measures, we could develop considerable understanding of the structure of In-

ternet Data Center resources. The set of eigenresources shows three features:

deterministic trends, spikes and noise. Furthermore, we discover more restrictive

behavioral subclasses, that help us to eigenresource characterization. From the as-

sembling of the contributions of the dimensions belonging to the same behavioral

class, we extracted three representative time series collecting the main features of

the Internet Data Center.

Our last objective was to examine the extent to which the three representative

visions can help Internet Data Center management. We consider five application

contexts: trend extraction, time series forecasting, state change, point anomaly

and collective anomaly detections. The results of the PCA-based technique sim-

plify whole system analysis and support runtime Internet Data Center manage-

174 Conclusions

ment.

The whole system analysis proposed in this work can be enriched by other

models and decision support systems, and can be applied to different contexts.

In particular, we are studying how to automate the process togenerate periodical

reports to the IT manager in a way that it would be possible to analyze system be-

haviors during several periods of time. These reports may begenerated at different

levels, such as operational level, tactical level, strategic level, and more.

The whole system analysis can be a guideline for the investigation of different

application contexts, such as the ones of network traffic flows, application work-

loads, virtualized environments, but also for non-technological fields.

Bibliography

[1] T. Abdelzaher, K. G. Shin, and N. Bhatti. Performance guaranteesfor Webserver end-systems: A control-theoretical approach.IEEE Trans. Parallel andDistributed Systems, 13(1):80–96, Jan. 2002.

[2] B. Abraham and G. E. P. Box. Bayesian analysis of some outlier problems in timeseries.Biometrika, 66(2):229–236, Aug. 1979.

[3] B. Abraham and A. Chuang. Outlier detection and time series modeling.Techno-metrics, 31(2):241–248, May 1989.

[4] B. Abrahao and A. Zhang. Characterizing application workloads oncpu utilizationin utility computing. Technical Report HPL-2004-157, Hewllet-Packard Labs,2004.

[5] E. Aleskerov, B. Freisleben, and B. Rao. Cardwatch: A neural network baseddatabase mining system for credit card fraud detection. InIEEE ComputationalIntelligence for Financial Engineering, pages 220–226, 1997.

[6] C. Alexander and M. Sadiku.The frequency spectrum of a signal consists of theplots of the amplitudes and phases of the harmonics versus frequency. Fundamen-tals of Electric Circuits, McGraw-Hill, 2004.

[7] D. L. Alspach and H. W. Sorenson. Nonlinear bayesian estimation using gaussiansum approximation.IEEE Trans. Automat. Contr., 20, 1972.

[8] M. Andreolini, S. Casolari, and M. Colajanni. Models and framework for sup-porting run-time decisions in web-based systems.ACM Tran. on the Web, 2(3),2008.

[9] M. Andreolini, M. Colajanni, and M. Nuccio. Scalability of content-aware serverswitches for cluster-based Web information systems. InProc. of WWW, Budapest,HU, May 2003.

[10] L. Anukool, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk, and N. Taft.Structural analysis of network traffic flows.Joint International Conference onMeasurement and Modeling of Computer Systems, pages 61–72, 2004.

[11] M. Arlitt, D. Krishnamurthy, and J. Rolia. Characterizing the scalability of a largeweb-based shopping system.IEEE Trans. Internet Technology, 1(1):44–69, Aug.2001.

[12] J. Bahi, S. Contassot-Vivier, and R. Couturier. Dynamic load balancing and effi-cient load estimators for asynchronous iterative algorithms.IEEE Trans. Paralleland Distributed Systems, 16(4):289–299, Apr. 2006.

176 BIBLIOGRAPHY

[13] P. Barford and M. E. Crovella. Generating representative Web workloads for net-work and server performance evaluation. InProc. of the Joint International Con-ference on Measurement and Modeling of Computer Systems (SIGMETRICS1998),Madison, WI, June 1998.

[14] Y. Baryshnikov, E. Coffman, G. Pierre, D. Rubenstein, M. Squillante, andT. Yimwadsana. Predictability of Web server traffic congestion. InProc.of the 10th International Workshop on Web Content Caching and Distribution(WCW2005), Sophia Antipolis, FR, Sept. 2005.

[15] M. Basseville and I. Nikiforov.Detection of Abrupt Changes:Theory and Appli-cation. Prentice-Hall, 1993.

[16] A. M. Bianco, E. J. Garcıa B. M. andMartınez, and V. J. Yohai. Outlier detectionin regression models with arima errors using robust estimates.Journal of Fore-casting, 20(8):565–579, Dec. 2001.

[17] G. Birkhoff and C. R. de Boor. Piecewise polynomial interpolation and approx-imation. InProc. General Motors Symposium of 1964, Elsevier, New York andAmsterdam, 1965. H. L. Garabedian.

[18] G. Bishop and G. Welch. An introduction to the kalman filter.SIGGRAPH, Course8, 2001.

[19] D. Bonett. Approximate confidence interval for standard deviation of nonnormaldistributions. Computational Statistics and Data Analysis, 50(3):775–882, Feb.2006.

[20] G. Box, G. Jenkins, and G. Reinsel.Time Series Analysis Forecasting and Control.Prentice Hall, 1994.

[21] B. L. Brockwell and R. A. Davis.Time Series: Theory and Methods. Springer-Verlag, 1987.

[22] P. J. Brockwell and R. A. Davis.Introduction to Time Series and Forecasting.Springer, 2001.

[23] A. Bronstein, J. Das, M. Duro, R. Friedrich, G. Kleyner, M. Mueller, S. Sing-hal, and I. Cohen. Bayesian networks for detecting anomalies in internet-basedservices. InInternational Symposium on Integrated Network Management, 2001.

[24] V. Cardellini, E. Casalicchio, M. Colajanni, and P. Yu. The state of theart inlocally distributed Web-server system.ACM Computing Surveys, pages 263–311,2002.

[25] V. Cardellini, M. Colajanni, and P. Yu. Request redirection algorithmsfor dis-tributed Web systems.IEEE Trans. Parallel and Distributed Systems, 14(5):355–368, May 2003.

[26] V. Cardellini, M. Colajanni, and P. S. Yu. Geographic load balancingfor scal-able distributed Web systems. InProc. of 8th International Workshop on Model-ing, Analysis, and Simulation of Computer and Telecommunication Systems (MAS-COTS 2000), San Francisco, CA, USA, aug/sep 2000.

[27] S. Casolari, , M. Colajanni, and F. Lo Presti. Runtime state change detector ofcomputer system resources under non stationary conditions. InProc. of 17th Int.Workshop on Modeling, Analysis, and Simulation of Computer and Telecommuni-cation Systems (MASCOTS 2009), Sept. 2009.

BIBLIOGRAPHY 177

[28] S. Casolari and M. Colajanni. Short-term prediction models for server manage-ment in internet-based contexts.Elsevier - Decision Support Systems, 48, 2009.

[29] M. Castro, M. Dwyer, and M. Rumsewicz. Load balancing and control for dis-tributed World Wide Web servers. InProceedings of the Intl. Conference on Con-trol Applications (CCA 1999), Kohala Coast, HI, Aug. 1999.

[30] J. Challenger, P. Dantzig, A. Iyengar, M. Squillante, and L. Zhang. Efficientlyserving dynamic data at highly accessed Web sites.IEEE/ACM Trans. on Net-working, 12(2):233–246, Apr. 2004.

[31] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACMComput. Surv., 41(3):1–58, 2009.

[32] C. Chatfield. The Analysis of Time Series: An Introduction. Chapman and Hall,1989.

[33] H. Chen and P. Mohapatra. Overload control in qos-aware web servers.ComputerNetworks, 42(1):119–133, May 2003.

[34] L. Cherkasova and P. Phaal. Session-based admission control: amechanismfor peak load management of commercial web sites.IEEE Trans. Computers,51(6):669–685, June 2002.

[35] B. Choi, J. Park, and Z. Zhang. Adaptive random sampling for load change de-tection. InProc. of the 16th IEEE International Conference on Communications(ICC2003), Anchorage, AL, USA, May 2003.

[36] C. K. Chui. An Introduction to Wavelets. Academic Press, 1992.[37] H.-P. Company.HP Open-View MeasureWare Agent for Windows NT: User’s Man-

ual. HP, 1999.[38] C. Cortes and V. Vapnik. Support-vector networks.Machine Learning, 20(3),

1995.[39] M. Dahlin. Interpreting state load information.IEEE Trans. Parallel and Dis-

tributed Systems, 11(10):1033–1047, Oct. 2000.[40] D. Dasgupta and F. Nino. A comparison of negative and positive selection algo-

rithms in novel pattern detection. InIEEE International Conference on Systems,Man, and Cybernetics, volume 1, pages 125–130, Nashville, TN, 2000.

[41] P. Del Moral. Non linear filtering: Interacting particle solution.Markov Processesand Related Fields, 2(4), 1996.

[42] J. Deshayes and P. D. Off-line statistical analysis of change-point models usingnon parametric and likelihood methods.In Detection of Abrupt Changes in Signalsand Dynamical Systems, pages 103–168, 1986.

[43] L. Devroye, L. Gyorfi, and G. Lugosi.A Probabilistic Theory of Pattern Recogni-tion. Springer-Verlag, 1996.

[44] P. Dinda and D. O’Hallaron. Host load prediction using linear models.ClusterComputing, 3(4):265–280, Dec. 2000.

[45] M. Dobber, R. Van det Mei, and G. Koole. A prediction method for jobruntimesin shared processors: Survey, statistical analysis and new avenues.PerformanceEvaluation, 2007.

[46] D. L. Donoho. High-dimensional data analysis: the curses and blessings of di-mensionality.American Mathematical Society Conf. Math Challenges of the 21stCentury, 2000.

178 BIBLIOGRAPHY

[47] D. L. Donovo, I. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet shrinkage:Asymptotia?Journal of the Royal Statistical Society B, 57(2), 1995.

[48] A. B. Downey and D. G. Feitelson. The elusive goal of workload characterization.Performance Evaluation, 26(4):14–29, 1999.

[49] F. Y. Edgeworth. On discordant observations.Philosophical Magazine,23(5):364–375, 1887.

[50] R. F. Engle and K. F. Kronera. Multivariate simultaneous generalized arch.Econo-metric Theory, 11:122–150, 1995.

[51] R. L. Eubank and E. Eubank.Non parametric regression and spline smoothing.Marcel Dekker, 1999.

[52] D. Ferrari and S. Zhou. An empirical investigation of load indices for load balanc-ing applications. InProc. of the 12th IFIP International Symposium on ComputerPerformance, Modeling, Measurements and Evaluation (PERFORMANCE1987),Brussels, BE, Dec. 1987.

[53] G. Fishman and I. Adan. How heavy-tailed distributions affect simulation-generated time averages.ACM Trans. on Modeling and Computer Simulation,16(2):152–173, Apr. 2006.

[54] S. Floyd and V. Paxson. Difficulties in simulating the internet.IEEE/ACM Trans.Networking, 9(3):392–403, Aug. 2001.

[55] S. Forrest, P. D’haeseleer, and P. Helman. An immunological approach to changedetection: Algorithms, analysis and implications. InProc. of the 1996 IEEE Sym-posium on Security and Privacy. IEEE Computer Society, 1996.

[56] G. E. Forsythe, M. A. Malcolm, and C. B. Moler.Computer Methods for Mathe-matical Computations. Prentice-Hall, 1977.

[57] A. J. Fox. Outliers in time series.Journal of the Royal Statistical Society,34(3):350–363, 1972.

[58] R. Fujimaki, T. Yairi, and K. Machida. An approach to spacecraft anomaly detec-tion problem using kernel feature space. Ineleventh ACM SIGKDD internationalconference on Knowledge discovery in data mining. ACM Press, pages 401–410,New York, NY, USA, 2005.

[59] P. Gaffney and M. Powell. Optimal interpolation.Numerical Analysis, 506, 1976.[60] R. Gnanadesikan and M. B. Wilk. Probability plotting methods for the analysis of

data.Biometrika, 55:1–17, 1968.[61] A. Graps. An introduction to wavelets.IEEE, 1995.[62] F. E. Harrell.Regression modeling strategies: with applications to linear models,

logistic regression, and survival analysis.Springer, 2001.[63] E. Hartikainen and S. Ekelin. Enhanced network-state estimation usingchange

detection. InProc. of the 31st IEEE Conf. on Local Computer Networks, Nov.2006.

[64] D. V. Hinkley. Inference about the change point in a sequence of random variables.Biometrika, 57:1–17, 1970.

[65] D. V. Hinkley. Inference about the change point from cumulativesum-tests.Biometrika, 58:509–523, 1971.

BIBLIOGRAPHY 179

[66] P. Hoogenboom and J. Lepreau. Computer system performance problem detec-tion using time series models. InUsenix-stc’93: Proceedings of the USENIXSummer 1993 Technical Conference on Summer technical conference, pages 1–21. USENIX Association, 1993.

[67] H. Hotelling. Analysis of a complex of statistica variables into principal compo-nents.J. Educ. Psy., pages 417–441, 1933.

[68] R. Hyndman, A. Koehler, R. Snyder, and S. Grose. A state spaceframework forautomaic forecasting using exponential smoothing methods.International Journalof Forecasting, 18(3), 2002.

[69] J. Jung, B. Krishnamurthy, and M. Rabinovich. Flash crowds anddenial of serviceattacks: characterization and implications for CDNs and Web sites. InProc. ofthe 11th International World Wide Web Conference (WWW2002), Honolulu, HW,May 2002.

[70] H. F. Kaiser. An index of factorial simplicity.Psychometrica, 39:31–36, 1974.[71] R. Kalman. A new approach to linear filtering and prediction problems.Journal

of Basic Engineering, 82(1), 1960.[72] A. Kamra, V. Misra, and E. M. Nahum. Yaksha: a self-tuning controller for man-

aging the performance of 3-tiered sites. InProc. of Twelfth International Workshopon Quality of Service (IWQOS2004), Montreal, CA, June 2004.

[73] P. Karbhari, M. Rabinovich, Z. Xiao, and F. Douglis. ACDN: a content deliv-ery network for applications. InProc. of 21st Int’l ACM SIGMOD Conference,Madison, WI, USA, 2002.

[74] T. Kelly. Detecting performance anomalies in global applications. InProc. of the2nd USENIX Workshop on Real, Large Distributed Systems (WORLDS2005), SanFrancisco, CA, USA, 2005.

[75] M. Kendall and J. Ord.Time Series. Oxford University Press, 1990.[76] M. Kirby and L. Sirovich. Application of the karhunen-loeve procedure for the

characterization of human faces.IEEE Transactions on Pattern Analysis and Ma-chine Intelligence, 12(1):103–108, Jan. 1990.

[77] G. Kitigawa. Monte carlo filter and smoother for non-gaussian nonlinear statemodels.Journal of Computational and Graphical Statistics, 5, 1996.

[78] N. Kligiene and T. L. Methods of detecting instants of change of random processproperties.Automation and Remote Control, 44:1241–1283, 1983.

[79] V. Kumar. Parallel and distributed computing for cybersecurity.Distributed Sys-tems Online, IEEE, 6(10), 2005.

[80] F. LeGland, C. Musso, and N. Oudjane. An analysis of regularized interactingparticle methods in nonlinear filtering. InProc. of the IEEE European Workshopon Computer-Intensive Methods in Control and Signal Processing, 1998.

[81] D. J. Lilja. Measuring computer performance. A practitioner’s guide. CambridgeUniversity Press, 2000.

[82] S. Ling and W. K. Li. On fractionally integrated autoregressive moving-averagetime series models with conditional heteroskedasticity.Journal of the AmericanStatistical Association, 92:1184–1194, 1997.

180 BIBLIOGRAPHY

[83] Y. Lingyun, I. Foster, and J. M. Schopf. Homeostatic and tendency-based CPUload predictions. InProc. of the 17th Parallel and distributed processing Sympo-sium (IPDPS2003), Nice, FR, 2003.

[84] D. Lu, P. Mausel, E. Brondizio, and E. Moran. Change detection techniques.Int.Journal of Remote Sensing, 2004.

[85] Y. Lu, T. Abdelzaher, L. Chenyang, S. Lui, and L. Xue. Feedback control withqueueing-theoretic prediction for relative delay guarantees in Web servers. InProc.of the 9th IEEE real-time and embedded technology and Applications Symposium(RTAS2003), Charlottesville, VA, May 2003.

[86] S. Luo and G. Marin. Realistic internet traffic simulation through mixture mod-eling and a case study. InProc. of the IEEE Winter Simulation Conference(WSC2005), Orlando, FL, USA, 2005.

[87] N. A. Macmillan and C. D. Creelman.Detection Theory: a User’s Guide.Lawrence Erlbaum Associates, 2005.

[88] S. G. Makridakis, S. C. Wheelwright, and R. J. Hyndman.Forecasting: Methodsand Applications. 3rd ed. John Wiley & Sons, 1998.

[89] S. G. Mallat. A theory of multiresolution signal decomposition: The wavelet de-composition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 11(7),1989.

[90] R. McGill, J. W. Tukey, and W. A. Larsen. Variations of box plots.The AmericanStatistician, 32:12–16, 1978.

[91] D. Menasce and J. Kephart. Autonomic computing.IEEE Internet Computing,11(1):18–21, Jan. 2007.

[92] D. A. Menasce, V. A. F. Almeida, and L. W. Dowdy.Capacity planning andperformance modeling: from mainframes to client-server systems. Prentice-Hall,Inc., 1994.

[93] M. Mitzenmacher. How useful is old information.IEEE Trans. Parallel and Dis-tributed Systems, 11(1):6–20, Jan. 2000.

[94] D. C. Montgomery. Introduction to Statistical Quality Control. John Wiley andSons, 2008.

[95] M. N. Nounou and B. Bakshi. On-line multiscale filtering of random andgrosserrors without process models.American Institute of Chemical Engineers Journal,45(5), May 1999.

[96] D. L. Olson and D. Delen.Advanced Data Mining Techniques. Springer, 2008.[97] G. Pacifici, W. Segmuller, M. Spreitzer, and A. Tantawi. Dynamic estimation

of CPU demand of Web traffic. InProc. of the 1st International Conference onPerformance Evaluation Methodologies and Tools (VALUETOOLS2006), Pisa, IT,Oct. 2006.

[98] E. S. Page. Estimating the point of change in a continuous process.Biometrika,44:248–252, 1957.

[99] V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel,and E. M. Nahum. Locality-aware request distribution in cluster-based networkservers. InProc. of the 8th International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS1998), San Jose, CA,Oct. 1998.

BIBLIOGRAPHY 181

[100] R. Pandey and R. Barnes, J. F. Olsson. Supporting quality of service in HTTPservers. InProc. of the ACM Symposium on Principles of Distributed Computing,Puerto Vallarta, MX, June 1998.

[101] A. Papoulis.Probability, Random Variables, and Stochastic Processes. Mc-GrawHill, 1984.

[102] D. B. Percival and A. T. Walden.Wavelet methods for time series analysis.Cam-bridge University Press, 2000.

[103] V. V. Phoha.The Springer Internet Security Dictionary. Springer-Verlag, 2002.[104] D. J. Poirier. Piecewise regression using cubic spline.Journal of the American

Statistical Association, 68(343):515–524, 1973.[105] R. W. Preisendorfer.Principal component analysis in meteorology and oceanog-

raphy. Elsevier, 1988.[106] M. Rabinovich, S. Triukose, Z. Wen, and L. Wang. DipZoom: the internet

measurement marketplace. InProc. of 9th IEEE Global Internet Symposium,Barcelona, ES, 2006.

[107] P. Ramanathan. Overload management in real-time control applicationsusing(m,k)-firm guarantee.Performance Evaluation Review, 10(6), Jun. 1999.

[108] P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection.JohnWiley and Sons, 1987.

[109] C. Runge. Uber empirische Funktionen und die Interpolation zwischenaquidistanten Ordinaten.Zeitschrift fur Mathematik und Physik, 1901.

[110] A. Sang and S. Li. A predictability analysis of network traffic. InProc. of the 19thAnnual Joint Conference of the IEEE Computer and Communications Societies(INFOCOM2000), Tel Aviv, ISR, Mar. 2000.

[111] M. Satyanarayanan, D. Narayanan, J. Tilton, J. Flinn, and K. Walker. Agileapplication-aware adaptation for mobility. InProceedings of the 16th ACM Intl.Symposium on Operating Systems Principles (SOSP 1997), Saint-Malo, France,Oct. 1997.

[112] A. Schuster. On the investigation of hidden periodicities with applicationto asupposed 26 day period of meteorological phenomena.Terrestrial Magnetism andAtmospheric Electricity, 3:13–41, 1898.

[113] S. S. Shapiro and M. B. Wilk. An analysis of variance test for normality (completesamples).Biometrika, 52:591–611, 1965.

[114] W. A. Shewhart.Economic control of quality of manufactured product. AmericanSociety for Quality, 1980.

[115] L. Sirovich, K. S. Ball, and L. R. Keefe. Plane waves and structures in turbulentchannel flow.Phys. Fluids, pages 2217–2226, 1990.

[116] S. Sivasubramanian, G. Pierre, and M. Van Steen. Replication for web hostingsystems.ACM Computing surveys, 36(3):291–334, Aug. 2004.

[117] D. Snyder.Online intrusion detection using sequences of system calls. M.S. thesis,De- partment of Computer Science, Florida State University, 2001.

[118] C. Spence, L. Parra, and P. Sajda. Detection, synthesis and compression in mam-mographic image analysis with a hierarchical image probability model. InIEEEWorkshop on Mathematical Methods in Biomedical Image Analysis. IEEE Com-puter Society, volume 3, Washington, DC, USA, 2001.

182 BIBLIOGRAPHY

[119] J. T. Spooner, M. Maggiore, R. Ordonez, and K. M. Passino.Stable AdaptiveControl and Estimation for Nonlinear Systems: Neural and Fuzzy ApproximatorTechniques. John Wiley and Sons, 2002.

[120] T. H. Spreen, R. E. Mayer, J. R. Simpson, and J. T. McClave. Forecasting monthlyslaughter cow prices with a subset autoregressive model.Southern Journal ofAgricultural Economics, Vol. 11, No. 1, 1979.

[121] J. A. Stankovic. Simulations of three adaptive, decentralized controlled, jobscheduling algorithms.Computer Networks, 8(3):199–217, June 1984.

[122] N. Tran and D. A. Reed. Automatic ARIMA time series modeling for adaptive I/Oprefetching.IEEE Trans. Parallel and Distributed Systems, 15(4):362–377, 2004.

[123] D. Ts’o, R. D. Frostig, E. E. Lieke, and G. A. Functional organization of primatevisual cortex revealed by high resolution optical imaging.Science, pages 417–420,1990.

[124] S. Vazhkudai and J. Schopf. Predict sporadic grid data transfers. In Proc.of the 11th IEEE Symposium on High Performance Distributed Computing(HPDC2002), Edinburgh, GBR, jul 2002.

[125] D. F. Vysochanskij and Y. I. Petunin. Justification of the 3sigma rule for unimodaldistributions.Theory of Probability and Mathematical Statistics, 21:25–36, 1980.

[126] A. S. Willsky and H. L. Jones. A generalized likelihood ratio approach to thedetection and the estimation of jumps in linear systems.IEEE Trans. on Data andKnowledge Engineering, 14(2), 2002.

[127] G. Wolberg and I. Alfy. Monotonic cubic spline interpolation. InCGI ’99: Pro-ceedings of the International Conference on Computer Graphics, Washington, DC,USA, 1999. IEEE Computer Society.

[128] R. Wolski, N. T. Spring, and J. Hayes. The network weather service: a distributedresource performance forecasting service for metacomputing.Future GenerationComputer Systems, 15(5–6):757–768, 1999.

[129] D. H. Zhou and P. M. Frank. Strong tracking filtering of nonlineartime-varyingstochastic systems with coloured noise: application to parameter estimation andempirical robustness analysis.International Journal of Control, 65:295–307,1996.

tesi di laurea stochastic analyses for internet …...stochastic analyses for internet data centers...

Documents