demand forecast, resource allocation and pricing for multimedia

DEMAND FORECAST, RESOURCE

ALLOCATION AND PRICING FOR

MULTIMEDIA DELIVERY FROM THE CLOUD

by

Di Niu

A thesis submitted in conformity with the requirements

for the degree of Doctor of Philosophy,

Department of Electrical and Computer Engineering,

at the University of Toronto.

Copyright

c� 2013 by Di Niu.

All Rights Reserved.

Demand Forecast, Resource Allocation and Pricingfor Multimedia Delivery from the Cloud

Doctor of Philosophy ThesisEdward S. Rogers Sr. Dept. of Electrical and Computer Engineering

University of Toronto

by Di NiuMay 6, 2013

Abstract

Video tra�c constitutes a major part of the Internet tra�c nowadays. Yet most video de-

livery services remain best-e↵ort, relying on server bandwidth over-provisioning to guar-

antee Quality of Service (QoS). Cloud computing is changing the way that video services

are o↵ered, enabling elastic and e�cient resource allocation through auto-scaling. In

this thesis, we propose a new framework of cloud workload management for multimedia

delivery services, incorporating demand forecast, predictive resource allocation and qual-

ity assurance, as well as resource pricing as inter-dependent components. Based on the

trace analysis of a production Video-on-Demand (VoD) system, we propose time-series

techniques to predict video bandwidth demand from online monitoring, and determine

bandwidth reservations from multiple data centers and the related load direction policy.

We further study how such quality-guaranteed cloud services should be priced, in both

a game theoretical model and an optimization model.Particularly, when multiple video

providers coexist to use cloud resources, we use pricing to control resource allocation

ii

in order to maximize the aggregate network utility, which is a standard network util-

ity maximization (NUM) problem with coupled objectives. We propose a novel class of

iterative distributed solutions to such problems with a simple economic interpretation

of pricing. The method proves to be more e�cient than the conventional approach of

dual decomposition and gradient methods for large-scale systems, both in theory and in

trace-driven simulations.

iii

Dedicated to my parents

iv

Acknowledgments

First and foremost, I would like to sincerely thank my thesis supervisor, Professor

Baochun Li, for his invaluable guidance throughout my Ph.D. studies. I deeply appre-

ciate his strict training and precious advice in all aspects of my academic development,

which are life-long treasures for myself. I feel very fortunate to have Professor Li as my

supervisor, and cherish all the opportunities to explore my favourite research topics and

exert my own strengths. These would not have been possible without the unwavering

support of Professor Li. The dedication, diligence and enthusiasm he has for an aca-

demic career was contagious and motivational for me, especially during tough times in

my Ph.D. pursuit. I am also thankful for the excellent example Professor Li has provided

as a successful scholar and professor in computer engineering and science.

The members of the iQua research group have contributed immensely to both my

personal and professional time at the University of Toronto. The group has been a

source of good advice and collaboration as well as friendships. I am especially grateful

for the valuable discussions and collaborations with Chen Feng, Hong Xu and Zimu

Liu. Through the projects we worked on and the papers we coauthored, I very much

appreciated their enthusiasm, intelligence, and amazing abilities in mathematical and

experimental studies. I would like to thank all the other colleagues in the iQua research

group, for precious friendships that have supported me throughout the past four years:

Jin Jin, Wei Wang, Yuefei Zhu, Yiwei Pu, Yuan Feng, Junqi Yu, and Elias Kehdi. I

would also like to acknowledge the visiting students in our group: Fangming Liu, Anh

Tuan Nguyen, Zhi Wang, and Boyang Wang for the numerous nights we spent together

in the lab. Other past group members that I have had the pleasure to work alongside

of are graduate students Mea Wang, Chuan Wu, and Hassan Shajonia; visiting scholars

v

Hui Wang and Ruixuan Li; and the numerous summer and rotation students who have

come through the lab.

Lastly, I would like to thank my parents for raising me with a passion for science and

engineering, and for their love and encouragement in all my pursuits. Thank you!

Di Niu

University of Toronto

May 6, 2013

vi

Contents

Abstract iii

iv

Acknowledgments v

List of Tables xii

List of Figures xviii

1 Introduction 1

1.1 Challenges in Multimedia Delivery Systems . . . . . . . . . . . . . . . . . 1

1.2 In the Era of Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Related Work 10

2.1 Measurement, Prediction and Learning in VoD Systems . . . . . . . . . . 10

2.2 Cloud Workload Management and Resource Allocation . . . . . . . . . . 13

2.3 Cloud Resource Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

vii

CONTENTS CONTENTS

2.4 Algorithms for Distributed and Parallel Optimization . . . . . . . . . . . 16

3 Forecast in Video-on-Demand Systems 18

3.1 Monitoring UUSee Video-on-Demand System . . . . . . . . . . . . . . . . 20

3.2 Some Prediction Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Modeling and Forecasting the Demand . . . . . . . . . . . . . . . . . . . 24

3.3.1 Population Prediction via Seasonal ARIMA Processes . . . . . . . 27

3.3.2 Forecasting the Aggregate Bandwidth Demand . . . . . . . . . . . 32

3.3.3 Inferring Initial Demand using Mixtures of Gaussians . . . . . . . 34

3.4 Predicting Peer Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.5 Applications in a P2P System . . . . . . . . . . . . . . . . . . . . . . . . 43

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Volatility Prediction and Dimension Reduction 48

4.1 Modeling Demand Volatility . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Volatility Estimation and Resource Allocation . . . . . . . . . . . . . . . 52

4.2.1 Comparing Five Volatility Models . . . . . . . . . . . . . . . . . . 53

4.2.2 Utilization as a Volatility Indicator . . . . . . . . . . . . . . . . . 56

4.3 Reducing Volatility through Diversification . . . . . . . . . . . . . . . . . 57

4.4 Dimension Reduction using PCA . . . . . . . . . . . . . . . . . . . . . . 61

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Cloud Bandwidth Reservation for VoD Applications 70

5.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Load Direction and Bandwidth Reservation . . . . . . . . . . . . . . . . 75

5.2.1 The Optimal Load Direction . . . . . . . . . . . . . . . . . . . . . 77

viii

CONTENTS CONTENTS

5.2.2 Suboptimal heuristics with Limited Replication . . . . . . . . . . 81

5.3 Experiments based on Real-World Traces . . . . . . . . . . . . . . . . . . 83

5.3.1 A Channel Interleaving Scheme . . . . . . . . . . . . . . . . . . . 84

5.3.2 Algorithms for Comparison . . . . . . . . . . . . . . . . . . . . . 85

5.3.3 Assumption Validation . . . . . . . . . . . . . . . . . . . . . . . . 86

5.3.4 Predictive Auto-Scaling vs. Reactive Provisioning . . . . . . . . . 88

5.3.5 Theoretical Optimal vs. Replication-Limited Heuristics . . . . . . 91

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 The Bandwidth Reservation Market 94

6.1 System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.1.1 The Cloud and Tenants . . . . . . . . . . . . . . . . . . . . . . . 98

6.1.2 The Broker and Load Direction Matrix . . . . . . . . . . . . . . . 99

6.1.3 Bandwidth Pricing in the Presence of a Broker . . . . . . . . . . . 101

6.2 Main Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.3 Optimal Bandwidth Saving . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.3.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.4 Pricing Region in Controlled Markets . . . . . . . . . . . . . . . . . . . . 108

6.4.1 Broker Profit and Price Discounts in the Pricing Region . . . . . 113

6.5 Pricing in Free Markets: A Game Theoretical Analysis . . . . . . . . . . 115

6.5.1 Nash Equilibrium: Existence and Uniqueness . . . . . . . . . . . . 116

6.5.2 Market Properties and Equilibrium Bandwidth Costs . . . . . . . 120

6.6 Trace-Driven Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

ix

CONTENTS CONTENTS

7 Algorithmic Cloud Bandwidth Pricing 126

7.1 A New Tenant-Cloud Agreement . . . . . . . . . . . . . . . . . . . . . . 128

7.2 Pricing Towards Social Welfare Maximization . . . . . . . . . . . . . . . 132

7.2.1 An Equivalent Pricing Problem . . . . . . . . . . . . . . . . . . . 134

7.2.2 No Multiplexing vs. Multiplexing All . . . . . . . . . . . . . . . . 138

7.2.3 Economic Implications . . . . . . . . . . . . . . . . . . . . . . . . 140

7.3 Distributed Optimization with Coupled Objectives . . . . . . . . . . . . 141

7.4 A Pricing Framework for Distributed Optimization . . . . . . . . . . . . 143

7.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.5.1 Dual Decomposition and Gradient Methods . . . . . . . . . . . . 145

7.5.2 New Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.5.3 Relationship to the Nonlinear Jacobi Algorithm . . . . . . . . . . 150

7.6 Convergence Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

7.6.1 Contraction Mapping . . . . . . . . . . . . . . . . . . . . . . . . . 152

7.6.2 Convergence of Algorithm 1 . . . . . . . . . . . . . . . . . . . . . 153

7.6.3 Convergence of Algorithm 2 . . . . . . . . . . . . . . . . . . . . . 158

7.6.4 Convergence of Gradient Methods for Dual Problem . . . . . . . . 160

7.6.5 Quadratic Programming: Convergence Speed Comparison . . . . 162

7.6.6 Handling General Coupled Objectives . . . . . . . . . . . . . . . . 165

7.7 Distributed Solutions to Cloud Pricing . . . . . . . . . . . . . . . . . . . 168

7.7.1 Equation Update . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

7.8 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

x

CONTENTS CONTENTS

8 Concluding Remarks 180

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

Bibliography 185

xi

List of Tables

3.1 Frequently used notations (for a particular channel). . . . . . . . . . . . . 21

3.2 Prediction of population at the initial stage based on mixtures of Gaussians

for flash-crowd video channels. . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 The performance of five schemes averaged over each test period, in terms

of QoS, resource utilization, and replication overhead. . . . . . . . . . . . 89

xii

List of Figures

1.1 Bandwidth auto-scaling with quality assurance, as compared to provision-

ing for the peak demand and pay-as-you-go. . . . . . . . . . . . . . . . . 5

3.1 The evolution of several time series in a typical flash crowd video channel

(with channel ID: A55F) since the channel was released. Each time unit

represents 10 minutes. All rates are average quantities for the channel.

The video bit rate is R = 79.6 KB/s. . . . . . . . . . . . . . . . . . . . . 24

3.2 Population evolution in steady-state channels and various kinds of flash

crowd channels. The samples of steady-state channels start from 2008-

08-08 14:51:58. The samples of flash-crowd channels start from the time

that the corresponding video is released. The video release times are:

channels 4872 and B830, before 2008-08-08 14:51:58; channel F190, 2008-

08-17 00:21:03; channel FA5C, 2008-08-19 01:47:20; channels 7A40 and

E856, 2008-08-12 20:57:34; channel 2E0B, 2008-08-17 23:43:43; channel

D55A, 2008-08-09 00:03:26. . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 The power spectrum of online population Nt

for various channels. To

better view periodicity, the x axis is set to be the period measured in

timestamps, which is the inverse of frequency. . . . . . . . . . . . . . . . 27

xiii

LIST OF FIGURES LIST OF FIGURES

3.4 One day-ahead population prediction based on seasonal ARIMA models

for three di↵erent channels. The model is trained based on the data of

first few days. Prediction for Nt

only makes use of the observations {N⌧

:

0 < ⌧ t � 144} up to time t � 144. The two flash-crowd channels have

di↵erent daily decreasing of Nt

. The validation and training sets overlap

for 1 day, since in a seasonal model, the predictor requires the first 144

validation samples to be the initial input. . . . . . . . . . . . . . . . . . 31

3.5 The autocorrelation function (ACF) and partial autocorrelation function

(PACF) of N(t) = r144Nt

for the training data of channel B830. . . . . . 32

3.6 10-minutes-ahead prediction for the aggregate bandwidth demand Dt

in

a popular video channel A55FF released at time period 264, compared

against the trace data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.7 Prediction of initial demand using a mixture of Gaussians. The fitting

function is trained with the first 360 population samples in channel 8331

released on 2008-08-10 19:46:29. Test data 1 is from channel 57F7 released

on 2008-08-18 20:16:35, and test data 2 is from channel EDAF released

on 2008-08-18 20:49:39. R-square1 = 0.5864, R-square2 = 0.1374, RMSE1

=74.44, RMSE2 = 273.55. . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.8 Fitting AR(10) to {ut

} of the first 9 days in channel A55F, and making 2.5

hour (15-step) ahead prediction for the rest of the days. The prediction

for each ut

is conditioned on {u⌧

; 0 < ⌧ t � 15}. . . . . . . . . . . . . . 39

xiv


3.9 A fluid chart illustrating the dependency of ct

on Nt

through the underlying

system dynamics. The white arrows denote the transition rates, while the

black arrows denote the dependencies. N c

t

is the number of cached peers

and uc

(t) is the average upload rate of each cached peer at time t. . . . . 39

3.10 The h-step prediction of average receiving rate ct

and total receiving rate

ct

Nt

from cached peers in channel A55F. Prediction for ct

or ct

Nt

only

makes use of the observations of N⌧

and c⌧

up to time t � h. We choose

d1 = d2 = 0, p = 54, q = 0 in the experiments. The validation and training

sets overlap for 1 day, since in a seasonal model, the predictor requires the

first 144 validation samples to be the initial input. . . . . . . . . . . . . . 42

3.11 Progressive model learning and 2.5 hours ahead (15-step) prediction for

the server bandwidth required by the peers in channel A55F over a 16-day

period, compared with the real server bandwidth requirement calculated

from the traces, and the actual server bandwidth used by the channel in the

traces. The channel is released on 2008-08-10 10:47:39, and the predicted

values start from 2008-08-11 16:47:39 (timestamp 181). The predictor is

retrained with the most recent data on a daily basis. . . . . . . . . . . . 46

3.12 Progressive model learning and the estimation for the average receiving

rate rt

of playing peers in channel A55F over a 12-day period. There is a

2.5-hour (15-step) delay in data collection. A comparison with the real rt

and the server rate st

allocated to each peer is provided. . . . . . . . . . 47

4.1 Online bandwidth demand monitoring and reservation. The reserved band-

width should match the sum of the expected demand and a risk premium

that tolerates volatility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

xv


4.2 The ACFs of the disturbance Zt

and Z2t

obtained from fitting the seasonal

ARIMA model (3.6) with d = 0 and p = q = 20 to bandwidth demand

{Dt

} in video channel A55FF. . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 One-step-ahead forecast errors for bandwidth demand Dt

in video channel

A55FF, and predicted conditional standard deviations for the forecast errors. 52

4.4 The empirical CDF of the utilization of booked bandwidth and the ratio

of time periods where the booked bandwidth is insu�cient. Bandwidth

reservation is performed in 169 popular video channels independently, each

with a test period of 2 days. . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 Bandwidth reservation with “probabilistic GARCH” in channel A55FF,

tested on a period of 3 days. . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.6 The mean and standard deviation of U when di↵erent numbers of video

channels are randomly combined for bandwidth reservation. The achieved

average e is less than 4% in all cases. . . . . . . . . . . . . . . . . . . . . 59

4.7 Bandwidth reservation with “probabilistic GARCH” for the mixed tra�c

of all the 93 channels in the 6-day period from time 264 to 1127. The first

3 days are the training periods, and the other 3 days (time 697-1127) are

the test periods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.8 Bandwidth consumption time series of 5 representative channels over a

2.5-day period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.9 The first 3 principal components C1—C3 in bandwidth consumption series

of 468 channels during 2 days via principal component analysis. . . . . . 63

4.10 Variance explained in bandwidth consumption series of 468 channels in 2

days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

xvi


4.11 Data projected onto the first 2 principal components. Each point is one

of the 468 channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.12 Root mean squared errors (RMSEs) of the two prediction methods over

test period 1680-1860 (1.25 days), compared to mean bandwidth consump-

tion in each channel. The three largest channels are channels 295, 241 and

317. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.13 10-minute-ahead conditional mean prediction in channel 172 over a test

period of 1.25 days. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.14 The departure of actual bandwidth consumption from its conditional mean

forecast and the predicted standard deviation of bandwidth consumption

in channel 172. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.15 Q-Q plot of conditional mean forecast errors in channel 172 over the test

period 1680-1860 (1.25 days), in reference to Gaussian quantiles. . . . . 68

5.1 The system decides the bandwidth reservation from each data center and

a matrix W = [wsi

] every �t minutes, where wsi

is the proportion of video

channel i’s requests directed to data center s. DC: data center. . . . . . . 72

5.2 Using demand correlation between channels, we can save the total band-

width reservation, even within each 10-minute period, while still providing

quality assurance to each channel. DC: data center. . . . . . . . . . . . . 73

5.3 The conditional mean demand prediction for virtual new channel 11, with

a test period of 1.5 days from time 1585 to 1800. . . . . . . . . . . . . . . 85

5.4 QQ plot of innovations for t =1562—1640 vs. normal distribution. . . . . 87

xvii


5.5 Predictive vs. reactive bandwidth provisioning for a typical peak period

702–780. There are 35 data centers available, each with capacity 300

Mbps, and 91 channels, including 52 popular channels, 24 small channels,

15 non-zero new channels. K = 2, k = 10. . . . . . . . . . . . . . . . . . 88

5.6 Workload portfolio selection vs. random load direction for a typical peak

usage period from time 1562 to 1640. K = 2, k = 10. . . . . . . . . . . . 92

6.1 A system of two cloud providers and two tenants. Random variables are

labeled with “r.v.”. The data centers are possibly owned by di↵erent cloud

providers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2 The region of Pi

(·) in a good pricing policy {Pi

(·)}. Pi

(·) is between P ⇤i

(·)

and P 0i

(·), and satisfies P 0i

(1) � P ⇤0i

(1). . . . . . . . . . . . . . . . . . . . 113

6.3 The aggregate bandwidthP

s

As

(t) booked by the broker, compared to the

aggregate bandwidthP

i

Bi

(t) needed if each channel books individually

and the real aggregate demandP

i

Di

(t). . . . . . . . . . . . . . . . . . . 123

6.4 The histogram of payment discounts of all channels at all times in equi-

librium, i.e., the histogram of 1 � P ⇤i

(1, t)/P 0i

(1, t) for all i and t. . . . . . 124

7.1 Iterative updates of prices and guaranteed portions. . . . . . . . . . . . 135

7.2 Behavior of Equation Update in a one dimensional case. “o” represents

the starting point (w(0)i

, k(0)i

). . . . . . . . . . . . . . . . . . . . . . . . . 172

7.3 CDF of convergence iterations in all 81 10-minute periods. . . . . . . . . 176

7.4 Final objective values SW(x⇤) (expected social welfare) produced by dif-

ferent algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

7.5 The evolution of kw(t) � w(t � 1)k1 in the first experiment. . . . . . . . 178

xviii

Chapter 1

Introduction

1.1 Challenges in Multimedia Delivery Systems

Video delivery systems, such as Youtube, Netflix, Hulu, etc., have gained unprecedented

popularity on the Internet nowadays. According to “Cisco Visual Networking Index:

Forecast and Methodology,” globally, Internet video tra�c will be 55% of all consumer

Internet tra�c in 2016, up from 51% in 2011. This does not include video exchanged

through peer-to-peer (P2P) file sharing. Most Internet users watch live and on-demand

videos through a web-browser and http-based video streaming, others may watch videos

through client software downloadable from the video service provider. Other video de-

livery systems include on-line video gaming services such as OnLive [7].

Despite the popularity of Internet video and the ever-increasing demand for improved

video quality, most Internet video services remain best-e↵ort systems. Since video flows

are delay-sensitive, to guarantee the Quality of Experience (QoE) for an end-user, the

video must be delivered from media servers to the end-user at a rate no less than the

video bit rate (at least in the long run). However, QoE is usually not guaranteed in

1

1.2. IN THE ERA OF CLOUD COMPUTING 2

current Internet VoD systems, mainly due to the bounded egress bandwidth from video

servers.1 Most video service providers over-provision the bandwidth capacity of their

streaming servers to provide quality assurance. However, over-provisioning is costly and

even ine↵ective sometimes, since a large amount of server capacity is unused during non-

peak hours, whereas in the event of a flash crowd, when a large number users join the

system, the provisioned capacity may not even be su�cient.

Certain VoD systems adopt a Peer-to-Peer (P2P) architecture (e.g., CoolStreaming

[47], PPLive [8], UUSee [9]), where end-users can help the servers deliver video content

to each other. While leveraging user upload bandwidth can alleviate the burden on

media servers to some extent, the user resources are not dedicated and their contribution

is not reliable. As a result, most P2P video systems are in fact peer-assisted systems,

where the servers still play a major role in streaming, and thus face the same issue of

how much server capacity should be provisioned. Because of the above reasons, in order

to guarantee the QoE, what is missing from today’s video delivery systems is a refined

scheme to accurately predict the online user demand together with a flexible allocation

mechanism that can economically vary the resource provisioning over time.

1.2 In the Era of Cloud Computing

Cloud computing delivers Infrastructure as a Service (IaaS) that integrates computation,

storage and network resources in a virtualized environment. It represents a new business

model where applications as tenants of the cloud can dynamically reserve instances on

1Due to the practice of over-provisioning, Quality-of-Service (QoS) is usually guaranteed at the coreof the Internet, with very small losses and delays.

1.2. IN THE ERA OF CLOUD COMPUTING 3

demand. Cloud computing is changing the way that multimedia content providers op-

erate their businesses, including video-on-demand (VoD) and online gaming companies.

Traditionally, video companies invest in commodity servers as business grows and acquire

monthly bandwidth deals from Internet Service Providers (ISPs). Nowadays, they can be

freed from the complexity of hardware maintenance and network administration, relying

on the cloud for service.

One particular example is Netflix [6], which moved its streaming servers, encoding

software, search engines, and huge data stores to Amazon Web Services (AWS) in 2010

[12, 13]. Moreover, many interactive gaming companies, e.g., OnLive [7], also depend on

the cloud to stream their content to online players.

One of the most important economic appeals of cloud computing is its elasticity

and “auto-scaling” in resource provisioning. As has been pointed out, to accommodate

the peak workload, over-provisioning is traditionally a common practice, leading to low

resource utilization during non-peak hours. In contrast, in the cloud, the number of

computing instances launched can be changed adaptively at a fine granularity with a

lead time of hours or even minutes. This converts the up-front capital infrastructure

investment to operating expenses charged by cloud providers. Since the cloud’s auto-

scaling ability enhances resource utilization by closely matching supply with demand,

overall expenses of the enterprise may be reduced.

However, unlike web servers or scientific computing, VoD is a network-bound service

with stringent bandwidth requirements. A major risk to the video applications using

cloud services is that unlike CPU and memory, bandwidth is not guaranteed in current-

generation cloud platforms (e.g., Amazon EC2), leading to unpredictable network perfor-

mance [17, 63]. A lack of bandwidth guarantee to some degree impedes cloud adoption by

1.3. THESIS SCOPE 4

applications that require such guarantees, such as video-on-demand (VoD) applications

[12] and transaction processing web applications [45]. The utility of tenants running

these applications depends not only on the bandwidth usage, but more importantly on

how many of their end-user requests are served with guaranteed performance.

With an ever-increasing demand for performance predictability, a recent trend in

networking research is to augment cloud computing to explicitly account for network

resources. In fact, data center engineering techniques have been developed to expand the

tenant-cloud interface to allow bandwidth reservation for tra�c flowing from a virtual

machine (VM) in the cloud to the Internet [19, 39]. We envision that in future cloud

platforms, bandwidth reservation will be a value-added feature that attracts tenants,

such as video providers, who seek bandwidth guarantees.

1.3 Thesis Scope

In this thesis, we aim to leverage the auto-scaling ability of the cloud to save the op-

erational cost for multimedia delivery services, while providing quality assurance. Since

bandwidth, as compared to CPU and memory, is a major factor that a↵ects the video

delivery cost and quality, we focus on bandwidth auto-scaling. The benefits of bandwidth

auto-scaling can be intuitively envisioned. As shown in Fig. 1.1(a), traditionally, a VoD

provider acquires a monthly plan from ISPs, in which a fixed bandwidth capacity, e.g., 1

Gbps, is guaranteed to accommodate the anticipated peak demand. As a result, resource

utilization is low during non-peak hours and demand troughs. Alternatively, a usage-

based pay-as-you-go model is adopted by current cloud providers as shown in Fig. 1.1(b),

where a VoD provider pays for the total amount of bytes transferred. However, the band-

width capacity available to the VoD provider is subject to variation due to contention

1.3. THESIS SCOPE 5

Band

wid

th

Demand

Capacity

Days(a) Provisioning for peak demand

1 20 Days

Demand

Capacity

(b) Pay as you go1 20 Days

DemandCapacity

(c) Auto-scaling1 20

Figure 1.1: Bandwidth auto-scaling with quality assurance, as compared to provisioningfor the peak demand and pay-as-you-go.

from other applications, incurring unpredictable performance and QoS issues. Fig. 1.1(c)

illustrates bandwidth auto-scaling and reservation to match demand with su�cient re-

sources, leading to both high resource utilization and quality guarantees. Apparently,

the more frequently the rescaling happens, the more closely resource supply will match

the demand.

The topics that will be covered in this thesis include demand prediction, resource

allocation based on prediction, as well as the economic issues related to cloud resource

pricing for video providers. Our contributions can mainly be divided into the following

four parts:

Demand Forecast. Accurate demand forecast serves as the first step towards a

flexible resource allocation scheme to replace traditional capacity over-provisioning. Tar-

geting multimedia applications, we employ a set of established statistical learning and

time series analysis techniques to automatically forecast demand and infer performance

in large-scale VoD applications. Our work is backed up by 400 GB of operational traces

collected from UUSee2, a major commercial VoD provider based in China. Although

many measurement studies of VoD systems exist in the literature, few practical workload

2Although UUSee adopts a peer-assisted video streaming architecture, our collected data containsthe download bandwidth of all online users sampled every 10 minutes, which reflects user bandwidthdemands regardless of the service architecture adopted.

1.3. THESIS SCOPE 6

forecast mechanisms have been developed for these systems. We utilize the daily peri-

odicity and trends inherent in historical workload data to predict the future population

and bandwidth demand in video services, study the demand volatility, and use a factor

model to reduce the dimension and complexity involved in statistical learning.

Cloud Resource Allocation. Based on such demand forecast, we have developed

resource allocation mechanisms for data centers in the cloud to satisfy video application

requirements with the minimum cost. We study a new type of service where companies

like Netflix can make reservations for (outgoing) bandwidth guarantees from the cloud.

The objective is to reserve as little bandwidth as possible while still guaranteeing the

video delivery quality. We have proposed an auto-scaling system that can adjust resource

allocation for each cloud tenant (a video company or channel) to closely match its short-

term demand prediction, accommodating both the estimated demand expectation and

unexpected demand variation. Borrowing insights from financial risk management, we

have proposed a mean-variance optimization framework to statistically mix negatively

correlated (anti-correlated) workloads to save the bandwidth reservation cost. When

multiple data centers coexist in the cloud, we have derived the optimal direction policy

of tenant demands to di↵erent data centers, proposed practical heuristic solutions and

evaluated our methods through extensive trace-driven simulations.

The Bandwidth Reservation Market. Based on demand forecast and the pro-

posed workload management mechanisms, we further propose new pricing models for

cloud bandwidth resources. The current usage-based cloud pricing or pay-as-you-go is

insu�cient for multimedia applications that require bandwidth guarantees. We study

the pricing of cloud bandwidth reservations for this kind of tenant. Since tenants may

lack expertise in demand estimation, we propose a new business model in which a tenant

1.3. THESIS SCOPE 7

just needs to specify a percentage of its demand, called the guaranteed portion, to be

satisfied with guaranteed service, while the cloud takes the responsibility to fulfill the

guarantee. Leveraging abundant computing power and workload monitoring, the cloud

can predict tenant demands and make corresponding reservations of actual bandwidth

for the tenants.

Under the assumptions that each tenant always chooses a guaranteed portion of 100%

(i.e., each video provider as a cloud tenant always requires the cloud to guarantee all its

end-user requests) and that pricing cannot a↵ect such choices, we have made the following

contributions. First, we characterize the unique Nash equilibrium in the bandwidth

reservation market, where the price that each tenant is willing to pay is a function of its

tra�c burstiness and its demand correlation to the aggregate market demand. Second,

when multiple cloud providers coexist, we propose a profitable cloud brokerage service

similar to “Groupon,” which sells bandwidth guarantees to tenants and jointly reserves

bandwidth from the clouds for the mixed demand while directing requests to di↵erent

cloud providers. We analytically characterize the broker pricing region in which a profit-

driven broker helps enhance cloud resource utilization and lower bandwidth prices. Third,

we develop a proof-of-concept bandwidth trading system driven by real-world demand

traces, incorporating demand estimation, workload management and pricing as the key

components.

Algorithmic Pricing for Cloud Resources. We further consider the case that

a tenant’s choice of guaranteed portion may be a↵ected by the pricing policy and may

not always be 100%. For example, a video provider may require the cloud to guarantee

only 95% of all its end-user requests if such bandwidth guarantees are costly. In this

model, our objective is to find the optimal bandwidth pricing policy to maximize the

1.4. THESIS ORGANIZATION 8

expected social welfare, that is the sum of the profit of all the entities in the system

including the tenants and the cloud provider. The problem is related to network utility

maximization (NUM) with a coupled objective function, which is traditionally solved

through Lagrangian dual decomposition and gradient/subgradient methods. However,

due to the large scale of our algorithmic pricing problem, the gradient method usually

achieves slow convergence because of its over-sensitivity to stepsize choices. In order

to e�ciently solve such a large-scale distributed optimization problem within a small

number of iterations, we propose a novel class of nonlinear distributed and iterative

algorithms, which achieve faster convergence than gradient methods both in theory and

in trace-driven simulations.

Besides bandwidth, bu↵ering and jitter handling are also important in video flow

management. With bu↵ering on the user side, more data blocks in a particular video flow

can be prefetched and stored when bandwidth is abundant, so that the playback can rely

on the bu↵er when bandwidth conditions degrade. For example, the applications (e.g.,

Netflix) that use HTML5, Microsoft Silverlight and HTTP-based streaming all perform

bu↵ering. In this thesis, we do not consider these control mechanisms for individual video

flows. Instead, we focus on satisfying all the end users of a video provider or a video

channel as a whole in the long run instead of individual video flows. Nevertheless, any

individual flow management and bu↵ering scheme can be combined with our cloud-based

video workload management mechanism to achieve good video delivery performance.

1.4 Thesis Organization

The remainder of this thesis is organized as follows. The related literature is reviewed in

Chapter 2. We present e↵ective time-series methods for demand forecast in Chapter 3,

1.4. THESIS ORGANIZATION 9

and study the demand volatility estimation and the dimensionality reduction problem

involved in the statistical learning in Chapter 4. In Chapter 5, we study the problem of

cloud bandwidth allocation and reservation for video-on-demand applications based on

prediction, together with the related load direction problem when multiple data centers

coexist. We then study the bandwidth pricing issues in a broker-assisted cloud band-

width reservation market in Chapter 6, under the assumption that pricing will not a↵ect

tenant demands. In Chapter 7, we further study the algorithmic cloud bandwidth pric-

ing problem when pricing can a↵ect tenant demands, and solve the underlying utility

maximization problem with e�cient new distributed algorithms.

Chapter 2

Related Work

2.1 Measurement, Prediction and Learning in VoD

Systems

Video-on-Demand systems have gained enormous popularity on today’s Internet. Some

examples of production systems include Netflix [6], Hulu [3], Justin.tv [4], Youtube [10],

Metacafe [5], etc. Since the inception of using a mesh-like P2P architecture as a solution

to live media streaming, exemplified in CoolStreaming [75], peer assistance has also

been introduced into video-on-demand services to increase system scalability. Significant

research e↵orts [35], [41], [42], [50], [52] have been devoted to the measurements of video

streaming systems, with a focus on the user behavior, scalability and system performance.

A number of coding and ad hoc optimization techniques [15], [31], [50] have also been

proposed to enhance their performance.

The importance of bandwidth demand estimation to capacity planning in Internet

VoD systems has been recognized recently. It is shown that estimating time-varying

10

2.1. MEASUREMENT, PREDICTION AND LEARNING IN VOD SYSTEMS 11

demands in a large-scale IPTV network can help the system optimally place content on

its geographically distributed servers [16]. Toward this goal, the recent demand history

is used as an estimate of future demand in each video channel [16] and demands for

previously released movies are used as the demand prediction for newly released movies.

Apparently, this simple method does not yield accurate forecasts. Since measurements

show that video workload demonstrates regular diurnal periodicity [16, 59, 70, 72], various

techniques have recently been proposed to forecast large-scale VoD tra�c. An Autore-

gressive Integrated Moving Average (ARIMA) model is introduced in [70] to predict

the population evolution in video streaming systems. To account for diurnal e↵ects,

seasonal ARIMA models have been introduced in our previous work [58, 59] to predict

non-stationary demand evolution at a fine granularity.

However, most existing forecast methods for video demand assume a constant fore-

cast error variance, and thus provision a constant amount of resource cushion for quality

assurance. In fact, our measurements of the UUSee system show that bandwidth de-

mand is subject to rapid changes in some periods, while remaining tranquil and highly

predictable in other periods. We therefore introduce Generalized Autoregressive Condi-

tional Heteroskedastic (GARCH) models [24, 28, 33] from econometrics research to model

the volatility persistence phenomenon — the bandwidth demand at a certain time period

tends to exhibit similar volatility as in recent time periods.

Volatility reduction in the mixed tra�c of multiple channels is similar to the idea of

statistical multiplexing and resource overbooking [67] in shared hosting platforms, where

the resources are booked to satisfy a certain percentile of demand in each application

instead of its worst-case demand, so as to enhance resource utilization. However, the

volatility reduction discussed here is novel in three aspects. First, we are concerned with

2.1. MEASUREMENT, PREDICTION AND LEARNING IN VOD SYSTEMS 12

forward-looking resource allocation and volatility forecasts for future demand, while in

[67] the resource usage of each application is profiled in an o✏ine and fixed manner, ig-

noring the change of demand patterns over time. Second, our study focuses on large-scale

VoD systems, where the concurrent number of users can ramp up by several hundreds

or thousands in tens of minutes. In this scenario, any fixed resource usage profiling for

small video channels (e.g., those with a user population of 20 in [67]) will be insu�cient.

Last but not least, we do not assume independence between the demands of channels.

Instead, we accurately quantify the conditional demand variance in each channel, which

enables the use of financial instruments such as hedging and diversification to achieve

cost-e↵ective server management with service level guarantees.

Principal Component Analysis (PCA) has been proposed in [40] to extract video

demand evolution patterns over longer periods (of weeks or months) and forecast coarse-

grained daily populations. In this thesis, we utilize an approach to find the common

factors that drive the demand evolution of all coexisting video channels using PCA at

a fine granularity. We then make forecasts for individual video channels as regressions

from factor forecasts obtained from the seasonal ARIMA model. Such an approach

combines the strengths of both PCA and the seasonal ARIMA model. Unlike [40], our

approach makes short-term predictions with a lead time of 10 minutes, enabling fine-

grained autoscaling of resource allocation.

Recently, there has been an increasing interest in applying statistical learning tools

to diagnose, predict and improve the reliability of large-scale distributed systems. Bahl

et al. [18] propose Sherlock, which discovers the dependencies among services, software

and physical components in a large enterprise network based on network tra�c, and

leverages such dependencies to detect and localize problems. Chen et al. [30] further

2.2. CLOUD WORKLOAD MANAGEMENT AND RESOURCE ALLOCATION 13

study the performance and limitations of automated dependency discovery from network

tra�c, and propose Orion that discovers dependencies in an enterprise network, using

the delay information between every pair of network flows. Mahimkar et al. [53] focus

on characterizing performance issues in a large-scale IPTV network, and propose Giza,

which applies multi-resolution data analysis to localize regions and physical components

in the IPTV distribution hierarchy that are experiencing performance problems. This

thesis represents one of the first attempts to characterize the dependencies among the

demand statistics and performance metrics in a large-scale video delivery system. We

introduce a systematic time-series analysis approach that learns the user behavior and

system dynamics from online measurements, and utilizes the learned rules to predict

demand and performance for the future.

2.2 Cloud Workload Management and Resource Al-

location

Cloud bandwidth reservation is becoming technically feasible. There have been proposals

on data center tra�c engineering to o↵er elastic bandwidth guarantees for egress tra�c

from virtual machines (VMs) [39]. The idea of virtual networks has also been proposed

to connect the VMs of the same tenant in a virtual network with bandwidth guarantees

[19, 39]. Further, explicit rate control has been proposed to apportion bandwidth accord-

ing to flow deadlines [69]. Such research progress has made the cloud more attractive

to bandwidth-intensive applications such as video-on-demand and MapReduce compu-

tations that rely on the network to transfer large amounts of data at high rates [73].

Netflix, as a major VoD provider, moved its data store, video encoding, and streaming

2.2. CLOUD WORKLOAD MANAGEMENT AND RESOURCE ALLOCATION 14

servers to Amazon AWS [2] in 2010 [12].

Virtualization techniques for supporting cloud-based IPTV services are also being

developed by major U.S. VoD providers such as AT&T [14]. Novel flow control algorithms

and improvements over TCP have been presented in [34] to ensure QoE-aware video

delivery from cloud data centers. Furthermore, video demand forecasting techniques have

been proposed, such as the non-stationary time series models introduced in [58, 59, 60],

and video access pattern extraction via principal component analysis in [40]. All these

recent advances will help the realization of video delivery from data centers, and will

enable e�cient and quality-assured management of video workload in the cloud.

Predictive and dynamic resource provisioning has been proposed mostly for VMs

and web applications with respect to CPU utilization [23, ?, 36, 37, 66] and power con-

sumption [46, 48]. VM consolidation with dynamic bandwidth demand has also been

considered in [68]. Our work exploits the unique characteristics of VoD bandwidth de-

mands and is distinct from the foregoing works in three aspects. First, our bandwidth

workload consolidation is as simple as solving convex optimization for a load direction

matrix. We leverage the fact that unlike VM, demand of a VoD channel can be fraction-

ally split into video requests. Second, our system forecasts not only the expected demand

but also the demand volatility, and thus can control the risk factors more accurately. In

contrast, most previous works [37, 36] assume a constant demand variance. Third, we

exploit the statistical correlation between bandwidth demands of di↵erent video channels

to save resource reservation while previous work, such as [68], considers VM consolidation

with independent random bandwidth demands.

The idea of statistical multiplexing and resource overbooking has been empirically

evaluated for a shared hosting platform in [67]. Our novelty is that we formulate the

2.3. CLOUD RESOURCE PRICING 15

quality-assured resource minimization problem using Value at Risk (VaR), a useful risk

measure in financial asset management [54], with the aid of accurate demand correla-

tion forecasts. We believe that our theoretically grounded approach provides stronger

robustness against the demand volatility in practice.

2.3 Cloud Resource Pricing

Cloud computing, e.g., Amazon EC2, is usually o↵ered with usage-based pricing (pay-

as-you-go) [17, 38]. In addition, cloud resources can also be sold with auction type of

pricing policies in a spot instance market [64, 74], to reflect the underlying demand-

supply relationship in the computing instance market. Di↵erent from pay-as-you-go,

resource reservation involves paying a negotiated cost to have the resource over a time

period, whether or not the resource is used. Although suitable for delay-insensitive

applications, pay-as-you-go is insu�cient as a business model for bandwidth-intensive

and quality-stringent applications like VoD, since no performance guarantees are provided

in general. To support guaranteed cloud services, we need new policies to price not only

the bandwidth usage but also bandwidth reservations.

Cloud brokers, e.g., Zimory [11], have recently emerged as intermediaries connecting

buyers and sellers of computing resources. The engineering aspects of using brokerage to

interconnect clouds into a global cloud market have been discussed in [29]. We propose a

new type of cloud brokerage that multiplexes bandwidth reservations to save cost while

providing quality guarantees to customers.

Amazon Cluster Compute (as of 2011) [1] allows tenants to reserve, at a high cost,

a dedicated 10 Gbps network with no multiplexing. Instead of over-provisioning a fixed

amount of capacity, our proposed broker dynamically books resources in response to

2.4. ALGORITHMS FOR DISTRIBUTED AND PARALLEL OPTIMIZATION 16

demand changes, exempting tenants from demand estimation, for which they have no

expertise. We aim to find the pricing policies under which a selfish broker can also

enhance cloud resource e�ciency and save money for VoD providers. Di↵erent from

usage-based pricing [17], our proposed pricing policy depends on demand statistics such

as burstiness and correlation.

Our bandwidth reservation pricing model is partly inspired by pricing electric power

consumption and capacity reservation under demand uncertainty [62]. However, due

to the computing capability and abundant workload data in the cloud, our bandwidth

reservation pricing theory is essentially a distributed computational problem based on

prediction.

2.4 Algorithms for Distributed and Parallel Opti-

mization

The optimal cloud pricing problem presented in Chapter 7 is related to network utility

maximization (NUM), which has been extensively studied in the past [32, 61, 49]. A num-

ber of primal/dual decomposition methods have been proposed to solve such problems in

a distributed way. Most NUM problems in the literature are concerned with uncoupled

utilities, where the local variables of one network element do not a↵ect the utilities of

other network elements. Unfortunately, this does not apply to systems with competition

or cooperation, where utilities are often coupled, such as in our cloud resource reservation

model in Chapter 7.

Traditional distributed solutions to NUM problems with coupled objectives seek de-

composition from Lagrangian dual problems with auxiliary variables, which are then

2.4. ALGORITHMS FOR DISTRIBUTED AND PARALLEL OPTIMIZATION 17

solved with gradient/subgradient methods. This approach has a simple economic in-

terpretation of consistency pricing [61, 65]. However, gradient methods are sensitive to

the choice of stepsizes, leading to slow and unstable convergence. Furthermore, existing

faster numerical convex problem solvers, e.g., Newton’s method [20, 27], can not be ap-

plied in a distributed way, since they involve a lot of message-passing beyond gradient

information, which is hard to implement and justify physically in reality. In Chapter 7,

we propose a new class of distributed algorithms to solve NUM problems with coupled

objective functions in general. The new methods are e�cient in terms of convergence

performance without complicating the message-passing that is involved.

The nonlinear Jacobi algorithm [21] solves convex optimization problems by opti-

mizing the objective function for each variable in parallel while holding other variables

unchanged, and converges under certain contraction conditions. Yet it cannot be ap-

plied to our pricing problem, since no network element has global information of all the

utilities. However, we will show in Chapter 7 that the nonlinear Jacobi algorithm is a

limiting case of our Algorithm 1 proposed in Sec. 7.5 (executed asynchronously), while

our proposed Algorithm 1 can be understood as a distributed version of the non-linear

Jacobi algorithm.

Chapter 3

Forecast in Video-on-Demand

Systems

Automated demand and performance prediction can help video services instantaneously

estimate their operational cost, avoid performance issues, place servers optimally and

allocate resources dynamically. In this chapter, we investigate the feasibility of automated

forecasting in on-demand video streaming systems, by analyzing the operational traces

that we have collected from UUSee Inc., one of the leading media content providers in

China, during the 2008 Summer Olympic Games.

The first step towards performance prediction is to understand the evolution of de-

mand, which is not only a deciding factor for the resource consumption, but also has

strong links with various system dynamics. From the real-world traces, we discover that

the demand evolution for a video channel is highly predictable — even in the long term

— as users exhibit diurnal behavior and regular habits that can be learned from mea-

surements. To model the diurnal periodicity and trend in the demand evolution, we use

seasonal Autoregressive Integrated Moving Average (ARIMA) models [25], which can

18

CHAPTER 3. FORECAST IN VIDEO-ON-DEMAND SYSTEMS 19

yield excellent prediction results for non-stationary demand evolution.

Surprisingly, not only is the demand of a channel predictable based on the auto-

correlation with its history, videos with similar properties, e.g., popular Olympics videos

released around the same time of day, also demonstrate similar demand evolution pat-

terns. This enables the inference of the initial demand of a newly released channel

based on the statistics of similar channels that were released earlier. We propose a gen-

erative model based on the regression of mixtures of Gaussians to account for such a

phenomenon.

When it comes to the estimation of the demand for server bandwidth and performance

at end users, the peer-assisted architecture adopted by many VoD services, including

UUSee, has posed great challenges to such tasks. Unlike a client-server based system,

a peer-assisted service partly relies on the uploads from end users, which are highly

dynamic, volatile and hard to predict. We adopt a general class of algorithms that apply

non-linear transformations to the performance-related time series, and make predictions

leveraging both the serial dependencies within each series and the interdependencies

among di↵erent series.

We also demonstrate the application of our time-series modeling techniques in the

prediction of server bandwidth requirement and the estimation of peer playback per-

formance at system run-time. In this chapter, we conduct extensive evaluation of the

proposed schemes based on the traces of 40 channels (with simultaneously online peer

population up to 8000) in a 21-day period spanning the entire 2008 Beijing Olympics,

corroborating the feasibility of implementing forecasting functions in real systems.

3.1. MONITORING UUSEE VIDEO-ON-DEMAND SYSTEM 20

3.1 Monitoring UUSee Video-on-Demand System

UUSee is one of the leading commercial multimedia solution providers in China, featur-

ing o�cial online broadcasting rights to the 2008 Beijing Olympics. It simultaneously

broadcasts thousands of on-demand video channels to millions of users distributed across

over 40 countries in the world. Videos in our UUSee dataset are encoded into di↵er-

ent streaming bit rates, ranging from 264 kbps to 1.3 Mbps, corresponding to di↵erent

streaming qualities.

The UUSee on-demand streaming system organizes users that are interested in the

same video into a random mesh network. Once a peer joins a video channel, an initial

set of partner nodes (up to 50) is supplied by one of the tracking servers. The peer then

requests media blocks from its partners based on periodically exchanged block avail-

ability information. In such a mesh network, there are three types of nodes, namely

media servers, playing peers, i.e., peers that are currently watching the video, and cached

peers, i.e., online peers that have previously watched the video and cached it in their

file systems. A number of algorithms and coding schemes are also incorporated to allow

peers to dynamically and optimally alter their partners, and to facilitate e�cient media

stream dissemination [50]. In general, the UUSee network forms a large-scale distributed

application with highly complex and time-varying system dynamics.

To monitor the run-time behavior of the system, we have implemented detailed mea-

surement and reporting capabilities within the UUSee client. Each online peer collects

its vital statistics, and encapsulates them into “heartbeat” reports, which are sent to our

dedicated logging servers every 10 minutes via UDP. Such statistics include the video it

is playing, its instantaneous aggregate download and upload throughput from and to all

its partners, its download and upload bandwidth capacities, as well as a list of all of its

3.1. MONITORING UUSEE VIDEO-ON-DEMAND SYSTEM 21

partners, with the number of blocks sent to or received from each partner within the past

10 minutes, and the type of nodes from which it downloads blocks.

The measurements used for validation in this chapter feature a set of traces collected

from 40 video channels during a 21-day period, from 14:51:58, August 8, 2008 (GMT+8)

to 23:43:56, August 28, 2008 (GMT+8), spanning the entire period of the 2008 Summer

Olympics. Among these channels, 32 videos were published during this period, while

others were released earlier. We believe this trace set best captures the user demand

patterns for di↵erent types of videos and UUSee on-demand streaming performance in a

wide range of scenarios, including large flash crowds gathered to watch popular Olympic

videos as soon as they were published.

Since our analysis focuses on the collective behavior of peers in individual channels,

we process and aggregate the peer heartbeat reports into time series concerning channel

average statistics using MySQL. Notations for frequently used quantities are listed in

Table 3.1. With a sampling interval of 10 minutes, there are 144 samples in each time

series per day.

Table 3.1: Frequently used notations (for a particular channel).R the playback rate, i.e., the bit rate of the videoN

t

the number of playing peers in the channel at time t, i.e., the online populationN c

t

the number of cached peers in the channel at time tD

t

the aggregate bandwidth demand from end users in the channel at time tSt

the aggregate server sending rate in the channel at time tSr

t

the aggregate server bandwidth required by the channel at time t to maintain its playback qualityst

the average server sending rate per playing peer at time tut

the average sending rate of playing peers at time tct

the aggregate sending rate of cached peers at time t divided by Nt

rt

the average receiving rate of playing peers at time t

3.2. SOME PREDICTION PROBLEMS 22

3.2 Some Prediction Problems

We observe that, in a complex peer-assisted on-demand streaming system, there exist

patterns in di↵erent kinds of time series, which contain information regarding human

factors and underlying system dynamics. Such information, if mined and learned from

measurements, will make demand forecast and performance prediction possible. In this

chapter, we consider a number of prediction/estimation problems that have practical

applications in peer-assisted on-demand streaming systems.

Demand Forecast and Inference. Given the statistics of the online population denoted

by {Nt

, Nt�1, . . .} up to time t, the demand forecast problem is to make an h-step predic-

tion of the future population Nt+h

. Furthermore, for a newly released channel with very

limited channel statistics {N0, . . . , Nl

} collected (l > 0 is small), the demand inference

problem is to infer the demand evolution pattern in the first n days, i.e., the evolution

pattern of {Nl+1, . . . , N144n}.

A similar problem is to make an h-step prediction of the future aggregate bandwidth

demand Dt+h

in a video channel, given bandwidth demand series {Dt

, Dt�1, . . .} up to

time t.

Being aware of the presence of peer contribution, we define Sr

t

as the total server

bandwidth required by the channel at time t in order to maintain satisfactory perfor-

mance. Clearly, as a minimum requirement, the average receiving rate of the playing

peers rt

should be at least the video bit rate R. This motivates the following problem:

Predicting Server Bandwidth Requirements. Given observations collected up to time

t, the problem is to make an h-step prediction of the total server bandwidth Sr

t+h

required

by a channel so that the achieved average receiving rate rt

� R. We can first make predic-

tions for the online population Nt+h

, the average receiving rates from playing peers ut+h

3.2. SOME PREDICTION PROBLEMS 23

and from cached peers ct+h

to obtain Nt+h

, ut+h

, and ct+h

. The total server bandwidth

required, Sr

t+h

, can then be forecasted as

Sr

t+h

= Nt+h

(R � ut+h

� ct+h

). (3.1)

In a video channel within the streaming system, the current total server rate St

allocated to the channel and the current online population Nt

can be instantaneously

monitored at the media servers and tracking servers. However, there could be a significant

delay of l timestamps to collect and process the statistics {rt

}, {ct

} and {ut

} from online

peers. This leads to the following problem:

Playback Performance Estimation. Estimate the average receiving rate rt

of playing

peers at the current time t, provided that we have monitored St

, Nt

, and have collected

{rt�l

, rt�l�1, . . .}, {c

t�l

, ct�l�1, . . .} and {u

t�l

, ut�l�1, . . .}. This again can be achieved by

first estimating ct

and ut

given the observations of these series up to time t � l. The

average receiving rate rt

of playing peers can then be estimated as

rt

=St

Nt

+ ut

+ ct

, (3.2)

where rt

, ut

, and ct

are the estimates of rt

, ut

, and ct

, respectively.

Demand and performance forecast can benefit a peer-assisted on-demand streaming

system in performance optimization, risk management, capacity planning, and resource

allocation, especially when thousands of channels coexist in a complex peer-assisted VoD

network. For example, the evolution of di↵erent time series are plotted in Fig. 3.1 for a

typical channel that attracts a flash crowd upon its release. Clearly, there is a performance

issue before t = 300, where the average peer receiving rate rt

is significantly lower than

3.3. MODELING AND FORECASTING THE DEMAND 24

200 400 600 800 1000 1200Time t

0

20

40

60

80

100

120

140 Average Peer Receiving Rate (KB/s)Population (Normalized)

200 400 600 800 1000 1200Time t

0

20

40

60

80

100Receive from Server (KB/s)Receive from Cached Peers (KB/s)Receive from Playing Peers (KB/s)

Figure 3.1: The evolution of several time series in a typical flash crowd video channel(with channel ID: A55F) since the channel was released. Each time unit represents 10minutes. All rates are average quantities for the channel. The video bit rate is R = 79.6KB/s.

the video playback rate 79.6 KB/s1. This is largely due to the insu�cient st

and ct

,

representing the average receiving rates from servers and cached peers, respectively.

A forecasting mechanism can infer the performance issues in a channel by estimating

the current rt

even in the presence of measurement delay, and advise system administra-

tors to take actions accordingly. It can also predict the server requirement in di↵erent

channels at a future time and do the capacity planning in advance to avoid problems.

However, we focus on tackling the unique challenges posed by the prediction, inference

and estimation problems themselves, while the investigation of their applications is not

the focus of this chapter.

3.3 Modeling and Forecasting the Demand

We start with analyzing the real-world traces, to investigate the feasibility of demand

inference and prediction. The demand of each video, which represents the interests of

users, follows a clear pattern and is highly predictable. In Fig. 3.2, we plot the number of

1Throughout this thesis, we use KB/s to represent kbytes per second, and use kbps to represent kbitsper second.


online playing peers, i.e., population, in several typical video channels since these videos

were released. According to their population patterns, channels in our trace set can be

categorized into steady-state channels and flash-crowd channels.

Steady-state channels, as shown in Fig. 3.2(a), usually have been released for a long

time and attract a small number of online users. Although the population in these

channels is roughly steady on a daily basis, it clearly exhibits a diurnal pattern with

a period of 144 (one day). To take a closer look at the periodicity in user demands,

we apply FFT to {Nt

} of the channels in Fig. 3.2(a), and plot their spectral densities

in Fig. 3.3(a), which shows that crests of power density are around period 72 and 144

in both channels. These two cycles correspond to the two demand peaks each day in

Fig. 3.2(a), which are usually around noon and midnight. Note that channel B830 and

4872 reach their peak populations at di↵erent times of day, due to the distinct nature

of the videos, e.g., news attracts more audience at specific times of day, while movies

and entertainment channels are popular during night hours. Fig. 3.3(a) also shows the

existence of high-frequency variations in Nt

.

Flash crowds often immediately follow the release of a popular video, e.g., Olympic

game videos released during our measurement period. Unlike the steady-state channels

in Fig. 3.2(a), flash-crowd channels, as shown in Fig. 3.2(b)(c)(d), attract hundreds or

thousands of interested online peers shortly after their release, with user interest gradually

diminishing over time.

Depending on the popularity, properties and release time of the video, flash-crowd

channels exhibit di↵erent patterns in the initial demand evolution, as well as di↵erent

decreasing speeds in Nt

. For example, a video published after midnight, e.g., F190 and

FA5C, is likely to have its first peak population on the second day around noon, whereas


200 400 600 800 10000

50

100

150

200

Time

Popu

latio

n

Channel_4872Channel_B830

(a) Steady-state channels

250 500 750 10000

500

1000

1500

2000

2500

3000

3500

TimePo

pula

tion

Channel_F190Channel_FA5C

(b) Flash-crowd channels

250 500 750 10000

500

1000

1500

2000

Time

Popu

latio

n

Channel_7A40Channel_E856

(c) Flash-crowd channels

250 500 750 10000

2000

4000

6000

8000

Time

Popu

latio

n

Channel_2E0BChannel_D55A

(d) Flash-crowd channels

Figure 3.2: Population evolution in steady-state channels and various kinds of flash crowdchannels. The samples of steady-state channels start from 2008-08-08 14:51:58. The sam-ples of flash-crowd channels start from the time that the corresponding video is released.The video release times are: channels 4872 and B830, before 2008-08-08 14:51:58; chan-nel F190, 2008-08-17 00:21:03; channel FA5C, 2008-08-19 01:47:20; channels 7A40 andE856, 2008-08-12 20:57:34; channel 2E0B, 2008-08-17 23:43:43; channel D55A, 2008-08-0900:03:26.


0 72 144 288 432Period (timestamps)

102

103

104

105

106

107

108

109

1010|F

FT(N

t)|2

(log

scale)

Channel 4872

Channel B830

(a) Steady-state channels

0 72 144 288 432Period (timestamps)

103

104

105

106

107

108

109

1010

1011

|FFT(N

t)|2

(log

scale)

Channel A55F

Channel F190

(b) Flash-crowd channels

Figure 3.3: The power spectrum of online population Nt

for various channels. To betterview periodicity, the x axis is set to be the period measured in timestamps, which is theinverse of frequency.

a video released before midnight, e.g., 7A40, will have its first peak population around

midnight. The spectral densities of two flash crowd channels A55F and F190 are plotted

in Fig. 3.3(b), from which we also find the periodicity with cycles 72 and 144.

By learning the periodicity and trend that exist in the past demand evolution of a

channel, we can make predictions of its future demand, even for several days. Moreover,

due to the interesting observation that videos with similar properties and similar release

times also share similar demand evolution patterns, we can infer a channel’s population

for the first few days, simply based on the statistics from similar channels that were

released earlier.

3.3.1 Population Prediction via Seasonal ARIMA Processes

If the population time series of a channel has been observed for a few days, we can

make fine-grained prediction of its future demand, leveraging the trend, periodicity and

autocorrelation exhibited in its own history. Due to the seasonality (diurnal pattern) in


all the channels and the decreasing trend in flash-crowd channels, Nt

of any channel is

clearly non-stationary over the long run2. However, we can eliminate both seasonality

and trend in the series via di↵erencing to obtain a stationary yet autocorrelated residual

series, which can then be characterized by models for stationary processes such as ARMA

(autoregressive moving-average) models. Such a model for non-stationary time series is

referred to as the seasonal ARIMA (autoregressive integrated moving-average) model [25]

and is an established time-series analysis technique in statistical learning [25, 28]. In the

following, we briefly describe how such a methodology can be applied to our time-series

forecasting problem.

Given a time series of interest {Yt

}, define the backward shift operator B by BYt

=

Yt�1, and define the lag-1 di↵erence operator r by

rYt

= Yt

� Yt�1 = (1 � B)Y

t

. (3.3)

Powers of B and r are defined in the obvious way, i.e., Bj(Yt

) = Yt�j

and rj(Yt

) =

r(rj�1(Yt

)), j � 1, with r0(Yt

) = Yt

. We further introduce the lag-d di↵erence operator

rd

defined by

rd

Yt

= Yt

� Yt�d

= (1 � Bd)Yt

. (3.4)

For steady-state channels, the population time series {N(t)} has a period of 144 (one

day) and no trend. We therefore remove the seasonality of {N(t)} to obtain a stationary

residual series Nt

= r144Nt

. {Nt

} can then be modeled as an ARMA(p, q) process:

Nt

= �1Nt�1 + . . . + �p

Nt�p

+ Zt

+ ✓1Zt�1 + . . . + ✓q

Zt�q

,

2A process {Yt} is (weakly) stationary if its mean E[Yt] and its covariance function Cov(Yt+h, Yt) ateach lag h are independent of t.


where {Zt

} ⇠ WN(0, �2) denotes the uncorrelated white noise with zero mean, and

�1, . . . , �p

, ✓1, . . . , ✓q are the model parameters. The ARMA process degenerates into an

autoregressive process AR(p) if q = 0, and becomes a moving-average process MA(q) if

p = 0. Equivalently, {Nt

} is modeled by the seasonal ARIMA model,

�(B)r144Nt

= ✓(B)Zt

, (3.5)

where {Zt

} ⇠ WN(0, �2), and �(B) = 1��1B�. . .��p

Bp and ✓(B) = 1+✓1B+. . .+✓q

Bq

are polynomial operators in B of degrees p and q.

For flash-crowd channels, however, both trend and periodicity exist in {Nt

}. The

fluctuation in the data decreases as t increases. We therefore first apply a logarithmic

transformation to {Nt

} to equalize the fluctuation. We then apply the transformation

r144 to {log(Nt

)} to remove seasonality, and di↵erence r144 log(Nt

) for d times to re-

move the trend, obtaining the stationary residual N(t) = rdr144 log(Nt

), which is well

explained by an ARMA(p, q) process. The corresponding seasonal ARIMA model derived

for {Nt

} is

�(B)rdr144 log(Nt

) = ✓(B)Zt

, d 2 {0, 1} (3.6)

where {Zt

} ⇠ WN(0, �2), �(B) = 1��1B � . . .��p

Bp and ✓(B) = 1+ ✓1B + . . .+ ✓q

Bq.

The value of d is chosen from {0, 1}, depending on how fast the daily population decreases

in trend.

Once the parameters of a seasonal ARIMA model are determined based on the training

sequence, the prediction for future Nt

is performed as follows. Let Pt

Nt+h

denote the best

predictor of Nt+h

, h > 0, in terms of the values {N1, . . . , Nt

}. To predict Nt+h

, h > 0,

we first obtain Pt

Nt+h

, the minimum mean square error predictor for the corresponding


residual Nt+h

. If q = 0, i.e., Nt

is modeled by a pure autoregressive process AR(p), we

have (e.g., refer to [25])

Pt

Nt+h

= �1Pt

Nt+h�1 + . . . + �

p

Pt

Nt+h�p

. (3.7)

Pt

Nt+h

can thus be calculated by solving (3.7) recursively for Pt

Nt+1, . . . , Pt

Nt+h

. Pt

Nt+h

can finally be obtained by retransforming Pt

Nt+h

using the inverse of the corresponding

di↵erence operators.

Let us illustrate the modeling and prediction of {Nt

} in a typical steady-state channel

B830. A model of the form (3.5) is fit to the training data from 5 consecutive days in

channel B830 and is used to make demand prediction for the next 9 days, as shown in

Fig. 3.4(a). By checking the autocorrelation function (ACF) of the residual Nt

= r144Nt

of the training data in Fig. 3.5(a), we can see that N(t) is not uncorrelated white noise and

needs to be explained by an ARMA process. Since the partial autocorrelation function

(PACF) of N(t) cuts o↵ at lag 144, it suggests an AR(144) model with p = 144, q = 0

for N(t) [25].

We use the standard Yule-Walker procedure [25] to obtain preliminary estimates of the

parameters �1, . . . , �p

in (3.5) and obtain final estimates through a maximum likelihood

estimator [25]. The trained model is used to make one day-ahead (144-step) demand

prediction, i.e., to predict each Nt+144 based on {N

⌧

; 0 ⌧ t}, as shown in Fig. 3.4(a).

We can see with p = 144, (3.5) is able to capture the slight demand variations from day

to day, which cannot be captured by a smaller p in our experiments.

To predict demand for a flash-crowd channel based on seasonal ARIMA models, we

train the model (3.6) based on the data of the first 2 days since a video was released,

i.e., {Nt

; t1 < t t1 + 288}, where t1 72 is to exclude the initial samples that do not


500 1000 1500 20000

50

100

150

Time

Popu

lation

Validation data144 step!ahead predictionTraining data

(a) Steady state channel B830

200 400 600 800 10000

500

1000

1500

2000

2500

3000

Time

Popu

lation


(b) Flash-crowd channel A55F

200 400 600 800 1000 12000

200

400

600

800

1000

1200

Time

Popu

lation


(c) Flash-crowd channel 7A40

Figure 3.4: One day-ahead population prediction based on seasonal ARIMA modelsfor three di↵erent channels. The model is trained based on the data of first few days.Prediction for N

t

only makes use of the observations {N⌧

: 0 < ⌧ t � 144} up totime t � 144. The two flash-crowd channels have di↵erent daily decreasing of N

t

. Thevalidation and training sets overlap for 1 day, since in a seasonal model, the predictorrequires the first 144 validation samples to be the initial input.

comply with the later evolution pattern (the inference of initial demand pattern based

on statistics of similar channels will be discussed in Sec. 3.3.3).

For channels with a slow decrease of the daily population as illustrated in Fig. 3.4(b),

we set d = 0 and do not di↵erence r144 log(Nt

), while for channels with faster daily

population decrease as illustrated in Fig. 3.4(c), we set d = 1. Similarly, having observed

the PACF of the residual N(t) = rdr144 log(Nt

), which cuts o↵ at lag 36, we choose

p = 36 and q = 0 as the model order, and plot one day-ahead (144-step) prediction


0 50 100 150 200−0.5

0

0.5

1

Lag(a) Sample ACF

0 50 100 150 200−0.5

0

0.5

1

Lag(b) Sample PACF

Figure 3.5: The autocorrelation function (ACF) and partial autocorrelation function(PACF) of N(t) = r144Nt

for the training data of channel B830.

starting from the 3rd day in Fig. 3.4(b) and 3.4(c) for two typical flash-crowd channels

with di↵erent decreasing speeds of daily population.

3.3.2 Forecasting the Aggregate Bandwidth Demand

We can also use the seasonal ARIMA model to predict the aggregate bandwidth demand

in each video channel directly. We use the series {Dt

} to denote the total bandwidth

demand from end users in a particular video channel. We first apply r144 to {Dt

} to

remove the daily periodicity, and then apply di↵erencing to r144Dt

for d times to remove

the trend, obtaining a stationary series D(t) = rdr144Dt

, which is well explained by

an ARMA(p, q) process. The corresponding seasonal ARIMA model [25] for the original

series {Dt

} is thus

�(B)rdr144Dt

= ✓(B)Zt

, d 2 {0, 1}, (3.8)

where {Zt

} ⇠ WN(0, �2) denotes the uncorrelated white noise with zero mean, and

�(B) = 1 � �1B � . . . � �p

Bp and ✓(B) = 1 + ✓1B + . . . + ✓q

Bq are polynomial operators


264 408 552 696 840 9840

1

2

3

4

5

6 x 104

Time (unit: 10 minutes)

KB/s

Trace data1−step predictionTraining data

(a) One-step-ahead prediction of Dt

697 769 841 913 985 1057−1

−0.5

0

0.5

1 x 104


KB/s

(b) Prediction errors

Figure 3.6: 10-minutes-ahead prediction for the aggregate bandwidth demand Dt

in apopular video channel A55FF released at time period 264, compared against the tracedata.

in B of degrees p and q. The di↵erencing order d is chosen from {0, 1}, depending on

whether a trend exists in the daily population variation.

As an example, we make 10-minute-ahead (one-step) prediction of the aggregate band-

width demand {Dt

} in a popular video channel released at time period t0 = 264 (2008-

08-10 10:47:39). The channel has a maximum online population of 2664. The demand

series of the first 3 days is used as the training data, excluding the initial 80 time periods

right after the release of the video, which may not conform to later evolution patterns.

The prediction is tested on the data of 3 days following the training period. We fit model

(3.8) to the training data and obtain parameter estimates through a maximum likelihood

estimator [25]. As shown in Fig. 3.6(a), with d = 0, p = 20, q = 20, model (3.8) can yield


prediction results that are close to the real bandwidth demand in the video channel.

3.3.3 Inferring Initial Demand using Mixtures of Gaussians

We now infer the initial demand evolution pattern in a newly released video channel,

where no past observation is available yet. As shown in the UUSee data, usually, several

irregular surges in the online population evolution are observed before it exhibits clearer

seasonality and trends. Therefore, the inference of the initial demand evolution cannot

make use of seasonal ARIMA models as in Sec. 3.3.1. However, the inference can be done

based on the fact that popular videos released around the same time of day (possibly

on di↵erent dates) demonstrate a similar demand evolution pattern in the first several

days, as we have observed in Fig. 3.2(b)(c)(d). Again, the underlying reason for this

phenomenon is that human users access the VoD system following a tractable diurnal

pattern.

For example, in Fig. 3.2(b), both channels F190 and FA5C are released after midnight,

it is expected that the first peak of demand comes right after their release due to a small

portion of users with the habit of staying up late. The second demand peak comes the

second day around noon, featuring a much larger population of daily users, and the third

demand peak is at midnight of the second day. As another example, Fig. 3.7 shows the

initial demands of 3 di↵erent channels released around 7-9 PM on di↵erent dates. They

also have a similar demand evolution pattern: unlike F190 and FA5C, the first peak

came around midnight after the videos were published, while the second peak came in

the second day around noon.

Assume t = 0 upon the release of a video v. The population {Nt

; 0 t 144n} in

the video channel for the first n days (e.g., n = 2.5) can be modeled as a realization of


the process Nt

= f(t, v) + Zt

, where f(t, v) is the signal underlying the evolution of Nt

in channel v, and Zt

represents a zero-mean random noise. We further model f(t, v) as

a mixture of Gaussians and obtain

Nt

= P (v)kX

j=1

⇡jq

2⇡�2j

exp

✓� (t � µ

j

)2

2�2j

◆+ Z

t

, (3.9)

withP

k

j=1 ⇡j

= 1, where P (v) is a popularity index associated with each video v.

We now group all the videos into a number of classes by their release times, and

assume that

Gt

:=f(t, v)

P (v)=

kX

j=1

⇡jq

2⇡�2j

exp

✓� (t � µ

j

)2

2�2j

◆

is the same pattern that underlies the population evolution of all the videos in the same

class. In other words, all videos in one class share the same parameters ⇡j

, µj

and �j

.

There is a strong probability rationale for the model proposed above, explained by

analyzing how a user chooses the time u(v) at which she starts to watch the video v,

although the success of the model does not depend on such an analysis. Assume each

user only watches the video once and her viewing span is negligible as compared to 144n.

Let µj

, j = 1, 2, . . . , k, represent the j-th peak timestamp (usually at midnight or noon)

after v is published. In reality, a user will watch v around time µ1 < µ2 < . . . with

probabilities ⇡1 > ⇡2 > . . ., since a video is more likely to attract audience when it is

just published. Conditioned on a user choosing to watch v around µj

, u(v) is normally

distributed with density N (µj

, �2j

). Thus, we have

Pr

✓t u(v) < t + 1

◆⇡

kX

j=1

⇡j

· 1q2⇡�2

j

exp

✓� (t � µ

j

)2

2�2j

◆· 1,


Table 3.2: Prediction of population at the initial stage based on mixtures of Gaussiansfor flash-crowd video channels.

Video Release Time 11PM - 1AM 1 - 4AM 7 - 11AM 3 - 5PM 5 - 7PM 7 - 9PM 9 - 11PMNumber of Training Videos 2 2 2 1 3 4 2

Number of Test Videos 2 2 1 1 3 4 2Mean R-Square of Test Videos 0.8029 0.8448 0.6056 0.7873 0.7153 0.7112 0.4193

Max RMSE of Test Videos 1071.30 205.98 227.92 420.86 149.81 259.09 465.99Peak Population in Test Set 8612 2307 1339 4287 1404 1855 3643

Let P (v) be the total number of potential users that are interested in video v. When

P (v) is large, we have Nt

, the number of users that choose to watch the video v at time

t, given by (3.9).

In our experiments, we categorize all 32 flash-crowd channels into 7 classes based

on their release times, as shown in Table 3.2. For each class, we train the mixture of

Gaussians (3.9) with k = 5 using the Expectation-Maximization (EM) algorithm [22]

based on the data {Nt

; 0 < t 144n} from the training channels. For each test channel

in the same class, {Nt

; l < t 144n} are inferred with the trained model, given very

limited initial observations {Nt

; 0 t l} of its own.

{Nt

; 0 t l} is used to avoid estimating P (v) for each video v individually. SinceP

l

t=1 Zt

⇡ 0, we have

P (v)P

l

t=1 Nt

=P (v)

P (v)P

l

t=1 Gt

+P

l

t=1 Zt

=1

Pl

t=1 Gt

, (3.10)

which is the same for all the videos in the same class. Thus, dividing both sides of (3.9)

byP

l

t=1 Nt

and letting ⇡0j

= ⇡j

P (v)/P

l

t=1 Nt

converts the original problem of estimating

P (v), ⇡j

, µj

and �j

into a problem of estimating ⇡0j

, µj

and �j

that are shared by all the

channels in the same class. Such an estimation can be approached by the EM algorithm

for standard mixtures of Gaussians [22].

3.4. PREDICTING PEER CONTRIBUTION 37

0 50 100 150 200 250 300 3500

200

400

600

800

1000

1200

1400

Time

Popu

latio

n

Training dataTest data 1Test data 2Fit to training dataPrediction of test data 1Prediction of test data 2

Figure 3.7: Prediction of initial demand using a mixture of Gaussians. The fittingfunction is trained with the first 360 population samples in channel 8331 released on 2008-08-10 19:46:29. Test data 1 is from channel 57F7 released on 2008-08-18 20:16:35, andtest data 2 is from channel EDAF released on 2008-08-18 20:49:39. R-square1 = 0.5864,R-square2 = 0.1374, RMSE1 =74.44, RMSE2 = 273.55.

We choose n = 2.5 and l = 30. For each test channel, the R-square and root mean

squared error (RMSE) are calculated. R-square is a measure of model fitness ranging in

[0, 1] with 1 being a perfect fit to data. We can see that the proposed method based on

mixtures of Gaussians achieves satisfactory performance as a best-e↵ort coarse-grained

predictor in most classes, with relatively small RMSEs compared to the peak population

sizes in the test set. An example of the prediction for two channels based on the model

trained from only one channel is plotted in Fig. 3.7.

3.4 Predicting Peer Contribution

In contrast to a client-server based solution, a major reason that a↵ects the stability

of peer-assisted video streaming is that peer bandwidth contributions are much more

dynamic and less reliable than with dedicated servers. Such bandwidth contributions


are good to have as they reduce server costs, but are di�cult to predict. Modeling peer

contribution {ct

} and {ut

} plays an important role in understanding the dynamics of

a peer-assisted video service. The prediction of {ct

} and {ut

} is also critical in various

performance prediction problems outlined in Sec. 3.2. Since steady-state channels have

similar daily population evolution, the performance time series in such channels also

exhibit steady and normal behavior. In this section, let us focus on the challenging case

of flash-crowd channels.

As illustrated in the right-hand-side of Fig. 3.1, we find that in almost all channels,

the average sending rate of playing peers {ut

} is a stationary noisy process (“Receive from

Playing Peers (KB/s)” in that figure). By checking its partial autocorrelation function,

we find that it can be modeled as a low order AR(p) process,

�(B)(ut

� µu

) = Zt

, (3.11)

where Zt

is white noise and µu

is the sample mean of {ut

}. For example, Fig. 3.8 shows

the prediction performance of an AR(10) model fitted to {ut

} in a flash-crowd channel.

In contrast, the average receiving rate from cached peers ct

is much harder to predict.

As shown in the right-hand-side of Fig. 3.1, ct

increases from 0 to a relatively stationary

value over time, as more playing peers finish downloading and become cached peers. As

{ct

} has a slow increasing trend, the first intuition is to transform ct

into rct

, which

actually has an ACF similar to white noise. However, such a model (�(B)rct

= Zt

with a small order p for �(B)) is ine↵ective in predicting future values of ct

. The model

yields a linear predictor that forecasts future ct

’s around a slowly increasing line, which

is actually an under-fit of {ct

} and misinterprets a large part of the system dynamics

as noise. Moreover, it fails to capture the slightly concave trend in ct

as t grows — ct


500 1000 1500 2000 25000

10

20

30

40

50

Time

KB/s

Validation dataPredictionTraining data

Figure 3.8: Fitting AR(10) to {ut

} of the first 9 days in channel A55F, and making 2.5hour (15-step) ahead prediction for the rest of the days. The prediction for each u

t

isconditioned on {u

⌧

; 0 < ⌧ t � 15}.

will stop increasing when the channel evolves into its steady state after several days.

Therefore, ct

is hardly predictable only based on its own history.

Cached Peers

Playing PeersNt N c

t

ct =uc(t)N c

t

Nt

�1(t)Abort

�2(t)Leave�1(t)Join

Finish downloading

�2(t)Nt

Figure 3.9: A fluid chart illustrating the dependency of ct

on Nt

through the underlyingsystem dynamics. The white arrows denote the transition rates, while the black arrowsdenote the dependencies. N c

t

is the number of cached peers and uc

(t) is the averageupload rate of each cached peer at time t.

However, we note that {ct

} has a strong dependency on {Nt

}, which we can make

use of in our prediction. Let N c

t

denote the number of cached peers in the channel and

uc

(t) denote the average upload rate of each cached peer at time t. Clearly, we have

ct

= uc

(t)N c

t

/Nt

. The fluid chart in Fig. 3.9 illustrates how {Nt

} is a deciding factor for

{N c

t

} and thus for ct

through the underlying system dynamics. Assume that each playing


peer finishes downloading with rate �2(t), then N c

t

= N c

t�1+�2(t)Nt

��2(t). Since Nt

can

be expressed as an autoregressive function of its past values, N c

t

is able to be predicted

as a function of {N c

t�1, Nc

t�2, . . .} and {Nt

, Nt�1, . . .}, and so is c

t

= uc

(t)N c

t

/Nt

.

Such a dependency of ct

on {ct�1, ct�2, . . .} and {N

t

, Nt�1, . . .} is non-linear in nature.

However, we can transform both {ct

} and {Nt

} into stationary series by a logarithm

transformation and the di↵erencing transformation used in Sec. 3.3.1, so that the trans-

formed series {ct

} and {Nt

} can be linked together through a linear time-invariant (LTI)

system of the form

ct

=pX

i=1

�i

ct�i

+qX

i=0

!i

Nt�i

+ Zt

, (3.12)

or more concisely, using back shift operator B,

�(B)ct

= !(B)Nt

+ Zt

, (3.13)

where Zt

is white noise, �(B) = 1� �1B � . . .� �p

Bp and !(B) = !0 +!1B + . . .+!q

Bq.

Specifically, we take the logarithm of ct

and Nt

to suppress fluctuation, remove sea-

sonality and trends by di↵erencing, and subtract the means to get the mean-corrected

stationary processes ct

, Nt

. We can thus obtain a transfer function model [25] with added

noise as follows, 8>>>>><

>>>>>:

ct

= ��1(B)!(B)Nt

+ �(B)�1Zt

,

Nt

= rd1r144 log(Nt

) � µ1, d1 2 {0, 1},

ct

= rd2r144 log(ct

) � µ2, d2 2 {0, 1},

(3.14)

where {Zt

} ⇠ WN(0, �2), and µ1 and µ2 are the means of rd

Nr144 log(Nt

) and rd

cr144 log(ct

),

respectively. As a result, by appropriately choosing d1, d2 2 {0, 1}, for any flash-crowd

channel, the relationship between {ct

} and {Nt

} can be characterized by the model (3.14).


Given the training data, the maximum likelihood estimates of �’s and !’s in (3.14) can

be obtained by minimizing the conditional sum-of-squares function [25] or by a non-

linear least squares algorithm [51], which are standard estimation algorithms in system

identification theory.

Experiments show that the transfer function model (3.14) yields very accurate pre-

diction for ct

. We illustrate the prediction procedure with a typical flash-crowd channel

A55F. We choose d1 = d2 = 0 and learn the model parameters using the data of the

first few days. For a lead time h, each ct

in the validation data is predicted based on

{ct�h

, ct�h�1, . . .} and {N

t�h

, Nt�h�1, . . .}. We choose d1 = d2 = 0, p = 54, q = 0. From

Fig. 3.10(a), we see that a model trained from the first 6 days of data yields very accurate

150 minute-ahead (15-step) prediction for ct

.

We also evaluate the h-step prediction for the total receiving rate from cached peers

ct

Nt

shown in Fig. 3.10(b) and Fig. 3.10(c). ct

Nt

is predicted as ct

Nt

, where the predic-

tions ct

and Nt

are calculated using the models (3.14) and (3.6) in Sec. 3.3.1, respectively,

based on observations up to time t � h. Fig. 3.10(b) shows that the proposed method

can predict the total receiving rate from cached peers accurately for 2.5 hours into the

future. An interesting phenomenon is that although ct

is increasing over time, the peak

of ct

Nt

is around the 5th day due to a decreasing number of Nt

. The proposed method is

able to predict such a peak in ct

Nt

1 hour ahead of time, simply with a training period

of the first 3 days, as shown in Fig. 3.10(c), although with lower accuracy as compared

to Fig. 3.10(b).


500 1000 1500 2000 25000

20

40

60

80

100

Time

KB/s


(a) ct, with a 6-day training period, h = 15

500 1000 1500 2000 25000

1

2

3

4

5

6x 104

Time

KB/s


(b) ctNt, with a 5-day training period, h = 15

500 1000 1500 2000 25000

1

2

3

4

5

6x 104

Time

KB/s


(c) ctNt, with a 3-day training period, h = 6

Figure 3.10: The h-step prediction of average receiving rate ct

and total receiving ratect

Nt

from cached peers in channel A55F. Prediction for ct

or ct

Nt

only makes use of theobservations of N

⌧

and c⌧

up to time t � h. We choose d1 = d2 = 0, p = 54, q = 0 inthe experiments. The validation and training sets overlap for 1 day, since in a seasonalmodel, the predictor requires the first 144 validation samples to be the initial input.

3.5. APPLICATIONS IN A P2P SYSTEM 43

3.5 Applications in a P2P System

Having elaborated the algorithms for modeling and predicting the online population {Nt

},

the average receiving rates {ut

} and {ct

} from playing and cached peers, we are ready

to solve the problems of server bandwidth requirement prediction and peer playback

performance estimation proposed in Sec. 3.2.

Given observations collected up to time t, we want to make an h-step prediction of the

total server bandwidth Sr

t+h

required by a channel so that the achieved rt

� R. This can

be done by making individual predictions for Nt+h

, ut+h

, and ct+h

, and predicting Sr

t+h

using (3.1). In the presence of data collection delay l, we also aim to estimate the average

receiving rate rt

of playing peers at the current time t, provided that we have monitored St

and Nt

, and collected {rt�l

, rt�l�1, . . .}, {c

t�l

, ct�l�1, . . .} and {u

t�l

, ut�l�1, . . .}. Similarly,

we can first infer ut

and ct

, based on which rt

can be estimated by (3.2).

We propose an iterative procedure for progressive model learning and performance

prediction that can be practically applied in real-world on-demand streaming channels.

Specifically, after the release of a channel, we first train the models (3.6), (3.11) and

(3.14) for {Nt

}, {ut

} and {ct

} using the data of the first 1-2 days. This preliminary

training period should be more than 1 day in order to incorporate the daily variation

of the data into the models. The models are retrained periodically during the system

operation, with a progressively increased order, based on all the measurements collected.

We illustrate such a procedure in a typical flash-crowd channel A55F. The preliminary

model is trained using the data of the first 1.25 days (180 timestamp) assuming an AR(p)

for (3.6), (3.11) and (3.14) with p = 10. p is increased by 10 (except for (3.11)) each

time, as the model is retrained every 24 hours. The prediction/estimation starts at time

181 and is always made with the latest trained model.

3.5. APPLICATIONS IN A P2P SYSTEM 44

We make 2.5 hours ahead (h=15) prediction for the server bandwidth Sr

t+h

required

by the channel and plot the results in Fig. 3.11. We can see that the predicted per-peer

server bandwidth required, R�ut+h

�ct+h

, is quite close to its real value R�ut+h

�ct+h

and

so is the total server requirement Sr

t+h

to its real value Sr

t+h

. We also see that the actual

server bandwidth used by the channel is much less than the required minimum amount

at many points, especially in the first 5 days. This accounts for the performance issues

after the channel is just released, as observed in Fig. 3.1. Incorporating our prediction

mechanism into the real system would forecast the server bandwidth requirement 2.5

hours ahead of time and suggest actions for server reallocation. Fig. 3.11(b) also shows

that the mechanism is able to predict an abnormal surge in the bandwidth requirement

at time 2075 because the seasonal ARIMA model (3.6) accurately predicts a surge in Nt

at this time. Moreover, the entire simulation of the above procedure takes less than 90

seconds on a Macbook Pro laptop with a 2.26 GHz duo-core processor.

Fig. 3.12 shows the estimation for rt

if the measurement collection is delayed by 2.5

hours (l = 15). The first 1.25 days (180 timestamps) are the preliminary training period.

We can see that the concave trend of rt

is successfully forecasted, and the estimation

becomes more accurate as the model is progressively refined with more data available.

Again, the proposed mechanism is able to discover the insu�cient rt

as compared to

the video bit rate (79.6 KB/s) starting from time 181, and would advise the system to

increase the server bandwidth allocated to the channel to compensate for the shortage

before time 500.

3.6. SUMMARY 45

3.6 Summary

In this chapter, we have addressed the issues of demand forecast and performance pre-

diction in on-demand streaming services either with or without peer assistance. Despite

the complexity of such large-scale distributed systems, we find that there exist clear pat-

terns in the evolution of user demands as well as dependencies within the time-series

related to various performance metrics. With the aid of online measurement collection, a

class of non-stationary time-series analysis techniques is used to enable demand predic-

tion and performance forecast. We also shed light on how such a monitoring-prediction

functionality can be incorporated in real-world systems at run time.

3.6. SUMMARY 46

180 468 756 1044 1332 1620 19080

20

40

60

80

Time

KB/s

Server bandwidth required per peerPredicted server bandwidth required per peerActual server bandwidth used per peer

(a) Prediction for the server bandwidth required per peer

180 468 756 1044 1332 1620 1908 2196 24840

2

4

6

8

10 x 104

Time

KB/s

Total server bandwidth requiredPredicted total server bandwidth requiredActual total server bandwidth used

(b) Prediction for the total server bandwidth required by the channel

Figure 3.11: Progressive model learning and 2.5 hours ahead (15-step) prediction for theserver bandwidth required by the peers in channel A55F over a 16-day period, comparedwith the real server bandwidth requirement calculated from the traces, and the actualserver bandwidth used by the channel in the traces. The channel is released on 2008-08-10 10:47:39, and the predicted values start from 2008-08-11 16:47:39 (timestamp 181).The predictor is retrained with the most recent data on a daily basis.

3.6. SUMMARY 47

0 288 576 864 1152 1440 17280

50

100

150

Time

KB/s

Server sending rate per playing peerActual average peer receiving ratePredicted average peer receiving rate

Figure 3.12: Progressive model learning and the estimation for the average receiving ratert

of playing peers in channel A55F over a 12-day period. There is a 2.5-hour (15-step)delay in data collection. A comparison with the real r

t

and the server rate st

allocatedto each peer is provided.

Chapter 4

Volatility Prediction and Dimension

Reduction

Demand forecast based on past observations enables the system to allocate the right

amount of bandwidth to match the demand. An online bandwidth monitoring and reser-

vation framework, as shown in Fig. 4.1, can be implemented and incorporated into ex-

isting (non-P2P) VoD systems. It monitors the bandwidth demand in a particular video

channel periodically, e.g., every 10 minutes, learns models to forecast future demand,

and judiciously decides the amount of server bandwidth to be provisioned in the next

time period.

However, as the predictor in Chapter 3 only forecasts the conditional mean demand,

which is the expected future demand conditioned on the history, the real demand may

vary around the predicted conditional mean. Due to the existence of prediction errors,

we need to provision an additional “risk premium” to tolerate demand fluctuation, which

depends on the variance of prediction errors. Prediction errors of bandwidth demand

48

4.1. MODELING DEMAND VOLATILITY 49

Demand Forecast

Expected Demand

Server UsageMonitoring

Risk Premium

Volatility PredictionModel

Learning

Bandwidth Reservation

Figure 4.1: Online bandwidth demand monitoring and reservation. The reserved band-width should match the sum of the expected demand and a risk premium that toleratesvolatility.

usually do not have a constant variance: there are periods where the prediction is rel-

atively accurate alongside periods with less trustworthy forecasts. In other words, the

bandwidth demand is highly predictable in some periods, but becomes highly variable

and unpredictable in other periods. In this chapter, we will present methods to estimate

the changing variance of the prediction error, an important factor to consider in resource

allocation.

The predictor presented in Chapter 3 has another weakness in that a di↵erent model

needs to be trained for each video channel, which is not scalable to a large number of

video channels in practice. In this chapter, we will also propose methods to reduce the

computational complexity of the model training based on principal component analysis

(PCA). In this chapter, we focus on non-P2P systems.

4.1 Modeling Demand Volatility

The changing demand volatility poses great challenges to e�cient resource provisioning.

As traditional time-series techniques assume a constant variance for the disturbance,


0 5 10 15 20−0.5

0

0.5

1

Lag

ACF

(a) ACF of Zt

0 5 10 15 20−0.5

0

0.5

1

Lag

ACF

(b) ACF of Z2t

Figure 4.2: The ACFs of the disturbance Zt

and Z2t

obtained from fitting the seasonalARIMA model (3.6) with d = 0 and p = q = 20 to bandwidth demand {D

t

} in videochannel A55FF.

the “risk premium”1 provisioned will be a function of the fixed prediction error vari-

ance averaged over the long run. However, such an approach will necessarily result in

over-provisioning when demand is less volatile, but insu�ciency when demand varies

drastically. To achieve e�cient resource allocation, we need a conditional variance fore-

cast mechanism to estimate the time-changing variance of demand around its conditional

mean, and adjust the amount of risk premium provisioned accordingly.

Such a “risk premium” should increase when we are less certain about the accuracy

of the conditional mean demand forecast, and decrease otherwise, when the conditional

variance of future demand is low. In contrast, the unconditional variance, i.e., the long-

run estimate of forecast error variance averaged over time, would not be important if we

care about the instantaneous “risk premium” needed.

Although the seasonal ARIMA model (3.8) is a good conditional mean model that

predicts the expectation of bandwidth demand Dt+h

conditioned on {Dt

, Dt�1, . . .}, it

fails to capture the serial dependency within the disturbance series {Zt

} obtained from

1The additional resources allocated to guard against demand fluctuation.


fitting (3.6) to {Dt

}. To check such dependency, we plot the autocorrelation functions

(ACFs) of Zt

and Z2t

in Fig. 4.2. We can see that although {Zt

} is an uncorrelated

white noise, it is not independent and identically distributed (IID) — the variance term

Z2t

clearly depends on Z2t�1, Z

2t�2, . . .. In fact, we can observe a persistence of volatility

for Zt

from Fig. 3.6(b): Zt

tends to exhibit a similar conditional variance as in recent

periods.

To include past variances in the explanation of future variances, we model Zt

using the

GARCH (generalized autoregressive conditional heteroscedasticity) process [24], which

has been successfully applied to modeling the volatility of stock data for the past decade.

Specifically, we model the disturbance Zt

obtained from fitting model (3.8) to {Dt

} as a

GARCH(P, Q) process:

8><

>:

Zt

=p

ht

et

, {et

} ⇠ IID N (0, 1),

ht

= ↵0 +P

P

i=1 ↵i

Z2t�i

+P

Q

j=1 �j

ht�j

,

(4.1)

where ↵0 > 0 and ↵j

, �j

� 0, j = 1, 2, . . ., and ht

is the conditional variance of Zt

given

its history {Zs

; s < t}. The GARCH model reflects the evolution of the variance in data

by incorporating correlation in the sequence {ht

} of conditional variances.

Taking channel A55FF as an example, we fit a GARCH(1, 1) model to the one-step-

ahead prediction errors for the bandwidth demand {Dt

} shown in Fig. 3.6(b). The model

parameters are obtained using maximum likelihood estimation (pp. 417, [25]) based on

the prediction errors of model (3.8) during the training period. We predict the conditional

standard deviations {p

ht

} of the prediction errors for the test period using the trained

model, and plot the results in Fig. 4.3. We can see that the predicted error standard

deviation is larger when the demand prediction errors are highly variable.

4.2. VOLATILITY ESTIMATION AND RESOURCE ALLOCATION 52

697 769 841 913 985 1057

−5000

0

5000

10000


KB/s

Prediction errorPredicted error standard deviation

Figure 4.3: One-step-ahead forecast errors for bandwidth demand Dt

in video channelA55FF, and predicted conditional standard deviations for the forecast errors.

With the GARCH model, we are able to forecast by how much the real demand will

deviate from the predicted conditional mean produced by model (3.8). It allows us to

quantify our certainty about bandwidth demand forecast so that server provisioning can

leverage this fact to enhance resource utilization. When we are certain about the Dt

in

the next time period, the server bandwidth allocated for the channel should be close to

the predicted conditional mean. On the other hand, during periods where Dt

is subject

to rapid changes and less predictable, we need to provision a higher risk premium to

tolerate the demand volatility in the video channel.

4.2 Volatility Estimation and Resource Allocation

In this section, we present the application of volatility forecasts to resource allocation. To

achieve service level guarantees to users without over-provisioning, the resource allocated

should match the future demand with a conditional mean forecast plus a “risk premium”

that tolerates tra�c volatility.

In general, the e↵ectiveness of a resource allocation scheme can be evaluated by two

performance metrics: 1) insu�ciency ratio e, which is the ratio of time periods where


the booked resource is lower than the actual demand over all the test periods; and 2)

time-averaged utilization U , which is the average utilization of the allocated resource over

all the test periods.

To provide quality assurance to users, it is expected to maintain the insu�ciency

ratio e to a low level. On the other hand, for the cost-e↵ectiveness of servers, we should

keep the average utilization U at a high level by booking resources sparingly. Striking

a balance between the two conflicting objectives, the key to successful resource booking

is to decide the minimum necessary “risk premium” Rt+1 at time period t + 1 given

observations up to time t that achieves a target insu�ciency ratio, e.g., e 2%, under

an appropriate volatility model.

4.2.1 Comparing Five Volatility Models

We propose five proactive server bandwidth reservation schemes for (non-P2P) VoD sys-

tems, each based on a di↵erent volatility model including GARCH and other heuristics.

To reserve bandwidth for a video channel at time t + 1, all these schemes periodically

monitor the bandwidth demand {D1, . . . , Dt

} of this channel by checking server logs (e.g.,

at a 10-minute frequency), and predict the conditional mean demand Pt

Dt+1 using the

method described in Sec. 3.3.2. However, they incorporate di↵erent volatility models to

determine the “risk premium” provisioned. Specifically, assuming e = 2%, these schemes

are described as follows:

Constant variance (baseline method) assumes a constant variance �2 in demand

forecast errors, which can be learned from training data. As demand forecast errors

exhibit Gaussian distribution in the traces, the risk premium is the 98th percentile of

the normal distribution N (0, �2), i.e., the value below which 98% of samples from the


distribution N (0, �2) fall.

Probabilistic GARCH predicts the conditional variance ht+1 for the demand fore-

cast error using the GARCH model. The risk premium is the 98th percentile of the

normal distribution N (0, ht+1).

Deterministic GARCH predicts the conditional variance ht+1 for the forecast error

using the GARCH model. The risk premium is ⌘p

ht+1, where ⌘ is a positive constant

determined as the ⌘ that achieves an insu�ciency ratio e = 2% in the training data.

Recent variance is a heuristic method that calculates the sample variance �2⌧

of

demand forecast errors in the ⌧ most recent time periods. The risk premium is the 98th

percentile of the normal distribution N (0, �2⌧

).

Maximum absolute error is a heuristic method that decides the risk premium as

the maximum absolute error of demand forecasts in recent ⌧ time periods.

We test the performance of the above schemes in terms of resource utilization and in-

su�ciency ratio through simulations driven by the real-world traces of 169 popular video

channels. For each channel, we train the conditional mean model (3.8) and conditional

variance model (4.1) based on the data of 1.5 days from the 50th to the 266th time period

after the channel is released. The first 500 minutes (50 time periods) are excluded from

training as the initial demands may not conform to the later evolution patterns. To test

the generalizability of our proposed methods to any VoD demands, we only assume a

simple seasonal ARIMA model (3.8) with d = 0, p = q = 1 for demand forecast. For the

two GARCH-based methods, we train a GARCH(1, 1) model for demand forecast errors

in the training data. We then use the above 5 methods with the trained parameters to

perform one-step-ahead bandwidth reservation in each channel for a test period of 2 days

following the training period.


0 10 20 30 40 50 60 70 80 900

0.2

0.4

0.6

0.8

1

Resource Utilization (%)

CD

F

Constant varianceProbablistic GARCHDeterministic GARCHRecent varianceMax absolute error

(a) CDF of resource utilization U in 169 channels

0 2 4 6 8 10 120

0.2

0.4

0.6

0.8

1

Insufficiency Ratio (%)

CD

F

Constant varianceProbablistic GARCHDeterministic GARCHRecent varianceMax absolute error

(b) CDF of bandwidth insu�ciency ratio e in 169channels

Figure 4.4: The empirical CDF of the utilization of booked bandwidth and the ratioof time periods where the booked bandwidth is insu�cient. Bandwidth reservation isperformed in 169 popular video channels independently, each with a test period of 2days.

We calculate the average resource utilization U and insu�ciency ratio e in each of the

169 channels over its corresponding test period, and plot the empirical cumulative distri-

bution functions (CDFs) of these U ’s and e’s in Fig. 4.4(a) and Fig. 4.4(b), respectively.

We can see that “constant variance,” as the baseline method, achieves the lowest resource

utilization as well as the lowest e, because it aggressively books a high risk premium as-

suming a large constant variance for demand forecast errors. In contrast, “probabilistic

GARCH” can adjust the risk premium dynamically based on the changing forecast error

variance. From Fig. 4.4(b), we see that it achieves an insu�ciency ratio e 2% in 80%

of the 169 channels, which is close to “constant variance,” and an insu�ciency ratio

e 4% in more than 90% of the channels, which is even better than “constant variance.”

“Probabilistic GARCH” achieves better resource utilization, shown in Fig. 4.4(a), since

it only books the necessary risk premium to tolerate the instantaneous demand volatility

instead of the long-run variance.

Although “deterministic GARCH” further enhances utilization by booking the risk

premium more conservatively, its average e exceeds the target of 2%. Furthermore,


the “recent variance” and “maximum absolute error” heuristics, when compared with

the first three methods, are advantageous, since they are fully adaptive during the test

period, while the first three use fixed model parameters estimated from the training

data for prediction. Even in such an unfair comparison, both heuristics perform worse

than GARCH-based methods, resulting in an e way beyond the target of 2%. Since they

inherently lack a mechanism to quantitatively tune the tradeo↵ between U and e, they are

not appealing for the sake of quality assurance. We conjecture that a fully online version

of ARIMA and GARCH, which adaptively relearns model parameters as the simulation

proceeds in the entire test period, can yield even better utilization while being able to

constrain e within the target range.

4.2.2 Utilization as a Volatility Indicator

From the above comparison, we find that the best bandwidth reservation scheme that

strikes a balance in the U -e tradeo↵ is “probabilistic GARCH,” thanks to its superior

ability to adjust the risk premium provisioned as the volatility changes. As an example,

we apply “probabilistic GARCH” to a popular channel A55FF over a test period of 3 days,

with models learned from the preceding 3 days. The bandwidth demand is forecasted

as a seasonal ARIMA process with d = 0, p = q = 20 driven by GARCH(1, 1) forecast

errors. We plot the bandwidth provisioned by “probabilistic GARCH” in Fig. 4.5(a) and

the achieved utilization in Fig. 4.5(b) under a target insu�ciency ratio of e = 5%. The

achieved e is 3.71% and U = 75.04% on the test data.

Just as tra�c volatility limits the resource utilization e�ciency, the best achievable

utilization U produced by one-step-ahead resource booking, given a target insu�ciency

ratio e (e.g., e = 2%), is essentially a quantitative indicator of tra�c volatility in the

4.3. REDUCING VOLATILITY THROUGH DIVERSIFICATION 57

697 769 841 913 985 10570

1

2

3

4 x 104


KB/s

Test workloadResource provisioned

(a) The bandwidth reserved by “probabilistic GARCH.”

697 769 841 913 985 105725

50

75

100


Perc

enta

ge (%

)

UtilizationMean Utilization

(b) Utilization of the reserved bandwidth.

Figure 4.5: Bandwidth reservation with “probabilistic GARCH” in channel A55FF,tested on a period of 3 days.

long term. A high U means that less “risk premium” is needed and the tra�c is likely

to evolve smoothly with tractable variation. In contrast, a low U implies that the tra�c

inherently has rapid and unpredictable variations, and thus more “risk premium” must

be booked to tolerate demand fluctuation. In other words, U evaluates volatility by the

ratio of the actual bandwidth usage over the sum of expected usage and the volatile

part of usage. Therefore, in the following analysis, we use the time-average utilization U

achieved by resource booking based on “probabilistic GARCH” to evaluate the demand

volatility in the long run.

4.3 Reducing Volatility through Diversification

In this section, we consider the combined or mixed tra�c of multiple video channels. We

observe that the aggregate tra�c volatility decreases as the number of channels to be


combined increases. Furthermore, such volatility reduction is not merely a consequence

of a higher volume of tra�c, but more importantly is due to tra�c mixing between

channels that may exhibit diverse variations. From the previous section, we see that

U evaluates volatility by computing the ratio of the actual bandwidth usage over the

sum of the expected usage and usage uncertainty. Therefore, in this section, we evaluate

the demand volatility of mixed channels by the achievable U in one-step-ahead resource

booking, given a target of e = 2%.

To study how the number of combined channels a↵ects tra�c volatility, we conduct

a series of bandwidth reservation simulations driven by the traces of 173 video channels

in UUSee over 21 days. We divide the entire 21 days into 7 periods, each of 3 days, and

study the tra�c volatility of random combinations of video channels in each 3-day test

period (except the first 3 days, which are only used for model training).

Now we illustrate the test with the second 3-day period (days 4-6). Suppose that

the number of channels to be combined is n. We randomly choose n channels that exist

in this test period and its previous 3-day training period, and consider their aggregate

tra�c. We perform bandwidth reservation based on “probabilistic GARCH” in this test

period (days 4-6), using the models trained from the previous 3 days (days 1-3). We

assume a simple conditional mean model (3.8) with d = 0, p = q = 1 and a conditional

variance model (4.1) of GARCH(1, 1). We calculate U and e in this test period of days

4-6. For this particular value of n, the above experiment is repeated for 60 times to

allow for di↵erent combinations of channels. The mean and standard deviation of the

achieved 60 U ’s and e’s are calculated. The above procedure applies to the other 3-day

test periods, each with its preceding 3 days as the training period.

From Fig. 4.6, we see that as bandwidth is reserved for an increasing number of


30

40

50

60

70

80

90

Number of video channels

Mea

n U

tiliz

atio

n (%

)

21 22 23 24 25 26 27 28

Days 4−6Days 7−9Days 10−12Days 13−15Days 16−18Days 19−21

(a) Mean of utilization U

02468

101214161820

Number of video channels

Util

izat

ion

Stan

dard

Dev

iatio

n (%

)

21 22 23 24 25 26 27 28

Days 4−6Days 7−9Days 10−12Days 13−15Days 16−18Days 19−21

(b) Standard deviation of U

Figure 4.6: The mean and standard deviation of U when di↵erent numbers of videochannels are randomly combined for bandwidth reservation. The achieved average e isless than 4% in all cases.

combined video channels, the mean of the achieved U ’s increases significantly up to close

to 90%, and the standard deviation of U ’s decreases in all of the test days 4-21. As

the resource utilization in one-step-ahead bandwidth reservation essentially evaluates

the degree of volatility in data, the above phenomenon implies that the aggregate tra�c

of multiple channels demonstrates less volatility. Bandwidth provisioned to multiple

channels can thus match the demand more closely. A further check into Fig. 4.6(b) shows

that when the number of channels is small, the U ’s have a high standard deviation.

This means that although all channels follow diurnal evolution in trend, there exist

di↵erent degrees of correlation between the demand forecast errors of di↵erent channels:

the volatility of the combined tra�c is amplified when they are positively correlated and

suppressed when they are negatively correlated.

To verify that volatility reduction is not a consequence of an increased amount of total

tra�c, but because of mixing channels with di↵erent statistical properties, we perform

bandwidth reservation for all the 93 channels in a test period from time 697 to 1127,

4.4. DIMENSION REDUCTION USING PCA 60

697 769 841 913 985 10570

1

2

3

4 x 104

Time (unit: 10 minutes)KB

/s

Test workloadResource provisioned

(a) The mixed tra�c and the provisioned bandwidth.

697 769 841 913 985 1057

−4000

−2000

0

2000

4000


KB/s

Prediction errorPredicted error standard deviation

(b) Conditional standard deviations for forecast errors.

Figure 4.7: Bandwidth reservation with “probabilistic GARCH” for the mixed tra�c ofall the 93 channels in the 6-day period from time 264 to 1127. The first 3 days are thetraining periods, and the other 3 days (time 697-1127) are the test periods.

which is the same test period as in Fig. 3.6, Fig. 4.3, and Fig. 4.5(a), where bandwidth

reservation is performed for a single channel A55FF. However, instead of considering the

aggregate tra�c of all 93 channels (including channel A55FF), we consider a mixture of

them by taking 1/10 of user requests from each channel and adding them up. Comparing

Fig. 4.7(a) with Fig. 4.5(a), we see that the resulting mixed tra�c is roughly the same in

size as that of the single channel A55FF. But the mixed tra�c evolves more smoothly,

and its forecast errors shown in Fig. 4.7(b) not only have a smaller variance than that

of channel A55FF shown in Fig. 4.3, but also exhibits less heteroscedastic property, i.e.,

less time variation in terms of variance.


1500 1550 1600 1650 1700 1750 1800 18500

200

400

600


Band

wid

th (M

bps)

Channel 295Channel 241Channel 317

(a) Three large channels

1500 1550 1600 1650 1700 1750 1800 18500

10

20

30


Band

wid

th (M

bps)

Channel 22Channel 298

(b) Two small channels

Figure 4.8: Bandwidth consumption time series of 5 representative channels over a 2.5-day period.

4.4 Dimension Reduction using PCA

In this section, we propose methods to reduce the model training complexity involved

in demand prediction. We consider one single time period, where N video channels are

present. Suppose that in this period, video channel i’s aggregate bandwidth demand is a

random variable Di

(Mbps) with mean µi

= E[Di

] and variance �2i

= var[Di

]. Since the

random demands D1, . . . , DN

of di↵erent channels may be correlated, we denote by ⇢ij

the

correlation coe�cient of Di

and Dj

, with ⇢ii

⌘ 1. For convenience, let µµµ = [µ1, . . . , µN

]T

and ⌃ = [�ij

] be the N ⇥ N symmetric demand covariance matrix, with �ii

= �2i

and

�ij

= ⇢ij

�i

�j

for i 6= j.

Let us first review some key characteristics of the video workload dataset from UUSee.

Fig. 4.8 shows the aggregate bandwidth demand in each channel for 5 representative

channels over a 2.5-day period. We have four observations. First, the workload dataset

consists of a large number of small unpopular channels, such as those in Fig. 4.8(b),

dominated by a small number of large popular channels, such as those in Fig. 4.8(a).

Second, there is a diurnal periodicity in the access pattern of each video channel. Third,

a channel’s popularity evolution may follow some trends over days. For example, band-

width consumption in channels 241 and 317 exhibits a downward trend over the 2.5 days


in Fig. 4.8(a), with 144 time periods representing one day. Finally, both the diurnal

periodicity and daily trends become vague in small channels, such as in channel 22 in

Fig. 4.8(b).

Chapter 3 has proposed to use seasonal ARIMA processes to predict bandwidth series

in each individual channel [59] at a fine granularity of 10 minutes. The prediction is

based on a regression of historical demand in the most recent time periods, as well as

demand around the same time in previous days. However, this method is not without

shortcomings. First, a separate statistical model needs to be trained for every video

channel. As a result, the method does not scale to a large number of channels. Second,

this approach performs poorly for small channels, e.g., channel 22, or ill-behaved channels,

e.g., channel 317. In both types of video channels, the daily repetition pattern is obscured

by various random factors.

To tackle the problems mentioned above, in this section, we further use a factor

model to account for demand evolutions, i.e., the demand series {Di

(t)} of each channel

i can be viewed as driven by M uncorrelated underlying factors C1(t), . . . , CM

(t) with a

zero-mean random shock e(t):

Di

(t) = ↵i1C1(t) + . . . + ↵

iM

CM

(t) + e(t), 8 i. (4.2)

If the coe�cients ↵i1, . . . , ↵iM

can be learned statistically, we will be able to forecast

{Di

(t)} by predicting factor movements first. As noted from Fig. 4.8, channel demands

exhibit co-movements. This inspires us to mine the factors while learning their coe�cients

from the collective demand history of all the channels.

We use principal component analysis (PCA) to find such underlying factors. Given

N bandwidth demand series {D1(t)}, . . . , {DN

(t)} from N video channels, PCA applies


1500 1550 1600 1650 1700 1750 1787

−0.1

−0.05

0

0.05

0.1

0.15

0.2


Mbp

s

C3 C2 C1

Figure 4.9: The first 3 principal components C1—C3 in bandwidth consumption series of468 channels during 2 days via principal component analysis.

an orthogonal transformation to these N demand series to obtain a small number of

uncorrelated time series {C1(t)}, . . ., {CM

(t)} called the principal components. This

transformation is defined in such a way that data projected onto the first principal

component has as high a variance as possible (that is, accounts for as much of the

variability in data as possible), and each succeeding component in turn has the highest

variance possible.

We perform PCA for all the 468 channels that are online in a 2-day period (time

1500-1787). Fig. 4.9 plots the first 3 principal component series in the data. We can see

the first component C1 explains the diurnal periodicity shared by all the channels. The

second component C2 accounts for the downward daily trend, which is salient in channel

317 as popularity diminishes and less salient in channel 295. A further check of Fig. 4.10

reveals that the first 10 principal component series explain 99% of the data variability,

which are su�cient to model the factors underlying all the demand evolution.

However, di↵erent channels have di↵erent dependencies on each factor. Fig. 4.11

shows the 468 demand series projected onto the first 2 components, i.e., the point


1 2 3 4 5 6 7 8 9 100102030405060708090

100

Principal Component

Varia

nce

Expl

aine

d (%

)

Figure 4.10: Variance explained in band-width consumption series of 468 channelsin 2 days.

0 1000 2000 3000−1500

−1000

−500

0

500

1000

1500

1st Principal Component

2nd

Prin

cipa

l Com

pone

nt

Channel 295

Channel 241

Channel 317

Channel 22

Channel 298

Figure 4.11: Data projected onto the first 2principal components. Each point is one ofthe 468 channels.

(↵i1, ↵i2) for all i. Without surprise, the dependence on the first component, ↵

i1, in-

dicates how large the channel is. In contrast, the dependence on the second component,

↵i2, accounts for how fast end-users may lose interest in channel i. Channel 295 has a

low ↵i2, indicating almost no decrease in popularity over days. Channel 241 has a mod-

erate ↵i2, showing a slightly downward trend. Channel 317 has a large ↵

i2, exhibiting a

dramatic decrease of popularity just on the second day.

To predict the demand means µµµ(t) and the covariance matrix ⌃(t) for {Di

(t) : i =

1, . . . , N} at time t, we first predict the principal components C1(t), . . . , CM

(t) for M = 10

based on the history and obtain forecasts about their means C(t) = [C1(t), . . . , CM

(t)]T

and their covariances ⌃C

(t). We further predict the error series e(t) to obtain its forecast

e(t) and error variance �2e

(t). Note that ⌃C

(t) is a diagonal matrix because the principal

components are uncorrelated. Denote the coe�cient matrix as AN⇥M

= [↵im

], i =


100 101 102

1

2357

101520305070

100160

Channel Index in Log Scale (Sorted by Channel Size)

Mbp

s (L

og S

cale

)

468

Ch 295

Ch 241

Ch 317

Mean Bandwidth ConsumptionRMSE of PCA−based PredictionRMSE of Individual Channel Prediction

Figure 4.12: Root mean squared errors (RMSEs) of the two prediction methods overtest period 1680-1860 (1.25 days), compared to mean bandwidth consumption in eachchannel. The three largest channels are channels 295, 241 and 317.

1, . . . , N , m = 1, . . . , M . We can therefore forecast µµµ(t) and ⌃(t) as

µµµ(t) =AC(t) + e(t) · 1, (4.3)

⌃(t) =A⌃C

(t)AT + �2e

(t) · [1, . . . ,1], (4.4)

where 1 is an all-one column vector of length N and [1, . . . ,1] is an all-one matrix of

size N ⇥ N .

We model each principal component series using a low-order seasonal ARIMA model

[25]. Since {C1(t)} clearly shows daily periodicity, we model {C1(t) � C1(t � 144)} as

an ARMA(1, 1) process, so that the forecast C1(t) is regressed from both the previous

value C1(t � 1), the values one day before C1(t � 144), C1(t � 145), and random noise

terms. All other components {Ci

(t)} for i � 2 do not exhibit periodicity. We thus use

ARMA(1, 1) processes to model these principal components. The conditional variances


1680 1730 1780 1830 18600

50

100


Band

wid

th (M

bps)

Original dataPCA−based Prediction

1680 1730 1780 1830 18600

50

100

150


Band

widt

h (M

bps)

Original dataIndividual Prediction

Figure 4.13: 10-minute-ahead conditional mean prediction in channel 172 over a testperiod of 1.25 days.

of all the component series are forecasted using GARCH(1, 1) models [25, 58]. Since the

components are orthogonal, we do not need to forecast their covariances.

We compare PCA-based prediction with individual channel prediction over a test pe-

riod of 1.25 days. Each prediction is made based on the training data of only the previous

1.25 days, which are a little more than one day to incorporate periodicity. The root mean

squared errors (RMSEs) of both approaches for all the 468 channels are summarized in

Fig. 4.12. The channel indices are sorted in descending order of the channel size. We

can see that the PCA-based approach outperforms individual predictions regardless of

channel sizes. For large channels, the ratio of RMSE over mean bandwidth consumption

is less than 15% in most cases using the PCA-based approach.

To zoom in, we take channel 172 as an example. Fig. 4.13 compares the conditional

mean predictions produced by the PCA-based approach with those produced by individ-

ual prediction. We observe that individual prediction tends to oscillate drastically, while

the PCA-based approach can better identify both periodicity and downward trends. One


1680 1730 1780 1830 1860

−70−50−30−10

103050


Mbp

s

Departure from Mean ForecastPredicted Standard Deviation

(a) PCA-based Prediction

1680 1730 1780 1830 1860

−110−90−70−50−30−10

10305070


Mbp

s

Departure from Mean ForecastPredicted Standard Deviation

(b) Individual Prediction

Figure 4.14: The departure of actual bandwidth consumption from its conditional meanforecast and the predicted standard deviation of bandwidth consumption in channel 172.

reason is that the driving factors found by PCA are weighted averages over all the chan-

nels, with channel-specific erratic noises smoothed out, exhibiting co-movements of all

the channels.

The standard deviation forecast of channel 172 is plotted in Fig. 4.14. Even though

there is a big gap between real demand and its conditional mean forecast around time

1700, the GARCH(1, 1) model is able to forecast a larger demand variance at this point,

which will be leveraged by the video service provider to allocate more capacity to guard

against performance risks. It is worth noting that we do not assume that demand can al-

ways be perfectly forecasted: the entire point of variance or volatility forecast via GARCH

is to estimate the deviation of actual demand from the conditional mean prediction and

enable risk management in a probabilistic sense. Fig. 4.15 shows the Q-Q plot of forecast

errors. We observe that with PCA, the actual demand will oscillate around its condi-

tional mean forecast more like a Gaussian process. This also substantiates the belief that

each Di

behaves like a Gaussian random variable.

4.5. SUMMARY 68

−4 −2 0 2 4

−60

−40

−20

0

20

40

60

Standard Normal Quantiles

Fore

cast

Erro

r Qua

ntile

s

(a) PCA-based Prediction

−4 −2 0 2 4

−60

−40

−20

0

20

40

60


Fore

cast

Erro

r Qua

ntile

s

(b) Individual Prediction

Figure 4.15: Q-Q plot of conditional mean forecast errors in channel 172 over the testperiod 1680-1860 (1.25 days), in reference to Gaussian quantiles.

Last but not least, the PCA-based approach has a lower complexity: it involves

training a seasonal ARIMA model for each of the 10 principal components, together

with finding these components from the 468 channels using PCA. Once the models are

trained, forecasting is simply a linear regression with negligible running time. In contrast,

individual prediction has to train a seasonal ARIMA model for each of the 468 channels

separately, leading to a much higher complexity.

4.5 Summary

In this chapter, we have focused on demand volatility forecasts in large-scale operational

VoD systems, with the objective of dynamically and e�ciently provisioning bandwidth

resources in VoD servers. We introduce GARCH models originated from econometrics to

predict demand volatility based on resource usage monitoring. We propose and compare

five volatility-aware resource provisioning schemes, based on GARCH modeling and other

4.5. SUMMARY 69

volatility heuristics. It is shown that GARCH models yield the best tradeo↵ in terms of

resource utilization and the service level provided to users. We further study the volatility

reduction due to diversification when tra�c of multiple video channels is mixed. As a

result, allocating resources for a well diversified collection of video channels can improve

resource utilization by up to 3 times as opposed to single-channel allocation. We have also

used factor models and principal component analysis (PCA) to reduce the computational

complexity involved in the statistical forecast model training, when a large number of

video channels coexist in a real-world VoD system.

Chapter 5

Cloud Bandwidth Reservation for

VoD Applications

Cloud computing is redefining the way many Internet services are operated and provided,

including Video-on-Demand (VoD). Instead of buying racks of servers and building pri-

vate data centers, it is now feasible for VoD companies to use computing and bandwidth

resources of cloud service providers. As an example, Netflix moved its streaming servers,

encoding software, data stores and other customer-oriented APIs to Amazon Web Ser-

vices (AWS) in 2010 [12]. As has been mentioned in Chapter 1, it is now becoming

feasible to make egress1 bandwidth reservation from a VM in the cloud to the Internet

[19, 39]. Within this context, we analyze the benefits and address open challenges of

cloud bandwidth auto-scaling for VoD applications.

As has been explained in Chapter 1, bandwidth auto-scaling can potentially achieve

high resource utilization and quality assurance at the same time. However, a number of

1In this thesis, we assume the connection bottleneck lies in the outgoing bandwidth of data centers,while the download bandwidth at end users and the paths in the Internet are well provisioned.

70

CHAPTER 5. CLOUD BANDWIDTH RESERVATION FOR VOD APPLICATIONS71

important challenges need to be addressed to realize bandwidth auto-scaling for a VoD

provider. First, since resource rescaling requires a delay of at least several minutes to

update configuration and launch instances, it is best to predict the demand with a lead

time greater than the update interval, and scale the capacity to meet anticipated demand.

Such a proactive, rather than passive, strategy for resource provisioning needs to take

into account demand fluctuations in order to avoid bandwidth insu�ciency. Second, since

statistical multiplexing can smooth tra�c, a VoD provider can reserve less bandwidth

to guard against fluctuations by jointly reserving bandwidth for all its video channels.

However, to serve geographically distributed end users, a VoD provider usually has its

collection of video channels served by multiple data centers, which are geographically

distributed and are possibly owned by di↵erent cloud providers. The question is — how

should the VoD provider optimally split and direct its workload across data centers to

save the overall bandwidth reservation cost?

In this chapter, we propose a bandwidth auto-scaling facility that dynamically reserves

resources from multiple data centers for VoD providers, with several distinct features.

First, it is predictive. The facility tracks the history of bandwidth demand in each

video channel using cloud monitoring services, and periodically estimates the expectation,

volatility and correlations of demands in all video channels for the near future using time-

series techniques proposed in the previous chapters. We propose a channel interleaving

scheme that can even predict demand for new videos that lack historical demand data.

Second, it provides quality assurance by judiciously deciding the minimum bandwidth

reservation needed to satisfy the demand with high probability. Third, it optimally mixes

anti-correlated (negatively correlated) demands to save the aggregate bandwidth capacity

reserved from all data centers, while confining risks of under-provisioning.

5.1. SYSTEM ARCHITECTURE 72

DC 1

DC 2

DC

w11

w21

ws1

Demand Predictor

BandwidthUsageMonitor

Load Optimizer

Demand HistoryChannel 1

Demand HistoryChannel

W

i

s

w1i

w2i

wsi

. . .

. . .

Load Direction

Demand Projections

BandwidthReservation

Figure 5.1: The system decides the bandwidth reservation from each data center anda matrix W = [w

si

] every �t minutes, where wsi

is the proportion of video channel i’srequests directed to data center s. DC: data center.

We formulate the bandwidth reservation minimization problem given the predicted

demand statistics as input, derive the theoretically optimal load direction policy across

data centers, and propose heuristic solutions that balance bandwidth and storage costs.

The proposed facility is evaluated through extensive trace-driven simulations based on a

large data set of 1693 video channels collected from UUSee, a production VoD system,

over a 21-day period.

5.1 System Architecture

Consider a VoD provider with N video channels, relying on S data centers for service,

which are possibly owned by di↵erent cloud service providers. We propose an unobtrusive

auto-scaling system that draws beliefs about future demands of all channels and reserves

minimal resources from multiple data centers to satisfy the demand. Our system archi-

tecture, shown in Fig. 5.1, consists of three key components: bandwidth usage monitor,


Channel 1

Band

wid

th

Reserved Capacity A1 A2

(a) (b)

DC 1 DC 2Asum Asum/2

(d) (e)(c)

Asum/2

Time 10 Minutes0

Channel 2

Reserved Capacity

Time 10 Minutes0 Time 10 Minutes0 Time 10 Minutes0 Time 10 Minutes0

Figure 5.2: Using demand correlation between channels, we can save the total bandwidthreservation, even within each 10-minute period, while still providing quality assurance toeach channel. DC: data center.

demand predictor and load optimizer. Bandwidth rescaling happens proactively every �t

minutes, with the following three steps:

First, before time t, the system collects bandwidth demand history of all channels up

to time t, which can easily be obtained from cloud monitoring services. As an example,

Amazon CloudWatch provides a free resource monitoring service to AWS customers at a

5-minute frequency [2].

Second, the bandwidth demand history of all channels is fed into the demand predictor

to predict bandwidth requirement of each video channel in the next �t minutes, i.e., in

the period [t, t + �t). Our predictor not only forecasts the expected demand, but also

outputs a volatility estimate, which represents the degree that demand will be fluctuating

around its expectation, as well as the demand correlations between di↵erent channels in

this period. Our volatility and correlation estimation is based on multivariate GARCH

models [24], which gained success in stock modeling in the past decade.

Finally, the load optimizer takes predicted statistics as the input, and calculates the

bandwidth capacity to be reserved from each data center. It also outputs a load direction

matrix W = [wsi

], where wsi

represents the portion of video channel i’s workload directed

to data center s. Apparently, we should haveP

s

wsi

= 1 if the aggregate data center

capacity is su�cient. The matrix W also indicates the content placement decision: video


i is replicated to data center s only if wsi

> 0. In practice, the load direction W can

be readily implemented by routing the requests for video channel i to data center s with

probability wsi

.

The system finishes the above three steps before time t, so that a new bandwidth

reservation can be performed at time t for the period [t, t + �t), after which the above

process is repeated for the next period [t + �t, t + 2�t).

Bandwidth Reservation vs. Load Balancing. One may be tempted to think

that periodic bandwidth reservation is unnecessary, since requests can be flexibly directed

to whichever data center has available capacity by a load balancer. However, the latter

will exactly fall in the range of pay-as-you-go model with no quality guarantee to VoD

users, whereas bandwidth reservation ensures that the provisioned resource exceeds the

projected demand with high probability.

Furthermore, a major di�culty of load balancing is that data centers might be owned

by di↵erent cloud providers. As a result, it is complicated if not infeasible to implement

a gateway that can continuously watch resources of each cloud provider (instead of every

5 minutes as o↵ered by free cloud watch services) and redirect requests instantaneously

across cloud providers. Even if requests can be redirected instantaneously to lightly

loaded data centers when the playback quality degrades, significant engineering e↵orts

are required to monitor the video playback quality at end users.

Quality-Assured Bandwidth Multiplexing. The bandwidth demand of each

video channel can fluctuate drastically even at small time scales. To avoid performance

risks, the bandwidth reservation made for each channel in each �t period should accom-

modate such fluctuations, inevitably leading to low utilization at troughs, as illustrated

in Fig. 5.2(a) and (b). Trough filling within a short period such as 10 minutes is hard,

5.2. LOAD DIRECTION AND BANDWIDTH RESERVATION 75

since there are too many random shocks in demand in such a short period.

However, our load optimizer strives to enhance utilization even when �t is as small

as 10 minutes by multiplexing demands based on their correlations. The usefulness of

anti-correlation is illustrated in Fig. 5.2(c): if we jointly book capacity for two negatively

correlated channels, the total reserved capacity is Asum < A1 + A2. Besides aggregation,

we can also take a part of the demand for each channel, mix them and reserve bandwidth

for the mixed demands from multiple data centers. As an example, in Fig. 5.2(d) and

(e), the aggregate demand of two channels is split into two data centers, each serving a

mixture of demands, which still leads to a total bandwidth reservation of Asum. In each

�t period, we leverage the estimated demand correlations to optimally direct workloads

across data centers so that the total bandwidth reservation necessary to guarantee quality

is minimized.

5.2 Load Direction and Bandwidth Reservation

In this section, we focus on the design of the load optimizer, which has been presented

in our published work [60]. Suppose before time t, we have obtained the estimates about

demands in the coming period [t, t + �t). Our objective is to decide load direction W

so as to minimize the total bandwidth reservation while controlling the under-provision

risk in each data center.

We first introduce a few useful notations. Since we are considering an individual time

period, without loss of generality, we drop subscript t in our notations. Recall that the

VoD provider runs N video channels. The bandwidth demand of channel i is a random

variable Di

with mean µi

and variance �2i

. For convenience, let D = [D1, . . . , DN

]T,

µµµ = [µ1, . . . , µN

]T and �� = [�1, . . . , �N

]T.


Note that the random demands D1, . . . , DN

may be highly correlated due to the

correlation between video genres, viewer preferences and video release times. Denote ⇢ij

the correlation coe�cient of Di

and Dj

, with ⇢ii

⌘ 1. Let ⌃ = [�ij

] be the N ⇥ N

symmetric demand covariance matrix, with �ii

= �2i

and �ij

= ⇢ij

�i

�j

for i 6= j.

The VoD provider will book resources from S data centers. Denote Cs

the upper

bound on the bandwidth capacity that can be reserved from data center s, for s =

1, . . . , S. Cs

may be limited by the available instantaneous outgoing bandwidth at data

center s, or may be intentionally set by the VoD provider to geographically spread its

workload and avoid booking resources from a single data center. Let Csum =P

s

Cs

be the

aggregate utilizable bandwidth capacity of all S data centers. Throughout the chapter,

we assume that Csum is su�ciently large to satisfy all the demands in the system.2

We define a load direction decision as a weight matrix W = [wsi

], s = 1, . . . , S,

i = 1, . . . , N , where wsi

represents the portion of video i’s demand directed to and served

by data center s, with 0 wsi

1 andP

s

wsi

= 1. We observe that ws

= [ws1, . . . , wsN

]T

represents the workload portfolio of data center s. Given ws

, the aggregate bandwidth

load imposed on data center s is a random variable

Ls

=P

i

wsi

Di

= wTs

D. (5.1)

We use As

to denote the amount of bandwidth reserved from data center s for this

period. Clearly, we must have As

Cs

. Let A =: [A1, . . . , AS

]T. To control the under-

provision risk, we require the load imposed on data center s to be no more than the

2A rigorous condition for supply exceeding demand is given in Theorem 1.


reserved bandwidth As

with high probability, i.e.,

Pr(Ls

> As

) ✏, 8 s, (5.2)

where ✏ > 0 is a small constant, called the under-provision probability.

5.2.1 The Optimal Load Direction

Given demand expectations µµµ and covariances ⌃, and the available capacities C1, . . . , CS

,

the load optimizer can decide the optimal bandwidth reservation A⇤ and load direction

W⇤ by solving the following optimization problem:

minW,A

Ps

As

(5.3)

s.t. As

Cs

, 8 s, (5.4)

Pr(Ls

> As

) ✏, 8 s, (5.5)

Ps

wsi

= 1, 8 i. (5.6)

Through reasonable aggregation, we believe that Ls

follows a Gaussian distribution.

We will empirically justify this assumption in Sec. 5.3 using real-world traces. When Ls

is Gaussian-distributed, constraint (5.2) is equivalent to

As

� E[Ls

] + ✓pvar[L

s

], with ✓ := F�1(1 � ✏), (5.7)

where F (·) is the CDF of the normal distribution N (0, 1). For example, when ✏ = 2%,


we have ✓ = 2.05. Since

E[Ls

] = µ1ws1 + . . . + µN

wsN

= µµµTws

,

var[Ls

] =P

i,j

⇢ij

�i

�j

wsi

wsj

= wTs

⌃ws

,

it follows that (5.2) is equivalent to

As

� µµµTws

+ ✓p

wTs

⌃ws

. (5.8)

Therefore, the bandwidth minimization problem (5.3) is now converted to

minW

Ps

As

(5.9)

As

= µµµTws

+ ✓pwT

s

⌃ws

, (5.10)

s.t. µµµTws

+ ✓p

wTs

⌃ws

Cs

, 8 s, (5.11)

Ps

ws

= 1, (5.12)

0 � ws

� 1, 8 s, (5.13)

where 1 = [1, . . . , 1]T and 0 = [0, . . . , 0]T are N -dimensional column vectors. We can

derive nearly closed-form solutions to problem (5.9) in the following theorem:

Theorem 1 When Csum � µµµT1 + ✓p1T⌃1, an optimal load direction matrix [w⇤

si

] is

given by

w⇤si

= ↵s

, 8 i, s = 1, . . . , S, (5.14)


where ↵1, . . . , ↵S

can be any solution to

X

s

↵s

= 1, 0 ↵s

min

⇢1,

Cs

µµµT1 + ✓p1T⌃1

�, 8 s. (5.15)

When Csum < µµµT1+✓p1T⌃1, there is no feasible solution that satisfies constraints (5.11)

to (5.13).

Proof: The function f(ws

) =p

wTs

⌃ws

is a cone and a convex function. We have

f

0

B@w1 + w2

2

1

CA f(w1) + f(w2)

2, (5.16)

or equivalently,

p(w1 + w2)T⌃(w1 + w2)

qwT

1⌃w1 +q

wT2⌃w2.

Applying the above inequality iteratively, we can prove

X

s

pwT

s

⌃ws

�s

(X

s

wTs

)⌃(X

s

ws

) =pwT⌃w. (5.17)

If w = 1 is feasible, by (5.11) and (5.17) we have

X

s

Cs

� µµµT1 + ✓pwT⌃w = µµµT1 + ✓

p1T⌃1. (5.18)

IfP

s

Cs

� µµµT1 + ✓p1T⌃1, it is easy to verify (5.15) is feasible. When w⇤

si

= ↵s

given

by (5.14), we find (5.11), (5.12) and (5.13) are all satisfied. Hence, (5.14) is a feasible


solution and w = 1 is feasible. By (5.17), the objective function (5.9) satisfies

X

s

(µµµTws

+ ✓p

wTs

⌃ws

) � µµµTw + ✓pwT⌃w = µµµT1 + ✓

p1T⌃1. (5.19)

We find that [w⇤si

] given by (5.14) achieves the above inequality with equality, and thus

is also an optimal solution to (5.9). ut

Theorem 1 implies that in the optimal solution, each video channel should split and

direct its workload to S data centers following the same weights ↵1, . . . , ↵S

, which can be

found by solving the linear constraints (5.15). Moreover, the optimal workload portfolio

of each data center s has a similar structure of ws

= ↵s

1, where ↵s

depends on its

available capacity Cs

through the constraints (5.15).

Under the optimal load direction, the aggregate bandwidth reservation reaches its

minimum value:

Ps

A⇤s

=P

s

(µµµTw⇤s

+ ✓pw⇤T

s

⌃w⇤s

) = µµµT1 + ✓p1T⌃1,

which does not depend on S, the number of data centers. This means that under the

optimal load direction, having demand served by multiple data centers instead of one big

data center does not increase bandwidth reservation cost as long as we have wsi

= ↵s

,

8 i given by (5.14). In other words, the multiplexing gain stems from combining all

workloads from di↵erent video channels and serving them aggregately, regardless of the

number of data centers used to serve these workloads. Therefore, the load optimizer can

first aggregate all the demands and then split the aggregated demand into di↵erent data

centers subject to their capacities.


5.2.2 Suboptimal heuristics with Limited Replication

Although solution (5.14) is optimal and e�cient, it encounters two major obstacles in

practice. First, as long as ↵s

> 0, w⇤si

= ↵s

> 0 for all i, which means that data center

s has to store all N videos. In other words, a video has to be replicated at all the data

centers with ↵s

> 0. This incurs significant additional storage fees at the VoD provider

charged by data centers. Second, each video channel i splits its workload into S data

centers according to the weights ↵1, . . . , ↵S

. When S is large and Di

is small, such fine-

grained splitting will not be technically feasible. Therefore, in practice, we need to limit

the replication degree of each video, or equivalently, limiting the number of videos stored

in each data center.

To achieve the above goal, we propose suboptimal solutions to problem (5.9) that

addresses replication concerns. First, we need the following heuristic to bridge the optimal

load direction to replication-limited load direction:

Heuristic 1 Per-DC Optimal. The algorithm iteratively outputs w⇤⇤1 , . . . ,w⇤⇤

S

for one

data center after another. Initially, set b = 1. Repeat the following for s = 1, . . . , S:

1) Solve the following problem to obtain w⇤⇤s

:

maxws

µµµTws

(5.20)

s.t. µµµTws

+ ✓p

wTs

⌃ws

As

Cs

, (5.21)

0 ws

b. (5.22)

2) Replace b in (5.22) by b � w⇤⇤s

.

The program terminates if b = 0.


Heuristic 1 packs the random demands into each data center, one after another, by

maximizing the expected demand µµµTws

that each data center s can accommodate subject

to the probabilistic performance guarantee in (5.21). As a result, the total amount of re-

sources needed to guard against demand variability is reduced. Clearly, under Heuristic 1,

the aggregate bandwidth reservation from all data centers is

Ps

A⇤⇤s

=P

S

s=1(µµµTw⇤⇤

s

+ ✓pw⇤⇤T

s

⌃w⇤⇤s

). (5.23)

Note that Heuristic 1 is also computationally e�cient since (5.20) is a standard second-

order cone program.

Now we can introduce the replication-limited load direction, which only requires the

VoD provider to upload k videos to each data center. We modify Heuristic 1 to cope

with this constraint as follows:

Heuristic 2 Per-DC Limited Channels. The algorithm iteratively outputs w01, . . . ,w

0S

for one data center after another. Initially, set b = 1. Repeat the following for s =

1, . . . , S:

1) Solve problem (5.20) to obtain w⇤⇤s

.

2) Choose the top k channels with the largest weights and solve problem (5.20) again

only for these k channels to obtain w0s

.

3) Replace b in (5.22) by b � w0s

.

The program terminates if b = 0.

5.3. EXPERIMENTS BASED ON REAL-WORLD TRACES 83

Under Heuristic 2, the aggregate bandwidth reserved is

Ps

A0s

=P

S

s=1(µµµTw0

s

+ ✓pw0T

s

⌃w0s

). (5.24)

In Sec. 5.3, we will show through trace-driven simulations that Heuristic 2, though sub-

optimal, e↵ectively limits the content replication degree, thus balancing the savings on

storage cost and bandwidth reservation cost for VoD providers.

5.3 Experiments based on Real-World Traces

We conduct a series of simulations to evaluate the performance of our bandwidth auto-

scaling system. The simulations are driven by the replay of the workload traces of UUSee

video-on-demand system over a 21-day period during the 2008 Summer Olympics [50].

The dataset contains performance snapshots taken at a 10-minute frequency of 1693

video channels, including sports events, movies, TV episodes and other genres. The

statistics we use in this chapter are the time-averaged total bandwidth demand in each

video channel in each 10-minute period. There are 144 time periods in a day. We ask

the question—what the performance would have been if UUSee had all its workload in

this period served by cloud services through our bandwidth auto-scaling system?

We assume that the bandwidth demand of channel i at any point in the period

[t, t + �t) can be represented by the same random variable Dit

. This is a reasonable

assumption when �t is small. Similarly, let µµµt

= [µ1t, . . . , µNt

] and ⌃t

= [�ijt

] represent

the demand expectation vector and demand covariance matrix for all N channels in

[t, t + �t). We assume that before time t, the system has already collected all the

demand history from cloud monitoring services with a sampling interval of �t. We


implement statistical learning and demand prediction techniques presented in Chapter 3

and Chapter 4 to forecast the expected demands µµµt

and demand covariance matrix ⌃t

every 10 minutes. The model parameters are retrained daily, with training data being the

bandwidth demand series {Di⌧

} in the recent 1.25 days of each channel i. Once trained,

the models will be used for the next 24 hours.

We conduct performance evaluation for 4 typical time spans which are near the be-

ginning, middle and end of the 21-day duration. Although video users may join or quit a

channel unexpectedly, our prediction mechanism is still e↵ective, since it deals with the

aggregate demand in the channel, which features diurnal evolution patterns. We assume

that there is a pool of data centers from which UUSee can reserve bandwidth. To spread

the load across data centers, we set Cs

= 300 Mbps for each s. The QoS parameter

✓ := F�1(1 � ✏) is set to ✓ = 2.05 to confine the under-provision probability to ✏ = 2%.

5.3.1 A Channel Interleaving Scheme

There are two practical challenges with regard to demand prediction. First, many of

the 1693 video channels are released during the 21 days: a new video does not have

su�cient demand history for statistical model learning. Second, there are many small

channels with only a few users online for which prediction is hard. We propose a channel

interleaving scheme to circumvent these obstacles. It has been shown from the traces

[59] that videos released on di↵erent dates but around the same time of day exhibit

similar initial demand evolution patterns, though with possibly di↵erent popularities.

The main reason is that most users watch VoD channels around several peak times of a

day. Therefore, it is possible to predict demands for new videos based on earlier videos.

We define virtual new channel k as a combination of all video channels with an


433 577 721 865 1009 1153 1297 1441 1585 18000

200

400

600


Aggr

egat

e ba

ndw

idth

(Mbp

s)

Trace data 1−step prediction Training data

Figure 5.3: The conditional mean demand prediction for virtual new channel 11, with atest period of 1.5 days from time 1585 to 1800.

age less than 1.25 days and released in hour k 2 {1, . . . , 24} on any date. Upon release, a

new video joins virtual new channel k based on its release hour k. Similarly, we aggregate

small video channels and set up 24 virtual small channels. When a video reaches the

age of 1.25 days, it quits its virtual new channel. If its demand never exceeded a threshold

(e.g., 40 Mbps) in the first 1.25 days, it will join one of the virtual small channels in a

round robin fashion. Otherwise, it becomes a mature channel.

Each mature or virtual channel is deemed as an entity to which predictions and

optimizations are applied. For example, we make 10-minutes-ahead conditional mean

prediction for virtual new channel 11 and plot results in Fig. 5.3. The bandwidth demand

exhibits repetition of a similar pattern because the videos in this virtual channel are all

released in hour 11 (possibly on di↵erent dates). Although conditional mean prediction

is subject to errors, the GARCH model can forecast the changing error variance, which

contributes to the risk constraint (5.11) in resource minimization (5.9).

5.3.2 Algorithms for Comparison

We compare our optimal load direction (5.14), Heuristic 1 and Heuristic 2, with the

following benchmark algorithms:


Reactive without Prediction: Initially replicate each video to K random data

centers. This limits the initial content replication degree to K. Each client requesting

channel i is randomly directed to a data center that has video i and idle bandwidth

capacity. A request is dropped if there is no such data center. In this case, the algorithm

reacts by replicating video i to an additional data center chosen randomly that has idle

capacity. Replicating content is not instant, and thus we assume that the replication

involves a delay of one period of time.

Random with Prediction: Initially, let s = 1 and b = 1. Second, randomly

generate ws

in (0,b) and rescale it so that the QoS constraint (5.11) is achieved with

equality for s. Update b to b � ws

and update s to s + 1. Go to the second step unless

b = 0 or s = S + 1, in which case the program terminates.

The reactive scheme represents provisioning for peak demand in Fig. 1.1 in some way,

with limited replication. It does not leverage prediction or bandwidth reservation. We

assume in Reactive, the total cloud capacity allocated is always the minimum capacity

needed to meet the peak demand in the system. The random scheme leverages prediction

and makes bandwidth reservation, but randomly directs workloads instead of using anti-

correlation to minimize bandwidth reservation.

5.3.3 Assumption Validation

First, we verify that the demand Di⌧

approximately follows Gaussian distribution in each

10-minute period. As an example, we apply one-day-lagged di↵erences (the lag is 144

when �t = 10 minutes) onto {Di⌧

} to remove daily periodicity to obtain the transformed


−3 −2 −1 0 1 2 3−15

−10

−5

0

5

10

15


Qua

ntile

s of

Inpu

t Sam

ple

(a) Zit of a typical channel

−3 −2 −1 0 1 2 3−1500

−1000

−500

0

500

1000

1500


Qua

ntile

s of

Inpu

t Sam

ple

(b)P

i Zit of all channels combined

Figure 5.4: QQ plot of innovations for t =1562—1640 vs. normal distribution.

series {D0i⌧

:= Di⌧

�Di⌧�144}, which can be modeled as a low-order autoregressive moving-

average (ARMA) process:

8><

>:

D0i⌧

� �i

D0i⌧�1 = N

i⌧

+ �i

Ni⌧�1,

D0i⌧

= Di⌧

� Di⌧�144,

(5.25)

where {Ni⌧

} ⇠ WN(0, �2) denotes the noise.

For each channel i, given conditional mean prediction µit

at time t, the innovation

is Zit

:= Dit

� µit

. Fig. 5.4(a) shows the QQ plot of Zit

for a typical channel i = 121

from time period 1562 to 1640, which indicates {Zit

} sampled at 10-minute intervals is

a Gaussian process. Thus, it is reasonable to assume Dit

follows a Gaussian distribution

within the 10 minutes following t, with mean µit

. Fig. 5.4(b) shows the QQ plot ofP

i

Zit

,

which indicates that the aggregated demandP

i

Dit

tends to Gaussian even if Djt

is not

Gaussian for some channel j. Since the load Ls

of each data center is aggregated from

many videos, it is reasonable to assume Ls

is Gaussian.


702 712 722 732 742 752 762 772 7800

10

20

30

40

50


Uns

atis

fied

chan

nels

Per−DC optimalPer−DC limited channelsReactive without prediction

(a) Channels with unsatisfied requests

702 712 722 732 742 752 762 772 78040

60

80

100

120


Util

izat

ion

(%)


(b) Resource Utilization

702 712 722 732 742 752 762 772 7802

4

6

8

10

12

14


Rep

licat

ion

degr

ee


(c) Replication Degree

Figure 5.5: Predictive vs. reactive bandwidth provisioning for a typical peak period 702–780. There are 35 data centers available, each with capacity 300 Mbps, and 91 channels,including 52 popular channels, 24 small channels, 15 non-zero new channels. K = 2,k = 10.

5.3.4 Predictive Auto-Scaling vs. Reactive Provisioning

We implement all of the five schemes discussed above, and present their performance

comparison in Table 5.1 for each of the four time spans. Note that the channels in the

table include mature channels, virtual new and virtual small channels. The number of

videos in each virtual channel can vary over time. As new videos are introduced, more

channels are present in later test periods. We evaluate the performance with regard to

QoS, bandwidth resource occupied, and replication cost.


Table 5.1: The performance of five schemes averaged over each test period, in terms ofQoS, resource utilization, and replication overhead.

PeriodsTime periods 702—780 (91 mature and virtual channels)

Peak demand 6.56 Gbps, mean demand 5.19 GbpsShort Drop Util Rep Booked Over-prov

Optimal 0.2 Chs 0.66% 92.9% 91.0 6.57 Gbps 108.5%Per-DC Opt 1.0 Chs 0.37% 90.0% 8.5 6.79 Gbps 112.2%Per-DC Lim 0.3 Chs 0.06% 85.7% 2.6 7.13 Gbps 117.8%

Random 5.9 Chs 0.02% 83.3% 3.8 7.33 Gbps 121.2%Reactive 7.9 Chs 0.47% 77.2% 4.3 7.91 Gbps 132.4%













Acronym Meanings:

Optimal : The optimal load direction. Per-DC Opt : Heuristic 1 (Per-DC Optimal). Per-DC Lim: Heuristic 2 (Per-DC Limited Channels). Random: Random with Prediction.Reactive: Reactive without Prediction.

Short : # channels with dropped requests; Drop: the request drop rate; Util : utilization ofallocated resources; Rep: replication degree; Booked : the booked (reserved) bandwidth;Over-prov : over-provisioning ratio.


Table 5.1 shows that Reactive generally has more QoS problems than all four pre-

dictive schemes in terms of both the number of unsatisfied channels and request drop

rate, demonstrating the benefit of utilizing demand prediction. Fig. 5.5 presents a more

detailed comparison for a typical peak period from time 702 to 780. Without surprise,

Reactive has many unfulfilled requests at the beginning. Since the videos are randomly

replicated to K = 2 data centers (shown in Fig. 5.5(c) at t = 702) and requests are

randomly directed, it is likely that a channel does not acquire enough capacity to meet

its demand. As Reactive detects the QoS problem, videos are replicated to more data

centers to acquire more capacity, with a gradually increasing replication degree over time,

as in Fig. 5.5(c). We can see that after 140 minutes, when the replication degree exceeds

4, the QoS of Reactive becomes relatively stable in Fig. 5.5(a). However, around time

763, Reactive su↵ers from QoS problems again, due to a sudden ramp-up of demand.

In contrast, the predictive schemes foresee and prepare for demand changes, resulting in

much better QoS, even in the event of drastic demand increase.

The predictive schemes also achieve higher resource utilization. Utilization of a pre-

dictive scheme is the ratio between the actual used bandwidth and the total booked

bandwidth in all data centers. For Reactive, the utilization is the actual bandwidth de-

mand divided by the peak demand. Although Fig. 5.5(b) shows that Reactive achieves

a high utilization for the peak demand around time 763, its average utilization is merely

77.19% in the test period from 702 to 780. Predictive auto-scaling enhances utilization to

85.67% with Per-DC Limited Channels, to 89.99% with Per-DC Optimal, and to 92.9%

with the theoretical optimal solution. In addition, the prediction and optimization in

predictive methods are computationally e�cient, e.g., prediction and Per-DC Optimal

finish in 2 minutes, well before the deadline of 10 minutes.


5.3.5 Theoretical Optimal vs. Replication-Limited Heuristics

Now we focus on each of the four predictive schemes. Among them, as shown in Table 5.1,

Optimal books the minimum necessary bandwidth and achieves the highest bandwidth

utilization, yet with the highest replication overhead: a video is replicated to every data

center. The VoD provider thus needs to pay a high storage fee to the cloud.

Per-DC Optimal can reduce the replication degree while maintaining other perfor-

mance metrics. By further imposing a channel number constraint on each data center,

Per-DC Limited Channels strikes a balance between replication overhead and bandwidth

utilization. It aggressively reduces the replication degree to a very small value of 2.4-

2.6 copies/video, which is the smallest among all four schemes, with an extremely low

drop rate and an over-provisioning ratio only slightly higher than Optimal and Per-DC

Optimal. Random achieves the lowest utilization, since it is blind to the correlation

information in workload selection and direction.

We further show a detailed comparison between the three predictive heuristics from

time 1562 to 1640 in Fig. 5.6. The e�ciency of predictive bandwidth booking can be

evaluated by the cushion bandwidth (“risk premium”) needed, which is the gap between

the booked bandwidth and actual required bandwidth. Fig. 5.6(a) plots the cushion

bandwidth. While achieving a similar QoS level, Random load direction uses a cushion

bandwidth up to 3 Gbps compared to a mean demand of 5.62 Gbps, representing sig-

nificant over-provisioning. Using Per-DC Optimal, the cushion bandwidth can be saved

by 50% on average, as shown in Fig. 5.6(b). Even Per-DC Limited Channels, with a

replication degree of 2.4 copies/video, can save cushion bandwidth by around 30% as

compared to Random, which has a higher replication degree of 3.3 copies/video.


1562 1582 1602 1622 1640−1000

0

1000

2000

3000

4000


Cus

hion

Ban

dwid

th (M

bps)

Per−DC optimalPer−DC limited channelsRandom

(a) Cushion Bandwidth

1562 1582 1602 1622 16400

20

40

60

80

100

120

140


Impr

ovem

ent (

%)

Per−DC optimal vs. RandomPer−DC limited channels vs. Random

(b) Savings on Cushion Bandwidth

1562 1582 1602 1622 1640

100

120

140

160

180


Ove

r−pr

ovisi

on ra

tio (%

)

Per−DC optimalPer−DC limited channelsRandom

(c) Over-Provisioning Ratio

Figure 5.6: Workload portfolio selection vs. random load direction for a typical peakusage period from time 1562 to 1640. K = 2, k = 10.

QoS problems occur if bandwidth is under-provisioned, leading to a cushion band-

width below 0 and an over-provisioning ratio less than 100%. From Fig. 5.6(a) and

Fig. 5.6(c), we observe that QoS problems occur occasionally for Per-DC Optimal but

seldom for Per-DC Limited Channels from time 1562 to 1640, because the latter scheme

conservatively books more cushion bandwidth. In addition, we note that request drop

rates in Table 5.1 are significantly lower than the frequency of under-provisioning in the

figures, because when under-provisioning happens, most user requests are still served.

Only the demand exceeding the booked capacity is dropped. From the above analysis,

we conclude that Per-DC Limited Channels achieves the best tradeo↵ in the domain of

5.4. SUMMARY 93

utilization, QoS and replication overhead.

5.4 Summary

In this chapter, we have proposed an unobtrusive, predictive and elastic cloud bandwidth

auto-scaling system for VoD providers. Operated at a 10-minute frequency, the system

automatically predicts the expected future demand as well as demand volatility in each

video channel through ARIMA and GARCH time-series forecasting techniques based on

history. Leveraging demand prediction, the system jointly makes load direction to and

bandwidth reservations from multiple data centers to satisfy the projected demands with

high probability. The system can save the resource reservation cost for VoD providers

with regard to both bandwidth and storage.

We exploit the predictable anti-correlation between demands to enhance resource

utilization, and derive the optimal load direction policy that minimizes the aggregate

bandwidth resource reservation while confining under-provisioning risks. Two suboptimal

heuristics have also been proposed to limit the storage cost. From extensive simulations

driven by the demand traces of a large-scale real-world VoD system, we observe that

suboptimal heuristics have practical appeal due to their ability to balance the costs of

bandwidth and storage.

Chapter 6

The Bandwidth Reservation Market

Current cloud providers charge their tenants (including VoD providers like Netflix) a

bandwidth fee in a pay-as-you-go model based on the number of bytes transferred in

each hour [17]. This business model is not suitable for QoS-sensitive applications where

the value of the service critically depends on the guarantee provided. We believe that,

when multiple content providers (such as Netflix or OnLive) use cloud services from

cloud providers (such as Amazon), there will be a market between content providers and

cloud providers, and commodities to be traded in such a market consist of bandwidth

reservations, so that media streaming performance can indeed be guaranteed.

As a buyer in such a market, each tenant in the cloud can periodically make reserva-

tions for bandwidth capacity to satisfy its demand projection. Specifically, each tenant

should reserve capacity that accommodates both its expected demand and unexpected

demand variation in the predictable future. The expectation and variance of its future

demand can be estimated online using historical demand information, which is easily

obtainable from cloud monitoring services (e.g., as has been mentioned in Chapter 5,

Amazon CloudWatch provides a free resource monitoring service to AWS customers at a

94

CHAPTER 6. THE BANDWIDTH RESERVATION MARKET 95

5-minute frequency [2]).

However, since demands at all tenants are unlikely to surge simultaneously, the ag-

gregate reserved cloud bandwidth — that the tenants have paid for — are under-utilized

most of the time. A classical way to improve utilization is to encourage statistical mul-

tiplexing and reserve resources for multiple tenants together to accommodate their ag-

gregate peak load instead of each tenant’s peak load individually. As diversification and

anti-correlation come into e↵ect, less cushion capacity needs to be reserved to guard

against unexpected demand fluctuation. Apparently, improving utilization is a win-win

scenario: better utilization of bandwidth at cloud providers may lead to reduced pric-

ing for bandwidth reservations in the market, eventually benefiting content providers.

Given multiple cloud providers in the market, we ask the question — can we use a simple

mechanism inherent in the market to optimally mix tenant demands, save the total cloud

bandwidth reservation cost for tenants and thus lower the market bandwidth price?

The answer is “yes”. Cloud resource brokerage has recently emerged as a cloud service

intermediator connecting buyers and sellers of computing resources. For example, Zimory

[11], as a spino↵ of Deutsche Telekom, claims to be the first online marketplace for cloud

computing. In this chapter, we propose an entirely new type of brokerage that collects

bandwidth demand statistics from content providers, selling probabilistic performance

guarantees to them, while reserving as few resources as possible from multiple cloud

providers for them. The broker charges each VoD provider a fee for the service guarantee

provided according to a certain pricing policy, while paying cloud providers their fees for

bandwidth reservations. In a healthy market, the broker should be able to: (1) make

a profit, an incentive for all selfish entities; (2) save the total resource reservation by

optimally mixing demands; and (3) lower bandwidth prices billed to content providers

CHAPTER 6. THE BANDWIDTH RESERVATION MARKET 96

to attract business.

We present an in-depth analysis of bandwidth pricing policies in such a broker-assisted

market, and make a number of original contributions as follows:

First, the broker needs to decide a load direction matrix from all tenants to all

cloud providers. Load direction should be such that tenant demands are mixed based

on anti-correlation to save the aggregate capacity reservation from all cloud providers.

We formulate this problem as a convex optimization and find this problem has the same

form as the problem of finding the optimal load direction policy in Chapter 5, with the

only di↵erence that now the broker (instead of a cloud provider) is in charge of directing

the loads of di↵erent video providers to all the data centers (possibly owned by di↵erent

cloud providers).

Second, however, a selfish broker may not have the incentive to mix and direct de-

mands optimally; it may even reject demand if accepting it is not lucrative — the broker

is always driven by profit after all. Our second contribution is therefore to study how to

enforce a good pricing policy (to be adopted by the broker), such that a profit-maximizing

broker behaves in the same way as an altruistic broker that saves the cloud bandwidth

reservation through optimal load direction. We give a necessary and su�cient condition

for all such “good” pricing policies. Such a pricing region characterization also helps

us to understand the maximum price discount that each tenant can enjoy in a healthy

market.

Third, we study a free market where the pricing policy is not enforced authoritatively.

Instead, each tenant can submit a pricing strategy at its own will, while the broker

makes demand admission decisions to each tenant based on all the submitted prices.

Assuming that all the tenants wish to have their demands fully served, we prove that the

6.1. SYSTEM MODEL 97

selfishness of tenants will inevitably drive their chosen pricing strategies into a unique

Nash equilibrium, at which each tenant enjoys the lowest possible price in the “good”

pricing region. As a highlight, we derive the equilibrium payment of each tenant in a

closed-form function of its expected demand, demand burstiness as well as its demand

correlation to the aggregate demand in the market.

Finally, we conduct online bandwidth trading simulations driven by the real-world

workload of more than 200 video channels from an operational VoD system over two 800-

minute periods. We have developed a novel dynamic pricing scheme in a broker-assisted

market that varies the price charged to each tenant periodically, based on prediction

of tenant demand statistics. Such demand-statistics-based pricing is di↵erent from the

traditional usage-based pricing. We observe that with the help of a broker, our theory

can lower the market price for cloud bandwidth reservations by around 50% on average,

and reduce the usage of cloud resources by over 30%. Most content in this chapter has

been published in our work [55].

6.1 System Model

With broker-assisted bandwidth reservation, the system operates periodically on a short-

term basis. At the beginning of a short-term period, each tenant makes a bandwidth

reservation at the broker to satisfy its future bandwidth demand. The broker attempts to

statistically multiplex bandwidth demands from the tenants to save the total bandwidth

reservation cost, while providing each tenant with a certain level of service guarantee.

At the end of the short-term period, the broker charges each tenant a fee for the service

guarantee.


From the perspective of tenants, having the ability to reserve bandwidth on a short-

term basis o↵ers a unique advantage. Many user-facing multimedia workloads, such as

VoD bandwidth demand and online gaming, involve diurnal periodicity and are highly

predictable even at a fine granularity of 10-minute intervals [58, 59]. Tenants running

predictable workloads are able to dynamically reserve (and pay for) bandwidth that just

su�ces to accommodate the predicted short-term demand, leading to a cost-e↵ective

operation mode. From the perspective of cloud providers, o↵ering users the ability to

make bandwidth reservations constitutes an important performance guarantee, which is a

desirable feature that makes the cloud provider more attractive to QoS-sensitive tenants.

6.1.1 The Cloud and Tenants

We now present our system model in more detail. We start from a description of cloud

providers and tenants.

Cloud Providers. We assume there are S cloud providers in the system, each of

which has an outgoing bandwidth capacity Cs

, s = 1, . . . , S. Let Csum be the aggregate

bandwidth capacity of all the cloud providers, i.e., Csum =P

s

Cs

. Throughout this

chapter, we assume that Csum is su�ciently large to satisfy all demands in the system.

This assumption is justified by the “illusion of infinite capacity” [17] and the fact that

costs of high-end routers are dropping more quickly than before.

Tenants. We assume that there are N tenants in the system, each of which reserves

bandwidth on a short-term basis. Without loss of generality, we consider one such short-

term period. Suppose that tenant i’s bandwidth demand in this period is a random

variable Di

with mean µi

and variance �2i

. To ensure that its demand Di

is satisfied with

a high probability 1 � ✏, tenant i needs to reserve bandwidth Bi

= f✏

(Di

), which is 1 � ✏


percentile of the distribution of Di

, i.e., we have

Pr(Di

> Bi

) ✏,

where ✏ is defined as the risk factor. Let D = [D1, . . . , DN

]T, µµµ = [µ1, . . . , µN

]T and

�� = [�1, . . . , �N

]T.

As has been discussed in Chapter 5, the random demands D1, . . . , DN

may be highly

correlated due to the correlation between video genres, viewer preferences and video

release times. Following the notations used in Chapter 5, denote ⇢ij

the correlation

coe�cient of Di

and Dj

, with ⇢ii

⌘ 1. Let ⌃ = [�ij

]N⇥N

be the N ⇥ N symmetric

demand covariance matrix, with �ii

= �2i

, and �ij

= ⇢ij

�i

�j

, for i 6= j.

For convenience, denote �M

the standard deviation of the “market demand”P

i

Di

,

and �iM

the covariance between Di

and the market demandP

i

Di

. Clearly, we have

�iM

=E[(Di

� µi

)(X

j

Dj

�X

j

µj

)] =NX

j=1

�ij

, (6.1)

�M

=

svar[

X

i

Di

] =p1T⌃1 =

sX

i,j

�ij

. (6.2)

We further denote ⇢iM

= �iM

/�i

�M

2 [�1, 1] as the correlation coe�cient between Di

and the market demandP

i

Di

.

6.1.2 The Broker and Load Direction Matrix

As shown in Fig. 6.1, the essence of the broker is a load direction weight matrix W =

[wsi

]S⇥N

, where wsi

represents the proportion of tenant i’s demand Di

served by cloud

provider s, for s = 1, . . . , S and i = 1, . . . , N , with 0 wsi

1. At the cloud provider


side, given [wsi

], the bandwidth load imposed on cloud provider s is a random variable

Ls

=P

i

wsi

Di

.

Similar to Bi

= f✏

(Di

), we define As

= f✏

(Ls

) (As

Cs

) as the amount of bandwidth

reserved from cloud provider s to ensure that the load Ls

is satisfied with a high proba-

bility 1 � ✏. Clearly, the reserved bandwidth As

must not exceed the capacity of cloud

provider s, i.e., As

Cs

, or equivalently,

Pr(Ls

> Cs

) ✏, 8 s. (6.3)

For convenience, let L = [L1, . . . , LS

]T and let ws

= [ws1, . . . , wsN

]T. Note that ws

can

be interpreted as the workload portfolio of cloud provider s. From tenant i’s point of

view, given [wsi

], the total fraction of its demand Di

served by all cloud providers is

wi

=P

s

wsi

1. For convenience, let w = [w1, . . . , wN

]. Clearly, we have w =P

s

ws

.

The broker takes demand estimates µµµ, ⌃ and cloud provider capacities C1, . . . , CS

as inputs, and outputs a load direction weight matrix W⇤ = [w⇤si

] to achieve a certain

goal in favor of cloud providers, tenants, or the broker, or all of them. The output W⇤

corresponds to a way that N inter-correlated random demands D1, . . . , DN

are packed

into S cloud providers of limited capacities. For convenience, let w⇤i

=P

s

w⇤si

and

let w⇤s

= [w⇤s1, . . . , w

⇤sN

]. A “good” broker should satisfy the demand of tenants while

reserving as little bandwidth resource as possible from cloud providers. In other words,

the direction matrix W⇤ should be chosen such that: 1) w⇤i

= 1, for all tenants i =

1, . . . , N ; and 2) the total bandwidth reservationP

s

As

is minimized.


w11

w21 w12

w22

C2

D1 D2

w1 = w11 + w21 w2 = w12 + w22

L1 =P

i Diw1i L2 =P

i Diw2i

L2

A2

L1

C1Capacity

r.v. r.v.

r.v. r.v.

A1Booked Bandwidth

Load

Demands

Service fraction

Load Direction

VoD Providers

Broker

DataCenters

Figure 6.1: A system of two cloud providers and two tenants. Random variables arelabeled with “r.v.”. The data centers are possibly owned by di↵erent cloud providers.

6.1.3 Bandwidth Pricing in the Presence of a Broker

Without loss of generality, we assume that each cloud provider charges a unit bandwidth

fee of $1 for every unit of bandwidth reserved in each short-term period. The broker needs

to pay $1 ·P

s

As

in total to the cloud providers for booking a total amount ofP

s

As

bandwidth. On the other hand, it charges tenant i a fee of Pi

(wi

) for accommodating wi

portion of its bandwidth demand Di

according to some pricing strategy Pi

(·). Under a

given W, the broker profit is thus

R(W) =P

i

Pi

(wi

) �P

s

As

. (6.4)

We make minimum assumptions on the pricing strategy Pi

(·) adopted by the broker as

follows:

Definition 1 A bandwidth pricing strategy Pi

(·) relative to tenant i is a concave func-

tion Pi

(wi

) of the service portion wi

2 [0, 1], with Pi

(0) = 0. The collection of bandwidth

pricing strategies {Pi

(·) : i = 1, . . . , N} forms a pricing policy.

6.2. MAIN OBJECTIVES 102

We observe that these assumptions on Pi

(·) are quite reasonable. Pi

(·) is concave

because in real life wholesale is cheaper than retail, i.e., the more a user buys some

goods, the lower the unit price P (wi

)/wi

she has to pay. Pi

(0) = 0 because a tenant

should pay nothing when receiving no service. In our system model, the pricing policy

{Pi

(·)} can be authoritatively enforced in a controlled market, or negotiated between the

broker and each tenant in a free market. As a special case, the broker may adopt the

cloud pricing policy {P 0i

(·)}:

P 0i

(wi

) = f✏

(Di

wi

) · 1, (6.5)

with P 0i

(1) = f✏

(Di

) = Bi

, 8 i. In other words, the broker charges the same amount of

money as cloud providers do. Note that a “good” pricing policy {Pi

(·)} should help ten-

ants to save money, i.e., Pi

(wi

) P 0i

(wi

), 8 wi

2 [0, 1], 8 i 2 1, . . . , N . Otherwise,

tenants will have little incentive to reserve bandwidth through a broker instead of from

cloud providers directly.

6.2 Main Objectives

In this section, we outline our objectives in this chapter.

Cloud Bandwidth Saving. From the cloud and social wellness point of view, a

basic goal is to enhance resource e�ciency in the cloud. When aggregate capacity supply

is su�cient, this means to serve all given demands D using as little bandwidth resourceP

s

As

as possible, i.e.,

minW

Ps

As

(6.6)


s.t. As

Cs

, 8 s, wi

= 1, 8 i. (6.7)

This leads to the first question we ask in this chapter:

(Q1) How do we e�ciently determine the load direction matrix W⇤ to minimize the

aggregate bandwidth reservation from all cloud providers?

Chapter 5 has already shown that the statistical anti-correlation between tenant demands

can be utilized to optimally consolidate the workload. We provided a simple criterion to

quickly check if all given demands can be served or not, without having to solve (6.6). If

yes, we have given nearly closed-form solutions to (6.6). In Sec. 6.3, we will review these

results, for the consistency of presentation in this Chapter.

Pricing in Controlled Markets. Note that objective (6.6) requires the broker to

be altruistic to the cloud and tenants. However, a selfish broker has no obligation to

accommodate all demands for service, i.e., it may output w⇤i

< 1 for some i. Neither will

it necessarily optimize load direction to maximize cloud bandwidth saving. Instead, the

broker may simply decide W⇤ by maximizing its profit R(W) as follows

maxW

R(W) =P

i

Pi

(wi

) �P

s

As

(6.8)

s.t. As

Cs

, 8 s. (6.9)

We refer to such a broker as a profit-driven broker. We observe that even a profit-driven

broker is willing to participate in the controlled market, since R(W⇤) � R(W)

��w

si

=0, 8 s,i

=

0 for all possible pricing policies. Although there is always some incentive for the broker

to operate regardless of the pricing policy {Pi

(·)}, the key question is

(Q2) Under what pricing policy {Pi

(·)} is a profit-driven broker an altruistic broker that


maximizes cloud bandwidth saving while accommodating all demands for service?

In other words, under what pricing policy {Pi

(·)} are problems (6.8) and (6.6)

equivalent?

An answer to this question is significant, in that it enables the optimal allocation

of cloud resources simply through a money-driven broker, by enforcing a good pricing

policy {Pi

(·)} to the market. In Sec. 6.4, we give a necessary and su�cient condition

for such good pricing policies. As has been mentioned, broker pricing {Pi

(·)} should

be upper-bounded by the cloud pricing {P 0i

(·)}. Since the broker can turn a profit from

bandwidth saving through arbitrage (out of no investment), tenants may expect to obtain

price discounts. In Sec. 6.4, we also find the fair discount each tenant can enjoy, without

hurting the interests of the cloud and other tenants.

Pricing in Free Markets. The good pricing policies found by answering question

(Q2) are enforced in the market by a supervisory agency other than the broker or tenants.

However, in a free market, a selfish tenant has the freedom to negotiate with the broker

on the price it should pay. We assume each tenant i can submit to the broker its own

pricing strategy Pi

(·). In return, based on all the submitted pricing strategies {Pi

(·)},

the broker decides a service fraction w⇤i

for each i by maximizing its own profit. The

submitted prices can a↵ect all tenants’ utilities. We assume each tenant expects to be

fully served (i.e., w⇤i

= 1, 8 i), though striving to lower its payment. Intuitively, if Pi

(·)

is too low, the broker will not serve Di

completely out of profit concerns. Ironically, if

Pi

(·) is su�ciently high, Di

may not get fully served either, depending on the prices other

tenants submit. We ask the question:

(Q3) In a free market where each selfish tenant competes for service by submitting its

own pricing strategy Pi

(·), will {Pi

(·)} reach an equilibrium point?

6.3. OPTIMAL BANDWIDTH SAVING 105

In Sec. 6.5, we use game theory to show that {Pi

(·)} will converge to a unique Nash

equilibrium {P ⇤i

(·)}. It turns out {P ⇤i

(·)} forms the lower border of the good pricing

region, which means that without intervention, the free market itself will lead to the

optimal load direction, where the tenants receive the maximum price discounts. A key

finding of this chapter is the following theorem:

Theorem 2 In equilibrium market, tenant i pays

$ P ⇤i

(w⇤i

) = µi

+ [f✏

(Di

) � µi

]⇢iM

= µi

+ ✓ · �i

⇢iM

(6.10)

for bandwidth reservation, where ⇢iM

2 [�1, 1] is the correlation coe�cient between Di

andP

i

Di

.

In Sec. 6.5, we discuss the connections between demand statistics and bandwidth

pricing.

6.3 Optimal Bandwidth Saving

For consistency, in this section, we briefly review our answer to question (Q1) — what are

the load direction weights W⇤ that achieve the most bandwidth saving? This question has

already been answered in Chapter 5. We assume each random demand Di

is Gaussian-

distributed. This assumption has already been verified by real-world traces in Chapter 5

(please refer to Fig. 5.4).

Recall from Chapter 5 that for a Gaussian random variable X, we have

f✏

(X) = E[X] + ✓pvar[X], ✓ = F�1(1 � ✏), (6.11)


where F (·) is the CDF of normal distribution N (0, 1). If each Di

is Gaussian-distributed,

so is Ls

=P

i

wsi

Di

, and we have

8><

>:

E[Ls

] = µ1ws1 + . . . + µN

wsN

= µµµTws

,

var[Ls

] =P

i,j

⇢ij

�i

�j

wsi

wsj

= wTs

⌃ws

.

It follows that

Bi

= f✏

(Di

) = µi

+ ✓�i

,

As

= f✏

(Ls

) = µµµTws

+ ✓pwT

s

⌃ws

.

Therefore, cloud resource minimization (6.6) has the following form under Gaussian

demands:

minW

Ps

(µµµTws

+ ✓p

wTs

⌃ws

), (6.12)

s.t. µµµTws

+ ✓p

wTs

⌃ws

Cs

, 8 s, (6.13)

w =P

s

ws

= 1, (6.14)

0 � ws

� 1, 8 s, (6.15)

where 1 = [1, . . . , 1]T and 0 = [0, . . . , 0]T are N -dimensional column vectors. Constraint

(6.13) is equivalent to As

Cs

and thus to (6.3).Note that problem (6.12) has the same

form as problem (5.9) in Chapter 5, with the only di↵erence being that now the broker

is responsible for load direction.

Recall that when capacity supply is su�cient, nearly closed-form solutions to prob-

lem (6.12) can be derived, as has been shown in Theorem 1 in Chapter 5. We rewrite

Theorem 1 of Chapter 5 here for ease of presentation in the sections to follow:


Theorem 3 When Csum � µµµT1 + ✓p1T⌃1, an optimal solution [w⇤

si

] is given by

w⇤si

= ↵s

, 8 i, s = 1, . . . , S, (6.16)

where ↵1, . . . , ↵S

can be any solution to

X

s

↵s

= 1, 0 ↵s

min

8><

>:1,

Cs

µµµT1 + ✓p1T⌃1

9>=

>;, 8 s. (6.17)

When Csum < µµµT1+✓p1T⌃1, there is no feasible solution that satisfies constraints (6.13)

to (6.15).

6.3.1 Discussions

Using Theorem 3, the broker can quickly check if all demands can be served, simply by

comparing the total cloud capacity Csum with total bandwidth required for all demands

combined:

f✏

(X

i

Di

) = E[X

i

Di

] + ✓

svar[

X

i

Di

] = µµµT1 + ✓p1T⌃1.

If Csum < µµµT1+ ✓p1T⌃1, it is infeasible to pack all demands into the limited resources

at a risk factor of ✏. Otherwise, all demands can be accommodated.

When supply is su�cient, as has been mentioned in Chapter 5, Theorem 3 implies

that in the optimal solution, each tenant should direct its workload into S cloud providers

following the same weights ↵1, . . . , ↵S

, withP

s

↵s

= 1. An altruistic broker can e�ciently

find a feasible set of ↵1, . . . , ↵S

subject to linear constraints (6.17), and promise each

6.4. PRICING REGION IN CONTROLLED MARKETS 108

tenant i to serve all its demand, under the same agreement that ↵s

of Di

will be directed

to cloud provider s.

The maximum bandwidth saving from joint bandwidth booking as compared to book-

ing bandwidth individually for each tenant is

�B(W⇤) =P

i

Bi

�P

s

As

=P

i

(µi

+ ✓�i

) �P

s

(µµµTw⇤s

+ ✓p

w⇤Ts

⌃w⇤s

)

= ✓(��T1 �p1T⌃1) = ✓(

Pi

�i

� �M

), (6.18)

which is ✓ times the gap between the sum of all demand standard deviations and the

standard deviation of all demands combined. Furthermore, the number of cloud providers

S does not a↵ect bandwidth saving: theoretically, there is no loss of resource e�ciency

if a big cloud provider is substituted by numerous small cloud providers with the same

aggregate capacity.

6.4 Pricing Region in Controlled Markets

In this section, we answer question (Q2)—if the broker is profit-driven, what pricing

policies need be enforced in the market in order to benefit cloud resource e�ciency and

the interests of tenants? Theorem 3 provides a necessary and su�cient condition for

supply to exceed demand. We thus assume ⌃s

Cs

� µµµT1 + ✓p1T⌃1 throughout the

chapter.

We assume a selfish broker determines W⇤ by maximizing its own profit via (6.8).

Similarly, under Gaussian demands, we can translate (6.8) into the following problem:

maxW

Pi

Pi

(wi

) �P

s

(µµµTws

+ ✓p

wTs

⌃ws

), (6.19)


s.t. µµµTws

+ ✓p

wTs

⌃ws

Cs

, 8 s, (6.20)

w =P

s

ws

� 1, (6.21)

0 � ws

� 1, 8 s, (6.22)

A selfish broker does not only have a di↵erent objective (6.19) than (6.12), but also

has a constraint (6.21) di↵erent from (6.14), since a profit-driven broker has no obligation

to accommodate all demands for service, if it cannot gain a higher profit by doing so.

Instead, the broker decides the service fraction w⇤i

1 for each tenant depending on the

pricing policy {Pi

(·)}. We aim to find all pricing policies such that problem (6.19) is

equivalent to problem (6.12).

The following theorem gives a necessary and su�cient condition for a good pricing

policy:

Theorem 4 Broker profit maximization (6.19) and cloud bandwidth resource minimiza-

tion (6.12) have the same optimal solution (6.16), if and only if

P 0i

(1) � µi

+ ✓ · �iM

�M

, 8 i, (6.23)

where �iM

is the covariance between Di

andP

i

Di

given by (6.1) and �M

is the standard

deviation ofP

i

Di

given by (6.2). Furthermore, if P 0i

(1) < µi

+ ✓�iM

/�M

for some i,

then w⇤ 6= 1.

Proof: We first show that if P 0i

(1) � µi

+ ✓�iM

/�M

, 8 i, (6.16) is also an optimal

solution to (6.19). Using the bound (5.17), problem (6.19) can be relaxed to the following


concave problem:

maxw

Ru

(w) :=P

i

Pi

(wi

) � (µµµTw + ✓pwT⌃w), (6.24)

s.t. µµµTw + ✓pwT⌃w

Ps

Cs

, (6.25)

w � 1, (6.26)

where Ru

(w) � R(W) is a concave function, and (6.25) is a relaxation of constraints

(6.20). Since µµµT1 + ✓p1T⌃1

Ps

Cs

, w = 1 is a feasible solution to (6.24).

The derivative of Ru

(w) with respect to wi

is given by

@Ru

(w)

@wi

��w=1

=P 0i

(wi

)

��w

i

=1

� µi

�✓P

N

j=1 �ij

wjp

wT⌃w

��w=1

=P 0i

(1) � µi

�✓Cov[D

i

,P

j

Dj

]p1T⌃1

=P 0i

(1) � µi

� �1M

�M

(6.27)

If (6.23) is satisfied, then @Ru

(w)/@wi

|w=1 � 0, 8 i. According to the gradient ascent

algorithm for concave maximization, w = 1 is the optimal solution to (6.24). If wsi

= ↵s

given by (6.16), we have the following 3 facts:

• All constraints (6.20)-(6.22) are satisfied;

• Ru

(w) = R(W);

• Ru

(w) = Ru

(1) in problem (6.24) reaches optimality.

Therefore, (6.16) is an optimal solution to (6.19).


Now we prove the converse. If there exists an i such that

P 0i

(1) < µi

� �1M

�M

, (6.28)

then the optimal solution w⇤ to the relaxed problem (6.24) must satisfy w⇤i

< 1. Then

the original problem (6.19) has an optimal solution w⇤si

= �s

, 8 i, 8 s, where �s

solves

8><

>:

0 �s

min{1, C

s

µ

µ

µ

Tw⇤+✓

pw⇤T⌃w⇤}, 8 s,

Ps

�s

= 1.

Let w⇤⇤si

= ↵s

, 8 i, 8 s be given by (6.16). Since

R

✓[w⇤⇤

si

]S⇥N

◆= R

u

(1) < Ru

(w⇤) = R

✓[w⇤

si

]S⇥N

◆,

we have proved that w⇤⇤si

given by (6.16) is not an optimal solution to (6.19). ut

We depict the region of good pricing policies in Corollary 5, which relies on the

following lemma.

Lemma 1 Let f(x) be a concave function on [0, 1] with f(0) = 0. Let k be a (constant)

real number. If f 0(1) � k, then for each x 2 (0, 1], f(x) � kx. In this case, if f(x0) = kx0

for some x0 2 (0, 1], then f(x) ⌘ kx, 8 x 2 [0, 1].

Proof: Note that the function f(x) is bounded by

f 0(1)x f(x) f 0(1)(x � 1) + f(1), 8 x 2 (0, 1], (6.29)


where the inequality on the left comes from the fact

f(x) =

Zx

0

f 0(t)dt + f(0) �Z

x

0

f 0(1)dt = f 0(1)x (6.30)

and the inequality on the right comes from the concavity.

Hence, if f 0(1) � k, then we have

f(x) � f 0(1)x � kx, 8 x 2 (0, 1].

In this case, if f(x0) = kx0 for some x0 2 (0, 1], then we have

kx0 � f 0(1)x0 � kx0.

It follows that f 0(1) = k. Further, note that

f(x0) =

Zx

0

0

f 0(x)dx �Z

x

0

0

kdx � kx0,

with equality achieved if and only if f 0(x) = k, 8 x 2 (0, x0]. Thus, we have f 0(x) = k,

for all x 2 (0, 1]. That is, f(x) ⌘ kx, 8 x 2 [0, 1]. ut

Corollary 5 In a good pricing policy {Pi

(·) : i = 1, . . . , N}, each Pi

(·) must satisfy for

all wi

2 [0, 1],

(µi

+ ✓ · �iM

�M

)wi

Pi

(wi

) (µi

+ ✓�i

)wi

. (6.31)

Proof: In a good pricing policy, each Pi

(·) cannot exceed cloud provider pricing

P 0i

(·), otherwise, the VoD providers would have booked bandwidth from cloud providers

directly. Therefore, P 0i

(wi

) = (µi

+✓�i

)wi

is the upper bound for Pi

(·). Consider another


wi

Pi(wi)

O 1

µi + � · �iM

�M

µi + ��i

upper border

lower border

P �i (w�

i )

good pricing

P 0i (1)

Figure 6.2: The region of Pi

(·) in a good pricing policy {Pi

(·)}. Pi

(·) is between P ⇤i

(·)and P 0

i

(·), and satisfies P 0i

(1) � P ⇤0i

(1).

pricing policy {P ⇤i

(·)}, where

P ⇤i

(wi

) = (µi

+ ✓ · �iM

�M

)wi

, 8 i. (6.32)

By Theorem 4 and Lemma 1, in a good pricing policy, for each i, Pi

(wi

) � (µi

+

✓�iM

/�M

)wi

, 8 wi

2 (0, 1], with equality achieved only if Pi

(·) ⌘ P ⇤i

(·). ut

6.4.1 Broker Profit and Price Discounts in the Pricing Region

Fig. 6.2 illustrates the region of Pi

(·) in a good pricing policy. Each good Pi

(·) is a

concave function between P ⇤i

(·) and P 0i

(·) (inclusive) with P 0i

(1) � µi

+ ✓�iM

/�M

. In

the good pricing region, a selfish broker behaves like an altruistic broker that optimizes

cloud bandwidth saving while serving all demands. When pricing policy falls out of this

region, the broker will reject a part of demands to chase a profit, i.e., there exists a j

such that w⇤j

< 1, which is undesirable when supply is su�cient. Note that Pi

(·) a↵ects

not only w⇤i

but also w⇤j

for all j. This phenomenon will be understood in Sec. 6.5.

We further analyze the broker profit R(W⇤) in the pricing region. Under any good

pricing policy, the optimal solution to (6.19) is w⇤si

= ↵s

given by (6.16) and w⇤i

= 1.


Thus, the broker’s payment to cloud providers is

X

s

As

=X

s

(µµµTw⇤s

+ ✓p

w⇤Ts

⌃w⇤s

) = µµµT1 + ✓p1T⌃1 =

X

i

µi

+ ✓�M

, (6.33)

Using Corollary 5, we obtain

R(W⇤) X

i

(µi

+ ✓�i

)w⇤i

�X

s

As

= ✓(X

i

�i

� �M

) = �B(W⇤), (6.34)

with equality achieved when Pi

(·) ⌘ P 0i

(·), and

R(W⇤) �X

i

(µi

+✓�

iM

�M

)w⇤i

�X

s

As

= ✓ ·P

i

�iM

�M

� ✓�M

= 0 (6.35)

with equality achieved when Pi

(·) ⌘ P ⇤i

(·).

The above bounds show that the maximum broker profit is essentially the maximum

achievable bandwidth saving �B(W⇤) from joint bandwidth booking given by (6.18),

which depends on the gap between the sum of all demand standard deviations and the

standard deviation of market demand. Moreover, the maximum profit is achieved when

the broker adopts the same pricing policy {P 0i

(·)} as cloud providers do. On the other

hand, when Pi

(·) ⌘ P ⇤i

(·), the broker profit is zero, which means all the profit made from

cloud bandwidth multiplexing has been rewarded to tenants as price discounts.

In a controlled market, the lowest price that tenant i should be charged for having

all its demand served is

P ⇤i

(1) = µi

+ ✓ · �iM

�M

. (6.36)

If any tenant i is charged less than P ⇤i

(1), the broker will deny a part of all demands to

turn a higher profit.

6.5. PRICING IN FREE MARKETS: A GAME THEORETICAL ANALYSIS 115

6.5 Pricing in Free Markets: A Game Theoretical

Analysis

In Sec. 6.4, good pricing is enforced as a policy by a third party other than the broker

and tenants, which is di�cult to implement in reality. To closely model a free market, we

assume each selfish tenant can freely bargain with the broker to establish a bandwidth

price. Each tenant i simply submits to the broker any pricing strategy Pi

(·) it prefers and

accepts the service fraction w⇤i

returned by the broker. Based on {Pi

(·)} collected from

all tenants, the broker determines the load direction matrix W⇤ and thus the service

fraction w⇤i

for each i by maximizing its own profit. We answer question (Q3) — what

will the prices eventually look like in such a free market of selfish players?

We define a game played by all tenants, each with an independent strategy Pi

(·), i.e.,

the price it submits. Under a set of submitted prices {Pi

(·)}, we define a utility function

associated with each tenant i as

Ui

[P1(·), . . . , PN

(·)] =

8><

>:

�Pi

(w⇤i

), if w⇤i

= 1,

�1, if w⇤i

< 1,(6.37)

where the broker determines w⇤i

by solving (6.19).

The rationale behind (6.37) is that in a system of su�cient capacity, a selfish tenant

always: 1) expects to get fully served1; and 2) tries to reduce its price if condition 1) is

met. In other words, unlike the case of scarce metal (e.g., gold), it is not profitable for

a tenant to deliberately deny user requests to trade for a lower unit bandwidth cost pi

.

In reality, popularity matters more to a tenant than instantaneous profit per user. In a

1This assumption will be relaxed in Chapter 7.


market of su�cient supply, if a tenant is found sacrificing user requests to chase cheaper

bandwidth deals, it will lose reputation and revenue in the long run.

To find the equilibrium market price, we just need to find the Nash equilibrium in

the above game, where no tenant i can get a better utility by unilaterally changing Pi

(·).

Intuitively, tenant i cannot submit too low a Pi

(·), beyond which the broker will be unable

to guarantee w⇤i

= 1 via (6.19), leading to Ui

= �1. From the analysis in Sec. 6.4, one

may conjecture that {P ⇤i

(·)} is a Nash equilibrium. However, is this the unique Nash

equilibrium? Could there be another equilibrium point where each tenant submits a very

low price and gets a utility of �1 without being able to be better o↵ by changing its

own pricing? In other words, can market collapse happen?

6.5.1 Nash Equilibrium: Existence and Uniqueness

Theorem 6 If tenants have utility (6.37) and the broker decides W⇤ by maximizing its

profit through (6.19), then {Pi

(·)} will converge to a unique Nash equilibrium {P ⇤i

(·)},

where

P ⇤i

(wi

) = (µi

+ ✓ · �iM

�M

)wi

, 0 wi

1, (6.38)

where �iM

and �M

are given by (6.1) and (6.2), respectively.

Let P�i

(·) represent the pricing strategies of all tenants except for tenant i. The proof

of Theorem 6 relies on the following two lemmas and Theorem 4 that characterize the

solution structure of (6.19) given {Pi

(·)}.

Lemma 2 If P 0i

(1) < µi

+ ✓�iM

/�M

, then w⇤i

< 1 regardless of P�i

(·).


Proof: It su�ces to show that

@Ru

(w)

@wi

��w

i

=1

< 0 for all w with wi

= 1,

since in this case we have w⇤i

= 1 regardless of {wj

}j 6=i

by the gradient ascent algorithm.

Recall that@R

u

(w)

@wi

= P 0i

(wi

) � µi

�✓P

N

j=1 �ij

wjp

wT⌃w.

If P 0i

(wi

) < µi

+ ✓�iM

/�M

, then

@Ru

(w)

@wi

��w

i

=1

< ✓

�iM

�M

�P

N

j=1 �ij

wjp

wT⌃w

!��w

i

=1

.

Now construct a function g(w1, . . . , wN

) as

g(w1, . . . , wN

) =

PN

j=1 �ij

wjp

wT⌃w.

It is easy to see that g(1, . . . , 1) = �iM

/�M

. Hence, it su�ces to show that

g(w1, . . . , wN

) � g(1, . . . , 1) for all w with wi

= 1.

Without loss of generality, we shall need only to prove the case when N = 2 and i = 1,

that is,�21 + ⇢12�1�2w2p

�21 + �2

2w22 + 2⇢12�1�2w2

� �1(�1 + ⇢12�2)p�21 + �2

2 + 2⇢12�1�2

.

It is easy to check that the above inequality indeed holds. Hence, @Ru

(w)/@wi

|w

i

=1 < 0

for all w with wi

= 1, completing the proof. ut

Lemma 3 If Pi

(wi

) = P 0i

(wi

) = (µi

+ ✓�i

)wi

for wi

2 [0, 1], then w⇤i

= 1, regardless of


P�i

(·).

Proof: It su�ces to show that

@Ru

(w)

@wi

��w

i

=1

� 0 for all w with wi

= 1,

since in this case we have w⇤i

= 1 regardless of {wj

}j 6=i

.

Recall that@R

u

(w)

@wi

= P 0i

(wi

) � µi

�✓P

N

j=1 �ij

wjp

wT⌃w.

If Pi

(wi

) = (µi

+ ✓�i

)wi

, then P 0i

(wi

) = µi

+ ✓�i

. Thus,

@Ru

(w)

@wi

��w

i

=1

= ✓

�i

�P

N

j=1 �ij

wjp

wT⌃w

!��w

i

=1

.

Now construct a random variable X as X =P

j 6=i

wj

Dj

+Di

. By the Cauchy-Schwarz

inequality, we have

E2[(Di

� µi

)(X � E[X])] E[(Di

� µi

)2] · E[(X � E[X])2]. (6.39)

That is, �2i

+X

j 6=i

�ij

wj

!2

�2i

pwT⌃w

��w

i

=1

.

It follows immediately that

�i

�P

N

j=1 �ij

wjp

wT⌃w

��w

i

=1

.

Hence, @Ru

(w)/@wi

|w

i

=1 � 0 for all w with wi

= 1. ut

Proof of Theorem 6: We first show that {P ⇤i

(·)} is indeed a Nash equilibrium. We

need to show that at {P ⇤i

(·)}, any tenant i cannot increase its utility Ui

by unilaterally


changing Pi

(·), i.e.,

Ui

[P ⇤i

(·), P ⇤�i

(·)] � Ui

[Pi

(·), P ⇤�i

(·)] 8 i. (6.40)

We exhaust the possibilities of Pi

(·) by considering the range of P 0i

(1). First, if

P 0i

(1) < µi

+ ✓�iM

/�M

,

by Lemma 2, w⇤i

< 1, and by (6.37), we have

Ui

[Pi

(·), P ⇤�i

(·)] = �1 < �P ⇤i

(1) = Ui

[P ⇤i

(·), P ⇤�i

(·)].

Second, if P 0i

(1) � µi

+ ✓�iM

/�M

, while P�i

(·) ⌘ P ⇤�i

(·), according to Theorem 4, we

have w⇤i

=P

s

w⇤si

=P

s

↵s

= 1. Hence, we have

Ui

[Pi

(·), P ⇤�i

(·)] = �Pi

(1) �P ⇤i

(1) = Ui

[P ⇤i

(·), P ⇤�i

(·)].

The inequality above is due to Lemma 1 and the fact that P ⇤i

(wi

) is a linear function in

wi

that passes through (0, 0). We have thus proved (6.40). Therefore, {P ⇤i

(·)} is indeed

a Nash equilibrium.

Now we show that {P ⇤i

(·)} is the unique Nash equilibrium, i.e., if {Pi

(·)} is a Nash

equilibrium, then Pi

(·) ⌘ P ⇤i

(·). First of all, let us show that the following inequality

holds:

P 0i

(1) � µi

+✓�

iM

�M

, 8 i. (6.41)

We prove (6.41) by contradiction. Assume there exists an i such that P 0i

< µi

+✓�iM

/�M

.


By Lemma 2, w⇤i

< 1 and thus Ui

= �1. If tenant i uses another strategy P 0i

(wi

) =

(µi

+ ✓�i

)wi

, then by Lemma 3, we have w⇤i

= 1, regardless of P�i

(·). Hence,

Ui

[P 0i

(·), P�i

(·)] = �P 0i

(1) > �1 = Ui

[Pi

(·), P�i

(·)],

contradicting the definition of Nash equilibrium that unilaterally changing Pi

(·) cannot

increase Ui

. Therefore, (6.41) must hold.

If (6.41) holds, by Theorem 4, we have w⇤i

=P

s

w⇤si

=P

s

↵s

= 1, 8 i, and thus

Ui

[Pi

(·), P�i

(·)] = �Pi

(1), 8 i. If (6.41) holds, by Lemma 1, Pi

(1) � (µi

+ ✓�

iM

�

M

) · 1 =

P ⇤i

(1), with equality achieved only if Pi

(wi

) ⌘ (µi

+ ✓�

iM

�

M

)wi

⌘ P ⇤i

(wi

) for wi

2 [0, 1]. In

other words, if Pi

(·) is not P ⇤i

(·),

Ui

[Pi

(·), P�i

(·)] = �Pi

(1) < �P ⇤i

(1) = Ui

[P ⇤i

(·), P�i

(·)],

contradicting the fact that {Pi

(·)} is a Nash equilibrium. Therefore, Pi

(·) must be P ⇤i

(·).

In other words, {P ⇤i

(·)} is the unique Nash equilibrium. ut

6.5.2 Market Properties and Equilibrium Bandwidth Costs

Theorem 6 shows that even without a supervisory party, the selfishness of tenants nec-

essarily drives {Pi

(·)} into an equilibrium {P ⇤i

(·)}, which is exactly the lower border of

all good pricing policies. Therefore, in a free market, a profit-driven broker is naturally

a workload consolidator that optimizes the cloud resource e�ciency. In equilibrium mar-

kets, we have Pi

(·) ⌘ P ⇤i

(·) and broker profit R(W⇤) = 0 according to (6.35). This is

reasonable since in the presence of competition, the broker would be unable to exploit

the profit on no investment gained from bandwidth multiplexing. Yet the broker can

6.6. TRACE-DRIVEN SIMULATIONS 121

turn a positive profit through agent fees, membership fees, ads and other means. In

addition, the free market automatically guarantees fairness on bandwidth costs across

di↵erent tenants.

Let us now analyze the equilibrium bandwidth cost for each tenant. At market

equilibrium, w⇤i

= 1 and P ⇤i

(w⇤i

) = P ⇤i

(1) given by (6.36). Rewriting (6.36), we have

$ P ⇤i

(w⇤i

) = µi

+ ✓ · �i

⇢iM

= µi

+ [f✏

(Di

) � µi

] · ⇢iM

,

proving Theorem 2.

The payment consists of two parts: $1 per unit of bandwidth for mean demand µi

, and

$⇢iM

per unit of bandwidth reservation exceeding the expected demand µi

. Apparently,

the lower the ⇢iM

, the more discount tenant i receives. In particular, if ⇢iM

< 0, Di

is

negatively correlated with market demand. Surprisingly, except for a charge on mean

demand, now tenant i earns a bonus of �✓�i

⇢iM

> 0 to be in the system. The reason is

that tenant i serves as a risk neutralizer: whenever market demand has a random increase,

Di

will decrease to release occupied resources to accommodate the market surge. This

helps the broker save the bandwidth reservation and hedge under-provision risks.

6.6 Trace-Driven Simulations

In this section, we simulate a broker-assisted cloud bandwidth trading system, evaluate

its benefit quantitatively through trace-driven simulations, and verify assumptions. Our

workload traces come from UUSee [50], an operational large-scale VoD service based in

China. The dataset contains the bandwidth demand in UUSee video channels sampled

every 10 minutes during the 2008 Summer Olympics. We consider two time spans in the


traces, time periods 702—780 and 1562—1640, containing 91 and 176 concurrent video

channels, respectively. We let each video channel i represent tenant i, with Di

equal to

the channel’s demand, assuming the same workload (or demand) is served by the clouds

in a broker-assisted market.

In our system, bandwidth reservation and trading are carried out online every �t = 10

minutes. (The same methodology can be applied to any other values of �t, e.g., 1

hour.) Before time t, the broker should have obtained estimates µµµt

= [µ1t, . . . , µNt

]

and ⌃t

= [�ijt

] for all tenants in the coming period [t, t + �t). Such statistics can

be predicted accurately based on historical demand data, since VoD demand follows

repeated daily patterns and is highly predictable [40, 58, 59]. Some established time

series forecasting tools in econometrics (e.g., [24, 25]) may be used for prediction. In

particular, we implement an online version of seasonal ARIMA and GARCH models

introduced in Chapter 3 and Chapter 4 to predict µµµt

and ⌃t

. In practice, historical

demand data can be easily obtained from cloud providers through their logging services

or from cloud providers themselves. Once µµµt

and ⌃t

are obtained, the broker maximizes

its profit by solving (6.19), making load direction decision W⇤ and bandwidth reservation

As

from each cloud provider s. The broker serves as a gateway that directs w⇤si

fraction

of tenant i’s incoming video requests to cloud provider s for service during [t, t + �t).

The above process is repeated for the next period [t + �t, t + 2�t).

To evaluate the benefit of such a system, we quantify the equilibrium point in the

market. Fig. 6.3 evaluates the saving on cloud bandwidth resources in equilibrium market.

Due to multiplexing, the aggregate bandwidthP

s

Ast

booked by the broker is less than

the aggregate bandwidthP

i

Bit

needed if each channel books individually. We set ✏ = 5%

and ✏ = 1% for time periods 702—780 and 1562—1640, respectively, resulting in an actual


702 722 742 762 78002468

101214

Time Periods

Gbp

s

Mean Bandwidth Saving =35%

Bandwidth Booked IndividuallyBandwidth Booked via BrokerAggregate Bandwidth Demand

(a) ✏ = 5%, ✏0 = 5.06%

1562 1582 1602 1622 164002468

10121416

Time Periods

Gbp

s

Mean Bandwidth Saving =36%

Bandwidth Booked IndividuallyBandwidth Booked via BrokerAggregate Bandwidth Demand

(b) ✏ = 1%, ✏0 = 1.27%

Figure 6.3: The aggregate bandwidthP

s

As

(t) booked by the broker, compared to theaggregate bandwidth

Pi

Bi

(t) needed if each channel books individually and the realaggregate demand

Pi

Di

(t).

under-provision probability of ✏0 = 5.06% and ✏0 = 1.27% in Fig. 6.3(a) and Fig. 6.3(b),

respectively. AlthoughP

i

Bit

always exceeds the real aggregate demandP

i

Dit

, it does

not mean there is no outage when booking individually: each individual channel is still

subject to an under-provision probability ✏. The broker can save bandwidth reservation

by more than 30% on average with controllable quality risks.

We further evaluate the price discount each tenant enjoys. At time t, tenant i pays

$ P 0it

(1) = µit

+ ✓�it

if reserving bandwidth individually, but pays $ P ⇤it

(1) = µit

+

✓�it

⇢iMt

in the equilibrium market with a broker. The discount it receives at time t is

1 � P ⇤it

(1)/P 0it

(1). Fig. 6.4 plots the histogram of discounts 1 � P ⇤it

(1)/P 0it

(1) for all i and

t, showing that the mean price discount depends on demand statistics and even exceeds

50% during the second time span.

An interesting finding is that discount is over 100% for some rare cases, which means

that the payment of some tenant i, P ⇤it

(1) = µit

+ ✓�it

⇢iMt

< 0, at some time t. We

call such tenants “bonus winners,” since the broker would even not mind paying to have

6.7. SUMMARY 124

0 20 40 60 80 100 1200

200

400

600

800

Discounts (%)

� Mean=44%

� Bonus� Winners

(a) t =702—780, 91 channels

0 20 40 60 80 100 1200

200

400

600

800

Discounts (%)

� Mean=53%

� Bonus� Winners

(b) t =1562—1640, 176 channels

Figure 6.4: The histogram of payment discounts of all channels at all times in equilibrium,i.e., the histogram of 1 � P ⇤

i

(1, t)/P 0i

(1, t) for all i and t.

them in the system. As pointed out in Sec. 6.5, this is because “bonus winners” have

demands negatively correlated to the market and thus are risk neutralizers: whenever

there is an increase in market demand, the demand of “bonus winners” drops to make

resources available for other tenants in the market.

6.7 Summary

In this chapter, we consider a scenario in which multiple tenants running user-facing

multimedia workloads can make reservations for bandwidth guarantees from cloud service

providers in order to support continuous media streaming. Such tenants (video providers)

attract inter-correlated random demands that can be directed to multiple cloud providers,

each with a bandwidth capacity. We introduce a profit-driven broker that statistically

mixes demands based on anti-correlation while controlling the resource shortage risks.

We study the behavior of the broker and how each tenant should be charged in such a

broker-assisted bandwidth reservation market.

6.7. SUMMARY 125

The region of all good pricing policies is characterized, such that making a profit

at the broker is equivalent to maximizing cloud resource utilization through optimal

load direction from tenants to the cloud providers. In a free market where tenants can

negotiate and bargain for bandwidth cost, we prove that the bandwidth prices charged to

the tenants will converge to a unique Nash equilibrium, which forms the lower bound of

the good pricing region. Furthermore, the equilibrium bandwidth price billed to a tenant

critically depends on its demand expectation, burstiness and its demand correlation to the

market. Real-world traces verify that the presence of a broker can lower the market price

for cloud bandwidth reservation by around 50% on average and save cloud bandwidth

resources by over 30%.

Chapter 7

Algorithmic Cloud Bandwidth

Pricing

In Chapter 6, we have shown that the bandwidth price each video company (as a cloud

tenant) has to pay in a cloud-based bandwidth reservation market depends on its demand

statistics. Yet we have assumed that each cloud tenant requires 100% of its demand

serviced no matter how expensive it is. However, in reality, a video company does have

an incentive to choose a “guaranteed portion” of its demand to be less than 100%, while

leaving a small amount of demand served with best e↵ort, if guaranteeing this small

additional demand is too costly.

We consider the following guaranteed service. A tenant can simply specify a percent-

age (e.g., 95%) of its (bandwidth) demand to be served with guaranteed performance,

which we call the guaranteed portion, while the rest of its demand will be served with

best e↵ort. It is then the cloud provider’s responsibility to satisfy the guaranteed portion

of the tenant with a high probability. Since the cloud provider has vast historical work-

load data, it can leverage statistical learning to predict tenant demands and make actual

126

CHAPTER 7. ALGORITHMIC CLOUD BANDWIDTH PRICING 127

bandwidth reservations for the tenants. The cloud can also multiplex tenant demands to

save bandwidth reservation cost.

In this chapter, we continue to study the pricing of such guaranteed services, but

assume the more realistic case that the bandwidth pricing can actually a↵ect tenant

demands, i.e., their choices of the guaranteed portion may vary according to the prices

set. On the other hand, such demand changes will in turn a↵ect the pricing decisions,

leading to the challenge of handling potentially unstable iterations. Moreover, due to

multiplexing, the absolute amount of bandwidth reserved for each tenant is unknown.

It is a challenging question to find out each tenant’s fair share in the aggregate service

cost, especially when pricing and demand depend on each other and both change in a

dynamic way.

To overcome these di�culties, we define the reservation fee of each tenant as a func-

tion of its specified guaranteed portion instead of the absolute amount of bandwidth

reserved. We also express each tenant’s utility as a function of its guaranteed portion,

which essentially measures the Quality of Service (QoS) at the tenant. For example,

a guaranteed portion of 95% for a video provider implies that 95% of all its end user

requests will receive a guaranteed service.

We study a cloud provider whose objective is to maximize the social welfare of the

system, i.e., the total expected tenant utility minus the aggregate service cost. Although

the cloud cannot know the exact form of utility at each tenant, it can a↵ect each tenant’s

choice of guaranteed portion indirectly through pricing, and thus control the social wel-

fare achieved. To expedite the computation of the optimal pricing policy in the presence

of a large number of tenants, we propose novel methods to e�ciently solve the underlying

large-scale distributed optimization problem with a coupled objective function. These

7.1. A NEW TENANT-CLOUD AGREEMENT 128

new methods are step-size-free and proved to be more e�cient than traditional gradi-

ent/subgradient methods based on Lagrangian dual decomposition in a decentralized

setting.

Finally, we evaluate the proposed algorithms on the workload traces of a real-world

VoD system, namely, UUSee [9]. We conduct trace-driven simulations of bandwidth

reservation and algorithmic pricing based on the demand prediction proposed in Chap-

ters 3 and 4. The system is shown to operate e↵ectively at a fine granularity (of as small

as 10 minutes).

Pricing in distributed systems has also been approached with game theory [43, 44].

However, the objective of this chapter is concerned with achieving fast and robust con-

vergence in large-scale systems, while incentive issues are beyond our focus.

7.1 A New Tenant-Cloud Agreement

Our system model is a generalization of the operation mode of the current cloud. Current

cloud providers charge tenants a usage fee based on the number of bytes transferred in

the past hour, and do not provide bandwidth guarantees. We extend this model to allow

tenants to make reservations for (egress) bandwidth guarantees explicitly, assuming that

the network reservation technologies proposed in [19, 39, 71] can be realized in cloud

computing systems. Our system operates on a short-term basis, e.g., based on hours or

tens of minutes. At the beginning of each short period, each tenant specifies a guaranteed

portion to guard against performance risks. The cloud decides the actual bandwidth

reservation for all the tenants based on both demand estimation and the submitted

guaranteed portions, and charges both a usage and reservation fee. We now describe our

system model in detail.


We consider one such short period, where N bandwidth-sensitive tenants are present.

Suppose that in this period, tenant i’s bandwidth demand is a random variable Di

(Mbps)

with mean µi

= E[Di

] and variance �2i

= var[Di

]. We assume the cloud can predict µi

and �i

based on demand history and share them with tenant i before the period starts.

In our proposed market, the key commodity traded is a notion called the guaranteed

portion instead of the absolute amount of bandwidth. Specifically, the tenants and cloud

will comply to the following service agreement Si

(wi

, ✏, Ri

):

• Before the period starts, each tenant i specifies a guaranteed portion wi

2 [0, 1];

• The cloud guarantees wi

fraction of demand Di

with a high probability 1 � ✏;

outage is allowed to happen with a small probability ✏, during which the bandwidth

allocated to tenant i is limited to Ri

.

The parameters ✏ and Ri

are a part of a service level agreement (SLA) advertised by the

cloud provider. We introduce the risk factor ✏, because for random demand, no matter

how much bandwidth is allocated, there exists a small risk of resource shortage.

Let qi

denote the actual bandwidth usage (realized data rate) in this period. Under

a guaranteed portion wi

, service Si

(wi

, ✏, Ri

) is supposed to lead to the following actual

usage of tenant i:

qi

(wi

) =

8><

>:

wi

Di

, w.p. 1 � ✏,

min{wi

Di

, Ri

}, w.p. ✏,

(7.1)

i.e., with probability 1� ✏ the actual usage qi

is a realization of wi

fraction of its demand

Di

, while during outage (which happens with a small probability ✏), the actual usage qi

is rationed by Ri

.

Clearly, tenant i will choose wi

based on both the utility Ui

and the price of guarantee-

ing wi

portion of its demand Di

. Unlike most prior work on network utility maximization


(NUM) [32], which assumes that the utility Ui

depends on a single variable such as rate,

we model utility Ui

(qi

, Di

) of tenant i as a function of both actual usage qi

and the de-

mand Di

. For example, a video content provider (or a VoD company) may have a linear

utility gain (or revenue) ↵i

qi

from usage qi

and a convexly increasing utility loss eAi

(Di

�q

i

)

for the denied requests Di

� qi

, with ↵i

, Ai

being tenant-specific parameters:

Ui

✓qi

(wi

), Di

◆= ↵

i

qi

(wi

) � eAi

(Di

�q

i

(wi

)), (7.2)

where the utility loss term can model the reputation degradation and potential revenue

loss due to unfulfilled demand1. We assume Ui

is concave and monotonically increasing

in qi

.

The price for tenant i to use service Si

(wi

, ✏, Ri

) is divided into two parts: a usage fee

and a reservation fee. As most current cloud providers do, we assume uniform pricing for

usage: each tenant pays $p for every unit of bandwidth consumed. As a key departure

from current clouds, we introduce a reservation fee, which is a function of the guaran-

teed portion instead of the absolute amount of bandwidth reservation: each tenant i is

charged a price of $ki

wi

for having wi

portion of its demand guaranteed. We price the

guaranteed portion wi

rather than the absolute bandwidth, because tenants usually have

no idea about how much bandwidth they need. Instead, they can intuitively know what

percentage of guarantee is desired. This new business model frees each tenant from the

computational burden of demand prediction: it simply submits its desired guaranteed

portion wi

, while the cloud provider computes the actual bandwidth reservation as well

as the reservation fee ki

wi

for each tenant.

1Even though the cloud provider may still be able to fulfill Di � qi in a best-e↵ort fashion, the tenantwill have no knowledge if this is the case, and will not be able to factor it into its expected utility.


We define the surplus of tenant i as its utility minus its price:

Si

(wi

, p, ki

) := Ui

✓qi

(wi

), Di

◆� pq

i

� ki

wi

. (7.3)

Given prices p and ki

, a rational tenant will choose a wi

to maximize its surplus Si

.

Tenant i will not always choose wi

= 1 because when its demand is bursty, the price to

guarantee 100% of Di

may be high. In this case, tenant i will choose a wi

close to 1

instead of being exactly 1, while the rest of its demand (1 � wi

)Di

will be served with

best e↵ort.

Based on tenant-specified guaranteed portions w1, . . . , wN

, the cloud should guarantee

the demands w1D1, . . . , wN

DN

for service. Denote w := [w1, . . . , wN

]T. To realize the

above service guarantees, the cloud provider needs to reserve a total bandwidth capacity of

K(w). Depending on the technology used, the value of K could vary significantly from

one case to another. For example, a simple non-multiplexing technology is to reserve

Ri

capacity for each tenant i individually such that demand wi

Di

is satisfied with high

probability, i.e.,

Pr(wi

Di

> Ri

) < ✏, (7.4)

and correspondingly, reserve capacity K =P

i

Ri

in total. In contrast, a multiplexing

technology will reserve capacity K for the tenants altogether such that the aggregate

demand is satisfied with high probability, i.e.,

Pr(X

i

wi

Di

> K) < ✏, (7.5)

and during outage (whenP

i

wi

Di

> K), the usage qi

of tenant i is rationed to Ri

withP

i

Ri

= K. In both cases, K is an implicit function of w defined by the probabilistic

7.2. PRICING TOWARDS SOCIAL WELFARE MAXIMIZATION 132

constraints (7.4) and (7.5), respectively. To determine K(w), the cloud provider must

estimate the future demand statistics of all the tenants, and convert tenant-specified

guaranteed portions w into the actual total bandwidth reservation K.

Similarly, the cloud provider has two kinds of service costs: usage and reservation

costs. We assume tier-1 ISPs charge the cloud provider $b for every unit of bandwidth

actually used. Furthermore, reserving bandwidth capacity K will incur a reservation cost

of $c(K). Due to multiplexing gain, to guarantee a similar service level, multiplexing will

incur a lower K and thus a lower reservation cost c(K) than without multiplexing.

We define the cloud profit ⇧ as the di↵erence between its total revenue and total cost,

i.e.,

⇧(w) :=X

i

✓pq

i

(wi

) + ki

wi

◆� c

✓K(w)

◆�X

i

bqi

(wi

). (7.6)

7.2 Pricing Towards Social Welfare Maximization

We study a cloud provider as a social planner whose objective is to maximize social

welfare W (w), which is defined as the total tenant utility minus the total service cost:

W (w) :=X

i

Ui

� c

✓K(w)

◆�X

i

bqi

(wi

)

= ⇧(w) +X

i

Si

(wi

, p, ki

) (7.7)

Under random demands, the cloud aims to decide a set of optimal guaranteed portions

w⇤ = [w⇤1, . . . , w

⇤N

]T for the tenants to maximize the expected social welfare by solving

maxw E[W (w)]

s.t. 0 w 1.(7.8)


To solve (7.8), we first derive the expected social welfare in a simple approximated

form. Note that E[Ui

] is bounded as follows:

E[Ui

] � (1 � ✏)E[Ui

(wi

Di

, Di

)] + ✏E[Ui

(0, Di

)],

E[Ui

] (1 � ✏)E[Ui

(wi

Di

, Di

)] + ✏E[Ui

(Ri

, Di

)].

When the risk factor ✏ is small, we have

E[Ui

(qi

, Di

)] ⇡ (1 � ✏)E[Ui

(wi

Di

, Di

)]. (7.9)

To simplify notations, we define

Ui

(wi

) := E[Ui

(wi

Di

, Di

)], (7.10)

which turns out to be monotonically increasing and concave in wi

under very mild tech-

nical conditions. Similarly, the expected usage of tenant i is

E[qi

(wi

)] ⇡ (1 � ✏)E[wi

Di

] = (1 � ✏)wi

µi

. (7.11)

Therefore, the expected surplus of tenant i is

E[Si

(wi

, p, ki

)] =E[Ui

] � pE[qi

] � ki

wi

= (1 � ✏)

✓U

i

(wi

) � pwi

µi

◆� k

i

wi

, (7.12)


and the expected profit of the cloud provider is

E[⇧(w)] = (p � b)(1 � ✏)X

i

wi

µi

+X

i

ki

wi

� c

✓K(w)

◆. (7.13)

Substituting the above into (7.7) gives the expected social welfare as

E[W (w)] = (1 � ✏)X

i

✓U

i

(wi

) � bwi

µi

◆� c

✓K(w)

◆. (7.14)

7.2.1 An Equivalent Pricing Problem

In reality, although the cloud provider has full knowledge about its service cost c(K(w)),

it does not know the utility function of each tenant. In other words, maximizing E[W (w)]

in terms of w requires the cloud to know the utility Ui

of each tenant and is infeasible.

We now convert problem (7.8) into an equivalent pricing problem.

Note that the expected social welfare is also the sum of the expected cloud profit and

the total expected tenant surplus, i.e.,

E[W (w)] = E[⇧(w)] +X

i

E[Si

(wi

, p, ki

)]. (7.15)

Furthermore, when charged with prices p, ki

and facing a random demand Di

, a rational

tenant will choose a guaranteed portion wi

to maximize its expected surplus, i.e.,

wi

= arg maxw

i

E[Si

(wi

, p, ki

)], (7.16)

which defines an implicit function wi

(p, ki

) of the prices. The cloud can a↵ect guaranteed

portion choices w = [w1, . . . , wN

]T via appropriate choices of prices p, k1, . . . , kN

, and


Tenant i

Cloud Provider

Update

wi(p, ki) p, ki

wi(p, ki) = arg maxwi

E�Si(wi, p, ki)

�

Tenant 1 Tenant N... ...

p, ki to increase social welfare

Figure 7.1: Iterative updates of prices and guaranteed portions.

control the corresponding expected social welfare E[W (w)].

Therefore, the social welfare maximization problem (7.8) is converted into an equiv-

alent optimal pricing problem:

maxp,k1,...,k

N

E[W (w)] = E[⇧(w)] +X

i

E[Si

(wi

, p, ki

)], (7.17)

which, by combining (7.14) and (7.30), can be rewritten as

maxp,k1,...,k

N

(1 � ✏)X

i

✓U

i

(wi

) � bwi

µi

◆� c

✓K(w)

◆, (7.18)

where wi

= wi

(p, ki

) is determined in a distributed fashion by each tenant i via surplus

maximization (7.16). Such a distributed optimization is illustrated in Fig. 7.1. Now the

cloud provider does not need to know Ui

: it simply charges tenant i the usage price p

and reservation price ki

, and expects a guaranteed portion w(p, ki

) chosen by tenant i.

We denote the optimal prices that solve problem (7.18) as p⇤, k⇤1, . . ., k⇤

N

. The

optimal pricing problem (7.18) is equivalent to the original problem (7.8), because by

adjusting p and ki

, wi

(p, ki

) can take any value in [0, 1]. In other words, the guaranteed

portion wi

(p⇤, k⇤i

) chosen by tenant i under optimal pricing p⇤, k⇤i

is exactly the guaranteed


portion w⇤i

that maximizes the expected social welfare, i.e., we have

w⇤i

= wi

(p⇤, k⇤i

). (7.19)

Therefore, once a set of optimal prices is obtained, we essentially have found a de-

centralized solution to expected social welfare maximization (7.8), which was originally

impossible to solve in a centralized way.

Nonetheless, the optimal pricing problem (7.18) is not easy to solve either. At first

glance, (7.18) can be understood as a network utility maximization (NUM) problem

[32] that may be solved via decomposition among the tenants. A closer look at (7.18)

suggests that the term c(K(w)) in the objective function may be coupled among all wi

’s,

so that (7.18) cannot be decomposed into a set of subproblems, each solved at a tenant

in a distributed fashion. Coupling happens in the cost term when the cloud multiplexes

tenant demands and books a capacity K(w) for the aggregated demand. As shown in

Fig. 7.1, the key to the solution is that the cloud provider must be able to update p

and ki

iteratively towards the direction that increases E[W (w)]. And an e�cient price

update algorithm should require fewer rounds of message-passing between the cloud and

tenants before reaching optimality.

Before presenting the distributed solutions to (7.18), let us first provide a number of

insights on how to make pricing policies, by checking the KKT conditions [20] that the

optimal prices p⇤, k⇤1, . . . , k

⇤N

must satisfy.

Proposition 7 The optimal prices p⇤, k⇤1, . . . , k

⇤N

must satisfy

(1 � ✏)(p⇤ � b)µi

+ k⇤i

� c0✓

K(w)

◆· @K

@wi

��w=w

= 0, 8 i, (7.20)


where w = [w1, . . . , wN

]T with wi

= wi

(p⇤, k⇤i

) given by (7.16).

Proof: First, applying the first order conditions @E[Si

]/@wi

= 0 to (7.16) and using

the approximation (7.12), we obtain

(1 � ✏)U0i

(wi

) = p(1 � ✏)µi

+ ki

, (7.21)

which defines an implicit function wi

(p, ki

), with partial derivatives given by

@wi

@p=

µi

U00i

(wi

)< 0,

@wi

@ki

=1

(1 � ✏)U00i

(wi

)< 0. (7.22)

The partial derivatives are negative because U00i

(wi

) < 0 by the concavity of Ui

(·). This

means that when the usage and reservation prices are higher, the tenant will choose a

lower guaranteed portion.

Furthermore, applying KKT conditions [20] @E[W ]/@ki

= 0 and @E[W ]/@p = 0 to

(7.18), we obtain

✓(1 � ✏)

✓U

0i

(wi

) � bµi

◆� c0(K)@K

@wi

��w=w

◆@w

i

@ki

= 0, 8 i,

X

i

✓(1 � ✏)

✓U

0i

(wi

) � bµi

◆� c0(K)@K

@wi

��w=w

◆@w

i

@p= 0.

Since @wi

/@p < 0 and @wi

/@ki

< 0, we have

(1 � ✏)

✓U

0i

(wi

) � bµi

◆� c0(K)@K

@wi

��w=w

= 0. (7.23)

Combining with (7.21), we have derived the proposition. ut


An inspection of (7.20) reveals that one set of optimal prices is given by

8>><

>>:

p⇤ = b,

k⇤i

= @c(K(w))@w

i

��w=w

, 8 i.(7.24)

Although (7.24) is not the only set of optimal prices, our finding complies with the

economic intuition that a welfare-maximizing cloud provider should charge marginal cost

for both tra�c usage and guaranteed reservation. In (7.24), we can also observe that k⇤i

depends on w, which in turn depends on p, k⇤1, . . . , k

⇤N

. Due to such coupling, (7.24) is

not yet a closed-form solution for the reservation price k⇤i

.

7.2.2 No Multiplexing vs. Multiplexing All

To draw insights, we take a look at two special service technologies that may be adopted

by the cloud: non-multiplexing and multiplexing across all the tenants. For simplicity,

we assume a linear reservation cost (which will be relaxed later): c(K) = �K.

When multiplexing is not used, we can derive the optimal prices in a closed form

from (7.24). Without multiplexing, recall that the capacity Ri

is reserved for each tenant

i individually, such that Pr(wi

Di

> Ri

) < ✏ and K =P

i

Ri

. When Di

is a Gaussian

random variable, it is easy to check that

Ri

(wi

) =

✓µi

+ ✓(✏)�i

◆w

i

, (7.25)

where ✓(✏) = F�1(1 � ✏) is a constant, with F (·) being the CDF of normal distribution

N (0, 1). Since the cost function is naturally decoupled among tenants, according to


(7.24), the optimal prices are immediately given in a closed form by p⇤ = b and

k⇤i

= �

✓µi

+ ✓(✏)�i

◆, 8 i. (7.26)

When multiplexing is used, however, optimal prices have no explicit solutions. Recall

that with multiplexing, a capacity K is reserved to accommodate all the tenants together,

such that the aggregate (instead of individual) demand is satisfied with high probability:

Pr(P

i

wi

Di

> K) < ✏.

Since the random demands D1, . . . , DN

of di↵erent tenants may be correlated, we

denote ⇢ij

the correlation coe�cient of Di

and Dj

, with ⇢ii

⌘ 1. For convenience, let

µµµ = [µ1, . . . , µN

]T and ⌃ = [�ij

] be the N ⇥ N symmetric demand covariance matrix,

with �ii

= �2i

and �ij

= ⇢ij

�i

�j

for i 6= j.

Under Gaussian demands, K can be written as

K(w) =E[P

i

wi

Di

] + ✓(✏)pvar[

Pi

wi

Di

]

=µµµTw + ✓(✏)pwT⌃⌃⌃w. (7.27)

Substituting the above K(w) into (7.24) gives p⇤ = b and

k⇤i

= �

✓µi

+ ✓(✏) · @pwT⌃⌃⌃w

@wi

��w=w⇤

◆, 8 i, (7.28)

where w⇤i

= wi

(p⇤, k⇤i

). Clearly, with multiplexing, k⇤i

is not given in a closed form yet,

due to the coupled cost function.

We note that whether with or without multiplexing, K(w) is a convex function and

so is c(K(w)). In fact, we can relax the linear cost assumption c(K) = �K, as long as


c(K(w)) is strictly convex and monotonically increasing in terms of each wi

2 [0, 1].

There is an interesting connection between the non-multiplexing and multiplexing

cases: the optimal solution for non-multiplexing can be used to bound the optimal solu-

tion for the multiplexing case. Specifically, the optimal prices {k⇤i

} of non-multiplexing

upper-bound {k⇤i

} of the multiplexing case, whereas the optimal portions {w⇤i

} of non-

multiplexing lower-bound {w⇤i

} of the multiplexing case. This is intuitive because multi-

plexing leads to a reduced cost c(K), stimulating tenants to increase their choices of the

guaranteed portion. We will use this connection in our distributed algorithms.

7.2.3 Economic Implications

Condition (7.20) has several economic implications, which apply to a general cost func-

tion, although we may use the non-multiplexing case for explanation due to its simplicity.

First, merely adopting a usage price p cannot maximize social welfare: when ki

= 0

for all i, there is no p that can satisfy (7.20). In other words, a positive reservation

fee ki

> 0 is necessary to achieve welfare optimality, since in the presence of demand

uncertainty, only ki

can incorporate a risk factor (e.g., �i

in (7.26)) into pricing. This

reveals that current cloud bandwidth pricing schemes are ine�cient in terms of providing

service guarantee against demand fluctuation. On the other hand, a usage price p is not

necessary: even if p = 0, the expected welfare is maximized as long as ki

satisfies (7.20).

In this case, the reservation fee can be raised to compensate for no usage fee.

Furthermore, heterogeneous reservation prices k1, . . . , kN

are necessary to achieve op-

timality, with each ki

depending on the statistical characteristics of tenant i’s demand Di

.

This conforms to the intuition that tenants have di↵erent degrees of demand volatility,

incurring di↵erent costs for service guarantees. For example, without multiplexing, ki

7.3. DISTRIBUTED OPTIMIZATION WITH COUPLED OBJECTIVES 141

depends on �i

in (7.26): the more bursty a tenant’s demand Di

, the more capacity that

must be reserved to guard against fluctuation, and thus the higher the price. In contrast,

in terms of usage pricing, it is e�cient enough to charge a homogeneous price p for every

unit of bandwidth consumed.

7.3 Distributed Optimization with Coupled Objec-

tives

To solve (7.8), we find that the social welfare maximization problem (7.8) can be written

as a basic resource allocation problem of the following form2:

maxx

nX

i=1

Ui

(xi

) � C(x) (7.29)

s.t. xi

2 [ai

, bi

], i = 1, . . . , n,

for some real numbers ai

, bi

, where x = (x1, . . . , xn

) represents resource allocation, each

Ui

is a strictly concave and monotonically increasing utility function, and C(x) is a

strictly convex cost function that is monotonically increasing in each xi

. Assume C and

Ui

are twice continuously di↵erentiable. In Sec. 7.6.6, we will see that based on duality

theory, problems with general coupled objectives can be transformed into (7.29).

Problem (7.29) also arises in many other networking systems, where each user i is

associated with an utility Ui

(xi

) by using xi

units of some resource, while the users

incur a total cost C(x) at the resource provider, which may be coupled among xi

’s in an

arbitrary way depending on the technology used. The goal of the resource provider is to

2The notations used from Sec. 7.3 to Sec. 7.6 will be independent of the notations used elsewhere inthis chapter.

7.3. DISTRIBUTED OPTIMIZATION WITH COUPLED OBJECTIVES 142

make resource allocation decision x wisely, so that the social welfareP

i

Ui

(xi

) � C(x) is

maximized. Apparently, granting the maximum resources to users will maximize the sum

of their utilities, but will also incur a high cost C(x). As a convex optimization problem,

(7.29) has a unique solution x⇤ = (x⇤1, . . . , x

⇤n

). However, it is undesirable to solve (7.29)

in a centralized way using conventional algorithms such as interior point methods [20],

because the utility Ui

is not known to the resource provider or other users, and the cost

function C is unknown to the users.

Before solving the cloud social welfare maximization problem (7.8), we propose new

distributed approaches to the general resource allocation problem (7.29) with coupled

objective functions, with a goal towards fast and robust convergence in large-scale sys-

tems. Let us illustrate the basic idea of the new approach with the simple example of

maximizing the sum of user utilities minus a coupled provider cost. The algorithm pro-

ceeds in the following iterative steps: given all the resource allocation requirements from

users, the provider computes a price vector as a weighted average of the marginal cost

under current resource requirements and the previous price; each user then computes its

resource requirement under the new price by maximizing its surplus, that is, its utility

minus the price.

The new approach distinguishes itself from traditional gradient methods in three as-

pects. First, while the performance of gradient methods is sensitive to stepsize selection,

the parameter in our approach is a weight between 0 and 1, which is easier to set and

achieves robustness. Second, our approach can be understood as a novel distributed

version of the nonlinear Jacobi algorithm [21], and only passes the gradient information

of certain utilities, e.g., the cost function, instead of all the utilities. By exploiting the

7.4. A PRICING FRAMEWORK FOR DISTRIBUTED OPTIMIZATION 143

structural di↵erence between utility functions, our approach can achieve much faster con-

vergence in certain scenarios. Third, in certain engineering applications, unlike gradient

methods, our approach satisfies a contraction assumption [21], and thus converges even if

executed asynchronously among network elements, further increasing convergence speed.

While enjoying these advantages, the new approach only feeds back gradient information,

and has the same economic interpretation as pricing.

We provide asynchronous convergence conditions of the new approach and compare

its convergence speed with that of gradient methods both analytically and through ex-

periments based on real-world trace data. Furthermore, we show how to generalize the

above basic idea, with the aid of duality theory, to solve a general class of problems with

additive coupled objective functions, which is also referred to as consensus optimization

[26] in statistical and machine learning. The proposed approach also turns out to be a

fast decentralized solution to machine learning problems that involve high-dimensional

data and a large number of training samples.

7.4 A Pricing Framework for Distributed Optimiza-

tion

Let us first introduce a pricing framework, based on which we interpret various distributed

algorithms. Distributed solutions to problem (7.29) are feasible via pricing. The basic

idea is that the provider should charge each user i a price pi

xi

for using resource xi

, where

the unit price pi

can be heterogeneous among users. Given a price vector p = (p1, . . . , pn),

7.4. A PRICING FRAMEWORK FOR DISTRIBUTED OPTIMIZATION 144

problem (7.29) is equivalent to

maxx2

Qi

[ai

,b

i

]

nX

i=1

✓Ui

(xi

) � pi

xi

◆+

✓pTx � C(x)

◆, (7.30)

where Ui

(xi

) � pi

xi

is the surplus of user i for using resource xi

, and pTx � C(x) is the

profit of the provider.

We can classify pricing-based iterative solutions to (7.30) into two types, depending

on who updates the price. In a type I algorithm, the provider sets the price p, under

which each user will maximize its surplus and choose to use resource

xi

(pi

) := arg maxx

i

2[ai

,b

i

]Ui

(xi

) � pi

xi

, 8 i. (7.31)

The provider aims to update p iteratively, such that x(p) = (x1(p1), . . . , xn

(pn

)) produced

by the surplus-maximizing users will converge to the optimal resource allocation x⇤.

We assume each network element can only use its local information and the available

feedback. Therefore, when the resource provider sets the price, the price update for each

user i is a function of the current price p and the returned user resource requirements

x1(p1), . . . , xn

(pn

) given by (7.31). The provider has full knowledge of C but no knowledge

of Ui

.

In a type II algorithm, each user submits a price (or bid) pi

to the provider, who

determines resource allocation x(p) by maximizing its own profit:

x(p) := arg maxx2

Qi

[ai

,b

i

]pTx � C(x), (7.32)

with x(p) = (x1(p), . . . , xn

(p)), and feeds xi

(p) back to each user i, who then updates its

7.5. ALGORITHMS 145

bid pi

as a function of its current bid pi

and the resource allocation xi

(p) determined by

the provider through (7.32). The bid updating at user i can make use of Ui

, but not C

or Uj

for j 6= i.

In both cases, the price update can be rewritten as a function

pi

:= Ti

(p), (7.33)

with T = (T1, . . . , Tn

). The above locality assumptions on the structure of T enforce price

updating under the minimum message-passing overhead. Clearly, when message-passing

is limited to only prices p and resource allocations x(p), any algorithm that requires

information beyond gradients, e.g., Newton methods, can hardly be decentralized.

7.5 Algorithms

Our main objective is to propose e↵ective price updating rules T in both types of algo-

rithms, so that x(p(t)) can converge to the optimal resource allocation x⇤ within a small

number of iterations. We propose two classes of iterative algorithms, depending on who

updates the price p. Before presenting the new algorithms, let us review the existing

method based on Lagrangian dual decomposition with gradient algorithms.

7.5.1 Dual Decomposition and Gradient Methods

We now briefly describe Lagrangian dual decomposition and interpret it as a price up-

dating algorithm of type I introduced in Sec. 7.3. Problem (7.29) is equivalent to

maxx,y

nX

i=1

Ui

(xi

) � C(y) (7.34)

7.5. ALGORITHMS 146

s.t. x = y, xi

2 [ai

, bi

], i = 1, . . . , n.

It is easy to check that the Lagrangian dual problem of (7.34) is

minp

q(p) =nX

i=1

maxx

i

2[ai

,b

i

]{U

i

(xi

) � pi

xi

} + maxy

{pTy � C(y)}.

Since problem (7.34) is convex, the duality gap is zero: if p⇤ solves the dual problem,

then x⇤i

= arg maxx

i

2[ai

,b

i

] Ui

(xi

) � p⇤i

xi

solves the original problem.

Thus, we only need to solve the dual problem. It is known [20, 21, 32] that the

gradient of q(p) is

rq(p) = y(p) � x(p), (7.35)

where y(p) = (y1(p), . . . , yn

(p)) is given by

y(p) = arg maxy

pTy � C(y), (7.36)

and x(p) = (x1(p1), . . . , xn

(pn

)) is defined as follows:

xi

(pi

) = arg maxx

i

2[ai

,b

i

]Ui

(xi

) � pi

xi

. (7.37)

Applying a gradient algorithm to the dual problem yields the following price updating

rule:

pi

:= Ti

(p) = pi

� �

✓yi

(p) � xi

(pi

)

◆, (7.38)

where � is a su�ciently small positive stepsize.

Obviously, (7.38) can be interpreted as a type I algorithm: given a p set by the resource

7.5. ALGORITHMS 147

provider, each user returns xi

(pi

) to the provider by maximizing its surplus. The provider

then computes y(p) locally by maximizing pTy �C(y) and updates p according to (7.38).

Apparently, price update here only makes use of the current price p, the returned x(p),

and the local cost function C at the provider. Similarly, (7.38) can also be explained as

a type II algorithm, which we omit here. We will compare the convergence conditions

and speeds of our new algorithms against the above dual decomposition with gradient

methods.

7.5.2 New Algorithms

We propose two types of new algorithms for solving problem (7.29), which satisfy the

locality assumptions on T in Sec. 7.3.

Algorithm 1: Assume the resource provider updates the price p. The price update

rule is, for all i,

pi

:= Ti

(p) = (1 � �)pi

+ �ri

C

✓x1(p1), . . . , xn

(pn

)

◆, (7.39)

where � 2 (0, 1] is a relaxation parameter, and

xi

(pi

) := arg maxx

i

2[ai

,b

i

]Ui

(xi

) � pi

xi

(7.40)

is the resource requirement collected from each user i under the current price p. This

algorithm has been presented in our published work [57], and a variation of this algorithm

has been presented in our work [56].

Algorithm 2: Assume each user updates its bid pi

. The price update rule is, for all

7.5. ALGORITHMS 148

i,

pi

:= Ti

(p) = (1 � �)pi

+ �U 0i

✓xi

(p)

◆, (7.41)

where � 2 (0, 1] is a relaxation parameter, and

(x1(p), . . . , xn

(p)) = x(p) := arg maxx2Q

i

[ai

, bi

]pTx � C(x) (7.42)

is the resource allocation determined by the provider under the current submitted bids

p.

Instead of relying on dual decomposition, Algorithms 1 and 2 actually observe the

optimality conditions of (7.29). For ease of explanation, let us ignore the constraint

xi

2 [ai

, bi

] for now. By first-order conditions, if x⇤ solves (7.29), we have

U 0i

(x⇤i

) = ri

C(x⇤), 8 i. (7.43)

Let us consider Algorithm 1 when � = 1. Assume the iteration (7.39) has a fixed point

p⇤. Then at p⇤, we have

U 0i

✓xi

(p⇤i

)

◆= p⇤

i

= ri

C

✓x1(p

⇤1), . . . , xn

(p⇤n

)

◆, 8 i, (7.44)

where the first equality is due to the first-order condition of (7.40). Thus, (x1(p⇤1), . . . , xn

(p⇤n

))

satisfies the conditions (7.43) and is an optimal solution x⇤, as our problem is convex.

Similarly, Algorithm 2 executes price updates in the reverse direction of Algorithm 1,

also aiming to achieve (7.44).

The following proposition shows that as long as Algorithm 1 or Algorithm 2 converges,

it will solve the optimization problem (7.29).

7.5. ALGORITHMS 149

Proposition 8 Suppose that Algorithm 1 (or Algorithm 2) has a fixed point p⇤ and an

associated x⇤. Then, x⇤ solves (7.29).

Proof: One approach is to show the fixed point x⇤ satisfies the KKT conditions of

(7.29). Here, we present a simpler proof based on the following lemma, the proof of which

can be found in [21] (pp. 210, Proposition 3.1):

Lemma 4 Suppose F is convex on a convex set X. Then, x 2 X minimizes F over X,

if and only if (y � x)TrF (x) � 0 for every y 2 X.

Suppose Algorithm 1 has a fixed point p⇤ and an associated x⇤. Then we have

8><

>:

p⇤i

= ri

C(x⇤), 8 i,

x⇤i

= arg maxx

i

2[ai

,b

i

] Ui

(xi

) � p⇤i

xi

, 8 i.

(7.45)

Since x⇤i

maximizes Ui

(xi

) � p⇤i

xi

on [ai

, bi

], by Lemma 4,

(yi

� x⇤i

)(U 0i

(x⇤i

) � p⇤i

) 0, 8 yi

2 [ai

, bi

], 8 i.

Since p⇤i

= ri

C(x⇤), we have

Pi

(yi

� x⇤i

)(U 0i

(x⇤i

) � ri

C(x⇤)) 0, 8 y 2Q

i

[ai

, bi

], (7.46)

which, by Lemma 4, means x⇤ solves (7.29).

Suppose Algorithm 2 has a fixed point p⇤ and an associated x⇤. Then, we have

8><

>:

p⇤i

= U 0i

(x⇤i

), 8 i,

x⇤ = arg maxx2

Qi

[ai

,b

i

] p⇤Tx � C(x).

(7.47)

7.5. ALGORITHMS 150

Applying Lemma 4 to the second equation above yields

Pi

(yi

� x⇤i

)(p⇤i

� ri

C(x⇤)) 0, 8 y 2Q

i

[ai

, bi

].

Substituting p⇤i

= U 0i

(x⇤i

) into the equation above leads to (7.46), which, by Lemma 4,

means x⇤ solves (7.29). ut

In Algorithms 1 and 2, surplus (profit) maximization is performed at either users or

the provider, whereas in gradient methods (7.38), such maximizations must be performed

at both users and the provider to obtain x(p) and y(p). First, this greatly reduces the

computational complexity when n is large. In particular, Algorithm 1 can scale to a

very large n without having to solve convex optimization at the provider side. More

importantly, we will see that the new algorithms can better exploit the structures of Ui

and C to achieve faster convergence rates. We will also see that it is easier to choose �

in the new algorithms since it does not have to be very small.

7.5.3 Relationship to the Nonlinear Jacobi Algorithm

All the above algorithms have synchronous and asynchronous versions, depending on

whether updates of pi

and xi

(p) are performed for all users i synchronously or asyn-

chronously. For example, in an asynchronous Algorithm 1, pi

and xi

(pi

) can be updated

for an user i for any finite number of iterations, before other xj

(pj

) for j 6= i are updated.

We have mentioned in Chapter 2 that the nonlinear Jacobi algorithm [21] is a limiting

case of the asynchronous version of Algorithm 1, while Algorithms 1 and 2 can be un-

derstood as novel decentralizations of the nonlinear Jacobi algorithm. In the following,

we show the connection of our proposed algorithms to the nonlinear Jacobi algorithm.

The nonlinear Jacobi algorithm consists of iteratively optimizing in a parallel fashion

7.6. CONVERGENCE CONDITIONS 151

with respect to each variable xi

while keeping the rest of the variables xj

(j 6= i) fixed.

Formally, for problem (7.29), it is defined as

xi

:= arg maxx

i

2[ai

,b

i

]Ui

(xi

) � C(x1, . . . , xi

, . . . , xn

) +X

j 6=i

Uj

(xj

). (7.48)

And the above update is to be conducted for all i in parallel. Note that the above non-

linear Jacobi algorithm cannot be applied to our problem, because the resource provider

does not know any user utility Ui

, and the users do not know the cost function C. Neither

can the nonlinear Jacobi algorithm be interpreted as a pricing-based algorithm.

However, the nonlinear Jacobi algorithm is a limiting case of Algorithm 1 executed

asynchronously. If for each user i, we perform (7.39) and (7.40) of Algorithm 1 for an

infinite number of iterations before updating the other variables xj

(for all j 6= i) in

equation (7.39) for this user, then the xi

obtained this way is the same as the one given

by (7.48), and thus Algorithm 1 becomes the nonlinear Jacobi algorithm. In other words,

the Algorithm 1 proposed here, with a convenient interpretation as pricing, is essentially

a novel distributed version of the nonlinear Jacobi algorithm.

7.6 Convergence Conditions

We analyze the convergence of the above algorithms, and in particular, aim to find su�-

cient conditions under which a certain algorithm is a contraction mapping [21]. Although

more general convergence conditions may be found for an algorithm when executed syn-

chronously, contraction mapping can ensure its convergence even if executed completely

asynchronously [21]. Note that an asynchronous algorithm is much better than its syn-

chronous counterpart, since message-passing between network elements usually incurs


heterogeneous delays in reality. Based on the derived conditions, we will extensively

compare the algorithm robustness to the choice of � and their convergence speeds in

Sec. 7.8.

7.6.1 Contraction Mapping

As some preliminaries, let us introduce contraction mapping and its properties. Many

iterative algorithms can be written as

x(t + 1) := T (x(t)), t = 0, 1, . . . , (7.49)

where x(t) 2 X ⇢ <n, and T : X 7! X is a mapping from X into itself. The mapping T

is called a contraction if

kT (x) � T (y)k ↵kx � yk, 8 x, y 2 X, (7.50)

where k ·k is some norm, and the constant ↵ 2 [0, 1) is called the modulus of T . Further-

more, the mapping T is called a pseudo-contraction if T has a fixed point x⇤ 2 X (such

that x⇤ = T (x⇤)) and

kT (x) � x⇤k ↵kx � x⇤k, 8 x 2 X. (7.51)

The following theorem establishes the geometric convergence of both contractions and

pseudo-contractions:

Theorem 9 (Geometric Convergence) Suppose that X ⇢ <n and the mapping T : X 7!

X is a contraction, or a pseudo-contraction with a fixed point x⇤ 2 X. Suppose the


modulus of T is ↵ 2 [0, 1). Then, T has a unique fixed point x⇤ and the sequence {x(t)}

generated by x(t + 1) := T

✓x(t)

◆satisfies

kx(t) � x⇤k ↵tkx(0) � x⇤k, 8 t � 0, (7.52)

for every choice of the initial vector x(0) 2 X. In particular, {x(t)} converges to x⇤

geometrically.

To simplify notations, we may use ci

(x) to stand for ri

C(x) and ui

(xi

) for U 0i

(xi

).

We denote @x

j

x

i

C(x) = rj

ci

(x) as the second order partial derivatives of C.

7.6.2 Convergence of Algorithm 1

We analyze the convergence of Algorithm 1 under di↵erent choices of �. First, suppose

� = 1. We rewrite Algorithm 1 in another form amenable to analysis. Denote [xi

]+i

the

projection of xi

2 < onto the interval [ai

, bi

], i.e.,

[xi

]+i

= arg minz2[a

i

,b

i

]|z � x

i

|, i = 1, . . . , n. (7.53)

It is easy to check that (7.40) is equivalent to

xi

(pi

) := [arg maxx

i

Ui

(xi

) � pi

xi

]+i

= [u�1i

(pi

)]+i

(7.54)

Therefore, letting � = 1 and substituting (7.39) into (7.54) yields

xi

:= Ti

(x) = [u�1i

(ci

(x))]+i

, i = 1, . . . , n, (7.55)


which is an equivalent iteration to Algorithm 1 when � = 1. The following result gives a

su�cient condition for the convergence of Algorithm 1 with � = 1.

Proposition 10 Suppose � = 1. If, for all i, we have

nX

j=1

|@x

j

x

i

C(x)| < minz

i

|U 00i

(zi

)|, 8 x 2Q

i

[ai

, bi

], (7.56)

then T = (T1, . . . , Tn

) given by (7.55) is a contraction and the sequence {x(p(t))} gener-

ated by Algorithm 1 converges geometrically to the optimal solution x⇤ to (7.29), given

any initial price vector p(0).

Proof: For each i, define a function gi

: [0, 1] 7! < by

gi

(t) = u�1i

✓ci

(z(t))

◆= u�1

i

✓ci

(tx + (1 � t)y)

◆. (7.57)

Note that gi

is continuously di↵erentiable. We have

|Ti

(x) � Ti

(y)| = |[u�1i

(ci

(x))]+i

� [u�1i

(ci

(y))]+i

|

|gi

(1) � gi

(0)| =

��Z 1

0

dgi

(t)

dtdt

��

Z 1

0

��dg

i

(t)

dt

��dt maxt2[0,1]

��dg

i

(t)

dt

��,

where the first inequality is because |[xi

]+i

� [yi

]+i

| |xi

� yi

| for all xi

, yi

2 <. Further-

more, the chain rule yields

��dg

i

(t)

dt

��=��P

n

j=1rj

u�1i

� ci

(tx + (1 � t)y) · (xj

� yj

)

��

��(u

�1i

)0(ci

(z(t)))

�� ·P

n

j=1

��rj

ci

(z(t))

�� · |xj

� yj

|,


where the function u�1i

� ci

(x) is equivalent to u�1i

(ci

(x)). If condition (7.56) holds, we

have for all x 2Q

i

[ai

, bi

],

Pn

j=1|rj

ci

(x)| < minz

i

|u0i

(zi

)| ��u

0i

✓u�1i

(ci

(x))

◆��

=1

|(u�1i

)0(ci

(x))|(7.58)

Therefore, there exists a positive ↵ < 1 such that

��dg

i

(t)

dt

�� ↵ maxj

|xj

� yj

| = ↵kx � yk1, 8 t 2 [0, 1], 8 i.

Therefore, we have shown that

kT (x) � T (y)k1 ↵kx � yk1, 8 x, y 2Q

i

[ai

, bi

],

and T is a contraction with modulus ↵ with respect to the maximum norm inQ

i

[ai

, bi

].

By Theorem 9, Algorithm 1 converges geometrically to x⇤. ut

Next, we consider the convergence condition when � < 1, which is much harder to

analyze in general. In many scenarios, the optimal solution x⇤ is in the interior of the boxQ

i

[ai

, bi

], i.e., ai

< x⇤i

< bi

for all i. If the x(p(t)) generated by (7.39) and (7.40) in each

iteration t is always in the interior ofQ

i

[ai

, bi

], we can rewrite (7.40) as xi

(pi

) := u�1i

(pi

),

and Algorithm 1 is equivalent to the following iteration:

xi

:= Ti

(x) = u�1i

✓(1 � �)u

i

(xi

) + �ci

(x)

◆, 8 i. (7.59)

The following result gives a convergence condition of Algorithm 1 when � < 1.


Proposition 11 Suppose ai

< x⇤i

< bi

for all i. Set xi

(pi

(0)) to be ai

for all i, or bi

for

all i. If we have 8><

>:

0 < � ✓

1 +@

x

i

x

i

C(x)

|U 00i

(xi

)|

◆�1

,

Pj 6=i

@x

j

x

i

C(x) 0,

(7.60)

for all x between x(p(0)) and x⇤, for all i, then T = (T1, . . . , T2) given by (7.59) is

a pseudo-contraction between x(p(0)) and x⇤, and the sequence {x(p(t))} generated by

Algorithm 1 converges to x⇤ geometrically.


: [0, 1] 7! < by

gi

(t) = u�1i

✓(1 � �)u

i

(zi

(t)) + �ci

(z(t))

◆, (7.61)

where z(t) = tx + (1 � t)x⇤.

Now suppose x < x⇤. Let us show that xi

< Ti

(x) x⇤i

for all i. We have

Ti

(x) � Ti

(x⇤) = gi

(1) � gi

(0) =

Z 1

0

dgi

(t)

dtdt, (7.62)

where dgi

(t)/dt is given by

dgi

(t)

dt= (u�1

i

)0✓

(1 � �)ui

(zi

(t)) + �ci

(z(t))

◆·✓

(1 � �)

u0i

(zi

(t))(xi

� x⇤i

) + �P

j

rj

ci

(z(t))(xj

� x⇤j

)

◆

=

✓u0i

(Ti

(z(t)))

◆�1

·✓

((1 � �)u0i

(zi

(t)) + �ri

ci

(z(t)))

(xi

� x⇤i

) + �P

j 6=i

rj

ci

(z(t))(xj

� x⇤j

)

◆(7.63)


Because Ui

is strictly concave, u0i

(x) < 0 for all x. Hence, under the condition (7.60), if

x < x⇤, we have Ti

(x) x⇤i

for all i, from which we immediately have

ui

✓Ti

(x)

◆= (1 � �)u

i

(xi

) + �ci

(x) � ui

(x⇤i

) = ci

(x⇤) > ci

(x),

which leads to ui

(xi

) > ci

(x) and thus

ui

(xi

) > (1 � �)ui

(xi

) + �ci

(x), 8 � > 0. (7.64)

Applying u�1i

(·) to both sides yields xi

< Ti

(x) for all i. Therefore, there exists an

↵ 2 [0, 1), which depends on �, ui

, ci

and x⇤, such that

|Ti

(x) � x⇤i

| ↵|xi

� x⇤i

|, 8 xi

2 [ai

, x⇤i

], 8 i, (7.65)

or equivalently, kT (x) � x⇤k1 ↵kx � x⇤k1, for all x between a = (a1, . . . , an

) and x⇤.

In other words, we have shown that T is a pseudo-contraction inQ

i

[ai

, x⇤i

].

Similarly, if x > x⇤, we can show that xi

> Ti

(x) � x⇤i

for all i, and T is a pseudo-

contraction inQ

i

[x⇤i

, bi

]. Thus, Algorithm 1 converges geometrically. ut

In many engineering problems, users may have a priori knowledge of (ai

, bi

) that

contains its optimal x⇤i

, which justifies the assumption that x⇤ resides in the interior ofQ

i

[ai

, bi

]. Furthermore, setting xi

(pi

(0)) to ai

(or bi

) is easy: the resource provider only

needs to set a very high (or very low) pi

(0) to push the user resource requirement xi

(pi

)

to its corresponding boundary value. Conditions (7.60) essentially ensure that x(p(t))

increases (or decreases) from a (or b) to x⇤ monotonically.


7.6.3 Convergence of Algorithm 2

The convergence of Algorithm 2 is much harder to analyze in general when there are

constraints ai

xi

bi

, since x(p) in (7.42) cannot be decoupled as projections onto

intervals [ai

, bi

]. In the following, we give su�cient convergence conditions of Algorithm

2 in the case of unconstrained optimization.

In this case, Algorithm 2 is rewritten as

8><

>:

pi

:= Ti

(p) = (1 � �)pi

+ �ui

✓xi

(p)

◆, 8 i,

x(p) := arg maxx

pTx � C(x)

(7.66)

Proposition 12 Suppose that problem (7.29) has no constraints and that the Hessian

matrix of C(x) is H(x) = [@x

j

x

i

C(x)]n⇥n

. Denote the inverse of H(x) as P (x) =

[H(x)]�1 = [Pij

(x)]n⇥n

.

If � = 1 and the following condition holds:

Pn

j=1|Pij

(x)| < |U 00i

(xi

)|�1, 8 x, 8 i, (7.67)

or if 0 < � (1 + Pii

(x) · |U 00i

(xi

)|)�1, for all x and i, and

Pj 6=i

|Pij

(x)| < Pii

(x) + |U 00i

(xi

)|�1, 8 x, 8 i, (7.68)

then T = (T1, . . . , Tn

) given by (7.66) is a contraction and the sequence {x(p(t))} gener-

ated by Algorithm 2 converges geometrically to the optimal solution to (7.29).


then there exists a constant ↵ 2 (0, 1) such that

��dg

i

(t)

dt

�� ↵ maxj

|pj

� qj

| = ↵kp � qk1, 8 t 2 [0, 1], 8 i,

or in other words, T is a contraction with modulus ↵.

We then derive rj

xi

(r) based on C(x). Applying the first-order conditions to the

definition of x(r) in (7.66), we have ri

� ci

(x(r)) = 0, for all i. Thus, the chain rule

yields

1 =@c

i

(x(r))

@ri

=nX

l=1

rl

ci

(x(r)) · @xl

(r)

@ri

, 8 i,

0 =@c

i

(x(r))

@rj

=nX

l=1

rl

ci

(x(r)) · @xl

(r)

@rj

, 8 i, j : i 6= j

Hence, we have [rj

xi

(r)]n⇥n

= [rj

ci

(x(r))]�1n⇥n

= [H(x(r))]�1, or in other words,

rj

xi

(r) = Pij

(x(r)). Substituting rj

xi

(r) into (7.71) and (7.72) leads to (7.67) and

(7.68), respectively. ut

We can also establish for Algorithm 2 a convergence condition similar to (7.60), which

can handle constrained optimization with x⇤ residing in the interior ofQ

i

[ai

, bi

] by making

x(p(t)) move to x⇤ always in the increasing or decreasing direction. On the other hand,

however, it turns out to be technically di�cult to establish a condition like (7.68) for

Algorithm 1 when � < 1.

7.6.4 Convergence of Gradient Methods for Dual Problem

There are many results on the convergence of gradient algorithms. For example, for

a su�ciently small constant stepsize �, the algorithm is guaranteed to converge to the


optimal value [20]. For comparison, here we derive a su�cient condition similar to (7.68),

under which the gradient algorithm applied to the dual problem, i.e., iteration (7.38), is

a contraction for the unconstrained version of problem (7.29).

Proposition 13 Suppose that problem (7.29) has no constraints and that the Hessian

matrix of C(x) is H(x) = [@x

j

x

i

C(x)]n⇥n

. Denote the inverse of H(x) as P (x) =

[H(x)]�1 = [Pij

(x)]n⇥n

.

If 0 < � (Pii

(x) + |U 00i

(xi

)|�1)�1, for all x and i, and

Pj 6=i

|Pij

(x)| < Pii

(x) + |U 00i

(xi

)|�1, 8 x, 8 i, (7.73)

then T = (T1, . . . , Tn

) given by (7.38) is a contraction and the sequence {x(p(t))} con-

verges geometrically to x⇤.

Proof: Ti

(p) in (7.38) can be written in the form Ti

(p) = pi

� �fi

(p), with fi

being

continuously di↵erentiable. By Proposition 1.11 of [21] (pp. 194), Ti

(p) is a contraction

if 8><

>:

0 < � 1/ri

fi

(p), 8 p, 8 i,

Pj 6=i

|rj

fi

(p)| < ri

fi

(p), 8 p, 8 i.

(7.74)

Similar to Proposition 12, we can derive rj

fi

(p) from the Hessian matrix of C(x) and

U(x). In particular,

ri

fi

(p) = ri

yi

(p) � ri

xi

(p) = Pii

(x(p)) �✓

U 00i

(xi

(p))

◆�1

rj

fi

(p) = rj

yi

(p) � rj

xi

(p) = Pij

(x(p)), 8 j 6= i.

Substituting the above into (7.74) and utilizing the fact U 00i

(xi

) < 0 will prove the

proposition. ut


We observe that (7.73) is exactly the same as the convergence condition (7.68) for

Algorithm 2 when � < 1, except for the choice of �. When constraints ai

xi

bi

are considered, the contraction property is hard to analyze in general, since in this case,

fi

(p) = yi

(p) � [u�1i

(pi

)]+i

involves projection and is not continuously di↵erentiable.

As a final note, when an update rule T is a contraction or pseudo-contraction with

respect to the maximum norm k · k1, asynchronous convergence of the corresponding

algorithm can be established using Proposition 2.1 in [21] (pp. 431). In other words, if

the su�cient conditions in this section hold, the corresponding algorithms also converge

asynchronously.

7.6.5 Quadratic Programming: Convergence Speed Compari-

son

We compare the convergence conditions and speeds of di↵erent algorithms using an ex-

ample of quadratic programming. A quadratic cost C and utilities Ui

simplify the con-

ditions derived in Sec. 7.6, allowing us to draw insights on the algorithm convergence

performance.

Suppose Ui

(xi

) = 12(�i

x2i

+ ⇢i

xi

+ ⌧i

) and C(x) = 12x

TAx, where �i

< 0 for all i, and

A = [aij

]n⇥n

is an n⇥n positive semi-definite symmetric matrix. Then the unconstrained

version of problem (7.29) becomes

maxx

1

2

✓Pn

i=1(�i

x2i

+ ⇢i

xi

+ ⌧i

) � xTAx

◆. (7.75)

It is easy to check that U 00i

(xi

) = �i

and @x

j

x

i

C(x) = aij

, for all x. Furthermore, the

Hessian matrix of C(x) is H(x) = A, and the inverse of H(x) is P (x) = [Pij

(x)]n⇥n

= A�1.


If we denote A�1 as [a0ij

]n⇥n

, then Pij

(x) = a0ij

, for all x.

According to Proposition 10 and Proposition 11, Algorithm 1 will converge to x⇤, if

� = 1 andP

n

j=1|aij

| < |�i

|, 8 i, or if

8><

>:

0 < � (1 + aii

/|�i

|)�1,

Pj 6=i

aij

0,8 i. (7.76)

According to Proposition 12, Algorithm 2 will converge to x⇤, if � = 1 andP

n

j=1 |a0ij

| <

|�i

|�1, 8 i, or if 8><

>:

0 < � (1 + a0ii

|�i

|)�1,

Pj 6=i

|a0ij

| < a0ii

+ |�i

|�1,

8 i. (7.77)

According to Proposition 13, the gradient algorithm applied to the dual decomposition

will converge to x⇤, if 8><

>:

0 < � (a0ii

+ |�i

|�1)�1,

Pj 6=i

|a0ij

| < a0ii

+ |�i

|�1,

8 i. (7.78)

Since the convergence speed of a contraction is governed by its modulus ↵, we now

derive ↵1(�), ↵2(�) and ↵D

(�) for Algorithm 1, Algorithm 2 and dual decomposition,

respectively, under di↵erent choices of parameter �.

If � = 1, from the proof of Proposition 10 and 12, we obtain

↵1(1) = maxi

Pn

j=1|aij

|/|�i

|, ↵2(1) = maxi

Pn

j=1|a0ij

| · |�i

|.

If � < 1, let us consider Algorithm 2 first. Substituting u0i

and rj

xi

into (7.70), it is


not hard to check that

↵2(�) = maxi

|1 � � + ��i

a0ii

| + �|�i

| ·P

j 6=i

|a0ij

|

� maxi

Pj 6=i

|a0ij

|/(a0ii

+ |�i

|�1), (7.79)

where the equality is achieved when a di↵erent �i

is allowed for each user i and �i

=

(1 + a0ii

|�i

|)�1.

Similarly, for dual decomposition, we have

↵D

(�) = maxi

|1 � �(|�i

|�1 + a0ii

)| + �|P

j 6=i

|a0ij

|

� maxi

Pj 6=i

|a0ij

|/(a0ii

+ |�i

|�1), (7.80)

where the equality is achieved when a di↵erent �i

is allowed for each user i and �i

=

(a0ii

+ |�i

|�1)�1.

Let us compare the convergence speeds of di↵erent algorithms. Note that the smaller

the modulus ↵ is, the faster the corresponding contraction converges. Algorithm 1 may

potentially achieve much faster convergence if |�i

| is big relative to |aij

|, which is often

the case in communication systems, where the service cost C is low due to multiplexing

or low unit cost for resources. On the contrary, if |�i

| is small relative to |aij

|, Algorithm

2 and gradient algorithms for the dual problem will converge faster. Therefore, when

gradient methods for dual decomposition fail to be a contraction (and thus a very small

stepsize must be chosen), we may resort to Algorithm 1 to achieve fast and asynchronous

convergence.

Another obvious strength of the new algorithms is their robustness and simplicity

in parameter selection. Under certain cases, choosing � = 1 su�ces to achieve fast


convergence for the new algorithms. Even if � < 1, it is also much easier to set in the

new algorithms. Let us compare Algorithm 2 with dual decomposition, both of which,

when � < 1, share the same su�cient convergence conditions (7.77) and (7.78), i.e.,

diagonal dominance for the inverse of Hessian matrix A�1 or a su�ciently large |�i

|�1.

They also share the same best convergence speed (7.79) and (7.80). However, for dual

decomposition to achieve its best convergence, � should be set around (a0ii

+ |�i

|�1)�1.

This requires one to have good estimates about the absolute values of a0ii

, |�i

|; otherwise,

� usually needs to be set to a small value to guarantee convergence, which, however,

leads to slow convergence.

In contrast, for Algorithm 2 to achieve its best convergence, � should be set to around

(1+a0ii

|�i

|)�1, which only depends on the relative values of a0ii

and |�i

| and is thus easier to

estimate. For example, if a0ii

|�i

|�1 for all i, then (1 + a0ii

|�i

|)�1 � 0.5, 8 i. Therefore,

setting � = 0.5 su�ces to achieve good performance if a0ii

and |�i

|�1 are close. Similarly,

for Algorithm 1, to choose a � < 1 that satisfies condition (7.76) is easy: if aii

|�i

|, for

all i, which is often the case when the cost is low, we can set � to 0.5.

7.6.6 Handling General Coupled Objectives

The proposed algorithms do not only solve (7.29). We now show how to use Algorithm

1 to approach a general form of optimization problems with coupled objective functions,

referred to as consensus optimization [26]:

minx

mX

i=0

Fi

(x) (7.81)

s.t. x 2 Pi

, i = 0, 1, ..., m,


where Fi

: <n 7! < are strictly convex functions, and Pi

are bounded polyhedral subsets

of <n. We assume at least F0 depends on all n coordinates of the vector x and P0 ⇢ Pi

.

Problem (7.81) not only models distributed resource allocation with arbitrary coupling

among utility functions, but also finds wide applications in machine learning problems

that involve a large number of training samples. Note that (7.29) is a special case of

(7.81) in that each Ui

only depends on the ith coordinate of x, while C plays the role of

F0.

We show that based on duality theory, consensus optimization (7.81) can be converted

into an equivalent form of (7.29). Introducing auxiliary variables xi

2 <n, (7.81) is

equivalent to

minx

F0(x) +mX

i=1

Fi

(xi

) (7.82)

s.t. x 2 P0, xi

= x, xi

2 Pi

, i = 1, ..., m.

It is easy to check that the dual problem of (7.82) is

maxp

q(p) = q0

✓mX

i=1

pi

◆+

mX

i=1

qi

(pi

) (7.83)

where pi

2 <n are the Lagrangian multipliers, and

q0

✓mX

i=1

pi

◆= min

x2P0

⇢F0(x) �

✓mX

i=1

pi

◆T

x

�, (7.84)

qi

(pi

) = minx

i

2Pi

{Fi

(xi

) + pTi

xi

}. (7.85)


The dual function q(p) is continuously di↵erentiable and

@q(p)

@pi

=@q0(p)

@pi

+@q

i

(pi

)

@pi

= �x(p) + xi

(pi

), (7.86)

where x(p) and xi

(pi

) are the unique minimizing vectors in (7.84) and (7.85), respectively.

A traditional gradient method would update p in the direction of its gradient rq(p).

Alternatively, we can apply Algorithm 1 to the dual problem (7.83) with p as the

variable by setting Ui

(pi

) = qi

(pi

) and C(p) = �q0(p), leading to the alternating iterations

for all i:

xi

:= (1 � �)xi

+ �ri

C(p) = (1 � �)xi

+ �x(p) (7.87)

pi

:= arg maxp

i

qi

(pi

) � pTi

xi

, (7.88)

where xi

is the price set for Fi

. We can further simplify the update (7.88) for pi

. We

assume P0 ⇢ Pi

for all i. Since x(p) 2 P0, from (7.87), we have xi

2 P0 ⇢ Pi

. Clearly, pi

is the solution to the system of equations @q

i

(pi

)@p

i

� xi

= 0, or equivalently, xi

(pi

)� xi

= 0.

When xi

(pi

) = xi

is in the interior of Pi

, from the first-order conditions of (7.85), we have

pi

= �@Fi

(xi

)

@xi

. (7.89)

Therefore, Algorithm 1, when applied to consensus optimization (7.81), is fully described

by the iterations (7.87) and (7.89), where pi

are variables, xi

are prices, F0 plays the role

of a “resource provider” and each Fi

is a “user”. Each “user” updates its variable pi

in

a distributed fashion based on the price xi

computed and sent by the “provider” until

convergence.

7.7. DISTRIBUTED SOLUTIONS TO CLOUD PRICING 168

Note that P0 ⇢ Pi

for all i is a condition to use Algorithm 1, since otherwise, xi

may

reside outside of Pi

and the gradient of qi

(pi

)� pTi

xi

, which is xi

(pi

)� xi

, cannot be zero.

In this case, (7.88) is not well defined. When P0 ⇢ Pi

, we can apply Algorithm 1 to the

unconstrained dual problem (7.83), which is equivalent to (7.81).

7.7 Distributed Solutions to Cloud Pricing

We now use the proposed methods to solve the large-scale cloud bandwidth resource

reservation and pricing problem (7.8) introduced in Sec. 7.2. We will continue to adopt

the notations used in Sec. 7.2. Recall that the social welfare maximization problem (7.8)

is

maxw

E[W (w)] = (1 � ✏)X

i

✓U

i

(wi

) � bwi

µi

◆� c

✓K(w)

◆(7.90)

s.t. 0 w 1, (7.91)

which is equivalent to the pricing problem (7.18). In this section, we propose a novel step-

size-free algorithm for price updates, based on the distributed optimization techniques

proposed in Sec. 7.5. We call the algorithm Equation Update, which does not rely on

decomposition at all: instead, it resorts to iterative equation updates based on the KKT

conditions (7.20). Our algorithm applies to a general convex cost function c(K(w)) (in

terms of w) under any service technology (e.g., multiplexing tenant demands in groups),

and can be executed asynchronously among all cloud tenants.


7.7.1 Equation Update

If we apply Algorithm 1 of Sec. 7.5 with � = 1 to problem (7.8), we obtain the following

Equation Update algorithm, which is based on alternated phases of price updates via

the cost-price relationship (7.20) and the tenant surplus maximization equation (7.16).

Recall that p is the uniform usage price and ki

is the reservation price for tenant i.

Equation Update. (Algorithm 1 for Problem (7.8).) Denote w(t) := [w(t)1 , . . . , w

(t)N

]T.

Set p ⌘ b. Set k(0)i

:= �(µi

+ ✓(✏)�i

) for all i and w(0) = 1. For t = 0, 1, . . ., repeat

(1) Distributed Surplus Maximization. Pass p and k(t)i

to each tenant i, which

returns:

w(t+1)i

:= arg max0w

i

1E[S

i

(wi

, p, k(t)i

)], 8 i. (7.92)

(2) Price Update. Set

k(t+1)i

:= c0✓

K(w(t+1))

◆· @K(w)

@wi

��w=w(t+1)

, 8 i. (7.93)

(3) If kw(t+1) � w(t)k ⇠, return w⇤ = w(t+1), k⇤i

= k(t+1)i

.

The above algorithm starts by setting k(0)i

to be the k⇤i

in the cost function without

multiplexing. It then updates ki

and wi

alternately by setting the current prices k(t)i

to

be the marginal reservation cost with the current w(t), and by collecting the next w(t+1)

from tenants who maximize their surpluses given the current prices. Applying the KKT

conditions to (7.16), step (1) can also be viewed as solving an equation

(1 � ✏)U0i

(w(t+1)i

) = p(1 � ✏)µi

+ k(t)i

, (7.94)


for w(t+1)i

. Since each tenant maximizes its expected surplus locally, the cloud provider

does not have to know the utility function of each tenant, leading to an iterative dis-

tributed solution.

Compared with Lagrangian dual decomposition based on consistency pricing [32, 65],

Equation Update represents a new way of handling coupled objectives. Since price up-

dates are based on equation (7.93) rather than on updating Lagrangian multipliers, the

algorithm is not concerned with the choice of step sizes that are required by gradi-

ent/subgradient methods.

We observe that the sequence {k(t)i

} produced by equations (7.93) and (7.94) could

demonstrate significantly di↵erent behavior under di↵erent initial values w(0)i

and dif-

ferent forms of utility and cost functions. In other words, our algorithm demonstrates

chaotic behavior, whose eventual outcome is sensitive to initial conditions and the struc-

ture of updating equations.

Our objective going forward is to analyze the behavior of {k(t)i

} and {w(t)} and find

out the conditions under which the algorithm can achieve fast convergence. To simplify

notations, we define

hi

(wi

) ⌘ (1 � ✏)

✓U

0i

(wi

) � bµi

◆, (7.95)

⌫i

(w) ⌘@c

✓K(w)

◆

@wi

. (7.96)

Recall that w⇤ := [w⇤1, . . . , w

⇤N

]T, where w⇤i

= wi

(p⇤, k⇤i

) is the guaranteed portion cho-

sen by tenant i under optimal pricing. By the definition of w⇤, the optimality condition


is

hi

(w⇤i

) = k⇤i

= ⌫i

(w⇤), 8 i. (7.97)

Since we have set p ⌘ b, (7.94) can be written as

hi

(w(t+1)i

) := (1 � ✏)

✓U

0i

(w(t+1)i

) � bµi

◆= k

(t)i

. (7.98)

Thus, the updating rules (7.93) and (7.94) in Equation Update can be rewritten as

hi

(w(t+1)i

) := k(t)i

:= ⌫i

(w(t)), 8 i. (7.99)

Let us illustrate the algorithm behavior using the special case of a single tenant.

There are three scenarios where the algorithm can produce dramatically di↵erent results,

as shown in Fig. 7.2. Since utility ui

(wi

) is strictly concave and monotonically increasing

in [0, 1], and cost c

✓K(w)

◆is strictly convex and monotonically increasing in [0, 1], we

have for all i:

8><

>:

hi

(wi

) > �bµi

, h0i

(wi

) < 0, 8 wi

2 [0, 1]

⌫i

(w) > 0, @⌫

i

(w)@w

i

> 0, 8 w : 0 � w � 1.

(7.100)

All three cases in Fig. 7.2 satisfy (7.100). In Fig. 7.2(a), {w(t)1 } always converges to w⇤

1 for

any initial value w(0)1 2 [0, 1], whereas in Fig. 7.2(b), {w

(t)1 } always diverges regardless of

its initial starting point. In Fig. 7.2(c), however, the behavior of {w(t)1 } critically depends

on its initial value w(0)1 . If w

(0)1 takes “initial value 1”, w

(t)1 will eventually hop between

two values alternately, without being able to approach w⇤1. On the other hand, if w

(0)1

takes “initial value 2”, w(t)1 will converge to w⇤

1.


h(wi) �(wi)

wiw(0)iw(1)

i w(2)iw(3)

i w(4)i

w�i

k(0)i

k(1)i

k(2)i

k(3)i

k�i

(a) Convergence to Global Optimality

h(wi) �(wi)

wiw(0)iw(1)

i w�i

k(0)i

k(1)i

k�i

(b) Divergence

h(wi) �(wi)

wiw�i

k�i

Initial Value 1

Initial Value 2

(c) 1: Periodic; 2: Converging.

Figure 7.2: Behavior of Equation Update in a one dimensional case. “o” represents thestarting point (w(0)

i

, k(0)i

).

7.8. PERFORMANCE EVALUATION 173

Directly applying Proposition 10 derived in Sec. 7.6 to problem (7.8), we obtain a

su�cient condition for the convergence of Equation Update in Theorem 14:

Theorem 14 If for each i = 1, . . . , N , we have

minx

i

(1 � ✏)|U 00i

(xi

)| >

NX

j=1

��

@2c

✓K(w)

◆

@wi

@wj

��, (7.101)

for all w between w(0) = 1 and w(1), then using Equation Update, k(t)i

converges to k⇤i

and w(t)i

converges to w⇤i

.

The economic implication behind Theorem 14 is that the algorithm will converge

when the marginal utility gain decreases faster than the marginal cost increases, as wi

increases. This technical assumption can be easily justified, since the marginal cost

c0(K) = � for adding network capacity (routers and switches) is decreasing at a fast pace

in our economy.

Moreover, we can also apply Algorithm 1 of Sec. 7.5 with � < 1 or Algorithm 2 of

Sec. 7.5 to problem (7.8), and obtain similar but di↵erent iterative distributed solutions

to Problem (7.8).

7.8 Performance Evaluation

In this section, we simulate a computerized cloud bandwidth reservation and trading

environment based on the proposed algorithms. We compare our proposed algorithms

with the gradient method, in terms of the convergence speed and optimization accuracy.

We assume that the reservation of egress network bandwidth from cloud data centers has

already been enabled by the detailed engineering techniques proposed in [19, 39, 71].


Suppose the random demands D1, . . . , DN

of N tenants are Gaussian with predictable

means µµµ = (µ1, . . . , µN

), standard deviations �� = (�1, . . . , �N

), and the covariance ma-

trix ⌃⌃⌃ = [�ij

]N⇥N

. We consider utility functions of the form (7.2). Under a Gaussian

approximation of Di

(which has been verified in Sec. 5.3.3), each tenant will have an

expected utility

E[Ui

(wi

)] = ↵i

wi

µi

� eAi

(1�w

i

)µi

+ 12A

2i

(1�w

i

)2�2i . (7.102)

The first term on the righthand side corresponds to the expected revenue of each tenant

made from serving the demand wi

Di

, while the second term models a reputation loss

which is convex and increasing in terms of the unfulfilled demand. Recall that under

Gaussian demands, the bandwidth reservation K is

K(w) = µµµTw + ✓(✏)pwT⌃⌃⌃w, (7.103)

where ✓(✏) = F�1(1 � ✏) is a constant, F being the CDF of normal distribution N (0, 1).

It is now clear that the social welfare maximization problem (7.8) is a convex problem

and can be solved using Algorithm 1 (or Equation Update), i.e., the cloud can charge

user i a reservation price ki

wi

to control the user choice of the guaranteed portion wi

towards expected social welfare maximization.

Our study is based on a video demand dataset from a real-world commercial VoD

system called UUSee. The data used here contains the bandwidth demands of N = 468

video channels, measured every 10 minutes, over an 810-minute period during the 2008

Olympics. We assume each video channel is a tenant that relies on the cloud for servicing

the video requests from its end-users. We input such demand traces to our pricing

framework and check the algorithm e�ciency in the challenging case that prediction and


optimization are to be carried out every 10 minutes. If the algorithms work for a 10-

minute frequency, they will be competent for lower operating frequencies, such as on an

hourly basis.

Assume the demand statistics µµµ, ⌃⌃⌃ of tenants remain the same within each 10-minute

period. We can estimate µµµ, ⌃⌃⌃ for the upcoming 10 minutes based on historical demands

using the time-series techniques presented in Chapters 3 and 4. Once estimates about

µµµ, ⌃⌃⌃ are obtained, we solve problem (7.8) for an optimal allocation w in this period.

After the 10-minute period has passed, the system proceeds to estimate µµµ, ⌃⌃⌃ for the

next 10 minutes and computes a new w from (7.8). Since the cloud does not know user

utility Ui

and the tenants do not know the multiplexed cost K, (7.8) must be solved in

a distributed fashion with the minimum message-passing overhead. Furthermore, (7.8)

must be solved within at most a few seconds in order not to delay the entire resource

reservation process performed every 10 minutes.

In our simulation, we set ↵i

= 1 and Ai

= 0.5. Since di↵erent tenants have di↵erent µi

and �i

, their utilities are heterogeneous. We set the marginal cost of allocating bandwidth

capacity to be � := c0(K) = 0.5, and assume that the cloud provider has an outage

probability of ✏ = 0.01.

We apply di↵erent algorithms to (7.8) for the 81 consecutive 10-minute periods. To

select algorithms, for the first period, we find that the condition (7.60) is satisfied, and

(7.56) is satisfied for most but not all i. We may thus tentatively believe Algorithm 1

is contracting when � = 0.5 and probably contracting when � = 1 (Equation Update).

However, conditions (7.67), (7.68), and (7.73) are not satisfied, since the Hessian of K(w)

is small due to multiplexing and its inverse is big. Thus, Algorithm 2 and gradient meth-

ods for dual decomposition are not contractions. We therefore choose to use Algorithm


2 10 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Convergence Iteration of the Last Tenant

Empi

rical

CD

F

Alg 1: � = 1Alg 1: � = 0.5Gradient: � = 0.01Gradient: � = 0.02Gradient Sync: � = 0.01Gradient Sync: � = 0.008

Figure 7.3: CDF of convergence iterations in all 81 10-minute periods.

1 instead of Algorithm 2. As a benchmark algorithm, gradient methods for the dual

problem, although not contracting, can still converge theoretically when stepsize � is

su�ciently small. Since in our particular problem, the cloud provider has superior com-

putation power, the delay is mainly due to the iterative message passing of prices and

guaranteed portions between tenants and the cloud.

Since the objective value of the expected social welfare SW(w) is unknown to the

cloud, we let an algorithm stop at iteration t if either kw(t) � w(t � 1)k1 < 10�2 or

t = 100. Since Algorithm 1 is a contraction, which even converges asynchronously, we

adopt an asynchronous stop criterion: if |wi

(t) � wi

(t � 1)| < 10�2, then tenant i stops

updating wi

. In contrast, since the gradient method is not a contraction, we evaluate

both synchronous and asynchronous stop criteria. We do not assume heterogeneous

message-passing delays, which may further strengthen the benefit of an asynchronous

algorithm.

Fig. 7.3 plots the CDF of convergence iterations needed in all 81 experiments, one

for each 10-minute test period. Algorithm 1 always converges within 10 iterations, when

� = 0.5. If � = 1, the convergence rate can be further speeded up to 5 iterations on


1700 1720 1740 1760 1780800

1000

1200

1400

1600

1800

Time Periods (unit: 10 minutes)

Fina

l Exp

ecte

d So

cial

Wel

fare

Alg 1: � = 1Alg 1: � = 0.5Gradient: � = 0.01Gradient: � = 0.02Gradient Synch: � = 0.01Gradient Synch: � = 0.008

Figure 7.4: Final objective values SW(x⇤) (expected social welfare) produced by di↵erentalgorithms.

average, although it does not converge in one single test period, due to violation of the

contraction assumption. In general, gradient methods need much longer time to converge,

and also face a dilemma in selecting the stepsize �. Since algorithms will stop when w

is changing by less than 10�2, we cannot set too small a �, in which case, the algorithm

will stop when t = 2 due to too small a change in the price vector. Fig. 7.3 shows that

gradient methods with all considered � produce erroneous outputs in many cases, since

they stop at t = 2. Increasing � reduces errors, but will lead to even longer convergence

time. Fig. 7.3 suggests that unlike Algorithm 1, no constant stepsize � for gradient

methods can achieve correct outputs and fast convergence at the same time for all 81

experiments.

A further check into Fig. 7.4 shows that the final objective SW(w⇤) produced by

Algorithm 1 is also higher than those of the gradient methods. Gradient methods with

asynchronous stop rules, although converged, may output wrong answers because they

are not contractions and in theory, should not be applied with asynchronous stop rules.

Gradient methods with synchronous stopping achieve slightly higher objective values if


100 101 102

10�2

10�1

100

Iteration t

Dist

ance

bet

wee

n x(

t) an

d x(

t�1)

Alg 1: � = 1Alg 1: � = 0.75Alg 1: � = 0.5Gradient: � = 0.01Gradient: � = 0.02Gradient Sync: � = 0.01

Figure 7.5: The evolution of kw(t) � w(t � 1)k1 in the first experiment.

not stopping at t = 2. However, since they do not converge within 100 iterations, they

are still inferior to Algorithm 1.

Let us zoom into the first 10-minute test period and see how convergence happens for

this experiment. From Fig. 7.5, we see that Algorithm 1 has an increasing convergence

speed as t increases. In contrast, gradient methods perform well in the first two iterations,

but its convergence slows down as the gradient becomes smaller. It is confirmed that

gradient methods are extremely sensitive to the stepsize �. Although it is possible to

vary the stepsize � dynamically as the optimization iterations proceed, the design of such

dynamic stepsizes is a non-trivial technical challenge.

Finally, the above analysis does not take into account the additional benefit of Algo-

rithm 1 regarding the computational complexity at each node. Both algorithms have the

same complexity at each user, which is a 1-dimensional surplus maximization. However,

in Algorithm 1, the provider only needs to compute a cost gradient rK(w) straight-

forwardly, whereas in gradient methods, the provider needs to solve a 468-dimensional

profit maximization problem.

7.9. SUMMARY 179

7.9 Summary

In this chapter, we propose a guaranteed cloud service model, where each tenant does

not have to estimate the absolute amount of bandwidth it needs to reserve—it simply

specifies a percentage of its demand from end-users that it wishes to serve with guaranteed

performance, which we call the guaranteed portion, while the rest of the demand will

be served with best e↵ort as the current cloud providers do. The cloud provider will

estimate tenant demands through workload analysis and guarantee the performance in a

probabilistic sense. The above process is repeated in small periods such as hours or tens

of minutes.

Our main contribution is to fairly price such guaranteed services in each period. In

contrast to the uniform usage pricing model, we price bandwidth reservations heteroge-

neously for tenants based on their workload statistics such as burstiness and correlation.

It turns out to be computationally challenging to find the optimal prices that maximize

the expected social welfare under demand uncertainty.

To address this challenge, we propose new algorithms for large-scale distributed opti-

mization with coupled objectives, and analyze the asynchronous algorithm convergence

conditions. Trace-driven simulations show that the proposed method can speed up con-

vergence by 5 times over gradient methods in the real-time cloud bandwidth resource

reservation and pricing problem with better robustness to parameter changes. It remains

open to explore the applicability of the new algorithms to problems with coupled con-

straints. Incentive issues (especially of Algorithm 2) with game theoretical modeling can

further open up new avenues of investigation to ensure that tenants follow the designed

protocol out of self-interest.

Chapter 8

Concluding Remarks

8.1 Conclusions

This thesis has focused on improving the Quality-of-Service (QoS) and reducing the

operational cost in large-scale multimedia delivery systems, such as Internet Video-on-

Demand (VoD) systems, especially in the context of emerging cloud-based services. The

thesis spans the areas of cloud computing, multimedia delivery, and economics. Our

contribution is a coherent cloud management framework for multimedia delivery systems

that involves statistical learning, demand forecast, optimized resource allocation and load

direction, as well as algorithmic resource pricing as interdependent components, mainly

targeting large-scale VoD systems. Our study has been backed up by data mining and

the analytics of real-world resource usage traces collected from a production Internet

VoD company named UUSee.

We first identify that automated demand forecast will be the key to intelligent and

e�cient workload management in an age of cloud computing. We propose a series of

statistical learning and time-series analysis techniques to forecast the expected video

180

8.2. FUTURE DIRECTIONS 181

bandwidth tra�c demand in the short-term future, as well as to estimate the demand

volatility, i.e., how confident we can be about such expected demand forecasts. We

also studied the complexity and dimension reduction issues related to such statistical

learning, in the presence of a large number of video channels. Furthermore, based on

automated demand forecast, we propose predictive resource reservation schemes together

with optimized load direction schemes in a multi-data center network, analyzing the

most cost-e↵ective solutions both in theory and in practice. Finally, we investigate the

bandwidth resource pricing issues and related business models for VoD providers to use

the cloud services, with two approaches adopted. In the first approach, we use game

theory to characterize the economic properties of the bandwidth reservation market when

multiple VoD providers rely on multiple cloud providers for services. In the second

approach, we formulate the bandwidth pricing as a distributed optimization problem,

for which we propose new distributed algorithms other than the traditional gradient

algorithm, to ensure fast and robust convergence in very large systems. In addition, the

second approach also turns out to bring new technical contributions to distributed and

parallel optimization techniques in large-scale systems.

8.2 Future Directions

Cloud workload management continues to be an exciting area of research, by con-

sidering even more diverse application requirements and by extending the scope from

request routing among data centers to VM placement within data centers. As has been

noted, virtual machines (VMs) can be placed to optimize application performance, by

spreading workload evenly over the underlying topology, or to save power consumption,

by consolidating workload onto as few servers as possible. As the number of VMs per


server increases, the power e�ciency increases linearly, whereas performance degrades

in a nonlinear fashion due to VM interference. What is the sweet spot in performing

workload consolidation that balances performance and power e�ciency? Given diverse

application requirements on di↵erent resources that may or may not be correlated, we

can mix VMs based on workload statistics balancing performance and cost. Machine

learning and demand prediction based on monitoring will be useful to serve this purpose.

However, unlike user-oriented web applications, the resource and power consumption of

many cloud applications may not follow a regular pattern. More advanced learning tech-

niques such as those based on dynamic neural networks can be developed for workload

analysis.

Tra�c management in the cloud. In today’s multi-tenant data centers, band-

width sharing leads to unpredictable network performance. However, the good news is

that unlike in the Internet, cloud operators can control not only tra�c routing but also

the physical positions to place the source-destination pairs, which are essentially pairs of

VMs in data centers. This leads to a new problem of joint VM placement and inter-VM

tra�c routing, so as to reduce the network burden in the cloud and to optimize application

performance. Furthermore, in a data center, VMs often need to be migrated dynamically

due to application requirements and the need for performance optimization or workload

consolidation. Dynamic VM placement introduces a multi-period network optimization

problem that requires statistical demand predictions as input, with VM repositioning

cost as an optimization constraint. Competitive online algorithms and approximations

need to be developed to solve this type of problem. From a systems perspective, a sta-

tistical analysis and optimization framework can be integrated into current data centers

to diagnose tra�c and guide management decisions.


Pricing in the cloud. Cloud computing has brought about a new business model

of resource leasing. Currently, Amazon EC2 mainly o↵ers three types of computing

instances: on-demand, spot and reserved instances. What is the advantage of having

spot and reserved instances? What is the impact of the cloud service menu with its

pricing on both tenants and the cloud? I hypothesize that more diversified instance

types would suit the flavors of more tenants, bringing a win-win situation to both cloud

providers and tenants. Two related questions need to be answered: (1) facing di↵erent

types of computing instances together with their prices, how each tenant, with her unique

utility and patience characteristics, should choose her instance type, and how much to

bid if she chooses the spot instance; and (2) how the cloud provider should price di↵erent

instances to maximize revenue. By studying these questions, we will be able to analyze

the aggregate social welfare under di↵erent markets and pricing policies. The good news is

that a dataset of AWS spot and on-demand instances pricing history since 2009 is publicly

available. We can use statistical learning to model and advise tenant strategies, infer

resource availability, optimize cloud pricing policies, and study the interaction between

tenant and cloud strategies.

Derivatives in cloud computing. Cloud pricing largely remains in a primitive

stage. A natural expansion of the cloud service menu is to introduce derivatives to the

computing market, so that tenants can hedge against the uncertainty of future prices for

computing resources. In this sense, the reserved instance of AWS cloud is essentially a

derivative product. As a popular derivative in stock markets, a futures contract involves

selling goods for a price agreed today with delivery occurring at a specified future date.

As another example, an option will give a buyer the right, but not the obligation, to

buy goods at a reference price in the future. Viewing computing resources as goods


and tenants as buyers, many derivatives and new pricing structures can be introduced,

modified and adapted to the cloud to benefit both tenants and cloud providers. On the

other hand, automated workload analysis and demand prediction based on statistical

learning need to be developed for companies that rely on cloud services to take the right

investment positions in the computing market.

Bibliography

[1] Amazon Cluster Compute, 2011. http://aws.amazon.com/ec2/hpc- applications/.

[2] Amazon Web Services. http://aws.amazon.com/.

[3] Hulu. http://www.hulu.com/.

[4] Justintv. http://www.justin.tv/.

[5] Metacafe. http://www.metacafe.com/.

[6] Netflix. http://www.netflix.com/.

[7] OnLive. http://www.onlive.com/.

[8] PPLive. http://www.pplive.com.

[9] UUSee Inc. [Online]. Available: http://www.uusee.com.

[10] Youtube. http://www.youtube.com/.

[11] Zimory Cloud Computing. http://www.zimory.com/.

[12] Four Reasons We Choose Amazon’s Cloud as Our Computing Platform. The Netflix

“Tech” Blog, Dec. 14 2010.

185

BIBLIOGRAPHY 186

[13] V. Adhikari, Y. Guo, F. Hao, M. Varvello, V. Hilt, M. Steiner, and Z.-L. Zhang.

Unreeling netflix: Understanding and improving multi-cdn movie delivery. In Proc.

of IEEE INFOCOM, Orlando, Florida, 2012.

[14] V. Aggarwal, X. Chen, V. Gopalakrishnan, R. Jana, K. K. Ramakrishnan, and V. A.

Vaishampayan. Exploiting Virtualization for Delivering Cloud-based IPTV Services.

In Proc. of IEEE INFOCOM Workshop on Cloud Computing, 2011.

[15] S. Annapureddy, S. Guha, C. Gkantsidis, D. Gunawardena, and P. Rodriguez. Ex-

ploring VoD in P2P Swarming Systems. In Proc. INFOCOM’07, pages 2571–2575,

May 2007.

[16] D. Applegate, A. Archer, V. Gopalakrishnan, S. Lee, and K. K. Ramakrishnan. Op-

timal Content Placement for a Large-Scale VoD System. In Proc. ACM International

Conference on Emerging Networking Experiments and Technologies (CoNEXT),

2010.

[17] M. Armbrust, A. Fox, R. Gri�th, A. D. Joseph, R. Katz, A. Konwinski, G. Lee,

D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia. A View of Cloud Computing.

Communications of the ACM, 53(4):50–58, 2010.

[18] P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. To-

wards Highly Reliable Enterprise Network Services Via Inference of Multi-level De-

pendencies. In Proc. of SIGCOMM’07, Kyoto, Japan, 2007.

[19] H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Towards Predictable Data-

center Networks. In Proc. of SIGCOMM’11, Toronto, ON, Canada, 2011.

BIBLIOGRAPHY 187

[20] D. P. Bertsekas, A. Nedic, and A. E. Ozdaglar. Convex Analysis and Optimization.

Athena Scientific, 2003.

[21] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numer-

ical Methods. Prentice Hall, 1989.

[22] C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.

[23] N. Bobro↵, A. Kochut, and K. Beaty. Dynamic Placement of Virtual Machines for

Managing SLA Violations. In Proc. 10th IFIP/IEEE International Symposium on

Integrated Network Management, 2007.

[24] T. Bollerslev. Generalized Autoregressive Conditional Heteroskedasticity. Journal

of Econometrics, 31:307–327, 1986.

[25] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting

and Control. Wiley, 2008.

[26] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed Optimiza-

tion and Statistical Learning via the Alternating Direction Method of Multipliers.

Foundations and Trends in Machine Learning, 3(1):1–122, 2011.

[27] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press,

2004.

[28] P. J. Brockwell and R. A. Davis. Introduction to Time Series and Forecasting.

Springer, 2002.

BIBLIOGRAPHY 188

[29] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic. Cloud Computing

and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing

as the 5th Utility. Future Generation Computer Systems, 25(6):599–616, 2009.

[30] X. Chen, M. Zhang, Z. Morley Mao, and P. Bahl. Automating Network Application

Dependency Discovery: Experiences, Limitations, and New Solutions. In Proc. of

OSDI ’08, 2008.

[31] B. Cheng, X. Liu, Z. Zhang, H. Jin, L. Stein, and X. Liao. Evaluation and Optimiza-

tion of a Peer-to-Peer Video-on-Demand System. J. Syst. Archit., 54(7):651–663,

Jul. 2008.

[32] M. Chiang, S. H. Low, A. R. Calderbank, and J. C. Doyle. Layering as Optimization

Decomposition: A Mathematical Theory of Network Architectures. Proceedings of

the IEEE, 95(1):255–312, January 2007.

[33] W. Enders. Applied Econometric Time Series. Wiley, Hoboken, NJ, third edition,

2010.

[34] V. Gabale, P. Dutta, R. Kokku, and S. Kalyanaraman. InSite: QoE-aware Video

Delivery From Cloud Data Centers. In Proc. of International Workshop on Quality

of Service (IWQoS), 2012.

[35] P. Gill, M. Arlitt, Z. Li, and A. Mahanti. YouTube Tra�c Characterization: A View

From the Edge. In Proc. of ACM SIGCOMM Internet Measurement Conference

(IMC), San Diego, USA, October 2007.

BIBLIOGRAPHY 189

[36] D. Gmach, J. Rolia, L. Cherkasova, and A. Kemper. Workload Analysis and De-

mand Prediction of Enterprise Data Center Applications. In Proc. of IEEE Symp.

Workload Characterization, 2007.

[37] Z. Gong, X. Gu, and J. Wilkes. PRESS: PRedictive Elastic ReSource Scaling for

Cloud Systems. In Proc. of IEEE International Conference on Network and Services

Management (CNSM), 2010.

[38] R. L. Grossman. The Case for Cloud Computing. IT Professional, 11(2):23–27,

March-April 2009.

[39] C. Guo, G. Lu, H. J. Wang, S. Yang, C. Kong, P. Sun, W. Wu, and Y. Zhang.

SecondNet: a Data Center Network Virtualization Architecture with Bandwidth

Guarantees. In Proc. of ACM International Conference on Emerging Networking

Experiments and Technologies (CoNEXT), 2010.

[40] G. Gursun, M. Crovella, and I. Matta. Describing and Forecasting Video Access

Patterns. In Proc. of IEEE INFOCOM Mini-Conference, 2011.

[41] C. Huang, J. Li, and K. W. Ross. Can Internet Video-on-Demand be Profitable? In

Proc. of SIGCOMM’07, Kyoto, Japan, August 2007.

[42] Y. Huang, T. Z. J. Fu, D.-M. Chiu, J. C. S. Lui, and C. Huang. Challenges, Design

and Analysis of a Large-scale P2P-VoD System. In Proc. of SIGCOMM’08, Seattle,

Washington, August 2008.

[43] R. Johari and J. N. Tsitsiklis. E�ciency Loss in a Network Resource Allocation

Game. Mathematics of Operations Research, 29(3):407–435, August 2004.

BIBLIOGRAPHY 190

[44] F. P. Kelly. Charging and Rate Control for Elastic Tra�c. European Transactions

on Telecommunications, 8:33–37, 1997.

[45] D. Kossmann, T. Kraska, and S. Loesing. An Evaluation of Alternative Architectures

for Transaction Processing in the Cloud. In Proc. of International Conference on

Management of Data (SIGMOD), 2010.

[46] D. Kusic, J. O. Kephart, J. E. Hanson, N. Kandasamy, and G. Jiang. Power and

Performance Management of Virtualized Computing Environments via Lookahead

Control. Cluster computing, 12(1):1–15, March 2009.

[47] B. Li, S.-S. Xie, G. Y. Keung, J.-C. Liu, I. Stoica, H. Zhang, and X.-Y. Zhang. An

Empirical Study of the Coolstreaming+ System. IEEE Journal on Selected Areas

in Communications, Special Issue on Advances in Peer-to-Peer Streaming System,

25(9):1627–1639, December 2007.

[48] M. Lin, A. Wierman, L. L. H. Andrew, and E. Thereska. Dynamic Right-Sizing for

Power-Proportional Data Centers. In Proc. of IEEE INFOCOM, 2011.

[49] X. Lin, N. B. Shro↵, and R. Srikant. A Tutorial on Cross-Layer Optimization in

Wireless Networks. IEEE Journal on Selected Areas in Communications, 24(8):1452–

1463, August 2006.

[50] Z. Liu, C. Wu, B. Li, and S. Zhao. UUSee: Large-Scale Operational On-Demand

Streaming with Random Network Coding. In Proc. of IEEE INFOCOM, 2010.

[51] L. Ljung. System Identification: Theory for the User. Prentice Hall PTR, 2 edition,

1999.

BIBLIOGRAPHY 191

[52] J.-G. Luo, Q. Zhang, Y. Tang, and S.-Q. Yang. A Trace-Driven Approach to Eval-

uate the Scalability of P2P-Based Video-on-Demand Service. IEEE Trans. Parallel

Distrib. Syst., 20(1):59–70, Jan. 2009.

[53] A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, and Q. Zhao. To-

wards Automated Performance Diagnosis in a Large IPTV Network. In Proc. of

SIGCOMM’09, Barcelona, Spain, 2009.

[54] A. McNeil, R. Frey, and P. Embrechts. Quantitative Risk Management: Concepts

Techniques and Tools. Princeton University Press, 2005.

[55] D. Niu, C. Feng, and B. Li. A Theory of Cloud Bandwidth Pricing for Video-on-

Demand Providers. In Proc. of IEEE INFOCOM, 2012.

[56] D. Niu, C. Feng, and B. Li. Pricing Cloud Bandwidth Reservations under Demand

Uncertainty. In Proc. of ACM SIGMETRICS, 2012.

[57] D. Niu and B. Li. An E�cient Distributed Algorithm for Resource Allocation in

Large-Scale Coupled Systems. In Proc. of IEEE INFOCOM, 2013.

[58] D. Niu, B. Li, and S. Zhao. Understanding Demand Volatility in Large VoD Sys-

tems. In Proc. of the 21st International workshop on Network and Operating Systems

Support for Digital Audio and Video (NOSSDAV), 2011.

[59] D. Niu, Z. Liu, B. Li, and S. Zhao. Demand Forecast and Performance Prediction

in Peer-Assisted On-Demand Streaming Systems. In Proc. of IEEE INFOCOM

Mini-Conference, 2011.

BIBLIOGRAPHY 192

[60] D. Niu, H. Xu, B. Li, and S. Zhao. Quality-Assured Cloud Bandwidth Auto-Scaling

for Video-on-Demand Applications. In Proc. of IEEE INFOCOM, Orlando, Florida,

2012.

[61] D. P. Palomar and M. Chiang. A Tutorial on Decomposition Methods for Net-

work Utility Maximization. IEEE Journal on Selected Areas in Communications,

24(8):1439–1451, August 2006.

[62] J. C. Panzar and D. S. Sibley. Public Utility Pricing under Risk: The Case of

Self-Rationing. The American Economic Review, 68(5):888–895, Dec. 1978.

[63] J. Schad, J. Dittrich, and J.-A. Quiane-Ruiz. Runtime Measurements in the Cloud:

Observing, Analyzing, and Reducing Variance. In Proc. of VLDB, 2010.

[64] Y. Song, M. Zafer, and K.-W. Lee. Optimal Bidding in Spot Instance Market. In

Proc. of IEEE INFOCOM, 2012.

[65] C. W. Tan, D. P. Palomar, and M. Chiang. Distributed Optimization of Coupled Sys-

tems With Applications to Network Utility Maximization. In Proc. of IEEE Inter-

national Conference on Acoustics, Speech and Signal Processing, Toulouse, France,

2006.

[66] C. Tang, M. Steinder, M. Spreitzer, and G. Pacifici. A Scalable Application Place-

ment Controller for Enterprise Data Centers. In Proc. ACM WWW, 2007.

[67] B. Urgaonkar, P. Shenoy, and T. Roscoe. Resource Overbooking and Application

Profiling in Shared Hosting Platforms. In Proc. of USENIX Symposium on Operating

Systems Design and Implementation (OSDI), 2002.

BIBLIOGRAPHY 193

[68] M. Wang, X. Meng, and L. Zhang. Consolidating Virtual Machines with Dynamic

Bandwidth Demand in Data Centers. In Proc. of IEEE INFOCOM 2011 Mini-

Conference, 2011.

[69] C. Wilson, H. Ballani, T. Karagiannis, and A. Rowstron. Better Never Than Late:

Meeting Deadlines in Datacenter Networks . In Proc. of SIGCOMM, 2011.

[70] C. Wu, B. Li, and S. Zhao. Multi-Channel Live P2P Streaming: Refocusing on

Servers. In Proc. of IEEE INFOCOM, 2008.

[71] D. Xie, N. Ding, Y. C. Hu, and R. Kompella. The Only Constant is Change:

Incorporating Time-Varying Network Reservations in Data Centers. In Proc. of

ACM SIGCOMM, 2012.

[72] H. Yin, X. Liu, F. Qiu, N. Xia, C. Lin, H. Zhang, V. Sekar, and G. Min. Inside the

Bird’s Nest: Measurements of Large-Scale Live VoD from the 2008 Olympics. In

Proc. ACM Internet Measurement Conference (IMC), 2009.

[73] M. Zaharia, A. Konwinski, A. D. Joseph, Y. Katz, and I. Stoica. Improving MapRe-

duce Performance in Heterogeneous Environments. In Proc. of OSDI, 2008.

[74] H. Zhang, B. Li, H. Jiang, F. Liu, A. V. Vasilakos, and J. Liu. A Framework for

Truthful Online Auctions in Cloud Computing with Heterogeneous User Demands.

In Proc. of IEEE INFOCOM, 2013.

[75] X. Zhang, J. Liu, B. Li, and T.-S. P. Yum. CoolStreaming/DONet: A Data-Driven

Overlay Network for E�cient Live Media Streaming. In Proc. of IEEE INFOCOM,

2005.

demand forecast, resource allocation and pricing for multimedia

Documents