orl-sdn: online reinforcement learning for sdn-enabled ...bentaleb/files/papers/journal/orlsdn.pdf71...

28
71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming ABDELHAK BENTALEB, National University of Singapore, Singapore ALI C. BEGEN, Ozyegin University & Networked Media, Turkey ROGER ZIMMERMANN, National University of Singapore, Singapore In designing an HTTP adaptive streaming (HAS) system, the bitrate adaptation scheme in the player is a key component to ensure a good quality of experience (QoE) for viewers. We propose a new online reinforcement learning optimization framework, called ORL-SDN, targeting HAS players running in a software-defined networking (SDN) environment. We leverage SDN to facilitate the orchestration of the adaptation schemes for a set of HAS players. To reach a good level of QoE fairness in a large population of players, we cluster them based on a perceptual quality index. We formulate the adaptation process as a Partially Observable Markov Decision Process and solve the per-cluster optimization problem using an online Q-learning technique that leverages model predictive control and parallelism via aggregation to avoid a per-cluster suboptimal selection and to accelerate the convergence to an optimum. This framework achieves maximum long-term revenue by selecting the optimal representation for each cluster under time-varying network conditions. The results show that ORL-SDN delivers substantial improvements in viewer QoE, presentation quality stability, fairness, and bandwidth utilization over well-known adaptation schemes. CCS Concepts: • Multimedia information systems Multimedia streaming; Additional Key Words and Phrases: HAS, SDN, reinforcement learning, QoE optimization, POMDP, HAS scalability issues, fastMPC ACM Reference format: Abdelhak Bentaleb, Ali C. Begen, and Roger Zimmermann. 2018. ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming. ACM Trans. Multimedia Comput. Commun. Appl. 14, 3, Article 71 (August 2018), 28 pages. https://doi.org/10.1145/3219752 1 INTRODUCTION HTTP adaptive streaming (HAS) has become the most popular technology for video streaming over the Internet, due to its benefits compared with traditional video streaming solutions (e.g., RTP/RTSP [35]) such as flexible dynamic bitrate adaptation to network conditions and HTTP- based chunk delivery that simplifies the traversal through NATs and firewalls [63]. In HAS, This research was supported in part by the National Natural Science Foundation of China under Grant No. 61472266 and by the National University of Singapore (Suzhou) Research Institute, 377 Lin Quan Street, Suzhou Industrial Park, Jiang Su, People’s Republic of China, 215123. Authors’ addresses: A. Bentaleb, Media lab 1, AS6, NUS School of Computing, Computing 1, 13 Computing Drive, Singapore 117417; email: [email protected]; A. C. Begen, Cekmekoy Campus Nisantepe District, Orman Street, Istanbul, Turkey 34794; email: [email protected]; R. Zimmermann, NUS School of Computing, Computing 1, 13 Computing Drive, Singapore 117417; email: [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2018 ACM 1551-6857/2018/08-ART71 $15.00 https://doi.org/10.1145/3219752 ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Upload: others

Post on 03-Mar-2020

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71

ORL-SDN: Online Reinforcement Learning for SDN-EnabledHTTP Adaptive Streaming

ABDELHAK BENTALEB, National University of Singapore, Singapore

ALI C. BEGEN, Ozyegin University & Networked Media, Turkey

ROGER ZIMMERMANN, National University of Singapore, Singapore

In designing an HTTP adaptive streaming (HAS) system, the bitrate adaptation scheme in the player is a key

component to ensure a good quality of experience (QoE) for viewers. We propose a new online reinforcement

learning optimization framework, called ORL-SDN, targeting HAS players running in a software-defined

networking (SDN) environment. We leverage SDN to facilitate the orchestration of the adaptation schemes

for a set of HAS players. To reach a good level of QoE fairness in a large population of players, we cluster them

based on a perceptual quality index. We formulate the adaptation process as a Partially Observable Markov

Decision Process and solve the per-cluster optimization problem using an online Q-learning technique that

leverages model predictive control and parallelism via aggregation to avoid a per-cluster suboptimal selection

and to accelerate the convergence to an optimum. This framework achieves maximum long-term revenue

by selecting the optimal representation for each cluster under time-varying network conditions. The results

show that ORL-SDN delivers substantial improvements in viewer QoE, presentation quality stability, fairness,

and bandwidth utilization over well-known adaptation schemes.

CCS Concepts: • Multimedia information systems → Multimedia streaming;

Additional Key Words and Phrases: HAS, SDN, reinforcement learning, QoE optimization, POMDP, HAS

scalability issues, fastMPC

ACM Reference format:

Abdelhak Bentaleb, Ali C. Begen, and Roger Zimmermann. 2018. ORL-SDN: Online Reinforcement Learning

for SDN-Enabled HTTP Adaptive Streaming. ACM Trans. Multimedia Comput. Commun. Appl. 14, 3, Article

71 (August 2018), 28 pages.

https://doi.org/10.1145/3219752

1 INTRODUCTION

HTTP adaptive streaming (HAS) has become the most popular technology for video streamingover the Internet, due to its benefits compared with traditional video streaming solutions (e.g.,RTP/RTSP [35]) such as flexible dynamic bitrate adaptation to network conditions and HTTP-based chunk delivery that simplifies the traversal through NATs and firewalls [63]. In HAS,

This research was supported in part by the National Natural Science Foundation of China under Grant No. 61472266 and

by the National University of Singapore (Suzhou) Research Institute, 377 Lin Quan Street, Suzhou Industrial Park, Jiang Su,

People’s Republic of China, 215123.

Authors’ addresses: A. Bentaleb, Media lab 1, AS6, NUS School of Computing, Computing 1, 13 Computing Drive, Singapore

117417; email: [email protected]; A. C. Begen, Cekmekoy Campus Nisantepe District, Orman Street, Istanbul,

Turkey 34794; email: [email protected]; R. Zimmermann, NUS School of Computing, Computing 1, 13 Computing

Drive, Singapore 117417; email: [email protected].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee

provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and

the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.

Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires

prior specific permission and/or a fee. Request permissions from [email protected].

© 2018 ACM 1551-6857/2018/08-ART71 $15.00

https://doi.org/10.1145/3219752

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 2: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:2 A. Bentaleb et al.

each player uses its own bitrate adaptation logic to dynamically adjust to network conditions byselecting an appropriate representation for the next chunk. However, when multiple HAS playersexist in a shared environment and compete for the available resources (e.g., in an edge networkwhere multiple homes are connected), purely client-driven bitrate adaptation introduces certainproblems [1]. Each player strives to maximize its own quality [21] by trying to grab an unfairshare of the available bandwidth. This selfish behavior leads to scalability issues that have beenwidely reported [1, 27, 34], including presentation quality instability, unfairness, and network re-source underutilization [27]. These issues may severely impact viewer QoE, which will result inrevenue loss for over-the-top (OTT) video and service providers. Hence, a more elegant networkresource allocation and management architecture is needed to address the growing demands ofHAS systems [59].

Scalability of HAS is a serious concern when the number of HAS players increases, with a lackof a central coordinator [41] that provides a global view of the network, efficient network re-source management, or QoE optimization [17]. The recent Software-Defined Networking (SDN)paradigm [31] simplifies the decoupling of the control from the data plane. In SDN, a central com-ponent (SDN controller) is aware of all flows and routes each one via a specific path through thenetwork. Given its software-driven flexibility, SDN provides a central management approach toassist HAS players with their bitrate adaptation schemes and alleviates (or even eliminates) scal-ability issues. In this setup, reinforcement learning (RL) [58] presents an attractive solution. AnRL agent learns about the dynamic environment through highly uncertain trial-and-error interac-tions. At each state, when taking an action, the agent receives revenue from the environment asfeedback. The main aim of an agent is to maximize the discounted cumulative revenue by learningthe optimal actions.

We have investigated the aforementioned issues that arise when multiple HAS players competefor the available bandwidth in a last-mile shared network with a single bottleneck link. We pro-pose a Q-learning-based optimization framework for SDN-based HAS systems termed ORL-SDN,which is an extension of our prior architecture SDNDASH [3]. Our framework depends on SDNand Network Function Virtualization (NFV) for intelligent HAS traffic-aware delivery. It optimizesthe long-term revenue (i.e., QoE) of HAS players under varying network conditions with signifi-cant uncertainty, by suggesting optimal representations to the HAS players. We assume that thenetwork bandwidth variations and congestion levels for the next few steps (several seconds) varyfollowing a Markovian model [57], which is unknown to the SDN-based HAS architecture enti-ties. Our method can be considered a network-assisted solution with both bitrate assistance andbandwidth allocation functionalities. It uses SDN capabilities for (1) assisting the HAS players intheir bitrate decisions and dynamically allocating respective bandwidth slices, and (2) collectingvarious application and network information (e.g., players’ statuses, QoS states, congestion levels,network resource states). Furthermore, it is also involved in configuring per-cluster QoS policiesin the network. The framework contributions in this article are fourfold:

(1) We employ the notion of a logical cluster-based network topology by grouping playersbased on the Structural Similarity Index Plus (SSIMplus) perceptual quality [13, 49]. Weselected SSIMplus for its superior performance over other perceptual quality indexes [13].We use SSIMplus to map three distinctive features, namely, device display resolution (DR),content type (CT), and subscription plan type (SPT), into a common space. DR can beconsidered a surrogate for various device capabilities, enabling the model to encompassheterogeneous devices. A small number of static clusters enables ORL-SDN to managelarge-scale deployments while ensuring rapid learning convergence.

(2) We formulate the multiplayer bitrate decision as a Partially Observable Markov Deci-sion Process (POMDP). Our model is flexible to accommodate different state variables

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 3: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:3

and parameters (e.g., bandwidth, congestion, QoE, buffer level, DR, CT, SPT) that are esti-mated by our SDN-based external application without any explicit communication. Thus,no extra overhead is introduced into the network. Further, while the POMDP model isdefined to be highly descriptive, it uses a small number of states.

(3) We propose a Q-learning-based online algorithm to solve the POMDP model. Our al-gorithm leverages a state aggregation process (SAP) [15], parallel postdecision state(PDS) [37, 47], and fast model predictive control (fastMPC) [62, 64] methods that helpORL-SDN to estimate global state variables for the subsequent steps and avert high com-putational cost and suboptimal solutions. This speeds up the convergence to the optimalsolution in real time.

(4) We present detailed experimental results confirming that our approach outperforms otherstate-of-the-art optimizations and heuristic-based bitrate adaptation schemes.

The remainder of this article is organized as follows. Section 2 presents the related studies onHAS bitrate adaptation schemes. Section 3 describes the ORL-SDN optimizer for an SDN-enabledHAS architecture. A performance evaluation and comparative analysis are presented in Section 4.Finally, Section 5 concludes the article and outlines future directions.

2 RELATED WORK

We present the most relevant research that has focused on (1) improving QoE and addressingscalability issues and (2) leveraging recent technologies such as SDN and RL. We categorize thesebitrate adaptation schemes as either optimization based or heuristics based.

Optimization-based adaptation schemes aim to enhance viewer QoE by incorporating bitrateadaptation within a mathematical model (e.g., Markov decision process (MDP), convex optimiza-tion) and/or leveraging a centralized entity such as an SDN controller or a proxy to assist theplayers in their bitrate selection. Chiariotti et al. [6] proposed an online RL-based controller forDynamic Adaptive Streaming over HTTP (DASH) [56] players. The authors model the bitrate de-cisions as an MDP problem and leverage the RL approach to learn the system dynamics. Similarly,Claeys et al. [7] designed the FAQ-learning approach, an online Q-learning-based bitrate adapta-tion. It aims to learn the optimal actions by considering the estimated available bandwidth andbuffer occupancy without any a priori knowledge of network dynamics. FAQ-learning ensuresa quick convergence to the optimal solution during the learning phase due to its initial Q-valuestate-action estimation algorithm. Zhou et al. [65] developed mDASH, which formulates the bi-trate adaptation logic as an MDP optimization problem that is solved by a low-complexity greedyalgorithm. To support multiple HAS players, FESTIVE [27], PANDA [34], SDNDASH [3], Jianget al. [26, 28], and the SAND architecture [59] were proposed. FESTIVE consists of three maincomponents including a chunk scheduler, a bitrate selection, and a harmonic mean bandwidth es-timator, while PANDA uses a probing mechanism to estimate the available bandwidth and allowsplayers to quickly respond to bandwidth fluctuations. In ORL-SDN, we used PANDA as a band-width estimator because of its ability to eliminate bandwidth overestimations [1] under highlyvariable network conditions. For each downloading step μm while fetching a chunk, PANDA uses aperiodic probe mechanism that consists of three stages: estimating, smoothing, and quantizing thebandwidth share. Recently, Bentaleb et al. [3] proposed SDNDASH, an SDN-based bitrate decision,dynamic resource allocation, and management architecture for HAS systems. It benefits from SDNcentralized capabilities to assist, manage, and allocate bandwidth for each player individually basedon its QoE. In a similar manner, other studies [2, 18, 30, 40, 46] incorporated SDN controller capa-bilities into HAS and DASH systems to improve video delivery. Furthermore, other methods [4, 26,28, 59] benefited from network traffic analysis and network-centric mechanisms to improve QoE.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 4: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:4 A. Bentaleb et al.

Heuristic-based adaptation schemes try to adapt to bandwidth variations by switching toan appropriate bitrate according to one or many heuristics such as available bandwidth, bufferoccupancy, or both. The authors of QDASH [39] developed a QoE-based system that acts as aproxy between DASH players and the server. It ensures a gradual transition between bitrate levelsby using an intermediate level during the adaptation process instead of abruptly switching upor down. Huang et al. [22] proposed BBA, which uses the buffer occupancy variation as a bitrateadaptation metric without considering network resources. SARA [29] is a chunk-aware bitrateadaptation scheme that uses the current available bandwidth, buffer occupancy, and next-chunkbyte-size during the adaptation process.

To the best of our knowledge, ORL-SDN is the first SDN-enabled HAS architecture consideringa last-mile network with a single bottleneck that models HAS scalability issues as an onlineRL-based optimization problem when multiple heterogeneous players compete for availablebandwidth in a shared (e.g., edge) network. Earlier HAS RL-based schemes either (1) do not scalewell and incur large overhead costs, (2) consider just a one-player scenario, or (3) are based on afixed, specific network configuration, and thus exhibit low flexibility for dynamic environments.Moreover, we designed our solution to be modular such that it can expand to multiple bottlenecksituations where a root SDN controller [42] is required to manage the coordination and controlof multiple ORL-SDN optimizers residing at each last-mile network.

3 ORL-SDN OPTIMIZER

We first describe the SDN-enabled HAS architecture with ORL-SDN modules, and then we showthe ORL-SDN design overview, which is inspired by an RL approach [58] that uses a set of agentsto take actions in the environment. During the exploration phase, each RL agent gathers infor-mation (e.g., network conditions and player status) and takes an action (representation selection)at the current state. The goal is to maximize the cumulative long-term revenue, i.e., the viewerQoE. ORL-SDN benefits from RL solving capabilities, allowing it to accurately learn the systemdynamics (e.g., bandwidth fluctuations), to optimize extremely large-scale player populations withlow complexity cost (e.g., overhead, time), and to quickly converge to the optimal solution, unlikedynamic programming mechanisms. Second, we introduce a specific data structure to representeach HAS player’s QoE requirements and its per-cluster generalization, which are referred to asper-player and per-cluster QoE policies, respectively. Thereafter, we present the POMDP multi-player bandwidth competition model with its revenue optimization function and an online RL Q-learning based solver. We note that ORL-SDN incorporates the SDN optimization language (SOL)library [20]. The insight for using Q-learning is that it is able to decide the best action based on trialand error without any prior knowledge of the system dynamics. Finally, we conclude this sectionby describing how the HAS players interact with different entities of the architecture shown inFigure 1. The key notations and definitions used throughout the article are listed in Table 1.

3.1 Architecture Description

As depicted in Figure 1, ORL-SDN represents one component in the SDNDASH [3] architecture,which consists of three main layers (application, control, and network layers) and six core en-tities within those layers including the SDN-based external application, HAS players, the HASserver, the RYU SDN controller [51], the SDN-based internal application, and OpenFlow-enabledforwarding devices. Below, the architecture is explained at a high level; for more details about thefunctionalities of each entity, refer to [3].

3.1.1 Application Layer. This provides a set of functionalities that assist HAS players in theirbitrate decisions and better network resource management. It is composed of three entities asfollows:

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 5: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:5

Fig. 1. The proposed SDN-enabled HAS architecture with ORL-SDN classes.

Fig. 2. Our dash.js-based player. Modifications are indicated using gray boxes.

• The HAS players with various device capabilities (e.g., display resolution, CPU speed, andmemory capacity) such as smartphones, tablets, and connected TVs request different videocontent types (e.g., animation, documentary, movie, news, or sports) from the HAS serveror a nearby cache according to a subscription plan (e.g., platinum, gold, or silver). To fullysupport our architecture, existing players (e.g., dash.js [10]) require minor modificationsvia the addition of two classes into their bitrate adaptation logic as shown in Figure 2: (1) theHTTP Header Recommendation Reader to read (i.e., via the libcurl API1) the HTTP headerto obtain bitrate decisions provided by the SDN-based external application and incorporatethose decision values into the adaptation algorithm, and (2) the Logger (Console.log()) tostore at each chunk downloading step the player status consisting of different metrics suchas buffer occupancy, selected bitrate, QoE value, number of bitrate switching events, andperceptual quality oscillation value. Moreover, we implemented a number of well-knownbitrate adaptation schemes for comparison.

• The HAS server is a plain HTTP server that stores video content in segmented form togetherwith the manifest files. It serves the chunks that are requested by the HAS players. For eachcorresponding video, we also add quality information in the form of the SSIMplus-based [49]per-chunk perceptual quality values into the existing or separate manifest files.

1Available at https://curl.haxx.se/libcurl/.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 6: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:6 A. Bentaleb et al.

Table 1. List of Key Symbols and Notations

Notation DefinitionDR, CT , SPT Device resolution, content type, subscription plan

typeP Set of HAS players (p denotes a single HAS player)Step Chunk downloading boundary (or duration)Cluster A subset of P that shares similar SSIMplus-based

MAP modelCl Set of created clusters (Clk denotes a specific

cluster)N Total number of HAS playersNClk

Total number of HAS players in a specific cluster Clk

pi , pn A HAS player in the network pi ∈ P , and a HASplayer in a specific cluster pn ∈ Clk

ClSS I Mplus SSIMplus-based clustering criterionμm Chunk downloading step (m denotes a chunk/step

ID)K Total number of chunks (i.e., total steps)GCl Set of learner agents for ClSCl Set of finite/discrete cluster states for ClACl Set of finite/discrete cluster actions for ClTCl Set of cluster probability transition for ClOCl Set of finite cluster observations for Cl

Notation DefinitionRCl Set of immediate cluster revenues for Clz Time horizon of the problemγ RL discount factorα RL learning rate

LCT ,DR List of bitrate levels (lm denotes a specific bitratelevel of chunk m)

QT CT ,DR List of SSIMplus-based perceptual qualityqtm (lm ) Bitrate level with its corresponding quality of

chunk mτ , T Chunk and video durations in secondsAvдQT , SD Average perpetual quality and startup delaySE, AvдQS Stall events and perceptual quality oscillationsrpn The step revenue of player pn (= QoEpn )rClk

The step revenue of cluster Clk (= AvдQoEClk)

r tot alClk

The long-term revenue of cluster Clk

v (.) The RL value functionQClk

(s, a) The RL Q-table of cluster Clk (Q (sm ,am )

denotes a specific Q-value)a�

ClkThe optimal action (i.e., bitrate decision) for the

cluster Clk

• The SDN-based external application functions as a proxy between the HAS players and theserver, and all communications between these entities are passing through this application.Three core functions are contained: (1) observing and managing all the HAS players andkeeping track of their statuses and mobility (e.g., joining or leaving the network); (2) di-viding the HAS players into a set of virtual static (i.e., fixed number) clusters consideringDR, CT, and SPT features; and (3) finding the optimal bitrate decision and bandwidth sliceallocation for each cluster. Players that belong to the same cluster receive similar decisionrecommendations. Note that the generated overhead relates to three aspects: (1) The in-teraction between the external application and HAS players is performed via the HTTPheader of the requests and chunks, and thus it needs few additional bytes; (2) the externalapplication collects small-sized data (e.g., MPD) from the HAS server through DPI, andthus the considered overhead is negligible; and (3) the overhead produced by the sam-pling and inspection mechanisms2 is slight [24] thanks to suitable parameter values se-lected based on prior work [5, 23] for the packet inspection and sampling rate, which en-sure a high accuracy without affecting performance. Thus, our solution scales well. Theexternal application provides two main components, namely, the module package and theORL-SDN optimizer. Each component is composed of different classes that are listed inTable 2.

3.1.2 Control Layer. This represents the central element of the network that defines the net-work resource control, monitoring, allocation, and slicing services. It interacts with the applicationand network layers to simplify the network resource abstraction (i.e., QoE-to-QoS policy mapping)and installs network resource allocation policies (i.e., QoS policies) via a well-defined RESTful [66](northbound) interface and an OpenFlow v1.3 [38] (southbound) interface, respectively. The con-trol layer is composed of two entities, namely, the SDN RYU Controller [51] and the Internal SDN-based Resource Manager. The former represents the SDN controller that allows optimal networkresource management/allocation and smart HAS streaming by allocating a bandwidth slice for

2Available at http://blog.sflow.com/2009/05/measurement-overhead.html, http://blog.sflow.com/2009/05/scalability-and-

accuracy-of-packet.html, http://blog.sflow.com/2009/06/sampling-rates.html.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 7: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:7

Table 2. SDN-Based External Application Components with Their Classes

Class Name Implementation Aspect Functionality

Mo

du

leP

ack

ag

e

Packet Monitoring Engine Deep packet inspection (DPI) [52]

Inspects packets from HAS players and the serverCollects data (e.g., player status, manifest files,chunks, etc.)Differentiates between the HAS and cross-traffic(intertraffic management) via the packet samplingsFlow [24]Uses BlindBox [54] for encrypted traffic

MANAGER Module —Manages, organizes, and stores differentinteraction events between the components ofthis application

Bandwidth Estimator PANDA [34]

Uses probe-based mechanism to predict thenetwork available bandwidth accurately andwithout introducing additional overheadComputes the bandwidth required for HAS andcross-traffic

MAP Tables —Defines the relations between SSIMplus-basedperceptual quality, bitrate level, DR, CT,and SPT (see Figure 3)

QoE MeasurementQMON, ISAAR [14], and Dimopouloset al. [12]

Measures and estimates the viewer QoE of everyHAS playerRecords the QoE values

LOG & Network/Server StatisticsData structures/analysis API (PANDAs)Visualization API (Matplotlib) [48]

Periodically records and visualizes network,cluster, and player statuses

OR

L-S

DN

Logical Network TopologySSIMplus-based mapping model usingSQM [55]

Groups the set of HAS players into a set of virtualclustersCreates a specific data structure (i.e., QoE policy)for each cluster

POMDP Environment POMDPFormulates the per-cluster bitrate decisionsproblem as a POMDP (see Figure 5)

Q-Learning-Based SolutionRL (Q-leaning) [58], SOL [20], SAP [15],PDS [47], and fastMPC [62]

Solves the POMDP problemOutputs the optimal per-cluster bitrate decisions

HAS Players Recommendations libcurl

Manages and records the resulting bitratedecisions from the solverPrepares to rewrite the HTTP headers of eachchunk request

each cluster after receiving per-cluster QoS polices3 from the SDN-based external application,while the latter represents the in-house SDN RYU application that implements the core function-alities of network resource monitoring and allocation. The SDN RYU application (1) uses the re-ceived per-cluster QoS policies to dynamically allocate and provision a suitable bandwidth slicefor each corresponding cluster, which is then equally divided between the players of this clusterrespecting the Jain fairness index [25], and (2) generates different OpenFlow messages4 to install,add, modify, and remove the per-cluster allocation rules in the forwarding devices of the networklayer.

3.1.3 Network Layer. This represents the OpenFlow-enabled forwarding devices that carry thenetwork traffic. It supports meter table installation for network resource allocation and provi-sioning based on allocation rules that are created by the internal SDN-based application. In ourarchitecture, we used Lagopus5 and CPqD6 vSwitches during the emulation-based experiments.

3The QoS policy encapsulates the optimal required amount of bandwidth that maximizes viewer QoEs of each cluster

player.4Available at https://goo.gl/gczPDc.5Available at http://www.lagopus.org/.6Available at http://cpqd.github.io/ofsoftswitch13/.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 8: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:8 A. Bentaleb et al.

Fig. 3. SSIMplus-based mapping.

3.2 Identifying Clusters for the HAS Players

We require an efficient mapping that embeds three distinctive features, namely, DR (e.g., 240p,360p, 480p, 720p, and 1080p), CT (e.g., animation, documentary, movie, news, and sports), and SPT(normal, bronze, silver, gold, and platinum) into a common space and allows us to identify a set ofclusters. The key insights behind the selection of these three features are as follows: (1) SPT em-bodies a business model that may be employed by content providers to differentiate service amongtheir subscribers. Often, a customer who pays a higher fee receives a premium service; i.e., a plat-inum plan includes more network resources (〈SPT↔ bandwidth〉 map; bitrate in Figure 3). UsingSPT simplifies the integration of ORL-SDN into real-world systems where such differentiationsexist, and it also helps to easily deploy our solution with the DASH SAND architecture [59]. (2) CTrepresents different content types with various encoding bitrate levels, resolutions, and percep-tual qualities. We include CT to allow more realistic test scenarios where HAS players can requestvarious types of content. (3) DR represents the device resolution of an HAS player. DR allows ourmodel to support heterogeneous device capabilities that naturally exist (e.g., smartphones, tablets,connected TVs). As shown in Figure 3, the clustering algorithm creates five clusters that mostlymap to the five service levels. The intuition is that at each service level a customer is provided witha certain amount of resources (i.e., bandwidth). Overall these factors enable ORL-SDN to provideQoE fairness. Recent studies [17, 26, 57] have shown that these features are the most critical in pro-viding a high QoE for adaptive video delivery systems. However, our model can easily be extendedto other factors, if so desired by service providers.

We used SSIMWave’s Video QoE Monitor (SQM) software [55] that implements the SSIMplusIndex [13, 49] library with its functions and characteristics (i.e., it provides a close estimation of avideo’s perceptual quality by mimicking the human visual system and considering content featuresincluding resolution, temporal and spatial complexity, device screen options like size, resolution,and brightness) to perform many per-chunk full-reference perceptual quality measurements7 onfive reference source video files with different content types and resolutions including anima-tion (Big Buck Bunny), documentary (Of Forests and Men), movie (Tears of Steel), news, and sports(Red Bull Playstreets). The original YUV and processed video files were obtained from the DASHdataset [33]. We note that for fair comparison, brevity, and simplicity, the encoded bitrate levels,chunk durations (4s), and video durations (600s) are similar for all content types, and thus thetotal number of chunks is the same as the total number of steps (K = 600

4 = 150). Moreover, in the

7The per-chunk SSIMplus value is the total average of the per-frame values (e.g., 4s chunk duration → 100 (25 × 4) per-

frame SSIMplus values).

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 9: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:9

experiments in Section 4, we used the DASH dataset, where the minimum total duration of videosis 150 steps, allowing us to show consolidated results with the same axis ranges.

Our SSIMplus-based mapping model is shown in Figure 3 and is derived from the offline per-ceptual quality measurements that we conducted using the SQM software [55], which takes intoconsideration all possible values (53 = 125 combinations) of DR, SPT, and CT. Then, given threefeature values, each player uses Equation (1) to identify the most suitable cluster to join. InFigure 3, the top plot confirms that the correlation between bitrate level and video quality isnonlinear and the perceptual quality is different from one video to another at the same bitrate.The bottom 3D plot depicts the set of nonoverlapping clusters derived from our SSIMplus-basedper-chunk perceptual quality measurements considering DR, CT, and SPT features. We used theSSIMplus-based mapping function (MAPSSIMplus (.)) to select the appropriate SSIMplus value fromthe 3D plot; hence, the appropriate cluster for the corresponding player can be easily determined.The set of clusters is defined as Cl = {ClL=1,ClM=2,ClH=3, ClVH=4,ClE=5}, identified as Low, Moder-

ate, High, Very High, and Extreme, where the total number of clusters is five with k ∈ [1..5] (i.e.,∈ [L..E]) identifying a specific cluster. The set of all possible combinations of the three featuresare grouped into five clusters based on SSIMplus taking into account DR, CT, and SPT.8 Thus, theSSIMplus-based clustering criterion ClSSIMplus is represented as follows:

ClSSIMplus = MAPSSIMplus

(ClDR

p ,ClCTp ,Cl

SPTp

), (1)

where ClDRp , ClCT

p , and ClSPTp are the three features of device resolution, content type, and sub-

scription plan type, and p identifies an HAS player. We note that at any time, a player can modifythe values of each feature and the cluster assignment (Equation (1)) is executed again to find themost appropriate one (Figure 3).

To summarize, ORL-SDN is based on four principles: (1) It leverages the flexible mathemati-cal model of SSIMplus to map three distinctive features into one common parameter space. (2) Itoptimizes on a per-cluster instead of a per-player basis. Thus, the cluster members receive a simi-lar bandwidth slice and network resource allocation. (3) A nonlinear relationship between bitratelevels and their corresponding perceptual qualities is confirmed and used. Finally, (4) perceptualquality manifest files for each video type are used by the players during bitrate adaptation. As aresult, our method accommodates a large number of HAS players while generating a low numberof clusters, which reduces the optimization complexity and runtime. We note that the number ofgenerated clusters is bounded by the set of considered clustering features and their parameters.Further, our architecture allows an implementer to choose other factors to perform the clustering,if desired.

3.3 Per-Cluster QoE and QoS Policies

This module manages a specific data structure (see Figure 4) that describes the current QoE-to-QoS mapping for every step μm , wherem = 1, 2, . . . ,K , with K being the total number of chunks.It uses a simple grouping process that aggregates the QoE requirements (i.e., QoE values withtheir metrics as described in Section 3.4) of the HAS players belonging to the same cluster into acommon structure, termed the per-cluster QoE policy. Thus, the aggregation helps to simplify theprocessing and computation in ORL-SDN. Each per-cluster QoE policy will be used as an input inORL-SDN and can be described as follows:

At each step μm ,Clk,QoE-policy = Aдд(pi ,QoEpi),∀pi ∈ Clk ,Clk ∈ Cl . (2)

8The number of generated clusters is generally small where it is given by a value known as the Bell number, which grows

exponentially with the number of features and their possible values.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 10: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:10 A. Bentaleb et al.

Fig. 4. An example mapping of per-cluster QoE policies to per-cluster QoS policies.

As depicted in Figure 4, the per-cluster QoE policy is translated into QoS-based network re-source requirements for the corresponding cluster (i.e., the optimal per-cluster slice of bandwidth)using a QoE↔QoS mapping function. This mapping represents an exponential correlation that isderived based on objective and subjective QoE measurements from the QoE-to-QoS quantitativerelationship model proposed by Fiedler et al. [16] and is defined for each HAS player pn as follows:

QoEpn↔ QoSpn

= c1 × e (−c2×Ω×BW ) + c3, (3)

where BW is the bandwidth (QoS metric), c1,2 are network layer factors that are defined basedon service differentiation (i.e., SPT), c3 is the video codec parameter, and Ω is a constant that isassigned based on the network access type (e.g., WiFi, 3G/4G).

3.4 System Model

We formally define the bitrate decision problem when multiple HAS players compete for the avail-able bandwidth at a single bottleneck link as a POMDP. Our reasons for selecting POMDP arethat (1) it fits our multiagent decision problem, unlike MDPs that support only single agents, and(2) its observation and historical information capabilities are not considered in other mathematicalmodels (e.g., MDP). Thus, partial information of the system state can be offered that exactly fitsthe network dynamics in HAS systems that exhibit high uncertainty (e.g., sudden bandwidth fluc-tuations). At the cluster level, the POMDP model consists of 10 tuples POMDPHAS = {Cl ,GCl ,SCl ,ACl ,TCl ,OCl ,RCl , z,γ ,α }, where

• Cl = {Cl1,Cl2, . . . ,Clk } is the set of created clusters.• GCl = {дCl1

,дCl2, . . . ,дClk

} is the set of cluster agents (also called learner agents); each agentдClk

is responsible for finding the optimal cluster action for the corresponding cluster Clk .• SCl = {SCl1

, SCl2, . . . , SClk

} is the set of the finite and discrete cluster states. For each clusterClk , the set of cluster states is SClk

= {s1, s2, . . . , sK }. For each cluster Clk , at each step μm ,we define the cluster-aggregated9 state as sμm

= 〈sp1 , sp2 , . . . , spn〉, ∀pn ∈ Clk .

• ACl = {ACl1,ACl2

, . . . ,AClk} is the finite and discrete set of cluster actions (i.e., bitrate levels

with their corresponding qualities for the cluster Clk at every step μm , qtμm,Clk(lμm,Clk

)).For each cluster Clk , the set of cluster actions is AClk

= {a1,a2, . . . ,aK }. At each chunkdownloading step μm , the agent дClk

∈ GCl selects a cluster-aggregated action a (μm,дClk) =

〈ap1 ,ap2 , . . . ,apn〉 ∈ AClk

for the HAS players belonging to the same cluster ∀pn ∈ Clk .• TCl represents the cluster probability transition function of a clusterP (sClk

|sClk,aдClk

) from

the cluster-aggregated state sClkto a next state sClk

when the cluster-aggregated actionaдClk

is taken.

• OCl = {OCl1,OCl2

, . . . ,OClk} is the finite set of cluster observations that are captured by the

set of agents. For each cluster Clk , the set of observations is OClk= {o1,o2, . . . ,oK } with

9We use the notion of aggregated, which denotes the X aggregation of all players belonging to the cluster.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 11: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:11

Table 3. The POMDP Environment State Variables

Global Description

BWall Total bandwidth at bottleneck link (fixed)

BW e Estimated available bandwidth (dynamic)

BWHAS HAS traffic (dynamic)

BW bt Cross-traffic (dynamic)

C Network congestion level (dynamic)

Local Description

Aggbuffmin,ClkSet of agg. min buf. sizes threshold for players in cluster k

Aggbuffmax,ClkSet of agg. max buf. sizes threshold for players in cluster k

AggbuffClkSet of agg. buf. sizes for players in cluster k

AggQoEmetricsClk

Set of agg. QoE and its metric values for players in cluster k

ClCTk

Set of CT req. by players in cluster k

Cl DRk

Set of DR of players in cluster k

cluster-aggregated observations at step μm , oμm,дClk= 〈op1 ,op2 , . . . ,opn

〉 ∈ OClkfor players

belonging to the same cluster ∀pn ∈ Clk 10. Such observations can be the estimation of theavailable bandwidth and the player status.

• RCl = {RCl1,RCl2

, . . . ,RClk} is the set of immediate cluster revenues, which depend on the

environment-aggregated states and actions taken by the agents. For each cluster Clk , theagent дClk

captures the immediate revenue RClk= R (sClk

,aдClk) at state sClk

and takes ac-

tion aдClk, which is defined as RClk

= {r1, r2, . . . , rK } with aggregated immediate revenue

r (μm,дClk) = 〈rp1 , rp2 , . . . , rpn

〉 ∈ RClkfor players belonging to the same cluster ∀pn ∈ Clk .

• z represents the time horizon of the problem that is defined by fastMPC for a given numberof future next states. z = 3 in this article.

• γ and α ∈ [0, 1] are the system-discounted factor and learning rate, respectively.

To solve this POMDP model, when a particular cluster state is observed, ORL-SDN finds foreach cluster the optimal policy π : S → A that maps the states to actions, such that the expectedrevenue is maximized by considering the immediate revenue and the future-discounted long-termrevenues. The following are the element definitions for the POMDP model:

(1) HAS Players: Each player pi ∈ P , P = {p1,p2, . . . ,pN } selects a video from the set of videocontent ClCT

pistored at the HAS server. A video consists of T seconds of media and is segmented

intoK chunks of fixed length τ = TK

each. A chunk is encoded at different bitrate levels lm ∈ L withcorresponding SSIMplus-based perceptual quality qtm ∈ QT . At each step μm , HAS player pi ∈ Pselects a chunk from the corresponding list of bitrate levels L with their corresponding perceptualqualities QT , which can be written as follows:

LCT ,DRμm,Clk

={l1μm,Clk

, . . . , lmμm,Clk, . . . , lK

μm,Clk

}, QTCT ,DR

μm,Clk={qt1

μm,Clk, . . . ,qtm

μm,Clk, . . . ,qtK

μm,Clk

}.

(4)We note that qtμm,Clk

(lμm,Clk) represents the bitrate level l .

Clkwith its corresponding quality qt .

Clk

mapping function for clusterClk . Furthermore, players may join or leave the network at any timeand request a variety of content types as our model is dynamic and flexible in support of differentscenarios, including player churn (see Section 4.4).

(2) Environment State Variables: The SDN-based external application observes the state vari-ables when a media chunk is completely received. For each step μm , for each cluster, we define astate as sμm,Clk

= {BW allμm,BW e

μm,BW HAS

μm, BW bt

μm,Cμm

,Aggbuffμm,Clk, AggQoEmetrics

μm,Clk,ClCT

k,ClDR

k},

with variables defined in Table 3 and divided into two types, global and local. The global variablesindicate the global network status, while the local variables represent the aggregated statuses ofthe local players belonging to the same cluster.

(3) Actions: The decisions that ORL-SDN takes are referred to as actions. At each step μm , foreach cluster Clk , for each CT and DR, ORL-SDN defines the set of possible actions A as the set

10We use pn for the players belonging to a cluster and pi for all HAS players in the network.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 12: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:12 A. Bentaleb et al.

of available bitrate levels with their corresponding perceptual qualities (stored at the HAS server)as described in Equation (4). Mapping values (common space model) are represented in Figure 3.Each action (aμm,дClk

= qtμm,Clk(lμm,Clk

)) that is taken yields different transition probabilities and

revenues. ORL-SDN aims to find for each cluster the optimal actions that maximize the long-termrevenue and avoid HAS scalability issues. We note that ORL-SDN uses one of the behavior policiesϵ-greedy, Softmax, or VDBE-Softmax [60] to find the optimal policy that avoids suboptimal actions(refer to Section 3.6).

(4) Observations: As shown in Figure 1, our SDN-based external application implements theBandwidth Estimator, Packet Monitoring Engine, and QoE Measurement modules to collect andestimate the environment state variables presented in Table 3. Periodically, the Bandwidth Estima-tor uses the PANDA algorithm [34] to estimate BW e , and the Packet Monitoring Engine employsthe sflow packet inspection and sampling mechanism11 to (a) differentiate between the cross andHAS traffic while estimating BW bt and BW HAS and (b) detect the congestion levelC ∈ [0, 1] basedon the following equation:

Cμm=

BW btμm+∑N

n=1 BWallocμm,pi

BW all,∀μm , (5)

where BW alloc is the bandwidth allocated for players’ HAS traffic, and (BW btμm+ BW alloc

μm) is the

bandwidth required for both cross- and HAS traffic at step μm . In addition, the estimated availablebandwidth is BW e

μm≈ BW all − [BW alloc

μm+ BW bt

μm]. We note that ifCμm

> 1 congestion occurs, then

the network status is unsatisfactory. Also, Equation (5) ensures the best tradeoff between BW bt ,BW alloc, and BW all, which allows ORL-SDN to guarantee that the demands of HAS players (BW alloc)together with the bandwidth required for the cross-traffic (BW bt ) will not exceed the total capacity(BW all ), and thus ORL-SDN reacts quickly to any increase in HAS or cross-traffic, or both (seeSection 4.4). Finally, the QoE Measurement module estimates the local state variables. Hence, the

cluster observation can be defined as oμm,дClk= (BW e

μm,Cμm

,ClAggstatus

μm,k) ∈ OClk

,∀pn ∈ Clk .

(5) Transition Probability Function: The function T : S × A → S provides for each clusterClk a probability distribution P (sClk

|sClk,aдClk

) of the new state sClkgiven the current state sClk

and the action aдClktaken by agent дClk

. Hence, at each step μm , the cluster transition probability

of the POMDP model is defined as

P(sClk|sClk,aдClk

)= PaClk

sClk→sClk

= Pr(sμm+1,Clk

���sμm,Clk,aμm,дClk

).

(6)

This probability takes into consideration both of the global Markovian state variables BW e andC ,which are the main causes of HAS scalability issues and mutually independent [43] because bothare random and relate directly to the amount of cross- and HAS traffic generated in the network.12

Moreover, the transition probability computation is directly related to the Markovian propertywhere the state variable dynamics in the next state (μm + 1) depend only on the current state (μm ).Thus, the cluster transition probability is

PaClk

sClk→sClk

= Pr(BW e

μm+1|BW e

μm

)× Pr(Cμm+1 |Cμm

), (7)

11Available at http://www.sflow.org/.12We excluded the local state variables for the next state as they depend on the previous states (historical data), the action

taken, and both the available bandwidth and the congestion level.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 13: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:13

where Pr (Cμm+1 |Cμm) is calculated based on Equation (5) and

Pr(BW e

μm+1|BW e

μm

)=⎧⎪⎨⎪⎩|BW e

μm−BW all |

BW all if BW eμm<BW e

μm−1

1 otherwise.(8)

The players belonging to the same cluster can update their status (local variables), buffer, and QoEbased on the per-chunk download time and revenue function.

(6) Revenue Function: The cluster revenue function R : S × A → S represents the benefitsof an agent taking an action and transiting to the next state. In our context, we define the revenuefunction as a QoE model that is computed for each player and at each downloading step. Thisallows ORL-SDN to decide how good the action taken by the agent дClk

of the set of playersbelonging to a cluster Clk is. Our revenue model is practical [11, 64] for real-world HAS servicesand consists of rewards including the SSIMplus-based average quality (AvдQTpn

) and penalties,which are the startup delay (SDpn

), stall events (SEpn), and quality oscillations (AvдQSpn

).13 Ateach step μm , and for each player pn ∈ Clk , when an action aдClk

is taken, the revenue function

is defined as a weighted combination of both the rewards and penalties (we adopt a similar QoEmodel as presented in Bentaleb et al. [3] and in [11, 64]):

rμm,pn= QoEμm,pn

= ω1 ×AvдQTμm,pn− ω2 ×AvдQSμm,pn

− ω3 × SEμm,pn− ω4 × SDμm,pn

, (9)

where parameters ω1,2,3,4 are nonnegative weighting factors with∑4

i=1 ωi = 1. Empirically, manyobjective tests were performed to tune weighting factors and based on the objective/subjectiverecommendations from studies [13, 27, 64]. We selected a value of 0.25 for each because the fourQoE metrics are equally important, and they can significantly impact the client satisfaction. Therewards and penalties of our revenue model are calculated as follows:

AvдQTμm=

K∑

m=1

qμm(lμm

),

AvдQSμm=

K−1∑

m=1

���qμm+1 (lμm+1 ) − qμm(lμm

)���,SEμm

=1

K

K∑

m=1

SEμm,

SDμm= StartupDelay(μm ), where SDμm

<< Aggbuffmin.

(10)

Further, we define the cluster revenue AvдQoEμm,Clkas the average revenue of all players be-

longing to the same cluster, where

AvдQoEμm,Clk= rμm,Clk

=1

NClk

NClk∑

n=1

rμm,pn(11)

and NClkis the number of players in clusterClk . The common objective of all agents is to maximize

the long-term revenue r totalμm,Clk

of all HAS players in every cluster during the streaming session:

r totalμm,Clk

=

K∑

m=1

γmrμm,Clk,γ discount factor ∈ [0, 1]. (12)

Based on the value function (υ) [58], the cluster long-term revenue can be reformulated as

r totalμm,Clk

(sμm,Clk) = rμm,Clk

(sμm,Clk) + γυ (SClk

), (13)

13AvgQS considers per-chunk SSIMplus variations (total average per-frame SSIMplus values). It helps ORL-SDN to maintain

consistent quality decisions.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 14: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:14 A. Bentaleb et al.

where υ (.) is computed via value function approximation (e.g., stochastic gradient descent, dy-namic programming, linear combinations, neural network).

(7) Q-Table: In Q-learning, in a state s an action a is taken based on a specific value Q (s,a) ,called the Q-value. This value should be the maximum among all tuples (s, a), where a representsall possible actions. The Q-table is the mapping between 〈state, action〉 following a behavior policy(π ); thus, for each cluster Clk , the per-cluster Q-table is defined as

QClk(s,a) =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

Q (s1,a1 ) Q (s1,a2 ) · · · Q (s1,a |A| )

Q (s2,a1 ) Q (s2,a2 ) · · · Q (s2,a |A| )...

.... . .

...Q (sK ,a1 ) Q (sK ,a2 ) · · · Q (sK ,a |A| )

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦. (14)

The rows represent all the cluster states of the streaming session and each column is related toone possible cluster action, which corresponds to the players of cluster ∀pn ∈ Clk . At each step μm

and ∀Clk ∈ Cl , the Q-values in the Q-table are calculated and updated as follows:

QClk(sμm,Clk

,aμm,Clk)

= QClk(sμm,Clk

,aμm,Clk) + α ×

[rμm,Clk

+ γ maxa

QClk

(`sClk, `aClk

)−QClk

(sμm,Clk

,aμm,Clk

)],

(15)

where sμm,Clkis the current cluster state, aμm,Cln

is the selected cluster action, rμm,Clkrepresents

the immediate cluster revenue, and `aClkis the cluster action that returns the highest Q-value to

the next cluster state s among other action possibilities. α and γ are the learning rate and dis-counted factor, respectively, with 0 ≤ α ,γ ≤ 1. We avoid a large 〈state, action〉 space (i.e., a largerQ-table size requires more learning space) and speed up convergence to the optimal decisions bytabulating (i.e., Q-values estimation and update) only the optimal Q-values in parallel using a mul-tilayer perceptron neural network (MPL) [19], in particular a stochastic gradient descent Q-valuefunction approximation [50] and PDS. In practice, the MPL neural network parameters (i.e., 〈state,action〉,α ,γ ) with their corresponding optimal Q-values (optimal Q-table) are trained offline andadded to ORL-SDN.

3.5 Objective Function

At each step μm , ORL-SDN agents learn the per-cluster optimal strategy. The goal is to maximizecluster revenue among the players of a cluster while achieving a stable and fair revenue for allplayers in the network. We formulate the online optimization problem where each agent strivesto find the per-cluster optimal action (a�) among all possible actions, aiming to maintain a highand stable long-term per-player QoE while respecting a set of dynamic constraints that are theavailable bandwidth, aggregated buffer size, content type, and device capabilities, respectively.For each cluster Clk , the objective function with its constraints can be expressed formally as

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

Find a�μm,Clk

,∀μm ,∀Clk ∈ Cl ,∀pn ∈ Clk ,where :

maxa

r totalμm,Clk

s.t.|Cl |∑k=1

NClk∑n=1

BW allocμm,pn

< BW all ,

Aggbuffmin,Clk≤ Aggbuffμm,Clk

≤ Aggbuffmax,Clk,

MAPSSIMplus (a�μm,Clk

,ClCT ),MAPSSIMplus (a�μm,Clk

,ClDR ).

(16)

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 15: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:15

Table 4. Parameters Used in the Experiments (*: Default)

Parameters Evaluated Values

HASPlayer

(ORLSDNdash.js)

Aggbuffmin 8 secondsAggbuffmax 36 secondsQoE Normalized QoE (1 to 5) [3]DR 240p, 360p, 480p, 720p, 1080pSPT Normal, bronze, silver, gold, platinum

ManifestFiles

CT Five types of videosT 600 secondsK (Steps) 150 stepsτ 4 secondsL (Actions) 20 bitrate levels (H.264)QT SSIMplus-based

PANDAEstimator

κ, ω, B 0.14, 0.3, 0.2, respectively

Parameters Evaluated Values

POMDPModel andSolution

Cl Five clustersα (Learning rate) 0.05*, 0.1, 0.3, 0.5, 0.7, 0.9γ (Discount rate) 0.05, 0.1, 0.3, 0.5, 0.7, 0.9*λ (Trace decay) 0.05, 0.1, 0.3, 0.5, 0.7, 0.9*ϵ (ϵ -greedy) 0 ≤ ϵ ≤ 1ζ (ϵ -greedy) rand(0..1) � Random numberβ (Softmax) 0.002*, 0.1, 0.3, 0.4, 0.5, 1δ (VDBE-Softmax) 1 � Influence factor ∈ [0, 1)σ (VDBE-Softmax) 1 � Inverse sensitivityEpisodes 150z (fastMPC Look) Three steps aheadBehavior policy VDBE-Softmax()

Other HASSchemes

As suggested in the respective papers

Fig. 5. The POMDP model aggregation. Fig. 6. Streaming session interactions.

3.6 Online Solution

To solve the optimization problem above, we develop a new Q-learning-based online algorithmas a module within ORL-SDN. It leverages the RL capabilities and Algorithm 1 finds the optimalstrategy (policy π�) using behavior policy VDBE-Softmax [60] with its parameters (see Table 4)leading to the optimal per-cluster action at each state under uncertainty. Further, it incorporatesboth the parallel PDS and fastMPC approaches with Q-learning that allow improving the optimalaction selection and estimating the POMDP model environment state variables for the next states,respectively. The goal is to maximize the long-term revenue of each cluster and speed up conver-gence to the optimal actions. Moreover, Algorithm 1 solves the problem as one bigger POMDPmodel (i.e., the aggregation of all clusters with their players). Such aggregation is performed viaa state aggregation process (SAP) [15] (Parallel(), line 14); thus, Q-values of all clusters areprocessed together and this allows ORL-SDN to reduce complexity/convergence times. We notethat during the solving phase, both the parallel PDS and SAP aggregate states (as we have manypossible actions⇔ a large space of 〈current state, new state, action, reward〉) via a parallelizationprocess, where the former is used for intraclusters and the latter for interclusters, respectively (seeFigure 5). Figure 5 represents our POMDP model aggregation that consists of three levels includingplayers, clusters, and the bigger POMDP model (aggregating all clusters with their players).

3.7 HAS Player Integration

Figure 6 shows the interactions in a streaming session. A player starts the streaming session byrequesting the regular and quality manifest files (these could be easily integrated into one file)

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 16: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:16 A. Bentaleb et al.

and then modifies the HTTP GET request headers by adding a new field regarding its status. Suchinformation is then extracted and stored by the external SDN-based application. Upon receptionof this request, the server replies with the appropriate chunk as usual. Before reaching the cor-responding player, the external application rewrites the HTTP header using libcurl by adding therepresentation recommendation suggested by ORL-SDN. Thereafter, when the chunk is completelydownloaded, the player extracts this recommendation to use as an upper bound during the bitrateadaptation. These steps are repeated for every chunk until the end of the streaming session, allow-ing our architecture to avoid explicit communication overhead that may affect the performance.Additionally, the player relies on only its bitrate adaptation heuristics when the SDN-based exter-nal application fails. We selected dash.js [10] as our HAS player and made the necessary changeson this player. One real-world issue is traffic encryption; for this we designed our architecture tobe modular and accommodate end-to-end security measures employed by content providers. Forthis, in our architecture a content provider could (1) insert an in-path appliance to broker keyexchanges between players and the server or (2) use the BlindBox system [54] for the encryptedtraffic. Also, our solution is able to identify new video sessions quickly when players join thenetwork and session terminations when players leave. Thus, our solution is efficient and robustwith respect to player mobility and churn.

ALGORITHM 1: Per-cluster online algorithm1: procedure –Pseudo Code: Online Q-learning based Algorithm

2: Input:

3: BW all , BW H ASμm , Cμm , AggbuffClk

, AддQoEmetr icsClk

,

4: ClCTk

, Cl DRk

, z, AClk, QClk

(s, a), Aggbuffmax,Clk,

5: OClk, Aggbuffmin,Clk

, POMDP parameters Table 4.

6: Initialization: � u1 : first step7: Initialize Q-table QClk

(sμ1,Clk, aμ1,Clk

) using FAQ-learning() [7]

8: Select VDBE-Softmax πsμ1 ,Clkas behavior policy

9: z = 3 � fastMPC time horizon lookahead10: Output:

11: a�μm ,Clk

, ∀μm, ∀Clk ∈ Cl : The optimal action

12: Begin:

13: repeat � For each learning episode (i.e., one streaming session)

14: for (each cluster Clk , k = 1, 2, .... , |Cl | in Parallel()) do

15: for (each step μm, m = 1, 2, .... , K ) do

16: for (each ajμm ,Clk

∈ AClk, j = 1, 2, . . . , |AClk

|) do

17: rjμm ,Clk

← Compute step revenue

18: Look z steps ahead for estimated state variables

19: rtotal, jμm ,Clk

← Compute total revenue

20: Store ajμm ,Clk

with its rjμm ,Clk

, rtotal, jμm ,Clk

21: end for22: a�

μm ,Clk← Find optimal action by solving (16))

23: rμm ,Clk← Compute step revenue

24: r totalμm ,Clk

← Compute total revenue

25: sClk← Observe in parallel next state using PDS

26: Update Q-values of QClk(sμm ,Clk

, aμm ,Clk) (15))

27: Store state variables in fastMPC table28: sClk

← Transit to the next state

29: end for30: end for31: until (μK ‖ End of Emulation) � μK : final step

32: End33: end procedure

4 PERFORMANCE EVALUATION

To evaluate the performance of ORL-SDN 14 and our modified HAS player (ORLSDNdash.js), weconducted a set of on-demand streaming experiments with 10 experiments for each test sce-nario. We varied the number of players (e.g., 10, 30, 50, 100, 200, 500, 1,000, 2,000), bandwidthvariation profiles (e.g., WiFi, 3G, 4G, and wired), test environment (e.g., mobile player, fixedplayer, arbitrary player’s arrival/departure times, heterogeneous environment), content types,POMDP model, and RL parameters (e.g., the learning rate, the discount factor). Due to spacelimits and similar results obtained in most test scenarios, we only show one experimental eval-uation result over multiple episodes (or runs) and players of the test scenario that consists of100 players and 170Mbps total network bandwidth. Our network topology is depicted in Fig-ure 7; it consists of a combination of real system components (HAS server, SDN app, and con-

14Implemented as real-time SDO-based Matlab code within our architecture.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 17: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:17

troller) and emulated, virtualized components (Mininet-based [32] HAS players). For perfor-mance evaluation we compare ORLSDNdash.js against the existing well-known bitrate adaptationschemes: mDASH [65], onlineLearner [6], BBA [22], QDASH [39], SARA [29], the original dash.js,PANDA [34], FESTIVE [27], and SDNDASH [3]. Furthermore, we compare the per-cluster effi-ciency of ORLSDNdash.js with pure ϵ − greedy and available-rate-based policies. We implementedthe schemes in the stable release (v2.3.0) of dash.js. The results were compared using the fourHAS scalability metrics: (1) presentation quality stability, (2) fairness, (3) utilization and (4) QoE.

4.1 Experimental Setup and Parameters

As shown in Figure 7, our network topology is inspired from real-word last-mile multihomingattached-based and multiplayer games network scenarios [36] where it consists of 100 heteroge-neous dash.js-based HAS players with various device resolutions that are connected to the sameedge bottleneck link (one bottleneck link) and created via Mininet. They compete for 170Mbps to-tal network bandwidth. The HAS server is Apache with a content catalog of five videos of differenttypes: Big Buck Bunny, Of Forests and Men, Tears of Steel, News, and Red Bull Playstreets [33],which are encoded at 20 bitrate levels, with L = {50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 900,1,200, 1,500, 2,000, 2,100, 2,400, 2,900, 3,300, 3,600, 3,900}Kbps. Dummynet is used to throttle thebandwidth, and iperf generates TCP-based dynamic cross-traffic that ranges from 10 to 25Mbpsaccording to realistic throughput variability profiles from both FCC broadband15 and DASH-IFguidelines [9] datasets. Thus, we emulate a typical realistic multiclient shared bottleneck link sce-nario. Furthermore, an extended SDN RYU controller runs the SDN-based external and internalapplications. Three Lagopus vSwitches with OpenFlow v1.3 support complete the setup [38]. Itmay be noted here that we selected this scenario where multiple clients run simultaneously fortwo main reasons: (1) This setup is recommended by the DASH-IF in its benchmark test casesfor multiple clients [9], and (2) it represents the most challenging, worst-case scenario to testwhether the proposed solution works well (i.e., the execution of multiple simultaneous clientswith high/sudden network variability).

Due to the complex characteristics of the network, it is challenging to find suitable values for allof the parameters. In the experiments, we tuned such parameters based on our own test scenariosand prior work of Tokic and Palm [60] as described in Table 4. In addition, in RL the action selectionis performed via a behavior policy such as ϵ-greedy, Softmax, and VDBE-Softmax [60] during theexploitation phase [58]. In these experiments, we used VDBE-Softmax [60] as the behavior policy,which represents a combination of both Softmax and ϵ-greedy. The action selection in the VDBE-Softmax policy (π (sμm

)) is defined as follows:

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

Softmax() =eβQClk

(sμm ,aμm )

∑∀b ∈A

eβQClk(sμm ,bμm )

if ζ < ϵμm(sμm

)

arд maxa∈A

QClk(sμm,aμm

) , otherwise,

(17)

where Softmax() selects actions based on a probabilistic method and a Boltzmann distribution isused to rank the learned Q-values, ζ is a uniform random number ∈ [0,1], and 0 ≤ ϵ ≤ 1 is drawnat each step μm . The parameter values in Equation (17) are set equally to prior studies [6, 7, 44, 45,61]. Moreover, to confirm these values, we performed a suite of tests with various combinations ofPOMDP model parameters (as shown in “POMDP Model and Solution” of Table 4, with the usedvalues marked by (*)). Our step revenue model (Equation (9)) outputs a value between 0 and 1; thenormalized QoE (N-QoE) represents the player’s satisfaction and is defined in Table 5.

15Available at https://goo.gl/xM5dUx.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 18: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:18 A. Bentaleb et al.

Fig. 7. Network topology used for the evalua-

tion.

Table 5. Normalized QoE

Utility (QoE) N-Utility (N-QoE) Degree

0.8 ≤ r ≤ 1 Between 4 and 5 Excellent0.6 ≤ r < 0.8 Between 3 and 4 Good0.4 ≤ r < 0.6 Between 2 and 3 Fair0.2 ≤ r < 0.4 Between 1 and 2 Poor0 ≤ r < 0.2 Between 0 and 1 Bad

4.2 Results, Discussion, and Analysis

We evaluate the effectiveness of ORL-SDN with two experiments. First, we compare the aver-age results over multiple episodes (around 150) and over all players that belong to each clusteragainst the pure ϵ-greedy and available-rate-based schemes in terms of presentation quality sta-bility, QoE, fairness, utilization, and congestion level. Second, we average the results over multipleepisodes and over all clusters (all players) and compare them with the well-known bitrate adapta-tion schemes. All experiments are performed with long-term bandwidth variations.

4.2.1 Video Presentation Quality Stability. Presentation quality stability is evaluated based onthe bitrate fluctuations, stalls, and perceptual quality oscillations. Results of per-cluster quality sta-bility for clusters low, moderate, high, very high, and extreme in ORLSDNdash.js, pure ϵ-greedy,and available rate based are depicted in Figures 8(a) through 8(e), respectively. The left (bitrate)and middle (quality) plots are the bitrate and its corresponding quality decisions recommendedby ORL-SDN to the set of ORLSDNdash.js players in each cluster. We observe that ORL-SDNachieves the best results in each cluster (blue line) compared to the other schemes, where it se-lects a high and stable per-cluster bitrate level that ranges from 1,500 to 2,089Kbps as an averageof all clusters while maintaining consistent perceptual quality with an average variance of 0.035.In addition, ORL-SDN ensures very few stalls (an average of 1.4 stalls), infrequent quality oscilla-tions (once), and low startup delay (1.72s). For the second experiment, similar results are observed.Table 7 tabulates the results, where ORL-SDN outperforms mDASH, onlineLearner, BBA, QDASH,SARA, dash.js, PANDA, FESTIVE, and SDNDASH, which experienced several variations in theirdecisions. Moreover, both PDS and fastMPC techniques provide accurate information that helpsORL-SDN in the available network resource estimation. Thus, at each step an optimal per-clusterbitrate with its corresponding quality and resource allocation is selected, leading to a maximumper-cluster and total QoE, and eliminating HAS instability issues. It is clearly observable that atthe beginning of each streaming session, ORLSDNdash.js starts with a low bitrate selection, sincethere is a lack of global network information and an empty buffer. Furthermore, the quick startupdelay is presented in Figure 9(f) and Table 6. This is the benefit of the FAQ-learning algorithm [8]that is integrated with ORL-SDN (see line 7 of Algorithm 1). This algorithm aims to initialize theQ-table with accurate values (not with some random values as in the case of traditional Q-learningalgorithms) before starting a streaming session.

Our average per-cluster and average total quality stability results are presented in Tables 6 and 7,respectively. The former is the average result for the players within the same cluster, while thelatter is the average result for all the players in the network. Moreover, table metrics are definedas follows:

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 19: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:19

Tab

le6.

Per

-Clu

ster

Ave

rage

Pre

sen

tati

on

Quali

tyS

tab

ilit

yan

dQ

oE

Met

rics

ORLSDNdash.js

Pu

reϵ−G

reed

yA

va

ila

ble

Ra

teB

ase

dA

VG

Qu

alit

y(S

SIM

plu

s)A

VG

#o

fO

scil

lati

on

sA

VG

#o

fSt

alls

AV

GSt

artu

pD

elay

AV

GQ

ual

ity

(SSI

Mp

lus)

AV

G#

of

Osc

illa

tio

ns

AV

G#

of

Stal

lsA

VG

Star

tup

Del

ayA

VG

Qu

alit

y(S

SIM

plu

s)A

VG

#o

fO

scil

lati

on

sA

VG

#o

fSt

alls

AV

GSt

artu

pD

elay

Cl L

0.8

to0.

911

10.

56s

0.77

to0.

8915

42.

6s

0.77

to0.

9150

1510

.58

sC

l M0.

91to

0.94

13

1.87

s0.

906

to0.

938

309

3.3

s0.

77to

0.94

4512

5.63

sC

l H0.

943

to0.

949

11

1.88

s0.

943

to0.

952

5010

2.3

s0.

77to

0.96

243

104.

37s

Cl VH

0.96

2to

0.97

11

2s

0.96

to0.

9730

73.

16s

0.77

to0.

983

4014

4.15

sC

l E0.

965

to0.

991

12.

3s

0.96

to0.

9915

74

s0.

77to

0.99

429

7.22

sA

vg0.

916

to0.

951

11.

4(1

.08

s)1.

72s

0.90

7to

0.94

828

7.4

(18.

4s)

3.07

s0.

77to

0.95

744

12(2

5.2

s)6.

39s

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 20: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:20 A. Bentaleb et al.

Tab

le7.

Ave

rage

To

tal

Quali

tyS

tab

ilit

y,F

air

nes

s,U

tili

zati

on

,an

dQ

oE

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 21: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:21

Fig. 8. Per-cluster average bitrate, SSIMplus-based quality, and normalized QoE in a typical run (over

150 episodes of 600 seconds each, and players belonging to the same cluster during a streaming session

of 600 seconds) for each cluster with 20 players.

• AVG Quality and Bitrate Level are the ranges of the average selected SSIMplus-based qualityand bitrate levels.

• AVG Number of Oscillations, Stalls, and Startup Delay are the average number of qualityoscillations, number of stalls, and the startup delay, respectively.

• AVG Instability, Unfairness, and Underutilization are properties that describe the degreeof quality stability, QoE fairness, and bandwidth utilization, respectively. The definitionscan be found in [34], and we note that the inter- and intracluster fairnesses are computedthrough the Jain fairness index [25], which takes into account the number of players in eachcluster, the number of clusters, and the bandwidth or the QoE of each player.

4.2.2 Fairness and QoE Maximization. ORL-SDN aims to maximize the QoE of each player in acluster while achieving a higher QoE fairness between the set of clusters. The N-QoE maximizationis shown in Figures 8(a) through 8(e) for each cluster, and in Figures 9(b) and 9(c) for all clusters. Theper-cluster N-QoE fairness is presented in the heat-map of Figure 9(a). As we can see, ORL-SDN

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 22: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:22 A. Bentaleb et al.

Fig. 9. Normalized QoE, utilization, and startup delay for 100 players and 170Mbps of total system band-

width.

ensures a fairly distributed N-QoE between the set of clusters (orange heat-map graph) in Fig-ure9(a) with 4.77 as an average. This finding confirms that ORL-SDN largely eliminates QoE un-fairness compared to other schemes that suffer from N-QoE fluctuations. These results are ex-pected, since ORL-SDN at each step avoids suboptimal action decisions and strives to guaran-tee and maintain a fair QoE between clusters. Furthermore, Figures 8 and 9(c) clearly show thatORLSDNdash.js’s N-QoE increases steadily during a streaming session until it reaches the max-imum at the end, whereas the other schemes experience many variations due to their selectionschemes that cannot deal with either long-term bandwidth fluctuations or cross-traffic dynamics.We note that for each cluster at each step a dynamic, dedicated amount of bandwidth is allocatedbased on optimal decisions that maximize the QoE in a fair manner across all clusters. A boxplotin Figure 9(b) shows the variations of the N-QoE maximization, where the dashed line (whisker)and their limits (two black lines) represent such variations from a low N-QoE to the maximumvalue, while the median of all N-QoE values (of a streaming session) is represented by the redline. ORLSDNdash.js outperforms all the other schemes in terms of the total average N-QoE max-imization, as it achieves a minimum 4.6 and maximum 4.8 at session start (t = 4s) and session end(t = 600s), respectively.

4.2.3 Utilization and Congestion Level. We analyze the total bandwidth usage by the HAS play-ers taking into account the cross-traffic dynamics and the congestion level introduced by the play-ers and the cross-traffic during a streaming session. Sandvine in its latest study [53] show thatnetwork resources are more prone to congestion, especially during peak utilization periods. Thisis due to lack of resources and their inadequate allocation and management. Figure 9(d) presentsthe total bandwidth usage by HAS traffic including random/dynamic cross-traffic throughput re-quirements. ORLSDNdash.js avoids network underutilization as it has the best bandwidth usage,allocation, and management without any violation. Figure 9(e) shows an appropriate congestionlevel (0.99)16 including random/dynamic cross-traffic throughput requirements that do not exceedthe threshold of 1. Moreover, Table 7 illustrates that the average bandwidth usage is around 97%

16When the total demand is greater than the total available bandwidth, a violation occurs and congestion increases (≈1 is

a very good value with full utilization and >1 represents overload).

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 23: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:23

with ORLSDNdash.js compared to mDASH (130% with violations and a very high congestion levelof 1.5), onlineLearner (40% with violations and a low congestion level of 0.49), BBA (95% withviolations and a good congestion level of 0.92), QDASH (110% with violations and a high conges-tion level of 1.2), SARA (87% with violations and a good congestion level of 0.9), dash.js (70%without violations and a low congestion level of 0.69), PANDA (92% without violation and a goodcongestion level of 0.9), FESTIVE (70% with violation and a low congestion level of 0.5), SDNDASH(93% without violation and a good congestion level of 0.89), pure ϵ-greedy (87% with violations anda high congestion level of 1.1), and available rate based (82% with violations and a very high conges-tion level of 1.3). These outcomes are because (1) ORL-SDN uses fastMPC that provides an accurateestimation of global network variables with their dynamics in advance; (2) our SDN-based externalapplication estimates the available bandwidth accurately, detects its dynamics especially in caseof long-term and sudden bandwidth fluctuations, and differentiates between HAS and cross-trafficusing a packet inspection mechanism; (3) the per-cluster action selection algorithm avoids subop-timal decisions and maximizes QoE without any bandwidth violations and affecting other players’QoE; and (4) dynamic network resource allocation, management, and monitoring are provided byour SDN internal application.

4.3 Summary of Results

Below, consecutive numbers represent the results for mDASH, onlineLearner, BBA, QDASH,SARA, dash.js, PANDA, FESTIVE, SDNDASH, pure ϵ-greedy, and available-rate-based, respec-tively. Our findings are summarized as follows:

(1) ORL-SDN improves the quality stability by 14%, 95%, 25%, 98%, 53%, 17%, 10%, 11%, 8%,20%, and 30%.

(2) ORL-SDN provides a 17%, 65%, 32%, 61%, 47%, 52%, 12%, 14%, 11%, 55%, and 41% increasein terms of QoE fairness.

(3) ORL-SDN provides efficient bandwidth utilization without any violations and with op-timal congestion levels that do not affect HAS players’ efficiency and QoE. It improvesbandwidth utilization by 37%, 69%, 27%, 20%, 30%, 45%, 13%, 28%, 13%, 25%, and 28%.

(4) ORL-SDN improves the viewer QoE by 10%, 14%, 36%, 56%, 46%, 40%, 8%, 9%, 7%, 24%, and32%.

(5) ORL-SDN provides 6%, 94%, 24%, 98%, 52%, 15%, 4%, 5%, 4%, 18%, and 29% less perceptualquality oscillations.

(6) ORL-SDN outperforms other schemes in terms of consistent and high perceptual qualityby 14%, 89%, 84%, 85%, 83%, 71%, 10%, 12%, 11%, 76%, and 84% among all players.

(7) ORL-SDN can accommodate a large number of HAS players, with Table 8 describing thisoutcome.

(8) At each step, ORL-SDN achieves a low time complexity (<τ chunk duration) to find theoptimal bitrate decision with around 3.05s average, and around 23.2s total average con-vergence time during a streaming session. This is because ORL-SDN solves the decisionproblem per cluster (only five) instead of per player, leveraging both PDS and SAP aggre-gation processes. Thus, its computation cost is not affected by the number of players.

4.4 ORL-SDN Discussion

Two important issues occur in real-world settings, namely, large-scale networks and player mo-bility, and we discuss how ORL-SDN addresses both. First, with regard to large-scale networks,we performed many experiments with sets of 100, 200, 500, 1,000, and 2,000 heterogeneous HASplayers sharing a 170, 480, 1,300, 3,000, and 5,000Mbps bottleneck link, with dynamically generated

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 24: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:24 A. Bentaleb et al.

Tab

le8.

OR

L-S

DN

Larg

e-S

cale

Net

work

an

dP

layer

s’M

ob

ilit

yE

xper

imen

ts

AV

GB

itra

teL

evel

(Kbp

s)A

VG

Qu

alit

y(S

SIM

plu

s)A

VG

N-Q

oE

AV

GIn

stab

ilit

yIn

dex

AV

GU

nfa

irn

ess

Ind

exT

ota

lB

and

wid

thU

sage

MIN

MA

XM

ED

MIN

MA

XM

ED

MIN

MA

XM

ED

MIN

MA

XM

ED

MIN

MA

XM

ED

MIN

MA

XM

ED

100

pla

yer

sT

ota

lban

dw

idth

=17

0Mbp

sC

ross

-tra

ffic

=rand

(10.

.25)

Mbp

s1,

500

2,08

91,

743

0.91

60.

957

0.95

4.6

4.8

4.77

0.00

10.

006

0.00

20.

013

0.01

40.

014

95%

97%

97%

200

pla

yer

sT

ota

lban

dw

idth

=48

0Mbp

sC

ross

-tra

ffic

=rand

(25.

.50)

Mbp

s2,

089

2,94

42,

409

0.95

70.

965

0.96

44.

74.

94.

790.

002

0.00

60.

0035

0.02

50.

037

0.03

90%

100%

95%

500

pla

yer

sT

ota

lban

dw

idth

=1,

300M

bp

sC

ross

-tra

ffic

=rand

(50.

.75)

Mbp

s1,

980

2,94

42,

089

0.95

60.

965

0.95

74.

14.

64.

530.

0033

0.00

70.

005

0.02

80.

048

0.04

180

%98

%92

%

1,00

0p

lay

ers

To

tal

ban

dw

idth

=3,

000M

bp

sC

ross

-tra

ffic

=rand

(75.

.150

)M

bp

s2,

944

3,63

92,

409

0.96

50.

977

0.96

43.

984.

54.

330.

0039

0.00

90.

0062

0.03

70.

063

0.04

998

%10

0%99

%

2,00

0p

lay

ers

To

tal

ban

dw

idth

=5,

000M

bp

sC

ross

-tra

ffic

=rand

(150

..300

)M

bp

s2,

409

2,94

42,

089

0.96

40.

965

0.95

74

4.22

4.1

0.00

40.

010.

0082

0.04

50.

067

0.05

97%

100%

98%

50p

lay

ers

To

tal

ban

dw

idth

=80

Mbp

sC

ross

-tra

ffic

=rand

(10.

.25)

Mbp

sA

rriv

alti

me

ΔT=

(20...12

0)

s

1,00

92,

089

1,50

00.

909

0.95

70.

933

44.

754.

380.

005

0.00

90.

007

0.01

90.

022

0.02

180

%10

0%95

%

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 25: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:25

cross-traffic. Table 8 rows 1 to 5 highlight the performance of ORL-SDN in these experiments andshow that it achieves high-quality stability, fairness, and bandwidth utilization, with very satisfy-ing viewer QoE and perceptual quality. We note that our SDN-based architecture offers real-timenetwork and HAS-player-related information to network operators and content providers, whichcan be used for analytics and fault isolation purposes. Second, we evaluated ORL-SDN’s efficiencywith player mobility (i.e., joins and leaves) and performed a test scenario with 50 heterogeneousHAS players sharing a 80Mbps bottleneck link and the same parameter values as in the previousexperiments. Groups of five players enter the system every 20 to 120 seconds (ΔT = 20, . . . , 120 s).The average results are depicted in the last row of Table 8 showing that ORL-SDN performs welland achieves a higher and more stable bitrate level and video quality (an average improvementof 32%) and an excellent QoE with higher fairness (an average improvement of 30%), and pro-vides more efficient bandwidth utilization (an average improvement of 33%) compared to BBA,SARA, QDASH, and dash.js. Also, this table shows that ORL-SDN reacts well to any increase incross-traffic and the demands from HAS players.

5 CONCLUSIONS

This article introduces ORL-SDN, an online Q-learning-based optimizer in the context of anSDN-enabled HAS architecture. It eliminates HAS scalability issues while maximizing viewer sat-isfaction when multiple heterogeneous players compete for the available resources in a sharednetwork. It works with HAS and HAS-like systems such as DASH and with various bitrate adapta-tion heuristics and schemes. ORL-SDN uses an SSIMplus-based clustering criterion that combinesthree features (i.e., DR, CT, and SPT) to generate a virtual network topology and group the com-peting players into a small number of clusters. The bitrate decision problem is then formulated asa POMDP where the objective is to maximize the QoE by finding the optimal actions (i.e., bitratedecisions) and minimizing the number of quality oscillations, stalls, and startup delays. To reducethe computational cost, avoid suboptimal actions, speed up the decisions, and accurately estimatethe network and player state variables for the next few steps, ORL-SDN leverages SAP, PDS, andfastMPC techniques. As future work, we plan to deploy ORL-SDN in a large-scale SDN-enablednetwork with multiple bottlenecks and different types of cross-traffic dynamics. Furthermore, weaim to extend our cluster mapping model to support a dynamic number of clusters. For accuratestate variable estimation and optimal decisions using advanced RL-based techniques such as neu-ral networks and deep RL, we also plan to integrate a new bitrate decision theory, bandwidth, andQoE estimators systems [26, 57] that use machine-learning algorithms.

REFERENCES

[1] Saamer Akhshabi, Lakshmi Anantakrishnan, Ali C. Begen, and Constantine Dovrolis. 2012. What happens when

HTTP adaptive streaming players compete for bandwidth? In Proceedings of the 22Nd International Workshop on

Network and Operating System Support for Digital Audio and Video (NOSSDAV’12). ACM, New York, NY, USA, 9–14.

DOI:http://dx.doi.org/10.1145/2229087.2229092

[2] Ahsan Arefin, Raoul Rivas, Rehana Tabassum, and Klara Nahrstedt. 2013. OpenSession: SDN-based cross-layer

multi-stream management protocol for 3D teleimmersion. In 21st IEEE International Conference on Network Proto-

cols (ICNP’13). 1–10. DOI:http://dx.doi.org/10.1109/ICNP.2013.6733616

[3] Abdelhak Bentaleb, Ali C. Begen, and Roger Zimmermann. 2016. SDNDASH: Improving QoE of HTTP adaptive

streaming using software defined networking. In Proceedings of the 2016 ACM on Multimedia Conference (MM’16).

ACM, New York, NY, USA, 1296–1305. DOI:http://dx.doi.org/10.1145/2964284.2964332

[4] Niels Bouten, Ricardo de O’Schmidt, Jeroen Famaey, Steven Latré, Aiko Pras, and Filip De Turck. 2015. QoE-driven

in-network optimization for adaptive video streaming based on packet sampling measurements. Computer Networks

81, C (2015), 96–115.

[5] Valentín Carela-Español, Pere Barlet-Ros, Albert Cabellos-Aparicio, and Josep Solé-Pareta. 2011. Analysis of the im-

pact of sampling on NetFlow traffic classification. Computer Networks 55, 5 (2011), 1083–1099.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 26: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:26 A. Bentaleb et al.

[6] Federico Chiariotti, Stefano D’Aronco, Laura Toni, and Pascal Frossard. 2016. Online learning adaptation strategy for

DASH clients. In Proceedings of the 7th International Conference on Multimedia Systems (MMSys’16). ACM, New York,

NY, USA, Article 8, 12 pages. DOI:http://dx.doi.org/10.1145/2910017.2910603

[7] Maxim Claeys, Steven Latré, Jeroen Famaey, Tingyao Wu, Werner Van Leekwijck, and Filip De Turck. 2013. Design

of a Q-learning-based client quality selection algorithm for HTTP adaptive video streaming. In Proceedings of the

Adaptive and Learning Agents Workshop, part of AAMAS2013. 30–37.

[8] Maxim Claeys, Steven Latré, Jeroen Famaey, Tingyao Wu, Werner Van Leekwijck, and Filip De Turck. 2014. Design

and optimisation of a (FA)Q-learning-based HTTP adaptive streaming client. Connection Science 26, 1 (2014), 25–43.

[9] DASH-IF. 2017. Guidelines for Implementation: DASH-AVC/264 Test cases and Vectors. Retrieved from https://goo.

gl/NhJcui (accessed June 5, 2017).

[10] Dash Industry Forum. 2017. DASH-264 JavaScript Reference Client. Retrieved from https://goo.gl/yd8rrt (accessed

March 30, 2017).

[11] Johan De Vriendt, Danny De Vleeschauwer, and David Robinson. 2013. Model for estimating QoE of video deliv-

ered using HTTP adaptive streaming. In 2013 IFIP/IEEE International Symposium on Integrated Network Management

(IM’13). 1288–1293.

[12] Giorgos Dimopoulos, Ilias Leontiadis, Pere Barlet-Ros, and Konstantina Papagiannaki. 2016. Measuring video QoE

from encrypted traffic. In Proceedings of the 2016 Internet Measurement Conference (IMC’16). ACM, New York, NY,

USA, 513–526. DOI:http://dx.doi.org/10.1145/2987443.2987459

[13] Zhengfang Duanmu, Kai Zeng, Kede Ma, Abdul Rehman, and Zhou Wang. 2017. A quality-of-experience index for

streaming video. IEEE Journal of Selected Topics in Signal Processing 11, 1 (2017), 154–166. DOI:http://dx.doi.org/10.

1109/JSTSP.2016.2608329

[14] Marcus Eckert and Thomas Martin Knoll. 2013. QoE management framework for internet services in SDN enabled

mobile networks. In Meeting of the European Network of Universities and Companies in Information and Communication

Engineering. Springer, 112–123.

[15] Zhengzhu Feng and E. Hansen. 2004. An approach to state aggregation for POMDPs. In AAAI-04 Workshop on Learn-

ing and Planning in Markov Processes–Advances and Challenges. 7–12.

[16] Markus Fiedler, Tobias Hossfeld, and Phuoc Tran-Gia. 2010. A quantitative relationship between quality of experience

and quality of service. IEEE Network 24, 2 (2010), 36–41.

[17] Aditya Ganjam, Faisal Siddiqui, Jibin Zhan, Xi Liu, Ion Stoica, Junchen Jiang, Vyas Sekar, and Hui Zhang. 2015. C3:

Internet-scale control plane for video quality optimization. In Proceedings of the 12th USENIX Conference on Networked

Systems Design and Implementation (NSDI’15). USENIX Association, Berkeley, CA, USA, 131–144. http://dl.acm.org/

citation.cfm?id=2789770.2789780

[18] Panagiotis Georgopoulos, Yehia Elkhatib, Matthew Broadbent, Mu Mu, and Nicholas Race. 2013. Towards network-

wide QoE fairness using openflow-assisted adaptive video streaming. In Proceedings of the 2013 ACM SIGCOMM-

Workshop on Future Human-centric Multimedia Networking (FhMN’13). ACM, New York, NY, USA, 15–20. DOI:http:

//dx.doi.org/10.1145/2491172.2491181

[19] Simon Haykin. 1998. Neural Networks: A Comprehensive Foundation (2nd ed.). Prentice Hall PTR, Upper Saddle River,

NJ, USA.

[20] Victor Heorhiadi, Michael K. Reiter, and Vyas Sekar. 2016. Simplifying software-defined network optimization using

SOL. In Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation (NSDI’16). USENIX

Association, Berkeley, CA, USA, 223–237. http://dl.acm.org/citation.cfm?id=2930611.2930627

[21] Te-Yuan Huang, Nikhil Handigol, Brandon Heller, Nick McKeown, and Ramesh Johari. 2012. Confused, timid, and

unstable: Picking a video streaming rate is hard. In Proceedings of the 2012 Internet Measurement Conference (IMC’12).

ACM, New York, NY, USA, 225–238. DOI:http://dx.doi.org/10.1145/2398776.2398800

[22] Te-Yuan Huang, Ramesh Johari, Nick McKeown, Matthew Trunnell, and Mark Watson. 2015. A buffer-based approach

to rate adaptation: Evidence from a large video streaming service. In Proceedings of the 2014 ACM Conference on

SIGCOMM (SIGCOMM’14). ACM, New York, NY, USA, 187–198. DOI:http://dx.doi.org/10.1145/2619239.2626296

[23] Milosz Marian Hulboj and Ryszard Erazm Jurga. 2007. Packet sampling and network monitoring. Retrieved from

https://bit.ly/2mTg88B (accessed June 15, 2016).

[24] InMon. 2004. sFlow. Retrieved from http://www.sflow.org/ (accessed December 25, 2016).

[25] Raj Jain, Dah-Ming Chiu, and William R. Hawe. 1984. A Quantitative Measure of Fairness and Discrimination for

Resource Allocation in Shared Computer System. Vol. 38. Eastern Research Laboratory, Digital Equipment Corporation,

Hudson, MA.

[26] Junchen Jiang, Vyas Sekar, Henry Milner, Davis Shepherd, Ion Stoica, and Hui Zhang. 2016. CFA: A practical predic-

tion system for video QoE optimization. In Proceedings of the 13th Usenix Conference on Networked Systems Design

and Implementation (NSDI’16). USENIX Association, Berkeley, CA, USA, 137–150. http://dl.acm.org/citation.cfm?id=

2930611.2930621

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 27: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming 71:27

[27] Junchen Jiang, Vyas Sekar, and Hui Zhang. 2012. Improving fairness, efficiency, and stability in HTTP-based adaptive

video streaming with FESTIVE. In Proceedings of the 8th International Conference on Emerging Networking Experiments

and Technologies (CoNEXT’12). ACM, New York, NY, USA, 97–108. DOI:http://dx.doi.org/10.1145/2413176.2413189

[28] Junchen Jiang, Shijie Sun, Vyas Sekar, and Hui Zhang. 2017. Pytheas: Enabling data-driven quality of experience

optimization using group-based exploration-exploitation. In Proceedings of the 14th USENIX Conference on Networked

Systems Design and Implementation (NSDI’17). USENIX Association, Berkeley, CA, USA, 393–406. http://dl.acm.org/

citation.cfm?id=3154630.3154662

[29] Parikshit Juluri, Venkatesh Tamarapalli, and Deep Medhi. 2015. SARA: Segment aware rate adaptation algorithm for

dynamic adaptive streaming over HTTP. In IEEE International Conference on Communication Workshop (ICCW’15).

1765–1770. DOI:http://dx.doi.org/10.1109/ICCW.2015.7247436

[30] Jan Willem Kleinrouweler, Sergio Cabrero, and Pablo Cesar. 2016. Delivering stable high-quality video: An SDN

architecture with DASH assisting network elements. In Proceedings of the 7th International Conference on Multimedia

Systems (MMSys’16). ACM, New York, NY, USA, Article 4, 10 pages. DOI:http://dx.doi.org/10.1145/2910017.2910599

[31] Diego Kreutz, Fernando M. V. Ramos, P. Esteves Verissimo, C. Esteve Rothenberg, Siamak Azodolmolky, and Steve

Uhlig. 2015. Software-defined networking: A comprehensive survey. Proceedings of the IEEE 103, 1 (2015), 14–76.

[32] Bob Lantz and Brian O’Connor. 2015. Mininet. Retrieved from http://mininet.org/ (accessed January 20, 2017).

[33] Stefan Lederer, Christopher Müller, and Christian Timmerer. 2012. Dynamic adaptive streaming over HTTP dataset.

In Proceedings of the 3rd Multimedia Systems Conference (MMSys’12). ACM, New York, NY, USA, 89–94. DOI:http:

//dx.doi.org/10.1145/2155555.2155570

[34] Zhi Li, Xiaoqing Zhu, Joshua Gahm, Rong Pan, Hao Hu, Ali C. Begen, and David Oran. 2014. Probe and adapt: Rate

adaptation for HTTP video streaming at scale. IEEE Journal on Selected Areas in Communications 32, 4 (2014), 719–733.

[35] Xi Liu, Florin Dobrian, Henry Milner, Junchen Jiang, Vyas Sekar, Ion Stoica, and Hui Zhang. 2012. A case for a coor-

dinated internet video control plane. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technolo-

gies, Architectures, and Protocols for Computer Communication (SIGCOMM’12). ACM, New York, NY, USA, 359–370.

DOI:http://dx.doi.org/10.1145/2342356.2342431

[36] Xiaomei Liu and Li Xiao. 2007. A survey of multihoming technology in stub networks: Current research and open

issues. Network 21, 3 (2007), 32–40.

[37] Nicholas Mastronarde and Mihaela van der Schaar. 2011. Fast reinforcement learning for energy-efficient wireless

communication. IEEE Transactions on Signal Processing 59, 12 (2011), 6262–6266.

[38] Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker,

and Jonathan Turner. 2008. OpenFlow: Enabling innovation in campus networks. In SIGCOMM Comput. Commun.

Rev. 38, 2 (2008), 69–74. DOI:http://dx.doi.org/10.1145/1355734.1355746

[39] Ricky K. P. Mok, Xiapu Luo, Edmond W. W. Chan, and Rocky K. C. Chang. 2012. QDASH: a QoE-aware DASH system.

In Proceedings of the 3rd Multimedia Systems Conference (MMSys’12). ACM, New York, NY, USA, 11–22. DOI:http:

//dx.doi.org/10.1145/2155555.2155558

[40] Mu Mu, Matthew Broadbent, Arsham Farshad, Nicholas Hart, David Hutchison, Qiang Ni, and Nicholas Race. 2016. A

scalable user fairness model for adaptive video streaming over SDN-assisted future networks. IEEE Journal on Selected

Areas in Communications 34, 8 (2016), 2168–2184.

[41] Matthew K. Mukerjee, David Naylor, Junchen Jiang, Dongsu Han, Srinivasan Seshan, and Hui Zhang. 2015. Practical,

real-time centralized control for CDN-based live video delivery. In Proceedings of the 2015 ACM Conference on Special

Interest Group on Data Communication (SIGCOMM’15). ACM, New York, NY, USA, 311–324. DOI:http://dx.doi.org/10.

1145/2785956.2787475

[42] Yustus Eko Oktian, SangGon Lee, HoonJae Lee, and JunHuy Lam. 2017. Distributed SDN controller system: A survey

on design choice. Computer Networks 121 (2017), 100–111.

[43] Athanasios Papoulis and S. Unnikrishna Pillai. 2002. Probability, Random Variables, and Stochastic Processes. Tata

McGraw-Hill Education.

[44] Stefano Petrangeli, Maxim Claeys, Steven Latré, Jeroen Famaey, and Filip De Turck. 2014. A multi-agent Q-learning-

based framework for achieving fairness in HTTP adaptive streaming. In IEEE Network Operations and Management

Symposium (NOMS’14). 1–9. DOI:http://dx.doi.org/10.1109/NOMS.2014.6838245

[45] Stefano Petrangeli, Jeroen Famaey, Maxim Claeys, Steven Latré, and Filip De Turck. 2016. QoE-driven rate adapta-

tion heuristic for fair adaptive video streaming. ACM Transactions on Multimedia Computing, Communications, and

Applications 12, 2 (2016), 28.

[46] Stefano Petrangeli, Tim Wauters, Rafael Huysegems, Tom Bostoen, and Filip De Turck. 2016. Software-defined

network-based prioritization to avoid video freezes in HTTP adaptive streaming. Netw. 26, 4 (2016), 248–268.

[47] Warren B. Powell. 2009. What you should know about approximate dynamic programming. NRL 56, 3 (2009), 239–249.

[48] PyPI. 2017. The Python Package Index. Retrieved from https://goo.gl/635J3x (accessed August 25, 2017).

[49] Abdul Rehman, Kai Zeng, and Zhou Wang. 2015. Display device-adapted video quality-of-experience assessment. In

SPIE/IS&T Electronic Imaging. Int. Society for Optics and Photonics, 939406–939406.

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.

Page 28: ORL-SDN: Online Reinforcement Learning for SDN-Enabled ...bentaleb/files/papers/journal/ORLSDN.pdf71 ORL-SDN: Online Reinforcement Learning for SDN-Enabled HTTP Adaptive Streaming

71:28 A. Bentaleb et al.

[50] Martin Riedmiller. 2005. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learn-

ing method. In Proceedings of the 16th European Conference on Machine Learning (ECM’05). Springer-Verlag, Berlin,

Heidelberg, 317–328. DOI:http://dx.doi.org/10.1007/1156409632

[51] RYU SDN Community. 2015. RYU SDN Framework. Retrieved from https://osrg.github.io/ryu/ (accessed December

25, 2016).

[52] Sandvine. 2015. Deep Packet Inspection (DPI). Retrieved from https://goo.gl/2Ms8bH (accessed April 10, 2017).

[53] Sandvine. 2016. Video Quality of Experience: Requirements and Considerations for Meaningful Insight. White Paper.

[54] Justine Sherry, Chang Lan, Raluca Ada Popa, and Sylvia Ratnasamy. 2015. Blindbox: Deep Packet Inspection over

Encrypted Traffic. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIG-

COMM’15). ACM, New York, NY, USA, 213–226. DOI:http://dx.doi.org/10.1145/2785956.2787502

[55] SSIMWave. 2015. SSIMWave’s Video QoE Monitor. Retrieved from https://goo.gl/u7pG45 (accessed June 11, 2016).

[56] Thomas Stockhammer. 2011. Dynamic adaptive streaming over HTTP: Standards and design principal. In Proceed-

ings of the Second Annual ACM Conference on Multimedia Systems (MMSys’11). ACM, New York, NY, USA, 133–144.

DOI:http://dx.doi.org/10.1145/1943552.1943572

[57] Yi Sun, Xiaoqi Yin, Junchen Jiang, Vyas Sekar, Fuyuan Lin, Nanshu Wang, Tao Liu, and Bruno Sinopoli. 2016. Cs2p: Im-

proving video bitrate selection and adaptation with data-driven throughput prediction. In Proceedings of the 2016 ACM

SIGCOMM Conference (SIGCOMM’16). ACM, New York, NY, USA, 272–285. DOI:http://dx.doi.org/10.1145/2934872.

2934898

[58] Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. Vol. 1. MIT Press, Cambridge.

[59] Emmanuel Thomas, M. O. van Deventer, Thomas Stockhammer, Ali C. Begen, and Jeroen Famaey. 2017. Enhanc-

ing MPEG DASH Performance via Server and Network Assistance. SMPTE Motion Imaging Journal 126, 1 (2017),

22–27.

[60] Michel Tokic and Günther Palm. 2011. Value-difference based exploration: Adaptive control between Epsilon-Greedy

and Softmax. In KI. Springer, 335–346.

[61] Jeroen van der Hooft, Stefano Petrangeli, Maxim Claeys, Jeroen Famaey, and Filip De Turck. 2015. A learning-based

algorithm for improved bandwidth-awareness of adaptive streaming clients. In IFIP/IEEE International Symposium on

Integrated Network Management (IM’15). 131–138. DOI:http://dx.doi.org/10.1109/INM.2015.7140285

[62] Yang Wang and Stephen Boyd. 2010. Fast Model Predictive Control using Online Optimization. IEEE Transactions on

Control Systems Technology 18, 2 (2010), 267–278.

[63] Dapeng Wu, Yiwei Thomas Hou, Wenwu Zhu, Ya-Qin Zhang, and Jon M. Peha. 2001. Streaming video over the

Internet: Approaches and directions. IEEE Transactions on Circuits and Systems for Video Technology 11, 3 (2001),

282–300.

[64] Xiaoqi Yin, Abhishek Jindal, Vyas Sekar, and Bruno Sinopoli. 2015. A control-theoretic approach for dynamic adaptive

video streaming over HTTP. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communi-

cation (SIGCOMM’15). ACM, New York, NY, USA, 325–338. DOI:http://dx.doi.org/10.1145/2785956.2787486

[65] Chao Zhou, Chia-Wen Lin, and Zongming Guo. 2016. mDASH: A Markov decision-based rate adaptation approach

for dynamic HTTP streaming. IEEE Transactions on Multimedia 18, 4 (2016), 738–751.

[66] Wei Zhou, Li Li, Min Luo, and Wu Chou. 2014. REST API design patterns for SDN northbound API. In 2014 28th

International Conference on Advanced Information Networking and Applications Workshops. 358–365. DOI:http://dx.

doi.org/10.1109/WAINA.2014.153

Received September 2017; revised February 2018; accepted April 2018

ACM Trans. Multimedia Comput. Commun. Appl., Vol. 14, No. 3, Article 71. Publication date: August 2018.