optimized computation offloading performance in virtual edge computing...

IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 3, JUNE 2019 4005

Optimized Computation Offloading Performance inVirtual Edge Computing Systems via Deep

Reinforcement LearningXianfu Chen , Member, IEEE, Honggang Zhang, Senior Member, IEEE, Celimuge Wu , Member, IEEE,

Shiwen Mao , Senior Member, IEEE, Yusheng Ji , Senior Member, IEEE,and Mehdi Bennis, Senior Member, IEEE

Abstract—To improve the quality of computation experiencefor mobile devices, mobile-edge computing (MEC) is a promisingparadigm by providing computing capabilities in close proxim-ity within a sliced radio access network (RAN), which supportsboth traditional communication and MEC services. Nevertheless,the design of computation offloading policies for a virtual MECsystem remains challenging. Specifically, whether to execute acomputation task at the mobile device or to offload it forMEC server execution should adapt to the time-varying networkdynamics. This paper considers MEC for a representative mobileuser in an ultradense sliced RAN, where multiple base stations(BSs) are available to be selected for computation offloading.The problem of solving an optimal computation offloading pol-icy is modeled as a Markov decision process, where our objectiveis to maximize the long-term utility performance whereby anoffloading decision is made based on the task queue state, theenergy queue state as well as the channel qualities between mobileuser and BSs. To break the curse of high dimensionality in statespace, we first propose a double deep Q-network (DQN)-basedstrategic computation offloading algorithm to learn the optimalpolicy without knowing a priori knowledge of network dynamics.Then motivated by the additive structure of the utility function, aQ-function decomposition technique is combined with the doubleDQN, which leads to a novel learning algorithm for the solv-ing of stochastic computation offloading. Numerical experiments

Manuscript received May 14, 2018; revised July 24, 2018; accepted October6, 2018. Date of publication October 16, 2018; date of current versionJune 19, 2019. This work was supported in part by the National Key Researchand Development Program of China under Grant 2018YFB0803702, in partby the National Natural Science Foundation of China under Grant 61701439and Grant 61731002, in part by the Zhejiang Key Research and DevelopmentPlan under Grant 2018C03056, in part by the National Science Foundationunder Grant CNS-1702957, in part by the Wireless Engineering Researchand Engineering Center at Auburn University, in part by JSPS KAKENHIunder Grant JP16H02817, and in part by the Academy of Finland under Grant289611. (Corresponding author: Xianfu Chen.)

X. Chen is with the VTT Technical Research Centre of Finland,90570 Oulu, Finland (e-mail: [email protected]).

H. Zhang is with the College of Information Science and ElectronicEngineering, Zhejiang University, Hangzhou 310027, China (e-mail:[email protected]).

C. Wu is with the Graduate School of Informatics and Engineering,University of Electro-Communications, Tokyo 182-8585, Japan (e-mail:[email protected]).

S. Mao is with the Department of Electrical and Computer Engineering,Auburn University, Auburn, AL 36849 USA (e-mail: [email protected]).

Y. Ji is with the Information Systems Architecture Research Division,National Institute of Informatics, Tokyo 101-8430, Japan (e-mail:[email protected]).

M. Bennis is with the Centre for Wireless Communications, University ofOulu, 90570 Oulu, Finland (e-mail: [email protected]).

Digital Object Identifier 10.1109/JIOT.2018.2876279

show that our proposed learning algorithms achieve a significantimprovement in computation offloading performance comparedwith the baseline policies.

Index Terms—Deep reinforcement learning, Markov decisionprocess (MDP), mobile-edge computing (MEC), network slicing,network virtualization, Q-function decomposition, radio accessnetworks (RANs).

I. INTRODUCTION

W ITH the proliferation of smart mobile devices, a multi-tude of mobile applications are emerging and gaining

popularity, such as location-based virtual/augmented realityand online gaming [1]. However, mobile devices are in gen-eral resource-constrained, for example, the battery capacityand the local CPU computation power are limited. Whenexecuted at the mobile devices, the performance and quality-of-experience (QoE) of computation-intensive applications aresignificantly affected by the devices’ limited computationcapabilities. The tension between computation-intensive appli-cations and resource-constrained mobile devices creates abottleneck for having a satisfactory quality-of-service (QoS)and QoE, and is hence driving a revolution in computinginfrastructure [2].

In contrast to cloud/fog computing [3], [4], mobile-edgecomputing (MEC) is envisioned as a promising paradigm,which provides computing capabilities within the radioaccess networks (RANs) in close proximity to mobile users(MUs) [5], [6]. By offloading computation tasks to theresource-rich MEC servers, not only the computation QoSand QoE can be greatly improved, but the capabilities ofmobile devices can be augmented for running a variety ofresource-demanding applications. Recently, lots of efforts havebeen put to the design of computation offloading policies.Wang et al. [7] developed an alternating direction method ofmultipliers-based algorithm to solve the problem of revenuemaximization by optimizing computation offloading decision,resource allocation, and content caching strategy. Hu et al. [8]proposed a two-phase-based method for joint power andtime allocation when considering cooperative computationoffloading in a wireless power transfer-assisted MEC system.Wang et al. [9] leveraged a Lagrangian duality method to min-imize the total energy consumption in a computation latencyconstrained wireless powered multiuser MEC system.

2327-4662 c© 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Auburn University. Downloaded on April 13,2020 at 01:40:44 UTC from IEEE Xplore. Restrictions apply.

https://orcid.org/0000-0002-5485-4955

https://orcid.org/0000-0001-6853-5878

https://orcid.org/0000-0002-7052-0007

https://orcid.org/0000-0003-4364-8491

4006 IEEE INTERNET OF THINGS JOURNAL, VOL. 6, NO. 3, JUNE 2019

For an MEC system, the computation offloading requireswireless data transmission, hence how to allocate wirelessradio resource between the traditional communication ser-vice and the MEC service over a common RAN raises aseries of technical challenges. Network slicing is a key enablerfor RAN sharing, with which the traditional single owner-ship of network infrastructure and spectrum resources can bedecoupled from the wireless services [10]. Consequently, thesame physical network infrastructure is able to host multiplewireless service providers [11], [12]. In literature, there existseveral efforts investigating joint communication and computa-tion resource management in such virtualized networks, whichsupport both the traditional communication service and theMEC service [13], [14]. In this paper, we focus on design-ing optimal stochastic computation offloading policies in asliced RAN, where a centralized network controller (CNC)is responsible for control-plane decisions on wireless radioresource orchestration over the traditional communication andMEC services.

The computation offloading policy designs in previousworks [7], [8], [13]–[16] are mostly based on one-shotoptimization and fail to characterize long-term computationoffloading performance. In a virtual MEC system, the designof computation offloading policies should account for the envi-ronmental dynamics, such as the time-varying channel qualityand the task arrival and energy status at a mobile device.Liu et al. [17] formulated the problem of delay-optimal com-putation task offloading under a Markov decision process(MDP) framework and developed an efficient 1-D search algo-rithm to find the optimal solution. However, the challenge liesin the dependence on statistical information of channel qual-ity variations and computation task arrivals. Mao et al. [18]investigated a dynamic computation offloading policy for anMEC system with wireless energy harvesting-enabled mobiledevices using a Lyapunov optimization technique. The sametechnique was adopted to study the power-delay tradeoff inthe scenario of computation task offloading by Liu et al. [19]and Jiang and Mao [20]. The Lyapunov optimization can onlyconstruct an approximately optimal solution. Xu et al. [21]developed a reinforcement learning-based algorithm to learnthe optimal computation offloading policy, which at the sametime does not need a priori knowledge of network statistics.

When the MEC meets an ultradense sliced RAN, multiplebase stations (BSs) with different data transmission quali-ties are available for offloading a computation task. In thiscontext, the explosion in state space makes the conven-tional reinforcement learning algorithms [21]–[23] infeasible.Moreover, in this paper, wireless charging [24] is integratedinto an MEC system, which on one hand achieves sustainedcomputation performance but, on the other hand, makes thedesign of a stochastic computation offloading policy even morechallenging. The main contributions in this paper are fourfold.

1) We formulate the stochastic computation offloadingproblem in a sliced RAN as an MDP, in which thetime-varying communication qualities and computationresources are taken into account.

2) To deal with the curse of state space explosion,we resort to a deep neural network-based function

Fig. 1. Illustration of MEC in a virtualized RAN, where the devices of MUsare wireless charging enabled, the radio resource is sliced between conven-tional communication services (the links in black color) and MEC services (thelinks in blue color), and a CNC is responsible for all control plane decisionsover the network.

approximator [25] and derive a double deep Q-network(DQN) [26] based reinforcement learning (DARLING)algorithm to learn the optimal computation offload-ing policy without any a priori knowledge of networkdynamics.

3) By further exploring the additive structure of the util-ity function, we attain a novel online deep state-action-reward-state-action-based reinforcement learningalgorithm (Deep-SARL) for the problem of stochasticcomputation offloading. To the best knowledge of theauthors, this is the first work to combine a Q-functiondecomposition technique with the double DQN.

4) Numerical experiments based on TensorFlow [27] areconducted to verify the theoretical studies in this paper.It shows that both of our proposed online learning algo-rithms outperform three baseline schemes. Especially,the Deep-SARL algorithm achieves the best computationoffloading performance.

The rest of this paper is organized as follows. In the nextsection, we describe the system model and the assumptionsmade throughout this paper. In Section III, we formulatethe problem of designing an optimal stochastic computationoffloading policy as an MDP. We detail the derived onlinelearning algorithms for stochastic computation offloading in avirtual MEC system in Section IV. To validate the proposedstudies, we provide numerical experiments under various set-tings in Section V. Finally, we draw the conclusions inSection VI.

II. SYSTEM DESCRIPTIONS AND ASSUMPTIONS

As illustrated in Fig. 1, we shall consider in this paper anultradense service area covered by a virtualized RAN witha set B = {1, . . . , B} of BSs. Both traditional communi-cation services and MEC services are supported over thecommon physical network infrastructure. An MEC server isimplemented at the network edge, providing rich comput-ing resources for the MUs. By strategically offloading thegenerated computation tasks via the BSs to the MEC serverfor execution, the MUs can expect a significantly improvedcomputation experience. We assume that the wireless radioresources are divided into traditional communication and MECslices to guarantee interslice isolation. All control plane oper-ations happening in such a hybrid network are managed by


CHEN et al.: OPTIMIZED COMPUTATION OFFLOADING PERFORMANCE IN VIRTUAL EDGE COMPUTING SYSTEMS 4007

the CNC. The focus of this paper is to optimize computa-tion performance from a perspective of the MUs, while thedesign of joint traditional communication and MEC resourceallocation is left for our next-step investigation. In a densenetworking area, our analysis hereinafter concentrates on arepresentative MU. The time horizon is discretized into deci-sion epochs, each of which is of equal duration δ (in seconds)and is indexed by an integer j ∈ N+. Let W (in Hz) denotethe frequency bandwidth allocated to the MEC slice, whichis shared among the MUs simultaneously accessing the MECservice.

This paper assumes that the mobile device of the MU iswireless charging enabled and the received energy can bestored in an energy queue. The computation task generated bythe MU across the time horizon form an independent and iden-tically distributed sequence of Bernoulli random variables witha common parameter λ(t) ∈ [0, 1]. We denote a j

(t) ∈ {0, 1} as

the task arrival indicator, that is, a j(t) = 1 if a computation task

is generated from the MU during a decision epoch j and oth-erwise a j

(t) = 0. Then, Pr{a j(t) = 1} = 1− Pr{a j

(t) = 0} = λ(t),where Pr{·} denotes the probability of the occurrence of anevent. We represent a computation task by (μ, ν) with μ andν being, respectively, the input data size (in bits) and the totalnumber of CPU cycles required to accomplish the task. Acomputation task generated at a current decision epoch canbe executed starting from the next epoch. The generated butnot processed computation tasks can be queued at the mobiledevice of the MU. Based on a first-in first-out principle, acomputation task from the task queue can be scheduled forexecution either locally on the mobile device or remotely atthe MEC server. More specifically, at the beginning of eachdecision epoch j, the MU makes a joint control action (c j, e j),where c j ∈ {0}∪B is the computation offloading decision ande j ∈ N+ is the number of allocated energy units.1 We havec j > 0 if the MU chooses to offload the scheduled compu-tation task to the MEC server via BS c j ∈ B and c j = 0 ifthe MU decides to execute the computation task locally on itsown mobile device. Note that when e j = 0, the queued taskswill not be executed.

When a computation task is scheduled for processing locallyat the mobile device of the MU during a decision epoch j, i.e.,c j = 0, the allocated CPU-cycle frequency with e j > 0 energyunits can be calculated as

f j =√

e j

τ · νwhere τ is the effective switched capacitance that dependson chip architecture of the mobile device [28]. Moreover, theCPU-cycle frequency is constrained by f j ≤ f (max)

(CPU). Then thetime needed for local computation task execution is given by

d j(mobile) =

ν

f j(1)

which decreases as the number of allocated energy unitsincreases.

1An energy unit corresponds to an amount of energy, say, 2 · 10−3 J as innumerical experiments.

We denote g jb as the channel gain state between the MU and

a BS b ∈ B during each decision epoch j, which independentlypicks a value from a finite state space Gb. The channel statetransitions across the time horizon are modeled as a finite-statediscrete-time Markov chain. At the beginning of a decisionepoch j, if the MU lets the MEC server execute the scheduledcomputation task on behalf of the mobile device, the input dataof the task needs to be offloaded to the chosen BS c j ∈ B. TheMU-BS association has to be first established. If the chosen BSc j is different from the previously associated one, a handoverbetween the two BSs hence happens [29]. Denote s j ∈ B asthe MU-BS association state at a decision epoch j2

s j = b · 1{{cj−1=b,b∈B}∨{{cj−1=0}∧{sj−1=b}}}where the symbols ∨ and ∧ mean “logic OR” and “logicAND,” respectively, and 1{�} is the indicator function thatequals 1 if the condition � is satisfied and otherwise 0. Weassume that the energy consumption during the handover pro-cedure is negligible at the mobile device. In our considereddense networking scenario, the achievable data rate can bewritten as

r j = W · log2

(1+ I−1 · g j

b · p j(tr)

)(2)

where I is the received average power of interference plusadditive background Gaussian noise and

p j(tr) =

e j

d j(tr)

is the transmit power with

d j(tr) =

μ

r j(3)

being the time of transmitting task input data. The transmitpower is constrained by the maximum transmit power of themobile device p(max)

(tr) [30], i.e.,

p j(tr) ≤ p(max)

(tr) . (4)

In (3) above, we assume that the energy is evenly assignedto the input data bits of the computation task. In other words,the transmission rate keeps unchanged during the input datatransmission. Lemma 1 ensures that d j

(tr) is the minimumtransmission time given the allocated energy units e j > 0.

Lemma 1: Given the computation offloading decision c j ∈ Band the allocated energy units e j > 0 at a decision epoch j,the optimal transmission policy achieving the minimum trans-mission time is a policy with which the rate of transmittingtask input data remains a constant.

Proof: The proof is given as Appendix A.Lemma 2: Given the computation offloading decision c j ∈ B

at a decision epoch j, the input data transmission time d j(tr) is

a monotonically decreasing function of the allocated energyunits e j > 0.

Proof: The proof is given in Appendix B.

2We assume that if the MU processes a computation task locally or no taskis executed at a decision epoch j − 1, then the MU-BS association does notchange, namely, s j = sj−1. In this case, no handover will be triggered.



In addition, we assume in this paper that the battery capacityat the mobile device of the MU is limited and the receivedenergy units across the time horizon take integer values. Letq j(e) be the energy queue length of the MU at the beginning

of a decision epoch j, which evolves according to

qj+1(e) = min

{q j(e) − e j + a j

(e), q(max)(e)

}where q(max)

(e) ∈ N+ denotes the battery capacity limit and

a j(e) ∈ N+ is the number of energy units received by the end

of decision epoch j.

III. PROBLEM FORMULATION

In this section, we shall first formulate the problem ofstochastic computation offloading within the MDP frameworkand then discuss the optimal solutions.

A. Stochastic Computation Task Offloading

The experienced delay is the key performance indicator forevaluating the quality of a task computing experience. Thedelay of a computation task is defined as the period of timefrom when the task arrives to the computation task queue towhen the task is successfully removed from the task queue.Thus, the experienced delay includes the computation task exe-cution delay and the task queuing delay. We assume that thereis a delay of ζ seconds for control signalling during the occur-rence of one handover. With a joint control action (c j, e j) ata decision epoch j, the handover delay can be then given as

h j = ζ · 1{{c j∈B}∧{c j �=s j}}. (5)

According to (1), (3), and (5), we obtain the task executiondelay as3

d j =

⎧⎪⎨⎪⎩

d j(mobile), if e j > 0 and c j = 0

h j + d j(tr) + d(server), if e j > 0 and c j ∈ B

0, if e j = 0

where d(server) is time consumed for task execution at the MECserver. Due to the sufficient available computation resource atthe MEC server, we assume that d(server) is a sufficiently smallconstant.

Notice that if: 1) the MU fails to process a computationtask at the mobile device within one decision epoch or 2) acomputation task is scheduled for MEC server execution butthe computation result cannot be sent back via the chosen BSwithin the decision epoch, the task execution fails and the taskwill remain in the queue until being successfully executed.The dynamics of the computation task queue at the MU canbe hence expressed as

qj+1(t) = min

{q j(t) − 1{0<d j≤δ} + a j

(t), q(max)(t)

}3In this paper, we assume that the BSs are connected to the MEC server

via fibre links. Hence, the round-trip delay between the BSs and the MECserver is negligible. Further, we neglect the time overhead for the selectedBS to send back the computation result due to the fact that the size of acomputation outcome is much smaller than the input data of a computationtask [31].

where q j(t) is the number of computation tasks in the queue

at the beginning of each decision epoch j and q(max)(t) ∈ N+

limits the maximum number of computation tasks that can bequeued at the mobile device. There will be computation taskdrops once the task queue is full. We let

η j = max{

q j(t) − 1{0<d j≤δ} + a j

(t) − q(max)(t) , 0

}define a computation task drop.

If a computation task remains in the queue for a decisionepoch, a delay of δ seconds will be incurred to the task. Wetreat the queuing delay during a decision epoch j equivalentlyas the length of a task queue, that is,

ρ j = q j(t) − 1{d j>0}.

As previously discussed, if d j > δ, the execution of a compu-tation task fails. In this case, the MU receives a penalty, whichis defined by

ϕ j = 1{d j>δ}.Moreover, a payment is required for the access to MEC servicewhen the MU decides to offload a computation task for MECserver execution. The payment is assumed to be proportionalto the time consumed for transmitting and processing the taskinput data. That is, the payment can be calculated as

φ j = π · (min{d j, δ

}− h j) · 1{c j∈B}where π ∈ R+ is the price paid for the MEC service per unitof time.

The network state of the MU during each decision epoch

j can be characterized by χ j = (q j(t), q j

(e), s j, g j) ∈ X def={0, 1, . . . , q(max)

(t) } × {0, 1, . . . , q(max)(e) } ×B× {×b∈BGb}, where

g j = (g jb:b ∈ B). At the beginning of epoch j, the MU

decides a joint task offloading and energy allocation decision

(c j, e j) ∈ Y def= {{0} ∪ B} × {0, 1, . . . , Q(e)}4 according to thestationary control policy defined by Definition 1. In line withthe discussions, we define an immediate utility at epoch j toquantify the task computation experience for the MU

u(χ j,

(c j, e j)) = ω1 · u(1)

(min

{d j, δ

})+ ω2 · u(2)(η j)

+ ω3 · u(3)(ρ j)+ ω4 · u(4)

(ϕ j)

+ ω5 · u(5)(φ j) (6)

where the positive monotonically deceasing functions u(1)(·),u(2)(·), u(3)(·), u(4)(·), and u(5)(·) measure the satisfac-tions of the task execution delay, the computation taskdrops, the task queuing delay, the penalty of failing toexecute a computation task, and the payment of access-ing the MEC service, and ω1, ω2, ω3, ω4, ω5 ∈ R+are the weights that combine different types of functionwith different units into a universal utility function. Withslight abuse of notations, we rewrite u(1)(·), u(2)(·), u(3)(·),u(4)(·), and u(5)(·) as u(1)(χ j, (c j, e j)), u(2)(χ j, (c j, e j)),u(3)(χ j, (c j, e j)), u(4)(χ j, (c j, e j)), and u(5)(χ j, (c j, e j)).

4To keep what follows uniform, we do not exclude the infeasible jointactions.



Definition 1 (Joint Task Offloading and Energy AllocationControl Policy): A stationary joint task offloading and energyallocation control policy � is defined as a mapping: � : X →Y . More specifically, the MU determines a joint control action�(χ j) = (�(c)(χ

j),�(e)(χj)) = (c j, e j) ∈ Y according to �

after observing network state χ j ∈ X at the beginning of eachdecision epoch j, where � = (�(c), �(e)) with �(c) and �(e)

being, respectively, the stationary task offloading and energyallocation policies.

Given a stationary control policy �, the {χ j : j ∈ N+} isa controlled Markov chain with the following state transitionprobability:

Pr{χ j+1|χ j,�

(χ j)}

= Pr{

qj+1(t) |q j

(t),�(χ j)} · Pr

{qj+1(e) |q j

(e),�(χ j)}

·Pr{

sj+1|s j,�(χ j)} ·∏

b∈BPr{

gj+1b |g j

b

}.

Taking expectation with respect to the per-epoch utilities{u(χ j,�(χ j)) : j ∈ N+} over the sequence of network states{χ j : j ∈ N+}, the expected long-term utility of the MUconditioned on an initial network state χ1 can be expressed as

V(χ ,�) = E�

⎡⎣(1− γ ) ·

∞∑j=1

(γ )j−1 · u(χ j,�(χ j))|χ1 = χ

⎤⎦

(7)

where χ = (q(t), q(e), s, g) ∈ X , g = (gb : b ∈ B), γ ∈ [0, 1)

is the discount factor, and (γ )j−1 denotes the discount factorto the (j−1)th power. V(χ ,�) is also named as the state-valuefunction for the MU in the state χ under policy �.

Problem 1: The MU aims to design an optimal stationarycontrol policy �∗ = (�∗(c), �

∗(e)) that maximizes the expected

long-term utility performance, V(χ ,�), for any given initialnetwork state χ , which can be formally formulated as in thefollowing:

�∗ = arg max�

V(χ ,�), ∀χ ∈ X .

V(χ) = V(χ ,�∗

)is defined as the optimal state-value

function, ∀χ ∈ X .Remark 1: The formulated problem of stochastic computa-

tion offloading optimization as in problem 1 is in general asingle-agent infinite-horizon MDP with the discounted utilitycriterion. Nevertheless, (7) can also be used to approximatethe expected infinite-horizon undiscounted utility [32]

U(χ ,�) = E�

⎡⎣ lim

J→∞1

J·

J∑j=1

u(χ j,�

(χ j))|χ1 = χ

⎤⎦

when γ approaches 1.

B. Learning Optimal Solution to Problem 1

The stationary control policy achieving the optimal state-value function can be obtained by solving the following

Bellman’s optimality equation [23]: ∀χ ∈ X

V(χ) = max(c,e)

⎧⎨⎩(1− γ ) · u(χ , (c, e))

+ γ ·∑χ ′

Pr{χ ′|χ , (c, e)

} · V(χ ′)⎫⎬⎭ (8)

where u(χ , (c, e)) is the achieved utility when a joint controlaction (c, e) ∈ Y is performed under network state χ andχ ′ = (q′(t), q′(e), s′, g′) ∈ X is the subsequent network statewith g′ = (g′b : b ∈ B).

Remark 2: The traditional solutions to (8) are based on thevalue iteration or the policy iteration [23], which need com-plete knowledge of the computation task arrival, the receivedenergy unit, and the channel state transition statistics.

One attractiveness of the off-policy Q-learning is that itassumes no a priori knowledge of the network state transitionstatistics [23]. We define the right-hand side of (8) by

Q(χ , (c, e)) = (1− γ ) · u(χ , (c, e))

+ γ ·∑χ ′

Pr{χ ′|χ , (c, e)

} · V(χ ′) (9)

∀χ ∈ X . The optimal state-value function V(χ) can be hencedirectly obtained from

V(χ) = max(c,e)

Q(χ , (c, e)). (10)

By substituting (10) into (9), we get

Q(χ , (c, e)) = (1− γ ) · u(χ , (c, e))

+ γ ·∑χ ′

Pr{χ ′|χ , (c, e)

} · max(c′,e′)

Q(χ ′,(c′, e′

))where we denote (c′, e′) ∈ Y as a joint control actionperformed under the network state χ ′. In practice, the com-putation task arrival and the number of energy units that canbe received by the end of a decision epoch are unavailablebeforehand. Using standard Q-learning, the MU tries to learnQ(χ , (c, e)) in a recursive way based on the observation ofnetwork state χ = χ j at a current decision epoch j, theperformed joint action (c, e) = (c j, e j), the achieved util-ity u(χ , (c, e)), and the resulting network state χ ′ = χ j+1

at the next epoch j + 1. The updating rule is given in (11),shown at the bottom of the next page, where α j ∈ [0, 1) is atime-varying learning rate. It has been proven that if: 1) thenetwork state transition probability under the optimal station-ary control policy is stationary; 2)

∑∞j=1 α j is infinite and∑∞

j=1(αj)2 is finite; and 3) all state-action pairs are visited

infinitely often, the Q-learning process converges and eventu-ally finds the optimal control policy [22]. The last conditioncan be satisfied if the probability of choosing any action in anynetwork state is nonzero (i.e., exploration). Meanwhile, theMU has to exploit the most recent knowledge of Q-functionin order to perform well (i.e., exploitation). A classical wayto balance the tradeoff between exploration and exploitationis the ε-greedy strategy [23].

Remark 3: From (11), we can find that the standardQ-learning rule suffers from poor scalability. Due to the



Fig. 2. Double DARLING for stochastic computation offloading in an MECsystem.

tabular nature in representing Q-function values, Q-learningis not readily applicable to high-dimensional scenarios withextremely huge network state and/or action spaces, where thelearning process is extremely slow [33], [34]. In our consid-ered system model, the sizes of the network state space X andthe action space Y can be calculated as X = (1+q(max)

(t) ) · (1+q(max)(e) )·B·∏b∈B |Gb| and Y = (1+B)·(1+q(max)

(e) ), respectively,where |G| means the cardinality of the set G. It can be observedthat X grows exponentially as the number B of BSs increases.Suppose there is an MEC system with six BSs and for eachBS, the channel gain is quantized into six states (as assumedin our experiment setups). If we set q(max)

(t) = q(max)(e) = 4, the

MU has to update in total X · Y = 2.44944 · 108 Q-functionvalues during the learning process, which is impossible forthe Q-learning process to converge within limited number ofdecision epoches.

The next section thereby focuses on developing practicallyfeasible and computationally efficient algorithms to approachthe optimal control policy.

IV. APPROACHING THE OPTIMAL POLICY

In this section, we proceed to approach the optimal controlpolicy by developing practically feasible algorithms based onrecent advances in deep reinforcement learning and a linearQ-function decomposition technique.

A. Deep Reinforcement Learning Algorithm

Inspired by the success of modeling an optimal state-actionQ-function with a deep neural network [25], we adopt a doubleDQN to address the massive network state space X [26].

Specifically, the Q-function expressed as in (9) is approxi-mated by Q(χ , (c, e)) ≈ Q(χ , (c, e); θ), where (χ , (c, e)) ∈X × Y and θ denotes a vector of parameters associated withthe DQN. The DARLING for stochastic computation offload-ing in our considered MEC system is illustrated in Fig. 2,during which instead of finding the optimal Q-function, theDQN parameters can be learned iteratively.

The mobile device is assumed to be equipped with areplay memory of a finite size M to store the experiencem j = (χ j, (c j, e j), u(χ j, (c j, e j)),χ j+1) that is happenedat the transition of two consecutive decision epoches j andj + 1 during the learning process of DARLING, where χ j,χ j+1 ∈ X and (c j, e j) ∈ Y . The experience pool can berepresented as M j = {mj−M+1, . . . , m j}. The MU main-tains a DQN and a target DQN, namely, Q(χ , (c, e); θ j)

and Q(χ , (c, e); θ j−), with parameters θ j at a current deci-

sion epoch j and θj− at some previous epoch before decision

epoch j, ∀(χ , (c, e)) ∈ X × Y . According to the experi-ence replay technique [36], the MU then randomly samplesa mini-batch M j ⊆M j from the pool M j of historical expe-riences at each decision epoch j to online train the DQN. Thatis, the parameters θ j are updated in the direction of mini-mizing the loss function, which is defined by (12), shownat the bottom of this page, where (c′, e′) ∈ Y . The lossfunction L(DARLING)(θ

j) is a mean-squared measure of theBellman equation error at a decision epoch j [i.e., the lastterm of (11)] by replacing Q j(χ , (c, e)) and its correspond-ing target (1− γ ) · u(χ , (c, e)) + γ ·max(c′,e′) Q j(χ ′, (c′, e′))with Q(χ , (c, e); θ j) and (1 − γ ) · u(χ , (c, e)) + γ ·Q(χ ′, arg max(c′,e′) Q(χ ′, (c′, e′); θ j); θ j

−) [26], respectively.By differentiating the loss function L(DARLING)(θ

j) withrespect to the DQN parameters θ j, we obtain the gradientas in (13), shown at the bottom of this page. Algorithm 1summarizes the implementation of the online DARLING algo-rithm by the MU for stochastic computation offloading in ourconsidered MEC system.

B. Linear Q-Function Decomposition-Based DeepReinforcement Learning

1) Linear Q-Function Decomposition: It can be foundthat the utility function in (6) is of an additive structure,

Qj+1(χ , (c, e)) = Q j(χ , (c, e))+ α j(

(1− γ ) · u(χ , (c, e))+ γ · max(c′,e′)

Q j(χ ′, (c′, e′))− Q j(χ , (c, e))

)(11)

L(DARLING)

(θ j)

= E(χ ,(c,e),u(χ ,(c,e)),χ ′)∈M j

⎡⎣((1− γ ) · u(χ , (c, e))+ γ · Q

(χ ′, arg max

(c′,e′)Q(χ ′, (c′, e′); θ j); θ j

−

)− Q

(χ , (c, e); θ j))2

⎤⎦ (12)

∇θ jL(DARLING)

(θ j)

= E(χ ,(c,e),u(χ ,(c,e)),χ ′)∈M j

⎡⎢⎣(

(1− γ ) · u(χ , (c, e))+ γ · Q(

χ ′, arg max(c′,e′)

Q(χ ′, (c′, e′); θ j); θ j

−

)− Q

(χ , (c, e); θ j))

·∇θ jQ(χ , (c, e); θ j)

⎤⎥⎦ (13)



Algorithm 1 Online DARLING Algorithm for StochasticComputation Task Offloading in an MEC System

1: input A DQN and a target DQN with two sets θ j and θj− of

random parameters, for j = 1.2: initialize the replay memory M j with a finite size of M ∈ N+,

the mini-batch M j with a size of M < M for experience replay,for j = 1.

3: repeat4: At the beginning of decision epoch j, the MU observes

the network state χ j ∈ X , which is taken as an input tothe DQN with parameters θ j, and then selects a joint con-trol action (c j, e j) ∈ Y randomly with probability ε or(c j, e j) = arg max(c,e)∈Y Q(χ j, (c, e); θ j) with probability1− ε.

5: After performing the selected joint control action (c j, e j), theMU realizes an immediate utility u(χ j, (c j, e j)) and observesthe new network state χ j+1 ∈ X at the next decision epochj+ 1.

6: The MU updates the replay memory M j at themobile device with the most recent transitionm j = (χ j, (c j, e j), u(χ j, (c j, e j)),χ j+1).

7: With a randomly sampled mini-batch of transitions M j ⊆M j

from the replay memory, the MU updates the DQN parametersθ j with the gradient given by (13).

8: The MU regularly resets the target DQN parameters withθ

j+1− = θ j, and otherwise θ

j+1− = θ

j−.

9: The decision epoch index is updated by j← j+ 1.10: until A predefined stopping condition is satisfied.11: return The parameters θ associated with the DQN of the MU.

which motivates us to linearly decompose the state-actionQ-function, namely, Q(χ , (c, e)), ∀(χ , (c, e)) ∈ X ×Y , basedon the pattern K = {1, . . . , K} of classifying the satisfac-tions regarding the task execution delay, the computation taskdrops, the task queuing delay, the penalty of failing to pro-cess a computation task, and the payment of using the MECservice. For example, we can divide the utility into four sat-isfaction categories, namely, u(χ , (c, e)) = u1(χ , (c, e)) +u2(χ , (c, e))+u3(χ , (c, e))+u4(χ , (c, e)) with u1(χ , (c, e)) =ω1 · u(1)(χ , (c, e)) + ω3 · u(3)(χ , (c, e)), u2(χ , (c, e)) =ω2 · u(2)(χ , (c, e)), u3(χ , (c, e)) = ω4 · u(4)(χ , (c, e)), andu4(χ , (c, e)) = ω5 · u(5)(χ , (c, e)), then K = {1, 2, 3, 4} formsa classification pattern. In our considered stochastic compu-tation offloading scenario, it is easy to see that K ≤ 5.Mathematically, Q(χ , (c, e)) is decomposed into [35]

Q(χ , (c, e)) =∑k∈K

Qk(χ , (c, e)) (14)

where the MU deploys a “virtual” agent k ∈ K to learnthe optimal per-agent state-action Q-function Qk(χ , (c, e))that satisfies

Qk(χ , (c, e)) = (1− γ ) · uk(χ , (c, e))

+ γ ·∑χ ′

Pr{χ ′|χ , (c, e)

} · Qk(χ ′,�∗(χ ′)

)(15)

with uk(χ , (c, e)) being the immediate utility related to a satis-faction category k. We emphasize that the optimal joint controlaction in (15) of an agent k across the time horizon shouldreflect the optimal control policy implemented by the MU. Inother words, the MU makes an optimal joint control actiondecision �∗(χ) under a network state χ

�∗(χ) = arg max(c,e)

∑k∈K

Qk(χ , (c, e))

to maximize the aggregated Q-function values from all theagents. We will see in Theorem 1 that the linear Q-functiondecomposition technique achieves the optimal solution toproblem 1.

Theorem 1: The linear Q-function decomposition approachas in (14) asserts the optimal computation offloadingperformance.

Proof: The proof is given in Appendix C.Remark 4: An apparent advantage of the linear Q-function

decomposition technique is that it potentially simplifies theproblem solving. Back to the example above, agent 2 learns theoptimal expected long-term satisfaction measuring the compu-tation task drops across the time horizon. It is obvious that acomputation task drop η j at a decision epoch j depends only onthe task queue state q j

(t) (rather than the network state χ j ∈ X )and the performed joint control action (c j, e j) by the MU.

Recall that the joint control action of each agent should be inaccordance with the optimal control policy of the MU, the Q-learning rule, which involves off-policy updates, is obviouslynot applicable to finding the optimal per-agent state-action Q-functions. The state-action-reward-state-action (SARSA) algo-rithm [23], [37], which applies on-policy updates, fosters apromising alternative, namely, an SARSA-based reinforcementlearning (SARL). Having the observations of the network stateχ = χ j, the performed joint control action (c, e) = (c j, e j) bythe MU, the realized per-agent utilities (uk(χ , (c, e)) : k ∈ K)

at a current decision epoch j and the resulting network stateχ ′ = χ j+1, the joint control action (c′, e′) = (cj+1, ej+1)

selected by the MU at the next epoch j + 1, each agentk ∈ K updates the state-action Q-function on the fly accord-ing to (16), shown at the bottom of this page, where differentfrom off-policy Q-learning, (c′, e′) is an actually performedjoint control action in the subsequent network state, ratherthan the hypothetical one that maximizes the per-agent state-action Q-function. Theorem 2 ensures that the SARL algorithmconverges.

Theorem 2: The sequence {(Q jk(χ , (c, e)) : χ ∈ X , (c, e) ∈

Y and k ∈ K) : j ∈ N+} by SARL converges to the optimalper-agent state-action Q-functions Qk(χ , (c, e)), ∀χ ∈ X ,∀(c, e) ∈ Y , and ∀k ∈ K if and only if the state-action pairs(χ , (c, e)) ∈ X×Y are visited infinitely often and the learningrate α j satisfies:

∑∞j=1 α j = ∞ and

∑∞j=1(α

j)2 <∞.Proof: The proof is given in Appendix D.

Qj+1k (χ , (c, e)) = Q j

k(χ , (c, e))+ α j ·((1− γ ) · uk(χ , (c, e))+ γ · Q j

k(χ′, (c′, e′))− Q j

k(χ , (c, e)))

(16)



Fig. 3. Deep-SARL-based stochastic computation offloading in an MECsystem.

Remark 5: Implementing the derived SARL algorithm, thesize of Q-function faced by each agent remains the same as thestandard Q-learning algorithm. From Remark 4, the linear Q-function decomposition technique provides the possibility ofsimplifying the solving of the stochastic computation offload-ing problem through introducing multiple virtual agents. Eachagent learns the respective expected long-term satisfaction byexploiting the key network state information and is henceenabled to use a simpler DQN to approximate the optimalstate-action Q-function.

2) Deep SARSA Reinforcement Learning: Applyingthe linear Q-function decomposition technique, the DQNQ(χ , (c, e); θ), which approximates the optimal state-actionQ-function Q(χ , (c, e)), ∀(χ , (c, e)) ∈ X × Y , can bere-expressed as

Q(χ , (c, e); θ) =∑k∈K

Qk(χ , (c, e); θk)

where θ = (θk : k ∈ K) is a collection of parameters associatedwith the DQNs of all agents and Qk(χ , (c, e); θ k) (for k ∈ K)is the per-agent DQN. Accordingly, we derive a novel Deep-SARL-based stochastic computation offloading algorithm, asdepicted in Fig. 3, where different from the DARLING algo-rithm, the parameters θ are learned locally at the agents in anonline iterative way.

Implementing our proposed Deep-SARL algorithm, at eachdecision epoch j, the MU updates the experience pool N j

with the most recent N experiences {nj−N+1, . . . , n j} witheach experience n j = (χ j, (c j, e j), (uk(χ

j, (c j, e j)) : k ∈K),χ j+1, (cj+1, ej+1)). To train the DQN parameters, the MUfirst randomly samples a mini-batch N j ⊆ N j and thenupdates θ j = (θ

jk : k ∈ K) for all agents at each decision

epoch j to minimize the accumulative loss function, which isgiven by (17), shown at the bottom of the next page, whereθ

j− = (θ

jk,− : k ∈ K) are the parameters of the target DQNs

of all agents at some previous decision epoch before epochj. The gradient for each agent k ∈ K can be calculated asin (18), shown at the bottom of the next page. We summarizethe online implementation of our proposed Deep-SARL algo-rithm for solving the stochastic computation offloading in anMEC system as in Algorithm 2.

V. NUMERICAL EXPERIMENTS

In this section, we proceed to evaluate the stochas-tic computation offloading performance achieved from ourderived online learning algorithms, namely, DARLING andDeep-SARL.

Algorithm 2 Online Deep-SARL Algorithm for StochasticComputation Task Offloading in an MEC System

1: input The DQNs and the target DQNs with two sets θ j =(θ

jk : k ∈ K) and θ

j− = (θ

jk,− : k ∈ K) of random parameters,

for j = 1.2: initialize for j = 1, the replay memory N j with a finite

size of N ∈ N+, the mini-batch N j with a size of N < Nfor experience replay, the initial network state χ j ∈ X , ajoint control action (c j, e j) ∈ Y randomly with probabilityε or (c j, e j) = arg max(c,e)∈Y

∑k∈K Qk(χ

j, (c, e); θ jk) with

probability 1− ε.3: repeat4: After performing the selected joint control action (c j, e j), the

agents realize immediate utilities (uk(χj, (c j, e j)) : k ∈ K).

5: The MU observes the new network state χ j+1 ∈ X at thenext epoch j + 1, which is taken as an input to the DQNsof all agents with parameters θ j, and selects a joint con-trol action (cj+1, ej+1) ∈ Y randomly with probability ε

or (cj+1, ej+1) = arg max(c,e)∈Y∑

k∈K Qk(χj+1, (c, e); θ j

k)with probability 1− ε.

6: The replay memory N j is updated with the most recenttransition n j = (χ j, (c j, e j), (uk(χ

j, (c j, e j)) : k ∈K), χ j+1, (cj+1, ej+1)).

7: With a randomly sampled mini-batch of transitions N j ⊆ N j,all agents update the DQN parameters θ j using the gradientas in (18).

8: The target DQNs are regularly reset by θj+1− = θ j, and

otherwise θj+1− = θ

j−.

9: The decision epoch index is updated by j← j+ 1.10: until A predefined stopping condition is satisfied.11: return The parameters θ = (θk : k ∈ K) associated with the

DQNs of all agents.

A. General Setup

Throughout the experiments, we suppose there areB = 6 BSs in the system connecting the MU with the MECserver. The channel gain states between the MU and theBSs are from a common finite set {−11.23,−9.37,−7.8,

−6.3,−4.68,−2.08} (dB), the transitions of which happenacross the discrete decision epochs following respective ran-domly generated matrices. Each energy unit corresponds to2 · 10−3 J, and the number of energy units received from thewireless environment follows a Poisson arrival process withaverage arrival rate λ(e) (in units per epoch). We set K = 5for the Deep-SARL algorithm, while the u(1)(χ j, (c j, e j)),u(2)(χ j, (c j, e j)), u(3)(χ j, (c j, e j)), u(4)(χ j, (c j, e j)), andu(5)(χ j, (c j, e j)) in (6) are chosen to be the exponentialfunctions. Then,

u1(χ j,

(c j, e j)) = ω1 · u(1)

(χ j,

(c j, e j))

= ω1 · exp{−min

{d j, δ

}}u2(χ j,

(c j, e j)) = ω2 · u(2)

(χ j,

(c j, e j))

= ω2 · exp{−η j}

u3(χ j,

(c j, e j)) = ω3 · u(3)

(χ j,

(c j, e j))

= ω3 · exp{−ρ j}

u4(χ j,

(c j, e j)) = ω4 · u(4)

(χ j,

(c j, e j))

= ω4 · exp{−ϕ j}

u5(χ j,

(c j, e j)) = ω5 · u(5)

(χ j,

(c j, e j))

= ω5 · exp{−φ j}.



TABLE IPARAMETER VALUES IN EXPERIMENTS

Based on the works [39] and [40], we use a single hiddenlayer consisting of 200 neurons5 for the design of a DQN inDARLING algorithm and select tanh as the activation func-tion [42] and Adam as the optimizer. The same number of 200neurons are employed to design the single-layer DQNs of allagents in Deep-SARL and hence for each DQN of an agent,there are in total 40 neurons. Both the DARLING and theDeep-SARL algorithms are experimented in TensorFlow [27].Other parameter values used in experiments are listed inTable I.

For the purpose of performance comparisons, we simulatethree baselines as follows.

1) Mobile Execution: The MU processes all scheduledcomputation task at the mobile device with maximumpossible energy, that is, at each decision epoch j, c j = 0and

e j =⎧⎨⎩min

{q j(e),

⌊(f (max)(CPU)

)3 · τ⌋}

, if q j(e) > 0

0, if q j(e) = 0

where the allocation of energy units takes into accountthe maximum CPU-cycle frequency f (max)

(CPU) and �·�means the floor function.

5The tradeoff between a time demanding training process and an improve-ment in performance with a deeper and/or wider neural network is still anopen problem [41].

Fig. 4. Illustration for the convergence property of DARLING andDeep-SARL.

2) Server Execution: According to Lemma 2, with max-imum possible energy units in the energy queue thatsatisfy the maximum transmit power constraint, the MUalways selects a BS that achieves the minimum task exe-cution delay to offload the input data of a scheduledcomputation task for MEC server execution.

3) Greedy Execution: At each decision epoch, the MUdecides to execute a computation task at its own mobiledevice or offload it to the MEC server for processingwith the aim of minimizing the immediate task executiondelay.

B. Experiment Results

We carry out experiments under various settings to validatethe proposed studies in this paper.

1) Experiment 1—Convergence Performance: Our goal inthis experiment is to validate the convergence property ofour proposed algorithms, namely, DARLING and Deep-SARL,for stochastic computation offloading in the considered MECsystem. We set the task arrival probability and the aver-age energy arrival rate to be λ(t) = 0.5 and λ(e) = 0.8units per epoch, respectively. Without loss of the gener-ality, we plot the simulated variations in Q

(χ , (c, e); θ j)

[where χ = (2, 2, 2, (−6.3,−6.3,−4.68,−7.8,−6.3,−6.3)),(c, e) = (2, 4) and j ∈ N+] for the DARLING algorithm aswell as the loss function defined by (17) for the Deep-SARLalgorithm versus the decision epochs in Fig. 4, which revealsthat the convergence behaviors of both DARLING and Deep-SARL are similar. The two algorithms converge at a reasonablespeed, for which around 1.5×104 decision epochs are needed.

2) Experiment 2—Performance Under Various λ(t): In thisexperiment, we try to demonstrate the average computation

L(Deep−SARL)

(θ j)

= E(χ ,(c,e),(uk(χ ,(c,e)) : k∈K),χ ′,(c′,e′))∈N j

[∑k∈K

((1− γ ) · uk(χ , (c, e))+ γ · Qk

(χ ′, (c′, e′); θ j

k,−)− Qk

(χ , (c, e); θ j

k

))2]

(17)

∇θ

jkL(Deep−SARL)

(θ j)

= E(χ ,(c,e),(uk(χ ,(c,e)) : k∈K),χ ′,(c′,e′))∈N j

⎡⎣((1− γ ) · uk(χ , (c, e))+ γ · Qk

(χ ′, (c′, e′); θ j

k,−)− Qk

(χ , (c, e); θ j

k

))·∇

θjkQk

(χ , (c, e); θ j

k

)⎤⎦ (18)



(d)

(a)

(e)

(b)

(f)

(c)

Fig. 5. Average computation offloading performance versus task arrival probabilities. (a) Average utility per epoch. (b) Average execution delay per epoch.(c) Average task drops per epoch. (d) Average task queuing delay per epoch. (e) Average task failure penalty per epoch. (f) Average MEC service paymentper epoch.

offloading performance per epoch in terms of the average util-ity, the average task execution delay, the average task drops,the average task queuing delay, the average task failure penalty,and the average MEC service payment under different com-putation task arrival probability settings. We choose for theaverage energy unit arrival rate as λ(e) = 1.6 units per epoch.The results are exhibited in Fig. 5. Fig. 5(a) illustrates the aver-age utility performance when the MU implements DARLINGand Deep-SARL. Fig. 5(b)–(f) illustrates the average task exe-cution delay, the average task drops, the average task queuingdelay, the average task failure penalty, and the average MECservice payment.

Each plot compares the performance of the DARLING andthe Deep-SARL to the three baseline computation offload-ing schemes. From Fig. 5, it can be observed that both theproposed schemes achieve a significant gain in average utility.Similar observations can be made from the curves in otherplots, though the average MEC service payment per epochfrom Deep-SARL is a bit higher than that from DARLING.This can be explained by the reason that with the networksettings by increasing the task arrival probability in this exper-iment, more tasks are scheduled for execution at the MECserver using the Deep-SARL algorithm. As the computationtask arrival probability increases, the average utility perfor-mances decrease due to the increases in average task executiondelay, average task drops, average task queuing delay, averagetask failure penalty, and average MEC service payment. Sincethere are not enough energy units in the energy queue duringone decision epoch on average, in order to avoid task dropsand task failure penalty, only a portion of the queued energy

units are allocated for processing a scheduled computationtask, hence leading to more queued tasks, i.e., an increasedaverage task queueing delay. The utility performance fromDeep-SARL outperforms that from DARLING. This indicatesthat by combining a deep neural network and the Q-functiondecomposition technique, the original stochastic computationoffloading problem becomes simplified, hence performanceimprovement can be expected from approximating the state-action Q-functions of all agents with the same number ofneurons.

3) Experiment 3—Performance With Changing λ(e): Wedo the third experiment to simulate the average computa-tion offloading performance per epoch achieved from thederived DARLING and Deep-SARL algorithms and otherthree baselines versus the average energy unit arrival rates.The computation task arrival probability in this experimentis set to be λ(t) = 0.6. The per epoch average utility, aver-age task execution delay, average task drops, average taskqueuing delay, average task failure penalty, and average MECservice payment across the entire learning period are depictedin Fig. 6. We can clearly see from Fig. 6 that as the avail-able energy units increase, the overall computation offloadingperformance improves. However, as the energy unit arrival rateincreases, the average task execution delay, the average taskfailure penalty and the average MEC service payment6 firstincrease but then decrease. Compared to ω1, ω4, and ω5, weset larger values for ω2 and ω3. The priority of minimizing

6It is easy to see that the mobile execution scheme does not use MECservice, hence no MEC service payment needs to be made.



(d)

(a)

(e)

(b)

(f)

(c)

Fig. 6. Average computation offloading performance versus average energy unit arrival rates. (a) Average utility per epoch. (b) Average execution delay perepoch. (c) Average task drops per epoch. (d) Average task queuing delay per epoch. (e) Average task failure penalty per epoch. (f) Average MEC servicepayment per epoch.

average task drops and average task queuing delay pushes theMU to schedule as many computation tasks for execution aspossible. Hence, the increasing number of queued energy unitsprovides more opportunities to execute a computation taskduring each decision epoch, and at the same time, increasesthe possibility of failing to execute a task. When the averagenumber of energy units in the energy queue increases to a suf-ficiently large value, enough energy units can be allocated toeach scheduled computation task, due to which the task exe-cution delay, the possibility of task computation failures andthe MEC service payment decrease.

VI. CONCLUSION

In this paper, we put our emphasis to investigate the designof a stochastic computation offloading policy for a repre-sentative MU in an ultradense sliced RAN by taking intoaccount the dynamics generated from the time-varying chan-nel qualities between the MU and the BSs, energy unitsreceived from the wireless environment as well as compu-tation task arrivals. The problem of stochastic computationoffloading is formulated as an MDP, for which we propose twodouble DQN-based online strategic computation offloadingalgorithms, namely, DARLING and Deep-SARL. Both learn-ing algorithms survive the curse of high dimensionality in statespace and need no a priori information of dynamics statis-tics. We find from numerical experiments that compared tothree baselines, our derived algorithms can achieve much bet-ter long-term utility performance, which indicates an optimaltradeoff among the computation task execution delay, the taskdrops, the task queuing delay, the task failure penalty, and the

MEC service payment. Moreover, the Deep-SARL algorithmoutperforms the DARLING algorithm by taking the advantageof the additive utility function structure.

APPENDIX APROOF OF LEMMA 1

With the decisions of computation offloading c j ∈ B andenergy unit allocation e j > 0 at an epoch j, the input datatransmission is independent of the network dynamics. Supposethere exists a new transmission policy with which the MUchanges its transmission rate from rj,(1) to rj,(2) �= rj,(1) ata certain point during the task input data transmission. Wedenote the time durations corresponding to transmission ratesrj,(1) and rj,(2) as dj,(1)

(tr) and dj,(2)

(tr) , respectively. Taking intoaccount the maximum transmit power of the mobile device (4),it is easy to verify that the following two constraints on totalenergy consumption:

min{

e j,(

dj,(1)

(tr) + dj,(2)

(tr)

)· p(max)

(tr)

}= I

g jc j

·(

2rj,(1)

W − 1

)· dj,(1)

(tr) +I

g jc j

·(

2rj,(2)

W − 1

)· dj,(2)

(tr)

and total transmitted data bits

rj,(1) · dj,(1)

(tr) + rj,(2) · dj,(2)

(tr) = μ

can be satisfied. On the other hand, the average transmissionrate within a duration of dj,(1)

(tr) + dj,(2)

(tr) can be written as

r j = W · log2

⎛⎝1+ g j

c j

I·

min{

e j,(

dj,(1)

(tr) + dj,(2)

(tr)

)· p(max)

(tr)

}dj,(1)

(tr) + dj,(2)

(tr)

⎞⎠.



Consider using r j as the only transmission rate during thewhole period of task input data transmission, we constructthe deduction as in (19), shown at the bottom of this page,where the inequality originates from the concavity of a loga-rithmic function. From (19), we can find that with a constanttransmission rate, more data can be transmitted within thesame transmission period. Hence, applying a constant ratefor the transmission of μ task input data bits achieves theminimum transmission time, which can be solved accordingto (2) and (3). This completes the proof.

APPENDIX BPROOF OF LEMMA 2

By replacing r j in (2) with (3), we get

log2

(1+ I−1 · g j

c j · e j · 1

d j(tr)

)= μ

W· 1

d j(tr

. (20)

Alternatively, we take 1/d j(tr) as the solution of (20), which is

an intersection of two lines, namely, �1(x) = log2(1 + g jc j ·

I−1 · e j · x) and �2(x) = (μ/W) · x, for x > 0. From the non-negativity and the monotonicity of a logarithmic function anda linear function, it is easy to see that 1/d j

(tr) is a monotoni-

cally increasing function of e j. Thus, d j(tr) is a monotonically

decreasing function of e j.

APPENDIX CPROOF OF THEOREM 1

For the state-action Q-function of a joint action (c, e) ∈ Yin a network state χ ∈ X as in (9), we have (21), shown atthe bottom of this page, which completes the proof.

APPENDIX DPROOF OF THEOREM 2

Since the per-agent state-action Q-functions are learnedsimultaneously, we consider the monolithic updates during thelearning process of the SARL algorithm, namely, the updatingrule in (16) can be then encapsulated as∑

k∈KQj+1

k (χ , (c, e)) = (1− α j) ·∑k∈K

Q jk(χ , (c, e))

+α j ·((1− γ ) ·

∑k∈K

uk(χ , (c, e))+ γ ·∑k∈K

Q jk(χ′, (c′, e′))

)

(22)

where (χ , (c, e)), (χ ′, (c′, e′)) ∈ X × Y . We rewrite (22) as∑k∈K

Qj+1k (χ, (c, e))−

∑k∈K

Qk(χ , (c, e))

= (1− α j) ·(∑

k∈KQ j

k(χ , (c, e))−∑k∈K

Qk(χ, (c, e))

)

+ α j ·ϒ j(χ , (c, e)) (23)

where

ϒ j(χ , (c, e))

= (1− γ ) ·∑k∈K

uk(χ , (c, e))

+ γ · max(c′′,e′′)

∑k∈K

Q jk

(χ ′,(c′′, e′′

))−∑k∈K

Qk(χ , (c, e))

)

+ γ ·(∑

k∈KQ j

k

(χ ′,(c′, e′

))− max(c′′,e′′)

∑k∈K

Q jk

(χ ′,(c′′, e′′

))).

Let O j = σ({(χ z, (cz, ez), (uk(χz, (cz, ez)) : k ∈ K)) : z ≤

j}, (Q jk(χ , (c, e)) : χ ∈ X , (c, e) ∈ Y, k ∈ K)) denote

r j ·(

dj,(1)

(tr) + dj,(2)

(tr)

)= W · log2

⎛⎜⎜⎝1+

(2

rj,(1)

W − 1

)· dj,(1)

(tr) +(

2rj,(2)

W − 1

)· dj,(2)

(tr)

dj,(1)

(tr) + dj,(2)

(tr)

⎞⎟⎟⎠ · (dj,(1)

(tr) + dj,(2)

(tr)

)

≥⎛⎝rj,(1) · dj,(1)

(tr)

dj,(1)

(tr) + dj,(2)

(tr)

+ rj,(2) · dj,(2)

(tr)

dj,(1)

(tr) + dj,(2)

(tr)

⎞⎠ · (dj,(1)

(tr) + dj,(2)

(tr)

)

= rj,(1) · dj,(1)

(tr) + rj,(2) · dj,(2)

(tr)

= μ (19)

Q(χ , (c, e)) = E�∗

⎡⎣(1− γ ) ·

∞∑j=1

(γ )j−1 · u(χ j,(c j, e j))|χ1 = χ ,

(c1, e1

)= (c, e)

⎤⎦

= E�∗

⎡⎣(1− γ ) ·

∞∑j=1

(γ )j−1 ·∑k∈K

uk(χ j,

(c j, e j))|χ1 = χ ,

(c1, e1

)= (c, e)

⎤⎦

=∑k∈K

E�∗

⎡⎣(1− γ ) ·

∞∑j=1

(γ )j−1 · uk(χ j,

(c j, e j))|χ1 = χ ,

(c1, e1

)= (c, e)

⎤⎦

=∑k∈K

Qk(χ , (c, e)) (21)



∥∥E[ϒ j(χ , (c, e))|O j]∥∥∞≤∥∥∥∥∥E

[(1− γ ) ·

∑k∈K

uk(χ , (c, e))+ γ · max(c′′,e′′)

∑k∈K

Q jk

(χ ′,(c′′, e′′

))−∑k∈K

Qk(χ , (c, e))|O j

]∥∥∥∥∥∞

+∥∥∥∥∥E

[γ ·(∑

k∈KQ j

k

(χ ′,(c′, e′

))− max(c′′,e′′)

∑k∈K

Q jk

(χ ′,(c′′, e′′

)))|O j

]∥∥∥∥∥∞

(a)≤ γ ·∥∥∥∥∥∑k∈K

Q jk(χ , (c, e))−

∑k∈K

Qk(χ , (c, e))

∥∥∥∥∥∞+∥∥∥∥∥E

[γ ·(∑

k∈KQ j

k

(χ ′,(c′, e′

))− max(c′′,e′′)

∑k∈K

Q jk

(χ ′,(c′′, e′′

)))|O j

]∥∥∥∥∥∞

(24)

the learning history for the first j decision epochs. Theper-agent state-action Q-functions are O j-measurable, thusthe (

∑k∈K Qj+1

k (χ , (c, e)) − ∑k∈K Qk(χ , (c, e))) and the

ϒ j(χ , (c, e)) are both O j-measurable. We then arrive at (24),shown at the top of this page, where (a) is due to theconvergence property of a standard Q-learning [22]. Weare left with verifying that ‖E[γ · (∑k∈K Q j

k(χ′, (c′, e′)) −

max(c′′,e′′)∑

k∈K Q jk(χ′, (c′′, e′′)))|O j]‖∞ converges to zero

with probability 1, which in our considered scenario fol-lows from: 1) an ε-greedy strategy is applied in SARL forchoosing joint control actions; 2) the per-agent state-action Q-functions are upper bounded; and 3) both the network stateand the joint control action spaces are finite. All conditionsin [38, Lemma 1] are satisfied. Therefore, the convergence ofthe SARL learning process is ensured.

REFERENCES

[1] “Cisco visual networking index: Global mobile data traffic forecastupdate, 2016–2021,” San Jose, CA, USA, Cisco, White Paper, Feb. 2017.

[2] M. Satyanarayanan, “The emergence of edge computing,” IEEEComput., vol. 50, no. 1, pp. 30–39, Jan. 2017.

[3] K. Gai, M. Qiu, H. Zhao, L. Tao, and Z. Zong, “Dynamic energy-awarecloudlet-based mobile cloud computing model for green computing,” J.Netw. Comput. Appl., vol. 59, pp. 46–54, Jun. 2016.

[4] Z. Su et al., “A secure content caching scheme for disaster backup in fogcomputing enabled mobile social networks,” IEEE Trans. Ind. Informat.,vol. 14, no. 10, pp. 4579–4589, Oct. 2018.

[5] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A sur-vey on mobile edge computing: The communication perspective,” IEEECommun. Surveys Tuts., vol. 19, no. 4, pp. 2322–2358, 4th Quart., 2017.

[6] P. Mach and Z. Becvar, “Mobile edge computing: A survey on architec-ture and computation offloading,” IEEE Commun. Surveys Tuts., vol. 19,no. 3, pp. 1628–1656, 3rd Quart., 2017.

[7] C. Wang, C. Liang, F. R. Yu, Q. Chen, and L. Tang, “Computationoffloading and resource allocation in wireless cellular networks withmobile edge computing,” IEEE Trans. Wireless Commun., vol. 16, no. 8,pp. 4924–4938, Aug. 2017.

[8] X. Hu, K.-K. Wong, and K. Yang, “Wireless powered cooperation-assisted mobile edge computing,” IEEE Trans. Wireless Commun.,vol. 17, no. 4, pp. 2375–2388, Apr. 2018.

[9] F. Wang, J. Xu, X. Wang, and S. Cui, “Joint offloading and comput-ing optimization in wireless powered mobile-edge computing systems,”IEEE Trans. Wireless Commun., vol. 17, no. 3, pp. 1784–1797,Mar. 2018.

[10] X. Costa-Péerez, J. Swetina, T. Guo, R. Mahindra, and S. Rangarajan,“Radio access network virtualization for future mobile carrier networks,”IEEE Commun. Mag., vol. 51, no. 7, pp. 27–35, Jul. 2013.

[11] R. Kokku, R. Mahindra, H. Zhang, and S. Rangarajan, “NVS: Asubstrate for virtualizing wireless resources in cellular networks,”IEEE/ACM Trans. Netw., vol. 20, no. 5, pp. 1333–1346, Oct. 2012.

[12] X. Chen et al., “Wireless resource scheduling in virtualized radio accessnetworks using stochastic learning,” IEEE Trans. Mobile Comput.,vol. 17, no. 4, pp. 961–974, Apr. 2018.

[13] P. Zhao, H. Tian, S. Fan, and A. Paulraj, “Information prediction anddynamic programming-based RAN slicing for mobile edge computing,”IEEE Wireless Commun. Lett., vol. 7, no. 4, pp. 614–617, Aug. 2018.

[14] Y. Zhou, F. R. Yu, J. Chen, and Y. Kuo, “Resource allocation forinformation-centric virtualized heterogeneous networks with in-networkcaching and mobile edge computing,” IEEE Trans. Veh. Technol., vol. 66,no. 12, pp. 11339–11351, Dec. 2017.

[15] C. You, K. Huang, H. Chae, and B.-H. Kim, “Energy-efficientresource allocation for mobile-edge computation offloading,” IEEETrans. Wireless Commun., vol. 16, no. 3, pp. 1397–1411, Mar. 2017.

[16] X. Lyu et al., “Energy-efficient admission of delay-sensitive tasksfor mobile edge computing,” IEEE Trans. Commun., vol. 66, no. 6,pp. 2603–2616, Jun. 2018.

[17] J. Liu, Y. Mao, J. Zhang, and K. B. Letaief, “Delay-optimal computationtask scheduling for mobile-edge computing systems,” in Proc. IEEEISIT, Barcelona, Spain, Jul. 2016, pp. 1451–1455.

[18] Y. Mao, J. Zhang, and K. B. Letaief, “Dynamic computation offloadingfor mobile-edge computing with energy harvesting devices,” IEEE J.Sel. Areas Commun., vol. 34, no. 12, pp. 3590–3605, Dec. 2016.

[19] C.-F. Liu, M. Bennis, and H. V. Poor, “Latency and reliability-awaretask offloading and resource allocation for mobile edge computing,” inProc. IEEE GLOBECOM WKSHP, Singapore, Dec. 2017, pp. 1–7.

[20] Z. Jiang and S. Mao, “Energy delay tradeoff in cloud offloading formulti-core mobile devices,” IEEE Access, vol. 3, pp. 2306–2316, 2015.

[21] J. Xu, L. Chen, and S. Ren, “Online learning for offloading and autoscal-ing in energy harvesting mobile edge computing,” IEEE Trans. Cogn.Commun. Netw., vol. 3, no. 3, pp. 361–373, Sep. 2017.

[22] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Mach. Learn., vol. 8,nos. 3–4, pp. 279–292, May 1992.

[23] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.Cambridge, MA, USA: MIT Press, 1998.

[24] M. A. Alsheikh, D. Niyato, S. Lin, H.-P. Tan, and D. I. Kim, “Fastadaptation of activity sensing policies in mobile devices,” IEEE Trans.Veh. Technol., vol. 66, no. 7, pp. 5995–6008, Jul. 2017.

[25] V. Mnih et al., “Human-level control through deep reinforcementlearning,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.

[26] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double Q-learning,” in Proc. AAAI, Phoenix, AZ, USA, Feb. 2016,pp. 2094–2100.

[27] M. Abadi et al., “TensorFlow: A system for large-scale machinelearning,” in Proc. OSDI, Savannah, GA, USA, Nov. 2016, pp. 265–283.

[28] T. D. Burd and R. W. Brodersen, “Processor design for portablesystems,” J. VLSI Signal Process. Syst., vol. 13, nos. 2–3, pp. 203–221,Aug. 1996.

[29] X. Chen, P. Liu, H. Liu, C. Wu, and Y. Ji, “Multipath transmissionscheduling in millimeter wave cloud radio access networks,” in Proc.IEEE ICC, Kansas City, MO, USA, May 2018, pp. 1–6.

[30] J. Kwak, O. Choi, S. Chong, and P. Mohapatra, “Processor-networkspeed scaling for energy-delay tradeoff in smartphone applications,”IEEE/ACM Trans. Netw., vol. 24, no. 3, pp. 1647–1660, Jun. 2016.

[31] X. Chen, L. Jiao, W. Li, and X. Fu, “Efficient multi-user computa-tion offloading for mobile-edge cloud computing,” IEEE Trans. Netw.,vol. 24, no. 5, pp. 2795–2808, Oct. 2016.

[32] D. Adelman and A. J. Mersereau, “Relaxations of weakly coupledstochastic dynamic programs,” Oper. Res., vol. 56, no. 3, pp. 712–727,Jan. 2008.

[33] K. Gai and M. Qiu, “Optimal resource allocation using reinforcementlearning for IoT content-centric services,” Appl. Soft Comput., vol. 70,pp. 12–21, Sep. 2018.



[34] K. Gai, M. Qiu, M. Liu, and H. Zhao, “Smart resource allocation usingreinforcement learning in content-centric cyber-physical systems,” inProc. SmartCom, Shenzhen, China, Dec. 2017, pp. 39–52.

[35] S. Russel and A. L. Zimdars, “Q-decomposition for reinforcementlearning agents,” in Proc. ICML, Washington, DC, USA, Aug. 2003,pp. 656–663.

[36] L.-J. Lin, Reinforcement Learning for Robots Using Neural Networks,Carnegie Mellon Univ., Pittsburgh, PA, USA, 1992.

[37] G. A. Rummery and M. Niranjan, “Online Q-learning using con-nectionist systems,” Eng. Dept., Cambridge Univ., Cambridge, U.K.,Rep. CUED/F-INFENG/TR 166, Sep. 1994.

[38] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, “Convergenceresults for single-step on-policy reinforcement-learning algorithms,”Mach. Learn., vol. 38, no. 3, pp. 287–308, Mar. 2000.

[39] X. Chen et al., “Performance optimization in mobile-edge computingvia deep reinforcement learning,” in Proc. IEEE VTC–Fall, Chicago,IL, USA, Aug. 2018.

[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in Proc. IEEE CVPR, Las Vegas, NV, USA, Jun. 2016,pp. 770–778.

[41] C. L. P. Chen and Z. Liu, “Broad learning system: An effective andefficient incremental learning system without the need for deep architec-ture,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 1, pp. 10–24,Jan. 2018.

[42] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun, “What is the bestmulti-stage architecture for object recognition?” in Proc. IEEE ICCV,Kyoto, Japan, Sep./Oct. 2009, pp. 2146–2153.

Xianfu Chen (M’13) received the Ph.D. degreein signal and information processing from theDepartment of Information Science and ElectronicEngineering, Zhejiang University, Hangzhou, China,in 2012.

Since 2012, he has been with the VTT TechnicalResearch Centre of Finland Ltd., Oulu, Finland,where he is currently a Senior Scientist. His cur-rent research interests include wireless communica-tions and networking, with emphasis on human-leveland artificial intelligence for resource awareness in

next-generation communication networks.

Honggang Zhang (M’01–SM’11) is a FullProfessor with the College of Information Scienceand Electronic Engineering, Zhejiang University,Hangzhou, China. He is an Honorary VisitingProfessor with the University of York, York,U.K. He was the International Chair Professorof Excellence for Université Européenne deBretagne, Rennes, France, and Supélec, Gif-sur-Yvette, France. He co-authored CommunicationsDistributed Artificial Intelligence (DAI), RegulatoryPolicy and Economics, Implementation (Wiley,

2012) and edited Green Communications: Theoretical Fundamentals,Algorithms and Applications (CRC Press, 2012).

Dr. Zhang served as the Chair of the Technical Committee on CognitiveNetworks of the IEEE Communications Society from 2011 to 2012. He is cur-rently active in the research on green communications and was the LeadingGuest Editor of the IEEE Communications Magazine special issues on “GreenCommunications.”

Celimuge Wu (M’10) received the Ph.D. degreefrom the University of Electro-Communications,Chofu, Japan, in 2010.

He is currently an Associate Professor with theUniversity of Electro-Communications. His cur-rent research interests include vehicular ad hocnetworks, Internet of Things, 5G, and mobile cloudcomputing.

Dr. Wu is/has been the TPC Co-Chair ofWireless Days 2019, ICT-DM 2018, and theTrack Co-Chair of many international confer-

ences, including ICCCN 2019, PIMRC 2016, IEEE ISC2 2017, ISNCC2017, and WICON 2016. He is/has been serving as an AssociateEditor for IEICE Transactions on Communications, a Guest Editor forthe IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTATIONAL

INTELLIGENCE, IEEE Computational Intelligence Magazine, ACM/SpringerMobile Networks and Applications, Sensors (MDPI), and Mobile InformationSystems (Hindawi).

Shiwen Mao (S’99–M’03–SM’09) received thePh.D. degree in electrical and computer engineeringfrom Polytechnic University, Brooklyn, NY, USA,in 2004.

He is a Samuel Ginn Distinguished Professor andthe Director of the Wireless Engineering Researchand Education Center, Auburn University, Auburn,AL, USA. His current research interests includewireless networks and multimedia communications.

Dr. Mao was a recipient of the Auburn UniversityCreative Research and Scholarship Award in 2018,

the 2017 IEEE ComSoc ITC Outstanding Service Award, the 2015 IEEEComSoc TC-CSR Distinguished Service Award, the 2013 IEEE ComSocMMTC Outstanding Leadership Award, and the NSF CAREER Award in2010. He was a co-recipient of the 2017 Best Conference Paper Awardof IEEE ComSoc MMTC, the IEEE SECON 2017 Best Demo Award, theBest Paper Awards from IEEE GLOBECOM 2016 and 2015, IEEE WCNC2015, and IEEE ICC 2013, and the 2004 IEEE Communications SocietyLeonard G. Abraham Prize in the field of communications systems. He isa Distinguished Lecturer of the IEEE Vehicular Technology Society in theClass of 2014. He is on the Editorial Board of the IEEE TRANSACTIONS

ON MOBILE COMPUTING, the IEEE TRANSACTIONS ON MULTIMEDIA,the IEEE INTERNET OF THINGS JOURNAL, IEEE MULTIMEDIA, and ACMGetMobile.

Yusheng Ji (M’94–SM’18) received the B.E., M.E.,and D.E. degrees in electrical engineering from theUniversity of Tokyo, Tokyo, Japan.

She joined the National Center for ScienceInformation Systems, Tokyo, Japan, in 1990. Sheis currently a Professor with the National Instituteof Informatics, Tokyo, Japan, and SOKENDAI(the Graduate University for Advanced Studies),Hayama, Japan. Her current research interestsinclude network architecture, mobile computing, andnetwork resource management.

She is/has been an Editor of the IEEE TRANSACTIONS ON VEHICULAR

TECHNOLOGY, an Associate Editor of IEICE Transactions and IPSJ Journal,a Guest Editor-in-Chief, a Guest Editor, and a Guest Associate Editor ofSpecial Issues of IEICE Transactions and the IPSJ Journal, the SymposiumCo-Chair of IEEE GLOBECOM 2012, 2014, the Track Chair of IEEE VTC2016 Fall, 2017 Fall, the General Co-Chair of ICT-DM 2018, and the TPCMember of IEEE INFOCOM, ICC, GLOBECOM, WCNC, and VTC.

Mehdi Bennis (S’07–A’08–SM’15) received theM.Sc. degree in electrical engineering jointly fromthe École Polytechnique Fédérale de Lausanne,Lausanne, Switzerland, and the Eurecom Institute,Biot, France, in 2002, and the Ph.D. degree in spec-trum sharing for future mobile cellular systems fromthe University of Oulu, Oulu, Finland, in 2009.

He is currently an Associate Professor with theUniversity of Oulu, and an Academy of FinlandResearch Fellow. He has co-authored 1 book andpublished over 200 research papers in international

conferences, journals, and book chapters. His main research interests are radioresource management, heterogeneous networks, game theory, and machinelearning in 5G networks and beyond.

Dr. Bennis was the recipient of the Prestigious 2015 Fred W. Ellersick Prizefrom the IEEE Communications Society, the 2016 Best Tutorial Prize fromthe IEEE Communications Society, the 2017 EURASIP Best Paper Award forthe Journal of Wireless Communications and Networks, and the All-Universityof Oulu Research Award.


optimized computation offloading performance in virtual edge computing...

Documents