hybrid imitation learning for real-time service

11
Hybrid Imitation Learning for Real-Time Service Restoration in Resilient Distribution Systems Yichen Zhang, Member, IEEE, Feng Qiu, Senior Member, IEEE, Tianqi Hong, Member, IEEE, Zhaoyu Wang, Senior Member, IEEE, Fangxing (Fran) Li, Fellow, IEEE Abstract —Self-healing capability is a critical factor for a resilient distribution system, which requires intelligent agents to automatically perform service restoration online, including network reconfiguration and reactive power dis- patch. The paper proposes the imitation learning framework for training such an agent, where the agent will interact with an expert built based on the mixed-integer program to learn its optimal policy, and therefore significantly improve the training efficiency compared with exploration-dominant re- inforcement learning methods. This significantly improved training efficiency makes the training problem under N - k scenarios tractable. A hybrid policy network is proposed to handle tie-line operations and reactive power dispatch simultaneously to further improve the restoration perfor- mance. The 33-bus and 119-bus systems with N - k distur- bances are employed to conduct the training. The results indicate that the proposed method outperforms traditional reinforcement learning algorithms such as the deep-Q net- work. Index Terms—Service restoration, imitation learning, re- inforcement learning, mixed-integer program, resilient dis- tribution system. NOMENCLATURE Indices and Sets t, T , T index, index set, number of steps h, V P , N P index, index set, number of point of common coupling i/j , V B , N B index, index set, number of buses k, V SC , N SC index, index set, number of shunt capacitors l, E L , N L index, index set, number index of lines m, N S , N S index, index set, number (if countable) of states n, N A , N A index, index set, number (if countable) of actions Continuous Decision Variables P PCC h,t active power injection at point of common cou- pling h during step t Q PCC h,t reactive power injection at point of common coupling h during step t V i,t voltage of bus i during step t This work was supported by the Advanced Grid Modeling program of U.S. Department of Energy. Y. Zhang, F. Qiu, T. Hong are with the Energy System Divi- sion, Argonne National Laboratory, Lemont, IL 60439 USA (email: [email protected], [email protected], [email protected]). Z. Wang is with ECpE Dept., Iowa State University, Ames, IA 50011 USA (email: [email protected]). F. Li is with Department of EECS, University of Tennessee, Knoxville, TN 37996 USA (email: fl[email protected]). The source code of the implementation will be available at https:// github.com/ANL-CEEESA/IntelliHealer. Q SC k,t reactive power output of shunt capacitor k during step t P l,t , Q l,t active, reactive power flow on line l during step t Δ SC k,t incremental change of shunt k from t - 1 to t Discrete Decision Variables u L l,t status of line l during step t: 1 closed and 0 otherwise a T l,t action decision of tie-line l during step t: 1 to be closed and 0 otherwise u SC k,t status of shunt capacitor k during step t: 1 active and 0 otherwise u D i,t connection status of demand at bus i during step t: 1 connected and 0 otherwise u R i,j,t indication if bus i is the parent bus of j : 1 true and 0 false Parameters P D i , Q D i active, reactive power demand at bus i P l , P l min, max active power flow of line l Q l , Q l min, max reactive power flow of line l Q SC k , Q SC k min, max reactive power output of shunt capac- itor k allowable voltage deviation from nominal value I. I NTRODUCTION Natural disasters can cause random line damages in distri- bution systems. The distribution system restoration (DSR) is one of the most critical factors to ensure power grid resilience. The objective of DSR is to search for alternative paths to re-energize the loads in out-of-service areas through a series of switching operations. Typical distribution systems have normally closed sectionalizing switches and normally open tie switches. When a fault is identified, the restoration plan will use tie switches to reconfigure the network so that the disrupted customers can be connected to available feeders [1]. Nowadays, there is an increasing demand to automatize the decision-making process of network reconfigurations and DSR, or so-called self-healing. The self-healing capability is considered as one of the most critical factors for a resilient distribution system to reduce the customers minute interruption (CMI) as well as other associated indices like system average interruption duration index (SAIDI), avoiding high interruption cost. For example, the self-healing technology IntelliTeam 1 developed by S&C Electric Company will result in a zero CMI 1 https://www.sandc.com/en/solutions/self-healing-grids/

Upload: others

Post on 02-Nov-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hybrid Imitation Learning for Real-Time Service

Hybrid Imitation Learning for Real-Time ServiceRestoration in Resilient Distribution Systems

Yichen Zhang, Member, IEEE, Feng Qiu, Senior Member, IEEE, Tianqi Hong, Member, IEEE,Zhaoyu Wang, Senior Member, IEEE, Fangxing (Fran) Li, Fellow, IEEE

Abstract—Self-healing capability is a critical factor fora resilient distribution system, which requires intelligentagents to automatically perform service restoration online,including network reconfiguration and reactive power dis-patch. The paper proposes the imitation learning frameworkfor training such an agent, where the agent will interact withan expert built based on the mixed-integer program to learnits optimal policy, and therefore significantly improve thetraining efficiency compared with exploration-dominant re-inforcement learning methods. This significantly improvedtraining efficiency makes the training problem under N − kscenarios tractable. A hybrid policy network is proposedto handle tie-line operations and reactive power dispatchsimultaneously to further improve the restoration perfor-mance. The 33-bus and 119-bus systems with N − k distur-bances are employed to conduct the training. The resultsindicate that the proposed method outperforms traditionalreinforcement learning algorithms such as the deep-Q net-work.

Index Terms—Service restoration, imitation learning, re-inforcement learning, mixed-integer program, resilient dis-tribution system.

NOMENCLATURE

Indices and Setst, T , T index, index set, number of stepsh, VP, NP index, index set, number of point of common

couplingi/j, VB, NB index, index set, number of busesk, VSC, NSC index, index set, number of shunt capacitorsl, EL, NL index, index set, number index of linesm, NS, NS index, index set, number (if countable) of statesn, NA, NA index, index set, number (if countable) of actionsContinuous Decision VariablesP PCCh,t active power injection at point of common cou-

pling h during step tQPCCh,t reactive power injection at point of common

coupling h during step tVi,t voltage of bus i during step t

This work was supported by the Advanced Grid Modeling program ofU.S. Department of Energy.

Y. Zhang, F. Qiu, T. Hong are with the Energy System Divi-sion, Argonne National Laboratory, Lemont, IL 60439 USA (email:[email protected], [email protected], [email protected]).

Z. Wang is with ECpE Dept., Iowa State University, Ames, IA 50011USA (email: [email protected]).

F. Li is with Department of EECS, University of Tennessee, Knoxville,TN 37996 USA (email: [email protected]).

The source code of the implementation will be available at https://github.com/ANL-CEEESA/IntelliHealer.

QSCk,t reactive power output of shunt capacitor k during

step tPl,t, Ql,t active, reactive power flow on line l during step

t∆SCk,t incremental change of shunt k from t− 1 to t

Discrete Decision VariablesuLl,t status of line l during step t: 1 closed and 0

otherwiseaTl,t action decision of tie-line l during step t: 1 to

be closed and 0 otherwiseuSCk,t status of shunt capacitor k during step t: 1 active

and 0 otherwiseuDi,t connection status of demand at bus i during step

t: 1 connected and 0 otherwiseuRi,j,t indication if bus i is the parent bus of j: 1 true

and 0 falseParametersPDi , QD

i active, reactive power demand at bus iP l, P l min, max active power flow of line lQl, Ql min, max reactive power flow of line l

QSCk

, QSCk min, max reactive power output of shunt capac-

itor kε allowable voltage deviation from nominal value

I. INTRODUCTION

Natural disasters can cause random line damages in distri-bution systems. The distribution system restoration (DSR) isone of the most critical factors to ensure power grid resilience.The objective of DSR is to search for alternative paths tore-energize the loads in out-of-service areas through a seriesof switching operations. Typical distribution systems havenormally closed sectionalizing switches and normally opentie switches. When a fault is identified, the restoration planwill use tie switches to reconfigure the network so that thedisrupted customers can be connected to available feeders [1].

Nowadays, there is an increasing demand to automatizethe decision-making process of network reconfigurations andDSR, or so-called self-healing. The self-healing capability isconsidered as one of the most critical factors for a resilientdistribution system to reduce the customers minute interruption(CMI) as well as other associated indices like system averageinterruption duration index (SAIDI), avoiding high interruptioncost. For example, the self-healing technology IntelliTeam1

developed by S&C Electric Company will result in a zero CMI

1https://www.sandc.com/en/solutions/self-healing-grids/

Page 2: Hybrid Imitation Learning for Real-Time Service

and zero SAIDI. While manual restoration usually results ina 90000-minute CMI and 43-minute SAIDI. Under extremeevents, the outage will occur more frequently, and the self-healing strategy will have more significant performance inreducing the interruption cost and the overall decision-makingcomplexity. The thrust of the DSR automation is the intelligentagent and the built-in policy mapping from different faultyscenarios to corresponding optimal restorative actions. Themethods for building such policies can be categorized intotwo major types: predefined or reactive.

The reactive policy requires the agent to solve DSR onlineonce the faulty condition is received. An overview can befound in [2]. Currently, most methods rely on mathematicalprogramming (MP). The graph theory was used to searchfor alternative topology after a fault in [3]. Two novel MPformulations were proposed in [4] and [5], respectively, to sec-tionalize the distribution system into self-sustained microgridsduring blackouts. A two-stage heuristic solution was proposedin [6] to optimize the path from a microgrid source to thecritical loads. Multiple distributed generators were incorpo-rated and optimized with a similar objective in [7]. Ref. [8]employed the particle swarm optimization (PSO) to solve theshipboard reconfiguration optimization problem. Multi-stepoptimization formulation were used for restoration in [9] and[10]. In [11], network operation and repairing crew dispatchwere co-optimized. A relaxed AC power flow formulationwas proposed for unbalanced system restoration in [12]. Amulti-step reconfiguration model with distributed generators(DG) start-up sequences [13]. Distributed optimization witha mixed-integer second-order cone programming problem wasformulated in [14]. The optimization-based multi-agent frame-work was employed to form self-sustained islands in [15].Ref. [16] proposed three different types of agents that solvea multi-objective optimization to facilitate self-healing. Undersimilar scopes, a convex optimal power flow model and DGswere considered in [17] and [18], respectively. However, thesetechnologies need devices to have sophisticated computationarchitectures. Furthermore, the solution time may not be ableto meet the real-time requirement.

The predefined strategy heavily relies on reinforcementlearning (RL) framework to train the policy. The RL frame-work has been extensively applied into various power systemoperation problems, including frequency control [19], voltagecontrol [20], energy management [21], economic dispatch[22], distribution system operation cost reduction via reconfig-uration [23]. While, recent work regarding RL for restorationis limited. Ref. [24] employed the dynamic programmingalgorithm to compute the exact value function, which isintractable for high dimensional problems. In Ref. [25], thevalue function was estimated using the approximate dynamicprogramming algorithm. Both algorithms, however, requirethe knowledge of the state transition probabilities, which aredifficult to know in advance. The temporal difference learningmethods, such as Q-learning, estimate the empirical statetransition probabilities from observations. In Refs. [26] and[27], the Q-learning algorithm with the ε-greed policy wasemployed to perform offline training such that the agent canreconfigure the network online. Ref. [28] proposed a mixed

online and offline strategy, in which the online restorationplan either from the agent or an MP was adopted based oncertain confident metrics. While in offline mode, the agentwas also trained using the Q-learning algorithm. Despite theinnovations, the aforementioned works have not consideredrandom N − k line outages. This disturbance randomnesshampers the application of exploration-dominant algorithmslike traditional RL, which is known to converge slowly dueto the exploration and exploitation dilemma [29]. In otherwords, these works rely on random exploration strategies, suchas ε-greed, to locally improve a policy [30]. With additionaldisturbance randomness, the number of interactions required tolearn a policy is enormous, leading to a prohibitive cost. Sucha capability limitation on handling disturbance randomnesssignificantly impedes the deployment in real-world scenarios.

In a nutshell, the major gap of current research for DSRautomation can be concluded as• Reactive strategies such as MP-based methods need so-

phisticated computational architectures and have overrunrisk in real-time execution.

• Predefined strategies such as RL-based methods have notconsidered random N − k line outages, which jeopar-dizes the self-healing capability. The underline reason isthat traditional RL is not capable of training the policyefficiently under random disturbances.

To overcome this limitation, the paper employs the predefinedstrategy and proposes the imitation learning (IL) frameworkfor training the restoration agent. The advantages of ILmethods are the significantly higher training efficiency sinceit leverages prior knowledge about a problem in terms ofexpert demonstrations and trains the agents to mimic thesedemonstrations. With the proposed method, random N − kdisturbances are tractable and considered in this paper toenhance the self-healing capability under various conditions.Its fundamental form consists of training a policy to predictthe expert’s actions from states in the demonstration datausing supervised learning. Here, we leverage well-studiedMP-based restoration as the expert. In addition, reconfigurednetworks may exhibit longer lines and low voltages. Thus,tie-line operations and reactive power dispatch are consideredsimultaneously to restore more loads. The contribution of thispaper is concluded as follows• proposing the IL framework to improve training ef-

ficiency and reduce the number of agent-environmentinteractions required to train a policy. We show that the ILalgorithms significantly outperform the deep Q-learningunder random N − k contingencies.

• strategically designing MP-based experts and environ-ments with tailored formulations for IL algorithms

• developing a hybrid policy structure and training algo-rithms to accommodate the mixed discrete and continuousaction space

Concisely, this paper proposes to use the new IL paradigmfor training DSR policy that is capable of handling trainingcomplexity under random disturbances. The proposed solutioncan successfully handle the central technical requirements ofself-healing capability, that is, automatic and optimal. It is

Page 3: Hybrid Imitation Learning for Real-Time Service

worth mentioning that the IL framework acts as a bridge be-tween RL-based techniques and MP-based methods and a wayto leverage well-studied MP-based decision-making systemsfor RL-based automation. The source code of the implemen-tation will be available at https://github.com/ANL-CEEESA/IntelliHealer.

The remainder of this paper is organized as follows. SectionII will frame the DSR problem as the Markov decision process(MDP). Section III will introduce the IL problem and algo-rithms. Section IV will introduce the MP-based experts andenvironments that IL algorithms are interacting with. SectionV will illustrate the case study, followed by the conclusion inVI.

II. PROBLEM STATEMENT

Let the distribution system be denoted as a graph G =(VB, EL), where VB denotes all buses (vertices) and EL denotesall lines (edges). The bus set is categorized into substationbuses VB,S and non-substation buses VB,NS. The line set iscategorized into non-switchable line set EL,NS and tie-line setEL,T. The non-switchable lines can not be actively controlledunless tripped due to external disturbances. The status oftie-lines can be controlled through tie-switches to adjust thenetwork configuration.

Assume a NL,NS− k contingency scenario indicating that klines from the set EL,NS are tripped. Without loss of generality,we uniformly sample these k lines from EL,NS in each scenario(or episode2). Let EF

L,NS be the set of faulty lines and ENFL,NS be

the set of non-faulty lines. The goal for a well-trained agentis to control the tie-lines and shunt capacitors to optimallyrestore interrupted customers given post-fault line status.

To account for the time-dependent process [13], such as thesaturating delays of tie-switches and shunt capacitors, as wellas reducing transients, we consider a multi-step restoration. Ineach step, only one tie-line is allowed to operate. In addition,closed tie-lines are not allowed to open again. Meanwhile,all shunt capacitors can be dispatched at each step. Basedon state-of-the-art industrial self-healing productions, such asIntelliTeam from S&C Electric Company and DistributionFeeder Automation (SDFA) from Siemens Industry Inc., thetime interval between each step is in the time scale ofseconds. The specific value can vary between systems dueto different topology, power sources and devices. Naturally,the step number is set to be equal to the number of tie-linesNL,T in the system, that is, T = [0, 1, · · · , NL,T], where 0stands for the initial step. A positive integer will be usedto label each tie-line action as shown in Eq. (1). In manyscenarios, not all tie-lines are involved. For a step t ∈ Twhere no tie-line action is needed, the action label will bezero. During the restoration process, the network radialitymust be maintained, and the tie-line operations that violatethe radiality constraint will be denied. We formalize the abovesetting using the episodic finite Markov decision process (EF-MDP) [29]. An EF-MPD M can be described by a six-tupleM = <S,A,D, p(s′|s, a), r(s, a), T>, where S denotes

2The terms scenario and episode are regarded the same in this paper andwill be used interchangeably.

the state space, A denotes the action space, D denotes thedisturbance space, p(s′|s, a) denotes the state transition prob-ability, r denotes the real-valued reward function, T denotesthe number of steps in each episode, and s′, s ∈ S, a ∈ A. Theaction space is hybrid, consisting of a discrete action space ATfor tie-line operations and a continuous action space AC where

AT = [0, 1, · · · , NL,T] (1)

AC = [QC1, Q

C1 ] ∪ · · · ∪ [QC

NC, Q

CNC

] (2)

A trajectory can be denoted as

τ = (s0(d), a1, s1, a2, s2, · · · , aT , sT ) (3)

where s0(d), or s0 for short, is the initial faulty condition dueto disturbance d ∈ D. For actions that violate the radialityconstraint, the corresponding transition probability will be zeroand one otherwise.

III. DEEP IMITATION LEARNING

A. Imitation Learning ProblemThe IL training process aims to search for a policy π(a|s)

(a conditional distribution of action a ∈ A given state s ∈ S)from the class of policies Π to mimic the expert policyπ∗ [31]. The expert policy is assumed to be deterministic.Without loss of generality, consider a countable state spaceS = [s1, s2, · · · , sNS ] with NS number states. Let ρ0 denotethe initial distribution of states and ρ0(sm) denote the proba-bility of state sm. Let ρπt denote the distribution of states attime t if the agent executes the policy π from step 1 to t− 1.The law of ρπt can be computed recursively as follows [31]

ρπt (smt ) =∑

st−1∈Sρt−1(st−1)

∑at∈A

π(at|st−1)p(smt |st−1, at)

(4)

Then, the average distribution of states is defined as ρπ(s) =∑Tt=1 ρ

πt−1(s)/T , which represents the state visitation fre-

quency over T time steps if policy π is employed [32].The 0-1 loss of executing action a in state s with respect

to (w.r.t) the expert policy π∗ is denoted as follows [33]

e(s, a) = I(a 6= π∗(s)) (5)

where I(•) is the indicator function. Consider an action at. Ifthis action is different from the action provided by the optimalpolicy a∗ = π∗(s), that is, at 6= a∗, then the loss value equalsone. Otherwise, it equals zero. Intuitively, if an agent is ableto act identically as the optimal policy, the loss will be zero.The expected 0-1 loss of policy π in state s reads as follows

eπ(s) = Ea∼πs [e(s, a)] (6)

The expected T -step loss w.r.t π is

L(π) = Es∼ρπ [eπ(s)] (7)

The goal is to find a policy π that minimize the expectedT -step loss L(π), that is,

π = argminπ∈Π

L(π) = argminπ∈Π

Es∼ρπ [eπ(s)] (8)

Note that this objective function is non-convex due to the de-pendence between the objective parameter ρπ and the decisionspace Π.

Page 4: Hybrid Imitation Learning for Real-Time Service

B. Imitation Learning Algorithm

The most effective form of imitation learning is behaviorcloning (BC). In the BC algorithm summarized, trajectoriesare collected under the expert’s policy π∗, and the IL problemrenders to a supervised learning problem, where the states arethe features, and the actions are the labels. The objective ofBC reads as follows

π = argminπ∈Π

Es∼ρπ∗ [eπ(s)] (9)

which disassociates the dependency between the objectiveparameter and the decision space [33].

A framework paradigm and the basic algorithm flowchartfor IL-based DSR agent are illustrated in Fig. 1 (a) and (b),respectively. Three modules, including agent, environment,and expert, will be built, as shown in Fig. 1 (a). In eachtraining step, the agent will use the expert policy to explorethe environment to obtain the optimal trajectory for training.Then, testing will be conducted under a new disturbance withthe trained policy network, as illustrated in Fig. 1 (b).

Start

Network Initialization

Generator Random

Disturbance

TrainingYes

Inquire Expert Policy

Obtain System State

Obtain Optimal Action

via Expert Policy

Episode

End

No

Update System State

Collect Trajectory

Train Policy

Network

Training

Complete

No

EndYes

Yes

No

Episode

End

Obtain System State

No

Obtain Action via

Policy Network

Update System State

Evaluate Trajectory

Yes

Network Status

Computation

Supervisory Learning

Policy

Agent

Environment

Trajectory Buffer

Policy Network

Training

Expert

Network Initialization and

Random Disturbance

Expert Policy

(a) (b)

Train

Test

Load Pick-up Optimizer

Restoration Optimizer

Initialized

Yes

No

Fig. 1. IL framework and algorithm for training DSR agent. (a) ILparadigm. (b) IL algorithm flowchart.

The BC algorithm is described in Algorithm 1. Severalmajor functions are explained as follows.• Expert: Since we are addressing a multi-period

scheduling problem, it is difficult to directly obtainan expert mapping π∗. Therefore, a mixed-integer pro-gram (MIP) is employed to obtain the optimal ac-tions. This MIP is specified as an expert solverExpert(st−1, [t, · · · , T ]), which takes the initial state att−1 and the scheduling interval [t, · · · , T ], and return theoptimal actions at, · · · , aT . The detailed MIP formulationis given in Section IV.

• Act: The DSR environment interacts with the policythrough Act. Given a disturbance d, total step T , andthe policy (either the mapping or expert solver), Actreturns a T -step trajectory. More details are describedin Algorithm 2.

• Eval: Eval compares the learned policy-induced tra-jectory with the optimal one that is provided by the MP-based method, or the expert, and calculates the ratio r

between restored total energy under the learned policyand the optimal restored total energy. The ratio is definedas the performance score of the learned policy in eachiteration.

Algorithm 1: Behavior cloning (BC)input : expert solver Expert, deep neural net policy

π, neural network training functionTrain(·, ·, ·), environment interactionAct(·, ·, ·), disturbance set D, stochasticsampling function Sample(·), policyevaluation function Eval(·, ·)

1 X ← ∅ // initialize the input2 Y ← ∅ // initialize the label3 P ← ∅ // initialize the performance4 π1 ∈ Π // initialize the policy5 for i← 1 to N do6 d← Sample(D)7 (s0, a1, s1, · · · , aT , sT )← Act(d, T,Expert)8 X ← X ∪ (s0, · · · , sT−1)9 Y ← Y ∪ (a1, · · · , aT )

10 πi+1 ← Train(X,Y, πi)11 d← Sample(D)12 r ← Eval(Act(d, T,Expert),Act(d, T, πi+1))13 P ← P ∪ (d, r)14 end

output: Trained deep neural net π, performance scoresP

Algorithm 2 runs either the learned policy or the expertsolver on the DSR environment Env to obtain the trajectory.The DSR environment Env is built on the standard Open-AI Gym environment template [34]. There are two majorfunctions: Env.Reset and Env.Step.• Env.Reset generates a certain number of line outages,

computes the initial system status under the line outagesusing Eq. (23), and updates the system states and actions.

• Env.Step receives a tie-line action, solve the MIPprogram Eq. (24) to obtain the new system state, andupdates the system states and actions. If this actionviolates the radiality and other technical constraints andresults in infeasibility, the action will be denied and thuswill not be updated. The reward will be computed basedon restored loads. Note that the rewards are specificallycalculated for RL as IL does not need rewards.

C. Hybrid Policy

The training in Algorithm 1 Line 10 is a multi-class classifi-cation problem, which is not able to handle continuous actionspaces. Thus, Algorithm 1 can only be used for automatic tie-line operators. To simultaneously coordinate tie-line operationsand reactive power dispatch, we propose a hybrid policynetwork, as shown in Fig. 2. The action spaces of the hybridneural network are mixed continuous and discrete. At thehigher level, there is a single neural network to predict theoptimal tie-line actions given measured states. Each tie-line

Page 5: Hybrid Imitation Learning for Real-Time Service

Algorithm 2: Environment interaction Act

input : disturbance d, time step T , policy function orexpert solver f , DSR environment Env

1 s0 ← Env.Reset(d)2 if f == Expert then

/* run Env under the expert policy

*/3 (a1, · · · , aT )← Expert(s0, [1, · · · , T ])4 for t← 1 to T do5 st ← Env.Step(at)6 end7 end8 if f == π then

/* run Env under learned policy */9 for t← 1 to T do

10 at ← π(st−1)11 st ← Env.Step(at)12 end13 end

output: T -step trajectory (s0, a1, s1, · · · , aT , sT )

action is associated with a neural network for reactive powerdispatch. The dispatch ranges associated with individual tie-lines can be a subset or entire continuous action spaces.Considering the fact that under each tie-line operation, thesystem may admit a different power flow pattern, we attachthe entire dispatch spaces in each tie-line action. It is alsoworth mentioning that the states for predicting discrete andcontinuous actions can be different.

States

⋯ T

K

T ⋯

Discrete Action Policy Network

1T

Continuous Action

Policy Network

States States States

Continuous Action

Policy Network

Continuous Action

Policy Network

[ 1 , 1 ] ∪⋯∪ [ , ] [ 1 , 1 ] ∪⋯∪ [ , ] [ 1 , 1 ] ∪⋯∪ [ , ]

Fig. 2. Discrete-continuous hybrid policy network.

The training process for the hybrid policy network is de-scribed in Algorithm 3. The additional effort from Algorithm1 is that we will train reactive power dispatchers under eachtie-line action. To do this, we first initialize the dispatchertraining dataset as shown in Line 2. In each episode, we groupthe dispatch commands from the expert hExp based on the tie-line actions as shown in Lines 11 and 12. The final step in eachepisode is to train the tie-line operation policy network andreactive power dispatch policy network, respectively, as shownin Lines 14 and 15. The hybrid behavior cloning algorithm willinteract with the environment that includes both tie-line andreactive power dispatch, which is described in Algorithm 4.Algorithm 4 is similar to Algorithm 2 except that the hybridactions are generated using the hybrid policy as shown in Lines10 and 11, and the DSR environment has hybrid actions. TheMIP formulation of hEnv will be introduced in Section IV.

Algorithm 3: Hybrid behavior cloning (HBC)input : hybrid action expert solver hExp, tie-line

operation policy π, reactive power dispatchpolicy under tie-line action k πk, tie-lineoperation policy network training functionTrainClf(·, ·, ·), reactive power dispatchpolicy network training functionTrainReg(·, ·, ·), hybrid action environmentinteraction hAct(·, ·, ·), disturbance set D,stochastic sampling function Sample(·),policy evaluation function Eval(·, ·)

1 X ← ∅, Y ← ∅2 Xk ← ∅, Yk ← ∅3 P ← ∅4 π1 ∈ Π, π1

k ∈ Π5 for i← 1 to N do6 d← Sample(D)7 (s0, a

D1 , a

C1 , s1, · · · , aD

T , aCT , sT )←

hAct(d, T,hExp)8 X ← X ∪ (s0, · · · , sT−1)9 Y ← Y ∪ (aD

1 , · · · , aDT )

10 for t← 1 to T do11 XaD

t← XaD

t∪ st−1

12 YaDt← YaD

t∪ aC

t

13 end14 πi+1 ← TrainClf(X,Y, πi)

15 πi+1k ← TrainReg(Xk, Yk, π

ik)

16 d← Sample(D)17 r ←

Eval(hAct(d, T,hExp),hAct(d, T, (πi+1, πi+1k )))

18 P ← P ∪ (d, r)19 end

output: Trained tie-line operator π, trained reactivepower dispatcher πk, performance scores P

Algorithm 4: Hybrid action environment interactionhActinput : disturbance d, time step T , policy function or

expert solver f , hybrid-action DSRenvironment hEnv

1 s0 ← hEnv.Reset(d)2 if f == hExp then3 (aD

1 , aC1 , · · · , aD

T , aCT )← hExp(s0, [1, · · · , T ])

4 for t← 1 to T do5 st ← hEnv.Step((aD

t , aCt ))

6 end7 end8 if f == (π, πk) then9 for t← 1 to T do

10 aDt ← π(st−1)

11 aCt ← πaD

t(st−1)

12 st ← hEnv.Step((aDt , a

Ct ))

13 end14 end

output: T -step trajectory(s0, a

D1 , a

C1 , s1, · · · , aD

T , aCT , sT )

Page 6: Hybrid Imitation Learning for Real-Time Service

IV. MATHEMATICAL PROGRAMMING-BASED EXPERT ANDENVIRONMENT

This section describes the MIP formulation for the ex-perts and environments. We will first introduce generic con-straints for the DSR problem. Then, Expert, Env.Reset,Env.Step, hExp, hEnv.Reset and hEnv.Step are in-troduced in Eqs. (22), (23), (24), (25), (26), (27), respectively.

Let L(·, i) denote the set of lines for which bus i is theto-bus, and L(i, ·) denote the set of lines for which bus i isthe from-bus. Let µ(l) and ν(l) map from the index of linel to the index of its from-bus and to-bus, respectively. Thenature of radiality guarantees that µ(l) and ν(l) are one-to-one mappings. Let P map from the index of bus i to thesubstation index. Without loss of generality, we consider oneactive substation and assume Bus 1 is connected to it. LetC map from the index of bus i to the shunt capacitor. LetT = [t0, t1, · · · , T ] be the step index and t ∈ T .

Following the convention in [35] and [11], linearized Dis-tflow equations are employed to represent power flows andvoltages in the network and are described as follows∑

∀l∈L(·,i)

Pl,t +∑∀h∈P(i)

P PCCh,t

=∑

∀l∈L(i,·)

Pl,t + uDi,tP

Di,t ∀i,∀t∑

∀l∈L(·,i)

Ql,t +∑∀h∈P(i)

QPCCh,t +

∑∀k∈C(i)

QSCk,t

=∑

∀l∈L(i,·)

Ql,t + uDi,tQ

Di,t ∀i,∀t

(10)

The line flow should respect the limits, which will be enforcedto be zero if it is opened

uLl,tP l ≤ Pl,t ≤ uL

l,tP l ∀l,∀tuLl,tQl ≤ Ql,t ≤ u

Ll,tQl ∀l,∀t

(11)

The shunt capacitor should also respect the limits, which willbe enforced to be zero if it is opened

uSCk,tQ

SCk≤ QSC

k,t ≤ uSCk,tQ

SCk ∀l,∀t (12)

The linear relation between voltages and line flow needs to beenforced when the line l is closed

(uLl,t − 1)M ≤ Vν(l),t − Vµ(l),t +

RlPl,t +XlQl,tV1

∀l,∀t

(1− uLl,t)M ≥ Vν(l),t − Vµ(l),t +

RlPl,t +XlQl,tV1

∀l,∀t(13)

The voltages should be maintained within permissible ranges

1− ε ≤ Vi,t ≤ 1 + ε ∀i,∀t (14)

The radiality constraints are expressed as follows [36]

uRµ(l),ν(l),t + uR

ν(l),µ(l),t = uLl,t ∀l,∀t

uRi,j,t = 0 ∀i,∀j ∈ VB,S,∀t∑i∈NB

uRi,j,t ≤ 1 ∀j,∀t

(15)

It is worth mentioning that a radial network with several is-lands will still satisfy Eq. (15) [37]. This could be a severe flawduring the normal operation. However, since N − k scenariosare considered in this paper, we may exhibit islands in certainscenarios, where radiality is enforced in each island, includingthe main energized branch. Within all non-switchable linesEL,NS, the status of faulty lines EF

L,NS is enforced to be zeroand the status of non-faulty lines ENF

L,NS is enforced to be one

uLl,t = 0 ∀l ∈ EF

L,NS,∀tuLl,t = 1 ∀l ∈ ENF

L,NS,∀t(16)

For a multi-step scenario, the restored loads are not allowedto be disconnected again

uDi,t ≥ uD

i,t−1 ∀i,∀t \ {t0} (17)

Similarly, closed tie-lines cannot be opened

uLl,t ≥ uL

l,t−1 ∀l ∈ EL,T,∀t \ {t0} (18)

In addition, only one tie-line can be operated in one step∑l∈NL,T

uLl,t −

∑l∈EL,T

uLl,t−1 ≤ 1 ∀t \ {t0} (19)

And all tie-lines are equal to the initial values

uLl,t0 = uL

l ∀l ∈ EL,T (20)

In some instances, there will be multiple shunt capacitordispatch solutions for an optimal load restoration, and theshunt dispatch results will jumpy between these solutions inan episode. This will jeopardize a smooth learning process.Therefore, a set of constraints is considered to limit thedispatch frequency

M(1− zk,t) ≤ QSCk,t −QSC

k,t−1 (21a)

−M(1− zk,t) ≤ ∆SCk,t − (QSC

k,t −QSCk,t−1) (21b)

M(1− zk,t) ≥ ∆SCk,t − (QSC

k,t −QSCk,t−1) (21c)

−Mzk,t ≤ ∆SCk,t + (QSC

k,t −QSCk,t−1) (21d)

Mzk,t ≥ ∆SCk,t + (QSC

k,t −QSCk,t−1) (21e)

∀k, ∀t \ {t0} (21f)

where we introduce two slack variables: ∆SCk,t is a continuous

variable to express the incremental changes of shunt capacitork from time t− 1 to t, and zk,t is a binary variable to denoteif there exists incremental changes of shunt capacitor k fromtime t− 1 to t. Eq. (21a) enforces zk,t to be one if QSC

k,t andQSCk,t−1 are different, where M is a big positive number. Eqs.

(21b)-(21e) ensure that ∆SCk,t equals to QSC

k,t−QSCk,t−1 if zk,t is

one, and ∆SCk,t equals to zero when zk,t is zero. With the set

of constraints, ∆SCk,t precisely denotes the incremental changes

and can be minimized in the objective function.The expert solver Expert takes the disturbance d (the set

of faulty lines EFL,NS), the initial tie-line status uL

l , where ∀l ∈EL,T, and the step index T = [t0, t1, · · · , T ] as inputs andsolver the following MIP problem

max∑t

∑i

uDi,tP

Di (22a)

subject to (10)− (20) ∀t ∈ T (22b)

uSCk,t = 0 ∀k, ∀t ∈ T (22c)

Page 7: Hybrid Imitation Learning for Real-Time Service

where (22c) deactivate shunt capacitors since they will notbe considered in Expert. The solution will provide a seriesof tie-line status uL

l,t0, uLl,t1, · · · , uL

l,T for l ∈ EL,T. Then, theoptimal tie-line operating actions can be parsed as aL

t1 , · · · , aLT .

The Env.Reset function computes the system initial condi-tion given a random generated faulty line set EF

L,NS

max∑t

∑i

uDi,tP

Di (23a)

subject to (10)− (16) ∀t ∈ [t0] (23b)

uLl,t0 = 0 ∀l ∈ EL,T (23c)

uSCk,t0 = 0 ∀k (23d)

where Eq. (23c) ensures no tie-line actions under this initialstage. The Env.Step aims to restore the maximal load giventhe disturbance, a tie-line status and the load status from theprevious step by solving the following problem

max∑t

∑i

uDi,tP

Di (24a)

subject to (10)− (16), (20) ∀t ∈ [tτ ] (24b)

uDtτ ≥ u

Dtτ−1

(24c)

uSCk,tτ = 0 ∀k, ∀t (24d)

where uDtτ−1

is the load status from the previous step, and Eq.(24c) ensures the restored load will not be disconnected again.

Similarly, hybrid-action expert solver hExp solves the fol-lowing MIP

max∑t

(∑i

uDi,tP

Di + w

∑k

∆SCk,t) (25a)

subject to (10)− (21) ∀t ∈ T (25b)

where w is the weighting factor. The hybrid-action DSR envi-ronment hEnv also consider the reactive power dispatch. ThehEnv.Reset function computes the system initial conditiongiven a random generated faulty line set EF

L,NS

max∑t

∑i

uDi,tP

Di (26a)

subject to (10)− (16) ∀t ∈ [t0] (26b)

uLl,t0 = 0 ∀l ∈ EL,T (26c)

QSCk,t0 = 0 ∀k (26d)

where Eqs. (26c) and (26d) ensure no restorative actionsunder this initial stage. The hEnv.Step aims to restore themaximal load given the disturbance, a tie-line status and theload status from the previous step by solving the followingproblem

max∑t

∑i

uDi,tP

Di +

∑t

∑k

|eSCk,tτ | (27a)

subject to (10)− (16), (20) ∀t ∈ [tτ ] (27b)

uDtτ ≥ u

Dtτ−1

(27c)

eSCk,tτ = QSC

k,tτ − QSC ∀k (27d)

where uDtτ−1

is the load status from the previous step, and QSC

is the var dispatch command. To avoid dispatch infeasibilitydue to the unenergized islands, the absolute error between

the expert signal QSC and the actual var dispatch QSCk,tτ

isminimized. Equivalent formulations to remove the absoluteoperator are implemented.

V. CASE STUDY

In the numerical experiments, two metrics are considered toevaluate the learning performance: (1) Restoration ratio: theratio between the restored load by the agent and the optimalrestorable load by the expert; (2) Restoration value: totalrestored load by the agent in each episode; (3) Success rate:number of times that the agent achieves optimal restorableload in N episodes. The optimization is formulated usingPyomo [38] (National Technology and Engineering Solutionsof Sandia, LLC, U.S.) and solved using IBM ILOG CPLEX12.8. The deep learning model is built using TensorFlow r1.14.

It is worth mentioning that the scalability of the proposedmethod is demonstrated from both disturbance complexityand system size perspectives. The disturbance complexitydetermines the number of randomly tripped lines in eachepisode. The agent will encounter larger numbers of differenttopologies if more lines are randomly tripped. The system sizeverifies if the method can handle considerable sizes of inputsand larger action space.

A. 33-Bus System

The 33-bus system in [39] will be employed for the firstcase study. It is a radial 12.66 kV distribution network, shownin Fig. 3. Detailed network data can be found in [39]. In thissystem, there are five tie-lines, which are assumed to be openedin the initial phase. Six shunt capacitors are assumed to bedeployed in the gray nodes Fig. 3. The dispatch ranges of allshunt capacitors are from -0.2 to 0.2 MVar. We assume thesubstation voltage is 1.05 p.u., and the voltage deviation limitis 0.05 p.u.

1 11 17 18

22

3

20

2

6

15

26

2728 29 31 32 33

9

23

24 25

19

4 5 7 8 10

21

30

12 13 14 1621 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

18

1920 21

22

23

24

25

2627 28 29 30 31 32

33

34

35

36

37

Fig. 3. The 33-node system with five tie-lines. The gray nodes areassumed to be equipped with shunt capacitors.

1) Policy Network and Feature Selection: Based on thesystem structure, the policy networks are shown in Fig. 4.The tie-line operation policy network consists of three hiddenlayers. We use the rectifier linear units (relu) as our activa-tion functions. For the tie-line operation, the connectivity ofthe system is essential, and thus the feature inputs are linestatus. The shunt capacitor policy network has four hiddenlayers. For this network, load status and real-valued powerflow are considered as feature inputs to extract the reactivepower effects on the local voltage and load pick-up. Two typesof activation functions are also compared.

Page 8: Hybrid Imitation Learning for Real-Time Service

ReLUX(n, p)F1(n,64) F2(n,128) F4(n,64)

ReLUF3(n,128)

ReLU TanhF5(n,6)

ReLUX(n, p)F1(n,64) F2(n,128)

ReLUF3(n,64)

ReLUY(n,6)

(a)

(b)

Fig. 4. Deep neural network based policy networks. (a) Tie-line opera-tion policy network. (b) Shunt capacitor dispatch policy network.

0 25 50 75 100 125 150 175 2001 Episodes

0.0

0.2

0.4

0.6

0.8

1.0

Res

tora

tion

Rat

io

BCDQN

A2C

(a)

0 25 50 75 100 125 150 175 2001 Episodes

10

12

14

16

18

20

22

Res

tora

tion

Val

ue (M

W)

BCDQN

A2C

(b)

Fig. 5. Training performance of imitation learning BC and reinforcementlearning DQN and A2C under the N − 1 scenario. (a) Restoration ratio.(b) Restoration value.

0 20 40 60 80 100100 Episodes

0.0

0.2

0.4

0.6

0.8

1.0

1.2

Res

tora

tion

Rat

io

BC DQN A2C

(a)

0 20 40 60 80 100100 Episodes

8

10

12

14

16

18

Res

tora

tion

Val

ue (M

W) BC DQN A2C

(b)

Fig. 6. Training performance of imitation learning BC and reinforcementlearning DQN and A2C under the N − 5 scenario. (a) Restoration ratio.(b) Restoration value.

0 5 10 15 20 25 30 355 Episodes

0.0

0.2

0.4

0.6

0.8

1.0

Res

tora

tion

Rat

io

BCBC_hybrid (flow)BC_hybrid (line)

(a)

0 5 10 15 20 25 30 355 Episodes

16

18

20

22

24

Res

tora

tion

Val

ue (M

W)

BCBC_hybrid (flow)BC_hybrid (line)

(b)

Fig. 7. Training performance of BC and HBC under the N − 1 scenario.(a) Restoration ratio. (b) Restoration value.

2) IL v.s. RL for Network Reconfiguration under N-1 and N-5 Contingencies: In this subsection, we compare the imitationlearning Algorithms 1 with two RL baseline algorithms, deepQ-network (DQN) and advanced actor critic (A2C). The DQNand A2C are implemented based on the reference in [40].The N − 1 random contingency is considered first. The totaltraining episodes are 200. The restoration ratio and valueare shown in Fig. 5 (a) and (b), respectively. As shown, the

0 20 40 60 80 10020 Episodes

0.0

0.2

0.4

0.6

0.8

1.0

Res

tora

tion

Rat

io

BCBC_hybrid (flow)BC_hybrid (line)

(a)

0 20 40 60 80 10020 Episodes

15

16

17

18

19

20

21

22

23

Res

tora

tion

Val

ue (M

W)

BCBC_hybrid (flow)BC_hybrid (line)

(b)

Fig. 8. Training performance of BC and HBC under the N − 2 scenario.(a) Restoration ratio. (b) Restoration value.

BC algorithm is able to optimally restore the system after75 episodes of training, while DQN and A2C admit onlyan averaged 45% restoration ratio over the 200 episodes.The problem complexity due to the topology switching isintractable for algorithms that heavily rely on exploration liketraditional RL.

For further verification, the N − 5 random contingencyis applied. Ten thousand training episodes are used. Therestoration ratio and value are shown in Fig. 6 (a) and (b),respectively. With increasing complexity of the problem, theadvantage of BC compared with DQN is more significant asBC can achieve more than 90% restoration ratio while DQNstays at 15%. A2C performs better than DQN in the N − 5scenario and still underperforms the IL with a 35% restorationratio deficit.

3) System Status during Restoration: A particular scenarioof N − 1 contingency is illustrated. In this scenario, Line3 is damaged and tripped. Once the damage situation istransmitted, the agent closes Tie-line 33 in the first step andperforms no further actions in the following steps to respect theradiality constraint. The corresponding load energization statusand voltage profile are shown in Fig. 9 and 10, respectively.Since Line 3 is upstream of the network, its outage causes de-energization of 60% load. Fortunately, with the reconfigurationthrough Tie-line 33, most of the loads have been picked upexcept for Loads 11, 30, and 33. It is because energizing theseloads will violate the voltage security constraint as shown inFig. 10, which indicates the necessity of voltage compensation.

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233Index of Buses

0

1

2

3

4

5

Ste

ps

Fig. 9. System load energization status under Line 3 outage.

Another scenario of N − 5 contingency is described, whereLines 5, 6, 10, 14, and 18 are tripped. In response, the agentthen closes Tie-lines 37, 36, 34, 35, and 33 in sequence. Asshown in Fig. 11, several loads can be energized after eachstep. At last, ten loads cannot be picked up due to the voltageconstraint, although all buses are connected through the tie-lines. The voltage profile is depicted in Fig. 12.

Page 9: Hybrid Imitation Learning for Real-Time Service

0 1 2 3 4 50.950

0.975

1.000

1.025

1.050

Vol

tage

(PU

)

Fig. 10. System voltage profile under Line 3 outage.

1 2 3 4 5 6 7 8 9 101112131415161718192021222324252627282930313233Index of Buses

012345

Ste

ps

Fig. 11. System load energization status under Lines 5, 6, 10, 14, and18 outages.

4) Hybrid Policy for N-1 and N-2 Contingencies: Under therandom N−1 contingency, the hybrid policy network is trainedfor 200 episodes. In the var dispatch policy network, twofeatures are considered: load status and real-valued power flow.The training performance is illustrated in Fig. 7. All metricsare averaged within five steps. The BC algorithm has a lowervariation in the restoration ratio since the task only involvesdiscrete actions and relatively easier. But with var dispatchcapability, the hybrid agent is able to restore approximately 2MW load in each episode as shown in Fig. 7 (b). As for thefeatures, real-valued power flow and the load status have thesimilar performance.

A more complicated random N − 2 scenario is consid-ered and train both BC and HBC agents for 2000 episodes.Similarly, BC has a lower variation in the restoration ratio,particularly when all algorithms achieve high restoration ratioat around 400 episodes as shown in Fig. 8. Fig. 8 (b)shows that the HBC agent can restore 2 MW more in eachepisode, indicating that it is critical to have var support inthe resilient setting. The reason lies in the fact that thereconfigured network may have longer feeders when thereare more line outages. Therefore, the voltage drops alongreconfigured feeders are more significant.

It is worth noting that with increasing numbers of linedamages the usage of shunt capacitors can be reduced sincethe designated shunt capacitors may result in one of theunenergized islands. Frequencies of the reactive power controlfrom N − 1 to N − 5 are calculated as in Fig. 13. Withincreasing number of line outages, the usage of var dispatchfirst increases as longer feeder may occur, and then decreasesdue to the unenergized islands.

B. 119-Bus SystemThe second case study is demonstrated on the 119-bus

system. This system is an 11 kV distribution system with15 tie-lines, which is particularly suitable to study networkreconfiguration and restoration. Detailed data of the systemcan be found in [41]. We assume the substation voltage is1.05 p.u., and the voltage deviation limit is 0.05 p.u. The

0 1 2 3 4 50.950

0.975

1.000

1.025

1.050

Vol

tage

(PU

)

Fig. 12. System voltage profile under Lines 5, 6, 10, 14, and 18 outages.

N-1 N-2 N-3 N-4 N-505

101520253035

Dis

patc

h Fr

eque

ncie

s (%

)

30.45%34.08%

36.39% 34.9%

26.67%

Average values of all dispatchers

Fig. 13. Averaged var dispatch frequencies of all shunt capacitordispatchers.

N − 1 random line outage is considered, and the tie-lineoperation is considered for restoration. The restoration agent istrained for 1000 episodes. The results are illustrated in Fig. 14.The agent is able to achieve an 80% ratio after 400 episodesand an 80% success rate after 600 episodes. Additionally, theeffect of integer numbers in MP-based expert on the trainingperformance is studied. The results of the 33-bus are usedfor comparison. The episode after which the success rate isabove 80% is considered as a metric, denoted as the confidentepisode. The results are summarized below in Table I. Theconfident episode of 33-bus is 75, while the one of 119-busis 600. The ratio of the confident episode between 33-bus and119-bus, which is eight, is close to the one of tie-line integernumbers between these two cases, which is nine. Intuitively,the integer variables that control the tie-lines should have adominant impact on the IL performance.

0 20 40 60 80 10010 Episodes

0.0

0.2

0.4

0.6

0.8

1.0

Res

tora

tion

Rat

io

(a)

0 20 40 60 80 10010 Episodes

0.0

0.2

0.4

0.6

0.8

1.0

Suc

cess

Rat

e

(b)

Fig. 14. Training performance of BC under the N − 1 scenario. (a)Restoration ratio. (b) Success rate.

TABLE IEFFECT OF INTEGER NUMBERS IN MP-BASED EXPERT ON IL

PERFORMANCE

Case Integer number Tie-line integer number Confident episode

33-bus 6960 25 75119-bus 226800 225 600

Page 10: Hybrid Imitation Learning for Real-Time Service

VI. CONCLUSIONS AND FUTURE WORKS

In this paper, we propose the IL framework and HBCalgorithm for training intelligent agents to perform onlineservice restoration. We strategically design the MP-basedexperts, who are able to provide optimal restoration actions forthe agent to imitate, and a series of MP-based environmentsthat agents can interact with. Agents that are trained underthe proposed framework can master the restoration skillsfaster and better compared with RL methods. The agentcan perform optimal tie-line operations to reconfigure thenetwork and simultaneously dispatch reactive power of shuntcapacitors using the trained policy network. The decision-making process has negligible computation costs and can bereadily deployed for online applications. Future efforts willbe devoted to feature extraction capability considering uniquepower network structure as well as a multi-agent trainingparadigm to incorporate distributed energy resources for thebuild-up restoration strategy.

REFERENCES

[1] A. Zidan et al., “Fault detection, isolation, and service restoration in dis-tribution systems: state-of-the-art and future trends,” IEEE Transactionson Smart Grid, vol. 8, no. 5, pp. 2170–2185, 2017.

[2] F. Shen et al., “Review of Service Restoration Methods in DistributionNetworks,” in 2018 IEEE PES Innovative Smart Grid TechnologiesConference Europe (ISGT-Europe), vol. 8, no. 1. IEEE, oct 2018,pp. 1–6.

[3] J. Li, X. Y. Ma, C. C. Liu, and K. P. Schneider, “Distribution systemrestoration with microgrids using spanning tree search,” IEEE Transac-tions on Power Systems, vol. 29, no. 6, pp. 3021–3029, 2014.

[4] Z. Wang and J. Wang, “Self-healing resilient distribution systemsbased on sectionalization into microgrids,” IEEE Transactions on PowerSystems, vol. 30, no. 6, pp. 3139–3149, 2015.

[5] C. Chen, J. Wang, F. Qiu, and D. Zhao, “Resilient distribution systemby microgrids formation after natural disasters,” IEEE Transactions onSmart Grid, vol. 7, no. 2, pp. 958–966, 2016.

[6] H. Gao, Y. Chen, Y. Xu, and C. C. Liu, “Resilience-oriented critical loadrestoration using microgrids in distribution systems,” IEEE Transactionson Smart Grid, vol. 7, no. 6, pp. 2837–2848, 2016.

[7] Y. Xu et al., “Microgrids for service restoration to critical load in aresilient distribution system,” IEEE Transactions on Smart Grid, vol. 9,no. 1, pp. 426–437, 2018.

[8] F. Shariatzadeh, N. Kumar, and A. K. Srivastava, “Optimal control al-gorithms for reconfiguration of shipboard microgrid distribution systemusing intelligent techniques,” IEEE Transactions on Industry Applica-tions, vol. 53, no. 1, pp. 474–482, 2017.

[9] F. Wang et al., “A multi-stage restoration method for medium-voltagedistribution system with DGs,” IEEE Transactions on Smart Grid, vol. 8,no. 6, pp. 2627–2636, 2017.

[10] B. Chen, C. Chen, J. Wang, and K. L. Butler-Purry, “Multi-time stepservice restoration for advanced distribution systems and microgrids,”IEEE Transactions on Smart Grid, vol. 9, no. 6, pp. 6793–6805, 2018.

[11] A. Arif, Z. Wang, J. Wang, and C. Chen, “Power distribution systemoutage management with co-optimization of repairs, reconfiguration, andDG dispatch,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4109–4118, 2018.

[12] R. Roofegari Nejad and W. Sun, “Distributed load restoration in unbal-anced active distribution systems,” IEEE Transactions on Smart Grid,vol. 10, no. 5, pp. 5759–5769, sep 2019.

[13] H. Sekhavatmanesh and R. Cherkaoui, “A multi-step reconfigurationmodel for active distribution network restoration integrating the DGstart-up sequences,” IEEE Transactions on Sustainable Energy, vol.3029, no. c, pp. 1–1, 2020.

[14] F. Shen et al., “Distributed self-healing scheme for unbalanced electricaldistribution systems based on alternating direction method of multipli-ers,” IEEE Transactions on Power Systems, vol. 35, no. 3, pp. 2190–2199, 2020.

[15] A. Sharma, D. Srinivasan, and A. Trivedi, “A decentralized multi-agent approach for service restoration in uncertain environment,” IEEETransactions on Smart Grid, vol. 9, no. 4, pp. 3394–3405, 2018.

[16] E. Shirazi and S. Jadid, “Autonomous self-healing in smart distributiongrids using agent systems,” IEEE Transactions on Industrial Informatics,vol. 15, no. 12, pp. 6291–6301, 2019.

[17] H. Sekhavatmanesh and R. Cherkaoui, “Distribution network restorationin a multiagent framework using a convex OPF model,” IEEE Transac-tions on Smart Grid, vol. 10, no. 3, pp. 2618–2628, 2019.

[18] W. Li et al., “A full decentralized multi-agent service restoration fordistribution network with DGs,” IEEE Transactions on Smart Grid,vol. 11, no. 2, pp. 1100–1111, 2020.

[19] C. Chen et al., “Model-free emergency frequency control based onreinforcement learning,” IEEE Transactions on Industrial Informatics,vol. 3203, no. c, pp. 1–1, 2020.

[20] J. Duan et al., “Deep-Reinforcement-Learning-Based Autonomous Volt-age Control for Power Grid Operations,” IEEE Transactions on PowerSystems, vol. 35, no. 1, pp. 814–817, 2020.

[21] Y. Du and F. Li, “Intelligent multi-microgrid energy management basedon deep neural network and model-free reinforcement learning,” IEEETransactions on Smart Grid, vol. 11, no. 2, pp. 1066–1076, 2020.

[22] P. Dai, W. Yu, G. Wen, and S. Baldi, “Distributed reinforcement learningalgorithm for dynamic economic dispatch with unknown generation costfunctions,” IEEE Transactions on Industrial Informatics, vol. 16, no. 4,pp. 2258–2267, 2020.

[23] Y. Gao, W. Wang, J. Shi, and N. Yu, “Batch-constrained reinforcementlearning for dynamic distribution network reconfiguration,” IEEE Trans-actions on Smart Grid, vol. Early Acce, pp. 1–1, 2020.

[24] R. Perez-Guerrero et al., “Optimal restoration of distribution systemsusing dynamic programming,” IEEE Transactions on Power Delivery,vol. 23, no. 3, pp. 1589–1596, 2008.

[25] C. Wang et al., “Markov decision process-based resilience enhance-ment for distribution systems: an approximate dynamic programmingapproach,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2498–2510, 2020.

[26] D. Ye, M. Zhang, and D. Sutanto, “A hybrid multiagent framework withQ-learning for power grid systems restoration,” IEEE Transactions onPower Systems, vol. 26, no. 4, pp. 2434–2441, 2011.

[27] S. Das et al., “Dynamic reconfiguration of shipboard power systemsusing reinforcement learning,” IEEE Transactions on Power Systems,vol. 28, no. 2, pp. 669–676, may 2013.

[28] M. J. Ghorbani, M. A. Choudhry, and A. Feliachi, “A multiagent designfor power distribution systems automation,” IEEE Transactions on SmartGrid, vol. 7, no. 1, pp. 329–339, 2016.

[29] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.Cambridge, MA: MIT press, 2018.

[30] C. A. Cheng, “Efficient and principled robot learning: theory and algo-rithms,” Ph.D. dissertation, School of Interactive Computing, GeorgiaInstitute of Technology, Atlanta, GA, 2020.

[31] S. Ross and D. Bagnell, “Efficient reductions for imitation learning,” inProceedings of International Conference on Artificial Intelligence andStatistics, 2010, pp. 661–668.

[32] W. Sun, “Towards generalization and efficiency in reinforcement learn-ing,” Ph.D. dissertation, The Robotics Institute, Carnegie Mellon Uni-versity, Pittsburgh, PA, 2019.

[33] S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learningand structured prediction to no-regret online learning,” in Proceedings ofInternational Conference on Artificial Intelligence and Statistics, 2011,pp. 627–635.

[34] G. Brockman et al., “Openai gym,” 2016.[35] Z. Wang et al., “Coordinated energy management of networked micro-

grids in distribution systems,” IEEE Transactions on Smart Grid, vol. 6,no. 1, pp. 45–53, 2015.

[36] R. A. Jabr, R. Singh, and B. C. Pal, “Minimum loss network reconfig-uration using mixed-integer convex programming,” IEEE Transactionson Power Systems, vol. 27, no. 2, pp. 1106–1115, 2012.

[37] H. Ahmadi and J. R. Martı, “Mathematical representation of radialityconstraint in distribution system reconfiguration problem,” InternationalJournal of Electrical Power and Energy Systems, vol. 64, pp. 293–299,2015.

[38] W. E. Hart et al., Pyomo–optimization modeling in python, 2nd ed.Springer Science & Business Media, 2017, vol. 67.

[39] M. Baran and F. Wu, “Network reconfiguration in distribution systemsfor loss reduction and load balancing,” IEEE Transactions on PowerDelivery, vol. 4, no. 2, pp. 1401–1407, apr 1989.

[40] A. Hill et al., “Stable baselines,” https://github.com/hill-a/stable-baselines, 2018.

[41] D. Zhang, Z. Fu, and L. Zhang, “An improved TS algorithm for loss-minimum reconfiguration in large-scale distribution systems,” ElectricPower Systems Research, vol. 77, no. 5-6, pp. 685–694, 2007.

Page 11: Hybrid Imitation Learning for Real-Time Service

Yichen Zhang (S’13–M’18) received the B.S.degree in electrical engineering from Northwest-ern Polytechnical University, Xi’an, China, in2010, the M.S. degree in electrical engineeringfrom Xi’an Jiaotong University, Xi’an, China, in2012, and the Ph.D. degree in electrical engi-neering from the Department of Electrical En-gineering and Computer Science, University ofTennessee, Knoxville, TN, USA, in 2018. He wasalso a research assistant with the Oak RidgeNational Laboratory from 2016-2018. He is cur-

rently a post-doctoral appointee with the Energy Systems Division, Ar-gonne National Laboratory. His research interests include power systemdynamics, grid-interactive converters, control and decision-making forcyber-physical power systems.

Feng Qiu (M’14) received his Ph.D. from theSchool of Industrial and Systems Engineering atthe Georgia Institute of Technology in 2013. Heis a principal computational scientist with the En-ergy Systems Division at Argonne National Lab-oratory, Argonne, IL, USA. His current researchinterests include optimization in power systemoperations, electricity markets, and power gridresilience.

Tianqi Hong (S’13–M’16) received the B.Sc.degree in electrical engineering from Hohai Uni-versity, China, in 2011, and the M.Sc. degree inelectrical engineering from the Southeast Uni-versity, China and the Engineering school ofNew York University in 2013. He received thePh.D. degree from New York University in 2016.His main research interests are power systemanalysis, power electronics system, microgrid,and electromagnetic design.

Currently, he is an Energy System Scientistat Argonne National Laboratory, IL. Prior to this, he was a PostdocFellow at the Engineering school of New York University and a SeniorResearch Scientist at Unique Technical Services, LLC responsible forheavy-duty vehicle electrification, battery energy storage integration,and medium capacity microgrid. Dr. Hong is an active reviewer formultiple international journals in the power engineering area and heserves as an Editorial Board Member in International Transactions onElectrical Energy Systems and IEEE Transactions on Power Delivery.

Zhaoyu Wang (S’13–M’15–SM’20) is theHarpole-Pentair Assistant Professor with IowaState University. He received the B.S. and M.S.degrees in electrical engineering from ShanghaiJiaotong University, and the M.S. and Ph.D.degrees in electrical and computer engineeringfrom Georgia Institute of Technology. Hisresearch interests include optimization anddata analytics in power distribution systemsand microgrids. He is the Principal Investigatorfor a multitude of projects focused on these

topics and funded by the National Science Foundation, the Departmentof Energy, National Laboratories, PSERC, and Iowa EconomicDevelopment Authority. Dr. Wang is the Chair of IEEE Power andEnergy Society (PES) PSOPE Award Subcommittee, Co-Vice Chairof PES Distribution System Operation and Planning Subcommittee,and Vice Chair of PES Task Force on Advances in Natural DisasterMitigation Methods. He is an editor of IEEE Transactions on PowerSystems, IEEE Transactions on Smart Grid, IEEE Open Access Journalof Power and Energy, IEEE Power Engineering Letters, and IET SmartGrid. Dr. Wang has been a recipient of the National Science Foundation(NSF) CAREER Award, the IEEE PES Outstanding Young EngineerAward, and the Harpole-Pentair Young Faculty Award Endowment.

Fangxing (Fran) Li (S’98–M’01–SM’05–F’17)received the B.S.E.E. and M.S.E.E. degreesfrom Southeast University, Nanjing, China, in1994 and 1997, respectively, and the Ph.D.degree from Virginia Tech, Blacksburg, VA,USA, in 2001. He is currently the James Mc-Connell Professor with the University of Ten-nessee, Knoxville, TN, USA. His research inter-ests include deep learning in power systems, re-newable energy integration, demand response,power market and power system computing. He

is presently the Chair of IEEE Power System Operation, Planning andEconomics (PSOPE) committee and the Editor-in-Chief of IEEE OpenAccess Journal of Power and Energy. He received the R&D 100 Award(as the project lead) in 2020.