can humans be out of the loop? · can humans be out of the loop? junzhe zhang columbia university...

16
TECHNICAL REPORT R-64 June, 2020 Can Humans Be out of the Loop? Junzhe Zhang Columbia University [email protected] Elias Bareinboim Columbia University [email protected] Abstract Recent advances in Reinforcement Learning have allowed automated agents (for short, agents) to achieve a high level of performance across a wide range of tasks, which when supplemented with human feedback has led to faster and more robust decision-making. The current literature, in large part, focuses on the human’s role during the learning phase: human trainers possess a priori knowledge that could help an agent to accelerate its learning when the environment is not fully known. In this paper, we study an interactive reinforcement learning setting where the agent and the human have different sensory capabilities, disagreeing, therefore, on how they perceive the world (observed states) while sharing the same reward and transition functions. We show that agents are bounded to learn sub-optimal policies if they do not take into account human advice, perhaps surprisingly, even when human’s decisions are less accurate than their own. We propose the counterfactual agent who proactively considers the intended actions of the human operator, and proves that this strategy dominates standard approaches regarding performance. Finally, we formulate a novel reinforcement learning task where the goal is to maximize the performance of an autonomous system subject to a budget constraint over the available amount of human advice. 1 Introduction Sequential decision-making plays a central role in the design of modern intelligent systems. In a prototypical setting, for example, an agent is deployed in an unknown environment and needs to learn how to act such that a set of non-trivial goals are achieved. A solution in these settings usually comes in the form of a policy, namely, a function mapping from the agent’s past observations (e.g., visited states, realized outcomes) to actions it should execute at each instant. Human experts might be able to specify a policy in advance for simple tasks, but the policy usually needs to be learned from experimentation in most complex, real-world scenarios. Reinforcement learning (RL) [35] has emerged as the de facto framework to solve this problem, allowing agents to learn optimal policies by using data collected from interactions with the environment. The RL framework has been used throughout many applications and areas, and it’s currently the source of many of AI success stories [37, 32, 24, 20, 31]. One of the major challenges when applying RL methods in practice is the substantial amount of time and data required to find a reasonable, hopefully optimal, policy. Many efforts have been made to accelerate the learning process by allowing agents to learn by interacting with human trainers. One such approach is to learn from humans past behaviors, which includes learning from demonstration [4, 8, 3, 15, 42] and inverse reinforcement learning [26, 1]. In general, these techniques learn a policy by taking as input observations of a human performing a task. Another approach that has shown promise is known as interactive reinforcement learning, where the agent is provided in real time with a series of critiques from (possibly non-expert) humans [17, 18, 23]. Techniques developed under this rubric include reward shaping [33, 29, 25, 11], policy shaping [13, 22, 21] and student-teacher training [9, 38, 2]. Preprint. Under review.

Upload: others

Post on 13-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

TECHNICAL REPORTR-64

June, 2020

Can Humans Be out of the Loop?

Junzhe ZhangColumbia University

[email protected]

Elias BareinboimColumbia University

[email protected]

Abstract

Recent advances in Reinforcement Learning have allowed automated agents (forshort, agents) to achieve a high level of performance across a wide range of tasks,which when supplemented with human feedback has led to faster and more robustdecision-making. The current literature, in large part, focuses on the human’s roleduring the learning phase: human trainers possess a priori knowledge that couldhelp an agent to accelerate its learning when the environment is not fully known.In this paper, we study an interactive reinforcement learning setting where theagent and the human have different sensory capabilities, disagreeing, therefore, onhow they perceive the world (observed states) while sharing the same reward andtransition functions. We show that agents are bounded to learn sub-optimal policiesif they do not take into account human advice, perhaps surprisingly, even whenhuman’s decisions are less accurate than their own. We propose the counterfactualagent who proactively considers the intended actions of the human operator, andproves that this strategy dominates standard approaches regarding performance.Finally, we formulate a novel reinforcement learning task where the goal is tomaximize the performance of an autonomous system subject to a budget constraintover the available amount of human advice.

1 Introduction

Sequential decision-making plays a central role in the design of modern intelligent systems. In aprototypical setting, for example, an agent is deployed in an unknown environment and needs tolearn how to act such that a set of non-trivial goals are achieved. A solution in these settings usuallycomes in the form of a policy, namely, a function mapping from the agent’s past observations (e.g.,visited states, realized outcomes) to actions it should execute at each instant. Human experts mightbe able to specify a policy in advance for simple tasks, but the policy usually needs to be learnedfrom experimentation in most complex, real-world scenarios. Reinforcement learning (RL) [35] hasemerged as the de facto framework to solve this problem, allowing agents to learn optimal policiesby using data collected from interactions with the environment. The RL framework has been usedthroughout many applications and areas, and it’s currently the source of many of AI success stories[37, 32, 24, 20, 31]. One of the major challenges when applying RL methods in practice is thesubstantial amount of time and data required to find a reasonable, hopefully optimal, policy.

Many efforts have been made to accelerate the learning process by allowing agents to learn byinteracting with human trainers. One such approach is to learn from humans past behaviors, whichincludes learning from demonstration [4, 8, 3, 15, 42] and inverse reinforcement learning [26, 1].In general, these techniques learn a policy by taking as input observations of a human performinga task. Another approach that has shown promise is known as interactive reinforcement learning,where the agent is provided in real time with a series of critiques from (possibly non-expert) humans[17, 18, 23]. Techniques developed under this rubric include reward shaping [33, 29, 25, 11], policy

shaping [13, 22, 21] and student-teacher training [9, 38, 2].

Preprint. Under review.

Page 2: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

This body of the literature concerned with incorporating the “human in the loop” focuses on the roleof human instructors in the learning phase of the system. By and large, (1) the challenge is that anagent has an incomplete model of the environment (e.g., unknown system dynamics), (2) which canbe improved by human trainers who possess a more detailed understanding about the environmentand the task at hand. The translation of the trainer’s knowledge into insights for the agent would thenallow a more aggressive pruning of the search space, naturally, leading to accelerated learning rates.This approach suggests that after serving as a data source for learning the model, human subjectshave no additional value and, therefore, can be disposed of, i.e., moved “out of the loop.”

In this paper, we explore a novel aspect of the role of human instructors in RL, specifically after theautomated agent learns a model of the environment. We show, by a simple example, that when theagent and the human have different perceptual capabilities (observed states), while sharing the samereward and transition functions, the agent could still be constrained to learn sub-optimal policies if itdoes not consider the human’s instructions (regardless of the number of interactions allowed, numberof demonstrations, etc). Most closely related to our work, [41] considered a planning problem wherethe human and agent repeatedly transfer the control of the decisions, thus reaching a performancelevel not achievable by the standard autonomous approaches. Our work, however, explicitly accountsfor the capabilities of the human and agent using the language of causality [27, Ch. 7], which allowsus to delineate a large number of instances where simpler solutions are available, yet overlookedin previous work. We propose a new type of agent who could explore human-AI interactions inmuch finer granularity using counterfactual reasoning. We prove that the counterfactual agent willalways dominate the standard agents (i.e., without human advice) regarding the cumulative rewards.When examined more closely, these results suggest a trade-off between autonomy and optimality– while higher degree of autonomy would be preferred, agents leveraging the human’s capabilitiescould potentially achieve higher performance. We formulate this trade-off as a novel RL problem tomaximize aggregate rewards of an autonomous system, subject to a budget constraint of availablehuman advice. Specifically, our contributions are as follow. (1) We demonstrate the conditions underwhich human input can provide valuable information for agents to achieve optimal performance,even when the model of the environment is fully known. (2) We define the counterfactual agentand identify a class of environments where it can be efficiently optimized using standard MDPalgorithms (albeit with changes). (3) We formulate a novel RL problem subject to a budget constraintof available human advice. We show that in Markovian environments, this problem can be reduced toa constrained MDP. The optimal policy, in this case, is computable using a polynomial program.

2 Preliminaries

We introduce in this section the basic notations and definitions used throughout the paper. Variablesare denoted by capital letters (e.g., X), and their values by lowercase letters (x). Let X denote thedomain of a variable X . Let v([i,j]) stand for a set of variables {v(i), v(i+1), . . . , v(j)} (empty if j < i),and V ([i,j]) = v([i,j]) for a sequence of assignments V (i) = v(i), V (i+1) = v(i+1), . . . , V (j) = v(j).We will consistently use the abbreviation P (x) for the probabilities P (X = x),∀x ∈D(X), so doesP (y�x) = P (Y = y�X = x). This paper deals extensively with finite (PO)MDPs models as the basisfor designing RL algorithms.Definition 1. A finite POMDP [7] is a 6-tuple �S,X ,O, T,R,⌦� in which S is a finite set of states;X a finite set of actions; T is a transition distribution defined as T ∶ S ×X × S → [0,1]; R is areward function defined as R ∶ S ×X → R; O is a finite set of observations; and ⌦ is an observationfunction defined as ⌦ ∶ S ×O → [0,1].

Formally, if states are fully observed (i.e., O = S and ⌦(s, x, s) = 1), the 4-tuple �S,X , T,R� formsan MDP. For simplicity, let T s

x(s′) = T (s, x, s′), Rs

x = R(s, x) and ⌦s(o) = ⌦(s, o). A RL system

can be modeled as an MDP if it is Markovian, i.e., the result of an action does not depend on theprevious actions and states (history), but only depends on the current state. The Markov propertymay be violated if observations do not uniquely identify the underlying state.

2.1 A Motivating Example

We start the discussion with an example involving the sequential decision-making of patients’treatment. We show that an agent with the fully specified reward and transition functions could beconstrained to finding a sub-optimal policy if it ignores the advice of the human instructor.

2

Page 3: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

(a) MDPUC

U (1) U (2)

X(1) Y (1) X(2) Y (2)

S(1) S(2) S(3)

(b) DSCM

U (1) U (2) U (3)

X(1) Y (1) X(2) Y (2)

S(1) S(2) S(3)

(c) MDPUC+

U (11) U (21)U (12) U (22)

X(1) Y (1) X(2) Y (2)

S(1) S(2) S(3)

References1

1

Figure 1: Causal diagrams for different environmental models: (a) MDPUC where U (t) affects only{X(t), Y (t), S(t+1)}; (b) a general DSCM (Def. 2); (c) MDPUC+ where U (t) = {U t1, U t2

} affects{X(t), Y (t), S(t+1)}, and also S(t).

Figure 2: Simulations comparing perfor-mance of experimental (Aexp) and counterfac-tual (Actf ) agents in offline planning. X-axisrepresents the total episodes.

Specifically, a physician treats each patient who vis-its the hospital regularly to maintain her long termhealth condition. This sequential environment canbe represented by an MDP MMDPUC describedin Fig. 1(a). Specifically, the physician measuresthe patient’s corticosteroid level at the t-th visit,S(t) = s(t), s(t) ∈ {0,1}, where 0 stands for a lowand 1 for a high level of corticosteroid. She thendecides a treatment X(t) = x′(t), x′(t) ∈ {1,0} (1 forto give the drug, 0 for not to), and then measures anoverall health score Y (t) = y(t), y(t) ∈ {1,0} (i.e.,“healthy” and “not healthy”). In reality, the patient’shealth score Y (t) is also affected by a pair of confounders U (t) = {M (t),E(t)}, where M (t)

=m(t)stands for patient’s psychological status (0 for a positive mood, 1 for negative) and E(t) = e(t) standsfor her socioeconomic status (0 for wealthy, 1 for poor). The patient’s long-term health condition canbe modeled by the discounted cumulative reward with � = 0.99. The physician decides a treatmentX(t) = x′(t) following a policy function x′(t) = ⇡h(s

(t),m(t), e(t)). The full parametrizations of thetransition function T , the reward function R, and the policy ⇡h are described in the Appendix.

To maximize the patient’s long-term health, the hospital’s administration aims to automate thedecision procedure and deploy an autonomous agent Aexp. The agent learns the fully specified model(functions T and R) from the physician, but it doesn’t have access to the psychological mood M (t)and the socioeconomic status E(t) of the patient due to privacy concerns. We solve for this systemusing both the standard MDP and POMDP planning algorithms and label the resulting policies as exp

and exp2, respectively. The administration also considers a simple interactive agent (Actf ), whichobserves the corticosteroid level S(t) and also take the physician’s decision as advice in real-time(labeled as ctf ). Two baseline policies are included for comparison: (1) the physician’s current policy(called human), and (2) a policy picking the treatment at random (random). 1 The cumulative rewardof all policies are shown in Fig. 4a. Perhaps surprisingly, the performance of exp and exp2 coincideswith random, indicating that Aexp is unable to learn a reasonable policy, which should be better thanchance. Moreover, the interactive approach (ctf ) manages to outperform all the other policies, despitethe fact that the physician (human) performs worse than random guessing.

Puzzling questions arise at this point – first, how could the human advice be useful for the performanceof the agent even when it has full knowledge of the reward function R and the world dynamics T ?Second, how could we formalize the interactive agent in this novel setting and systematically find itsoptimal policy? Our goal for the remainder of the paper is to answer these fundamental questions infull generality (i.e., for any decision setting).

3 Modeling Autonomous Systems

We will use structural causal models (SCMs) [27] as the basic semantical framework of our analysis.SCMs allow us to explicitly model the difference in capabilities of the human operator and theautonomous agent, thus providing a novel perspective on the human’s role in these settings. Werepresent the environment of a sequential decision-making system using dynamic SCMs, whichdescribes causal dynamics of how world’s states evolve over time.

3

Page 4: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

Definition 2 (DSCM). A dynamic structural causal model (DSCM) M is a pair of SCMs �M1,M→�[27, Ch. 7] with domains over actions X , observed states S, unobserved states U , and rewardsY . M1 is a SCM over initial states X(1), U (1), actions X(1) and rewards Y (1). M→ is a SCMdescribing the process: for t ∈ {2,3, . . .}, (1) U (t) is the unobserved state whose values u(t) ∈ U isdecided by the function u(t) = fu(x(t−1), s(t−1), u(t−1), ✏(t)u ), where ✏(t)u is an independent error termdrawn from P (✏u); (2) S(t) is the observed state whose values s(t) ∈ S is decided by the functions(t) = fs(x(t−1), s(t−1), u([t−1,t])); (3) Y (t) is the observed reward whose values y(t) ∈ Y is decidedby the function y(t) = fy(x(t), s(t), u(t)); (4) X(t) is the action of the human instructor where itsvalues x(t) ∈X . x(t) is decided by the function x(t) = ⇡h(s

(t), u(t)).The definition of DSCM is a direct translation of POMDP formalism to causal language: the pair(S(t), U (t)) represents the underlying states, and S(t) for the partial observation (from an agent’sperspective). Each DSCM M is assocaited with a causal diagram G, which is a directed acyclic graph(DAG) where empty nodes correspond to unobserved variables, solid nodes correspond to observedvariables, and edges represent functional (cause-and-effect) relationships (see Fig. 1).

We next model the standard autonomous agents as the experimental agents, which interact with theenvironment through a series of interventions, oblivious to the presence of the human instructor.

Definition 3 (Experimental Agent). An experimental agent Aexp is defined as a policy ⇧exp = ⇡([1,t]).

⇡(t) is a function deciding values of X(t) such that x(t) = ⇡(t)(h(t)exp), where the experimental historyh(t)exp = {s([1,t]), x([1,t−1])} is the observed history of Aexp up to time t.

Using Def. 3, we could model the interaction between an autonomous agent Aexp and the environmentMDSCM as an intervention on actions nodes X([1,t]), denoted by do(⇧exp), which sets values ofX(t) to ⇡(t)(h(t)exp) regardless of how they were ordinarily determined by the human operator ⇡h [27,Ch. 7]. The Markov property delineates a set of sequential decision-making environments that couldbe efficiently solved. We next define the Markov property associated with the experimental agents.Definition 4 (Experimental Markov Property). An environment M is said to be ExperimentallyMarkov (exp-Markov, for short) if and only if for t ≥ 1,

P (s(t+1)x([1,t]) �s

([1,t])x([1,t−1])) = P (s

(t+1)x(t) �s

(t)), E[Y (t)

x([1,t]) �s([1,t])x([1,t−1])] = E[Y

(t)x(t) �s

(t)], (1)

where s([1,t])x([1,t−1]) = {s

(t)x([1,t−1]) , s

(t−1)x([1,t−2]) , . . . , s

(1)}.

In Def. 4, S(t)x([1,t−1]) is the potential response of the state S(t) to intervention do(x([1,t−1])), which

can be read as “The value that the state S(t) would be had X([1,t−1]) been x([1,t−1])” [14, 27].The exp-Markov property says that for Aexp, the information in S([1,t]) and the causal influenceof the intervention do(x([1,t])) can be best summarized in the current state S(t) and interventiondo(x(t)). For example, the MDPUC+ model of Fig. 1(c) is exp-Markov though the unobservedU (t) = {U (t1), U (t2)} could affect S(t−1), S(t). For an exp-Markov environment M , the resultingautonomous system �M,Aexp� forms an MDP. The optimal policy ⇧ for an MDP is stationary where⇡(t) only depends on the state s(t) and is invariant to time t, denoted by ⇡.

3.1 Counterfactual Agents

We will account for the human interaction in the environment using the language of SCMs. Acounterfactual agent takes the human’s advice as its intended action, and adjusts this action usingthe counterfactual reasoning if a better decision is available (the intended action is called intuition in[5, 12]). Given the space constraints, the proofs are included in the Appendix.Definition 5 (Counterfactual Agent). A counterfactual agent Actf is defined as a policy ⇧ctf =

⇡([1,t]). ⇡(t) is a function deciding values of X(t) such that x(t) = ⇡(t)(h(t)ctf), where history

h(t)ctf = {h(t)exp, x′([1,t])}, and x′([1,t]) is a sequence of human’s decisions up to time t.

For an agent Actf , the evaluation of the policy ⇧ctf is indeed a counterfactual reasoning procedure:“Given that I am about to perform x′(t) (intended, factual) and the past history, what the world

4

Page 5: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

state and reward would be had I done the action x(t) (contrary to the fact, counter-factual)”. Forconcreteness, consider a counterfactual agent Actf in the MDPUC model with the goal of maximizingthe reward at t = 1. Given the observed state s(1) and its intended action x′(1), the agent Actf decidesfor a new action x(1) to maximize the expected reward Y (1), i.e., the counterfactual statementE[Y (1)

x(1) �s(1), x′(1)], which can be read as “The value that the reward Y (1) would attain had I done

x(1), given that I am about to do x′(1) and the current state is s(1).” Note that the values of x′(1)(factual) and x(1) (counterfactual) could be different. Comparing the performance of the experimentaland counterfactual agents follows immediately after Def. 5. Next, we establish the superiority of thecounterfactual approach in terms of the discounted expected cumulative rewards.Theorem 1. Let Fexp and Fctf denote, respectively, the policy domain for Aexp and Actf . Let the

optimal policy for Aexp be ⇧∗exp and for Actf be ⇧∗ctf . For any DSCM model, V ⇧∗exp ≤ V ⇧∗ctf .

This result reassures the intuition that, since ⇧∗exp ∈ Fctf , as the counterfactual history alwayscovers its experimental counterpart (i.e., h(t)exp ⊆ h

(t)ctf ), counterfactual agents are never worse than

experimental ones. The following result provides a graphical condition for when the performance ofActf and Aexp coincides, i.e., the human advice can be discarded without loss of information.Theorem 2. For any DSCM model where the human decision is affected only by observable variables

(arrow U (t) →X(t) is not present), denoted by DSCM−

, V ⇧∗exp = V ⇧∗ctf .

Thm. 1 and 2 challenge the somewhat popular belief that the role of human instructors in RL systemsis to serve as data sources for agents with partially specified models, and it is dispensable once thelearning procedure is complete. These results imply that (1) it is still better to utilize an interactiveapproach (Actf ) even if the human decision has an adverse effect on its performance (Thm. 1); (2)the human advice is indispensable if it possesses some critical information about the task (the arrowU (t) →X(t) is present), even when the agent has obtained the fully specified world model.

Solving for Counterfactual Agents We will next discuss methods to systematically compute theoptimal policy of counterfactual agents. We first characterize the Markov property associated withthe counterfactual agent, as it outlines sufficient statistics for Actf .Definition 6 (Counterfactual Markov Property). An environment M is said to be CounterfactuallyMarkov (ctf-Markov, for short) if and only if for t ≥ 1,

P�s(t+1)x([1,t]) , x

′(t+1)x([1,t]) � s

([1,t])x([1,t−1]) , x

′([1,t])x([1,t−1])� = P�s

(t+1)x(t) , x′(t+1)

x(t) � s(t), x′(t)�, (2)

E�Y (t)x([1,t]) � s

([1,t])x([1,t−1]) , x

′([1,t])x([1,t−1])� = E�Y

(t)x(t) � s

(t), x′(t)�, (3)

where x′([1,t])x([1,t−1]) = {x

′(t)x([1,t−1]) , x

′(t−1)x([1,t−2]) , . . . , x

′(1)} and s([1,t])

x([1,t−1]) follows Def. 4.

Compared to the exp-Markov property (Def. 4), Eqs. 2 and 3 condition not only on the observedstates s([1,t])

x([1,t−1]) but all the past human’s decisions x′([1,t]) to interventions do(x([1,t−1])) as well.This additional term makes the independence conditions for ctf-Markov property somewhat moreinvolved: models that are exp-Markov are not necessarily ctf-Markov, as shown in the next example.Proposition 1. MMDPUC+ (Fig. 1(c)) is not ctf-Markov.

Prop. 1 confirms that, in general, ctf-Markov property does not necessarily hold in a exp-Markov

environment, i.e., exp-Markov �⇒ ctf-Markov. The ctf-Markov property is satisfiable in some settings,which is characterized by the general graphical condition given in the next result.Proposition 2. MMDPUC (Fig. 1(b)) is ctf-Markov.

For a ctf-Markov environment, the system �MMDPUC ,Actf � forms an MDP where the humandecision is included as an additional state. Formally, the MDP should be constructed as follows:Proposition 3. �MMDPUC ,Actf � forms an MDP �Sctf ,Xctf , T,R�, where Sctf = S ×X , Xctf =

X , T �s,x′�x (s′, x′′) and R�s,x′�x are equal to, respectively,

P (S(t+1)X(t)=x = s′,X(t+1)X(t)=x = x′′�S(t) = s,X(t) = x′), E[Y (t)

X(t)=x�S(t) = s,X(t) = x′].

5

Page 6: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

Prop. 3 follows immediately after the ctf-Markov property (Thm. 2). The transition T and the rewardfunction R are both counterfactual quantities, which can be computed by the procedure providedin [27, Thm. 7.1.7], whenever the SCM is known. Alternatively, they can be empirically estimatedusing the new counterfactual randomization procedure introduced in [5].

In this paper, we will evaluate the performance of an agent with a policy ⇧ by the discounted expectedcumulative rewards V ⇧

= E[∑∞t=0 �tY (t)] , where � ∈ [0,1). There are several standard ways ofsolving an MDP with the discounted expected cumulative rewards optimization criterion [28]. Someuse dynamic programming (value of policy iteration), others reduce MDPs to linear programs (LPs).A discounted MDP formed by an interactive RL system �MMDPUC ,Actf � can be formulated as thefollowing LP [10, 16] (this maximization LP is the dual to the more commonly used minimizationLP in the value function coordinates):

max �s,x,x′

R�s,x′�x �x(s, x′) s.t. �x(s, x

′) ≥ 0, �

x

�x(s, x′) − ��

i,x

�x(i)Tix(s, x

′) = ↵�s,x′� (4)

where ↵�s,x′� = P (S(1) = s,X(1) = x′) specify the probability distribution over the initial state.Parameters �x(s, x

′) are called the occupation measure of a policy, where �x(s, x

′) is the total

discounted number of times action x is executed in state s, x′. An optimial policy for Actf is stationaryand can be computed from a solution to the above LP as ⇡∗ctf(x�s, x′) = �x(s, x

′)�∑x �x(s, x

′).

There are also new learning from demonstration opportunities for this novel interactive RL system.With a simple application of the counterfactual axiom known as composition [27, pp. 229] to Eqs. 2-3,the following equality constraints between distributions are entailed.Corollary 1. In �MMDPUC ,Actf �, the following two equations hold true for t ≥ 1:

T �s,x�x (s′, x′′) = P (S(t+1) = s′,X(t+1) = x′′�S(t) = s,X(t) = x), R�s,x�x = E[Y (t)�S(t) = s,X(t) = x].The quantities on the r.h.s. of the above equations are observational distributions (do not containsubscripts), which can be obtained by simply observing the human demonstrating the task. Corol. 1implies a method for partially estimating T,R without direct experimentation, where abundant(passive) observations of past human behaviors suffices.

4 Autonomy vs Optimality

Env. Model Optimality AutonomyAexp Actf

MDPUC− 3 3 3MDPUC 7 3 7

MDPUC+ 7 3 7DSCM− 3 3 3DSCM 7 3 7

Figure 3: Canonical models of the environment.Aexp and Actf stand for the experimental andcounterfactual agents. “Optimality” for whetherthe corresponding agent can achieve optimal per-formance (using Actf as the baseline). “Auton-omy” stands for whether the system can be fullyautonomous. We use the superscript − to denotemodels where the human is not affected by anyunobserved variable (i.e., X(t) = ⇡h(s

(t))).

We start this section by summarizing in Table 3the results discussed so far. The “optimality” col-umn indicates whether the corresponding agentcould obtain the optimal policy (using Actf asthe baseline); a check (cross) mark in Aexp un-der “optimality” represents that the experimen-tal agent is (is not) able to achieve the optimalperformance in the environment of the corre-sponding row, compared to its counterfactualcounterpart. In an arbitrary environment, fullautonomy (represented by the “autonomy” col-umn) can only be achieved when it is not at thecost of the optimality, i.e., the performance ofAexp and Actf coincides. These results conveysan inherent trade-off between optimality and au-tonomy when designing RL systems – while fullautonomy is preferable, the agent could poten-tially achieve better performance by leveraginghuman’s capabilities. We model this trade-off as a constrained transfer of control (TOC) problemsuch that a system tries to maximize its cumulative rewards while repeatedly transferring between anexperimental and a counterfactual agent, subject to the total time of the counterfactual agent is nomore than � ratio of the total running time. For instance, a self-driving vehicle could ask for humaninput when necessary, but no more than 10% of the total driving time. To solve this problem, we firstdefine hybrid agents, which combines both Aexp and Actf modes of reasoning.

Definition 7 (Hybrid Agent). A hybrid agent Ahyb follows a policy ⇧hyb = (⇡ ○ fa)([1,t]) and

maintaining an internal state A(t) ∈ {a0, a1} (a0 for Aexp and a1 for Actf ). (⇡ ○ fa)(t) is a

6

Page 7: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

composite function in the form of ⇡(t)(h(t)hyb, x′(t), f (t)a (h

(t)hyb)), where h(t)hyb is the history of the

hybrid agent (includes s([1,t]), x([1,t]), and partially x′([1,t])), the function a(t) = f (t)a (h(t)hyb) decides

a value of A(t). ⇡(t) is defined as: ⇡(t)(h(t)hyb, x′(t), a(t)) = ⇡(t)0 (h

(t)hyb) if a(t) = a0; otherwise,

⇡(t)(h(t)hyb, x′(t), a(t)) = ⇡(t)1 (h

(t)hyb, x

′(t)).

A hybrid RL system is defined as the pair �MDSCM ,Ahyb� proactively deciding whether it considersthe human decision. The internal state A(t) is a switch variable indicating the current running modethe system is in (a0 for experimental or a1 for counterfactual). The human decision x′(t) is includedas an evidence to decide for agent’s decision x(t) if A(t) = a1; otherwise, ignored. We now consideralgorithms to solve for such hybrid systems when the environment is ctf-Markov and full autonomycannot be achieved. The following result establishes the relation between hybrid systems and MDPs.Proposition 4. A hybrid system �MMDPUC ,Ahyb� forms an MDP �Shyb,Xhyb, T,R� where Shyb =S ×X , Xhyb = X ×A, T �s,x′��x,a� (s′, x′) and R�s,x′��x,a� are equal to, respectively,

P (S(t+1)X(t)=x = s′,X(t+1)X(t)=x = x′′�S(t) = s,X(t) = x′), E[Y (t)

X(t)=x�S(t) = s,X(t) = x′].This result follows immediately after Prop. 3, as A(t) is an auxiliary action which does not directlyaffect the environment. Prop. 4 allows us to translate the constrained TOC problem to a constrainedMDP problem. The optimal policy of Ahyb can be computed as a solution to the LP specified in Eq. 4with additional constraints as follows:

��x,a0�(s, x′)��x

��x,a0�(s, x′) =�x′

��x,a0�(s, x′)��x,x′

��x,a0�(s, x′), (5)

�x

��x,a�(s, x′)��x,a

��x,a�(s, x′) = �x,x′

��x,a�(s, x′)� �x,a,x′

��x,a�(s, x′), (6)

and ∑i,x ��x,a1�(i) ≤ �∑i,x,a ��x,a�(i), which ensures that for system �MMDPUC ,Ahyb�, the totaltime running in counterfactual mode (a1) is no more than � ratio of the total running time (discountedso that future visits count less than present ones). Eq. 5 ensures that for A(t) = a0, the policy ⇡does not take the human decision x′(t) as an input. Eq. 6 reflects the functional constraint that fais a Markovian policy only depends on the current state s(t). Eqs. 5-6 combined guarantees thatin the experimental mode (a0), the agent’s decision is independent of the human decision (i.e., theagent is autonomous). This mathematical program forms a polynomial optimization problem [39],which is neither linear nor convex (due to Eqs. 5-6). Despite its difficulty, there are several efficientmethods of polynomial optimization that can be used in this case, for example: the RLT method(Reformulation-Lineralization) [30] and a SDP relaxation method [19].

5 Experiments

In this section, we apply the analysis of the various Markov properties (exp-Markov and ctf-Markov)in the context of the medical treatment problem described in Sec. 2.1, and operationalize the learningalgorithms for experimental (Aexp), counterfactual (Actf ), and hybrid (Ahyb) agents. We use valueiteration for MDP planning and incremental pruning [6] for POMDP planning. For online settings, weuse Mormax [36], and the SDP relaxation method for polynomial optimization. SDPs are constructedusing SparsePOP [40] and solved with SeDuMi [34]. We perform 25000 trials per experiment,partitioned into 250 episodes. We assess each agents’ performance based on their cumulative rewards(CR) and average cumulative rewards (AR).

Offline Planning Since MMDPUC is both exp-Markov and ctf-Markov (Thm. 2), the best possiblepolicies of Aexp and Actf can be computed using standard MDP algorithms following the constructiongiven in Prop. 3, labeled as exp and ctf. The performance of agents Aexp and Actf are compared withthe baseline policies described in Sec. 2.1 (human and random). An experimental policy computedthrough POMDP algorithms (called exp2) is also included. Simulations, shown in Fig. 4a, supportthe efficiency of the counterfactual approach, revealing that (1) both experimental policies obtainedby standard MDP (exp, CR = 1.2162 × 104) and POMDP (exp2, CR = 1.2164 × 104) coincide withrandom policy (CR = 1.2147 × 104); (2) the counterfactual agent Actf (ctf, CR = 1.965 × 104)leveraging the human’s capabilities consistently outperforms the experimental agent Aexp, even whenthe human performs worse than random guessing (human, CR = 4.6492 × 103).

7

Page 8: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

(a) Online Learning (b) Hybrid Planning (c) Hybrid Planning

Figure 4: (a) Simulations comparing performance of experimental (Aexp) and counterfactual (Actf )agents in online learning settings; (b-c) Simulations comparing the performance (b) and composition(c) of exp and ctf modes of hybrid agents with constraint ratio � set to {0.1,0.3,0.5,0.7,0.9} labeledas {hyb1, hyb3, hyb5, hyb7, hyb9}. X-axis in (a-b) represents the total episodes.

Online Learning We next show that the challenge considered here is not only bounded to the post-learning phase, but appears compounded in the online setting when the agent has only an incompletemodel of the environment. We use standard MDP learning algorithms for the experimental andcounterfactual agents, since MMDPUC is both exp-Markov and ctf-Markov. To empirically estimatethe transition T and reward function R of Actf (Prop. 3), we follow the counterfactual randomizationprocedure from [5]. In addition, we collect 5000 samples of distributions P�s(t+1) � s(t), x(t)� andE[Y (t) � s(t), x(t)] through the physician’s treatment histories. We collect trajectories for bothexperimental and counterfactual agents, respectively, exp and ctf. We also deploy a counterfactualagent Actf incorporating observational data using Corol. 1, labeled as ctf

+. The two baselinepolicies previously described, obs and random, are also included. The simulations (Fig. 4(a))support the counterfactual approach. In fact, there is a significant difference in rewards experiencedby counterfactual agents (AR = 79.2438 for ctf

+, AR = 79.2837 for ctf ) when compared to theexperimental agent exp (AR = 50.1900), despite the fact that human (AR = 19.1913) performs poorly,even worse than random guessing (random, AR = 49.2212). In addition, ctf

+ demonstrates betterconvergence rate (after 2 episodes) compared to ctf.

Hybrid Planning We apply the methods discussed in Sec. 4 to obtain the optimal policy for hybridagents Ahyb in the constrained TOC setting. Recall that � is a constraint over the ratio betweenthe total time of running the counterfactual approach (using human input) and the total runningtime of the system. We compute policies for three hybrid agents with the ratio constraint � setto 0.1,0.5,0.9, labeled as hyb1, hyb5 and hyb9, respectively. The two baseline policies exp andctf are also included for comparison. The simulations (Fig. 4(b)) reveal that the performance ofhybrid agents converges to the counterfactual agent as � → 1. In particular, hyb1 (� = 0.1) showslimited performance improvement (CR = 1.3023 × 104) over the autonomous approach (exp, CR= 1.2162 × 104), while hyb9 (� = 0.9) experiences higher rewards (CR = 1.8736 × 104), which iscomparable to the counterfactual agent ctf (CR = 1.965 × 104). Predictably, the performance of hyb5

(� = 0.5, CR = 1.558 × 104) lies in between hyb1 and hyb9. We plot the composition graph of theexperimental and counterfactual modes in Fig. 4(c). We also include two hybrid agents with � set to0.3 and 0.7 for comparison. The simulations support the polynomial optimization reduction, wherehyb1 (P (a0) = P (A(t) = a0) = 0.88) and hyb3 (P (a0) = 0.6629) tend to stay in the autonomousmode, while hyb7 (P (a0) = 0.3366) and hyb9 (P (a0) = 0.1239) ask for human involvement morefrequently; hyb5 kept somewhat neutral.

6 Conclusions

We investigated a novel interactive reinforcement learning setting where the human instructor and theRL agent do not perceive the world in the same fashion, but share the reward and transition functions.We showed that the agent could be constrained to find a sub-optimal policy if it does not take thehuman into the loop. We proposed a new type of interactive agent, which could account for thehuman in real-time using counterfactual reasoning. Our analysis revealed that the counterfactualapproach dominates standard methods even if the human instructor performs poorly, possibly worsethan random guessing. To resolve the tension between the autonomy and optimality of the system,we proposed a novel constrained RL task subject to a budget constraint.

8

Page 9: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

References[1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In

Proceedings of the twenty-first international conference on Machine learning, page 1. ACM,2004.

[2] O. Amir, E. Kamar, A. Kolobov, and B. J. Grosz. Interactive teaching strategies for agenttraining. International Joint Conferences on Artificial Intelligence, 2016.

[3] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning fromdemonstration. Robotics and autonomous systems, 57(5):469–483, 2009.

[4] C. G. Atkeson and S. Schaal. Robot learning from demonstration. In ICML, volume 97, pages12–20, 1997.

[5] E. Bareinboim, A. Forney, and J. Pearl. Bandits with unobserved confounders: A causalapproach. In Advances in Neural Information Processing Systems, pages 1342–1350, 2015.

[6] A. Cassandra, M. L. Littman, and N. L. Zhang. Incremental pruning: A simple, fast, ex-act method for partially observable markov decision processes. In Proceedings of the 13th

Conference on Uncertainty in Artificial Intelligence (UAI), pages 54–61. Morgan Kaufmann,1997.

[7] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially observablestochastic domains. In AAAI, volume 94, pages 1023–1028, 1994.

[8] S. Chernova and M. Veloso. Confidence-based policy learning from demonstration using gaus-sian mixture models. In Proceedings of the 6th international joint conference on Autonomous

agents and multiagent systems, page 233. ACM, 2007.

[9] J. A. Clouse. On integrating apprentice learning and reinforcement learning. University ofMassachusetts Amherst, 1996.

[10] F. d’Epenoux. A probabilistic production and inventory problem. Management Science,10(1):98–108, 1963.

[11] S. Devlin and D. Kudenko. Dynamic potential-based reward shaping. In Proceedings of the

11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages433–440. International Foundation for Autonomous Agents and Multiagent Systems, 2012.

[12] A. Forney, J. Pearl, and E. Bareinboim. Counterfactual data-fusion for online reinforcementlearners. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference

on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1156–1164, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.

[13] S. Griffith, K. Subramanian, J. Scholz, C. L. Isbell, and A. L. Thomaz. Policy shaping:Integrating human feedback with reinforcement learning. In Advances in neural information

processing systems, pages 2625–2633, 2013.

[14] J. Halpern. Axiomatizing causal reasoning. In G. Cooper and S. Moral, editors, Uncertainty

in Artificial Intelligence, pages 202–210. Morgan Kaufmann, San Francisco, CA, 1998. Also,Journal of Artificial Intelligence Research 12:3, 17–37, 2000.

[15] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne. Imitation learning: A survey of learningmethods. ACM Computing Surveys (CSUR), 50(2):21, 2017.

[16] L. C. M. Kallenberg. Linear programming and finite markovian control problems. MC Tracts,1983.

[17] W. B. Knox and P. Stone. Interactively shaping agents via human reinforcement: The tamerframework. In Proceedings of the fifth international conference on Knowledge capture, pages9–16. ACM, 2009.

9

Page 10: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

[18] W. B. Knox and P. Stone. Reinforcement learning from simultaneous human and mdp reward.In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent

Systems-Volume 1, pages 475–482. International Foundation for Autonomous Agents andMultiagent Systems, 2012.

[19] J. B. Lasserre. Global optimization with polynomials and the problem of moments. SIAM

Journal on Optimization, 11(3):796–817, 2001.

[20] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalizednews article recommendation. In Proceedings of the 19th international conference on World

wide web, pages 661–670. ACM, 2010.

[21] R. Loftin, B. Peng, J. MacGlashan, M. L. Littman, M. E. Taylor, J. Huang, and D. L. Roberts.Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strate-gies to speed up learning. Autonomous agents and multi-agent systems, 30(1):30–59, 2016.

[22] R. T. Loftin, J. MacGlashan, B. Peng, M. E. Taylor, M. L. Littman, J. Huang, and D. L. Roberts.A strategy-aware technique for learning behaviors from discrete human feedback. In AAAI,pages 937–943, 2014.

[23] T. Mandel, Y.-E. Liu, E. Brunskill, and Z. Popovic. Where to add actions in human-in-the-loopreinforcement learning. In AAAI, pages 2322–2328, 2017.

[24] A. Y. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang.Autonomous inverted helicopter flight via reinforcement learning. In Experimental Robotics IX,pages 363–372. Springer, 2006.

[25] A. Y. Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theoryand application to reward shaping. In ICML, volume 99, pages 278–287, 1999.

[26] A. Y. Ng and S. Russell. Algorithms for inverse reinforcement learning. In in Proc. 17th

International Conf. on Machine Learning. Citeseer, 2000.

[27] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York,2000.

[28] M. L. Puterman. Markov decision processes: discrete stochastic dynamic programming. Wiley& Sons, 2014.

[29] J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping.In ICML, volume 98, pages 463–471, 1998.

[30] H. D. Sherali and W. P. Adams. A reformulation-linearization technique for solving discrete

and continuous nonconvex problems, volume 31. Springer Science & Business Media, 2013.

[31] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser,I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neuralnetworks and tree search. Nature, 529(7587):484–489, 2016.

[32] S. Singh, D. Litman, M. Kearns, and M. Walker. Optimizing dialogue management withreinforcement learning: Experiments with the njfun system. Journal of Artificial Intelligence

Research, 16:105–133, 2002.

[33] B. F. Skinner. The behavior of organisms: An experimental analysis. BF Skinner Foundation,1990.

[34] J. F. Sturm. Using sedumi 1.02, a matlab toolbox for optimization over symmetric cones.Optimization methods and software, 11(1-4):625–653, 1999.

[35] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 1998.

[36] I. Szita and C. Szepesvári. Model-based reinforcement learning with nearly tight explorationcomplexity bounds. In Proceedings of the 27th International Conference on Machine Learning

(ICML-10), pages 1031–1038, 2010.

10

Page 11: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

[37] G. Tesauro. Temporal difference learning and td-gammon. Communications of the ACM,38(3):58–68, 1995.

[38] L. Torrey and M. Taylor. Teaching on a budget: Agents advising agents in reinforcementlearning. In Proceedings of the 2013 international conference on Autonomous agents and

multi-agent systems, pages 1053–1060. International Foundation for Autonomous Agents andMultiagent Systems, 2013.

[39] H. Tuy, T. Hoang, T. Hoang, V.-n. Mathématicien, T. Hoang, and V. Mathematician. Convex

analysis and global optimization. Springer, 1998.

[40] H. Waki, S. Kim, M. Kojima, M. Muramatsu, and H. Sugimoto. Algorithm 883: Sparsepop—a sparse semidefinite programming relaxation of polynomial optimization problems. ACM

Transactions on Mathematical Software (TOMS), 35(2):15, 2008.

[41] K. H. Wray, L. Pineda, and S. Zilberstein. Hierarchical approach to transfer of control insemi-autonomous systems. In Proceedings of the Twenty-Fifth International Joint Conference

on Artificial Intelligence, pages 517–523. AAAI Press, 2016.

[42] J. Zhang and E. Bareinboim. Transfer learning in multi-armed bandits: a causal approach.In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages1340–1346. AAAI Press, 2017.

11

Page 12: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

“Can Humans Be out of the Loop?” Supplemental

Materials

1 Parametrizations

In this section, we provide the full parametrizations of the medical treatment example used in the

paper.

Let U (t) = {M (t),E(t)} where M (t),E(t) are binary variables drawn from distributions P (m(t) =0) = 1

2 , P (e(t) = 0) = 12 . The values of X(t) are decided by X(t) = fx(s(t),m(t), e(t)) = s(t) ⊕

m(t) ⊕ e(t), where ⊕ represents the exclusive OR operator. The reward probability function P (y(t) �s(t),m(t), e(t), x(t)) and the transition probability function P (s(t+1) � s(t), s(t)) are provided in

Tables 1 and 2. The entries encode the probabilities for Y (t) = 1. The the human’s decision (i.e.,

following fx) are indicated by asterisks. The optimal policy for the counterfactual agent Actf is to

always take the oppsite of the physician’s decision, i.e., X(t) = ⇡ctf(s(t), x′(t)) = ¬x′(t).S(t) = 0

M (t) = 0 M (t) = 1E(t) = 0 E(t) = 1 E(t) = 0 E(t) = 1

X(t) = 0 ∗0.2 0.9 0.8 ∗0.3X(t) = 1 0.9 ∗0.2 ∗0.3 0.8

S(t) = 1M (t) = 0 M (t) = 1

E(t) = 0 E(t) = 1 E(t) = 0 E(t) = 1X(t) = 0 0.7 ∗0.2 ∗0.1 0.8X(t) = 1 ∗0.2 0.7 0.8 ∗0.1

Table 1: Reward probability table for Y (t) = 1, which is P (Y (t) = 1 � s(t),m(t), e(t), x(t)). The

human’s decision under S(t),M (t),E(t) are indicated by asterisks.

S(t) = 0 S(t) = 1X(t) = 0 0.9 0.3X(t) = 1 0.7 0.8

Table 2: The transition probability table P (S(t+1) = 0 � s(t), x(t)).

2 Proofs

We start by introducing the notations and theorems used throughout the proofs. Recall that we use

X([i,j]) to represent the sequence (X(i),X(i+1), . . . ,X(j)). X([i,j]) is an empty sequence if i > j.

Axiom 1. (The Axioms of Counterfactuals, [1]) In all causal models, composition, effectiveness andreversibility properties hold.

Preprint. Under review.

Page 13: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

Composition: For any three sets of endogenous variables X,Y, and W in a causal model, we have:Wx = w⇒ Yxw = Yx

Effectiveness: For any all sets of variables X and W ,Xxw = x

Reversibility: For any two variables Y and W and any set of variables X ,(Yxw = y)&(Wxy = w)⇒ Yx = y

Composition, effectiveness and reversibility are proved to be sound and complete in all causal models.

To translated the assumptions embodied in the graphical model into the language of counterfactuals,

we also introduces two rules: exclusion restrictions and independence restrictions.

Lemma 1. (Exclusion restrictions, [1]) For every variables Y having parents PAY for every set ofvariables Z ⊂ V , V disjoint of PAY , we have:

YpaY = YpaY ,Z

Lemma 2. (Independence restrictions, [1]) If Z1, . . . , Zk is any set of nodes in V not connected toY via paths containing only U variables, we have:

�YpaY ⊥⊥ Z1paZ1, . . . , ZkpaZk

�We also introduce the concept of C-component ([2]), which will be useful in our later proofs.

Definition 1. (C-component, [2]) Let G be a causal diagram such that a subset of its bidirectional

arcs forms a spanning tree over all vertices in G. Then G is a C-component.

Lemma 3. Given a SCM M�U,V,F,P (u)� with N endogenous variables and K C-components.Let PAX be parents of a variable X , and let Vpa = {V (1)pa

V (1) , . . . , V(N)pa

V (N)}. The joint distributionP (vpa) factorizes according to the product:

P (vpa) = K�k=1

P (c(k)pa ), (1)

where C(k)pa denote endogenous variables with parents’ value fixed in C-component k by C(k)pa ={V (1)pa

V (1) , . . . , V(N(k))pa

V (N(k))}.Proof. Let U (k), C(k) be, respectively, exogenous and endogenous variables of C-component k. A

V (i)paV (i) in C(k)pa is decided by the function V (i) = f(paV (i) , u(k)) with its parents’ value fixed at

PAV (i) = paV (i) . We can thus write P (c(k)pa ) as:

P (c(k)pa ) = �u(k)

N(k)�i=1

I{v(i) = f(paV (i) , u(k))}P (u(k)) (2)

Since U is partitioned into mutually independent subsets

{U (1), U (2), . . . , U (N)},P (vpa) can be written as

P (vpa) =�u

K�j=1

N(j)�n=1 I{v

(n) = f(paV (n) , u)}P (u(j)).By moving out U (k) and C(k), this becomes:

P (vpa) = �u(k)

N(k)�i=1

I{v(i) = f(paV (i) , u(k))}P (u(k))������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Part 1

⋅ �u�u(i)

K�j=1,j≠k

N(j)�n=1 I{v

(n) = f(pa(n), u(n))}P (u(j))�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Part 2

2

Page 14: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

Part 1 is exactly P (s(j)pa ) defined in Eqn. 2. By applying the same procedure for remaining K − 1C-components, we obtain Eqn. 1.

Proof of Theorem 1. Since

(s([1,t]), x([1,t−1])) ⊆ (s([1,t]), x′([1,t]), x([1,t−1])),we have ⇧∗exp ∈ Fctf . This implies

V ⇧∗exp ≤ V ⇧∗ctf

Proof of Theorem 2. Since ⇧∗ctf always dominates ⇧∗exp, it suffices to show the other direction

V ⇧∗ctf (s(1)) ≤ V ⇧∗exp . Suppose ⇧∗ctf = ⇡∗([1,t])ctf ,⇧∗exp = ⇡∗([1,t])exp , we write V ⇧∗ctf as:

�s([1,t])

�u([1,t])

�x([1,t])

�y(t)

y(t)P�y(t)s(t),u(t),x(t)�

P�s(1), u(1)� t�i=2

P�s(i)x(i−1),u(i),u(i−1),s(i−1)�P�u(i)u(i−1),s(i−1),x(i−1)�

t�j=1

⇡∗(j)exp �x(j) � h(j)exp�Similarly, we can write V ⇧∗exp(s(1)) as:

�x′([1,t])

�s([1,t])

�u([1,t])∈Ut

�x([1,t])

�y(t)

y(t)P�y(t)s(t),u(t),x(t)�

P�s(1), u(1)� t�i=2

P�s(i)x(i−1),u(i),u(i−1),s(i−1)�P�u(i)u(i−1),s(i−1),x(i−1)�

t�j=1

P (x′(j)s(j))⇡∗(j)ctf �x(j) � h(j)ctf�

Since h(t)ctf = h(t)exp ∪ {x′(1), . . . , x′(t)x([1,t−1])}, this turns to

�s([1,t])

�u([1,t])

�x([1,t])

�y(t)

y(t)P�y(t)s(t),u(t),x(t)�

P�s(1), u(1)� t�i=2

P�s(i)x(i−1),u(i),u(i−1),s(i−1)�P�u(i)u(i−1),s(i−1),x(i−1)�

�x′([1,t])

t�j=1

P (x′(j)s(j))⇡∗(j)ctf �x(j) � h(j)ctf�

Term 1 defines a policy of the experimental agent after summing out x′([1,t]), which we denote by

⇧exp. Since ⇧∗exp is the optimal policy for the experimental agent, it follows that

V ⇧∗ctf (s(1)) = V ⇧exp(s(1)) ≤ V ⇧∗exp(s(1)).Proof of Proposition 1. We prove this statement by constructing a model MMDPUC+ where ctf-Markov does not hold. Suppose X ,Y ,S,U are all binary domains {0,1}, U (t1), U (t2) follow

distributions P (U (t1) = 0) = P (U (t2) = 0) = 0.6. In this model, values of X(t) are decided by

the function x(t) = u(t1) ⊕ u(t2) ⊕ s(t). Values of S(t+1) are decided by the function s(t+1) =x(t) ⊕ u(t1) ⊕ u(t+1,2) ⊕ s(t). The initial state S(1) follows the function s(t) = u(12). Without any

intervention, values of S(1),X(1), S(2),X(2) thus are

s(1) = u(12), x(1) = u(11) ⊕ u(12) ⊕ s(1) = u(11)s(2) = x(1) ⊕ u(11) ⊕ u(22) ⊕ s(1) = u(12) ⊕ u(22)x(2) = u(21) ⊕ u(22) ⊕ s(2) = u(12) ⊕ u(21)s(3) = x(2) ⊕ u(21) ⊕ u(31) = u(12) ⊕ u(31)

3

Page 15: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

After intervesion do(X(1) = x1), S(1)x1 , S(2)x1 ,X(2)x1 are

s(1) = u(12)s(2) = x1⊕ u(11) ⊕ u(22) ⊕ s(1)= x1⊕ u(11) ⊕ u(12) ⊕ u(22)x(2) = u(21) ⊕ u(22) ⊕ s(2)= x1⊕ u(11) ⊕ u(12) ⊕ u(21)

After intervesion do(X(1) = x1), do(X(2) = x2), S(3)x1,x2 are

s(3) = x2⊕ u(21) ⊕ u(32) ⊕ s(2)= x1⊕ x2⊕ u(11) ⊕ u(12) ⊕ u(21) ⊕ u(22) ⊕ u(32)

Note that S(3)x1,x2 contains values u(11), u(12), about which the observation x(1) = u(11), s(1) = u(12)provide perferct information. Moreover, this information is not included in S(2)x1 and X(2)x2 , since

values of u(11), u(12) are completely masked by latent variales U (22), U (21). To see this, one can

compute their distribution, and find P (S(2)x1 = 0) = P (X(2)x1 = 0) = 0.5. Therefore, we have

S(3)x([1,2]) �⊥⊥ S(1),X(1)�S(2)x(1) ,X

(2)x(1) ,

which implies that MMDPUC+ is not ctf-Markov.

Proof of Proposition 2. We will focus on the time t = 3. The proof for the general case follows

naturally. Consider Fig. 1(b) which describes MDPUC at time t = 3, we want to show

P�s(3)x([1,2]) , x

′(3)x([1,2]) � s(2)x(1) , x

′(2)x(1) , s

(1), x′(1)�= P�s(3)

x(2) , x′(3)x(2) � s(2), x′(2)�, (3)

P�y(2)x([1,2]) � s(2)x(1) , x

′(2)x(1) , s

(1), x′(1)�= P�y(2)

x(2) � s(2), x′(2)�. (4)

Since Fig. 1(b) consists of 3 C-components: S(1),{X(1), Y (1), S(2)}, {X(2), Y (2), S(3)} and{X(3), Y (3), S(4)}, Lem. 3 implies

�s(3)x(2),s(2) , x

′(3)s(3) , x

′(2)s(2) ⊥⊥ s(1), x′(1)s(1) , s

(2)x(1),s(1)�, (5)

where s([1,3]), y([1,2]), x([1,2]), x′([1,2]) are arbitrary values. By composition and weak union axioms,

we have

�s(3)x(2),s(2) , x

′(3)s(3) ⊥⊥ s(1), x′(1) � s(2)x(1) , x

′(2)x(1)�. (6)

Let x′(1) = x(1) and apply composition and weak union axioms again.

�s(3)x(2),s(2) , x

′(3)s(3) ⊥⊥ s(1), x(1) � s(2), x′(2)�. (7)

Eqn. 7 implies

P�s(3)x(2),s(2) , x

′(3)s(3) � s(2), x′(2)�

= P�s(3)x(2),s(2) , x

′(3)s(3) � s(2), x′(2), s(1), x(1)�.

By Composition axiom, we move x(1) to subscripts.

P�s(3)x(2),s(2) , x

′(3)s(3) � s(2), x′(2), s(1), x(1)�

= P�s(3)x(2),s(2) , x

′(3)s(3) � s(2)x(1) , x

′(2)x(1) , s

(1), x(1)�= P�s(3)

x(2),s(2) , x′(3)s(3) � s(2)x(1) , x

′(2)x(1)�.

4

Page 16: Can Humans Be out of the Loop? · Can Humans Be out of the Loop? Junzhe Zhang Columbia University junzhez@cs.columbia.edu Elias Bareinboim Columbia University eb@cs.columbia.edu Abstract

The last step holds by the independence implied by Eqn. 6. Since x(1) in Eqn. 6 can be any value, let

x(1) = x′(1) and apply Eqn. 6 again.

P�s(3)x(2),s(2) , x

′(3)s(3) � s(2)x(1) , x

′(2)x(1)�

= P�s(3)x(2),s(2) , x

′(3)s(3) � s(2)x(1) , x

′(2)x(1) , s

(1), x′(1)�= P�s(3)

x([1,2]),s(2) , x′(3)x([1,2]),s(3) , � s(2)x(1) , x

′(2)x(1) , s

(1), x′(1)�.The last step holds, since s(3)

x(2),s(2) = s(3)x([1,2]),s(2) and x′(3)

s(3) = x′(3)x([1,2]),s(3) (Exclusion Restrictions

rule). Together, we have

P�s(3)x(2),s(2) , x

′(3)s(3) � s(2), x′(2)�

= P�s(3)x([1,2]),s(2) , x

′(3)x([1,2]),s(3) � s(2)x(1) , x

′(2)x(1) , s

(1), x′(1)�. (8)

We can rewrite P�s(3)x(2),s(2) , x

′(3)s(3) � s(2), x′(2)� as

�x′(1)∈X,s′(1)∈S

P�s(3)x(2),s(2) , x

′(3)s(3) � s(2), x′(2), s′(1), x′(1)�������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

Term1

⋅ P�s′(1), x′(1) � s(2), x′(2)�. (9)

Some algebra through Composition axiom and Exclusion Restrictions rule turns Term 1 into:

P�s(3)x(2),s(2) , x

′(3)s(3) � s(2), x′(2), s′(1), x′(1)�

= P�s(3)x(2) , x

′(3)x(2) � s(2), x′(2), s′(1), x′(1)�. (10)

Replace Term 1 with Eqn. 10, Eqn. 9 equals to:

P�s(3)x(2),s(2) , x

′(3)s(3) � s(2), x′(2)�

= �x′(1)∈X,s′(1)∈S

P�s(3)x(2) , x

′(3)x(2) , s

′(1), x′(1) � s(2), x′(2)�= P�s(3)

x(2) , x′(3)x(2) � s(2), x′(2)� (11)

Some algebra through Composition axiom and Exclusion Restrictions rule turns

P�s(3)x([1,2]),s(2) , x

′(3)x([1,2]),s(3) � s(2)x(1) , x

′(2)x(1) , s

(1), x′(1)� into:

P�s(3)x([1,2]),s(2) , x

′(3)x([1,2]),s(3) � s(2)x(1) , x

′(2)x(1) , s

(1), x′(1)�= P�s(3)

x([1,2]) , x′(3)x([1,2]) � s(2)x(1) , x

′(2)x(1) , s

(1), x′(1)� (12)

Together with Eqn. 8, 11 and 12, we prove Eqn. 3. Eqn. 4 can be proved with the same steps but

replacing s(3) with y(2).

References

[1] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, New York,

2000.

[2] J. Tian and J. Pearl. On the identification of causal effects. Technical Report R-290-

L, Department of Computer Science, University of California, Los Angeles, CA, 2003.

http://www.cs.iastate.edu/∼jtian/r290-L.pdf.

5