hierarchical planning under uncertainty - ulm · existing approaches combining hierarchical...

Fakultät für Ingenieurwissenschaften,Informatik und PsychologieInstitut für Künstliche Intelligenz

Hierarchical Planning Under Uncertainty

Dissertation zur Erlangung des Doktorgrades Dr. rer. nat.der Fakultät für Ingenieurwissenschaften, Informatik und Psychologie

der Universität Ulm

vorgelegt von Felix Milo Richter, geb. Mülleraus Balingen

2017

Fassung December 17, 2017

Amtierender Dekan: Prof. Dr. Frank KarglGutachter: Prof. Dr. Susanne Biundo-StephanGutachter: PD Dr. Friedhelm SchwenkerTag der Promotion: 13. Juli 2017

c© 2017 Felix Richter

This work is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).To view a copy of this license, visit https://creativecommons.org/licenses/by/4.0/legalcode.Satz: PDF-LATEX 2ε

AbstractThe recent years have seen significant progress in the fields of computer science and the en-gineering sciences, leading to a plethora of systems and services aimed at simplifying the or-ganization of everyday life. The great potential of these utilities is however hindered by theircomplexity as well as the complexity of their interplay. Automated assistance systems can helpusers overcome this challenge.

At the core of these systems lies planning functionality, needed for automatically generatingcourses of action, or policies, that represent, e.g., step-by-step instructions. Often, planning re-quires accounting for uncertainty inherent in a given application domain, making the processof generating such instructions computationally difficult. Fortunately, many assistance tasks ex-hibit hierarchical structure that allows understanding planning tasks as a hierarchy of subtasks,each of which can be solved using a limited number of solution recipes. The Hierarchical TaskNetwork planning approach can already exploit such structures in deterministic planning do-mains by representing subtasks and recipes using abstract actions and methods, respectively,and generating plans by iteratively refining an initial abstract plan.

The main goal of this thesis is to create a similar planning approach suited for planning do-mains that exhibit uncertainty, modeled as Partially Observable Markov Decision Processes.Based on a newly introduced suitable policy representation formalism called logical finite statecontrollers, the concepts of abstract actions and methods are reintroduced to create the PartiallyObservable Hierarchical Task Network planning approach. Next, Monte-Carlo tree search inthe space of partially abstract controllers is identified as a suitable means for planning. The ap-proach is then empirically evaluated on four domains by comparing it to search in the space ofhistories, a state-of-the-art non-hierarchical planning approach also based on Monte-Carlo treesearch. This reveals that, with comparable computational effort, the proposed approach leads topolicies of superior quality, and that it scales well with problem size.

Two further techniques are then proposed enhance the presented approach: one that furtherreduces the required controller construction effort during search by refining controllers onlyselectively where required, and one that combines hierarchical and non-hierarchical search inorder to combine the advantages of both.

iii

Contents1. Introduction 1

1.1. Planning under Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2. Hierarchical Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1. Approaches Based on HTN Planning . . . . . . . . . . . . . . . . . . 51.3.2. Approaches Based on Decision-Theoretic Planning . . . . . . . . . . . 61.3.3. Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4. HTN Planning in POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2. Preliminaries 112.1. Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2. First Order Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1. Syntax and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.2. Updating Interpretations . . . . . . . . . . . . . . . . . . . . . . . . . 142.2.3. Logical Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2.4. Many-Sorted Logics . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3. Partially Observable Markov Decision Processes . . . . . . . . . . . . . . . . 152.3.1. Partially Observable Environments . . . . . . . . . . . . . . . . . . . . 162.3.2. Histories and Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3.3. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.4. Finite State Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3.5. Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.6. The Home Theater Setup POMDP . . . . . . . . . . . . . . . . . . . . 23

2.4. Monte-Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.4.1. Generative Indefinite Horizon Markov Decision Processes . . . . . . . 272.4.2. The MCTS Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4.3. Concrete MCTS Algorithms . . . . . . . . . . . . . . . . . . . . . . . 302.4.4. MCTS for POMDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.5. Hierarchical Task Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3. Finite State Controllers for Partially Observable HTN Planning 37

4. Partially Observable Hierarchical Task Networks 414.1. Transferring HTN Concepts to POMDP Planning . . . . . . . . . . . . . . . . 41

4.1.1. Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.1.2. Methods and Abstract Observations . . . . . . . . . . . . . . . . . . . 424.1.3. Method Application . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

v

Contents

4.1.4. Problems and Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 484.2. POHTN Plan Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1. Binding Variables and Choosing Nodes to Decompose . . . . . . . . . 524.2.2. Decomposability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.2.3. Pseudo-Primitive Controllers . . . . . . . . . . . . . . . . . . . . . . . 544.2.4. The Decomposition MDP . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3. Reducing Deterministic HTN Planning to POHTN Planning . . . . . . . . . . 604.3.1. Representing HTN Domain Dynamics . . . . . . . . . . . . . . . . . . 604.3.2. Representing HTN Methods . . . . . . . . . . . . . . . . . . . . . . . 614.3.3. Matching Solution Criteria . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4. Comparison to Other Non-Deterministic Hierarchical Approaches . . . . . . . 63

5. Evaluation 655.1. Experimental Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655.2. Home Theater Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.2.1. Problem Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2.2. POHTN Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675.2.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.3. Skill Teaching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.3.1. Problem Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.3.2. POHTN Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.3.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.4. Navigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.4.1. Problem Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 775.4.2. POHTN Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.4.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5. Elevators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.5.1. Problem Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.5.2. POHTN Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815.5.3. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.6. Summary and Further Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6. Interleaving Decomposition and Policy Simulation 916.1. The Interleaved Decomposition MDP . . . . . . . . . . . . . . . . . . . . . . 916.2. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.3. Discussion and Further Improvements . . . . . . . . . . . . . . . . . . . . . . 98

6.3.1. More Elaborate Interleaved Decomposition MDP . . . . . . . . . . . . 986.3.2. Online Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

7. Combining Hierarchical and History-based Planning 1057.1. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

vi

Contents

8. Conclusion and Future Work 1198.1. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1198.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

8.2.1. Further Improvements to the LFSC Policy Representation . . . . . . . 1208.2.2. Policy Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1218.2.3. Monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

A. CD Contents 123

B. RDDL Code for the Home Theater Setup Domain 125B.1. Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125B.2. Example Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128B.3. Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129B.4. Example Initial Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

vii

Chapter 1.

IntroductionThe progress in the fields of computer science and the engineering sciences in recent years hasprovided, and continues to provide, society with a multitude of electronic services and systemsthat equip humans with increasingly powerful functionality. These systems and services, andin particular their interplay, have the potential to grant an enormous benefit to humans for theirorganizing everyday life. Fully benefiting from these capabilities, however, burdens and possiblyoverburdens users with mastering this complex functionality.

Automated assistance systems can provide a solution to this issue and enable humans to makeuse of the full potential of this development. For this, such assistance systems require bothcognitive abilities for reasoning about recommendations, instructions, and explanations as wellas sophisticated capabilities for user interaction and recognition of their user and environment.Also, they should tailor their functionality individually to a user’s capabilities, preferences, andemotional state [Biundo and Wendemuth, 2015]. The recent years have seen increased interestin this topic, leading to a growing number of related research projects in academia, such as theTransregional Collaborative Research Centre “Companion-Technology for Cognitive TechnicalSystems” [Biundo and Wendemuth, 2015], CoTeSys [Buss and Beetz, 2010], CITEC [Ritter,2010], and many others [Biundo et al., 2016], as well as industrial activities by informationtechnology corporations such as, for example, Google, Apple, Microsoft, Amazon, or Bosch.

An important part of such assistance systems required in particular for providing instructionsis a means for automatically generating courses of action for tasks whose completion requireseveral steps, i.e., a planning component [Biundo et al., 2011]. Many such tasks inherentlyinvolve uncertainty, either because the assistance system receives its inputs through noisy sensorssuch as speech recognition or cameras, or because it is hard to predict the effects of its actions,e.g., whether to expect a traffic jam when the system considers proposing to take the car togo to work. In other words, they exhibit partial observability as well as uncertainty in actionexecution.

This thesis is concerned with the realization of planning functionality that respects these cir-cumstances specifically in the context of assistance systems, as will be elaborated upon in thefollowing: first, the mathematical framework of Partially Observable Markov Decision Pro-cesses (POMDPs) is introduced, which is suitable for representing the aforementioned planningtasks that involve uncertainty, together with an example task that will serve as a running examplethroughout this thesis. Motivated by a major challenge in applying the POMDP framework, itscomputational difficulty, and a common property of assistance tasks, their hierarchical structure,the approach of Hierarchical Task Network (HTN) planning is then presented, which exhibitsattractive properties for solving assistance problems in a deterministic setting. Since HTN plan-

1

Chapter 1. Introduction

ning aims at deterministic planning problems, this then gives rise to the question whether theadvantages of HTN planning can be transferred into a probabilistic setting. After reviewingexisting approaches combining hierarchical planning and uncertainty, the chapter concludes byproposing a new planning approach that directly combines HTN planning and POMDPs.

1.1. Planning under Uncertainty

Partially Observable Markov Decision Processes offer a widely used mathematical model forformalizing the aforementioned partially observable sequential decision problems. POMDPscan model problems where an agent has to deal with uncertainty in both action execution and insensing its surroundings. Both types of uncertainty occur in the context of assistance systems:a system giving recommendations as to what step the user should take next needs to deal withthe former case, i.e., the possibility that the user might not follow this recommendation andchoose to do something else instead. The latter occurs whenever information about users needsto be gathered through sensors prone to uncertainty, i.e., through observations, such as speechrecognition or localization [Pineau et al., 2003c]. POMDPs have been used for modeling manyinteresting user assistance tasks such as managing human-computer trust in dialogs [Nothdurftet al., 2014a; Nothdurft et al., 2014b], dialog management in general [Williams et al., 2005], orhome theater setup assistance [Richter and Biundo, 2017]. There is also a large body of work thatemploys POMDPs in assisting elderly persons with cognitive impairments [Hoey et al., 2010;Pineau et al., 2003c; Pollack, 2002; Kautz et al., 2002].

In the interest of having a common illustrative assistance task example throughout this thesis,said Home Theater Setup assistance task will be introduced below. In this scenario, a human userneeds to be assisted in setting up their home theater system. It models the task of, given an as-sortment of home theater devices and cables (Figure 1.1a), connecting the devices in such a waythat they represent a working home theater system (Figure 1.1b). A multimodally interacting as-sistance system supports the user in this task, as shown in Figure 1.2 [Honold et al., 2014]. In itsmost common incarnation, the scenario is represented as a deterministic planning task that essen-tially encodes the problem of finding a satisfactory configuration of cables [Bercher et al., 2016;Bercher et al., 2015; Bercher et al., 2014]. The version used in this thesis additionally exhibitsuncertainty in the execution of actions and partial observability, as described by Richter andBiundo [2017]: it assumes that the user will now and then plug in cables sloppily, such thatthe cable is plugged in, but not tightly enough to actually transport signals as intended. Theeffect of this is that, in contrast to the original version of the scenario, it is in general no longersufficient to find a satisfactory configuration of cables: it also needs to be ensured that cablesare correctly plugged in. Moreover, the model reflects reality in that one does not simply knowwhether a given signal has reached a device: one needs to actively check for signals, and this isonly possible for output devices, e.g., the TV can be checked for video signals while the A/Vreceiver cannot.

The flip side of the expressive power offered by the POMDP framework is the incurred com-putational complexity, which is PSPACE-complete even when limiting the number of time stepsconsidered a finite constant and without considering factored, i.e., compact logic-based, repre-sentations for world states, actions, or observations [Papadimitriou and Tsitsiklis, 1987]. Two

2

1.1. Planning under Uncertainty

Sat Blu-ray

A/V TV

(a) The task: an assortment of uncon-nected devices and cables.

Sat Blu-ray

A/V TV

(b) The goal: the devices are properlyconnected.

Figure 1.1.: Schematic representation of the Home Theater Setup domain. The A/V receiver,TV, satellite receiver, and Blu-ray player each have various female ports, where ,

, , and denote HDMI, SCART, cinch video and cinch audio ports, respec-tively. Black ports on cables denote male ports. There are two HDMI cablesand one SCART-to-cinch-AV cable. The devices should be connected in such a waythat the video signals of the Blu-ray player and the satellite receiver reach the TV.The respective audio signals should be transported to the A/V receiver, which isconnected to speakers. (Source: [Richter and Biundo, 2017, Figure 6.1], reprint andmodification by courtesy of Springer International Publishing AG)

Figure 1.2.: A user interacting with an assistance system helping him to set up a hometheater system. (Source: SFB/Transregio 62, Video “The Assembly Assistant”,minute 18:52 / 20:27, https://companion.informatik.uni-ulm.de/DS-Filme/SFB-TRR-62_Demonstrationsszenario_1_1080.mp4,courtesy of Prof. Dr. Susanne Biundo-Stephan)

3

https://companion.informatik.uni-ulm.de/DS-Filme/SFB-TRR-62_Demonstrationsszenario_1_1080.mp4

https://companion.informatik.uni-ulm.de/DS-Filme/SFB-TRR-62_Demonstrationsszenario_1_1080.mp4


aspects make planning in POMDPs difficult: the “curse of dimensionality” [Kaelbling et al.,1998], which refers to the difficulty of reasoning over the (n− 1)-dimensional continuous spaceof probability distributions over world states in POMDPs with n states, and the “curse of history”[Pineau et al., 2003b], which refers to the fact that the number of possible so-called histories, i.e.,sequences of executed actions and received observations, grows exponentially with the numberof time steps considered. Both aspects become even more problematic when compact represen-tations are used for describing world states, actions and observations [Boutilier et al., 1999].This means one cannot hope to find an efficient exact solution algorithm for arbitrary POMDPproblems of relevant size.

1.2. Hierarchical Planning

Fortunately though, the tasks encountered in the context of user assistance often naturally exhibithierarchical structure and can be described as compositions of subtasks. In the example ofFigure 1.1, connecting the Blu-ray player to the TV or connecting the satellite receiver to theTV constitute such subtasks. This hierarchical structure has also been noted by, e.g., Toussaintet al. [2008], who say that “prompting systems that assist older adults with activities of dailyliving [...] can be naturally decomposed into subtasks for each step of an activity”. Fox [1997]argues that this hierarchical structure can and should be exploited for plan generation by statingthat “one of the strongest motivations for using some form of abstraction in planning is theobservation that people use it to great effect in their problem-solving”. Indeed, knowledgeablehumans are quite successful at solving such problems. They approach planning tasks by findingsolutions in a top-down manner, decomposing problems into subproblems and considering onlya limited number of possibilities on how to solve a given subtask, i.e., they apply known problemsolving “recipes” to subproblems [Byrne, 1977].

A successful planning approach that mimics this behavior is HTN planning [Erol et al., 1994;Biundo and Schattenberg, 2001; Geier and Bercher, 2011; Alford et al., 2015]. HTN planningalso models planning problems in terms of subproblems, so-called abstract tasks, and a numberof solution recipes, so-called methods, for each abstract task. It is attractive for several reasons:first, HTN planning makes it easy for domain experts to capture their knowledge in a domainmodel, since, as noted above, it is often hierarchical in nature. HTN planning has therefore beensuccessfully applied to many real-world problems: it can be employed for, e.g., smartphone op-eration [Biundo et al., 2011] or web service composition [Lin et al., 2008]. Also, Nau et al.[2005] give a survey of applications of the SHOP (Simple Hierarchical Ordered Planner) andSHOP2 planning systems, which include evacuation planning, fighting forest fires, and control-ling unmanned aerial vehicles. Second, HTN planning leads to efficient plan generation: anHTN planning system only needs to consider plans that are consistent with the hierarchy, whichleads to smaller search spaces compared to planning without hierarchical knowledge. Third,using a technique that imitates human problem solving behaviors seems appropriate to facili-tate non-standard planning functionality relevant in automated assistance systems. Explanationfunctionality is a prime example of this, as it has been shown to be important in building andmaintaining a user’s trust in an assistance system [Lim et al., 2009; Nothdurft et al., 2014b]. Thefact that HTN planners employ similar procedures as humans for generating courses of action

4

1.3. Related Work

simplifies explaining their decisions to humans [Seegebarth et al., 2012; Bercher et al., 2014;Bercher et al., 2016]. Fourth, planning in a similar manner as humans do facilitates including hu-mans into the process of creating plans itself, which leads to an approach called mixed-initiativeplanning [Behnke et al., 2016].

However, HTN planning uses a deterministic, fully observable planning problem descriptionand hence cannot be directly applied in the setting described above. The question therefore ariseswhether the advantages that HTN planning exhibits in deterministic settings can be combinedwith the advantages of modeling user assistance planning tasks as POMDPs.

1.3. Related Work

Ideas in this direction have been pursued in the past, both from the areas of POMDP planningand hierarchical task network planning. On the one side, there are approaches for leveragingvarious kinds of hierarchies for MDP and POMDP planning. Conversely, there also exist HTN-based approaches that include some form of uncertainty handling. The following section willelaborate on both forms.

1.3.1. Approaches Based on HTN Planning

Kuter and Nau [2004] extend the deterministic HTN planner SHOP2 [Nau et al., 2003] to plan-ning in fully observable non-deterministic domains. In contrast to MDPs, however, uncertaintyin action execution is qualitative, i.e., action outcomes are not associated with probabilities, andgoals are not specified in terms of rewards but in terms of states to be reached. The approachpresents a technique for adapting deterministic forward-chaining planners to non-deterministicplanning problems. Applied to SHOP2, this results in the planner ND-SHOP2, which succes-sively plans for every state that can be reached by applying actions prescribed by a currenttask network. A solution is a so-called strong-cyclic policy, which is guaranteed to eventuallyreach a goal state under a certain fairness assumption [Cimatti et al., 2003]. It is worth not-ing that the methods themselves are not modified compared to SHOP2, i.e., it is not possibleto directly make a method specify how to react to non-determinism. A later version of the ap-proach is implemented as a wrapper to the unmodified SHOP2 planner using an all-outcomesdeterminization of the non-deterministic problem [Kuter et al., 2008]. The approach was alsoextended to employ a compact state set representation based on decision diagrams [Kuter et al.,2009], and to allow a limited form of partial observability [Kuter et al., 2007] where there existdedicated sensing actions for some state variables that deterministically determine their value.

A similar approach that additionally allows for conditional branching in methods is proposedby Bouguerra and Karlsson [2004] in their algorithm C-SHOP, which is based on SHOP [Nauet al., 1999]. They informally propose a planning problem definition that incorporates partialobservability similar to POMDPs and present a planner that returns a conditional plan togetherwith the probability that the plan can be executed to completion. The authors do not presentresults concerning correctness or completeness of the presented algorithm.

A conceptually simple approach to uncertainty handling in HTN planning is to ignore uncer-tainty during plan generation altogether, and then, during plan execution, repairing the plan in

5


the case of a deviation between the expected and the actual world state [Warfield et al., 2007;Bidot et al., 2008]. For example, this approach was applied in the context of the original, de-terministically modeled Home Theater Setup Assistance Task, to compensate failures duringplan execution [Bercher et al., 2014]. Plan repair however relies on having access to the trueworld state during plan execution, i.e., it requires full observability, and is thus not applicablein a POMDP setting such as the version of the Home Theater Setup Assistance Task describedabove [Richter and Biundo, 2017].

1.3.2. Approaches Based on Decision-Theoretic Planning

Decision-theoretic planning research has produced three main approaches to hierarchical plan-ning for fully observable MDPs, which form the basis of existing extensions to POMDPs: theoptions framework [Sutton et al., 1999], the Hierarchical Abstract Machines (HAM) approach[Parr and Russell, 1998; Parr, 1998], and the MAXQ approach [Dietterich, 1998; Dietterich,2000].

The options framework uses Semi-Markov Decision Processes (SMDPs) [Howard, 2013] asa basis for formalizing temporally extended actions. An option, i.e., a temporally extended orabstract action, is then defined by specifying an implementing policy, an initiation set to deter-mine the states in which the option can be taken, and a function that specifies the probability thatthe option terminates in a given state. Note that there is a one-to-one correspondence betweenoptions and implementing policies, contrary to how abstract actions are understood in HTN plan-ning. Options can be hierarchical in the sense that the actions executed by their associated policycan in turn be options. Planning is performed by adding a set of options to the MDP action setand adapting existing MDP solution algorithms such as Value Iteration to SMDPs, which can bedone in a relatively straight-forward manner.

In the HAM approach, policies are specified as a finite-level hierarchy of stochastic finite stateautomata that constrain the set of possible policies. These automata have four kinds of states:action, call, stop, and choice states. Action states execute an associated primitive MDP actionand stochastically determine the successor machine state conditional on the new world state. Acall state is associated with a (single) child automaton. When entered, execution of the parentautomaton is suspended and control is passed to the child automaton. The child automaton isexecuted until a stop state is reached, at which point control is passed back to the calling parent.To enable a hierarchy to represent a set of policies rather than a single policy, choice states areused, which represent decision points where a planner needs to choose a successor machinestate for each MDP world state. The combination of a HAM hierarchy and an MDP results inan SMDP in which actions correspond to choosing successor machine states at choice states. Asolution to this SMDP can then be interpreted as a policy for the original MDP.

The MAXQ approach works by specifying a task hierarchy with finitely many layers, calledthe task graph. Each node in the graph represents a task, and edges specify which subtasks canbe used for constructing a policy for an abstract task. More specifically, each task consists ofthree elements: first, the subtasks it is allowed to use in its policy. Tasks in the lowest level of thehierarchy only use primitive actions, while higher-level tasks can use lower-level abstract tasksfor their policy. Second, each task is associated with a termination predicate that partitions theset of world states into active states, where the task can be executed, and termination states, in

6

1.3. Related Work

which the task is terminated and control is given back to the parent task. Third, a pseudo-rewardfunction can be optionally defined for the terminal states of a task to represent the subgoal itshould achieve. Note that tasks that use other subtasks in their action set are again SMDPs.Planning in MAXQ works by computing policies bottom-up, starting at the lowest-level tasks,so that a MAXQ policy is represented as a collection of subtask policies. Because the valueof a given state under a MAXQ policy can be understood as a sum of values representing thevalue-to-completion of each subtask in the current call hierarchy, the MAXQ approach is alsocalled MAXQ value function decomposition.

Existing extensions to the partially observable case are all based on the MAXQ and optionsapproaches. To the best of the author’s knowledge, there are no attempts to extend the HAMapproach to partially observable domains.

One example of a MAXQ extension to POMDPs is the MAXQ-hierarchical policy iterationalgorithm proposed by Hansen and Zhou [2003]. The paper identifies finite state controllersas a good representation for POMDP policies and uses policy iteration for optimizing subtaskcontrollers bottom-up.

Another approach based on MAXQ was developed by Pineau et al. [2003a; 2004]. They firstintroduce the hierarchical MDP algorithm Policy Contingent Abstraction (PolCA), an extensionof MAXQ that eases subtask policy generation by applying a fully-automatic state abstractionalgorithm due to Dean and Givan [1997] to each subtask before it is solved. A deviation fromthe original MAXQ approach lies in its policy execution algorithm: in MAXQ, each subtaskpolicy is executed until termination. In PolCA, the next action to execute is always determinedby traversing the task hierarchy starting at the root task. This algorithm is then generalized to aPOMDP version called PolCA+ by additionally respecting observations during the abstractionprocess and approximating world state transition probabilities for abstract tasks by assumingtheir corresponding policy will be queried at the true world state instead of the agent’s beliefstate. Vien and Toussaint [2015] follow a very similar approach and employ Monte-Carlo treesearch to solve the resulting planning problem.

Theocharous et al. [2001; 2002] present the HPOMDP formalism1. Its foundations lay inhierarchical Hidden Markov Models (HHMMs) [Fine et al., 1998], which extend normal HMMsused for describing Markov processes with partially observable states by allowing states to inturn be entire Markov processes. HPOMDPs extend HHMMs in the same manner as POMDPsextends HMM, i.e., by adding actions and rewards. The HHMM foundation means that theHPOMDP model includes state abstraction. This is complemented with temporal abstraction byintroducing macro-actions based on the options framework. Planning is done approximately bysolving the fully observable part of the model. The resulting policy is executed by combiningexact belief state tracking with either the most likely state heuristic or the QMDP heuristic,which choose an action for the current belief state based on the computed MDP policy.

Theocharous and Kaelbling [2004] also present another approach rooted in the options frame-work: they use macro-actions, represented as, e.g., finite state automata, and combine themwith a belief space approximation technique called dynamic grid approximation. This yields an

1Note that earlier work on the POHTN approach described in this thesis used the term HPOMDP to denote thecombination of a POHTN hierarchy and its underlying POMDP [Müller and Biundo, 2011; Müller et al., 2012].To avoid confusion, this thesis does not use the term HPOMDP in the context of POHTNs.

7


SMDP whose states are belief grid points in the original POMDP. Policies are then computedusing real-time dynamic programming (RTDP).

The approach of He et al. [2010] employs open-loop macro-actions, i.e., action sequences,enabling it to quickly determine good policies using forward search in the space of beliefs. Themacro-actions are generated automatically to either reach states with high immediate reward orhigh information gain using a heuristic estimate. The authors also present a similar approachfor POMDPs with continuous state [He et al., 2011], albeit without automatic construction ofmacro-actions. The simplifying assumption that belief states are always Gaussian distributionsallows them to analytically compute the effects of action sequences on belief states.

There also exist approaches that aim at automatically finding hierarchical policies akin tothe ones found by the approaches based on MAXQ. For the fully observable case, the HEXQapproach [Hengst, 2002] exploits a factored state representation and builds a hierarchy by suc-cessively adding a new level to the hierarchy for each state variable. The order in which statevariables are processed depends on their frequency of change. For POMDPs, Charlin et al.[2007] build on the MAXQ-hierarchical policy iteration approach [Hansen and Zhou, 2003] andrepresent the hierarchy discovery problem as a non-convex optimization problem. Followingthis idea, Toussaint et al. [2008] propose a variation that employs the expectation-maximization(EM) algorithm. As noted by Toussaint et al. [2008] for their approach, however, these methodslikely do not find human-readable hierarchies. Since this precludes, e.g., explaining policies tousers, the hierarchies found by these approaches might not be adequate in the context of userassistance.

1.3.3. Other Approaches

Further approaches exist that make use of POMDPs and hierarchy in various specific domains.Sridharan et al. [2010] construct a hierarchical POMDP specifically for the purpose of visualinformation processing in robotics. There also exist approaches that employ domain specificPOMDPs-based hierarchical methods for autonomous navigation [Foka and Trahanias, 2007;Theocharous and Mahadevan, 2002; Theocharous et al., 2001].

A more in-depth comparison of the aforementioned approaches and the approach presentedin this thesis can be found in Section 4.4.

1.4. HTN Planning in POMDPs

None of the above approaches combine the adequacy of POMDPs in terms of modeling userassistance tasks with the benefits offered by the HTN planning approach. This thesis there-fore explores the question whether the two can be combined in a single planning approach thatexhibits the advantages of both. For this, the author proposes and evaluates the partially observ-able HTN (POHTN) approach. Its foundation is given by the finite state controller (FSC) policyrepresentation, which is chosen due to its similarity with plans in deterministic planning. TheFSC data structure is first generalized to be more compact for logic-based POMDPs, yielding theLogical FSC (LFSC) formalism [Müller and Biundo, 2011]. The principles of HTN planningare then reintroduced based on the LFSC data structure, augmented with observation abstrac-

8

1.4. HTN Planning in POMDPs

tion to account for partial observability [Müller et al., 2012]. This thesis investigates theoreticalproperties of POHTNs and proposes a planning algorithm based on Monte Carlo Tree Search(MCTS).

The remainder of this thesis is structured as follows. Chapter 2 reviews the relevant back-ground on POMDPs, HTN planning, and planning based on MCTS. Chapter 3 introduces theLFSC representation, based upon which the POHTN approach is defined in Chapter 4. The lat-ter also discusses properties of POHTNs as well as MCTS-based planning with POHTNs. Anempirical evaluation of the approach is given in Chapter 5.

Chapters 6 and 7 present extensions of the basic approach: Chapter 6 introduces a techniquethat mitigates the required effort for policy construction during planning by interleaving abstracttask decomposition and policy evaluation. Chapter 7 presents an approach for combining hierar-chical and non-hierarchical planning to create an algorithm that exploits hierarchical knowledgewhile at the same time converging to an optimal policy. Chapter 8 concludes the thesis withsome final remarks.

9

Chapter 2.

PreliminariesThis chapter introduces the concepts underlying the approach described in this thesis. It isdivided into five parts. The first two parts describe probability theory and first order logic, whichform the basis for the remaining three parts. In the third part, the POMDP planning formalismis introduced, along with related concepts relevant to this thesis, including factored POMDPs.The fourth part describes Monte-Carlo Tree Search as an algorithmic framework for solvingsequential decision problems involving uncertainty. The last part introduces Hierarchical TaskNetwork planning concepts.

2.1. Probability Theory

Let Ω be a non-empty set, and Σ ⊆ 2Ω a σ-algebra over Ω, i.e., a set of subsets of Ω that containsΩ itself and is closed under complementation and countable unions. A probability distributionover Ω is a function P : Σ→ [0, 1], for which P (A) ≥ 0 for allA ∈ Σ, P (Ω) = 1, and for everysequence A1, A2, A3, . . . of pairwise disjoint sets from Σ, it holds that P (

⋃iAi) =

∑i P (Ai).

The members of Ω are called outcomes and the members of Σ are called events. An example fora probability distribution is the discrete uniform distribution, which assigns the same probabilityto all single-outcome events of a finite set Ω and is denoted PU (Ω).

The set of possible probability distributions over Ω is denoted with ∆(Ω). Drawing a samplefrom a given distribution, i.e., probabilistically choosing an outcome ω from Ω according to P ,is denoted ω ∼ P . The conditional probability P (A | B) of an event A given that event B isknown to be true is defined as

P (A | B) =P (A ∩B)

P (B).

Often, P (A,B) is used as a shorthand notation for P (A ∩B).A random variable X : Ω→ cdm(X) is a function mapping outcomes to values of some set,

the co-domain cdm(X) ofX , called realizations. A realization x ∈ cdm(X) can itself be seen asan event by defining P (X = x) = P (ω ∈ Ω | X(ω) = x). WhenX is clear from the context,P (X = x) is abbreviated to P (x). A discrete random variable is a random variable X where|cdm(X)| < ∞. For such discrete random variables, a probability distribution can be definedin terms of a probability mass function fX : cdm(X) → R defined as fX(x) = P (X = x).The author uses the familiar notation P (X) for the probability mass function of X . The above-mentioned notations can also be combined, so that, e.g., the expression P (X | y, Z) is used fordenoting a function f(x, z) = P (x | y, z) = P (X = x | Y = y, Z = z).

11

Chapter 2. Preliminaries

The expected value of a real-valued random variable X with respect to a probability distribu-tion P , denoted Ex∼P X , is defined as

Ex∼P

X =∑

x∈cdm(X)

xP (x).

An estimator for a quantity θ related to a random variable X , e.g., its expected value, is afunction that maps a set x = x1, . . . , xn of observed realizations of X to an estimate θ(x) ofθ. An example for an estimator for the expected value EX of a real-valued random variable Xis the arithmetic mean EX(x) = 1/n

∑i xi. If, as is the case for this example, the expected

value of the estimator is θ itself, i.e., E θ = θ, the estimator is called unbiased.

2.2. First Order Logic

The approach presented in this thesis employs first-order logic with equality as a basis for de-scribing the structured planning problems introduced in Section 2.3.5.

2.2.1. Syntax and Semantics

The alphabet of a first-order language is given by the logical symbols and a signature, wherethe logical symbols are as usual given by the union of the set ∀,∃,¬,∧,∨,⇒,⇔,=, the setof symbols for parentheses, and a recursively enumerable set of variable symbols V . A (single-sorted) signature P = P ∪ F ∪ C defines disjoint finite sets of predicate symbols P , functionsymbols F , constant symbols C, and the symbol false . Predicate and function symbols areassociated with a natural number, their arity. Terms are constructed recursively by defining that

• every variable v ∈ V is a term,

• every constant c ∈ C is a term, and

• for an n-ary function symbol f and terms t1, . . . , tn, f(t1, . . . , tn) is a term.

Formulas are also defined recursively:

• The symbol false is a formula.

• For an n-ary predicate symbol p and terms t1, . . . , tn, p(t1, . . . , tn) is a formula.

• For terms t1 and t2, t1 = t2 is a formula.

• For a variable symbol x and formulas ϕ and ψ, all of ¬ϕ, ϕ ∧ ψ, ϕ ∨ ψ, ϕ⇒ ψ, ϕ⇔ ψ,∃xϕ and ∀xϕ are also formulas.

When a formula contains a variable without a corresponding quantification over the variable, thevariable is said to be free. Formulas without free variables are called closed. The author alsouses true as abbreviations for ¬false .

Meaning is assigned to formulas via interpretations. An interpretation over a signature P is atuple (D, d), where D is a non-empty set called the domain. The function d assigns

12

2.2. First Order Logic

• a function d(f) : Dn → D to every n-ary function symbol f ∈ F ,

• a function d(p) : Dn → B to every n-ary predicate symbol p ∈ P , where B = true, falsedenotes the set of Boolean values, and

• an element d(c) ∈ D to every constant c ∈ C.

A valuation is a function β : V → D that assigns domain elements to variables. Let β[x 7→ d]denote a valuation that agrees with β on all variables expect x, which is instead mapped to d. Acombination of an interpretation I = (D, d) and a valuation β can be used for assigning domainelements to terms by defining,

• (I, β)(x) = β(x) for a variable symbol x ∈ V ,

• (I, β)(c) = d(c) for a constant symbol c ∈ C, and

• (I, β)(f(t1, . . . , tn)) = d(f)((I, β)(t1), . . . , (I, β)(tn)) for terms t1, . . . , tn and n-aryfunction symbol f ∈ F .

A formula ϕ is said to be satisfied under interpretation I and valuation β, denoted (I, β) |= ϕ,or not satisfied, denoted (I, β) 6|= ϕ, according to the following rules:

• (I, β) 6|= false

• (I, β) |= p(t1, . . . , tn) iff. p ∈ P and d(p)((I, β)(t1), . . . (I, β)(tn)) = true

• (I, β) |= t1 = t2 iff. (I, β)(t1) = (I, β)(t2)

• (I, β) |= ∀xϕ iff. for all d ∈ D, it holds that (I, β[x 7→ d]) |= ϕ

• (I, β) |= ∃xϕ iff. there exists a d ∈ D, such that (I, β[x 7→ d]) |= ϕ

• (I, β) |= ¬ϕ iff. (I, β) 6|= ϕ

• (I, β) |= ϕ ∧ ψ iff. (I, β) |= ϕ and (I, β) |= ψ

• (I, β) |= ϕ ∨ ψ iff. (I, β) |= ¬(¬ϕ ∧ ¬ψ)

• (I, β) |= ϕ⇒ ψ iff. (I, β) |= ¬ϕ ∨ ψ

• (I, β) |= ϕ⇔ ψ iff. (I, β) |= ϕ⇒ ψ and (I, β) |= ψ ⇒ ϕ

For a closed formula ϕ, whether (I, β) |= ϕ holds is independent of the valuation β, i.e., onlydependent on I . One can thus use the abbreviation I |= ϕ, and call I a model for ϕ.

13


2.2.2. Updating Interpretations

This thesis will require a means for specifying interpretations as modifications of other interpre-tations [Schattenberg, 2009; Stephan and Biundo, 1993]. This first requires defining updates:

Definition 2.1. An update takes the possible forms

• p(t1, . . . , tn), for an n-ary predicate symbol p and terms t1, . . . , tn,

• ¬p(t1, . . . , tn), for an n-ary predicate symbol p and terms t1, . . . , tn,

• f(t1, . . . , tn)← t, for an n-ary function symbol f and terms t1, . . . , tn, t, or

• c← t, for a constant symbol c and term t.

The set of all updates with respect to a given signature P is denoted U(P). In conjunctionwith a valuation, an update induces an updated interpretation as follows:

Definition 2.2. Given an interpretation I = (D, d), a valuation β, and an update operation u,I[u, β] defines an interpretation (D, d′), i.e., an interpretation over the same domain, where d′

depends on the update u:

• If u is of the form p(t1, . . . , tn), then

d′(p)(d1, . . . , dn) =

true if (I, β)(ti) = di for all i = 1, . . . , n,

d(p)(d1, . . . , dn) else,

d′(q) = d(q) for all q ∈ P \ p, d′(f) = d(f) for all f ∈ F , and d′(c) = d(c) for allc ∈ C.

• If u is of the form ¬p(t1, . . . , tn), then

d′(p)(d1, . . . , dn) =

false if (I, β)(ti) = di for all i = 1, . . . , n,

d(p)(d1, . . . , dn) else,

d′(q) = d(q) for all q ∈ P \ p, d′(f) = d(f) for all f ∈ F , and d′(c) = d(c) for allc ∈ C.

• If u is of the form f(t1, . . . , tn)← t, then

d′(f)(d1, . . . , dn) =

(I, β)(t) if (I, β)(ti) = di for all i = 1, . . . , n,

d(f)(d1, . . . , dn) else,

d′(p) = d(p) for all p ∈ P , d′(g) = d(g) for all g ∈ F \ f, and d′(c) = d(c) for allc ∈ C.

• If u is of the form c← t, then d′(c) = (I, β)(t), d′(p) = d(p) for all p ∈ P , d′(f) = d(f)for all f ∈ F , and d′(e) = d(e) for all e ∈ C \ c.

14

2.3. Partially Observable Markov Decision Processes

In other words, an update produces an interpretation that differs from the original interpre-tation for at most one constant symbol, or at most one assignment of one function or predicatesymbol. Obviously, β is required to contain an assignment for all variables in u. When the termsin u do not contain variables, I[u, β] can be abbreviated to I[u].

Updates can be chained by defining the result of a sequence of updates U recursively:

I[U, β] =

I if U is empty,(I[u1, β])[U ′, β] if U = u1 U ′

2.2.3. Logical Partitions

Another required tool is the notion of a logical partition:

Definition 2.3. A multiset of formulas ϕi0≤i<n over a common signature is called a logicalpartition if for all valuations β and all interpretations I it holds that

• (I, β) |= ϕi implies (I, β) 6|= ϕj for all 0 ≤ i, j < n, i 6= j, and

• there exists an i such that (I, β) |= ϕi.

In other words, the formulas are mutually exclusive and together form a tautology. Note thatthe definition uses multisets, i.e., the collection of formulas under consideration is in principleallowed to contain multiple occurrences of the same formula. Of course, false is the only for-mula that can be contained multiple times that still allows such a multiset to be a partition. Thedefinition is based on multisets because this will frequently be the case in this thesis.

2.2.4. Many-Sorted Logics

To keep the above description of first-order logic concise, it is limited to single-sorted predicatelogic. This thesis will frequently make use of many-sorted predicate logic for convenience,however. In many-sorted predicate logic, signatures are tuples (S,P, type), where S is a setof sort symbols and P is the set of predicate, function and constant symbols as above. Thetype function associates the symbols in P with sorts according to their arity, and, in the case offunction and constant symbols, additionally one sort for the values they can assume. Similarly,variable symbols are also annotated with a sort. The sorts restrict how formulas can be formedby only allowing terms of appropriate sorts to be used for creating terms from n-ary functionsymbols and formulas from n-ary predicate symbols.

When S is finite, as is always the case for the purposes of this thesis, many-sorted logic can bereduced to single-sorted logic by introducing a unary predicate for each sort, introducing axiomsthat require these sort predicates to form a partition of the domain of discourse, and adding sortpredicates in quantified formulas for limiting the range of quantifications.


This thesis uses POMDPs as a means for formalizing planning problems in the presence ofpartial observability [Smallwood and Sondik, 1973; Sondik, 1971]. This section introduces, in

15


agent world

action a

observation o ∼ Z(a, s′)

state transition from s to s′

s′ ∼ T (s, a)

Figure 2.1.: A visualization of a POMDP agent interacting with its environment. (Source:[Richter and Biundo, 2017, Figure 6.2], reprint and modification by courtesy ofSpringer International Publishing AG)

this order, the domain dynamics of POMDPs, how a policy controls an agent, how the agent’sgoals are represented, how POMDPs can be compactly represented, and an existing but limitedcompact policy representation.

2.3.1. Partially Observable Environments

A POMDP agent interacts with its environment by, in alternation, executing an action and re-ceiving an observation as depicted in Figure 2.1. Executing an action changes the state of theagent’s environment. Due to partial observability, the agent cannot see the new state. Instead, itreceives an observation, whose generation is dependent on the new state of the environment, butin general contains only noisy and incomplete information about that state.

More specifically, the possible states of the environment are given as a finite set of world statesS. The actions the agent can execute are given as the finite set A. For each state s and actiona, the transition function T : S × A → ∆(S) specifies a probability distribution over possiblesuccessor world states, so that the result s′ of executing a in s is sampled from T (s, a) accordingto s′ ∼ T (s, a). The finite set O determines the possible observations the agent can make.Which observation the agent sees is governed by the observation function Z : A× S → ∆(O),i.e., o is sampled as o ∼ Z(a, s′) when executing a resulted in state s′.

2.3.2. Histories and Policies

The agent chooses its actions based on the information it has access to. In each repetition of thecycle, the agent receives new information, namely the action a it executes and the observation oit sees. After t repetitions of the cycle, the agent has accrued the sequence (a1, o1, . . . , at, ot),which is called an observable history of length t. The behavior of the agent for a maximum oft time steps can therefore be described as a function π : Ht → A that chooses the appropriateaction for each possible observable history of maximum length t, i.e., for all h ∈ Ht. Furthermaking the assumption that, at a given time step t, the agent has followed π for the first t − 1steps, only histories for which the executed actions coincide with the actions recommended byπ need to be considered. To see this, consider a policy π that recommends executing π(ε) = afor the empty history ε. Then, assuming the agent does not deviate from its policy and actuallyexecutes a first, π does not need to be defined for any history (a1, o1, . . . , at, ot) where a1 6= a.This argument can be extended to all histories via induction. An agent following its policy thusonly needs to look at the observations it receives.

16


a2

start

t steps left

a5 a0 . . . a1 t− 1 steps left. . . . . . . . .... ... ... ...

a3 2 steps left

a3 a6 . . . a4 1 step left

o0 o1on−1

o0 o1on−1

α(π)

ω(π, o0)

Figure 2.2.: A t-step policy π represented as a tree. Nodes are labeled with the action to execute,edges are labeled with observations. The root node is labeled with the start marker.

Such a policy maps observation histories (o1, . . . , ot) to actions, which can be convenientlyrepresented as a tree [Kaelbling et al., 1998] where nodes are labeled with actions and edgesare labeled with observations, as depicted in Figure 2.2. A t-step policy tree π is executedby first executing the action associated with the root node and following the edge labeled withthe received observation o. This leads to a (t − 1)-step policy which can in turn be executed.Denoting the set of all possible policies as Π, policy execution can thus be defined in terms ofthe two functions α : Π → A and ω : Π × O → Π. The value α(π) determines the action toexecute first, i.e., the action associated with the root node. The subpolicy to execute when o isobserved after executing α(π) is identified by ω(π, o).

2.3.3. Goals

The next question is how the agent should choose the policy to execute. Choosing a policynecessitates a measure for comparing different policies with respect to their quality. For this,the goals of the agent are given in terms of a reward the agent receives every time it executesan action. This (immediate) reward is represented as a function R : S × A × S → R, whereR(s, a, s′) specifies how attractive it is for the agent if the result of executing a in s is s′. WhileR clearly determines policy quality, it cannot be directly employed as a quality measure forpolicies, since the agent is interested in the total reward it can expect to accumulate when actingover a number of time steps.

Let V tπ : S → R denote the expected accumulated reward of executing policy π for t time

steps, starting in world state s. When π is to be executed for one time step, this means thatits quality is given by the expected single-step reward of executing the recommended action.For more than one time step, V t

π(s) additionally accounts for the expected accumulated rewardV t−1π (s′) of executing π in a successor state s′, according to the Bellman principle of optimal-

17


ity [Kaelbling et al., 1998]:

V tπ(s) =

Es′∼T (s,α(π))R(s, α(π), s′) if t = 1

Es′∼T (s,α(π))[R(s, α(π), s′) + Eo∼Z(s′,α(π))[Vt−1ω(π,o)(s

′)]] if t > 1(2.4)

While V tπ defines a utility value for each world state, it is often unreasonable to assume that

a POMDP agent exactly knows the world state when it starts executing actions. Often, only aprobability distribution over possible initial world states is given, called the initial belief stateand denoted bI . In this case, the expectation over bI is taken, and the quality of π in bI over ttime steps is given as

V tπ(bI) = Es∼bI [V t

π(s)].

Finally, the number of time steps the performance of the agent is measured over is called thehorizon H < ∞. Policy quality is therefore given by V H

π (bI). This concludes the descriptionof the constituent elements of POMDPs:

Definition 2.5. A tuple (S,A,O, T, Z,R, bI , H) of a partially observable environment togetherwith a reward function, an initial belief state, and a horizon as described above, is called aPOMDP.

POMDP solutions are optimal policies:

Definition 2.6. A solution of a POMDP P = (S,A,O, T, Z,R, bI , H) is a policy π∗ of maximalvalue among all its policies Π(P ), i.e., V H

π∗ (bI) = maxπ∈Π(P ) VHπ (bI).

Unfortunately, computing POMDP solutions is PSPACE-complete1, i.e., intractable [Papadim-itriou and Tsitsiklis, 1987]. One can therefore not hope to find optimal policies in general,especially for POMDPs of realistic size.

2.3.4. Finite State Controllers

Sometimes, policies can be described more compactly than by policy trees: when the policy rec-ommends the same behavior for several histories, e.g., after an intermediate goal is achieved, thecorresponding subtrees can be shared. It might also be useful to introduce cycles into the policytree to represent repeated behavior, e.g., repeated attempts of executing an action that can fail.Figure 2.3 illustrates the possible savings that can be gained from such a representation. In thecase of infinite horizon POMDPs, introducing cycles is even necessary to enable a finite policyrepresentation. Compact policy representations are also especially important in the context ofthis thesis: since the author’s endeavor involves specifying courses of action by hand, care needsto be taken that specifying these courses of action is as simple as possible.

Policies containing shared behavior and cycles as shown in Figure 2.3c are called finite statecontrollers [Kaelbling et al., 1998]:

Definition 2.7. An FSC is a tuple (N,λ, δ, nI), whereN is the finite set of controller nodes, eachassociated with an action via the action association function λ : N → A. Transitions betweennodes are given by the transition function δ : N×N → 2O, i.e., each pair of nodes is associatedwith a subset of the set of observations. The initial node of the FSC is nI ∈ N .

1More precisely, deciding whether there exists a policy π with V Hπ (bI) ≥ x for a given x ∈ R is PSPACE-complete.

18


a2

start

a5 a0 a1

a3 a4 a4 a1 a1

. . . a3 a3 . . . a1

. . . . . . . . .

o0o1

o2

o0, o1 o2 o0 o1, o2 o∗

o∗o∗ o∗ o∗

o∗

o∗ o∗ o∗

(a) A policy that exhibits redundancies.

a2

start

a5 a0 a1

a3 a4 a1 a1

. . . a3 . . . a1

. . . . . .

o0o1

o2

o0, o1 o2 o0 o1, o2 o∗

o∗o∗ o∗

o∗

o∗ o∗

(b) Sharing subpolicies allows for a morecompact policy representation.

a2

start

a5 a0 a1

a3 a4 a1

. . . a3 . . .

. . .

o0o1

o2

o0, o1 o2 o0 o1, o2

o∗

o∗o∗ o∗

o∗

(c) Allowing cycles in the policy yields afinite state controller.

Figure 2.3.: An illustration of the savings possible by sharing subpolicies and allowing for cyclesin policies in a POMDP with three observations, highlighted in gray. The shorthandnotation o∗ stands for o0, o1, o2.

19


Executing a controller fsc = (N,λ, δ, nI) works by maintaining a current node, initially nI .In each time step, the action associated with the current node is executed. Then, the receivedobservation o is used for updating the current node n to the node n′ for which o ∈ δ(n, n′).Formally, this can be defined in terms of the functions α and ω by representing the update of thecurrent node through the construction of a modified controller with a different initial node:

α(fsc) = λ(nI), and

ω(fsc, o) = (N,λ, δ, n∗) where n∗ is the node for which o ∈ δ(nI , n∗)(2.8)

Obviously, such a node n∗ should both exist and be unique, i.e., the outgoing transitions of agiven node form a partition of O:

Definition 2.9. A controller (N,λ, δ, nI) is called well-defined, if for all n ∈ N and o ∈ O itholds that

• o ∈ δ(n, n′) implies o 6∈ δ(n, n′′) for all n′, n′′ ∈ N , n′ 6= n′′, and

• there exists a node n′ ∈ N such that o ∈ δ(n, n′).

Note that an alternative definition for transitions as a function δ′ : N × O → N would en-sure well-defined transitions without additional restrictions. However, the author uses the firstdefinition because it generalizes more easily to the definitions later in this thesis.

Since executing a well-defined controller is defined in terms of α and ω, the expected accu-mulated reward V t

fsc(s) of executing a well-defined controller in a state s can be determined inthe same way as shown for policy trees in Equation 2.4. Consequently, FSCs can be POMDPsolutions in the sense of Definition 2.6.

2.3.5. Factorization

The plain POMDP definition does not assume any internal structure in the sets of states, ac-tions, or observations. Particularly for planning problems of realistic size, however, it is oftencumbersome to explicitly enumerate all possible world states, transition probabilities, and so on.The countermeasure for this is to choose a factored representation, which means representingthe sets of states, actions, and observations in terms of features that can assume different values[Boutilier et al., 1999; Poupart, 2005]. This thesis will usually consider POMDPs representedusing RDDL, the Relational Dynamic Influence Diagram Language [Sanner, 2010]. This sectionadds a formal description of a useful RDDL fragment to the syntactic description of RDDL bySanner. The presented fragment is general enough to represent arbitrary POMDP as defined inDefinition 2.5, but omits features like integer and continuous variable domains. For an exampleof a factored POMDP modeled in RDDL, see Section 2.3.6.

In RDDL, POMDPs are specified in a factored representation based on typed first-order flu-ents. Let S denote the finite set of sort symbols and let F denote the finite set of fluent symbols.Each fluent has a typed parameter list, represented by the function dom : F → S∗, and a typethat specifies what values it can assume, represented by cdm : F → S ∪ B. In the many-sorted first-order logic sense, Boolean-valued fluents are predicates, all other fluents functionsor constants, and dom and cdm together realize the type function. Each fluent belongs to a

20


certain class, as identified by the function class : F → state, act , obs, inter. The classesstate , act , and obs denote state, action, and observation fluents, respectively. Fluents belongingto class inter are intermediate fluents, akin to derived predicates in the Planning Domain Defi-nition Language PDDL used for deterministic planning [Edelkamp and Hoffmann, 2004]. Theirprimary role is expressing correlated action effects.

States, actions, and observations are interpretations of the fluents of the respective class, i.e.,denoting the subset of fluents of class c with Fc, a state is an interpretation over the signature(S,Fstate , type). Let I(X,Y ) denote the set of interpretations over fluent symbols of classX using domain objects Y . For a given non-empty finite set of domain objects D, a POMDPdefined in RDDL has the state space S = I(state, D), the action space A = I(act , D), andthe observation space O = I(obs, D). Note how RDDL naturally supports concurrent actions,since an action in the RDDL sense is an interpretation over all action fluents.

Transition and observation probabilities are defined by specifying, for each non-action fluent,conditional probability functions (CPFs), which are parametrized probability mass functions.The purpose of the CPF of a given fluent F is to define a probability distribution over the func-tions d(F ) that interpret F . In consequence, the CPFs of, e.g., all state fluents together define aprobability distribution over complete interpretations of all state fluents, i.e., states.

In RDDL, CPFs are constructed recursively in a first-order like syntax from logical expres-sions, basic probability distributions, and if-then-else constructs. Formally, let val : S → 2D

denote the domain objects of a given sort and let val(B) = B. A CPF for an intermediate fluentFI is a function cpf FI : val(dom(FI)) × S × A → ∆(val(cdm(FI))), where, by slight abuseof notation, val(dom(F )) is shorthand for the application of val to the tuple entries of dom(F ).In other words, cpf FI defines, for each assignment g ∈ val(dom(FI)), state s ∈ S, and actiona ∈ A, the probability P (d(FI)(g) = v | s, a) that the interpretation function of fluent FI ,d(FI), assumes a given value v ∈ val(cdm(FI)).

State transitions may depend on the value of intermediate fluents, i.e., a CPF for state flu-ent FS is a function cpf FS : val(dom(FS)) × S × A × I → ∆(val(cdm(FS))) representingP (d(FS)(g) | S,A, I). Observation CPFs cpf FO : val(dom(FO))×A×S → ∆(val(cdm(FO)))may, as for plain POMDPs, depend on the executed action and the successor state and representP (d(FO)(g) | A,S′).

The observation function Z is then defined as

Z(a, s′) =∏

FO∈Fobsg∈val(dom(FO))

P (d(FO)(g) | a, s′).

This defines a joint probability distribution over the possible values of all assignments of all ob-servation fluents. This means it is a probability distribution over all interpretations of observationfluents and hence a probability distribution over observations.

For an example of a simple observation CPF, consider an excerpt of a factored POMDP wherean autonomous vehicle on Mars can analyze a rock sample it has obtained, which might ormight not contain life. The POMDP has a Boolean 0-ary action fluent analyze , a 0-ary Booleanstate fluent sampleContainsLife and a 0-ary Boolean observation fluent lifeDetected . Thevehicle’s sensor is always on (i.e., it produces a constant stream of sensor values), but only givesuseful results when analyze is executed, and uniformly random noise otherwise. The CPF of

21


lifeDetected can be represented as follows:

Z(a, s′) =

P (d(lifeDetected) = true) = 1 a |= analyze and s′ |= sampleContainsLife

P (d(lifeDetected) = true) = 0 a |= analyze and s′ 6|= sampleContainsLife

P (d(lifeDetected) = true) = 0.5 a 6|= analyze

In RDDL Syntax, as explained in detail by Sanner [2010], this is stated as

lifeDetected =if (analyze) then

KronDelta(sampleContainsLife)else

Bernoulli(0.5);

Hence, if the vehicle executes analyze , lifeDetected assumes the same value as sampleCon-tainsLife using the Kronecker Delta, a deterministic probability distribution that places a proba-bility of 1 on its argument, and is uniformly distributed over true, false else.

State transition probabilities are defined similarly, except that one needs to account for theintermediate fluents: since they are not part of the resulting state, they are marginalized out, i.e.,

T (s, a) =∑

FI∈FintergI∈val(dom(FI))i∈val(cdm(FI))

P (d(FI)(gI) = i | s, a)∏

FS∈FstategS∈val(dom(FS))

P (d(FS)(gS) | s, a, i).

The reward is a function R : S ×A× S → R specified in a similar syntax as CPFs. RDDL alsoallows for the definition of an initial state2 sI and a horizon H . The complete factored POMDPis thus defined as follows:

Definition 2.10. A factored POMDP is a tuple (S,F , dom, cdm, class, D, val ,CPFF , R,sI , H), where CPFF = cpf F | F ∈ F denotes the set of CPFs and the other tuple en-tries are defined as above.

RDDL divides the definition of a POMDP into two parts, the RDDL domain and the RDDLinstance. The domain includes the definition of sorts, fluents, CPFs, the reward function, andspecifies the available objects for so-called enum sorts. The instance specifies the domain objectsfor the remaining sorts, the initial state, and the horizon. This facilitates the definition of multipleRDDL instances of, e.g., different size for a given RDDL domain.

The last RDDL feature to be presented in this brief description is that of a default value thathas to be defined for state and action fluents. In the case of state fluents, this enables a compactdescription of an initial state as a sequence of (variable-free) updates UI to an interpretationsdef that interprets all state fluents to their default values, i.e., sI = sdef [UI ] as defined inDefinition 2.2. While similar in spirit to other planning languages such as PDDL [Edelkampand Hoffmann, 2004], where everything that is omitted in the initial state description is assumedto be false, this method can also be used for non-Boolean fluents.

2RDDL does not allow the definition of a proper initial belief state, only a single initial state sI ∈ S can be specified.While this can make the definition of POMDPs more complicated, it does not incur a loss of expressivity.

22


Similarly, default values for action fluents induce a default action interpretation adef , whichcan be understood as a “noop” action that exists for all POMDPs specified in RDDL. Thisdefault value mechanism is also used for controlling action parallelism by restricting the set oflegal actions to those that can be represented by a limited number of updates to adef . Unlessstated otherwise, this thesis will assume that at most one update is allowed to adef . The executedaction is thus either noop, i.e., adef , or adef [u], where u specifies a non-default value for oneaction fluent assignment.

2.3.6. The Home Theater Setup POMDP

As an example for a POMDP modeled using RDDL, consider a formalization of the HomeTheater Setup problem introduced in Chapter 1. The complete RDDL code can be found inAppendix B. The set of sorts is given by

S = Device, SignalType, SignalSource, Port, count, tightness.

Device represents signal-producing and signal-displaying devices such as satellite receivers orTVs, but also cables or adapters. SignalType and SignalSource together determine thekind of signals to be transported and represent their type, e.g., audio or video, and source, e.g.,the satellite receiver. The Port sort is used for distinguishing different kinds of ports such asHDMI, cinch video, or cinch audio. Lastly, count and tightness are enum sorts that have@zero,@one,@two,@three and @none,@loose,@tight as objects, respectively.

Dynamic state properties are described using four fluents. The first two, freeFemale-Ports(Device,Port) and freeMalePorts(Device,Port) are both count-valued.They specify the number of currently free ports on a device. Cables, for example, usuallyhave two male ports and no female ports, which means that the state contains freeFemale-Ports(HDMI_Cable,HDMI) = @two and freeMalePorts(HDMI_Cable,HDMI) =@zero for an unconnected HDMI cable HDMI_Cable. The number of free ports changesas soon as a port of a device is connected to an opposite sex port of another device, whichis then represented with the fluent connected(Device,Device,Port). As described,the scenario makes a distinction between loose and tight connections, therefore the fluent istightness-valued. The Boolean-valued fluent hasSignal(Device, SignalType,SignalSource) denotes whether a given device has a signal of a certain type and source.

Further, there are some unchangeable state properties. These can be modeled in RDDL usingso-called non-fluents. These are left out of the RDDL description for brevity, but can be consid-ered state fluents whose values can be observed and whose CPFs deterministically mandate thattheir values remain unchanged. Non-fluents are useful for modeling: in the Home Theater SetupPOMDP, there are three Boolean-valued non-fluents, where

• PORT_TRANSPORTS_SIGNAL_TYPE(Port,SignalType) describes whether a portcan propagate a given signal type,

• DEVICE_IS_CHECKABLE(Device,SignalType) specifies whether a device canbe checked for a given signal type, and

23


• DEVICE_NEEDS_SIGNAL(Device,SignalType,SignalSource) determineswhich devices need which signals.

The model inherits the action fluent connect(Device,Device,Port) from the deter-ministic version of the model for connecting two devices via a specified port. In addition, thereis a Boolean action fluent tighten(Device,Device,Port) for tightening such connec-tions and a Boolean action fluent checkSignals(Device) to check a device for signals.The maximum number of updates to the default action adef is one, i.e., as stated in Section 2.3.5,an action is either noop or an assignment of one of the action fluents described above.

Obviously, the intended result of executing connect(?d1,?d2,?p) for some port ?pis that ?d1 and ?d2 are either connected tightly and any signals are propagated, or that ?d1and ?d2 are connected loosely and signal are not propagated. This means that the effects ofexecuting connect(?d1,?d2,?p) on hasSignal and connected are correlated. Suchcorrelated effects require using intermediate fluents. In this case, the tightness-valued inter-mediate fluent connectSucceeded(Device,Device,Port) is introduced.

With the necessary fluents for state transitions and their intended meanings defined, they cannow be used for constructing suitable CPFs. First, the CPF for connectSucceeded(?d1,?d2,?p) needs to be defined:

connectSucceeded(?d1,?d2,?p) =if (connect(?d1,?d2,?p) ^

(freeFemalePorts(?d1,?p) ~= @zero ^ freeMalePorts(?d2,?p) ~= @zero |freeMalePorts(?d1,?p) ~= @zero ^ freeFemalePorts(?d2,?p) ~= @zero)) then

Discrete(tightness,@none:0,@loose:0.2,@tight:0.8)else if (tighten(?d1,?d2,?p) ^ connected(?d1,?d2,?p) == @loose) thenKronDelta(@tight)

else KronDelta(@none);

Here, ^ denotes conjunction, | denotes disjunction, and ~= denotes 6=. Discrete(tight-ness,@none:0,@loose:0.2,@tight:0.8) denotes a probability distribution over thetype tightness where the value @zero has probability 0, @loose has probability 0.2,and @tight has probability 0.8 [Sanner, 2010]. In words, this means that executing con-nect is successful when suitable ports are free on both devices. However, the connectionswill be @loose with probability 0.2. Such a loose connection can be tightened by executingtighten(?d1,?d2,?p).

The CPFs for the four state fluents can then be deterministically defined based on connect-Succeeded(Device,Device,Port), where

• the value of connected(?d1,?d2,?p) persists, unless connectSucceeded(?d1,?d2,?p) or connectSucceeded(?d2,?d1,?p) indicates a new connection (looseor tight) was established, or a loose connection was tightened,

• the values of freeFemalePorts(?d1,?p) and freeMalePorts(?d2,?p) aredecreased when connectSucceeded(?d1,?d2,?p) or connectSucceeded(?d2,?d1,?p) indicates a new connection (loose or tight) was established, and

• hasSignal(?d,?t,?s) becomes true if there is a tight connection to a device ?d2via a port ?p that transports signals of type ?t, and ?d2 has a signal originating from ?s.

24

2.4. Monte-Carlo Tree Search

Note that a consequence of the definition of the CPF for hasSignal(?d,?t,?s) is that itcan take several time steps to propagate a signal over a chain of devices with tight connections.Instantaneous propagation would require computing the transitive closure over tight connectionsin each time step, something which is not supported in RDDL.

The Home Theater Setup domain defines a single Boolean observation fluent hasSignal-Obs(Device,SignalType,SignalSource). Its CPF is deterministic and specifies has-SignalObs(?d,?t,?s) to be true when checkSignals(?d) is executed, ?d can bechecked for signals of type ?t, and ?d has such a signal from source ?s. The reward functionawards the agent, in every time step, with +1 for each occurrence where a device has a signal itneeds and with−1 every time a useless action is executed, such as using checkSignals(?d)on uncheckable devices. See Richter and Biundo [2017] for possible other variants for defininga reward function in this domain.

Consider a simple RDDL instance for the domain defined above, where there are three de-vices, a satellite receiver Sat, a TV, and an HDMI_Cable, and where the audio and videosignals of the Sat should be transported to the TV. Given that the default value of all fluents isfalse, the initial state of this is instance is given as follows:

PORT_TRANSPORTS_SIGNAL_TYPE(HDMI,audio);PORT_TRANSPORTS_SIGNAL_TYPE(HDMI,video);DEVICE_IS_CHECKABLE(TV,audio);DEVICE_IS_CHECKABLE(TV,video);DEVICE_NEEDS_SIGNAL(TV,audio,Sat);DEVICE_NEEDS_SIGNAL(TV,video,Sat);

freeFemalePorts(Sat,HDMI) = @one;freeFemalePorts(TV,HDMI) = @one;freeMalePorts(HDMI_Cable,HDMI) = @two;hasSignal(Sat,audio,Sat);hasSignal(Sat,video,Sat);

It is easy to see how the agent should act in this simple instance: first, connect the TV to theHDMI_Cable and the HDMI_Cable to the Sat. Second, check the TV for signals. Third,if the TV does not have any signals, tighten the connections. Once this is done, all devices areconnected as desired and no further action is required, i.e., the noop action is executed until thehorizon is reached. Figure 2.4 shows how this is represented as a finite state controller.


Monte-Carlo Tree Search is a framework for round-based anytime algorithms applicable to awide range of probabilistic sequential decision problems [Browne et al., 2012]. MCTS appearsparticularly appropriate for the purposes of this thesis, for two reasons. First, it is applicableto both non-hierarchical planning for POMDPs, as shown below, and planning in the partiallyobservable HTN planning approach introduced in this thesis. MCTS thus forms a good basis forcomparing the performance of both, as shown in the evaluation in Chapter 5. Second, at the timeof writing of this thesis, non-hierarchical planning for POMDPs as embodied in the POMCPalgorithm (Partially Observable Monte-Carlo Planning) [Silver and Veness, 2010] is one of the

25


adef [connect(Sat,HDMI_Cable,HDMI)]start

adef [connect(HDMI_Cable,TV,HDMI)]

adef [checkSignals(TV)]

adef [tighten(HDMI_Cable,TV,HDMI)]

adef [tighten(Sat,HDMI_Cable,HDMI)]

adef

Figure 2.4.: A finite state controller for a simple instance of the Home Theater SetupPOMDP. Each node is labeled with the default action adef , i.e., noop, oradef [u], where u is an update. The dot arrays that label transition edgesrepresent the sets of observations for which a transition is taken. The in-stance contains three devices, two signal types, and one signal source. Hence,hasSignalObs(Device,SignalType,SignalSource) induces a total of23·2·1 = 64 observations. After checking signals, the task is completed if eitherhasSignalObs(TV,audio,Sat) or hasSignalObs(TV,video,Sat)are true, which is the case for 48 observations. Obviously, only a handful of these64 observations can actually occur, in particular because only TV can be checkedfor signals and only Sat can produce signals. It would, in principle, suffice to onlymention observations that occur with a non-zero probability. See Chapter 3 for adiscussion of this topic.

26


most competitive approaches to POMDP planning [Sanner, 2011; Sanner, 2014]. Performanceimprovements over POMCP thus constitute improvements of the state of the art in POMDPplanning.

The following section will first present a formalization of MCTS for (fully observable) gen-erative indefinite horizon Markov Decision Processes (MDPs), before describing MCTS on thistype of MDPs and finally how POMDPs can be cast as MDPs.

2.4.1. Generative Indefinite Horizon Markov Decision Processes

A generative indefinite horizon MDP (called MDP for short in this thesis) is an MDP definedusing a simulator for state transitions paired with the guarantee that a terminal state is alwaysreached after finitely many state transition steps:

Definition 2.11. A generative indefinite horizon MDP is a tuple (S,A,G, sI , ST ). The finite setof states is given by S, and ST ⊆ S is the set of terminal states. Actions are given separately foreach state, i.e., as a function that maps each state s ∈ S \ ST to a finite set of available actionsA(s). The simulator G is a random variable that, given a non-terminal state s ∈ S \ ST andapplicable action a ∈ A(s), can provide a sample (s′, r) ∼ G(s, a) of a successor state s′ ∈ Sand incurred reward r ∈ R. The initial state is given by sI ∈ S. To make the horizon finite, Gis such that repeatedly applying G starting at sI always leads to a terminal state after finitelymany steps. I.e., letting

h(s) =

0 s ∈ ST1 + maxa∈A(s)h(s′) | P (G(s, a) = s′) > 0 s ∈ S \ ST ,

it holds that h(sI) <∞.

For convenience, A =⋃s∈S\ST A(s) additionally denotes the union of all actions applicable

in some state. The given MDP definition is convenient in cases where the number of stepsrequired to reach a terminal state varies and constitutes a special case of stochastic shortest pathproblems [Bertsekas, 2000]. Still, it is equivalent to other finite horizon MDP formalizations: anMDP with a fixed time horizon can be converted to the given definition by making the numberof remaining time steps part of the state and declaring states with zero remaining time stepsas terminal. In the other direction, define the horizon to be h(sI), i.e., the maximum numberof time steps required to reach a terminal state, and define transitions and rewards such thatterminal states cannot be left and have a reward of zero.

An MDP policy π : S → A maps each non-terminal state s ∈ S \ ST to an applicable actionπ(s) ∈ A(s). The value of an MDP policy is given by:

Vπ(s) =

0 if s is terminalE(s′,r)∼G(s,π(s))[r + Vπ(s′)] else

The goal is to find a policy π∗ of maximal value for the initial state, i.e., Vπ∗(sI) = maxπ Vπ(sI).

27


2.4.2. The MCTS Scheme

Next, a formalization of the MCTS scheme closely related to that of Keller and Helmert [2013]is presented. MCTS works in a round-based manner, building an explicit representation of thesearch space in memory. To be consistent with existing literature, this thesis will refer to thisrepresentation as the search tree, though it can, in general, be a directed acyclic graph.

The search tree consists of alternating layers of decision nodes and chance nodes. A decisionnode nd is a tuple (s,S, C, V ), where s ∈ S is a state, S is the set of successor chance nodesof nd, C ∈ N0 represents the number of visits to the node, and V ∈ R is the value estimatefor s. A chance node nc is a tuple (s, a,S, C,R,Q), where s ∈ S is a state, a ∈ A(s) isan action, S is the set of successor decision nodes, C again represents the number of visits tothe node, and R ∈ R and Q ∈ R are estimates of the immediate reward and the accumulatedreward of executing a in s, respectively. To simplify notation, this thesis uses function notationfor accessing the elements of nodes, e.g., S(nd) denotes the set of successors nodes of nd.Further, the shorthand notations newd(s) = (s, ∅, 0, 0) and new c(s, a) = (s, a, ∅, 0, 0, 0) willbe employed for initializing new decision and chance nodes, respectively.

Letting Nd and Nc denote the set of possible decision and chance nodes, respectively, MCTSconsists of the functions mctsd : Nd → Nd, mctsc : Nc → Nc, backup : 2Nc → R, andselectAction : Nd → A. The tree is represented by its root node, which recursively containsall other nodes through the successor sets S , and initially consists only of the root decision nodeT := newd(sI). In each iteration of MCTS, the tree is updated to T := mctsd(T ), wheremctsd(nd) = n′d is defined by

s(n′d) = s(nd),

S(n′d) =

∅ if s(nd) ∈ ST ,(S(nd) \ n∗c) ∪ mctsc(n

∗c) if ∃n∗c ∈ S(nd) s.t. a(n∗c) = a∗,

S(nd) ∪ mctsc(new c(s(nd), a∗)) else,

C(n′d) = C(nd) + 1,

V (n′d) =

0 if s(nd) ∈ ST ,backup(S(n′d)) else,

where a∗ = selectAction(nd).

Conversely, mctsc(nc) = n′c is defined by

s(n′c) = s(nc),

a(n′c) = a(nc),

S(n′c) =

(S(nc) \ n∗d) ∪ mctsd(n

∗d) if ∃n∗d ∈ S(nc) s.t. s(n∗d) = s∗,

S(nc) ∪ mctsd(newd(s∗) else,

C(n′c) = C(nc) + 1,

R(n′c) =r∗ +R(nc)C(nc)

C(n′c),

28


Q(n′c) = R(n′c) +∑

n′d∈S(n′c)

C(n′d)

C(n′c)V (n′d),

where (s∗, r∗) ∼ G(s(nc), a(nc)).

In words, this means that one iteration (or roll-out) of MCTS consists of traversing the searchtree starting at the root, applying actions until a terminal state is reached, and then propagatingthe information about gathered rewards back up the tree. Note that since the number of executedactions needed to reach a terminal state is finite, the recursion always terminates. Figure 2.5illustrates how the search tree is built.

s1

s2

. . .

V =3

Q=3

V =3

a1

(a) after iteration 1

s1

s2 s3

. . . . . .

V =4

Q=3

V =3

Q=4

V =4

a1 a2

(b) after iteration 2

s1

s2 s3 s4

. . . . . . . . .

V =4.5

Q=3

V =3

Q=4.5

V =4 V =5

a1 a2

(c) after iteration 3

Figure 2.5.: An illustration of MCTS search tree construction in an MDP with two actions.Rectangular nodes represent decision nodes and gray round nodes represent chancenodes. After iteration 1, the tree is the result of a single roll-out. In the second andthird iteration, action a2 is chosen in s1, each time yielding a different result. Theway actions are chosen and values are backed up corresponds to the MaxUCT algo-rithm as defined by Equations 2.13 and 2.14. Visit counts C and immediate rewardestimates R are omitted for readability.

The computation may be stopped after an arbitrary number of iterations. The resulting treerepresents a policy for the MDP by looking, for each state s, at the successor nodes of its cor-responding decision node (s, C,S, V ), and associating the state with the action whose chancenode has a maximal estimated value, i.e.,

π(s) =

argmaxa∈A(s)Q(nc) | ∃nc ∈ S s.t. a(nc) = a S 6= ∅an arbitrary action a ∈ A(s) else

The case where S is empty happens when a state is reached during execution that was nevervisited during search. In practice, actions are often chosen uniformly at random when thishappens.

A commonly employed modification of the MCTS scheme is motivated by practical consider-ations: in its basic form introduced above, each iteration of MCTS adds an entire execution traceto the tree, beginning at the initial state and ending at a terminal state. Depending on the MDPat hand, it might be unlikely to visit the same sequence of states again, which makes the infor-mation memorized become less useful. It might be more sensible to save memory and only keep

29


the information about the accumulated reward value of the trace. In this situation, the schemecan be modified such that the tree only grows by one decision node and one chance node periteration. In fact, MCTS is often described in this way directly [Browne et al., 2012]. In theabove scheme, this can be integrated by modifying mctsd(nd) = n′d for S(n′d) to (differencehighlighted in bold)

S(n′d) =

∅ if s(nd) ∈ ST or C(nd) = 0,

(S(nd) \ n∗c) ∪ mctsc(n∗c) if ∃n∗c ∈ S(nd) s.t. a(n∗c) = a∗,

S(nd) ∪ mctsc(new c(s(nd), a∗)) else

2.4.3. Concrete MCTS Algorithms

To obtain a concrete algorithm from the MCTS scheme, one needs to choose implementationsfor backup and selectAction . Algorithmic properties of MCTS are heavily influenced by thischoice. A popular choice is

backup(S) =

∑nc∈S C(nc)Q(nc)∑

nc∈S C(nc), and (2.12)

selectAction(nd) = argmaxa∗∈A(s(nd))

Q(nc) +B

√2 logC(nd)C(nc)

if ∃nc ∈ S(nd) s.t. a(nc) = a∗,

∞ else,(2.13)

which yields the UCT algorithm (Upper Confidence bound applied to Trees) and hence makesMCTS anytime optimal, i.e., converge to an optimal policy in the limit [Kocsis and Szepesvári,2006]. The bias value B can be increased to make UCT more explorative or lowered to makeUCT greedier. Other anytime optimal algorithms include the MaxUCT algorithm [Keller andHelmert, 2013], which combines the UCT action selection of Equation 2.13 with Max-Monte-Carlo backups

backup(S) = maxnc∈S

Q(nc), (2.14)

and MaxBRUE [Feldman and Domshlak, 2014b], which combines the Max-Monte-Carlo back-ups of Equation 2.14 with uniformly random action selection:

selectAction(nd) = a ∼ PU (A(s(nd))) (2.15)

Max-Monte-Carlo backups have the convenient property that they yield an anytime optimalalgorithm when combined with any action selection function that chooses every action infinitelyoften [Keller and Helmert, 2013].

Another UCT variant, named SoftMaxUCT [Richter and Biundo, 2017], is a combination ofUCT and MaxUCT defined as follows: first, the action selection function of UCT defined inEquation 2.13 is used for action selection. Second, the backup function is chosen to resemblethe Monte-Carlo backups of UCT for rarely visited nodes and the Max-Monte-Carlo backups ofMaxUCT for nodes visited more often. This is motivated by findings of Feldman and Domshlak[2014a] that Monte-Carlo backups seem to have an advantage over Max-Monte-Carlo backups

30


h

h (a1, o1) . . . . . . h (a2, o2)

a1 a2

o1 o2 o1o2

Figure 2.6.: An illustration of the shape of the MCTS search tree for POMDP planning. Rectan-gular nodes represent decision nodes and correspond to histories. Grey round nodesrepresent chance nodes. Edges from decision nodes to chance nodes correspond toactions, and edges from chance nodes to decision nodes correspond to observations.

in domains with large state branching factors, despite that Max-Monte-Carlo backups seem todo the “right thing”: for rarely visited nodes, the reduced variance achieved by averaging overall successor node values (as in Monte-Carlo backups) seems to yield more reliable estimatesthan using the value of the currently best successor node only (as in Max-Monte-Carlo backups).For nodes visited more often, however, suboptimal successor nodes should likely be ignored. Toachieve this, the author uses a custom combination of both based on the weighted power mean,which is called Soft-Max-Monte-Carlo backups: for a decision node nd = (s, C,S, V ), letp = C/ |S| and define

backup(S) =

(∑nc∈S C(nc) (V (nc))

p

C

)1/p

(2.16)

Until all actions applicable to s have been tried once, a new action will be tried on each visit tond and p will be equal to 1. Hence, Equation 2.16 will simply average, i.e., be like Monte-Carlobackups. As C approaches infinity, so will p, and Equation 2.16 will converge to maximization,i.e., become like Max-Monte-Carlo backups. Since Soft-Max-Monte-Carlo backups converge toMax-Monte-Carlo backups, they also yield an anytime optimal algorithm when combined withan action selection function that chooses every action infinitely often.

2.4.4. MCTS for POMDPs

MCTS can be applied to POMDPs by constructing, for a POMDPP = (S,A,O, T, Z,R, bI , H),an equivalent MDP given by (HH ,A,G, sI ,HH \HH−1) [Silver and Veness, 2010, Lemma 1].The state space of the MDP is therefore the space of observable histories of the POMDP, andreceiving observation o after executing an action a in history h leads to the history h (a, o),where denotes concatenation. To honor the fact that the POMDP has a fixed time horizon H ,histories of length H are terminal states. All actions are always applicable in POMDPs, andtherefore A(h) = A for all h ∈ HH . The initial state sI is the empty history. Figure 2.6 givesan illustration of the search tree that results from applying MCTS to POMDPs. This proceduredirectly computes a POMDP policy that maps histories to actions, as defined in Section 2.3.

The simulator G needs to be able to sample successor histories from a given history and action(h (a, o), r) ∼ G(h, a). The basis for this is a POMDP simulator GP which simulates applying

31


a from a given world state s to yield a successor world state, an observation, and an immediatereward, i.e., (s′, o, r) ∼ GP (s, a). Given a POMDP as defined above, this POMDP simulator isimplemented by first sampling a successor world state s′ ∼ T (s, a), letting r = R(s, a, s′), andsampling an observation o ∼ Z(a, s′).

Next, the history-based simulator G needs to be constructed from the state-based POMDPsimulator GP . Sampling from the empty history ε is simple. To compute ((a, o), r) ∼ G(ε, a)for a given action a, one only needs to sample a state from the initial belief state s ∼ bI and applythe POMDP simulator (s′, o, r) ∼ GP (s, a). The case is not so obvious for the successor history(a, o), since computing the new probability distribution over world states P (S | a, o, bI) requiresan expensive belief update that requires quadratic effort in the number of world states [Kael-bling et al., 1998], i.e., exponential effort for factored POMDPs. When (a, o) is constructedas above however, the sampled successor world state s′ constitutes an unbiased sample fromP (S | a, o, bI), i.e., s′ ∼ P (S | a, o, bI). Thus, for a next action a′, a new successor historycan be sampled by sampling from GP (s′, a′) and P (S | a, o, bI) does not need to be computedexplicitly. This means that complete roll-outs can be generated by always propagating the suc-cessor world state along each time a successor history is sampled, which suffices for the purposesof MCTS [Silver and Veness, 2010, Lemma 2].

MCTS in the space of observable histories fully solves the curse of dimensionality mentionedin Section 1.1, since its performance does not rely on the number of world states. It also ad-dresses the curse of history, because the space of observable histories only needs to be sparselysampled to determine a high-quality action to execute first. On the other hand, policy qualityquickly declines when the computed policy is executed over many time steps. This issue be-comes even more problematic when considering factored POMDPs as described in Section 2.3.5,since MCTS introduces chance nodes individually for each action, and there are exponentiallymany actions.

It should be noted that a huge variety of other (non-hierarchical) POMDP solution algorithmsexist beside MCTS in the space of observable histories. This includes exact optimal algorithmssuch as Value Iteration (VI) [Smallwood and Sondik, 1973], which is a dynamic programmingtechnique that iteratively improves a POMDP policy represented implicitly as a value functionconsisting of linear pieces called α-vectors, and Policy Iteration [Hansen, 1998], which appliesdynamic programming directly on a policy, represented as an FSC. Approximate algorithmsinclude offline point-based approaches such as PBVI [Pineau et al., 2003b], Perseus [Spaan andVlassis, 2005] and Symbolic Perseus [Poupart, 2005], SARSOP [Kurniawati et al., 2008] orPLEASE [Zhang et al., 2015], as well as online approaches such as HSVI [Smith and Simmons,2004], Symbolic HSVI [Sim et al., 2008], or DESPOT [Somani et al., 2013]. The POMCPalgorithm [Silver and Veness, 2010], which is based on MCTS in the space of histories, alsofalls into this category.

2.5. Hierarchical Task Networks

HTN planning is a planning approach that is aimed at real-world planning problems [Erol etal., 1994; Biundo and Schattenberg, 2001; Geier and Bercher, 2011; Alford et al., 2015] andhas been successfully applied in many scenarios [Nau et al., 2005; Lin et al., 2008; Biundo et

32


al., 2011]. The key principle in HTN planning is refinement of abstract plan specifications: thedomain modeler not only defines the dynamics of the planning domain but also abstract actionsthat represent higher-level activities, and ways in which these activities can be performed, themethods. This approach enables encoding useful standard solutions to subproblems that need tobe solved during plan construction. This way, knowledge of domain experts can be leveragedto relieve the planner of some of the burden of plan construction. Plan construction itself thenworks similarly to rule application in formal grammars.

In contrast to the POMDP formalism introduced above, HTN planning assumes deterministicdomain dynamics and full observability. This means that no uncertainty arises during plan ex-ecution and that action sequences, rather than policy trees, suffice for representing plans. Still,one often considers more complex plan representations in HTN planning, which feature par-tially ordered action sequences [Currie and Tate, 1991] and explicit representation of causalrelationships between primitive [Erol et al., 1994] and even abstract plan steps [Biundo andSchattenberg, 2001]. This offers various advantages, such as, e.g., flexibility during plan execu-tion [Höller et al., 2014] or improved plan explanation capabilities [Seegebarth et al., 2012]. Forthe purposes of this thesis, however, an HTN approach that considers totally ordered sequencesonly is used, similarly as in the HTN planning system SHOP [Nau et al., 1999] and as morerecently considered by Alford et al. [2015], as it both covers the essential features of HTN plan-ning and can be extended to the partially observable HTN planning approach introduced in thisthesis. The following section thus describes a totally ordered HTN planning formalism based onfirst order logic that can be considered a reformulation of the formalism described by Alford etal. [2015] in terms of the definitions in Section 2.3.5.

Just as in factored POMDPs, the set of world states S = I(state, D) is defined as the setof interpretations of state fluents Fstate for a given set of domain objects D. Since Alfordet al. [2015] require a function-free predicate logic, only Boolean fluents and constants areallowed, however. Also, the consequences of actions are specified in an action-centric insteadof a state fluent-centric manner. For this, the directly executable primitive actions are defined interms of parametrized action schemata, each of which is associated with a parametrized effect.Additionally, each action has a precondition, which is a formula over state fluents that determineswhether the action can be executed in a state:

Definition 2.17. A primitive action schema as = a(v) consists of an action symbol a and asequence of variables v = v1, . . . , vn, its parameters. Each schema is associated with a formulaprec(as) over Fstate , its precondition, and a sequence of state updates eff (as), its effect. Allfree variables in both prec(as) and eff (as) must be contained in v.

To allow for executing a given action schema multiple times with different parameters, anaction is defined to be a copy of an action schema where the parameters are substituted withnew variables. To execute an action a, a valuation β that assigns a value to all its parameters isrequired. Then, executing a in a state s is possible if and only if (s, β) |= prec(a) and results ina new state s′ = s[eff (a), β]. The set of all primitive action schemata is denoted with A.

In contrast, abstract action schemata c(v1, . . . , vn) are not associated with a precondition oran effect and are not directly executable. The set of all abstract action schemata is denoted withC. The behavior of abstract actions is specified using methods that associate an abstract actionschema with a possible subplan that it can be refined into. A plan in HTN planning is a tuple

33


(ts, β), where ts is a sequence of actions and β is a valuation. Let PX denote the set of all planswhose sequence of actions is non-empty and finite and whose schemata are taken from the setof action schemata X . Formally, methods are tuples (c, P ) ∈ C × PA∪C and are applied asfollows:

Definition 2.18. Let P1 = (ts1, β1) ∈ PA∪C be a plan with ts1 = (a0, . . . , ak−1, c(v′1, . . . , v

′n),

ak+1, . . . , an−1) and let c(v1, . . . , vn) ∈ C be an abstract action schema. Let further m =(c(v1, . . . , vn), (tsm, βm)) be a method for c(v1, . . . , vn). Finally, let ts ′m = (am0 , . . . , a

ml−1)

and β′m be copies of tsm and βm where the variables v1, . . . , vn are substituted with v′1, . . . , v′n

and all other variables are substituted with new variables. Applying m to c(v′1, . . . , v′n) results

in a new plan P2 = (ts2, β2), where ts2 = (a0, . . . , ak−1, am0 , . . . , a

ml−1, ak+1, . . . , an−1) and

β2 = β1 ∪ β′m.

In other words, the method subplan is instantiated to match the variables occurring in P1 andreplaces the abstract action. The act of applyingm to c(v′1, . . . , v

′n) in this way is also frequently

called decomposition of c(v′1, . . . , v′n). A complete planning problem is defined as follows:

Definition 2.19. An HTN planning problem is a tuple (S,Fstate , dom, cdm, D, val , A,C,M,PI , sI), where S,Fstate , dom, cdm, D, and val define the worlds states S as in Definition 2.10.The sets A and C are the finite disjoint sets of primitive and abstract action schemata, respec-tively. Methods are given as a set M . The initial state and initial plan are given by sI ∈ S andPI ∈ PA∪C , respectively.

The goal in HTN planning is to apply methods to the initial plan until an executable primitiveplan is found. More formally, a primitive plan ((a0, . . . , an−1), β) ∈ PA is called executablein a state s ∈ S, if and only if β assigns a value to every variable occurring in (a0, . . . , an−1)and there exists a state sequence (s0, . . . , sn) such that s0 = s, and si |= prec(ai) and si+1 =si[eff (ai), β] for all 0 ≤ i ≤ n− 1. Further, when P2 = (ts1, β1) can be obtained by applyinga method m to a plan P1 = (ts2, β2), this can be denoted as P1 →m P2. Generalizing to arbi-trarily many, including zero, method applications and adding so-called variable codesignations,P1 →∗M P2 is true if

1. P1 = P2, or

2. there exists a method m ∈M such that P1 →m P2 (while β1 = β2), or

3. a variable is bound, i.e., β2 = β1[v 7→ d] for a d ∈ D (while P1 = P2), or

4. an intermediate plan P3 can be obtained through cases 2. or 3. and P3 →∗M P2.

With this, HTN solutions can be defined as follows:

Definition 2.20. A plan PS is called a solution to an HTN problem PH if and only if it isexecutable in sI and it can be created from PI , i.e., PI →∗M PS . The set of all solutions of PHis called its language L(PH ).

Note that, apart from the deterministic domain dynamics and their action-centric represen-tation, HTN planning differs from POMDP planning because the length of a solution plan is

34


arbitrary. Even more importantly, no reward function is given, the goal is simply to find an exe-cutable plan. In other words, HTN planning is satisficing in nature, whereas POMDP planningis optimizing in nature.

Solutions can be generated for HTN problems by searching in the space of partial plans, i.e.,starting with PI and considering all plans that can be generated using method decompositionsand variable codesignations. Applicable search algorithms include breadth-first search or heuris-tic algorithms such as A∗.

35

Chapter 3.

Finite State Controllers for PartiallyObservable HTN PlanningAs described in Section 2.3.4, FSCs such as the one shown in Figure 2.4 can describe POMDPpolicies more compactly than policy trees. When considering policies for factored POMDPsthough, a drawback of FSCs becomes apparent: to fulfill the well-definedness property of Def-inition 2.9, the outgoing edges of an FSC node together need to enumerate all observations,which can become rather verbose. Figure 2.4 defines an FSC for a Home Theater Setup instancethat contains three devices, two signal types, and one signal source. Hence, hasSignalObs(Device,SignalType,SignalSource) induces 23·2·1 = 64 observations, even in thissmall example. As noted in the caption of Figure 2.4, it would in principle suffice to only enu-merate all observations of non-zero probability. This is however, an unsatisfactory solution fortwo reasons. First, while it is obvious for the human observer that only a handful observationshave non-zero probability in this example, this is not the case in general. In a factored POMDP,determining the set of non-zero probability observations is a difficult task in itself, as it wouldinvolve determining the probability of every observation for every state, and the number of bothis exponential in the number of observation/state fluents and domain objects. Indeed, a morerealistic-sized instance considered in the evaluation of this thesis contains 16 devices, 2 signaltypes, and 2 signal sources, leading to 216·2·2 ≈ 1.8 · 1019 observations. Second, it can be ex-pected that the number of non-zero probability observations is also exponential in the number ofdomain objects, which means that FSCs are still very verbose for large instances.

Clearly, a domain expert who wants to formalize domain knowledge needs a more sophisti-cated means for compactly representing node transitions, e.g., one that uses an implicit ratherthan explicit representation of the observation sets. The need for dealing with observations inbulk was also noted by other researchers [Hoey and Poupart, 2005; Sanner and Kersting, 2010].Fortunately, the logic-based POMDP representation in the sense of Definition 2.10 allows this:consider that a set of observations is a set of interpretations and that a closed formula ϕ canbe said to represent the set of all interpretations under which the formula is true, i.e., the seti | i |= ϕ. Therefore, labeling FSC edges with closed formulas allows for an implicit obser-vation set representation [Müller and Biundo, 2011; Müller et al., 2012].

When choosing a suitable policy representation for an HTN approach to be used with RDDL,another challenge needs to be addressed. In RDDL, actions are interpretations of action fluents,while in HTN, choosing an action means selecting a single action schema and a valuation forthe parameters. The solution to integrate both views here is to label action nodes in FSCs withupdates instead of complete interpretations: on the one hand, an update u gives rise to an RDDL

37

Chapter 3. Finite State Controllers for Partially Observable HTN Planning

connect(Sat,HDMI_Cable,HDMI)start

connect(HDMI_Cable,TV,HDMI)

checkSignals(TV)

tighten(HDMI_Cable,TV,HDMI)

tighten(Sat,HDMI_Cable,HDMI)

noop

¬ϕ

ϕ

Figure 3.1.: The policy shown in Figure 2.4 as a logical FSC. Edges are labeled with formu-las instead of sets of observations. The ϕ symbol is short for exists_?type,?source hasSignalObs(TV,?type,?source). Unlabeled edges repre-sent an edge label of true, and edges labeled with false are omitted. (Source:[Richter and Biundo, 2017, Figure 6.4], reprint and modification by courtesy ofSpringer International Publishing AG)

action by applying it to adef , resulting in the action adef [u]. On the other hand, an HTN actionsyntactically is an update according to Definition 2.1.

Both considerations lead to a new form of FSC that modifies Definition 2.7:

Definition 3.1. Let P = (S,F , dom, cdm, class, D, val ,CPFF , R, sI , H) be a factored POM-DP and let Φc denote the set of formulas over the signature (S,Fc, type). A Logical FSCfor P is a tuple (N,λ, δ, nI , β), where the node transition function δ : N × N → Φobs mapsdirected edges between nodes to formulas over the observations fluents. Free variables in nodetransition formulas need to be contained in the valuation β. The action association functionλ : N → U(Fact) ∪ noop maps controller nodes to updates of action fluent interpretations.The value noop is used for indicating that adef should be executed. Free variables in updatesare again required to be contained in β. The node set N and initial node nI are defined in thesame way as stated in Definition 2.7 for ordinary FSCs.

An example of such an LFSC representing the same policy as Figure 2.4 can be seen inFigure 3.1. The possible space savings from representing transitions as formulas manifest inseveral ways. Consider the transition between the initial node and the single successor node: inFigure 2.4, this edge is labeled with the set of all observations, whereas in Figure 3.1, the edgecan be compactly labeled with the formula true. It can also be seen that the LFSC in Figure 3.1is independent of the number of domain objects, since formulas are specified independentlyof the domain objects. The LFSC therefore retains its compactness also for large instances.Additionally, the meaning of a formula is often easier to grasp than that of a plain set.

38

The well-definedness property for FSCs in Definition 2.9 can be carried over to LFSCs:

Definition 3.2. An LFSC (N,λ, δ, nI , β) is well-defined, if and only if for all n ∈ N , the multisetof outgoing transition formulas δ(n, n′)n′∈N forms a logical partition over the observationfluents Fobs according to Definition 2.3.

In summary, this leads to the following theorem:

Theorem 3.3. Let lfsc = (N,λ, δ, nI , β) be a well-defined LFSC in a factored POMDP P =(S,F , dom, cdm, class, D, val ,CPFF , R, sI , H). Then, by letting

α(lfsc) =

adef if λ(nI) = noop

adef [λ(nI), β] else

ω(lfsc, o) = (N,λ, δ, n∗, β) where n∗ is the node for which (o, β) |= δ(nI , n∗),

lfsc is a policy in P that can be executed and for which its expected accumulated rewardV t−1lfsc (s′) can be determined according to Equation 2.4.

Proof. It needs to be ensured that α is a function into the set of actions A and that ω yields asuccessor policy for every possible observation. For α, consider that adef is an action and thatadef [λ(nI), β] is an action because λ(nI) ∈ U(Fact) is an appropriate update and β binds allvariables in λ(nI). For ω, well-definedness of lfsc in conjunction with the fact that β binds allvariables guaranties that a node n∗ exists for every observation o, making (N,λ, δ, n∗, β) anappropriate successor LFSC.

The next step is using LFSCs as the basis for representing expert knowledge in POMDPenvironments.

39

Chapter 4.

Partially Observable Hierarchical TaskNetworksThe central question in this thesis is whether HTN planning can be generalized to partiallyobservable and stochastic environments. A planning approach that answers this question affir-matively should fulfill three main requirements. First, it should be an HTN-style formalism,meaning that it features the concepts of abstract actions, methods, and method application. Sec-ond, it should account for the differences between deterministic and partially observable prob-abilistic environments. This necessitates using a suitable representation for the concretions ofabstract actions, i.e., policies instead of action sequences. Third, the approach should be well-suited for employment in real-world scenarios, where domain experts specify their knowledgeusing HTNs. A compact representation for policies is therefore required, both for keeping thespecification effort low as well as for enabling maintainability and readability.

The POHTN approach introduced in the following chapter addresses the compactness require-ment in two ways. First, a first-order formalization is used, similarly as for HTN planning inSection 2.5. Second, LFSCs as presented in Chapter 3 are used as the means for policy repre-sentation.

This chapter is divided into several sections. The first step is to generalize the concepts ofabstract action, method, and method application to POMDP environments. In the second step,formal definitions for POHTN planning problems and their corresponding solutions are given.Afterwards, a way of generating solutions for POHTN problems based on MCTS is described.

4.1. Transferring HTN Concepts to POMDP Planning

4.1.1. Actions

The first step in developing the POHTN approach is choosing a way of representing actions.Following the thoughts of Chapter 3, the HTN view of actions is compatible with the RDDLview of actions. Hence, the POHTN approach will also define abstract actions through a set ofaction schemata C. As argued, the representation of primitive actions as updates can be kept,since it constitutes a generalization of action schemata.

In the Home Theater Setup scenario, an example for an abstract action schema is given byconnect_abstract(?device1,?device2). The intended meaning of the action is thatit is an abstraction of bringing the signals of the first device to the second device via any number

41

Chapter 4. Partially Observable Hierarchical Task Networks

a0start

a4 c

ϕ ¬ϕa0start

a4 a2 a5

ϕ ¬ϕ

Figure 4.1.: A superficial depiction of method application in POHTN planning. A method isapplied to the shaded node labeled with abstract action c, resulting in a controllerwithout abstract actions.

of intermediate devices, while also making sure that all intermediate connections are tight bychecking signals on the target device and tightening connections if necessary.

4.1.2. Methods and Abstract Observations

On a rather superficial level, the concept of methods can also be transferred easily to POHTNplanning: in normal HTN planning as defined in Section 2.5, methods associate an abstract ac-tion schema with a subplan, which consists of an action sequence and a valuation. As motivatedin Chapter 3, LFSCs are an appropriate choice for compactly representing policies in POMDPs.Subplans of methods should therefore be LFSCs in POHTN planning, too. Method applicationthen means embedding a method LFSC into an abstract LFSC by replacing an abstract actionlabeled node, as illustrated in Figure 4.1.

An issue not addressed by Figure 4.1 is what happens “at the end” of the execution of asubpolicy, i.e., when the subtask is completed and the execution of the next subtask shouldbegin. In normal HTN planning, this is straight-forward: obviously, the end of a sequence ofactions is reached once its last action is executed. In an LFSC, however, every node specifies asuccessor node for all possible observations, and thus has no “last action”. An alternative meansfor determining when a subtask is finished is therefore required.

Another challenge is that observations could occur during the execution of a subtask thatcan become important for the solution of a higher-level task after the subtask is completed.To see this, imagine a domain expert who tries to model a method that connects a satellite re-ceiver Sat to the TV and wants to divide this task into two parts by first connecting Sat toa cable, which is then in some way connected to the TV. This can be implemented by exe-cuting connect(Sat,?someCable,?somePort) followed by connect_abstract(?someCable,TV). Note that the connection between Sat and ?someCable might not betight after the former action is executed. This, however, cannot be determined at this point,because a cable cannot be directly checked for signals (in contrast to the TV). Nor is it feasi-ble to execute connect_abstract(?someCable,TV) and then afterwards check for sig-nals and tighten the connection between Sat and ?someCable, since the domain expert can-not know at modeling time which other connections made as part of connect_abstract(?someCable,TV) would also need to be tightened in this case. Unless the modeler is willingto unconditionally tighten each connection, which requires two steps for every connection andadditionally incurs a reward penalty of −1 each time tighten is superfluously executed on analready tight connection, checking for a signal thus needs to be part of connect_abstract(

42


?someCable,TV), and the information whether the TV has a signal needs to be made avail-able to appropriately determine whether tighten(Sat,?someCable,?somePort) needsto be executed after connect_abstract(?someCable,TV) is completed.

At the same time, many of the observations made during the execution of a subtask are likelyto be only relevant to the subtask at hand, and the information they convey can be safely dis-carded once it is completed. What is therefore required is a mechanism that allows for remem-bering and acting contingent on an abstraction of the gathered information.

To address the second challenge first, consider that ordinary POMDPs employ observations,i.e., interpretations over the fluents Fobs , for the purpose of information gathering. The PO-HTN approach introduces additional abstract observations, which are interpretations over a newclass of fluents Fabstract-obs and represent the information that can be gathered by executingan abstract action, analogously to primitive observations made by executing primitive actions.The idea of abstract observations was already considered for, but not fully integrated into theMAXQ-hierarchical policy iteration algorithm [Hansen and Zhou, 2003], since, according tothe authors, a full integration would have severe consequences on the performance of their al-gorithm. In the above example, an appropriate abstract observation fluent is the Boolean flu-ent loose_connection_found, which when true signifies that no signal was detected at?device2 after connect_abstract(?device1,?device2) was completed.

To integrate abstract actions and observations into policies, the LFSC Definition 3.1 needs tobe extended so that it can represent partially abstract policies that correspond to partially abstractplans in HTN planning:

Definition 4.1. Let P be a POMDP as in Definition 2.10, C be a set of abstract action schematawith resulting set of abstract actions C, and Fabstract-obs be a set of abstract observation fluents.An abstract LFSC is a tuple (N,λ, δ, nI , β) where the action association function λ : N →U(Fact)∪ noop ∪ C additionally allows abstract actions and where the node transition func-tion δ : N × N → Φobs ∪ Φabstract-obs additionally allows formulas over abstract observationfluents. Also, the valuation β is allowed to be incomplete, i.e., not all variables that occur in anaction or observation formula need to also occur in β. The other tuple entries are defined as forLFSCs.

Note that every LFSC as in Definition 3.1 is also an abstract LFSC as in Definition 4.1. Well-definedness for abstract LFSCs generalizes well-definedness for LFSCs to ensure that abstractobservations are only used for edges starting at abstract nodes:

Definition 4.2. An abstract LFSC lfsc = (N,λ, δ, nI , β) is well-defined, if and only if

• for all primitive action nodes n ∈ N,λ(n) ∈ U(Fact), the multiset of outgoing transitionformulas δ(n, n′)n′∈N forms a logical partition over the observation fluents Fobs , and

• for all abstract action nodes n ∈ N,λ(n) ∈ C, the multiset of outgoing transitionformulas δ(n, n′)n′∈N forms a logical partition over the abstract observation fluentsFabstract-obs .

Well-definedness for abstract LFSC only extends the definition for well-definedness to ab-stract nodes, so a well-defined abstract LFSC that is primitive as in Definition 3.1 is also awell-defined LFSC as in Definition 3.2.

43


connect(Sat,HDMI_Cable,HDMI)start

connect_abstract(HDMI_Cable,TV)

tighten(Sat,HDMI_Cable,HDMI)

noop

loose_connection_found

∼loose_connection_found

Figure 4.2.: An abstract LFSC. Grey nodes denote abstract actions. When loose_connec-tion_found is true after connect_abstract(HDMI_Cable, TV) isexecuted, no signal could be found at the TV. Since connect_ab-stract(HDMI_Cable,TV) checks and tightens connections, the problem seemsto be with the connection between the satellite receiver and the HDMI cable, so thisconnection is tightened. (Source: [Richter and Biundo, 2017, Figure 6.6], reprintand modification by courtesy of Springer International Publishing AG)

Figure 4.2 shows an example abstract LFSC for the Home Theater Setup domain that uses thepreviously introduced abstract action schema connect_abstract(?device1,?device2)and the abstract observation fluent loose_connection_found.

Returning to the issue of how to determine when subtasks are completed, each method spec-ifies a termination function for its implementing controller. It determines the conditions underwhich the execution of this controller is completed and what the according abstract observationis:

Definition 4.3. A POHTN method is a tuple m = (c,mfsc, Om, τ), where c ∈ C is an abstractaction schema, mfsc = (N,λ, δ, nI , β) is an abstract LFSC, and Om ⊆ I(Fabstract-obs , D)is the subset of abstract observations that can be obtained using m. The termination functionτ : N × Om → Φobs ∪ Φabstract-obs is such that τ(n, o) represents the conditions under which,after executing the action associated with n, mfsc terminates with abstract observation o.

The termination function can also be understood as “transitions” to terminal nodes labeledwith abstract observations as visualized in Figure 4.3.

The implementing controller mfsc will in general not be well-defined in the sense of Def-inition 4.2, because there are conditions under which it terminates instead of transitioning toanother controller node. Therefore, the method as a whole needs to be well-defined:

Definition 4.4. Let m = (c,mfsc, Om, τ) be a method where mfsc = (N,λ, δ, nI , β) and letδ : N × (N ∪Om)→ Φobs ∪ Φabstract-obs be the generalized transition function with

δ(n, n′) =

δ(n, n′) if n′ ∈ Nτ(n, n′) if n′ ∈ Om.

44


connect(?d1,?d2,?p)start

checkSignals(?d2)

tighten(?d1,?d2,?p)

d(loose_connection_found) = false

d(loose_connection_found) = true

¬ϕ

ϕ

Figure 4.3.: The method controller of a method for connect_abstract(?d1,?d2),where the obtainable abstract observations are represented by double-borderednodes and τ is given by the transitions to these nodes. The ϕ symbol is short forexists_?type, ?source hasSignalObs(?d2,?type,?source).The method implements connect_abstract(?d1,?d2) by connecting thedevices directly, checking the connection between them, and creating an abstractobservation with the result. (Source: [Richter and Biundo, 2017, Figure 6.5],reprint and modification by courtesy of Springer International Publishing AG)

Then m is well-defined if and only if

• for all primitive action nodes n ∈ N,λ(n) ∈ U(Fact), the multiset of outgoing transitionformulas δ(n, n′)n′∈N∪Om forms a logical partition over the observation fluents Fobs ,and

• for all abstract action nodes n ∈ N,λ(n) ∈ C, the multiset of outgoing transition for-mulas δ(n, n′)n′∈N∪Om forms a logical partition over the abstract observation fluentsFabstract-obs .

4.1.3. Method Application

Just as in ordinary HTN planning, applying a method also works by instantiating a methodcontroller to match the variables occurring in a given abstract LFSC lfsc and using it to replacethe abstract action in lfsc:

Definition 4.5. Let fsc1 = (N1, λ1, δ1, n1I , β

1) be an abstract LFSC, c(v1, . . . , vn) ∈ C anabstract action schema, n1

C ∈ N1 with λ1(n1C) = c(v′1, . . . , v

′n) the node to decompose,

and let β1 contain an assignment for every variable occurring in fsc1. Let further m =(c(v1, . . . , vn),mfsc, Om, τ) be a decomposition method and let fsc2 = (N2, λ2, δ2, n2

I , β2)

and τ2 be isomorphic copies of mfsc and τ , for which N1 ∩ N2 = ∅ and where the variablesv1, . . . , vn are substituted with v′1, . . . , v

′n and all other variables are substituted with new vari-

ables. Applying m to fsc1 results in a new abstract LFSC fsc3 = (N3, λ3, δ3, n3I , β

3), denotedapply(fsc1, n1

C ,m) = fsc3, which is defined as follows:

45


• The resulting node set N3 = (N1 ∪ N2) \ n1C contains all nodes from N1 and N2

except the decomposed node n1C .

• Action labels are kept from the original controllers, i.e.,

λ3(n) =

λ1(n), if n ∈ N1

λ2(n), if n ∈ N2.

• Unless the initial node of the original controller was decomposed, the initial node remainsunchanged:

n3I =

n1I , if n1

I 6= n1C

n2I , if n1

I = n1C .

• For the node transitions, inner transitions of fsc1 and fsc2 are kept. Transitions to the re-placed node n1

C are converted to transitions to the initial node of the method controller n2I .

The termination function τ2 creates transitions to the successor nodes of n1C by checking

whether the abstract observations fulfill the outgoing transition conditions of n1C , which

are formulas over abstract observation fluents. For this, let for each node n ∈ N2 andn′ ∈ N1

δT (n, n′) =∨

om∈Om(om,β1)|=δ1(n1

C ,n′)

τ2(n, om)

denote the disjunction over all termination conditions from a node in N2 that fulfill theabstract transition condition from the decomposed node n1

C to another node in N1. Let-ting further pred(n1

C) = n ∈ N1 | δ1(n, n1C) 6= false be the set of predecessor nodes

of n1C , i.e., the set of nodes from which n1

C can be reached in one step, δ3(n, n′) can bedefined in terms of where n, n′ ∈ N3 originate from:

δ3(n, n′) n′ = n2I n′ ∈ N2 \ n2

I n′ ∈ N1

n ∈ pred(n1C) δ1(n, n1

C) false δ1(n, n′)

n ∈ N1 \ pred(n1C) false false δ1(n, n′)

n ∈ N2 δ2(n, n′) ∨ δT (n, n1C) δ2(n, n′) δT (n, n′)

Note that in the case where n ∈ N2 and n′ = n2I , δ

T (n, n1C) represents the possibility of

a self loop at n1C .

• The new valuation β3 = β1 ∪ β2 contains the variable assignments of both fsc1 and fsc2.

For example, the result of applying the method defined by Figure 4.3 to the abstract LFSCshown in Figure 4.2 is the LFSC shown in Figure 3.1. Note that the reason why β1 needs to con-tain an assignment for every variable occurring in fsc1 is that it is checked during decompositionwhether the outgoing abstract transition conditions of n1

C are fulfilled. This could in principle be

46


relaxed to only require β1 to bind every variable occurring in λ1(n1C) and all variables occurring

in transition conditions of outgoing edges of n1C .

The goal of this definition is of course that decomposition produces well-defined LFSCs,which is indeed the case [Müller et al., 2012]:

Theorem 4.6. Let fsc1, m, fsc2, and fsc3 be as in Definition 4.5 and let fsc1 and m be well-defined. Then fsc3 is a well-defined abstract LFSC.

Proof. It needs to be shown that outgoing transitions of every node in fsc3 form a partition ofthe respective possible observations, abstract or primitive, as stated in Definition 4.2. Accordingto the definition of δ3(n, n′), n ∈ N3 falls into one of three categories:

• For the case where n ∈ pred(n1C), the transition to the terminal node is converted to a

transition to n2I , so it holds that

δ3(n, n′)n′∈N3 = δ3(n, n′)n′=n2I∪ δ3(n, n′)n′∈N2\n2

I∪ δ3(n, n′)n′∈N1∩N3

= δ1(n, n1C) ∪ falsen′∈N2\n2

I∪ δ1(n, n′)n′∈N1\n1

C

= δ1(n, n′)n′∈N1 ∪ falsen′∈N2\n2I.

Since fsc1 is well-defined, δ1(n, n′)n′∈N1 is a logical partition. Adding an arbitrarynumber of instances of false does not change this, so δ3(n, n′)n′∈N3 is a logical par-tition, too. Moreover, δ3(n, n′)n′∈N3 is a logical partition over abstract observationfluents if and only if δ1(n, n′)n′∈N1 is a logical partition over abstract observation flu-ents.

• For the case where n ∈ N1 \ pred(n1C), it holds that

δ3(n, n′)n′∈N3 = δ3(n, n′)n′=n2I∪ δ3(n, n′)n′∈N2\n2

I∪ δ3(n, n′)n′∈N1∩N3

= falsen′=n2I∪ falsen′∈N2\n2

I∪ δ1(n, n′)n′∈N1\n1

C

= falsen′∈N2 ∪(δ1(n, n′)n′∈N1 \ falsen′=n1

C

),

where the last equality is because n is not a predecessor of n1C . Again, this is a logical

partition because δ1(n, n′)n′∈N1 is a logical partition and again, δ3(n, n′)n′∈N3 isa logical partition over abstract observation fluents if and only if δ1(n, n′)n′∈N1 is alogical partition over abstract observation fluents.

• When n ∈ N2, it holds that

δ3(n, n′)n′∈N3

= δ3(n, n′)n′=n2I∪ δ3(n, n′)n′∈N2\n2

I∪ δ3(n, n′)n′∈N1∩N3

= δ2(n, n′) ∨ δT (n, n1C)n′=n2

I∪ δ2(n, n′)n′∈N2\n2

I∪ δT (n, n′)n′∈N1\n1

C.

Because m is well-defined and

δT (n, n1C) =

∨om∈Om

(om,β1)|=δ1(n1C ,n

1C)

τ2(n, om),

47


δ2(n, n2I) and δT (n, n1

C) exclude each other, so the last line is a logical partition if andonly if

δ2(n, n′)n′=n2I∪ δT (n, n1

C)n′=n2I∪ δ2(n, n′)n′∈N2\n2

I∪ δT (n, n′)n′∈N1\n1

C

= δ2(n, n′)n′∈N2 ∪ δT (n, n1C) ∪ δT (n, n′)n′∈N1\n1

C

= δ2(n, n′)n′∈N2 ∪ δT (n, n′)n′∈N1

= δ2(n, n′)n′∈N2 ∪ ∨

om∈Om(om,β1)|=δ1(n1

C ,n′)

τ2(n, om)n′∈N1

is a logical partition. Since fsc1 is well-defined there is, for every abstract observationom ∈ Om, exactly one n′ ∈ N1 such that (om, β

1) |= δ1(n1C , n

′). Every τ2(n, om) thusappears exactly once for each om ∈ Om in the second multiset of formulas. Since the τ2

terms are also mutually exclusive, this is again a logical partition if and only if

δ2(n, n′)n′∈N2 ∪ τ2(n, om)om∈Om= δ2(n, n′)n′∈N2∪Om

is a logical partition, which it is because m is well-defined. Just as in the two simplercases, δ3(n, n′)n′∈N3 is a logical partition over abstract observation fluents if and onlyif δ2(n, n′)n′∈N2∪Om is a logical partition over abstract observation fluents.

The outgoing transitions δ3(n, n′)n′∈N3 thus form a logical partition over the appropriate typeof observation fluents for every node n ∈ N3, concluding the proof.

The same notation as in deterministic HTN planning can be used for denoting method appli-cation, i.e., fsc →m fsc′ means that fsc′ = (N ′, λ′, δ′, n′I , β

′) can be obtained by applying m toan abstract action node n in fsc = (N,λ, δ, nI , β), i.e., fsc′ = apply(fsc, n,m). For arbitrarilymany, including zero, method applications and variable codesignations, fsc →∗M fsc′ holds ifand only if

1. fsc = fsc′, or

2. there exists a method m ∈M such that fsc →m fsc′, or

3. a variable is bound, i.e., β′ = β[v 7→ d] for a d ∈ D, also denoted apply(fsc, [v 7→ d]), or

4. an intermediate plan fsc′′ can be obtained through cases 2. or 3. and fsc′′ →∗M fsc′.

4.1.4. Problems and Solutions

The given definitions of abstract plans, methods, and method application now enable the defini-tion of a POHTN planning problem in analogy of the HTN problem definition:

48


Definition 4.7. A POHTN planning problem is a tuple POHTN = (P,C,Fabstract-obs ,M, fscI)consisting of a factored POMDP P , a set of abstract action schemata C, a set of abstractobservation fluents Fabstract-obs , a set M of well-defined methods, at least one for each abstractaction schema in C, and a well-defined initial abstract LFSC fscI.

POHTN problems can be divided into two parts in accordance to the distinction betweenplanning domain and planning instance in factored POMDPs. The sets C, Fabstract-obs , and Mare independent of the instance at hand and can be used for the entire domain. Only the initialcontroller needs to be specified separately for each instance.

A POHTN problem induces a set of primitive LFSCs that can be constructed by repeatedlyapplying methods to the initial controller:

Definition 4.8. The language of a problem PH = (P,C,Fabstract-obs ,M, fscI) is given byL(PH ) = fsc | fscI →∗M fsc, fsc is an LFSC in the sense of Definition 3.1.

A simple property of the language of a POHTN problem is the following:

Theorem 4.9. Let PH be a POHTN problem. Then every controller in L(PH ) is an executablepolicy.

Proof. From Theorem 4.6 and Definition 4.7, it immediately follows that every controller inL(PH ) is well-defined. By Definition 4.8, every controller in L(PH ) is an LFSC as in Defini-tion 3.1. Applying Theorem 3.3, every controller in L(PH ) is an executable policy.

A departure from the HTN view of plans concerns the definition of what constitutes a solution:in HTN planning, a solution is an executable plan. Defining POHTN solutions in the same waywould be problematic in two respects. First, according to Theorem 4.9, every primitive controllerthat can be generated from a POHTN problem is executable. This is in line with non-hierarchicalPOMDP planning, where every policy can be executed. Second, the underlying POMDP of aPOHTN planning problem as in Definition 4.7 already specifies a means for assessing the qualityof a policy, namely the reward function. Using the same measure of policy quality appearsappropriate, since it allows for a direct comparison of policies generated via POHTN planningand policies generated with non-hierarchical planning approaches. Thus, a POHTN solutionis defined to be a policy of maximal expected accumulated reward among all policies in itslanguage:

Definition 4.10. A solution of a POHTN planning problem PH is a controller fsc∗ ∈ L(PH )with

V Hfsc∗(sI) = max

fsc∈L(PH )(V H

fsc(sI))

When L(PH ) = ∅, PH is called unsolvable.

Note that the use of the max operator is justified. While L(PH ) can contain infinitely manycontrollers when M contains recursive methods, there can be only finitely many different valuesfor V H

fsc(sI) since controllers are evaluated over finitely many time steps H . Another view onthis is that the controllers in L(PH ) can be grouped into equivalence classes such that twocontrollers are in the same class when the subgraphs defined by taking all nodes reachable from

49


the initial node in H time steps are isomorphic. All controllers in the same class will then havethe same expected accumulated reward, and there are only finitely many of these classes.

In general, a POHTN solution will not be an optimal policy in its underlying POMDP in thesense of Definition 2.6. This is because the methods and initial controller might not be able togenerate an optimal policy, i.e., π∗ 6∈ L(PH ). On the other hand, the number of controllers inL(PH ) can potentially be much smaller than the total number of POMDP policy trees. Evenwhen L(PH ) is infinite, a planning algorithm for POHTN only needs to decide on an equiva-lence class of controllers, of which there are also potentially much fewer than, although at mostas many as policy trees. Generating POMDP policies using the POHTN approach thus tradesoptimality for a smaller search space.

An important detail in this regard is that using POHTN planning does not inherently incursuboptimal policies. This can be seen by giving a generic construction that produces, for a givenPOMDP, a POHTN problem whose language contains all policies of the POMDP, which hencealso includes the optimal policy:

Theorem 4.11. For any factored POMDP P = (S,F , dom, cdm, class, D, val ,CPFF , R, sI ,H), there exists a POHTN hierarchy PH ∗, such that L(PH ∗) contains all policy trees of P .

Proof. Consider the following POHTN hierarchy PH ∗ = (P,C∗,F∗abstract-obs ,M∗, fsc∗I):

• There is only one abstract action schema without parameters c, so C∗ = c.

• There are no abstract observation fluents, i.e. F∗abstract-obs = ∅.

• For every primitive action a ∈ A of the POMDP, i.e., adef or adef [u] for a single update uto the default action, there is a method ma ∈M∗ such as the one shown in Figure 4.4: theinitial node nI of the method controller is labeled with a. For each possible observationo ∈ O, there is a successor node no of nI . Each such node is labeled with c, and δ(nI , no)is such that only o |= δ(nI , no) and o′ 6|= δ(nI , no) for o 6= o′. There are no transitions outof an no, i.e., δ(no, no) = true . Additionally, there is one method mnoop ∈ M∗ whosecontroller only consists of a single node nI labeled with noop and where δ(nI , nI) =true .

• The initial controller fsc∗I contains a single node nI labeled with c and where δ(nI , nI) =true .

Iteratively applying the methods ma to fsc∗I creates a tree-shaped controller that branches onevery observation history. Since there is a methodma for every action a of the POMDP, arbitrarypolicy trees can be constructed.

4.2. POHTN Plan Generation

The next step in applying the POHTN approach is constructing an algorithm for finding POHTNsolutions. Just as for HTNs, search in the space of partial policies lends itself for this purpose,i.e., search states correspond to abstract LFSCs and actions correspond to method applicationor variable codesignation. A hurdle in this respect is, however, that an ordinary optimal search

50


a

start

c c . . . c

o0 o1on−1

Figure 4.4.: The method controller for a method ma of the abstract action c, whose initial nodeis labeled with a and which leads to a different course of action for every possibleobservation.

algorithm such as breadth-first search or A∗ at the very least needs to calculate the value ofterminal search states exactly, i.e., V H

fsc(sI) for a primitive controller fsc. This alone involvessignificant effort, however, since computing V H

fsc(sI) takes O(H |S|2 |O|) time according toEquation 2.4, and both |S| and |O| are exponentially large in the size of a factored POMDP.Experimentally, this translates to rather poor performance of an A∗-based approach, even whenconsidering compact representations for intermediate results [Müller et al., 2012]. Simulatingthe execution of a controller in the POMDP multiple times and averaging over the seen valuesis a cheaper way to assess its value and constitutes an unbiased estimator. It cannot be used inconjunction with A* search, however, since A* requires an exact value.

A remedy to this is employing MCTS, since it can easily deal with uncertainty. To do this, oneneeds to define an appropriate MDP representation of a POHTN problem, which is called its de-composition MDP. The principal idea is illustrated in Figure 4.5 and captured by the following,unfortunately flawed definition:

Flawed Definition. For a given POHTN problem PH = (P,C,Fabstract-obs ,M, fscI), the de-composition MDP is given by the tuple (SPH ,APH ,GPH , sPHI , SPH

T ), where

• The set of search states is defined as SPH = fsc | fscI →∗M fsc, i.e., the set of allabstract LFSCs that can be generated from PH .

• For a given search state, i.e., controller fsc = (N,λ, δ, nI , β) ∈ SPH and letting un-bound(fsc) denote the unbound variables in fsc and abstract(fsc) = n ∈ N | λ(n) ∈C its abstract nodes, the set of applicable actions is defined as

APH (fsc) = (n,m) | n ∈ abstract(fsc),m ∈M such that

m is a method for the schema of λ(n)∪ [v 7→ d] | v ∈ unbound(fsc), d ∈ D,

i.e., the actions for fsc correspond to each possibility of decomposing each abstract nodetogether with all possibilities for binding a variable in fsc.

51


fsc

apply(fsc, n,m) apply(fsc, n,m′)

(n,m) (n,m′)

Figure 4.5.: An illustration of the shape of the MCTS search tree for POHTN planning. Rect-angular nodes represent decision nodes and correspond to abstract LFSCs. Edgesbetween decision nodes correspond to method applications or variable bindings.Since these are deterministic, no chance nodes are needed.

• The simulator GPH is defined as (fsc′, r) ∼ GPH (fsc, a) where

fsc′ =

apply(fsc, n,m) if a = (n,m),

apply(fsc, [x 7→ d]), if a = [x 7→ d] and

r =

V Hfsc′

(sI) if fsc′ ∈ L(PH ),

0 else.

Here, V Hfsc′

(sI) denotes an unbiased estimate for V Hfsc′

(sI) generated by simulating the ex-ecution of fsc′ inP : executing fsc′ yields an execution trace (a1, o1, r1, a2, o2, r2, . . . , aH ,oH , rH), i.e., an observable history annotated with the incurred immediate rewards, andthe estimate is defined as V H

fsc′(sI) =

∑Hi=1 r

i. In other words, GPH applies a method orbinds a variable and when the result is a policy, simulates its execution in P .

• The initial search state sPHI = fscI is the initial controller.

• The terminal states SPHT = L(PH ) are the policies that can be generated from PH .

Since GPH generates successor search states deterministically, there is only a single terminalsearch state sT (π) for a given policy π in the decomposition MDP. By the above definition,sT (π) is an LFSC. Thus, a policy in the decomposition MDP uniquely specifies a controller inL(PH ). The expected value of π is equal to the expectation of V H

fsc′(sI) of the terminal search

state fsc it reaches, since r is 0 for non-terminal states. By construction E V Hfsc(sI) = V H

fsc(sI),so this definition already has the property that an optimal policy in the decomposition MDPyields an optimal controller of the POHTN problem. It is, however, flawed in several respects.

4.2.1. Binding Variables and Choosing Nodes to Decompose

First, it does not ensure that all variables of a controller are bound before a method is applied, asrequired by Definition 4.5 for ensuring that abstract transition conditions can be evaluated. Thisis easy to fix, one needs to redefine APH (fsc) such that for controllers with unbound variables,

52


the only available actions are to bind a variable, i.e., redefining

APH (fsc) =

[v 7→ d] | v ∈ unbound(fsc), d ∈ D if unbound(fsc) 6= ∅(n,m) | n ∈ abstract(fsc),m ∈M such that m

is a method for the schema of λ(n) else

solves this issue. Related to this, although more of a performance issue, is the fact that the num-ber of actions in APH (fsc) is unnecessarily high. For two variables v, v′ ∈ unbound(fsc), v 6=v′, the order which they are bound is arbitrary, and it suffices to restrict the available actionsto the possibilities for binding only v. The same holds when there are two abstract nodes in acontroller. Restricting APH (fsc) in this manner is common practice in HTN planning [Ghallabet al., 2004, p. 249]. Given an arbitrary but fixed function choose(S) for selecting a singleelement from a set S, this leads to the following definition:

APH (fsc) =

[choose(unbound(fsc)) 7→ d] | d ∈ D if unbound(fsc) 6= ∅(choose(abstract(fsc)),m) | m ∈M such that m is a

method for the schema of λ(choose(abstract(fsc))) else

4.2.2. Decomposability

The second problem is that some abstract action schema might exist that can never be refinedinto a primitive controller, e.g., when decomposing a node labeled with c always adds a newnode labeled with c. When such an abstract action is contained in a controller under consider-ation, this violates the requirement that h(sI) < ∞ in the generative indefinite horizon MDPDefinition 2.11, and using MCTS inevitably leads to an infinitely long chain of decompositions.To prevent this, the notion of decomposability is introduced:

Definition 4.12. A POHTN problem PH = (P,C,Fabstract-obs ,M, fscI) is called decompos-able if and only if for every controller fsc such that fscI →∗M fsc, there exists a controllerfsc′ ∈ L(PH ) such that fsc →∗M fsc′.

Obviously, decomposability prevents inescapable cycles that MCTS can become trapped in. Adecomposable POHTN problem is also solvable, because decomposability implies the existenceof an LFSC fsc′ ∈ L(PH ) such that fscI →∗M fsc′. It turns out that decomposability can easilybe ensured:

Theorem 4.13. For every POHTN problem PH = (P,C,Fabstract-obs ,M, fscI), it holds thateither

• PH is unsolvable, or

• there exist subsets of abstract action schemata C ′ ⊆ C and methods M ′ ⊆ M suchthat PH ′ = (P,C ′,Fabstract-obs ,M

′, fscI) is a decomposable POHTN problem whereL(PH ) = L(PH ′).

Deciding which category PH falls into and generating PH ′ can be done in polynomial time.

53


Proof. The first step is deciding whether PH is solvable. As a prerequisite, note from Defini-tion 4.5 that the only thing that can prevent a method from being applied to a controller is whenthe controller has unbound variables. Since variables can be bound arbitrarily without affectingwhether a given abstract LFSC can be refined into an LFSC, they can always be bound directlyas soon as they are introduced. One can thus assume that methods are always applicable. Next,determine whether L(PH ) is empty through a bottom-up procedure that successively marksabstract action schemata as generating. The procedure is similar to the one used for decidingwhether the language of a context-free grammar is empty [Gurari, 1989, Theorem 5.6.1]:

1. Traverse the set of methods M to find a method m = (c,mfsc, Om, τ) with mfsc =(N,λ, δ, nI , β) where c is unmarked and the schema of λ(n) is marked for all abstractnodes n ∈ N,λ(n) ∈ C. Note that this is trivially true when all n ∈ N are labeled withprimitive actions. If such an m exists, mark c as generating and repeat this step.

2. If a node in fscI exists where λ(n) ∈ C and the schema of λ(n) is not generating, thenoutput L(PH ) = ∅, else output L(PH ) 6= ∅.

Step 1 is executed at most |C| times. Each iteration takes at most |M | |Nmax| time, where Nmax

is the largest node set in any method controller. Hence, deciding whether L(PH ) = ∅ takescubic time in the size of PH .

For determining PH ′, one can now assume that L(PH ) 6= ∅. The non-generating abstractaction schemata cannot contribute to any controller in L(PH ). They can thus be discarded,along with any methods for them and any methods where they are used in, yielding C ′ andM ′. One does not need to worry that the schema of an abstract action occurring in fscI isremoved, since L(PH ) 6= ∅. The POHTN problem formed using the remaining abstract actionschemata and methods is decomposable. To see this, consider an arbitrary controller fsc suchthat fscI →∗M ′ fsc. The schemata of all abstract actions occurring in it are generating and cantherefore be directly or indirectly decomposed primitively using appropriate methods from M ′.Applying these methods to fsc will yield a primitive controller fsc′ ∈ L(PH ).

Theorem 4.13 means that one can take any POHTN problem and either sort it out as unsolv-able or continue working with an equivalent solvable, decomposable POHTN problem.

4.2.3. Pseudo-Primitive Controllers

Decomposability resolves the problem of arbitrarily long decomposition chains in MCTS onlypartially. While decomposability enables generating an LFSC from every obtainable abstractLFSC, it does not provide an upper bound on the number of decompositions required to reachan LFSC. This means that a decomposition MDP as defined above still does not conform to Def-inition 2.11. The Home Theater Setup hierarchy is an example for this, as can be seen from themethod for connect_abstract(?d1,?d2) shown in Figure 4.6. Applying this methodto the abstract LFSC shown in Figure 4.2 introduces a new node labeled with connect_ab-stract(?d1,?d2). The method can be applied arbitrarily often. Yet, at any time, the methodshown in Figure 4.3 can be applied to terminate the recursion. Hence, these two methods allowfor a decomposable POHTN problem where the maximum number of decompositions is un-bounded.

54


connect(?d1,?d3,?p)start

connect_abstract(?d3,?d2)

tighten(?d1,?d3,?p) d(loose_connection_found) = false

d(loose_connection_found) = true

loose_connection_found ∼loose_connection_found

Figure 4.6.: The method controller of a second method for connect_abstract(?d1,?d2). The method implements connect_abstract(?d1,?d2) recur-sively by connecting the devices via a third device ?d3, tightening the connectionbetween ?d1 and ?d3 if a loose connection was detected after connecting ?d3 and?d2, and creating an abstract observation with the result.

The example also hints at a solution to this issue, however. Suppose one tries to execute theabstract LFSC of Figure 4.2 without first refining it into an LFSC. This would only work for thefirst time step, since its initial node is primitive and the only possible successor node is abstract.The abstract LFSC obtained from applying the method of Figure 4.6 to the abstract LFSC ofFigure 4.2 can be executed for two time steps before an abstract node is reached. Iterated ap-plication of the method of Figure 4.6 will, after H − 1 method applications, yield an abstractLFSC where the first abstract node cannot be reached within the specified time horizonH duringexecution. Such controllers are called pseudo-primitive, a notion that will be formalized in thefollowing.

Definition 4.14. Let fsc = (N,λ, δ, nI , β) be an abstract LFSC. The distance of a node n ∈ N ,denoted dist(n), is given by

dist(n) =

0 if n = nI ,

1 + minn′∈N,δ(n′,n)6=false dist(n′) else.

The distance of a node gives the minimum number of steps required to reach it from the initialnode of a controller1. This can be employed to define pseudo-primitivity:

Definition 4.15. Let PH = (P,C,Fabstract-obs ,M, fscI) be a POHTN problem and P =(S,F , dom, cdm, class, D, val ,CPFF , R, sI , H) its underlying POMDP. An abstract LFSCfsc = (N,λ, δ, nI , β) with fscI →∗M fsc is called pseudo-primitive if and only if

• for all abstract nodes n ∈ N,λ(n) ∈ C, it holds that dist(n) ≥ H and

1More precisely, it is a lower bound on the minimum number of steps, since this definition does not account forunsatisfiable formulas that are syntactically different from false .

55


• β binds all variables occurring in fsc.

Whether a given abstract LFSC is pseudo-primitive can be checked using a shortest-pathalgorithm such as Dijkstra’s algorithm.

In a decomposable POHTN problem PH , one can be sure a pseudo-primitive controller canbe refined into at least one, but potentially many controllers in L(PH ). All of these controllershave the same expected accumulated reward:

Theorem 4.16. Let PH = (P,C,Fabstract-obs ,M, fscI) be a POHTN problem and P = (S,F ,dom, cdm, class, D, val ,CPFF , R, sI , H) its underlying POMDP. Let fsc = (N,λ, δ, nI , β)be a pseudo-primitive abstract LFSC with fscI →∗M fsc. For all fsc′, fsc′′ ∈ L(PH ) such thatfsc →∗M fsc′ and fsc →∗M fsc′′, it holds that V H

fsc′(sI) = V H

fsc′′(sI).

Proof. By Definition 4.15, all nodes reachable in H time steps from nI in fsc are primitive.Equation 2.4 only considers parts of the policy reachable in H time steps, so any differencesbetween fsc′ and fsc′′ do not influence their expected accumulated reward.

This means that one can assign a unique expected accumulated reward V Hfsc(sI) to a given

pseudo-primitive controller fsc as the expected accumulated reward of one of its refinementsin L(PH ). Also, the value of V H

fsc(sI) can be determined without ever computing such a re-finement. This justifies using pseudo-primitive controllers instead of LFSCs as terminal searchstates in the decomposition MDP.

It still needs to be ensured that a pseudo-primitive controller is reached after a finite numberof decompositions. For this, monotone POHTN problems are required:

Definition 4.17. Let PH = (P,C,Fabstract-obs ,M, fscI) be a POHTN problem. Consider thedirected graph G = (C,E) whose nodes are abstract action schemata and E = (c, c′) | ∃m ∈M,m = (c,mfsc, Om, τ),mfsc = (N,λ, δ, nI , β), c′ is the schema of λ(nI) contains an edgefrom c to c′ whenever there is a method for c where the initial node of its method controller is anabstract action whose schema is c′. Then PH is called monotone if and only if G is acyclic.

Monotonicity ensures that a method with a primitive initial node is applied eventually. Giventhat the nodes to decompose are chosen in an appropriate manner, i.e., with the right implementa-tion of choose(), this can be used for deriving an upper bound on the number of decompositionsthat need to be applied. Some definitions are required as prerequisites, so let

amd(fsc) = n ∈ abstract(fsc) | dist(n) = minn′∈abstract(fsc)

dist(n′)

denote the set of abstract nodes of minimal distance for a given abstract LFSC. Let further, foran abstract LFSC fsc = (N,λ, δ, nI , β),

isucc(fsc) =∣∣n′ ∈ N | δ(nI , n′) 6= false

∣∣denote the number of successors of its initial node and let

B = maxisucc(mfsc) | (c,mfsc, Om, τ) ∈M ∪ isucc(fscI)

denote the maximum number of successors of the initial node of any method controller or initialcontroller. This allows for the following statement:

56


Theorem 4.18. Let PH = (P,C,Fabstract-obs ,M, fscI) be a monotone POHTN problem. Whenthe next node to decompose is always chosen from amd(fsc)), i.e., is one of minimal distance,then every sequence of |C| BH−1

B−1 or more decompositions applied to fscI will yield a pseudo-primitive abstract LFSC.

Proof. Let fscI = (N,λ, δ, nI , β) be the initial controller of PH . In the worst case, nI isalready abstract. Since PH is monotone, the maximum number of decompositions that need tobe applied before nI is primitive is given by l+ 1, where l is the length of the longest path in thegraph G of Definition 4.17. This number is bounded by the number of abstract action schemata,i.e., l+ 1 ≤ |C|. After applying these decompositions, nI is primitive, so the minimum distanceof the remaining abstract nodes is one. The maximum number of successor nodes of nI is B. Inthe worst case, each of theseB successor nodes is again abstract and requires |C| decompositionsto become primitive. However, once all these decompositions have been applied, the minimumdistance of all remaining abstract nodes is two. This process thus has to be iterated only untilthe minimum distance of the remaining abstract nodes is H . This means that the number ofdecompositions can be bounded by

|C|+B |C|+B2 |C|+ · · ·+BH−1 |C| = |C|H−1∑i=0

Bi = |C| BH − 1

B − 1,

concluding the proof.

Note that B is bounded by the number of observations |O|: consider the POHTN hierarchyused in the proof of Theorem 4.11. The method controllers of the methods mf defined in thishierarchy have a different successor node for every possible observation, i.e., |O| many. Thisis the maximum number of successor nodes any node can have, adding a transition to a furthernode with a satisfiable transition condition will inevitably destroy the well-definedness of themethod. This puts the result of Theorem 4.18 in line with the fact that the size of a policy tree ina POMDP with observation set O is |O|

H−1|O|−1 . In a typical POHTN, though, B can be expected to

be much lower than |O|.An interesting special case of monotonicity is given when all initial nodes of all methods

are primitive, as is the case for the Home Theater Setup example. This special case is closelyrelated to the Greibach normal form for context-free grammars [Greibach, 1965], where theright-hand side of every production rule must begin with a terminal symbol, followed by anarbitrary number of non-terminal symbols. There also exists a generalization of the Greibachnormal form to some classes of context-free graph grammars [Engelfriet, 1992]. The existenceof a generalized Greibach normal form for graph grammars suggests that a similar normal formmight exist for POHTN problems. This would mean that monotonicity does, in principle, notimpose a restriction on the class of POHTN problems that can be solved using MCTS. A morethorough analysis into this matter is left as future work, though.

Another thing worth noting is the relationship between monotonicity and decomposability.Monotonicity does not imply decomposability: a POHTN hierarchy whose only method is theone presented in Figure 4.6 is monotone, because the initial node of the method controller isprimitive. It is, however, not decomposable without the method shown in Figure 4.3. Nor does

57


decomposability imply monotonicity: consider a POHTN problem that represents the context-free grammar with the production rules S → a and S → Sa, where a is a primitive action and Sis a parameterless abstract action schema. Obviously, every occurring S can be decomposed intoan a, so this hierarchy is decomposable, but since the controller of the method corresponding toS → Sa starts with an abstract action, it is not monotone.

4.2.4. The Decomposition MDP

Summing up the above modifications of the initial decomposition MDP definition, a reviseddefinition is the following:

Definition 4.19. For a given POHTN problem PH = (P,C,Fabstract-obs ,M, fscI), the decom-position MDP is given by the tuple (SPH ,APH ,GPH , sPHI , SPH

T ), where

• The set of search states is defined as SPH = fsc | fscI →∗M fsc, i.e., the set of allabstract LFSCs that can be generated from PH .

• For a given search state fsc = (N,λ, δ, nI , β) ∈ SPH , the set of applicable actions isdefined as

APH (fsc) =

[choose(unbound(fsc)) 7→ d] | d ∈ D if unbound(fsc) 6= ∅(choose(amd(fsc)),m) | m is a method for

the schema of λ(choose(amd(fsc))) else.

• The simulator GPH is defined as (fsc′, r) ∼ GPH (fsc, a) where

fsc′ =

apply(fsc, n,m) if a = (n,m),

apply(fsc, [x 7→ d]), if a = [x 7→ d] and

r =

V Hfsc′

(sI) if fsc′ ∈ SPHT ,

0 else.

• The initial search state sPHI = fscI is the initial controller.

• The terminal states SPHT = fsc | fscI →∗M fsc, fsc is pseudo-primitive are the pseudo-

primitive controllers that can be generated from PH .

The new definition differs from the old definition in two respects. First,APH (fsc) is such thatfree variables are bound before decompositions are applied and decompositions are only appliedto nodes closest in distance to the initial node. Second, terminal states are pseudo-primitivecontrollers. Again, a policy π for a decomposition MDP deterministically determines a terminalstate, i.e., pseudo-primitive controller sT (π). By requiring decomposable and monotone PO-HTN problems, Definition 4.19 allows for reducing POHTN planning to planning in an MDP:

58


Theorem 4.20. For a decomposable, monotone POHTN problem PH = (P,C,Fabstract-obs ,M, fscI), its decomposition MDP MDPPH = (SPH ,APH ,GPH , sPHI , SPH

T ) as in Defini-tion 4.19 defines a generative indefinite horizon MDP according to Definition 2.11, where forevery controller fsc it holds that fsc ∈ L(PH ) if and only if there exists a policy π for MDPPH

such that

• sT (π)→M fsc, i.e., sT (π) can be refined into fsc, and

• Vπ(sPHI ) = V Hfsc(sI), i.e., π has the same value as fsc.

Proof. First, one needs to prove that MDPPH conforms to Definition 4.19, i.e., that h(sPHI ) <∞. The set of terminal states SPH

T is equal to the set of pseudo-primitive controllers, so h(fsc) =0 for all pseudo-primitive controllers fsc. Since PH is monotone, sPHI = fscI , and nodes todecompose are chosen from amd(fsc), Theorem 4.18 implies that applying methods at most|C| BH−1

B−1 times ensures reaching a pseudo-primitive controller from fscI . The number of newvariables introduced in each method application is finite, so let |V |max denote the maximumnumber of variables that are introduced in a given step. Letting additionally |V |init denote thenumber of variables present in fscI , it holds that

h(sPHI ) ≤ |V |init + (|V |max + 1) |C| BH − 1

B − 1<∞.

Second, one needs to prove the relationship between L(PH ) and the policies of MDPPH . Letfsc ∈ L(PH ). Since this means fscI →M fsc, there exists a sequence (a0, . . . , an−1) of methodapplications and variable bindings that induces a sequence of abstract LFSCs (fsc0, . . . fscn)where fsc0 = fscI , fscn = fsc, and fsci+1 = apply(fsci, ai). The order in which abstractnodes are decomposed is arbitrary, so one can assume without loss of generality that nodes withminimal distance are decomposed first in (a0, . . . , an−1). Also, as required by Definition 4.5,any unbound variables introduced by decomposition are bound before the next decompositionis applied. Let T be the smallest index such that fscT is pseudo-primitive, which must existbecause fscn is an LFSC, i.e., fully primitive. A policy π for MDPPH can be constructed fromthis by letting π(fsci) = ai for 0 ≤ i < T . It holds that sT (π) = fscT and, by construction,fscT →M fsc. Following Theorem 4.16, fscT can be assigned a unique expected accumulatedreward V H

fscT(sI) equal to V H

fsc(sI). The value Vπ(sPHI ) of π is equal to the expectation of

V HfscT

(sI), since r is 0 for non-terminal states. Thus, Vπ(sPHI ) = E V HfscT

(sI) = V HfscT

(sI) =

V Hfsc(sI).Conversely, consider an arbitrary policy π for MDPPH . By Definition 4.19, sT (π) is a

pseudo-primitive controller. Because PH is decomposable, at least one LFSC fsc ∈ L(PH )such that sT (π) →∗M fsc is guaranteed to exist. The same reasoning as above can be used forproving that Vπ(sPHI ) = V H

fsc(sI).

Theorem 4.20 implies that there is a policy π∗ that is optimal for MDPPH and that hasthe same value as an optimal controller fsc∗ for PH . Also, such an optimal controller can beextracted from π∗, which makes MCTS planning on MDPPH a viable procedure for solvingPOHTN problems.

59


It is worth noting that the efficiency of this procedure compares favorably to MCTS in thespace of observable histories (see Section 2.4.4) for constructing a policy for a factored POMDP:it solves the curse of dimensionality in the same manner by avoiding reasoning over probabilitydistributions over world states. Furthermore, the branching factor of the generated search treedoes not depend on the number of action or observation fluents, but on the number of methods.In contrast, MCTS in the space of histories has an exponential branching factor in the number ofaction fluents. Seemingly, a disadvantage of MCTS on the decomposition MDP is the maximumlength of the trace, which is exponential in the time horizon H according to Theorem 4.18, asopposed to linear for MCTS in the space of histories. This is addressed in Chapter 6, however,which gives a more detailed complexity analysis and shows how the dependence on H can bereduced from exponential to linear.

4.3. Reducing Deterministic HTN Planning to POHTN Planning

Given the goal of this thesis, the question arises naturally whether POHTN planning constitutesa proper generalization of HTN planning, in the sense that POHTN planning can be employed tosolve totally ordered HTN planning problems as defined in Section 2.5. Answering this questionin the affirmative requires constructing a POHTN problem for any given HTN problem and usinga solution for the former to create a solution for the latter. The following section will show thatthis is indeed possible. The construction consists of two parts, a factored POMDP that representsthe HTN domain dynamics and a POHTN hierarchy that represents the HTN hierarchy.

4.3.1. Representing HTN Domain Dynamics

For the most part, the construction of the POMDP is rather straight-forward. For a given HTNproblem DH = (S,Fstate , dom, cdm, D, val , A,C,M,PI , sI), the sorts S, fluent domain andco-domain functions dom and cdm , domain elements D together with their corresponding sortgiven by val , and initial state sI can be carried over directly to the HTN domain dynamicsPOMDP PDH = (S,FDH , dom, cdm, classDH , D, val ,CPFDH

F , RDH , sI , HDH ).

The first construction step is to introduce a Boolean action fluent with the appropriate arityand a default value of false for each primitive HTN action schema, i.e., FDH = Fstate ∪ A.This is also reflected in classDH , where

classDH (F ) =

act F ∈ Astate F ∈ Fstate

No observation fluents are required, since the initial state is known and the domain dynamics aredeterministic. Then, conditional probability functions that implement the effects of primitiveHTN actions are required. This can be done with a technique similar to the one employedby Eyerich et al. [2006], which translates a fragment of PDDL, i.e., a generalization of thelanguage used for describing HTN domain dynamics in this thesis, into Golog, i.e., a planningproblem description language based on the situation calculus, which in turn describes planningproblems in a fluent-centric manner similar to RDDL. The idea is to construct the CPF cpf Fof a given (Boolean) state fluent symbol F ∈ Fstate such that d(F )(d1, . . . , dn) = true in the

60

4.3. Reducing Deterministic HTN Planning to POHTN Planning

successor state if and only if either an action was executed that set its value to true , or its valuewas true in the current state and no action was executed that set its value to false . This cansimply be represented by a formula over state and action fluents with free variables v1, . . . , vn.Executing POMDP action a in state s will lead to a state in which, for every valuation β thatbinds v1, . . . , vn, it holds that d(F )(β(v1), . . . , β(vn)) = true if and only if

((s, a), β) |=

∨ha∈+F (v1,...,vn)

prec(ha) ∧ ha

∨

F (v1, . . . , vn) ∧ ¬

∨ha∈−F (v1,...,vn)

prec(ha) ∧ ha

where +F (v1, . . . , vn) and −F (v1, . . . , vn) are the sets of HTN actions whose effects containF (v1, . . . , vn) or ¬F (v1, . . . , vn), respectively. Note that there is no parallelism between actionsin HTN planning, i.e., a must correspond to one HTN action ha (or none of them). Hence, amust be an action that can be obtained from the default action adef with at most one update.This means that when a = adef [ha] for a given HTN action ha , the above formula simplifies toeither prec(ha)∧ha if ha ∈ +F (v1, . . . , vn), i.e., makes F (v1, . . . , vn) true, or F (v1, . . . , vn)∧¬(prec(ha)∧ha) if ha ∈ −F (v1, . . . , vn), i.e., makes F (v1, . . . , vn) false. Executing adef doesnot correspond to executing an HTN action, but takes the role of signifying plan completion, aswill become apparent below.

Also note that the CPF construction is such that if an action is executed whose preconditionin the original HTN is not satisfied, the state does not change. Specifying what happens when aprecondition is violated is unavoidable, since in a POMDP all actions are executable in any state.To rectify this, the reward function is defined such that it gives a penalty whenever an action isexecuted whose precondition is violated, i.e.:

RDH (s, a, s′) =

−HDH if a |= as and s 6|= prec(as)

1 if a = adef

0 else.

The definition of a suitable horizon HDH is left open for the moment.

4.3.2. Representing HTN Methods

For the POHTN hierarchy PH DH = (PDH , CDH ,FDHabstract-obs ,M

DH , fscDHI ), the HTN ab-

stract action schemata can be carried over, i.e., CDH = C. The set of abstract observationfluents FDH

abstract-obs is empty for the same reason as above, none are required.Encoding an HTN method m = (c, P ) ∈ M is rather simple and only requires transforming

its subplan P = ((a0, . . . , ak−1), β). The result is a POHTN method mDH = (c,mfsc, Om, τ)∈ MDH where mfsc = (N,λ, δ, nI , β) is an abstract LFSC with |N | = k nodes, and it holdsfor all 0 ≤ i, j < k that

δ(ni, nj) =

true if j = i+ 1

false else.

61


Action association is such that λ(ni) = ai for all 0 ≤ i < k and the initial node is nI = n0.The termination function of the method controller is such that termination is sure for nk−1 andimpossible otherwise, i.e.,

τ(n, o) =

true n = nk−1

false else,

where o is the empty interpretation, since the are no abstract observation fluents. In other words,the constructed controller corresponds to a sequence of nodes labeled with the action of P in thesame order. The initial plan PI is transformed in a similar manner into fscDH

I , except that anadditional node is added after nk−1. This node nlast has λ(nlast) = adef and has a self loop withδ(nlast, nlast) = true .

With these definition of MDH and fscDHI , there exists a bijection σ between the sets of HTN

plans P | PI →∗M P and POHTN controllers fsc | fscDHI →∗

MDH fsc where

• σ(PI) = fscDHI ,

• whenever σ(P ) = fsc, it follows that σ(apply(P, ai,m)) = apply(fsc, ni,mDH ) for all

HTN methods m ∈M and corresponding POHTN methods mDH ∈MDH , and

• σ(P ) = fsc implies that σ(apply(P, [x 7→ d])) = apply(fsc, [x 7→ d]) for all variables xand domain elements d.

4.3.3. Matching Solution Criteria

To complete the translation, one needs to overcome the difference between the optimizing natureof POHTN planning and the precondition-based exclusion of plans in HTN planning. Given avalue for the horizon HDH , one can make the following statement in this regard:

Theorem 4.21. Let DH be an HTN problem and PH DH its POHTN encoding as above, whereHDH > 0. It holds that DH has a solution of length k < HDH if and only if there exists anLFSC fsc ∈ L(PH DH ) with expected accumulated reward V HDH

fsc (sDHI ) > 0.

Proof. Let fsc ∈ L(PH DH ) be a controller with V HDH

fsc (sDHI ) > 0. The only opportunity to

gain a positive reward is executing the default action. Due to the construction of MDH andfscDH

I , it is sure that fsc executes all actions that correspond to HTN actions and then executesthe default action at least once, all within the time horizon HDH . Moreover, no preconditionwas violated: assume this was case and that the i-th action, i < HDH , executed by fsc wasexecuted in a state in which its precondition was violated. As per the definition of RDH , thepenalty for violating a precondition −HDH , so even assuming the maximum immediate rewardof +1 is received for all other time steps, this means that

V HDH

fsc (sDHI ) ≤ −HDH +

∑j∈0,...,HDH−1

j 6=i

1 = −HDH + (HDH − 1) = −1 < 0,

which is a contradiction. Hence, σ−1(fsc) is executable in sI . Since fscDHI →∗

MDH fsc andbecause σ is a bijection between plans and controllers as described above, it holds that PI →∗Mσ−1(fsc). It follows that σ−1(fsc) is a solution of DH that has at most HDH − 1 steps.

62

4.4. Comparison to Other Non-Deterministic Hierarchical Approaches

In the other direction, let DH have a solution P of length k < HDH . Since P is a solution,executing it does not violate any preconditions. Hence it holds that V HDH

σ(P ) (sDHI ) ≥ 0. Since P

has at most HDH − 1 steps, σ(P ) executes the default action at least once within the horizon,and therefore V HDH

σ(P ) (sDHI ) > 0. Also, it holds that fscDH

I →∗MDH σ(P ), and since P is fully

primitive and binds all variables, σ(P ) ∈ L(PH DH ).

A possible way of determining a suitable value for HDH is the following: given an upperbound on a parameter called the non-tail height (or progression bound) of a total-order HTNproblem, Alford et al. [2009] describe a linear-time translation of total-order HTN problemsinto equivalent PDDL planning problems. The maximum solution length for a PDDL problemis the size of its state space [Ghallab et al., 2004, p. 57]. The non-tail height of a total-orderHTN problem, in turn, can be upper-bounded using a polynomial time algorithm [Alford et al.,2016].

HTN problems can thus indeed be represented as equivalent POHTN problems. The trans-lation mechanism needed to overcome some obstacles that stem from differences in domaindynamics representation and solution criteria. Still, translating abstract actions, methods andthe initial plan is straight-forward and direct, so that it is justifiable to regard (total-order) HTNplanning as a special case of POHTN planning. Of course, it is still more advisable to solvedeterministic HTN problems with a dedicated HTN planner, not least because it is possible thatthe illustrated method for determining HDH yields impractically large values.

4.4. Comparison to Other Non-Deterministic HierarchicalApproaches

As noted in Section 1.3, the ND-SHOP2 approach can be seen as a wrapper around the standardSHOP2 algorithm that handles non-determinism. This implies that the method task networks ofND-SHOP2 are deterministic and cannot specify how to react to probabilistic action outcomes.The wrapper approach is also the reason that ND-SHOP2 can only be extended to a limitedform of partial observability. The POHTN approach circumvents this by using the LFSC policyrepresentation, which is a natural representation for POMDP policies.

The C-SHOP approach allows for branching in methods, but does not allow for cycles as theLFSCs used in the POHTN approach do. Also, the POHTN approach offers a foundation in thewell-known framework of POMDPs as well as a formal description of its properties.

Beyond the obvious difference that the HAM approach does not apply to POMDPs, the PO-HTN and HAM approaches are similar in that both can be characterized as approaches thatgenerate policies by concretizing partially specified policies represented in the form of hierar-chically linked finite state automata. In both approaches, control is passed to subcontrollers viacall states (resp. abstract nodes), and passed back to calling controllers via stop states (resp.termination functions). Both also a have a top level (resp. initial) controller that does not have astop state.

The key difference between HAM and POHTN is that in HAM, dedicated choice nodes repre-sent the degrees of freedom in policy construction, whereas the degrees of freedom in the HTNapproach are given by the choice of the implementing subcontroller at call states, i.e., in HTN

63


terms, the choice of methods for abstract actions, without the need for dedicated choice nodes.In that, the POHTN controller representation is simpler. More importantly though, for HAM, adecision needs to be made for each world state at each choice node. World states are replacedby belief states or histories in a partially observable setting, of which there are considerablymore. This represents a major obstacle in extending HAMs to POMDPs and might explain theapparent absence of corresponding approaches. In contrast, the number of choices for POHTNsis independent of the number of world states, actions, or observations.

Approaches based on MAXQ compute policies bottom-up, i.e., they solve a series of smallerPOMDPs to build a policy for the original POMDP. An advantage of this approach is that itallows for very high-quality policies for subtasks, since any kind of algorithm can be employedto create them, including optimal ones. The downside is that the MAXQ approach incorporateslittle guidance on how to compute these subtask policies. This means that when the originalPOMDP cannot be broken down into small enough pieces, even computing a policy for onesubtask can incur prohibitive computational cost. The HTN family of approaches, includingPOHTN, avoids this by specifying a limited number of alternatives for performing a subtask.

The PolCA+ and HPOMDP approaches both represent subtask policies as mappings from be-lief states to actions. This necessitates maintaining an updated belief state at policy executiontime, which, as argued above, means exponential effort in the number of state fluents at everytime step. MAXQ-hierarchical policy iteration and POHTN avoid this pitfall, since representingpolicies as finite state controllers alleviates the need for belief updates. On the other hand, be-lief updates guarantee that information gathered during execution will be correctly memorized,while this is not as straight-forward when (L)FSCs are used. POHTN uses abstract observationsof variable granularity as a flexible mechanism for passing on relevant information of the subtaskexecution history. The idea of abstract observations was also proposed for MAXQ-hierarchicalpolicy iteration. It was, however, not thoroughly integrated into the approach, because of itssevere effect on the algorithm’s scaling behavior.

64

Chapter 5.

EvaluationTo paraphrase the underlying question of this thesis stated in Section 1.4, the goal is to transferthe advantages of HTN planning to a POMDP setting. This chapter aims to empirically evaluatewhether the approach proposed in Chapter 4 succeeds in achieving this goal in one of its mostimportant aspects, namely with respect to possible efficiency gains achievable by exploitingexpert knowledge. It starts with a description of the chosen experimental setting. After that,data on four evaluation domains is shown, each time including a description of the domain,problem generation, and the POHTN hierarchy. The chapter concludes with a summary of theresults.

5.1. Experimental Setting

As shown in Chapter 4, in particular in Definition 4.19 and Theorem 4.20, POHTN planningproblems can be represented as MDPs, which in turn can be solved using Monte-Carlo TreeSearch. This approach has been implemented in Scala using a generic implementation of theMCTS algorithm shown in Section 2.4, a component that parses POHTN hierarchies and repre-sents them as MDPs, and a simulator for RDDL POMDP domains.

A baseline to compare this approach to is given by non-hierarchical MCTS in the space of his-tories as described in Section 2.4.4, implemented using the same generic MCTS implementationand simulator for RDDL POMDP domains. Both approaches share an important commonal-ity that enables a direct comparison. To see this, remember that, in each round of search, thehistory-based approach chooses and simulates H POMDP actions until the horizon is reached,while the POHTN approach constructs a policy and executes this policy for H time steps. I.e.,they each generate one policy execution trace per round of search, using H calls to the POMDPsimulator. This means that the relative performance of both approaches can be assessed by com-paring policy quality for a fixed number of calls to the POMDP simulator, or, equivalently, afixed number of planning iterations. This simple and well-defined criterion is related to the con-cept of sample complexity in POMDPs, which aims to quantify the required number of calls toa POMDP simulator for finding a near-best policy from a given set of policies [Kearns et al.,1999].

Using the implementation mentioned above, data on policy quality as a function of the em-ployed planning approach was generated according to the following procedure: given an RDDLPOMDP domain, a problem size, and a random seed, a non-hierarchical factored POMDP plan-ning problem as described in Section 2.3.5 was generated using a domain-specific problem gen-erator. Using the SoftMaxUCT MCTS algorithm described in Section 2.4, a policy was then

65

Chapter 5. Evaluation

generated for this problem with both the history-based and POHTN approaches, with a domain-but not problem-specific POHTN hierarchy for the latter. Unless stated otherwise, both planningapproaches were run for 10,000 iterations. The quality of the resulting policies was approxi-mated by executing them 40 times and taking the arithmetic mean over the resulting accumu-lated rewards. The whole procedure was repeated 100 times using different seeds for the randomnumber generator.

The Wilcoxon signed rank test was used for statistical analysis of the differences in policyquality between the POHTN and history-based approaches [Wilcoxon, 1945]. It is adequate inthis situation for several reasons:

• It is applicable to paired data, i.e., when each data point is a pair of values, as is the casein this setting.

• It does not require the data to be normally distributed, in contrast to, e.g., the pairedStudent’s t-test. Looking at the data shown below, it is immediately apparent that this isindeed not the case.

• It can give estimates on the magnitude of the difference, in contrast to, e.g., the moresimple sign test.

Based on the theoretical analyses in Chapter 4, the following expectations on experimentalresults can be formulated for the proposed comparison:

• With few planning iterations, planning with the POHTN approach will have an advantageover history-based planning: the constraints that encode the expert knowledge in the PO-HTN hierarchy result in a smaller search space than in the case of history-based planning.When the POHTN hierarchy is chosen well, this smaller search space will result in fastconvergence to a good policy.

• With enough planning iterations, history-based planning will eventually produce betterpolicies than planning using the POHTN approach: unless the constraints imposed by thePOHTN hierarchy allow for representing an optimal non-hierarchical policy, the anytimeoptimality of MCTS means that the history-based approach will eventually produce abetter policy than the POHTN approach. Depending on the structure of the planningproblem at hand, however, it might take a very long time until this point is reached.

Experiments were conducted in four evaluation domains: the first domain is the Home The-ater Setup domain used as a running example throughout this thesis. The other three domainsSkill Teaching, Navigation, and Elevators, are taken from the 2011 International ProbabilisticPlanning Competition (IPPC). The rest of this chapter is divided into two parts. The first partcontains a section for each of the domains, describing the domain itself, their respective POHTNhierarchies, the generation of planning problems of various sizes, and experimental results. Thesecond part draws a résumé of the results and presents some other conducted analyses. Ad-ditional information, including where to the find source code of the implementation, code forgenerating plots, and RDDL code of the evaluation domains and POHTN hierarchies, can befound in Appendix A.

66

5.2. Home Theater Setup

5.2. Home Theater Setup

The first evaluation domain is the Home Theater Setup domain introduced in Section 2.3.6. TheRDDL description of the domain and an example instance can be found in Appendix B.

5.2.1. Problem Generation

For this evaluation, each problem uses the same signal-producing and signal-displaying devices,namely a TV, a satellite receiver, an amplifier and a Bluray player. The task is always to bring thevideo signals of both the satellite receiver and the Bluray player to the TV and their respectiveaudio signals to the amplifier. Available cables are taken from a fixed pool of predefined cablesC: to generate a problem instance of size n, a subset C ′ ⊆ C of available cables is chosenuniformly from the set C ′ ∈ 2C | |C ′| = n. The number of actions required for makingall necessary connections is 6, provided that the right cables are available. To further allow forchecking and tightening some of the connections, the planning horizon was set to 10.

5.2.2. POHTN Hierarchy

The POHTN hierarchy, as already outlined in Section 4.1, features a single abstract actionschema connect_abstract(?device1,?device2) and an abstract observation fluentloose_connection_found. For the schema connect_abstract(?device1,?de-vice2), methods define how to connect the devices directly, indirectly via other devices, howto connect the same two devices twice, in case separate cables are necessary for transportingvideo and audio signals, and also for reusing an existing connection established while connect-ing other devices.

The initial controller for the generated problems is always the controller shown in Figure 5.1,which abstractly specifies that both the Bluray player and the satellite receiver should be con-nected to both the amplifier and the TV, and the order in which they should be connected.

5.2.3. Results

Figure 5.2 compares POHTN and history-based policy quality after 10,000 iterations for prob-lem sizes ranging from 1 to 12. It can be seen that the POHTN approach outperforms thehistory-based approach on every single generated problem instance. Indeed, the history-basedapproach never finds policies with an expected accumulated reward above 0. Apparently, theaction branching factor in the Home Theater Setup domain is too large for the history-based ap-proach to yield useful policies: letting P andD be the number of ports and devices, respectively,the connect(?device,?device,?port), tighten(?device,?device,?port),and checkSignals(?device) action fluents together yield 2D2P + D actions. The gen-erated problems define P = 7 different ports and D = 4 + n devices, hence an instance ofsize n has 14n2 + 113n+ 228 actions. Furthermore, most of these actions will yield a negativeimmediate reward when executed, e.g., checking uncheckable devices, attempting to tighten anon-existing connection, or trying to connect the TV and an HDMI cable via a DVI port. ThePOHTN hierarchy is able to rule out many of these useless actions, as can be seen by the factthat the generated POHTN policies rarely have an accumulated reward below 0. Looking more

67


connect_abstract(Sat,Amplifier)start

connect_abstract(Bluray,Amplifier)

connect_abstract(Sat,TV)

connect_abstract(Bluray,TV)

noop

Figure 5.1.: The initial controller used for all generated Home Theater Setup problems.

closely at the results, it can be seen that the quality of POHTN policies for instance sizes 1 to3 is often also near zero. This happens when none of the available cables can be connected totransport any signal to a target device. For sizes 4 and above, the available cables always sufficefor at least partially setting up the home theater system. POHTN policy quality is also quite sta-ble with a growing number of cables, i.e., scales up to larger instance sizes. For further resultson POHTN planning in this domain, see Richter and Biundo [2017].

5.3. Skill Teaching

The Skill Teaching domain from the 2011 IPPC models a tutoring system whose task is to teacha set of interdependent skills to a student using hints and multiple choice questions [Green et al.,2011]. Proficiency for each skill is either low, medium, or high. The system’s goal is to bringeach skill to high proficiency level. Skills cannot be taught in any arbitrary order, however, sinceacquiring introductory skills might be necessary for understanding advanced lessons. Thereare two action fluents. The first is giveHint(?skill), which deterministically increasesthe proficiency level of a skill to medium when all prerequisite skills are on high proficiencylevel. The other action fluent is askProb(?skill), which asks a multiple choice questionfor one skill. The student’s chances of answering such a question correctly depend on the currentproficiency level of the skill in question, and on the current proficiency levels of prerequisiteskills. A correct answer has a high chance of increasing the proficiency level, while a wronganswer can also decrease the proficiency level1. There is also a chance for a false positiveanswer, where the student’s skill level does not increase despite a correct answer.

In every other time step, the agent can either give a hint or ask a multiple choice question for

1It seems debatable whether the latter property models reality well. However, changing it would have requiredmajor modifications to the IPPC domain.

68

5.3. Skill Teaching

n = 1 (m = 7.99, p = 3.94e−18) n = 2 (m = 10, p = 3.95e−18) n = 3 (m = 13.2, p = 3.96e−18)

n = 4 (m = 14.9, p = 3.96e−18) n = 5 (m = 14.8, p = 3.95e−18) n = 6 (m = 15.2, p = 3.95e−18)

n = 7 (m = 15.7, p = 3.95e−18) n = 9 (m = 15.7, p = 3.95e−18) n = 12 (m = 15, p = 3.95e−18)

−10

−5

0

5

10

15

−10

−5

0

5

10

15

−10

−5

0

5

10

15

−10 −5 0 5 10 15 −10 −5 0 5 10 15 −10 −5 0 5 10 15policy quality with history−based approach

polic

y qu

ality

with

PO

HT

N a

ppro

ach

Figure 5.2.: Scatter plots comparing POHTN and history-based policy quality on the Home The-ater Setup domain for various problem sizes n. Each point represents a single prob-lem instance whose position is determined by the estimated quality of the generatedPOHTN and history-based policies after 10,000 planning iterations. When a pointis above the line, it means that the POHTN policy outperformed the history-basedpolicy. The blue concentric curves represent a density estimate to aid seeing higherpoint density areas. For each problem size, the values m and p respectively give themedian difference and the significance level concerning whetherm is different from0, as determined using a Wilcoxon signed rank test. One can see that the POHTNapproach very consistently yields better policies.

69


one skill2. The only available observations are whether multiple choice questions are answeredcorrectly. Each skill si has a weight wi, so some skills are more important than others. For eachskill si, the agent is awarded with −wi, 0, or +wi in each time step, depending on whether theproficiency level is low, medium, or high, respectively.


The size parameter n determines the number of skills to teach. Skill dependencies are generatedas follows: For each pair of skills (si, sj) where 0 ≤ i < j < n, flip a fair coin to determinewhether si is a prerequisite of sj . Then, rename the skills using a uniformly randomly chosenpermutation σ, such that whenever si is a prerequisite of sj before renaming, afterwards sσ(i) isa prerequisite of sσ(j). This uniformly samples a directed, acyclic graph of skill dependencies.Given that all prerequisite skills of s are on high proficiency level, the chance for a correct an-swer is pl = 0.5 or pm = 0.7, when the current proficiency for s is low or medium, respectively.Probabilities for answering questions for s correctly when not all prerequisites are on high pro-ficiency level were set to kp/(m + 1), where m is the total number of prerequisites of s, k isthe number of prerequisites currently on high proficiency level, and p is pl or pm, dependingon whether the current proficiency of s is low or medium. Thus, not knowing all prerequisitesalways leads to a lower probability of answering correctly than knowing all prerequisites, andlearning the last prerequisite skill gives a bigger boost in proportion.

For the skill weights, a vector (x1, . . . , xn) is uniformly sampled from the open (n − 1)-simplex at first, i.e., an n-dimensional vector where

∑i xi = 1 and xi ∈ (0, 1) for all i. This is

done by sampling from a Dirichlet distribution of order K = n with parameters (α1, ..., αn) =(1, . . . , 1). Then, given a total skill weight w, which was chosen to be 5 for this evaluation, eachwi is defined as wi = wxi. Hence, the immediate reward for any instance is always in [−w,w].It is −w when all skills are on low proficiency level and w when all skills are learned. Togetherwith the planning horizon of 40 time steps, the consequence is that there exists a convenientcommon lower and upper bound of−200 and 200 on the quality of all policies for every instance,simplifying the creation of plots that relate instance size with planning performance.


The top level task of the POHTN hierarchy chosen for the Skill Teaching domain is teachAll.It has two implementations, one for teaching a single skill with the abstract action schemateach(?skill) and then recursing with teachAll, and the other for stopping the recur-sion, as shown in Figure 5.3. Teaching a single skill tries to bring the student’s proficiency levelto high, while finding the correct order, i.e., making sure that prerequisites are all learned inorder, is left to the planner.

The first implementation of teach(?skill) is meant for a skill s with currently low pro-ficiency level and uses the following procedure: first, use giveHint(?s). It is the mostefficient action for changing proficiency to medium, as it is guaranteed to work. Then, switch

2An actual teaching action can only be executed every other time step due to the way the domain is encoded inRDDL. It is an unfortunate side effect of the restricted RDDL language fragment used in the 2011 IPPC, whereintermediate fluents were not allowed. Using intermediate fluents could alleviate this.

70

5.3. Skill Teaching

noop

start

teach(?s)

start

teachAll

Figure 5.3.: The method controllers for the two methods of teachAll.

giveHint(?s)

start

noop askProb(?s)

noop

noop

answeredRight(?s)

¬answeredRight(?s)

Figure 5.4.: The method controller of one method of teach(?s). The noop nodes are anartifact of the particular encoding of the Skill Teaching domain used in the 2011IPPC, where a useful action can only be executed every other time step.

to using multiple choice questions with askProb(?s). If the student answers correctly, theproficiency level can be assumed to be high, and the method controller terminates. If not, thecontroller returns its initial node, i.e., it gives another hint, since the wrong answer might havedecreased the proficiency level to low. This controller is shown in Figure 5.4.

The second method uses the same method controller, except that the initial node is now theone labeled with askProb(?s). This method exists to account for previous false positiveanswers when using the first method, where a multiple choice question was answered correctly,but the student’s skill level did not actually increase.

5.3.3. Results

Figure 5.5 compares POHTN and history-based policy quality after 10,000 iterations. In thisdomain, the POHTN approach also gives an advantage over the history-based approach. Fourinteresting observations can be made, however.

First, there is a clear correlation between POHTN and history-based policy quality, as can beseen from Table 5.1. This suggests that there is an inherent difficulty to the generated problemsthat affects both approaches, but that the POHTN approach deals better with.

Second, it does not seem clear whether the POHTN approach or the history approach scalesbetter with problem size. While the difference values grow for instance sizes n = 1 throughn = 6, suggesting the POHTN approach scales better, they shrink for the larger problem sizes.

71


n = 1 (m = 9.62, p = 2.35e−10) n = 2 (m = 13, p = 4.87e−12) n = 3 (m = 18.9, p = 1.7e−15)

n = 4 (m = 29.4, p = 1.67e−17) n = 5 (m = 37.4, p = 5.68e−18) n = 6 (m = 38.3, p = 5.85e−18)

n = 7 (m = 40.3, p = 4.6e−18) n = 9 (m = 35.9, p = 4.46e−18) n = 12 (m = 33.3, p = 3.96e−18)

−200

−100

0

100

−200

−100

0

100

−200

−100

0

100

−200 −100 0 100 −200 −100 0 100 −200 −100 0 100policy quality with history−based approach

polic

y qu

ality

with

PO

HT

N a

ppro

ach

Figure 5.5.: Scatter plots comparing POHTN and history-based policy quality on the SkillTeaching domain for various problem sizes n. Each point represents a single prob-lem instance whose position is determined by the estimated quality of the generatedPOHTN and history-based policies after 10,000 planning iterations. When a pointis above the line, it means that the POHTN policy outperformed the history-basedpolicy. For each problem size, the values m and p respectively give the mediandifference and the significance level concerning whether m is different from 0, asdetermined using a Wilcoxon signed rank test. One can see that the POHTN ap-proach yields better policies.

72

5.3. Skill Teaching

Table 5.1.: Correlation coefficients for the data shown in Figure 5.5, relating policy quality val-ues for the POHTN and history-based approaches in the Skill Teaching domain. Forall problem sizes except 1, the values are strongly correlated.

Problem Size n Correlation coefficient

1 -0.039018952 0.837911053 0.884086154 0.909738025 0.924826626 0.924063297 0.904434779 0.92238341

12 0.86974113

However, this can be explained with another effect outweighing the scaling properties: sinceeach skill needs to be taught at least twice to bring its proficiency level from low to high, andsince a real teaching action can only be executed every other time step, the number of skills thatcan be fully taught within the chosen horizon of 40 time steps is at most 10. In other words, theproportion of skills that can be taught diminishes with growing problem size, and since the totalskill weight is normalized, the difference in policy quality approaches 0 as the number of skillsapproaches infinity. As shown in Figure 5.6, this can be resolved by omitting the normalization,i.e., by multiplying the policy quality values with their respective problem size. While obscuringdifferences for smaller problems, it reveals that the POHTN approach indeed scales better.

Third, as can be seen in Figure 5.7, the advantage of the POHTN approach is especiallypronounced with fewer planning iterations and becomes smaller as the number of iterations in-creases. It seems that the history-based approach could overtake the POHTN approach withsome more iterations. This confirms the expectations formulated in Section 5.1 that, givenenough time, the history-based approach can produce better policies. Note that the initial “dip”in POHTN policy quality in Figure 5.7 can be explained by looking at the different seeds for therandom number generator individually, as shown in Figure 5.8.

The fourth interesting aspect is revealed by taking a closer look at when rewards are gainedduring policy execution. Looking at the data shown in Table 5.2, it can be seen that history-basedpolicies have a small edge over POHTN policies when only looking at the first 10 time steps, atleast for some problem sizes. Apparently, history-based policies are optimized quite well for thefirst few time steps, but do not generate much reward after that. With POHTN policies, on theother hand, the user learns additional skills until the very end of policy execution. This is likelybecause history-based search optimizes its actions one by one in order of their execution, and,after 10,000 planning iterations, has thus gathered enough information to determine very goodbehavior for the first few time steps. For later time steps, only little information is available, andbehavior is more random. A POHTN policy, on the other hand, exhibits useful behavior evenfor later time steps, due to the information encoded within the hierarchy.

73


n = 1 (m = 9.62, p = 2.35e−10) n = 2 (m = 26, p = 4.87e−12) n = 3 (m = 56.8, p = 1.7e−15)

n = 4 (m = 117, p = 1.67e−17) n = 5 (m = 187, p = 5.68e−18) n = 6 (m = 230, p = 5.85e−18)

n = 7 (m = 282, p = 4.6e−18) n = 9 (m = 288, p = 4.46e−18) n = 12 (m = 300, p = 3.96e−18)

−1500

−1000

−500

0

500

−1500

−1000

−500

0

500

−1500

−1000

−500

0

500

−1500 −1000 −500 0 500 −1500 −1000 −500 0 500 −1500 −1000 −500 0 500policy quality with history−based approach

polic

y qu

ality

with

PO

HT

N a

ppro

ach

Figure 5.6.: Scatter plots comparing POHTN and history-based policy quality on the SkillTeaching domain for various problem sizes n. Each point represents a single prob-lem instance whose position is determined by the estimated quality of the generatedPOHTN and history-based policies after 10,000 planning iterations. When a pointis above the line, it means that the POHTN policy outperformed the history-basedpolicy. For each problem size, the values m and p respectively give the mediandifference and the significance level concerning whether m is different from 0, asdetermined using a Wilcoxon signed rank test. The plot shows the same data asFigure 5.5, except that policy quality values are multiplied with the problem size.One can see that the difference values grow monotonically with problem size.

74

5.3. Skill Teaching

n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−200

−100

0

100

−200

−100

0

100

−200

−100

0

100

101 103 101 103 101 103

iterations

polic

y qu

ality

approach POHTN history−based

Figure 5.7.: Plots showing policy quality in the Skill Teaching domain for various problemssizes n dependent on the number of planning iterations. The ribbon represents the95% confidence interval for the mean policy quality. Note that the scale on the x-axis is logarithmic. One can see that, after an initial information gathering phase ofabout 10 iterations, the POHTN approach has an advantage over the history-basedapproach. This advantage grows smaller with more iterations.

75


−200

−100

0

100

101 103

iterations

polic

y qu

ality

random seed

00

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

Figure 5.8.: A plot showing policy quality dependent on the number of planning iterations foreach of 100 runs with different random seeds on Skill Teaching problems of size 1.Note that the scale on the x-axis is logarithmic. One can see that the dip in policyquality in the first 10 iterations is because policy quality drops to −200 for someseeds. This happens when the planner initially does not find a primitive controllerthat is better than directly decomposing the initial controller to a controller thatexecutes noop in a loop. POHTN performance after 0 iterations, i.e., without anysearch, has a value larger than −200 because decompositions are chosen at randomin this case, as described in Section 2.4. In all cases, this anomaly disappears after10 iterations.

76

5.4. Navigation

Table 5.2.: A comparison of POHTN and history-based policy quality on the Skill Teachingdomain for the first 10 of a total of 40 time steps of execution for various problemsizes n after 10,000 iterations. The values m and p respectively give the mediandifference and the significance level for whether m is different from 0, as determinedusing a Wilcoxon signed rank test. Negative values for m mean that the history-based approach generated better policies, i.e., the data reveals that the history-basedpolicies are slightly better for the first few time steps, at least for some problem sizes.

Problem Size n Median Difference m Significance p

1 1.81 0.001042 −1.52 0.003773 −1 0.00794 −1.61 1.85× 10−7

5 −1.6 1.63× 10−8

6 −0.906 5.44× 10−6

7 −1.06 6.94× 10−6

9 −0.578 3.56× 10−5

12 −0.3 3× 10−4

5.4. Navigation

In the Navigation domain taken from the 2011 IPPC, the agent’s goal is to safely navigatethrough a grid world to reach a specified goal cell, as visualized in Figure 5.9. Starting inone cell in the top row, the agent can choose to go left, right, up, or down. In each cell, apartfrom the cells in the top and bottom row, there is a chance that the agent dies. This probabilitygrows larger the farther to the right the cell is. The goal of the agent is to reach the bottom rightcell as quickly as possible without dying. For this, immediate rewards are −1 for each time stepuntil the goal is reached, at which point immediate reward is 0. On its way, it can only observewhether it currently is in a corner, and if so, which corner it is.


For the evaluation in this thesis, square instances were used. The size parameter n determinesthe number of rows where the agent can die. Since the top and bottom rows are safe, this gives(n + 2) × (n + 2) grids. Letting m = n + 2, death probabilities are assigned as follows:the top and bottom row, i.e., y = 0 or y = m − 1, are safe. For all other cells (x, y) ∈0, . . . ,m−1×1, . . . ,m−2 on the grid, the death probability P (x, y) is 0.01+0.9x/(m−1) + ξ, where ξ is drawn uniformly from the interval [0, 0.05]. This makes columns increasinglydangerous with higher values of x, ranging from a minimum probability of 0.01 for the leftmostcolumn to a maximum probability of 0.96 for the rightmost column. This procedure of assigningdeath probabilities is the same as the one used in the 2011 International Probabilistic PlanningCompetition. Figure 5.9 shows such an instance of size 4. The horizon was set to 40 time steps.

77


G

Figure 5.9.: A visualization of the Navigation domain. The agent, symbolized by the star, startsat a location in the top row and needs to reach the goal, the cell marked with theletter G. The color of a cell indicates its level of danger for the agent, where darkermeans more dangerous.


The “hierarchy” for the Navigation domain consists solely of the primitive initial controllerdepicted in Figure 5.10, implementing a strategy where the agent moves all the way to the left,then all the way down, and finally all the way to the right. The intuition is that traversing thedangerous cells using the leftmost column is always the safest way.

5.4.3. Results

Planning results are shown in Figure 5.11. It can be seen that the controller shown in Figure 5.10seems to be a good policy. The history-based approach, on the other hand, has trouble findinguseful policies for all but the smallest instances. The reason for this lies in the reward function,

move-west

start

move-south

move-east

~nw-corner

nw-corner

~sw-corner

sw-corner

Figure 5.10.: The primitive initial controller of the Navigation hierarchy

78

5.5. Elevators

because the immediate reward is −1 everywhere except in the goal cell. This means that untilthe goal cell is reached for the first time, the planner explores the search space in a completelyuninformed manner. The seemingly degrading performance of the POHTN policies are alsodue to this reward structure: a larger instance size implies a larger grid, so the number of stepsrequired to reach the goal grows.

The apparent variation in POHTN policy quality is not due to different policies, since thePOHTN hierarchy specifies one fixed policy, but due to variance in the policy quality estimate:for each point, the estimate is computed by simulating policy execution for 40 times. Sincethere is a small chance for the agent to die even in the leftmost grid column, the agent is notguaranteed to reach the goal in all simulations.

Of course, the structure of the Navigation domain heavily favors the POHTN approach, sincegood behavior is immediately apparent to humans. On the other hand, it serves to demonstratethat sometimes, the POHTN approach can describe good behavior very compactly.

5.5. Elevators

The last evaluation domain is the Elevators domain, also originating from the 2011 IPPC. Theagent’s task is to control one or more elevators that bring people waiting at different floorsto their destination, which can be either the top or bottom floor. Available action fluents aremove-current-dir(?elevator), for going to the next floor in the current directionof movement, open-door-going-up(?elevator) and open-door-going-down(?elevator), for simultaneously opening the elevator door and setting the direction of move-ment, and close-door(?elevator), for closing the elevator door. For each elevator, onlyone action fluent can be true in each time step, but the elevators can be controlled independently.Hence, only instances with one elevator were considered for this evaluation so as to avoid paral-lel actions.

Opening the door at an intermediate floor will make any passenger waiting to go in the cur-rent direction of movement enter the elevator. On the top and bottom floors, passengers thathave reached their target leave the elevator. All actions are deterministic. Uncertainty is intro-duced by people waiting to go upwards or downwards arriving at intermediate floors. For eachdirection, the probability of someone arriving at floor fi is pi. Available observation fluentsare person-waiting-obs(?floor), for people waiting at floors (their intended direc-tion of travel remains hidden), person-in-elevator-going-up-obs(?elevator)and person-in-elevator-going-down-obs(?elevator), for people inside the el-evator, and elevator-at-floor-obs(?elevator, ?floor), for the current elevatorposition3. In each time step, the reward function punishes the agent with −1 for people waitingat floors, and−0.75 or−3 for people inside the elevator, depending on whether people are takenin the direction of their destination or not.

3The observation fluent elevator-at-floor-obs(?elevator, ?floor) is not present in the originaldomain as used in the 2011 IPPC. Since elevator movement is deterministic and the initial floor is known, addingthis observation gives the agent no additional information about the world state, but it simplifies defining thePOHTN hierarchy introduced below. See Section 8.2 for ideas how the POHTN approach can be extended totrack elevator movement without changing the domain.

79


n = 1 (m = 13.9, p = 7.03e−18) n = 2 (m = 21.5, p = 3.95e−18) n = 3 (m = 25.2, p = 3.94e−18)

n = 4 (m = 22, p = 3.94e−18) n = 5 (m = 19.5, p = 3.93e−18) n = 6 (m = 16.6, p = 3.94e−18)

n = 7 (m = 14, p = 3.92e−18) n = 9 (m = 9.42, p = 3.94e−18) n = 12 (m = 3.71, p = 3.4e−18)

−30

−20

−10

−30

−20

−10

−30

−20

−10

−40 −30 −20 −10 −40 −30 −20 −10 −40 −30 −20 −10policy quality with history−based approach

polic

y qu

ality

with

PO

HT

N a

ppro

ach

Figure 5.11.: Scatter plots comparing POHTN and history-based policy quality on the Naviga-tion domain for various problem sizes n. Each point represents a single probleminstance whose position is determined by the estimated quality of the generatedPOHTN and history-based policies after 10,000 planning iterations. When a pointis above the line, it means that the POHTN policy outperformed the history-basedpolicy. For each problem size, the values m and p respectively give the mediandifference and the significance level concerning whether m is different from 0, asdetermined using a Wilcoxon signed rank test. For instances of size 3 and above,history-based policies rarely reach the goal cell at all, while the POHTN policyyields useful behavior independent of problem size.

80

5.5. Elevators


In this domain, the size parameter n determines the number of intermediate floors, i.e., thetotal number of floors including the top and bottom floors fn+1 resp. f0 is n + 2. The arrivalprobabilities pi for intermediate floors fi, 1 ≤ i ≤ n, are constructed as follows: first, uniformlysample a vector (x1, . . . , xn) from the open (n−1)-simplex, i.e., an n-dimensional vector where∑

i xi = 1 and it holds that xi ∈ (0, 1) for all i. This is done by sampling from a Dirichletdistribution of order K = n with parameters (α1, ..., αn) = (1, . . . , 1). Then, given an arrivalprobability parameter p, each pi is defined as pi = pxi. The value p can be interpreted as theexpected number of floors where there is someone waiting to go upwards in the next time step,given that no-one is currently waiting. The same holds for people waiting to go downwards.Consequently, since no-one is waiting in the initial state, the expected number of floors that havepeople waiting after the first time step is 2p. By construction of the pi, this holds regardless ofthe actual number of floors. The value of p was chosen to be 0.3 for this evaluation.

The benefit of generating instances in this way is that (1) instances are drawn uniformly fromthe set of instances with constant n and p and (2) policy values across different instance sizesremain close together, which simplifies plotting of results.


The Elevators domain uses a hierarchy that essentially models two different strategies. The ini-tial controller repeatedly executes the abstract action executeElevatorStrategy(?ele-vator), which has two implementations. Using the abstract actions simpleGoUp and sim-pleGoDown, the elevator either repeatedly moves from the bottom to the top floor and backuntil it arrives at a floor where someone waits, or it waits at the bottom floor until someonearrives at a floor, at which point it starts moving. There are three alternatives each for how thesimpleGoUp(?elevator) and simpleGoDown(?elevator) are decomposed. Thefirst is that the elevator opens its door on any floor where a passenger is waiting and then con-tinues in its current direction, i.e., it will not pick up passengers that want to go in the oppositedirection. The second is that the elevator opens its door twice, once in each direction, to makethe waiting passenger enter with certainty, and then goes in the direction of the passenger’s desti-nation. The third is a variant of the second that does not stop in between to pick up other waitingpassengers.

5.5.3. Results

Figure 5.12 compares POHTN and history-based policy quality. It can be seen that the PO-HTN policies consistently outperform history-based policies on almost all generated Elevatorsinstances. Just as in the case of the Skill Teaching domain, there also is a correlation betweenPOHTN and history-based policy quality, as can be seen from Table 5.3. Again, this suggeststhat there is an inherent difficulty to the generated problems that affects both approaches, butthat the POHTN approach deals better with.

Apparently, the hierarchy captures useful knowledge for elevator control. The fact that itmodels a limited number of strategies is both its advantage and its limitation: on the one hand,the small POHTN search space enables the search to quickly converge to a useful policy. On the

81


n = 1 (m = 12, p = 3.95e−18) n = 2 (m = 25.7, p = 3.96e−18) n = 3 (m = 36.3, p = 3.96e−18)

n = 4 (m = 39.5, p = 3.96e−18) n = 5 (m = 41.5, p = 3.96e−18) n = 6 (m = 42.5, p = 3.96e−18)

n = 7 (m = 40.5, p = 3.96e−18) n = 9 (m = 42, p = 3.96e−18) n = 12 (m = 39, p = 4.74e−18)

−300

−250

−200

−150

−100

−300

−250

−200

−150

−100

−300

−250

−200

−150

−100

−300 −200 −100 −300 −200 −100 −300 −200 −100policy quality with history−based approach

polic

y qu

ality

with

PO

HT

N a

ppro

ach

Figure 5.12.: Scatter plots comparing POHTN and history-based policy quality on the Eleva-tors domain for various problem sizes n. Each point represents a single probleminstance whose position is determined by the estimated quality of the generatedPOHTN and history-based policies after 10,000 planning iterations. When a pointis above the line, it means that the POHTN policy outperformed the history-basedpolicy. For each problem size, the values m and p respectively give the mediandifference and the significance level concerning whether m is different from 0, asdetermined using a Wilcoxon signed rank test. The POHTN policies consistentlyoutperform history-based policies on almost all generated instances, with the ad-vantage being less pronounced on small instances.

82

5.6. Summary and Further Analysis

Table 5.3.: Correlation coefficients for the data shown in Figure 5.12, relating policy qualityvalues for the POHTN and history-based approaches in the Elevators domain. For allproblem sizes except 1, the values are correlated.

Problem Size n Correlation coefficient

1 0.12711652 0.93429143 0.89300324 0.88740115 0.81790346 0.81092387 0.79209369 0.7948306

12 0.6153779

other hand, this also means that search does not yield any substantial improvements, as can beseen from Figure 5.13. It is plausible that there is room for improvement that, e.g., cannot beconveniently captured within a POHTN hierarchy. Such improvements can in principle be foundusing the history-based approach, since it converges to optimal policies in the limit. However, ascan be seen from Figure 5.14, the history-based approach even has trouble finding policies thatperform better than choosing actions uniformly at random. This does not improve substantiallywith more iterations. Indeed, even for problems of size 1 and after 1,000,000 iterations, thePOHTN policy is still superior to the history-based policy, as can be seen from Figure 5.15.


The main conclusion that can be drawn from the experiments is that the POHTN approachindeed succeeds in creating good policies. For every evaluation domain, a compactly describedhierarchy could be defined that applied to all generated problem instances of several problemsizes. The hierarchies could then be used for generating policies that consistently improvedupon history-based policies generated using comparable computational effort. Since, as statedin Section 2.4, MCTS in the space of histories is one of the most competitive algorithms for non-hierarchical POMDP planning [Sanner, 2011; Sanner, 2014], the POHTN approach constitutesa meaningful contribution to the state of the art in POMDP planning. Some further aspects areaddressed in the following.

Choice of MCTS Variant

As mentioned in the beginning of this chapter, the data shown in this evaluation was generatedusing the SoftMaxUCT algorithm. The question arises whether using other MCTS variants, suchas UCT or MaxUCT (see Section 2.4), leads to different results. As can be seen from Figure 5.16for Elevators problems of size 6 though, the choice of algorithm does not seem to influence the

83


n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−300

−250

−200

−150

−100

−300

−250

−200

−150

−100

−300

−250

−200

−150

−100

101 103 101 103 101 103

iterations

polic

y qu

ality


Figure 5.13.: Plots showing policy quality in the Elevators domain for various problems sizes ndependent on the number of planning iterations. Note that the scale on the x-axis islogarithmic. The ribbon represents the 95% confidence interval for the mean policyquality. One can see that the POHTN approach quickly converges to a policy. Thehistory-based approach slowly improves its policy over time, although it achievesvery little improvement for larger instances. The dips in POHTN policy qualitycan be explained in a similar manner as in Figure 5.8.

84


n = 1 (m = 15, p = 3.96e−18) n = 2 (m = 14.6, p = 3.96e−18) n = 3 (m = 11.2, p = 8.95e−16)

n = 4 (m = 8.98, p = 6.1e−13) n = 5 (m = 8.67, p = 8.4e−11) n = 6 (m = 9.23, p = 8.68e−12)

n = 7 (m = 8.8, p = 1.51e−09) n = 9 (m = 6.04, p = 2.88e−05) n = 12 (m = 5.47, p = 0.000113)

−300

−200

−100

−300

−200

−100

−300

−200

−100

−350 −300 −250 −200 −150 −100−350 −300 −250 −200 −150 −100−350 −300 −250 −200 −150 −100policy quality after iteration 1

polic

y qu

ality

afte

r ite

ratio

n 10

000

Figure 5.14.: Scatter plots comparing history-based policy quality after 10,000 planning itera-tions with choosing actions uniformly at random on the Elevators domain. Eachpoint represents a single problem instance whose position is determined by the es-timated quality of the random policy and the history-based policy. When a pointis above the line, it means that the history-based policy outperformed the randompolicy. For each problem size, the values m and p respectively give the mediandifference and the significance level concerning whether m is different from 0, asdetermined using a Wilcoxon signed rank test. One can see that the history-basedpolicies improve upon random behavior, but not by much. Also, the difference issmaller for larger instances.

85


n = 1 (m = 6, p = 3.96e−18)

−80

−75

−70

−80 −75 −70policy quality with history−based approach

polic

y qu

ality

with

PO

HT

N a

ppro

ach

Figure 5.15.: A scatter plot comparing POHTN and history-based policy quality on the Eleva-tors domain for problem size 1 after 1,000,000 iterations. Each point representsa single problem instance whose position is determined by the estimated qualityof the POHTN policy and the history-based policies. When a point is above theline, it means that the POHTN policy outperformed the history-based policy. Thevalues m and p respectively give the median difference and the significance levelconcerning whether m is different from 0, as determined using a Wilcoxon signedrank test. It can be seen that history-based planning does not perform as well asPOHTN planning even on small Elevators problems with a generous planning timebudget.

86


MaxUCT(m = 42.5, p = 3.96e−18)

SoftMaxUCT(m = 42.5, p = 3.96e−18)

UCT(m = 46.6, p = 3.96e−18)

−210

−180

−150

−250 −200 −250 −200 −250 −200policy quality with history−based approach

polic

y qu

ality

with

PO

HT

N a

ppro

ach

Figure 5.16.: Scatter plots comparing POHTN and history-based policy quality of the algorithmsMaxUCT, SoftMaxUCT, and UCT on the Elevators domain for problems of size 6.Each point represents a single problem instance whose position is determined bythe estimated quality of the generated POHTN and history-based policies after10,000 planning iterations. When a point is above the line, it means that the PO-HTN policy outperformed the history-based policy. For each algorithm, the valuesm and p respectively give the median difference and the significance level concern-ing whetherm is different from 0, as determined using a Wilcoxon signed rank test.The choice of algorithm does not seem to influence the difference between POHTNand history-based policy quality much.

result of the evaluation much. Results are very similar for the other evaluation domains and aretherefore omitted.

Overhead for Controller Construction in the POHTN Approach

While the number of samples generated from the POMDP simulator can be compared directlybetween the POHTN and history-based approaches, the POHTN approach incurs an additionaloverhead due to the required construction of controllers. To mitigate this, the implementationof the decomposition MDP employs a caching mechanism to remember the result of often-computed method applications. Figures 5.17 and 5.18 compare the required CPU time per iter-ation on the Home Theater Setup and Skill Teaching domains, respectively. It can be seen thatthe overhead for the POHTN approach is small, at least after an initial period where the cache isstill empty. No data is shown for the Elevators and Navigation domains: the overhead in thesedomains is negligible, since the hierarchies induce only few primitive controllers.

However, the overhead for controller construction strongly depends on the chosen hierarchy:while monotone POHTN hierarchies guarantee that the required number of decompositions isfinite, the required effort can in the worst case be exponential in the planning horizon, as shownby Theorem 4.18. This is addressed in Chapter 6, where a technique for limiting controllerconstruction effort to linear in the planning horizon is presented.

87


n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

101 103 101 103 101 103

iterations

CP

U ti

me

in s

econ

ds


Figure 5.17.: Plots comparing CPU time per iteration for the POHTN and history-based ap-proaches on the Home Theater Setup domain for different problem sizes n over10,000 planning iterations. The overhead required for the POHTN approach seemssmall once the method application cache is filled.

88


n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

101 103 101 103 101 103

iterations

CP

U ti

me

in s

econ

ds


Figure 5.18.: Plots comparing CPU time per iteration for the POHTN and history-based ap-proaches on the Skill Teaching domain for different problem sizes n over 10,000planning iterations. The overhead required for the POHTN approach seems smallonce the method application cache is filled.

89


Combining the POHTN and History-Based Approaches

Despite the lower performance of the history-based approach, it still has the advantage that itproduces optimal POMDP policies in the limit. A possible explanation for its lower performanceis that it initially explores the search space rather blindly, whereas the POHTN approach quicklyfocuses its search on good policies. This explanation would be consistent with the performancebehavior observed in the Skill Teaching and Elevators domains, shown in Figures 5.7 and 5.13,respectively. Ideally, one would want a planning approach that both quickly finds good policiesand also converges to an optimal policy in the limit. This suggests that it might be beneficial tocombine the history-based and POHTN approaches in a single approach. Such a combination isinvestigated in Chapter 7.

Other Hierarchical Approaches

As argued in the beginning of this chapter, MCTS in the space of histories lends itself as abaseline for assessing POHTN planning performance due to the common underlying search al-gorithm MCTS. However, it would be interesting to compare the POHTN approach directly toother hierarchical POMDP approaches mentioned in Section 1.3. Unfortunately, doing such acomparison would need to address a series of open questions, since they are based on very differ-ent principles: first, one would need to address differences in the various forms of representationformalisms for expert knowledge, e.g., by specifying expert knowledge for a given domain inan approach-independent general form and providing standardized procedures for transformingthis knowledge into hierarchies for the various approaches. Second, an objective means for de-termining parameters for the respective algorithms would be required, which are quite different.With these issues solved, one could determine the best approach by comparing the quality of thecomputed policies. Since this is difficult to implement, however, this thesis takes the approach ofanalyzing the advantage of MCTS in the space of partial POHTN policies over non-hierarchicalMCTS in the space of histories.

90

Chapter 6.

Interleaving Decomposition and PolicySimulationThe evaluation in Chapter 5 compared performance of the history-based and POHTN approacheson the basis of the number of required POMDP simulator calls. An aspect that is ignored by thisanalysis is the effort the POHTN approach requires to construct a primitive controller in eachround. Indeed, an unfavorable POHTN hierarchy might require exponentially many decompo-sition steps in the time horizon H , as can be seen from the proof of Theorem 4.18. A simpleexample where this is the case is the hierarchy considered in the proof of Theorem 4.11, i.e.,a hierarchy that creates general tree-shaped controllers that can represent every possible policy.In this particular example, the number of decomposition steps is |O|

H−1|O|−1 for a POMDP with ob-

servation set O, which would translate into significant additional effort in each round of search.While this does not happen for the hierarchies considered in the evaluation, the possibility ofunwittingly creating a POHTN hierarchy that incurs a high controller construction overheadnevertheless represents a potential pitfall for domain experts modeling their knowledge as aPOHTN hierarchy.

To rectify this, one needs to take a closer look at how one round of MCTS with the PO-HTN approach works: first, a fully primitive controller is constructed, potentially incurring thedescribed effort. Then, the execution of this controller is simulated once, i.e., H actions areexecuted. In other words, the controller can be exponentially big in H , but a maximum ofH controller nodes actually contribute to its quality assessment. This fact raises the questionwhether the construction process can be modified to only explicitly construct the parts of thecontroller that are visited during simulation. Of course, the nodes actually traversed are onlyknown during simulation. Therefore, this leads to an approach where controller constructionand simulation need to be interleaved.

6.1. The Interleaved Decomposition MDP

For this purpose, a modified Decomposition MDP is constructed. The idea is to interleavedecomposition and policy execution: the current partially abstract controller is executed until(1) a node labeled with an abstract action is reached, (2) an unbound variable is found, or (3)the execution horizon H is reached. The construction of this MDP is successively described inthe following, based on a POHTN problem PH = (P,C,Fabstract-obs ,M, fscI) and its (non-interleaved) decomposition MDP (SPH ,APH ,GPH , sPHI , SPH

T ) as defined in Definition 4.19.

91

Chapter 6. Interleaving Decomposition and Policy Simulation

The first aspect to cover is the appropriate choice of search states. Obviously, the searchstate needs to contain a representation of the current partially executed partially abstract con-troller. Given a controller fsc = (N,λ, δ, nI , β) ∈ SPH , a partially executed variant of fscis a controller that also remembers its “current node”, which can be represented by simplychanging the initial node of the controller. The set of all such variants of fsc is denoted aspe(fsc) = (N,λ, δ, n′I , β) | n′I ∈ N. Due to loops, however, memorizing the current node ofa controller cannot represent an important piece of information: the number of remaining timesteps. This needs to be represented by a number between 0 and the time horizon H .

Actions in the interleaved decomposition MDP correspond to either executing the action as-sociated with the current controller node, if it is primitive and the occurring variables are bound,or else binding variables and decomposing the current node as necessary. Collecting these con-siderations, this leads to the following definition:

Definition 6.1. For a given POHTN problem PH = (P,C,Fabstract-obs ,M, fscI), the inter-leaved decomposition MDP is given by the tuple (SIPH ,AIPH ,GIPH , sIPHI , SIPH

T ), where

• The set of search states is defined as

SIPH =

⋃fsc|fscI→∗M fsc

pe(fsc)

× 0, . . . ,H,i.e., the cross product of (1) the set of all partially executed partially abstract LFSCs thatcan be generated from IPH and (2) all possible counts of remaining time steps, between0 and the entire horizon H .

• For a given search state (fsc, h) ∈ SIPH , where fsc = (N,λ, δ, nI , β), the set of applica-ble actions is defined as

AIPH (fsc) =

[choose(unbound(fsc)) 7→ d] | d ∈ D if unbound(fsc) 6= ∅λ(nI) λ(nI) is a primitive actionm | m ∈M such that m is a

method for the schema of λ(nI) else.

In words, this means that the applicable actions for a given search state are given asfollows: (1) if there are unbound variables, bind a variable, or else, (2) if all variablesare bound and the action associated with the initial node of the controller is primitive,simulate the execution of this action in the POMDP P , or else, (3) decompose nI with asuitable method.

• The simulator GIPH is defined as ((fsc′, h′), r) ∼ GIPH ((fsc, h), a) as follows.

– If a = [x 7→ d], then fsc′ = apply(fsc, [x 7→ d]), h′ = h, and r = 0.

– Else, if a = λ(nI) and fsc = (N,λ, δ, nI , β), simulate executing λ(nI) in thePOMDPP using the techniques of Silver and Veness [2010] as described for history-based POMDP simulation in Section 2.4.4: assuming s is the current world state,

92

6.1. The Interleaved Decomposition MDP

apply the POMDP simulator GP to determine (s′, o, r) ∼ GP (s, λ(nI)). Then, de-fine fsc′ = (N,λ, δ, n′I , β) where n′I is the successor node of nI for which o |=δ(nI , n

′I). For the first time step, s is sampled from the initial belief state of the

POMDP, i.e., s ∼ bI . For all later time steps, s is defined to be the successor worldstate s′ of the previous primitive action execution.

– Else, if a = m, then fsc′ = apply(fsc, nI ,m), h′ = h, and r = 0.

• The initial search state sIPHI = (fscI , H) is the combination of POHTN initial controllerand the full time horizon H .

• The terminal states SIPHT = (fsc, h) ∈ SIPH | h = 0 are the states for which the

remaining time horizon is zero.

The first observation is the following:

Theorem 6.2. Let PH be a decomposable, monotone POHTN problem and (SIPH ,AIPH ,GIPH ,sIPHI , SIPH

T ) its interleaved decomposition MDP. Then the interleaved decomposition MDP isan MDP according to Definition 2.11, and the maximum number of required steps until a termi-nal node is linear in H .

Proof. One needs to prove that h(sIPHI ) < ∞. The variables present in fscI are bound first,requiring |V |init many steps. Next, it is checked whether the initial node nI of the controlleris primitive. If this is not the case, is needs to be decomposed. Since PH is monotone, themaximum number of decompositions that need to be applied before nI is primitive is given byl+ 1, where l is the length of the longest path in the graph G of Definition 4.17. This number isbounded by the number of abstract action schemata, i.e., l+1 ≤ |C|. After each decomposition,a finite number of new variables need to be bound. Let |V |max denote the maximum such number.Afterwards, the action associated with nI is primitive and all variables are bound. It is thensimulated, and the number of remaining time steps decreases to H − 1. After H repetitions ofthis procedure, the number remaining time steps is 0 and a terminal state is reached. The totalnumber of steps to reach a terminal state can thus be bounded by

h(sIPHI ) ≤ |V |init + (|V |max + 1) |C|H <∞

Obviously, it holds that h(sIPHI ) ∈ O(H).

The regular decomposition MDP and the interleaved decomposition MDP differ in how poli-cies computed for them represent POMDP policies. In the regular decomposition MDP, terminalsearch states correspond to primitive controllers that represent complete POMDP policies, i.e.,to compute a POMDP policy, one computes a policy for the decomposition MDP, sequentiallyapplies the recommended decompositions and variable bindings to the initial controller until aprimitive controller is reached, and uses this controller as the POMDP policy.

In contrast, terminal search states in the interleaved decomposition MDP do not representcomplete policies, they are only primitive for the parts traversed in the simulation steps as shownin Figure 6.1, and as such do not define complete POMDP policies for other possible observationhistories. Therefore, the policy computed for the interleaved decomposition MDP represents a

93


a2

start

c a5 . . . c

a3 c . . . c

...

o0 o1on−1

o0 o1on−1

Figure 6.1.: How a controller of the interleaved decomposition MDP of the POHTN hierarchyused in the proof of Theorem 4.11 looks like, where the hierarchy can representarbitrary policy trees. In the first simulation step, o1 was observed, o0 in the second,and only the corresponding parts are concretized.

POMDP policy only in its entirety, and a “current state” of the interleaved decomposition MDPpolicy needs to be tracked during POMDP execution: let πIPH be a policy computed for theinterleaved decomposition MDP, sIPH = (fsc, h) be a state of the interleaved decompositionMDP, and nI be the initial node of fsc. Let furthermore a∗ = πIPH (sIPH ) denote the actionrecommended for sIPH and let

autoApply(πIPH , sIPH ) =

sIPH a∗ = λ(nI)

autoApply(πIPH , (apply(fsc, nI ,m), h)) a∗ = m

autoApply(πIPH , (apply(fsc, [x 7→ d]), h)) a∗ = [x 7→ d]

The function autoApply executes all variable codesignations and method applications until astate (fsc′, h) is reached whose controller fsc′ has a primitive initial node and no unbound vari-ables, i.e., the POMDP action associated with the initial node of fsc′ can be executed. A POMDPpolicy can then be constructed from a policy πIPH for the interleaved decomposition MDP rela-tive to a current state sIPH of the interleaved decomposition MDP by defining the the functionsα and ω as follows:

α((πIPH , sIPH )) = λ(nI), where autoApply(πIPH , sIPH ) = ((N,λ, δ, nI , β), h)

ω((πIPH , sIPH ), o) = (πIPH , ((N,λ, δ, n∗, β), h− 1)), where

autoApply(πIPH , sIPH ) = ((N,λ, δ, nI , β), h)

and n∗ is the node for which o |= δ(nI , n∗)

At the start of POMDP policy execution, the current state of the interleaved decompositionMDP is its initial state sIPHI .

It is important to note that the interleaved decomposition MDP and the regular decompositionMDP do not always compute the same set of POMDP policies: in the regular decompositionMDP, when the planner needs to choose the best decomposition method for a given abstractnode, this choice is made for a partially abstract controller, since search states are partially

94

6.2. Evaluation

abstract controllers. In contrast, a search state in the decomposition MDP corresponds to apartially abstract controller, plus an updated initial node and the number of remaining time steps.The planner can thus take additional information into account when choosing decompositions.For example, in the Skill Teaching domain, the planner could choose different decompositionmethods for the same teach(?skill)-labeled node of the same controller, depending onwhether the number of remaining time steps is 2 or 4. Of course, this also means that the effortfor determining the best decomposition method is incurred twice, for both 2 and 4 remainingtime steps.

Whether planning in the interleaved decomposition MDP computes a different set of policiesthan the regular decomposition MDP depends on the hierarchy, though: for example, for hierar-chies that generate only tree-shaped policies such as the one used in the proof of Theorem 4.11,memorizing the number of remaining time steps does not represent additional information.

Summing up the above considerations, planning in the interleaved decomposition MDP poten-tially has two advantages and one disadvantage: on the one hand, it can generate better POMDPpolicies than planning in the regular decomposition MDP for some hierarchies, and the num-ber of decomposition steps per planning iteration is guaranteed to be linear. On the other hand,convergence to a good policy might also need more iterations than planning in the regular de-composition MDP.

6.2. Evaluation

These considerations raise the question how the interleaved POHTN approach empirically com-pares to the regular POHTN approach. With regard to the aforementioned potential advantagesand disadvantages, the questions of interest are the following:

1. Is there a noticeable difference in the rate of convergence to a good policy?

2. Does planning in the interleaved decomposition MDP in the limit produce better POMDPpolicies than planning in the regular decomposition MDP?

3. Does planning in the interleaved decomposition MDP empirically decrease the controllerconstruction overhead?

To answer these questions, data for the Home Theater Setup and Skill Teaching domains wasgenerated according to the same setting as described in Section 5.1 on page 65. No data wasgenerated for the Elevators and Navigation domains since, as argued in Section 5.6 (page 83),constructing primitive controllers takes very little time in these domains, and applying the inter-leaved POHTN approach to these hierarchies therefore seems not very promising.

Concerning the first question, Figures 6.2 and 6.3 show policy quality dependent on the num-ber of planning iterations for the history-based, regular POHTN, and interleaved POHTN ap-proaches. They show that the rate of convergence is very similar between the regular and theinterleaved POHTN approach. There is a small difference in the Skill Teaching domain in Fig-ure 6.3, where the interleaved POHTN approach takes a little longer to find a policy of similarvalue. The difference vanishes after about 1,000 iterations, however.

95


n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−10

−5

0

5

10

−10

−5

0

5

10

−10

−5

0

5

10

101 103 101 103 101 103

iterations

polic

y qu

ality

approach POHTN history−based interleaved POHTN

Figure 6.2.: Plots showing policy quality in the Home Theater Setup domain for various prob-lems sizes n dependent on the number of planning iterations. The ribbon representsthe 95% confidence interval for the mean policy quality. Note that the scale on thex-axis is logarithmic. There does not seem to be a noticeable difference between theconvergence rates of the regular and interleaved POHTN approaches.

96

6.2. Evaluation

n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−200

−100

0

100

−200

−100

0

100

−200

−100

0

100

101 103 101 103 101 103

iterations

polic

y qu

ality


Figure 6.3.: Plots showing policy quality in the Skill Teaching domain for various problems sizesn dependent on the number of planning iterations. The ribbon represents the 95%confidence interval for the mean policy quality. Note that the scale on the x-axis islogarithmic. The interleaved POHTN approach takes a little longer to find a policyof similar value as the regular POHTN approach, though this difference vanishesafter about 1,000 iterations.

97


The ends of the interleaved POHTN curves for the Skill Teaching domain even suggest thatthe interleaved POHTN approach might start to outperform the regular POHTN approach after10,000 iterations. To investigate this more closely, Figure 6.4 compares regular and interleavedPOHTN policy quality after 100,000 iterations on Skill Teaching problems of size 6. Indeed,the interleaved POHTN approach has a small advantage, answering the second question affir-matively.

To answer the last question, Figure 6.5 and 6.6 compare the required CPU time per iterationof planning for the Home Theater Setup and Skill Teaching domains, respectively. The chosenhierarchies for these domains do not induce a large controller construction overhead even for theregular POHTN approach. Therefore, the interleaved POHTN approach cannot improve muchover the regular approach. Still, it seems to have a small advantage especially for larger problemsizes, in addition to its superior worst-case behavior. It should be noted, however, that CPUtime measurements are prone to many sources of noise. A statistical analysis of the differenceis therefore intentionally omitted.

6.3. Discussion and Further Improvements

The interleaved POHTN approach offers better worst-case runtime guarantees compared to theregular POHTN approach. This results in greater freedom for modeling POHTN hierarchies,since it enables the domain expert to use a much broader range of hierarchies that would nor-mally result in exponential planning effort. Taken to the extreme, one could use the hierarchyused in the proof of Theorem 4.11, which, apart from some additional constant controller con-struction overhead, exactly corresponds to the history-based planning approach. By choosing ahierarchy in between this most general hierarchy and a compact hierarchy, one can very flexiblytrade off search space size and expressivity.

Additionally, the interleaved POHTN approach is, for some hierarchies and with enough it-erations, capable of producing better policies than the regular POHTN approach. It can thus beconsidered an improvement over the regular POHTN approach, unless its slightly slower con-vergence is an issue for the domain at hand. There also exist some further improvements thatcan be applied to the interleaved POHTN approach.

6.3.1. More Elaborate Interleaved Decomposition MDP

A more elaborate version of the interleaved decomposition MDP could improve upon Defini-tion 6.1 by binding variables only when necessary, i.e., only when the initial node of the con-troller currently under consideration contains an unbound variable. This reduces the numberof necessary variable bindings. However, since the number of introduced variables is alwaysbounded linearly in H according to Theorem 6.2, the impact of this reduction is only moderate.

6.3.2. Online Planning

The interleaved POHTN approach is well-suited for applying it in an online planning setting. Inthis setting, the agent alternates between planning for a fixed amount of time and executing a“real” action in its environment. There are two main advantages to online planning:

98


n = 6 (m = 5.67, p = 0.00153)

−150

−100

−50

0

50

100

−100 −50 0 50 100policy quality with POHTN approach

polic

y qu

ality

with

inte

rleav

ed P

OH

TN

app

roac

h

Figure 6.4.: A scatter plot comparing regular and interleaved POHTN policy quality on SkillTeaching problems of size 6. Each point represents a single problem instance whoseposition is determined by the estimated quality of the generated regular and inter-leaved POHTN policies after 100,000 planning iterations. When a point is abovethe line, it means that the interleaved POHTN policy outperformed the regular PO-HTN policy. The values m and p respectively give the median difference and thesignificance level concerning whether m is different from 0, as determined usinga Wilcoxon signed rank test. One can see that the interleaved POHTN approachyields slightly better policies, since m > 0.

99


n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

0.0

0.2

0.4

0.6

0.8

101 103 101 103 101 103

iterations

CP

U ti

me

in s

econ

ds


Figure 6.5.: Plots comparing CPU time per iteration for the history-based, regular POHTN andinterleaved POHTN approaches on the Home Theater Setup domain for differentproblem sizes n over 10,000 planning iterations.

100


n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

0.00

0.02

0.04

0.06

101 103 101 103 101 103

iterations

CP

U ti

me

in s

econ

ds


Figure 6.6.: Plots comparing CPU time per iteration for the history-based, regular POHTN andinterleaved POHTN approaches on the Skill Teaching domain for different problemsizes n over 10,000 planning iterations.

101


• Once the agent has executed a real action, it also receives a real observation it can notundo. This means that the agent no longer needs to care about other possible observationsit could have received, and can thus better focus its planning effort.

• In the context of an assistance system, it might be unreasonable to assume that the useris patient enough to wait until the system has computed a complete policy offline. In anonline setting, on the other hand, the system can always present an action recommendationafter a fixed limited amount of time.

For online planning to be effective, however, it is important that planning does not need torestart from scratch after an action has been executed, i.e., the search tree should be reused, atleast in part. For history-based planning, achieving this is straight-forward: after action a isexecuted and observation o is received, the new search tree corresponds to the subtree rooted athistory (a, o). It is unclear how this could be realized for the regular POHTN approach, since it ishard to see how the search tree should be updated to reflect that an action has been executed. Onthe other hand, since planning in the interleaved POHTN approach already works by alternatingbetween simulating action execution and making planning decisions, it is better suited for theonline setting: executing a real action can be reflected in the search tree by traversing the treeuntil a search state is reached where the next step is to simulate an action, and choosing the newinitial search state of the search tree to be the successor search state that matches the receivedreal observation.

More formally, let πIPH be an interleaved decomposition MDP policy and let sIPH = (fsc, H)= ((N,λ, δ, nI , β), H) = autoApply(πIPH , sIPHI ) be the search state resulting from applyingall recommended decompositions and variable codesignations to the initial controller before thefirst action is simulated. Then, the action that the agent executes in its environment is the actionassociated with the initial node of fsc, i.e., λ(nI). Given that the observation received is o, thenew initial search state is given by ((N,λ, δ, n′I , β), H − 1), where o |= δ(nI , n

′I). After this

update, the search can continue by reusing the subtree below the new initial search state as thenew search tree.

One more challenge that needs to be overcome is that an augmented simulator GIPH is re-quired: for offline interleaved POHTN planning as presented above, simulating action executingwas done by sampling an initial world state from the initial belief state, and then continuallypassing the sampled successor world states along, just as for history-based POMDP planningpresented in Section 2.4.4 [Silver and Veness, 2010, Lemma 2].

As also noted by Silver and Veness for history-based POMDP planning, this can no longer bedone in an online planning setting once the first real action is executed: after the agent executesaction a in its “real” environment, receives observation o, and updates the initial search stateof its search tree to history (a, o), initial world states need to be sampled from P (S | a, o, bI)instead of bI . As argued in Section 2.4.4, however, computing P (S | a, o, bI) exactly is compu-tationally expensive for factored POMDPs. Silver and Veness solve this by using a particle filterapproach that remembers the world states s′ encountered as samples for history (a, o) duringplanning and uses the relative frequency of occurrence of each s′ to represent an approximationof P (S = s′ | a, o, bI). This approach can be transferred to interleaved POHTN planning byremembering the encountered world states for each search state sIPH . When sIPH then becomes

102


the new initial search state, the relative frequency of the world states encountered is used as anew initial belief state.

103

Chapter 7.

Combining Hierarchical andHistory-based PlanningAs noted in Section 5.6, the history-based and POHTN planning approaches have differentstrengths: while the POHTN approach quickly finds good policies, the history-based approachproduces optimal policies in the limit. It is thus desirable to have an approach that combines theadvantages of both in a single combination approach.

The arguably simplest combination approach builds two search trees, a history-based and aPOHTN search tree, such that search effort is evenly divided between these trees by alternat-ingly performing an iteration of history-based and POHTN search in a round-robin fashion.When search is stopped, the qualities of the resulting history-based and POHTN policies arecompared, and the better policy is executed. This approach bears some resemblance to rootparallelization developed in the context of parallel MCTS [Chaslot et al., 2008], as well asEnsemble MCTS [Fern and Lewis, 2011] techniques.

The round-robin approach is obviously anytime optimal, since the history-based tree con-verges to an optimal policy as usual. Overall convergence will not be as fast, however, as bothtrees are updated every other iteration only. This means that the POHTN tree in the round-robinapproach will not find good policies as quickly as dedicating all search effort to POHTN searchonly. Still, as can be seen from, e.g., Figure 5.7 for the Skill Teaching domain, the advantage ofthe POHTN approach can be big enough that the round-robin approach has an advantage overpure history-based search. The drawback of the round-robin approach is its wastefulness: oncesearch is stopped, only one of the search trees is chosen, and the other is simply thrown away.It seems particularly unsatisfactory that the history-based search does not profit from the qualityof policies already known in the POHTN tree, since the biggest issue that prevents history-basedsearch from achieving better performance early during search is that it initially explores thesearch space blindly.

The literature on MCTS describes several applicable techniques that address the problem ofinitially blind search [Browne et al., 2012]. The underlying idea is to incorporate heuristicknowledge into the search procedure. This is realized by techniques such as

• first play urgency, which replaces the usually uniformly random order in which unexploredactions are tried with a heuristically chosen order,

• progressive bias, where a heuristic bias term is initially added to the action scores used forselecting actions (e.g., the UCT score in Equation 2.13), which shrinks as the number ofiterations grows,

105

Chapter 7. Combining Hierarchical and History-based Planning

• or search seeding, where the value and visit count statistics V and C of nodes are initial-ized with heuristic values.

All three techniques preserve anytime optimality, because their influence vanishes as the numberof iterations grows.

There are, however, two challenges in using these techniques to let the history-based searchprofit from the knowledge encoded in a POHTN hierarchy. First, the POHTN hierarchy alonedoes not represent useful heuristic knowledge, or at least only implicitly: search in the spaceof partially abstract LFSCs is required to determine good policies. Second, even assumingPOHTN search has already been performed, the aforementioned techniques require heuristicknowledge to be specified for each decision node in the history-based search tree, i.e., per history.Decision nodes in the POHTN search tree are logical FSCs, though. Some means for making theknowledge encoded in the POHTN search tree available on a per-history basis is thus required.

The latter can be tackled by remembering a property already exploited to make search effortbetween the history-based and POHTN approaches comparable, namely that both approachesproduce an execution trace in each round of search. The POHTN approach produces such atrace once a primitive LFSC fsc is reached and its execution is simulated. The idea is nowto additionally use the produced traces to create and update a history-based tree, called thePOHTN trace tree Ttrace. For this, let fsci denote the primitive LFSC reached in the i-th round ofPOHTN search. The simulated execution of fsci, which yields the value estimate V H

fsci ′(sI), is

now implemented by running one iteration of history-based MCTS on Ttrace, where the normalaction selection function selectAction is replaced by selecting actions as recommended by fsci.This both creates an execution trace from which V H

fsci ′(sI) can be calculated for updating the

POHTN search tree, as well as an updated POHTN trace tree T ′trace.Under the right circumstances, the POHTN trace tree captures the best POHTN policy for

each history reachable with an LFSC:

Theorem 7.1. In the above approach, let PH be a POHTN problem and let the MCTS algo-rithm used for the POHTN search employ an action selection function that selects every ac-tion infinitely often, such as UCT action selection in Equation 2.13 or uniform action selectionin Equation 2.15. Let further Max-Monte-Carlo backups (Equation 2.14) or Soft-Max-Monte-Carlo backups (Equation 2.16) be the backup function used for updating the POHTN tracetree Ttrace . Then, Ttrace converges to a history-based tree T ∗trace where for each decision nodend = (h,S, C, V ) with history h, the value estimate V (nd) represents the best possible expectedvalue for h that can be gleaned with a controller fsc ∈ L(PH ).

Proof. Due to the choice of the action selection function, the POHTN search selects every pos-sible decomposition and variable binding for every controller infinitely often. Therefore, everypossible pseudo-primitive controller is visited infinitely often, and each possible execution traceis in expectation produced an unbounded number of times. Thus, every history h that can bereached by some controller inL(PH ) and every action recommended for h by some controller inL(PH ) are visited infinitely often. As stated in Section 2.4, these circumstances make decisionnode values converge to the value of the best successor chance node for both Max-Monte-Carlobackups and Soft-Max-Monte-Carlo backups.

106

7.1. Evaluation

In particular, this implies the following: let T ∗trace = (ε,S, C, V ) be the root node1 of T ∗trace ,where ε is the empty history, and let fsc∗ be a POHTN solution according to Definition 4.10.Then V = V H

fsc∗(sI), i.e., the value at the root of T ∗trace is the POHTN solution value.Assuming a POHTN trace tree Ttrace (possibly T ∗trace ) is known, all three aforementioned

techniques for incorporating heuristic knowledge into the history-based approach are applicable:for first play urgency, unexplored actions are chosen in descending order of their value in Ttrace ,for progressive bias, the bias term for an action is its value in Ttrace , and search seeding isperformed by starting history-based search from Ttrace instead of an empty tree.

However, the first challenge mentioned above is still unsolved, namely that Ttrace needs to becomputed before history-based search starts. This is inconvenient, because it is unclear whenthe POHTN search should be stopped. When it is stopped too early, Ttrace might not containenough useful information to improve history-based search. When it is stopped to late, it couldhappen that POHTN search is still performed after Ttrace has already converged, leading towasted search effort. Unfortunately, it is in general not easy to determine when POHTN searchhas converged, due to the nature of MCTS [Keller and Helmert, 2013].

To counter this issue, this thesis proposes a modified version of the search seeding tech-nique, named incremental search seeding. Incremental search seeding combines search seedingwith the round-robin approach mentioned in the beginning of this chapter. Instead of initiallycreating a POHTN trace tree Ttrace and then performing history-based search starting at Ttrace ,incremental search seeding creates a history-based tree Tcombined by alternating between seedingrounds and search rounds: a seeding round amounts to performing a round of POHTN searchthat updates both the usual POHTN search tree TPOHTN and the history-based tree Tcombined

as described for creating a POHTN trace tree, while search rounds are rounds of conventionalhistory-based search on Tcombined . Thus, heuristic knowledge for history-based search is gener-ated in parallel with actual history-based search. The computed policy is defined on Tcombined ,i.e., it is a history-based policy just as for pure history-based search.

A simple way for ensuring convergence to an optimal policy is to stop POHTN search aftera certain point, i.e., by making the search eventually become a pure history-based search. Thiscan, for example, be done heuristically by monitoring the values of the root nodes of Tcombined =(ε,Scombined , Ccombined , Vcombined ) and TPOHTN = (fscI ,SPOHTN , CPOHTN , VPOHTN ): onceone is sufficiently certain that Vcombined ≥ VPOHTN , it can be assumed that history-based searchhas surpassed POHTN search in quality, and POHTN search can be stopped.

7.1. Evaluation

Empirically, the most important question concerning the proposed approach is the following:for comparable search effort, does a combination approach outperform standard history-basedsearch? For this, incremental search seeding was implemented as described above, except thata check to determine when the POHTN search can be stopped was not integrated for the sakeof simplicity. Then, experiments were conducted on all four domains used in Chapter 5, in thesame setting as described in Section 5.1 on page 65.

1Remember that the search tree is represented by its root node.

107


Results for the Navigation domain are shown in Figure 7.1. As can be seen, the combinationapproach greatly profits from the additional traces of the POHTN search: after a few iterationsof search, the policy represented by Tcombined has the same value as the POHTN policy. Thoughhard to prove, it seems likely that the combination approach does not improve upon POHTNpolicy quality because the controller that represents the POHTN “hierarchy” in this domain,shown in Figure 5.10 on page 78, already is an optimal policy.

The combination approach also outperforms the history-based approach for the Home TheaterSetup domain, as shown in Figure 7.2, though it cannot fully replicate POHTN policy qualitywithin 10,000 iterations. Analysis of the distribution of immediate rewards over time stepsduring policy execution reveals that the advantage of the POHTN policies is fully transferred forthe first few time steps, though, as shown in Figure 7.3. The number of time steps for whichPOHTN knowledge can be transferred diminishes with growing problem size, however. Thisbehavior can be explained with an observation made in Section 5.3.3 on page 71, namely thathistory-based search tends to optimize actions for earlier time steps before it optimizes latertime steps. Hence, information from the POHTN traces about good actions at early time steps istransferred before information about later time steps.

Figure 7.4 shows a similar picture for the Skill Teaching domain: again, the combinationapproach outperforms the pure history-based approach, and again, it is mostly the advantagein the first time steps that is transferred from the POHTN policies into Tcombined , as shown inFigure 7.5.

Unfortunately, the same cannot be said for the Elevators domain, as can be seen from Fig-ure 7.6: the combination approach outperforms the pure history-based approach for instancesize 1, but this is not true for instance sizes 2 and above. Looking at the curves of the combina-tion approach, it seems that a pattern common to all instance sizes exists, however: apparently,the combination approach outperforms the pure history-based approach after an intermediateperiod where it shows worse performance. This is confirmed by looking at Figure 7.7, whichshows performance of all three approaches on Elevator problems of size 2 after 10 times thenumber of planning iterations, i.e., 100,000 iterations.

It can be conjectured that the combination approach shows this performance curve because,as argued above, the advantage of the POHTN policies is transferred time step by time step. Tounderstand why this leads to the observed curves, consider the courses of action proposed by thepure history-based approach and the policies that result from the POHTN hierarchy described inSection 5.5.2 on page 81 in the following world state: the elevator is on the bottom floor and onepassenger waits at some floor. As can be seen from Figure 7.6, the policy of the pure history-based approach can be expected not to improve very much by planning, i.e., it closely resemblesexecuting actions uniformly at random. With the POHTN policy on the other hand, the elevatoris moved to the floor where the passenger waits, which is then picked up and transported tothe top or the bottom floor, depending on where the passenger wants to go. The performanceof the combination approach now depends on the number of time steps for which hierarchicalknowledge has been transferred:

• When the number of planning iterations is low, the policy generated by the combinationapproach closely resembles random behavior. This means that it is unlikely to move theelevator to the floor where the passenger waits. This gives a constant immediate reward

108

7.1. Evaluation

n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−40

−30

−20

−10

−40

−30

−20

−10

−40

−30

−20

−10

101 103 101 103 101 103

iterations

polic

y qu

ality

approach POHTN history−based combination

Figure 7.1.: Plots showing policy quality in the Navigation domain for various problems sizesn dependent on the number of planning iterations. The ribbon represents the 95%confidence interval for the mean policy quality. Note that the scale on the x-axis islogarithmic. For all problem sizes, the combination approach yields the same policyquality as the POHTN approach after at most about 100 iterations.

109


n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−10

−5

0

5

10

−10

−5

0

5

10

−10

−5

0

5

10

101 103 101 103 101 103

iterations

polic

y qu

ality


Figure 7.2.: Plots showing policy quality in the Home Theater Setup domain for various prob-lems sizes n dependent on the number of planning iterations. The ribbon representsthe 95% confidence interval for the mean policy quality. Note that the scale on thex-axis is logarithmic. For all problem sizes, the combination approach significantlyimproves upon the quality of the pure history-based approach.

110

7.1. Evaluation

n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−1.0

−0.5

0.0

0.5

1.0

1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

−1.0

−0.5

0.0

0.5

1.0

1.5

0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5 0.0 2.5 5.0 7.5time step

imm

edia

te r

ewar

d


Figure 7.3.: Plots showing average immediate reward for each time step during policy executionfor instances of various sizes n from the Home Theater Setup domain. The ribbonrepresents the 95% confidence interval for the immediate reward. It can be seen thatthe POHTN policies considerably outperform history-based policies. The combina-tion approach greatly benefits from the POHTN execution traces and can replicatePOHTN policy quality for the first few time steps.

111


n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−200

−100

0

100

−200

−100

0

100

−200

−100

0

100

101 103 101 103 101 103

iterations

polic

y qu

ality


Figure 7.4.: Plots showing policy quality in the Skill Teaching domain for various problems sizesn dependent on the number of planning iterations. The ribbon represents the 95%confidence interval for the mean policy quality. Note that the scale on the x-axisis logarithmic. For all but the smallest problem sizes, the combination approachsignificantly improves upon the quality of the pure history-based approach.

112

7.1. Evaluation

n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−5.0

−2.5

0.0

2.5

5.0

−5.0

−2.5

0.0

2.5

5.0

−5.0

−2.5

0.0

2.5

5.0

0 10 20 30 0 10 20 30 0 10 20 30time step

imm

edia

te r

ewar

d


Figure 7.5.: Plots showing average immediate reward for each time step during policy execu-tion for instances of various sizes n from the Skill Teaching domain. The ribbonrepresents the 95% confidence interval for the immediate reward. It can be seenthat quality for the first few time steps is very similar for all three approaches, butthat policies found by the history-based approach are not optimized as well as PO-HTN policies for later time steps. To a certain extent, the combination approach canreplicate the advantage of the POHTN policies: for problem size 6, for example, thehistory-based policies are roughly as good as POHTN policies for the first about 15time steps, while the policies generated by the combination approach are as good asPOHTN policies for about 25 time steps.

113


n = 1 n = 2 n = 3

n = 4 n = 5 n = 6

n = 7 n = 9 n = 12

−300

−250

−200

−150

−100

−300

−250

−200

−150

−100

−300

−250

−200

−150

−100

101 103 101 103 101 103

iterations

polic

y qu

ality


Figure 7.6.: Plots showing policy quality in the Elevators domain for various problems sizes ndependent on the number of planning iterations. The ribbon represents the 95%confidence interval for the mean policy quality. Note that the scale on the x-axis islogarithmic. For problem size 1, the combination approach outperforms the history-based approach after an intermediate period where it shows worse performance. Itseems that the combination approach shows a similar performance curve for largerinstances sizes, though the length of the intermediate period grows with growinginstance size.

114

7.1. Evaluation

n = 2

−140

−120

−100

101 103 105

iterations

polic

y qu

ality


Figure 7.7.: A plot showing policy quality in Elevators problems of size 2 dependent on thenumber of planning iterations. The total number of planning iterations is increasedto 100,000. The ribbon represents the 95% confidence interval for the mean policyquality. Note that the scale on the x-axis is logarithmic. It can be seen that thecombination approach indeed outperforms the pure history-based approach whengiven additional planning time.

115


of −1 per time step, since the passenger is never picked up.

• After an intermediate number of iterations, the combination approach has reached a policythat is able to pick passengers up, but behaves randomly after that. Reward in this situationis either−0.75 or−3 with roughly equal probability, depending on whether the elevator isheading in the correct direction. This gives an expected immediate reward of −1 per timestep before the passenger is picked up, and about −1.875 afterwards. Hence, the policy isworse despite being closer to a good policy than the random policy.

• Once knowledge about the correct direction of travel is transferred, policy quality canreach and eventually improve upon POHTN policy quality.

To summarize, the random policy represents a local maximum with respect to policy quality.The information from the POHTN policy execution traces helps the non-hierarchical search inescaping this local maximum in favor of a considerably more useful policy.

The question remains why transferring hierarchical knowledge takes more iterations in theElevators domain than in the other evaluation domains. The reason for this could lie in the factthat the history-based policy representation is inherently not suited very well for this kind of do-main: a good policy always needs to react in the same manner once a passenger arrives: go pickthe passenger up, and move to the appropriate floor, either top or bottom. This is independentof whether the current time step is 1, 5, or 10, and also independent of how many passengershave been fetched and delivered. For the history-based approach, all these situations correspondto different histories, however, and hence, it re-computes good behavior for each such situationseparately. In other words, it does not generalize over similar situations. This would explainwhy the performance of the pure history-based approach improves only very slowly for largerproblem sizes. In consequence, the combination approach can benefit from POHTN traces ina limited manner only, since a POHTN trace corresponds to information of the kind “if a pas-senger who wants to go to the top floor arrives at floor 2 in time step 3 and no-one has arrivedanywhere else, it is rewarding to go pick the passenger up and deliver them to the top floor”.

7.2. Discussion

Overall, it can be said that the combination approach succeeds in combining the advantages ofthe POHTN and history-based planning approaches: the combination approach is both anytimeoptimal and empirically superior to the pure history-based approach, although the latter is onlypartially true for the Elevators domain. It would be interesting to see whether better perfor-mance of both the history-based and combination approach could be achieved in the Elevatorsdomain by applying dimensionality reduction techniques [Cox and Cox, 2000], i.e., by lettingsearch states be collections of features of histories instead of full histories. Dimensionality re-duction has, e.g., been successfully applied to belief state-based POMDP planning methods inan approach called belief compression [Roy et al., 2005].

It should be noted that there are many other conceivable ways to combine the history-basedand POHTN approaches. For example, the proposed round-robin alternation between seedingrounds and search rounds could be replaced by more elaborate mechanisms such as UCB1 [Auer

116

7.2. Discussion

et al., 2002]. It would also be possible to explore approaches akin to planning with macro ac-tions [He et al., 2010; He et al., 2011; Theocharous and Kaelbling, 2004] by converting meth-ods into macro actions. Finally, the history-based approach can also be combined with theinterleaved POHTN approach instead of the standard POHTN approach in the same manner asdescribed above.

117

Chapter 8.

Conclusion and Future Work

8.1. Conclusion

The main goal of this thesis was to create an HTN-like planning approach suited for factoredPOMDP planning domains. As a prerequisite, the LFSC data structure was introduced as asuitable policy representation that allows domain experts to compactly specify useful behaviorusing finite state automata whose node transitions are governed by conditions over observations,formulated using first-order logic. The concepts of abstract actions and methods were thenreintroduced based on LFSCs to create the POHTN approach. As in HTN planning, a POHTNproblem is thus given by an initial plan, here a partially abstract LFSC, and a hierarchy thatconsists of abstract actions and methods, which specify ways for refining an abstract actioninto method LFSCs. Important differences to HTN planning are given by, first, the feature ofabstract observations, which represents the key ingredient for passing on important informationabout observations made during the execution of a method LFSC after an abstract action hasended. Second, due to the fact that POMDPs are optimization problems, POHTN problems areoptimization problem as well, i.e., the goal is to find the best possible LFSC that can be createdwith the action hierarchy. Still, it was shown that the POHTN approach can be interpreted as ageneralization of the deterministic totally ordered HTN approach.

Next, Monte-Carlo tree search was identified as a suitable choice for planning in the POHTNapproach, since it avoids the need for the prohibitively expensive task of computing the qualityof LFSCs exactly. For this, a given POHTN problem is cast as a special kind of MDP thatrepresents planning as search in the space of partially abstract LFSCs. This thesis discussed thenecessary steps and prerequisites for this technique to generate POHTN solutions. These are(1) specifying a suitable order in which abstract nodes are decomposed, namely in ascendingorder of distance to the initial node, (2) ensuring decomposability, i.e., that all abstract actionscan be refined into primitive controllers, and (3) ensuring monotonicity, i.e., the guarantee thatiteratively refining partially abstract LFSCs leads to a fully primitive LFSCs after a finite numberof steps.

The approach was then implemented and evaluated on generated problems of different sizefrom four POMDP domains, the Home Theater Setup domain, and the domains Skill Teach-ing, Navigation, and Elevators from the 2011 International Probabilistic Planning Competition.For all four domains, a compact POHTN hierarchy could be specified that exploits hierarchicalstructure present in these domains. The empirical performance was compared to the history-based approach, a state-of-the-art non-hierarchical POMDP planning approach. The basis forthis comparison was chosen to be the number of planning iterations, since both approaches sim-

119

Chapter 8. Conclusion and Future Work

ulate the execution of the same number of actions per iteration. It could then be shown that, withcomparable computational effort, the POHTN approach led to policies of superior quality on allfour evaluation domains, and that it scales well with problem size. The advantage of the POHTNapproach is particularly pronounced when the number of planning iterations is small. This prop-erty is useful in the context of assistance systems, where a large number of planning iterationscorresponds to the time the user needs to wait before the system can give recommendations.

The POHTN approach was then further enhanced with a technique where controller construc-tion and simulated action execution are interleaved during planning. Empirically, this interleavedapproach shows comparable performance to the POHTN approach. However, it has the theoret-ical advantage that it limits the LFSC construction overhead to linear in the POMDP horizon,where the normal POHTN approach can only guarantee an exponential limit. A further advan-tage is that it is suited for online planning, where the agent alternates between planning andexecuting actions in the real world. In contrast, the normal POHTN approach uses an offlineplanning setting, where all planning happens before policy execution starts. The online settingseems better suited in the context of assistance systems, since it allows the assistance system touse the time taken by the user to execute recommendations for further deliberation on its nextrecommendations.

Finally, it was investigated how hierarchical and non-hierarchical planning can be combinedin a single approach that can both quickly generate policies of good quality as well as findoptimal POMDP policies that might not be representable in a given POHTN hierarchy. Forthis, iterations of hierarchical and non-hierarchical search were performed in alternation, andthe information on the simulated execution of POHTN policies was used for guiding the non-hierarchical search. It could be shown that this combination approach is empirically superior topure history-based planning.

8.2. Future Work

There are several interesting directions for future research.

8.2.1. Further Improvements to the LFSC Policy Representation

The logical FSC representation presented in Chapter 3 could be further improved upon by choos-ing a factored controller state representation that applies the principles employed for represent-ing world states and observations in factored POMDPs (cf. Section 2.3.5) to the set of controllernodesN . In such a controller, a controller state would then be an interpretation of controller statefluents Fcontroller -state , i.e., N = I(controller -state, D), where each fluent can be understoodas a variable in the programming language sense. Transitions between controller states would bedefined similarly as state transitions in factored POMDP, i.e., in terms of (deterministic) CPFsfor controller state fluents that depend on the current controller state and the received observa-tion. Each time an observation is received, a new controller state is determined by computingthe values of all controller state fluents using the CPFs. The action association function wouldbe a deterministic CPF that depends on the current controller state.

Generalizing the POHTN approach to this controller representation would allow domain

120

8.2. Future Work

experts to state their domain knowledge even more comfortably and succinctly. For exam-ple, in the Elevators domain, a controller state fluent could be defined that tracks the currentfloor of the elevator. The result would be an approach where a POHTN problem can be seenas a partial specification of a program whose abstract controller nodes correspond to place-holders for concrete function calls, and where solving the POHTN problem produces an ex-ecutable program. Ideas that go into a similar direction have been investigated in the con-text of the HAM approach, which resulted in the ALisp language [Andre and Russell, 2002;Andre, 2003]. A more powerful POHTN modeling language could also support parallel actions,which would be useful for, e.g., controlling multiple elevators at once in the Elevators domain.

8.2.2. Policy Explanation

The ability to justify decisions is crucial to assistance systems in building and maintaining theiruser’s trust [Lim et al., 2009; Nothdurft et al., 2014b]. A user might ask why they shouldperform an action recommended by an assistance system, e.g., they might want to know howplugging in an HDMI cable to the amplifier helps in connecting the Bluray player and the TV. InHTN planning, the hierarchical structure of HTN plans offers a suitable basis for generating suchexplanations in a both simple and effective way [Seegebarth et al., 2012; Bercher et al., 2014].By construction, this technique can be directly transferred to the POHTN setting, enabling thegeneration of explanations of the form “Plugging the HDMI cable into the amplifier is part ofconnecting the amplifier to the TV, which in turn is part of connecting the Bluray player to theTV” from the hierarchy shown in Sections 5.2 and 4.1.

Few other approaches exist that try to generate such explanations for sequential decision mak-ing problems in the presence of uncertainty. A technique for generating similar explanations ex-ists for fully observable MDPs [Khan et al., 2009]. It assumes knowledge of an optimal policy,however, which is a computationally difficult prerequisite to fulfill.

8.2.3. Monotonicity

As noted in Section 4.2.3, it is currently unclear whether a normal form for POHTN hierarchiesexists where the monotonicity property of Definition 4.17 is fulfilled. Having such a guaranteewould mean that every POHTN problem could be transformed into an equivalent monotonePOHTN problem, effectively removing the monotonicity requirement from Theorem 4.20 andmaking MCTS applicable to all decomposable POHTN problems.

121

Appendix A.

CD ContentsApart from this thesis as a PDF file (thesis.pdf), the enclosed CD contains the followingadditional material in the folder material:

• the RDDL files for all evaluation

– domains (<domain>_pomdp.rddl) and

– hierarchies (<domain>_hierarchical_pomdp.rddl), including

– example instances (<domain>_inst_pomdp__1.rddl) and corresponding

– initial controllers (<domain>_hierarchical_inst_pomdp__1.rddl).

• an executable JAR file (evaluation-diss-1.0.jar) for generating the data usedin the evaluation in Chapter 5. For example, generating data for Elevators problems ofsize 3 for both the history-based and POHTN approaches, using 100 random seeds with10,000 iterations each, and approximating policy quality by averaging over 40 simulatedexecutions can be done using the following command on a linux system:

1 nohup n i c e j a v a −Xss16m −Xmx300G −Dorg . s l f 4 j . s i m p l e L o g g e r .d e f a u l t L o g L e v e l =DEBUG − j a r e v a l u a t i o n−d i s s −1 .0 . j a r −− i t e r a t i o n s10000 −−p r e c i s i o n 40 −−domain e l e v a t o r s −−s e e d s 0 :99 −−a p p r o a c h h i s t o r y , h i e r a r c h y −− r e p r e s e n t a t i o n o p t i m i z e d −−d e s i r e d−p a r a l l e l i s m20 −−problem . s i z e 3 > e l e v a t o r s 3 . csv 2> e l e v a t o r s 3 . l o g &

• the R code for generating the plots shown in Chapter 5 (eval.Rmd). Using the RStu-dio software1, Figure 5.12, for example, can be generated and saved using the followingcommands, given that the file elevators.csv contains data generated as above forproblem sizes 1-7, 9 and 12:

1 e l e v a t o r s <− r e a d _ f i l e ( " e l e v a t o r s . c sv " )2 e l e v a t o r s %>% s c a t t e r . a p p r o a c h e s3 save ( " s c a t t e r _ p l o t _ e l e v a t o r s . pdf " )

• the complete Scala source code of the implementation as a collection of several SBTprojects in the folder code.

1https://www.rstudio.com/

123

https://www.rstudio.com/

Appendix A. CD Contents

Furthermore, the folder publications contains a list of all publications the author con-tributed to as a PDF file (publications.pdf), as well as PDF files of these publications inthe subfolder pdfs.

124

Appendix B.

RDDL Code for the Home TheaterSetup DomainThe complete RDDL code for the Home Theater Setup Domain can be found here.

B.1. Domain

The RDDL code for the Home Theater Setup domain definition is given in the following listing:

1 domain home_thea te r_assembly_pomdp 23 r e q u i r e m e n t s = 4 m u l t i v a l u e d ,5 i n t e r m e d i a t e−nodes ,6 p a r t i a l l y −o b s e r v e d7 ;89 t y p e s

10 Device : o b j e c t ; / / TV, B lu r ay p l a y e r , l ong c a b l e s , s h o r t a d a p t e r s ,e v e r y t h i n g . P o r t ’ gende r s ’ d e f i n e p o s s i b l e c o m b i n a t i o n s

11 S i g n a l T y p e : o b j e c t ; / / audio , v ideo , . . .12 S i g n a l S o u r c e : o b j e c t ; / / where a s i g n a l u l t i m a t e l y comes from , e . g .

B l u r ay p l a y e r13 P o r t : o b j e c t ; / / a p o r t such as HDMI, c inch , . . .14 c o u n t : @zero , @one , @two , @three ; / / numbers f o r c o u n t i n g t h e number o f

f r e e p o r t s15 t i g h t n e s s : @none , @loose , @t igh t ; / / how t i g h t a c o n n e c t i o n i s16 ;1718 p v a r i a b l e s 1920 / / how p r o b a b l e i t i s t h a t c o n n e c t i n g r e s u l t s i n a l o o s e c o n n e c t i o n21 LOOSENESS_FACTOR : non−f l u e n t , r e a l , d e f a u l t = 0 . 2 ;2223 / / whe the r a p o r t t r a n s p o r t s a s i g n a l t y p e24 PORT_TRANSPORTS_SIGNAL_TYPE( Por t , S i g n a l T y p e ) : non−f l u e n t , bool , d e f a u l t

= f a l s e ;2526 / / whe the r a d e v i c e needs a s i g n a l27 DEVICE_NEEDS_SIGNAL ( Device , S igna lType , S i g n a l S o u r c e ) : non−f l u e n t , bool ,

d e f a u l t = f a l s e ;

125

Appendix B. RDDL Code for the Home Theater Setup Domain

2829 / / whe the r t h e a v a i l a b l e s i g n a l s o f a g i v e n t y p e on a d e v i c e can be

checked30 DEVICE_IS_CHECKABLE ( Device , S i g n a l T y p e ) : non−f l u e n t , bool , d e f a u l t =

f a l s e ;3132 / / i f t r u e , t h e a g e n t w i l l on ly r e c e i v e reward once a l l s i g n a l s a r e

b r o u g h t t o t h e r i g h t d e v i c e33 WHOLE_REWARD_AT_ONCE: non−f l u e n t , bool , d e f a u l t = f a l s e ;3435 / / whe the r e x e c u t i n g u s e l e s s a c t i o n s i n s t e a d o f noop i s p u n i s h e d36 PUNISH_USELESS_ACTIONS : non−f l u e n t , bool , d e f a u l t = t r u e ;3738 / / t h e number o f f r e e f em a le p o r t s o f a g i v e n t y p e on a g i v e n d e v i c e39 f r e e F e m a l e P o r t s ( Device , P o r t ) : s t a t e−f l u e n t , count , d e f a u l t = @zero ;4041 / / t h e number o f f r e e male p o r t s o f a g i v e n t y p e on a g i v e n d e v i c e42 f r e e M a l e P o r t s ( Device , P o r t ) : s t a t e−f l u e n t , count , d e f a u l t = @zero ;4344 / / whe the r a d e v i c e has a s i g n a l o f a c e r t a i n t y p e o r i g i n a t i n g from a

g i v e n d e v i c e45 h a s S i g n a l ( Device , S igna lType , S i g n a l S o u r c e ) : s t a t e−f l u e n t , bool , d e f a u l t =

f a l s e ;4647 / / whe the r two d e v i c e s a r e c o n n e c t e d v i a a g i v e n p o r t48 c o n n e c t e d ( Device , Device , P o r t ) : s t a t e−f l u e n t , t i g h t n e s s , d e f a u l t = @none

;4950 / / t r u e , i f a c o n n e c t was e x e c u t e d and i t worked51 c o n n e c t S u c c e e d e d ( Device , Device , P o r t ) : in t e rm−f l u e n t , t i g h t n e s s , l e v e l =

1 ;5253 / / o b s e r v a t i o n whe the r a d e v i c e has a g i v e n s i g n a l54 h a s S i g n a l O b s ( Device , S igna lType , S i g n a l S o u r c e ) : obse rv−f l u e n t , boo l ;5556 / / c o n n e c t s two d e v i c e s a t t h e g i v e n p o r t s57 c o n n e c t ( Device , Device , P o r t ) : a c t i o n−f l u e n t , bool , d e f a u l t = f a l s e ;5859 / / c he c ks which s i g n a l s a d e v i c e has , g i v e n t h a t i t can be o b s e r v e d on

t h e d e v i c e60 c h e c k S i g n a l s ( Device ) : a c t i o n−f l u e n t , bool , d e f a u l t = f a l s e ;6162 / / t i g h t e n s a l o o s e c o n n e c t i o n63 t i g h t e n ( Device , Device , P o r t ) : a c t i o n−f l u e n t , bool , d e f a u l t = f a l s e ;64 ;6566 c p f s 6768 h a s S i g n a l ’ ( ? d , ? t , ? s ) =69 i f ( h a s S i g n a l ( ? d , ? t , ? s ) ) t h e n KronDel t a ( t r u e )70 e l s e i f ( e x i s t s _ ? p : Por t , ? d2 : Device [ ( c o n n e c t S u c c e e d e d ( ? d , ? d2 , ? p ) ==

@t igh t | c o n n e c t S u c c e e d e d ( ? d2 , ? d , ? p ) == @t igh t | c o n n e c t e d ( ? d , ? d2 , ? p ) ==@t igh t | c o n n e c t e d ( ? d2 , ? d , ? p ) == @t igh t ) ^ PORT_TRANSPORTS_SIGNAL_TYPE ( ?

p , ? t ) ^ h a s S i g n a l ( ? d2 , ? t , ? s ) ] ) t h e n KronDel t a ( t r u e )

126

B.1. Domain

71 e l s e KronDel t a ( f a l s e ) ;7273 c o n n e c t S u c c e e d e d ( ? d1 , ? d2 , ? p ) =74 i f ( c o n n e c t ( ? d1 , ? d2 , ? p )75 ^ ( f r e e F e m a l e P o r t s ( ? d1 , ? p ) ~= @zero ^ f r e e M a l e P o r t s ( ? d2 , ? p ) ~=

@zero | f r e e M a l e P o r t s ( ? d1 , ? p ) ~= @zero ^ f r e e F e m a l e P o r t s ( ? d2 , ? p ) ~= @zero) ) t h e n

76 D i s c r e t e ( t i g h t n e s s , @none : 0 , @loose : LOOSENESS_FACTOR, @t igh t :1 . 0 − LOOSENESS_FACTOR)

77 e l s e i f ( t i g h t e n ( ? d1 , ? d2 , ? p ) ^ c o n n e c t e d ( ? d1 , ? d2 , ? p ) == @loose ) t h e nKronDel t a ( @t igh t )

78 e l s e KronDel t a ( @none ) ;7980 connec t ed ’ ( ? d1 , ? d2 , ? p ) =81 i f ( c o n n e c t e d ( ? d1 , ? d2 , ? p ) == @t igh t | c o n n e c t S u c c e e d e d ( ? d1 , ? d2 , ? p ) ==

@t igh t | c o n n e c t S u c c e e d e d ( ? d2 , ? d1 , ? p ) == @t igh t ) t h e n KronDel t a ( @t igh t )82 e l s e i f ( c o n n e c t e d ( ? d1 , ? d2 , ? p ) == @loose | c o n n e c t S u c c e e d e d ( ? d1 , ? d2 , ? p )

== @loose | c o n n e c t S u c c e e d e d ( ? d2 , ? d1 , ? p ) == @loose ) t h e n KronDel ta ( @loose)

83 e l s e KronDel t a ( @none ) ;8485 f r e e F e m a l e P o r t s ’ ( ? d , ? p ) =86 / / a s s u m p t i o n : t h e r e a r e n e v e r bo th a f e ma le and a male v e r s i o n o f t h e

same p o r t on bo th d e v i c e s . I f t h i s a s s u m p t i o n f a i l s , bo th87 / / t h e male and t h e f em a le v e r s i o n s w i l l be c o n s i d e r e d as c o n n e c t e d88 i f ( e x i s t s _ ? d2 : Device [ c o n n e c t S u c c e e d e d ( ? d , ? d2 , ? p ) ~= @none |

c o n n e c t S u c c e e d e d ( ? d2 , ? d , ? p ) ~= @none ] ) t h e n89 [ i f ( f r e e F e m a l e P o r t s ( ? d , ? p ) == @zero ) t h e n KronDel t a ( @zero )90 e l s e i f ( f r e e F e m a l e P o r t s ( ? d , ? p ) == @one ) t h e n KronDel t a ( @zero )91 e l s e i f ( f r e e F e m a l e P o r t s ( ? d , ? p ) == @two ) t h e n KronDel t a ( @one )92 e l s e KronDel t a ( @two ) ]93 e l s e KronDel t a ( f r e e F e m a l e P o r t s ( ? d , ? p ) ) ;9495 f r e e M a l e P o r t s ’ ( ? d , ? p ) =96 / / a s s u m p t i o n : t h e r e a r e n e v e r bo th a f e ma le and a male v e r s i o n o f t h e

same p o r t on bo th d e v i c e s . I f t h i s a s s u m p t i o n f a i l s , bo th97 / / t h e male and t h e f em a le v e r s i o n s w i l l be c o n s i d e r e d as c o n n e c t e d98 i f ( e x i s t s _ ? d2 : Device [ c o n n e c t S u c c e e d e d ( ? d , ? d2 , ? p ) ~= @none |

c o n n e c t S u c c e e d e d ( ? d2 , ? d , ? p ) ~= @none ] ) t h e n99 [ i f ( f r e e M a l e P o r t s ( ? d , ? p ) == @zero ) t h e n KronDel t a ( @zero )

100 e l s e i f ( f r e e M a l e P o r t s ( ? d , ? p ) == @one ) t h e n KronDel t a ( @zero )101 e l s e i f ( f r e e M a l e P o r t s ( ? d , ? p ) == @two ) t h e n KronDel t a ( @one )102 e l s e KronDel t a ( @two ) ]103 e l s e KronDel t a ( f r e e M a l e P o r t s ( ? d , ? p ) ) ;104105 h a s S i g n a l O b s ( ? d , ? t , ? s ) =106 i f ( c h e c k S i g n a l s ( ? d ) ^ DEVICE_IS_CHECKABLE ( ? d , ? t ) ) t h e n KronDel t a (

h a s S i g n a l ( ? d , ? t , ? s ) )107 e l s e KronDel t a ( f a l s e ) ;108 ;109110 reward =111 / / b r i n g s i g n a l s t o d e s t i n a t i o n s112 [ i f (WHOLE_REWARD_AT_ONCE) t h e n

127


113 i f ( f o r a l l _ ? d : Device , ? t : S igna lType , ? s : S i g n a l S o u r c e [DEVICE_NEEDS_SIGNAL ( ? d , ? t , ? s ) => h a s S i g n a l ’ ( ? d , ? t , ? s ) ] ) t h e n

114 [ sum_ ? d : Device , ? t : S igna lType , ? s : S i g n a l S o u r c e [DEVICE_NEEDS_SIGNAL ( ? d , ? t , ? s ) ] ]

115 e l s e 0116 e l s e117 [ sum_ ? d : Device , ? t : S igna lType , ? s : S i g n a l S o u r c e [

DEVICE_NEEDS_SIGNAL ( ? d , ? t , ? s ) ∗ h a s S i g n a l ’ ( ? d , ? t , ? s ) ] ] ]118 −119 [ i f ( PUNISH_USELESS_ACTIONS ) t h e n120 / / a v o i d f a i l i n g c o n n e c t s121 [ sum_ ? d1 : Device , ? d2 : Device , ? p : P o r t [ c o n n e c t ( ? d1 , ? d2 , ? p ) ∗ (

c o n n e c t S u c c e e d e d ( ? d1 , ? d2 , ? p ) == @none ) ] ]122 / / a v o i d t i g h t e n w i t h o u t p r i o r c o n n e c t o r when a l r e a d y t i g h t123 + [ sum_ ? d1 : Device , ? d2 : Device , ? p : P o r t [ t i g h t e n ( ? d1 , ? d2 , ? p ) ∗ (

c o n n e c t e d ( ? d1 , ? d2 , ? p ) ~= @loose ) ] ]124 / / a v o i d c h e c k i n g u n c h e c k a b l e d e v i c e s125 + [ sum_ ? d : Device [ c h e c k S i g n a l s ( ? d ) ∗ ~ e x i s t s _ ? t : S i g n a l T y p e

DEVICE_IS_CHECKABLE ( ? d , ? t ) ] ]126 / / p u n i s h do ing t h i n g s once g o a l i s a c h i e v e d127 + [ i f ( [ f o r a l l _ ? d : Device , ? t : S igna lType , ? s : S i g n a l S o u r c e [

DEVICE_NEEDS_SIGNAL ( ? d , ? t , ? s ) => h a s S i g n a l ’ ( ? d , ? t , ? s ) ] ] ^ e x i s t s _ ? d1 :Device , ? d2 : Device , ? p : P o r t [ t i g h t e n ( ? d1 , ? d2 , ? p ) | c o n n e c t ( ? d1 , ? d2 , ? p )| c h e c k S i g n a l s ( ? d1 ) ] ) t h e n 1 e l s e 0 ]

128 e l s e 0 ]129 ;130131

B.2. Example Instance

The following listing gives the RDDL code for a simple Home Theater Setup problem instancewith a satellite receiver, a TV and one HDMI cable:

1 non−f l u e n t s n f _ h o m e _ t h e a t e r _ a s s e m b l y _ i n s t _ _ 1 2 domain = home_thea te r_assembly_pomdp ;3 o b j e c t s 4 Device : e _ S a t R e c e i v e r , c_HDMI_1 , e_TV ;5 S i g n a l T y p e : audio , v i d e o ;6 S i g n a l S o u r c e : s a t ;7 P o r t : p_HDMI ;8 ;9 non−f l u e n t s

10 PORT_TRANSPORTS_SIGNAL_TYPE( p_HDMI , a u d i o ) ;11 PORT_TRANSPORTS_SIGNAL_TYPE( p_HDMI , v i d e o ) ;1213 DEVICE_NEEDS_SIGNAL ( e_TV , audio , s a t ) ;14 DEVICE_NEEDS_SIGNAL ( e_TV , video , s a t ) ;1516 DEVICE_IS_CHECKABLE ( e_TV , a u d i o ) ;17 DEVICE_IS_CHECKABLE ( e_TV , v i d e o ) ;18 ;

128

B.3. Hierarchy

19 20 i n s t a n c e h o m e _ t h e a t e r _ a s s e m b l y _ i n s t _ p o m d p _ _ 1 21 domain = home_thea te r_assembly_pomdp ;22 non−f l u e n t s = n f _ h o m e _ t h e a t e r _ a s s e m b l y _ i n s t _ _ 1 ;23 i n i t −s t a t e 24 f r e e F e m a l e P o r t s ( e _ S a t R e c e i v e r , p_HDMI) = @one ;25 f r e e F e m a l e P o r t s ( e_TV , p_HDMI) = @one ;26 f r e e M a l e P o r t s ( c_HDMI_1 , p_HDMI) = @two ;2728 h a s S i g n a l ( e _ S a t R e c e i v e r , audio , s a t ) ;29 h a s S i g n a l ( e _ S a t R e c e i v e r , v ideo , s a t ) ;30 ;31 max−nondef−a c t i o n s = 1 ;32 h o r i z o n = 1 0 ;33 d i s c o u n t = 1 . 0 ;34

B.3. Hierarchy

The following listing gives the RDDL definition of the POHTN hierarchy chosen for the HomeTheater Setup domain shown in Section B.1:

1 h i e r a r c h y h o m e _ t h e a t e r _ a s s e m b l y _ p o m d p _ h i e r a r c h y 23 domain = home_thea te r_assembly_pomdp ;45 a b s t r a c t s 6 c o n n e c t _ a b s t r a c t ( Device , Device ) : a b s t r a c t−a c t i o n ;7 l o o s e _ c o n n e c t i o n _ f o u n d : a b s t r a c t−o b s e r v a t i o n ;8 ;9

10 methods 1112 method c o n n e c t _ a b s t r a c t _ r e c u r s i v e ( ? d1 : Device , ? d2 : Device , ? d3 : Device , ? p :

P o r t ) 13 a b s t r a c t−a c t i o n = c o n n e c t _ a b s t r a c t ( ? d1 , ? d2 ) ;1415 c o n t r o l l e r −node s0 : c o n n e c t ( ? d1 , ? d3 , ? p ) ;16 c o n t r o l l e r −node s1 : c o n n e c t _ a b s t r a c t ( ? d3 , ? d2 ) ;17 c o n t r o l l e r −node s2 : t i g h t e n ( ? d1 , ? d3 , ? p ) ;18 t e r m i n a l−node good : 19 l o o s e _ c o n n e c t i o n _ f o u n d = KronDel t a ( f a l s e ) ;20 ;21 t e r m i n a l−node bad : 22 l o o s e _ c o n n e c t i o n _ f o u n d = KronDel t a ( t r u e ) ;23 ;2425 c o n t r o l l e r −edge s0 −> s1 : t r u e ;26 c o n t r o l l e r −edge s1 −> good : ~ l o o s e _ c o n n e c t i o n _ f o u n d ;27 c o n t r o l l e r −edge s1 −> s2 : l o o s e _ c o n n e c t i o n _ f o u n d ;28 c o n t r o l l e r −edge s2 −> bad : t r u e ;29

129


30 i n i t i a l −node = s0 ;31 ;3233 method c o n n e c t _ a b s t r a c t _ d o u b l e ( ? d1 : Device , ? d2 : Device ) 34 a b s t r a c t−a c t i o n = c o n n e c t _ a b s t r a c t ( ? d1 , ? d2 ) ;3536 c o n t r o l l e r −node s0 : c o n n e c t _ a b s t r a c t ( ? d1 , ? d2 ) ;37 c o n t r o l l e r −node s1 : c o n n e c t _ a b s t r a c t ( ? d1 , ? d2 ) ;38 c o n t r o l l e r −node s2 : c o n n e c t _ a b s t r a c t ( ? d1 , ? d2 ) ;39 t e r m i n a l−node good : 40 l o o s e _ c o n n e c t i o n _ f o u n d = KronDel t a ( f a l s e ) ;41 ;42 t e r m i n a l−node bad : 43 l o o s e _ c o n n e c t i o n _ f o u n d = KronDel t a ( t r u e ) ;44 ;4546 c o n t r o l l e r −edge s0 −> s1 : l o o s e _ c o n n e c t i o n _ f o u n d ;47 c o n t r o l l e r −edge s0 −> s2 : ~ l o o s e _ c o n n e c t i o n _ f o u n d ;48 c o n t r o l l e r −edge s1 −> bad : t r u e ;49 c o n t r o l l e r −edge s2 −> bad : l o o s e _ c o n n e c t i o n _ f o u n d ;50 c o n t r o l l e r −edge s2 −> good : ~ l o o s e _ c o n n e c t i o n _ f o u n d ;5152 i n i t i a l −node = s0 ;53 ;5455 method c o n n e c t _ a b s t r a c t _ s t o p ( ? d1 : Device , ? d2 : Device , ? p : P o r t ) 56 a b s t r a c t−a c t i o n = c o n n e c t _ a b s t r a c t ( ? d1 , ? d2 ) ;5758 c o n t r o l l e r −node s0 : c o n n e c t ( ? d1 , ? d2 , ? p ) ;59 c o n t r o l l e r −node s1 : c h e c k S i g n a l s ( ? d2 ) ;60 c o n t r o l l e r −node s2 : t i g h t e n ( ? d1 , ? d2 , ? p ) ;61 t e r m i n a l−node good : 62 l o o s e _ c o n n e c t i o n _ f o u n d = KronDel t a ( f a l s e ) ;63 ;64 t e r m i n a l−node bad : 65 l o o s e _ c o n n e c t i o n _ f o u n d = KronDel t a ( t r u e ) ;66 ;6768 c o n t r o l l e r −edge s0 −> s1 : t r u e ;69 c o n t r o l l e r −edge s1 −> good : e x i s t s _ ? t : S igna lType , ? s : S i g n a l S o u r c e [

h a s S i g n a l O b s ( ? d2 , ? t , ? s ) ] ;70 c o n t r o l l e r −edge s1 −> s2 : ~ e x i s t s _ ? t : S igna lType , ? s : S i g n a l S o u r c e [

h a s S i g n a l O b s ( ? d2 , ? t , ? s ) ] ;71 c o n t r o l l e r −edge s2 −> bad : t r u e ;7273 i n i t i a l −node = s0 ;74 ;7576 method c o n n e c t _ a b s t r a c t _ n o o p ( ? d1 : Device , ? d2 : Device ) 77 a b s t r a c t−a c t i o n = c o n n e c t _ a b s t r a c t ( ? d1 , ? d2 ) ;7879 c o n t r o l l e r −node s0 : c h e c k S i g n a l s ( ? d2 ) ;80 t e r m i n a l−node good : 81 l o o s e _ c o n n e c t i o n _ f o u n d = KronDel t a ( f a l s e ) ;

130

B.4. Example Initial Controller

82 ;83 t e r m i n a l−node bad : 84 l o o s e _ c o n n e c t i o n _ f o u n d = KronDel t a ( t r u e ) ;85 ;8687 c o n t r o l l e r −edge s0 −> good : e x i s t s _ ? t : S igna lType , ? s : S i g n a l S o u r c e [

h a s S i g n a l O b s ( ? d1 , ? t , ? s ) ] ;88 c o n t r o l l e r −edge s0 −> bad : ~ e x i s t s _ ? t : S igna lType , ? s : S i g n a l S o u r c e [

h a s S i g n a l O b s ( ? d1 , ? t , ? s ) ] ;8990 i n i t i a l −node = s0 ;91 ;92 ;93

B.4. Example Initial Controller

The following listing gives the RDDL code for the initial controller of the example instanceshown in Section B.2 using the hierarchy of Section B.3:

1 h i e r a r c h y−i n s t a n c e h o m e _ t h e a t e r _ a s s e m b l y _ h i e r a r c h i c a l _ i n s t _ p o m d p _ _ 1 2 i n s t a n c e = h o m e _ t h e a t e r _ a s s e m b l y _ i n s t _ p o m d p _ _ 1 ;3 h i e r a r c h y = h o m e _ t h e a t e r _ a s s e m b l y _ p o m d p _ h i e r a r c h y ;4 i n i t i a l −c o n t r o l l e r 5 c o n t r o l l e r −node s0 : c o n n e c t _ a b s t r a c t ( e _ S a t R e c e i v e r , e_TV ) ;6 c o n t r o l l e r −node s1 : noop ;78 c o n t r o l l e r −edge s0 −> s1 : t r u e ;9 c o n t r o l l e r −edge s1 −> s1 : t r u e ;

1011 i n i t i a l −node = s0 ;12 ;13

131

Bibliography[Alford et al., 2009] Ron Alford, Ugur Kuter, and Dana Nau. Translating HTNs to PDDL:

A small amount of domain knowledge can go a long way. In Proceedings of the 21st In-ternational Jont Conference on Artifical Intelligence (IJCAI 2009), pages 1629–1634, SanFrancisco, CA, USA, 2009. AAAI Press.

[Alford et al., 2015] Ron Alford, Pascal Bercher, and David Aha. Tight bounds for HTN plan-ning. In Proceedings of the 25th International Conference on Automated Planning andScheduling (ICAPS 2015), pages 7–15. AAAI Press, 2015.

[Alford et al., 2016] Ron Alford, Gregor Behnke, Daniel Höller, Pascal Bercher, Susanne Bi-undo, and David Aha. Bound to plan: Exploiting classical heuristics via automatic transla-tions of tail-recursive htn problems. In Proceedings of the 26th International Conference onAutomated Planning and Scheduling (ICAPS 2016), pages 20–28. AAAI Press, 2016.

[Andre and Russell, 2002] David Andre and Stuart J. Russell. State abstraction for pro-grammable reinforcement learning agents. In AAAI/IAAI, pages 119–125, 2002.

[Andre, 2003] David Andre. Programmable reinforcement learning agents. PhD thesis, Uni-versity of California at Berkeley, 2003.

[Auer et al., 2002] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis ofthe multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.

[Behnke et al., 2016] Gregor Behnke, Daniel Höller, Pascal Bercher, and Susanne Biundo.Change the plan - how hard can that be? In Proceedings of the 26th International Con-ference on Automated Planning and Scheduling (ICAPS 2016), pages 38–46. AAAI Press,2016.

[Bercher et al., 2014] Pascal Bercher, Susanne Biundo, Thomas Geier, Thilo Hoernle, FlorianNothdurft, Felix Richter, and Bernd Schattenberg. Plan, repair, execute, explain - how plan-ning helps to assemble your home theater. In Proceedings of the 24th International Confer-ence on Automated Planning and Scheduling (ICAPS 2014), pages 386–394. AAAI Press,2014.

[Bercher et al., 2015] Pascal Bercher, Felix Richter, Thilo Hörnle, Thomas Geier, DanielHöller, Gregor Behnke, Florian Nothdurft, Frank Honold, Wolfgang Minker, Michael We-ber, and Susanne Biundo. A planning-based assistance system for setting up a home theater.In Proceedings of the 29th National Conference on Artificial Intelligence (AAAI 2015), pages4264–4265. AAAI Press, 2015.

133

Bibliography

[Bercher et al., 2016] Pascal Bercher, Felix Richter, Thilo Hörnle, Thomas Geier, DanielHöller, Gregor Behnke, Florian Nothdurft, Frank Honold, Felix Schüssel, Stephan Reuter,Wolfgang Minker, Michael Weber, Klaus Dietmayer, and Susanne Biundo. Advanced userassistance for setting up a home theater. In Companion Technology – A Paradigm Shift inHuman-Technology Interaction. Springer, 2016. to appear.

[Bertsekas, 2000] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. AthenaScientific, 2nd edition, 2000.

[Bidot et al., 2008] Julien Bidot, Susanne Biundo, and Bernd Schattenberg. Plan repair in hy-brid planning. In Andreas Dengel, Karsten Berns, Thomas Breuel, Frank Bomarius, andThomas R. Roth-Berghofer, editors, Proceedings of the 31st Annual German Conferenceon Artificial Intelligence (KI 2008), volume 5243 of Lecture Notes in Artificial Intelligence,pages 169–176. Springer Science + Business Media, 2008.

[Biundo and Schattenberg, 2001] Susanne Biundo and Bernd Schattenberg. From abstract crisisto concrete relief (a preliminary report on combining state abstraction and HTN planning).In Proceedings of the 6th European Conference on Planning (ECP 2001), pages 157–168.AAAI Press, 2001.

[Biundo and Wendemuth, 2015] Susanne Biundo and Andreas Wendemuth. Companion-technology for cognitive technical systems. Künstl Intell, 30(1):71–75, nov 2015. SpecialIssue on Companion Technologies.

[Biundo et al., 2011] Susanne Biundo, Pascal Bercher, Thomas Geier, Felix Müller, and BerndSchattenberg. Advanced user assistance based on AI planning. Cognitive Systems Research,Special Issue on Complex Cognition, 12(3-4):219–236, sep 2011.

[Biundo et al., 2016] Susanne Biundo, Daniel Höller, Bernd Schattenberg, and Pascal Bercher.Companion-technology: An overview. Künstl Intell, 30(1):11–20, jan 2016. Special Issue onCompanion Technologies.

[Bouguerra and Karlsson, 2004] Abdelbaki Bouguerra and Lars Karlsson. Hierarchical taskplanning under uncertainty. In In 3rd Italian Workshop on Planning and Scheduling (AI*IA,2004.

[Boutilier et al., 1999] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoreticplanning: Structural assumptions and computational leverage. Journal of Artificial Intelli-gence Research, 1:1–93, 1999.

[Browne et al., 2012] Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lu-cas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samoth-rakis, and Simon Colton. A survey of Monte Carlo tree search methods. IEEE Trans. Comput.Intell. AI Games, 4(1):1–43, mar 2012.

[Buss and Beetz, 2010] Martin Buss and Michael Beetz. CoTeSys—cognition for technical sys-tems. Künstl Intell, 24(4):323–327, aug 2010.

134

Bibliography

[Byrne, 1977] Richard Byrne. Planning meals: Problem-solving on a real data-base. Cognition,5(4):287–332, dec 1977.

[Charlin et al., 2007] Laurent Charlin, Pascal Poupart, and Romy Shioda. Automated hierarchydiscovery for planning in partially observable environments. Advances in Neural InformationProcessing Systems, 19:225–232, 2007.

[Chaslot et al., 2008] Guillaume M. J. B. Chaslot, Mark H. M. Winands, and H. Jaap van denHerik. Parallel Monte-Carlo tree search. In Proceedings of the 6th International Conferenceon Computer and Games, pages 60–71. Springer, Springer Science + Business Media, 2008.

[Cimatti et al., 2003] A. Cimatti, M. Pistore, M. Roveri, and P. Traverso. Weak, strong, andstrong cyclic planning via symbolic model checking. Artificial Intelligence, 147(1-2):35–84,jul 2003.

[Cox and Cox, 2000] Trevor F Cox and Michael AA Cox. Multidimensional scaling. CRCpress, 2000.

[Currie and Tate, 1991] Ken Currie and Austin Tate. O-plan: The open planning architecture.Artificial Intelligence, 52(1):49–86, nov 1991.

[Dean and Givan, 1997] Thomas Dean and Robert Givan. Model minimization in Markov de-cision processes. In Proceedings of the 11th National Conference on Artificial Intelligence(AAAI 1997), pages 106–111. AAAI Press, 1997.

[Dietterich, 1998] Thomas G. Dietterich. The MAXQ method for hierarchical reinforcementlearning. In Proceedings of the Fifteenth International Conference on Machine Learning,pages 118–126. Morgan Kaufmann Publishers Inc., 1998.

[Dietterich, 2000] Thomas G. Dietterich. Hierarchical reinforcement learning with the MAXQvalue function decomposition. Journal of Artificial Intelligence Research (JAIR), 13:227–303, 2000.

[Edelkamp and Hoffmann, 2004] Stefan Edelkamp and Jörg Hoffmann. PDDL2.2: The lan-guage for the classical part of IPC-4. Technical report, Albert-Ludwigs-Universität Freiburg,Institut für Informatik, January 2004.

[Engelfriet, 1992] Joost Engelfriet. A greibach normal form for context-free graph grammars.In Automata, Languages and Programming, pages 138–149. Springer Science + BusinessMedia, 1992.

[Erol et al., 1994] Kutluhan Erol, James Hendler, and Dana S. Nau. UMCP: A sound and com-plete procedure for hierarchical task-network planning. In Proceedings of the Second Interna-tional Conference on Artificial Intelligence Planning Systems (AIPS 1994), pages 249–254.AAAI Press, 1994.

135

Bibliography

[Eyerich et al., 2006] Patrick Eyerich, Bernhard Nebel, Gerhard Lakemeyer, and Jens Claßen.GOLOG and PDDL. In Proceedings of the 2006 international symposium on Practical cogni-tive agents and robots (PCAR 2006), pages 93–104, New York, NY, USA, 2006. Associationfor Computing Machinery (ACM).

[Feldman and Domshlak, 2014a] Zohar Feldman and Carmel Domshlak. Monte-carlo treesearch: To MC or to DP? In Proceedings of the Twenty-First European Conference on Artifi-cial Intelligence (ECAI 2014), pages 321–326. IOS Press, 2014.

[Feldman and Domshlak, 2014b] Zohar Feldman and Carmel Domshlak. On MABs and separa-tion of concerns in monte-carlo planning for MDPs. In Proceedings of the Twenty-Fourth In-ternational Conference on Automated Planning and Scheduling (ICAPS 2014). AAAI Press,2014.

[Fern and Lewis, 2011] Alan Fern and Paul Lewis. Ensemble monte-carlo planning: An empir-ical study. In Proceedings of the 21st International Conference on Automated Planning andScheduling (ICAPS 2011), pages 58–65, 2011.

[Fine et al., 1998] Shai Fine, Yoram Singer, and Naftali Tishby. The hierarchical hiddenMarkov model: Analysis and applications. Machine learning, 32(1):41–62, 1998.

[Foka and Trahanias, 2007] Amalia Foka and Panos Trahanias. Real-time hierarchicalPOMDPs for autonomous robot navigation. Robotics and Autonomous Systems, 55(7):561–571, jul 2007.

[Fox, 1997] Maria Fox. Natural hierarchical planning using operator decomposition. In RecentAdvances in AI Planning, pages 195–207. Springer Science + Business Media, 1997.

[Geier and Bercher, 2011] Thomas Geier and Pascal Bercher. On the decidability of HTN plan-ning with task insertion. In Proceedings of the 22nd International Joint Conference on Arti-ficial Intelligence (IJCAI 2011), pages 1955–1961, 2011.

[Ghallab et al., 2004] Malik Ghallab, Dana Nau, and Paolo Traverso. Automated planning:theory & practice. Elsevier, 2004.

[Green et al., 2011] Derek T. Green, Thomas J. Walsh, Paul R. Cohen, and Yu-Han Chang.Learning a skill-teaching curriculum with dynamic bayes nets. In Proceedings of the Twenty-Third Conference on Innovative Applications of Artificial Intelligence (IAAI 2011), pages1648–1654. AAAI Press, 2011.

[Greibach, 1965] Sheila A. Greibach. A new normal-form theorem for context-free phrasestructure grammars. Journal of the ACM, 12(1):42–52, jan 1965.

[Gurari, 1989] Eitan Gurari. An introduction to the theory of computation. Computer SciencePress, 1989.

[Hansen and Zhou, 2003] Eric A. Hansen and Rong Zhou. Synthesis of hierarchical finite-statecontrollers for POMDPs. In Proceedings of the 13th International Conference on AutomatedPlanning and Scheduling (ICAPS 2003), pages 113–122. AAAI Press, 2003.

136

Bibliography

[Hansen, 1998] Eric A. Hansen. Solving POMDPs by searching in policy space. In Proceedingsof the 14th conference on Uncertainty in artificial intelligence (UAI 1998), pages 211–219.Morgan Kaufmann Publishers Inc., 1998.

[He et al., 2010] Ruijie He, Emma Brunskill, and Nicholas Roy. PUMA: planning under uncer-tainty with macro-actions. In Proceedings of the Twenty-Fourth AAAI Conference on ArtificialIntelligence (AAAI 2010), pages 1089–1095. AAAI Press, 2010.

[He et al., 2011] Ruijie He, Emma Brunskill, and Nicholas Roy. Efficient planning under un-certainty with macro-actions. J. Artif. Intell. Res. (JAIR), 40:523–570, 2011.

[Hengst, 2002] Bernhard Hengst. Discovering hierarchy in reinforcement learning with HEXQ.In Proceedings of the Nineteenth International Conference on Machine Learning, pages 243–250. Morgan Kaufmann Publishers Inc., 2002.

[Hoey and Poupart, 2005] Jesse Hoey and Pascal Poupart. Solving POMDPs with continuousor large discrete observation spaces. In Proceedings of the 17th International Jont Conferenceon Artifical Intelligence (IJCAI 2005), pages 1332–1338. Morgan Kaufmann Publishers Inc.,2005.

[Hoey et al., 2010] Jesse Hoey, Pascal Poupart, Axel von Bertoldi, Tammy Craig, CraigBoutilier, and Alex Mihailidis. Automated handwashing assistance for persons with dementiausing video and a partially observable Markov decision process. Computer Vision and ImageUnderstanding, 114(5):503–519, may 2010.

[Höller et al., 2014] Daniel Höller, Pascal Bercher, Felix Richter, Marvin Schiller, ThomasGeier, and Susanne Biundo. Finding user-friendly linearizations of partially ordered plans. In28th PuK Workshop ”Planen, Scheduling und Konfigurieren, Entwerfen” (PuK 2014), 2014.

[Honold et al., 2014] Frank Honold, Pascal Bercher, Felix Richter, Florian Nothdurft, ThomasGeier, Roland Barth, Thilo Hornle, Felix Schussel, Stephan Reuter, Matthias Rau, GregorBertrand, Bastian Seegebarth, Peter Kurzok, Bernd Schattenberg, Wolfgang Minker, MichaelWeber, and Susanne Biundo. Companion-technology: Towards user- and situation-adaptivefunctionality of technical systems. In 2014 International Conference on Intelligent Environ-ments, pages 378–381. Institute of Electrical & Electronics Engineers (IEEE), jun 2014.

[Howard, 2013] Ronald A. Howard. Dynamic Probabilistic Systems, Volume II: Semi-Markovand Decision Processes, volume 2. Courier Corporation, 2013.

[Kaelbling et al., 1998] Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra.Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101(1-2):99–134, may 1998.

[Kautz et al., 2002] Henry Kautz, Larry Arnstein, Gaetano Borriello, Oren Etzioni, and DieterFox. An overview of the assisted cognition project. In AAAI-2002 Workshop on Automationas Caregiver: The Role of Intelligent Technology in Elder Care, pages 60–65, 2002.

137

Bibliography

[Kearns et al., 1999] Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. Approximateplanning in large POMDPs via reusable trajectories. In Advances in Neural InformationProcessing Systems 12, pages 1001–1007. The MIT Press, 1999.

[Keller and Helmert, 2013] Thomas Keller and Malte Helmert. Trial-based heuristic tree searchfor finite horizon MDPs. In Proceedings of the Twenty-Third International Conference onAutomated Planning and Scheduling (ICAPS 2013), pages 135–143. AAAI Press, 2013.

[Khan et al., 2009] Omar Zia Khan, Pascal Poupart, and James P. Black. Minimal sufficientexplanations for factored markov decision processes. In Proceedings of the 19th InternationalConference on Automated Planning and Scheduling (ICAPS 2009), pages 194–200. AAAIPress, 2009.

[Kocsis and Szepesvári, 2006] Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine LearningMachine Learning (ECML 2006), pages 282–293. Springer Science + Business Media, 2006.

[Kurniawati et al., 2008] Hanna Kurniawati, David Hsu, and Wee Sun Lee. SARSOP: Effi-cient point-based POMDP planning by approximating optimally reachable belief spaces. InRobotics: Science and Systems IV. Robotics: Science and Systems Foundation, jun 2008.

[Kuter and Nau, 2004] Ugur Kuter and Dana Nau. Forward-chaining planning in nondetermin-istic domains. In Proceedings of the 18th National Conference on Artificial Intelligence(AAAI 2004), pages 513–518. AAAI Press, 2004.

[Kuter et al., 2007] Ugur Kuter, Dana Nau, Elnatan Reisner, and Robert Goldman. Conditional-ization: Adapting forward-chaining planners to partially observable environments. In ICAPS2007 workshop on planning and execution for real-world systems, 2007.

[Kuter et al., 2008] Ugur Kuter, Dana Nau, Elnatan Reisner, and Robert Goldman. Using clas-sical planners to solve nondeterministic planning problems. In Proceedings of the EighteenthInternational Conference on Automated Planning and Scheduling (ICAPS 2008), pages 190–197. AAAI Press, 2008.

[Kuter et al., 2009] Ugur Kuter, Dana Nau, Marco Pistore, and Paolo Traverso. Task decompo-sition on abstract states, for planning under nondeterminism. Artificial Intelligence, 173(5-6):669–695, apr 2009.

[Lim et al., 2009] Brian Y. Lim, Anind K. Dey, and Daniel Avrahami. Why and why not ex-planations improve the intelligibility of context-aware intelligent systems. In Proceedings ofthe 27th international conference on Human factors in computing systems - CHI 09, pages2119–2128. Association for Computing Machinery (ACM), 2009.

[Lin et al., 2008] Naiwen Lin, Ugur Kuter, and Evren Sirin. Web service composition withuser preferences. In Sean Bechhofer, Manfred Hauswirth, Jörg Hoffmann, and ManolisKoubarakis, editors, Proceedings of the 5th European Semantic Web Conference (ESWC2008), pages 629–643, Berlin, Heidelberg, 2008. Springer Science + Business Media.

138

Bibliography

[Müller and Biundo, 2011] Felix Müller and Susanne Biundo. HTN-style planning in relationalPOMDPs using first-order FSCs. In Proceedings of the 34th Annual German Conference onArtificial Intelligence (KI 2008), pages 216–227. Springer Science + Business Media, 2011.

[Müller et al., 2012] Felix Müller, Christian Späth, Thomas Geier, and Susanne Biundo. Ex-ploiting expert knowledge in factored POMDPs. In Proceedings of the 20th European Con-ference on Artificial Intelligence (ECAI 2012), pages 606–611. IOS Press, 2012.

[Nau et al., 1999] Dana S. Nau, Yue Cao, Amnon Lotem, and Héctor Muñoz-Avila. SHOP:simple hierarchical ordered planner. In Proceedings of the Sixteenth International Joint Con-ference on Artificial Intelligence (IJCAI 1999), pages 968–975. Morgan Kaufmann PublishersInc., 1999.

[Nau et al., 2003] Dana Nau, Okhtay Ilghami, Ugur Kuter, J. William Murdock, Dan Wu, andFusun Yaman. SHOP2: An HTN planning system. Journal of Artificial Intelligence Research,20:379–404, 2003.

[Nau et al., 2005] Dana Nau Nau, Tsz-Chiu Au, Okhtay Ilghami, Ugur Kuter, Héctor Muñoz-Avila, J.William Murdock, Dan Wu, and Fusun Yaman. Applications of SHOP and SHOP2.In IEEE Intelligent Systems, volume 20, pages 34–41. Institute of Electrical & ElectronicsEngineers (IEEE), mar 2005.

[Nothdurft et al., 2014a] Florian Nothdurft, Felix Richter, and Wolfgang Minker. Probabilisticexplanation dialog augmentation. In 2014 International Conference on Intelligent Environ-ments, pages 392 – 395. Institute of Electrical & Electronics Engineers (IEEE), jun 2014.

[Nothdurft et al., 2014b] Florian Nothdurft, Felix Richter, and Wolfgang Minker. Probabilistichuman-computer trust handling. In Proceedings of the 15th Annual Meeting of the SpecialInterest Group on Discourse and Dialogue (SIGDIAL 2014), pages 51–59. Association forComputational Linguistics (ACL), 2014.

[Papadimitriou and Tsitsiklis, 1987] Christos Papadimitriou and John Tsitsiklis. The complex-ity of Markov decision processes. Mathematics of Operations Research, 12(3):441–450, aug1987.

[Parr and Russell, 1998] Ronald Parr and Stuart Russell. Reinforcement learning with hierar-chies of machines. In Proceedings of the 1997 conference on Advances in neural informationprocessing systems 10, pages 1043–1049. MIT Press, 1998.

[Parr, 1998] Ronald Edward Parr. Hierarchical control and learning for Markov decision pro-cesses. PhD thesis, University of California, Berkeley, 1998.

[Pineau et al., 2003a] Joelle Pineau, Geoff Gordon, and Sebastian Thrun. Policy-contingentabstraction for robust robot control. In Proceedings of the Nineteenth conference on Uncer-tainty in Artificial Intelligence (UAI 2003), pages 477–484. Morgan Kaufmann PublishersInc., 2003.

139

Bibliography

[Pineau et al., 2003b] Joelle Pineau, Geoff Gordon, Sebastian Thrun, et al. Point-based valueiteration: An anytime algorithm for POMDPs. In Proceedings of the 15th InternationalJont Conference on Artifical Intelligence (IJCAI 2003), volume 3, pages 1025–1032. MorganKaufmann Publishers Inc., 2003.

[Pineau et al., 2003c] Joelle Pineau, Michael Montemerlo, Martha Pollack, Nicholas Roy, andSebastian Thrun. Towards robotic assistants in nursing homes: Challenges and results.Robotics and Autonomous Systems, 42(3-4):271–281, mar 2003.

[Pineau, 2004] Joelle Pineau. Tractable planning under uncertainty: exploiting structure. PhDthesis, Rutgers University, 2004.

[Pollack, 2002] Martha E. Pollack. Planning technology for intelligent cognitive orthotics. InProceedings of the Sixth International Conference on Artificial Intelligence Planning Systems(AIPS 2002), pages 322–332. AAAI Press, 2002.

[Poupart, 2005] Pascal Poupart. Exploiting structure to efficiently solve large scale partiallyobservable Markov decision processes. PhD thesis, University of Toronto, 2005.

[Richter and Biundo, 2017] Felix Richter and Susanne Biundo. Addressing uncertainty in hi-erarchical user-centered planning. In Companion Technology – A Paradigm Shift in Human-Technology Interaction, chapter 6. Springer, 2017.

[Ritter, 2010] Helge Ritter. Cognitive interaction technology. Künstl Intell, 24(4):319–322, aug2010.

[Roy et al., 2005] Nicholas Roy, Geoffrey J Gordon, and Sebastian Thrun. Finding approximatePOMDP solutions through belief compression. Journal of Artificial Intelligence Research,23:1–40, 2005.

[Sanner and Kersting, 2010] Scott Sanner and Kristian Kersting. Symbolic dynamic program-ming for first-order POMDPs. In Proceedings of the 24th National Conference on ArtificialIntelligence (AAAI 2010), volume 10, pages 1140–1146. AAAI Press, 2010.

[Sanner, 2010] Scott Sanner. Relational dynamic influence diagram language (RDDL): Lan-guage description. http://users.cecs.anu.edu.au/~ssanner/IPPC_2011/RDDL.pdf, 2010.

[Sanner, 2011] Scott Sanner. International probabilistic planning competition – discrete trackresults. Presentation at ICAPS 2011, 2011. http://users.cecs.anu.edu.au/~ssanner/IPPC_2011/IPPC_2011_Presentation.pdf.

[Sanner, 2014] Scott Sanner. International probabilistic planning competition – discrete trackresults. Presentation at ICAPS 2014, 2014. https://cs.uwaterloo.ca/~mgrzes/IPPC_2014/IPPC_2014_Presentation_Discrete.pdf.

[Schattenberg, 2009] Bernd Schattenberg. Hybrid Planning And Scheduling. PhD thesis, UlmUniversity, Institute of Artificial Intelligence, 2009. URN: urn:nbn:de:bsz:289-vts-68953.

140

http://users.cecs.anu.edu.au/~ssanner/IPPC_2011/RDDL.pdf

http://users.cecs.anu.edu.au/~ssanner/IPPC_2011/RDDL.pdf

http://users.cecs.anu.edu.au/~ssanner/IPPC_2011/IPPC_2011_Presentation.pdf

http://users.cecs.anu.edu.au/~ssanner/IPPC_2011/IPPC_2011_Presentation.pdf

https://cs.uwaterloo.ca/~mgrzes/IPPC_2014/IPPC_2014_Presentation_Discrete.pdf

https://cs.uwaterloo.ca/~mgrzes/IPPC_2014/IPPC_2014_Presentation_Discrete.pdf

Bibliography

[Seegebarth et al., 2012] Bastian Seegebarth, Felix Müller, Bernd Schattenberg, and SusanneBiundo. Making hybrid plans more clear to human users – a formal approach for generatingsound explanations. In Proceedings of the 22nd International Conference on AutomatedPlanning and Scheduling (ICAPS 2012), pages 225–233. AAAI Press, 2012.

[Silver and Veness, 2010] D. Silver and J. Veness. Monte-carlo planning in large POMDPs. InAdvances in Neural Information Processing Systems, pages 2164–2172. MIT Press, 2010.

[Sim et al., 2008] Hyeong Seop Sim, Kee-Eung Kim, Jin Hyung Kim, Du-Seong Chang, andMyoung-Wan Koo. Symbolic heuristic search value iteration for factored POMDPs. In Pro-ceedings of the 22th National Conference on Artificial Intelligence (AAAI 2008), pages 1088–1093. AAAI Press, 2008.

[Smallwood and Sondik, 1973] Richard D. Smallwood and Edward J. Sondik. The optimal con-trol of partially observable Markov processes over a finite horizon. Operations Research,21(5):1071–1088, oct 1973.

[Smith and Simmons, 2004] Trey Smith and Reid Simmons. Heuristic search value iterationfor POMDPs. In Proceedings of the 20th conference on Uncertainty in artificial intelligence,pages 520–527. AUAI Press, 2004.

[Somani et al., 2013] Adhiraj Somani, Nan Ye, David Hsu, and Wee Sun Lee. Despot: Onlinepomdp planning with regularization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahra-mani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26,pages 1772–1780. Curran Associates, Inc., 2013.

[Sondik, 1971] Edward J. Sondik. The optimal control of partially observable Markov decisionprocesses. PhD thesis, Stanford University, 1971.

[Spaan and Vlassis, 2005] Matthijs TJ Spaan and Nikos Vlassis. Perseus: Randomized point-based value iteration for POMDPs. Journal of artificial intelligence research, 24:195–220,2005.

[Sridharan et al., 2010] Mohan Sridharan, Jeremy Wyatt, and Richard Dearden. Planning tosee: A hierarchical approach to planning visual actions on a robot using POMDPs. ArtificialIntelligence, 174(11):704–725, jul 2010.

[Stephan and Biundo, 1993] Werner Stephan and Susanne Biundo. A new logical frameworkfor deductive planning. In Proceedings of the 13th International Joint Conference on ArtificialIntelligence (IJCAI 1993), pages 32–38. Morgan Kaufmann Publishers Inc., 1993.

[Sutton et al., 1999] Richard S. Sutton, Doina Precup, and Satinder Singh. Between MDPsand semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificialintelligence, 112(1-2):181–211, aug 1999.

[Theocharous and Kaelbling, 2004] Georgios Theocharous and Leslie P. Kaelbling. Approxi-mate planning in POMDPs with macro-actions. In Advances in Neural Information Process-ing Systems, pages 775–782. MIT Press, 2004.

141

Bibliography

[Theocharous and Mahadevan, 2002] G. Theocharous and S. Mahadevan. Approximate plan-ning with hierarchical partially observable Markov decision process models for robot naviga-tion. In Proceedings of the 2002 IEEE International Conference on Robotics and Automation(ICRA 2002), pages 1347–1352. Institute of Electrical & Electronics Engineers (IEEE), 2002.

[Theocharous et al., 2001] Georgios Theocharous, Khashayar Rohanimanesh, and Sridhar Ma-harlevan. Learning hierarchical partially observable Markov decision process models forrobot navigation. In Proceedings of the 2001 IEEE International Conference on Roboticsand Automation (ICRA 2001), pages 511–516. Institute of Electrical & Electronics Engineers(IEEE), 2001.

[Theocharous, 2002] Georgios Theocharous. Hierarchical learning and planning in partiallyobservable Markov decision processes. PhD thesis, Michigan State University, 2002.

[Toussaint et al., 2008] Marc Toussaint, Laurent Charlin, and Pascal Poupart. HierarchicalPOMDP controller optimization by likelihood maximization. In Proceedings of the 24thconference on Uncertainty in artificial intelligence (UAI 2008), volume 24, pages 562–570.AUAI Press, 2008.

[Vien and Toussaint, 2015] Ngo Anh Vien and Marc Toussaint. Hierarchical monte-carlo plan-ning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI2015), pages 3613–3619. AAAI Press, 2015.

[Warfield et al., 2007] Ian Warfield, Chad Hogg, Stephen Lee-Urban, and Héctor Muñoz-Avila.Adaptation of hierarchical task network plans. In Proceedings of the Twentieth InternationalFlorida Artificial Intelligence Research Society Conference (FLAIRS 2007), pages 429–434.AAAI Press, 2007.

[Wilcoxon, 1945] Frank Wilcoxon. Individual comparisons by ranking methods. BiometricsBulletin, 1(6):80–83, dec 1945.

[Williams et al., 2005] Jason D. Williams, Pascal Poupart, and Steve Young. Factored partiallyobservable Markov decision processes for dialogue management. In 4th Workshop on Knowl-edge and Reasoning in Practical Dialog Systems, International Joint Conference on ArtificialIntelligence (IJCAI), pages 76–82, 2005.

[Zhang et al., 2015] Zongzhang Zhang, David Hsu, Wee Sun Lee, Zhan Wei Lim, and AijunBai. PLEASE: palm leaf search for POMDPs with large observation spaces. In Proceedingsof the 25th International Conference on Automated Planning and Scheduling (ICAPS 2015),pages 238–240. AAAI Press, 2015.

142

Copyright NoticeParts of this dissertation have already been published in the following journal articles:[Richter and Biundo, 2017] Felix Richter and Susanne Biundo. Addressing uncertainty inhierarchical user-centered planning. In Companion Technology – A Paradigm Shift in Human-Technology Interaction, chapter 6. Springer, 2017.

143

AcknowledgmentsI am very thankful for all the support I enjoyed and which made this thesis possible. First andforemost, I want to thank my supervisor Susanne Biundo for the opportunity to write this thesis,as well as her support, especially in the freedom I enjoyed in choosing the topic of this thesis.Many thanks go to my colleague Thomas Geier for being a competent and willing discussionpartner whenever I was stuck, as well as for the exciting discussions on numerous topics in ouroffice. I want to thank Gregor Behnke, Pascal Bercher, and Daniel Höller, who, together withThomas, provided valuable feedback on different parts of this thesis. I also want to acknowledgeChristian Späth for the initial implementation of my core approach, which eventually led to thecurrent version and the data presented in this thesis.

Thanks as well to Sylvia Simonsen for her continuous effort in keeping up our spirits withsweets, cakes, and generally being nice. I also fondly remember the lunch breaks with FrankHonold, Felix Schüssel, Miriam Schmidt-Wack, Sascha Meudt, and Thilo Hörnle.

Special thanks go to my brothers Julian, Christian, and Lukas Müller, my parents HermannMüller and Heidrun Reichart-Müller, and my grandmother Margarete Müller for providingmoral support whenever I needed it. Finally, I feel deeply grateful towards my wife VerenaRichter, for not only proofreading this thesis, but especially her patience and her support in allrespects concerning this thesis and beyond.

145

hierarchical planning under uncertainty - ulm · existing approaches combining hierarchical...

Documents