Strategic Interactions againstNon-Stationary Agents
por
Pablo Francisco Hernandez Leal
Tesis sometida como requisito parcial para obtener el grado de Doctor
en Ciencias en el Area de Ciencias Computacionales en el Instituto
Nacional de Astrofısica, Optica y Electronica
Supervisada por:
Dr. Jose Enrique Munoz de Cote Flores Luna,
INAOE
c⃝INAOE 2015
El autor otorga al INAOE el permiso de reproducir y distribuir copiasen su totalidad o en partes de esta tesis
Strategic Interactions against Non-Stationary Agents
By:
Pablo Francisco Hernandez Leal
Advisor:
Dr. Jose Enrique Munoz de Cote Flores Luna
Ph.D. Dissertation
Coordinacion de Ciencias Computacionales
Instituto Nacional de Astrofısica Optica y Electronica
December 2015
Sta. Marıa Tonanzintla, Puebla, Mexico
Agradecimientos
A mis asesores, por su guıa, revisiones y apoyo para llevar a cabo esta investigacion doctoral.
A Enrique Munoz de Cote, quien me introdujo al area de teorıa de juegos y me acepto como
su primer alumno doctoral en su llegada al INAOE. En particular, agradezco las largas platicas de
brainstorming (que a veces me dejaban con mas dudas que al inicio), las revisiones tan detalladas de
los artıculos y el apoyo para realizar una estancia de investigacion fuera del paıs. De el he aprendido
y valorado la idea de hacer investigacion de alta calidad, de no conformarse y de buscar siempre los
mas altos estandares. Aprecio su direccion y guıa en estos 4 anos, la cual combinaba consejos y el
presionarme para buscar mi mayor potencial. A Enrique Sucar quien desde la maestrıa me acepto
como su estudiante y continuo durante mi investigacion doctoral aun cuando el tema era diferente.
Aprecio su personalidad, la cual es muy similar a su forma de hacer ciencia, al mismo tiempo agradable
y responsable. De ellos he aprendido a ser un mejor investigador y espero que en el futuro pueda ser
un asesor tan dedicado como ellos.
A diversas personas del INAOE que a traves de seminarios, cursos o platicas me ayudaron a cul-
minar esta tesis y me hacen ser un mejor investigador. Al grupo de robotica y de sistemas inteligentes
quienes escucharon mis presentaciones. A Felipe Orihuela, por su guıa en las cuestiones estadısticas
y sus duras crıticas y sugerencias. A Alma Rios, por su apoyo en las cuestiones administrativas.
A personas que conocı en mi estancia en CREATE-NET, Oscar Mayora, Venet Osmani y
Alban Maxhuni (quien se convirtio en un gran amigo). A Matthew E. Taylor y Yusen Zhan de
Washington State University, con quienes convivı por 7 meses y con quienes espero seguir colaborando.
A Benjamin Rosman con quien he iniciado una colaboracion, aun cuando el se encuentra al otro
lado del mundo.
A mis sinodales, Eduardo Morales, Francisco Martinez, Angelica Munoz, Aurelio Lopez
y Prashant Doshi por sus observaciones y comentarios que enriquecieron y mejoraron esta tesis.
A mi familia y en particular a mi padres, quienes me han apoyado en esta vida de hacer in-
vestigacion. Terminando mi licenciatura, ellos me apoyaron cuando les dije que querıa estudiar una
maestrıa. Dos anos despues tambien quise estudiar un doctorado y su respuesta fue positiva. En estos
cuatro anos he tenido la fortuna de realizar estancias y de asistir a diversas conferencias, mis padres
siempre me han apoyado aun cuando en el fondo sepan que me alejo fısicamente de ellos. Siempre
han sido y seran mi guıa, inspiracion y soporte durante toda mi vida, los admiro y son un modelo de
personas a seguir. Agradezco a mi mama, Isabel Leal por educarme, por quererme, por heredarme su
inteligencia y responsabilidad en la vida y el trabajo. Por ensenarme a superarme todos los dıas. Por
ensenarme a ser puntual, respetuoso y muchos otros valores que espero me hagan ademas de un buen
i
investigador, una buena persona. Gracias por ensenarme a hacer bien mi trabajo, a ser responsable y
cumplir como es debido. A mi papa, Francisco Hernandez por brindarme siempre su carino y sus
consejos, por ensenarme lo que es amar tu trabajo. Gracias por ensenarme a disfrutar de la vida, por
compartir tus alegrıas conmigo, por apoyarme y celebrar siempre mis pequenos logros. Gracias por
todo tu afecto y tus abrazos. Por alentarme y ensenarme a no dudar de mis capacidades. Gracias por
ensenarme tantas cosas en la vida que no se pueden aprender mediante libros y ecuaciones. Gracias a
mis padres por la educacion que me brindaron, por su formacion y los valores que me han ensenado,
por todos sus sacrificios y principalmente por su apoyo incondicional creyendo siempre en mi, sin
ellos no podrıa ser lo que soy ahora. A Isabel Chavarrıa, por ensenarme a superar retos y hacerme
una mejor persona. Por estar junto a mi en momento difıciles, gracias por darme paz y certidumbre.
Gracias por confiar en mi, en esta idea de hacer ciencia para toda la vida. Gracias por este tiempo
juntos y por todo lo que aun nos falta, porque contigo soy feliz.
Al Consejo Nacional de Ciencia y Tecnologıa (CONACyT) por el apoyo economico otorgado a
traves de la beca No. 234507 para estudios de doctorado. Al Instituto Nacional de Astrofısica, Optica
y Electronica y la Coordinacion de de Ciencias Computacionales por la formacion academica y todas
las facilidades otorgadas.
Pablo Fco. Hernandez Leal — Sta. Marıa Tonantzintla, Pue. 2015
ii
Resumen
Lograr disenar un agente capaz de aprender a interactuar con otro agente es un problema abierto.
Una interaccion ocurre cuando dos o mas agentes realizan una accion en un ambiente determinado
y obtienen una utilidad dependiendo de la accion conjunta. Las tecnicas actuales de aprendizaje
multiagente usualmente no obtienen buenos resultados con agentes que cambian de comportamiento
en una interaccion repetida. Esto es porque generalmente no modelan el comportamiento de los
demas agentes, y en su lugar realizan suposiciones que son muy restrictivas para escenarios reales.
Mas aun, considerando que muchas aplicaciones requieren la interaccion de diferentes tipos de agentes
este problema es importante de resolver. No importa si el dominio es cooperativo (donde los agentes
tiene un objetivo comun) o competitivo (donde los objetivos son diferentes), hay un aspecto en comun:
los agentes deben aprender como los demas estan actuando y reaccionar rapidamente a cambios de
comportamiento. Esta tesis esta enfocada en como actuar ante agentes que usan distintas estrategias
durante la interaccion (lo cual los convierte en no estacionarios), en particular se enfoca en agentes
que se enfrentan a otros agentes (llamados oponentes en esta tesis) los cuales pueden usar diferentes
estrategias y cambiar entre ellas en el tiempo. Lidiar con oponentes no estacionarios involucra tres
aspectos diferentes: (i) aprender un modelo del oponente, (ii) establecer una polıtica (plan) contra el
oponente (ya que el objetivo es maximizar la utilidad durante la interaccion), y (iii) detectar cambios
en el comportamiento del oponente. Las contribuciones principales de esta tesis son:
• Se propuso un framework de aprendizaje y planificacion contra estrategias no estacionarias en
juegos repetidos. Un algoritmo que usa este framework es MDP4.5 el cual usa arboles de decision
como modelos del oponente. Un segundo algoritmo es MDP-CL el cual aprende un proceso de
decision de Markov (MDP) para representar la estrategia del oponente. MDP4.5 posteriormente
transforma el arbol de decision en un MDP. Resolver el MDP resulta en una polıtica optima con-
tra el oponente. Para determinar cambios en las estrategias, diferentes modelos son aprendidos
durante el juego con diferente informacion.
• MDP-CL y MDP4.5 aprenden modelos con base en interaccion sin informacion previa. Ademas,
cuando estos algoritmos detectan un cambio, el modelo actual se descarta y se aprende uno
nuevo. Para solventar estas limitaciones se propusieron a priori MDP-CL e incremental MDP-
CL. A priori MDP-CL inicia con un conjunto de modelos antes de la interaccion y debe detectar
cual de ellos es el usado por el oponente. Incremental MDP-CL no descarta el modelo aprendido
cuando detecta un cambio. De esta forma, si el oponente regresa a una estrategia previamente
usada, el modelo ya es conocido y el proceso de deteccion es mas rapido.
• Se propuso un nuevo tipo de exploracion llamada drift, la cual esta disenada para detectar
iii
cambios en el oponente que en otro caso pasarıan desapercibidos. En este contexto se propuso un
nuevo algoritmo, R-max#, el cual visita parte del espacio de estados que no ha sido actualizado
recientemente lo cual implıcitamente revisa cambios de comportamiento del oponente. Mas aun,
se proveen garantıas de optimalidad contra ciertos tipos de oponentes no estacionarios.
• Por ultimo, se propuso DriftER el cual es un mecanismo de deteccion de cambios de compor-
tamiento basado en monitorizar la tasa de error sobre el modelo del oponente. DriftER provee
garantıas teoricas de deteccion de cambios bajo ciertas suposiciones.
Los algoritmos propuestos se evaluaron en distintos dominios, unos que son utilizados normalmente
como referencia, y otros nuevos dominios mas apegados a situaciones reales: el dilema del prisionero
repetido, una tarea de negociacion, una aplicacion del mundo real en mercados energeticos y en juegos
aleatorios del area de teorıa de juegos. Se realizaron comparaciones contra algoritmos del estado del
arte de aprendizaje por refuerzo y teorıa de juegos comprobando que nuestras propuestas son capaces
de detectar cambios de comportamiento y obtener mejores resultados en cuanto a la calidad del modelo
aprendido y recompensas promedio.
iv
Abstract
Designing an agent that is capable of interacting with another agent is an open problem. An
interaction happen when two or more agents perform an action in an environment and they obtain
an utility based on the performed joint action.Current multiagent learning techniques do not fare
well with agents that change their behavior during a repeated interaction. This happens because
they usually do not model the other agents’ behavior and instead make some assumptions that for
real scenarios are too restrictive. Furthermore, considering that many applications demand di↵erent
types of agents to work together this should be an important problem to solve. It does not matter if
the domain is cooperative (where agents have a common goal) or competitive (where objectives are
di↵erent), there is one common aspect: agents must learn how their counterpart is acting and react
quickly to changes in behavior. Our research is focused on how to act against agents that use di↵erent
strategies over time during an interaction (which makes them non-stationary), in particular we focus
on agents that use di↵erent strategies and switch among them on time. Dealing with non-stationary
opponents involves three di↵erent aspects: (i) learning a model of the opponent, (ii) computing a
policy (plan to act) against the opponent (since the objective is to maximize the utility throughout
the interaction) and (iii) detecting switches in the opponent strategy. The main contributions of this
thesis are:
• We have proposed a framework for learning and planning against non-stationary strategies in
repeated games. One algorithm of this framework is MDP4.5 which uses decision trees as op-
ponent models. The second algorithm is MDP-CL that learns a Markov decision process for
representing the opponent strategy. MDP4.5 transforms the decision tree into a MDP. Solv-
ing the MDP yields an optimal policy against that opponent. In order to assess the change of
strategies, di↵erent models are learned trough the game with di↵erent information.
• MDP-CL and MDP4.5 learned models from interaction without any prior information. Also when
they detect a switch the learned model is discarded and a new one is learned. We proposed two
extensions for MDP-CL: a priori and incremental MDP-CL that overcome these limitations. A
priori MDP-CL knows the set of models used by the opponent and the problem is to detect
the strategy of the opponent from that set. Incremental MDP-CL does not discard the learned
model once it detects a switch. In this way, if the opponent returns to a previously used strategy
the model is already known and the detection process is faster than relearning it.
• We proposed a new type of exploration called drift exploration which is designed to detect
switches in the opponent which otherwise would have passed unnoticed. In this regard we
proposed a new algorithm, R-max# which revisits parts of the state space which have not
v
been updated recently, thus implicitly checking for opponent switches. Moreover, we provide
theoretical guarantees for some types of non-stationary opponents under which R-max# will
behave optimally.
• Finally we propose DriftER, a switch detection mechanism keeping track of the error rate of
the opponent model. We also provide a theoretical guarantee for switch detection under certain
assumptions.
Our di↵erent proposals were evaluated on diverse domains, some are used as references and other
domains are more realistic: the iterated prisoner’s dilemma, a negotiation task, a real-world domain
in energy markets and random games from game theory. Comparisons were made against state of the
art algorithms in reinforcement learning and game theory showing that our approaches are capable of
detecting switches and obtaining better scores in terms of quality of the model and average rewards.
vi
Contents
Agradecimientos i
Resumen iii
Abstract v
List of Figures xii
List of Tables xv
List of Algorithms xvii
List of Symbols xviii
1 Introduction 1
1.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.1 Main objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.2 Specific objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Preliminaries 9
2.1 Decision theoretic planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Markov decision process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Partially observable Markov decision process . . . . . . . . . . . . . . . . . . . 11
2.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
vii
CONTENTS
2.2.1 Supervised learning: classification . . . . . . . . . . . . . . . . . . . . . . . . . 12
Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Hidden mode - MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Exploration vs. exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Sample complexity of exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Nash equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.2 Repeated and stochastic games . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Behavioral game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3 Related Work 23
3.1 Decision theoretic planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Multiagent approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Probabilistic Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Concept drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.2 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.3 Exploration vs. exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.1 Implicit negotiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.2 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4.3 Behavioral game theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 Opponent and teammate modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 Memory bounded learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Acting against Non-Stationary Opponents 37
4.1 MDP-CL and MDP4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.1 Modeling opponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.1.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.4 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.1.5 Overview of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
viii
CONTENTS
4.1.6 Learning: opponent strategy assessment . . . . . . . . . . . . . . . . . . . . . . 42
MDP4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.7 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Decision trees to MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.8 Detecting opponent switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Running example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 MDP-CL with knowledge reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 A priori MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 Incremental MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Drift exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 General drift exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.2 E�cient drift exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
R-MAX# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.3 Running example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.4 Practical considerations of R-max# . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Sample complexity of exploration for R-max# . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 E�cient drift exploration with switch detection . . . . . . . . . . . . . . . . . . 60
4.4.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 DriftER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.2 Model learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.3 Switch detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.5.4 Theoretical guarantee for switch detection . . . . . . . . . . . . . . . . . . . . . 64
4.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Experiments 67
5.1 Experimental domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
ix
CONTENTS
5.1.1 Iterated prisoner’s dilemma (iPD) . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.2 Multiagent iterated prisoner’s dilemma . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.3 Alternate-o↵ers bargaining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1.4 Double auctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.5 General-sum games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Battle of the sexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 MDP4.5 and MDP-CL against deterministic switching opponents . . . . . . . . . . . . 71
5.2.1 Setting and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.2 HM-MDPs performance experiments . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.3 HM-MDPs vs MDP4.5 vs MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.4 Preliminary drift exploration for MDP4.5 and MDP-CL . . . . . . . . . . . . . 76
5.2.5 Learning speed of the opponent strategy . . . . . . . . . . . . . . . . . . . . . . 78
5.2.6 Increasing the number of opponents . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 A priori and incremental MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.1 Setting and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.2 Model selection in a priori MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.3 Rewards and quality in a priori MDP-CL . . . . . . . . . . . . . . . . . . . . . 82
5.3.4 Incremental models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.5 A priori noisy models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Drift exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.1 Settings and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.2 Drift and non-drift exploration approaches . . . . . . . . . . . . . . . . . . . . 87
5.4.3 Further analysis of MDP-CL(DE) . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.4 Further analysis of R-max# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.5 E�cient exploration + switch detection: R-max#CL . . . . . . . . . . . . . . . 96
5.4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5 DriftER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.5.1 Setting and objectives (repeated games) . . . . . . . . . . . . . . . . . . . . . . 99
5.5.2 Switch detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
DriftER parameter behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5.3 Setting and objectives (double auctions) . . . . . . . . . . . . . . . . . . . . . . 101
5.5.4 Fixed non-stationary opponents . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.5 Detecting switches in the opponent . . . . . . . . . . . . . . . . . . . . . . . . . 103
x
CONTENTS
5.5.6 Noisy non-stationary opponents . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6 Non-stationary game theory strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6.1 Setting and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6.2 Battle of the sexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6.3 General-sum games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.7 Summary of the chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6 Conclusions and Future Research 111
6.1 Summary of the proposed algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4 Open questions and future research ideas . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.5 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
References 117
A PowerTAC 127
A.1 Energy markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.2 PowerTAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.3 Periodic double auctions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
A.4 TacTex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
B General-sum Games 131
C Extra Experiments 133
C.1 HM-MDPs training and performance experiments . . . . . . . . . . . . . . . . . . . . . 133
C.2 R-max exploration against pure and mixed strategies . . . . . . . . . . . . . . . . . . . 136
xi
CONTENTS
xii
List of Figures
2.1 Interaction of an agent in an environment. . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 A Markov decision process (MDP) with four states and two actions. . . . . . . . . . . 10
2.3 A decision tree that models an opponent strategy. . . . . . . . . . . . . . . . . . . . . 13
2.4 An example of an HM-MDP with 3 modes and 4 states. . . . . . . . . . . . . . . . . . 15
2.5 The automata that describe TFT and Pavlov strategies. . . . . . . . . . . . . . . . . . 21
3.1 Related work to this thesis divided in areas. . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 An example of an Influence Diagram that represents the decision whether to take an
umbrella. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Di↵erent sections in this thesis and how the related to each other inside this chapter. . 38
4.2 The three main parts of the framework. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 A decision tree and a learned MDP using the game matrix of the prisoner’s dilemma. 43
4.4 MDP obtained from decision tree in Figure 4.3 (a). . . . . . . . . . . . . . . . . . . . . 45
4.5 Example of highly dissimilar and similar decision trees. . . . . . . . . . . . . . . . . . . 47
4.6 Example of how the framework works against a TFT-Pavlov opponent. . . . . . . . . . 48
4.7 An example of a learning agent against a Bully-TFT switching opponent. . . . . . . . 53
4.8 An example of the learned models of R-max# against a Bully-TFT switching opponent. 55
4.9 An illustration for the running behavior of R-max#. . . . . . . . . . . . . . . . . . . . 59
4.10 Possible switch points in R-max#. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1 Experimental domains used in this thesis and where they are used in this chapter. . . 68
5.2 The evaluation approach for the proposed framework and for HM-MDPs. . . . . . . . 72
5.3 Comparison of rewards of MDP-CL, MDP4.5 and HM-MDPs against di↵erent switching
opponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 Comparison between MDP4.5 and MDP-CL in terms of average rewards. . . . . . . . 78
5.5 MDP representation and rewards obtained in the multiagent version of the prisoner’s
dilemma with two opponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xiii
LIST OF FIGURES
5.6 Total variation distance of the current learned model compared with each strategy given
as prior information using a priori MDP-CL. . . . . . . . . . . . . . . . . . . . . . . . 82
5.7 Comparison of MDP-CL and a priori MDP-CL in terms of immediate and cumulative
rewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.8 Model quality of MDP-CL and a priori MDP-CL against the opponent TFT-Bully. . . 84
5.9 Di↵erence of cumulative rewards between incremental MDP-CL and MDP-CL. Total
variation distance of the learned model and the noisy representations. . . . . . . . . . 85
5.10 Cumulative rewards of MDP-CL with and without drift exploration, the opponent is
Bully-TFT switching at round. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.11 Cumulate rewards against he Bully-TFT opponent in the iPD using R-max# and R-max. 92
5.12 Immediate and cumulative rewards of R-max# and R-max in the alternating o↵ers
domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.13 Immediate rewards of MDP-CL, R-max#, WOLF-PHC and R-max#CL in the iPD
domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.14 Error probabilities of a learning algorithm with no switch detection and DriftER against
an opponent changes between two strategies. . . . . . . . . . . . . . . . . . . . . . . . 99
5.15 Switch detection with di↵erent parameters of DriftER against a non-stationary opponent.100
5.16 Error rate of TacTex-WM and MDP-CL while comparing with DriftER. . . . . . . . . 102
5.17 Profits of TacTex-WM, MDP-CL and DriftER against the non-stationary opponent in
a PowerTAC competition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.18 Rewards obtained by DriftER, MDP-CL, R-max# and WOLF-PHC in the BoS game
against a non-stationary opponent that uses pure and mixed Nash. . . . . . . . . . . . 106
A.1 Partial representation of the MDP broker in PowerTAC, ovals represent states (timeslots
for future delivery). Arrows represent transition probability and rewards. . . . . . . . 128
C.1 Fraction of updates when learning an opponent model using R-max exploration against
a pure strategy and a mixed strategy in the BoS game. . . . . . . . . . . . . . . . . . . 136
xiv
List of Tables
2.1 The bimatrix for the prisoners’ dilemma game. . . . . . . . . . . . . . . . . . . . . . . 18
3.1 A comparison of di↵erent algorithms of the state of art. . . . . . . . . . . . . . . . . . 34
4.1 A description of the main parts of the approach using two di↵erent representations:
MDP4.5 and MDP-CL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1 The bimatrix game known as the prisoners’ dilemma. . . . . . . . . . . . . . . . . . . . 68
5.2 A bimatrix game representing the battle of the sexes game. . . . . . . . . . . . . . . . 71
5.3 Average rewards for the HM-MDPs agent with std. deviation using di↵erent training
sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Average rewards for the HM-MDPs agent and for the opponent with standard deviations. 73
5.5 Average rewards of MDP-CL, MDP4.5 and HM-MDPs against non-stationary opponents. 75
5.6 Comparison without exploration, an ✏�exploration and a softmax exploration for MDP4.5
and MDP-CL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.7 Comparison in terms of average rewards of MDP-CL and a priori MDP-CL and non-
stationary opponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.8 Average rewards of the proposed algorithms against an opponent with a probability ⌘
of changing to a di↵erent strategy at any round in the iPD domain. . . . . . . . . . . 88
5.9 Average rewards and percentage of successful negotiations in the negotiation domain of
the proposed algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.10 Comparison of MDP-CL and MDP-CL(DE) while varying the parameter ✏ (using ✏�greedyas drift exploration). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.11 Comparison of R-max and R-max# with di↵erent ⌧ values in terms of average rewards. 95
5.12 Average rewards of R-max#CL and R-max# with di↵erent ⌧ values. . . . . . . . . . 96
5.13 Average timeslots for switch detection, accuracy, and traded energy of the learning
agents against a non-stationary opponent. . . . . . . . . . . . . . . . . . . . . . . . . . 104
xv
LIST OF TABLES
5.14 Average profit of the learning agents against non-stationary opponents with and without
noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.15 Rewards of our proposed approaches andWOLF-PHC against non-stationary opponents
in four random repeated games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.1 A comparison of our proposals in terms of advantages and limitations. . . . . . . . . . 112
B.1 Games used in the experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
B.2 Pure and mixed Nash strategies for the selected games. . . . . . . . . . . . . . . . . . 132
C.1 Average rewards for the HM-MDPs agent and for the opponent with standard deviations.135
C.2 Performance measures when solving the HM-MDP as a POMDP, average of di↵erent
opponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
C.3 Performance measures when solving the HM-MDP as a POMDP against di↵erent non-
stationary opponents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
C.4 Average rewards of R-max learning against pure and mixed strategies in the battle of
the sexes game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
xvi
List of Algorithms
2.1 R-max algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 WOLF-PHC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1 Proposed framework algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Incremental MDP-CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3 R-max# algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 R-max#CL algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 DriftER algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
xvii
LIST OF ALGORITHMS
xviii
List of Symbols
Symbol Description
Pr Probability
E Expectation
� A normal form game
T Number of rounds
I Number of agents
A Learning agent
Oi
Opponents
O Set of stationary strategies used by an opponent
A Set of attributes used to describe an opponent strategy.
B Set of opponent actions
M A Markov decision process (MDP)
M A set of MDPs
S Set of states
A Set of actions
R Reward function
T Transition function
� Discount factor value
Z Set of observations
Z Observation function
⇡ Policy
rpd
, spd
, tpd
, ppd
Values in the prisoner’s dilemma matrix
D A decision tree
w Parameter that represents window size
TV D Total variation distance
Threshold value in MDP4.5 and MDP-CL
⇢ Threshold value in Incremental MDP-CL
K Set of known states in R-max#and R-max#CL
m Parameter of R-max, R-max# and R-max#CL
⌧ Parameter of R-max# and R-max#CL
� Parameter of DriftER
ninit
Parameter of DriftER
xix
Chapter 1
Introduction
Multiagent systems (MAS) are those that include multiple autonomous entities (agents) that are
capable of independent action. Even when there is not a universal definition of agent [Russell et al.,
1995] most researchers in the community accept that an agent is an entity that: (i) has sensors which
help it see the world and (ii) can take actions to a↵ect the world. An agent can be a simulated software
agent, it could be physical robot, or even more, an agent can be thought as a person. An agent can
figure out for itself what it needs to do in order to satisfy its design objectives. A multiagent system
is one that consists of a number of agents, which interact with one another [Wooldridge, 2009].
The MAS community has developed several algorithms for di↵erent needs (coordination, com-
munication, learning, etc.) and a number of real applications have begun to appear. For example
implementing a network for distributing electricity [Pipattanasomporn et al., 2009], a coordination
mechanism for charging electric cars [Valogianni et al., 2015] or patrolling the Los Angeles airport
[Pita et al., 2009]. Also there are several applications from the branch of economy like energy markets
[Ketter et al., 2014], auctions [Hu and Wellman, 1998] and negotiation [Jennings et al., 2001].
Despite these advances, most of the previous work has focused on homogeneous interactions, this
is, all the agents have the same internal structure, including goals, domain knowledge and possible
actions [Stone and Veloso, 2000]. However, this assumption is not realistic. Thus, new algorithms
must be designed for heterogeneous systems, that range from having di↵erent goals to having di↵erent
models and actions (this includes the case of human-agent interactions). Even more, it would be
desirable that computer agents take into account who they are interacting with. This is related to the
concept of strategic interaction: a situation in which agents share an environment, each one of them
with its own objective, trying to obtain the best results, in spite of the fact that each agent’s outcome
could depend on the behavior of other agents [Risse, 2000].
However, one limitation of current technologies is that, many of the underlying algorithms do not
take into account against who they are interacting. These algorithms do not model the other agent.
1
CHAPTER 1. INTRODUCTION
This is specially troublesome when having di↵erent types of agents in the environment because that
means they act the same for any other agent, which is not optimal. Considering that many applications
demand di↵erent agents to collaborate closely together this should be an important problem to solve.
For example, in the medical area rehabilitation systems can be used directly by the patient (with-
out the therapist) [Sucar et al., 2010]. Even when these systems can produce important benefits in
the patient’s health, most of them lack the capacity to adapt and learn the patient’s needs. Nowa-
days, humans have to adapt to the system, therefore their motivation and consequently its benefits
are usually curtailed. Humans tend to have changes in mood and motivation, we adapt and learn
continuously, and if a system was capable of learning these changes of behavior and adapt itself to the
situation then, the benefits for the subject using the system would dramatically increase.
A similar problem called knowledge tracing [Corbett and Anderson, 1994] occurs in intelligent tutor
systems, where the objective is to monitor students’ change in knowledge state during the teaching
process. In this case, if the system could learn this knowledge state and keep track of the possible
changes, the system could optimize its behavior accordingly to the student.
A recent commercial application is to monitor the user’s music preferences to propose new tracks
that the user may like. An agent, by continuous monitoring of the listened tracks can learn a model
of the user by means of reinforcement learning [Liebman et al., 2015]. Even when the agent learns
a model of the user in an online fashion it does not take into account changes of mood of the user,
which reduces the quality of its predictions and its usefulness for everyday use.
Previous examples showed domains that involved a human and an agent working together in
a cooperative way. However, this problem can be seen also in competitive domains such as security
patrolling. In this scenario (and assuming repeated interactions) an intruder would change its strategy
constantly in order to avoid a patroller to learn its behavior from past interactions. In conclusion,
in cooperative or competitive scenarios agents must learn how their counterpart is acting and react
quickly to changes in their behavior.
1.1 Related work
Regardless of the task’s nature and whether the agent is interacting with another agent, or is isolated
in its environment, it is reasonable to expect that some conditions will change in time. These changes
turn the environment into a non-stationary one and render many techniques futile.
Strategic interactions (where the best outcome for an agent depends on the joint actions of all
the agents in the environment) are one of the fundamental problems in MAS systems. One way to
tackle those is to assume that the agent is the only one in the environment, treating the rest of the
agents like part of the environment and using single agent techniques. Decision theoretic planning is
2
1.1. RELATED WORK
one area that has developed algorithms for a single agent to find the optimal decisions in a sequential
decision problem [Boutilier et al., 1999]. Recently, some works have tackled the problem of having
more than one agent in the environment [Gmytrasiewicz and Doshi, 2005; Seuken and Zilberstein,
2008]. However, these multiagent extensions have a limited use due to computational constraints,
in some cases the complexity of the algorithms involved is NEXP-complete [Seuken and Zilberstein,
2008].1
Assuming there is only one agent in the environment is incorrect in situations in which other agents
do not have a fixed strategy [Shoham et al., 2007] since the environment will turn non-stationary where
single agent techniques fail. Thus, another solution is to represent explicitly other agents and their
possible actions. This is what opponent modeling algorithms do [Gmytrasiewicz and Durfee, 2000;
Stone, 2007]. Recent approaches in this area have shown the importance of adapting fast to the
opponent behavior [Sykulski et al., 2010]. Also they provided ideas about how to model adversarial
environments [Banerjee and Peng, 2005; Cote et al., 2010] where classical game theory solutions will
not succeed.
Game theory studies the decision making process between intelligent rational decision makers
[Myerson, 1991]. It has proposed models that can prescribe optimal strategies (Nash equilibria) under
specific situations. However, it can do so only under the strong assumption that all agents are fully
rational —i.e., a worst-case opponent or the perfect rational teammate (that always takes the best
action possible). The reality is that most situations involve agents that cannot be assumed as perfectly
rational, and where such prescribed strategies will fare less than optimal. Reasons for the agents not
to be rational are diverse, humans make choices depending on di↵erent factors which violate axioms
of the theory [Kahneman and Tversky, 1979], in other cases it is impossible to know all the necessary
information to make the best decision; when dealing with computer agents or robots, they have limited
capabilities (sensors, memory and processing) which limits their reasoning.
Most game theory models are designed for a single decision (one-shot games). However, extensions
for repeated games and stochastic games overcome this limitation. Some approaches [Abdallah and
Lesser, 2008; Conitzer and Sandholm, 2006] are designed to converge to the Nash equilibrium, others
switch between di↵erent game theoretic strategies2 depending on the context [Crandall and Goodrich,
2011; Powers and Shoham, 2005]. Learning approaches had been developed for repeated games [Brown,
1951]. Its limitation is that they assume their counterpart will use a stationary strategy during the
complete interaction, which means that the rule of choosing an action is the same in every stage3
[Shoham and Leyton-Brown, 2008].
1NEXP is the set of decision problems that can be solved by a non-deterministic Turing machine using time O(2p(n))
for some polynomial p(n), and unlimited space.2Nash equilibrium and minimax strategy are examples of these strategies which are described in Section 2.3.3Note that this does not imply that the action chosen in each stage will be the same.
3
CHAPTER 1. INTRODUCTION
On a di↵erent area, the machine learning community has developed an area for learning with
changes over time: concept drift [Gama et al., 2014; Widmer and Kubat, 1996]. However algorithms
in this area cannot be used directly in multiagent scenarios. Approaches from reinforcement learn-
ing [Sutton and Barto, 1998] range from the basic Q-learning algorithm with its various multiagent
extensions [Littman, 1994; Tesauro, 2003]. Another group of algorithms has been proposed with the
objective of converging in self-play [Bowling, 2004; Bowling and Veloso, 2002]. In particular WOLF-
PHC is a variant of Q-learning that is designed to learn against a non-stationary opponent that slowly
changes its behavior. Thus, the authors propose to use a variable learning rate, to learn quickly when
losing and cautiously when winning. However the approach had never been tested against switching
opponents until this thesis. As we will show in Sections 5.4.2 and 5.4.4 WOLF-PHC is not capable of
adapting fast to all opponent switches.
Other approaches like R-max [Brafman and Tennenholtz, 2003] addressed the problem of e�ciently
exploring by providing theoretical guarantees. However, the main limitation is that it does not handle
switching opponents. A di↵erent approach which is in fact designed for non-stationary opponents
is presented in [Choi et al., 1999]. The authors proposed an extension of MDPs for non-stationary
environments called hidden-mode MDPs. This model needs an o✏ine training phase and requires
solving a partially observable MDP, which is intractable in general (PSPACE-complete) [Littman,
1996].
Recent approaches have proposed algorithms for “fast learning” in repeated and stochastic games
[Elidrisi et al., 2012, 2014]. However, experiments were performed on small size problems since the
algorithms show an exponential increase in the number of hypotheses (in the size of the observation
history), which may limit its use in larger domains. Even more, these approaches do not take into
account exploration mechanisms for detecting opponent switches. As a result they obtain longer
detection times and suboptimal rewards.
In summary, there are di↵erent approaches that can be used only for single decisions (one-
shot)[Camerer et al., 2004a; Costa Gomes et al., 2001; Koller and Milch, 2001]. Game theory works
focus on finding Nash equilibria and convergence in self-play (which does not guarantee the opti-
mal rewards) [Abdallah and Lesser, 2008; Bowling, 2004; Bowling and Veloso, 2002; Conitzer and
Sandholm, 2006]. Planning algorithms are computationally intractable for larger multiagent domains
[Gmytrasiewicz and Doshi, 2005; Seuken and Zilberstein, 2008]. Most learning approaches assume
stationarity of the opponent [Brown, 1951] and those that deal with non-stationary are not designed
for multiagent domains or need an o✏ine training phase [Choi et al., 1999]. Recent approaches do not
use exploration mechanisms for detecting switches.
4
1.2. CHALLENGES
1.2 Challenges
The proposed research aims to tackle most of the mentioned problems, in particular it will be focused
on learning against non-stationary opponents that use di↵erent stationary strategies and switch among
them during the interaction. Dealing with non-stationary opponents involves three di↵erent aspects:
• Learning a model of the opponent. This model will provide information on how the opponent
behaves under di↵erent circumstances. Note that we do not have any previous models before to
the interaction with the opponent.
• Compute a policy (plan to act) against the opponent. The objective of the learning agent is to
maximize its rewards throughout the interaction.
• Detecting switches in the opponent strategy. The opponent has a set of strategies and can switch
among them during the interaction. An optimal policy against one strategy will be suboptimal
if the opponent changes to a di↵erent strategy. Thus, the opponent model and the policy against
it should be adapted.
Throughout this thesis we will refer to the other agents in the environment as opponents indepen-
dently of the domain. However, our proposed methods can be applied in collaborative systems since
the objective of our agent is to maximize its own rewards. Thus our agent is a self-interested agent.
Moreover, our methods do not need to know the reward function of the other agents (which it is often
complicated to obtain).
1.2.1 Problem statement
The problem’s setting is the following: One learning agent A, and one or more opponents Oi share the
environment. Agents take one action (simultaneously) in a sequence of rounds/timesteps/timeslots.4
They all obtain a reward5 r at each round that depends on the actions of all agents. The objective of
agent A is to maximize its cumulative rewards over the entire interaction. Agent A observes its reward
at the end of each round but not those of its opponents. Agent A does not have any initial policies
nor models of how Oi act in the environment. Each Oi has a set of Oi possible stationary strategies
to choose from and can switch from one to another in any round of the interaction. A strategy defines
a probability distribution for taking an action given a history of interactions.
4These terms will be used interchangeably throughout this document.5The interpretation of the reward depends on the domain. It could represent for example, money. However, it is
generally a used as a value without units.
5
CHAPTER 1. INTRODUCTION
1.2.2 Research questions
We formulate the following questions:
1. How should we model a non-stationary agent so that a model can be learned in few interactions?
2. How to e�ciently detect strategy switches in other agents?
3. How to learn a model of an opponent based on few interactions to derive an optimal policy for
interacting with other agents?
1.3 Objectives
This thesis has the following main objective.
1.3.1 Main objective
Develop algorithms for learning agent models accurately and in few interactions against non-stationary
opponents, along with planning algorithms that use the learned model to compute a strategy so that
interaction with the other agents is as profitable as possible.
1.3.2 Specific objectives
1. Define a model of an opponent (agent).
2. Define a suitable measure of distance between models of agents.
3. Develop an algorithm for learning models of non-stationary agents.
4. Identify a suitable planning algorithm to be used in conjunction with the learning algorithm.
5. Integrate the learning algorithm and the planning algorithm.
6. Test the proposed algorithm in a real-world application.
1.4 Contributions
The contributions of this thesis in the area of multiagent learning are listed below:
• A framework [Hernandez-Leal et al., 2014a] with two instantiations: MDP-CL and MDP4.5
[Hernandez-Leal et al., 2013a,b,c] which are designed to learn fast against non-stationary oppo-
nents in repeated games.
6
1.5. THESIS SUMMARY
• Two extensions of MDP-CL [Hernandez-Leal et al., 2014b]. A priori MDP-CL assumes to know
the set of possible strategies used by the opponent and detect which one to use while still checking
for switches. Incremental MDP-CL will keep a history of learned models, once it detects a switch
it will check if it is any of the previous models or a new one.
• A new type of exploration called drift exploration that is designed to detect switches against
non-stationary opponents that otherwise would have passed unnoticed.
• R-max# algorithm [Hernandez-Leal et al., 2014c], which performs an e�cient drift exploration.
The algorithm is inspired by R-max [Brafman and Tennenholtz, 2003] but it keeps learning a
model continuously by relearning state-action pairs that have not been updated recently. R-
max# provides theoretical guarantees for obtaining optimal rewards under certain assumptions.
• DriftER [Hernandez-Leal et al., 2015a] algorithm which proposes a way to detect switches based
on tracking the error rate of the learned model. It provides theoretical guarantees for switch
detection with high probability.
1.5 Thesis summary
In Chapter 2, the models and concepts from planning, machine learning and game theory which are
related to this thesis are described.
In Chapter 3, the state of the art in multiagent decision theoretic planning, non-stationary rein-
forcement learning, repeated games in game theory and recent approaches in opponent modeling are
analyzed.
In Chapter 4, we present the main contributions of this thesis, di↵erent proposals for acting against
non-stationary opponents. This thesis first proposes a framework for fast learning against switching
opponents in repeated games. The framework learns opponent’s models from a series of interactions.
Two di↵erent implementations of the framework were evaluated, one uses decision trees (MDP4.5)
and the other uses MDPs (MDP-CL). MDP4.5 limitation is that decision trees are not a good model
to handle stochasticity. MDP-CL limitation is that it fails to detect some type of switches.
Next we take into consideration what happens when prior information can be used when facing
non-stationary opponents. We propose a priori MDP-CL that assumes to know from start the set of
models used by the opponent. The second extension is incremental MDP-CL that learns new models
from a history of interactions, but it will not discard them once it detects a switch. In this way, it
keeps a record in case the opponent reuses a previous strategy. Their limitation is that they do not
have theoretical guarantees of switch detection.
7
CHAPTER 1. INTRODUCTION
Then we argue that classic exploration strategies (e.g., ✏-greedy, softmax), which tend to decrease
its exploration rate over time are not su�cient to be used in non-stationary opponents [Cote et
al., 2010]. We take recent algorithms that perform e�cient exploration (in terms of the number of
suboptimal decisions made during learning [Brafman and Tennenholtz, 2003]) as a stepping stone
to derive a new exploration strategy against non-stationary opponent strategies. We propose a new
adversarial drift exploration, which e�ciently explores the state space while being able to detect
regions of the environment that have changed. We present drift exploration as a strategy for switch
detection and second, a new algorithm called R-max# for learning and planning against non-stationary
opponents. R-max# makes e�cient use of exploration experiences, which results in rapid adaptation
and e�cient drift exploration, to deal with the multiagent non-stationary nature of the opponent
behavior.
Lastly, we present DriftER an algorithm for detecting switches in opponent strategies by tracking
the error rate of the learned model. Moreover, DriftER provides a theoretical guarantee that a switch
will be detected with high probability.
In Chapter 5 we present experiments in five domains: the iterated prisoner’s dilemma, the multi-
agent prisoner’s dilemma, the alternate o↵ers bargaining protocol, double auctions in energy markets
and a general setting involving game theory games and strategies. Experiments and results are dis-
cussed for each of our proposals in terms of quality of learned models and average rewards obtained.
Comparisons were performed against di↵erent state of the art approaches.
Finally, in Chapter 6 we present the conclusions of this thesis with a summary of the contributions.
We enumerate some open questions that are left as future research and we conclude with the list of
derived publications.
8
Chapter 2
Preliminaries
The proposed thesis lies in the intersection of di↵erent areas: decision theoretic planning, game theory
and machine learning. In this section, we present the models, concepts and algorithms of each area
that are relevant for this work.
2.1 Decision theoretic planning
Decision theoretic planning is the study of sequential decisions under uncertainty. Many real problems
depend on several decisions over time, for example in a negotiation against an opponent the best reward
will require a sequence of actions, these problems can be solved with models that take sequential
decisions. Two important models are Markov decision processes (MDPs) and partially observable
Markov decision processes (POMDPs).
2.1.1 Markov decision process
A Markov decision process [Puterman, 1994] is a model that can obtain optimal decisions in an
environment with a single agent assuming it has perfect sensors. An MDP can be seen as a model of
an agent interacting with the world. This process is depicted in Figure 2.1, the agent takes as input a
state s of the world and generates as output an action a that a↵ects the world. There is a transition
function T that describes how an action a↵ects the environment in a given state. The component Z in
the figure represents the agent’s perception function, which transforms the state s into a perception z.
In an MDP it is assumed there is no uncertainty in where the agent is. This implies that the agent has
full and perfect perception capabilities and knows the true state of the environment (what it perceives
is the real state z = s) (next section shows a model where z 6= s). The component R is the reward
function, the rewards help the agent to know which actions and states are good and which are bad
[Littman, 1996].
9
CHAPTER 2. PRELIMINARIES
T
Z
R
azr
s
Agent
Environment
Figure 2.1: Interaction of an agent in an environment. The agent performs an action a which a↵ects the state
environment according to a function T producing the state s. The agent perceives an observation z about the
environment (given by a function Z) and obtains a reward r (given by a function R).
S2 S0
a1,1,10S3 S1
a2,1,5
a2,0.2,1
a1,1,-100
a2,1,5a2,0.8,1a1,1,5
Figure 2.2: A Markov decision process (MDP) with four states S0, S1, S2, S3 and two actions a1, a2. The
arrows denote the tuple: action, transition probability and reward.
We first define a Markovian process and then an MDP.
Definition 2.1 (Markovian process). A stochastic process in which the transition probabilities depend
only on the current state.
Definition 2.2 (Markov decision process). An MDP is defined by the tuple hS,A,R, T i where S
represent the world divided up into a finite set of possible states. A represents a finite set of available
actions. T : S ⇥ A ! �(S), called transition function, is a function that for each state and action
associates a probability distribution over the possible successor states (�(S) denotes the set of all
probability distributions over S). Thus, for each s, s0 2 S and a 2 A, the function T determines the
probability of a transition from state s to state s0 after executing action a. R : S ⇥ A ! R, is the
reward function that defines the immediate reward that an agent would receive for being in state s and
executing action a.
A common assumption about MDPs is that they are stationary, i.e., that the transition probabilities
do not change in time. An example of an MDP with 4 states and 2 actions is depicted in Figure 2.2.
Ovals represent states of the environment. Each arrow has a triplet an, p, r representing the action,
the transition probability and the reward, respectively.
Solving an MDP will yield a policy ⇡ : S ! A, which is a mapping from states to actions. An
10
2.1. DECISION THEORETIC PLANNING
optimal policy ⇡⇤ is the one that guarantees the maximum reward. There are di↵erent techniques
for solving MDP and one of the most common is the value iteration algorithm [Bellman, 1957]. The
complexity of solving an MDP depends on the method but several methods had been shown to be in
P [Littman et al., 1995].
The assumption of perfect perception capabilities of the MDPs can be excessive in some domains.
For example, sensors are not perfect in a robot, which limit their capacities; in competitive domains,
agents may not be able to observe the complete scenario. Thus, MDPs may not be the perfect choice
for modeling those domains.
2.1.2 Partially observable Markov decision process
A POMDP [Kaelbling et al., 1998; Monahan, 1982] is a partially observable MDP, this means that is
not known with certainty the state of the agent, there is a probability distribution of being in every
state. The model is similar to an MDP, and solving a POMDP also yields a policy with the di↵erence
that this policy is now a mapping from probability distributions of states to actions.
A POMDP extends an MDP by adding:
• Observations Z - a finite set of observations of the state (which can be seen as responses,
diagnoses, perceptions or views). In MDPs, the agent has full knowledge of the system state;
therefore, Z = S. In partially observable environments, observations are only probabilistically
dependent on the underlying environment state. Determining which state the agent is in becomes
problematic, because the same observation can be obtained in di↵erent states.
• Observation Function Z - captures the relationship between the state and the observations (and
can be action dependent). This is the probability that observation z will be recorded in time
t+ 1 after an agent performs action a (in time t) and lands in state s0 (in time t+ 1):
Pr(Zt+1 = z0|St+1 = s0, At = a) (2.1)
The observation function is also assumed to be Markovian and stationary.
Definition 2.3 (Partially observable Markov decision process). A POMDP is defined by the tuple
hS,A,Z, Z, T,Ri where: S = set of states, A = set of actions, Z = set of observations, T : S ⇥ A!�(S) = P (s0|s, a) is the state-transition function, giving for each world state and agent action, a
probability distribution over world states. Z : S ⇥A! �(Z) = P (z0|s0, a) is the observation function.
This is the probability of making observation o0 given that the agent took action a and landed in state
s0. R : S ⇥ A ! R is the reward function, giving the expected immediate reward gained by the agent
for taking each action in each state.
11
CHAPTER 2. PRELIMINARIES
Solving a POMDP in a finite horizon (steps) is PSPACE-complete and some variations had been
proved NP-complete [Papadimitriou and Tsitsiklis, 1987] making them computationally more complex
than an MDP. Thus, a number of approximate techniques for solving POMDPs had been developed
[Monahan, 1982; Pineau et al., 2006].
2.2 Machine learning
The field of machine learning is concerned with the question of how to construct computer programs
that automatically improve with experience. Machine learning has two important subareas concerning
this thesis: supervised and reinforcement learning.
2.2.1 Supervised learning: classification
The supervised learning objective is to infer a function from a set of labeled data [Mohri et al., 2012].
One important task of supervised learning is classification where usually the data is known before the
learning task starts, which is called an o✏ine learning. Data consist of a set of examples containing a
feature vector xi
and a label (class) yi. A supervised learning algorithm produces a function g : X ! Y ,
with X and Y input and output spaces respectively. One widely used technique for classification is
learning based on decision trees.
Decision trees
Decision tree learning is a method for approximating discrete-valued target functions, in which the
learned function is represented by a decision tree. Learned trees can also be re-represented as sets of
if-then rules to improve human readability [Mitchell, 1997].
The objective of a decision tree is to specify a model that predicts the value of a certain variable,
called class, given that some input information is provided.
Definition 2.4 (Decision tree). A decision tree D is composed of nodes which represent tests to be
carried out on variables known as attributes. Each test has di↵erent outcomes, which are branches of
the node. These outcomes can be of two types: a leaf in which a value for the class (predicted variable)
is provided and represents a final node for the tree. Or it can be another test.
One of the most known algorithms for learning decision trees from a batch of information is C4.5
[Quinlan, 1993].
Trees are useful to represent strategies of behavior. For example, in Figure 2.3 a decision tree
with one decision node and two leaves is depicted. This tree specifies the behavior of an agent: when
the opponent last action is a1 then it will respond with an action b1, 100% of the times. When the
12
2.2. MACHINE LEARNING
LearnAgent last action
Clearn DLearn
Copp Dopp
100/0 100/0
Figure 2.3: A decision tree that models an opponent strategy. It contains one decision node LearnAgent
last action and two leaves that correspond to the actions of the opponent b1 and b2, each leaf has a
number of correctly classified/misclassified instances.
last action is a2 then, it will respond with b2 with 100% accuracy (these values will be used later to
compute probability transitions).
2.2.2 Reinforcement learning
Reinforcement learning (RL) [Sutton and Barto, 1998] addresses the question of how an autonomous
agent can learn to choose optimal actions to achieve its goals. Section 2.1 presented MDPs and how
to solve them given a complete set of states, actions, rewards and transitions. However, this may be
di�cult to obtain for several domains; for this reason, reinforcement learning algorithms deal with
learning optimal policies by experience without having a complete description of the MDP.
A reinforcement learning agent interacts with its environment in discrete time steps. At each
time, the agent chooses an action from the set of actions available, which is subsequently sent to the
environment. The environment moves to a new state and the reward associated with the transition is
determined (see Figure 2.1). The goal of a reinforcement learning agent is to collect as much reward
as possible. In this type of learning the learner is not told which actions to take, but instead must
discover which actions yield the best reward by trying them.
Q-learning [Watkins, 1989] is one well known algorithm for RL. It is generally used in stationary,
single-agent, fully observable environments. In its general form, a Q-learning agent can be in any state
s 2 S and can choose an action a 2 A. It keeps a data structure Q(s, a) that represents its estimate
of its expected payo↵ for starting in state s, taking action a. Each entry Q(s, a) is an estimate of the
corresponding optimal Q⇤ function. Each time the agent makes a transition from a state s to a state
s0 via action a and receives payo↵ r, the Q table is updated according to:
Q(s, a) = ↵(r + �maxb
Q(s0, b)) + (1� ↵)Q(s, a) (2.2)
with ↵ (called the learning rate) and � 2 [0, 1] (called discount factor). Q-learning will converge
13
CHAPTER 2. PRELIMINARIES
toward the true Q function if each state-action pair is visited infinitely often, this is, Q(s, a) converges
to the true Q(s, a) as with su�cient visits.
By using Q-learning it is possible to learn an optimal policy without knowing T or R beforehand,
and even without learning these functions [Littman, 1996]. For this reason, this type of learning is
known as model free RL. In contrast, model-based RL is where the agent attempts to learn a model of
its environment. Having such a model allows the agent to predict the consequences of actions before
they are taken, like Dyna-Q [Sutton and Barto, 1998].
Most RL algorithms consider the environment to be stationary, an exception is the Hidden-mode
MDP which will be used as comparison in the experiments in Chapter 5.2.
Hidden mode - MDPs
Hidden-mode Markov decision processes (HM-MDPs) [Choi et al., 1999] are a reinforcement learning
approach designed for non-stationary environments. They assume the environment can be represented
in a small number of modes. Each mode is a stationary environment, which has di↵erent dynamics
and needs a di↵erent policy. It is assumed that at each time step there is only one active mode. The
modes are hidden, which means that cannot be directly observed, they are only estimated by past
observations. Moreover, transitions between modes are stochastic events. Each mode is modeled as
an MDP. Di↵erent MDPs along with its transition probabilities form the HM-MDP. HM-MDPs are a
special case of POMDPs. Therefore, it is always possible to reformulate the former in the form of the
latter [Choi et al., 2001], this is useful since a number of methods for solving POMDPs are available
[Cassandra, 1998].
Definition 2.5 (Hidden-mode Markov decision process). An HM-MDP is an 8-tuple hQ, S, A,X, Y,R,⇧, i,where Q, S and A represent the sets of modes, states and actions respectively; the mode transition
function X maps mode m to n with a fixed probability xmn; the state transition function Y defines
transition probability ym(s, a, s0) from state s to s’ given mode m and action a; the stochastic reward
function R returns rewards with mean value rm(s, a); ⇧ and represent the prior probabilities of the
modes and the states respectively.
In Figure 2.4 an example of an HM-MDP with 3 modes and 4 states is depicted. Each of the three
large circles represent a mode, shaded circles inside the modes represent states. Thick arrows indicate
stochastic transitions between modes, for example the arrow labeled Xmn represents the probability of
transitioning from mode m to mode n. Thinner arrows represent state-action-next state probabilities.
For example, the arrow labeled ym(s, a, s0) represents the probability of transitioning to state s0 when
using action a being in state s in mode m.
14
2.2. MACHINE LEARNING
Mode m state s
Xmn
Mode n
state s'
ym(s,a,s')
Figure 2.4: An example of an HM-MDP with 3 modes (large circles) and 4 states (smaller shaded
circles). The value Xmn represents a transition probability between modes mandn, and ym(s, a, s0)
represents a state transition probability in mode m.
2.2.3 Exploration vs. exploitation
One major di↵erence between reinforcement learning and supervised learning is that a RL agent must
explicitly explore its environment.
The simplest possible reinforcement-learning problem is known as the k-armed bandit problem
[Robbins, 1985]: the agent is in a room with k gambling machines (called a “one-armed bandit”). At
each time step the agent pulls the arm of one of the machines and receives a reward. The agent is
permitted a fixed number of pulls. The agent’s purpose is to maximize his total reward over a sequence
of trials. Since each arm is assumed to have a di↵erent distribution of rewards, the goal is to find the
arm with the best expected return as early as possible, and then to keep gambling using that arm.
This problem illustrates the fundamental tradeo↵ between exploration and exploitation. The agent
might believe that a particular arm has a fairly high payo↵ probability; the questions are: 1) should
it choose that arm all the time (exploit)? or 2) should it choose another one that has less information
about (explore)?
Sample complexity of exploration
In order to answer those questions, we need to define some related terms like the sample complexity.
Loosely speaking, sample complexity is the number of examples needed for the estimate of a target
function to be within a given error rate. Kakade [2003] studies the sample complexity as a function of
the sampling model.1 In particular, the sample complexity is considered to be the number of calls to
1For example the environment itself is the sampling model where an agent must follow one unbroken chain of experience
for some number of decision epochs (timestep) in which a state is observed and an action is taken, and so the number of
15
CHAPTER 2. PRELIMINARIES
the sampling model required to satisfy a specified performance criterion, and we are interested in how
this scales with the relevant problem dependent parameters. In the reinforcement learning setting, the
parameters are the size of the state space, the size of the action space, the number of decision steps
and the variance of the reward function.
R-max [Brafman and Tennenholtz, 2003] is a well known model-based RL algorithm (presented
in Algorithm 2.1), which has an e�cient built-in mechanism for resolving the exploration-exploitation
dilemma. It uses an MDP to model the environment which is initialized optimistically assuming all
actions return the maximum possible reward, rmax. With each experience of the form (s, a, s0, r)
R-max updates its model. Since there is only a polynomial number of parameters to learn, as long as
learning is done e�ciently we can ensure that the agent spends a polynomial number of steps exploring,
and the rest of the time will be spent exploiting (Theorem 2.2.1). Thus, the policy e�ciently leads the
agent to less known state-action pairs or exploits known ones with high utility. R-max promotes an
e�cient sample complexity of exploration. However, R-max alone will not work when the environment
is non-stationary (e.g., when there are strategy switches during the interaction).
Algorithm 2.1: R-max algorithmInput: State set S, fictitious state s0, action set A, threshold parameter m, rmax value, number of rounds T
Function: SolveMDP(), receives a tuple which corresponds to a MDP and obtains a policy
1 S = S [ s0
2 8(s, a, s0) r(s, a) = n(s, a) = n(s, a, s0) = 0;
3 8(s, a, s0) T (s, a, s0) = 1;
4 8(s, a) R(s, a) = rmax;
5 for t = 1, . . . ,T do
6 Observe state s
7 Execute action a using policy
8 Observe reward and next state s0
9 if n(s, a) < m then
10 Increment counters for n(s, a) and n(s, a, s0)
11 Update reward r(s, a)
12 if n(s,a)==m then
13 R(s, a) = r(s, a)/m;
14 for s0 2 S do
15 T (s, a, s0) = n(s, a, s0)/m
16 end
17 SolveMDP (S,A, T,R)
18 end
19 end
20 end
Below we present the main theorem for R-max which guarantees near optimal expected rewards.
Definition 2.6 (Approximation Condition). If R-max uses the set of states K and an MDP MK ,
decision epochs is equivalent to the amount of observed experience.
16
2.3. GAME THEORY
then for the optimal policy ⇡ for MK , assumes that for all states s and times t < T
U⇡,t,Mk
(s) > U⇤t,M
k
(s)� ✏ (2.3)
The assumption states that the policy ⇡ that R-max derives from Mk is near-optimal in Mk.
Theorem 2.2.1 (Kakade, 2003). Let M = hS,A, T,Ri be a L-epoch MDP. If c is an L-path sampled
from Pr(·|R�MAX,M, s0) and the approximation condition holds, then, with probability greater than
1� �, the R-MAX algorithm guarantees an expected return of U⇤(ct)� 2✏ within O(m|S||A|T✏ log |S||A|
� )
timesteps t L.
The high level idea of the proof is as follows, by the approximation condition we know that the
learned MDP MK is a good approximation to the real MK . The policy used by the algorithm is 2✏ near
optimal in M . By the Pigeonhole Principle, successful exploration can only occur mSA times. Hence,
as long as the escape probability (exploring a state not known) is “large”, exploration must“quickly”
cease and exploitation must occur [Kakade, 2003].
Kakade’s results will be used as a base to provide a theoretical guarantee for near-optimal rewards
for our R-max# approach in Section 2.2.3.
2.3 Game theory
The area that studies decision problems when several agents interact is game theory [Fudenberg
and Tirole, 1991]. The terminology in this area is di↵erent, agents are called players, and a single
interaction between players is represented as a game.
The most common way of presenting a game is by using a matrix that denote the utility obtained
by each agent, this is the normal form.
Definition 2.7 (Normal-form game). A (finite, I-person) normal-form game �, is a tuple hN , A, ui,where:
N is a finite set of I players, indexed by i;
A = A1⇥· · ·⇥AI , where Ai is a finite set of actions available to player i. Each vector a = (a1, . . . , aI) 2A is called an action profile;
u = (u1, . . . , uI) where ui : A 7! R is a real-valued utility or payo↵ function for player i.
For example, the game presented in Table 2.1 represents a two-dimensional table, called a bimatrix.
In general, each row corresponds to a possible action for player 1, each column corresponds to a
possible action for player 2, and each cell corresponds to one possible outcome. Each player’s utility
for an outcome is written in the cell corresponding to that outcome, with player 1’s utility listed
17
CHAPTER 2. PRELIMINARIES
Table 2.1: The bimatrix for the prisoners’ dilemma game. Each cell represents the utilities given for the
agents (the first for agent A and second for agent O), rpd
, tpd
, spd
, ppd
represent numerical values where the
following conditions must hold tpd
> rpd
> ppd
> spd
and 2rpd
> ppd
+ spd
.
Agent Ocooperate defect
Agent Acooperate rpd, rpd spd, tpd
defect tpd, spd ppd, ppd
first. In the example, each player has two actions {cooperate, defect}. This is a well-known game,
known as prisoner’s dilemma (PD) where the following conditions must hold tpd > rpd > ppd > spd
and 2rpd > ppd + spd (to prevent alternating cooperate and defect giving a higher payo↵ than full
cooperation). When both players cooperate they both obtain the reward rpd. If both defect, they
get a punishment reward ppd. If a player chooses to cooperate with someone who defects receives the
sucker’s payo↵ spd, whereas the defecting player gains the temptation to defect, tpd.
A strategy specifies a method for choosing an action. One kind of strategy is to select a single
action and play it, this is a pure strategy.
Definition 2.8 (Mixed strategy). Let (I, A, u) be a normal-form game, and for any set X, let �(X) be
the set of all probability distributions over X, then the set of mixed strategies for player i is Si = �(Ai)
In general, a mixed strategy specifies a probability distribution over actions.
Definition 2.9 (Best response). Player i’s best response to the strategy profile s�i is a mixed strategy
s⇤i 2 Si such that ui(s⇤i , s�i) � ui(si, s�i) for all strategies si 2 Si.
where s�i = s1, . . . , si�1, si+1, . . . , sn represent the strategies of all players except i.
Thus, a best response for an agent is the strategy (or strategies) that produce the most favorable
outcome for a player, taking other players’ strategies as given. Another common strategy is the
minimax strategy.
Definition 2.10 (Minimax Strategy.). Strategy that maximizes its payo↵ assuming the opponent will
make this maximum as small as possible.
Definition 2.11 (Security level). The security level is the expected payo↵ a player can guarantee itself
using a minimax strategy.
In single-agent decision theory, the notion of optimal strategy is the one that maximizes the agent’s
expected payo↵ for a given environment. In multiagent settings the situation is more complex, and the
18
2.3. GAME THEORY
notion of an optimal strategy for a given agent is not meaningful since the best strategy depends on
the choices of others. To solve this problem, game theory has identified certain subsets of outcomes,
called solution concepts [Shoham and Leyton-Brown, 2008], one of those is the Nash equilibrium, that
will be explained next.
2.3.1 Nash equilibrium
Suppose that all players have a fixed strategy profile in a given game, if no player can increase its
utility by unilaterally changing its strategy, then the decisions are in Nash equilibrium. Formally it is
defined by:
Definition 2.12 (Nash equilibrium [Nash, 1950]). A set of strategies s = (s1, . . . , sn) is a Nash
equilibrium if, for all agents i, si is a best response to s�i.
Even when it is proved that in every game exists a Nash equilibrium, there are several limitations.
One problem is that there may be multiple equilibria in a game, and it is not an easy task which one
should be selected [Harsanyi and Selten, 1988].
2.3.2 Repeated and stochastic games
All the concepts presented in the previous section were defined for one-shot games (one single inter-
action), however it could be the case when more than one decision have to be made. For example,
repeating the same game, or having a set of possible games.
Definition 2.13 (Stochastic game). A stochastic game (also know as a Markov game) is a tuple
(Q,N , A, P, r), where: Q is a finite set of games, N is a finite set of I players, A = A1 ⇥ · · · ⇥ AI
where Ai is finite set of actions available to player i, P : Q⇥A⇥Q! R is the transition probability
function; P (q, a, q) is the probability of transitioning from state q to state q after action profile a and
R = r1, . . . , rI where ri : Q⇥A! R is a real valued payo↵ function for player i.
In a stochastic game, the agents repeatedly play games from a collection. The particular game
played at any given iteration depends probabilistically on the previous played game, and on the actions
taken by all agents in that game [Shoham and Leyton-Brown, 2008].
Definition 2.14 (Repeated game). A repeated game is a stochastic game in which there is only one
game (called stage game).
Before presenting examples of strategies for repeated and stochastic games we define formally some
concepts.
19
CHAPTER 2. PRELIMINARIES
Definition 2.15 (History). Let ht = (q0, a0, q1, a1, . . . , at�1, qt) denote the history of t stages of a
stochastic game and let Ht be the set of all possible histories of this length.
The set of deterministic strategies is the Cartesian product ⇧t,Ht
Ai, which requires a choice for each
possible history at each point in time. An agent’s strategy can consist of any mixture over deterministic
strategies. However, there are restricted classes of strategies, for example the requirement that the
mixing take place at each history independently, this gives behavioral strategies.
Definition 2.16 (Behavioral strategy). A behavioral strategy si(ht, aij
) returns the probability of
playing action aij
, for history ht.
A Markov strategy restricts a strategy so that, for a given time t, the distribution over actions
depends only on the current state.
Definition 2.17 (Markov strategy). A Markov strategy si is a behavioral strategy in which si(ht, aij
) =
si(h0t, aij
) if qt = q0t where qt and q0t are the final states of ht and h0t respectively.
If we restrict the dependency on the time t we get,
Definition 2.18 (Stationary strategy). A stationary strategy si is a Markov strategy in which si(ht1 , aij
) =
si(h0t2 , aij ), if qt = q0t, where qt1 and q0t2 are the final states of ht1 and h0t2 respectively.
To exemplify a repeated game, recall the prisoner’s dilemma presented previously. If we repeat
the same game we get the iterated prisoner’s dilemma (iPD), which has been the subject of di↵erent
experiments and for which there are diverse well known strategies. A successful strategy which won
Axelrod’s tournament2 is called Tit-for-Tat (TFT) [Axelrod and Hamilton, 1981]; it starts by coop-
erating, and does whatever the opponent did in the previous round. It will cooperate if the opponent
cooperated, and will defect if the opponent defected. Another important strategy is called Pavlov,
which cooperates if both players did the same action and defect whenever they did di↵erent actions in
the past round. Another strategy is called Bully (described in detail in Section 3.4.1) and it behaves
as a player who always defects in the iPD. The finite state machines describing TFT and Pavlov are
depicted in Figure 2.5. It should be noticed that these strategies do not depend on the time index;
they are stationary strategies.
2.3.3 Behavioral game theory
In recent years, some authors have claimed that Nash equilibrium is a solution concept that has
limitations. The reason is that in many experiments people would not follow the actions prescribed by
2Robert Axelrod held a tournament of various strategies for the iterated prisoner’s dilemma. Strategies were run by
computers. In the tournament, programs played games against each other and themselves repeatedly.
20
2.4. SUMMARY OF THE CHAPTER
C Dd
dc
c
(a)
CCDD
CD DCcd
d d
c
dc
c
(b)
Figure 2.5: (a) The automata that describes the TFT strategy, depending of the opponent action (c or d) it
transitions between the two states C and D. (b) The automata describing Pavlov strategy, it consists of four
states formed by the last action of both agents (CC, CD, DC, DD).
the theory [Goeree and Holt, 2001; Risse, 2000; Simon, 1955]. Another complain is that, there exist
many games in which the Nash equilibrium does not guarantee the maximum utility for all players.
Moreover, other limitations are the assumption of rational agents and the possible multiple equilibria.
Some experiments [Kahneman and Tversky, 1979] have shown that humans do not always act
rationally (in terms of following the prescribed Nash equilibrium). These interesting conclusions gave
birth to a branch known as behavioral game theory [Camerer, 1997]. The objective of this area is to
obtain more accurate predictions than with classic game theory. To fulfill this objective, a number of
social factors such as altruism, selfishness, reciprocity [Bolton and Ockenfels, 2000], heuristics [Tversky
and Kahneman, 1974], as well as insights from cognitive science have shown useful to mediate people’s
play in games [Camerer, 2003].
2.4 Summary of the chapter
In this chapter we reviewed some of the most important concepts and models of decision theoretic
planning, machine learning and game theory which will be relevant for the approaches described in
Chapter 4. In the next section we present recent works which are related to this thesis.
21
CHAPTER 2. PRELIMINARIES
22
Chapter 3
Related Work
This research lies in the intersection of several areas. In Figure 3.1 a diagram with the state of the
art models of game theory, decision theoretic planning, opponent modeling and machine learning is
depicted. This figure provides an overview of this chapter. Next we review the relevant related work
in detail.
3.1 Decision theoretic planning
In Section 2.1 the MDP and POMDP models were described. Now we present recent approaches which
try to overcome the limitation of having a single agent in the environment.
3.1.1 Multiagent approaches
An obvious approach is to extend single agent solutions to multiagent settings while trying to obtain
the actions that yield the best utility. In order to solve this problem, models such as decentralized
MDPs (DEC-MDPs) and decentralized POMDPs [Seuken and Zilberstein, 2008] (DEC-POMDPs) have
been proposed. These models are a generalization of MDPs and POMDPs to multiple cooperative
agents. One limitation is that they are only useful when the agents share a common objective, i.e.,
they share the utility function, furthermore, the solution needs to be centralized and then distributed
to each agent. Other important limitation is its complexity which is nondeterministic exponential time
(NEXP-complete), and it is believed that solving these problems requires double exponential time in
the worst case [Seuken and Zilberstein, 2008].
Hidden-mode Markov decision processes (HM-MDPs) [Choi et al., 1999], presented in Section
2.2.2, are designed for non-stationary environments. A HM-MDPs is a special case of a POMDP.
HM-MDPs assume the environment can be represented in a small number of modes. Each mode is
a stationary environment (modeled as an MDP), which has di↵erent dynamics and needs a di↵erent
23
CHAPTER 3. RELATED WORK
Related Work(Chapter 3)
Planning(Section 3.1) Machine
Learning(Section 3.3)
Concept drift (Section 3.3.1)
Reinforcement learning
(Section 3.3.2)
Game Theory(Section 3.4)
Multiagent approaches
(Section 3.1.1)
HM-MDPsDEC-MDPs
DEC-POMDPsI-POMDPs
Opponent and Teammate Modeling
(Section 3.5)
Hybrid approaches(Section 3.6)
Exploration vs exploitation
(Section 3.3.3)
Implicit Negotiation
(Section 3.4.1)
Learning(Section 3.4.2)
Behavioral GT(Section 3.4.3)
Memory bounded
(Section 3.5.1)
RL-CDHyper-Q
Minimax-QWOLF-IGAWOLF-PHCGIGA-WOLF
COLFM-Qbed
CLEANBPR
Fictitous PlayManipulatorAWESOME
WPLFAL
FAL-SG
Cog-HierarchyLevel-K
EWA
PI-POMDPMAIDNID
LoE-AIMBullyGodfather
RMMEA^2
TeamUP
Machine Learning
(Section 3.2)
I-DIDs
Figure 3.1: Related work to this thesis divided in areas (white boxes) and the most representative models and
algorithms for each one (grey boxes).
policy. Experiments were performed contrasting our approaches against HM-MDPs in Section 5.2.
A more general model is the interactive POMDP (I-POMDP) [Gmytrasiewicz and Doshi, 2005].
This model does not assume cooperativeness of the agents. They are called interactive because it
considers what an agent knows and believes about what another agent knows and believes [Aumann,
1999]. This means that an agent will have a model of how it believes another agent reasons. I-
POMDPs extends POMDPs incorporating models of other agents into the regular state space, thus
building an interactive state space. A problem that occurs in I-POMDPs is the infinite recursive
reasoning, this is, an agent A will model another agent B that is respectively modeling A. To solve
this problem a threshold of reasoning, `, is defined; in which the base model cannot be a recursive
model. The main limitation of these models is its inherent complexity, since solving one I-POMDP
with M number of models considered in each level, with ` maximum reasoning levels, is equivalent
to solving O(M `) POMDPs [Seuken and Zilberstein, 2008]. Some works have tried these models in
real world applications, such as analyzing money laundering [Ng et al., 2010] and playing a simplified
version of chess [Del Giudice et al., 2009]. Recently Ng et al. [2012] proposed an approach that can
learn I-POMDPs online called Bayes-Adaptive I-POMDPs, also these models have been used to model
populations (more than a thousand) of agents [Sonu et al., 2015]
24
3.2. PROBABILISTIC GRAPHICAL MODELS
Weather
Forecast Utility
Umbrella
Figure 3.2: An example of an Influence Diagram that represents the decision whether to take an
umbrella.
3.2 Probabilistic Graphical Models
MDPs and POMDPs are enumerative models, which have its graphical representation in the area of
probabilistic graphical models (PGM). The objective in using probabilistic graphical models [Koller
and Friedman, 2009] is to exploit its structure in order to find faster solutions than their enumerative
versions [Doshi and Gmytrasiewicz, 2009].
One basic type of PGMs are influence diagrams (IDs) [Howard and Matheson, 2005; Shachter,
1986], which are a graphical compact representation of a single decision. An ID is a directed acyclic
graph with chance nodes, decision nodes, and a utility node. Arcs coming into decision nodes represent
the information that will be available when the decision is made. Arcs coming into chance nodes
represents probabilistic dependence. Arcs coming into the utility node represent what the utility
depends on. In Figure 3.2 an example of an ID is presented, it corresponds to a decision (rectangle)
whether to take an umbrella or not, it has two probabilistic nodes (ovals) Weather and Forecast, and
one utility node (diamond).
Another model are the dynamic influence diagrams (DIDs), which can be seen as a ID that is
repeated over time, in each step a single decision for a single agent has to be made, this is the
graphical form of a POMDP.
There is another relevant model that for multiagent systems, the interactive dynamic influence
diagrams (I-DIDs) [Doshi et al., 2008]. They are a generalization of DIDs for multiple agents. I-DIDs
are the graphical correspondence of I-POMDPs. I-DIDs su↵er from the curse of dimensionality, where
the dimensionality of the planning problem is directly related to the number of states [Kaelbling et al.,
1998] and the curse of history, where the number of belief-contingent plans increases exponentially
with the planning horizon [Pineau et al., 2006]. Because the state space is interactive, I-DIDs includes
the models of other agents and often, the number of candidate models grows exponentially [Doshi
et al., 2008].
25
CHAPTER 3. RELATED WORK
3.3 Machine learning
Now we review works in two di↵erent areas of learning. The first one is called concept drift and it is
related to supervised learning and changing concepts. The second area is reinforcement learning, in
Section 2.2.2 we presented the basic Q-learning algorithm, here we review extensions for multiagent
systems.
3.3.1 Concept drift
The machine learning community has developed an area related to non-stationary environments and
online learning which is called concept drift [Widmer and Kubat, 1996]. The approach is similar to a
supervised learning scenario where the relation between the input data and the target variable changes
over time [Gama et al., 2014].
In particular, the work in [Gama et al., 2004] studies the problem of learning when the class-
probability distribution that generates the examples changes over time. A central idea is the concept
of context : a set of contiguous examples where the distribution is stationary. The idea behind the
concept drift detection method is to control the online error-rate of the algorithm. When a new
training instance is available, it is classified using the actual model. Statistical theory guarantees that
while the distribution is stationary, the error will decrease. When the distribution changes, the error
will increase. Therefore, if the error is greater than a defined threshold, it means that the concept
has changed and it needs to be relearned. The method was tested on both artificial and real world
datasets. Even when concept drift ideas are related they need to be adapted to a multiagent setting.
Moreover, the approach did not provide any formal guarantees of context (switch) detection.
3.3.2 Reinforcement learning
Reinforcement learning, and in particular the Q-learning algorithm, have been widely studied and
some multiagent extensions have been proposed. We present the most important algorithms in this
area, a more extensive survey is presented in [Busoniu et al., 2008].
Hyper-Q [Tesauro, 2003] is an extension of Q-learning designed specifically for multiagent systems.
The main di↵erence is that the Q function depends on three parameters: the state, the estimated joint
mixed strategy of all other agents, and the current mixed strategy of the agent. The problem with
this approach is that in order to obtain an approximation of the mixed strategies a discretization has
to be performed. Thus, the Q-table grows exponentially in the number of discretization points, which
will also result in larger learning times.
In [Da Silva et al., 2006] the reinforcement learning with context detection (RL-CD) approach is
described. The idea is to learn several partial models and decide at each time step which one to use
26
3.3. MACHINE LEARNING
depending on the context of the environment. The approach needs extensive parameter tuning (six
parameters) for each domain, no formal guarantees are provided and the approach does not use any
form of exploration for detecting changes in the environment.
Minimax-Q [Littman, 1994] extends Q-learning to zero-sum games1. In this case the value function
formed by the joint action of the two players, and instead of maximizing, it computes the minimax
operator in order to play its part of the Nash equilibrium strategy. This algorithm is guaranteed to
converge in self-play. However, it is not guaranteed to obtain a best response, which means it will
obtain suboptimal rewards.
Algorithm 3.1: WOLF-PHC algorithmInput: �
l
> �w
2 (0, 1],↵ 2 (0, 1] learning rate parameters, set of actions A
1 8(s, a) Q(s, a) = 0 // Initialize Q-table, policy and counter
2 8(s, a) ⇡(s, a) = 1|A|
3 8(s) C(s) = 0
4 foreach round do
5 Observe state s and select action a according to ⇡(s) with suitable exploration
6 Obtain reward r and next state s0
7 Q(s, a) = (1� ↵)Q(s, a) + ↵(r + �maxa
0 Q(s0, a0))
8 C(s) = C(s) + 1
9 8a0 2 Ai
⇡(s, a0) = 1C(s) (⇡(s, a
0)� ⇡(s, a0)) // Update average policy
10 ⇡(s, a) = ⇡(s, a) +
8<
:� if a = argmax
a
Q(s, a)��
|A|�1 otherwise
11 where, // Determine learning rate
12
� =
(�w
ifP0
a
⇡(s, a0)Q(s, a0) >P0
a
⇡(s, a0)Q(s, a0)
�l
otherwise
13 end
The principle WOLF (win or learn fast) was introduced in the algorithm WOLF-IGA [Bowling
and Veloso, 2002]. The algorithm was designed to fulfill two properties: (i) rationality, in the form of
obtaining a best response against stationary policies, and (ii) convergence to a stationary policy. The
algorithm uses gradient ascent, so each round the player will update its strategy to increase its expected
payo↵s. However, the key of the approach is to use a variable learning rate for the gradient ascent.
Thus, the intuition is to learn quickly when losing and cautiously when winning. The WoLF-PHC (see
Algorithm 3.1), is another algorithm which is an extension of Q-learning. It uses two learning rates:
one for winning and one for losing, to determine this decision it compares the expected value against
the average policy. The algorithm has been empirically successful in self-play. Building on the WOLF
1In a zero-sum game each participant’s gain (or loss) of utility is exactly balanced by the losses (or gains) of the utility
of the other participant(s).
27
CHAPTER 3. RELATED WORK
principle the authors proposed another algorithm that also shows no regret2 in the limit [Bowling,
2004]. Even when WOLF algorithms are designed to converge in self-play to the Nash equilibrium,
that is not always optimal in terms of rewards. COLF (Change of Learn Fast) [Cote et al., 2006] is
an algorithm inspired by the WOLF principle, but with the objective of promoting cooperation of
self-interested agents to achieve a Pareto e�cient solution.
A recent algorithm, M-Qubed (Max or Minimax Q) [Crandall and Goodrich, 2011] is another
reinforcement learning algorithm which balances cautious and best-response attitudes. M-Qubed
typically selects actions based on its Q-values in the current state (best-response), but triggers to its
minimax strategy when its total loss exceeds a pre-determine threshold (cautious). However, it is not
designed for switching opponents.
3.3.3 Exploration vs. exploitation
Exploration in multiagent systems has not received as much attention as in the single agent setting.
Some works have analyzed di↵erent explorations in specific domains such as economic systems [Rejeb
et al., 2005] or foraging [Mohan and Ponnambalam, 2011]. Carmel and Markovitch [1999] propose
an exploration strategy for model-based learning. In this case, the opponent was modeled through a
mixed strategy in a way to reflect uncertainty about the opponent. Experiments were performed in
the iterated prisoner’s dilemma. The main limitation of these previous approaches is that they do not
handle exploration in non-stationary environments (like strategy switches).
In multiagent systems, one common assumption is to not explicitly model other agents, but instead
treat them as part of the environment. However, this assumption presents problems, for example when
many learning agents are interacting in the environment and they explore at the same time, they may
create some noise to the rest, this is called exploratory action noise [HolmesParker et al., 2014]. The
authors propose the coordinated learning exploratory action noise (CLEAN) rewards to cancel such
noise. CLEAN assumes that agents have access to an accurate model of the reward function, which is
used jointly with reward shaping [Ng et al., 1999] to promote coordination and scalability. Exploratory
action noise will not appear in our setting since we focus on cooperative and competitive settings. In
competitive settings this approach will not work since it needs to know the reward function from all
others agents.
Bayesian Policy Reuse (BPR) [Rosman et al., 2015] has been proposed as a framework to determine
quickly the best policy to use in an environment facing an unknown task. The agent is presented with
an unknown task which must be solved within a limited and small number of trials. BPR has a set
of policies ⇧ and faces di↵erent tasks T . A BPR agent knows performance models P (U |T ,⇧) of how
each policy behaves over each task. Then for each episode of the interaction BPR selects a policy to
2Regret is a measure of how much worse an algorithm performs compared to the best static strategy.
28
3.4. GAME THEORY
act and receives an observation signal which is used to update their belief. BPR main limitation is
that it needs to know the set of policies to act and how those policies behave under di↵erent tasks
before the interaction starts.
3.4 Game theory
We start by reviewing related work on implicit negotiation. Next, we review methods for learning
in repeated and stochastic games. Finally, we review approaches from the area of behavioral game
theory.
3.4.1 Implicit negotiation
In [Littman and Stone, 2001] the authors consider autonomous agents engaging in implicit negotiation
via their tacit interactions. The authors propose two general “leader” strategies for playing repeated
games. These strategies are called leader since they assume that their opponents will follow to its
strategy by using a best response. The first strategy is a deterministic, state–free policy called “Bully”.
This strategy chooses the action that maximizes its reward given that the opponent is playing a best
response. The second strategy is called “Godfather” since makes its opponent an o↵er it can’t refuse.
They call a pair of deterministic policies a targetable pair if playing them results in each player receiving
more than its security level. Godfather chooses a targetable pair (if there is one) and plays its half in the
first stage. From then on, if the opponent plays its half of the targetable pair in one stage, Godfather
plays its half in the next stage. Otherwise, it plays the policy that forces its opponent to achieve its
security level (expected payo↵ a player can guarantee itself using a minimax strategy). These tactics
are forms of implicit negotiation in that they aim to achieve a mutually beneficial outcome without
using explicit communication outside of the game and can be used as general strategies in repeated
games.
3.4.2 Learning
One well known algorithm for learning in repeated games is fictitious play [Brown, 1951]. The model
simply maintains a count of the plays by the opponent in the past. The opponent is assumed to be
playing a stationary strategy and the observed frequencies are taken to represent the opponent’s mixed
strategy.
Some works have considered how to play against classes of opponents. For example Manipulator
[Powers and Shoham, 2005] is designed against adaptive opponents with bounded memory in normal
form games. In particular, the opponent plays a conditional strategy where its actions can only depend
29
CHAPTER 3. RELATED WORK
on the most recent k periods of past history, for this it alternates among the strategies: fictitious play,
minimax and a modified Bully strategy [Littman and Stone, 2001].
AWESOME (Adapt When Everybody is Stationary Otherwise Move to Equilibrium) [Conitzer
and Sandholm, 2006] is an algorithm that guarantees to converge to a Nash-equilibrium in self-play.
It also learns to play optimally against stationary opponents in games with an arbitrary number of
players and actions.
Weighted Policy Learner (WPL) [Abdallah and Lesser, 2008] is similarly designed to converge to a
Nash equilibrium. However it can do it with limited knowledge (assumes that an agent neither knows
the underlying game nor observes other agents).
Recent approaches have tried to reduce the learning time thus proposing a “fast” learning. One
approach is the fast adaptive learner (FAL) [Elidrisi et al., 2012]. This algorithm focuses on fast
learning in two-person repeated games. To predict the next action of the opponent it maintains a set
of hypotheses according to the history of observations using the method in [Jensen et al., 2005]. To
obtain a strategy against the opponent they use a modified version of the Godfather strategy [Littman
and Stone, 2001]. However, the Godfather strategy is not a general strategy that can be used against
any opponent and in any game. Also FAL shows an exponential increase in the number of hypotheses
(in the size of the observation history) which may limit its use in larger domains. Recently, a similar
version of FAL designed for stochastic games, FAL-SG [Elidrisi et al., 2014], has been presented. The
idea is to map the stochastic game into the setting used in FAL (a matrix normal form game). For
this, the algorithm generates a meta-game matrix by means of clustering the opponent’s actions. After
obtaining this matrix the algorithm proceeds in the same way as FAL. Both FAL and FAL-SG use a
modified Godfather strategy to act, which fails to promote an exploration for detecting switches, and
this results in a longer time to detect opponent switches.
3.4.3 Behavioral game theory
An important group of models that try to model human-behavior are those that use iterative strategic
reasoning. The cognitive hierarchy [Camerer et al., 2004a] and level-k [Costa Gomes et al., 2001] are
the two most important models of this group. In both models, for an agent to take a decision it will
perform a series of reasoning steps over di↵erent levels. The level zero is formed by simple strategies
(like a random choice across the possible actions). A best response against the lower levels is the way
to construct the next level. Even when these models tend to obtain good results in single-decision
games with human populations [Wright and Leyton-Brown, 2010], there is no analysis of how these
models might perform in repeated games or sequential decisions problems.
Self tuning experience-weighted attraction (EWA) [Camerer et al., 2004b] is an algorithm that
generalizes reinforcement learning and fictitious play and has shown good predictive power in short re-
30
3.5. OPPONENT AND TEAMMATE MODELING
peated games (less than 8 iterations). However, this model does not allow history dependent strategies
or other type of more general opponents.
3.5 Opponent and teammate modeling
Opponent modeling is the capacity of predicting other agents’ behavior in which the environment is
populated with adversaries [Stone, 2007]. In this area, some important works had been devoted to
specific applications, for example playing poker [Bard et al., 2013] and Scrabble [Richards and Amir,
2006]. A similar situation can appear when the environment is populated with teammates without
communication, thus, same approach of learning a model (in this case of a teammate) can be applied.
A general non-specific domain algorithm is the recursive modeling method (RMM) [Gmytrasiewicz
and Durfee, 2000]. This algorithm takes into account the reasoning of other agents to obtain the best
coordinated action. The model proposes that each agent creates a model of the other agents which
can have a model of the first agent (recursive). In order to finish the recursion there is one basic level
in which the other agents are ignored. The model of an agent is a utility matrix, the main limitation
of this approach is how to obtain that matrix when agents are not cooperative.
In [Barrett et al., 2012] the problem is one where an agent is forced to work in a team with other
three unknown agents in order to complete a specific task. Therefore, the agent needs to build a model
of its teammates to plan his future behavior. To learn the models, decision trees [Quinlan, 1993] were
used. To perform planning a Monte Carlo tree search [Kocsis and Szepesvari, 2006] is proposed. One
limitation is that the learning of models is performed in an o✏ine fashion and only the belief’s update
is online.
The Lemonade Stand Game [Zinkevich et al., 2011] is a repeated symmetric 3 player constant-sum
finite horizon game, in which a player chooses a location for their lemonade stand on an island with
the aim of being as far as possible from its opponents. Di↵erent tournaments were played each year
and research groups from all over the world submitted their agents which lead to interesting ideas
related to opponent modeling. In particular the EA2 algorithm [Sykulski et al., 2010] was the winner
of the first tournament. The algorithm attempts to find a suitable partner with which to coordinate
and exploit the third player. To do this, it classifies the behavior of our opponents using the history
of joint interactions. Note that EA2 models the behaviors of its opponents, rather than situations
of the game. In [Cote et al., 2010] the authors presented TeamUP agent which was the winner of
the second tournament. They propose a special representation for adversarial (constant-sum) games.
Given the repeated nature of the interaction they frame the action selection problem as a planning
problem, where the unknown behavior of the opponents is learnt by repeatedly interacting with them.
This idea of the representation will be used throughout this thesis to model the opponent’s strategy.
31
CHAPTER 3. RELATED WORK
3.5.1 Memory bounded learning
A special type of learning can happen by using a particular type of information. Bounded memory
opponents are agents that use the opponent’s D past actions to assess the way they are behaving.
For these agents the opponent’s history of play defines the state of the learning agent. In [Banerjee
and Peng, 2005] the authors propose the adversary induced MDP (AIM) model, which uses the vector
D, which is a function of the past D actions of the learning agents, i.e., (at, . . . , at�D) 2 AD. Note
that the agent, by just keeping track of D (its own past moves) can infer the policy of the bounded
memory opponent.
Definition 3.1 (AIMs). An adversary induced Markov decision process (AIM) is a tuple hA, , T, Uiwhere,
• A is the action space of the agent.
• = { D :P
a2A D(a) = 1, D(a) 2 [0, 1] 8a 2 A} is the state space.
• T : A ⇥ ! �( ) is the state-transition function that maps actions and states to probability
measures on future states.
• U : A⇥ ! R is the function
U( D, a) =X
b2B⇢( D, b)RA(a, b) (3.1)
that maps state D 2 �(A) and action a 2 A to a real number that represents the agent’s
expected reward. RA is the reward function of the learning agent. Here, ⇢(·, b) 2 [0, 1], subject
to the constraintP
b02B ⇢(·, b0) = 1 (B is the action space of the opponent).
The learning agent, by knowing the MDP that the opponent induces, can compute an optimal policy
⇡⇤. These types of players can be thought of as a finite automata that take the D most recent actions
of the opponent and use this history to compute their policy [Cote and Jennings, 2010].
Since memory bounded opponents are a special case of opponents, di↵erent algorithms were spe-
cially developed to be used against these agents. For example, the agent LoE-AIM [Chakraborty
and Stone, 2008] is designed to play against a memory bounded player (but does not know the exact
memory length). Even more, the algorithm presented in [Cote and Jennings, 2010] is designed to play
against an unbounded memory player.
3.6 Hybrid approaches
Recently, some works have adopted a hybrid approach using models and ideas from di↵erent areas,
most of them use a behavioral game theory approach to model human behavior.
32
3.7. SUMMARY OF THE CHAPTER
In [Wunder et al., 2009, 2011] an I-POMDP [Gmytrasiewicz and Doshi, 2005] model was used as
a building block in combination with the cognitive hierarchy [Camerer et al., 2004a] model to form a
Parametrized I-POMDP. This model presents di↵erent reasoning levels, one key aspect is that there
is a distribution of strategies within each level. To construct the strategies, in each level there is
a population. Zero level corresponds to simple behaviors, thus they do not present any strategic
reasoning. The upper levels (with strategic reasoning) are constructed by obtaining a policy that
maximizes the score against either a distribution over lower levels, or a selection of agents from those
levels, by solving the POMDP formed by them. One limitation of the approach is its complexity, so
it might not scale for larger problems; other main di↵erence with our work is that they learn against
populations of agents, not modeling explicitly the opponents.
Another hybrid model is the multiagent Influence Diagram (MAID) [Koller and Milch, 2001]. This
model provides a graphical representation of a game and the objective is to exploit its graphical form
to compute Nash equilibrium. Moreover, the network of influence diagrams (NID) [Gal and Pfe↵er,
2008] extends the MAID model to include uncertainty over the agent models. One limitation of MAIDs
and NIDs is that they are designed for one-shot decision games. Also, they use Nash equilibrium as a
solution concept which is not always the best solution for di↵erent types of agents or scenarios.
3.7 Summary of the chapter
In this chapter, we reviewed recent works which have relation with this thesis. In Table 3.1 we present
the most important related works and we compare them by their single or multiagent nature, the type
of learning they have, the theoretical guarantees provided and their complexity. A summary of the
limitations found in the state of the art is the following:
• Approaches that can be used only for single decisions (one-shot) [Camerer et al., 2004a; Costa
Gomes et al., 2001; Koller and Milch, 2001].
• Approaches that are computationally intractable for large scale problems [Choi et al., 1999;
Gmytrasiewicz and Doshi, 2005; Seuken and Zilberstein, 2008; Tesauro, 2003].
• Approaches that assume stationarity of the opponent [Brown, 1951; Gmytrasiewicz and Durfee,
2000; Koller and Milch, 2001].
• Approaches that remove stationary assumption need an o✏ine training phase [Choi et al., 1999].
3Contains all decision problems that can be solved by a deterministic Turing machine using a polynomial amount of
computation time.4A decision problem is PSPACE-complete if it can be solved using an polynomial amount of space and if every other
problem that can be solved in polynomial space can be transformed to it in polynomial time.
33
CHAPTER 3. RELATED WORK
Table 3.1: A comparison of di↵erent algorithms in terms of type of opponents they handle, type
of learning, complexity and whether they provide guarantees for switch detection. The * indicates
algorithms that will be used as comparison in the experimental chapter of this thesis. The ? symbol
indicates there is no information about it. Bold typeface indicates a proposed algorithm in this
thesis.
Algorithm Multiagent
approach
Type of
learning
Theoretical
guarantee
for switch
detection
Complexity Exploration
for switch
detection
Hyper-Q, Minimax-Q,
COLF, Manipulator,
M-Qubed, AWESOME,
WPL, EWA, LoW-AIM
Yes Online No ? No
MAID, NID, RMM Yes No No ? No
RL-CD No Online No ? No
BPR No O✏ine No P 3 No
Fictitious play Yes Online No P No
I-POMDPs Yes O✏ine &
Online
No ? No
FAL* Yes Online No ? No
WOLF-PHC* Yes Online No ? Partial
HM-MDPs* No O✏ine No PSPACE � complete4 Partial
R-max* Yes Online No P No
MDP-CL Yes Online No P No
MDP4.5 Yes Online No P No
MDP-CL(DE) Yes Online No P Yes
R-max# Yes Online Yes P Yes
R-max#CL Yes Online No P Yes
DriftER Yes Online Yes P Yes
34
3.7. SUMMARY OF THE CHAPTER
• Approaches that work in non-stationary environments do not use exploration mechanisms for
detecting switches [Elidrisi et al., 2014].
In the next chapter, we present our contributions in the area of non-stationary opponents. We
start by presenting a framework (MDP-CL and MDP4.5) for learning and planning against switching
opponents in repeated games. Then, we address di↵erent initial limitations like adding an exploration
mechanism for switch detection, using prior information and not forgetting previously learned models.
We present another algorithm (R-max#) which provides an e�cient exploration and gives guarantees
of optimality against switching opponents. We conclude with another algorithm (DriftER) for switch
detection based on concept drift which also provides theoretical guarantees for switch detection.
35
CHAPTER 3. RELATED WORK
36
Chapter 4
Acting against Non-Stationary
Opponents
In this chapter, we present our main proposals for dealing with non-stationary opponents (a graphical
roadmap of this chapter is depicted in Figure 4.1). We start by describing a general framework for
learning and planning against non-stationary opponents in repeated games. Two implementations for
the framework are presented: MDP4.5 and MDP-CL, their di↵erence lies in the model they use for
learning the opponent strategy. The former uses decision trees and the latter uses MDPs.
Then we present two extensions for MDP-CL. In some domains it may be possible to know the set
of strategies used by the opponent before starting the interaction. We adapt MDP-CL for those cases
and name it a priori MDP-CL. In MDP-CL once a switch has been detected the model is discarded
and a new model is learned from scratch. However, the previous model can still be useful. Thus, we
adapt MDP-CL to keep a record of previous models and when a switch is detected it compares if the
model is similar to those previously seen, this is incremental MDP-CL.
The framework and its implementations were able to detect opponent switches, however they did
not apply an explicit exploration for detecting them. In regard to this problem, we propose drift
exploration for non-stationary opponents. First we propose to add drift exploration in MDP-CL.
Then we propose R-max#, an algorithm for e�ciently exploring the state space which is able to work
against non-stationary opponents. This approach is based on the original R-max algorithm (Section
2.2.3) which is theoretical grounded to provide e�cient exploration. Using jointly a switch detection
method with R-max# exploration gives R-max#CL.
Our last proposal is DriftER, an algorithm for detecting switches inspired by concept drift tech-
niques. The main idea is that once the agent has learned a model of the opponent it can track how
that model is behaving with a measure of predictive error. That information will be used as indicator
of switch detection when the error starts increasing consistently.
37
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
Proposed Algorithms
(Chapter 4)
Framework against non-stationary
opponents(Section 4.1)
MDP4.5
Drift Exploration(Section 4.3)
MDP-CL (DE) (Section 4.3.1)
R-max# (Section 4.3.2)
DriftER(Section 4.5)MDP-CL
A priori MDP-CL
(Section 4.2.3)
Incremental MDP-CL
(Section 4.2.4)
R-max#CL(Section 4.4.1)
Figure 4.1: Di↵erent sections in this thesis and how the related to each other inside this chapter.
4.1 MDP-CL and MDP4.5
Before presenting our framework, we present how to model opponents in repeated games.
4.1.1 Modeling opponents
How an agent should act in a repeated adversarial games when the adversaries are unknown is a
di�cult task. Specifically, in these games, the capability of an agent to compute a good sequence of
actions relies on its capacity to forecast the behavior of its opponents [Cote et al., 2010].
The repeated nature of the interaction allows the agent to learn the unknown behavior of the
opponents by interacting with them. For example, the interaction of a learning agent with a stationary
opponent can be modeled as an MDP. This occurs since the interaction between agents generates
Markovian observations which can be used to learn the opponent’s strategy. In this case, the learning
agent perceives a stationary environment whose dynamics can be learned. However, the learning task
may require a large number of repeated interactions to be e↵ective. Therefore an abstraction of the
complete interaction history is needed to reduce the number of samples. For this reason we need a set
of attributes A that can describe the opponent strategy and those will be used to construct the states.
Similarly we can build a model of the opponent strategy with another representation, for example
using a decision tree. In this case the decision nodes are the attributes that describe the opponent
strategy. The leaves’ values are the possible opponent actions. Thus a decision tree will provide a
model of how the opponent is behaving.
The set of attributes depend on the domain. It is common to use the history of past interactions
(Section 2.3.2) to model the opponent strategy [Banerjee and Peng, 2005]. Note that this represen-
38
4.1. MDP-CL AND MDP4.5
tation is capable of learning a variety of strategies [Chakraborty and Stone, 2008]. However, in order
to avoid using the complete history of interaction it is common to use only the last step [Crandall
and Goodrich, 2011] as attributes. This representation will allow the agent to learn strategies such
as TFT, Pavlov and Bully in the iterated prisoner’s dilemma (Section 2.3.2). However, in more elab-
orated domains (for example PowerTAC domain; appendix A) agents may use other attributes (that
may not have relation with the interaction history) in the environment to select its actions.
4.1.2 Introduction
Now, we start by presenting a framework for learning and planning against non-stationary opponents
in repeated games. The approach is based on the comparison of learned models in order to detect
strategy switches. It consists of three main parts:
• Learning phase. A model of the opponent is learned.
• Planning phase. Uses the learned model along with information from the environment to compute
an optimal plan against the modeled opponent.
• Change detection process, that embeds the learning and planning phases to identify switches in
the opponent strategy. Here, di↵erent models are learned and comparisons among them reveal
when the opponent model has changed.
A model is learned with information obtained from a window of interactions.
Definition 4.1 (Interaction window). A window of interaction of size k represents a sequence of
interactions among the agents in the environment starting at round ti and ending at round ti+k
This model will reflect the opponent behavior and then a policy to act against it can be computed.
Di↵erent models of the opponent will be learned using di↵erent windows of interactions (in terms of
size). If the opponent has not changed strategies, information from those windows of interaction will
yield to the same learned model (with enough samples). On the other hand, if the opponent has
changed its strategy it will yield a di↵erent model. When the opponent changes the strategy, the
model and the respective policy are reset and the process restarts from scratch.
4.1.3 Assumptions
The proposed framework makes the following assumptions:
• The opponent will not change strategy during a number of interactions (learning phase).1
1This number of interactions can be passed as a parameter to the learning algorithm.
39
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
Learning (1) Planning (2)Random strategy
Switch?
!* strategy
(3)yes
no
Figure 4.2: The three main parts of the framework: (1) learning, use an exploration (random) strategy, (2)
planning, use the computed policy and the (3) switch detection process.
• Our agent knows the attributes that can learn the dynamics of the opponent.
The first assumption is important for the agent to learn an accurate model of the opponent strategy.
We performed experiments where the second assumption does not hold and the approach still obtains
good results (Sections 5.5.4 and 5.5.6).
4.1.4 Setting
The problem’s setting is the following: Our learning agent A and one opponent O2 repeatedly play
a Repeated normal form game � (see Section 2.3). Both agents take one action (simultaneously) in a
sequence of rounds. A and B are the set of possible actions for our learning agent and the opponent
respectively. They both obtain a reward r that depends on the actions of both agents (defined by �).
The objective of our learning agent is to maximize its cumulative rewards over the entire interaction.
Agent O has a set of O possible stationary strategies (see Section 2.3 for definition) to choose from
and can switch from one to another in any round of the interaction, excluding periods called learning
phases. A strategy defines a probability distribution for taking an action given a history of interactions.
4.1.5 Overview of the framework
In Figure 4.2 a graphical depiction of the three parts of the framework is presented. Learning is the
initial phase and uses a random strategy in order to learn the opponent’s model. The next phase is
planning, where an optimal policy ⇡? is computed to play against the opponent. The computed policy
is used and the switch detection process starts; if the opponent switches strategies then the learning
phase is restarted, if not, the agent continues using the same policy.
In Algorithm 4.1 a pseudocode of the proposed framework is presented. It uses as parameters:
the size of the window of interaction w needed to learn a model, the threshold which determines a
value of comparison for deciding whether the models are di↵erent, and the number of rounds T in
the repeated game.
2In most experiments on this thesis we assume only one opponent however in Section 5.2.6 we present experiments
using MDP-CL with more opponents.
40
4.1. MDP-CL AND MDP4.5
Algorithm 4.1: Proposed framework algorithmInput: Size of the window of interactions w, threshold , number of rounds in the repeated game T .
Function: compareModels(), compare two opponent models
Function: planWithModel(), obtain a policy to act using the opponent model
Function: playWithPolicy(), play using the computed policy
1 j = 2; // initialize counter
2 model = ⇡⇤ = ; // Initialize opponent model and policy
3 for t = 1 to T do
4 if t == i · w, (i � 1) and model == ; // Learn exploration model then
5 Learn an exploration model with past w interactions
6 ⇡⇤ = planWithModel(model) // Compute a policy from the learned model
7 end
8 if t == j · w, where (j � 2) // Learn comparison model then
9 Learn another model0 with past interactions
10 d = compareModels(model,model0) // Compare models
11 if d > // Opponent strategy has changed? then
12 ⇡⇤ = model = ;, j = 2 // Reset models and restart exploration
13 else
14 j = j + 1
15 end
16 end
17 if ⇡⇤ == ; then
18 Play with random actions // No model, explore randomly
19 else
20 playWithPolicy(⇡⇤) // Use learned policy
21 end
22 end
Table 4.1: A description of the main parts of the approach using two di↵erent representations: MDP4.5 and
MDP-CL.Representation Learning Planning Switching detection
MDP4.5 Decision Trees DT to MDPs Compare trees by means of structure and pre-
dictive similarity.
MDP-CL MDP MDP Compare MDPs by using total variation dis-
tance of the transition functions.
41
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
The proposed framework is not subject to a specific representation of the opponents’ models. To
exemplify our approach two di↵erent representations for learning an opponent model were considered:
decision trees and MDPs. These two representations are quite di↵erent, a motivation to use MDPs is
that they are the common representation in sequential decision problems, uncertainty and RL. In fact
MDPs have been proposed to model opponents which use only past history [Banerjee and Peng, 2005]
(see Section 3.5.1). On the other hand, decision trees is a technique used in machine learning, mostly
in classification problems. However, there are several algorithms for learning decision trees from a
batch of information. The main parts of these two representations, called MDP4.5 and MDP-CL, are
presented in Table 4.1 and described in the following sections.
Now we discuss each of the three main parts of the framework in more detail: learning, planning
and switch detection.
4.1.6 Learning: opponent strategy assessment
Consider the case where the opponent uses a stationary strategy for a period of interactions. Our
learning agent starts with no prior model of the opponent and starts an exploration phase playing a
random strategy. After a certain number of interactions, w, the learning agent uses the information
from the past interactions to generate a model of the opponent which we call exploration learned
model.
Definition 4.2 (Exploration learned model). The model of the opponent strategy obtained when there
is no previous model or when a switch is detected using the information from the last w rounds.
Since the proposed framework is general di↵erent techniques to learn a model of the opponent can
be used, to exemplify our approach we present two implementations: MDP4.5, which uses decision
trees to model the opponent and MDP-CL, which uses MDPs.
MDP4.5
Given window of interaction w between agents, C4.5 algorithm [Quinlan, 1993] returns a decision tree
D which corresponds to the opponent strategy. The set of attributes, A , is assumed to be given by an
expert, the class values are the opponent’s actions B. Each path pl of D to a leaf l is a unique decision
rule. Thus, there is a one-to-one correspondence between a path pl and the leaf l. Additionally, each
leaf has two associated values, classified(l) and misclassified(l) which represent the accuracy of each
leaf. In Figure 4.3 (a) a decision tree with just one decision node and two leaves is depicted. The
attributes in the decision tree are previous plays from both agents. The leaves of the tree represent
the opponent’s next action and edges represent the decision rules (later used to plan a strategy).
42
4.1. MDP-CL AND MDP4.5
LearnAgent last action
Clearn DLearn
Copp Dopp
100/0 100/0
(a)
C,1,0
C,1,3
D,1,1
D,1,4
C,1,3C,1,0 D,1,1D,1,4
oppC ,C learn C ,Dopp learn
D ,Copp learn D ,Dopp learn
(b)
Figure 4.3: (a) A decision tree that corresponds to a model of an agent. It contains one decision node
LearnAgent last action and two leaves that correspond opponent’s actions Copp
and Dopp
, each leaf has a number
of correctly classified/misclassified instances. (b) A learned MDP using the game matrix of the prisoner’s
dilemma. It is composed of four states (ovals). Each state is formed by the last action (C or D) of the
learning agent (learn) and the opponent (opp).The arrows represent the triplet: action, transition probability
and immediate reward.
MDP-CL
The second approach is to learn an MDP model of the opponent. This MDP describes the dynamics
of the opponent. The set of attributes A used to construct the states is assumed to be given by an
expert. For example, the attribute Ai could take as values of the last play of agent i.
Formally, the MDP is composed of:
• The set of states S := ⇥i2A Ai, i.e., each state is formed by the Cartesian product of the set of
attributes A .
• The set of actions A are the learning agent actions in �,
• The transition function T : S ⇥A! S is learned using counts
T (s, a, s0) =n(s, a, s0)
m(s, a)(4.1)
with n(s, a, s0) the number of times the agent was in state s, used action a and arrived at state
s0, m(s, a) is defined as the number of times the agent was in state s and used action a.
• The reward function R is learned in a similar way
R(s, a) =
Pr(s, a)
m(s, a)(4.2)
with r(s, a) the reward obtained by the agent when being in state s and performing action a.
In Figure 4.3 (b) a learned MDP with four states is depicted. This MDP represents the opponent
strategy TFT in the iPD. Each state is formed by the last plays (C or D) of the learning agent (learn)
and the opponent (opp), the arrows indicate the action, transition probability and reward.
43
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
4.1.7 Planning
Once a model of the opponent is learned, a plan that provides the optimal way to play against such
opponent is needed. Using the opponent’s model to compute the best action can be seen as a sequential
decision problem, assuming the opponent will remain fixed. Thus, the obvious approach is to plan
with MDPs since we assume the state is fully observable. However, if the state was not fully observable
POMDPs could be used. Solving an MDP does not involve high complexity, computing the optimal
policy is complete for P [Papadimitriou and Tsitsiklis, 1987], and a number of methods, from linear
programming and dynamic programming can be used.
Decision trees to MDPs
When learning decision trees, a transformation is needed (since the DT does not prescribe how to
act) to obtain an MDP and solving it will provide an optimal policy to act. The induced MDP (by a
decision tree D and a bimatrix game �) is composed of:
• A is the set of available actions for the learning agent.
• R are obtained from the game matrix �,
• The set of states S : P ⇥B, i.e., each state is formed by a path and an action of the opponent,
• The transition function T : S ⇥A! S is generated as follows, for each s 2 S and a 2 A,
T (s, a, s0) =
8<
:classified(l) if s0 2 Pmisclassified(l)
|B|�1 other case(4.3)
As an example, the tree of Figure 4.3 (a) represents the strategy of a learning agent (Learn) against
an opponent with two actions B = b1, b2. Converting this tree and using the values t = 4, r = 3, p =
1, s = 0, from the matrix of the prisoner’s dilemma will yield the MDP depicted in Figure 4.4 (a).
This decision tree can be augmented to include new leafs (corresponding to misclassified instances)
as depicted in Figure 4.4 (b). These new leafs share the path from the original leaf but replace the
class with other possible values. In the example, the (original) left leaf has the value b1 and this leaf
classified correctly 90% of the times. The other 10% corresponds to a di↵erent value, in this case
b2, therefore this new leaf is added to the tree and it is depicted with a dotted arrow. Notice that
the original leaves correspond to the set P and new leafs correspond to the set P 0. Converting this
augmented tree will yield the MDP depicted in Figure 4.4 (c). Solving this MDP provides the optimal
policy against the opponent.
44
4.1. MDP-CL AND MDP4.5
Learn Last action=a1--->b1
Learn Last action=a2--->b2
a1/1.0 (r=0)
a2/1.0 (r=4)a2/1.0 (r=1)a1/1.0(r=3)
(a)
Learn last action
a1 a2
b1 b2b2 b1
90/10 80/2010 20(b)
Learn Last action=a1--->b1
Learn Last action=a2--->b2
a1/ 0.9 (r=0)
a2 /0.8 (r=4)
a2 / 0.8 (r=1)a1/ 0.9(r=3)
Learn Last action=a1--->b2
Learn Last action=a2--->b1
a1/ 0.1 (r=3)
a2 / 0.2 (r=1)
a2 /0.2 (r=4)a2/ 0.1 (r=0)
a1/ 0.9(r=0)a2 / 0.8 (r=4)
a2 / 0.2 (r=4)a1/ 0.1 (r=0)
a2 /0.8 (r=1)
a2 /0.2 (r=1)
a1/ 0.9 (r=3)
a1/ 0.9 (r=3)
(c)
Figure 4.4: (a) The MDP obtained from the decision tree in Figure 4.3 (a) composed of two states (ovals),
the arrows represent transition probabilities using actions ax with rewards in parenthesis. (b) The augmented
decision tree of Figure 4.3 (a) that contains in dotted lines two added leaves representing classification errors.
(c) The MDP obtained from (b) using the prisoner’s dilemma game matrix. It is composed of four states. The
dotted arrows and ovals correspond to the added actions.
As shown in the previous example, when the opponent shows an stochastic behavior converting
a DT into a MDP will increase the number of states. This is a limitation of DT, since they are not
generally well fit to handle these type of behaviors. A better fit will be to directly use MDPs.
In cases where the learning model is already an MDP this phase is omitted. No matter if the MDP
was learned directly or was transformed from a decision tree, any o↵ the shelf dynamic programming
algorithm like value iteration [Bellman, 1957] can be used to solve it. If the model correctly represents
the opponent’s strategy, the solution will produce an optimal policy against that opponent that will
result in a maximization of the accumulated rewards.
4.1.8 Detecting opponent switches
The first two phases of the framework have been described: (i) assessing the strategy used by the op-
ponent and (ii) generating a model of the opponent. Note that on the second phase, the learning agent
switches to a strategy that optimizes against the newly exploration learned model. Notwithstanding,
the switch could trigger a response from the opponent and the agent needs to be able to detect such
changes. With this in mind, another model is learned concurrently and compared with the exploration
learned model periodically in order to detect possible changes in the opponent strategy.
Definition 4.3 (Short history model). A learned model of the opponent strategy used to perform
comparisons with the exploration learned model using the interaction of past w rounds.
45
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
Short history models are used when comparing decision trees, they use the last window of interac-
tion of size w to learn a model.
Definition 4.4 (Long history model). A learned model of the opponent strategy used to perform
comparisons with the exploration learned model using the interactions from the last switch detected to
the current round.
Long history models are used when comparing MDPs. The exploration model and the history model
(short or long) are compared every jw steps, where w is a parameter of the algorithm (represents the
size of the window of interaction) and j = 2, 3, . . . , to evaluate their similarity. If the distance between
models is greater than a given threshold, parameter (the value for this parameter may be di↵erent
for each representation), the opponent has changed strategy and the modeling agent must restart the
learning phase, resetting both models and starting from scratch with a random (exploration) strategy.
Otherwise, it means that the opponent has not switched strategies, so the same strategy is continued,
and j is incremented.
Decision trees
When using decision trees, a sensible measure of how similar the exploration learned model and short
history one is needed. For this purpose, the dissimilarity measure presented in [Miglio and So↵ritti,
2004] is used. This measure has the advantage that can combine the structure (the attributes of the
nodes) and predictive (the predicted classes) similarities in a single value.
Let Di and Dj be two di↵erent trees learned as presented in Section 4.1.6. We define 1, . . . , H as
the leaves of Di, and 1, . . . ,K as the leaves of Dj , the value mhk is the number of instances which
belong to both the hth leaf of Di and to the kth leaf of Dj .
The dissimilarity measure is defined as:
d(Di, Dj) =HX
h=1
↵h(1� sh)mh0
n+
KX
k=1
↵k(1� sk)m0k
n(4.4)
where themxy values measure the predictive similarity and the ↵x and sx values measure the structural
similarity. The coe�cient sh is a similarity coe�cient whose value synthesizes the similarities shk
between the hth leaf of Di and the K leaves of Dj . The value shk measures similarities of two leaves
taking into account their classes and the objects they classify. The coe�cient ↵x is a dissimilarity
measure of the paths associated to two leaves. When these paths are not discrepant, then the value
is set equal to 0. If, on the contrary, those paths are discrepant, the value is > 0 depending on the
path and level where the two paths di↵er from each other. The maximum value of d(Di, Dj) can
be reached when the di↵erence between the structures of Di and Dj is maximum and the similarity
46
4.1. MDP-CL AND MDP4.5
att1 > 3.5
att1 > 5.5
A B
att3 > 4.5
att2 > 4.5
att3 > 6.5
A
B
B A
85%
70% 100% 89%
90% 95%
(a)
att5 > 3.5
A
att1 > 4.5
att1 > 4.5
att3 > 6.5 B
BA
85%97%
90%
90% 95%
att3 > 4.5
A B
100%
(b)
att1 > 5.5
A B
att3 > 3.5
att6 > 3.5
B A
90% 100% 95% 90%
(c)
att1 >4.5
A B
att3 > 4.5
att6 > 3.5
B A
93% 97% 98% 92%
(d)
Figure 4.5: Example of highly dissimilar decision trees (a) and (b) using the measure in Equation 4.4 (since
their paths and predictions di↵er); in contrast (c) and (d) depict highly similar trees since the attributes in the
nodes are the same and the predictions are similar.
between their predictive powers is zero. This measure can be normalized to be in the range [0 � 1],
where 0 represents that the trees are very similar3 and 1 that they are totally dissimilar.
To exemplify the distance presented in Equation 4.4 take the decision trees4 depicted in Figures 4.5
(a) and (b) which have a high dissimilarity value, (d = 0.38). The reason is that paths are discrepant
(structural similarity) and their predictive classification is di↵erent. In contrast, Figures 4.5 (c) and
(d) depict highly similar trees, (d = 0.0), note that the attributes in the nodes are the same (even
when the split value is di↵erent they are considered the same).
MDPs
When using MDPs as models, the comparison is performed between the long history model and
exploration learned model. In particular for comparing MDPs we used the total variation distance
between transition functions. The total variation distance (TVD) compares probability distributions
3Nodes with numeric attributes where with same variables occurs but with di↵erent splitting values are seen as the
totally similar.4Example adapted from [Miglio and So↵ritti, 2004].
47
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
TFT Pavlov
w (j+1)w
h
My last play
cooperate defect
cooperate
Opponent last play
cooperate defect
defect
round
< h > h
cooperate defect
My last play
cooperate defect
cooperate defect
My last play
cooperate defect
cooperate defect
Opponent last play
cooperate defect
My last play
cooperate defect
defect cooperateShort history models
Exploration learned models
jw... ...
My last play
cooperate defect
cooperate defect
My last play
cooperate defect
cooperate defect
Opponent last play
cooperate defect
My last play
cooperate defect
defect cooperate
d>!
Figure 4.6: Example of how the framework works against a TFT-Pavlov opponent that changes between
strategies at time h. Each w rounds a tree is learned. In the upper part the exploration learned trees are
presented which represent the TFT and Pavlov strategies. In the lower part from left to right, the first tree
represents the TFT opponent strategy, the second one is learned when the opponent changed strategies, because
is di↵erent than the exploration learned model a switch has been detected (d > ). The third one represented
the Pavlov strategy, learned after the opponent switch.
and for categorical distributions is defined as:
TV D(µ, v) =1
2
X
x
|µ(x)� v(x)| (4.5)
between the transition functions µ, v (compares each element in the transition function of µ with the
one in v) of the two MDPs.
Running example
The framework is exemplified in Figure 4.6 using decision trees as models. Models are shown in two
groups to better exemplify the framework: in the upper part exploration learned models are shown
and in the lower part the short history ones are depicted, comparison between models of these groups
will reveal switches in the opponent.
Suppose an infinitely repeated iPD game played against an opponent that starts playing the TFT
strategy (see Section 2.3.2) and after h steps changes its strategy to Pavlov. The first exploration
48
4.2. MDP-CL WITH KNOWLEDGE REUSE
learned model (upper left) is the initial learned tree and represents the TFT strategy, learned after
interacting with the opponent for w steps. After this initial interaction, the agent can compute a policy
against the learned model (TFT) and it will start using that policy. At 2w the first short history model
is learned (lower left) and a comparison between trees is performed every w steps, after which, the
short tree is reset. In this case the comparison reveals they are the same model and the agent keeps
playing the same policy. At step h, with jw < h < (j + 1)w, the opponent switches from TFT to
Pavlov. The short history model learned with information from jw to (j + 1)w (during the switch),
is the second one depicted in the lower part of the figure. This tree is di↵erent compared with the
exploration learned model that represents TFT. Since the distance between these trees, d, is greater
than the specified threshold, , it means the opponent has changed strategies. The current learned
models and policies are reset and the exploration phase restarts. The second exploration learned tree
(upper right) is the learned model after the switch and represents the Pavlov strategy.
4.1.9 Summary
We have presented our first contribution, a framework for fast learning non-stationary strategies in
repeated games. The framework uses windows of interactions to learn a model of the opponent.
The learned model is used to compute an optimal policy against that opponent. Di↵erent models
are learned throughout the repeated game and comparisons between models indicate a switch in the
opponent. Two di↵erent implementations of the framework are evaluated (experiments are presented
in Section 5.2), the first one called MDP4.5 uses decision trees to model the opponent. The second,
called MDP-CL uses only MDPs. This framework has a limitation, it discards the learned model once
a switch is detected (which may be useful for future interactions). Therefore, in the following section
we propose how to overcome this limitation. The idea is keep the learned models in memory and reuse
them, if they reappear in the interaction. A second extension is designed to take advantage when
previous knowledge (the set of possible strategies the opponent will use) can be obtained. We address
both of these extensions for MDP-CL in the next section.
4.2 MDP-CL with knowledge reuse
In this section, we present two algorithms that extend MDP-CL (i.e., the proposed framework presented
in the previous section and modeling the problem as an MDP). The first one (a priori MDP-CL) uses
prior information (set of strategies used by the opponent) to quickly detect the opponent model while
still checking for opponent switches. The second approach (incremental MDP-CL) learns new models
but, in contrast to MDP-CL, it will not discard them once it detects a switch. In this way it keeps a
record of models in case the opponent reuses a previous strategy.
49
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
4.2.1 Assumptions
In this section, we make the following assumptions:
• The opponent will not change strategy during a number of interactions (learning phase).5
• Our agent uses a state space representation that can describe the opponent strategy.
With a priori MDP-CL we also assume:
• A priori MDP-CL knows the set of strategies used by the opponent before the interaction.
We performed experiments where this last assumption is removed (Section 5.3.5) and the algorithm
starts with a set of noisy models.
4.2.2 Setting
The problem’s setting is the same as in the previous section. Our learning agent and one opponent Orepeatedly play a bimatrix game �.
4.2.3 A priori MDP-CL
MDP-CL learns models of the opponents by exploring the entire state space, where the space state is
as described in Section 4.1.6. However, in some settings an agent could have information about the
set of strategies used by the opponents. A priori MDP-CL is designed to be used in those cases.
A priori MDP-CL assumes to know prior information in the form of a set M of MDPs that
represent possible strategies used by the opponent. However, there is still the problem to detect in a
fast way which of these strategies is the one used by the opponent. In MDP-CL this was not a problem
since there were no prior models to compare. Now, the problem we face is one of model selection.
At each round of the repeated game the learning agent experiences a tuple (s, a, r, s0). In a similar
way to MDP-CL the a priori algorithm learns as if there were no prior information, using an exploration
phase. Recall that MDP-CL needed to finish that phase to learn a model, in contrast a priori MDP-
CL learns a new model every round of the repeated game. This learned model is compared (at each
round) with each M 2 M using the TVD (Equation 4.5). Since we assume that the strategy used
by the opponent belongs to the set of models and we can guarantee that with enough experience
tuples, the correct model will have a perfect similarity (TV D = 0.0) with at least one of the models
in M. When this happens we stop the exploration phase and we change to planning phase, setting
the opponent model to compute a policy. Since this is only an asymptotic guarantee in some domains
a perfect similarity will not happen in finite time. For this reason a priori MDP-CL has as parameter
5This number of interactions can be passed as a parameter to the learning algorithm.
50
4.2. MDP-CL WITH KNOWLEDGE REUSE
a threshold, ⇢, that defines how close a model should be in order to set that model as the current one.
This parameter can be set to handle noisy opponents, like the ones presented in Section 5.3.5. The
rest of the algorithm behaves as MDP-CL.
A priori MDP-CL takes advantage of knowing the set of strategies which will results in less
interactions to detect the model in contrast to learn it, as MDP-CL does (for experimental results
please refer to Section 5.3). The limitation is the assumption of knowing the set of opponent strategies.
4.2.4 Incremental MDP-CL
A priori MDP-CL makes use of an initial set of models, but with incremental MDP-CL we relax the
assumption of having the complete set of models M from the beginning. We assume there is a finite
set of strategies used by the opponent and that these strategies will be used repeatedly during the
interaction. Incremental MDP-CL includes both, learning new models if the opponent uses a new and
unknown strategy and maintaining a history of learned strategies in case the opponent switches to a
previous one.
Algorithm 4.2: Incremental MDP-CLInput: Size of the window of interactions w, threshold , threshold ⇢
Function: TVD(), compare two opponent models using the total variation distance
Function: planWithModel(), obtain a policy to act using opponent model
Function: playWithPolicy(), play using the computed policy
1 M = ;, currentModel = ; // Initialize set of learned models and current model
2 for each round of repeated game do
3 if currentModel == ; then
4 Learn a model with past interactions
5 if less than i · w interactions (i � 1) then
6 for each m 2 M do
7 if TV D(model,m) ⇢ then
8 currentModel = model
9 ⇡⇤ = planWithModel(currentModel)
10 end
11 end
12 else
13 M = M [model
14 currentModel = model
15 ⇡⇤ = planWithModel(currentModel)
16 end
17 Play with random actions
18 else
19 playWithPolicy(⇡⇤) and use switch detection like MDP-CL
20 end
21 end
A high level view of incremental MDP-CL is described in Algorithm 4.2. It starts by initializing the
51
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
set of learned models, M = ;, and setting the current model variable to null. Then, for every round
of the repeated game it learns an opponent model, currentModel. Then the algorithm compares it
with those in M. If the TVD is lower than a threshold ⇢ then it means the model has been previously
used and it computes a policy to act. Otherwise, we need w interactions to learn a new model and
add it to the set M. Switch detection is performed as in MDP-CL.
4.2.5 Summary
In this section, we presented two extensions for MDP-CL: (i) A priori MDP-CL, which assumes to
know the set of strategies used by the opponent (now, the problem is to detect which is the one used
by the opponent) and (ii) incremental MDP-CL, which keeps a record of the previous learned models
in case the opponent returns to one of those; in which case it will not be necessary to learn it again
from scratch.
Our proposed approaches are capable of detecting most of the changes in the opponent strategy.
However, in cases a shadowing behavior [Fulda and Ventura, 2006] appeared (experimental support is
shown in Section 5.2.3) yielding suboptimal results. Next section explains this problem in detail and
proposes a new type of exploration for detecting switches to overcome that limitation.
4.3 Drift exploration
Exploration in non-stationary environments has a special characteristic that is not present in station-
ary domains. If the opponent plays a stationary strategy, the learning agent perceives a stationary
environment (an MDP) whose dynamics can be learned. However, if the former plays a stationary
strategy (strat1) that induces an MDP (MDP1) and then switches to strat2 and induces MDP2, if
strat1 6= strat2, then MDP1 6= MDP2 and so the learned policy is probably not optimal anymore.
In order to motivate drift exploration, take the example depicted in Figure 4.7, where the learning
agent faces a switching opponent in the iterated prisoner’s dilemma. Here, at time t1 the opponent
starts with a strategy that defects all the time, i.e., Bully (Section 2.3.2). The learning agent can
recreate the underlying MDP that represents the opponent’s strategy using counts (learned Bully
model) by trying out all actions in all states (exploring). At some time (t2 in the figure), it can
solve for the optimal policy against this opponent (because it has learned a correct model), which
is to defect in the iPD, which will produce a sequence of visits to state Dopp, Dlearn. Now, at
some time t3 the opponent switches its selfish Bully strategy to a fair TFT strategy. But because
the transition T ((Dopp, Dlearn), D) = Dopp, Dlearn in both MDPs, the switch in strategy (Bully !TFT) will not be perceived by the learning agent. Thus resulting in not having the optimal strategy
52
4.3. DRIFT EXPLORATION
Bully TFT
oppC ,C learn C ,Dopp learn
C,1,0D ,Copp learn D ,Dopp learn
C,1,3
D,1,1
D,1,4
C,1,3 C,1,0D,1,1D,1,4
oppC ,C learn C ,Dopp learn
C,1,0D ,Copp learn D ,Dopp learn
D,1,1D,1,1C,1,0
learned Bully modelTFT model
t t1 3t2
Figure 4.7: An example of a learned models against a Bully-TFT switching opponent. The models represent
two MDPs: the opponent starts with Bully (at time t1), after some rounds (t2) a model of the opponent is
complete learned and can optimize against it. The opponent switches to TFT (at time t3) and the learning
agent cannot detect the switch since is performing the action D (thick arrow) and not exploring the rest of the
state space.
against the opponent. This e↵ect is known as shadowing6 [Fulda and Ventura, 2006] and can only
be avoided by continuously checking far visited states. Drift exploration deals with such shadowing
explicitly, and in what follows we will present drift exploration for switch detection; then we propose
a new algorithm called R-max# (since it is sharp to changes) for learning and planning against non-
stationary opponents.
4.3.1 General drift exploration
The problem with non-stationary environments is that opponent strategies may share similarities in
their induced MDP (specifically between transition functions). If the agent’s optimal policy produces
an ergodic set of states (e.g., the resulting ergodic set for defecting against Bully is the sole state
Dopp, Dlearn) and this part of the MDP is shared between opponent strategies, the agent will not
perceive such strategy change, which results in a suboptimal policy and performance. The solution to
this is to explore even when an optimal policy has been learned. Exploration schemes like ✏�greedyor softmax (e.g., a a Boltzmann distribution), can be used for such purpose and they will work
as drift exploration with the added cost of not e�ciently exploring the state space. Against this
background, we propose another approach for drift exploration that e�ciently explores the state space
and demonstrates this with a new algorithm called R-max#.
6Other authors have seen a related behavior which is called observationally equivalent models [Doshi and Gmy-
trasiewicz, 2006].
53
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
4.3.2 E�cient drift exploration
E�cient exploration strategies should take into account what parts of the environment remain un-
certain, R-max is an example of such (see Section 2.2.3). In this section we present R-max#, an
algorithm inspired by R-max but designed for strategic interactions against non-stationary switching
opponents. To handle such opponents, R-max# reasons and acts in terms of two objectives: 1) to
maximize utilities in the short term while learning, and 2) to eventually detect opponent behavioral
changes.
R-MAX#
The basic idea of R-max# is to forget long gone state-action pairs. These pairs are those that 1)
are considered known and 2) have not been updated in ⌧ rounds, at which point the algorithm resets
its reward value to rmax in order to promote exploration of that pair which implicitly rechecks to
determine if the opponent model has changed.
R-max# receives as parameters (m, rmax, ⌧), where m and rmax are used in the same way as R-
max, and ⌧ is a threshold that defines when to reset a state-action pair. R-max# starts by initializing
the counters n(s, a) = n(s, a, a0) = r(s, a) = 0, rewards to rmax, transitions to a fictitious state s0
(like R-max), and set of pairs considered known K = ;. Then, for every round the algorithm checks
for each state-action pair (s, a) that is considered known (2 K) how many rounds have passed since
the last update. If this number is greater than the threshold ⌧ then the reward for that pair is set
to rmax; the counters n(s, a), n(s, a, s0) and the transition function T (s, a, s0) are reset, and a new
policy is computed. Then, the algorithm behaves as R-max. The pseudocode of R-max# is presented
in Algorithm 4.3.
4.3.3 Running example
Now we will use the example in Figure 4.8 to show how R-max# will interact against an unknown
switching opponent. The opponent starts with a Bully strategy (t1). After learning the model, R-
max# knows that the best response against such strategy is to defect (t2) and the interaction will be
a cycle of defects. At time (t3) the opponent changes from Bully to TFT, and because some state-
action pairs have not been updated for several rounds (more than the threshold ⌧) R-max# resets
the rewards and transitions for reaching such states, at which point a new policy is recomputed. This
policy will encourage to re-visit far visited states. Now, R-max# will update its model as shown
in the transition model in Figure 4.8 (note the thick transitions which are di↵erent from the Bully
model). After certain rounds (t4), the complete TFT model will be learned and an optimal policy
against it is computed. Note that R-max#will re-learn a model even when no change has occurred in
54
4.3. DRIFT EXPLORATION
Algorithm 4.3: R-max# algorithmInput: States S, actions A, m threshold value, rmax value, threshold ⌧
Function: SolveMDP(), receives a tuple which corresponds to a MDP and obtains a policy
1 8(s, a, s0) r(s, a) = n(s, a) = n(s, a, s0) = 0
2 8(s, a, s0) T (s, a, s0) = 1
3 8(s, a) R(s, a) = rmax
4 K = ;5 8(s, a) lastUpdate(s, a) = ;6 ⇡ = SolveMDP (S,A, T,R) // initial policy
7 for t:1, . . . , T do
8 Observe state s, execute action a from policy ⇡(s)
9 for each (s, a) do
10 if (s, a) 2 K and t-lastUpdate(s, a) > ⌧ then
11 R(s, a) = rmax // reset reward to rmax
12 n(s, a) = 0
13 8s0 n(s, a, s0) = 0 // reset counters
14 8s0 T (s, a, s0) = 1 // reset transitions
15 ⇡ = SolveMDP (S,A, T,R) // Solve MDP and get new policy.
16 end
17 end
18 if n(s, a) < m then
19 Increment counters for n(s, a) and n(s, a, s0)
20 Update reward r(s, a)
21 if n(s,a)==m // pair is considered known then
22 K = K [ (s, a)
23 lastUpdate(s, a) = t
24 R(s, a) = r(s, a)/m
25 for s00 2 S do
26 T (s, a, s00) = n(s, a, s00)/m
27 end
28 ⇡ = Solve(S,A, T,R)
29 end
30 end
31 end
Bully TFT
C,1,0
C,1,3
D,1,1
D,1,4
C,1,3 C,1,0D,1,1D,1,4
C,1,0
D,1,1D,1,1C,1,0
C,1,0
D,1,1C,1,3 D,1,4
…
learned Bully model learned TFT modeltransition model
t t t1 3 4t2
oppC ,C learn oppC ,C learnoppC ,C learnC ,Dopp learn
C ,Dopp learn C ,Dopp learn
D ,Copp learn D ,Copp learn D ,Copp learnD ,Dopp learn D ,Dopp learn D ,Dopp learn
Figure 4.8: An example of the learned models of R-max# against a Bully-TFT switching opponent. The
models represent three learned MDPs: at the extremes opponent starts with Bully (at time t1) and switches to
TFT (t4) and in the middle a transition model, learned after the switch (t3) Bully ! TFT.
55
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
the opponent strategy.
4.3.4 Practical considerations of R-max#
In contrast to stationary opponents, when acting against non-stationary opponents we need to perform
drift exploration constantly. However, knowing when to explore is a more di�cult question, specially
if we know nothing about possible switching behavior (switching periodicity). However, even in this
case we can still provide guidelines for setting ⌧ : 1) should be large enough to learn a su�ciently
good opponent model (could be a partial model when the state space is large). In this situation the
algorithm learns a partial model and optimizes against that model of the opponent. 2) ⌧ should be
small enough to enable exploration of the state space. An extremely large value for ⌧ will decrease the
exploration for longer periods of time and will take longer to detect opponent switches. 3) If expected
switching times can be inferred or learned then ⌧ can be set accordingly to a value that is related to
those timings. This is explained by the fact that after the opponent switches strategies, the optimal
action is to re-explore at that time. For experimental results that support these guidelines please
refer to Section 5.4.4. Next we provide some theoretical guarantees for R-max# that proves that it is
capable of detecting opponent switches and that its rewards are optimal with high probability given
certain assumptions.
4.4 Sample complexity of exploration for R-max#
In this section, we study the sample complexity of exploration for R-max# algorithm. Before pre-
senting our analysis, we first state our assumptions.
1. Complete (self) information: the agent knows its states, actions and rewards received.
2. Approximation condition (from Kakade, 2003): states that the policy derived by R-max is near-
optimal in the MDP (Definition 4.6, see below).
3. The opponent will use a stationary strategy for some number of steps.
4. All state-action pairs will be marked known between some number of steps.
Given the aforementioned assumptions, we show that R-max# will eventually relearn a new model
for the MDP after the opponent switches and will compute a near-optimal policy.
We first need some definitions and notations, which are from Kakade [2003], to formalize the proofs.
Firstly, an L-round MDP = hS,A, T,Ri is an MDP with a set of decision rounds {0, 1, 2, . . . , L� 1},L is either finite or infinite. In each round both agents choose actions concurrently. A deterministic
56
4.4. SAMPLE COMPLEXITY OF EXPLORATION FOR R-MAX#
T -step policy ⇡ is a T -step sequences of decision rules of the form {⇡(s0),⇡(s1), . . . ,⇡(sT �1)}, wheresi 2 S. To measure the performance of a T -step policy in the L-round MDP, t-value is used.
Definition 4.5 (t-value, Kakade, 2003). Let M be an L-round MPD and ⇡ be a T -step policy for M .
For a time t < T , the t-value U⇡,t,M (s) of ⇡ at state s is defined as
U⇡,t,M (s) =1
TE(s
t
,at
,...,sT �1,aT �1)⇠Pr(·|⇡,M,st
=s)
"T �1X
i=t
R(si, ai)
#, (4.6)
where the T -path (st, at, . . . , sT �1, aTT �1) is from time t up until time T starting at s and
following the sequence {⇡(st),⇡(st+1), . . . ,⇡(sT �1)}, E represents expectation and Pr probability.
The optimal t-value at state s is
U⇤t,M (s) = sup
⇡2⇧U⇡,t,M (s), (4.7)
where the ⇧ is the set of all T -step policies for MDP M . Finally, we define the Approximation
Condition (assumption 2).
Definition 4.6 (Approximate Condition, Kakade, 2003). Let K be a set of known states and MDP
MK be an estimation of the true MDP MK with a set of state K. Then, the optimal policy ⇡ that
R-max derived from MK such that for all states s and times t T ,
U⇡,t,MK(s) � U⇤t,MK(s)� ✏ (4.8)
This assumption states that the policy ⇡ derived by R-max from MK is near-optimal in the true
MDP MK. For R-max#, we have the following main theorem:
Theorem 4.4.1. Let ⌧ = 2m|S||A|T✏ log |S||A|
� and M’ be the new L-round MDP after the opponent
switches its strategy. The R-max# algorithm guarantees an expected return of U⇤M 0(ct) � 2✏ within
O(m|S|2|A|T 3
✏3 log2 |S||A|� ) timesteps with probability greater than 1� �, given timesteps t L.
The proof of Theorem 4.4.1 will be provided after we introduce some lemmas. Now we just give
a sketch of the proof. R-max# is more general than R-max since it is capable of reseting the reward
estimations of state-action pairs. However, the basic result of R-max# is derived from R-max. The
proof relies on applying the R-max sample complexity theorem to R-max# as a basic solver. With
proper assumptions, R-max# can be viewed as R-max with periods, that is, the running timesteps
of R-max# is separated into periods. In each period, R-max# behaves as the classic R-max so that
R-max# can learn the new state-action pairs by the R-max algorithm after the opponent switches its
policy.
57
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
Theorem 4.4.2 (Kakade, 2003). Let M = hS,A, T,Ri be a L-round MDP. If c is an L-path sampled
from Pr(·|R-max,M, s0) and the assumption 1 and 2 hold, then, with probability greater than 1��, theR-max algorithm guarantees an expected return of U⇤(ct)� 2✏ within O
⇣m|S||A|T
✏ log |S||A|�
⌘timesteps
t L.
The proof of Theorem 4.4.2 is given in Lemma 8.5.2 in Kakade [2003]. To simplify the notation,
let C = O⇣m|S||A|T
✏ log |S||A|�
⌘.
Lemma 4.4.3. After C steps, each state-action pair (s, a) is visited m times with probability greater
than 1� �/2.
Proof. This is an alternative interpretation of Theorem 4.4.2 due to Hoe↵ding’s bound,7 using �/2 to
replace �. C ignores all constants, so, within C steps, all state-action pairs are visited m time with
probability greater than 1� �/2.
Lemma 4.4.4. With properly chosen ⌧ , the R-max# algorithm resets and visits each state-action
pair m times with probability greater than 1� �.
Proof. Suppose the R-max# algorithm already learned a model. Lemma 4.4.3 states that within C
steps, each state-action pair (s, a) is visited m times with probability greater than 1��/2. That is, welearn a model, all state-action pairs are marked as known. Remember that ⌧ measures the di↵erence
between the current time step and the time step at which each action-state pair is visited m times
again. To make sure R-max# does not reset a state-action pair before all state-action pairs are visited
with probability greater than 1 � �/2, ⌧ is at least C. Hoe↵ding’s bound does not predict the order
of all the mth visits for each state-action pair. The worst situation is that all state-action pairs are
marked known near t = C. According to Lemma 4.4.3, we need an extra interval C to make sure the
all the state-pairs are visited m times after reset with probability greater than 1 � �/2. In all, we
need to set ⌧ = 2C (one C for the resetting stage and another C for the learning stage). Note that
according to the assumption 4, all state-action pairs will be learned between t = nC and t = (n+1)C
after resetting. Then, R-max# restarts the learning process.
To simplify the proof, we introduce the concept of a cycle to help us to analysis the algorithm.
Definition 4.7. A cycle occurs when all reward estimations for each state-action pair (s, a) are reset
and then marked as known.
Intuitively, a cycle is the process where R-max# forgets the old MDP and learns the new one.
According to the Lemma 4.4.4, 2C steps are su�cient to reset and visit each state-action pair (s, a)
7Hoe↵ding [1963] provides an upper bound on the probability that the sum of random variables deviates from its
expected value.
58
4.4. SAMPLE COMPLEXITY OF EXPLORATION FOR R-MAX#
t1
Reset window
Reset window
Reset window
Learning 1 2 n...
(s,a) pairs:
t=0
1 2 n
Reset
t=C t=2C t=3C
Cycle 1
t2 tnt=4C
1 2 n
Learning
1
2
n...
... ...
Figure 4.9: An illustration for the running behavior of R-max#. Circles represent state-action pairs, a cycle
consists of a reset and learning phases. The length of the reset window is ⌧ = 2C in R-max#. In the learning
stage, all state-action pairs may be marked known before t = C with high probability. After resetting, between
[2C, 3C], we assume that all state-action pairs will be marked known in [3C, 4C].
m times, with probability at least 1 � �. Thus, we set the length of one cycle as 2C. A cycle is a ⌧
window so that we leave enough timesteps for R-max# reset and learn each state-action pair.
Lemma 4.4.5. Let ⌧ = 2C and M 0 be the new L-round MDP after the opponent switches its strategy
the R-max# algorithm guarantees an expected return of U⇤M 0(ct)�2✏ within C timesteps with probability
greater than 1� 3�.
Proof. Lemma 4.4.4 states that if ⌧ = 2C, each state-action pair (s, a) are reset within 2C timesteps
with probability greater than 1 � �, since (1 � �/2)(1 � �/2) = 1 � � + �2/2 > 1 � �. If an opponent
switched its strategy at any timestep in cycle i (see Fig. 4.10 for details), there are three cases: 1) R-
max# does not reset the corresponding state-action pairs (s, a), since they are not considered known
(/2 K); 2) R-max# resets the corresponding state-action pairs (s, a) reward estimations but does not
learn new ones; 3) R-max# has already reset and learned the state-action pairs (s, a).
Case 1 is safe for R-max# since it learns the new state-action pairs, with probability greater than
1� �. For cases 2 and 3, the worst is case 3 since R-max# is not able to learn new state-actions pairs
within cycle i, whereas R-max# may have chance to learn new state-actions in case 2 (in the learning
phase in the same cycle i). The assumption 3 states that the opponent adopts a stationary strategy at
least 4C steps, which is exactly 2 cycles between two switch points. Although R-max# can not learn
new state-action pairs within cycle i when case 3 happens, it can learn them in cycle i+ 1 by Lemma
4.4.4.
In all, the R-max# will eventually learn the new state-action pairs in either cycle i or cycle i+ 1
with probability greater than 1� 2� since (1� �)(1� �) = 1� 2� + �2 > 1� 2�. That is the R-max#
requires 2 cycles or 4C to learn a new model to fit the new opponent policy. Apply the chain rules
in probability theory and Theorem 4.4.2, the R-max# algorithm guarantees an expected return of
59
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
Cycle i
t=nC t=(n+1)C t=(n+2)C t=(n+3)C
Cycle i+1
t=(n+4)C
Switchcase 2
Switchcase 3
Reset Learning Reset Learning
Figure 4.10: Possible switch points in R-max#. Suppose an opponent switches in cycle i, there are two possible
switch points. The R-max# will learn the new state-action pairs within two cycles.
U⇤(ct)�2✏ within C timesteps with probability greater than (1�2�)(1��) = 1�3�+2�2 > 1�3�.
Note that we do not propose the value ofm. How should the value ofm be bounded? Kakade shows
that m = O⇣|S|T 2
✏2 log |S||A|�
⌘is su�cient, given error ✏ and confidence � (Lemma 8.5.6 in Kakade,
2003). With this result, we have following proof for Theorem 4.4.1:
Proof of Theorem 4.4.1 Recall that C = O⇣m|S||A|T
✏ log |S||A|�
⌘. Combining Lemma 4.4.5 and m =
O⇣|S|T 2
✏2 log |S||A|�
⌘, Theorem 4.4.1 follows by setting � �
4 .
The proofs heavily rely on the assumptions we state at the beginning of this section, however maybe
it is too strong to capture the practical performance of R-max#. In particular assumption 4 may not
hold in some domains, nonetheless it provides a theoretical way to understand R-max#. We are able to
understand R-max# in its relation with R-max since R-max is the basic solver in the proof. Also, the
theoretical result gives some bounds on the parameters, e.g., ⌧ = 2C and C = O⇣m|S||A|T
✏ log |S||A|�
⌘.
4.4.1 E�cient drift exploration with switch detection
As final approach, we propose to use the e�cient drift exploration of R-max# together with the
MDP-CL framework for detecting switches. The idea is that these two approaches tackle the same
problem in di↵erent ways and therefore should probably complement each other at the expense of some
extra exploration. We call such R-max#CL (Algorithm 4.4), which combines MDP-CL synchronous
updates with the asynchronous rapid adaptation of R-max# (note that it uses the parameters of both
approaches).
The approach of R-max#CL behaves as R-max# e�ciently visiting far visited states only. Con-
currently the switch detection process of MDP-CL is performed and if a switch is detected the current
model and policy are reset and R-max# restarts. Naturally, such combination of explorations will
turn out to be profitable in some settings and in some others it is better to use only one. Experiments
in Section 5.4.5 provide some insights about this.
60
4.5. DRIFTER
Algorithm 4.4: R-max#CL algorithmInput: w window, threshold, m value, rmax value, threshold ⌧
Function: TVD(), compare two opponent models using total variation distance
Function: R-max#(), call R-max# algorithm.
1 Initialize as R-max#
2 model == ;3 for t : 1, . . . ,T do
4 R-max#(m, rmax, ⌧)
5 if t == i · w, (i � 1) and model == ; then
6 Learn a model with past w interactions
7 end
8 if t == j · w, (j � 2) then
9 Learn comparison model0 with past interactions
10 d= TVD(model,model’)
11 if d > // Strategy switch? then
12 Reset models, j = 2
13 else
14 j = j + 1
15 end
16 end
17 end
4.4.2 Summary
In section 4.3 we presented drift exploration which is designed to overcome the shadowing problem
that occurs when facing non-stationary opponents. We propose R-max# which is an algorithm that
builds upon the theoretical results of R-max to provide switch detection guarantees under certain
assumptions (Section 4.4). Finally, we propose to use the e�cient exploration approach of R-max#
with MDP-CL in R-max#CL. Next section presents DriftER, an algorithm for switch detection that
uses the estimated error of learned opponent model as an indicator of a possible change of strategies.
4.5 DriftER
While most machine learning work assumes that examples are generated according to some stationary
probability distribution. Concept drift (Section 3.3.1) approaches have studied the problem of learning
when the class-probability distribution that generates the examples changes over time. However, this
approach is not directly applicable to multiagent settings since we need to interact with another agent
and we need to plan an optimal policy. For that reason, we use the modeling idea proposed by MDP-
CL (Section 4.1) to generate a model of the opponent that at the same time can be used to play
against the opponent and can be used to estimate a possible strategy switch. The idea is to compute
a predictive error about the opponent’s model and when the error increases constantly then a new
model is needed.
61
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
DriftER leverages insights from concept drift and MDP-CL to identify switches in an opponent’s
strategy. When facing non-stationary opponents whose model has been learned, an agent must bal-
ance exploitation (to perform optimally against that strategy) and exploration (to attempt to detect
switches in the opponent). DriftER treats the opponent as part of a stationary (Markovian) environ-
ment but tracks the quality of the learned model as an indicator of a possible change in the opponent’s
strategy. When a switch in the opponent strategy is detected, DriftER resets its learned model and
restarts the learning. An additional virtue of DriftER is that it can check for switches at every timestep
(as opposed to MDP-CL). And in contrast to R-max# it detects switches explicitly with the addition
of having a theoretical guarantee for high probability switch detection.
DriftER pseudocode is presented in Algorithm 4.5. It starts by learning an opponent model in the
form of an MDP (lines 3-4). When a model has been learned, a switch detection process starts by
predicting the opponent’s behavior (line 5). An error probability is computed and keeping track of
this error will decide when a switch has happened (lines 6-13). When this happens the learning phase
is restarted (lines 15-16).
4.5.1 Assumptions
The proposed framework makes the following assumptions.
• The opponent will not change strategy during a number of interactions (learning phase).
If this assumption does not hold then our agent will not learn the correct dynamics of the opponent
and its predictions will not be accurate. Also the theoretical guarantee (Section 4.5.4) will not hold.
4.5.2 Model learning
DriftER learns a model of the opponent in order to compute a policy to act against it. In all settings
after interacting with the environment/opponent for w times, the environment is learned using the
R-max exploration [Brafman and Tennenholtz, 2003] and we can use techniques such as value iteration
[Bellman, 1957] to solve the MDP.
4.5.3 Switch detection
After learning a model of the opponent, DriftER must decide on each timestep whether to learn a new
model or keep the existing one. Using the existing model, DriftER can predict the next state of the
MDP (which depends on the opponent strategy) and then compare it with the experienced true state.
This comparison can be binarized with correct/incorrect values. A Bernoulli process S1, S2, . . . , ST
will be produced assuming a sequence of independent identically distributed events where Si 2 {0, 1}and T is the last timestep. Let pi be the estimated error probability (i.e., the probability of observing
62
4.5. DRIFTER
Algorithm 4.5: DriftER algorithmInput: Learning window size w, counter of errors n
init
, adjust value �, m window
Function: predictAction(), predicts the next action of the opponent
Function: computeError(), computes the error of the prediction
Function: computeConfInterval(), computes confidence intervals
Function: adjustN(), adjust value according to error and �
1 model =;, countError = 0
2 for t=1, . . . ,T do
3 if t == i · w, (i � 1) and model == null then
4 Learn a model with past w interactions
5 end
6 if model 6= ; then
7 a = predictAction(model)
8 Observe real action a
9 p =computeError(a,a)
10 [fupper
, flower
] = computeConfInterval(p)
11 for i = t� 1, . . . , t�m steps do
12 �i
= fupper
(pi
)� fupper
(pi�1)
13 if �i
> 0 then
14 countError = countErrror +1
15 end
16 end
17 n = adjustN(ninit
, p, �)
18 if countError � n // Switch detected then
19 model = ;20 end
21 countError = 0
22 end
23 end
63
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
incorrect) from S1 to Si, i = 1, . . . , T . Then, the 95% confidence interval [flower(pi), fupper(pi)] over
S1, S2, . . . , Si is calculated for each timestep i using the Wilson score [Wilson, 1927] such that the
confidence interval will improve as the amount of data grows, where flower(pi) and fupper(pi) denote
the lower bound and upper bound of the confidence interval, respectively.
The estimated error probability, and its associated confidence interval can increase for two reasons:
i) if the opponent is exploring or makes mistakes or ii) if the opponent changes its strategy. To detect
the latter case, DriftER tracks the finite di↵erence of the confidence interval using the upper bound
fupper(pi) at each timestep i. The finite di↵erence is defined by
�i = fupper(pi)� fupper(pi�1), i = 1, . . . , T. (4.9)
If �i > 0, �i�1 > 0,. . . ,�i�n+1 > 0, where n is a parameter that should be set accordingly to the
domain. This is, if DriftER detects the confidence interval is increasing in the last n steps, then it
decides to restart the learning phase.
Once DriftER has a model of the opponent it can start computing an error rate and confidence
intervals. However, information from the learning phase is used to have an initial estimation of both
terms (error probability and its confidence intervals). Which will avoid starting with no information
and will reduce peaks in the estimation.
Using a fixed n for all types of opponents may not be the best option. For example, against
stochastic opponents there will be a non-zero probability of incorrectly predicting the opponent’s next
move. Since we still need to check when the error increases, in what follows, we propose to adjust n
accordingly to the error probability p.
We set n = ninit assuming a perfect model can be learned, n is adjusted against stochastic oppo-
nents to
n = ninit +log(�)
log(p)(4.10)
where n ninit + C, C a constant value, � > 0 (described in the next section).
4.5.4 Theoretical guarantee for switch detection
Now we provide a theoretical result to justify that this method is capable of detecting opponent
switches with high probability. In so doing we make the following assumptions: (i) The opponent do
not switch strategies while DriftER is in the learning phase. (ii) The probability of exploration or
mistakes of the opponent is at most ✏ for each timestep.
Theorem 4.5.1. Let ✏ > 0 and � > 0 be small constants. If �i > 0, �i�1 > 0,. . . ,�i�n+1 > 0 and
we set n = O(log �/ log ✏), then DriftER detects the opponent switch with probability 1� �.
64
4.6. SUMMARY OF THE CHAPTER
Proof. If �i > 0, �i�1 > 0,. . . ,�i�n+1 > 0, then DriftER decides to learn a new model. However, we
point out that the �i > 0 may be caused by opponent’s exploration/mistake. The worst case happens
when DriftER incorrectly detects a switch while the opponent only made mistakes or explores, this is
�i�j+1 > 0, for all j = 0, · · · , n � 1 due to opponent’s exploration/mistake. Let A denote the above
event. Given �i > 0, �i�1 > 0,. . . ,�i�n+1 > 0, the probability of event A, is
P[A|�i > 0,�i�1 > 0, . . . ,�i�n+1 > 0] ✏n, (4.11)
Since we assume that the probability of exploration or mistake is at most ✏ for each timestep. By
the chain rule (multiplying the probability of each event), the result follows. Then, set ✏n = �, where
� the probability of incorrectly detecting the switch, so 1� � is the probability of detecting the switch
correctly, finally n = log �/ log ✏ and we have the result.
This result shows that if DriftER decides to restart the learning phase, it does so because it detected
the opponent switch with high probability (1� �) which makes the method robust.
4.5.5 Summary
In this section we introduced DriftER, an algorithm that learns a model of the opponent in the form
of an MDP and keeps tracks of its error-rate. When the error increases significantly, the opponent has
changed its strategy and DriftER must learn a new model. Theoretical results provide a guarantee
of detecting switches with high probability. Section 5.5 will present results in repeated games and in
PowerTAC simulator against state of the art approaches.
4.6 Summary of the chapter
This chapter presents three main approaches for dealing with non-stationary opponents. The first one
is a framework for learning and planning in repeated games against non-stationary opponents with
two implementations: MDP-CL and MDP4.5. The framework detects switches by learning di↵erent
models throughout the interaction. Comparisons between models reveal switches in the opponent.
Then two extensions for MDP-CL were presented, a priori MDP-CL assumes to know the set of
opponent’s models and adapts rapidly once it detects a switch. Incremental MDP-CL learns a model
and will not discard it once a switch has occurred, which will avoid relearning previously seen models.
Since MDP-CL was not capable of detecting some types of switches, drift exploration was proposed to
overcome this limitation. We proposed the R-max# algorithm which provides an e�cient exploration
against non-stationary opponents. R-max# provides theoretical guarantees for switch detection and
optimal expected rewards under certain assumptions. Our last proposal is DriftER, a switch detection
mechanism which learns a model of the opponent and uses it to keep track of the error. DriftER
65
CHAPTER 4. ACTING AGAINST NON-STATIONARY OPPONENTS
provides a theoretical guarantee for switch detection with high probability. These approaches have
di↵erent characteristics and in the next chapter we present the results of each of our proposals in
di↵erent domains.
66
Chapter 5
Experiments
In this chapter, we present experiments performed in five experimental domains (see Figure 5.1).
These are: the iterated prisoner’s dilemma (iPD), the multiagent prisoner’s dilemma, the alternate-
o↵ers bargaining protocol, double auctions in the PowerTAC simulator and general-sum repeated
games. The iPD is simple to understand and well known domain with di↵erent strategies available
(Section 2.3.2). Its multiagent version will be used to show that MDP-CL can be used in domains with
more than one opponent. The alternating-o↵ers protocol is a more complex domain with richer state
and action spaces. Double auctions in PowerTAC are a more realistic approach were there is added
uncertainty in the environment. Finally, by using general-sum games we show that our approaches
can be generalizable against game theoretic strategies (Section 2.3).
Experiments follow the same order as the approaches were presented:
• Section 5.2 compares the MDP4.5 and MDP-CL approaches against Hidden-mode Markov de-
cision process (see Section 2.2.2) since these approaches are designed to handle non-stationary
environments.
• In Section 5.2.6, MDP-CL is evaluated in the multiagent version of the PD to test the approach
is generalizable to more than one opponent.
• In Section 5.3, a priori and incremental MDP-CL are compared against the original MDP-CL.
• In Section 5.4, drift exploration is evaluated in MDP-CL(DE), R-max# and R-max#-CL against
the state of the art approaches R-max [Brafman and Tennenholtz, 2003] as baseline, and FAL
[Elidrisi et al., 2012] and WOLF-PHC [Bowling and Veloso, 2002] since they are algorithms that
can learn in non-stationary environments.
• DriftER is evaluated in Section 5.5, in the PowerTAC domain against MDP-CL and TacTex
[Urieli and Stone, 2014] (Appendix A.4) which is the champion of the inaugural PowerTAC
67
CHAPTER 5. EXPERIMENTS
Experimental Domains(Chapter 5)
Iterated Prisoner's Dilemma
(Sections 5.2, 5.3, 5.4)
Multiagent iPD(Section 5.2.6)
Negotiation (Section 5.4)
Normal form games
(Section 5.5, 5.6)
Battle of the sexes(Section 5.5.1,
5.6.2)
Games ≥ 2 actions(Section 5.6.3)
PowerTAC (Section 5.5.3)
Figure 5.1: Experimental domains used in this thesis and where they are used in this chapter.
competition.
• Finally in Section 5.6, MDP-CL, R-max# and DriftER are compared in the battle of the sexes
games (to see their di↵erent behaviors in the same setting) and then on general-sum games
against switching opponents that use strategies from game theory literature (to show our ap-
proaches can be used against di↵erent strategies). Comparisons are performed with WOLF-PHC.
5.1 Experimental domains
We used five domains for performing experiments. In all of them, the setting consists of one learning
agent and one (or more) opponent(s) that interact for several timesteps/timeslots/rounds. We start
by presenting each domain in detail.
5.1.1 Iterated prisoner’s dilemma (iPD)
Table 5.1: The bimatrix game known as the prisoners’ dilemma. Each cell represents the utilities given for
the agents (the first for agent A and second for agent B).
Agent B
cooperate defect
Agent Acooperate 3,3 0,4
defect 4,0 1,1
As initial setting we used the iPD since it is a well known domain where we can easily use di↵erent
68
5.1. EXPERIMENTAL DOMAINS
strategies for the opponent. In Table 5.1 we present the values used for the iPD game (since they
fulfill the requirements presented in Section 2.3). We used the three most successful and known human
crafted strategies that the literature has proposed: TFT, Pavlov and Bully as opponent strategies
(Section 2.3).
These three strategies have di↵erent behaviors in the iPD and the optimal policy di↵ers across
them. For example with the values in Table 5.1 and a discount factor1 � = 0.9, the optimal policy
against a TFT opponent is always to cooperate, in contrast to the optimal policy against Bully which
is always to defect. The optimal policy against Pavlov is to play the Pavlov strategy.
5.1.2 Multiagent iterated prisoner’s dilemma
Most experiments were performed with one opponent in the environment. However, it is also good
to test performance of the proposed algorithms in settings with more than two agents. The natural
extension was a multiagent version of the prisoner’s dilemma.
In [Stimpson and Goodrich, 2003] an extended version of the prisoner’s dilemma was presented,
it consists of I-players and |A|-actions that preserve the same structure (one Nash equilibrium and
a dominated cooperative strategy). In the game, I agents hold |A| resource units each. At each
iteration, the i-th agent must choose how many of its |A| units will be allocated for a group goal G,
while the remaining will be used for a self-interested goal Si. Let ai be the amount contributed by
agent i towards goal G, and a = [a1, ..., aI ] the joint action. The utility of agent i given the joint
action is:
Ui(a) =1IPI
j=1 aj
|A|(1� k)(5.1)
where k 2 ( 1I ; 1) is a constant that indicates how much each agent estimates its contribution
towards the selfish goal. The pay-o↵ function is such that when all the agents put |A| units in the
group goal, each agent is rewarded with 1. On the other hand, if nobody puts units in the group goal,
a payo↵ of 0 is produced. If each agent adopts a random strategy the expected average payo↵ is 0.5.
Here the state space is formed by the last action of both opponents and our learning agent; the action
space remains with two possible actions.
5.1.3 Alternate-o↵ers bargaining
We also performed experiments on an alternative domain which has di↵erent characteristics (richer
state and action spaces), i.e., the alternate-o↵ers bargaining. This domain consists of two players,
a buyer and a seller. Their o↵ers alternate each other, trying to agree on a price. Their possible
1In game theory, this value is commonly used to represent the time value of money; people usually prefer money
(rewards) immediately rather than at some later date.
69
CHAPTER 5. EXPERIMENTS
actions are o↵er(x) with x 2 R, exit and accept. If any of the players accepts the game finishes with
rewards for the players. If one of them plays exit the bargaining stops and the outcome is 0 for both
of them. Each utility function Ui depends on three parameters of agent i: reservation price, RPi (the
maximum/minimum amount a player is willing to accept), discount factor, �i, and deadline, Ti (agents
prefer an earlier agreement, and after the deadline they exit).
For this domain the state space is composed of the last action performed. The parameters used
were Ti = 4, �i = 0.99 and o↵ers are in the range [0�10]2 (integer values), therefore |S| = 13, |A| = 13
(the iPD had |S| = 4, |A| = 2). The buyer valuates the item to buy in 10. One complete interaction
consisted of repeated negotiations. In the experiments, our learning agent was the buyer and the
non-stationary opponent was the seller.
5.1.4 Double auctions
We use the PowerTAC simulator (see Appendix A.2 for a detailed description) as a practical and real-
world setting. The wholesale market operates as periodic double auction (Appendix A.3) in which
brokers are allowed to buy and sell quantities of energy for future delivery, typically between 1 and
24 hours in the future. At each timestep traders can place limit orders in the form of bids (buy
orders) and asks (sell orders). Orders are maintained in an orderbook. In a periodic double auction,
the clearing price is determined by the intersection of the inferred supply and demand functions.
Demand and supply curves are constructed from bids and asks to determine the clearing price of
each orderbook (one for each enabled timeslot) at the intersection of the two, which is the price that
maximizes turnover [Ketter et al., 2014].
Although we define a fixed limit price, and there is only a single opponent (other buying broker),
PowerTAC includes seven wholesale energy providers as well as one wholesale buyer to ensure liq-
uidity of the market [Ketter et al., 2013], introducing additional uncertainty and randomness in the
simulation.
5.1.5 General-sum games
Previously we used the prisoner’s dilemma with di↵erent strategies. In order to present a more general
setting to test our proposed algorithms we will present experiments in general-sum games where the
opponents may use game theoretic strategies. The most relevant strategies derived from game theoretic
stability concepts that we found relevant to test are: pure Nash equilibria (when available), mixed
Nash equilibria, minimax strategy and fictitious play [Brown, 1951] (see Section 2.3). Furthermore,
2These values emulate an scenario where the buyer wants to buy sooner rather than later, and after a number of
rounds it will leave the negotiation.
70
5.2. MDP4.5 AND MDP-CL AGAINST DETERMINISTIC SWITCHING OPPONENTS
Table 5.2: A bimatrix game representing the battle of the sexes game. Two agents choose between two
actions: going to the opera (O) or going to a Football match (F). v1 and v2 represent numerical values.
O F
O v1, v2 0, 0
F 0, 0 v2, v1
to describe this more general game theoretic setting we also include well known games such as Battle
of sexes.
Battle of the sexes
Battle of the sexes (BoS) is a two-player coordination game. Suppose two persons want to meet in
a specific place: the opera (O) or at a football match (F). One prefers opera and the other prefers
the football match. There is no possible communication. The game is presented in Table 5.2 where
v1 > 0, v2 > 0 and v1 6= v2 will yield di↵erent instantiations of the BoS game. This game has two
pure Nash equilibria (O,O) and (F,F). Both pure equilibria are unfair since one player obtains better
scores than the other. There is also one mixed Nash equilibria were players go more often to their
preferred event.
We have presented the domains used in the following sections. Now, we start by presenting
experiments with MDP-CL and MDP4.5 in the iPD.
5.2 MDP4.5 andMDP-CL against deterministic switching opponents
In Section 4.1, we presented a framework which is general enough to accept several learning techniques
to generate opponent models. This section presents the experimental results and comparisons with
a reinforcement learning technique (Hidden mode-MDPs) in terms of average rewards obtained in
repeated games.
The evaluation of the proposed approach was performed on the iPD with values t = 4, r = 3, p =
1, s = 0. We compared the proposed framework with MDP4.5 and MDP-CL against HM-MDPs. We
have chosen HM-MDPs since they are a technique designed to be used in non-stationary environments
(see Section 2.2.2). Solving a HM-MDP will give a policy to act which can be compared to the policy
generated by our two implementations in terms of the average rewards for a repeated game.
71
CHAPTER 5. EXPERIMENTS
strategy1strategy2
Evaluation of the proposed framework (MDP4.5 and MDP-CL)
Learning and test period500 rounds(a)
strategy1
strategy2
strategy1
strategy2
Evaluation for HM-MDPs
Learning period rounds
Test period 500 rounds
Solve POMDP(b)
Figure 5.2: Graphical depiction of the experiments. In (a) the evaluation approach for the proposed framework,
in (b) the evaluation for HM-MDPs, note the extra learning period at the beginning.
5.2.1 Setting and objectives
In Figure 5.2 we depict the scheme of the experiments for the three compared approaches. In contrast
to our proposed framework, HM-MDPs need an extensive o↵-line training phase. For this reason,
there is an extra learning phase for HM-MDPs in which the agent uses a random behavior and learns
a HM-MDP (see Fig. 5.2 (b)). The opponent uses two strategies and switches from one to another in
certain part of the game (in both training and evaluation). The learned HM-MDP consists of 4 states
and 2 modes. This information about the opponent is enough to allow the HM-MDP to fully learn
the opponent model. To solve the learned HM-MDP, we transform it into a POMDP and use the
incremental pruning algorithm [Cassandra et al., 1997]. The POMDP is considered solved when the
error between stages is less than 1e�9 or when it exceeds 500 iterations. Solving the POMDP yields
a policy that can be used to play against an opponent.
There are three possible strategies to be used by the opponents. An opponent was constructed
taken two out of these three strategies (we tested using all possible combinations). The opponent
starts playing with one of these strategies and in a certain point in the game (not known by the
opponent), switches to the other (continuing that way for the rest of the game).
The experiments are divided into four parts. The first one is devoted to analyze the performance of
HM-MDPs with di↵erent training options. For a HM-MDP the number of modes (strategies) is fixed
in advance (they cannot learn new models online), so we tested di↵erent training schemes varying the
training data. The second part compares the three approaches MDP4.5, MDP-CL and HM-MDPs.
Our implementations learn in an online fashion in contrast to HM-MDPs which need an o✏ine training
phase, this di↵erence makes it hard to evaluate the approaches under the same conditions. We compare
the average rewards in a series of repeated games, for HM-MDPs, the rewards will be obtained only
for the evaluation phase. Even if unfair to MDP4.5 and MDP-CL, since their learning is online, the
complete interaction is used to compute the average rewards. The third part proposes an extension
to our approach in terms of adding a drift exploration scheme. The last part provides results for
MDP-CL in a multiagent version of prisoner’s dilemma.
72
5.2. MDP4.5 AND MDP-CL AGAINST DETERMINISTIC SWITCHING OPPONENTS
Table 5.3: Average rewards for the HM-MDPs agent with std. deviation using di↵erent training sizes. in
the iterated prisoner’s dilemma. Opponent switched between strategies in the middle of the interaction. The
evaluation phase consisted of 500 steps.
Opponent/Training size 100 500 2000 Average
TFT-Pavlov 2.81 2.90 2.92 2.88
TFT-Bully 1.77 1.55 1.56 1.63
Pavlov-TFT 2.90 2.85 2.74 2.83
Pavlov-Bully 1.97 1.94 1.86 1.92
Bully-TFT 1.58 1.89 1.50 1.66
Bully-Pavlov 1.88 1.92 1.87 1.89
Average 2.17 2.18 2.06 2.13
Table 5.4: Average rewards for the HM-MDPs agent (AvgR(A)) and for the opponent (AvgR(opp)) with
standard deviations in the iterated prisoner’s dilemma. Opponent switched between strategies in the middle of
the interaction. The evaluation phase consisted of 500 steps. Column Perfect shows the maximum value that a
learning agent can obtain.
Same model Di↵erent model Perfect
Opponent AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A)
TFT-Pavlov 2.87 ± 0.09 2.97 ± 0.10 2.05 ± 0.27 1.59 ± 0.99 3.0
TFT-Bully 1.63 ± 0.39 1.67 ± 0.45 1.56 ± 0.15 2.96 ± 0.43 2.0
Pavlov-TFT 2.84 ± 0.13 2.92 ± 0.24 2.29 ± 0.30 2.12 ± 0.68 3.0
Pavlov-Bully 1.92 ± 0.06 1.76 ± 0.41 1.55 ± 0.11 2.82 ± 0.62 2.0
Bully-TFT 1.65 ± 0.27 2.02 ± 0.48 1.17 ± 0.27 1.43 ± 0.88 2.0
Bully-Pavlov 1.89 ± 0.12 1.87 ± 0.53 1.72 ± 0.05 0.97 ± 0.14 2.0
Average 2.13 ± 0.18 2.20 ± 0.37 1.72 ± 0.19 1.98 ± 0.62 2.3
5.2.2 HM-MDPs performance experiments
In order to evaluate the robustness of HM-MDPs, we evaluated the learned policy under di↵erent
switching times. We used 250 as the round when the opponent switches from strategy1 to strategy2,
the duration of games in the evaluation phase was 500 rounds. We evaluated di↵erent training sizes
tsize = {100, 500, 2000} games. In Table 5.3 we present the average rewards of the HM-MDP agent
with di↵erent training sizes against switching opponents. Results show that a training size of 500
obtained the best scores, a smaller size could help reduce the processing times but it may not be
su�cient to learn the best policy and a large training size takes longer times and can overfit the model
yielding lower scores.
We present the average rewards with standard deviations (average of all tsize values) when the
opponent switched between strategies in middle of the interaction in Table 5.4 under the Same model
73
CHAPTER 5. EXPERIMENTS
column3. AvgR(A) presents the average rewards for the learning agent, and AvgR(Opp) presents the
average rewards for the switching opponent.
As we mentioned earlier, HM-MDPs need to fix the number of modes when learning. In the
previous experiment, opponents always used two strategies, since there are three di↵erent strategies
available we modified the experiment in order to have di↵erent strategies in the training and evaluation
phases. The motivation for this is that it may happen the opponent did not use all the strategies
during training phase, and therefore the learned HM-MDP is not complete. During evaluation a new
strategy is used (which is not known for HM-MDP) and this would a↵ect the results. So, in the
next experiment the training opponent consists of strategy1 � strategy2 and the evaluation opponent
consists of strategy1 � strategy3. The results of this experiment are presented in Table 5.4 under the
di↵erent model column.
From the results is easy to note that HM-MDPs consistently decrease its average reward when
using a di↵erent model for evaluation and training (di↵erence between AvgR(A) columns of Same
model and Di↵erent model in Table 5.4). On average the decrease is 0.56± 0.27. In conclusion, when
HM-MDPs can explore all models in training phase, they obtain good results. However, if they do not
learn the complete opponent strategies they cannot compute an optimal policy against it and thus,
they receive lower rewards.
Having analyzed how HM-MDP behave in terms on di↵erent training conditions we present the
comparison with our proposed implementations.
5.2.3 HM-MDPs vs MDP4.5 vs MDP-CL
Since HM-MDPs need an o✏ine learning phase the comparison with MDP-CL and MDP4.5 is not
entirely direct. HM-MDPs have an o↵-line training phase, in which a policy is computed and this
policy is evaluated against the switching opponent. In contrast, MDP4.5 and MDP-CL learn and
compute a policy continuously throughout the interaction, there are no clear training and evaluation
phases. Comparing the average rewards for the evaluation phase for HM-MDPs and the complete
interaction for MDP4.5 and MDP-CL is not totally fair but is a reasonable comparison in which HM-
MDPs have an advantage. However, HM-MDP may learn an incomplete model of the opponent, and
during evaluation a new strategy not presented in the evaluation could occur. This was evaluated in
the previous section. So, for HM-MDPs we take the average of the same model and Di↵erent model
columns of Table 5.4. For MDP4.5 and MDP-CL the average rewards are obtained during the complete
game, i.e., during training and evaluation phases.
In Figure 5.3 we depict the average rewards of HM-MDPs, MDP4.5 and MDP-CL against switching
opponents. We varied the switching time with 150, 250 and 350, we present only one result (250)
3A more detailed description of the experiments performed for HM-MDPs is presented in Appendix C.1
74
5.2. MDP4.5 AND MDP-CL AGAINST DETERMINISTIC SWITCHING OPPONENTS
Table 5.5: Average rewards with standard deviation of MDP-CL, MDP4.5 and HM-MDPs against non-
stationary opponents. HM-MDPs have two columns depending if the models in the evaluation phase were
di↵erent or the same to those models in the training phase.
Opponent MDP-CL MDP4.5 HM-MDP (Same) HM-MDP (Di↵erent)
TFT-Pavlov 2.87 ± 0.19 2.70 ± 0.08 2.87 ± 0.09 2.05 ± 0.27
TFT-Bully 1.79 ± 0.13 1.87 ± 0.02 1.63 ± 0.39 1.56 ± 0.15
Pavlov-TFT 2.88 ± 0.07 2.72 ± 0.08 2.84 ± 0.13 2.29 ± 0.30
Pavlov-Bully 1.87 ± 0.10 1.79 ± 0.07 1.92 ± 0.06 1.55 ± 0.11
Bully-TFT 0.96 ± 0.01 1.02 ± 0.02 1.65 ± 0.27 1.17 ± 0.27
Bully-Pavlov 1.83 ± 0.05 1.80 ± 0.06 1.89 ± 0.12 1.72 ± 0.05
Average 2.04 ± 0.09 1.98 ± 0.05 2.13 ± 0.18 1.72 ± 0.19
since they are consistent with the other values. In Table 5.5 we present the comparison among the
same algorithms but with the two versions of HM-MDPs (same and di↵erent models in learning and
evaluation). Conclusions are the following:
• MDP-CL obtained the best results in average followed by MDP4.5.
• The main di↵erence between MDP4.5, MDP-CL and HM-MDPs appears with the opponent
Bully-TFT. This happens because the policy of the HM-MDP maintains a good exploration
process. In more detail, the HM-MDP’s policy is to defect against Bully but every 5 o 6 steps
explores the cooperation action in order to detect when the opponent changes to TFT. When this
happens, HM-MDPs are capable of noticing this change of behavior and adapt to it, entering
a cooperate-cooperate cycle. This results in an increase in rewards. In contrast MDP4.5 and
MDP-CL are not capable of detecting the switch in strategy from Bully to TFT and it keeps
defecting throughout the game. The behavior presented in HM-MDPs is a good example of
exploration for detecting switches, which was the motivation of drift exploration (Section 4.3).
These experiments allow us to make the following remarks. MDP-CL obtained the best scores in
average and HM-MDPs obtained the worst against all switching opponents except Bully-TFT. HM-
MDPs seem to have a good exploration scheme, because of its o↵-line learning process. They obtain
good results when they can learn the opponent strategies beforehand, however, facing a unknown
strategy their score decreases. In summary, HM-MDPs have three main limitations: i) the need for a
training phase, ii) the time to solve the resulting POMDP, and iii) the need to determine the number
of training stages. On the other side, MDP4.5 and MDP-CL are on-line learning approaches, which do
not need to know beforehand the number of strategies. Also they compute the policy in a faster way
since solving an MDP is computationally much simpler than a POMDP. A limitation of MDP4.5 and
MDP-CL is that the exploration approach is limited to certain periods of the game, for this reason we
75
CHAPTER 5. EXPERIMENTS
0
0.5
1
1.5
2
2.5
3
TFT-Pavlov TFT-Bully Pavov-TFT Pavlov-Bully Bully-TFT Bully-Pavlov Average
Aver
age re
wards
Algorithms/Opponents
MDP-CLMDP4.5
HM-MDP
Figure 5.3: Average rewards of MDP-CL, MDP4.5 and HM-MDPs (100 trials) against di↵erent switching
opponents that switch in the middle of the interaction in a game of 500 rounds.
proposed a drift exploration scheme.
5.2.4 Preliminary drift exploration for MDP4.5 and MDP-CL
One of the main conclusions drawn for the results presented in the previous section was that MDP4.5
and MDP-CL do not notice the opponent’s switch in some cases, which results in suboptimal policies
(low rewards). This is observed against the Bully-TFT strategy switching opponent. Since the optimal
policy against Bully is to defect, when the opponent switches from Bully to TFT, the agent keeps
defecting and TFT behaves as Bully.
To solve this problem we promote exploration even when the agent has a policy to act. This
section presents an initial version of drift exploration, in the form of ✏-greedy (with probability ✏
the agent will act randomly, not following the prescribed policy). We evaluated di↵erent values for
✏ = {0.05, 0.1, 0.15, 0.2, 0.25}. In Table 5.6, we present the comparison between the no exploration
approach and this ✏-exploration (✏ = 0.1) approach for MDP4.5 and MDP-CL.
We note that in all the cases the performance drops except with Bully-TFT, which was expected.
However, for MDP4.5 in average the results are worst with this naive exploration. This happens
because the proposed approach is too simple and explores with a fixed probability in all the cases.
A way to improve this situation is to explore more when the results are bad, and explore less when
the results are good (in terms of rewards). We propose a more intelligent approach that takes into
76
5.2. MDP4.5 AND MDP-CL AGAINST DETERMINISTIC SWITCHING OPPONENTS
Table 5.6: Comparison without exploration, an ✏�exploration and a softmax exploration for MDP4.5 and
MDP-CL. Perfect column shows the maximum reward an agent can obtain.
No exploration ✏ exploration Softmax exploration Perfect
MDP4.5
Opponent AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A)
TFT-Pavlov 2.70 ± 0.08 2.84 ± 0.08 2.51 ± 0.07 2.49 ± 0.07 2.67 ± 0.09 2.78 ± 0.08 3.0
TFT-Bully 1.87 ± 0.02 2.15 ± 0.02 1.71 ± 0.04 2.12 ± 0.07 1.83 ± 0.04 2.17 ± 0.05 2.0
Pavlov-TFT 2.72 ± 0.08 2.70 ± 0.19 2.53 ± 0.07 2.45 ± 0.08 2.68 ± 0.08 2.68 ± 0.15 3.0
Pavlov-Bully 1.79 ± 0.07 1.90 ± 0.23 1.70± 0.06 1.90 ± 0.07 1.76 ± 0.06 1.96 ± 0.16 2.0
Bully-TFT 1.02 ± 0.02 1.08 ± 0.03 1.63 ± 0.08 1.91 ± 0.08 1.51 ± 0.26 1.69 ± 0.28 2.0
Bully-Pavlov 1.80 ± 0.06 1.69 ± 0.12 1.68 ± 0.05 1.74 ± 0.08 1.77± 0.06 1.73 ± 0.13 2.0
Average 1.98 ± 0.05 2.06 ± 0.11 1.96 ± 0.06 2.10 ± 0.08 2.04 ± 0.10 2.17 ± 0.14 2.3
MDP-CL
Opponent AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp) AvgR(A)
TFT-Pavlov 2.87± 0.19 2.93 ± 0.06 2.72 ± 0.07 2.76 ±0.09 2.86 ± 0.15 2.89± 0.13 3.0
TFT-Bully 1.79 ±0.13 2.15 ± 0.03 1.79 ± 0.05 2.15± 0.06 1.83 ± 0.08 2.18 ±0.07 2.0
Pavlov-TFT 2.88 ±0.07 2.83 ±0.24 2.75 ± 0.06 2.64± 0.11 2.87 ± 0.11 2.80± 0.27 3.0
Pavlov-Bully 1.87 ±0.10 2.07 ±0.15 1.82 ± 0.02 2.04± 0.04 1.87 ± 0.02 2.13 ±0.04 2.0
Bully-TFT 0.96 ± 0.01 1.09 ±0.02 1.82± 0.07 2.04±0.08 1.74± 0.15 1.93 ±0.14 2.0
Bully-Pavlov 1.83 ± 0.05 1.94 ±0.21 1.79± 0.04 1.93 ±0.12 1.82 ± 0.05 1.97 ±0.19 2.0
Average 2.03 ± 0.09 2.17 ±0.12 2.12 ± 0.05 2.26 ± 0.08 2.17 ± 0.09 2.31 ± 0.14 2.3
77
CHAPTER 5. EXPERIMENTS
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
5 10 15 20 25 30Average rewards
Size of learning window w
MDP-CLMDP4.5Perfect
Figure 5.4: Average rewards of MDP4.5, MDP-CL and the perfect agent (that knows how to best respond
immediately) in the IPD against di↵erent stationary opponents. The window size w for learning the opponent’s
model was varied from 5 to 30.
account the average rewards of the last T interactions. Moreover we give more importance to recent
rewards than old ones so the rewards will be weighted accordingly. Now the exploration probability
is given by:
P (exploration) = (�e��ravg) · x (5.2)
where:
ravg =TX
t=1
�t
�rt; � =
TX
t=1
�t and x 2 (0, 1] (5.3)
We evaluated di↵erent combinations (� = {0.7, 0.8, 0.9, 0.95, 0.99}, � = {1, 1.25, . . . , 2}, x =
{0.1, 0.15, . . . , 0.5}) with the parameters and the best results were obtained with � = 0.95,� = 1.5, x =
0.25 for MDP4.5; and � = 0.99,� = 1.5, x = 0.2 for MDP-CL. These results are presented in Table
5.6 in the column Softmax exploration. We can observe that in this case, using this exploration yields
better results in average than using no exploration and the ✏-exploration approach, this occurs because
it does not explore blindly, when rewards are low it will explore more than when rewards are high. Its
limitation is that it needs three parameters to be tuned to the domain.
5.2.5 Learning speed of the opponent strategy
Previous experiment evaluated the approaches against switching opponents. However it is also im-
portant to evaluate how the size of the window of interactions a↵ects the rewards against stationary
opponents. The motivation is that the learning times could be reduced and we will like to see how
our approaches behave in those conditions. For that reason, we evaluated MDP4.5 and MDP-CL with
di↵erent values for the parameter w = 5, 10, . . . , 25, 30 in terms of average rewards against stationary
opponents (average of TFT, Pavlov and Bully),
78
5.2. MDP4.5 AND MDP-CL AGAINST DETERMINISTIC SWITCHING OPPONENTS
In Figure 5.4 we depict the comparison in terms of average rewards for di↵erent sizes of the window
of interaction. The results show that MDP-CL outperforms MDP4.5 for all the values of w. For small
sizes of w = 5 and 10, the di↵erence between the two approaches is more noticeable than with larger
sizes of w. In summary, when the window of interaction is small MDP-CL obtains better results than
MDP4.5. However, with an appropriate amount of interaction, (more than 20 steps in this domain),
the two approaches obtain similar results. This is explained due to the fact that decision trees are not
the best model to use with limited data.
Now we move towards more general domains, in particular next section presents experiments with
more than one opponent.
5.2.6 Increasing the number of opponents
In previous experiments we used only one opponent in the environment. However, there are several
domains where there is more than one opponent and we will present experiments showing that our
approach (MDP-CL) will still work on those cases. The domain chosen to perform experiments is
the multiagent version of the prisoner’s dilemma. The environment consisted on three agents: one
learning agent and two opponents who used generalized strategies of Bully, TFT, and Pavlov. The
agents were given 1 resource unit and k = 0.5 (this means that the best score is 1.0 and the worst
is -1.0). Bully strategy never gives its unit to the group. TFT gives its unit when at least one other
player contributes with one unit in the past round, if not it keeps its unit. Pavlov contributes its unit
whenever both remaining players performed the same action in the previous play, if not it keeps its
unit.
The setting was designed to evaluate all possible combinations of di↵erent strategies given that
two opponents are available. The game lasted 7200 rounds and each 400 rounds there was a change
in strategies in one of the two opponents. We decided to test MDP-CL (DE) (with softmax drift
exploration) since it shows a faster learning time and better scores in average than MDP4.5. We
compare against two algorithms: R-max (Section 2.2.3) used as baseline and the perfect agent which
best responds immediately (used as an upper bound).
Note that in this case the state space representation for MDP-CL is increased in order to take
into account that there are two opponents in the environment. In more detail, now, the state space is
formed by the last action of both opponents and our learning agent (see an example in Fig. 5.5 (a)).
The action space remains with two actions: giving the resource unit to the group (which is similar to
cooperate) or keep the unit (similar to defect).
In Figure 5.5 (b) and (c) we depict immediate and cumulative rewards for the learning agents
(average of 100 iterations). In Figure 5.5 (b) we note that MDP-CL(DE) is capable of converging
to the best score after detecting and learning the opponents’ models. In contrast, R-max obtains
79
CHAPTER 5. EXPERIMENTS
C,1,0
C,1,-1
D,1,1
D,1,-1
C,1,1C,1,0
D,1,1D,1,1
C ,C ,Dopp1 learnopp2
C ,D ,Dopp1 learnopp2C ,D ,Copp1 learnopp2
C ,C ,Copp1 learnopp2
C,1,0D,1,1
D,1,1
C,1,1
C,1,0
D,1,1D,1,-1
D ,C ,Dopp1 learnopp2
D ,D ,Dopp1 learnopp2D ,D ,Copp1 learnopp2
D ,C ,Copp1 learnopp2
C,1,-1C,1,-1
D,1,-1
(a)
(b) (c)
Figure 5.5: (a) MDP representation in the multiagent version of the prisoner’s dilemma with two opponents,
the state space increased accordingly to the number of opponents. (b) Immediate and (c) cumulative rewards
of MDP-CL(DE), R-max and the perfect agent which best responds immediately to switches in the multiagent
prisoner’s dilemma with 3 agents. At least one of the opponents changes its strategy every 400 rounds.
80
5.3. A PRIORI AND INCREMENTAL MDP-CL
suboptimal scores for di↵erent combinations of opponents since it is not capable of adapting. In
Figure 5.5 (c) we can observe that with every switch MDP-CL needs a detection and learning phase
which results in not obtaining the optimal score for certain period. However, after this period it
obtains the optimal policy.
These results indicate that MDP-CL(DE) is useful in domains with more than one opponent in
the environment. However, one limitation is that it may not scale properly. For example, the state
space will increase exponentially in the number of agents. In particular |S| = |A|I , where A is the
action space and I is the number of agents in the environment. Some ideas about how to solve this
limitation are presented in Section 6.4 and are left as future work.
5.2.7 Summary
This section presented diverse experiments on MDP4.5 and MDP-CL against a technique for non-
stationary environments. We also present an initial version of drift exploration and promising results
with MDP-CL in domains with more than one opponent. Conclusions are that MDP-CL and MDP4.5
are online approaches which do not need to know beforehand the number of opponent strategies and
compute their policy in a faster way than HM-MDPs. Adding drift exploration in MDP-CL provides
better scores since it helps to detect some switches that will not be perceived without it.
5.3 A priori and incremental MDP-CL
We presented results comparing MDP-CL and MD4.5 against state of the art algorithms in non-
stationary reinforcement learning. Now we present experiments for the extensions of MDP-CL: a
priori MDP-CL and incremental MDP-CL.
5.3.1 Setting and objectives
We compare a priori and incremental versions against the original MDP-CL in terms of performance
(average utility over the repeated game) and quality (prediction of the opponent’s next action at each
round compared with the real value averaged over a number of repetitions) of the learned models.
Experiments were performed on the iPD (one opponent).
First, we present how the TVD (equation 4.5) behaves under non-stationary opponents showing
that it can be used to e�ciently compare models. Second, we present empirical results showing
that prior information increases the cumulative rewards for the learning agent and provides a better
prediction of the opponent model. Third, we show the advantages of incremental MDP-CL in case
the opponent reuses a previous strategy. Finally, we relax the assumption of having the complete set
of models used by the opponent and instead assume a set of noisy models that are an approximation
81
CHAPTER 5. EXPERIMENTS
BullyTFT
Figure 5.6: Total variation distance of di↵erent prior models compared with the real one (0 means they are
the same model, totally dissimilar models give a value of 1) using a priori MDP-CL. The opponent is TFT-Bully
switching at the middle of the game.
of the real ones. We first provide qualitative results based on a simple example and then quantitative
results with comparative data.
5.3.2 Model selection in a priori MDP-CL
Here, we present how the TVD behaves against a switching opponent (TFT-Bully) that switches from
a strategy (TFT) to another (Bully) in the middle of the game. The game consisted of 300 rounds
and our agent is given as prior information the set of opponent strategies {Bully, TFT, Pavlov}.In Figure 5.6 the TVD of each strategy compared to the currently learned model is depicted for
each round of the repeated game. From the figure we can observe that from round 5 the TFT model is
the one with the lowest distance (a zero value means they are the same), which is in fact the one used
by the opponent. At round 150 the opponent changes its strategy to Bully and two things happen: the
TVD with respect to Bully decreases and the TVD with respect to TFT increases. Before round 200
the learned model has a perfect similarity (with the correct model). This figure shows how the TVD is
able to e�ciently provide a score useful to identify which model is the one used by the opponent. The
next section shows the improvement of using a priori models on quantitative terms against di↵erent
switching opponents.
5.3.3 Rewards and quality in a priori MDP-CL
In this section MDP-CL and a priori MDP-CL are compared in terms of cumulative rewards and
quality of the current model. We illustrate the results with the same opponent as previously.
82
5.3. A PRIORI AND INCREMENTAL MDP-CL
BullyTFT
(a)
BullyTFT
(b)
Figure 5.7: (a) Immediate rewards and (b) cumulative rewards of MDP-CL and a priori MDP-CL against
the opponent TFT-Bully that switches at the middle of the iPD.
In Figure 5.7 we depict the (a) immediate and (b) cumulative rewards (average of 10 iterations)
for MDP-CL and a priori MDP-CL against the TFT-Bully opponent. In (a) we note that in the first
15 rounds of the interaction the di↵erence in rewards is not noticeable, since both approaches are
exploring (learning/detecting the opponent model). However, from round 15 to 40 a priori MDP-CL
increases its rewards since it already knows which model is the correct one and can exploit it. In
contrast, MDP-CL needs a longer period of exploration to determine correctly the opponent model.
This pattern is repeated when a switch is performed by the opponent (round 150). In (b) we can see
how the cumulative rewards increase each time there is a switch in the opponent because of the faster
detection of a priori MDP-CL.
With respect to the quality of the model, in Figure 5.8 we depict the quality of the predictions
made by MDP-CL and a priori MDP-CL (average of 10 iterations) against the TFT-Bully opponent.
Here it is easy to note that since MDP-CL needs to complete an exploration phase of certain size (in
this case 40) it does not have a correct model until that round (having a quality of 0). In contrast to
a priori MDP-CL which always achieve a better quality in fewer interactions.
The previous example illustrates the benefits of a priori MDP-CL, now we compare MDP-CL and
a priori MDP-CL against di↵erent switching opponents (that switch in the middle of a repeated game
of 300 rounds). Results are shown in Table 5.7 where AvgR(A) represents the rewards of the learning
agent and AvgR(Opp) of the opponent. Each row averages 50 repetitions. From the table, we can
observe that for all opponents, a priori MDP-CL obtained statistical significant better results (using
Wilcoxon signed-rank test, 5% significance level) than MDP-CL, which means a faster detection and
an earlier exploitation of the opponent model.
83
CHAPTER 5. EXPERIMENTS
BullyTFT
Figure 5.8: Model quality (a perfect quality, value of 1.0, is where the model predicts perfectly the next state)
of MDP-CL and a priori MDP-CL (50 trials) against the opponent TFT-Bully that switches between strategies
in the middle of the interaction.
Table 5.7: Average rewards of learning agents AvgR(A) (MDP-CL anda priori MDP-CL) and the non-
stationary opponent AvgR(Opp). The symbol * indicate statistical significance using the Wilcoxon signed-rank
test.MDP-CL A priori MDP-CL
Opponent AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp)
Bully-Pavlov 1.74 2.03 1.89* 1.88
Bully-TFT 0.93 1.20 0.99* 1.03
Pavlov-Bully 1.79 2.12 1.89* 2.11
Pavlov-TFT 2.88 2.86 2.96* 2.95
TFT-Bully 1.76 2.17 1.86* 2.23
TFT-Pavlov 2.87 2.87 2.94* 2.94
Average 2.00 2.21 2.09 2.19
84
5.3. A PRIORI AND INCREMENTAL MDP-CL
BullyTFTPavlov
(a)
BullyTFT
(b)
Figure 5.9: (a) Di↵erence of cumulative rewards between incremental MDP-CL and MDP-CL against the
opponent TFT-Bully-Pavlov-Bully, rewards increase against the second Bully. (b) Total variation distance of
the learned model and the noisy representations (values close to zero represent more similarity with the real
model) of TFT and Bully while using a priori MDP-CL.
5.3.4 Incremental models
Now we relax the assumption of starting the interaction with a set of known models. Thus, the
algorithm needs to learn these models trough interaction which is the objective of incremental MDP-
CL. Even more, it should detect a switch faster to a known strategy than learning it from scratch.
In Figure 5.9 (a), we depict the di↵erence between cumulative rewards of incremental MDP-CL and
MDP-CL against the opponent TFT-Bully-Pavlov-Bully (that changes from one strategy to another
every 150 rounds) in a game of 600 rounds. We selected this opponent since it uses the Bully strategy
on two occasions during the interaction. From the figure, we can observe that from round 0 to 450 the
score moves around 0, this means there is no di↵erence in rewards between the approaches since both
are learning the models TFT, Bully and Pavlov. Starting from round 450, the opponent returns to the
Bully strategy which has been previously used (rounds 0 to 150) therefore incremental MDP-CL has
this model in its memory which is faster to detect (approximately 20 rounds) than to relearn it (as
the original MDP-CL does). This is the reason why the incremental MDP-CL increases its rewards
after round 470. This example shows how keeping a record of models increases the rewards when the
opponent reuses one of those previous models.
85
CHAPTER 5. EXPERIMENTS
5.3.5 A priori noisy models
As final experiment we assume now a set of approximately similar strategies with respect to the real
ones. The objective is to analyze what happens when the given models are not perfect, which may
happen due to error (noise) or because the opponent is human or a hybrid (mixed) strategy. In order
to include noise in the models we changed two transitions of the MDPs (that represent each strategy)
to random values. Now, a priori MDP-CL will start with a set of noisy models of {TFT, Pavlov, andBully}.
In Figure 5.9 (b), we depict the TVD of a priori MDP-CL against the TFT-Bully opponent that
switches in the middle of the game. The TVD of the non-noisy and noisy models is depicted in the
figure. We can observe that the TVD is still capable of detecting which model is the correct (values
closer to zero) even in the presence of noisy models. The di↵erence is that in this case, the TVD will
not reach the perfect score since our models are not exact, in this case, the best score is close to 0.2
(instead of 0.0).
In order to use the a priori and incremental algorithms when starting with a set of noisy models,
we need only to adjust the ⇢ parameter in Algorithm 4.2 to the desired value.
5.3.6 Summary
This section presented results when the set of opponent strategies is available before the interaction.
Results show that a priori MDP-CL can reduce the learning time and increase the total rewards. Also
we show the advantages of incremental MDP-CL when the opponent reuses a strategy. Finally, we
showed that these algorithms still work against noisy opponents (when the set of initial models is not
exactly the same). The next section presents experiments with respect to drift exploration and the
R-max# algorithm.
5.4 Drift exploration
In Section 5.2.4 we presented initial experiments on drift exploration in MDP-CL. In this section we
include new experiments with our R-max# and R-max#-CL proposals. Comparisons are performed
against FAL, WOLF-PHC and R-max.
5.4.1 Settings and objectives
In this section, we compare our proposals with drift exploration: MDP-CL(DE) (Section 4.3.1), R-
max# (Section 4.3.2) and R-max#CL (Section 4.4.1), with MDP4.5, MDP-CL, R-max [Brafman
86
5.4. DRIFT EXPLORATION
and Tennenholtz, 2003], FAL [Elidrisi et al., 2012], WOLF-PHC4 [Bowling and Veloso, 2002] and the
omniscient (perfect) agent that best responds immediately to switches. Results are compared in terms
of average utility over the repeated game. Experiments were performed on two di↵erent scenarios.
• Iterated prisoner’s dilemma (with values presented in Table 5.1). We used TFT, Pavlov and
Bully (Section 2.3.2) as opponent strategies. We emulate two di↵erent scenarios. In the first, the
opponent switches strategies deterministically every 100 rounds. The second scenario proposes
a more realistic opponent with non-deterministic switching times. In particular, we model a
probabilistic switching opponent that can switch strategies at any round. Here, each repeated
game consists of 1000 rounds. At every round the opponent either continues using the current
strategy with probability 1�⌘, or with probability ⌘ switches to a new strategy (drawn randomly
from the set of strategies mentioned before). We used switching probabilities ⌘ = 0.02, 0.015, 0.01
which translates to 10 to 20 strategy switches in expectation for one repeated game, this values
are enough to represent di↵erent behaviors in an interaction and are not excessive for the learning
algorithms to not able to learn correctly the opponent models.
• A negotiation task (Section 5.1.3). The opponent uses two di↵erent strategies: i) a fixed price
strategy where the seller uses a fixed price Pf for the complete negotiation, and ii) a flexible
price strategy where the seller initially valuates the object at Pf , but after round 2 he is more
interested in selling the object, so it will accept an o↵er Pl < Pf . We represent that strategy
by Pl = {x ! y}. For example, the optimal policy against Pf = {8} is to o↵er 8 in the first
round (recall the game is discounted by �i in every round, so it’s better to buy/sell sooner rather
than later and the buyer valuates the object in 10), receiving a reward of 2. However against
Pl = {8! 6} the optimal policy is to wait until the third round to o↵er 6, receiving a reward of
4�2i .
In all cases the opponents were non-stationary in the sense that they used di↵erent strategies
for acting in a single repeated interaction. We present experiments against deterministic switching
opponents (Sections 5.4.2, 5.4.3 and 5.4.4) and probabilistic switching opponents (Section 5.4.2 and
5.4.5). We compare how drift exploration in MDP-CL(DE) improves the results over MDP-CL (Section
5.4.3); how the parameters a↵ect R-max# (Section 5.4.4) and R-max#CL (Section 5.4.5). Results in
bold denote the best scores in the tables. Statistical significance is denoted with * and † symbols.
5.4.2 Drift and non-drift exploration approaches
In this section, we present a summary of the results for the two domains in terms of average rewards.
Results show that approaches with drift exploration obtained better results than not using it.
4To simulate a drift exploration there was a constant ✏-greedy exploration and no decay in the learning rate.
87
CHAPTER 5. EXPERIMENTS
Table 5.8: Average rewards of the proposed algorithms against an opponent with a probability ⌘ of changing
to a di↵erent strategy at any round in the iPD domain,* and † represent statistical significance of R-max#CL
with MDP-CL and R-max# respectively. Perfect agent best responds immediately after a switch and gives an
upper bound on the maximum value that can be obtained.
Algorithm/ ⌘ 0.02 0.015 0.01 Drift Exp.
Perfect 2.323 2.319 2.331 -
R-max#CL 2.051*† 2.079*† 2.086† Yes
MDP-CL(DE) 1.944 1.988 2.046 Yes
R-max# 1.691 1.709 1.725 Yes
WOLF-PHC 1.628 1.627 1.629 Yes
MDP-CL 1.696 1.790 1.841 No
MDP4.5 1.681 1.782 1.839 No
FAL 1.625 1.658 1.725 No
In the iPD we compared our proposals which use drift exploration: R-max# (⌧ = 55), MDP-
CL(DE) (w = 30, Boltzmann exploration), and R-max#CL (⌧ = 90, w = 50), against state of the art
approaches MDP-CL (w = 30), MDP4.5 (w = 30), FAL, WOLF-PHC (�w = 0.3, �l = 2�w, ↵ = 0.8,
✏-greedy exploration) and the perfect agent that knows exactly when the opponent switches and best
responds immediately. In Table 5.8, we summarize the results showing the average rewards obtained
by each agent against the probabilistic switch opponent for di↵erent values of ⌘ (switch probability).
All the scores were obtained using the best parameters for each algorithm and the results shown are
based on the average of 100 iterations. In all the cases, R-max#CL obtained better scores than the
rest. An * indicates statistical significance against MDP-CL(DE) and † against R-max#. MDP-
CL(DE) obtains good results since it exploits the model and uses drift exploration. However, note
that we fed it with the perfect window w so it could remain competitive. Using only R-max# is not
as good since it explores continuously but it will not properly exploit the learned model, recall that
R-max#will re-learn a model even when the opponent does not changes its strategy (which may result
in obtaining suboptimal rewards). WOLF-PHC shows almost the same performance against di↵erent
switch probabilities, however, its results are far from the best. MDP-CL and MDP4.5 obtained better
results than FAL, but since none of them use drift exploration they are not as good as our proposed
approaches with drift exploration.
We performed a similar analysis for the negotiation domain. Table 5.9, shows average rewards and
percentage of successful negotiations obtained by each learning agent against a switching opponent.
In this case each interaction consists of 500 negotiations. The opponent uses 4 strategies (Fp{8},Fp{9}, Fl = {8 ! 6}, Fl = {9 ! 6}); the switching round and strategy were drawn from a uniform
88
5.4. DRIFT EXPLORATION
Table 5.9: Average rewards and percentage of successful negotiations of the proposed algorithms against an
opponent with a probabilistic non-stationary opponent in the negotiation domain. Perfect agent best responds
immediately after a switch and gives an upper bound on the maximum value that can be obtained.
Algorithm AvgR(A) SuccessRate Drift Exp.
Perfect 2.70 100.0 -
R-max#CL 1.95 74.9 Yes
R-max# 1.91 70.5 Yes
MDP-CL(DE) 1.73 82.0 Yes
WOLF-PHC 1.71 88.5 Yes
MDP-CL 1.70 85.5 No
R-max 1.67 90.6 No
distribution. R-max#CL obtained the best scores in terms of reward and R-max# obtained the
second best rewards. In this domain MDP-CL(DE) and WOLF-PHC take more time to detect the
switch and adapt accordingly obtaining lower rewards. However they have a higher percentage of
successful negotiations than the R-max# approaches. In this domain, we note that not using drift
exploration (MDP-CL and R-max) results in failing to adapt to non-stationary opponents which
results in suboptimal rewards.
These results show the importance of performing drift exploration in di↵erent domains. In the
next sections, we present detailed analysis of MDP-CL(DE), R-max# and R-max#CL.
5.4.3 Further analysis of MDP-CL(DE)
The MDP-CL framework does not use drift exploration which results in failing to detect some type
of switches. We present two examples where MDP-CL(DE) is capable of detecting those switches. In
Figure 5.10 (a), the cumulative rewards against a Bully opponent that switches to TFT at round 100
(deterministically) are depicted. In the first 100 rounds, the figure shows a slight cost associated to the
drift exploration of MDP-CL (DE). After round 100, MDP-CL(DE) increases its rewards considerably
since the agent has learned the new opponent strategy (TFT) and has updated its policy. In the
negotiation domain a similar behavior happens when the opponent starts with a fixed price strategy
(Pf = {8}) and switches at round 100 (deterministically) to a flexible price strategy (Pl = {8! 6}). InFigure 5.10 (b), the immediate rewards of MDP-CL with and without drift exploration are depicted,
also we represent the rewards of a perfect agent which best responds at every round. The figure
shows that MDP-CL is not capable of detecting the strategy switch, from rounds 50 to 400 it uses
the same action and therefore obtains the same reward. In contrast MDP-CL (DE) explores with
di↵erent actions (therefore it seems unstable) and due to the drift exploration is capable of detecting
89
CHAPTER 5. EXPERIMENTS
TFTBully
(a)
P {8→6}P {8}f
l
(b)
Figure 5.10: On top of each figure we depict how the opponent changes between strategies during the in-
teraction. Cumulative rewards of (a) MDP-CL (w = 25) with and without drift exploration, the opponent is
Bully-TFT switching at round 100. (b) Immediate rewards of MDP-CL, MDP-CL(DE) and a perfect agent (that
best responds immediately after the switch) against a non-stationary opponent in the alternating bargaining
o↵ers domain.
the strategy switch. However, it needs several rounds to relearn the optimal policy after which it starts
increasing its rewards at round 175 approximately. After this round, the rewards keep increasing and
eventually converge around the value of 3.0. This occurs because even when MDP-CL(DE) is using
the optimal policy it keeps exploring.
In Table 5.10 we present results in terms of average rewards (AvgR) and percentage of successful
negotiations (SuccessRate) while varying the ✏ parameter (of ✏-greedy as drift exploration) from 0.1 to
0.9 in MDP-CL (DE) in the negotiation domain. We used w = 35 since it obtained the best scores (we
evaluated w = {20, 25, . . . , 50}) and a * represents statistical significance (using Wilcoxon rank-sum
test, 5% significance level) with respect to MDP-CL. These results indicate that using a moderate
drift exploration (0.1 ✏ 0.5) increases the average rewards. A higher value of ✏ causes too much
exploration. Thus, the agent cannot exploit the optimal policy and results are worse than not using
drift exploration. On the one hand, increasing ✏ improves the average rewards only for the successful
negotiations, this happens because using drift exploration causes to detect the switch updating the
immediate reward to 4.0 (after the switch). On the other hand, the number of successful negotiations
is reduced with high values of ✏. This is the common exploration-exploitation trade-o↵ which causes
that moderate ✏ values in the range (0.3� 0.5), obtain the best average rewards.
These results show that adding a general drift exploration (for example with ✏�greedy exploration)
helps to detect switches in the opponent that otherwise would have passed unnoticed. However, as
90
5.4. DRIFT EXPLORATION
Table 5.10: Comparison of MDP-CL and MDP-CL(DE) while varying the parameter ✏ (using ✏�greedy as
drift exploration) in terms of average rewards (AvgR), percentage of successful negotiations (SuccessRate). *
indicates statistical significance of MDP-CL(DE) over MDP-CL using Wilcoxon rank-sum test.
✏ AvgR(A) SuccessRate
0.1 1.796* 92.1
0.2 1.782* 91.0
0.3 1.776* 88.9
0.4 1.801* 86.1
0.5 1.753 83.5
0.6 1.694 80.7
0.7 1.619 78.2
0.8 1.506 73.7
0.9 1.385 68.1
Average 1.679 82.4
MDP-CL 1.726 88.9
we said before, parameters such as window size, threshold (of MDP-CL), and ✏ (for drift exploration)
should be tuned in order to e�ciently detect switches. In the next section, we analyze R-max#.
5.4.4 Further analysis of R-max#
We proposed another way of performing drift exploration using an implicit approach. R-max# has two
main parameters: m counts whether a state-action pair is assumed to be known (same as R-max),
and ⌧ , that controls how many rounds should have passed for a state-action pair to be considered
forgotten. First we analyze the e↵ect of ⌧ . In Figure 5.11 (a) we present the cumulative rewards of
R-max (dotted straight line) and R-max# with ⌧ = 5 (thick line) and ⌧ = 35 (solid line) against a
Bully-TFT opponent, we used m = 2 for all the experiments since it obtained the best scores. For
R-max#, a ⌧ = 5 makes the agent explore continuously causing a decrease of rewards from rounds 20
to 100. However, from round 100 rewards immediately increase since the agent detects the strategy
switch. Increasing the ⌧ value (⌧ = 35) reduces the drift exploration and also reduces the cost in
rewards before the switch (at round 100). However, it also impacts the total cumulative rewards,
since it takes more time to detect the switch (and learn the new model). Here we show an important
trade-o↵ when choosing ⌧ , a small value causes a continuous exploration which quickly detects switches
but has a cost before the switch occurs, on the contrary a large ⌧ reduces the cost of exploration and
therefore the switch will take more time to be noticed. It is important to note that R-max# is capable
of detecting the switch in strategies as opposed to R-max which shows a linear result since it keeps
91
CHAPTER 5. EXPERIMENTS
TFTBully
(a)
P {8→6}P {8}f
l
(b)
P {8→6}P {8}f
l
(c)
P {8→6}P {8}f
l
(d)
Figure 5.11: On top of each figure we depict how the opponent changes between strategies during the interac-
tion. (a) Cumulate rewards against he Bully-TFT opponent in the iPD using R-max# and R-max. Immediate
rewards of R-max# with (b) ⌧ = 100, (c) ⌧ = 60 and (cd ⌧ = 140 and a perfect agent which best responds to
the opponent in the negotiation domain.
92
5.4. DRIFT EXPLORATION
acting against a Bully opponent when in fact the opponent is TFT.
In Figure 5.11, we depict the immediate rewards of R-max# with (b) ⌧ = 100, (c) ⌧ = 60 and
(c) ⌧ = 140 against the opponent that changes from a fixed prices to a flexible price strategy in the
negotiation domain. In these figures we note that our agent starts with an exploratory phase which
finishes at approximately round 25, then it uses the optimal policy to obtain the maximum possible
reward. The opponent switches to a di↵erent strategy at round 100. A ⌧ = 100 (Figure 5.11 (b))
means that those state-action pairs which after 100 rounds have not been updated will be reset. The
agent will compute a new policy which will re explore, this phase occurs from rounds 105 to 145.
Now, the optimal policy is di↵erent (since the opponent has changed) and R-max# obtains a better
reward. The previous case shows how with a ⌧ = 100 which was deliberately chosen to match the
switching time of the opponent can adapt to switching opponents and learn the optimal policy against
each di↵erent strategy. In Figure 5.11 (c) and (d) we depict what happens when ⌧ does not match the
switching time. Using a ⌧ = 60 means that the agent will re-explore the state space more frequently,
which explains the decrease in rewards from round 65 to 95 and 125 to 160 (since both are exploratory
phases). The first exploratory phase (starting at round 65) shows what happens when no change has
occurred; the agent returns to the same optimal policy. However, the second exploration phase shows
a di↵erent result; starting from round 160 it has updated its optimal policy and can exploit it. Using
a ⌧ = 140, Figure 5.11 (c), means less exploration which can be seen by the stability from round 25 to
145, from that point to round 190 the agent re-explores and updates its optimal policy at round 190.
In Table 5.11, we depict di↵erent scores while using R-max# with di↵erent ⌧ values against the
switching opponent in the negotiation domain, an * indicate statistical significance with R-max(using
Wilcoxon rank-sum test). A small ⌧ value is not enough to learn an optimal policy, which results in a
low number of successful negotiations. In contrast a high ⌧ makes not enough exploration and takes too
much time to detect the switch which results in lower rewards. In this case a ⌧ between 80-120 yields
the best scores because it is enough to learn an appropriate opponent model while providing enough
exploration to detect switches. R-max obtained the best score in successful negotiations because it
learns an optimal policy and uses that for the complete interaction even when it is a suboptimal policy
in terms of rewards for half of the interactions.
Comparison with R-max and WOLF-PHC. Previous experiments used only two strategies,
now we increase to 4 di↵erent strategies in the negotiation domain to compare the approaches of
R-max#,R-max and WOLF-PHC. The strategies are (i)Fp{8}, (ii)Fl = {8 ! 6}, (iii)Fp{9} and (iv)
Fl = {9 ! 6}. The opponent switches every 100 rounds to the next strategy. In Fig. 5.12 we depict
(a) the immediate rewards and (b) cumulative rewards of R-max#, R-max and the perfect agent over
400 negotiations. Each opponent strategy is represented by a number zone (I-IV). Each curve is the
93
CHAPTER 5. EXPERIMENTS
P {9→6}P {8→6}l
l
P {9}P {8}f
f
III IVIII
(a)
P {9→6}P {8→6}l
l
P {9}P {8}f
f
III IVIII
(b)
P {9→6}P {8→6}l
l
P {9}P {8}f
f
III IVIII
(c)
P {9→6}P {8→6}l
l
P {9}P {8}f
f
III IVIII
(d)
Figure 5.12: On top of each figure we depict how the opponent changes among strategies during the interaction.
(a) Immediate and (b) cumulative rewards of R-max# (⌧ = 90) and R-max in the alternating o↵ers domain.
(c) Immediate and (d) cumulative rewards of R-max# and WOLF-PHC.
94
5.4. DRIFT EXPLORATION
Table 5.11: Comparison of R-max and R-max# with di↵erent ⌧ values in terms of average rewards (AvgR),
percentage of successful negotiations (SuccessRate). * indicates statistical significance with R-max.
⌧ AvgR(A) SuccessRate
20 1.101 61.2
40 1.375 61.4
60 1.643 71.6
80 2.034* 77.1
90 2.163* 80.4
100 2.164* 80.9
110 2.043* 81.0
120 1.942* 81.0
140 1.746 80.8
160 1.657 83.5
180 1.736 88.9
Average 1.782 77.0
R-max 1.786 90.6
average of 50 iterations.
In zone I (Figure 5.12 (a) and (b)), from rounds 0 to 25 R-max# and R-max explore and obtain low
rewards, after this exploration phase they have an optimal policy and use it to obtain the maximum
possible reward at that time (2.0). In zone II (at round 100) the opponent switches its strategy and
R-max does not detect this switch and keeps using the same policy. Since R-max# re-explores the
state space (rounds 95 to 135) it obtains a new optimal policy which in this case yields the reward
of 4.0. We can see that during the second exploration phase of R-max# the cumulative rewards are
lower than those of R-max. However, after updating its policy, cumulative rewards start increasing.
From round 170 the rewards are greater than those for R-max. Starting zone III (at round 200) the
opponent switches its strategy and in this case R-max is capable of updating its policy. This happens
because there is a new part of the state space that is explored, but again at zone IV (round 300) the
opponent switches to a flexible strategy which R-max fails to detect and exploit. In contrast, R-max#
is capable of detecting all switches in the opponent strategy and reaching the maximum reward. This
can be easily noted in the di↵erence in cumulative rewards at the final round (534.9 for R-max and
789.2 R-max#).
In Figure 5.12 (c) and (d), we depict the immediate and cumulative rewards of R-max# and
WOLF-PHC. Even when WOLF-PHC is capable of adapting to non-stationary changes in this case
the action space is larger to adapt quickly to all changes. In terms of cumulative rewards WOLF
obtained 647.3 which is better than R-max but it is still far from R-max#.
95
CHAPTER 5. EXPERIMENTS
Table 5.12: Average rewards (AvgR) of R-max#CL and R-max# with di↵erent ⌧ values, and percentage of
successful negotiations (SuccessRate).
R-max# R-max#CL w = 50
⌧ AvgR(A) SuccessRate AvgR(A) SuccessRate
60 1.791 70.1 1.697 70.1
80 1.833 69.5 1.919 70.7
90 1.716 65.9 1.898 70.3
100 1.910 70.5 1.948 71.9
110 1.875 67.3 1.947 71.1
120 1.767 66.9 1.943 72.8
140 1.847 68.3 2.110 75.1
Average 1.820 68.4 1.923 71.7
5.4.5 E�cient exploration + switch detection: R-max#CL
Previously we analyzed the e↵ect of adding a general drift exploration to MDP-CL, now we experi-
ment with the R-max#CL which provides an e�cient drift exploration together with the MDP-CL
framework. First, we test the approach against the opponent that switches strategies deterministically
every 100 rounds in the iPD in the following way: TFT-Bully-Pavlov-Bully-TFT-Pavlov-TFT, which
represent all the possible permutation pairs for the three used strategies. The duration of the repeated
game is 700 rounds. The immediate rewards (average of 10 iterations) are depicted in Figure 5.13.
As comparison, the rewards of a perfect agent that best responds immediately to the switches are
depicted as a dotted line. In Figure 5.13 (a) MDP-CL shows a stable approach since it learns the
opponent model and immediately exploits it, but since it lacks a drift exploration it fails to detect some
strategy switches (change from Bully to TFT at round 400 in Fig. 5.13 (a)). In contrast, R-max#
shows peaks throughout the complete interaction since it is continuously exploring. Its advantage is
that they detect all strategy switches, like Bully-TFT. In Figure 5.13 (c) we depict the results for
WOLF-PHC which in this domain shows that is capable of adapting slowly to the changes. In Figure
5.13 (d), we depict the immediate rewards of R-max#CL with w = 50, ⌧ = 90 (since these parameters
obtained the best scores). This approach is capable of learning and exploiting the opponent model
while keeping a drift exploration which enables the agent to detect switches. This experiment clearly
shows the strengths of our approach, the need for an e�cient drift exploration and a rapid adaptation
which results in higher utilities.
Lastly, we show how R-max#CL fares against a random switching opponent in the negotiation
domain. The interaction consists of 500 negotiations and the opponent uses 4 strategies (Fp{8},Fp{9}, Fl = {8 ! 6}, Fl = {9 ! 6}); switching round and strategy were drawn from a uniform
96
5.4. DRIFT EXPLORATION
BullyTFT
Pavlov
(a)
BullyTFT
Pavlov
(b)
BullyTFT
Pavlov
(c)
BullyTFT
Pavlov
(d)
Figure 5.13: On top of each figure we depict the opponent TFT-Bully-Pavlov-Bully-TFT-Pavlov-TFT that
switches every 100 rounds in the iPD. Immediate rewards of (a) MDP-CL, (b) R-max#, (c) WOLF-PHC and
(d) R-max#CL.
97
CHAPTER 5. EXPERIMENTS
distribution. We compared the R-max# and R-max#CL approaches while varying the ⌧ parameter,
for R-max#CL we used w = 50 since it obtained the best results. A summary of the experiments is
presented in Table 5.12, each value is the average of 50 iterations.
As a baseline we used R-max in this setting (not shown in the table). We varied the parameter m
from 2 to 20, and selected the best score (m = 4) nonetheless its results were the worst, with average
rewards of 1.675 which are far from the obtained average rewards of 1.820 and 1.923 of R-max# and R-
max#CL, respectively. Moreover, almost all values were statistically significantly better results than
R-max. From Table 5.12, we can see that R-max#CL using almost any value of ⌧ provides better
results than R-max# (although they are not statistically significant, using Wilcoxon rank-sum test,
5% significance level) against a randomly switching opponent. This can be explained due to the fact
that R-max#CL provides an e�cient drift exploration combined with a switch detection mechanism.
5.4.6 Summary
We tested two domains in which drift exploration is necessary to obtain an optimal policy — due to
the non-stationary nature of the opponent’s strategy. We presented scenarios where the use of switch
detection mechanisms, such as MDP-CL, FAL or MDP4.5 were not enough to deal with switching
opponents (Sections 5.4.2 and 5.4.3). When keeping a non-decreasing learning rate and exploration
WOLF-PHC is capable of adapting in some scenarios however it does so in a slow way (Section
5.4.4). The general approach of drift exploration by means of ✏�greedy or some softmax type of
exploration, solves the problem since this exploration re-visits some parts of the state space that
eventually will lead to detect switches in the opponent strategy (Section 5.4.3). However, the main
limitation is that they need to be tuned for each specific domain and are not very e�cient since they
explore in an undirected way, not considering which parts of the state space need to be re-visited.
Our approach, R-max#, which implicitly handles drift exploration is generally better equipped to
handle non-stationary opponents of di↵erent sorts. Its pitfall lies in its parameterization (parameter
⌧), which generally should be large enough so as to learn a correct opponent model, yet small enough
to react promptly to strategy switches (Section 5.4.4). In realistic scenarios where we do not know the
switching time of a non-stationary opponent, it is useful to combine both approaches, switch detection
and implicit drift exploration, as can be seen in R-max#CL (Section 5.4.5).
5.5 DriftER
MDP-CL (DE) is capable of detecting switches however it needs to wait for a window size of interactions
w. R-max# performs an implicit switch detection and similarly needs to wait for switches depending
on its ⌧ parameter. The joint approach of R-max#CL provides good empirical results, however, its
98
5.5. DRIFTER
Figure 5.14: Upper value of confidence interval over error probabilities (fupper
) of a learning algorithm with no
switch detection (blue line) and DriftER (black line) against an opponent that changes between two strategies
in the middle of the interaction (vertical bar), small arrows represent DriftER learning phase after detecting
the switch.
parameter tuning is time consuming. This section presents experiments performed with DriftER,
which is an approach based on keeping track the error of the opponent model. Additionally, it can
check for switches at every step of the interaction. We present experiments in two domains, first in
repeated games such as battle of sexes to analyze how DriftER will behave against non-stationary
opponents. Then DriftER is evaluated in the PowerTAC domain, particularly in the wholesale market
where it is compared against a previous champion of the competition and the MDP-CL approach.
5.5.1 Setting and objectives (repeated games)
A well known game from GT is called battle of the sexes (BoS). This is a two-player coordination
game. The game is presented in Table 5.2 and Section 5.1.5. The opponent has di↵erent strategies
to use that come from the GT literature: pure Nash equilibria, mixed Nash equilibria and minimax
strategy (see Section 2.3). It can change from one to another during the interaction and DriftER goal
is to adapt as fast as possible to these switches.
5.5.2 Switch detection
We start by comparing the behavior of a learning agent that does not include a switch detection mech-
anism and DriftER against non-stationary opponents that switches in the middle of the interaction
in a repeated BoS game of 1500 rounds. The opponent has two possible actions (O,F ) and starts
with a mixed strategy of [0.9, 0.1] and changes to a pure strategy [0.0, 1.0] (which is the pure Nash
equilibrium that is most beneficial to the opponent).
Figure 5.14 depicts the upper value of the confidence over the error (fupper) against the switching
opponent for a learning algorithm without switch detection mechanism (thin line) and DriftER (thick
99
CHAPTER 5. EXPERIMENTS
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
350-400
400-450
450-500
500-550
550-600
600-650
650-700
700-750
750-800
800-850
850-900
900-950
950-1000
1000-1050
1050-1100
1100-1150
1150-1200
1200-1250
1250-1300
1300-1350
1350-1400
1400-1450
1450-1500
Frac. of detected switches
Intervals of rounds
ninit=2ninit=4ninit=8ninit=20
Figure 5.15: Fraction of times a switch was detected with di↵erent parameters of DriftER against a non-
stationary opponent that changes from a maximin strategy to a mixed Nash strategy in round 750.
line). Assume the learning agents obtain an opponent model in the first 200 rounds, from that point
they compute the error and confidence intervals. Since the opponent uses a stochastic policy before
the switch there is a prediction error during the interaction, this happens because the agent predicts
the opponent will use one action (O) and with probability 0.1 the opponent chooses F . At round 750
(marked with a vertical line) the opponent changes to a pure Nash equilibrium action which is to use
F in every round. This will increase consistently the prediction error for the agent without switch
detection, in contrast DriftER detects the switch and starts a learning phase (marked with a small
vertical arrows) after which its error will decrease consistently.
DriftER parameter behavior
First we present how the parameter n (that measure the number of errors allowed) a↵ects DriftER
against switching opponents. The opponent will use a mixed Nash equilibrium [0.2, 0.8] and will
change to a minimax strategy ([0.8, 0.2]) of the BoS game with values v1 = 100, v2 = 25. We present
experiments varying the n parameter in a game of 1500 rounds where the opponent switches in the
middle of the game.
Now we present how the parameter n a↵ects DriftER against switching opponents. The opponent
will start with a mixed Nash equilibrium strategy [0.2, 0.8] and will change to a minimax strategy
([0.8, 0.2]) in the middle of the BoS game. We present experiments varying the parameter n with
values {2, 4, 8, 20} in a game of 1500 rounds. For each n we value we keep track of the round when
a switch was detected (100 iterations). In Figure 5.15 we depict a histogram showing the fraction
of times when a switch was detected in certain interval of the game. From the figure we note that
100
5.5. DRIFTER
choosing a small value (2 in this case) may cause to erroneously detect switches (small red bars). A
higher value reduces the errors and obtains an fast switch detection. If we increase again the value to
8, the errors are almost reduced to zero however it may take more rounds to detect the switch. A large
value (20 in this case) increases the number of rounds required to detect a switch in the opponent’s
policy.
In Section 5.6.3, we present a set of experiments using general-sum games and comparisons with
the rest of our approaches. However, we wanted to test our approach in a more realistic domain where
the environment is more complex and there is uncertainty. Thus, we first present DriftER in the
context of double auctions in PowerTAC.
5.5.3 Setting and objectives (double auctions)
First we present experiments showing how the policies need to adapt to non-stationary opponents in
this domain. Then, experiments are presented against a non-stationary opponent who uses di↵erent
fixed prices and then against noisy non-stationary opponents which has a probability distribution over
an interval of possible limit prices. The opponent we designed is non-stationary in the sense that it
uses two stationary strategies: it starts with a fixed limit priced Pl and then in the middle of the
interaction changes to a di↵erent (higher) fixed limit price Ph. The timestep at which the opponent
switches is unknown to the other broker agent.
The MDP for all learning agents models the opponent with the following parameters: the number
of states was set to |s| = 6, and the actions represent limit prices with the values in {15, 20, 25, 30, 35}.The opponent started with a Pl = 20 and then changed to Ph = 34. In the first case, the learning
agent needs to bid using a price > 20 (25, 30, 35) to win bids. Later, when the opponent uses a limit
price of 34, the only bid that will be accepted by the producer is 35. Both the learning agent and the
opponent have a fixed demand which is greater than the average energy needed to supply all buyers.
We compare the performance of DriftER (with parameters n = 7) against TacTex-WM,5, the
champion of the 2013 competition, and MDP-CL, which is not specific for PowerTAC but is designed
for non-stationary opponents.
We present results in terms of average accuracy, confidence over error rate and profit. The learned
MDP contains a transition function for each (s, a) pair, comparing the predicted next state with the
real (experienced) state gives an accuracy value. At each timestep the agent submits nbids bids and
its learned model predicts if those bids will be cleared or not. When the timestep finishes it receives
feedback from the server and compares the predicted with the real transactions. An average of those
nbids predictions is the average accuracy of each timestep. A value of 1.0 equates to perfect prediction.
A similar measure is confidence over error-rate, as described in Section 4.5.3. Finally, profit is defined
5TacTex-WM is the part of TacTex applied only to the wholesale market.
101
CHAPTER 5. EXPERIMENTS
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100 120 140 160 180 200
fupper
Timeslots
TacTex-WMDriftER
(a)
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100 120 140 160 180 200
fupper
Timeslots
MDP-CLDriftER
(b)
Figure 5.16: Upper value of confidence interval over error probabilities (fupper
) of (a) TacTex-WM and (b)
MDP-CL while comparing with DriftER. The timeslot where the opponent switches strategies is denoted with
a vertical line. DriftER is capable of detecting the switch and adapt its strategy to the new opponent strategy.
in PowerTAC as the income minus the costs (balancing, wholesale, and tari↵ markets). We used
default parameters for all other settings in PowerTAC.
The experiments we designed focused only on the wholesale market of PowerTAC. However, Pow-
erTAC also includes another type of market that we cannot disregard (the tari↵ market). We therefore
fix a strategy for this market so that a single flat tari↵ is published and stays constant throughout the
experiments.
5.5.4 Fixed non-stationary opponents
Now we compare the learning algorithms in terms of confidence over the error-rate against the switching
opponent. In Figure 5.16 (a) the upper value of the confidence over error rate of a single interaction
of TacTex-WM and DriftER are shown. After round 100, when the opponent changes its strategy, the
error rate of TacTex-WM increases because it does not adapt to the opponent. In contrast, DriftER
stops using its learned policy at timeslot 110 and restarts the learning phase, which ends at timeslot
135. At timeslot 135, DriftER has high confidence over the error rate (since it is a new model) and
the error rate shows a peak. At this point, DriftER has learned a new MDP and a new policy which
reduces the error rate. The upper value of confidence over error rate of MDP-CL and DriftER are
shown in 5.16 (b). Both algorithms can detect the opponent’s strategy change, but MDP-CL performs
comparisons to detect switches every w steps (w = 25 in this case) and it must wait n ·w (n = 1, 2, . . . )
timeslots (in contrast to a timeslot per timeslot switch detection of DriftER).
Additional experiments were performed to tune the w parameter. However, optimizing these
parameters is time consuming since w 2 N and threshold 2 R, w = 25 was selected as the best value
102
5.5. DRIFTER
Profi
ts (€
)
Timeslots (hours)
(a)
Profi
ts (€
)
Timeslots (hours)
(b)
Profi
ts (€
)
Timeslots (hours)
(c)
Figure 5.17: Profits (e), where higher is better, of (a) TacTex-WM, (b) MDP-CL and (c) DriftER against
the non-stationary opponent in a PowerTAC competition of 250 timesteps. Neither TacTex nor MDP-CL are
capable of increasing its profits after the opponent switch (vertical line) since they do not adapt in a fast wat
to switches as DriftER does it.
(based on accuracy) for setting threshold = 0.05. In next section we review directly both MDP-CL
and DriftER against switching opponents.
5.5.5 Detecting switches in the opponent
Now we compare MDP-CL and DriftER since both approaches handle non-stationary opponents. We
measure the average number of timeslots needed to detect the switch, the average accuracy and the
traded energy as a measure of indirect cost provided by PowerTAC (the more time it takes to detect
the switch, the less energy the agent successfully buys). Table 5.13 reports the results for MDP-CL
(using w = 25) and DriftER. The competition lasted 250 timesteps and the opponent switched at
103
CHAPTER 5. EXPERIMENTS
Table 5.13: Average timeslots for switch detection (Avg. S.D. Time), accuracy, and traded energy of the
learning agents against a non-stationary opponent.
Avg. S.D. Time Accuracy Traded E.
MDP-CL 85.0 ± 55.0 57.55 ± 28.56 2.9 ± 1.3
DriftER 33.2 ± 13.6 67.60 ± 21.21 4.4 ± 0.5
Table 5.14: Average profit of the learning agents against non-stationary opponents with and without noise.
TacTex-WM MDP-CL DriftER
Agent Opp Agent Opp Agent Opp
Fixed NS 219.0 ± 7.5 228.7 ± 31.7 261.3 ± 65.8 270.1 ± 75.5 263.0 ± 38.9 228.7 ± 64.2
Noisy NS 198.0 ± 41.3 197.6 ± 24.78 260.1 ± 75.0 305.6 ± 41.18 265.9 ± 39.9 229.0 ± 38.2
timestep 100. Results are averaged over 10 independent trials. Results show that DriftER needs less
time for detecting switches obtaining better accuracy (explained by the fast switch detection) and as
a result is capable of trading more energy.
To further evaluate the algorithms and perform a fair comparison in terms of profit, we implemented
the same strategy in the tari↵ market on all learning algorithms. Figure 5.17 shows the cumulative
profit of (a) TacTex-WM, (b) MDP-CL, and (c) DriftER against the same non-stationary opponent.
The timeslot where the opponent switches strategies is indicated with a vertical line. From these figures
we note that TacTex-WM profits increase before the opponent switch and then decrease after that.
At the end of the experiment, they both obtain a similar profit. MDP-CL was capable of detecting
switches but took more time and obtained less profit than TacTex-WM. In contrast, DriftER profits
increase again after the switch. In terms of cumulative profits DriftER obtained 80k e more than the
opponent.
5.5.6 Noisy non-stationary opponents
In the previous experiments the opponent switched between two fixed strategies. In this section we
present a better approximation to real-world strategies. The opponent has a limit price Pl = 20.0 with
a noise of ±2.5 (bids are in the range [18.5� 22.5]). Then, it switches to Ph = 34.0, with bids in the
range [31.5, 36.5]. The rest of the experiment remains the same as in the previous section.
Table 5.14 shows the total profits of the learning agents against the non-stationary opponents with
and without noise, averaged over 10 independent trails. When the opponent uses a range of values
to bid, TacTex-WM’s profits are reduced while its standard deviation is increased. MDP-CL shows
competitive profit scores with fixed opponents, but against a noisy opponent MDP-CL obtained lower
scores than the opponent. DriftER shows the best score in profit against fixed opponents and is the
only algorithm of the three able to score better than this noisy, non-stationary opponent.
104
5.6. NON-STATIONARY GAME THEORY STRATEGIES
5.5.7 Summary
This section presented experiments with DriftER in repeated games and in the PowerTAC simulator.
In PowerTAC the opponent is non-stationary in the sense that changes its limit price during the
interaction. DriftER was tested against the champion of the inaugural competition and MDP-CL
obtaining better results in terms of profit against switching opponents. In the next section we present
the last set of experiments with comparisons of our three main proposals and WOLF-PHC in the
context of randomly generated repeated games.
5.6 Non-stationary game theory strategies
This section presents experiments using our proposals: MDP-CL, R-max# and DriftER on GT games.
5.6.1 Setting and objectives
Common strategies to be used in GT include playing pure and mixed Nash equilibria, maximin strate-
gies (Section 2.3) and fictitious play [Brown, 1951]. Thus, the proposed opponents will use those
strategies and will switch among them during the interaction.
The experiments are divided into two parts: first we analyze how MDP-CL, R-max# and DriftER
behave in the BoS game against opponents that switch between pure and mixed strategies. Then,
we use our approaches in general-sum games with more than two actions. As comparison we present
results with the WOLF-PHC algorithm (Section 3.3.2). We performed experiments (presented in
Appendix C.2) to select the parameter m = 10 for the rest of the section.
5.6.2 Battle of the sexes
We analyze our approaches: DriftER, MDP-CL and R-max# with WOLF-PHC in the BoS game with
values v1 = 100, v2 = 25. The game consisted of 3000 rounds, the opponent switched strategies at
1000 and 2000 rounds. The opponent starts with a pure Nash equilibrium [1.0, 0.0], then switches
to a minimax strategy [0.8, 0.2] and finally it changes to a di↵erent pure Nash equilibrium [0.0, 1.0].
Results are the average of 100 iterations.
First we present experiments with DriftER (n = 5) and WOLF-PHC. In Figure 5.18 (a) we depict
the immediate rewards of both approaches against the non-stationary opponent in the BoS game. The
opponent starts with a pure Nash strategy which DriftER quickly learns (less than 50 rounds). In
contrast, WOLF-PHC needs more time to converge to the best action (120 rounds approximately).
Moreover, since it needs to keep exploring it never reaches the best possible score (25). At round
1000 the opponent changes to a mixed strategy and both algorithms adapt correctly to the opponent,
105
CHAPTER 5. EXPERIMENTS
(a)
(b)
(c)
Figure 5.18: Rewards obtained by (a) DriftER, (b) MDP-CL, (c) R-max# and WOLF-PHC in the BoS game
against a non-stationary opponent that uses pure Nash and mixed Nash in a game of 3000 rounds. Switches
happen every 1000 rounds.
106
5.6. NON-STATIONARY GAME THEORY STRATEGIES
even when DriftER does not play a mixed policy it is capable to obtain similar rewards than WOLF-
PHC. At round 2000 the opponent changes to a di↵erent pure strategy. DriftER is capable of quickly
adapting its model and its policy obtaining the best possible score. WOLF-PHC takes more rounds
to adapt and it does not exploit completely the opponent.
In Figure 5.18 (b) we depict the rewards for MDP-CL (w = 160) and WOLF-PHC. In contrast to
DriftER that uses R-max exploration MDP-CL uses a fixed window of exploration and in this case
it takes more time to exploit the model. When the opponent switches at round 1000 it is capable
of adapting. At round 2000 the opponent changes again and we can observe that MDP-CL adjusts
by steps until finally converging to the new opponent model. These steps happen every w when the
model comparison is performed to detect switches.
In Figure 5.18 (c) we depict the rewards for R-max# (⌧ = 1000). The approach is capable of
learning quickly the opponent model (as DriftER) and can learn the new opponent model in a quickly
manner. However, since the approach uses an implicit switch detection mechanism in some cases it
would take more time to adjust (like the second switch against the opponent).
5.6.3 General-sum games
We tested our proposals and WOLF-PHC in repeated games with the following conditions: at least
one pure and one mixed Nash equilibrium di↵erent from each other and � 2 actions. The values of
the games are in the range of [�100, 100] and they are shown in Appendix B.
The setting is one opponent that starts playing a pure Nash equilibrium, then changes to a mixed
Nash strategy and finally uses a fictitious play strategy. Switches happen every 1000 or 2000 rounds,
the game consists of 3000 and 6000 rounds respectively. An * indicates statistical significance of the
algorithm with respect to WOLF-PHC (using Wilcoxon rank-sum test, 5% significance level).
Average rewards of the algorithms are depicted in Table 5.15 against non-stationary opponents
with two di↵erent switching frequencies. Results show that DriftER (n = 5, w = 200) obtained in
average better results than WOLF-PHC. When switching frequency was 1000 rounds only one result
is statistically significant. In contrast, when switching frequency increases to 2000 rounds DriftER can
exploit the model for more rounds and the di↵erence with WOLF-PHC increases.
We selected the best parameters for MDP-CL w = 120, = 0.15 when switch frequency is 1000
and w = 150 when switch frequency is 2000. When the switch frequency is 1000 rounds MDP-CL is
comparable with WOLF-PHC in games 1-3. In game-4 MDP-CL results drastically decrease because
the parameter is sensitive to the number of actions. The reason is that MDP-CL computes a distance
that averages over all actions so the di↵erence between MDP becomes smaller (in game-4 there are 5
actions). When using a suitable value ( = 0.01) then we get results (marked with †) comparable to
WOLF-PHC. When the switch frequency increases to 2000 rounds, for games 1 to 3 MDP-CL obtained
107
CHAPTER 5. EXPERIMENTS
Table 5.15: Average rewards of our proposed approaches and WOLF-PHC against non-stationary opponents
in four random repeated games. * indicates statistical significance with respect to WOF-PHC (using Wilcoxon
rank-sum test). † indicates that the value was used with a di↵erent parameter for MDP-CL.
Game Id DriftER MDP-CL R-max# WOLF-PHC Switch freq.
1 35.69 ± 1.29 35.24 ± 1.51 38.00 ± 0.65* 35.14 ± 1.11 1000
2 58.03 ± 1.54* 57.30 ± 0.67* 47.27 ± 0.80 56.21 ± 1.71 1000
3 71.76 ± 2.05 75.34 ± 1.70* 73.85 ± 1.77* 71.68 ± 1.46 1000
4 68.22 ± 2.19 67.53 ± 5.74† 71.78 ± 0.98 68.03 ± 5.06 1000
Avg 58.42 ± 1.77 58.85 ± 2.40 57.72 ± 1.05 57.77 ± 2.33 1000
1 37.72 ± 2.19 37.76 ± 0.41* 40.74 ± 2.83* 35.60 ± 0.68 2000
2 60.32 ± 0.85* 59.78 ± 0.33* 45.46 ± 0.67 57.58 ± 1.06 2000
3 75.61 ± 1.95* 74.63 ± 1.95* 73.80 ± 1.77 72.87 ± 0.92 2000
4 74.19 ± 0.74* 70.05 ± 7.05† 69.89 ± 1.06 70.94 ± 2.58 2000
Avg 61.96 ± 0.74* 60.55 ± 2.43 57.47 ± 1.58 59.25 ± 1.31 2000
better scores (statistically significant) than WOLF-PHC this happens because MDP-CL is capable of
detecting switches fast and it can exploit the model for more rounds. In game 4 however, the results
decrease considerably.
R-max# was also compared in this setting using ⌧ = 1000. When switch frequency was 1000 steps
was capable of adapting to the switches and obtained in average comparable results to WOLF-PHC.
When the switch frequency was increased to 2000 steps R-max# results were not as good since it is
performing exploration more frequently than the opponent is changing which results in lower rewards.
5.7 Summary of the chapter
This chapter presents experiments in five di↵erent domains: prisoner’s dilemma, multiagent prisoner’s
dilemma, alternate-o↵ers bargaining, double auctions in PowerTAC and randomly generated repeated
games. First the MDP4.5 and MDP-CL approaches were compared to a reinforcement learning al-
gorithm for non-stationary environments. Results show that our proposals are capable of learning in
repeated games with comparable results with the advantage of a faster computation and an online
learning approach. Extensions that handle a priori information as well as not forgetting previously
learned model were compared showing the advantage of both extensions to MDP-CL. Later we tested
two di↵erent domains in which drift exploration is necessary to obtain an optimal policy. The iPD
problem and the negotiation task. In both scenarios, the use of switch detection mechanisms, such
as MDP-CL, FAL or MDP4.5 were not enough to deal with switching opponents. The general ap-
proach of drift exploration by means of ✏�greedy or some softmax type of exploration, solves the
108
5.7. SUMMARY OF THE CHAPTER
problem since this exploration re-visits some parts of the state space that eventually will lead to
detect switches in the opponent strategy. Our approach, R-max#, which implicitly handles drift ex-
ploration is generally better equipped to handle non-stationary opponents of di↵erent sorts. Its pitfall
lies in its parameterization (parameter ⌧), which generally should be large enough so as to learn a
correct opponent model, yet small enough to react promptly to strategy switches. In realistic scenarios
where we do not know the switching time of a non-stationary opponent, it is useful to combine both
approaches, switch detection and implicit drift exploration, as can be seen in R-max#CL. Next we
presented experiments showing that DriftER can be used in a realistic domain (double auctions inside
the PowerTAC simulator). DriftER obtained better scores than MDP-CL and showed robust results
against noisy opponents. We conclude with a set of experiments in repeated games comparing our
di↵erent approaches with WOLF-PHC the state of the art algorithm for non-stationary strategies in
repeated games. To summarize, our approaches were capable of exploiting the opponent model and
react quickly to strategy changes. Next chapter will conclude with the contributions of this research
and ideas for future work.
109
CHAPTER 5. EXPERIMENTS
110
Chapter 6
Conclusions and Future Research
In this chapter we present a summary of the proposed algorithms, conclusions, the contributions of
this thesis and outline open questions and ideas for future research. We conclude with the list of the
publications derived from this research.
6.1 Summary of the proposed algorithms
In Table 6.1 we present a summary of our proposals in terms of theoretical guarantees, how the switch
detection is performed against non-stationary opponents, its advantages and limitations. MDP4.5 uses
decision trees to model the opponent. Its main advantage is that the learned model can be analyzed
and interpreted easily. However, it may not be the best option against stochastic opponents. MDP-CL
learns models using MDPs, its advantages are that it can use a priori models and drift exploration can
be easily added. However, its threshold parameter is sensitive to the number of actions. R-max# is
an algorithm with theoretical guarantees for switch detection and near-optimal rewards under certain
conditions. However, it does not detect switches explicitly. R-max#CL is the approach which obtained
the best scores empirically in two domains, however it needs to set 5 parameters and guarantees are
not proved. Finally, DriftER uses a single learned model as a method for switch detection which it is
simpler than the approaches that learn di↵erent models. Also DriftER provides guarantees for switch
detection with high probability. However, a current limitation is that it cannot use a priori models
nor keep models in case the opponent returns to a previous strategy. It is left as open work to consider
an approach similar to the one used for a priori MDP-CL to extend DriftER.
6.2 Conclusions
Now, we provide some final remarks about this research.
111
CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH
Table 6.1: A comparison of our proposals in terms of guarantees, switch detection, advantages and limitations.
Algorithm Theoretical
guarantee
of switch
detection
Switch detection Advantages Limitations
MDP4.5 No Periodical comparison
between decision trees.
Learned model can be trans-
lated into rules.
Transformation from DTs to
MDPs increases the state
space. Not the best model
agains stochastic opponents.
MDP-CL No Periodical comparison
between MDPs.
More information to detect a
switch. It is possible to add
drift exploration and use a
priori models.
Parameter sensitive to num-
ber of actions. No theoreti-
cal guarantees.
R-max# Yes Implicit switch detec-
tion.
Few parameters, algorithm
easy to understand. Per-
forms e�cient drift explo-
ration.
Do not detect switches ex-
plicitly.
R-max#CL No Implicitly by relearning
models and explicitly
by periodical compari-
son between MDPs.
Obtained the best scores
empirically in two domains.
Parameter tuning.
DriftER Yes Tracking predictive
error of the learned
model.
Simple method detection in-
dependent of the domain.
Cannot use a priori models
or keep history of models.
112
6.3. CONTRIBUTIONS
• Some assumptions about the nature of the non-stationarity of the opponents must be made.
Otherwise, it would be impossible to propose a general algorithm for all the cases. Our assump-
tion, is that the opponent will switch among several stationary strategies during the interaction.
This makes it possible to learn how to act optimally against it.
• We proposed methods that are based on computing a model of the opponent and track that
model for possible changes. Thus, an e↵ective representation of the opponent (attributes used
to describe it) model is of great importance to achieve: 1) a policy against it and 2) an e�cient
switch detection method.
• Knowing beforehand the set of strategies that the opponent may use is an assumption that may
not hold in several domains. However, this research has showed that knowing these models
accelerates the process of opponent strategy detection. Also, keeping models after a switch
can be helpful for further use for example guiding exploration. In particular, we are further
exploring this issue using a Bayesian approach for tracking and identifying opponent switches
[Hernandez-Leal et al., 2016].
• When an opponent switches between strategies some of them will pass unnoticed (shadowing
e↵ect) unless an exploration is applied. This type of exploration (coined as drift exploration),
needs to explore with actions that di↵er from the computed optimal policy. Thus, a tradeo↵
appears, exploring may reduce the immediate rewards but detecting a switch to another opponent
strategy may increase the rewards in the long term. A more extensive analysis of this situation
providing theoretical guarantees is still an open problem.
• Theoretical results for opponent switch detection are important to make robust algorithms. We
provided two results in this context, for R-max# and DriftER. However, their main limitation
is that the assumptions made about the opponent behavior may not hold in some domains.
6.3 Contributions
We contribute to the state of the art with di↵erent algorithms for learning against non-stationary
opponents providing empirical results in five domains and theoretical guarantees for two algorithms.
In detail, the contributions are:
• A framework for learning against non-stationary opponent in repeated games. This framework
uses windows of interactions to learn a model of the opponent. The learned model is used to
compute an optimal policy against that opponent. Di↵erent models are learned throughout the
repeated game and comparisons between models detect a switch in the opponent. The approaches
113
CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH
were evaluated empirically with two di↵erent implementations: MDP4.5 and MDP-CL against
a reinforcement learning technique for non-stationary environments.
• Two extensions for MDP-CL were presented. A priori MDP-CL assumes to know beforehand
the set of strategies that the opponent is using and will detect which is the one used by the
opponent. Incremental MDP-CL keeps a record of the learned models and will not discard them
if the opponent returns to a previously used strategy. Empirically the approaches were evaluated
in the iterated prisoner’s dilemma.
• Drift exploration is proposed as an exploration mechanism for detecting opponent switches that
will otherwise pass unnoticed. We evaluated the approach experimentally by using drift explo-
ration in MDP-CL on two domains the iPD and a negotiation task.
• In the context of drift exploration R-max# was proposed. Its roots come from R-max but di↵ers
in that keeps learning a model continuously by forgetting state-action pairs that have not been
updated recently. We provide theoretical results showing that R-max# will perform optimally
under certain assumptions. Moreover, using R-max# with MDP-CL results in R-max#CL
which obtained the best results on two domains since it combines an e�cient exploration with a
switch detection mechanism.
• Finally, we proposed DriftER an algorithm that learns a model of the opponent and keeps
tracks of their error-rate. When the error-rate increases for several timesteps, the opponent
has changed strategy and we must learn a new model. We provide a theoretical result that
ensures that DriftER will detect opponent switches with high probability by correctly setting
its parameters. Results on repeated games and the PowerTAC simulator show that DriftER is
capable of detecting switches in the opponent faster than state of the art algorithms.
During the process of this thesis we made some assumptions about our settings and therefore there
are still several open questions which are discussed in the next section.
6.4 Open questions and future research ideas
We propose five di↵erent ideas for future research that are worth pursuing.
• Learning opponents. Throughout this thesis we assumed that opponents used a set of strategies
and switched among them during the interaction. However, we do not consider that the opponent
used learning strategies [Bowling and Veloso, 2002]. In this case, consideration needs to be taken,
if both agents are learning at the same time they may learn noise [HolmesParker et al., 2014].
114
6.5. PUBLICATIONS
• Not knowing the representation of the opponent. In order to learn the opponent model we
assumed a representation which in most cases can correctly describe the opponent. In order to
relax this assumption we must learn the model and at the same time the correct representation. A
recent area whose aim is to learn without putting e↵ort into designing the correct representation
is deep learning [Deng and Yu, 2013]. Some approaches could be used in a preprocessing phase
since these approaches have the limitation of need long learning times.
• Stochastic games. The prisoner’s dilemma and double auctions can be seen as repeated games.
The alternating-o↵ers bargaining problem can be solved as an extensive form game (a tree repre-
sentation of a game). However, there are domains where the environment cannot be represented
as a single matrix. Thus a stochastic game could be the best fit, approaches such as [Elidrisi
et al., 2014] could be a good start to model non-stationary opponents in those cases.
• Increasing the number of opponents. In Section 5.2.6, we increase the number of opponents and
showed that MDP-CL performed successfully. However, the main limitation is that increasing
the number of opponents increases the size of the space state exponentially, which may limit its
use. A possible solution on how to handle a large number of opponents is to treat them as a
population and best respond to classes of opponents [Bard et al., 2015; Wunder et al., 2011].
• Adapting related approaches for opponent modeling. Recent models such as Bayesian policy
reuse [Rosman et al., 2015] had been proposed for fast learning in sequential decision tasks.
However, the problem could be cast to an adversarial setting. Another approach that is worth
analyzing are I-POMDPs since there are new techniques for learning and solving them in a faster
way [Qu and Doshi, 2015].
6.5 Publications
Several parts of this thesis had been published. Below we provide the list of papers derived from this
research. One journal paper:
• “A framework for learning and planning against switching strategies in repeated games” [Hernandez-
Leal et al., 2014a].
the following conference papers:
• “Bidding in Non-Stationary energy markets” (AAMAS 2015) [Hernandez-Leal et al., 2015a]
• “Opponent modeling against non-stationary strategies (Doctoral Consortium)” (AAMAS 2015)
[Hernandez-Leal et al., 2015c]
115
CHAPTER 6. CONCLUSIONS AND FUTURE RESEARCH
• “Using a priori information for fast learning against non-stationary opponents” (IBERAMIA
2014) [Hernandez-Leal et al., 2014b].
• “Modeling non-stationary opponents” (AAMAS 2013) [Hernandez-Leal et al., 2013b]
• “Strategic Interactions Among Agents with Bounded Rationality” (IJCAI 2013) [Hernandez-Leal
et al., 2013d]
and the following workshop papers:
• “Learning against non-stationary opponents in double auctions” (Workshop ALA 2015) [Hernandez-
Leal et al., 2015b]
• “Exploration strategies to detect strategy switches” (Workshop ALA 2014) [Hernandez-Leal et
al., 2014c]
• “Learning against non-stationary opponents.” (Workshop ALA2013) [Hernandez-Leal et al.,
2013c]
• “Opponent modeling and planning against non-stationary strategies.” (Workshop MSDM 2013)
[Hernandez-Leal et al., 2013a]
116
References
Abdallah, Sherief and Victor Lesser (2008). “A multiagent reinforcement learning algorithm with
non-linear dynamics.” In: Journal of Artificial Intelligence Research 33.1, pp. 521–549.
Aumann, Robert J. (1999). “Interactive epistemology I: knowledge.” In: International Journal of Game
Theory 28.3, pp. 263–300.
Axelrod, Robert and William D. Hamilton (1981). “The evolution of cooperation.” In: Science 211.27,
pp. 1390–1396.
Banerjee, Bikramjit and Jing Peng (2005). “E�cient learning of multi-step best response.” In: Proceed-
ings of the 4th International Conference on Autonomous Agents and Multiagent Systems. Utretch,
Netherlands: ACM, pp. 60–66.
Bard, Nolan, Michael Johanson, Neil Burch, and Michael Bowling (2013). “Online implicit agent mod-
elling.” In: Proceedings of the 12th International Conference on Autonomous Agents and Multiagent
Systems. International Foundation for Autonomous Agents and Multiagent Systems, pp. 255–262.
Bard, Nolan, Deon Nicholas, Csaba Szepesvari, and Michael Bowling (2015). “Decision-theoretic Clus-
tering of Strategies.” In: Proceedings of the 14th International Conference on Autonomous Agents
and Multiagent Systems. Istanbul,Turkey, pp. 17–25.
Barrett, Samuel, Peter Stone, Sarit Kraus, and Avi Rosenfeld (2012). “Learning Teammate Models
for Ad Hoc Teamwork.” In: AAMAS Adaptive Learning Agents (ALA) Workshop.
Bellman, Richard (1957). “A Markovian decision process.” In: Journal of Mathematics and Mechanics
6.5, pp. 679–684.
Bolton, Gary E. and Axel Ockenfels (2000). “ERC: A theory of equity, reciprocity, and competition.”
In: American Economic Review, pp. 166–193.
Boutilier, Craig, Thomas L. Dean, and Steve Hanks (1999). “Decision-Theoretic Planning: Structural
Assumptions and Computational Leverage.” In: Journal of Artificial Intelligence Research, pp. 1–
94.
Bowling, Michael (2004). “Convergence and no-regret in multiagent learning.” In: Advances in Neural
Information Processing Systems. Vancouver, Canada, pp. 209–216.
117
REFERENCES
Bowling, Michael and Manuela Veloso (2002). “Multiagent learning using a variable learning rate.”
In: Artificial Intelligence 136.2, pp. 215–250.
Brafman, Ronen I. and Moshe Tennenholtz (2003). “R-MAX a general polynomial time algorithm for
near-optimal reinforcement learning.” In: The Journal of Machine Learning Research 3, pp. 213–
231.
Brown, George W. (1951). “Iterative solution of games by fictitious play.” In: Activity analysis of
production and allocation 13.1, pp. 374–376.
Busoniu, Lucian, Robert Babuska, and Bart De Schutter (2008). “A Comprehensive Survey of Mul-
tiagent Reinforcement Learning.” In: IEEE Transactions on Systems, Man and Cybernetics, Part
C (Applications and Reviews) 38.2, pp. 156–172.
Camerer, Colin F. (1997). “Progress in behavioral game theory.” In: The Journal of Economic Per-
spectives 11.4, pp. 167–188.
— (2003). Behavioral Game Theory: Experiments in Strategic Interaction (Roundtable Series in Be-
havioral Economics). Princeton University Press.
Camerer, Colin F., Teck-Hua Ho, and Juin-Kuan Chong (2004a). “A cognitive hierarchy model of
games.” In: The Quarterly Journal of Economics 119.3, p. 861.
— (2004b). “Behavioural Game Theory: Thinking, Learning and Teaching.” In: Advances in Under-
standing Strategic Behavior. New York, pp. 120–180.
Carmel, David and Shaul Markovitch (1999). “Exploration strategies for model-based learning in
multi-agent systems.” In: Autonomous Agents and Multi-Agent Systems 2.2, pp. 141–172.
Cassandra, Anthony R. (1998). “Exact and approximate algorithms for partially observable Markov
decision processes.” PhD thesis. Computer Science Department, Brown University.
Cassandra, Anthony R., Michael L. Littman, and Nevin L. Zhang (1997). “Incremental pruning: a
simple, fast, exact method for partially observable Markov decision processes.” In: Proceedings
of the 13th Conference on Uncertainty in Artificial Intelligence. Providence, Rhode Island, USA:
Morgan Kaufmann Publishers Inc, pp. 54–61.
Chakraborty, Doran and Peter Stone (2008). “Online multiagent learning against memory bounded
adversaries.” In: Machine Learning and Knowledge Discovery in Databases, pp. 211–226.
Choi, Samuel P. M., Dit-Yan Yeung, and Nevin L. Zhang (1999). “An Environment Model for Non-
stationary Reinforcement Learning.” In: NIPS, pp. 987–993.
— (2001). “Hidden-mode markov decision processes for nonstationary sequential decision making.”
In: Sequence Learning, pp. 264–287.
Conitzer, Vincent and Tuomas Sandholm (2006). “AWESOME: A general multiagent learning algo-
rithm that converges in self-play and learns a best response against stationary opponents.” In:
Machine Learning 67.1-2, pp. 23–43.
118
REFERENCES
Corbett, Albert T. and John R. Anderson (1994). “Knowledge tracing: Modeling the acquisition of
procedural knowledge.” In: User Modeling and User-Adapted Interaction 4.4, pp. 253–278.
Costa Gomes, Miguel, Vincent P. Crawford, and B. Broseta (2001). “Cognition and Behavior in
Normal–Form Games: An Experimental Study.” In: Econometrica 69.5, pp. 1193–1235.
Cote, Enrique Munoz de and Nicholas R. Jennings (2010). “Planning against fictitious players in
repeated normal form games.” In: Proceedings of the 9th International Conference on Autonomous
Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent
Systems, pp. 1073–1080.
Cote, Enrique Munoz de, Alessandro Lazaric, and Marcello Restelli (2006). “Learning to cooperate
in multi-agent social dilemmas.” In: Proceedings of the Fifth International Joint Conference on
Autonomous Agents and Multiagent Systems. ACM, pp. 783–785.
Cote, Enrique Munoz de, Archie C. Chapman, Adam M. Sykulski, and Nicholas R. Jennings (2010).
“Automated Planning in Repeated Adversarial Games.” In: Uncertainty in Artificial Intelligence,
pp. 376–383.
Crandall, Jacob W. and Michael A. Goodrich (2011). “Learning to compete, coordinate, and cooperate
in repeated games using reinforcement learning.” In: Machine Learning 82.3, pp. 281–314.
Da Silva, Bruno C, Eduardo W. Basso, Ana L.C. Bazzan, and Paulo M. Engel (2006). “Dealing with
non-stationary environments using context detection.” In: Proceedings of the 23rd International
Conference on Machine Learnig. Pittsburgh, Pennsylvania, pp. 217–224.
Del Giudice, A., Piotr J. Gmytrasiewicz, and J. Bryan (2009). “Towards strategic kriegspiel play
with opponent modeling.” In: Proceedings of the 8th International Conference on Autonomous
Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent
Systems, pp. 1265–1266.
Deng, Li and Dong Yu (2013). “Deep Learning Methods and Applications.” In: Foundations and
Trends in Signal Processing 7.3-4, pp. 197–387.
On the Di�culty of Achieving Equilibrium in Interactive POMDPs (2006). Boston, MA, USA.
Doshi, Prashant and Piotr J. Gmytrasiewicz (2009). “Monte Carlo sampling methods for approximat-
ing interactive POMDPs.” In: Journal of Artificial Intelligence Research 34.1, p. 297.
Doshi, Prashant, Yifeng Zeng, and Qiongyu Chen (2008). “Graphical models for interactive POMDPs:
representations and solutions.” In: Autonomous Agents and Multi-Agent Systems 18.3, pp. 376–
416.
Elidrisi, Mohamed, Nicholas Johnson, and Maria Gini (2012). “Fast Learning against Adaptive Ad-
versarial Opponents.” In: Adaptive Learning Agents Workshop at AAMAS. Valencia, Spain.
119
REFERENCES
Elidrisi, Mohamed, Nicholas Johnson, Maria Gini, and Jacob W. Crandall (2014). “Fast adaptive learn-
ing in repeated stochastic games by game abstraction.” In: Proceedings of the 13th International
Joint Conference on Autonomous Agents and Multiagent Systems. Paris, France, pp. 1141–1148.
Fudenberg, Drew and Jean Tirole (1991). Game Theory. The MIT Press.
Fulda, Nancy and Dan Ventura (2006). “Predicting and Preventing Coordination Problems in Co-
operative Q-learning Systems.” In: IJCAI-07: Proceedings of the Twentieth International Joint
Conference on Artificial Intelligence, pp. 780–785.
Gal, Ya’akov and Avi Pfe↵er (2008). “Networks of influence diagrams: A formalism for representing
agents’ beliefs and decision making processes.” In: Journal of Artificial Intelligence Research 33.1,
pp. 109–147.
Gama, Joao, Pedro Medas, Gladys Castillo, and Pedro Rodrigues (2004). “Learning with Drift Detec-
tion.” In: Advances in Artificial Intelligence–SBIA. Brazil, pp. 286–295.
Gama, Joao, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia (2014).
“A survey on concept drift adaptation.” In: ACM Computing Surveys (CSUR) 46.4.
Gmytrasiewicz, Piotr J. and Prashant Doshi (2005). “A framework for sequential planning in multia-
gent settings.” In: Journal of Artificial Intelligence Research 24.1, pp. 49–79.
Gmytrasiewicz, Piotr J. and Edmund H. Durfee (2000). “Rational Coordination in Multi-Agent Envi-
ronments.” In: Autonomous Agents and Multi-Agent Systems 3.4, pp. 319–350.
Goeree, Jacob K. and C.A. Holt (2001). “Ten little treasures of game theory and ten intuitive contra-
dictions.” In: American Economic Review, pp. 1402–1422.
Harsanyi, John C. and Reinhard Selten (1988). A general theory of equilibrium selection in games.
MIT Press.
Hernandez-Leal, Pablo, Enrique Munoz de Cote, and L. Enrique Sucar (2013a). “Learning against non-
stationary opponents.” In: Workshop on Adaptive Learning Agents (ALA). Saint Paul, Minnesota,
pp. 76–83.
— (2013b). “Modeling Non-Stationary Opponents.” In: Proceedings of the 12th International Con-
ference on Autonomous Agents and Multiagent Systems. Saint Paul, Minnesota, USA, pp. 1135–
1136.
— (2013c). “Opponent modeling and planning against non-stationary strategies.” In: The 8th Work-
shop on Multiagent Sequential Decision Making Under Uncertainty (MSDM) 2013. Saint Paul,
Minnesota, pp. 47–54.
— (2013d). “Strategic Interactions Among Agents with Bounded Rationality.” In: Proceedings of the
Twenty-Third International Joint Conference on Artificial Intelligence. Beijing, China, pp. 3219–
3220.
120
REFERENCES
— (2014a). “A framework for learning and planning against switching strategies in repeated games.”
In: Connection Science 26.2, pp. 103–122.
— (2014b). “Exploration strategies to detect strategy switches.” In: AAMAS Workshop on Adaptive
Learning Agents. Paris, France.
— (2014c). “Using a priori information for fast learning against non-stationary opponents.” In: Ad-
vances in Artificial Intelligence – IBERAMIA 2014. Santiago de Chile, pp. 536–547.
Hernandez-Leal, Pablo, Matthew E. Taylor, L. Enrique Sucar, and Enrique Munoz de Cote (2015a).
“Bidding in Non-Stationary Energy Markets.” In: Proceedings of the 14th International Conference
on Autonomous Agents and Multiagent Systems. Istanbul, Turkey, pp. 1709–1710.
Hernandez-Leal, Pablo, Matthew E. Taylor, Enrique Munoz de Cote, and L. Enrique Sucar (2015b).
“Learning Against Non-Stationary Opponents in Double Auctions.” In: Workshop Adaptive Learn-
ing Agents ALA 2015. Istanbul, Turkey.
Hernandez-Leal, Pablo, Enrique Munoz de Cote, and L. Enrique Sucar (2015c). “Opponent Model-
ing against Non-stationary Strategies.” In: Proceedings of the 14th International Conference on
Autonomous Agents and Multiagent Systems. Istanbul, Turkey, pp. 1989–1990.
Hernandez-Leal, Pablo, Matthew E. Taylor, Benjamin Rosman, L. Enrique Sucar, and Enrique Munoz
de Cote (2016). “Identifying and Tracking Switching, Non-stationary Opponents: a Bayesian Ap-
proach.” In: Third Workshop on Multiagent Interaction without prior Coordination. Phoenix, AZ,
USA.
Hoe↵ding, Wassily (1963). “Probability inequalities for sums of bounded random variables.” In: Journal
of the American Statistical Association 58.301, pp. 13–30.
HolmesParker, Chris, Matthew E. Taylor, Adrian Agogino, and Kagan Tumer (2014). “CLEANing
the reward: counterfactual actions to remove exploratory action noise in multiagent learning.”
In: Proceedings of the 13th International Joint Conference on Autonomous Agents and Multiagent
Systems. Paris, France: International Foundation for Autonomous Agents and Multiagent Systems,
pp. 1353–1354.
Howard, Ronald A. and James E. Matheson (2005). “Influence Diagrams.” In: Decision Analysis 2.3,
pp. 127–143.
Hu, J. and M.P. Wellman (1998). “Online learning about other agents in a dynamic multiagent system.”
In: Proceedings of the Second International Conference on Autonomous Agents. ACM, pp. 239–246.
Jennings, Nicholas R. et al. (2001). “Automated negotiation: prospects, methods and challenges.” In:
Group Decision and Negotiation 10.2, pp. 199–215.
Jensen, Steven, Daniel Boley, Maria Gini, and Paul Schrater (2005). “Rapid on-line temporal se-
quence prediction by an adaptive agent.” In: Proceedings of the 4th International Conference on
Autonomous Agents and Multiagent Systems. Utrecht, The Netherlands: ACM, pp. 67–73.
121
REFERENCES
Kaelbling, Leslie P., Michael L. Littman, and Anthony R. Cassandra (1998). “Planning and acting in
partially observable stochastic domains.” In: Artificial Intelligence 101.1-2, pp. 99–134.
Kahneman, Daniel and Amos Tversky (1979). “Prospect theory: An analysis of decision under risk.”
In: Econometrica, pp. 263–291.
Kakade, Sham Machandranath (2003). “On the sample complexity of reinforcement learning.” PhD
thesis. Gatsby Computational Neuroscience Unit, University College London.
Ketter, Wolfgang, John Collins, and Prashant P. Reddy (2013). “Power TAC: A competitive economic
simulation of the smart grid.” In: Energy Economics 39, pp. 262–270.
Ketter, Wolfgang, John Collins, Prashant P. Reddy, and Mathijs De Weerdt (2014). The 2014 Power
Trading Agent Competition. Rotterdam, The Netherlands: Department of Decision and Information
Sciencies, Erasmus University.
Kocsis, Levente and Csaba Szepesvari (2006). “Bandit based monte-carlo planning.” In: Proceedings
of the 17th European Conference on Machine Learning. Berlin, Germany: Springer, pp. 282–293.
Koller, D. and N. Friedman (2009). Probabilistic graphical models: principles and techniques. The MIT
Press.
Koller, Daphne and Brian Milch (2001). “Multi-agent influence diagrams for representing and solv-
ing games.” In: IJCAI’01: Proceedings of the 17th International Joint Conference on Artificial
Intelligence. Seattle, Washington: Morgan Kaufmann Publishers Inc, pp. 1027–1036.
Liebman, Elad, Maytal Saar-Tsechansky, and Peter Stone (2015). “DJ-MC: A Reinforcement-Learning
Agent for Music Playlist Recommendation.” In: Proceedings of the 14th International Conference
on Autonomous Agents and Multiagent Systems. Istanbul, Turkey, pp. 591–599.
Littman, Michael L. (1994). “Markov games as a framework for multi-agent reinforcement learning.”
In: Proceedings of the 11th International Conference on Machine Learning. New Brunswick, NJ,
USA, pp. 157–163.
— (1996). “Algorithms for sequential decision making.” PhD thesis. Department of Computer Science,
Brown University.
Littman, Michael L. and Peter Stone (2001). “Implicit Negotiation in Repeated Games.” In: ATAL
’01: Revised Papers from the 8th International Workshop on Intelligent Agents VIII.
Littman, Michael L., Thomas L. Dean, and Leslie P. Kaelbling (1995). “On the complexity of solving
Markov decision problems.” In: Proceedings of the 11th Conference on Uncertainty in Artificial
Intelligence. Montreal, Canada: Morgan Kaufmann Publishers Inc, pp. 394–402.
McKelvey, Richard D., Andrew M. McLennan, and Theodore L. Turocy (2014). Gambit: Software
Tools for Game Theory. url: http://www.gambit-project.org.
Miglio, Rossella and Gabriele So↵ritti (2004). “The comparison between classification trees through
proximity measures.” In: Computational Statistics and Data Analysis 45.3, pp. 577–593.
122
REFERENCES
Mitchell, Thomas M. (1997). Machine Learning. 1st. McGraw-Hill Higher Education.
Mohan, Yogeswaran and S G Ponnambalam (2011). “Exploration strategies for learning in multi-agent
foraging.” In: Swarm, Evolutionary, and Memetic Computing 2011. Springer, pp. 17–26.
Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar (2012). Foundations of Machine Learn-
ing. The MIT Press.
Monahan, George E. (1982). “A survey of partially observable Markov decision processes: Theory,
models, and algorithms.” In: Management Science 28, pp. 1–16.
Myerson, Roger B. (1991). Game theory: analysis of conflict. Harvard University Press.
Nash, John F. (1950). “Equilibrium points in n-person games.” In: Proceedings of the National Academy
of Sciences 36.1, pp. 48–49.
Ng, Andrew Y, Daishi Harada, and Stuart J. Russell (1999). “Policy invariance under reward transfor-
mations: Theory and application to reward shaping.” In: Proceedings of the Sixteenth International
Conference on Machine Learning. Bled, Slovenia, pp. 278–287.
Ng, Brenda, Carol Meyers, Kofi Boakye, and J. Nitao (2010). “Towards Applying Interactive POMDPs
to Real-World Adversary Modeling.” In: Twenty-Second IAAI Conference. Atlanta, Georgia, pp. 1814–
1820.
Ng, Brenda, Kofi Boakye, Carol Meyers, and Andrew Wang (2012). “Bayes-Adaptive Interactive
POMDPs.” In: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence. Toronto,
Canada, pp. 1408–1414.
Nudelman, Eugene, Jennifer Wortman, Yoav Shoham, and Kevin Leyton-Brown (2004). “Run the
GAMUT: a comprehensive approach to evaluating game-theoretic algorithms.” In: Proceedings of
the Third International Joint Conference on Autonomous Agents and Multiagent System, pp. 880–
887.
Papadimitriou, Christos H. and John N. Tsitsiklis (1987). “The complexity of Markov decision pro-
cesses.” In: Mathematics of Operations Research 12.3, pp. 441–450.
Parsons, Simon, Marek Marcinkiewicz, Jinzhong Niu, and Steve Phelps (2006). Everything you wanted
to know about double auctions, but were afraid to (bid or) ask. New York, USA: Department of
Computer & Information Science, University of New York.
Pineau, Joelle, Geo↵rey Gordon, and Sebastian Thrun (2006). “Anytime point-based approximations
for large POMDPs.” In: Journal of Artificial Intelligence Research 27.1, pp. 335–380.
Pipattanasomporn, M., H. Feroze, and Saifur Rahman (2009). “Multi-agent systems in a distributed
smart grid: Design and implementation.” In: Power Systems Conference and Exposition, 2009.
PSCE’09. IEEE/PES. IEEE, pp. 1–8.
Pita, James et al. (2009). “Using game theory for Los Angeles airport security.” In: AI Magazine 30.1,
pp. 43–57.
123
REFERENCES
Powers, Rob and Yoav Shoham (2005). “Learning against opponents with bounded memory.” In: IJ-
CAI’05: Proceedings of the 19th International Joint Conference on Artificial Intelligence. Edinburg,
Scotland, UK: Morgan Kaufmann Publishers Inc, pp. 817–822.
Puterman, Martin L. (1994). Markov decision processes: Discrete stochastic dynamic programming.
John Wiley & Sons, Inc.
Qu, Xia and Prashant Doshi (2015). “Improved Planning for Infinite-Horizon Interactive POMDPs
Using Probabilistic Inference.” In: Proceedings of the 14th International Conference on Autonomous
Agents and Multiagent Systems. Istanbul, Turkey, pp. 1839–1840.
Quinlan, J. Ross (1993). C4. 5: programs for machine learning. Morgan Kaufmann.
Rejeb, Lilia, Zahia Guessoum, and Rym M’Hallah (2005). “An adaptive approach for the exploration-
exploitation dilemma for learning agents.” In: Proceedings of the 4th international Central and
Eastern European conference on Multi-Agent Systems and Applications. Springer, pp. 316–325.
Richards, Mark and Eyal Amir (2006). “Opponent Modeling in Scrabble.” In: IJCAI-07: Proceed-
ings of the Twentieth International Joint Conference on Artificial Intelligence. Hyderabad, India,
pp. 1482–1487.
Risse, Mathias (2000). “What is rational about Nash equilibria?” In: Synthese 124.3, pp. 361–384.
Robbins, Herbert (1985). “Some aspects of the sequential design of experiments.” In: Herbert Robbins
Selected Papers. Springer, pp. 527–535.
Rosman, Benjamin, Majd Hawasly, and Subramanian Ramamoorthy (2015). “Bayesian Policy Reuse.”
In: arXiv.org. arXiv: 1505.00284v1 [cs.AI].
Russell, Stuart J., Peter Norvig, J.F. Canny, J.M. Malik, and D.D. Edwards (1995). Artificial intelli-
gence: a modern approach. Vol. 2. Englewood Cli↵s, NJ: Prentice Hall.
Seuken, Sven and Shlomo Zilberstein (2008). “Formal models and algorithms for decentralized decision
making under uncertainty.” In: Autonomous Agents and Multi-Agent Systems 17.2, pp. 190–250.
Shachter, Ross D. (1986). “Evaluating influence diagrams.” In: Operations Research 34.6.
Shoham, Yoav and Kevin Leyton-Brown (2008). Multiagent Systems: Algorithmic, Game-Theoretic,
and Logical Foundations. Cambridge University Press.
Shoham, Yoav, Rob Powers, and T. Grenager (2007). “If multi-agent learning is the answer, what is
the question?” In: Artificial Intelligence 171.7, pp. 365–377.
Simon, Herbert A. (1955). “A behavioral model of rational choice.” In: The Quarterly Journal of
Economics 69.1, p. 99.
Sonu, Ekhlas, Yingke Chen, and Prashant Doshi (2015). “Individual Planning in Agent Populations:
Exploiting Anonymity and Frame-Action Hypergraphs.” In: ICAPS, pp. 202–210.
124
REFERENCES
Stimpson, Je↵rey L. and Michael A. Goodrich (2003). “Learning To Cooperate in a Social Dilemma:
A Satisficing Approach to Bargaining.” In: Proceedings of the Twentieth International Conference
on Machine Learning, pp. 728–735.
Stone, Peter (2007). “Learning and multiagent reasoning for autonomous agents.” In: The 20th Inter-
national Joint Conference on Artificial Intelligence. Hyderabad, India, pp. 13–30.
Stone, Peter and Manuela Veloso (2000). “Multiagent Systems: A Survey from a Machine Learning
Perspective.” In: Autonomous Robots 8.3.
Sucar, L. Enrique, Roger Luis, Ron Leder, Jorge Hernandez, and Israel Sanchez (2010). “Gesture
therapy: a vision-based system for upper extremity stroke rehabilitation.” In: Annual International
Conference of the IEEE Engineering in Medicine and Biology Society. Buenos Aires, Argentina:
IEEE, pp. 3690–3693.
Sutton, Richard S. and Andrew G. Barto (1998). Reinforcement Learning An Introduction. Cambridge,
MA: MIT Press.
Sykulski, Adam M., Archie C. Chapman, Enrique Munoz de Cote, and Nicholas R. Jennings (2010).
“EA2: The Winning Strategy for the Inaugural Lemonade Stand Game Tournament.” In: Proceed-
ing of the 19th European Conference on Artificial Intelligence. IOS Press, pp. 209–214.
Tesauro, Gerald (2003). “Extending Q-learning to general adaptive multi-agent systems.” In: Advances
in Neural Information Processing Systems 16, pp. 871–878.
Tesauro, Gerald and Jonathan L. Bredin (2002). “Strategic sequential bidding in auctions using dy-
namic programming.” In: Proceedings of the 1st International Joint Conference on Autonomous
Agents and Multiagent Systems. ACM Request Permissions.
Tversky, Amos and Daniel Kahneman (1974). “Judgment under uncertainty: Heuristics and biases.”
In: Science 185.4157, pp. 1124–1131.
Urieli, Daniel and Peter Stone (2014). “TacTex’13: A Champion Adaptive Power Trading Agent.” In:
Proceedings of the Twenty-Eighth Conference on Artificial Intelligence. Quebec, Canada, pp. 465–
471.
Valogianni, Konstantina, Wolfgang Ketter, and John Collins (2015). “A Multiagent Approach to
Variable-Rate Electric Vehicle Charging Coordination.” In: Proceedings of the 14th International
Conference on Autonomous Agents and Multiagent Systems. Istanbul, Turkey, pp. 1131–1139.
Watkins, John (1989). “Learning from delayed rewards.” PhD thesis. Cambridge, UK: King’s College.
Widmer, Gerhard and Miroslav Kubat (1996). “Learning in the presence of concept drift and hidden
contexts.” In: Machine Learning 23.1, pp. 69–101.
Wilson, Edwin B. (1927). “Probable Inference, the Law of Succesion, and Statistical Inference.” In:
Journal of the American Statistical Association 22.158, pp. 209–212.
Wooldridge, Michael (2009). An Introduction to MultiAgent Systems. 2nd. Wiley Publishing.
125
REFERENCES
Wright, James Robert and Kevin Leyton-Brown (2010). “Beyond equilibrium: Predicting human be-
havior in normal-form games.” In: Twenty-Fourth Conference on Artificial Intelligence (AAAI-10).
Atlanta, Georgia, pp. 901–907.
Wunder, Michael, Michael L. Littman, and Matthew Stone (2009). “Communication, Credibility and
Negotiation Using a Cognitive Hierarchy Model.” In: AAMAS Workshop# 19: Multi-agent Sequen-
tial Decision Making 2009. Budapest, Hungary, pp. 73–80.
Wunder, Michael, Michael Kaisers, John Robert Yaros, and Michael L. Littman (2011). “Using iterated
reasoning to predict opponent strategies.” In: Proceedings of 10th Int. Conference on Autonomous
Agents and Multiagent Systems. Taipei, Taiwan, pp. 593–600.
Wurman, Peter R, William E. Walsh, and M.P. Wellman (1998). “Flexible double auctions for elec-
tronic commerce: theory and implementation.” In: Decision Support Systems 24.1, pp. 17–27.
Zinkevich, Martin A., Michael Bowling, and Michael Wunder (2011). “The lemonade stand game
competition: solving unsolvable games.” In: SIGecom Exchanges 10.1, pp. 35–38.
126
Appendix A
PowerTAC
In this section we review the PowerTAC competition, which has been used to perform research in
multiagent systems and therefore we used it as a testbed for experiments in our research.
A.1 Energy markets
New trends in energy generation and distribution are being implemented around the world, this has
lead to the deregulation of energy supply and demand, allowing producers to sell energy to consumers
by using a broker as an intermediary, e↵ectively creating a market. Such markets have led to the
development of diverse energy trading strategies, most of which remain di�cult to optimize due to the
inherent complexity of the markets (rich state spaces, high dimensionality, and partial observability
[Urieli and Stone, 2014]) which results in that straightforward game-theoretic, machine learning, and
artificial intelligence techniques fall short.
A.2 PowerTAC
The PowerTAC simulator models a retail electrical energy market, where competing brokers (agents)
are challenged to maximize their profits. Brokers take actions in di↵erent markets at each timestep,
which simulates one hour of real time: (i) the tari↵ market, where brokers buy and sell energy by
o↵ering tari↵ contracts that specify price and other characteristics like early withdraw fee, bonus for
subscription, and expiration time; (ii) the wholesale market, where brokers buy and sell quantities
of energy for future delivery; and (iii) the balancing market, which is responsible for the real-time
balance of supply and demand on the distribution grid.
127
APPENDIX A. POWERTAC
Success
S0...
S24 S1S23
Figure A.1: Partial representation of the MDP broker in PowerTAC, ovals represent states (timeslots
for future delivery). Arrows represent transition probability and rewards.
A.3 Periodic double auctions
In this thesis we focus on the wholesale market, which operates as a periodic double auction (PDA)
[Wurman et al., 1998] and is similar to real world markets (e.g., Nord Pool in Scandinavia or FERC in
North America) [Ketter et al., 2014]. The wholesale market allows brokers to buy and sell quantities
of energy for future delivery, typically between 1 and 24 hours in the future. A PDA is a mechanism to
match buyers and sellers of a particular good, and to determine the prices at which trades are executed.
At any point in time, traders can place limit orders in the form of bids (buy orders) and asks (sell
orders) [Parsons et al., 2006]. Orders are maintained in an orderbook. In a PDA, the clearing price
is determined by the intersection of the inferred supply and demand functions, demand and supply
curves are constructed from bids and asks to determine the clearing price of each orderbook (one
for each enabled timeslot) at the intersection of the two, which is the price that maximizes turnover
[Ketter et al., 2014].
A.4 TacTex
The champion agent from the inaugural PowerTAC competition in 2013 was TacTex [Urieli and Stone,
2014], which uses an approach based on reinforcement learning for the wholesale market and prediction
methods for the tari↵ market. TacTex uses a modified version of Tesauro’s representation of the
wholesale market [Tesauro and Bredin, 2002], where states represent agent’s holdings and transition
probabilities are estimated from the market event history. TacTex models the bidding process as an
MDP, starting each game with no data and learning to improve its estimates and bids online. At each
timeslot, it uses dynamic programming to solve the MDP with all the data collected thus far, providing
a limit price to bid for future timeslots. Even tough TacTex learns to quickly bid, it is not capable
of adapting to non-stationary opponents. This is a large drawback, as many real-life agents are non-
stationary and change strategies over time — note that agents may slowly change or drastically switch
from one strategy to another (either to confuse the opponent or just as a best response measure).
In PowerTAC, a wholesale broker can place a bid for buying or selling energy by issuing a tuple
ht, e, pi: in timeslot t the broker makes a bid/ask for an amount of energy e (expressed in megawatt-
128
A.4. TACTEX
hour MWh) at a limit price p of buying/selling. At each timeslot, PowerTAC provides (as public
information) market clearing prices and the cleared volume. It also provides as private information
the successful bids and asks. A bid/ask can be partially or fully cleared. When a bid is fully cleared
the total amount of energy will be sent at the requested timeslot, if a bid was partially cleared the
o↵er was accepted but there was not enough energy and only a fraction of the requested energy will be
sent. In order to maintain a clear view of the problem and its solution, similar to TacTex, we restrict
our setting to brokers that only made bids for buying energy.
TacTex models the problem as an MDP, depicted in Figure A.1, in this work we will adopt the
same model for DriftER to be able to fairly compare the approaches in the experimental section.
States: s 2 {0, 1, . . . , nbids, success }, represent the timeslots for future delivery for the bids in the
market, with initial state s0 = nbids and terminal states s⇤ 2 {0, success}. Actions represent di↵erentlimit prices for the buying o↵ers in the wholesale market. Any state st 2 {1, . . . , nbids} will transition
to one of two states: success if a bid is partially or fully cleared, or st+1 = st�1. Note that transitions
depend on the action chosen by DriftER, but also on the actions (bids) chosen by the opponents, also,
these transitions are initially unknown to our agent. The reward is 0 in any state s 2 {1, . . . , nbids},the limit price of the successful bid in the state success and a balancing-penalty in state s = 0 (i.e.
when the time is over to secure the required energy).
The solution (a policy) to this MDP will define the best limit-price order for each of the nbids
states. TacTex solves the MDP once per timeslot, submitting nbids limit-prices to each of the nbids
auctions. DriftER leverages the TacTex MDP formulation but is designed to handle non-stationary
opponents by explicitly accounting for drift.
During this learning phase, let CTs be the set of past successful cleared transactions at state s
(i.e., timeslot for future delivery). Note that each element a 2 CTs contains information about the
cleared energy and clearing price (ae and ap respectively). To compute the probability of reaching
state success from state s and action a we use:
P successs,a :=
Pa02CT
s
,a0p
<ap
aeP
a02CTs
ae(A.1)
Where (A.1) captures the ratio between all successful past transactions smaller than the limit price
ap and all successful past transactions. Using A.1 we compute the empirical transition function as:
T (s, a, s0) =
8<
:P s0s,a if (s0 == success)
1� P s0s,a otherwise
(A.2)
The value P s0s,a gives the probability for cleared transactions. If the transaction is not cleared, we
transition to state s� 1 with probability 1� P s0s,a. Rewards are not stochastic — no statistics need to
be collected to learn the reward function.
129
APPENDIX A. POWERTAC
130
Appendix B
General-sum Games
We tested our proposals in randomly generated general-sum games which at least had one pure and
one mixed Nash equilibrium. To generate games the Gamut library [Nudelman et al., 2004] was used
and to generate Nash strategies Gambit [McKelvey et al., 2014] was used. Values and characteristics
of the games are shown in Tables B.1 and B.2.
131
APPENDIX B. GENERAL-SUM GAMES
Table B.1: Games used in the experiments. They have at least one pure and one mixed Nash equilibrium.
Learning agents will play rows and opponents will play columns.
(a) Game 1
B1 B2 B3
A1 �29,�41 93,�56 56,-4
A2 �17,�87 �70,�79 -44, -82
A2 50, 49 �75, 76 27,-56
(b) Game 2
B1 B2 B3
A1 37, 35 45, 76 67,43
A2 33, 94 38, 74 -94, -72
A2 83,�61 �95,�5 99,32
(c) Game 3
B1 B2 B3 B4
A1 �50, 20 73,�7 69,-45 83,22
A2 �51, 89 88, 96 -55,40 -26,-92
A3 �58, 58 �41, 14 66,-46 0,-80
A4 �62, 52 �94,�52 -40,-46 -94,-84
(d) Game 4
B1 B2 B3 B4 B5
A1 5, 7 32, 78 1,7 -55, -79 -1,0
A2 89, 96 81,�45 -26 61 73,78 -45,-68
A3 29, 92 90,�53 -53,-46 45,-83 11,20
A4 �89, 14 94,�99 -26,-10 89,22 67,-19
A5 35, 84 67, 34 75,35 -6,33 -16,-62
Table B.2: Pure and mixed Nash strategies for column player of the games used.
Game Id # actions Pure Nash Mixed Nash
Game 1 3 [0,0,1] [0.680, 0.319,0]
Game 2 3 [0,1,0] [0.0, 0.186, 0.813]
Game 3 4 [0,1,0,0] [0.0, 0.879, 0.0, 0.120]
Game 4 5 [1,0,0,0,0] [0.082, 0.0, 0.0, 0.917, 0.0]
132
Appendix C
Extra Experiments
C.1 HM-MDPs training and performance experiments
In order to evaluate the robustness of HM-MDPs, we evaluated the learned policy under di↵erent
switching times. We used 150, 250 and 350 as the times when the opponent switches from strategy1
to strategy2, the duration of games in the evaluation phase was 500 stages. The training phase
directly a↵ects the quality of the model [Choi et al., 1999], therefore we evaluated at what extent
this variation in size could a↵ect the total rewards. For this reason, training size was varied with
tsize = {100, 500, 2000} games.
We present the average results with standard deviations of using all values of tsize for each switching
time in Table C.1 under the same model column. R(A) presents the average rewards for the learning
agent, and AvgR(Opp) presents the average rewards for the switching opponent.
Some conclusions that can be drawn are:
• The step where the opponent switches between strategies does not impact the results. There is
only a variation of 0.04 in average between the best and worst results.
• HM-MDP agent shows high standard deviation against TFT-Bully. This happens because two
types of behaviours appeared in the experiments: in some cases the cooperate-cooperate cycle
happen with TFT and a defect-defect cycle with Bully. However, in other cases, the agent
performed a defect against TFT and started the defect-defect cycle for the rest of the repeated
game. This suboptimal behaviour appeared when the training phase was small (100 steps).
• In some cases a suboptimal cycle also appeared against TFT-Pavlov. Some policies gets stuck
in a cycle C-D, D-C against TFT, and this cycle reduced the accumulated rewards, since a C-C
was the optimal policy against TFT. It was only when changing to Pavlov, that the cycle C-C
appeared.
133
APPENDIX C. EXTRA EXPERIMENTS
In general, when the training size is small, suboptimal behaviour could appear, which a↵ects the
total rewards.
As we mentioned earlier, HM-MDPs need to fix the number of modes when learning. In the previous
experiments opponents always use two strategies, but there are three di↵erent strategies available, so
we modified the first experiment in order to have di↵erent strategies in the training and evaluation
phases. The motivation for this is that it can happen that during training phase the opponent did
not use all the strategies, and therefore the learned HM-MDP is not complete. During evaluation a
new strategy is used (which is not known for HM-MDP) and this would a↵ect the results. So, in the
next experiment the training opponent consists of strategy1 � strategy2 and the evaluation opponent
consists of strategy1� strategy3. The results of this experiment are presented in Table C.1 under the
di↵erent model column.
From the results is easy to note that HM-MDPs consistently decrease its average reward when using
a di↵erent model for evaluation and training. On average the decrease is 0.56 ± 0.27. In conclusion,
when HM-MDPs can explore all models in training phase, they obtain good results. However, if they
do not learn the complete opponent strategies they cannot compute an optimal policy against it and
thus, they receive lower rewards.
To solve the learned HM-MDP a transformation to a POMDP was performed. In tables C.2 and
C.3 we present data regarding the time and convergence statistics when solving the POMDPs1. In
particular we present when the policy converged (final horizon), the number of ↵ vectors generated
and the time (in seconds) needed to solve the POMDP. Some conclusions from the experiments are:
• As the number of learning steps increased there was an increase of iterations needed to solve the
POMDP. In some cases (TFT-Pavlov, Pavlov-TFT, Bully-TFT and Bully-Pavlov) 500 iterations
were not enough to converge.
• The policies that contain a high number of vectors (Bully-TFT or TFT-Bully) have a larger
solving time. Conversely, if the number of vectors is low (Pavlov-Bully) this yields lower solving
times.
• The number of vectors is mainly determined by the type of opponent, TFT-Bully and Bully-TFT
have a high number of vectors in all the cases. On the contrary, Pavlov-Bully and Bully-Pavlov
have a low number of vectors with di↵erent sizes of learning. TFT-Pavlov and Pavlov-TFT
started with a low number of vectors but as the training phase size increased also the number
of vectors did.
• Too much interaction increases the time of solving the POMDP. However, as we said in the
1All the experiments were performed on a MacBook Pro with Intel Core 2 Duo 2.16 GHz and 8 Gb of RAM.
134
C.1. HM-MDPS TRAINING AND PERFORMANCE EXPERIMENTS
Table C.1: Average rewards for the HM-MDPs agent (AvgR(A)) and for the opponent (AvgR(opp)) with stan-
dard deviations. Change at column present the round where the opponent switches strategies. The evaluation
phase consisted of 500 steps.
HM-MDP
Same model Di↵erent model
Opponent Change at AvgR(A) AvgR(Opp) AvgR(A) AvgR(Opp)
TFT-Pavlov 150 2.86 ± 0.10 2.97 ± 0.10 2.09 ± 0.13 1.46 ± 1.01
TFT-Bully 150 1.39 ± 0.22 1.44 ± 0.28 1.08 ± 0.10 3.01 ± 0.46
Pavlov-TFT 150 2.88 ± 0.06 2.94 ± 0.13 2.01 ± 0.37 1.89 ± 0.58
Pavlov-Bully 150 1.56 ± 0.05 1.50 ± 0.29 1.04 ± 0.14 3.02 ± 0.58
Bully-TFT 150 1.99 ± 0.35 2.23 ± 0.49 1.28 ± 0.50 1.45 ± 0.88
Bully-Pavlov 150 2.24 ± 0.09 2.06 ± 0.58 2.01 ± 0.04 0.83 ± 0.15
Average - 2.15 ± 0.14 2.19 ± 0.31 1.59 ± 0.21 1.94 ± 0.61
TFT-Pavlov 250 2.87 ± 0.09 2.97 ± 0.10 2.05 ± 0.27 1.59 ± 0.99
TFT-Bully 250 1.63 ± 0.39 1.67 ± 0.45 1.56 ± 0.15 2.96 ± 0.43
Pavlov-TFT 250 2.84 ± 0.13 2.92 ± 0.24 2.29 ± 0.30 2.12 ± 0.68
Pavlov-Bully 250 1.92 ± 0.06 1.76 ± 0.41 1.55 ± 0.11 2.82 ± 0.62
Bully-TFT 250 1.65 ± 0.27 2.02 ± 0.48 1.17 ± 0.27 1.43 ± 0.88
Bully-Pavlov 250 1.89 ± 0.12 1.87 ± 0.53 1.72 ± 0.05 0.97 ± 0.14
Average - 2.13 ± 0.18 2.20 ± 0.37 1.72 ± 0.19 1.98 ± 0.62
TFT-Pavlov 350 2.89 ± 0.09 2.97 ± 0.09 1.99 ± 0.52 1.71 ± 0.98
TFT-Bully 350 2.04 ± 0.37 1.89 ± 0.62 2.04 ± 0.24 2.90 ± 0.41
Pavlov-TFT 350 2.80 ± 0.18 2.89 ± 0.33 2.56 ± 0.26 2.33 ± 0.80
Pavlov-Bully 350 2.28 ± 0.08 2.03 ± 0.52 2.05 ± 0.09 2.65 ± 0.68
Bully-TFT 350 1.33 ± 0.19 1.80 ± 0.48 1.06 ± 0.05 1.43 ± 0.88
Bully-Pavlov 350 1.5 ± 0.10 1.62 ± 0.47 1.39 ± 0.05 1.11 ± 0.18
Average - 2.11 ± 0.17 2.20 ± 0.42 1.85 ± 0.20 2.02 ± 0.66
Table C.2: Performance measures when solving the HM-MDP as a POMDP. Final horizon presents the iteration
when the policy converged, the number of ↵ vectors and the time in seconds are also presented. We show the
results while varying the training phase size.
Final Horizon # ↵ Vectors Time (s) Learning steps
232.17 ± 21.17 71.17 ± 108.30 95.98 ± 195.10 100
320.47 ± 110.66 260.53 ± 189.76 434.74 ± 574.83 500
457.08 ± 60.69 717.83 ± 426.39 2297.27 ± 2193.54 2000
135
APPENDIX C. EXTRA EXPERIMENTS
Table C.3: Performance measures when solving the HM-MDP as a POMDP against di↵erent non-stationary
opponents. Final horizon presents the iteration when the policy converged, the number of ↵ vectors and the
time in seconds are also presented.
Opponent Final Horizon # ↵ Vectors Time (s)
TFT-Pavlov 297.24 ± 57.04 176.50 ± 34.36 464.61 ± 116.21
TFT-Bully 318.06 ± 118.34 500.74 ± 707.46 972.31 ± 1650.27
Pavlov-TFT 267.63 ± 74.74 180.06 ± 101.78 685.61 ± 749.39
Pavlov-Bully 338.29 ± 92.55 71.37 ± 54.64 24.81 ± 35.70
Bully-TFT 352.09 ± 78.66 1022.27 ± 607.91 2123.02 ± 2352.17
Bully-Pavlov 309.86 ± 59.62 72.64 ± 54.38 297.23 ± 413.67
Average 313.86 ± 80.16 337.26 ± 260.09 761.26 ± 886.23
(a) (b)
Figure C.1: Fraction of updates when learning an opponent model using R-max exploration against (a) a pure
strategy and (b) a mixed strategy in the BoS game. Results are the average of 10 interactions.
previous section, too little information may not be enough to learn an optimal policy against the
opponent. Thus a trade-o↵ appears, knowing the amount of interaction needed to learn o↵-line
models without increasing dramatically the solving time is an open problem for future work.
C.2 R-max exploration against pure and mixed strategies
Our approaches need a phase to learn the transition function that will describe the opponent’s dy-
namics. MDP-CL uses a random exploration in a fixed window, in contrast R-max# and DriftER are
based on R-max which guarantees an e�cient exploration. When using R-max we need to set one
extra parameter m which sets the number of visits needed on a state to be considered known.
We analyze the behavior of a learning agent using R-max exploration against opponents that use
pure (deterministic) and mixed (stochastic) strategies. We measure how often the model is updated
136
C.2. R-MAX EXPLORATION AGAINST PURE AND MIXED STRATEGIES
Table C.4: R-max learning against (a) pure and (b) mixed strategies in the battle of the sexes game. Results
are the average of 100 iterations. Each game consists of 1500 rounds.
(a)
m Average Rewards
2 53.843 ± 0.017
5 53.646 ± 0.021
8 53.442 ± 0.024
10 53.318 ± 0.025
12 53.173 ± 0.028
15 52.970 ± 0.023
21 52.461 ± 0.015
(b)
m Average Rewards
2 57.712 ± 11.077
5 60.529 ± 7.077
8 61.832 ± 4.827
10 62.145 ± 4.139
12 61.859 ± 4.068
15 62.145 ± 2.354
21 60.998 ± 2.121
during the repeated game while keeping track of the rewards.
The BoS game was selected with v1 = 100, v2 = 54. In this case, opponents are stationary and
the game consists of 1500 rounds. Results are the average of 10 iterations. In Figure C.1 we depict
percentages of model updates while playing against a (a) pure Nash strategy [1.0, 0.0] and a (b) mixed
Nash strategy [0.65,0.35] using di↵erent values for parameter m. In Table C.4 we present the respective
average rewards for those values of m. From these results we note that learning a pure strategy is
much faster, using m = 2 yields the best scores, requiring about 20 rounds to learn a complete model.
In contrast, when learning against a mixed strategy, the best scores were obtained with m = 10, 15,
which means that it takes more than 200 rounds to learn a model that achieves the maximum reward.
Note that the best value for m is di↵erent for pure and mixed strategies and this is important to
be taken into account when facing non-stationary opponents that use both strategies. Also note that
there is a tradeo↵, a large value of m will ensure that a correct model is learned (thus obtaining a
value closer to the maximum reward) however it would take more time to be certain we have correctly
learned a model and we can start exploiting it. For example using a m = 15 provides the maximum
reward with lower standard deviation and needs approximately 300 rounds to learn a model completely.
However, using m = 10 will in average yield the same reward and will need only 200 rounds to learn
the model.
137