intelligent agents: technology and applications
TRANSCRIPT
Intelligent Agents: Technology and ApplicationsMulti-agent Learning
IST 597B
Spring 2003
John Yen
Learning Objectives
How to identify goals for agent projects? How to design agents? How to identify risks/obstacles early on?
Multi-Agent Learning
Multi-Agent Learning
The learned behavior can be used as a basis for more complex interactive behavior
Enables agent to participate in higher level collaborative or adversarial learning situations
Learning would not be possible if the agent was isolated
Examples Examples of single agent learning in a multi-
agent environment:
1. Reinforcement Learning agent which incorporates information gathered by another agent (Tan, 93)
2. Agent learning negotiating techniques of another using Bayesian Learning (Zeng & Sycara, 96)
Class of multi-agent learning in which an agent attempts to model another agent
Examples
Training scenario in which a novice agent learns from a knowledgeable agent (Clouse, 96)
A common thing among all the examples is that the learning agent is interacting with other agents
Predator/Pray (Pursuit) Domain
Introduced by Bends et. al (86) Four predators and one prey Goal: to capture (or surround) the prey Not a complex real-world, toy domain that
helps concretize concepts
Predator/Pray (Pursuit) Domain
Taxonomy of MAS
Taxonomy organized along – the degree of heterogeneity, and – the degree of communication
1. Homogenous, Non-Communicating Agents
2. Heterogeneous, Non-Communicating Agents
3. Homogenous, Communicating Agents4. Heterogeneous, Communicating Agents
Taxonomy of MAS
Taxonomy of MAS
1. Homogenous, Non-Communicating Agents
All agents have the same internal structure• Goals
• Domain knowledge
• Actions
The only difference is their sensory input and the actions that they take
• They are situated differently in the world
Korf (1992) introduces a policy for each predictor based on an attractive force to the prey and a repulsive force from other preditors
1. Homogenous, Non-Communicating Agents
Korf concludes that explicit cooperation is not necessary
Haynes & Sen show that Korf’s heuristic does not work for certain instantiation of the domain
1. Homogenous, Non-Communicating Agents
Issues:
1. Reactive vs. deliberative agents
2. Local vs. global perspective
3. Modeling of other agents
4. How to affect others
5. Further learning opportunities
1: Reactive vs. Deliberative Agents
Reactive agents do not maintain an internal state and simply retrieve pre-set behaviors
Deliberative agents maintain an internal state and behave by searching through a space of behaviors, predicting the action of other agents and the effect of actions
2: Local vs. Global Perspective
How much sensory input should be available to agents? (observability)
Having a global view might lead to sub-optimal results
Better performance by agents with less knowledge: “Ignorance is Bliss”
3: Modeling of Other Agents
Since agents are identical, they can predict each others actions given the sensory input
Recursive Modeling Method: to model the internal state of another agent in order to predict its actions
Each predator bases its move on the predicted move of other predators and vice versa
Since reasoning can recurse indefinitely, it should be limited in terms of time or recursion
3: Modeling of Other Agents
If agents know too much, RMM could recurse indefinitely
For coordination to be possible, some potential knowledge must be ignored
Schmidhuber (1996) shows that agents can cooperate without modeling each other
They consider each other as part of the environment
4: How to Affect Others
Without communication, agents cannot affect each other directly
Can affect each other indirectly in several ways
1. They can be sensed by other agents
2. Change the state of another agent (e.g. by pushing it)
3. Affect each other by stigmergy (Becker, 94)
4: How to Affect Others
Active stigmergy: – an agent alters the environment so as to effect the
sensory input of another agent. E.g. an agent might leave a marker for other agents to observe
Passive stigmergy:– altering the environment so that the effect of another
agents’ actions change. If an agent turns of the main water valve of a building, the effect of another agent turning on the faucet is altered
4: How to Affect Others Example: A number of robots in an area with many pucks
scattered around. Robots reactively move straight (turning at walls) until they are pushing 3 or more pucks. Then they back up and turn away
Although robots do not communicate, they can collect the pucks in a single pile over time
When a robot approaches an existing pile, it adds the pucks and turns away
A robot approaching an existing pile obliquely might take a puck away, but over time the desired result is accomplished
5: Further Learning Opportunities An agent might try to learn to take actions
that will not help it directly in the current situation, but may allow other agents to be more effective in the future.
In Traditional RL, if an action leads to a reward by another agent, the acting agent may have no way of reinforcing that action
2. Heterogeneous, Non-Communicating Agents
Can be heterogeneous in any of following:• Goals• Actions• Domain knowledge
In the pursuit domain, the prey can be modeled as an agent
Haynes et. al. have used GA and case-based reasoning to make predators learn to cooperate in absence of communication
2. Heterogeneous, Non-Communicating Agents
They also explore the possibility of evolving both predators and the prey
• Predators use Korf’s greedy heuristic
Though one might think this will result in repeated improvement of predator and prey with no convergence, a prey behavior emerges that always succeeds
• Prey simply moves in a constant straight line
Haynes et. al. conclude Korf’s greedy algorithm relies on random prey movement
2. Heterogeneous, Non-Communicating Agents
Issues:
1. Benevolence vs. competitiveness
2. Fixed vs. learning agents
3. Modeling of other agents
4. Resource management
5. Social conventions
1: Benevolence vs. Competitiveness Can be benevolent even if they have
different goals (if they are willing to help each other)
Selfish agents: more effective and biologically plausible
Agents cooperate because it is in their own best interest
1: Benevolence vs. Competitiveness Prisoners dilemma: two burglars are
captured. Each has to choose whether or not to confess and implicate the other. If neither confess, they will both serve 1 year. If both confess they will both serve 10 years. If one confesses and the other does not, the one who has collaborated will go free and the other will serve for 20 years
1: Benevolence vs. Competitiveness
1: Benevolence vs. Competitiveness Each agent will decide to confess to
maximize its own interest If both confess, they will get 10 years each If they had acted “irrationally” and kept
quiet, they would each get 1 year Mor et.al. (1995) show that in repeated
prisoner’s dilemma cooperative behavior can emerge
1: Benevolence vs. Competitiveness In zero-sum games cooperation is not
sensible If a third dimension was to be added to the
taxonomy, besides the degree of heterogeneity and communication, it would be benevolence vs. competitiveness
2: Fixed vs. Learning Agents
Learning agents desirable in dynamic environments
Competitive vs. cooperative learning Possibility of “arms race” in competitive
learning. Competing agents continually adapt to each other in more and more specialized ways, never stabilizing at a good behavior
2: Fixed vs. Learning Agents
Credit-assignment problem: when performance of an agent improves, it is not clear whether the improvement is due to an improvement in the agent’s behavior or a negative behavior in the opponent’s behavior. Same problem if the performance of an agent gets worse.
One solution is to fix the one agent while allowing the other to learn and the to switch. Encourages more arms race than ever!
3: Modeling of other agents
Goals, actions and domain knowledge of other agents may be unknown and need modeling
Without communication, modeling is done strictly through observation
RMM is good for modeling the states of homogenous agents
Tambe (1995) takes it one step further, studying how agents can learn models of teams of agents
4: Resource Management
Examples:– Network traffic problem: several agents send
information through the same network (GA)– Load balancing: several users have limited
amount of computing power to share among them (RL)
Braess’ Paradox (Glance et. al., 1995): adding more resources to a network but getting worse performance
5: Social Conventions Imagine you are to meet a friend in Paris. You
both arrive on the same day but were unable to get in touch to set a time and place. Where will you go and when?
75% of audience at AAAI-95 Symposium on Active Learning answered (without prior communication) they would go to Eiffel tower at noon.
Even without communication agents are able to coordinate actions
3. Homogenous, Communicating Agents
Communication can be either broadcast or point-to-point
Issues:1. Distributed sensing
– Distributed vision project (Matsuyama, 1997)
– Trafficopter system (Moukas et. al., 1997)
2. Communication content– What they should communicate? states, or goals?
3. Further learning opportunities:– When to communicate?
4. Heterogeneous, Communicating Agents
Tradeoff between cost and freedom Osawa suggests predators should go through 4
phases:– Autonomy, communication, negotiation, and control
– When they stop making progress using one strategy, they should move to the next expensive strategy
Increasing order of cost (decreasing order of freedom)
4. Heterogeneous, Communicating Agents
Important issues:
1. Understanding each other
2. Planning communication acts
3. Negotiation
4. Commitment/decommitment
5. Further learning opportunities
1: Understanding Each Other
Need some set protocol for communication Aspects of the protocol:
1. Information content: KIF (Genesereth, 92)
2. Message Format: KQML (Finin, 94)
3. Coordination: COOL (Barbuceanu, 95)
2: Planning Communication Acts
The theory of communication as action is called speech acts
Communication acts have precondition and effects
Effects might be to alter an agent’s belief about the state of another agent or agents
3: Negotiation
Design negotiating MAS based on law of supply and demand
1. Contract nets (Smith, 1990): • Agents have their own goals, are self-interested,
and have limited reasoning resources. They bid to accept tasks from other agents and can then either perform the task or subcontract it to another agent. Agent must pay to contract their tasks.
3: Negotiation
MAS controlling air temperature in different rooms of a building:
• An agent can set the thermostat to any temperature. Depending on the actual air temperature, the agent can ‘buy’ hot or cold air from another room that has an excess. At the same time the agent can sell the excess air at the current temperature to other rooms. Modeling the loss of heat in transfer from one room to another, the agents try to buy and sell at the best possible prices.
4: Commitment/Decommitment
Agent agrees to pursue a given goal regardless of how much it serves its own interest
Commitments can make the systems run more smoothly by making agents trust each other
Unclear how to make self-interested agents to commit to others
Belief/desire/intention (BDI) a popular technique for modeling other agents– Used in OASIS: air traffic control
5: Further Learning Opportunities Instead of predefining a protocol, allow the
agents to learn for themselves what to communicate and how to interpret it
Possible result would be more efficient communication
Q Learning
Assess state action pairs (s, a) using a Q value
Learn the Q value using rewards/feedback A reward receives at time t is discounted to
previous state-action pairs (using a discount factor)
Goal of learning is to find an optimal policy for selecting actions.
*( , ) ( , ) ( ) *( )xyy
Q x R x P V y
The Q value
R: Reward
Pxy: The probability of reaching state y from x by taking action action alpha.
Gamma: Discount factor (between 0 and 1).
V*(y): The expected total discounted return starting in y following the policy *.
Policy: a sequence of actions.
*( ) max *( , )V x Q x
The Expected Total Discount Return V for a state is the maximal Q value among all actions that can be taken at the state (following the rest of the policy).
*( , ) (1 ) *( , ) ( *( ))Q x Q x r V y
Learning Rule for Q value
Alpha: learning rate
( , ) 0Q x a
and ( , ) 0Tr x a for all x and a
Do Forever:
tx the current state
ta that maximizes ( , )tQ x a
over all a
Carry out action ta in the world. Let the short term reward be tr , and the new state be 1tx
' ( 1) ( , )tt t t tte r V x Q x a
( 1) ( )t tt t te r V x V x
For each state-action pair ( , )x a do
( , ) ( , )Tr x a Tr x a
1( , ) ( , ) ( , ) tt tQ x a Q x a Tr x a e
'1 1( , ) ( , )t t t t tt tQ x a Q x a e
( , ) ( , ) 1t t t tTr x a Tr x a
Choose an action
1.
2.
(a)
(b)
(c)
(d)
(e)(f)
(g)
(h)
( , ) /
( , ) /( )i
k
Q x a T
i Q x a T
k actions
ep a x
e
Probability for the agent to select action ai based on Q values
T: “temperature” parameter to determine the randomness of decisions.
Towards Collaborative and Adversarial LearningA Case Study in Robotic Soccer
Peter Stone & Manuela Veloso
Introduction
Layered learning, to develop complex multi-agent behaviors from simple ones
Simple multi-agent behavior in Robotic Soccer, to shoot a moving ball
• Passer
• Shooter
Behavior to be learnt: When the shooter should begin to move (shooting policy)
Simple Behavior
Parameters
1. Ball speed (fixed vs. variable)
2. Ball trajectory (fixed vs. variable)
3. Goal location (fixed vs. variable)
4. Action quadrant (fixed vs. variable)
Parameters
Fixed Ball Motion Simple shooting policy: begin accelerating when the balls
distance to its projected point of intersection with the agent’s path reaches 110 units
• 100% success rate if shooter position fixed• 61% success rate if shooter position variable
Use Neural network, Inputs to NN (coordinate independent):
• Ball distance• Agent distance• Heading offset
Output: 1 or 0 (shot successful or not) Use random shooting policy for training
Neural Network
Results
Varying Ball Speed
Add a fourth input to NN, Ball Speed
Varying Ball’s Trajectory
Use the same shooting policy Use another NN to determine the direction the
shooter should steer (shooter’s aiming policy)
Moving the Goal
Can think of it as aiming for different parts of the goal
Change nothing but the shooter’s knowledge of the goal location
Cooperative Learning
Passing a moving ball• Passer: where to aim the pass,
• Shooter: where to position itself
Cooperative Learning
Adversarial Learning
References
Peter Stone, Manuela Veloso, 2000, “Multi-Agent Systems: A Survey from a Machine Learning Perspective”
Ming Tan, 1993, “Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents”
Peter Stone, Manuela Veloso, 1998, “Toward Collaborative and Adversarial Learning: A Case Study in Robotic Soccer”