intelligent agents: technology and applications

Intelligent Agents: Technology and ApplicationsMulti-agent Learning

IST 597B

Spring 2003

John Yen

Learning Objectives

How to identify goals for agent projects? How to design agents? How to identify risks/obstacles early on?

Multi-Agent Learning

Multi-Agent Learning

The learned behavior can be used as a basis for more complex interactive behavior

Enables agent to participate in higher level collaborative or adversarial learning situations

Learning would not be possible if the agent was isolated

Examples Examples of single agent learning in a multi-

agent environment:

1. Reinforcement Learning agent which incorporates information gathered by another agent (Tan, 93)

2. Agent learning negotiating techniques of another using Bayesian Learning (Zeng & Sycara, 96)

Class of multi-agent learning in which an agent attempts to model another agent

Examples

Training scenario in which a novice agent learns from a knowledgeable agent (Clouse, 96)

A common thing among all the examples is that the learning agent is interacting with other agents

Predator/Pray (Pursuit) Domain

Introduced by Bends et. al (86) Four predators and one prey Goal: to capture (or surround) the prey Not a complex real-world, toy domain that

helps concretize concepts

Predator/Pray (Pursuit) Domain

Taxonomy of MAS

Taxonomy organized along – the degree of heterogeneity, and – the degree of communication

1. Homogenous, Non-Communicating Agents

2. Heterogeneous, Non-Communicating Agents

3. Homogenous, Communicating Agents4. Heterogeneous, Communicating Agents

Taxonomy of MAS


All agents have the same internal structure• Goals

• Domain knowledge

• Actions

The only difference is their sensory input and the actions that they take

• They are situated differently in the world

Korf (1992) introduces a policy for each predictor based on an attractive force to the prey and a repulsive force from other preditors


Korf concludes that explicit cooperation is not necessary

Haynes & Sen show that Korf’s heuristic does not work for certain instantiation of the domain


Issues:

1. Reactive vs. deliberative agents

2. Local vs. global perspective

3. Modeling of other agents

4. How to affect others

5. Further learning opportunities

1: Reactive vs. Deliberative Agents

Reactive agents do not maintain an internal state and simply retrieve pre-set behaviors

Deliberative agents maintain an internal state and behave by searching through a space of behaviors, predicting the action of other agents and the effect of actions

2: Local vs. Global Perspective

How much sensory input should be available to agents? (observability)

Having a global view might lead to sub-optimal results

Better performance by agents with less knowledge: “Ignorance is Bliss”

3: Modeling of Other Agents

Since agents are identical, they can predict each others actions given the sensory input

Recursive Modeling Method: to model the internal state of another agent in order to predict its actions

Each predator bases its move on the predicted move of other predators and vice versa

Since reasoning can recurse indefinitely, it should be limited in terms of time or recursion

3: Modeling of Other Agents

If agents know too much, RMM could recurse indefinitely

For coordination to be possible, some potential knowledge must be ignored

Schmidhuber (1996) shows that agents can cooperate without modeling each other

They consider each other as part of the environment

4: How to Affect Others

Without communication, agents cannot affect each other directly

Can affect each other indirectly in several ways

1. They can be sensed by other agents

2. Change the state of another agent (e.g. by pushing it)

3. Affect each other by stigmergy (Becker, 94)

4: How to Affect Others

Active stigmergy: – an agent alters the environment so as to effect the

sensory input of another agent. E.g. an agent might leave a marker for other agents to observe

Passive stigmergy:– altering the environment so that the effect of another

agents’ actions change. If an agent turns of the main water valve of a building, the effect of another agent turning on the faucet is altered

4: How to Affect Others Example: A number of robots in an area with many pucks

scattered around. Robots reactively move straight (turning at walls) until they are pushing 3 or more pucks. Then they back up and turn away

Although robots do not communicate, they can collect the pucks in a single pile over time

When a robot approaches an existing pile, it adds the pucks and turns away

A robot approaching an existing pile obliquely might take a puck away, but over time the desired result is accomplished

5: Further Learning Opportunities An agent might try to learn to take actions

that will not help it directly in the current situation, but may allow other agents to be more effective in the future.

In Traditional RL, if an action leads to a reward by another agent, the acting agent may have no way of reinforcing that action


Can be heterogeneous in any of following:• Goals• Actions• Domain knowledge

In the pursuit domain, the prey can be modeled as an agent

Haynes et. al. have used GA and case-based reasoning to make predators learn to cooperate in absence of communication


They also explore the possibility of evolving both predators and the prey

• Predators use Korf’s greedy heuristic

Though one might think this will result in repeated improvement of predator and prey with no convergence, a prey behavior emerges that always succeeds

• Prey simply moves in a constant straight line

Haynes et. al. conclude Korf’s greedy algorithm relies on random prey movement


Issues:

1. Benevolence vs. competitiveness

2. Fixed vs. learning agents

3. Modeling of other agents

4. Resource management

5. Social conventions

1: Benevolence vs. Competitiveness Can be benevolent even if they have

different goals (if they are willing to help each other)

Selfish agents: more effective and biologically plausible

Agents cooperate because it is in their own best interest

1: Benevolence vs. Competitiveness Prisoners dilemma: two burglars are

captured. Each has to choose whether or not to confess and implicate the other. If neither confess, they will both serve 1 year. If both confess they will both serve 10 years. If one confesses and the other does not, the one who has collaborated will go free and the other will serve for 20 years

1: Benevolence vs. Competitiveness

1: Benevolence vs. Competitiveness Each agent will decide to confess to

maximize its own interest If both confess, they will get 10 years each If they had acted “irrationally” and kept

quiet, they would each get 1 year Mor et.al. (1995) show that in repeated

prisoner’s dilemma cooperative behavior can emerge

1: Benevolence vs. Competitiveness In zero-sum games cooperation is not

sensible If a third dimension was to be added to the

taxonomy, besides the degree of heterogeneity and communication, it would be benevolence vs. competitiveness

2: Fixed vs. Learning Agents

Learning agents desirable in dynamic environments

Competitive vs. cooperative learning Possibility of “arms race” in competitive

learning. Competing agents continually adapt to each other in more and more specialized ways, never stabilizing at a good behavior

2: Fixed vs. Learning Agents

Credit-assignment problem: when performance of an agent improves, it is not clear whether the improvement is due to an improvement in the agent’s behavior or a negative behavior in the opponent’s behavior. Same problem if the performance of an agent gets worse.

One solution is to fix the one agent while allowing the other to learn and the to switch. Encourages more arms race than ever!

3: Modeling of other agents

Goals, actions and domain knowledge of other agents may be unknown and need modeling

Without communication, modeling is done strictly through observation

RMM is good for modeling the states of homogenous agents

Tambe (1995) takes it one step further, studying how agents can learn models of teams of agents

4: Resource Management

Examples:– Network traffic problem: several agents send

information through the same network (GA)– Load balancing: several users have limited

amount of computing power to share among them (RL)

Braess’ Paradox (Glance et. al., 1995): adding more resources to a network but getting worse performance

5: Social Conventions Imagine you are to meet a friend in Paris. You

both arrive on the same day but were unable to get in touch to set a time and place. Where will you go and when?

75% of audience at AAAI-95 Symposium on Active Learning answered (without prior communication) they would go to Eiffel tower at noon.

Even without communication agents are able to coordinate actions

3. Homogenous, Communicating Agents

Communication can be either broadcast or point-to-point

Issues:1. Distributed sensing

– Distributed vision project (Matsuyama, 1997)

– Trafficopter system (Moukas et. al., 1997)

2. Communication content– What they should communicate? states, or goals?

3. Further learning opportunities:– When to communicate?

4. Heterogeneous, Communicating Agents

Tradeoff between cost and freedom Osawa suggests predators should go through 4

phases:– Autonomy, communication, negotiation, and control

– When they stop making progress using one strategy, they should move to the next expensive strategy

Increasing order of cost (decreasing order of freedom)

4. Heterogeneous, Communicating Agents

Important issues:

1. Understanding each other

2. Planning communication acts

3. Negotiation

4. Commitment/decommitment

5. Further learning opportunities

1: Understanding Each Other

Need some set protocol for communication Aspects of the protocol:

1. Information content: KIF (Genesereth, 92)

2. Message Format: KQML (Finin, 94)

3. Coordination: COOL (Barbuceanu, 95)

2: Planning Communication Acts

The theory of communication as action is called speech acts

Communication acts have precondition and effects

Effects might be to alter an agent’s belief about the state of another agent or agents

3: Negotiation

Design negotiating MAS based on law of supply and demand

1. Contract nets (Smith, 1990): • Agents have their own goals, are self-interested,

and have limited reasoning resources. They bid to accept tasks from other agents and can then either perform the task or subcontract it to another agent. Agent must pay to contract their tasks.

3: Negotiation

MAS controlling air temperature in different rooms of a building:

• An agent can set the thermostat to any temperature. Depending on the actual air temperature, the agent can ‘buy’ hot or cold air from another room that has an excess. At the same time the agent can sell the excess air at the current temperature to other rooms. Modeling the loss of heat in transfer from one room to another, the agents try to buy and sell at the best possible prices.

4: Commitment/Decommitment

Agent agrees to pursue a given goal regardless of how much it serves its own interest

Commitments can make the systems run more smoothly by making agents trust each other

Unclear how to make self-interested agents to commit to others

Belief/desire/intention (BDI) a popular technique for modeling other agents– Used in OASIS: air traffic control

5: Further Learning Opportunities Instead of predefining a protocol, allow the

agents to learn for themselves what to communicate and how to interpret it

Possible result would be more efficient communication

Q Learning

Assess state action pairs (s, a) using a Q value

Learn the Q value using rewards/feedback A reward receives at time t is discounted to

previous state-action pairs (using a discount factor)

Goal of learning is to find an optimal policy for selecting actions.

*( , ) ( , ) ( ) *( )xyy

Q x R x P V y

The Q value

R: Reward

Pxy: The probability of reaching state y from x by taking action action alpha.

Gamma: Discount factor (between 0 and 1).

V*(y): The expected total discounted return starting in y following the policy *.

Policy: a sequence of actions.

*( ) max *( , )V x Q x

The Expected Total Discount Return V for a state is the maximal Q value among all actions that can be taken at the state (following the rest of the policy).

*( , ) (1 ) *( , ) ( *( ))Q x Q x r V y

Learning Rule for Q value

Alpha: learning rate

( , ) 0Q x a

and ( , ) 0Tr x a for all x and a

Do Forever:

tx the current state

ta that maximizes ( , )tQ x a

over all a

Carry out action ta in the world. Let the short term reward be tr , and the new state be 1tx

' ( 1) ( , )tt t t tte r V x Q x a

( 1) ( )t tt t te r V x V x

For each state-action pair ( , )x a do

( , ) ( , )Tr x a Tr x a

1( , ) ( , ) ( , ) tt tQ x a Q x a Tr x a e

'1 1( , ) ( , )t t t t tt tQ x a Q x a e

( , ) ( , ) 1t t t tTr x a Tr x a

Choose an action

1.

2.

(a)

(b)

(c)

(d)

(e)(f)

(g)

(h)

( , ) /

( , ) /( )i

k

Q x a T

i Q x a T

k actions

ep a x

e

Probability for the agent to select action ai based on Q values

T: “temperature” parameter to determine the randomness of decisions.

Towards Collaborative and Adversarial LearningA Case Study in Robotic Soccer

Peter Stone & Manuela Veloso

Introduction

Layered learning, to develop complex multi-agent behaviors from simple ones

Simple multi-agent behavior in Robotic Soccer, to shoot a moving ball

• Passer

• Shooter

Behavior to be learnt: When the shooter should begin to move (shooting policy)

Simple Behavior

Parameters

1. Ball speed (fixed vs. variable)

2. Ball trajectory (fixed vs. variable)

3. Goal location (fixed vs. variable)

4. Action quadrant (fixed vs. variable)

Parameters

Fixed Ball Motion Simple shooting policy: begin accelerating when the balls

distance to its projected point of intersection with the agent’s path reaches 110 units

• 100% success rate if shooter position fixed• 61% success rate if shooter position variable

Use Neural network, Inputs to NN (coordinate independent):

• Ball distance• Agent distance• Heading offset

Output: 1 or 0 (shot successful or not) Use random shooting policy for training

Neural Network

Results

Varying Ball Speed

Add a fourth input to NN, Ball Speed

Varying Ball’s Trajectory

Use the same shooting policy Use another NN to determine the direction the

shooter should steer (shooter’s aiming policy)

Moving the Goal

Can think of it as aiming for different parts of the goal

Change nothing but the shooter’s knowledge of the goal location

Cooperative Learning

Passing a moving ball• Passer: where to aim the pass,

• Shooter: where to position itself

Cooperative Learning

Adversarial Learning

References

Peter Stone, Manuela Veloso, 2000, “Multi-Agent Systems: A Survey from a Machine Learning Perspective”

Ming Tan, 1993, “Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents”

Peter Stone, Manuela Veloso, 1998, “Toward Collaborative and Adversarial Learning: A Case Study in Robotic Soccer”

intelligent agents: technology and applications

Documents

agents actions

intelligent agents

reinforcement learning

novice agent

agent tan

acting agent

agent projects

multiagent environment