probabilistic plan recognition kathryn blackmond laskey department of systems engineering and...

44
Probabilistic Plan Recognition Kathryn Blackmond Laskey Department of Systems Engineering and Operations Research George Mason University Dagstuhl Seminar April 2011

Upload: georgiana-snow

Post on 17-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Probabilistic Plan Recognition

Kathryn Blackmond Laskey

Department of Systems Engineering and Operations Research

George Mason University

Dagstuhl Seminar April 2011

2

The problem of plan recognition is to take as input a sequence of actions performed

by an actor and to infer the goal pursued by the actor and also to organize the action

sequence in terms of a plan structureSchmidt, Sridharan and Goodson, 1978

…the problem of plan recognition is largely a problem of inference under

conditions of uncertainty.Charniak and Goldman, 1993

3

PPR in a Nutshell

Represent • set of possible plans• anticipated evidence for each plan

Specify• prior probabilities for plans • likelihood for evidence given plans

Infer plans using Bayes Rule

Bayes, Thomas. An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370- 418, 1763.

Thomas Bayes (1702-1762)

P(plan |obs)

=P(obs | plan)P(plan)

P(obs)

…or just directly specify P(plan|obs)

4

Why Probability?• Theoretically well-founded representation for relative

plausibility of competing explanations• Unified approach to inference and learning• Combine engineered and learned knowledge• Many general-purpose exact and approximate algorithms with

strong theoretical justification and practical success• Good results on many interesting problems• But…

– Inference and learning (exact and approximate) are NP hard

– Balancing tractability and expressiveness is a major research and engineering challenge

5

Representing Plans and Observations

• Plan recognition requires a computational representation of possible plans and observable evidence– Goals– Actions

• When executed in combination, actions are expected (with high probability) to achieve the goal

– Preconditions / postconditions of actions – Constraints

• Most notably, temporal ordering

– Observables• Actions may or may not be directly observable• Sometimes we observe effects of actions

– Hierarchical decomposition of the above

• For probabilistic plan recognition, we need to assign probabilities to these elements– Balance expressivity against tractability of inference & learning

6

Some Representations for PPR

• Bayesian networks• Hidden Markov Models / Dynamic Bayesian Networks• Plan Recognition Bayesian Networks / Probabilistic

Relational Models / Multi-Entity Bayesian Networks • Bayesian Abductive Logic Programs• Stochastic Grammars• Conditional Random Fields• Markov Logic Networks

Each of these formalisms can be thought of as a way of representing a set of “possible worlds” and defining a probability measure

on an algebra of subsets

7

Graphical Probability Models• Factorize joint distribution into factors involving only a few

variables each– Graph represents conditional independence assumptions– Local distributions specify probability information for small groups

of related variables– Factors are combined into joint distribution

• Drastically simplifies specification,inference and learning

• 20 possible goals, 100 possible actions– Fully general model 2.5x1031 probabilities– “Naïve Bayes model” 19x20x100=38,000 probabilities– If each goal has only 10 associated actions then “naïve Bayes

model” 19x10 = 190 probabilities– Naïve Bayes inference scales as #variables x #states/variable

G

A3A2

A1A4

A5

Naïve Bayes Model

8

Bayesian Network (BN)• Directed graph represents dependencies

• Joint distribution factors as

• Factored representation makes specification, inference and learning tractable for interesting classes of problems

• Directed graph naturally represents causality– Effects of intervention via “do” operator– Explaining away

127 probabilities 14 probabilities

Pr(X1,K ,Xn ) = Pr(X i | X pa( i))i=1

n

Nuclear Weapons (W)

StealMaterials (S)

BuyReactors (B)

RegionalDomination (R)

Exercise (E)

TroopsMoving (T)

Invasion (I)

Pr(R,E,I,W,T,B,S) = Pr(R)Pr(E)Pr(I|R)Pr(W|R)Pr(T|E,I)Pr(B|W)Pr(S|W)

9

Possible and Probable Worlds• “Traditional or deductive logic admits only three attitudes to any

proposition: definite proof, disproof, or blank ignorance.” (Jeffreys)• Semantics of classical logic is based on possible worlds

– Set of possible worlds defined by language, domain, and axioms– In propositional logic, possible worlds assign truth values to atoms

(e.g., R T; W T; E F)

• Probability theory– Set of possible worlds is called the sample space– Probability measure maps subsets to real numbers– Probability axioms are a natural extension of classical propositional logic to

likelihood

• BN combines propositional logic with probability

R E I W T B S PrT T T T T T T p1

T T T T T T F p2

T T T T T F T p3

Nuclear Weapons (W)

StealMaterials (S)

BuyReactors (B)

RegionalDomination (R)

Exercise (E)

TroopsMoving (T)

Invasion (I)

10

Other Factored Representations• Markov network: factorization specified by undirected

graph– More natural for domains without natural causal direction– Joint distribution factorizes as:

• Chain graph: factorization specified by graph with both directed and undirected edges

• Representations to exploit context-specific independence– Probability trees– Tree-structured parameterization for local distributions in a BN

P(X1,K ,Xn ) =1

ZφC (X1C ,K XkCC )

C

∏C indexes cliques in the graphxiC is ith variable in clique CkC is size of clique CZ is a normalization constant

11

Conditional Random Fields• Bayesian networks are generative models

– Represent joint probability over plans and observations– Realistic dependence models often yield intractable

inference

• Conditional (or discriminative) model directly represents probability of plans given observations– Can allow some dependencies to be relaxed

• CRFs are discriminative– Undirected graph represents local dependencies– Potential function represents strength of dependence

• A CRF is a family of MRFs (a mapping from observations to potentials)

12

Inference in Graphical Models• Exact inference

– E.g., Belief propagation, junction tree, bucket elimination, symbolic probabilistic inference, cutset conditioning

– Exploit graph structure / factorization to simplify computation– Infeasible for complex problems

• Approximate (deterministic)– E.g., Loopy BP, variational Bayes

• Approximate (stochastic)– E.g., Gibbs sampling, Metropolis-Hastings sampling, likelihood

weighting

• Combinations– E.g., Bidyuk and Dechter (2007) – cutset sampling

13

Belief Propagation for Singly Connected BNs

• Goal: compute probability distribution of random variable B given evidence (assume B itself is not known)

• Key idea: impact of belief in B from evidence "above" B and evidence "below" B can be processed separately

• Justification: B d-separates “above” random variables from “below” random variables

= evidence random variable

A1

A2 A3 A6

D5 B

D6 D1D2

A4 A5

D7

D3D4

?

Random variables “above” B

Random variables “below” B

p

p

l

• This picture depicts the updating process for one node. • The algorithm simultaneously updates beliefs for all the nodes.• Loopy BP applies BP to network with loops; often results in

good approximation

14

Likelihood Weighting (for BNs)1. Proceed through non-evidence

variables in order consistent with partial ordering induced by graph– Sample variable according to its local

probability distribution– Calculate weight proportional to

Pr(evidence | sampled values)

2. Repeat Step 1 until done

3. Estimate Pr(Variable=value) by weighted sample frequency

H

B

J

D

C

EG

F

K

A

L

15

Junction Tree Algorithm1. Compile BN into junction tree

– Tree of clusters of nodes– Has JT property: variable

belonging to 2 clusters must belong to all clusters along path connecting them

– Becomes part of the knowledge representation

– Changes only if the graph changes

2. Use local local message-passing algorithm to propagate beliefs in the junction tree

3. Query on any node or any set of nodes in same cluster can be computed from cluster joint distribution

H

B

J

D

C

EG

F

K

A

L

FGJK

JKL

ABC

DEGHJ

BCDEH

CDGEH

DFGHJ

16

Gibbs Sampling1. Initialize

– Evidence variables assigned to observed values

– Arbitrary value for other variables

2. Sample non-evidence nodes one at a time:– Sample with probability

Pr(variable | Markov blanket)– Replace with newly sampled value

3. Repeat Step 2 until done

4. Estimate Pr(Variable=value) by sample frequency

H

B

J

D

C

EG

F

K

A

L

H

B

J

D

C

EG

F

K

A

L

H

B

J

D

C

EG

F

K

A

L• Markov blanket

- In BN: parents, children, co-parents- In MN: neighbors

• Variable is conditionally independent of rest of network given its Markov blanket

17

Cutset Sampling (for BNs)

H

B

J

D

C

EG

F

K

A

L

H

B

J=j

D=d

C

EG

F

K

A

L

• Find a loop cutset• Initialize cutset variables• Do until done

– Propagate beliefs on non-cutset variables– Do Gibbs iteration on cutset

• Estimate P(Variable=value) by averaging probability over samples

• This is a kind of “Rao-Blackwellization”– Reduce variance of Monte Carlo

estimator by replacing a sampling step with an exact computation with same expected value

18

Variational Inference• Method for approximating posterior distribution of

unobserved variables given observed variables• Approximation finds distribution in family with

simpler functional form (e.g., remove some arcs in graph) by minimizing a measure of distance from true posterior

• Estimation via “variational EM”– Alternate between “expectation” and “maximization”

steps– Converges to local minimum of distance function– Yields lower bound for marginal likelihood

• Often faster but less accurate than MC

19

Extending Expressive Power of BNs

Charniak and Goldman (1993)

• Propositional logic + probability is insufficiently expressive for requirements of plan recognition– Repeated structure – Multiple interrelated entities (e.g., plans, actors, actions)– Type hierarchy and inheritance– Unbounded number of potentially relevant variables

• Some formalisms with greater expressive power:– PBN (Plan recognition BN)– PRM (Probabilistic

Relational Models)– OOBN (Object-Oriented

Bayesian Networks)– MEBN (Multi-Entity BN)– Plates– BALP (Bayesian Abductive

Logic Programs)

20

Example: Maritime Domain Awareness

Entities, attributes and relations

21

MDA Probabilistic Ontology

Built in UnBBayes-MEBN

22

MDA SSBN

Screenshot of situation-specific BN in UnBBayes-MEBN(open-source tool for building & reasoning with PR-OWL ontologies)

23

Protégé Plugin for UnBBayes

24

Drag-and-Drop Mapping

drag-and-drop

25

Markov Logic Networks• First-order knowledge base with weight

attached to formulas and clauses• KB + individual constants ground

Markov network containing variable for each grounding of a formula in the KB

• Compact language for specifying large Markov networks

26

MLN Example(Richardson and Domingos, 2006)

27

CRFs for Chat Recognition(Hsu, Lian and Jih, 2011)

• Subscript indexes pairs of individuals

• Yit represents chatting activity of pair

• Xit represents observed acoustic features

• Dependence structure:– Within-pair temporal

dependence– Between-pair concurrent

dependence

• Can be represented as MLN

28

Possible and Probable FO Worlds

• In first-order logic, a possible world (aka “structure”) assigns:– Each constant symbol to a domain element (e.g., go3 obj23)

– Each n-ary function symbol to a function on n-tuples of domain elements (e.g., (go-stp pln1) obj23

– Each n-ary relation symbol to a set of n-tuples of domain elements (e.g., inst {(obj23, go-), (obj78, liquor-store), (obj78, store) … }

• A first-order probabilistic logic assigns a probability measure to first-order structures– This is called “measure model” semantics (Gaifman,1964)

29

FOL + Probability: Issues• Probability zero ≠ unsatisfiable

– E.g., every possible value of a continuous distribution has probability zero

• FOL is undecidable; FOL + probability is not even semi-decidable– Example: IID sequence of coin tosses, 0 < P(H) < 1

• Given any finite sequence of prior tosses, both H and T are possible

• We cannot disprove any non-extreme probability distribution from a finite sequence of tosses

• Wrong solution: “We will prevent you from expressing this query because we cannot tractably compute the answer.”

• Better solution: “Represent the problem you really want to solve, and then figure out a way to approximate the answer.”– Think carefully about what the real problem is!

30

Knowledge Based Model Construction

• KBMC system contains :– Base representation that represents goals, plans, actions,

actors, observables, constraints, etc.– Model construction procedure that maps a context and/or query

into a target model

• At problem solving time– Construct a problem-specific Bayesian network– Process queries on constructed model using general-purpose

BN algorithm

• Advantages of expressive representation – Understandability– Maintainability– Knowledge reuse– Exploit repeated structure (representation, inference, learning)– Construct only as much of model as needed for query

31

Hypothesis Management

• Constructed BN rapidly becomes intractable, especially in presence of existence and association uncertainty

• What do we really need to represent?• Heuristics help to avoid constructing (or prune)

very unlikely hypotheses (or variables with very weak impact on conclusions)– E.g., from only “John went to the airport” do not

nominate hypothesis that John intends to set off a bomb

– But a security system needs to be on the alert for prospective bombers!

32

Lifted Inference

• Constructed BN (propositionalized theory) typically contains repeated structure

• Applying standard BN inference often results in many repetitions of the identical computation

• Lifted inference algorithms detect such repetitions– “Lift” problem from ground to first-order level– Perform computation only once

• Very active area of research

(Braz, et al., 2005)

33

Learning = Inference… in theory, at least

(= (store-of xi) yj)

(inst xi liquor-shopping)

(inst yj liquor-store)

i=1,…,Nj=1,…,M

Plate model for parameter learning of store-of local distribution

34

Representing Temporal Evolution• Plans evolve in time• HMM / DBN / PDBN replicate variables describing

temporally evolving situation

Position(k)Position(k-1)

Report(k-1) Report(k)

PosReport(k)

Position(k-1)

PosReport(k-1)

Position(k)

ApparentShape(k)ApparentShape(k-1)

ShpReport(k-1) ShpReport(k-1)

Hidden Markov Model (HMM) unobservable evolving state + observable indicator

Dynamic Bayesian Network (DBN)factored representation of state / observable

Partially Dynamic Bayesian Network (PDBN)some variables not time-dependent

PosReport(k)

Position(k-1)

PosReport(k-1)

Position(k)

ApparentShape(k)ApparentShape(k-1)

ShpReport(k-1) ShpReport(k-1)

DestinationShape

Mission

35

DBN Inference• Any BN inference algorithm can be applied to a finite-

horizon DBN• Special-case inference algorithms exploit DBN structure

– “Rollup” algorithm marginalizes out past hidden states given past observations to explicitly represent only a sliding window

– Viterbi algorithm finds most probable values of hidden states given observations

– Forward-backward algorithm estimates marginal distributions for hidden states given observations

• Exact inference is generally intractable– Factored frontier algorithm approximates marginalization of past

hidden state for intractable DBNs– Particle filter is a temporal variant of likelihood weighting with

resampling

• Beware of static nodes!

36

Resampling

Particle Filter

initialization

Likelihood weighting

Resampling

Likelihood weighting

Evolution

• Maintains sample of weighted particles• Each particle is a single realization of all non-evidence nodes• Particle is weighted by likelihood of observation given particle• Particles are resampled with probability proportional to weight

From van der Merwe et al. (undated)

37

Particle Impoverishment• Particles with large weights are sampled more

often, leading to low particle diversity• This effect is counteracted by “spreading” effects

of process noise• Impoverishment is very serious when:

– Observations are extremely unlikely– Low “process noise” leads to long dwell times in

widely separated basins of attraction• “In fact, for the case of very small process noise, all particles

will collapse to a single point within a few iterations.”• “If the process noise is zero, then using a particle filter is not

entirely appropriate.” (Arulampulam et al., 2002)

38

Particle Filter with Static Nodes• PF cannot recover from impoverishment of static node• Some approaches:

– Estimate separate PF for each combination of static node • Only if static node has small state space

– Regularized PF - artificial evolution of static node• Ad hoc; no justification for amount of perturbation;

information loss over time

– Shrinkage (Liu & West)• Combines ideas from artificial evolution & kernel smoothing• Perturbation “shrinks” static node for each particle

toward weighted sample mean– Perturbation holds variance of set of particles constant– Correlation in disturbances compensates for information loss

– Resample-Move (Gilks & Berzuini)• Metropolis-Hastings step corrects for particle impoverishment• MH sampling of static node involves entire trajectory but is performed

less frequently as runs become longer

X

39

Stochastic Grammars• Motivation: find representation that is sufficiently

expressive for plan recognition but more tractable than general DBN inference

• A stochastic grammar is a set of stochastic production rules for generating sequences of actions (terminal symbols in the grammar)

• Modularity of production rules yields factored joint distribution

40

Stochastic Grammar - Example• Taken from Geib and

Goldman (2009)• Plans are represented as

and/or tree with temporal constraints

41

Stochastic Grammar - Inference• Parsing algorithms can be applied to compute

restricted class of queries– If plans can be represented in a given formalism then that

formalism’s inference algorithms can be applied to process queries

– We are often interested in a broader class of queries than traditional parsing algorithms can handle (e.g., we usually have not observed all actions)

• Parse tree can be converted to DBN – Enables answering a broader class of queries– Can exploit structure of grammar to improve tractability of

inference

• Special-purpose algorithms exploit grammar structure

42

Where Do We Stand?• Contributions of probabilistic methods

– Useful way of thinking about problems

– Unified approach to reasoning, parameter learning, structure learning

– Principled combination of KE with learning

– Can learn from small, moderate and large samples

– Many general-purpose exact and approximate algorithms with strong theoretical justification and practical success

– Good results (better than previous state of the art) on many interesting problems

• Many challenging problems remain– Exact learning and inference are intractable

– High-dimensional multi-modal distributions are just plain ugly• All inference algorithms break down on the toughest cases

– Asymptotics doesn’t mean much when the long run is millions of years!

– With good engineering backed by solid theory, we will continue to make progress

43

Bibliography (1 of 2)

Arulampalam, M. Maskell, S., Gordon, N. and Clapp, T. A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking, IEEE Transactions on Signal Processing, 50 , pp. 174–188, 2002.

Bidyuk, B. and Dechter, R. "Cutset Sampling for Bayesian Networks", Journal of Artificial Intelligence Research 28, pages 1-48, 2007.

Braz, R., Amir, E. and Roth, D. Lifted First-Order Probabilistic Inference. Proceedings of the International Joint Conference on Artificial Intelligence, 2005.

Bui, H., Venkatesh, S. and West, G. "Policy Recognition in the Abstract Hidden Markov Model", Artificial Intelligence, Journal of Artificial Intelligence Research, Volume 17, pages 451-499, 2002.

Charniak, E. and Goldman, R. A Bayesian Model of Plan Recognition. Artificial Intelligence, 64: 53-79, 1993.

Charniak, E. and Goldman, R. A Probabilistic Model of Plan Recognition. Proceedings of the Ninth Conference on Artificial Intelligence 1991.

Darwiche, A. Modeling and Reasoning with Bayesian Networks. Cambridge University Press. 2009.

Gaifman, H. Concerning measures in First-Order calculi. Israel Journal of Mathematics, 2, 1–18, 1964.

Geib, C.W. and Goldman, R.P. A Probabilistic Plan Recognition Algorithm Based on Plan Tree Grammars. Artificial Intelligence 173, pp. 1101–1132, 2009.

Gilks, W.R. and Berzuini, C. Following a Moving Target—Monte Carlo Inference for Dynamic Bayesian Models,” Journal of the Royal Statistical Society B, 63, pp. 127–146, 2001.

Hsu, J., Lian, C., and Jih, W. Probabilistic Models for Concurrent Chatting Activity Recognition. ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 1, 2011.

Jensen, F., Bayesian Networks and Decision Graphs (2nd edition). Springer, 2007.

Korb, K. and Nicholson, A. Bayesian Artificial Intelligence. Chapman and Hall, 2003.

Koller, D., Friedman, N. Probabilistic Graphical Models. MIT Press, 2009.

Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, 2001.

Laskey, K.B., MEBN: A Language for First-Order Bayesian Knowledge Bases, Artificial Intelligence, 172(2-3): 140-178, 2008

44

Bibliography (2 of 2)

Liao, L. Patterson, D. J. Fox, D. and Kautz, H. Learning and Inferring Transportation Routines. Artificial Intelligence, 2007.

Liao, L., Fox, D., AND Kautz, H. Hierarchical Conditional Random Fields for GPS-based Activity Recognition. In Springer Tracts in Advanced Robotics. Springer, 2007.

Liu, J. and West, M., Combined Parameter and State Estimation in Simulation-Based Filtering,” in Sequential Monte Carlo Methods in Practice, A. Doucet, J. F. G. de Freitas, and N. J. Gordon, Eds. New York: Springer-Verlag, 2001.

Musso, C. Oudjane, N and LeGland, F. Improving Regularised Particle Filters, in Sequential Monte Carlo Methods in Practice, A. Doucet, J. F. G. de Freitas, and N. J. Gordon, Eds. New York: Springer-Verlag, 2001.

Neapolitan, R. Learning Bayesian Networks. Prentice Hall, 2003.

Pearl, J. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988.

Pynadath, D.V. and Wellman, M.P. Probabilistic State-Dependent Grammars for Plan Recognition. Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000.

Pynadath, D. V. and Wellman, M. P. Generalized queries on probabilistic context-free grammars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):65–77, 1998.

Richardson, M. and Domingos, P., Markov Logic Networks. Machine Learning, 62, 107-136, 2006.

Schmidt, C., Sridharan, N., and Goodson, J., The plan recognition problem: An Intersection of psychology and Artificial Intelligence, Artificial Intelligence 11 pp. 45–83, 1978.

van der Merwe, R., Doucet, A., de Freitas, N. and Wan, E. The Unscented Particle Filter, Adv. Neural Inform. Process. Syst. 2000.

van der Merwe, R., Doucet, A., de Freitas, N. and Wan, E. (undated) “The unscented particle filter,” http://cslu.cse.ogi.edu/publications/ps/UPF_CSLU_talk.pdf

Wellman, M.P., J.S. Breese, and R.P. Goldman (1992) From knowledge bases to decision models. The Knowledge Engineering Review, 7(1):35-53.