cvpr2011: human activity recognition - part 4: syntactic

67
Frontiers of Human Activity Analysis J. K. Aggarwal Michael S. Ryoo Kris M. Kitani

Upload: zukun

Post on 03-Jul-2015

624 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: cvpr2011: human activity recognition - part 4: syntactic

Frontiers of Human Activity Analysis

J. K. Aggarwal Michael S. Ryoo

Kris M. Kitani

Page 2: cvpr2011: human activity recognition - part 4: syntactic

Overview

2

Human Activity Recognition

Single-layer Hierarchical

Statistical

Syntactic

Descriptive

Space-time

Sequential

Page 3: cvpr2011: human activity recognition - part 4: syntactic

Motivation

3

How do we interpret a sequence of actions?

Page 4: cvpr2011: human activity recognition - part 4: syntactic

Hierarchy

4

Hierarchy implies decomposition into sub-parts

Page 5: cvpr2011: human activity recognition - part 4: syntactic

Now we’ll cover…

5

Human Activity Recognition

Single-layer Hierarchical

Statistical Syntactic

Descriptive

Space-time

Sequential

Page 6: cvpr2011: human activity recognition - part 4: syntactic

6

Syntactic Approaches

Page 7: cvpr2011: human activity recognition - part 4: syntactic

Syntactic Models

7

Activities as strings of symbols.

What is the underlying structure?

Page 8: cvpr2011: human activity recognition - part 4: syntactic

Early applications to Vision

8

Tsai and Fu 1980. Attributed Grammar-A Tool for Combining Syntactic and Statistical Approaches to Pattern Recognition.

Page 9: cvpr2011: human activity recognition - part 4: syntactic

Hierarchical syntactic approach

  Useful for activities with:  Deep hierarchical structure  Repetitive (cyclic) structure

  Not for   Systems with a lot of errors and uncertainty   Activities with shallow structure

9

Page 10: cvpr2011: human activity recognition - part 4: syntactic

Basics

10

Generic Language Natural Languages

Start Symbol (S) Sentences

Set of Terminal Symbols (T) Words

Set of Non-Terminal Symbols (N) Parts of Speech

Set of Production Rules (P) Syntax Rules

Context-Free Grammar

Page 11: cvpr2011: human activity recognition - part 4: syntactic

Parsing with a grammar

11 ants like flies swat

S → NP VP (0.8) PP → PREP NP (1.0) S → VP (0.2) PREP → like (1.0) NP → NOUN (0.4) VERB → swat (0.2) NP → NOUN PP (0.4) VERB → flies (0.4) NP → NOUN NP (0.2) VERB → like (0.4) VP → VERB (0.3) NOUN → swat (0.05) VP → VERB NP (0.3) NOUN → flies (0.45) VP → VERB PP (0.2) NOUN → ants (0.5) VP → VERB NP PP (0.2)

Page 12: cvpr2011: human activity recognition - part 4: syntactic

Parsing with a grammar

12

S

VP

NP

VERB NOUN NOUN

(0.8)

(0.05)

(0.4)

(0.45) (0.4) (0.5)

NP

NOUN (0.2)

NP (0.3)

(0.4)

ants like flies swat

S → NP VP (0.8) PP → PREP NP (1.0) S → VP (0.2) PREP → like (1.0) NP → NOUN (0.4) VERB → swat (0.2) NP → NOUN PP (0.4) VERB → flies (0.4) NP → NOUN NP (0.2) VERB → like (0.4) VP → VERB (0.3) NOUN → swat (0.05) VP → VERB NP (0.3) NOUN → flies (0.45) VP → VERB PP (0.2) NOUN → ants (0.5) VP → VERB NP PP (0.2)

Page 13: cvpr2011: human activity recognition - part 4: syntactic

Video analysis with CFGs

13

The “Inverse Hollywood problem”: From video to scripts and storyboards via causal analysis. Brand 1997

Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998

Recognizing Multitasked Activities from Video using Stochastic Context-Free Grammar. Moore and Essa 2001

Page 14: cvpr2011: human activity recognition - part 4: syntactic

CFG for human activities

14

enter detach leave enter detach attach touch touch detach attach leave

M. Brand. The "Inverse Hollywood Problem": From video to scripts and storyboards

via causal analysis. AAAI 1997.

Page 15: cvpr2011: human activity recognition - part 4: syntactic

Parse tree

15

enter detach

ADD

IN ACTION (Open PC)

leave

OUT

enter detach

IN

ADD

attach touch touch detach

ACTION (unscrew)

MOVE

MOTION MOTION

attach leave

OUT

REMOVE

SCENE (Open up a PC)

M. Brand. The "Inverse Hollywood Problem": From video to scripts and storyboards via causal analysis. AAAI 1997.

•  Deterministic low-level primitive detection •  Deterministic parsing

Page 16: cvpr2011: human activity recognition - part 4: syntactic

Stochastic CFGs

16

Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998

Page 17: cvpr2011: human activity recognition - part 4: syntactic

Gesture analysis with CFGs

17 Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998

Primitive recognition with HMMs

Page 18: cvpr2011: human activity recognition - part 4: syntactic

18

left-right

Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998

Page 19: cvpr2011: human activity recognition - part 4: syntactic

19

up-down

Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998

Page 20: cvpr2011: human activity recognition - part 4: syntactic

20

right-left

Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998

Page 21: cvpr2011: human activity recognition - part 4: syntactic

21

down-up

Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998

Page 22: cvpr2011: human activity recognition - part 4: syntactic

Parse Tree

22

left-right up-down right-left down-up

LR

UD

RL

DU TOP BOT

RH

S

Page 23: cvpr2011: human activity recognition - part 4: syntactic

Errors

23 Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998

HMM a

HMM b

Likelihood value over time (not discrete symbols)

Errors are inevitable…

but the grammar acts as a top-down constraint

Page 24: cvpr2011: human activity recognition - part 4: syntactic

Dealing with uncertainty & errors

  Stolcke-Early (probabilistic) parser   SKIP rules to deal with insertion errors

24

HMM a

HMM b

HMM c

Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998

Page 25: cvpr2011: human activity recognition - part 4: syntactic

SCFG for Blackjack

25

Recognizing Multitasked Activities from Video using Stochastic Context-Free Grammar.

Moore and Essa 2001

•  Deals with more complex activities •  Deals with more error types

Page 26: cvpr2011: human activity recognition - part 4: syntactic

extracting primitive actions

26

Page 27: cvpr2011: human activity recognition - part 4: syntactic

Game grammar

27 Recognizing Multitasked Activities from Video using Stochastic Context-Free Grammar. Moore and Essa 2001

Page 28: cvpr2011: human activity recognition - part 4: syntactic

Dealing with errors

  Ungrammatical strings cause parser to fail

  Account for errors with multiple hypothesis   Insertion, deletion, substitution

  Issues   How many errors should we tolerate?   Potentially exponential hypothesis space   Ungrammatical strings: vision problem or illegal

activity?

28

Page 29: cvpr2011: human activity recognition - part 4: syntactic

Observations   CFGs good for structured activities   Can incorporate uncertainty in observations   Natural contextual prior for recognizing errors

  Not clear how to deal with errors   Assumes ‘good’ action classifiers   Need to define grammar manually

Can we learn the grammar from data?

29

Page 30: cvpr2011: human activity recognition - part 4: syntactic

Heuristic Grammatical Induction

30

1.  Lexicon learning •  Learn HMMs •  Cluster HMMs

2.  Convert video to string

3.  Learn Grammar

Unsupervised Analysis of Human Gestures. Wang et al 2001

Page 31: cvpr2011: human activity recognition - part 4: syntactic

COMPRESSIVE

31

a b c d a b c d b c d a b a b

substring deletion of substring

insertion of new rule

On-Line and Off-Line Heuristics for Inferring Hierarchies of Repetitions in Sequences. Nevill-Manning 2000.

length occurrence new rule new symbol

Page 32: cvpr2011: human activity recognition - part 4: syntactic

example

32

S→a b c d a b c d b c d a b a b

S→a A a A A a b a b A→b c d

(DL=16)

(DL=14)

Repeat until compression becomes 0.

Page 33: cvpr2011: human activity recognition - part 4: syntactic

Critical assumption

  No uncertainty   No errors

  insertions   deletions   substitution

Can we learn grammars despite errors?

33

Page 34: cvpr2011: human activity recognition - part 4: syntactic

Learning with noise

34

Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Can we learn the basic structure of a transaction?

Page 35: cvpr2011: human activity recognition - part 4: syntactic

extracting primitives

35 Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Page 36: cvpr2011: human activity recognition - part 4: syntactic

Underlying structure?

36

D → a x b y c a b x c y a b c x

Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Page 37: cvpr2011: human activity recognition - part 4: syntactic

Underlying structure?

37

D → a x b y c a b x c y a b c x

D → a b c a b c a b c

Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Page 38: cvpr2011: human activity recognition - part 4: syntactic

Underlying structure?

38

D → a x b y c a b x c y a b c x

D → a b c a b c a b c

Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Page 39: cvpr2011: human activity recognition - part 4: syntactic

Underlying structure?

39

D → a x b y c a b x c y a b c x

D → a b c a b c a b c

A → a b c D → A A A Simple grammar Efficient compression

Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Page 40: cvpr2011: human activity recognition - part 4: syntactic

Information Theory Problem (MDL)

40

Model complexity Data compression

Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

G = argminG

{DL(G) +DL(D|G)}

Page 41: cvpr2011: human activity recognition - part 4: syntactic

Information Theory Problem (MDL)

41

DL(G) = − log p(G)

= − log p(θS , GS)

= − log p(θS |GS)− log p(GS)

= DL(θS |GS)−DL(GS)

Model complexity Data compression

Grammar parameters Grammar structure

Model complexity

Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

G = argminG

{DL(G) +DL(D|G)}

Page 42: cvpr2011: human activity recognition - part 4: syntactic

Information Theory Problem (MDL)

42

DL(G) = − log p(G)

= − log p(θS , GS)

= − log p(θS |GS)− log p(GS)

= DL(θS |GS)−DL(GS)

DL(D|G) = − log p(D|G)

Model complexity Data compression

Grammar parameters Grammar structure

Likelihood (inside probabilities)

Model complexity

Data compression

Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

G = argminG

{DL(G) +DL(D|G)}

Page 43: cvpr2011: human activity recognition - part 4: syntactic

Minimum Description Length

43 Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Page 44: cvpr2011: human activity recognition - part 4: syntactic

Minimum Description Length

44 Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Page 45: cvpr2011: human activity recognition - part 4: syntactic

45 Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Page 46: cvpr2011: human activity recognition - part 4: syntactic

46 Recovering the basic structure of human activities from noisy video-based symbol strings. Kitani et al 2008.

Page 47: cvpr2011: human activity recognition - part 4: syntactic

Conclusions

  Possible to learn basic structure   Robust to errors

(insertion, deletion, substitution)

  Need a lot of training data   Computational complexity

47

Page 48: cvpr2011: human activity recognition - part 4: syntactic

Bayesian Approaches

48

Infinite Hierarchical Hidden Markov Models. Heller et al 2009.

The Infinite PCFG using Hierarchical Dirichlet Processes. Liang et al 2007.

Page 49: cvpr2011: human activity recognition - part 4: syntactic

Take home message Hierarchical Syntactic Models

  Useful for activities with:  Deep hierarchical structure  Repetitive (cyclic) structure

  Not for   Systems with a lot of errors and uncertainty   Activities with weak structure

49

Page 50: cvpr2011: human activity recognition - part 4: syntactic

50

Statistical Approaches

Page 51: cvpr2011: human activity recognition - part 4: syntactic

Using a hierarchical statistical approach

51

  Use when   Low-level action detectors are noisy   Structure of activity is sequential   Integrating dynamics

  Not for   Activities with deep hierarchical structure   Activities with complex temporal structure

Page 52: cvpr2011: human activity recognition - part 4: syntactic

Statistical (State-based) Model

52

Activities as a stochastic path.

What are the underlying dynamics?

Page 53: cvpr2011: human activity recognition - part 4: syntactic

Characteristics

  Strong Markov assumption   Strong dynamics prior   Robust to uncertainty

  Modifications to account for  Hierarchical structure  Concurrent structure

53

Page 54: cvpr2011: human activity recognition - part 4: syntactic

Hierarchical activities

54

Problem: How do we model

hierarchical activities?

Solution: “stack” actions for

hierarchical activities

combinatory state space!

Page 55: cvpr2011: human activity recognition - part 4: syntactic

Hierarchical hidden Markov model

55

Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Models. Nguyen et al 2005

Page 56: cvpr2011: human activity recognition - part 4: syntactic

Context-free activity grammar

56 Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Models. Nguyen et al 2005

Page 57: cvpr2011: human activity recognition - part 4: syntactic

Context-free activity grammar

57 Learning and Detecting Activities from Movement Trajectories Using the Hierarchical Hidden Markov Models. Nguyen et al 2005

Page 58: cvpr2011: human activity recognition - part 4: syntactic

Observations

  Tree structures useful for hierarchies   Tight integration of trajectories with

abstract semantic states

  Activities are not always a single sequence (ie. they sometimes happen in parallel )

58

Page 59: cvpr2011: human activity recognition - part 4: syntactic

Concurrent activities

59

Problem: How do we model

concurrent activities?

Solution: “stand-up” model for concurrent activities

combinatory state space!

Page 60: cvpr2011: human activity recognition - part 4: syntactic

Propagation network

60

Propagation Networks for Recognition of Partially Ordered Sequential Action. Shi et al 2004

Page 61: cvpr2011: human activity recognition - part 4: syntactic

61 Propagation Networks for Recognition of Partially Ordered Sequential Action. Shi et al 2004

Page 62: cvpr2011: human activity recognition - part 4: syntactic

temporal inference

62 Inference by standing the state transition model on its side

Page 63: cvpr2011: human activity recognition - part 4: syntactic

Inferring structure (storylines)

63

Understanding Videos, Constructing Plots – Learning a Visually Grounded Storyline Model from Annotated Videos

Gupta, Srinivasan, Shi and Davis CVPR 2009

Learn AND-OR graphs from weakly labeled data

Page 64: cvpr2011: human activity recognition - part 4: syntactic

Scripts from structure

64 Understanding Videos, Constructing Plots - Learning a Visually Grounded Storyline Model from Annotated Videos. Gupta, Srinivasan, Shi and Davis CVPR 2009

Page 65: cvpr2011: human activity recognition - part 4: syntactic

Take home message Hierarchical statistical model

  Use when   Low-level action detectors are noisy   Structure of activity is sequential   Integrating dynamics

  Not for   Activities with deep hierarchical structure   Activities with complex temporal structure

65

Page 66: cvpr2011: human activity recognition - part 4: syntactic

Contrasting hierarchical approaches

66

Actions as: Activities as: Model Characteristic

Statistic probabilistic states paths DBN Robust to

uncertainty

Syntactic discrete symbols strings CFG Describes

deep hierarchy

Descriptive logical relationships sets CFG, MLN

Encodes complex

logic

Page 67: cvpr2011: human activity recognition - part 4: syntactic

References (not included in ACM survey paper)

  W. Tsai and K.S. Fu. Attributed Grammar-A Tool for Combining Syntactic and Statistical Approaches to Pattern Recognition. SMC1980.

  M. Brand. The "Inverse Hollywood Problem": From video to scripts and storyboards via causal analysis. AAAI 1997.

  T. Wang, H. Shum, Y. Xu, N. Zheng. Unsupervised Analysis of Human Gestures. PRCM 2001.

  C.G. Nevill-Manning, I.H. Witten. On-Line and Off-Line Heuristics for Inferring Hierarchies of Repetitions in Sequences. IEEE 2000.

  K. Heller, Y.W. Teh and D. Gorur. Infinite Hierarchical Hidden Markov Models. AISTATS 2009.

  P. Liang, S. Petrov, M. Jordan, D. Klein. The Infinite PCFG using Hierarchical Dirichlet Processes. EMNLP 2007.

  A. Gupta, N. Srinivasan, J.Shi and L.Davis. Understanding Videos, Constructing Plots - Learning a Visually Grounded Storyline Model from Annotated Videos. CVPR 2009.

67