context-aware activity modeling using hierarchical ... · context-aware activity modeling using...

1

Context-Aware Activity Modeling usingHierarchical Conditional Random Fields

Yingying Zhu, Nandita M. Nayak, and Amit K. Roy-Chowdhury

Abstract—In this paper, rather than modeling activities in videos individually, we jointly model and recognize related activities ina scene using both motion and context features. This is motivated from the observations that activities related in space and timerarely occur independently and can serve as the context for each other. We propose a two-layer conditional random field model,that represents the action segments and activities in a hierarchical manner. The model allows the integration of both motionand various context features at different levels and automatically learns the statistics that capture the patterns of the features.With weakly labeled training data, the learning problem is formulated as a max-margin problem and is solved by an iterativealgorithm. Rather than generating activity labels for individual activities, our model simultaneously predicts an optimum structurallabel for the related activities in the scene. We show promising results on the UCLA Office Dataset and VIRAT Ground Datasetthat demonstrate the benefit of hierarchical modeling of related activities using both motion and context features.

Index Terms—Activity localization and recognition, Context-aware activity model, Hierarchical Conditional Random Field.

F

1 INTRODUCTIONIt has been demonstrated in [28] that context is significantin human visual systems. As there is no formal defini-tion of context in computer vision, we consider all thedetected objects and motion regions as providing contextualinformation about each other. Activities in natural scenesrarely happen independently. The spatial layout of activitiesand their sequential patterns provide useful cues for theirunderstanding. Consider the activities that happen in thesame spatio-temporal region in Fig. 1: the existence ofthe nearby car gives information about what the person(bounded by red circle) is doing, and the relative positionof the person of interest and the car says that activities (b)and (c) are very different from activity (a). Moreover, justfocusing on the person, it may be hard to tell what theperson is doing in (b) and (c) - “opening vehicle trunk”or “closing vehicle trunk”. If we knew that these activitiesoccurred around the same vehicle along time, it would beimmediately clear that in (b) the person is opening thevehicle trunk and in (c) the person is closing the vehicletrunk. This example shows the importance of spatial andtemporal relationships for activity recognition.

1.1 Overview of the FrameworkMany existing works on activity recognition assume that,the temporal locations of the activities are known [1], [27].

• ∗This work was partially supported under ONR grant N00014-12-1-1026 and NSF grant IIS-1316934.

• Y. Zhu is with the Department of Electrical and Computer Engineering,University of California, Riverside. E-mail: [email protected].

• N. M. Nayak is with the Department of Computer Science andEngineering, University of California, Riverside. E-mail: [email protected].

• A. K. Roy-Chowdhury is with the Department of Electrical andComputer Engineering, University of California, Riverside. E-mail:[email protected].

Temporal Relationships

Spatial Relationships

(b) t=13sec(a) t=5sec (c) t=18sec

Temporal Relationships

Spatial Relationships

(b) t=19sec(a) t=16sec (c) t=25sec

Fig. 1. An example that demonstrates the importance ofcontext in activity recognition. Motion region surrounding theperson of interest is located by red circle, interacting vehicleis located by blue bounding box.

In practice, activity-based analysis of videos should involvereasoning about motion regions, objects involved in thesemotion regions, and spatio-temporal relationships betweenthe motion regions. We focus on the problem of detectingactivities of interest in continuous videos without priorinformation about the locations of the activities. The mainchallenge is to develop a representation of the continuousvideo that respects the spatio-temporal relationships of theactivities. To achieve this goal, we build upon existingwell-known feature descriptors and spatio-temporal con-text representations that, when combined together, providea powerful framework to model activities in continuousvideos.

An activity can be considered as a union of actionsegments or actions that are neighbors to each other closelyin space and time. We provide an integrated frameworkthat conducts multiple stages of video analysis, startingwith motion localization. The detected motion regions aredivided into action segments, which are considered as theelements of activities, using a motion segmentation algo-rithm based on the nonlinear dynamic model (NDM) in [5].The goal then is to generate smoothed activity labels, which

2

are optimum in a global sense, for the action segments; andthus obtaining semantically meaningful activity regions andcorresponding activity labels.

Towards this goal, we perform an initial labeling togroup adjacent action segments into semantically meaning-ful activities using a baseline activity detector. Any existingactivity detection method, such as sliding window bag-of-words (BOW) with a support vector machine (SVM)[25] can be used in this step. We call the labeled groupsof action segments as the candidate activities. Candidateactivities that are related to each other in space and timeare grouped together into activity sets. For each set, theunderlying activities are jointly modeled and recognizedwith the proposed two-layer Conditional Random Fieldmodel, which models the hierarchical relationship betweenthe action segments and activities. We refer to this proposedtwo-layer Hierarchical-CRF as Hierarchical-CRF in shortfor simplicity of expression. First, the action layer is mod-eled as a linear-chain CRF model with the activity labelswith the action segments as the random variables. Latentactivity variables, which represent the detected activities,are then introduced in the hidden activity layer. Doing so,action-activity consistency and intra-activity potentials, asthe higher-order smoothness potentials, can be introducedinto the model to smooth the preliminary activity labelsin the action layer. Finally, the activity layer variables,whose underlying activities are within the neighborhoods ofeach other in space and time, are connected to utilize thespatial and temporal relationships between activities. Theresulting model is the action-based two-layer Hierarchical-CRF model.

Potentials in and between the action and activity layersare developed to represent the motion and context patternsof individual variables and groups of them in both actionand activity levels, as well as action-activity consistencypatterns between variables in the two layers. The action-activity potentials upon sets of action nodes and theircorresponding activity nodes are introduced between ac-tion and activity layers. Such potentials, as smoothnesspotentials, are used to enforce label consistency of actionsegments within activity regions while allowing for labelinconsistency for certain circumstances. This allows therectification of the preliminary activity labels of actionsegments during the inference of the Hierarchical-CRFmodel according to the motion and context patterns in andbetween actions and activities.

Fig. 2 shows the framework of our approach. Givena video, we detect the motion regions using backgroundsubtraction. Then, the segmentation algorithm aims todivide a continuous motion region into action segments,whose motion pattern is consistent and is different from itsadjacent segments. These action segments, as the nodes inthe action layer, are modeled as a linear-chain CRF andthe proposed Hierarchical-CRF model is built accordinglyas described above.

The model parameters are learned automatically fromweakly-labeled training data with the location and labels ofactivities of interest. Image-level features are detected and

Y

Tim

e

Motion Regions

1ay

ActionLayer

ActivityLayer

Action Segments

2ay 3

ay

3hy1

hy hmy

4ay a

ny

X

Fig. 2. The left graph shows the video representation of anactivity set with n motion segments and m candidate activ-ities. The right graph shows the graphical representation ofour Hierarchical-CRF model. The white nodes are the actionvariables and the gray nodes in the graph are the hiddenactivity variables. Note that observations associated with themodel variables are not shown for clear representation.

organized to form the context for activities. Common sensedomain knowledge about the activities of interest is usedto guide the formulation of these context features withinactivities from the weakly-labeled training data. We utilizea structural model in a max-margin framework, iterativelyinferring the hidden activity variables and learning theparameters of different layers. For the testing, the actionsegments, which are merged together and assigned withactivity labels by the preliminary activity detection method,are relabeled through inference on the learned Hierarchical-CRF model.

1.2 Main ContributionsThe main contribution of this work is three-fold.(i) We combine low-level motion segmentation with high-level activity model under one framework. With the detect-ed individual action segments as the elements of activities,we design a Hierarchical-CRF model that jointly modelsthe related activities in the scene.(ii) We propose a weakly supervised approach that utilizescontext within and between actions and activities thatprovide helpful cues for activity recognition. The proposedmodel integrates motion and various context features withinand between actions and activities into a unified model. Theproposed model can localize and label activities in con-tinuous videos simultaneously, in the presence of multipleactors in the scene interacting with each other or actingindependently.(iii) With a task-oriented discriminative approach, the mod-el learning problem is formulated as a max-margin problemand is solved by an Expectation Maximization approach.

2 RELATED WORK

Many existing works exploring context focus on interac-tions among features, objects and actions [1], [3], [14],[34], [39], environmental conditions such as spatial loca-tions of certain activities in the scene [23], and temporalrelationships between activities [24], [35]. Spatio-temporalconstraints across activities in a wide-area scene are rarelyconsidered.

3

Motion segmentation and action recognition are donesimultaneously in [26]. The proposed algorithm modelsthe temporal order of actions while ignoring the spatialrelationships between actions. The work in [35] modelsa complex activity by a variable-duration hidden Markovmodel on equal-length temporal segments. It decomposesa complex activity into sequential actions, which are thecontext of each other. However, it considers only thetemporal relationships, while ignoring the spatial relation-ships between actions. AND-OR graph [2], [13], [32] isa powerful tool for activity representation. It has beenused for multi-scale analysis of human activities in [2],α , β , γ procedures were defined for a bottom-up costsensitive inference of low-level action detection. However,the learning and inference processes of AND-OR graphsbecome more complex as the graph grows large and moreand more activities are learned. In [20], [21], a structuralmodel is proposed to learn both feature-level and action-level interactions of group members. This method labelseach image with a group activity label. How to smooth thelabeling results along time is a problem and is not addressedin the paper. Also, these methods aim to recognize groupactivities and are not suitable in our scenario where activ-ities cannot be considered as the parts of larger activities.

In [4], complex activities are represented as spatiotem-poral graphs representing multi-scale video segments andtheir hierarchical relationships. Existing higher-order mod-els [16], [17], [19], [42] propose the use of higher orderpotentials that encourage the smoothness of variables withincliques of the graph. Higher-order graphical models havebeen frequently used in image segmentation, object recog-nition, etc. However, few works exist in the field of activityrecognition. We propose a novel method that explicitlymodels the action and activity level motion and contextpatterns with a Hierarchical-CRF model and use them inthe inference stage for recognition.

The problem of simultaneous tracking and activity recog-nition was addressed in [6], [15]. In these works, track-ing and action/activity recognition are expected to benefiteach other through an iterative process that maximizes adecomposable potential function which consists of track-ing potentials and action/activity potentials. However, onlycollective activities are considered in [6], [15], in whichthe individual persons of interest have a common goal interms of activity. This work address the general problem ofactivity recognition, when individual persons in the scenemay conduct heterogeneous activities.

The inference method on a structural model proposedin [20], [21] searches through the graphical structure, inorder to find the one that maximizes the potential function.Though this inference method is computationally less in-tensive than exhaustive search, it is still time consuming.As an alternative, greedy search has been used for inferencein object recognition [8].

This paper has major differences with our previous workin [45]. In [45], we proposed a structural SVM to explicitlymodel the durations, motion, intra-activity context and thespatio-temporal relationships between the activities. In this

work, we develop a hierarchical model which representsthe related activities in a hidden activity layer, whichinteracts with a lower-level action layer. Representing ac-tivities as hidden activity variables simplifies the inferenceproblem, by associating each hidden activity with a smallset of neighboring action segments, and enables efficientiterative learning and inference algorithms. Furthermore,the modeling of more aspects of the activities of interestadds additional feature functions that measure both actionand activity variables. Since more information about theactivities to be recognized is modeled, the recognitionaccuracy is improved as demonstrated by the experiments.

3 MODEL FORMULATION FOR CONTEXT-AWARE ACTIVITY REPRESENTATION

In this section, we describe how the higher-order con-ditional random field (CRF) modeling of activities thatintegrates activity durations, motion features and variouscontext features within and across activities is built uponautomatically detected action segments to jointly modelrelated activities in space and time.

3.1 Video PreprocessingAssuming there are M+1 classes of activities at the scene,including a background class with label 0 and M classes ofinterest with labels 1, ...,M (the background class can beomitted if all the activity classes in the scene are known).Our goal is to locate and label each activity of interest invideos. Given a continuous video, background substraction[46] is used to locate the moving objects. Moving personsare identified, and local trajectories of moving persons aregenerated (any existing tracking methods like [33] canbe used). Spatio-temporal Interest Point (STIP) features[22] are generated only for these motion regions. Thus,STIPs generated by noise, such as slight tree shaking,camera jitter and motion of shadows, are avoided. Eachmotion region is segmented into action segments using themotion segmentation based on the method in [5] with STIPhistograms as the model observation. The detailed motionsegmentation algorithm is described in Section 5.3.1.

3.2 Hierarchical-CRF Models for ActivitiesThe problem of activity recognition in continuous videosrequires two main tasks: to detect motion regions and tolabel these detected motion regions. The detection and la-beling problems can be solved simultaneously as proposedin [26] or separately as proposed in [44], [45]. For the latter,candidate action or activity regions are usually detectedbefore the labeling task. The problem of activity recognitionis then converted to a problem of labeling, that is, to assigneach candidate region with an optimum activity label.

CRF is a discriminative model often used usually usedfor labeling problems of image and image objects. Es-sentially, CRF can be considered as a special version ofMarkov Random Field (MRF) where the variable potentialsare conditioned on the observed data. Let x be the model

4

observations and y be the label variables. The posteriordistribution p(y|x,ω) of the label variables over the CRFis a Gibbs distribution and is usually represented as

p(y|x,ω) =1

Z(x,ω) ∏c∈C

exp(ωcT

ϕc(x,yc)), (1)

where ωc is a model weight vector, which needs to belearned from training data. Z(x,ω) is a normalizing con-stant called the partition function. ϕc(x,yc) is a featurevector derived from the observation x and the label vector,yc, in the clique c.

The potential function of the CRF model given theobservations x and model weight vector ω is defined as

ψ(y|x,ω) = ∑c

ωcT

ϕc(x,yc). (2)

For the development of the Hierarchical-CRF model, theaction layer is first modeled as a linear-chain CRF. Activitylayer variables which are associated with detected activitiesare then introduced for the smoothing of the action-layervariables. Finally, activity-layer variables are connectedto represent the spatial and temporal relationships be-tween activities. The evolution of the proposed two-layerHierarchical-CRF model from the one-layer CRF model isshown in Fig. 3. Details on the development of these modelswill be described in the following sub-sections. The variousfeature vectors used for the calculation of the potentials aredescribed in Section 3.3.

3.2.1 Action-based Linear-chain CRFWe first describe the linear-chain CRF model in Fig. 3(a).We first define the following items: intra-action potentialψν(ya

i |x,ω), which measures the compatibility of the ob-served feature of i and its label ya

i ; inter-action potentialψε(ya

i ,yaj |x,ω), which measures the consistency between

two connected action segments i and j. Let V a be theset of vertices, each representing an action segment asthe element in the action layer and E a denotes the set ofconnected action pairs. The potential function of the action-layer linear-chain CRF is

ψ(ya|x,ω) = ∑i∈V a

ψν(yai |x,ω)+ ∑

i j∈E aψε(ya

i ,yaj |x,ω) (3)

= ∑i∈V a

ωaν ,ya

i

Tϕν(xa

i ,yai )+ ∑

i j∈E aω

aε,ya

i ,yaj

Tϕε(xa

i ,xaj ,y

ai ,y

aj),

where ϕν(xai ,y

ai ) is the intra-action feature vector that

describes action segment i. ωaν ,ya

iis the weight vector of

the intra-action features for class yai . ϕε(xa

i ,xaj ,y

ai ,y

aj) is the

inter-action feature, which is derived from the labels yai ,

yaj and intra-action feature vectors xa

i and xaj . ωa

ε,yai ,y

aj

isthe weight vector of the inter-action features for class pairya

i ,yaj .

3.2.2 Incorporating Higher Order PotentialsAccording to experimental observations, action segments ina candidate activity region, which are generated by activitydetection methods [44], tend to have the same activitylabels. However, consistent labeling is not guaranteed due

to inaccurate detections. Let an action clique ca denote theunion of action segments in a candidate activity c. Thelinear-chain CRF can be converted to a higher-order CRFby adding a latent activity variable yh

c , representing thelabel of c, for each action clique ca. All action variablesassociated with the same activity variable are connected.Then, the associated higher-order potential ψc(ya

c |x,ω) isintroduced to encourage action segments in the clique ca

to take the same label, while still allowing some of themto have different labels without additional penalty. Theresulting CRF model is shown in 3 (b). The potentialfunction ψ for the higher-order CRF model is representedas

ψ(ya,yhc |x,ω) = ∑

i∈V aω

aν ,ya

i

Tϕν(xa

i ,yai ) (4)

+ ∑i j∈{E ′a}

ωaε,ya

i ,yaj

Tϕε(xa

i ,xaj ,y

ai ,y

aj)+ ∑

c∈Cah

ψc(yac |x,ω),

where E ′a denotes the set of connected action pairs in thenew model. Cah is the set of action-activity cliques andeach action-activity clique c in Cah corresponds to an actionclique ca in the action layer and its associated activity cin the activity layer. Let L = 0,1, · · · ,M be the activitylabel set in the action layer, from which the action variablesmay take values. The activity variable yh

c takes values froman extended label set Lh = L ∪ l f , where L is the setof variable values in the action layer. When an activityvariable takes value l f , it allows its child variables to takedifferent labels in L , without additional penalty upon labelinconsistency.

We define ϕc,l(yac ,y

hc) as the action-activity consistency

feature of activity c, and ωahc,l,yh

cto be the weight vector of

the action-activity consistency feature for class yhc . Define

ϕc, f (xac ,y

hc) as the intra-activity feature for activity c, and

ωahc, f ,yh

cto be the weight vector of intra-activity feature

for class yhc . The corresponding action-activity higher-order

potential can be defined as

ψ(yac |x,ω) = max

yhc

ωahc,yh

c

Tϕc(xa

c ,xhc ,y

ac ,y

hc) (5)

= maxyh

c

[ωahc,l,ya

c ,yhc

Tϕc,l(ya

c ,yhc)+ω

ahc, f ,yh

c

Tϕc, f (xa

c ,yhc)],

where ωahc,l,yh

c

Tϕc,l(ya

c ,yhc) measures the labeling consistency

within the activity c. Intuitively, the higher-order potentialsare constructed such that a latent variable tends to take alabel from L if majority of its child nodes take the samevalue, and take the label l f if its child nodes take diversifiedvalues. ωah

c, f ,yhc

Tϕc, f (xa

c ,yhc) is the intra-activity potential that

measures the compatibility between the activity label ofclique c and its activity features.

3.2.3 Incorporating Inter-Activity PotentialsAs stated before, it would be helpful to model the spa-tial and temporal relationships between activities. For thisreason, we connect activity nodes in the higher-order CRFmodel. The resulting CRF is shown in Fig. 3(c). We defineϕsc(xh

s ,xhd ,y

hs ,y

hd) as the inter-activity spatial feature that

5

Y

Tim

e

X

Action Segments

Action Segments

Action Segments

2ay

2ay1

ay

3ay

4ay

5ay

6ay

7ay

8ay

(a)

Y

Tim

e

X

Action Segments

Action Segments

1ay

2ay

3ay

8ay

7ay

6ay

5ay

4ay

1hy

3hy

2hy

(b)

Y

Tim

e

X

Action Segments

Action Segments

1ay

2ay

3ay

8ay

7ay

6ay

5ay

4ay

1hy

3hy

2hy

(c)

Symbols used in this figure.ya

i label variable for action segment i,ya

i ∈L , where L = {0,1, ...,M} and adenotes the action-layer and i denotesthe index of the action segment.

yhc label variable for activity c, yh

c ∈ Lh,where Lh = {0,1, ...,M}

⋃l f and h

denotes the hidden activity layer and cdenotes the index of the hidden activity.

(d) (e)

Fig. 3. Illustration of CRF models for activity recognition. (a): Action-based Linear-Chain CRF; (b): Action-based higher-orderCRF model (with latent activity variables); (c): Action-based two-layer Hierarchical-CRF. Note that all the observations for therandom variables are omitted for compactness; (d): symbols in sub-figures (a, b, c); (e): graph representation of the model in[45] for comparison. One action segment denotes a random variable in the action layer, whose value is the activity label forthe action segment. A colored circle denotes a random variable in the activity layer, whose value is the label for its connectedclique. As shown in (a), in the action layer, action segments that belong to the same trajectory are modeled as a linear-chainCRF. Then, hidden activity-level variables with action-activity edges (in light blue) are added for each action clique to formhigher-order CRF as shown in (b). An activity and its associated action nodes have a same color. Finally, pair-wise activityedges (in red) are added to form the proposed two-layer Hierarchical-CRF mdoel.

encodes the spatial relationship between activities s and d,and ωh

sc,yhs ,y

hd

to be the weight vector of inter-activity spatial

feature for class pair (yhs ,y

hd). Define ϕtc(xh

s ,xhd ,y

hs ,y

hd) as

the inter-activity temporal feature that encodes the temporalrelationship between activities s and d, and ωh

tc,yhs ,y

hd

to bethe weight vector of inter-activity temporal feature for classpair (yh

s ,yhd).

The pairwise activity potential between clique s and d isdefined as

ψ(yh|x,ω) = ∑sd∈E h

[ωhsc,yh

s ,yhd

Tϕsc(xh

s ,xhd ,y

hs ,y

hd)

+ωhtc,yh

s ,yhd

Tϕtc(xh

s ,xhd ,y

hs ,y

hd)], (6)

where ωhsc,yh

s ,yhd

Tϕsc(xh

s ,xhd ,y

hs ,y

hd) is the pairwise spatial

potential between activities s and d that measures thecompatibility between the candidate labels of s and d andtheir spatial relationship. ωh

tc,yhs ,y

hd

Tϕtc(xh

s ,xhd ,y

hs ,y

hd) is the

pairwise temporal potential between activities s and d thatmeasures the compatibility between the candidate labels ofs and d and their temporal relationship.

3.3 Feature DescriptorsWe now define the concepts we use for the feature de-velopment. An activity is a 3D region consisting of one

or multiple consecutive action segments. An agent is theunderlying moving person(s) or a trajectory. Motion regionat frame n is the region surrounding the moving objects ofinterest in the nth frame of the activity. Activity region isthe smallest rectangle region that encapsulates the motionregions over all frames of the activity. In general, same typeof features for different class or class pair can be different.There are mainly three kinds of features in our model:action-layer features, action-activity features and activity-layer features, which can be further divided into five typesof features. We now describe how to encode motion andcontext information into feature descriptors.

Intra-action Feature: ϕν(xai ,y

ai ) encodes the motion

information of the action segment i that is extracted fromlow-level motion features such as STIP features. Since inthe action layer, we obtain action segments by utilizingtheir discriminative motion patterns, we use only motionfeatures for the development of action-layer features. STIPhistograms are generated for each action segment usingbag-of-word method [25]. We train a kernel multi-SVMupon action segments to generate the normalized confidencescores, si, j, of classifying the action segment i as activityclass j, where j ∈ {0,1, ...,M}, such that ∑

Mj=0 si, j = 1.

In general, any kind of classifier and low-level motionfeatures can be used here. Given an action segment i,

6

ϕν(xai ,y

ai ) = [si,0 · · ·si,M]T is developed as the intra-action

feature descriptor of action segment i.Inter-action Feature: ϕε(xa

i ,xaj ,y

ai ,y

aj) encodes the

probabilities of coexistence of action segments i andj according to their features and activity labels.ϕε(xa

i ,xaj ,y

ai ,y

aj) = I(ya

i )I(yaj), where I(ya

k) is the Diracmeasure that equals 1 if the true label of segment k is ya

kand equals to 0 other wise, for k = i, j.

Action-Activity Consistency Feature: ϕc,l(yac ,y

hc) en-

codes the labeling information within clique c as

ϕc,l(yac ,y

hc) =

{ 1 yhc = l f

∑i∈c I(yai =yh

c)Nc

yhc ∈L

.

where I(·) is the Dirac measure and Nc is the number ofaction segments in clique c.

Intra-activity Feature: ϕc, f (xac ,x

hc ,ya

c ,yhc) encodes the

intra-activity motion and context information of activity c.To capture the motion pattern of an activity, we use theintra-action features of action segments which belong tothe activity. Given an activity, [maxi∈ℵ si,0, ...,maxi∈ℵ si,M]is developed as the intra-activity motion feature descriptor,where ℵ is a list of action segments in activity c.

Intra-activity context feature captures the context infor-mation about the agents and relationships between theagents, as well as the the interacting objects (e.g. theobject classes, interactions between agents and their sur-roundings). We define a set, G, of attributes that describessuch context for activities of interest, using common-senseknowledge about the activities of interest (how to identifysuch attributes automatically is another research topic thatwe do not address in this paper). For a given activity,whether the defined attributes are true or not are determinedfrom image-level detection results. The resulting featuredescriptor is a normalized feature histogram. The attributesused and the development of intra-activity context featuresare different for different tasks (please refer to Section 5.3.1for the details).

Finally, the weighted motion and context features areused as the input to a multi-SVM and the output confidencescores are used to develop the intra-activity feature asϕc, f (xa

c ,yhc) = [sc,0, ...,sc,M]T .

Inter-activity Spatial and Temporal Featuresϕsc(xh

s ,xhd ,y

hs ,y

hd) and ϕtc(xh

s ,xhd ,y

hs ,y

hd) capture the spatial

and temporal relationships between activities s and d.Define the scaled distance between activities s and d atthe nth frame of s as

rs(s(n),d) =D(Os(n),Od)

Rs(n)+Rd, (7)

where Os(n) and Rs(n) denote the center and radius of themotion region of activity s at its nth frame and Od andRd denote the center and radius of the activity region ofactivity d. D(·) denotes the Euclidean distance. Then, thespatial relationship of s and d at the nth frame is modeled byscsd(n) = bin(rs(s(n),d)) as in Fig. 4 (a). The normalizedhistogram scs,d =

1N f

∑N fn=1 scsd(n) is the inter-activity spatial

feature of activity s and d.

Let TC be defined by the following temporal relation-ships: nth frame of s is before d, nth frame of s is duringd and nth frame of s is after d. tcsd(n) is the temporalrelationship of s and d at the nth frame of s as shown inFig. 4 (b). The normalized histogram tc = 1

N f∑

N fn=1 tcsd(n)

is the inter-activity temporal context feature of activity swith respect to activity d.

(a) (b)

Fig. 4. (a) The image shows one example of inter-activityspatial relationship. The red circle indicates the motion regionof s at this frame while the purple rectangle indicates theactivity region of d. Assume SC is defined by quantizing andgrouping rs(n) into three bins: rs(n) ≤ 0.5 (s and d is at thesame spatial position at the nth frame of s), 0.5 < rs(n) < 1.5(s is near d at the nth frame of s) and rs(n) ≥ 1.5 (s is faraway from d at the nth frame of s). In the image, rs(n) > 1.5,so, scsd(n) = [0 0 1]. (b) The image shows one example ofinter-activity temporal relationship. The nth frame of s occursbefore d. So, tsd(n) = [1 0 0].

4 MODEL LEARNING AND INFERENCE

The parameters of the overall potential function ψ(y|x,ω)for the two-layer hierarchical CRF include ωa

v , ωaε , ωah

c,l ,ωah

c, f , ωhsc and ωh

tc. We define the weight vector as theconcatenation of these parameters:

ω = [ωav ,ω

aε ,ω

ahc,l ,ω

ahc, f ,ω

hsc,ω

htc]. (8)

Thus, the potential function, ψ(y|x,ω), can be convertedinto a linear function with a single parameter ω as

ψ(ya) = maxyh

ωT

Γ(x,ya,yh), (9)

where Γ(x,ya,yh), called the joint feature of activity setx, can be easily obtained by concatenating various featurevectors in (4),(5) and (6).

4.1 Learning Model ParametersSuppose we have P activity sets for learning. Let the train-ing set be (X ,Y a,Y h) = (x1,y1,a,y1,h), ...,(xP,yP,a,yP,h),where xi denotes the ith activity set as well as the observedfeatures of the set. yi,a is the label vector in the action layerand yi,h is the label vector in the hidden activity layer. Whilethere are various ways of learning the model parameters, wechoose a task-oriented discriminative approach. We wouldlike to train the model in such a way that it increases theaverage precision scores on a training data and thus tend toproduce the correct activity labels for each action segment.

A natural way to learn the model parameter ω is to adoptthe latent structural SVM. The loss ∆(xi, yi,a) of labeling

7

xi with yi,a in the action layer equals the number of actionsegments that associate with incorrect activity labels (anaction segment is mislabeled if over half of the segmentis mislabeled). From the construction of the higher-orderpotentials in section 3.2.2, it is observed that, in orderto achieve the best labeling of the action segments, theoptimum latent activity label of an action clique must bethe dominant ground truth label lc of its child nodes in theaction layer; or the free label l f if no dominant label existsfor the action clique. Thus the loss ∆(xi, yi,h) of labelingthe activity layer of xi with yi,h is

∆(xi, yi,h) = ∑c∈V h

I(yi,hc 6= {li

c, l f }), (10)

where I(·) is the indicator function which equals 1 if theinside equation is satisfied and 0 otherwise. (10) counts thenumber of activity labels in yi,h that are neither a free labelnor the dominant label of its child nodes. Finally, the lossfunction of assigning xi with (yi,a, yi,h) is defined as thesummation of the two, that is

∆(xi, yi,a, yi,h) = ∆(xi, yi,a)+∆(xi, yi,h). (11)

Next, we define a convex function F(ω) and a concavefunction J(ω) as

F(ω) =12

ωT

ω (12)

+CP

∑i=1

max(yi,a,yi,h)

[ω

TΓ

(xi, yi,a, yi,h

)+∆

(xi, yi,a, yi,h

)],

and J(ω) =−CP

∑i=1

maxyi,h

ωT

Γ

(xi,yi,a,yi,h

).

The model learning problem is given as:

ω∗ = argmin

ω[F(ω)+ J(ω)] (13)

Although the objective function to be minimized in (13)is not convex, it is a combination of a convex functionand a concave function [29]. Such kind of problems canbe solved using the Concave-Convex Procedure (CCCP)[40], [41]. We describe an algorithm similar to the CCCPin [40] that iteratively infers the latent variables yi,h for i =1, ...,P and optimizes the weight vector ω . The inferenceand optimization procedures continue until convergence ora predefined maximum number of iterations is reached.

The limitation of all learning algorithms that involvegradient optimization is that they are susceptible to localextrema and saddle points [18]. Thus, the performance ofthe proposed latent structural model is sensitive to initializa-tion. There have been many works dealing with the problemof learning the parameters of hierarchical models [10], [36].We use a coarse to fine scheme that separately initializes themodel parameters using piecewise training, and then refinesthe model parameters jointly in a globally optimum manner.Specifically, the separately learned model parameters areused as the initialization values for the proposed learningalgorithm. Given the weakly labeled training data withactivity labels for each action segment, the dominant labellc for each action clique can be determined. We initialize

the latent activity variable of c with the dominant label lcof its action clique ca, and with l f if there is no dominantlabel for ca.

In the “E step”, we infer latent variables using the pre-viously learned weight vector ωt (or the initially assignedweight vector for the first iteration) leading to

yi,h∗t+1 = argmax

yi,hωt

TΓ

(xi,yi,a,yi,h

). (14)

Then, in the “M step”, with the inferred latent variableyi,h∗

t+1, we solve a fully visible structural SVM (SSVM). Letus define the risk function at iteration t +1, Λ(ω), as

Λt+1(ω) =CP

∑i=1

max(yi,a,yi,h)

{∆

(xi, yi,a, yi,h

)(15)

+ωT[Γ

(xi, yi,a, yi,h

)−Γ

(xi,yi,a,yi,h∗

t+1

)]}.

Thus, the optimization problem in (13) is converted to afully visible SSVM as

ω∗t+1 = argmin

ω

{12

ωT

ω +Λt+1(ω)

}. (16)

The problem in (16) can be converted to an unconstrainedconvex optimization problem [44] and solved by the mod-ified bundle method in [38]. The algorithm iterativelysearches for the increasingly tight quadratic upper andlower cutting planes of the objective function until the gapbetween the two bounds reaches a predefined threshold. Thealgorithm is effective because of its very high convergencerate [37]. The visible SSVM learning algorithm specifiedfor our problem is summarized in Algorithm 1.

Algorithm 1 Learning the model parameter in (16) throughbundle method

Input: S = ((aT (1),yT (1)), . . . ,(aT (P),yT (P))),ω∗t ,yi,h∗t+1,C,ε

Output: Optimum model parameter ω∗t+1

1) initialize ω0t+1 with ω∗t , Gt+1(cutting plane set) ← Ø.

2) for k = 0 to ∞ do3) for i = 1, ...,P do

find the most violated label vector for each training instance,if any, using ωk

t+1 (the value of ωt+1 at the kth iteration);4) end for5) find the cutting plane g

ωkt+1

of Λ(ω) at ωkt+1:

gωk

t+1= ωT ∂ω Λt+1(ω

kt+1)+b

ωkt+1

,

where bωk

t+1= Λt+1(ω

kt+1)−ωk

t+1T

∂ω Λ(ωkt+1).

6) Gt+1← Gt+1 ∪gωk

t+1(ω);

7) update ωt+1: ωk+1t+1 = argminω F

ωkt+1

(ω),

where Fωk

t+1(ω) = 1

2 ωT ω +max(0,max j=1,...,kgω

jt+1

(ω)).

8) gapk+1 = mink′≤k Fω

k′+1t+1

(ωk′+1t+1 )−F

ωkt+1

(ωk+1t+1 );

9) if gapk+1 ≤ ε , then return ω∗t+1 = ωk+1t+1 ;

10) end for

4.2 InferenceSuppose the model parameter vector ω is given. We nowdescribe how to identify the optimum label vector ya for atest instance x that maximizes (9). The inference problemis generally NP hard for multi-class problems, thus MAP

8

inference algorithms, such as loopy belief propagation [29],are slow to converge. We propose an approximation methodthat alternatively optimizes the hidden variable yh and thelabel vector ya. Such an algorithm is guaranteed to increasethe objective at every iteration [29]. Let us define theactivity layer potential function as

ψh(yh) = ∑

c∈Caψ(ya

c |x,ω)+ψ(yh|x,ω). (17)

For each iteration, with current predicted label vector ya

fixed, the inference sub-problem is to find the yh thatmaximizes ψh(yh). An efficient greedy search method isused to find the optimum yh as described in Algorithm 2.In order to simplify the inference, we force the edge weightsbetween non-adjacent actions to be zeros. With the inferredhidden variable yh, the model is reduced to a one-layerdiscriminative CRF. The inference sub-problem of findingthe optimum ya can now be solved by computing the exactmixed integer solution. We initialize the process by holdingthe hidden variable fixed using the values obtained fromautomatic activity detection. The process continues untilconvergence or a predefined maximum number of iterationsis reached.

Algorithm 2 Greedy Search Algorithm for the sub-problem offinding optimum hidden variable yh

Input: Testing instance with action layer labels ya

Output: Hidden variable labels yh

1) initialize (V h,yh)←{Ø,Ø} and ψh = 0.2) repeat

∆ψh(yhc)c*V h = ψ(yh ∪ yh

c)−ψ(yh);

yhc

opt= argmaxc*V h ∆ψh(yh

c);

(V h,yh)← (V h,yh)∪ (c,yhc

opt);

3) end if all activities are labeled.

4.2.1 Analysis of Computational ComplexityWe now discuss the computational complexity of inferencefor a particular activity set consists of n action segmentsand m activities. Assuming there are M activity classes inthe problem. For the graphical model in [45], the timecomplexity of the inference as discussed in the paper isO(dmaxn2M), where dmax is the maximum number of actionsegments one activity may have. The inference on both thehigher-order CRF and hierarchical-CRF is carried out layer-by-layer, and so the overall time complexity is linear inthe number of layers used. Specifically, we use two-layerCRFs with an action layer and an activity layer. For thehigher-order CRF model, inference on the activity layertakes O(mM) computation to obtain the activity labels foreach candidate activity. With the inferred activity labels,inference on the action layer takes O(nM2), since the modelis reduced to a chain-CRF. For the hierarchical-CRF, theincrease of computational complexity over the higher-orderCRF lies in the inference on the activity layer, because theactivities are connected with each other in this model. Usingthe proposed greedy search algorithm, the time complexityfor inference on the activity layer is O(m2M). Thus, theoverall complexity of inference is O[T · ((mM)+O(nM2))]

for higher-order CRF and O[T · ((m2M) + O(nM2))] forhierarchical-CRF, where T is the number of iterations.Furthermore, the number of action segments n is usuallyseveral times of the number of activities, that is n = αm,where α is a small positive value larger that one. dmax and Tare small positive value larger than one. Assuming n, m andM are in the same order, which is a reasonable assumptionfor our case, the asymptotic computational complexity ofthe model in [45] and the compared higher-order CRF andhierarchical-CRF models is of the same order.

5 EXPERIMENTAL RESULTSThe goal of our framework is to locate and recognize activ-ities of interest in continuous videos using both motion andcontext information about the activities; therefore, datasetswith segmented video clips or independent activities likeWeizmann [11], KTH [31], UT-Interaction Dataset [30] andCollective Activity Dataset [7] do not fit our evaluationgoal. To assess the effectiveness of our framework in activ-ity modeling and recognition, we perform experiments ontwo challenging datasets containing long duration videos:the UCLA office Dataset [32] and VIRAT Ground Dataset[9].

5.1 Motion Segmentation and Activity Localiza-tionWe first develop an automatic motion segmentation algo-rithm by detecting boundaries where the statistics of motionfeatures change dramatically, and thus obtain the actionsegments. Let two NDMs be denoted as M1 and M2, andds be the dimension of the hidden states. The distancebetween the models can be measured by the normalizedgeodesic distance dist(M1,M2) =

4dsπ2 ∑

dsi=1 θi

2, where θi isthe principal subspace angle (please refer to [5] for detailson the distance computation).

A sliding window of size Ts, where Ts is the numberof temporal bins in the window, is applied to each detectedmotion region along time. A NDM M(t) is built for the timewindow centered at the tth temporal bin. Since an actioncan be modeled as one dynamic model, the model distancesbetween subsequences from the same action should besmall, compared to those of subsequences from a differentaction. Suppose an activity starts from temporal bin k; theaverage model distance between temporal bin j > k and kis defined as the weighted average distance between modelj and neighboring models of k as

DEk( j) =Td−1

∑i=0

γi ·dist(M(k+ i),M( j)), (18)

where Td is the number of neighboring bins used, and γi isthe smoothing weight for model k+ i that decreases alongtime. When the average model distance grows above apredefined threshold dth, an action boundary is detected.Action segments along tracks are thus obtained.

A multi-class SVM is trained upon the intra-activityfeatures (as described in Section 3.3) of activities of d-ifferent classes. After obtaining the action segments, we

9

use the sliding window method with the trained multi-class SVM to group adjacent action segments into candidateactivities. To speed up, we only work on candidate activitieswith confidence scores larger than a predefined threshold,indicating they are likely to be of activity classes of interest.

5.2 UCLA Dataset

The UCLA Office Dataset [32] consists of indoor andoutdoor videos of single activities and person-person in-teractions. Here, we perform experiments on the videos ofoffice scene containing about 35 minutes of activities in anoffice room that captured with a single fixed camera. Weidentify 10 frequent activities as the activities of interest:1- enter room, 2 - exit room, 3 - sit down, 4 - stand up, 5- work on laptop, 6 - work on paper, 7 - throw trash, 8 -pour drink, 9 - pick phone, 10 - place phone down. Eachactivity occurs 9 to 26 times in the dataset. Since the datasetcontains only single person activities, it is natural to modelactivities in one sequence together. The dataset is dividedinto 8 sets, each set contains 2 sequences of activities andeach sequence contains 2 to 19 activities of interest, as wellas varying number of background activities. We use leave-one-set-out cross validation for the evaluation: use 7 setsfor training and 1 set for testing.

5.2.1 Preprocessing

Intra-activity context feature is based on interactions be-tween the agent and the surroundings. In the office dataset,there are 7 classes of objects that are frequently involved inthe activities of interest: laptop, garbage can, papers, phone,coffee maker and cup. Fig. 5 shows the detected objects ofinterest in the office room. Since the UCLA Dataset consists

(a)

Fig. 5. Detected objects of interest in the UCLA office scene.

of single person activities, the intra-activity attributes con-sidered include agent-object interactions and their relativelocations. We identify (NG = 10) subsets of attributes forthe development of intra-activity context features in theexperiment as shown in Fig. 6. For a given activity, theabove attributes are determined from image-level detectionresults. The locations of objects are automatically tracked.Similar to [32], if enough skin color is detected withinthe areas of laptop, paper and phone,the corresponding

Attribute Subset Associated AttributesG1 G2 G3 the agent is touching / not touching

laptop1, paper2, phone3.G4 G5 the agent is occluding / not occluding

the garbage can4, coffee maker5.G6 G7 G8 the agent is near / far away from the

garbage can6, coffee maker7, door8.G9 the agent disappears / not disappears at

the door.G10 the agent appears / not appears at the

door.

Fig. 6. Subsets of context attributes used for the devel-opment of intra-activity context features for UCLA Dataset(the superscripts indicates the correspondence between thesubsets and the objects).

touch laptop touch paper occlude garbage can touch phone

(a)

Fig. 7. Examples of agent-object interactions detected fromimage.

attributes are considered as true. Fig. 7 shows examplesof detected agent-object interactions.

Whether the agent is near or far away from an objectis determined by the distance between the two based onnormal distributions of the distances of the two scenarios.Probabilities indicating how likely the agent is near orfar away from an object are thus obtained. For frame nof an activity, we obtain gi(n) = I(Gi(n)), where I(·) isthe indicator function. gi(n) is then normalized so that itselements sum to 1.

Related candidate activities are connected. Whether twoactivities are related can be naturally determined by theirtemporal distances. One way to decide if the relationshipsbetween two candidate activities should be modeled is tosee if they are in the α-neighborhood of each other intime. Two activities are said to be in the α-neighborhood ofeach other if there are less than α other activities occurringbetween the two.

5.2.2 Experimental ResultsAlthough UCLA Dataset has been used in [32], therecognition accuracy for the office dataset has not beenprovided in the paper. We compare the performance ofthe popular BOW+SVM classifier and our model. Theexperiment results in precision and recall as shown in Fig.8. In order to show the affects of incorporating differentkinds of motion and context features, we also show resultsof using the action-based linear-chain CRF approach andthe action-based higher-order CRF approach (Fig. 3 (a)and 3 (b)). It can be seen that the use of intra-activitycontext increases the recognition accuracy of activitieswith obvious context patterns. For example, “enter room”is characterized by the context that the agent appears at

10

the door. The increased recognition accuracy of “enterroom” by using intra-activity context features indicatesthat our model successfully captures this characteristics.From the performance of higher-order CRF approach andHierarchical-CRF approach, we can see that for activitieswith strong spatio-temporal patterns, such as “pick phone”and “place phone down”, modeling the inter-activity spatio-temporal relationships increases the recognition accuracysignificantly. Next, we change the value of α to see how it

0

0.2

0.4

0.6

0.8

1

0

1 2 3 4 5 6 7 8 9 10BOW+SVM Linear‐chain CRF Higher‐order CRF HCRF α=2

(a)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10

BOW+SVM Linear‐chain CRF Higher‐order CRF HCRF α=2

(b)

Fig. 8. Precision (a) and recall (b) for the ten activities inUCLA Office Dataset. The activities are defined in Section5.2. HCRF is the short of Hierarchical-CRF.

influences the recognition accuracy of the Hierarchical-CRFapproach. Fig. 9 compares the overall accuracy of differentmethods and the Hierarchical-CRF approach with differentα values. From the results, we can see that Hierarchical-

Method Overall Average per-classBOW+SVM 82.2 80.7

Linear-chain CRF 73.6 72.5Higher-order CRF 83.9 84.3

HCRF (α = 1) 87.9 87.1HCRF (α = 2) 89.9 90.8

HCRF (fully connected) 73.5 74.2

Fig. 9. Overall and average per-class accuracy for dif-ferent methods on UCLA Office Dataset. The BOW+SVMmethod is tested on video clips, while other results are in theframework of our proposed action-based CRF models uponautomatically detected action segments. HCRF is the shortof Hierarchical-CRF.

CRF approach with α = 2 outperforms other models. Thisis expected. When α is too small, the spatio-temporalrelationships of related activities are not fully utilized,while Hierarchical-CRF with fully connected activity layermodels the spatio-temporal relationships of unrelated activ-ities. For instance, in the UCLA office Dataset, one typicaltemporal pattern of activities is a person sits down to workon the laptop, then, the same person stands up to do otherthings, and then sits down to work on the laptop. All theseactivities are conducted sequentially. Thus, Hierarchical-CRF model with fully connected activity layer captures the

false temporal pattern of “stand up” followed by “work onthe laptop”. The optimum value of α can be obtained usingcross validation on the training data.

5.3 VIRAT Ground DatasetThe VIRAT Ground Dataset is a state-of-the-art activitydataset with many challenging characteristics, such as widevariation in the activities and clutter in the scene. Thedataset consists of surveillance videos of realistic sceneswith different scales and resolution, each lasting 2 to15 minutes and containing upto 30 events. The activitiesdefined in Release 1 include 1 - person loading an objectto a vehicle; 2 - person unloading an object from a vehicle;3 - person opening a vehicle trunk; 4 - person closing avehicle trunk; 5 - person getting into a vehicle; 6 - persongetting out of a vehicle. We work on the all the scenes inRelease 1 except scene 0002 and use half of the data fortraining and the rest for testing. Five more activities aredefined in VIRAT Release 2 as: 7 - person gesturing; 8 -person carrying an object; 9 - person running; 10 - personentering a facility; 11 - person exiting a facility. We workon the all the scenes in Release 2 except scene 0002 and0102, and use two-third of the data for training and the restfor testing.

5.3.1 PreprocessingMotion regions that do not involve people are excludedfrom the experiments since we are only interested inperson activities and person-vehicle interactions. For thedevelopment of STIP histograms, nearest neighbor soft-weighting scheme [25] is used.

Since we work on the VIRAT Dataset with individualperson activities and person-object interactions, we use thefollowing NG = 7 subsets of attributes for the developmentof intra-activity context features in the experiments asshown in Fig. 10.

Persons and vehicles are detected based on the part-based object detection method in [9]. Opening/closingentrance/exit doors of facilities, boxes and bags are de-tected using method in [6] with binary linear-SVM as theclassifier. Using these high-level image features, we followthe description in Section 5.2.1 to develop the featuredescriptors for each activity set. The first three sets ofattributes in Fig. 10 are used for the experiments on Release1, and all are used for the experiments on Release 2. Fig.11 shows examples of gi(n) defined as in Section 5.2.1 fordifferent activities in VIRAT. Since, in VIRAT, activities arenaturally related to each other, the activity layer nodes arefully connected to utilize the spatio-temporal relationshipsof activities occurring in the same local space-time volume.

5.4 Recognition Results on VIRAT Release 1Fig. 12 compares the precision and recall for the sixactivities defined in VIRAT Release 1 using BOW+SVMmethod and our approach with different kinds of features.The results show, as expected, the recognition accuracy

11

Subset Associated AttributesG1 moving object is a person; moving object is a

vehicle trunk; moving object is of other kind.G2 the agent is at the body of the interacting

vehicle; the agent is at the rear/head of theinteracting vehicle; the agent is far away fromthe vehicles.

G3 the agent disappears at the body of the inter-acting vehicle; the agent appears at the bodyof the interacting vehicle; none of the two.

G4 the agent disappears at the entrance of a facil-ity; the agent appears at the exit of a facility;none of the two.

G5 velocity of the agent (in pixel) is larger thana predefined threshold; velocity of object ofinterest is smaller than a predefine threshold.

G6 the activity occurs at parking areas; the activityoccurs at other areas.

G7 an object (e.g. bag/box) is detected on theagent; no object is detected on the agent.

Fig. 10. Subsets of context attributes used for the develop-ment of intra-activity context features.

Activity personloading

personunloading

openingtrunk

closing trunk

ExampleImage

g1(n)[ 1

2 0 12

][1 0 0]

[ 12

12 0

] [ 12

12 0

]g2(n) [0 1 0] [0 1 0] [0 1 0] [0 1 0]

g5(n) [0 1] [0 1] [0 1] [0 1]

g6(n) [1 0] [1 0] [1 0] [1 0]

g7(n) [1 0] [0 1] [0 1] [1 0]

Activity getting intovehicle

getting outof vehicle

gesturing carryingobject

ExampleImage

g1(n) [1 0 0] [1 0 0] [1 0 0] [ 12 0 1

2 ]

g2(n) [1 0 0] [1 0 0] [0 0 1] [0 0 1]

g5(n) [0 1] [0 1] [0 1] [0 1]

g6(n) [1 0] [1 0] [1 0] [1 0]

g7(n) [0 1] [0 1] [0 1] [1 0]

Fig. 11. Examples of detected intra-activity context features.The example images are shown with detected high-levelimage features. Object in red bounding box is a movingperson; object in blue bounding box is a static vehicle; objectin orange bounding box is a moving object of other kind;object in black bounding box is a bag/box on the agent.

increases by encoding the various context features. Forinstance, the higher-order CRF approach encodes intra-activity context patterns of activities of interest. Thus,activities with strong intra-activity context pattern, such as“person getting into vehicle”, are better recognized by thehigher-order CRF approach than by the linea-chain CRFapproach, which does not model intra-activity context ofactivities. The Hierarchical-CRF approach further encodes

inter-activity context patterns of activities. Thus, activitieswith strong spatio-temporal relationships with each otherare better recognized by the Hierarchical-CRF approach.For instance, the higher-order CRF approach often confuses“open a vehicle trunk” and “close a vehicle trunk” with eachother. However, if the two activities happen closely in timein the same place, the first activity in time is probably “opena vehicle trunk”. This kind of contextual information withinand across activity classes are captured by the Hierarchical-CRF approach and used to improve the recognition per-formance. Fig. 13 shows examples that demonstrate thesignificance of context in activity recognition.

0

0.2

0.4

0.6

0.8

1

0

1 2 3 4 5 6

BOW+SVM Linear‐chain CRF Higher‐order CRF HCRF

(a)

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6


(b)

Fig. 12. Precision (a) and recall (b) for the six activitiesdefined in VIRAT Release 1.

getting out of vehicle opening trunk getting into vehicle

loading an object unloading an object getting into vehicle

getting out of vehicle closing trunk getting into vehicle

Figure 2 Examples show the effect of context features in recognizing activities that wereFigure 2. Examples show the effect of context features in recognizing activities that wereincorrectly recognized by the baseline (NDM+SVM) classifier (related example results forFigure 6 in Section 5.4).

Fig. 13. Example activities (defined in VIRAT Release 1)correctly recognized by action-based linear-chain CRF (top),incorrectly by linear-chain CRF but corrected using higher-order CRF with intra-activity context (middle), and incorrectlyrecognized by higher-order CRF, but rectified using action-based hierarchical CRF with inter-activity context (bottom).

We also show the results on VIRAT Release 1 fordifferent methods using overall and average accuracy inFig. 14. We have compared our results with the popularBOW+SVM approach, the more recently proposed String-

12

of-Feature-Graphs approach [12], [43] and structural modelin [45].

Method average accuracyBOW+SVM [25] 45.8

SFG [44] 57.6Structural Model [45] 62.9

Linear-chain CRF 42.6Higher-order CRF 60.4Hierarchical-CRF 66.2

Fig. 14. Average accuracy for the six activities definedin VIRAT Release 1. Note that SVM+BOW works on videoclips; while other methods work on continuous videos. Notethat BOW+SVM works on video clip while others work oncontinuous video.

The Hierarchical-CRF approach outperforms the othermethods. The results are expected since the intra-activityand inter-activity context within and between action andactivities gives the model additional information aboutthe activities of interest beyond the motion informationencoded in low-level features. SFG approach models thespatial and temporal relationships between the low-levelfeatures and thus takes into account the local structure ofthe scene; However, it does not consider the relationshipsbetween various activities and thus our method outperformsthe SFGs. Structural model in [45] models the intra andinter context within and between activities, however, it doesnot model the action layer and the interactions betweenaction and activities.

5.5 Recognition Results on VIRAT Release 2VIRAT Release 2 defines additional activities of interest.We work on VIRAT Release 2 to further evaluate theeffectiveness of the proposed approach. We follow themethod defined above to get the recognition results onthis dataset. Fig. 15 compares the precision and recallfor the eleven activities defined in VIRAT Release 2 forBOW+SVM method, the structural model in [45], and ourmethod. We see that by modeling the relationships betweenactivities, those with strong context patterns, such as “per-son closing a vehicle trunk”(4) and “person running”(9),achieve larger performance gain compared to activities withweak context patterns such as “person gesturing”(7). Fig.16 shows example results on activities in Release 2.

Fig. 17 compares the recognition accuracy using recallfor different methods. We can see that the performanceof our Hierarchical-CRF approach is comparable to therecently proposed method in [1]. In [1], a SPN on BOWis learned to explore the context among motion features.However, [1] works on video clips, each containing anactivity of interest with additional 10 seconds occurringrandomly before or after the target activity instance, whilewe work on continuous video.

6 CONCLUSIONIn this paper, we design a framework for modeling anddetection of activities in continuous videos. The proposed

0

0.2

0.4

0.6

0.8

1

0

1 2 3 4 5 6 7 8 9 10 11


(a)

0

0.2

0.4

0.6

0.8

1

0

1 2 3 4 5 6 7 8 9 10 11


(b)

Fig. 15. Precision (a) and recall (b) for the eleven activitiesdefined in VIRAT Release 2.

opening trunk getting out of vehicle entering a facility

exiting a facility person running carrying an object

Fig. 16. Examples of recognition results (from VIRAT Re-lease 2). For each two rows, examples in the bottom rowshow the effect of context features in correctly recognizingactivities that were incorrectly recognized by the linear-chainCRF approach, while other examples of the same activitiescorrectly recognized by the linear-chain CRF are shown inthe top row.

framework jointly models a variable number of activitiesin continuous videos, with action segments as the basicmotion elements. The model explicitly learns the activitydurations and motion patterns for each activity class aswell as the context patterns within and across action andactivities of different classes from training activity sets. Ithas been demonstrated that joint modeling of activities byencapsulating object interactions and spatial and temporal

13

Method average accuracyBOW+SVM [25] 55.4

SPN [1] 70Structural Model [45] 73.5

Linear-chain CRF 52.5Higher-order CRF 69.4Hierarchical-CRF 75.1

Fig. 17. Average accuracy (in recall) for different methods.

relationships of activity classes can significantly improvethe recognition accuracy.

It is worth noticing that more complex activities canbe modeled by adding additional layers to the hierarchicalmodel. However, the additional layers increase the learningand inference complexity by increasing the tree width.Balance between the representation power of the hierarchi-cal model and the computational complexity of the modelshould be achieved.

REFERENCES

[1] M. R. Amer and S. Todorovic. Sum-product networks for modelingactivities with stochastic structure. In CVPR, 2012. 1, 2, 12, 13

[2] M. R. Amer, D. Xie, M. Zhao, S. Todorovic, and S.-C. Zhu.Cost-sensitive top-down/bottom-up inference for multiscale activityrecognition. In ECCV, 2012. 3

[3] Y. B. and F. L. Modeling mutual context of object and human posein human object interaction activities. In CVPR, 2010. 2

[4] W. Brendel and S. Todorovic. Learning spatiotemporal graphs ofhuman activities. In ICCV, 2011. 3

[5] R. Chaudhry, A. Ravichandran, G. Hager, and R. Vidal. Histogramsof oriented optical flow and binet-cauchy kernels on nonlineardynamical systems for the recognition of human actions. In CVPR,2009. 1, 3, 8

[6] W. Choi and S. Savarese. A unified framework for multi-targettracking and collective activity recognition. In ECCV, 2012. 3

[7] W. Choi, K. Shahid, and S. Savarese. Learning context for collectiveactivity recognition. In CVPR, 2011. 8

[8] C. Desai, D. Ramanan, and C. C. Fowlkes. Discriminative modelsfor multi-class object layout. In International Journal of ComputerVision, 2011. 3

[9] S. O. et al. A large-scale benchmark dataset for event recognitionin surveillance video. In CVPR, 2011. 8

[10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan.Object detection with discriminative trained part based models. InIEEE Trans. on Pattern Analysis and Machine Intelligence, 2010. 7

[11] L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri. Actionsas spatio-temporal shapes. IEEE Trans. on Pattern Analysis andMachine Intelligence, Dec. 2007. 8

[12] U. Guar, Y. Zhu, B. Song, and A. K. Roy-Chowdhury. A “stringof feature graphs” model for recognition of complex activities innatural videos. In ICCV, 2011. 12

[13] A. Gupta, P. Srinivasan, J. Shi, and L. S. Davis. Understandingvideos, constructing plots learning a visually grounded storylinemodel from annotated videos. In CVPR, 2009. 3

[14] D. Han, L. Bo, and C. Sminchisescu. Selection and context foraction recognition. In ICCV, 2009. 2

[15] S. Khamis, V. I. Morariu, and L. S. Davis. Combining per-frameand per-track cues for multi-person action recognition. In ECCV,2012. 3

[16] P. Kohli, L. Ladicky, P. H.S., and Torr. Robust higher order potentialsfor enforcing label consistency. In IJCV, 2010. 3

[17] N. Komodakis. Learning to cluster using high order graphical modelswith latent variables. In ICCV, 2011. 3

[18] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associativehierarchical crfs for object class image segmentation. In ICCV, 2009.7

[19] L. Ladicky, P. Sturgess, K. Alahari, C. Russell, and P. H. Torr. What,where & how many? combining object detectors and crfs. In ECCV,2010. 3

[20] T. Lan, Y. Wang, S. N. Robinovitch, and G. Mori. Discriminativelatent models for recognizing contextual group activities. In IEEETrans. on Pattern Analysis and Machine Intelligence, 2012. 3

[21] T. Lan, Y. Wang, W. Yang, and G. Mori. Beyond actions: Discrim-inative models for contextual group activities. In NIPS, 2010. 3

[22] I. Laptev. On spatio-temporal interest points. In InternationalJournal of Computer Vision, 2005. 3

[23] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. InCVPR, 2009. 2

[24] V. I. Morariu and L. S. Davis. Multi-agent event recognition instructured scenarios. In CVPR, 2011. 2

[25] Y.-G. J. G.-W. Ngo and J. Yang. Towards optimal bag of featuresfor object categorization and semantic video retrieval. ACM-CIVR,2007. 2, 5, 10, 12, 13

[26] M. Nguyen, Z. Lan, and F. DellaTorre. Joint segmentation andclassification of human actions in video. In CVPR, 2011. 3

[27] J. C. Niebles, H. Wang, and L. Fei-Fei. Modeling temporal structureof decomposable motion segments for activity classification. InECCV, 2010. 1

[28] A. Oliva and A. Torralba. The role of context in object recognition.In Trends in Cognitive Science, 2007. 1

[29] K. p. Murphy. Machine Learning: A Probabilistic Perspective. TheMIT Press, 2012. 7, 8

[30] M. S. Ryoo and J. K. Aggarwal. UT-Interaction Dataset, ICPRcontest on Semantic Description of Human Activities (SDHA).http://cvrc.ece.utexas.edu/SDHA2010/Human Interaction.html,2010. 8

[31] C. Schuldt, I. Laptev, and B. Caputo. Recognizing human actions:A local SVM approach. In ICPR, 2004. 8

[32] Z. Si, M. Pei, B. Yao, and S. Zhu. Unsupervised learning of eventand-or grammar and semantics from video. In ICCV, 2011. 3, 8, 9

[33] B. Song, T. Jeng, E. Staudt, and A. Roy-Chowdury. A stochasticgraph evolution framework for robust multi-target tracking. InECCV, 2010. 3

[34] J. Sun, X. Wu, S. Yan, L.-F. Cheong, T.-S. Chua, and J. Li. Hier-archical spatio-temporal context modeling for action recognition. InCVPR, 2009. 2

[35] K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structurefor complex event detection. In CVPR, 2012. 2, 3

[36] B. Taskar, V. Chatalbashev, and D. Koller. Learning associativemarkov networks. In ICML, 2004. 7

[37] C. H. Teo, Q. Le, A. Smola, and S. V. N. Vishwanathan. Ascalable modular convex solver for regularized risk minimization.In SIGKDD, pages 727–736, 2007. 7

[38] C. H. Teo, S. Vusgwanathan, A. Smola, and Q. V. Le. Bundlemethods for regularized risk minimization. In Journal of MachineLearning Research, 2010. 7

[39] J. Wang, Z. Chen, and Y. Wu. Action recognition with multiscalespatio-temporal contexts. In CVPR, 2011. 2

[40] C.-N. J. Yu and T. Joachims. Learning structural svms with latentvariables. In International Conference on Machine Learning, 2009.7

[41] A. L. Yuille and A. Rangarajan. The concave-convex procedure(CCCP). In Neural Computation, volume 15, April 2003. 7

[42] E. Zheleva, L. Getoor, and S. Sarawagi. Higher-order graphicalmodels for classification in social and affiliation networks. In NIPS,2010. 3

[43] Y. Zhu, N. Nayak, U. Gaur, B. Song, and A. Roy-Chowdhury.Modeling multi-object interactions using “string of feature graphs”.In Computer Vision and Image Understanding, 2013. 12

[44] Y. Zhu, N. Nayak, and A. Roy-Chowdhury. Context-aware activityrecognition and anomaly detection in video. In IEEE Journal ofSelected Topics in Signal Processing, 2013. 3, 4, 7, 12

[45] Y. Zhu, N. M. Nayak, and A. K. Roy-chowdhury. Context-awaremodeling and recognition of activities in video. In CVPR, 2013. 3,5, 8, 12, 13

[46] Z.Zivkovic. Improved adaptive Gaussian mixture model for back-ground subtraction. In ICPR, 2004. 3

14

Yingying Zhu Yingying Zhu received herMaster’s degree in Engineering in 2007 and2010 from Shanghai Jiao Tong Universityand Washington State University, respective-ly. She is currently pursuing the Ph.D. de-gree, and is expected to receive the Ph.D.soon, within Intelligent Systems in the De-partment of Electrical and Computer Engi-neering at the University of California, River-side. Her main research interests includepattern recognition and machine learning,

computer vision, image/video processing and communication.

Nandita Nayak Nandita M. Nayak receivedher Bachelor’s degree in Electronic andCommunications Engineering from M. S. Ra-maiah Institute of Technology, Bangalore, In-dia in 2006 and her Master’s degree in Com-putational Science from Indian Institute ofScience, Bangalore, India. She received herPh.D. degree from the Department of Com-puter Science and Engineering in Universityof California, Riverside. Her main researchinterests include image processing and anal-

ysis, computer vision and artificial intelligence.

Amit K. Roy-Chowdhury Amit K. Roy-Chowdhury received the Bachelors degreein Electrical Engineering from Jadavpur Uni-versity, Calcutta, India, the Masters degreein systems science and automation from theIndian Institute of Science, Bangalore, India,and the Ph.D. degree in Electrical Engineer-ing from the University of Maryland, CollegePark. He is a Professor of Electrical andComputer Engineering at the University ofCalifornia, Riverside. His research interests

include image processing and analysis, computer vision, patternrecognition, and statistical signal processing. His current researchprojects include vision networks, distributed visual analysis, wide-area scene understanding, visual recognition and search, video-based biometrics (gait), and biological video analysis. He has au-thored the monograph Camera Networks: The Acquisition and Anal-ysis of Videos over Wide Areas, and published about 150 papersand book chapters. He has been on the organizing and programcommittees of multiple conferences and serves on the editorialboards of a number of journals.

context-aware activity modeling using hierarchical ... · context-aware activity modeling using...

Documents