learning behavioural contexttxiang/publications/ligongxiang_ijcv2011.pdf · context, behaviour...

Int J Comput VisDOI 10.1007/s11263-011-0487-2

Learning Behavioural Context

Jian Li · Shaogang Gong · Tao Xiang

Received: 3 December 2009 / Accepted: 28 July 2011© Springer Science+Business Media, LLC 2011

Abstract We propose a novel framework for automatic dis-covering and learning of behavioural context for video-based complex behaviour recognition and anomaly detec-tion. Our work differs from most previous efforts on learningvisual context in that our model learns multi-scale spatio-temporal rather than static context. Specifically three typesof behavioural context are investigated: behaviour spatialcontext, behaviour correlation context, and behaviour tem-poral context. To that end, the proposed framework consistsof an activity-based semantic scene segmentation model forlearning behaviour spatial context, and a cascaded prob-abilistic topic model for learning both behaviour correla-tion context and behaviour temporal context at multiplescales. These behaviour context models are deployed forrecognising non-exaggerated multi-object interactive andco-existence behaviours in public spaces. In particular, wedevelop a method for detecting subtle behavioural anoma-lies against the learned context. The effectiveness of the pro-posed approach is validated by extensive experiments car-ried out using data captured from complex and crowded out-door scenes.

Keywords Visual context · Behavioural context ·Video-based behaviour recognition · Activity-based scenesegmentation · Cascaded topic models · Anomaly detection

J. Li (�) · S. Gong · T. XiangSchool of Electronic Engineering and Computer Science, QueenMary University of London, London E1 4NS, UKe-mail: [email protected]

S. Gonge-mail: [email protected]

T. Xiange-mail: [email protected]

1 Introduction

Visual context is the environment, background, and settingswithin which objects and associated events are observed vi-sually. Humans employ visual context extensively for bothobject recognition in a static setting and behaviour recog-nition in a dynamic environment. For instance, for objectrecognition we can differentiate and recognise whether ahand-held object is a mobile phone or calculator by itsrelative position to other body parts (e.g. closeness to theears), even though they are visually similar and partiallyoccluded by the hand. Similarly for behaviour recognition,the arrival of a bus can be detected/inferred just by look-ing at the passengers’ behaviour at a bus stop. Indeed, ex-tensive cognitive, physiological and psychophysical studieshave shown that visual context plays a critical role in hu-man visual perception (Palmer 1975; Biederman et al. 1982;Bar and Ullman 1993; Bar and Aminof 2003; Bar 2004).Motivated by these studies, there is an increasing inter-est in exploiting contextual information for computer vi-sion tasks such as object detection (Heitz and Koller 2008;Murphy et al. 2003; Kumar and Hebert 2005; Carbonetto etal. 2004; Wolf and Bileschi 2006; Rabinovich et al. 2007;Gupta and Davis 2008; Galleguillos et al. 2008; Zheng et al.2009), action recognition (Marszalek et al. 2009) and track-ing (Yang et al. 2008; Ali and Shah 2008).

Previous studies on visual context are predominantly fo-cused on static visual context particularly regarding thescene background, scene category, and other co-existingobjects in a scene. However, for understanding object be-haviour in a crowded space, the most relevant visual con-text is no longer static due to the non-stationary backgroundand non-rigid relationships among co-existing objects in apublic space. In particular, a meaningful interpretation ofobject behaviour depends largely on knowledge of spatial

mailto:[email protected]



Int J Comput Vis

Fig. 1 The behaviour of anobject in a traffic junction needsto be understood by taking intoaccount its spatial context(i.e. where it occurs), temporalcontext (i.e. when it takesplace), and correlation context(i.e. how other correlatedobjects behave in the sameshared space). In (b) a fireengine moved horizontally andbroke the vertical traffic flow.This is an anomaly because it isincoherent with both thetemporal context (it happensduring the vertical traffic phase)and correlation context(horizontal and vertical trafficare not expected to occursimultaneously)

and temporal context defining where and when it occurs, andcorrelation context specifying the expectation inferred fromthe correlated behaviours of other objects co-existing in thesame scene. In this work, we propose a novel framework forunsupervised discovery and learning of object behaviouralcontext for context-aware interactive and group behaviourmodelling that can facilitate the detection of global anoma-lies in a crowded space.

Let us first define what constitutes behavioural context. Inthis work, we consider three types of behavioural context:

1. Behaviour Spatial Context provides situational aware-ness about where a behaviour is likely to take place.A public space serving any public function such as a roadjunction or a train platform can often be segmented into anumber of distinctive zones within which behaviours ofcertain characteristics are expected in one zone but dif-fer from those observed in other zones. We call these be-haviour sensitive zones semantic regions. For instance, ina train station behaviours of passengers on the train plat-form and in front of a ticket machine can be very differ-ent. Another example can be seen in Fig. 1 where differ-ent traffic zones/lanes play an important role in defininghow objects are expected to behave.

2. Behaviour Correlation Context specifies how the mean-ing of a behaviour can be affected by those of other ob-jects either nearby in the same semantic region or fur-ther away in other regions. Object behaviours in a com-plex scene are often correlated and need be interpreted

together rather than in isolation, e.g. behaviours are con-textually correlated when they occur concurrently in ascene. Figure 1 shows some examples of moving verti-cal traffic flow with standby horizontal flow typically co-occurring, with the exception of emergency vehicles run-ning a red light (see Fig. 1(b)). It is important to note thatin a complex dynamic scene composed of multiple se-mantic regions, there are behaviour correlation context attwo scales: local correlation context within each region,and global context that represents behaviour correlationsacross regions.

3. Behaviour Temporal Context provides information re-garding when different behaviours are expected to hap-pen both inside each semantic region and across regions.This is again illustrated by the examples shown in Fig. 1where behaviour temporal context is determined by traf-fic light phases. More specifically, meaningful interpre-tation of vehicle behaviour needs to take into accounttemporal phasing of the traffic light, e.g. a vehicle is ex-pected to be moving if the traffic light governing its laneis green and stopped if red. Even if the traffic light isnot directly visible in the scene, this translates directlyto observing how other vehicles move in the scene, i.e.the expected traffic flow. Similar to behaviour correla-tion context, the temporal context also has two scalescorresponding to within-region and across-region con-text. Successful identifications of such temporal contextin both local regions and globally in a scene are cru-

Int J Comput Vis

cial for establishing behaviour constraints at differentscales which play a key role in identifying abnormal be-haviours.

To model and infer these three types of behavioural context,a novel context learning framework is proposed consistingof two key components:

(a) A behaviour-based semantic scene segmentationmodel for learning automatically behaviour spatial context.Given a crowded public scene, we decompose the scene intoa number of disjoint local regions according to how differentbehaviour patterns are observed spatially over time. To thatend, the problem of semantic scene segmentation is treatedas an image segmentation problem and solved by employinga spectral clustering algorithm with the number of clustersdetermined automatically. However, different from conven-tional image segmentation methods (Shi and Malik 2000;Malik et al. 2001) where each pixel location is representedusing static visual appearance features of colour and texture,we represent each pixel location using a behaviour-footprint.To obtain this behaviour-footprint, object behaviours arerepresented as classes of scene events. Each event class cor-responds to the behaviour of a group of objects with a certainsize and specific motion directions. The occurrences of dif-ferent classes of object behaviours accumulated over timeare then used to compute the behaviour-footprint. The sim-ilarity of footprints between two different pixel locationsdetermines whether they should be grouped into the samesemantic region.

(b) A cascaded probabilistic topic model for learningboth behaviour correlation context and behaviour temporalcontext. Probabilistic topic models (PTM) (Blei et al. 2003;Teh et al. 2006) such as Latent Dirichlet Allocation (LDA)(Blei et al. 2003) and Probabilistic Latent Semantic Analysis(pLSA) (Hofmann 1999a, 1999b) are Bag of Words (BoW)models traditionally used for capturing co-occurrence oftext words in document analysis. In this work, we exploretopic models for learning behaviour correlation context with‘words’ and ‘documents’ corresponding to visual events andvideo clips respectively. We choose topic models becausethat visual features extracted from a crowded dynamic sceneare inevitably noisy, and a BoW model such as a topic modelis intrinsically more robust against input noise. However,standard topic models are contextually unscalable, i.e. theyare unable to capture and differentiate between local (withinregion) and global (across region) behavioural context. Toaddress this problem, we propose a Cascaded Latent Dirich-let Allocation (Cas-LDA) model employing two stages oftopic modelling in a cascade. The first stage model capturesbehaviour correlation context within each semantic regionvia inferred hidden (local) topics. The inferred local contextis then used as inputs to a second stage model which aimsto capture global correlation context. In addition to correla-tion context modelling, the proposed cascaded topic model

is also exploited for inferring temporal behavioural context.Specifically, topic profiles inferred from both stages of ourcascade topic model at any given time are used to representthe temporal characteristics of behaviours and infer any tem-poral behavioural context both locally (within each region)and globally (across regions).

A practical aim for learning behavioural context is tobe able to detect subtle and non-exaggerated abnormal be-haviours in public space. A behavioural anomaly capturedin video from a crowded public space is only likely to bemeaningful when detected and interpreted in a context. Weargue that such context is necessarily intricate at differentlevels for a crowded space and not easily specified, if at allpossible, by hard-wired top-down rules, either exhaustivelyor partially. In particular, an identical behaviour in a pub-lic space can be deemed either normal or abnormal depend-ing on when, where, and how other objects behave. For in-stance, a person running on a platform with train approach-ing and all other people also running is normal, whilst thesame person running on an empty platform with no train insight is more likely to be abnormal. In other words, a be-havioural anomaly can be defined and measured as contex-tually incoherent, that is, behaviour that cannot be predictednor explained away using the learned or inferred behaviouralcontext. To that end, given learned behavioural context asan intricate part of our model, the model is more adept atdetecting subtle and unpredictable (unknown) behaviouralanomalies that are otherwise undetectable. The effectivenessof the proposed approach is validated through extensive ex-periments carried out using complex and crowded outdoorscenes.

The rest of the paper is organised as follows: Sect. 2reviews related work to highlight the contributions of thiswork. Section 3 addresses the problem of behaviour repre-sentation. A behaviour-based semantic scene segmentationmodel is described for learning behaviour spatial context inSect. 4. To that end, we also formulate a spectral cluster-ing algorithm. Section 5 centres on a novel cascaded topicmodel used for learning both behaviour correlation and tem-poral context. A context-aware video behaviour anomaly de-tection method is described in Sect. 6. In Sect. 7, the ef-fectiveness and robustness of our approach is evaluated ex-tensively by a series of experiments using data from threedifferent outdoor public scenes. We draw conclusions inSect. 8.

2 Related Work

Existing studies on visual context modelling are dominatedby the modelling of static scene or object appearance contextfor object detection in a static image. Objects in a scene cap-tured in an image can be divided into two categories (Heitz

Int J Comput Vis

and Koller 2008): monolithic objects, or “things” (e.g. carsand people), and regions with homogeneous or repetitivepatterns, or “stuffs” (e.g. roads and sky). Consequently, thereare Scene-Thing (Murphy et al. 2003), Stuff-Stuff (Sing-hal et al. 2003), Thing-Thing (Rabinovich et al. 2007), andThing-Stuff (Heitz and Koller 2008) context depending onwhat the target objects are and where the context comesfrom. In this work, the focus is on dynamic behaviouralcontext in video. Therefore different terminologies and def-initions are necessary. Nevertheless it is useful to draw ananalogy to the static context for easier understanding. Forinstance, behaviour correlation context can be seen as an ex-tension of the Thing-Thing context in the space and time do-main. The spatial context shares similarity with the Scene-Thing and Thing-Stuff context. The temporal context, on theother hand, is unique to video.

More recently, Marszalek et al. (2009) proposed to ex-ploit static context for dynamic scene understanding. Specif-ically scene context is represented as scene categories andused for action recognition. However, compared to the be-haviour spatial context studied in this work, the scene cate-gory information used in Marszalek et al. (2009) carries lit-tle information for understanding how objects are expectedto behave at different regions of a scene. Moreover, actionrecognition does not address the problem of behaviour un-derstanding: a person walking in different scenes can be in-terpreted as very different behaviours. Similar, our previouswork (Xiang and Gong 2006a) also considered categories offacial expressions and scene events as visual context. How-ever, as in Marszalek et al. (2009), categories of events andfacial actions only convey object-centred and isolated con-textual information with little if any measure of spatial, cor-relation and temporal context that define the changing en-vironment (scene context) within which object behaves. Al-ternatively, Yang et al. (2008) proposed a method for visualtracking by employing context extracted from the so-calledauxiliary objects. This context is dynamic in nature as itis associated with the tracked object and of similar motioncharacteristics. However, the problem of visual tracking dif-fers significantly from that of behaviour interpretation, andbehavioural context is not limited to objects of similar mo-tion directions (e.g. in the example in Fig. 1 stopped vehiclescan provide useful contextual information for vehicles thatare moving). Another work on tracking with context mod-elling is presented in Ali and Shah (2008) where floor fieldsare computed to assist in tracking objects in a very crowdedscene. Apart from modelling context for different objectives(tracking vs. behaviour understanding and anomaly detec-tion), our approach differs significantly in that we aim tolearn both spatial and temporal dynamic behavioural contextat different scales, rather than focusing only on correlationsof objects as in Ali and Shah (2008). To our best knowledge,our work is the first attempt to (a) systematically model be-havioural context by learning from data of complex scenes

without exhaustive top-down hard-wired rules, (b) providean effective solution for learning different aspects of be-havioural context in a principled manner, and (c) demon-strate the usefulness and importance of context learning forbehavioural anomaly detection.

There have been some efforts on automatic discoveryof the layout of a dynamic scene which is closely relatedto one aspect of the problem studied here, namely learn-ing behaviour spatial context. To that end, there are tech-niques focusing on learning two specific types of scene lay-out, entry and exit points/zones (Breitenstein et al. 2008;Wang et al. 2006, 2008; Makris et al. 2004), and static occlu-sion zones (Greenhill et al. 2008). There are also techniquesattempting at capturing a broader range of scene layoutsuch as routes, junctions and paths (Makris and Ellis 2005;Wang et al. 2010). Most existing techniques rely on objecttrajectory based scene representations and the motivation forlearning scene layout is to achieve more robust tracking be-tween camera views or under occlusion. However, visualtracking is intrinsically limited especially in crowded sceneswhen object visual appearance is no longer continuous orundergoing smooth change, an underlying assumption forestablishing visual tracking. Moreover, due to scene com-plexity, realistic abnormal behaviours are often not well de-fined by object trajectories alone. In particular, the contextfrom which behaviour can be interpreted meaningfully is notonly object-centred but also spatial, correlation and tempo-ral among objects in shared space. Trajectory based scenemodel is thus often insufficient for behaviour anomaly de-tection. To address this problem, the proposed behaviouralcontext learning model does not rely on object tracking norestablishes continuous trajectory, and therefore can be ap-plied to crowded scenes with severe occlusions.

There exist a number of approaches on behaviour correla-tion modelling, which can potentially be used for correlationcontext learning and inference. Xiang and Gong (2006b)employ Dynamic Bayesian Networks (DBNs) to learn tem-poral relationships among scene events. Temporal dynamicsof each event type is represented by a Hidden Markov Model(HMM) and the learned topology of a DBN reveals the tem-poral/causal relationships among different events. However,learning a DBN with multiple temporal processes is compu-tationally very expensive. In particular, learning the topol-ogy of a DBN from data rapidly becomes intractable asthe number of temporal processes representing event classesincreases. To overcome this problem, Wang et al. (2009)adopted hierarchical probabilistic topic models (PTMs), in-cluding a Hierarchical Dirichlet Processes (HDP) mixturemodel and a Dual Hierarchical Dirichlet Processes (Dual-HDP) model, to categorise global visual behaviours usingthree levels of abstraction from low-level motion patterns,atomic activities to high-level behaviour interactions. Com-pared to DBNs, topic models are less demanding compu-tationally and also less sensitive to input noise due to their

Int J Comput Vis

Bag of Words nature, although it loses sensitivity on copingwith scale variations in addition to throwing away any tem-poral order information. Compared to the hierarchical PTMof Wang et al., our cascaded topic models have two desir-able features: (1) We decompose a complex dynamic sceneinto semantic regions by learning behaviour spatial context.Consequently, behaviour correlation context is modelled atboth the local (within region) and global (across region)scales given the learned spatial context. This lead to betterbehaviour understanding and anomaly detection, as demon-strated in our experiments (see Sect. 7.6). (2) Using a cas-cade of simple topic models is computationally more effi-cient than using a single and necessarily complex hierarchi-cal topic model.

A PTM is essentially a Bag of Words model that ignorestemporal order information for the gain of robustness againstnoise and input errors. Recently efforts have been taken tointroduce dynamics modelling into a topic model in order tomake the model sensitive to dynamics of behaviour, whilstkeeping the robustness of a topic model. The result is a hy-brid model of PTM and DBN, or a dynamic topic model,with a hierarchical model structure. Hospedales et al. (2009)proposed a Markov Clustering Topic Model (MCTM) formodelling video behaviours. Using a MCTM, temporal or-ders of documents are modelled explicitly. This model wasfurther extended by Kuettel et al. (2010) who developed aDependent Dirichlet Process Hidden Markov Model (DDP-HMM) to model the temporal order information at both thetopic and document levels. With temporal information mod-elled explicitly, in theory, a dynamic topic model is wellsuited for detecting behavioural anomalies which could becaused by abnormal temporal orders between behaviours.However, in practice, it is difficult to strike the right bal-ance between model sensitivity and robustness. Importantly,both models are hierarchical models; therefore they sufferfrom the same problem as the HDP based PTM models inWang et al. (2009), when applied to abnormal behaviourdetection. That is, in a hierarchical model the numbers ofinputs in different layers from bottom to top are extremelyimbalanced. For example for both DDP-HMM and MCTM,the bottom layer models video words which are in the orderof thousands per clip. The number of actions/topics in thelayer above are in the order of dozens, whilst the numberof temporal phases in the top layer is only a handful. Thisimbalanced modelling structure may cause problems in de-tecting abnormities because those occurred at upper layerscan be easily overwhelmed by those in the bottom layer, andbecome undetectable.

The original ideas of semantic scene segmentation (Liet al. 2008a) and anomaly detection using a cascaded topicmodel (Li et al. 2008b) were introduced by our early work.In this paper, we provide a more coherent and completetreatment on learning behavioural context for anomaly de-tection in a single framework with extended comparative

evaluations against alternative models including the recentlyproposed hierarchical topic models by Wang et al. (2009)and dynamic topic model by Hospedales et al. (2009). Wealso analyse the pros and cons of different topic models fordifferent behaviour recognition tasks, with additional imple-mentation details.

3 Behaviour Representation

An event-based behaviour representation is adopted. Weconsider visual events as significant scene changes occur-ring over a short temporal window and characterised by thelocation, shape and motion information associated with eachchange. These visual scene events are object-independentand location specific and their detection does not rely onobject segmentation and tracking. We further consider thatvisual behaviours are different collections of categorised andlabelled scene events. But let us first consider in more detailson how we compute scene events.

Suppose a long continuous video of a scene is splitinto non-overlapping clips. We consider this long video asthe training data for learning a model. Within each clip,scene events are first detected and represented in each im-age frame in isolation. Specifically, this is computed as fol-lows: (1) Foreground pixels are identified by employing abackground subtraction method (Russell and Gong 2006).(2) These foreground pixels are then grouped into blobs us-ing connected components and each blob corresponds toa scene event and is assigned a rectangular bounding box.(3) The scene events detected in each frame, or frame-wiseevents, are represented as a 10-D feature vector:

[x, y,w,h, rs, rp,u, v, ru, rv], (1)

where (x, y) and (w, h) are the centroid position and thewidth and height of the bounding box respectively, rs =w/h is the ratio between width and height, rp is the per-centage of foreground pixels in a bounding box, (u, v) is themedian optic flow vector for all foreground pixels in a blobcomputed using (Lucas and Kanade 1981), ru = u/w andrv = v/h are the scaling features between motion informa-tion and blob shape. Note that different features have verydifferent value ranges. Before clustering the events to dis-cover groupings, all features are normalised to the range of[0,1].

However, scene events detected in each frame in isolationare inevitably noisy due to image noise, occlusion betweenobjects and non-stationary background clutter. To minimiseerror in scene event representation, the detected frame-wiseevents from each clip are clustered into groups to form clip-wise events. The mean and variance of the 10 event featuresfrom each group are employed to represent the correspond-ing clip-wise event. More specifically, the clustering within

Int J Comput Vis

Algorithm 1: Behaviour representation using events.Input: A continuous video V, divided into T

non-overlapping video clipsV = {v1, . . . ,vt , . . . ,vT }. The t-th clip vt

consisting of F image frames:vt = [It1, . . . , Itf , . . . , ItF ].

Output: A set of clip-wise events computed from eachclip

for t = 1 to T do1

for f = 1 to F do2

In the f -th frame of the t-th clip, extract3

frame-wise events by grouping foregroundpixels into blobs;Represent each frame-wise event using (1);4

end5

Cluster all frame-wise events from the t-th clip6

into clip-wise events;Represent each clip-wise event using (2);7

end8

During training, cluster all clip-wise event from V9

using GMM;During testing, assign each clip-wise event with a class10

label using the learned GMM;

a clip is by K-means with the number of clusters being auto-matically determined as the average number of scene eventsdetected per frame over the clip. Given the clustering ofall detected frame-wise events in each clip, each clip-wiseevent, denoted as e, is represented by a 20-D feature vector:

e = [em, ev], (2)

where em and ev correspond to the mean and variance of the10 features (see (1)) from all the frame-wise events in eachgroup over each clip. With this representation, scene eventsare no longer solely dependent on local features computedfrom individual frames. Instead, they are represented by dif-ferent group average and variance over a short video clip.Consequently, it is much more robust against noise withoutlosing frame locality within each clip.

Clip-wise scene events from all the clips of a set oftraining videos are further clustered by a Gaussian Mix-ture Model (GMM) with the number of clusters (V ) auto-matically determined using the BIC model selection score(Schwarz 1978). This aims to categorise all the scene eventsinto a finite set of groups, similar to the idea of determiningtypical words from a given document. Each scene event isthen assigned a class label for a particular cluster. A pseudocode for computing events from video clips are provided inAlgorithm 1.

The motivations of taking a two-staged clustering ap-proach for event computation are two-fold. First, we aim

to remove outliers in the event computed at the frame levelso that our representation is more robust against noise. Sec-ond, after grouping frame-wise events into clip-wise events,each clip-wise event is represented by both the mean of thecorresponding group of frame-wise events, and their vari-ance (see (2)). The variance captures information about howa group of frame-wise events evolve over the duration of avideo clip. This implicit temporal information is useful fordescribing behavioural characteristics.

This representation of behaviour by globally categorisedscene events independent from object types has a num-ber of advantages over existing methods. First, comparedwith the object-centred trajectory based methods (Breiten-stein et al. 2008; Wang et al. 2008; Makris et al. 2004),it does not require object segmentation and tracking. Thismakes it more suitable for crowded scenes where trackingis severely limited intrinsically. Second, compared to otherscene event detection methods (Xiang and Gong 2006b;Wang et al. 2009), our scene events are constructed by richerand more reliable features, and smoothed over a temporalwindow (non-overlapping clip) for more robustness againstlocal noise.

4 Learning Behaviour Spatial Context

We consider that behaviour spatial context is given as dis-joint semantic regions where behaviour patterns observedwithin each region are similar to each other whilst beingdissimilar to those occurring in other regions. Moreover,we wish to discover such semantic regions automaticallyand unsupervised from data. We address this problem us-ing a two-steps approach as follows: (1) Each pixel locationin the scene is labelled by a behaviour-footprint measuringhow different behaviour patterns occur over time. This pixel-wise location behaviour-footprint is necessarily a distribu-tion measurement, e.g. a histogram of scene event classesoccurring at each location over time (Fig. 2). (2) Givenbehaviour-footprints computed at all pixel locations of ascene, a spectral clustering algorithm is employed to seg-ment all pixel locations by their behaviour-footprints intodifferent non-overlapping regions with the optimal numberof regions determined automatically. Let us describe in moredetails as follows.

Behaviour-Footprint First, a behaviour-footprint is com-puted for each pixel location in a scene. As described inSect. 3, behaviours are represented by object-independentscene events categorised in space and over time. Supposethere are V classes of scene events, to measure how differentbehaviours have taken place at a pixel location in a video, itsbehaviour-footprint is computed as a histogram of V bins:

p = [p1, . . . , pv, . . . ,pV ] (3)

Int J Comput Vis

Fig. 2 (Color online) Examplesof behaviour-footprint. Threepixel locations and theircorrespondingbehaviour-footprints are showncolour-coded. It is evident thatthe red and green locations havesimilar behaviour-footprints thatdiffer from the yellow location

where pv counts for the number of occurrence of the v-thevent class at this pixel location. Figure 2 shows examplesof computed behaviour-footprints at different locations of ascene.

Scene Segmentation Second, learning behaviour spatialcontext is treated as a segmentation problem where the im-age space is segmented by pixel-wise feature vectors givenby the V dimensional behaviour-footprints (see (3)). Forsegmentation, we consider to employ the spectral clusteringmodel of Zelnik-Manor and Perona (2004). Given a scenewith N pixel locations, an N × N affinity matrix A is con-structed and the similarity between the behaviour-footprintsat the i-th and j -th locations is computed as:

A(i, j) =

⎧⎪⎨

⎪⎩

exp(− (d(pi ,pj ))2

σiσj) exp(− (d(xi ,xj ))2

σ 2x

),

if ‖xi − xj‖ ≤ r,

0, otherwise

(4)

where pi and pj are the behaviour-footprints at the i-th andthe j -th pixel loci, d represents Euclidean distance, σi andσj correspond to the scaling factors for the feature vectors atthe i-th and the j -th positions, xi and xj are the coordinatesand σx is the spatial scaling factor. r is the radius indicatinga circle only within which, similarity is computed.1 Usingthis model, two pixel locations will have strong similarityand thus be grouped into one semantic region if they havesimilar behaviour-footprints and are also close to each otherspatially, e.g. the red and green dots in Fig. 2.

For spectral clustering, choosing correct scaling factors iscritical for obtaining meaningful segmentation. The Zelnik-

1This hard cut-off constraint is introduced to further prevent pixelsthat are far apart from being grouped together. It is also for reduc-ing the computational cost of computing the affinity matrix A. Withoutthis constraint, the affinity between each pair of behaviour-footprintsmust be evaluated exhaustively which can be very expensive. For ex-ample, with a moderate-sized behaviour-footprint image of 200 by200, the number of affinity values to compute without the constraintis 1,600,000,000. This number is reduced to around 12,000,000 withthe constraint when r is set to 10.

Perona model computes σi using a pre-defined constant dis-tance between the behaviour-footprint at the i-th pixel loca-tion and its neighbours, which can be rather arbitrary andsensitive outliers. This results in mostly under-fitting for ourproblem (see Sect. 7.2). To address this problem, we revisethe model as follows. We compute σi as the standard devia-tion of behaviour-footprint distances between the i-th pixellocation and all locations within a given radius r . Similarly,the spatial scaling factor σx is computed as the mean of thespatial distances between all locations within radius r andthe centre.2 The affinity matrix is then normalised accord-ing to:

A = L− 12 AL− 1

2 (5)

where L is a diagonal matrix with:

L(i, i) =N∑

j=1

(A(i, j)). (6)

A is then used as the input to the Zelnik-Perona’s algorithmwhich automatically determines the number of clusters andperforms segmentation. This procedure groups pixel loca-tions into Q optimal regions by their behaviour-footprintsfor a given scene. Note that because we perform scene seg-mentation by behaviours, those locations without or withfew objects detected (via background subtraction) are absentfrom the segmentation process, i.e. the N pixel locations tobe clustered using the above algorithm do not include thosewhere no or few foreground objects have been detected.

2Note that both σi and σx aim to measure the distribution of distanceswithin a neighbourhood. σi measures how similar the behaviour-footprints are within the neighbourhood. Since the similarity/distancevalues can vary greatly, using standard deviation is more robust againstoutliers than using mean. σx , on the other hand, measures the distribu-tion of image coordinate distances of pixel locations within the neigh-bourhood. This distribution is constant across neighbourhood and us-ing mean is sufficient.

Int J Comput Vis

Fig. 3 Structure of the cascadetopic modelling

5 Learning Behaviour Correlation and TemporalContext

We aim to learn behaviour correlation and temporal contextwhich constrains how behaviour of an object is correlated toand thus affected by those of other objects both nearby inthe same semantic region and further away in other regionsof a scene. Given the learned spatial behavioural context(Sect. 4), we consider both behaviour correlation and tempo-ral context at two scales: regional context and global context.In order to discover the context at both scales, a two-stagecascaded topic model is formulated in this section. Figure 3shows the structure of the proposed model. The inputs tothe first stage of the model are regional behaviours repre-sented as labels of regional events (Sect. 5.1). The inferredtopics from the first stage modelling correspond to regionalbehaviour correlation context and are utilised for computingregional (local) temporal context, corresponding to tempo-ral phases of behaviours in different regions. The learned re-gional temporal context is further used as input to a secondstage topic modelling. We define global behaviour correla-tion context as inferred topics from this model, which arealso used to compute global temporal context. More detailsare as follows.

5.1 Regional Behaviour Representation

Let us first describe how the input to the proposed model,the regional behaviours, are represented. Recall that due tothe lack of any prior information at the initial behaviouralgrouping stage for scene segmentation, all 10 features to-gether with their corresponding variances were used to rep-resent scene events (see (2)). These settings are not nec-essarily optimal for accurately describing behaviours oncethe scene has been decomposed semantically into regions.In particular since most behaviour patterns are likely to besimilar within each region but differ more across regions, itis sensible to select different features for event representa-tion in different regions. We call them regional events (ascompared to scene events). This needs to be determined au-tomatically and to be scalable. To this end, we follow the

same procedure as described in Sect. 3 but perform an ad-ditional refinement step on event grouping in each region.Specifically, given a decomposed scene, we determine themost representative features in each region by computingentropy values for the 10 features (see (1)) in that regionand select the top 5 features (i.e. half) with the highest en-tropy values. This results in a smaller and more selectiveset of different features representing events tuned to differ-ent regions. After feature selection, these regional events arerepresented by clustering using GMM within each region.We yield V q regional event classes in each region q , where1 ≤ q ≤ Q. Note that each region may have different num-ber of event classes and those numbers are determined byautomatic model order selection using BIC as in Sect. 3.The regional event class labels are inputs to the cascadedtopic model described below.

5.2 Multi-Scale Context Learning

For learning and modelling multi-scale context, we explorea topic model based on the concept of Latent Dirichlet Al-location (Blei et al. 2003). LDA has been widely used fortext document analysis aiming to discover semantic topicsfrom text documents according to concurrent correlation ofwords. In LDA, a document w is a collection of Nw words:w = {w1, . . . ,wn, . . . ,wNw } and can be modelled as a mix-ture of K topics z = {z1, . . . , zK }. Each topic is modelled asa multinomial distribution over a vocabulary consisting ofV words, from which all words in w are sampled. Given thevocabulary, a document is represented as a V -dimensionalfeature vector, each element of which corresponds to thecount of how many times a specific word occurs in the doc-ument. LDA is essentially a Bag of Words model that clus-ters co-occurring words into topics. It provides a more con-cise (and arguably more semantic) representation of a doc-ument than using all the words directly. The number of top-ics K is in general much smaller than the size of the code-book/vocabulary V . In order to learn behavioural contextboth locally within each semantic region and globally acrossdifferent regions, we formulate a Cascaded Latent DirichletAllocation (Cas-LDA) model with two stages.

Int J Comput Vis

Stage I—Learning Regional Context For modelling re-gional behaviour correlation and temporal context, we con-sider that regional events correspond to words, all events de-tected from a single region over a clip form a document,correlations of regional events (regional correlation context)correspond to topics, and the inferred topic profile for eachdocument is used for categorising each document into tem-poral phases (regional temporal context).

More specifically, a training video (or a set of videos) ofa scene is segmented temporally into equal-length and non-overlapping short video clips. These short video clips aretreated as documents for training a cascaded topic model,given Q semantic regions of a spatially segmented scene(Sect. 4). In the first stage of our cascaded topic modellearning, each region is modelled using a LDA. Regionalevents detected in the q-th region are visual words that forma document denoted as d

qt ; the documents corresponding

to all T clips in the q-th region form the regional corpusDq = {dq

t }, where t = 1, . . . , T is the clip index and sub-sequently omitted for conciseness. Assuming that there areV q classes of regional events in the q-th region, the sizeof the codebook/vocabulary is thus V q . Each document ismodelled as a mixture of Kq topics, which correspond toKq different types of regional behaviour correlations andrepresent our regional correlation context. Note, differentfrom the conventional LDA formulation where a documentis represented as the counts of different visual words, ourdocument is represented as a binary V q dimensional fea-ture vector with each element being a binary value indicatingwhether a certain regional event class is present in that clip.This is because (1) we are interested in how different be-haviours correlate by co-occurrence in each document/clip,rather than how often they occur. (2) This binary vector rep-resentation is more robust to noise/error from event detec-tion. Our extensive experiments demonstrate that this modi-fied document representation is more advantageous than thestandard representation (see Sect. 7.6).

A LDA model for the q-th region has the following pa-rameters:

1. αq : a Kq dimension vector governing the Dirichlet dis-tributions of topics in the corpus, i.e. all clips for the q-thregion in the training video;

2. βq : a Kq ×V q dimension matrix representing the multi-nomial distributions of words in the vocabulary for alllearned topics where β

qk,n =P(w

qn |zq

k ) and∑V q

n=1 βqk,n =1.

Given these model parameters, visual words in a localdocument can be repeatedly sampled to generate a documentof N

qw words, dq = {wq

n}, as follows:

1. Sample a Kq dimensional vector θq from the Dirichletdistribution governed by parameter αq : θq ∼ Dir(αq).Vector θq contains the information about how the Kq

topics are to be mixed in the document.

2. Sample words from topics:(a) Choose a topic for w

qn : z

qn ∼ Multinomial(θq).

(b) Choose a word wqn from the vocabulary of V q words

according to P(wqn |zq

n,βq).

Following the conditional dependency of the componentsin the generative process, we can compute the log-likelihoodof a document dq given the model parameters as:

logp(dq |αq,βq)

= log∫

p(θq |αq)

×(

Nqw∏

n=1

∑

zqn

p(zqn|θq)p(w

qn |zq

n,βq)

)

dθq, (7)

and the log-likelihood of a whole corpus Dq = {dqt } of T

clips for the q-th region as:

logp(Dq) =T∑

t=1

logp(dqt |αq,βq), (8)

where t is the clip index and T is the total number of docu-ments in the corpus (i.e. all the clips in the training video).

The model parameters αq and βq are estimated by max-imising the log-likelihood function logp(Dq) in (8). How-ever, there is no close-form analytical solution to the prob-lem. A variational EM algorithm can be employed (Blei etal. 2003).

In the E-step of variational inference, the posterior dis-tribution of the hidden variables p(θq, {zq

n}|dq,αq,βq) in aspecific document dq is approximated by a variational distri-bution p(θq, {zq

n}|γ q,φq) where γ q and φq are document-specific variational parameters (note that the clip index t

is omitted here). As shown by Blei et al. (2003), max-imising the log-likelihood logp(dq |αq,βq) correspondsto minimising the Kullback-Leibler (KL) divergence be-tween q(θq, {zq

n}|γ q,φq) and p(θq, {zqn}|dq,αq,βq), re-

sulting in logp(dq |αq,βq) being approximated by its max-imised lower bound L(γ q,φq;αq,βq). By setting αq andβq as constants, the variational parameters for d

qt are esti-

mated according to the following pair of updating equations:

φqn,k ∝ β

qk,v exp

(

Ψ (γqk ) − Ψ

(Kq∑

k=1

γqk

))

, (9)

γqk = α

qk +

Nqw∑

n=1

φqn,k, (10)

where n = 1, . . . ,Nqw indicates the n-th word in dq ; k =

1, . . . ,Kq indicates the k-th regional topic; v = 1, . . . , V q

indicates the v-th word in the regional vocabulary; and Ψ isthe first order derivative of a logΓ function.

Int J Comput Vis

In the M-step, the learned variational parameters {γ q}and {φq} are set as constant and the model parameters αq

and βq are learned by maximising the lower bound of thelog-likelihood of the whole corpus:

L({γ qt }, {φq

t };αq,βq) =T∑

t=1

L(γqt , φ

qt ;αq,βq). (11)

The learned model parameter βq specifies the probability ofeach words/regional events given each topic. It thus captureshow different types of events are correlated within each re-gion and represents the regional behaviour correlation con-text.

To compute regional temporal context which correspondsto the temporal phase of behaviours occurring in each re-gion, we perform document clustering (i.e. clip clustering)in the training video as follows. The Dirichlet parameter γ

qt

represents a document in the topic simplex and thus canbe viewed as the topic profile in the q-th region and t-thclip, i.e. how likely different topics are combined. Given thatthe set of topic profiles {γ q

t } for a regional corpus Dq con-sist of T documents/clips in the q-th region, documents cantherefore be grouped into Cq categories hq = {hq

c }, wherec = 1, . . . ,Cq . Such clustering can be readily performedby K-means. Consequently, regional behaviours occurringwithin each document (clip) are uniquely assigned a tempo-ral phase using the class label of that document.

Stage II—Learning Global Context A single second-stageLDA, termed as global context LDA, is employed for learn-ing global behaviour correlation and temporal context. Forthis LDA, a document is a collection of words correspond-ing to temporal phases of different semantic regions in thescene. Given a total number of C = ∑Q

q=1 Cq regional tem-poral phases classified from the first-stage LDA in all Q re-gions in a scene, the vocabulary of words for generating doc-uments in the second-stage LDA can then be represented as:

H = [h11, . . . , h

1C1 , . . . , h

q

1 , . . . , hqCq , . . . , h

Q1 , . . . , h

Q

CQ ],(12)

and each element of H corresponds to the index of a re-gional temporal phase. A document d = {wn}, where n =1, . . . ,Nw , represents the occurrences of temporal phases indifferent scene regions at the same time. Thus any word wn

is sampled from H in (12).The learning and inference processes of this global con-

text LDA are identical to those for regional context LDA.Suppose K types of global correlation context are discov-ered from the corpus D = {dt }, where t = 1, . . . , T , thelearned parameter β is then a K × C matrix representingthe probabilities of occurrence of each of the C categories

of regional temporal phases in K topics of global correla-tions, corresponding to global behavioural context. Mean-while, the second-stage LDA also infers the topic profiles{γt }, i.e. how global topics are mixed in each of the docu-ments d . The topic profiles can then be employed to classifydocuments (clips) into a number of temporal phases corre-sponding to global behaviour temporal context.

After training the cascaded topic model, the model can beapplied to interpret behaviours captured in unseen videos. Inparticular, the model can be employed to infer a topic pro-file using either the regional LDA or the global LDA. Theformer reveals what types of regional behaviour correlationexist. The latter informs existence of any global behaviourcorrelations. Moreover, the inferred regional and global tem-poral phases assigned to clips can be used for temporal seg-mentation of unseen videos by topics.

It is worth pointing out that although LDA is exploredfor topic modelling in our formulation of a cascaded topicmodel, any other alternative topic model can also be used inour model. For instance, a widely adopted alternative topicmodel is Probabilistic Latent Semantic Analysis (pLSA)(Hofmann 1999a, 1999b). In our experiments, the effective-ness of LDA and pLSA for context learning are compared(Sect. 7). It should also be noted that in our implementationof LDA, although the model input is a binary vector ratherthan a vector of counts of occurrences of different words,we do not change the way words are sampled from topics.In other words, it is not enforced that only binary documentscan be sampled from the learned model, even though themodel is learned with only binary documents. It is foundempirically that the change of model input alone is suffi-cient for improving the model robustness against input noise(see Sect. 7.6). Nevertheless, one could enforce a stricter bi-nary word sampling process by modifying how words aresampled from the model, that is, each word is sampled de-pending on not only the topic-word distribution, but also thepreviously sampled words in the same document.

6 Context-Aware Abnormal Behaviour Detection

Given the learned behavioural context, behavioural anoma-lies are detected as contextually incoherent behaviours thatcannot be explained away or predicted using the learned be-haviour context. Importantly, we also formulate in this sec-tion a novel method for not only detecting an anomaly in avideo document (clip), but also locating which semantic re-gion and what regional events have caused the anomaly. Ina wide area scene featured with multiple objects appearingsimultaneously and constantly (e.g. a public road junction ora train platform), it is critical to know both when and wherean anomaly occurs. Existing topic models by nature sacrificespecificity of location information about topics (and words)

Int J Comput Vis

within each document, therefore is incapable of performingsuch a task.

First, abnormal clips are detected. Given a test video con-sisting of non-overlapping clips as documents, we use theglobal context LDA to examine whether each clip (docu-ment) contains abnormal behaviours. This is achieved bycomputing the normalised log-likelihood of a clip given thetrained LDA model. Specifically, the log-likelihood of ob-serving an unseen clip d∗ is approximated by its maximisedlower bound:

logp(d∗) ≈ L(γ ∗, φ∗;α,β) (13)

in which γ ∗ and φ∗ are Dirichlet parameters represent-ing d∗ in topic simplex and the posterior probabilities oftopics of words occurring in d∗. α and β are learnedmodel parameters from the second stage of LDA modelling.L(γ ∗, φ∗;α,β) is computed by setting α and β as constantand updating γ ∗ and φ∗ until L(γ ∗, φ∗;α,β) is maximised,using (9) and (10). An anomaly score for the clip is thencomputed as:

S(d∗) = L(γ ∗, φ∗;α,β)

N∗w

, (14)

where N∗w is the number of words in d∗, which in our case

corresponds to the number of regional temporal phases iden-tified in the clip d∗. Any clip with anomaly score lower thana threshold THc is then detected as containing abnormal be-haviours. The value of THc is set according to applicationrequirements on the acceptable balance between detectionrate and false alarm rate.

Second, once a clip d∗ is detected as being abnormal, theregions and subsequently the regional events that triggeredthe anomaly are identified. This is achieved through a top-down manner in the cascade structure of our model. Morespecifically, the anomalous region is firstly detected usingthe global context LDA. The regional events within that re-gion are then examined using the corresponding regionalLDA for locating the contributing events. More details areas follows.

Recall that a clip/document d∗ = {w∗n}, n = 1, . . . ,N∗

w ,is represented using the inferred regional temporal phasesin H (see (12)). To localise regions containing abnor-mal behaviours, regional temporal phases are evaluatedagainst the learned global behaviour correlation context.This is carried out by computing logp(w∗

n|d∗, α,β). Inthis work, we approximate logp(w∗

n|d∗, α,β) by comput-ing logp(w∗

n|d∗−n,α,β), where d∗−n represents a documentin which the word w∗

n in d∗ is removed. logp(w∗n|d∗−n,α,β)

is the log-likelihood of w∗n being co-occurring with all other

distinct words in d∗, and can be obtained by computing thedifference between log-likelihoods of the original document

d∗ and the document d∗−n as:

logp(w∗n|d∗−n,α,β)

= logp(w∗n, d

∗−n|α,β) − logp(d∗−n|α,β)

= logp(d∗|α,β) − logp(d∗−n|α,β). (15)

Because d∗−n is derived from d∗ by removing a single word,it is reasonable to assume that they have the same topic pro-file γ ∗. This implies that for computing the log-likelihoodfor d∗−n, we can use γ ∗ learned from d∗. To compute thelog-likelihoods of the original document d∗ and the docu-ment d∗−n (i.e. logp(d∗|α,β) and logp(d∗−n|α,β) in (15)),the following two steps are taken:

• Step 1: Given α and β , compute L(γ ∗, φ∗;α,β)) to max-imise the lower bound of logp(d∗|α,β) through itera-tively updating (9) and (10), and storing the value γ ∗;

• Step 2: Given the document d∗−n, set γ ∗ in (10) asconstant and only update φ in (9) resulting the max-imised lower bound of logp(d∗−n|α,β) being computedas L(γ ∗, φ∗−n;α,β) where each element in φ∗−n repre-sents how likely that assigning each global topic to a wordin d∗−n.

logp(w∗n|d∗−n,α,β) can finally be computed as:

logp(w∗n|d∗−n,α,β) ≈ L(γ ∗, φ∗;α,β))−L(γ ∗, φ∗−n;α,β).

(16)

The regions with the lowest values of logp(w∗n|d∗−n,α,β)

are considered to contain abnormal behaviours. In those re-gions, the above procedure can be further deployed to lo-calise any abnormal regional events by computing:

logp(wq∗n |dq∗

−n,αq,βq) ≈ L(γ q∗, φq∗;αq,βq)

− L(γ q∗, φq∗−n;αq,βq) (17)

for all regional events wq∗n using the regional LDAs. The

abnormal regional events are then detected as those givinglowest log-likelihoods of co-occurring with all other eventsoccurring in the same region of the same clip.

7 Experiments

7.1 Datasets

Three datasets were employed in our experiments for evalu-ating the effectiveness of the proposed framework for learn-ing behavioural context.3 All datasets were collected fromreal-world surveillance scenes featured with large numbersof objects exhibiting complex behaviours.

3The Junction and Roundabout datasets can be downloadedat http://www.eecs.qmul.ac.uk/~jianli/Dataset_List.html, where theanomaly annotation and evaluation protocol can also be found.

http://www.eecs.qmul.ac.uk/~jianli/Dataset_List.html

Int J Comput Vis

Fig. 4 Example frames from the Junction dataset

Fig. 5 Example frames from the Roundabout dataset

Junction Dataset This dataset contains videos of a busyurban road junction recorded at 25 Hz with a frame sizeof 360×288 pixels. The dataset consists of 34000 frames,among which 22000 frames were used for training and therest 12000 frames were used for testing. Example framesare shown in Fig. 4. This scene contains different types ofobjects including vehicles and pedestrians moving at differ-ent regions of the scene. The behaviour of an object is gov-erned by both the traffic lights and the behaviours of otherobjects co-existing in the scene. Specifically, as illustrated inFig. 4, there are two traffic phases: the vertical traffic phaseand the horizontal traffic phase. During the horizontal phase,there are also two sub-phases corresponding to the leftwardand rightward horizontal traffic respectively. To make thingseven more complicated, the vehicles waiting in the centralwaiting zone for horizontal turning (see Fig. 4(a)) can do sowhenever there is a gap in the vertical flow.

Roundabout Dataset This dataset contains videos of a traf-fic roundabout recorded at 25 Hz with a frame size of360×288 pixels. The training and test sets contain 45000frames and 18000 frames respectively. Example frames areshown in Fig. 5. Vehicles in this scene usually enter thescene from the entrances near the left boundary and rightbottom corner. They move towards the exits located on thetop, at left bottom corner and near the right boundary. Sim-ilar to the junction scene, the roundabout traffic was con-trolled by multiple sets of traffic lights and can be roughlydivided into two phases temporally. However, compared to ajunction, behaviours of vehicles in a roundabout are less reg-ulated by traffic lights. Consequently, the two traffic phases

Fig. 6 Example frames from the MIT Traffic dataset

are less distinctive visually (e.g. vehicles can leave the sceneusing the exits in the top during both traffic phases).

MIT Traffic Dataset (Wang et al. 2009) This dataset con-tains 1.5 hours of video with an image size of 720 × 480pixels and frame rate of 30 Hz. Figure 6 shows exampleframes. The video was cut into 540 non-overlap clips of 10second long each. This dataset was used to compare the per-formance of our model to that of Wang et al. (2009) (seeSect. 7.6).

7.2 Learning Behaviour Spatial Context

Scene Event Detection In the Junction Dataset, 121583clip-wise scene events were detected. After being smoothedwithin each clip (see Sect. 3), they were automaticallygrouped into 13 clusters, each of which represents one sceneevent class. In the Roundabout Dataset, 440607 scene eventswere detected which led to 19 scene event classes. The de-tected scene event classes in both datasets are illustrated inFig. 7. It can be seen that different scene events correspondsemantically to objects at different regions of the scene per-forming different behaviours (e.g. moving towards certaindirections at certain speeds).

Semantic Scene Segmentation Following the proceduredescribed in Sect. 4, each scene was decomposed intosemantic regions for learning behaviour spatial context.Specifically, the behaviour-footprint images for all threescenes were resized to a similar size.4 The radius in (4) forcomputing scaling factors of the affinity matrices was set tor = 10. Our spectral clustering algorithm automatically de-composed the Junction Scene into 6 regions and the Round-about Scene into 9 regions. Figures 8(a) and 9(a) showthat semantically meaningful regions are obtained using ourmethod. In particular, the decomposed regions correspondsto various traffic lanes and waiting zones where object be-haviours tend to follow a similar pattern. Our proposed spec-tral clustering algorithm was compared with the originalZelnik-Perona (ZP) method. The results in Figs. 8 and 9

4After resizing, the sizes of the Junction and Roundabout scenes are180×144 pixels and the size for the MIT scene is 180×120 pixels.

Int J Comput Vis

Fig. 7 (Color online) The detected scene event classes. Each classis represented using an ellipse. The centre of each ellipse correspondto the mean positions of all scene events belonging to that class. Theorientation of an ellipse is determined by the angle between the major

and minor axes of spatial distributions of events. The mean speed andorientation for the events in each class is depicted using a red arroworiginating from the centre of the ellipse

Fig. 8 Semantic scene segmentation for the Junction Dataset

Fig. 9 Semantic scene segmentation for the Roundabout Dataset

Int J Comput Vis

indicate clearly that for both scenes, the original Zelnik-Perona (ZP) method suffers from under-fitting severely andwas not able to decompose both scenes correctly accord-ing to the expected traffic behaviour patterns. These resultssuggest that setting the scaling factors (see (4)) properly iscrucial for meaningful scene segmentation using spectralclustering.

7.3 Learning Behaviour Correlation Context

Regional Event Detection With the learned behaviour spa-tial context, behaviours occurring within each semantic re-gion are represented using regional events (Sect. 5.1). Au-tomatically selected features for each region of both scenesare shown in Tables 1 and 2. It is evident from the tables thatdifferent features were selected for different regions. For ex-ample, Table 2 shows that motion features were selectedas informative features for Region 7 but not for Region 4.This reflects the fact that in the Roundabout Dataset (seeFig. 9), Region 4 contains predominantly static or slow mov-ing objects, whilst Region 7 contains mostly objects movingrapidly towards different directions. Using the selected fea-tures, 30 classes of regional events were detected automat-ically for the Junction Dataset and 52 for the RoundaboutDataset. These regional event classes for the two datasets areshown in Figs. 10 and 11 respectively. Compared with thedetected scene events (see Fig. 7), these regional events de-tected using the learned behaviour spatial context are morefine-grained and also more meaningful. For example, in Re-gion 1 of the Junction scene, two classes of regional eventswere detected at the bottom of this region (see the blue andcyan ellipses in Fig. 10(a)). They correspond to (1) pedes-trian waiting in the crossroad island and (2) pedestrian cross-ing the road respectively. These two classes of events are

important for understanding the behaviours of objects in theregion (e.g. when pedestrians are crossing, vehicles are notsupposed to be present in the region) and of those in thewhole scene (e.g. pedestrians waiting in the crossroad is-land indicates the vertical traffic flow phase). Both events aremissed from the detected scene event classes (see Fig. 7(a)).This is because that behaviour spatial context is not utilisedin detecting scene events.

Table 1 Representative features selected for the Junction Dataset

x y w h rs rp u v ru rv

Region 1√ √ √ √ √

Region 2√ √ √ √ √

Region 3√ √ √ √ √

Region 4√ √ √ √ √

Region 5√ √ √ √ √

Region 6√ √ √ √ √

Table 2 Representative features selected for the Roundabout Dataset

x y w h rs rp u v ru rv

Region 1√ √ √ √ √

Region 2√ √ √ √ √

Region 3√ √ √ √ √

Region 4√ √ √ √ √

Region 5√ √ √ √ √

Region 6√ √ √ √ √

Region 7√ √ √ √ √

Region 8√ √ √ √ √

Fig. 10 The detected 30regional event classes in eachregion of the Junction Scene

Int J Comput Vis

Fig. 11 The detected 52 regional event classes in each region of the Roundabout Scene

Multi-scale Behaviour Correlation Learning A cascadedLDA model (denoted as Cas-LDA) was trained for eachdataset for learning multi-scale behaviour correlation con-text. Specifically, the detected regional events from each re-gion were used for training a first-stage LDA for regional be-haviour correlation modelling, with the learned hidden top-ics representing the local behaviour correlation context. Theregional LDAs then provide inputs for a second-stage LDAfor modelling global behaviour correlation. The learned top-ics using this LDA then correspond to global behaviour cor-relation context.

Let us first look at the results obtained from the JunctionDataset. The training data was first temporally segmentedinto non-overlapping clips with equal length of 300 frames,resulting in 73 clips/documents for the first stage LDA train-ing. It was found that varying the numbers of topics for thelocal LDAs has little effect on the global correlation learn-ing in the second stage. In this experiment, we set the num-bers of topics for different regions to be equal and searchfor the optimal number that gives the best global tempo-ral phase inference performance using cross validation (seeSects. 5.2 and 7.4). Subsequently the number of topics wasautomatically set to 4 for all 6 regions in the scene. Thelearned topics for each region, which correspond to differ-ent types of regional behaviour correlations, are illustratedin Fig. 12. As can be seen in Fig. 12, each learned topic orregional behaviour correlation context captures one type ofcommonly observed concurrent object behaviours under onespecific traffic phase. For instance, in Region 1, both Topic1 and 4 correspond to the vertical traffic flow in that region,whilst Topic 2 represent leftward horizontal flow with mov-

ing pedestrians and Topic 3 captures the rightward horizon-tal flow with pedestrians waiting for crossing. For the secondstage LDA, the number of topics was set to 2, correspond-ing to the two global traffic phases. The learned topics areshown in Fig. 13. It is evident that Topic 1 (Fig. 13(a)) rep-resents the concurrent behaviours of various objects in dif-ferent regions during the vertical traffic phase, whilst Topic2 (Fig. 13(b)) corresponds to those during the horizontalphase.

For the Roundabout Dataset, there were 146 clips ob-tained after segmenting the training data into non-overlap-ping clips. The number of topics for the first stage LDAs wasdetermined as 2 and the learned topics using the first stageand second stage LDAs are illustrated in Figs. 14 and 15 re-spectively. It can be seen in Fig. 14 that the learned topics us-ing the first stage LDAs reveal clearly the typical concurrentobject behaviours in each region. In particular, it is notedthat in regions where the traffic phase has a significant influ-ence on the behaviour, the two topics correspond to the twotraffic phases. For instance, in Region 6, the two topics cap-ture vehicles waiting during horizontal traffic phase and ve-hicles moving upwards during vertical phase respectively. InRegion 7, the regional LDA reveals how vehicles typicallymove in different directions during the two traffic phases. Incontrast, in Regions 2 and 8, the two topics are alike sincethese regions are less effected by the traffic phases. Similarto the Junction Dataset, the learned 2 topics using the sec-ond stage LDA discovers correlations of object behavioursacross different regions during the two traffic phases (seeFig. 15).

Int J Comput Vis

Fig. 12 The learned regionaltopics for each region in theJunction Scene using theCas-LDA model. Each topic isillustrated using the top twodominant/likely words/regionalevent classes

Cas-LDA vs. LDA To evaluate the effectiveness of the pro-posed two-stage cascaded model structure, our Cas-LDAmodel was compared with a standard LDA using detectedscene events as input. The latter has a single stage learn-ing and does not rely on scene segmentation for comput-ing model input. The inferred topics using the single-stageLDA for the two datasets are shown in Figs. 13 and 15 re-spectively. It is evident that the learned topics using a stan-dard single-stage LDA capture less meaningful correlationsof object behaviours compared to our Cas-LDA. It can alsobe observed from the comparative results that without ex-ploiting behaviour spatial context, the visual words (sceneevents) alone failed to capture the subtle difference be-tween object behaviours in different regions. Consequently,the correlations learned using the standard LDA are lessmeaningful and more difficult to interpret, indicating weak-ened capability for modelling complex behaviour correla-tions.

Cas-LDA vs. Markov Clustering Topic Model (MCTM)We further compare Cas-LDA with the Markov ClusteringTopic Model (MCTM) by Hospedales et al. (2009) whichhas a single hierarchical model structure as opposed to ourcascaded structure, and is a hybrid of PTM and DBN. InHospedales et al. (2009), the mode input to MCTM are lo-cal motion events/words computed from optical flow vec-tors at each regular grid location. The computational costof learning a MCTM is extremely high given the complexmodel structure and sheer number of words computed fromeach clip. Here, for fair comparison, our implementation ofMCTM directly uses detected scene events as input, i.e. thesame input as our Cas-LDA. A MCTM groups events/wordsinto actions/topics, and documents into behaviours, whichcorrespond to the temporal phases learned by our temporalcontext learning model. We set the number of topics to eightand learned two behaviours each of which corresponds to atraffic phase. The learned behaviours using MCTM for the

Int J Comput Vis

Fig. 13 Learning global behaviour correlation context in the JunctionScene. (a)–(b): The learned topics using the second-stage LDA of thecascade, which correspond to the object behaviour correlations duringthe two traffic phase. Each topic is illustrated using the top two mostdominant/likely words/regional event classes in each region. (c)–(d):

The topics learned using a single-stage LDA without the cascade struc-ture. The top 6 most dominant/likely scene event classes are shown foreach topic. (e)–(f): The behaviour temporal phases learned using aMarkov Clustering Topic Model (MCTM)

Junction scene and Roundabout scene are shown in Figs. 13

and 15, respectively. The results show that with a hierarchi-

cal model structure, the MCTM is not able to learn a set

of interpretable behaviour correlations for different tempo-ral phases, even though it models the temporal ordering ofbehaviour temporal phases explicitly.

Int J Comput Vis

Fig. 14 The learned regional topics for each region in the Roundabout Scene using the Cas-LDA model. Each topic is illustrated using the topthree dominant (most likely) words/regional event classes

7.4 Learning Behaviour Temporal Context

Temporal Context Learning for Video Segmentation As de-scribed in Sect. 5, given a learned Cas-LDA model, eachvideo document (clip) was represented by the inferred topicprofile indicating how likely each topic is present in thatclip. The topic profiles were then used for classifying docu-ments (clips) into different types of temporal context. Learn-ing temporal context also provides a mechanism for seg-

menting an unseen video into different temporal phases. Inparticular, for both the Junction and Roundabout datasets,there are two types of temporal visual context correspond-ing the vertical and horizontal traffic flow phases respec-tively. To evaluate the temporal context learning, we gen-erated ground truth by manually labelling exhaustively allthe clips in the testing video frames from both datasets intotwo traffic phases. Table 3 shows that an accuracy of 87.2%and 74.6% were obtained on the two datasets respectively

Int J Comput Vis

Fig. 15 Learning global behaviour correlation context in the Round-about Scene. (a)–(b): The learned topics using the second-stage LDAof the cascade, which correspond to the object behaviour correlationsduring the two traffic phase. Each topic is illustrated using the top threemost dominant/likely words/regional event classes in each region. (c)–

(d): The topics learned using a single-stage LDA without the cascadestructure. The top 10 most dominant/likely scene event classes areshown for each topic. (e)–(f): The behaviour temporal phases learnedusing a Markov Clustering Topic Model (MCTM)

from applying our model for inferring and predicting traf-

fic phase. These results show that our Cas-LDA model is

able to learn the temporal context in a meaningful way. It is

noted that lower segmentation accuracy was obtained for the

Roundabout Scene due to the more complex and less regu-

lated object behaviours controlled by the traffic lights.

Int J Comput Vis

Fig. 16 Examples of abnormal behaviours

Table 3 Temporal video segmentation accuracy using different mod-els

MOHMM MCTM pLSA Cas-pLSA LDA Cas-LDA

Jun. 56.4% 51.79% 56.4% 89.7% 61.5% 87.2%

Rou. 66.1% 68.36% 52.5% 76.2% 55.9% 74.6%

Comparing with Alternative Models The proposed Cas-LDA model was compared with a host of alternative topicmodels for learning behavioural context, including a single-stage LDA (Blei et al. 2003), single-stage pLSA (Hof-mann 1999a, 1999b), Cas-pLSA (Li et al. 2008b), a Dy-namic Bayesian Network (DBN) in the form of a Multi-Observation Hidden Markov Model (MOHMM) (Xiang andGong 2008) and a Markov Clustering Topic Model (MCTM)(Hospedales et al. 2009). The single-stage pLSA and Cas-pLSA use the same model input and similar learning and in-ference algorithms, but differ from the single-stage LDA andCas-LDA in topic model formulation. For the MOHMM,detected regional events were used as observations and in-ferred hidden states were used for video segmentation. Thismeans that the inputs to the MOHMM are obtained basedon the learned spatial context information. The results inTable 3 show that the cascaded model structure makes ahuge difference for both LDA and pLSA with a relative in-crease of around 50% in performance compared with thecorresponding single stage topic models. Table 3 also showsthat the MOHMM model generated poor results despite thatit has already benefited from the learned spatial context byusing regional events as input. MCTM is capable of group-ing documents/clips into temporal phases and modelling thetemporal ordering of different phases explicitly. It is thus in

theory well-suited for learning temporal context. However,our results suggest that the temporal phases learned use aMCTM are less meaningful compared with a Cas-LDA (seeFigs. 13 and 15). Consequently, less accurate temporal seg-mentation of video is achieved. Furthermore, we observedthat using the two different topic models in our cascadedstructure seems to make little difference with Cas-pLSAachieving slightly better performance than Cas-LDA.

7.5 Context-Aware Anomaly Detection

Anomalies as Contextually Incoherent Behaviours To eval-uate the proposed approach for abnormal behaviour detec-tion (Sect. 6), every clip in the testing datasets is labelled aseither normal or abnormal by careful manual examination.This led to 8 out of 39 clips and 6 out of 59 clips being iden-tified as being abnormal for the Junction and RoundaboutDatasets respectively. The anomalies in the testing part ofthe Junction Dataset fall into two categories: emergency ve-hicles such as fire engine and ambulance interrupting nor-mal traffic flow, or dangerous leftward or rightward turningduring the vertical traffic flow phase. For the RoundaboutDataset, all abnormal clips contain vehicles jumping a red-light. It is important to observe that in all anomalies, thebehaviours of individual objects alone exhibit visually littleif any information for being abnormal. They all look normalin isolation. The abnormalities were caused by behaviourstaking place in a wrong place at the wrong time resultingin abnormal correlations with other objects in the scene, i.e.being contextually incoherent. Some examples of anomaliesare shown in Fig. 16.

Int J Comput Vis

Fig. 17 (Color online) Comparing anomaly detection performance of different models using ROC curves

Table 4 Comparing different models on anomaly detection using Areaunder ROC (AUROC) values

MOHMM MCTM pLSA Cas-pLSA LDA Cas-LDA

Jun. 0.6351 0.3911 0.4355 0.8024 0.3871 0.8589

Rou. 0.6730 0.5579 0.6431 0.7154 0.6761 0.7374

Comparing with Alternative Models We compared the per-formance of our Cas-LDA model with the single-stage LDAand pLSA, Cas-pLSA, MOHMM and MCTM. Specifically,we varied the threshold THc (see Sect. 6) for detection andobtained Receiver Operating Characteristic (ROC) curves,from which the Area Under ROC (AUROC) values werecomputed. The results are shown in Fig. 17 and Table 4.Similar to the traffic phasing prediction and segmentationresults early, both cascaded topic models achieved signif-icantly better performance compared to their single stagecounterparts. This again highlights the importance of utilis-ing behavioural context and a cascaded model structure. Ourcascaded topic models also outperform the MOHMM basedDBN model and MCTM. Again, the performances of Cas-pLSA and Cas-LDA are similar but this time with Cas-LDAyielding better detection result.

Locating Contributing Events in an Anomaly Using themethod formulated in Sect. 6 (see (16) and (17)), we wereable to locate regional events that caused those clips beingdetected as abnormal. Figure 18 shows examples when theproposed method identified and located accurately, amongdozens of objects present in the scene, those specific objectbehaviours that were most abnormal. Specifically, the con-current correlation of these objects are at odds with the ex-pected behaviour correlation and temporal context. For ex-ample, this could mean that the objects are moving on a col-lision course with other objects (see Figs. 18(a) and (d)).

This is a very useful feature with which an automated sys-tem can be utilised to pinpoint the abnormality both in spaceand over time in a complex scene involving multiple objectsconcurrently, aiding rapid human response.

7.6 Further Comparisons with Alternative Approaches

Comparing with Tracking-Based Approach Our approachis based on a discrete event-based representation for be-haviour modelling. This is different from most existing ap-proaches which rely on tracking. To highlight the inade-quacy of tracking based representation for behaviour mod-elling in crowded scenes, in Fig. 19(a), we show the trajec-tories extracted from a two-minute video clip from the Junc-tion Dataset. In Fig. 19(b), we plot a histogram of the dura-tions of all the tracked objects trajectories (red), 331 in totaland compare it to that of the ground truth (blue), which wasexhaustively labelled manually for all the objects appearedin the clip (in total 114 objects). It is evident that inevitableand significant fragmentation of object trajectories makes apurely trajectory based representation unsuitable for accu-rate behaviour representation and subsequent analysis in thistype of scenes. Moreover, it is equally important to point outthat monitoring objects in isolation even over a prolongedperiod of time through tracking does not necessarily facil-itate temporal segmentation such as traffic phase inferenceand prediction, and anomaly detection in the context of reg-ulated traffic flow.

Comparing with Hierarchical Topic Models For correla-tion modelling in a complex scene, Wang et al. (2009) pro-posed to use Dual Hierarchical Dirichlet Processes (Dual-HDP). In the bottom layer of a Dual-HDP model, concurrentquantised motion information (visual words) are groupedinto atomic activities, whilst in the top layer, these activitiesare grouped into interactions if they co-occur. Compared to

Int J Comput Vis

Fig. 18 (Color online) The contributing events are identified and highlighted using red bounding boxes. (a)–(b): events identified for the fireengine anomaly in the Junction Dataset. (c)–(d): events identified for the running traffic light anomaly in the Roundabout Dataset

Fig. 19 (Color online) Failure of tracking in the Junction scenario

our cascaded topic model, the discovered interactions usingthe Dual-HDP correspond to the global correlation and tem-poral context discovered using the second stage LDA. Wecarried out experiments to compare the Dual-HDP modelwith our Cas-LDA model for learning behavioural contextof complex scenes. Since the implementation of the Dual-HDP model is non-trivial and unavailable to us, we were notable to evaluate it on the Junction and Roundabout datasets.Instead, we apply our Cas-LDA to the MIT Traffic Datasetused by Wang et al. (2009) (see Fig. 6) and compared re-sults with those reported in their paper. To ensure that thedifference in performance is only caused by the model usedrather than behaviour representation, the same low-level mo-tion feature based representation as adopted by Wang et al.(2009) was used.

Figure 20 shows the 9 semantic regions discovered by ourspatial context learning. Five types of interactions, i.e. tem-poral phases, learned using the second-stage LDA are illus-trated using concurrent traffic flows in Figs. 21(a) and (b).For direct comparison, the discovered global correlations us-ing Dual-HDP is shown in Fig. 21(c). These results werereproduced from Fig. 11(b) in Wang et al. (2009).

It is evident from Fig. 21 that more meaningful behaviourcorrelations are discovered using our Cas-LDA model. In

Fig. 20 Semantic scene segmentation for the traffic scene used byWang et al. (2009)

particular, it can be seen from Figs. 21(b) and (c) thatPhase 2 and 4 of Cas-LDA are almost identical to Phase 4and 5 of Dual-HDP respectively. But the other 3 phases arevery different. Specifically, Phase 1 of Dual-HDP suggeststhat flow a and e occur simultaneously. In reality, since theflow are on collision course, they can only co-exist in a sameclip if there are gaps in either a or e which is rare in the videoused in the experiment. What was taking place much morefrequently in the video is the co-occurrence of flow a and

Int J Comput Vis

Fig. 21 (Color online) Comparing Cas-LDA with Dual-HDP forglobal correlation modelling. In (a), different colours represent differ-ent motion directions: red →, green ↑, blue ←, yellow ↓. In (b) and (c)

each flow is labelled and has different colours with the arrow indicatingthe flow direction

Table 5 Accuracy of temporalsegmentation using Cas-LDA Ground Cas-LDA labels

truth Phase 1 Phase 2 Phase 3 Phase 4 Phase 5

Phase 1 78 0 1 15 7Phase 2 1 89 3 2 1Phase 3 0 0 61 4 0Phase 4 0 8 2 109 0Phase 5 12 0 3 11 133

b, and e and b, which have been captured by Phase 1 and 5respectively using our Cas-LDA. Phase 2 of Dual-HDP indi-cates that typically flow b co-occurs with pedestrians cross-ing horizontally (flow j and m) but not with other traffic.Again this happens very rarely in the video because whenflow b is taking place, pedestrians cannot cross in the bottomof the image (flow m). Phase 3 of Dual-HDP correspondsto pedestrians crossing horizontally (flow k and l) with notmuch vehicle traffic. However, it is perfectly normal to havehorizontal vehicle traffics co-existing with pedestrian cross-ing as indicated by Phases 4 and 5 of Dual-HDP. Com-pared with Dual-HDP, our Cas-LDA discovered an impor-tant correlation missed by Dual-HDP (Phase 3 in Fig. 21(b)),that is, horizontal right-to-left traffic performs left-turn (flowc) without the co-existence of left-to-right-moving traffic(flow d).

The 540 clips in the MIT Traffic dataset were manuallylabelled exhaustively into 5 temporal phases according to

the discovered 5 correlations. The confusion matrix betweenthe segmentation result and the ground truth is shown in Ta-ble 5. The average segmentation accuracy is 87.04%. Thisis similar to the 85.74% result obtained from Dual-HDP re-ported by Wang et al. (2009), although as explained abovethe discovered phases have different meanings. On compu-tational cost, Dual-HDP model training takes 12 hours toprocess the 1.5 hour video data, as reported by Wang etal. (2009). In comparison, our Cas-LDA is computationallymuch more efficient, requiring only 25 minutes on a 2.5 GHzPC platform.

We computed the value of anomaly score using (14) andthe top 5 most abnormal clips are shown in Fig. 22. Amongthe top 5 video clips, we detected one illegal vehicle U-turn(Fig. 22(c)), one pedestrian crossing against the traffic light(Fig. 22(b)), two vehicles left turning with pedestrian cross-ing in close proximity (Figs. 22(a) and (e)), which are legalbut dangerous, and one pedestrian crossing outside the cross

Int J Comput Vis

Fig. 22 (Color online) The top 5 abnormal clips detected using our Cas-LDA model. The contributing words (object motions) are highlightedin red

Table 6 Comparing our LDA formulation with the standard LDA for-mulation for video segmentation

Cas-LDA with binary rep. Cas-LDA with count-based rep.

Jun. 87.2% 87.2%

Rou. 74.6% 71.2%

Table 7 Area under ROC (AUROC) values for anomaly detection us-ing our LDA formulation and standard LDA formulation

Cas-LDA with binary rep. Cas-LDA with count-based rep.

Jun. 0.8589 0.4919

Rou. 0.7374 0.4796

walk (Fig. 22(d)). Compared to the top 5 detected abnor-mal clips (Fig. 15 from Wang et al. 2009), our model seemsto be more sensitive to abnormal behaviour caused by ab-normal correlations of multiple objects rather than the ab-normal motion patterns of each single object. For instanceour model picked up a number of instances of vehicles andpedestrians moving on a collision course whilst the Dual-HDP detected a few clips where bicycles or pedestrianscrossing outside the lanes/crosswalks, for which correlationmodelling is not essential. In other words, these experimentsdemonstrate that our Cas-LDA model is more sensitive tothose anomalies where each individual object behaves nor-mally but collectively an object’s behaviour is deemed ab-normal due to our model’s ability to detect contextually in-coherent behaviours.

Comparing with Standard Topic Model Formulation Asdescribed in Sect. 5.2, the LDA models used in both stagesof our Cas-LDA have an important difference from the stan-dard topic model formulation (Hofmann 1999a; Blei et al.2003). That is, each video document (clip) is represented bywhether each event class occurs in that clip (a binary value),rather than by the counts of their occurrence. Tables 6 and 7compare the effectiveness of these two different representa-tions for video temporal segmentation and anomaly detec-tion respectively. It is evident from these results that our bi-nary document (clip) representation is superior to the stan-

dard counts-based representation especially for the task ofanomaly detection.

7.7 Discussions

The Importance of Utilising Context for Behaviour Under-standing Our extensive experimental results have demon-strated compellingly that learning behavioural context ishugely beneficial and can be crucial for understanding be-haviours in complex dynamic scenes. This is because thatthe proposed behavioural context learning framework pro-vides an effective means of decomposing a complex multi-object dynamic scene according to spatial, correlation andtemporal distributions of object behaviours. Rather thanbuilding a single model based on a single representation, thediscovered spatial, correlation and temporal context enablesus to adapt different representations for different behaviours,and facilitates not only grouping, at multiple scales, differ-ent correlated behaviours for analysis, but also embeddingtemporal constraints for interpreting behaviours. As a re-sult, better understanding of multi-object behaviour can beachieved, leading to high sensitivity particularly to subtleanomalies with improved robustness to noise at the sametime.

Probabilistic Topic Models (PTM) vs. Dynamic BayesianNetworks (DBN) Our results suggest that PTMs outper-form DBNs for both video segmentation and anomaly detec-tion. One would have thought that a DBN is more suitablefor dynamic scene understanding as the temporal order in-formation is utilised, which should be particularly useful fortemporal video segmentation. However, one clear drawbackof DBNs against the Bag-of-Words based PTM is that it ismore sensitive to noise, which explains the inferior perfor-mance of DBNs in our experiments.

Cascade PTM vs. Hierarchical PTM The results inSect. 7.6 show that a cascaded PTM outperforms a hier-archical PTM such as the Dual-HDP model and MCTM forbehaviour context modelling. Theoretically a hierarchicalmodel may be more attractive as it allows for the cluster-ing of words or atomic activities simultaneously in order tohelp the discovering of correlations among activities (Wang

Int J Comput Vis

et al. 2009). However, in practice, such a model is not al-ways better and can suffer from a number of drawbacks: (1)A single hierarchical model needs more parameter to de-scribe, and thus has poorer scalability and tractability andhigher computational cost. As a result it is often less appli-cable to larger and more complex problems (e.g. the multi-camera behaviour modelling problem). This drawback is ev-ident from our experiments reported in Sect. 7.6, where ourmodel is compared with the Dual-HDP model in Wang etal. (2009). (2) The model inputs computed from real-worldvideo data inevitably contain noise and errors. These errorscan be propagated from the bottom to different layers of ahierarchical model with unpredictable effects. (3) In a hi-erarchical model the numbers of inputs in different layersfrom bottom to top are extremely imbalanced. For exam-ple for both the DDP-HMM in Kuettel et al. (2010) andMCTM in Hospedales et al. (2009), the bottom layer mod-els video words which are local motion events and in theorder of thousands per clip. The number of actions/topicsin the layer above are in the order of dozens, whilst thenumber of temporal phases in the top layer is only a hand-ful. This imbalanced modelling structure causes problemsin detecting abnormities because those occurred at upperlayers can be easily overwhelmed by those in the bottomlayer, and become undetectable. This is reflected by theresults reported in Hospedales et al. (2009). For fair com-parison, in our experiments we used the same inputs forMCTM as for our cascaded LDA model, i.e. scene events.Even with this much reduced words number, the results ob-tained using MCTM are inferior to those of cascaded LDA(see Sects. 7.4 and 7.5). (4) A cascaded model enables usto incorporate easily behaviour spatial context informationwhich has shown to be critical for understanding complexbehaviour. On the contrary, integration of spatial context forthe already complex hierarchical PTM would be very dif-ficult both for designing inference and learning algorithms,and for maintaining tractability.

Cascaded LDA vs. Cascaded pLSA As pointed out by Bleiet al. (2003), the advantage of LDA over pLSA is that LDAmodels the prior distribution of topics over documents andthus would fare better in generalisation to unseen documentsand also overcome the over-fitting problem of pLSA. Our re-sults indicate that LDA models indeed have better generativepower which is beneficial for anomaly detection, but not fortemporal context learning and video temporal segmentation.Similar observations were obtained on action recognition byNiebles et al. (2008). This suggests that for selecting differ-ent topic models in the proposed cascaded model structure,one needs to balance generative and discriminative power ofa model according to the nature of the visual tasks. For in-stance, for robust anomaly detection, the model must be ableto generalise well as normal behaviours can be executed in

many variations which should not be confused with anoma-lies.

Binary Document Representation vs. Accumulative Count-Based Representation Our results show that a binary doc-ument representation is advantageous over count-based rep-resentation for topic model based visual behaviour under-standing. This is due to two reasons: (1) for visual be-haviour understanding, whether one type of behaviour hashappened is more important semantically than how manytimes it has taken place. (2) More importantly, differing fromtext analysis where document representation is mostly freefrom noise and errors, ‘clean’ visual data for representa-tion is mostly implausible due to image noise and visualambiguities. Word counts in a document, is far more sus-ceptible to noise compared to a binary representation. Thisexplains why the performance of anomaly detection is im-proved drastically when the binary representation is adopted(see Table 7). Note that, more recently, the Indian BuffetProcess (IBP) (Griffiths and Ghahramani 2005) has been in-troduced to model binary representation of documents. Ex-tending our existing LDA modelling using IBP can be con-sidered.

Relationships Between Different Behaviour Context Thethree types of behaviour contex studied in this work areclosely related. Specifically, the learning of spatial contextenables the efficient and effective learning of correlation andtemporal context at different scales. Behaviour correlationcontext and temporal context are related in the sense thatwith different temporal context, i.e. in different temporalphases of a global behaviour, local behaviours are correlatedin a different way leading to different correlation context.These two types of context also have important differences.More specifically, correlation context specifies how differentlocal behaviours in a visual scene are correlated with eachother. This is discovered in our framework by the learningof co-occurrence of scene events. Temporal context, on theother hand, corresponds to different temporal phases of aglobal behaviour. The inferred correlation context can thusbe used as input to learn temporal context. This is preciselywhat we propose to do in our framework, that is, the topicprofile inferred using a LDA is used as input to a clusteringalgorithm to discover temporal context.

Temporal Ordering Information Modelling Although theproposed cascaded topic model does not model temporalorder of words, topics, or documents explicitly, the corre-lation context learned using the model can be utilised todetect abnormal behaviours caused by abnormal temporalorder. In particular, we have demonstrated that the modelcan detect anomalies such as emergency vehicle interrupt-ing normal traffic flow (see Sect. 7.5). These anomalies are

Int J Comput Vis

caused by abnormal temporal order of events (e.g. horizontaltraffic should follow vertical traffic). However, this abnor-mal temporal order also results in abnormal co-occurrencesof events (co-occurrence of horizontal and vertical traffic).Therefore, these anomalies can also be detected using ourmodel. One can consider that temporal ordering informa-tion is captured implicitly by our model. Importantly, thisimplicit modelling of temporal ordering information makesthe model more robust against noise in mode inputs. This isvalidated by our experiments on comparing our model withmodels that explicitly model temporal ordering information,including MOHMM and MCTM (see Sect. 7.5).

Computational Cost The computational cost of a PTM hastwo components, one in the learning stage for parameter es-timation and the other in the testing stage for inference oflatent variables. Compared to a more complex PTM such asa Dual-HDP or a HDP mixture model of Wang et al. (2009)and MCTM of Hospedales et al. (2009), the learning cost ofthe model proposed here is much lower. Note that this com-putational cost also depends on the learning algorithm so itis difficult to give an analytic form. In particular, there aretwo categories of learning algorithms, variational Bayesian,which is used in our work, and Gibbs sampling. The for-mer is in general considered to be more efficient than thelatter, although this also depends on the settings (e.g. howmany samples and sweeps to use in Gibbs sampling). Thetesting stage is a different matter. As shown in Hospedaleset al. (2009), one can develop an efficient Gibbs samplingmethod based on an approximation to online Bayesian in-ference. This can make a more complicated model such asMCTM run in real time during testing.

The Flexibility of Our Approach It is worth pointing outthat the proposed approach is very flexible in many waysas follows: (1) Behaviour representation—although a dis-crete event based represented is adopted in this work, anybag-of-words representation can be used (e.g. for the MITTraffic Dataset, a low-level motion feature based represen-tation was used in Sect. 7.6). (2) Behaviour-footprint—inthe current work, a behaviour-footprint is computed as anevent histogram for each pixel location. Different ways ofcomputing behaviour-footprint can be considered. Note thatthis histogram-based representation ignores temporal orderof events; it is thus limited in describing behaviour spatialcontext. However, it is found in Sect. 7.2 that this represen-tation is sufficient in segmenting the scenes experimentedin this paper. This is because different semantic regions inthose scenes are featured with different events or events ofdifferent frequency of occurrence. In a more complicateddynamic scene there could be regions that differ from eachother only in the temporal order of event occurrence. Onecould then compute behaviour-footprint as a time series of

event occurrences, at an extra computational cost, and tomeasure the similarity between footprints based on tem-poral order. (3) Topic model selection—two topic models,LDA and pLSA, are considered in this paper. However anytopic model can be adopted. One could also employ differ-ent topic models at different stages of the cascade. (4) Thenumber of stages in a cascade—in this work, a two stage cas-caded topic model is formulated. However one could readilyinclude more stages. For instance, for learning correlationcontext for multiple camera views decomposed into seman-tic regions across all views globally, a third stage can beadded for capturing camera-view-level correlations. More-over, with more complex behaviour patterns involving mul-tiple objects from multiple views requiring more stages, theadvantage of a cascaded model over a hierarchical modelin terms of computational cost and scalability become moreapparent. (5) Finally, although all datasets used in our ex-periments are featured with traffic scene, our approach isequally suitable for other public scenes of crowded spaces.For example, recently we have shown that the learning ofbehaviour spatial context facilitates the understanding of abusy underground station scene (Loy et al. 2009).

8 Conclusions

This paper defines comprehensively behavioural context andpresents a novel framework for learning three different typesof behavioural context, including behaviour spatial context,correlation context, and temporal context. For learning spa-tial context a semantic scene segmentation method is formu-lated. The learned spatial context is then exploited to com-pute a cascaded topic model for learning correlation andtemporal context at multiple scales. The learned cascadedmodel is employed to address the problems of video tem-poral segmentation and context-aware anomaly detection.Extensive experiments are carried out using three differentbusy public space scenes to demonstrate the effectiveness ofthe proposed behavioural context learning framework andits usefulness for understanding complex multi-object be-haviours in a crowded space.

The presented work has a number of limitations whichneed be addressed in the future work.

– Although dynamic behavioural context is discoveredfrom video data, once learned the current framework doesnot accommodate the change of context, which is com-mon when video data of long duration needs be anal-ysed. For instance, the semantic layout of the scene andthe correlations between objects can differ at differenttimes of a day, different days of the year. One solutionto this problem is via incremental learning. For examplean incremental spectral clustering algorithm by Chi et al.(2007) can be adopted to update the scene segmentation

Int J Comput Vis

on-the-fly. An incremental topic model learning methodby Canini et al. (2009) can also be employed for incre-mental learning of the topic models used in this work.

– The model complexity corresponding to the number oftopics in each cascade stage is currently determined eithera priori by domain knowledge or through cross validation.One could use a Bayesian model selection approach todetermine the model complexity automatically (Wallachet al. 2009).

– The current method is only applicable to unseen datafrom the same scene of the same viewpoint. It wouldbe desirable if a model learned from one scene can beused to facilitate the understanding of behaviour in a dif-ferent scene (or the same scene captured from a differ-ent viewpoint). Solving that problem is particularly use-ful for abnormal behaviour detection because abnormalbehaviours are typically rare; therefore there is a bigincentive to transfer knowledge from one scene to an-other. One solution to the problem is by transfer learn-ing. In the past 5 years, there have been extensive ef-forts on applying transfer learning techniques for objectrecognition and action recognition (Fei-Fei et al. 2006;Duan et al. 2009). However, it is not straightforward toapply those techniques for our problem here because:(1) Most transfer learning techniques are designed for dis-criminative models, with few exception such as the one-shoot-learning work by Fei-Fei et al. (2006). It is notclear which part of our topic models can be transferred.(2) Transferring target object/action/behaviour has beenattempted before. But we are not aware of any previouswork on transferring context, which poses additional chal-lenges because behavioural context is defined in conjunc-tion with behaviour and one cannot simply transfer con-text alone.

– The correlations modelled in this work is limited toco-occurrence. This is mainly due to the use of topicmodel which is based on a Bag of Words representa-tion and unable to capture temporal order information.In other words, model robustness to noise is gained atthe price of being unable to model more complex cor-relations. As a simple extension of the current frame-work, instead of clustering the topic profile of the sec-ond stage LDA for global temporal context modelling,one could adopt an HMM to model the temporal orderof different phases explicitly. Recently there have beena number of attempts at re-introducing temporal order-ing information to topic models (Griffiths et al. 2007;Wallach 2006), which may provide a solution for thisproblem. However, it remains an open problem on howto strike the right balance between model sensitivity androbustness when dynamics are modelled in a topic model.

– A spectral clustering based segmentation method isadopted for semantic scene segmentation in this work.

However, there are a large variety of other image segmen-tation methods can be considered here. In particular, re-cently topic models have been employed for simultaneousimage segmentation and object categorisation (Cao andFei-Fei 2007; Blei and Lafferty 2007). In particular, whena video is split into multiple sub-sequences, a topic modelbased segmentation method, such as those introduced inCao and Fei-Fei (2007), Blei and Lafferty (2007), can beapplied for scene segmentation. Furthermore, since bothscene segmentation and behaviour modelling can be doneusing topic models, it is possible to combine them into aunified topic model for doing both tasks simultaneously.We note that in a recent effort (Haines and Xiang 2009),a regional LDA model is formulated which attempts to en-code spatial awareness into a LDA model for behaviourmodelling. However, this model tends to over-segmentand has a much higher computational cost compared to astandard LDA model, even when behaviour is modelledonly at a single scale. Therefore further investigations arenecessary for developing a unified topic model which istractable and capable of performing multi-scale behaviourcontext modelling.

Acknowledgements We shall thank Xiaogang Wang for provid-ing us with the MIT Traffic Dataset for our comparative experi-ments and evaluation. This work was partially supported by the UKEngineering and Physical Sciences Research Council (grant numberEP/E028594/1).

References

Ali, S., & Shah, M. (2008). Floor fields for tracking in high den-sity crowd scenes. In European conference on computer vision(Vol. II), Marseille, France, October 2008.

Bar, M. (2004). Visual objects in context. Nature Reviews. Neuro-science, 5, 617–629.

Bar, M., & Aminof, E. (2003). Cortical analysis of visual context. Neu-ron, 38, 347–358.

Bar, M., & Ullman, S. (1993). Spatial context in recognition. Percep-tion, 25, 343–352.

Biederman, I., Mezzanotte, R. J., & Rabinowitz, J. C. (1982). Sceneperception: detecting and judging objects undergoing relationalviolations. Cognitive Psychology, 14, 143–177.

Blei, D. M., & Lafferty, J. D. (2007). Spatial latent Dirichlet allocation.In Advances in neural information processing systems.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet alloca-tion. Journal of Machine Learning Research, 3, 993–1022.

Breitenstein, M. D., Sommerlade, E., Leibe, B., Van Gool, L., &Reid, I. (2008). Probabilistic parameter selection for learningscene structure from video. In British machine vision conference,Leeds, UK, September 2008.

Canini, K. R., Shi, L., & Griffiths, T. L. (2009). Online inference oftopics with latent Dirichlet allocation. In International conferenceon artificial intelligence and statistics.

Cao, L., & Fei-Fei, L. (2007). Spatially coherent latent topic modelfor concurrent object segmentation and classification. In Interna-tional conference on computer vision.

Carbonetto, P., de Freitas, N., & Barnard, K. (2004). A statistical modelfor general contextual object recognition. In European conferenceon computer vision, Prague, Czech Republic, May 2004.

Int J Comput Vis

Chi, Y., Song, X., Zhou, D., Hino, K., & Tseng, B. (2007). Evolution-ary spectral clustering by incorporating temporal smoothness. InACM SIGKDD international conference on Knowledge discoveryand data mining.

Duan, L., Tsang, I. W., Xu, D., & Maybank, S. J. (2009). Domain trans-fer svm for video concept detection. In IEEE conference on com-puter vision and pattern recognition.

Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of objectcategories. IEEE Transactions on Pattern Analysis and MachineIntelligence, 28(4), 594–611.

Galleguillos, C., Rabinovich, A., & Belongie, S. (2008). Object cate-gorization using co-occurrence, location and appearance. In IEEEconference on computer vision and pattern recognition, Alaska,USA, June 2008.

Greenhill, D., Renno, J., Orwell, J., & Jones, G. A. (2008). Occlusionanalysis: learning and utilising depth maps in object tracking. Im-age and Vision Computing, 26(3), 430–441.

Griffiths, T., & Ghahramani, Z. (2005). Infinite latent feature modelsand the Indian buffet process (Technical report). University Col-lege London.

Griffiths, T., Steyvers, M., Blei, D., & Tenenbaum, J. (2007). Integrat-ing topics and syntax. In Neural information processing systems.

Gupta, A., & Davis, L. S. (2008). Beyond nouns: exploiting preposi-tions and comparative adjectives for learning visual classifier. InEuropean conference on computer vision, Marseille, France, Oc-tober 2008.

Haines, T., & Xiang, T. (2009). Video topic modelling with behaviouralsegmentation. In IACM workshop on multimodal pervasive videoanalysis.

Heitz, G., & Koller, D. (2008). Learning spatial context: using stuffto find things. In European conference on computer vision, Mar-seille, France, October 2008.

Hofmann, T. (1999a). Probabilistic latent semantic analysis. In Uncer-tainty in artificial intelligence.

Hofmann, T. (1999b). Probabilistic latent semantic indexing. In SIGIR.Hospedales, T. M., Gong, S., & Xiang, T. (2009). A Markov cluster-

ing topic model for mining behaviours in video. In Internationalconference on computer vision.

Kuettel, D., Breitenstein, M. D., Van Gool, L., & Ferrari, V. (2010).What’s going on? Discovering spatio-temporal dependencies indynamic scenes. In IEEE conference on computer vision andpattern recognition, San Francisco, USA, June 2010 (pp. 1951–1958).

Kumar, S., & Hebert, M. (2005). A hierarchical field framework forunified context-based classification. In International conferenceon computer vision, Beijing, China, October 2005.

Li, J., Gong, S., & Xiang, T. (2008a). Scene segmentation for be-haviour correlation. In European conference on computer vision,Marseille, France, October 2008 (Vol. IV, pp. 383–395).

Li, J., Gong, S., & Xiang, T. (2008b). Global behaviour inference usingprobabilistic latent semantic analysis. In British machine visionconference, Leeds, UK, September 2008 (pp. 193–202).

Loy, C. C., Xiang, T., & Gong, S. (2009). Multi-camera activity corre-lation analysis. In IEEE conference on computer vision and pat-tern recognition, Miami, USA, June 2009 (pp. 1988–1995).

Lucas, B. D., & Kanade, T. (1981). An interative image registrationtechnique with an application to stereo vision. In Proceedings ofimaging understanding workshop (pp. 121–130).

Makris, D., & Ellis, T. (2005). Learning semantic scene models fromobserving activity in visual surveillance. IEEE Transactions onSystems, Man and Cybernetics. Part B. Cybernetics, 35(3), 397–408.

Makris, D., Ellis, T., & Black, J. (2004). Bridging the gaps betweencameras. In IEEE conference on computer vision and patternrecognition, Washington (pp. 205–210).

Malik, J., Belongie, S., Leung, T., & Shi, J. (2001). Contour and textureanalysis for image segmentation. International Journal of Com-puter Vision, 43(1), 7–27.

Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context.In IEEE conference on computer vision and pattern recognition,Miami, USA, June 2009.

Murphy, K., Torralba, A., & Freeman, W. (2003). Using the forest tosee the tree: a graphical model relating features, objects and thescenes. In Neural information processing systems.

Niebles, J., Wang, H., & Fei-Fei, L. (2008). Unsupervised learning ofhuman action categories using spatial-temporal words. Interna-tional Journal of Computer Vision, 79(3), 299–318.

Palmer, S. (1975). The effects of contextual scenes on the identificationof objects. Memory and Cognition, 3, 519–526.

Rabinovich, A., Vedaldi, A., Galleguillos, C., Wiewiora, E., & Be-longie, S. (2007). Objects in context. In International conferenceon computer vision, Rio de Janeiro, Brazil, October 2007.

Russell, D., & Gong, S. (2006). Minimum cuts of a time-varying back-ground. In British machine vision conference, Edinburgh, UK,September 2006.

Schwarz, G. (1978). Estimating the dimension of a model. Annals ofStatistics, 6(2), 461–464.

Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence,22(8), 888–905.

Singhal, A., Luo, J., & Zhu, W. (2003). Probabilistic spatial contextmodels for scene content understanding. In IEEE internationalconference on computer vision and pattern recognition, Madison,USA, June 2003.

Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hier-archical Dirichlet processes. Journal of the American StatisticalAssociation, 101(476), 1566–1581.

Wallach, H. (2006). Topic modeling: beyond bag-of-words. In Inter-national conference on machine learning, Pittsburgh, USA, June2006.

Wallach, H., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Eval-uation methods for topic models. In International conference onmachine learning, Montreal, Canada, June 2009.

Wang, X., Tieu, K., & Grimson, E. (2006). Learning semantic scenemodels by trajectory analysis. In European conference on com-puter vision (Vol. 3, pp. 110–123).

Wang, X., Ma, K. T., Ng, G., & Grimson, E. (2008). Trajectory analy-sis and semantic region modeling using a nonparametric Bayesianmodel. In IEEE conference on computer vision and pattern recog-nition, Alaska, USA, June 2008.

Wang, X., Ma, X., & Grimson, W. E. L. (2009). Unsupervised activityperception in crowded and complicated scenes using hierarchi-cal Bayesian models. IEEE Transactions on Pattern Analysis andMachine Intelligence, 31(3), 539–555.

Wang, X., Tieu, K., & Grimson, E. (2010). Correspondence-free ac-tivity analysis and scene modeling in multiple camera views.IEEE Transactions on Pattern Analysis and Machine Intelligence,32(1), 56–71.

Wolf, L., & Bileschi, S. (2006). A critical view of context. Interna-tional Journal of Computer Vision, 69(2), 251–261.

Xiang, T., & Gong, S. (2006a). Model selection for unsupervised learn-ing of visual context. International Journal of Computer Vision,69(2), 181–201.

Xiang, T., & Gong, S. (2006b). Beyond tracking: modelling activityand understanding behaviour. International Journal of ComputerVision, 67(1), 21–51.

Xiang, T., & Gong, S. (2008). Video behavior profiling for anomalydetection. IEEE Transactions on Pattern Analysis and MachineIntelligence, 30(5), 893–908.

Int J Comput Vis

Yang, M., Wu, Y., & Hua, G. (2008). Context-aware visual tracking.IEEE Transactions on Pattern Analysis and Machine Intelligence,31(7), 1195–1209.

Zelnik-Manor, L., & Perona, P. (2004). Self-tuning spectral clustering.In Neural information processing systems.

Zheng, W., Gong, S., & Xiang, T. (2009). Quantifying contextual infor-mation for object detection. In International conference on com-puter vision, Kyoto, Japan, September 2009.

learning behavioural contexttxiang/publications/ligongxiang_ijcv2011.pdf · context, behaviour...

Documents