[ieee 2008 37th ieee applied imagery pattern recognition workshop - washington, dc, usa...

8
Behavior Recognition Architecture for Surveillance Applications Charles J. Cohen Katherine A. Scott Marcus J. Huber Steven C. Rowe Cybernet Systems Corporation 727 Airport Blvd. Ann Arbor, MI 48108 (734) 668-2567 ccohen, kscott, mhuber, srowe @ cybernet.com Frank Morelli U.S. Army Research Laboratory Aberdeen Proving Ground, MD 21005 (410) 278-8824 frank.morelli @ us.army.mil AbstractDifferentiating between normal human activity and aberrant behavior via closed circuit television cameras is a difficult and fatiguing task. The vigilance required of human observers when engaged in such tasks must remain constant, yet attention falls off dramatically over time. In this paper we propose an architecture for capturing data and creating a test and evaluation system to monitor video sensors and tag aberrant human activities for immediate review by human monitors. A psychological perspective provides the inspiration of depicting isolated human motion by point-light walker (PLW) displays, as they have been shown to be salient for recognition of action. Low level intent detection features are used to provide an initial evaluation of actionable behaviors. This relies on strong tracking algorithms that can function in an unstructured environment under a variety of environmental conditions. Critical to this is creating a description of “suspicious behavior” that can be used by the automated system. The resulting confidence value assessments are useful for monitoring human activities and could potentially provide early warning of IED placement activities. INTRODUCTION Closed circuit television (CCTV) cameras have become a pervasive component in modern society’s arsenal of crime deterrent devices. When employed as preventative devices, CCTVs are only as effective as the human monitor who views the data and orchestrates a response. The human operator in a CCTV system is subject to boredom and fatigue, and usually has to shift his or her attention between a large number of CCTV camera inputs. The combination of fatigue and distraction makes it difficult for human CCTV monitors to detect criminal and terrorist behavior. Additionally, the cues leading up to these behaviors may be subtle, easily escaping human detection during cursory monitoring. Here we detail a software system that assists the CCTV monitor by ―flagging‖ potentially hostile behavior for further scrutiny, and alerting monitoring personnel, in order to increase operational efficiency, increase vigilance, and potentially save lives. This is illustrated in Figure 1, where the surveillance operator is alerted, by the red flashing border and associated sound, to the center CCTV monitor that shows a person burying an object. A system that detects and highlights hostile behavior in CCTV camera data will improve monitoring only if it is able to reliably detect hostile behavior without distracting the human monitor with false-positive results (non-hostile behavior tagged as hostile, see Table 1). For a multiple camera system, false-positives are acceptable so long as they do not distract the user from true hostile behavior or in other words, the false positives are not hazardous false- positives. Benign false positives could potentially be distracting, but an optimal intent detection system should closely mirror the behavior of a skilled surveillance operator, and therefore not present too many distractions. Figure 1: Surveillance operator alerted to suspicious activity by the red flashing border (and associated sound) in the center CCTV.

Upload: frank

Post on 15-Mar-2017

220 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: [IEEE 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Washington, DC, USA (2008.10.15-2008.10.17)] 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Behavior

Behavior Recognition Architecture for Surveillance

Applications

Charles J. Cohen

Katherine A. Scott

Marcus J. Huber

Steven C. Rowe

Cybernet Systems Corporation

727 Airport Blvd.

Ann Arbor, MI 48108

(734) 668-2567

ccohen, kscott, mhuber, srowe @ cybernet.com

Frank Morelli

U.S. Army Research Laboratory

Aberdeen Proving Ground, MD 21005

(410) 278-8824

frank.morelli @ us.army.mil

Abstract— Differentiating between normal human activity

and aberrant behavior via closed circuit television cameras

is a difficult and fatiguing task. The vigilance required of

human observers when engaged in such tasks must remain

constant, yet attention falls off dramatically over time. In

this paper we propose an architecture for capturing data

and creating a test and evaluation system to monitor video

sensors and tag aberrant human activities for immediate

review by human monitors. A psychological perspective

provides the inspiration of depicting isolated human motion

by point-light walker (PLW) displays, as they have been

shown to be salient for recognition of action. Low level

intent detection features are used to provide an initial

evaluation of actionable behaviors. This relies on strong

tracking algorithms that can function in an unstructured

environment under a variety of environmental conditions.

Critical to this is creating a description of “suspicious

behavior” that can be used by the automated system. The

resulting confidence value assessments are useful for

monitoring human activities and could potentially provide

early warning of IED placement activities.

INTRODUCTION

Closed circuit television (CCTV) cameras have become a

pervasive component in modern society’s arsenal of crime

deterrent devices. When employed as preventative devices,

CCTVs are only as effective as the human monitor who

views the data and orchestrates a response. The human

operator in a CCTV system is subject to boredom and

fatigue, and usually has to shift his or her attention between

a large number of CCTV camera inputs. The combination of

fatigue and distraction makes it difficult for human CCTV

monitors to detect criminal and terrorist behavior.

Additionally, the cues leading up to these behaviors may be

subtle, easily escaping human detection during cursory

monitoring.

Here we detail a software system that assists the CCTV

monitor by ―flagging‖ potentially hostile behavior for

further scrutiny, and alerting monitoring personnel, in order

to increase operational efficiency, increase vigilance, and

potentially save lives. This is illustrated in Figure 1, where

the surveillance operator is alerted, by the red flashing

border and associated sound, to the center CCTV monitor

that shows a person burying an object.

A system that detects and highlights hostile behavior in

CCTV camera data will improve monitoring only if it is

able to reliably detect hostile behavior without distracting

the human monitor with false-positive results (non-hostile

behavior tagged as hostile, see Table 1). For a multiple

camera system, false-positives are acceptable so long as

they do not distract the user from true hostile behavior – or

in other words, the false positives are not hazardous false-

positives. Benign false positives could potentially be

distracting, but an optimal intent detection system should

closely mirror the behavior of a skilled surveillance

operator, and therefore not present too many distractions.

Figure 1: Surveillance operator alerted to suspicious

activity by the red flashing border (and associated

sound) in the center CCTV.

Page 2: [IEEE 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Washington, DC, USA (2008.10.15-2008.10.17)] 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Behavior

Table 1: Detection Criteria

Detection Type Description

True Positive The system detects hostile

actions.

True Negative The system detects the lack

of hostile actions.

False Positive The system detects hostile

actions where none exist.

False Negative The system fails to detect

hostile actions where they

do exist.

Hazardous False Positive The system simultaneously

detects a false positive and a

false negative on different

data sources.

Benign False Positive The system simultaneously

detects false positives and

true negatives.

When considering the analysis of 3-dimensional scenes that

feature human activity, biological motion is a critical

component for the visual deduction of complex semantic

information. Johansson’s [2] seminal work showed that

simple isolated joint motion, even without the influence of

contextual factors such as clothing or background

information, provided ample visual data for the accurate

perception of the human form in motion. Johansson’s

original research provided the foundation for modern

inquiry into the perception of human biological motion

using point-light displays, which have revealed that

classifications such as gender [4], [9], [11], action

categorization [5], and even emotional state [6] may be

reliably derived from motion cues alone.

What is critical is a system that can take the tracking of

these feature points, allow the surveillance operator to

define actionable behavior through a usable user interface,

apply a method for intent detection, and test them under a

variety of environmental and operational conditions. In this

paper, we detail the architecture shown in Figure 3.

A vertical slice demonstration of this system has been

developed, using the basic elements of each architecture

module. Each module will be explained, although some

modules are basic in description and are just pure interface

software. Other modules that deal with machine vision and

assessment are given expanded space. The final section

details our conclusions, the applications of our research

system, and our future work.

SECTION I: HUMAN PERCEPTION OF INTENT

In an attempt to capture what the Surveillance Operator is

observing, we examine the human perception of intent from

a psychological perspective. Joint movement is a useful

variable for determining often subtle motion cues, where

even partial data due to occlusion may be enough to provide

an accurate classification of intent. The U.S. Army Research

Laboratory (ARL) is currently examining the human

perceptual ability to detect specified classes of human

biological motion, looking at the role of mere exposure as a

possible modulator for proficiency in rapidly categorizing

intent based on human biological motion. The focus for this

research is the anticipatory inference of threatening actions

based solely upon incomplete human motion cues, with a

specific focus on human motion characteristics that lead up

to a violent or threatening action. As the temporal window

that precedes a hostile act progressively closes without

awareness, the likelihood of an undesirable outcome for

forces on the receiving end of an assault increases

dramatically. The earlier one detects the behavioral

tendencies that may be precursors for a hostile act, the

greater the advantage to an operator for executing the rapid,

proactive decisions necessary to thwart or neutralize

impending hostility.

Video Analysis, Markerless Motion Capture, and

Conversion of Footage to Experimental Stimuli

To create a stimulus set that isolates human motion from the

complex visual information inherent to complex 3-

dimensional scenes, video footage of hostile and non-

threatening action sequences is being translated into point-

light walker (PLW) displays. To accomplish this goal, a

video library has been compiled from open source,

unclassified footage of hostile activity, produced and

released to the public over the internet by groups seeking to

maximize the viewing audience for their exploits. These

often lengthy videos have undergone careful review and

have been coded and parsed into short clips of footage that

Figure 3: Surveillance Data Collection & Testing

Architecture

Figure 2: a) Image frame taken from video footage of an

insurgent in the act of throwing an explosive device; b) Points

are superimposed on major joints using AFRL Point-Light

Video Developer (PLVD) software [12]; c) Still frame of

rendered point-light figure shown in isolation.

Page 3: [IEEE 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Washington, DC, USA (2008.10.15-2008.10.17)] 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Behavior

feature action sequences with the potential for use as

experimental stimuli. A number of problems have been

identified and addressed in the undertaking of this effort,

with the most prominent among them noted below.

One main issue of significant note in this process, and one

that will predictably remain so for future efforts, has been

video format variability. The videos have been encoded

using different codecs (i.e., compression/decompression

algorithms), container formats (e.g., .mpeg, .avi, .rm, .wma,

among others), and frame rate standards (e.g., NTSC, PAL,

SECAM), limiting initial visibility and subsequent usability.

This necessitated a conversion to a singular format, for both

viewing and eventual digitization into point-light walker

displays. Universal video format conversion was

accomplished using multiple applications, including

Macintosh Final Cut Pro, Thompson Grass Valley’s EDIUS

nonlinear postproduction editing software, as well as open

source web-based applications (e.g., http://media-

convert.com). This approach led to successful conversion

for the bulk of the videos, but not without additional

complications. Frame rate conversion has not always been

reliable, leading to difficulty in video segment localization

for processing. Additionally, the conversion process has at

times led to extremely poor video quality, causing

uncertainty when engaged in the point-light digitization

process. For example, extreme pixilation has often resulted

in joint assignment difficulty. When combined with

resolution levels that are initially quite poor, and when

considering natural obstructions inherent to real world

visual scenes that lead to figure occlusion and distortion, the

process of creating point-light displays using a markerless

methodology is often arduous. Despite these concerns, as

well as the inherent tedium involved in the process, this

method for digitization has been quite successful and

minimizes concerns within the field of perceptual

psychology regarding content validity. Given that the

―subjects― being digitized are engaged in behavior without

coercion from a experimental ―director‖ (i.e., they are not

actors engaged in movement on command for experimental

purposes), behavioral data such as detection accuracy and

response time relative to the presentation of motion displays

can be confidently reported to be accurate responses to

genuine behaviors.

To digitize the isolated movements of individuals within the

original video footage into point-light displays, customized

software, conceptually based on the open-source system

generated by the Temple University Vision Laboratory [10],

was written by the U.S. Air Force Research Laboratory

(AFRL) [12] for use by ARL (see Figure 2). Individuals are

isolated from video sequences and rendered into point-light

displays, animations devoid of background detail present in

the original footage. Given that the nature of the depicted

actions is a subjective categorization determined by the

experimenters, an initial study will validate experimenter

action content categorization by asking subjects to passively

observe each display and rate the actions they depict with

respect to perceived threat content. Participant stimulus

categorizations that are concurrent with those assignments

made by experimenters will result in the retention of those

stimuli for further experimental use. Non-concurrent stimuli

will be omitted from further use.

Follow-on experimentation will employ the retained stimuli

for analysis of human response timing and accuracy relative

to incomplete point-light display exposure. Given that

accurate identification of threat within an operational scenario

is less useful subsequent to a hostile action than it could be

prior to execution, where preventative measures might still be

viable, the emphasis for this experiment will be the

examination of response characteristics that fall within the

temporal window preceding the terminal event in an action

sequence. To ensure rapid decision-making on the part of the

observer that falls within that temporal period, the PLW

displays that will be presented to subjects will not feature the

final frames that animate a particular goal action, so observers

will not be inclined to wait for the completion of an executed

motion before rendering their classification of ―threatening‖

or ―innocuous‖. The executed motion they are presented with

will never be fully enacted, thus a choice will be ―forced‖

based on uncertain, incomplete visual input, as is often the

case during real-world perception and action downrange.

SECTION II: SUPPORT MODULES

In this section, we briefly discuss some of the modules that

provide supporting functionality within the Vigilance

application. These support modules include the Input

Database, the Video to Database process, Analysis Database,

and the Display. Later in the paper, we discuss the core

Vigilance components in more detail.

Video to Database. Data from each video source is stored as

individual frames within the SQL database. Each frame can

then be tagged with an arbitrary amount of metadata using

relational methods. The power of this approach comes from

the ability of multiple, independent data processing

applications browsing the data that has been captured and

then further adding to the metadata. This approach also

provides the capability to view or process only the data that

meet specified criteria. In the future, database functions will

also enable the search of observed anomalous behavior in

other stored video. For efficiency purposes, our

implementation collocates a few cameras with a computer

running database server software. These ―sensor nodes‖ are

interconnected via a network and can be treated as a

distributed database by external processing and viewing

applications.

Input Database. This is the storage area for all the raw video

created by the Video to DB module. It can also contain still

imagery and information that the surveillance operator has

specified using the front end as important for detecting

anomalous behavior (such as prohibited zones, loitering

behavior. This information is used for analysis; along with the

Page 4: [IEEE 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Washington, DC, USA (2008.10.15-2008.10.17)] 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Behavior

operational ―badguyology1‖ parameters (see Section IV

below).

Analysis Database. The Vigilance Modules detect, track,

identify, and classify objects within the video clips and from

this identify benign and potentially hostile events and

behaviors. The Analysis Database is used to maintain the

association between the inputs, badguyology parameters, and

the results computed by the Automated Vigilance Module

and specified by a subject matter expert using the Manual

Vigilance Module.

Display. The Display module shows the results of tracking

and anomaly detection. It also lets the operator view

automated results and select appropriate analysis for truthing

data. For example, the operator can decide that the tracking

algorithms are sufficiently accurate for a specified video set.

That data set would then be available for regression testing if

the tracking algorithms are changed or replaced, resulting in

system performance measurements. This display will also

cause a red border to appear around the associated video

whenever an alert was triggered.

Comparator. This module compares the results of the

Automated Vigilance module to either the truth data specified

from the Manual Vigilance Module (see below) or regression

testing against a prior Automated Vigilance Module run on

the same data. The Comparator Module provides us the

capability to automatically validate approaches and

algorithms against ground truth and prior implementations

and immediately identify whether the modifications improved

the system’s performance.

SECTION III: THE FRONT END

The Front End Module consists of the interface that performs

a number of functions:

Determine which video (one or more) to examine.

Tag objects for tracking (while this is not required

since automatic tracking is part of the system, it is

useful for regression testing).

Set badguyology parameters for surveillance.

Create regression testing experiments.

Run real time operations.

Note that the surveillance video is sent both to a database for

future retrieval and to the front end for real time evaluation.

This allows for the operator to tag people or objects to track

and set up the badguyology parameters (such as marking

zones of interest – see Section III).

SECTION IV: BADGUYOLOGY PARAMETERS

Badguyology Parameters are specifications of the actions,

behaviors, and suspicious activities that the Automated

Vigilance and Manual Vigilance modules use for analysis.

1 Term originated by Dr. Barbara O’Kane, RDECOM, Principal

Scientist for Human Performance, U.S. Army Night Vision Lab

The specific parameters and values are based on the

operator’s specification within the Front End Module,

The types of parameters are, for all intents and purposes,

limitless, based on possible terrorist activities and the subject

matter expertise of surveillance operators. However, we have

made the attempt to create various anomaly categories that

can be used for the first instantiation of this system, and

provide immediate feedback for surveillance operators.

These include, but are not limited to:

Area violations, such as beyond boundary, inside

region, and outside region.

Abnormal/unusual physical behaviors, including

unusual location (in a location not frequented),

unusual motions (such as moving against normal

traffic flow, moving fast, moving slow, and stopping

at too many places), and unusual numbers (too many

and too few).

Abnormal/unusual temporal behaviors, including

dropping/abandoning objects (not picking up

something dropped or someone else picking it up),

picking up objects, loitering, same person/vehicle

appearing too often in video scenes, and odd gaits or

motions (carrying a concealed object, for example).

Camera view being altered (being moved, obscured,

or blinded).

There are a number of vision-system characteristic descriptors

that are required to support the above broad-spectrum of alert

conditions. So far, these include Object Location, Object

Tracking, Object Size, Object Type/Classification, Motion

Detection, Object Speed, Object Color, Object Counts, Object

Abandonment, Object Removal, Temporal/spatial

distributions, Camera Move/Blind, and Compound Queries.

SECTION V: MANUAL VIGILANCE

This module is essentially another user interface with

functionality to allow a surveillance subject matter expert to

manually, given an input video sequence of images and

specified badguyology parameters, mark objects of interest to

be tracked, identifies and classifies them, indicates anomalies

observed, etc.

The results of this module become truthing data, and is

compared with the equivalent output from the automatic

system in the Comparator Module.

Manual analysis will be extremely tedious and take an

extremely long time, and should only be used as an analysis

debugging tool or for extremely critical data that must be

captured in a very specific manner. For full regression

testing, we take the results of the Automated Vigilance

module, and evaluate the results manually in the Display for

approval or rejection.

Page 5: [IEEE 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Washington, DC, USA (2008.10.15-2008.10.17)] 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Behavior

SECTION VI: AUTOMATED VIGILANCE

This is the core module of our intent detection architecture,

and is composed of two components: camera data extraction

(tracking), detecting anomalous behavior.

CCTV Data Extraction

The collection of CCTV camera data for our intent detection

system is performed at both the system level (i.e. the

collection of CCTV monitoring at a facility) and at the

camera level. The rationale for using a multiple camera

system is that it we believe behaviors exhibited between

multiple cameras may yield a secondary source of intent

data. For example, multiple cameras make it possible to

track an individual moving through a facility or along a road

and observe surveillance behaviors that may be indicative of

IED emplacement.

Single Camera Data Extraction

At the camera level, human motion data for our gesture

recognition system is extracted from CCTV cameras using a

series of image processing operations that yield a set of

features for processing by our gesture recognition system.

For a single CCTV camera the chain of image processing

operations is as follows:

1. Capture raw image data.

2. Pre-process, data - resize and flip as necessary.

3. Segment foreground and background data using the

codebook method [3] (see Figure 4).

4. Threshold the foreground imagery into binary

connected-component ―blobs‖ or features.

5. Dilate or merge the features and ignore any feature

that meets our criteria for noise (e.g. too small, or

unrealistic aspect ratio).

6. Perform a rough classification of the binary image

components using parameters like size, aspect

ratio, position, and color profile to yield a

classification confidence pair. (e.g. 0.6 confidence

that a feature is a torso).

7. Attempt to associate current frame image features

with ―similar‖ features from previous frames. Re-

evaluate the feature classification based on

previous frame data.

8. Attempt to de-compose large features into smaller

sub-features using convex hull detection (e.g.

breaking torsos, that include legs and arms, into

constituent parts).

9. Feed refined image feature position data and

classification into the gesture recognition system

mentioned previously.

The codebook algorithm works by building a per-pixel

background model of color and luminance viewed by the

camera over a short period of time (30 to 150 frames, or 1 to

5 seconds). As reported by Kim [3] the results were quite

good, and the method has greatly improved our system’s

ability to track moving objects. The codebook method

significantly reduces the overall system noise and the

number of motion artifacts that were present in our previous

background subtraction method. Figure 4 gives an example

of the system’s overall segmentation performance. The

codebook method is not without its drawbacks; these

include a number of tuning parameters, the requirement of a

stationary camera, and high computational cost particularly

during initialization.

Multiple Camera Data Extraction

For multiple camera environments, the system consists of

multiple camera sensor nodes, software applications, and a

computer network. This modular approach is highly

scalable; new sensor nodes and applications can be added as

needed. Intelligent pre-processing of data at each sensor

node minimizes required network bandwidth. Each sensor

node contains a standard functionality suite that includes

data capture, data recording, data processing, configuration,

and networking.

A suite of specific programs, rather than a single monolithic

application, controls the configuration of each node. Each of

these programs has a set of operational parameters that are

stored such that they can be updated in real-time via some

API, and is accessible to the database tagging applications.

The programs can be stopped and started remotely.

Intent Recognition

This part of the automated detection uses a variety of

methods to determine intent based on the results of tracking.

Figure 4: Codebook Foreground-Background

Segmentation. Left image: Input image with

tracking. Right image: Segmented image, black

represents background while white represents

foreground imagery.

Page 6: [IEEE 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Washington, DC, USA (2008.10.15-2008.10.17)] 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Behavior

Zone Detection

We have created a simple ―zone detection‖ algorithm that

allows the user to define a specific zone within a video

stream and receive an alert whenever a person or vehicle

enters the zone. Potential applications of this technology

include watching for person entering or leaving a sidewalk,

a vehicle entering or leaving a road lane, and detecting

transitions in other critical areas. The Automated Vigilance

Module will use the zone detection feature later on to create

a security ―grammar‖ for analysis. For example, a user

could define the boundaries for roads, sidewalks, curbs,

parking areas, etc, and the software could determine if an

observer is entering a dangerous area, or if a vehicle is

travelling through a restricted location

Linger Detection

We have created a software module for detecting lingering

behavior. This software module monitors all tracked objects

and issues an alert whenever an object remains stationary

for pre-determined amount of time. Therefore we can detect

when a previously moving car has parked or when an

observed person has stopped at a street corner.

Path Recording

Path recording allows us to monitor and record the paths

taken by tracked object (as shown in the center monitor of

Figure 1). This method allows us to make more informed

decisions about the behaviors of tracked objects. For

example this method will allow us to collect statistical data

about where tracked objects go and when a particular

object/path/time combination occurs. Furthermore this

feature will be used to expand our image processing

grammar by providing us with path and speed information.

Behavior Recognition

We have developed gesture recognition algorithms that use

position and velocity information to identify behaviors and

intents. Gestures, and combinations of gestures, are

identified through the use of a bank of predictor bins (see

Figure 5), each containing a dynamic system model with

parameters preset to a specific gesture. We assume that the

motions of human circular gestures are decoupled in x and

y, therefore there are separate predictor bins for the x and y

axes (as well as one for z axis when dealing with three

dimensional motions). A predictor bin is required for each

gesture type for each dimension – the position and velocity

information from the sensor module is fed directly into each

bin.

The idea for seeding each bin with different parameters was

inspired by Narendra and Balakrishnan's work on improving

the transient response of adaptive control systems. In this

work, they created a bank of indirect controllers that were

tuned online, but whose identification models had different

initial estimates of the plant parameters. When the plant

was identified, the bin that best matched that identification

supplied a required control strategy for the system [8].

Each bin's model, which has parameters that tune it to a

specific gesture, is used to predict the future position and

velocity of the motion – the prediction of a particular gesture

is made by feeding the current state of the motion into the

gesture model. This prediction is compared to the next

position and velocity, a residual error is computed, and the

bin for each axis with the least residual error is the best

gesture match. If the best gesture match is not below a

predefined threshold (which is a measure of how much

variation from a specific gesture is allowed), then the result is

ignored and no gesture is identified. Otherwise, geometric

information is used to constrain the gesture further – a single

gesture identification number, which represents the

combination of the best x bin, the best y bin, and the

geometric information, is output to the transformation

module. This number (or NULL if no gesture is identified) is

output immediately upon the initiation of the gesture, and is

continually updated. The parameters used to initially seed

each predictor bin can be calculated by feeding the data of

each axis from previously categorized motions into the

recursive linear least squares identifier.

The identification module contains the majority of the

required processing. Compared to most of the systems

developed for gesture recognition (for example, see [1], and

[7]), the identification module requires relatively little

processing time and memory to identify each individual

gesture feature.

Just as the gesture recognition module is built on a bank of

predictor bins, the behavior recognition system is composed

of a bank of gesture recognition modules. Each module

focuses on a specific point of the body (such as features

acquired from a PLW system). As that point moves through

space, a ―gesture‖ is generated and identified. The

combination of gestures from those points is what we define

as a motion behavior, which can be categorized and

identified. The simplicity of the behavior recognition

system is possible because of the strength and utility of the

gesture recognition modules.

Figure 5: Simplified Diagram of the Predictor Module

which Determines Gesture Characteristics.

Page 7: [IEEE 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Washington, DC, USA (2008.10.15-2008.10.17)] 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Behavior

Currently we are developing a Dempster-Shafer based

analysis, which uses belief functions to combine evidence to

determine a probability of an event. The results of this

analysis will be the subject of a future paper.

SECTION VII: EXPERIMENTS, FUTURE

DIRECTIONS, AND CONCLUSIONS

We have developed an end-to-end vertical slice

demonstration of the system. All the modules have been

created with varying levels of functionality. We also have a

large data set of video for regression testing, and have

performed the intent detection described above on over

sixteen three-minute videos. The goal was to ensure end-to-

end functionality before expanding capabilities.

We performed experiments to test the behavior recognition

system. First, a variety of behaviors were performed and

served as the baseline for identification. Then these actions

were repeated and the data sent through the system.

The results of the end to end system are shown in Figure 6.

Here sixteen CCTV video feeds were individually processed

in real time according to the methods described above

(features were acquired and tracked, gestures recognition,

and behaviors identified). The outputs, with associated

tracking rectangles and red flashing alerts, were merged into

a four-by-four display as shown. Note that while this

merging was not done in real-time (and is part of our future

effort) the recognition and identification was performed in

real time with excellent results.

We are also exploring the identification of group behaviors,

based on combinations of behaviors from one or more

sensors in a CCTV environment. Such group behaviors

would include the recognition of IED placement activities

that require more than one person to bury the device and

plant the transmitter or lay and bury the triggering wire.

Any future developments must involve verification of the

behavior recognition system. Verification involves both the

testing of the algorithms and the use of modules in a

completely integrated system. To verify the system, our

plan is to:

Test the system with a wide variety of users.

Test the system using a large number of intents in

our lexicon.

We strongly believe that these steps will verify the system,

thus facilitating its development into a full system useful in

a wide variety of applications.

Check point duty and security tasks represent high-risk

missions for soldiers. The increase of asymmetric threats to

military and civilian assets throughout the world has

generated a critical need for enhanced perimeter security

systems, such as automated surveillance systems. Current

U.S. Military operations, such as those in Iraq and

Afghanistan, have greatly increased the number of U.S. assets

in harm’s way – which in turn has put a strain on our ability

to adequately protect all of our facilities. Appropriate

perimeter and facility security requires continuous indoor and

outdoor monitoring under a wide variety of environmental

conditions – typically these locations are complex in that they

include open perimeters and high traffic flow that would

strain even human observers.

To maximize the effectiveness of military and security

personnel, automated real-time methods for tracking,

identifying, and predicting the activities of individuals and

groups are crucial. The human resources available to such

tasks are constantly being reduced, while the prevention of

negative actions must be increased. Small numbers of

individuals are able to cause a large amount of damage and

terror by going after non-military secondary targets that still

have an adverse affect on the military’s abilities to fulfill its

missions – the system presented here can be used to identify

behaviors that would be flagged for observation and

evaluation by human monitors based on their observed

threat, thereby reducing the tedium and making it possible

for one operator to accurately and attentively observe large

numbers of CCTV sensors.

REFERENCES

[1] Darrell, Trevor J. and Penland, Alex P. (1993) ―Space-

Time Gestures.‖ In /IEEE Conference on Vision and Pattern

Recognition, /NY, NY.

[2] Johansson, G. (1973). Visual perception of biological

motion and a model for its analysis. Perception &

Psychophysics, 14 (2), 201-211.

[3] K. Kim (2005) Real-time Foreground-Background

Segmentation using Codebook Model, Real-time Imaging,

Volume 11, Issue 3, Pages 167-256

[4] Kozlowski, L.T. & Cutting, J.E. (1977). Recognizing the

sex of a walker from a dynamic point-light display.

Perception & Psychophysics, 21 (6), 575-580.

[5] Loula, F., Prasad, S., Harber, K. & Shiffrar, M. (2005).

Recognizing People From Their Movement. Journal of

Figure 6: Monitoring system showing vehicles and

people being tracked in real time.

Page 8: [IEEE 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Washington, DC, USA (2008.10.15-2008.10.17)] 2008 37th IEEE Applied Imagery Pattern Recognition Workshop - Behavior

Experimental Psychology: Human Perception and

Performance, 31(1), 210–220.

[6] Montepare, J.M., Goldstein, S.B. & Clausen, A. (1987).

The identification of emotions from gait information.

Journal of Nonverbal Behavior, 11(1), 33-42.

[7] Murakami, Kouichi and Taguchi, Hitomi (1991).

―Gesture Recognition Using Recurrent Neural Networks.‖

/Journal of the ACM/, 1(1):237-242.

[8] Narendra, Kumpati S. and Balakrishnan, Jeyendran

(1994). ―Improving Transient Response of Adaptive Control

Systems using Multiple Models and Switching.‖ /IEEE

Transactions on Automatic Control/, 39:1861-1866.

[9] Pollick, F.E., Kay, J.W., Heim, K. & Stringer, R. (2005).

Gender recognition from point-light walkers. Journal of

Experimental Psychology: Human Perception and

Performance, 31(6), 1247–1265.

[10] Shipley, T.F. & Brumberg, J.S. (2003). Markerless

motion-capture for point-light displays. Technical Report,

Temple University Vision Laboratory, Philadelphia, PA.

[11] Troje, N.F. (2002). Decomposing biological motion: A

framework for analysis and synthesis of human gait

patterns. Journal of Vision, 2, 371-387.

[12] Fullenkamp, A. M. (in press). Point-light visualization

developer user guide. Technical Report, U.S. Air Force

Research Laboratory, Wright-Patterson Air Force Base, OH.