pets metrics: on-line performance evaluation service · pets metrics has been developed to be both...

8
PETS Metrics: On-Line Performance Evaluation Service David P. Young and James M. Ferryman Computational Vision Group, Department of Computer Science The University of Reading, Reading, RG6 6AY, UK [d.young][j.m.ferryman]@reading.ac.uk Abstract This paper presents the PETS Metrics On-line Evaluation Service for computational visual surveillance algorithms. The service allows researchers to submit their algorithm re- sults for evaluation against a set of applicable metrics. The results of the evaluation processes are publicly displayed allowing researchers to instantly view how their algorithm performs against previously submitted algorithms. The ap- proach has been validated using seven motion segmentation algorithms. 1. Introduction Performance evaluation is necessary to state whether the re- search community is making a quantifiable progression in algorithm development. Previously there was a tenancy to test and report algorithm results based on in-house datasets. This could lead to a bias of exaggerated performance due to non-representative datasets of a particular problem. To combat this problem standard datasets have been created: PETS, CAVIAR – and in the near future ETISEO and iLIDS. Even with standard datasets it can be difficult to con- firm the performance of an algorithms due to in-house met- ric evaluations and selective reporting of results. The next progressive step is to supply the research community with an central location where algorithms can be tested against standard datasets and a common set of metrics. The PETS Metrics web site aims to provide this service for the research community. 1.1. History of PETS & Related The first Performance Evaluation of Tracking and Surveil- lance (PETS) workshop [1] was held in Grenoble on March 31st 2000, in conjunction with the IEEE Face and Gesture Recognition conference. It was realised that the growth in the development of the visual surveillance field had not been met with complementary systematic performance evalua- tion of developed techniques. It has been especially difficult to make comparisons between published algorithms in the literature if they have been tested on different datasets under widely varying conditions. The PETS workshop was insti- gated to address this issue. The workshop was unique in that all participants tested algorithms and presented results based on the same published dataset. Since 2000, a fur- ther six PETS (and VS-PETS) workshops have been held in collaboration with major conferences, examining a range of surveillance data. Recently, there has been a growing number of addi- tional activities which aim to further address performance evaluation in a surveillance context. This includes project ETISEO which commenced in 2005 (funded by the French Ministry of Science) and “seeks to work out a new struc- ture contributing to an increase in the evaluation of video scene understanding.” 1 , and satellite workshops including the Real-Time Event Detection Solutions (CREDS) for En- hanced Security and Safety in Public Transportation. This latter activity was held in conjunction with the 2005 IEEE International Conference on Advanced Video and Signal based Surveillance. Table 1 summarises representative cur- rent efforts in publicly available surveillance datasets and performance evaluation. 1.2. Performance Evaluation and Surveillance In the context of visual surveillance, Cavallaro el al. [2] pro- pose an “automatic” approach to the objective evaluation of segmentation results. The method is used to compare and rank change detection results for three different algorithms based on a metric which weights judgement of the spatial accuracy and temporal coherence against a reference seg- mentation. Beleznai el al. [3] present quantitative assess- ment of two motion segmentation algorithms performed us- ing error metrics based on ground truth data and with and without the use of spatial context. Erdem el al. [4] evaluate the performance of metrics for object segmentation when ground truth segmentation maps are not available. Recent work by Brown el al. [5] developed a new method for evaluating the performance of background sub- traction and tracking including a track evaluation based on matching ground truth tracks to system tracks in a two-way matching process. Ellis [6] examined the main require- ments for effective performance analysis of surveillance systems and proposed a number of new metrics. Black el al. 1 ETISEO Website located at: www.silogic.fr/etiseo 1

Upload: others

Post on 18-Oct-2019

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PETS Metrics: On-Line Performance Evaluation Service · PETS Metrics has been developed to be both an ancil-lary and complementary mechanism to the traditional PETS workshop event

PETS Metrics: On-Line Performance Evaluation Service

David P. Young and James M. FerrymanComputational Vision Group, Department of Computer Science

The University of Reading, Reading, RG6 6AY, UK[d.young][j.m.ferryman]@reading.ac.uk

Abstract

This paper presents the PETS Metrics On-line EvaluationService for computational visual surveillance algorithms.The service allows researchers to submit their algorithm re-sults for evaluation against a set of applicable metrics. Theresults of the evaluation processes are publicly displayedallowing researchers to instantly view how their algorithmperforms against previously submitted algorithms. The ap-proach has been validated using seven motion segmentationalgorithms.

1. IntroductionPerformance evaluation is necessary to state whether the re-search community is making a quantifiable progression inalgorithm development. Previously there was a tenancy totest and report algorithm results based on in-house datasets.This could lead to a bias of exaggerated performance dueto non-representative datasets of a particular problem. Tocombat this problem standard datasets have been created:PETS, CAVIAR – and in the near future ETISEO andiLIDS. Even with standard datasets it can be difficult to con-firm the performance of an algorithms due to in-house met-ric evaluations and selective reporting of results. The nextprogressive step is to supply the research community withan central location where algorithms can be tested againststandard datasets and a common set of metrics. The PETSMetrics web site aims to provide this service for the researchcommunity.

1.1. History of PETS & RelatedThe first Performance Evaluation of Tracking and Surveil-lance (PETS) workshop [1] was held in Grenoble on March31st 2000, in conjunction with the IEEE Face and GestureRecognition conference. It was realised that the growth inthe development of the visual surveillance field had not beenmet with complementary systematic performance evalua-tion of developed techniques. It has been especially difficultto make comparisons between published algorithms in theliterature if they have been tested on different datasets underwidely varying conditions. The PETS workshop was insti-gated to address this issue. The workshop was unique in

that all participants tested algorithms and presented resultsbased on the same published dataset. Since 2000, a fur-ther six PETS (and VS-PETS) workshops have been held incollaboration with major conferences, examining a range ofsurveillance data.

Recently, there has been a growing number of addi-tional activities which aim to further address performanceevaluation in a surveillance context. This includes projectETISEO which commenced in 2005 (funded by the FrenchMinistry of Science) and “seeks to work out a new struc-ture contributing to an increase in the evaluation of videoscene understanding.”1, and satellite workshops includingthe Real-Time Event Detection Solutions (CREDS) for En-hanced Security and Safety in Public Transportation. Thislatter activity was held in conjunction with the 2005 IEEEInternational Conference on Advanced Video and Signalbased Surveillance. Table 1 summarises representative cur-rent efforts in publicly available surveillance datasets andperformance evaluation.

1.2. Performance Evaluation and SurveillanceIn the context of visual surveillance, Cavallaroel al. [2] pro-pose an “automatic” approach to the objective evaluation ofsegmentation results. The method is used to compare andrank change detection results for three different algorithmsbased on a metric which weights judgement of the spatialaccuracy and temporal coherence against a reference seg-mentation. Beleznaiel al. [3] present quantitative assess-ment of two motion segmentation algorithms performed us-ing error metrics based on ground truth data and with andwithout the use of spatial context. Erdemel al. [4] evaluatethe performance of metrics for object segmentation whenground truth segmentation maps are not available.

Recent work by Brownel al. [5] developed a newmethod for evaluating the performance of background sub-traction and tracking including a track evaluation based onmatching ground truth tracks to system tracks in a two-waymatching process. Ellis [6] examined the main require-ments for effective performance analysis of surveillancesystems and proposed a number of new metrics. Blackel al.

1ETISEO Website located at: www.silogic.fr/etiseo

1

Page 2: PETS Metrics: On-Line Performance Evaluation Service · PETS Metrics has been developed to be both an ancil-lary and complementary mechanism to the traditional PETS workshop event

Project Description Timescale

PETS (The University ofReading, UK)

Series of workshops on performance evaluation. Each workshop focuses onspecific dataset(s)

2000 - current

iLIDS(HOSDB, UK)

Collection of real-world CCTV test imagery (4 surveillance scenarios) 2004 - current

VERAAE(ARDA, US)

Comparative study of Video Event Recognition (VER) algorithms Focus is on surveillance do-main. 2005 - current

ARDA VACE(NIST, US)

Develop revolutionary advances in automatic video extraction, understand-ing, and multimodal fusion - surveillance is one of the evaluation domains

Three 2-year phasesCurrently in middle phase 2

CAVIAR EC funded project(IST 2001 37540)

Set of ground truthed datasets. Domain is public space, and shopping, sur-veillance

October 2002 - September2005

ETISEO (France) French Government funded project to evaluate vision techniques for videosurveillance applications

2005 - 2006

Table 1:Summary of representative current efforts in Performance Evaluation of Surveillance

[7] presented a novel framework for performance evaluationemploying pseudo-synthetic video. The most relevant workto PETS Metrics is the recent work undertaken by Collinsel al. [8] on an open source tracking test bed and evaluationweb site. The web site developed by Collinsel al. [8] pro-vides an on-line algorithm evaluation service. However, theevaluation is limited to the specific task of tracking groundvehicles from airborne video.

1.3. PETS Metrics

PETS Metrics has been developed to be both an ancil-lary and complementary mechanism to the traditional PETSworkshop event. The overall aim is to provide an automaticmechanism to compare, in a quantitative manner, a selectionof algorithms operating on the same data. PETS Metrics au-tomates the performance evaluation process, and provide tothe community an evolving online repository of datasets,metrics and results. The approach is different from relatedactivities such as ETISEO in which the overall coordina-tor performs the evaluation of submitted results. In PETSMetrics, results are uploaded directly to the website, auto-matically evaluated and results presented ranked alongsideother algorithms.

A principal motivation behind PETS Metrics was theevaluation results presented during PETS 2001. At this sec-ond PETS workshop, a requirement was set that submittedpapers were accompanied with algorithmic results output inXML format. The coordinator then reconstructed the objectdetection and tracking results from the XML files, whichprovided for a qualitative comparison of a number of al-gorithms operating on the same PETS video sequences. Asignificant outcome of this process was that it allowed forspecific sub-problems within the surveillance task, for ex-ample, ensuring the maintenance of the identity of trackedobjects through partial occlusion, to be studied. Specifi-cally, which algorithms succeeded and failed at this task fora given sequence. This led to recommendations on how to

combine the advantages of different algorithms to producea single, more robust tracking algorithm. A significant aimof PETS Metrics is to extend the evaluation methodologyto:

1. Provide a online, evolving repository of datasets, met-rics and results

2. Allow for automaticevaluation of submitted results3. Provide quantitative results which may be viewed

ranked by metric

2. PETS Metrics Web Site2.1. Web Site OverviewThe PETS Metrics web site, shown in Figure 1, is the in-terface for researchers to submit their algorithm results andview the algorithm performance against a set of applicablemetrics. For each metric the web site publicly shows a rank-ing table on how the submitted algorithms have performedagainst each other.

The site is scalable to accommodate any number of met-rics for any particular research area or activity. The site cur-rently includes, but is not limited to, metrics for motion seg-mentation as described in [9]. In future the metrics will beextended to cover further visual surveillance research tasks,for example, tracking and categorisation.

The site is also scalable to any number of video datasets.Due to the time intensive nature of ground truthing PETSMetrics currently measures algorithms against Dataset 1Camera 1 of the PETS 2001 dataset. This will be expandedto cover the whole PETS 2001 dataset and further datasetswith available ground truth.

2.2. Algorithm Results’ FilesCollectively researchers develop their algorithms on differ-ing computing platforms, use a variety of programming lan-guages and typically store their algorithm results in theirown data structures. For PETS Metrics to evaluate differing

2

Page 3: PETS Metrics: On-Line Performance Evaluation Service · PETS Metrics has been developed to be both an ancil-lary and complementary mechanism to the traditional PETS workshop event

Figure 1:Home page of the PETS Metrics web site

researchers’ algorithms there must therefore exist a standardfile format for the submission of results. To resolve this is-sue, PETS Metrics requires that an algorithm’s result file issubmitted as an XML file formatted to the PETS MetricsXML Schema. An XML Schema is a definition on how tolegally construct an XML file. Figure 2 diagrammaticallyshows the XML elements used in the PETS Metrics XMLSchema2

XML is suited to PETS Metrics’ results submission as itis a non-propriety, internationally recognised, platform andprogramming language independent data format. Further-more, XML can be easily generated by a programming lan-guage’s normal file writing functions, i.e. fwrite() in C orofstream in C++3.

2.3. Submission ProcessSubmission of algorithm results to the PETS Metrics website process consists of five stages. This is summarised be-low with a generalised flow diagram shown in Figure 3.

1. Entering your contact details2. Entering details on the algorithm3. Uploading the XML Results file4. Final check of inputted data5. Confirmation of submission

The submitter will first complete two web form ques-tionnaires in order to submit an algorithm’s results to thesite. The first questionnaire captures the submitter’s con-tact details to allow future correspondence4 whilst, the sec-ond questionnaire captures details regarding the algorithm;

2 Details regarding the schema can be found on the PETS Metrics website located at http://petsmetrics.net

3 C++ code for generating XML files that follow the PETS Metricsschema is available for download from the PETS Metrics site

4 This information will not be publicly displayed on the web site

Figure 2:PETS Metrics XML Schema

these include: a short name, frames per second (FPS) of thealgorithm, speed of the executing computer, description ofthe algorithm and task the algorithm was developed for (oneselection from a list.) It is required to know for what task thealgorithm was developed in order to apply the correct met-rics to the results file. The FPS and speed of the computerare required to analyse the approximate processing load ofthe algorithm.

To upload a file to PETS Metrics the user selects a filefrom their computer. Upon upload the PETS Metrics sitewill validate the XML file. The PETS Metrics XML parserwill attempt to extract all valid information. For flexibilitythe parser is tolerant to certain errors in the formatting ofthe XML file. However, submitters should not assume theparser is tolerant to all forms of errors. Illegal formattedXML elements are ignored which may affect the algorithmsrankings by the site’s metrics.

Prior to the confirmation of a successful submission toPETS Metrics, the submitter has a final opportunity to cor-rect any questionnaire entries. On confirmation by thesubmitter that the questionnaire information is correct, thePETS Metrics web site displays a successful submissionweb page and starts processing the results file against theappropriate metrics.

2.4. Algorithm Ranking DisplayBy selecting the Ranking Table web page of the PETS Met-rics web site, a user can view a table of the current metrics

3

Page 4: PETS Metrics: On-Line Performance Evaluation Service · PETS Metrics has been developed to be both an ancil-lary and complementary mechanism to the traditional PETS workshop event

Figure 3: Generalised flow diagram on submission process toPETS Metrics

for the web site with their associated description. By select-ing a metric a web page will appear showing the algorithmranking table for that metric, as shown in Figure 4. A partic-ular metric may have many constituent parts, e.g. a value fortrue positives, false positives, etc. These constituent partswill be displayed as column headers in the ranking tablefor that metric. The algorithms in the ranking table for thatmetric can be sorted into different orders depending on theconstituent part selected.

To supplement the information in the ranking table, asecond table of information regarding the algorithms is alsoshown. The second table information provides the algo-rithm details captured in the second questionnaire, see Sec-tion 2.3. This will allow users of PETS Metrics to identifywhich computational vision methods are performing better.

3. Example Results of the Service

PETS Metrics currently evaluates motion segmentation al-gorithms using the metrics described in [9]. Currently 240foreground objects from 100 frames of the test sequence ofDataset 1 Camera 1 of the PETS 2001 Dataset have beenground truthed.

Figure 4:Screenshot of a Metric’s Algorithm Evaluation Table

3.1. Ground TruthingMetrics evaluate algorithms against a ground truth that isassumed to be correct. To generate ground truth data isa largely manual repetitive process where semi-automationshould be only introduced where it will not bias the groundtruth data.

In practice, ground truth data collection is subject tothe systematic subjective error of individuals collecting thedata. For example, in ground truthing object boundariesthe quantisation of the real world to pixels and JPEG noisecan make it hard to exactly define a boundary. In this casesystematic noise is added to ground truth, when comparingindividuals, as some may systematically tend to regularlydefine the object boundary closer or further away from theobject.

Ideally, a given sequence should be ground truthed anumber of times by the same person (or by different people)and the results averaged. This however is usually unrealisticdue to the economics and time required. It is therefore prac-tically impossible to state the collected ground truth datawill actually be 100% correct. This will therefore alwayslead to a level of error in the reported results of metricsagainst the results of an algorithm.

Ground Truthing Assumptions. Identifying and in-forming researchers of assumptions in the ground truth datais of vital importance to ensure all algorithms are evalu-ated fairly. For example, in annotating scene event data onrecognising unattended luggage, one would have to state theminimum distance and time at which the left luggage is de-fined as unattended in the ground truth data. Informationregarding the parameters for a particular surveillance task,used when collecting the ground truth data, are availablefrom the PETS Metrics web site.

Ground Truthing Annotation Tools (GTAT) GTATstools are required to input ground truth data by users to thecomputer. The first set of metrics implemented for PETSMetrics were for motion segmentation. A tool for obtainingmotion segmentation ground truth was selected. Two popu-

4

Page 5: PETS Metrics: On-Line Performance Evaluation Service · PETS Metrics has been developed to be both an ancil-lary and complementary mechanism to the traditional PETS workshop event

lar GTATs, which are freely available, for bounded box seg-mentation of objects are Viper5 and the CAVIAR6 projecttool. Although these are sophisticated tools, for PETS Met-rics the Reading University GTAT from the AVITRACK7

project was adapted. This tool, shown in Figure 4, is a mul-tiple video stream bounded box object segmentation anddescription annotation tool with linear object motion pre-diction. This tool outputs ground truth annotation in XML.The tool was adapted to allow the inputting of segmentationdata for the motion.

Ground truthing with the Reading University GTAT wasperformed in four steps:

1. First ensure each unique physical mobile object isgiven a unique ID and classification (e.g. person, car,etc.) that references it in each frame.

2. Next, for each unique physical mobile object in eachframe, ensure is it encapsulated by an overly largebounded box. The automatic linear motion predictionaided rapid addition of bounding boxes to selected pre-ceding or successive frames.

3. Next, segment each object in its bounded box from thebackground by identifying foreground boundary pix-els around the object and using the segmentation fore-ground “flood fill” feature to fill the pixels inside theboundary.

4. Next, use the automatic tighten bounding box featureto ensure the bounded box only extends to the mini-mum and maximum x-y dimension values of the objectforeground data.

3.2. Experimental ResultsAt present PETS Metrics has four metrics implemented forevaluation of Motion Segmentation, as described in [9]:

• Negative Rate Metric (NR)• Misclassification Penalty Metric (MP)• Rate of Misclassifications Metric (RM)• Weighted Quality Measure Metric (WQM)

For all the Motion Segmentation metrics, the lower thescore the better the algorithm is at correctly segmentingforeground that matches the ground truth foreground seg-mentation. All of the metrics are the sum of two parts: afalse positive score and a false negative score. A low falsepositive score describes good object boundary identifica-tion. A low false negative score describes good identifi-cation of foreground internal to the object.

5 http://viper-toolkit.sourceforge.net6 http://homepages.inf.ed.ac.uk/rbf/CAVIAR/7 http://www.avitrack.net/

Figure 5:Ground Truth Annotation Tool developed by The Uni-versity of Reading. Images curtsy of the AVITRACK project

The first metric,Negative Rate Metric (NR)as shownin Equations 1 to 3, evaluates a false negative rate (NRfn)and false positive rate (NRfp). This metric is based on apixel-wise mismatches between the ground truth and obser-vations in a frame [6].

NR =12(NRfn + NRfp) (1)

whereNRfn =

Nfn

Ntp + Nfn(2)

NRfp =Nfp

Nfp + Ntn(3)

whereNfn andNfp denote the number of false negativeand false positive pixels respectively.Ntn andNtp are thenumber of true negatives and true positives. It should bebe noted the Negative Rate Metric can only used to give ageneral indication of specific object segmentation.

The second metric,Misclassification Penalty Metric(MP) as shown in Equations 4 to 6, evaluates an algorithm’sobject segmentation against the ground truth on an object-by-object basis. Misclassified pixels are penalised by theirdistances from the ground truth reference object’s border.

MP =12(MPfn + MPfp) (4)

where

MPfn =

∑Nfn

j=1 djfn

D(5)

MPfp =

∑Nfp

j=1 djfp

D(6)

5

Page 6: PETS Metrics: On-Line Performance Evaluation Service · PETS Metrics has been developed to be both an ancil-lary and complementary mechanism to the traditional PETS workshop event

wheredjfn anddj

fp are the distances of thejth false neg-ative andkth false positive pixel from the contour of the ref-erence segmentation. The normalised factorD is the sumover all the pixel-to-contour distances of objects in a frame.This metric describes how well an algorithm can extract aspecific physical object by penalising misclassified pixelsbased on distance from an object’s boundary. If an algo-rithm has a lowMP score it is good at identifying an ob-ject’s boundary and segmenting a physical object from thescene.

The third metric, Rate of Misclassifications Metric(RM) as shown in Equations 7 to 9, evaluates an algorithm’saverage erroneously segmented pixel’s distance to an ob-ject’s border in units of pixels.

RM =12(RMfn + RMfp) (7)

whereRMfn =

1Nfn

Nfn∑j=1

djfn

Ddiag(8)

RMfp =1

Nfp

Nfp∑k=1

dkfp

Ddiag(9)

Nfn andNfp denote the number of false negative andfalse positive pixels respectively.Ddiag is the diagonal dis-tance of the frame. This metric is similar to the MP metricbut uses the number ofNfn or Nfp pixels as the normalis-ing factor forRMfn andRMfp, respectively, opposed toDin Equations 5 and 6. This metric will evaluate the averagedegree of error when errors occur rather than the averagequantity of error that occurs.

The fourth metric,Weighted Quality Measure Metric(WQM) as shown in Equations 10 to 14, quantifies thespatial discrepancy between estimated motion segmentationand the ground truthed reference object motion segmenta-tion. This is measured as the sum of weighted effects offalse positive and false negative pixels.

WQM = ln(12(WQMfn + WQMfp)) (10)

whereWQMfn =

1Nfn

Nfn∑j=1

wfn(djfn)dj

fn (11)

WQMfp =1

Nfp

Nfp∑k=1

wfp(dkfp)d

kfp (12)

whereN is the area of the reference object in pixels. Fol-lowing the argument in the work from Aguileraet al. [9]that the visual importance of false positives and false nega-tives is not the same, and thus should be treated differently,the weighting functionswfp andwfn are used, where:

wfp(dfp) = B1 +B2

dfp + B3(13)

wfn(dfn) = C · dfn (14)

PETS Metrics uses the same constant values forB1, B2,B3 andC as [9], namely:B1 = 19, B2 = −178.125, B3 =9.375 andC = 2. The weighting with these parametersvalues favours algorithms which provide larger foregroundestimations over conservative ones.

3.2.1 Metric Evaluation of Submitted Algorithms

Seven motion segmentation algorithms, listed below, havebeen published on the PETS Metrics web site. It should benoted that each algorithm uses the same colour tracking al-gorithm implementation, as described in [10], which utilisesthe object’s colour histogram as the colour model to repre-sent objects. HSV colour space is used for the colour model:2D hue-saturation histogram plus a 1D value-histogram forrepresenting the “colourless” pixels. Tracking algorithmsare only used to feedback to the motion segmentation algo-rithm when an object becomes stationary or recommencesmotion.

1. Brightness and Chromaticity (BC) [11]2. Five Frame Differencing (DIF)3. Edge Fusion (EDG) [12]4. Gaussian Mixture Model (GMM) [13]5. Kernel Density Estimation (KDE) [14]6. Colour Mean and Variance (VAR) [15]7. Wallflower Linear Prediction Filter (WFL) [16]

The PETS Metrics evaluation service was used to eval-uate the algorithms against the currently implemented met-rics. The results are shown in Tables 2 to 5.

Algorithm False Nega-tive Rate

False Posi-tive Rate

Negative Rate(NR)

BC 0.3535 0.001381 0.1773DIF 0.3500 0.002776 0.1764EDG 0.3293 0.001843 0.1656GMM 0.3017 0.002203 0.1520KDE 0.3026 0.001942 0.1523VAR 0.3308 0.001800 0.1663WFL 0.2435 0.002336 0.1229

Table 2:NR Metric evaluation of algorithms

Table 2, the Negative Rate metric (NR), broadly showsa symmetry where algorithms that erroneously over seg-ment objects (highfp) are better at segmenting all of anobject’s foreground pixels (lowfn), and vice versa. This isto be expected since correctly segmenting nearly all the ob-ject’s pixels indicates a resilience to noise yet, this resiliencegenerally creates errors in identifying an object’s boundary;thus over segmentation occurs. The exception is the framedifferencing method, relative to the other algorithms, where

6

Page 7: PETS Metrics: On-Line Performance Evaluation Service · PETS Metrics has been developed to be both an ancil-lary and complementary mechanism to the traditional PETS workshop event

Algorithm False Nega-tives MP

False Posi-tives MP

MisclassificationPenalty (MP)

BC 0.939 0.097 0.518DIF 0.459 9.909 5.183EDG 0.320 7.928 4.123GMM 0.294 7.827 4.060KDE 0.321 5.968 3.144VAR 0.322 7.736 4.029WFL 0.242 7.521 3.882

Table 3:MP Metric evaluation of algorithms

Algorithm False Nega-tives RM

False Posi-tives RM

Rate Misclassifica-tion Penalty (RM)

BC 0.342 0.361 0.352DIF 0.351 0.171 0.261EDG 0.281 0.084 0.186GMM 0.276 0.076 0.176KDE 0.264 0.076 0.170VAR 0.280 0.084 0.182WFL 0.278 0.083 0.181

Table 4:RM Metric evaluation of algorithms

poor segmentation is shown as high values for both falsepositive and false negative rates.

Table 3, the Misclassification Penalty metric (MP),shows to a lesser extent the symmetry between false positiveand false negative segmentation. However, the MP metriccan provide interesting comparisons between the algorithmsdue to it’s method of measuring boundary distances of er-roneous pixels, opposed to NR metric’s method of countingerroneous pixels. According to the NR metric, the Wall-flower algorithm responded best in correctly identifying anobject’s foreground pixels up to it’s boundary (lowestfnscore) yet nearly the worst at over segmenting the object(high fp score). However, the MP metric shows the Wall-flower algorithm has again the lowestfn score yet, in con-trast to the NP metric, thefp score from the MP metric islower than most of the submitted algorithms. Relative to theother algorithms, the NR metric shows that the Wallfloweralgorithm constantly over segments yet the MP shows theover segmentation is quite close to the ground truth seg-mentation data.

Table 4, the Rate of Misclassifications metric (RM),shows the average distance of misclassified pixel to an ob-ject’s boundary. This metric shows the best five algorithmsKDE, WFL, GMM, VAR and EDG algorithms have ap-proximately the same average distance of erroneously seg-mented pixel’s to an object’s border. Therefore accordingto the RM metric, when errors in these algorithms segmen-tation occur, in what ever number, the errors are approxi-mately of the same degree. Interestingly the Differencingalgorithm does not perform the worst in this metric. This

Algorithm Loge ofWQMfn

Loge ofWQMfp

Weighted QualityMeasure (WQM)

BC 5.575 1.236 5.571DIF 5.854 1.415 5.837EDG 5.701 1.271 5.723GMM 5.720 1.217 5.746KDE 5.650 1.195 5.666VAR 5.698 1.270 5.721WFL 5.743 1.056 5.769

Table 5:WQM Metric evaluation of algorithms

shows that although we know from the other metrics that ithas the most errors, the errors on average are not as seriousas the Brightness and Chromaticity algorithm.

Table 5, the Weighted Quality Measure (WQM) met-ric, shows evaluation based upon visual appearance of thesegmented object where false negative pixels are deemeda worse error than false positive pixels. With the currentparameter values, Table 5 reports that the Brightness andChromaticity motion segmentation algorithm scores well,showing the algorithm tends to over, rather than under, seg-ment objects. However, the WQM scores of the algorithmsare all in a similar range which therefore makes a meaning-ful comparison of the evaluation difficult.

Algorithm Ranking. By ranking the algorithms in Ta-bles 2 to 5 and averaging the ranked position, the algo-rithm’s ranked positions are shown in Table 6:

Algorithm Average Ranking Position

BC 5.75 6DIF 6.25 7EDG 4.5 4GMM 2.5 2KDE 2.75 3VAR 4.75 5WFL 1.5 1

Table 6:Algorithm Ranking

For the PETS 2001 Dataset 1 Camera 1 sequence, theWFL algorithm has been shown as the better algorithm formotion segmentation. Although the WLF performed well,for other video sequences created in different conditions8

the algorithm may not perform to the same accuracy. It isan aim of PETS Metrics to increase the number of groundedtruthed video sequences.

Furthermore, the practicality of algorithms must be con-sidered as the WFL and KDE algorithms run at 3 and 5frame per second, respectively, and both require approxi-mately 1 GB of memory. In contrast, when run on the samecomputer, the GMM algorithm, ranked third, ran at real-time and required 60 MB of memory.

8E.g. inconstant lighting, an indoor scenario, grayscale video, etc.

7

Page 8: PETS Metrics: On-Line Performance Evaluation Service · PETS Metrics has been developed to be both an ancil-lary and complementary mechanism to the traditional PETS workshop event

4. Evaluation of the EvaluationParameters of Metrics.Evaluation metrics can contain pa-rameters an algorithm’s result data against the ground truthdata. These parameters must be precisely and openly com-municated to users of the metric. An example of parametersin a metric is the object-by-object evaluation of the MP met-ric. Users should be informed on how the metric matchesobjects found by the algorithm to the objects in the groundtruth, e.g. matching by Object IDs or by matching boundingboxes. A poor awareness of the metric’s assumptions couldlead to a good algorithm evaluated as poor.

An example of this is a researcher who is interested inmotion segmentation and not tracking. This researcher mayset all Object Ids to the same value. If object matching wasperformed by Object Id tracking the researcher will receivean unfair evaluation of their algorithm. Furthermore, usersof the metric must be informed on what occurs when a met-ric fails to match ground truth to algorithm result data.

Selection of Metrics. To alleviate problems with as-sumptions in metrics, one could implement various spe-cialised versions of a metric which handled differing as-sumptions. However, the purpose of metric is to not onlyto evaluate but to use the evaluation in a meaningful com-parison of algorithms. By specialising a metric it may be-come applicable to a minority of algorithms, thus renderingit difficult to perform a meaningful performance compari-son against the majority of algorithms.

An ideal evaluation process would be a single metric thatcould evaluate any algorithm for a particular task. A singlescore by which to rank algorithms would render easy com-parison of algorithms. However, as metrics become gen-eralised, to cover an increasing set of possible algorithms,they can become insensitive to the subtleties of an algo-rithm’s performance. This leads to problems in creating ameaningful comparison between algorithms as many willscore practically the same values.

Metrics Discussion Summary. Metrics must be cho-sen that are neither too specialised to one application nortoo generalised to a large set of applications. A number ofmetrics will therefore have to exist to evaluate a particularproblem. For a proper comparison of algorithms, the resultsfrom a set of metrics will have to be cross referenced againsteach other. Care must be taken not to include a large num-ber of metrics as a meaningful comparison can be hamperedby too much information to be cross referenced.

5. Summary and ConclusionsThis paper has presented a new web based fully automaticservice to evaluate the performance of computational visualsurveillance algorithms.

The paper discussed issues in the ground truthingprocess and issues to be concerned with the use of metrics.

The paper described the metrics which are currently im-plemented in the PETS Metrics web site. The paper showedand discussed the results of PETS Metric’s evaluations uponseven submitted motion segmentation algorithms.

Future work for the PETS Metrics project includes im-plementing metrics for further tasks including object cate-gorisation, tracking and event detection.

References[1] J. M. Ferryman. PETS. IEEE Int. Workshops on Performance Evaluation

of Tracking and Surveillance (PETS) http://www.visualsurveillance.org, 2000-2005.

[2] A. Cavallaro et. al. Objective evaluation of segmentation quality using spatio-temporal context.IEEE International Conference on Image Processing, 2002.

[3] C. Beleznai, T. Schlgl, H. Ramoser1, M. Winter, H. Bischof and W. Kropatsch.Quantitative evaluation of motion detection algorithms for surveillance appli-cations.27th Workshop of Austrian Association for Pattern Recognition, pages205–212, 2003.

[4] C-E. Erdem and B. Sankur. Performance evaluation metrics for object-basedvideo segmentation.10th European Signal Processing Conference (EUSIPCO),2000.

[5] L. M. Brown, A. W. Senior, Y-L. Tian, J. Connell, A. Hampapur, C-F. Shu,H. Merkl and M. Lu. Performance evaluation of surveillance systems undervarying conditions.IEEE International Workshop on Performance Evaluationof Tracking and Surveillance (WAMOP-PETS), pages 1–8, 2005.

[6] T. Ellis. Performance metrics and methods for tracking in surveillance.3rdIEEE International Workshop on Performance Evaluation of Tracking and Sur-veillance (PETS), pages 26–31, 2002.

[7] T. Ellis J. Black and P. Rosin. A novel method for video tracking performanceevaluation.Joint IEEE Int. Workshop on Visual Surveillance and PerformanceEvaluation of Tracking and Surveillance (VS-PETS), pages 125–132, 2003.

[8] X. Zhu R. Collins and S-K The. An open source tracking testbed and evaluationweb site.IEEE International Workshop on Performance Evaluation of Trackingand Surveillance (WAMOP-PETS), pages 17–24, 2005.

[9] J. Aguilera, H. Wildenauer, M. Kampel, M. Borg, D. Thirde and J. Ferryman.Evaluation of motion segmentation quality for aircraft activity surveillance.Joint IEEE International Workshop on Visual Surveillance and PerformanceEvaluation of Tracking and Surveillance (VS-PETS), In this volume, 2005.

[10] G. Bradski. Computer vision face tracking for use in a perceptual user interface.Intel Technology Journal, Q2, 1998.

[11] T. Horprasert, D. Harwood and L.S. Davis. A statistical approach for real-timerobust background subtraction and shadow detection.IEEE ICCV’99 FRAME-RATE Workshop, 1999.

[12] S. Jabri, Z. Duric, H. Wechsler and A. Rosenfeld. Detection and location ofpeople in video images using adaptive fusion of color and edge information.Proc. IAPR Internation Conference on Pattern Recognition, pages 4627–4631,2000.

[13] C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. Proc. International Conference on Pattern Recognition, pages246–252, 1999.

[14] A. Elgammal, D. Harwood and L. Davis. Non-parametric Model for Back-ground Subtraction.6th European Conference on Computer Vision, Dublin,Ireland, 2:751–767, 2000.

[15] C. R. Wren, A. Azarbayejani, T. Darrell and A. Pentland. Pfinder: Real-timetracking of the human body.IEEE Transactions on PAMI, 19(7):780–785,1997.

[16] K. Toyama, J. Krumm, B. Brumitt and B. Meyers. Wallflower: Principles andpractice of background maintenance.IEEE International Conference on Com-puter Vision, pages 255–261, 1999.

8