abstract - arxiv · 2016-11-28 · discriminative correlation filter with channel and spatial...

Discriminative Correlation Filter with Channel and Spatial Reliability

Alan Lukezic1, Tomas Vojır2, Luka Cehovin1, Jirı Matas2, Matej Kristan1

1 Faculty of Computer and Information Science, University of Ljubljana, Slovenia2 Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic

{alan.lukezic, luka.cehovin, matej.kristan}@fri.uni-lj.si{vojirtom, matas}@cmp.felk.cvut.cz

Abstract

Short-term tracking is an open and challenging prob-lem for which discriminative correlation filters (DCF) haveshown excellent performance.We introduce the channel andspatial reliability concepts to DCF tracking and provide anovel learning algorithm for its efficient and seamless in-tegration in the filter update and the tracking process. Thespatial reliability map adjusts the filter support to the part ofthe object suitable for tracking. This both allows to enlargethe search region and improves tracking of non-rectangularobjects. Reliability scores reflect channel-wise quality ofthe learned filters and are used as feature weighting coeffi-cients in localization. Experimentally, with only two simplestandard features, HoGs and Colornames, the novel CSR-DCF method – DCF with Channel and Spatial Reliability– achieves state-of-the-art results on VOT 2016, VOT 2015and OTB100. The CSR-DCF runs in real-time on a CPU.

1. IntroductionShort-term visual object tracking is the problem of con-

tinuously localizing a target in a video-sequence given asingle example of its appearance. It has received signifi-cant attention of the computer vision community which isreflected in the number of papers published on the topic andthe existence of multiple performance evaluation bench-marks [43, 29, 30, 26, 27, 33, 38, 36]. Diverse factors –occlusion, illumination change, fast object or camera mo-tion, appearance changes due to rigid or non-rigid deforma-tions and similarity to the background – make short-termtracking challenging.

Recent short-term tracking evaluations [43, 29, 30, 26]consistently confirm the advantages of semi-supervised dis-criminative tracking approaches [17, 1, 18, 4]. In partic-ular, trackers based on the discriminative correlation filtermethod (DCF) [4, 8, 20, 31, 10] have shown state-of-the-art performance in all standard benchmarks. Discrimina-tive correlation methods learn a filter with a pre-defined re-

Spatialy constrained

filters

Training patch

Final response

Localization stage:

Learning - Update stage:Spatial map

Test patch

Filter responses

Channelweights

=

Channelweights

Figure 1: Overview of the proposed CSR-DCF approach.An automatically estimated spatial reliability map restrictsthe correlation filter to the parts suitable for tracking (top)improving the search range and performance for irregularlyshaped objects. Channel reliability weights calculated inthe constrained optimization step of the correlation filterlearning reduce the noise of the weight-averaged filter re-sponse (bottom).

sponse on the training image.

The standard formulation of DCF uses circular correla-tion which allows to implement learning efficiently by FastFourier transform (FFT). However, the FFT requires the fil-ter and the patch size to be equal which limits the detectionrange. Due to the circularity, the filter is trained on many ex-amples that contain unrealistic, wrapped-around circularly-shifted versions of the target. These windowing problemswere recently addressed by state-of-the-art approaches ofGaloogahi et al. [24] who propose zero-padding the filterduring learning and by Danneljan et al. [10] who introduce

1

arX

iv:1

611.

0846

1v1

[cs

.CV

] 2

5 N

ov 2

016

spatial regularization to penalize filter values outside the ob-ject boundaries. Both approaches train from image patcheslarger than the object and thus increase the detection range.

Another limitation of the published DCF methods is theassumption that the target shape is well approximated byan axis-aligned rectangle. For irregularly shaped objects orthose with a hollow center, the filter eventually learns thebackground, which may lead to drift and failure. The sameproblem appears for approximately rectangular objects inthe case of occlusion. The Galoogahi et al. [24] and Dan-neljan et al. [10] methods both suffer from this problem.

In this paper we introduce the CSR-DCF, the Discrim-inative Correlation Filter with Channel and Spatial Relia-bility. The spatial reliability map adapts the filter supportto the part of the object suitable for tracking which over-comes both the problems of circular shift enabling an arbi-trary search range and the limitations related to the rectan-gular shape assumption. The spatial reliability map is esti-mated using the output of a graph labeling problem solvedefficiently in each frame. An efficient novel optimizationprocedure is derived for learning a correlation filter with thesupport constrained by the spatial reliability map since thestandard closed-form solution cannot be generalized to thiscase. Experiments show that the novel filter optimizationprocedure outperforms related approaches for constrainedlearning in DCFs.

Channel reliability is the second novelty the CSR-DCFtracker introduces. The reliability is estimated from theproperties of the constrained least-squares solution to filterdesign. The channel reliability scores are used for weight-ing the per-channel filter responses in localization (Fig-ure 1). The CSR-DCF shows state-of-the-art performanceon standard benchmarks – OTB100 [44], VOT2015 [26]and VOT2016 [26] while running in real-time on a singleCPU. The spatial and channel reliability formulation is gen-eral and can be used in most modern correlation filters, e.g.those using deep features.

2. Related workThe discriminative correlation filters for object detec-

tion date back to the 80’s with seminal work of Hesterand Casasent [21]. They have been popularized only re-cently in the tracking community, starting with the Bolmeet al. [4] MOSSE tracker published in 2010. Using a gray-scale template, MOSSE achieved a state-of-the-art perfor-mance on a tracking benchmark [43] at a remarkable pro-cessing speed. Significant improvements have been madesince and in 2014 the top-performing trackers on a recentbenchmark [30] were all from this class of trackers. Im-provements of DCFs fall into two categories, applicationof improved features and conceptual improvements in filterlearning.

In the first group, Henriques et al. [20] replaced the

grayscale templates by HoG [7], Danelljan et al. [12]proposed multi-dimensional color attributes and Li andZhu [32] applied feature combination. Recently, the convo-lutional network features learned for object detection havebeen applied [35, 11, 13], leading to a performance boost,but at a cost of significant speed reduction.

Conceptually, the first successful theoretical extension ofthe standard DCF was the kernelized formulation by Hen-riques et al. [20]. Later, a correlation-filter-based scaleadaptation was proposed by Danelljan et al. [8]. Zhang etal. [46] introduced spatio-temporal context learning in theDCFs. Recently, Galoogahi et al. [16] addressed the prob-lems resulting from learning with circular correlation fromsmall patches. They proposed a learning framework thatartificially increases the filter size by implicit zero-paddingto the right and down. The non-symmetric padding onlypartially reduces the boundary artefacts in filter learning.Danneljan et al. [10] reformulate the learning cost functionto penalize non-zero filter values outside the object bound-ing box. Performance better than [16] is reported, but thelearned filter is still a tradeoff between the correlation re-sponse and regularization, and it does not guarantee that fil-ter values are zero outside of object bounding box.

3. Spatially constrained correlation filtersGiven a set of Nd channel features f = {fd}d=1:Nd

andcorresponding target templates (filters) h = {hd}d=1:Nd

,where fd ∈ Rdw×dh , hd ∈ Rdw×dh , the object position xis estimated by maximizing the probability

p(x|h) =∑Nd

d=1p(x|fd)p(fd). (1)

The density p(x|fd) = [fd ∗ hd](x) is a convolution of afeature map with a learned template evaluated at x and p(fd)is a prior reflecting the channel reliability.

Most correlation filters, e.g., [31, 16, 46], assume inde-pendent feature channels. Optimal filters are obtained atlearning stage by minimizing the sum of squared differ-ences between the channel-wise correlation outputs and thedesired output g ∈ Rdw×dh ,

arg minh

Nd∑d=1

‖fd ∗ hd − g‖2 + λ

Nd∑d=1

‖hd‖2

= arg minh

Nd∑d=1

(‖hHd diag(fd)− gd‖2 + λ‖hd‖2). (2)

The equivalence in (2) follows from the Parsevaal’s theo-rem, the operator a = vec(F [a]) is a Fourier transformof a reshaped into a column vector, i.e., a ∈ RD×1, withD = dw ·dh, diag(a) forms aD×D diagonal matrix from aand (·)H is a Hermitian transpose. Minimization of (2) hasa closed-form solution by equating the complex gradient of

(2) w.r.t. each channel to zero [4]. Albeit its simplicity, thissolution suffers from boundary defects due to input circu-larity assumption and from assuming all pixels are equallyreliable for filter learning. In the following we address thisissue by proposing an efficient spatial reliability map con-struction for correlation filters and propose a new spatiallyconstrained correlation filter learning framework.

3.1. Constructing spatial reliability map

Spatial reliability map m ∈ [0, 1]dw×dh , with elementsm ∈ {0, 1}, indicates the learning reliability of each pixel.Probability of pixel x being reliable conditioned on appear-ance y is specified as

p(m = 1|y,x) ∝ p(y|m = 1,x)p(x|m = 1)p(m = 1).(3)

The appearance likelihood p(y|m = 1,x) is computed byBayes rule from the object foreground/background colormodels, which are maintained during tracking as color his-tograms c = {cf , cb}. The prior p(m = 1) is defined by theratio between the region sizes for foreground/backgroundhistogram extraction.

Central pixels in axis-aligned approximations of elon-gated rotating, articulated or deformable object will likelycontain the object regardless of deformation. On the otherhand, pixels away from the center will equally likely con-tain object or background. This deformation invariance ofcentral elements reliability is enforced in our approach bydefining a weak spatial prior

p(x|m = 1) = k(x;σ), (4)

where k(x;σ) is a modified Epanechnikov kernel,k(r;σ) = 1 − (r/σ)2, with size parameter σ equal to theminor bounding box axis and clipped to interval [0.5, 0.9]such that the object prior probability at center is 0.9 andchanges to a uniform prior away from the center (Figure 2).

Spatial consistency of labeling m is enforced by using(3) as unary terms in a Markov random field. An efficientsolver [28] is applied to compute maximum a posteriori so-lution of m. To avoid potentially poor classifications atsignificant appearance changes, the interior of the objectbounding box is set to a uniform distribution if less than avery small percentage of pixels αmin are classified as fore-ground. The reliability map is morphologically dilated by asmall kernel to prevent unwanted masking of object bound-aries. Figure 2 shows the likelihood and spatial prior in theunary terms and the final binary reliability map.

3.2. Constrained correlation filter learning

In interest of notation clarity, we assume only a singlechannel in the following derivation, i.e., Nd = 1, and dropthe channel index (·)d, since the filter learning is indepen-dent across the channels.

The spatial reliability map m identifies pixels in the fil-ter that should be ignored in learning, i.e., introduces a con-straint h ≡ m � h, which prohibits a closed-form solutionof (2). In the following we summarize our solution of thisconstrained optimization and report the full derivation in thesupplementary material.

We introduce a dual variable hc and the constrainthc −m� h ≡ 0, which leads to the following augmentedLagrangian [5]

L(hc,h, l|m) = ‖hHc diag(f)− g‖2 +

λ

2‖hm‖2 + (5)

[lH(hc − hm) + lH(hc − hm)] + µ‖hc − hm‖2,

where l is a complex Lagrange multiplier, µ > 0, andwe use the following definition hm = (m � h) for com-pact notation. The augmented Lagrangian (5) can be it-eratively minimized by the alternating direction method ofmultipliers [5], which sequentially solves the following sub-problems at each iteration:

hi+1c = arg min

hc

L(hc,hi, li|m), (6)

hi+1 = arg minh

L(hi+1c ,h, li|m), (7)

and the Lagrange multiplier is updated as

li+1 = li + µ(hi+1c − hi+1). (8)

The minimizations in (6) have a closed-form solution:

hi+1c = (f � g + (µhi

m − li))�−1 (f � f + µi), (9)

hi+1 = m�F−1 [li + µihi+1c ]/(

λ

2D+ µi). (10)

A standard scheme for updating the constraint penalty µvalues [5] is applied, i.e., µi+1 = βµi.

Computations of (9,8) are fully carried out in frequencydomain, the solution for (10) requires a single inverse FFTand another FFT to compute the hi+1. A single optimiza-tion iteration thus requires only two calls of the Fouriertransform, resulting in a very fast optimization. The com-putational complexity is that of the Fourier transform, i.e.,O(D logD). Filter learning is implemented in less than fivelines of Matlab code and is summarized in the Algorithm 1.

3.3. Channel reliability estimation

The channel reliability at target localization stage is com-puted as the product of a learning channel reliability mea-sure and a detection reliability measure. These are de-scribed next.

Minimization of (5) solves a least squares problem aver-aged over all displacements of the filter on a feature chan-nel. A discriminative feature channel fd produces a filter hd

Spatial prior Backprojection Posterior Overlayed patchTraining patch

Figure 2: Spatial reliability map construction. From left to right: a training patch with the tracked object bounding box, thespatial prior used as a unary term in the Markov random field optimization, the object log-likelihood according to foreground-background color models, the posterior object probability after Markov random field regularization. The training patchmasked with the final binary reliability map.

Algorithm 1 : Constrained filter optimization.Require:

Image patch features f , ideal correlation response g, bi-nary mask m.

Ensure:Optimized filter h.

Procedure:1: Initialize filter h0 by ht−1.2: Initialize Lagrangian coefficients: l0 ← zeros.3: while stop condition do4: Calculate hi+1

c from hi and li using (9).5: Calculate hi+1 from hi+1

c and li using (10).6: Update the Lagrangian li+1 from hi+1

c and hi+1 (8).7: end while

whose output fd ∗ hd nearly exactly fits the ideal responseg. On the other hand, the response is highly noisy on thechannel with low discriminative power and a global errorreduction in the least squares significantly reduces the max-imal response. Thus a straight-forward measure of channellearning reliability p(fd) in (1) is the maximum response ofa learned channel filter, i.e., wd = ζ max(fd ∗ hd), wherethe normalization scalar ζ ensures that

∑d wd = 1.

At detection stage, the per-channel detection reliabilityis reflected in the expressiveness of the major mode in theresponse of each channel. Note that Bolme et al. [4] pro-posed a similar approach to detect target loss. Our mea-sure is based on the ratio between the second and first ma-jor mode in the response map, i.e., ρmax2/ρmax1. Notethat this ratio penalizes cases when multiple similar ob-jects appear in the target vicinity since these result in multi-ple equally expressed modes, even though the major modeaccurately depict the target position. To prevent such pe-nalizations, the ratio is clamped by 0.5. Therefore, theper-channel detection reliability is estimated as w(det)

d =1−min(ρmax2/ρmax1,

12 ).

3.4. Tracking with channel and spatial reliability

The localization and update steps of the proposed chan-nel and spatial reliability correlation filter trackers (CSR-DCF) proceed as follows. Localize. Per-channel responsesand the corresponding detection reliability values are com-puted (Section 3.3) and multiplied with the learning relia-bility measures from previous time-step wt−1 into channelreliability scores. The object is localized by summing theresponses of the learned correlation filters ht−1 weighted bythe estimated channel reliability scores. Scale is estimatedby a single scale-space correlation filter from Danelljan etal. [8]. Update. Foreground/background histograms c areextracted at the estimated location and updated by an auto-regressive scheme with learning rate ηc. The foregroundhistogram is extracted by a standard Epanechnikov kernelwithin the estimated object bounding box and the back-ground is extracted from the neighborhood twice the objectsize. The spatial reliability map (Section 3.1) is constructed,the optimal filters h are computed by optimizing (5) and theper-channel learning reliability w = [w1, . . . , wNd

]T is es-timated from their responses (Section 3.3). For temporal ro-bustness, the filters and channel learning reliability weightsare updated by an autoregressive model with learning rate η.A single tracking iteration is summarized in Algorithm 2.

4. Experimental analysisThis section overviews a comprehensive experimental

evaluation of the proposed CSR-DCF tracker. Implemen-tation details are discussed in Section 4.1, Section 4.2 re-ports comparison of the proposed constrained learning tothe related state-of-the-art, ablation study is provided inSection 4.3, performance on three recent benchmarks is re-ported in Section 4.4, Section 4.5 and Section 4.6. Sec-tion 4.7 evaluates the tracking speed.

4.1. Implementation details and parameters

Standard HOG [15] and Colornames [39] featuresare used in the correlation filter and HSV fore-

Algorithm 2 : The CSR-DCF tracking algorithm.Require:

Image It, object position on previous frame pt−1, scalest−1, filter ht−1, color histograms ct−1, channel relia-bility wt−1.

Ensure:Position pt, scale st and updated models.

Localization and scale estimation:1: New target location pt: position of the maximum in

correlation between ht−1 and image patch features fextracted on position pt−1 and weighted by the channelreliability scores (Section 3.3).

2: Using location pt, estimate new scale st.Update:

3: Extract foreground and background histograms cf , cb.4: Update foreground and background histograms

cft = (1− ηc)cft−1 + ηccf , cbt = (1− ηc)cbt−1 + ηcc

b.5: Estimate reliability map m (Section 3.1).6: Estimate a new filter h using m (Algorithm 1).7: Estimate channel reliability w from h (Section 3.3).8: Update filter ht = (1− η)ht−1 + ηh.9: Update channel reliability wt = (1− η)wt−1 + ηw.

ground/background color histograms with 16 bins per colorchannel are used in reliability map estimation with param-eter αmin = 0.05. All the parameters are set to valuescommonly used in literature [10, 16]. Histogram adaptationrate is set to ηc = 0.04, correlation filter adaptation rate isset to η = 0.02, and the regularization parameter is set toλ = 0.01. The augmented Lagrangian optimization param-eters are set to µ0 = 5 and β = 3. All parameters havestraight-forward interpretation, do not require fine-tuning,and were kept constant throughout all experiments. OurMatlab implementation1 runs at 13 frames per second onan Intel Core i7 3.4GHz standard desktop.

4.2. Impact of boundary constraint formulation

This section compares our proposed boundary con-straints formulation (Section 3) with recent state-of-the-artapproaches [10, 16]. In the first experiment, three variantsof the standard single-scale HOG-based correlation filterwere implemented to emphasize the difference in bound-ary constraints: the first one uses our channel reliabilityboundary constraint formulation from Section 3 (TCR) thesecond one applies the spatial regularization constraint [10](TSR) and the third one applies the limited boundaries con-straint [16] (TLB).

The three variants were compared on the challengingVOT2015 dataset [26] by applying a standard no-reset one-pass evaluation (OTB [43]) and computing the AUC on thesuccess plot. The tracker with our constraint formulation

1The CSR-DCF Matlab source code will be made publicly available.

0 50 100 150 200 250 300 350 400 450 5000

10

20

30

40

50

60

CSR-DCFSRDCFLBCF

Trajectory lengths Overlap thresholds

Number of trajectories

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

CSR-DCF [0.242]SRDCF [0.115]LBCF [0.038]

Figure 3: The number of trajectories with tracking success-ful up to frame Θfrm (upper left), the success plots (upperright) and initialization examples of non-axis-aligned tar-gets (bottom).

TCR achieved 0.32 AUC, while the alternatives achieved0.28 (TSR) and 0.16 (TLB). The only difference betweenthese tackers is in the constraint formulation, which indi-cates superiority of the proposed channel-reliability-basedconstraints formulation over the recent alternatives [16, 10].

4.2.1 Non-axis-aligned target initialization robustness

The proposed CSR-DCF tracker from Section 3 wascompared to the original recent state-of-the-art trackersSRDCF [10] and CFLB [24] that apply alternative bound-ary constraints. The source code was obtained from the au-thors and only HoG features were used in all three track-ers for a fair comparison. An experiment was designed toevaluate initialization and tracking of non axis-aligned tar-gets, which is the case for most realistic deforming and non-circular objects. Trackers were initialized on frames withnon-axis aligned targets and left to track until the sequenceend, resulting in a large number of tracking trajectories andsummarized by various performance measures.

The VOT2015 dataset [26] contains non-axis-aligned an-notations, which allows automatic identification of trackerinitialization frames, i.e., frames in which the ground truthbounding box significantly deviates from an axis-alignedapproximation. Frames with overlap between ground truthand axis-aligned approximation lower than 0.5 were iden-tified and filtered to obtain a set of initialization frames at

Table 1: Comparison of three most related trackers onnon-axis-aligned initialization experiment: weighted aver-age tracking length in frames Γfrm and proportions Γprp,and weighted average overlaps using the original and axis-aligned ground truth, Φrot and Φaa, respectively.

Tracker Γprp Γfrm Φaa Φrot

CSR-DCF 0.58 221 0.31 0.24SRDCF (ICCV2015) 0.31 95 0.16 0.12LBCF (CVPR2015) 0.12 37 0.06 0.04

Figure 4: Qualitative results for trackers CSR-DCF (red)tracker, SRDCF (blue) and LBCF (green).

least hundred frames apart – this constraint fits half the typ-ical short-term sequence length [26] and reduces the poten-tial correlation across the initializations (see Figure 3 forexamples).

Initialization robustness is estimated by counting thenumber of trajectories in which the tracker was still tracking(overlap with ground truth greater than 0) Θfrm frames afterinitialization. Figure 3 shows these values with increasingthe threshold Θfrm. The CSR-DCF graph is consistentlyabove the SRDCF and LBCF for all thresholds. The perfor-mance is summarized by the average tracking length (num-ber of frames before the overlap drops to zero) weighted bytrajectory lengths. The weighted average tracking lengthsin frames, Γfrm, and proportions of full trajectory lengths,Γprp, are shown in Table 1. The CSR-DCF by far outper-forms SRDCF and LBCF in all measures indicating a signif-icant robustness at initialization of challenging targets thatdeviate from axis-aligned templates. This improvement isfurther confirmed by Figure 3 which shows the OTB successplots [43] calculated on these trajectories and summarizedby the AUC values, which are equal to the average over-laps [6]. Table 1 shows the average overlaps computed onthe original ground truth on VOT2015 (Φrot) and on groundtruth approximated by the axis-aligned bounding box (Φaa).Again, the CSR-DCF by far outperforms the competing al-ternatives SRDCF and LBCF. Tracking examples for the

Table 2: Ablation study of CSR-DCF.

Tracker EAO Rav Aav

CSR-DCF 0.338 0.85 0.51CuSR-DCF 0.297 1.08 0.51CSuR-DCF 0.264 1.18 0.49CuSuR-DCF 0.256 1.33 0.51DCF 0.152 2.85 0.47

three trackers are shown in Figure 4.

4.3. Spatial and channel reliability ablation study

An ablation study on VOT2016 (see Section 4.6 for de-tails of the evaluation protocol) was conducted to evalu-ate the contribution of spatial and channel reliability mea-sures in our CSR-DCF. Results of the VOT primary mea-sure EAO and two supplementary measures (A,R) are sum-marized in Table 2. Setting the adaptive channel reliabil-ity weights to uniform values (CuSR-DCF) results in 12%performance drop in EAO compared to CSR-DCF. Replac-ing the adaptive spatial reliability map in CSR-CDF by aconstant map with uniform values within the bounding boxand zeros elsewhere (CSuR-DCF), results in a 21% dropin EAO. Making both replacements in CSR-DCF (CuSuR-DCF) results in 24% drop. Removing the channel and spa-tial reliability map reduces our tracker to a standard DCFwith a large receptive field – the performance drops by over50%.

4.4. The OTB100 benchmark [44]

The OTB100 [44] benchmark contains results of 29trackers evaluated on 100 sequences by a no-reset evalua-tion protocol. Tracking quality is measured by precisionand success plots. Success plot shows portion of frameswith the overlap between predicted and ground truth bound-ing box greater than a threshold with respect to all thresholdvales. The precision plot shows similar statistics on the cen-ter error. The results are summarized by areas under theseplots. To reduce clutter in the graphs, we show here only theresults for top-performing recent baselines, i.e., Struck [18],TLD [23], CXT [14], ASLA [45], SCM [42], LSK [34],CSK [19] and results for recent top-performing state-of-the-art trackers SRDCF [10] and MUSTER [22].

The CSR-DCF is ranked top on the benchmark (Fig-ure 5). It significantly outperforms the best performersreported in [44] and outperforms the current state-of-the-art (SRDCF [10] and MUSTER [22]). The average CSR-DCF performance on success plot is slightly lower thanSRDCF [9] due to poorer scale estimation, but yields betterperformance in the average precision (center error). Both,precision and success plot, show that the CSR-DCF trackson average longer than competing methods.

0 10 20 30 40 50

Location error threshold

0

0.2

0.4

0.6

0.8

1Precision plots

0 0.6 0.8 1

Overlap threshold

0

0.2

0.4

0.6

0.8

1

Su

cce

ss

rate

Success plots

0.2 0.4

Pre

cisi

on

CSR-DCF [0.733]

SRDCF [0.725]

MUSTER [0.709]

Struck [0.599]

TLD [0.550]

SCM [0.540]

CXT [0.521]

CSK [0.496]

ASLA [0.491]

LSK [0.478]

SRDCF [0.598]

CSR-DCF [0.587]

MUSTER [0.572]

Struck [0.463]

SCM [0.446]

TLD [0.427]

CXT [0.414]

ASLA [0.410]

LSK [0.386]

CSK [0.386]

Figure 5: OTB100 [44] benchmark comparison. The preci-sion plot (left) and the success plot (right).

4.5. The VOT2015 benchmark [26]

The VOT2015 [26] benchmark contains results of 63state-of-the-art trackers evaluated on 60 challenging se-quences. In contrast to related benchmarks, the VOT2015dataset was constructed from 300 sequences by an advancedsequence selection methodology that favors objects difficultto track and maximizes a visual attribute diversity cost func-tion [26]. This makes it arguably the most challenging se-quence set available. The VOT methodology [27] resets atracker upon failure to fully use the dataset. The basic VOTmeasures are the number of failures during tracking (robust-ness) and average overlap during the periods of successfultracking (accuracy), while the primary VOT2015 measureis the expected average overlap (EAO) on short-term se-quences. The latter can be thought of as the expected no-reset average overlap (AUC in OTB methodology), but withreduced bias and the variance as explained in [26].

Figure 7 shows the VOT EAO plots with the CSR-DCFand the VOT2015 state-of-the-art approaches consideringthe VOT2016 rules that do not consider trackers learnedon video sequences related to VOT to prevent over-fitting.The CSR-DCF outperforms all trackers and achieves a toprank. The CSR-DCF significantly outperforms the relatedcorrelation filter trackers like SRDCF [9] as well as track-ers that apply computationally-intesive state-of-the-art deepfeatures e.g., deepSRDCF [11] and SO-DLT [41].

4.6. The VOT2016 benchmark [25]

Finally, we compare our tracker on the most recent vi-sual tracking benchmark, VOT2016 [25]. The dataset con-tains 60 sequences from VOT2015 [26] with improved an-notations. The benchmark evaluated a set of 70 track-ers which includes the recently published and yet unpub-lished state-of-the-art trackers. The set is indeed diverse,the top-performing trackers come from various classese.g., correlation filter methods (CCOT [13], Staple [2],DDC [25]), deep ConvNets (TCNN [25], SSAT[25, 37],MLDF [25, 40], FastSiamnet [3]) and different detection-

Overall(0.27, 0.34)

Camera motion(0.28, 0.37)

Occlusion(0.16, 0.29)

Size change(0.25, 0.38)

Unassigned(0.10, 0.21)

(0.20, 0.46)

Illuminationchange

(0.20, 0.46)

Motionchange

CSR-DCF CCOT TCNN SSAT MLDF Staple

DDC EBT SRBTStaple+ DNT

Figure 6: Expected averaged overlap performance on dif-ferent visual attributes on the VOT2016 [25] benchmark.The proposed CSR-DCF and the top 10 performing track-ers from VOT2016 are shown. The scales of visual attributeaxes are displayed below the attribute labels.

based approaches (EBT [47], SRBT [25]).Figure 7 shows the EAO performance on the VOT2016.

Our proposed CSR-DCF outperforms all 70 trackers at theEAO score 0.338. The CSR-DCF significantly outperformscorrelation filter approaches that do not apply deep Con-vNets. Even though the CSR-DCF applies only simple fea-tures, it outperforms all trackers that apply computationallyintensive deep features.

The VOT2016 [25] dataset is per-frame annotated withvisual attributes and allows detailed analysis of per-attributetracking performance. Figure 6 shows per-attribute plot forten top-performing trackers on VOT2016 in EAO. The pro-posed CSR-DCF is consistently ranked among top threetrackers on five out of six attributes. In four attributes (sizechange, occlusion, camera motion, unassigned) the trackeris ranked number one.

4.7. Tracking speed analysis

Tracking speed is an important factor of many real-worldtracking problems. Table 3 thus compares several relatedand well-known trackers (including the best-performingtracker on the VOT2016 challenge) in terms of speed andVOT performance measures. Speed measurements on a sin-gle CPU are computed on Intel Core i7 3.4GHz standarddesktop.

The proposed CSR-DCF performs on par with theVOT2016 best-performing CCOT [13], which applies deepConvNets, with respect to VOT measures, while being 20

1102030405060Rank

0.0

0.1

0.2

0.3

0.4

Expe

cted

Ave

rage

Ove

rlap

ASMS2)CSR-DCFqDATgDeepSRDCFwEBTe

HMMTxD;LDPtMCTlMEEMhnsamfi

OACFkrajsscaRobStruckjs3trackersscebtu

SODLTfsPSTysrdcfrstruckosumshiftd

qwer

t

y

uio

a

s

d

f

gh

jk

l

;

2)

110203040506070Rank

0.0

0.1

0.2

0.3

0.4

Expe

cted

Ave

rage

Ove

rlap

CCOTwCSR-DCFqDDCuDeepSRDCFgDNTs

EBTiFCFkGGTv22)MDNetNjMLDFt

RFDCF2;SHCThSiamRNfSRBToSRDCFl

SSATrSSKCFdStapleySTAPLEpaTCNNe

qw

e

rt

yui

oa

sd

fgh

jk

l

;

2)

Figure 7: Expected average overlap plot for VOT2015 [26] (left) and VOT2016 [25] (right) benchmarks with the proposedCSR-DCF tracker. Legends are shown only for top performing trackers. The graphs with full legends are provided insupplementary materials.

Table 3: Speed in frames per second (fps) of correlationtrackers and Struck – a baseline. The EAO, average accu-racy (Aav) and average failures (Rav) are shown for refer-ence.

Tracker Published at EAO Aav Rav fpsCSR-DCF This work. 0.338 0.51 0.85 13.0CCOT ECCV2016 0.331 0.52 0.85 0.55CCOT* ECCV2016 0.274 0.52 1.18 1.0SRDCF ICCV2015 0.247 0.52 1.50 7.3KCF PAMI2015 0.192 0.48 2.03 115.7DSST PAMI2016 0.181 0.48 2.52 18.6Struck ICCV2011 0.142 0.42 3.37 8.5

times faster than the CCOT. The CCOT was modified byreplacing the computationally intensive deep features withthe same simple features used in CSR-DCF. The result-ing tracker, indicated by CCOT*, is still ten times slowerthan CSR-DCF, while the performance drops by over 15%.The proposed CSR-DCF performs twice as fast as the re-lated SRDCF [9], while achieving approximately 25% bet-ter tracking results. The speed of baseline real-time trackerslike DSST [8] and Struck [18] is comparable to CSR-DCF,but their tracking performance is significantly poorer. Thefastest compared tracker, KCF [20] runs much faster thanreal-time, but delivers a significantly poorer performancethan CSR-DCF.

The experiments show that the CSR-DCF performs atcomparably to the state-of-the-art trackers which applycomputationally demanding high-dimensional features, butruns considerably faster and delivers top tracking perfor-mance among the real-time trackers.

5. Conclusion

The Discriminative Correlation Filter with Channel andSpatial Reliability (CSR-DCF) was introduced. The spatial

reliability map adapts the filter support to the part of theobject suitable for tracking which overcomes both the prob-lems of circular shift enabling an arbitrary search range andthe limitations related to the rectangular shape assumption.A novel efficient spatial map estimation method was pro-posed and an efficient optimization procedure derived forlearning a correlation filter with the support constrained bythe spatial reliability map. The second novelty of CSR-DCFis the channel reliability. The reliability is estimated fromthe properties of the constrained least-squares solution. Thechannel reliability scores were used for weighting the per-channel filter responses in localization.

Experimental comparison with recent related state-of-the-art boundary-constraints formulations showed signifi-cant benefits of using our formulation. The CSR-DCFhas state-of-the-art performance on standard benchmarks –OTB100 [44], VOT2015 [26] and VOT2016 [26] while run-ning in real-time on a single CPU. Despite using simple fea-tures like HoG and Colornames, the CSR-DCF performs onpar with trackers that apply computationally complex deepConvNet, but is significantly faster.

To the best of our knowledge, the proposed approach isthe first of its kind to introduce constrained filter learningwith arbitrary spatial reliability map and the use of channelreliabilities. The spatial and channel reliability formulationis general and can be used in most modern correlation fil-ters, e.g. those using deep feature.

References[1] B. Babenko, M.-H. Yang, and S. Belongie. Robust object

tracking with online multiple instance learning. IEEE Trans.Pattern Anal. Mach. Intell., 33(8):1619–1632, Aug. 2011. 1

[2] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, andP. H. S. Torr. Staple: Complementary learners for real-timetracking. In Comp. Vis. Patt. Recognition, pages 1401–1409,June 2016. 7

[3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, andP. H. Torr. Fully-convolutional siamese networks for object

tracking. arXiv preprint arXiv:1606.09549, 2016. 7[4] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui.

Visual object tracking using adaptive correlation filters. InComp. Vis. Patt. Recognition, pages 2544–2550. IEEE, 2010.1, 2, 3, 4

[5] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Dis-tributed optimization and statistical learning via the alternat-ing direction method of multipliers. Foundations and Trendsin Machine Learning, 3(1):1–122, 2011. 3

[6] L. Cehovin, A. Leonardis, and M. Kristan. Visual objecttracking performance measures revisited. IEEE Trans. ImageProc., 25(3):1261–1274, 2016. 6

[7] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In Comp. Vis. Patt. Recognition, volume 1,pages 886–893, June 2005. 2

[8] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Ac-curate scale estimation for robust visual tracking. In Proc.British Machine Vision Conference, pages 1–11, 2014. 1, 2,4, 8

[9] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Learn-ing spatially regularized correlation filters for visual track-ing. In 2015 IEEE International Conference on ComputerVision, ICCV 2015, Santiago, Chile, December 7-13, 2015,pages 4310–4318, 2015. 6, 7, 8

[10] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg.Learning spatially regularized correlation filters for visualtracking. In Int. Conf. Computer Vision, pages 4310–4318,2015. 1, 2, 5, 6

[11] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg. Con-volutional features for correlation filter based visual track-ing. In IEEE International Conference on Computer VisionWorkshop (ICCVW), pages 621–629, Dec 2015. 2, 7

[12] M. Danelljan, F. S. Khan, M. Felsberg, and J. van de Wei-jer. Adaptive color attributes for real-time visual tracking.In 2014 IEEE Conference on Computer Vision and PatternRecognition, CVPR 2014, Columbus, OH, USA, June 23-28,2014, pages 1090–1097, 2014. 2

[13] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg.Beyond correlation filters: learning continuous convolutionoperators for visual tracking. In Proc. European Conf. Com-puter Vision, pages 472–488. Springer, 2016. 2, 7

[14] T. B. Dinh, N. Vo, and G. Medioni. Context tracker: Ex-ploring supporters and distracters in unconstrained environ-ments. In Comp. Vis. Patt. Recognition, pages 1177–1184,2011. 6

[15] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ra-manan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell.,32(9):1627–1645, Sept 2010. 4

[16] H. K. Galoogahi, T. Sim, and S. Lucey. Multi-channel cor-relation filters. In Int. Conf. Computer Vision, pages 3072–3079, 2013. 2, 5

[17] H. Grabner, M. Grabner, and H. Bischof. Real-time trackingvia on-line boosting. In Proc. British Machine Vision Con-ference, volume 1, pages 47–56, 2006. 1

[18] S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structuredoutput tracking with kernels. In Int. Conf. Computer Vision,

pages 263–270, Washington, DC, USA, 2011. IEEE Com-puter Society. 1, 6, 8

[19] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Ex-ploiting the circulant structure of tracking-by-detection withkernels. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato,and C. Schmid, editors, Proc. European Conf. ComputerVision, pages 702–715, Berlin, Heidelberg, 2012. SpringerBerlin Heidelberg. 6

[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. IEEETrans. Pattern Anal. Mach. Intell., 37(3):583–596, 2015. 1,2, 8

[21] C. F. Hester and D. Casasent. Multivariant technique formulticlass pattern recognition. Applied Optics, 19(11):1758–1761, 1980. 2

[22] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, andD. Tao. Multi-store tracker (muster): A cognitive psychol-ogy inspired approach to object tracking. In Comp. Vis. Patt.Recognition, pages 749–758, June 2015. 6

[23] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell.,34(7):1409–1422, July 2012. 6

[24] H. Kiani Galoogahi, T. Sim, and S. Lucey. Correlation filterswith limited boundaries. In Comp. Vis. Patt. Recognition,pages 4630–4638, 2015. 1, 2, 5

[25] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,R. Pflugfelder, L. Cehovin, T. Vojir, G. Hager, A. Lukezic,and G. et al. Fernandez. The visual object tracking vot2016challenge results. In Proc. European Conf. Computer Vision,2016. 7, 8

[26] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin,G. Fernandez, T. Vojir, G. Hager, G. Nebehay, and R. et al.Pflugfelder. The visual object tracking vot2015 challengeresults. In Int. Conf. Computer Vision, 2015. 1, 2, 5, 6, 7, 8

[27] M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. Pflugfelder,G. Fernandez, G. Nebehay, F. Porikli, and L. Cehovin. Anovel performance evaluation methodology for single-targettrackers. IEEE Trans. Pattern Anal. Mach. Intell., 2016. 1, 7

[28] M. Kristan, J. Pers, V. Sulic, and S. Kovacic. A graphi-cal model for rapid obstacle image-map estimation from un-manned surface vehicles. In Proc. Asian Conf. ComputerVision, pages 391–406, 2014. 3

[29] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli,L. Cehovin, G. Nebehay, G. Fernandez, and T. e. a. Vojir. Thevisual object tracking vot2013 challenge results. In Vis. Obj.Track. Challenge VOT2013, In conjunction with ICCV2013,pages 98–111, Dec 2013. 1

[30] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas,L. Cehovin, G. Nebehay, T. Vojir, and G. et al. Fernan-dez. The visual object tracking vot2014 challenge results.In Proc. European Conf. Computer Vision, pages 191–217,2014. 1, 2

[31] Y. Li and J. Zhu. A scale adaptive kernel correlation filtertracker with feature integration. In Proc. European Conf.Computer Vision, pages 254–265, 2014. 1, 2

[32] Y. Li and J. Zhu. A scale adaptive kernel correlation filtertracker with feature integration. In Proc. European Conf.Computer Vision, pages 254–265, 2014. 2

[33] P. Liang, E. Blasch, and H. Ling. Encoding color informationfor visual tracking: Algorithms and benchmark. IEEE Trans.Image Proc., 24(12):5630–5644, Dec 2015. 1

[34] B. Liu, J. Huang, L. Yang, and C. Kulikowsk. Robust track-ing using local sparse appearance model and k-selection. InComp. Vis. Patt. Recognition, pages 1313–1320, June 2011.6

[35] C. Ma, J. B. Huang, X. Yang, and M. H. Yang. Hierarchi-cal convolutional features for visual tracking. In Int. Conf.Computer Vision, pages 3074–3082, Dec 2015. 2

[36] M. Mueller, N. Smith, and B. Ghanem. A benchmark andsimulator for uav tracking. In Proc. European Conf. Com-puter Vision, 2016. 1

[37] H. Nam and B. Han. Learning multi-domain convolutionalneural networks for visual tracking. In Comp. Vis. Patt.Recognition, pages 4293–4302, June 2016. 7

[38] A. Smeulders, D. Chu, R. Cucchiara, S. Calderara, A. De-hghan, and M. Shah. Visual tracking: An experimental sur-vey. IEEE Trans. Pattern Anal. Mach. Intell., 36(7):1442–1468, July 2014. 1

[39] J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus.Learning color names for real-world applications. IEEETrans. Image Proc., 18(7):1512–1523, July 2009. 4

[40] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual trackingwith fully convolutional networks. In Int. Conf. ComputerVision, pages 3119–3127, Dec 2015. 7

[41] N. Wang, S. Li, A. Gupta, and D. Yeung. Transferringrich feature hierarchies for robust visual tracking. CoRR,abs/1501.04587, 2015. 7

[42] M.-H. Y. Wei Zhong, Huchuan Lu. Robust object trackingvia sparsity-based collaborative model. In Comp. Vis. Patt.Recognition, pages 1838–1845, 2012. 6

[43] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: Abenchmark. In Comp. Vis. Patt. Recognition, pages 2411–2418, 2013. 1, 2, 5, 6

[44] Y. Wu, J. Lim, and M. H. Yang. Object tracking benchmark.IEEE Trans. Pattern Anal. Mach. Intell., 37(9):1834–1848,Sept 2015. 2, 6, 7, 8

[45] M.-H. Y. Xu Jia, Huchuan Lu. Visual tracking via adaptivestructural local sparse appearance model. In Comp. Vis. Patt.Recognition, pages 1822–1829, 2012. 6

[46] K. Zhang, L. Zhang, Q. Liu, D. Zhang, and M.-H. Yang. Fastvisual tracking via dense spatio-temporal context learning.In Proc. European Conf. Computer Vision, pages 127–141.Springer International Publishing, 2014. 2

[47] G. Zhu, F. Porikli, and H. Li. Beyond local search: Trackingobjects everywhere with instance-specific proposals. In TheIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), pages 943–951, June 2016. 7

abstract - arxiv · 2016-11-28 · discriminative correlation filter with channel and spatial...

Documents