lecture 15 - advanced correlation filters16623.courses.cs.cmu.edu/slides/lecture_15.pdf ·...

$: Lecture 15 - Advanced Correlation Filters16623.courses.cs.cmu.edu/slides/Lecture_15.pdf · Localization rate Threshold (fraction of interocular distance) Human Our method Valstart$
Correlation Filters - Advanced Methods

Instructor - Simon Lucey

16-623 - Designing Computer Vision Apps

Today

• Review - The Correlation Filter

• Multi-Channel Correlation Filters

• Kernel Correlation Filters

x

y

MULTI-CHANNEL CORRELATION FILTERSHAMED KIANI1, TERENCE SIM1 AND SIMON LUCEY2 1School of Computing, NUS, Singapore{hkiani,tsim}@comp.nus.edu.sg, [email protected] 2CSIRO ICT Center, Australia

ABSTRACT

From a signal processing perspective, patterndetection using modern descriptors like HOG canbe efficiently posed as a correlation between amulti-channel image and a multi-channel detec-tor/filter, which results in a single-channel re-sponse indicating where the pattern (e.g. object)has occurred. Here, we proposed a novel frame-work for learning multi-channel filters efficientlyin the frequency domain, both in terms of com-plexity and memory usage.

CONTRIBUTIONS• Extending canonical correlation filter theory

to efficiently handle multi-channel signals• A multi-channel detector whose training

memory is independent of the number oftraining samples

• Superior performance to current state of theart correlation filters, and superior computa-tional and memory efficiency in comparisonto spatial detectors (e.g. linear SVM) withcomparable detection performance

MULTI-CHANNLE CFS

(i) Spatial domain:

(ii) Fourier domain:

Complexity: O(D3K3)

Memory: O(D2K2)

(iii) Fourier domain with variable re-ordering:

Complexity: O(DK3)

Memory: O(DK2)

NOTATION: ⇤: convolution operation, |y| = D, K:# of channels and V(a(j)) = [a(1)(j), ..., a(K)(j)]

COMPARISON WITH LINEAR SVM250 500 1000 2000 4000 8000 16000 24000

MCCF 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02SVM 6.17 12.35 24.68 49.36 98.87 197.44 395.88 592.32

0 0.1 0.2 0.3 0.4 0.50.2

0.4

0.6

0.8

1

False positive rate

Det

ectio

n ra

te

Our methodSVM + HOG

250 500 1000 2000 4000 80000

0.2

0.4

0.6

Number of Training images

Det

ectio

n ra

te a

t FP

R =

0.1

0

SVM + HOGOur method

250 500 1000 2000 4000 8000 16000240000

20

40

60

Number of training images

Trai

ning

tim

e (s

)

SVM + HOG

Our method

(a) (b) (c)Figure 1. Comparing MCCF with SVM + HOG on the problem of pedestrian detection using Daimler dataset. Top:Memory usage (MB) of MCCF compared to SVM as a function of number of training images. Bottom: Detection rateas a function of (a) FPR, (b) number of training images at FPR = 0.10, and (c) training time versus training size.

FACIAL LANDMARK DETECTION

0 0.05 0.1 0.15 0.20

0.2

0.4

0.6

0.8

1

Loca

lizat

ion

rate

Threshold (fraction of interocular distance)

Human

Our methodValstart et al.

Everingham et al.

Figure 2. Facial landmark detection on the LFW dataset.

CAR DETECTION

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Threshold (pixels)

Det

ectio

n ra

te

Our method

MOSSEASEF

Figure 3. Car detection on the MIT Street Dataset.

REFERENCES[1] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui.

Visual object tracking using adaptive correlation filters. InCVPR ’10.

“known signal”

“known response”

“unknown filter”

⌦g

x

y


ABSTRACT






MULTI-CHANNLE CFS

(i) Spatial domain:


Complexity: O(D3K3)

Memory: O(D2K2)


Complexity: O(DK3)

Memory: O(DK2)



MCCF 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02SVM 6.17 12.35 24.68 49.36 98.87 197.44 395.88 592.32

0 0.1 0.2 0.3 0.4 0.50.2

0.4

0.6

0.8

1

False positive rate

Det

ectio

n ra

te

Our methodSVM + HOG

250 500 1000 2000 4000 80000

0.2

0.4

0.6


Det

ectio

n ra

te a

t FP

R =

0.1

0

SVM + HOGOur method

250 500 1000 2000 4000 8000 16000240000

20

40

60


Trai

ning

tim

e (s

)

SVM + HOG

Our method



0 0.05 0.1 0.15 0.20

0.2

0.4

0.6

0.8

1

Loca

lizat

ion

rate


Human


Everingham et al.


CAR DETECTION

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Threshold (pixels)

Det

ectio

n ra

te

Our method

MOSSEASEF




“known signal”



⌦g

“correlation operator”

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(g) =1

2

X

⌧2C||y⌧ � x[⌧ ]Tg||22

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

x[⌧ ]

E(g) =1

2

X

⌧2C||y⌧ � x[⌧ ]Tg||22

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

x[⌧ ] ⌧x

⌧y

⌧ = [⌧x

, ⌧y

]

“set of all circular shifts”

E(g) =1

2

X

⌧2C||y⌧ � x[⌧ ]Tg||22

Linear Least Squares Discriminant

• One can view a correlation filters in the spatial domain as a linear least squares discriminant.

• Made popular by Bolme et al., referred to in literature as a Minimum Output Sum of Squared Error (MOSSE) filter.

argming

1

2

X

⌧2C||y⌧ � x[⌧ ]Tg||22

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(g) =1

2||y �Xg||22 +

�

2||g||22

X = [x[⌧1], . . . ,x[⌧D]] y = [y⌧1 , . . . , y⌧D ]T

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

O(D3)(XT X + �I)�1

D = number of samples in x

E(g) =1

2||y �Xg||22 +

�

2||g||22

X = [x[⌧1], . . . ,x[⌧D]] y = [y⌧1 , . . . , y⌧D ]T

Circulant Toeplitz Matrices

X = [x[⌧1], . . . ,x[⌧D]]

S = XX0

“circulant Toeplitz”


X = [x[⌧1], . . . ,x[⌧D]]

S = XX0


F FT

F Fourier Transform

Always Zero

Correlation vs. Convolution

• Convolution is preferred mathematically over correlation as it is,

g ⇤ (h ⇤ x) = (g ⇤ h) ⇤ xh ⇤ x = x ⇤ h

(associative)

(communicative)



• Correlation is neither!!!

g ⇤ (h ⇤ x) = (g ⇤ h) ⇤ xh ⇤ x = x ⇤ h

(associative)

(communicative)




g ⇤ (h ⇤ x) = (g ⇤ h) ⇤ x

g ⌦ (h⌦ x) 6= (g ⌦ h)⌦ x

h ⇤ x = x ⇤ h(associative)

(communicative)

h⌦ x 6= x⌦ h




• Correlation, however, preferred for signal matching/detection.

g ⇤ (h ⇤ x) = (g ⇤ h) ⇤ x

g ⌦ (h⌦ x) 6= (g ⌦ h)⌦ x

h ⇤ x = x ⇤ h(associative)

(communicative)

h⌦ x 6= x⌦ h

x

y


ABSTRACT






MULTI-CHANNLE CFS

(i) Spatial domain:


Complexity: O(D3K3)

Memory: O(D2K2)


Complexity: O(DK3)

Memory: O(DK2)



MCCF 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02SVM 6.17 12.35 24.68 49.36 98.87 197.44 395.88 592.32

0 0.1 0.2 0.3 0.4 0.50.2

0.4

0.6

0.8

1

False positive rate

Det

ectio

n ra

te

Our methodSVM + HOG

250 500 1000 2000 4000 80000

0.2

0.4

0.6


Det

ectio

n ra

te a

t FP

R =

0.1

0

SVM + HOGOur method

250 500 1000 2000 4000 8000 16000240000

20

40

60


Trai

ning

tim

e (s

)

SVM + HOG

Our method



0 0.05 0.1 0.15 0.20

0.2

0.4

0.6

0.8

1

Loca

lizat

ion

rate


Human


Everingham et al.


CAR DETECTION

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Threshold (pixels)

Det

ectio

n ra

te

Our method

MOSSEASEF




“known signal”



⌦

g

⇤

x

y


ABSTRACT






MULTI-CHANNLE CFS

(i) Spatial domain:


Complexity: O(D3K3)

Memory: O(D2K2)


Complexity: O(DK3)

Memory: O(DK2)



MCCF 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02SVM 6.17 12.35 24.68 49.36 98.87 197.44 395.88 592.32

0 0.1 0.2 0.3 0.4 0.50.2

0.4

0.6

0.8

1

False positive rate

Det

ectio

n ra

te

Our methodSVM + HOG

250 500 1000 2000 4000 80000

0.2

0.4

0.6


Det

ectio

n ra

te a

t FP

R =

0.1

0

SVM + HOGOur method

250 500 1000 2000 4000 8000 16000240000

20

40

60


Trai

ning

tim

e (s

)

SVM + HOG

Our method



0 0.05 0.1 0.15 0.20

0.2

0.4

0.6

0.8

1

Loca

lizat

ion

rate


Human


Everingham et al.


CAR DETECTION

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Threshold (pixels)

Det

ectio

n ra

te

Our method

MOSSEASEF




“known signal”



h>> g = flipud(fliplr(h));

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(g) =1

2

X

⌧2C||y⌧ � x[⌧ ]Tg||22

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(h) =12

||y � x ⇤ h||22 +�

2||h||22

>> g = flipud(fliplr(h));

⇤2

4......

3

5

2

4......

3

5F

⇥ a

a � x = F(a ⇤ x)

x

diag{a}x

Not Always Zero Always Zero

⇤2

4......

3

5

2

4......

3

5F

⇥ a

a � x = F(a ⇤ x)

x

Fourier Transformdiag{a}x

Not Always Zero Always Zero

⇤2

4......

3

5

2

4......

3

5F

⇥ a

a � x = F(a ⇤ x)

x


Not Always Zero Always Zero“Hadamard product”

⇤2

4......

3

5

2

4......

3

5F

⇥ a

x


diag{a}x = F(a ⇤ x)Not Always Zero Always Zero

Fx

x

=

Carl Friedrich Gauss

Not Always ZeroAlways Zero

}⇥

. . .⇥⇥

O(D log D)

>> xf = fft(x);

This ignorance is well founded, Structure from Motion (SfM) [17] dictates that a 3D object/scene can bereconstructed up to an ambiguity in scale. The vision world, however, is changing. Smart devices (phones,tablets, etc.) are low cost, ubiquitous and packaged with more than just a monocular camera for sensing theworld.

The idea of combining measurements of an intertial measurement unit (IMU) and a monocular camerato make metric sense of the world has been well explored by the robotics community [18,19,21,23,31,34].Traditionally, however, the community has focused on odometry and navigation which requires accurateand as a consequence expensive IMUs while using vision largely in a periphery manner. IMUs on modernsmart devices, in contrast, are used primarily to obtain coarse measurement of the forces being applied to thedevice for the purposes of enhancing user interaction. As a consequence costs can be reduced by selectingnoisy, less accurate sensors. In isolation they are largely unsuitable for making metric sense of the world. Inthis proposal we shall explore a vision centric strategy for obtaining metric reconstructions of dense3D faces using noisy IMUs commonly found in smart devices.

3 Innovations3.1 Compositionally Sparse RegressorsInspiration from the FFT: In this proposal we want to draw inspiration from the evolution of the FastFourier Transform (FFT) [9] on modern computational architecture. FFTs and SDMs share the same centralcomputational cost - a matrix transform. For example, the FFT ˆ

z = F{z} of the vectorized signal z 2 RD

can simply be expressed asˆ

z = F

D

z (4)

where F

D

is the D ⇥ D discrete one dimensional Fourier transform matrix. Naively the cost of this op-eration should be O(D2

) since F

D

is a dense matrix. However, it has been well understood for decades,since the seminal work of Cooley & Tukey [9], that F

D

has intrinsic redundancies that makes it particularlycomputationally efficient – specifically that it is compositionally sparse (see Fig. 4). We define a compo-sitionally sparse matrix R =

QL

l=1 S

l

as a matrix that is composed of a set of matrices {Sl

}L

l=1 each ofwhich are sparse of group-sparse even through the resultant matrix R itself is dense. Such a decompositionis computationally useful if

PL

l=1 ||Sl

||0 < ||R||0. In this case the computational advantage is clear whenone executes the matrix multiplication in a compositional manner (i.e.

QL

l=1 S

l

x as opposed to the morecostly (

QL

l=1 S

l

)x). One can see an example of this in Fig. 4 for a 16 dimensional FFT where the dense F16

matrix can be decomposed into L = 4 sparse and group-sparse matrices. In general due to this sparse com-positional property a FFT can be applied classically in O(D log D) operations (although many extensionson this theme now exist [14]) instead of O(D2

) naively. A critical thing to note about the sparse decom-

Figure 4: Example (adapted from [14]) of the sparse compositional properties of a 16 dimensional FFT matrix F16

which can be decomposed into: F16 = (F4 ⌦ I4)⌦164 (I4 ⌦ F4)L

164 where L

164 is a permutation matrix and ⌦

164 is a

diagonal matrix. Remember F4 is the 4 dimensional FFT matrix, I4 is the 4⇥4 identity matrix and⌦ is the Kroneckerproduct operator.

position of the 16 dimensional FFT in Fig. 4 is the use of the Kronecker product operator ⌦ and identity

F16 ! 16 dimensional FFT F4 ! 4 dimensional FFT

L164 ! permutation matrix T16

4 ! diagonal matrix

Examples to try in MATLAB

• Using fftmtx.m (see link),

>> F = fftmtx(5); >> x = randn(5,1); h = randn(5,1);

>> xf_a = F*x; >> xf_b = fft(x);

• Apply the FFT in the naive and fast methods.

• You should see that xf_a and xf_b are the same.

http://16623.courses.cs.cmu.edu/misc/fftmtx.m

Examples to try in MATLAB

• Using fftmtx.m (see link),

>> F = fftmtx(5); >> x = randn(5,1); h = randn(5,1);

>> xf_a = F*x; >> xf_b = fft(x);

• Apply the FFT in the naive and fast methods.

• You should see that xf_a and xf_b are the same.

Question: what does F’*F look like?

http://16623.courses.cs.cmu.edu/misc/fftmtx.m

Parseval’s Theorem

x1

x2

“Antoine Parseval”

||x||2 / ||x||22


x2

x1

where,x� F{x}


||x||2 / ||x||22


x2

x1

where,x� F{x}

||x||2 = D · ||x||22


x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(h) =12

||y � x ⇤ h||22 +�

2||h||22

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(h) =12||y � diag{x}h||22 +

�

2||h||22

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(h) =12||y � diag{x}h||22 +

�

2||h||22

(diag(x)T diag(x) + �I)�1


O(D log D)

Hermitian Transpose

diag(

ˆ

x)

T= diag(conj(

ˆ

x))

>> x = randn(3,1); >> xf = fft(x); >> diag(xf)'

ans =

-1.6409 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i -2.7579 + 1.2555i 0.0000 + 0.0000i 0.0000 + 0.0000i 0.0000 + 0.0000i -2.7579 - 1.2555i

>> conj(xf)

ans =

-1.6409 + 0.0000i -2.7579 + 1.2555i -2.7579 - 1.2555i

CF in a couple of lines of Matlab

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862

ICCV 2013 Submission #1862. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

lectivity to spatial frequency, orientation and scale (e.g. ori-ented edge filters, Gabor filters, etc.).

Prior Art in Correlation Filters: Bolme et al. [2] re-cently proposed an extension to traditional correlation fil-ters referred to as Minimum Output Sum of Squared Error(MOSSE) filters. This approach has proven invaluable formany object tracking tasks, outperforming current state ofthe art methods such as [1, 15]. A strongly related methodto MOSSE was also proposed by Bolme et al. [3] for objectdetection/localization referred to as Average of SyntheticExact Filters (ASEF) which also reported superior perfor-mance to state of the art. A full discussion on other variantsof correlation filters such as Optimal Tradeoff Filters (OTF)[14], Unconstrained MACE (UMACE) [16] filters, etc. isoutside the scope of this paper. Readers are encouraged toinspect [11] for a full treatment on the topic.

3. Correlation FiltersDue to the efficiency of correlation in the frequency do-

main, correlation filters have canonically been posed in thefrequency domain. There is nothing, however, stopping one(other than computational expense) from expressing a cor-relation filter in the spatial domain. In fact, we argue thatviewing a correlation filter in the spatial domain can give:(i) important links to existing spatial methods for learningtemplates/detectors, and (ii) crucial insights into fundamen-tal problems in current correlation filter methods.

Bolme et. al’s [2] MOSSE correlation filter can be ex-pressed in the spatial domain as solving the following ridgeregression problem,

E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi

2 RD is the desired response for the i-th ob-servation x

i

2 RD and � is a regularization term. C =

[�⌧ 1, . . . ,�⌧D

] represents the set of all circular shifts fora signal of length D. Bolme et al. advocated the use of a2D Gaussian of small variance (2-3 pixels) for y

i

centeredat the location of the object (typically the centre of the im-age patch). The solution to this objective becomes,

h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)

Solving a correlation filter in the spatial domain quickly be-comes intractable as a function of the signal length D, asthe cost of solving Equation 2 becomes O(D3

+ ND2).

Efficiency in the Frequency Domain: It is well understoodin the signal processing community that circular convolu-tion in the spatial domain can be expressed as a Hadamardproduct in the frequency domain. This allows one to expressthe objective in Equation 1 more succinctly and equivalentlyas,

E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .

where ˆh, ˆx, ˆy are the Fourier transforms of h,x,y. Thecomplex conjugate of ˆh is employed to ensure the oper-ation is correlation not convolution. The equivalence be-tween Equations 1 and 4 also borrows heavily upon anotherwell known property from signal processing namely, Parse-val’s theorem which states that

xT

i

xj

= D�1ˆxT

i

ˆxj

8i, j, where x 2 RD . (5)

The solution to Equation 4 becomes

ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)

where ��1 denotes element-wise division, and

ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)

are the average auto-spectral and cross-spectral energies re-spectively of the training observations. The solution for ˆh inEquations 1 and 4 are identical (other than that one is posedin the spatial domain, and the other is in the frequency do-main). The power of this method lies in its computationalefficiency. In the frequency domain a solution to ˆh can befound with a cost of O(ND log D). The primary cost isassociated with the DFT on the ensemble of training sig-nals {x

i

}N

i=1 and desired responses {yi

}N

i=1.

Memory Efficiency: Inspecting Equation 7 one can see anadditional advantage of correlation filters when posed in thefrequency domain. Specifically, memory efficiency. Onedoes not need to store the training examples in memory be-fore learning. As Equation 7 suggests one needs to sim-ply store a summation of the auto-spectral ˆs

xx

and cross-spectral ˆs

xy

energies. This is a powerful result not often dis-cussed in correlation filter literature as unlike other spatialstrategies for learning detectors (e.g. linear SVM) whosememory usage grows as a function of the number of train-ing examples O(ND), correlation filters have fixed mem-ory overheads O(D) irrespective of the number of trainingexamples.

3

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

>> xf = fft2(x);


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

>> xf = fft2(x); >> yf = fft2(y);


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

>> xf = fft2(x); >> yf = fft2(y); >> sxx = xf.*conj(xf);


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

>> xf = fft2(x); >> yf = fft2(y); >> sxx = xf.*conj(xf); >> sxy = xf.*conj(yf);


216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269

270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323

ICCV#1862

ICCV#1862







E(h) =

1

2

NX

i=1

DX

j=1

||yi

(j)� hT xi

[�⌧j

]||22 +

�

2

||h||22 (1)

where yi


i


[�⌧ 1, . . . ,�⌧D


i


h = H�1NX

i=1

DX

j=1

yi

(j)xi

[�⌧j

] (2)

where,

H = �I +

NX

i=1

DX

j=1

xi

[�⌧j

]xi

[�⌧j

]

T . (3)


+ ND2).


E(

ˆh) =

1

2

NX

i=1

||ˆyi

� ˆxi

� conj(ˆh)||22 +

�

2

||ˆh||22 (4)

=

1

2

NX

i=1

||ˆyi

� diag(

ˆxi

)

T

ˆh||22 +

�

2

||ˆh||22 .


xT

i

xj

= D�1ˆxT

i

ˆxj



ˆh = [diag(

ˆsxx

) + �I]�1NX

i=1

diag(

ˆxi

)

ˆyi

(6)

=

ˆsxy

��1(

ˆsxx

+ �1)


ˆsxx

=

NX

i=1

ˆxi

�conj(ˆxi

) &

ˆsxy

=

NX

i=1

ˆyi

�conj(ˆxi

) (7)


i

}N


}N

i=1.


xx


xy


3

>> xf = fft2(x); >> yf = fft2(y); >> sxx = xf.*conj(xf); >> sxy = xf.*conj(yf); >> hf = sxy./(sxx + 1e-3);

methods for creating filters, such as cropping a templatefrom an image, produce strong peaks for the target butalso falsely respond to background. As a result they arenot particularly robust to variations in target appearanceand fail on challenging tracking problems. Average ofSynthetic Exact Filters (ASEF), Unconstrained MinimumAverage Correlation Energy (UMACE), and MinimumOutput Sum of Squared Error (MOSSE) (introduced inthis paper) produce filters that are more robust to appear-ance changes and are better at discriminating between tar-gets and background. As shown in Figure 2, the result isa much stronger peak which translates into less drift andfewer dropped tracks. Traditionally, ASEF and UMACEfilters have been trained offline and are used for object de-tection or target identification. In this research, we havemodified these techniques to be trained online and in anadaptive manor for visual tracking. The result is trackingwith state of the art performance that retains much of thespeed and simplicity of the underlying correlation basedapproach.

Despite the simplicity of the approach, tracking basedon modified ASEF, UMACE, or MOSSE filters performswell under changes in rotation, scale, lighting, and par-tial occlusion (See Figure 1). The Peak-to-Sidelobe Ratio(PSR), which measures the strength of a correlation peak,can be used to detect occlusions or tracking failure, tostop the online update, and to reacquire the track if theobject reappears with a similar appearance. More gen-erally, these advanced correlation filters achieve perfor-mance consistent with the more complex trackers men-tioned earlier; however, the filter based approach is over20 times faster and can process 669 frames per second(See Table 1).

Table 1: This table compares the frame rates of the MOSSEtracker to published results for other tracking systems.

Algorithm Frame Rate CPUFragTrack[1] realtime UnknownGBDL[19] realtime 3.4 Ghz Pent. 4IVT [17] 7.5fps 2.8Ghz CPU

MILTrack[2] 25 fps Core 2 QuadMOSSE Filters 669fps 2.4Ghz Core 2 Duo

The rest of this paper is organized as follows. Section 2reviews related correlation filter techniques. Section 3 in-troduces the MOSSE filter and how it can be used to createa robust filter based tracker. Section 4 presents experimen-tal results on seven video sequences from [17]. Finally,

Naive UMACE ASEF MOSSE

INPU

TFI

LTER

OU

TPU

T

Figure 2: This figure shows the input, filters, and correlationoutput for Frame 25 of the fish test sequence. The three correla-tion filters produce peaks that are much more compact than theone produced by the Naive filter.

Section 5 will revisit the major findings of this paper.

2 Background

In the 1980’s and 1990’s, many variants of correla-tion filters, including Synthetic Discriminant Functions(SDF) [7, 6], Minimum Variance Synthetic Discrimi-nant Functions (MVSDF) [9], Minimum Average Cor-relation Energy (MACE) [11], Optimal Tradeoff Filters(OTF) [16], and Minimum Squared Error Synthetic Dis-criminant Functions (MSESDF) [10]. These filters aretrained on examples of target objects with varying appear-ance and with enforced hard constraints such that the fil-ters would always produce peaks of the same height. Mostrelevant is MACE which produces sharp peaks and highPSRs.

In [12], it was found that the hard constraints of SDFbased filters like MACE caused issues with distortion tol-erance. The solution was to eliminate the hard constraintsand instead to require the filter to produce a high av-erage correlation response. This new type of “Uncon-strained” correlation filter called Maximum Average Cor-relation Height (MACH) led to a variant of MACE calledUMACE.

A newer type of correlation filter called ASEF [3] in-troduced a method of tuning filters for particular tasks.Where earlier methods just specify a single peak value,ASEF specifies the entire correlation output for each train-ing image. ASEF has performed well at both eye local-ization [3] and pedestrian detection [4]. Unfortunately

2

Today




Robust Representations

1. Compute image gradients 2. Pool into local histograms3. Concatenate histograms4. Normalize histograms

! RN : RN⇥K

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053

054055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107

ICCV#1862

ICCV#1862


Multi-Channel Correlation Filters

Anonymous ICCV submission

Paper ID 1862

Abstract

Modern descriptors like HOG and SIFT are now com-monly used in vision for pattern detection within im-age and video. From a signal processing perspectivethis detection process can be efficiently posed as a cor-relation/convolution between a multi-channel image anda multi-channel detector/filter which results in a single-channel response map indicating where the pattern (e.g.object) has occurred. In this paper we propose a novelframework for learning a multi-channel detector/filter ef-ficiently in the frequency domain (both in terms of trainingtime and memory footprint) which we refer to as a multi-channel correlation filter. To demonstrate the effectivenessof our strategy, we evaluate it across a number of visual de-tection/localization tasks where we: (i) exhibit superior per-formance to current state of the art correlation filters, and(ii) superior computational and memory efficiencies com-pared to state of the art spatial detectors.

1. IntroductionIn computer vision it is now rare for tasks like convo-

lution/correlation to be performed on single channel imagesignals (e.g. 2D array of intensity values). With the adventof advanced descriptors like HOG [4] and SIFT [12] convo-lution/correlation across multi-channel signals has becomethe norm rather than the exception in most visual detectiontasks. Most of these image descriptors can be viewed asmulti-channel images/signals with multiple measurements(such the oriented edge energies) associated with each pixellocation. We shall herein refer to all image descriptors asmulti-channel images. An example of multi-channel corre-lation can be seen in Figure 1 where a multi-channel imageis convolved/correlated with a multi-channel filter/detectorin order to obtain a single-channel response. The peak ofthe response (in white) indicating where the pattern of in-terest is located.

Like single channel signals, correlation between twomulti-channel signals is rarely performed naively in the spa-tial domain. Instead, the fast Fourier transform (FFT) af-fords the efficient application of correlating a desired tem-

xh

y

Figure 1. An example of multi-channel correlation/convolutionwhere one has a multi-channel image x correlated/convolved witha multi-channel filter h to give a single-channel response y. Byposing this objective in the frequency domain, our multi-channelcorrelation filter approach attempts to give a computational &memory efficient strategy for estimating h given x and y.

plate/filter with a signal. Contrastingly, however, most tech-niques for estimating a detector for such a purpose (i.e. de-tection/tracking through convolution) are performed in thespatial domain [4]. It is this dilemma that is at the heart ofour paper.

This has not always been the case. Correlation fil-ters, developed initially in the seminal work of Hester andCasasent [7], are a method for learning a template/filterin the frequency domain that rose to some prominence inthe 80s and 90s. Although many variants have been pro-posed [7, 10, 11], the approach’s central tenet is to learna filter, that when correlated with a set of training signals,gives a desired response (typically a peak at the origin ofthe object, with all other regions of the correlation responsemap being suppressed). Like correlation itself, one of thecentral advantages of the approach is that it attempts to learnthe filter in the frequency domain due to the efficiency ofcorrelation/convolution in that domain. Hitherto, correla-tion filter theory, to our knowledge, has been restricted tosingle-channel signals/filters. In this paper we present anefficient strategy for handling multi-channel signals/filtersthat has numerous applications throughout vision and learn-ing.

Contributions: In this paper we make the following con-tributions

1

⇤

y


ABSTRACT






MULTI-CHANNLE CFS

(i) Spatial domain:


Complexity: O(D3K3)

Memory: O(D2K2)


Complexity: O(DK3)

Memory: O(DK2)



MCCF 0.02 0.02 0.02 0.02 0.02 0.02 0.02 0.02SVM 6.17 12.35 24.68 49.36 98.87 197.44 395.88 592.32

0 0.1 0.2 0.3 0.4 0.50.2

0.4

0.6

0.8

1

False positive rate

Det

ectio

n ra

te

Our methodSVM + HOG

250 500 1000 2000 4000 80000

0.2

0.4

0.6


Det

ectio

n ra

te a

t FP

R =

0.1

0

SVM + HOGOur method

250 500 1000 2000 4000 8000 16000240000

20

40

60


Trai

ning

tim

e (s

)

SVM + HOG

Our method



0 0.05 0.1 0.15 0.20

0.2

0.4

0.6

0.8

1

Loca

lizat

ion

rate


Human


Everingham et al.


CAR DETECTION

0 10 20 30 40 500

0.2

0.4

0.6

0.8

1

Threshold (pixels)

Det

ectio

n ra

te

Our method

MOSSEASEF





x

“known signal”

“unknown filter”h

Kiani, Sim and Lucey ICCV 2013.

E(h) = ||y �KX

k=1

x

(k) ⇤ h

(k)||22 +�

2

KX

k=1

||h(k)||22

K = no. of channels


E(h) =12

||y �Xh||22 +�

2||h||22


O(K3D3)


E(h) =12

||y �Xh||22 +�

2||h||22

K = no. of channels

(XXT + �I)�1


E(h) =12

||y � Xh||22 +�

2||h||22

O(K3D3)XXT + �I =



K = no. of channels

E(h) =12

||y � Xh||22 +�

2||h||22

X = [diag(x1), . . . ,diag(xK)]where,

O(K3D)(XXT + �I)�1


MOSSE = Single-Channel CF

MCCF = Multi-Channel CF


Today




x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(g) =1

2

X

⌧2C||y⌧ � x[⌧ ]Tg||22

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

x[⌧ ]

E(g) =1

2

X

⌧2C||y⌧ � x[⌧ ]Tg||22

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

x[⌧ ] ⌧x

⌧y

⌧ = [⌧x

, ⌧y

]

“set of all circular shifts”

E(g) =1

2

X

⌧2C||y⌧ � x[⌧ ]Tg||22

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(↵) =1

2

X

⌧2C||y⌧ � x[⌧ ]T

X

⌫2Cx[⌫]↵⌫ ||22

Henriques, Caseiro, Martins, and Batista PAMI 2014.

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

E(↵) =1

2

X

⌧2C||y⌧ � x[⌧ ]T

X

⌫2Cx[⌫]↵⌫ ||22

g =X

⌧

↵⌧x[⌧ ]


x[⌧ ]

y⌧ 1 0 0

. . .

. . .


E(↵) =1

2||y �K↵||22 +

�

2↵TK↵

x[⌧ ]

y⌧ 1 0 0

. . .

. . .

K(⌧ ,⌫) = x[⌧ ]Tx[⌫]


E(↵) =1

2||y �K↵||22 +

�

2↵TK↵

x[⌧ ]

y⌧ 1 0 0

. . .

. . .


E(↵) =1

2||y �K↵||22 +

�

2↵TK↵

↵ = (K+ �I)�1y

Reminder: Circulant Toeplitz Matrices

X = [x[⌧1], . . . ,x[⌧D]]

S = XX0



Reminder: Circulant Toeplitz Matrices

X = [x[⌧1], . . . ,x[⌧D]]

S = XX0


F FT

F Fourier Transform

Always Zero



X = [x[⌧1], . . . ,x[⌧D]]


F FT

F Fourier Transform

Always Zero

K = X0X



X = [x[⌧1], . . . ,x[⌧D]]


F FT

F Fourier Transform

Always Zero

K = X0X

What does this imply about the SVD of X?Henriques, Caseiro, Martins, and Batista PAMI 2014.

Linear KCF in a couple of lines of Matlab

↵ = y ��1 (k+ �1)



>> xf = fft2(x);

↵ = y ��1 (k+ �1)



>> xf = fft2(x); >> yf = fft2(y);

↵ = y ��1 (k+ �1)



>> xf = fft2(x); >> yf = fft2(y); >> kf = xf.*conj(xf);

↵ = y ��1 (k+ �1)



>> xf = fft2(x); >> yf = fft2(y); >> kf = xf.*conj(xf); >> af = yf./(kf + 1e-3);

↵ = y ��1 (k+ �1)


Kernel Trick

• We can take advantage of trick made popular in SVM literature.

�(xi)T�(xj) = k(xi,xj)

• Useful if we cannot form explicitly (or if it is too costly),

S = XX0

• Since we can always form.

K = X0X

X = [�(x[⌧ 1]), . . . ,�(x[⌧D)]]


Not all Kernels Allowed

• Not all kernels will guarantee that K remains circulant Toeplitz.

• Kernels that take the form,

are guaranteed of forming circulant Toeplitz matrices.

• Fortunately many useful kernels take this form, • RBF • Polynomial • Multi-Channel Linear

k(xi,xj) = ⌘(xTi xj)

KCF in a couple of lines of Matlab

↵ = y ��1 (k+ �1)


>> xf = fft2(x);

↵ = y ��1 (k+ �1)


>> xf = fft2(x); >> yf = fft2(y);

↵ = y ��1 (k+ �1)


>> xf = fft2(x); >> yf = fft2(y); >> sf = xf.*conj(xf);

↵ = y ��1 (k+ �1)



>> af = yf./(kf + 1e-3);

↵ = y ��1 (k+ �1)



>> af = yf./(kf + 1e-3);

↵ = y ��1 (k+ �1)

>> kf = fft2(kernel(ifft2(sf)));

Evaluate KCF on new image z

y = kz,x � ↵


>> xf = fft2(x);

y = kz,x � ↵


>> xf = fft2(x); >> zf = fft2(z);

y = kz,x � ↵


>> xf = fft2(x); >> zf = fft2(z); >> sf_zx = zf.*conj(xf);

y = kz,x � ↵



>> yf = kf_zx.*af;

y = kz,x � ↵



>> yf = kf_zx.*af; >> kf_zx = fft2(kernel(ifft2(sf_zx)));

y = kz,x � ↵

More to read…

• Vijaya Kumar, Mahalanobis, & Juday “Correlation Pattern Recognition”, 2010.

• Bolme, Beveridge, Draper & Lui, “Visual Object Tracking using Adaptive Correlation Filters”, CVPR 2010.

• Galoogahi, Sim & Lucey “Multi-Channel Correlation Filters”, ICCV 2013.

• Henriques, Caseiro, Martins, and Batista, “High-Speed Tracking with Kernelized Correlation Filters”, PAMI 2014.

Multi-Channel Correlation Filters

Hamed Kiani GaloogahiNational University of Singapore

[email protected]

Terence SimNational University of Singapore

[email protected]

Simon LuceyCSIRO

[email protected]

Abstract

Modern descriptors like HOG and SIFT are now com-monly used in vision for pattern detection within im-age and video. From a signal processing perspective,this detection process can be efficiently posed as a cor-relation/convolution between a multi-channel image anda multi-channel detector/filter which results in a single-channel response map indicating where the pattern (e.g.object) has occurred. In this paper, we propose a novelframework for learning a multi-channel detector/filter ef-ficiently in the frequency domain, both in terms of trainingtime and memory footprint, which we refer to as a multi-channel correlation filter. To demonstrate the effectivenessof our strategy, we evaluate it across a number of visual de-tection/localization tasks where we: (i) exhibit superior per-formance to current state of the art correlation filters, and(ii) superior computational and memory efficiencies com-pared to state of the art spatial detectors.

1. IntroductionIn computer vision it is now rare for tasks like convo-

lution/correlation to be performed on single channel imagesignals (e.g. 2D array of intensity values). With the adventof advanced descriptors like HOG [5] and SIFT [13] convo-lution/correlation across multi-channel signals has becomethe norm rather than the exception in most visual detectiontasks. Most of these image descriptors can be viewed asmulti-channel images/signals with multiple measurements(such the oriented edge energies) associated with each pixellocation. We shall herein refer to all image descriptors asmulti-channel images. An example of multi-channel corre-lation can be seen in Figure 1 where a multi-channel imageis convolved/correlated with a multi-channel filter/detectorin order to obtain a single-channel response. The peak ofthe response (in white) indicating where the pattern of in-terest is located.

Like single channel signals, correlation between twomulti-channel signals is rarely performed naively in the spa-

!

x

hy

Figure 1. An example of multi-channel correlation/convolutionwhere one has a multi-channel image x correlated/convolved witha multi-channel filter h to give a single-channel response y. Byposing this objective in the frequency domain, our multi-channelcorrelation filter approach attempts to give a computational &memory efficient strategy for estimating h given x and y.

tial domain. Instead, the fast Fourier transform (FFT) af-fords the efficient application of correlating a desired tem-plate/filter with a signal. Contrastingly, however, most tech-niques for estimating a detector for such a purpose (i.e. de-tection/tracking through convolution) are performed in thespatial domain [5]. It is this dilemma that is at the heart ofour paper.

This has not always been the case. Correlation fil-ters, developed initially in the seminal work of Hester andCasasent [8], are a method for learning a template/filterin the frequency domain that rose to some prominence inthe 80s and 90s. Although many variants have been pro-posed [8, 11, 12], the approach’s central tenet is to learna filter, that when correlated with a set of training sig-nals, gives a desired response (typically a peak at the originof the object, with all other regions of the correlation re-sponse map being suppressed). Like correlation itself, oneof the central advantages of the single channel approach isthat it attempts to learn the filter in the frequency domaindue to the efficiency of correlation/convolution in that do-main. Learning multi-channel filters in the frequency do-main, however, comes at the high cost of computation andmemory usage. In this paper we present an efficient strategyfor learning multi-channel signals/filters that has numerousapplications throughout vision and learning.

2013 IEEE International Conference on Computer Vision

1550-5499/13 $31.00 © 2013 IEEEDOI 10.1109/ICCV.2013.381

3065

2013 IEEE International Conference on Computer Vision

1550-5499/13 $31.00 © 2013 IEEEDOI 10.1109/ICCV.2013.381

3072

Visual Object Tracking using Adaptive Correlation Filters

David S. Bolme J. Ross Beveridge Bruce A. Draper Yui Man LuiComputer Science Department

Colorado State UniversityFort Collins, CO 80521, USA

[email protected]

AbstractAlthough not commonly used, correlation filters can trackcomplex objects through rotations, occlusions and otherdistractions at over 20 times the rate of current state-of-the-art techniques. The oldest and simplest correlationfilters use simple templates and generally fail when ap-plied to tracking. More modern approaches such as ASEFand UMACE perform better, but their training needs arepoorly suited to tracking. Visual tracking requires robustfilters to be trained from a single frame and dynamicallyadapted as the appearance of the target object changes.

This paper presents a new type of correlation filter, aMinimum Output Sum of Squared Error (MOSSE) filter,which produces stable correlation filters when initializedusing a single frame. A tracker based upon MOSSE fil-ters is robust to variations in lighting, scale, pose, andnon-rigid deformations while operating at 669 frames persecond. Occlusion is detected based upon the peak-to-sidelobe ratio, which enables the tracker to pause and re-sume where it left off when the object reappears.

Note: This paper contains additional figures and con-tent that was excluded from CVPR 2010 to meet lengthrequirements.

1 IntroductionVisual tracking has many practical applications in videoprocessing. When a target is located in one frame ofa video, it is often useful to track that object in subse-quent frames. Every frame in which the target is success-fully tracked provides more information about the identityand the activity of the target. Because tracking is easierthan detection, tracking algorithms can use fewer compu-tational resources than running an object detector on everyframe.

Visual tracking has received much attention in recent

Figure 1: This figure shows the results of the MOSSE filterbased tracker on a challenging video sequence. This tracker hasthe ability to quickly adapt to scale and rotation changes. It isalso capable of detecting tracking failure and recovering fromocclusion.

years. A number of robust tracking strategies have beenproposed that tolerate changes in target appearance andtrack targets through complex motions. Recent examplesinclude: Incremental Visual Tracking (IVT) [17], RobustFragments-based Tracking (FragTrack) [1], Graph BasedDiscriminative Learning (GBDL) [19], and Multiple In-stance Learning (MILTrack) [2]. Although effective,these techniques are not simple; they often include com-plex appearance models and/or optimization algorithms,and as result struggle to keep up with the 25 to 30 framesper second produced by many modern cameras (See Ta-ble 1).

In this paper we investigate a simpler tracking strategy.The targets appearance is modeled by adaptive correlationfilters, and tracking is performed via convolution. Naive

1

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 1

High-Speed Tracking withKernelized Correlation Filters

João F. Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista

Abstract—The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the targetand the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaledsample patches. Such sets of samples are riddled with redundancies – any overlapping pixels are constrained to be the same. Basedon this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resultingdata matrix is circulant, we can diagonalize it with the Discrete Fourier Transform, reducing both storage and computation by severalorders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastestcompetitive trackers. For kernel regression, however, we derive a new Kernelized Correlation Filter (KCF), that unlike other kernelalgorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension oflinear correlation filters, via a linear kernel, which we call Dual Correlation Filter (DCF). Both KCF and DCF outperform top-rankingtrackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implementedin a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.

Index Terms—Visual tracking, circulant matrices, discrete Fourier transform, kernel methods, ridge regression, correlation filters.

F

1 INTRODUCTION

ARGUABLY one of the biggest breakthroughs in recentvisual tracking research was the widespread adoption

of discriminative learning methods. The task of tracking, acrucial component of many computer vision systems, canbe naturally specified as an online learning problem [1], [2].Given an initial image patch containing the target, the goalis to learn a classifier to discriminate between its appearanceand that of the environment. This classifier can be evaluatedexhaustively at many locations, in order to detect it insubsequent frames. Of course, each new detection providesa new image patch that can be used to update the model.

It is tempting to focus on characterizing the object ofinterest – the positive samples for the classifier. However,a core tenet of discriminative methods is to give as muchimportance, or more, to the relevant environment – thenegative samples. The most commonly used negative sam-ples are image patches from different locations and scales,reflecting the prior knowledge that the classifier will beevaluated under those conditions.

An extremely challenging factor is the virtually unlim-ited amount of negative samples that can be obtained froman image. Due to the time-sensitive nature of tracking,modern trackers walk a fine line between incorporatingas many samples as possible and keeping computationaldemand low. It is common practice to randomly choose onlya few samples each frame [3], [4], [5], [6], [7].

Although the reasons for doing so are understandable,we argue that undersampling negatives is the main factorinhibiting performance in tracking. In this paper, we de-velop tools to analytically incorporate thousands of samples

• The authors are with the Institute of Systems and Robotics, University ofCoimbra.E-mail: {henriques,ruicaseiro,pedromartins,batista}@isr.uc.pt

at different relative translations, without iterating over themexplicitly. This is made possible by the discovery that, in theFourier domain, some learning algorithms actually becomeeasier as we add more samples, if we use a specific model fortranslations.

These analytical tools, namely circulant matrices, pro-vide a useful bridge between popular learning algorithmsand classical signal processing. The implication is that weare able to propose a tracker based on Kernel Ridge Re-gression [8] that does not suffer from the “curse of ker-nelization”, which is its larger asymptotic complexity, andeven exhibits lower complexity than unstructured linearregression. Instead, it can be seen as a kernelized versionof a linear correlation filter, which forms the basis for thefastest trackers available [9], [10]. We leverage the power-ful kernel trick at the same computational complexity aslinear correlation filters. Our framework easily incorporatesmultiple feature channels, and by using a linear kernel weshow a fast extension of linear correlation filters to the multi-channel case.

2 RELATED WORK

2.1 On tracking-by-detection

A comprehensive review of tracking-by-detection is outsidethe scope of this article, but we refer the interested readerto two excellent and very recent surveys [1], [2]. The mostpopular approach is to use a discriminative appearancemodel [3], [4], [5], [6]. It consists of training a classifieronline, inspired by statistical machine learning methods, topredict the presence or absence of the target in an imagepatch. This classifier is then tested on many candidatepatches to find the most likely location. Alternatively, theposition can also be predicted directly [7]. Regression with

arX

iv:1

404.

7584

v3 [

cs.C

V]

5 N

ov 2

014

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.294.4992&rep=rep1&type=pdf

http://www.apple.com

https://arxiv.org/pdf/1404.7584v3.pdf

lecture 15 - advanced correlation filters16623.courses.cs.cmu.edu/slides/lecture_15.pdf ·...

Documents