mixing space-time derivatives for video compressive sensing › ~wotaoyin › papers › pdf ›...

Mixing space-time derivatives forvideo compressive sensing

Yi Yang∗, Hayden Schaeffer∗, Wotao Yin†, Stanley Osher‡

∗Department of Mathematics, University of California, Los Angeles, CA 90095 USA†Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005 USA

‡Level Set Systems, Pacific Palisades, CA 90272 USA

Abstract—With the increasing use of compressive sensingtechniques for better data acquisition and storage, the needfor efficient, accurate, and robust reconstruction algorithmscontinues to be in demand. In this work we present a fasttotal variation based method for reconstructing video compressivesensing data. Video compressive sensing systems store videosequences by taking a linear combination of consecutive spatiallycompressed frames. In order to recover the original data, ourmethod regularizes both the spatial and temporal componentsusing a total variation semi-norm that mixes information betweendimensions. This mixing provides a more consistent approxima-tion of the connection between neighboring frames with little to noincrease in complexity. The algorithm is easy to implement sinceeach iteration contains two shrinkage steps and a few iterationsof conjugate gradient. Numerical simulations on real data showlarge improvements in both the PSNR and visual quality of thereconstructed frame sequences using our method.

I. INTRODUCTION

Compressive sensing (CS) techniques have a large rangeof applications from hardware development using 1 bit com-pressive sensing [1], [2] or coded apertures [3], [4] to method-ologies using CS sampling methods [5]. Its popularity stemsfor the efficient way of handling large data sets - a growingissue in all fields and applications. For example in informationscience, large collections of images and videos are obtainedat extremely high rates, thereby efficient transmission andstorage accompanied by robust and accurate reconstruction isnecessary. One recent device, the coded aperture compressivetemporal imaging (CACTI) [6], shows promise on encodingoptical data entering at rates approaching 1 exapixel/second.In fact, an emerging application of the CS paradigm, calledvideo compressive sensing (VCS), has enabled high qualityreconstruction of videos from very few observed frames.

In terms of the mechanics, VCS methods use physicaltechniques to code incoming optical data in order to compresseither spatial or spectral information. For spatial compression,this is typically accomplished by using a mechanical grating(the coded aperture) to block the entering data stream, thussubsampling the incoming signal. There are several ways tocompress the temporal data [7], [8], for this work we considerthe methods employed in the compressive sensing framework.Here, we assume that the final stored measurement is definedas a linear combination of several spatially compressed frames,thereby removing both the spatial and temporal redundancy[9], [10].

The underlying forward model we consider is as follows:

Fig. 1. Illustration of the compression/encoding process of the video data.

let X be the compressed data set, F = [F1, . . . , FT ] be thevector formed by all the original frames, where T is thenumber of frames that are compressed into one. Define A asthe CS matrix containing the random frame by frame masksas well as the temporal compression. In other words, A can bewritten as [A1, . . . , AT ] with each Ai being a random binarymatrix, representing which pixels can store information. X isacquired by taking the sum of AiFi for all the i’s with theproduct understood as the Hadamard product. For simplicity,here we denote this compression process as X = AF . TheCS mask can be generated in two ways, depending on theapplication. The first is to independently generate Ai foreach frame, a common technique in compressive sensing. Thesecond way is to generate an initial mask A1 for the firstframe and shift it by one pixel (in a known direction) for eachsubsequent frame. In hardware, this amounts to translating thecoded aperture between frames, which is easy to implement,requires less storage, and usually yields similar results as firstway [6]. Fig. 1 illustrates the compression process for ourproblem.

Similar to other compressive sensing problems, the linearsystem X = AF is underdetermined since there are muchfewer measurements than unknown variables. In this way, wecan greatly reduce the amount of information stored while stillbeing able to recover the original signal with high accuracy.Our goal is to reconstruct F given A and X by leveragingcertain desired sparsity properties on the original video. Forimage decompression from CS measurements, recent researchsupports the use of total variation (TV) regularization becauseof its ability to preserve edge information. With the introduc-tion of the Bregman iteration [11] and the split Bregman algo-rithm (related to the alternating direction method of multipliers(ADMM) algorithm) [12], the TV-based optimization problemscan be solved in an extremely efficient way with sharper con-trast in the recovered image. This is of particular importance

to video reconstruction, since the temporal compression causesmotion blur, adding to the overall difficultly of solving theinverse problem.

Although the application of TV to imaging problems isclear, the extra dimension in videos makes extensions moreambiguous. The most direct way is to incorporate the temporaldimension as a third coordinate, which leads to the followingoptimization problem (TV3):

minF

T∑i=1

|Fi|TV 3 , s.t.

T∑i=1

AiFi = X (1)

where the 3d TV semi-norm | · |TV 3 is defined as

|Y |TV 3 := |D3Y |1 =∑i,j,k

|D3Y(i,j,k)| (2)

with the difference operator defined as D3 :=[Dx, Dy, Dt] and the vector norm |D3Y(i,j,k)| :=√(DxY(i,j,k))2 + (DyY(i,j,k))2 + (DtY(i,j,k))2. The

TV3 model is used by TwIST (two-Step iterativeshrinkage/thresholding) [13] and TVAL3 (TV minimizationby augmented Lagrangian and alternating direction algorithm)[14]. The above regularizer places equal importance in alldirections and treats the pixel values as piecewise constantin time. The underlying assumption is that the movementbetween successive frames is small (sparse). Though thismodel achieves reasonable performance in application (withrespect to PSNR, i.e. Peak Signal-to-Noise Ratio), we canoften greatly improve both the PSNR and visual quality of thereconstruction by carefully exploring the temporal component.More examination of this model can be found in Section III.

Rather than treating the difference between successiveframes as sparse, we assume that the difference is piecewiseconstant. This assumption is consistent with the spatial regular-izer, since the typical hypothesis is that each frame in the videocan be approximated by a piecewise constant image, then thedifference image between two successive frames should alsobelong to that category. As a matter of fact, this is consistentwith the dynamics captured in images: a piecewise constantobject travels with a certain speed across the field of view.Therefore, instead of considering the L1 norm of the differenceimage, here we propose the TV norm on the frame difference.As seen in the experimental results presented in Section III, ourmodel provides better quality results regardless of the amountof motion present in the video.

This paper is organized as follows. Section II gives adetailed description of our model and the associated algorithm.In Section III, numerical simulations on real data are provided,which demonstrate the improvement in both numerical andvisual quality of the reconstructed video under various circum-stances. We also compare our model with several total variationbased methods. Lastly, we conclude with a short discussion inSection IV.

II. MODEL AND ALGORITHM DESCRIPTION

The mathematical formulation of our model is as follows:

minF

T∑i=1

|Fi|TV + λT−1∑i=1

|Fi+1 − Fi|TV , s.t.

T∑i=1

AiFi = X

(3)

where |Fi|TV = |DFi|1 with D = [Dx, Dy]. The model es-sentially recovers the original data via deblurring the temporalcomponent, reconstructing the compressed spatial component,as well as interpolating the missing frames.

The first term is the 2d TV semi-norm and the secondterm is the total variation of the difference between frames.The parameter λ balances the importance between the tworegularizers. In practice λ has a direct relationship to T : whenT is large (small), a larger (smaller) λ yields better results.This is related to the compression methodology, since large Tis used when successive frames are similar, while small T isused for highly dynamic frames.

Since the model contains L1 type regularizers, we can solvethe minimization efficiently by applying the split Bregmantechnique. The convergence of this algorithm to the minimizerof (3) is guaranteed according to [15]. Introducing auxiliaryvariables Pi for i = 1, . . . , T and Qi for i = 1, . . . , T − 1to substitute for the spatial derivative and the mixed gradientrespectively, the above problem is equivalent to

minF,P,Q

T∑i=1

|Pi|1 + λ

T−1∑i=1

|Qi|1, (4)

s.t. Pi = DFi, Qi = DFi+1 −DFi,

T∑i=1

AiFi = X

Let us define D̂Fi = DFi+1 − DFi, which will stand forthe gradient of the time differences. Next adding back theconstraint yields:

(F k, P k, Qk) =argminF,P,Q

T∑i=1

|Pi|1 + λ

T−1∑i=1

|Qi|1 (5)

+μ1

2‖

T∑i=1

AiFi −X +Xk−1‖2 (6)

+μ2

2

T∑i=1

‖Pi −DFi +Bk−1i ‖2 (7)

+μ3

2

T−1∑i=1

‖Qi − D̂Fi + bk−1i ‖2 (8)

Xk =

T∑i=1

AiFki −X +Xk−1 (9)

Bki =P k

i −DF ki +Bk−1

i (10)

bki =Qki − D̂F k

i + bk−1i . (11)

where the variables Xk, Bk and bk are the Bregman variableswhich are used to enforce the equality constraint. Althoughthe choice of μj will not affect the final solution, their valuesrelate to the convergence rate. Also, in practice we can simplyset μ2 = μ3 since they both control the gradient terms. The

update for the variables P , Q and F are done in an alternatingfashion by the equations below:

P ki =shrink(DF k−1

i −Bk−1i , 1/μ2) (12)

Qki =shrink(D̂F k−1

i − bk−1i , λ/μ3) (13)

F k =argminF

μ1

2‖

T∑i=1

AiFi −X +Xk−1‖2 (14)

+μ2

2

T∑i=1

‖P ki −DFi +Bk−1

i ‖2 (15)

+μ3

2

T−1∑i=1

‖Qki − D̂Fi + bk−1

i ‖2 (16)

where the shrink function above is defined pointwise for a2d vector Y as shrink(Y, τ) := max(‖Y ‖ − τ, 0) Y

‖Y ‖ . For

the F variable update, since the associated subproblem isdifferentiable, taking the first derivative and rearranging termsleads to

(μ1A∗A+ μ2D

∗D + μ3D̂∗D̂)F = RHS (17)

where (·)∗ stands for the adjoint of the operator and RHS =μ1A

∗(X −Xk−1)+μ2D∗(P k +Bk−1)+μ3D̂

∗(Qk + bk). Ineach iteration of split Bregman algorithm, (17) only needs to beapproximately solved with a few steps of conjugate gradient.Faster convergence can be obtained by using a preconditioner.

Altogether the method reduces to two explicit shrinkagesteps and solving a linear system for each iteration, makingthe algorithm fast and easy to implement.

III. NUMERICAL EXPERIMENTS AND DISCUSSION

In this section we use various numerical experiments toillustrate how our model and the associated algorithm canimprove the reconstruction results compared with TV3. Inorder to show the importance of temporal consistency in videoreconstruction, we also include the following model (TV2) inour comparison:

minF

T∑i=1

|Fi|TV , s.t.

T∑i=1

AiFi = X (18)

where the regularizer is decoupled. The TV2 model is able toreconstruct a reasonable result since the CS mask is random.Numerically, (18) is solved via the split Bregman method,similar to the algorithm described in Section II. Our methodconverges in under a minute for all the experiments describedbelow.

As discussed in [6], shifting the CS masks usually leadsto reconstruction results that are comparable with generatinga random masks for each frame. Hence we simulate this typeof mask generation during our experiments. This is done byfirst generate a random binary mask with a certain amountof zero elements, and translating the mask in a fixed directionafter each frame. We also provide an experiment using randommask, which are generated for each frame independently. Inall cases, the parameters are chosen to maximize the PSNRvalues.

In the first experiment (Setting 1), 4 frames are compressedinto one – each with 50% spatial downsampling. According to

the first row of results displayed in Table I, we can see thatour model achieves a much higher PSNR value compared withthe results obtained from TV2 and TV3. The model withouttemporal regularizers provides the worst results among all thecandidates with respect to the PSNR values. This justifies theimportance of leveraging the relationship between neighboringframes. Note that all the methods yield a relatively good PSNR.Since the region of motion is relatively small compared to theoverall size of the image, one could poorly recover the dynamicobject and still have a good PSNR.

Next, in Setting 2 we keep the same conditions as in Setting1 but input a more dynamic video (see Fig. 2). From Table I,we see that our model yields higher PSNRs than the othermodels. Although all models give a relatively high PSNR,as seen in Fig. 2 there are clear qualitative differences. Ourreconstruction has less point structures commonly associatedwith poor CS reconstruction. It is also able to recover thedetails of the background and capture the moving structures.The TV2 model provides a piecewise constant reconstructionof each frame, but it fails to recover the fine scale details foundin the background. The TV3 model is able to reconstruct thebackground well and some of the textures, but is unable toresolve the moving object. From this example, it is clear thatthe temporal regularizer helps promote texture reconstruction.

In Setting 3, we increase the T value to 8 while keep allother conditions the same. The PSNR gap between our modeland the other two grows, showing that our model is morerobust to larger compression rates. Fig. 3 provides a visualcomparison between our model, TV2 and TV3. Fig. 3 (a)shows an original frame from the video sequence and (b) showsthe compressed frame X . Comparing the reconstructions (c-e), the zoomed in images (f-h) and difference images (i-k), wecan easily see that our model gives a smoother reconstructionwith sharper contrast. The loss of contrast in (d) is primarilydue to the lack of temporal consistency. The point structureeffect in (e) is caused by the inaccuracy of L1 regularizationfor large motion. In the difference images (i-k), it is clearthat our model produces the least amount of errors in thebackground component while recovers a piecewise smoothdynamic foreground. The PSNR gain of our model for thisparticular frame is more than 3.

To examine the effect of both the random mask simulationand the sampling ratio, in the forth setting we generate anindependent random mask for each frame and lowered thesampling ratio to 45%. The temporal compression rate, T , isset at 8 for this case. We can see that all the PSNRs havedecreased as expected, while the values from our model aremuch higher than the others. The PSNRs for TV2 are robustin the sense that they do not change dramatically between thesettings; however, they are consistently lower than the onesfrom other algorithms. Lastly, when T = 12 and the spatialcompression ratio returns to 50%, again our model providesthe best PSNR values among all the models.

In general, adding two regularizers for the same variabledoes not necessarily increase the quality of the reconstruction.In our model, since the regularizers act on the differentcoordinates, both are necessary. In Fig. 4, we compare our

TABLE I. MEAN AND MAXIMUM PSNR COMPARISON BETWEEN DIFFERENT VIDEO RECONSTRUCTION MODELS

Our model TV2 TV3mean PSNR max PSNR mean PSNR max PSNR mean PSNR max PSNR

Setting 1 38.6716 39.3424 32.3084 32.5162 36.9144 37.6050Setting 2 34.4394 34.9082 30.8161 31.1694 31.7932 32.3711Setting 3 36.6830 37.4991 30.0898 30.4783 34.1394 35.3124Setting 4 33.4960 34.1152 28.8421 29.1447 31.0489 31.6782Setting 5 33.1679 33.6097 27.8260 28.0761 31.0260 31.6416

(a) Original (b) Compressed frame

(c) Our reconstruction (d) TV2 reconstruction (e) TV3 reconstruction

(f) Our difference (g) TV2 difference (h) TV3 difference

Fig. 2. Figure illustration for Setting 2. The PSNR of this particularreconstruction with our model is 34.9082, while TV2 gives 31.1694 and TV3gives 32.3697.

model to the temporal-only version of our model:

minF

|F1|TV +

T−1∑i=1

|Fi+1 − Fi|TV + |FT |TV , (19)

s.t.

T∑i=1

AiFi = X

Note that the first and last term are also time difference withzero boundary conditions. In this form, the recovered videosis (in some sense) a piecewise constant interpolation betweenthe two recovered endpoints. In Fig. 4, a motion blur artifactremains in the temporal-only model (g), while our model betterlocalizes the temporal information (d). This also appears inthe difference images as a region of error around the cars.In practice, we see that the spatial regularizer stabilizes eachframe.


(c) Our reconstruction (d) TV2 reconstruction (e) TV3 reconstruction

(f) Our zoomed in (g) TV2 zoomed in (h) TV3 zoomed in

(i) Our difference (j) TV2 difference (k) TV3 difference

Fig. 3. Figure illustration for Setting 3. The PSNR of this particularreconstruction with our model is 35.4635, compared with 29.7760 for TV2and 32.4518 for TV3.

IV. CONCLUSION

Our model shows dramatic improvement using a moreconsistent regularizer. The main function of our temporalregularizer is to better capture both fast moving objectsand temporal relationships. Numerical simulations show largeimprovements in both the PSNR and visual quality of thereconstructions using our method, regardless of the type ofdynamics present in the video. In this work, the problem we areconsidering derives from VCS; however, the method presentedhere can easily be applied to other compression types.


(c) Our reconstruction (d) Our zoomed in (e) Our difference

(f) Temporal-onlymodel

(g) Temporal-onlyzoomed in

(h) Temporal-only dif-ference

Fig. 4. The comparison between our model and the model with onlytemporal regularizer. The PSNR of this particular reconstruction for our modelis 34.3988, and temporal-only regularized version of our model gives 32.2018.

ACKNOWLEDGMENT

H. Schaeffer is supported by the Department of Defense(DoD) through the National Defense Science and EngineeringGraduate Fellowship (NDSEG). Y. Yang is supported by NSFDMS 0835863, NSF DMS 0914561, and ONR N00014-11-0749. W. Yin is supported by NSF grants DMS-0748839, andECCS-1028790, and ARO/ARL MURI grant FA9550-10-1-0567. S. Osher is supported by DARPA KeCom.

REFERENCES

[1] P. T. Boufounos and R. G. Baraniuk, “1-bit compressive sensing,”in Information Sciences and Systems, 2008. CISS 2008. 42nd AnnualConference on. IEEE, 2008, pp. 16–21.

[2] M. Yan, Y. Yang, and S. Osher, “Robust 1-bit compressive sensingusing adaptive outlier pursuit,” Signal Processing, IEEE Transactionson, vol. 60, no. 7, pp. 3868–3875, 2012.

[3] A. Wagadarikar, R. John, R. Willett, and D. Brady, “Single disperserdesign for coded aperture snapshot spectral imaging,” Applied optics,vol. 47, no. 10, pp. B44–B51, 2008.

[4] A. A. Wagadarikar, N. P. Pitsianis, X. Sun, D. J. Brady et al., “Videorate spectral imaging using a coded aperture snapshot spectral imager,”Opt. Express, vol. 17, no. 8, pp. 6368–6388, 2009.

[5] M. Lustig, D. Donoho, and J. M. Pauly, “Sparse mri: The applicationof compressed sensing for rapid mr imaging,” Magnetic Resonance inMedicine, vol. 58, no. 6, pp. 1182–1195, 2007.

[6] P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro,and D. J. Brady, “Coded aperture compressive temporal imaging,”unpublished.

[7] D. Le Gall, “Mpeg: A video compression standard for multimediaapplications,” Communications of the ACM, vol. 34, no. 4, pp. 46–58,1991.

[8] Y. Hitomi, J. Gu, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Videofrom a single coded exposure photograph using a learned over-completedictionary,” in Computer Vision (ICCV), 2011 IEEE InternationalConference on. IEEE, 2011, pp. 287–294.

[9] X. Yuan, J. Yang, P. Llull, X. Liao, G. Sapiro, D. J. Brady, and L. Carin,“Adaptive temporal compressive sensing for video,” 2013, unpublished.

[10] H. Schaeffer, Y. Yang, and S. Osher, “Real-time adaptive video com-pressive sensing,” 2013, unpublished.

[11] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, “An iterativeregularization method for total variation-based image restoration,” Mul-tiscale Modeling & Simulation, vol. 4, no. 2, pp. 460–489, 2005.

[12] T. Goldstein and S. Osher, “The split bregman method for l1-regularizedproblems,” SIAM Journal on Imaging Sciences, vol. 2, no. 2, pp. 323–343, 2009.

[13] J. M. Bioucas-Dias and M. A. Figueiredo, “A new twist: two-step it-erative shrinkage/thresholding algorithms for image restoration,” ImageProcessing, IEEE Transactions on, vol. 16, no. 12, pp. 2992–3004,2007.

[14] C. Li, W. Yin, H. Jiang, and Y. Zhang, “An efficient augmentedLagrangian method with applications to total variation minimization,”2012, unpublished.

[15] W. Deng and W. Yin, “On the global and linear convergence ofthe generalized alternating direction method of multipliers,” 2012,unpublished.

mixing space-time derivatives for video compressive sensing › ~wotaoyin › papers › pdf ›...

Documents