resolution-invariant coding for continuous image super-resolution

8

Click here to load reader

Upload: jinjun-wang

Post on 10-Sep-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Resolution-invariant coding for continuous image super-resolution

Neurocomputing 82 (2012) 21–28

Contents lists available at SciVerse ScienceDirect

Neurocomputing

0925-23

doi:10.1

n Corr

E-m

journal homepage: www.elsevier.com/locate/neucom

Resolution-invariant coding for continuous image super-resolution

Jinjun Wang a,n, Shenghuo Zhu b

a Epson Research and Development, Inc., 2580 Orchard Parkway, San Jose, CA 95131, United Statesb NEC Laboratories America, Inc., 10080 N. Wolfe Road, Cupertino, CA 95014, United States

a r t i c l e i n f o

Article history:

Received 21 June 2011

Received in revised form

30 August 2011

Accepted 21 September 2011

Communicated by Qingshan Liu2-D image interpolation. The R IIR framework includes the methods of building a multi-resolution

Available online 24 December 2011

Keywords:

Image representation

Sparse-coding

Image super-resolution

12/$ - see front matter & 2011 Elsevier B.V. A

016/j.neucom.2011.09.027

esponding author.

ail address: [email protected] (J. Wang)

a b s t r a c t

The paper presents the resolution-invariant image representation (R IIR) framework. It applies sparse-

coding with multi-resolution codebook to learn resolution-invariant sparse representations of local

patches. An input image can be reconstructed to higher resolution at not only discrete integer scales, as

that in many existing super-resolution works, but also continuous scales, which functions similar to

bases set from training images, learning the optimal sparse resolution-invariant representation of an

image, and reconstructing the missing high-frequency information at continuous resolution level. Both

theoretical and experimental validations of the resolution invariance property are presented in the

paper. Objective comparison and subjective evaluation show that the R IIR framework based image

resolution enhancement method outperforms existing methods in various aspects.

& 2011 Elsevier B.V. All rights reserved.

1. Introduction

Most digital imaging devices produce a rectangular grid ofpixels to represent the photographic visual data. This is called theraster image. The human perceptual clarity of a raster image isdecided by its spatial resolution which measures how closely thegrid can be resolved. Raster images with higher pixel density aredesirable in many applications, such as high resolution (HR)medical images for cancer diagnosis, high quality video confer-ence, HD television, Blu-ray movies, etc. There is an increasingdemand to acquire HR raster images from low resolution (LR)inputs such as images taken by cell phone cameras, or convertingexisting standard definition footage into high definition image/video materials. However, raster images are resolution depen-dent, and thus cannot scale to arbitrary resolution without loss ofapparent quality.

Another generally used image representations is the vector

image. It represents the visual data using geometrical primitivessuch as points, lines, curves, and shapes or polygon. The vectorimage is totally scalable, which largely contrasts the deficiency ofraster representation. Hence the idea of vectorizing raster image forresolution enhancement has long been studied. Recently, Ramanar-ayanan et al. [1] added the vectorized region boundaries to theoriginal raster images to improve sharpness in scaled results;Dai et al. [2] represented the local image patches using the back-ground/foreground descriptors and reconstructed the sharp dis-continuity between the two; to allow efficient vector representation

ll rights reserved.

.

for multi-colored region with smooth transitions, gradient meshtechnique has also been attempted [3]. In addition, commercialsoftwares such as [4] are already available. However, vector-basedtechniques are limited in the visual complexity and robustness. Forreal photographic images with fine texture or smooth shading,these approaches tend to produce over-segmented vector repre-sentations using a large number of irregular regions with flat colors.To illustrate, Fig. 1(a) and (b) is vectorized and grown up to �3scale using methods in [2,4]. The discontinuity artifacts in regionboundaries can be easily observed, and the over-smoothed textureregions make the scaled image watercolor like.

Alternatively, researchers have proposed to vectorize rasterimage with the aids of a bases set to achieve higher modelingcapacity than simple geometrical primitives. For example, inimage/video compression domain, pre-fixed bases, such as theDCT/DWT bases adopted in JPEG/JPEG-2000 standard, and theanisotropic bases such as countourlets [5], have already beenexplicitly proposed to capture different 2-D edge/texturepatterns, because they lead to sparse representation which isvery preferable for compression [6]. In addition to pre-fixed bases,adaptive mixture model representations were also reported. Forexample, the Bandelets model [7] partitions an image intosquared regions according to local geometric flows, and repre-sents each region by warped wavelet bases; the primal sketchmodel [8] detects the high entropy regions in the image through asketching pursuit process, and encodes them with multipleMarkov random fields. These adaptive representations capturethe stochastic image generating process, therefore they are suitedfor image parsing, recognition and synthesis.

In the large body of example-based image resolutionenhancement literature, or called ‘‘Single Frame Super-Resolution

Page 2: Resolution-invariant coding for continuous image super-resolution

Fig. 1. Image SR quality by our R IIR framework. Top: comparison to image vectorization. Bottom: comparison to different example-based methods.

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–2822

(SR in short)’’, researchers utilize the co-occurrence prior betweenLR and HR representations in an over-completed bases set to‘‘infer’’ the HR image. For example, Freeman et al. [9] representedeach local region in the LR image using one example LR patch, andapplied the co-occurrence prior and global smoothness depen-dence through a parametric Markov Network to estimate the HRimage representation. Qiang et al. [10] adopted ConditionalRandom Field to infer both the missing HR patches and thepoint-spread-function parameters. Chang et al. [11] utilizedlocally linear embedding (LLE) to learn the optimal combinationweights of multiple LR base elements to estimate the optimal HRrepresentations. In our previous work [12] and in Yang et al.’swork [13], the sparse-coding model is applied to obtain theoptimal reconstruction weight using the whole bases set. Inaddition to example patches, representing images in transferreddomain, such as edge profile [14], wavelet coefficients [15], imagecontourlet [16], etc., has also been examined.

However, although the example-based SR methods signifi-cantly improve image quality over 2-D image interpolation, thebases used by existing approaches have only single scale capacity.E.g., the base used for �2 up-sizing cannot be used for �3up-sizing. Hence these existing methods are not capable formulti-scale image SR. To cope with these limitations, this paperpresents a novel method that uses example bases set yet iscapable for multi-scale and even continuous-scale image resolu-tion enhancement. The contribution includes:

The paper introduces a novel resolution-invariant imagerepresentation (R IIR) framework that models the inter-depen-dency between example base sets of different scales. The papershows that, an image can be encoded into a resolution-invariant representation, such that by applying different basesset, the LR input can be enhanced to multiple HRs. Thiscapability has obviously the importance in many novel resolu-tion enhancement applications that existing SR method cannothandel well. � The key components of the R IIR framework include construct-

ing a R IIR bases set and coding the image into R IIR. Inaddition to our previous work [12,17], this paper introducesseveral coding schemes that all possess the resolution-invar-iant property, as illustrated in Fig. 1(f)–(h). A comprehensiveevaluation was conducted to evaluate the advantages ofdifferent coding scheme over different aspects.

� The paper further extends the proposed R IIR framework to

support continuous scale image SR. A new base for anyarbitrary resolution level can be synthesized using existingR IIR set on the fly. In this way the input image can beenhanced to continuous scales using only matrix–vector mul-tiplication, which can be implemented very efficiently bymodern computers.

The rest of the paper is organized as follows: Section 2

introduces the image decomposition model and generalizes theinvariant property between different image frequency layers.Section 3 introduces our key R IIR framework based on theinvariant property between base sets of different scales. Section4 applies the R IIR framework for continuous image SR. Section 5lists the experimental results, and Section 6 summarizes theproposed methods and discusses future works.

2. Resolution invariant property between frequency layers

2.1. Image model

Example-based SR approaches assume that [9] an HR imageI¼ Ih

þImþIl consists of a high frequency layer (denoted as Ih),

a middle frequency layer (Im), and a low frequency layer (Il).The down-graded LR image I ¼ Im

þIl results from discardingthe high frequency components from the original HR version.Hence the image super-resolving process strives to estimate themissing high frequency layer Ih by maximizing PrðIh9Im,Il

Þ for anyLR input. In addition, since the high frequency layer Ih isindependent of Il [9], it is only required to maximize PrðIh9Im

Þ,which greatly reduces the variability to be stored in theexample set.

A typical example-based SR process works as follows: Given anHR image I and the corresponding LR image I0 , I0 is interpolated tothe same size as I and denoted as I. The missing high frequencylayer Ih can be obtained by Ih

¼ I�I. A Gaussian filter Gl is properlydefined to obtain the middle frequency layer Im by Im

¼ I�I � Gl.Now from Ih and Im, a patch pair set S ¼ fSm,Sh

g can be extractedas the example bases set. Sm ¼ fpm

i gNi ¼ 1 and Sh ¼ fph

i gNi ¼ 1 repre-

sent the middle frequency and the high frequency bases respec-tively. Each element pair in fpm

i ,phi g is the column expansion of a

square image patch from the middle frequency layer Im and thecorresponding high frequency layer Ih. The dimensions of pm

i andph

i are Dm� 1 and Dh

� 1 respectively, and often DmaDh. Nowfrom a given LR input, the middle frequency patches can beextracted accordingly and denoted as fym

j g. The missing highfrequency components fyh

j g are estimated based on theco-occurrence patterns stored in S. The following subsectionsreview three different models for the estimation process.

2.2. Nearest neighbor

Assuming that image patches follow Gaussian distribution, i.e.,PrðymÞ �N ðlm,R2

Þ, and Prðyh9ymÞ �N ðlh,R2Þ, it can be easily

verified that, for any observed patch ymj from the LR input, the

maximum likelihood (ML) estimation of lmj minimizes the

Page 3: Resolution-invariant coding for continuous image super-resolution

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28 23

following objective function:

flmn

j g ¼ argminflm

jg � fpm

igN1 ¼ 1

Jymj �lm

j J2, ð1Þ

which yields a 1-nearest neighboring (1-NN) solution. With theco-occurrence prior, the corresponding lhn

j is the ML estimation oflh

j , which is then used as the missing yhj for reconstruction.

The 1-NN estimation only considers the local observation,hence the performance of 1-NN method heavily depends on theexample images. Freeman et al. proposed a parametric MarkovNetwork (MN) to incorporate neighboring smoothness con-straint [9]. His method strives to find lm

j Afpmi g

Ni ¼ 1 such that lm

j

is similar to ymj , and the corresponding lh

j follows certainsmoothness constrain from the 4-connection neighborhood. Thisequals to minimizing the following objective function for thewhole network

flmn

j g ¼ argminflm

jg � fpm

igN

i ¼ 1

Xj

ðJymj �lm

j J2þlJlh

j �Oðlhj ÞJ

2Þ, ð2Þ

where Oðlhj Þ represents the overlapped region in lh

j by neighbor-ing patches. Basically, the second term in Eq. (2) penalizesthe pixel difference at overlapped regions. Comparing to Eq. (1),Eq. (2) obtains a maximum a posteriori (MAP) estimation of lm,hence the result is more stable and robust. However, due to cyclicdependencies of Markov Network, exact inference of Eq. (2) is a#P-complete problem, and thus computationally intractable. Onefeasible solution is to break the inference process into twoindividual steps [9,10,18]: First, for each input patch ym

j , K-NNfpm

k gKk ¼ 1 are selected from the training data. This minimizes

Jymj �lm

j J2 in Eq. (2); second, the K corresponding high frequency

patch fphkg

Kk ¼ 1 are used as candidates to search for a winner that

minimizes Jlhj �Oðlh

j ÞJ2, using approximation techniques, such

as Bayesian belief propagation [9], Gibbs sampling [10], graph-cut[18], etc. The winner is the estimated lhn

j which is then used asthe final yh

j for reconstruction.

2.3. Local linear embedding

The two-step strategy in solving Eq. (2) is computationallyexpensive. Besides, the improvement in SR image quality over the1-NN is limited. Chang et al. [11] introduced an alternativeapproach where the problem of reconstruct the optimal yh

j isregarded as discovering the local linear embedding (LLE) in theoriginal RDm

space and reconstruct in another RDh

space.The LLE method works in the following manner. First, for each

input patch ymj , K nearest neighboring patches fpm

k gKk ¼ 1 are

selected as that in Section 2.2. Next, an optimal embeddingweights xn

j is obtained by

fxn

j g ¼ argminfxjg

Jymj �Pm

j xjJ2þlJxjJ

2, ð3Þ

where Pmj is the Dm

� K matrix representation of fpmk g

Kk ¼ 1 and

xj ¼ ½x1, . . . ,xK �>. The regularization term lJxjJ

2 is added toimprove the condition of the least square fitting problem.

Then xn

j is used to estimate lhnj by

flhnj g ¼ fP

hj xn

j g, ð4Þ

where Phj is the Dh

� K matrix representation of fphkg

Kk ¼ 1 that

corresponds to fpmk g

Kk ¼ 1. lhn

j is used as the computed yhj for final

reconstruction, and pixels in the neighboring overlapped regionssimply take their average values.

2.4. Sparse coding

The performance of the LLE method is limited by the qualityof the K candidates fpm

k gKk ¼ 1, hence the solution by LLE is

sub-optimal. In fact, the searching and embedding steps in theLLE method can be addressed simultaneously, i.e., searching for aset of base elements whose combination is a good estimation ofthe input. This equals to learning the optimal fxn

j g that minimizesthe following objective function:

fxn

j g ¼ argminfxjg

Jymj �PmxjJ

2þgfðxjÞ, ð5Þ

where Pm is a Dm� N matrix representing the middle frequency

patch set Sm¼ fpm

i gNi ¼ 1 in the training data.

It can be easily found that, Eq. (5) is very similar to Eq. (3),except Pm is used instead of Pm

j . Since Sm is usually over-complete, the regularization term, fðxjÞ, is very important. Inour previous work [12], an L1 regularization is suggested, and theoptimization problems becomes learning the sparse-coding (SC)[19] for each ym

j individually. More details can be found in [12].Similar to Sections 2.3 and 2.2, the obtained fxjg

n can beapplied to estimate the high frequency layer to estimate flhn

j g

using Eq. (4), and then fyhj g accordingly.

2.5. Invariant property between different frequency layers

In the above subsections, each image patch ymj is converted

into a local representation xn

j , using either the NN, LLE or SCmodel. Each representation xn

j is a sparse N�1 vector with onlyone non-zero element (in the NN model), K non-zero elements(in the LLE model), or a small number of non-zero elements(in the SC model). For simplicity, we call such process the codingprocess. When fxn

j g is obtained, the reconstruction process calcu-lates Eq. (4) for all the three models. The difference among thethree models is the objective function used during the codingprocess:

In the NN model, xn

j is obtained by writing Eq. (1) as

xn

j ¼ argminxj

Jymj �PmxjJ

2

s:t: xjAf0;1gN and

Xxj ¼ 1, ð6Þ

where Pm is the same as that used in Eq. (5).

� In the LLE model, according to Eq. (3), xn

j is obtained by

xn

j ¼ argminxj

Jymj �ðP

m AjÞxjJ

2þlJxjJ

2, ð7Þ

where the term Pm Aj denotes the neighboring relation using

matrix manipulation. Aj is a Dm� N matrix and can be

factorized by Aj ¼ Imaj. Im is a Dm� 1 unit vector. ajAf0;1g

N ,and for each element ak in aj,

ak ¼1 if Jym

j �pmk J

2rdthr , pmk ASm,

0 otherwise,

(ð8Þ

where dthr is a pre-defined threshold that controls the numberof NNs to be selected. It can be regarded as a constant variable.

� In the SC model, xn

j is obtained by Eq. (5) with fðxjÞ ¼ 9xj91.

Our intention to discuss these different coding models isthat, although xn

j is learned from the middle frequency layer byEqs. (1), (3) and (5), it can be directly applied to compute themissing components in the high frequency layer by Eq. (4). Suchinvariant property can be generalized in Theorem 2.1 below:

Theorem 2.1. The optimal representation xn is invariant across

different frequency layers given respective bases of the corresponding

frequency components.

Theorem 2.1 is a direct result of the image co-occurrence prior,and has been validated by numerous example-based SR work.However, such invariant property depends on the example patch

Page 4: Resolution-invariant coding for continuous image super-resolution

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–2824

pair set S, where the optimal representation xn is invariant acrossthe middle and the high frequency layer only under a defined

up-scaling factors, such as �2, �3, etc. In this paper, thecorrelations between base sets of different scales is of interest.The following section introduces another invariant propertybetween different base sets. We call it resolution-invariant imagerepresentation (R IIR).

3. Resolution-invariant image representation

3.1. Generating multi-resolution base set

To examine the relation among different resolution versionsof a same image, a multi-resolution image patch pair set S isgenerated: First, each image I in a given dataset is processed toobtain its LR version Iu by first downsampling I to 1/u scale, thenupsampling it back to the original size. As explained in Section2.1, in this way N image patch pairs can be extracted from the Im

and Ih layers respectively, and we denote the obtained set asSu ¼ fSm

u ,Shug ¼ fp

mi,u,ph

i,ugNi ¼ 1.

Next, multiple u¼1,y,U is applied to obtain a multi-resolutionbases set. In particular, the order of the elements in each set isspecially arranged such that the ith pair in fSm

u ,Shu} and the ith pair

in fSmU ,Sh

U} are from patches at the same location as highlighted inFig. 2.

With the obtained S ¼ fSug, u¼1,y,U, the next subsectionexamines the relation among these multiple base sets to revealanother invariance property.

3.2. Invariant property between different base sets

Ideally, obtaining Iu requires first a downsampling processand then an upsampling process. The downsampling processconsists of applying an anti-aliasing filter to reduce the band-width of the signal and then a decimator to reduce the samplerate; the upsampling process, on the other side, increases thesample rate and then applies an interpolation filter to remove

...

Interpolated image IU

Interpolated image Iu

Original HR image I

Fig. 2. Sampling

the aliasing effects [20]. Both the interpolation and theanti-aliasing filters are low-pass filters, and they can be combinedinto a single filter. In practice, the filter with the smallestbandwidth is more restrictive, and thus can be used in place ofboth filters.

Now assuming both the HR image I and several downgradedLR images Iu are available for training (the notation I and I1 isinterchangeable hereafter), each Iu can be modeled by

Iu ¼ ððI1 � G1=uÞk1=umu=1Þ � Gu=1 ¼ I1 � Gmu , ð9Þ

where G1=u is the anti-aliasing filter, Gu=1 is the interpolation filter,and k1=u / mu=1 is the downsampler/upsampler. The combinedfilter is the one with the smallest bandwidth between G1=u andGu=1. For simplicity, we denote the true combined filter as Gm

u forlater discussion.

The downsampling/upsampling steps are generally not rever-sible. The difference between the obtained Iu and the original I1 isthe missing high frequency layer Ih

u that needs to be estimated(Section 2.1). Similarly, the middle frequency Im

u can be obtainedby

Imu ¼ Iu�Iu � Gl

u ¼ I1 � Gmu �I1 � Gm

u � Glu ¼ I1 � Gu, ð10Þ

where Gu ¼Gmu �Gm

u � Glu, and Gl

u denotes the combined filter tofurther discard the middle frequency layer from Iu.

Let Pmu be a Dm

u � N matrix to represent all the elements in Smu ,

where Dum is the dimension of patch pm

u , ymu be the middle

frequency component of an input patch yu, and gu be the columnexpansion of Gu. With Eq. (10), we can have

Pmu ¼ Pm

1 � gu, ð11Þ

where the convolution applies on each row of P, and

ymu ¼ ym

1 � gu: ð12Þ

To see whether the representation learned from SC is inde-pendent of u, taking Eqs. (11) and (12) into Eq. (5), the optimalrepresentation under resolution u is obtained by

xn

u ¼ argminxu

Jymu �Pm

u xuJ2þg9xu9

...

resolutionlevel u

resolutionlevel U

resolutionlevel 1

LR image IU'

LR image Iu'

the base set.

Page 5: Resolution-invariant coding for continuous image super-resolution

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28 25

¼ argminxu

Jym1 � gu�ðP

m1 � guÞxuJ

2þg9xu9

¼ argminxu

JCuðym1 �Pm

1 xuÞJ2þg9xu9, ð13Þ

where Cu is the convolution matrix formed by gu. The solution isindependent of u, i.e., xn

u ¼ xn1, when C>u Cu is a unitary matrix,

which requires gu to be the Dirac Delta function. The proof for NNand LLE model is given in Appendices A and B respectively.

3.3. Validation for realistic imaging system

In realistic imaging process, gu is a low-pass filter withsufficiently small bandwidth in small scale factors, such that gu

usually approximates Dirac Delta function well. Although theexact parameters of gu is unknown, it is able to examine thecaused invariancy property by directly measuring the similarity ofx learned from different resolution versions. To depict, from atraining HR image, we first generated a multi-resolution patch pairset S with around 8000 patch pairs in each resolution level. Next,we extracted around 2000 patches from each of the five resolutionversion of the same testing image. Then we solved Eqs. (2), (3) and(5) to get the optimal representations x¼ fxj,ug

2000j ¼ 1 , u¼ 1, . . . ,5 at

each resolution level for the NN, LLe and SC model respectively.If Theorem 3.1 holds, xj,u should be very similar to xj,v, 8uav.Hence we computed the overall similarity between every twoelements in xj ¼ fxj,1,xj,2,xj,3,xj,4,xj,5g by

simðxjÞ ¼1

C25

X4

u ¼ 1

X5

v ¼ uþ1

corju,v,

where corju,v is the correlation between xj,u and xj,v. Finally, the

overall similarity is averaged over the 2000 patched to get a score.To make the experiment more comprehensive, we tested

different redundancy level in the base set by either randomremoving or using K-Mean clustering methods to reduce the basecardinality from 8000 to until 50. The experiments were repeated5 times with different training/testing images, and the results areshown in Fig. 3. As can be seen from Fig. 3, the lower bound of thesimilarity score is greater than 0.44, and the maximal scorereaches almost 0.8. The results validate the high similaritiesbetween xu and xv from different resolutions. The reason whythe scores decrease as the cardinality increase is due to the over-completed nature of the base, where the coding process may notselect exactly the same basis for reconstruction. When suchredundancy is removed, the similarity between representationsbecomes significantly higher.

Based on both theoretical proof in Eq. (13) and the experi-mental validation in Fig. 3, we can generalize the second invariantproperty for a multi-resolution base set:

0 1000 2000 3000 4000 5000 6000

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

Cardinality

Cor

rela

tion

NN (KMean)NN (Rand)LLE (KMean)LLE (Rand)SC (KMean)SC (Rand)

Fig. 3. Correlation between different resolution versions.

Theorem 3.1. The optimal representation xn is invariant across

different resolution version given respective bases of the correspond-

ing resolutions.

Theorem 3.1 reveals that, if the different resolution versions ofthe same image are related by Eq. (9), then the optimal repre-sentation learned from one resolution can also be applied foranother. For simplicity, we call Theorem 3.1 the resolution-

invariant image representation (R IIR) property, and the multi-resolution sets S an R IIR set. With R IIR property, it is able to savethe computational expensive coding process for multi-scaleresolution enhancement tasks, as discussed in the next section.

4. Applying R IIR for continuous image SR

There are many scenarios where users need different resolu-tion version of the same image/video input, and thus requires themulti-scale image SR capacity. An R IIR set S ¼ fSug, u¼1,y,Uis born with the multi-scale reconstruction ability at discretescales, because each Su can be used for �u image SR by existingexample-based SR method [9]. One advantage of the R IIR frame-work is that, instead of solving the optimal representation fxng

under each scale independently, under the R IIR framework, fxng

only needs to be learned once. By applying different Su, the samefxng can be used to reconstruct the image at multiple scales. Finerscale factors are achievable by simply increasing the level ofresolutions in the R IIR set. During reconstruction, only matrix–vector multiplication is required by Eq. (4), which can beimplemented very efficiently. In addition, the R IIR set can bestored locally, while the computed R IIR can be transmittedtogether with the image/video document.

To further extend R IIR to support continuous scale SR, a newbase can be synthesized at the required scale on the fly. Toelaborate, let v be the target scale factor which is between u anduþ1, the ith element in Sv can be synthesized by

pi,v ¼wu,v ~p i,uþð1�wu,vÞ ~pi,uþ1, ð14Þ

where ~pi,u is the patch interpolated from scale u, and ~pi,uþ1 isinterpolated from scale uþ1. The weight wu,v ¼ ð1þexpððv�u�

0:5ÞntÞÞ�1 where in our implementation, t¼ 10 empirically.

5. Experimental results

5.1. Multi-scale image SR

This sub-section compares the quality of super-resolvedimages by R IIR framework with existing SR methods. Since mostof these benchmark methods do not support continuous SR, wecompared the image quality under multiple discrete scales. Tobegin with, an R IIR set S was trained. Around 20 000 patch pairexamples were extracted from some widely used images such asthe ‘‘peppers’’ image.

First, 25 testing images were processed to compare withexisting example-based SR methods that use the same codingmodel but without the R IIR technique, including, ‘‘KNN’’ [9],

Table 1Comparison of average SR processing time (seconds) with/without R IIR.

Scale NN R IIR(NN) LLE R IIR (LLE) SC R IIR (SC)

�2 3.89 0.11 7.03 0.13 19.45 0.16�3 11.15 11.15 13.97 13.97 54.22 54.22

�4 14.39 0.19 22.73 0.23 98.59 0.28�5 14.69 0.28 26.86 0.34 159.92 0.45�6 15.11 0.42 28.94 0.52 249.68 0.66

Page 6: Resolution-invariant coding for continuous image super-resolution

Fig. 5. Illustration of continuous image scaling (top-left: the original image; first row:

model; last row: R IIR with SC model).

x2 x3 x4 x5 x6−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

Scale

PS

NR

sco

re o

ver B

i−C

ubic

inte

rpol

atiio

n

RIIR(NN)RIIR(LLE)RIIR(SC)NN

LLESCEnhanceSoft Edge

Fig. 4. Average SR quality under different scales.

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–2826

‘‘LLE’’ [11] and ‘‘SC’’ [13]. The R IIR was learned at �3 scale, andmultiple up-scale factors from �2 to �6 was specified forreconstruction. The processing time is logged from a DELLPRECISION 490 PC (3.2 GHz CPU, 2G RAM), and the results arelisted in Table 1. As can be seen, while the same amount ofcomputation is required at calculating the coding at �3 scale, forthe rest scales, the computation becomes neglectable.

Next, the quality of the generated SR images were evaluated.In addition to those benchmark methods used in Table 1, twofunctional interpolation SR methods, ‘‘Enhance’’ [21] and ‘‘Softedge’’ [2], were also implemented for comparison. The PSNR scoreover BiCubic interpolation is presented in Fig. 4. As can be seen, inmost scales, the best image quality is achieved by the SC method,while the proposed R IIR method using SC model achieves thesecond best image quality, losing only a very small margin. Thispromising result shows that, the R IIR method saves a consider-able amount of computation (Table 1) while sacrifices onlynegligible amount of image quality. In fact, comparing the three

BiCubic interpolation; second row: R IIR with NN model; third row: R IIR with LLE

Page 7: Resolution-invariant coding for continuous image super-resolution

scale: × 2.65 scale: × 3.15 scale: × 3.85 scale: × 4.35

Fig. 6. Illustration of continuous image scaling (top-left: the original image; first row: BiCubic interpolation; second row: R IIR with NN model; third row: R IIR with LLE

model; last row: R IIR with SC model).

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–28 27

coding models with/without the R IIR framework, the achievedimage quality are always comparable. In addition, in the NNcoding model, the achieved image PSNR score is even higher thanthat without the R IIR framework. This is reasonable because, asexplained in Section 2.2, the NN method tends to over-fit to theexample patch pair set, and hence the computed representationby Eq. (1) may not generalize well to the high-frequency layerwell. On the other side, solving Eq. (1) under the R IIR frameworkincorporates stronger regularization, such that the learned repre-sentation is more reliable and robust.

5.2. Continuous image SR

The second experiment demonstrated continuous image SRusing the R IIR framework (Section 4). We first generated the R IIRbase set S from �1 to �5 scales, with step size 0.5, i.e., a base istrained at every u and uþ0.5 scales, u¼1,y,5. This would take up15 Mb storage space if the cardinality is selected to be 2000. Foreach testing image, the R IIR is learned at scale �3. Next weconducted continuous scaling between �1 and �5 with step size0.05. A DELL PRECISION 490 PC (3.2 GHz CPU, 2G RAM) was usedto conduct a subjective user study where 12 people were asked tocompare the image quality with BiCubic interpolation. All of themrated significantly higher scores for our results, and most of themwere not aware of the processing delay in generating the SRimages. This results validate the good performance of continuousSR using the R IIR method as well as the low computational costin generating the up-scaled images. Some example images canbe found in Figs. 5 and 6. A video demonstration of the recon-struction process has been attached, where readers can examinethe processing speed and quality of image reconstructing by ourR IIR framework.

5.3. Parameter tuning

To get more insight of the R IIR framework, we also evaluateddifferent parameter settings. Of the three different codingschemes, NN has no parameter during the coding step, while inLLE, the number of local neighbors K in Eq. (3), and in SC, theweight of L1 regularization g in Eq. (5), need to be specified. Inaddition, in all three schemes the codebook size N is required.

We first tested the affect of different codebook sizes. We usedthe method in Section 3.3 to build several R IIR base sets withincreasing cardinality. Then we evaluated the image quality undermultiple scales as that in Section 5.1. It is observed that, theaverage SR image quality saturates after the number of basisvectors reached a sufficient number around 2000.

The second experiments tested the affect of the K for LLE andthe g for SC. According to Eqs. (3) and (5), they both control thenumber of activated basis vectors for reconstruction, hence weput the results into the same scale as the ratio of non-zeroelement to the size of codebook. Similar to that reported in[12], best SR image quality is achieved when averagely 5–10basis vectors are used to code each input patch, i.e., 0.25–0.5% oftotal basis vectors are selected, while further reducing or increas-ing the sparsity will decrease the performance.

6. Conclusion and future work

The paper presents the resolution-invariant image representa-tion (R IIR) framework motivated by the idea that, a same imageshould have identical representation at various resolution levels.In the framework, a multi-scale R IIR bases set is constructed,and three coding models, the NN, LLE and SC model, are all

Page 8: Resolution-invariant coding for continuous image super-resolution

J. Wang, S. Zhu / Neurocomputing 82 (2012) 21–2828

validated to present the resolution-invariant property. In thisway the computational cost in multi-scale image SR can besignificantly reduced. In addition, the method to extend R IIRto support continuous image scaling is also discussed. Withsuch capacity, the R IIR framework can support additional appli-cations that existing image SR methods cannot handle well. Forinstance, in [22] the R IIR framework is applied for content-basedzooming for mobile users. Experimental results show that ourR IIR based method outperforms existing methods in variousaspects.

The future work of the research includes the following issues:first, in addition to image magnification, the possibility of apply-ing the R IIR framework to improve image shrinking quality willbe investigated; second, additional optimization strategies toimprove the coding speed will be examined; third, the imple-mentation of the coding and reconstructing process will beparallelized with modern CPU and/or GPU support; and fourth,other application domains in image compression, streaming,personalization, etc., will be explored.

Appendix A. Resolution invariancy in NN model

According to Eq. (6), at level u, for each input patch yu

(the subscript j is omitted), the optimal representation xnu can

be obtained by

xn

u ¼ argminxu

Jymu �Pm

u xuJ2

s:t: xuAf0;1gN andX

xu ¼ 1, ð15Þ

where Pmu is a Dm

u � N matrix to represent all the elements in Smu .

Taking Eq. (10) into Eq. (15),

xn

u ¼ argminxu

Jðym1 � guÞ�ðP

m1 � guÞxuJ

2

s:t: xuAf0;1gN andX

d

xu ¼ 1,

¼ argminxu

JCuðym1 �Pm

1 xuÞJ2

s:t: xuAf0;1gN andX

xu ¼ 1,

which becomes identical to Eq. (13) except for the constraint.Hence when C>u Cu is the unitary matrix, the solution of xn

u

becomes independent to u, and hence the resolution-invariantproperty holds.

Appendix B. Resolution invariancy in LLE model

According to Eq. (7), for each input patch yj,u, the optimalrepresentation weight xn

j,u minimizes

xn

u ¼ argminxu

Jymu �ðP

mu AuÞxuJ

2þlJxuJ

2: ð16Þ

Taking Eqs. (7) and (10) into Eq. (16), we can have

xn

u ¼ argminxu

Jym1 � gu�ððP

m1 � guÞ ðA1 � guÞÞxuJ

2þlJxuJ

2

¼ argminxu

JCuðym1 �ðP

m1 A1ÞxuÞJ

2þlJxuJ

2,

which has the same form as Eq. (13). Hence when C>u Cu is unitary,the solution of xn

u becomes independent to u, and hence theresolution-invariant property holds.

References

[1] G. Ramanarayanan, K. Bala, B. Walter, Feature-based textures, in: Proceedingsof Eurographics Symposium on Rendering’04, 2004, pp. 186–196.

[2] S. Dai, M. Han, W. Xu, Y. Wu, Y. Gong, Soft edge smoothness prior for alphachannel super resolution, in: Proceedings of IEEE Conference on ComputerVision and Pattern Recognition, 2007, pp. 1–8.

[3] J. Sun, H. Tao, H. Shum, Image hallucination with primal sketch priors, in:Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,2003, pp. 729–736.

[4] /http://www.vectormagic.comS.[5] M.N. Do, M.M. Vetterli, The contourlet transform: an efficient directional multi-

resolution image representation, IEEE Trans. Image Process. (2005) 2091–2106.[6] W. Hong, J. Wright, H. Kun, M. Yi, Multiscale hybrid linear models for lossy

image representation, IEEE Trans. Image Process. (2006) 3655–3671.[7] L. Erwan, M. Stephane, Sparse geometric image representations with bande-

lets, IEEE Trans. Image Process. (2005) 423–438.[8] C. Guo, S. Zhu, Y. Wu, Towards a mathematical theory of primal sketch and

sketchability, in: Proceedings of International Conference on ComputerVision, 2003, pp. 1228–1235.

[9] W. Freeman, E. Pasztor, O. Carmichael, Learning low-level vision, Int. J.Comput. Vision (1) (2000) 25–47.

[10] Q. Wang, X. Tang, H. Shum, Patch based blind image super resolution, in:Proceedings of International Conference on Computer Vision, 2005,pp. 709–716.

[11] H. Chang, D. Yeung, Y. Xiong, Super-resolution through neighbor embedding,in: Proceedings of IEEE Conference on Computer Vision and Pattern Recogni-tion, 2004, pp. 275–282.

[12] J. Wang, S. Zhu, Y. Gong, Resolution enhancement based on learning thesparse association of image patches, Pattern Recognition Letters 31 (1) (2010)1–10.

[13] J. Yang, J. Wright, T. Huang, M. Yi, Image super-resolution as sparserepresentation of raw image patches, in: Proceedings of IEEE Conference onComputer Vision and Pattern Recognition, 2008.

[14] J. Sun, Z. Xu, H. Shum, Image super-resolution using gradient profile prior, in:Proceedings of IEEE Conference on Computer Vision and Pattern Recognition,2008.

[15] C. Jiji, M. Joshi, S. Chaudhuri, Single-frame image super-resolution using learnedwavelet coefficients, Int. J. Imaging Syst. Technol. (3) (2004) 105–112.

[16] C. Jiji, C. Subhasis, Single-frame image super-resolution through contourletlearning, EURASIP J. Appl. Signal Process. (2006) 73767. (11).

[17] J. Wang, S. Zhu, Y. Gong, Resolution-invariant image representation and itsapplications, in: Proceedings of CVPR’09, 2009, pp. 2512–2519.

[18] U. Mudenagudi, R. Singla, P.K. Kalra, S. Banerjee, Super resolution usinggraph-cut, in: Proceedings of Asian Conference on Computer Vision, 2006,pp. 385–394.

[19] H. Lee, A. Battle, R. Raina, A. Ng, Efficient sparse coding algorithms, in:Advances in Neural Information Processing Systems, MIT Press, 2007,pp. 801–808.

[20] A.V. Oppenheim, R.W. Schafer, J.R. Buck, Discrete-time Signal Processing, 2nded., Prentice Hall.

[21] J. Wang, Y. Gong, Fast image super-resolution using connected componentenhancement, in: Proceedings of International Conference on Multimedia-Expo, 2008.

[22] J. Wang, S. Zhu, Y. Gong, Resolution-invariant image representation forcontent-based zooming, in: Proceedings of International Conference onMultimedia-Expo, 2000.

Jinjun Wang received the B.E. and M.E. degrees fromHuazhong University of Science and Technology,China, in 2000 and 2003. He received the Ph.D degreefrom Nanyang Technological University, Singapore, in2006. From 2006 to 2009, Dr. Wang was with NECLaboratories America, Inc. as a postdoctoral researchscientist, and in 2010, he joined Epson Research andDevelopment, Inc. as a senior research scientist. Hisresearch interests include pattern classification,image/video enhancement and editing, content-basedimage/video annotation and retrieval, semantic eventdetection, etc.

Shenghuo Zhu received the Ph.D. degree in computerscience from the University of Rochester, Rochester,NY, in 2003. He is a Research Staff Member with NECLaboratories America, Inc., Cupertino, CA. His primaryresearch interests include information retrieval,machine learning, and data mining. In addition, he isinterested in customer behavior research, game the-ory, robotics, machine translation, natural languageprocessing, computer vision, pattern recognition,bioinformatics, etc.