quality assessment of thumbnail and billboard...

QUALITY ASSESSMENT OF THUMBNAIL AND BILLBOARD IMAGES ON MOBILEDEVICES

Zeina Sinno1, Anush Moorthy2, Jan De Cock2, Zhi Li2, Alan Bovik3

ABSTRACTObjective image quality assessment (IQA) research entailsdeveloping algorithms that predict human judgments of pic-ture quality. Validating performance entails evaluating algo-rithms under conditions similar to where they are deployed.Hence, creating image quality databases representative oftarget use cases is an important endeavor. Here we presenta database that relates to quality assessment of billboardimages commonly displayed on mobile devices. Billboardimages are a subset of thumbnail images, that extend acrossa display screen, representing things like album covers,banners, or frames or artwork. We conducted a subjectivestudy of the quality of billboard images distorted by pro-cesses like compression, scaling and chroma-subsampling,and compared high-performance quality prediction modelson the images and subjective data.

Index Terms— Subjective Study, Image Quality Assess-ment, User Interface, Mobile Devices.

I. INTRODUCTION

Over the past few decades, many researchers have workedon the design of algorithms that automatically assess thequality of images in agreement with human quality judg-ments. These algorithms are typically classified based onthe amount of information that the algorithm has accessto. Full reference image quality assessment algorithms (FRIQA), compare a distorted version of a picture to its pristinereference; while no-reference algorithms evaluate the qualityof the distorted image without the need for such comparison.Reduced reference algorithms use additional, but incompleteside-channel information regarding the source picture [1].

Objective algorithms in each of the above categories arevalidated by comparing the computed quality predictionsagainst ground truth human scores, which are obtained byconducting subjective studies. Notable subjective databasesinclude the LIVE IQA database [2], [3], the TID 2013databases [4], the CSIQ database [5], and the recent LIVE

1: The author is at the Laboratory for Image and Video Engineering (LIVE)at the University of Texas at Austin and was working for Netflix when thiswork was performed (email: [email protected]).2: The author is at Netflix.3: The author is at the Laboratory for Image and Video Engineering (LIVE)at the University of Texas at Austin.

Challenge database [6]. The distortions studied in manydatabases are mainly JPEG and JPEG 2000 compression,linear blur, simulated packet-loss, noise of several variants(white noise, impulse noise, quantization noise etc.), as wellas visual aberrations [6], such as contrast changes. The LIVEChallenge database is much larger and less restrictive, butis limited to no-reference studies. While these databases arewide-ranging in content and distortion, they fail to covera use case that is of importance thumbnail-sized imagesviewed on mobile devices. Here we attempt to fill this gapfor the case of billboard images.

We consider the use case of a visual interface for anaudio/video streaming service displayed on small-screendevices such as a mobile phone. In this use case, the user ispresented with many content options, typically representedby thumbnail images representing the art associated with theselection in question. For instance, on a music streaming site,such art could correspond to images of album covers. Suchvisual representations are appealing and are a standard formof representation across multiple platforms. Typically, suchimages are stored on servers and are transmitted to the clientwhen the application is accessed.

While these kinds of billboard images can be used tocreate very appealing visual interfaces, it is often the casethat multiple images must be transmitted to the end-user topopulate the screen as quickly as possible. Further, billboardimages often contain stylized text, which needs to be ren-dered in a manner that is true to the artistic intent shouldallow for easy reading even on the smallest of screens.These two goals imply conflicting requirements: the needto compress the image as much as possible to enable rapidtransmission and a dense population, while also deliveringhigh quality, high resolution versions of the images.

Fig. 1(a) shows a representative interface of such a service[7], where a dense population is seen. Such interfaces maybe present elsewhere as well, as depicted in Fig. 1(b),which shows the display page of a video streaming serviceprovider [8], which includes images in thumbnail and bill-board formats on the interface. These images have uniquecharacteristics such as text and graphics overlays, as wellas the presence of gradients, which are rarely encounteredin existing picture quality databases that are used to trainand validate image quality assessment algorithms. Imagesin such traditional IQA databases consist of mostly natural

images, as seen in Fig. 2, which differ in important wayswith the content seen in Figs. 1(a) and (b).

(a) (b)

Fig. 1. Screenshots of mobile audio and video streamingservices (a) Spotify and (b) Netflix.

To address the use case depicted in the interfaces in Figs.1(a) and (b), we conducted an extensive subjective study tocapture the perceptual quality of such content. The studydetailed here targets billboard images, which are thumbnailimages arranged in a landscape orientation, spanning thewidth of the screen, that may be overlayed with text, gradi-ents and other graphic components. The test was designed toreplicate the viewing conditions of users interacting with anaudio/video streaming service. The source images that weused are all billboard artwork at Netflix [8]. The distortionsconsidered represent the most common kinds of distortionsthat billboard images are subjected to when presented by astreaming service.

Specifically we describe the components of the newdatabase, and how the study was conducted, and we examinehow several leading IQA models perform.

Fig. 2. Sample images from the LIVE IQA database [2], [3],which consists only of natural images.

II. DETAILS OF THE EXPERIMENT

In this section we detail the subjective study that weconducted on billboard images viewed on mobile devices.

We first describe the content selection, followed by a de-scription of the distortions studied, and finally detail the testmethodology.

II-A. Source Image ContentThe source content of the database was selected to repre-

sent typical billboard images viewed on audio and videostreaming services. Twenty-one high resolution contentswere selected to span a wide range of visual categoriessuch as animated content, contents with (blended) graphicoverlays, faces, and to include semi-transparent gradientoverlays. A few contents were selected containing the Netflixlogo, since it has some saturated red which exhibit artifactswhen subjected to chroma subsampling. Fig. 3 shows someof the content used in the study.

(a) (b) (c)

(d) (e) (f)

Fig. 3. Several samples of content in the new database.

The images, which were all originally of resolutions2048×1152, or 2560×1440, were downscaled to 1080×608,so their width matched the pixel density of typical mobiledisplays (when held in portrait mode). These 1080×608images form the reference set.

II-B. Distortion TypesThe reference images were then subjected to a combina-

tion of distortions using [9], that are typically encounteredin streaming app scenarios:

• Chroma subsampling: 4:4:4 and 4:2:0.• Spatial subsampling: by factors of 1, (no subsampling),√

2, 2, and 4.• JPEG compression: using quality factors 100, 78, 56,

and 34 in the compressor.All images were displayed at 1080×608 resolution, hence

all of the downsampled images were upsampled back to thereference resolution as would be displayed on the device. Weapplied the following imagemagick [9] commands to obtainthe resulting images:

• magick SrcImg.jpg -resize 1080x608 RefImg.jpg• magick RefImg.jpg -resize hxw -compress JPEG -

quality fq -sampling-factor fc DistortedImg.jpg• magick DistortedImg.jpg -resize 1080x608 Distorted-

Img.jpg

where SrcImg.jpg, RefImg.jpg, and DistortedImg.jpg aresource, reference and the distorted images respectively, hand w are the dimensions to which the images were down-sampled, defined as w = 1080

fsand h = 608

fswhere fs is the

imagemagick spatial subsampling factor, fc is the chromasubsampling factor and fq is the quality factor.

The above combinations resulted in 32 types of multi-ply distorted images (2 chroma subsamplings × 4 spatialdownsamplings × 4 quality levels), yielding a total of 672distorted images (21 contents × 32 distortion combinations).The distortion parameters were chosen so that distortionseverities were perceptually separable, varying from nearlyimperceptible to severe.

II-C. Test Methodology

A double-stimulus study was conducted to gauge humanopinions on the distorted image samples. Our goal was thateven very subtle artifacts be judged, to ensure that picturequality algorithms could be trained and evaluated in regardsto their ability to distinguish even minor distortions. Afterall, each billboard image might be viewed multiple timesby many millions of viewers. Further, we attempt to mimicthe interface of a generic streaming application, wherebythe billboard would span the width of the display whenheld horizontally (landscape mode). Hence, we presented thedouble stimulus images in the way depicted in Fig. 4. In eachpresentation, one of the images is the reference, while theother is a distorted version of the same content. The subjectswere not told which image was the reference. Instead theywere asked to rate which image was better than the other,and by how much (on a continuous rating bar) as depictedin Fig. 4. Further presentation order was randomized so that(top/bottom) reference-distorted and distorted-reference pairswere equally likely. Finally, the distorted content order wasrandomized, but so that the same content was never presentedconsecutively (to reduce memory effects), and the contentordering was randomized across users.

Equipment and Display

The subjective study was conducted on two LG Nexus5 mobile devices, so that two subjects could concurrentlyparticipate in the study. Auto brightness was disabled andthe two devices were calibrated to have the same mid-levelof brightness, to ensure that all the subjects had the sameviewing conditions. LG Nexus 5 devices are Android basedand have a display resolution of 1920 × 1080, which isubiquitous on mobile devices. They were mounted in portraitorientation on a stand [10], and the users were positioned ata distance of three screen widths as in [11]. Although theusers were told to try to maintain their viewing distance, theywere allowed to adjust their position if they felt the need todo so.

Fig. 4. Screenshot of the mobile interface used in the study.

Human Subjects, Training and Testing

The recruited participants were mostly Netflix employees,spanning different departments: engineering, marketing, lan-guages, statistics etc. A total of 122 participants took part inthe study. The majority did not have knowledge of imageprocessing of the image quality assessment problem. Wedid not test the subjects for vision problems, but a verbalconfirmation of (corrected-to-) normal vision was obtained.

Each subject participated in one session, lasting aboutthirty minutes. The subjects were provided with writteninstructions explaining the task. Next, each subject wasgiven a demonstration of the experimental procedure. Duringa session, the subject viewed a random subset of eightdifferent contents (from amongst the 21). Thirteen differentdistortion levels were randomly selected for each session,and each session also included a reference-reference controlpair so that a total of fourteen distorted-reference pairs weredisplayed for the subject to rate. A short training session pre-ceeded the actual study, where six reference-distorted pairsof images were presented to each subject where distortedimages approximately spanned the entire range of picturequalities. The images shown in the training sessions weredifferent from those used in the actual experiment.

Thus, each subject viewed a total of 118 pairs [8 contents× (13 distortion levels + 1 reference pair) + 6 training pairs]in each session. A black screen was displayed for threeseconds between successive pair presentations. Finally, at thehalfway point, a screen was shown to the subjects indicatingthat they had completed half of the test, and that they couldtake a break if they chose to.

II-D. Processing of the resultsThe position of the slider after the subject ranked the

image was converted to a scalar value between 0 and 100,where 0 represents the worst quality and 100 representsthe highest possible quality. The reference images were

anchors, having an assumed score of 100. We found thatthe subject scores on the reference-reference pairs followa binary distribution with p ≈ 0.5, indicating that no biascould be attributed to the location (top/bottom) of the imagesas they were being evaluated.

We used the guidelines of BT. 500-13 (Annex 2, section2.3) [11] for subject rejection. Two subjects were outliers,hence their scores were removed for all further analysis.Finally, we computed the mean opinion scores (MOS) ofthe images.

III. ANALYSIS OF THE RESULTSFirst, we study the individual effects of the three con-

sidered distortion types (JPEG compression, spatial subsam-pling, and chroma subsampling) on reported quality.

III-A. Impact of the individual distortionsFigure 5(a) plots the distribution of the MOS while

varying the chroma subsampling between 4:4:4 and 4:2:0at a fixed JPEG quality factor fq = 56 and a fixed spatialsubsampling factor fs = 1. Chroma subsampling is verydifficult to perceive in the quality range 80-95, but easierto perceive (in the quality range 60-75). Confusion oftenoccurs when there are conspicuous red objects, e.g, in Fig.3(c) and (d), owing to the heavy subsampling on the redchannel. Figure 5(b) plots the distribution of MOS againstthe JPEG quality factor fq for fixed chroma subsampling(4:4:4) and fixed spatial subsampling factor fs = 2. Theplots shown that these values of fq produce clear perceptualseparation, except for fq ∈ {78, 100} and fq ∈ {34, 56},which is expected at the quality extremes. Figs 3(b) and (f)show examples of such content.

Finally, Fig. 5(c) plots MOS against the spatial subsam-pling factor fs, for fixed chroma subsampling (4:4:4) andquality factors fq = 76. The downsampling artifacts areclearly perceptually separable.

III-B. Objective Algorithm PerformanceWe also evaluated the performances of several lead-

ing objective picture quality predictors on the collecteddatabase. We computed Pearson’s Linear Correlation Coef-ficient (LCC) and Spearman’s rank correlation coefficient(SROCC) to assess the following IQA model performanceson the new dataset: Peak-Signal to-Noise Ratio (PSNR),PSNR-HVS [12], Structural Similarity Index (SSIM) [3],Multi-Scale Structural Similarity Index (MS-SSIM) [13],Visual Information Fidelity (VIF) [14], Additive DistortionMetric (ADM) [15], and the Video Multi-method Assess-ment Fusion (VMAF) version 1.2.0 [16]. LCC was computedafter applying a non-linear map as prescribed in [17]. Wealso computed the root mean squared error (RMSE) betweenthe obtained objective scores after mapping the subjectivescores.

All of the models were computed on the luma channelonly, at the source resolution of the reference images. Theresults are tabulated in Table I. As may be seen, VMAF andADM were the top performers.

Given its good performance, we decided to analyzeVMAF, towards better understanding when it fails or when itcould be improved. Figure 6 plots the VMAF scores againstthe MOS. While the scatter plot is nicely linear, as alsoreflected by the correlation numbers, of interest are outlyingfailures that deviated the most from the general linear trend.Based on our analysis of these we can make the followingobservations:

1) VMAF tends to overestimate quality when chromasubsampling distortions are present. This is not un-expected since the VMAF is computed only on theluminance channel. This suggests that using featurescomputed on the chroma signal when trying andapplying VMAF may be advisable.

2) VMAF tends to underestimate quality on certain con-tents containing gradient overlays when the dominantdistortion is spatial subsampling. This may occur be-cause of banding (false contouring) effects that occuron the gradients.

3) Likewise, on contents with both gradient overlays andheavy compression artifacts, VMAF poorly predictsthe picture quality, perhaps for similar reasons.

IV. CONCLUSIONWe have taken a first step towards understanding the

quality assessment of thumbnail images on mobile devices,like those supplied by audio/video streaming services. Weconducted a subjective study where viewers rated the sub-jective quality of billboard-style thumbnail images that wereimpaired by chroma subsampling, compression and spatialsubsampling artifacts. We observed that spatial subsampling,followed by compression, were the main factors affectingperceptual quality. There are other types of interesting sig-nals, such as boxshot images, which are displayed in aportrait mode at lower resolutions, usually with a densetext component. These characteristics can render both com-pression and objective quality prediction more challenging.Broader studies encompassing these and other kinds ofthumbnail images could prove fruitful for advancing this newsubfield of picture quality research.

V. ACKNOWLEDGMENTThe authors would like to thank Dr. Ioannis Katsavounidis

and other members of the Video Algorithms team at Netflixfor their valuable feedback in designing the study.

VI. REFERENCES[1] L. He, F. Gao, W. Hou, and L. Hao, “Objective image

quality assessment: a survey,” Intern. J. of Comp.Math., vol. 91, no. 11, pp. 2374–2388, 2014.

(a) Histogram of MOS plotted against chromasubsampling.

(b) Histogram of MOS plotted against JPEGquality factor fq .

(c) Histogram of MOS plotted against spatialsubsampling factor fs.

Fig. 5. Distribution of the MOS as a function of different distortions

Table I. PLCC and SROCC results computed between the subjective scores and predicted objective scoresPSNR PSNR HVS [12] SSIM [3] MS SSIM [13] VIF [14] ADM [15] VMAF [16]

PLCC 0.72 0.86 0.82 0.83 0.88 0.94 0.95SROCC 0.71 0.85 0.81 0.83 0.86 0.93 0.94RMSE 20.79 15.42 17.31 16.51 14.31 10.09 9.19

Fig. 6. Scatter plot of predicted VMAF scores against MOS.

[2] H. Sheikh, C. L. Wang, Z., and A. C. Bovik,“Live image quality assesment database release 2,”http://live.ece.utexas.edu/research/quality/subjective.htm,accessed on Jan. 8 2018.

[3] Z. Wang, A. C. Bovik, H. Sheikh, and E. Simoncelli,“Image quality assessment: From error visibility tostructural similarity,” IEEE Trans. on Imag. Proc.,vol. 13, no. 4, pp. 600–612, 2004.

[4] N. Ponomarenko, O. Ieremeiev, V. Lukin, K. Egiazar-ian, L. Jin, J. Astola, B. Vozel, K. Chehdi, M. Carli,F. Battisti et al., “Color image database TID2013: Pecu-liarities and preliminary results,” in Visual InformationProcessing (EUVIP), 2013 4th European Workshop on.IEEE, 2013, pp. 106–111.

[5] E. Larson and D. Chandler, “Categorical imagequality CSIQ database 2009.” [Online]. Available:http://vision. okstate. edu/csiq

[6] D. Ghadiyaram and A. C. Bovik, “Massive onlinecrowdsourced study of subjective and objective picturequality,” IEEE Trans. Image Process., vol. 25, no. 1,pp. 372–387, 2016.

[7] “Spotify,” https://www.spotify.com/, accessed on Jan. 82018.

[8] “Netflix,” https://www.netflix.com, accessed on Jan. 82018.

[9] “Imagemagick,” https://www.imagemagick.org/, ac-cessed on Jan. 8 2018.

[10] “Cell phone stand, lamicall s1 dock : Cradle, holder,stand for switch, all android smartphone, iphone 6 6s7 8 x plus 5 5s 5c charging, accessories desk - black,”http://a.co/bDukbam, accessed on Jan. 8 2018.

[11] “Methodology for the subjective assessment of thequality of television pictures.” ITU-R Rec. BT. 500-13, 2012.

[12] K. Egiazarian, J. Astola, N. Ponomarenko, V. Lukin,F. Battisti, and M. Carli, “New full-reference qualitymetrics based on hvs,” in Proc. of the Sec. Int. Worksh.Vid. Process. Qual. Met., vol. 4, 2006.

[13] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Mul-tiscale structural similarity for image quality assess-ment,” in Asilomar Conf. Sign., Sys. Comp., vol. 2.IEEE, 2003, pp. 1398–1402.

[14] H. R. Sheikh, A. C. Bovik, and G. De Veciana, “Aninformation fidelity criterion for image quality assess-ment using natural scene statistics,” IEEE Transactionson image processing, vol. 14, no. 12, pp. 2117–2128,2005.

[15] S. Li, F. Zhang, L. Ma, and K. N. Ngan, “Image qualityassessment by separately evaluating detail losses andadditive impairments,” IEEE Trans. Mult., vol. 13,no. 5, pp. 935–949, 2011.

[16] Z. Li, A. Aaron, I. Katsavounidis, A. Moorthy,and M. Manohara, “Toward a practical perceptualvideo quality metric,” Jun 2016, accessed on Jan. 82018. [Online]. Available: https://medium.com/netflix-techblog/toward-a-practical-perceptual-video-quality-metric-653f208b9652

[17] “Final report from the video quality experts group onthe validation of objective models of video qualityassessment.” Video Quality Expert Group (VQEG),2000.

quality assessment of thumbnail and billboard...

Documents