temporal trajectory aware video quality measure

266 IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 3, NO. 2, APRIL 2009

Temporal Trajectory Aware Video Quality MeasureMarcus Barkowsky, Jens Bialkowski, Björn Eskofier, Roland Bitto, and André Kaup, Senior Member, IEEE

Abstract—The measurement of video quality for lossy and low-bitrate network transmissions is a challenging topic. Especially,the temporal artifacts which are introduced by video transmissionsystems and their effects on the viewer’s satisfaction have to beaddressed. This paper focuses on a framework that adds a tem-poral distortion awareness to typical video quality measurementalgorithms. A motion estimation is used to track image areas overtime. Based on the motion vectors and the motion prediction error,the appearance of new image areas and the display time of ob-jects is evaluated. Additionally, degradations which stick to movingobjects can be judged more exactly. An implementation of thisframework for multimedia sequences, e.g., QCIF, CIF, or VGA res-olution, is presented in detail. It shows that the processing stepsand the signal representations that are generated by the algorithmfollow the reasoning of a human observer in a subjective experi-ment. The improvements that can be achieved with the newly pro-posed algorithm are demonstrated using the results of the Multi-media Phase I database of the Video Quality Experts Group.

Index Terms—Cognitive science, multimedia communication,objective video quality measurement, temporal effects, visualsystem.

I. INTRODUCTION

I N the last decade, the transmission of video sequences overIP networks evolved. Those networks often have a limited

bandwidth, and in some cases the transmission is also lossy. Theviewer’s expectation of the visual quality is not as high as fortelevision broadcast scenarios. A lower resolution or a reducedframe rate are acceptable. When mobile transmission channelsor lossy IP networks are used, the video may even pause forsome time and then continue after some content was lost.

For network providers, the correct estimation of video qualityis of paramount importance. The viewer’s satisfaction mainlydepends on the perceived visual quality. An optimization whichrelates the network bandwidth and the encoder quality to theperceived video quality helps in improving the network service.

The best way to estimate the perceived video quality is to per-form a subjective experiment. According to the Recommenda-tions of the International Telecommunication Union (ITU) [1],[2], 20–30 viewers will be asked to rate a set of video sequencesand the results of all viewers are averaged as a mean opinionscore (MOS). However, this approach is very time consuming

Manuscript received April 30, 2008; revised December 04, 2008. Current ver-sion published March 11, 2009. The associate editor coordinating the review ofthis manuscript and approving it for publication was Dr. Sheila Hemami.

M. Barkowsky is with the Chair of Multimedia Communications andSignal Processing, University of Erlangen-Nuremberg, 91058 Erlangen,Germany, and also with OPTICOM GmbH, 91052 Erlangen, Germany (e-mail:[email protected]).

J. Bialkowski, B. Eskofier, and A. Kaup are with the Chair of MultimediaCommunications and Signal Processing, University of Erlangen-Nuremberg,91058 Erlangen, Germany.

R. Bitto is with the OPTICOM GmbH, 91052 Erlangen, Germany.Digital Object Identifier 10.1109/JSTSP.2009.2015375

and it can only be performed on a very limited set of samplesequences.

Several automated video quality measurement algorithmshave been developed to predict the results of the subjective test.They use different approaches which differ in the amount ofinformation present for the evaluation. The algorithms whichperform the most accurate prediction compare the degradedvideo to the reference video. They are called “Full Reference”(FR) algorithms. A typical scenario for FR algorithms is de-picted in Fig. 1. A reference video sequence which is calledsource reference channel or circuit (SRC) is encoded, trans-mitted, decoded and postprocessed. The resulting processedvideo sequence (PVS) is evaluated together with the SRC by ameasurement algorithm.

In order to evaluate the performance of the algorithm, a sub-jective test is performed and the MOS values are compared tothe prediction of the measurement algorithm.

Several examples of FR algorithms will be briefly reviewedin Section II. They show a common structure which will be ex-plained. An extension of this structure is proposed in this paperwhich allows to predict the visual quality more accurately byexploiting the temporal dimension. The building blocks of thenew framework are explained in Section III. The extension canbe combined with most of the reviewed algorithms.

In order to show the performance of the framework, an imple-mentation will be presented which is called the Temporal Tra-jectory Aware Video Quality Measure (TetraVQM) algorithm.This implementation is meant for small image sizes, rangingfrom 176 144 pixels (QCIF) to 640 480 pixels (VGA). De-tails of the algorithm are described in Section IV. In the courseof the development many subjective experiments have been con-ducted. One of them was published in [3]. These experimentswere used to train the algorithm. In order to show the perfor-mance of the algorithm, a validation using the data generatedby the Video Quality Experts Group (VQEG) in the MultimediaPhase I [4], [5] has been performed. The results are presentedin Section V. Finally, the paper is summarized and concluded inSection VI.

II. STATE OF THE ART AND MOTIVATION

The design of most FR algorithms follows the block diagramdepicted in Fig. 2. At the beginning an alignment step is per-formed which tries to find the correspondence between the ref-erence and the degraded sequence. The next steps are performedfor each image separately. In the step “Spatial Preprocessing”a representation of the images is generated that simplifies thecomparison. While in the peak signal-to-noise ratio (PSNR) [6]metric little effort is necessary, e.g., a color space transforma-tion, other algorithms model the first stages of the Human VisualSystem (HVS) in this step [7]–[9]. After the difference of thetwo images has been calculated, an additional processing step

1932-4553/$25.00 © 2009 IEEE

BARKOWSKY et al.: TEMPORAL TRAJECTORY AWARE VIDEO QUALITY MEASURE 267

Fig. 1. Measurement scenario for full reference VQMs.

Fig. 2. Typical block diagram of a traditional full reference VQM.

follows which shall emphasize the regions in which the distor-tion is dominant. In PSNR, the value of each pixel is squaredwhich results in a larger penalty for large degradations. Otheralgorithms perform a spatial windowing function which empha-sizes degradations near the center of the image [10]. Most of theeffort is spent on either of the spatial processing steps. In thespatio-temporal pooling step, which involves the spatial and thetemporal summation, often a simple average or a weighted av-erage is used. Several instances of the gray blocks called “Spa-tial processing” and “Spatio–Temporal Pooling” may exist inparallel, each giving a spatial indicator for one aspect of videoquality, e.g., blurring, blocking, edginess, and color artifacts.

The information gathered in the temporal alignment step isoptionally used to model degradations in the temporal fluidityof the video sequence, especially a reduced frame rate, pausesduring playback, and sudden skips of several frames. The spatialindicators and the temporal indicators are finally combined toresult in an estimation of the mean opinion score obtained froma subjective experiment.

In 2000 and 2003, the VQEG performed independent per-formance tests for several FR algorithms in the context oftelevision broadcast [11], [12]. Two large subjective experi-ments were setup to compare the algorithms. Four candidateswere then standardized by the International TelecommunicationUnion—Telecommunication Standardization Sector (ITU-T)in J.144 [13].

After the successful evaluation and standardization of thesealgorithms for television broadcasts, the next step in videoquality estimation was towards multimedia scenarios. Therequirements for multimedia quality assessment are differentbecause the resolution is smaller, typically ranging fromQuarter Common Intermediate Format (QCIF) with 176 144pixels to VGA resolution with 640 480 pixels. The images

are in progressive format as opposed to the television formatsPAL or NTSC which are interlaced. Typically, the distortionsare much higher because the available bandwidth is smaller andthe quantization used in the video coding algorithms is coarser.In order to save even more bandwidth, the video sequence isoften temporally subsampled, leading to a lower frame rate.In addition, when lossy networks are considered, the videoplayback may pause and skip some content or a rebufferingevent may occur.

In 2008, the VQEG finished a large number of subjectiveexperiments for the evaluation of several video quality mea-surement algorithms. Four algorithms were standardized by theITU-T in J.247 [14]. Although they widely differ in the choiceof indicators and in the implementation of the temporal align-ment, in general they all follow the basic framework depictedin Fig. 2. All algorithms model the influence of the frame rateand the effect of pauses and skips in the video sequence. Whentemporal aspects are considered, the most often modeled effectis the masking of the distortions due to changes in the con-tent. Two of the abovementioned algorithms use an indicatorwhich considers several frames at once: The NTT Full Refer-ence Model uses a temporal variability indicator which calcu-lates the standard deviation of all pixels within ten seconds andOPTICOM’s Perceptual Evaluation of Video Quality (PEVQ)determines how many edges are lost or introduced between twoadjacent video frames. In both cases, it is assumed that the moretemporal activity is detected in the sequence, the less influencea spatial degradation will have. A very flexible algorithm whichuses a combined analysis of the spatial and temporal behaviorof a video sequence in spatio-temporal regions was proposed byNTIA/ITS in [15]. An algorithm which models the first stagesof the HVS and includes an inter-frame difference indicatorcan be found in [16]. A drawback of the inter-frame evaluation


Fig. 3. Detailed data flow diagram of the TetraVQM algorithm.

without motion estimation is that it is impossible to distinguishbetween a high motion sequence, e.g., a sports sequence, andother sources of luminance changes, e.g., a press conferencewith a flurry of camera flashes.

Another class of algorithms uses the spatio-temporal contrastsensitivity function (CSF) which indicates the smallest amountof brightness change at a given spatial frequency that is nec-essary for the detection of a change. Although this is a visi-bility threshold rather than an explanation of the effect of spatialmasking by motion, it has been successfully applied in manyalgorithms even when evaluating severe distortions. Typically,either the video is split into several spatio–temporal pathways[17], [18] or the temporal part is used as a prefiltering step[19], [20]. It was also shown, that the quality of motion esti-mation algorithms in video codecs can be tested by modelingthe spatio–temporal CSF [21]. This work demonstrates the closeconnection between motion in a video sequence and the mod-eling by the spatio–temporal CSF.

On the other hand, object motion may draw the attention ofthe viewer to a distorted object and the quality of the moving ob-ject may become important. An algorithm which benefits fromthis effect is presented in [22].

Recently, the Video Structural Similarity Measure (VSSIM)algorithm was proposed which extends the Structural SimilarityMeasure (SSIM) by estimating an optical flow field and calcu-lating similarities along the trajectories [23].

Our proposal is based on the following assumptions whichwere collected from the participants in several subjective tests.

1) The viewers do not rate the quality of individual imagespresented one after the other. According to the temporalCSF, the human visual system is limited to frequenciesbelow approximately 60 Hz. In order to recognize a set ofobjects correctly, the viewer needs more time [24]. Thus,the instantaneous impression of video quality is the resultof several frames.

2) The subjects follow the motion of objects. When an objectis distorted and it moves, the subjects track the distortioninstead of seeing the distortion disappear at one spatial lo-cation and reappear at another location.

3) When an object appears and it is distorted only for a shorttime, the subjects tend to be unsure about the distortion.However, when a single object was visible for some timeand it suddenly becomes distorted, it is easily noticed.

4) Similarly, when the video accidently pauses for some time,the viewer is able to analyze the degradation in much moredetail as in cases when the display changes continuously.

5) Because the participants in a subjective test are told to ratethe degradations they focus on the maximum distortionvisible to them. This may be different in real-world ap-plications but the performance of an algorithm is usuallyevaluated by comparison to an ITU conforming subjectiveexperiment.

The proposed TetraVQM framework was developed on theseassumptions. It introduces a new dimension to the video qualityestimation by estimating the temporal visibility of image areas.It becomes possible to weight the degradations based on the du-ration a viewer can analyze them. As a side effect, it is pos-sible to correctly estimate the instantaneous impression of thedegradations that a subject will perceive. The effect is similarto that demonstrated for the VSSIM but instead of using an op-tical flow estimation, the output of a block-based motion estima-tion is used. The spatio–temporal CSF is not yet included in theTetraVQM algorithm but the combination is subject to furtherstudies, e.g., by using one of the approaches mentioned above.

III. TETRAVQM FRAMEWORK

The structure of the proposed framework is depicted in Fig. 3.The undistorted reference sequence SRC and the PVS are inputsto the framework. The goal of the video quality measure is toaccurately predict the results of a subjective test which containsmany of these PVSs.

In this section, an overview of the algorithm will be presented.The details are explained in the next Section.

A typical multimedia scenario is assumed. At the sender, thePVS is encoded by a video encoder which may reduce the framerate. During transmission to the decoder, packets may be lost.As a result the video may suddenly pause and resume play-back at the same or a later video frame depending on the error


concealment behavior of the network and the decoder. Becausemultimedia content is often shown on a low-cost device, arti-facts like color or brightness changes, e.g., gamma correctionsor color saturation enhancements may be introduced to enhancethe display quality. These artifacts need to be addressed in theTemporal and Color Alignment step. The processing locates thecorresponding reference frame for each of the distorted frames.The correspondence information is further analyzed in the stepcalled Estimation of Temporal Degradations. In this step, theframe rate reduction is estimated and the influence on the finalobjective score is modeled by one single indicator. Another in-dicator reflects the influence of pauses and skips in the videosequence.

The registered signals of SRC and PVS are fed into the Spa-tial Processing of the Traditional VQM. This block contains thespatial processing steps that are usually part of VQM algorithmswith a focus on spatial degradations. Often, some sort of con-trast detection and discrimination modeling is included beforethe two images are compared. The result may be further pro-cessed to emphasize the dominant distortions and finally, a spa-tial distortion map is created. This map shall contain an estimateof the visible distortion at each position in the image. In the caseof PSNR this is simply the squared difference image.

In the TetraVQM framework, the modeling of the HVS istaken a step further. In parallel to the spatial processing, themotion in the video sequence is estimated by applying a block-based motion estimation. Because the PVS may be severely de-graded, the motion is estimated on the reference sequence. Inorder to transfer the motion information to the PVS it may benecessary to accumulate the motion of several frames, e.g., inthe case of pauses or a reduced frame rate. The output of theblock Estimation of Distorted Object Motion shall be the mo-tion vectors and the prediction error for the degraded sequence.Instead of the block-based motion estimation, other algorithmsmay be applied as well, e.g., content independent object seg-mentation and tracking algorithms.

It is assumed that this information mimics the behavior of aparticipant in a subjective test: He tracks the objects and fol-lows their movements. The viewer is also aware of objects thatappear and disappear. He judges the video quality based on thedegradations visible for a longer period of time at observed ob-jects, even when the objects move on the screen. One of thenew aspects of the TetraVQM algorithm is to store the output ofthe spatial processing steps. For example, the distortion maps ofseveral frames are stored. A motion compensation is applied tothem. Thus, the degradation of each object in a previous frameis moved to the position that the object occupies in the currentframe. The motion compensation and the storage is combined inthe Distortion Map Buffer. The viewer has a notion of an instan-taneous distortion which is the effect of the limition of the HVSand the cognitive processes. This is modeled in the block Es-timation of Spatio–Temporal Distortion Map by accumulatingthe previous degradations.

The visibility of artifacts at objects which just appeared islimited. This is also modeled in the new TetraVQM framework.The block-based motion estimation minimizes the predictionerror between blocks of the current video frame and the pre-vious image. This prediction error is large, when image areas

were not present in the previous frame and it is assumed that anew object appeared at that position. By storing the predictionerror of previous frames in the Tracking Information Buffer andusing a motion compensation in the same way as for the distor-tion maps, the duration that an image area has been visible in thepast can be predicted. This is performed in the block TemporalVisibility and Reliability Estimation and will be explained in de-tail later. The step returns two matrices of the same size as theimage: The first one provides the information for which dura-tion the corresponding pixel was visible at most and the secondmatrix provides the information how easy it was to track the mo-tion during this maximum duration.

At scene cuts, the block-based motion estimation results ina scattered motion vector field and large prediction errors. Ifthis condition is detected, the information in the buffers is nolonger valid and the buffers are reset; thus, it is assumed that thecomplete image contains new content. This models the reducedvisibility of artifacts after the scene cut. However, it is knownthat the HVS has also some masking before the scene cut [25].This effect is not yet modeled within the TetraVQM algorithmbut could be implemented in the framework as well.

In the Spatio–Temporal Pooling step, three visibility mapsare summarized. The first map is the spatial degradation mapwhich now also includes a view on the past distortions. Thesecond and third map contain the information about the timeeach pixel could be analyzed by the viewer and the reliability ofthis information.

The representation of these three information sources allowsa multiplicative combination. Instead of simply averaging thecombined information, another aspect learned from a subjec-tive experiment is modeled. The viewer is usually asked to ratethe degradations in the video sequence. Thus, he focuses on thepoint where he is able to see the maximum distortion. By usingthe cone distribution in the human retina, a filter is designedthat is moved across the image. The position at which the outputof the filtered image reaches its maximum is considered as thefocus point of the assessor and this degradation will be used. Inthe temporal direction a simple averaging is applied. This leadsto the objective mean opinion score (OMOS) which is a pre-diction of the subjective MOS value obtained from a subjectiveexperiment on an absolute category rating scale, e.g., rangingfrom 5 (best) to 1 (worst).

IV. TETRAVQM ALGORITHM DESCRIPTION

In the following section, the implementation details of thealgorithm will be described. The data paths as shown in Fig. 3are used.

It is assumed that the two video sequences of the SRC and thePVS are available in the Y, Cb, Cr component color space. Thesignal representation shall include no color subsampling, e.g.,all color planes have the same size, and the range of the valuesshall be linearly scaled to the range of zero to one. However,many multimedia signals contain color subsampling and thusupsampling of the color components may be necessary by usinga bicubic filter. The input sequences shall have a fixed frame rateof 25 or 30 fps. In case of a frame rate reduction, frames in thePVS shall be repeated.


A. Alignment Estimation and Correction

The TetraVQM algorithm is clocked by unique frames of thedistorted signal. At first, each distorted frame is assigned the du-ration that results from the frame rate of the distorted sequence

. The duration on the screen of the current distorted frameis thus . The unique frames are deter-

mined by comparing each frame of the distorted sequences to itssuccessor. When an exact match occurs, the corresponding suc-ceeding frame is removed and the duration of the current frameis extended accordingly.

For each unique distorted frame , the corresponding refer-ence frame is searched. A two-dimensional phase corre-lation with additional sum of absolute differences (SAD) calcu-lation on the highest correlation peaks is used as described in[26]. This algorithm is capable of returning several candidatepositions, which are then chosen using a maximum-likelihoodsequence estimation with temporal constraints as published bythe authors in [27]. An alignment of the brightness and the colorcomponents using cumulative histograms [10] completes thealignment process. The necessary correction that was estimatedis performed on the reference signal, the corresponding refer-ence frame is therefore changed according to the brightness andcolor component estimation.

B. Motion Estimation

There are many algorithms which can be used for the mo-tion estimation. They range from simple displacement measure-ments, e.g., block-based algorithms [28], to very elaborate algo-rithms which estimate object boundaries or take the HVS intoaccount [29], [30]. Several algorithms have been tested but itwas decided that a simple block based algorithm is sufficientfor the task.

The TetraVQM implementation uses a fast implementationwith the Hybrid Unsymmetrical-cross Multi-Hexagon-gridSearch (UMHexagonS) search pattern as published in [31].To calculate even less positions, additionally the SuccessiveElimination algorithm [32]–[34] is applied. This allows to usea large search range of pixels at a blocksize ofpixels with reasonable computation time.

Two additional constraints were added in our implemen-tation that smooth the resulting motion vector field. The firstconstraint is to avoid large motion vectors which are unlikelyfor the reference sequence at a reasonable frame rate. A Lan-grangian term is used to penalize the length of the resultingmotion vector in terms of the prediction error measure SAD.The corresponding update rule is shown in (1). The predictionerror of the best match is denoted as , the new candidateposition as and the prediction error of the newcandidate position as

(1)

For the TetraVQM algorithm, a value of was chosen.The second constraint is calculated after the motion estima-

tion is finished. Each block is tested whether its SAD valueexceeds a certain threshold , indicating thatthe match is not very reliable. For such a block, additionalmotion vectors are tested which will worsen the SAD value but

smoothen the motion vector field. The zero motion vector andthe motion vectors of the neighboring blocks are evaluated. Ifthe smallest of the new SAD values does not worsen the oldSAD beyond a factor of 1.5, then it is used instead of the oldone. The numerical values for the parameters of the smoothingalgorithm were determined by visual inspection of the resultingmotion vector field of several video sequences. The resultingblock motion vectors for the current reference frame aremapped to each pixel and they are stored in the matrices

and for the motionvector in - and -direction and the SAD values, respectively.

When the scene changes, it can be assumed that the completecontent is replaced and thus previously recorded presentationtimes and distortions are no longer valid. Because an error in thedetection only affects a small part of the sequence, the reliabilityof the shot detection is not of outstanding importance and aneasy and straight forward implementation is sufficient. It can benoted that at scene cuts either the motion vector field becomeschaotic or the prediction error is very high. The spatial activityin the motion vector field is measured by a two dimensionalconvolution, denoted as “ ,” with a Laplacian kernel as shownas follows:

(2)

is used as discontinuity measure in (3) to detect a scenecut between the previous and the current frame. The vectorcontains the information whether a scene cut was detected foreach frame of the reference sequence

if

else(3)

Based on some examples of our training data, we chose the de-tection threshold for scene cuts to be . Thewidth and the height of the image are denoted as and .

The motion estimation and the shot detection is carried outon the reference sequence, thus it can be done in parallel to thetemporal registration.

C. Motion Trajectory

The information of the object motion has to be transferredfrom the reference sequence to the degraded sequence in orderto correctly predict the motion of the visible distortions. Whentwo adjacent degraded frames have adjacent corresponding ref-erence frames a simple copy process of the corresponding ref-erence motion is sufficient. This only happens when there isno frame rate reduction or temporal irregularity involved. In allother cases, the motion information for several reference frameshas to be combined which may lead to a different motion vectorfor each pixel.

The problem shall be illustrated with two examples: In thesimple case, let us assume that the distorted frame 55 matchesthe reference frame 60 and the next distortedframe 56 matches frame 61 . Because there is onlya temporal delay of five images between the reference sequence


Fig. 4. Example for motion trajectory tracking.

and the degraded sequence, the motion information of frame 56of the degraded sequence is identical to the motion informationof frame 61 of the reference sequence.

The second example is more complicated: Again, we assumethat the distorted frame 55 matches the reference frame 60

but the next distorted frame 56 matches frame62 , which would be a common case whenthe frame rate is halved. Then, the motion from the distortedframe 55 to the distorted frame 56 has to be deduced from thereference sequence’s motion from 60 to 61 and from 61 to 62as shown graphically in Fig. 4. In the example, the match forthe second image block in frame 62 corresponds to a slight shiftto the left when compared to frame 61. However, the matchingblock in frame 61 is not located on the block grid for themotion estimation from frame 60 to 61. In this example, thereare two different motions in this matching block. Dependingon the position of the matching block in frame 61, one, twoor four motion regions may be included in one block. As aconsequence, the motion vector field in the degraded sequencefrom frame 55 to frame 56 does not exhibit a block structureanymore. Each pixel may have a different motion vector whichresults from the addition of the motion vectors of the referencesequence from frame 60 to 61 and from frame 61 to 62.

Unlike the motion vectors, the prediction error of the refer-ence motion estimation should not be simply accumulated. Asimple accumulation leads to a strong contribution of the imagenoise on the prediction error, especially when several referenceframes have to be combined. Therefore, the maximum valuealong the temporal trajectory is stored instead. The block dia-gram in Fig. 5 shows the complete implementation of the algo-rithm. It also takes into account that the block based motion mayfind a match outside the image boundaries. This happens for ex-ample in a camera pan: New content enters the display area butthe already existing content is matched by the block-based mo-tion estimation. The new content’s motion vectors then pointoutside the image and the prediction error is consequently set toinfinity.

Fig. 5. Algorithm for motion trajectory tracking.

Similar to the transfer of the motion information, the shotdetection information has to be calculated for the degradedsequence. There are two conditions that need to be checked.The first condition is whether a scene cut occurred in thecorresponding reference frames that lie in between two de-graded frames. The second condition checks whether there isa rewinding, e.g., the corresponding reference frame of thesecond distorted frame temporally precedes the correspondingreference frame of the first frame. Both conditions are com-bined in the calculation of the shot detection indicator forthe degraded sequence as shown as follows:

if

else(4)


D. Tracking Information Buffer

The estimation of the visibility of objects in the degradedvideo sequence along the temporal dimension is based on theprediction error of the motion estimation. A small predictionerror in the current frame can be interpreted as a high proba-bility that a pixel belongs to an object that was present in theprevious frame as well. In order to decide how long an object isvisible several frames have to be considered. Therefore, the pre-diction error is stored in the tracking information buffer. Whenan object moves to a different location it is still visible for theviewer which makes it necessary to motion compensate the pre-viously stored prediction errors. Thus, all previous predictionerror maps for all have to be up-dated as follows:

if

else

(5)

After the motion compensation the current prediction errormatrix is added to the buffer as . Theshot detection is evaluated for each distorted image todecide whether the buffers for the past images shall be cleared.

E. Temporal Visibility and Reliability

After building and updating the tracking information buffer,the period of time that an object is visible can be estimated. Foreach pixel, the maximum prediction error along the temporaltrajectory is calculated as shown as follows:

(6)

The values in the matrix increase monotonicallywith . When a certain threshold is exceeded, the corre-sponding pixel cannot be tracked any further backwards in time.The value of the threshold was manually chosen as .The index at which this event occurs is stored in the matrixand the corresponding presentation time is calculated usingthe frame durations in . This is shown in (7) and (8) as follows:

(7)

with

ifelse

(8)

The visibility of a degradation is a function of the time it isvisible to the viewer in a subjective test. This has been exten-sively analyzed by the authors in a subjective test with codedstill images that were only presented for a short time [35]. Theusage of still images instead of video sequences avoids the influ-ence of motion masking effects. In our previous publication, wefitted a sigmoid function to the subjective data which is calledPresentation Time Model. This model can now be applied tothe matrix which contains the estimated visibility duration

Fig. 6. Weighting for different presentation times.

as shown in (9). The matrix contains the influence ofthe temporal visibility of each pixel, as follows:

(9)

The function is also plotted in Fig. 6 for the relevant duration.After a presentation of two seconds, it can be assumed that theviewer has seen all coding artifacts and that his rating becomesstable.

The analysis is based on the maximum of the cumulative pre-diction error so far. Additional information can be extractedfrom a closer view on the cumulative prediction error: The pre-sentation time estimation is more likely to be correct, when theprediction error stays small until it suddenly jumps above thethreshold . When the prediction error raises early and re-mains slightly below the threshold, it is more likely to be anoverestimation of the presentation time which needs to be re-duced. This is modeled by the average of the cumulative SADvalues as shown in Fig. 7. A simple linear relationship is as-sumed in the current algorithm. This may be improved when ananalysis concerning the cognitive influence of objects that arehard to track becomes available. The resulting function needs tobe squared in order to convert the SAD values to energy scaleas shown in (10). The matrix contains the reliability ofthe presentation time estimation for each pixel. Both the tem-poral visibility and the reliability will be used multiplicativelyon the degradation estimation in the estimation of attention andfocus point.

(10)

F. Spatial Part of Traditional VQM

The inputs to the spatial distortion analysis are the current de-graded frame and the corresponding reference frame which hasbeen spatially aligned. The colors of the corresponding refer-ence frame are matched to the degraded frame as well.

This part of the TetraVQM algorithm is intentionally keptsimple. As explained in Section II, most of today’s video quality


Fig. 7. Reliability weight for different averaged SAD values.

measures contain a very sophisticated spatial processing. Theycan be implemented at this position.

In order to keep the processing as simple and as clearly ar-ranged as possible, it was decided to use the simple squared errorat this stage. The squared error is also the basic building blockof the PSNR measure which allows to compare the performanceof the TetraVQM algorithm to the well-known PSNR later. Theidea of the energy measure of the PSNR will be used throughoutthe remaining steps and the conversion to logarithmic scale willbe performed in the temporal summation step.

Instead of only using the luminance component, the colorcomponents are also considered but only with a small weight.The spatial distortion map is generated in a way that a differ-ence of five luminance steps in an eight bit quantized luminancesignal is normed to one. As most of the steps do not change therange of the values and none relies on the input range, this is nota necessary condition. However, it simplifies the interpretationof the results in the different steps because a value of one, twoor three corresponds to a PSNR value of 34, 28, and 25 dB. In(11), the calculation of the visible distortion map is given.The three color component signals for the reference sequenceafter temporal alignment are denoted as , andand the degraded signal after temporal alignment is specified as

, and .

(11)

G. Distortion Map Buffer

The viewer in a subjective test does not rate one isolatedframe after another. Instead, he integrates over a certain time pe-riod and weights the current distortions stronger than previousdistortions. When an object moves, he will integrate over thequality of the object, not over the spatial position in the image.Therefore, the spatial distortion maps have to be compensatedprior to a weighted accumulation of the distortions. The com-pensation is performed for each previous distortion mapfor all in the buffer with the motion vectors

Fig. 8. Temporal weighting of distortion map buffer.

of the current frame as shown in (12). When a scene cut is de-tected for the current frame, the complete distortion map bufferis emptied

if

else(12)

After the motion compensation, the current distortion mapis stored in the buffer as .

H. Estimation of Spatio–Temporal Distortion Map

This step combines the motion compensated spatial distortionmaps that are stored in the distortion map buffer. Threedifferent effects are modeled. The first effect deals with thememory of the human viewer. As stated in our first assump-tion in Section II, the instantaneous perception of visual qualityis not only based on the current image but it is influenced bypreviously perceived distortions as well. It is assumed that theintegration of one third of a second is necessary, because thisis known to be the time a human needs to recognize digits,e.g., on a measurement device before the display changes. Thisthreshold is denoted as . A simple exponentiallyfalling curve is used to weight the degradation as shown in (13)and plotted in Fig. 8. Because the frame display time maydiffer for each frame in the history buffer, the integral of thetemporal filter function is used to determine the weightfor each frame as shown in (14).

ifelse

(13)

with

(14)

The second and the third effect deal with the presentation timeand the reliability of the estimation for each individual pixel.


When a pixel is marked as not being trackable, the influence ofthe past distorted map at that pixel is zero. Similarly the cumu-lative SAD value is used as an estimate for the reliabilitythat this pixel adds to the distortion seen by the viewer. An SADvalue of 0.05 causes the influence of the distortion to be halved,thus the energy to be only a quarter of the stored value. The finalcalculation for the spatial map of visible distortions is givenas

(15)

with

if

else

I. Estimation of Attention and Focus Point

This is the first step in the spatio-temporal pooling stage ofthe TetraVQM algorithm. Besides the instantaneously visiblespatial distortion matrix , the weighting matrices for the pre-sentation time and its reliability are inputs to this step.They can be multiplicatively combined to yield an estimate ofthe final distortion.

The purpose of this step is to find the spatial position on whichthe viewer has to focus in order to detect the maximum degrada-tion. This is modeled by using the cone distribution of the retina.The process was already described by the authors in [36]. Thetwo-dimensional filter for the cone distribution given in (16) de-pends on the distance to the screen . In Fig. 9, an example filterfor an image in common intermediate format (CIF) resolution ata viewing distance of 6 H is shown. The filter function is givenas

ifelse

(16)

with

ifelse

with

This filter needs to be applied to every pixel in the image,which can be easily performed by a two-dimensional convolu-tion. The maximum of the convolution in (17) represents the

Fig. 9. Mesh plot of fovea filter for CIF resolution viewed at a distance of 6 H.

above-mentioned condition that the viewer found the position hehas to focus on, in order to perceive the maximum degradation

(17)

J. Temporal Summation

The temporal summation is performed with a simple aver-aging over the total number of unique images in the distortedsequences, denoted as . This results in the final spatial distor-tion indicator as shown in (18). In parallel to the PSNR algo-rithm which was used as the spatial analysis part of TetraVQM,the logarithm of the resulting average is taken

(18)

K. Modeling Frame Rate, Pauses, and Skips

This algorithm uses the duration of each distorted frameand the temporal correspondence between the reference

and the distorted frames . This correspondence wasestimated in the Temporal and Color Alignment. The numberof repeated frames and skipped frames arecalculated for each distorted frame. Two indicators and

stem from the analysis of a reduced frame rate and anirregularity in the temporal flow, respectively.

The distinction between a continuous repetition and skippingdue to a frame rate reduction and an anomalous one-time event isperformed using a histogram approach. In (19), the calculationfor the histogram matrix is shown. Because the viewer ina subjective test is distracted at the start and the end of a videosequence by the change to a gray background, he will not beable do notice any pausing. So the first and the last occurrenceare skipped

(19)


with

if

else

In the next step, the portion of the histogram that representsthe reduction of the frame rate is selected. It is assumed that afrequently occurring event of pausing and skipping indicates areduced frame rate. A stable configuration which was empiri-cally found is to require that a frequently occurring event hasto appear in at least 15% of the sequence and at least threetimes. The frame rate histogram contains only those con-ditions from that meet these requirements. It is generatedas follows:

ifelse

(20)

if

else(21)

For the annoyance of the frame rate reduction, a function hasbeen manually fitted on the results of the subjective experimentpresented in [3]. The degradation from an optimal MOS scorewas estimated to yield a difference mean opinion score (DMOS)of 0.5 for 12.5 fps, 1 for 8 fps, 1.5 for 5 fps, and 3 for 2.5 fps. Itis assumed that the degradation is largest at 2.5 fps because thesequence is most jerky. A further reduction leads to a slide-showeffect which does not look like a video sequence but gets highersubjective ratings. This should be further analyzed in a separatesubjective experiment. A fitting has been performed which usesthe logarithm of the frame rate as in [37]. However, in our fittingwe use two sigmoid functions. The sigmoidal slope guaranteesan upper and lower bound when the frame rate approaches zeroor infinity. The resulting function is given in (22) and it is alsodepicted in Fig. 10

(22)

with

Fig. 10. DMOS degradation for different frame rates.

The reduction of the frame rate is less severe in still or lowmotion video sequences or sections of the sequence. In order totake this into account as well, an average motion vector length

in pixels for each distorted frame is calculated as

(23)

Combining the above parameters, a single value for the framerate degradation is determined. A lower frame rate in a smallpart of the sequence will have a high influence on the total score.Thus, a fourth order Lebesque norm is preferred to simple av-eraging. Because the reference frame rate may also be re-duced, the perceived difference to the reference is used. The re-sulting calculation for the frame rate degradation is shownin (24) at the bottom of the page.

So far, in the frame rate histogram only the frequentlyoccurring events of are assessed. The remaining events offrame repetitions and frame skips are considered to be anoma-lous temporal artifacts. Thus, those parts of the histogramare used to model the influence of pauses and skips. The pausingand skipping algorithm is also part of the Perceptual Evaluationof Video Quality (PEVQ) model and it is described in Annex Bof ITU-T J.247 [14] in the section titled “Analysis of Frame Re-peats and Frame Skips.” The indicator is termed “FrameRe-peatIndicator” in the PEVQ algorithm.

L. Combination of Indicators

The TetraVQM algorithm generates three different indica-tors: The result of the spatial processing , the degradation dueto a frame rate reduction , and the influence of pauses andskips in the video sequence . For simplicity, these indica-tors are stored in a result vector with ,and . In order to obtain the final output value ofTetraVQM, which is denoted as OMOS, a sigmoid function is

(24)

withifelse


TABLE ICOEFFICIENTS FOR SIGMOIDAL COMBINATION OF INDICATORS

applied to each indicator, followed by a linear combination ofall three indicators as shown in (25). The coefficients are givenin Table I. The three indicators contribute differently to the finalvalue. When no corresponding degradation is present, the con-tribution is zero. The maximum value which can be reachedby each individual indicator is approximately 2.4, 2.2, and 0.6DMOS, respectively

(25)

with:

(26)

The values for the sigmoidal mapping were fitted by mini-mizing the correlation and the root mean squared error (RMSE)value as suggested by [5]. The training database consisted ofone subjective test in QCIF format with 310 sequences, twosubjective tests in CIF format with 240 and 350 sequences, andone subjective test in VGA resolution with 310 sequences. Allsubjective experiments were generated according to ITU Rec-ommendations with 30 viewers. A broad range of content andcodec settings was used. Conditions with transmission distor-tions were included.

V. RESULTS

The performance of the algorithm was tested on the hugedatabase of subjective experiments generated by VQEG in Mul-timedia Phase I [4] according to the multimedia testplan [5].This database is completely distinct from the training database;thus, a fair performance evaluation is possible.

The new algorithm is compared to PSNR. Applying thePSNR algorithm to multimedia sequences without any tem-poral alignment does not lead to reasonable results. Therefore,two PSNR measures are presented.

The first one, named “PSNR A,” uses the same temporalalignment that was described in the alignment estimation andcorrection step. Because a sigmoidal fitting has been applied ona training data set for the TetraVQM algorithm, the same fittingon the same training data set has been applied to this PSNRmeasure as well.

The second PSNR algorithm was calculated during theVQEG analysis by the National Telecommunications andInformation Administration (NTIA) and is called “PSNR B.”It is the PSNR metric used by VQEG for comparison in thefinal report [4]. It does not align the images frame by frame butuses only a global offset, thus temporally and spatially shiftingthe complete sequence by one offset in each dimension. Alltemporal offsets and all spatial positions that are allowed inthe testplan are tested in an exhaustive search and the highestPSNR value is reported.

Fig. 11. Pearson correlation results of TetraVQM compared to various PSNRversions for VQEG Multimedia databases.

Each of these algorithms was applied on all tests that wereperformed by VQEG in Multimedia Phase I. There were 14 sub-jective tests in QCIF resolution, another 14 subjective experi-ments in CIF resolution, and 13 subjective experiments in VGAresolution. Each of these 41 experiments consisted of 166 videosequences which were assessed by 24 viewers. In total, 5320 se-quences were evaluated. These sequences were generated from346 reference video sequences. Further details can be found in[4]. Since a full reference algorithm can only predict degrada-tions, each MOS score for each PVS obtained from the subjec-tive Absolute Category Rating with Hidden Reference Removal(ACR-HRR) [1] experiment is normalized to the correspondingMOS score obtained for its SRC. This results in a DMOS value.

The linear correlation coefficient [38], which is also referredto as Pearson Correlation, is a measure for the linear relation-ship between the model output values OMOS and the DMOSvalues. It was calculated after a third order monotonic fit [39]as specified in the VQEG Multimedia Testplan [5]. An optimalmodel would reach a Pearson Correlation with a value of oneafter the third order fitting while zero would be the worst valuein this case. The results are depicted in Fig. 11 for the three res-olutions and all 41 experiments.

First, the gain of the new framework shall be analyzed withoutthe frame rate and pausing and skipping indicators. The results


TABLE IIPEARSON CORRELATION RESULTS

of the “TetraVQM spatial” indicator show that the TetraVQMalgorithm clearly outperforms both PSNR algorithms in mostcases. Please note, that this difference is purely due to the ad-vanced framework which takes the temporal trajectory into ac-count. No additional information is used.

The difference between TetraVQM’s spatial indicator andVQEG’s reference PSNR “PSNR B” is statistically significanton a 95% confidence level in seven cases for QCIF and in fivecases for CIF and VGA resolution. In some cases, “PSNR B”performs better. The reason is that TetraVQM’s spatial indicatordoes not take the frame rate reduction into account. It penalizesthat the spatial distortions are visible for a longer period of timebut the algorithm is unable to estimate the annoyance for theviewer that results from the jerkiness of the reduced frame rate.In the experiments q03 and v03 the frame rate for some PVSsdrops to about 3 fps.

The additional two indicators eliminate this issue. The“TetraVQM” curve shows the performance gain which isachieved. This algorithm performs better than the “PSNR B”algorithm in a statistically significant way in 11 cases for QCIF,four four cases for CIF, and nine cases for VGA.

So far, the PSNR measure was used. However, the frameworkcan be used to enhance other measures as well. For example,the PSNR algorithm can by substituted by the SSIM algorithm[40]. A MATLAB software for the algorithm is available andwas used in our implementation. The results of the TetraVQMframework when the SSIM is used to generate the spatial dis-tortion map is shown in the figures as “TetraVQM/SSIM.”So far, this algorithm performs best.

In Table II, the average correlation results are summarized.The VQEG Multimedia Phase I validation tests comparedfour FR models. Compared to the best performing modelthat was selected for each of the 41 experiments separately,“TetraVQM/SSIM” performs statistically equivalent in sixcases for QCIF, five cases for CIF, and nine cases for VGA.Thus, the overall performance of “TetraVQM/SSIM” is com-parable to some models validated within VQEG’s MultimediaPhase I. It should be noted that the TetraVQM algorithm wasneither trained on that validation data nor was the algorithmsignificantly changed after VQEG’s results became available.

VI. CONCLUSION

In this paper, a framework was proposed that can be uni-versally applied to many video quality measures. It models thehuman assessor participating in a subjective experiment. The in-stantaneous distortion in the video sequence seen by the viewerand the time that the distortions remain visible are the key as-pects of the framework. Based on this information, many newaspects and temporal effects of the HVS can be implemented intoday’s video quality measures.

An implementation of the framework, the TetraVQM algo-rithm, was explained in detail. Because the spatial processingpart was reduced to PSNR, a comparison with this simple mea-sure could be presented in order to show the performance gainof the new algorithm. It can be stated that the newly proposedframework leads to a significantly higher performance in mostexperiments. It should be noted that the training of the algorithmwas done on a distinct database and that the same training wasperformed for the PSNR metric. It was also shown that substi-tuting the spatial processing part with another algorithm is ad-vantageous by using the SSIM algorithm.

Further advances are expected from improving the spatialpart of the TetraVQM algorithm with a more sophisticated pro-cessing that also takes the Human Visual System into account.Additionally, several of the parameters in the algorithm werechosen manually or by visual inspection of a few sequences.Further psychophysical experiments are planned to calibrate theprocessing steps in many parts of the algorithm.

ACKNOWLEDGMENT

The authors would like to thank the Video Quality ExpertsGroup for the huge effort spent on the preparation and imple-mentation of the subjective tests. This publication is partly basedon the subjective scores collected by VQEG. The performanceresults presented in this paper are not to be compared to the re-sults presented in the VQEG Final Report of Multimedia PhaseI [4] because the models in the report were validated using thisdata. Thus, the data was not available to the models that weresubmitted to the VQEG evaluation.

REFERENCES

[1] ITU-T P.910 Subjective Video Quality Assessment Methods for Multi-media Applications, , 1997, ITU-T Study Group 12, ITU-T P.910.

[2] ITU-R BT.500-10 Methodology for the Subjective Assessment ofthe Quality of Television Pictures, Question ITU-R 211/11, ITU-RBT.500-10, 1974.

[3] M. Barkowsky, J. Bialkowski, and A. Kaup, “Subjective Video QualityAssessment for Low Bitrate Multimedia Applications (in German),” inITG Fachbericht 188: Elektronische Medien 2005. Berlin, Germany:VDE-Verlag, 2005, pp. 169–175.

[4] Final Report of VQEG’s Multimedia Phase I Validation Test, TD 923,2008, ITU Study Group 9.

[5] Multimedia Group Test Plan Draft Version 1.19, D. Hands and K.Brunnstrom, Eds. Boulder, CO: Video Quality Experts Group(VQEG), 2007.

[6] A3: Objective Video Quality Measurement Using a Peak-Signal-to-Noise-Ratio (PSNR) Full Reference Technique, NTIA/ITS,T1.TR.PP.74-2001, 2001.

[7] J. Lubin and D. Fibush, “Sarnoff JND vision model (ANSI submis-sion),” IEEE G-2.1.6 Compression and Processing Subcommittee,1997.

[8] A. B. Watson and J. Ahumada, “A standard model for foveal detectionof spatial contrast,” J. Vision, vol. 5, pp. 717–740, 2006.

[9] S. Winkler, Digital Video Quality: Vision Models and Metrics. NewYork: Wiley, 2005.

[10] A. Hekstra, J. Beerends, D. Ledermann, F. de Caluwe, S. Kohler, R.Koenen, S. Rihs, M. Ehrsam, and D. Schlauss, “PVQM—A perceptualvideo quality measure,” Elsevier, Signal Process.: Image Commun.,vol. 17, pp. 781–798, 2002.

[11] Final Report From the Video Quality Experts Group on the Validationof Objective Models of Video Quality Assessment, Contribution 80, ITUStudy Group 9, 2000.


[12] Final Report From the Video Quality Experts Group on the Validationof Objective Models of Video Quality Assessment, Phase II (FR-TV2),Contribution 60, ITU Study Group 9, 2003.

[13] ITU-T J.144 Objective Perceptual Video Quality Measurement Tech-niques for Digital Cable Television in the Presence of a Full Reference,ITU-T J.144, ITU-T Study Group 9, 2004.

[14] ITU-T J.247: Objective Perceptual Multimedia Video Quality Measure-ment in the Presence of a Full Reference, Rec. ITU-T J.247, ITU StudyGroup 9, 2008.

[15] S. Wolf and M. Pinson, Video Quality Measurement Techniques.Boulder, CO: Inst. for Telecomm. Sci., 2002.

[16] E. P. Ong, X. Yang, W. Lin, Z. Lu, S. Yao, X. Lin, S. Rahardja, andB. C. Seng, “Perceptual quality and objective quality measurements ofcompressed videos,” J. Vis. Commun. Image Representation, vol. 17,pp. 717–737, 2006.

[17] A. B. Watson, J. Hu, and J. F. McGowan III, “DVQ: A digital videoquality metric based on human vision,” J. Electron. Imag., vol. 10, no.1, pp. 20–29, 2001.

[18] S. J. P. Westen, R. L. Lagendijk, and J. Biemond, “Spatio–temporalmodel of human vision for digital video compression,” Proc. SPIEHuman Vis. Electron. Imag. II, vol. 3016, pp. 260–268, 1997.

[19] P. Lindh and C. van den Branden Lambrecht, “Efficient spatio–tem-poral decomposition for perceptual processing of video sequences,” inProc. Int. Conf. Image Process. ICIP, 1996, vol. 3, pp. 331–334, ser.Proc. IEEE.

[20] M. A. Masry and S. S. Hemami, “A metric for continuous quality eval-uation of compressed video with severe distortions,” Signal Process.:Image Commun., vol. 19, pp. 133–146, 2004.

[21] C. van den Branden Lambrecht, D. Costantini, G. Sicuranza, and M.Kunt, “Quality assessment of motion rendition in video coding,” IEEETrans. Circuits Syst. Video Technol., vol. 9, no. 5, pp. 766–782, Aug.1999.

[22] Z. Wang and Q. Li, “Video quality assessment using a statistical modelof human visual speed perception,” J. Opt. Soc. Amer. A, vol. 24, pp.B61–B69, 2007.

[23] K. Seshadrinathan and A. C. Bovik, “A structural similarity metric forvideo based on motion models,” in Proc. IEEE Int. Conf. Acoust. ,Speech, Signal Process., 2007, vol. 1, pp. 869–872.

[24] J. M. Wolfe, A. Oliva, S. J. Butcher, and H. C. Arsenio, “An unbindingproblem?: The disintegration of visible, previously attended objectsdoes not attract attention,” J. Vision, vol. 2, pp. 256–271, 2002.

[25] R. R. Pastrana-Vidal, J. C. Gicquel, C. Colomes, and C. Hocine, “Tem-poral masking effect on dropped frames at video scene cuts,” in Proc.SPIE: Human Vis. Electron. Imag. IX, 2004, vol. 5292, pp. 192–201.

[26] M. Barkowsky, R. Bitto, J. Bialkowski, and A. Kaup, “Comparison ofmatching strategies for temporal frame registration in the perceptualevaluation of video quality,” in Proc. 2nd Int. Workshop Video Process.Quality Metrics for Consumer Electron., 2006.

[27] M. Barkowsky, J. Bialkowski, R. Bitto, and A. Kaup, “Temporal regis-tration using 3D phase correlation and a maximum likelihood approachin the perceptual evaluation of video quality,” in Proc. IEEE Int. Work-shop Multimedia Signal Process., 2007, pp. 195–198.

[28] J. R. Jain and A. K. Jain, “Displacement measurement and its applica-tion in interframe image coding,” IEEE Trans. Communications, vol.COM-29, no. 12, pp. 1799–1808, Dec. 1981.

[29] K. Andersson, “Motion estimation for perceptual image sequencecoding,” Ph.D. dissertation, Linköping Univ., Linköping, Sweden,2003.

[30] V. Mezaris, I. Kompatsiaris, and M. G. Strintzis, “Video object seg-mentation using Bayes-based temporal tracking and trajectory-basedregion merging,” IEEE Trans. Circuits Syst. Video Technol., vol. 14,no. 6, pp. 782–795, Jun. 2004.

[31] Z. Chen, P. Zhou, and Y. He, “Fast integer PEL and fractional PELmotion estimation for JVT,” JVT-F017, 2002.

[32] W. Li and E. Salari, “Successive elimination algorithm for motion es-timation,” IEEE Trans. Image Process., vol. 4, no. 1, pp. 105–107, Jan.1995.

[33] M. Brünig and W. Niehsen, “Fast full-search block matching,” IEEETrans. Circuits Syst. Video Technol., vol. 11, no. 5, pp. 241–247, May2001.

[34] J. Lu and Z. Lin, “Deliberation with fast full-search block matching,”IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 1, pp. 97–99,Jan. 2003.

[35] M. Barkowsky, B. Eskofier, J. Bialkowski, and A. Kaup, “Influence ofthe presentation time on subjective votings of coded still images,” inProc. Int. Conf. Image Process., 2006, pp. 429–432.

[36] M. Barkowsky, B. Eskofier, R. Bitto, J. Bialkowski, and A. Kaup,“Perceptually motivated spatial and temporal integration of pixel basedvideo quality measures,” in Proc. Mobile Content Quality Exper., 4thInt. Conf. Heterog. Netw. Qual., Reliab., Sec., Robust., 2007.

[37] Q. Huynh-Thu and M. Ghanbari, “Impact of jitter and jerkiness on per-ceived video quality,” in Proc. Workshop Video Process. Quality Met-rics, 2006.

[38] M. R. Spiegel and L. J. Stephens, Schaum’s Outline of Theory andProblems of Statistics. New York: McGraw-Hill, 1998.

[39] C. Fenimore, J. Libert, and M. H. Brill, “Algebraic constraints implyingmonotonicity for cubics; Monotonic cubic regression using standardsoftware for constrained optimization,” NIST, Gaithersburg, MD, 1999.

[40] Z. Wang and A. C. Bovik, “A universal image quality index,” IEEESignal Process. Lett., vol. 9, no. 3, pp. 81–84, Mar. 2002.

Marcus Barkowsky received the Dipl.-Ing. degreein electrical engineering from the University ofErlangen-Nuremberg, Erlangen, Germany, in 1999.He is currently pursuing the Dr.-Ing. degree at theUniversity of Erlangen-Nuremberg. Starting froma deep knowledge of video coding algorithms, hisPh.D. thesis focuses on a reliable video quality mea-sure for low bitrate scenarios with special emphasison mobile transmission.

From 2000 to 2002, he was with the FraunhoferInstitute for Integrated Circuits (IIS-A), Erlangen, as

a Research Scientist. In 2001, he gave a lecture at the Technical University ofIlmenau, Germany. In 2002 he joined the Chair of Multimedia Communicationsand Signal Processing at the University of Erlangen-Nuremberg. His researchwas performed in an industry cooperation with OPTICOM GmbH, Erlangen.Since November 2008, he has been researching the influence of 3-D televisionon the Human Visual System at the University of Nantes, Nantes, France.

Jens Bialkowski received the Dipl.-Ing. degreein electrical engineering from the University ofUlm, Ulm, Germany, in 2001. He is currentlypursuing the Dr.-Ing. degree at the University ofErlangen-Nuremberg, Erlangen, Germany.

In January 2002, he joined the Chair of MultimediaCommunications and Signal Processing, Universityof Erlangen-Nuremberg, Erlangen, Germany. There,he is currently working as Research Assistant in co-operation with Siemens AG (Corporate Technology),Munich, Germany, while pursuing the Dr.-Ing. de-

gree. His research is focused on digital video processing, where he is concen-trating on investigation of video transcoding.

Björn Eskofier received the Dipl.-Ing. degreein electrical engineering from the University ofErlangen-Nuremberg, Erlangen, Germany, in 2006.He is currently pursuing the Dr.-Ing. degree at theUniversity of Erlangen-Nuremberg.

After graduating, he joined the Chair for PatternRecognition, University of Erlangen-Nuremberg. Heis currently working as Assistant Researcher in co-operation with adidas AG (adidas innovation team),Portland, OR, while pursuing the Dr.-Ing. degree. Hisresearch focuses on the application of pattern recog-

nition algorithms in digital sports, his main interests lying in signal processingand sensor data fusion.


Roland Bitto received the Dipl.-Ing. degree in elec-trical engineering from the University of Erlangen-Nuremberg, Erlangen, Germany, in 1992.

After three years in the industry, he was with theFraunhofer Institute for Integrated Circuits from1995 to 2000 as a Research Scientist. His workwas dedicated to the research of psychoacousticsand the development of perceptual measurementand audio codecs. In 2000, he joined OPTICOMGmbH, Erlangen, where he is responsible for thedevelopment of voice-, audio-, and video-quality

measurement algorithms. His current work focuses on the development of thePerceptual Evaluation of Video Quality which is discussed for standardizationby the ITU.

Mr. Bitto received the Publications Award for the best paper of the years2000/2001 from the AES. He was one of the main contributors to the Recom-mendations ITU-R BS.1387 PEAQ and ITU-T P.563/3SQM.

André Kaup (M’96–SM’99) received the Dipl.-Ing.and Dr.-Ing. degrees in electrical engineering fromRWTH Aachen University, Aachen, Germany, in1989 and 1995, respectively.

From 1989 to 1995, he was with the Institutefor Communication Engineering, Aachen Univer-sity of Technology, where he was responsible forindustrial as well as academic research projectsin the area of high-resolution printed image com-pression, object-based image analysis and coding,and models for human perception. In 1995, he

joined the Networks and Multimedia Communications Department at SiemensCorporate Technology, Munich, where he chaired work packages in severalEuropean research projects in the area of very low bitrate video coding,image quality enhancement, and mobile multimedia communications. In1999, he was appointed head of the mobile applications and services groupin the same department, with research focusing on multimedia adaptationfor heterogeneous communication networks. Since 2001, he has been a FullProfessor and Head of the Chair of Multimedia Communications and SignalProcessing at Friedrich-Alexander-University of Erlangen-Nuremberg. From1998 to 2001, he served as an Adjunct Professor at the Technical University ofMunich and the University of Erlangen-Nuremberg, teaching courses on imageand video communication. From 2005 to 2007, he was vice-speaker of theDFG Collaborative Research Center 603 “Modeling and Analysis of ComplexScenes and Sensor Data.”

Prof. Kaup is member of the German ITG. He was elected Siemens inventorof the year 1998 and is recipient of the 1999 ITG Award.

temporal trajectory aware video quality measure

Documents