universidad polit´ecnica de madrid - core · 2018. 2. 11. · mi sincero agradecimiento a todas...
TRANSCRIPT
Universidad Politecnica de Madrid
Escuela Tecnica Superior de Ingenieros de
Telecomunicacion
ANALYSIS, MONITORING, AND MANAGEMENT OF
QUALITY OF EXPERIENCE IN VIDEO DELIVERY
SERVICES OVER IP
Tesis Doctoral
Pablo Perez Garcıa
Ingeniero de Telecomunicacion
2013
Universidad Politecnica de Madrid
Departamento de Senales, Sistemas y
Radiocomunicaciones
Escuela Tecnica Superior de Ingenieros de
Telecomunicacion
Tesis Doctoral
ANALYSIS, MONITORING, ANDMANAGEMENT OF QUALITY OF
EXPERIENCE IN VIDEO DELIVERYSERVICES OVER IP
Autor:
Pablo Perez Garcıa
Ingeniero de Telecomunicacion
Director:
Narciso Garcıa Santos
Doctor Ingeniero de Telecomunicacion
2013
Tesis Doctoral
ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF
EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP
Autor: Pablo Perez Garcıa
Director: Narciso Garcıa Santos
Tribunal nombrado por el Mfgco. y Excmo. Sr. Rector de la Universidad Politecnica
de Madrid, el dıa . . . . . . de . . . . . . . . . . . . . . . . . . . . . . . . de 2013.
Presidente: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vocal: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vocal: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vocal: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Secretario: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Realizado el acto de defensa y lectura de la Tesis el dıa . . . . . . de . . . . . . . . . . . . . . . . . . . . . de
2013 en . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Calificacion: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EL PRESIDENTE LOS VOCALES
EL SECRETARIO
“If you make listening and observation your occupation you will gain much more than
you can by talk.”
Robert Baden-Powell
UNIVERSIDAD POLITECNICA DE MADRID
Abstract
TESIS DOCTORAL
ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF
EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP
by Pablo Perez Garcıa
This thesis proposes a comprehensive approach to the monitoring and management of
Quality of Experience (QoE) in multimedia delivery services over IP. It addresses the
problem of preventing, detecting, measuring, and reacting to QoE degradations, under
the constraints of a service provider: the solution must scale for a wide IP network
delivering individual media streams to thousands of users.
The solution proposed for the monitoring is called QuEM (Qualitative Experience Mon-
itoring). It is based on the detection of degradations in the network Quality of Service
(packet losses, bandwidth drops. . . ) and the mapping of each degradation event to a
qualitative description of its effect in the perceived Quality of Experience (audio mutes,
video artifacts. . . ). This mapping is based on the analysis of the transport and Network
Abstraction Layer information of the coded stream, and allows a good characterization
of the most relevant defects that exist in this kind of services: screen freezing, mac-
roblocking, audio mutes, video quality drops, delay issues, and service outages. The
results have been validated by subjective quality assessment tests. The methodology
used for those test has also been designed to mimic as much as possible the conditions
of a real user of those services: the impairments to evaluate are introduced randomly in
the middle of a continuous video stream.
Based on the monitoring solution, several applications have been proposed as well: an
unequal error protection system which provides higher protection to the parts of the
stream which are more critical for the QoE, a solution which applies the same principles
to minimize the impact of incomplete segment downloads in HTTP Adaptive Streaming,
and a selective scrambling algorithm which ciphers only the most sensitive parts of the
media stream. A fast channel change application is also presented, as well as a discussion
about how to apply the previous results and concepts in a 3D video scenario.
UNIVERSIDAD POLITECNICA DE MADRID
Resumen
TESIS DOCTORAL
ANALYSIS, MONITORING, AND MANAGEMENT OF QUALITY OF
EXPERIENCE IN VIDEO DELIVERY SERVICES OVER IP
por Pablo Perez Garcıa
Esta tesis estudia la monitorizacion y gestion de la Calidad de Experiencia (QoE) en
los servicios de distribucion de vıdeo sobre IP. Aborda el problema de como prevenir,
detectar, medir y reaccionar a las degradaciones de la QoE desde la perspectiva de
un proveedor de servicios: la solucion debe ser escalable para una red IP extensa que
entregue flujos individuales a miles de usuarios simultaneamente.
La solucion de monitorizacion propuesta se ha denominado QuEM (Qualitative Experien-
ce Monitoring, o Monitorizacion Cualitativa de la Experiencia). Se basa en la deteccion
de las degradaciones de la calidad de servicio de red (perdidas de paquetes, disminucio-
nes abruptas del ancho de banda. . . ) e inferir de cada una una descripcion cualitativa
de su efecto en la Calidad de Experiencia percibida (silencios, defectos en el vıdeo. . . ).
Este analisis se apoya en la informacion de transporte y de la capa de abstraccion de
red de los flujos codificados, y permite caracterizar los defectos mas relevantes que se
observan en este tipo de servicios: congelaciones, efecto de “cuadros”, silencios, perdida
de calidad del vıdeo, retardos e interrupciones en el servicio. Los resultados se han va-
lidado mediante pruebas de calidad subjetiva. La metodologıa usada en esas pruebas se
ha desarrollado a su vez para imitar lo mas posible las condiciones de visualizacion de
un usuario de este tipo de servicios: los defectos que se evaluan se introducen de forma
aleatoria en medio de una secuencia de vıdeo continua.
Se han propuesto tambien algunas aplicaciones basadas en la solucion de monitoriza-
cion: un sistema de proteccion desigual frente a errores que ofrece mas proteccion a las
partes del vıdeo mas sensibles a perdidas, una solucion para minimizar el impacto de la
interrupcion de la descarga de segmentos de Streaming Adaptativo sobre HTTP, y un
sistema de cifrado selectivo que encripta unicamente las partes del vıdeo mas sensibles.
Tambien se ha presentado una solucion de cambio rapido de canal, ası como el analisis
de la aplicabilidad de los resultados anteriores a un escenario de vıdeo en 3D.
Acknowledgements
This thesis would not have been possible without the help of all the people with whom I
have been so lucky to share my way in these more than eight years. Let me express my
gratitude to all of them in my mother tongue.
La vida es un conjunto de relaciones; y enumerar todas las que se pueden forjar en los
ocho anos que ha durado este trabajo ocuparıa mas espacio del que, probablemente,
sea razonable dedicar en una tesis doctoral. De modo que es probable que este siendo
injusto con algunas personas que, por descuido, olvido, o falta de espacio, no apareceran
aquı citadas. Vaya de antemano mi disculpa (y agradecimiento) tambien para ellas.
Gracias ante todo a Narciso Garcıa, que sigue logrando sacar huecos en su cada vez mas
complicada agenda para acompanarme en esta aventura. Es un privilegio contar con el
como director de tesis.
Gracias tambien, muy especialmente, a Jaime Ruiz, que ha sido mucho mas que un
manager en estos ocho anos. No exagero si digo que, si no fuera por el, difıcilmente
podrıa yo haber terminado este trabajo.
Gracias al excepcional equipo humano y profesional con el que he tenido la suerte de
trabajar a lo largo de estos anos en Telefonica I+D y Alcatel-Lucent. A Jesus Macıas,
que me enseno a mirar el vıdeo de otra manera. A Alvaro Villegas, en cuyo trabajo se
apoya buena parte del mıo. A Silvia Varela, por ayudarme a encontrar el enfoque de
este espinoso asunto de la calidad. A Enrique Estalayo y Jose M. Cubero, con los que he
compartido tanto en tantos proyectos. A Ernesto Puerta, por las conversaciones sobre
cuantificacion y otros asuntos arcanos. A Javier Lopez Poncela, por guiarme por los
entresijos de los descodificadores.
Gracias tambien a la gente del Grupo de Tratamiento de Imagenes, que me ha seguido
acogiendo como en casa durante todos estos anos. Muy en particular a Jesus Gutierrez,
por todo el trabajo de las pruebas de calidad subjetiva: sin el, acabar esta tesis habrıa
resultado mucho mas difıcil. Gracias tambien a Julian Cabrera y Fernando Jaureguizar,
siempre dispuestos a echar una mano en lo que hiciera falta.
Mi sincero agradecimiento a todas aquellas personas que, a lo largo de estos anos, han
puesto tambien su granito de arena en esta tesis. A Juan Casal, por compartir su ex-
periencia sobre codificacion de vıdeo. A Rocıo Bravo, por la ayuda con las audiencias
de television. A todos los socios del CENIT VISION, donde se gesto buena parte de la
investigacion que ahora presento.
xiii
Finalmente, muchas gracias a mi familia y amigos. A mis hermanos Lucas y David, que
marcaron el camino a seguir. A mi hermano Jesus, de quien he aprendido lo poco que
se de audio digital (y algun que otro truco de television). A mi madre Teresa, que tanto
ha puesto de su parte para empujarme a terminar la tesis. A mi padre Juan, a quien
seguro que le habrıa gustado verla acabada, y con quien tambien he discutido alguna de
las ecuaciones que en ella aparecen. Y a Graciela, por todo lo que hemos compartido, y
lo que queda por venir; tanto, que no se puede resumir en una frase.
Gracias, en definitiva, a todos los que han hecho posible que esta tesis se haya escrito.
Aun de aquellos que, por la falta de espacio, no he tenido ocasion de mencionar en estas
lıneas, guardo un buen recuerdo en el corazon. Gracias a ti, que te estas tomando el
trabajo de leer estas paginas. Y gracias a Dios por habernos puesto en contacto.
Contents
Abstract ix
Resumen xi
Acknowledgements xiii
List of Figures xix
List of Tables xxi
Abbreviations xxiii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Understanding Quality of Experience 7
2.1 Quality of Experience and its relatives . . . . . . . . . . . . . . . . . . . . 7
2.2 A word about multimedia services . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Players . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Coding standards and transport protocols . . . . . . . . . . . . . . 11
2.2.3 Artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Who is who in the QoE metrics . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Subjective quality assessment . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Full-Reference quality metrics . . . . . . . . . . . . . . . . . . . . . 20
2.3.3 Reduced-Reference quality metrics . . . . . . . . . . . . . . . . . . 22
2.3.4 No-Reference quality metrics . . . . . . . . . . . . . . . . . . . . . 23
2.4 Other topics related to QoE in IPTV services . . . . . . . . . . . . . . . . 26
2.4.1 Media formats in IPTV deployments . . . . . . . . . . . . . . . . . 29
2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 Designing QoE-Aware Multimedia Delivery Services 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Delivering multimedia over IP . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Architecture of a multimedia service delivery platform . . . . . . . 36
3.2.2 Impairing the Quality of Experience . . . . . . . . . . . . . . . . . 41
xv
xvi CONTENTS
3.3 QuEM: a qualitative approach to QoE monitoring . . . . . . . . . . . . . 44
3.3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.2 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 Qualitative Impairment Detectors . . . . . . . . . . . . . . . . . . 47
3.3.4 Severity Transfer Function . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 A Subjective Assessment methodology to calibrate Quality ImpairmentDetectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4.1 Design principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Test methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.3 Selection of impairments . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 QoE enablers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5.1 Headend metadata architecture . . . . . . . . . . . . . . . . . . . . 53
3.5.2 Intelligent Packet Rewrapper . . . . . . . . . . . . . . . . . . . . . 55
3.5.3 Edge Servers for IPTV and OTT . . . . . . . . . . . . . . . . . . . 57
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4 Quality Impairment Detectors 59
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2 Video Packet Loss Effect Prediction (PLEP) model . . . . . . . . . . . . . 60
4.2.1 Description of the model . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2.3 Subjective analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Audio packet loss effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.1 Objective analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Subjective analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4 Coding quality and rate forced drops . . . . . . . . . . . . . . . . . . . . . 79
4.4.1 Analysis of feature-based RR/NR metrics as estimators of videocoding quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.4.2 Managing coding quality drops . . . . . . . . . . . . . . . . . . . . 84
4.5 Outages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.1 Detection of outages . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.5.2 Subjective impact of outages . . . . . . . . . . . . . . . . . . . . . 88
4.6 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6.1 Lag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6.2 Channel Change time . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6.3 Latency trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.7 Mapping to Severity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5 Applications 99
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Unequal Error Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2.1 Priority Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2.2 Experimentation and results . . . . . . . . . . . . . . . . . . . . . 105
5.2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3 Fine-grain segmenting for HTTP adaptive streaming . . . . . . . . . . . . 112
5.3.1 Description of the solution . . . . . . . . . . . . . . . . . . . . . . . 113
CONTENTS xvii
5.4 Selective Scrambling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4.1 Problem statement and requirements . . . . . . . . . . . . . . . . . 117
5.4.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5 Fast Channel Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.6 Application to 3D Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6 Conclusions 123
A Experimental setup 127
A.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
A.2 Subjective Assessment based on QuEM approach . . . . . . . . . . . . . . 127
A.2.1 Selection and preparation of content . . . . . . . . . . . . . . . . . 127
A.2.2 Selection of impairments . . . . . . . . . . . . . . . . . . . . . . . . 128
A.2.3 Test sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.3 Subjective quality assessment of H.264 video encoders . . . . . . . . . . . 134
A.4 Test sequences from IPTV deployments . . . . . . . . . . . . . . . . . . . 135
Bibliography 137
List of Figures
2.1 Layer and domain model for multimedia services . . . . . . . . . . . . . . 10
2.2 Protocol stack for multimedia services over IP . . . . . . . . . . . . . . . . 13
2.3 Models for objective quality assessment: FR/RR/NR . . . . . . . . . . . . 17
2.4 Hierarchical GOP structure . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Network architecture for IPTV and OTT services . . . . . . . . . . . . . . 37
3.2 Delivery chain of a multimedia service . . . . . . . . . . . . . . . . . . . . 45
3.3 QuEM architecture design . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Test sequences in ACR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.5 Test sequences in our proposed method . . . . . . . . . . . . . . . . . . . 51
3.6 Questionnaire for subjective assessment tests . . . . . . . . . . . . . . . . 51
3.7 Structure of the content streams in the subjective assessment test session 53
3.8 Schematic representation of a modular headend . . . . . . . . . . . . . . . 54
3.9 RTP header and extension introduced by the rewrapper processing . . . . 56
4.1 Video sequence used for qualitative analysis . . . . . . . . . . . . . . . . . 67
4.2 MSE and PLEP for all sequences under study, varying the loss position . 69
4.3 Detail of MSE and PLEP for all sequences under study . . . . . . . . . . 69
4.4 MSE vs PLEP (log scale) and linear fit . . . . . . . . . . . . . . . . . . . . 70
4.5 % of different macroblocks vs PLEP and linear fit . . . . . . . . . . . . . 70
4.6 % of different macroblocks and PLEP for all sequences under study, vary-ing the loss position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.7 Results of the subjective assessment for Video Loss impairments . . . . . 73
4.8 Detailed results for each of the individual segments for Video Loss . . . . 74
4.9 Waveform of a lossy audio file . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.10 Effect of audio losses: measured vs. expected . . . . . . . . . . . . . . . . 76
4.11 Short-length audio losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.12 Results of the subjective assessment for Audio Loss impairments . . . . . 78
4.13 Detailed results for each of the individual segments for Audio Loss . . . . 79
4.14 Results of TI and Contrast NR metrics . . . . . . . . . . . . . . . . . . . . 83
4.15 Results of the subjective assessment for Rate Drop impairments . . . . . . 86
4.16 Detailed results for each of the individual segments for Rate Drop . . . . 86
4.17 Results of the subjective assessment for Outage impairments . . . . . . . 89
4.18 Detailed results for each of the individual segments for Outage . . . . . . 89
4.19 Simplified transmission chain for real-time video . . . . . . . . . . . . . . 90
4.20 Decoding delay for video and audio components of a MPEG-2 TransportStream . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.21 Results for all the QuIDs mentioned in the chapter . . . . . . . . . . . . . 96
xix
xx LIST OF FIGURES
5.1 Example of the packet priority model . . . . . . . . . . . . . . . . . . . . . 103
5.2 Implementation of the prioritization model . . . . . . . . . . . . . . . . . 104
5.3 Effect of the window size in packet prioritization results . . . . . . . . . . 107
5.4 Values of MSE comparing random vs. priority-based packet loss . . . . . 107
5.5 Effect of varying the loss burst size . . . . . . . . . . . . . . . . . . . . . . 108
5.6 Contribution of each term to the prioritization equation . . . . . . . . . . 109
5.7 Effects of a limited bit budget to encode the priority . . . . . . . . . . . . 110
5.8 Priority-based HTTP Adaptive Streaming segment structure . . . . . . . 115
A.1 Structure of the content streams in the subjective assessment test session 132
A.2 Summary of the subjective quality assessment test results . . . . . . . . . 133
A.3 Subjective MOS for a football sequence . . . . . . . . . . . . . . . . . . . 135
List of Tables
2.1 ACR and DCR evaluation scales . . . . . . . . . . . . . . . . . . . . . . . 19
3.1 Priority values used in the RTP header extension . . . . . . . . . . . . . . 56
4.1 Coefficient of determination (R2) of MSE vs PLEP fit for several videosequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 PLEP impairments analyzed in the subjective assessment tests . . . . . . 72
4.3 Audio losses analyzed in the subjective assessment tests. . . . . . . . . . . 78
4.4 Comparison of NR/RR results with subjective tests . . . . . . . . . . . . 82
4.5 Quality drops analyzed in the subjective assessment tests. . . . . . . . . . 85
4.6 Outage events analyzed in the subjective assessment tests . . . . . . . . . 88
4.7 Example Channel Change time ranges and their mapping to QoE . . . . . 94
5.1 Priority value for each slice type . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 Values of the Aggregated Gain Ratio . . . . . . . . . . . . . . . . . . . . . 106
5.3 Bit budget assignation to encode priority . . . . . . . . . . . . . . . . . . 111
5.4 Minimum scrambling rate required to completely loss the video signal . . 119
A.1 Video test sequences: bitrate and resolution . . . . . . . . . . . . . . . . . 128
A.2 Bitrate drops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.3 Frame rate drops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
A.4 Audio losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.5 Macroblocking errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
A.6 Video freezing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.7 Impairment sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
A.8 Example of a sequence of impairments . . . . . . . . . . . . . . . . . . . . 132
A.9 Test sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
xxi
Abbreviations
3G Third generation of mobile communication technology
ACR Absolute Category Rating
AL-FEC Application Layer Forward Error Correction
ARQ Automatic Repeat reQuest
AVC Advanced Video Coding (also H.264 or MPEG-4 part 10)
CA Conditional Access
CABAC Context-Adaptive Binary Arithmetic Coding
CBR Constant Bit Rate
CDN Content Delivery Network
CoD Content on Demand
DCR Degradation Category Rating
DRM Digital Rights Management
DSL Digital Subscriber Line
DTS DecodingTime Stamp
DVB Digital Video Broadcasting
FCC Fast Channel Change
FEC Forward Error Correction
FR Full Reference
GOP Group Of Pictures
GPON Gigabit-capable Passive Optical Network
HAS HTTP Adaptive Streaming
HDS HTTP Dynamic Streaming
HLS HTTP Live Streaming
HNED Home Network End Device
HTTP Hypertext Transfer Protocol
xxiii
xxiv ABBREVIATIONS
IDR Instantaneous Decoding Refresh
IP Internet Protocol
IPTV Television over Internet Protocol
ITU International Telecommunication Union
LMB Live Media Broadcast
LTE Long Term Evolution
MDI Media Delivery Index
MOS Mean Opinion Score
MPEG Moving Picture Experts Group
MSE Mean Square Error
MVC Multi-view Video Coding
NAL Network Abstraction Layer
NR No Reference
OTT Over The Top multimedia delivery services
PCR Program Clock Reference
PLEP Packet Loss Effect Prediction metric
PLP Packet Loss Pattern
PLR Packet Loss Rate
PSNR Peak Signal to Loss Ratio
PTS Presentation Time Stamp
QoE Quality of Experience
QoS Quality of Service
QuEM Qualitative Experience Monitoring
QuID Quality Impairment Detector
RAP Random Access Point
RET RETranmsission (synonym of ARQ)
RGW Residential Gateway
RR Reduced Reference
RTP Real-Time Transport Protocol
SS Smooth Streaming
STF Severity Transfer Function
TCP Transmission Control Protocol
UDP User Datagram Protocol
ABBREVIATIONS xxv
VBR Variable Bit Rate
VQEG Video Quality Experts Group
To the loving memory of Juan
To Teresa
Chapter 1
Introduction
1.1 Motivation
There is little doubt about the social relevance of the audiovisual delivery services since
the beginning of the first television broadcasts. During the second half of the 20th
century, broadcast television channels controlled the audiovisual market and were the
main communication path for information, culture, and entertainment. But in the last
decades, though the traditional broadcasters are still quite relevant players in the con-
tent marketplace, their offer has been complemented by a plethora of new services: IP
television, video on demand, web video portals, user-generated content. . . The way in
which contents are consumed is rapidly changing, and there are two technological drivers
which have made this possible: digital video and IP networks.
With the standardization of MPEG video in the 1990s, it became possible to consume
video products at home with high quality and at an affordable cost. The popularization
of the internet, at about the same time, brought the possibility to easily interconnect
any two points in the world. The combination of both events allowed that video con-
tents could be managed, stored, and distributed homogeneously with the rest of the
information. Somehow, the distribution of video to the households had just become a
problem of digital data communication and storage. And the main problem to solve was,
consequently, finding enough bandwidth to fit the transmission requirements of video
assets.
The first decade of the 21st century witnessed a quantitative change which resulted in
a qualitative jump: improvements in video codec technologies and in the capacity of
the xDSL access networks allowed to distribute real time video over IP networks with
a quality that could compete with that of television and DVDs. This gave birth to
1
2 Chapter 1. Introduction
the television over IP (IPTV), which introduced real interactivity and personalization
into the audiovisual ecosystem. And in few years time, with subsequent generations
of technological improvements, it has been possible to obtain a competing service of
video distribution even over the standard best-effort internet, in what has been called
over-the-top video delivery (OTT). This has significantly reduced the barriers to entry
the multimedia business. And, as this happens, new services are appearing beyond
the classic television channels, covering from huge video-clubs over the internet to the
distribution of personalized, or even user-generated, video content.
Together with the evolution of the services, it comes the problem of how to provide them
with enough quality for the end users. The transmission of high quality video can be
demanding for the capabilities of IP networks, especially in the access segment. Errors
happen, and service providers struggle to have them under control. The monitoring
of Quality of Service (QoS) parameters, such as bit rate, packet loss rate, or delay, is
not straightforward when the service is distributed over a complex IP network topology.
And even when a suitable QoS monitoring system has been set up in the delivery service
network, it shows insufficient. The interesting concept to monitor is not strictly the
QoS, but the QoE: the Quality of Experience perceived by the final customer.
There has been an important effort in the last decade to characterize the perceived
quality of an audiovisual content, as well as to find algorithms able to model it. A first
method is using subjective quality assessment tests, where a panel of viewers evaluate
the perceived quality of the video clips under study. This can provide quite accurate
information about video quality and user preferences, but at the high cost of having a
group of users involved in the assessment.
The complementary approach is developing objective quality metrics: algorithms which
try to emulate the responses of those viewers by computer analysis of the video sequences.
It has been a very active field of research, especially during the last decade. Dozens of
algorithms have been developed, from simple measures of mean square errors between
images, up to complex metrics which include information about the Human Visual
System (HVS) perception and about the visual structure of the impairments introduced
in the video by the coding and transmission chain.
However, few of those methods have impacted the market relevantly. There are com-
mercially available quality probes which implement this kind of algorithms, but they
are typically used just to measure the quality of the video compression process, and not
always in real time. For the monitoring of the quality in the distribution and access net-
work, only network-based measures are used: packet losses, router failures... Moreover,
in the recent years, the manufacturers of measurement equipment seem to have reduced
the efforts to introduce these complex metrics in their equipments.
Chapter 1. Introduction 3
There are good reasons for that. Video QoE metrics are complex to develop and expen-
sive to deploy in the field. They also cover a very specialized field of interest, frequently
critical in the video headend and video production departments, but much rarer in the
service definition and in the network operation. In many cases the teams operating the
network already have an overwhelming amount of QoS data which is hardly possible
to manage; so that there is little use of increasing the complexity of this information.
Besides, monitoring algorithms need to be implemented in heavily-loaded routers or
low-processing user terminals, thus requiring to be extremely lightweight in processing
power needs, what may disqualify a large number of the metrics available in the litera-
ture. Finally, some metrics are even impossible to apply due to the unavailability of the
information at the monitoring point, as it is the case, for instance, when parts of the
video stream are encrypted by digital rights management (DRM) or conditional access
(CA) systems.
In summary, service providers are still using mainly QoS metrics to monitor their net-
works, but it happens because they are the ones which are applicably under the bud-
getary, computing, and information availability restrictions that they have to cope with.
There is still room for improvement. And this thesis wants to be a step in this direc-
tion, trying to reduce the gap between QoE expertise and multimedia delivery service
providers. The focus of the work is precisely analyzing how to model, monitor and
manage the Quality of Experience under the mentioned restrictions.
The research of the thesis has been carried out along the last 8 years in the framework
of the Grupo de Tratamiento de Imagenes research group at Universidad Politecnica
de Madrid, in parallel to a professional career in the multimedia competence center of
Alcatel-Lucent in Madrid. In this time, services, products, research areas and standard-
ization efforts have evolved significantly. During the first years of the research, the line
that we are proposing in this thesis was almost inexistent in the most relevant journals,
save for a couple of remarkable exceptions. In the recent years, however, there has been
an increasing interest in the research and standardization of monitoring strategies which
are easier to apply in real operation environments.
1.2 Overview
The aim of this thesis is providing architecture, models and results which make it possible
for multimedia service providers to control the Quality of Experience offered by their
service in a way which is relevant for their interests, practical and better than QoS-
only monitoring schemes. It intends to answer the most frequently asked questions that
4 Chapter 1. Introduction
a service provider can raise about the QoE it is offering: which elements determine
the quality of the multimedia stream, which are the most relevant impairments in the
perceived quality, what causes them, and how can they be monitored, prevented, and
minimized. The thesis proposes a comprehensive strategy to address this problem as a
whole, as well as detailed solutions for most of its elements.
Part of the inputs taken to create the approach presented in this thesis have come from
the day-by-day experience of assessing IPTV service providers, designing solutions for
them, and developing products for the content delivery market. All the assumptions
taken in the development of the thesis will be supported either by the work itself or by
previous works published in the scientific literature. However, broader decisions, such
as the relevance of the problem to study or the general approach to it, are influenced
by the experience of listening to the customer, capturing their requirements and un-
derstanding the advantages and disadvantages of different measurement schemes from
a service provider point of view. This fact has no effect on the scientific quality of this
work, but it may help understand its underlying motivation.
As a consequence, the work is probably biased towards this application-oriented ap-
proach in two different ways. On the one hand, there is a stronger focus on the ideas
and concepts, rather than on training of mathematical models or extensive analysis of
experimental results. As it is virtually unaddressable to simulate the conditions of work
of any possible service provider in the world, the research has been aimed at building
models which have as less dependency as possible on the context where they are applied,
or that can be easily adapted to any specific deployment. In a word: clean and generic
models have been preferred to trained and optimized ones. On the other hand, there
has been an explicit effort to be sure that any architecture or algorithm proposed in this
thesis can be directly applied to real multimedia delivery services. And, in fact, some of
them have already been included in products which are currently deployed in the field.
The study starts by analyzing several aspects of the state of the art (Chapter 2). It
defines what a multimedia delivery service is, which technologies it implies, and which
are the most relevant problems to its quality. Although the market applicability of the
multimedia services is quite wide, its underlying technological problem is much more
restricted. The existing techniques to model, analyze, and monitor the multimedia
quality are covered, with special focus on their applicability to content delivery services,
and including the published studies which support or formalize the knowledge obtained
by work experience.
Chapter 3 contains guidelines to design a multimedia delivery service which takes into
account the Quality of Experience. It describes a reference architecture model for the ser-
vice with some QoE-specific elements. It also proposes an specific design for a monitoring
Chapter 1. Introduction 5
system, which explicitly includes the most relevant requirements that any commercially
deployable system should fulfill. The design is complemented with a methodology of
subjective assessment tests that can be used to select, validate and calibrate its quality
monitoring metrics.
Chapter 4 dives into the quality metrics themselves. It presents a novel approach to
predict the effect of packet losses on video quality, as well as some complementary
metrics for audio losses, coding quality drops and outage. The effect of latency in the
quality is analyzed as well. All the metrics also include the results of their respective
subjective assessment tests.
Chapter 5 shows some applications which derive from the previous work and go beyond
the pure monitoring of quality. The knowledge of the effect of packet losses can be used
as input to a packet prioritization model, usable for error protection in IPTV channels or
to improve the error resiliency of HTTP Adaptive Streaming schemes. Other proposed
applications are a method to increase the effect of selective scrambling and a system to
reduce zapping time in IPTV and hybrid environments.
Finally Chapter 6 presents the conclusions of the thesis, also summarizing which parts of
it contain work which has been published in national and international scientific journals
and conferences.
There is also an appendix with some ancillary work: Appendix A, which describes the
detail of some subjective and objective quality assessment tests used for several results
in Chapters 3 and 4.
Chapter 2
Understanding Quality of
Experience
2.1 Quality of Experience and its relatives
Quality of Experience is defined as the overall acceptability of an application or service,
as perceived subjectively by the end-user. It includes the complete end-to-end system
effects (client, terminal, network, services infrastructure, etc.) and may be influenced
by user expectations and context [43].
Some identifiable factors which impact in the QoE are the following [120, 121]:
• Individual interests of the viewer on the content.
• Audiovisual quality of the content.
• Viewing conditions, screen resolution and type. . .
• Interaction with the service or display device (e..g. zap time, remote control,
EPG). . .
• Individual experience and expectations (previous experiences. . . ).
The concept of Quality of Experience is therefore quite wide, including aspects from the
subjective preferences of each user to the objective technical conditions under which the
service was provided. Roughly speaking, there are elements related to the content itself
(the movie, TV show. . . ) and others related to the service (how the content is delivered
and presented to the end user). Most of the analysis of the Quality of Experience are
restricted to the service-related factors, which can be effectively monitored and managed
7
8 Chapter 2. Understanding Quality of Experience
from an engineering point of view: media compression and synchronization, network
transmission performance, channel zapping time. . . [43]
A step down in the abstraction scale, we find the audiovisual quality or multimedia
quality (MMQ), which is the study of the quality of the video and audio signals (either
separately or together). Within the framework of multimedia services, the multimedia
quality is by far the most relevant element of the QoE, up to the point that both terms
are frequently exchanged. Likewise, the analysis of MMQ is typically focused on the
video quality, which is the most critical in most multimedia services.
An additional concept is the multimedia Quality of Service (M-QoS, or just QoS). By
QoS we understand the complete and uninterrupted delivery of the multimedia stream
through the network, from one end of the communication to the other one. It is the
quality offered by the transmission chain (from the output of the multiplexer to the input
of the demultiplexer) [32] without taking into account the contribution of the encoder,
decoder, capture, and display devices into the final quality.
These three quality concepts have a tight relationship. The QoS describes the capabilities
of the communication network (bandwidth, delay. . . ) and their possible degradations
(bit errors, packet losses, jitter. . . ). It therefore limits the level of MMQ that can be
obtained in two senses: on the one hand, limitations in bandwidth result in limitations
in the coding quality of the sequence; on the other, QoS degradations can cause impair-
ments in the transmitted multimedia signal and, hence, in its MMQ. The final QoE will
have to do with the final MMQ, as well as with other factors which are influenced by
the QoS: interactivity, end-to-end latency, zap time. . .
2.2 A word about multimedia services
The concept of multimedia service which will be used in this work is, basically, the pos-
sibility of watching an audiovisual content at home, usually assuming as well that the
content is also delivered to the household at the time when it is going to be viewed.
Multimedia services, thus, have been universally present at homes for the last half a
century, first in the form of television broadcasting and, later, with the possibility to
watch recorded contents in video recording systems. However, in the recent years, this
scenario has been evolving rapidly, with the irruption of at least three significant tech-
nology changes, which have led to the three most relevant families of existing multimedia
delivery services.
The first one was the switch from analog to digital video, which increased the availabil-
ity of different television channels to the households, fostering the growth of channels
Chapter 2. Understanding Quality of Experience 9
for specific target audiences (documentary, sport, children channels. . . ) and impacting
strongly on the business models in the television marketplace. As a side (but relevant)
effect, the experience of watching television changed, with increasing received quality
(including high definition video), the appearance of new video defects, the raise of zap-
ping times, the presence of Electronic Program Guides. . . This technology supports the
existing television broadcast services: terrestrial, cable, and satellite.
As a second step, some of those broadcast television services started their evolution
towards all IP delivery networks [1]. IP delivery networks offer an easy integration with
triple-play offers (voice, internet access and television), as well as inherent interactivity,
which allows to deliver personalized services and, especially, Video on Demand (VoD),
a remote access to stored video content (i.e. the experience of “renting a film in a
video-club” integrated with the television service). A response to this evolution is the
standardization of IPTV architectures, such as the DVB-IPTV [19], focusing on the
delivery of continuous high-quality video services and covering the natural evolution of
the television services (High Definition, stereoscopic video. . . ). And, in parallel, the
deployment of IPTV platforms all over the world.
The third technology change has been the irruption in the marketplace of the last gen-
eration of smartphones and tablets, which have given rise to new video delivery services,
based on the streaming of multimedia content over unmanaged networks [23]. These
services, which do not require a specialized end-to-end network for them, are exper-
imenting a very fast growth. As an example, the website of the BBC delivered 106
million requests for online video during the recent Olympic Games of London 2012 [73].
The result is that, in the near future, multimedia services will have to handle a complex
scenario comprising from 3.5-inch smartphone screens to 100-inch wall-mounted plas-
mas, covering the services coming both from the “television” and from the “internet”
worlds [72][107]. Consequently, content sources will move in a wide range of formats
and qualities, from the user-generated content in the social TV to the high-budget 3D
movie produced by Hollywood studios.
Nevertheless, the core of the multimedia delivery services is the same for all of them
—television broadcasting, IPTV, or internet video—: taking a multimedia content and
delivering it to an end user, providing the best possible Quality of Experience within
the limitations imposed by the available network Quality of Service. In the rest of this
section we will explore the common properties of all those multimedia services: the
players or entities which take part in the service chain, the standards and protocols used
to compress and transport the media stream, and the most relevant quality degradations
or limitations that are present in those services. The focus will be on the multimedia
10 Chapter 2. Understanding Quality of Experience
services over IP networks; but most of the concepts are applicable to other transmission
means as well.
2.2.1 Players
The first step in the analysis of multimedia services is characterizing the players and
their roles. We will use the model proposed by the DVB-IPTV standard [19], and
depicted in figure 2.1. This model is applicable to most service scenarios and it has the
advantage of showing the relationship between the different players (or “domains”) and
their relationships regarding the OSI layer model.
Figure 2.1: Layer and domain model for multimedia services
The Content Provider is “the entity that owns or is licensed to sell content or content
assets”. The Content Provided may have direct relationship with the end user for the
management of usage rights to the content, or it can even be the entity which has the
commercial agreement with the end user (the end user being then a direct customer
of the Content Provider). However, regarding the content flow, the Content Provider
delivers content assets only to the Service Provider. The content offered by the Content
Provider is already “finished”, in the sense that it is a content asset which is deliverable
to an end user (a TV channel, a live event, a movie. . . ). All the complexity of the
content generation is outside this model and out of the scope of our work.
Chapter 2. Understanding Quality of Experience 11
The Service Provider is “the entity providing a service to the end-user”. This is the
one with has direct logical connection with the end user for the purpose of delivering
video content. The Service Provider is also the responsible of controlling the Quality of
Experience offered to the end user, and therefore the subject of the quality monitoring
services covered in our work.
The Delivery Network is “the entity connecting clients and service providers”. According
to DVB-IPTV, “the delivery network is transparent to the IP traffic, although there
may be timing and packet loss issues relevant for A/V content streamed on IP”. In the
practice, however, the Service Provider will need to impose specific requirements to the
delivery network, what leads into two different delivery scenarios:
• “Managed IPTV” (or simply “IPTV”). The Service Provider controls (and typi-
cally owns) the end-to-end IP distribution to the Home domain. The most relevant
implication here is that it is possible to distribute UDP traffic over IP multicast
with sufficient Quality of Service. This scenario has been the most important
(sometimes the only one) for the last years, and therefore it has also been the
main focus of our research and of this work.
• “Over The Top” content (or simply “OTT”). Video delivery is done “over the top”
of the internet, i.e., using a delivery network which is neither owned nor controlled
by the Service Provider. As such, some of the IPTV-related delivery network
features (multicast support, controlled QoS) are not available. In this context,
however, Service Providers normally make use of (or even own) Content Delivery
Networks (CDNs). CDNs are distributed networks which deliver the video content
in an efficient way to points of presence which are closer to the end users, thus
shortening the part of the delivery chain which goes really “over the top”.
Home is “the domain where the A/V services are consumed”. The Home domain is
property of the content consumer (the end customer or subscriber), and includes the
User Terminal —or Home Network End Device (HNED), using DVB-IPTV terminology.
Due to the fact that IPTV is traditionally delivered to a TV screen, the Home domain
is normally depicted as the end user’s own home. However, the User Terminal may be
also a mobile device with direct connection to the Delivery Network. The Home domain
may, but does not need to, include a home local area network.
2.2.2 Coding standards and transport protocols
The multimedia codec and transport technologies used in IPTV and OTT services result
from the ones used in digital television. There are several families of digital television
12 Chapter 2. Understanding Quality of Experience
standards around the world: Digital Video Broadcasting (DVB), adopted in Europe,
Africa, Australia and parts of Asia; Advanced Television System Committee (ATSC),
used mainly in North America; Integrated Services Digital Broadcasting (ISDB), used
in Japan and most of Central and South America; and Digital Terrestrial Multimedia
Broadcast (DTMB), adopted in China. All of them are quite similar in their basis:
transport of audiovisual services, multiplexed in MPEG-2 Transport Stream, over dif-
ferent physical media and using different modulation techniques. When needed, we will
take DVB as a reference, considering that the differences with other standards will be
almost insignificant for the purposes of our work.
DVB (and others) standardize the transport of audiovisual services multiplexed in
MPEG-2 Transport Stream [36]. Video elementary streams are coded in MPEG-2 video
[37] or MPEG-4 AVC/H.264 [38], while audio is coded in MPEG-1, MPEG-2, Dolby
AC3, or MPEG-4 AAC [18]. Both video codecs use similar concepts for compression:
motion prediction (to make use of temporal redundancy), block transformations (to
make use of local spatial redundancy), quantification of transform coefficients, entropy
coding of the resulting data, and package of data into a bitstream which add some head-
ers of meta information (such as delimitation and characterization of the different video
frames). Besides, audio codecs are also quite similar among them in the basic concepts
(encoding of different frequency sub-bands of a block of audio samples). As a result, the
key elements which affect multimedia quality will be very similar among all the different
scenarios for digital television, regardless of the underlying transport.
Both IPTV and OTT platforms may offer several different services around the distribu-
tion of multimedia content. However, we will focus here on the pure delivery of content
assets to the Home domain. In both cases, there are two basic service types:
• Live content (Live Media Broadcast, or LMB, in DVB-IPTV terminology). The
most typical examples are the live broadcast TV channels, which still are the main
contributor in IPTV deployments and one of the most popular audiovisual services
in any deployment. Its most important property is the real-time constraint: the
end-to-end latency must remain constant for the whole play out of the stream
to avoid discontinuities in the received multimedia session. Live content must be
ingested, processed, and delivered by the Service Provider in real time.
• On-demand content (Content on Demand, or CoD, in DVB-IPTV terminology).
This content is pre-loaded by the Content Provider into the Service Provider do-
main. It may take some time for the Service Provider to process it before it is
ready for its delivery to the end user.
Chapter 2. Understanding Quality of Experience 13
Figure 2.2: Protocol stack for multimedia services over IP
Those audiovisual services are delivered over IP. Figure 2.2 shows the protocol stack
used for this purposes, where there is a clear differentiation between IPTV and OTT
protocol families:
• MPEG-2 TS / RTP / UDP / IP. This is the standard scenario for an IPTV
deployment over managed network, as considered in [19], [76], and [55]. It follows
a push paradigm: the server controls the bit rate of the delivery.
• HTTP Adaptive Streaming (HAS) / TCP / IP. This is the upcoming scenario for
OTT environments. It follows a pull paradigm: the client decides which video
segments it downloads and when.
HTTP Adaptive Streaming (HAS) is a solution used to deliver multimedia content to
users where the bitrate is adapted to the network. Although the distribution of video
over the internet can be done in dozens of different ways, the use of adaptive streaming
is becoming the most popular one, especially in the context of OTT services offered by
IPTV service providers [75]. It is also natively supported by most smartphones, tablets,
and set-top-boxes.
HAS works as follows: the content is encoded at a specific bitrate as a concatenation of
small segments, each containing a few seconds of the stream, with the property that at
the video segment boundaries the terminal can switch from one variant (at a particular
bitrate) to another (at a different bitrate) without any visible effects on the screen or
the audio. Each of these segments is accessible as an independent asset with its own
URL, so once it is present in an HTTP server it can be retrieved by a standard web
client using pure HTTP mechanisms.
14 Chapter 2. Understanding Quality of Experience
There are several different HAS implementations. The most widespread distributed in
the market come from the initiative of individual companies: Apple HTTP Live Stream-
ing (HLS), Microsoft Smooth Streaming (SS), and Adobe HTTP Dynamic Streaming
(HDS). All of them are based in the same principles and use similar codecs. Their
main differences are the signaling of the segments and the multiplexing layer: HLS uses
MPEG-2 Transport Stream while SS and HDS use extensions of the ISO base media file
format. MPEG has also recently standardized a proposal for HTTP adaptive streaming
called MPEG DASH (Dynamic Adaptive Streaming over HTTP) [39]. MPEG DASH
supports both MPEG-2 TS and ISO file format profiles.
2.2.3 Artifacts
The “perfect” possible media quality for a multimedia service is the quality of the audio-
visual content just after the production process has finished. This reference “production
quality” shows the product exactly as its creators wanted it to be. Of course, there
might be defects in the capture, recording and production process, but, in a professional
product, it is reasonable to assume that they will be very rare and with a small impact
in the perceived quality.
Producers must then deliver their products to the service provider. This is usually done
encoding the content with a very lightweight compression, to avoid a perceptible loss of
quality, giving as a result a product with “contribution quality”. It can be assumed that
a product with contribution quality has the highest possible multimedia quality, with
no perceptible visual or sound artifact or impairment.
However, due to the impairments produced in the delivery chain, the final multimedia
quality received by the end users may be far from the contribution quality. We will con-
sider three main types of impairments, according to the place where they are generated:
compression artifacts, transmission errors, and display errors [113]. Other terminologies
and classifications are also possible [2, 7].
Compression artifacts are defects introduced when compressing the video from contri-
bution to distribution quality, which must fit into the bitrate budget that the service
provider has reserved for that specific media stream. In this compression process, several
impairments can be introduced [105]:
• Blocking effect appears as a pattern of square-shaped blocks in the compressed
image. It is caused by the independent quantification of adjacent groups of pixels,
which are processed in 4x4, 8x8, or 16x16 blocks, which leads to discontinuities in
the block boundaries. This effect is easy to appreciate due to the regularity of the
Chapter 2. Understanding Quality of Experience 15
generated pattern, and it is typically the most salient defect in MPEG-2 video. In
AVC video it is partially mitigated by the use of smaller blocks and the effect of
the deblocking filter.
• Blurring is the loss of spatial detail and edge sharpness in the image. It is generated
by the application of strong quantification in the high frequency components, and
it is emphasized by the application of deblocking filters, thus being typically the
most relevant artifact in AVC video.
• Flickering is a defect introduced in highly textured regions which are compressed
with different quantification factors along time (normally having higher quality
in key frames than in predicted frames). As a result, the coding quality of those
regions fluctuates periodically along time, and so does the perceived detailed level.
• Ringing (also known as Gibbs effect) produces ring-like periodic intensity varia-
tions around image edges in areas which should not have a perceptible texture. It
is caused by a strong quantification of high frequency coefficients in edgy regions.
• Chromatic dispersion is produced by the suppression of high frequency components
in the chrominance signal, resulting in cross-talk and loss of color definition in areas
with strong color variation.
• Motion jerkiness is caused by the use of a smaller frame rate than the one needed
to properly display the image motion.
Transmission errors are produced by the loss, corruption or excessive delay of some pack-
ets in the transmission chain, which results in stream discontinuities or buffer underrun
events in the receiver. They typically result in stronger versions of the compression
defects:
• Macroblocking : a highly visible blocking effect produced by the loss of video in-
formation, which forces the receiver to build the picture using wrong references
(normally repeating a correctly received frame instead of the lost one). The re-
sult is a strong blocking pattern, sometimes also causing other perceptual artifacts
(parts of the image, typically blocks or horizontal stripes, with a different color or
texture than what they should).
• Freezing or continual jerky motion, caused by the unrecoverable loss of video
frames.
• Mute or audio glitches, caused by the loss of packets with audio information.
• Outage or temporal loss of service due to network problems.
16 Chapter 2. Understanding Quality of Experience
Finally there is an heterogeneous set of errors that can be caused in the user terminal and
display, such as an incorrect aspect ratio display [113] or a malfunction in the terminal
itself.
Transport errors are normally the most damaging for the perceived QoE. In an study
done with a real IPTV deployment [7], it was shown that about 82% of the multimedia
quality impairments reported by customers were directly related to them: “Breaking
Up Into Blocks” (macroblocking, 29%), “Screen Freezes” (20%), “Choppy Screen Tran-
sitions” (or jerky motion, 18%) and “Distorted audio” (mute or glitches, 15%). As the
customers were requested to report perceived errors, it is possible that a fraction of them
were caused in the encoding process. However, the description of the errors as given by
customers suggest that most of them refer to the “stronger” (and more visible) effects
of the artifacts, i.e. the ones resulting from transmission errors. The additional 18%
of the errors is divided into “Edges Shimmer” (11%), visible artifacts around edges in
the image (caused by coding artifacts, as the edges are one of the places where they are
more visible), and “Error Stoppage” (7%), or problems with the end terminal (which
“has to be reset”).
2.3 Who is who in the QoE metrics
In contrast with the relatively fast standardization of audio [41] and speech [51, 52]
qualities, the efforts to standardize video quality metrics have produced slower results
[15]. The Video Quality Experts Group (VQEG) has been the most relevant contributor
to this standardization process [111, 112], producing an extensive evaluation of quality
metrics which has led to some standardization initiatives [45, 46, 47, 48, 49, 50].
The study of the multimedia quality, and more specifically of the video quality, has
been of great interest for the last 15 years, and therefore it is relatively easy to find
good surveys, reviews and classifications of the different existing metrics and approaches
[15, 33, 78, 92, 121]. This section will present the most used classification of video quality
assessment strategies, as well as some example methods which are relevant for our work.
More detailed surveys can be found in the given references.
The first division in the quality assessment approaches is between subjective and objec-
tive methods. Subjective quality assessment implies having a panel of users watching the
target content and evaluating its quality by giving a score to each fragment of content
under study. The result is normally presented in terms of Mean Opinion Score (MOS),
which is the average of the results from the different users, maybe with some statis-
tical processing such as the removal of outliers. Objective quality assessment is done
Chapter 2. Understanding Quality of Experience 17
automatically by computing processes which analyze the multimedia stream to produce
some quality values. In most cases, the aim of objective metrics is providing MOS values
which correlate well with those provided by subjective assessments, which are used as
benchmark.
Figure 2.3: Models for objective quality assessment: Full-reference method (top),Reduced-reference method (middle), No-reference method (bottom)
Objective quality assessment methods can be classified into three different types, de-
pending on how much information they use from the original signal (see Figure 2.3):
• Full-Reference (FR). The impaired signal is compared with the original one to
obtain a quality value. This is the most appropriate method to use in cases where
it is possible to have access to the original and impaired signals simultaneously
(for instance, to analyze the compression defects introduced by a video encoder).
• Reduced-Reference (RR). A reduced description of the original and impaired sig-
nals are generated, and they are compared to produce a quality value. This model
is useful when the original signal is not available in the measurement point (for
instance, when they are at different points in the network), but it is possible to
receive ancillary data through a lower bitrate channel.
• No-Reference (NR). The quality measure is generated only by analyzing the im-
paired signal, without having any information about the original. This is the most
generic model, because it can be introduced in a non-intrusive way at any point
of the transmission chain.
18 Chapter 2. Understanding Quality of Experience
A second classification criterium for objective metrics refers to the type of data they
use, having:
• Picture metrics, which operate in the baseband domain, analyzing the pixel values
of the original and/or decoded frames to produce their results.
• Bitstream metrics, which operate in the coded domain, analyzing the video stream
without fully decoding it or, in some cases, analyzing just the quality of service
information (losses, delays. . . ). Bitstream metrics are usually No-Reference as
well.
2.3.1 Subjective quality assessment
The aim of the quality assessment is knowing, for a specific set of content assets and
impairments, which would be the opinion of an average user. As such, the best way
to know it is in fact asking the users. Subjective quality assessment methods provide
guidelines about how to ask users about multimedia quality in the most effective way.
There are several standards which provide these methods of subjective assessment,
mainly the ITU-R BT.500 [42], ITU-T P.910 [53], and ITU-T P.911 [54]. All of them are
quite similar in the way they propose to structure, perform and evaluate tests. Most of
the subjective assessment tests reported in the literature are based on these standards,
being the VQEG validation tests the most relevant example [119].
In test sessions, a number of “subjects” are asked to watch a set of audiovisual clips
and rate their quality. The total number of viewers for a test must be between 4 and
40 (they can be effectively distributed in different viewing sessions). In general, at least
15 observers should participate in the experiment. They should not be professionally
involved in multimedia quality evaluation, and they should have normal or corrected-
to-normal visual acuity and color vision.
The location and the displays where the tests are conducted must comply with a set of
requirements regarding lighting, screen brightness and contrast, distance and angle from
viewers to screen. . . Guidelines are provided to work either with professional monitors
or with domestic TV sets [42].
Sessions should not last more than half an hour. At the beginning of the session, viewers
are presented with a set of example clips where they can see the type of defects that they
are supposed to judge. The content samples to be evaluated may be preceded by about
five “dummy presentations”, whose results are not taken into account, to stabilize the
Chapter 2. Understanding Quality of Experience 19
observers’ opinion. Besides, the video clips under study should be distributed randomly
along the session.
Table 2.1: ACR and DCR evaluation scales
ACR DCR5 Excellent Imperceptible4 Good Perceptible but not annoying3 Fair Slightly Annoying2 Poor Annoying1 Bad Very Annoying
Different evaluation strategies are used. Although there are some variations in the details
from one standard to another, they are basically the following [54]:
• Absolute Category Rating (ACR), or Single Stimulus method (SS). The test se-
quences are presented one at a time and are rated independently on a category
scale. After each presentation, the subjects are asked to evaluate the quality of
the sequence presented using an absolute scale, normally with five levels (see Ta-
ble 2.1). Nine-level and eleven-level rating scales are also suggested to increase
resolution, but they do not seem to produce significantly different results [35].
• Degradation Category Rating (DCR), or Double Stimulus Impairment Scale method
(DSIS). In this case, each presentation consists of two different video clips: the
reference content (without impairments) and the processed or impaired version
of the same content. Both videos are watched consecutively, and the subject is
asked to rate the impairment of the second stimulus in relation to the reference.
Five-level scales are also used (see Table 2.1).
• Pair Comparison method (PC). Test sequences are presented in pairs as in the case
of DCR, but now the sequences are two different processed versions of the same
original one (i.e. with two different levels or types of impairments). After each
pair is presented, the subject has to select which one is preferred in the context of
the test scenario.
• Single Stimulus Continuous Quality Evaluation (SSCQE). This method considers
long-duration sequences (3 to 30 min). While the sequence is being played, sub-
jects are asked to continuously evaluate the quality of the sequence, normally by
controlling a slider.
The proposed duration of sequences is about 10 seconds, including another 10-second
period (showing a grey screen) to vote each of the sequences. When sequence pairs are
20 Chapter 2. Understanding Quality of Experience
used (DCR and PC), both sequences within a pair should be separated by a short (about
2 seconds) grey screen.
2.3.2 Full-Reference quality metrics
Full Reference metrics compare the original and impaired versions of the sequence, thus
having access to more information than RR or NR metrics. For this reason, FR metrics
have been the first ones to be developed and they also are the ones which produce more
accurate results.
Video engineers have used for years simple FR objective metrics such as the Peak Signal
to Noise Ratio (PSNR) or the Mean Square Error (MSE) of the impaired video with
respect to the reference. They are computed as follows:
MSE =1
MN
M−1�
i=0
N−1�
j=0
�I(i, j)−K(i, j)�2 (2.1)
PSNR = 10 log10
�(max I)2
MSE
�(2.2)
where I(i, j) and K(i, j) are the two compared images, whose size is M ×N pixels, and
max I is the maximum possible intensity value for any pixel in the image (for instance,
255 for 8-bit pixel values).
These metrics compare the pictures on a pixel-by-pixel basis, ignoring the image struc-
ture, and their capability to predict the perceived MOS is quite limited. However, they
are still used for some applications, and especially as benchmark for other FR quality
metrics: the acceptability criterium for any FR quality metric is having a correlation
with subjective MOS which is significantly better (statistically speaking) than that ob-
tained by PSNR [111].
The first attempts to improve the performance of PSNR and MSE resulted from the
application of psychophysical models of the Human Vision System (HVS) to improve
the measurements, in a way that has been known to produce good results in the audio
quality estimation (and in the development of audio codecs) [78, 120].
A second family of FR algorithms appeared with a different approach: trying to detect
impairments related to the known processing applied to the image, the expected impair-
ments that can appear or, in general, how the image is affected from the image point of
view. Some metrics having this “engineering approach” [121] were able to outperform
the PSNR in the second round of the VQEG tests for television signals [111]. They are
Chapter 2. Understanding Quality of Experience 21
the ones included in the ITU-T Recommendation J.144 [45], the first standard for FR
video quality metrics:
• BTFR (BT Full Reference). It makes a weighted linear composition of several
individual measures, such as: percent of correctly estimated blocks, PSNR of
matching blocks, segmental PSNR (error in the matching vectors), energy of edge
differences, texture degradation and pyramidal PSNR.
• EPSNR (Edge PSNR). It measures the PSNR between both images, considering
only the regions where there are edges. The result is afterwards scaled non-linearly
to generate a MOS value.
• CPqD-IES. Image is segmented in three regions: flat, edges and textured. The
Absolute Sobel Difference (ASD) is computed for each region: the result of applying
a Sobel filter and finding the MSE of the resulting images. The result is introduced
into a trained model to obtain the final MOS value.
• VQM. This metric computes also seven different parameters of the image, which
are afterwards added linearly with experimentally obtained weights. Measured
features are: loss of spatial information, loss of horizontal and vertical edges, gain
of horizontal and vertical edges, chroma spread, spatial information gain at edges,
errors in high-contrast areas end extreme chrominance errors. An implementation
of VQM is publicly available on the internet [89].
Subsequent test projects of the VQEG have resulted in additional ITU-T Recommenda-
tions for slightly different scenarios. For instance, ITU-T J.341 [49] introduces VQual-
HD, another FR metric specialized for HDTV contents, which combines picture similar-
ity, spatial degradation, and temporal degradation to obtain a quality metric. ITU-T
J.247 [47] proposes metrics for multimedia environments, more focused on “internet”
frame resolutions and bit rates (lower than in digital television, as a general rule). ITU-
T J.147 [46] proposes embedding hidden data in the original signal and measure their
degradation in the received one.
Additionally to them, it is relevant to mention the Structural Similarity Index (SSIM)
[116]. SSIM considers image degradation as perceived change in structural information.
Structural information is the idea that the pixels have strong inter-dependencies espe-
cially when they are spatially close. The metric is computed over several windows in the
image, and its value between two windows x and y (assumed to be in the same position
of two different images) is:
SSIM(x, y) =(2µxµy + c1)(2σxy + c2)
(µ2x + µ2
y + c1)(σ2x + σ2
y + c2)(2.3)
22 Chapter 2. Understanding Quality of Experience
where µ represent the average, σ2 the variance and σxy the covariance of the signals,
and c1 and c2 are constants used to stabilize the division when the denominator is small.
Although the metric has some limitations [13], SSIM has becoming increasingly popular
over the recent years, since it seem to offer better results than PSNR while being a
simple metric to implement (the source code is available on the internet as well).
In any case, most of the FR metrics (and especially the ones included in ITU-T recom-
mendations) have been specifically designed to be able to provide good MOS estimations
for relatively subtle impairments, such as the ones generated by video encoders. How-
ever, when the errors are generated by packet losses or other network problems, and
therefore are more aggressive perceptually, PSNR, SSIM and VQM show reasonably
good correlation with MOS [40]. For such cases, it can be more useful to use simpler
metrics (such as PSNR or SSIM) rather than the complex schemes proposed by the
standards.
2.3.3 Reduced-Reference quality metrics
The basic strategy used to design Reduced-Reference metrics is extracting a set of statis-
tic parameters that characterize the video and compare them between the original and
the impaired sequences (see [15] for a short survey). We can difference between two
types of features:
• Features which describe image properties: temporal and spatial information [63,
98, 117], structural similarity [106], image statistics [114]. . .
• Known impairments on the image, normally by applying No-Reference quality
estimators in both pictures (original and impaired) and comparing the results
[16].
Simple RR measures can be combined to generate a more complex metric, in a similar
way that FR metrics are generated from complex measures. This is the case of the RR
metrics selected by the RR-NR project of VQEG [112], which are now included in the
ITU-T Recommendations J.249 (for Standard Definition TV)[48] and J.342 (for High
Definition)[50]:
• Yonsei University metric. It is a Reduced Reference version of the EPSNR included
in ITU-T J.144 [45]. The algorithm selects some pixels in the edge region of the
original image and computes its PSNR with the same pixels in the impaired image.
Chapter 2. Understanding Quality of Experience 23
Temporal, spacial and gain registrations are performed to enhance pixel mapping.
Besides, the EPSNR of each picture is post-processed to take into account some
defects or features: EPSNR is reduced if there is high blurring, blocking or freezing
effects, and enhanced for high-motion or high-complexity pictures.
• NEC metric. A reduced version of the image is transmitted, containing the activity
values of 16x16 pixel blocks of the original luminance image. Activity of a block
(ACT) is computed as the average of the absolute differences from the pixel in-
tensities to the average intensity, as in eq. (2.4). The MSE of the activity images
is obtained. It is then post-processed (weighted) if the impaired image exceeds
threshold on different features: psychophysical features (spatial frequency, color),
scene changes, blocking effect, or local impairments.
ACT =1
256
15�
i=0
15�
j=0
���Xi,j − X��� (2.4)
• NTIA metric. It is a Reduced Reference version of the VQM included in ITU-
T J.144 [45], called “fast low bandwidth VQM”. It extracts color, spatial and
temporal features, which are transmitted and compared to the same features of the
processed (impaired) image. Different complex comparisons are used, so that the
original and processed features are used to generate parameters, which are similar
to the ones available in the FR metric, measuring modifications in horizontal and
vertical edges, in spatial information, in color information and in absolute temporal
information. The resulting parameters are linearly combined (with fixed weights
obtained by training) to generate the final VQM value.
Yonsei University EPSNR metric is the only one included both in SDTV (ITU-T J.249)
and HDTV (ITU-T J.342) standards; while the other two were only included in J.249. It
is also relevant to note that, even though the models described in the recommendations
matched (and, at some points, outperformed) a Full-Reference PSNR, none of them
“reached the accuracy of the normative subjective testing” [112], i.e. they are not good
enough to replace subjetive assessment tests.
2.3.4 No-Reference quality metrics
There are two basic families of No-Reference video quality metrics: pixel-based (also
baseband or picture based) and bitstream-based. The former operate in a similar way
as the described FR and RR: they analyze some features of the images and sequences (but
without any information about the original image). They typically focus on detecting
24 Chapter 2. Understanding Quality of Experience
one specific impairment, normally the ones expected to be introduced in the coding
phase (see section 2.2.3). The latter analyze the bitstream of the coded video sequence,
trying to obtain a quality metric from the syntax and semantics of the coded video.
They are normally used to handle packet losses and other network impairments, but
some of the bitstream metrics are also applied for coding defects. It is also possible
to find hybrid schemes which combine both approaches. Several surveys can be found
which describe all these metrics in detail (for instance, [33] or [15]). We will describe
some of the most relevant ones.
Yang et al. [123] propose a metric for temporal consistency. They compute the MSE
between two consecutive pictures on motion-compensated areas with high spatial com-
plexity and homogeneous movement. Kuszpet et al. [62] propose a metric to detect
flickering based on the error of motion-compensated areas with smooth (homogeneous)
textures.
Several authors propose blocking metrics, trying to detect the patterns produced by
block coding. For instance, Wu and Yuen propose GBIM (Generalized Block-edge Im-
pairment Metric) [122], based on the energy of the difference between pixels at both sides
of a block boundary. Vlachos [110] estimates the block effect by comparing the cross
correlation between pixels within the same block with that of pixels between adjacent
blocks. Wang et al. [115] search for peaks in the transform domain (FFT) at multiples
of the block spatial frequency.
Blurring is normally measured by studying the width of edges in the image. An edge
detector (usually Sobel or Canny) is applied to the image and then some statistics are
computed to provide a value for the edge width (see, for instance, [21, 68]).
Other metrics include measuring other less common artifacts, such as additive white
gaussian noise (AWGN), edge continuity, motion continuity. . . and combinations of them
[21, 74].
However, pixel-based NR metrics are not able to provide good enough performance when
evaluated towards subjective quality assessments [64]. In fact, VQEG has not been able
to recommend any NR metrics for standardization; only RR and FR [119]. For such
reason, pixel-based NR metrics are normally not directly applied in the measurement of
video quality. However, they are sometimes used as building blocks for more complex
FR and RR metrics.
The second family of no-reference metrics are the bitstream-based. They have been
increasingly popular in the recent years for two reasons. On the one hand, the lack of
success for pixel-based NR metrics fosters the search for different ways of measuring
quality. On the other, there is a need for measure schemes that are easy to apply to
Chapter 2. Understanding Quality of Experience 25
large platforms of multimedia services (such as IPTV), where using the decoded video
could have an excessive cost which would prevent a scalable deployment.
The benchmark bitstream-based metric for video delivery over UDP/IP is the Media
Delivery Index (MDI), described in IETF RFC 4445 [118]. MDI is a combination of
two different values, Packet Loss Rate (PLR) and Delay Factor (DF), which are usually
shown separated by a colon:
MDI = PLR : DF (2.5)
DF shows how many milliseconds of data must be buffered in the receiver to completely
remove the effect of jitter. It is the additional delay that must be available in the system
to avoid that jitter generates packet losses. In other words, when DF grows over the
dejitter buffer size in a video receptor, some packets will get lost due to buffer underrun,
adding the effect to the losses accounted by the PLR part of the MDI. Let ∆ be the
instant variation of the fill level (in bits) of the dejittering buffer, the Delay Factor over
a period of time (typically of one second) is computed as:
DF =max(∆)−min(∆)
bitrate(2.6)
Media Loss Rate is computed just as the number of packets lost per time interval:
MLR =packets expected− packets received
interval(2.7)
MDI is in fact a pure QoS metric, with no knowledge of the effects produced by packet
losses or jitter. However, due to its simplicity, has became a de-facto standard in com-
mercial IPTV deployments (see, for instance, [67]). Besides, for randomly distributed
errors, it is possible to find a linear correlation between the packet loss rate and the
mean square error [102].
However, these results can vary when losses are not randomly distributed along time.
Different authors have proposed enhancing these metrics by analyzing how errors are
distributed along time and how they propagate between protocols. Liang et al. analyze
the effect of different packet loss patterns for low-bitrate applications [65]. Pattara-
Atikom et al. analyze the propagation of errors, either coming from packet losses or
from excessive delay, from the IP layer to the video layer, also considering different
structural factors related with the loss pattern and how the protocol stack is built [80].
Reibman et al. developed a model which can compute the MSE from the received
bitstream without decoding it [95]. The algorithm, designed for MPEG-2 video, esti-
mates the error at macroblock level and follows its propagation along the following video
26 Chapter 2. Understanding Quality of Experience
frames. The same research group has evolved these results to predict the visibility of
packet losses for MPEG-2 and AVC video, based on some parametrization of the packet
loss and using Generalized Linear Models to combine the parameters [58, 66, 93].
In a different approach, the Picture Appraisal Rating (PAR) is a metric which estimates
the PSNR of the stream from the values of the quantification parameters in MPEG-2
coded video [59].
These schemes are evolving towards hybrid metrics, which combine several bitstream
measures, sometimes also with additional picture measures, to obtain better quality
estimates. The V-Factor proposed by Winkler et al. use several measures such as
quantification parameters, bitrate, packet losses, video stream structure. . . to produce a
single quality value [121] . Erman and Matthews analyze Key Quality Indicators (KQIs)
such as blockiness, jerkiness. . . and predict their value from the measurement of network
quality of service (bitrate, packet loss rate, buffering) using a trained model for that
mapping [17].
This approach is also being used in the new upcoming multimedia quality standards
which are being developed in ITU-T Study Group 12: P.1201 (ex P.NAMS) and P.1202
(ex P.NBAMS) [8]. They are intended to be used both for network planning and for QoE
monitoring. P.1201 use only transport information, while P.1202 adds video bitstream
information. Only video headers are used; neither of them require to decode the video.
Most of the work developed in this PhD Thesis is also located within the framework of
the bitstream-based and hybrid quality estimation. We propose a simple but effective
method to predict packet loss effects on video quality [86]. It can be used as basis of
a full quality monitoring scheme which provides with a significant mapping between
quality values and the qualitative effect observed by the user [85]. This model can
also be applied to different scenarios, such as unequal error protection [82] or selective
scrambling [83], among others.
2.4 Other topics related to QoE in IPTV services
When managing multimedia Quality of Experience, there is some “implicit knowledge”
which is not always easy to find in the surveys of metrics, such as a proper definition
of QoE, the relevant fact that transmission errors are much worse than coding errors,
or the difficulty to generate a good no-reference metric [91]. This section compiles some
miscellaneous results extracted from the literature, which can be used to support design
decisions.
Chapter 2. Understanding Quality of Experience 27
Cermak et al. [6] study “the relationship among video quality, screen resolution, and
bitrate” to show that, as expected, the perceived quality increases with the bitrate for a
given screen resolution. Besides, they conclude that “it would be reasonable to choose
a bit rate, given a screen resolution; it would not be reasonable to choose a screen
resolution given a bit rate.”
Jumisko et al. [56] study the effect of the selected content in subjective assessment
of video quality on mobile devices, finding that the content selection may have strong
effects in the results of subjective assessments. Specifically, for audiovisual content, it
seems that errors are perceived as more severe in contents which are recognized by the
users than in unrecognized contents.
There are also several studies which characterize the levels and patterns of packet losses
(and other network issues) in IPTV services, so that they can provide valuable inputs
to the metrics that monitor the effect of those losses. Hohlfeld et al. [34] provide a
model to simulate packet loss packets based on Markov chains whose parameters are
computed from session capture logs. Ellis and Perkins [14] characterize the packet losses
in residential access networks (cable and ADSL), performing an intensive study of packet
loss rates in 4 cable and 1 ADSL links, at several bitrates (1-4 Mbps). Most sequences
had an error rate lower than < 1%. Typical error bursts were short: 1 to 5 packets.
However, this can change if the DSL service activates Forward Error Correction and
interleaving, which reduces the error rate at the cost of spreading the errors. With a
typical interleaving of about 10 ms [5], any error which is not corrected by the ADSL
FEC will result in a potential loss of 10 ms worth of video.
Mahimkar et al. perform an extensive analysis of a large commercial IPTV network
[67]. They collect a lot of data from the field and develop a method to find the root
cause of a problem by statistics analysis and correlations. Beyond that, they provide an
interesting insight of what happens inside a real IPTV deployment:
• Video traffic is monitored using MDI. Other monitoring data used are the logs
of the network elements and routers (recovered in a centralized syslog), logs of
Set-Top-Box reboots and reports from customer care centers.
• There is a high correlation between MDI events and network events (syslog), as
expected. However, there is low correlation between MDI and call center events
(bursty video losses rarely generate a call). On the other hand, most customer
complaints (46%) are related to video (supposedly sustained problems).
• About 5% of STBs had at least one reboot event in 3 months period.
28 Chapter 2. Understanding Quality of Experience
Another field to consider is the composition of audio and video qualities to generate
an audiovisual quality model of the content. It is widely accepted that the multimedia
quality m can be modeled parametrically from the audio and video qualities (a and v),
m = αa+ βv + γ(a× v) + C (2.8)
and that those parameters depend on the specific application [31].
A recent analysis of 12 different subjective assessment tests [90] has shown that “au-
dio quality and video quality are equally important in the overall audiovisual quality.
The application drives the range of audio quality and video quality examined and thus
produces the appearance that one factor has greater influence than the other. The
underlying perceptual model is invariant to application. The most important overall
conclusion is that only the cross term (a×v) is needed to predict the overall audiovisual
quality” . These results are in line with others that showed that instantaneous errors
where similarly unacceptable, were they produced in audio or in video [57].
Audio quality metrics as such are not usually included in the quality assessment for
multimedia applications. It might be caused by a bias in the studies of multimedia
quality, since most of them come as evolution of video-only quality analysis. Even if it
might be partially true, there is a good reason for not being too concerned about audio
coding quality: while audio and video are equally important for multimedia quality,
audio requires at least an order of magnitude less of bitrate to reach a similar quality
level [43]. Therefore audio coding quality should not be a problem in a well-dimensioned
multimedia service.
The measurement of audio coding quality has been standardized in the recommendation
ITU-R BS.1387-1 [41], which defines a Full Reference audio quality metric called Per-
ceptual Evaluation of Audio Quality (PEAQ). The model divides the audio signal into
segments (called frames). Each frame is divided into different sub-band components (us-
ing an FFT or a filter bank) in the Bark scale, modeling also the frequency response of
the peripheral ear and the time and frequency component masking of the human hearing
system. There are a total of 54 sub-bands between the 80 and the 18000 kHz. Afterwards
both signals are adjusted, equalized and its relative error (per sub-band) is computed.
These results are used to compute several Model Output Parameters (MOVs), which
characterize the level and structure of the error signal (bandwidth, modulation, noise
to mask ratio. . . ). Those MOVs feed a neural network which provide the final quality
value.
Chapter 2. Understanding Quality of Experience 29
Audio packet losses are relatively simple to analyze, since they basically produce a mute
in the audio output. The effect of the mute length in audio quality has been evaluated
by Pastrana et al. [79], from which it is possible to extract rough results for audio losses:
• Mutes below 500 ms produce low to moderate impact in quality.
• Mutes from 500 to 1000 ms produce a strong impact in quality.
• Mutes longer than 1 second produce a very strong impact in quality.
The end-to-end delay is not frequently considered as a critical design factor for multi-
media services (IPTV and similar), as it is for conversational services. However, this
could become a relevant factor in some specific situations. For instance, experiments
show that for specific contents such as important sport matches, having an end-to-end
delay which is 2-4 seconds higher than other services (SDTV vs HDTV for instance) can
be a reason for a user to switch services [70]. And sport matches are in fact the most
relevant content in current digital television platforms: for instance, in Spain pay-TV
channel Canal+, more than 40% of the aggregated audience comes from live football
matches 1.
2.4.1 Media formats in IPTV deployments
Video coding and transport standards provide a reasonable degree of freedom to imple-
ment them. However, when designing, implementing, or testing QoE measuring strate-
gies, some assumptions must be done about the specific parameters used to encode the
content.
To support these assumptions, we have analyzed video streams from existing IPTV
deployments (or field trials of IPTV service providers) in several countries, such as
Spain, USA, UK, Brazil, Chile, Argentina, Austria, Cyprus, Dubai, Czech Republic,
Slovenia, France, Italy, Japan, Taiwan, India, Turkey, South Korea and Australia, from
2007 to 2011.
From this survey, the following conclusions were obtained:
• Video format is MPEG-4 AVC (H.264) in all the scenarios, plus some MPEG-2
part 2 video in some of them (in all cases for legacy support and with intention to
migrate to MPEG-4 AVC). VC-1 has limited use, mainly in North America. Other
formats, such as MPEG-4 part 2 (visual) Simple Profile, which were relatively
1Audience data from January to September 2012. Source: Kantar Media.
30 Chapter 2. Understanding Quality of Experience
popular in internet video in the recent years, have virtually no presence in the
IPTV world. Main profile is used for SDTV and main or high profile for HDTV.
• Video resolutions are the typical for television distribution: 720x576 (25 fps) and
720x480 (30 fps) for SD, as well as 1920x1080 (25/30 fps) and 1280x720 (50/60
fps). Fractions of the full horizontal resolution (e.g 1440x1080 or 544x576) are
also frequent, especially in the lowest bitrates, to reduce the amount of data to
transmit.
• Video bit rates lay between 1.5 and 3 Mbps for SD. HD bitrates are more variable:
from 6 to 20 Mbps. Constant bit rate is used in most of the deployments, although
not necessarily with strict CBR constraints (some local variations of the bitrate
are acceptable).
• GOP lengths are between 12 and 100 frames, (0.5 to 4 seconds, approx.). The most
typical values are between 24 and 48 frames. GOP structures are IBBP or IBBBP,
the latter being more frequent with longer GOPs. Besides, IBBBP GOP structures
are normally hierarchical, with the reference structure represented in figure 2.4.
Dynamic GOP size, i.e. changing the GOP size depending on the structure of the
video (normally to insert I frames in scene changes), are frequently used. However,
the number of consecutive B frames (between I and P frames) does not change.
• Video start-up delay (PTS-PCR difference for I frames) is imposed by the encoder
end-to-end delay. In last-generation low-delay encoders, coding delay is typically
around 1 second and video start-up delay is between 700 and 900 ms. Medium-
delay configurations have values around 2 seconds (normally with the benefit of
better coding quality). The first generation of H.264 encoders had delays on the
range of 4 seconds.
• Audio formats are MPEG-1 layer 2 (typically associated with MPEG-2 video),
MPEG-4 AAC (both low-complexity and high-efficiency profiles), and Dolby AC-
3. Audio bitrates range from 96 to 512 kbps. There may be several audio streams
(with different languages).
• Other data streams are usually present, where the subtitles and the teletext are
the most relevant ones. Their presence, relevance, and format, vary significantly
from one country to another.
Chapter 2. Understanding Quality of Experience 31
Figure 2.4: Hierarchical GOP structure
2.5 Conclusions
Even though the market and the technology are in constant evolution, it is possible to
define a subset of common elements which can cover a big fraction of the multimedia
service playground: delivery of MPEG digital video and audio contents over a packet
network. The main task of service providers is indeed offering this delivery with enough
Quality of Experience to the end user. To achieve it, they must control three elements:
coding quality, network quality of service, and overall service availability.
Multimedia coding quality will depend mainly on the available bit rate, which will limit
the video quality stronger than the audio. Service providers use the best available codec
(from the ones widely available in the market) which best quality offer for a given bit
rate budget, which is currently H.264. The coding process can (and should) be screened
to assess its quality; and this should be done by the best available means: subjective
assessment tests or Full-Reference quality metrics. Regarding FR metrics, the ones that
have passed VQEG tests are available in ITU-T recommendations J.144 and J.341. If
they are not available, simpler metrics such as SSIM (or even PSNR) can be used, if one
is aware of their limitations.
The most relevant risk to Quality of Experience in field deployments, however, is the
drop of the QoS offered by the network, resulting in the loss of information. These
losses can produce both spatial and temporal artifacts (macroblocking, jerky motion,
freezing), with strong impact to the QoE. This impact could be monitored using No-
Reference or, even better, Reduced-Reference metrics, such as the ones proposed in
ITU-T J.249. However, practical reasons make that only QoS monitoring schemes, such
as the Media Delivery Index, are widely used in service deployments. Bitstream-based
NR QoE metrics can overcome those practical limitations and enhance pure QoS models,
as it is intended by the recently approved ITU-T recommendations P.1201 and P.1202.
Regarding contributions of different elements to quality, audio and video can be consid-
ered equally important. There are also reasons to consider end-to-end lag as a relevant
factor.
Overall service availability, understood at the possibility to receive the multimedia ser-
vice, must also be considered in any practical scenario. Customers suffer from user
32 Chapter 2. Understanding Quality of Experience
terminal software issues or other outage events, in a number that can be estimated
roughly in 1–5%, according to the studies presented in the literature. However, unlike in
the coding and network qualities, the problems with service availability will be specific
from each service deployment.
Chapter 3
Designing QoE-Aware
Multimedia Delivery Services
3.1 Introduction
Monitoring multimedia quality of experience (QoE) in a multimedia service is a complex
task. Quality monitoring implies generating quality data in real time in all the relevant
points of the network, raising alarms whenever a critical event happens, and being able
to retrieve and process all the data to obtain significant statistic information about the
network performance in terms of QoE. Also a quality monitoring framework typically
presumes that the original signal is rarely available at the monitoring point, and therefore
reduced-reference (RR) or no-reference (NR) video metrics need to be used.
There are dozens of video, audio, and multimedia RR/NR quality metrics which could be
applicable to the monitoring of multimedia QoE (see, for instance, [33]). Although their
performance is not as good as the Full-Reference metrics [111, 112], they can provide
relevant results about the video quality of the measured signal. In fact, this kind of
metrics has also being introduced in some commercial monitoring probes during the last
decade. The cost of those probes makes them usable for deployment in several points
within the communication network (such as the video head-end or the local points of
presence), but not at the end user home network.
However, in communication networks, the errors occur typically in the last mile, where
the computing power of the equipments (network routers, access gateways or set top
boxes) is rarely dedicated to the implementation of complex processing algorithms for
QoE monitoring. In practical terms, the monitoring information available in field de-
ployments is obtained only at transport level: packet loss rate (PLR) and packet loss
33
34 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
pattern (PLP), as well as jitter [3, 67], frequently expressed using the Media Delivery
Index (MDI) [118]. Some derivative QoE monitoring metrics, such as [4, 17], are built
assuming that packet loss and jitter is the only available input regarding network impair-
ment issues. Since the effect of jitter is creating a packet loss in the receptor (because
the packet arrived too late to be used), this is the same as saying that the only available
network quality source information is the packet loss pattern at the end decice.
Despite its limitations, this approach makes sense because, for random packet losses, the
effects (error in the decoded video) are quite correlated to the (effective) packet loss rate
[95] and pattern [65]. PLR/PLP monitoring systems have also many other advantages:
they make no assumptions about the content, can be widely deployed in a non-intrusive
way, and provide data which are easy to understand, aggregate, and analyze. Besides,
they are repeatable: if we can assume that a specific packet loss pattern creates an
aggregated effect (let us say, an impairment in global perceived quality of x%, within
some error margin), we can recreate the same effect by replicating the causes —by
generating the same error pattern.
However, using PLR/PLP as the only description of network losses implies considering
all the media packets as homogeneous data, which is certainly sub-optimal. The impact
of an isolated packet loss may vary strongly depending on whether the data loss belong
to audio or video, and also depending on the part of the audio or video stream which
has been lost. Besides, as discussed in section 5.2, the impact of the packet losses can
be dramatically mitigated by a simple re-arrangement of the transport stream packets
in the RTP and an appropriate prioritization model [82]. Hence, with an appropriate
packet priority model applied in the service deployment to reduce the randomness in
the packet loss events, the significance of the pure PLR/PLP could also get reduced.
Fortunately, there is additional information at transport level, or at the network abstrac-
tion layer of media level, that is also available for the network elements with very few
additional effort (or, in other words, without needing to decode, even partially, the video
or the audio): elementary stream type (video or audio), type of coded video frame (I, P,
B. . . ), position of the frame boundaries within the bitstream. . . This information, which
we could call rich transport data, can be used to better predict the effect of packet losses
[86]. The key point here is that the rich transport data are obtained directly from the
bitstream in a deterministic way (it is syntactic information which is always available in
the media stream). This way, these data share the main properties that made PLR/PLP
so useful: content-agnosticism, non-intrusiveness, simple processing and, what is most
relevant, repeatability (in the same terms as PLR/PLP).
The aim of this chapter is building a framework for QoE monitoring based on the
information available at the rich transport data level. This framework will be built from
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 35
a pure bottom-up approach: the target is making the best possible use of the available
information within this rich transport data level, and trying to find out whether the
obtained results could be sufficient in a network monitoring scenario, as well as whether
they would provide more information that the ones obtained only by PLR/PLP analysis.
Before going on with the analysis, it is important to consider that the final target of
the network monitoring is anticipating, or at least explaining, the degradations in user
quality of experience that could cause complains from end customers. In such context,
when users complain about errors in the field, they do not speak of packet losses, but
of video artifacts [7], such as blockiness, screen freezes, choppy transitions or distorted
audio.
The aim of our monitoring framework is identifying the root causes of these kinds of
impairments, so that they can be detected when they happen. On the one hand, for
impairments which are caused by network errors, we will use the information of the
rich transport data to obtain the most accurate description and characterization of the
effect. On the other hand, for impairments related to the coding process itself, there are
also elements in the transport data that can be used as proxies to monitor them. It is
also important to consider that, in typical multimedia service deployments (with a few
hundred video streams for hundreds of thousands of users), the quality of the encoded
content should be high enough in normal operating conditions, and the the monitoring
of coding quality could also be done with more complex (and expensive) tools.
To validate the characterization of the different impairments based on rich transport
data, we have also designed a set of subjective quality assessment tests, where the
impairments to analyze are based on that characterization. They compare the effect
of the same type of degradation for several contents and different users. The results of
the test can be used both to validate the characterization of the error (i.e., to determine
to what extent it makes sense), as well as to calibrate its subjective impact.
Quality monitoring tools are aimed at estimating the quality perceived by the end users.
Therefore, to obtain from subjective tests meaningful conclusions to be applied in the
development of the monitoring architecture, these assessment tests should be designed
respecting as far as possible real home viewing conditions. Thus, a novel subjective
methodology, based on well known standard procedures, was used in the tests covered
in the present work to obtain representative results of what end users perceive in their
households when typical transmission errors degrade the received video.
The main target of our work is to sketch the steps required to build a consistent moni-
toring framework. This way, it is possible to identify the main impairment sources, find
36 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
out which information can be obtained about them (under reasonable assumptions), pro-
pose a framework to sort and classify this information, and design and implement a set
of subjective assessment tests based on this framework. We believe that this approach
can be easily enriched with other algorithms and models and also could provide useful
tools to other monitoring schemes that are being developed (for example, the recent
standardization efforts such as ITU-T P.NAMS and P.NBAMS [8]).
The chapter is structured as follows. In section 3.2 we will describe the architecture of
a multimedia delivery service, as well as the main quality impairment events that are
present in a field deployment. Once they are identified, in section 3.3 we will propose
the architecture for the monitoring process, aimed at detecting and characterizing each
of those events. Section 3.4 describes the design of the subjective assessment tests used
to validate and parameterize the proposed solution. Finally, section 3.5 describes some
QoE enablers: network elements focused on enhancing the QoE offered by the service.
3.2 Delivering multimedia over IP
The monitoring system has to be designed according to the architecture of the monitored
service. For this reason, a fine characterization of what a multimedia service is and
how it works is quite relevant for the purposes of our work. This section proposes an
architecture for multimedia services, based on the principles described in Section 2.2.
It also describes a set of impairments that appear in those deployments, and how they
could be detected based on the monitoring of rich transport data.
3.2.1 Architecture of a multimedia service delivery platform
Figure 3.1 shows a schematic architecture for an IPTV and OTT service. Although
it is a simplification, it shows the main elements that are present in most commercial
deployments [11, 55, 67]. The architecture also shows the most relevant quality moni-
toring points according to the Recommendation ITU-T G.1081 [44]. They are labeled
as PT1-PT5 in the figure, following the terminology proposed in the Recommendation.
The main building blocks for a multimedia service delivery architecture are thus the
following:
• The video contribution, coming from the Content Provider. The ingestion of the
video contribution is the monitoring point PT1.
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 37
Figure 3.1: Schematic representation of the network architecture for IPTV and OTTservices, including reference monitoring points (PT1-PT5)
38 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
• A Central Headend, where the content preparation occurs. This is normally owned
or controlled by the Service Provider. There may be also local headends, which
are smaller versions of the local headend used for local content. PT2 is located at
the output of the headend.
• The core network, with different configurations depending on the type of service
distributed. It is assumed to be a high-quality network, with negligible error rate.
• The Point of Presence. This is the last point in the network chain where the
Service Provider has control. PT3 is located here.
• The access network, which is an IP link between the PoP and the Home Domain.
• The Home Domain, which includes the Residential Gateway (RGW, the entry
point of the home, where PT4 is placed) and the HNED or user terminal (whose
output is PT5).
The video contribution is received, by definition, in “contribution quality”, which is
the maximum multimedia quality available to the service provider. The contribution is
ingested into the video headend and processed once in a centralized way (or “locally”
for local headers, with video streams that may have regional or local distribution only).
The key principle of the headend is that any processing is done once for each content
asset or stream (or, in other words, each processed asset will be common for all the users
of the service). For this reason, the processing done in the headend is usually performed
by dedicated equipment. Processing or storage capacity in the headend is not a strong
limitation in the deployment.
Typical head-end functionality includes [87]:
• Coding (or transcoding) of the contribution. The contribution source is encoded
using a format, resolution, and bitrate that fits the dimensioning of the service and
the capabilities of the network and user terminal. After this, the encoding can be
assumed to be left untouched, and the multimedia quality of the content at this
point (delivery quality) is the expected quality to be perceived by the end users.
• Encryption of the content using a Digital Rights Management (DRM) system [109].
The coded media stream, or a fraction of it, is scrambled using cryptographic algo-
rithms. The scrambled data can only be deciphered by authorized user terminals.
• Other video processing: multiplexing, remultiplexing, labeling, signaling of entry
and exit points for local content splicing, metadata insertion...
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 39
• Ingestion into the core network, for Live/OD and for IPTV/OTT, using the appro-
priate multiplex and transport protocol stack, as it has been described in section
2.2.2. For the case of RTP streams (IPTV), it could also imply adding Forward
Error Correction (FEC) redundancy packets, as defined in DVB-IPTV AL-FEC.
Quality monitoring in the headend is oriented to guarantee a sufficient degree of delivery
quality. It normally requires intensive monitoring, as any quality impairment in this
point affects all the users in the deployment. It allows FR measurement of the delivery
quality with respect to the contribution quality, between points PT1 and PT2.
The core network is different for each of the service types. In the case of IPTV, the
live video is distributed using a multicast-enabled IP network. IPTV Video on De-
mand is ingested into a centralized master VoD server, which may distribute it to video
pumps located closer to the end users. The core network for OTT is a Content Delivery
Network (CDN). CDNs ingest the master copy of the content into a centralized server
(usually called “origin server”), which stores it permanently (for on-demand content) or
for some time window (for live content). The video is then distributed towards the edge
throughout a hierarchy of caches.
The point of presence (PoP) is, by definition, the last point where the service provider
may have control of the delivered video. Although it has been displayed as a common
point for all the networks, it does not need to be this way: it is not unfrequent that
CDN PoPs, for instance, cover a wider area (and more users) than IPTV PoPs. The
key point of the PoP is that all the processing done here is done in a per-user way. It is
also the last common point for unicast services: any communication between the PoP
and the end user will be different for each user (except for the case of live IPTV over
multicast). As a consequence, the scalability of the PoP processing must be done in a
per-user basis (contrary to the per-asset scalability of the video headend), and therefore
the cost of processing and storage in the PoPs is very relevant for the full performance
of the service.
The PoP is also PT3: the last monitoring point in the service provider domain, where
two different elements are monitored:
1. Errors in the core network. It allows using RR or NR metrics, depending on the
capability of the headend to generate RR information. Each error detected here
affects all the users in that PoP; therefore intense monitoring is recommended.
2. Errors in the delivery and home networks. This is the real monitoring of the
quality delivered to the end user, which must be done between the PoP servers
and the user terminal. In the cases where the service provider does not control any
40 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
home network element (which is typical in OTT), the monitoring must be done in
the PoP (with the feedback data provided from the client in the communication).
As performance is critical, in almost all cases only bitstream NR measures will be
available.
The access network is the IP link between the PoP and the home domain. Strictly
speaking, the term “access network” is normally used only for the “last mile”, i.e.,
the part of the network covering the data link to the home domain (DSL, GPON, 3G,
LTE...). However, we will use the term in a broader sense, so that it may cover also
the “second mile” metropolitan network or, in general, any required IP access between
the home domain and the PoP. IPTV access networks must support UDP traffic and,
more specifically, UDP over IP multicast. OTT traffic is less restrictive, typically only
involving HTTP connections (TCP over port 80).
Finally the home domain comprises all the equipments located in the end user premises.
Depending on the type of service, the Service Provider may have some kind of control
of what is happening in the home domain. For instance, in IPTV services it is frequent
that the Service Provider owns the residential gateway and/or the user terminal, which
are provided as part of the service itself. In OTT services it is more frequent that the
user terminal is owned by the end user itself, but it might include some Service Provider
specific application software.
In the former case, it is possible to take NR bitstream based measures in PT4. In
the latter, monitoring in the user terminal is not possible. In any case PT5, which is
the final quality displayed to the end user, cannot be effectively monitored in real time
service-wide. Only selected users, either with objective monitoring probes or as subjects
of subjective assessment tests, will be able to provide quality information.
Two additional considerations are relevant. The first one is that the possibility to take
some measures (as well as to perform error correction actions when possible) may depend
on specific QoE capabilities of the deployment, such as the ones that will be described
in the next subsection 3.5. The second one is that, if a DRM system is in place, it
would be virtually impossible to apply pixel-based metrics beyond the headend, as the
content will be scrambled and will not be decodable by the monitoring probes. It may,
as a general rule, be possible to apply bitstream-based metrics, since the scrambles can
normally be configured to leave in clear all the relevant rich transport data of the stream.
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 41
3.2.2 Impairing the Quality of Experience
The first step to design a monitoring system is to understand the elements which can
affect to the quality of experience perceived by the end user. The concept of QoE is
used on purpose in this work, since the framework established here is applicable to any
element of the QoE which is susceptible to be monitored. However, the work will have
a special focus on the aspects of QoE which are directly related to multimedia quality.
Having said this, this section does a first classification of the possible causes of degrada-
tion for quality of experience, as well as their possible consequences. The classification
is mainly based on the causes, because it is what is measured by monitoring systems. In
the description of the different causes, we will also identify the mapping to the quality
impairments reported by the end users, as described by Cermak in [7].
3.2.2.1 Coding quality
Video coding quality is one of the most relevant elements in the QoE, and establishes
an upper boundary for the global perceived quality. The artifacts which appear in video
coding, as well as several ways of measuring them, have been widely discussed in the
literature [15]. Among them, the “edges shimmer” reported in [7] is one of the effects
of problems in video coding. Low coding quality can also cause blocking effect on the
pictures (although it is less visible in AVC video, due to the deblocking filter effect), but
it is less aggressive than in the case of video packet losses.
Audio coding quality is normally a less relevant issue in video delivery services, because
its bitrate is typically one order of magnitude smaller than that of the video; while its
impact in the quality is similar [31].
Estimating video quality from rich transport data is not obvious. Without any better
proxy to measure coding quality, bitrate normally makes a good one, especially when
comparing quality from the same encoder and the same content [6]. Under stable con-
ditions (same encoder implementation and bit rate), the quantification parameters may
also provide an estimation for video quality [59].
3.2.2.2 Packet losses
The most relevant impairments in video transmission services (for instance, those re-
ported in [6]) come from errors in the network: either packet losses or jitter. Packet
losses can be corrected using either FEC or ARQ techniques (see, for instance, the
proposal for IPTV in [19]), while jitter can be corrected by using a reception buffer.
42 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
However, when the error (loss or jitter) exceeds the capabilities of the correction strat-
egy, the effect is always a loss of data in the decoded stream. Those effective packet
losses are the main target of the quality monitoring.
The effect of packet losses depends on the error recovery strategy. In most cases, the
decoder tries to conceal the error by inferring an appropriated replacement for the af-
fected data sections (video or audio frames), such as repeating previous data or inserting
silence or noise. Losses in the video stream produce “blockiness” effect or freezes in the
video play out. The former cause the appearance of incoherent blocks in regions of
the frame, while the latter can be perceived as “screen freeze” or “choppy transition”,
depending on the length of the effect [86, 95]. Losses in the audio stream produce an
audio degradation (”distorted audio”) with a duration of the same order of magnitude
as the length of the data loss [79, 84].
However, the behavior for on-demand content could be different. An alternative error
recovery strategy is allowed: stopping the play out and waiting until all necessary data
have arrived. Nevertheless, this only makes sense when data retransmission is possible
(e.g., TCP transport layer, where the integrity of the received data segment is guaran-
teed). Besides, it generates the “buffering events” typical of internet video.
3.2.2.3 Latency
With bidirectional real-time communication (such as videoconference), end-to-end la-
tency is the most critical parameter to consider. However, in unidirectional content
delivery, latency is typically much less important. Coding quality management and
packet loss correction is normally done at the cost of latency. Latency is only important
in live events (especially in sports events). However, up to our knowledge, the effect of
the global latency in the user QoE has not been widely studied in the literature.
Except for the previously mentioned buffering events, end-to-end latency is constant. As
such, it is established at the beginning of the multimedia session and remains constant
from then. In fact, except for pure transport latency (which is only significant for
satellite broadcast), latency is decided at the design phase.
Another latency-related QoE element is the initial wait time, which is the time that the
user has to wait to start viewing (and hearing) a multimedia service. In the context
of linear TV, it is called “channel change” time, or “zapping” time, and it has been
modeled as a component of the QoE [60]. Since digital TV has typically long zapping
times, there have been significant efforts in the last years to develop systems that can
reduce it (see, for instance, [11]).
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 43
3.2.2.4 Outages
Service outages are interruptions of the whole service for a relevant period of time.
Although there may be quite different sources for such kind of errors, they must be taken
into consideration in any global QoE monitoring systems: on the one hand, because it
does not make sense to monitor the less relevant errors if the most critical ones are not
controlled; on the other hand, because they effectively happen and are reported by the
end users (they are labeled as “error stop” in [7]).
Besides the possible failures in the service equipments (either in the customer premises
or in the network), an outage can also be produced by an abrupt loss of video and audio
signal, which can be monitored with measures such as the ones defined in [94]. If the
origin of the outage is in the contribution media source, it should be monitored in the
service head-end (before or after the video coders). Outages caused by the network are
equivalent to long packet losses, and are easily monitored as well.
3.2.2.5 Quality degradations in new multimedia scenarios
Nowadays multimedia services are starting to popularize two features which affect the
QoE management: scalability and stereoscopy.
The concept of scalable video, where the video is coded using several quality layers,
each one refining the quality provided by the previous one, has been included in the
coding standards for the last years. However, it has not had wide acceptance and is
not significantly present in current multimedia services. Anyway, the concept of scal-
ability has been recently introduced in the marketplace with the irruption of HTTP
adaptive streaming [104]. In this kind of systems, the media stream is coded in parallel
using different bitrates, and the streaming can switch among them at pre-defined switch-
ing points. The advantage is that the streams are fully compatible with current AVC
decoders, thus simplifying its implementation and deployment. As there are different
bitrates, there are different coding qualities for the same video and audio stream (and all
the considerations done for coding quality apply). Besides parallel coding, another way
to create a codec-compatible lower-bitrate version of a video stream is dropping some
no-reference frames. This technique, called denting, has been already used in different
IPTV applications [85]. Regardless the method used to create the different bitrate ver-
sions, they will have different quality (and therefore different impact in the perceived
QoE). Any monitoring system has to be aware of the version which is being received by
the client at any moment.
44 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
Stereoscopic video is also being introduced in broadcast and streaming service, as 3D
productions are increasingly popular in the entertainment market. There are basically
two coding options for stereoscopic services: either coding the right and left view as parts
of a single coded frame (typically side-by-side), using a common 2D video encoder; or
coding them separately, using normally MVC. In either case, the coding schemes are
basically the same as in 2D video (based on blocks and prediction), and therefore the
effects in the decoded picture are equivalent to the ones produced in 2D video. However,
those artifacts can produce different impacts in the final stereoscopic reconstruction done
by the human visual system, and therefore they have to be studied specifically [24].
3.3 QuEM: a qualitative approach to QoE monitoring
The aim of this section is proposing an architectural design aimed at monitoring the
quality of experience in an multimedia service delivery network. First we will provide a
definition of the problem, trying to make explicit all the assumptions taken into account
in the design. Afterward we will propose the architecture structure, as well as some
proposed implementation for its main building blocks.
3.3.1 Problem statement
The problem addressed by this architecture is the monitoring of multimedia QoE in
an IPTV or OTT network. Figure 3.2 shows the delivery chain of multimedia services
based on the network architecture described in section 3.2: source media, coding, trans-
port, decoding, and presentation. The most typical realization of this delivery chain is
an IPTV deployment of MPEG-2 Transport Stream video over RTP/UDP over an IP
network [19]. However, the main ideas and elements described later will be also easily
applicable to HTTP adaptive streaming scenarios.
The main assumption taken is that the monitoring is applied to a network of a service
provider offering some kind of video distribution service to a high number of end users.
This assumption imposes two conditions.
On the one hand, scalability is a must. As such, any monitoring metric should require
small processing power, be applicable on real time, be a no-reference metric, and assume
no prior knowledge of the source content.
On the other hand, it is expected that the service provided has established a target
quality which is considered sufficient, and which is the one offered by the service in
normal conditions. Therefore the aim of the monitoring system will be detecting the
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 45
Figure 3.2: Delivery chain of a multimedia service
moments where this quality gets impaired and establish some measure or description of
such impairment. In other words, the monitoring system should provide a relative value
of the quality, with respect to the target quality that would happen in the absence of
impairments.
3.3.2 System design
To detect and measure those impairing events, we take a simple approach based on a
typical architecture of quality estimator: measure, pool, and map to quality [33]. Figure
3.3 shows the block architecture of the design, which we have called QuEM (Qualitative
Experience Monitoring) [85].
The basic building block of the solution is the Qualitative Impairment Detector (QuID).
It performs the measurement step by identifying each of the sources from content degra-
dation. Its output is the (approximate) perceived degradation in the user experience.
The key of this block is that it must be, as much as possible, a systematic description of
effect of the error which has been produced (e.g. “half of the picture is blurred for one
second”), and not only a single quality value (e.g. “Mean Opinion Score equal to 2”).
This property of significance of the QuID output is what makes the approach qualitative
(in the sense that there is not only a quantitative value of the degradation, but also a
qualitative description).
46 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
Figure 3.3: QuEM architecture design
At this point of the chain, the repeatability is also very important. Therefore it should
be possible, as a general rule, to force the introduction of an error of each type, as it is
possible in the pure-PLR-based methods.
The next step is the Severity Transfer Function (STF). The idea here is mapping the
error to quality values which, in the case of packet monitoring, would be the severity of
the error. This STF is done within a pooling window. Synchronizing the pooling window
along all the different errors (and along different clients) is important, because it allows
following whether an error has been produced in different users at the same time. The
length of the pooling window is another configurable parameter of the model. It should
be in the range of the duration of what could be considered as a single impairment event.
To cover macroblocking error propagation along the video Group Of Pictures, segments
of adaptive streaming or short outages [94], for instance, pooling windows from 5 to 20
seconds can be considered appropriate.
The scale used for the STF may be anything which is significant for the user of the
monitoring system, including a Mean Opinion Score (MOS) scale. However, unlike
in typical MOS-based quality metrics, the STF is known by the user, thus making it
possible to trace the MOS value to the qualitative description of the impairment that
generated it.
The last step is the aggregation of errors for their use in statistics and in alarm systems.
As with the STF, the aggregation function can also be modified by the service provider.
Due to the complexity of taking into consideration all the possible interaction between
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 47
different QuIDs, and provided that the severity of each of the QuIDs has already been
established, our proposal for this block is simply taking always the maximum severity
of the ones in play [57].
3.3.3 Qualitative Impairment Detectors
The key of the usability of this architecture is the definition of the QuIDs in a way that
are significant and repeatable. As a relevant example, we are going to build a monitoring
system which can operate in an extensive multimedia network, based on the following
QuIDs, which will be further discussed in chapter 4:
• Packet Loss Effect Prediction (PLEP) [86], described in section 4.2, models the
effect of video packet losses, which depends mainly on the video coding structure
and the position of the packet loss within the stream. PLEP metric provides a good
estimation of the effect of the loss: macroblocking (with a reasonable estimation
of the area affected and the duration of the artifact) and video freeze.
• Audio packet losses, described in section 4.3. Their effect can be measured by
monitoring loss patterns, since there is a high correlation between the length of
the loss burst and the duration of the resulting distortion (normally silence or
noise) [84].
• Drops of coding quality, discussed in section 4.4. We will assume that the coding
quality that enters the core network is the desired quality (or, alternatively, that it
can be monitored in the encoder side with more suitable mechanisms). However,
this quality can decrease in the cases of network congestion or bandwidth drops, by
using HTTP adaptive streaming or packet prioritization mechanisms [82]. Besides
the switch to a lower bitrate coded stream, we will consider the drop of no-reference
frames (denting). They can be measured by monitoring bit and frame rates.
• Service Outages, or interruptions in the continuity of the delivered content, which
are described in section 4.5. They are basically severe version of the video and
audio packet loss effects, and they can be measured with the same techniques.
These measures cover the most relevant defects which appear in IPTV deployments [13]
and can be easily measured in the bitstream, without needing to decode the video or
audio (only NAL Unit headers and Slice headers beyond the transport layer). All the
measures are fully compliant with the requirements for repeatability and significance
required to qualify as QuID, with the possible exception of the bitrate, whose signifi-
cance is more questionable. However, for the sake of this analysis, we will consider it
48 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
enough to provide quality information to the service provider (which should be able to
observe easily the subjective quality of each of the different bitrates produce by its video
encoders).
Before keeping on with the discussion, it is important to point out that we are using a
bottom-up approach to build the QuID measures. We start from the information that
can be obtained by tracking audio and video headers in the bitstream, as well as the
main properties of the stream itself (bitrate and frame rate). Then we provide some
simple measures that can offer information about impairments with a computing cost
similar to the PLR/PLP metrics.
They key is finding out whether these QuID measures can provide relevant information
about the QoE of the received stream. To validate this point, we have designed a
methodology for subjective quality assessment tests, which will be discussed in section
3.4.
3.3.4 Severity Transfer Function
Our proposed way to build the Severity Transfer Function is using subjective quality
assessments which evaluate the effect of the different QuIDs under consideration. Any-
way, due to the significance property of the QuIDs, STFs can be established by the
service provider (or network operator) considering its own severity criteria. This way,
the relative severity of “screen freeze” events versus “blockiness” events, for instance,
can be modified by the service provider by tuning the STF blocks, and without needing
to modify the QuIDs. The subjective quality assessment tests proposed in section 3.4
can also cover this point, as they provide a way to design and calibrate STFs.
3.4 A Subjective Assessment methodology to calibrate Qual-
ity Impairment Detectors
We have included some subjective tests to assess the validity of the approach and to
calibrate the results (and design a first level of STFs). A new test methodology has
been designed to adapt the tests to its purpose, which is described in the following
subsections. The methodology has also been put into practice to assess the impact of
the defects that are being monitored by our QuEM proposal. A description of those
tests can be found in the Appendix A.2.
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 49
3.4.1 Design principles
The objective pursued with these subjective assessment tests is twofold. On the one
hand, the tests should validate the selected QuID measures. Each of the impairments
under consideration has a characterization in the monitoring architecture in a way that
can be measured with precision and repeatability. The aim of the tests is validating that
those characterizations are good enough to provide information, with sufficient indepen-
dency on the context. Or, at least, to know to what extent are these characterizations
usable without knowing the context. This way, if a QuID provides, for instance, estima-
tion of “screen freezes” and their duration, different realizations of the event detected as
“screen freeze for 500 ms” should have similar evaluation results among them, and be
differentiable from events detected as “screen freeze for 5 seconds”. On the other hand,
the tests can also be used to establish some severity transfer function from the QuID
outputs to a severity scale.
Having this in mind, the tests should evaluate the effect of the same impairments de-
tected by the QuIDs, using evaluation periods similar to the pooling windows of the
QuEM architecture. Moreover, since the aim of the QuEM system is precisely estimat-
ing the effect of network impairments in real users of the system (and in real time), it
is desirable that the tests respect as far as possible domestic real viewing conditions.
This will allow mimicking the audiovisual experience of an end user watching multime-
dia services, and evaluating the QuID elements matching as much as possible their final
operation conditions to obtain meaningful results.
This fact is especially relevant in the current work compared to other subjective as-
sessment scenarios, and makes unsuitable the most common approaches for subjective
quality evaluation. The main reason is that the methodologies should be designed ori-
ented to the specific aspects of the pursued study. Therefore, in the present case, to
respect real viewing conditions, many aspects of the test should be adapted, such as:
• The test material should be similar to that usually watched by people in their
homes, e.g. movies, sports, news, etc. In addition, the sequences should be longer
enough to attract attention of the observers. This way, as it happens at households,
the viewers will be interested on the content and not only focused on detecting the
impairments.
• The equipment used in the tests should be similar to those used in domestic envi-
ronments; therefore, especially the TV sets should be consumer products.
• The sequences should be shown to the observers following a single stimulus proce-
dure, which means that no unimpaired reference is presented to them to compare
50 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
with the test video. This makes the test similar to home environment scenarios,
where there is no explicit reference.
• The evaluation should be carried out in a nearly continuous way, since the effects
of transmission errors are highly dependent on the instant when they occur, and
they are not stationary.
These aspects cause that the majority of the international standard methodologies are
not appropriate (e.g., the procedures proposed in ITU-R BT.500 [42] or ITU-T P.910
[53]), , since they were designed to evaluate the performance of video coding algorithms
and, in many cases, some conditions distance the observers from real viewing situations.
Nevertheless, these standard recommendations have been considered in the design of
the novel methodology that is proposed to evaluate the impact of typical transmission
artifacts, so that the results are more easily comparable with those coming from other
sources.
3.4.2 Test methodology
Our main objective is to mimic home viewing conditions. Therefore, the proposed
methodology is based on standard single stimulus methods, such as those recommended
by the ITU [42] and the Absolute Category Rating (ACR) [112]. These methods do not
have an explicit reference to compare with the content to be evaluated. This situation
is similar to home environments where people watch video sequences.
However these assessment methodologies limit the maximum duration of the video se-
quences to usually 10 seconds to allow silence periods for voting, when a grey fixed
background is displayed. Figure 3.4 shows the structure of a test sequence according to
the standard evaluation methodologies.
To allow for a QoE assessment closer to a real-life situation, we have considered a new
evaluation scenario where subjects view long test video sequences, so they are immersed
in the watching experience. As we are interested in the evaluation of different type of
impairments within this continuous stream, we have divided the whole sequence into
segments. Then, the impairments under study can only be inserted in the first half of
each segment, while the second half remains undistorted. Therefore, while this second
half is being displayed, observers can carry out the evaluation of the distortion introduced
in the first half.
To indicate to the observers when and which segment they have to evaluate, the second
half of each segment displays a number in the right-bottom corner of the screen. During
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 51
Figure 3.4: Diagram of the structure of the test sequences in ACR
Figure 3.5: Diagram of the structure of the test sequences in our proposed method
SCALE 1 2 . . .
ImperceptiblePerceptible but not annoying XSlightly annoyingAnnoying XVery annoying
Figure 3.6: Questionnaire for subjective assessment tests
these periods, the observers can avert their eyes if needed to look to the questionnaires,
without affecting the result of the evaluation. In addition, a first segment is used to
indicate to the observers the beginning of the test and provide a coding quality reference,
thus it is also left unimpaired and marked with a zero. Therefore, the structure of the test
sequence is as depicted in Figure 3.5. This methodology simulates better real viewing
situations, and therefore allows, in contrast to ACR, a nearly continuous evaluation of
the quality of the sequence without losing the continuity of the video.
For simplicity, the observers provide their ratings using a questionnaire; however, other
methods could be investigated. The evaluations are done according to the five-grade
impairment scale proposed in [42]. Thus, the questionnaire contains boxes where the
subjects have to write a cross in the one corresponding to the evaluated segment and its
score, as depicted in Figure 3.6.
52 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
3.4.3 Selection of impairments
The impairments introduced in the video sequence are selected among the effects mea-
sured by the QuIDs under study. With the aim of controlling and reducing the possible
limitations of using continuous long sequences, such as content dependency and context
effect, N versions of the same original sequences are created by introducing different
impairments in the same time segments. These versions are called variants, and in each
of them, for the same value of i, the impairments introduced in the segment Ti will be
all corresponding to the same QuID.
This kind of distribution allows the parallel evaluation of controlled combinations of
degradations, defined as “impairment set” when concerning the same segment. Each
“impairment set” then is made of N different “intensities” of the same QuID to be
evaluated in parallel. For instance, for a QuID detecting “video screen freezing for x
seconds”, the “impairment set” would be made of N different values of x. The “impair-
ment set” may, but does not need to, include also hidden references.
This way, the structure of the content streams in the test session has the aspect depicted
in Figure 3.7. Each row (A, B, C, D) represents a different variant of the same original
sequence, each one divided into aligned segments (T1, T2, ...). The colored sections in the
segments are the halves where the impairments are present, while the white-background
sections are the evaluation halves (when the segment number is shown on the screen).
The segments in the same position (e.g. the T1 segments in all the variants) contain
different impairments from the same “impairment set”. Once the number of segments
and impairment sets to be tested has been selected, the position of each impairment
set in the sequence is selected randomly. For each segment position Ti, each of the
impairments in the “impairment set” are also assigned randomly to each of the variants.
In the experiments that we have performed using this methodology, N = 4 variants
were selected. Each variant was assigned to a different viewing session. This way, the
evaluators only view each content asset once, which is in line with the intention of
simulating as much as possible home viewing conditions. Each “impairment set” was
introduced at least three times in each of the sequences under evaluation, to be able to
have a relevant number of measures, as well as to take the context and content effects into
account. A detailed description of the assessment tests can be found in the Appendix
A.2.
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 53
Figure 3.7: Structure of the content streams in the subjective assessment test session
3.5 QoE enablers
The previous architecture can be enhanced by adding QoE enablers: specific features
that can help simplify the management of QoE in the service. In this section we describe
three of them: a headend architecture proposal to integrate synchronized metadata, an
intelligent way to build RTP packets, and a network element to manage QoE between
the PoP and the user terminal.
Although they are described briefly in this section, all of them are have been evolved
to the point of becoming parts of commercial products which are currently available, or
are in the roadmap to be available in the upcoming months.
In the rest of this work, it will be assumed that those elements exist or, at least, could
be added to the deployment when required. This is not a very restrictive assumption:
since the quality monitoring strategies described are targeted to Service Providers that
want to improve the QoE offered by their service, it is reasonable to suppose that they
may decide to include QoE-enhancing elements such as the ones described.
3.5.1 Headend metadata architecture
Introducing metadata synchronized with the media stream can be necessary for a number
of purposes [9]. The most obvious one is the possibility to use Reduced Reference quality
54 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
Figure 3.8: Schematic representation of a modular headend
algorithms. But, in general, any preprocessing that can be done in the headend will be
more efficient there than in any other place further. A first QoE enabler is having a
headend architecture which allows the introduction and synchronization of metadata,
enhancing the interoperability of different headend elements, as we propose in [87].
The proposed architecture is modular and based upon a combination of components
fulfilling different functionality. To avoid duplicating the same functions several times, it
is necessary that the results of each of the processing steps can be reused by the following
components. All the information generated in each step, which can be considered as
metadata (data about the data), is propagated along the chain, so that it can be used
in further processing components [99], as shown in figure 3.8.
The key point here is that each of the components is homogeneous in terms of interfacing,
so that both the management of the headend and the integration of new elements get
simplified. All the processing components share a common time reference and exchange
a set of metadata describing the content. This architecture resembles that of software
multimedia frameworks, such as GStreamer or DirectShow, but applied to a distributed
scenario. All the meta-information available at each point of the processing chain is kept
untouched at the output, and the additional information generated in that processing
step is added as well. This way, all the stream analysis done in the different processing
components can be reused by the others by just not breaking the metadata chain. That
would allow, for example, having access to the Access Unit structure of the stream even
after it has been scrambled (if the scrambling module does not filter out AU metadata).
Synchronization is possible by keeping reference to the clock of the original stream:
all the components shall keep the same time base, so that parallel processing can be
resynchronized afterwards. Each block of video data shall include a Transport Time
Stamp (TTS) as part of its metadata, representing the time stamp (using original clock
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 55
reference) where the block starts. Metadata shall always have a TTS reference, and can
be sent in-band or out-of-band. In this context, in-band means that, together with the
multimedia stream, they form a valid MPEG2 TS or ISO file. In such case, however,
they shall be correctly signaled as private data within the resulting stream, so that they
do not disturb the multimedia decoding. In both cases they will have the same interface
strategy (e.g. push or pull) as the multimedia stream.
A video headend which implements this architecture offers several advantages to the
whole network: the possibility to pre-process the content to help the video analysis in
the edge servers (see section 3.5.3), a global synchronization of all the elements with
respect to the video internal clock signal —which would help synchronize the different
QuIDs throughout the network—, or a support for a metadata stream that can be used
to implement Reduced-Reference metrics, among others.
3.5.2 Intelligent Packet Rewrapper
When MPEG-2 Transport Stream is used as multiplexing layer, as it is the case in IPTV,
video and audio data are separated at TS packet level, but mixed again when several
TS packets are encapsulated in RTP. However, the behavior of the network elements
with respect to QoE could get improved if each RTP packet contained homogeneous
information —allowing, for instance, simple prioritization schemes. This can be achieved
by an specialized headend element: the intelligent packet rewrapper (or, in short, the
rewrapper) [96].
The rewrapper reorders MPEG2-Transport Stream packets and encapsulates them in
RTP packets in such a way that TS packets of the same type (e.g. video elementary
stream packets) are grouped together in the same RTP packet. Besides, frame bound-
aries are split between different RTP packets, so that an RTP packet never contains
information from two different frames. The elementary streams are further analyzed
(deep packet inspection) in order to include in an RTP header extension some informa-
tion useful for different applications running lower in the network.
The RTP header generated by the rewrapper, shown in Figure 3.9, follows the syntax
according to RFC 5285 [101] defining, for ID=1, an extension element with the following
semantics:
• B. Frame Begin (set to 1 if a video frame starts in the payload of this RTP packet).
• E. Frame End (set to 1 if a video frame finishes in the payload of this RTP packet).
• ST. Stream Type (0=video, 1=audio, 2=data, 3=reserved).
56 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
0 1 2 30 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+|V=2|P|X| CC |M| PT | sequence number |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| timestamp |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| synchronization source (SSRC) identifier |+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+| Profile=0xbede | length=1 |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+| ID=1 | len=2 |B|E| ST|S|r|PRI| FPRI|r| a | b | c |+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Figure 3.9: RTP header and extension introduced by the rewrapper processing
• S, r. Reserved
• PRI, FPRI. Priority (coarse and fine). Values used for H.264 video are described
in Table 3.1.
• a: time from the current packet to the transmission time of the RTP packet con-
taining the last piece of the current frame (in 5-millisecond units).
• b: time from the end of current frame to the end of the next frame of the same
priority within the current GOP (in 20-millisecond units).
• c: time from the end of next frame with the same priority in this GOP to the end
of the GOP (in 20-millisecond units).
Table 3.1: Coarse (PRI) and fine (FPRI) priorities used in the RTP header extensionwhen the main video stream is H.264
PRI FPRI Decimal Meaning3 7 31 Video IDR frame3 0 24 Audio2 0 16 Reference frame1 7 15 Non-reference frame0 4 4 Rest of cases (data, secondary videos, etc.)0 1 1 Padding packets
The use of the rewrapper enables the development of value-add services over an IPTV
deployment. The rewrapper allows that the different types of elements in the coded
stream can be easily identified and isolated at RTP level. Since there is one RTP packet
per UDP packet, and one UDP packet per IP packet, it is the same as saying that each
different IP packet contains only one class of information, which can be retrieved only by
reading its RTP header. Therefore it is possible to implement video-related functionality
Chapter 3. Designing QoE-Aware Multimedia Delivery Services 57
which does not require to rebuild the IP packets or, in other words, with the similar
level of performance and scalability that is achieved by IP routers. In fact, some of the
applications proposed in this thesis use a rewrapper as part of their implementation,
such as the Unequal Error Protection algorithm or the Fast Channel Change solution
(see sections 5.2 and 5.5 respectively).
3.5.3 Edge Servers for IPTV and OTT
Even though enhancing the capabilities of the video headend can help improve the final
Quality of Experience, the QoE is, at the end, something experienced by individual
users. Besides, the parts of the distribution chain with higher error probability are the
access network and the home domain. For those reasons, to be able to enhance the
perceived QoE it would be necessary to add new systems at PoP level, together with
possible modifications in the user terminal side.
For the case of IPTV streams, a video management server (generically called “video
services appliance”, VSA) can be located in the PoP, in parallel with the main video
traffic flow and receiving it as well. User terminals can establish individual sessions with
the VSA, in order to request some QoE related services such as:
• Retransmission of lost RTP packets, as defined by RFC 4585 [77].
• Unicast delivery of personalized streams (for instance, to accelerate the channel
change time, as proposed in RFC 6285 [103]).
• Collection of quality measures, as the ones proposed in RFC 3611 [22].
These kind of services have been standardized as part of DVB-IPTV (the first two ones
as DVB-RET and DVB-FCC respectively), and the most relevant IPTV technology
providers have different solutions around this concept.
A similar concept can be applied to CDNs. The proposed idea is to modify dynamically
the properties of the stream at the edge, in specialized servers (“Tailoring Servers”)
placed at the same level than the CDN delivery servers, or even closer to the end
terminals [108]. These Tailoring Servers will have access to the CDN and will retrieve
from it all available media (segments and manifests) and will be able to process them and
offer as a result the same media with some added value functionality to the end devices,
using the same HAS API. From the end user perspective, the network will be providing a
much better quality of service, and the service provider (the one operating the Tailoring
Servers) will achieve it without any modifications in the headend or additional load in
58 Chapter 3. Designing QoE-Aware Multimedia Delivery Services
the CDN core network. In this case, it is essential that the Tailoring Server operates in
full transparent way, since in OTT environments it is frequent that the service provider
does not have any control about the user terminal.
Both concepts will be referred generically as Edge Servers from this point onwards.
3.6 Conclusions
In this chapter we have proposed a reference architecture for multimedia delivery services
over IP. This reference architecture provides a homogenous view of the most relevant
scenarios: IPTV and OTT, both for live and on-demand contents, and including quality
monitoring points as well.
We have also introduced the QuEM quality monitoring framework, which is applicable
to almost the same scenarios where PLR/PLP systems are, but offering a more detailed
analysis. Specifically, the basis of this approach has been set up with the objective of
developing a system that is able to characterize what is happening in the network, and
is easy enough to implement, integrate, and deploy in real video delivery systems.
Moreover, the proposed approach and the metrics that compose the monitoring architec-
ture have been validated by means of subjective assessment tests, analyzing the effects of
several transmission impairments on the QoE of the observers, and the relations among
those degradations. Those studies are also useful to calibrate the measurement elements
of the architecture to obtain reliable estimations of the impact of the distortions on the
perceived quality.
Finally, we have described some enablers: network elements that facilitate the imple-
mentation of QoE functionality in the delivery network.
In next chapters we will fill this framework with information. In chapter 4 we will
describe metrics to monitor the most relevant impairments using rich transport data.
Those metrics will comply with the requirements established in the QuEM framework,
and will be validated using the proposed subjective assessment methodology. In chapter
5 we will use the knowledge obtained in the generation of metrics to propose new value-
add applications in the context of the multimedia QoE. The implementation of these
application will also rely on the presence of some of the QoE enablers that we have
described in this chapter.
Chapter 4
Quality Impairment Detectors
4.1 Introduction
This chapter describes the different metrics which are proposed for the monitoring of the
Quality of Experience in multimedia delivery services. Using the terminology defined in
the previous chapter, they are the Quality Impairment Detector (QuID) blocks needed
to build a Qualitative Experience Monitoring (QuEM) system. Each section devotes to
a different QuID.
The general approach to study each of the QuIDs has been similar. First, the impairment
which wants to be detected is defined and characterized. This implies identifying the
cause of the impairment —and therefore propose a technique to monitor it—, as well as
understanding its impact on the perceived quality. Afterwards this analysis is completed
with specific subjective quality assessment tests, which use the methodology described
in section 3.4. A common set of subjective tests has been used for this purpose; they
are described in Appendix A.2. In specific sections of this chapter, additional subjective
and objective experiments have been used. They are described in different sections of
Appendix A, and referred in the appropriate sections of the text when needed.
The metrics described in this chapter are the ones proposed in section 3.3.3. They cover
the most relevant defects described by users [7], and each of them fulfills the requirements
imposed by the QuEM architecture in sections 3.3.1 and 3.3.2 —scalability, significance,
and repeatability.
Section 4.2 describes a video Packet Loss Effect Prediction metric (PLEP). It predicts
how the loss of a video packet can lead to freezing or macroblocking effects, by analyz-
ing the propagation of the error within the video frame, as well as to adjacent frames
throughout the inter-frame prediction reference chain. The results of this metric are
59
60 Chapter 4. Quality Impairment Detectors
analyzed objectively and subjectively using the test sequences described in Appendix
A.4 and the test set described in Appendix A.2, respectively.
Section 4.3 repeats the same structure of 4.2, but analyzing the effect of the loss of audio
packets.
Section 4.4 analyzes the media coding quality, with two differentiated subsections. First,
in 4.4.1, the video artifacts produced by compression are analyzed with a specific set
of subjective quality assessment tests, described in Appendix A.3. The results of these
tests is used to explore the possibilities to use RR or NR metrics to monitor video coding
artifacts in the context of a QuEM framework. Afterwards, in 4.4.2, a different approach
is presented, to analyze the effect of quality drops produced by strong variations in the
channel effective bandwidth —a typical OTT scenario with HTTP Adaptive Streaming.
In this case, two main alternatives are compared: switching to a version with different
bitrate and dropping frames. Their effects are analyzed with the subjective assessment
tests of Appendix A.2.
Section 4.5 describes outage events, understood as the total loss of video, audio, or both
signals for a period of time. Techniques to measure outage are described, as well as its
subjective effect according to the tests described in Appendix A.2.
Section 4.6 analyzes latency-related issues: lag and channel change time. This type of
analysis is sometimes excluded in the discussion of QoE, but it has been included in
this chapter for two reasons. On the one hand, lag and channel change are relevant
only in some specific scenarios; but these scenarios may have great impact in the overall
perceived quality of the multimedia delivery service —live delivery of sports events is the
most typical case. On the other hand, there is a design trade-off between latency and
other quality factors, such as video coding quality or packet loss probability. Acknowl-
edging this relationship is relevant when considering the whole QoE of our services.
Section 4.7 describes the relationship, in terms of perceived quality, between the different
impairments that have been studied.
Finally section 4.8 summarizes the main conclusions obtained in the whole chapter.
4.2 Video Packet Loss Effect Prediction (PLEP) model
Packet losses are the main cause of errors in multimedia services and, more specifically,
in IPTV. The loss of video packets can cause macroblocking and image freezing, which
are about half of the QoE impairments reported by customers in a field deployment [7].
Chapter 4. Quality Impairment Detectors 61
For this reason, packet losses are a relevant QoS issue to monitor in IPTV networks. In
existing deployments, it is typical to use pure QoS metrics, such as the Media Delivery
Index (MDI), to monitor them [67]. On the one hand, MDI is a useful metric to esti-
mate QoE because, on the long term and for random losses, packet loss rate correlate
reasonably well with the Mean Square Error which, in this scenario, can be a reasonably
good predictor of the perceived quality [40, 95]. On the other hand, in most cases there
is simply no other metric which can be applicable in the context of real-time service
monitoring, either because they need information that is not available at the monitoring
point, or because they are too costly to be applied.
However, other approaches are possible. If we have access to rich transport data, such
as the information provided by the rewrapper described in section 3.5.2, we can take
into account the structure of the video stream to improve the prediction of the effect of
losing some packets, instead of applying the sort of flat rate used by MDI.
Another important fact to consider is that network QoS provided for IPTV should
be good enough to make difficult to assume that “MSE correlates to PLR”. Besides,
QoS-management decisions are taken in the short term (some dozens of packets or so;
otherwise delay is too high). Therefore we need to analyze the short-term effect of
isolated packet losses in order to improve quality management in IPTV.
We will focus in this section on the analysis of packet-loss effect in the short term.
We will build a model to predict the effect of packet losses in video, based on the
information available at transport level in a real deployment. In particular, we will
analyze the transport information (RTP and MPEG-2 Transport Stream), as well as
the network abstraction layer of H.264: NAL Unit Headers and Slice Headers. We will
not analyze deeper than Slice Header in any case: firstly because, when any scrambling
is applied (even partial), some parts of the slice are always unavailable; and secondly
because it would require decoding the entropy coding CABAC, which would increase
the computation cost of the monitoring tool excessively for practical applications, thus
violating the scalability requirements required for QuIDs —see section 3.3.1.
The analysis has been performed in the context of an IPTV service, where the transport
unit (the minimum block that can get lost) is the RTP packet. It has also been assumed
that, to simplify the network processing, the MPEG-2 TS has been packaged into RTP
using a rewrapper. However, the model can be easily extended to other multimedia
delivery scenarios, just by adjusting the size and nature of the packets that can get lost.
62 Chapter 4. Quality Impairment Detectors
4.2.1 Description of the model
We need a packet loss effect prediction (PLEP) model which is based on the analysis
of rich transport data, provides meaningful information to the operator using it, and is
as general as possible. To comply with these requirements, we propose a metric which
estimates the fraction of each of the frames which is affected by artifacts coming from
packet losses. Therefore a frame with a degradation value of, e.g., 50 percent, will have
half of its surface affected by artifacts.
The main advantage of this approach is that it focuses on the structure of the error in
the image, i.e. on the most direct impact of the packet loss, which is the absence of
correct information in parts of the image for some time. This metric does not depend
on the statistics of the image itself, and it is therefore usable in environments where
the picture intensity values are not available. Besides, it provides an easy qualitative
description of the impairment, which makes it suitable for our QoEM architecture.
Our solution encompasses two steps which are applied iteratively: we first compute
the degradation value in one frame, and then estimate the error propagation to the
neighboring pictures. The model only makes use of information which is available in the
slice header of H.264 slices: the slice type and reference picture buffer indexes. No data
is obtained from either the original (unimpaired) stream or from the decoded video.
4.2.1.1 Degradation Value
The first component of the impairment is the error generated in the frame where the
packet loss occurs. In an IPTV environment, video frames will typically be transported
over several transport packets (typically RTP). For that reason, a loss in one of the
packets does not necessarily mean the loss of the whole frame. In fact, the effect of the
loss of a single packet within the frame can be estimated by considering two well-known
properties of the H.264 coding:
• The information of macroblocks within a picture is transported in scan order (un-
less flexible macroblock ordering is used, which is not the case in Main and High
profiles).
• When there is an error in a NAL Unit, decoders usually cannot resynchronize video
decoding until the beginning of next NAL Unit.
We measure the degradation value on a scale of 0 to 100, where 0 represents that an
image has been received without errors, and 100 indicates that it is completely impaired.
Chapter 4. Quality Impairment Detectors 63
The metric will estimate the percentage of image which is affected by the error:
E0 = 100%1
N
N−1�
S=0
1− f
�L(S)
Lavg
�(4.1)
where S represents each slice, N is the number of slices per frame, and L(S) represents
the length in bytes of the fragment of the slice which is not lost. It is assumed that
the rest of the slice is lost the moment an error is produced. Similarly, as macroblock
information is sequentially introduced in a slice (i.e., one macroblock after another), it
is reasonable to assume that the larger the portion of the slice is affected, the larger the
region of image is impaired. Lavg is an estimation of the length of the slice if there had
been no losses. Depending on the size of the loss and the video transport layer, it may
be estimated with higher or lower accuracy. In any case, it is always possible to assume
that the slice byte size will be similar to a sliding average of the sizes of the K previous
slices of the same type (I, P, B) and their position in the image. f is a function which
must be monotonically increasing. We will select the identity function saturated to the
value “1”, so that no slice can contribute to more than 100 percent of its size.
The equation assumes that all slices in the image have the same size (in pixels). Other-
wise, values should be weighted by their relative surface in the whole image.
4.2.1.2 Error Propagation
Most of the pictures in an H.264 video sequence use other pictures as references in their
decoding process. This technique, needed to encode the stream with a reasonably low
bit rate, causes errors in one frame to propagate to all frames which make reference to it.
If those frames, in turn, serve as references for others, the impairment would propagate
even more along the reference chain. Therefore a picture with no losses can also have
artifacts which have been propagated from its reference frames.
We compute this propagated error Ep from the value E of each of the frames which are
used as a reference by the picture under study. Given a picture x, depending on a set
of references {yk}, propagated error will be:
Ep = γ
�
k
ωkE (yk) (4.2)
where E(yk) is the error level in the frame yk. This error can be result of a packet loss
in that frame (E0) or being a propagated error itself (Ep), and the values of ωk and γ
model how to estimate the fraction of affected pixels in the predicted picture.
64 Chapter 4. Quality Impairment Detectors
The constant γ represents the attenuation of the error effect along the reference chain.
In a typical coding scenario in H.264, instantaneous decoding refresh (IDR) pictures are
introduced periodically (each few seconds, at most). Therefore, regardless the value of
γ, the error will only propagate until the next IDR frame in the worst case (which is with
γ = 1). However, this assumption is not stable for long IDR repeat periods, or for cases
where I frames are not IDRs and there can be references beyond GOP boundaries1. For
such reason γ < 1 is recommended (for instance, γ = 0.9).
Factors ωk represent the weight of the different pictures which contribute as reference to
the picture under study. We use a model where higher level errors have a higher weight,
as they propagate in a more perceptible way:
ωk =E (yk)�kE (yk)
(4.3)
This allows us to write:
Ep = γ
�kE
2 (yk)�kE (yk)
(4.4)
4.2.1.3 Error Composition
Finally, it is possible that one picture suffers from a packet loss and also that its reference
pictures had errors as well. In this situation, both error contributions must be combined.
In the best scenario, both contributions will overlap and the total error level will be the
maximum:
Ebc = max {E0, Ep} (4.5)
In the worst case, contributions will be independent and the error will be the sum:
Ewc = min {(E0 + Ep), 100%} (4.6)
Therefore we assume that the error will be somewhere in between:
E = αEbc + (1− α)Ewc with 0 ≤ α ≤ 1 (4.7)
1In H.264, it is possible to define an I frame which is not an IDR. As I frame, it can be decoded
without needing other frames for prediction. However, unlike an IDR, it allows that subsequent frames
in decoding order use previous frames as references. This can slightly improve the obtained video quality
for a given bitrate constraint, and it is frequently used by IPTV video encoders.
Chapter 4. Quality Impairment Detectors 65
4.2.2 Experiment
To test the PLEP model proposed, it is necessary to design an experiment which focuses
on the effect of where packet losses occur. Instead of generating random error patterns,
we have designed an experiment where packet losses are set deterministically and where
it is possible to observe the effect of changing the loss position in the stream.
The sequences are pre-processed with the rewrapper described in section 3.5.2. This
way, each video frame is transported in an integer number of RTP packets, and so does
each GOP. With the aim of analyzing the effect of different packet losses within the
stream structure, one single GOP is selected to generate packet losses on it.
We apply the following steps, with K taking values from 0 to the number of RTP packets
in the selected GOP:
1. In the selected GOP, the RTP packet in position K is dropped.
2. The PLEP metric is obtained for the resulting sequence.
3. The video sequence is then decoded using the open-source decoder FFmpeg2 (with
default error concealment) and stored on a disk without compression.
4. The obtained sequence is compared with the original one (without errors) using
MSE.
This experiment was conducted with the sequences A, B, C, and D described in the
Appendix A.4. The following discussion will be done considering sequence A, as it is
the one with longer GOP (100 frames), and therefore the one producing more test cases.
However, the same process was repeated with sequences B, C, and D, with similar results
—a comparison will be provided later.
Sequence A is encoded in H.264 over MPEG-2 TS at 2.8 Mb/s (with the video stream
at 2.3 Mpbs). Each frame has only one slice, which is the most typical situation for
commercially available video encoders for IPTV. The GOP structure is a hierarchical
“. . . IBBBP. . . ”, such as the one discussed in section 2.4.1 and depicted in Figure 2.4 in
page 31. All I frames are IDR pictures.
The sequence is encapsulated in RTP using the rewrapper. Each GOP occupies about
1000 RTP packets and, in particular, the GOP under study had exactly 958 packets.
Therefore 958 different impaired sequences (each one with the error in a different position
within the GOP) were generated, decoded, and processed.
2http://www.ffmpeg.org
66 Chapter 4. Quality Impairment Detectors
It is worth noting that, due to the rewrapping process, all the losses affected only one
video frame, although the visual impairment will affect more than one frame due to
error propagation in the prediction process.
4.2.2.1 Qualitative Analysis
Before analyzing the results of the measurements, it is interesting to examine the video
itself, to better understand what happens when one packet is lost. We mainly consider
the results in sequence A, since having a longer GOP, it produces more data in the one-
GOP analysis. Figures 4.1 is used as an example for this analysis, although the ideas
described in this section are applicable to the majority of sequences generated for the
study, including both other sequences generated from sequence A and from sequences
B, C and D. Figures 4.1(a), (c), and (e) show an IDR frame where RTP packets #11,
#28, and #29 have been lost, respectively. Figures 4.1(b), (d), and (f) show the next P
frame in display order for the same sequences. Figures 4.1(g) and (h) show the original
unimpaired IDR and P frames, respectively.
In all the measurements, the frame with the highest MSE is the one where the loss
occurred. However, this is not the frame where artifacts are most visible. This is
illustrated in Figure 4.1(a). Where the packet is lost when the MSE is high, the visibility
of the error is low. However, four frames later in Figure 4.1(b), once the error has been
propagated by inter-frame predictions, the error has higher visibility even with a lower
MSE than before. This effect is also produced from Figure 4.1(c) to Figure 4.1(d), and
from Figure 4.1(e) to Figure 4.1(f).
This fact is due to error concealment: when part of the frame is lost, it is simply replaced
by the most recent reference frame available. The visual effect of this replacement is
a frame with a spatial discontinuity (part of the frame is the correct one, part is the
previous), which is not very disturbing visually. However, when the frame is used for
prediction, the predicted macroblocks will have errors, and the macroblocking effect will
appear.
It is also important to consider that in real situations, error concealment techniques may
not be as predictable as desired. For example, Figure 4.1(c) and Figure 4.1(e) show the
same frame for two different sequences —Figure 4.1(c) with the loss of packet #28, and
Figure 4.1(e) with the loss of packet #29, with both packets affecting the same frame.
In the first instance, FFmpeg concealment attempts to reuse the last referenced frame
to replace the missing portion of frame #28, and as a result that the error has low
visibility. In the second instance, the lost packet, #29, is directly adjacent to the packet
previously used, #28, which shows that the FFmpeg concealment has failed and that the
Chapter 4. Quality Impairment Detectors 67
Figure 4.1: Video sequence used for qualitative analysis. Left column shows an IDRframe where one RTP packet is lost; while right column shows the following P frame.The red line in each frame indicates the position of in the image of the first macroblockwhich got lost. RTP packet lost are #11 (a,b), #28 (c,d) and #29 (e,f). (g,h) showthe original unimpaired IDR and P frames.
68 Chapter 4. Quality Impairment Detectors
error has high visibility. These kinds of concealment failures can occur in real decoders,
either software or consumer set-top boxes. Therefore one must be careful when making
a priori assumptions about how impaired frames appear on the user screen.
We also found that the sooner an error is produced within an encoded frame, the higher
is the fraction of the decoded frame affected. The lines in Figure 4.1 show the position
of the error within the frame. Frames in Figure 4.1(a) and Figure 4.1(b), where the
error was produced in packet #11, have more visible and extensive artifacts than those
between frames Figure 4.1(c) and Figure 4.1(d), where the error was produced in packet
#28. The underlying idea is that once a fragment of the H.264 slice is lost, the rest of
the slice becomes useless to the decoder, which throws it out completely since it is not
trivial to resynchronize CABAC decoding. As there is only one slice per frame, when
an error occurs within a video frame, the rest of the frame is lost.
Finally, we should mention an specific case of interest: when the first video packet in
the GOP is lost, then the whole I frame gets lost as well, including any GOP-level
header (such as Sequence Parameter Set, Picture Parameter Set or SEI messages). As a
result, and with the decoder implementation that we have used, the whole GOP becomes
impossible to decode and the image freezes until the next I frame arrives.
4.2.2.2 Quantitative Results
We have computed the Packet Loss Effect Prediction (PLEP) values for each one of the
sequences under study. As IDRs are used at GOP boundaries, sensitivity to γ is not so
critical. We have taken the default value of γ = 0.9. Since there is only one packet loss,
there is no error composition situation, and therefore the value of α is not relevant.
We selected MSE (aggregated along all the impaired frames) as our method choice to
measure the impact of error in the sequence. Although there are other methods which
correlate better to subjective MOS, such as structural similarity index (SSIM) [116],
MSE has been shown to perform better when predicting packet loss visibility [93].
Figure 4.2 shows the MSE for all the sequences (varying the loss position) generated
from sequence A. The grey line shows the aggregated MSE of the whole sequence while
the green line shows the MSE only of the frame where the loss was produced. The red
line shows the MSE obtained by just substituting the frame where the error occurs with
the previous available reference frame (i.e., the concealment error at frame level). And
the blue line shows the result of the PLEP metric. Figure 4.3 shows the same values for
a reduced number of the sequences.
Chapter 4. Quality Impairment Detectors 69
0 100 200 300 400 500 600 700 800 900 100010 5
10 4
10 3
10 2
10 1
100
101
102
Position of lost packet
PLEP
/ M
SE
Figure 4.2: Mean Square Error and Packet Loss Effect Prediction metric for allsequences under study, varying the loss position: aggregated MSE (grey), MSE at theframe where the loss occurs (green), concealment error (red), and PLEP (blue).
0 20 40 60 80 100 120 140 160 18010 5
10 4
10 3
10 2
10 1
100
101
102
Position of lost packet
PLEP
/ M
SE
Figure 4.3: Detail of Mean Square Error and Packet Loss Effect Prediction metricfor all sequences under study
It can be seen that error has higher impact in higher levels of the reference hierarchy:
when the error occurs in an I frame or P frame, it generates higher MSE than when it
occurs at a (reference) B frame, which in turn is higher than error generated by losses
in (no-reference) b frames. This is mainly due to the fact that errors in reference frames
propagate, and therefore affect more frames. Error concealment also produces more
visible results in I frames and P frames since the previous reference frame available is
further back in time (four frames distant), than in the case of B frames (two frames
away), or b frames (one frame away).
The analysis also indicates that error decreases along frame position. This is due to the
fact that losing a single packet on a slice means losing the rest of the slice completely,
since the decoder is unable to resynchronize the CABAC decoding. Of course this
decrease is not completely monotonic, as the reconstruction of the damaged frame is
not always perfect. Sometimes concealment techniques fail or are just less effective than
expected.
70 Chapter 4. Quality Impairment Detectors
Figure 4.4: Mean Square Error versus Packet Loss Effect Prediction metric (log scale)and linear fit between them (R2 = 0.67)
Figure 4.5: Percentage of macroblocks which are different between both images versusPacket Loss Effect Prediction metric, both in log scale, as well as linear fit (R2 = 0.85)
There is also some tendency to error decreasing along the GOP because the earlier the
error occurs in the GOP, the greater number of frames it affects. However, due to the
fact that there are some scene changes within the GOP, this effect is not very strong.
Figure 4.2 shows that the PLEP model follows the shape of the error and in Figure
4.4 both magnitudes are directly compared. There is reasonably good correlation (R2
= 0.67) between both values, which suggests that the PLEP model is robust enough
to predict packet loss effects. It is worth noting that in this scenario, unlike in other
experiments reported in the literature, there is no correlation between the MSE (which is
variable) and the PLR (which is constant and equal to 1/958 for all the sequences). This
Chapter 4. Quality Impairment Detectors 71
0 100 200 300 400 500 600 700 800 900 100010 3
10 2
10 1
100
101
102
103
Position of lost packet
PLEP
/ %
diffM
B
Figure 4.6: Percentage of macroblocks which are different between both images (blue)and Packet Loss Effect Prediction metric (red) for all sequences under study, varyingthe loss position
means that our PLEP model is able to explain the effect of packet losses reasonably well,
even in situations where the packet loss ratio does not provide any valuable information.
Results obtained from the other sequences are quite similar qualitatively. Table 4.1
shows the R2 between PLEP and MSE for all video sequences.
Table 4.1: Coefficient of determination (R2) of MSE vs PLEP fit for several videosequences.
Sequence A B C DGOP size 100 24 24 12
R2 0.67 0.63 0.74 0.91
With this in mind, it is also important to consider that the PLEP method is more
robust to failures in error concealment than MSE estimation methods. Indeed, error
concealment is quite unpredictable in a real case, and not easy to fit into a predefined
model, as we illustrated previously in Figure 4.3, where the MSE in the frame where
the loss occurred is shown in green, while the MSE in dashed black depicts an instance
when an error occurred and the damaged frame was replaced by the previous frame
available. This suggests that even knowing the MSE produced by replacing one frame
by its predecessor, there is no specific pattern which can easily model MSE in a specific
frame when the loss occurs in the middle of a GOP. However, predicting the “part of the
frame affected” is much more stable, since it does not depend on the error concealment
techniques used. Thus, a metric defined as the ratio (in percent) of macroblocks different
on a pixel-to-pixel basis between both images provides a better approximation than MSE
does for the concept of “part of the frame affected.”
Figures 4.5 and 4.6 show that PLEP model is indeed a good predictor of ratio of mac-
roblocks which differ between the original and the impaired images. Correlation with
72 Chapter 4. Quality Impairment Detectors
the PLEP model increases so that, for the sequence under study, R2 = 0.85.
4.2.3 Subjective analysis
The next step in the analysis is discovering whether the prediction of the fraction of
the image affected by errors can be effectively used to model impairments in the per-
ceived Quality of Experience. With this target, the subjective assessment test session
described in Appendix A.2 included some impairments based on the PLEP model. The
impairments were caused in the same conditions as in the previously discussed objective
experiment: the video is sent by a rewrapper process and only one RTP packet is lost,
and the loss includes data of only one frame. The position of the RTP loss within the
GOP structure is varied to produce different effects.
The different impairment conditions are described in Table 4.2. We will consider the
simplified version of γ = 1, so that we assume that the error is propagated until the
end of the GOP. Impairment N is the hidden reference (no packet loss). Impairment E1
losses the first packet of the first no-reference B frame in the GOP; thus the error does
not propagate to other frames. Impairments E2, E3 and E4 lose one packet in the first
reference P frame of the GOP, so that the error gets propagated along the GOP. To
vary the resulting effect, the packet is lost at the beginning (E4), in the middle (E3) or
at the end (E2) of the frame, which varies the packet loss effect according to what has
been discussed previously. Finally impairment V1 has a special effect, which is losing
the very first packet of the GOP (in the I frame). In this case, as the most relevant
headers for the GOP get lost, the resulting effect is not macroblocking, but the freeze
of the image for the duration of the GOP (until another I frame is received).
Table 4.2: PLEP impairments analyzed in the subjective assessment tests
Code Frame % frame affected DescriptionN n/a n/a Hidden referenceE1 B (nr) 100 Loss of one frameE2 P (ref) 25 25% of frame affected during one GOPE3 P (ref) 50 50% of frame affected during one GOPE4 P (ref) 95 95% of frame affected during one GOPV1 I (ref) 100 Video freeze during one GOP
The results obtained from the tests are shown in Figure 4.7, differentiating the three
content sources under study: an action movie (Avatar, in blue), a football match (yellow)
and a documentary (red). The global average value is also displayed, together with its
confidence intervals. The description of the sources, as well as more details about the
tests, can be found in Appendix A.2.
Chapter 4. Quality Impairment Detectors 73
!"
!#$"
%"
%#$"
&"
&#$"
'"
'#$"
$"
(" )!" )%" )&" )'" *!"
!"#$
*+,-."/012-3"4.55"
Figure 4.7: Results of the subjective assessment for Video Loss impairments
As a first conclusion, the results suggest that the PLEP metric is applicable to the
characterization of video packet losses, as they confirm that the position of the error
within the GOP structure affects significantly the quality perceived by the end user.
This conclusion has to be taken with some degree of caution, because there is variability
in the results, especially from one content source to another. However, it is clear that
the PLEP model outperforms the simple packet loss rate metrics. More specifically,
losing one single frame (without propagation) or a small part of the frame (even with
propagation along the GOP) is, in general, either no perceived or perceived as not
annoying, and statistically indistinguishable from the hidden reference. Beyond that,
the bigger fraction of the frame is affected, the higher the severity it has. Finally, freezing
the video for the whole GOP has a severer impact into quality than the macroblocking
effect.
The errors E2, E3, E4 and V1 belong to the same “impairment set”, as defined in section
3.4.3. That means that they are evaluated in parallel over the same segments. Figure
4.8 shows the detailed results for each of the segments of this “impairment set” for
the three sequences under study. Most of the segments follow the same pattern as the
general results, and it is also possible to see that the “inter-segment” variability for the
same error event is lower than the “intra segment” variability for the different errors
applied to each segment. The segments labeled as “Doc-10” and “Avatar-20” —from
the documentary and the movie sequences, respectively— may be considered outliers,
and they share the property of having a low MOS for the less perceptible error (E2).
This suggests that in both cases the “delivery quality” of the unimpaired version of those
segments might be lower than expected, and maybe a characterization of the properties
of the video in the headend could lead to a RR metric that improved the performance
of PLEP.
74 Chapter 4. Quality Impairment Detectors
!"#$%&'!"#$%('
!"#$(&')*+,+-$%('
)*+,+-$%&')*+,+-$(&'
./,0"1$%%'./,0"1$(&'
./,0"1$(%'
%'
%23'
('
(23'
4'
423'
5'
523'
3'
6('
64'
65'
7%'
!"#$
Figure 4.8: Detailed results for each of the individual segments for Video Loss
4.3 Audio packet loss effect
When packets containing audio information get lost, there is also an impairment in the
perceived quality: either a temporary interruption in the displayed sound or a distortion
(glitch or noisy sound). Audio distortions are less frequent than video artifacts or, at
least, least frequently perceived by end users [7]. However, they are still common enough
for any monitoring system to consider them, especially if we take into account that they
are as unacceptable as video artifacts [57]. It is also relevant to consider that, as video
streams have normally a very stable bitrate, they normally require a relatively small
buffer in the receiver (around 50 ms, compared to the 500-2000 ms typical for video
streams). As a consequence, audio packets are much more sensitive to delay variation
than video streams; and high values of jitter will easily increase the losses in the audio
stream.
In this section we will study the effects of those packet losses, both objectively and
subjectively. We will take as baseline scenario an IPTV channel over MPEG-2 Transport
Stream. To simplify the analysis, we will assume that the stream as been encapsulated
into RTP packets by a rewrapper. This way, a packet loss at RTP level will impair either
audio or video signals, but not both simultaneously.
4.3.1 Objective analysis
Audio coding formats used in multimedia systems use normally block coding: they take
a time window of the audio waveform, divide it into spectrum sub-bands and code each
sub-band attending spectral masking criteria (obtained from a psychophyisical model
Chapter 4. Quality Impairment Detectors 75
Figure 4.9: Waveform of a lossy audio file
of the human hearing system), aimed at maximizing the perceived quality for a target
bit rate. There is some overlap between adjacent windows, but not long term coding
prediction or complex prediction structures. All the audio codecs considered in our
IPTV and OTT scenarios (MPEG-1 layer 2, MPEG-4 AAC, and Dolby AC3) have this
kind of design.
With this, the impairment produced by the loss of one audio RTP packet will affect
only to the time window to which this packet belongs. Therefore we can make the
hypothesis that the impairment will be a silence whose length is proportional to the
length of the packet loss burst. This, which is exact for uncompressed audio (PCM),
will be sufficiently approximate for compressed audio as well.
Figure 4.9 shows the waveform obtained after decoding an audio file with losses. It is
the audio stream of the sequence A described in Appendix A.4, encoded in MPEG-1
layer 2 at 192 kbps. 70 TS-packet losses (around 550 ms) were introduced each 1000 TS
packets (7.8 s). Silence intervals are clearly visible in the waveform, and their duration
is effectively around 0.5 seconds each.
In some cases, signal peaks can be observed next to the silence intervals. They are
perceived as glitches or audio discontinuities, and they may also appear on the event of
packet losses. In principle, and for the sake of the analysis of the losses, we will consider
only the silences as the base impairment, since they cannot be distinguished from the
glitches just by the analysis of the lost packets.
Another 2 minute cut of the aforementioned sequence A (with MPEG-1 layer 2 audio
at 192 kbps) has been taken to introduce audio packet losses, varying the number of
76 Chapter 4. Quality Impairment Detectors
Figure 4.10: Effect of audio losses: measured vs. expected (R2 = 0.98)
consecutive packet lost (the loss burst). The expected duration of each TS packet loss
would be:188× 8
192000= 7.8× 10−3
s (4.8)
Afterwards, the resulting stream has been decoded by a software decoder and the length
of the silences has been determined. The result can be shown in Figure 4.10. Blue points
show the length of the silence events (Y axis) as a function of the number of packet losses
(expressed in seconds, X axis). Most of the silence events have a length which is similar
to the expected one (although there is a small fraction of outliers, which represent the
short silence periods just after or before a glitch effect). Once the outliers have been
removed, the data fitting to a regression line (in red) allows us to determine the validity
of the approach. The line has a slope of 1.05 and a ordinate at the origin of 0.18, with
a determination coefficient R2 = 0.98.
With this data, the following conclusions can be obtained:
• The model is sufficiently good to be used as QuID.
• The slope is approximately 1, so that we can say that the perceptible duration of
the loss is quite similar to the length of the packet loss.
• Each packet loss, even the shortest ones, generates a silence of at least 180 ms.
This last data of the 180 ms should be taken with the appropriated caution. Firstly,
because the offline software decoder is not very robust under packet loss events (and, in
fact, extracting the silence length has required a careful analysis of the recovered data).
Chapter 4. Quality Impairment Detectors 77
Figure 4.11: Short-length audio losses
And secondly because the number of samples used in the model is not high enough to
be sure about the quantitative significance of this result.
However, from a qualitative point of view, it seems to be clear that there is a minimum
silence length that happens in most of the cases. In Figure 4.11, which shows the values
of figure 4.10 for its smallest loss durations, it can be seen that the four columns of blue
points in the left side (which refer to losses of 1, 3, 5 and 7 TS packets) generate errors
between 150 and 300 ms indistinctly. Without considering the quantitative significance
of those figures, it is possible to say that, quantitatively, the effect of losing one single
TS packet is similar than the effect of a short burst of packet losses. A side-effect of this
conclusion is that the fact that 7 audio TS packets are encapsulated in a single RTP
audio packet by the rewrapper does not increase significantly the effect of the minimum
audio loss, which would be of 1 TS packet (plus probably some video packets as well)
for non-rewrapped streams, and is of 7 TS packets (without additional loss of video) for
rewrapped streams.
4.3.2 Subjective analysis
The subjective assessment test session described in Appendix A.2 included also impair-
ments produced by the loss of audio packets. As the transmitted packets have been
processed by the rewrapper, each 7 audio MPEG-2 TS packets are grouped into one
audio RTP packet. As described before, audio coded bitstream does not have complex
prediction structures (as video has), and the effect of the packet loss is basically related
to its duration. Therefore the different type of audio losses differ only in the number
78 Chapter 4. Quality Impairment Detectors
!"
!#$"
%"
%#$"
&"
&#$"
'"
'#$"
$"
(" )!" )%" )&" )'"
!"#$
)*+,-"./0123"4-55"
Figure 4.12: Results of the subjective assessment for Audio Loss impairments
of packets that have been lost (it is similar to a packet loss rate / packet loss pattern
metric, but with the important distinction that we know that the lost packets are audio
packets). The RTP audio packet loss patterns used in the subjective assessment tests
are described in Table 4.3.
Table 4.3: Audio losses analyzed in the subjective assessment tests.
Code Duration of the burstN 0 (hidden reference)A1 1 packetA2 500 msA3 2 sA4 6 s
The results obtained from the tests are shown in Figure 4.12, differentiating the three
content sources under study: the action movie in blue, the football match in yellow,
and the documentary in red. The global average value is also displayed, together with
its confidence intervals. The results are stable and coincident with other research in
the topic [79]: the longer the loss, the higher the severity. Isolated one-packet audio
losses seem to be admissible under real viewing conditions. The acceptability of short
bursts (up to 500 ms) depends strongly on the selected content: it is acceptable in the
soundtrack of a movie, but not in the narration of a sports match. Long bursts (2
seconds or higher) are unacceptable by all means.
Since A1, A2, A3 and A4 belong to the same “impairment set”, it is possible to compare
their results segment by segment. It is shown in figure 4.13, which confirms the conclu-
sions mentioned before. In this case, since the audio structure is simpler and the audio
original quality is, as in a real deployment, high enough for the purpose, the probability
of having clear outliers is low.
Chapter 4. Quality Impairment Detectors 79
!"#$%&!"#$'(&
!"#$')&*+,-,.$'(&
*+,-,.$%&*+,-,.$')&
/0-1"2$'3&/0-1"2$'%&
/0-1"2$'4&
'&
'5)&
6&
65)&
7&
75)&
(&
(5)&
)&
*'&
*6&
*7&
*(&
!"#$
Figure 4.13: Detailed results for each of the individual segments for Audio Loss
4.4 Coding quality and rate forced drops
Another relevant element for the Quality of Experience is the multimedia quality ob-
tained at the end of the encoding process: the coding quality. The coding quality is
important for the overall QoE, but it is not so critical for a monitoring system for two
main reasons. On the one hand, its impairments are less frequently reported by the users
than the ones produced by packet losses [7]. On the other, the target coding quality is
something that must be controlled in the design phase of the service, when selecting the
encoder which is going to be used and the conditions, especially bitrate, under which it
is going to work. But once in runtime, there should be less unexpected events in the
encoder than in the access network, for instance.
When considering coding quality, we will only focus on the video stream; and not on the
audio. The reason for that is that, while both of them contribute similarly to the final
multimedia quality [90], video requires much more bandwidth than audio [6] and, as a
result, video encoders will be working under more stressful conditions,
In this section we will study the coding quality from two different perspectives. First
we will explore the options to control or estimate the coding quality using simple RR
or NR metrics (with a chance to be applicable in the QuEM framework). Then we will
analyze different scenarios of strong quality drops, such as the ones produced when the
stream jumps from one bitrate to a much lower (or higher one). This scenario is typical
of OTT services using HTTP adaptive streaming.
80 Chapter 4. Quality Impairment Detectors
4.4.1 Analysis of feature-based RR/NR metrics as estimators of video
coding quality
The first step done in the analysis of video quality has been trying to find out whether it is
possible to estimate the perceived coding quality (or, at least, some salient impairments)
from elementary Reduced-Reference or No-Reference metrics performed in the pixel
domain. The main reason for that is trying to build a quality estimator that can be of
use in scenarios similar to the ones proposed in our QuEM architecture.
The approach taken to this problem has been analyzing several NR and RR metrics from
the literature. Those metrics have been applied to video at contribution quality (high-
quality recordings from television content, obtained directly from the television studios
in uncompressed D1 format), and to the result of encoding them with commercial H.264
video encoders at different bit rates. The obtained values have been compared to the
outputs of subjective assessment tests done for the same video segments.
The work described in this subsection 4.4.1 was done during the first steps of the re-
search activity of this thesis [81], before the development of the QuEM strategy and its
associated subjective assessment test methodology, described in chapter 3. Therefore,
the subjective tests referred in this subsection, described in Appendix A.3, are different
from the QuEM-based subjective tests used in the rest of this chapter, and described in
Appendix A.2. The experiments, main results, and conclusions are described now.
4.4.1.1 Metrics under study
The aim of the experiment is determining whether it is possible to detect degradations
in the video quality by using lightweight Reduced Reference (RR) and No Reference
(NR) metrics. Most RR metrics are based on comparing some image features before
and after the impairment process. These features usually model amount of movement
and spatial detail. NR metrics are normally based on the detection of known artifacts
produced in the coding process, such as blocking, or blurring [121].
To compare different possible strategies homogeneously, we will extract several features
from the original an impairment features, and measure its relative degradation, averaged
along time:
M = meant
�|X[Forig(t)]−X[Fproc(t)]|
X[Forig(t)]
�(4.9)
Four groups of features have been compared: spatial information (obtained from sev-
eral RR metrics), temporal information (from RR metrics as well), blocking (from NR
metrics), and blurring (from NR as well).
Chapter 4. Quality Impairment Detectors 81
Different feature extractors have been considered for spatial information (or texture):
• Le Callet et al. [63] propose a pair of complementary measures based on intensity
and direction of borders, which they call GHV and GHVP. They compute GHV as
the average magnitude of intensity gradient for all the pixels in which this gradient
is horizontal or vertical, and GHVP as the average magnitude of intensity gradient
for all the pixels in which this gradient is neither horizontal nor vertical.
• BTFR metric in ITU-T J.144 [45] includes a texture measure computed as the
zero cross rate of horizontal gradient.
• Saha and Vemuri [98] propose using the average value of absolute vertical and
horizontal differences, which they call IAM4.
• Webster et al. [117] propose a Spatial Information feature (SI), defined as the
standard deviation of the Sobel-filtered frame.
When characterizing temporal variations, there is less diversity of metrics in the litera-
ture. We will consider Le Callet’s Temporal Information (TI), defined as the energy of
the difference image along time [63].
Regarding the blocking effect, we have studied three of the most frequently cited metrics:
• GBIM (Generalized Block-edge Impairment Metric) [122]. It measures the differ-
ences between both sides of the block (which must present a regular and well-known
pattern).
• Vlachos metric [110], which uses a method based on the spectral analysis of the
pixels in block boundaries.
• Wang metric [115]. It analyzes the Fourier transform of the image to detect energy
peaks in the multiples of the inverse of the block period.
The other relevant artifact to study is blurring. Most blurring metrics are based on the
measurement of the average width of borders in the image [21]. We have selected the
implementation proposed by Marziliano et al. [68].
Finally, we have also included two basic measures: global brightness (mean value of
intensity) and global contrast (standard deviation of intensity).
82 Chapter 4. Quality Impairment Detectors
4.4.1.2 Evaluation
Reference data to benchmark these video quality metrics were obtained from the results
of a study of subjective quality for real-time H.264 encoders, described in Appendix
A.3. The same sequences used for the subjective tests were provided as input for all the
feature extractors described in the previous subsection.
Reduced-Reference metrics were obtained for all the features by applying equation (4.9).
Besides, the blocking and blurring metrics were considered as individual No-Reference
metrics as well, just by computing its average along each test sequence.
The output of all the metrics, both RR and NR, was been compared with the MOS
obtained from the subjective tests, to check whether any of the features under study
could be a reasonable predictor for MOS variations. Pearson correlation and Spearman
rank correlation (with p-test) were computed. Results are shown in Table 4.4.
Table 4.4: Comparison of NR/RR results with subjective tests
Metric Pearson Spearman p-testBrightness 0.41 0.44 OKContrast 0.61 0.64 OK
BTFR Texture 0.16 0.22 NOSI 0.54 0.59 OK
GHV 0.43 0.43 OKGHVP 0.42 0.44 OKIAM4 0.51 0.52 OKTI 0.70 0.68 OK
GBIMRR 0.16 0.17 NOVlachosRR 0.21 0.24 NO
MarzilianoRR 0.29 0.26 OKGBIMNR 0.17 0.19 NOVlachosNR 0.31 0.24 NOWangNR - - -
MarzilianoNR 0.32 0.33 OK
As a rule, correlations are quite low (below 0.7). The best results are obtained from TI
metric and from contrast difference. No-Reference metrics obtained quite poor results:
no blocking metric provides a result which is statistically meaningful and Marziliano’s
blur metric gets a very slight correlation. Wang method did not even provide any stable
result. Furthermore, even when using them as basis for a Reduced-Reference metric,
results were not any better.
Figure 4.14 shows the value pairs which have been used to obtain the results mentioned,
i.e., the output of the metrics versus the subjective MOS, for to the two metrics which
Chapter 4. Quality Impairment Detectors 83
(a) TI loss vs MOS
(b) Contrast loss vs MOS
Figure 4.14: Results for (a) TI and (b) Contrast
provided better results: TI and contrast degradation. Each shape represents one of the
sequences (cross and triangle for the football match; square and circle for the music
show). Each color represents one of the encoders. The regression line is also shown.
Two considerations can be made:
• Low values of the metrics are closer to low MOS than high values of the metrics
are to high MOS. That means that a bad result in one metric would probably
imply bad quality, but a high result will not imply high MOS.
• Results may vary significantly depending on the content. This is especially clear
for contrast loss and one of the music sequences (the square-shaped markers in the
figure). For both metrics, music sequences show better correlation than football
ones.
84 Chapter 4. Quality Impairment Detectors
We can conclude that simple feature-based RR/NR metrics are hardly usable in the
context of continuous video quality monitoring, with the conditions that we have estab-
lished for the design of monitoring systems. Some of the results that have been reported
by other authors regarding the performance of those methods in, for instance, JPEG
encoded images, need to be taken cautiously when applied to H.264 video.
Although the use of more complex metrics could improve the results, their performance
could hardly reach the capabilities of the FR metrics [112]. For those reasons, we
will not consider the use of NR/RR pixel-based metrics to be included in the QuEM
architecture. The source video quality should, as a general rule, be sufficiently high by
network design. And the monitoring of the variations of those reference quality along
time would be better performed at the video headend, where FR metrics can be applied
in dedicated equipment.
4.4.2 Managing coding quality drops
The previous discussion suggested that direct monitoring of the video quality in the
access link (between the PoP and the user terminal) is difficult to achieve with cost
effective RR/NR metrics. However, it is also true that, in the typical monitoring sce-
nario, the video coding quality is selected by the Service Provider in the network design,
and any quality drop in the original quality can be monitored in the headend in better
conditions.
There is room anyway in the access network for drops in the coding quality, if variations
in the video quality can be introduced in the Edge Server. The most typical example for
that is the HTTP Adaptive Streaming, where the user terminal may select to download
different versions of the same video segment (at different bitrates), depending on the
instant quality of service provided by the access network. The same principle might
be also applied for IPTV: once the video is encoded in parallel at several bitrates, an
IPTV Edge Server could force the downgrade of the video quality to overcome a network
congestion event.
In some production or delivery environments it may be necessary to obtain a lower bi-
trate version of a media stream but there is no possibility of performing a full transcoding
process, either because of lack or resources or timing issues. In this cases the “dent-
ing” concept may be helpful: the idea is to dynamically remove frames from the video
elementary stream keeping the original audio (or audios). The result is a stream with
a lower frame rate but also a lower output bitrate which keeps the rest of the variant
characteristics (codec, resolution, etc) unaltered.
Chapter 4. Quality Impairment Detectors 85
The denting component performs exactly this process: based on a configuration param-
eter (target bitrate, target frame rate , “remove all B”, etc) it sends to its output the
same media received at the input except for some video frames which are carefully se-
lected to meet the desired requirements. Due to the encoding properties of most codecs,
video frames can usually not be removed arbitrarily, because the absence of a frame
may prevent other frames which remain in the stream to be properly decoded. For this
reason the denting component requires deep information about the video frames, not
only about their boundaries but also about their decoding hierarchy. Padding packets
can also be removed by the denting component, but non audio/video streams (appli-
cation data, teletext, subtitles, etc) should only be removed if explicitly allowed by
configuration parameters.
Denting can be used in the Edge Server to dynamically generate lower-bitrate versions
of the main stream, either to create or enhance HAS structures or to reduced the bitrate
of a unicast transmission between the Edge Server and the user terminal. In particular,
denting has been successfully used in Fast Channel Change solutions to increase the
apparent bitrate of the unicast session without effectively allocating a higher bitrate for
it.
These quality drops (reducing the bitrate and denting) have also been included in the
subjective assessment tests described in Appendix A.2. Table 4.5 shows the different
values considered. R1 and R2 are a reduction of 50% and 75% of the bit rate. F1 and
F2 are a reduction of 50% and 75% of the frame rate. The effective bitrate reduction
of F1 and F2 depend on how the video was encoded. However, typical values for the
content assets under study are about 25-30% of bitrate reduction for F1, and 35-50%
for F2.
Table 4.5: Quality drops analyzed in the subjective assessment tests.
Code Type DescriptionN n/a Hidden referenceR1 Bitrate Bitrate reduced to 1/2R2 Bitrate Bitrate reduced to 1/4F1 Denting 1/2 of all frames droppedF2 Denting 3/4 of all frames dropped
The results of the subjective assessment tests for these impairments are shown in the
Figure 4.15. The following conclusions can be obtained:
• The results of the hidden reference are high. This means that coding defects
introduced at the reference quality are perceived of much less severity than other
86 Chapter 4. Quality Impairment Detectors
!"
!#$"
%"
%#$"
&"
&#$"
'"
'#$"
$"
(" )!" )%" *!" *%"
!"#$
)+,-"./01"
Figure 4.15: Results of the subjective assessment for Rate Drop impairments
!"#$%&!"#$'&
!"#$()&*+,-,.$%&
*+,-,.$'&*+,-,.$()&
*+,-,.$()/(&01-2"3$4&
01-2"3$)%&01-2"3$(%&
)&
)5'&
(&
(5'&
%&
%5'&
6&
65'&
'&
7)&
7(&
0)&
0(&!"#$
Figure 4.16: Detailed results for each of the individual segments for Rate Drop
defects (forced quality drops in this case, but also other defects considered in other
sections).
• The impact of this kind of impairments depends on the source content, at least up
to some point.
• In general, the quality variations between bitrates are relevant (and between frame
rates as well). However, its specific impact differs from one asset to another, and
from one segment to another. This can be better shown in the comparison within
the “impairment set” formed by R1, R2, F1 and F2, in Figure 4.16.
• Denting has higher impact in the perceived quality than the drop of coding qual-
ity, which was expected, as in the latter case the quality-rate trade-off has been
optimized by the encoder, while in the former case it has not.
Chapter 4. Quality Impairment Detectors 87
4.5 Outages
All the issues considered so far are caused by isolated errors. Now we will analyze a
different case: outage —loss of service for a period of time. The relevance of this case
is that sometimes the users report errors which are described as a complete stop in the
video play out, which sometimes is only recoverable after a reboot of the user terminal
[7]. Any system that monitors the global QoE must be aware of this kind of errors since,
although they are less frequent than the ones caused by isolated packet losses, have a
higher impact in the final quality.
Outages can be roughly classified into two categories: “short” and “long”. By “long”
outages we understand those caused by service unavailability for several minutes or
hours. The most typical example is a software problem in the user terminal, but there
could be severer situations (such a critical failure in the delivery equipments, for in-
stance). “Short” outages are the ones caused by a brief stop (some seconds) in the video
service delivery, typically caused by discontinuities in the service, an issue in the delivery
equipment followed by a recovery of the service from a redundant one...
“Long” outages should be ever monitored and managed by the Service Provider and are,
in fact, outside the scope of our work. The impact of having no service at all is not easy
to measure in the same scale that we are considering. We will focus in the detection and
impact measurement of the “short” outages exclusively.
4.5.1 Detection of outages
The outage can happen in the contribution (detectable in the headend), in the core
network (detectable in the PoP), or in the access network (detectable in the HNED,
maybe with the help of the Edge Server).
If it happens in the contribution, it should be monitored by continuity monitors in
the headend. An effective way to do it is using the VODA algorithm proposed by
Reibman and Wilkins [94]. This algorithm detects an outage when there is as sudden
and simultaneous drop of three different factors: average brightness (i.e. the picture
changes abruptly to black), space information, and audio signal power. The three factors
must also remain low for some seconds for the outage to be detected.
If the outage happens in the network, it will be an extreme case of packet loss with
high impact (loss of several seconds worth of video and/or audio), which can be de-
tected normally with packet loss effect estimators (and probably with simpler packet
loss detectors).
88 Chapter 4. Quality Impairment Detectors
Additionally, short outages in the contribution can be detected in the coded stream
(with less accuracy, but it can be enough for our purposes) by monitoring the global
video and audio signal level:
• For video, with the analysis of the frame size and structure (coded long freezes
have almost zero-byte P and B frames).
• For audio, either from the analysis of energy values for each sub-band (exact) or
with the analysis of the dynamic range compression parameters, when available.
4.5.2 Subjective impact of outages
Some outage events have also been included in the subjective assessment tests described
in Appendix A.2. Table 4.6 shows the different values considered: stops of 2 and 6
seconds for audio and video (or both). The results are shown in the Figure 4.17, with
the comparison of the impairment set A4, V3, AV in 4.18.
In general, and for the same sequence, the longer the outage, the worst the detected
quality. However, the specific impact and the relative importance of video and audio is
quite dependent on the specific content.
Table 4.6: Outage events analyzed in the subjective assessment tests
Code Outage Duration Elementary Stream AffectedA3 2 s AudioA4 6 s AudioV2 2 s VideoV3 6 s VideoAV 6 s Both
4.6 Latency
A final QoE factor to consider is latency. Latency issues are usually disregarded in many
QoE analyses, because they are only perceived in very specific scenarios. However, the
study of latency is relevant because of two different, but related, causes. On the one
hand, as discussed in section 2.4, these scenarios where latency is relevant —mainly live
sport events— are important enough to make latency be a meaningful QoE element. On
the other hand, there is a trade-off between latency and other QoE components that
makes it difficult to have low-latency video delivery services without compromising the
Chapter 4. Quality Impairment Detectors 89
!"
!#$"
%"
%#$"
&"
&#$"
'"
'#$"
$"
(&" ('" )%" )&" ()"
*+,-./"
Figure 4.17: Results of the subjective assessment for Outage impairments
!"#$%&
'()*)+$%&
,-*."/$%&
0&
012&
%&
%12&
3&
'4&
53&
'5&
'5&!"#$
Figure 4.18: Detailed results for each of the individual segments for Outage
perceived quality. These trade-offs will be summarized at the end of this section, at
4.6.3.
Latency will be studied from two different perspectives. Fist we will analyze the end-
to-end latency or lag. Afterwards we will analyze channel change time, which is also a
latency-related scenario with a significant contribution to the overall QoE.
4.6.1 Lag
End-to-end latency or lag refers to the delay observed in the displayed video by the user
with respect to the moment when the event is being recorded. With such definition,
lag only makes sense for live content streams: those which are being watched while
they are being captured. Although it is possible to provide an equivalent definition for
on-demand content, the reality is that lag is only a QoE factor in live events. And even
90 Chapter 4. Quality Impairment Detectors
Figure 4.19: Simplified transmission chain for real-time video
for live television channels, there are very few cases where the lag is really an issue, and
that receiving the video with some additional seconds of delay makes any difference.
However, the few cases where lag is important are also important for service providers
and users, the most typical ones being sport matches. For those reasons, keeping the
lag under control is very relevant for IPTV service providers [70].
Lag must be constant end-to-end, to avoid losing video continuity. As such, any protocol
layer that imposes timing constrains must have also constant end-to-end delay, because
it should not assume that the delay variation may be absorbed by the uppermost layers.
Figure 4.19 shows it. Points A and Z represent the decoded video stream. In absence
of errors, the video reproduced in A and Z should be identical, and therefore the delay
between those points TAZ must be constant.
A first component of this delay is introduced by the encoding process, and it is due to
two main causes. On the one hand, the coding of video using frame prediction normally
implies that the frames are encoded and transmitted in a different order that they are
displayed, to allow the use of bidirectional prediction. On the other hand, this kind of
compression also makes that the size, in bytes, of the different frames differs strongly
among them and along time. This generates local peaks of bitrate that normally need
to be softened before the transmission, introducing additional delay, to comply with
bandwidth restrictions. Those two sub-components of the video delay are introduced by
the encoder and depend only on coding decisions (and therefore can be known in point
B).
MPEG-2 Transport Stream allows that the encoder to manage the coding delay end-to-
end. The transport stream includes a clock signal called PCR (program clock reference),
which indicates the rate at which the coded stream is produced at pointB and, therefore,
the rate at which it is expected to be delivered at point Y. The stream also includes,
for each video, audio or data access unit, its presentation time stamp (PTS) in the same
clock base. The total encoder-decoder delay TAB + TY Z is constant. This way, if the
Chapter 4. Quality Impairment Detectors 91
network is able to keep constant delay TBY , the end-to-end delay TAZ will be constant
as expected.
However, the real delay in the transmission network TCX , which is an IP network, cannot
be guaranteed to be constant. Therefore network elements are introduced to control the
network ingestion and the reception in the user terminal to flatten network jitter and
also to manage error correction protocols.
The delay introduced by server-side elements and by the decoder (TAB + TBC + TY Z)
are established by the network design and known a-priori by the service provider. The
network buffer TXY depends on the implementation of the user terminal, and it is
normally set individually for each video session. Once it is established, however, the
end-to-end network delay TBY will remain constant for the whole video session, and
therefore each video packet whose jitter exceeds this buffer will arrive too late to be sent
to the decoder, and will be considered as a network loss. Therefore, when establishing
the length of the network buffer, there is a tradeoff between end-to-end delay and packet
loss probability.
Additionally, if the video multiplexing protocol is ISO File Format, then it does not
include transport timing information equivalent to the PCR. In such case, the user
terminal must set up the value of TY Z arbitrarily for the first decoded video frame,
and assume that it will be enough to present it on time for then onwards. As a result,
buffer sizes are normally overdimensioned, to avoid buffer emptying events, at the cost
of suffering a higher lag. This overdimensioning is also generally applied to the network
buffer TXY , especially in the case of Over The Top services (where network capacity
variations can be very strong).
4.6.2 Channel Change time
We will define channel change time (or zapping time) as the time between the moment
when the end user presses a “channel change” key in their user terminal and the instant
when the new channel (video and audio) starts playing on their screen. This time can
be divided into the following components:
TCC = Tterm + Tnet + Tbuf + Tvid (4.10)
Where
92 Chapter 4. Quality Impairment Detectors
• Tterm is the delay between the user key stroke and the moment where the user
terminal effectively requests the new video stream to the network (by issuing an
IGMP join, an HTTP request or what it is suitable for each scenario).
• Tnet is the delay between the new video is requested and the first byte of the new
stream arrives back to the user terminal.
• Tbuf is the time needed to fill the network buffer in the user terminal.
• Tvid is the time needed to present the first video frame in the decoder output.
From the analysis done in the previous subsection, it is immediate to consider that Tbuf
is equal to TXY as depicted in the Figure 4.19. Tvid abstracts all the delay introduced
by the video stream in the decoding side. This can be inferred only by analyzing the
video stream and depends only on the encoding process. It can be modeled as:
Tvid = TRAP + Tdec (4.11)
TRAP is the time that the decoder has to wait to reach a Random Access Point (RAP).
A RAP is a specific point in the video stream where it is possible to start decoding
it, which maps approximately with the beginning of the intra-coded frames. Therefore
TRAP can be easily modeled with a random variable of uniform distribution between 0
and the intra frame period TI , whose mean value is TI/2.
Tdec is the interval between the RAP and the moment when the frame can be presented
to the user. It is equal to the stationary delay of the video decoder, i.e. TY Z in Figure
4.19. It represents the decoding part of the end-to-end coding delay for each of the
media components (audio, video, and data) and, in MPEG-2 Transport Stream, it is:
Tdec = PTS(first access unit)− PCR(first packet) (4.12)
It is relevant to consider that the value of Tdec will, in general, be different for each of
the elementary streams. Even though the end-to-end delay (TAB + TY Z) is constant
and equal for all of them, it is usual that the part of the delay left to the decoder
(Tdec = TY Z) varies strongly from one component to another. A typical example taken
from a commercial encoder is shown in Figure 4.20: audio Tdec is constant and below
100 ms while video Tdec varies along time between 800 and 1400 ms approximately.
With these elements, it is possible to build a QuID which monitors the channel change
time in the network in the following way:
Chapter 4. Quality Impairment Detectors 93
Figure 4.20: Decoding delay (PTS-PCR) in milliseconds for video (blue) and audio(red) components of a MPEG-2 Transport Stream, and its variation along time (inseconds)
• Tterm and Tbuf depend on the user terminal implementation, which is the only
point where they are available. However, they are normally quite stable, so they
can be known a priori and introduced into the model as parameters.
• Tnet, TRAP and Tdec can be easily monitored in the network.
It is worth noting that most of the components of the channel change are frequently
sacrified in the process of enhancing the available end-to-end quality of experience. In
particular, Tbuf , as it has been mentioned in the previous subsection, represents the
buffering required to absorb network jitter and to correct packet losses. TRAP and Tdec
provide also a higher degree of freedom to the encoder to distribute its bit budget flexibly,
according to the coding complexity of the images and therefore optimizing the coding
quality. Reducing any of those parameters, what would reduce the channel change time
in the same amount, could also have undesired side-effects in the global quality.
Unlike the case of the global lag, channel change time is a QoE element which is relevant
for many IPTV deployments, and for all the video channels. However, the mapping of
the channel change events into a global scale of severities (or qualities) is very dependent
on the expectations of the service provider, and there is no standard way to do it. Table
4.7 shows an example that could be used as reference, based on informal laboratory
experimentation.
94 Chapter 4. Quality Impairment Detectors
Table 4.7: Example Channel Change time ranges and their mapping to QoE
Time (s) QoE description< 0.4 Very Fast0.4− 1 Fast1− 2.5 Normal2.5− 5 Slow> 5 Very Slow
4.6.3 Latency trade-offs
Since lag and channel change can be considered relevant elements for the global QoE, we
may ask whether it is possible to improve them by reducing some of their components.
The answer is that it is possible, but with some cost: degrading other QoE factors. We
will show here why.
Regarding end-to-end lag, encoding latency TAB + TY Z is used to provide a buffer for
rate-control operations in the video encoder. Reducing this buffer will impair the video
quality that the encoder is able to produce at its output. Network processing delay
TBC + TXY provides a buffer to protect the decoder against network jitter. This buffer
can be reduced, but only at the cost of increasing the packet loss probability.
Channel change components Tbuf and Tdec are TXY and TY Z respectively, so the same
considerations can be made. TRAP is also a design parameter for the encoder: if it is
reduced, the frequency of I frames will increase, which will degrade the video quality (if
the bitrate is kept, as it is assumed).
The rest of the delay components are limited by the technology itself, and are normally
outside the control of the service provider:
• TCX and Tnet depend on the performance of the communication network.
• Tterm depends on the performance of the user terminal software.
As a conclusion: there is a strong relationship between the latency and the video quality
components of the QoE. Therefore latency should always be controlled in any multimedia
delivery service. Even in the cases where lag or channel change are not important by
themselves, managing latency parameters is always a good strategy. Service providers
should be aware that reducing that latency elements in the future will always be at the
cost of putting the video quality at risk.
Chapter 4. Quality Impairment Detectors 95
4.7 Mapping to Severity
One of the most complex problems to solve when managing a QoE monitoring system
in a large multimedia service deployment is the comparison and aggregation of a big
quantity of data. In our QuEM model, this problem is addressed by referring all the
measures to a common severity scale and synchronizing the measurement windows, so
that one single severity value is produced for each monitoring period in each monitoring
point (section 3.3.2). These values should be then processed statistically according to the
needs of the monitoring service with the particularity that, even though the aggregated
value has only meaning in terms of average severity, each of the individual impairment
events is easily traceable to a qualitative description of what happened.
Each QuEM system should be calibrated according to the specific needs of the service
provider, and should also be modified during the operation phase with the feedback
retrieved from the field. The best way to calibrate the different QuID elements to
produce severity values is by performing subjective quality assessment tests such as the
ones described in section 3.4. This way, each service provider could feed the tests with
the type of content and impairments that fit better in their deployment, having the
Severity Transfer Functions completely under their control.
The results of the subjective assessments describe in Appendix A.2 can provide some ini-
tial approach to the problem, which should be used as starting point for real deployments
of a QuEM infrastructure.
Figure 4.21 shows a summary of the different results that have been discussed along this
chapter. The most relevant conclusions for each of the type of errors have already been
discussed, but we can summarize them as follows:
• Video packet losses can have very different effect depending on the part of the
stream which is lost. We have proposed a simple but effective metric (PLEP) to
model this variability.
• Audio packet losses depend mostly on the packet loss rate and pattern. We have
also modeled this in our proposal for audio loss QuID.
• Bitrate is a reasonably good proxy to monitor video coding quality in the context
of a QuEM system. The comparative effect of bitrate change and denting has been
studied. The former technique has less impact than the latter in the final QoE,
but it requires generating and transporting the different versions of the content
stream from the headend to the network edge.
96 Chapter 4. Quality Impairment Detectors
Figure 4.21: Results for all the QuIDs mentioned in the chapter
• Outages can be monitored as severer versions of the rest of the impairments, but
they must be considered separately because of its high impact in the perceived
quality.
• Latency effects (end-to-end lag and channel change) have to be taken into account,
both for their impact in the final QoE and for their relationship to other quality
issues.
Besides, the cross-analysis of different QuIDs can also provide some additional ideas:
• In case of network congestion or any other error situation, the decision of which
packet or packets to discard is critical for the final impact in the Quality of Expe-
rience. Losing all the no-reference frames for six seconds (F1) has an impact which
is similar to losing all the audio during only half a second (A2) or have relevant
macroblocking (90% of the picture) for half a second (E4), and is even better than
any of the video screen freezes (V1-V3). All those impairments are produced by
the loss of less packets than F1.
• Video freezing is probably the worst artifact (relative to the minimum loss burst
needed to produce it). For this reason, it should be avoided by any means. This
is especially relevant in scenarios where the network buffer is small because a low
latency is required. In such cases, countermeasures such as bitrate drop or frame
rate drop are preferable to an empty buffer resulting in video and audio loss signal.
Chapter 4. Quality Impairment Detectors 97
4.8 Conclusions
This chapter has presented strategies to monitor all the relevant sources of quality im-
pairments in multimedia delivery services. We have proposed metrics to analyze the
effect of packet losses in video and audio, which are currently the most frequent errors
in multimedia services; and in particular in IPTV. We have also covered the analysis and
monitoring of media coding quality, with a special focus on the strong bitrate variations
which are typical of OTT scenarios. We have also analyzed the causes and effects of
service outage, as well as the effects of latency in the final QoE.
All the metrics proposed in this chapter can be integrated as Quality Impairment De-
tectors in the QuEM architecture described in chapter 3. Besides, we have analyzed a
set of subjective quality assessment test results which support the selection of QuIDs
and provide a way to compare their relative severities. The results of this analysis have
provided relevant information about the relative severity of the errors under study.
The ideas discussed in this chapter suggest that, with the right knowledge of the effect
of network events in QoE, it is possible to design network systems whose policies are
optimized towards the final perceived quality. The next chapter will present and discuss
some of these applications.
Chapter 5
Applications
5.1 Introduction
This chapter describes applications which, by making use of the knowledge obtained
in previous chapters about the Quality of Experience, can enhance the functionality of
existing multimedia delivery services. In fact, some of the applications described here
have been applied to products and services which are currently deployed in the field.
Section 5.2 describes a variation of the Packet Loss Effect Prediction model which can
be used to establish packet priorities in a video communication network. This can be
used to support Unequal Error Protection schemes which make best use of the error
correction capabilities of the network.
A similar idea is applied in section 5.3 to an HTTP Adaptive Streaming scenario. By
composing HAS segments in priority order (instead of in the traditional decoding order),
it is possible to react better to dynamic variations in the network effective bandwidth
without needing to increase the buffering delay excessively.
Section 5.4 describes a selective scrambling algorithm which can be used to efficiently
protect video content in scenarios where the processing power of the deciphering elements
is small. By only selecting to encrypt the most relevant packets (with respect to their
impact in the QoE) it is possible to get very effective protection with a low packet
scrambling rate.
Section 5.5 proposes a solution to overcome the channel change limitations described in
section 4.6.
Finally section 5.6 discusses the application of the results to stereoscopic video.
99
100 Chapter 5. Applications
5.2 Unequal Error Protection
Not all the packet losses have the same impact in the QoE. For instance, the effect
of isolated packet losses in perceived video quality depends on several factors, such as
coding structure (the type of prediction in the frame or the part of the frame which gets
lost), camera motion, or the presence of scene changes, among others [86, 93]. When the
number of errors grows, the effects of those factors tend to compensate among them, so
that the impact of random errors depends mainly on packet loss rate [95] and loss burst
structure [124]. Audio packet losses have a strong impact on the perceived quality,
depending mainly on the frequency and length of the bursts of loss packets, with no
significant differences between individual packets [79, 84]. When they are studied jointly,
video errors seem to be more acceptable than audio errors, except for high error rates
[57].
Most of the studies mentioned so far analyze the effect of packet losses for relatively high
loss rates. In practical situations, however, real-time video services provide a quality of
experience resulting in less than one visible error per hour, with users showing sensitivity
to higher impairment rates [7]. In terms of network quality of service, it means that
only a few packet loss bursts per hour are allowed, at most.
Home networks typically have error rates which are some orders of magnitude above
these figures, especially in the case of wifi (802.11) [97]. If the media stream is to be
delivered through the home network, the residential gateway must provide some kind
of error correction mechanism (FEC or ARQ) in order to keep the required level of
service. This protection is performed at the cost of introducing end-to-end delay in the
transmission chain [61], as well as increasing the required bandwidth.
The understanding of how packet loss can affect video and audio quality has been used
to propose several unequal error protection (UEP) schemes, where packets with higher
impact in quality are protected better [29, 66]. This allows keeping a good QoE without
an excessive increase in the required protection and, consequently, in the additional delay
introduced. However they usually require an in-depth video analysis which is difficult
to integrate in cost-effective consumer electronic devices. Lightweight UEP designs also
exist, but they usually focus on the characteristics of the loss patterns and use limited
approaches to characterize the priority of the packets [12, 71].
We have shown in the previous chapter that, even with its limitations, the PLEP model
we describe is a promising approximation for blind packet loss effect estimation. How-
ever, it is based on reading and building a reference frame list for each frame. Even
as simple as it is, this could be too expensive for some applications, such as packet
QoS policies applied in routers, and it may require the use of information which is not
Chapter 5. Applications 101
available in real service deployments, perhaps because the elementary video stream is
completely scrambled.
Here we will show how it is possible to reduce strongly the effect of packet losses by
applying a simplified version of the PLEP metric to label video packet priorities (and
even using a low number of bits to encode them). This technique can be applied to
congestion control in home gateways or buffer management in dynamic HTTP adaptive
streaming. In addition, it can improve other lightweight UEP schemes by enriching
their characterization of the video sequence. This approach requires low processing
capabilities while clearly outperforming a random packet drop.
The solution specifically addresses short-term protection decisions, where the error cor-
rection system has to decide which packets to protect (or which ones to drop) within
a short window of time. Thus it is especially suitable for real-time multimedia trans-
missions. This solution is applicable not only to error correction, but also to congestion
control.
5.2.1 Priority Model
5.2.1.1 Effects of packet losses
The priority model proposed is based on the fact that not all the video packets contain
the same kind of information and, therefore, the loss of different kinds of packets will
produce different effects in the perceived video quality. In fact, even the loss of a single
video packet can produce a wide range of different effects, depending on the kind of
packet which is lost.
There are several factors which influence the effect of a single packet loss. They can
be roughly classified in two sets: content-based (camera motion, scene changes. . . ) and
coding-based (type of video frame, position of packet within the frame. . . ). Only the
latter are considered in this approach, since they are the ones which can be easily
identified in the analysis of the coded media stream. It will be shown later that they
suffice to provide a good performance of UEP algorithms.
The factors considered are based on the following previous knowledge:
1. The effect of a loss is higher when it is produced in a reference frame (a frame
used by the encoding system to predict the following ones), because the error will
propagate to the frames which have it as reference [95].
102 Chapter 5. Applications
Table 5.1: Priority value for each slice type
NALU Type PS
IDR(I) 1Reference (R) 0.5
No-Reference (N) 0
2. If a packet in the middle of a video slice is lost, then the rest of the slice gets
lost too, as the decoder cannot easily re-synchronize in the middle of a slice. This
is especially relevant in H.264 video, where most commercial encoders use a low
number of slices per frame (typically one). In such cases, the sooner the error is
produced within a frame, the higher its impact is [86].
3. If packets are lost in two different frames, their contribution to the final error (in
terms of mean square error, MSE) can be considered to be the sum, as errors are
typically uncorrelated [95].
4. Audio packet loss effects are basically related to the length and structure of the
loss burst, not existing meaningful differences between audio packets [79, 84].
5.2.1.2 Packet Priority
A packet priority model is proposed in order to assign higher priority to packets whose
loss is going to produce a stronger effect in QoE. The model is based on the type of video
slice carried by the packet and the position of the packet within the slice (assuming that
typically a video slice is carried in several transport packets). As it has been mentioned
before, losses have higher effect in reference slices than in no-reference ones, and at the
beginning of the slice and of the GOP, where error propagation effects are higher [66, 86].
The priority model is defined as follows:
P = αPS + βH + γTS + δTG (5.1)
where PS is the priority of the slice type as described in Table 5.1, H is a flag indicating
whether the packet contains a NALU (Network Abstraction Layer Unit) header, TS
indicates the number of packets until the next slice in the stream and TG is the number
of packets until the next I frame. All the parameters are normalized between 0 and 1.
According to their relevance, the following coefficients are selected: α = 103, β = 102,
γ = 10, δ = 1.
Chapter 5. Applications 103
Figure 5.1: Example of the packet priority model applied to one GOP of a codedvideo sequence
Figure 5.1 shows an example of the application of the model to a sequence of video
packets in transmission order. Each box represents an RTP packet, while different colors
represent different frames. The figure shows all the elements of the prioritization model.
PS depends on the NALU type (IDR, Reference slice or No-reference slice), indicated
as I, R or N within the boxes. H = 1 (presence of NALU headers) is represented as a
black bold frame. Finally, TS and TG are shown for the packet marked by the red circle.
Audio packets can be easily introduced in this model just by assigning them a fixed
priority value P = PA. In line with the idea that audio losses are more relevant than
video ones, except in case of high video degradations [57], PA is set to 900. This way,
audio packets have lower priority than IDR packets (for α = 103, PA = 0.9α), but
higher than any other video packet. Different values could be considered depending on
the specific application.
It is important to remark that it is not a scale of priorities, but only an ordering. The
intention of the model is providing a way to sort a group of packets in priority order, so
that the higher the priority is, the higher the impact of its loss is. However, there is no
information about the relative magnitudes of the losses.
Another relevant property of the model is that, once the priority for each packet is
known, no more analysis is required. This allows the unequal error protection schemes
to be stateless in the following sense: the decision of whether one packet is protected
or not will have no effect in the priority value applied to other packets. This simplifies
significantly the work of the UEP mechanisms.
104 Chapter 5. Applications
Figure 5.2: Implementation of the prioritization model
5.2.1.3 Implementing the model
Figure 5.2 shows the basic implementation modules to apply the described prioritization
model to a video source. As mentioned before, the priority labeling is applied indepen-
dently from the unequal error protection mechanism itself, and before it. To each packet
x in the sequence, a priority P (x) is assigned and signaled to the UEP module.
In the specific case of an IPTV scenario, each packet x is an RTP packet containing
H.264 or MPEG-2 video, or MPEG audio (MPEG-1, AAC or similar), over MPEG-
2 Transport Stream. To assign the priorities correctly to the transport packets, it is
necessary that audio and video are carried in different packets. It is also advisable that
no packet carries data from more than one slice; which, for the typical H.264 stream
with one slice per frame, means that no packet should carry data from two or more
different frames. All these conditions are satisfied if the packing of MPEG-2 TS into
RTP is done by the rewrapper described in section 3.5.2.
Priorities assigned to packets can be signaled in the RTP header extension, so that
the network processing elements can read them and use them to apply unequal error
protection techniques. This has the advantage that the extension is transparent to other
RTP receivers, so that the application of priority labels is backwards compatible with
any RTP-aware system. This compatibility has been successfully tested with several
commercial set-top-boxes, and this use of signalization in RTP header extensions is
currently in the field in some IPTV commercial deployments.
Other implementation options are possible. For example, priorities can be signaled using
different protocols, such as the DSCP bits of the IP header. In such cases, the number
of bits available to encode the priorities can be reduced. Next section will show that
even a few bits can be enough to encode the priority in an efficient way.
One of the main advantages of this model is its simplicity. This makes lightweight
implementations possible: to assign a priority to a packet, only the video NALU header
has to be read and analyzed. This way, the prioritization algorithm can be implemented
in devices with limited processing capabilities, such as home network gateways. In such
cases, the priority labeling and the unequal error protection modules would both reside
in the same hardware device.
Chapter 5. Applications 105
5.2.2 Experimentation and results
5.2.2.1 Description of the experiment
To test the performance of the model, three different short video sequences (4-12 sec-
onds), encoded by commercial IPTV encoders, have been selected. They are sequences
A, B and C from Appendix A.4. All of them are encoded in H.264 over MPEG-2 TS and
packed in RTP in the way described before; with each RTP packet containing informa-
tion about part of at most one video frame. Audio is not considered in the experiment.
Within each possible window of W consecutive RTP packets in the sequence, the K
packets with lowest priority are discarded. Then the resulting sequence is decoded,
using the repetition of the last reference frame as error concealment strategy. The Mean
Square Error of the resulting impaired sequence is computed, MSEPRIO.
For the same W -packet window, the MSE resulting of randomly dropping K packets
it is also computed, MSERAND. The calculation of the random loss is performed by
randomly selecting 1000 of all the possible combinations of K lost packets within the
window. If there are less than 1000 combinations, then all are selected. MSERAND is
obtained as the average of the MSE of each of the (up to) 1000 combinations.
For each window, the MSE gain is computed as
MSEgain(dB) = 10 log10
�MSERAND
MSEPRIO
�(5.2)
Based on this, an Aggregated Gain Ratio (AGR) can be defined to measure the perfor-
mance of the model. For each sequence and each pair of (W ,K), AGRW,K(G) is defined
as the proportion of windows whose MSE gain is equal or greater than G, and it is
expressed as a percentage in a 0-100 scale.
Table 5.2 shows the values of AGR for some relevant values of MSE gain, W and K, for
the three sequences under study (A, B and C), summarizing the results of the experiment.
They will be discussed and analyzed in the following subsections.
5.2.2.2 Single-packet loss
The first test considered is the case where K = 1, for several values of W . For each
original sequence it is necessary to individually discard each one of the RTP packets and
then decode and process the result of that individual discard. This way, more than 1500
impaired sequences have been obtained and used for the analysis.
106 Chapter 5. Applications
Table 5.2: Values of the Aggregated Gain Ratio for some relevant values of MSE gain,W and K
MSE gain(dB) W K AGR% A AGR% B AGR% C10 15 1 73.7 50.9 64.920 15 1 50.2 17.4 47.410 20 1 87.4 65.8 72.920 20 1 62.3 21.3 56.010 15 3 61.8 49.1 59.620 15 3 47.9 30.9 49.710 20 3 78.9 60.0 71.720 20 3 59.5 33.8 58.410 15 6 44.1 42.6 35.720 15 6 34.7 41.7 22.210 20 6 66.7 55.5 48.820 20 6 48.2 55.4 30.110 15 10 18.5 30.0 15.820 15 10 14.0 26.9 14.010 20 10 36.8 47.1 22.320 20 10 26.0 44.0 18.7
The results for sequence A, K = 1 and several values of W are shown in Figure 5.3.
Each of the curves refer to a different value of W and represents, for several values of
MSE gain, which proportion of the sequences obtained at least that gain value. The
range of values of W is selected to cover typical loss burst lengths in a wireless home
network [97].
Gains of 20 dB in MSE can be reached for from 20% of the packets (W = 5) up to 85%
(W = 30), using window sizes which are reasonable for a home network device. The
figure also shows that the longer the window is, the better the results are, since it is
easier to find a low-priority packet within the window.
Figure 5.4 shows some values of MSE for sequence A, K = 1, W = 15. As it can be seen,
the MSE varies highly between different windows along the sequence, independently of
the protection method used. However, focusing on any of the specific windows (any
value in the horizontal axis), using the prioritization method results in lower MSE in
almost all the cases; and in most of them this reduction is very strong. This means that
the specific error will depend heavily on the specific window which is selected but, once
the window is there (i.e., once the error is bound to happen), a good UEP decision can
mitigate the error effect dramatically.
Chapter 5. Applications 107
0 5 10 15 20 25 3010
20
30
40
50
60
70
80
90
100
MSE Gain (dB)
Aggr
egat
ed G
ain
Rat
io (A
GR
)
5 pkt10 pkt15 pkt20 pkt25 pkt30 pkt
Figure 5.3: Effect of the window size: Aggregated Gain Ratio for K = 1 and severalvalues of W
0 20 40 60 80 100 120 140 16010 5
10 4
10 3
10 2
10 1
100
101
window number
MSE
priorityrandom
Figure 5.4: Values of MSE for some possible windows within sequence A, comparingrandom packet loss (grey line) with priority-based packet loss (red line) for K=1 andW=15
5.2.2.3 Multiple-packet loss
The second test considered is setting the value of W to a fixed value and analyzing the
effect of the burst size, by changing the value of K. To simplify the implementation of
the test bed, the results of the different (W,K) combinations have been derived from the
(W, 1) case of the previous section, according to the considerations described in section
5.2.1.1. This way, only the first error within a slice is considered (as the rest of the slice
is lost anyway) and errors in two different frames are assumed to be uncorrelated.
108 Chapter 5. Applications
0 5 10 15 20 25 30
10
20
30
40
50
60
70
80
90
100
MSE Gain (db)
Aggr
egat
ed G
ain
Rat
io (A
GR
)
1 pkt2 pkt4 pkt6 pkt8 pkt10 pkt
Figure 5.5: Effect of varying the loss burst size (K) for a window of W = 15 packets
Figure 5.5 shows the results for sequence A and W = 15. This value has been selected
as representative from the range that was considered in Figure 5.3. Qualitatively, curves
for other values of W within that range show similar behaviors. Results from the other
sequences are summarized in Table 5.2.
When the values of K are high, it can be seen that the effectiveness of the model drops,
as there is very little margin to select low-priority packets. It is also interesting the fact
that the curves gradually reduce their decreasing rate. For example, Figure 5.5 shows
that, for K = 8, only 10% of the sequences have an MSE gain between 10 and 30 dB,
while 20% reach gains over 30 dB.
This behavior is due to the fact that the prioritization method concentrates errors firstly
in no-reference frames (versus reference ones) and secondly in the end of the frame (versus
the beginning). When the window lies entirely within one frame, then the gains against
the random loss are limited. However, when the window covers part of two different
frames, then the priority strategy concentrates the error in the less-impacting part of
the window, thus reaching high MSE gains. As a consequence, even for severe error
patterns, the prioritization method allows that, in a representative proportion of the
cases, the error effect is negligible.
Chapter 5. Applications 109
0 5 10 15 20 25 300
10
20
30
40
50
60
70
80
90
100
MSE Gain (dB)
Aggr
egat
ed G
ain
Rat
io (A
GR
)
PS
+H+TS+TG
Figure 5.6: Contribution of each term to the prioritization equation: only PS (red),PS + H (green), PS + H + TS (cyan), and all of them (blue). Computed for W=15and K=1
5.2.2.4 Contribution of each priority factor
An additional analysis of the performance of the model is represented in Figure 5.6. It
shows the contribution of each of the terms in equation (5.1) to the aggregated MSE gain
of the method. The red line represents the use of only PS as prioritization parameter.
Then the green line introduces the effect of H additionally to the already available of
PS . Afterwards the effects of TS and TG are added.
Several aspects of the graphic are notable. First of all, the use of the very simple method
of prioritization of just considering the frame type of the packets (PS) can be good enough
for some applications. And secondly, the most relevant contribution afterwards is TS
which allows dramatic improvements to the performance. Therefore, in addition to PG
and H, the parameter TS should always be considered.
As the scope of the study is focused on the short term, and therefore window sizes
are relatively small, it typically results in a small number of frames within each packet
window, at most. This is the main reason why the contribution of TG is so limited
in current scenario. Nevertheless, additional tests show that when the window size is
enlarged, the relative weight of TG increases, supporting the choice of a model with four
parameters.
110 Chapter 5. Applications
0 5 10 15 20 25 300
10
20
30
40
50
60
70
80
90
100
MSE (dB)
Aggr
egat
ed G
ain
Rat
io (A
GR
)
2 bits3 bits4 bits5 bits6 bits8 bits12 bitsreference
Figure 5.7: Effects of a limited bit budget to encode the priority
5.2.2.5 Limiting the bit budget
These results are useful in a scenario where the priority can be established with as high
resolution as possible, meaning that there is a high bit budget to encode priority values.
In the mentioned case of a RTP header extension, for example, this bit budget could
typically be around 16 bits in each RTP packet to encode its priority. However, in other
signaling implementation, such as DSCP, the number of available bits to encode the
priority could be lower.
Figure 5.7 shows the effect of using a reduced number of bits to encode priority. The
lines for 2 and 3 bits are the equivalent to the ones representing using only PS and
PS+H. The rest of the lines are built according to the results shown in previous section,
i.e., devoting more bits to the encoding of TS than TG. The proposed assignation of bits
to each of the components is in 5.3.
According to the results of this experiment, very few bits are necessary to encode packet
priority. In particular, by using only 3 bits to encode TS , plus 3 more for PS and H, the
deviation from the reference curve is already quite small.
Chapter 5. Applications 111
Table 5.3: Bit budget assignation to encode priority
Total PS H TS TG
2 2 0 0 03 2 1 0 04 2 1 1 05 2 1 2 06 2 1 3 08 2 1 3 212 2 1 5 4
5.2.3 Applications
This packet prioritization model can be applied to several scenarios related to unequal
error protection. Some of them will be covered in this section: weighted random early de-
tection (WRED), automatic repeat request (ARQ) and forward error correction (FEC).
5.2.3.1 Weighted Random Early Detection (WRED)
Random early detection (RED), sometimes also called random early drop, is a technique
used in routing devices to handle with congestion: when packet queues reach some fill
level, some packets are dropped randomly in order to avoid buffer overflow. Weighted
RED (WRED) is an enhancement of RED which allows assigning different priorities to
each packet, so that their probability to get dropped depends on its priority.
Using the prioritization algorithm in WRED is straightforward: when some packet has
to be discarded, then it should be the one with lowest priority within the buffer.
5.2.3.2 Unequal Forward Error Correction (FEC)
Unequal Forward Error Correction can also make use of this prioritization method. In
many cases, FEC cannot protect all the packets within a specific sequence, thus only
recovering part of them. In this case, where the protection happens before the actual
error is produced; the FEC server has to decide whether to protect all the packets equally
or whether to use this simplified approach to find out which packets have lowest priority.
Typical FEC structures used in IPTV are based on M-by-N matrixes of packets, where
XOR redundancy is applied packet-wise, either vertically, horizontally or both [19].
When applying unequal FEC, there is a limited bitrate budget to transmit these FEC
packets, so that only part of them are generated and set. By introducing prioritization,
112 Chapter 5. Applications
it is possible to reduce the required overhead introduced in the sequence, while keeping
a good protection for the high-priority packets.
The total number of packets within a matrix is below, but typically close to, 100. In this
case, the window size is usually in the same order of magnitude. In that application, as
stated before, the effect of the term TG will be higher than in the scenarios that we have
discussed [12]. However, the principles presented here are fully applicable.
5.2.3.3 Unequal Automatic Repeat reQuest (ARQ)
The priority model is particularly suitable for unequal ARQ. When transmitting mul-
timedia over a lossy network, such as 802.11g, it is common that the loss bursts are
longer than what it can be retransmitted according to the available bitrate budget. In
such cases, the decisions of whether to retransmit or not, as well as which packets to
retransmit, are based on the priority of the specific packet [71].
The problem can be modeled as follows: when the receiver detect a loss of r frames, it
requests for a retransmission of the whole burst. However, due to bitrate constraints,
the server can only guarantee that the first n will arrive on time, i.e. before they have to
be consumed in the reception buffer. Then the strategy of the server is retransmitting
only the most important n packets, and doing it on priority order [30].
In such case, the decision in the server is which packets to drop: from a window of W = r
possible packets, only n will be retransmitted, meaning that K = r−n will be lost. The
improvement by introducing this kind of prioritization in the recovery process, instead
of just randomly drop some of the packets, is the one discussed and characterized along
the previous subsection 5.2.2.
5.3 Fine-grain segmenting for HTTP adaptive streaming
An HTTP Adaptive Streaming (HAS) client requests segments at one bitrate or another
depending on the bandwidth of the TCP connection. Basically, when the buffer in the
client is emptying, it requires lower bitrates; when it is stays full enough, it can use high
bitrates. By keeping enough amount of buffering, the play out of the content can be
seamless during this process. If by any chance (an abrupt drop of network quality) the
buffer gets empty, then the play out stops (freezes) while the buffer fills again.
In cases where network capabilities vary strongly, this implementation requires an impor-
tant amount of buffering. When using HAS to transmit live events, however, increasing
Chapter 5. Applications 113
the buffer means increasing the latency, which is undesirable in such scenarios. For
non-live content the effect is a slower startup time, which is also undesirable. On the
other hand, if the buffer is small, underflows may be frequent, which is also undesirable
in all kind of transmissions (as freezing the video and starting from the same point for
a while is not an option, as a general rule).
HTTP adaptive streaming works as follows. The source video is encoded at several bi-
trates (and therefore at different qualities) and then chunked into segments (typically the
length of the segments is between 2-10 seconds; depending on the specific application,
the length of the segments may be similar, or even equal, but it is not an strict re-
quirement). The segment size is a compromise between flexibility and efficiency: shorter
segments allow changing bitrate more frequently, and therefore reducing the required
buffering time; however, they increase the server complexity (as it has to handle more
requests) and the protocol overhead. Besides, segment boundaries usually need to be
video random access points (typically I frames): the minimum segment length will be
the minimum intra picture period which, due to coding efficiency, is rarely under 0.5
seconds. Therefore in HAS scenarios, all the decisions are taken within a time resolu-
tion which is at least 0.5 seconds long, and typically much longer than that. There is
no solution to handle low-buffering systems beyond this point.
This section proposes a solution to improve the behavior of HAS under low-delay con-
straints.
5.3.1 Description of the solution
The idea proposed here is to arrange the information contained in each segment (the
audio and video frames) in priority order, instead of in the original temporal order. A
companion metadata included in the segment will carry all the information required to
restore the original order. Depending on the delivery architecture this metadata may be
part of the transport level or a separate information package added to the segment.
Once the information has been arranged this way, if there is drop in network quality and,
for example, only an 80% of the segment can be received by the client on time, instead
of dropping the 20% of the segment duration, the client will drop the 20% which is less
important. In the case, for example, of a 10-second segment, the difference would be
dropping completely at least the last 2 seconds (without our solution) versus dropping,
for example, some frames along the whole segment (which also affects the Quality of
Experience, but in a less aggressive way).
114 Chapter 5. Applications
The assigned priority to a fragment of the segment depends on the effect that the lost
of said fragment has in the Quality of Experience. That is, the more important the
fragment is for the Quality of Experience of the segment when played for an user (i.e.
the higher the effect in the lost of quality perceived by the user if the fragment is lost),
the more priority it has.
The solution can be implemented following the next steps:
First, on the multimedia content server side, each multimedia content segment (original
segment) of the HAS streaming is divided into fragments (the length of each fragment
is a design parameter), with the following requirements:
• Each fragment contains homogeneous data: that is each fragment does not content
data from more than one elementary stream (video, audio or data) and/or from
more than one access unit (frame or field) of a video elementary stream. Fragments
may be smaller than a video frame (i.e. dividing each frame into several fragments)
as it allows better performance of the solution (more granularity), although it is
not mandatory.
• Each fragment has associated information (metadata) including information which
allows to restore its original position in the segment (the temporal original posi-
tion); for instance, a sequence number.
• Each fragment is assigned a priority. The packet prioritization model described in
section 5.2 will be used here.
• If all the fragments are concatenated according to the associated information (e.g.
sequence number order), then the result is the original segment or a valid segment
of the multimedia stream equivalent to the original segment from the decoding
point of view (i.e. can be reproduced by a normal multimedia decoder, with the
same result as reproducing the original segment), but with maybe a different order
in the frames. This segment is called recovered segment.
Secondly, a prioritized segment is created by concatenating fragments in priority order.
This segment includes the metadata for each fragment: priority and sequence number.
Fragments with the same priority value may be ordered for example in sequence number
order. This prioritized segment is sent from the multimedia server to the end device or
in other words, the fragments are sent to the end device by the multimedia server in
priority order.
Finally, the prioritized segment is retrieved by the end device and stored in a memory
buffer. When the client starts receiving the prioritized segment via HTTP, it starts
Chapter 5. Applications 115
Figure 5.8: Priority-based HTTP Adaptive Streaming segment structure
extracting the fragments into the buffer. Each segment is put at their right position
using the associated information (sequence number), rebuilding therefore the recovered
segment. The buffer may be consumed at normal pace by the client (no special buffering
policy is needed). If the segment is consumed before the whole segment has arrived,
then there will be gaps in the buffer; but they will take place in the less important
positions (the ones with less priority and therefore, less impact to QoE). Late arrivals
are discarded.
Figure 5.8 shows a schematic diagram of the solution in an exemplary scenario. Top line
(both left and right) describes a typical segment transmission and presentation. Bottom
describes a segment transmission and presentation using our solution. The segment
has video frames (I, P, B) and audio (a) frames or, being more general, access units
(AUs). For the present explanation, in order to simplify the figure, it is assumed that
one fragment contains exactly one AU, although each AU can be divided into smaller
fragments if needed.
The left part shows the structure of the segment for transmission. In our solution
(prioritized segment), the fragments have been re-ordered in priority order; but both
segments (top line and bottom line) represent the same content. Note that the prioritized
fragment contains the same AUs than the regular one, but in different order.
Now the segment is transmitted but, for any reason, the download (streaming) has to
stop in the middle (i.e. all the data under the highlighted square are lost, because they
have not been received by the end device) and it has to be sent to play out (this is the
presentation part, on the right hand side of the image).
In the regular segment case (top), the answer is simple. The client plays out the first
half of the segment, and then stops (black or frozen video, and no audio too). In the
prioritized segment case (bottom), it is different: packets are re-ordered and, as they
have a sequence number, they are re-ordered by the client (end-device) and displayed
in their right position and only the less important packets have been lost. To simplify:
116 Chapter 5. Applications
we have all the I and P frames, plus all the audio. The result is that the segment is
played out completely, although at a lower frame rate (33%), and with all the audio.
Of course, dropping the frame rate and keeping the audio is much better than losing
several seconds completely. According to the subjective assessment tests described in
Appendix A.2 (see also [27]), there could be a difference of 1 to 3 points in a MOS scale
(1 to 5) between both approaches.
It is important to note that the creation of the prioritized segment is a decision that
can be taken prior to the knowledge of the network status between the content server
and the end device. In other words, the prioritized segment is generated once in the
server, and all the end devices download and play it. If there is no network congestion,
the experience will be the same as with the original segment: it will be correctly and
completely displayed. However, if there is a sudden network QoS drop, the end device
will have its prioritized segment available without having to do anything special in the
server side.
The solution has thus the following advantages:
• It allows to recover from buffer underrun in HAS in an optimal way. That is,
smaller HAS buffers can be used, therefore reducing the latency of the whole HAS
solution.
• It works passively, in the sense that neither the server nor the client have to change
their default behavior when facing a network congestion.
• Besides, it provides a mechanism to mitigate the effect of high network rate vari-
ations.
• More generically, it makes it possible to use in HAS all the QoE enhancement
technology which has been developed for real-time RTP delivery, such as video
preparation for Fast Channel Change, unequal loss protection or selective scram-
bling; that is, this solution allows to use QoE enhancement techniques in a different
environment (HTTP delivery).
5.4 Selective Scrambling
The concept of selective scrambling means that, when cryptographically protecting a
multimedia asset or stream, only a (typically reduced) fraction of the data is scrambled,
whilst the rest are distributed in clear. The reasons for such approach are twofold:
on the one side, by leaving some specific information unscrambled, intermediate video-
processing systems can access to the part of the data which is required for them to work
Chapter 5. Applications 117
correctly —the rich transport data; on the other, keeping a reduced bit rate of scrambled
packets can be the only possible solution for decoding devices with limited computing
power, such as user terminals. Addressing the former problem is relatively simple, as
the specific data headers required by the network processors are typically well-known.
The latter is more interesting, as it is necessary to find a good balance point between
scrambling rate and protection effectiveness.
5.4.1 Problem statement and requirements
A user which is watching a partially scrambled content asset and is not entitled to (and
therefore does not have the appropriated keys to descramble it) will experience the same
effect that a user that loses (for example, due to network errors) exactly the same packets
that are scrambled in the stream. From this point of view, selective scrambling can be
seen as a reverse rate-distortion optimization (RDO) problem. Unlike in the typical
RDO problem, however, the aim here is maximizing the final distortion for a specific
rate of scrambled packets. In an ideal case, the resulting distortion should be so high
that no useful data can be extracted from the content. However, for many practical
applications, it could be suitable that the resulting video has a quality bad enough to
discourage the potential user to watch it. The underlying idea here is that, in order to
find a good selective scrambling algorithm, techniques for Quality of Experience analysis
can be used.
Notwithstanding, the design of selective scrambling schemes must be aware of the reasons
why this algorithm is required: processing the scrambled video in the network and
low computing power required in the descrambler. Besides, using a lightweight scheme
also in the scrambler side would broaden the applicability of the scheme. Hence the
requirements for the selective scrambling algorithm would be to:
1. Be transparent to video servers —by leaving “rich transport data” in clear,
2. Scramble only a (low) percent of the video packets,
3. Be implementable with low computational cost and
4. Maximize the distortion introduced by the encrypted packets (i.e., do not allow
to recover the video sequence from the unscrambled packets unless with heavy
impairment).
118 Chapter 5. Applications
5.4.2 Algorithms
Most existing commercial CAS/DRM solutions fulfill requirement 1. However, they
typically rely on the encryption of the full stream. There are several solutions in the
literature that address the partial encryption of the video stream. A description of the
state of the art can be found in the work of Massoudi et al. [69], who describe a set
of encryption techniques that allow good visual degradation of encrypted video while
scrambling only part of the packets. However, all of them either require deep analysis
of the video stream (thus not satisfying requirement 3) or scramble the video headers to
make video impossible to decode (not meeting requirement 1).
Fan et al. propose encoding with higher security the most important data and with
lower security (and complexity) the less important [20]. Shi et al. divide H.264 video
elements in different classes, which are provided with different protection [100]. In
the work of Zou et al., different encryption levels can be reached by analyzing the
entropy coding of the H.264 stream [125]. These methods satisfy requirement 2, but
all of them require analyzing H.264 up to, at least, macroblock level, which might be
computationally expensive (especially when CABAC entropy coding is used, as in most
IPTV streams).
The approach we propose is exploiting the error resilience characteristics of video coding
standards such as, but not limited to, H.264, where video frames are divided in slices.
It has been shown that, when a fragment of the video slice gets lost, the rest of the slice
becomes almost impossible to decode [86]. Therefore by scrambling a small set of data
in each slice it is possible to get a very high video degradation.
This solution is especially suitable for multimedia deployments because:
• Commercial encoders use a low number of slices per frame (typically one in SDTV,
4-8 in HDTV, see section 2.4.1). Thus the fraction of video packets to encrypt
(scrambling rate) is kept low.
• The information required to process video in a video server (i.e., stream and
picture-level information) is contained in other H.264 syntax elements (called
NALUs —Network Abstraction Layer Units) which are not slices, and in the header
of the slices.
• The only analysis of the video stream required for this solution is: detecting the
type of NAL Units, detecting slices and slice headers and reading the coding type
of each frame. This can be performed in the H.264 Network Abstraction Layer
i.e., it does not require analyzing anything beyond slice header level. This makes
processing much simpler than any other selective scrambling algorithm.
Chapter 5. Applications 119
Table 5.4: Minimum scrambling rate required to completely loss the video signal, assubjectively assessed by expert viewers in laboratory, for several content assets.
Content Resolution Bitrate %Scrambled %Scrambled(selective) (uniform)
Advertising 576i 2.7 Mbps 1% 15%News 576i 2.7 Mbps 1.5% 15%Movie 1080i 8.5 Mbps 0.8% 10%Movie 1080i 15 Mbps 0.3% 5%
With these premises, we propose a scrambling scheme based in two layers [83]:
1. In each Slice, scramble a small set of data just after each Slice Header. This
protects video from real-time decoding with a very low scrambling rate.
2. After that, scramble randomly some sets of data of the rest of the VCL units,
as well as other streams (e.g. audio). With this second layer, two aims are met:
first of all, audio streams are also scrambled so that they are heavily impaired to
non-descrambling receivers; secondly, eliminating redundancy in the video stream
makes it impossible to decode, even for sophisticated offline error concealment
methods.
5.4.3 Results
This algorithm has been tested with different scrambling rates and several contents
encoded in H.264. Then the processed video has been played without correctly de-
scrambling the packets, so that they result in packet losses. The video has been then
watched by expert viewers in the laboratory in order to assess the minimum scrambling
rate at which it was impossible to extract any information from it (i.e. the image was
completely impaired). This value has been compared with the minimum scrambling rate
required to obtain an equivalent result by randomly encrypting video packets. Detailed
results are shown in Table 5.4. Even with the limitations of the experiment, it can be
shown that, by only encrypting up to 2% of the transport packets, it is possible to impair
the video quality so that the resulting video is useless.
All the video samples under study used only one slice per picture. For video sequences
with N slices per picture, these values would have to be multiplied by a factor K ≤ N .
Even in that case, for most typical scenarios, the required scrambling rate would be
relatively small.
120 Chapter 5. Applications
5.5 Fast Channel Change
As it has been shown in section 4.6.2, the channel change time can be modeled as
TCC = Tterm + Tnet + Tbuf + TRAP + Tdec (5.3)
where Tterm is the response time of the user terminal software, Tnet is the network
response time, Tbuf is the dejitter buffering time in the terminal, TRAP is the time needed
to reach a Random Access Point, and Tdec is the decoding start-up time, according to
the buffering model imposed by the encoder.
These factors are normally neither optimized nor easily optimizable in real deployments.
For this reason, specific solutions have been proposed to address this issue. The most
common one is the so-called “Rapid Acquisition of Multicast Stream” (RAMS), de-
scribed also as “unicast based Fast Channel Change” in DVB-IPTV [19]. This solution
is based on Fast Channel Change servers deployed as Edge Servers in the network, which
provides the following functionality:
1. When the user requests a channel change, the user terminal, instead of joining
a new multicast stream, it requests a unicast stream to the FCC server. This
changes, and usually reduces, Tnet.
2. The FCC server then sends a unicast stream to the user terminal. This stream
starts from a Random Access Point in the past, reducing TRAP to virtually zero.
3. The stream is sent at a higher bitrate that the one of the multicast stream, so that
at some point it catches up with the multicast. This point is signaled by the FCC
server so that the user terminal can switch to the multicast stream seamlessly.
The application of the standard solution, however, only solves part of the problem (Tnet
and TRAP ). The user terminal can also set Tbuf to a minimum value (Tbuf -fcc � 0) to
further reduce the channel change time. Since the unicast is received at a rate which is
higher than the nominal one, but it is only consumed at the nominal one, the exceeding
bitrate can be used to fill Tbuf to its desired value after the video has been started to
be decoded.
There is still a relevant component of the channel change time, which is Tdec, which has
not been addressed so far. It has additional difficulty for two reasons: it is imposed by
the video encoder and it is different for each of the streams (video, audio and subtitles).
To reduce it to the minimum value, we propose the following solution:
Chapter 5. Applications 121
• At the beginning of the unicast session, re-multiplex the different elementary
streams in the FCC server so that they have similar Tdec values at the begin-
ning of the unicast stream. The easiest way is increasing the Tdec of the audio
and subtitling streams. Due to the fact that all the elementary streams have been
separated in different RTP packets by the rewrapper processing (see section 3.5.2),
it can be done only by reordering the RTP packets in the stream.
• Besides, reduce the value of the Tdec of all the elementary streams together by
re-stamping the value of the PCR in the Transport Stream.
• Finally, during the unicast session, recover the original situation of the stream,
by gradually un-doing the re-multiplexing so that, when the session is switch to
multicast, unicast and multicast streams are equal and the switchover can be done
seamlessly.
With this procedure, Tdec can be reduced down to about 100 ms. Since Tbuf , Tnet
and TRAP have been also reduced, almost instantaneous channel change times can be
obtained (in the range of 200 to 300 ms), provided that Tterm, which depends basically
on the software design of the user terminal application, can also be optimized.
5.6 Application to 3D Video
In the recent years the popularity of 3D video has increased strongly, mainly from the
availability of last-generation stereoscopic displays both in cinemas and in consumer
television sets, as well as the production of several successful films using this technology.
As such, today it is possible to buy a 3D television set and watch blu-ray 3D contents at
home at affordable prices. The next challenge is delivering 3D content by the different
types of multimedia delivery services, from traditional television broadcasting to over-
the-top content distribution.
Virtually all the 3D multimedia contents that are available for these kind of services are
encoded as stereoscopic pairs of images: two different video frames, one for each eye of
the viewer. The resulting content is therefore composed of two different video streams
(called “views”), which represent the same scene from two slightly separated points of
view. These two views can be encoded an transmitted in several ways [10]:
• As two different video streams (simulcast).
• Multiplexed in a single video stream. The most typical way to do it is side-by-side
(each half of the image, left and right, contains one view; and the player is able to
separate them).
122 Chapter 5. Applications
• Using specific standards which make use of the redundancy between views, such
as H.264 MVC.
In all the cases the video is encoded either in AVC or in MVC (which is an extension
of AVC as well), and therefore the approach presented in this thesis is completely valid.
Since the basics of the coding structure are the same as in 2D video, the most relevant
errors which will happen in a 3D video delivery service will be again video macroblocking,
audio losses, quality drops, outages...
These artifacts, however, may have different impact in the Quality of Experience, as the
viewing experience is completely different. To test it, the subjective quality assessment
methodology described in section 3.4 has been applied to assess stereoscopic video sub-
ject to network errors [24, 25, 28]. The results show that these errors have a similar
impact in 3D video that what they have in standard 2D video. Macroblocking effect
seems to be more annoying in 3D video due to visual rivalry (mismatch between left
and right views). The other artifacts, however, seem to be slightly more tolerable in
3D than what they are in 2D video, maybe because they are somehow masked by the
added-value provided by the stereoscopic experience.
The other relevant difference is that, in the coding schemes where each view is encoded
in a different frame (i.e. simulcast and MVC), there is a new dimension in terms of
scalability. In other words, one of the views can be signaled with a lower priority than
the other so that, in case of an error-prone channel or network congestion, all the errors
are concentrated in a single view. In these situations, dropping one view and switching
to 2D video is an option that can complement the existing drops in bit or frame rate
that have been discussed in section 4.4 [27].
Chapter 6
Conclusions
The market of the multimedia content distribution is in a rapid and continuous evo-
lution, which started with the standardization of digital television broadcasting in the
1990s, continued with the deployment of triple-play offers, with the special relevance of
interactive IPTV, in the 2000s, and is moving towards global multi-screen OTT services
in the 2010s. The service offer is increasingly richer and tends to be more personalized
and focused on the expectations of individual users. New players and business models
appear in the marketplace while the distribution of video traffic over communication
networks increases its relative relevance in the total amount of transported data.
Within this complexity, the underlying technology has a common definition: the delivery
of digitally-encoded multimedia streams (from short advertisement clips to unbounded
television channels) over a packet network. A very restricted set of technologies (MPEG
codecs and multiplexers, and IP transport) is used for a good fraction of the present and
upcoming services. Therefore, as seen in section 3.2, it is possible to define a generic
architecture which can be applied to model the most relevant service scenarios.
The same effort to establish a general architecture can be done for the problem of the
monitoring of multimedia Quality of Experience (QoE) in such multimedia services.
In section 3.3 we propose QuEM (Qualitative Experience Monitoring): a monitoring
framework aimed at obtaining significant descriptions of the impairments present in
the service [84, 85], which can be used as a replacement of pure Packet Loss Rate
(PLR) based monitoring systems. Each measure introduced in the framework must
work under real conditions (lightweight processing and bitstream-based), and must be
repeatable, in the sense that it is possible to artificially generate the error conditions
measured. The output of the measurement block (called QuID, Quality Impairment
Detector) can afterwards be mapped to a severity value through a user-defined Severity
123
124 Chapter 6. Conclusions
Transfer Function (STF). Measures in the QuEM system are proposed to cover the most
relevant artifacts present in existing multimedia delivery services.
The QuEM approach has also inspired a novel methodology of subjective assessment
tests, described in section 3.4. It is aimed at reproducing as much as possible the viewing
conditions of the final user of the services. Therefore the content shown is intended to
be meaningful for the viewer and, more relevantly, the content is displayed in a nearly
continuous way. This methodology can be used for the validation and calibration of
QuIDs [25, 28, 85].
Besides, section 3.5 introduces new features that simplify the management of the multi-
media content in the transport network, by using “rich transport data” which make the
network aware of part of the video information. Among them we can cite the homog-
enization of interfaces in video processing elements [87], the introduction of metadata
synchronized with the video stream [9], the intelligent re-wrapping of video into trans-
port packets, or the processing of video in the edge of the network [108].
The most relevant source for impairments in a real deployment is the loss of video pack-
ets. Section 4.2 describes a proposed metric to predict the effect of video packet losses
(PLEP, Packet Loss Effect Prediction) [86]. By monitoring the Network Abstraction
Layer (NAL) of H.264 video and following the chain of references, it provides a rea-
sonably reliable description of the extension and duration of the artifacts associated to
the loss of video packets. Experiments show that it clearly outperforms simple PLR
monitoring, while still being applicable to the monitoring of real multimedia services.
PLEP metric is complemented with other measures, such as the monitoring of audio
packet losses (section 4.3), video coding quality (section 4.4) and outages (section 4.5),
to complete the relevant impairments. In all the cases, subjective assessments suggest
that the monitoring of basic parameters is enough to contribute to the QuEM system
in a significant way. In the specific case of video coding quality, bit and frame rates are
taken as proxy metrics. Finer monitoring of quality within the same bitrate cannot be
reliably addressed with No-Reference quality metrics [81].
For completeness, section 4.6 studies the latency-related measures that can affect QoE:
end-to-end lag and channel change time. Both can be analytically obtained by measuring
times in the network, the timing and buffering information present in the multiplexing
headers, as well as by knowing the (constant) additional buffer introduced in the encoder
and decoder elements.
The comparison of the different impairments (section 4.7) shows the importance of
controlling the effect of packet losses: the impact of losing packets in video no-reference
frames is much less aggressive that the loss of audio packets, for instance. Therefore,
Chapter 6. Conclusions 125
with the right knowledge of the effect of network events in QoE, it should be possible to
design network systems whose policies are optimized towards the final perceived quality.
Following this idea, section 5.2 proposes applying a simplification of the PLEP model to
do packet prioritization for Unequal Error Protection: error correction and congestion
control [82]. The solution addresses specifically short-term protection decisions, where
the error correction system has to decide which packets to protect (or which ones to
drop) within a short window of time, based on the potential impact of their loss. Thus
it is especially suitable for real-time multimedia transmissions.
This principle can also be applied to HTTP Adaptive Streaming (section 5.3), by sorting
the packets within a segment in priority order. This way, if the segment download has
to be interrupted for any reason, the impact in the final QoE will be minimized [88].
The same concept can be reversed by searching, under certain conditions, the packets
whose loss has a stronger impact in the quality. In other words: it is possible to use
the PLEP model to maximize the effect in the QoE for a given loss rate. Section 5.4
describes how to apply this idea to a selective scrambling environment. By encrypting
a few number of packets in the video stream, it is possible to leave the signal virtually
impossible to decode from the remaining packets [83, 109].
In section 5.5 we propose a solution to reduce the zapping time in IPTV channels.
The analysis of the decoding and buffering processes makes it possible to accelerate the
start-up of the stream decoding after a channel change, reaching zapping times below
500 ms.
Finally the metrics and applications described can also be applied to stereoscopic mul-
timedia delivery (and, more specifically, 3DTV) [10]. Section 5.6 mentions it, with a
special reference to the application of the subjective quality assessment methodology to
3DTV environments [28], both in IPTV [24, 25, 26] and OTT/HAS [27].
In summary, this thesis proposes a comprehensive approach to the monitoring and man-
agement of Quality of Experience in IP multimedia delivery services. With the appro-
priate framework and a deep knowledge of the ”rich transport data” it is possible to
enhance the quality monitoring, minimize the effect of losses, maximize the power of
encryption systems, or improve the zapping time of the service without significantly
increasing its complexity or cost. The proposed approach is also transparent in the in-
formation it provides and the processing it does: any fine tuning can be done by the user
of the system, and there is no a priori dependence on empiric parameters or training
data. These properties make it perfectly suitable for the needs of multimedia service
providers and, in fact, some of the proposals of this thesis have already present in several
IPTV deployments around the world. And we believe that, event for scenarios where
126 Chapter 6. Conclusions
our proposals are not of direct application (for technical or commercial reasons), the
information contained in this work can be helpful to any person who has to address
the complex problem of modeling, monitoring and managing the Quality of Experience
provided by a multimedia delivery service over IP.
Future work is envisioned in three complementary directions. Firstly, in the enhancement
of the QuEM model with additional experiments which can provide better calibration
for the impairment detectors, as well as outline models to evaluate the effect of multiple
artifacts (either simultaneously or sequentially) over a short period of time. Providing
simple and robust mechanisms to aggregate quality data is still a challenge which needs
to be solved for the context of the multimedia delivery services. Secondly, by continuing
the application of the ideas discussed here to the field of 3DTV and stereoscopic video.
And finally, by expanding the applicability of the model to HTTP Adaptive Streaming
environments. With the popularization of this technology as the basis for OTT delivery
services, there is an opportunity to develop new applications that can make an optimal
use of the network resources and manage the Quality of Experience of the end-to-end
service.
Appendix A
Experimental setup
A.1 Introduction
This Appendix describes the most relevant experiments used in this thesis. Section A.2
details the subjective quality assessment tests performed with the methodology described
in section 3.4, and focused on the evaluation and calibration of the QuIDs. Their results
have been used in several subsections of chapter 4. Section A.3 describes a previous set
of subjective quality assessment tests, aimed at evaluating the video quality produced
by video encoders provided by different manufacturers. These tests have been used in
section 4.4. Finally section A.4 describes the set of contents used for several objective
quality experiments. Those contents have been taken from IPTV field deployments (or
laboratory trials) and therefore have been selected as target contents for the development
algorithms to work with.
A.2 Subjective Assessment based on QuEM approach
This section describes the details of the specific set of tests done to calibrate the Quality
Impairment Detectors under study, and whose most relevant results have been presented
in chapter 4, together with the description of the QuIDs. The tests were performed
following the methodology that has been described in section 3.4.
A.2.1 Selection and preparation of content
According to the test methodology, selected content must be representative of what a
multimedia service usually offers, as well as significant to the viewers. An important
127
128 Appendix A. Experimental setup
target of the test methodology is reproducing the experience of a user watching video at
home. To do that, it is important that the user can see the video stream as meaningful,
and not just as a simple evaluation sequence whose contents are irrelevant.
Three content sources were selected for the tests, each one with a duration of 5 minutes
and 30 seconds:
• A movie sequence: a cut from Avatar. It is a film with detailed image information
(making it suitable for subjective tests), and it was reasonably popular at the time
where the tests were made. As an added value, it was released in 3D and could be
used to compare 2D vs 3D impairments easily.
• A sports sequence. In particular, a cut was selected from the extra time of the
final match of 2010 FIFA World Championship, including the goal which resulted
in the victory for the Spanish team. It was the probably most relevant sports
content available.
• A documentary sequence. A high-quality video was selected. Documentaries are
relevant in those tests because their audiovisual features (length of scenes, type of
camera movement. . . ) are quite different from the ones of sports and cinema. The
documentary was also available in 3D.
The sources were compressed using a professional H.264 video encoder. Selected reso-
lutions and bitrates are described in Table A.1. Lower bitrate versions of the sequences
were also generated to simulate bitrate drops.
Table A.1: Video test sequences: bitrate and resolution
Source Format BitrateMovie 1920x1080p 24fps 8 MbpsSports 720x576p 25fps 4 Mbps
Documentary 720x576p 25fps 4 Mbps
The resulting streams were there chunked into 12-seconds segments for the tests and
processed by a rewrapper. Impairments were introduced in the first half of each of the
segments.
A.2.2 Selection of impairments
The selection of impairments was done to cover a sufficient range of error cases related to
the metrics that were going to be evaluated and calibrated (the ones defined in chapter
4).
Appendix A. Experimental setup 129
A.2.2.1 Bitrate drops
To simulate the effect of a bandwidth drop, the first half of the segment was re-encoded
using a different bitrate, which was a fraction of the original one. Two different impair-
ments were defined (called R1 and R2) as detailed in Table A.2.
Table A.2: Bitrate drops
Test R1 R2Bitrate (% of reference) 50% 25%
A.2.2.2 Frame rate drops
In these test cases, the first half of the segment is transmitted using a lower frame rate,
which is a fraction of the original one. Frame rate reduction is achieved by discarding
some B frames from the original stream (denting). Two different impairments were
defined, as detailed in Table A.3.
Table A.3: Frame rate drops
Test F1 F2Frame Rate (% of reference) 50% 25%
A.2.2.3 Audio losses
These impairments are implemented by discarding audio packets in the middle of the
first half of the segment. The shortest loss length, achieved by dropping a single audio
packet, produced a silence of about 200 ms. Longer lengths were achieved by dropping
consecutive packets. Test cases A5 and A6 introduced a sequence of several short losses
separated approximately 1 second. Impairments are detailed in Table A.4. The ‘total
duration’ represents the time from the beginning of the first audio mute to the end of
the last one.
A.2.2.4 Video losses: macroblocking
The macroblocking effect caused by a transmission loss can be roughly characterized
using three parameters:
130 Appendix A. Experimental setup
Table A.4: Audio losses
Test A1 A2 A3 A4 A5 A6Loss length (s) 0.2 0.5 2 6 0.2 0.2Loss events 1 1 1 1 3 7Total duration (s) 0.2 0.5 2 6 2 6
• The fraction of the picture affected (position of the loss within the frame).
• The duration of the artifact due to error propagation (position of the loss within
the GOP).
• The loss pattern (i.e. the effect of losing several packets in several frames).
To simplify the test cases, the following restrictions were imposed to the test cases:
• There would be at most a packet loss in each GOP.
• Loss patterns would be established by introducing the same type of packet loss in
several consecutive GOPs.
Impairments are detailed in Table A.5. ‘MIN’ means that the impairment occurred in a
no-reference frame, and therefore its effect did not propagate through the GOP.
Table A.5: Macroblocking errors
Test E1 E2 E3 E4 E5 E6 E7 E8% of Frame 100 25 50 100 50 50 50 50% of GOP MIN 90 90 90 90 90 25 25Number of GOPs 1 1 1 1 3 5 3 5
The rationale for this selection of impairments is the following:
• E1 — Verify that the loss of isolated no-reference frames has no effect in the
perceived quality.
• E2–E4 — Analyze the effect of single packet losses.
• E5–E8 — Analyze the effect of multiple packet losses.
Appendix A. Experimental setup 131
A.2.2.5 Video freezing
Video freezing was achieved by the loss of a single I frame (or its header), so that the
whole picture remains still until the beginning of the next GOP. The length of the freezes
were selected as multiples of the GOP length (half a second), as shown in Table A.6.
Table A.6: Video freezing
Test V1 V2 V3Freeze duration (s) 0.5 2 6
A.2.2.6 Impairment sets
The selected impairments were structured in impairment sets: groups of impairments
related among them, as described in Table A.7. ‘N’ represents a hidden reference (no
impairment). ‘AV’ is the combination of A4+V3 (6 seconds audio mute and video freeze,
i.e., a 6-second full outage).
Table A.7: Impairment sets
Impairment Set Freq. Impairments DescriptionRate Drop 3 R1 R2 F1 F2 Reaction to bandwidth changes
Audio Loss 1 3 A1 A2 A3 A4 Audio mute lengthAudio Loss 2 3 A3 A4 A5 A6 Continuous vs. periodic mutes
Macroblocking 1 3 E1 E1 N N Detectability of no-reference lossMacroblocking 2 3 E3 E4 E5 E6 Impairment durationMacroblocking 3 3 E5 E6 E7 E8 Effect of % of GOP affected
Single Loss 5 V1 E2 E3 E4 Effect of a single video packet lossOutage 1 1 V2 V3 A3 A4 Audio vs video outagesOutage 2 1 V3 A4 AV AV Audio vs video vs both
The ‘Freq.’ (frequency) label indicates the number of times that each impairment set
appears in each test sequence. The sum of all the frequencies is 25, which means that
25 different impairments were introduced in each test sequence: one impairment each
12 seconds.
For each of the three video test sequences (movie, sports and documentary), the following
steps were followed:
1. Each segmented sequence was replicated 4 times, to create 4 different variants.
132 Appendix A. Experimental setup
Figure A.1: Structure of the content streams in the subjective assessment test session
2. The 25 occurrences of the impairment sets were randomized, as well as the 4 dif-
ferent impairments within each set. This way, 4 different sequences of impairments
were generated, each one having 25 impairments.
3. Each sequence of impairments was applied to each of the variants, i.e., impairments
were introduced in the first halves of the segments accordingly.
The resulting sequences have the structure shown in Figure A.1, where the impairments
introduced in each of the evaluation periods Ti belong to the same impairment set. Table
A.8 shows an example of some of them —they are the first 13 impairments introduced
in each of the variants of the sports sequence in the final tests.
Table A.8: Example of a sequence of impairments
Variant T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 . . .A A4 V4 E8 E4 E1 A5 F2 A4 E5 A3 V1 A6 F1 . . .B A6 AV E7 E3 E1 A6 F1 V2 E6 A1 E3 A5 R1 . . .C A5 V3 E5 E6 N A4 R2 A3 E7 A4 E2 A3 F2 . . .D A3 AV E6 E5 N A3 R1 V3 E8 A2 E4 A4 R2 . . .
Appendix A. Experimental setup 133
!"#$%"&
'#()*&
+,#-./&
0#123*4.%-5&
6&
7&
8&
9&
:&
;&
<7&<8&=7& =8&>7&
>8&>9&
>:&>;&
>?&@7&
@8&@9&
@:&@;&
@?&@A&
@B&C7&
C8&C9&
>C&D&
!"#$%"&
'#()*&
+,#-./&
0#123*4.%-5&
Figure A.2: Summary of the subjective quality assessment test results
A.2.3 Test sessions
Tests were carried out in the laboratories of the Universidad Politecnica de Madrid. The
viewing room was set to have correct light conditions, according to international stan-
dard recommendations for home environment tests. Specifically, a 42 full-HD Panasonic
television was used, and the observers were placed at a viewing distance of 3 times the
height of the TV set.
A total number of 42 observers, 35 male and 7 female, took part in the experiment. All
of them had normal visual acuity and color vision. The ages of the subjects were ranged
between 20 and 48 years old, with an average age of 27. A maximum of 4 people took
part in each of the assessment sessions. In each session, the viewers were shown one
variant of each of the three test sequences (movie, sports and documentary). This way,
each variant was assessed by at least 10 different viewers.
Figure A.2 shows a summary of the results for each impairment and content stream.
134 Appendix A. Experimental setup
A.3 Subjective quality assessment of H.264 video encoders
This section describes a set of subjective quality assessment tests, performed as part of
a comparative study of coding quality of several IPTV H.264 encoders. The results of
these tests have been used as benchmark to evaluate NR and RR objective assessment
metrics in section 4.4.1.
In this test set, 7 different encoder implementations —from 7 different manufacturers—
were analyzed, using several bitrates and several source video sequences. All the source
sequences are cuts of television programs in contribution quality. This way, the only
impairment introduced in the tests is the one generated by the compression process of
the encoder.
Tests were Single Stimulus (SS), and they followed the recommendation ITU-R BT.500
[42]. The viewer was presented with a 10-second sequence, whose quality had to be
judged using a MOS scale (from 1 —“bad”— to 5 —“excellent”—).
Four 10-second sequences (two from a football match, two from a live music show) were
encoded with seven different implementations of H.264 encoders (from different vendors),
each one at five different bitrates: 1.4, 1.7, 2.0, 2.3 and 2.6 Mbps. They were SDTV
sequences encoded at Main Profile, level 3. A hidden reference was included as well.
The selected content assets came from real contributions of an IPTV network and present
demanding coding requirements (movement, textures, capture in interlaced format...).
The target bitrates represented the range of real IPTV deployments and the encoder
configuration was provided for each vendor. Thus the environment was as close as
possible to a real commercial service.
For the tests 20 non-expert observers, balanced in age and gender, were selected and
divided into 4 sessions of 5 participants each. They were presented the sequences, in
random order, and asked to evaluate their quality with a MOS scale (1 to 5), according
to the specifications in [42]. Additionally, 6 stabilizing cuts where added at the beginning
of each viewing session, whose votes were not taken into account for the final results.
Figure A.3 shows the results of one of the sequences for all the H.264 encoders. It is
worth noting that the different codec implementations obtain quite different marks. This
should prevent us from generalizing the behavior of the H.264 standard when only one
implementation is used. In other words, there is no common “AVC quality” for a given
content and bitrate; it will depend on the specific encoder implementation.
Appendix A. Experimental setup 135
1
2
3
4
5
1.4 1.7 2.0 2.3 2.6
MO
S
Video AVC (Mb/s)
Cod1
Cod2
Cod3
Cod4
Cod5
Cod6
Cod7
Figure A.3: Subjective MOS for a football video test sequence. Each color representsa different encoder. The original sequence was ranked with MOS=4.2
A.4 Test sequences from IPTV deployments
This section describes the set of sequences used to test some of the algorithms and
applications that have been presented in this thesis. The main target of those tests is
developing techniques which have to be applicable in real multimedia delivery services.
For that reasons, test sequences have been selected from streams used to validate services
in the field: all of them are captures either from a real field deployment or from a
validation laboratory in an IPTV service. There is therefore more interest in the way
the sequences are encoder rather than in the specific content that was shown at that
moment (which is something that can be rarely selected when doing the capture).
The properties of the different sequences are described in Table A.9, and their source
content is the following:
1. Sequence A is a scene from an action movie (Die Hard 4).
2. Sequence B is a documentary.
3. Sequences C and D are advertisements (the same source sequence with different
encoding settings).
The following clarifications can be done about the table:
136 Appendix A. Experimental setup
Table A.9: Test sequences
Sequence A B C DTS Bitrate (Mb/s) 2.8 2.5 2.7 2.7Video H.264 H.264 H.264 H.264Video Bitrate (Mb/s) 2.3 2.0 2.0 2.0Video Profile Main Main Main MainVideo Level 3.0 3.0 3.0 3.0Video Resolution 720x576 544x576 480x576 480x576Aspect Ratio 16/9 4/3 4/3 4/3Picture Rate 50i 50i 50i 50iIDRs Yes No Yes YesSlices per picture 1 1 1 1GOP length 100 24 24 12P frame period 4 4 3 3Hierarchical GOP Yes Yes No NoNo. of audio streams 2 2 1 1Audio Format MP1L2 MP1L2 MP1L2 MP1L2Audio Bitrate (kb/s) 192 192 192 192
• A P frame period of 4 means that there are 3 B frames between each consecutive
P or I frames (i.e., the structure is IBBBP). Similarly, a P period of 3 represents
an IBBP structure.
• A hierarchical GOP structure (“. . . IBBBP. . . ”) is like the one discussed in section
2.4.1 and deployed in Figure 2.4 in page 31.
• As it was mentioned in section 4.2, in IPTV scenarios it is frequent that some
streams use I frames which are not IDRs. This is the case of sequence B.
Bibliography
[1] K. Ahmad and A. C. Begen. IPTV and video networks in the 2015 timeframe: The
evolution to medianets. IEEE Communications Magazine, 47(12):68–74, December
2009.
[2] ANSI T1.801.02-1996. American National Standard for Telecommunications -
digital transport of video teleconferencing/video telephony signals - performance
terms, definitions, and examples, 1996.
[3] A. C. Begen, C. Perkins, and J. Ott. On the use of RTP for monitoring and fault
isolation in IPTV. IEEE Network, 24(2):14–19, March-April 2009.
[4] Brix Network. Video quality measurement algorithms: Scaling IP video services
for the real world, 2006.
[5] Broadband Forum. TR 176. ADSL2Plus configuration guidelines for IPTV – v3.0,
September 2008.
[6] G. Cermak, M. Pinson, and S. Wolf. The relationship among video quality, screen
resolution, and bitrate. IEEE Transactions on Broadcasting, 57(2):258–262, June
2011.
[7] G. W. Cermak. Consumer opinions about frequency of artifacts in digital video.
IEEE Journal of Selected Topics in Signal Processing, 3(2):336–343, April 2009.
[8] P. Coverdale, S. Moller, A. Raake, and A. Takahashi. Multimedia quality assess-
ment standards in ITU-T SG12. IEEE Signal Processing Magazine, 28(6):91–97,
November 2011.
[9] J. M. Cubero, A. M. Sanz, E. Estalayo, P. Perez, F. Jaureguizar, J. Cabrera, and
J. J. Ruiz. Gestion y aplicacion de metadatos asociados al trafico multimedia en
videoconferencia 3D. In XX Jornadas Telecom I+D, September 2010. Valladolid,
Spain.
137
138 BIBLIOGRAPHY
[10] J. M. Cubero, J. Gutierrez, P. Perez, E. Estalayo, J. Cabrera, F. Jaureguizar, and
N. Garcia. Providing 3D video services: The challenge from 2D to 3DTV quality
of experience. Bell Labs Technical Journal, 16(4):115–134, March 2012.
[11] N. Degrande, K. Laevens, D. Vleeschauwer, and R. Sharpe. Increasing the user
perceived quality for IPTV services. IEEE Communications Magazine, 46(2):94–
100, February 2008.
[12] C. Diaz, J. Cabrera, F. Jaureguizar, and N. Garcia. A video-aware FEC-based
unequal loss protection scheme for RTP video streaming. In IEEE Int. Conf. on
Consumer Electronics, ICCE 2011, Jan 2011. Las Vegas (NV), United States.
[13] R. Dosselmann and X. Yang. A comprehensive assessment of the structural simi-
larity index. Signal, Image and Video Processing, 5:81–91, March 2011.
[14] M. Ellis and C. Perkins. Packet loss characteristics of IPTV-like traffic on resi-
dential links. In IEEE Consumer Communications and Networking Conference,
CCNC 2010, January 2010. Las Vegas (NV), United States.
[15] U. Engelke and H. J. Zepernik. Perceptual-based quality metrics for image and
video services: a survey. In Conf. on Next Generation Internet Networks, May
2007. Trondheim, Norway.
[16] U. Engelke, T. M. Kusuma, and H.-J. Zepernick. Perceptual quality assessment of
wireless video applications. In Int. Symposium on Turbo Codes & Related Topics,
April 2006. Munich, Germany.
[17] B. Erman and E. P. Matthews. Analysis and realization of IPTV service quality.
Bell Labs Technical Journal, 12(4):195–212, February 2008.
[18] ETSI TS 101 154 v1.10.1. Digital Video Broadcasting DVB; specification for the
use of video and audio coding in broadcasting applications based on the MPEG-2
transport stream, 2011.
[19] ETSI TS 102 034 v1.4.1. Digital Video Broadcasting DVB; transport of MPEG-2
based DVB services over IP based networks, 2009.
[20] Y. Fan, J. Wang, T. Ikenaga, Y. Tsunoo, and S. Goto. An unequal secure en-
cryption scheme for H.264/AVC video compression standard. IEICE Transactions
on Fundamentals of Electronics, Communications and Computer Sciences, 91(1):
12–21, January 2008.
[21] M. C. Q. Farias and S. K. Mitra. No-reference video quality metric based on artifact
measurement. In IEEE Int. Conf. on Image Processing, ICIP 2005, September
2005. Genoa, Italy.
BIBLIOGRAPHY 139
[22] T. Friedman, R. Caceres, and A. Clark. RTP Control Protocol Extended Reports
(RTCP XR). RFC 3611 (Proposed Standard), November 2003.
[23] F. Gabin, M. Kampmann, T. Lohmar, and C. Priddle. 3GPP mobile multimedia
streaming standards [standards in a nutshell]. IEEE Signal Processing Magazine,
27(6):134–138, November 2010.
[24] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Subjective
assessment of the impact of transmission errors in 3DTV compared to HDTV. In
IEEE 3DTV Conference, May 2011. Antalya, Turkey.
[25] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Subjective evalu-
ation of transmission errors in IPTV and 3DTV. In IEEE Visual Communications
and Image Processing, VCIP 2011, November 2011. Tainan, Taiwan.
[26] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Monitoring
packet loss impact in IPTV and 3DTV receivers. In IEEE Int. Conf. on Consumer
Electronics, ICCE 2012, January 2012. Las Vegas (NV), United States.
[27] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Subjective
study of adaptive streaming strategies for 3DTV. In IEEE Int. Conf. on Image
Processing, ICIP 2012, October 2012. Orlando (FL), United States.
[28] J. Gutierrez, P. Perez, F. Jaureguizar, J. Cabrera, and N. Garcia. Validation of a
novel approach to subjective quality evaluation of conventional and 3D broadcasted
video services. In Int. Workshop on Quality of Multimedia Experience, QoMEX
2012, July 2012. Yarra Valley, Australia.
[29] H. Ha, J. Park, S. Lee, and A. C. Bovik. Perceptually unequal packet loss protec-
tion by weighting saliency and error propagation. IEEE Transactions on Circuits
and Systems for Video Technology, 20(9):1187–1199, September 2010.
[30] R. Haimi-Coehen. Prioritized retransmission of internet protocol television (IPTV)
packets, December 2008. US Patent Proposal US 2010/0138885 A1.
[31] D. S. Hands. A basic multimedia quality model. IEEE Transactions on Multimedia,
6(6):808–816, December 2004.
[32] S. Hawley and G. Schultz. IPTV Video Quality: QoS & QoE. Quarterly Technology
and Content Report. Multimedia Research Group, Inc., February 2007.
[33] S. S. Hemami and A. R. Reibman. No-reference image and video quality esti-
mation: Applications and human-motivated design. Signal Processing: Image
Communication, 25(7):469–481, August 2010.
140 BIBLIOGRAPHY
[34] O. Hohlfeld, R. Geib, and G. Hasslinger. Packet loss in real-time services: Marko-
vian models generating QoE impairments. In IEEE Int. Workshop on Quality of
Service, IEEE IWQoS 2008, June 2008. Enschede, the Netherlands.
[35] Q. Huynh-Thu, M.-N. Garcia, F. Speranza, P. Corriveau, and A. Raake. Study
of rating scales for subjective quality assessment of high-definition video. IEEE
Transactions on Broadcasting, 57(1):1–14, March 2011.
[36] ISO/IEC 13818-1:2007. Information technology – generic coding of moving pictures
and associated audio information: Systems, 2007.
[37] ISO/IEC 13818-2:2000. Information technology – generic coding of moving pictures
and associated audio information: Video, 2000.
[38] ISO/IEC 14496-10:2012. Information technology – coding of audio-visual objects
– Part 10: Advanced Video Coding, 2012.
[39] ISO/IEC 23009-1:2012. Information technology – dynamic adaptive streaming over
HTTP (DASH) – Part 1: Media presentation description and segment formats,
2012.
[40] O. Issa, W. Li, H. Liu, F. Speranza, and R. Renaud. Quality assessment of
high definition TV distribution over IP networks. In IEEE Int. Symposium on
Broadband Multimedia Systems and Broadcasting, BMSB 2009, May 2009. Bilbao,
Spain.
[41] ITU-R Tech. Rec. BS.1387. Method for objective measurements of perceived audio
quality, 2001.
[42] ITU-R Tech. Rec. BT 500.11. Methodology for the subjective assessment of the
quality of television pictures, 2002.
[43] ITU-T Tech. Rec. G.1080. Quality of experience requirements for IPTV services,
2008.
[44] ITU-T Tech. Rec. G.1081. Performance monitoring points for IPTV, 2008.
[45] ITU-T Tech. Rec. J.144. Objective perceptual video quality measurement tech-
niques for digital cable television in the presence of a full reference, 2004.
[46] ITU-T Tech. Rec. J.147. Objective picture quality measurement method by use
of in-service test signals, 2002.
[47] ITU-T Tech. Rec. J.247. Objective perceptual multimedia video quality measure-
ment in the presence of a full reference, 2008.
BIBLIOGRAPHY 141
[48] ITU-T Tech. Rec. J.249. Perceptual video quality measurement techniques for
digital cable television in the presence of a reduced reference, 2010.
[49] ITU-T Tech. Rec. J.341. Objective perceptual multimedia video quality measure-
ment of HDTV for digital cable television in the presence of a full reference, 2011.
[50] ITU-T Tech. Rec. J.342. Objective multimedia video quality measurement of
HDTV for digital cable television in the presence of a reduced reference signal,
2011.
[51] ITU-T Tech. Rec. J.863. Perceptual objective listening quality assessment, 2011.
[52] ITU-T Tech. Rec. P.862. Perceptual evaluation of speech quality (PESQ): An ob-
jective method for end-to-end speech quality assessment of narrow-band telephone
networks and speech codecs, 2001.
[53] ITU-T Tech. Rec. P.910. Subjective video quality assessment methods for multi-
media applications, 2008.
[54] ITU-T Tech. Rec. P.911. Subjective audiovisual quality assessment methods for
multimedia applications, 1998.
[55] ITU-T Tech. Rec. Y.1910. IPTV functional architecture, 2008.
[56] S. H. Jumisko, V. P. Ilvonen, and K. A. Vaananen-Vainio-Mattila. Effect of TV
content in subjective assessment of video quality on mobile devices. In Proc. SPIE,
Multimedia on Mobile Devices, volume 5684, pages 243–254, March 2005.
[57] S. Jumisko-Pyykko and J. Korhonen. Unacceptability of instantaneous errors in
mobile television: from annoying audio to video. In 8th Conf. on Human-computer
interaction with mobile devices and services, September 2006. Espoo, Finland.
[58] S. Kanumuri, S. Subramanian, P. Cosman, and A. Reibman. Predicting H.264
packet loss visibility using a generalized linear model. In IEEE Int. Conf. on
Image Processing, ICIP 2006, September 2006. Atlanta (GA), United States.
[59] M. Knee. The picture appraisal rating (PAR) - a single-ended picture quality mea-
sure for MPEG-2. In Int. Broadcasting Convention, September 2000. Amsterdam,
the Netherlands.
[60] R. Kooij, K. Ahmed, and . Brunnstrom. Perceived quality of channel zapping.
In IASTED Int. Conf. Commun. Sys. and Networks, August 2006. Palma de
Mallorca, Spain.
142 BIBLIOGRAPHY
[61] K. Kunert, E. Uhlemann, and M. Jonsson. Enhancing reliability in IEEE 802.11
based real-time networks through transport layer retransmissions. In Int. Sympo-
sium on Industrial Embedded Systems, July 2010. Trento, Italy.
[62] Y. Kuszpet, D. Kletsel, Y. Moshe, and A. Levy. Post-processing for flicker reduc-
tion in H.264/AVC. In Picture Coding Symposium, PCS 2007, November 2007.
Lisbon, Portugal.
[63] P. Le Callet, C. Viard-Gaudin, and D. Barba. A convolutional neural network
approach for objective video quality assessment. IEEE Transactions on Neural
Networks, 17(5):1316–1327, September 2006.
[64] A. Leontaris and A. R. Reibman. Comparison of blocking and blurring metrics
for video compression. In IEEE Int. Conf. on Acoustics, Speech, and Signal Pro-
cessing, ICASSP 2005, March 2005. Philadelphia (PA), United States.
[65] Y. Liang, J. Apostolopoulos, and B. Girod. Analysis of packet loss for compressed
video: does burst-length matter? In IEEE Int. Conf. on Acoustics, Speech and
Signal Processing, ICASSP 2003, April 2003. Hong Kong, China.
[66] T. L. Lin, S. Kanumuri, Y. Zhi, D. Poole, P. C. Cosman, and A. R. Reibman. A
versatile model for packet loss visibility and its application to packet prioritization.
IEEE Transactions on Image Processing, 19(3):722–35, March 2010.
[67] A. A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, Y. Zhang, and Q. Zhao. Towards
automated performance diagnosis in a large IPTV network. ACM SIGCOMM
Computer Communication Review, 39:231–242, August 2009.
[68] P. Marziliano, F. Dufaux, S. Winkler, and T. Ebrahimi. A no-reference perceptual
blur metric. In IEEE Int. Conf. on Image Processing, ICIP 2002, September 2002.
Rochester (NY), United States.
[69] A. Massoudi, F. Lefebvre, C. De Vleeschouwer, B. Macq, and J. Quisquater.
Overview on selective encryption of image and video: challenges and perspectives.
EURASIP Journal on Information Security, 2008, December 2008.
[70] R. Mekuria, P. Cesar, and D. Bulterman. Digital TV: the effect of delay when
watching football. In 10th European Conf. on Interactive TV and video, July 2012.
Berlin, Germany.
[71] V. Miguel, J. Cabrera, F. Jaureguizar, and N. Garcia. High-definition video dis-
tribution in 802.11g home wireless networks. In IEEE Int. Conf. on Consumer
Electronics, ICCE 2011, pages 213–214, Las Vegas (NV), United States, January
2011.
BIBLIOGRAPHY 143
[72] M.-J. Montpetit, T. Mirlacher, and M. Ketcham. IPTV: An end to end perspective
(invited paper). Journal of Communications, 5(5):358–373, August 2010.
[73] BBC News. Olympics bring 55 million visits to BBC Sport online, August 2012.
http://www.bbc.com/news/technology-19242083.
[74] T. Oelbaum, C. Keimel, and K. Diepold. Rule-based no-reference video quality
evaluation using additionally coded videos. IEEE Journal of Selected Topics in
Signal Processing, 3(2):294–303, April 2009.
[75] Open IPTV Forum. Release 2 specification – volume 2a HTTP adaptive streaming
– v2.1, 2011.
[76] Open IPTV Forum. Release 2 specification – volume 2 media formats – v2.1, 2011.
[77] J. Ott, S. Wenger, N. Sato, C. Burmeister, and J. Rey. Extended RTP Profile
for Real-time Transport Control Protocol (RTCP)-Based Feedback (RTP/AVPF).
RFC 4585 (Proposed Standard), July 2006.
[78] T. N. Pappas and R. J. Safranek. Perceptual criteria for image quality evaluation.
In Handbook of Image and Video Processing, pages 669–684. Academic Press, 2000.
[79] R. R. Pastrana-Vidal and C. Colomes. Perceived quality of an audio signal im-
paired by sigal loss: psychoacoustic tests and prediction model. In IEEE Int. Conf.
on Acoustics, Speech and Signal Processing, ICASSP 2007, April 2007. Honolulu
(HI), United States.
[80] W. Pattara-Atikom, S. Banerjee, and P. Krishnamurthy. Predicting the quality
of video transmission over best effort network service. In Int. Conf. on Computer
Communications and Networks, ICCCN 2003, October 2003.
[81] P. Perez. Calidad de experiencia en IPTV. Master’s thesis, Universidad Politecnica
de Madrid, September 2008. Trabajo de Investigacion en Tecnologias y Sistemas
de Comunicaciones.
[82] P. Perez and N. Garcia. Lightweight multimedia packet prioritization model for
unequal error protection. IEEE Transactions on Consumer Electronics, 57(1):
132–138, February 2011.
[83] P. Perez and J. J. Ruiz. Encryption procedure and device for an audiovisual data
stream, April 2011. European Patent Application EP 2,309,745 (Published).
[84] P. Perez, J. J. Ruiz, and N. Garcia. Calidad de experiencia en servicios multimedia
sobre IP. In XX Jornadas Telecom I+D, September 2010. Valladolid, Spain.
144 BIBLIOGRAPHY
[85] P. Perez, J. Gutierrez, J. J. Ruiz, and N. Garcia. Qualitative monitoring of video
over a packet network. In IEEE Int. Symposium on Multimedia, December 2011.
Dana Point (CA), United States.
[86] P. Perez, J. Macias, J. J. Ruiz, and N. Garcia. Effect of packet loss in video quality
of experience. Bell Labs Technical Journal, 16(1):91–104, June 2011.
[87] P. Perez, J. J. Ruiz, A. Villegas, K. V. Damme, C. V. Boven, J. Dupont, and P. A.
Molina-Salmeron. Multi-vendor video headend convergence solution. Bell Labs
Technical Journal, 17(1):185–200, June 2012.
[88] P. Perez, A. Villegas, and J. J. Ruiz. Method, system and devices for improved
adaptive streaming of media content, January 2012. European Patent Application
No. 12382006.0 (Filed).
[89] M. H. Pinson and S. Wolf. A new standardized method for objectively measuring
video quality. IEEE Transactions on Broadcasting, 50(3):312 – 322, September
2004.
[90] M. H. Pinson, W. Ingram, and A. Webster. Audiovisual quality components.
Signal Processing Magazine, IEEE, 28(6):60–67, November 2011.
[91] F. Porikli, A. Bovik, C. Plack, G. AlRegib, J. Farrell, P. Le Callet, Q. Huynh-Thu,
S. Moller, and S. Winkler. Multimedia quality assessment [DSP Forum]. IEEE
Signal Processing Magazine, 28(6):164–177, November 2011.
[92] A. Raake, J. Gustafsson, S. Argyropoulos, M. Garcia, D. Lindegren, G. Heikkila,
M. Pettersson, P. List, and B. Feiten. IP-based mobile and fixed network audio-
visual media services. IEEE Signal Processing Magazine, 28(6):68–79, November
2011.
[93] A. R. Reibman and D. Poole. Characterizing packet-loss impairments in com-
pressed video. In IEEE Int. Conf. on Image Processing, ICIP 2007, September
2007. San Antonio (TX), United States.
[94] A. R. Reibman and A. R. Wilkins. Video outage detection: Algorithm and evalu-
ation. In Picture Coding Symposium, PCS 2009, May 2009. Chigaco (IL), United
States.
[95] A. R. Reibman, V. A. Vaishampayan, and Y. Sermadevi. Quality monitoring of
video over a packet network. IEEE Transactions on Multimedia, 6(2):327–334,
April 2004.
BIBLIOGRAPHY 145
[96] D. C. Robinson and A. Villegas. Intelligent wrapping of video content to lighten
downstream processing of video streams, June 2009. European Patent Application
2,071,850 (Published).
[97] S. H. Russ and S. Haghani. 802.11g packet-loss behavior at high sustained bit
rates in the home. IEEE Transactions on Consumer Electronics, 55(2):788–791,
May 2009.
[98] S. Saha and R. Vemuri. An analysis on the effect of image features on lossy coding
performance. IEEE Signal Processing Letters, 7(5):104–107, May 2000.
[99] W. B. P. Schallauer. Studies in Computational Intelligence: Multimedia Semantics
The Role of Metadata, chapter Metadata in the Audiovisual Media Production
Process, pages 65–84. Springer Berlin / Heidelberg, 2008.
[100] T. Shi, B. King, and P. Salama. Selective encryption for H.264/AVC video coding.
In Proc. SPIE, Electronic Imaging, volume 6072, page 607217, 2006.
[101] D. Singer and H. Desineni. A General Mechanism for RTP Header Extensions.
RFC 5285 (Proposed Standard), july 2008.
[102] C. W. Snyder, U. K. Sarkar, and D. Sarkar. Effects of cell loss on MPEG video:
analytical modeling and empirical validation. In IEEE Int. Conf. on Multimedia
and Expo, ICME 2002, volume 2, pages 457–460. IEEE, 2002.
[103] B. V. Steeg, A. Begen, T. V. Caenegem, and Z. Vax. Unicast-Based Rapid Acqui-
sition of Multicast RTP Sessions. RFC 6285 (Proposed Standard), June 2011.
[104] T. Stockhammer. Dynamic adaptive streaming over HTTP: standards and design
principles. In ACM Conf. on Multimedia Systems, February 2011. San Jose (CA),
United States.
[105] S. Susstrunk and S. Winkler. Color image quality on the internet. In Proc. SPIE,
IS&T Internet Imag, pages 118–131, January 2004.
[106] M. Tagliasacchi, G. Valenzise, M. Naccari, and S. Tubaro. A reduced-reference
structural similarity approximation for videos corrupted by channel errors. Multi-
media Tools and Applications, 48:471–492, 2010.
[107] M. Verhoeyen, D. De Vleeschauwer, and D. Robinson. Content storage architec-
tures for boosted IPTV service. Bell Labs Technical Journal, 13(3):29–43, Septem-
ber 2008.
[108] A. Villegas, K. Chow, C. V. Boven, and P. Perez. Content delivery method, June
2011. European Patent Application EP 2,538,629 (Published).
146 BIBLIOGRAPHY
[109] A. Villegas, P. Perez, J. M. Cubero, E. Estalayo, and N. Garcia. Network as-
sisted content protection architectures for a connected world. Bell Labs Technical
Journal, 16(4):85–96, March 2012.
[110] T. Vlachos. Detection of blocking artifacts in compressed video. Electronic Letters,
36(13):1106–1108, June 2000.
[111] VQEG. Final report from the video quality experts group on the validation of
objective models of video quality assessment, Phase II, 2003.
[112] VQEG. Validation of reduced-reference and no-reference objective models for
standard definition television, Phase I. Technical report, 2009.
[113] VQEG. Monitoring of audiovisual quality by key indicators, 2012. Draft available
online at http://www.its.bldrdoc.gov/vqeg/.
[114] Z. Wang and E. P. Simoncelli. Reduced-reference image quality assessment using
a wavelet-domain natural image statistic model. In Proc. SPIE, Human Vision
and Electronic Imaging, volume 5666, pages 149–159, 2005.
[115] Z. Wang, A. C. Bovik, and B. Evan. Blind measurement of blocking artifacts in
images. In IEEE Int. Conf. on Image Processing, ICIP 2000, September 2000.
Vancouver, Canada.
[116] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assess-
ment: From error visibility to structural similarity. IEEE Transactions on Image
Processing, 13(4):600–612, April 2004.
[117] A. A. Webster, C. T. Jones, M. H. Pinson, S. D. Voran, and S. Wolf. An objective
video quality assessment system based on human perception. In Proc. SPIE,
Human Vision, Visual Processing, and Digital Display IV, pages 15–26, 1993.
[118] J. Welch and J. Clark. A Proposed Media Delivery Index (MDI). RFC 4445
(Informational), April 2006.
[119] S. Winkler. Video quality measurement standards - current status and trends. In
Int. Conf. on Information, Communications and Signal Processing, ICICS 2009,
December 2009. Macau, China.
[120] S. Winkler. Digital Video Quality – Vision Models and Metrics. John Wiley &
Sons, January 2005.
[121] S. Winkler and P. Mohandas. The evolution of video quality measurement: From
PSNR to hybrid metrics. IEEE Transactions on Broadcasting, 54(3):660–668,
September 2008.
BIBLIOGRAPHY 147
[122] H. R. Wu and M. Yuen. A generalized block-edge impariment metric for video
coding. IEEE Signal Processing Letters, 4(11):317–320, November 1997.
[123] F. Yang, S. Wan, Y. Chang, and H. R. Wu. A novel objective no-reference metric
for digital video quality assessment. IEEE Signal Processing Letters, 12(10):685–
688, October 2005.
[124] F. You, W. Zhang, and J. Xiao. Packet loss pattern and parametric video quality
model for IPTV. In IEEE/ACIS Int. Conf. on Computer and Information Science,
June 2009. Shanghai, China.
[125] Y. Zou, T. Huang, W. Gao, and L. Huo. H.264 video encryption scheme adaptive to
DRM. IEEE Transactions on Consumer Electronics, 52(4):1289–1297, November
2006.