introduction to multimedia communications : applications ...the mpeg-4 multimedia coding standard...

Paper [99] proposes a corresponding pair of efficient streaming schedule and

pipeline decoding architectures to deal with the mentioned problems. The proposed

method can be applied to the case of streaming stored FGS videos and can benefit

FGS-related applications.

Texture coding based on discrete wavelet transform (DWT) is playing a lead-

ing role for its higher performance in terms of signal analysis, multiresolution

features, and improved compression compared to existing methods such as

DCT-based compression schemes adopted in the old JPEG standard. This success

is testified by the fact that the wavelet transform has now been adopted by

MPEG-4 for still-texture coding and by JPEG2000. Indeed, superior performance

at low bit rates and transmission of data according to client display resolution are

particularly interesting for mobile applications. The wavelet transform shows bet-

ter results because it is intrinsically well suited to nonstationary signal analysis,

such as images. Although it is a rather simple transform, DWT implementations

may lead to critical requirements in terms of memory size and bandwidth, poss-

ibly yielding costly implementations. Thus, efficient implementations must be

investigated to fit different system scenarios. In other words, the goal is to find

different architectures, each of them specifically optimized for any specific sys-

tem requirement in terms of complexity and memory bandwidth.

To facilitate MPEG-1 and MPEG-2 video compression, many graphics coproces-

sors provide the accelerators to the key function blocks, such as inverse DCT and

motion compensation, in compression algorithms for real-time video decoding.

The MPEG-4 multimedia coding standard supports object-based coding and

manipulation of natural video and synthetic graphics objects. Therefore, it is desir-

able to use the graphics coprocessors to accelerate decoding of arbitrary-shaped

MPEG-4 video objects as well [100]. It is found that boundary macroblock padding,

which is an essential processing step in decoding arbitrarily shaped video objects,

could not be efficiently accelerated on the graphics coprocessors due to its complex-

ity. Although such a padding can be implemented by the host processor, the frame

data processed on the graphics coprocessor need to be transferred to the host pro-

cessor for padding. In addition, the padded data on the host processor need to be

sent back to the graphics coprocessor to be used as a reference for subsequent

frames. To avoid this overhead, there are two approaches of boundary macroblock

padding. In the first approach, the boundary macroblock padding is partitioned into

two tasks, one that the host processor can perform without the overhead of data

transfers, and the second approach, in which two new instructions are specified

and an algorithm is proposed for the next-generation graphics coprocessors or

media-processors, which gives a performance improvement of up to a factor of

nine compared to that with the PentiumIII [100].

5.4 MEDIA STREAMING

Advances in computers, networking, and communications have created new distri-

bution channel and business opportunities for the dissemination of multimedia

5.4 MEDIA STREAMING 431

content. Streaming audio and video over networks such as the Internet, local area

wireless networks, home networks, and commercial cellular phone systems has

become a reality and it is likely that streaming media will become a mainstream

means of communication. Despite some initial commercial success, streaming

media still faces challenging technical issues, including quality of service (QoS)

and cost-effectiveness. For example, deployments of multimedia services over

2.5G and 3G wireless networks have presented significant problems for real-time

servers and clients in terms of high variability of network throughput and packet

loss due to network buffer overflows and noisy channels. New streaming architec-

tures such as pear-to-pear (P2P) networks and wireless ad hoc networks have also

raised many interesting research challenges. This section is intended to address

some of the principal technical challenges for streaming media by presenting a col-

lection of the most recent advances in research and development.

5.4.1 MPEG-4 Delivery Framework

The framework is a model that hides to its upper layers the details of the technology

being used to access the multimedia content. It supports virtual and known com-

munication scenarios (e.g., stored files, remotely retrieved files, interactive retrieval

from a real-time streaming server, multicast, broadcast, and interpersonal communi-

cation). The delivery framework provides, in ISO/OSI terms, a session layer ser-

vice. This is further referred to as the delivery multimedia integration framework

(DMIF) layer, and the modules making use of it are referred to as DMIF users.

The DMIF layer manages sessions (associated to overall MPEG-4 presentations)

and channels (associated to individual MPEG-4 elementary streams) and allows

for the transmission of both user data and commands. The data transmission part

is often referred to in the open literature as the user plane, while the management

side is referred to as the control plane. The term DMIF, for instance, is used to indi-

cate an implementation of the delivery layer for a specific delivery technology [101].

In the DMIF context, the different protocol stack options are generally named

transport multiplexer (TransMux). Specific instances of a TransMux, such as a

user datagram protocol (UDP), are named TransMux channels [102]. Within a

TransMux channel, several streams can be further multiplexed, and MPEG-4 speci-

fies a suitable multiplexing tool, the FlexMux. The need for an additional multiplex-

ing stage (the FlexMux) derives from the wide variety of potential MPEG-4

applications, in which even huge numbers of MPEG-4 elementary streams (ESs)

can be used at once. This is somewhat a new requirement specific to MPEG-4; in

the IP world, for example, the real-time transport protocol (RTP) that is often

used for streaming applications normally carries one stream per socket. However,

in order to more effectively support the whole spectrum of potential MPEG-4 appli-

cations, the usage of the FlexMux in combination with RTP is being considered

jointly between IETF and MPEG [103].

Figure 5.49 shows some of the possible stacks that can be used within the delivery

framework to provide access to MPEG-4 content. Reading the figure as it applies for

the transmitter side: ESs are first packetized and packets are equipped with

432 MIDDLEWARE LAYER

information necessary for synchronization (timing) – SL packets. Within the con-

text of MPEG-4 Systems, the Sync Layer (SL) syntax is used for this purpose.

Then, the packets are passed through the delivery application interface (DAI).

They possibly get multiplexed by the MPEG-4 FlexMux tool, and finally they

enter one of the various possible TransMuxes.

In order to control the flow of the ESs, commands such as PLAY, PAUSE,

RESUME, and related parameters needed to be conveyed as well. Such commands

are considered by DMIF as user commands, associated to channels. Such commands

are opaquely managed (i.e., not interpreted by DMIF and just evaluated at the peer

entity). This allows the stream control protocol(s) to evolve independently from

DMIF. When real-time streaming protocol (RTSP) is used as the actual control pro-

tocol, the separation between use commands and signaling messages vanishes as

RTSP deals with both channel setup and stream control. This separation is also

void, for example, when directly accessing a file.

The delivery framework also is prepared for QoS management. Each request for

creating a new channel might have associated certain QoS parameters, and a simple

but generic model for monitoring QoS performance has been introduced as well. The

infrastructure for QoS handling does not include, however, generic support for QoS

negotiation or modification.

Of course, not all the features modeled in the delivery framework are meaningful

for all scenarios. For example, it makes little sense to consider QoS when reading

content from local files. Still, an application making use of the DMIF service as a

whole need not be further concerned with the details of the actually involved

scenario.

The approach of making the application running on top of DMIF totally unaware

of the delivery stack details works well with MPEG-4. Multimedia presentations can

be repurposed with minimal intervention. Repurposing means here that a certain

multimedia content can be generated in different forms to suit specific scenarios,

for example, a set of files to be locally consumed, or broadcast/multicast, or even

FlexMux FlexMux

Elementary Streams

DMIF-Application interfaceFlexMux channel

Sync layer

FlexMux channel FlexMux streams

Optional use of

FlexMux tool

Delivery layer

TransMux streams

Co

nfig

ure

d b

y

ISO

/IE

C 1

44

96

-6

(MP

EG

-4 D

MIF

)

Co

nfig

ure

d b

y

ISO

/IE

C 1

44

96

-1

(MP

EG

-4 S

yste

ms)

SL SL SL SL SL SL

TCP

IP

UDP

IP

(PES)

MPEG2

TS

AAL5

ATMH233

GSTN

DAB

mux

SL

...

...

...

SL-Packetized Streams

Figure 5.49 User plane in an MPEG-4 terminal [104]. (#2000 ISO/IEC.)


interactively served from a remote real-time streaming application. Combinations of

these scenarios are also enabled within a single presentation.

The delivery application service (DAI) represents the boundary between the ses-

sion layer service offered by the delivery framework and the application making use

of it, thus defining the functions offered by DMIF. In ISO/OSI terms, it corresponds

to a Session Service Access Point.

The entity that uses the service provided by DMIF is termed the DMIF user and is

hidden from the details of the technology used to access the multimedia content.

The DAI comprises a simple set of primitives that are defined in the standard in

their semantic only. Actual implementation of the DAI needs to assign a precise syn-

tax to each function and related parameters, as well as to extend the set of primitives

to include initialization, reset, statistics monitoring, and any other housekeeping

function.

DAI primitives can be categorized into five families, and analyzed as follows:

. service primitives (create or destroy a service)

. channel primitives (create or destroy channels)

. QoS monitoring primitives (set up and control QoS monitoring functions)

. user command primitives (carry user commands)

. data primitives (carry the actual media content).

In general, all the primitives being presented have two different (although similar)

signatures (i.e., variations with different sets of parameters): one for communication

from DMIF user to the DMIF layer and another for the communication in the reverse

direction. The second is distinguished by the Callback suffix and, in a retrieval appli-

cation, applies only to the remote peer. Moreover, each primitive presents both IN

and OUT parameters, meaning that IN parameters are provided when the primitive is

called, whereas OUT parameters are made available when the primitive returns. Of

course, a specific implementation may choose to use nonblocking calls and to return

the OUT parameters through an asynchronous callback. This is the case for the

implementation provided in the MPEG-4 reference software.

The MPEG-4 delivery framework is intended to support a variety of communi-

cation scenarios while presenting a single, uniform interface to the DMIF used. It

is then up to the specific DMIF instance to map the DAI primitives into appropriate

actions to access the requested content. In general, each DMIF instance will deal

with very specific protocols and technologies, such as the broadcast of MPEG-2

transport streams, or communication with a remote peer. In the latter case, however,

a significant number of options in the selection of control plane protocol exists. This

variety justifies the attempt to define a further level of commonality among the var-

ious options, making the final mapping to the actual bits on the wire a little more

focused. Delivery applications interface (DAI) and delivery multimedia integration

network interface (DNI) in the DMIF framework architecture are shown in

Figure 5.50 [104].


The DNI captures a few generic concepts that are potentially common to peer-to-

peer control protocols, for example, the usage of a reduced number of network

resources (such as sockets) into which several channels would be multiplexed,

and the ability to discriminate between a peer-to-peer relation (network session)

and different services possibly activated within that single session (services).

DNI follows a model that helps determine the correct information to be delivered

between peers but by no means defines the bits on the wire. If the concepts of sharing

a TransMux channel among several streams or of the separation between network

session and services are meaningless in some context, that is fine, and does not con-

tradict the DMIF model as a whole.

The mapping between DAI and DNI primitives has been specified in reference

[104]. As a consequence, the actual mapping between the DAI and a concrete pro-

tocol can be determined as the concatenation of the mappings between the DAI and

DNI and between DNI and the selected protocol. The first mapping determines how

to split the service creation process into two elementary steps, and how multiple

channels managed at the DAI level can be multiplexed into one TransMux channel

(by means of the FlexMux tool). The second protocol-specific mapping is usually

straightforward and consists of placing the semantic information exposed at the

DNI in concrete bits in the messages being sent on the wire.

In general, the DNI captures the information elements that need to be exchanged

between peers, regardless of the actual control protocol being used.

DNI primitives can be categorized into five families, analyzed in the following:

. service primitives (create or destroy a session)

. channel primitives (create or destroy service)

Target

App.

Originating

DMIF

for Broadcast

Originating

DMIF

for Local Files

Originating

DMIF

for Remote Srv

DM

IF F

ilte

r

Target DMIF

Target DMIF

Sig

Map

Target App.

Target App.

Broadcast

Source

Local

Storage

Network

DAI DNI

Sig

MapTarget DMIF

DNI DAI

Ori

gin

atin

g

Ap

p.

Flows between independent systems (normative)

Flows internal to single system (either informative or out of DMIF scope)

Figure 5.50 DAI and DNI in the DMIF architecture [104]. (#2000 ISO/IEC.)


. Transmux primitives (create or destroy a TransMux channel carrying one or

more streams)

. channel primitives (create or destroy a FlexMux channel carrying a single

stream)

. user command primitives (carry user commands).

In general, all the primitives being presented have two different but similar signa-

tures, one for each communication direction. The Callback suffix indicates primi-

tives that are issued by the lower layer. Different from the DAI, for the DNI the

signatures of both the normal and the associated Callback primitives are identical.

As for the DAI, each primitive presents both IN and OUT parameters, meaning

that IN parameters are provided when the primitive is called, whereas OUT par-

ameters are made available when the primitive returns. As for the DAI, the actual

implementation may choose to use nonblocking calls, and to return the OUT par-

ameters through an asynchronous callback. Also, some primitives use a loop( ) con-

struct within the parameter list. This indicates that multiple tuples of those

parameters can be exposed at once (e.g., in an array).

5.4.2 Streaming Video Over the Internet

Recent advances in computing technology, compression technology, high-band-

width storage devices, and high-speed networks have made it feasible to provide

real-time multimedia services over the Internet. Real-time multimedia, as the

name implies, has timing constraints. For example, audio and video data must be

played out continuously. If the data do not arrive in time, the playout process will

pause, which is annoying to human ears and eyes.

Real-time transport of live video or stored video is the predominant part of real-

time multimedia. Here, we are concerned with video streaming, which refers to real-

time transmission of stored video. There are two modes for transmission of stored

video over the Itnernet, namely the download mode and the streaming mode (i.e.,

video streaming). In the download mode, a user downloads the entire video file

and then plays back the video file. However, full file transfer in the download

mode usually suffers long and perhaps unacceptable transfer time. In contrast, in

the streaming mode, the video contents are being received and decoded. Owing to

its real-time nature, video streaming typically has bandwidth, delay, and loss

requirements. However, the current best-effort Internet does not offer any quality

of service (QoS) guarantees to streaming video over the Internet. In addition, for

multicast, it is difficult to efficiently support multicast video while providing service

flexibility to meet a wide range of QoS requirements from the users. Thus, designing

mechanisms and protocols for Internet streaming video poses many challenges

[105]. To address these challenges, extensive research has been conducted. To intro-

duce the necessary background and give the reader a complete picture of this field,

we cover some key areas of streaming video, such as video compression, application

layer QoS control, continuous media distribution services, streaming servers, media


synchronization mechanisms, and protocols for streaming media [106]. The

relations among the basic building blocks are illustrated in Figure 5.51. Raw

video and audio data are precompressed, by video and audio compression algor-

ithms, and then saved in storage devices. It can be seen that the areas are closely

related and they are coherent constituents of the video streaming architecture. We

will briefly describe the areas. Before that it must be pointed out that upon the cli-

ent’s requests, a streaming server retrieves compressed video/audio data from sto-

rage devices and the application-layer QoS control module adapts the video/audiobit streams according to the network status and QoS requirements. After the adap-

tation, the transport protocols packetize the compressed bit streams and send the

video/audio packets to the Internet. Packets may be dropped or experience exces-

sive delay inside the Internet due to congestion. To improve the quality of video/audio transmission, continuous media distribution service (e.g., caching) is deployed

in the Internet. For packets that are successfully delivered to the receiver, they first

pass through the transport layers and are then processed by the application layer

before being decoded at the video/audio decoder. To achieve synchronization

between video and audio presentations, media synchronization mechanisms are

required.

Video Compression

Since raw video consumes a large amount of bandwidth, compression is usually

employed to achieve transmission efficiency. In this section, we discuss various

compression approaches and requirements imposed by streaming applications on

the video encoder and decoder.

In essence, video compression schemes can be classified into two approaches:

scalable and nonscalable video coding. We will show the encoder and decoder in

intramode and only use DCT. Intramode coding refers to coding video macroblocks

without any reference to previously coded data. Since scalable video is capable of

Raw video

Videocompression Compressed

video

Raw video

Audiocompression Compressed

audio

Storage device

Application-layerQoS control

Transportprotocols

Streaming serverVideo

decoder

Application-layerQoS control

Transportprotocols

Audiodecoder

Client/receiver

Internet

(Continuous media distribution services)

Mediasynchronization

Figure 5.51 Basic building blocks in the architecture for video streaming [105]. (#2001 IEEE.)


gracefully coping with the bandwidth fluctuations in the Internet [107], we are pri-

marily concerned with scalable video coding techniques.

A nonscalable video encoder shown in Figure 5.52a generates one compressed

bit stream, while a scalable video decoder is presented in Figure 5.52b. In contrast

a scalable video encoder compresses a raw video sequence into multiple substreams

as represented in Figure 5.53a. One of the compressed substreams is the base sub-

stream, which can be independently decoded and provides coarse visual quality.

Other compressed substreams are enhanced substreams, which can only be decoded

together with the base substream and can provide better visual quality. The complete

bit stream (i.e., the combination of all substreams) provides the highest quality. An

SNR scalable encoder as well as scalable decoder are shown in Figure 5.53.

The scalabilities of quality, image sizes, or frame rates are called SNR, spatial, or

temporal scalabilities, respectively. These three scalabilities form the basic mechan-

isms, such as spatiotemporal scalability [108]. To provide more flexibility in meet-

ing different demands of streaming (e.g., different access link bandwidths and

different latency requirements), the fine granularity scalability (FGS) coding mech-

anism is proposed in MPEG-4 [109, 110]. An FGS encoder and FGS decoder are

shown in Figure 5.54.

The FGS encoder compresses a raw video sequence into two substreams, that is, a

base layer bit stream and an enhancement layer bit stream. Different from an SNR-

scalable encoder, an FGS encoder uses bit-plane coding representing the enhance-

ment stream. Bit-plane coding uses embedded representations. Bit planes of

enhancement DCT coefficients are shown in Figure 5.55. With bit-plane coding,

an FGS encoder is capable of achieving continuous rate control for the enhancement

stream. This is because the enhancement bit stream can be truncated anywhere to

achieve the target bit rate.

Figure 5.52 (a) Nonscalable video encoder, (b) nonscalable video decoder.


Example 5.3. A DCT coefficient can be represented by 7 bits (i.e., its value ranges from 0 to

127). There are 64 DCT coefficients. Each DCT coefficient has a most significant bit (MSB).

All the MSB from the 64 DCT coefficients form Bitplane 0 (Figure 5.55). Similarly, all the

second most significant bits form Bitplane 1.

A variation of FGS is progressive fine granularity scalability (PFGS) [111]. PFGS

shares the good features of FGS, such as fine granularity bit rate scalability and error

resilience. Unlike FGS, which only has two layers, PFGS could have more then two

layers. The essential difference between FGS and PFGS is that FGS only uses the

base layer as a reference to reduce prediction error, resulting in higher coding

efficiency.

Various Requirements Imposed by Streaming Applications

In what follows we will describe various requirements imposed by streaming allo-

cations on the video encoder and decoder. Also, we will briefly discuss some tech-

niques that address these requirements.

Bandwidth. To achieve acceptable perceptual quality, a streaming application typi-

cally has minimum bandwidth requirement. However, the current Internet does not

provide bandwidth reservation to support this requirement. In addition, it is desirable

for video streaming applications to employ congestion control to avoid congestion,

which happens when the network is heavily loaded. For video streaming, congestion

Raw

video

Base layer

compressed

bit-stream

Enhancement layer

compressed

bit-stream

Base layer

compressed

bit-stream

Base layer

decoded

video

Enhancement layer

compressed

bit-stream

Enhancement

layer decoded

video

DCT Q

Q

VLC

VLC

IQ

VLD IQ IDCT

VLD IQ IDCT

(a)

(b)

+

+-

Figure 5.53 (a) SNR-scalable video encoder, (b) SNR-scalable video decoder.


control takes the form of rate control, that is, adapting the sending rate to the avail-

able bandwidth in the network. Compared with nonscalable video, scalable video is

more adaptable to the varying available bandwidth in the network.

Delay. Streaming video requires bounded end-to-end delay so that packets can

arrive at the receiver in time to be decoded and displayed. If a video packet does

+–

DCT Q VLC

IQ

Bitplane

shift

Find

Maximum

Bitplane

VLC

Enhancement

compressed

bit-stream

Base Layer

compressed

bit-stream

Raw

video

Base layer

compressed

bit-stream

Base layer

decoded

video

Enhancement layer

compressed

bit-stream

Bitplane

VLD

Bitplane

shift IDCTEnhancement

decoded

video

(b)

(a)

+

VLD IQ IDCT

Figure 5.54 (a) Fine granularity scalability (FGS) encoder, (b) FGS decoder [105]. (#2001

IEEE.)

Bitplane 0

Bitplane 1

Bitplane k

DCT

Coefficient 0DCT

Coefficient 1

DCT

Coefficient 2

Least significant bit

Most significant bit

Figure 5.55 Bitplanes of enhancement DCT coefficients [105]. (#2001 IEEE.)


not arrive in time, the playout process will pause, which is annoying to human eyes.

A video packet that arrives beyond its delay bound (e.g., its playout time) is useless

and can be regarded as lost. Since the Internet introduces time-varying delay, to pro-

vide continuous playout, a buffer at the receiver is usually introduced before decod-

ing [112].

Loss. Packet loss is inevitable in the Internet and can damage pictures, which is dis-

pleasing to human eyes. Thus, it is desirable that a video stream be robust to packet

loss. Multiple description coding is such a compression technique to deal with

packet loss [113].

Video Cassette Recorder (VCR) like Functions. Some streaming applications

require VCR-like functions such as stop, pause/resume, fast forward, fast backward,

and random access. A dual bit stream least-cost scheme to efficiently provide VCR-

like functionality for MPEG video streaming is proposed in reference [105].

Decoding Complexity. Some devices such as cellular phones and personal digital

assistants (PDAs) require low power consumption. Therefore, streaming video

applications running on these devices must be simple. In particular, low decoding

complexity is desirable.

We here present the application-layer QoS control mechanisms, which adapt the

video bit streams according to the network status and QoS requirements.

Application-Layer QoS Control

The objective of application-layer QoS control is to avoid congestion and maximize

video quality in the presence of packet loss. To cope with varying network con-

ditions, and different presentation quality requested by the users, various appli-

cation-layer QoS control techniques have been proposed [114, 115]. The

application-layer QoS control techniques include congestion control and error con-

trol. These techniques are employed by the end systems and do not require any QoS

support from the network.

Congestion Control. Bursty loss and excessive delay have a devasting effect on

video presentation quality, and they are usually caused by network congestion.

Thus, congestion control mechanisms at end systems are necessary to help reduce

packet loss and delay. Typically, for streaming video, congestion control takes the

form of rate control. This is a technique used to determine the sending rate of

video traffic based on the estimated available bandwidth in the network. Rate control

attempts to minimize the possibility of network congestion by matching the rate of

the video stream to the available network bandwidth. Existing rate control schemes

can be classified into three categories: source-based, receiver-based, and hybrid rate

control.

Under source-based rate control, the sender is responsible for adapting the video

transmission rate. Feedback is employed by source-based rate control mechanisms.


Based upon the feedback information about the network, the sender can regulate the

rate of the video stream. The service-based rate control can be applied to both uni-

cast [116] and multicast [117]. Figure 5.56 represents unicast and multicast video

distribution.

For unicast video, existing source-based rate control mechanisms follow two

approaches: probe-based and model-based. The probe-based approach is based on

probing experiments. In particular, the source probes for the available network band-

width by adjusting the sending rate in a way that could maintain the packet loss ratio

p below a certain threshold Pth. There are two ways to adjust the sending rate: (1)

additive increase and multiplicative decrease, and (2) multiplicative increase and

multiplicative decrease [118].

The model-based approach is based on a throughput model of a transmission con-

trol protocol (TCP) connection. Specifically, the throughput of a TCP connection

can be characterized by the following formula:

l ¼1:22þMTU

RTTxffiffiffiffi

pp (5:1)

where l is throughput of a TCP connection, MTU (maximum transit unit) is the

packet size used by the connection, RTT is the round-trip time for the connection,

and p is the packet loss ratio experienced by the connection [119].

Under the model-based rate control, equation (5.1) is used to determine the send-

ing rate of the video stream. Thus, the video connection can avoid congestion in a

Receiver Receiver

Receiver

Receiver Receiver

Receiver

ReceiverReceiver

ReceiverReceiver

SenderLink 1 Link 1

SenderLink 1 Link 1

(a)

(b)

Figure 5.56 (a) Unicast video distribution usingmultiple point-to-point connections, (b) multicast

video distribution using point-to-multipoint transmission.


way similar to that of TCP and it can compete fairly with TCP flows. For this reason,

the model-based rate control is also called TCP-friendly rate control.

For multicast, under the source-based rate control, the sender uses a single chan-

nel to transport video to the receivers. Such multicast is called single-channelmulti-

cast. For single-channel multicast, only the probe-based rate control can be

employed.

Single-channel multicast is efficient since all the receivers share one channel.

However, single-channel multicast is unable to provide flexible services to meet

the different demands from receivers with various access link bandwidths. In con-

trast, if multicast video were to be delivered through individual unicast streams,

the bandwidth efficiency is low, but the service could be differentiated since each

receiver can negotiate the parameters of the services with the source.

Under the receiver-based rate control, the receiver regulates the receiving rate of

video streams by adding/dropping channels, while the sender does not participate inrate control. Typically, receiver-based rate control is used in multicast scalable

video, where there are several layers in the scalable video and each layer corre-

sponds to one channel in the multicast tree [105].

Similar to the source-based rate control, the existing receiver-based rate-control

mechanisms follow two approaches: probe-based and model-based. The basic

probe-based rate control consists of two parts [105]:

. When no congestion is detected, a receiver probes for the available bandwidth

by joining a layer/channel, resulting in an increase of its receiving rate. If no

congestion is detected after the joining, the joining experiment is successful.

Otherwise, the receiver drops the newly added layer.

. When congestion is detected, a receiver drops a layer (i.e., leaves a channel),

resulting in a reduction of its receiving rate.

Unlike the probe-based approach, which implicitly estimates the available network

bandwidth through probing experiments, the model-based approach uses explicit

estimation for the available network bandwidth.

Under the hybrid rate control the receiver regulates the receiving rate of video

streams by adding/dropping channels, while the sender also adjusts the transmission

rate of each channel based on feedback from the receivers. Examples of hybrid rate

control include the destination set grouping and layered multicast scheme [120].

Architecture for source-based rate control is shown in Figure 5.57. An associated

technique with rate control is rate shaping. The objective of rate shaping is to match

rate of a precompressed video bit stream to the target rate constraints. A rate shaper

(or filter), which performs rate shaping, is required for source-based rate control.

This is because the stored video may be precompressed at a certain rate, which

may not match the available bandwidth in the network. Many types of filters can

be used, such as codec filter, frame-dropping filter, layer-dropping filter, frequency

filter, and requantization filter [121].


In some applications, the purpose of congestion control is to avoid congestion.

On the other hand, packet loss is inevitable in the Internet and may have significant

impact on perceptual quality. This prompts the need to design mechanisms to maxi-

mize video presentation quality in the presence of packet loss. Error control is such a

mechanism, and will be presented next.

Error Control. In the Internet, packets may be dropped due to congestion at routers,

they may be misrouted, or they may reach the destination with such a long delay as

to be considered useless or lost. Packet loss may severely degrade the visual presen-

tation quality. To enhance the video quality in presence of packet loss, error-control

mechanisms have been proposed.

For certain types of data (such as text), packet loss is intolerable while delay is

acceptable. When a packet is lost, there are two ways to recover the packet: the cor-

rupted data must be corrected by traditional forward error correction (FEC), that is,

channel coding, or the packet must be retransmitted. On the other hand, for real-time

video, some visual quality degradation is often acceptable while delay must be

bounded. This feature of real-time video introduces many new error-control mech-

anisms, which are applicable to video applications but not applicable to traditional

data such as text. In essence, the error-control mechanisms for video applications

can be classified into four types, namely, FEC, retransmission, error resilience,

and error concealment. FEC, retransmission, and error resilience are performed at

Figure 5.57 Architecture for source-based rate control [105]. (#2001 IEEE.)


both the source and the receiver side, while error concealment is carried out only at

the receiver side.

The principle of FEC is to add redundant information so that the original message

can be reconstructed in the presence of packet loss. Based on the kind of redundant

information to be added, we classify existing FEC schemes into three categories:

channel coding, source coding-based FEC, and joint source/channel coding [106].

For Internet applications, channel coding is typically used in terms of block

codes. Specifically, a video stream is first chopped into segments, each of which

is packetized into k packets; then, for each segment, a block code is applied to

the k packets to generate an n-packet block, where n . k. To perfectly recover a seg-

ment, a user only needs to receive k packets in the n-packet block [122].

Source coding-based FEC (SFEC) is a variant of FEC for Internet video [123].

Like channel coding, SFEC also adds redundant information to recover from loss.

For example, the nth packet contains the nth group of blocks (GOB) and redundant

information about the (n2 1)th GOB, which is a compressed version of the

(n2 1)th GOB with larger quantizer.

Joint source/channel coding is an approach to optimal rate allocation between

source coding and channel coding [106].

Delay-Constrained Retransmission. Retransmission is usually dismissed as a

method to recover lost packets in real-time video since a retransmitted packet

may miss its playout time. However, if the one-way trip time is short with respect

to the maximum allowable delay, a retransmission-based approach (called delay-

constrained retransmission) is a viable option for error control.

For unicast, the receivers can perform the following delay-constrained retrans-

mission scheme. When the receiver detects the loss of packet N, if [Tcþ RTTþDa , Td(N)] send the request for packet N to the sender, where Tc is current time,

RTT is estimated round-trip time, Da is a slack term, Td(N) is time when packet N

is scheduled for display.

The slack time Da may include tolerance of error in estimating RTT, the sender’s

response time, and the receiver’s decoding delay. The timing diagram for receiver-

based control is shown in Figure 5.58, where Da is only the receiver’s decoding

Sender

packet 1Receiver

packet 2 lost

retransmitted packet 2

packet 3

retransmitted packet 2

Tc

RTT

Ds

Td(2)

Figure 5.58 Timing diagram for receiver-based control [105]. (#2001 IEEE.)


delay. It is clear that the objective of the delay-constrained retransmission is to sup-

press requests of retransmission that will not arrive in time for display.

Error-Resilient Encoding. The objective of error-resilient encoding is to enhance

robustness of compressed video to packet loss. The standardized error-resilient

encoding schemes include resynchronization marking, data partitioning, and data

recovery [124]. However, resynchronization marking, data partitioning, and data

recovery are targeted at error-prone environments like wireless channels and may

not be applicable to the Internet environment. For video transmission over the Inter-

net, the boundary of a packet already provides a synchronization point in the vari-

able-length coded bit stream at the receiver side. On the other hand, since a

packet loss may cause the loss of all the motion data and its associated shape/texturedata, mechanisms such as resynchronization, marking, data partitioning, and data

recovery may not be useful for Internet video applications. Therefore, we will not

present the standardized error-resilient tools. Instead, we present multiple descrip-

tion coding (MDC), which is promising for robust Internet video transmission [125].

With MDC, a raw video sequence is compressed into multiple streams (referred

to as descriptions) as follows: each description provides acceptable visual quality;

more combined descriptions provide a better visual quality. The advantages of

MDC are:

. Robustness to loss – even if a receiver gets only one description (other descrip-

tions being lost), it can still reconstruct video with acceptable quality.

. Enhanced quality – if a receiver gets multiple descriptions, it can combine

them together to produce a better reconstruction than that produced from any

one of them.

However, the advantages come with a cost. To make each description provide

acceptable visual quality, each description must carry sufficient information about

the original video. This will reduce the compression efficiency compared to conven-

tional single description coding (SDC). In addition, although more description com-

binations provide a better visual quality, a certain degree of correlation between the

multiple description has to be embedded in each description, resulting in further

reduction of the compression efficiency. Further investigation is needed to find a

good trade-off between compression efficiency and reconstruction quality from

any one description.

Error Concealment. Error-resilient encoding is executed by the source to enhance

robustness of compressed video before packet loss actually happens (this is called

preventive approach). On the other hand, error concealment is performed by the

receiver when packet loss has already occurred (this is called reactive approach).

Specifically, error concealment is employed by the receiver to conceal the lost

data and make the presentation less displeasing to human eyes.

The are two basic approaches for error concealment: spatial and temporal inter-

polations. In spatial interpolation, missing pixel values are reconstructed using


neighboring spatial information. In temporal interpolation, the lost data are recon-

structed from data in the previous frames. Typically, spatial interpolation is used

to reconstruct the missing data in intracoded frames, while temporal interpolation

is used to reconstruct the missing data in intercoded frames [126]. If the network

is able to support QoS for video streaming, the performance can be further enhanced.

Continuous Media Distribution Services

In order to provide quality multimedia presentations, adequate support from the net-

work is critical. This is because network support can reduce transport delay and

packet loss ratio. Streaming video and audio are classified as continuous media

because they consist of a sequence of media quanta (such as audio samples or

video frames), which convey meaningful information only when presented in

time. Built on top of the Internet (IP protocol), continuous media distribution ser-

vices are designed with the aim of providing QoS and achieving efficiency for

streaming video/audio over the best-effort Internet. Continuous media distribution

services include network filtering, application-level multicast, and content

replication.

Network Filtering. As a congestion control technique, network filtering aims to

maximize video quality during network congestion. The filter at the video server

can adapt the rate of video streams according to the network congestion status.

Figure 5.59 illustrates an example of placing filters in the network. The nodes

labeled R denote routers that have no knowledge of the format of the media streams

and may randomly discard packets. The filter nodes receive the client’s requests and

adapt the stream sent by the server accordingly. This solution allows the service pro-

vider to place filters on the nodes that connect to network bottlenecks. Furthermore,

multiple filters can be placed along the path from a server to a client.

To illustrate the operations of filters, a system model is depicted in Figure 5.60.

The model consists of the server, the client, at least one filter, and two virtual

Server

Client

FilterR

R

Filter R R

Client

Client

Figure 5.59 Filters placed inside the network.


channels between them. Of the virtual channels, one is for control and the other is for

data. The same channels exist between any pair of filters. The control channel is

bidirectional, which can be realized by TCP connections. The model shown allows

the client to communicate with only one host (the last filter), which will either for-

ward the requests or act upon them. The operations of a filter on the data plane

include: (1) receiving video stream from server or previous filter and (2) sending

video client or next filter at the target rate. The operations of a filter on the control

plane include: (1) receiving requests from client or next filter, (2) acting upon

requests, and (3) forwarding the requests to its previous filter.

Typically, frame-dropping filters are used as network filters. The receiver can

change the bandwidth of the media stream by sending requests to the filter to

increase or decrease the frame dropping rate. To facilitate decisions on whether

the filter should increase or decrease the bandwidth, the receiver continuously

measures the packet loss ratio p. Based on the packet loss ratio, a rate-control mech-

anism can be designed as follows. If the packet loss ratio is higher that a threshold a,

the client will ask the filter to increase the frame dropping rate. If the packet loss

ratio is less than another threshold b (b , a), the receiver will ask the filter to

reduce the frame dropping rate [105].

The advantage of using frame-dropping filters inside the network include the

following:

. Improved video quality. For example, when a video stream flows from an

upstream link with larger available bandwidth to a downstream link with smal-

ler available bandwidth, use of a frame-dropping filter at the connection point)

between the upstream link and the downstream link can help improve the video

quality. This is because the filter understands the format of the media stream

and can drop packets in a way that gracefully degrades the stream’s quality

instead of corrupting the flow outright.

. Bandwidth efficiency. This is because the filtering can help to save network

resources by discarding those frames that are late.

Application-Level Multicast. The Internet’s original design, while well suited for

point-to-point applications like e-mail, file transfer, and Web browsing, fails to

effectively support large-scale content delivery like streaming-media multicast. In

an attempt to address this shortcoming, a technology called IP multicast was pro-

posed. As an extension to IP layer, IP multicast is capable of providing efficient mul-

tipoint packet delivery. To be specific, the efficiency is achieved by having one and

Server

Control

Data

Control

Data

ClientFilter

Figure 5.60 Systems model of network filtering.


only one copy of the original IP packet (sent by the multicast source) be transported

along any physical path in the IP multicast tree. However, with a decade of research

and development, there are still many barriers in deploying IP multicast. These pro-

blems include scalability, network management, deployment and support for higher

layer functionality (e.g., error flow and congestion control) [127].

Application-level multicast is aimed at building a multicast service on top of the

Internet. It enables independent content delivery service providers (CSPs), Internet

service providers (ISPs), or enterprises to build their own Internet multicast net-

works and interconnect them into larger, worldwide media multicast networks.

That is, the media multicast network can support peering relationships at the appli-

cation level or streaming-media/content layer, where content backbones intercon-

nect service providers. Hence, much as the Internet is built from an

interconnection of networks enabled through IP-level peering relationships among

ISPs, the media multicast networks can be built from an interconnection of con-

tent-distribution networks enabled through application-level peering relationships

among various sorts of service providers, namely, traditional ISPs, CSPs, and appli-

cations service providers (ASPs).

The advantage of the application-level multicast is that it breaks barriers such as

scalability, network management, and support for congestion control, which have

prevented Internet service providers from establishing IP multicast peering

arrangements.

Content Replication. An important technique for improving scalability of the

media delivery system is content media replication. The content replication takes

two forms: mirroring and caching, which are deployed by the content delivery ser-

vice provider (CSP) and Internet service provider (ISP). Both mirroring and caching

seek to place content closer to the clients and both share the following advantages:

. reduced bandwidth consumption on network links

. reduced load on streaming servers

. reduced latency for clients

. increased availability.

Mirroring is to place copies of the original multimedia files on the other machines

scattered around the Internet. That is, the original multimedia files are stored on

the main server while copies of the original multimedia files are placed on duplicate

servers. In this way, clients can retrieve multimedia data from the nearest duplicate

server, which gives the clients the best performance (e.g., lowest latency). Mirroring

has some disadvantages. Currently, mechanisms for establishing a dedicated mirror

on an existing server, while cheaper, is still an ad hoc and administratively complex

process. Finally, there is no standard way to make scripts and server setup easily

transferable from one server to another.

Caching, which is based on the belief that different clients will load many of the

same contents, makes local copies of contents that the clients retrieve. Typically,


clients in a single organization retrieve all contents from a single local machine,

called a cache. The cache retrieves a video file from the origin server, storing a

copy locally and then passing it on to the client who requests it. If a client asks

for a video file that the cache has already stored, the cache will return the local

copy rather that going all the way to the origin server where the video file resides.

In addition, cache sharing and cache hierarchies allow each cache to access files

stored at other caches so that the load on the origin server can be reduced and net-

work bottlenecks can be alleviated [128].

Streaming ServersStreaming servers play a key role in providing streaming services. To offer quality

streaming services, streaming servers are required to process multimedia data under

timing constraints in order to prevent artifacts (e.g., jerkiness in video motion and

pops in audio) during playback at the clients. In addition, streaming servers also

need to support video cassette recorder (VCR) like control operations, such as

stop, pause/resume, fast forward, and fast backward. Streaming servers have to

retrieve media components in a synchronous fashion. For example, retrieving a lec-

ture presentation requires synchronizing video and audio with lecture slides. A

streaming server consists of the following three subsystems: communicator (e.g.,

transport protocol), operating system, and storage system.

. The communicator involves the application layer and transport protocols

implemented on the server. Through a communicator the clients can communi-

cate with a server and retrieve multimedia contents in a continuous and syn-

chronous manner.

. The operating system, different from traditional operating systems, needs to

satisfy real-time requirements for streaming applications.

. The storage system for streaming services has to support continuous media sto-

rage and retrieval.

Media SynchronizationMedia synchronization is a major feature that distinguishes multimedia applications

from other traditional data applications. With media synchronization mechanisms,

the application at the receiver side can present various media streams in the same

way as they were originally captured. An example of media synchronization is

that the movements of a speaker’s lips match the played-out audio.

A major feature that distinguishes multimedia applications from other traditional

data applications is the integration of various media streams that must be presented

in a synchronized fashion. For example, in distance learning, the presentation of

slides should be synchronized with the commenting audio stream. Otherwise, the

current slide being displayed on the screen may not correspond to the lecturer’s

explanation heard by the students, which is annoying. With media synchronization,

the application at the receiver side can present the media in the same way as they


were originally captured. Synchronization between the slides and the commenting

audio stream is shown in Figure 5.61.

Media synchronization refers to maintaining the temporal relationships within

one data stream and among various media streams. There are three levels of syn-

chronization, namely, intrastream, interstream, and interobject synchronization.

The three levels of synchronization correspond to three semantic layers of multime-

dia data as follows [129]:

Intrastream Synchronization. The lowest layer of continuous media or time-

dependent data (such as video and audio) is the media layer. The unit of the

media layer is a logical data unit such as a video/audio frame, which adheres to strict

temporal constraints to ensure acceptable user perception at playback. Synchroniza-

tion at this layer is referred to as intrastream synchronization, which maintains the

continuity of logical data units. Without intrastream synchronization, the presen-

tation of the stream may be interrupted by pauses or gaps.

Interstream Synchronization. The second layer of time-dependent data is the

stream layer. The unit of the stream layer is a whole stream. Synchronization at

this layer is referred to as interstream synchronization, which maintains temporal

relationships among different continuous media. Without interstream synchroniza-

tion, skew between the streams may become intolerable. For example, users could

be annoyed if they notice that the movements of the lips of a speaker do not corre-

spond to the presented audio.

Interobject Synchronization. The highest layer of the multimedia document is the

object layer, which integrates streams and time-independent data such as text and

still images. Synchronization at this layer is referred to as interobject synchroniza-

tion. The objective of interobject synchronization is to start and stop the presentation

of the time-independent data within a tolerable time interval, if some previously

defined points of the presentation of a time-dependent media object are reached.

Without interobject synchronization, for example, the audience of a slide show

could be annoyed if the audio is commenting on one slide while another slide

being presented.

The essential part of any media synchronization mechanism is the specifications of

the temporal relations within a medium and between the media. The temporal

relations can be specified either automatically or manually. In the case of audio/video recording and playback, the relations are specified automatically by the

Slide 1 Slide 2 Slide 3 Slide 4

Audio sequence

Figure 5.61 Synchronization between the slides and the commenting audio stream.


recording device. In the case of presentations that are composed of independently cap-

tured or otherwise created media, the temporal relations have to be specified manually

(with human support). The manual specification can be illustrated by the design of a

slide show: the designer selects the appropriated slides, creates an audio object, and

defines the units of the audio stream where the slides have to be presented.

The methods that are used to specify the temporal relations include interval-

based, axes-based, control flow-based, and event-based specifications. A widely

used specification method for continuous media is axes-based specifications

or time-stamping: at the source, a stream is time-stamped to keep temporal

information within the stream and with respect to other streams; at the destina-

tion, the application presents the streams according to their temporal

relation [130].

Besides specifying the temporal relations, it is desirable that synchronization be

supported by each component on the transport path. For example, the servers store

large amounts of data in such way that retrieval is quick and efficient to reduce

delay; the network provides sufficient bandwidth, and delay and jitter introduced

by the network are tolerable to the multimedia applications; the operating systems

and the applications provide real-time data processing (e.g., retrieval, resynchroni-

zation, and display). However, real-time support from the network is not available in

the current Internet. Hence, most synchronization mechanisms are implemented at

the end systems. The synchronization mechanisms can be either preventive or cor-

rective [131].

Preventive mechanisms are designed to minimize synchronization errors as data

is transported from the server to the user. In other words, preventive mechanisms

attempt to minimize latencies and jitters. These mechanisms involve disk-reading

scheduling algorithms, network transport protocols, operating systems, and synchro-

nization schedulers. Disk-reading scheduling is the process of organizing and coor-

dinating the retrieval of data from the storage devices. Network transport protocols

provide means for maintaining synchronization during data transmission over the

Internet.

Corrective mechanisms are designed to recover synchronization in the presence

of synchronization errors. Synchronization errors are unavoidable, since the Inter-

net introduces random delay, which destroys the continuity of the media stream by

incurring gaps and jitters during data transmission. Therefore, certain compen-

sations (i.e., corrective mechanisms) at the receiver are necessary when synchroni-

zation errors occur. An example of corrective mechanisms is the stream

synchronization protocol (SSP). In SSP, the concept of an intentional delay is

used by the various streams in order to adjust their presentation time to recover

from network delay variations. The operations of SSP are described as follows.

At the client side, units that control and monitor the client end of the data connec-

tions compare the real arrival times of data with the ones predicted by the presen-

tation schedule and notify the scheduler of any discrepancies. These discrepancies

are then compensated by the scheduler, which delays the display of data that are

ahead of other data, allowing the late data to catch up. To conclude, media

synchronization is one of the key issues in the design of media streaming services.


Protocols for Streaming MediaProtocols are designed and standardized for communication between clients and

streaming servers. Protocols for streaming media provide such services as network

addressing, transport, and session control. According to their functionalities, the pro-

tocols can be classified into three categories: network-layer protocol such as Internet

protocol (IP), transport protocol such as user datagram protocol (UDP), and session

control protocol such as real-time streaming protocol (RTSP).

Network-layer protocol provides basic network service support such as network

addressing. The IP serves as the network-layer protocol for Internet video streaming.

Transport protocol provides end-to-end network transport functions for stream-

ing applications. Transport protocols include UDP, TCP, real-time transport proto-

col (RTP), and real-time control protocol (RTCP). UDP and TCP are lower-layer

transport protocols while RTP and RTCP [132] are upper-layer transport protocols,

which are implemented on top of UDP/TCP. Protocol stacks for media streaming

are shown in Figure 5.62.

Session control protocol defines the messages and procedures to control the

delivery of the multimedia data during an established session. The RTSP and the

session initiation protocol (SIP) are such session control protocols [133, 134].

To illustrate the relationship among the three types of protocols, we depict the

protocol stacks for media streaming. For the data plane, at the sending side, the com-

pressed video/audio data is retrieved and packetized at the RTP layer. The RTP-

packetized streams provide timing and synchronization information, as well as

sequence numbers. The RTP-packetized streams are then passed to the UDP/TCPlayer and the IP layer. The resulting IP packets are transported over the Internet.

At the receiver side, the media streams are processed in the reversed manner before

Compressed

Video/AudioRTP Layer

Data Plane

Protocol Stacks

Control Plane

RTCP Layer RTSP/SIP Layer

UDP/TCP Layer

IP Layer

Internet

Figure 5.62 Protocol stacks for media streaming [105]. (#2001 IEEE.)


their presentations. For the control plane, RTCP packets and RTSP packets are mul-

tiplexed at the UDP/TCP layer and move to the IP layer for transmission over the

Internet.

In what follows, we will discuss transport protocols for streaming media. Then,

we will describe control protocols, that is, real-time streaming protocol (RTSP) and

session initiation protocol (SIP).

Transport Protocols. The transport protocol family for media streaming includes

UDP, TCP, RTP, and RTCP protocols [135]. UDP and TCP provide basic transport

functions, while RTP and RTCP run on top of UDP/TCP. UDP and TCP protocols

support such functions as multiplexing, error control, congestion control, or flow

control. These functions can be briefly described as follows. First, UDP and TCP

can multiplex data streams for different applications running on the same machine

with the same IP address. Secondly, for the purpose of error control, TCP and most

UDP implementations employ the checksum to detect bit errors. If single or multiple

bit errors are detected in the incoming packet, the TCP/UDP layer discards the

packet so that the upper layer (e.g., RTP) will not receive the corrupted packet.

On the other hand, different from UDP, TCP uses retransmission to recover lost

packets. Therefore, TCP provides reliable transmission while UDP does not.

Thirdly, TCP employs congestion control to avoid sending too much traffic,

which may cause network congestion. This is another feature that distinguishes

TCP from UDP. Lastly, TCP employs flow control to prevent the receiver buffer

from overflowing while UDP does not have any flow control mechanisms.

Since TCP retransmission introduces delays that are not acceptable for streaming

applications with stringent delay requirements, UDP is typically employed as the

transport protocol for video streams. In addition, since UDP does not guarantee

packet delivery, the receiver needs to rely on the upper layer (i.e., RTP) to detect

packet loss.

RTP is an Internet standard protocol designed to provide end-to-end transport

functions for supporting real-time applications. RTCP is a companion protocol

with RTP and is designed to provide QoS feedback to the participants of an RTP

session. In other words, RTP is a data transfer protocol while RTCP is a control

protocol.

RTP does not guarantee QoS or reliable delivery, but rather provides the follow-

ing functions in support of media streaming:

. Time-stamping. RTP provides time-stamping to synchronize different media

streams. Note that RTP itself is not responsible for the synchronization,

which is left to the applications.

. Sequence numbering. Since packets arriving at the receiver may be out of

sequence (UDP does not deliver packets in sequence), RTP employs sequence

numbering to place the incoming RTP packets in the correct order. The

sequence number is also used for packet loss detection.

. Payload type identification. The type of the payload contained in an RTP packet

is indicated by an RTP-header field called payload type identifier. The receiver


interprets the contents of the packet based on the payload type identifier. Cer-

tain common payload types such as MPEG-1/2 audio and video have been

assigned payload type numbers. For other payloads, this assignment can be

done with session control protocols [136].

. Source identification. The source of each RTP packet is identified by an RTP-

header field called Synchronization SouRCe identifier (SSRC), which provides

a means for the receiver to distinguish different sources.

RTCP is the control protocol designed to work in conjunction with RTP. In an

RTP session, participants periodically send RTCP packets to convey feedback on

quality of data delivery and information of membership. RTCP essentially provides

the following services.

. QoS feedback. This is the primary function of RTCP. RTCP provides feedback

to an application regarding the quality of data distribution. The feedback is

in the form of sender reports (sent by the source) and receiver reports (sent

by the receiver). The reports can contain information on the quality of reception

such as: (1) fraction of the lost RTP packets, since the last report; (2) cumulat-

ive number of lost packets, since the beginning of reception; (3) packet inter-

arrival jitter; and (4) delay since receiving the last sender’s report. The

control information is useful to the senders, the receivers, and third-party moni-

tors. Based on the feedback, the sender can adjust its transmission rate, the

receivers can determine whether congestion is local, regional, or global, and

the network manager can evaluate the network performance for multicast

distribution.

. Participant identification. A source can be identified by the SSRC field in the

RTP header. Unfortunately, the SSRC identifier is not convenient for the human

user. To remedy this problem, the RTCP provides human-friendly mechanisms

for source identification. Specifically, the RTCO SDES (source description)

packet contains textual information called canonical names as globally unique

identifiers of the session participants. It may include a user’s name, telephone

number, e-mail address, and other information.

. Control packet scaling. To scale that the RTCP controls packet transmission

with the number of participants, a control mechanism is designed as follows.

The control mechanism keeps the total control packets to 5 percent of the

total session bandwidth. Among the control packets, 25 percent are allocated

to the sender reports and 75 percent to the receiver reports. To prevent control

packet starvation, at least one control packet is sent within 5 s at the sender or

receiver.

. Inter media synchronization. The RTCP sender reports contain an indication of

real time and the corresponding RTP time-stamp. This can be used in inter-

media synchronization like lip synchronization in video.

. Minimal session control information. This optional functionality can be used

for transporting session information such as names of the participants.


Session Control Protocols (RTSP and SIP). The RTSP is a session control pro-

tocol for streaming media over the Internet. One of the main functions of RTSP is to

support VCR-like control operations such as stop, pause/resume, fast forward, and

fast backward. In addition, RTSP also provides means for choosing delivery chan-

nels (e.g., UDP, multicast UDP, or TCP), and delivery mechanisms based upon RTP.

RTSP works for multicast as well as unicast.

Another main function of RTSP is to establish and control streams of continuous

audio and video media between the media server and the clients. Specifically, RTSP

provides the following operations.

. Media retrieval. The client can request a presentation description, and ask the

server to set up a session to send the requested media data.

. Adding media to an existing session. The server or the client can notify each

other about any additional media becoming available to the established session.

In RTSP, each presentation and media stream is identified by an RTSP universal

resource locator (URLS). The overall presentation and the properties of the media

are defined in a presentation description file, which may include the encoding,

language, RTSP URLs, destination address, port and other parameters. The presen-

tation description file can be obtained by the client using HTTP, e-mail, or other

means.

SIP is another session control protocol. Similar to RTSP, SIP can also create and

terminate sessions with one or more participants. Unlike RTSP, SIP supports user

mobility by proxying and redirecting requests to the user’s current location.

To summarize, RTSP and SIP are designed to initiate and direct delivery of

streaming media data from media servers. RTP is a transport protocol for streaming

media data while RTCP is a protocol for monitoring delivery of RTP packets. UDP

and TCP are lower-layer transport protocols for RTP/RTCP/RTSP/SIP packets and

IP provides a common platform for delivering UDP/TCP packets over the Internet.

The combination of these protocols provides a complete streaming service over the

Internet.

Video streaming is an important component of many Internet multimedia appli-

cations, such as distance learning, digital libraries, home shopping, and video-on-

demand. The best-effort nature of the current Internet poses many challenges to

the design of streaming video systems. Our objective is to give the reader a perspec-

tive on the range of options available and the associated trade-offs among perform-

ance, functionality, and complexity.

5.4.3 Challenges for Transporting Real-Time Video Over the Internet

Transporting video over the Internet is an important component of many multimedia

applications. Lack of QoS support in the current Internet, and the heterogeneity of

the networks and end systems pose many challenging problems for designing

video delivery systems. Four problems for video delivery systems can be identified:


bandwidth, delay, loss, and heterogeneity. Two general approaches address these

problems: the network-centric approach and the end system-based approach

[106]. Over the past several years extensive research based on the end system-

based approach has been conducted and various solutions have been proposed. A

holistic approach was taken from both transport and compression perspectives. A

framework for transporting real-time Internet video includes two components,

namely, congestion control and error control. It is well known that congestion con-

trol consists of rate control, rate-adaptive coding, and rate shaping. Error control

consists of forward error correction (FEC), retransmission, error resilience and

error concealment. As shown in Table 5.12, the approaches in the design space

can be classified along two dimensions: the transport perspective and the com-

pression perspective.

There are three mechanisms for congestion control: rate control, rate adaptive

video encoding, and rate shaping. On the other hand, rate schemes can be classi-

fied into three categories: source-based, receiver-based, and hybrid. As shown

in Table 5.13, rate control schemes can follow either the model-based

approach or probe-based approach. Source-based rate control is primarily targeted

at unicast and can follow either the model-based approach or the probe-based

approach.

There have been extensive efforts on the combined transport approach and com-

pression approach [137]. The synergy of transport and compression can provide bet-

ter solutions in the design of video delivery systems.

Table 5.12 Taxonomy of the design space

Transport Compression

Congestion control Rate control Source-based

Receiver-based

Hybrid

Rate adaptive Altering quantizer

encoding Altering frame rate

Rate shaping Selective

frame discard

Dynamic

rate shaping

Error control FEC Channel coding SFEC

Joint channel/source codingDelay-constrained

retransmission

Sender-based control

Receiver-based control

Hybrid control

Error resilience Optimal mode selection

Multiple description

coding

Error concealment EC-1, EC-2, EC-3


Under the end-to-end approach, three factors are identified to have impact on the

video presentation quality at the receiver:

. the source behavior (e.g., quantization and packetization)

. the path characteristics

. the receiver behavior (e.g., error concealment (EC)).

Figure 5.63 represents factors that have impact on the video presentation quality,

that is, source behavior, path characteristics, and receiver behavior. By taking into

consideration the network congestion status and receiver behavior, the end-to-end

approach is capable of offering superior performance over the classical approach

for Internet video applications. A promising future research direction is to combine

the end system-based control techniques with QoS support from the network.

Different from the case in circuit-switched networks, in packet-switched net-

works, flows are statistically multiplexed onto physical links and no flow is isolated.

To achieve high statistical multiplexing gain or high resource utilization in the net-

work, occasional violations of hard QoS guarantees (called statistical QoS) are

allowed. For example, the delay of 95 percent packets is within the delay bound

while 5 percent packets are not guaranteed to have bounded delays. The percentage

(e.g., 95 percent) is in an average sense. In other words, a certain flow may have only

10 percent packets arriving within the delay bound while the average for all flows is

Table 5.13 Rate control

Model-based Probe-based

Rate control Source-based Unicast Unicast/Multicast

Receiver-based Multicast Multicast

Hybrid Multicast

Source behavior

Raw

Video Video

Encoder

Packetizer

Transport

Protocol

Path characteristics

Network

Receiver behavior

Video

Encoder

Depacketizer

Transport

Protocol

Figure 5.63 Factors that have impact on video presentation quality: source behavior, path

characteristics, and receiver behavior [137]. (#2000 IEEE.)


95 percent. The statistical QoS service only guarantees the average performance,

rather than the performance for each flow. In this case, if the end system-based con-

trol is employed for each video stream, higher presentation quality can be achieved

since the end system-based control is capable of adapting to short-term violations.

As a final note, we would like to point out that each scheme has a trade-off

between cost/complexity and performance. Designers can choose a scheme in the

design space that meets the specific cost/performance objectives.

5.4.4 End-to-End Architecture for Transporting

MPEG-4 Video Over the Internet

With the success of the Internet and flexibility of MPEG-4, transporting MPEG-4

video over the Internet is expected to be an important component of many multime-

dia applications in the near future. Video applications typically have delay and loss

requirements, which cannot be adequately supported by the current Internet. Thus, it

is a challenging problem to design an efficient MPEG-4 video delivery system that

can maximize perceptual quality while achieving high resource utilization.

MPEG-4 builds on elements from several successful technologies, such as digital

video, computer graphics, and the World Wide Web, with the aim of providing

powerful tools in the production, distribution, and display of multimedia contents.

With the flexibility and efficiency provided by coding a new form of visual data

called visual object (VO), it is foreseen that MPEG-4 will be capable of addressing

interactive content-based video services, as well as conventional storage and trans-

mission video. Internet video applications typically have unique delay and loss

requirements that differ from other data types. Furthermore, the traffic load con-

dition over the Internet varies drastically over time, which is detrimental to video

transmission. Thus, it is a major challenge to design an efficient video delivery sys-

tem that can both maximize users’ perceived quality of service (QoS) while achiev-

ing high resource utilization in the Internet.

Figure 5.64 shows an end-to-end architecture for transporting MPEG-4 video

over the Internet. The architecture is applicable to both precompressed video and

live video.

If the source is a precompressed video, the bit rate of the stream can be matched

to the rate enforced by a feedback control protocol through dynamic rate shaping

[138] or selective frame discarding [136]. If the source is a live video, it is used

in the MPEG-4 rate adaptation algorithm to control the output rate of the encoder.

On the sender side raw bit stream of live video is encoded by an adaptive MPEG-

4 encoder. After this stage, the compressed video bit stream is first packetized at

the sync layer (SL) and then passed through the RTP/UDP/IP layers before entering

the Internet.

Packets may be dropped at a router/switch (due to congestion) or at the destina-

tion (due to excess delay). For packets that are successfully delivered to the destina-

tion, they first pass through the RTP/UDP/IP layers in reverse order before being

decoded at the MPEG-4 decoder.


Under the architecture, a QoS monitor is kept at the receiver side to infer network

congestion status based on the behavior of the arriving packets, for example, packet

loss and delay. Such information is used in the feedback-control protocol, which is

sent back to the source. Based on such feedback information, the sender estimates the

available network bandwidth and controls the output rate of the MPEG-4 encoder.

Figure 5.65 shows the protocol stack for transporting MPEG-4 video. The right

half shows the processing stages at an end system. At the sending side, the com-

pression layer compresses the visual information and generates elementary streams

(ESs), which contain the coded representation of the VOs. The ESs are packetized as

SL-packetized streams at the SL. The SL-packetized streams are multiplexed into

FlexMux stream at the TransMux Layer, which is then passed to the transport pro-

tocol stacks composed of RTP, UDP, and IP. The resulting IP packets are trans-

ported over the Internet. At the receiver side, the video stream is processed in the

reversed manner before its presentation. The left half shows the data format of

each layer.

Figure 5.66 shows the structure of MPEG-4 video encoder. Raw video stream is

first segmented into video objects, then encoded by individual VO encoder. The

encoded VO bit streams are packetized before beingmultiplexed by the streammulti-

plex interface. The resulting FlexMux stream is passed to the RTP/UDP/IP module.

The structure of an MPEG-4 video decoder is shown in Figure 5.67. Packets from

RTP/UDP/IP are transferred to a stream demultiplex interface and FlexMux buffer.

The packets are demultiplexed and put into corresponding decoding buffers. The

error concealment component will duplicate the previous VOP when packet loss

is detected. The VO decoders decode the data in the decoding buffer and produce

composition units (CUs), which are then put into composition memories to be con-

sumed by the compositor.

To conclude, the MPEG-4 video standard has the potential of offering interactive

content-based video services by using VO-based coding. Transporting MPEG-4

Feedback Control

Protocol

Raw Video Rate Adaptive

MPEG-4 Encoder

RTP/UDP/IPModule Internet

RTP/UDP/IPModule

QoS Monitor

Feedback ControlProtocol

RTP/UDP/IPModule

Figure 5.64 An end-to-end architecture for transporting MPEG-4 video [116]. (#2000 IEEE.)


video is foreseen to be an important component of many multimedia applications.

On the other hand, since the current Internet lacks QoS support, there remain

many challenging problems in transporting MPEG-4 video with satisfactory video

quality. For example, one issue is packet loss control and recovery associated

with transporting MPEG-4 video. Another issue that needs to be addressed is the

support of multicast for Internet video.

5.4.5 Broadband Access

The demand for broadband access has grown steadily as users experience the con-

venience of high-speed response combined with always on connectivity. There are a

MPEG-4

data

MPEG-4

data

SL

header

MPEG-4

Compression

Layer

MPEG-4

Sync

Layer

Elementary

streams...

SL-packetized

streams...

Flex Mux

header

SL

header

MPEG-4

data

RTP

header

Flex Mux

header

SL

header

MPEG-4

data

UDP

header

RTP

header

Flex Mux

header

SL

header

MPEG-4

data

IP

header

UDP

header

RTP

header

Flex Mux

header

SL

header

MPEG-4

data

MPEG-4

FlexMux

Layer

MPEG-4TransMux

Layer

FlexMux

stream

RTP

Layer

UDP

Layer

IP

Layer

FlexMux

stream

Internet

Figure 5.65 Data format at each processing layer at an end system [116]. (#2000 IEEE.)


introduction to multimedia communications : applications ...the mpeg-4 multimedia coding standard...

Documents