philippe gournay, kyle. d. anderson voiceage corporation 750, chemin lucerne

Philippe Gournay, Kyle. D. Anderson

VoiceAge Corporation 750, Chemin Lucerne

Montréal (Québec) Canada H3R 2H6

[email protected]

Performance Analysis of aDecoder-Based Time-Scaling Algorithm

for Variable Jitter Buffering ofSpeech over Packet Networks

Speech over Packet NetworksVariable Jitter Buffering

Decoder-Based Time-ScalingPerformance Analysis

Voice over Packet Networks

• Voice communications over packet networks (VoIP) is characterized by a variable transmission time (jitter)

• VoIP receivers generally use a jitter buffer to control the effect of the jitter

• The jitter buffer works by introducing an additional playout delay

• The playout delay is chosen to minimize the number of late packets while keeping the total end-to-end delay within acceptable limits

• Packets that arrive before their playout time are temporarily stored in a reception buffer

Voice over Packet Networks (2)Fixed transmission time (no jitter)

a fixed playout delay is enough to produce asustained flow of speech to the listener

Transmission delay

0 1 2

0 1 2 n n+1

0 1 2 n n+1Sender

Receiver

Playout

Playoutdelay

Voice over Packet Networks (3)Variable transmission delay (some jitter), fixed playout delay

some packets (n, n+1) arrive too late to be decoded

Transmission delay

0 1 2

0 1 2

Playoutdelay

0 1 2 n n+1

n-1 n n+1

Sender

Receiver

Playout

Jitter Buffering Strategies

• Fixed jitter buffer– Playout delay chosen at the beginning of the

conversation

• Variable Jitter buffer– Talk-spurt based: playout delay changed at

the beginning of each silence period– For quickly varying networks: better results

are obtained when the playout delay is also adapted during active speech

Playout Delay Adaptation

1. Using past jitter values, estimate the “ideal” playout time

Pi+1 of frame number i+1.

2. Send frame number i to the decoder, requesting it to

generate an output frame of length Ti=Pi+1-Pi.

3. The actual playout time of packet i+1 is Pi+1=Pi+Ti, where

Ti is the actual length of frame i. Iterate from step 1.

The playout delay for packet i is the difference between Pi and the reference clock at the normal frame rate

^

^^

Time Scaling Inside the Decoder

• Uses the decoder’s internal parameters:– Pitch period– In VMR-WB: voicing classification

• Regulates the processor load (esp. for shorter frames)– Number of operations per second tends to increase

as the frame length decreases– Some complexity is saved during the synthesis

operation

• Improves quality– smoothing performed by the synthesis filters

… presents several advantages over “standalone” time scaling (SOLA, TDHC, …) :

General Principle

• Time scaling is performed in the excitation domain• The adaptive codebook is updated before time scaling to

keep the encoder and decoder synchronized• Frames are modified depending on their voicing

classification– In VMR-WB, voicing classification is a part of the

bitstream– Not all frames are modified– Voiced frames can only be modified by a multiple

number of pitch periods• Concealed frames can also be time-scaled

Inactive Frames

• The pseudo-random number generator used to build the excitation signal of CNG frames is simply run for the requested number of samples.

• Output frame duration limited to between 0 and 40ms (twice the standard duration)

Unvoiced Frames

• Plosive frames and frames that are too voiced are not modified

• To lengthen frames:– Insert zeroes between the original excitation samples– Adjust gain to preserve average energy per sample

• To shorten frames:– Remove samples from the excitation signal

• Frame duration limited to between 10 and 40ms

Voiced Frames

• Onset frames and frames that are not voiced enough are not modified

• To lengthen frames– Use the long-term predictor to duplicate some pitch

cycles

• To shorten frames– Remove selected pitch cycles

• Frame duration limited to between 0 and 40ms

To Lengthen Voiced FramesVoiced frames are lengthened by repeating selected pitch cycles

Past Excitation Current Excitation

T0

Subframes

(i)

1 2 3 4

1 2 3 4

Experimental Results

• Experiments conducted on clean speech using mode 2 of the VMR-WB codec (Average Data Rate of 4.96 kbits/s with 60% active speech)

• Subjective quality:

– Excellent for “fast playback” (up to twice the normal speed)

– Small degradation for “slow playback” (down to half the normal speed)

– Very efficient at adding or removing a few ms (e.g. 20ms) to the playout delay from time to time

– Much better than losing a few frames…

Distribution of Unmodified Frames

Required frame length (40ms) is twice the standard frame length

Total number of frames: 22803Number of unmodified frames: 4085 (18%)

Distribution of unmodified frames:

1. Voiced framesOnsets: 814 (20% of unmodified)Not voiced enough: 935 (23%)

2. Unvoiced framesPlosive: 1850 (45%)Too voiced: 486 (12%)

Frame Length Distribution for Modified Voiced Frames

Desired frame length was 640 samples = 40 ms

0

2

4

6

8

10

12

480

500

520

540

560

580

600

620

640

Modified frame length (samples)

Per

cent

age

of f

ram

es

Time Required for a 50-ms Increase of the Playout Delay

0

10

20

30

40

5060-70

80-90

…-110

…-130

…-150

…-170

…-190

…-210

…-230

…-250

…-270

…-290

Adaptation Time (ms)

Per

cent

age

of c

ases

Experiment done for 8000 different active speech frames

Maximum Complexity and Corresponding Frame Length

Requested

Frame LengthStandard (20ms)

Longer (40ms)

Shorter (10ms)

Shorter* (10ms)

Maximum

Complexity6.68

WMOPS6.69

WMOPS33.90

WMOPS9.57

WMOPS

Corresponding

Frame Length20ms 20ms 1.875ms 10ms

*: Modified voiced frames not allowed to be less than 10 ms

Optimize the Complexity

• Lengthening does not increase complexity• Shortening CNG frames does not increase

complexity• Shortening active speech frames increases

complexity• Good compromise:

– Increase the playout delay as soon as it is necessary– Decrease the playout delay during inactive periods

Playout delay adaptation requires no additional complexity

Audio Demonstration

Male voice Female voice

Original

VMR-WB mode 2

Fast playback (200% of normal speed)

Slow playback (66% of normal speed)

Adaptation: -20ms every 10 frames

Adaptation: +20ms every 10 frames

Summary

• Adaptive jitter buffering requires a means of time scaling of speech

• Time scaling can be done in the decoder’s “excitation domain”

• This approach is very efficient in terms of both quality and reactivity

• It requires almost no complexity providing some clever limitations are imposed on the amount of time scaling

philippe gournay, kyle. d. anderson voiceage corporation 750, chemin lucerne

Documents

actual playout time

ideal playout time pi

packet i

jittera fixed playout

frame number i

fixed transmission time

packet networks voip

jitter values