philippe gournay, kyle. d. anderson voiceage corporation 750, chemin lucerne
DESCRIPTION
Speech over Packet Networks Variable Jitter Buffering Decoder-Based Time-Scaling Performance Analysis. Performance Analysis of a Decoder-Based Time-Scaling Algorithm for Variable Jitter Buffering of Speech over Packet Networks. Philippe Gournay, Kyle. D. Anderson VoiceAge Corporation - PowerPoint PPT PresentationTRANSCRIPT
Philippe Gournay, Kyle. D. Anderson
VoiceAge Corporation 750, Chemin Lucerne
Montréal (Québec) Canada H3R 2H6
Performance Analysis of aDecoder-Based Time-Scaling Algorithm
for Variable Jitter Buffering ofSpeech over Packet Networks
Speech over Packet NetworksVariable Jitter Buffering
Decoder-Based Time-ScalingPerformance Analysis
Voice over Packet Networks
• Voice communications over packet networks (VoIP) is characterized by a variable transmission time (jitter)
• VoIP receivers generally use a jitter buffer to control the effect of the jitter
• The jitter buffer works by introducing an additional playout delay
• The playout delay is chosen to minimize the number of late packets while keeping the total end-to-end delay within acceptable limits
• Packets that arrive before their playout time are temporarily stored in a reception buffer
Voice over Packet Networks (2)Fixed transmission time (no jitter)
a fixed playout delay is enough to produce asustained flow of speech to the listener
Transmission delay
0 1 2
0 1 2 n n+1
0 1 2 n n+1Sender
Receiver
Playout
Playoutdelay
Voice over Packet Networks (3)Variable transmission delay (some jitter), fixed playout delay
some packets (n, n+1) arrive too late to be decoded
Transmission delay
0 1 2
0 1 2
Playoutdelay
0 1 2 n n+1
n-1 n n+1
Sender
Receiver
Playout
Jitter Buffering Strategies
• Fixed jitter buffer– Playout delay chosen at the beginning of the
conversation
• Variable Jitter buffer– Talk-spurt based: playout delay changed at
the beginning of each silence period– For quickly varying networks: better results
are obtained when the playout delay is also adapted during active speech
Playout Delay Adaptation
1. Using past jitter values, estimate the “ideal” playout time
Pi+1 of frame number i+1.
2. Send frame number i to the decoder, requesting it to
generate an output frame of length Ti=Pi+1-Pi.
3. The actual playout time of packet i+1 is Pi+1=Pi+Ti, where
Ti is the actual length of frame i. Iterate from step 1.
The playout delay for packet i is the difference between Pi and the reference clock at the normal frame rate
^
^^
Time Scaling Inside the Decoder
• Uses the decoder’s internal parameters:– Pitch period– In VMR-WB: voicing classification
• Regulates the processor load (esp. for shorter frames)– Number of operations per second tends to increase
as the frame length decreases– Some complexity is saved during the synthesis
operation
• Improves quality– smoothing performed by the synthesis filters
… presents several advantages over “standalone” time scaling (SOLA, TDHC, …) :
General Principle
• Time scaling is performed in the excitation domain• The adaptive codebook is updated before time scaling to
keep the encoder and decoder synchronized• Frames are modified depending on their voicing
classification– In VMR-WB, voicing classification is a part of the
bitstream– Not all frames are modified– Voiced frames can only be modified by a multiple
number of pitch periods• Concealed frames can also be time-scaled
Inactive Frames
• The pseudo-random number generator used to build the excitation signal of CNG frames is simply run for the requested number of samples.
• Output frame duration limited to between 0 and 40ms (twice the standard duration)
Unvoiced Frames
• Plosive frames and frames that are too voiced are not modified
• To lengthen frames:– Insert zeroes between the original excitation samples– Adjust gain to preserve average energy per sample
• To shorten frames:– Remove samples from the excitation signal
• Frame duration limited to between 10 and 40ms
Voiced Frames
• Onset frames and frames that are not voiced enough are not modified
• To lengthen frames– Use the long-term predictor to duplicate some pitch
cycles
• To shorten frames– Remove selected pitch cycles
• Frame duration limited to between 0 and 40ms
To Lengthen Voiced FramesVoiced frames are lengthened by repeating selected pitch cycles
Past Excitation Current Excitation
T0
Subframes
(i)
1 2 3 4
1 2 3 4
Experimental Results
• Experiments conducted on clean speech using mode 2 of the VMR-WB codec (Average Data Rate of 4.96 kbits/s with 60% active speech)
• Subjective quality:
– Excellent for “fast playback” (up to twice the normal speed)
– Small degradation for “slow playback” (down to half the normal speed)
– Very efficient at adding or removing a few ms (e.g. 20ms) to the playout delay from time to time
– Much better than losing a few frames…
Distribution of Unmodified Frames
Required frame length (40ms) is twice the standard frame length
Total number of frames: 22803Number of unmodified frames: 4085 (18%)
Distribution of unmodified frames:
1. Voiced framesOnsets: 814 (20% of unmodified)Not voiced enough: 935 (23%)
2. Unvoiced framesPlosive: 1850 (45%)Too voiced: 486 (12%)
Frame Length Distribution for Modified Voiced Frames
Desired frame length was 640 samples = 40 ms
0
2
4
6
8
10
12
480
500
520
540
560
580
600
620
640
Modified frame length (samples)
Per
cent
age
of f
ram
es
Time Required for a 50-ms Increase of the Playout Delay
0
10
20
30
40
5060-70
80-90
…-110
…-130
…-150
…-170
…-190
…-210
…-230
…-250
…-270
…-290
Adaptation Time (ms)
Per
cent
age
of c
ases
Experiment done for 8000 different active speech frames
Maximum Complexity and Corresponding Frame Length
Requested
Frame LengthStandard (20ms)
Longer (40ms)
Shorter (10ms)
Shorter* (10ms)
Maximum
Complexity6.68
WMOPS6.69
WMOPS33.90
WMOPS9.57
WMOPS
Corresponding
Frame Length20ms 20ms 1.875ms 10ms
*: Modified voiced frames not allowed to be less than 10 ms
Optimize the Complexity
• Lengthening does not increase complexity• Shortening CNG frames does not increase
complexity• Shortening active speech frames increases
complexity• Good compromise:
– Increase the playout delay as soon as it is necessary– Decrease the playout delay during inactive periods
Playout delay adaptation requires no additional complexity
Audio Demonstration
Male voice Female voice
Original
VMR-WB mode 2
Fast playback (200% of normal speed)
Slow playback (66% of normal speed)
Adaptation: -20ms every 10 frames
Adaptation: +20ms every 10 frames
Summary
• Adaptive jitter buffering requires a means of time scaling of speech
• Time scaling can be done in the decoder’s “excitation domain”
• This approach is very efficient in terms of both quality and reactivity
• It requires almost no complexity providing some clever limitations are imposed on the amount of time scaling