ece6255 digital processing of speech signals lectures 30

75
Center of Signal and Image Processing Georgia Institute of Technology ECE6255 Digital Processing of Speech Signals Chin-Hui Lee School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332, USA [email protected] Lectures 30-31: Speech Coding (Part I & II, Characterization & Quantization)

Upload: others

Post on 28-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Slide 1Georgia Institute of Technology
Chin-Hui Lee
Georgia Institute of Technology
Atlanta, GA 30332, USA
Characterization & Quantization)
Georgia Institute of Technology
2 ECE6255 Spring 2010
– 32-64 phonemes per language => 6 bits/phoneme
– Information rate = 60-90 bps at the source
• Waveform level – Speech bandwidth from 4 – 10 kHz => sampling
rate from 8 – 20 kHz
– Need 12-16 bit quantization for high quality digital coding
– Information rate = 96-320 Kbps => more than 3 orders of magnitude difference in information rates between the production and waveform levels
Center of Signal and Image Processing
Georgia Institute of Technology
3 ECE6255 Spring 2010
Georgia Institute of Technology
4 ECE6255 Spring 2010
Principles of Audio Compression
• Human Source (Speaking)
encoder digital channel
or storage decoder
encoder digital channel
or storage decoder
Georgia Institute of Technology
5 ECE6255 Spring 2010
Signal Compression Applications
1 2 4 8 16 32 64 256 1 4 16 256 Kbits/sec Mbits/sec
Secure
voice
Image-
Georgia Institute of Technology
6 ECE6255 Spring 2010
Telephony Bandwidth & Bit Rates
– 8 kHz sampling rate, 8-12 bit resolution
• Bit Rates
• Standards
– Military standards (US DoD)
Georgia Institute of Technology
7 ECE6255 Spring 2010
poor
fair
good
excellent
2 4 8 16 32 64 128 Bit rate (kb/s)
A/u Law
1972 ADPCM
Center of Signal and Image Processing
Georgia Institute of Technology
8 ECE6255 Spring 2010
– A-B Comparison tests
– Intelligibility tests (DRT)
• Objective evaluation:
– Many possibilities
Georgia Institute of Technology
9 ECE6255 Spring 2010
1 2 4 8 16 32 64 kb/s
S p
e e
c h
Q u
Speech Coder Classes
Georgia Institute of Technology
10 ECE6255 Spring 2010
Introduction to Waveform Coding
T T
T T T
Georgia Institute of Technology
11 ECE6255 Spring 2010
Introduction to Waveform Coding
• to perfectly recover xa(t) (or equivalently a lowpass filtered version of it) from
the set of digital samples (as yet unquantized) we require that Fs = 1/T > twice
the highest frequency in the input signal
• this implies that xa(t) must first be lowpass filtered since speech is not
inherently lowpass
for telephone bandwidth the frequency range of interest is 200-3200 Hz
(filtering range) => Fs = 6400 Hz, 8000 Hz
for wideband speech the frequency range of interest is 100-7000 Hz
(filtering range) => Fs = 16000 Hz
T=1/ Fs
Georgia Institute of Technology
12 ECE6255 Spring 2010
• Notice high frequency components of vowels and fricatives (up to 10 kHz) =>
need Fs > 20 kHz without lowpass filtering
• Need only about 4 kHz to estimate formant frequencies
• Need only about 3.2 kHz for telephone speech coding
Speech Spectra
Georgia Institute of Technology
13 ECE6255 Spring 2010
Telephone Channel Response
It is known that 4 kHz bandwidth is sufficient for most
applications using telephone speech because of inherent
channel band limitations from the transmission path
Center of Signal and Image Processing
Georgia Institute of Technology
14 ECE6255 Spring 2010
Statistical Model of Speech
random process
• Then x(n) (derived from xa(t) by sampling) is also a
sample sequence of a discrete-time random process
• xa(t) has first order probability density, p(x), with
autocorrelation and power spectrum of the form
( ) ( ) ( )
( ) ( )
e d
Georgia Institute of Technology
15 ECE6255 Spring 2010
Speech Probability Density Function
• Probability density function for x(n) is the same as for xa(t)
since x(n)= xa(nT) => the mean and variance are the same for
both x(n) and xa(t)
from speech waveforms
– Simpler approximation is Laplacian density, of the form:
1/ 2 3



Georgia Institute of Technology
16 ECE6255 Spring 2010
(x=0, σx=1)
analytical studies
Georgia Institute of Technology
17 ECE6255 Spring 2010
Statistical Model of Speech
( ) ( ) ( )
( ) ( ) ( )
( ) ( )
e m e
led version of ( ), its transform is an infinite sum
2 of the power spectrum, shifted by aliased version of the power
spectrum of the original analog signal
a
k
T
Georgia Institute of Technology
18 ECE6255 Spring 2010


a a a
E x t x t e d
m E x n x n m E x nT x nT mT mT
k e m e
m M
e w m m e










finite windows
Georgia Institute of Technology
19 ECE6255 Spring 2010
of a random signal
L j j n
n
V e x n w n e DTFT x n w n
I V e U w n LU L










N LU





Georgia Institute of Technology
20 ECE6255 Spring 2010
90,000-Point FFT – Male Speaker
Georgia Institute of Technology
21 ECE6255 Spring 2010
( ) ( ) ( ), 0 1
1 1 ( ) ( ) ( )





1 ( ) ( )
i.e., 50% window overl
R L
Georgia Institute of Technology
22 ECE6255 Spring 2010
Georgia Institute of Technology
23 ECE6255 Spring 2010
Long Time Average Spectrum
Georgia Institute of Technology
27 ECE6255 Spring 2010
quantization
bandlimited signal at a rate at or above the
Nyquist rate
amplitude
Center of Signal and Image Processing
Georgia Institute of Technology
28 ECE6255 Spring 2010
ˆ 1. quantization process: ( ) ( )
ˆ 2. encoding process: ( ) ( )
Decoding is a single-stage process
1. d
ˆ ( ) ( ) coding and quantization loses information
c n x n
x n x n
Georgia Institute of Technology
29 ECE6255 Spring 2010
samples => 2B quantization levels
• Information Rate of Coder: I=B FS= total bit rate in
bits/second
• Goal of waveform coding is to get the highest quality
at a fixed value of I (Kbps), or equivalently to get the
lowest value of I for a fixed quality
• Since FS is fixed, need most efficient quantization
methods to minimize I
Georgia Institute of Technology
30 ECE6255 Spring 2010
that 0.35% of the samples fall outside the range -4σx
≤ x(n) ≤ 4σx => large quantization errors for 0.35%
of the samples
Center of Signal and Image Processing
Georgia Institute of Technology
31 ECE6255 Spring 2010
Quantization Process Quantization => dividing amplitude range into a finite set of ranges, and assigning the same bin to all samples in a given range
0 1 1
1 2 2
2 3 3
x x n x
x n x x
choices that can be made (and bad choices)
3-bit quantizer => 8 levels
Georgia Institute of Technology
32 ECE6255 Spring 2010
signal can easily be processed digitally
1
i i
i i
x x
x x
Georgia Institute of Technology
33 ECE6255 Spring 2010
Mid-Riser and Mid-Tread Quantizers
• Mid-riser
–origin (x=0) in middle of rising part of the staircase
–same number of positive and negative levels
–symmetrical around origin
–one more negative level than positive
–one quantization level of 0 (where a lot of activity occurs)
• Code words have direct numerical significance (sign-magnitude representation for mid-riser, two‘s complement for mid-tread)


1 if first bit of ( ) 1
- for mid-tread quantizer code words are 3-bit
two's complement representation, givin
sign c n c n
c n
Center of Signal and Image Processing
Georgia Institute of Technology
34 ECE6255 Spring 2010
A-to-D and D-to-A Conversion
ˆ x B[n]
Quantization error
Center of Signal and Image Processing
Georgia Institute of Technology
35 ECE6255 Spring 2010
Unquantized
sinewave
3-bit
quantization
waveform
3-bit
quantization
error
8-bit
quantization
error
Georgia Institute of Technology
36 ECE6255 Spring 2010
• Quantization step size-Δ
• If |x(n)|≤Xmax and x(n) is a symmetric density, then
•Δ2B = 2 Xmax
• If we let
• With x(n) the unquantized speech sample, and e(n) the quantization
error (noise), then
( ) 2 2
e n
which can exceed Xmax and thus
the error can exceed Δ/2)
Center of Signal and Image Processing
Georgia Institute of Technology
37 ECE6255 Spring 2010
white noise process
signal
each quantization interval
otherwise
( ) 1/ / 2 / 2
Georgia Institute of Technology
38 ECE6255 Spring 2010
Georgia Institute of Technology
39 ECE6255 Spring 2010
Georgia Institute of Technology
40 ECE6255 Spring 2010
Georgia Institute of Technology
41 ECE6255 Spring 2010
3-Bit Speech Quantization Error
Georgia Institute of Technology
42 ECE6255 Spring 2010
Georgia Institute of Technology
43 ECE6255 Spring 2010
5-Bit Speech Quantization Error
Georgia Institute of Technology
44 ECE6255 Spring 2010
Histograms of Quantization Noise
Georgia Institute of Technology
45 ECE6255 Spring 2010
Spectra of Quantization Noise
Georgia Institute of Technology
46 ECE6255 Spring 2010
Correlation of Quantization Error
Assumptions look good at 8-bit quantization, not as good at 3-
bit levels (however only 6 db variation in power spectrum level)
Center of Signal and Image Processing
Georgia Institute of Technology
47 ECE6255 Spring 2010
Quantized to 10 bits/sample
Quantized to 5 bits/sample
Center of Signal and Image Processing
Georgia Institute of Technology
48 ECE6255 Spring 2010

2
X
Georgia Institute of Technology
49 ECE6255 Spring 2010

if we choose 4 , then 6 7.2
16,
X
X
accurately represented
Georgia Institute of Technology
50 ECE6255 Spring 2010
5
6
input signal amplitude
Georgia Institute of Technology
51 ECE6255 Spring 2010
• Error is uniformly distributed over the interval:
• Error is stationary white noise: 2
2( ) , | | 12
e eP
( /2) [ ] ( /2).e n
{ [ ] [ ]} { [ ]} { [ ]} 0E x n e n E x n E e n
2 2{( [ ]) } xE x n
Center of Signal and Image Processing
Georgia Institute of Technology
52 ECE6255 Spring 2010
Review of Quantization Assumptions
• Input signal fluctuates in a complicated manner so a statistical model is valid
• Quantization step size is small enough to remove any signal correlated patterns in quantization error
• Range of quantizer matches peak-to-peak range of signal, utilizing full quantizer range with essentially no clipping
• For a uniform quantizer with a peak-to-peak range of ±4 σx, the resulting SNR(dB) is 6B- 7.2
Center of Signal and Image Processing
Georgia Institute of Technology
53 ECE6255 Spring 2010
Uniform Quantizer SNR Issues
• To get an SNR of at least 30 dB, need at least B≥6 bits (assuming Xmax=4 σx) SNR=6B-7.2 – This assumption is weak across speakers and
different transmission environments since σx varies so much (order of 40 dB) across sounds, speakers, and input conditions
– SNR(dB) predictions can be off by significant amounts if full quantizer range is not used; e.g., for unvoiced segments => need more than 6 bits for real communication systems, more like 11-12 bits
– Need a quantizing system where the SNR is independent of the signal level => constant percentage error rather than constant variance error => need non-uniform quantization
Center of Signal and Image Processing
Georgia Institute of Technology
54 ECE6255 Spring 2010
constant variance error), need logarithmically spaced
quantization levels
– Quantize logarithm of input signal rather than input signal itself
Center of Signal and Image Processing
Georgia Institute of Technology
55 ECE6255 Spring 2010
ˆ log
sign x n x n
x n
Center of Signal and Image Processing
Georgia Institute of Technology
56 ECE6255 Spring 2010
ˆ ˆ exp ( )
ˆ ( ) 1 ( )
x n sign x n n
x n n
n n n
1
f x
x n n
Georgia Institute of Technology
57 ECE6255 Spring 2010
• Unfortunately true logarithmic compression is not practical, since the
dynamic range (ratio between the largest and smallest values) is infinite
=> need an infinite number of quantization levels
• Need an approximation to logarithmic compression => μ-law/A-law
compression
Georgia Institute of Technology
58 ECE6255 Spring 2010
| ( ) | | ( ) | log
log
X
m
m
m
m
Georgia Institute of Technology
59 ECE6255 Spring 2010
Histogram for μ-Law Companding
Georgia Institute of Technology
60 ECE6255 Spring 2010
μ-Law Approximation to log
error (| ( ) | log | ( ) |)y n x n
m
Georgia Institute of Technology
61 ECE6255 Spring 2010
SNR for μ-law Quantizer
6 dependence on
x x
B B
good
good
good
- -law quantizer used in wireline telephony for more than 40 yearsm
Center of Signal and Image Processing
Georgia Institute of Technology
62 ECE6255 Spring 2010
Georgia Institute of Technology
63 ECE6255 Spring 2010
Georgia Institute of Technology
64 ECE6255 Spring 2010
Comparison of Linear and μ-Law Quantizers
• can see in these plots that Xmax characterizes the quantizer (it
specifies the ‘overload’ amplitude), and σx characterizes the signal (it
specifies the signal amplitude), with the ratio (Xmax/σx) showing how
the signal is matched to the quantizer
Center of Signal and Image Processing
Georgia Institute of Technology
65 ECE6255 Spring 2010
5
6
7
5
6
Georgia Institute of Technology
66 ECE6255 Spring 2010
Analysis of μ-Law Performance
• Curves show that μ-law quantization can maintain roughly the same SNR over a wide range of Xmax/σx, for reasonably large values of μ – for μ=100, SNR stays within 2 dB for 8< Xmax/σx<30
– for μ=500, SNR stays within 2 dB for 8< Xmax/σx<150
– loss in SNR in going from μ=100 to μ=500 is about 2.6 dB => rather small sacrifice for much greater dynamic range
– B=7 gives SNR=34 dB for μ=100 => this is 7-bit μ-law PCM-the standard for toll quality speech coding in the PSTN => would need about 11 bits to achieve this dynamic range and SNR using a linear quantizer
Center of Signal and Image Processing
Georgia Institute of Technology
67 ECE6255 Spring 2010
CCITT G.711 Standard
• μ -law characteristic is approximated by 15 linear segments with uniform quantization within a segment – uses a mid-tread quantizer. +0 and –0 are defined
– decision and reconstruction levels defined to be integers
• A-law characteristic is approximated by 13 linear segments – uses a mid-riser quantizer
Center of Signal and Image Processing
Georgia Institute of Technology
68 ECE6255 Spring 2010
• Quantization of sample valuess is unavoidable in DSP applications: digital transmission and storage of speech
• Analysis of quantization error using a random noise model
• The more bits in the number representation, the lower the noise. The fundamental theorem‘ of uniform quantization is that the signal-to-noise ratio increases 6 dB with each added bit in the quantizer; however, if the signal level decreases while keeping the quantizer step-size the same, it is equivalent to throwing away one bit for each halving of the input signal amplitude
• μ-law compression can maintain constant SNR over a wide dynamic range, thereby reducing the dependency on signal level remaining constant
Center of Signal and Image Processing
Georgia Institute of Technology
69 ECE6255 Spring 2010
• Goal is to match quantizer to actual signal
density to achieve optimum SNR
– μ-law tries to achieve constant SNR over wide
range of signal variances => some sacrifice over
SNR performance when quantizer step size is
matched to signal variance
– if σx is known, you can choose quantizer levels to
minimize quantization error variance and
maximize SNR
Georgia Institute of Technology
70 ECE6255 Spring 2010
Quantization for MMSE
• If we know the size of the signal (i.e., we know the signal variance, σx
2) then we can design a quantizer to minimize the mean-squared quantization error – the basic idea is to quantize the most probable samples
with low error and least probable with higher error
– this would maximize the SNR
– general quantizer is defined by defining M reconstruction levels and a set of M decision levels defining the quantization slots.
– this problem was first studied by J. Max, Quantizing for Minimum Distortion, IRE Trans. Info. Theory, March,1960.
Center of Signal and Image Processing
Georgia Institute of Technology
71 ECE6255 Spring 2010


variance of quantization noise is:
ˆ ( ) ( ) ( )
associating quant
x n Q x n M
x x x x x







ˆ quantization level for interval ,
for symmetric, zero-mean distributions, with large amplitudes ( )
it makes sense to define the boundary
j j jx x x

( )
Georgia Institute of Technology
72 ECE6255 Spring 2010
/ 2 1
ˆ by definition, ; thus we can make a change of variables to give:
ˆ ˆ ( ) ( ) ( / ) ( )
xM
p e p x x p x x p x
x x p x dx
p x p x
ˆ ˆ and
ˆ 2 ( ) ( )
i
i









Center of Signal and Image Processing
Georgia Institute of Technology
73 ECE6255 Spring 2010
Solution for Optimum Levels
1
2
2
ˆ to solve for optimum values for { } and { }, we differentiate wrt
the parameters, set the derivative to 0, and solve numerically:
ˆ ( ) ( ) 0 1,2,..., / 2 (1)
1 (
2
i
i
x
ˆ ˆ ) 1,2,..., / 2 1 2
with boundary conditions of 0, 3
can also constrain quantizer to be uniform and solve for value of Δ
that maximizes SNR
i i
een / 2 quantizer levels
ˆ - optimum location of quantization level is at the centroid of the probability
density over the interval to
- solve the above set of equations (1,2,3) iterative
i
Center of Signal and Image Processing
Georgia Institute of Technology
74 ECE6255 Spring 2010
1xσ
M. D. Paez and T. H. Glisson, Minimum Mean-Squared Error
Quantization In Speech, IEEE Trans. Comm., April 1972.
Center of Signal and Image Processing
Georgia Institute of Technology
75 ECE6255 Spring 2010
step size decreases r exponentially
with increasing number of bits quantization levels get further apart as
the probability density decreases
Density; Uniform Case
Georgia Institute of Technology
76 ECE6255 Spring 2010
Performance of Optimum Quantizers
Georgia Institute of Technology
77 ECE6255 Spring 2010
Summary on Speech Coding
• Examined a statistical model of speech showing probability densities, autocorrelations, and power spectra
• Studied instantaneous uniform quantization model and derived SNR as a function of the number of bits in the quantizer, and the ratio of signal peak to signal variance
• Studied a companded model of speech that approximated logarithmic compression and showed that the resulting SNR was a weaker function of the ratio of signal peak to signal variance
• Examined a model for deriving quantization levels that were optimum in the sense of matching the quantizer to the actual signal density, thereby achieving optimum SNR for a given number of bits in the quantizer
Center of Signal and Image Processing
Georgia Institute of Technology
• Reading assignments
ECE6255 Spring 201078