speech information processingaito/soundmedia/slides-eng.pdf3 production of speech organs that...
TRANSCRIPT
1
Sound Media Engineering part II
Speech Information Processing
Akinori ItoGraduate School of Engineering, Tohoku Univ.
2
Overview of the lecture● #1: Production and coding of speech (1)
– Speech production, feature of speech sound
– Basic codecs: PCM,DPCM,ADPCM
● #2: Coding of speech (2)– Linear Prediction of speech: Linear Prediction
Coefficients, PARCOR Coefficients and LSP
– CELP coding
– Audio coding
● #3: Speech enhancement– Spectral subtraction
– Microphone array
3
Production of speech
● Organs that producespeech– vocal cords
– larynx
– pharynx
– tongue
– gums
– teeth
– lips
– nasal cavity
vocaltract
4
Acoustic tube model
● Human speech production is similar to wind instruments
声帯 声道
喉頭唇
鼻腔
Pitch of voice Linguistic contentPersonality
5
Linguistic and speaker feature
声帯 声道
喉頭唇
鼻腔
A speaker can controlshape of this part
6
Linguistic and speaker feature
声帯 声道
喉頭唇
鼻腔
A speaker cannot control shape of this part, total length of vocal tract
7
Speech waveform
● Complex enough
/a/ /i/ /u/
/o//e/
8
Speech waveform
● It is complex, but almost periodic
Fundamentalperiod
Fundamental period T [s]Fundamental frequency F0 [Hz] = 1/T
9
Various "a"
● Two /a/'s with different fundamental frequencies– Same phone = same vocal tract shape
– Completely different waveforms
– What is the same between these waveforms?
10
Spectrum of speech
● Spectrum of two /a/'s– Spectral shapes are similar →Shape of vocal tract
– "Jaggies" of speectrum differ→Fundamental Freq.
11
Spectrum and formant frequencies
● F0: 基本周波数
● F1,F2,..: ホルマント(formant)周波数
基本周波数
ホルマント周波数F 0
F 1F 2
F 3 F 4
Formant frequencies
Fundamental frequency
12
Speech coding
● Sound (analog) → Convert to digital data– Handle with computer
– Transmission over digital line
● How do we digitize sound?– Goals
● Good quality when converting back to analog sound● Less bit-rate
– Methodology● Exploit various features of speech
13
Basics of speech coding● Sampling
– Observe the temporally continuous signal at discrete time
– Period of "discrete" observation: sampling frequency fs
– The original signal can be restored from sampled data when the original signal only contains frequency component under fs/2 (Sampling Theorem)
14
Basics of speech coding
● Quantization– Round off magnitude of signal into discrete level
● Magnitude of signal can be represented in integers
– The discrete level : quantization step
– Difference between the original signal and quantized signal : Quantization error
15
Sampling and quantization:how are they determined?
● Sampling frequency is determined by the highest frequency in the sound– Telephone : 8kHz (up to 4kHz sound)
– High-quality speech: 16kHz (up to 8kHz sound)
– CD:44.1kHz (up to 22.05kHz sound)
● Quantization is determined by the dynamic range of the sound– To code speech is to quantize speech
16
PCM coding
● PCM(Pulse Code Modulation)– Represent the quantized values as binary numbers
● What to be determined in PCM– How many bits to be used for one sample
– How to determine levels of quantization● Equal steps: linear quantization● Inequal steps: nonlinear quantization
● Examples of PCM coding– CD:16bit linear quantization
– VoIP(G.711): 8bit nonlinear quantization
17
PCM linear quantization
● There are nothing difficult
0
5
10
-5
-10
-7 -7 5 2 -6 -2 0 1 4 0 -2 11 110 -1 -2 0 3 2 0 1 33
CD: quantize in 16bit(-32768~+32767)
18
Nonlinear quantization
● Most samples are nearly zero→Total error can be reduced by finely quanti- zing values around zero
0
5
10
-5
-10
0
5
10
-5
-10
19
Example of nonlinear quantization: G.711
● Speech coding for 64kbit/s digital phone line– 8kHz sampling, 8bit nonlinear quantization
– μ-Law (Japan, US) A-Law (Europe)
– μ-Law: 14bit linear quant.→8bit nonlinear quant.
Y=128 sign X
log1255∣X∣8192
log 256
-150
-100
-50
0
50
100
150
-8000 -6000 -4000 -2000 0 2000 4000 6000 8000
8b
it m
u-L
aw
14bit linear
20
Differential PCM (DPCM)
● In ordinary speech signal, values of two contiguous samples do not differ very much→Reduce bit-rate by transmitting the differences of samples
-
z-1
Q
21
Differential PCM(DPCM)
● Originalwaveform
● Differentialwaveform
-25000
-20000
-15000
-10000
-5000
0
5000
10000
15000
0 500 1000 1500 2000 2500 3000 3500 4000 4500
-25000
-20000
-15000
-10000
-5000
0
5000
10000
15000
0 500 1000 1500 2000 2500 3000 3500 4000 4500
22
Adaptive Differential PCM(ADPCM)
● To enhance efficiency of DPCM– Use more sophisticated prediction rather than
simple difference
– Adaptively change quantization steps● When difference between two samples is large, the
difference to the next sample is likely to be large too● When difference between two samples is small, the
difference to the next sample is likely to be small too
23
Block diagram of ADPCM
adaptivequantizer
adaptivede-quantizer
ADPCMoutput
+PCM input
signalpredictor +
+
-
+
+
x k xe k
xr k
d k
d q k
reconstructedsignal
quantizeddifferential
signal
differentialsignal
predictionsignal
I k
24
Calculation algorithm of ADPCM
1.Compute prediction signal
2.Compute difference
3.Quantize (ADPCM output)
4.De-quantize
5.Reconstruct signal
6.Compute next prediction
xe k
d k =xe k −x k
I k =Q d k
d q k =Q−1 I k
xr k =xe k d qk
xe k1= pred xr k , d qk ,
25
Prediction of speech signal
● ADPCM quantizes difference between the input signal and predicted signal
● How to predict signal– DPCM
– A little better way
– G.726
xe k =xr k−1
xe k =2 xr k−1−xr k−2
xe k =∑i=1
2
ai xr k−i ∑i=1
6
bi d qk−i
26
Determine quantization step adaptively (example)
0
-1
1
2
3
4
5
6
7
-7
-6
-5
-4
-3
-2
-8
0
-1
1
2
3
4
5
6
7
-7
-6
-5
-4
-3
-2
-8
●Observe difference between previous sample using the scale
●If the difference is "blue", half the size of the next scale
●If the difference is "red", double the size of the next scale
27
For high-efficiency speech coding
● PCM, DPCM, ADPCM encodes general sound signal– DPCM, ADPCM partly exploits property of input
signal
● Human speech is a small part of sound signal→We can enhance efficiency of coding by considering property of human speech
● What is the property of human speech?
28
High-level speech coding
digitaldata
speechfeature
phoneswords/
sentencessemantics
digitaldata
speechfeature
phones words/sentences
semantics
speech
音声
PCM coder(public phone)
CELP coder(mobile phone)
underresearch
summarizingtelephone?
AD/DA vocoder speechsynthesis
Text-to-Speech
29
Speech production model
声帯 声道
喉頭唇
鼻腔
X =S T R
S T R
radiation
X
vocal cords
larynx
nasal cavity
vocal tract
lips
30
Speech production model
S
31
Speech production model
S
T R
32
Modeling speech using parameters
● Modeling speech using linear prediction (LPC)– Spectral shape: parameters of linear prediction filter
– Vocal cord vibration : residue
– In spectral domain
x k =−∑i=1
p
ai x k−i e k
X =E
1∑n=1
p
an en i
=E H
S T R
Estimate coefficients to
minimize residue
33
Analysis and transmission of speech by LPC
● Information to be transmitted
– LP coefficients ai and residue e(k)
● How to transmit them?– Estimate ai for a fixed number of samples (a block)
– Calculate e(k) using estimated ai
– Transmit ai and e(k) as parameters of the block
● How to restore the signal?– Using LPC formula
x k =−∑i=1
p
ai x k−i e k
34
Estimation of LP coefficients● How to estimate LPC from x(1)...x(k)
– Solve a simultaneous equation (Yule-Walker equation)→LPC are calculated as the least-error solution
– Faster algorithm (Levinson-Durbin algorithm)
● LPC equation
−x k−1 x k−2 ⋯ x k− px k−2 x k−3 ⋯ x k− p−1
⋮ ⋮ ⋱ ⋮x p−1 x p−2 ⋯ x 1
a1
a2
⋮a p=
x k x k−1
⋮x p
e k
e k−1⋮
e p
−FA=VE
35
Estimation of LP coefficients
● Least square solution: minimize |E|2
→minimize |FA+V|2
● Equation to be solved FT F A=−FT V
FT F=ij , F T V=0j
11 12 ⋯ 1p
21 22 ⋯ 2p
⋮ ⋮ ⋱ ⋮ p1 p2 ⋯ pp
a1
a2
⋮a p=−
01
02
⋮0p Yule-Walker
equation
36
Estimation of LP coefficients
● Elements in the Yule-Walker equation
● Solve the equation directly● Fast algorithm (autocorrelation method) N>>p
– Matrix is in a special form (symmetric Toeplitz matrix)
– Quick solution algorithm (Levinson-Durbin algorithm)
ij=∑n= p
N−1
y n−i y n− j
ij=r ∣i− j∣= ∑n=0
N−∣i− j∣−1
y n y n∣i− j∣
37
Analysis and transmission of speech by LPC
● Problem– Re-synthesis by LPC formula could be unstable
→The output signal eventually oscillates when ai have quantization errors
● Solution– Transmit parameters that are equivalent to LPC and
stable against quantization error● PARCOR coefficients● LSP coefficients
38
LPC and PARCOR coefficients
● PARCOR (partial correlation) coefficients
k i=
∑n=−∞
∞
i−1n i−1n
∑n=−∞∞
i−12n ∑
n=−∞
∞
i−12n
i−1 n=x n∑j=1
i−1
a ji−1 x n− j
i−1n=x n−i∑j=1
i−1
b ji−1 x n− j
Forward prediction error
PARCOR coefficientis equivalent to
correlation of the forward prediction
errors and backward prediction errors
Backward prediction error
39
PARCOR coefficients
x(n-i-1) x(n-i) x(n-i+1) x(n-i+2) x(n-2) x(n-1) x(n) x(n+1)
x(n)x(n-i)bj aj
^ ^
++
-- -
in i n
correlation
k i
...
40
PARCOR and LPC
k 1=
∑n=−∞
∞
0n0n
∑n=−∞∞
02n∑
n=−∞
∞
02n
0n=x n 0=x n−1
k1 is a correlation coefficient between x(n-1) and x(n)
As x(n-1) and x(n) have same variance and zero mean,
x1n=k1 x n−1 a11=−k 1
x1n−1=k1 x n b11=−k 1
41
PARCOR and LPC
k 2=
∑n=−∞
∞
1n1n
∑n=−∞∞
12n ∑
n=−∞
∞
12n
12n=k 21n−1=−k 1 k 2 x n−1k 2 x n−2
1 n=x n−k1 x n−1
1 n=x n−2−k1 x n−1
Here, as 1 n=x n−k1 x n−1
x2n=k 11−k 2 x n−1k 2 x n−2
a12=−k 11−k 2=a1
1−k 2 b1
1
a22=−k 2
Similarly,b22=b1
1−k 2 a1
1 b12=−k 2
42
PARCOR and LPC
In general,a ji=a j
i−1−k i b j
i−1 a0i−1
=0
b ji =b j−1
i−1−k i a j−1i−1 b0
i−1=0
We can calculate LP coefficients using this recurrence relation
43
LPC and LSP coefficients
● What is LSP (Line Spectrum Pair)?– Representation of LPC equation in z-domain:
– Decompose A(z) into P(z) and Q(z)
x k ∑i=1
p
ai x k−i =e k X z1∑i=1
p
ai z−i=E z
A p z
P z =A p z −z− p1 A p
z−1
Q z =A p z z− p1 A p z−1
A p z =P z Q z
2
44
LSP coefficients
● Roots of P(z)=0 and Q(z)=0 are on the frequency axis (on the unit circle in z-domain)
● P(z) and Q(z) can be written as
● Roots of P(z)=0 and Q(z)=0● LSP
P z =1−z−1 ∏i=2,4, , p
1−2 z−1 cosiz−2
Q z =1z−1 ∏i=1,3, , p−1
1−2 z−1 cosiz−2
z=cosi±i sini
1,2, , p
45
Properties of LSP
● Frequency parameter● Calculated using numerical analysis● Easy to determine stability● Robust against quantization than PARCOR● More computation needed than PARCOR● Widely used in current speech codings
012⋯ p
46
CELP coder
● CELP(Code-Excitation Linear Prediction)– Basic coding scheme for mobile phone
– Analysis and synthesis based on LPC
– Transmit LSP coefficients and residue
47
Overview of CELP
LPCanalysis
quantizationLSP coeff.
codevector
selection
residuecodebook
gaincodebook
residue
LPCsynthesis +
-
auditoryweighteddistanceSelect a code vector
with minimum distance
generatingbitstream
output
48
Speech codings based on CELP– LD-CELP (Low-Delay CELP)
● G.728 16kbit/s
– CS-ACELP (Conjugate Structure Algebraic CELP)● G.729 8kbit/s
– RPE-LTP (Regular Pulse Excitation with Long Term Prediction)
● GSM standard 13kbit/s
– VSELP (Vector Sum Excitation LP)● PDC standard 6.7kbit/s
– PSI-CELP (Pitch Synchronous Innovation CELP)● PDC half-rate standard 3.45kbit/s
– ACELP (Algebraic CELP)● GSM revised standard 7.4kbit/s
49
Audio coding● Coding of general sound and music
– We cannot make assumption (like speech) on the input signal
● Higher frequency components have smaller power
– Model-based coding (like speech) cannot be used
● Coding based on multi-band analysis– Split the input signal into low-frequency to high-
frequency● Frequency analysis using filter bank / MDCT
– Change quantization step frequency by frequency● Course quantization for high frequency● Course quantization if that sound is not salient
50
Basic framework of audio coding
frequ-ency
analysis
quant.
quant.
quant.
quant.
generatebitstream
restorebitstream
dequant.
dequant.
dequant.
dequant.
convertintotime
domain
QMFMDCTWavelet
Considerauditoryproperty(psycho-acousticanalysis)
Entropycoding(Huffman,arithmetic)
51
SB-ADPCM
● 16kHz middle-quality speech coder (G.722)– Sub-Band ADPCM
– Split the input signal into high and low signals and encode them using ADPCM individually
● Frequency splitting using quadrature mirror filter(QMF)● ADPCM coding: 2bit for high band,4 to 6 bit for low band
(48~64kbit/s)
QMF
ADPCMencoder
ADPCMencoder
generatebitstream
restorebitstream
ADPCMdecoder
ADPCMdecoder
QMF
52
Quadrature Mirror Filter (QMF)
● Split the input signal into high-frequency and low-frequency signals and total data amount is identical
● The original signal can be perfectly restored by combining low and high frequency signal
● Example of simple QMF (Haar Wavelet)
y i=x2ix2i1
2
z i=x2i−x2 i1
2
x2i= yizi
x2 i1=yi−zi
QMFspilt
QMFsynthesisxi xi
zi
yi
高域
低域
53
MPEG1 audio
● Audio part of MPEG, standard of video encoding– Layer 1 (MP1), layer 2 (MP2), layer 3 (MP3)
– Frequency analysis, psychoacoustic model
polyphasefilter bank/
MDCT
Q
Q
Q
Q
bit-stream
gen.
bit-stream
res.
de-Q
de-Q
de-Q
de-Q
totime
domain
FFTpsycho-acousticmodel
54
MP1● MPEG1 audio layer 1
– Frequency analysis by polyphase filter bank
– Normalization and scholar quantization in every 12 samples
polyphasefilter bank
nonlinear scholar quant.
32frequencybands
nonlinearscholar quant.
normalize
normalize
Block average powerscholar quant.
Block average powerscholar quant.
55
MP3● Frequency analysis by polyphase filter bank
and MDCT ● Variable frame length (18 points standard)● Entropy coding
polyphasefilter bank
nonlinearscholar quant.
32 bands
nonlinearscholar quant.
MDCT
MDCT
Huffmancoding
bitstream
56
Modified Discrete Cosine Transform (MDCT)
● Convert n points of time-domain signal into n/2 points of frequency-domain signal
● Exploits n/2-point overlap window
X m=∑k=0
n−1
f k x k cos{ 2 n 2 k1n2 2 m1}
x k =4 f k
n ∑m=0
n /2−1
X mcos{ 2n 2 k1n2 2m1}
57
Conversion to time-domain by Overlap-Add
● The original signal can be restored by adding the temporally overlapping data
MDCT
Overlap-Add
IMDCT
58
Speech Enhancement● What is speech enhancement?
– Extract specific speech from input signal that contains the target speech and other noise
– It is generally difficult : some kind of assumption needed
● Basic methods– Single channel
● Linear method: Wiener filter● Nonlinear method: Spectral subtraction
– Multiple-channel case● Linear method: microphone array● Nonlinear method: multichannel spectral subtraction
59
Single-channel case
● Speech signal x,noise signal n,observed signal y
● Aim: Estimate y from x (both x and n are unknown)
● Assumption needed– Spectrum of n is known
y t =x t nt
Y =X N
60
Linear method: the Wiener filter
● Consider a filter W that minimizes errors
● Consideration in temporal-frequency domain
X =W Y =W X N
∫0
∣ X −X ∣2d min
∑t∑i=0
N−1
∣X i t − X it ∣2=
∑t∑i=0
N−1
∣X i t −W i X it N i t ∣2min
61
Linear method: the Wiener filter
● Differentiate the previous formula
If X(t) and N(t) have no correlation,
∂∂W i
∑t∑i=0
N−1
∣X it −W i X i t N i t ∣2=0
∑t∣−2 X i t X it N i t 2W i X it N i t ∣
2=0
W i=
∑t∣X it ∣
2
∑t∣X i t ∣
2∑
t∣N i t ∣
2
∑t∣X i t N i t ∣≈0
The Wiener filter
62
The Wiener filter
● Condition to apply – Average spectrums of signal and noise are known
● Actually, it hardly holds for the signal
– The signal and noise have no correlation
● Meaning of the Wiener filter– Suppress frequency band with large noise power
● The powers of frequency bands become identical to E[Xi
2]
in average
W i X it N i t =∑
t∣X i t ∣
2
∑t∣X i t ∣
2∑
t∣N i t ∣
2⋅X i t N i t
63
Example of the Wiener filter
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0
50000
100000
150000
200000
250000
300000
SpeechNoise
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
0
0.2
0.4
0.6
0.8
1
1.2
Spectrumof the filter
64
Spectral subtraction
● Suppress noise by nonlinear processing● Noise is assumed to be stable● Subtract spectrum of the noise from observed
spectrum● Definitions
– Signal
– Noise
– Observed signal
X it
N i t
Y it =X i t N i t
65
Spectral subtraction
● Principle of SS– Power spectrum of the observed signal
– The signal is assumed to have no correlation to the noise
– Noise signal is assumed to be stable
∣Y i t ∣2=∣X i t N i t ∣
2≤∣X i t ∣
22∣X it N i t ∣∣N i t ∣
2
∣X i t N i t ∣≪∣X it ∣2∣N i t ∣
2
∣X i t ∣2≈∣Y it ∣
2−∣N i t ∣
2
∣N i t ∣2=N i
2
∣X i t ∣2≈∣Y it ∣
2−N i
2
66
Spectral subtraction
● Estimation of noise spectrum– Noise spectrum must be prepared beforehand
– Estimated from silent part before the voice
● To estimate magnitude spectrum
X i t ≈∣Y i t ∣2−∣N i t ∣
2
∣Y i t ∣2 Y i t
67
Practical problem and its solution
● Power spectrum eventually become negative– Solution by flooring
● Overestimation makes the quality better
∣X i t ∣2≈{∣Y i t ∣
2−N i
2if ∣Y it ∣
2N i
2
∣Y it ∣2
otherwise0≪1
∣X i t ∣2≈{∣Y i t ∣
2− N i
2if ∣Y i t ∣
2N i
2
∣Y i t ∣2
otherwise1
68
Examples (waveform)
Speech
Noise
Speechwith noise
Enhancedspeech
69
Examples (spectrogram)
Speech
Noise
Speechwith noise
Enhancedspeech
70
Speech enhancement using multiple microphones
● Exploit speech signals recorded by more than one microphone– Spatial information can be used
● Record speech and noise individually● Beam forming
● Various methods– Linear processing
● Delayed-sum array (superdirective microphone)● Adaptive array
– Nonlinear processing● Multi-channel spectral subtraction
71
Delayed-sum array
● Record sound signal derived from a specific angle using multiple microphones– The sound is assumed to be plane wave
d
d sin
q
72
Delayed-sum array
d
q
d sin
sin t sint−d sin c sint−2d sin
c sint−3d sinc
73
3 d sin
c
2d sinc
d sin
c
Delayed-sum array
sint−3 d sinc
− sint−3 d sinc
− sint−3 d sinc
− sint−3 d sinc
−
Delay
74
3 d sin
c
2d sinc
d sin
c
Delayed-sum array
Delay
+
4sint−3 d sinc
−
75
3 d sin
c
2d sinc
d sin
c
Delayed-sum array
Delay
+
∑n=0
3
sint− n d sin3−nd sinc
−
f
76
Example
入射角(rad)
n=4, d=1fun1: w=5fun2: w=10fun3: w=50
77
Property of the delayed-sum array
● Simple processing– Fast calculation
– Easy hardware realization
● Directivity– Main robe width become narrower when
● More microphones are used ● The frequency become higher ● Microphone-to-microphone distance become wider
– Spatial aliasing● Condition of no aliasing: d<c/2f
78
Adaptive noise suppression● When the noise can be recorded by a
microphone● Linear filtering
speech+noise
noise
SignalProcessing
speech+noise
79
Adaptive noise suppression● Processing by an adaptive filter
– n(k) is not a noise signal mixed into x(k), so we can't subtract n(k) from x(k) directly→Use the filter W(z)
speech+noise
noise
speech+noise
W(z)
+
-
x k
nk
G(z) Update W(z) so that power of the output signal become minimum
y k
e k
80
Adaptive filter
● Realize W(z) as a FIR filter
– : i-th filter coefficient at time k
● Updating coefficients using LMS algorithm
(Many other algorithms have been developed)
y k =∑i=1
p
wi k nk−i
w ik
w ik1=wi k 2 e k nk−i1
81
Subtractive array
● Eliminate noise using microphone array(When the direction of the noise is known)
+
-
delay
noise(angle )
speech
N
d sin N
c
82
Adaptive subtractive array
● When direction of the noise is not known– Noise suppression using adaptive filter
+
-
adaptivefilter
noise
speech
adaptivefilter
83
Adaptive subtractive array
● Some constraints are needed for the filters– Without any constraints, the output become zero
+
-
雑音
音声
H 1
H 2
84
Adaptive subtractive array
● Adaptive filter● Transfer function from speech source to
microphone● Constraint of the filters
● Example– Griffith-Jim array
F =∑i
G iH i =1
H i
G i
85
Griffith-Jim array
+
-delay
delay
delay
+
+
-
+
-H 1
H 2
Output ofdelayed-sum array
noise only
86
2ch spectral subtraction
● Spectral subtraction using microphone array– Estimate noise spectrum by suppressing target
signal
– Subtract noise spectrum from the observed spectrum
● Features– Nonlinear processing
– No need to prepare noise spectrum beforehand
– Effective for unstable noise
87
2ch spectrul subtraction
+
-
delay
delay
+ |DFT|2
+
-
noise only
|DFT|2
output of delayed-sum array
nonlinear processing
(overestimation, flooring)