1 speech and audio processing and coding (cont.) dr wenwu wang centre for vision speech and signal...

49
1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering [email protected] http://personal.ee.surrey.ac.uk/Personal/W.Wang/te aching.html

Upload: dwight-reep

Post on 31-Mar-2015

232 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

1

Speech and Audio Processing and Coding (cont.)

Dr Wenwu Wang

Centre for Vision Speech and Signal Processing

Department of Electronic Engineering

[email protected]

http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

Page 2: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

2

Frequency response of H(z) The frequency response of the vocal tract filter can be best obtained by

performing the DFT of the impulse response. It can also be directly calculated from the pole-zero plot, due to the relation

between DFT and Z-transform. Essentially, the frequency response of the vocal tract can be obtained by

sampling its Z-transform along the unit circle:

The magnitude response can be obtained as:

Exercise: derive the expression for the phase response of H(z).

1 iez

pii

qii

i

ee

eebeH

1

10)(

Page 3: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

Mag

nitu

de in

dB

How to Separate Spectral Envelop and Spectral Details

From Spectrum?

Spectrum

Spectral envelope

(containing information about formant structure)

Spectral details

Page 4: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

4

Two Techniques Can Be Used

Linear Prediction

Cepstrum Analysis

Page 5: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

5

Recall “resonance” effect in speech production

Source: (Ellis 2013)

Vocal tract acts as a variable resonator In simple term, resonance = “formants”

Page 6: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

6

Resonance

Source: (Ellis 2013)

Page 7: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

7

Simulate resonance with a single pair-poles filter

Source: (Ellis 2013)

Page 8: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

8

Source filter model

Source: (Ellis 2013)

Source: excitation signal defining the fine structure of the speech.

Filter: subsequent shaping by physical resonances

Page 9: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

9

The “resonance” in spectrum

Source: (Ellis 2013)

Changing slowly ~ 10-20 ms

Page 10: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

10

Linear prediction In the time-domain, filtering of the excitation signal through the vocal tract

filter described in previous slides can be written equivalently as:

q

jj

p

kk jnubknsans

01

][][][

The above equation is known as autoregressive moving average (ARMA) process. That is, a sampled value at the output of a linear filter is a weighted average of its past output samples, past input samples and the current input sample. In general, both poles and zeros exist, and it is also referred to as a pole-zero filtering process which can be further split into AR process and MA process written respectively as:

][][][ 01

nubknsansp

kk

q

jj jnubns

0

][][

AR (all-pole filter)

MA (all-zero filter)

Page 11: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

11

Linear prediction (cont.) Linear prediction analysis or linear predicative coding (LPC) is a powerful

method used in speech processing and coding for modelling the short-term correlations between speech samples.

It is a mathematical operation where future values of a discrete-time signal are estimated as a weighted summation of previous samples, i.e. an AR process (all-pole filter), where G is a gain factor:

This effectively exploits the redundancy between consecutive samples and encodes and transmits only the non-redundant information in the signal.

Although nasals and fricatives can cause anti-resonances in the transfer function, suggesting the usage of a pole-zero filter, a reasonable approximation to the vocal tract filter can be obtained by the above all-pole filter with a sufficiently high order p.

p

k

kk za

GzH

1

1][

Page 12: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

12

Linear prediction (cont.) Linear prediction provides efficient methods for calculating the coefficients

of an all-pole filter from real speech which leads to a compact representation of the formant structure.

In speech coding, the LPC coefficients are quantised and encoded and then transmitted to a speech decoder, which reconstructs an approximation to the original speech.

The output of a linear predictor with coefficients is defined as:

The prediction error between the actual signal and its predicted value is given by:

The coefficients are calculated by minimising the short-term mean-squared prediction error:

p

kk knsans

1

][][̂

ka

p

kk knsansnsnsne

1

][][][̂][][

2

1

2 ][][]}[{p

kk knsansEneE

Page 13: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

13

LPC coefficients estimation Autocorrelation method:

)(

)2(

)1(

)0()2()1(

)2()0()1(

)1()1()0(

2

1

pR

R

R

a

a

a

RpRpR

pRRR

pRRR

n

n

n

pnnn

nnn

nnn

where is the short-time autocorrelation function:

.,...,0;][][)(1

0

pjjmsmsjRjN

mnnn

)( jRn

The LPC coefficients can be solved using the set of linear equations known as the Yule-Walker equations:

The above matrix is a Toeplitz matrix, and therefore the equation is usually solved by a recursive algorithm known as Durbin’s algorithm.

Page 14: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

14

LPC coefficients estimation Covariance method:

)0,(

)0,2(

)0,1(

),()2,()1,(

),2()2,2()1,2(

),1()2,1()1,1(

2

1

pa

a

a

pppp

p

p

n

n

n

pnnn

nnn

nnn

where measures the similarity over a fixed number of samples between two sequences which are slightly delayed with respect to one-another:

.1,1;][][),(1

0

pjpijmsimsjiN

mnnn

),( jin

The LPC coefficients can be solved using the following set of linear equations:

The above matrix is not a Toeplitz matrix, and therefore the equation is usually solved by matrix inversion.

Page 15: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

15

Inverse filtering Once the estimates of the linear prediction coefficients have been obtained,

the excitation signal can be obtained from the speech signal using inverse filtering

)()()( zUzHzS

)()()()()( 1 zSzAzSzHzU

)(zAwhere is an inverse filter of the vocal tract transfer function )(zH

Inverse filtering results in a residual with a flatter spectrum than the original speech. However, the residual still contains useful information about the speech signal, e.g. whether it is voiced/unvoiced, and the periodicity of the speech. For voiced sounds, it has a pulse-like nature, so it is common to apply periodicity estimation to the residual rather than the original speech.

Page 16: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

16

Inverse filtering (cont.)

The LPC residual of a voiced segment obtained by inverse filtering

Page 17: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

17

Order selection of the LPC model Roughly speaking, the number of peaks in the magnitude response is equal

to the number of poles. As p increases, the error between the AR fit and the speech signal

decreases, and the estimated coefficients result in a closer fit to the actual spectral shape, and vice versa.

In practice, LPC modelling is usually used to model the formant structure, so the spectral detail obtained from the higher order is unnecessary, and LPC orders of around 20-30 are sufficient to model the first few formants in speech.

Generally, LPC spectrum provides a better match to the spectral peaks than the spectral valleys, due to the usage of an all-pole filter to model the spectrum, without any zeros to model the spectral valleys.

Page 18: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

18

Order selection of the LPC model

Page 19: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

19

Estimate formant frequency

Estimate from the LPC spectrum Estimate from the Z-transform of the vocal tract transfer function

Page 20: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

20

LPC analysis

Source: (Ellis 2013)

20ms

Page 21: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

21

Prediction gain

The prediction error has a smaller variance compared with the original signal. Therefore if we quantise and encode the prediction error instead of the original signal, the error in the quantisation/encoding scheme can be reduced, for a given bit rate. Let d[n] be the quantisation error, the signal-to-quantising noise ratio of the system is:

Qpd

e

e

s

d

s SNRGndE

nsESNR

2

2

2

2

2

2

2

2

]}[{

]}[{

2

2

e

spG

2

2

d

eQSNR

Prediction gain SNR of the quantiser

Page 22: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

22

Speech synthesis from the LPC model

After obtaining the LPC coefficients, it is a simple matter to re-synthesize the original speech Generating periodic impulses or white noise. Convolving the LPC coefficients with the generated excitation signals.

Source: (Ellis 2013)

Page 23: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

23

Cepstral analysis

LPC modelling has been widely used in speech processing due to its simplicity and computational efficiency. The LPC spectral envelope provides a poor fit to the spectral valleys An assumption is made that the excitation is Gaussian noise or a

single impulse, while multiple glottal pulses could exist within the window used for estimating the prediction coefficients, especially for female or high pitched speech.

A more accurate but less efficient approach to separate the excitation signal from the vocal tract transfer function is cepstral deconvolution. Essentially, this is a process whereby fine spectral detail can be separated from the smooth spectral shape (formant structure).

Page 24: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

24

Definition of cepstrum Complex cesptrum

The complex cepstrum is defined as the inverse Fourier transform of the logarithm of the phase unwrapped spectrum.

mikSDFTcc 2])[log(][ 1

Real cesptrum The real cepstrum is simply obtained by discarding the phase

information

)][log(][ 1 kSDFTcr

Both complex and real cepstrum are real valued, as s[n] is real, and S[k] is complex conjugate symmetric.

The main drawback with the real cepstrum is that the phase information is discarded and it is not possible to reconstruct the original spectrum, while it is possible to do this from the complex cepstrum by using an inverse DFT.

Page 25: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

25

Cepstrum is a special case of homomorphic filtering

Homomorphic system:

Homomorphic system can be split into a cascade of three homomorphic systems:

Source: (Rabiner & Schafer, 1978)

Page 26: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

26

Cepstrum is a special case of homomorphic filtering

Source: (Rabiner & Schafer, 1978)

Cepstrum Liftering

Page 27: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

27

Cepstrum

Figure source: http://cnx.org/content/m12469/latest/

Procedure for computing cepstrum

Page 28: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

28

Cepstrum: an example

Source: (Taylor, 2009) = DFT

Page 29: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

29

Phase unwrapping

][][][ kkk hus

is the phase of the speech spectrum in the k-th frequency bin, and is measured by taking the angle of S[k], which has a range of .],(

Any substitution produces the same complex value of S[k].

mikkss

2][][

To produce a smooth phase spectrum from bin to bin, the phase needs to be unwrapped, i.e. adding integer multiples of to the initial phase estimates. 2

][][][ nhnuns

][][][ kHkUkS

Convolution Theorem

][

][

][

][][

][][

][][

ki

ki

ki

u

h

s

ekUkU

ekHkH

ekSkS

DFT

Page 30: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

30

Phase unwrapping

Phase unwrapping

Page 31: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

31

Cepstrum Even through the cepstrum involves a transform from time to frequency

domains, and then an inverse transformation, it is not accurate to say that it is back to the time domain, due to the nonlinear logarithmic function that is used in the operation.

The axis of the cepstrum is therefore given the name quefrency, in the similar spirit as creating “cepstrum” versus “spectrum”.

Any regular periodicity in the spectrum would produce a peak in the cepstrum, in the same way that periodicity in the time domain produces a peak in the frequency domain.

Harmonics are equally spaced apart in frequency, so they would produce a peak along the quefrency axis at a position (and an integer multiple of this position) which would be proportional to the pitch period.

The low end of the quefrency axis corresponds to the smooth spectral shape or formant structure, and the middle and high ends correspond to harmonics and spectral detail. In other words, the contributions of the excitation signal and the vocal tract transfer function occupy different regions of the quefrency axis.

Page 32: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

32

Terminology

Figure source: http://cnx.org/content/m12469/latest/

Terns for Spectrum Derived Terms for Cepstrum

Spectrum Cepstrum

Frequency Quefrency

Harmonics Rahmonics

Magnitude Gamnitude

Phase Saphe

Filter Lifter

Low-pass filter Short-pass lifter

High-pass filter Long-pass lifter

The terminology was invented by Bogert et al. (1960)

Page 33: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

33

Cepstrum

Real cesptrum of a short vowel segment

Page 34: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

34

Cepstrum

Real cesptrum of other short vowel segments (only half quefrencies are shown). Horizontal axis is in ms, and vertical axis shows the magnitudes.

Figure source: http://cnx.org/content/m12469/latest/

Page 35: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

35

Spectrum vs Cepstrum

Waveform of a vowel segment (512 samples) Spectrum of the vowel

Cepstrum of the vowel segment The first 20 cepstral coefficients

Examples from: http://mi.eng.cam.ac.uk/~ajr/SA95/node33.html

Page 36: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

36

Liftering Liftering in cepstrum is similar to “filtering” in the spectrum.

All cepstrum coefficients are set to zero, except the low end coefficients (such as the first 20 coefficients together with their reflected counterparts at negative frequencies). This process is called low-pass liftering.

The spectral envelope which contains the formant structural information can be estimated by liftering.

The spectral envelope obtained from cepstrum is a good fit in both peaks and valleys, in contrast to LPC, where a better fit was obtained to the peaks than to the valleys.

Page 37: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

37

Liftering: demonstration

Plot of ]}[{ cDFT40,...1],[ c

Source: Rabiner & Schafer, 2007

Page 38: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

38

Liftering

Spectrum of a vowel segment and its spectral envelop obtained by liftering the first 20 cepstrum coefficients

Page 39: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

39

Liftering

Liftered log-magnitude spectrum obtained by using only the first K cepstrum coefficients. It can be seen that with more coefficients, more spectral details can be obtained (similar effect to the use of different orders in LPC modelling to get different spectral details). Source: (Taylor 2009).

Page 40: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

40

An application example of cepstrum: pitch frequency estimation

Win

dow

(fr

am

e) n

um

ber

Source: Rabiner & Schafer, 2007

Page 41: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

41

An application example of cepstrum: pitch frequency estimation

Page 42: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

42

An application example of cepstrum liftering: echo/reverberation removal

Page 43: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

43

An application example of cepstrum liftering: echo/reverberation removal

Figure due to Guido Gebl

Page 44: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

44

Mel-frequency cepstral coefficients (MFCCs)

MFCC is similar to the computation of the real cepstrum, the difference is that instead of using the DFT, where frequency bins are linearly-spaced apart, the signal is passed through a filter bank in which frequency bands are non-linearly spaced apart, but equi-distant on the mel-scale.

The mel-scale is roughly linear below 500Hz, and roughly logarithmic above this, and is based on the perceptual model of pitch sensitivity.

The MFCCs features are frequently used in speech recognition, and an advantage of MFCCs is that the amplitude spectrum can be summarised by a relatively small number (commonly around 13) of perceptually-relevant features.

Page 45: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

45

Filter bank for computing MFCCs

Speech signal is analysed using STFT.

The DFT values at various frequency bands are grouped together in each critical band.

They are then weighted according to the weighting function defined by the triangular shape as shown in the next slide.

Page 46: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

46

Filter bank for computing MFCCs

Page 47: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

47

Filter bank for computing MFCCs (cont.)

FFT FFT FFT … … FFT

Spectrum… …

Mel-filters & Cepstrum Analysis

… …

… …

Page 48: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

48

Comparison of smoothing techniques based on LPC, Cepstrum and Mel

cepstrum

Source: Rabiner & Schafer, 2007

Page 49: 1 Speech and Audio Processing and Coding (cont.) Dr Wenwu Wang Centre for Vision Speech and Signal Processing Department of Electronic Engineering w.wang@surrey.ac.uk

49

Summary

Source filter model + resonance

LPC model

LPC analysis & synthesis

Cepstrum analysis

Liftering

MFCCs