speech information processingaito/soundmedia/slides-eng.pdf3 production of speech organs that...

1

Sound Media Engineering part II

Speech Information Processing

Akinori ItoGraduate School of Engineering, Tohoku Univ.

[email protected]

2

Overview of the lecture● #1: Production and coding of speech (1)

– Speech production, feature of speech sound

– Basic codecs: PCM,DPCM,ADPCM

● #2: Coding of speech (2)– Linear Prediction of speech: Linear Prediction

Coefficients, PARCOR Coefficients and LSP

– CELP coding

– Audio coding

● #3: Speech enhancement– Spectral subtraction

– Microphone array

3

Production of speech

● Organs that producespeech– vocal cords

– larynx

– pharynx

– tongue

– gums

– teeth

– lips

– nasal cavity

vocaltract

4

Acoustic tube model

● Human speech production is similar to wind instruments

声帯声道

喉頭唇

鼻腔

Pitch of voice Linguistic contentPersonality

5

Linguistic and speaker feature

声帯声道

喉頭唇

鼻腔

A speaker can controlshape of this part

6

Linguistic and speaker feature

声帯声道

喉頭唇

鼻腔

A speaker cannot control shape of this part, total length of vocal tract

7

Speech waveform

● Complex enough

/a/ /i/ /u/

/o//e/

8

Speech waveform

● It is complex, but almost periodic

Fundamentalperiod

Fundamental period T [s]Fundamental frequency F0 [Hz] = 1/T

9

Various "a"

● Two /a/'s with different fundamental frequencies– Same phone = same vocal tract shape

– Completely different waveforms

– What is the same between these waveforms?

10

Spectrum of speech

● Spectrum of two /a/'s– Spectral shapes are similar →Shape of vocal tract

– "Jaggies" of speectrum differ→Fundamental Freq.

11

Spectrum and formant frequencies

● F0: 基本周波数

● F1,F2,..: ホルマント(formant)周波数

基本周波数

ホルマント周波数F 0

F 1F 2

F 3 F 4

Formant frequencies

Fundamental frequency

12

Speech coding

● Sound (analog) → Convert to digital data– Handle with computer

– Transmission over digital line

● How do we digitize sound?– Goals

● Good quality when converting back to analog sound● Less bit-rate

– Methodology● Exploit various features of speech

13

Basics of speech coding● Sampling

– Observe the temporally continuous signal at discrete time

– Period of "discrete" observation: sampling frequency fs

– The original signal can be restored from sampled data when the original signal only contains frequency component under fs/2 (Sampling Theorem)

14

Basics of speech coding

● Quantization– Round off magnitude of signal into discrete level

● Magnitude of signal can be represented in integers

– The discrete level : quantization step

– Difference between the original signal and quantized signal : Quantization error

15

Sampling and quantization:how are they determined?

● Sampling frequency is determined by the highest frequency in the sound– Telephone : 8kHz (up to 4kHz sound)

– High-quality speech: 16kHz (up to 8kHz sound)

– CD：44.1kHz (up to 22.05kHz sound)

● Quantization is determined by the dynamic range of the sound– To code speech is to quantize speech

16

PCM coding

● PCM(Pulse Code Modulation)– Represent the quantized values as binary numbers

● What to be determined in PCM– How many bits to be used for one sample

– How to determine levels of quantization● Equal steps: linear quantization● Inequal steps: nonlinear quantization

● Examples of PCM coding– CD:16bit linear quantization

– VoIP(G.711): 8bit nonlinear quantization

17

PCM linear quantization

● There are nothing difficult

0

5

10

-5

-10

-7 -7 5 2 -6 -2 0 1 4 0 -2 11 110 -1 -2 0 3 2 0 1 33

CD: quantize in 16bit(-32768～+32767)

18

Nonlinear quantization

● Most samples are nearly zero→Total error can be reduced by finely quanti- zing values around zero

0

5

10

-5

-10

0

5

10

-5

-10

19

Example of nonlinear quantization: G.711

● Speech coding for 64kbit/s digital phone line– 8kHz sampling, 8bit nonlinear quantization

– μ-Law (Japan, US) A-Law (Europe)

– μ-Law: 14bit linear quant.→8bit nonlinear quant.

Y=128 sign X

log1255∣X∣8192

log 256

-150

-100

-50

0

50

100

150

-8000 -6000 -4000 -2000 0 2000 4000 6000 8000

8b

it m

u-L

aw

14bit linear

20

Differential PCM (DPCM)

● In ordinary speech signal, values of two contiguous samples do not differ very much→Reduce bit-rate by transmitting the differences of samples

-

z-1

Q

21

Differential PCM(DPCM)

● Originalwaveform

● Differentialwaveform

-25000

-20000

-15000

-10000

-5000

0

5000

10000

15000

0 500 1000 1500 2000 2500 3000 3500 4000 4500

-25000

-20000

-15000

-10000

-5000

0

5000

10000

15000

0 500 1000 1500 2000 2500 3000 3500 4000 4500

22

Adaptive Differential PCM(ADPCM)

● To enhance efficiency of DPCM– Use more sophisticated prediction rather than

simple difference

– Adaptively change quantization steps● When difference between two samples is large, the

difference to the next sample is likely to be large too● When difference between two samples is small, the

difference to the next sample is likely to be small too

23

Block diagram of ADPCM

adaptivequantizer

adaptivede-quantizer

ADPCMoutput

＋PCM input

signalpredictor ＋

+

-

+

+

x k xe k

xr k

d k

d q k

reconstructedsignal

quantizeddifferential

signal

differentialsignal

predictionsignal

I k

24

Calculation algorithm of ADPCM

1.Compute prediction signal

2.Compute difference

3.Quantize (ADPCM output)

4.De-quantize

5.Reconstruct signal

6.Compute next prediction

xe k

d k =xe k −x k

I k =Q d k

d q k =Q−1 I k

xr k =xe k d qk

xe k1= pred xr k , d qk ,

25

Prediction of speech signal

● ADPCM quantizes difference between the input signal and predicted signal

● How to predict signal– DPCM

– A little better way

– G.726

xe k =xr k−1

xe k =2 xr k−1−xr k−2

xe k =∑i=1

2

ai xr k−i ∑i=1

6

bi d qk−i

26

Determine quantization step adaptively (example)

0

-1

1

2

3

4

5

6

7

-7

-6

-5

-4

-3

-2

-8

0

-1

1

2

3

4

5

6

7

-7

-6

-5

-4

-3

-2

-8

●Observe difference between previous sample using the scale

●If the difference is "blue", half the size of the next scale

●If the difference is "red", double the size of the next scale

27

For high-efficiency speech coding

● PCM, DPCM, ADPCM encodes general sound signal– DPCM, ADPCM partly exploits property of input

signal

● Human speech is a small part of sound signal→We can enhance efficiency of coding by considering property of human speech

● What is the property of human speech?

28

High-level speech coding

digitaldata

speechfeature

phoneswords/

sentencessemantics

digitaldata

speechfeature

phones words/sentences

semantics

speech

音声

PCM coder(public phone)

CELP coder(mobile phone)

underresearch

summarizingtelephone?

AD/DA vocoder speechsynthesis

Text-to-Speech

29

Speech production model

声帯声道

喉頭唇

鼻腔

X =S T R

S T R

radiation

X

vocal cords

larynx

nasal cavity

vocal tract

lips

30


S

31


S

T R

32

Modeling speech using parameters

● Modeling speech using linear prediction (LPC)– Spectral shape: parameters of linear prediction filter

– Vocal cord vibration : residue

– In spectral domain

x k =−∑i=1

p

ai x k−i e k

X =E

1∑n=1

p

an en i

=E H

S T R

Estimate coefficients to

minimize residue

33

Analysis and transmission of speech by LPC

● Information to be transmitted

– LP coefficients ai and residue e(k)

● How to transmit them?– Estimate ai for a fixed number of samples (a block)

– Calculate e(k) using estimated ai

– Transmit ai and e(k) as parameters of the block

● How to restore the signal?– Using LPC formula

x k =−∑i=1

p

ai x k−i e k

34

Estimation of LP coefficients● How to estimate LPC from x(1)...x(k)

– Solve a simultaneous equation (Yule-Walker equation）→LPC are calculated as the least-error solution

– Faster algorithm (Levinson-Durbin algorithm)

● LPC equation

−x k−1 x k−2 ⋯ x k− px k−2 x k−3 ⋯ x k− p−1

⋮ ⋮ ⋱ ⋮x p−1 x p−2 ⋯ x 1

a1

a2

⋮a p=

x k x k−1

⋮x p

e k

e k−1⋮

e p

−FA=VE

35

Estimation of LP coefficients

● Least square solution: minimize |E|2

→minimize |FA+V|2

● Equation to be solved FT F A=−FT V

FT F=ij , F T V=0j

11 12 ⋯ 1p

21 22 ⋯ 2p

⋮ ⋮ ⋱ ⋮ p1 p2 ⋯ pp

a1

a2

⋮a p=−

01

02

⋮0p Yule-Walker

equation

36

Estimation of LP coefficients

● Elements in the Yule-Walker equation

● Solve the equation directly● Fast algorithm (autocorrelation method) N>>p

– Matrix is in a special form (symmetric Toeplitz matrix)

– Quick solution algorithm (Levinson-Durbin algorithm)

ij=∑n= p

N−1

y n−i y n− j

ij=r ∣i− j∣= ∑n=0

N−∣i− j∣−1

y n y n∣i− j∣

37

Analysis and transmission of speech by LPC

● Problem– Re-synthesis by LPC formula could be unstable

→The output signal eventually oscillates when ai have quantization errors

● Solution– Transmit parameters that are equivalent to LPC and

stable against quantization error● PARCOR coefficients● LSP coefficients

38

LPC and PARCOR coefficients

● PARCOR (partial correlation) coefficients

k i=

∑n=−∞

∞

i−1n i−1n

∑n=−∞∞

i−12n ∑

n=−∞

∞

i−12n

i−1 n=x n∑j=1

i−1

a ji−1 x n− j

i−1n=x n−i∑j=1

i−1

b ji−1 x n− j

Forward prediction error

PARCOR coefficientis equivalent to

correlation of the forward prediction

errors and backward prediction errors

Backward prediction error

39

PARCOR coefficients

x(n-i-1) x(n-i) x(n-i+1) x(n-i+2) x(n-2) x(n-1) x(n) x(n+1)

x(n)x(n-i)bj aj

^ ^

++

-- -

in i n

correlation

k i

...

40

PARCOR and LPC

k 1=

∑n=−∞

∞

0n0n

∑n=−∞∞

02n∑

n=−∞

∞

02n

0n=x n 0=x n−1

k1 is a correlation coefficient between x(n-1) and x(n)

As x(n-1) and x(n) have same variance and zero mean,

x1n=k1 x n−1 a11=−k 1

x1n−1=k1 x n b11=−k 1

41

PARCOR and LPC

k 2=

∑n=−∞

∞

1n1n

∑n=−∞∞

12n ∑

n=−∞

∞

12n

12n=k 21n−1=−k 1 k 2 x n−1k 2 x n−2

1 n=x n−k1 x n−1

1 n=x n−2−k1 x n−1

Here, as 1 n=x n−k1 x n−1

x2n=k 11−k 2 x n−1k 2 x n−2

a12=−k 11−k 2=a1

1−k 2 b1

1

a22=−k 2

Similarly,b22=b1

1−k 2 a1

1 b12=−k 2

42

PARCOR and LPC

In general,a ji=a j

i−1−k i b j

i−1 a0i−1

=0

b ji =b j−1

i−1−k i a j−1i−1 b0

i−1=0

We can calculate LP coefficients using this recurrence relation

43

LPC and LSP coefficients

● What is LSP (Line Spectrum Pair)?– Representation of LPC equation in z-domain:

– Decompose A(z) into P(z) and Q(z)

x k ∑i=1

p

ai x k−i =e k X z1∑i=1

p

ai z−i=E z

A p z

P z =A p z −z− p1 A p

z−1

Q z =A p z z− p1 A p z−1

A p z =P z Q z

2

44

LSP coefficients

● Roots of P(z)=0 and Q(z)=0 are on the frequency axis (on the unit circle in z-domain)

● P(z) and Q(z) can be written as

● Roots of P(z)=0 and Q(z)=0● LSP

P z =1−z−1 ∏i=2,4, , p

1−2 z−1 cosiz−2

Q z =1z−1 ∏i=1,3, , p−1

1−2 z−1 cosiz−2

z=cosi±i sini

1,2, , p

45

Properties of LSP

● Frequency parameter● Calculated using numerical analysis● Easy to determine stability● Robust against quantization than PARCOR● More computation needed than PARCOR● Widely used in current speech codings

012⋯ p

46

CELP coder

● CELP(Code-Excitation Linear Prediction)– Basic coding scheme for mobile phone

– Analysis and synthesis based on LPC

– Transmit LSP coefficients and residue

47

Overview of CELP

LPCanalysis

quantizationLSP coeff.

codevector

selection

residuecodebook

gaincodebook

residue

LPCsynthesis ＋

-

auditoryweighteddistanceSelect a code vector

with minimum distance

generatingbitstream

output

48

Speech codings based on CELP– LD-CELP (Low-Delay CELP)

● G.728 16kbit/s

– CS-ACELP (Conjugate Structure Algebraic CELP)● G.729 8kbit/s

– RPE-LTP (Regular Pulse Excitation with Long Term Prediction)

● GSM standard 13kbit/s

– VSELP (Vector Sum Excitation LP)● PDC standard 6.7kbit/s

– PSI-CELP (Pitch Synchronous Innovation CELP)● PDC half-rate standard 3.45kbit/s

– ACELP (Algebraic CELP)● GSM revised standard 7.4kbit/s

49

Audio coding● Coding of general sound and music

– We cannot make assumption (like speech) on the input signal

● Higher frequency components have smaller power

– Model-based coding (like speech) cannot be used

● Coding based on multi-band analysis– Split the input signal into low-frequency to high-

frequency● Frequency analysis using filter bank / MDCT

– Change quantization step frequency by frequency● Course quantization for high frequency● Course quantization if that sound is not salient

50

Basic framework of audio coding

frequ-ency

analysis

quant.

quant.

quant.

quant.

generatebitstream

restorebitstream

dequant.

dequant.

dequant.

dequant.

convertintotime

domain

QMFMDCTWavelet

Considerauditoryproperty(psycho-acousticanalysis)

Entropycoding(Huffman,arithmetic)

51

SB-ADPCM

● 16kHz middle-quality speech coder (G.722)– Sub-Band ADPCM

– Split the input signal into high and low signals and encode them using ADPCM individually

● Frequency splitting using quadrature mirror filter(QMF)● ADPCM coding: 2bit for high band，4 to 6 bit for low band

(48～64kbit/s)

QMF

ADPCMencoder

ADPCMencoder

generatebitstream

restorebitstream

ADPCMdecoder

ADPCMdecoder

QMF

52

Quadrature Mirror Filter (QMF)

● Split the input signal into high-frequency and low-frequency signals and total data amount is identical

● The original signal can be perfectly restored by combining low and high frequency signal

● Example of simple QMF (Haar Wavelet)

y i=x2ix2i1

2

z i=x2i−x2 i1

2

x2i= yizi

x2 i1=yi−zi

QMFspilt

QMFsynthesisxi xi

zi

yi

高域

低域

53

MPEG1 audio

● Audio part of MPEG, standard of video encoding– Layer 1 (MP1), layer 2 (MP2), layer 3 (MP3)

– Frequency analysis, psychoacoustic model

polyphasefilter bank/

MDCT

Q

Q

Q

Q

bit-stream

gen.

bit-stream

res.

de-Q

de-Q

de-Q

de-Q

totime

domain

FFTpsycho-acousticmodel

54

MP1● MPEG1 audio layer 1

– Frequency analysis by polyphase filter bank

– Normalization and scholar quantization in every 12 samples

polyphasefilter bank

nonlinear scholar quant.

32frequencybands

nonlinearscholar quant.

normalize

normalize

Block average powerscholar quant.

Block average powerscholar quant.

55

MP3● Frequency analysis by polyphase filter bank

and MDCT ● Variable frame length (18 points standard)● Entropy coding

polyphasefilter bank


32 bands


MDCT

MDCT

Huffmancoding

bitstream

56

Modified Discrete Cosine Transform (MDCT)

● Convert n points of time-domain signal into n/2 points of frequency-domain signal

● Exploits n/2-point overlap window

X m=∑k=0

n−1

f k x k cos{ 2 n 2 k1n2 2 m1}

x k =4 f k

n ∑m=0

n /2−1

X mcos{ 2n 2 k1n2 2m1}

57

Conversion to time-domain by Overlap-Add

● The original signal can be restored by adding the temporally overlapping data

MDCT

Overlap-Add

IMDCT

58

Speech Enhancement● What is speech enhancement?

– Extract specific speech from input signal that contains the target speech and other noise

– It is generally difficult : some kind of assumption needed

● Basic methods– Single channel

● Linear method: Wiener filter● Nonlinear method: Spectral subtraction

– Multiple-channel case● Linear method: microphone array● Nonlinear method: multichannel spectral subtraction

59

Single-channel case

● Speech signal x，noise signal n，observed signal y

● Aim: Estimate y from x (both x and n are unknown)

● Assumption needed– Spectrum of n is known

y t =x t nt

Y =X N

60

Linear method: the Wiener filter

● Consider a filter W that minimizes errors

● Consideration in temporal-frequency domain

X =W Y =W X N

∫0

∣ X −X ∣2d min

∑t∑i=0

N−1

∣X i t − X it ∣2=

∑t∑i=0

N−1

∣X i t −W i X it N i t ∣2min

61

Linear method: the Wiener filter

● Differentiate the previous formula

If X(t) and N(t) have no correlation,

∂∂W i

∑t∑i=0

N−1

∣X it −W i X i t N i t ∣2=0

∑t∣−2 X i t X it N i t 2W i X it N i t ∣

2=0

W i=

∑t∣X it ∣

2

∑t∣X i t ∣

2∑

t∣N i t ∣

2

∑t∣X i t N i t ∣≈0

The Wiener filter

62

The Wiener filter

● Condition to apply – Average spectrums of signal and noise are known

● Actually, it hardly holds for the signal

– The signal and noise have no correlation

● Meaning of the Wiener filter– Suppress frequency band with large noise power

● The powers of frequency bands become identical to E[Xi

2]

in average

W i X it N i t =∑

t∣X i t ∣

2

∑t∣X i t ∣

2∑

t∣N i t ∣

2⋅X i t N i t

63

Example of the Wiener filter

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0

50000

100000

150000

200000

250000

300000

SpeechNoise

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

0

0.2

0.4

0.6

0.8

1

1.2

Spectrumof the filter

64

Spectral subtraction

● Suppress noise by nonlinear processing● Noise is assumed to be stable● Subtract spectrum of the noise from observed

spectrum● Definitions

– Signal

– Noise

– Observed signal

X it

N i t

Y it =X i t N i t

65


● Principle of SS– Power spectrum of the observed signal

– The signal is assumed to have no correlation to the noise

– Noise signal is assumed to be stable

∣Y i t ∣2=∣X i t N i t ∣

2≤∣X i t ∣

22∣X it N i t ∣∣N i t ∣

2

∣X i t N i t ∣≪∣X it ∣2∣N i t ∣

2

∣X i t ∣2≈∣Y it ∣

2−∣N i t ∣

2

∣N i t ∣2=N i

2

∣X i t ∣2≈∣Y it ∣

2−N i

2

66


● Estimation of noise spectrum– Noise spectrum must be prepared beforehand

– Estimated from silent part before the voice

● To estimate magnitude spectrum

X i t ≈∣Y i t ∣2−∣N i t ∣

2

∣Y i t ∣2 Y i t

67

Practical problem and its solution

● Power spectrum eventually become negative– Solution by flooring

● Overestimation makes the quality better

∣X i t ∣2≈{∣Y i t ∣

2−N i

2if ∣Y it ∣

2N i

2

∣Y it ∣2

otherwise0≪1

∣X i t ∣2≈{∣Y i t ∣

2− N i

2if ∣Y i t ∣

2N i

2

∣Y i t ∣2

otherwise1

68

Examples (waveform)

Speech

Noise

Speechwith noise

Enhancedspeech

69

Examples (spectrogram)

Speech

Noise

Speechwith noise

Enhancedspeech

70

Speech enhancement using multiple microphones

● Exploit speech signals recorded by more than one microphone– Spatial information can be used

● Record speech and noise individually● Beam forming

● Various methods– Linear processing

● Delayed-sum array (superdirective microphone)● Adaptive array

– Nonlinear processing● Multi-channel spectral subtraction

71

Delayed-sum array

● Record sound signal derived from a specific angle using multiple microphones– The sound is assumed to be plane wave

d

d sin

q

72

Delayed-sum array

d

q

d sin

sin t sint−d sin c sint−2d sin

c sint−3d sinc

73

3 d sin

c

2d sinc

d sin

c

Delayed-sum array

sint−3 d sinc

− sint−3 d sinc

− sint−3 d sinc

− sint−3 d sinc

−

Delay

74

3 d sin

c

2d sinc

d sin

c

Delayed-sum array

Delay

＋

4sint−3 d sinc

−

75

3 d sin

c

2d sinc

d sin

c

Delayed-sum array

Delay

＋

∑n=0

3

sint− n d sin3−nd sinc

−

f

76

Example

入射角(rad)

n=4, d=1fun1: w=5fun2: w=10fun3: w=50

77

Property of the delayed-sum array

● Simple processing– Fast calculation

– Easy hardware realization

● Directivity– Main robe width become narrower when

● More microphones are used ● The frequency become higher ● Microphone-to-microphone distance become wider

– Spatial aliasing● Condition of no aliasing: d<c/2f

78

Adaptive noise suppression● When the noise can be recorded by a

microphone● Linear filtering

speech+noise

noise

SignalProcessing

speech＋noise

79

Adaptive noise suppression● Processing by an adaptive filter

– n(k) is not a noise signal mixed into x(k), so we can't subtract n(k) from x(k) directly→Use the filter W(z)

speech+noise

noise

speech＋noise

W(z)

＋

－

x k

nk

G(z) Update W(z) so that power of the output signal become minimum

y k

e k

80

Adaptive filter

● Realize W(z) as a FIR filter

– : i-th filter coefficient at time k

● Updating coefficients using LMS algorithm

(Many other algorithms have been developed)

y k =∑i=1

p

wi k nk−i

w ik

w ik1=wi k 2 e k nk−i1

81

Subtractive array

● Eliminate noise using microphone array(When the direction of the noise is known)

＋

－

delay

noise（angle　　）

speech

N

d sin N

c

82

Adaptive subtractive array

● When direction of the noise is not known– Noise suppression using adaptive filter

＋

－

adaptivefilter

noise

speech

adaptivefilter

83


● Some constraints are needed for the filters– Without any constraints, the output become zero

＋

－

雑音

音声

H 1

H 2

84


● Adaptive filter● Transfer function from speech source to

microphone● Constraint of the filters

● Example– Griffith-Jim array

F =∑i

G iH i =1

H i

G i

85

Griffith-Jim array

＋

－delay

delay

delay

＋

＋

－

＋

－H 1

H 2

Output ofdelayed-sum array

noise only

86

2ch spectral subtraction

● Spectral subtraction using microphone array– Estimate noise spectrum by suppressing target

signal

– Subtract noise spectrum from the observed spectrum

● Features– Nonlinear processing

– No need to prepare noise spectrum beforehand

– Effective for unstable noise

87

2ch spectrul subtraction

＋

－

delay

delay

＋ |DFT|2

＋

－

noise only

|DFT|2

output of delayed-sum array

nonlinear processing

(overestimation, flooring)

speech information processingaito/soundmedia/slides-eng.pdf3 production of speech organs that...

Documents