lecture 5: speech modeling · 2009-02-19 · application of (speech) signal models classi cation /...

46
EE E6820: Speech & Audio Processing & Recognition Lecture 5: Speech modeling Dan Ellis <[email protected]> Michael Mandel <[email protected]> Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/dpwe/e6820 February 19, 2009 1 Modeling speech signals 2 Spectral and cepstral models 3 Linear predictive models (LPC) 4 Other signal models 5 Speech synthesis E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 1 / 46

Upload: others

Post on 14-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

EE E6820: Speech & Audio Processing & Recognition

Lecture 5:Speech modeling

Dan Ellis <[email protected]>Michael Mandel <[email protected]>

Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820

February 19, 2009

1 Modeling speech signals2 Spectral and cepstral models3 Linear predictive models (LPC)4 Other signal models5 Speech synthesis

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 1 / 46

Page 2: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Outline

1 Modeling speech signals

2 Spectral and cepstral models

3 Linear predictive models (LPC)

4 Other signal models

5 Speech synthesis

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 2 / 46

Page 3: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

The speech signal

watch thin as a dimeahas

mdnctcl

^

θ zwzh e

III ayε

Elements of the speech signal

spectral resonances (formants, moving)

periodic excitation (voicing, pitched) + pitch contour

noise excitation

transients (stop-release bursts)

amplitude modulation (nasals, approximants)

timing!

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 3 / 46

Page 4: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

The source-filter model

Notional separation of

source: excitation, fine time-frequency structure

filter: resonance, broad spectral structure

Glottal pulsetrain

Fricationnoise

Vocal tractresonances+ Radiation

characteristicSpeech

Voiced/

unvoiced

Pitch

Formants

Source Filter

t

f

t

More a modeling approach than a single model

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 4 / 46

Page 5: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Signal modeling

Signal models are a kind of representationI to make some aspect explicitI for efficiencyI for flexibility

Nature of model depends on goalI classification: remove irrelevant detailsI coding/transmission: remove perceptual irrelevanceI modification: isolate control parameters

But commonalities emergeI perceptually irrelevant detail (coding) will also be irrelevant for

classificationI modification domain will usually reflect ‘independent’

perceptual attributesI getting at the abstract information in the signal

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 5 / 46

Page 6: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Different influences for signal models

ReceiverI see how signal is treated by listeners

→ cochlea-style filterbank models . . .

Transmitter (source)I physical vocal apparatus can generate only a limited range of

signals . . .

→ LPC models of vocal tract resonances

Making explicit particular aspectsI compact, separable correlates of resonances

→ cepstrum

I modeling prominent features of NB spectrogram

→ sinusoid models

I addressing unnaturalness in synthesis

→ Harmonic+noise model

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 6 / 46

Page 7: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Application of (speech) signal models

Classification / matchingGoal: highlight important information

I speech recognition (lexical content)I speaker recognition (identity or class)I other signal classificationI content-based retrieval

Coding / transmission / storageGoal: represent just enough information

I real-time transmission, e.g. mobile phonesI archive storage, e.g. voicemail

Modification / synthesisGoal: change certain parts independently

I speech synthesis / text-to-speech (change the words)I speech transformation / disguise (change the speaker)

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 7 / 46

Page 8: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Outline

1 Modeling speech signals

2 Spectral and cepstral models

3 Linear predictive models (LPC)

4 Other signal models

5 Speech synthesis

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 8 / 46

Page 9: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Spectral and cepstral models

Spectrogram seems like a good representationI long historyI satisfying in useI experts can ‘read’ the speech

What is the information?I intensity in time-frequency cellsI typically 5ms × 200 Hz × 50 dB

→ Discarded detail:I phaseI fine-scale timing

The starting point for other representations

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 9 / 46

Page 10: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Short-time Fourier transform (STFT) as filterbankView spectrogram rows as coming from separate bandpass filters

sound

f

Mathematically:

X [k , n0] =∑n

x [n]w [n − n0] exp

(−j

2πk(n − n0)

N

)=∑n

x [n]hk [n0 − n]

where hk [n] = w [−n] exp(j 2πkn

N

)n

hk[n]w[-n]

ω

Hk(ejω)W(ej(ω − 2πk/N))

2πk/N

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 10 / 46

Page 11: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Spectral models: which bandpass filters?

Constant bandwidth? (analog / FFT)

But: cochlea physiology & critical bandwidths

→ implement ear models with bandpass filters & choosebandwidths by e.g. CB estimates

Auditory frequency scalesI constant ‘Q’ (center freq / bandwidth), mel, Bark, . . .

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 11 / 46

Page 12: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Gammatone filterbankGiven bandwidths, which filter shapes?

I match inferred temporal integration windowI match inferred spectral shape (sharp high-freq slope)I keep it simple (since it’s only approximate)

→ Gammatone filtersI 2N poles, 2 zeros, low complexityI reasonable linear match to cochlea

h[n] = nN−1e−bn cos(ωin)time →

10050 200 500 1000 2000 5000

-10

0

-30

-20

-40

-50

mag

/ dB

freq / Hz

z plane

2

2

2

2

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 12 / 46

Page 13: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Constant-BW vs. cochlea model

Frequency responses Spectrograms

-50

-40

-30

-20

-10

0Effective FFT filterbank

Gai

n / d

B

-50

-40

-30

-20

-10

0

Gai

n / d

B

Gammatone filterbank

0 1000 2000 3000 4000 5000 6000 7000 8000

0 1000 2000 3000 4000 5000 6000 7000 8000Freq / Hz

FFT-based WB spectrogram (N=128)

freq

/ H

z

0 0.5 1 1.5 2 2.5 30

2000

4000

6000

8000

Q=4 4 pole 2 zero cochlea model downsampled @ 64

freq

/ H

ztime / s

0 0.5 1 1.5 2 2.5 3

100

200

500

1000

2000

5000

Magnitude smoothed over 5-20 ms time window

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 13 / 46

Page 14: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Limitations of spectral models

Not much data thrown awayI just fine phase / time structure (smoothing)I little actual ‘modeling’I still a large representation

Little separation of features

e.g. formants and pitch

Highly correlated featuresI modifications affect multiple parameters

But, quite easy to reconstructI iterative reconstruction of lost phase

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 14 / 46

Page 15: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

The cepstrum

Original motivation: assume a source-filter model:

Excitation source g[n]

n

nResonance filter H(ejω)

ω

Define ‘Homomorphic deconvolution’:source-filter convolution g [n] ∗ h[n]FT → product G (e jω)H(e jω)log → sum log G (e jω) + log H(e jω)IFT → separate fine structure cg [n] + ch[n]= deconvolution

DefinitionReal cepstrum cn = idft (log |dft(x [n])|)

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 15 / 46

Page 16: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Stages in cepstral deconvolution

Original waveform has excitationfine structure convolved withresonances

DFT shows harmonics modulatedby resonances

Log DFT is sum of harmonic‘comb’ and resonant bumps

IDFT separates out resonantbumps (low quefrency) and regular,fine structure (‘pitch pulse’)

Selecting low-n cepstrum separatesresonance information(deconvolution / ‘liftering’)

0 100 200 300 400-0.2

0

0.2Waveform and min. phase IR

samps

0 1000 2000 30000

10

20abs(dft) and liftered

freq / Hz

freq / Hz0 1000 2000 3000

-40

-20

0

log(abs(dft)) and liftered

0 100 200

0

100

200 real cepstrum and lifter

quefrency

dB

pitch pulse

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 16 / 46

Page 17: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Properties of the cepstrum

Separate source (fine) from filter (broad structure)I smooth the log magnitude spectrum to get resonances

Smoothing spectrum is filtering along frequencyi.e. convolution applied in Fourier domain→ multiplication in IFT (‘liftering’)

Periodicity in time → harmonics in spectrum → ‘pitch pulse’in high-n cepstrum

Low-n cepstral coefficients are DCT of broad filter /resonance shape

cn =

∫log∣∣X (e jω)

∣∣ (cos nω +����j sin nω) dω

0 1000 2000 3000 4000 5000 6000 7000

-0.1

0

0.1

0 1 2 3 4 5-1

0

1

2

5th order Cepstral reconstructionCepstral coefs 0..5

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 17 / 46

Page 18: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Aside: correlation of elements

Cepstrum is popular in speech recognitionI feature vector elements are decorrelated

frames

5

10

15

20

25

4

8

12

16

20

5 10 15 20

4

8

12

16

20

-5

-4

-3

-2

50 100 150

Cep

stra

l co

effi

cien

tsA

ud

ito

ry

spec

tru

m

Covariance matrixFeatures Example joint distrib (10,15)

2468

1012141618

-5 0 5-4-3-2-10123

I c0 ‘normalizes out’ average log energy

Decorrelated pdfs fit diagonal GaussiansI simple correlation is a waste of parameters

DCT is close to PCA for (mel) spectra?

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 18 / 46

Page 19: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Outline

1 Modeling speech signals

2 Spectral and cepstral models

3 Linear predictive models (LPC)

4 Other signal models

5 Speech synthesis

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 19 / 46

Page 20: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Linear predictive modeling (LPC)

LPC is a very successful speech modelI it is mathematically efficient (IIR filters)I it is remarkably accurate for voice (fits source-filter distinction)I it has a satisfying physical interpretation (resonances)

Basic mathI model output as linear function of prior outputs:

s[n] =

(p∑

k=1

aks[n − k]

)+ e[n]

. . . hence “linear prediction” (pth order)I e[n] is excitation (input), AKA prediction error

⇒ S(z)

E (z)=

1

1−∑p

k=1 akz−k=

1

A(z)

. . . all-pole modeling, ‘autoregression’ (AR) model

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 20 / 46

Page 21: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Vocal tract motivation for LPC

Direct expression of source-filter model

s[n] =

(p∑

k=1

aks[n − k]

)+ e[n]

Pulse/noise excitation

Vocal tract

e[n] s[n]H(z) = 1/A(z)

z-plane

H(z)

f

|H(ejω)|

Acoustic tube models suggest all-pole model for vocal tract

Relatively slowly-changingI update A(z) every 10-20 ms

Not perfect: Nasals introduce zeros

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 21 / 46

Page 22: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Estimating LPC parameters

Minimize short-time squared prediction error

E =m∑

n=1

e2[n] =∑n

(s[n]−

p∑k=1

aks[n − k]

)2

Differentiate w.r.t. ak to get equations for each k:

0 =∑n

2

s[n]−p∑

j=1

ajs[n − j ]

(−s[n − k])

∑n

s[n]s[n − k] =∑

j

aj

∑n

s[n − j ]s[n − k]

φ(0, k) =∑

j

ajφ(j , k)

where φ(j , k) =∑m

n=1 s[n − j ]s[n − k] are correlationcoefficients

I p linear equations to solve for all ajs . . .

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 22 / 46

Page 23: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Evaluating parameters

Linear equations φ(0, k) =∑p

j=1 ajφ(j , k)

If s[n] is assumed to be zero outside of some window

φ(j , k) =∑n

s[n − j ]s[n − k] = rss(|j − k |)

I rss(τ) is autocorrelation

Hence equations become:r(1)r(2)

...r(p)

=

r(0) r(1) · · · r(p − 1)r(1) r(2) · · · r(p − 2)

......

. . ....

r(p − 1) r(p − 2) · · · r(0)

a1

a2...

ap

Toeplitz matrix (equal antidiagonals)→ can use Durbin recursion to solve

(Solve full φ(j , k) via Cholesky)

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 23 / 46

Page 24: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

LPC illustration

0 1000 2000 3000 4000 5000 6000 7000 freq / Hz

time / samp

-60

-40

-20

0

0 50 100 150 200 250 300 350 400

dB

windowed original

original spectrum

LPC residual

residual spectrum

LPC spectrum

-0.3

-0.2

-0.1

0

0.1

Actual poles

z-plane

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 24 / 46

Page 25: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Interpreting LPC

Picking out resonancesI if signal really was source + all-pole resonances, LPC should

find the resonances

Least-squares fit to spectrumI minimizing e2[n] in time domain is the same as minimizing

E 2(e jω) by Parseval→ close fit to spectral peaks; valleys don’t matter

Removing smooth variation in spectrumI 1

A(z) is a low-order approximation to S(z)

IS(z)E(z) = 1

A(z)I hence, residual E (z) = A(z)S(z) is a ‘flat’ version of S

Signal whitening:I white noise (independent x [n]s) has flat spectrum→ whitening removes temporal correlation

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 25 / 46

Page 26: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Alternative LPC representations

Many alternate p-dimensional representationsI coefficients {aj}I roots {λj}:

∏(1− λjz

−j)

= 1−∑

ajz−1

I line spectrum frequencies. . .I reflection coefficients {kj} from lattice form

I tube model log area ratios gj = log(

1−kj

1+kj

)Choice depends on:

I mathematical convenience / complexityI quantization sensitivityI ease of guaranteeing stabilityI what is made explicitI distributions as statistics

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 26 / 46

Page 27: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

LPC applications

Analysis-synthesis (coding, transmission)

I S(z) = E(z)A(z) hence can reconstruct by filtering e[n] with {aj}s

I whitened, decorrelated, minimized e[n]s are easy to quantize. . . or can model e[n] e.g. as simple pulse train

Recognition / classificationI LPC fit responds to spectral peaks (formants)I can use for recognition (convert to cepstra?)

ModificationI separating source and filter supports cross-synthesisI pole / resonance model supports ‘warping’

e.g. male → female

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 27 / 46

Page 28: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Aside: Formant tracking

Formants carry (most?) linguistic information

Why not classify → speech recognition?

e.g. local maxima in cepstral-liftered spectrum pole frequencies inLPC fit

But: recognition needs to work in all circumstancesI formants can be obscured or undefined

freq

/ H

zfr

eq /

Hz

0

1000

2000

3000

4000

time / s0 0.2 0.4 0.6 0.8 1 1.2 1.40

1000

2000

3000

4000

Original (mpgr1_sx419)

Noise-excited LPC resynthesis with pole freqs

→ need more graceful, robust parameters

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 28 / 46

Page 29: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Outline

1 Modeling speech signals

2 Spectral and cepstral models

3 Linear predictive models (LPC)

4 Other signal models

5 Speech synthesis

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 29 / 46

Page 30: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Sinusoid modeling

Early signal models required low complexity

e.g. LPC

Advances in hardware open new possibilities. . .

NB spectrogram suggests harmonics model

time / s

freq

/ H

z

0 0.5 1 1.50

1000

2000

3000

4000

I ‘important’ info in 2D surface is set of tracks?I harmonic tracks have ∼smooth propertiesI straightforward resynthesis

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 30 / 46

Page 31: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Sine wave models

Model sound as sum of AM/FM sinusoids

s[n] =

N[n]∑k=1

Ak [n] cos(nωk [n] + φk [n])

I Ak , ωk , φk piecewise linear or constantI can enforce harmonicity: ωk = kω0

Extract parameters directly from STFT frames:

freq

time

mag

I find local maxima of |S [k , n]| along frequencyI track birth/death and correspondence

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 31 / 46

Page 32: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Finding sinusoid peaks

Look for local maxima along DFT frame

i.e. |s[k − 1, n]| < |S [k , n]| > |S [k + 1, n]|Want exact frequency of implied sinusoid

I DFT is normally quantized quite coarselye.g. 4000 Hz / 256 bands = 15.6 Hz/band

mag

nitu

de

frequency

spectral samples

quadratic fit to 3 points

interpolated frequency and magnitude

I may also need interpolated unwrapped phase

Or, use differential of phase along time (pvoc):

ω =ab − ba

a2 + b2where S [k, n] = a + jb

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 32 / 46

Page 33: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Sinewave modeling applications

Modification (interpolation) and synthesisI connecting arbitrary ω and φ requires cubic phase interpolation

(because ω = φ)

Types of modificationI time and frequency scale modification

. . . with or without changing formant envelope

I concatenation / smoothing boundariesI phase realignment (for crest reduction)

Non-harmonic signals? OK-ish

time / s

freq

/ H

z

0 0.5 1 1.50

1000

2000

3000

4000

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 33 / 46

Page 34: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Harmonics + noise model

Motivation to improve sinusoid model becauseI problems with analysis of real (noisy) signalsI problems with synthesis quality (esp. noise)I perceptual suspicions

Model

s[n] =

N[n]∑k=1

Ak [n] cos(nkω0[n])︸ ︷︷ ︸Harmonics

+ e[n](hn[n] ∗ b[n])︸ ︷︷ ︸Noise

I sinusoids are forced to be harmonicI remainder is filtered and time-shaped noise

‘Break frequency’ Fm[n] between H and N

Harmonicity limit Fm[n]

HarmonicsNoise

freq / Hz0

40

dB

0

20

1000 2000 3000

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 34 / 46

Page 35: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

HNM analysis and synthesisDynamically adjust Fm[n] based on ‘harmonic test’:

time / s

freq

/ H

z

0 0.5 1 1.50

1000

2000

3000

4000

Noise has envelopes in time e[n] and frequency Hn

time / s0

0 040dB

1000

2000

3000freq / Hz

0.01 0.02 0.03

Hn[

k]

e[n]

reconstruct bursts / synchronize to pitch pulses

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 35 / 46

Page 36: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Outline

1 Modeling speech signals

2 Spectral and cepstral models

3 Linear predictive models (LPC)

4 Other signal models

5 Speech synthesis

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 36 / 46

Page 37: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Speech synthesis

One thing you can do with models

Synthesis easier than recognition?I listeners do the work

. . . but listeners are very critical

Overview of synthesis

text speechText normalization

Synthesis algorithm

Phoneme generation

Prosody generation

I normalization disambiguates text (abbreviations)I phonetic realization from pronunciation dictionaryI prosodic synthesis by rule (timing, pitch contour)

. . . all control waveform generation

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 37 / 46

Page 38: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Source-filter synthesis

Flexibility of source-filter model is ideal for speech synthesis

Glottal pulsesource

Noisesource

Vocal tractfilter+

SpeechVoiced/unvoiced

Pitchinfo

Phonemeinfo

t

t

t

t

th ax k ae t

Excitation source issues

voiced / unvoiced / mixture ([th] etc.)

pitch cycles of voiced segments

glottal pulse shape → voice quality?

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 38 / 46

Page 39: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Vocal tract modelingSimplest idea: store a single VT model for each phoneme

th ax k ae ttime

freq

but discontinuities are very unnatural

Improve by smoothing between templates

th ax k ae ttime

freq

trick is finding the right domain

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 39 / 46

Page 40: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Cepstrum-based synthesis

Low-n cepstrum is compact model of target spectrum

Can invert to get actual VT IR waveforms:

cn = idft(log |dft(x [n])|)⇒ h[n] = idft(exp(dft(cn)))

All-zero (FIR) VT response

→ can pre-convolve with glottal pulses

time

ee

ae

ah

Glottal pulse inventory Pitch pulse times (from pitch contour)

I cross-fading between templates OK

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 40 / 46

Page 41: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

LPC-based synthesis

Very compact representation of target spectraI 3 or 4 pole pairs per template

Low-order IIR filter → very efficient synthesis

How to interpolate?I cannot just interpolate aj in a running filterI but lattice filter has better-behaved interpolation

+ +

z-1a1 kp-1

a2

a3

z-1

z-1

e[n]

+

e[n] s[n]

-1

s[n] +

k0

+

z-1z-1

z-1

+

- -

What to use for excitationI residual from original analysisI reconstructed periodic pulse trainI parametrized residual resynthesis

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 41 / 46

Page 42: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Diphone synethsis

Problems in phone-concatenation synthesisI phonemes are context-dependentI coarticulation is complexI transitions are critical to perception

→ store transitions instead of just phonemes

mdnctcl

^

θ zwzh e

III ayεPhones

Diphone segments

I ∼ 40 phones ⇒ ∼ 800 diphonesI or even more context if have larger database

How to splice diphones together?I TD-PSOLA: align pitch pulses and cross fadeI MBROLA: normalized multiband

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 42 / 46

Page 43: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

HNM synthesis

High quality resynthesis of real diphone units + parametricrepresentation for modification

I pitch, timing modificationsI removal of discontinuities at boundaries

Synthesis procedureI linguistic processing gives phones, pitch, timingI database search gives best-matching unitsI use HNM to fine-tune pitch and timingI cross-fade Ak and ω0 parameters at boundaries

time

freq

Careful preparation of database is keyI sine models allow phase alignment of all unitsI larger database improves unit match

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 43 / 46

Page 44: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Generating prosody

The real factor limiting speech synthesis?

Waveform synthesizers have inputs forI intensity (stress)I duration (phrasing)I fundamental frequency (pitch)

Curves produced by superposition of (many) inferred linguisticrules

I phrase final lengthening, unstressed shortening, . . .

Or learn rules from transcribed elements

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 44 / 46

Page 45: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

Summary

Range of modelsI spectral, cepstralI LPC, sinusoid, HNM

Range of applicationsI general spectral shape (filterbank) → ASRI precise description (LPC + residual) → codingI pitch, time modification (HNM) → synthesis

IssuesI performance vs computational complexityI generality vs accuracyI representation size vs quality

Parting thought

not all parameters are created equal. . .

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 45 / 46

Page 46: Lecture 5: Speech modeling · 2009-02-19 · Application of (speech) signal models Classi cation / matching Goal:highlight important information I speech recognition (lexical content)

References

Alan V. Oppenheim. Speech analysis-synthesis system based on homomorphicfiltering. The Journal of the Acoustical Society of America, 45(1):309–309, 1969.

J. Makhoul. Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561–580, 1975.

Bishnu S. Atal and Suzanne L. Hanauer. Speech analysis and synthesis by linearprediction of the speech wave. The Journal of the Acoustical Society of America,50(2B):637–655, 1971.

J.E. Markel and AH Gray. Linear Prediction of Speech. Springer-Verlag New York,Inc., Secaucus, NJ, USA, 1982.

R. McAulay and T. Quatieri. Speech analysis/synthesis based on a sinusoidalrepresentation. Acoustics, Speech, and Signal Processing [see also IEEETransactions on Signal Processing], IEEE Transactions on, 34(4):744–754, 1986.

Wael Hamza, Ellen Eide, Raimo Bakis, Michael Picheny, and John Pitrelli. The IBMexpressive speech synthesis system. In INTERSPEECH, pages 2577–2580, October2004.

E6820 (Ellis & Mandel) L5: Speech modeling February 19, 2009 46 / 46