signal modeling for robust speech recognition with frequency warping and convex optimization

Signal Modeling for Robust Speech Recognition With Frequency

Warping and Convex Optimization

Yoon Kim

March 8, 2000

Outline

• Introduction and Motivation

• Speech Analysis with Frequency Warping

• Speaker Normalization with Convex Optimization

• Experimental Results

• Conclusions

Problem Definition

• Devise effective and robust features for speech recognition that are insensitive to mismatches in individual speaker acoustics and environment

• How can we process the signal such that the acoustic mismatch is minimized ?

Robust Signal Modeling

• Feature Extraction– Derives a compact, yet effective representation

• Feature Normalization – Compensates for the acoustic mismatch

between the training and testing conditions

Part I: Feature Extraction for Speech Recognition

Cepstral Analysis of Speech

• Most popular choice for speech recognition

• Cepstrum is defined as the inverse Fourier transform of the log spectrum

• Truncated to length L (smoothes log spectrum)

1,,1,0,)(log2

1)( LndeeSnc njj

FFT-Based Feature Extraction

• Perceptually motivated FFT filterbank is used to emulate the auditory system

• Analysis is directly affected by fine harmonics

• Examples

– Mel Frequency Cepstral Analysis

– Perceptual Linear Prediction (PLP)

LP-Based Feature Extraction

• Linear prediction provides a smooth spectrum mostly containing vocal-tract information

• Frequency warping is not straightforward

• Examples

– Frequency-Warped Linear Prediction– Time-domain Warped Linear Prediction

Part I: Non-uniform Linear Predictive Analysis of Speech

Basic Ideas of the NLP Analysis

• Frequency warping of the vocal-tract spectrum using non-uniform DFT (NDFT)

• Bark-frequency scale is used for warping

• Pre- and post-warp linear prediction smoothing

Bark Bilinear Transform

z

zzA

1)(

z

zzA

1)(

• Bark Bilinear Transform

• For an appropriately chosen ρ, the mapping closely resembles a Bark mapping

z

zzA

1)(ρ

Figure: Bark-Frequency Warping

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1fs = 31kHz, rho = 0.70777

Linear Frequency (rad/pi)

Warp

ed

Fre

qu

en

cy

(rad

/pi)

Optimal Fit Using BBT Bark Frequency Warping

Pre-Warp Linear Prediction

• Vocal-tract transfer function H(z) can be represented by an all-pole model

p

k

kk za

G

zA

GzH

1

1)(

)(

NDFT Frequency Warping

• NDFT of the vocal-tract impulse response

• ωk : Frequency grid of Bark bilinear transform

],,,,1[)(

1,,1,0,)()(~

21

0

p

p

n

nj

aaana

MkenakA k

Post-Warp Linear Prediction

• Take the IDFT of the power spectrum to get the warped autocorrelation coefficients

• Durbin recursion to get new LP coefficients

qnekPM

nr

MkkA

GkHkP

M

k

Mknj ,,1,0,)(~1

)(~

1,,1,0,|)(

~|

|)(|)(~

1

0

/2

2

22

Conversion to Cepstrum

• Convert warped LP parameters to a set of L cepstral parameters via recursion

1,,1),(~)(1

)(~)(

,ln)0(1

1

2

Lnknakckn

nanc

Gcn

k

NDFT Warping: Vowel /u/

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-16

-14

-12

-10

-8

-6

-4

Normalized Frequency (rad/pi)

Log

Mag

nitu

de

Original LP power spectrum

NDFT Warping: Vowel /u/

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-16

-14

-12

-10

-8

-6

-4

Normalized Frequency (rad/pi)

Log

Mag

nitu

de

Original LP power spectrumNLP power spectrum

Clustering Measures

• Derive meaningful measures to assess how well the feature clusters of each class (vowel) can be separated and discriminated

• Three measures were considered– Determinant measure– Trace measure– Inverse trace measure

Scatter Matrices

• SW: Within-class scatter matrix

• SB : Between-class scatter matrix

• ST : Total scatter matrix

BWT

c

i

TiiiB

c

i

TiiW

SSS

mmmmnS

mxmxS

1

1

))((

))((

Determinant Measure

)det(

)det(det WSW

WSWJ

WT

BT

• Ratio of the between-class and within-class scattering volume

• Larger the value, better the clustering

Trace Measure

iBW

c

iBW SSSSJ )()(Tr 1

1

1

1tr

• Ratio of the sum of scattering radii of between-class and within-class scatter

• Larger the better

Inverse Trace Measure

1

1

1inv 1

1)(Tr

c

i iWT SSJ

• Sum of within-class scattering radii normalized by the total scatter

• Smaller the better

Vowel Clustering Performance

• We compared the values of the scattering measures discussed to assess the clustering performance of the NLP cepstrum

• Mel, PLP and LP techniques were also tested for comparison

Steady-State Vowel Database

• Eleven steady-state English vowels from 23 speakers (12 male, 9 female, 2 children)

• Sampling rate: 10 kHz

• Each speaker provided 6 frames of steady-state vowel segments

Results: Vowel Clustering

Method Jdet Jtr Jinv

LP 7.01 e-9 12.04 6.89

PLP 7.08 e-7 12.36 6.73

Mel 2.37 e-6 13.58 6.75

NLP 4.30 e-5 14.72 6.49

2-D Vowel Clusters: /a/ /i/ /o/

-0.5 0 0.5 1 1.5-1

-0.5

0

0.5

1LPC

Jtr = 20.29

-0.1 -0.05 0 0.05 0.1-0.02

0

0.02

0.04

0.06PLP

Jtr = 22.72

-2 0 2 4-1

0

1

2

3

4Mel

Jtr = 25.06

-0.5 0 0.5 1-0.6

-0.4

-0.2

0

0.2

0.4NLP

Jtr = 28.82

/a//i//o/

2-D Vowel Clusters: /a/ /e/ /i/

-1 -0.5 0 0.5 1-0.6

-0.4

-0.2

0

0.2LPC

Jtr = 8.72

-5 0 5 10

x 10-3

-12

-10

-8

-6

-4

-2x 10

-3 PLP

Jtr = 9.77

-4 -2 0 2-1

0

1

2Mel

Jtr = 10.31

-0.4 -0.2 0 0.2 0.4-0.3

-0.2

-0.1

0

0.1

0.2NLP

Jtr = 12.45

/a//e//i/

Part II: Feature Normalization for Speaker Acoustics Matching

Speech Recognition Problem

• Given a sequence of acoustic feature vectors Xextracted from speech, find the most likely word string that could have been uttered

)|()(maxarg)|(maxargˆ WXPWPXWPWWW

HMM Acoustic Model

• Hidden Markov Models (HMMs): Each phone unit is modeled as a sequence of hidden states

• Speech dynamics modeled as transitions from one state to another

• Each state has a feature probability distribution

• Goal: Guess the underlying state sequence (phone string) from the observable features

Example: HMM Word Model

Digit: “one”

pause /w/ /Λ/ /n/ pause

1 2 3 4 5

Why Speaker Normalization ?

• Most speech recognition systems use statistical models trained using a large database with the hope that the testing conditions will be similar

• Acoustic mismatches between the speakers used in training and testing result in unacceptable degradation of recognition performance

Prior Work in Speaker Normalization

• Normalization usually refers to modification of the features to fit a statistical model

• Vocal-tract length normalization (VTLN)– Attempts to alter the resonant frequencies of the

vocal-tract by warping the frequency axis – Linear warping– All-pass warping (bilinear transform)

Prior Work: Speaker Adaptation

• Adaptation usually refers to modification of the model parameters to fit the data

• Maximum Likelihood Bias

• ML Linear Regression (MLLR)

mean Original:mean, dTransforme:~R,~

LLAA

v~

Part II: Speaker Normalization with Maximum-Likelihood Affine Cepstral Filtering

Linear Cepstral Filtering (LCF)

• We propose the following linear, Toeplitz transformation of the cepstral feature vectors

LL

LL

R

hhh

hh

h

HcHc

021

01

0

0

0

00

,~

Linear Cepstral Filtering (LCF)

• H represents the linear cepstral transformation for normalizing speaker acoustics.

• The matrix operation corresponds to – Convolution in the cepstral domain– Log spectral filtering in the frequency domain

)()()(~

)()()(~1

0

SHSknhkcncL

k

Maximum-Likelihood Estimation

• Find the optimal normalization H such that the transformed features yield maximum likelihood with respect to a given model Λ

• Only L parameters for estimation (instead of L2)

TLo

hH

hhhhHcc

cPcPH

],,,[,~

)|~(maxarg)|~(maxargˆ

11

Commutative Property of LCF

• Due to the commutative property of the convolution, the transformed cepstrum can also be expressed as a linear function of the filter h

021

01

0

0

0

00

,~

ccc

cc

c

CChcHc

LL

Solution: Single Gaussian Case

• Let c(i) be the i-th feature of the data (i=0,…,N-1)

• Let the distribution corresponding to c(i) be Gaussian with mean μi and covariance Σi

• Total log-likelihood of transformed feature data set is a concave, quadratic function of the filter h

1

0

)(1)(1

0

)( )()(2

1)|~(log

N

ii

ii

Ti

iN

i

i hChCcP

Solution: Single Gaussian Case• Since the negative of the log-likelihood is convex

in h, there exists a unique ML solution h*

12/11

12/1

1

02/1

0

)1(2/11

)1(2/11

)0(2/10

1

2

*

,

,)(minarg

NNN

N

TT

h

b

C

C

C

A

bAAAbAhh

Case: Gaussian Mixture

i

iii

M

iii wwxNwxP 1,0),,;();(

1

• Log-likelihood is no longer a convex function

• Approximation: We use the single Gaussian density for ML filter estimation

• Past studies support the validity of the approx.

Case: Log-Concave PDFs

• For any distribution that is log-concave, ML estimation can be posed as a convex problem

• Examples– Laplace: p(x) = (1/2a) exp(-|x|/a)– Uniform: p(x) = 1/(2a) on [-a, a]– Rayleigh: p(x) = (2/a) x exp(-x2/b), x > 0

Affine Cepstral Filtering (ACF)

• We can extend the linear transformation to an affine form by adding a cepstral bias term v

• Bias models channel and other additive effects

• Joint optimization of filter and bias leads to a more flexible transformation of the cepstral space

)()()()(~~ VSHSvcHc

Solution: Affine Transformation• By combining the filter h and bias v into an

augmented design vector x, the joint ML solution can be easily attained by extending the linear case

v

hxb

C

C

C

A

bAxx

NNNN

N

x

,,

,minarg

12/11

12/1

1

02/1

0

2/11

2/11

2/10

)1(2/11

)1(2/11

)0(2/10

2

*

Example: Vowel /ah/No Warping, No Normalization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 16

8

10

12

14

16

18

20

Normalized Frequency

Log

Spe

ctru

m

REFERENCE SPEAKERTEST SPEAKER

Vowel /ah/: With NLP Warping, No Normalization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 12.5

3

3.5

4

4.5

5

5.5

6

6.5


Log

Spe

ctru

m


Vowel /ah/: With NLP Warping and LCF Normalization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 12.5

3

3.5

4

4.5

5

5.5

6

6.5


Log

Spe

ctru

m


Example: Vowel /oh/No Warping, No Normalization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 113

14

15

16

17

18

19

20

21

22

23


Log

Spe

ctru

m


Vowel /oh/: With NLP Warping, No Normalization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 13

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8


Log

Spe

ctru

m


Vowel /oh/: With NLP Warping and LCF Normalization

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 13

3.5

4

4.5

5

5.5

6

6.5

7

7.5

8


Log

Spe

ctru

m


Normalization in Training

• For each speaker in the training database, ML filter and bias vectors are estimated using the unnormalized model Λ

• ML transformation is applied to the feature vectors for each speaker

• A normalized, Gaussian-mixture model is trained using the normalized features

~

Normalization in Recognition

• Given a set of enrollment data, normalization parameters are estimated for each speaker

• We apply the speaker-dependent mapping to subsequent data from the speaker

• Transformation can be regarded as a statistical “spectral equalizer” applied to each speaker to optimally fit the normalized model

Frame-Based Vowel Recognition

• Same vowel database used for clustering (23 speakers, 11 steady-state vowels)

• 4 speakers in the test set provided a total of 18 frames; 6 frames were used for estimation

• LP, PLP, Mel, and NLP features considered

• Recognition performance: Error rate (%)

Results: Vowel Recognition

Method Baseline No Norm

Diag MLLR

ML Bias

ML LCF

LP 35.7 45.5 30.0 26.4

PLP 32.0 32.7 30.6 27.4

Mel 32.8 30.1 24.6 20.5

NLP 25.9 26.3 21.7 19.2

Avg. 31.6 33.6 26.7 23.4

Summary: Vowel Recognition

31.633.6

26.7

23.4

15

20

25

30

35

40

Err

or R

ate

% BaselineDiag MLLRML BiasML LCF

HMM Digit Recognition

• TIDIGITS corpus: 326 speakers providing 77 digit sequences in a quiet environment

• Digits: 1-9, “zero” and “oh”

• 8-state HMM for each digit

• Varied # of Gaussians/state from 1 to 15, and the best result was selected

Case: Adult Data on Adult Model

• HMM for each digit was trained with data from 112 adult speakers (55 male, 57 female)

• Another set of 113 adult speakers were used for testing (56 male, 57 female)

• One utterance per digit was used for estimating the normalization parameters for each speaker

Digit Results: Adult on Adult

Method Baseline Linear Affine GD model

Mel 2.0 1.7 1.5 1.9

NLP 1.9 1.6 1.3 1.9

Avg 2.0 1.6 1.4 1.9

Case: Child Data on Adult Model

• Case of severe mismatch between the training and testing speaker acoustics

• Model: 112 adults

• Test set: 100 children (50 boys, 50 girls)

Digit Results: Child on Adult

Method Baseline Model

Linear Norm.

Affine Norm.

Child Model

Mel 23.1 18.7 15.5 4.5

NLP 19.5 18.2 15.6 2.3

Avg. 21.3 18.5 15.6 3.4

Summary: Digit Recognition

Adult on Adult

21.6 1.4

0.51

1.52

2.53

Err

or R

ate

%

Baseline Linear Affine

Child on Adult

21.318.5

15.6

5

15

25

Baseline Linear Affine

Conclusions

• Speaker normalization was achieved using NLP frequency warping and ML affine cepstral filtering

Conclusions


• A unified framework for optimizing the matrix and bias parameters was presented using simple convex programming

Conclusions


• A unified framework for optimizing the matrix and bias parameters was presented using simple convex programming

• Proposed signal modeling techniques gave considerable boost in recognition performance, even for severely mismatched conditions

Future Research

• Compensation of noise and channel mismatches

• Joint optimization of frequency warping and affine transform parameters

• Investigation of other optimality criteria for stochastic matching

signal modeling for robust speech recognition with frequency warping and convex optimization

Documents

class vowel

class scatterlarger

class scatter matrixst

vowel undft warping

scattering measures

vocaltract spectrum

robust speech recognition

class scatter matrix