hynek bořil introductiondata acquisitionneutral/le speech analysisequalization of le in...

Hynek Bořil

Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions

Attributes and Recognition of Lombard Speech

Center for Robust Speech SystemsErik Jonsson School of Engineering and Computer Science

University of Texas at Dallas

© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008

Contents

IntroductionWhat is Lombard Effect?

Why is Lombard Effect Interesting?

Goals and Motivation of the Study

Data Acquisition

Neutral/LE Speech Analysis

Equalization of LE in ASRAcoustic Model Adaptation

Voice Conversion

Data-Driven Design of Robust Features

Frequency Warping

Two-Stage Recognition System

Summary



Objective



What is Lombard Effect?When exposed to noisy adverse environment, speakers modify the

way they speak in an effort to maintain intelligible communication

(Lombard Effect - LE)

Why is Lombard Effect Interesting?Better understanding mechanisms of human speech

communication (Can we intentionally change particular parameters

of speech production to improve intelligibility, or is LE an automatic

process learned through public loop? How the type of noise and

communication scenario affect LE?)

Mathematical modeling of LE classification of LE level, speech

synthesis in noisy environments, increasing robustness of automatic

speech recognition and speaker identification systems

Study Objective and Motivation – LE Analysis


Ambiguity in Past LE InvestigationsLE has been studied since 1911, however, many investigations disagree in the

observed impacts of LE on speech production

Analyses conducted typically on very limited data – a couple of utterances from few

subjects (1–10)

Lack of communication factor – many of studies ignore the importance of

communication for evoking LE (an effort to convey message over noise) occurrence

and level of LE in speech recordings is ‘random’ contradicting analysis results

LE was studied only for several world languages (English, Spanish, French,

Japanese, Korean, Mandarin Chinese), no comprehensive study for any of Slavic

languages

1st GoalDesign of Czech Lombard Speech Database addressing the need of communication

factor and well defined simulated noisy conditions

Systematic analysis of LE in Czech spoken language


Study Objective and Motivation – ASR under LE



ASR under LEMismatch between LE speech with by noise and acoustic models trained on clean neutral

speech

Strong impact of noise on ASR is well known and vast number of noise suppression/speech

emphasis algorithms have been proposed in last decades (yet no ultimate solution is reached)

Negative impact of LE on ASR often exceeds the one of noise; recent state-of-the-art ASR

systems mostly ignore this issue

LE-Equalization MethodsLE-equalization algorithms typically operate in the following domains: Robust features, LE-

transformation towards neutral, model adjustments, improved training of acoustic models

The algorithms display various degrees of efficiency and are often bound by strong

assumptions preserving them from the real world application (applying fixed transformations

to phonetic groups, known level of LE, etc.)

2nd GoalProposal of novel LE-equalization techniques with a focus on both level of LE suppression

and extent of bounding assumptions

Data Acquisition - Motivation


LE Corpora – Issues (1)Tradeoff between realism and control of the phenomena of interest: (Murray

and Arnott, 1993) for elicited emotional speech: “A trade-off exists between realism

and measurement accuracy of emotions generated by speakers in the laboratory

(questionable realism, but verbal content and recording conditions controllable) and

field recordings (real emotions, but content and recording conditions less

controllable).”

Databases recorded in real adverse conditions (e.g., car environment):

Limited or no control over level and characteristics of background noise

Low SNRs of the recordings difficult to perform reliable speech analysis

Importance of communication factor often completely ignored - e.g., SPEECON

(Iskra et al., 2002)




LE Corpora – Issues (2)Special LE databases simulated noisy conditions

Successfully address the control of noise and SNR

In studies on speech production, authors sometimes employ

communication factor (Korn, 1954), (Webster and Klumpp, 1962) – repeating

words, (Patel and Schell, 2008) – interactive game

Studies on ASR/Speaker ID under LE - the importance of communication

factor largely ignored– (Junqua , 1993) SUSAS (Hansen and Ghazale, 1997);

exception (Junqua et al, 1998) – communication with dialing machine

Limited number of subjects and utterances - ranging typically from ten

(Webster and Klumpp, 1962), (Lane et al., 1970), (Junqua, 1993), to one or two

speakers, (Summers et al., 1988), (Pisoni et al., 1985), (Bond et al., 1989), (Tian

et al., 2003), (Garnier et al., 2006)




Available Czech CorporaCzech SPEECON – speech recordings from various environments including office

and car

CZKCC – car recordings – include parked car with engine off and moving car

scenarios

Both databases contain speech produced in quiet in noise candidates for study of

LE, however, not good ones, shown later

Design/acquisition of LE-oriented database – Czech

Lombard Speech Database‘05 (CLSD‘05) Goals – Communication in simulated noisy background high SNR

-Phonetically rich data/extensive small vocabulary material

-Parallel utterances in neutral and LE conditions


Introducing Communication in Recording


Data Acquisition

Simulated Noisy ConditionsNoise samples mixed with speech feedback and produced to the speaker and

operator by headphones

Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible,

operator asks the subject to repeat it speakers are required to convey message

over noise communication LE

Noises: mostly car noises from Car2E database, normalized to 90 dB SPL

Speaker Sessions14 male/12 female speakers

Each subject recorded both in neutral and simulated noisy conditions


Close talk

Noise + speech feedback

Middle talk

H&T RECORDER

OK – next / / BAD - again

Noise + speech monitor SPEAKER

SMOOTH OPERATOR

Speaker Session Contents


Data Acquisition

IVR – interactive

voice response


Corpus contents Corpus/item id. Number

Phonetically rich sentences S01 – 30 30 Phonetically rich words W01 – 05 5 Isolated digits CI1 – I4, 30 – 69 44 Isolated digit sequences (8 digits) CB1 – B2, 00 – 29 32 Connected digit sequences (5 digits) CC1 – 4, C70 – 99 34 Natural numbers CN1 – N3 3 Money amount CM1 1 Time phrases; T1 – analogue, T2 – digital

CT1 – T2 2

Dates: D1 – analogue, D2 – relative and general date, D3 – digital

CD1 – D3 3

Proper name CP1 1 City or street names CO1 – O2 2 Questions CQ1 – Q2 2 Special keyboard characters CK1 – K2 2 Core word synonyms Y01 – 95 Basic IVR commands 101 – 85 Directory navigation 201 – 40 Editing 301 – 22 Output control 401 – 57 Messaging & Internet browsing 501 – 70 Organizer functions 601 – 33 Routing 701 – 39 Automotive 801 – 12 Audio & Video 901 – 95

89

Sound Attenuation by Headphones


Data Acquisition


Environmental Sound Attenuation by HeadphonesAttenuation characteristics measured on dummy head

Source of wide-band noise, measurement of sound transfer to dummy head’s

auditory canals when not wearing/wearing headphones

Attenuation characteristics – subtraction of the transfers

Sound Attenuation by Headphones


Data Acquisition


102

103

104

0

50

100

150

200-10

0

10

20

30

Frequency (Hz)Angle (°)

Att

enua

tion

(dB

)

-10

-5

0

5

10

15

20

25

100 1000 10000

0° 90°180°Rec. room

Frequency (Hz)

Atte

nu

atio

n (

dB

)

Attenuation by headphones

-100102030

0

15

30

45

60

75 90 105

120

135

150

165

180

195

210

225

240

255270285

300

315

330

345

0 180 -10 0 10 20 30 0 10 20 30

1 kHz 2 kHz 4 kHz 8 kHz

Angle (°)

Attenuation (dB)

Environmental Sound Attenuation by HeadphonesDirectional attenuation – reflectionless sound booth

Real attenuation in recording room

Closed vs. Open-Air Headphones


Data Acquisition


Open-Air Headphones+ Easier to reach flat frequency response than in closed headphones

+ Lower attenuation of sound coming from outside the headset

- High level of cross-talk from headphones to close-talk microphone contamination

of recorded speech by noise reproduced to headphones

0

200

400

600

800

1000

1200

1400

1600

-10 0 10 20 30 40 50 60

Close-talk NHands-free NClose-talk LEHands-free LE

CLSD'05 - SNR distributions

Nu

mb

er

of u

ttera

nce

s

SNR (dB)

Parameters of Neutral and LE Speech


1

( )1

Nk

kk

GV z

z

Speech Features affected by LE


IMPULSE TRAIN GENERATOR I(z)

VOCAL TRACT MODEL V(z)

RADIATION MODEL R(z)

RANDOM NOISE GENERATOR N(z)

Voiced/Unvoiced Switch

Pitch Period

AV

AN

Vocal Tract Parameters

GLOTTAL PULSE MODEL G(z)

uG(n) pL(n)



Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency

rises


1

( )1

Nk

kk

GV z

z






Pitch Period

AV

AN



uG(n) pL(n)



1

( )1

Nk

kk

GV z

z


rises

Vocal tract transfer function: center frequencies of low formants increase,

formant bandwidths reduce







Pitch Period

AV

AN



uG(n) pL(n)



1

( )1

Nk

kk

GV z

z


rises



Vocal effort (intensity) increase







Pitch Period

AV

AN



uG(n) pL(n)



1

( )1

Nk

kk

GV z

z

Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency rises



Vocal effort (intensity) increase

Other: voiced phonemes prolonged, energy ratio in voiced/unvoiced increases,…







Pitch Period

AV

AN



uG(n) pL(n)

Analysis: Fundamental Frequency




0

2

4

6

8

10

12

70 170 270 370 470 570

Office FCar F

Office MCar M

Fundamental frequency (Hz)

Distribution of fundamental frequencyCzech SPEECON

Nu

mb

er

of s

am

ple

s (x

10

,00

0)

0

2

4

6

8

10

12

14

16

70 170 270 370 470 570

Eng off F

Eng on F

Eng off M

Eng on M


Nu

mb

er

of s

am

ple

s (x

10

00

)

Distribution of fundamental frequencyCZKCC

0

1

2

3

4

5

6

70 170 270 370 470 570

Neutral FLE FNeutral MLE M


Nu

mb

er

of s

am

ple

s (x

10

,00

0)

Distribution of fundamental frequencyCLSD'05

Analysis: Formant Center Frequencies



900

1100

1300

1500

1700

1900

2100

2300

2500

300 400 500 600 700 800 900 1000

Female_N

Female_LE

F1 (Hz)

F2

(H

z)

Formants - CZKCCFemale digits/i/

/i'/

/e//e'/

/a/

/a'//o/

/o'//u/

/u'/

500

700

900

1100

1300

1500

1700

1900

2100

200 300 400 500 600 700 800 900

Male_N

Male_LE

/i//i'/

F1 (Hz)

F2

(H

z)

/e//e'/

/a/

/a'/

/o//o'/

/u//u'/

Formants - CZKCCMale digits

900

1100

1300

1500

1700

1900

2100

2300

2500

300 400 500 600 700 800 900 1000

Female_N

Female_LE

F1 (Hz)

F2

(H

z)

Formants - CLSD'05Female digits/i/

/i'/

/e/

/e'/

/a//a'/

/o/

/o'/

/u/

/u'/

500

700

900

1100

1300

1500

1700

1900

2100

200 300 400 500 600 700 800 900

Male_N

Male_LE

Formants - CLSD'05Male digits/i/

/i'/

F1 (Hz)

F2

(H

z)

/e/ /e'/

/a//a'/

/o/ /o'//u/

/u'/




CZKCC

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 207* 74 210* 84 275 97 299 78

/e/ 125* 70 130* 78 156 68 186 79

/i/ 124* 49 127* 44 105 44 136 53

/o/ 275 87 222 67 263* 85 269* 73

/u/ 187 100 170 89 174* 96 187* 101

CLSD‘05

Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)

/a/ 269 88 152 59 232 85 171 68

/e/ 168 94 99 44 169 73 130 49

/i/ 125 53 108 52 132* 52 133* 58

/o/ 239 88 157 81 246 91 158 62

/u/ 134* 67 142* 81 209 95 148 66

Analysis: Formant Bandwidths

SPEECON, CZKCC: no consistent BW changes

CLSD‘05: significant BW reduction in many voiced phonemes


Analysis: Phoneme Durations



CZKCC

Word Phoneme # OFF TOFF (s) TOFF (s) # ON TON (s) TON (s) (%)

Nula /a/ 349 0.147 0.079 326 0.259 0.289 48.50

Jedna /a/ 269 0.173 0.076 251 0.241 0.238 39.36

Dva /a/ 245 0.228 0.075 255 0.314 0.311 38.04

Štiri /r/ 16 0.045 0.027 68 0.080 0.014 78.72

Sedm /e/ 78 0.099 0.038 66 0.172 0.142 72.58


CLSD‘05

Word Phoneme # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)

Jedna /e/ 583 0.031 0.014 939 0.082 0.086 161.35

Dvje /e/ 586 0.087 0.055 976 0.196 0.120 126.98

Čtiri /r/ 35 0.041 0.020 241 0.089 0.079 115.92

Pjet /e/ 555 0.056 0.033 909 0.154 0.089 173.71

Sedm /e/ 358 0.080 0.038 583 0.179 0.136 122.46

Osm /o/ 310 0.086 0.027 305 0.203 0.159 135.25

Devjet /e/ 609 0.043 0.022 932 0.120 0.088 177.20

Significant increase in duration in some phonemes, especially voiced phonemes

Some unvoiced consonants – duration reduction

Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC

Word Durations



CZKCC

Word # OFF TOFF (s) OFF (s) # ON TON (s) TON (s) (%)

Nula 349 0.475 0.117 326 0.560 0.345 17.82

Jedna 269 0.559 0.136 251 0.607 0.263 8.58

Dva 245 0.426 0.106 255 0.483 0.325 13.57

CLSD‘05

Word # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)

Nula 497 0.397 0.109 802 0.476 0.157 19.87

Jedna 583 0.441 0.128 939 0.527 0.165 19.56

Dvje 586 0.365 0.114 976 0.423 0.138 15.87

Word durations variations typically did not exceed 20 %


HMM-Based Automatic Speech Recognition (ASR)


Typical HMM Recognizer

LANGUAGE MODEL

(BIGRAMS)

DECODER (VITERBI)

ESTIMATED WORD

SEQUENCE

SPEECH SIGNAL

FEATURE EXTRACTION (MFCC/PLP)

ACOUSTIC MODEL

SUB-WORD LIKELIHOODS

(GMM/MLP)

LEXICON (HMM)


Feature extraction – transformation of time-domain acoustic signal into

representation more convenient for ASR engine: data dimensionality reduction,

suppression of irrelevant (disturbing) signal components

(speaker/environment/recording chain-dependent characteristics), preserving

phonetic content

Sub-word models – Gaussian Mixture Models (GMMs) – mixture of gaussians used

to model distribution of feature vector parameters; Multi-Layer Perceptrons (MLPs) –

artificial neural networks

HMM-Based Automatic Speech Recognition (ASR)



Mel Frequency Cepstral CoefficientsDavis & Mermelstein, IEEE Trans. Acoustics, Speech, and Signal Processing, 1980

Mermelstein was born in Czechoslovakia

MFCC is the first choice in current commercial ASR

When used in HMM ASR, MFCC may be incorporating several redundant stages –

historical reasons (in the past, distance-based measures were used in speech

decoding, different requirements on cepstral coeffs than in HMM systems)

Perceptual Linear Predictive CoefficientsHermansky, Journal of Acoustical Society of America, 1990

Hermansky was born in Czechoslovakia

Linear prediction – smoothing of the spectral envelope (may improve robustness)

PLP is a frequent choice in research labs – IDIAP, ICSI Berkeley, LIMSI…

WINDOW

(HAMMING)

|FFT|2

c(n)

s(n)

PREEMPHASIS

Log( )

.

IDCT

MFCC FILTER

BANK (MEL)

WINDOW

(HAMMING)

|FFT|2

EQUAL LOUDNESS

PREEMPHASIS

LINEAR PREDICTION

c(n)

s(n)

PLP INTENSITY

LOUDNESS 3

RECURSION

CEPSTRUM

FILTER BANK

(BARK)

Initial ASR Experiment


Equalization of LE in ASR

ASR Evaluation – WER (Word Error Rate)S – word substitutions

I – word insertions

D – word deletions

Digit RecognizerMonophone HMM models

13 MFCC + ∆ + ∆∆

32 Gaussian mixtures per model state


Czech SPEECON CZKCC CLSD‘05 Set

Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M

# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN

Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB

SNR)

Clean recordings (LE - 40.9 dB SNR)

Initial ASR ExperimentEqualization of LE in ASR





13 MFCC + ∆ + ∆∆





# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN


SNR)









13 MFCC + ∆ + ∆∆




# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)

100 %D S I

WERN


SNR)




100 %D S I

WERN





13 MFCC + ∆ + ∆∆





# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14

# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303

WER

(%) 5.5

(4.0–7.0)

4.3 (3.1–5.4)

4.6 (3.4–5.9)

10.5 (9.0–12.0)

3.0 (2.1–3.8)

2.3 (1.5–3.1)

13.5 (11.7–15.2)

10.4 (8.8–12.0)

7.3 (6.6–8.0)

3.8 (2.8–4.8)

42.8 (41.5–44.1)

16.3 (15.4–17.2)


SNR)



Acoustic Model Adaptation


Model AdaptationOften effective when only limited data from given conditions are available

Maximum Likelihood Linear Regression (MLLR) – if limited amount of data per

class, acoustically close classes are grouped and transformed together

'MLLR μ Aμ b

Maximum a posteriori approach (MAP) – initial models are used as informative

priors for the adaptation

'MAP

N

N N

μ μ μ

Adaptation ProcedureFramework provided by Technical University of Liberec

First, neutral speaker-independent (SI) models transformed by MLLR, employing

clustering (binary regression tree)

Second, MAP adaptation – only for nodes with sufficient amount of adaptation data


Acoustic Model Adaptation



0

10

20

30

40

50

60

70

80

90

Baseline digits LE Adapted digits LE Baseline sentences LE Adapted sentences LE

SI adapt to LE (same spkrs)

SI adapt to LE (disjunct spkrs)

SD adapt to neutral

SD adapt to LE

Model adaptation to conditions and speakers

WE

R (

%)

Adaptation SchemesSpeaker-independent adaptation (SI) – group dependent/independent

Speaker-dependent adaptation (SD) – to neutral/LE

Voice Conversion


Technique transforming speech from source speaker towards target speaker

Voice conversion typically transforms both excitation and vocal tract parameters

promising for LEneutral transformation

Idea: xn – speech samples from source speaker, yn – target speaker; the goal is to find

conversion function F minimizing the mean square error

2

mse n nE F y x

1

1

My yx xx x

V m m m m mm

F p

x x μ Σ Σ x μ

Voice conversion framework provided by Siemens AGGMM-based text-dependent voice conversion

Parallel utterances required for transformation model training

Fundamental frequency transform:

0

0 0

0

0 0y

y x

x

F

G x F x FF

F F F

Vocal tract transfer function transform


Transformation of Fundamental Frequency


0

1000

2000

3000

4000

5000

6000

7000

0 100 200 300 400 500 600

Female N

Female LE

Female CLE

Female CLEF0


Fundamental frequencyCLSD'05 female sentences

Nu

mb

er

of s

am

ple

s

0

500

1000

1500

2000

2500

3000

3500

4000

0 100 200 300 400 500 600

Male NMale LEMale CLEMale CLEF0

Fundamental frequencyCLSD'05 male sentences


Nu

mb

er

of s

am

ple

s

Voice Conversion

CLEF – conversion of both excitation and vocal tract parameters

CLEF0 – only excitation converted, vocal tract parameters preserved


Transformation of Formants


Voice Conversion

500

1000

1500

2000

2500

300 400 500 600 700 800 900

Female NFemale LEFemale CLEFemale CLEF0

/u/

/o/

/a/

/e/

/i//i'/

/u'//o'/

/a'/

/e'/

/i''/

/u''//o''/

/a''/

/e''/

F1 (Hz)

Formants - CLSD'05Female digits

F2

(Hz)

500

1000

1500

2000

2500

300 400 500 600 700 800 900

Male NMale LEMale CLEMale CLEF0

/u/

/o/

/a/

/e/

/i//i'/

/i''/

/u'/

/u''/ /o'//o''/

/a'//a''/

/e'/

/e''/

F1 (Hz)

Formants - CLSD'05 Male digits

F2

(Hz)


Voice Conversion in ASR Front-End


Voice Conversion

0

20

40

60

80

100

Males - no LM Males - LM Females - no LM Females - LM

NeutralLECLECLEF0

WE

R (

%)

Sentences - LVCSR

0

10

20

30

40

50

Males Females

NeutralLECLECLEF0W

ER

(%

)

Digits

Effectiveness of Voice Conversion in ASR TaskPartially successful in digits task

Fails in LVCSR task – classes to be recognized too close in acoustic space, any inaccuracy of VC

results in ASR deterioration

Listening tests – converted speech samples contain strong artifacts, at times the speech becomes

unintelligible for human listeners




Filter Bank ApproachAnalysis of importance of frequency components for ASR

Repartitioning filter bank (FB) to emphasize components carrying phonetic information and suppress

disturbing components

Initial FB uniformly distributed on linear scale – equal attention to all components

Consecutively, a single FB band is omitted impact on WER?

Omitting bands carrying more information will result in considerable WER increase

ImplementationMFCC front-end, MEL scale replaced by linear, triangular filters replaced by rectangular filters

without overlap


Optimizing Filter Banks – Baseline Performance



3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

Filterbank Cut-Off Frequencies (Hz)

Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20


Optimizing Filter Banks – Omitting Bands



3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)


Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

1 20


Optimizing Filter Banks – Omitting Bands



3

4

5

0 5 10 15 20

Omitted band

Neutral speech

WE

R (

%)

20

30

40

0 5 10 15 20

Omitted band

LE speech

WE

R (

%)

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 Area of 1st and 2nd formant occurrence – highest portion of phonetic information, F1 more important

for neutral speech, F1–F2 for LE speech recognition

Omitting the 1st band considerably improves LE ASR while reducing performance on neutral speech

tradeoff

Next step – how much of the low frequency content should be omitted for LE ASR?



Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

Optimizing Filter Banks – Omitting Low Frequencies



1 19






Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)

1 19





1 19


Am

plitu

de

c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12

4K

0

2

4

6

8

10

12

0 200 400 600 800 1000 1200

Bandwidth (Hz)

Neutral speech

WE

R (

%)

0

10

20

30

0 200 400 600 800 1000 1200

Bandwidth (Hz)

LE speech

WE

R (%

)


`

Reduced Filter Bank vs. Standard Features



Devel set Set

Neutral LE

LFCC, full band 4.8

(4.1–5.5)

29.0

(27.5–30.5) WER

(%) LFCC, 625 Hz

6.6

(5.8–7.4)

15.6

(14.4–16.8)

Effect of Omitting Low Spectral Components Increasing FB low cut-off results in almost linear increase of WER on neutral speech while

considerably enhancing ASR performance on LE speech

Optimal low cut-off found at 625 Hz


Increasing Filter Bank Resolution




15

20

25

30

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Omitted band

LE speech

WE

R (

%)

625 Hz

Increasing Frequency Resolution Idea – emphasize high information portion of spectrum by increasing FB resolution

Experiment – FB decimation from 1912 bands (decreasing computational costs)

Increasing number of filters at the peak of information distribution curve

deterioration of LE ASR (17.2 % 26.9 %)

Slight F1–F2 shifts due to LE affect cepstral features

No simple recipe on how to derive efficient FB from the information distribution curves

Filter Bank Repartitioning



13

15

17

19

21

23

25

27

500 1000 1500 2000 2500 3000 3500 4000

LE speech

Band 1 Band 2 Band 3 Band 4 Band 5 Band 6

Critical frequency (Hz)

WE

R (

%)

Consecutive Filter Bank RepartitioningConsecutively, from lowest to highest, each FB high cut-off is varied, while the upper rest of the FB

is redistributed uniformly across the remaining frequency band

Cut-off resulting in WER local minimum is fixed and the procedure is repeated for adjacent higher

cut-off WER reduction by 2.3 % for LE, by 1 % on neutral (Example – 6 bands FB)


Evaluation - Standard vs. Novel Features



3988700 10 1 0 2000 Hz

Expolog

2595 log 1 2000 4000 Hz700

f

f

ff

f

Linear frequency (Hz)Linear frequency (Hz)

Exp

olo

g f

req

ue

ncy

(H

z)

Exp

olo

g f

req

ue

ncy

(H

z)

State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000) FB redistributed to improve stressed speech recognition (including loud and Lombard speech)

Increased resolution in the area of F2 occurrence





Evaluation in ASR Task Standard MFCC, PLP, variations MFCC-LPC, PLP-DCT – altered cepstrum exctraction schemes

Expolog – Expolog FB replacing trapezoid FB in PLP

20Bands-LPC – uniform rectangular FB employed in PLP front-end

Big1-LPC – derived from 20Bands-LPC, first 3 bands merged – decreased resolution at frequencies

disturbing for LE ASR

RFCC-DCT – repartitioned FB, 19 bands, starting at 625 Hz, employed in MFCC

RFCC-DCT – RFCC employed in PLP





0

10

20

30

40

50

60

70

80

MFCC MFCC-LPC PLP PLP-DCT Expolog 20Bands-LPC Big1-LPC RFCC-DCT RFCC-LPC

Neutral

LE

CLE

CLEF0

WE

R (

%)

Features - performance on female digits


Performance as a Function of Utterances’ Mean F0




Mean fundamental frequency correlates with LE level evaluation of selected front-ends on data

subsets with changing mean fundamental frequency

0

2000

4000

6000

8000

10000

12000

0 100 200 300 400 500 600

Fc


Fundamental frequencyNeutral + LE female digits

Nu

mb

er

of s

am

ple

s

0

10

20

30

40

50

60

70

80

90

100

0 100 200 300 400 500 600

MFCCRFCC-LPCPLPExpolog20Bands-LPCBL MFCCBL RFCC-LPCBL PLPBL ExpologBL 20Bands-LPC

Center frequency (Hz)

WE

R (

%)

Performance as a Function of Utterances’ Mean F0




0

10

20

30

40

50

60

70

80

90

100

150 200 250 300 350 400 450

MFCCRFCC-LPCPLPExpolog20Bands-LPCBL MFCCBL RFCC-LPCBL PLPBL ExpologBL 20Bands-LPC

Center frequency (Hz)

WE

R (

%)

Mean fundamental frequency correlates with LE level evaluation of selected front-ends on data

subsets with changing mean fundamental frequency

RFCC-LPC outperform other approaches in increasing fundamental frequency

Performance in Noise



Set Neutral LE NSeff (dB)

Airport MFCC, 20Bands–LPC, PLP Big1–LPC, RFCC–LPC, Expolog None

Babble MFCC, MFCC–LPC, PLP–DCT RFCC–LPC, Expolog (Big1–LPC), RFCC–DCT 10

Car2E Expolog, 20Bands–LPC, Big1–LPC RFCC–LPC, Big1–LPC, Expolog -5

Restaurant MFCC, 20Bands–LPC, MFCC–LPC RFCC–LPC, Big1–LPC, RFCC–DCT -5

Street 20Bands–LPC, MFCC, Expolog RFCC–LPC, Big1–LPC, 20Bands–LPC 0

Train station 20Bands–LPC, MFCC, Expolog RFCC–LPC, Big1–LPC, 20Bands–LPC -5

Front-End Comparison on Noisy Speech A set of noises of levels -5, 0, 5,…,20, dB SNR added to clean speech recordings

Simple full-wave rectification noise subtraction applied

Neutral noisy speech: both DCT and LPC-based features perform best, depending on the noise type

LE noisy speech: LPC-based features perform best for all noise types – spectral smoothing introduced

by LP modeling more robust to glottal variations due to LE


Frequency Warping


Maximum Likelihood (ML) Approach Vocal tract length normalization (VTLN): mean formant locations inversely proportional to vocal tract

length (VTL) compensation for inter-speaker VTL variations by frequency transformation (warping):

Formant-Driven (FD) Approach Warping factor determined from estimated mean formant locations

WF F

Warping factor searched to maximize likelihoods of observations and acoustic models:

ˆ arg max Pr ,

O W Θ

Factor searched in typically in the interval 0.8–1.2 (corresponds to ratio of VTL differences in males

and females)


Vocal Tract Length Normalization (ML Approach)


Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

VTLN - PrincipleCase: VTL1 VTLNORM

Formant frequencies (Hz)Speaker1 (VTL1)

No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

=




Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

>




Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4

<


VTLN vs. Lombard Effect


Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

F2

F1

F3

F4




Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz)

F2

F1

F3

F4?

What to choose?

Good approx. of low formants?




Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

What to choose?

Good approx. of higher formants?

F2

F3

F4?

F1


Generalized frequency transform


Frequency Warping

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500



No

rma

lize

d S

pe

ake

r NO

RM (

VT

LN

OR

M)

Fo

rma

nt f

req

ue

nci

es

(Hz) ?

F2

F3

F4?

F1

Generalized Transform

Case: VTL1 ? VTLNORM


Frequency Warping (Formant Driven Approach)


Frequency Warping

y = 1.0045x - 76.745

R2 = 0.9979

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

LE domain - frequency (Hz)

Ne

utr

al d

om

ain

- fr

eq

ue

ncy

(H

z)

Frequency warping function Females


y = 1.0217x - 50.311

R2 = 0.9941

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Frequency warping function Males

LE domain - frequency (Hz)

Ne

utr

al d

om

ain

- fr

eq

ue

ncy

(H

z)

Evaluation – ML VTLN vs. FD Generalized Transform


Frequency Warping


Females Males Set

Neutral LE Neutral LE

# Digits 2560 2560 1423 6303

Baseline 4.3

(3.5–5.0)

33.6 (31.8–35.5)

2.2 (1.4–2.9)

22.9 (21.8–23.9)

Utterance-dependent VTLN 3.6

(2.9–4.3)

28.2

(26.4–29.9)

1.8

(1.1–2.4)

16.6

(15.7–17.6)

WER

(%)

Speaker-dependent VTLN 4.0

(3.2–4.7)

27.7

(26.0–29.5)

1.8

(1.1–2.4)

17.4

(16.5–18.3)

Females Males

Set Neutral LE Neutral LE

# Digits 2560 2560 1423 6303

Baseline bank 4.2

(3.4–5.0)

35.1 (33.3–37.0)

2.2 (1.4–2.9)

23.2 (22.1–24.2) WER

(%) Warped bank

4.4

(3.6–5.2)

23.4

(21.8–25.0)

1.8

(1.1–2.4)

15.7

(14.8–16.6)

Generalized transform better addresses LE-induced formant shifts

Formant-driven approach is less computationally demanding (no need of multiple alignment passes as in

VTLN), but requires reliable formant tracking problem in low SNR’s ML approach more stable



LE recognizer

Speech Signal

Estimated Word

Sequence

Neutral/LE Classifier

Neutral Recognizer

Tandem Neutral/LE Classifier – Neutral/LE Dedicated Recognizers Improving ASR features for LE often results in performance tradeoff on neutral speech

Idea - combining separate systems ‘tuned’ for neutral and LE speech directed by neutral/LE classifier


Features for Neutral/LE Classification – Spectral Slope



100

101

102

103

104

-80

-60

-40

-20

0

20

40

60

Frequency (Hz)

Am

plitu

de (

dB)

Mag

nitu

de (

dB)

Log frequency (Hz)

Spectral slope – female vowel /a/

100

101

102

103

104

-80

-60

-40

-20

0

20

40

60

Frequency (Hz)

Am

plitu

de (

dB)

Mag

nitu

de (

dB)

Log frequency (Hz)

Spectral slope – female vowel /a/

Proposal of Neutral/LE Classifier Search for a set of features providing good discriminability between neutral/LE speech

Requirements – speaker/gender/phonetic content independent classification

Extension of the set of analyzed features for the slope of short-term spectra


Features for Neutral/LE Classification – Spectral Slope



Neutral LE

Set # N T (s)

Slope (dB/oct)

(dB/oct) # LE T (s) Slope

(dB/oct) (dB/oct)

M 2587 618 -7.42

(-7.48; -7.36) 1.53 3532 1114

-5.32 (-5.37; -5.27)

1.55 0–8000

Hz F 5558 1544

-6.15

(-6.18; -6.12) 1.30 5030 1926

-3.91

(-3.96; -3.86) 1.77

Neutral – LE distribution overlap (%) Set

0–8000 Hz 60–8000 Hz 60–5000 Hz 1k–5k Hz 0–1000 Hz 60–1000 Hz

M 26.00 28.13 29.47 100.00 27.81 27.96

F 26.20 28.95 16.76 100.00 25.75 22.18

M+F 28.06 30.48 29.49 100.00 27.54 26.00

Mean Spectral Slopes in Voiced Male/Female Speech

Overlap of Neutral/LE Spectral Slope Distributions


Classification Feature Set A feature set providing superior classification performance on the development data set was found:

SNR, spectral slope (60–1000 Hz), F0, F0

Training GMM and multi-layer perceptron (MLP) classifiers


Classifier ImplementationTwo-Stage Recognition System

1

0

P H

P H i

i

o

o

1

1f

1 jj qqe

2

1

fj

i

q

j Mq

i

eq

e

Pr(N) Pr(LE)

GMMN GMMLE

Acoustic Observation (Classification Feature Vector)

Binary Classification Task

GMM Classifier

11

21e

2

Ti i

i nP

o μ Σ o μo

Σ

MLP Classifier

… …

Pr(N) Pr(LE)

Classification Feature Vector

(Softmax)

(Sigmoid)


Classification Feature Set




0

20

40

60

80

100

120

0 20 40 60 80 100

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Dev_N_M+FDev_LE_M+FPDF_LEPDF_N

SNR (dB)

GMM PDFsSNR

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

10

20

30

40

50

60

70

80

-20 -10 0 10 20 30

0

0.04

0.08

0.12

0.16


Spectral slope (dB/oct)

GMM PDFsSpectral slope

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

0 20 40 60 80 100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)

SNR (dB)

ANN posteriorsSNR

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

10

20

30

40

50

60

70

80

-20 -10 0 10 20 30

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Spectral slope (dB/oct)

ANN posteriorsSpectral slope

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)



Classification Feature SetTwo-Stage Recognition System

0

50

100

150

200

250

0 20 40 60 80 100 120

0

0.01

0.02

0.03

0.04

Dev_N_M+F

Dev_LE_M+F

PDF_LE

PDF_N

GMM PDFsF0

F0 (Hz)

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

140

160

0 100 200 300 400 500

0.000

0.004

0.008

0.012


F0 (Hz)

GMM PDFsF0

PD

FN, P

DF

LE

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

20

40

60

80

100

120

140

160

0 100 200 300 400 500

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7


F0 (Hz)

ANN posteriorsF0

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)

0

50

100

150

200

250

0 20 40 60 80 100 120

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Dev_N_M+FDev_LE_M+F

Pr(N)Pr(LE)

F0 (Hz)

ANN posteriors

F0

Pr(

N),

Pr(

LE

)

Nu

mb

er

of s

am

ple

s (n

orm

aliz

ed

)


Classifier EvaluationTwo-Stage Recognition System

Set Train CV Open

# Utterances 2202 270 1371

UER (%) 9.9

(8.7–11.1)

5.6

(2.8–8.3)

1.6

(0.9–2.3)

Set Devel FM Open FM Devel DM Open DM

# Utterances 2472 1371 2472 1371

UER (%) 6.6

(5.6–7.6)

2.5

(1.7–3.3)

8.1

(7.0–9.2)

2.8

(1.9–3.6)

Set #Utterances

Devel 2472 4.10 1.60

Open 1371 4.01 1.50

sUtterT sUtterT

Classification Data Sets

Classification Performance UER – Utterance Error Rate – ratio of incorrectly classified utterances to all utterances

GMM

MLP



Two-Stage Recognizer (TSR)Two-Stage Recognition System

Set Real – neutral Real – LE

# Female digits 1439 1837

PLP 4.3

(3.3–5.4)

48.1

(45.8–50.4)

RFCC–LPC 6.5

(5.2–7.7)

28.3

(26.2–30.4)

MLP TSR 4.2

(3.2–5.3)

28.4

(26.4–30.5)

FM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.4–30.5)

WER

(%)

DM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.3–30.4)

Discrete Recognizers Either good on neutral or LE speech

LE recognizer

Speech Signal

Estimated Word

Sequence


Neutral Recognizer




LE recognizer

Speech Signal

Estimated Word

Sequence


Neutral Recognizer



PLP 4.3

(3.3–5.4)

48.1

(45.8–50.4)

RFCC–LPC 6.5

(5.2–7.7)

28.3

(26.2–30.4)

MLP TSR 4.2

(3.2–5.3)

28.4

(26.4–30.5)

FM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.4–30.5)

WER

(%)

DM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.3–30.4)

Discrete Recognizers Either good on neutral or LE speech




LE recognizer

Speech Signal

Estimated Word

Sequence


Neutral Recognizer



PLP 4.3

(3.3–5.4)

48.1

(45.8–50.4)

RFCC–LPC 6.5

(5.2–7.7)

28.3

(26.2–30.4)

MLP TSR 4.2

(3.2–5.3)

28.4

(26.4–30.5)

FM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.4–30.5)

WER

(%)

DM–GMLC TSR 4.4

(3.3–5.4)

28.4

(26.3–30.4)


TSR When exposed to mixture of neutral/LE

speech, provides best of both discrete

recognizers

Only neutral speech data required for

training acoustic models

Proposed Equalization Techniques


Conclusions

Acoustic Model AdaptationAdaptation of neutral acoustic models to LE: proposal of speaker/group-

dependent adaptation approaches

Assumptions: LE-level dependent; adaptation data for a given speaker/LE

level are available, together with their transcriptions

Voice ConversionExcitation and vocal tract components of LE speech are transformed towards

neutral in the ASR front-end

Assumptions: LE-level dependent; parallel training data for each speaker

available, speaker identification system choosing from the codebook of

speaker-dependent transforms available; increased conversion accuracy

required

Data-Driven Design of robust featuresContribution of frequency sub-bands to speech recognition performance is studied;

Novel filter banks for MFCC and PLP-based front-ends are designed

Assumptions: LE-level dependent; gender classification required to pick

gender-dependent features


Proposed Equalization Techniques


Conclusions

Frequency WarpingModified vocal tract normalization (VTLN) and generalized formant-driven

frequency warping are proposed

Assumptions: LE-level independent, transform parameters adapt on-the-fly!

Two-Stage Recognition SystemNeutral/LE classifier is proposed and used to direct incoming speech to

matching neutral/LE dedicated recognizers

Assumptions: LE-level independent, increasing codebook of LE-level dependent

recognizers would further improve performance in changing LE levels


Proposed Techniques – Performance Comparison


Conclusions

0

10

20

30

40

50

60

Model Adapt toLE - SI

Model Adapt toLE - SD

VoiceConversion -

CLE

Modified FB -RFCC-LPC

VTLNRecognition -

Utt. Dep. Warp

FormantWarping

MLP TSR

Baseline Neutral

Baseline LE

LE Suppression

WE

R (

%)

Comparison of proposed techniques for LE-robust ASR


Thank you


Conclusions

Thank You for Your Attention!


References


Conclusions


PhD ThesisHynek Bořil – Robust Speech Recognition: Analysis and Equalization of

Lombard Effect in Czech Corpora. Czech Technical University in Prague,

2008.

http://www.utdallas.edu/~hxb076000

hynek bořil introductiondata acquisitionneutral/le speech analysisequalization of le in...

Documents