hynek bořil introductiondata acquisitionneutral/le speech analysisequalization of le in...
TRANSCRIPT
Hynek Bořil
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Attributes and Recognition of Lombard Speech
Center for Robust Speech SystemsErik Jonsson School of Engineering and Computer Science
University of Texas at Dallas
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Contents
IntroductionWhat is Lombard Effect?
Why is Lombard Effect Interesting?
Goals and Motivation of the Study
Data Acquisition
Neutral/LE Speech Analysis
Equalization of LE in ASRAcoustic Model Adaptation
Voice Conversion
Data-Driven Design of Robust Features
Frequency Warping
Two-Stage Recognition System
Summary
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Objective
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
What is Lombard Effect?When exposed to noisy adverse environment, speakers modify the
way they speak in an effort to maintain intelligible communication
(Lombard Effect - LE)
Why is Lombard Effect Interesting?Better understanding mechanisms of human speech
communication (Can we intentionally change particular parameters
of speech production to improve intelligibility, or is LE an automatic
process learned through public loop? How the type of noise and
communication scenario affect LE?)
Mathematical modeling of LE classification of LE level, speech
synthesis in noisy environments, increasing robustness of automatic
speech recognition and speaker identification systems
Study Objective and Motivation – LE Analysis
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Ambiguity in Past LE InvestigationsLE has been studied since 1911, however, many investigations disagree in the
observed impacts of LE on speech production
Analyses conducted typically on very limited data – a couple of utterances from few
subjects (1–10)
Lack of communication factor – many of studies ignore the importance of
communication for evoking LE (an effort to convey message over noise) occurrence
and level of LE in speech recordings is ‘random’ contradicting analysis results
LE was studied only for several world languages (English, Spanish, French,
Japanese, Korean, Mandarin Chinese), no comprehensive study for any of Slavic
languages
1st GoalDesign of Czech Lombard Speech Database addressing the need of communication
factor and well defined simulated noisy conditions
Systematic analysis of LE in Czech spoken language
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Study Objective and Motivation – ASR under LE
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
ASR under LEMismatch between LE speech with by noise and acoustic models trained on clean neutral
speech
Strong impact of noise on ASR is well known and vast number of noise suppression/speech
emphasis algorithms have been proposed in last decades (yet no ultimate solution is reached)
Negative impact of LE on ASR often exceeds the one of noise; recent state-of-the-art ASR
systems mostly ignore this issue
LE-Equalization MethodsLE-equalization algorithms typically operate in the following domains: Robust features, LE-
transformation towards neutral, model adjustments, improved training of acoustic models
The algorithms display various degrees of efficiency and are often bound by strong
assumptions preserving them from the real world application (applying fixed transformations
to phonetic groups, known level of LE, etc.)
2nd GoalProposal of novel LE-equalization techniques with a focus on both level of LE suppression
and extent of bounding assumptions
Data Acquisition - Motivation
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
LE Corpora – Issues (1)Tradeoff between realism and control of the phenomena of interest: (Murray
and Arnott, 1993) for elicited emotional speech: “A trade-off exists between realism
and measurement accuracy of emotions generated by speakers in the laboratory
(questionable realism, but verbal content and recording conditions controllable) and
field recordings (real emotions, but content and recording conditions less
controllable).”
Databases recorded in real adverse conditions (e.g., car environment):
Limited or no control over level and characteristics of background noise
Low SNRs of the recordings difficult to perform reliable speech analysis
Importance of communication factor often completely ignored - e.g., SPEECON
(Iskra et al., 2002)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Data Acquisition - Motivation
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
LE Corpora – Issues (2)Special LE databases simulated noisy conditions
Successfully address the control of noise and SNR
In studies on speech production, authors sometimes employ
communication factor (Korn, 1954), (Webster and Klumpp, 1962) – repeating
words, (Patel and Schell, 2008) – interactive game
Studies on ASR/Speaker ID under LE - the importance of communication
factor largely ignored– (Junqua , 1993) SUSAS (Hansen and Ghazale, 1997);
exception (Junqua et al, 1998) – communication with dialing machine
Limited number of subjects and utterances - ranging typically from ten
(Webster and Klumpp, 1962), (Lane et al., 1970), (Junqua, 1993), to one or two
speakers, (Summers et al., 1988), (Pisoni et al., 1985), (Bond et al., 1989), (Tian
et al., 2003), (Garnier et al., 2006)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Data Acquisition - Motivation
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Available Czech CorporaCzech SPEECON – speech recordings from various environments including office
and car
CZKCC – car recordings – include parked car with engine off and moving car
scenarios
Both databases contain speech produced in quiet in noise candidates for study of
LE, however, not good ones, shown later
Design/acquisition of LE-oriented database – Czech
Lombard Speech Database‘05 (CLSD‘05) Goals – Communication in simulated noisy background high SNR
-Phonetically rich data/extensive small vocabulary material
-Parallel utterances in neutral and LE conditions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Introducing Communication in Recording
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data Acquisition
Simulated Noisy ConditionsNoise samples mixed with speech feedback and produced to the speaker and
operator by headphones
Operator qualifies intelligibility of speech in noise – if the utterance is not intelligible,
operator asks the subject to repeat it speakers are required to convey message
over noise communication LE
Noises: mostly car noises from Car2E database, normalized to 90 dB SPL
Speaker Sessions14 male/12 female speakers
Each subject recorded both in neutral and simulated noisy conditions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Close talk
Noise + speech feedback
Middle talk
H&T RECORDER
OK – next / / BAD - again
Noise + speech monitor SPEAKER
SMOOTH OPERATOR
Speaker Session Contents
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data Acquisition
IVR – interactive
voice response
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Corpus contents Corpus/item id. Number
Phonetically rich sentences S01 – 30 30 Phonetically rich words W01 – 05 5 Isolated digits CI1 – I4, 30 – 69 44 Isolated digit sequences (8 digits) CB1 – B2, 00 – 29 32 Connected digit sequences (5 digits) CC1 – 4, C70 – 99 34 Natural numbers CN1 – N3 3 Money amount CM1 1 Time phrases; T1 – analogue, T2 – digital
CT1 – T2 2
Dates: D1 – analogue, D2 – relative and general date, D3 – digital
CD1 – D3 3
Proper name CP1 1 City or street names CO1 – O2 2 Questions CQ1 – Q2 2 Special keyboard characters CK1 – K2 2 Core word synonyms Y01 – 95 Basic IVR commands 101 – 85 Directory navigation 201 – 40 Editing 301 – 22 Output control 401 – 57 Messaging & Internet browsing 501 – 70 Organizer functions 601 – 33 Routing 701 – 39 Automotive 801 – 12 Audio & Video 901 – 95
89
Sound Attenuation by Headphones
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data Acquisition
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Environmental Sound Attenuation by HeadphonesAttenuation characteristics measured on dummy head
Source of wide-band noise, measurement of sound transfer to dummy head’s
auditory canals when not wearing/wearing headphones
Attenuation characteristics – subtraction of the transfers
Sound Attenuation by Headphones
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data Acquisition
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
102
103
104
0
50
100
150
200-10
0
10
20
30
Frequency (Hz)Angle (°)
Att
enua
tion
(dB
)
-10
-5
0
5
10
15
20
25
100 1000 10000
0° 90°180°Rec. room
Frequency (Hz)
Atte
nu
atio
n (
dB
)
Attenuation by headphones
-100102030
0
15
30
45
60
75 90 105
120
135
150
165
180
195
210
225
240
255270285
300
315
330
345
0 180 -10 0 10 20 30 0 10 20 30
1 kHz 2 kHz 4 kHz 8 kHz
Angle (°)
Attenuation (dB)
Environmental Sound Attenuation by HeadphonesDirectional attenuation – reflectionless sound booth
Real attenuation in recording room
Closed vs. Open-Air Headphones
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data Acquisition
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Open-Air Headphones+ Easier to reach flat frequency response than in closed headphones
+ Lower attenuation of sound coming from outside the headset
- High level of cross-talk from headphones to close-talk microphone contamination
of recorded speech by noise reproduced to headphones
0
200
400
600
800
1000
1200
1400
1600
-10 0 10 20 30 40 50 60
Close-talk NHands-free NClose-talk LEHands-free LE
CLSD'05 - SNR distributions
Nu
mb
er
of u
ttera
nce
s
SNR (dB)
Parameters of Neutral and LE Speech
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
1
( )1
Nk
kk
GV z
z
Speech Features affected by LE
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Parameters of Neutral and LE Speech
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency
rises
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
1
( )1
Nk
kk
GV z
z
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Parameters of Neutral and LE Speech
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
1
( )1
Nk
kk
GV z
z
Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency
rises
Vocal tract transfer function: center frequencies of low formants increase,
formant bandwidths reduce
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Parameters of Neutral and LE Speech
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
1
( )1
Nk
kk
GV z
z
Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency
rises
Vocal tract transfer function: center frequencies of low formants increase,
formant bandwidths reduce
Vocal effort (intensity) increase
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Parameters of Neutral and LE Speech
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
1
( )1
Nk
kk
GV z
z
Speech Features affected by LEVocal tract excitation: glottal pulse shape changes, fundamental frequency rises
Vocal tract transfer function: center frequencies of low formants increase,
formant bandwidths reduce
Vocal effort (intensity) increase
Other: voiced phonemes prolonged, energy ratio in voiced/unvoiced increases,…
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
IMPULSE TRAIN GENERATOR I(z)
VOCAL TRACT MODEL V(z)
RADIATION MODEL R(z)
RANDOM NOISE GENERATOR N(z)
Voiced/Unvoiced Switch
Pitch Period
AV
AN
Vocal Tract Parameters
GLOTTAL PULSE MODEL G(z)
uG(n) pL(n)
Analysis: Fundamental Frequency
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Parameters of Neutral and LE Speech
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
0
2
4
6
8
10
12
70 170 270 370 470 570
Office FCar F
Office MCar M
Fundamental frequency (Hz)
Distribution of fundamental frequencyCzech SPEECON
Nu
mb
er
of s
am
ple
s (x
10
,00
0)
0
2
4
6
8
10
12
14
16
70 170 270 370 470 570
Eng off F
Eng on F
Eng off M
Eng on M
Fundamental frequency (Hz)
Nu
mb
er
of s
am
ple
s (x
10
00
)
Distribution of fundamental frequencyCZKCC
0
1
2
3
4
5
6
70 170 270 370 470 570
Neutral FLE FNeutral MLE M
Fundamental frequency (Hz)
Nu
mb
er
of s
am
ple
s (x
10
,00
0)
Distribution of fundamental frequencyCLSD'05
Analysis: Formant Center Frequencies
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Parameters of Neutral and LE Speech
900
1100
1300
1500
1700
1900
2100
2300
2500
300 400 500 600 700 800 900 1000
Female_N
Female_LE
F1 (Hz)
F2
(H
z)
Formants - CZKCCFemale digits/i/
/i'/
/e//e'/
/a/
/a'//o/
/o'//u/
/u'/
500
700
900
1100
1300
1500
1700
1900
2100
200 300 400 500 600 700 800 900
Male_N
Male_LE
/i//i'/
F1 (Hz)
F2
(H
z)
/e//e'/
/a/
/a'/
/o//o'/
/u//u'/
Formants - CZKCCMale digits
900
1100
1300
1500
1700
1900
2100
2300
2500
300 400 500 600 700 800 900 1000
Female_N
Female_LE
F1 (Hz)
F2
(H
z)
Formants - CLSD'05Female digits/i/
/i'/
/e/
/e'/
/a//a'/
/o/
/o'/
/u/
/u'/
500
700
900
1100
1300
1500
1700
1900
2100
200 300 400 500 600 700 800 900
Male_N
Male_LE
Formants - CLSD'05Male digits/i/
/i'/
F1 (Hz)
F2
(H
z)
/e/ /e'/
/a//a'/
/o/ /o'//u/
/u'/
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Parameters of Neutral and LE Speech
CZKCC
Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)
/a/ 207* 74 210* 84 275 97 299 78
/e/ 125* 70 130* 78 156 68 186 79
/i/ 124* 49 127* 44 105 44 136 53
/o/ 275 87 222 67 263* 85 269* 73
/u/ 187 100 170 89 174* 96 187* 101
CLSD‘05
Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)
/a/ 269 88 152 59 232 85 171 68
/e/ 168 94 99 44 169 73 130 49
/i/ 125 53 108 52 132* 52 133* 58
/o/ 239 88 157 81 246 91 158 62
/u/ 134* 67 142* 81 209 95 148 66
Analysis: Formant Bandwidths
SPEECON, CZKCC: no consistent BW changes
CLSD‘05: significant BW reduction in many voiced phonemes
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Parameters of Neutral and LE Speech
CZKCC
Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)
/a/ 207* 74 210* 84 275 97 299 78
/e/ 125* 70 130* 78 156 68 186 79
/i/ 124* 49 127* 44 105 44 136 53
/o/ 275 87 222 67 263* 85 269* 73
/u/ 187 100 170 89 174* 96 187* 101
CLSD‘05
Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)
/a/ 269 88 152 59 232 85 171 68
/e/ 168 94 99 44 169 73 130 49
/i/ 125 53 108 52 132* 52 133* 58
/o/ 239 88 157 81 246 91 158 62
/u/ 134* 67 142* 81 209 95 148 66
Analysis: Formant Bandwidths
SPEECON, CZKCC: no consistent BW changes
CLSD‘05: significant BW reduction in many voiced phonemes
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Parameters of Neutral and LE Speech
CZKCC
Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)
/a/ 207* 74 210* 84 275 97 299 78
/e/ 125* 70 130* 78 156 68 186 79
/i/ 124* 49 127* 44 105 44 136 53
/o/ 275 87 222 67 263* 85 269* 73
/u/ 187 100 170 89 174* 96 187* 101
CLSD‘05
Vowel B1M (Hz) 1M (Hz) B1M (Hz) 1M (Hz) B1F (Hz) 1F (Hz) B1F (Hz) 1F (Hz)
/a/ 269 88 152 59 232 85 171 68
/e/ 168 94 99 44 169 73 130 49
/i/ 125 53 108 52 132* 52 133* 58
/o/ 239 88 157 81 246 91 158 62
/u/ 134* 67 142* 81 209 95 148 66
Analysis: Formant Bandwidths
SPEECON, CZKCC: no consistent BW changes
CLSD‘05: significant BW reduction in many voiced phonemes
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Analysis: Phoneme Durations
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Parameters of Neutral and LE Speech
CZKCC
Word Phoneme # OFF TOFF (s) TOFF (s) # ON TON (s) TON (s) (%)
Nula /a/ 349 0.147 0.079 326 0.259 0.289 48.50
Jedna /a/ 269 0.173 0.076 251 0.241 0.238 39.36
Dva /a/ 245 0.228 0.075 255 0.314 0.311 38.04
Štiri /r/ 16 0.045 0.027 68 0.080 0.014 78.72
Sedm /e/ 78 0.099 0.038 66 0.172 0.142 72.58
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
CLSD‘05
Word Phoneme # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)
Jedna /e/ 583 0.031 0.014 939 0.082 0.086 161.35
Dvje /e/ 586 0.087 0.055 976 0.196 0.120 126.98
Čtiri /r/ 35 0.041 0.020 241 0.089 0.079 115.92
Pjet /e/ 555 0.056 0.033 909 0.154 0.089 173.71
Sedm /e/ 358 0.080 0.038 583 0.179 0.136 122.46
Osm /o/ 310 0.086 0.027 305 0.203 0.159 135.25
Devjet /e/ 609 0.043 0.022 932 0.120 0.088 177.20
Significant increase in duration in some phonemes, especially voiced phonemes
Some unvoiced consonants – duration reduction
Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC
Analysis: Phoneme Durations
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Parameters of Neutral and LE Speech
CZKCC
Word Phoneme # OFF TOFF (s) TOFF (s) # ON TON (s) TON (s) (%)
Nula /a/ 349 0.147 0.079 326 0.259 0.289 48.50
Jedna /a/ 269 0.173 0.076 251 0.241 0.238 39.36
Dva /a/ 245 0.228 0.075 255 0.314 0.311 38.04
Štiri /r/ 16 0.045 0.027 68 0.080 0.014 78.72
Sedm /e/ 78 0.099 0.038 66 0.172 0.142 72.58
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
CLSD‘05
Word Phoneme # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)
Jedna /e/ 583 0.031 0.014 939 0.082 0.086 161.35
Dvje /e/ 586 0.087 0.055 976 0.196 0.120 126.98
Čtiri /r/ 35 0.041 0.020 241 0.089 0.079 115.92
Pjet /e/ 555 0.056 0.033 909 0.154 0.089 173.71
Sedm /e/ 358 0.080 0.038 583 0.179 0.136 122.46
Osm /o/ 310 0.086 0.027 305 0.203 0.159 135.25
Devjet /e/ 609 0.043 0.022 932 0.120 0.088 177.20
Significant increase in duration in some phonemes, especially voiced phonemes
Some unvoiced consonants – duration reduction
Duration changes in CLSD’05 considerably exceed the ones in SPEECON and CZKCC
Word Durations
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Parameters of Neutral and LE Speech
CZKCC
Word # OFF TOFF (s) OFF (s) # ON TON (s) TON (s) (%)
Nula 349 0.475 0.117 326 0.560 0.345 17.82
Jedna 269 0.559 0.136 251 0.607 0.263 8.58
Dva 245 0.426 0.106 255 0.483 0.325 13.57
CLSD‘05
Word # N TN (s) Tn (s) # LE TLE (s) Tle (s) (%)
Nula 497 0.397 0.109 802 0.476 0.157 19.87
Jedna 583 0.441 0.128 939 0.527 0.165 19.56
Dvje 586 0.365 0.114 976 0.423 0.138 15.87
Word durations variations typically did not exceed 20 %
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
HMM-Based Automatic Speech Recognition (ASR)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Typical HMM Recognizer
LANGUAGE MODEL
(BIGRAMS)
DECODER (VITERBI)
ESTIMATED WORD
SEQUENCE
SPEECH SIGNAL
FEATURE EXTRACTION (MFCC/PLP)
ACOUSTIC MODEL
SUB-WORD LIKELIHOODS
(GMM/MLP)
LEXICON (HMM)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Feature extraction – transformation of time-domain acoustic signal into
representation more convenient for ASR engine: data dimensionality reduction,
suppression of irrelevant (disturbing) signal components
(speaker/environment/recording chain-dependent characteristics), preserving
phonetic content
Sub-word models – Gaussian Mixture Models (GMMs) – mixture of gaussians used
to model distribution of feature vector parameters; Multi-Layer Perceptrons (MLPs) –
artificial neural networks
HMM-Based Automatic Speech Recognition (ASR)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Mel Frequency Cepstral CoefficientsDavis & Mermelstein, IEEE Trans. Acoustics, Speech, and Signal Processing, 1980
Mermelstein was born in Czechoslovakia
MFCC is the first choice in current commercial ASR
When used in HMM ASR, MFCC may be incorporating several redundant stages –
historical reasons (in the past, distance-based measures were used in speech
decoding, different requirements on cepstral coeffs than in HMM systems)
Perceptual Linear Predictive CoefficientsHermansky, Journal of Acoustical Society of America, 1990
Hermansky was born in Czechoslovakia
Linear prediction – smoothing of the spectral envelope (may improve robustness)
PLP is a frequent choice in research labs – IDIAP, ICSI Berkeley, LIMSI…
WINDOW
(HAMMING)
|FFT|2
c(n)
s(n)
PREEMPHASIS
Log( )
.
IDCT
MFCC FILTER
BANK (MEL)
WINDOW
(HAMMING)
|FFT|2
EQUAL LOUDNESS
PREEMPHASIS
LINEAR PREDICTION
c(n)
s(n)
PLP INTENSITY
LOUDNESS 3
RECURSION
CEPSTRUM
FILTER BANK
(BARK)
Initial ASR Experiment
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Equalization of LE in ASR
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Initial ASR ExperimentEqualization of LE in ASR
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Initial ASR ExperimentEqualization of LE in ASR
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Initial ASR ExperimentEqualization of LE in ASR
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Initial ASR ExperimentEqualization of LE in ASR
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Initial ASR ExperimentEqualization of LE in ASR
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
100 %D S I
WERN
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Initial ASR ExperimentEqualization of LE in ASR
100 %D S I
WERN
ASR Evaluation – WER (Word Error Rate)S – word substitutions
I – word insertions
D – word deletions
Digit RecognizerMonophone HMM models
13 MFCC + ∆ + ∆∆
32 Gaussian mixtures per model state
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Czech SPEECON CZKCC CLSD‘05 Set
Office F Office M Car F Car M OFF F OFF M ON F ON M N F N M LE F LE M
# Spkrs 22 31 28 42 30 30 18 21 12 14 12 14
# Digits 880 1219 1101 1657 1480 1323 1439 1450 4930 1423 5360 6303
WER
(%) 5.5
(4.0–7.0)
4.3 (3.1–5.4)
4.6 (3.4–5.9)
10.5 (9.0–12.0)
3.0 (2.1–3.8)
2.3 (1.5–3.1)
13.5 (11.7–15.2)
10.4 (8.8–12.0)
7.3 (6.6–8.0)
3.8 (2.8–4.8)
42.8 (41.5–44.1)
16.3 (15.4–17.2)
Noisy car recordings (SPEE/CZKCC – 10.7/12.6 dB
SNR)
Clean recordings (LE - 40.9 dB SNR)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Acoustic Model Adaptation
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Model AdaptationOften effective when only limited data from given conditions are available
Maximum Likelihood Linear Regression (MLLR) – if limited amount of data per
class, acoustically close classes are grouped and transformed together
'MLLR μ Aμ b
Maximum a posteriori approach (MAP) – initial models are used as informative
priors for the adaptation
'MAP
N
N N
μ μ μ
Adaptation ProcedureFramework provided by Technical University of Liberec
First, neutral speaker-independent (SI) models transformed by MLLR, employing
clustering (binary regression tree)
Second, MAP adaptation – only for nodes with sufficient amount of adaptation data
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Acoustic Model Adaptation
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
0
10
20
30
40
50
60
70
80
90
Baseline digits LE Adapted digits LE Baseline sentences LE Adapted sentences LE
SI adapt to LE (same spkrs)
SI adapt to LE (disjunct spkrs)
SD adapt to neutral
SD adapt to LE
Model adaptation to conditions and speakers
WE
R (
%)
Adaptation SchemesSpeaker-independent adaptation (SI) – group dependent/independent
Speaker-dependent adaptation (SD) – to neutral/LE
Voice Conversion
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Technique transforming speech from source speaker towards target speaker
Voice conversion typically transforms both excitation and vocal tract parameters
promising for LEneutral transformation
Idea: xn – speech samples from source speaker, yn – target speaker; the goal is to find
conversion function F minimizing the mean square error
2
mse n nE F y x
1
1
My yx xx x
V m m m m mm
F p
x x μ Σ Σ x μ
Voice conversion framework provided by Siemens AGGMM-based text-dependent voice conversion
Parallel utterances required for transformation model training
Fundamental frequency transform:
0
0 0
0
0 0y
y x
x
F
G x F x FF
F F F
Vocal tract transfer function transform
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Transformation of Fundamental Frequency
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
0
1000
2000
3000
4000
5000
6000
7000
0 100 200 300 400 500 600
Female N
Female LE
Female CLE
Female CLEF0
Fundamental frequency (Hz)
Fundamental frequencyCLSD'05 female sentences
Nu
mb
er
of s
am
ple
s
0
500
1000
1500
2000
2500
3000
3500
4000
0 100 200 300 400 500 600
Male NMale LEMale CLEMale CLEF0
Fundamental frequencyCLSD'05 male sentences
Fundamental frequency (Hz)
Nu
mb
er
of s
am
ple
s
Voice Conversion
CLEF – conversion of both excitation and vocal tract parameters
CLEF0 – only excitation converted, vocal tract parameters preserved
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Transformation of Formants
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Voice Conversion
500
1000
1500
2000
2500
300 400 500 600 700 800 900
Female NFemale LEFemale CLEFemale CLEF0
/u/
/o/
/a/
/e/
/i//i'/
/u'//o'/
/a'/
/e'/
/i''/
/u''//o''/
/a''/
/e''/
F1 (Hz)
Formants - CLSD'05Female digits
F2
(Hz)
500
1000
1500
2000
2500
300 400 500 600 700 800 900
Male NMale LEMale CLEMale CLEF0
/u/
/o/
/a/
/e/
/i//i'/
/i''/
/u'/
/u''/ /o'//o''/
/a'//a''/
/e'/
/e''/
F1 (Hz)
Formants - CLSD'05 Male digits
F2
(Hz)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Voice Conversion in ASR Front-End
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Voice Conversion
0
20
40
60
80
100
Males - no LM Males - LM Females - no LM Females - LM
NeutralLECLECLEF0
WE
R (
%)
Sentences - LVCSR
0
10
20
30
40
50
Males Females
NeutralLECLECLEF0W
ER
(%
)
Digits
Effectiveness of Voice Conversion in ASR TaskPartially successful in digits task
Fails in LVCSR task – classes to be recognized too close in acoustic space, any inaccuracy of VC
results in ASR deterioration
Listening tests – converted speech samples contain strong artifacts, at times the speech becomes
unintelligible for human listeners
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Data-Driven Design of Robust Features
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Filter Bank ApproachAnalysis of importance of frequency components for ASR
Repartitioning filter bank (FB) to emphasize components carrying phonetic information and suppress
disturbing components
Initial FB uniformly distributed on linear scale – equal attention to all components
Consecutively, a single FB band is omitted impact on WER?
Omitting bands carrying more information will result in considerable WER increase
ImplementationMFCC front-end, MEL scale replaced by linear, triangular filters replaced by rectangular filters
without overlap
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Baseline Performance
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Bands
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Bands
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Bands
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Bands
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Bands
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
1 20
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Bands
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
3
4
5
0 5 10 15 20
Omitted band
Neutral speech
WE
R (
%)
20
30
40
0 5 10 15 20
Omitted band
LE speech
WE
R (
%)
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12 Area of 1st and 2nd formant occurrence – highest portion of phonetic information, F1 more important
for neutral speech, F1–F2 for LE speech recognition
Omitting the 1st band considerably improves LE ASR while reducing performance on neutral speech
tradeoff
Next step – how much of the low frequency content should be omitted for LE ASR?
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
Optimizing Filter Banks – Omitting Low Frequencies
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
1 19
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Low Frequencies
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
1 19
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Low Frequencies
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
1 19
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Low Frequencies
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
1 19
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Low Frequencies
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
1 19
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Optimizing Filter Banks – Omitting Low Frequencies
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
1 19
Filterbank Cut-Off Frequencies (Hz)
Am
plitu
de
c0, c1,…, c12; ∆c0, ∆c1,…, ∆c12; ∆∆c0, ∆∆c1,…, ∆∆c12
4K
0
2
4
6
8
10
12
0 200 400 600 800 1000 1200
Bandwidth (Hz)
Neutral speech
WE
R (
%)
0
10
20
30
0 200 400 600 800 1000 1200
Bandwidth (Hz)
LE speech
WE
R (%
)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
`
Reduced Filter Bank vs. Standard Features
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
Devel set Set
Neutral LE
LFCC, full band 4.8
(4.1–5.5)
29.0
(27.5–30.5) WER
(%) LFCC, 625 Hz
6.6
(5.8–7.4)
15.6
(14.4–16.8)
Effect of Omitting Low Spectral Components Increasing FB low cut-off results in almost linear increase of WER on neutral speech while
considerably enhancing ASR performance on LE speech
Optimal low cut-off found at 625 Hz
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Increasing Filter Bank Resolution
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Data-Driven Design of Robust Features
15
20
25
30
0 1 2 3 4 5 6 7 8 9 10 11 12 13
Omitted band
LE speech
WE
R (
%)
625 Hz
Increasing Frequency Resolution Idea – emphasize high information portion of spectrum by increasing FB resolution
Experiment – FB decimation from 1912 bands (decreasing computational costs)
Increasing number of filters at the peak of information distribution curve
deterioration of LE ASR (17.2 % 26.9 %)
Slight F1–F2 shifts due to LE affect cepstral features
No simple recipe on how to derive efficient FB from the information distribution curves
Filter Bank Repartitioning
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
13
15
17
19
21
23
25
27
500 1000 1500 2000 2500 3000 3500 4000
LE speech
Band 1 Band 2 Band 3 Band 4 Band 5 Band 6
Critical frequency (Hz)
WE
R (
%)
Consecutive Filter Bank RepartitioningConsecutively, from lowest to highest, each FB high cut-off is varied, while the upper rest of the FB
is redistributed uniformly across the remaining frequency band
Cut-off resulting in WER local minimum is fixed and the procedure is repeated for adjacent higher
cut-off WER reduction by 2.3 % for LE, by 1 % on neutral (Example – 6 bands FB)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Evaluation - Standard vs. Novel Features
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
3988700 10 1 0 2000 Hz
Expolog
2595 log 1 2000 4000 Hz700
f
f
ff
f
Linear frequency (Hz)Linear frequency (Hz)
Exp
olo
g f
req
ue
ncy
(H
z)
Exp
olo
g f
req
ue
ncy
(H
z)
State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000) FB redistributed to improve stressed speech recognition (including loud and Lombard speech)
Increased resolution in the area of F2 occurrence
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Evaluation - Standard vs. Novel Features
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
3988700 10 1 0 2000 Hz
Expolog
2595 log 1 2000 4000 Hz700
f
f
ff
f
Linear frequency (Hz)Linear frequency (Hz)
Exp
olo
g f
req
ue
ncy
(H
z)
Exp
olo
g f
req
ue
ncy
(H
z)
State-of-the-Art LE Front-End – Expolog (Ghazale & Hansen, 2000) FB redistributed to improve stressed speech recognition (including loud and Lombard speech)
Increased resolution in the area of F2 occurrence
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Evaluation - Standard vs. Novel Features
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
Evaluation in ASR Task Standard MFCC, PLP, variations MFCC-LPC, PLP-DCT – altered cepstrum exctraction schemes
Expolog – Expolog FB replacing trapezoid FB in PLP
20Bands-LPC – uniform rectangular FB employed in PLP front-end
Big1-LPC – derived from 20Bands-LPC, first 3 bands merged – decreased resolution at frequencies
disturbing for LE ASR
RFCC-DCT – repartitioned FB, 19 bands, starting at 625 Hz, employed in MFCC
RFCC-DCT – RFCC employed in PLP
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Evaluation - Standard vs. Novel Features
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
0
10
20
30
40
50
60
70
80
MFCC MFCC-LPC PLP PLP-DCT Expolog 20Bands-LPC Big1-LPC RFCC-DCT RFCC-LPC
Neutral
LE
CLE
CLEF0
WE
R (
%)
Features - performance on female digits
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Performance as a Function of Utterances’ Mean F0
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Data-Driven Design of Robust Features
Mean fundamental frequency correlates with LE level evaluation of selected front-ends on data
subsets with changing mean fundamental frequency
0
2000
4000
6000
8000
10000
12000
0 100 200 300 400 500 600
Fc
Fundamental frequency (Hz)
Fundamental frequencyNeutral + LE female digits
Nu
mb
er
of s
am
ple
s
0
10
20
30
40
50
60
70
80
90
100
0 100 200 300 400 500 600
MFCCRFCC-LPCPLPExpolog20Bands-LPCBL MFCCBL RFCC-LPCBL PLPBL ExpologBL 20Bands-LPC
Center frequency (Hz)
WE
R (
%)
Performance as a Function of Utterances’ Mean F0
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Data-Driven Design of Robust Features
0
10
20
30
40
50
60
70
80
90
100
150 200 250 300 350 400 450
MFCCRFCC-LPCPLPExpolog20Bands-LPCBL MFCCBL RFCC-LPCBL PLPBL ExpologBL 20Bands-LPC
Center frequency (Hz)
WE
R (
%)
Mean fundamental frequency correlates with LE level evaluation of selected front-ends on data
subsets with changing mean fundamental frequency
RFCC-LPC outperform other approaches in increasing fundamental frequency
Performance in Noise
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Data-Driven Design of Robust Features
Set Neutral LE NSeff (dB)
Airport MFCC, 20Bands–LPC, PLP Big1–LPC, RFCC–LPC, Expolog None
Babble MFCC, MFCC–LPC, PLP–DCT RFCC–LPC, Expolog (Big1–LPC), RFCC–DCT 10
Car2E Expolog, 20Bands–LPC, Big1–LPC RFCC–LPC, Big1–LPC, Expolog -5
Restaurant MFCC, 20Bands–LPC, MFCC–LPC RFCC–LPC, Big1–LPC, RFCC–DCT -5
Street 20Bands–LPC, MFCC, Expolog RFCC–LPC, Big1–LPC, 20Bands–LPC 0
Train station 20Bands–LPC, MFCC, Expolog RFCC–LPC, Big1–LPC, 20Bands–LPC -5
Front-End Comparison on Noisy Speech A set of noises of levels -5, 0, 5,…,20, dB SNR added to clean speech recordings
Simple full-wave rectification noise subtraction applied
Neutral noisy speech: both DCT and LPC-based features perform best, depending on the noise type
LE noisy speech: LPC-based features perform best for all noise types – spectral smoothing introduced
by LP modeling more robust to glottal variations due to LE
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Frequency Warping
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Maximum Likelihood (ML) Approach Vocal tract length normalization (VTLN): mean formant locations inversely proportional to vocal tract
length (VTL) compensation for inter-speaker VTL variations by frequency transformation (warping):
Formant-Driven (FD) Approach Warping factor determined from estimated mean formant locations
WF F
Warping factor searched to maximize likelihoods of observations and acoustic models:
ˆ arg max Pr ,
O W Θ
Factor searched in typically in the interval 0.8–1.2 (corresponds to ratio of VTL differences in males
and females)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Vocal Tract Length Normalization (ML Approach)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
=
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Vocal Tract Length Normalization (ML Approach)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
>
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Vocal Tract Length Normalization (ML Approach)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
>
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Vocal Tract Length Normalization (ML Approach)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
>
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Vocal Tract Length Normalization (ML Approach)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
<
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Vocal Tract Length Normalization (ML Approach)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
<
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Vocal Tract Length Normalization (ML Approach)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4
<
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
VTLN vs. Lombard Effect
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz) ?
F2
F1
F3
F4
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
VTLN vs. Lombard Effect
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz)
F2
F1
F3
F4?
What to choose?
Good approx. of low formants?
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
VTLN vs. Lombard Effect
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz) ?
What to choose?
Good approx. of higher formants?
F2
F3
F4?
F1
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Generalized frequency transform
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
VTLN - PrincipleCase: VTL1 VTLNORM
Formant frequencies (Hz)Speaker1 (VTL1)
No
rma
lize
d S
pe
ake
r NO
RM (
VT
LN
OR
M)
Fo
rma
nt f
req
ue
nci
es
(Hz) ?
F2
F3
F4?
F1
Generalized Transform
Case: VTL1 ? VTLNORM
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Frequency Warping (Formant Driven Approach)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
y = 1.0045x - 76.745
R2 = 0.9979
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
LE domain - frequency (Hz)
Ne
utr
al d
om
ain
- fr
eq
ue
ncy
(H
z)
Frequency warping function Females
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
y = 1.0217x - 50.311
R2 = 0.9941
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Frequency warping function Males
LE domain - frequency (Hz)
Ne
utr
al d
om
ain
- fr
eq
ue
ncy
(H
z)
Evaluation – ML VTLN vs. FD Generalized Transform
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Frequency Warping
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Females Males Set
Neutral LE Neutral LE
# Digits 2560 2560 1423 6303
Baseline 4.3
(3.5–5.0)
33.6 (31.8–35.5)
2.2 (1.4–2.9)
22.9 (21.8–23.9)
Utterance-dependent VTLN 3.6
(2.9–4.3)
28.2
(26.4–29.9)
1.8
(1.1–2.4)
16.6
(15.7–17.6)
WER
(%)
Speaker-dependent VTLN 4.0
(3.2–4.7)
27.7
(26.0–29.5)
1.8
(1.1–2.4)
17.4
(16.5–18.3)
Females Males
Set Neutral LE Neutral LE
# Digits 2560 2560 1423 6303
Baseline bank 4.2
(3.4–5.0)
35.1 (33.3–37.0)
2.2 (1.4–2.9)
23.2 (22.1–24.2) WER
(%) Warped bank
4.4
(3.6–5.2)
23.4
(21.8–25.0)
1.8
(1.1–2.4)
15.7
(14.8–16.6)
Generalized transform better addresses LE-induced formant shifts
Formant-driven approach is less computationally demanding (no need of multiple alignment passes as in
VTLN), but requires reliable formant tracking problem in low SNR’s ML approach more stable
Two-Stage Recognition System
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
LE recognizer
Speech Signal
Estimated Word
Sequence
Neutral/LE Classifier
Neutral Recognizer
Tandem Neutral/LE Classifier – Neutral/LE Dedicated Recognizers Improving ASR features for LE often results in performance tradeoff on neutral speech
Idea - combining separate systems ‘tuned’ for neutral and LE speech directed by neutral/LE classifier
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Features for Neutral/LE Classification – Spectral Slope
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Two-Stage Recognition System
100
101
102
103
104
-80
-60
-40
-20
0
20
40
60
Frequency (Hz)
Am
plitu
de (
dB)
Mag
nitu
de (
dB)
Log frequency (Hz)
Spectral slope – female vowel /a/
100
101
102
103
104
-80
-60
-40
-20
0
20
40
60
Frequency (Hz)
Am
plitu
de (
dB)
Mag
nitu
de (
dB)
Log frequency (Hz)
Spectral slope – female vowel /a/
Proposal of Neutral/LE Classifier Search for a set of features providing good discriminability between neutral/LE speech
Requirements – speaker/gender/phonetic content independent classification
Extension of the set of analyzed features for the slope of short-term spectra
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Features for Neutral/LE Classification – Spectral Slope
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Two-Stage Recognition System
Neutral LE
Set # N T (s)
Slope (dB/oct)
(dB/oct) # LE T (s) Slope
(dB/oct) (dB/oct)
M 2587 618 -7.42
(-7.48; -7.36) 1.53 3532 1114
-5.32 (-5.37; -5.27)
1.55 0–8000
Hz F 5558 1544
-6.15
(-6.18; -6.12) 1.30 5030 1926
-3.91
(-3.96; -3.86) 1.77
Neutral – LE distribution overlap (%) Set
0–8000 Hz 60–8000 Hz 60–5000 Hz 1k–5k Hz 0–1000 Hz 60–1000 Hz
M 26.00 28.13 29.47 100.00 27.81 27.96
F 26.20 28.95 16.76 100.00 25.75 22.18
M+F 28.06 30.48 29.49 100.00 27.54 26.00
Mean Spectral Slopes in Voiced Male/Female Speech
Overlap of Neutral/LE Spectral Slope Distributions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Classification Feature Set A feature set providing superior classification performance on the development data set was found:
SNR, spectral slope (60–1000 Hz), F0, F0
Training GMM and multi-layer perceptron (MLP) classifiers
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Classifier ImplementationTwo-Stage Recognition System
1
0
P H
P H i
i
o
o
1
1f
1 jj qqe
2
1
fj
i
q
j Mq
i
eq
e
Pr(N) Pr(LE)
GMMN GMMLE
Acoustic Observation (Classification Feature Vector)
Binary Classification Task
GMM Classifier
11
21e
2
Ti i
i nP
o μ Σ o μo
Σ
MLP Classifier
… …
Pr(N) Pr(LE)
Classification Feature Vector
(Softmax)
(Sigmoid)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Classification Feature Set
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Two-Stage Recognition System
0
20
40
60
80
100
120
0 20 40 60 80 100
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
Dev_N_M+FDev_LE_M+FPDF_LEPDF_N
SNR (dB)
GMM PDFsSNR
PD
FN, P
DF
LE
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
10
20
30
40
50
60
70
80
-20 -10 0 10 20 30
0
0.04
0.08
0.12
0.16
Dev_N_M+FDev_LE_M+FPDF_LEPDF_N
Spectral slope (dB/oct)
GMM PDFsSpectral slope
PD
FN, P
DF
LE
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
20
40
60
80
100
120
0 20 40 60 80 100
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)
SNR (dB)
ANN posteriorsSNR
Pr(
N),
Pr(
LE
)
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
10
20
30
40
50
60
70
80
-20 -10 0 10 20 30
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)
Spectral slope (dB/oct)
ANN posteriorsSpectral slope
Pr(
N),
Pr(
LE
)
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Classification Feature SetTwo-Stage Recognition System
0
50
100
150
200
250
0 20 40 60 80 100 120
0
0.01
0.02
0.03
0.04
Dev_N_M+F
Dev_LE_M+F
PDF_LE
PDF_N
GMM PDFsF0
F0 (Hz)
PD
FN, P
DF
LE
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
20
40
60
80
100
120
140
160
0 100 200 300 400 500
0.000
0.004
0.008
0.012
Dev_N_M+FDev_LE_M+FPDF_LEPDF_N
F0 (Hz)
GMM PDFsF0
PD
FN, P
DF
LE
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
20
40
60
80
100
120
140
160
0 100 200 300 400 500
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Dev_N_M+FDev_LE_M+FPr(N)Pr(LE)
F0 (Hz)
ANN posteriorsF0
Pr(
N),
Pr(
LE
)
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
0
50
100
150
200
250
0 20 40 60 80 100 120
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Dev_N_M+FDev_LE_M+F
Pr(N)Pr(LE)
F0 (Hz)
ANN posteriors
F0
Pr(
N),
Pr(
LE
)
Nu
mb
er
of s
am
ple
s (n
orm
aliz
ed
)
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Classifier EvaluationTwo-Stage Recognition System
Set Train CV Open
# Utterances 2202 270 1371
UER (%) 9.9
(8.7–11.1)
5.6
(2.8–8.3)
1.6
(0.9–2.3)
Set Devel FM Open FM Devel DM Open DM
# Utterances 2472 1371 2472 1371
UER (%) 6.6
(5.6–7.6)
2.5
(1.7–3.3)
8.1
(7.0–9.2)
2.8
(1.9–3.6)
Set #Utterances
Devel 2472 4.10 1.60
Open 1371 4.01 1.50
sUtterT sUtterT
Classification Data Sets
Classification Performance UER – Utterance Error Rate – ratio of incorrectly classified utterances to all utterances
GMM
MLP
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Two-Stage Recognizer (TSR)Two-Stage Recognition System
Set Real – neutral Real – LE
# Female digits 1439 1837
PLP 4.3
(3.3–5.4)
48.1
(45.8–50.4)
RFCC–LPC 6.5
(5.2–7.7)
28.3
(26.2–30.4)
MLP TSR 4.2
(3.2–5.3)
28.4
(26.4–30.5)
FM–GMLC TSR 4.4
(3.3–5.4)
28.4
(26.4–30.5)
WER
(%)
DM–GMLC TSR 4.4
(3.3–5.4)
28.4
(26.3–30.4)
Discrete Recognizers Either good on neutral or LE speech
LE recognizer
Speech Signal
Estimated Word
Sequence
Neutral/LE Classifier
Neutral Recognizer
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Two-Stage Recognizer (TSR)Two-Stage Recognition System
LE recognizer
Speech Signal
Estimated Word
Sequence
Neutral/LE Classifier
Neutral Recognizer
Set Real – neutral Real – LE
# Female digits 1439 1837
PLP 4.3
(3.3–5.4)
48.1
(45.8–50.4)
RFCC–LPC 6.5
(5.2–7.7)
28.3
(26.2–30.4)
MLP TSR 4.2
(3.2–5.3)
28.4
(26.4–30.5)
FM–GMLC TSR 4.4
(3.3–5.4)
28.4
(26.4–30.5)
WER
(%)
DM–GMLC TSR 4.4
(3.3–5.4)
28.4
(26.3–30.4)
Discrete Recognizers Either good on neutral or LE speech
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Two-Stage Recognizer (TSR)Two-Stage Recognition System
LE recognizer
Speech Signal
Estimated Word
Sequence
Neutral/LE Classifier
Neutral Recognizer
Set Real – neutral Real – LE
# Female digits 1439 1837
PLP 4.3
(3.3–5.4)
48.1
(45.8–50.4)
RFCC–LPC 6.5
(5.2–7.7)
28.3
(26.2–30.4)
MLP TSR 4.2
(3.2–5.3)
28.4
(26.4–30.5)
FM–GMLC TSR 4.4
(3.3–5.4)
28.4
(26.4–30.5)
WER
(%)
DM–GMLC TSR 4.4
(3.3–5.4)
28.4
(26.3–30.4)
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
TSR When exposed to mixture of neutral/LE
speech, provides best of both discrete
recognizers
Only neutral speech data required for
training acoustic models
Proposed Equalization Techniques
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Conclusions
Acoustic Model AdaptationAdaptation of neutral acoustic models to LE: proposal of speaker/group-
dependent adaptation approaches
Assumptions: LE-level dependent; adaptation data for a given speaker/LE
level are available, together with their transcriptions
Voice ConversionExcitation and vocal tract components of LE speech are transformed towards
neutral in the ASR front-end
Assumptions: LE-level dependent; parallel training data for each speaker
available, speaker identification system choosing from the codebook of
speaker-dependent transforms available; increased conversion accuracy
required
Data-Driven Design of robust featuresContribution of frequency sub-bands to speech recognition performance is studied;
Novel filter banks for MFCC and PLP-based front-ends are designed
Assumptions: LE-level dependent; gender classification required to pick
gender-dependent features
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Proposed Equalization Techniques
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Conclusions
Frequency WarpingModified vocal tract normalization (VTLN) and generalized formant-driven
frequency warping are proposed
Assumptions: LE-level independent, transform parameters adapt on-the-fly!
Two-Stage Recognition SystemNeutral/LE classifier is proposed and used to direct incoming speech to
matching neutral/LE dedicated recognizers
Assumptions: LE-level independent, increasing codebook of LE-level dependent
recognizers would further improve performance in changing LE levels
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Proposed Techniques – Performance Comparison
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Conclusions
0
10
20
30
40
50
60
Model Adapt toLE - SI
Model Adapt toLE - SD
VoiceConversion -
CLE
Modified FB -RFCC-LPC
VTLNRecognition -
Utt. Dep. Warp
FormantWarping
MLP TSR
Baseline Neutral
Baseline LE
LE Suppression
WE
R (
%)
Comparison of proposed techniques for LE-robust ASR
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
Thank you
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Conclusions
Thank You for Your Attention!
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
References
Introduction Data Acquisition Neutral/LE Speech Analysis Equalization of LE in ASR Conclusions
Conclusions
© Hynek Bořil, [email protected] Workshop on Speech in Adverse Conditions, S2S, September 8-12, 2008
PhD ThesisHynek Bořil – Robust Speech Recognition: Analysis and Equalization of
Lombard Effect in Czech Corpora. Czech Technical University in Prague,
2008.
http://www.utdallas.edu/~hxb076000