reverberant speech processing for human …...reverberant speech processing for human communication...
TRANSCRIPT
![Page 1: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/1.jpg)
Reverberant Speech Processingfor Human Communication
and Automatic Speech Recognition
Tomohiro Nakatani, Armin Sehr, Walter [email protected], sehr,[email protected]
NTT Communication Science Laboratories
LMS, University of Erlangen-Nuremberg
March 26, 2012
![Page 2: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/2.jpg)
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
![Page 3: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/3.jpg)
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
![Page 4: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/4.jpg)
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Acquisition - Localizesources and capture cleansignals from distance
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
![Page 5: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/5.jpg)
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
• Acquisition - Localizesources and capture cleansignals from distance
Challenges:
• Feedback of loudspeakersignals
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
![Page 6: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/6.jpg)
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
• Acquisition - Localizesources and capture cleansignals from distance
Challenges:
• Noise and interferers
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
![Page 7: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/7.jpg)
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
• Acquisition - Localizesources and capture cleansignals from distance
Challenges:
• Reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
![Page 8: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/8.jpg)
Generic Scenario:Natural Interactive Human/Machine Interface
Mobile users, distant microphones/loudspeakers
DigitalSignal
Processing
Tasks:
• Rendering - Reproducedesired signals at distantears
• Acquisition - Localizesources and capture cleansignals from distance
Challenges:
• Feedback of loudspeakersignals
• Noise and interferers
• Reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 2
![Page 9: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/9.jpg)
Applications
Hands-free equipment
for telecommunication and natural human/machine interaction
for mobile phones / smart phones, mobile computing devices,PDAs
in car interiors (’command&control’, telecommunication, in-carcommunication, . . .)
for desktop computers, info-/edutainment terminals, interactive TV,game stations, simulators
for telepresence systems (offices,. . ., classrooms, . . ., auditoria) for ambient communication (smart meeting rooms, smart homes,
information kiosks, museums and exhibitions, . . .) for voice-driven navigation systems in cars, operating rooms, . . .
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 3
![Page 10: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/10.jpg)
Applications (cont’d)
Professional Audio equipment for stages and recording studios virtual acoustic environments (virtual concert halls, telepresence
studios,. . .)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 4
![Page 11: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/11.jpg)
Applications (cont’d)
Professional Audio equipment for stages and recording studios virtual acoustic environments (virtual concert halls, telepresence
studios,. . .)
Safety and Surveillance acoustic displays in control centers, cockpits monitoring in health care environments (advanced ’babyphones’) acoustic scene analysis (train stations, . . .)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 4
![Page 12: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/12.jpg)
Another Scenario: ’Listening devices’
DSP
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
![Page 13: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/13.jpg)
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
![Page 14: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/14.jpg)
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
![Page 15: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/15.jpg)
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Challenges:
• Loudspeakerfeedback (howling)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
![Page 16: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/16.jpg)
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Challenges:
• Noise and interferers
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
![Page 17: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/17.jpg)
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Challenges:
• Reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
![Page 18: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/18.jpg)
Another Scenario: ’Listening devices’
DSP
Tasks:
• Rendering -Reproduceundistorted signalswith binaural cues
• Acquisition - Localizedesired source(s)and enhance desiredsignal(s)
Challenges:
• Loudspeakerfeedback (howling)
• Noise and interferers
• Reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 5
![Page 19: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/19.jpg)
Applications
Hearing aids, of course
Headsets, e.g., for mobile phones, mobile computing devices, personal digital
assistants
hearing protection in noisy environments (construction work,mining,. . .)
active noise cancellation systems
. . .
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 6
![Page 20: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/20.jpg)
Example 1: DICIT - an Interactive TV system
Voice-controlled home entertainment system (EU project DICIT2005-2009; see e.g., Marquardt et al., 2009; Youtube )
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 7
![Page 21: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/21.jpg)
Example 1: DICIT - an Interactive TV system
Voice-controlled home entertainment system (EU project DICIT2005-2009; see e.g., Marquardt et al., 2009; Youtube )
featuring
Multichannel AEC (GFDAF, Buchner/Benesty et al., 2003ff) Multibeamforming (Mabande et al., 2009; Kellermann, 1997) Source localization (GCF; Brutti et al., 2007) Speech/non-speech classification (Omologo, 2009) Noise-robust automatic speech recognition (ViaVoice, IBM 2009)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 7
![Page 22: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/22.jpg)
Example 1: DICIT - an Interactive TV system
Voice-controlled home entertainment system (EU project DICIT2005-2009; see e.g., Marquardt et al., 2009; Youtube )
featuring
Multichannel AEC (GFDAF, Buchner/Benesty et al., 2003ff) Multibeamforming (Mabande et al., 2009; Kellermann, 1997) Source localization (GCF; Brutti et al., 2007) Speech/non-speech classification (Omologo, 2009) Noise-robust automatic speech recognition (ViaVoice, IBM 2009)
Challenge: Reverberation for large source distances in morereverberant rooms
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 7
![Page 23: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/23.jpg)
Who Spoke When, What, Who Spoke When, What, ToTo--whom and How?whom and How?
RealReal--time Meeting Browsertime Meeting Browser
Recognize speech andRecognize speech andother audio eventsother audio events
Example 2: Meeting recognition system
![Page 24: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/24.jpg)
Example 3: Audio postproduction system
Microphone(s)Actor/actress
Step1:Sound&video recording (on location)
Step2:Audio post-production(de-noising, de-reverb, sound effects)
[Movies/TV creation]
![Page 25: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/25.jpg)
Overview
Part I: Introduction
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 26: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/26.jpg)
Overview
Part I: Introduction Fundamentals
Approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 27: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/27.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 28: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/28.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Professional audio post production
Meeting speech recognition with microphone arrays
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 29: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/29.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Fundamentals: Dereverberation with inverse filtering
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 30: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/30.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications Fundamentals: Dereverberation with inverse filtering
What is ’inverse’ filtering? Robust ’approximate’ inverse filtering
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 31: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/31.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Fundamentals: Dereverberation with inverse filtering
Blind inverse filtering
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 32: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/32.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Fundamentals: Dereverberation with inverse filtering Blind inverse filtering
Overview of basic approaches Closer look: multichannel linear prediction with
time-varying source model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 33: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/33.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering Example applications
Fundamentals: Dereverberation with inverse filtering
Blind inverse filtering
Integration with blind source separation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 34: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/34.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 35: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/35.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Cepstral mean normalization
Model-based feature enhancement
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 36: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/36.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Model-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 37: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/37.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches Model-based approaches
Matched training Multi-style training Adaptive training MAP and MLLR adaptation Parametric adaptation tailored to reverberation Frame-wise adaptation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 38: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/38.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Model-based approaches
Decoder-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 39: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/39.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Model-based approaches Decoder-based approaches
Missing feature techniques Uncertainty decoding
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 40: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/40.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments Feature-based approaches
Model-based approaches
Decoder-based approaches
A generic approach: REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 41: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/41.jpg)
Overview
Part I: Introduction
Part II: Multichannel blind inverse filtering
Part III: Robust ASR in reverberant environments
Part IV: Summary, Conclusions, and Outlook
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 10
![Page 42: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/42.jpg)
Fundamental Signal Processing Problems - Formulation
DigitalSignal
Processing
W
v
x
u
y
KL
N P
Linear MIMO system W (’multi-ple input/ multiple output’):(
vy
)
= W ∗
(
ux
)
=
(
Wvu Wvx
Wyu Wyx
)
∗
(
ux
)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 11
![Page 43: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/43.jpg)
Fundamental Signal Processing Problems - Formulation
Linear MIMO system W (’multi-ple input/ multiple output’):(
vy
)
= W ∗
(
ux
)
=
(
Wvu Wvx
Wyu Wyx
)
∗
(
ux
)
DigitalSignal
Processing
W
n
S1
SM
z1
z2
z2M−1
z2M
Hzvv
x
u
y
KL
N P
Listeners’ signals:
z = Hzv ∗ v + nz
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 11
![Page 44: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/44.jpg)
Fundamental Signal Processing Problems - Formulation
Linear MIMO system W (’multi-ple input/ multiple output’):(
vy
)
= W ∗
(
ux
)
=
(
Wvu Wvx
Wyu Wyx
)
∗
(
ux
)
Listeners’ signals:
z = Hzv ∗ v + nz
DigitalSignal
Processing
W
n
s1
sM
S1
SM
z1
z2
z2M−1
z2M
Hxv
Hxs
Hzvv
x
u
y
KL
N P Microphone signals:
x = Hxs ∗ s + Hxv ∗ v + nx
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 11
![Page 45: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/45.jpg)
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
![Page 46: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/46.jpg)
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
3 Subproblems:
• Echo cancellation:(Wyu + Wyx ∗ Hxv ∗ Wvu )∗u = 0
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
![Page 47: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/47.jpg)
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
3 Subproblems:
• Source separation anddereverberation :Wyx ∗ Hxs ∗ s = s ∗ δ(k − k0)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
![Page 48: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/48.jpg)
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
3 Subproblems:
• Noise and interferencesuppression:
Wyx ∗ nx = 0
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
![Page 49: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/49.jpg)
Fundamental Problems for Signal Acquisition
n
s1
sM
S1
SM
Hxv
Hxs
Wvu
Wyx
Wyu
v
x
u
y
KL
NP
Goal: Undistorted source signalsy = Wyu ∗u +Wyx ∗x !
= s ∗δ(k −k0)
where x = Hxs ∗ s + Hxv ∗ v + nx
3 Subproblems:
• Echo cancellation:(Wyu + Wyx ∗ Hxv ∗ Wvu )∗u = 0
• Source separation anddereverberation :Wyx ∗ Hxs ∗ s = s ∗ δ(k − k0)
• Noise and interferencesuppression:
Wyx ∗ nx = 0
Components of x , i.e., Hxs ∗ s, Hxv ∗ v, nx , must be separated by W!
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 12
![Page 50: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/50.jpg)
Fundamentals - Room Impulse Response (RIR) properties
Elements of Hwv , Hxv , Hxs are room impulse responses (RIRs).
Typical structure of RIRs:
Direct sound
Early reflections
Late reverberation
h
t
Main characteristic parameters:
T60: Time for exponential decay of envelope by 60dB
DRR: Direct-to-Reverberant (Energy) Ratio
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 13
![Page 51: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/51.jpg)
Fundamentals - Room Impulse Response (RIR) properties
• Reverberation time T60
⊲ car ≈ 50ms⊲ concert halls ≈ 1 . . . 2s
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 14
![Page 52: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/52.jpg)
Fundamentals - Room Impulse Response (RIR) properties
• Reverberation time T60
⊲ car ≈ 50ms⊲ concert halls ≈ 1 . . . 2s
• FIR models⊲ typically LH ≈ T60 · fs/3 coefficients⊲ nonminimum-phase⊲ many zeros close to unit circle
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 14
![Page 53: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/53.jpg)
Fundamentals - Room Impulse Response (RIR) properties
• Reverberation time T60
⊲ car ≈ 50ms⊲ concert halls ≈ 1 . . . 2s
• FIR models⊲ typically LH ≈ T60 · fs/3 coefficients⊲ nonminimum-phase⊲ many zeros close to unit circle
Example: Office 5.5m × 3m × 2.8m, T60 ≈ 300msec, fs = 12kHz.
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
taps−1.5 −1 −0.5 0 0.5 1 1.5
−1.5
−1
−0.5
0
0.5
1
1.5
Real part
Imag
inar
y pa
rt
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 14
![Page 54: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/54.jpg)
Fundamentals - RIR properties (cont’d)
RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
distance 4m
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6
−4
−2
0
2
4x 10
−5
t in s
h(n)
distance 1m
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15
![Page 55: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/55.jpg)
Fundamentals - RIR properties (cont’d)
RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
distance 4m
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6
−4
−2
0
2
4x 10
−5
t in s
h(n)
distance 1m
Energy decay curves(EDC [Schröder 1965])
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−40
−35
−30
−25
−20
−15
−10
−5
0
t in s
ener
gy d
ecay
cur
ve in
dB
distance 4mdistance 1m
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15
![Page 56: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/56.jpg)
Fundamentals - RIR properties (cont’d)
RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
distance 4m
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6
−4
−2
0
2
4x 10
−5
t in s
h(n)
distance 1m
Energy decay curves(EDC [Schröder 1965])
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−40
−35
−30
−25
−20
−15
−10
−5
0
t in s
ener
gy d
ecay
cur
ve in
dB
distance 4mdistance 1m
DRR(Direct-to-Reverberant Energy Ratio)
4.9dB/-4.0dB
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15
![Page 57: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/57.jpg)
Fundamentals - RIR properties (cont’d)
RIRs for varying source-mic distance (d1 = 1m vs. d2 = 4m, T60 ≈ 900ms)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
distance 4m
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−6
−4
−2
0
2
4x 10
−5
t in s
h(n)
distance 1m
Energy decay curves(EDC [Schröder 1965])
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5−40
−35
−30
−25
−20
−15
−10
−5
0
t in s
ener
gy d
ecay
cur
ve in
dB
distance 4mdistance 1m
DRR(Direct-to-Reverberant Energy Ratio)
4.9dB/-4.0dB
RIR, DRR ⇔ Reverberation time T60
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 15
![Page 58: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/58.jpg)
Fundamentals - RIR properties (cont’d)
Variability with displacements:
Mic displacement 4.2cm(source distance d=4m):
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
−0.5
0
0.5
1x 10
−5
t in s
h(n)
Difference between RIR1 and RIR2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
0
1x 10
−5
t in s
h(n)
RIR 1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
0
1x 10
−5
t in s
h(n)
RIR 2
System error norm: 0.23dB
Shift of RIR by 1 sample:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1
t in s
h(n)
Difference between RIR1 and a RIR1 shifted by 1 sample
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1.5
−1
−0.5
0
0.5
1
t in s
h(n)
RIR 1
System error norm: 2.56dB
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 16
![Page 59: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/59.jpg)
Fundamentals - Reverberation in signal representations
Clean vs. reverberated with T60 ≈ 900ms, d1 = 4m and T60 ≈ 3.1s, d2 = 5m
Time-domain:
−0.5
0.5
s t
−0.3
0.3
x t
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
−0.5
0.5
x t
t in s
Pauses filled!
STFT domain:
f in
Hz
0
2000
4000
6000
−80
−60
−40
−20
0
f in
Hz
0
2000
4000
6000
−80
−60
−40
−20
0
t in s
f in
Hz
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.60
2000
4000
6000
−80
−60
−40
−20
0
Pauses filled!
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 17
![Page 60: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/60.jpg)
Fundamentals - Reverberation in ASR features
Clean vs. reverberated with T60 ≈ 900ms, d1 = 4m and T60 ≈ 3.1s, d2 = 5m
Logmelspec domain:m
el c
hann
el
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
mel
cha
nnel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel
cha
nnel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
Pauses filled!
MFCC domain:
ceps
tral
coe
ffici
ent
0
2
4
6
8
10
−60
−50
−40
−30
−20
−10
0
10
ceps
tral
coe
ffici
ent
0
2
4
6
8
10
−60
−50
−40
−30
−20
−10
0
t in sce
pstr
al c
oeffi
cien
t
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
0
2
4
6
8
10
−70
−60
−50
−40
−30
−20
−10
0
10
Pauses of c0 filled!
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 18
![Page 61: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/61.jpg)
Dereverberation for Speech Enhancement
Basic Idea: Separate speech production from RIR, equalize the latter
room(to be equalized)
vocal tract(to be preserved)
glottal excitation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19
![Page 62: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/62.jpg)
Dereverberation for Speech Enhancement
Basic Idea: Separate speech production from RIR, equalize the latter
room(to be equalized)
vocal tract(to be preserved)
glottal excitation
’Blind’ problem!(no reference signal for RIR input)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19
![Page 63: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/63.jpg)
Dereverberation for Speech Enhancement
Basic Idea: Separate speech production from RIR, equalize the latter
room(to be equalized)
vocal tract(to be preserved)
glottal excitation
’Blind’ problem!(no reference signal for RIR input)
Distinction: Partial Deconvolution
(removes reverberation by RIRinversion, ideally without speechdistortion)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19
![Page 64: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/64.jpg)
Dereverberation for Speech Enhancement
Basic Idea: Separate speech production from RIR, equalize the latter
room(to be equalized)
vocal tract(to be preserved)
glottal excitation
’Blind’ problem!(no reference signal for RIR input)
Distinction: Partial Deconvolution
(removes reverberation by RIRinversion, ideally without speechdistortion)
m
Reverberation Suppression(compromise betweendereverberation and signaldistortion necessary)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 19
![Page 65: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/65.jpg)
Dereverberation for Signal Enhancement (cont’d)
Dealing with ’Blindness’ by exploiting
Prior Knowledge on
A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)
B, Room acoustics parameter (e.g., T60)
C, Location and radiation characteristics of speech source
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20
![Page 66: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/66.jpg)
Dereverberation for Signal Enhancement (cont’d)
Dealing with ’Blindness’ by exploiting
Prior Knowledge on
A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)
B, Room acoustics parameter (e.g., T60)
C, Location and radiation characteristics of speech source
and some Useful Assumptions
D, Joint moments (e.g., correlation) of signal samples:Small lags characterize speech ⇔ Large lags characterize reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20
![Page 67: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/67.jpg)
Dereverberation for Signal Enhancement (cont’d)
Dealing with ’Blindness’ by exploiting
Prior Knowledge on
A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)
B, Room acoustics parameter (e.g., T60)
C, Location and radiation characteristics of speech source
and some Useful Assumptions
D, Joint moments (e.g., correlation) of signal samples:Small lags characterize speech ⇔ Large lags characterize reverberation
E, Speech signal statistics change faster than RIRs
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20
![Page 68: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/68.jpg)
Dereverberation for Signal Enhancement (cont’d)
Dealing with ’Blindness’ by exploiting
Prior Knowledge on
A, Speech production models (e.g., source-filter model, HMM) and signalproperties (nonwhiteness, nonstationarity, nongaussianity)
B, Room acoustics parameter (e.g., T60)
C, Location and radiation characteristics of speech source
and some Useful Assumptions
D, Joint moments (e.g., correlation) of signal samples:Small lags characterize speech ⇔ Large lags characterize reverberation
E, Speech signal statistics change faster than RIRs
F, Multichannel recordings: Speech component is the same ⇔ RIRs aredifferent
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 20
![Page 69: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/69.jpg)
Dereverberation for Signal Enhancement - Approaches
SignalDereverberation
PartialDeconvolution
Single-channel Multichannel
ReverberationSuppression
Single-channel Multichannel
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 21
![Page 70: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/70.jpg)
Dereverberation - Single-Channel Partial Deconvolution
Single-channel partial deconvolution
Can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E) for identifying RIR estimate
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 22
![Page 71: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/71.jpg)
Dereverberation - Single-Channel Partial Deconvolution
Single-channel partial deconvolution
Can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E) for identifying RIR estimate
Inversion of a single RIR involves [Neely 1979] removing the allpass component of the nonminimum-phase RIR →
approximated by delay
inverting zeros close to, or on unit circle → approximation by ’channelshortening’
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 22
![Page 72: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/72.jpg)
Dereverberation - Single-Channel Partial Deconvolution
Single-channel partial deconvolution
Can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E) for identifying RIR estimate
Inversion of a single RIR involves [Neely 1979] removing the allpass component of the nonminimum-phase RIR →
approximated by delay
inverting zeros close to, or on unit circle → approximation by ’channelshortening’
for realization problems see, e.g., [Morjopoulos 1994], [Naylor 2010]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 22
![Page 73: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/73.jpg)
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
![Page 74: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/74.jpg)
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR
Spatial diversity facilitates RIR identification
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
![Page 75: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/75.jpg)
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR
Spatial diversity facilitates RIR identification
Perfect inversion with FIR filters is possible (MINT [Miyoshi 1988]) exact knowledge of RIR lengths required
no common zeros of RIRs allowed
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
![Page 76: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/76.jpg)
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR
Spatial diversity facilitates RIR identification
Perfect inversion with FIR filters is possible (MINT [Miyoshi 1988]) exact knowledge of RIR lengths required
no common zeros of RIRs allowed
Indirect approaches often invert in subbands for robustness (e.g. [Naylor2005])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
![Page 77: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/77.jpg)
Dereverberation - Multichannel Partial Deconvolution
Multichannel partial deconvolution
Can additionally exploit spatial diversity (incl. assumption F) and priorknowledge of source location and radiation characteristic (C) foridentifying RIR
Spatial diversity facilitates RIR identification
Perfect inversion with FIR filters is possible (MINT [Miyoshi 1988]) exact knowledge of RIR lengths required
no common zeros of RIRs allowed
Indirect approaches often invert in subbands for robustness (e.g. [Naylor2005])
Direct approaches to identify a robust inverse exist (e.g. [Buchner 2004],[Buchner 2010], and below!)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 23
![Page 78: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/78.jpg)
Dereverberation - Single-Channel Reverberation Suppressio n
Single-channel Reverberation Suppression
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 24
![Page 79: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/79.jpg)
Dereverberation - Single-Channel Reverberation Suppressio n
Single-channel Reverberation Suppression
can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E), e.g., for
equalizing the vocal tract IR and suppressing reverberation in the LPCresidual (e.g., [Yegnanarayana 2000], [Gaubitch 2006])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 24
![Page 80: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/80.jpg)
Dereverberation - Single-Channel Reverberation Suppressio n
Single-channel Reverberation Suppression
can exploit speech models and properties (A) and correlation andstationarity assumptions (D, E), e.g., for
equalizing the vocal tract IR and suppressing reverberation in the LPCresidual (e.g., [Yegnanarayana 2000], [Gaubitch 2006])
can exploit prior knowledge on room acoustics (e.g.,T60) to estimatePSD of reverberation and use spectral subtraction methods as commonfor additive noise (e.g., [Lebart 2001])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 24
![Page 81: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/81.jpg)
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
![Page 82: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/82.jpg)
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
![Page 83: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/83.jpg)
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,
beamforming using only prior knowledge of source location and radiationcharacteristic (C) (e.g., [Griebel 2001])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
![Page 84: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/84.jpg)
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,
beamforming using only prior knowledge of source location and radiationcharacteristic (C) (e.g., [Griebel 2001])
spatial diversity for multichannel spectral subtraction (e.g., [Allen 1977]), orsubspace methods (e.g., [Gannot 2003])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
![Page 85: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/85.jpg)
Dereverberation - Multichannel Reverberation Suppression
Multichannel Reverberation Suppression
can additionally exploit spatial diversity (incl. assumption F) and priorknowledge on source location and radiation characteristic (C), e.g.,
beamforming using only prior knowledge of source location and radiationcharacteristic (C) (e.g., [Griebel 2001])
spatial diversity for multichannel spectral subtraction (e.g., [Allen 1977]), orsubspace methods (e.g., [Gannot 2003])
spatial diversity complemented by prior knowledge on room acousticsparameter (e.g., [Habets 2005])
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 25
![Page 86: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/86.jpg)
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
![Page 87: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/87.jpg)
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
![Page 88: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/88.jpg)
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
B) feature-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
![Page 89: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/89.jpg)
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
![Page 90: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/90.jpg)
Handling Reverberation for Automatic Speech Recognition
Block diagram of ASR system
pre−
training
speechsignal
transcription
transcriptionrecog−nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
D) decoder-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 26
![Page 91: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/91.jpg)
Part II.Multichannel blind inverse filtering
![Page 92: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/92.jpg)
Two approaches for signal dereverberation
Signaldereverberation
Partialdeconvolution
Reverberationsuppression
[Lebart 2001], [Habets 2005], [Löllman 2009], [Erkelens 2010], [Kameoka 2009], [Jeub 2010]and others
“Robust” blind inverse filteringis the main topic of part II
![Page 93: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/93.jpg)
Multichannel inverse filtering
M
m
Kmt
mt xwy
1 0
)()(
Linear filtering:
ts)1(
th)2(
th)(M
th
)1(tx
)2(tx
)(Mtx
)1(tw
)2(tw
)(Mtw
+
ty
Dereverberatedsignal
tt sy Goal: estimate )(mtw s.t.
RIRsCleanspeech
Reverberantspeech
m : mic. indext : time index : a set of
variablesfor all t and m
Inversefilter
![Page 94: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/94.jpg)
!
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
![Page 95: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/95.jpg)
"
Application to audio post-production
Microphone(s)Actor/actress
Step1:Sound&video recording (on location)
Step2:Audio post-production(de-noising, de-reverb, sound effects)
[Movies/TV creation]
![Page 96: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/96.jpg)
#Dereverberation plug-in for Pro Tools: NML RevCon-RR(sold by TAC System, Inc.)
Dereverberation system for audio post production [Kinoshita 2008]
![Page 97: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/97.jpg)
Who Spoke When, What, Who Spoke When, What, ToTo--whom and How?whom and How?
Show&Tell:ST-3.2: Thursday, March 29, 10:30-12:30 RealReal--time Meeting Browsertime Meeting Browser
Recognize speech andRecognize speech andother audio eventsother audio events
Online meeting recognition [Hori 2012]
ReverberationReverberation
SimultaneousSimultaneousSpeechSpeech
BackgroundBackgroundnoisenoise
![Page 98: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/98.jpg)
$
Online/offline processing flow of meeting recognition
Dereverberation
Voiceactivity
detection
Speechseparation
Mic signals
Dereverberatedmicrophone signals
Separatedsignals
Cleanedsignals
Noisesuppression ASR
Wordsequence
Preprocessingfor all following
signal processing units
![Page 99: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/99.jpg)
%
ASR performance w/ and w/o dereverberation
Worderrorrate(%)
Test data: Meeting by 4 speakers (15 min x 8 sessions)Recording: 8 mics. (T60: about 350 ms, Speaker-mic distance: 100 cm)
Acoustic model:Trained on CSJ (corpus of spontaneous Japanese): headset recording
Language model:Vocabulary size: 156K (LVCSR)
Baseline:Distant microphone(w/o enhancement)w/o derev:BSS+denoisew/ derev:derev.+BSS+denoiseHeadset:Close microphone(w/o enhancement)Online processing
(Latency=1s for preprocessing,w/o speaker adaptation)
Offline processing(w/ unsupervised speaker adaptation)
0102030405060
908070
Hea
dsetw
/ der
evw
/o d
erev
72.
1 %
Bas
elin
e 86
.5 %
56.6
%30
.6 %
Hea
dset
w/ d
erev
38.0
%B
asel
ine
78.9
%w
/o d
erev
35.9
%27
.4 %
![Page 100: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/100.jpg)
&
Questions to be answered
• What is inverse filtering ?• Is the inverse filter robust against interferences ?• Can we estimate the inverse filter with blind
processing ?
ts )1(th)2(
th)(M
th
)1(tx
)2(tx
)(Mtx
)1(tw
)2(tw
)(Mtw
+
tyDereverberatedsignal
![Page 101: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/101.jpg)
What is inverse filtering ?
Is the inverse filter robust against interferences ?
Can we estimate the inverse filter with blind processing ?
Answers at a glance
Unfortunately no,
Yes, we can,
Inversion of room impulse responses (RIRs)
by using cues for distinguishing speech from RIRs
but there is a robust ‘approximate’ inverse filter
![Page 102: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/102.jpg)
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
Assume non-blindprocessing for
analysis purpose
![Page 103: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/103.jpg)
Inversion of RIRs = Inversion of matrix transformation
Reverberant speech
Cleanspeech
Dereverberatedspeech
RIRsInversefiltering
)1(tx
tyts
Viewed asmatrix inversion
InversionViewed as matrixtransformation
)(Mtx
![Page 104: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/104.jpg)
$!
)(
)(1
)(
mKt
mt
mt
x
xx
Matrix/vector representations of RIR convolution/filtering
Single channel filtering Multichannel filtering
Kmt
m xw0
)()(
0
1
)()(1
)(
)(0
)(
)(1
)(0
)(1
)(0
0
00
00
0
Kt
t
t
mL
m
mL
m
mL
mm
mm
s
ss
hh
h
h
hhh
hh
h
h
h
ts)(mH
)(
)(
)()(0 ,,
mKt
mt
mK
m
x
xww
Tm)(w)(m
tx)()( m
tTm xw
)(
)1(
)()1(
1
)()(
Mt
tTMTM
m
mt
Tm
x
xwwxw
Tw
txt
Tty xw
Single channel RIR convolution Multichannel RIR convolution
tmm
t sHx )()(
tMM
t
t
sH
H
x
x
)(
)1(
)(
)1(
tt Hsx
H)(mtx tx
hLKK 0
![Page 105: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/105.jpg)
$"
Existence of inverse filter
• A column vector is an inverse filter when it satisfies:
• An inverse filter exists, when is invertible, i.e., it is full column rank, and is obtained as
tt ys w
ts tytxtt Hsx t
Tty xw
tTHsw
Hw
T]0,,0,1[ ewhereTT eHw
Hew TT TT HHHH 1)( where
w
TKtttt sss ],,,[01 swhere
![Page 106: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/106.jpg)
$
M (#mics) > 1 is required for single source case
• H is invertible, or full column rank, if and only if
and all columns are linearly independent
• In the case of single source, (#rows of H) >= (#columns of H)is satisfied if and only if
M (#mics) > 1
H
)1(H
)(MH
)2(H
#rows
#columns
(#rows of H) >= (#columns of H)
![Page 107: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/107.jpg)
$
Generalization to N sources ' M microphones case
– Inverse filter exists when is full column rank
• M (#mics) >N (#sources)• H(z) does not contain common zeros
)1(ts )1(
tx)1()1(
tt sy )2(
tx)2(
ts)(N
ts )(Mtx
)2()2(tt sy
)()( Nt
Nt sy
H W
HEquivalent
Multiple-input/output inverse theorem (MINT) [Miyoshi 1988]
![Page 108: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/108.jpg)
$$
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
![Page 109: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/109.jpg)
$%
Assumptions for inverse filtering
• Invertible RIRs• No additive noise • Time-invariant RIRs
Not realistic !
Inverse filter is too sensitive tomodeling errors (noise or RIR change)
Problem of inverse filtering
![Page 110: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/110.jpg)
$&
Inverse filter greatly amplifies noise
Noise-free reverberant case• Clean speech• Reverberant speech
– Synthesized using a fixed RIR (RT60=0.5 s)• Dereverberated speech using an inverse filter
for known RIRs (2-channel)
Noisy reverberant case• Noisy reverberant speech (SNR=30dB)• Speech processed using the same inverse filter
(2-channel)
![Page 111: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/111.jpg)
$
Why inverse filter is so sensitive to additive noise
ts H
min
Minimum singular value of
tx tstn tn~
often extremely small
(compared to maximum singular value)
Extremelyamplifies
noise
tT
tn nHe ~where
Hew TT
invmax
Maximum singular value of
often extremelylarge
HH
![Page 112: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/112.jpg)
$
Standard numerical approach for robustness [Engl 1996]
•Regularization– A general technique for robust matrix inversion
• Add a very small positive constant to diagonal offor calculating the pseudo-inversion of
– It can reduce the maximum singular value of
H
TT HIHHH 1)(~
(Identity matrixI
TT HHHH 1)(
HHT
H~invmax
Noisy rev. Processed
Noise amplification is greatly mitigated
![Page 113: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/113.jpg)
$
• Channel shortening– Set “direct signal + early reflections” as target signal, and
reduce only late reverberation
Room acoustics motivated approach for robustness
Target to reverberation ratio (TRR) w/ channel shortening is much higher than TRR w/ inverse filtering
Directpath Early
reflections
Latereverberation
TargetTRR
e.g., -3 dB
e.g., 8 dB
Noisy rev. Processed
t
Illustration of an RIR
about 30 ms
Inversefiltering
Hew ~TT
Channelshortening
Hhew ~)( Te
TT
e eh
lh
![Page 114: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/114.jpg)
%!
Intermediate summary II-1
• Dereverberation: inversion of RIRs– Assuming RIRs to be a time-invariant linear system
• Inverse filter exists– When we have more microphones than sources– But it may be very sensitive to additive noise
• ‘Approximate’ inverse filter is robust against noise– Based on regularization and channel shortening
![Page 115: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/115.jpg)
%"
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
![Page 116: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/116.jpg)
%
Blind inverse filtering based dereverberation
Reverberantspeech
Unknown
RIRsts )(m
txDereverberatedspeech
Inversefiltering
ty
Estimateinverse
filter
wMics.
UnknownSpeech
productionsystem
Cleanspeech
Two approaches• RIR estimation + RIR inversion• Direct estimation of inverse filter
![Page 117: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/117.jpg)
%
RIRsts )(mtx
Inversefiltering
ty
Estimateinverse
filter
w
Approach:Estimatethatdecorrelates
)(mtx
w
Unknown
• SOS approach assumes to be stationary white Gaussian
• HOS approach assumes to be an i.i.d. sequenceHigher order decorrelation [Sato 1975], [Bellini 1994]
ts
ts
Conventional decorrelation approaches for stationary white signal
Multichannel linear prediction (MCLP) [Slock 1994], [Abed-Meraim 1997]
0' tt ssEfor 'tt
0
)('
)(
mt
mt xxE
Increasecorrelation
0' tt yyEStationarywhite signal
![Page 118: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/118.jpg)
%$
Multichannel linear prediction (MCLP)
Mic.1
Mic.M
)))
Predictreverberationin observation
)(mtx
)1()1(ttt rsx
Past observation
)))
Currentobservation
M
m
Lmt
mt
x
xcr1 1
)()()1(
)1(tr
Time
: reverberation
![Page 119: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/119.jpg)
%%
MCLP based decorrelation [Slock 1994], [Abed-Meraim 1997]
• is modeled by
where is prediction coeffs. – is equivalent to inverse filter
• can be estimated by minimizing prediction error when sources are stationary and uncorrelated in time
– Quadratic form: optimized using a closed form solution
t
M
m
Kmt
mt sxcx
1 1
)()()1(
)1(tx
wTM
KM
K cccc ],,,,,,[ )()(1
)1()1(1 c
t
Tttx
2cxcc
1)1(minargˆ
c
tTt s cx 1
Predicted signal (= reverberation)Prediction error (= direct signal)
cxTttt xs 1)1(
c
![Page 120: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/120.jpg)
%&
Why dereverberation can be achieved by MCLP
1
' '0t
tt ttss for
1
2
10
)1(
1
21
)1(
t
Ttt
tt
Tt shx cxxc
1
2
11
)1(
t
Tttt shs cx
1 1
2
11
)1(2||t t
Tttt shs cx
Minimization is achieved only when
1
2||t
ts
1
)1(
tsh cxTt 1
Truereverberation
Predictedreverberation=
01
1
t
Ttts x(and thus )
is usually assumed for MCLP without loss of generality
1)1(0 h
![Page 121: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/121.jpg)
%
Robustness of MCLP against noise
)()()( mt
mt
mt nxz Let be noisy reverberant observation.
Additive noise (or can be viewed as modeling error)
Cost function fordereverberation
Cost function fornoise amplification
Assume and to be uncorrelated, then, the cost function becomes
1
21
)1(
1
21
)1(
1
21
)1(
t
Ttt
t
Ttt
t
Ttt nxz cncxcz
)(mtx
)(mtn
Regularization is inherently included
![Page 122: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/122.jpg)
%
RIRsts )(m
txInversefiltering
ty
Estimateinverse
filter
wMics.
Approach:Estimatethatdecorrelates
)(mtx
w
Unknown
Problem of decorrelation approach for speech dereverberation
UnknownSpeech
productionsystem
Problem:Not only dereverberatebut also decorrelate ts
)(mtx
Key to the solution:Use cues to separatespeech and RIRs
Both are decorrelated
![Page 123: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/123.jpg)
%
Cues for separating speech and RIRs
Cues Speech RIRs
Inter-channeldifference
Auto-correlationduration
Nonstationarity Stationary only within short time period of the order of 30 ms
Stationary over long time period of the order of 1000 ms or larger
Correlated only within short time interval of the order of 30 ms
Correlated within long time interval over100 ms
Common to all the microphone signals
Different for each microphone
![Page 124: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/124.jpg)
&!
Approaches to blind inverse filtering
• Subspace method (RIR estimation + inversion)–[Furuya 1997], [Gannot 2003], [Gaubitch 2006]
• Pre-whitening + decorrelation–Second-order statistics (SOS): [Gaubitch 2003],
[Furuya 2007], [Triki 2007]–Higher-order statistics (HOS): [Gillespie 2001]
• Channel shortening–[Gillespie 2003], [Kinoshita 2009]
• Joint speech and reverberation modeling–[Hopgood 2003], [Buchner (TRINICON) 2010],
[Yoshioka 2007], [Nakatani 2008]
Auto-correlationduration
Auto-correlationdurationandnonstationarity
Inter-channeldifference
Cues
![Page 125: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/125.jpg)
&"
Pre-whitening + decorrelation
• A typical method for pre-whitening–Low-dimensional (e.g., 12-dim) single channel linear prediction often used
Assumption: pre-whitening can decorrelate only in , and we can obtain where is an unknown decorrelated speech
Pre-whiteningReducecorrelationwithin shorttime interval
tt Hsx
Reverberantspeech
tmt sHx ~~ )(
tt sHx ~~ ts~tt Hsx ts
Estimateinverse filter
Inversefiltering
Estimatethatdecorrelates
)(~ mtx)(~ m
ty
w
w
Inversefiltering
w)(m
ty
![Page 126: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/126.jpg)
&
Channel shortening
• Introduce constraints so that dereverberation reduces only late reverberation
• Techniques:– Correlation shaping [Gillespie 2003]– Multistep MCLP [Kinoshita 2009]
Make derev. robust and do not decorrelate speech
Channelshortening
Directpath Early
reflections
Latereverberation
![Page 127: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/127.jpg)
&
Multistep MCLP [Gesbert 1997], [Kinoshita 2009]
Mic.1
Mic.M
)))
Predict late reverberationin observation
)(mtx
)1()1(ttt rsx
Past observation
)))
Currentobservation
Time
M
m
K
D
mt
mt xcr
1
)()()1(
Delay D (=30-50 ms)
)1(tr
ts : direct signal + earlyreflections
: latereverberation
![Page 128: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/128.jpg)
&$
Approaches to blind inverse filtering
• Subspace method (RIR estimation + inversion)–[Furuya 1997], [Gannot 2001], [Gaubitch 2006]
• Pre-whitening + decorrelation–Second-order statistics (SOS): [Gaubitch 2003],
[Furuya 2006], [Triki 2006]–Higher-order statistics (HOS): [Gillespie 2001]
• Channel shortening–[Gillespie 2003], [Kinoshita 2006]
• Joint speech and reverberation modeling–[Hopgood 2003], [Buchner (TRINICON) 2010],
[Yoshioka 2007], [Nakatani 2008]
Duration of auto-correlation
Duration of auto-correlationandnonstationarity
Spatial diversity
Cues
![Page 129: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/129.jpg)
&%
Joint speech and reverberation modeling for derev.
Reverberantobservation
Unknown true generative system
Sourceprocess
Reverberationprocess tx
hstt xpx ,);(~
Model of generative system
Sourcemodel
Reverberationmodel
Parametricmodel
sParametricmodel
h
Parameter estimation by
#Likelihood maximization[Hopgood 2003], [Yoshioka 2007],[Nakatani 2008]
#Kullback-Leiblerdivergence minimization[Buchner (TRINICON) 2010]
Distinguishable ?
![Page 130: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/130.jpg)
&&
Time-varying&
Correlatedonly within
short interval
Stationary&
Correlatedover
long interval
Source model(SOS or HOS)
Reverberationmodel
Models for source process and reverberation process
Distinguishable
![Page 131: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/131.jpg)
&
Multichannel blind partial deconvolution (MCBPD) by TRINICON
Cost function for SOS-TRINICON [Buchner 2010]
t
tyts ,,SOSˆdetlogˆdetlog RR J
Goal
Autocorrelation matrix of ty
ts,R
Goal
Decorrelation(deconvolution)
MCBPD byTRINICON
Autocorrelationmatrix of
observed signal
![Page 132: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/132.jpg)
&
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
![Page 133: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/133.jpg)
&
MCLP with time-varying source model for dereverberation
[Yoshioka 2007][Nakatani 2008, 2011]
Time-varyingshort-timeGaussian
MCLP
Source process(SOS)
Distinguishable bylikelihood
maximization
Reverberationprocess
![Page 134: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/134.jpg)
!
Reformulation of MCLP based on likelihood maximization
);(log)( cxc txpL
)1,0;()( tts sNsp Assume (stationary white Gaussian), then
1
21
)1( .||)2/1()(t
tT
t constxL xcc
)(max ccL
1
21
)1( ||mint
tT
tx xcc
Minimize prediction errorMaximize likelihood
Conditionalprobability rule
.);|(log1
1:1'')1( constxxp
tttttx
.)(log1
constspt
ts Source model
cxTttt xs 1)1(
where tTtt sx cx 1
)1(
![Page 135: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/135.jpg)
"
Time-varying Gaussian source model (TVGSM)
1. Each short time segment is stationary multivariate Gaussian, which can be characterized by
2. varies over different time segments
TNtttt sss ][ 11 s
ttttsp RsRs ,0;; N
Tttt E ssR where
ts R
tR
: parameters to be estimated
is an autocorrelation matrix
ms order30
![Page 136: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/136.jpg)
MCLP with multivariate source model
• Prediction error is assumed to follow TVGSM
tTtt sx cx 1
)1(
ttt scXx 1)1(
1
12
1
)1(1
)1(1
)1(
Nt
t
t
TNt
Tt
Tt
Nt
t
t
s
ss
x
xx
c
x
xx
or
ts
)1(tx 1tX ts
30ms
order
![Page 137: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/137.jpg)
Likelihood function of MCLP with TVGSM
t
ttst pL ),;(log),( RcsRc
),0;();( ttttsp RsRs Nwhere
||log||||),( 1)1(
tt
ttt tL RcXxRc R
sRss R1|||| Twhere (quadratic form)
Prediction errorweighted by
Normalization term
1tR
cXxs 1)1(
ttt and
![Page 138: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/138.jpg)
$
Iterative optimization procedure
Initialize
ˆt
Ttt E xxR
A few iterations are sufficient for convergence
cxs ˆˆ1
)1( ttt X
Update prediction coeffs.
ˆ tR1. Dereverberate
2. Calculate autocorrelation matrix of
t
tt tRc
cXxc ˆ1)1( ||||minargˆ
Update source model
ˆˆˆt
Ttt E ssR
ˆ ts
tx
c
c ˆ tR
ts
Closedform
![Page 139: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/139.jpg)
%
Importance of time-varying source model
Source signal Observation
Processed A Processed B
Freq
uenc
ykH
z
(A) MCLP withstationary whiteGaussian source model
(B) MCLP with TVGSM
T60 : 0.5 s
A few seconds of observation are sufficient for dereverberation
Recording: 2.5 s
Source-micdistance: 1.5 m# mics : 2
![Page 140: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/140.jpg)
&
Blind inverse filtering works in noisy environments
Reducereverberation
10 dB0.1 dB
(T60 =0.65s)15 dB10 dB15 dBSNR
5.8 dB (T60 =0.39s)
TRR*
*TRR: Target-to-reverberation ratio (target = direct signal + early reflections)
Noisy reverberant speech Dereverberation(w/ multistep MCLP
w/ TVGSM)
Noise: additive white noise (reproduced and recorded by 8 mics)
Processed signal# mics: 8source-mics distance: 2 m
10.3 dB11.4 dB13.8 dB15.2 dB
TRR
Noise may slightly increase,but not significantly
![Page 141: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/141.jpg)
Real-time factor (RTF)using MATLAB
(RT60: 0.5 s, # mics: 2)
Time-domain Subband
170 0.8
Computationally efficient implementation
• Subband decomposition approach [Nakatani 2010], [Yoshioka 2009b]
• Computational efficiency largely improves
Subbandanalysis
Subbandsynthesis
MCLP with TVGSM
MCLP with TVGSM
MCLP with TVGSM
ty )(mtx
1,ny
2,ny
,Fny
)(1,mnx
)(2,mnx
)(,mFnx
![Page 142: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/142.jpg)
Processing flow with subband decomposition [Nakatani 2010]
1.Set analysis parameters: prediction delay (D should be # of subband samples corresponding to 30 ms, or larger)
: length of prediction filter, : # of mics,: index of target channel to be dereverberated: a coeff. for flooring constant (e.g., )
2.Decompose a multichannel observed signal into a set of subband signals
: subband signal (e.g., [Weiss 2000], or STFT can also be used)
: channel index, : sample index: subband index
E.g., # of subbands is 512 (including negative frequencies) for 16 kHz sampling
3.In each subband , set initial estimates of source variance as
where is a flooring constant for
4.Obtain vector representation of in all channels as
where T is non-conjugate transposition, and
5.In each subband f, iterate the following until convergence is achieved
i. Obtain prediction filter as
where and are Moore-Penrose pseudo-inverse and complex conjugate operations. (see [Yoshioka, 2009b] for efficient calculation)
ii.Obtain dereverberated subband signalas
iii. Update source variance estimates as
6.Compose a dereverberated signal from a set of dereverberated subband signals
)(,mfnx
m nf
2)(, ||max 0mfnnk x
fmfnfn x ! ,||max 2)(,,0
f
f fn,!
)(,mfnxTTM
fnTfn
Tfnfn ],,[ )(
,)2(,
)1(,, xxxx
TmfLn
mfn
mfn
mfn xxx ],,[ )(
,1)(,1
)(,
)(, x
n fn
mfnfDn
n fn
TfDnfDn
f
x
,
*)(,,
,
*,,
0
!!xxx
c
*
fDnTf
mfnfn x ,
*)(,,0
xcyfn,y
fn ,! ffnfn y ! ,||max 2
,,
fn,y
fc
D
L0m
410
fn,!
M
![Page 143: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/143.jpg)
Part II. Multichannel blind inverse filtering
- Example applications- Professional audio post production- Meeting recognition with microphone arrays
- Fundamentals: dereverberation with inverse filtering- What is inverse filter- Robust ‘approximate’ inverse filter
- Blind inverse filtering- Overview of basic approaches- Closer look: multichannel linear prediction with
time-varying source model
- Integration with blind source separation
![Page 144: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/144.jpg)
!
Reverberant speechMixed reverberant speech
BSS+dereverberation
Cleanspeechestimate
IntegratedBSS+
dereverberationReverberation
Reverberation
Directsignal
Cleanspeech
Approaches:• MCLP based approach [Yoshioka 2009b, 2011]• TRINICON [Buchner 2010]
![Page 145: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/145.jpg)
"
Generative model for reverberant sound mixture
Time-varyingGaussian
Instantaneousmixing
process
Source process 1 Mixture process
Time-varyingGaussian
Source process 2
###
Multi-inputmulti-output
MCLP
Reverb. process###
)1(ts
)2(ts
)2(tx
)1(tx
)1(
tx
)2(tx
)(mtx : reverberant mixture
)(mtx
: non-reverberant mixture
Jointly optimized by maximum likelihood estimation approach [Yoshioka 2009b, 2011]
![Page 146: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/146.jpg)
Optimization procedure (subband-based implementation)
Initialization
Compute source estimates
Update source model parameters
Update de-mixing matrices
Converged ?
Update prediction coefficients .)(mtx
*
+,-
.)(mts*.
/ 0 10
Closed-form optimization
![Page 147: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/147.jpg)
Improvement in signal-to-interference ratio (SIR)
T60=0.3 s T60=0.5 s02468
10121416
Der
ever
b +
BS
S
BS
S
Unp
roce
ssed
# sources: 2# mics: 4Source-micdistance: 1.5 m
Recording: 1 to 8 s(average: 3.5 s)
SIR[dB]
BSS: [Sawada 2007]
Results averaged over 672 pairs of utterances (TIMIT test set)
![Page 148: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/148.jpg)
$
Live demoLive demo
![Page 149: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/149.jpg)
%
TRINICON: general framework for blind MIMO signal processing
• for source (assumed or estimated)• for output
"
0 0
,,
0
))),((ˆlog())),((ˆlog(),()(i
N
jPDyPDsb jipjipbi yyW J #
Unknownmixingsystem
Unmixingsystem Wb
)1(s
)(Ps
)1(x
)(Mx
)1(y
)(Py
Source models
Cost function [Buchner 2010]
with PD-variate pdfs (P: source number, D: filter length)
)),((ˆ , jip PDs y)),((ˆ , jip PDy y
b: index of signal blocks
![Page 150: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/150.jpg)
&
Comparison of SOS and HOS by TRINICON [Buchner 2010]
SIR improvement (dB)
Number of iterations
SOS
SOS+HOS
BSS (w/o derev.)
Signal-to-reverberation ratio (SRR) improvement (dB)
# mics.: 4, # sources: 2, T60 : 700 ms,Source-mic distance: 1.65 m, Recording: 30 sec
Number of iterations
SOS
SOS+HOS
![Page 151: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/151.jpg)
Summary II-2
• Robust blind inverse filtering is possible – Using joint speech and reverberation modeling
• Based only on a few seconds of observation (e.g., 2.5 s)• With a relatively small computational cost (e.g., RTF<1)• In an online processing manner (e.g., latency=1s)
– Under low SNR conditions (e.g., 10 dB SNR)
• Future challenges– Realtime adaptation of inverse filter [Yoshioka 2009a],
[Evers 2011]– Single channel inverse filtering [Gillespie 2001] – Processing under more adverse noise conditions such as
nonstationary diffuse noise– Optimal integration of inverse filtering and spectral
enhancement based dereverberation
![Page 152: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/152.jpg)
Part III:
Robust Automatic Speech Recognition (ASR)
in Reverberant Environments
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 88
![Page 153: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/153.jpg)
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 89
![Page 154: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/154.jpg)
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 89
![Page 155: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/155.jpg)
ASR System
Block Diagram
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
REMOS
D
C
B
A
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 90
![Page 156: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/156.jpg)
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st|()|2
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
![Page 157: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/157.jpg)
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st|()|2
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
![Page 158: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/158.jpg)
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st
sMELn
|()|2
coefficientsmelspec
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
![Page 159: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/159.jpg)
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st
sMELnsn
|()|2
logmelspeccoefficients coefficients
melspec
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
![Page 160: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/160.jpg)
Feature Extraction: Calculation of MFCCs
DFT
DCT logmel
Hamming
window
filtering
st
sMELnsnsMFCC
n
|()|2
MFCCs logmelspeccoefficients coefficients
melspec
Goal:Dimensionality
reduction
MFCCs:
Mel
Frequency
Cepstral
Coefficients
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 91
![Page 161: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/161.jpg)
Acoustic Modeling
Hidden Markov Model (HMM) λ
1 2 4 5 63
a22 a33 a44 a55
a12 a23 a34 a45 a56
p(sn|qn = 2) p(sn|qn = 5)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 92
![Page 162: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/162.jpg)
Acoustic Modeling
Hidden Markov Model (HMM) λ
1 2 4 5 63
a22 a33 a44 a55
a12 a23 a34 a45 a56
p(sn|qn = 2) p(sn|qn = 5)
Powerful model for:
temporal variation
spectral variation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 92
![Page 163: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/163.jpg)
Dispersive Effect of Reverberation
clean utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
reverberant utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
frame n
frame n
melchannell
melchannell
Logmelspec features, dB scale
Dispersive effect of reverberation:
features smeared along time axis
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 93
![Page 164: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/164.jpg)
Dispersive Effect of Reverberation
clean utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
reverberant utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
frame n
frame n
melchannell
melchannell
Logmelspec features, dB scale
Dispersive effect of reverberation:
features smeared along time axis
Time-frequency pattern is changed
Inter-frame correlation is increased
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 93
![Page 165: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/165.jpg)
Dispersive Effect of Reverberation
clean utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
reverberant utterance "four, two, seven"
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
frame n
frame n
melchannell
melchannell
Logmelspec features, dB scale
Dispersive effect of reverberation:
features smeared along time axis
Time-frequency pattern is changed
Inter-frame correlation is increased
Different statistical properties
to be captured by acoustic model
Contradiction to conditional inde-
pendence assumption of HMMs
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 93
![Page 166: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/166.jpg)
Explanation of Dispersive Effect
10 20 30 40 50 60 70 80 90 100
−0.2
0
0.2
TD representation of initial RIR segment
FD representation of initial RIR segment
5
10
15
20
frame τ
ht
me
lch
an
ne
ll
time in msframe 1
frame 2frame 3
frame shift
Time-domain (TD) description ofreverberant speech xt :
xt = ht ∗ st
RIR typically much longer than
analysis window
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 94
![Page 167: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/167.jpg)
Explanation of Dispersive Effect
10 20 30 40 50 60 70 80 90 100
−0.2
0
0.2
TD representation of initial RIR segment
FD representation of initial RIR segment
5
10
15
20
frame τ
ht
me
lch
an
ne
ll
time in msframe 1
frame 2frame 3
frame shift
Time-domain (TD) description ofreverberant speech xt :
xt = ht ∗ st
RIR typically much longer than
analysis window
Feature-domain (FD) description ofxMEL
n : melspec convolution
xMEL
n =
TH−1∑
τ=0
hMEL
τ ⊙ sMEL
n−τ
sMELn : clean-speech feature vector
xMELn : reverberant feature vector
hMEL
n : melspec RIR representation
⊙: element-wise multiplication
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 94
![Page 168: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/168.jpg)
Illustration of Melspec Convolution
= *
= * sMELnxMEL
n hMEL
n
=
=
xMELn sMEL
nhMEL
0 ⊙
⊙
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 95
![Page 169: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/169.jpg)
Illustration of Melspec Convolution
= *
= * sMELnxMEL
n hMEL
n
+
+=
=
xMELn sMEL
nhMEL
0sMEL
n−1hMEL
1 ⊙
⊙
⊙
⊙
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 95
![Page 170: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/170.jpg)
Illustration of Melspec Convolution
= *
= * sMELnxMEL
n hMEL
n
+ +
+ +
+
+=
=
xMELn sMEL
nhMEL
0sMEL
n−1hMEL
1sMEL
n−2hMEL
2
. . .
. . .
⊙⊙
⊙⊙⊙
⊙
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 95
![Page 171: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/171.jpg)
Accuracy of Melspec Convolution
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
20 40 60 80 100 120
5
10
15
20
−60
−40
−20
0
20
a)
b)
c)
d)
Frame n
ch
an
ne
ll
ch
an
ne
ll
ch
an
ne
ll
ch
an
ne
ll
a) Clean utterance
b) Reverberant utterance
c) Melspec convolution
d) Simple multiplication
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 96
![Page 172: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/172.jpg)
Statistical Properties of Reverberant Speech Features
Example: digit “seven”
logmelspec clean utterance
10 20 30 40
5
10
15
20
5
10
15
20
logmelspec reverberant utterance
10 20 30 40
5
10
15
20
5
10
15
20
means of clean logmelspec HMM
5 10 15
5
10
15
20
5
10
15
20
logmelspec RIR representation
20 40 60
5
10
15
20
−14
−12
−10
−8
−6
−4
−2
0
frame delay τ
frame nframe n
state j
melchannell
melchannell
melchannell
melchannell
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 97
![Page 173: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/173.jpg)
Statistical Properties of Reverberant Speech Features
Histograms
0 5 10 15 200
0.1
0.2
0.3
0.4
10 15 200
0.1
0.2
0.3
0.4
histogram clean
histogram rev.
histogram clean
histogram rev.
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdf
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 98
![Page 174: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/174.jpg)
Statistical Properties of Reverberant Speech Features
Histograms
0 5 10 15 200
0.1
0.2
0.3
0.4
10 15 200
0.1
0.2
0.3
0.4
histogram clean
histogram rev.
histogram clean
histogram rev.
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdfAuto-CoVariances (ACVs)
0 10 20 30
5
10
15
20
0.2
0.4
0.6
0.8
1
0 10 20 30
5
10
15
20
0.2
0.4
0.6
0.8
1
ACVs of clean speech, j = 9
ACVs of reverberant speech, j = 9
melchannell
melchannell
frame τ
frame τ
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 98
![Page 175: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/175.jpg)
Recognition Results
Word Accuracy as Function of Reverberation Time
0 200 400 600 8000
10
20
30
40
50
60
70
80
90
100
reverberation time T60 in ms
word
accura
cy
in%
Task: Read sentences
from Wall Street
Journal (WSJ 5K task)
Features: MFCCs
+ ∆ + ∆∆ coefficients
Recognizer:Cross-word triphones,
3 states per triphone,
16 Gaussians per state
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 99
![Page 176: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/176.jpg)
Which Part of Reverberation is Harmful for ASR?
Word Accuracy as Function of Dereverberation Start Time
0 100 200 300 400 500 600 70065
70
75
80
85
90
95
100
0 dB
5 dB
10 dB
15 dB
20 dB
30 dB
∞ dB
TDEREV in ms
word
accura
cy
in%
[Sehr 2010a]
Task: Connected
digits (TI digits)
Features: MFCCs
+ ∆ coefficients
Recognizer:Word-level HMMs,
16 states per digit,
3 Gaussians per state
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 100
![Page 177: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/177.jpg)
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
Strategies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
![Page 178: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/178.jpg)
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
Strategies
A) signal-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
![Page 179: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/179.jpg)
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
B
Strategies
A) signal-based approaches
B) feature-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
![Page 180: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/180.jpg)
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
![Page 181: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/181.jpg)
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
D) decoder-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
![Page 182: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/182.jpg)
Strategies for Reverberation-Robust ASR
Block diagram of ASR system
pre−
training
speech
signal
transcription
transcriptionrecog−
nition
processing extractionfeature acoustic
modellanguagemodel
A
B
C
D
REMOS
Strategies
A) signal-based approaches
B) feature-based approaches
C) model-based approaches
D) decoder-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 101
![Page 183: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/183.jpg)
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 102
![Page 184: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/184.jpg)
Key Ideas of Feature-based Approaches
Three Different Approaches
Feature compensation
⇒ Example: Cepstral mean normalization (CMN)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 103
![Page 185: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/185.jpg)
Key Ideas of Feature-based Approaches
Three Different Approaches
Feature compensation
⇒ Example: Cepstral mean normalization (CMN)
Features insensitive to reverberation
⇒ Example: RASTA features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 103
![Page 186: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/186.jpg)
Key Ideas of Feature-based Approaches
Three Different Approaches
Feature compensation
⇒ Example: Cepstral mean normalization (CMN)
Features insensitive to reverberation
⇒ Example: RASTA features
Features facilitating the capture of statistical properties
⇒ Example: Dynamic features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 103
![Page 187: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/187.jpg)
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
![Page 188: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/188.jpg)
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
![Page 189: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/189.jpg)
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
![Page 190: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/190.jpg)
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Cepstral Mean Normalization
xCMN
n,c = xMFCC
n,c − xMFCC
c
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
![Page 191: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/191.jpg)
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Cepstral Mean Normalization
xCMN
n,c = xMFCC
n,c − xMFCC
c
xMFCC
c =1
N
N∑
n=1
xMFCC
n,c ≈ hMFCC
c + sMFCC
c
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
![Page 192: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/192.jpg)
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Cepstral Mean Normalization
xCMN
n,c = xMFCC
n,c − xMFCC
c
xMFCC
c =1
N
N∑
n=1
xMFCC
n,c ≈ hMFCC
c + sMFCC
c
xCMN
n,c ≈ hMFCC
c + sMFCC
n,c − (hMFCC
c + sMFCC
c )= sMFCC
n,c − sMFCC
c = sCMN
n,c
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
![Page 193: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/193.jpg)
Cepstral Mean Normalization [Atal 1974]
If impulse response much shorter than STFT analysis window
xt = ht ∗ st
|XSTFT
n,k |2 ≈ |HSTFT
k |2 |SSTFT
n,k |2
xMFCC
n,c ≈ hMFCC
c + sMFCC
n,c
Cepstral Mean Normalization
xCMN
n,c = xMFCC
n,c − xMFCC
c
xMFCC
c =1
N
N∑
n=1
xMFCC
n,c ≈ hMFCC
c + sMFCC
c
xCMN
n,c ≈ hMFCC
c + sMFCC
n,c − (hMFCC
c + sMFCC
c )= sMFCC
n,c − sMFCC
c = sCMN
n,c
⇒ convolution compensated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 104
![Page 194: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/194.jpg)
CMN - Illustration 1st-order Highpass Filter
Clean vs. Highpass Filtered Logmel Features
No CMN
mel channel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 105
![Page 195: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/195.jpg)
CMN - Illustration 1st-order Highpass Filter
Clean vs. Highpass Filtered Logmel Features
No CMN
mel channel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
With CMN
mel channel
5
10
15
20
−12
−10
−8
−6
−4
−2
0
2
4
6
8
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−12
−10
−8
−6
−4
−2
0
2
4
6
8
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 105
![Page 196: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/196.jpg)
CMN - Illustration Reverberation
Clean vs. Reverberant (T60 = 900ms) Logmel Features
No CMN
mel channel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 106
![Page 197: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/197.jpg)
CMN - Illustration Reverberation
Clean vs. Reverberant (T60 = 900ms) Logmel Features
No CMN
mel channel
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−16
−14
−12
−10
−8
−6
−4
−2
0
2
4
With CMN
mel channel
5
10
15
20
−12
−10
−8
−6
−4
−2
0
2
4
6
8
t in s
mel channel
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
5
10
15
20
−12
−10
−8
−6
−4
−2
0
2
4
6
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 106
![Page 198: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/198.jpg)
CMN - Discussion
Approach
Apply CMN to both training and test data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107
![Page 199: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/199.jpg)
CMN - Discussion
Approach
Apply CMN to both training and test data
⇒ Short impulse responses can be compensated
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107
![Page 200: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/200.jpg)
CMN - Discussion
Approach
Apply CMN to both training and test data
⇒ Short impulse responses can be compensated
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
− But: not suitable for compensating late reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107
![Page 201: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/201.jpg)
CMN - Discussion
Approach
Apply CMN to both training and test data
⇒ Short impulse responses can be compensated
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
− But: not suitable for compensating late reverberation
Further considerations
Reliable only if utterance is long enough (>4 s [Droppo 2008])
Extensions necessary for different speech activity rates of
training and test data [Droppo 2008]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 107
![Page 202: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/202.jpg)
RASTA (RelAtive SpecTrA) Features [Hermansky 1994]
Background
Speed of spectral changes of speech:
⇒ limited by movements of articulators in vocal tract
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108
![Page 203: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/203.jpg)
RASTA (RelAtive SpecTrA) Features [Hermansky 1994]
Background
Speed of spectral changes of speech:
⇒ limited by movements of articulators in vocal tract
Many non-speech effects:
⇒ characterized by short time-invariant impulse responses
Examples: microphone characteristics, telephone channels
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108
![Page 204: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/204.jpg)
RASTA (RelAtive SpecTrA) Features [Hermansky 1994]
Background
Speed of spectral changes of speech:
⇒ limited by movements of articulators in vocal tract
Many non-speech effects:
⇒ characterized by short time-invariant impulse responses
Examples: microphone characteristics, telephone channels
Analysis artifacts:
⇒ very fast spectral changes
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108
![Page 205: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/205.jpg)
RASTA (RelAtive SpecTrA) Features [Hermansky 1994]
Background
Speed of spectral changes of speech:
⇒ limited by movements of articulators in vocal tract
Many non-speech effects:
⇒ characterized by short time-invariant impulse responses
Examples: microphone characteristics, telephone channels
Analysis artifacts:
⇒ very fast spectral changes
Idea
Remove very slow and fast spectral changes from features:
⇒ bandpass filtering in each channel
+ Insensitivity to slow and fast spectral changes
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 108
![Page 206: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/206.jpg)
RASTA Features: Block Diagram
Calculation of RASTA Features
Bandpassfilter
filterBandpass
Bandpassfilter
form
vecto
rs
H0(ejΩ)
H1(ejΩ)
HL(ejΩ)
log()
log()
log() exp()
exp()
exp()
xtxRASTA
n
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 109
![Page 207: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/207.jpg)
RASTA Features: Discussion
RASTA Features
Effective for short time-invariant impulse responses (like CMN)
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 110
![Page 208: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/208.jpg)
RASTA Features: Discussion
RASTA Features
Effective for short time-invariant impulse responses (like CMN)
+ Good for compensating different microphone characteristics or
different telephone channels
+ Good for compensating coloration due to early reflections
Reverberation described by long RIRs
− Therefore: not suitable for compensating late reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 110
![Page 209: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/209.jpg)
Dynamic Features [Furui 1986]
Idea
Temporal changes of short-time spectra:
⇒ important for discriminating phonemes
First and second derivate of static features (∆ and ∆∆ features):
⇒ capture these changes
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111
![Page 210: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/210.jpg)
Dynamic Features [Furui 1986]
Idea
Temporal changes of short-time spectra:
⇒ important for discriminating phonemes
First and second derivate of static features (∆ and ∆∆ features):
⇒ capture these changes
∆ Feature Calculation
∆sn = sn+κ − sn−κ
typical: κ ∈ 1,2
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111
![Page 211: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/211.jpg)
Dynamic Features [Furui 1986]
Idea
Temporal changes of short-time spectra:
⇒ important for discriminating phonemes
First and second derivate of static features (∆ and ∆∆ features):
⇒ capture these changes
∆ Feature Calculation
∆sn = sn+κ − sn−κ
∆sn =
∑N∆
κ=1 κ ·(
sn+κ − sn−κ
)
2 ·∑N∆
κ=1 κ2
typical: κ ∈ 1,2 or N∆ ∈ 2,3,4
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111
![Page 212: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/212.jpg)
Dynamic Features [Furui 1986]
Idea
Temporal changes of short-time spectra:
⇒ important for discriminating phonemes
First and second derivate of static features (∆ and ∆∆ features):
⇒ capture these changes
∆ Feature Calculation
∆sn = sn+κ − sn−κ
∆sn =
∑N∆
κ=1 κ ·(
sn+κ − sn−κ
)
2 ·∑N∆
κ=1 κ2
typical: κ ∈ 1,2 or N∆ ∈ 2,3,4
∆∆ features: calculated in a similar way from ∆ features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 111
![Page 213: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/213.jpg)
Why are Dynamic Features interesting for Reverberant ASR?
Reverberant Speech
Long-term relations between feature vectors
Cannot be captured by HMMs
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112
![Page 214: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/214.jpg)
Why are Dynamic Features interesting for Reverberant ASR?
Reverberant Speech
Long-term relations between feature vectors
Cannot be captured by HMMs
⇒ Mitigation by feature vectors with long temporal reach
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112
![Page 215: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/215.jpg)
Why are Dynamic Features interesting for Reverberant ASR?
Reverberant Speech
Long-term relations between feature vectors
Cannot be captured by HMMs
⇒ Mitigation by feature vectors with long temporal reach
Temporal reach of features
Static features: typically 10 ms – 40 ms
∆ features: typically 20 ms – 120 ms
∆∆ features: typically 30 ms – 200 ms
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112
![Page 216: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/216.jpg)
Why are Dynamic Features interesting for Reverberant ASR?
Reverberant Speech
Long-term relations between feature vectors
Cannot be captured by HMMs
⇒ Mitigation by feature vectors with long temporal reach
Temporal reach of features
Static features: typically 10 ms – 40 ms
∆ features: typically 20 ms – 120 ms
∆∆ features: typically 30 ms – 200 ms
Dynamic features can partly capture long-term relations
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 112
![Page 217: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/217.jpg)
Model-based Feature Enhancement [Krueger 2010]
RIR parameter
estimationFeature extraction
DCT
ASR
Observation modelA priori model Inference
Reverberant speech xt
Reverberant logmelspec coefficients xn T60
p(sn|sn−1) p(sn|x1:n) p(xn|sn−TH :n)
Enhanced logmelspec coefficients sn
Enhanced MFCCs sMFCC
n
Estimated transcription
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 113
![Page 218: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/218.jpg)
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Linear dynamic model−2
sn−3 sn−2 sn−1 sn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
![Page 219: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/219.jpg)
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Linear dynamic model−2
sn−3 sn−2 sn−1 sn
sn = Asn−1 + b + un
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
![Page 220: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/220.jpg)
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Switching linear dynamic model
hidden statesqn−3 qn−2qn−2 qn
sn−3 sn−2 sn−1 sn
sn = A(qn)sn−1 + b(qn) + un
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
![Page 221: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/221.jpg)
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Switching linear dynamic model
hidden statesqn−3 qn−2qn−2 qn
sn−3 sn−2 sn−1 sn
sn = A(qn)sn−1 + b(qn) + un
p(sn|sn−1,qn) = N (sn;A(qn)sn−1 + b(qn),Σu(qn))
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
![Page 222: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/222.jpg)
Model-based Feature Enhancement [Krueger 2010]
A Priori Model: Clean Speech Model
Switching linear dynamic model
hidden statesqn−3 qn−2qn−2 qn
sn−3 sn−2 sn−1 sn
sn = A(qn)sn−1 + b(qn) + un
p(sn|sn−1,qn) = N (sn;A(qn)sn−1 + b(qn),Σu(qn))
Model for non-stationary feature vector sequences of clean speech
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 114
![Page 223: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/223.jpg)
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
![Page 224: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/224.jpg)
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
based on melspec convolution
xn = log
(
TH∑
τ=0
exp(hτ + sn−τ )
)
+ vn
vn: captures approximation error
h0:TH: based on strictly exponentially decaying RIR model
⇒ Only T60 needs to be estimated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
![Page 225: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/225.jpg)
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
based on melspec convolution
xn = log
(
TH∑
τ=0
exp(hτ + sn−τ )
)
+ vn
= f (sn−TH :n,h0:TH) + vn
vn: captures approximation error
h0:TH: based on strictly exponentially decaying RIR model
⇒ Only T60 needs to be estimated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
![Page 226: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/226.jpg)
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
based on melspec convolution
xn = log
(
TH∑
τ=0
exp(hτ + sn−τ )
)
+ vn
= f (sn−TH :n,h0:TH) + vn
p(vn) = N (vn;µv ,Σv )
vn: captures approximation error
h0:TH: based on strictly exponentially decaying RIR model
⇒ Only T60 needs to be estimated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
![Page 227: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/227.jpg)
Model-based Feature Enhancement [Krueger 2010]
Observation Model: Reverberation Model
based on melspec convolution
xn = log
(
TH∑
τ=0
exp(hτ + sn−τ )
)
+ vn
= f (sn−TH :n,h0:TH) + vn
p(vn) = N (vn;µv ,Σv )
p(xn|sn−TH :n) = N (xn; f (sn−TH :n,h0:TH) + µv ,Σv )
vn: captures approximation error
h0:TH: based on strictly exponentially decaying RIR model
⇒ Only T60 needs to be estimated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 115
![Page 228: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/228.jpg)
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
![Page 229: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/229.jpg)
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
MMSE estimate
sn = E sn|x1:n
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
![Page 230: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/230.jpg)
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
MMSE estimate
sn = E sn|x1:n
p(sn|x1:n) =p(xn|sn,x1:n−1) p(sn|x1:n−1)
∫
p(xn|sn,x1:n−1) p(sn|x1:n−1)dsn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
![Page 231: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/231.jpg)
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
MMSE estimate
sn = E sn|x1:n
p(sn|x1:n) =p(xn|sn,x1:n−1) p(sn|x1:n−1)
∫
p(xn|sn,x1:n−1) p(sn|x1:n−1)dsn
≈p(xn|sn−TH :n)
∑Mi=1 p(sn|sn−1,qn = i) p(qn = i)
∫
p(xn|sn,x1:n−1)∑M
i=1 p(sn|sn−1,qn = i) p(qn = i) dsn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
![Page 232: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/232.jpg)
Model-based Feature Enhancement [Krueger 2010]
Bayesian Inference
MMSE estimate
sn = E sn|x1:n
p(sn|x1:n) =p(xn|sn,x1:n−1) p(sn|x1:n−1)
∫
p(xn|sn,x1:n−1) p(sn|x1:n−1)dsn
≈p(xn|sn−TH :n)
∑Mi=1 p(sn|sn−1,qn = i) p(qn = i)
∫
p(xn|sn,x1:n−1)∑M
i=1 p(sn|sn−1,qn = i) p(qn = i) dsn
⇒ Inference performed by bank of iterated extended Kalman filters
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 116
![Page 233: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/233.jpg)
Model-based Feature Enhancement [Krueger 2010]
Discussion
Approach tailored to reverberant feature vector sequences
long-term relations explicitely captured by observation model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 117
![Page 234: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/234.jpg)
Model-based Feature Enhancement [Krueger 2010]
Discussion
Approach tailored to reverberant feature vector sequences
long-term relations explicitely captured by observation model
+ Promising results reported on AURORA 5 task (connected digits)
+ Moderate computational complexity
+ Latency of only a few frames
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 117
![Page 235: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/235.jpg)
Model-based Feature Enhancement [Krueger 2010]
Discussion
Approach tailored to reverberant feature vector sequences
long-term relations explicitely captured by observation model
+ Promising results reported on AURORA 5 task (connected digits)
+ Moderate computational complexity
+ Latency of only a few frames
Suitable for online recognition in reverberant environments
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 117
![Page 236: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/236.jpg)
Further Feature-based Approaches
[Petrick 2008] Harmonicity-based feature analysis
[Thomas 2008] Frequency-domain linear prediction
[Wolfel 2009] Particle filter-based feature enhancement
[Kumar 2010] Cepstral inverse filtering
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 118
![Page 237: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/237.jpg)
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 119
![Page 238: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/238.jpg)
Key Idea of Model-based Approaches
Mismatch between clean HMM and reverberant data
test datareverberant
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
![Page 239: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/239.jpg)
Key Idea of Model-based Approaches
Feature-based: “dereverberate” data
clean HMMtest datadereverberated
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
![Page 240: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/240.jpg)
Key Idea of Model-based Approaches
Model-based: “reverberate” acoustic model
test datareverberant
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
![Page 241: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/241.jpg)
Key Idea of Model-based Approaches
Model-based: “reverberate” acoustic model
Adjust acoustic model to statistical properties of reverberant data
test datareverberant
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
![Page 242: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/242.jpg)
Key Idea of Model-based Approaches
Model-based: “reverberate” acoustic model
Adjust acoustic model to statistical properties of reverberant data
reverberant
test data
HMM
reverberant
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 120
![Page 243: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/243.jpg)
Training with Reverberant Data
Matched Training
Record training data in target environment
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
![Page 244: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/244.jpg)
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
![Page 245: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/245.jpg)
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
![Page 246: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/246.jpg)
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]
+ Significantly reduced effort
+ Only slight degradation in recognition performance [Stahl 2001]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
![Page 247: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/247.jpg)
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]
+ Significantly reduced effort
+ Only slight degradation in recognition performance [Stahl 2001]
Multi-Style Training
Use training data from many different rooms
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
![Page 248: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/248.jpg)
Training with Reverberant Data
Matched Training
Record training data in target environment
+ Training data perfectly capture statistical properties
− Extremely high effort
Generate training data by convolution with RIR[Giuliani 1999, Stahl 2001, Matassoni 2002]
+ Significantly reduced effort
+ Only slight degradation in recognition performance [Stahl 2001]
Multi-Style Training
Use training data from many different rooms
+ Robust HMMs
+ Very flexible
− Discrimination capability reduced compared to matched training
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 121
![Page 249: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/249.jpg)
Matched Training
matched
test datareverberant
HMM
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 122
![Page 250: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/250.jpg)
Matched Training: Modeling Accuracy
Histograms
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
8 10 12 14 16 18 20 220
0.05
0.1
0.15
0.2
0.25
0.3
histogram rev.
output pdf clean HMM
output pdf rev. HMM
histogram rev.
output pdf clean HMM
output pdf rev. HMM
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdf
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 123
![Page 251: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/251.jpg)
Matched Training: Modeling Accuracy
Histograms
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
8 10 12 14 16 18 20 220
0.05
0.1
0.15
0.2
0.25
0.3
histogram rev.
output pdf clean HMM
output pdf rev. HMM
histogram rev.
output pdf clean HMM
output pdf rev. HMM
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdfAuto-CoVariances (ACVs)
0 5 10 15 20 25 30 35
5
10
15
20
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35
5
10
15
20
0.2
0.4
0.6
0.8
1
ACVs of reverberant speech, j = 9
ACVs captured by HMM, j = 9
melchannell
melchannell
frame τ
frame τ
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 123
![Page 252: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/252.jpg)
Multi-Style Training
reverberanttest data
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
![Page 253: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/253.jpg)
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
![Page 254: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/254.jpg)
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
![Page 255: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/255.jpg)
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
![Page 256: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/256.jpg)
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
![Page 257: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/257.jpg)
Multi-Style Training
training dataclean HMM
reverberant
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
![Page 258: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/258.jpg)
Multi-Style Training
reverberanttest data
HMMmulti−style
clean HMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 124
![Page 259: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/259.jpg)
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
![Page 260: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/260.jpg)
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Approach
Multi-style training with dereverberated data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
![Page 261: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/261.jpg)
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Approach
Multi-style training with dereverberated data
Similar to noise-adaptive training [Deng 2000] ormodel-independent adaptive training [Gales 2001]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
![Page 262: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/262.jpg)
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Approach
Multi-style training with dereverberated data
Similar to noise-adaptive training [Deng 2000] ormodel-independent adaptive training [Gales 2001]
+ long-term relations partly removed by dereverberation
+ room dependency reduced
⇒ discrimination capability increased compared to multi-style training
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
![Page 263: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/263.jpg)
Reverberation-Adaptive Training
Idea
Capture only linguistic variabilities by acoustic model
Remove acoustic variabilities by appropriate transforms
Approach
Multi-style training with dereverberated data
Similar to noise-adaptive training [Deng 2000] ormodel-independent adaptive training [Gales 2001]
+ long-term relations partly removed by dereverberation
+ room dependency reduced
⇒ discrimination capability increased compared to multi-style training
Successfully applied, e.g., in [Kinoshita 2006]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 125
![Page 264: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/264.jpg)
Reverberation-Adaptive Training
clean HMM
reverberanttest data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
![Page 265: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/265.jpg)
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated test data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
![Page 266: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/266.jpg)
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated training data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
![Page 267: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/267.jpg)
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated training data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
![Page 268: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/268.jpg)
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated training data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
![Page 269: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/269.jpg)
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated training data
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
![Page 270: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/270.jpg)
Reverberation-Adaptive Training
clean HMM
reverberanttest data
dereverberated test data
adaptiveHMM
statistical properties
sta
tisticalpro
pert
ies
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 126
![Page 271: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/271.jpg)
Data-driven Adaptation
Approaches
Maximum A Posteriori adaptation (MAP) [Gauvain 1994]
Maximum Likelihood Linear Regression (MLLR)
[Legetter 1995, Gales 1998]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 127
![Page 272: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/272.jpg)
Data-driven Adaptation
Approaches
Maximum A Posteriori adaptation (MAP) [Gauvain 1994]
Maximum Likelihood Linear Regression (MLLR)
[Legetter 1995, Gales 1998]
Successfully used for speaker and noise adaptation
Can also be used for reducing mismatch due to reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 127
![Page 273: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/273.jpg)
MLLR
MLLR
Adaptation of the HMM mean vectors
µX = DµS + d
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
![Page 274: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/274.jpg)
MLLR
MLLR
Adaptation of the HMM mean vectors and covariance matrices
µX = DµS + d
ΣXX = E ΣSS ET
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
![Page 275: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/275.jpg)
MLLR
MLLR
Adaptation of the HMM mean vectors and covariance matrices
µX = DµS + d
ΣXX = E ΣSS ET
Transformation parameters D, d , E estimated by EM algorithm
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
![Page 276: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/276.jpg)
MLLR
MLLR
Adaptation of the HMM mean vectors and covariance matrices
µX = DµS + d
ΣXX = E ΣSS ET
Transformation parameters D, d , E estimated by EM algorithm
Supervised MLLR: known transcription
Unsupervised MLLR: during recognition
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
![Page 277: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/277.jpg)
MLLR
MLLR
Adaptation of the HMM mean vectors and covariance matrices
µX = DµS + d
ΣXX = E ΣSS ET
Transformation parameters D, d , E estimated by EM algorithm
Supervised MLLR: known transcription
Unsupervised MLLR: during recognition
CMLLR (Constrained MLLR)
Same transformation matrix for mean vector and covariance matrix
µX = DµS + d
ΣXX = D ΣSS DT
+ fewer adaptation parameters
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 128
![Page 278: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/278.jpg)
Data-driven Approaches
Illustration: Example Matched Training on Reverberated Data
reverberantly−trained HMM
(e.g., set of RIRs)
description of the
acoustic environmentclean−speech
training data
reverberated
training data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 129
![Page 279: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/279.jpg)
Data-driven Approaches
Discussion
Very accurate description of statistical properties by reverberant
training/adaptation data
Loss of accuracy: only when turning data into model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130
![Page 280: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/280.jpg)
Data-driven Approaches
Discussion
Very accurate description of statistical properties by reverberant
training/adaptation data
Loss of accuracy: only when turning data into model
Reverberant training: requires large amount of reverberant data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130
![Page 281: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/281.jpg)
Data-driven Approaches
Discussion
Very accurate description of statistical properties by reverberant
training/adaptation data
Loss of accuracy: only when turning data into model
Reverberant training: requires large amount of reverberant data
Data-driven adaptation: moderate amount of reverberant data
(but more than model-based adaptation)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130
![Page 282: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/282.jpg)
Data-driven Approaches
Discussion
Very accurate description of statistical properties by reverberant
training/adaptation data
Loss of accuracy: only when turning data into model
Reverberant training: requires large amount of reverberant data
Data-driven adaptation: moderate amount of reverberant data
(but more than model-based adaptation)
Main Limitation
Conventional HMMs cannot accurately capture long-term relations
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 130
![Page 283: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/283.jpg)
Parametric Model-Based Approaches
Illustration
adapted HMMclean−speech HMM
reverberationrepresentation
training data
clean−speech
(e.g., set of RIRs)
acoustic environment
description of the
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 131
![Page 284: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/284.jpg)
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
![Page 285: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/285.jpg)
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Discussion
proposed in [Raut 2006, Hirsch 2008, Sehr 2009]
based on melspec convolution
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
![Page 286: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/286.jpg)
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Discussion
proposed in [Raut 2006, Hirsch 2008, Sehr 2009]
based on melspec convolution
+ long-term relations considered for HMM parameter estimation
+ no adaptation utterances necessary
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
![Page 287: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/287.jpg)
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Discussion
proposed in [Raut 2006, Hirsch 2008, Sehr 2009]
based on melspec convolution
+ long-term relations considered for HMM parameter estimation
+ no adaptation utterances necessary
− reduced accuracy due to approximation errors
− additional loss of accuracy when mapping combination to HMM
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
![Page 288: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/288.jpg)
Parametric Model-based Adaptation
clean-speech
HMMs
reverberation
model
adaptation
algorithm
adapted
HMMs
Discussion
proposed in [Raut 2006, Hirsch 2008, Sehr 2009]
based on melspec convolution
+ long-term relations considered for HMM parameter estimation
+ no adaptation utterances necessary
− reduced accuracy due to approximation errors
− additional loss of accuracy when mapping combination to HMM
Main Limitation
Conventional HMMs cannot accurately capture long-term relations
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 132
![Page 289: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/289.jpg)
Parametric Model-based Adaptation
Mean Adaptation Approach [Raut 2006, Hirsch 2008, Sehr 2009]
adaptation
transform to
cepstral domainmelspec domain
transform to perform calculate
cepstral averages
µSMFCC µ
SMFCC µSMEL µ
XMEL µXMFCC
β
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 133
![Page 290: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/290.jpg)
Parametric Model-based Adaptation
Mean Adaptation Approach [Raut 2006, Hirsch 2008, Sehr 2009]
adaptation
transform to
cepstral domainmelspec domain
transform to perform calculate
cepstral averages
µSMFCC µ
SMFCC µSMEL µ
XMEL µXMFCC
β
Adaptation Equation
µXMEL(l , j) =∑
p
β(l , j , j − p) µSMEL(l , j − p)
β(l , j , i) state-level reverberation representation:
describes energy dispersion from state i to j in channel l
i , j state indices
l mel channel index
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 133
![Page 291: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/291.jpg)
Estimation of Reverberation Representation [Hirsch 2008]
state 1 state 2 state 3
h2t
tstart(2,1) tend(2,1)t
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 134
![Page 292: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/292.jpg)
Estimation of Reverberation Representation [Hirsch 2008]
state 1 state 2 state 3
h2t
tstart(2,1) tend(2,1)t
h2t =
6 log(10)
T60M
· exp
(
−6 log(10)
T60M
· t
)
, for t ≥ 0
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 134
![Page 293: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/293.jpg)
Estimation of Reverberation Representation [Hirsch 2008]
state 1 state 2 state 3
h2t
tstart(2,1) tend(2,1)t
h2t =
6 log(10)
T60M
· exp
(
−6 log(10)
T60M
· t
)
, for t ≥ 0
β(j , i) =
∫ tend(j,i)
tstart(j,i)
h2t dt
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 134
![Page 294: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/294.jpg)
Limitation of Conventional HMMs
Emission pdf of Conventional HMMs
p(xn|j)
⇒ conditional independence assumption
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 135
![Page 295: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/295.jpg)
Limitation of Conventional HMMs
Emission pdf of Conventional HMMs
p(xn|j)
⇒ conditional independence assumption
Conditional Emission pdf
capturing long-term relationships by
p(xn|j ,x1:n−1)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 135
![Page 296: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/296.jpg)
Limitation of Conventional HMMs
Emission pdf of Conventional HMMs
p(xn|j)
⇒ conditional independence assumption
Conditional Emission pdf
capturing long-term relationships by
p(xn|j ,x1:n−1)
Approximation by: Context-aware Methods:
Frame-wise HMM adaptation
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 135
![Page 297: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/297.jpg)
Conventional Adaptation versus Context-Aware Methods
HMM adaptation
Viterbi initialization
Finished?
Viterbi score calculation
(a) Conventional
HMM adaptation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 136
![Page 298: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/298.jpg)
Conventional Adaptation versus Context-Aware Methods
HMM adaptation
Viterbi initialization
Finished?
Viterbi score calculation
(a) Conventional
HMM adaptation
Viterbi initialization
HMM adaptation
Finished?
Viterbi score calculation
(b) Frame-wise
adaptation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 136
![Page 299: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/299.jpg)
Conventional Adaptation versus Context-Aware Methods
HMM adaptation
Viterbi initialization
Finished?
Viterbi score calculation
(a) Conventional
HMM adaptation
Viterbi initialization
HMM adaptation
Finished?
Viterbi score calculation
(b) Frame-wise
adaptation
Viterbi initialization
Inner optimization
Finished?
Viterbi score calculation
(c) REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 136
![Page 300: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/300.jpg)
Frame-wise Adaptation
xn ≈ log(exp(h0 + sn) + exp(rn))
µxn(j) = log(exp(h0 + µs(j)) + exp(rn))
rn late reverberation
j state index
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 137
![Page 301: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/301.jpg)
Frame-wise Adaptation
xn ≈ log(exp(h0 + sn) + exp(rn))
µxn(j) = log(exp(h0 + µs(j)) + exp(rn))
rn late reverberation
j state index
Autoregressive Modeling [Takiguchi 2006]
rn = a + xn−1 a prediction coefficient
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 137
![Page 302: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/302.jpg)
Frame-wise Adaptation
xn ≈ log(exp(h0 + sn) + exp(rn))
µxn(j) = log(exp(h0 + µs(j)) + exp(rn))
rn late reverberation
j state index
Autoregressive Modeling [Takiguchi 2006]
rn = a + xn−1 a prediction coefficient
Moving-Average Modeling [Sehr 2011]
rn = log
(
TH∑
τ=1
exp(µhτ+ sn−τ )
)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 137
![Page 303: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/303.jpg)
Frame-wise Adaptation
Discussion
+ Overcomes conditional independence assumption
+ Accurate modeling of long-term relations
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138
![Page 304: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/304.jpg)
Frame-wise Adaptation
Discussion
+ Overcomes conditional independence assumption
+ Accurate modeling of long-term relations
− Increased computational complexity
− Increased effort for integration into ASR systems
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138
![Page 305: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/305.jpg)
Frame-wise Adaptation
Discussion
+ Overcomes conditional independence assumption
+ Accurate modeling of long-term relations
− Increased computational complexity
− Increased effort for integration into ASR systems
Full potential not yet demonstrated
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138
![Page 306: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/306.jpg)
Frame-wise Adaptation
Discussion
+ Overcomes conditional independence assumption
+ Accurate modeling of long-term relations
− Increased computational complexity
− Increased effort for integration into ASR systems
Full potential not yet demonstrated
Promising direction for future research
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 138
![Page 307: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/307.jpg)
Further Model-based Approaches
[Couvreur 2001] Reverberant training of several HMMs
+ model selection
[Sehr 2010b] Training of reverberant HMMs on stereo data
[Gales 2011] Extension of MLLR and VTS to reverberant
environments
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 139
![Page 308: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/308.jpg)
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 140
![Page 309: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/309.jpg)
Overview of Decoder-based Approaches
Key Idea
Modify the decoding algorithm to increase reverberation robustness
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141
![Page 310: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/310.jpg)
Overview of Decoder-based Approaches
Key Idea
Modify the decoding algorithm to increase reverberation robustness
Two Approaches
Missing feature techniques
⇒ Distinguish between reliable and unreliable observations
⇒ Estimate or discard the unreliable parts
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141
![Page 311: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/311.jpg)
Overview of Decoder-based Approaches
Key Idea
Modify the decoding algorithm to increase reverberation robustness
Two Approaches
Missing feature techniques
⇒ Distinguish between reliable and unreliable observations
⇒ Estimate or discard the unreliable parts
Uncertainty decoding
⇒ Combined with signal or feature enhancement techniques
⇒ Exploit reliability information about enhanced data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141
![Page 312: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/312.jpg)
Overview of Decoder-based Approaches
Key Idea
Modify the decoding algorithm to increase reverberation robustness
Two Approaches
Missing feature techniques
⇒ Distinguish between reliable and unreliable observations
⇒ Estimate or discard the unreliable parts
Uncertainty decoding
⇒ Combined with signal or feature enhancement techniques
⇒ Exploit reliability information about enhanced data
Decoder-based approaches bridge the gap between
feature-based and model-based approaches
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 141
![Page 313: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/313.jpg)
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
![Page 314: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/314.jpg)
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
![Page 315: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/315.jpg)
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
How to handle missing data?
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
![Page 316: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/316.jpg)
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
How to handle missing data?
Marginalization: Eliminate unreliable data by integration over
corresponding dimensions
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
![Page 317: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/317.jpg)
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
How to handle missing data?
Marginalization: Eliminate unreliable data by integration over
corresponding dimensions Bounded marginalization: Exploit known bounds of the missing
data for integration
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
![Page 318: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/318.jpg)
Missing Feature Techniques
For overviews see [Cooke 2001, Raj 2005, Kolossa 2011]
Key Ideas
Partition the observations into reliable and missing components Use only the reliable components for recognition
Main Steps
Mask estimation: Mark observations as either reliable or missing Handle missing data appropriately
How to handle missing data?
Marginalization: Eliminate unreliable data by integration over
corresponding dimensions Bounded marginalization: Exploit known bounds of the missing
data for integration Data imputation: Determine state-dependent estimates for the
unreliable data, given the reliable data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 142
![Page 319: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/319.jpg)
Missing Feature Techniques for Reverberation Robustness
[Palomaki 2004]
Modulation filtering for the mask estimation
Bounded marginalization for handling missing data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 143
![Page 320: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/320.jpg)
Missing Feature Techniques for Reverberation Robustness
[Palomaki 2004]
Modulation filtering for the mask estimation
Bounded marginalization for handling missing data
[Gemmeke 2011]
Oracle masks based on clean and reverberant features
Semi-Oracle masks based on clean features and estimated RIRs
Gaussian-dependent bounded imputation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 143
![Page 321: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/321.jpg)
Uncertainty Decoding
Conventional Feature Enhancement Methods
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxnsn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 144
![Page 322: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/322.jpg)
Uncertainty Decoding
Conventional Feature Enhancement Methods
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxnsn
Use only point estimate sn of clean features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 144
![Page 323: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/323.jpg)
Uncertainty Decoding
Conventional Feature Enhancement Methods
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxnsn
Use only point estimate sn of clean features
Contribution of each Gaussian component m:
p(sn|m) = N (sn;µ(m)s ,Σ
(m)s )
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 144
![Page 324: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/324.jpg)
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Feature Enhancement Combined with Uncertainty Decoding
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxn
p(sn|sn)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 145
![Page 325: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/325.jpg)
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Feature Enhancement Combined with Uncertainty Decoding
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxn
p(sn|sn)
Signal/feature enhancement inevitably introduces distortions
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 145
![Page 326: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/326.jpg)
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Feature Enhancement Combined with Uncertainty Decoding
EnhancementFeature
Algorithm
Decoding
AcousticModel
Transcriptionxn
p(sn|sn)
Signal/feature enhancement inevitably introduces distortions
Use reliability information in addition to point estimate
⇒ Use p(sn|sn) instead of sn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 145
![Page 327: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/327.jpg)
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
![Page 328: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/328.jpg)
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Contribution of Gaussian Component m
p(sn|m) =
∫
p(sn,sn|m) dsn =
∫
p(sn|sn,m) p(sn|m) dsn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
![Page 329: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/329.jpg)
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Contribution of Gaussian Component m
p(sn|m) =
∫
p(sn,sn|m) dsn =
∫
p(sn|sn,m) p(sn|m) dsn
≈ N (sn;µ(m)s ,Σ
(m)s +Σbn
)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
![Page 330: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/330.jpg)
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Contribution of Gaussian Component m
p(sn|m) =
∫
p(sn,sn|m) dsn =
∫
p(sn|sn,m) p(sn|m) dsn
≈ N (sn;µ(m)s ,Σ
(m)s +Σbn
)
Unreliable features ⇒ large Σbn⇒ little effect on Viterbi score
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
![Page 331: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/331.jpg)
Uncertainty Decoding
[Droppo 2002, Deng 2005, Liao 2008, Haeb-Umbach 2011]
Mismatch Modelsn = sn + bn
p(bn) = N (bn;0,Σbn)
p(sn|sn,m) ≈ p(sn|sn) = p(bn)
Contribution of Gaussian Component m
p(sn|m) =
∫
p(sn,sn|m) dsn =
∫
p(sn|sn,m) p(sn|m) dsn
≈ N (sn;µ(m)s ,Σ
(m)s +Σbn
)
Unreliable features ⇒ large Σbn⇒ little effect on Viterbi score
Main challenge: Estimation of time-variant feature cov. Σbn
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 146
![Page 332: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/332.jpg)
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Key Idea
Strong reverberation
⇒ Large effect of speech enhancement
⇒ Large mismatch between clean and enhanced features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147
![Page 333: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/333.jpg)
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Key Idea
Strong reverberation
⇒ Large effect of speech enhancement
⇒ Large mismatch between clean and enhanced features
Effect of speech enhancement captured by bn = xn − sn
⇒ Mismatch covariance assumed proportional to difference
between observed and enhanced features
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147
![Page 334: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/334.jpg)
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Key Idea
Strong reverberation
⇒ Large effect of speech enhancement
⇒ Large mismatch between clean and enhanced features
Effect of speech enhancement captured by bn = xn − sn
⇒ Mismatch covariance assumed proportional to difference
between observed and enhanced features
Model elements of time-variant diagonal mismatch cov. matrix Σbnas
(Σbn)ii = αi b2
n,i
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147
![Page 335: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/335.jpg)
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Key Idea
Strong reverberation
⇒ Large effect of speech enhancement
⇒ Large mismatch between clean and enhanced features
Effect of speech enhancement captured by bn = xn − sn
⇒ Mismatch covariance assumed proportional to difference
between observed and enhanced features
Model elements of time-variant diagonal mismatch cov. matrix Σbnas
(Σbn)ii = αi b2
n,i
α is estimated by EM algorithm using adaptation data
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 147
![Page 336: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/336.jpg)
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Featureextraction
Variance
Reverberant speech
Dereverberation
Gaussiancovariance
Compensatedcovariance sequence
Acousticmodel
Word
Dereverberated speech
Variancecompensation
Recognition
Gaussian mean
Feature covariance
xtst
xn sn
Σbn
Σ(m)s
µ(m)s ,Σ
(m)s
µ(m)s
Σ(m)s +Σbn w
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 148
![Page 337: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/337.jpg)
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Discussion
− Accounting for time-variant covariance matrix Σbnincreases
computational complexity
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 149
![Page 338: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/338.jpg)
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Discussion
− Accounting for time-variant covariance matrix Σbnincreases
computational complexity
+ Can be combined with static variance compensation and
mean adaptation by MLLR
+ Independent of enhancement algorithm ⇒ Highly flexible
+ Has been used successfully also for non-stationary interferences
[Delcroix 2011b]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 149
![Page 339: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/339.jpg)
Uncertainty Decoding for Reverberation-Robust ASR
[Delcroix 2009, Delcroix 2011a]
Discussion
− Accounting for time-variant covariance matrix Σbnincreases
computational complexity
+ Can be combined with static variance compensation and
mean adaptation by MLLR
+ Independent of enhancement algorithm ⇒ Highly flexible
+ Has been used successfully also for non-stationary interferences
[Delcroix 2011b]
Promising approach for interconnection of
signal/feature-based methods and ASR systems
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 149
![Page 340: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/340.jpg)
Part III: Robust ASR in Reverberant Environments
Introduction
Feature-based Approaches
Model-based Approaches
Decoder-based Approaches
REMOS
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 150
![Page 341: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/341.jpg)
REMOS: REverberation MOdeling for Speech Recognition
Online Model Combination
RVM
CSM
previous
observations
current
observation
: combinationoperator
CSM: clean-speech model
⇒ HMM network
RVM: reverberation model
combination of CSM and RVM:
⇒ context-aware acoustic model
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 151
![Page 342: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/342.jpg)
REMOS: REverberation MOdeling for Speech Recognition
Online Model Combination
RVM
CSM
previous
observations
current
observation
: combinationoperator
CSM: clean-speech model
⇒ HMM network
RVM: reverberation model
combination of CSM and RVM:
⇒ context-aware acoustic model
Advantages
CSM and RVM are trained
independently
changing environment:
adjust only RVM
changing task:
adjust only CSM
high degree of flexibility
[Sehr 2010c]
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 151
![Page 343: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/343.jpg)
REMOS Decoding [Sehr 2010c]
feature
extraction
transcriptionViterbi
algorithm
extended
RVMCSM
xt xn
Extended Viterbi Algorithm:
finds most likely path through CSM
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 152
![Page 344: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/344.jpg)
REMOS Decoding [Sehr 2010c]
feature
extraction
transcriptionViterbi
algorithm
extended
RVMCSM
xt xn
Extended Viterbi Algorithm:
finds most likely path through CSM
Inner Optimization: accounts for RVM and previous observations
determines most likely contributions of CSM and RVM to current
observation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 152
![Page 345: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/345.jpg)
REMOS [Sehr 2010c]
Online combination of model outputs from
clean-speech HMM and reverberation model
capturing long-term relations:
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 153
![Page 346: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/346.jpg)
REMOS [Sehr 2010c]
Online combination of model outputs from
clean-speech HMM and reverberation model
capturing long-term relations:
Combination Operator
xn = f (sn,sn−TH :n−1,hn,an)
= log(exp(hn + sn) + exp(rn + an))
rn: logmelspec late re-verberation estimate
an: captures approxima-tion error of rn
hn: logmelspec repre-sentation of directsound component ofRIR
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 153
![Page 347: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/347.jpg)
REMOS [Sehr 2010c]
Online combination of model outputs from
clean-speech HMM and reverberation model
capturing long-term relations:
Combination Operator
xn = f (sn,sn−TH :n−1,hn,an)
= log(exp(hn + sn) + exp(rn + an))
Late Reverberation Estimate
rn = log
(
TH∑
τ=1
exp(µHτ + sn−τ )
)
rn: logmelspec late re-verberation estimate
an: captures approxima-tion error of rn
hn: logmelspec repre-sentation of directsound component ofRIR
µH1:TH
:mean vectors of log-melspec representa-tion for late reverber-ation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 153
![Page 348: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/348.jpg)
REMOS [Sehr 2010c]
Illustration of Generative Model
Cleanspeech
model
Reverberationmodel
p(sn|j)
p(hn)
p(an)
hn
an
sn sn−1 sn−TH
f
xn
. . .
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 154
![Page 349: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/349.jpg)
REMOS [Sehr 2010c]
Conditional emission pdf is decomposed into
reverberation model and clean HMM:
p(xn|j ,x1:n−1) =
∫
p(xn|sn,x1:n−1) p(sn|j) dsn,
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 155
![Page 350: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/350.jpg)
REMOS [Sehr 2010c]
Conditional emission pdf is decomposed into
reverberation model and clean HMM:
p(xn|j ,x1:n−1) =
∫
p(xn|sn,x1:n−1) p(sn|j) dsn,
Reverberation Model:
p(xn|sn,x1:n−1)
=
∫∫
p(hn)p(an) δ(xn − f (sn,sn−TH :n−1,hn,an)) dhn dan
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 155
![Page 351: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/351.jpg)
REMOS [Sehr 2010c]
Approximation of Conditional Emission pdf:by maximum values of integrand
p(xn|j ,x1:n−1) ≈ p(hn) p(an) p(sn|j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 156
![Page 352: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/352.jpg)
REMOS [Sehr 2010c]
Approximation of Conditional Emission pdf:by maximum values of integrand
p(xn|j ,x1:n−1) ≈ p(hn) p(an) p(sn|j)
maximum values hn, an, sn determined by inner optimization
(hn, an, sn) = argmax(hn,an,sn)
p(hn)p(an)p(sn|j)
subject to xn = f (sn,sn−TH :n−1,hn,an)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 156
![Page 353: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/353.jpg)
Detailed Illustration of REMOS Decoding
model
clean−speech HMMs
network of
reverberation
find previous
vectors
backtracking matrix
Viterbi score matrix
vectors (3D tensor)clean−speech
matrix of clean−speech
optimization
inner Viterbi
calculate
score
n
n
j
j
sn
αij
p(xn|j, x1:n−1)
γn−1(i)
γn(j)
ψn(j)
sn(j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157
![Page 354: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/354.jpg)
Detailed Illustration of REMOS Decoding
model
clean−speech HMMs
network of
reverberation
find previous
vectors
backtracking matrix
Viterbi score matrix
vectors (3D tensor)clean−speech
matrix of clean−speech
optimization
inner Viterbi
calculate
score
n
n
j
j
xn
RVM
CSM
sn−TH
sn−1
sn
hn
αij
p(xn|j, x1:n−1)
γn−1(i)
γn(j)
ψn(j)
sn(j)
p(hn) p(an)
p(sn|j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157
![Page 355: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/355.jpg)
Detailed Illustration of REMOS Decoding
model
clean−speech HMMs
network of
reverberation
find previous
vectors
backtracking matrix
Viterbi score matrix
vectors (3D tensor)clean−speech
matrix of clean−speech
optimization
inner Viterbi
calculate
score
n
n
nj
jj
l
xn
RVM
CSM
sn−TH
sn−1
sn
hn
αij
p(xn|j, x1:n−1)
γn−1(i)
γn(j)
ψn(j)
sn(j)
p(hn) p(an)
p(sn|j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157
![Page 356: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/356.jpg)
Detailed Illustration of REMOS Decoding
model
clean−speech HMMs
network of
reverberation
find previous
vectors
backtracking matrix
Viterbi score matrix
vectors (3D tensor)clean−speech
matrix of clean−speech
optimization
inner Viterbi
calculate
score
n
n
nj
jj
l
xn
RVM
CSM
sn−TH
sn−1
sn
hn
αij
p(xn|j, x1:n−1)
γn−1(i)
γn(j)
ψn(j)
sn(j)
p(hn) p(an)
p(sn|j)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 157
![Page 357: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/357.jpg)
Modeling Accuracy of REMOS
Example: digit “seven”
logmelspec clean utterance
10 20 30 40
5
10
15
20
5
10
15
20
logmelspec reverberant utterance
10 20 30 40
5
10
15
20
5
10
15
20
means of clean logmelspec HMM
5 10 15
5
10
15
20
5
10
15
20
logmelspec RIR representation
20 40 60
5
10
15
20
−14
−12
−10
−8
−6
−4
−2
0
frame delay τ
frame nframe n
state j
melchannell
melchannell
melchannell
melchannell
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 158
![Page 358: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/358.jpg)
Modeling Accuracy of REMOS
Histograms
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
0.3
8 10 12 14 16 18 20 220
0.1
0.2
0.3
0.4
histogram rev.
output pdf clean HMM
prior hist. REMOS
posterior hist. REMOS
histogram rev.
output pdf clean HMM
prior hist. REMOS
posterior hist. REMOS
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdf
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 159
![Page 359: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/359.jpg)
Modeling Accuracy of REMOS
Histograms
0 5 10 15 200
0.05
0.1
0.15
0.2
0.25
0.3
8 10 12 14 16 18 20 220
0.1
0.2
0.3
0.4
histogram rev.
output pdf clean HMM
prior hist. REMOS
posterior hist. REMOS
histogram rev.
output pdf clean HMM
prior hist. REMOS
posterior hist. REMOS
histograms for state j = 1, channel l = 3
histograms for state j = 5, channel l = 21
x
x
estim
ate
dpdf
estim
ate
dpdf
Auto-CoVariances (ACVs)
0 5 10 15 20 25 30 35
5
10
15
20
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30 35
5
10
15
20
0.2
0.4
0.6
0.8
1
ACVs of reverberant speech, j = 9
ACVs of posterior REMOS output, j = 9
melchannell
melchannell
frame τ
frame τ
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 159
![Page 360: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/360.jpg)
Recognition Results [Sehr 2010c]
30
40
50
60
70
80
90
100
clean HMMclean HMM + MLLRadaptation [Sehr 2009]multi-style HMM
multi-style HMM + MLLRmatched HMM
REMOS
word
accura
cy
in%
room A room B room C
Setup
Task: Connected
digits (TI digits)
Features:Logmelspec
coefficients
Recognizer:Word-level HMMs,
16 states/digit,
1 Gaussian/state
Rooms:T60 DRR
A: 300 ms 4.0dBB: 700 ms −4.0dBC: 900 ms −4.0dB
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 160
![Page 361: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/361.jpg)
REMOS [Sehr 2010c]
Discussion
+ Approach tailored to reverberant feature vector sequences
+ Long-term relations explicitely captured by reverberation model
+ Reverberation exploited for discrimination
+ Very promising results in logmelspec domain
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 161
![Page 362: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/362.jpg)
REMOS [Sehr 2010c]
Discussion
+ Approach tailored to reverberant feature vector sequences
+ Long-term relations explicitely captured by reverberation model
+ Reverberation exploited for discrimination
+ Very promising results in logmelspec domain
− Inner optimization increases decoding complexity
− Implementation requires changes in decoding routines
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 161
![Page 363: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/363.jpg)
REMOS [Sehr 2010c]
Discussion
+ Approach tailored to reverberant feature vector sequences
+ Long-term relations explicitely captured by reverberation model
+ Reverberation exploited for discrimination
+ Very promising results in logmelspec domain
− Inner optimization increases decoding complexity
− Implementation requires changes in decoding routines
Promising direction for future research
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 161
![Page 364: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/364.jpg)
IV. Summary, Conclusions, and Outlook
Dereverberation for Signal EnhancementState-of-the-art
Close to 12 dB DRR gain with T60 ≈ 0.7s (offline) with 4 mics, d=1.65m, no noise (TRINICON, 2 sources) 8 mics, d=2m, SNR=10 dB (MCLP)
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 162
![Page 365: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/365.jpg)
IV. Summary, Conclusions, and Outlook
Dereverberation for Signal EnhancementState-of-the-art
Close to 12 dB DRR gain with T60 ≈ 0.7s (offline) with 4 mics, d=1.65m, no noise (TRINICON, 2 sources) 8 mics, d=2m, SNR=10 dB (MCLP)
Challenges Larger distances, more reverberant rooms Robustness to speech-like interferers, nonstationary/diffuse noise,
transient echo cancellation residuals Robust tracking of time-varying acoustics Low-latency (≪ 1s) and efficient real-time implementations Joint optimization with spectral subtraction techniques
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 162
![Page 366: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/366.jpg)
IV. Summary, Conclusions, and Outlook (cont’d)
Dereverberation as preprocessing for ASR
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 163
![Page 367: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/367.jpg)
IV. Summary, Conclusions, and Outlook (cont’d)
Dereverberation as preprocessing for ASRExample: 20 k WSJ convolved with RIRs (T60 = 0.78s, d = 2m), NTT ASR
WER[%] Preproc. Acoustic model85.5 none clean speech43.4 none multi-condition training26.1 1-ch derev multi-condition training14.2 2-ch derev clean w/ unsuperv.
speaker adaptation by MLLR
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 163
![Page 368: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/368.jpg)
IV. Summary, Conclusions, and Outlook (cont’d)
Dereverberation as preprocessing for ASRExample: 20 k WSJ convolved with RIRs (T60 = 0.78s, d = 2m), NTT ASR
WER[%] Preproc. Acoustic model85.5 none clean speech43.4 none multi-condition training26.1 1-ch derev multi-condition training14.2 2-ch derev clean w/ unsuperv.
speaker adaptation by MLLR
Challenges for approaching close-talk performance Transition from reverberated signals to real recordings Self-adaptation to changing acoustics and front-ends, including
variable number and changing, unconstrained positions of talkers different nodes in distributed microphone arrays
Joint optimization with ASR methods to handle reverberation and noise
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 163
![Page 369: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/369.jpg)
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
![Page 370: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/370.jpg)
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
![Page 371: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/371.jpg)
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation
Model-based techniques could not yet show their full potential, as framewise adaptation and optimization is computationally complex
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
![Page 372: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/372.jpg)
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation
Model-based techniques could not yet show their full potential, as framewise adaptation and optimization is computationally complex
Decoder-based techniques compromise between the above regarding complexity
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
![Page 373: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/373.jpg)
Summary, Conclusions, and Outlook (cont’d)
Reverberation-specific ASR TechniquesState-of-the-art
Feature-based techniques account for the inter-frame relations caused by dispersion efficiently exploit predictability of reverberation
Model-based techniques could not yet show their full potential, as framewise adaptation and optimization is computationally complex
Decoder-based techniques compromise between the above regarding complexity
Outlook Integration into state-of-the art ASR systems
expected soon for signal enhancement- and feature-based methods model-based methods must become more efficient for widespread use
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 164
![Page 374: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/374.jpg)
Concluding remarks
Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165
![Page 375: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/375.jpg)
Concluding remarks
Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?
Blind deconvolution of the acoustic paths seems to come closer
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165
![Page 376: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/376.jpg)
Concluding remarks
Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?
Blind deconvolution of the acoustic paths seems to come closer
Less ambitious algorithms are also effective and their progress followsthe typical DSP objectives
increase algorithmic performance and robustness
reduce computational load
integrate with other functionalities
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165
![Page 377: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/377.jpg)
Concluding remarks
Dereverberation, the ’Holy Grail’ of Acoustic Signal Proce ssing?
Blind deconvolution of the acoustic paths seems to come closer
Less ambitious algorithms are also effective and their progress followsthe typical DSP objectives
increase algorithmic performance and robustness
reduce computational load
integrate with other functionalities
As a follow-up to the CHIME Challenge 2011
⇒ Next Challenge for Reverberation-robust Speech Processin gis underway!
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 165
![Page 378: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/378.jpg)
Acknowledgements
We are especially grateful to
Dr. Keisuke Kinoshita, Dr. Marc Delcroix, Dr. Shoko Araki, Dr. MehrezSouden, and Dr. Takaaki Hori (NTT)
Dr. Herbert Buchner, Edwin Mabande and Lutz Marquardt (formerlyLMS)
Roland Maas and Christian Hofmann (LMS)
for their contributions to the course material
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 166
![Page 379: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/379.jpg)
Acknowledgements
We are especially grateful to
Dr. Keisuke Kinoshita, Dr. Marc Delcroix, Dr. Shoko Araki, Dr. MehrezSouden, and Dr. Takaaki Hori (NTT)
Dr. Herbert Buchner, Edwin Mabande and Lutz Marquardt (formerlyLMS)
Roland Maas and Christian Hofmann (LMS)
for their contributions to the course material and wish to acknowledgethe support of parts of the LMS by
Deutsche Forschungsgemeinschaft (DFG) under contract number KE890/4-1
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 166
![Page 380: Reverberant Speech Processing for Human …...Reverberant Speech Processing for Human Communication and Automatic Speech Recognition Tomohiro Nakatani, Armin Sehr, Walter Kellermann](https://reader030.vdocuments.net/reader030/viewer/2022040309/5f200197242b20550f00e11d/html5/thumbnails/380.jpg)
ご清聴ありがとうございました
Nakatani, Sehr, Kellermann: Reverberant Speech Processing 167