analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · motivation:...

95
Analysis-by-synthesis for source separation and speech recognition Michael I Mandel [email protected] Brooklyn College (CUNY) Joint work with Young Suk Cho and Arun Narayanan (Ohio State) Columbia Neural Network Seminar Series September 8, 2015 Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 1 / 73

Upload: others

Post on 25-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Analysis-by-synthesisfor source separation and speech recognition

Michael I [email protected]

Brooklyn College (CUNY)

Joint work with Young Suk Cho and Arun Narayanan (Ohio State)

Columbia Neural Network Seminar SeriesSeptember 8, 2015

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 1 / 73

Page 2: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 2 / 73

Page 3: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness

Outline

1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 3 / 73

Page 4: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for better mobile voice quality

Outline

1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 4 / 73

Page 5: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for better mobile voice quality

Need for better mobile voice quality

There are now more mobile devices than humans on earth1

But recording conditions for these devices leave much to be desired

Can we recover high quality speech from noisy & degraded recordings?

1http://www.independent.co.uk/life-style/gadgets-and-tech/news/

there-are-officially-more-mobile-devices-than-people-in-the-world-9780518.html

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 5 / 73

Page 6: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for better mobile voice quality

Why mobile voice quality stinks2

2Je↵ Hecht. Why mobile voice quality still stinks—and how to fix it. IEEE Spectrum, September 2014

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 6 / 73

Page 7: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for better mobile voice quality

Why mobile voice quality stinks2

2Je↵ Hecht. Why mobile voice quality still stinks—and how to fix it. IEEE Spectrum, September 2014

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 6 / 73

Page 8: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)

Outline

1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 7 / 73

Page 9: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)

Conversational mobile software agents

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 8 / 73

Source: Tom Vanleenhove

Page 10: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)

Conversational mobile software agents need to work in

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 9 / 73

Source: Flickr user rickihuang

Page 11: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)

Conversational mobile software agents need to work in

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 9 / 73

Source: Flickr user retorta net

Page 12: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)

Conversational mobile software agents need to work in

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 9 / 73

Source: Flickr user Brian Indrelunas

Page 13: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Need for noise robust automatic speech recognition (ASR)

But automatic speech recognition doesn’t work there3

3Amit Juneja. A comparison of automatic and human speech recognition in null grammar. The Journal of the Acoustical

Society of America, 131(3):EL256–EL261, February 2012

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 10 / 73

Page 14: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Main challenge

Outline

1 Motivation: need for noise robustnessNeed for better mobile voice qualityNeed for noise robust automatic speech recognition (ASR)Main challenge

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 11 / 73

Page 15: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Main challenge

Main challenge

Speech is a rich signal, it requires rich models

Synthesis models are rich enough to represent almost all speech

Non-parametric synthesis models for high qualityDNN as non-linear distance function

Parametric synthesis models for e�cient representatione�cient gradient-based optimization of input (not model)

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 12 / 73

Page 16: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Motivation: need for noise robustness Main challenge

Main challenge

Speech is a rich signal, it requires rich models

Synthesis models are rich enough to represent almost all speech

Non-parametric synthesis models for high qualityDNN as non-linear distance function

Parametric synthesis models for e�cient representatione�cient gradient-based optimization of input (not model)

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 12 / 73

Page 17: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 13 / 73

Page 18: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Overview

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 14 / 73

Page 19: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Overview

Concatenative resynthesis for speech enhancement4,5

Standard approaches try to modify noisy recordings

We instead resynthesize a clean version of the same speech

Should produce infinite suppression and high speech quality

4Michael I Mandel, Young-Suk Cho, and Yuxuan Wang. Learning a concatenative resynthesis system for noise suppression.

In Proc. IEEE GlobalSIP, 20145Michael I Mandel and Young Suk Cho. Audio super-resolution using concatenative resynthesis. In Proc. IEEE WASPAA,

2015. To appear

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 15 / 73

Page 20: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Overview

Motivating example

Your phone records your voice in quiet, close-talk conditions

Uses those recordings to replace your voice in noisy, far-talk conditions

Resynthesizes your speech from previous high-quality recordings

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 16 / 73

Page 21: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Overview

Concatenative resynthesis

Use a large dictionary of ⇠200 ms “chunks” of audio

Learn DNN-based a�nity between dictionary & mixture chunks

Perform concatenative synthesis of signal from dictionary

General robust supervised nonlinear signal mapping framework

Task Map from To

Noise suppression Noisy CleanAudio super-resolution Reverberated, compressed Clean

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 17 / 73

Page 22: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 18 / 73

Page 23: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function

Deep neural network as nonlinear distance function6

Generative Discriminative Dictionary-based

Data-intensive training Moderate training data Data-e�cient trainingHard to adapt Hard to adapt Very adaptable

6Michael I Mandel, Young-Suk Cho, and Yuxuan Wang. Learning a concatenative resynthesis system for noise suppression.

In Proc. IEEE GlobalSIP, 2014

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 19 / 73

Page 24: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function

Train DNN on correctly and incorrectly paired chunks

Noise suppression

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 20 / 73

Page 25: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Deep neural network as nonlinear distance function

Train DNN on correctly and incorrectly paired chunks

Audio super-resolution

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 21 / 73

Page 26: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 22 / 73

Page 27: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Find optimal sequence of clean chunks

x = {xt}Tt=0 input sequence of noisy chunks

z = {zt}Tt=0 best sequence of corresponding dictionary chunks

A�nity between clean and noisy chunks

Transition a�nity between clean chunks

z = argmaxz

Y

t

p(zt = j | xt) p(zt = j | zt�1 = i)

= argmaxz

Y

i

g(zj , xi ) Tij

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 23 / 73

Page 28: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Find optimal sequence of clean chunks

x = {xt}Tt=0 input sequence of noisy chunks

z = {zt}Tt=0 best sequence of corresponding dictionary chunks

A�nity between clean and noisy chunks

Transition a�nity between clean chunks

z = argmaxz

Y

t

p(zt = j | xt) p(zt = j | zt�1 = i)

= argmaxz

Y

i

g(zj , xi ) Tij

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 23 / 73

Page 29: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Find optimal sequence of clean chunks

x = {xt}Tt=0 input sequence of noisy chunks

z = {zt}Tt=0 best sequence of corresponding dictionary chunks

A�nity between clean and noisy chunks

Transition a�nity between clean chunks

z = argmaxz

Y

t

p(zt = j | xt) p(zt = j | zt�1 = i)

= argmaxz

Y

i

g(zj , xi ) Tij

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 23 / 73

Page 30: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Compare all pairs of noisy and clean chunks

D1D2D3...

M1M1M1...

DNN

Observed mixture

Clean dictionary

Similarity

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73

Page 31: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Compare all pairs of noisy and clean chunks

D1D2D3...

M2M2M2...

DNN

Observed mixture

Clean dictionary

Similarity

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73

Page 32: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Compare all pairs of noisy and clean chunks

D1D2D3...

M3M3M3...

DNN

Observed mixture

Clean dictionary

Similarity

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73

Page 33: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Compare all pairs of noisy and clean chunks

D1D2D3...

M4M4M4...

DNN

Observed mixture

Clean dictionary

Similarity

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73

Page 34: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Compare all pairs of noisy and clean chunks

D1D2D3...

M5M5M5...

DNN

Observed mixture

Clean dictionary

Similarity

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73

Page 35: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Compare all pairs of noisy and clean chunks

D1D2D3...

MNMNMN...

DNN

Observed mixture

Clean dictionary

Similarity

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 24 / 73

Page 36: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Standard Viterbi algorithm for to find optimal sequence

D1D2D3...

MNMNMN...

DNN

Observed mixture

Clean dictionary

Similarity

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 25 / 73

Page 37: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Using this DNN for speech enhancement

Standard Viterbi algorithm for to find optimal sequence

D1D2D3...

MNMNMN...

DNN

Observed mixture

Clean dictionary

Similarity

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 25 / 73

Page 38: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Noise suppression experiments

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 26 / 73

Page 39: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Noise suppression experiments

Original “clean” speech

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 27 / 73

Page 40: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Noise suppression experiments

Noisy speech

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 28 / 73

Page 41: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Noise suppression experiments

Traditional mask-based separation

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 29 / 73

Page 42: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Noise suppression experiments

Concatenative resynthesis output

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 30 / 73

Page 43: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Noise suppression experiments

Original “clean” speech

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 31 / 73

Page 44: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Noise suppression experiments

Subjective quality is high

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 32 / 73

Page 45: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Noise suppression experiments

Subjective quality is high

20 40 60 80 100

Noisy

IRM NN

Concat

Clean

Quality (higher better)

SpeechNoise SupOverall

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 32 / 73

Page 46: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Noise suppression experiments

Subjective intelligibility is ok

60 70 80 90 100

Noisy

IRM NN

Concat

Clean

Words correctly identified (%)

Keywords

All words

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 33 / 73

Page 47: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Audio super-resolution experiments

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 34 / 73

Page 48: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Audio super-resolution experiments

Original clean speech

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 35 / 73

Page 49: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Audio super-resolution experiments

Reverberated, compressed, 20% packet loss

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 36 / 73

Page 50: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Audio super-resolution experiments

NMF-based bandwidth expansion output

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 37 / 73

Page 51: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Audio super-resolution experiments

Concatenative resynthesis output

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 38 / 73

Page 52: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Audio super-resolution experiments

Original clean speech

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 39 / 73

Page 53: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Audio super-resolution experiments

Subjective quality is high

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 40 / 73

Page 54: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Audio super-resolution experiments

Subjective quality is high

0

10

20

30

40

50

60

70

80

90

100

MU

SH

RA

Sco

re

CleanClean (hid)RevRev 8kHzRevOpusL20RevOpusL20 (hid)

CleanAmr RevAmr RevOpusL20

InputNMFConcat

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 40 / 73

Page 55: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Audio super-resolution experiments

Subjective intelligibility is good

70

75

80

85

90

95

100

Corr

ect

word

s (%

)

CleanRevRev 8kHzRevOpusL20

CleanAmr RevAmr RevOpusL20

InputNMFConcat

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 41 / 73

Page 56: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Summary

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancementOverviewDeep neural network as nonlinear distance functionUsing this DNN for speech enhancementNoise suppression experimentsAudio super-resolution experimentsSummary

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 42 / 73

Page 57: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Summary

Summary

Concatenative synthesizer, DNN as noise-robust selection function

Instead of modifying noisy speech, replace itcompletely eliminates noise, except for synthesis errorsproduces high quality, natural-sounding speech

General robust supervised nonlinear signal mapping framework

Data-e�cient to train and adaptable to new talkers

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 43 / 73

Page 58: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Non-parametric synthesis for speech enhancement Summary

Future applications

Generalize to audio-visual speech recognition

Label dictionary elements ahead of time to enablenoise-robust non-parametric speech recognitionnoise-robust pitch trackingnoise-robust speaker identification

Incorporate language model into transition cost

Develop e�cient search mechanisms for large-vocabulary dictionaries

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 44 / 73

Page 59: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 45 / 73

Page 60: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Overview

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 46 / 73

Page 61: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Overview

Mask-based source separation: Noisy

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 47 / 73

Page 62: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Overview

Mask-based source separation: Masked

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 48 / 73

Page 63: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Overview

Disrupts speech features: Noisy MFCCs

“He said such products would be marketed by othercompanies with experience him at this month.”

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 49 / 73

Page 64: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Overview

Disrupts speech features: Masked MFCCs

“He said such products would be marketed by othercompanies with experience him at this month.”

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 50 / 73

Page 65: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Overview

Disrupts speech features: Clean MFCCs

“He said such products would be marketed by othercompanies with experience in that business.”

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 51 / 73

Page 66: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Overview

Estimate better features using a strong prior model

“He said such products would be marketed by othercompanies with experience in that business.”

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 52 / 73

Page 67: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Overview

Our approach: Analysis-by-synthesis

Synthesize speech signal so that itlooks like the observationlooks like speech

Itakura-Saito divergence compares prediction with noisy observation

Recognizer gives likelihood of speech-ness

Both easy to optimize using gradient descent

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 53 / 73

Page 68: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Overview

Speech recognizer includes lots of information

Large vocabulary continuous speech recognizer captures:

Acoustics of speech sounds

The e↵ect of neighboring speech sounds

Pronunciation of words

Order of words

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 54 / 73

Page 69: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Algorithm

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 55 / 73

Page 70: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Algorithm

Optimization over speech features

x: optimization state: MFCCs, ⇠10,000 dimensions

y(x): ASR features derived from x

M: mask provided a priori by another source separator

minx

L(x;M) = minx

n

(1� ↵) LI (x;M) + ↵ LH(y(x))o

Total cost

Distance to noisy observation

Negative log likelihood under recognizer

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 56 / 73

Page 71: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Algorithm

Optimization over speech features

x: optimization state: MFCCs, ⇠10,000 dimensions

y(x): ASR features derived from x

M: mask provided a priori by another source separator

minx

L(x;M) = minx

n

(1� ↵) LI (x;M) + ↵ LH(y(x))o

Total cost

Distance to noisy observation

Negative log likelihood under recognizer

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 56 / 73

Page 72: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Algorithm

Optimization over speech features

x: optimization state: MFCCs, ⇠10,000 dimensions

y(x): ASR features derived from x

M: mask provided a priori by another source separator

minx

L(x;M) = minx

n

(1� ↵) LI (x;M) + ↵ LH(y(x))o

Total cost

Distance to noisy observation

Negative log likelihood under recognizer

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 56 / 73

Page 73: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Algorithm

Analysis of audio meets resynthesis of MFCCs at mask

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 57 / 73

Page 74: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Algorithm

LI (x;M): Distance to noisy observation

Resynthesize MFCCs to power spectrum, where mask was computed

Do mask-aware comparison in that domain: weighted Itakura-Saitobetween resynthesis, S!t(x), and noisy observation, Sweighted by mask, M

LI (x;M) = DM(S k S) =X

!,t

M!t

S!t

S!t(x)� log

S!t

S!t(x)� 1

Does not require modeling speech excitation

Numerically di↵erentiable with respect to x

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 58 / 73

Page 75: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Algorithm

LH(y(x)): Likelihood under recognizer

Large vocabulary continuous speech recognizerbig hidden Markov model (HMM)approximated by the lattice of likely paths

Closed form gradient with respect to x

Serves as a model of clean MFCC sequences

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 59 / 73

Page 76: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Algorithm

LH(y(x)): Likelihood under recognizer

Large vocabulary continuous speech recognizerbig hidden Markov model (HMM)approximated by the lattice of likely paths

Closed form gradient with respect to x

Serves as a model of clean MFCC sequences

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 59 / 73

Page 77: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Algorithm

Optimization

State space of approximately 13⇥ 800 ⇡10,000 dimensions

Quasi-Newton optimization, BFGSgradient plus approximate second-order information

Closed form gradient of HMM likelihoodusing a forward-backward algorithm

Numerical gradient of IS divergenceindependent costs and gradients for each frame

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 60 / 73

Page 78: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Results

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 61 / 73

Page 79: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Results

Experiment

AURORA4 corpusread Wall Street Journal sentences (5000 word vocabulary)six environmental noise typesSNRs between 5 and 15 dB

Masks from ideal binary mask and estimated ratio mask7

7Arun Narayanan and DeLiang Wang. Ideal ratio mask estimation using deep neural networks for robust speech

recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 7092–7096.IEEE, May 2013

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 62 / 73

Page 80: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Results

Recognition results

Word error rate (%) averaged across noise type

Mask Direct A-by-S

Noisy 30.94Estimated 16.18 15.31Oracle 14.38 13.62Clean 9.54

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 63 / 73

Page 81: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Results

Reconstruction results

Itakura-Saito divergence between resynthesized speech and original

Mask Direct A-by-S �

Noisy 272301Estimated 276497 275224 �1273Oracle 273006 272506 �500

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 64 / 73

Page 82: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Results

Resynthesis gets closer to reliable regions

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 65 / 73

Page 83: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Results

Resynthesis gets closer to reliable regions

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 66 / 73

Page 84: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Results

Resynthesis gets closer to reliable regions

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 67 / 73

Page 85: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Results

Resynthesis gets closer to reliable regions

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 68 / 73

Page 86: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Summary

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognitionOverviewAlgorithmResultsSummary

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 69 / 73

Page 87: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Summary

Summary

Use a full recognizer as a prior model for clean speech

Synthesize from MFCCs to the domain of the mask

Adjust synthesis of speech signal so that itlooks like the observationlooks like speech

Reduces recognition errors, distance to clean utterance

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 70 / 73

Page 88: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for speech recognition Summary

Future directions

Apply to DNN-based acoustic models

Model speech excitation for full resynthesis of clean speech

Model multiple simultaneous speakers and estimate masks jointly

Combine with similar binaural model to include spatial clustering

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 71 / 73

Page 89: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Summary

Outline

1 Motivation: need for noise robustness

2 Non-parametric synthesis for speech enhancement

3 Parametric synthesis for speech recognition

4 Summary

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 72 / 73

Page 90: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Summary

Summary

Synthesizers provide strong prior information

Non-parametric synthesis models for high qualitylearned nonlinear matching function for perceptually motivated features

Parametric synthesis models for e�cient representationstrong, di↵erentiable prior model of speech

Thanks!

Any questions?

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 73 / 73

Page 91: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Summary

Summary

Synthesizers provide strong prior information

Non-parametric synthesis models for high qualitylearned nonlinear matching function for perceptually motivated features

Parametric synthesis models for e�cient representationstrong, di↵erentiable prior model of speech

Thanks!

Any questions?

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 73 / 73

Page 92: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Summary

Summary

Synthesizers provide strong prior information

Non-parametric synthesis models for high qualitylearned nonlinear matching function for perceptually motivated features

Parametric synthesis models for e�cient representationstrong, di↵erentiable prior model of speech

Thanks!

Any questions?

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 73 / 73

Page 93: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for separation

Outline

5 Parametric synthesis for separation

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 1 / 3

Page 94: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for separation

Re-estimate mask using resynthesis: Original

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 2 / 3

Page 95: Analysis-by-synthesis for source separation and speech recognition · 2015-09-10 · Motivation: need for noise robustness Outline 1 Motivation: need for noise robustness Need for

Parametric synthesis for separation

Re-estimate mask using resynthesis: Re-estimate

Michael Mandel (Brooklyn College) Analysis-by-synthesis Sept 8, 2015 3 / 3