complex-valued linear layers for deep neural network-based ... · 4/8/2016  · complex-valued...

54
Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Izhak Shafran Joint work with: Ehsan Variani, Arun Narayanan, and Tara Sainath Google Inc

Upload: others

Post on 14-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Complex-valued Linear Layers for Deep NeuralNetwork-based Acoustic Models for Speech

Recognition

Izhak Shafran

Joint work with: Ehsan Variani, Arun Narayanan, and Tara Sainath

Google Inc

Page 2: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Outline

BackgroundAutomatic Speech RecognitionDeep Neural Networks

Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters

Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations

Summary & Discussion

Page 3: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Automatic Speech Recognition

Page 4: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Automatic Speech Recognition: Language ModelsScores likelihood of utterances in a given language

P(”call home”) > P(”call roam”)

P(w1w2) = P(w1|begin)P(w2|w1)P(end |w1w2)

P(”call home”) = P(”call”|begin) . . .

Page 5: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Automatic Speech Recognition: Feature ExtractionRemove redundancy and irrelevant information

Page 6: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Automatic Speech Recognition: Feature ExtractionRemove redundancy and irrelevant information

Page 7: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Automatic Speech Recognition: Acoustic ModelsScores the conditional likelihood that the hypothesized words were spoken

Page 8: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Acoustic Models Illustration: Vowel classification task

Consider the (mag) FFT of a frame of (vowel) speech

Notice energy peaks (formants)Turns out, their positions are different for different vowels

Page 9: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Acoustic Models Illustration: Vowel classification taskValues of (F1,F2) for vowels in a training set

Page 10: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Acoustic Models Illustration: Vowel classification taskScore the conditional likelihood that a hypothesized vowel was observed

P(X |æ) > P(X |ε)

Page 11: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Automatic Speech Recognition: Decoder

return ”call home” if

P(X |”call home”)P(”call home”) > P(X |”call roam”)P(”call roam”)

Mathematical justification,

W = argmaxW

P(W |X )

= argmaxW

P(X |W )P(W )

P(X )

= argmaxW

P(X |W )P(W )

Page 12: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

(Deep) Neural NetworksIllustration with a simple feed forward network

Page 13: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

A Node in a Neural Network

Page 14: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Neural NetworksIllustration of Forward Propagation

Page 15: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Neural Networks: LearningIllustration of Back Propagation Algorithm

home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

Multistyle Training (MTR) improves robustness

Page 16: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Deep Neural Networks: Why are they so hot now?

I Universal approximatorSuperposition Theorem of Kolmogorov (1957)A Computable Kolmogorov S. T. (Brattka, 2000)Neural Network with Unbounded Activation Functions is aUniversal Approximator (Sonoda & Murata, 2015)

I Back propagation algorithm(Linnainmaa, 1970; Rummelhart et al, 1986)

I Efficient BackProp (LeCun 1998),Fast Learning Algorithms for Deep Belief Net (Hinton, 2006)

Page 17: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Outline

BackgroundAutomatic Speech RecognitionDeep Neural Networks

Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters

Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations

Summary & Discussion

Page 18: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Motivation

In most ASR systems, features extraction:

I uses 3 decades old hand-crafted rules

I mimics human perception

I may not be optimal for speech recognition

Since neural networks are universal approximators,why not let the network learn the optimal features too !?

Page 19: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Recall: Feature Extraction

I Most problematic stage is Mel-filterbank

I Loss of information

I Log-mel features: useless for dereverberation or beamforming,precludes integration into neural networks

Page 20: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Problem

Can we feed the speech frames directly as input to the neuralnetworks and learn the features jointly with acoustic models?

Page 21: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

One Approach: Learnable Time Domain Filters (Hoshen et al, 2015)

I Skip transforming the signal into frequency domain

I Replace mel-filters with a convolutional neural network layer

I Learn the parameters of the layer (filter) from the data

I Implementation inspired by gammatone features

Page 22: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Gammatone Features

All the above operations can be approximated in the neural network

Page 23: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Learnable Time Domain Filters (Hoshen et al, 2015)

Page 24: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Learned Time Domain Filters

Page 25: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Learnable Time Domain Filters (Hoshen et al, 2015)

But:

I convolution layer has several parameters to tune

I computationally expensive

Page 26: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Convolution

Page 27: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Convolutional Layer

I Convolution layers use only the ”valid” outputs

I Trade-off: Longer filters =⇒ fewer valid outputs

Page 28: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Time Domain Filters: Filter Length Dependencies

I For 64ms frame, there are still more settings to sweep!

I Typically, max pooling works better than average pooling

I Info loss, cannot cascade filters in the signal processing sense

Page 29: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Outline

BackgroundAutomatic Speech RecognitionDeep Neural Networks

Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters

Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations

Summary & Discussion

Page 30: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Convolution in Time ≡ Multiplication in Frequency Domain

Page 31: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Multiplication in Frequency Doman: Complex-Valued

X[k] H[k] Y[k]

I X[k] is complex-valued

I H[k] is complex-valued

I Need a linear layer with complex-valued neurons!

Page 32: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Complex-valued Neuron

Let X := Xr + jXi , W := Wr + jWi , B := Br + jBi

Z = g(WX − B)

Zr = g(WrXr −WiXi − Br )

Zi = g(WiXr + WrXi − Bi )

I Easily implemented in a neural net w/ shared weights

I How is this different from using g([WrWi ][XrXi ]T − B)?

Page 33: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Decision Boundaries are Orthogonal!

Easy to see from the norm to the boundary (Tohru Nitta, 2004)

Zr = g(WrXr −WiXi − Br )

Zi = g(WiXr + WrXi − Bi )

Qr =

[∂Zr

∂Xr

∂Zr

∂Xi

]=

[g ′(. . .) g ′(. . .)

] [ Wr

−Wi

]Qi =

[g ′(. . .) g ′(. . .)

] [ Wi

Wr

]QT

r Qi = 0 =⇒ Qr ⊥ Qi

Page 34: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Complex FFT Input & Complex-valued Linear Layers: The Appeal

I Orthogonal boundaries =⇒ higher modeling capacitye.g., XOR with one neuron!

I Gradient descent is faster

I Does not have trade-offs of convolutional layers

I L1 regularization is meaningful, unlike in time domain

I No max-pooling, allows cascading of filterse.g., Dereverberation + Beamforming + Feature extraction

I Possibility to learn hybrid filters that optimize both signalprocessing cost function and WER

Page 35: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Complex-valued Linear Layers: Single Filter

Page 36: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Complex-valued Linear Layers: Set of Filters

Page 37: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

A model is a lie that helps you see the truth.Howard Skipper

Page 38: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Summation in the Frequency Domain

LemmaSummation in the frequency domain ≡ weighted average poolingin the time domain.If X is the Fourier transform of the 2N point signal x , then

N∑k=0

Xk =2N−1∑n=0

αnxn

where,

αn =

N + 1 n = 0

coth(j πi2N

)mod (n, 2) = 1

1 mod (n, 2) = 0, n 6= 0

Page 39: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Complex Linear Projection

Proposition

The projection in the frequency domain is equivalent toconvolution followed by a weighted average pooling:

N∑j=0

WijXj ←→2N−1∑j=0

αj (wi ∗ x) [j ]

Page 40: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Complex Linear Projection (CLP)

Page 41: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

ASR Experimental Setup

I Training set: 2000 hrs, 3M anonymized voice search queriesI SNR between 0-20dB (≈12 dB)I Reverb T60 between 400ms-900ms (≈ 600ms)I 8 channel linear mic with spacing of 2cmI Noise and target speaker location constant per utterance

I Acoustic modelI A cascade of 3 LSTMs followed by a DNN layerI 13K HMM statesI Learning with asynchronized stochastic gradient descent

I Test set: Voice search w/ matching conditions

I All results reported here are after CE training

Page 42: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Experimental Settings & WER Results

I 25 ms, 32 ms, 64 ms

I 128 filters

System 25 ms 32 ms 64 ms

Log-mel 23.4 22.8 21.8Time-domain conv. 23.7 23.4 22.5CLP 23.2 22.8 22.0

I CLP is at least as good as the alternatives!

Page 43: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Learned CLP Filters: Before and After Applying L1 Regularization

Page 44: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Learned Filters: Histogram of Learned Filter Coefficients

Time-domain filter coefficients (in time & in freq)

CLF filter coefficients (w/o & w/ L1)

CLP filters are sparse!

Page 45: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Complex Linear Projection: Two Channels

Page 46: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Experimental Settings & WER Results

I Two channels from microphones, 14cm apart

I 25 ms, 32 ms, 64 ms

I 256 filters

System 25 ms 32 ms 64 ms

Log-mel 21.8 21.3 20.7Time-domain conv. 21.5 21.2 21.2CLP 21.5 20.9 20.5

I Again, CLP is at least as good as the other alternatives!

Page 47: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Multichannel CLP: Frame Size vs Projection Size

I Frame size: 32ms, 64ms

I Projection size: 128, 256, 512

I Without L1 regularization and stop bands

128 256 512

32ms 21.7 21.4 21.764ms 21.1 21.1 21.5

I Higher projections size =⇒ larger input to LSTM, slowerStep time is about 1.5 × slower going from 256 to 512

I Frame size has no impact for 32ms and 64ms

I Step time is about 2 × slower going from 1 to 8 channels

Page 48: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Learned CLP Filters: Two Channel

Page 49: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Learned CLP Filters: Time-delay between Real and Imaginary

Filter number vs time delay

Page 50: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Histogram of Time-delay between Real and Imaginary

Time delay vs frequency of occurrence

Page 51: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Language and Task Dependency

I English high-pitch voice search (en-us-hp)

I Taiwanese voice search (cmn-hant-tw)

System en-us-hp cmn-hant-tw

Log-mel 16.6 17.2Time-domain conv. 16.3 16.8CLP 16.4 16.6

I CLP as good or better than the alternatives!

Page 52: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Learned Filters: Sorted by Center Frequency

Page 53: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Computational Efficiency

I Time-domain filters ≈ O(Nkp + k2p))k strides, 2N window size, p filters

I CLP ≈ O(pN)

Model Time Filters CLP

num-params 45K 66Knum-add-mult 14.51 M 263.17 K

I Computation cost is a magnitude lower!

Page 54: Complex-valued Linear Layers for Deep Neural Network-based ... · 4/8/2016  · Complex-valued Linear Layers for Deep Neural Network-based Acoustic Models for Speech Recognition Author:

Summary & Future Directions

I Introduced complex-valued linear layers for acoustic modeling

I Complex-valued linear projection (CLP) as one instance of it

I Showed a simple multi-channel extension for CLP

I Reported experimental comparisons with log-mel andtime-domain filters

I Demonstrated task-dependent advantage

I Discussed properties of the CLP model

I Future directions:More efficient multi-channel projections, hybrid systems withWER and beamforming-type cost functions, adding multiplecomplex layers with appropriate complex-valued activationfunctions, integrating de-reverberation, . . .