complex-valued linear layers for deep neural network-based ... · 4/8/2016  · complex-valued...

Post on 14-Oct-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Complex-valued Linear Layers for Deep NeuralNetwork-based Acoustic Models for Speech

Recognition

Izhak Shafran

Joint work with: Ehsan Variani, Arun Narayanan, and Tara Sainath

Google Inc

Outline

BackgroundAutomatic Speech RecognitionDeep Neural Networks

Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters

Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations

Summary & Discussion

Automatic Speech Recognition

Automatic Speech Recognition: Language ModelsScores likelihood of utterances in a given language

P(”call home”) > P(”call roam”)

P(w1w2) = P(w1|begin)P(w2|w1)P(end |w1w2)

P(”call home”) = P(”call”|begin) . . .

Automatic Speech Recognition: Feature ExtractionRemove redundancy and irrelevant information

Automatic Speech Recognition: Feature ExtractionRemove redundancy and irrelevant information

Automatic Speech Recognition: Acoustic ModelsScores the conditional likelihood that the hypothesized words were spoken

Acoustic Models Illustration: Vowel classification task

Consider the (mag) FFT of a frame of (vowel) speech

Notice energy peaks (formants)Turns out, their positions are different for different vowels

Acoustic Models Illustration: Vowel classification taskValues of (F1,F2) for vowels in a training set

Acoustic Models Illustration: Vowel classification taskScore the conditional likelihood that a hypothesized vowel was observed

P(X |æ) > P(X |ε)

Automatic Speech Recognition: Decoder

return ”call home” if

P(X |”call home”)P(”call home”) > P(X |”call roam”)P(”call roam”)

Mathematical justification,

W = argmaxW

P(W |X )

= argmaxW

P(X |W )P(W )

P(X )

= argmaxW

P(X |W )P(W )

(Deep) Neural NetworksIllustration with a simple feed forward network

A Node in a Neural Network

Neural NetworksIllustration of Forward Propagation

Neural Networks: LearningIllustration of Back Propagation Algorithm

home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

Multistyle Training (MTR) improves robustness

Deep Neural Networks: Why are they so hot now?

I Universal approximatorSuperposition Theorem of Kolmogorov (1957)A Computable Kolmogorov S. T. (Brattka, 2000)Neural Network with Unbounded Activation Functions is aUniversal Approximator (Sonoda & Murata, 2015)

I Back propagation algorithm(Linnainmaa, 1970; Rummelhart et al, 1986)

I Efficient BackProp (LeCun 1998),Fast Learning Algorithms for Deep Belief Net (Hinton, 2006)

Outline

BackgroundAutomatic Speech RecognitionDeep Neural Networks

Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters

Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations

Summary & Discussion

Motivation

In most ASR systems, features extraction:

I uses 3 decades old hand-crafted rules

I mimics human perception

I may not be optimal for speech recognition

Since neural networks are universal approximators,why not let the network learn the optimal features too !?

Recall: Feature Extraction

I Most problematic stage is Mel-filterbank

I Loss of information

I Log-mel features: useless for dereverberation or beamforming,precludes integration into neural networks

Problem

Can we feed the speech frames directly as input to the neuralnetworks and learn the features jointly with acoustic models?

One Approach: Learnable Time Domain Filters (Hoshen et al, 2015)

I Skip transforming the signal into frequency domain

I Replace mel-filters with a convolutional neural network layer

I Learn the parameters of the layer (filter) from the data

I Implementation inspired by gammatone features

Gammatone Features

All the above operations can be approximated in the neural network

Learnable Time Domain Filters (Hoshen et al, 2015)

Learned Time Domain Filters

Learnable Time Domain Filters (Hoshen et al, 2015)

But:

I convolution layer has several parameters to tune

I computationally expensive

Convolution

Convolutional Layer

I Convolution layers use only the ”valid” outputs

I Trade-off: Longer filters =⇒ fewer valid outputs

Time Domain Filters: Filter Length Dependencies

I For 64ms frame, there are still more settings to sweep!

I Typically, max pooling works better than average pooling

I Info loss, cannot cascade filters in the signal processing sense

Outline

BackgroundAutomatic Speech RecognitionDeep Neural Networks

Problem DefinitionMotivation: Replace Hand-Crafted Feature ExtractionProblem DefinitionOne Approach: Learnable Time Domain Convolutional Filters

Learnable Complex-valued Frequency Domain FiltersProperties of the ModelEmpirical Evaluations

Summary & Discussion

Convolution in Time ≡ Multiplication in Frequency Domain

Multiplication in Frequency Doman: Complex-Valued

X[k] H[k] Y[k]

I X[k] is complex-valued

I H[k] is complex-valued

I Need a linear layer with complex-valued neurons!

Complex-valued Neuron

Let X := Xr + jXi , W := Wr + jWi , B := Br + jBi

Z = g(WX − B)

Zr = g(WrXr −WiXi − Br )

Zi = g(WiXr + WrXi − Bi )

I Easily implemented in a neural net w/ shared weights

I How is this different from using g([WrWi ][XrXi ]T − B)?

Decision Boundaries are Orthogonal!

Easy to see from the norm to the boundary (Tohru Nitta, 2004)

Zr = g(WrXr −WiXi − Br )

Zi = g(WiXr + WrXi − Bi )

Qr =

[∂Zr

∂Xr

∂Zr

∂Xi

]=

[g ′(. . .) g ′(. . .)

] [ Wr

−Wi

]Qi =

[g ′(. . .) g ′(. . .)

] [ Wi

Wr

]QT

r Qi = 0 =⇒ Qr ⊥ Qi

Complex FFT Input & Complex-valued Linear Layers: The Appeal

I Orthogonal boundaries =⇒ higher modeling capacitye.g., XOR with one neuron!

I Gradient descent is faster

I Does not have trade-offs of convolutional layers

I L1 regularization is meaningful, unlike in time domain

I No max-pooling, allows cascading of filterse.g., Dereverberation + Beamforming + Feature extraction

I Possibility to learn hybrid filters that optimize both signalprocessing cost function and WER

Complex-valued Linear Layers: Single Filter

Complex-valued Linear Layers: Set of Filters

A model is a lie that helps you see the truth.Howard Skipper

Summation in the Frequency Domain

LemmaSummation in the frequency domain ≡ weighted average poolingin the time domain.If X is the Fourier transform of the 2N point signal x , then

N∑k=0

Xk =2N−1∑n=0

αnxn

where,

αn =

N + 1 n = 0

coth(j πi2N

)mod (n, 2) = 1

1 mod (n, 2) = 0, n 6= 0

Complex Linear Projection

Proposition

The projection in the frequency domain is equivalent toconvolution followed by a weighted average pooling:

N∑j=0

WijXj ←→2N−1∑j=0

αj (wi ∗ x) [j ]

Complex Linear Projection (CLP)

ASR Experimental Setup

I Training set: 2000 hrs, 3M anonymized voice search queriesI SNR between 0-20dB (≈12 dB)I Reverb T60 between 400ms-900ms (≈ 600ms)I 8 channel linear mic with spacing of 2cmI Noise and target speaker location constant per utterance

I Acoustic modelI A cascade of 3 LSTMs followed by a DNN layerI 13K HMM statesI Learning with asynchronized stochastic gradient descent

I Test set: Voice search w/ matching conditions

I All results reported here are after CE training

Experimental Settings & WER Results

I 25 ms, 32 ms, 64 ms

I 128 filters

System 25 ms 32 ms 64 ms

Log-mel 23.4 22.8 21.8Time-domain conv. 23.7 23.4 22.5CLP 23.2 22.8 22.0

I CLP is at least as good as the alternatives!

Learned CLP Filters: Before and After Applying L1 Regularization

Learned Filters: Histogram of Learned Filter Coefficients

Time-domain filter coefficients (in time & in freq)

CLF filter coefficients (w/o & w/ L1)

CLP filters are sparse!

Complex Linear Projection: Two Channels

Experimental Settings & WER Results

I Two channels from microphones, 14cm apart

I 25 ms, 32 ms, 64 ms

I 256 filters

System 25 ms 32 ms 64 ms

Log-mel 21.8 21.3 20.7Time-domain conv. 21.5 21.2 21.2CLP 21.5 20.9 20.5

I Again, CLP is at least as good as the other alternatives!

Multichannel CLP: Frame Size vs Projection Size

I Frame size: 32ms, 64ms

I Projection size: 128, 256, 512

I Without L1 regularization and stop bands

128 256 512

32ms 21.7 21.4 21.764ms 21.1 21.1 21.5

I Higher projections size =⇒ larger input to LSTM, slowerStep time is about 1.5 × slower going from 256 to 512

I Frame size has no impact for 32ms and 64ms

I Step time is about 2 × slower going from 1 to 8 channels

Learned CLP Filters: Two Channel

Learned CLP Filters: Time-delay between Real and Imaginary

Filter number vs time delay

Histogram of Time-delay between Real and Imaginary

Time delay vs frequency of occurrence

Language and Task Dependency

I English high-pitch voice search (en-us-hp)

I Taiwanese voice search (cmn-hant-tw)

System en-us-hp cmn-hant-tw

Log-mel 16.6 17.2Time-domain conv. 16.3 16.8CLP 16.4 16.6

I CLP as good or better than the alternatives!

Learned Filters: Sorted by Center Frequency

Computational Efficiency

I Time-domain filters ≈ O(Nkp + k2p))k strides, 2N window size, p filters

I CLP ≈ O(pN)

Model Time Filters CLP

num-params 45K 66Knum-add-mult 14.51 M 263.17 K

I Computation cost is a magnitude lower!

Summary & Future Directions

I Introduced complex-valued linear layers for acoustic modeling

I Complex-valued linear projection (CLP) as one instance of it

I Showed a simple multi-channel extension for CLP

I Reported experimental comparisons with log-mel andtime-domain filters

I Demonstrated task-dependent advantage

I Discussed properties of the CLP model

I Future directions:More efficient multi-channel projections, hybrid systems withWER and beamforming-type cost functions, adding multiplecomplex layers with appropriate complex-valued activationfunctions, integrating de-reverberation, . . .

top related