speech signals frequency modulation decoding via deep

Speech Signals FrequencyModulation Decoding via Deep

Neural Networks

Dan Elbaz

Technion - Computer Science Department - M.Sc. Thesis MSC-2018-22 - 2018

Speech Signals FrequencyModulation Decoding via Deep

Neural Networks

Research Thesis

Submitted in partial fulfillment of the requirementsfor the degree of Master of Science in Computer Science

Dan Elbaz

Submitted to the Senateof the Technion — Israel Institute of Technology

Iyyar 5778 Haifa May 2018


This research was carried out under the supervision of Adjunct Professor Michael Zibulevsky,in the Faculty of Computer Science.

Some results in this thesis have been published as a paper by the author and researchcollaborators in conference during the course of the author’s Master research period, the mostup-to-date versions of which being:

Dan Elbaz and Michael Zibulevsky. End to end deep neural network frequency demodulation of speechsignals. In Future of Information and Communication Conference (FICC) 2018. IEEE [InPress],Singapore, 2018. ISBN: 978-1-5386-2056-4.

Acknowledgements

This Research thesis was done under the supervision of Professor Michael Zibulevsky in theFaculty of Computer Science. The generous financial help of the Technion is gratefully acknowl-edged. I would like to express my sincere gratitude to my advisor Professor Michael Zibulevskyfor the continuous support of my study and research. The door to Professor Zibulevsky officewas always open whenever I ran into a trouble spot or had a question. Last but not the least, Iwould like to thank my family and friends for the spiritual support they have given me.

The generous financial help of the Technion is gratefully acknowledged.


Contents

List of Figures

Abstract 1

Abbreviations and Notations 3

1 Introduction 5

2 Scientific Background 72.1 Signal modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Frequency modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Base band representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.3 Noise model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.1.4 Emphasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.1 Human voice system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Speech quality assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.1 Supervised learning and function approximation . . . . . . . . . . . . . . 132.3.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3.3 Optimization- RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.4 Feed-forward neural networks and backpropagation . . . . . . . . . . . . . 152.3.5 Recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.6 Training RNN- Backpropagation through time . . . . . . . . . . . . . . . 182.3.7 Bidirectional recurrent neural networks . . . . . . . . . . . . . . . . . . . 202.3.8 Stacked recurrent neural networks . . . . . . . . . . . . . . . . . . . . . . 202.3.9 The vanishing gradient problem . . . . . . . . . . . . . . . . . . . . . . . . 222.3.10 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.11 Deep bidirectional LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Problem Formulation and Related Work 273.1 Problem formulation and motivation . . . . . . . . . . . . . . . . . . . . . . . . . 273.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28


4 Neural Network Demodulator 314.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Front end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.1.2 Neural network block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.3 Full system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Dataset and training procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Experimental Results 375.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusions and Future Work 436.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A Demodulator Software 45A.1 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45


List of Figures

2.1 Communication system with amplitude and phase noise sources. . . . . . . . . . 102.2 Pre-Emphasis and De-Emphasis in FM System . . . . . . . . . . . . . . . . . . . 112.3 Human voice system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4 Sigmoid activation function. Source: Isaac Changhau . . . . . . . . . . . . . . . . 142.5 Hyperbolic tangent activation function. Source: Isaac Changhau . . . . . . . . . 152.6 Layer’s inputs/outputs in back-propagation . . . . . . . . . . . . . . . . . . . . . 162.7 Propagation of the gradients threw the neural network . . . . . . . . . . . . . . . 172.8 Recurrent neural network unrolled. Source: Nature . . . . . . . . . . . . . . . . . 182.9 Backward pass for E3. Source: WILDML . . . . . . . . . . . . . . . . . . . . . . 192.10 Bidirectional RNN. Source: Stanford cs224d . . . . . . . . . . . . . . . . . . . . . 202.11 Deep bidirectional RNN with three RNN layers. Source: Stanford cs224d . . . . 212.12 Repeating module in LSTM. Source: Chris Olah blog: Understanding LSTM

Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.13 Interaction between the cell state c and access gates i, f, g. . . . . . . . . . . . . . 25

4.1 Block for extracting the I and Q components . . . . . . . . . . . . . . . . . . . . 334.2 Neural network decoder block . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Full demodulation system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1 Speech reconstruction SNR, for various levels of amplitude noise. Conventional(with and without additional speech enhancement) Vs. DNN based demodulator. 38

5.2 Speech reconstruction segmental SNR, for various levels of amplitude noise. Con-ventional (with and without additional speech enhancement) Vs. DNN baseddemodulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Speech reconstruction PESQ score, for various levels of amplitude noise. Con-ventional (with and without additional speech enhancement) Vs. DNN baseddemodulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 Speech reconstruction SNR, for various levels of phase noise. Conventional (withand without additional speech enhancement) Vs. DNN based demodulator. . . . 40

5.5 Speech reconstruction segmental SNR, for various levels of phase noise. Con-ventional (with and without additional speech enhancement) Vs. DNN baseddemodulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.6 Speech reconstruction PESQ score, for various levels of phase noise. Conventional(with and without additional speech enhancement) Vs. DNN based demodulator. 41

5.7 Spectrogram of the original audio signal and DNN demodulator reconstruction. . 41


6.1 Training perceptual loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A.1 Training loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46


Abstract

Frequency modulation (FM) is a form of radio broadcasting which is widely used nowadaysand has been for almost a century. The widest use of FM is for radio broadcasting, which iscommonly used for transmitting audio signal representing voice.Due to the effect of various distortions, noise conditions and other impairments imposed onthe transmitted signal, the detection reliability severely deteriorates. As a result thereof, theintelligibility and quality of the detected speech decreases significantly. This phenomenon isknown as the Threshold Effect.End-to-end learning based approaches have shown to be effective and have resulted in excel-lent performance for many systems with less training data. In this work we present an end toend learning approach for novel application of software defined radio (SDR) receiver for FMdetection. By adopting an end-to-end learning based approach, the system utilizes the priorinformation of transmitted speech message in the demodulation process.The receiver uses a multi-layered bidirectional Long Short-Term Memory ((B)LSTM) architec-ture to capture long range dependencies and nonlinear dynamics of the speech signal. Thereceiver then uses the learned speech structure to detect and enhance speech from the in-phaseand quadrature components of its base band version.The new system yields high performance detection for both acoustical disturbances and commu-nication channel noise and is foreseen to out-perform the established methods for low signal tonoise ratio (SNR) conditions. We compared the new system performance with the conventionalmethod using several speech quality assessment measures, such as: SNR, segmental SNR andalso in perceptual evaluation of speech quality score (PESQ).

1Technion - Computer Science Department - M.Sc. Thesis MSC-2018-22 - 2018

Abbreviations and Notations

ANN : Artificial Neural NetworksAWGN : Additive White Gaussian NoiseBLSTM : Bidirectional Long short Term MemoryBPTT : Backpropagation Through TimeBRNN : Bidirectional Recurrent Neural NetworkDBLSTM : Deep Bidirectional Long Short Term MemoryDBRNN : Deep Bidirectional Recurrent Neural Networkfc : The carrier frequencyf∆ : The frequency deviationf(t) : Instantaneous frequency of the modulated waveFM : Frequency ModulationGPU : Graphical Processing UnitI(t) : In phase componentLPF : Low Lass FilterLSTM : Long Short Term MemoryMOS : Mean Opinion ScoreMSE : Mean Squared ErrorOM − LSA: Optimally Modified Log Spectral AmplitudePESQ : Perceptual Evaluation of Speech QualityPSNR : Peak Signal to Noise RatioQ(t) : Quadrature componentRNN : Recurrent Neural NetworkSDR : Software Defined RadioSNR : Signal to Noise Ratioxm(t) : the information message or speech signalϕ(t) : the modulation phase


Chapter 1

Introduction

Frequency modulation (FM) is a nonlinear encoding of information on a carrier wave. It can beused for interferometric [1], seismic prospecting [2], remote monitoring of vital signs [3] and manymore applications, each with its own statistics, dominated by the underlying generating process.However, its widest use is for radio broadcasting, which is commonly used for transmitting audiosignal representing voice.Communication transmission channel is subject to various distortions, noise conditions and otherimpairments. Those impairments severely degrade FM demodulator performance when a criticallevel is exceeded.Long Short-Term Memory (LSTM) recurrent neural networks [4] are powerful models that cancapture long range dependencies and non-linear dynamics. In many signal estimation tasks,the advantage of recurrent neural network becomes significant only when there is a statisticaldependency between the examples. This work introduces FM demodulator based on LSTMrecurrent neural network. The main contributions of this work are as follows:

• Utilizing the LSTM abilities to capture the temporal dynamics of speech signals andtaking advantage of the prior statistics of the speech to overcome transmission channeldisturbances.

• Taking an end-to-end learning based approach for filtering both acoustical disturbances,modeled as phase noise, and transmission channel disturbances, modeled as amplitudenoise. In this approach, the LSTM learns how to map directly from the modulated base-band signal to the modulating audio that had been applied at the transmitter, thus creatinga baseband to speech mapping.

We demonstrate this method by applying it to FM decoding in varying levels of amplitude andphase noise and show it has a superior performance over legacy reception systems in low SNRconditions.


Chapter 2

Scientific Background

In this chapter we provide the scientific background for this work. We start by surveying issuesconcerning frequency modulation and noise in the context of communication systems. We thendescribe speech structure and quality measures designed for speech quality estimation, andconclude with general and recurrent neural networks overview.

2.1 Signal modulation

Modulation involves two waveforms: a modulating signal that holds the information or the mes-sage to be transmitted, for example, speech signal, and a periodic waveform, called the carriersignal that suits the particular application. In the modulation process the message signal isconveyed- by varying one or more properties of the carrier signal. In other words, modulationchanges the shape of a carrier wave, usually the amplitude, phase or frequency, to encode thespeech or data information that we are interested in sending.Modulating high frequency sinusoidal carrier with a narrow frequency range baseband messagetransforms the message signal into a high frequency range passband signal, one that can passthrough a communication channel and can be physically transmitted.The modulation process is carried out by a device called a modulator. The modulator performsthe modulation by combining the carrier with the baseband data signal to get the transmittedsignal.A demodulator is a device that performs demodulation, the inverse of modulation, i.e. extractingthe information message from the modulated waveform. The modulation should be a reversibleoperation, so that the message could be retrieved in the demodulation process.The aim of analog modulation is to enable the transmission of an analog signal, for example, anaudio speech signal, over an analog bandpass channel.

2.1.1 Frequency modulation

In FM, the speech signal is encoded by varying the frequency of a carrier wave, i.e. the differ-ence between the frequency of the transmitted wave and its carrier frequency, referred to as theinstantaneous frequency, is proportional to the speech signal xm (t) in the following manner:


y (t) = Accos(2πfct + 2πf∆

∫ t

0xm (τ) dτ

)Where Ac is the amplitude of the carrier, xc(t) = Ac cos(2πfct) is the sinusoidal carrier, fc isthe base frequency of the carrier, f∆ is the frequency deviation, which represents the maximumshifting away of the carrier’s base frequency produced by the information signal xm (t).The information signal xm (t) is typically a speech signal. This signal must have band limitedspectrum (typically speech signal is band limited to 16KHz). For mathematical convenience,we will also assume xm(t) ≤ 1.In FM, the instantaneous frequency of the modulated wave f(t) varies in proportion with themodulating information signal and is defined as follows:

f(t) ≜ fc + f∆xm (t) (2.1)

The periodic nature of the carrier signal can cause ambiguity, (for example −180◦ equals to+180◦). To prevent this we need to ensure that f(t) ≥ 0, or equivalently f∆ ≤ fc. Mostcommunication channels have a bandpass frequency response, hence any signal transmitted onsuch a channel must have a bandpass spectrum. Tough FM signal spectrum has componentsextending to infinitely, it can be viewed as band-limited signal. This is due to the fact that mostof the energy of the signal is contained within fc ± f∆. This approximation is justified sincethe amplitude of frequencies outside this band decreases and higher-order components are oftenneglected in practical design problems [5].

2.1.2 Base band representation

Sinusoid with frequency modulation can be decomposed into two amplitude-modulated sinusoidsthat are offset in phase by one-quarter cycle (π/2 rad). The amplitude modulated sinusoids areknown as in-phase and quadrature components or the I/Q components.We will derive the representation of the FM signal with its I/Q components. By using thesimple trigonometric relation:cos(α + β) = cos(α)cos(β)− sin(α)sin(β)The general expression representing the transmitted signal can be expressed as follows:

y (t) =

Accos (2πfct) cos

(2πf∆

∫ t

0xm (τ) dτ

)−

Acsin (2πfct) sin

(2πf∆

∫ t

0xm (τ) dτ

)


The I/Q components can be defined in the following manner:

I (t) = Accos

(2πf∆

∫ t

0xm (τ) dτ

)Q (t) = Acsin

(2πf∆

∫ t

0xm (τ) dτ

)

And we can represent the modulated signal with its I/Q components in the following way:

y (t) = I (t) cos (2πfct)−Q (t) sin (2πfct)

The I/Q components are band limited, low pass signals and hence this signal has a bandpassspectrum centered around the carrier frequency fc.It is common to analyze communication systems by using the low pass equivalents, also referredto as baseband (or I/Q components) of the original band pass signals.

2.1.3 Noise model

Noise refers to random and unwanted disturbance in the transmitted signal. This disturbancecan be produced by processes either internal or external to the system. When such random un-desired disturbance are imposed on the transmitted signal, the message may be corrupted. Wecan filter out or reduce some of the noise, however, generally we cannot eliminate all the noiseand inevitably some noise remains. This remaining noise is responsible for one of the system’sfundamental limitations.It is acceptable to analyze the quality of analog communication systems in terms of signal tonoise power ratio, SNR, which measures noise relative to the information signal. The signal-to-noise ratio expresses, in decibels, the amount by which a signal level exceeds its correspondingnoise. When measuring SNR against real noise, the actual measured quantity is (S + N)/N

where S is the signal power and N is the noise power. However, since we know the noise valuesadded in the simulation, we will use the ratio S/N to measure the SNR.Noise degrades the reliability of the communication channel. This problem is especially signifi-cant in low SNR values.In the transmission and reception process the signal is subject to several impairments. Thoseimpairments degrade the quality of the transmitted signal and hence, also the quality of the in-formation signal. The receiver role is to reconstruct the original signal from the received signalwhile overcoming those impairments.As mentioned, the message signal undergoes several distortions, the signal impairments due tothose distortions can be divided into two categories:

1. Frequency noise: Impairments due to environmental conditions such as audio distortionsand the operation of frequency modulation, those original audio additive impairments aretranslated into frequency to become frequency noise.

r (t) = Accos(2πfct + 2πf∆

∫ t

0(xm (τ) + n (τ)) dτ

)9


2. Amplitude noise: Impairments due to communication channel distortions such as con-volution with the communication channel, multi-path, additive noise due to propagationcharacteristics of the channel environment, etc. Those impairments are translated intoadditive amplitude noise, r (t) = y (t) + n (t), where y (t) is the clean FM signal.

In communication systems, the statistical model for each of the above noise models is usually as-sumed to be white Gaussian noise. For clarity purposes, a diagram depicting the communicationsystem and its elements is presented in Figure 2.1.

Figure 2.1: Communication system with amplitude and phase noise sources.

2.1.4 Emphasis

FM amplifies high-frequency noise and degrades the overall signal-to-noise ratio. To compensate,FM broadcasters insert a pre-emphasis filter prior to FM modulation. This filter increases theamplitude of high frequency bands and decrease the amplitudes of lower bands. It can beimplemented as follows:

Hp(f) = 1 + j2πfτs (2.2)

Where τs is the filter time constant. The time constant of 75µs in the United States. At theFM receiver, this process should be reversed, the FM receiver has a reciprocal de-emphasis filterafter the FM demodulator to attenuate high-frequency noise and restore a flat message signalfrequency response.The lowpass de-emphasis filter is given by:

Hd(f) = 11 + j2πfτs

(2.3)

Figure 2.2 shows the block diagrams of an FM transmitter with a pre-emphasis filter, Hp(f),and an FM receiver with a de-emphasis filter, Hd(f).

2.2 Speech

2.2.1 Human voice system

Normally speech is created with pulmonary pressure provided by the lungs that generates soundby passing through the glottis in the larynx that is then modified by the vocal tract into differentvowels and consonants. The speech creation process can be broken down into different phases,


Figure 2.2: Pre-Emphasis and De-Emphasis in FM System

each originating from a different part in the human voice system.Figure 2.3 shows the human voice system.The vocal folds which are present at the top of the trachea, can allow the air to pass withoutinterruption or they can vibrate (open and close rapidly). This process produces quasi-periodicair pulses. This fundamental frequency of vibration of the vocal folds, due to the period betweenthose pulses, is known as the pitch. The pitch period is affected by changes in the air pressureand the glottis tension.Air flow then spreads from the glottis to the vocal tract; vocal tract is everything from the nasaltract, tongue, teeth, lips, palate, etc. The air flow is shaped by the vocal tract and createsformant frequency resonances. The formant frequencies are created as a result of the frequencyshaping of the signal from the vocal folds by the vocal tract. The particular configuration of theabove organs (articulators) for every phoneme creates resonances of the vocal tract at specificfrequencies called formants. Changing the shape of the vocal tract also changes the frequencyof the formant. Formant frequencies mainly vary due to the phoneme being articulated and lessdue to the speaker’s identity.

Figure 2.3: Human voice system


2.2.2 Speech quality assessment

In order to assess our demodulator performance in the speech reconstruction task, we will re-quire some objective quality measures. However, for the task of speech quality assessment, thereis no single formula that can give perfect estimation of the speech quality. This is also due tothe fact that the speech quality assessment is affected by both psychological and environmentalconditions and its assessment is related to the subjective perception of the listener.Finding a formula for speech quality assessment is an ongoing research and as previously men-tioned, there is no single formula to approximate human perception. In this section we willdescribe three quality measures: SNR, segmental SNR and PESQ (perceptual evaluation ofspeech quality).

Signal to noise ratio

The signal to noise ratio (SNR) score is defined as the ratio of signal power to noise power:

SNR = PSignal

PNoise(2.4)

Where PSignal is the power of the information signal and PNoise is the power of the noise signal.We can also wright the SNR explicitly:

SNR =∑n=N

n=1 x(n)2∑n=Nn=1 (x(n)− x(n))2

(2.5)

Where x(n) is the information signal, x(n) is the reconstructed signal and N is the signal length.It is common to express SNR using the logarithmic decibel scale as:

SNRdB = 10 log10 SNR = 10 log10

( ∑n=Nn=1 x(n)2∑n=N

n=1 (x(n)− x(n))2

)(2.6)

Segmental signal to noise ratio

SNR measure is more affected by regions with high amplitude than regions with low amplitude.However, small errors in regions with small amplitude may be significant for the speech qualityassessment. Segmental SNR score solves this problem by dividing the signal into small, possiblyoverlapping, segments and computing the SNR for each segment separately. The final segmentalSNR score is the average score of those sections. By using this subdivision into small segments,the noise power is calculated relative to the signal power in the same segment. This increases thecontribution of low energy segments to the total score. The segmental SNR score is calculatedas follows:

Segmental SNRdB = 1M

m=M−1∑m=0

10 log10

( ∑l=L−1l=0 x(mL + l)2∑l=L−1

l=0 (x(mL + l)− x(mL + l))2 + ϵ

)(2.7)

Where x is the information signal, x is the reconstructed signal and ML is the signal length,divided into M subsections, each of length L. In this work L is set to correspond to a duration


of 32 ms. In order to prevent small arguments in the log function, the summation is limitedonly to subsections with minimum value of −10 dB. To prevent very big arguments in the logfunction, caused by sections for which the reconstructed signal is equal to the information signal,ϵ is added to the denominator for numerical stability. We also address this problem by limitingthe summation only to subsections with maximum value of 40 dB.

Perceptual evaluation of speech quality

In communication systems analysis, SNR measure is a widely used objective measure and isconsidered a good indicator for the reconstruction quality between the estimator output andthe size that we want to estimate. However, SNR may not have good correlation with reliablesubjective methods, such as Mean Opinion Score (MOS) obtained from expert listeners. Amore suitable speech quality assessment can be achieved by using tests that aim to achieve highcorrelation with MOS tests, such as PESQ.PESQ, Perceptual Evaluation of Speech Quality [6] is a standard, comprising a test methodologyfor automated assessment of the voice quality of speech as experienced by human beings. PESQis a full-reference algorithm that analyzes the speech signal sample-by-sample after a temporalalignment of the reference (usually clean signal) and test signal (usually degraded signal). Thepredictions of this objective measure should come as close as possible to subjective quality scoresas judged in subjective listening tests. PESQ tries to predict Mean Opinion Score that covers ascale from −0.5 (bad) to 4.5 (excellent).

2.3 Deep learning

2.3.1 Supervised learning and function approximation

In supervised learning, the computer is presented with example inputs and their desired out-puts, and the goal is to learn a general rule that maps inputs to outputs. This requires thelearning algorithm to generalize from the training data to unseen situations in a ”reasonable”manner. The kind of generalization we require is often called function approximation becauseit takes examples from a desired function and attempts to generalize from them to construct anapproximation of the entire function. An example for a non-linear function approximator thathas gained much attention in recent years is artificial neural networks (ANNs), which we willdiscuss in greater details in the next section.

2.3.2 Artificial Neural Networks

Artificial neural networks are a family of models that are loosely inspired by biological neuralnetworks (the central nervous systems of animals, in particular the brain) and are used for es-timation or approximation of functions that can depend on a large number of inputs and aregenerally unknown. Artificial neural networks are usually presented as systems of interconnected”neurons”, which are modeled as nonlinear functions or activation functions, such as sigmoid orhyperbolic tangent. These ”neurons” exchange messages between one other. The connectionshave numeric weights that can be tuned based on experience, making neural nets adaptive to


inputs and capable of learning. The weights are the parameters of the network, which are tunedduring the learning phase in order to minimize some loss function that depends on the trainingset. The tuning of the network weights is done via some iterative optimization algorithm, usuallyan algorithm that depends on the gradient of the loss function. In addition, neural networkshave hyper-parameters which help define their behavior. The hyper-parameters are fixed andset at the construction of the network.

Activation functions

Activation function or neuron is an elementary unit in an artificial neural network. The acti-vation function defines the output given on an input by applying a nonlinear function on theinput. The role of the activation function is to introduce non linearity into the neural networkand thus enable the network to approximate any complex mapping function linear or not. Theactivation functions usually have a sigmoid shape, but other non-linear functions such as: piece-wise linear functions, or step functions are also acceptable. The activation functions are usuallymonotonically increasing, continuous and differentiable. Next, we will discuss the two commonforms of activations, the sigmoid function and the hyperbolic tangent function. Those activationfunctions will act as a basic building block in our model.

Sigmoid: A Sigmoid function (used for hidden layer neuron output) is a special case of thelogistic function having a characteristic “S”-shaped curve. The activation function squashes itsinput to the range (0, 1). The logistic function and its derivative can be seen in figure 2.4 andis defined by the formula:

• σ(z) = 11+exp(−z)

• ∂σ(z)∂sz = σ(z)(1− σ(z))

Figure 2.4: Sigmoid activation function. Source: Isaac Changhau

Hyperbolic tangent: Hyperbolic tangent is zero centered activation function with an “S”-shaped curve. The activation function squashes its input to the range (−1, 1).The hyperbolic tangent function and its derivative can be seen in figure 2.5 and is defined bythe formula:

• tanh(z) = 1−exp(−2z)1−exp(−2z)

• ∂ tanh(z)∂sz = 1− tanh2(z)


Figure 2.5: Hyperbolic tangent activation function. Source: Isaac Changhau

2.3.3 Optimization- RMSProp

In this section, we will describe the optimization method that we used in order to adjust themodel weights to minimize the cost (phase 5 in the training loop A.1).We will start with Adagrad [7], which forms the basis for the optimization algorithm we willeventually use. Adagrad is a gradient based optimization method that uses a separate adaptivelearning rate for each weight in the network at every time step t. The learning rate is adaptedcomponent-wise, and is results from the calculation of the square root of the sum of squares ofthe gradient from previous steps, taken element-wise.Defining the gradient of the loss at time step t w.r.t. to the weight wi as gt,i, The sum of thesquares of the gradients component i up to time step t

∑k=tk=1 g2

k,i as st,i and the global learningrate as α. We obtain the following update rule:

wt+1,i = wt,i −α

√st,i + ϵ

gt,i (2.8)

The problem with Adagrad is that it accumulates the squared gradients in the denominator,which can make the denominator become very big and thus causes the learning rate to becomevery small, and slows down the learning process. In order to address this issue RMSProp [8] wassuggested. RMSProp is a flourish on top of Adagrad which limits the size of the denominator.This is done by dividing the learning rate by an exponentially decaying average of squaredgradients, i.e. this time we divide each element in the gradient by ∑k=t

k=1 γk−1g2k,i where γ ∈ (0, 1)

is leakiness factor that limits the size of the denominator.In order to use the above optimization method, we will need to first compute the gradient of theloss function w.r.t. the network weights. In the next section 2.3.4 we will describe the procedureneeded in order to compute the gradient.

2.3.4 Feed-forward neural networks and backpropagation

We will start with a description and demonstrate the way to compute the gradient of a feed-forward network. As previously mentioned, the computation of the gradient is the basis of mostoptimization algorithms that are used in order to find the optimal set of weights. Feed-forwardnetwork is the most common type of network. This type of network forms the basis of recurrentneural network (RNN), as RNN can be seen as a deep neural network if we “unroll” the networkwith respect to time. Hence, we can use the same principles derived here in order to develop the


gradients of the recurrent neural network. A feed-forward network is a network with layers thatare connected sequentially, i.e., no backward connections (loops) and no inter-layer connections.The network is defined by specifying each layer and the order of the layers. The description ofeach layer is done by three messages. Where the superscript k denotes the index of the layer:

• Forward Pass – This is a description of the layer output, the function that the layercomputes, i.e. the layer output given its input.

zk+1 = fk(zk, wk) (2.9)

• Backward Pass – This is what the layer passes backwards for the back-propagation algo-rithm. This is the derivative of the loss w.r.t the input.

δk = ∂E

∂zk= ∂E

∂zk+1∂zk+1

∂zk= δk+1 ∂fk

(zk, wk

)∂zk︸︷︷︸

Jacobian

(2.10)

• If the layer has parameters, we will also need the derivative of the loss w.r.t the parametersof the layer. This is needed in order to use some gradient based optimization method inorder to find the optimal parameters.

∂E

∂wk= ∂E

∂zk+1∂zk+1

∂wk= δk+1 ∂fk

(zk, wk

)∂wk

(2.11)

The function fk for each layer is chosen by the designer of the network based on the appli-cation, the derivatives ∂fk

∂wk and ∂fk

∂zk can be calculated analytically for each layer based on thechosen fk.

Figure 2.6: Layer’s inputs/outputs in back-propagation

Each layer is communicating with the adjacent layers by passing the messages fk and δk to thefollowing and previous layers respectively. The gradient of the entire network is computed usingthe chain rule, by passing those messages between layers and by accumulating the derivative ofthe loss w.r.t each building block (layer) parameters, for layers with parameters. The forward


pass of training data and the corresponding backward pass of the gradient of the loss functionw.r.t. to the neural network weights is depicted in diagram 2.7.

Figure 2.7: Propagation of the gradients threw the neural network

2.3.5 Recurrent neural networks

In a feed-forward neural network, we assume that all inputs are independent of each other. Butfor many tasks the input is a time series, for which the samples are dependent. Recurrent neuralnetwork [9], is basically similar to a feed-forward neural network. However, RNN have a hiddenstate, which captures information about a sequence.The way that this is implemented is by sharing weights over time. Recurrent networks performthe same task for every element of a sequence, with the output being depended on the previouscomputations, just with different inputs. This greatly reduces the total number of parameters weneed to learn. Because this feedback loop occurs at every time step in the series, the hidden stateacts as a “memory”, which captures sequential dependencies between samples. Each hidden statecontains information not only of the previous hidden state, but also of all those that preceded


ht−1. This process of carrying memory forward can by described as follows:

st = f(Uxt + Wst−1)

ot = g(V st)

Where: xt is the input from time step t, st is the hidden state at time step t and ot is the outputat time step t. The functions f and g usually are a nonlinear functions such as sigmoid, tanhor ReLU. The weight matrices, U, V, W , determine how much importance to accord to both thepresent input and the past hidden state. A description of RNN unrolled through time (unrolledRNN is the network for the complete sequence.) is in diagram 2.8.

Figure 2.8: Recurrent neural network unrolled. Source: Nature

2.3.6 Training RNN- Backpropagation through time

In order to find the optimal set of weights, we need to first compute the gradient of the lossfunction w.r.t. to the network weights. In feed-forward networks this process was done bybackpropagation 2.3.4. In recurrent networks we find the gradients through a process calledbackpropagation through time (BPTT), which is an extension of backpropagation, the algorithmwas independently derived by [10], [11]. BPTT similar to backpropagation. However, becausethe parameters are shared by all time steps in the network, the gradient at each output dependsnot only on the calculations of the current time step, but also the previous time steps. In orderto compute the gradient we need to sum the gradients for all previous time steps.Since full BPTT forward/backward computational demands become very high if we compute thegradients over many time steps, we approximate BPTT by truncating it and take into accountonly the last T time steps when computing the gradient.We will derive the gradients of the RNN:For one time step the error is

Et(ot, ot) (2.12)

For example in the case of mean square error (MSE) loss function:

Et(ot, ot) = (ot − ot)2 (2.13)


If we average over all relevant time steps we get:

E(ot, ot) =∑

t

Et(ot, ot)∑t

(ot − ot)2

Where ot is the target output at time step t and ot is the network prediction. Each trainingexample is composed of a full sequence, hence the total error is the sum of all the errors at eachtime step.

∂E

∂w=

T∑t=1

∂Et

∂w(2.14)

For each time step t the gradients using backpropagation algorithm is given by:

∂Et

∂w=

t∑k=1

∂Et

∂ot

∂ot

∂st

∂st

∂sk

∂sk

∂w(2.15)

Sincest = f(Uxt + Wst−1) (2.16)

We need to use the chain rule again in order to get the gradients:

∂st

∂sk=

t∏i=k+1

∂si

∂si−1(2.17)

An illustration of the backward pass of the gradients from t = 3 through the network all theway to t = 0 is depicted in figure 2.9.

Figure 2.9: Backward pass for E3. Source: WILDML


2.3.7 Bidirectional recurrent neural networks

In vanilla RNN 2.3.5 the current state does not depend on the future input information. How-ever, in some time series, both past samples and future samples are related to current samples.Bidirectional Recurrent Neural Networks (BRNN) [12] can exploit this time series dependencies,as the output at time t may not only depend on the previous elements in the sequence, but alsoon future elements.Bidirectional RNNs are composed of two RNNs, one is trained on the original signal or thepositive time direction (forward states), and the other is trained on the negative time direction(backward states), that is the signal reversed in time. The output is then computed based on thehidden state of both RNNs. According to this structure, the output layer can obtain informationfrom past and future states.The general structure of BRNN is depicted in figure 2.10.We can transcribe the relation between the inputs and the outputs of the BRNN mathemati-

Figure 2.10: Bidirectional RNN. Source: Stanford cs224d

cally:

←−s t = f(←−U xt +

←−W←−s t−1)

−→s t = f(−→U xt +

−→W−→s t+1)

ot = g(V st) = g(V [−→st ;←−st ])

Where the right arrow represents the network parameters related to the forward states and theleft arrow represents the network parameters related to the backward states. For BRNN st

represents the past and future around a single sample.

2.3.8 Stacked recurrent neural networks

RNN can also be made deeper by feeding each lower layer output, which is a sequential rep-resentation of the layer inputs, into the next layer. Deep Bidirectional RNNs (DBRNN) aresimilar to Bidirectional RNNs, only that we now have multiple layers per time step. In practice,


this provides us a higher learning capacity.DBRNN architecture is depicted in figure 2.11.Sticking to our previous notation, Where the right arrow represents the network parametersrelated to the forward states and the left arrow represents the network parameters related tothe backward states.

Figure 2.11: Deep bidirectional RNN with three RNN layers. Source: Stanford cs224d

In this deep architecture, we will also denote the depth of the layer with index i and thetime step with index t.At time-step t each intermediate neuron in level i receives:

1. The network state from previous time step, and the same layer:←−s i

t−1 for the forward network or −→s it+1 for the backward network.

2. The output of the RNN from previous layer, layer i− 1, at the same time-step t, si−1t .

Where si−1t is composed of two sets of parameters from the previous RNN hidden layer.

Using the above notations we can transcribe the relation between the inputs and the outputs of


the deep BRNN mathematically:

←−s it = f(

←−U isi−1

t +←−W i←−s i

t−1)−→s i

t = f(−→U isi−1

t +−→W i−→s i

t+1)

ot = si+1t = g(V si

t) = g(V [−→si

t ;←−si

t ])

2.3.9 The vanishing gradient problem

In this part we will address the vanishing gradient problem, [13], of RNN. This will be themotivation for the LSTM architecture that we will demonstrate next. We will derive the problemfor vanilla RNN, although using very similar derivation can be used for stacked BRNN.In 2.3.6 we computed the gradient of RNN using BPTT and obtained the following result forthe gradient of the error of the output at time step t:

∂Et

∂w=

t∑k=1

∂Et

∂ot

∂ot

∂st

∂st

∂sk

∂sk

∂w(2.18)

Sincest = f(Uxt + Wst−1) (2.19)

We need to use the chain rule again in order to get the gradients:

∂st

∂sk=

t∏i=k+1

∂si

∂si−1(2.20)

∂Et

∂w=

t∑k=1

∂Et

∂ot

∂ot

∂st

( t∏i=k+1

∂si

∂si−1

)∂sk

∂w(2.21)

We will further develop this term using the external definition of the gradient:

∂si

∂si−1= ∂f(Uxi + Wsi−1)

∂si−1= W T Jf (2.22)

Where Jf is the Jacobian matrix of f w.r.t. its input vector. We can now use Cauchy–Schwarzinequality for the Frobenius Norms of the matrices W and Jf and get:

∥ ∂si

∂si−1∥ ≤ ∥W T ∥∥Jf∥ (2.23)

Hence we can derive an upper bound on the norm of ∂st∂sk

:

∥ ∂st

∂sk∥ = ∥

t∏i=k+1

∂si

∂si−1∥ ≤ ∥W T ∥t−k∥Jf∥t−k (2.24)

In the above equation t−k evaluates the error due to input samples from previous k time steps.If t− k is big, the exponential term ∥W T ∥t−k∥Jf∥t−k can become a very small or large number,


depending if the value of ∥W T ∥∥Jf∥ is bigger or smaller than 1. If the value is smaller than 1,the gradient goes to zero and thus the contribution of faraway inputs to predicting the output attime-step t cannot be accounted for. This problem is known as the vanishing gradient problem.To solve the problem of vanishing gradients, we will next introduce LSTM.

2.3.10 LSTM

LSTM network [4] is a very popular model of RNN architecture that reduces the effect of thevanishing gradient problem. By reducing the effect of the long-term dependency problem (van-ishing gradient problem) LSTM can take into account information over long periods of time andcan achieve excellent performance on general sequence to sequence learning problems [14]. Inthis section we will introduce the LSTM model and explain how it avoids the vanishing gradientproblem.Just like any other RNN, LSTM can also by ”unrolled” in time to the form of a recurrent chainof repeating modules. However, in LSTM RNN, instead of having a single neural network layerin the repeating module, there are four.LSTM RNN internal block diagram can be seen in figure 2.12. Notice in figure 2.12- the hiddenstate is denoted as h instead of s, both are conventional notations to the hidden state.

Figure 2.12: Repeating module in LSTM. Source: Chris Olah blog: Understanding LSTMNetworks

In diagram 2.12, each line carries an entire vector, from the output of one node to the inputsof others. The pink circles represent element-wise operations, like vector addition, while theyellow boxes are learned neural network layers. Lines merging denote concatenation, while aline forking denotes its content being copied and the copies going to different locations.In our work, we will use the LSTM version from [15], with the following update equations [15]:


it = σ(U ixt + W ist−1

)ft = σ

(Uf xt + W f st−1

)ot = σ (Uoxt + W ost−1)

gt = tanh (Ugxt + W gst−1)

ct = ft ⊙ ct−1 + it ⊙ gt

st = ot ⊙ tanh (ct)

Where ⊙ represent element-wise product of matrices, known as the Hadamard product. σ is thelogistic sigmoid function. Recall that sigmoid function squashes the values of its input vectorsbetween 0 and 1, and multiplying them element-wise with another vector defines how much ofthat vector will go through.U i, W i, Uf , W f , Uo, W o, Ug, W g, are the network parameters or the weight matrices and xt isthe input feature vector.

• ct the LSTM cell state or memory cell. This vector acts as the internal memory of theunit and allows the states and the gradients to flow along it from time step to time step.The LSTM has the ability to read, write or remove information from the cell state, 2.13.Those operations are controlled by i, f and g gates that we will describe next.

• it Input gate, handles the writing of data into the information cell.

• ft Forget gate, handles the maintaining and modification of the data stored in the memorycell.

• ot Output gate, handles the sending of data from the memory cell back onto the LSTM.

• gt has the same functionality as the hidden state of a vanilla RNN, if we take the activationfunction of the vanilla RNN to be tanh. However, instead of passing g as the new hiddenstate, as was had done in the RNN, the input gate i selects how much of that hidden statewill go through.

• st the output hidden state. Given the memory state ct we can compute st by multiplyingthe memory with the output gate ot.

We will return to the reason that enables LSTM to handle the vanishing gradient problem.LSTM architecture addresses the vanishing gradient problem by introducing a memory cell ct

that is able to preserve state over long periods of time. A key observation is that the cell statect is a linear function of ct−1 with no activation function connecting the two.

ct = ft ⊙ ct−1 + it ⊙ gt (2.25)

As computed in previous section, for vanilla RNN, when computing the gradient by applyingthe chain rule, we need to add the derivative of ct w.r.t ct−1. We can think of forget gate f as


Figure 2.13: Interaction between the cell state c and access gates i, f, g.

the weights for the previous cell state ct−1. And when we derivate the cell state w.r.t. to the cellstate in a previous time step, we multiply by different ft in each time step, and not a constantmatrix like in the RNN case. We also avoid multiplying with tanh at each time step, thus wereduce the vanishing gradient effect.

2.3.11 Deep bidirectional LSTM

Bidirectional LSTM (BLSTM) [16], combines the advantages of BRNN with the advantages ofLSTM. This enables the network to exploit the long-range context dependencies in the past aswell as the future time steps when performing sequential modeling.BLSTM can also be made deeper to form Deep BLSTM (DBLSTM) [17]. DBLSTM combinesthe advantages of BLSTM with the advantages of deep neural network. This enables the networkto exploit multiple levels of representation as well as long-range context in both input directions.DBLSTM architecture can be described as follows: The input is fed into both the forward andthe backward LSTM which together form a layer. The output layer or the next hidden layer(consisting of forward and backward LSTM layers as well) receives an input by concatenatingthe outputs of the previous forward and backward LSTM from previous layer. There are nohidden-to-hidden connections between forward and backward layers. We will use very similararchitecture in our demodulator.


Chapter 3

Problem Formulation and RelatedWork

In this chapter we will describe the problem, present the motivation that led us to choose oursolution and survey relevant works. We will start by surveying works on speech enhancementand on general sequence to sequence modeling with deep learning.Later on, we will discuss relevant works on machine learning and deep learning applied toproblems from the communication field.

3.1 Problem formulation and motivation

Traditionally, radio transmission decoding and speech enhancement are considered two separateproblems. However, optimal signal estimation algorithms are usually constructed on the basis ofstatistical measurement process properties and prior model of the reconstructed signals. This isoften a difficult problem with no analytic solution that can be only approximately solved basedon noise and signal simplistic models.On the other hand, any signal estimation can be considered as non-linear mapping from inputdata to the desired output. Having a universal function approximation tool in hand, we canlearn such mapping using a set of training examples, pairs of input modulated baseband signalsand the desired audio output signals.Natural speech is composed of several timescale features generated by anatomic processes thatcontrol sound production. A typical segment of speech can be decomposed to sentences or wordsthat have a typical time scale of one second. On a smaller time scale, words can be decomposedinto phonemes, which are one of the units of sound that distinguish one word from another.Usually phonemes last a duration which is smaller than 10−1 seconds. We can observe on aneven smaller time scale, such as pitch 10−2 and formants 10−3.For an optimal reconstruction to take place, all those timescales need to be accounted for in thereconstruction task.While previous neural networks were applied to the task of radio demodulation and speechenhancement separately, to the best of our knowledge, none suggested the task of radio trans-mission decoding with the prior information of transmitted speech messages. In this sense,our work is entirely novel as our network exploits the prior knowledge of the speech signal to


overcome both acoustical disturbances and noise in the communication channel.

3.2 Related work

Apart from the traditional methods of radio transmission decoding and minimum mean squareerror (MMSE) based speech enhancement techniques, several neural networks based methodshave been proposed for each of the two problems separately.For example, for the radio transmission decoding and for channel noise estimation [18], [19] andrecently [20]. However, these works deal with digital communication for which bit-streams aremapped to symbols. Moreover, traditionally the symbols are pre-coded and scrambled beforetransmitted. Therefore, effectively the coded data stream is uncorrelated from time-sample totime-sample [21], and use of the prior speech data to overcome the noise in the transmissionchannel is not possible.As for the analog communication problem, the fact that the modulating input is proportionalonly to the instantaneous frequency of the received FM signal, has driven the development ofFM demodulators to rely on very short time frame processing in order to extract the modulatingsignal, hence, disregarding long range dependencies that are present in the transmitted speech.For example [22] suggested a neural network based solution for the analog FM problem. How-ever, the approach taken was to imitate the way a conventional FM demodulator works byimplementing different neural network for each building block separately. It used memoryless(or very short memory) feed-forward neural network with only one input at some intermediateblocks. Therefore it did not take into account the prior knowledge of the transmitted speech.Moreover, demodulation was performed directly on the high frequency passband signal whichresulted in a very high sampling rate on the neural network input. This high rate was neededin order to detect the changes in the input thus resulting in several samples, most of which areredundant, and a very large network for actual sampling rates. Such a large network is verydifficult to train and not suitable for practical use.As opposed to the demodulation problem, in the speech processing domain, prior speech struc-ture forms the basis for both traditional algorithms and neural network based algorithms. Con-sequently, in the field of speech modeling and speech enhancement, BLSTM is used in a lot ofreal world sequence processing problems, such as phoneme classification [16], speech recognition[23] and speech synthesis [24].For the problem of speech enhancement, several neural networks based solutions were suggestedand resulted in good performance. For example [25], [26] and in [27] an LSTM based model wasproposed.A suggestion to take the prior speech into account in the demodulation process was made in [28],[29], where it was proposed to use a demodulation process that can be shaped by user-specificprior information. In these works, demodulation is viewed as a Bayesian inference problem anda priori assumption on the signal statistics is made. However, in this formulation, the carrierfrequency is unknown and is estimated. For this reason, the approach is not directly comparableto the one taken here.A similar approach is proposed in our work. However, we assume the carrier is known, which isa reasonable assumption for radio transmission demodulation. By making this assumption, the


demodulation can be framed as a time series problem that can be solved using a neural networkframework.


Chapter 4

Neural Network Demodulator

This section presents our FM detector, a detector that introduces a new perspective on FMdemodulation. In this perspective, the demodulation is framed as a time series problem thatcan be solved using a deep learning framework. This new point of view guided us in the detectorarchitecture selection, stacked BLSTM DNN.We start by describing the architecture and then describe the training procedure.

4.1 Architecture

The demodulator is composed of two main building blocks. The first block is an analog or digitalfront end. This block commonly used in conventional SDR, takes in the modulated signal andthe carrier frequency and outputs the I/Q components.The second block is based on machine learning concepts for time series prediction. This blocktakes in the noisy I/Q samples from previous block and reconstructs the speech.In this section we will give the full description of each block and describe its functionality.

4.1.1 Front end

In order to avoid manipulating FM passband signal directly, the conversion to baseband frompassband frequency is performed by another digital or analog hardware front end block. Here,we will give the motivation and description of this front-end block.

Motivation: The FM signal is composed of a high frequency carrier, and a low frequencymodulating signal. As a result FM has a bandpass spectrum centered around the carrier fre-quency fc. As seen in 2.1.2, we can represent such a signal with its I/Q components, which areband limited low pass signals. Since the carrier frequency is much higher than the bandwidthof the I/Q components, converting the high frequency signal to baseband signal enables a moreconvenient processing in a lower sampling rate than the original carrier frequency. This allevi-ates the demodulator (either standard or DNN based) computational demands.


Block description: In 2.1.2 we presented the FM passband signal

y (t) = Accos(2πfct + 2πf∆

∫ t

0xm (τ) dτ

)with its I/Q components in the following manner:

y (t) = I (t) cos (2πfct)−Q (t) sin (2πfct)

Focusing on the top branch of figure 4.1, by multiplying y(t) with 2cos (2πfct) and using thefollowing trigonometric relations:

cos(α)cos(β) = cos(α− β) + cos(α + β)2

cos(α)sin(β) = sin(β − α) + sin(α + β)2

We get:

cos (2πfct)[I (t) cos (2πfct)−Q (t) sin (2πfct)

]=

I(t)[1 + cos (4πfct)

]−Q (t) sin (4πfct)

I(t) and Q(t) are band limited low pass signals and their support in the frequency domainis much smaller then fc, thus, after applying the Low Pass Filters (LPF), the high frequencycomponents are removed and we are left with I(t). In a similar way, after multiplication with−2sin (2πfct) and applying LPF , we are left with the Q(t) component.A scheme of such analog front end block is presented in figure 4.1.

4.1.2 Neural network block

This block is based on deep learning for time series prediction. The key idea that forms thebasis of the architecture of this block stems from the distinction that there is a direct mappingbetween the I/Q components samples and the underlying speech source. This block is aimed toutilize this connection in order to perform joint demodulation and de-noising of the speech signal.

Motivation: The first consideration we have made when choosing the detector architecturewas utilizing the temporal dynamics of the underlying generating process. For this purpose wechose to use the LSTM network as the basis of this block. By doing so, we are utilizing the abil-ities of the LSTM network to capture the temporal dynamics of speech signals. This allows usto efficiently estimate the speech source signal from noisy frequency-modulated measurements.As dictated by the underlying generating speech, future samples are also related to current sam-ples. To exploit this dependency we make the system slightly non causal by inducing a small


Figure 4.1: Block for extracting the I and Q components

delay of 100 samples, about two milliseconds at audio sampling rate of 48KHz. This delay isalmost imperceptible to the human ear, however, it enables us to use bidirectional LSTMs [12],which are trained using past input information as well as future information.For combining multiple representation levels of the modulated speech signal and giving themodel more expressive power, we used deep architecture. Deep LSTM is created by stackingmultiple LSTM layers on top of each other with the output sequence of one layer, forming theinput sequence for the next. The stacking of multiple recurrent hidden layers has proven to givestate-of-the-art performance for acoustic modeling [17], [30].For the above stated reasons we have decided to adopt deep bidirectional LSTM architecturebased on the architecture proposed in [17]. For regularization, we added a dropout layer [31].We unrolled the network to length of 100 time steps, using backpropagation through time [32]in the training phase.

Block description: In order to support high quality audio transmissions broadcast, FM sta-tions use large values of frequency deviation. The FM broadcast standards in the United Statesspecify a value of 75 kHz of peak deviation and 240 kHz sampling frequency of the output signal.The default value of the modulating audio signal is 48 kHz.For the above reasons, the training set was generated using Matlab FM modulation [33] withthe above stated standard specifications.The above system constraints dictate the number of baseband samples the modulator producesfor each audio sample on its input (five in-phase and five quadrature).We used bidirectional stacked LSTM architecture with two hidden layers. The first bidirectionalLSTM layer has an input size of 10 samples (five in-phase and five quadrature samples) and anoutput of 100 samples for each of the forward and backward LSTM cells. The output from thefirst layer is regularized with dropout (dropout probability of 0.2) and then fed into a secondbidirectional LSTM layer. The second LSTM has an output size of 200 samples for each of the


forward and backward LSTM cells. The output of this layer is then fed into a linear layer withone output.A detailed diagram of the decoder block in figure 4.2.

Figure 4.2: Neural network decoder block

4.1.3 Full system

The full demodulation process is performed as follows: The noisy FM signal and the carrierfrequency are fed into the front end block.The front end block outputs are fed into the neural network decoder, which outputs the audio


samples.A diagram of the full system is presented in figure 4.3.

Figure 4.3: Full demodulation system

4.2 Dataset and training procedure

The audio waveforms used in our experiments were downloaded from TIMIT Acoustic-PhoneticContinuous Speech Corpus [34]. The TIMIT corpus includes a 16-bit, 16 kHz speech waveformfile for each utterance. In our experiments we used male speakers. The speech material in theTIMIT corpus is subdivided into portions for training and testing. We used this subdivision inour experiments.For compatibility with standard United States specifications described above, the audio wave-forms were up-sampled to 48 kHz. For the input of the neural network we used two features,samples of the in-phase and quadrature components of the baseband signal. The outputs of theneural network are the clean audio waveforms.The entire system was optimized with MSE loss function and RmsProp[8] optimization method.


Chapter 5

Experimental Results

Here we present how we measured the performance of our SDR, and demonstrate simulationresults, validating its superior performance over traditional FM recovers.

5.1 Experiments

We compared the speech reconstruction quality of the proposed neural network demodulatoragainst the speech reconstruction quality of a conventional demodulator. For fairness, we ap-plied additional speech enhancement algorithm on the output of the conventional demodulator.The conventional demodulator implementation is from Matlab communication toolbox, whichis based on [33]. The additional speech enhancement algorithm is based on OM-LSA [35].In both cases, DNN and a conventional demodulator, the modulated signal sample rate and thefrequency deviation are set to United States standard values. In order to boost the performanceof the conventional FM receiver we used Matlab FM broadcasters. FM broadcasters also insert apre-emphasis filter prior to FM modulation. The FM receiver has a reciprocal de-emphasis filterafter the FM demodulator to attenuate high-frequency noise and restore a flat signal spectrum.In order to evaluate and compare the speech reconstruction quality of the demodulators we usedSNR, segmental SNR and also PESQ 2.2.2.The speech reconstruction quality was tested in the presence of various levels of additive whiteGaussian noise (AWGN). The noise corrupted both the amplitude and phase of the transmittedsignal.

5.2 Results

The following figures present the experiments results comparing the proposed neural networkdemodulator and conventional demodulator, with and without additional speech enhancementalgorithm.Figure 5.1, figure 5.2 and figure 5.3 show the speech reconstruction SNR, segmental SNR andthe PESQ score, correspondingly, for various levels of AWGN amplitude noise.


Figure 5.1: Speech reconstruction SNR, for various levels of amplitude noise. Conventional (withand without additional speech enhancement) Vs. DNN based demodulator.

Figure 5.4, figure 5.5 and figure 5.6 show the speech reconstruction SNR, segmental SNRand the PESQ score, correspondingly, for various levels of AWGN phase noise.

The main goal of our receiver is to perform audio reconstruction in low SNR conditions.However, for completeness, we also present the noise free case, i.e. neither phase nor amplitudenoise were added to the modulated signal. For the noise free case we receive output SNR of36.56 dB. These results indicate good reconstruction. 5.7 shows the spectrogram of the originalaudio and the spectrogram of the LSTM demodulator reconstruction for noise free case.

The conducted experiments show that the proposed receiver has a clear advantage over theconventional receiver for both amplitude and phase noise conditions. This is mostly due tothe fact that the proposed neural network demodulator takes advantage of the statistics ofthe generating speech signal. We prove this point by limiting the memory of the network toonly one time step, as in theory we can map the FM signal back to audio with almost nomemory. However, for low SNR conditions, of 0 dB amplitude noise, reconstruction was notpossible, and the demodulation failed. This experiment shows that indeed in order for qualityreconstruction to take place under noise conditions, the statistics of the generating speech signalmust be accounted for. Finally we compared the reconstruction quality in the presence of bothamplitude and phase noise. This was done by adding AWGN noise with SNR of 0 dB both to themodulating speech signal and to the frequency modulated signal separately. The experimentshowed that the SNR of the neural network demodulator reconstructed signal is 12.32 dB,whereas for conventional demodulator we obtained 4.54 dB for the reconstructed signal. Again,a clear performance advantage for the LSTM demodulator in this case as well.


Figure 5.2: Speech reconstruction segmental SNR, for various levels of amplitude noise. Con-ventional (with and without additional speech enhancement) Vs. DNN based demodulator.

Figure 5.3: Speech reconstruction PESQ score, for various levels of amplitude noise. Conven-tional (with and without additional speech enhancement) Vs. DNN based demodulator.


Figure 5.4: Speech reconstruction SNR, for various levels of phase noise. Conventional (withand without additional speech enhancement) Vs. DNN based demodulator.

Figure 5.5: Speech reconstruction segmental SNR, for various levels of phase noise. Conventional(with and without additional speech enhancement) Vs. DNN based demodulator.


Figure 5.6: Speech reconstruction PESQ score, for various levels of phase noise. Conventional(with and without additional speech enhancement) Vs. DNN based demodulator.

Transmited audio

0 0.5 1 1.5 2

0

2000

4000

6000

8000

DNN reconstruction

0 0.5 1 1.5 2

0

2000

4000

6000

8000

Figure 5.7: Spectrogram of the original audio signal and DNN demodulator reconstruction.


Chapter 6

Conclusions and Future Work

6.1 Conclusions

This work has introduced a new perspective on coherent FM demodulation, viewing it as a timeseries problem that can be solved using a neural network framework. This perspective led to anew approach to decode FM transmission of audio speech signals based on stacked bidirectionalstacked LSTM. In this approach we utilize the statistics of the information message, more specif-ically long and short time-scale temporal structure in speech. As a result, the proposed receiverhas a clear advantage over the conventional receiver as it yields much higher reconstructionquality and can overcome both distortions in the information message and distortions in thetransmission channel. Moreover, we believe that the proposed detector can achieve even betterresults if trained on user-specific priors.Tough the proposed detector is computationally intensive when compared to existing approaches,it can be implemented with the availability of sufficient computation power. This computationpower has become practical with the appearance of powerful graphical processing units (GPU)and corresponding software, that enables the proposed receiver can be used as an extremelyrobust radio receiver.

6.2 Future work

In statistics, the MSE or Peak Signal to Noise Ratio (PSNR) of an estimator are widely usedobjective measures and are good distortion indicators (loss functions) between the estimatorsoutput and the size that we want to estimate. Those loss functions are used for many reconstruc-tion tasks. However, PSNR and MSE may not have good correlation with reliable subjectivemethods such as Mean Opinion Score (MOS) obtained from expert listeners. A more suitablespeech quality assessment can be achieved by using tests that aim to achieve high correlationwith MOS tests such as PESQ [6] or POLQA [36]. However, those algorithms are hard to rep-resent as a differentiable function such as MSE.We propose training a neural network model that takes as an input the clean and the degradedaudio. The neural network is trained in a supervised way to predict the PESQ score of thosetwo signals (clean and the degraded audio). This can be achieved by training the network withthe results obtained from full reference PESQ algorithm, with MSE loss function. i.e. we train


the model to predict the full reference PESQ score.After learning the PESQ, we can use it as a differentiable loss function in order to enhancespeech, this way we can minimize:

P (xclean, xdegraded) + λMSE(xclean, xdegraded)

Where xclean is the clean audio, xdegraded is the degraded audio, λ is a number in [0, 1] and P

is the differentiable loss function, which was trained to learn the PESQ mapping. A diagramdepicting the system for training and utilizing the PESQ loss is described in figure 6.1

Figure 6.1: Training perceptual loss function


Appendix A

Demodulator Software

The code of the neural network demodulator was implemented with TensorFlow [37]

A.1 TensorFlow

TensorFlow is an open source software library for machine learning in various kinds of per-ceptual and language understanding tasks. TensorFlow can run on multiple CPUs and GPUs(with optional CUDA extensions for general-purpose computing on graphics processing units).TensorFlow computations are expressed as statefull data flow graphs. This library of algorithmsoriginated from Google’s need to instruct computer systems, known as neural networks, to learnand reason similarly to how humans do, so that new applications which are able to assumeroles and functions previously reserved only for capable humans, can be derived. The nameTensorFlow itself derives from the operations which such neural networks perform on multidi-mensional data arrays. These multidimensional arrays are referred to as ”tensors” but thisconcept is not identical to the mathematical concept of tensors. Its purpose is to train neuralnetworks to detect and decipher patterns and correlations 1. The general scheme for training amodel with Tensorflow includes the following steps:

1. Initialize all model variables for the first time.

2. Feed in the training data.

3. Execute the inference model on the training data, so that it calculates the output with thecurrent model parameters for each training input example.

4. Compute the cost

5. Adjust the model parameters to minimize the cost.

The visualization of the training loop in figure A.1 is from the book [38]:

1From Wikipedia, the free encyclopedia


Figure A.1: Training loop


Bibliography

[1] Jesse Zheng and Jesse Zheng, “Optical frequency-modulated continuous-wave interferome-ters,” Applied optics, vol. 45, no. 12, pp. 2723–2730, 2006.

[2] H N Al-Sadi, Seismic Exploration: Technique and Processing, Lehrbücher und Monogra-phien aus dem Gebiete der exakten Wissenschaften. Birkhäuser Basel, 2013.

[3] K. Mostov, E. Liptsen, and R. Boutchko, “Medical applications of shortwave FM radar:Remote monitoring of cardiac and respiratory motion,” Medical Physics, vol. 37, no. 3, pp.1332–1338, 2010.

[4] Sepp Hochreiter and J Urgen Schmidhuber, “LONG SHORT-TERM MEMORY,” NeuralComputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[5] S. C. Sekhar T.G. Thomas, Communication Theory, Tata-McGraw Hill, 2005.

[6] John G Beerends, Andries P Hekstra, Antony W Rix, and Michael P Hollier, “PerceptualEvaluation of Speech Quality (PESQ) The New ITU Standard for End-to-End SpeechQuality Assessment Part II: Psychoacoustic Model,” J. Audio Eng. Soc, vol. 50, no. 10, pp.765–778, 2002.

[7] John Duchi, Elad Hazan, and Yoram Singer, “Adaptive Subgradient Methods for OnlineLearning and Stochastic Optimization,” Journal of Machine Learning Research, vol. 12,pp. 2121–2159, 2011.

[8] Tijmen Tieleman and Geoffrey Hinton, “Lecture 6.5-rmsprop: Divide the gradient by arunning average of its recent magnitude.,” COURSERA: Neural Networks for MachineLearning, 2012.

[9] J. J. Hopfield, “Neural networks and physical systems with emergent collective compu-tational abilities.,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp.2554–2558, 1982.

[10] Michael C Mozer, “A Focused Backpropagation Algorithm for Temporal Pattern Recogni-tion,” Complex Systems, vol. 3, pp. 349–381, 1989.

[11] Paul J. Werbos, “Generalization of backpropagation with application to a recurrent gasmarket model,” Neural Networks, vol. 1, no. 4, pp. 339–356, 1988.

[12] M. Schuster and K. K Paliwal, “Bidirectional recurrent neural networks,” IEEE Transac-tions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.


[13] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, “On the difficulty of training recur-rent neural networks.,” 2013, JMLR.org.

[14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neuralnetworks,” Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112,2014.

[15] Felix a Gers, Nicol N Schraudolph, and Jurgen Schmidhuber, “Learning Precise Timingwith LSTM Recurrent Networks,” Journal of Machine Learning Research, vol. 3, no. 1, pp.115–143, 2002.

[16] Alex Graves and Jürgen Schmidhuber, “Framewise phoneme classification with bidirectionalLSTM networks,” in Proceedings of the International Joint Conference on Neural Networks,2005, vol. 4, pp. 2047–2052.

[17] a Graves, A Mohamed, and G Hinton, “Speech recognition with deep recurrent neuralnetworks,” Icassp, , no. 3, pp. 6645–6649, 2013.

[18] M. Amini and E. Balarastaghi, “Universal Neural Network Demodulator for SoftwareDefined Radio,” International Journal of Machine Learning and Computing, vol. 1, no. 3,2011.

[19] Mürsel Önder, Aydin Akan, and Hakan Doǧan, “Advanced neural network receiver designto combat multiple channel impairments,” Turkish Journal of Electrical Engineering andComputer Sciences, vol. 24, no. 4, pp. 3066–3077, 2016.

[20] Meng Fan and Lenan Wu, “2017 International Conference on Communication, Control,Computing and Electronics Engineering (ICCCCEE), Communication, Control, Computingand Electronics Engineering (ICCCCEE), 2017 International Conference on,” 2017.

[21] Gregory W. Wornell, Efficient symbol-spreading strategies for wireless communication,Research Laboratory of Electronics, Massachusetts Institute of Technology, 1994.

[22] K Rohani and M T Manry, “The design of multi-layer perceptrons using building blocks,”1991.

[23] Alex Graves, Navdeep Jaitly, and Abdel Rahman Mohamed, “Hybrid speech recognitionwith Deep Bidirectional LSTM,” in 2013 IEEE Workshop on Automatic Speech Recognitionand Understanding, ASRU 2013 - Proceedings, 2013, pp. 273–278.

[24] Yuchen Fan, Yao Qian, Fenglong Xie, and Frank K. Soong, “TTS synthesis with bidirec-tional LSTM based Recurrent Neural Networks,” in Proceedings of the Annual Conferenceof the International Speech Communication Association, INTERSPEECH, 2014, pp. 1964–1968.

[25] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A Regression Approach to Speech Enhance-ment Based on Deep Neural Networks,” IEEE/ACM Transactions on Audio, Speech, andLanguage Processing, vol. 23, no. 1, pp. 7–19, 2015.


[26] Tobias Goehring, Federico Bolner, Jessica J M Monaghan, Bas van Dijk, Andrzej Zarowski,and Stefan Bleeck, “Speech enhancement based on neural networks improves speech in-telligibility in noise for cochlear implant users,” Hearing Research, vol. 344, pp. 183–194,2016.

[27] Morten Kolbaek, Zheng-Hua Tan, and Jesper Jensen, “Speech Enhancement Using LongShort-Term Memory Based Recurrent Neural Networks for Noise Robust Speaker Verifi-cation,” in IEEE Workshop on Spoken Language Technology (SLT), 2016, number 1, pp.305–311.

[28] Richard E. Turner and Maneesh Sahani, “Demodulation as probabilistic inference,” IEEETransactions on Audio, Speech and Language Processing, vol. 19, no. 8, pp. 2398–2411,2011.

[29] Richard E Turner and Maneesh Sahani, “Probabilistic amplitude and frequency demodu-lation,” Advances in Neural Information Processing Systems, vol. 24, pp. 981–989, 2011.

[30] Xiangang Li and Xihong Wu, “Constructing Long Short-Term Memory based Deep Recur-rent Neural Networks for Large Vocabulary Speech Recognition,” 2015 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4520–4524, 2014.

[31] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals, “Recurrent Neural Network Regu-larization,” Iclr, , no. 2013, pp. 1–8, 2014.

[32] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams, “Learning representa-tions by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.

[33] Indranil Hatai and Indrajit Chakrabarti, “A new high-performance digital FM modulatorand demodulator for software-defined radio and its FPGA implementation,” InternationalJournal of Reconfigurable Computing, vol. 2011, 2011.

[34] John S. Garofolo, Lori F. Lamel, Wiliam M. Fischer, Jonathan G. Fiscus, David S. Pallett,and Nancy L. Dahlgren, “DARPA TIMIT Acoustic-Phonetic Continuous Speech CorpusCD-ROM,” NASA STI/Recon Technical Report N, vol. 0, no. January 1993, pp. 1–94, 1993.

[35] Ariel Hirszhorn, David Dov, Ronen Talmon, and Israel Cohen, “Transient InterferenceSuppression in Speech Signals based on the OM-LSA Algorithm,” in 13th InternationalWorkshop on Acoustic Signal Enhancement (IWAENC), 2012, number September, pp. 4–6.

[36] Gaoxiong Yi and Wei Zhang, “The perceptual objective listening quality assessment al-gorithm in telecommunication: Introduction of ITU-T new metrics POLQA,” in 20121st IEEE International Conference on Communications in China, ICCC 2012, 2012, pp.351–355.

[37] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur,Josh Levenberg, Rajat Monga, Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker,Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, Xiaoqiang Zheng, and Google


Brain, “TensorFlow: A System for Large-Scale Machine Learning TensorFlow: A systemfor large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Designand Implementation (OSDI ’16), 2016, pp. 265–284.

[38] Danijar Hafner Troy Mott Sam Abrahams by Ariel Scarpinelli, Erik Erwitt, TensorFlowfor Machine Intelligence, Bleeding Edge Press, 0.


speech signals frequency modulation decoding via deep

Documents