listening to sounds of silence for speech denoising · 2020. 10. 20. · listening to sounds of...

23
Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1 , Rundi Wu 1 , Yuko Ishiwaka 2 , Carl Vondrick 1 , and Changxi Zheng 1 1 Columbia University, New York, USA 2 SoftBank Group Corp., Tokyo, Japan Abstract We introduce a deep learning model for speech denoising, a long-standing challenge in audio analysis arising in numerous applications. Our approach is based on a key observation about human speech: there is often a short pause between each sentence or word. In a recorded speech signal, those pauses introduce a series of time periods during which only noise is present. We leverage these incidental silent intervals to learn a model for automatic speech denoising given only mono-channel audio. Detected silent intervals over time expose not just pure noise but its time- varying features, allowing the model to learn noise dynamics and suppress it from the speech signal. Experiments on multiple datasets confirm the pivotal role of silent interval detection for speech denoising, and our method outperforms several state-of-the-art denoising methods, including those that accept only audio input (like ours) and those that denoise based on audiovisual input (and hence require more information). We also show that our method enjoys excellent generalization properties, such as denoising spoken languages not seen during training. 1 Introduction Noise is everywhere. When we listen to someone speak, the audio signals we receive are never pure and clean, always contaminated by all kinds of noises—cars passing by, spinning fans in an air conditioner, barking dogs, music from a loudspeaker, and so forth. To a large extent, people in a conversation can effortlessly filter out these noises [42]. In the same vein, numerous applica- tions, ranging from cellular communications to human-robot interaction, rely on speech denoising algorithms as a fundamental building block. Despite its vital importance, algorithmic speech denoising remains a grand challenge. Provided an input audio signal, speech denoising aims to separate the foreground (speech) signal from its additive background noise. This separation problem is inherently ill-posed. Classic approaches such as spectral subtraction [7, 98, 6, 72, 79] and Wiener filtering [80, 40] conduct audio denoising in the spectral domain, and they are typically restricted to stationary or quasi-stationary noise. In recent years, the advance of deep neural networks has also inspired their use in audio denoising. While outperforming the classic denoising approaches, existing neural-network-based approaches use network structures developed for general audio processing tasks [56, 90, 100] or borrowed from other areas such as computer vision [31, 26, 3, 36, 32] and generative adversarial networks [70, 71]. Nevertheless, beyond reusing well-developed network models as a black box, a fundamental question remains: What natural structures of speech can we leverage to mold network architectures for better performance on speech denoising? 1.1 Key insight: time distribution of silent intervals Motivated by this question, we revisit one of the most widely used audio denoising methods in practice, namely the spectral subtraction method [7, 98, 6, 72, 79]. Implemented in many commercial software such as Adobe Audition [39], this classical method requires the user to specify a time interval during which the foreground signal is absent. We call such an interval a silent interval.A silent interval is a time window that exposes pure noise. The algorithm then learns from the silent 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

Upload: others

Post on 01-Sep-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Listening to Sounds of Silence for Speech Denoising

Ruilin Xu1, Rundi Wu1, Yuko Ishiwaka2, Carl Vondrick1, and Changxi Zheng1

1Columbia University, New York, USA2SoftBank Group Corp., Tokyo, Japan

Abstract

We introduce a deep learning model for speech denoising, a long-standing challengein audio analysis arising in numerous applications. Our approach is based on akey observation about human speech: there is often a short pause between eachsentence or word. In a recorded speech signal, those pauses introduce a series oftime periods during which only noise is present. We leverage these incidental silentintervals to learn a model for automatic speech denoising given only mono-channelaudio. Detected silent intervals over time expose not just pure noise but its time-varying features, allowing the model to learn noise dynamics and suppress it fromthe speech signal. Experiments on multiple datasets confirm the pivotal role ofsilent interval detection for speech denoising, and our method outperforms severalstate-of-the-art denoising methods, including those that accept only audio input(like ours) and those that denoise based on audiovisual input (and hence requiremore information). We also show that our method enjoys excellent generalizationproperties, such as denoising spoken languages not seen during training.

1 IntroductionNoise is everywhere. When we listen to someone speak, the audio signals we receive are neverpure and clean, always contaminated by all kinds of noises—cars passing by, spinning fans in anair conditioner, barking dogs, music from a loudspeaker, and so forth. To a large extent, peoplein a conversation can effortlessly filter out these noises [42]. In the same vein, numerous applica-tions, ranging from cellular communications to human-robot interaction, rely on speech denoisingalgorithms as a fundamental building block.

Despite its vital importance, algorithmic speech denoising remains a grand challenge. Providedan input audio signal, speech denoising aims to separate the foreground (speech) signal from itsadditive background noise. This separation problem is inherently ill-posed. Classic approachessuch as spectral subtraction [7, 98, 6, 72, 79] and Wiener filtering [80, 40] conduct audio denoisingin the spectral domain, and they are typically restricted to stationary or quasi-stationary noise. Inrecent years, the advance of deep neural networks has also inspired their use in audio denoising.While outperforming the classic denoising approaches, existing neural-network-based approachesuse network structures developed for general audio processing tasks [56, 90, 100] or borrowed fromother areas such as computer vision [31, 26, 3, 36, 32] and generative adversarial networks [70, 71].Nevertheless, beyond reusing well-developed network models as a black box, a fundamental questionremains: What natural structures of speech can we leverage to mold network architectures for betterperformance on speech denoising?

1.1 Key insight: time distribution of silent intervalsMotivated by this question, we revisit one of the most widely used audio denoising methods inpractice, namely the spectral subtraction method [7, 98, 6, 72, 79]. Implemented in many commercialsoftware such as Adobe Audition [39], this classical method requires the user to specify a timeinterval during which the foreground signal is absent. We call such an interval a silent interval. Asilent interval is a time window that exposes pure noise. The algorithm then learns from the silent

34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.

Page 2: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Clean Speech

Noisy Speech

Figure 1: Silent intervals over time. (top) A speech signal has many natural pauses. Without anynoise, these pauses are exhibited as silent intervals (highlighted in red). (bottom) However, mostspeech signals are contaminated by noise. Even with mild noise, silent intervals become overwhelmedand hard to detect. If robustly detected, silent intervals can help to reveal the noise profile over time.

interval the noise characteristics, which are in turn used to suppress the additive noise of the entireinput signal (through subtraction in the spectral domain).

Yet, the spectral subtraction method suffers from two major shortcomings: i) it requires user spec-ification of a silent interval, that is, not fully automatic; and ii) the single silent interval, althoughundemanding for the user, is insufficient in presence of nonstationary noise—for example, a back-ground music. Ubiquitous in daily life, nonstationary noise has time-varying spectral features. Thesingle silent interval reveals the noise spectral features only in that particular time span, thus inade-quate for denoising the entire input signal. The success of spectral subtraction pivots on the conceptof silent interval; so do its shortcomings.

In this paper, we introduce a deep network for speech denoising that tightly integrates silent intervals,and thereby overcomes many of the limitations of classical approaches. Our goal is not just to identifya single silent interval, but to find as many as possible silent intervals over time. Indeed, silentintervals in speech appear in abundance: psycholinguistic studies have shown that there is almostalways a pause after each sentence and even each word in speech [78, 21]. Each pause, howevershort, provides a silent interval revealing noise characteristics local in time. All together, these silentintervals assemble a time-varying picture of background noise, allowing the neural network to betterdenoise speech signals, even in presence of nonstationary noise (see Fig. 1).

In short, to interleave neural networks with established denoising pipelines, we propose a networkstructure consisting of three major components (see Fig. 2): i) one dedicated to silent intervaldetection, ii) another that aims to estimate the full noise from those revealed in silent intervals, akinto an inpainting process in computer vision [38], and iii) yet another for cleaning up the input signal.

Summary of results. Our neural-network-based denoising model accepts a single channel of audiosignal and outputs the cleaned-up signal. Unlike some of the recent denoising methods that takeas input audiovisual signals (i.e., both audio and video footage), our method can be applied in awider range of scenarios (e.g., in cellular communication). We conducted extensive experiments,including ablation studies to show the efficacy of our network components and comparisons toseveral state-of-the-art denoising methods. We also evaluate our method under various signal-to-noiseratios—even under strong noise levels that are not tested against in previous methods. We show that,under a variety of denoising metrics, our method consistently outperforms those methods, includingthose that accept only audio input (like ours) and those that denoise based on audiovisual input.

The pivotal role of silent intervals for speech denoising is further confirmed by a few key results. Evenwithout supervising on silent interval detection, the ability to detect silent intervals naturally emergesin our network. Moreover, while our model is trained on English speech only, with no additionaltraining it can be readily used to denoise speech in other languages (such as Chinese, Japanese, andKorean). Please refer to the supplementary materials for listening to our denoising results.

2 Related Work

Speech denoising. Speech denoising [53] is a fundamental problem studied over several decades.Spectral subtraction [7, 98, 6, 72, 79] estimates the clean signal spectrum by subtracting an estimate ofthe noise spectrum from the noisy speech spectrum. This classic method was followed by spectrogramfactorization methods [84]. Wiener filtering [80, 40] derives the enhanced signal by optimizing themean-square error. Other methods exploit pauses in speech, forming segments of low acousticenergy where noise statistics can be more accurately measured [13, 57, 86, 15, 75, 10, 11]. Statisticalmodel-based methods [14, 34] and subspace algorithms [12, 16] are also studied.

2

Page 3: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

0 0 . . . . 1 1

STFT

Estimated Noise

Spectrogram Noise Profile

Complex Mask

Denoised Spectrogram

Silent Interval Detection Noise Estimation Noise Removal

Input

Silent Interval Mask

Element-wise Product

Threshold

Dee

p N

et, D

ISTFT

Denoised Waveform

STFT

Dee

p N

et, N

Dee

p N

et, R

(a) (b) (c)

Figure 2: Our audio denoise network. Our model has three components: (a) one that detects silentintervals over time, and outputs a noise profile observed from detected silent intervals; (b) anotherthat estimates the full noise profile, and (c) yet another that cleans up the input signal.

Applying neural networks to audio denoising dates back to the 80s [88, 69]. With increased computingpower, deep neural networks are often used [104, 106, 105, 47]. Long short-term memory networks(LSTMs) [35] are able to preserve temporal context information of the audio signal [52], leading tostrong results [56, 90, 100]. Leveraging generative adversarial networks (GANs) [33], methods suchas [70, 71] have adopted GANs into the audio field and have also achieved strong performance.

Audio signal processing methods operate on either the raw waveform or the spectrogram by Short-time Fourier Transform (STFT). Some work directly on waveform [23, 68, 59, 55], and others useWavenet [91] for speech denoising [74, 76, 30]. Many other methods such as [54, 94, 61, 99, 46,107, 9] work on audio signal’s spectrogram, which contains both magnitude and phase information.There are works discussing how to use the spectrogram to its best potential [93, 67], while oneof the disadvantages is that the inverse STFT needs to be applied. Meanwhile, there also existworks [51, 29, 28, 95, 19, 101, 60] investigating how to overcome artifacts from time aliasing.

Speech denoising has also been studied in conjunction with computer vision due to the relationsbetween speech and facial features [8]. Methods such as [31, 26, 3, 36, 32] utilize different networkstructures to enhance the audio signal to the best of their ability. Adeel et al. [1] even utilize lip-readingto filter out the background noise of a speech.

Deep learning for other audio processing tasks. Deep learning is widely used for lip reading,speech recognition, speech separation, and many audio processing or audio-related tasks, with thehelp of computer vision [64, 66, 5, 4]. Methods such as [50, 17, 65] are able to reconstruct speechfrom pure facial features. Methods such as [2, 63] take advantage of facial features to improve speechrecognition accuracy. Speech separation is one of the areas where computer vision is best leveraged.Methods such as [25, 64, 18, 109] have achieved impressive results, making the previously impossiblespeech separation from a single audio signal possible. Recently, Zhang et al. [108] proposed a newoperation called Harmonic Convolution to help networks distill audio priors, which is shown to evenfurther improve the quality of speech separation.

3 Learning Speech DenoisingWe present a neural network that harnesses the time distribution of silent intervals for speech denoising.The input to our model is a spectrogram of noisy speech [103, 20, 83], which can be viewed as a 2Dimage of size T ×F with two channels, where T represents the time length of the signal and F is thenumber of frequency bins. The two channels store the real and imaginary parts of STFT, respectively.After learning, the model will produce another spectrogram of the same size as the noise suppressed.

We first train our proposed network structure in an end-to-end fashion, with only denoising super-vision (Sec. 3.2); and it already outperforms the state-of-the-art methods that we compare against.Furthermore, we incorporate the supervision on silent interval detection (Sec. 3.3) and obtain evenbetter denoising results (see Sec. 4).

3.1 Network structureClassic denoising algorithms work in three general stages: silent interval specification, noise featureestimation, and noise removal. We propose to interweave learning throughout this process: we rethinkeach stage with the help of a neural network, forming a new speech denoising model. Since wecan chain these networks together and estimate gradients, we can efficiently train the model withlarge-scale audio data. Figure 2 illustrates this model, which we describe below.

3

Page 4: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Emergent Silent Intervals

Ground Truth Full Noise

Ground Truth Clean Input

Denoised Result

Detected Silent Intervals

Noisy Input

Estimated Full Noise

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Figure 3: Example of intermediate and final results. (a) The spectrogram of a noisy input signal,which is a superposition of a clean speech signal (b) and a noise (c). The black regions in (b) indicateground-truth silent intervals. (d) The noise exposed by automatically emergent silent intervals, i.e.,the output of the silent interval detection component when the entire network is trained without silentinterval supervision (recall Sec. 3.2). (e) The noise exposed by detected silent intervals, i.e., theoutput of the silent interval detection component when the network is trained with silent intervalsupervision (recall Sec. 3.3). (f) The estimated noise profile using subfigure (a) and (e) as the input tothe noise estimation component. (g) The final denoised spectrogram output.

Silent interval detection. The first component is dedicated to detecting silent intervals in the inputsignal. The input to this component is the spectrogram of the input (noisy) signal x. The spectrogramsx is first encoded by a 2D convolutional encoder into a 2D feature map, which is in turn processedby a bidirectional LSTM [35, 81] followed by two fully-connected (FC) layers (see network detailsin Appendix A). The bidirectional LSTM is suitable for processing time-series features resultingfrom the spectrogram [58, 41, 73, 18], and the FC layers are applied to the features of each timesample to accommodate variable length input. The output from this network component is a vectorD(sx). Each element of D(sx) is a scalar in [0,1] (after applying the sigmoid function), indicating aconfidence score of a small time segment being silent. We choose each time segment to have 1/30second, small enough to capture short speech pauses and large enough to allow robust prediction.

The output vector D(sx) is then expanded to a longer mask, which we denote as m(x). Each elementof this mask indicates the confidence of classifying each sample of the input signal x as pure noise(see Fig. 3-e). With this mask, the noise profile x̃ exposed by silent intervals are estimated by anelement-wise product, namely x̃ := x�m(x).

Noise estimation. The signal x̃ resulted from silent interval detection is noise profile exposed onlythrough a series of time windows (see Fig. 3-e)—but not a complete picture of the noise. However,since the input signal is a superposition of clean speech signal and noise, having a complete noiseprofile would ease the denoising process, especially in presence of nonstationary noise. Therefore,we also estimate the entire noise profile over time, which we do with a neural network.

Inputs to this component include both the noisy audio signal x and the incomplete noise profile x̃.Both are converted by STFT into spectrograms, denoted as sx and sx̃, respectively. We view thespectrograms as 2D images. And because the neighboring time-frequency pixels in a spectrogramare often correlated, our goal here is conceptually akin to the image inpainting task in computervision [38]. To this end, we encode sx and sx̃ by two separate 2D convolutional encoders into twofeature maps. The feature maps are then concatenated in a channel-wise manner and further decodedby a convolutional decoder to estimate the full noise spectrogram, which we denote as N(sx, sx̃). Aresult of this step is illustrated in Fig. 3-f.

Noise removal. Lastly, we clean up the noise from the input signal x. We use a neural networkR that takes as input both the input audio spectrogram sx and the estimated full noise spectrogramN(sx, sx̃). The two input spectrograms are processed individually by their own 2D convolutionalencoders. The two encoded feature maps are then concatenated together before passing to a bidirec-tional LSTM followed by three fully connected layers (see details in Appendix A). Like other audioenhancement models [18, 92, 96], the output of this component is a vector with two channels which

4

Page 5: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

form the real and imaginary parts of a complex ratio mask c := R(sx,N(sx, sx̃)

)in frequency-time

domain. In other words, the mask c has the same (temporal and frequency) dimensions as sx.

In the final step, we compute the denoised spectrogram s∗x through element-wise multiplication ofthe input audio spectrogram sx and the mask c (i.e., s∗x = sx � c). Finally, the cleaned-up audiosignal is obtained by applying the inverse STFT to s∗x (see Fig. 3-g).

3.2 Loss functions and trainingSince a subgradient exists at every step, we are able to train our network in an end-to-end fashionwith stochastic gradient descent. We optimize the following loss function:

L0 = Ex∼p(x)

[‖N(sx, sx̃)− s∗n‖2 + β

∥∥sx � R(sx,N(sx, sx̃)

)− s∗x

∥∥2

], (1)

where the notations sx, sx̃, N(·, ·), and R(·, ·) are defined in Sec. 3.1; s∗x and s∗n denote thespectrograms of the ground-truth foreground signal and background noise, respectively. The firstterm penalizes the discrepancy between estimated noise and the ground-truth noise, while the secondterm accounts for the estimation of foreground signal. These two terms are balanced by the scalar β(β = 1.0 in our experiments).

Natural emergence of silent intervals. While producing plausible denoising results (see Sec. 4.4),the end-to-end training process has no supervision on silent interval detection: the loss function (1)only accounts for the recoveries of noise and clean speech signal. But somewhat surprisingly, theability of detecting silent intervals automatically emerges as the output of the first network component(see Fig. 3-d as an example, which visualizes sx̃). In other words, the network automatically learnsto detect silent intervals for speech denoising without this supervision.

3.3 Silent interval supervisionAs the model is learning to detect silent intervals on its own, we are able to directly supervise silentinterval detection to further improve the denoising quality. Our first attempt was to add a termin (1) that penalizes the discrepancy between detected silent intervals and their ground truth. Butour experiments show that this is not effective (see Sec. 4.4). Instead, we train our network in twosequential steps.

First, we train the silent interval detection component through the following loss function:

L1 = Ex∼p(x)

[`BCE

(m(x),m∗

x

)], (2)

where `BCE(·, ·) is the binary cross entropy loss, m(x) is the mask resulted from silent intervaldetection component, and m∗

x is the ground-truth label of each signal sample being silent or not—theway of constructing m∗

x and the training dataset will be described in Sec. 4.1.

Next, we train the noise estimation and removal components through the loss function (1). Thisstep starts by neglecting the silent detection component. In the loss function (1), instead of usingsx̃, the noise spectrogram exposed by the estimated silent intervals, we use the noise spectrogramexposed by the ground-truth silent intervals (i.e., the STFT of x�m∗

x). After training using sucha loss function, we fine-tune the network components by incorporating the already trained silentinterval detection component. With the silent interval detection component fixed, this fine-tuningstep optimizes the original loss function (1) and thereby updates the weights of the noise estimationand removal components.

4 ExperimentsThis section presents the major evaluations of our method, comparisons to several baselines and priorworks, and ablation studies. We also refer the reader to the supplementary materials (including asupplemental document and audio effects organized on an off-line webpage) for the full descriptionof our network structure, implementation details, additional evaluations, as well as audio examples.

4.1 Experiment setupDataset construction. To construct training and testing data, we leverage publicly available audiodatasets. We obtain clean speech signals using AVSPEECH [18], from which we randomly choose2448 videos (4.5 hours of total length) and extract their speech audio channels. Among them, weuse 2214 videos for training and 234 videos for testing, so the training and testing speeches are fully

5

Page 6: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Noise 2

Noise 4

Noise 1

Noise 3Figure 4: Noise gallery. We show four examples of noise from the noise datasets. Noise 1) is astationary (white) noise, and the other three are not. Noise 2) is a monologue in a meeting. Noise 3) isparty noise from people speaking and laughing with background noise. Noise 4) is street noise frompeople shouting and screaming with additional traffic noise such as vehicles driving and honking.

separate. All these speech videos are in English, selected on purpose: as we show in supplementarymaterials, our model trained on this dataset can readily denoise speeches in other languages.

We use two datasets, DEMAND [89] and Google’s AudioSet [27], as background noise. Both consistof environmental noise, transportation noise, music, and many other types of noises. DEMAND has beenused in previous denoising works (e.g., [70, 30, 90]). Yet AudioSet is much larger and more diversethan DEMAND, thus more challenging when used as noise. Figure 4 shows some noise examples. Ourevaluations are conducted on both datasets, separately.

Due to the linearity of acoustic wave propagation, we can superimpose clean speech signals withnoise to synthesize noisy input signals (similar to previous works [70, 30, 90]). When synthesizinga noisy input signal, we randomly choose a signal-to-noise ratio (SNR) from seven discrete values:-10dB, -7dB, -3dB, 0dB, 3dB, 7dB, and 10dB; and by mixing the foreground speech with properlyscaled noise, we produce a noisy signal with the chosen SNR. For example, a -10dB SNR meansthat the power of noise is ten times the speech (see Fig. S1 in appendix). The SNR range in ourevaluations (i.e., [-10dB, 10dB]) is significantly larger than those tested in previous works.

To supervise our silent interval detection (recall Sec. 3.3), we need ground-truth labels of silentintervals. To this end, we divide each clean speech signal into time segments, each of which lasts 1/30seconds. We label a time segment as silent when the total acoustic energy in that segment is below athreshold. Since the speech is clean, this automatic labeling process is robust.

Remarks on creating our own datasets. Unlike many previous models, which are trained usingexisting datasets such as Valentini’s VoiceBank-DEMAND [90], we choose to create our own datasetsbecause of two reasons. First, Valentini’s dataset has a noise SNR level in [0dB, 15dB], muchnarrower than what we encounter in real-world recordings. Secondly, although Valentini’s datasetprovides several kinds of environmental noise, it lacks the richness of other types of structured noisesuch as music, making it less ideal for denoising real-world recordings (see discussion in Sec. 4.6).

Method comparison. We compare our method with several existing methods that are also designedfor speech denoising, including both the classic approaches and recently proposed learning-basedmethods. We refer to these methods as follows: i) Ours, our model trained with silent intervalsupervision (recall Sec. 3.3); ii) Baseline-thres, a baseline method that uses acoustic energythreshold to label silent intervals (the same as our automatic labeling approach in Sec. 4.1 but appliedon noisy input signals), and then uses our trained noise estimation and removal networks for speechdenoising. iii) Ours-GTSI, another reference method that uses our trained noise estimation andremoval networks, but hypothetically uses the ground-truth silent intervals; iv) Spectral Gating,the classic speech denoising algorithm based on spectral subtraction [79]; v) Adobe Audition [39],one of the most widely used professional audio processing software, and we use its machine-learning-based noise reduction feature, provided in the latest Adobe Audition CC 2020, with defaultparameters to batch process all our test data; vi) SEGAN [70], one of the state-of-the-art audio-onlyspeech enhancement methods based on generative adversarial networks. vii) DFL [30], a recentlyproposed speech denoising method based on a loss function over deep network features; 1 viii)VSE [26], a learning-based method that takes both video and audio as input, and leverages both audiosignal and mouth motions (from video footage) for speech denoising. We could not compare withanother audiovisual method [18] because no source code or executable is made publicly available.

For fair comparisons, we train all the methods (except Spectral Gating which is not learning-based and Adobe Audition which is commercially shipped as a black box) using the same datasets.

1This recent method is designed for high-noise-level input, trained in an end-to-end fashion, and as theirpaper states, is “particularly pronounced for the hardest data with the most intrusive background noise”.

6

Page 7: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

PESQ

DE

MA

ND

2.355

1.7722.2422.227

2.5662.542

1.625

2.7952.945

SSNR

0.1545.992

5.5413.597

4.6286.447

9.5059.67

STOI

0.876

0.4780.83

0.8350.8620.865

0.737

0.9110.916

CSIG

3.132

2.6512.5912.594

3.1962.8192.778

3.6593.766

CBAK

2.582

2.0832.8072.7612.6842.6562.556

3.3583.439

COVL

2.658

2.1242.3992.377

2.7792.551

2.166

3.1863.312

Aud

ioS

et

Noisy Input Baseline-thres

Spectral Gating Adobe Audition

SEGANDFL VSEOurs

Ours-GTSI

Noisy Input Baseline-thres SpectralGating AdobeAudition

SEGAN DFL VSE Ours

Ours-GTSI

1.5691.496

0.9421.8411.845

1.4931.705

2.3042.471

0.1283.2351.128

4.3953.322

5.9846.1

0.4430.669

0.4130.7250.720

0.6850.737

0.8160.829

2.3841.705

2.5292.065

2.3302.46

2.9133.065

1.9512.179

1.5802.315

2.1332.2782.298

2.8092.893

1.8821.616

1.1032.129

1.8591.8672.029

2.5432.695

1

2.7612.897

1.137

3.013

Figure 5: Quantitative comparisons. We measure denoising quality under six metrics (correspond-ing to columns). The comparisons are conducted using noise from DEMAND and AudioSet separately.Ours-GTSI (in black) uses ground-truth silent intervals. Although not a practical approach, it servesas an upper-bound reference of all methods. Meanwhile, the green bar in each plot indicates themetric score of the noisy input without any processing.

Demand

PSEQ

1.01.62.32.93.5

SNR-10 -8 -5 -3 0 3 5 8 10

AudioSet

PSEQ

0.51.22.02.73.5

SNR-10 -8 -5 -3 0 3 5 8 10

Baseline-thresSEGANSpectral gatingAdobe AuditionDFLVSEOursOurs-GTSI

Figure 6: Denoise quality w.r.t. input SNRs. Denoise results measured in PESQ for each methodw.r.t different input SNRs. Results measured in other metrics are shown in Fig. S2 in Appendix.

For SEGAN, DFL, and VSE, we use their source codes published by the authors. The audiovisualdenoising method VSE also requires video footage, which is available in AVSPEECH.

4.2 Evaluation on speech denoisingMetrics. Due to the perceptual nature of audio processing tasks, there is no widely accepted singlemetric for quantitative evaluation and comparisons. We therefore evaluate our method under sixdifferent metrics, all of which have been frequently used for evaluating audio processing quality.Namely, these metrics are: i) Perceptual Evaluation of Speech Quality (PESQ) [77], ii) SegmentalSignal-to-Noise Ratio (SSNR) [82], iii) Short-Time Objective Intelligibility (STOI) [87], iv) Meanopinion score (MOS) predictor of signal distortion (CSIG) [37], v) MOS predictor of background-noise intrusiveness (CBAK) [37], and vi) MOS predictor of overall signal quality (COVL) [37].

Results. We train two separate models using DEMAND and AudioSet noise datasets respectively,and compare them with other models trained with the same datasets. We evaluate the average metricvalues and report them in Fig. 5. Under all metrics, our method consistently outperforms others.

We breakdown the performance of each method with respect to SNR levels from -10dB to 10dB onboth noise datasets. The results are reported in Fig. 6 for PESQ (see Fig. S2 in the appendix for allmetrics). In the previous works that we compare to, no results under those low SNR levels (at < 0dBs) are reported. Nevertheless, across all input SNR levels, our method performs the best, showingthat our approach is fairly robust to both light and extreme noise.

From Fig. 6, it is worth noting that Ours-GTSI method performs even better. Recall that this isour model but provided with ground-truth silent intervals. While not practical (due to the need ofground-truth silent intervals), Ours-GTSI confirms the importance of silent intervals for denoising: ahigh-quality silent interval detection helps to improve speech denoising quality.

4.3 Evaluation on silent interval detectionDue to the importance of silent intervals for speech denoising, we also evaluate the quality of oursilent interval detection, in comparison to two alternatives, the baseline Baseline-thres and a

7

Page 8: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Table 1: Results of silent interval detection. The metrics are measured using our test signals thathave SNRs from -10dB to 10dB. Definitions of these metrics are summarized in Appendix C.1.

Noise Dataset Method Precision Recall F1 Accuracy

DEMANDBaseline-thres 0.533 0.718 0.612 0.706VAD 0.797 0.432 0.558 0.783Ours 0.876 0.866 0.869 0.918

AudiosetBaseline-thres 0.536 0.731 0.618 0.708VAD 0.736 0.227 0.338 0.728Ours 0.794 0.822 0.807 0.873

Table 2: Ablation studies. We alter network components and training loss, and evaluate the denoisingquality under various metrics. Our proposed approach performs the best.

Noise Dataset Method PESQ SSNR STOI CSIG CBAK COVL

DEMAND

Ours w/o SID comp 2.689 9.080 0.904 3.615 3.285 3.112Ours w/o NR comp 2.476 0.234 0.747 3.015 2.410 2.637Ours w/o SID loss 2.794 6.478 0.903 3.466 3.147 3.079Ours w/o NE loss 2.601 9.070 0.896 3.531 3.237 3.027Ours Joint loss 2.774 6.042 0.895 3.453 3.121 3.068Ours 2.795 9.505 0.911 3.659 3.358 3.186

Audioset

Ours w/o SID comp 2.190 5.574 0.802 2.851 2.719 2.454Ours w/o NR comp 1.803 0.191 0.623 2.301 2.070 1.977Ours w/o SID loss 2.325 4.957 0.814 2.814 2.746 2.503Ours w/o NE loss 2.061 5.690 0.789 2.766 2.671 2.362Ours Joint loss 2.305 4.612 0.807 2.774 2.721 2.474Ours 2.304 5.984 0.816 2.913 2.809 2.543

Voice Activity Detector (VAD) [22]. The former is described above, while the latter classifies each timewindow of an audio signal as having human voice or not [48, 49]. We use an off-the-shelf VAD [102],which is developed by Google’s WebRTC project and reported as one of the best available. Typically,VAD is designed to work with low-noise signals. Its inclusion here here is merely to provide analternative approach that can detect silent intervals in more ideal situations.

We evaluate these methods using four standard statistic metrics: the precision, recall, F1 score, andaccuracy. We follow the standard definitions of these metrics, which are summarized in Appendix C.1.These metrics are based on the definition of positive/negative conditions. Here, the positive conditionindicates a time segment being labeled as a silent segment, and the negative condition indicates anon-silent label. Thus, the higher the metric values are, the better the detection approach.

Table 1 shows that, under all metrics, our method is consistently better than the alternatives. BetweenVAD and Baseline-thres, VAD has higher precision and lower recall, meaning that VAD is overlyconservative and Baseline-thres is overly aggressive when detecting silent intervals (see Fig. S3in Appendix C.2). Our method reaches better balance and thus detects silent intervals more accurately.

4.4 Ablation studiesIn addition, we perform a series of ablation studies to understand the efficacy of individual networkcomponents and loss terms (see Appendix D.1 for more details). In Table 2, “Ours w/o SID loss”refers to the training method presented in Sec. 3.2 (i.e., without silent interval supervision). “OursJoint loss” refers to the end-to-end training approach that optimizes the loss function (1) with theadditional term (2). And “Ours w/o NE loss” uses our two-step training (in Sec. 3.3) but withoutthe loss term on noise estimation—that is, without the first term in (1). In comparison to thesealternative training approaches, our two-step training with silent interval supervision (referred to as“Ours”) performs the best. We also note that “Ours w/o SID loss”—i.e., without supervision onsilent interval detection—already outperforms the methods we compared to in Fig. 5, and “Ours”further improves the denoising quality. This shows the efficacy of our proposed training approach.

We also experimented with two variants of our network structure. The first one, referred to as “Oursw/o SID comp”, turns off silent interval detection: the silent interval detection component alwaysoutputs a vector with all zeros. The second, referred as “Ours w/o NR comp”, uses a simple spectral

8

Page 9: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Table 3: Comparisons on VoiceBank-DEMAND corpus.

Method PESQ CSIG CBAK COVL STOI

Noisy Input 1.97 3.35 2.44 2.63 0.91WaveNet [91] – 3.62 3.24 2.98 –SEGAN [70] 2.16 3.48 2.94 2.80 0.93DFL [30] 2.51 3.79 3.27 3.14 –MMSE-GAN [85] 2.53 3.80 3.12 3.14 0.93MetricGAN [24] 2.86 3.99 3.18 3.42 –SDR-PESQ [43] 3.01 4.09 3.54 3.55 –T-GSA [44] 3.06 4.18 3.59 3.62 –Self-adapt. DNN [45] 2.99 4.15 3.42 3.57 –RDL-Net [62] 3.02 4.38 3.43 3.72 0.94Ours 3.16 3.96 3.54 3.53 0.98

subtraction to replace our noise removal component. Table 2 shows that, under all the tested metrics,both variants perform worse than our method, suggesting our proposed network structure is effective.

Furthermore, we studied to what extent the accuracy of silent interval detection affects the speechdenoising quality. We show that as the silent interval detection becomes less accurate, the denoisingquality degrades. Presented in details in Appendix D.2, these experiments reinforce our intuition thatsilent intervals are instructive for speech denoising tasks.

4.5 Comparison with state-of-the-art benchmarkMany state-of-the-art denoising methods, including MMSE-GAN [85], Metric-GAN [24], SDR-PESQ [43], T-GSA [44], Self-adaptation DNN [45], and RDL-Net [62], are all evaluated on Valen-tini’s VoiceBank-DEMAND [90]. We therefore compare ours with those methods on the same dataset.We note that DEMAND consists of audios with SNR in [0dB, 15dB]. Its SNR range is much narrowerthan what our method (and our training datasets) aims for (e.g., input signals with -10dB SNR).Nevertheless, trained and tested under the same setting, our method is highly competitive to thebest of those methods under every metric, as shown in Table 3. The metric scores therein for othermethods are numbers reported in their original papers.

4.6 Tests on real-world dataWe also test our method against real-world data. Quantitative evaluation on real-world data, however,is not easy because the evaluation of nearly all metrics requires the corresponding ground-truthclean signal, which is not available in real-world scenario. Instead, we collected a good number ofreal-world audios, either by recording in daily environments or by downloading online (e.g., fromYouTube). These real-world audios cover diverse scenarios: in a driving car, a café, a park, onthe street, in multiple languages (Chinese, Japanese, Korean, German, French, etc.), with differentgenders and accents, and even with singing songs. None of these recordings is cherry picked. We referthe reader to our project website for the denoising results of all the collected real-world recordings,and for the comparison of our method with other state-of-the-art methods under real-world settings.

Furthermore, we use real-world data to test our model trained with different datasets, including ourown dataset (recall Sec. 4.1) and the existing DEMAND [90]. We show that the network modeltrained by our own dataset leads to much better noise reduction (see details in Appendix E.2). Thissuggests that our dataset allows the denoising model to better generalize to many real-world scenarios.

5 ConclusionSpeech denoising has been a long-standing challenge. We present a new network structure thatleverages the abundance of silent intervals in speech. Even without silent interval supervision, ournetwork is able to denoise speech signals plausibly, and meanwhile, the ability to detect silent intervalsautomatically emerges. We reinforce this ability. Our explicit supervision on silent intervals enablesthe network to detect them more accurately, thereby further improving the performance of speechdenoising. As a result, under a variety of denoising metrics, our method consistently outperformsseveral state-of-the-art audio denoising models.

Acknowledgments. This work was supported in part by the National Science Foundation (1717178,1816041, 1910839, 1925157) and SoftBank Group.

9

Page 10: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Broader ImpactHigh-quality speech denoising is desired in a myriad of applications: human-robot interaction, cellularcommunications, hearing aids, teleconferencing, music recording, filmmaking, news reporting, andsurveillance systems to name a few. Therefore, we expect our proposed denoising method—be it asystem used in practice or a foundation for future technology—to find impact in these applications.

In our experiments, we train our model using English speech only, to demonstrate its generalizationproperty—the ability of denoising spoken languages beyond English. Our demonstration of denoisingJapanese, Chinese, and Korean speeches is intentional: they are linguistically and phonologicallydistant from English (in contrast to other English “siblings” such as German and Dutch). Still, ourmodel may bias in favour of spoken languages and cultures that are closer to English or that havefrequent pauses to reveal silent intervals. Deeper understanding of this potential bias requires futurestudies in tandem with linguistic and sociocultural insights.

Lastly, it is natural to extend our model for denoising audio signals in general or even signals beyondaudio (such as Gravitational wave denoising [97]). If successful, our model can bring in even broaderimpacts. Pursuing this extension, however, requires a judicious definition of “silent intervals”. Afterall, the notion of “noise” in a general context of signal processing depends on specific applications:noise in one application may be another’s signals. To train a neural network that exploits a generalnotion of silent intervals, prudence must be taken to avoid biasing toward certain types of noise.

References[1] A. Adeel, M. Gogate, A. Hussain, and W. M. Whitmer. Lip-reading driven deep learning

approach for speech enhancement. IEEE Transactions on Emerging Topics in ComputationalIntelligence, page 1–10, 2019. ISSN 2471-285x. doi: 10.1109/tetci.2019.2917039. URLhttp://dx.doi.org/10.1109/tetci.2019.2917039.

[2] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Deep audio-visual speechrecognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1,2018.

[3] T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speechenhancement. In Proc. Interspeech 2018, pages 3244–3248, 2018. doi: 10.21437/Interspeech.2018-1400. URL http://dx.doi.org/10.21437/Interspeech.2018-1400.

[4] R. Arandjelovic and A. Zisserman. Objects that sound. In Proceedings of the EuropeanConference on Computer Vision (ECCV), pages 435–451, 2018.

[5] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations fromunlabeled video. In Advances in neural information processing systems, pages 892–900, 2016.

[6] M. Berouti, R. Schwartz, and J. Makhoul. Enhancement of speech corrupted by acoustic noise.In ICASSP ’79. IEEE International Conference on Acoustics, Speech, and Signal Processing,volume 4, pages 208–211, 1979.

[7] S. Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactionson Acoustics, Speech, and Signal Processing, 27(2):113–120, 1979.

[8] C. Busso and S. S. Narayanan. Interrelation between speech and facial gestures in emotionalutterances: A single subject study. IEEE Transactions on Audio, Speech, and LanguageProcessing, 15(8):2331–2347, 2007.

[9] J. Chen and D. Wang. Long short-term memory for speaker generalization in supervisedspeech separation. Acoustical Society of America Journal, 141(6):4705–4714, June 2017. doi:10.1121/1.4986931.

[10] I. Cohen. Noise spectrum estimation in adverse environments: improved minima controlledrecursive averaging. IEEE Transactions on Speech and Audio Processing, 11(5):466–475,2003.

[11] I. Cohen and B. Berdugo. Noise estimation by minima controlled recursive averaging forrobust speech enhancement. IEEE Signal Processing Letters, 9(1):12–15, 2002.

10

Page 11: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

[12] M. Dendrinos, S. Bakamidis, and G. Carayannis. Speech enhancement from noise: A regener-ative approach. Speech Commun., 10(1):45–67, Feb. 1991. ISSN 0167-6393. doi: 10.1016/0167-6393(91)90027-q. URL https://doi.org/10.1016/0167-6393(91)90027-Q.

[13] G. Doblinger. Computationally efficient speech enhancement by spectral minima tracking insubbands. In in Proc. Eurospeech, pages 1513–1516, 1995.

[14] Y. Ephraim. Statistical-model-based speech enhancement systems. Proceedings of the IEEE,80(10):1526–1555, 1992.

[15] Y. Ephraim and D. Malah. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Transactions on Acoustics, Speech, and Signal Processing,33(2):443–445, 1985.

[16] Y. Ephraim and H. L. Van Trees. A signal subspace approach for speech enhancement. IEEETransactions on Speech and Audio Processing, 3(4):251–266, 1995.

[17] A. Ephrat, T. Halperin, and S. Peleg. Improved speech reconstruction from silent video.In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pages455–462, 2017.

[18] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Ru-binstein. Looking to listen at the cocktail party: A speaker-independent audio-visual model forspeech separation. ACM Transactions on Graphics, 37(4):1–11, July 2018. ISSN 0730-0301.doi: 10.1145/3197517.3201357. URL http://dx.doi.org/10.1145/3197517.3201357.

[19] H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In 2015 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pages 708–712, 2015.

[20] J. L. Flanagan. Speech Analysis Synthesis and Perception. Springer-Verlag, 2nd edition, 1972.ISBN 9783662015629.

[21] K. L. Fors. Production and perception of pauses in speech. PhD thesis, Department ofPhilosophy, Linguistics, and Theory of Science, University of Gothenburg, 2015.

[22] D. K. Freeman, G. Cosier, C. B. Southcott, and I. Boyd. The voice activity detector forthe pan-european digital cellular mobile telephone service. In International Conference onAcoustics, Speech, and Signal Processing,, pages 369–372 vol.1, 1989.

[23] S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai. Raw waveform-based speech enhancement by fullyconvolutional networks. 2017 Asia-Pacific Signal and Information Processing AssociationAnnual Summit and Conference (APSIPA ASC), Dec. 2017. doi: 10.1109/apsipa.2017.8281993.URL http://dx.doi.org/10.1109/APSIPA.2017.8281993.

[24] S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin. Metricgan: Generative adversarial networksbased black-box metric scores optimization for speech enhancement, 2019.

[25] A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg. Seeing through noise: Visually driven speakerseparation and enhancement, 2017.

[26] A. Gabbay, A. Shamir, and S. Peleg. Visual speech enhancement, 2017.

[27] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal,and M. Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Proc.IEEE ICASSP 2017, New Orleans, LA, 2017.

[28] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux. Phase processing for single-channelspeech enhancement: History and recent advances. IEEE Signal Processing Magazine, 32(2):55–66, 2015.

[29] F. G. Germain, G. J. Mysore, and T. Fujioka. Equalization matching of speech recordings inreal-world environments. In 2016 IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), pages 609–613, 2016.

11

Page 12: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

[30] F. G. Germain, Q. Chen, and V. Koltun. Speech denoising with deep feature losses. In Proc.Interspeech 2019, pages 2723–2727, 2019. doi: 10.21437/Interspeech.2019-1924. URLhttp://dx.doi.org/10.21437/Interspeech.2019-1924.

[31] L. Girin, J.-L. Schwartz, and G. Feng. Audio-visual enhancement of speech in noise. TheJournal of the Acoustical Society of America, 109(6):3007–3020, 2001. doi: 10.1121/1.1358887. URL https://doi.org/10.1121/1.1358887.

[32] M. Gogate, A. Adeel, K. Dashtipour, P. Derleth, and A. Hussain. Av speech enhancementchallenge using a real noisy corpus, 2019.

[33] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,and Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conferenceon Neural Information Processing Systems - Volume 2, Nips’14, page 2672–2680, Cambridge,MA, USA, 2014. MIT Press.

[34] H.-G. Hirsch and C. Ehrlicher. Noise estimation techniques for robust speech recognition.1995 International Conference on Acoustics, Speech, and Signal Processing, 1:153–156 vol.1,1995.

[35] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9:1735–80,12 1997. doi: 10.1162/neco.1997.9.8.1735.

[36] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-m. Wang. Audio-visual speechenhancement using multimodal deep convolutional neural networks. IEEE Transactions onEmerging Topics in Computational Intelligence, 2, 03 2018. doi: 10.1109/tetci.2017.2784878.

[37] Y. Hu and P. Loizou. Evaluation of objective quality measures for speech enhancement.Audio, Speech, and Language Processing, IEEE Transactions on, 16:229–238, 02 2008. doi:10.1109/tasl.2007.911054.

[38] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Globally and locally consistent image completion.ACM Trans. Graph., 36(4), July 2017. ISSN 0730-0301. doi: 10.1145/3072959.3073659.URL https://doi.org/10.1145/3072959.3073659.

[39] A. Inc. Adobe audition, 2020. URL https://www.adobe.com/products/audition.html.

[40] Jae Lim and A. Oppenheim. All-pole modeling of degraded speech. IEEE Transactions onAcoustics, Speech, and Signal Processing, 26(3):197–210, 1978.

[41] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg,A. van den Oord, S. Dieleman, and K. Kavukcuoglu. Efficient neural audio synthesis, 2018.

[42] A. J. E. Kell and J. H. McDermott. Invariance to background noise as a signature of non-primaryauditory cortex. Nature Communications, 10(1):3958, Sept. 2019. ISSN 2041-1723. doi:10.1038/s41467-019-11710-y. URL https://doi.org/10.1038/s41467-019-11710-y.

[43] J. Kim, M. El-Kharmy, and J. Lee. End-to-end multi-task denoising for joint sdr and pesqoptimization, 2019.

[44] J. Kim, M. El-Khamy, and J. Lee. T-gsa: Transformer with gaussian-weighted self-attentionfor speech enhancement. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pages 6649–6653, 2020.

[45] Y. Koizumi, K. Yatabe, M. Delcroix, Y. Masuyama, and D. Takeuchi. Speech enhancementusing self-adaptation and multi-head self-attention, 2020.

[46] A. Kumar and D. Florencio. Speech enhancement in multiple-noise conditions using deepneural networks. Interspeech 2016, Sept. 2016. doi: 10.21437/interspeech.2016-88. URLhttp://dx.doi.org/10.21437/Interspeech.2016-88.

[47] A. Kumar and D. A. F. Florêncio. Speech enhancement in multiple-noise conditions usingdeep neural networks. In Interspeech, 2016.

12

Page 13: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

[48] R. Le Bouquin Jeannes and G. Faucon. Proposal of a voice activity detector for noise reduction.Electronics Letters, 30(12):930–932, 1994.

[49] R. Le Bouquin Jeannes and G. Faucon. Study of a voice activity detector and its influence on anoise reduction system. Speech Communication, 16(3):245–254, 1995. ISSN 0167-6393. doi:https://doi.org/10.1016/0167-6393(94)00056-G. URL http://www.sciencedirect.com/science/article/pii/016763939400056G.

[50] T. Le Cornu and B. Milner. Generating intelligible audio speech from visual speech. IEEE/ACMTransactions on Audio, Speech, and Language Processing, 25(9):1751–1761, 2017.

[51] J. Le Roux and E. Vincent. Consistent wiener filtering for audio source separation. IEEESignal Processing Letters, 20(3):217–220, 2013.

[52] Z. C. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks forsequence learning, 2015.

[53] P. C. Loizou. Speech Enhancement: Theory and Practice. CRC Press, Inc., Usa, 2nd edition,2013. ISBN 1466504218.

[54] X. Lu, Y. Tsao, S. Matsuda, and C. Hori. Speech enhancement based on deep denoisingautoencoder. In Interspeech, 2013.

[55] Y. Luo and N. Mesgarani. Conv-tasnet: Surpassing ideal time–frequency magnitude maskingfor speech separation. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 27(8):1256–1266,Aug. 2019. ISSN 2329-9290. doi: 10.1109/taslp.2019.2915167. URL https://doi.org/10.1109/TASLP.2019.2915167.

[56] A. L. Maas, Q. V. Le, T. M. O’Neil, O. Vinyals, P. Nguyen, and A. Y. Ng. Recurrent neuralnetworks for noise reduction in robust asr. In Interspeech, 2012.

[57] R. Martin. Noise power spectral density estimation based on optimal smoothing and minimumstatistics. IEEE Transactions on Speech and Audio Processing, 9(5):504–512, 2001.

[58] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio.Samplernn: An unconditional end-to-end neural audio generation model, 2016.

[59] M. Michelashvili and L. Wolf. Audio denoising with deep network priors, 2019.

[60] J. A. Moorer. A note on the implementation of audio processing by short-term fouriertransform. In 2017 IEEE Workshop on Applications of Signal Processing to Audio andAcoustics (WASPAA), pages 156–159, 2017.

[61] A. Narayanan and D. Wang. Ideal ratio mask estimation using deep neural networks for robustspeech recognition. In 2013 IEEE International Conference on Acoustics, Speech and SignalProcessing, pages 7092–7096, 2013.

[62] M. Nikzad, A. Nicolson, Y. Gao, J. Zhou, K. K. Paliwal, and F. Shang. Deep residual-denselattice network for speech enhancement, 2020.

[63] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata. Audio-visual speechrecognition using deep learning. Applied Intelligence, 42(4):722–737, June 2015. ISSN0924-669x. doi: 10.1007/s10489-014-0629-7. URL https://doi.org/10.1007/s10489-014-0629-7.

[64] A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisen-sory features. Lecture Notes in Computer Science, page 639–658, 2018. ISSN 1611-3349. doi: 10.1007/978-3-030-01231-1\_39. URL http://dx.doi.org/10.1007/978-3-030-01231-1%5F39.

[65] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visuallyindicated sounds. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),June 2016. doi: 10.1109/cvpr.2016.264. URL http://dx.doi.org/10.1109/CVPR.2016.264.

13

Page 14: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

[66] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba. Ambient sound providessupervision for visual learning. In European conference on computer vision, pages 801–816.Springer, 2016.

[67] K. Paliwal, K. Wójcicki, and B. Shannon. The importance of phase in speech enhancement.Speech Commun., 53(4):465–494, Apr. 2011. ISSN 0167-6393. doi: 10.1016/j.specom.2010.12.003. URL https://doi.org/10.1016/j.specom.2010.12.003.

[68] A. Pandey and D. Wang. A new framework for supervised speech enhancement in the timedomain. In Proc. Interspeech 2018, pages 1136–1140, 2018. doi: 10.21437/Interspeech.2018-1223. URL http://dx.doi.org/10.21437/Interspeech.2018-1223.

[69] S. Parveen and P. Green. Speech enhancement with missing data techniques using recurrentneural networks. In 2004 IEEE International Conference on Acoustics, Speech, and SignalProcessing, volume 1, pages I–733, 2004.

[70] S. Pascual, A. Bonafonte, and J. Serrà. Segan: Speech enhancement generative adversarialnetwork. In Proc. Interspeech 2017, pages 3642–3646, 2017. doi: 10.21437/Interspeech.2017-1428. URL http://dx.doi.org/10.21437/Interspeech.2017-1428.

[71] S. Pascual, J. Serrà, and A. Bonafonte. Towards generalized speech enhancement withgenerative adversarial networks. In Proc. Interspeech 2019, pages 1791–1795, 2019. doi:10.21437/Interspeech.2019-2688. URL http://dx.doi.org/10.21437/Interspeech.2019-2688.

[72] L. ping Yang and Q.-J. Fu. Spectral subtraction-based speech enhancement for cochlearimplant patients in background noise. The Journal of the Acoustical Society of America, 117 3Pt 1:1001–4, 2005.

[73] H. Purwins, B. Li, T. Virtanen, J. Schluter, S.-Y. Chang, and T. Sainath. Deep learningfor audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2):206–219, May 2019. ISSN 1941-0484. doi: 10.1109/jstsp.2019.2908700. URL http://dx.doi.org/10.1109/JSTSP.2019.2908700.

[74] K. Qian, Y. Zhang, S. Chang, X. Yang, D. Florêncio, and M. Hasegawa-Johnson. Speechenhancement using bayesian wavenet. In Proc. Interspeech 2017, pages 2013–2017, 2017. doi:10.21437/Interspeech.2017-1672. URL http://dx.doi.org/10.21437/Interspeech.2017-1672.

[75] S. Rangachari, P. C. Loizou, and Yi Hu. A noise estimation algorithm with rapid adaptationfor highly nonstationary environments. In 2004 IEEE International Conference on Acoustics,Speech, and Signal Processing, volume 1, pages I–305, 2004.

[76] D. Rethage, J. Pons, and X. Serra. A wavenet for speech denoising. In 2018 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP), pages 5069–5073, 2018.

[77] A. Rix, J. Beerends, M. Hollier, and A. Hekstra. Perceptual evaluation of speech quality (pesq):A new method for speech quality assessment of telephone networks and codecs. In 2001IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings(Cat. No.01CH37221), volume 2, pages 749–752 vol.2, 02 2001. ISBN 0-7803-7041-4. doi:10.1109/icassp.2001.941023.

[78] S. R. Rochester. The significance of pauses in spontaneous speech. Journal of PsycholinguisticResearch, 2(1):51–81, 1973.

[79] T. Sainburg. Noise reduction in python using spectral gating. https://github.com/timsainb/noisereduce, 2019.

[80] P. Scalart and J. V. Filho. Speech enhancement based on a priori signal to noise estimation. In1996 IEEE International Conference on Acoustics, Speech, and Signal Processing ConferenceProceedings, volume 2, pages 629–632 vol. 2, 1996.

[81] M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. Signal Processing, IEEETransactions on, 45:2673–2681, 12 1997. doi: 10.1109/78.650093.

14

Page 15: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

[82] M. A. C. Schuyler R. Quackenbush, Thomas P. Barnwell. Objective Measures Of SpeechQuality. Prentice Hall, Englewood Cliffs, NJ, 1988. ISBN 9780136290568.

[83] E. Sejdic, I. Djurovic, and L. Stankovic. Quantitative performance analysis of scalogram asinstantaneous frequency estimator. IEEE Transactions on Signal Processing, 56(8):3837–3845,2008.

[84] P. Smaragdis, C. Févotte, G. J. Mysore, N. Mohammadiha, and M. Hoffman. Static anddynamic source separation using nonnegative factorizations: A unified view. IEEE SignalProcessing Magazine, 31(3):66–75, 2014.

[85] M. H. Soni, N. Shah, and H. A. Patil. Time-frequency masking-based speech enhancementusing generative adversarial network. In 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), pages 5039–5043, 2018.

[86] K. V. Sørensen and S. V. Andersen. Speech enhancement with natural sounding residualnoise based on connected time-frequency speech presence regions. EURASIP J. Adv. SignalProcess, 2005:2954–2964, Jan. 2005. ISSN 1110-8657. doi: 10.1155/asp.2005.2954. URLhttps://doi.org/10.1155/ASP.2005.2954.

[87] C. Taal, R. Hendriks, R. Heusdens, and J. Jensen. A short-time objective intelligibilitymeasure for time-frequency weighted noisy speech. In 2010 IEEE International Conferenceon Acoustics, Speech and Signal Processing, pages 4214–4217, 04 2010. doi: 10.1109/icassp.2010.5495701.

[88] S. Tamura and A. Waibel. Noise reduction using connectionist models. In ICASSP-88.,International Conference on Acoustics, Speech, and Signal Processing, pages 553–556 vol.1,1988.

[89] J. Thiemann, N. Ito, and E. Vincent. The diverse environments multi-channel acoustic noisedatabase (demand): A database of multichannel environmental noise recordings. In 21stInternational Congress on Acoustics, Montreal, Canada, June 2013. Acoustical Society ofAmerica. doi: 10.5281/zenodo.1227120. URL https://hal.inria.fr/hal-00796707.The dataset itself is archived on Zenodo, with DOI 10.5281/zenodo.1227120.

[90] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi. Investigating rnn-based speechenhancement methods for noise-robust text-to-speech. In 9th ISCA Speech Synthesis Workshop,pages 146–152, 2016. doi: 10.21437/ssw.2016-24. URL http://dx.doi.org/10.21437/SSW.2016-24.

[91] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,A. W. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. ArXiv,abs/1609.03499, 2016.

[92] D. Wang and J. Chen. Supervised speech separation based on deep learning: An overview.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10):1702––1726,Oct. 2018. ISSN 2329-9304. doi: 10.1109/taslp.2018.2842159. URL http://dx.doi.org/10.1109/TASLP.2018.2842159.

[93] D. Wang and Jae Lim. The unimportance of phase in speech enhancement. IEEE Transactionson Acoustics, Speech, and Signal Processing, 30(4):679–681, 1982.

[94] Y. Wang and D. Wang. Cocktail party processing via structured prediction. In Proceedingsof the 25th International Conference on Neural Information Processing Systems - Volume 1,Nips’12, page 224–232, Red Hook, NY, USA, 2012. Curran Associates Inc.

[95] Y. Wang and D. Wang. A deep neural network for time-domain signal reconstruction. In 2015IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages4390–4394, 2015.

[96] Y. Wang, A. Narayanan, and D. Wang. On training targets for supervised speech separation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1849–1858,2014.

15

Page 16: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

[97] W. Wei and E. Huerta. Gravitational wave denoising of binary black hole mergers with deeplearning. Physics Letters B, 800:135081, 2020.

[98] M. R. Weiss, E. Aschkenasy, and T. W. Parsons. Study and development of the intel tech-nique for improving speech intelligibility. Technical report nsc-fr/4023, Nicolet ScientificCorporation, 1974.

[99] F. Weninger, J. R. Hershey, J. Le Roux, and B. Schuller. Discriminatively trained recurrentneural networks for single-channel speech separation. In 2014 IEEE Global Conference onSignal and Information Processing (GlobalSIP), pages 577–581, 2014.

[100] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Roux, J. R. Hershey, and B. Schuller.Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr.In Proceedings of the 12th International Conference on Latent Variable Analysis and SignalSeparation - Volume 9237, Lva/ica 2015, page 91–99, Berlin, Heidelberg, 2015. Springer-Verlag. ISBN 9783319224817. doi: 10.1007/978-3-319-22482-4\_11. URL https://doi.org/10.1007/978-3-319-22482-4%5F11.

[101] D. S. Williamson and D. Wang. Time-frequency masking in the complex domain for speechdereverberation and denoising. IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing, 25(7):1492–1501, 2017.

[102] J. Wiseman. py-webrtcvad. https://github.com/wiseman/py-webrtcvad, 2019.

[103] L. Wyse. Audio spectrogram representations for processing with convolutional neural networks,2017.

[104] Y. Xu, J. Du, L. Dai, and C. Lee. An experimental study on speech enhancement based ondeep neural networks. IEEE Signal Processing Letters, 21(1):65–68, 2014.

[105] Y. Xu, J. Du, L. Dai, and C. Lee. A regression approach to speech enhancement based on deepneural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1):7–19, 2015.

[106] Y. Xu, J. Du, Z. Huang, L.-R. Dai, and C.-H. Lee. Multi-objective learning and mask-basedpost-processing for deep neural network based speech enhancement. In Interspeech, 2015.

[107] X. Zhang and D. Wang. A deep ensemble learning method for monaural speech separation.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(5):967–977, 2016.

[108] Z. Zhang, Y. Wang, C. Gan, J. Wu, J. B. Tenenbaum, A. Torralba, and W. T. Freeman.Deep audio priors emerge from harmonic convolutional networks. In International Con-ference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygjHxrYDB.

[109] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The soundof pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pages570–586, 2018.

16

Page 17: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Supplementary DocumentListening to Sounds of Silence for Speech Denoising

A Network Structure and Training Details

We now present the details of our network structure and training configurations.

The silent interval detection component of our model is composed of 2D convolutional layers, abidirectional LSTM, and two FC layers. The parameters of the convolutional layers are shown inTable S1. Each convolutional layer is followed by a batch normalization layer with a ReLU activationfunction. The hidden size of bidirectional LSTM is 100. The two FC layers, interleaved with a ReLUactivation function, have hidden size of 100 and 1, respectively.

Table S1: Convolutional layers in the silent interval detection component.

conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8 conv9 conv10 conv11 conv12

Num Filters 48 48 48 48 48 48 48 48 48 48 48 8Filter Size (1,7) (7,1) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (1,1)Dilation (1,1) (1,1) (1,1) (2,1) (4,1) (8,1) (16,1) (32,1) (1,1) (2,2) (4,4) (1,1)Stride 1 1 1 1 1 1 1 1 1 1 1 1

The noise estimation component of our model is fully convolutional, consisting of two encoders andone decoder. The two encoders process the noisy signal and the incomplete noise profile, respectively;they have the same architecture (shown in Table S2) but different weights. The two feature mapsresulted from the two encoders are concatenated in a channel-wise manner before feeding into thedecoder. In Table S2, every layer, except the last one, is followed by a batch normalization layertogether with a ReLU activation function. In addition, there is skip connections between the 2nd and14-th layer and between the 4-th and 12-th layer.

Table S2: Architecture of noise estimation component. ‘C’ indicates a convolutional layer, and ‘TC’indicates a transposed convolutional layer.

Encoder DecoderID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Layer Type C C C C C C C C C C C TC C TC C CNum Filters 64 128 128 256 256 256 256 256 256 256 256 128 128 64 64 2Filter Size 5 5 5 3 3 3 3 3 3 3 3 3 3 3 3 3Dilation 1 1 1 1 1 2 4 8 16 1 1 1 1 1 1 1Stride 1 2 1 2 1 1 1 1 1 1 1 2 1 2 1 1

The noise removal component of our model is composed of two 2D convolutional encoders, abidirectional LSTM, and three FC layers. The two convolutional encoders take as input the inputaudio spectrogram sx and the estimated full noise spectrogram N(sx, sx̃), respectively. The firstencoder has the network architecture listed in Table S3, and the second has the same architecture butwith half of the number of filters at each convolutional layer. Moreover, the bidirectional LSTM hasthe hidden size of 200, and the three FC layers have the hidden size of 600, 600, and 2F , respectively,where F is the number of frequency bins in the spectrogram. In terms of the activation function,ReLU is used after each layer except the last layer, which uses sigmoid.

Table S3: Convolutional encoder for the noise removal component of our model. Each convolutionallayer is followed by a batch normalization layer with ReLU as the activation function.

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15

Num Filters 96 96 96 96 96 96 96 96 96 96 96 96 96 96 8Filter Size (1,7) (7,1) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (5,5) (1,1)Dilation (1,1) (1,1) (1,1) (2,1) (4,1) (8,1) (16,1) (32,1) (1,1) (2,2) (4,4) (8,8) (16,16) (32,32) (1,1)Stride 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1

Page 18: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Ground Truth Clean Input

SNR = 0

SNR = 7

SNR = 10

SNR = -10

SNR = 3

SNR = -3

SNR = -7

Figure S1: Constructed noisy audio based on different SNR levels. The first row shows thewaveform of the ground truth clean input.

Training details. We use PyTorch platform to implement our speech denoising model, which isthen trained with the Adam optimizer. In our end-to-end training without silent interval supervision(referred to as “Ours w/o SID loss” in Sec. 4; also recall Sec. 3.2), we run the Adam optimizer for50 epochs with a batch size of 20 and a learning rate of 0.001. When the silent interval supervision isincorporated (recall Sec. 3.3), we first train the silent interval detection component with the followingsetup: run the Adam optimizer for 100 epochs with a batch size of 15 and a learning rate of 0.001.Afterwards, we train the noise estimation and removal components using the same setup as theend-to-end training of “Ours w/o SID loss”.

B Data Processing Details

Our model is designed to take as input a mono-channel audio clip of an arbitrary length. However,when constructing the training dataset, we set each audio clip in the training dataset to have thesame 2-second length, to enable batching at training time. To this end, we split each original audioclip from AVSPEECH, DEMAND, and AudioSet into 2-second long clips. All audio clips are thendownsampled at 16kHz before converting into spectrograms using STFT. To perform STFT, theFast Fourier Transform (FFT) size is set to 510, the Hann window size is set to 28ms, and the hoplength is set to 11ms. As a result, each 2-second clip yields a (complex-valued) spectrogram with aresolution 256× 178, where 256 is the number of frequency bins, and 178 is the temporal resolution.At inference time, our model can still accept audio clips with arbitrary length.

Both our clean speech dataset and noise datasets are first split into training and test sets, so that noaudio clips in training and testing are from the same original audio source—they are fully separate.

To supervise our silent interval detection, we label the clean audio signals in the following way. Wefirst normalize each audio clip so that its magnitude is in the range [-1,1], that is, ensuring the largestwaveform magnitude at -1 or 1. Then, the clean audio clip is divided into segments of length 1/30seconds. We label a time segment as a “silent” segment (i.e., label 0) if its average waveform energyin that segment is below 0.08. Otherwise, it is labeled as a “non-silent” segment (i.e., label 1).

2

Page 19: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

out

Metric Dataset SNR Baseline-thres SEGAN Spectral gating Adobe Audition DFL VSE Ours Ours-GTSI

avg_stoi Audioset -10 0.452435221186 0.239792699982 0.480897205715 0.498351248714 0.359458015212 0.339769991737 0.601921509877 0.638204206878

avg_stoi Audioset -7 0.541852268966 0.286854724622 0.563202140051 0.574848359499 0.464998412978 0.392132163975 0.69522558827 0.718753578538

avg_stoi Audioset -3 0.652099112496 0.363023527823 0.671714767097 0.676393645352 0.618150488059 0.444570540143 0.795443428389 0.806383840488

avg_stoi Audioset 0 0.71846241353 0.428111325525 0.743615992609 0.746257376588 0.71423807327 0.458656992051 0.850343095305 0.857522433438

avg_stoi Audioset 3 0.774287496407 0.483402768654 0.805798068855 0.806576254242 0.786894414835 0.475988156388 0.891961445652 0.896741288063

avg_stoi Audioset 7 0.81942655938 0.535070552592 0.871147454483 0.86971108878 0.854474908641 0.496507598105 0.929618377107 0.93327168215

avg_stoi Audioset 10 0.838441265589 0.558100747534 0.905914562206 0.904167825488 0.887762306588 0.494145838704 0.948536799441 0.951655050801

avg_stoi DEMAND -10 0.553166661562 0.653382295734 0.721006107387 0.718956515606 0.685191177031 0.430805843788 0.790743553097 0.805123909548

avg_stoi DEMAND -7 0.625274202924 0.740577824678 0.781610360589 0.779002283309 0.746021563415 0.460305708156 0.850713755898 0.859616921299

avg_stoi DEMAND -3 0.713761524053 0.8272630316 0.848201926362 0.845376422518 0.811285453074 0.476619227544 0.905679639079 0.911138736434

avg_stoi DEMAND 0 0.767750382985 0.870178850878 0.887953227152 0.883655299869 0.849823545207 0.483375797614 0.933110856316 0.936634683923

avg_stoi DEMAND 3 0.809977514501 0.898972129825 0.917268938905 0.91276743617 0.879801514652 0.496745522403 0.951859897034 0.954462423953

avg_stoi DEMAND 7 0.839759440988 0.922691746168 0.944072068003 0.939713546885 0.909440582201 0.496783455747 0.967695057825 0.97025338594

avg_stoi DEMAND 10 0.851136887153 0.933187563345 0.957288345806 0.952972426848 0.925755751256 0.503612328369 0.975460970055 0.97813551667

avg_csig Audioset -10 1.38184591945 1.01482989249 1.21013344315 1.43244182962 1.05774346267 1.83136456829 1.7982145472 1.99043215572

avg_csig Audioset -7 1.64539209768 1.01916195361 1.35186814047 1.68809856871 1.10033154374 2.05626851121 2.17927007522 2.37843748427

avg_csig Audioset -3 2.05629722246 1.04188752035 1.70762621688 2.16762655851 1.27512966244 2.30851787814 2.64397539356 2.80932079692

avg_csig Audioset 0 2.37294490211 1.08223043611 2.01462820064 2.53822901983 1.54054753956 2.45603853705 2.96539022549 3.11554639442

avg_csig Audioset 3 2.70562032138 1.15108502507 2.35725772548 2.94121501629 1.88660711688 2.56801903735 3.32794713927 3.46174116275

avg_csig Audioset 7 2.9577303687 1.28261338445 2.76291996742 3.29843382072 2.37792107712 2.71321000215 3.59711310827 3.71589989485

avg_csig Audioset 10 3.19113236991 1.36596812223 3.04813896667 3.63592768265 2.69729056987 2.75593559853 3.88155178351 3.98649814747

avg_csig DEMAND -10 1.87002618665 1.71010766003 1.94567233994 2.10149523175 1.74651665706 2.28997007715 2.78081352145 2.89162911252

avg_csig DEMAND -7 2.1714038205 2.02399528274 2.25865816815 2.45836201091 1.97846852675 2.4453653343 3.13269207245 3.24224023712

avg_csig DEMAND -3 2.59799358965 2.46308789135 2.64355254584 2.97315840118 2.32791450864 2.61527743823 3.54150883411 3.65242863891

avg_csig DEMAND 0 2.8942677892 2.72225994064 2.94227467512 3.32759448608 2.61019375326 2.70819585413 3.789732664 3.89437160605

avg_csig DEMAND 3 3.11949673345 2.92256133303 3.11812984672 3.59357073858 2.87351952928 2.78152529765 3.95241903192 4.05822091177

avg_csig DEMAND 7 3.35553645249 3.1115894152 3.37060153891 3.89702076585 3.19349630622 2.83591265167 4.18064834216 4.28010761713

avg_csig DEMAND 10 3.43568356507 3.20700853687 3.4524249956 4.022436246 3.40391496901 2.88188855576 4.23567441771 4.34058062807

avg_cbak Audioset -10 1.63300603249 1.30934795505 1.3745867378 1.59381988132 1.45521592892 1.69038261341 2.02805499848 2.13401486329

avg_cbak Audioset -7 1.80977418372 1.34429398195 1.54964826596 1.78304642222 1.59530567305 1.79615817372 2.28532392042 2.38430465251

avg_cbak Audioset -3 2.09279987913 1.45591690573 1.88985459878 2.0824653242 1.91293650075 1.9086047612 2.61957303372 2.70233065977

avg_cbak Audioset 0 2.3046451094 1.57371851786 2.14227453831 2.31742030758 2.18047302049 1.98424106071 2.84987860414 2.92902822558

avg_cbak Audioset 3 2.51236744052 1.6870410422 2.38782025647 2.55160428375 2.43253435222 2.03626945686 3.06597682552 3.14075604327

avg_cbak Audioset 7 2.73498753469 1.81526394753 2.69842229145 2.84075810785 2.73803958664 2.10696460833 3.32149478103 3.39416966185

avg_cbak Audioset 10 2.85698494464 1.87231659764 2.89120778153 3.03349530347 2.93780167642 2.13118768516 3.49080596816 3.56362473235

avg_cbak DEMAND -10 1.90980851692 2.12658950026 1.93734714511 1.93765715381 2.16180632372 1.90659454879 2.68945332416 2.76867770113

avg_cbak DEMAND -7 2.11134268484 2.36494778529 2.1708321186 2.14969914227 2.37463298773 1.98184207652 2.94074185598 3.01810993498

avg_cbak DEMAND -3 2.39899878288 2.65866022956 2.47274428625 2.46644609796 2.65548191547 2.06385565093 3.23164872199 3.31145098296

avg_cbak DEMAND 0 2.61054456596 2.84081997328 2.70614075027 2.71728157061 2.84903460644 2.11010194265 3.41981812272 3.4988601241

avg_cbak DEMAND 3 2.80131560925 2.9890457177 2.90258444099 2.95225370075 3.02423207891 2.14746930661 3.58161825113 3.66221151215

avg_cbak DEMAND 7 2.98553007154 3.13640606148 3.12953713865 3.21071661404 3.22716247802 2.17209175614 3.76518929619 3.84772685646

avg_cbak DEMAND 10 3.07768120679 3.2121678098 3.27025868265 3.35637544284 3.35684774286 2.19588606866 3.88030919031 3.96354104525

avg_covl Audioset -10 1.16852786313 1.04085498997 1.14094596886 1.23414944246 1.05083266498 1.42571992216 1.56287749416 1.72946402856

avg_covl Audioset -7 1.31837497391 1.02568748257 1.23702529674 1.40721849863 1.07910351495 1.59560815901 1.89235980221 2.0776539074

avg_covl Audioset -3 1.60794037902 1.03436633948 1.51403169542 1.78227632849 1.22170760193 1.80102913972 2.30708929326 2.46821124837

avg_covl Audioset 0 1.87974512741 1.0589044715 1.79027645692 2.10685955593 1.45459486582 1.93901575981 2.59163708375 2.74524661027

avg_covl Audioset 3 2.16093475458 1.10599163512 2.10968316442 2.46317708977 1.77301844427 2.03453863456 2.90751200759 3.05105487163

avg_covl Audioset 7 2.37814359049 1.19813094347 2.47394098327 2.79948723856 2.22237088735 2.1676478906 3.14913941942 3.28035631548

avg_covl Audioset 10 2.55635756578 1.25357808036 2.74891936278 3.10740814308 2.51121954018 2.21266878789 3.39221193667 3.51525854406

avg_covl DEMAND -10 1.42728550153 1.59153717461 1.75438873957 1.83998419708 1.63045115261 1.81404925572 2.40522810399 2.52638647395

avg_covl DEMAND -7 1.63944632148 1.86282674582 2.02055067169 2.13476229428 1.84436913783 1.93773996827 2.72216900291 2.84475143827

avg_covl DEMAND -3 1.9894307776 2.24372442591 2.37497639006 2.5796267862 2.16323714163 2.08639247493 3.08717050074 3.21702822133

avg_covl DEMAND 0 2.25445037073 2.47859354074 2.64914624051 2.88887215419 2.41521044122 2.17292640396 3.30521053976 3.43229421999

avg_covl DEMAND 3 2.46040803567 2.66769942595 2.83250028776 3.12354973572 2.65290790018 2.24055332084 3.44690252358 3.57576136508

avg_covl DEMAND 7 2.66118569629 2.85169511382 3.06910933022 3.3858301795 2.94691088349 2.28885730808 3.6428937956 3.76603885054

avg_covl DEMAND 10 2.73077954769 2.94540289635 3.15567250621 3.50075542625 3.14244133392 2.32935298321 3.69177375632 3.82029127601

avg_pesq Audioset -10 0.924088591427 0.777906485525 1.045186227 1.01416232264 0.723805399425 1.21072688263 1.49321457841 1.68612497765

avg_pesq Audioset -7 1.06263094068 0.784360812015 1.25243807786 1.24437458398 0.845534836284 1.35453917931 1.77435181504 1.9619057935

avg_pesq Audioset -3 1.32799818309 0.841500279088 1.61150939559 1.59077637471 1.17828470035 1.49735088934 2.13075576099 2.29982094785

avg_pesq Audioset 0 1.52752901041 0.919772355252 1.87616270933 1.85420142433 1.46999775383 1.60844536378 2.36697376539 2.5342873936

avg_pesq Audioset 3 1.72295432799 1.00315455086 2.1185494641 2.11444938794 1.76038289758 1.68273054996 2.58286411487 2.74331730553

avg_pesq Audioset 7 1.90527675294 1.11221058616 2.41210526177 2.43087395389 2.1285367091 1.79070658642 2.81920251938 2.97029628407

avg_pesq Audioset 10 1.97735667866 1.15596683409 2.59757543018 2.63777308179 2.36744350093 1.83618863389 2.9590846835 3.10324312072

avg_pesq DEMAND -10 1.04141358251 1.55705351666 1.9298458968 1.91257566567 1.50487205811 1.53646500696 2.21089407165 2.35679833171

avg_pesq DEMAND -7 1.22147012177 1.80505497977 2.1492789345 2.14839664765 1.74035705671 1.62722750435 2.46049654815 2.60445576295

avg_pesq DEMAND -3 1.4867303537 2.1137618043 2.41930987259 2.44849459279 2.05864462753 1.73899699582 2.72376374837 2.87706292032

avg_pesq DEMAND 0 1.69048987342 2.30547160369 2.60959566607 2.64252384555 2.27884698322 1.80752944179 2.87958444031 3.03204306731

avg_pesq DEMAND 3 1.8636040117 2.46452854586 2.76060782029 2.79845963584 2.4794550254 1.86449576888 2.99743682999 3.15123278972

avg_pesq DEMAND 7 2.00874357753 2.62893634474 2.91800894136 2.96070283193 2.73063050363 1.90137165332 3.11439533933 3.26768618873

avg_pesq DEMAND 10 2.06267856686 2.71469114377 3.00591690571 3.04862845453 2.90312580178 1.92970571811 3.1764660519 3.32713262953

avg_ssnr Audioset -10 1.27738496777 0.104606474115 0.75811714378 0.632327770879 0.535233239324 0.0751246339745 2.32229410118 2.51508309979

avg_ssnr Audioset -7 1.9606158704 0.248675329927 1.17313518784 1.02684680809 1.01372266813 0.0932493747925 3.25149268148 3.41615080131

avg_ssnr Audioset -3 3.17817678946 0.623993207192 1.98731498746 1.8067004794 2.10379997084 0.124790091516 4.69847830758 4.80126493255

avg_ssnr Audioset 0 4.23826479324 1.05142051989 2.70448153861 2.54149642425 3.11309673023 0.133950538447 5.85446287332 5.94247031777

avg_ssnr Audioset 3 5.39717767494 1.50680256065 3.44464452227 3.37009682678 4.12526395917 0.145721331226 7.06019772169 7.14507039335

avg_ssnr Audioset 7 6.87198978755 2.03560176346 4.65656723657 4.53621511078 5.41623948338 0.160568272956 8.7161594012 8.80405093367

avg_ssnr Audioset 10 7.84109636392 2.32655104412 5.55559625302 5.41479321723 6.33473491066 0.164945925586 9.98539041269 10.0773727545

avg_ssnr DEMAND -10 2.83363901944 2.62283940409 1.68899107024 0.572218003206 3.18072646106 0.114796773595 5.14483330341 5.34388531207

avg_ssnr DEMAND -7 3.73009729315 3.60657515363 2.41252362071 1.04750755502 4.0399536095 0.133028433281 6.42255639111 6.61710238247

avg_ssnr DEMAND -3 5.25960094766 4.93229599746 3.57509705237 2.13032621396 5.2246226552 0.151297809038 8.22303012099 8.39029083915

avg_ssnr DEMAND 0 6.48895851886 5.83026974071 4.5671263652 3.30543130529 6.10740859842 0.157334097536 9.57352812709 9.72403437096

avg_ssnr DEMAND 3 7.7529600295 6.59844712753 5.54500307927 4.62428090869 6.95188100097 0.170818267992 10.8857794561 11.0356381902

avg_ssnr DEMAND 7 9.1329505758 7.38516838517 6.83972251299 6.24679359602 7.92339049 0.166365183575 12.5619830342 12.7116563295

avg_ssnr DEMAND 10 9.92825310545 7.80813293108 7.76897185274 7.25079850654 8.51381472963 0.183162574714 13.7242716488 13.8699439631

STO

I

0.0 0.3

0.5 0.8 1.0

0.00.3

0.50.81.0

CS

IG

0.0

1.0 2.0 3.0

4.0

0.0

1.32.53.8

5.0

CB

AK

0.0 1.0

2.0 3.0 4.0

0.01.0

2.03.04.0

CO

VI

0.0 1.0 2.0

3.0 4.0

SNR-10 -8 -5 -3 0 3 5 8 10

0.01.02.0

3.04.0

SNR-10 -8 -5 -3 0 3 5 8 10

AudioSet

PE

SQ

0.0 1.0 2.0

3.0 4.0

Baseline-thres SEGAN Spectral gating Adobe AuditionDFL VSE Ours Ours-GTSI

DEMAND

0.01.02.0

3.04.0

SS

NR

0.0

3.06.09.0

12.0

0.0

3.57.0

10.5

14.0

�1

Figure S2: Denoise quality under different input SNRs. Here we expand Fig. 6 of the main text,including the evaluations under all six metrics described in Sec. 4.2.

C Evaluation on Silent Interval Detection

C.1 Metrics

We now provide the details of the metrics used for evaluating our silent interval detection (i.e., resultsin Table 1 of the main text). Detecting silent intervals is a binary classification task, one that classifies

3

Page 20: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

every time segment as being silent (i.e., a positive condition) or not (i.e., a negative condition). Recallthat the confusion matrix in a binary classification task is as follows:

Table S4: Confusion matrix

Actual

Positive Negative

Pred

icte

dPositive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

In our case, we have the following conditions:

• A true positive (TP) sample is a correctly predicted silent segment.

• A true negative (TN) sample is a correctly predicted non-silent segment.

• A false positive (FP) sample is a non-silent segment predicted as silent.

• A false negative (FN) sample is a silent segment predicted as non-silent.

The four metrics used in Table 1 follow the standard definitions in statistics, which we review here:

precision =NTP

NTP +NFP,

recall =NTP

NTP +NFN,

F1 = 2 · precision · recallprecision + recall

, and

accuracy =NTP +NTN

NTP +NTN +NFP +NFN,

(S1)

where NTP, NTN, NFP, and NFN indicate the numbers of true positive, true negative, false positive,and false negative predictions among all tests. Intuitively, recall indicates the ability of correctlyfinding all true silent intervals, precision measures how much proportion of the labeled silent intervalsare truly silent. F1 score takes both precision and recall into account, and produces their harmonicmean. And accuracy is the ratio of correct predictions among all predictions.

C.2 An Example of Silent Interval Detection

In Fig. S3, we present an example of silent interval detection results in comparison to two alternativemethods. The two alternatives, described in Sec. 4.3, are referred to as Baseline-thres and VAD,respectively. Figure S3 echos the quantitative results in Table 1: VAD tends to be overly conservative,even in the presence of mild noise; and many silent intervals are ignored. On the other hand,Baseline-thres tends to be too aggressive; it produces many false intervals. In contrast, our silentinterval detection maintains a better balance, and thus predicts more accurately.

Ground Truth

Ours

Baseline-thres

VAD

Figure S3: An example of silent interval detection results. Provided an input signal whose SNRis 0dB (top-left), we show the silent intervals (in red) detected by three approaches: our method,Baseline-thres, and VAD. We also show ground-truth silent intervals in top-left.

4

Page 21: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

D Ablation Studies and Analysis

D.1 Details of Ablation Studies

In Sec. 4.4 and Table 2, the ablation studies are set up in the following way.

• “Ours” refers to our proposed network structure and training method that incorporates silentinterval supervision (recall Sec. 3.3). Details are described in Appendix A.

• “Ours w/o SID loss” refers to our proposed network structure but optimized by thetraining method in Sec. 3.2 (i.e. an end-to-end training without silent interval supervision).This ablation study is to confirm that silent interval supervision indeed helps to improve thedenoising quality.

• “Ours Joint loss” refers to our proposed network structure optimized by the end-to-endtraining approach that optimizes the loss function (1) with the additional term (2). In thisend-to-end training, silent interval detection is also supervised through the loss function.This ablation study is to confirm that our two-step training (Sec. 3.3) is more effective.• “Ours w/o NE loss” uses our two-step training (in Sec. 3.3) but without the loss term on

noise estimation—that is, without the first term in (1). This ablation study is to examine thenecessity of the loss term on noise estimation for better denoising quality.

• “Ours w/o SID comp” turns off silent interval detection: the silent interval detectioncomponent always outputs a vector with all zeros. As a result, the input noise profile to thenoise estimation component N is made precisely the same as the original noisy signal. Thisablation study is to examine the effect of silent intervals for speech denoising.

• “Ours w/o NR comp” uses a simple spectral subtraction to replace our noise removalcomponent; the other components remain unchanged. This ablation studey is to examine theefficacy of our noise removal component.

D.2 The Influence of Silent Interval Detection on Denoising Quality

A key insight of our neural-network-based denoising model is the leverage of silent interval distribu-tion over time. The experiments above have confirmed the efficacy of our silent interval detectionfor better speech denoising. We now report additional experiments, aiming to gain some empiricalunderstanding of how the quality of silent interval prediction would affect speech denoising quality.

Table S5: Results on how silent interval detection quality affects the speech denoising quality.

Shift No Shift 1/30 1/10 1/5 1/2

PESQ 2.471 1.932 1.317 1.169 1.094(a) Effect of shifting silent intervals.

Shrink No Shrink 20% 40% 60% 80%

PESQ 2.471 2.361 2.333 2.283 2.249

(b) Effect of shrinking silent intervals.

First, starting with ground-truth silent intervals, we shift them on the time axis by 1/30, 1/10, 1/5, and1/2 seconds. As the shifted time amount increases, more time segments become incorrectly labeled:both the numbers of false positive labels (i.e., non-silent time segments labeled silent) and falsenegative labels (i.e., silent time segments are labeled non-silent) increase. After each shift, we feedthe silent interval labels to our noise estimation and removal components and measure the denoisingquality under the PESQ score.

In the second experiment, we again start with ground-truth silent intervals; but instead of shiftingthem, we shrink each silent interval toward its center by 20%, 40%, 60%, and 80%. As the silentintervals become more shrunken, fewer time segments are labeled as silent. In other words, only thenumber of false negative predictions increases. Similar to the previous experiment, after each shrink,we use the silent interval labels in our speech denoising pipeline, and meausure the PESQ score.

The results of both experiments are reported in Table S5. As we shrink the silent intervals, thedenoising quality drops gently. In contrast, even a small amount of shift causes a clear drop ofdenoising quality. These results suggest that in comparison to false negative predictions, false positivepredictions affect the denoising quality more negatively. On the one hand, reasonably conservativepredictions may leave certain silent time segments undetected (i.e., introducing some false negative

5

Page 22: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

labels), but the detected silent intervals indeed reveal the noise profile. On the other hand, even asmall amount of false positive predictions causes certain non-silent time segments to be treated assilent segments, and thus the observed noise profile through the detected silent intervals would betainted by foreground signals.

E Evaluation of Model Performance

E.1 Generalization Across Datasets

To evaluate the generalization ability of our model, we performed three cross-dataset tests reported inTable S6. The experiments are set up in the following way.

• “Test i”: We train our model on our own AVSPEECH+Audioset (AA) dataset but evaluateon Valentini’s VoiceBank-DEMAND (VD) testset. The result is shown in the first row ofthe “Test i” section in Table S6. In comparison, the second row shows the result of trainingon VD and testing on VD.

• “Test ii”: We train our model on our own AA dataset but evaluate on our secondAVSPEECH+DEMAND (AD) testset. The result is shown in the first row of the “Testii” section in Table S6. In comparison, the second row shows the result of training on ADand testing on AD.

• “Test iii”: We train our model on our own AD dataset but evaluate on AA testset. The resultis shown in the first row of the “Test iii” section of the table. In comparison, the second rowshows the result of training on AA and testing on AA.

The small degradation in each cross-dataset test demonstrates the great generalization ability of ourmethod. We could not directly compare the generalization ability of our model with existing methods,as no previous work reported cross-dataset evaluation results.

Table S6: Generalization across datasets.

Test Trainset Testset PESQ CSIG CBAK COVL STOI

Test i AA VD 3.00 3.78 3.08 3.34 0.98VD VD 3.16 3.96 3.54 3.53 0.98

Test ii AA AD 2.65 3.48 3.21 3.01 0.90AD AD 2.80 3.66 3.36 3.17 0.91

Test iii AD AA 2.12 2.71 2.65 2.34 0.79AA AA 2.30 2.91 2.81 2.54 0.82

E.2 Generalization on Real-world Data

We conduct experiments to understand the extent to which the model trained with different datasetscan generalize to real-world data. We train two versions of our model using our AVSPEECH+Audiosetdataset and Valentini’s VoiceBank-DEMAND, respectively (denoted as “Model AA” and “ModelVD”, respectively, in Table S7), and use them to denoise our collected real-world recordings. For thedenoised real-world audios, we measure the noise level reduction in detected silent intervals. Thismeasurement is doable, since it requires no knowledge of noise-free ground-truth audios. As shownin Table S7, in terms of noise reduction, the model trained with our own AA dataset outperformsthe one trained with the public VoiceBank-DEMAND dataset by a significant amount for all testedreal-world recordings. On average, it produces 22.3 dB noise reduction in comparison to 12.6 dB,suggesting that our dataset allows the denoising model to better generalize to many real-worldscenarios.

6

Page 23: Listening to Sounds of Silence for Speech Denoising · 2020. 10. 20. · Listening to Sounds of Silence for Speech Denoising Ruilin Xu 1, Rundi Wu , Yuko Ishiwaka2, Carl Vondrick

Table S7: Real-world recording noise reduction comparison.

Real-world Recording Model AA noise reduction (dB) Model VD noise reduction (dB)

Song Excerpt 1 22.38 9.22Song Excerpt 2 21.16 17.65Chinese 18.99 15.65Japanese 16.95 9.39Korean 25.90 9.76German 14.54 7.29French 21.95 17.16Spanish 1 33.34 11.00Spanish 2 31.66 14.09Female 1 17.64 7.79Female 2 30.21 8.64Female 3 19.15 6.70Male 1 24.81 14.44Male 2 24.92 13.35Male 3 13.10 10.99Multi-person 1 32.00 21.60Multi-person 2 15.10 15.07Multi-person 3 18.71 10.68Street Interview 20.54 18.94

AVERAGE 22.27 12.60

7