arxiv:2002.03921v2 [eess.as] 13 feb 2020 · signals. experiments on the spatialized wsj1-2mix...

5
END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER Xuankai Chang 1 , Wangyou Zhang 2 , Yanmin Qian 2 , Jonathan Le Roux 3 , Shinji Watanabe 1 1 Center for Language and Speech Processing, Johns Hopkins University, USA 2 MoE Key Lab of Artificial Intelligence & SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China 3 Mitsubishi Electric Research Laboratories (MERL), USA [email protected], {wyz-97, yanminqian}@sjtu.edu.cn, [email protected], [email protected] ABSTRACT Recently, fully recurrent neural network (RNN) based end- to-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce com- putation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER. Index TermsTransformer, end-to-end, overlapped speech recognition, neural beamforming, speech separation. 1. INTRODUCTION Deep learning techniques have dramatically improved the perfor- mance of separation and automatic speech recognition (ASR) tasks related to the cocktail party problem [1], where the speech from multiple speakers overlaps. Two main scenarios are typically con- sidered, single-channel and multi-channel. In single-channel speech separation, various methods have been proposed, among which deep clustering (DPCL) based methods [2] and permutation invariant training (PIT) based methods [3] are the dominant ones. For ASR, methods combining separation with single-speaker ASR as well as methods skipping the explicit separation step and building directly a multi-speaker speech recognition system have been proposed, using either the hybrid ASR framework [4–6] or the end-to-end ASR framework [7–9]. In the multi-channel condition, the spatial information derived from the inter-channel differences can help dis- tinguish between speech sources from different directions, which makes the problem easier to solve. Several methods have been pro- posed for multi-channel speech separation, including DPCL-based Wangyou Zhang and Yanmin Qian were supported by the China NSFC project No.U1736202. methods using integrated beamforming [10] or inter-channel spa- tial features [11], and a PIT-based method using a multi-speaker mask-based beamformer [12]. For multi-channel multi-speaker speech recognition, an end-to-end system was proposed in [13], called MIMO-Speech because of the multi-channel input (MI) and multi-speaker output (MO). This system consists of a mask- based neural beamformer frontend, which explicitly separates the multi-speaker speech via beamforming, and an end-to-end speech recognition model backend based on the joint CTC/attention-based encoder-decoder [14] to recognize the separated speech streams. This end-to-end architecture is optimized via only the connectionist temporal classification (CTC) and cross-entropy (CE) losses in the backend ASR, but is nonetheless able to learn to develop relatively good separation abilities. Recently, Transformer models [15] have shown impressive per- formance in many tasks, such as pretrained language models [16,17], end-to-end speech recognition [18, 19], and speaker diarization [20], surpassing the long short-term memory recurrent neural networks (LSTM-RNNs) based models. One of the key components in the Transformer model is self-attention, which computes the contribu- tion information of the whole input sequence and maps the sequence into a vector at every time step. Even though the Transformer model is powerful, it is usually not computationally practical when the se- quence length is very long. It also needs adaptation for specific tasks, such as the subsampling operation in encoder-decoder based end-to- end speech recognition. However, for signal-level processing tasks such as speech separation and enhancement, subsampling is usually not a good option, because these tasks need to maintain the original time resolution. In this paper, we explore the use of Transformer models for end- to-end multi-speaker speech recognition in both the single-channel and multi-channel scenarios. First, we replace the LSTMs in the encoder-decoder network of the speech recognition module with Transformers for both scenarios. Second, in order to also apply Transformers in the masking network of the neural beamforming module in the multi-channel case, we modify the self-attention lay- ers to reduce their memory consumption in a time-restricted (or local) manner, as used in [5, 21, 22]. To the best of our knowl- edge, this work is the first attempt to use the Transformer model for tasks such as speech enhancement/separation with such very long sequences. Another contribution of this paper is to improve the robustness of our model in reverberant environments. To do so, we incorporate an external dereverberation method, the weighed prediction error (WPE) [23], to preprocess the reverberated speech. The experiments show that this straightforward method can lead to a performance boost for reverberant speech. arXiv:2002.03921v2 [eess.AS] 13 Feb 2020

Upload: others

Post on 23-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:2002.03921v2 [eess.AS] 13 Feb 2020 · signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40:9% and 25:6% relative WER reduction,

END-TO-END MULTI-SPEAKER SPEECH RECOGNITION WITH TRANSFORMER

Xuankai Chang1, Wangyou Zhang2, Yanmin Qian2, Jonathan Le Roux3, Shinji Watanabe1

1Center for Language and Speech Processing, Johns Hopkins University, USA2MoE Key Lab of Artificial Intelligence &

SpeechLab, Department of Computer Science and Engineering, Shanghai Jiao Tong University, China3Mitsubishi Electric Research Laboratories (MERL), USA

[email protected], {wyz-97, yanminqian}@sjtu.edu.cn, [email protected], [email protected]

ABSTRACT

Recently, fully recurrent neural network (RNN) based end-to-end models have been proven to be effective for multi-speakerspeech recognition in both the single-channel and multi-channelscenarios. In this work, we explore the use of Transformer modelsfor these tasks by focusing on two aspects. First, we replace theRNN-based encoder-decoder in the speech recognition model with aTransformer architecture. Second, in order to use the Transformer inthe masking network of the neural beamformer in the multi-channelcase, we modify the self-attention component to be restricted to asegment rather than the whole sequence in order to reduce com-putation. Besides the model architecture improvements, we alsoincorporate an external dereverberation preprocessing, the weightedprediction error (WPE), enabling our model to handle reverberatedsignals. Experiments on the spatialized wsj1-2mix corpus show thatthe Transformer-based models achieve 40.9% and 25.6% relativeWER reduction, down to 12.1% and 6.4% WER, under the anechoiccondition in single-channel and multi-channel tasks, respectively,while in the reverberant case, our methods achieve 41.5% and13.8% relative WER reduction, down to 16.5% and 15.2% WER.

Index Terms— Transformer, end-to-end, overlapped speechrecognition, neural beamforming, speech separation.

1. INTRODUCTION

Deep learning techniques have dramatically improved the perfor-mance of separation and automatic speech recognition (ASR) tasksrelated to the cocktail party problem [1], where the speech frommultiple speakers overlaps. Two main scenarios are typically con-sidered, single-channel and multi-channel. In single-channel speechseparation, various methods have been proposed, among which deepclustering (DPCL) based methods [2] and permutation invarianttraining (PIT) based methods [3] are the dominant ones. For ASR,methods combining separation with single-speaker ASR as well asmethods skipping the explicit separation step and building directlya multi-speaker speech recognition system have been proposed,using either the hybrid ASR framework [4–6] or the end-to-endASR framework [7–9]. In the multi-channel condition, the spatialinformation derived from the inter-channel differences can help dis-tinguish between speech sources from different directions, whichmakes the problem easier to solve. Several methods have been pro-posed for multi-channel speech separation, including DPCL-based

Wangyou Zhang and Yanmin Qian were supported by the China NSFCproject No.U1736202.

methods using integrated beamforming [10] or inter-channel spa-tial features [11], and a PIT-based method using a multi-speakermask-based beamformer [12]. For multi-channel multi-speakerspeech recognition, an end-to-end system was proposed in [13],called MIMO-Speech because of the multi-channel input (MI)and multi-speaker output (MO). This system consists of a mask-based neural beamformer frontend, which explicitly separates themulti-speaker speech via beamforming, and an end-to-end speechrecognition model backend based on the joint CTC/attention-basedencoder-decoder [14] to recognize the separated speech streams.This end-to-end architecture is optimized via only the connectionisttemporal classification (CTC) and cross-entropy (CE) losses in thebackend ASR, but is nonetheless able to learn to develop relativelygood separation abilities.

Recently, Transformer models [15] have shown impressive per-formance in many tasks, such as pretrained language models [16,17],end-to-end speech recognition [18,19], and speaker diarization [20],surpassing the long short-term memory recurrent neural networks(LSTM-RNNs) based models. One of the key components in theTransformer model is self-attention, which computes the contribu-tion information of the whole input sequence and maps the sequenceinto a vector at every time step. Even though the Transformer modelis powerful, it is usually not computationally practical when the se-quence length is very long. It also needs adaptation for specific tasks,such as the subsampling operation in encoder-decoder based end-to-end speech recognition. However, for signal-level processing taskssuch as speech separation and enhancement, subsampling is usuallynot a good option, because these tasks need to maintain the originaltime resolution.

In this paper, we explore the use of Transformer models for end-to-end multi-speaker speech recognition in both the single-channeland multi-channel scenarios. First, we replace the LSTMs in theencoder-decoder network of the speech recognition module withTransformers for both scenarios. Second, in order to also applyTransformers in the masking network of the neural beamformingmodule in the multi-channel case, we modify the self-attention lay-ers to reduce their memory consumption in a time-restricted (orlocal) manner, as used in [5, 21, 22]. To the best of our knowl-edge, this work is the first attempt to use the Transformer modelfor tasks such as speech enhancement/separation with such verylong sequences. Another contribution of this paper is to improvethe robustness of our model in reverberant environments. To doso, we incorporate an external dereverberation method, the weighedprediction error (WPE) [23], to preprocess the reverberated speech.The experiments show that this straightforward method can lead toa performance boost for reverberant speech.

arX

iv:2

002.

0392

1v2

[ee

ss.A

S] 1

3 Fe

b 20

20

Page 2: arXiv:2002.03921v2 [eess.AS] 13 Feb 2020 · signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40:9% and 25:6% relative WER reduction,

Attention-Decoder

Enc1SD

<latexit sha1_base64="ZEx54u5Dh9nrkVqveX1aT1DLf2g=">AAACAXicbVDLSsNAFJ3UV62vqBvBTbAIrkpSBV0WVHBZ0T6gjWEynbRDJ5MwcyOWEDf+ihsXirj1L9z5N07TLrT1wMDhnHu5c44fc6bAtr+NwsLi0vJKcbW0tr6xuWVu7zRVlEhCGyTikWz7WFHOBG0AA07bsaQ49Dlt+cPzsd+6p1KxSNzCKKZuiPuCBYxg0JJn7nWBPkB6KUjmpRN+c5Fld45nlu2KncOaJ86UlNEUdc/86vYikoRUAOFYqY5jx+CmWAIjnGalbqJojMkQ92lHU4FDqtw0T5BZh1rpWUEk9RNg5ervjRSHSo1CX0+GGAZq1huL/3mdBIIzN2UiToDqjPmhIOEWRNa4DqvHJCXAR5pgIpn+q0UGWGICurSSLsGZjTxPmtWKc1ypXp+Ua/a0jiLaRwfoCDnoFNXQFaqjBiLoET2jV/RmPBkvxrvxMRktGNOdXfQHxucPOluXUg==</latexit>

Enc2SD

<latexit sha1_base64="/1B3Sg++45arqjne5eMfVV5bINY=">AAACAXicbVDLSsNAFJ3UV62vqBvBTbAIrkpSBV0WVHBZ0T6gjWEynbRDJ5MwcyOWEDf+ihsXirj1L9z5N07TLrT1wMDhnHu5c44fc6bAtr+NwsLi0vJKcbW0tr6xuWVu7zRVlEhCGyTikWz7WFHOBG0AA07bsaQ49Dlt+cPzsd+6p1KxSNzCKKZuiPuCBYxg0JJn7nWBPkB6KUjmpRN+c5Fld1XPLNsVO4c1T5wpKaMp6p751e1FJAmpAMKxUh3HjsFNsQRGOM1K3UTRGJMh7tOOpgKHVLlpniCzDrXSs4JI6ifAytXfGykOlRqFvp4MMQzUrDcW//M6CQRnbspEnADVGfNDQcItiKxxHVaPSUqAjzTBRDL9V4sMsMQEdGklXYIzG3meNKsV57hSvT4p1+xpHUW0jw7QEXLQKaqhK1RHDUTQI3pGr+jNeDJejHfjYzJaMKY7u+gPjM8fO9+XUw==</latexit>

EncRec<latexit sha1_base64="GwzczoxHcI3jB7nU9sctP24Zn0A=">AAACAHicbVDLSsNAFJ34rPUVdeHCTbAIrkpSBV0WRHBZxT6gDWEyvW2HTh7M3IglZOOvuHGhiFs/w51/4zTNQlsPDBzOuZc75/ix4Apt+9tYWl5ZXVsvbZQ3t7Z3ds29/ZaKEsmgySIRyY5PFQgeQhM5CujEEmjgC2j746up334AqXgU3uMkBjegw5APOKOoJc887CE8YnodssxLZ/wOWJZ5ZsWu2jmsReIUpEIKNDzzq9ePWBJAiExQpbqOHaObUomcCcjKvURBTNmYDqGraUgDUG6aB8isE630rUEk9QvRytXfGykNlJoEvp4MKI7UvDcV//O6CQ4u3ZSHcYKgI+aHBomwMLKmbVh9LoGhmGhCmeT6rxYbUUkZ6s7KugRnPvIiadWqzlm1dnteqdtFHSVyRI7JKXHIBamTG9IgTcJIRp7JK3kznowX4934mI0uGcXOAfkD4/MH+YSXPA==</latexit>

EncMix<latexit sha1_base64="tHPY8WFEw4gs1l5QM1a7jEWlJ1Y=">AAACAHicbVDLSsNAFJ3UV62vqAsXbgaL4KokVdBlQQQ3QgX7gDaEyXTSDp08mLmRlpCNv+LGhSJu/Qx3/o3TNAttPTBwOOde7pzjxYIrsKxvo7Syura+Ud6sbG3v7O6Z+wdtFSWSshaNRCS7HlFM8JC1gINg3VgyEniCdbzx9czvPDKpeBQ+wDRmTkCGIfc5JaAl1zzqA5tAehPSzE3n/I5Pssw1q1bNyoGXiV2QKirQdM2v/iCiScBCoIIo1bOtGJyUSOBUsKzSTxSLCR2TIetpGpKAKSfNA2T4VCsD7EdSvxBwrv7eSEmg1DTw9GRAYKQWvZn4n9dLwL9yUh7GCTAdMT/kJwJDhGdt4AGXjIKYakKo5PqvmI6IJBR0ZxVdgr0YeZm06zX7vFa/v6g2rKKOMjpGJ+gM2egSNdAtaqIWoihDz+gVvRlPxovxbnzMR0tGsXOI/sD4/AEYBZdQ</latexit>

G1<latexit sha1_base64="ZLYAvCPOhzWs+6UeweDd7WGCEhc=">AAAB83icbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZcFF7qsYB/QGUsmzbShmUxIMkIZ+htuXCji1p9x59+YaWehrQcCh3Pu5Z6cUHKmjet+O6W19Y3NrfJ2ZWd3b/+genjU0UmqCG2ThCeqF2JNORO0bZjhtCcVxXHIaTec3OR+94kqzRLxYKaSBjEeCRYxgo2VfD/GZhxG2e3s0RtUa27dnQOtEq8gNSjQGlS//GFC0pgKQzjWuu+50gQZVoYRTmcVP9VUYjLBI9q3VOCY6iCbZ56hM6sMUZQo+4RBc/X3RoZjradxaCfzjHrZy8X/vH5qousgY0KmhgqyOBSlHJkE5QWgIVOUGD61BBPFbFZExlhhYmxNFVuCt/zlVdJp1L2LeuP+stZ0izrKcAKncA4eXEET7qAFbSAg4Rle4c1JnRfn3flYjJacYucY/sD5/AHZ2ZGA</latexit>

G2<latexit sha1_base64="pCq5nYaVQaXKVjL3rgHd3orUnI0=">AAAB83icbVDLSgMxFL2pr1pfVZdugkVwVWaqoMuCC11WsA/ojCWTZtrQTGZIMkIZ+htuXCji1p9x59+YaWehrQcCh3Pu5Z6cIBFcG8f5RqW19Y3NrfJ2ZWd3b/+genjU0XGqKGvTWMSqFxDNBJesbbgRrJcoRqJAsG4wucn97hNTmsfywUwT5kdkJHnIKTFW8ryImHEQZrezx8agWnPqzhx4lbgFqUGB1qD65Q1jmkZMGiqI1n3XSYyfEWU4FWxW8VLNEkInZMT6lkoSMe1n88wzfGaVIQ5jZZ80eK7+3shIpPU0CuxknlEve7n4n9dPTXjtZ1wmqWGSLg6FqcAmxnkBeMgVo0ZMLSFUcZsV0zFRhBpbU8WW4C5/eZV0GnX3ot64v6w1naKOMpzAKZyDC1fQhDtoQRsoJPAMr/CGUvSC3tHHYrSEip1j+AP0+QPbXZGB</latexit>

O<latexit sha1_base64="wMweFiNMs6vlTkExGU793EeMxi8=">AAAB8XicbVDLSgMxFL1TX7W+qi7dBIvgqsxUQZcFN+6sYB/YDiWT3mlDM5khyQhl6F+4caGIW//GnX9j2s5CWw8EDufcS849QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfjm5nffkKleSwfzCRBP6JDyUPOqLHSYy+iZhSE2d20X664VXcOskq8nFQgR6Nf/uoNYpZGKA0TVOuu5ybGz6gynAmclnqpxoSyMR1i11JJI9R+Nk88JWdWGZAwVvZJQ+bq742MRlpPosBOzhLqZW8m/ud1UxNe+xmXSWpQssVHYSqIicnsfDLgCpkRE0soU9xmJWxEFWXGllSyJXjLJ6+SVq3qXVRr95eVupvXUYQTOIVz8OAK6nALDWgCAwnP8ApvjnZenHfnYzFacPKdY/gD5/MHutCQ5Q==</latexit>

H1<latexit sha1_base64="yl9nAYwe9S9kWChBV2JKveqp7IA=">AAAB83icbVDLSgMxFL2pr1pfVZdugkVwVWaqoMuCmy4r2Ad0xpJJM21oJjMkGaEM/Q03LhRx68+482/MtLPQ1gOBwzn3ck9OkAiujeN8o9LG5tb2Tnm3srd/cHhUPT7p6jhVlHVoLGLVD4hmgkvWMdwI1k8UI1EgWC+Y3uV+74kpzWP5YGYJ8yMyljzklBgreV5EzCQIs9b80R1Wa07dWQCvE7cgNSjQHla/vFFM04hJQwXReuA6ifEzogyngs0rXqpZQuiUjNnAUkkipv1skXmOL6wywmGs7JMGL9TfGxmJtJ5FgZ3MM+pVLxf/8wapCW/9jMskNUzS5aEwFdjEOC8Aj7hi1IiZJYQqbrNiOiGKUGNrqtgS3NUvr5Nuo+5e1Rv317WmU9RRhjM4h0tw4Qaa0II2dIBCAs/wCm8oRS/oHX0sR0uo2DmFP0CfP9tgkYE=</latexit>

H2<latexit sha1_base64="c9S6DOjgYJARdVzKPp6u+xhegLw=">AAAB83icbVDLSgMxFL2pr1pfVZdugkVwVWaqoMuCmy4r2Ad0xpJJM21oJjMkGaEM/Q03LhRx68+482/MtLPQ1gOBwzn3ck9OkAiujeN8o9LG5tb2Tnm3srd/cHhUPT7p6jhVlHVoLGLVD4hmgkvWMdwI1k8UI1EgWC+Y3uV+74kpzWP5YGYJ8yMyljzklBgreV5EzCQIs9b8sTGs1py6swBeJ25BalCgPax+eaOYphGThgqi9cB1EuNnRBlOBZtXvFSzhNApGbOBpZJETPvZIvMcX1hlhMNY2ScNXqi/NzISaT2LAjuZZ9SrXi7+5w1SE976GZdJapiky0NhKrCJcV4AHnHFqBEzSwhV3GbFdEIUocbWVLEluKtfXifdRt29qjfur2tNp6ijDGdwDpfgwg00oQVt6ACFBJ7hFd5Qil7QO/pYjpZQsXMKf4A+fwDc5JGC</latexit>

EncRec<latexit sha1_base64="GwzczoxHcI3jB7nU9sctP24Zn0A=">AAACAHicbVDLSsNAFJ34rPUVdeHCTbAIrkpSBV0WRHBZxT6gDWEyvW2HTh7M3IglZOOvuHGhiFs/w51/4zTNQlsPDBzOuZc75/ix4Apt+9tYWl5ZXVsvbZQ3t7Z3ds29/ZaKEsmgySIRyY5PFQgeQhM5CujEEmjgC2j746up334AqXgU3uMkBjegw5APOKOoJc887CE8YnodssxLZ/wOWJZ5ZsWu2jmsReIUpEIKNDzzq9ePWBJAiExQpbqOHaObUomcCcjKvURBTNmYDqGraUgDUG6aB8isE630rUEk9QvRytXfGykNlJoEvp4MKI7UvDcV//O6CQ4u3ZSHcYKgI+aHBomwMLKmbVh9LoGhmGhCmeT6rxYbUUkZ6s7KugRnPvIiadWqzlm1dnteqdtFHSVyRI7JKXHIBamTG9IgTcJIRp7JK3kznowX4934mI0uGcXOAfkD4/MH+YSXPA==</latexit>

Attention-Decoder

CTC

CTC

Lossatt<latexit sha1_base64="LjFYthwd545pAZXOCe/lj6seSXo=">AAACAXicbVDLSsNAFJ34rPUVdSO4GSyCq5JUQZcFNy5cVLAPaEOYTCft0MkkzNyIJcSNv+LGhSJu/Qt3/o3TNgttPXDhzDn3MveeIBFcg+N8W0vLK6tr66WN8ubW9s6uvbff0nGqKGvSWMSqExDNBJesCRwE6ySKkSgQrB2MriZ++54pzWN5B+OEeREZSB5ySsBIvn3YA/YA2U2sde5nswcByHPfrjhVZwq8SNyCVFCBhm9/9foxTSMmgQqiddd1EvAyooBTwfJyL9UsIXREBqxrqCQR0142vSDHJ0bp4zBWpiTgqfp7IiOR1uMoMJ0RgaGe9ybif143hfDSy7hMUmCSzj4KU4EhxpM4cJ8rRkGMDSFUcbMrpkOiCAUTWtmE4M6fvEhatap7Vq3dnlfqThFHCR2hY3SKXHSB6ugaNVATUfSIntErerOerBfr3fqYtS5ZxcwB+gPr8wdEt5gA</latexit>

Lossctc<latexit sha1_base64="VMSg1qIgk41lt9PRv6n8VO9dNxY=">AAACAXicbVDLSsNAFJ34rPUVdSO4GSyCq5JUQZcFNy5cVLAPaEOYTCft0MkkzNyIJcSNv+LGhSJu/Qt3/o3TNgttPXDhzDn3MveeIBFcg+N8W0vLK6tr66WN8ubW9s6uvbff0nGqKGvSWMSqExDNBJesCRwE6ySKkSgQrB2MriZ++54pzWN5B+OEeREZSB5ySsBIvn3YA/YA2U2sde5nswcFmue+XXGqzhR4kbgFqaACDd/+6vVjmkZMAhVE667rJOBlRAGnguXlXqpZQuiIDFjXUEkipr1sekGOT4zSx2GsTEnAU/X3REYircdRYDojAkM9703E/7xuCuGll3GZpMAknX0UpgJDjCdx4D5XjIIYG0Ko4mZXTIdEEQomtLIJwZ0/eZG0alX3rFq7Pa/UnSKOEjpCx+gUuegC1dE1aqAmougRPaNX9GY9WS/Wu/Uxa12yipkD9AfW5w8t4Zfx</latexit>

⇡<latexit sha1_base64="I4DIRN+1kc6dBvStayuydfH4Z8A=">AAAB8HicbVBNS8NAEJ3Ur1q/qh69LBbBg5SkCnosePFYwX5IU8pmu2mX7iZhdyKU0F/hxYMiXv053vw3btsctPXBwOO9GWbmBYkUBl332ymsrW9sbhW3Szu7e/sH5cOjlolTzXiTxTLWnYAaLkXEmyhQ8k6iOVWB5O1gfDvz209cGxFHDzhJeE/RYSRCwSha6dEfUcz8REz75Ypbdecgq8TLSQVyNPrlL38Qs1TxCJmkxnQ9N8FeRjUKJvm05KeGJ5SN6ZB3LY2o4qaXzQ+ekjOrDEgYa1sRkrn6eyKjypiJCmynojgyy95M/M/rphje9DIRJSnyiC0WhakkGJPZ92QgNGcoJ5ZQpoW9lbAR1ZShzahkQ/CWX14lrVrVu6zW7q8q9Ys8jiKcwCmcgwfXUIc7aEATGCh4hld4c7Tz4rw7H4vWgpPPHMMfOJ8/GW2QiA==</latexit>Permutation

Fig. 1. End-to-end single-channel multi-speaker model in the 2-speaker case. The speaker-differentiating encoder (EncSD), recogni-tion encoder (EncRec), and decoder are either RNNs or Transformers.

Masking Network(Speeches and Noise)

X1<latexit sha1_base64="t+U78+QaykkxwptJbp8HDmQTnNg=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRV0GXBjcsK9gFNKZPppB06mYSZG6GE/oYbF4q49Wfc+TdO2iy09cDA4Zx7uWdOkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2md7nffeLaiFg94izhg4iOlQgFo2gl348oToIw682H3rBac+vuAmSdeAWpQYHWsPrlj2KWRlwhk9SYvucmOMioRsEkn1f81PCEsikd876likbcDLJF5jm5sMqIhLG2TyFZqL83MhoZM4sCO5lnNKteLv7n9VMMbweZUEmKXLHloTCVBGOSF0BGQnOGcmYJZVrYrIRNqKYMbU0VW4K3+uV10mnUvat64+G61mwUdZThDM7hEjy4gSbcQwvawCCBZ3iFNyd1Xpx352M5WnKKnVP4A+fzB/XvkZQ=</latexit>

XC<latexit sha1_base64="XM9WM0Q8/kjWev57Wc5zx4HhYWw=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaQKuix047KCfUBTymQ6aYdOJmHmRiihv+HGhSJu/Rl3/o2TNgttPTBwOOde7pkTJFIYdN1vZ2Nza3tnt7RX3j84PDqunJx2TJxqxtsslrHuBdRwKRRvo0DJe4nmNAok7wbTZu53n7g2IlaPOEv4IKJjJULBKFrJ9yOKkyDMevNhc1ipujV3AbJOvIJUoUBrWPnyRzFLI66QSWpM33MTHGRUo2CSz8t+anhC2ZSOed9SRSNuBtki85xcWmVEwljbp5As1N8bGY2MmUWBncwzmlUvF//z+imGd4NMqCRFrtjyUJhKgjHJCyAjoTlDObOEMi1sVsImVFOGtqayLcFb/fI66dRr3nWt/nBTbdSLOkpwDhdwBR7cQgPuoQVtYJDAM7zCm5M6L86787Ec3XCKnTP4A+fzBxFGkaY=</latexit>

MVDRFiltering

FeatureExtractor

Masking Network and Neural Beamformer

mt,f,2<latexit sha1_base64="OOoCU6FwER68usVrORCNm54xVRk=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgopSkCrosuHFZwT6gDWUynbRDZyZhZiKWkF9x40IRt/6IO//GSZuFth4YOJxzL/fMCWJGlXbdb6u0sbm1vVPereztHxwe2cfVrooSiUkHRyyS/QApwqggHU01I/1YEsQDRnrB7Db3e49EKhqJBz2Pic/RRNCQYqSNNLKrQ470NAhTno1SXQ/rzWxk19yGu4CzTryC1KBAe2R/DccRTjgRGjOk1MBzY+2nSGqKGckqw0SRGOEZmpCBoQJxovx0kT1zzo0ydsJImie0s1B/b6SIKzXngZnMk6pVLxf/8waJDm/8lIo40UTg5aEwYY6OnLwIZ0wlwZrNDUFYUpPVwVMkEdamroopwVv98jrpNhveZaN5f1Vr1Ys6ynAKZ3ABHlxDC+6gDR3A8ATP8ApvVma9WO/Wx3K0ZBU7J/AH1ucP1rWUOw==</latexit>

mt,f,1<latexit sha1_base64="xOrcVZcEpCgRygzX4DQi1AMETzs=">AAAB+3icbVDLSsNAFL2pr1pfsS7dBIvgopSkCrosuHFZwT6gDWEynbRDZyZhZiKWkF9x40IRt/6IO//GSduFth4YOJxzL/fMCRNGlXbdb6u0sbm1vVPereztHxwe2cfVropTiUkHxyyW/RApwqggHU01I/1EEsRDRnrh9Lbwe49EKhqLBz1LiM/RWNCIYqSNFNjVIUd6EkYZz4NM16O6lwd2zW24czjrxFuSGizRDuyv4SjGKSdCY4aUGnhuov0MSU0xI3llmCqSIDxFYzIwVCBOlJ/Ns+fOuVFGThRL84R25urvjQxxpWY8NJNFUrXqFeJ/3iDV0Y2fUZGkmgi8OBSlzNGxUxThjKgkWLOZIQhLarI6eIIkwtrUVTEleKtfXifdZsO7bDTvr2qt+rKOMpzCGVyAB9fQgjtoQwcwPMEzvMKblVsv1rv1sRgtWcudE/gD6/MH1TCUOg==</latexit>

X1<latexit sha1_base64="t+U78+QaykkxwptJbp8HDmQTnNg=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRV0GXBjcsK9gFNKZPppB06mYSZG6GE/oYbF4q49Wfc+TdO2iy09cDA4Zx7uWdOkEhh0HW/ndLG5tb2Tnm3srd/cHhUPT7pmDjVjLdZLGPdC6jhUijeRoGS9xLNaRRI3g2md7nffeLaiFg94izhg4iOlQgFo2gl348oToIw682H3rBac+vuAmSdeAWpQYHWsPrlj2KWRlwhk9SYvucmOMioRsEkn1f81PCEsikd876likbcDLJF5jm5sMqIhLG2TyFZqL83MhoZM4sCO5lnNKteLv7n9VMMbweZUEmKXLHloTCVBGOSF0BGQnOGcmYJZVrYrIRNqKYMbU0VW4K3+uV10mnUvat64+G61mwUdZThDM7hEjy4gSbcQwvawCCBZ3iFNyd1Xpx352M5WnKKnVP4A+fzB/XvkZQ=</latexit>

X2<latexit sha1_base64="M5lb5ct8ncczub5oMLvoyvxvWC4=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRR0GXBjcsK9gFNKJPppB06mYR5CCX0N9y4UMStP+POv3HSZqGtBwYO59zLPXOijDOlXffbqWxsbm3vVHdre/sHh0f145OuSo0ktENSnsp+hBXlTNCOZprTfiYpTiJOe9H0rvB7T1QqlopHPctomOCxYDEjWFspCBKsJ1Gc9+dDf1hvuE13AbROvJI0oER7WP8KRikxCRWacKzUwHMzHeZYakY4ndcCo2iGyRSP6cBSgROqwnyReY4urDJCcSrtExot1N8bOU6UmiWRnSwyqlWvEP/zBkbHt2HORGY0FWR5KDYc6RQVBaARk5RoPrMEE8lsVkQmWGKibU01W4K3+uV10vWb3lXTf7hutPyyjiqcwTlcggc30IJ7aEMHCGTwDK/w5hjnxXl3PpajFafcOYU/cD5/APdzkZU=</latexit>

XC<latexit sha1_base64="XM9WM0Q8/kjWev57Wc5zx4HhYWw=">AAAB83icbVDLSsNAFL3xWeur6tLNYBFclaQKuix047KCfUBTymQ6aYdOJmHmRiihv+HGhSJu/Rl3/o2TNgttPTBwOOde7pkTJFIYdN1vZ2Nza3tnt7RX3j84PDqunJx2TJxqxtsslrHuBdRwKRRvo0DJe4nmNAok7wbTZu53n7g2IlaPOEv4IKJjJULBKFrJ9yOKkyDMevNhc1ipujV3AbJOvIJUoUBrWPnyRzFLI66QSWpM33MTHGRUo2CSz8t+anhC2ZSOed9SRSNuBtki85xcWmVEwljbp5As1N8bGY2MmUWBncwzmlUvF//z+imGd4NMqCRFrtjyUJhKgjHJCyAjoTlDObOEMi1sVsImVFOGtqayLcFb/fI66dRr3nWt/nBTbdSLOkpwDhdwBR7cQgPuoQVtYJDAM7zCm5M6L86787Ec3XCKnTP4A+fzBxFGkaY=</latexit> O2

<latexit sha1_base64="GdMxfLn7T84SLC9lg6u7M/lf5DY=">AAAB9XicbVDLSsNAFL2pr1pfVZduBovgQkrSCrosuHFnBfuANi2T6aQdOpmEmYlSQv7DjQtF3Pov7vwbJ20W2npg4HDOvdwzx4s4U9q2v63C2vrG5lZxu7Szu7d/UD48aqswloS2SMhD2fWwopwJ2tJMc9qNJMWBx2nHm95kfueRSsVC8aBnEXUDPBbMZwRrIw36AdYTz0/u0kFSS4flil2150CrxMlJBXI0h+Wv/igkcUCFJhwr1XPsSLsJlpoRTtNSP1Y0wmSKx7RnqMABVW4yT52iM6OMkB9K84RGc/X3RoIDpWaBZyazlGrZy8T/vF6s/Ws3YSKKNRVkcciPOdIhyipAIyYp0XxmCCaSmayITLDERJuiSqYEZ/nLq6Rdqzr1au3+stK4yOsowgmcwjk4cAUNuIUmtICAhGd4hTfryXqx3q2PxWjByneO4Q+szx+yY5KR</latexit>

O1<latexit sha1_base64="6w/t6FZseI7W3c3vTbapJ1LERjU=">AAAB9XicbVDLSsNAFL2pr1pfVZduBovgQkrSCrosuHFnBfuANi2T6aQdOpmEmYlSQv7DjQtF3Pov7vwbJ20W2npg4HDOvdwzx4s4U9q2v63C2vrG5lZxu7Szu7d/UD48aqswloS2SMhD2fWwopwJ2tJMc9qNJMWBx2nHm95kfueRSsVC8aBnEXUDPBbMZwRrIw36AdYTz0/u0kHipMNyxa7ac6BV4uSkAjmaw/JXfxSSOKBCE46V6jl2pN0ES80Ip2mpHysaYTLFY9ozVOCAKjeZp07RmVFGyA+leUKjufp7I8GBUrPAM5NZSrXsZeJ/Xi/W/rWbMBHFmgqyOOTHHOkQZRWgEZOUaD4zBBPJTFZEJlhiok1RJVOCs/zlVdKuVZ16tXZ/WWlc5HUU4QRO4RwcuIIG3EITWkBAwjO8wpv1ZL1Y79bHYrRg5TvH8AfW5w+w3pKQ</latexit>

Xenh,1t,f

<latexit sha1_base64="HgIHo0X6orSIhJUKrYxP8EmPs88=">AAACCXicbVA9SwNBEN2LXzF+RS1tFoNgEcJdFLQM2FgqGBNIYtjbzCVL9vaO3TkxHNfa+FdsLBSx9R/Y+W/cS1L49WDg8d4MM/P8WAqDrvvpFBYWl5ZXiqultfWNza3y9s61iRLNockjGem2zwxIoaCJAiW0Yw0s9CW0/PFZ7rduQRsRqSucxNAL2VCJQHCGVuqXaTdkOPKDtJ3dpF2EO0xBjbIq9bJ+itUg65crbs2dgv4l3pxUyBwX/fJHdxDxJASFXDJjOp4bYy9lGgWXkJW6iYGY8TEbQsdSxUIwvXT6SUYPrDKgQaRtKaRT9ftEykJjJqFvO/O7zW8vF//zOgkGp71UqDhBUHy2KEgkxYjmsdCB0MBRTixhXAt7K+UjphlHG17JhuD9fvkvua7XvKNa/fK40qjO4yiSPbJPDolHTkiDnJML0iSc3JNH8kxenAfnyXl13matBWc+s0t+wHn/AhAPmns=</latexit>

Xenh,2t,f

<latexit sha1_base64="ooJdNnMy8Iyr5vIL3eUDyRhDXhQ=">AAACCXicbVA9SwNBEN3z2/gVtbRZDIJFCHdR0DJgY6lgPiAXj73NnFnc2zt258RwXGvjX7GxUMTWf2Dnv3HzUajxwcDjvRlm5oWpFAZd98uZm19YXFpeWS2trW9sbpW3d1omyTSHJk9kojshMyCFgiYKlNBJNbA4lNAOb89GfvsOtBGJusJhCr2Y3SgRCc7QSkGZ+jHDQRjlneI69xHuMQc1KKq0XgQ5VqMiKFfcmjsGnSXelFTIFBdB+dPvJzyLQSGXzJiu56bYy5lGwSUUJT8zkDJ+y26ga6liMZhePv6koAdW6dMo0bYU0rH6cyJnsTHDOLSdo7vNX28k/ud1M4xOe7lQaYag+GRRlEmKCR3FQvtCA0c5tIRxLeytlA+YZhxteCUbgvf35VnSqte8o1r98rjSqE7jWCF7ZJ8cEo+ckAY5JxekSTh5IE/khbw6j86z8+a8T1rnnOnMLvkF5+MbEZqafA==</latexit>

m0t,f,C

<latexit sha1_base64="Y3No1kkYg1DckqBWYQBMwONwlaM=">AAAB9HicbVBNSwMxEJ2tX7V+VT16CRbBQym7VdBjoRePFewHtGvJptk2NMmuSbZQlv4OLx4U8eqP8ea/MW33oK0PBh7vzTAzL4g508Z1v53cxubW9k5+t7C3f3B4VDw+aekoUYQ2ScQj1QmwppxJ2jTMcNqJFcUi4LQdjOtzvz2hSrNIPphpTH2Bh5KFjGBjJV88pu6sn5pyWK7P+sWSW3EXQOvEy0gJMjT6xa/eICKJoNIQjrXuem5s/BQrwwins0Iv0TTGZIyHtGupxIJqP10cPUMXVhmgMFK2pEEL9fdEioXWUxHYToHNSK96c/E/r5uY8NZPmYwTQyVZLgoTjkyE5gmgAVOUGD61BBPF7K2IjLDCxNicCjYEb/XlddKqVryrSvX+ulQrZ3Hk4QzO4RI8uIEa3EEDmkDgCZ7hFd6cifPivDsfy9ack82cwh84nz82CJGp</latexit>

m2t,f,C

<latexit sha1_base64="w+qIPGo905d3YLJoRo9w8YD2TGQ=">AAAB9HicbVBNSwMxEJ2tX7V+VT16CRbBQym7VdBjoRePFewHtGvJptk2NMmuSbZQlv4OLx4U8eqP8ea/MW33oK0PBh7vzTAzL4g508Z1v53cxubW9k5+t7C3f3B4VDw+aekoUYQ2ScQj1QmwppxJ2jTMcNqJFcUi4LQdjOtzvz2hSrNIPphpTH2Bh5KFjGBjJV88ptVZPzXlsFyf9Yslt+IugNaJl5ESZGj0i1+9QUQSQaUhHGvd9dzY+ClWhhFOZ4VeommMyRgPaddSiQXVfro4eoYurDJAYaRsSYMW6u+JFAutpyKwnQKbkV715uJ/Xjcx4a2fMhknhkqyXBQmHJkIzRNAA6YoMXxqCSaK2VsRGWGFibE5FWwI3urL66RVrXhXler9dalWzuLIwxmcwyV4cAM1uIMGNIHAEzzDK7w5E+fFeXc+lq05J5s5hT9wPn8AOSKRqw==</latexit>

m1t,f,C

<latexit sha1_base64="x8rzed6jNob8FXID6TnZrPX0Slw=">AAAB9HicbVBNSwMxEJ2tX7V+VT16CRbBQym7VdBjoRePFewHtGvJptk2NMmuSbZQlv4OLx4U8eqP8ea/MW33oK0PBh7vzTAzL4g508Z1v53cxubW9k5+t7C3f3B4VDw+aekoUYQ2ScQj1QmwppxJ2jTMcNqJFcUi4LQdjOtzvz2hSrNIPphpTH2Bh5KFjGBjJV88pt6sn5pyWK7P+sWSW3EXQOvEy0gJMjT6xa/eICKJoNIQjrXuem5s/BQrwwins0Iv0TTGZIyHtGupxIJqP10cPUMXVhmgMFK2pEEL9fdEioXWUxHYToHNSK96c/E/r5uY8NZPmYwTQyVZLgoTjkyE5gmgAVOUGD61BBPF7K2IjLDCxNicCjYEb/XlddKqVryrSvX+ulQrZ3Hk4QzO4RI8uIEa3EEDmkDgCZ7hFd6cifPivDsfy9ack82cwh84nz83lZGq</latexit>

X2<latexit sha1_base64="M5lb5ct8ncczub5oMLvoyvxvWC4=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRR0GXBjcsK9gFNKJPppB06mYR5CCX0N9y4UMStP+POv3HSZqGtBwYO59zLPXOijDOlXffbqWxsbm3vVHdre/sHh0f145OuSo0ktENSnsp+hBXlTNCOZprTfiYpTiJOe9H0rvB7T1QqlopHPctomOCxYDEjWFspCBKsJ1Gc9+dDf1hvuE13AbROvJI0oER7WP8KRikxCRWacKzUwHMzHeZYakY4ndcCo2iGyRSP6cBSgROqwnyReY4urDJCcSrtExot1N8bOU6UmiWRnSwyqlWvEP/zBkbHt2HORGY0FWR5KDYc6RQVBaARk5RoPrMEE8lsVkQmWGKibU01W4K3+uV10vWb3lXTf7hutPyyjiqcwTlcggc30IJ7aEMHCGTwDK/w5hjnxXl3PpajFafcOYU/cD5/APdzkZU=</latexit>

E2EASR

FeatureExtractor

E2EASR

Fig. 2. End-to-end multi-channel multi-speaker model in the 2-speaker case. The masking network and end-to-end ASR networkare based on either RNNs or Transformers.

2. END-TO-END MULTI-SPEAKER ASR

In this section, we review the end-to-end speech recognition mod-els for both the single-channel [8, 9] and multi-channel [13] tasks.For both tasks, we denote by J the number of speakers in the inputspeech mixture.

2.1. Single-channel Multi-speaker ASR

In this subsection, we briefly introduce the end-to-end single-channel multi-speaker speech recognition model proposed in [8, 9],shown in Fig.1. The model is an extension of the joint CTC/attention-based encoder-decoder framework [14] to recognize multi-speakerspeech. The input O = {o1, . . . ,oT } is the single-channel mixedspeech feature. In the encoder, the input feature is separated andencoded as hidden states Gj , j = {1, . . . , J} for each speaker. Thecomputation of the encoder can be divided into three submodules:

H = EncoderMix(O), (1)

Hj = EncoderjSD(H), j = 1, . . . , J, (2)

Gj = EncoderRec(Hj), j = 1, . . . , J. (3)

EncoderMix first maps the input O to some high dimensional repre-sentation H. Then J speaker-differentiating encoders EncoderjSD ex-tract each speaker’s speech Hj . Finally, EncoderRec transforms eachHj into the embeddings Gj = {gj1, . . . ,g

jL}, L ≤ T with sub-

sampling. The attention-based decoder then takes these hidden rep-resentations to generate the corresponding output token sequencesYj = {yj1, . . . , y

jN}. For each embedding sequence Gj , the recog-

nition process is formalized as follows:

cjn = Attention(ejn−1,Gj), (4)

ejn = Update(ejn−1, cjn−1, y

jn−1), (5)

yjn ∼ Decoder(ejn, yjn−1), (6)

in which cjn denotes the context vector and ejn is the hidden state ofthe decoder at step n. To determine the permutation of the referencesequences Rj , permutation invariant training (PIT) is performed onthe CTC loss right after the encoder [8, 9]:

π = argminπ∈P

∑j

Lossctc(Zj ,Rπ(j)), j = 1, . . . , J, (7)

where Zj is the sequence obtained from Gj by linear transform tocompute the label posterior distribution, P is the set of all permuta-tions on {1, . . . , J}, and π(i) is the i-th element of permutation π.The model is optimized with both CTC and cross-entropy losses:

L =∑j

(λLossctc(Z

j ,Rπ(j)) + (1− λ)Lossatt(Yj ,Rπ(j))

), (8)

where 0 ≤ λ ≤ 1 is an interpolation factor, and Lossatt is the cross-entropy loss of the attention-decoder.

2.2. Multi-channel Multi-speaker ASR

In this subsection, we review the model architecture of the MIMO-Speech end-to-end multi-channel multi-speaker speech recogni-tion system [13], shown in Fig.2. The model takes as inputthe microphone-array signals from an arbitrary number C ofsensors. The model can be roughly divided into two modules,namely the frontend and the backend. The frontend is a mask-based multi-source neural beamformer. For simplicity of notation,we denote the noise as the 0-th source in the mixture signals.First, the monaural masking network estimates the masks Mj

c

for every source j = 0, 1, . . . , J on each channel c = 1, . . . , Cfrom the complex STFT of the multi-channel mixture speech,Xc = (xt,f,c)t,f ∈ CT×F , where 1 ≤ t ≤ T and 1 ≤ f ≤ Frepresent the time and frequency indices, as follows:

Mc = MaskNet(Xc), (9)

where Mc = (mjt,f,c)t,f,j ∈ [0, 1]T×F×(J+1). Second, the multi-

source neural beamformer separates each source from the mixturebased on the MVDR formalization [24]. The estimated masks ofeach source are used to compute the corresponding power spectraldensity (PSD) matrices Φj for j∈{0, . . . , J} [25–27]:

Φj(f) =1∑T

t=1mjt,f

T∑t=1

mjt,fxt,fx

Ht,f ∈ CC×C , (10)

where xt,f = (xt,f,c)c ∈ CC , mjt,f = 1

C

∑Cc=1m

jt,f,c and H rep-

resents the conjugate transpose. The time-invariant filter coefficientsgj(f) for each speaker j are then computed from the PSD matrices:

gj(f) =(∑i6=j Φi(f))−1Φj(f)

Tr((∑i 6=j Φi(f))−1Φj(f))

u ∈ CC , (11)

where 1 ≤ j ≤ J , and u ∈ RC is a vector representing the referencemicrophone derived from an attention mechanism [28]. The beam-forming filters gj can be used to obtain the enhanced signal sjt,f forspeaker j, which is further processed to get the log mel-filterbankwith global mean and variance normalization (LMF(·)):

sjt,f = (gj(f))Hxt,f ∈ C, (12)

Oj = LMF(|Sj |)), (13)

where Sj is the short-time Fourier transform (STFT) of sj .The backend ASR module maps the speech feature Oj =

{oj1, . . . ,ojT } of each speaker j into the output token sequences

Yj = {yj1, . . . , yjN}. The computation of the speech recognition is

very similar to the process for the single-channel case described inSec. 2.1, except that the encoder is a single path network and doesnot have to separate the input feature using EncoderSD.

Page 3: arXiv:2002.03921v2 [eess.AS] 13 Feb 2020 · signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40:9% and 25:6% relative WER reduction,

Similar to the single-channel model, the permutation order ofthe reference sequences Rj is determined by (7). The whole MIMO-Speech model is optimized only with ASR loss as in (8).

3. TRANSFORMER WITH TIME-RESTRICTEDSELF-ATTENTION

In this section, we describe one of the key components in the Trans-former architecture, the multi-head self-attention [15], and the time-restricted modification [22] for its application in the masking net-work of the frontend.

Transformers employ the dot-product self-attention for mappinga variable-length input sequence to another sequence of the samelength, making them different from RNNs. The input consists ofqueries Q, keys K, and values V of dimension datt. The weights ofthe self-attention are obtained by computing the dot-product betweenthe query and all keys and normalizing with softmax. A scalingfactor

√datt is used to smooth the distribution:

Attention(Q,K, V ) = softmax(QKT

√datt

)V. (14)

To capture information from different representation subspaces,multi-head attention (MHA) is used by multiplying the originalqueries, keys, and values by different weight matrices:

MHA(Q,K, V ) = Concat([Hh]dhead

h=1)Whead, (15)

where Hh = Attention(QW qh ,KW

kh , V

vhW

vh ), (16)

where dhead is the number of heads, and W head ∈ R(dheaddatt)×dattand

W qh ,W

kh ,W

vh ∈ Rd

att×dattare learnable parameters.

In general, the speech sequence length can be considerably long,making self-attention computationally difficult. For tasks like speechseparation and enhancement, the technique of subsampling is notpractical as in speech recognition. Inspired by [21, 22], we adjustthe self-attention of the Transformers in the masking network to beperformed on a local segment of the speech, because those frameshave higher correlation. This time-restricted self-attention for thequery at time step t is formalized as:

Attention(Q,K′, V ′) = softmax(QK′T√

datt

)V ′, (17)

where the corresponding keys and values are K′ = Kt−l:t+r andV ′ = Vt−l:t+r , respectively, with l and r here denoting the left andright context window sizes.

4. EXPERIMENTS

The proposed methods were evaluated on the same dataset as in [13],referred to as the spatialized wsj1-2mix dataset, where the number ofspeakers in an utterance is J = 2. The multi-channel speech signalswere generated1 from the monaural wsj1-2mix speech used in [8,9].The room impulse responses (RIR) for the spatialization were ran-domly generated2, characterizing the room dimensions, speaker lo-cations, and microphone geometry. The final spatialized dataset con-tains two different environment conditions, anechoic and reverber-ant. In the anechoic condition, the room is assumed to be anechoic

1The spatialization toolkit is available at http://www.merl.com/demos/deep-clustering/spatialize_wsj0-mix.zip

2The RIR generator script is available online at https://github.com/ehabets/RIR-Generator

and only the delays and decays due to the propagation are consideredwhen generating the signals. In the reverberant condition, reverbera-tion is also considered, with randomly drawn T60s from [0.2, 0.6] s.In total, the spatialized corpus under each condition contains 98.5hr, 1.3 hr, and 0.8 hr in training, development, and evaluation setsrespectively.

In the single-channel multi-speaker speech recognition task, weused the 1st channel of the training, development, and evaluation setto train, validate, and evaluate our model respectively. The input fea-tures are 80-dimensional log mel-filterbank coefficients with pitchfeatures and their delta and delta delta features. In the multi-channelmulti-speaker speech recognition task, we also followed [13] in in-cluding the WSJ train si284 in the training set to improve the per-formance. The model takes the raw waveform audio signal as in-put and converts it to its STFT using a 25 ms-long Hann windowwith stride 10 ms. The spectral feature dimension is F = 257 dueto zero-padding. After the frontend computation, 80-dimensionallog filterbank features are extracted for each separated speech signaland global mean-variance normalization is applied, using the statis-tics of the single-speaker WSJ1 training set. All the multi-channelexperiments were performed with C = 2 channels. However, themodel can be extended to an arbitrary number of input channels asdescribed in [28].

4.1. Experimental Setup

All the proposed end-to-end multi-speaker speech recognition mod-els are implemented with the ESPnet framework [29] using the Py-torch backend. Some basic parts are the same for all the models. Theinterpolation factor λ of the loss function in (8) is set to 0.2. Theword-level language model [30] used during decoding was trainedwith the official text data included in the WSJ corpus. The configu-rations of the RNN-based models are the same as in [9] and [13] forsingle-channel and multi-channel experiments, respectively.

In the Transformer-based multi-speaker encoder-decoder ASRmodel, there is a total of 12 layers in the encoder and 6 layers in thedecoder as in [18]. Before the Transformer encoder, the log mel-filterbank features are encoded by two CNN blocks. The CNN lay-ers have a kernel size of 3 × 3 and the number of feature maps is64 in the first block and 128 in the second block. For the single-channel multi-speaker model in Sec. 2.1, EncoderMix is the same asthe CNN embedding layer, and EncoderSD and EncoderRec contain4 and 8 Transformer layers, respectively. For all the tasks, the con-figuration of each encoder-decoder layer is datt = 256, dff = 2048,dhead = 4. The masking network in the frontend has 3 layers similarto the encoder-decoder layer except dff = 768. The training stage ofTransformer runs with the Adam optimizer and Noam learning ratedecay as in [15]. Note that the backend ASR module is currentlyinitialized with a pretrained model from the ESPnet recipe of WSJcorpus and kept frozen for the first 15 epochs, for training stability.

4.2. Performance in Anechoic Condition

We first provide in Table 1 the performance in anechoic conditionof the single-channel multi-speaker end-to-end ASR models trainedand evaluated on the original single-channel wsj1-2mix corpus usedin [9, 30]. All the layers are randomly initialized. The result showsthat using the Transformer model leads to a 40.9% relative word er-ror rate (WER) improvement on the evaluation set, decreasing from20.43% to 12.08% compared with the RNN-based model in [9].

The multi-channel multi-speaker speech recognition perfor-mance is shown in Table 2 using the spatialized anechoic wsj1-2mix dataset. The baseline multi-channel system is the RNN-based

Page 4: arXiv:2002.03921v2 [eess.AS] 13 Feb 2020 · signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40:9% and 25:6% relative WER reduction,

Table 1. Performance in terms of average WER [%] on the single-channel anechoic wsj1-2mix corpus.

Model dev eval

RNN-based 1-channel Model [9] 24.90 20.43Transformer-based 1-channel Model 17.11 12.08

Table 2. Performance in terms of average WER [%] on the spatial-ized two-channel anechoic wsj1-2mix corpus.

Model dev eval

RNN-based MIMO-Speech [13] 13.54 8.62

+ Transformer backend 10.73 6.85++ Transformer frontend 11.75 6.41

model from our previous study [13]. Before we move to the fullyTransformer-based MIMO-Speech model, we first replace the RNNswith Transformers in the backend ASR only. We see that usingTransformers for the ASR backend can achieve 20.5% relativeimprovement against the RNN-based model in anechoic conditions.

We then also apply Transformers in the masking network ofthe frontend. Considering the feasibility of computing, in this pre-liminary study, the left and right context window sizes of the self-attention are set to l = 14 and r = 15. The parameters of the fron-tend are randomly initialized. Compared with using a Transformer-based model only for the backend, the fully Transformer-basedmodel leads to a further improvement, achieving a WER of 6.41%.Compared against the whole sequence information available in theRNN-based model, such a small context window greatly limits thepower of our model but shows its potential. Overall, the proposedfully Transformer-based model achieves a 25.6% relative WER im-provement against the RNN-based model in the multi-channel case.We also see that the multi-channel system is better than the single-channel system, thanks to the availability of spatial information.

4.3. Performance in Reverberant Condition

Even though our model can perform very well in anechoic condi-tion, such ideal environments are rarely encountered in practice. It isthus crucial to investigate whether the model can be applied in morerealistic environments. In this subsection, we describe preliminaryefforts to process the reverberated signal.

We first used a straightforward multi-conditioned training byadding reverberated utterances into the training set. The results ofmulti-speaker speech recognition on the multi-channel reverberantdatasets are shown in Table 3. It can be observed that only using theTransformers for the backend is 6.6% better than the RNN-basedmodel. In addition, the fully Transformer-based model achieves13.2% relative WER improvement on the evaluation set, which isconsistent with the anechoic case. However, comparing with thenumbers for the anechoic condition in Table 2, a large performancedegradation can be observed.

To alleviate this, we turned to an existing external dereverber-ation method to preprocess the input signals as a simple yet effec-tive solution. Nara-WPE [31] is a widely used open source softwarefor blind dereverberation of acoustic signals. The dereverberation isperformed on the reverberated speech before it is added to the train-ing dataset with anechoic data. Similarly, the reverberant test set isalso preprocessed. Speech recognition performance on the multi-channel reverberant speech after Nara-WPE is shown in Table 4. Ingeneral, the WERs are dramatically decreased with the dereverbera-

Table 3. Performance in terms of average WER [%] on the spatial-ized two-channel reverberant wsj1-2mix corpus.

Model dev eval

RNN-based MIMO-Speech [13] 34.98 29.99

+ Transformer backend 32.95 28.01++ Transformer frontend 31.93 26.02

Table 4. Performance in terms of average WER [%] on the spatial-ized two-channel reverberant wsj1-2mix corpus after Nara-WPE.

Model dev eval

RNN-based MIMO-Speech 24.45 17.67

+ Transformer backend 19.17 15.24++ Transformer frontend 20.55 15.46

tion method. For the RNN-based model, the WER on the evaluationset decreased by 41.1% relative, from 29.99% to 17.67%. Similarto the experiments under other conditions, the model with backendTransformer only is better than the RNN-based baseline model onthe reverberant evaluation set by 13.8% relative WER. However, theTransformer-based frontend slightly degraded the performance. Thismay be due the window size of the attention being too small, as itonly covers about 0.3 s of speech. Note that our systems are nottrained through Nara-WPE, which is left for future work.

At last, we show results in the single-channel task with the 1stchannel of the reverberated speech after Nara-WPE dereverberationin Table 5. Using the RNN-based model, the WER of the evaluationset is high, at 28.21%, which is influenced greatly by the reverber-ation, even when preprocessing with the dereverberation technique.However, the Transformer-based model can reach a final WER of16.50%, a 41.5% relative reduction, proving that the Transformer-based model is more robust than the RNN-based model.

Table 5. Performance in terms of average WER [%] on the 1st chan-nel of the spatialized reverberant wsj1-2mix corpus after Nara-WPE.

Model dev eval

RNN-based 1-channel Model 31.21 28.21Transformer-based 1-channel Model 20.44 16.50

5. CONCLUSION

In this paper, we applied Transformer models for end-to-end multi-speaker ASR in both the single-channel and multi-channel scenarios,and observed consistent improvements. The RNN-based ASR mod-ule is replaced with the Transformers. To alleviate the fatal mem-ory consumption issue when applying Transformers in the frontendwith considerably long sequences, we modified the self-attention inthe Transformers of the masking network by using a local contextwindow. Furthermore, by incorporating an external dereverberationmethod, we largely reduced the performance gap between the re-verberant condition and the anechoic condition, and hope to furtherreduce it in the future thanks to tighter integration of the dereverber-ation within our model.

Page 5: arXiv:2002.03921v2 [eess.AS] 13 Feb 2020 · signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40:9% and 25:6% relative WER reduction,

6. REFERENCES

[1] E. C. Cherry, “Some experiments on the recognition of speech,with one and with two ears,” The Journal of the AcousticalSociety of America, vol. 25, no. 5, pp. 975–979, 1953.

[2] J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deepclustering: Discriminative embeddings for segmentation andseparation,” in Proc. IEEE ICASSP, 2016, pp. 31–35.

[3] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation in-variant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE ICASSP, Mar. 2017.

[4] D. Yu, X. Chang, and Y. Qian, “Recognizing multi-talkerspeech with permutation invariant training,” in Proc. ISCA In-terspeech, 2017, pp. 2456–2460.

[5] X. Chang, Y. Qian, and D. Yu, “Monaural multi-talker speechrecognition with attention mechanism and gated convolutionalnetworks,” in Proc. ISCA Interspeech, 2018.

[6] T. Menne, I. Sklyar, R. Schluter, and H. Ney, “Analysis of deepclustering as preprocessing for automatic speech recognition ofsparsely overlapping speech,” in Proc. ISCA Interspeech, 2019,pp. 2638–2642.

[7] S. Settle, J. Le Roux, T. Hori, S. Watanabe, and J. R. Hershey,“End-to-end multi-speaker speech recognition,” in Proc. IEEEICASSP, 2018, pp. 4819–4823.

[8] H. Seki, T. Hori, S. Watanabe, J. Le Roux, and J. R. Hershey,“A purely end-to-end system for multi-speaker speech recog-nition,” in Proc. ACL, Jul. 2018.

[9] X. Chang, Y. Qian, K. Yu, and S. Watanabe, “End-to-endmonaural multi-speaker ASR system without pretraining,” inProc. IEEE ICASSP, 2019, pp. 6256–6260.

[10] L. Drude and R. Haeb-Umbach, “Tight integration of spatialand spectral features for bss with deep clustering embeddings.”in Proc. ISCA Interspeech, 2017, pp. 2650–2654.

[11] Z.-Q. Wang, J. Le Roux, and J. R. Hershey, “Multi-ChannelDeep Clustering: Discriminative spectral and spatial embed-dings for speaker-independent speech separation,” in Proc.IEEE ICASSP, 2018, pp. 1–5.

[12] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, and F. Alleva,“Recognizing overlapped speech in meetings: A multichannelseparation approach using neural networks,” in Proc. ISCA In-terspeech, 2018.

[13] X. Chang, W. Zhang, Y. Qian, J. L. Roux, and S. Watan-abe, “MIMO-Speech: End-to-end multi-channel multi-speakerspeech recognition,” arXiv preprint arXiv:1910.06522, 2019.

[14] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention basedend-to-end speech recognition using multi-task learning,” inProc. IEEE ICASSP, Mar. 2017, pp. 4835–4839.

[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is allyou need,” in Proc. NIPS, 2017.

[16] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever,“Improving language understanding by generative pre-training,” 2018.

[17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT:Pre-training of deep bidirectional transformers for languageunderstanding,” in Proc. NAACL, 2019.

[18] S. Karita, N. Enrique Yalta Soplin, S. Watanabe, M. Del-croix, A. Ogawa, and T. Nakatani, “Improving transformer-based end-to-end speech recognition with connectionist tem-poral classification and language model integration,” in Proc.ISCA Interspeech, 2019, pp. 1408–1412.

[19] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “Acomparative study on Transformer vs RNN in speech applica-tions,” arXiv preprint arXiv:1909.06317, 2019.

[20] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, andS. Watanabe, “End-to-end neural speaker diarization with self-attention,” arXiv preprint arXiv:1909.06247, 2019.

[21] M.-T. Luong, H. Pham, and C. D. Manning, “Effective Ap-proaches to Attention-based Neural Machine Translation,” inProc. EMNLP, 2015, pp. 1412–1421.

[22] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur,“A time-restricted self-attention layer for ASR,” in Proc. IEEEICASSP, 2018, pp. 5874–5878.

[23] T. Yoshioka and T. Nakatani, “Generalization of multi-channellinear prediction methods for blind mimo impulse responseshortening,” IEEE/ACM Trans. Audio, Speech, Language Pro-cess., vol. 20, no. 10, pp. 2707–2720, 2012.

[24] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain multichannel linear filtering for noise reduction,”IEEE/ACM Trans. Audio, Speech, Language Process., vol. 18,no. 2, pp. 260–276, 2009.

[25] T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita,M. Fujimoto, C. Yu, W. J. Fabian, M. Espi, T. Higuchi et al.,“The NTT CHiME-3 system: Advances in speech enhance-ment and recognition for mobile multi-microphone devices,”in Proc. IEEE ASRU, Dec. 2015, pp. 436–443.

[26] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural networkbased spectral mask estimation for acoustic beamforming,” inProc. IEEE ICASSP, Mar. 2016.

[27] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, andJ. Le Roux, “Improved MVDR beamforming using single-channel mask prediction networks,” in Proc. ISCA Interspeech,Sep. 2016, pp. 1981–1985.

[28] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichan-nel end-to-end speech recognition,” in Proc. ICML, 2017, pp.2632–2641.

[29] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba,Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chenet al., “ESPnet: End-to-End Speech Processing Toolkit,” inProc. ISCA Interspeech, 2018, pp. 2207–2211.

[30] T. Hori, J. Cho, and S. Watanabe, “End-to-end speech recog-nition with word-based RNN language models,” in Proc. IEEESLT, Dec. 2018, pp. 389–396.

[31] L. Drude, J. Heymann, C. Boeddeker, and R. Haeb-Umbach,“NARA-WPE: A Python package for weighted prediction er-ror dereverberation in Numpy and Tensorflow for online andoffline processing,” in ITG Fachtagung Sprachkommunikation(ITG), 2018.