a purely end-to-end system for multi-speaker speech ... · a purely end-to-end system for...

MITSUBISHI ELECTRIC RESEARCH LABORATORIEShttp://www.merl.com

A Purely End-to-end System for Multi-speaker SpeechRecognition

Seki, H.; Hori, T.; Watanabe, S.; Le Roux, J.; Hershey, J.

TR2018-058 July 10, 2018

AbstractRecently, there has been growing interest in multi-speaker speech recognition, where theutterances of multiple speakers are recognized from their mixture. Promising techniqueshave been proposed for this task, but earlier works have required additional training datasuch as isolated source signals or senone alignments for effective learning. In this paper, wepropose a new sequence-to-sequence framework to directly decode multiple label sequencesfrom a single speech sequence by unifying source separation and speech recognition functionsin an end-toend manner. We further propose a new objective function to improve the contrastbetween the hidden vectors to avoid generating similar hypotheses. Experimental results showthat the model is directly able to learn a mapping from a speech mixture to multiple labelsequences, achieving 83.1% relative improvement compared to a model trained without theproposed objective. Interestingly, the results are comparable to those produced by previousendto-end works featuring explicit separation and recognition modules.

arXiv

This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy inwhole or in part without payment of fee is granted for nonprofit educational and research purposes provided that allsuch whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi ElectricResearch Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and allapplicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall requirea license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved.

Copyright c© Mitsubishi Electric Research Laboratories, Inc., 2018201 Broadway, Cambridge, Massachusetts 02139

A Purely End-to-end System for Multi-speaker Speech Recognition

Hiroshi Seki1,2∗, Takaaki Hori1, Shinji Watanabe3, Jonathan Le Roux1, John R. Hershey1

1Mitsubishi Electric Research Laboratories (MERL)2Toyohashi University of Technology

3Johns Hopkins University

Abstract

Recently, there has been growing inter-est in multi-speaker speech recognition,where the utterances of multiple speak-ers are recognized from their mixture.Promising techniques have been proposedfor this task, but earlier works have re-quired additional training data such asisolated source signals or senone align-ments for effective learning. In this paper,we propose a new sequence-to-sequenceframework to directly decode multiple la-bel sequences from a single speech se-quence by unifying source separation andspeech recognition functions in an end-to-end manner. We further propose a new ob-jective function to improve the contrast be-tween the hidden vectors to avoid generat-ing similar hypotheses. Experimental re-sults show that the model is directly ableto learn a mapping from a speech mix-ture to multiple label sequences, achieving83.1% relative improvement compared toa model trained without the proposed ob-jective. Interestingly, the results are com-parable to those produced by previous end-to-end works featuring explicit separationand recognition modules.

1 Introduction

Conventional automatic speech recognition (ASR)systems recognize a single utterance given aspeech signal, in a one-to-one transformation.However, restricting the use of ASR systems to sit-uations with only a single speaker limits their ap-plicability. Recently, there has been growing inter-

∗This work was done while H. Seki, Ph.D. candidate atToyohashi University of Technology, Japan, was an intern atMERL.

est in single-channel multi-speaker speech recog-nition, which aims at generating multiple tran-scriptions from a single-channel mixture of mul-tiple speakers’ speech (Cooke et al., 2009).

To achieve this goal, several previous workshave considered a two-step procedure in which themixed speech is first separated, and recognitionis then performed on each separated speech sig-nal (Hershey et al., 2016; Isik et al., 2016; Yu et al.,2017; Chen et al., 2017). Dramatic advances haverecently been made in speech separation, via thedeep clustering framework (Hershey et al., 2016;Isik et al., 2016), hereafter referred to as DPCL.DPCL trains a deep neural network to map eachtime-frequency (T-F) unit to a high-dimensionalembedding vector such that the embeddings forthe T-F unit pairs dominated by the same speakerare close to each other, while those for pairs dom-inated by different speakers are farther away. Thespeaker assignment of each T-F unit can thus beinferred from the embeddings by simple cluster-ing algorithms, to produce masks that isolate eachspeaker. The original method using k-means clus-tering (Hershey et al., 2016) was extended to al-low end-to-end training by unfolding the cluster-ing steps using a permutation-free mask inferenceobjective (Isik et al., 2016). An alternative ap-proach is to perform direct mask inference usingthe permutation-free objective function with net-works that directly estimate the labels for a fixednumber of sources. Direct mask inference was firstused in Hershey et al. (2016) as a baseline method,but without showing good performance. This ap-proach was revisited in Yu et al. (2017) and Kol-baek et al. (2017) under the name permutation-invariant training (PIT). Combination of suchsingle-channel speaker-independent multi-speakerspeech separation systems with ASR was first con-sidered in Isik et al. (2016) using a conventionalGaussian Mixture Model/Hidden Markov Model

(GMM/HMM) system. Combination with an end-to-end ASR system was recently proposed in (Set-tle et al., 2018). Both these approaches eithertrained or pre-trained the source separation andASR networks separately, making use of mixturesand their corresponding isolated clean source ref-erences. While the latter approach could in princi-ple be trained without references for the isolatedspeech signals, the authors found it difficult totrain from scratch in that case. This ability cannonetheless be used when adapting a pre-trainednetwork to new data without such references.

In contrast with this two-stage approach, Qianet al. (2017) considered direct optimization of adeep-learning-based ASR recognizer without anexplicit separation module. The network is opti-mized based on a permutation-free objective de-fined using the cross-entropy between the system’shypotheses and reference labels. The best per-mutation between hypotheses and reference labelsin terms of cross-entropy is selected and used forbackpropagation. However, this method still re-quires reference labels in the form of senone align-ments, which have to be obtained on the clean iso-lated sources using a single-speaker ASR system.As a result, this approach still requires the originalseparated sources. As a general caveat, generationof multiple hypotheses in such a system requiresthe number of speakers handled by the neural net-work architecture to be determined before train-ing. However, Qian et al. (2017) reported that therecognition of two-speaker mixtures using a modeltrained for three-speaker mixtures showed almostidentical performance with that of a model trainedon two-speaker mixtures. Therefore, it may bepossible in practice to determine an upper boundon the number of speakers.

Chen et al. (2018) proposed a progressivetraining procedure for a hybrid system with ex-plicit separation motivated by curriculum learn-ing. They also proposed self-transfer learningand multi-output sequence discriminative trainingmethods for fully exploiting pairwise speech andpreventing competing hypotheses, respectively.

In this paper, we propose to circumvent theneed for the corresponding isolated speech sourceswhen training on a set of mixtures, by using anend-to-end multi-speaker speech recognition with-out an explicit speech separation stage. In sep-aration based systems, the spectrogram is seg-mented into complementary regions according to

sources, which generally ensures that different ut-terances are recognized for each speaker. Withoutthis complementarity constraint, our direct multi-speaker recognition system could be susceptible toredundant recognition of the same utterance. Inorder to prevent degenerate solutions in which thegenerated hypotheses are similar to each other, weintroduce a new objective function that enhancescontrast between the network’s representations ofeach source. We also propose a training procedureto provide permutation invariance with low com-putational cost, by taking advantage of the jointCTC/attention-based encoder-decoder network ar-chitecture proposed in (Hori et al., 2017a). Ex-perimental results show that the proposed modelis able to directly convert an input speech mix-ture into multiple label sequences without requir-ing any explicit intermediate representations. Inparticular no frame-level training labels, such asphonetic alignments or corresponding unmixedspeech, are required. We evaluate our model onspontaneous English and Japanese tasks and ob-tain comparable results to the DPCL based methodwith explicit separation (Settle et al., 2018).

2 Single-speaker end-to-end ASR

2.1 Attention-based encoder-decodernetwork

An attention-based encoder-decoder net-work (Bahdanau et al., 2016) predicts a targetlabel sequence Y = (y1, . . . , yN ) without requir-ing intermediate representation from a T -framesequence of D-dimensional input feature vectors,O = (ot ∈ RD|t = 1, . . . , T ), and the past labelhistory. The probability of the n-th label yn iscomputed by conditioning on the past historyy1:n−1:

patt(Y |O) =N∏n=1

patt(yn|O, y1:n−1). (1)

The model is composed of two main sub-modules,an encoder network and a decoder network. Theencoder network transforms the input feature vec-tor sequence into a high-level representation H =(hl ∈ RC |l = 1, . . . , L). The decoder net-work emits labels based on the label history yand a context vector c calculated using an atten-tion mechanism which weights and sums the C-dimensional sequence of representationH with at-tention weight a. A hidden state e of the decoder is

updated based on the previous state, the previouscontext vector, and the emitted label. This mecha-nism is summarized as follows:

H = Encoder(O), (2)

yn ∼ Decoder(cn, yn−1), (3)

cn, an = Attention(an−1, en, H), (4)

en = Update(en−1, cn−1, yn−1). (5)

At inference time, the previously emitted labelsare used. At training time, they are replaced bythe reference label sequence R = (r1, . . . , rN ) ina teacher-forcing fashion, leading to conditionalprobability patt(YR|O), where YR denotes the out-put label sequence variable in this condition. Thedetailed definitions of Attention and Update aredescribed in Section A of the supplementary mate-rial. The encoder and decoder networks are trainedto maximize the conditional probability of the ref-erence label sequence R using backpropagation:

Latt = Lossatt(YR, R) , − log patt(YR = R|O),(6)

where Lossatt is the cross-entropy loss function.

2.2 Joint CTC/attention-basedencoder-decoder network

The joint CTC/attention approach (Kim et al.,2017; Hori et al., 2017a), uses the connection-ist temporal classification (CTC) objective func-tion (Graves et al., 2006) as an auxiliary task totrain the network. CTC formulates the condi-tional probability by introducing a framewise la-bel sequence Z consisting of a label set U and anadditional blank symbol defined as Z = {zl ∈U ∪ {’blank’}|l = 1, · · · , L}:

pctc(Y |O) =∑Z

L∏l=1

p(zl|zl−1, Y )p(zl|O), (7)

where p(zl|zl−1, Y ) represents monotonic align-ment constraints in CTC and p(zl|O) is the frame-level label probability computed by

p(zl|O) = Softmax(Linear(hl)), (8)

where hl is the hidden representation generatedby an encoder network, here taken to be the en-coder of the attention-based encoder-decoder net-work defined in Eq. (2), and Linear(·) is the finallinear layer of the CTC to match the number of

labels. Unlike the attention model, the forward-backward algorithm of CTC enforces monotonicalignment between the input speech and the out-put label sequences during training and decod-ing. We adopt the joint CTC/attention-basedencoder-decoder network as the monotonic align-ment helps the separation and extraction of high-level representation. The CTC loss is calculatedas:

Lctc = Lossctc(Y,R) , − log pctc(Y = R|O).(9)

The CTC loss and the attention-based encoder-decoder loss are combined with an interpolationweight λ ∈ [0, 1]:

Lmtl = λLctc + (1− λ)Latt. (10)

Both CTC and encoder-decoder networks arealso used in the inference step. The final hypothe-sis is a sequence that maximizes a weighted condi-tional probability of CTC in Eq. ( 7) and attention-based encoder decoder network in Eq. (1):

Y = arg maxY

{γ log pctc(Y |O)

+ (1− γ) log patt(Y |O)}, (11)

where γ ∈ [0, 1] is an interpolation weight.

3 Multi-speaker end-to-end ASR

3.1 Permutation-free trainingIn situations where the correspondence betweenthe outputs of an algorithm and the references isan arbitrary permutation, neural network trainingfaces a permutation problem. This problem wasfirst addressed by deep clustering (Hershey et al.,2016), which circumvented it in the case of sourceseparation by comparing the relationships betweenpairs of network outputs to those between pairs oflabels. As a baseline for deep clustering, Hersheyet al. (2016) also proposed another approach to ad-dress the permutation problem, based on an ob-jective which considers all permutations of refer-ences when computing the error with the networkestimates. This objective was later used in Isik etal. (2016) and Yu et al. (2017). In the latter, it wasreferred to as permutation-invariant training.

This permutation-free training scheme extendsthe usual one-to-one mapping of outputs and la-bels for backpropagation to one-to-many by se-lecting the proper permutation of hypotheses and

references, thus allowing the network to generatemultiple independent hypotheses from a single-channel speech mixture. When a speech mixturecontains speech uttered by S speakers simulta-neously, the network generates S label sequencevariables Y s = (ys1, . . . , y

sNs

) with Ns labels fromthe T -frame sequence ofD-dimensional input fea-ture vectors, O = (ot ∈ RD|t = 1, . . . , T ):

Y s ∼ gs(O), s = 1, . . . , S, (12)

where the transformations gs are implemented asneural networks which typically share some com-ponents with each other. In the training stage, allpossible permutations of the S sequences Rs =(rs1, . . . , r

sN ′s

) of N ′s reference labels are consid-ered (considering permutations on the hypotheseswould be equivalent), and the one leading to min-imum loss is adopted for backpropagation. Let Pdenote the set of permutations on {1, . . . , S}. Thefinal loss L is defined as

L = minπ∈P

S∑s=1

Loss(Y s, Rπ(s)), (13)

where π(s) is the s-th element of a permutationπ. For example, for two speakers, P includes twopermutations (1, 2) and (2, 1), and the loss is de-fined as:

L = min(Loss(Y 1, R1) + Loss(Y 2, R2),

Loss(Y 1, R2) + Loss(Y 2, R1)). (14)

Figure 1 shows an overview of the proposedend-to-end multi-speaker ASR system. In the fol-lowing Section 3.2, we describe an extension ofencoder network for the generation of multiplehidden representations. We further introduce apermutation assignment mechanism for reducingthe computation cost in Section 3.3, and an ad-ditional loss function LKL for promoting the dif-ference between hidden representations in Sec-tion 3.4.

3.2 End-to-end permutation-free training

To make the network output multiple hypotheses,we consider a stacked architecture that combinesboth shared and unshared (or specific) neural net-work modules. The particular architecture we con-sider in this paper splits the encoder network intothree stages: the first stage, also referred to asmixture encoder, processes the input mixture and

Figure 1: End-to-end multi-speaker speech recog-nition. We propose to use the permutation-freetraining for CTC and attention loss functionsLossctc and Lossatt, respectively.

outputs an intermediate feature sequence H; thatsequence is then processed by S independent en-coder sub-networks which do not share param-eters, also referred to as speaker-differentiating(SD) encoders, leading to S feature sequencesHs;at the last stage, each feature sequence Hs is inde-pendently processed by the same network, also re-ferred to as recognition encoder, leading to S finalhigh-level representations Gs.

Let u ∈ {1 . . . , S} denote an output index (cor-responding to the transcription of the speech byone of the speakers), and v ∈ {1 . . . , S} de-note a reference index. Denoting by EncoderMix

the mixture encoder, EncoderuSD the u-th speaker-differentiating encoder, and EncoderRec therecognition encoder, an input sequence O corre-sponding to an input mixture can be processed bythe encoder network as follows:

H = EncoderMix(O), (15)

Hu = EncoderuSD(H), (16)

Gu = EncoderRec(Hu). (17)

The motivation for designing such an architecturecan be explained as follows, following analogieswith the architectures in (Isik et al., 2016) and(Settle et al., 2018) where separation and recog-

nition are performed explicitly in separate steps:the first stage in Eq. (15) corresponds to a speechseparation module which creates embedding vec-tors that can be used to distinguish between themultiple sources; the speaker-differentiating sec-ond stage in Eq. (16) uses the first stage’s outputto disentangle each speaker’s speech content fromthe mixture, and prepare it for recognition; the fi-nal stage in Eq. (17) corresponds to an acousticmodel that encodes the single-speaker speech forfinal decoding.

The decoder network computes the conditionalprobabilities for each speaker from the S outputsof the encoder network. In general, the decodernetwork uses the reference label R as a history togenerate the attention weights during training, ina teacher-forcing fashion. However, in the abovepermutation-free training scheme, the referencelabel to be attributed to a particular output is notdetermined until the loss function is computed, sowe here need to run the attention decoder for allreference labels. We thus need to consider the con-ditional probability of the decoder output variableY u,v for each output Gu of the encoder networkunder the assumption that the reference label forthat output is Rv:

patt(Yu,v|O) =

∏n

patt(yu,vn |O, y

u,v1:n−1), (18)

cu,vn , au,vn = Attention(au,vn−1, eu,vn , Gu), (19)

eu,vn = Update(eu,vn−1, cu,vn−1, r

vn−1), (20)

yu,vn ∼ Decoder(cu,vn , rvn−1). (21)

The final loss is then calculated by considering allpermutations of the reference labels as follows:

Latt = minπ∈P

∑s

Lossatt(Ys,π(s), Rπ(s)). (22)

3.3 Reduction of permutation costIn order to reduce the computational cost, we fixedthe permutation of the reference labels based onthe minimization of the CTC loss alone, and usedthe same permutation for the attention mechanismas well. This is an advantage of using a jointCTC/attention based end-to-end speech recogni-tion. Permutation is performed only for the CTCloss by assuming synchronous output where thepermutation is decided by the output of CTC:

π = arg minπ∈P

∑s

Lossctc(Ys, Rπ(s)), (23)

where Y u is the output sequence variable corre-sponding to encoder output Gu. Attention-baseddecoding is then performed on the same hiddenrepresentations Gu, using teacher forcing with thelabels determined by the permutation π that mini-mizes the CTC loss:

patt(Yu,π(u)|O) =

∏n

patt(yu,π(u)n |O, yu,π(u)1:n−1 ),

cu,π(u)n , au,π(u)n =Attention(au,π(u)n−1 , e

u,π(u)n , Gu),

eu,π(u)n = Update(eu,π(u)n−1 , c

u,π(u)n−1 , r

π(u)n−1 ),

yu,π(u)n ∼ Decoder(cu,π(u)n , rπ(u)n−1 ).

This corresponds to the “permutation assignment”in Fig. 1. In contrast with Eq. (18), we only needto run the attention-based decoding once for eachoutput Gu of the encoder network. The final lossis defined as the sum of two objective functionswith interpolation λ:

Lmtl = λLctc + (1− λ)Latt, (24)

Lctc =∑s

Lossctc(Ys, Rπ(s)), (25)

Latt =∑s

Lossatt(Ys,π(s), Rπ(s)). (26)

At inference time, because both CTC andattention-based decoding are performed on thesame encoder output Gu and should thus pertainto the same speaker, their scores can be incorpo-rated as follows:

Y u = arg maxY u

{γ log pctc(Y

u|Gu)

+ (1− γ) log patt(Yu|Gu)

}, (27)

where pctc(Yu|Gu) and patt(Y

u|Gu) are obtainedwith the same encoder output Gu.

3.4 Promoting separation of hidden vectors

A single decoder network is used to output mul-tiple label sequences by independently decodingthe multiple hidden vectors generated by the en-coder network. In order for the decoder to gener-ate multiple different label sequences the encoderneeds to generate sufficiently differentiated hiddenvector sequences for each speaker. We propose toencourage this contrast among hidden vectors byintroducing in the objective function a new termbased on the negative symmetric Kullback-Leibler

(KL) divergence. In the particular case of two-speaker mixtures, we consider the following ad-ditional loss function:

LKL = −η∑l

{KL(G1(l) || G2(l))

+ KL(G2(l) || G1(l))}, (28)

where η is a small constant value, andGu = (softmax(Gu(l)) | l = 1, . . . , L) is ob-tained from the hidden vector sequence Gu at theoutput of the recognition encoder EncoderRec asin Fig. 1 by applying an additional frame-wisesoftmax operation in order to obtain a quantityamenable to a probability distribution.

3.5 Split of hidden vector for multiplehypotheses

Since the network maps acoustic features to la-bel sequences directly, we consider various archi-tectures to perform implicit separation and recog-nition effectively. As a baseline system, we usethe concatenation of a VGG-motivated CNN net-work (Simonyan and Zisserman, 2014) (referredto as VGG) and a bi-directional long short-termmemory (BLSTM) network as the encoder net-work. For the splitting point in the hidden vectorcomputation, we consider two architectural varia-tions as follows:

• Split by BLSTM: The hidden vector is split atthe level of the BLSTM network. 1) the VGGnetwork generates a single hidden vector H; 2)H is fed into S independent BLSTMs whoseparameters are not shared with each other;3) the output of each independent BLSTMHu, u=1, . . . , S, is further separately fed into aunique BLSTM, the same for all outputs. Eachstep corresponds to Eqs. (15), (16), and (17).

• Split by VGG: The hidden vector is split at thelevel of the VGG network. The number of filtersat the last convolution layer is multiplied by thenumber of mixtures S in order to split the out-put into S hidden vectors (as in Eq. (16)). Thelayers prior to the last VGG layer correspond tothe network in Eq. (15), while the subsequentBLSTM layers implement the network in (17).

4 Experiments

4.1 Experimental setupWe used English and Japanese speech corpora,WSJ (Wall street journal) (Consortium, 1994;

Table 1: Duration (hours) of unmixed and mixedcorpora. The mixed corpora are generated by Al-gorithm 1 in Section B of the supplementary ma-terial, using the training, development, and evalu-ation set respectively.

TRAIN DEV. EVALWSJ (UNMIXED) 81.5 1.1 0.7WSJ (MIXED) 98.5 1.3 0.8CSJ (UNMIXED) 583.8 6.6 5.2CSJ (MIXED) 826.9 9.1 7.5

Garofalo et al., 2007) and CSJ (Corpus of spon-taneous Japanese) (Maekawa, 2003). To show theeffectiveness of the proposed models, we gener-ated mixed speech signals from these corpora tosimulate single-channel overlapped multi-speakerrecording, and evaluated the recognition perfor-mance using the mixed speech data. For WSJ, weused WSJ1 SI284 for training, Dev93 for develop-ment, and Eval92 for evaluation. For CSJ, we fol-lowed the Kaldi recipe (Moriya et al., 2015) andused the full set of academic and simulated pre-sentations for training, and the standard test sets 1,2, and 3 for evaluation.

We created new corpora by mixing two utter-ances with different speakers sampled from exist-ing corpora. The detailed algorithm is presentedin Section B of the supplementary material. Thesampled pairs of two utterances are mixed at vari-ous signal-to-noise ratios (SNR) between 0 dB and5 dB with a random starting point for the overlap.Duration of original unmixed and generated mixedcorpora are summarized in Table 1.

4.1.1 Network architectureAs input feature, we used 80-dimensional log Melfilterbank coefficients with pitch features and theirdelta and delta delta features (83 × 3 = 249-dimension) extracted using Kaldi tools (Poveyet al., 2011). The input feature is normalized tozero mean and unit variance. As a baseline sys-tem, we used a stack of a 6-layer VGG networkand a 7-layer BLSTM as the encoder network.Each BLSTM layer has 320 cells in each direc-tion, and is followed by a linear projection layerwith 320 units to combine the forward and back-ward LSTM outputs. The decoder network hasan 1-layer LSTM with 320 cells. As described inSection 3.5, we adopted two types of encoder ar-chitectures for multi-speaker speech recognition.The network architectures are summarized in Ta-ble 2. The split-by-VGG network had speakerdifferentiating encoders with a convolution layer

Table 2: Network architectures for the en-coder network. The number of layers is indi-cated in parentheses. EncoderMix, EncoderuSD,and EncoderRec correspond to Eqs. (15), (16),and (17).

SPLIT BY EncoderMix EncoderuSD EncoderRec

NO VGG (6) — BLSTM (7)VGG VGG (4) VGG (2) BLSTM (7)BLSTM VGG (6) BLSTM (2) BLSTM (5)

(and the following maxpooling layer). The split-by-BLSTM network had speaker differentiatingencoders with two BLSTM layers. The architec-tures were adjusted to have the same number oflayers. We used characters as output labels. Thenumber of characters for WSJ was set to 49 includ-ing alphabets and special tokens (e.g., charactersfor space and unknown). The number of charac-ters for CSJ was set to 3,315 including JapaneseKanji/Hiragana/Katakana characters and specialtokens.

4.1.2 OptimizationThe network was initialized randomly from uni-form distribution in the range -0.1 to 0.1. Weused the AdaDelta algorithm (Zeiler, 2012) withgradient clipping (Pascanu et al., 2013) for opti-mization. We initialized the AdaDelta hyperpa-rameters as ρ = 0.95 and ε = 1−8. ε is de-cayed by half when the loss on the development setdegrades. The networks were implemented withChainer (Tokui et al., 2015) and ChainerMN (Ak-iba et al., 2017). The optimization of the networkswas done by synchronous data parallelism with 4GPUs for WSJ and 8 GPUs for CSJ.

The networks were first trained on single-speaker speech, and then retrained with mixedspeech. When training on unmixed speech, onlyone side of the network only (with a single speakerdifferentiating encoder) is optimized to output thelabel sequence of the single speaker. Note thatonly character labels are used, and there is noneed for clean source reference corresponding tothe mixed speech. When moving to mixed speech,the other speaker-differentiating encoders are ini-tialized using the already trained one by copyingthe parameters with random perturbation, w′ =w × (1 + Uniform(−0.1, 0.1)) for each param-eter w. The interpolation value λ for the multipleobjectives in Eqs. (10) and (24) was set to 0.1 forWSJ and to 0.5 for CSJ. Lastly, the model is re-trained with the additional negative KL divergenceloss in Eq. (28) with η = 0.1.

Table 3: Evaluation of unmixed speech withoutmulti-speaker training.

TASK AVG.WSJ 2.6CSJ 7.8

4.1.3 DecodingIn the inference stage, we combined a pre-trained RNNLM (recurrent neural network lan-guage model) in parallel with the CTC and de-coder network. Their label probabilities were lin-early combined in the log domain during beamsearch to find the most likely hypothesis. For theWSJ task, we used both character and word levelRNNLMs (Hori et al., 2017b), where the charac-ter model had a 1-layer LSTM with 800 cells andan output layer for 49 characters. The word modelhad a 1-layer LSTM with 1000 cells and an outputlayer for 20,000 words, i.e., the vocabulary sizewas 20,000. Both models were trained with theWSJ text corpus. For the CSJ task, we used a char-acter level RNNLM (Hori et al., 2017c), whichhad a 1-layer LSTM with 1000 cells and an out-put layer for 3,315 characters. The model parame-ters were trained with the transcript of the trainingset in CSJ. We added language model probabilitieswith an interpolation factor of 0.6 for character-level RNNLM and 1.2 for word-level RNNLM.

The beam width for decoding was set to 20 inall the experiments. Interpolation γ in Eqs. (11)and (27) was set to 0.4 for WSJ and 0.5 for CSJ.

4.2 Results4.2.1 Evaluation of unmixed speechFirst, we examined the performance of the base-line joint CTC/attention-based encoder-decodernetwork with the original unmixed speech data.Table 3 shows the character error rates (CERs),where the baseline model showed 2.6% on WSJand 7.8% on CSJ. Since the model was trained andevaluated with unmixed speech data, these CERsare considered lower bounds for the CERs in thesucceeding experiments with mixed speech data.

4.2.2 Evaluation of mixed speechTable 4 shows the CERs of the generated mixedspeech from the WSJ corpus. The first col-umn indicates the position of split as mentionedin Section 3.5. The second, third and forthcolumns indicate CERs of the high energy speaker(HIGH E. SPK.), the low energy speaker (LOW

E. SPK.), and the average (AVG.), respectively.The baseline model has very high CERs because

Table 4: CER (%) of mixed speech for WSJ.SPLIT HIGH E. SPK. LOW E. SPK. AVG.NO (BASELINE) 86.4 79.5 83.0VGG 17.4 15.6 16.5BLSTM 14.6 13.3 14.0+ KL LOSS 14.0 13.3 13.7

Table 5: CER (%) of mixed speech for CSJ.SPLIT HIGH E. SPK. LOW E. SPK. AVG.NO (BASELINE) 93.3 92.1 92.7BLSTM 11.0 18.8 14.9

it was trained as a single-speaker speech recog-nizer without permutation-free training, and it canonly output one hypothesis for each mixed speech.In this case, the CERs were calculated by du-plicating the generated hypothesis and comparingthe duplicated hypotheses with the correspond-ing references. The proposed models, i.e., split-by-VGG and split-by-BLSTM networks, obtainedsignificantly lower CERs than the baseline CERs,the split-by-BLSTM model in particular achieving14.0% CER. This is an 83.1% relative reductionfrom the baseline model. The CER was further re-duced to 13.7% by retraining the split-by-BLSTMmodel with the negative KL loss, a 2.1% rela-tive reduction from the network without retrain-ing. This result implies that the proposed negativeKL loss provides better separation by actively im-proving the contrast between the hidden vectorsof each speaker. Examples of recognition resultsare shown in Section C of the supplementary ma-terial. Finally, we profiled the computation timefor the permutations based on the decoder networkand on CTC. Permutation based on CTC was 16.3times faster than that based on the decoder net-work, in terms of the time required to determinethe best match permutation given the encoder net-work’s output in Eq. (17).

Table 5 shows the CERs for the mixed speechfrom the CSJ corpus. Similarly to the WSJ ex-periments, our proposed model significantly re-duced the CER from the baseline, where the aver-age CER was 14.9% and the reduction ratio fromthe baseline was 83.9%.

4.2.3 Visualization of hidden vectors

We show a visualization of the encoder networksoutputs in Fig. 2 to illustrate the effect of the neg-ative KL loss function. Principal component anal-ysis (PCA) was applied to the hidden vectors onthe vertical axis. Figures 2(a) and 2(b) show thehidden vectors generated by the split-by-BLSTMmodel without the negative KL divergence loss

for an example mixture of two speakers. We canobserve different activation patterns showing thatthe hidden vectors were successfully separated tothe individual utterances in the mixed speech, al-though some activity from one speaker can be seenas leaking into the other. Figures 2(c) and 2(d)show the hidden vectors generated after retrain-ing with the negative KL divergence loss. Wecan more clearly observe the different patterns andboundaries of activation and deactivation of hid-den vectors. The negative KL loss appears to reg-ularize the separation process, and even seems tohelp in finding the end-points of the speech.

4.2.4 Comparison with earlier work

We first compared the recognition performancewith a hybrid (non end-to-end) system includingDPCL-based speech separation and a Kaldi-basedASR system. It was evaluated under the sameevaluation data and metric as in (Isik et al., 2016)based on the WSJ corpus. However, there are dif-ferences in the size of training data and the op-tions in decoding step. Therefore, it is not a fullymatched condition. Results are shown in Table 6.The word error rate (WER) reported in (Isik et al.,2016) is 30.8%, which was obtained with jointlytrained DPCL and second-stage speech enhance-ment networks. The proposed end-to-end ASRgives an 8.4% relative reduction in WER eventhough our model does not require any explicitframe-level labels such as phonetic alignment, orclean signal reference, and does not use a phoneticlexicon for training. Although this is an unfaircomparison, our purely end-to-end system outper-formed a hybrid system for multi-speaker speechrecognition.

Next, we compared our method with an end-to-end explicit separation and recognition net-work (Settle et al., 2018). We retrained our modelpreviously trained on our WSJ-based corpus usingthe training data generated by Settle et al. (2018),because the direct optimization from scratch ontheir data caused poor recognition performancedue to data size. Other experimental conditionsare shared with the earlier work. Interestingly,our method showed comparable performance tothe end-to-end explicit separation and recognitionnetwork, without having to pre-train using cleansignal training references. It remains to be seen ifthis parity of performance holds in other tasks andconditions.

Figure 2: Visualization of the two hidden vector sequences at the output of the split-by-BLSTM encoderon a two-speaker mixture. (a,b): Generated by the model without the negative KL loss. (c,d): Generatedby the model with the negative KL loss.

Table 6: Comparison with conventional ap-proaches

METHOD WER (%)DPCL + ASR (ISIK ET AL., 2016) 30.8Proposed end-to-end ASR 28.2METHOD CER (%)END-TO-END DPCL + ASR (CHAR LM)

(SETTLE ET AL., 2018) 13.2Proposed end-to-end ASR (char LM) 14.0

5 Related work

Several previous works have considered an ex-plicit two-step procedure (Hershey et al., 2016;Isik et al., 2016; Yu et al., 2017; Chen et al., 2017,2018). In contrast with our work which uses a sin-gle objective function for ASR, they introduced anobjective function to guide the separation of mixedspeech.

Qian et al. (2017) trained a multi-speakerspeech recognizer using permutation-free trainingwithout explicit objective function for separation.In contrast with our work which uses an end-to-end architecture, their objective function relies ona senone posterior probability obtained by align-ing unmixed speech and text using a model trainedas a recognizer for single-speaker speech. Com-pared with (Qian et al., 2017), our method di-rectly maps a speech mixture to multiple charactersequences and eliminates the need for the corre-sponding isolated speech sources for training.

6 Conclusions

In this paper, we proposed an end-to-end multi-speaker speech recognizer based on permutation-

free training and a new objective function pro-moting the separation of hidden vectors in orderto generate multiple hypotheses. In an encoder-decoder network framework, teacher forcingat the decoder network under multiple refer-ences increases computational cost if implementednaively. We avoided this problem by employinga joint CTC/attention-based encoder-decoder net-work.

Experimental results showed that the model isable to directly convert an input speech mixtureinto multiple label sequences under the end-to-endframework without the need for any explicit inter-mediate representation including phonetic align-ment information or pairwise unmixed speech. Wealso compared our model with a method basedon explicit separation using deep clustering, andshowed comparable result. Future work includesdata collection and evaluation in a real worldscenario since the data used in our experimentsare simulated mixed speech, which is already ex-tremely challenging but still leaves some acous-tic aspects, such as Lombard effects and real roomimpulse responses, that need to be alleviated forfurther performance improvement. In addition,further study is required in terms of increasingthe number of speakers that can be simultane-ously recognized, and further comparison with theseparation-based approach.

ReferencesTakuya Akiba, Keisuke Fukuda, and Shuji Suzuki.

2017. ChainerMN: Scalable Distributed DeepLearning Framework. In Proceedings of Work-shop on ML Systems in The Thirty-first Annual Con-ference on Neural Information Processing Systems(NIPS).

Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk,Philemon Brakel, and Yoshua Bengio. 2016. End-to-end attention-based large vocabulary speechrecognition. In IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP),pages 4945–4949.

Zhehuai Chen, Jasha Droppo, Jinyu Li, and WayneXiong. 2018. Progressive joint modeling in unsu-pervised single-channel overlapped speech recogni-tion. IEEE/ACM Transactions on Audio, Speech,and Language Processing, 26(1):184–196.

Zhuo Chen, Yi Luo, and Nima Mesgarani. 2017. Deepattractor network for single-microphone speakerseparation. In IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP),pages 246–250.

Jan K Chorowski, Dzmitry Bahdanau, DmitriySerdyuk, Kyunghyun Cho, and Yoshua Bengio.2015. Attention-based models for speech recogni-tion. In Advances in Neural Information ProcessingSystems (NIPS), pages 577–585.

Linguistic Data Consortium. 1994. CSR-II (wsj1)complete. Linguistic Data Consortium, Philadel-phia, LDC94S13A.

Martin Cooke, John R Hershey, and Steven J Ren-nie. 2009. Monaural speech separation and recog-nition challenge. Computer Speech and Language,24(1):1–15.

John Garofalo, David Graff, Doug Paul, and David Pal-lett. 2007. CSR-I (wsj0) complete. Linguistic DataConsortium, Philadelphia, LDC93S6A.

Alex Graves, Santiago Fernandez, Faustino Gomez,and Jurgen Schmidhuber. 2006. Connectionisttemporal classification: labelling unsegmented se-quence data with recurrent neural networks. InInternational Conference on Machine learning(ICML), pages 369–376.

John R Hershey, Zhuo Chen, Jonathan Le Roux, andShinji Watanabe. 2016. Deep clustering: Discrim-inative embeddings for segmentation and separa-tion. In IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP), pages31–35.

Takaaki Hori, Shinji Watanabe, and John R Hershey.2017a. Joint CTC/attention decoding for end-to-endspeech recognition. In Proceedings of the 55th An-nual Meeting of the Association for ComputationalLinguistics (ACL): Human Language Technologies:long papers.

Takaaki Hori, Shinji Watanabe, and John R Hershey.2017b. Multi-level language modeling and decod-ing for open vocabulary end-to-end speech recog-nition. In IEEE Workshop on Automatic SpeechRecognition and Understanding (ASRU).

Takaaki Hori, Shinji Watanabe, Yu Zhang, and ChanWilliam. 2017c. Advances in joint CTC-Attentionbased end-to-end speech recognition with a deepCNN encoder and RNN-LM. In Interspeech, pages949–953.

Yusuf Isik, Jonathan Le Roux, Zhuo Chen, ShinjiWatanabe, and John R. Hershey. 2016. Single-channel multi-speaker separation using deep cluster-ing. In Proc. Interspeech, pages 545–549.

Suyoun Kim, Takaaki Hori, and Shinji Watanabe.2017. Joint CTC-attention based end-to-end speechrecognition using multi-task learning. In IEEE In-ternational Conference on Acoustics, Speech andSignal Processing (ICASSP), pages 4835–4839.

Morten Kolbæk, Dong Yu, Zheng-Hua Tan, and Jes-per Jensen. 2017. Multitalker speech separationwith utterance-level permutation invariant trainingof deep recurrent neural networks. IEEE/ACMTransactions on Audio, Speech, and Language Pro-cessing, 25(10):1901–1913.

Kikuo Maekawa. 2003. Corpus of SpontaneousJapanese: Its design and evaluation. In ISCA &IEEE Workshop on Spontaneous Speech Processingand Recognition.

Takafumi Moriya, Takahiro Shinozaki, and ShinjiWatanabe. 2015. Kaldi recipe for Japanese sponta-neous speech recognition and its evaluation. In Au-tumn Meeting of ASJ, 3-Q-7.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.2013. On the difficulty of training recurrent neu-ral networks. International Conference on MachineLearning (ICML), pages 1310–1318.

Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, MirkoHannemann, Petr Motlicek, Yanmin Qian, PetrSchwarz, Jan Silovsky, Georg Stemmer, and KarelVesely. 2011. The kaldi speech recognition toolkit.In IEEE Workshop on Automatic Speech Recogni-tion and Understanding (ASRU).

Yanmin Qian, Xuankai Chang, and Dong Yu. 2017.Single-channel multi-talker speech recognition withpermutation invariant training. arXiv preprintarXiv:1707.06527.

Shane Settle, Jonathan Le Roux, Takaaki Hori, ShinjiWatanabe, and John R. Hershey. 2018. End-to-endmulti-speaker speech recognition. In IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP), pages 4819–4823.

Karen Simonyan and Andrew Zisserman. 2014. Verydeep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556.

Seiya Tokui, Kenta Oono, Shohei Hido, and JustinClayton. 2015. Chainer: a next-generation opensource framework for deep learning. In Proceedingsof Workshop on Machine Learning Systems (Learn-ingSys) in NIPS.

Dong Yu, Morten Kolbk, Zheng-Hua Tan, and JesperJensen. 2017. Permutation invariant training of deepmodels for speaker-independent multi-talker speechseparation. In IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP),pages 241–245.

Matthew D Zeiler. 2012. ADADELTA: an adap-tive learning rate method. arXiv preprintarXiv:1212.5701.

A Architecture of the encoder-decodernetwork

In this section, we describe the details of thebaseline encoder-decoder network which is fur-ther extended for permutation-free training. Theencoder network consists of a VGG network andbi-directional long short-term memory (BLSTM)layers. The VGG network has the following 6-layer CNN architecture at the bottom of the en-coder network:

Convolution (# in = 3, # out = 64, filter = 3×3)Convolution (# in = 64, # out = 64, filter = 3×3)MaxPooling (patch = 2×2, stride = 2×2)Convolution (# in = 64, # out = 128, filter = 3×3)Convolution (# in=128, # out=128, filter=3×3)MaxPooling (patch = 2×2, stride = 2×2)

The first 3 channels are static, delta, and delta deltafeatures. Multiple BLSTM layers with projectionlayer Lin(·) are stacked after the VGG network.We defined one BLSTM layer as the concatena-tion of a forward LSTM

−−−−−−→LSTM(·) and a backward

LSTM←−−−−−−LSTM(·):

−→H =

−−−−→LSTM(·) (29)

←−H =

←−−−−LSTM(·) (30)

H = [Lin(−→H ); Lin(

←−H )], (31)

When the VGG network and the multiple BLSTMlayers are represented as VGG(·) and BLSTM(·),the encoder network in Eq. (2) maps the input fea-ture vector O to internal representation H as fol-lows:

H = Encoder(O) = BLSTM(VGG(O)) (32)

The decoder network sequentially generates then-th label yn by taking the context vector cn andthe label history y1:n−1:

yn ∼ Decoder(cn, yn−1). (33)

The context vector is calculated in an locationbased attention mechanism (Chorowski et al.,2015) which weights and sums the C-dimensionalsequence of representation H = (hl ∈ RC |l =1, . . . , L) with attention weight an,l:

cn = Attention(an−1, en, H), (34)

,L∑l=1

an,lhl. (35)

Algorithm 1 Generation of multi speaker speechdatasetnreuse ⇐ maximum number of times same ut-terance can be used.U ⇐ utterance set of the corpora.Ck ⇐ nreuse for each utterance Uk ∈ Ufor Uk ∈ U doP (Uk) = Ck /

∑l Cl

end forfor Ui in U do

Sample utterance Uj from P (U) while ensur-ing speakers of Ui and Uj are different.Mix utterances Ui and Ujif Cj > 0 thenCj = Cj − 1for Uk ⇐ U doP (Uk) = Ck /

∑l Cl

end forend if

end for

The location based attention mechanism definesthe weights an,l as follows:

an,l =exp(αkn,l)∑Ll=1 exp(αkn,l)

, (36)

kn,l = wTtanh(V Een−1 + V Hhl + V Ffn,l + b),(37)

fn = F ∗ an−1, (38)

where w, V E , V H , V F , b, F are tunable parame-ters, α is a constant value called inverse tempera-ture, and ∗ is the convolution operation. We used10 convolution filters of width 200, and set α to 2.The introduction of fn makes the attention mech-anism take into account the previous alignment in-formation. The hidden state e is updated recur-sively by an updating LSTM function:

en = Update(en−1, cn−1, yn−1), (39)

, LSTM(

Lin(en−1) + Lin(cn−1) + Emb(yn−1)), (40)

where Emb(·) is an embedding function.

B Generation of mixed speech

Each utterance of the corpus is mixed with arandomly selected utterance with the probability,P (Uk), that moderates over-selection of specificutterances. P (Uk) is calculated in the first for-loopas a uniform probability. All utterances are used as

Table 7: Examples of recognition results. Errors are emphasized as capital letter. “ ” is a space character,and a special token “*” is inserted to pad deletion errors.

(1) Model w/ permutation-free training (CER of HYP1: 12.8%, HYP2: 0.9%)HYP1: t h e s h u t t l e * * * I S I N t h e f i r s t t H E l i f E o * f s i n c e t h e n i n e t e e n e i g h t

y s i x c h a l l e n g e r e x p l o s i o n

REF1: t h e s h u t t l e W O U L D B E t h e f i r s t t * O l i f T o F f s i n c e t h e n i n e t e e n e i g

h t y s i x c h a l l e n g e r e x p l o s i o n

HYP2: t h e e x p a n d e d r e c a l l w a s d i s c l o s e d a t a m e e t i n g w i t h n . r . c . o f f i c i a

l s a t a n a g e n c y o f f i c e o u t s i d e c h i c a g o

REF2: t h e e x p a n d e d r e c a l l w a s d i s c l o s e d a t a m e e t i n g w i t h n . r . c . o f f i c i a

l s a t a n a g e n c y o f f i c e o u t s i d e c h i c a g o

(2) Model w/ permutation-free training (CER of HYP1: 91.7%, HYP2: 38.9%)HYP1: I T W A S L a s t r * A I S e * D * I N J U N E N I N E t E e N e * I G h T Y f I V e T O *

T H I R T Y

REF1: * * * * * * * * a s t * r O N O M e R S S A Y T H A T * * * * t H e * e A R T h ’ S f A T e I S S E A

L E D

HYP2: * * * * a N D * * s t * r O N G e R S S A Y T H A T * * * * t H e * e * A R t H f A T e I S t o f o

r t y f i v e d o l l a r s f r o m t h i r t y f i v e d o l l a r s

REF2: I T W a * S L A s t r A I S e * D * I N J U N E N I N E t E e N e I G H t Y f I V e * * * t o f o r t

y f i v e d o l l a r s f r o m t h i r t y f i v e d o l l a r s

one side of the mixture, and another side is sam-pled from the distribution P (Uk) in the second for-loop. The selected pairs of utterances are mixed atvarious signal-to-noise ratios (SNR) between 0 dBand 5 dB. We randomized the starting point of theoverlap by padding the shorter utterance with si-lence whose duration is sampled from the uniformdistribution within the length difference betweenthe two utterances. Therefore, the duration of themixed utterance is equal to that of the longer utter-ance among the unmixed speech. After the gen-eration of the mixed speech, the count of selectedutterances Cj is decremented to prevent of over-selection. All counts C are set to nreuse, and weused nreuse = 3.

C Examples of recognition results anderror analysis

Table 7 shows examples of recognition result. Thefirst example (1) is one which accounts for a largeportion of the evaluation set. The SNR of theHYP1 is -1.55 db and that of HYP2 is 1.55 dB.The network generates multiple hypotheses witha few substitution and deletion errors, but withoutany overlapped and swapped words. The secondexample (2) is one which leads to performance re-duction. We can see that the network makes errorswhen there is a large difference in length betweenthe two sequences. The word “thirty” of HYP2 is

injected in HYP1, and there are deletion errors inHYP2. We added a negative KL divergence loss toease such kind of errors. However, there is furtherroom to reduce error by making unshared modulesmore cooperative.

a purely end-to-end system for multi-speaker speech ... · a purely end-to-end system for...

Documents