neural generation of dialogue response timingsneed to design response timing models that take into...

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2442–2452July 5 - 10, 2020. c©2020 Association for Computational Linguistics

2442

Neural Generation of Dialogue Response Timings

Matthew Roddy and Naomi HarteADAPT Centre, School of Engineering

Trinity College Dublin, Ireland{roddym,nharte}@tcd.ie

Abstract

The timings of spoken response offsets in hu-man dialogue have been shown to vary basedon contextual elements of the dialogue. Wepropose neural models that simulate the distri-butions of these response offsets, taking intoaccount the response turn as well as the pre-ceding turn. The models are designed to beintegrated into the pipeline of an incrementalspoken dialogue system (SDS). We evaluateour models using offline experiments as wellas human listening tests. We show that humanlisteners consider certain response timings tobe more natural based on the dialogue context.The introduction of these models into SDSpipelines could increase the perceived natural-ness of interactions.1

1 Introduction

The components needed for the design of spokendialogue systems (SDSs) that can communicate ina realistic human fashion have seen rapid advance-ments in recent years (e.g. Li et al. (2016); Zhouet al. (2018); Skerry-Ryan et al. (2018)). How-ever, an element of natural spoken conversationthat is often overlooked in SDS design is the tim-ing of system responses. Many turn-taking com-ponents for SDSs are designed with the objectiveof avoiding interrupting the user while keeping thelengths of gaps and overlaps as low as possible e.g.Raux and Eskenazi (2009). This approach doesnot emulate naturalistic response offsets, since inhuman-human conversation the distributions of re-sponse timing offsets have been shown to differbased on the context of the first speaker’s turn andthe context of the addressee’s response (Sacks et al.,1974; Levinson and Torreira, 2015; Heeman andLunsford, 2017). It has also been shown that lis-teners have different anticipations about upcoming

1 Our code is available at https://github.com/mattroddy/RTNets.

20000

10000

0

10000

20000

30000

0 100000 200000 300000 400000

30000

20000

10000

0

10000

20000

30000

20000

10000

0

10000

20000

30000

0 100000 200000 300000 400000

30000

20000

10000

0

10000

20000

30000

Response Encoder“Are you doing any kind of volunteer

work now?”

“Not right now...”

Feature extraction

Inference LSTM

User Turn

System Turn

0 0 0 0 0 1

Output Probabilities

Sampling

1.5 1.0 0.5 0.0 0.5 1.0 1.5

TruePredicted

0

Response Offset (Seconds)

Figure 1: Overview of how our model generates thedistribution of turn-switch offset timings using an en-coding of a dialogue system response hz , and featuresextracted from the user’s speech xn.

responses based on the length of a silence before aresponse (Bogels et al., 2019). If we wish to real-istically generate offsets distributions in SDSs, weneed to design response timing models that takeinto account the context of the user’s speech andthe upcoming system response. For example, off-sets where the first speaker’s turn is a backchanneloccur in overlap more frequently (Levinson andTorreira, 2015). It has also been observed that dis-preferred responses (responses that are not in linewith the suggested action in the prior turn) are as-sociated with longer delays (Kendrick and Torreira,2015; Bogels et al., 2019).

https://github.com/mattroddy/RTNets

https://github.com/mattroddy/RTNets

2443

Overview We propose a neural model for gen-erating these response timings in SDSs (shown inFig. 1). The response timing network (RTNet) op-erates using both acoustic and linguistic featuresextracted from user and system turns. The twomain components are an encoder, which encodesthe system response hz , and an inference network,which takes a concatenation of user features (xn)and hz . RTNet operates within an incremental SDSframework (Schlangen and Skantze, 2011) whereinformation about upcoming system responses maybe available before the user has finished speaking.RTNet also functions independently of higher-levelturn-taking decisions that are traditionally made bythe dialogue manager (DM) component. Typically,the DM decides when the system should take a turnand also supplies the natural language generation(NLG) component with a semantic representationof the system response (e.g. intents, dialogue acts,or an equivalent neural representation). Any ofthe system response representations that are down-stream from the DM’s output representation (e.g.lexical or acoustic features) can potentially be usedto generate the response encoding. Therefore, weassume that the decision for the system to take aturn has already been made by the DM and ourobjective is to predict (on a frame-by-frame basis)the appropriate time to trigger the system turn.

It may be impractical in an incremental frame-work to generate a full system response and thenre-encode it using the response encoder of RT-Net. To address this issue, we propose an exten-sion of RTNet that uses a variational autoencoder(VAE) (Kingma and Welling, 2014) to train an in-terpretable latent space which can be used to by-pass the encoding process at inference-time. Thisextension (RTNet-VAE) allows the benefit of hav-ing a data-driven neural representation of responseencodings that can be manipulated without the over-head of the encoding process. This representationcan be manipulated using vector algebra in a flex-ible manner by the DM to generate appropriatetimings for a given response.

Our model’s architecture is similar to VAEs withrecurrent encoders and decoders proposed in Bow-man et al. (2016); Ha and Eck (2018); Roberts et al.(2018). Our use of a VAE to cluster dialogue acts issimilar to the approach used in Zhao et al. (2017).Our vector-based representation of dialogue actstakes inspiration from the ‘attribute vectors’ usedin Roberts et al. (2018) for learning musical struc-

ture representations. Our model is also related tocontinuous turn-taking systems (Skantze, 2017) inthat our model is trained to predict future speechbehavior on a frame-by-frame basis. The encoderuses a multiscale RNN architecture similar to theone proposed in Roddy et al. (2018) to fuse infor-mation across modalities. Models that intentionallygenerate responsive overlap have been proposed inDeVault et al. (2011); Dethlefs et al. (2012). Whileother models have also been proposed that generateappropriate response timings for fillers (Nakanishiet al., 2018; Lala et al., 2019) and backchannels(Morency et al., 2010; Meena et al., 2014; Lalaet al., 2017).

This paper is structured as follows: First, wepresent how our dataset is structured and our train-ing objective. Then, in sections 2.1 and 2.2 wepresent details of our two models, RTNet andRTNet-VAE. Section 2.3 presents our input fea-ture representations. In section 2.4 we discuss ourtraining and testing procedures. In sections 3.1 and3.2 we analyze the performance of both RTNet andRTNet-VAE. Finally, in section 4 we present theresults of a human listener test.

2 Methodology

Dataset Our dataset is extracted from theSwitchboard-1 Release 2 corpus (Godfrey and Hol-liman, 1997). Switchboard has 2438 dyadic tele-phone conversations with a total length of approxi-mately 260 hours. The dataset consists of pairs ofadjacent turns by different speakers which we referto as turn pairs (shown in Fig. 2). Turn pairs areautomatically extracted from orthographic anno-tations using the following procedure: We extractframe-based speech-activity labels for each speakerusing a frame step-size of 50ms. The frame-basedrepresentation is used to partition each person’sspeech signal into interpausal units (IPUs). Wedefine IPUs as segments of speech by a personthat are separated by pauses of 200ms or greater.IPUs are then used to automatically extract turns,which we define as consecutive IPUs by a speakerin which there is no speech by the other speaker inthe silence between the IPUs. A turn pair is thendefined as being any two adjacent turns by differentspeakers. The earlier of the two turns in a pair isconsidered to be the user turn and the second isconsidered to be the system turn.

Training Objective Our training objective is topredict the start of the system turn one frame ahead

2444

200 ms

frame length = 50 ms

final IPU of user turn

user turn

Offset ( )

system turn

- - 0 0- - - - - - - - - - - 0 0 0 0 0 0 1 - - - - - -

End of previous system turn and start of new pair

Inference LSTM Network

User Features

System Encodings

Training Objective

End of current system turn and start of new pair

IPU

Figure 2: Segmentation of data into turn pairs, and how the inference LSTM makes predictions.

of the ground truth start time. The target labels ineach turn pair are derived from the ground truthspeech activity labels as shown in Fig. 2. Each 50ms frame has a label y ∈ {0, 1}, which consists ofthe ground truth voice activity shifted to the left byone frame. As shown in the figure, we only includeframes in the spanR in our training loss. We definethe span R as the frames from the beginning of thelast IPU in the user turn to the frame immediatelyprior to the start of the system turn.

We do not predict at earlier frames since we as-sume that at these mid-turn-pauses the DM has notdecided to take a turn yet, either because it expectsthe user to continue, or it has not formulated oneyet. As mentioned previously in section 1, we de-sign RTNet to be abstracted from the turn-takingdecisions themselves. If we were to include pausesprior to the turn-final silence, our response genera-tion system would be additionally burdened withmaking turn-taking decisions, namely, classifyingbetween mid-turn-pauses and end-of-turn silences.We therefore make the modelling assumption thatthe system’s response is formulated at some pointduring the user’s turn-final IPU. To simulate thisassumption we sample an index RSTART from thespan of R using a uniform distribution. We thenuse the reduced set of frames fromRSTART toRENDin the calculation of our loss.

2.1 Response Timing Network (RTNet)Encoder The encoder of RTNet (shown in Fig.3) fuses the acoustic and linguistic modalitiesfrom a system response using three bi-directionalLSTMs. Each modality is processed at independenttimescales and then fused in a master Bi-LSTMwhich operates at the linguistic temporal rate. Theoutput of the master Bi-LSTM is a sequence ofencodings h0, h1, ...hI , where each encoding is aconcatenation of the forward and backward hiddenstates of the master Bi-LSTM at each word index.

The linguistic Bi-LSTM takes as input the se-quence of 300-dimensional embeddings of the to-kenized system response. We use three specialtokens: SIL, WAIT, and NONE. The SIL token isused whenever there is a gap between words that isgreater than the frame-size (50ms). The WAIT andNONE tokens are inserted as the first and last to-kens of the system response sequence respectively.The concatenation [h0;h1;hI ] is passed as input toa RELU layer (we refer to this layer as the reduc-tion layer) which outputs the hz encoding. The hzencoding is used (along with user features) in theconcatenated input to the inference network. Sincethe WAIT embedding corresponds to the h0 outputof the master Bi-LSTM and the NONE embeddingcorresponds to hI , the two embeddings serve as“triggering” symbols that allow the linguistic andmaster Bi-LSTM to output relevant informationaccumulated in their cell states.

The acoustic Bi-LSTM takes as input the se-quence of acoustic features and outputs a sequenceof hidden states at every 50ms frame. As shownin Fig. 3, we select the acoustic hidden states thatcorrespond to the starting frame of each linguis-tic token and concatenate them with the linguistichidden states. Since there are no acoustic featuresavailable for the WAIT and NONE tokens, we traintwo embeddings to replace these acoustic LSTMstates (shown in purple in Fig. 3). The use of acous-tic embeddings results in there being no connectionbetween the WAIT acoustic embedding and thefirst acoustic hidden state. For this reason we in-clude h1 in the [h0;h1;hI ] concatenation, in orderto make it easier for information captured by thethe acoustic bi-LSTM to be passed through to thefinal concatenation.

Inference Network The aim of our inferencenetwork is to predict a sequence of output prob-abilities Y = [yRSTART

, yRSTART+1, ..., yN ] using

2445

Not

righ

t

now

but

I have

done

uh Red

<S

IL T

OK

>

Cro

ss

wor

k

<N

ON

E T

OK

>

<W

AIT

TO

K>

Acoustic

Linguistic

Master

Figure 3: The encoder is three stacked Bi-LSTMs. We use special embeddings (shown in purple) to represent theacoustic states corresponding to the first and last tokens (WAIT and NONE) of the system’s turn.

RELU

RELU

RELURELU

Figure 4: VAE

a response encoding hz , and a sequence of user fea-tures X = [x0, x1, ..., xN ]. We use a a single-layerLSTM (shown in Fig. 2) which is followed by asigmoid layer to produce the output probabilities:

[hn; cn] = LSTMinf([xn;hz], [hn−1; cn−1])

yn = σ(W hhn + bh)

Since there are only two possible output valuesin a generated sequence {0,1}, and the sequenceends once we predict 1, the inference network canbe considered an autoregressive model where 0is passed implicitly to the subsequent time-step.To generate an output sequence, we can sam-ple from the distribution p(yn = 1|yRSTART

=0, yRSTART+1 = 0, ..., yn−1 = 0, X0:n, hz) usinga Bernoulli random trial at each time-step. Forframes prior to RSTART the output probability isfixed to 0, since RSTART is the point where theDM has formulated the response. During trainingwe minimize the binary cross entropy loss (LBCE)between our ground truth objective and our outputpredictions Y .

2.2 RTNet-VAE

Motivation A limitation of RTNet is that it maybe impractical to encode system turns before trig-gering a response. For example, if we wish toapply RTNet using generated system responses, atrun-time the RTNet component would have to waitfor the full response to be generated by the NLG,which would result in a computational bottleneck.

If the NLG system is incremental, it may also bedesirable for the system to start speaking before theentirety of the system response has been generated.

VAE To address this, we bypass the encodingstage by directly using the semantic representationoutput from the DM to control the response tim-ing encodings. We do this by replacing the reduc-tion layer with a VAE (Fig. 4). To train the VAE,we use the same concatenation of encoder hiddenstates as in the RTNet reduction layer ([h0;h1;hI ]).We use a dimensionality reduction RELU layer tocalculate hreduce, which is then split into µ andσ components via two more RELU layers. σ ispassed through an exponential function to produceσ, a non-negative standard deviation parameter. Wesample the latent variable z with the standard VAEmethod using µ, σ, and a random vector from thestandard normal distributionN (0, I). A dimension-ality expansion RELU layer is used to transform zinto the response encoding hz , which is the samedimensionality as the output of the encoder:

hreduce = RELU(Wreduce[h0;h1;hI ] + breduce)

µ = RELU(Wµhreduce + bµ)

σ = RELU(Wσhreduce + bσ)

σ = exp(σ

2)

z = µ+ σ �N (0, I)

hz = RELU(Wexpandz + bexpand)

We impose a Gaussian prior over the latent space us-ing a Kullback-Liebler (KL) divergence loss term:

LKL = − 1

2Nz(1 + σ − µ2 − exp(σ))

The LKL loss measures the distance of the gener-ated distribution from a Gaussian with zero mean

2446

and unit variance. LKL is combined with LBCE

using a weighted sum:

L = LBCE + wKLLKL

As we increase the value of wKL we increasinglyenforce the Gaussian prior on the latent space. Indoing so our aim is to learn a smooth latent spacein which similar types of responses are organizedin similar areas of the space.

Latent Space During inference we can skip theencoding stage of RTNet-VAE and sample z di-rectly from the latent space on the basis of the inputsemantic representation from the dialogue manager.Our sampling approach is to approximate the dis-tribution of latent variables for a given response-type using Gaussians. For example, if we have acollection of labelled backchannel responses (andtheir corresponding z encodings) we can approxi-mate the distribution of p(z|label =backchannel)using an isotropic Gaussian by simply calculatingµbackchannel and σbackchannel , the maximum likeli-hood mean and standard deviations of each of thez dimensions. These vectors can also be used tocalculate directions in the latent space with differ-ent semantic characteristics and then interpolatebetween them.

2.3 Input Feature RepresentationsLinguistic Features We use the word annota-tions from the ms-state transcriptions as linguis-tic features. These annotations give us the timingfor the starts and ends of all words in the corpus.As our feature representation, we use 300 dimen-sional word embeddings that are initialized withGloVe vectors (Pennington et al., 2014) and thenjointly optimized with the rest of the network. Intotal there are 30080 unique words in the annota-tions. We reduced the embedding number down to10000 by merging embeddings that had low wordcounts with the closest neighbouring embedding(calculated using cosine distance).

We also introduce four additional tokens thatare specific to our task: SIL, WAIT, NONE, andUNSPEC. SIL is used whenever there is a silence.WAIT and NONE are used at the start and end of allthe system encodings, respectively. The use of UN-SPEC (unspecified) is shown in Fig. 5. UNSPECwas introduced to represent temporal informationin the linguistic embeddings. We approximate theprocessing delay in ASR by delaying the annota-tion by 100 ms after the ground truth frame where

the user’s word ended. This 100 ms delay was pro-posed in Skantze (2017) as a necessary assumptionto modelling linguistic features in offline continu-ous systems. However, since voice activity detec-tion (VAD) can supply an estimate of when a wordhas started, we propose that we can use this infor-mation to supply the network with the UNSPECembedding 100ms after the word has started.

Acoustic Features We combine 40 log-mel fil-terbanks, and 17 features from the GeMAPs featureset (Eyben et al., 2016). The GeMAPs features arethe complete set excluding the MFCCs (e.g. pitch,intensity, spectral flux, jitter, etc.). Acoustic fea-tures were extracted using a 50ms framestep.

2.4 Experimental Settings

Training and Testing Procedures The training,validation, and test sets consist of 1646, 150, 642conversations respectively with 151595, 13910,and 58783 turn pairs. The test set includes all ofthe conversations from the NXT-format annotations(Calhoun et al., 2010), which include referencesto the Switchboard Dialog Act Corpus (SWDA)(Stolcke et al., 2000) annotations. We include theentirety of the NXT annotations in our test set sothat we have enough labelled dialogue act samplesto analyse the distributions.

We used the following hyperparameter settingsin our experiments: The inference, acoustic, lin-guistic, and master LSTMs each had hidden sizesof 1024, 256, 256, and 512 (respectively). We useda latent variable size of 4, a batch size of 128, andL2 regularization of 1e-05. We used the Adam op-timizer with an initial learning rate of 5e-04. Wetrained each model for 15000 iterations, with learn-ing rate reductions by a factor of 0.1 after 9000,11000, 13000, and 14000 iterations.

While we found that randomizingRSTART duringtraining was important for the reasons given inSection 2, it presented issues for the stability andreproducibility of our evaluation and test results forLBCE and LKL. We therefore randomize duringtraining and sampling, but when calculating the testlosses (reported in Table 1) we fix RSTART to be thefirst frame of the user’s turn-final IPU.

We also calculate the mean absolute error(MAE), given in seconds, from the ground truthresponse offsets to the generated output offsets.When sampling for the calculation of MAE, it isnecessary to increase the length of the turn pairsince the response time may be triggered by the

2447

But Uhhh Why?<

SIL

>

<U

NS

PE

C>

<bu

t>

<U

NS

PE

C>

<U

NS

PE

C>

<uh

>

<S

IL>

<w

hy>

<S

IL>

50 ms

Figure 5: The user’s linguistic feature representation scheme. The embedding for each word is triggered 100 msafter the ground truth end of the word, to simulate ASR delay. The UNSPEC embedding begins 100ms after aword’s start frame and holds information about whether a word is being spoken (before it has been recognized) andthe length of each word.

sampling process after the ground truth time. Wetherefore pad the user’s features with 80 extraframes in which we simulate silence artificiallyusing acoustic features. During sampling, we usethe same RSTART randomization process that wasused during training, rather than fixing it to thestart of the user’s turn-final IPU. For each modelwe perform the sampling procedure on the test setthree times and report the mean error in Table 1.

Best Fixed Probability To the best of our knowl-edge, there aren’t any other published models thatwe can directly compare ours to. However, wecan calculate the best performance that can beachieved using a fixed value for y. The bestpossible fixed y for a given turn pair is: ytp =

1(REND−RSTART)/FrameLength . The best fixed y for aset of turn pairs is given by the expected value ofytp in that set: yfixed = E[ytp]. This representsthe best performance that we could achieve if wedid not have access to any user or system features.We can use the fixed probability model to put theperformance of the rest of our models into context.

1.5 1.0 0.5 0.0 0.5 1.0 1.5Offset (Seconds)

TruePredicted

(a) Full Model

1.5 1.0 0.5 0.0 0.5 1.0 1.5Offset (Seconds)

TruePredicted

(b) Fixed ProbabilityFigure 6: Generated offset distributions for the test setusing the full model and the fixed probability (random)model.

3 Discussion

3.1 RTNet Discussion

RTNet Performance The offset distribution forthe full RTNet model is shown in Fig. 6a. This

# Model LBCE LKL MAE Details

1 Full Model 0.1094 – 0.4539 No VAE

2 Fixed Probability 0.1295 – 1.4546 Fixed Probability

3 No Encoder 0.1183 – 0.4934

Encoder Ablation4 Only Acoustic 0.1114 – 0.4627

5 Only Linguistic 0.1144 – 0.4817

6 Only Acoustic 0.1112 – 0.5053Inference Ablation

7 Only Linguistic 0.1167 – 0.4923

8 wKL = 0.0 0.1114 3.3879 0.4601

Inclusion of VAE

9 wKL = 10−4 0.1122 1.5057 0.4689

10 wKL = 10−3 0.1125 0.8015 0.4697

11 wKL = 10−2 0.1181 0.0000 0.5035

12 wKL = 10−1 0.1189 0.0000 0.5052

Table 1: Experimental results on our test set. Lower isbetter in all cases. Best results shown in bold.

0

1

True DistributionBCState

0

1

Full Model

0

1

Random

0

1

No Encoder

0

1

VAE

1.0 0.5 0.0 0.5 1.0Offset (Seconds)

0

1

Vector Representation

(a) BC/Statement

0

1

True Distributionyesno

0

1

Full Model

0

1

Random

0

1

No Encoder

0

1

VAE

1.0 0.5 0.0 0.5 1.0Offset (Seconds)

0

1

Vector Representation

(b) Yes/NoFigure 7: Generated offset distributions for selected re-sponse dialogue acts using different model conditions.

2448

baseline RTNet model is better able to replicatemany of the features of the true distribution incomparison with predicted offsets using the bestpossible fixed probability shown in Fig. 6b. Thedifferences between the baseline and the fixed prob-ability distributions are reflected in the results ofrows 1 and 2 in Table 1. In Fig. 6a, the modelhas the most trouble reproducing the distributionof offsets between -500 ms and 0 ms. This part ofthe distribution is the most demanding because itrequires that the model anticipate the user’s turn-ending. From the plots it is clear that our model isable to do that to a large degree. We observe thatafter the user has stopped speaking (from 0 secondsonward) the generated distribution follows the truedistribution closely.

To look in more detail at how the system modelsthe offset distribution we can investigate the gener-ated distributions of labelled response dialogue actsin our test set. Fig. 7 shows plots of backchannelsvs. statements (Fig. 7a), and yes vs. no (Fig.7b)responses. In the second rows, we can see that thefull model is able to accurately capture the differ-ences in the contours of the true distributions. Forexample, in the no dialogue acts, the full model ac-curately generates a mode that is delayed (relativeto yes dialogue acts).

Encoder Ablation The performance of the re-sponse encoder was analysed in an ablation study,with results in rows 3 through 5 of Table 1. With-out the response encoder, there is a large decreasein performance, relative to the full model. Fromlooking at the encoders with only acoustic andlinguistic modalities, we can see that the resultsbenefit more from the acoustic modality than thelinguistic modality. If we consider the impact ofthe encoder in more detail, we would expect thatthe network would not be able to model distribu-tional differences between different types of DAresponses without an encoder. This is confirmed inthe fourth rows of Fig. 7, where we show the gen-erated distributions without the encoder. We cansee that without the encoder, the distributions ofthe all of the dialogue act offsets are almost exactlythe same.

Inference Network Ablation In rows 6 and 7 ofTable 1 we present an ablation of the inference net-work. We can see that removing either the acousticor linguistic features from the user’s features isdetrimental to the results. An interesting irregular-

1.5 1.0 0.5 0.0 0.5 1.0 1.5Offset (Seconds)

TruePredicted

(a) Only Acoustic

1.5 1.0 0.5 0.0 0.5 1.0 1.5Offset (Seconds)

TruePredicted

(b) Only LinguisticFigure 8: Generated offset distributions for the infer-ence network ablation.

sdnnnyb

(a) wKL = 0.0

sdnnnyb

(b) wKL = 10−3

Figure 9: T-SNE plots of z for four different dialogueacts using two different wKL settings.

ity is observed in the results for the model that usesonly acoustic features (row 6): the MAE is unusu-ally high, relative to the LBCE. In all other rows,lower LBCE corresponds to lower MAE. However,row 6 has the second lowest LBCE, while also hav-ing the second highest MAE.

In order to examine this irregularity in more de-tail, we look at the generated distributions from theinference ablation, shown in Fig. 8. We observethat the linguistic features are better for predictingthe mode of the distribution whereas the acous-tic features are better at modelling the -100 ms to+150 ms region directly preceding the mode. Sinceword embeddings are triggered 100 ms after theend of the word, the linguistic features can be usedto generate modal offsets in the 150 ms to 200 msbin. We propose that, in the absence of linguisticfeatures, there is more uncertainty about when theuser’s turn-end has occurred. Since the majorityof all ground-truth offsets occur after the user hasfinished speaking, the unusually high MAE in row6 could be attributed to this uncertainty in whetherthe user has finished speaking.

3.2 RTNet-VAE Discussion

RTNet-VAE Performance In rows 8 through12 of Table 1 we show the results of our experi-ments with RTNet-VAE with different settings ofwKL. AswKL is increased, theLBCE loss increaseswhile the LKL loss decreases. Examining some ex-ample distributions of dialogue acts generated byRTNet-VAE using wKL = 10−4 (shown in the fifthrows of Fig. 7) we can see that RTNet-VAE is capa-

2449

1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00Offset (Seconds)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Interpolated Distributionsagree-acceptrejectinterpolated

Figure 10: Interpolated distributions

ble of generating distributions that are of a similarquality to those generated by RTNet (shown in thesecond row). We also observe that RTNet-VAE us-ing wKL = 10−4 produces competitive results, incomparison to the full model. These observationssuggest that the inclusion of the VAE in pipelinedoes not severely impact the overall performance.

In Fig. 9 we show the latent variable z gener-ated using RTNet-VAE and plotted using t-SNE(van der Maaten and Hinton, 2008). To show thebenefits of imposing the Gaussian prior, we showplots for with wKL = 0.0 and wKL = 10−3. Theplots show the two-dimensional projection of fourdifferent types of dialogue act responses: state-ments (sd), no (nn), yes (ny), and backchannels (b).We can observe that for both settings, the latentspace is able to organize the responses by dialogueact type, even though it is never explicitly trainedon dialogue act labels. For example, in both cases,statements (shown in blue) are clustered at the op-posite side of the distribution from backchannels(shown in red). However, in the case of wKL = 0.0there are “holes” in the latent space. For practicalapplications such as interpolation of vector repre-sentations of dialogue acts (discussed in the nextparagraph), we would like a space that does notcontain any of these holes since they are less likelyto have semantically meaningful interpretations.When the Gaussian prior is enforced (Fig. 9b) wecan see that the space is smooth and the distinctionsbetween dialogue acts is still maintained.

Latent Space Applications As mentioned inSection 2.2, part of the appeal in using the VAEin our model is that it enables us to discard the re-sponse encoding stage. We can exploit the smooth-ness of the latent space to skip the encoding stageby sampling directly from the trained latent space.

We can approximate the distribution of latent vari-ables for individual dialogue act response typesusing isotropic Gaussians. This enables us to effi-ciently represent the dialogue acts using mean andstandard-deviation vectors, a pair for each dialogueact. Fig. 7 shows examples of distributions gener-ated using Gaussian approximations of the latentspace distributions in the final rows. We can seethat the generated outputs have similar propertiesto the true distributions.

We can use the same parameterized vector repre-sentations to interpolate between different dialogueact parameters to achieve intermediate distributions.This dimensional approach is flexible in that wegive the dialogue manager (DM) more control overthe details of the distribution. For example, if theobjective of the SDS was to generate an agree dia-logue act, we could control the degree of agreementby interpolating between disagree and agree vec-tors. Figure 10 shows an example of a generatedinterpolated distribution. We can see that the prop-erties of the interpolated distribution (e.g. mode,kurtosis) are perceptually “in between” the rejectand accept distributions.

4 Listening Tests

It has shown that response timings vary based onthe semantic content of dialogue responses andthe preceding turn (Levinson and Torreira, 2015),and that listeners are sensitive to these fluctuationsin timing (Bogels and Levinson, 2017). However,the question of whether certain response timingswithin different contexts are considered more real-istic than others has not been fully investigated. Wedesign an online listening test to answer two ques-tions: (1) Given a preceding turn and a response,are some response timings considered by listenersto be more realistic than others? (2) In cases wherelisteners are sensitive to the response timing, is ourmodel more likely to generate responses that areconsidered realistic than a system that generates amodal response time?

Participants were asked to make A/B choicesbetween two versions of a turn pair, where eachversion had a different response offset. Participantswere asked: ”Which response timing sounds like itwas produced in the real conversation?” The turnpairs were drawn from our dataset and were limitedto pairs where the response was either dispreferredor a backchannel. We limited the chosen pairs tothose with ground truth offsets that were either clas-

2450

1.5 1.0 0.5 0.0 0.5 1.0 1.50.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4 Mode = +157 msEarly / Late Cutoffs

(a) Early, modal, and late regions.

Non-Modal Offsets

1 0 1 1 0 1

Modal Offsets

1 0 1

(b) Generated distributions for six turn pairs. The high-lighted regions indicate the region that was preferred bylisteners. The red line indicates the ground truth offset.

Figure 11: Listening test experiments

sified as early or late. We classified offsets as early,modal, or late by segmenting the distribution of allof the offsets in our dataset into three partitions asshown in Fig. 11a. The cutoff points for the earlyand late offsets were estimated using a heuristicwhere we split the offsets in our dataset into twogroups at the mode of the distribution (157 ms) andthen used the median values of the upper (+367 ms)and lower (-72 ms) groups as the cutoff points. Weselected eight examples of each dialogue act (fourearly and four late). We generated three differentversions of each turn pair: true, modal, and oppo-site. If the true offset was late, the opposite offsetwas the mean of the early offsets (-316 ms). If thetrue offset was early, the opposite offset was themean of the late offsets (+760 ms).

We had 25 participants (15 female, 10 male)who all wore headphones. We performed binomialtests for the significance of a given choice in eachquestion. For the questions in the first half of thetest, in which we compared true vs. opposite off-sets, 10 of the 16 comparisons were found to bestatistically significant (p < 0.05). In all of thesignificant cases the true offset was was consid-ered more realistic than the opposite. In referenceto our first research question, this result supportsthe conclusion that some responses are indeed con-sidered to be more realistic than others. For thequestions in the second half of the test, in whichwe compared true vs. modal offsets, six out of the16 comparisons were found to be statistically signif-icant. Of the six significant preferences, three werea preference for the true offset, and three were apreference for the modal offset. To investigate oursecond research question, we looked at the offsetdistributions generated by our model for each of

the six significant preferences, shown in Fig. 11b.For the turn pairs where listeners preferred non-modal offsets (top row), the distributions generatedby our system deviate from the mode into the pre-ferred area (highlighted in yellow). In pairs wherelisteners preferred modal offsets (bottom row) thegenerated distributions tend to have a mode nearthe overall dataset mode (shown in the green line).We can conclude, in reference to our second ques-tion, that in instances where listeners are sensitiveto response timings it is likely that our system willgenerate response timings that are more realisticthan a system that simply generates the mode ofthe dataset.

5 Conclusion

In this paper, we have presented models that can beused to generate the turn switch offset distributionsof SDS system responses. It has been shown inprior studies (e.g. (Bogels et al., 2019)) that hu-mans are sensitive to these timings and that theycan impact how responses are perceived by a lis-tener. We would argue that they are an importantelement of producing naturalistic interactions thatis often overlooked. With the advent of commer-cial SDS systems that attempt to engage users overextended multi-turn interactions (e.g. (Zhou et al.,2018)) generating realistic response behaviors is apotentially desirable addition to the overall experi-ence.

Acknowledgments

The ADAPT Centre for Digital Content Technol-ogy is funded under the SFI Research Centres Pro-gramme (Grant 13/RC/2106) and is co-funded un-der the European Regional Development Fund.

2451

ReferencesSara Bogels, Kobin H. Kendrick, and Stephen C. Levin-

son. 2019. Conversational expectations get revisedas response latencies unfold. Language, Cognitionand Neuroscience, pages 1–14.

Sara Bogels and Stephen C. Levinson. 2017. The BrainBehind the Response: Insights Into Turn-taking inConversation From Neuroimaging. Research onLanguage and Social Interaction, 50(1):71–89.

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An-drew Dai, Rafal Jozefowicz, and Samy Bengio.2016. Generating Sentences from a ContinuousSpace. In Proceedings of The 20th SIGNLL Con-ference on Computational Natural Language Learn-ing, pages 10–21, Berlin, Germany. Association forComputational Linguistics.

Sasha Calhoun, Jean Carletta, Jason M. Brenier, NeilMayo, Dan Jurafsky, Mark Steedman, and DavidBeaver. 2010. The NXT-format Switchboard Cor-pus: A rich resource for investigating the syntax, se-mantics, pragmatics and prosody of dialogue. Lan-guage Resources and Evaluation, 44(4):387–419.

Nina Dethlefs, Helen Hastie, Verena Rieser, and OliverLemon. 2012. Optimising incremental dialogue de-cisions using information density for interactive sys-tems. In Proceedings of the 2012 Joint Conferenceon Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning,pages 82–93. Association for Computational Lin-guistics.

David DeVault, Kenji Sagae, and David Traum. 2011.Incremental interpretation and prediction of utter-ance meaning for interactive dialogue. Dialogue &Discourse, 2(1):143–170.

Florian Eyben, Klaus R. Scherer, Bjorn W. Schuller,Johan Sundberg, Elisabeth Andre, Carlos Busso,Laurence Y. Devillers, Julien Epps, Petri Laukka,Shrikanth S. Narayanan, and Khiet P. Truong. 2016.The Geneva Minimalistic Acoustic Parameter Set(GeMAPS) for Voice Research and Affective Com-puting. IEEE Transactions on Affective Computing,7(2):190–202.

John J Godfrey and Edward Holliman. 1997.Switchboard-1 release 2. Linguistic Data Con-sortium, Philadelphia, 926:927.

David Ha and Douglas Eck. 2018. A neural represen-tation of sketch drawings. In International Confer-ence on Learning Representations.

Peter A. Heeman and Rebecca Lunsford. 2017. Turn-taking offsets and dialogue context. In Proc. Inter-speech 2017, pages 1671–1675.

Kobin H. Kendrick and Francisco Torreira. 2015. Thetiming and construction of preference: A quantita-tive study. Discourse Processes, 52(4):255–289.

Diederik P. Kingma and Max Welling. 2014. Auto-encoding variational bayes. In 2nd InternationalConference on Learning Representations.

Divesh Lala, Pierrick Milhorat, Koji Inoue, MasanariIshida, Katsuya Takanashi, and Tatsuya Kawahara.2017. Attentive listening system with backchan-neling, response generation and flexible turn-taking.In Proceedings of the 18th Annual SIGdial Meet-ing on Discourse and Dialogue, pages 127–136,Saarbrucken, Germany. Association for Computa-tional Linguistics.

Divesh Lala, Shizuka Nakamura, and Tatsuya Kawa-hara. 2019. Analysis of Effect and Timing of Fillersin Natural Turn-Taking. In Interspeech 2019, pages4175–4179. ISCA.

Stephen C. Levinson and Francisco Torreira. 2015.Timing in turn-taking and its implications for pro-cessing models of language. Frontiers in Psychol-ogy, 6.

Jiwei Li, Michel Galley, Chris Brockett, Georgios Sp-ithourakis, Jianfeng Gao, and Bill Dolan. 2016. APersona-Based Neural Conversation Model. In Pro-ceedings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics (Volume 1: LongPapers), pages 994–1003, Berlin, Germany. Associ-ation for Computational Linguistics.

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-SNE. Journal of machinelearning research, 9(Nov):2579–2605.

Raveesh Meena, Gabriel Skantze, and JoakimGustafson. 2014. Data-driven models for timingfeedback responses in a Map Task dialogue system.Computer Speech & Language, 28(4):903–922.

Louis-Philippe Morency, Iwan de Kok, and JonathanGratch. 2010. A probabilistic multimodal approachfor predicting listener backchannels. AutonomousAgents and Multi-Agent Systems, 20(1):70–84.

Ryosuke Nakanishi, Koji Inoue, Shizuka Nakamura,Katsuya Takanashi, and Tatsuya Kawahara. 2018.Generating Fillers based on Dialog Act Pairs forSmooth Turn-Taking by Humanoid Robot. IWSDS,page 11.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global Vectors for WordRepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 1532–1543, Doha,Qatar. Association for Computational Linguistics.

Antoine Raux and Maxine Eskenazi. 2009. A finite-state turn-taking model for spoken dialog systems.In HLT-NAACL, pages 629–637. ACL.

Adam Roberts, Jesse Engel, Colin Raffel, CurtisHawthorne, and Douglas Eck. 2018. A hierarchicallatent vector model for learning long-term structurein music. In ICML, pages 4361–4370.

2452

Matthew Roddy, Gabriel Skantze, and Naomi Harte.2018. Multimodal Continuous Turn-Taking Predic-tion Using Multiscale RNNs. In Proceedings of the2018 on International Conference on Multimodal In-teraction - ICMI ’18, pages 186–190, Boulder, CO,USA. ACM Press.

Harvey Sacks, Emanuel A. Schegloff, and Gail Jeffer-son. 1974. A Simplest Systematics for the Organi-zation of Turn-Taking for Conversation. Language,50(4):696.

David Schlangen and Gabriel Skantze. 2011. A gen-eral, abstract model of incremental dialogue process-ing. In Proceedings of the 12th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics, pages 710–718.

Gabriel Skantze. 2017. Towards a General, Continu-ous Model of Turn-taking in Spoken Dialogue usingLSTM Recurrent Neural Networks. In Proceedingsof SigDial, Saarbrucken, Germany.

RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, YuxuanWang, Daisy Stanton, Joel Shor, Ron J Weiss, RobClark, and Rif A Saurous. 2018. Towards End-to-End Prosody Transfer for Expressive Speech Synthe-sis with Tacotron. Proceedings of the 35 th Interna-tional Conference on Machine Learning, page 10.

Andreas Stolcke, Klaus Ries, Noah Coccaro, Eliza-beth Shriberg, Rebecca Bates, Daniel Jurafsky, PaulTaylor, Rachel Martin, Carol Van Ess-Dykema, andMarie Meteer. 2000. Dialogue act modeling for au-tomatic tagging and recognition of conversationalspeech. Computational linguistics, 26(3):339–373.

Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi.2017. Learning Discourse-level Diversity for Neu-ral Dialog Models using Conditional Variational Au-toencoders. In Proceedings of the 55th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers), pages 654–664, Vancou-ver, Canada. Association for Computational Linguis-tics.

Li Zhou, Jianfeng Gao, Di Li, and Heung-YeungShum. 2018. The Design and Implementation of Xi-aoIce, an Empathetic Social Chatbot. arXiv preprintarXiv:1812.08989.

neural generation of dialogue response timingsneed to design response timing models that take into...

Documents