hierarchical recurrent attention network for …hierarchical recurrent attention network for...

10
Hierarchical Recurrent Attention Network for Response Generation Chen Xing 12* , Wei Wu 3 , Yu Wu 4 , Ming Zhou 3 , Yalou Huang 12 , Wei-Ying Ma 3 1 College of Computer and Control Engineering, Nankai University, Tianjin, China 2 College of Software, Nankai University, Tianjin, China 3 Microsoft Research, Beijing, China 4 State Key Lab of Software Development Environment, Beihang University, Beijing, China {v-chxing,wuwei,v-wuyu,mingzhou,wyma}@microsoft.com [email protected] Abstract We study multi-turn response generation in chatbots where a response is generated according to a conversation context. Ex- isting work has modeled the hierarchy of the context, but does not pay enough at- tention to the fact that words and utter- ances in the context are differentially im- portant. As a result, they may lose im- portant information in context and gen- erate irrelevant responses. We propose a hierarchical recurrent attention network (HRAN) to model both aspects in a uni- fied framework. In HRAN, a hierarchical attention mechanism attends to important parts within and among utterances with word level attention and utterance level at- tention respectively. With the word level attention, hidden vectors of a word level encoder are synthesized as utterance vec- tors and fed to an utterance level encoder to construct hidden representations of the context. The hidden vectors of the context are then processed by the utterance level attention and formed as context vectors for decoding the response. Empirical studies on both automatic evaluation and human judgment show that HRAN can signifi- cantly outperform state-of-the-art models for multi-turn response generation. 1 Introduction Conversational agents include task-oriented dia- log systems which are built in vertical domains for specific tasks (Young et al., 2013; Boden, 2006; Wallace, 2009; Young et al., 2010), and non-task-oriented chatbots which aim to realize natural and human-like conversations with people * The work was done when the first author was an intern in Microsoft Research Asia. Figure 1: An example of multi-turn conversation regarding to a wide range of issues in open do- mains (Jafarpour et al., 2010). A common practice to build a chatbot is to learn a response genera- tion model within an encoder-decoder framework from large scale message-response pairs (Shang et al., 2015; Vinyals and Le, 2015). Such mod- els ignore conversation history when responding, which is contradictory to the nature of real con- versation between humans. To resolve the prob- lem, researchers have taken conversation history into consideration and proposed response gener- ation for multi-turn conversation (Sordoni et al., 2015; Serban et al., 2015; Serban et al., 2016b; Serban et al., 2016c). In this work, we study multi-turn response gen- eration for open domain conversation in chatbots in which we try to learn a response generation model from responses and their contexts. A con- text refers to a message and several utterances in its previous turns. In practice, when a message comes, the model takes the context as input and generate a response as the next turn. Multi-turn conversation requires a model to generate a re- sponse relevant to the whole context. The com- plexity of the task lies in two aspects: 1) a conver- sation context is in a hierarchical structure (words form an utterance, and utterances form the con- text) and has two levels of sequential relationships among both words and utterances within the struc- arXiv:1701.07149v1 [cs.CL] 25 Jan 2017

Upload: others

Post on 14-Aug-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

Hierarchical Recurrent Attention Network for Response Generation

Chen Xing12∗ , Wei Wu3 , Yu Wu4 , Ming Zhou3 , Yalou Huang12 , Wei-Ying Ma3

1College of Computer and Control Engineering, Nankai University, Tianjin, China2College of Software, Nankai University, Tianjin, China

3 Microsoft Research, Beijing, China4State Key Lab of Software Development Environment, Beihang University, Beijing, China{v-chxing,wuwei,v-wuyu,mingzhou,wyma}@microsoft.com [email protected]

Abstract

We study multi-turn response generationin chatbots where a response is generatedaccording to a conversation context. Ex-isting work has modeled the hierarchy ofthe context, but does not pay enough at-tention to the fact that words and utter-ances in the context are differentially im-portant. As a result, they may lose im-portant information in context and gen-erate irrelevant responses. We proposea hierarchical recurrent attention network(HRAN) to model both aspects in a uni-fied framework. In HRAN, a hierarchicalattention mechanism attends to importantparts within and among utterances withword level attention and utterance level at-tention respectively. With the word levelattention, hidden vectors of a word levelencoder are synthesized as utterance vec-tors and fed to an utterance level encoderto construct hidden representations of thecontext. The hidden vectors of the contextare then processed by the utterance levelattention and formed as context vectors fordecoding the response. Empirical studieson both automatic evaluation and humanjudgment show that HRAN can signifi-cantly outperform state-of-the-art modelsfor multi-turn response generation.

1 Introduction

Conversational agents include task-oriented dia-log systems which are built in vertical domainsfor specific tasks (Young et al., 2013; Boden,2006; Wallace, 2009; Young et al., 2010), andnon-task-oriented chatbots which aim to realizenatural and human-like conversations with people

∗The work was done when the first author was an internin Microsoft Research Asia.

Figure 1: An example of multi-turn conversation

regarding to a wide range of issues in open do-mains (Jafarpour et al., 2010). A common practiceto build a chatbot is to learn a response genera-tion model within an encoder-decoder frameworkfrom large scale message-response pairs (Shanget al., 2015; Vinyals and Le, 2015). Such mod-els ignore conversation history when responding,which is contradictory to the nature of real con-versation between humans. To resolve the prob-lem, researchers have taken conversation historyinto consideration and proposed response gener-ation for multi-turn conversation (Sordoni et al.,2015; Serban et al., 2015; Serban et al., 2016b;Serban et al., 2016c).

In this work, we study multi-turn response gen-eration for open domain conversation in chatbotsin which we try to learn a response generationmodel from responses and their contexts. A con-text refers to a message and several utterances inits previous turns. In practice, when a messagecomes, the model takes the context as input andgenerate a response as the next turn. Multi-turnconversation requires a model to generate a re-sponse relevant to the whole context. The com-plexity of the task lies in two aspects: 1) a conver-sation context is in a hierarchical structure (wordsform an utterance, and utterances form the con-text) and has two levels of sequential relationshipsamong both words and utterances within the struc-

arX

iv:1

701.

0714

9v1

[cs

.CL

] 2

5 Ja

n 20

17

Page 2: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

ture; 2) not all parts of the context are equallyimportant to response generation. Words are dif-ferentially informative and important, and so arethe utterances. State-of-the-art methods such asHRED (Serban et al., 2016a) and VHRED (Ser-ban et al., 2016c) focus on modeling the hierar-chy of the context, whereas there is little explo-ration on how to select important parts from thecontext, although it is often a crucial step for gen-erating a proper response. Without this step, ex-isting models may lose important information incontext and generate irrelevant responses1. Fig-ure 1 gives an example from our data to illustratethe problem. The context is a conversation be-tween two speakers about height and boyfriend,therefore, to respond to the context, words like“girl”, “boyfriend” and numbers indicating heightsuch as “160” and “175” are more important than“not good-looking”. Moreover, u1 and u4 conveymain semantics of the context, and therefore aremore important than the others for generating aproper response. Without modeling the word andutterance importance, the state-of-the-art modelVHRED (Serban et al., 2016c) misses importantpoints and gives a response “are you a man ora woman” which is OK if there were only u3left, but nonsense given the whole context. Af-ter paying attention to the important words and ut-terances, we can have a reasonable response like“No, I don’t care much about height” (the responseis generated by our model, as will be seen in ex-periments).

We aim to model the hierarchy and the impor-tant parts of contexts in a unified framework. In-spired by the success of the attention mechanismin single-turn response generation (Shang et al.,2015), we propose a hierarchical recurrent atten-tion network (HRAN) for multi-turn response gen-eration in which we introduce a hierarchical atten-tion mechanism to dynamically highlight impor-tant parts of word sequences and the utterance se-quence when generating a response. Specifically,HRAN is built in a hierarchical structure. At thebottom of HRAN, a word level recurrent neuralnetwork (RNN) encodes each utterance into a se-quence of hidden vectors. In generation of eachword in the response, a word level attention mech-

1Note that one can simply concatenate all utterances andemploys the classic sequence-to-sequence with attention tomodel word importance in generation. This method, how-ever, loses utterance relationships and results in bad genera-tion quality, as will be seen in expeirments.

anism assigns a weight to each vector in the hid-den sequence of an utterance and forms an utter-ance vector by a linear combination of the vectors.Important hidden vectors correspond to importantparts in the utterance regarding to the generationof the word, and contribute more to the forma-tion of the utterance vector. The utterance vectorsare then fed to an utterance level RNN which con-structs hidden representations of the context. Dif-ferent from classic attention mechanism, the wordlevel attention mechanism in HRAN is dependenton both the decoder and the utterance level RNN.Thus, both the current generated part of the re-sponse and the content of context can help selectimportant parts in utterances. At the third layer, anutterance attention mechanism attends to impor-tant utterances in the utterance sequence and sum-marizes the sequence as a context vector. Finally,at the top of HRAN, a decoder takes the contextvector as input and generates the word in the re-sponse. HRAN mirrors the data structure in multi-turn response generation by growing from wordsto utterances and then from utterances to the out-put. It extends the architecture of current hierar-chical response generation models by a hierarchi-cal attention mechanism which not only results inbetter generation quality, but also provides insightinto which parts in an utterance and which utter-ances in context contribute to response generation.

We conduct an empirical study on large scaleopen domain conversation data and compare ourmodel with state-of-the-art models using both au-tomatic evaluation and side-by-side human com-parison. The results show that on both met-rics our model can significantly outperform ex-isting models for multi-turn response genera-tion. We release our source code and data athttps://github.com/LynetteXing1991/HRAN.

The contributions of the paper include (1) pro-posal of attending to important parts in contextsin multi-turn response generation; (2) proposal ofa hierarchical recurrent attention network whichmodels hierarchy of contexts, word importance,and utterance importance in a unified framework;(3) empirical verification of the effectiveness ofthe model by both automatic evaluation and hu-man judgment.

2 Related Work

Most existing effort on response generation is paidto single-turn conversation. Starting from the ba-

Page 3: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

sic sequence to sequence model (Sutskever et al.,2014), various models (Shang et al., 2015; Vinyalsand Le, 2015; Li et al., 2015; Xing et al., 2016;Li et al., 2016; ?) have been proposed underan encoder-decoder framework to improve genera-tion quality from different perspectives such as rel-evance, diversity, and personality. Recently, multi-turn response generation has drawn attention fromacademia. For example, Sordoni et al. (2015)proposed DCGM where context information is en-coded with a multi-layer perceptron (MLP). Ser-ban et al. (2016a) proposed HRED which modelscontexts in a hierarchical encoder-decoder frame-work. Under the architecture of HRED, more vari-ants including VHRED (Serban et al., 2016c) andMrRNN (Serban et al., 2016b) are proposed in or-der to introduce latent and explicit variables intothe generation process. In this work, we also studymulti-turn response generation. Different from theexisting models which do not model word and ut-terance importance in generation, our hierarchicalrecurrent attention network simultaneously mod-els the hierarchy of contexts and the importanceof words and utterances in a unified framework.

Attention mechanism is first proposed for ma-chine translation (Bahdanau et al., 2014; Cho etal., 2015), and is quickly applied to single-turn re-sponse generation afterwards (Shang et al., 2015;Vinyals and Le, 2015). Recently, Yang et al.(2016) proposed a hierarchical attention networkfor document classification in which two levels ofattention mechanisms are used to model the con-tributions of words and sentences in classificationdecision. Seo et al. (2016) proposed a hierarchicalattention network to precisely attending objects ofdifferent scales and shapes in images. Inspired bythese work, we extend the attention mechanism forsingle-turn response generation to a hierarchicalattention mechanism for multi-turn response gen-eration. To the best of our knowledge, we are thefirst who apply the hierarchical attention techniqueto response generation in chatbots.

3 Problem Formalization

Suppose that we have a data set D ={(Ui,Yi)}Ni=1. ∀i, (Ui,Yi) consists of a re-sponse Yi = (yi,1, . . . , yi,Ti) and its contextUi = (ui,1, . . . , ui,mi) with yi,j the j-th word,ui,mi the message, and (ui,1, . . . , ui,mi−1) the ut-terances in previous turns. In this work, we re-quire mi > 2 and thus each context has at

least one utterance as conversation history. ∀j,ui,j =

(wi,j,1, . . . , wi,j,Ti,j

)where wi,j,k is the k-

th word. We aim to estimate a generation prob-ability p(y1, . . . , yT |U) from D , and thus givena new conversation context U, we can gener-ate a response Y = (y1, . . . , yT ) according top(y1, . . . , yT |U).

In the following, we will elaborate how to con-struct p(y1, . . . , yT |U) and how to learn it.

4 Hierarchical Recurrent AttentionNetwork

We propose a hierarchical recurrent attention net-work (HRAN) to model the generation probabilityp(y1, . . . , yT |U). Figure 2 gives the architectureof HRAN. Roughly speaking, before generation,HRAN employs a word level encoder to encodeinformation of every utterance in context as hiddenvectors. Then, when generating every word, a hi-erarchical attention mechanism attends to impor-tant parts within and among utterances with wordlevel attention and utterance level attention respec-tively. With the two levels of attention, HRANworks in a bottom-up way: hidden vectors of ut-terances are processed by the word level attentionand uploaded to an utterance level encoder to formhidden vectors of the context. Hidden vectors ofthe context are further processed by the utterancelevel attention as a context vector and uploaded tothe decoder to generate the word.

In the following, we will describe details andthe learning objective of HRAN.

4.1 Word Level Encoder

Given U = (u1, . . . , um), we employ a bidirec-tional recurrent neural network with gated recur-rent units (BiGRU) (Bahdanau et al., 2014) to en-code each ui, i ∈ {1, . . . ,m} as hidden vectors(hi,1, . . . ,hi,Ti). Formally, suppose that ui =(wi,1, . . . , wi,Ti), then ∀k ∈ {1, . . . , Ti}, hi,k isgiven by

hi,k = concat(−→h i,k,

←−h i,k), (1)

where concat(·, ·) is an operation defined as con-catenating the two arguments together,

−→h i,k is the

k-th hidden state of a forward GRU (Cho et al.,2014), and

←−h i,k is the k-th hidden state of a back-

ward GRU. The forward GRU reads ui in its order

Page 4: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

um

Word Level Attention

u2

Word Level Attention

u1

Word Level Attention

𝑙𝑚,𝑡𝑙2,𝑡𝑙1,𝑡……

Utterance Level Attention

……

……

𝑐𝑡

𝑠𝑡𝑠𝑡−1 …… 𝑠T…

Word Level Attention:𝑙1,𝑡

MLP

𝑠𝑡−1

α1,𝑡,1 α1,𝑡,2

𝑙2,𝑡

Decoder:

𝑟2,𝑡𝑟1,𝑡𝑟𝑚,𝑡

𝑟1,𝑡

h1,1 h1,2 h1,3 h1,4 h1,5 h2,1 h2,2 h2,3 h2,4 h2,5 h𝑚,1 h𝑚,2 h𝑚,3 h𝑚,4 h𝑚,5

h1,1 h1,2

Weights:

Utterance Level Encoder

Word Level Encoder

Context vector

Figure 2: Hierarchical Recurrent Attention Network

(i.e., from wi,1 to wi,Ti), and calculates−→h i,k as

zk = σ(Wzei,k + Vz−→h i,k−1)

rk = σ(Wrei,k + Vr−→h i,k−1)

sk = tanh(Wsei,k + Vs(−→h i,k−1 ◦ rk))

−→h i,k = (1− zk) ◦ sk + zk ◦

−→h i,k−1,

(2)

where−→h i,0 is initialized with a isotropic Gaus-

sian distribution, ei,k is the embedding of wi,k,zk and rk are an update gate and a reset gaterespectively, σ(·) is a sigmoid function, andWz,Wr,Ws,Vz,Vr,Vs are parameters. Thebackward GRU reads ui in its reverse order (i.e.,from wi,Ti to wi,1) and generates {

←−h i,k}Ti

k=1 witha parameterization similar to the forward GRU.

4.2 Hierarchical Attention and UtteranceEncoder

Suppose that the decoder has generated t − 1words, at step t, word level attention calculates aweight vector (αi,t,1, . . . , αi,t,Ti) (details are de-scribed later) for {hi,j}Ti

j=1 and represents utter-ance ui as a vector ri,t. ∀i ∈ {1, . . . ,m}, ri,t isdefined by

ri,t =

Ti∑j=1

αi,t,jhi,j . (3)

{ri,t}mi=1 are then utilized as input of an utterancelevel encoder and transformed to (l1,t, . . . , lm,t) ashidden vectors of the context. After that, utterancelevel attention assigns a weight βi,t to li,t (detailsare described later) and forms a context vector ctas

ct =

m∑i=1

βi,tli,t. (4)

In both Equation (3) and Equation (4), the moreimportant a hidden vector is, the larger weight it

will have, and the more contributions it will maketo the high level vector (i.e., the utterance vectorand the context vector). This is how the two levelsof attention attends to the important parts of utter-ances and the important utterances in generation.

More specifically, the utterance level encoder isa backward GRU which processes {ri,t}mi=1 fromthe message rm,t to the earliest history r1,t. Simi-lar to Equation (2), ∀i ∈ {m, . . . , 1}, li,t is calcu-lated as

z′i = σ(Wzlri,t + Vzlli+1,t)

r′i = σ(Wrlri,t + Vrlli+1,t)

s′i = tanh(Wslri,t + Vsl(li+1,t ◦ r′i))li,t = (1− z′i) ◦ s′i + z′i ◦ li+1,t,

(5)

where lm+1,t is initialized with a isotropic Gaus-sian distribution, z′i and r′i are the update gate andthe reset gate of the utterance level GRU respec-tively, and Wzl,Vzl,Wrl,Vrl,Wsl,Vsl are pa-rameters.

Different from the classic attention mechanism,word level attention in HRAN depends on boththe hidden states of the decoder and the hiddenstates of the utterance level encoder. It works ina reverse order by first weighting {hm,j}Tm

j=1 andthen moving towards {h1,j}T1

j=1 along the utter-ance sequence. ∀i ∈ {m, . . . , 1}, j ∈ {1, . . . , Ti},weight αi,t,j is calculated as

ei,t,j = η(st−1, li+1,t,hi,j);

αi,t,j =exp(ei,t,j)∑Ti

k=1 exp(ei,t,k),

(6)

where lm+1,t is initialized with a isotropic Gaus-sian distribution, st−1 is the (t−1)-th hidden stateof the decoder, and η(·) is a multi-layer perceptron(MLP) with tanh as an activation function.

Page 5: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

Note that the word level attention and the utter-ance level encoding are dependent with each otherand alternatively conducted (first attention thenencoding). The motivation we establish the depen-dency between αi,t,j and li+1,t is that content fromthe context (i.e., li+1,t) could help identify im-portant information in utterances, especially whenst−1 is not informative enough (e.g., the generatedpart of the response are almost function words).We require the utterance encoder and the wordlevel attention to work reversely, because we thinkthat compared to history, conversation that hap-pened after an utterance in the context is morelikely to be capable of identifying important in-formation in the utterance for generating a properresponse to the context.

With {li,t}mi=1, the utterance level attention cal-culates a weight βi,t for li,t as

e′i,t = η(st−1, li,t);

βi,t =exp(e′i,t)∑mi=1 exp(e

′i,t)

.(7)

4.3 Decoding the Response

The decoder of HRAN is a RNN language model(Mikolov et al., 2010) conditioned on the contextvectors {ct}Tt=1 given by Equation (4). Formally,the probability distribution p(y1, . . . , yT |U) is de-fined as

p(y1, ..., yT |U) = p(y1|c1)

T∏t=2

p(yt|ct, y1, ..., yt−1). (8)

where p(yt|ct, y1, ..., yt−1) is given by

st = f(eyt−1 , st−1, ct)

p(yt|ct, y1, ..., yt−1) = Iyt · softmax(st, eyt−1),(9)

where st is the hidden state of the decoder at step t,eyt−1 is the embedding of yt−1, f is a GRU, Iyt isthe one-hot vector for yt, and softmax(st, eyt−1)is a V -dimensional vector with V the responsevocabulary size and each element the generationprobability of a word. In practice, we employ thebeam search (Tillmann and Ney, 2003) techniqueto generate the n-best responses.

Let us denote Θ as the parameter set of HRAN,then we estimate Θ from D = {(Ui,Yi)}Ni=1 byminimizing the following objective function:

Θ = arg minΘ

−N∑i=1

log (p(yi,1, ..., yi,Ti |Ui)) (10)

5 Experiments

We compared HRAN with state-of-the-art meth-ods by both automatic evaluation and side-by-sidehuman judgment.

5.1 Data Set

We built a data set from Douban Group2 which is apopular Chinese social networking service (SNS)allowing users to discuss a wide range of topicsin groups through posting and commenting. InDouban Group, regarding to a post under a specifictopic, two persons can converse with each otherby one posting a comment and the other quotingit and posting another comment. We crawled 20million conversations between two persons withthe average number of turns as 6.32. The datacovers many different topics and can be viewedas a simulation of open domain conversations in achatbot. In each conversation, we treated the lastturn as response, and the remaining turns as con-text. As preprocessing, we first employed Stan-ford Chinese word segmenter3 to tokenize eachutterance in the data. Then we removed the con-versations whose response appearing more than50 times in the whole data to prevent them fromdominating learning. We also removed the con-versations shorter than 3 turns and the conversa-tions with an utterance longer than 50 words. Af-ter the preprocessing, there are 1, 656, 652 conver-sations left. From them, we randomly sampled1 million conversations as training data, 10, 000conversations as validation data, and 1, 000 con-versations as test data, and made sure that there isno overlap among them. In the test data, the con-texts were used to generate responses and their re-sponses were used as ground truth to calculate per-plexity of generation models. We kept the 40, 000most frequent words in the contexts of the trainingdata to construct a context vocabulary. The vo-cabulary covers 98.8% of words appearing in thecontexts of the training data. Similarly, we con-structed a response vocabulary that contains the40, 000 most frequent words in the responses ofthe training data which covers 99.0% words ap-pearing in the responses. Words outside the twovocabularies were treated as “UNK”. The data willbe publicly available.

2https://www.douban.com/group/explore3http://nlp.stanford.edu/software/segmenter.shtml

Page 6: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

Model Validation Perplexity Test PerplexityS2SA 43.679 44.508HRED 46.279 47.467

VHRED 44.548 45.484HRAN 40.257 41.138

Table 1: Perplexity results

5.2 Baselines

We compared HRAN with the following models:S2SA: we concatenated all utterances in a con-

text as a long sequence and treated the sequenceand the response as a message-response pair. Bythis means, we transformed the problem of multi-turn response generation to a problem of single-turn response generation and employed the stan-dard sequence to sequence with attention (Shanget al., 2015) as a baseline.

HRED: the hierarchical encoder-decodermodel proposed by (Serban et al., 2016a).

VHRED: a modification of HRED (Serban etal., 2016c) where latent variables are introducedin to generation. In all models, we set the dimen-sionality of hidden states of encoders and decodersas 1000, and the dimensionality of word embed-ding as 620. All models were initialized withisotropic Gaussian distributions X ∼ N (0, 0.01)and trained with an AdaDelta algorithm (Zeiler,2012) on a NVIDIA Tesla K40 GPU. The batchsize is 128. We set the initial learning rate as 1.0and reduced it by half if the perplexity on val-idation began to increase. We implemented themodels with an open source deep learning toolBlocks4.

5.3 Evaluation Metrics

How to evaluate a response generation model isstill an open problem but not the focus of the pa-per. We followed the existing work and employedthe following metrics:

Perplexity: following (Vinyals and Le, 2015),we employed perplexity as an evaluation metric.Perplexity is defined by Equation (11). It mea-sures how well a model predicts human responses.Lower perplexity generally indicates better gener-ation performance. In our experiments, perplex-ity on validation was used to determine when tostop training. If the perplexity stops decreasingand the difference is smaller than 2.0 five times invalidation, we think that the algorithm has reachedconvergence and terminate training. We tested thegeneration ability of different models by perplex-

4https://github.com/mila-udem/blocks

Models Win Loss Tie KappaHRAN v.s. S2SA 27.3 20.6 52.1 0.37HRAN v.s. HRED 27.2 21.2 51.6 0.35

HRAN v.s. VHRED 25.2 20.4 54.4 0.34

Table 2: Human annotation results (in %)

ity on the test data.

PPL = exp

{− 1

NΣN

i=1 log(p(Yi|Ui))

}. (11)

Side-by-side human annotation: we also com-pared HRAN with every baseline model by side-by-side human comparison. Specifically, we re-cruited three native speakers with rich DoubanGroup experience as human annotators. To eachannotator, we showed a context of a test examplewith two generated responses, one from HRANand the other one from a baseline model. Bothresponses are the top one results in beam search.The two responses were presented in random or-der. We then asked the annotator to judge whichone is better. The criteria is, response A is betterthan response B if (1) A is relevant, logically con-sistent to the context, and fluent, while B is eitherirrelevant or logically contradictory to the context,or it is disfluent (e.g., with grammatical errors orUNKs); or (2) both A and B are relevant, consis-tent, and fluent, but A is more informative and in-teresting than B (e.g., B is a universal reply like“I see”). If the annotator cannot tell which one isbetter, he/she was asked to label a “tie”. Each an-notator individually judged 1000 test examples foreach HRAN-baseline pair, and in total, each onejudged 3000 examples (for three pairs). Agree-ments among the annotators were calculated usingFleiss’ kappa (Fleiss and Cohen, 1973).

Note that we do not choose BLEU (Papineni etal., 2002) as an evaluation metric, because (1) Liuet al. (Liu et al., 2016) have proven that BLEUis not a proper metric for evaluating conversa-tion models as there is weak correlation betweenBLEU and human judgment; (2) different from thesingle-turn case, in multi-turn conversation, onecontext usually has one copy in the whole data.Thus, without any human effort like what Sordoniet al. (Sordoni et al., 2015) did in their work, eachcontext only has a single reference in test. Thismakes BLEU even unreliable as a measurementof generation quality in open domain conversationdue to the diversity of responses.

5.4 Evaluation ResultsTable 1 gives the results on perplexity. HRANachieves the lowest perplexity on both validation

Page 7: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

Figure 3: Case study (utterances between two persons in contexts are split by “⇒”)

and test. We conducted t-test on test perplexity andthe result shows that the improvement of HRANover all baseline models is statistically significant(p-value < 0.01).

Table 2 shows the human annotation results.The ratios were calculated by combining the anno-tations from the three judges together. We can seethat HRAN outperforms all baseline models andall comparisons have relatively high kappa scoreswhich indicates that the annotators reached rela-tively high agreements in judgment. Comparedwith S2SA, HRED, and VHRED, HRAN achievespreference gains (win-loss) 6.7%, 6%, 4.8% re-spectively. Sign test results show that the improve-ment is statistically significant (p-value< 0.01 forHRAN v.s. S2SA and HRAN v.s. HRED, and p-value < 0.05 for HRAN v.s. VHRED). Amongthe three baseline models, S2SA is the worst one,because it loses relationships among utterances inthe context. VHRED is the best baseline model,which is consistent with the existing literatures(Serban et al., 2016c). We checked the cases onwhich VHRED loses to HRAN and found thaton 56% cases, VHRED generated irrelevant re-sponses while responses given by HRAN are rele-vant, logically consistent, and fluent.

5.5 Discussions

Case study: Figure 3 lists some cases from thetest set to compare HRAN with the best baselineVHRED. We can see that HRAN not only cananswer the last turn in the context (i.e., the mes-

Model Win Loss Tie PPLNo UD Att 22.3% 24.8% 52.9% 41.54

No Word Att 20.4% 25.0% 50.6% 43.24No Utterance Att 21.1% 23.7% 55.2% 47.35

Table 3: Model ablation results

sage) properly by understanding the context (e.g.,case 2), but also be capable of starting a new topicaccording to the conversation history to keep theconversation going (e.g., case 1). In case 2, HRANunderstands that the message is actually asking“why can’t you come to have dinner with me?”and generates a proper response that gives a plau-sible reason. In case 1, HRAN properly brings upa new topic by asking the “brand” of the user’s “lo-tion” when the current topic “how to exfoliate myskin” has come to an end. The new topic is basedon the content of the context and thus can naturallyextends the conversation in the case.

Visualization of attention: to further illustratewhy HRAN can generate high quality responses,we visualized the hierarchical attention for thecases in Figure 3 in Figure 4. In every sub-figure,each line is an utterance with blue color indicat-ing word importance. The leftmost column ofeach sub-figure uses red color to indicate utteranceimportance. Darker color means more importantwords or utterances. The importance of a word oran utterance was calculated by the average weightof the word or the utterance assigned by attentionin generating the response given at the bottom ofeach sub-figure. It reflects an overall contribu-tion of the word or the utterance to generate theresponse. Above each line, we gave the transla-

Page 8: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

(a) Visualization of case 1 (b) Visualization of case 2

(c) Visualization of case 3 (d) Visualization of case 4

Figure 4: Attention visualization (the importance of a word or an utterance is calculated as their averageweights when generating the whole response)

tion of the utterance, and below it, we translatedimportant words. Note that word-to-word trans-lation may cause confusion sometimes, therefore,we left some words (most of them are functionwords) untranslated. We can see that the hierar-chical attention mechanism in HRAN can attendto both important words and important utterancesin contexts. For example, in Figure 4(c), wordsincluding “girl” and “boyfriend” and numbers in-cluding “160” and “175” are highlighted, and u1and u4 are more important than others. The resultmatches our intuition in introduction. In Figure4(b), HRAN assigned larger weights to u1, u4 andwords like “dinner” and “why”. This explains whythe model can understand that the message is ac-tually asking “why can’t you come to have dinnerwith me?”. The figures provide us insights on howHRAN understands contexts in generation.

Model ablation: we then examine the effectof different components of HRAN by removingthem one by one. We first removed li+1 fromη(st−1, li+1,t,hi,j) in Equation (6) (i.e., removingutterance dependency from word level attention)and denoted the model as “No UD Att”, then weremoved word level attention and utterance levelattention separately, and denoted the models as“No Word Att” and “No Utterance Att” respec-tively. We conducted side-by-side human compar-ison on these models with the full HRAN on thetest data and also calculated their test perplexity(PPL). Table 3 gives the results. We can see that

all the components are useful because removingany of them will cause performance drop. Amongthem, word level attention is the most importantone as HRAN achieved the most preference gain(4.6%) to No Word Att on human comparison.

Error analysis: we finally investigate how toimprove HRAN in the future by analyzing thecases on which HRAN loses to VHRED. The er-rors can be summarized as: 51.81% logic con-tradiction, 26.95% universal reply, 7.77% irrel-evant response, and 13.47% others. Most badcases come from universal replies and responsesthat are logically contradictory to contexts. Thisis easy to understand as HRAN does not explicitlymodel the two issues. The result also indicates that(1) although contexts provide more informationthan single messages, multi-turn response gener-ation still has the “safe response” problem as thesingle-turn case; (2) although attending to impor-tant words and utterances in generation can leadto informative and logically consistent responsesfor many cases like those in Figure 3, it is still notenough for fully understanding contexts due to thecomplex nature of conversations. The irrelevantresponses might be caused by wrong attention ingeneration. Although the analysis might not coverall bad cases (e.g., HRAN and VHRED may bothgive bad responses), it sheds light on our future di-rections: (1) improving response diversity, e.g., byintroducing extra content into generation like Xinget al. (Xing et al., 2016) and Mou et al. (Mou et al.,

Page 9: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

2016) did for single-turn conversation; (2) model-ing logics in contexts; (3) improving attention.

6 Conclusion

We propose a hierarchical recurrent attention net-work (HRAN) for multi-turn response generationin chatbots. Empirical studies on large scale con-versation data show that HRAN can significantlyoutperform state-of-the-art models.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-

gio. 2014. Neural machine translation by jointlylearning to align and translate. arXiv preprintarXiv:1409.0473.

Margaret Ann Boden. 2006. Mind as machine: A his-tory of cognitive science. Clarendon Press.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah-danau, and Yoshua Bengio. 2014. On the propertiesof neural machine translation: Encoder–decoder ap-proaches. Syntax, Semantics and Structure in Statis-tical Translation, page 103.

Kyunghyun Cho, Aaron Courville, and Yoshua Ben-gio. 2015. Describing multimedia content usingattention-based encoder-decoder networks. Multi-media, IEEE Transactions on, 17(11):1875–1886.

Joseph L Fleiss and Jacob Cohen. 1973. The equiv-alence of weighted kappa and the intraclass corre-lation coefficient as measures of reliability. Educa-tional and psychological measurement.

Sina Jafarpour, Christopher JC Burges, and Alan Rit-ter. 2010. Filter, rank, and transfer the knowledge:Learning to chat. Advances in Ranking, 10.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2015. A diversity-promoting objec-tive function for neural conversation models. arXivpreprint arXiv:1510.03055.

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,and Bill Dolan. 2016. A persona-based neural con-versation model. arXiv preprint arXiv:1603.06155.

Chia-Wei Liu, Ryan Lowe, Iulian V Serban, MichaelNoseworthy, Laurent Charlin, and Joelle Pineau.2016. How not to evaluate your dialogue system:An empirical study of unsupervised evaluation met-rics for dialogue response generation. arXiv preprintarXiv:1603.08023.

Tomas Mikolov, Martin Karafiat, Lukas Burget, JanCernocky, and Sanjeev Khudanpur. 2010. Recur-rent neural network based language model. In IN-TERSPEECH 2010, 11th Annual Conference of theInternational Speech Communication Association,Makuhari, Chiba, Japan, September 26-30, 2010,pages 1045–1048.

Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang,and Zhi Jin. 2016. Sequence to backward and for-ward sequences: A content-introducing approach togenerative short-text conversation. arXiv preprintarXiv:1607.00970.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automaticevaluation of machine translation. In Proceedings ofthe 40th annual meeting on association for compu-tational linguistics, pages 311–318. Association forComputational Linguistics.

Paul Hongsuck Seo, Zhe Lin, Scott Cohen, XiaohuiShen, and Bohyung Han. 2016. Hierarchical atten-tion networks. arXiv preprint arXiv:1606.02393.

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2015. Build-ing end-to-end dialogue systems using generative hi-erarchical neural network models. arXiv preprintarXiv:1507.04808.

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2016a. Build-ing end-to-end dialogue systems using generative hi-erarchical neural network models. In Proceedings ofthe 30th AAAI Conference on Artificial Intelligence(AAAI-16).

Iulian Vlad Serban, Tim Klinger, Gerald Tesauro,Kartik Talamadupula, Bowen Zhou, Yoshua Ben-gio, and Aaron Courville. 2016b. Multireso-lution recurrent neural networks: An applicationto dialogue response generation. arXiv preprintarXiv:1606.00776.

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe,Laurent Charlin, Joelle Pineau, Aaron Courville,and Yoshua Bengio. 2016c. A hierarchical latentvariable encoder-decoder model for generating dia-logues. arXiv preprint arXiv:1605.06069.

Lifeng Shang, Zhengdong Lu, and Hang Li. 2015.Neural responding machine for short-text conversa-tion. arXiv preprint arXiv:1503.02364.

Alessandro Sordoni, Michel Galley, Michael Auli,Chris Brockett, Yangfeng Ji, Margaret Mitchell,Jian-Yun Nie, Jianfeng Gao, and Bill Dolan. 2015.A neural network approach to context-sensitive gen-eration of conversational responses. arXiv preprintarXiv:1506.06714.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in neural information process-ing systems, pages 3104–3112.

Christoph Tillmann and Hermann Ney. 2003. Word re-ordering and a dynamic programming beam searchalgorithm for statistical machine translation. Com-putational linguistics, 29(1):97–133.

Oriol Vinyals and Quoc Le. 2015. A neural conversa-tional model. arXiv preprint arXiv:1506.05869.

Page 10: Hierarchical Recurrent Attention Network for …Hierarchical Recurrent Attention Network for Response Generation Chen Xing12, Wei Wu3, Yu Wu4, Ming Zhou3, Yalou Huang12, Wei-Ying Ma3

Richard S Wallace. 2009. The anatomy of ALICE.Springer.

Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang,Ming Zhou, and Wei-Ying Ma. 2016. Topicaware neural response generation. arXiv preprintarXiv:1606.08340.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies.

Steve Young, Milica Gasic, Simon Keizer, Francois

Mairesse, Jost Schatzmann, Blaise Thomson, andKai Yu. 2010. The hidden information state model:A practical framework for pomdp-based spoken dia-logue management. Computer Speech & Language,24(2):150–174.

Stephanie Young, Milica Gasic, Blaise Thomson, andJohn D Williams. 2013. Pomdp-based statisticalspoken dialog systems: A review. Proceedings ofthe IEEE, 101(5):1160–1179.

Matthew D Zeiler. 2012. Adadelta: an adaptive learn-ing rate method. arXiv preprint arXiv:1212.5701.