arxiv:1912.10852v1 [cs.lg] 20 dec 2019

7
ET-USB: Transformer-Based Sequential Behavior Modeling for Inbound Customer Service Ta-Chun Su 1 , Guan-Ying Chen 1 1 Cathay Financial Holdings Lab {bgg, amber}@cathayholdings.com.tw Abstract Deep-Learning based models with attention mech- anism has achieved exceptional results for many tasks, including language tasks and recommender system. In this study, we focus on inbound calls prediction for customer service, whereas previous works put emphasis on the phone agents alloca- tion. Furthermore, a common method to deal with user’s history behaviors is to extract all kinds of aggregated features over time. Nevertheless, the method may fail to detect users’ sequential pat- tern. Therefore, we incorporate users’ sequential features along with non-sequential features and ap- ply the encoder of Transformer, one such power- ful self-attention network model, to capture the in- formation underlying those behavior sequence. We propose our approach: ET-USB, which is helpful for different business scenarios in Cathay Financial Holdings. In practice, we test and prove that the proposed network structure for processing differ- ent dimension of behavior data could demonstrate the superiority comparing with other deep-learning based models. 1 Introduction Inbound call center customer service, the service dealt with vast customers complaints and queries, are closely related with customer satisfaction, corporate brand image and even profitability. Thus, many companies invest significant re- sources to provide high quality customer service based on telephone, internet or even chatbot nowadays. Several cus- tomer service related applications are universal, including forecasting the amount of phone calls in a period of time, telemarketing various products to the customers, predicting customers’ call questions, etc. Here we focus on the scenario of customers’ question pre- diction at Cathay United Bank, subsidiary of Taiwan’s largest financial holdings: Cathay Financial Holdings, where having almost a million of calls from customers per month. Most of the time, customers only make phone calls when they en- counter operation problems. However, due to the miscommu- nication between call center agents and customers, the aver- age call duration lasts 5 minutes long or even more, leading to high communication costs. Consequently, if we could pre- cisely predict the possible questions that customers may ask, we could help save the call duration. There would be no need to confirm customers’ questions for a long time, and we can further raise customer satisfaction. In this study, inspired by the success of Attention Mecha- nism for machine translation task in Natural Language Pro- cessing (NLP) [Bahdanau et al., 2015], we propose a novel neural network: ET-USB, which is the encoder in the Trans- former [Vaswani et al., 2017], to learn a better representation for user’s sequential behaviors. Our model utilizes the self attention mechanism to capture the most important depen- dency among behavior information in a behavior sequence; thus make better prediction on customers’ inbound calls. We found that most of the customer calls were triggered by once a customer encountering difficulties on different channels or product services. In other words, customer behaviors in each channel/product/service play an important role in analyzing inbound calls problems. By collecting users data from var- ious subsidiary database with HIPPO framework, an event- driven information integrating framework aiding workflow in our company, we can use these users data as data source in our modeling task. We have these user behavior data from more than 30 channels named as ”Customer Journey Event”, which is abbreviated as ”CJE” afterwards. Based on occur- ring time, we transform these data into users’ sequential be- haviors, which achieves a powerful analysis, and use on dif- ferent scenarios, including solving inbound calls problems. In the experiment, we also delve into the behavior granu- larity of ”CJE”. In recommender system scenario, each item in a sequence pattern is embedded without complex feature engineering for easy interpretation, e.g., click, redeem, gift- sending. However, in our scenarios, it is hard to describe a behavior in a simple approach. Let us assume that below two sentences have the same meanings, but in different de- scriptions. Take viewing event as an example. ”I watched Avengers: Endgame in the movie theater”. Another descrip- tion of this example is, ”I watched Avengers: Endgame on AMC in the evening”. As you may see, it is hard to define an universal approach regarding the best description of be- haviors. Therefore, we introduce a statistical method to help make better behavior granularity. Eventually, offline experi- ments show the superiority of our proposed approaches. We have also deployed the model online, which might accurately arXiv:1912.10852v1 [cs.LG] 20 Dec 2019

Upload: others

Post on 01-Oct-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1912.10852v1 [cs.LG] 20 Dec 2019

ET-USB: Transformer-Based Sequential Behavior Modeling forInbound Customer Service

Ta-Chun Su1 , Guan-Ying Chen1

1Cathay Financial Holdings Lab{bgg, amber}@cathayholdings.com.tw

AbstractDeep-Learning based models with attention mech-anism has achieved exceptional results for manytasks, including language tasks and recommendersystem. In this study, we focus on inbound callsprediction for customer service, whereas previousworks put emphasis on the phone agents alloca-tion. Furthermore, a common method to deal withuser’s history behaviors is to extract all kinds ofaggregated features over time. Nevertheless, themethod may fail to detect users’ sequential pat-tern. Therefore, we incorporate users’ sequentialfeatures along with non-sequential features and ap-ply the encoder of Transformer, one such power-ful self-attention network model, to capture the in-formation underlying those behavior sequence. Wepropose our approach: ET-USB, which is helpfulfor different business scenarios in Cathay FinancialHoldings. In practice, we test and prove that theproposed network structure for processing differ-ent dimension of behavior data could demonstratethe superiority comparing with other deep-learningbased models.

1 IntroductionInbound call center customer service, the service dealt withvast customers complaints and queries, are closely relatedwith customer satisfaction, corporate brand image and evenprofitability. Thus, many companies invest significant re-sources to provide high quality customer service based ontelephone, internet or even chatbot nowadays. Several cus-tomer service related applications are universal, includingforecasting the amount of phone calls in a period of time,telemarketing various products to the customers, predictingcustomers’ call questions, etc.

Here we focus on the scenario of customers’ question pre-diction at Cathay United Bank, subsidiary of Taiwan’s largestfinancial holdings: Cathay Financial Holdings, where havingalmost a million of calls from customers per month. Mostof the time, customers only make phone calls when they en-counter operation problems. However, due to the miscommu-nication between call center agents and customers, the aver-age call duration lasts 5 minutes long or even more, leading

to high communication costs. Consequently, if we could pre-cisely predict the possible questions that customers may ask,we could help save the call duration. There would be no needto confirm customers’ questions for a long time, and we canfurther raise customer satisfaction.

In this study, inspired by the success of Attention Mecha-nism for machine translation task in Natural Language Pro-cessing (NLP) [Bahdanau et al., 2015], we propose a novelneural network: ET-USB, which is the encoder in the Trans-former [Vaswani et al., 2017], to learn a better representationfor user’s sequential behaviors. Our model utilizes the selfattention mechanism to capture the most important depen-dency among behavior information in a behavior sequence;thus make better prediction on customers’ inbound calls. Wefound that most of the customer calls were triggered by oncea customer encountering difficulties on different channels orproduct services. In other words, customer behaviors in eachchannel/product/service play an important role in analyzinginbound calls problems. By collecting users data from var-ious subsidiary database with HIPPO framework, an event-driven information integrating framework aiding workflow inour company, we can use these users data as data source inour modeling task. We have these user behavior data frommore than 30 channels named as ”Customer Journey Event”,which is abbreviated as ”CJE” afterwards. Based on occur-ring time, we transform these data into users’ sequential be-haviors, which achieves a powerful analysis, and use on dif-ferent scenarios, including solving inbound calls problems.

In the experiment, we also delve into the behavior granu-larity of ”CJE”. In recommender system scenario, each itemin a sequence pattern is embedded without complex featureengineering for easy interpretation, e.g., click, redeem, gift-sending. However, in our scenarios, it is hard to describea behavior in a simple approach. Let us assume that belowtwo sentences have the same meanings, but in different de-scriptions. Take viewing event as an example. ”I watchedAvengers: Endgame in the movie theater”. Another descrip-tion of this example is, ”I watched Avengers: Endgame onAMC in the evening”. As you may see, it is hard to definean universal approach regarding the best description of be-haviors. Therefore, we introduce a statistical method to helpmake better behavior granularity. Eventually, offline experi-ments show the superiority of our proposed approaches. Wehave also deployed the model online, which might accurately

arX

iv:1

912.

1085

2v1

[cs

.LG

] 2

0 D

ec 2

019

Page 2: arXiv:1912.10852v1 [cs.LG] 20 Dec 2019

Table 1: The Details of Customer Journey Data

customer id action time action type channel type object type event type attributesA123456 1569859200 40 merchant nbr credit card acct cc transaction {key1: value1, ...}A123456 1569911568 41 merchant nbr credit card acct cc transaction {key1: value1, ...}A124789 1569981819 40 merchant nbr credit card acct cc transaction {key1: value1, ...}

predict inbound calls for inbound customer service every day.

2 Related WorkIn recent years, unsupervised pre-trained models have dom-inated the field with embedding approaches and later be fedinto fine-tuning models. During pretraining, an encoder neu-ral network model is trained using large-scale unlabeled datato learn word embeddings; parameters are then fine-tunedwith labeled data related to downstream tasks. For example,in the case of text classification, each word in a sentence areembedded into low-dimensional spaces as vector representa-tions that are learned from large quantities of unstructuredtextual data such as those from Wikipedia corpora, and thenfed into fully connected layers to predict which categories thearticle belongs. Among those methods, Word2vec is the mostcommon method[Mikolov et al., 2013].

Because of the feasibility in Word2vec, several non-NLPfields have used pre-trained vectors as input in the followingmodel to solve related tasks. For example, Item2Vec [Barkanand Koenigstein, 2017] uses the concept of Word2Vec, inwhich the author proposed the method that produces embed-dings for items in a latent space. The method is capable ofinferring item-to-item relations even when user informationis not available. Due to the embedding vectors that containswith more information in comparison with one-hot encoding,everything2Vec appears, node2vec [Grover and Leskovec,2016], graph2vec [Narayanan et al., 2017], etc.. These em-bedding vectors can further be fed into a deep learning (DL)architecture.

Now, attention mechanism plays an important role inthe architecture, because it learns to pay attention to onlythe most important parts of the target, and thus have beenshown to be effective in various tasks such as computer vi-sion and machine translation. In early years, Vanilla at-tention mechanism are integrated into DL algorithms, e.g.,RNN and CNN to overcome the shortcomings [Sinha etal., 2018] [Zhou et al., 2016]. However, due to the prob-lem of long-range dependency and unable to parallel, theseapproaches are inefficient in some scenarios, respectively.Later, self-attention [Vaswani et al., 2017] appears. BERT[Devlin et al., 2019] uses the concept of self-attention witha Transformer model and achieved state-of-the-art perfor-mance along with efficiency on GLUE Leaderboard, whichhad previously been dominated by RNN and CNN-based ap-proaches. The Transformer model relies heavily on the pro-posed ”self-attention” mechanism to capture complex struc-tures in sentences. Many researchers have applied these ap-proaches to DL-based recommender systems. For example,In DIN[Zhou et al., 2018b], which is a method tries to adap-

tively learn the representation of user interests from historicalbehaviors with respect to a certain advertising. Another ex-ample is Atrank[Zhou et al., 2018a], the paper proposes anattention-based user behavior modeling framework for rec-ommendation tasks. Alibaba also presented a Transformer-based recommender system recently [Chen et al., 2019].

3 Modeling Framework3.1 Description of Customer Journey DataTo delineate the outline of customers, we integrate those”CJE” occurring in various channels into ”Customer Jour-ney Data”. As shown in table 1, which is an example ofthe raw data of customer journey event. Here, we take creditcard transaction as an example. Users’ transaction behaviorsare recorded in various dimension, where contains the infor-mation of customer ID, action time and attributes, includingchannel type, card type, and other external domain informa-tion. To be more specific, we can get the information of whena behavior, which channel- online/offline store, with whichcredit card, etc., that a customer takes place. Due to the com-pany’s secrecy policy, we will not disclose the customer jour-ney data in details. In sum, each channel are collected withover 20 elements in attributes for the analysis.

3.2 Model ArchitectureOur model architecture is shown in figure 1. We divide ourfeatures into two parts: non-sequential features and sequen-tial behavior features. In the subsequent section, we focuson extracting information from sequential behavior features,since non-sequential features are only used for concatenationwith these information in the final layer.

Embedding LayerWe embed each behavior in a behavior sequence into a fixed-size low-dimensional vectors. Given a vocabulary V of be-haviors, we create an embedding matrix E ∈ RV×d, whered was the dimension size. E is referred as behaviors lookuptable. The size of embeddings d is often set between 100and 1000 [Bahdanau et al., 2015]. Here we set d = 128 andadopted ”Xavier” initialization method.

Since we will use the self attention later, if without anypositional representation, the approach would cause identi-cal behaviors at different positions to have the same output.We then focus on the position of sequential behaviors, whichmight have different meaning and latent information in thebehavior sequence, to capture the order information in sen-tences. In our scenarios, we simply adopt positional embed-ding by adding the “position” as an input feature of each item

Page 3: arXiv:1912.10852v1 [cs.LG] 20 Dec 2019

Figure 1: The architecture of ET-USB. The model incorporates features in two parts: sequential features and non-sequential features. Specif-ically, the model embeds sequential features to low-dimensional vectors. Then, we adopt self-attention mechanism inside encoder layersto capture the relations among those behaviors in behavior sequence and learn deeper representation. Finally, we concatenate the output ofnon-sequential features, which are processed through two fully connected layers, together feed into the final layer with softmax function togenerate the final output.

in the input layer before it is projected as a low-dimensionalvector.

Transformer EncoderSelf-Attention Layer We focus on the encoder architectureof a multi-layer Transformer [Vaswani et al., 2017]. The en-coder consists of stacking encoder layers, each containingtwo sublayers, namely a multi-head self-attention layer anda feed-forward network. The multi-head mechanism runsthrough a scaled dot product attention function, which canbe formulated by querying a dictionary entry with key valuepairs [Miller et al., 2016]. The self-attention input consists ofa query Q ∈ Rl×d, a key K ∈ Rl×d, and a value V ∈ Rl×d,where l is the length of the input sentence, and d is the dimen-sion of embedding for query, key and value. For subsequentlayers, Q, K, V comes from the output of the previous layer.The scaled dot-product attention [Vaswani et al., 2017] is de-fined as follows: Self Attention allows capturing the depen-dencies between representation pairs without regard to theirdistance in the sequences.

Attention(Q,K,V ) = softmax(QKT

√d

) · V = A · V

(1)The output represents the multiplication of the attention

weights A and the vector V , where A ∈ Rl×l. The atten-

tion weights Ai,j enabled us to better understand about theimportance of the i-th key-value pairs with respect to the j-thquery in generating the output [Bahdanau et al., 2015].

Previous work has shown that it is beneficial to linearlyproject the queries, keys and values h times with differentrepresentation subspaces at different positions. Specifically,multi-head attention first linearly projects input layer I into hsubspaces, with different linear projections, and then performthe attention function in parallel, yielding the output repre-sentations which are concatenated and once again projected.

Ohead = MultiHead(I)

= Concat(Ohead1 ,Ohead

2 , ...,Oheadh )WO (2)

Oheadi = Attention(IWQ

i , IWKi , IW V

i ) (3)where the projections matrices for each head

WQi ,WK

i ,W Vi ∈ Rd× d

h , and WO ∈ Rd×d are learnableparameters, and h is the number of heads.Feed-Forward Network Layer Since the self-attentionsub-layer is mainly based on linear projections. To further en-hance the model with non-linearity, we apply a feed-forwardnetwork after the self-attention layer. This consists of two lin-ear transformations with a SeLU activation in between. Theoutput of the feed-forward layers are formulated as follows:

Page 4: arXiv:1912.10852v1 [cs.LG] 20 Dec 2019

Table 2: Comparison of training and prediction datasets. We adopt validation set, which is 10% of the training datasets, for model evaluation.

Customer Journey Data # Users # Event Instances # Questions Categories

Training Datasets 971,904 16,510,346 24

Prediction Datasets 4,635,236 99,678,852 24

Oout = SELU(OheadW (1) + b(1))W (2) + b(2) (4)

Next, we employ a residual connection [He et al., 2015]by applying dropout to the output of each sub-layer, beforeit is added to the sub-layer input, and then followed by layernormalization [Ba et al., 2016].

L = LayerNorm(Ohead +Dropout(Oout)) (5)

Stacking Encoder Layer After the first self-attention en-coder layer, it aggregates all the previous behavior embed-dings. However, it might be helpful to learn more complexbehavior transition patterns by stacking the previous encoderlayers L. Therefore, we stack the encoder layer, includinga self-attention layer and a feed-forward network. The b-thblock is defined as:

S(b) = SA(L(b−1)) (6)

L(b) = FFN(S(b)) (7)

where SA is a self-attention layer, FFN is a feed-forwardnetwork, and b ≥ 1. Specifically, the performance of b de-pends on different scenarios. In this paper, our experimentsshow that b =1 outperforms than others.

Output Layer We use two fully connected layers to learnthe interactions among non-sequential features, then we con-catenated the output with the output of the Transformer layer,together form the dense representation vectors.

3.3 Objective FunctionTo predict which questions categories a customer will ask,we model it as a multi-class classification problem, thus weuse the softmax function as the output unit. The objectivefunction used in the model is the category cross-entropy anddefined as:

Loss = − 1

N

N∑i=1

L∑j=1

yij log(p̂ij) (8)

p̂ij =expfj(xi)∑L

j′=1 expfj′(xi)

(9)

where N represents the size of training set, yij correspondsto the j′-th element of one-hot encoded label of the samplexi, and fj denotes the j′-th element of f .

4 ExperimentsIn this section, we present our experiments in detail, includingdata description, data processing, model comparison and thecorresponding analysis. We evaluate the proposed approach:ET-USB, by conducting a multi-class classification task. Ourexperiments demonstrate the effectiveness of our approachwhich outperforms prestigious methods on NLP tasks.

4.1 SetupWe implement the models using TensorFlow in our experi-ments. All of the results in this paper are processed in nomore than 2 hours by a graphics processing unit with trainingdatasets. We train all of the models in the same computationenvironment with a NVIDIA Tesla V100 graphics processingunit.

Training and Prediction DatasetsTo form the training and prediction datasets, we cooperatewith business units and together decide the business logic ofdatasets, where excludes meaningless information in noisydata. Furthermore, because several call questions occurs inspecific months, the training datasets consist of inbound callshappening in last four months and those months. Most impor-tantly, we only predict those call questions categories relatedwith credit card, which accounts for a total of 24 targets. Thestatistics of the datasets are shown in table 2.

4.2 Data PreprocessingThe Granularity of Customer Journey EventsThe raw data of customer journey events contains rich infor-mation about behavior, which has both advantages and dis-advantages. We can understand each customer seemingly viathese micro events. However, the information contains muchnoises if using micro granularity. Still, it is difficult to under-stand the behaviors from a too macro perspective. For exam-ple, if using only the roughest granularity -event type- table 1,it is hard to extract information from user behavior sequences.We try to find out the proper granularity to describe the cus-tomer behavior event, and reveal the commonalities betweenall customers’ behaviors. This concept is similar to stemmingand lemmatization in the field of natural language process-ing. We propose a statistical method to help reduce behaviorgranularity. We not only perform some feature transforma-tion methods to both numerical and categorical features, butadjust the behavior granularity by cumulative percent and re-garding the financial domain knowledge. For instance, afterdiscretizing continuous features into categorical features, wesort each value in a descending order. Thus, we may calculate

Page 5: arXiv:1912.10852v1 [cs.LG] 20 Dec 2019

Table 3: Adjustment in the Granularity of Behavior by Cumulative Percent

event txn currency code count event cnt event percent

creditcard transaction TWD 72,761,143 75,177,802 0.967854

creditcard transaction others 2,416,659 75,177,802 0.032146

the cumulative percent from the transformed categorical fea-tures. According to the specific threshold value (such as 90percent), we could keep top N distinct values and transformthe remains into ”others” type. In figure 2, we see the orig-inal data distribution, a long tailed distribution, of currencycode, where having over 100 types of currencies occurringin credit card transaction. Nevertheless, the currency, NewTaiwan dollar (TWD), accounts for the overwhelmingly highrate. Therefore, we reduce currency code to two dimensions,

Figure 2: The Data Distribution of Currency Code

that is: TWD and others. It would be easier to discover thecommonalities between all customers’ behaviors. The resultis show in Table 3.

Feature Extraction

We put emphasis on both sequential and non-sequential fea-tures, where we extracted from customer journey events.Here, we demonstrate part of the non-sequential features forsimplicity. It is worth mentioning that we don’t use demo-graphic features, such as gender, age and so on. We believethat inbound calls have no relation with these features. In spe-cific, we care about the timing of the latest behavior event,and whether this event occurred within 1, 3,..., N days or not.Furthermore, we make transformation on categorical vari-ables using one-hot encoding and make the combination offeatures in different dimension. For example, we expand theunique value in each categorical field by crossing with timefeatures, e.g., whether the behaviors occurs within 1, 3,..., Ndays or not. We also calculate the count of each unique value.The input dataset of the model is shown in figure 3.

Figure 3: Input Dataset of the Model

4.3 Experiments Results

This subsection provides the experiment results of the base-line model and our proposed method. We employ a batchsize of 128, learning rates of 3e-4, 1 stacking layer, and 3epochs over the data for our tasks. The reason that we useonly one stacking layer is that we found that the layers withmore stacks not performed as good as one layer, just like itwas mentioned in [Chen et al., 2019]. Probably the behaviorin a sequence is not as informative as in word in a sentence.Therefore, a small number of layers would be sufficient. Theresults are presented in table 4, from which, we can see theadvantage of ET-USB comparing to other approaches. Com-pared with base models without sequential behavior features,we can see the advantage of those features in the subsequentmodels. When comparing effectiveness of incorporating be-havior sequence information in different models, we can seethat ET-USB using self-attention provides a significant im-provement.

Table 4: Performance of different models on training datasets, withmap@3 and accuracy as the metric.

Map@3 AccuracyDNN (w/o Seq) 0.2345 0.1826

CNN 0.2542 0.2468Bi-LSTM 0.2527 0.2472

Bi-LSTM+Att 0.2546 0.2491ET-USB 0.2657 0.2562

Page 6: arXiv:1912.10852v1 [cs.LG] 20 Dec 2019

Figure 4: Heatmap of Self-Attention scores using Transformer Encoder. In this work we employ n = 1 encoder layer.

We try to explain the effects of self-attention mechanismin our model by visualizing a user’s sequential behavior inCustomer Journey Data. In figure 4, we illustrate how theself-attention scores distribute those behaviors in differentheads. The behavior in coordinate is ordered by the hap-pening timeline. Here we use heatmap matrix to representthe self-attention scores between any two behaviors in sev-eral different semantic space. It is worth to be mentionedthat the score matrix is normalized with row-based softmax.Take the user in figure 4, who eventually had a calls relatedto the questions of ”credit-card bill”, as an example. Herewe present with the roughest granularity to clearly performthe behavior on the cooridnate. Subsequently, we’ll explainthe phenomenon using a more detailed, which is the originalgranularity, behavior.

In specific, the user bought a cell-phone with a ”COSTCOCard” in different channel, and then bought a plane ticket us-ing a ”KOKO COMBO Card”, and then with a series of be-haviors; eventually he tried to pay the creditcard in 7-11. It isobvious that he was confused when having the latest behav-ior of paying credit card, and thus he made a inbound callsrelated to ”credit-card bill”. Here we can see how modelworks. In different semantic space may focus on differentbehaviors, and the heat distributions vary a lot in some latentspaces. For example, Head 6, 7, 8 are different than the 5 re-main heads. In these 5 heads, the relative trends of the atten-tion score vectors are similar, regardless of the strength, theseheads consider more about the overall importance of a behav-ior among them. On the other hand, we focus on Head 6and Head 8. we found that higher scores tend to form in adense situation. In Head 6, earlier behavior form a groupof creditcard transaction related behaviors, where exist very

high attention scores. Those spaces probably consider moreabout relationships of transaction behavior granularity simi-larities. On the other hand, Head 8 has a high attention scoreon latest behaviors, that is to say, this space focused on deal-ing with creditcard payment behaviors. From these heads, wecan conclude that self-attention mechanism had a great im-pact on user behaviors in different semantic spaces.

5 Conclusion

In this paper, we focus on the task of sequential behav-ior modeling using ET-USB in the scenario of customer in-bound call prediction with abundant customer journey data.We also provide a statistical approach to adjust the granu-larity of customer’s behavior. In our experiment, by usingthe self-attention mechanism of capturing behavior relations,we reveal that ET-USB outperforms the model without in-corporating sequential behaviors, and those previous well-known models in the field of Natural Language Processingconsists of sequential behaviors. The results also reveal theself-attention scores distributions of behaviors in differentheads. The model architecture has already deployed in theproduction environment at Cathay Financial Holdings, whichhelps phone agents accurately predict customer’s questions,thereby reducing transaction costs and help better allocate hu-man resources for customer service.

References

[Ba et al., 2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Ge-offrey E. Hinton. Layer normalization. 2016.

Page 7: arXiv:1912.10852v1 [cs.LG] 20 Dec 2019

[Bahdanau et al., 2015] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. Neural machine translation byjointly learning to align and translate. ICLR, 2015.

[Barkan and Koenigstein, 2017] Oren Barkan and NoamKoenigstein. Item2vec: Neural item embedding for col-laborative filtering. 2017.

[Chen et al., 2019] Qiwei Chen, Huan Zhao, Wei Li, PipeiHuang, and Wenwu Ou. Behavior sequence transformerfor e-commerce recommendation in alibaba. 2019.

[Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Ken-ton Lee, and Kristina Toutanova. Bert: Pre-training ofdeep bidirectional transformers for language understand-ing. NAACL, 2019.

[Grover and Leskovec, 2016] Aditya Grover and JureLeskovec. node2vec: Scalable feature learning fornetworks. SIGKDD, 2016.

[He et al., 2015] Kaiming He, Xiangyu Zhang, ShaoqingRen, and Jian Sun. Deep residual learning for image recog-nition. 2015.

[Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Cor-rado, and Jeffrey Dean. Distributed representations ofwords and phrases and their compositionality. 2013.

[Miller et al., 2016] Alexander H. Miller, Adam Fisch, JesseDodge, Amir-Hossein Karimi, Antoine Bordes, and JasonWeston. Key-value memory networks for directly readingdocuments. 2016.

[Narayanan et al., 2017] Annamalai Narayanan, MahinthanChandramohan, Rajasekar Venkatesan, Lihui Chen, YangLiu, and Shantanu Jaiswal. graph2vec: Learning dis-tributed representations of graphs. 2017.

[Sinha et al., 2018] Koustuv Sinha, Yue Dong, Jackie C.K.Cheung, and Derek Ruths. A hierarchical neural attention-based text classifier. ACL, 2018.

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, NikiParmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. Attention is all youneed. 2017.

[Zhou et al., 2016] Peng Zhou, Wei Shi, Jun Tian, ZhenyuQi, Bingchen Li, Hongwei Hao, and Bo Xu. Attention-based bidirectional long short-term memory networks forrelation classification. ACL, 2016.

[Zhou et al., 2018a] Chang Zhou, Jinze Bai, Junshuai Song,Xiaofei Liu, Zhengchao Zhao, Xiusi Chen, and Jun Gao.Atrank: An attention-based user behavior modeling frame-work for recommendation. 2018.

[Zhou et al., 2018b] Guorui Zhou, Chengru Song, Xiao-qiang Zhu, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan,Junqi Jin, Han Li, and Kun Gai. Deep interest network forclick-through rate prediction. 2018.