supervised transfer learning for product information question … · 2019-01-10 · powerful...

Supervised Transfer Learning for ProductInformation Question Answering

Tuan Manh Lai∗, Trung Bui†, Nedim Lipka†, Sheng Li‡∗Purdue University, USA; †Adobe Research, USA; ‡University of Georgia, USA

Email: ∗[email protected], †{bui,lipka}@adobe.com, ‡[email protected]

Abstract—Popular e-commerce websites such as Amazon offercommunity question answering systems for users to pose product-related questions and experienced customers may provide an-swers voluntarily. In this paper, we show that the large volumeof existing community question answering data can be bene-ficial when building a system for answering questions relatedto product facts and specifications. Our experimental resultsdemonstrate that the performance of a model for answeringquestions related to products listed in the Home Depot websitecan be improved by a large margin via a simple transfer learn-ing technique from an existing large-scale Amazon communityquestion answering dataset. Transfer learning can result in anincrease of about 10% in accuracy in the experimental settingwhere we restrict the size of the data of the target task usedfor training. As an application of this work, we integrate thebest performing model trained in this work into a mobile-basedshopping assistant and show its usefulness.

Index Terms—Natural Language Processing, Question Answer-ing, Transfer Learning

I. INTRODUCTION

Customers ask many questions before buying products.They want to get adequate information to determine whetherthe product of interest is worth their money. Because thequestions customers ask are diverse, developing a generalquestion answering system to assist customers is challenging.In this paper, we are particularly interested in the task ofanswering questions regarding product facts and specifications.We formalize the task as follows: Given a question Q about aproduct P and the list of specifications (s1, s2, ..., sM ) of P ,the goal is to identify the specification that is most relevantto Q. M is the number of specifications of P , and si is theith specification of P . In this formulation, the task is similarto the answer selection problem [1]–[5]. ‘Answers’ shall beindividual product specifications in this case. In production,given a question about a product, a possible approach is to firstselect the specification of the product that is most relevant tothe question and then use the selected specification to generatethe complete response sentence using predefined templates.Figure 1 illustrates the approach.

Many e-commerce websites offer community question an-swering (CQA) systems for users to pose product-relatedquestions and experienced customers may provide answersvoluntarily. In popular websites such as Amazon or eBay, theamount of CQA data (i.e., questions and answers) is hugeand growing over time. For example, in [6], the authors couldcollect a QA dataset consisting of 800 thousand questions andover 3.1 million answers from the CQA platform of Amazon.

In this paper, we show that the large amount of CQAdata of popular e-commerce websites can be used to improvethe performance of models for answering questions relatedto product facts and specifications. Our experimental resultsdemonstrate that the performance of a model for answeringquestions related to products listed in the Home Depot websitecan be improved by a large margin via a simple transfer learn-ing technique from an existing large-scale Amazon communityquestion answering dataset. Transfer learning can result in anincrease of about 10% in accuracy in the experimental settingwhere we restrict the size of the data of the target task used fortraining. As an application of this work, we integrate the bestperforming model trained in this work into a mobile-basedshopping assistant and show its usefulness.

II. RELATED WORK

A. Recurrent Neural Networks

Recurrent Neural Networks (RNNs) form an expressivemodel family for processing sequential data. They have beenwidely used in many tasks, including machine translation [7],[8], image captioning [9], and document classification [10].The Long Short Term Memory (LSTM) [11] is one of themost popular variations of RNN. The main components ofthe LSTM are three gates: an input gate it to regulate theinformation flow from the input to the memory cell, a forgetgate ft to regulate the information flow from the previous timestep’s memory cell, and an output gate that regulates how themodel produces the outputs from the memory cell. Given aninput sequence {x1,x2, ...,xn} where xt is typically a wordembedding, the computations of LSTM are as follows:

it = σ(W ixt +U iht−1 + bi)

ft = σ(W fxt +Ufht−1 + bf )

ot = σ(W oxt +Uoht−1 + bo)

Ct = tanh(W cxt +U cht−1 + bc)

Ct = it � Ct + ft � Ct−1

ht = ot � Ct

(1)

The standard single direction LSTM processes input onlyin one direction, it does not utilize the contextual informationfrom the future inputs. In other words, the value of ht doesnot depend on any element of {xt+1,xt+2, ...,xn}. On theother hand, a bi-directional LSTM (biLSTM) utilizes both theprevious and future context by processing the input sequence

arX

iv:1

901.

0253

9v1

[cs

.CL

] 8

Jan

201

9

Fig. 1: An approach to answering questions related to product facts and specifications. In this work, we focus on the first stageof the approach, which is similar to the answer selection problem.

on two directions, and generate two independent sequencesof LSTM output vectors. One processes the input sequencein the forward direction, while the other processes the inputin the reverse direction. The output at each time step is theconcatenation of the two output vectors from both directions,i.e., ht = ~ht || ~ht. In this case, the value of ht will dependon every element of {x1,x2, ...,xn}.

B. Answer Selection

Answer selection is an important problem in natural lan-guage processing. Given a question and a set of candidateanswers, the task is to identify which of the candidatescontains the correct answer to the question. For example, giventhe question “Who won the Nobel Prize in Physics in 2016?”and the following candidate answers:

1) The Fields Medal is regarded as one of the highest hon-ors a mathematician can receive, and has been describedas the mathematician’s Nobel Prize.

2) Neutron scattering played an important role in the exper-imental exploration of the theoretical ideas of Thouless,Haldane, and Kosterlitz, who won the Nobel Prize inPhysics in 2016.

3) The Nobel Prize was established in the will of AlfredNobel, a Swedish inventor of many inventions, mostfamously dynamite.

The second answer should be selected.Previous work on answer selection typically relies on feature

engineering, linguistic tools, or external resources [12]–[16].Recently, many researchers have investigated employing deeplearning for the task [1]–[4]. Most recently proposed deeplearning models outperform traditional techniques. In addition,they do not need any feature-engineering effort or hand-codedresources beyond some large unlabeled corpus on which tolearn the initial word embeddings, such as word2vec [17] orGloVe [18]. The authors in [5] provide a comprehensive reviewon deep learning methods applied to answer selection. Inaddition, the most popular datasets and the evaluation metricsfor answer selection are also described in the work.

C. Customer Service Chatbots

Developing customer service chatbots for ecommerce web-sites is an active area. For example, ShopBot 1 aims athelping consumers narrow down the best deals from eBay’sover a billion listings. SuperAgent introduced in [19] is apowerful chatbot designed to improve online shopping experi-ence. Given a specific product page and a customer question,SuperAgent selects the best answer from multiple data sourceswithin the page such as in-page product information, existingcustomer questions & answers, and customer reviews of theproduct. In [20], T. Lai et al. proposed a simple but effectivedeep learning model for answering questions regarding productfacts and specifications.

D. Transfer Learning for Question Answering

Transfer learning [21] has been successfully applied to vari-ous domains such as speech recognition [22], computer vision[23], and natural language processing [24]. Its applicabilityto question answering and answer selection has recently beenstudied [25], [26]. In [25], the authors created SQuAD-T, amodification of the original large-scale SQuAD dataset [27]to allow for directly training and evaluating answer selectionsystems. Through a basic transfer learning technique fromSQuAD-T, the authors achieve the state of the art in two well-studied QA datasets, WikiQA [28] and SemEval-2016 (Task3A) [29]. In [26], the authors tackle the TOEFL listening com-prehension test [30] and MCTest [31] with transfer learningfrom MovieQA [32] using two existing QA models. To thebest of our knowledge, there is no published work on exploringtransfer learning techniques for improving the performance ofmodels for answering questions related to product facts andspecifications.

III. APPROACH

A. Baseline Model

Our task of matching questions and product specificationsis similar to the answer selection problem. “Answers” shall

1https://shopbot.ebay.com

Fig. 2: Architecture of the baseline model.

be individual product specifications. Even though a commontrait of a number of recent state-of-the-art methods for answerselection is the use of complicated deep learning models [1]–[4], T. Lai et al. showed in [20] that complicated modelsmay not be needed in this case. Simple but well designedmodels may match the performance of complicated modelsfor the task of selecting the most relevant specification to aquestion. Inspired by the results of [20], we propose a newsimple baseline model for the task.

The baseline model takes a question and a specificationname as input and outputs a score indicating their relevance.During inference, given a question, the model is used to assigna score to every candidate specification based on how relevantthe specification is. After that, the top-ranked specification isselected. Figure 2 illustrates the overall architecture of thebaseline model. Given a question Q and a specification nameS, the model first transforms Q and S into two sequencesQe = [eQ1 , e

Q2 , ..., eQm] and Se = [eS1 , eS2 , ..., eSn ] using word

embeddings pre-trained with GloVe [18]. Here, eQi is theembedding of the ith word of the question and eSj is theembedding of the jth word of the specification name. m and nare the lengths of Q and S, respectively. After that, we feed Qe

and Se individually into a parameter shared biLSTM model.For the question Q, we will obtain a sequence of vectors[q1,q2, ...,qm] from the biLSTM where qi = ~qi || ~qi. Toform a final fixed-size vector representation of the questionQ, we select the maximum value over each dimension ofthe vectors [q1,q2, ...,qm] (max pooling). We denote the finalrepresentation of the question as q. In a similar way, we canobtain the final representation s of the specification. Finally,the probability of the specification being relevant is

p(y = 1|q, s) = σ(qT M s + b) (2)

where the bias term b and the transformation matrix M aremodel parameters. The sigmoid function squashes the scoreto a probability between 0 and 1.

B. Transfer Learning

The transfer learning technique used in this work is simpleand includes two steps. The first step is to pre-train thebaseline model on a large source dataset. The second stepis to fine-tune the same model on the target dataset, whichtypically contains much less data than the source dataset.The effectiveness of transfer learning is evaluated by theperformance of the baseline model on the target dataset.

C. Datasets

HomeDepotQA. The target dataset used for experiments iscreated using Amazon Mechanical Turk (MTurk) 2. MTurkconnects requesters (i.e., people who have works to be done)and workers (i.e., people who work on tasks for money).Requesters can post small tasks for workers to complete fora fee. These small tasks are referred to as human intelligencetasks (HITs). We crawled the information of products listedin the Home Depot website 3. For each product, we createHITs where workers are asked to write questions regardingthe specifications of the product. The final dataset consists of7,119 correct question-specification pairs that are related to153 different products in total. Table I shows some examplesof correct question-specification pairs collected. We refer tothe dataset as HomeDepotQA. We split up the dataset intotraining set, development set, and test set so that the testset has no products in common with the training set or thedevelopment set. For example, if the test set has a question

2https://www.mturk.com3https://www.homedepot.com

Product ID Product Category Question Correct Specification207025690 Microwaves What is the wattage? Wattage (watts)205209621 Refrigerators How many bottles can I place inside? Bottle Capacity301688014 Smart TVs Is the screen size at least 50 inches? Screen Size (In.)205867752 Electric Ranges Can I return the range if I change my mind? Returnable

TABLE I: Examples of correct question-specification pairs.

Split # Positive Pairs # Negative Pairs # PairsTraining Set 815484 812030 1627514

Development Set 2482 2518 5000

TABLE II: Statistics of the AmazonCQA dataset.

about the product with ID 205148717, then there will beno questions about that product in the training set or thedevelopment set. We are interested in whether our proposedmodel can be generalized to answer questions about newproducts. The proportions of the training set, development set,and test set are roughly 80%, 10%, and 10% of the total correctquestion-specification pairs, respectively.

AmazonCQA. The source dataset used for experiments isa preprocessed version of the QA dataset collected in [6].The original dataset consists of 800 thousand questions andover 3.1 million answers collected from the CQA platform ofAmazon. Each answer is supposed to address its correspondingquestion. But since the CQA data (i.e., questions and answers)is a resource created by a community of casual users, thereis a lot of noise in addition to the complications of informallanguage use, typos, and grammatical mistakes. Below are themajor preprocessing steps applied to the original dataset:

1) We remove questions or answers that contain URLs,because the target dataset does not have any questionsor specification names that contain URLs.

2) We set the minimum length of questions to four tokensfor filtering out poorly structured questions. There can bemany examples where the questions are very short andnot grammatically correct; for example, people mightjust ask: “Waterproof?”. In the target dataset, a questionis typically a complete sentence (e.g., “What happens ifthis laser level kit gets wet?”).

3) Answers must also contain at least ten tokens, as thesame problem can occur here; for example, the answermight be a single “Yes”, which does not contain muchsemantic information.

4) We remove questions or answers that are too long. Onereason is that most of the specification names in thetarget dataset are short. Therefore, the answers in thesource dataset should not be too long.

5) We remove answers that contain phrases such as “I haveno idea” or “I am not sure”, because it is likely that thoseanswers do not contain any information relevant to thequestion.

6) In order to be able to train the baseline model, we sampleone or more negative answers for each question.

We refer to the final preprocessed dataset as AmazonCQA.

We split up the dataset into a training set and a developmentset. We do not need a test set as the effectiveness of transferlearning is evaluated by only the performance on the targetdataset. The statistics of the AmazonCQA dataset is shown inTable II.

IV. EXPERIMENTS AND RESULTS

During the pre-training step, we trained the baseline modelon the training set of AmazonCQA. We chose the hyperparam-eters of the model using the development set of AmazonCQA.After that, during fine-tuning, the model was further trainedon the training set of the target dataset (i.e., HomeDepotQA)and tuned on the development set, and the performance on thetesting set of the target dataset was reported as the final result.We use mean reciprocal rank (MRR) and accuracy as the per-formance measurement metrics. In addition to fine-tuning thebaseline model on the entire training set of HomeDepotQA,we conducted experiments where we restricted the amount oftraining data from HomeDepotQA for fine-tuning the model.Table III shows the experiment results. Pre-training on theAmazonCQA dataset clearly helps. In the setting where only10% of the correct question-specification pairs in the trainingset of HomeDeptQA is used, transfer learning can result in anincrease of about 10% in accuracy.

It is worth mentioning that when training on the train setof HomeDepotQA, we use all possible question-specificationpairs. In other words, if there are k questions about a productand the product has h specifications, then there are h × kquestion-specification examples related to the product, andexactly k of them are positive examples.

V. APPLICATION

As an application of this work, we integrate the best per-forming model trained in this work into ISA, a mobile-basedintelligent shopping assistant. ISA is designed to improveusers’ shopping experience in brick and mortar stores. First, anin-store user just needs to take a picture or scan the barcode ofa product. ISA then retrieves the information of the product ofinterest from a database by using computer vision techniques.After that, the user can ask natural language questions aboutthe product to ISA. The user can either type in the questions ordirectly speak out the questions using voice. ISA is integrated

Pre-trained Percentage of HomeDepotQA’s train set used MRR AccuracyNo 10% 0.7636 0.6667Yes 10% 0.8442 0.7656No 50% 0.8636 0.7933Yes 50% 0.9049 0.8486No 100% 0.8815 0.8180Yes 100% 0.9030 0.8443

TABLE III: Results of transfer learning on the target datasets.

with both speech recognition and speech synthesis abilities,which allows users to ask questions without typing.

The role of our model is to answer questions regarding thespecifications of a product. Given a question about a product,the model is used to rank every specification of the productbased on how relevant the specification is. We select thetop-ranked specification and use it to generate the responsesentence using predefined templates. Figure 3 shows exampleoutputs generated using our model.

Fig. 3: Answering questions regarding product specifications.

VI. CONCLUSION

In this work, we show that the large volume of existingCQA data can be beneficial when building a system foranswering questions related to product facts and specifications.Our experimental results demonstrate that the performance of amodel for answering questions related to products listed in theHome Depot website can be improved by a large margin viaa simple transfer learning technique from an existing large-scale Amazon CQA dataset. Transfer learning can result in

an increase of about 10% in accuracy in the experimentalsetting where we restrict the size of the data of the targettask used for training. In addition, we also integrate the bestperforming model trained in this work into ISA, an intelligentshopping assistant that is designed with the goal of improvingshopping experience in physical stores. In the future, we planto investigate more transfer learning techniques for utilizingthe large volume of existing CQA data.

REFERENCES

[1] Z. Wang, W. Hamza, and R. Florian, “Bilateral multi-perspective match-ing for natural language sentences,” in IJCAI, 2017.

[2] W. Bian, S. Li, Z. Yang, G. Chen, and Z. Lin, “A compare-aggregatemodel with dynamic-clip attention for answer selection,” in CIKM, 2017.

[3] G. Shen, Y. Yang, and Z.-H. Deng, “Inter-weighted alignment networkfor sentence pair modeling,” in EMNLP, 2017.

[4] Q. H. Tran, T. Lai, G. Haffari, I. Zukerman, T. Bui, and H. Bui, “Thecontext-dependent additive recurrent neural net,” in NAACL-HLT, 2018.

[5] T. M. Lai, T. Bui, and S. Li, “A review on deep learning techniquesapplied to answer selection,” in Proceedings of the 27th InternationalConference on Computational Linguistics. Association for Computa-tional Linguistics, 2018, pp. 2132–2144.

[6] M. Wan and J. McAuley, “Modeling ambiguity, subjectivity, and diverg-ing viewpoints in opinion question answering systems,” 2016 IEEE 16thInternational Conference on Data Mining (ICDM), pp. 489–498, 2016.

[7] K. Cho, B. van Merrienboer, aglar Gulehre, D. Bahdanau, F. Bougares,H. Schwenk, and Y. Bengio, “Learning phrase representations using rnnencoder-decoder for statistical machine translation,” in EMNLP, 2014.

[8] T. Luong, H. Pham, and C. D. Manning, “Effective approaches toattention-based neural machine translation,” in EMNLP, 2015.

[9] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments forgenerating image descriptions,” 2015 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), pp. 3128–3137, 2015.

[10] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy,“Hierarchical attention networks for document classification,” in HLT-NAACL, 2016.

[11] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9 8, pp. 1735–80, 1997.

[12] M. Wang, N. A. Smith, and T. Mitamura, “What is the jeopardy model?a quasi-synchronous grammar for qa,” in EMNLP-CoNLL, 2007.

[13] M. Wang and C. D. Manning, “Probabilistic tree-edit models with struc-tured latent variables for textual entailment and question answering,”in Proceedings of the 23rd International Conference on ComputationalLinguistics, ser. COLING ’10. Stroudsburg, PA, USA: Association forComputational Linguistics, 2010, pp. 1164–1172.

[14] M. Heilman and N. A. Smith, “Tree edit models for recognizingtextual entailments, paraphrases, and answers to questions,” in HumanLanguage Technologies: The 2010 Annual Conference of the NorthAmerican Chapter of the Association for Computational Linguistics,ser. HLT ’10. Stroudsburg, PA, USA: Association for ComputationalLinguistics, 2010, pp. 1011–1019.

[15] S. W.-t. Yih, M.-W. Chang, C. Meek, and A. Pastusiak, “Questionanswering using enhanced lexical semantic models.” ACL Associationfor Computational Linguistics, August 2013.

[16] X. Yao, B. V. Durme, C. Callison-burch, and P. Clark, “Answer extrac-tion as sequence tagging with tree edit distance,” in North AmericanChapter of the Association for Computational Linguistics (NAACL),2013.

[17] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributedrepresentations of words and phrases and their compositionality,” inProceedings of the 26th International Conference on Neural InformationProcessing Systems - Volume 2, ser. NIPS’13. USA: Curran AssociatesInc., 2013, pp. 3111–3119.

[18] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectorsfor word representation,” in EMNLP, 2014.

[19] L. Cui, S. Huang, F. Wei, C. Tan, C. Duan, and M. Zhou,“Superagent: A customer service chatbot for e-commerce websites,”in Proceedings of ACL 2017, System Demonstrations. Associationfor Computational Linguistics, 2017, pp. 97–102. [Online]. Available:http://www.aclweb.org/anthology/P17-4017

[20] T. Lai, T. Bui, S. Li, and N. Lipka, “A simple end-to-end questionanswering model for product information,” in Proceedings of theFirst Workshop on Economics and Natural Language Processing.Association for Computational Linguistics, 2018, pp. 38–43. [Online].Available: http://aclweb.org/anthology/W18-3105

[21] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transac-tions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, Oct 2010.

[22] J. T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-languageknowledge transfer using multilingual deep neural network with sharedhidden layers,” in 2013 IEEE International Conference on Acoustics,Speech and Signal Processing, May 2013, pp. 7304–7308.

[23] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnnfeatures off-the-shelf: An astounding baseline for recognition,” inProceedings of the 2014 IEEE Conference on Computer Vision andPattern Recognition Workshops, ser. CVPRW ’14. Washington, DC,USA: IEEE Computer Society, 2014, pp. 512–519. [Online]. Available:http://dx.doi.org/10.1109/CVPRW.2014.131

[24] Y. Zhang, R. Barzilay, and T. S. Jaakkola, “Aspect-augmented adver-sarial networks for domain adaptation,” CoRR, vol. abs/1701.00188,2017. [Online]. Available: http://arxiv.org/abs/1701.00188

[25] S. Min, M. J. Seo, and H. Hajishirzi, “Question answering throughtransfer learning from large fine-grained supervision data,” in ACL, 2017.

[26] Y. Chung, H. Lee, and J. R. Glass, “Supervised and unsupervisedtransfer learning for question answering,” CoRR, vol. abs/1711.05345,2017. [Online]. Available: http://arxiv.org/abs/1711.05345

[27] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100, 000+questions for machine comprehension of text,” in EMNLP, 2016.

[28] Y. Yang, W. tau Yih, and C. Meek, “Wikiqa: A challenge dataset foropen-domain question answering,” in EMNLP, 2015.

[29] P. Nakov, L. M. i Villodre, A. Moschitti, W. Magdy, H. Mubarak, A. A.Freihat, J. Glass, and B. Randeree, “Semeval-2016 task 3: Communityquestion answering,” in SemEval@NAACL-HLT, 2016.

[30] B.-H. Tseng, S. syun Shen, H. yi Lee, and L.-S. Lee, “Towards machinecomprehension of spoken content: Initial toefl listening comprehensiontest by machine,” in INTERSPEECH, 2016.

[31] M. Richardson, C. J. C. Burges, and E. Renshaw, “Mctest: A chal-lenge dataset for the open-domain machine comprehension of text,” inEMNLP, 2013.

[32] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, andS. Fidler, “Movieqa: Understanding stories in movies through question-answering,” 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pp. 4631–4640, 2016.

http://www.aclweb.org/anthology/P17-4017

http://aclweb.org/anthology/W18-3105

http://dx.doi.org/10.1109/CVPRW.2014.131

http://arxiv.org/abs/1701.00188

http://arxiv.org/abs/1711.05345

supervised transfer learning for product information question … · 2019-01-10 · powerful...

Documents