predicting emotional reaction distributions from buzzfeed ...sunilpai/buzzfeed_paper_cs224u.pdf ·...

Predicting Emotional Reaction Distributions from BuzzFeed Articles

Ilan [email protected]

Vivek [email protected]

Sunil [email protected]

Abstract

In this paper, we explore how to predictthe emotional response distribution of anarticle with text and images, based solelyon the article’s text. In particular, we es-tablish a new data corpus of Buzzfeed ar-ticles. We run Bayesian estimators withdependency parses and neural networkson this dataset and find that while depen-dency parses with linear models (Jensen-Shannon (J-S) distance 0.319, lower is bet-ter) do not improve over our baseline pre-diction (J-S distance 0.286), the recurrentneural network (RNN) is able to signifi-cantly outperform the baseline prediction(J-S distance 0.248). Since this is a noveldataset, more testing with this corpus isnecessary to determine if our results arecompetitive.

1 Introduction

Predicting and understanding sentiment in re-sponse to content is valuable for deeper under-standing of reviews, social networks, blogs andnews websites. Sentiment analysis has proven avaluable source of progress in natural languageprocessing and natural language understanding.While there has been a wealth of research into pre-dicting positive and negative sentiment, the pictureis in reality more nuanced and many emotions cancome up in response to a particular piece of text.We wish to explore the full emotional reaction dis-tribution in reaction to text articles.

Prior literature on the subject of sentiment anal-ysis has found that although relatively simple ap-proaches such as Naı̈ve Bayes and support vectormachines (SVM) with unigram and bigram (bag ofwords/bigrams, or BOW) features are reasonablyaccurate for predicting binary sentiment, BOWfeatures are typically inadequate for more com-

plex sentence constructions and difficult to im-prove upon (Pang et al., 2002). Accordingly, suchsuperficial features will not suffice for our task: wemust apply more sophisticated natural languageunderstanding (NLU) techniques to successfullypredict a more nuanced distribution. Similarly,simply using semantic orientation to predict bi-nary sentiment performs comparably to the otherBOW features, with the same issues in generaliz-ing to more nuanced analyses of emotion (Turney,2002). Neural networks are far more powerful,can encode more linguistic information, and typ-ically perform better, but they come at the cost ofadded complexity. There are many different meth-ods and architectures for creating neural nets, eachof which has different characteristics and perfor-mance. Regardless of the approach, distinguishingbetween more than two classes (i.e., binary senti-ment) is difficult, and accuracy drops significantlywhen doing so.

In this paper, we introduce a new datasetof Buzzfeed articles, annotated by readers withthe emotional reaction distribution, and use thisdataset to train and test our models. With thisdataset, we explore a combination of supervised(Maximum Likelihood Estimate (MLE) and Max-imum A Posteriori (MAP) estimators with depen-dency parses) and semi-supervised learning ap-proaches (neural networks and unsupervised wordvectors).

2 Dataset

We scrape BuzzFeed for data on 16, 426 articles,which represents all the articles that we can re-trieve from the website’s “Buzz” and “Commu-nity” sections. For each article, users can vote forup to three out of the thirteen categories: “cute,”“drab,” “ew,” “fab,” “fail,” “hate,” “lol,” “love,”“omg,” “trashy,” “win,” “wtf,” and “yaaass.” WhileBuzzFeed content is centered around images, it isof interest to see how accurately we can predict the

correct reactions based purely on the text in the ar-ticle.

After filtering BuzzFeed articles for those thatreceived at least 50 votes, we are left with 12, 685articles. Of these, around 75% of these arti-cles’ most popular emotion is “love,” and the mostpopular emotion for nearly every other article is“lol.” We notice that the scraped pages also con-tain British versions of the reaction categories,which are partially different (they have “amazing,”“blimey,” “splendid,” and “ohdear,” while they ex-clude a few of the American labels), but the Britishlabels have far fewer votes than the American la-bels; accordingly, we ignore these categories. Fig-ure 1 shows an example of a web scraped article.

{’reactions’: {’lol’: 1547,’love’: 988, ’omg’: 192,’win’: 161, ’fail’: 62,’wtf’: 32, ’cute’: 10,’trashy’: 10, ’hate’: 8,’yaaass’: 6, ’ew’: 3},’title’: ’The 23 MostPainfully Awkward Things ThatHappened In 2014’,’section’: ’Culture’,’body’: ’This guy masteringthe art of the backup: RiffRaff’s darkest moment:...’}

Figure 1: An example of a web scraped BuzzFeedarticle. Note that no image data is gathered here,just text.

Importantly, the rated emotions have relation-ships between each other in terms of reactions,with some emotions (like “ew” and “trashy”) hav-ing positive correlations and others (like “lol” and“love”) having negative correlations. This rela-tionship among common BuzzFeed emotions isshown in Figure 2.

3 Baseline Predictor Performance

3.1 Classification

We initially train classifiers to predict the top votedemotion for a particular article in order to explorethe topology of our dataset. The two most popu-lar top emotions are “love” and “lol.” In fact, we

Figure 2: There is notable correlation among theemotions. Particularly striking is the negative cor-relation between “wtf” and “love” and the positivecorrelation between “ew” and “wtf.”

find these two emotions are relatively orthogonal:they have a strong negatively correlation. Accord-ingly, we try a variety of classifiers to distinguishbetween them. There are 1, 880 articles with “lol”as the top emotion, and we select a random subsetof 1, 880 articles with “love” as the top emotion toachieve a 50% split. We use a 66%-34% train-testsplit.

We use the scikit-learn (Pedregosa et al., 2011)implementation of the following algorithms, andunless otherwise noted default parameters areused.

3.1.1 Features: TF-IDFThe features used for all of the classifiers area Term Frequency–Inverse Document Frequency(TF-IDF) reweighting of the token counts of thebody of each article. TF-IDF computes the weightof each word as the count (term frequency) scaledby the log of one over the number of times thatword appears in the corpus (inverse document fre-quency). Intuitively, we punish words that occurtoo frequently (such as “the”), since they are likelyto carry less intrinsic meaning.

3.1.2 Gaussian Naı̈ve BayesNaı̈ve Bayes builds a Bayesian probability modelof the features, assuming they are independentwhen conditioned on the class (i.e., top emotion)of each article. This achieves a score of 0.59 onour dataset. Although this does better than randomguessing, it is not much better, suggesting that theNaı̈ve Bayes assumption may not hold well for ourdataset. The training accuracy is 0.98, strongly in-dicating that this model is overfitting.

3.1.3 Logistic RegressionLogistic Regression models the probability ofeach of the classes as the sum of the features us-ing a logistic function. This has a training scoreof 0.90 with a testing score of 0.65. Although it isstill overfitting, it is not as egregious as with Gaus-sian Naı̈ve Bayes. More data would likely helpso that the model has better information on whichfeatures generalize to many articles and which donot.

3.1.4 Support Vector Classifier (SVC)Support Vector Machine classification works byfinding points that can be used as ”support vec-tors” to define the classification boundary betweendifferent classes. We found a linear kernel to workbest. This has a training accuracy of 0.94, and atesting accuracy of 0.67. Again, this is overfitting(although using a linear rather than Gaussian ker-nel reduces the amount of overfitting), and so moredata would likely help, but this is the best modelthat we found.

We could of course have spent more time tun-ing this to get better numbers, but we chose not tosince this is just our baseline and our primary goalis to predict the full distribution. That said, thesescores are relatively low, indicating that this is adifficult task. Although the models clearly do bet-ter than random, and thus are able to learn some-thing from the data, they do not perform well over-all. We suspect this is partially because BuzzFeedfocuses primarily on images and GIFs, with rela-tively little text for these models to learn from. Totest this, we try a human oracle.

3.1.5 Human OracleUsing all of the extracted information in a Buz-zFeed article, a human subject predicts whether aparticular article has “love” or “lol” as its top emo-tion to see how well actual natural language un-derstanding and higher level reasoning performson our dataset. Out of 100 articles, he achievesa score of 0.73. This is better than the best ma-chine learning model score of 0.67, but is not sig-nificantly higher. While some articles are quiteclearly destined to have “love” or “lol” as theirtop emotion based on the title alone (e.g., 13 Rea-sons Why “Caitlin’s Way” Was Actually The BestShow From Your Childhood is clearly “love,” asare most posts talking about TV shows, while ThisGuy Took A Joke With His Friend So Far ThatHe Bought Himself A Billboard is clearly “lol”),

most are rather unclear and the top emotion seemsalmost arbitrary (e.g., 16 Hilarious One-Star Re-views Of Childrens Books has “hilarious” in its ti-tle; nevertheless “love” was the top emotion forthis article).

However, trying to predict the full distributionshould actually help with this. The aforemen-tioned article with “hilarious” in its title has “love”as its top emotion, but “lol” is quite a close second.As such, if a model predicts the full distributionand guesses that “lol” will be the top emotion, thiswill not be treated as a large error.

3.2 Predicting the MeanOur baseline for the full distribution is simplyto always predict the average probability distri-bution. Surprisingly, this performed quite well,achieving a J-S distance of 0.286.

3.2.1 Jensen-Shannon Distance (J-SDistance)

We adopt J-S distance as our evaluation metric,which provides a measurement of the differencebetween two probability distributions. The J-Sdistance J over two probability distributions Pand Q is given by the expression

J =1

2D(P ||M) +

1

2D(Q||M), (1)

where M = 12(P + Q) and D(P ||Q) is the K-L

divergence of P and Q:

D(P ||Q) =∑i

P (i) lnP (i)

Q(i). (2)

We choose J-S distance because it is symmetric, incontrast to the KullbackLeibler divergence metric.

4 Dependency Parse

A dependency parse aims to capture the directedconnections between linguistic units (such aswords) in a sentence to extract meaning beyonda simple parse tree. When creating the depen-dency parse of a sentence, one transforms thestring of words into a list of directed word pairsannotated with a relationship. For example, thesentence “It is so beautiful” returns the parse[[’root’, ’ROOT’, ’beautiful’],[’nsubj’, ’beautiful’, ’It’],[’cop’, ’beautiful’, ’is’],[’advmod’, ’beautiful’, ’so’]],reflecting that ‘beautiful’ is the root of the sen-tence, ‘It’ is the nominative subject of ‘beautiful’

(since ‘is’ is a copular verb), ‘beautiful’ is thecompliment of the copular verb ‘is,’ and ‘so’ is theadverbial modifier of ‘beautiful’ (Manning et al.,2014). Our initial work processes the BuzzFeedarticle body and produces dependency parses,which we use as features in our predictions.

4.1 Dependency Parse FeaturesThe dependency parses may be considered atseveral levels in a hierarchical manner. At theroot level, we consider the full dependency (forexample, the tuple (’advmod’,’very’,’much’)). However, this feature set is oftentoo sparse, particularly when the data variessignificantly. In such cases, dependencies suchas (u’appos’, u’delilah’, u’cat’)can appear, which will have a negligible impacton our model due to its rarity. Therefore, wemake the dataset more dense by considering morefeatures such as (’advmod’,’very’,1), (’advmod’,’much’, 2), and(’very’,’much’, 3). This level of thehierarchy (the latter two sets) tells us about thefunction of each word in the dependency and whattwo words are being compared. The final level ofthe hierarchy is the densest, yet least descriptive.Namely, only single words (and their positions onthe parse) comprise these features. The featuresdemonstrate the tradeoff between feature density(as opposed to sparsity) and descriptiveness (asopposed to ambiguity). In our model, we counthow often each of these features shows up inthe body of a BuzzFeed article and weight thefeatures by their frequency.

4.2 Dependency Parse MAP (DPMAP)The dependency parse MAP method works in asimilar manner to multinomial Naı̈ve Bayes clas-sification. In particular, in the analysis of de-pendency features, we consider the Naı̈ve Bayesassumption that different dependency features inthe same sentence are independent of each other.Consider the BuzzFeed article x, composed ofsparse features representing dependency featuresfrom the article. Then the MLE of the multinomialdistribution (with Laplace smoothing parameter λ)for the emotion Ck will take the form

p(Ck|x) =p(x, Ck) + λ∑k p(x, Ck) + kλ

. (3)

Applying the Naı̈ve Bayes assumption that depen-dency features d that belong to the article x are all

independent, we obtain

p(x, Ck) ∝∑d∈x

p(Ck, d). (4)

To train the data, we estimate the distributionp(Ck|d) by iterating through the articles andtallying votes for emotions and dependencies.We then normalize the probabilities such that∑

k p(Ck|d) = 1 (i.e., each dependency has anemotion distribution attributed to it). To test thedata, we employ Equations 3 and 4, which yieldthe emotional distributions for the articles.

To improve the performance of the MLE algo-rithm, we can add a MAP estimate to bias the per-formance of the parser toward the a priori distribu-tion derived from p(Ck) in the training step. Thisgives the following:

p(Ck|x) =p(x, Ck) + p(Ck)λ∑

k p(x, Ck) + λ. (5)

This is important as it now enables us to determinehow far predicted distributions fell from the origi-nal MAP estimate.

5 Neural Networks

Neural networks are particularly appropriate forthis task because of their ability to capture com-plex interactions and because an output layer withmultiple dimensions can be used to predict the fullemotion distribution, in a single model. Each out-put node is used to represent one emotion. Whenusing a prediction from the neural net, any neg-ative predictions are assumed to be 0 and the re-maining values are normalized to produce a prob-ability distribution.

5.1 Shallow Neural Network (SNN)A shallow neural network is one that has a sin-gle hidden layer. For the shallow neural net, weused unsupervised GloVe vectors (Pennington etal., 2014), which are continuous fixed-length vec-tor representations of words that have been shownto encode some semantic meaning, as input fea-tures; the feature representation of the words in aphrase is the average of the corresponding wordvectors. If we use multiple different texts as inputto the neural network (e.g., both title and subti-tle), the features for all of them are simply con-catenated together.

We search a variety of different approaches tothe shallow neural net:

• different activation functions

• different combinations of the title, subtitleand body as input features

• different pre-trained GloVe vectors, of differ-ent dimensions

• instead of GloVe, using pre-trained word2vecvectors (Mikolov et al., 2013) of 300 dimen-sions that were trained on 100 billion wordsfrom a Google News dataset

• different numbers of hidden dimensions inthe hidden layer

• different number of iterations

Most of these variations yield a J-S distance sim-ilar to the baseline’s 0.286. The best approachwe have is using the tanh activation function,using the title and subtitle to create input fea-tures, 50-dimensional GloVe vectors pre-trainedon Wikipedia 2014 and Gigaword 5, 30 hidden di-mensions, and 300 iterations. This achieves a J-Sdistance of 0.277.

5.2 Recurrent Neural Network (RNN)A recurrent neural network connects some the out-put of some of the nodes to the input of nodesin a previous layer, forming a cycle. This allowsit to exhibit some form of ”memory” of the in-put, thus allowing the neural net to process datasequentially. We use the Passage library (Indico-DataSolutions/Passage, ) to train our RNN. Ratherthan using GloVe vectors, we employ a word em-bedding layer as part of the neural net so that theembeddings themselves can be modified as part oftraining. Every five epochs, the RNN performanceis evaluated on a development set, and training isstopped after the J-S distance on the developmentset started to increase. The model with the bestdevelopment accuracy is used on the test set.

Similarly to above, we explore a variety of dif-ferent approaches to the recurrent neural net:

• different activation functions

• different thresholds for the minimum numberof times a token has to appear to be included

• different combinations of the title, subtitleand body as input features

• different sizes of all but the last output layer

• different regularizers

The RNN generally does better than the shallowneural net. Our best accuracy is with an initial em-bedding layer of 256 dimensions, a gated recurrentneural net (Cho et al., 2014) of 256 dimensions,followed by a final output layer of 13 dimensionsfor the 13 emotions, the sigmoid activation func-tion with binary cross entropy cost, using just thetitle, and using all tokens as input regardless ofhow many times they appear in the input. This hasa J-S distance of 0.248.

5.3 Alternative IdeasOne of the interesting ideas we have is to sub-tract the average distribution from each data pointso that the values we are trying to predict will bemore closely clustered around 0, allowing the neu-ral net to make more fine-tuned and accurate ad-justments. However, for both the shallow neuralnet and the recurrent neural net, this seems to de-bilitate our performance.

6 Discussion of Results

A comparison of all results (baseline, DPMAP,SNN, RNN) are shown in Table 1. Notably, this ta-ble illustrates that RNN has the best performance.

6.1 DPMAP ResultsOur DPMAP results (specified over a test set span-ning 30% of the web scraped articles) demonstratethat there was too much sparsity in the data setfor there to be a significant predictive power in themodel. As a result, many of the predicted distri-butions showed little deviation from the averageemotions (baseline). We can therefore report onlya marginal improvement over baseline. Some ex-ample results of DPMAP prediction are shown inFigure 3.

6.2 Shallow Neural NetworkThe shallow neural net is able to perform betterthan our baseline, but only minimally so. It likelyis not a complex enough model to be able to ex-tract significant meaning from the BuzzFeed data.In addition, this suggests that averaging word vec-tors (which is what we use as input to our shallowneural net) does not adequately capture the mean-ing of a sentence or phrase (e.g., it is not able tocapture compositionality information). While it isa relatively dense feature, allowing faster neuralnet training and presumably reducing variability in

Figure 3: A comparison of well-performing (top)and poorly-performing (bottom) test examplesafter training with DPMAP. It was found thatDPMAP tends to predict close to the average emo-tions (baseline) method due to feature sparsity.Notice the corresponding lack of change in theblack (predicted) bar heights.

accuracy (since fewer weights have to be learned),more and better features are likely to help.

6.3 Recurrent Neural Network

The recurrent neural net is our best model. Al-though it performs relatively well, during train-ing we notice that the number of iterations has alarge effect on accuracy. At first, more iterationsincreases accuracy on a separate development set,as expected, but after a relatively small number ofiterations (for us, typically around 20) training ac-curacy continues to improve but testing accuracygets worse, implying the model begins to overfitthe data. While we tried some regularization ap-proaches, they did not seem to help. Having moredata, however, would be beneficial since then thetraining data would be more representative of gen-eral patterns, and it would give the model moreinformation to learn from.

7 Conclusion

We present a novel dataset of BuzzFeed articlescontaining text and a corresponding distribution ofemotional reactions. Our baseline predictor of just

Figure 4: A comparison of randomly selected testexamples showing the variation in J-S distance Jafter training with RNN.

Mean Emotion Baseline

Shallow Neural Network

Recurrent Neural Network

Actual

Figure 5: A comparison of model predictions toaverage emotion baseline and actual distributions,ordered by actual “love” emotion response. Thisheat map demonstrates the prevalence of emo-tions in 2, 537 test articles, showing a tendencyfor predictors to tend toward the average emotionresponse. While not immediately noticeable, theRNN shows the most variability in prediction andout of all the models, most closely resembles theactual gradient in the love emotion.

Method JBaseline: Average Emotion 0.283Baseline: SNN (+ GloVe) 0.277

Approach: DPMAP** 0.317Approach: RNN 0.248

Table 1: The table shows that compared to base-line (average emotion), the RNN performs bet-ter whereas the DPMAP method performs worse.**(70/30 split)

using the average emotional distribution actuallyworks rather well, producing an average J-S dis-tance of 0.286. Our best model, a recurrent neuralnetwork with a word embeddings layer and a gatedrecurrent neural net layer achieves an improved J-S distance of 0.248. We have also evaluated usingdependency parses and shallow neural networks,determining that it is difficult to predict much bet-ter than the average distribution.

Future work would involve adding more train-ing data and having more time for a detailed hyper-parameter search. The disappointing results of thedependency method (DPMAP) suggest that fur-ther processing of raw dependency features (e.g.,incorporating GloVe and word embeddings intothe representation) may be necessary to improvethat particular model. Another potential area forimprovement would be to incorporate an “emotionmanifold” into our work, which probabilisticallymodels how the different reactions interact witheach other. A potentially bigger improvement thatwould step beyond natural language understand-ing would be to integrate article images into ourmodels to get a better picture of an article’s con-tent.

ReferencesCho, K., van Merrinboer, B., Bahdanau, D., & Bengio,

Y. 2014. On the properties of neural machine trans-lation: Encoder-decoder approaches. arXiv preprintarXiv:1409.1259.

”IndicoDataSolutions/Passage.” GitHub. Web. 10June 2015.

Kim, S., Li, F., Lebanon, G., & Essa, I. 2012. Beyondsentiment: The manifold of human emotions. arXivpreprint arXiv:1202.1568.

Le, Q. V., & Mikolov, T. 2014. Distributed represen-tations of sentences and documents.. arXiv preprintarXiv:1405.4053.

Lin, K. H. Y., & Chen, H. H. 2008, October. Rank-ing reader emotions using pairwise loss minimiza-tion and emotional distribution regression. Proceed-ings of the conference on empirical methods in nat-ural language processing. (pp. 136-144). Associa-tion for Computational Linguistics.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel,J., Bethard, S. J., & McClosky, D. 2014, June.The Stanford CoreNLP natural language processingtoolkit. Proceedings of 52nd Annual Meeting of theAssociation for Computational Linguistics: SystemDemonstrations (pp. 55-60).

Mikolov, T., Chen, K., Corrado, G., & Dean, J. 2013.Efficient estimation of word representations in vec-tor space.. arXiv preprint arXiv:1301.3781.

Pang, B., Lee, L., & Vaithyanathan, S. Thumbsup?: sentiment classification using machine learningtechniques. 2002. Proceedings of the ACL-02 con-ference on Empirical methods in natural languageprocessing-Volume 10 (pp. 79-86). Association forComputational Linguistics

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel,V., Thirion, B., Grisel, O., ... & Duchesnay, E. 2011.Scikit-learn: Machine learning in Python. The Jour-nal of Machine Learning Research, 12, 2825-2830.

Pennington, J., Socher, R., & Manning, C. D. 2014.Glove: Global vectors for word representation. Pro-ceedings of the Empiricial Methods in Natural Lan-guage Processing (EMNLP 2014), 12, 1532-1543.

Rao, Y., Lei, J., Wenyin, L., Li, Q., & Chen, M. 2014.Building emotional dictionary for sentiment analysisof online news. World Wide Web 17(4), 723-742.

Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., &Manning, C. D. 2011, July. Semi-supervised recur-sive autoencoders for predicting sentiment distribu-tions. Proceedings of the Conference on EmpiricalMethods in Natural Language Processing (pp. 151-161). Association for Computational Linguistics.

Strapparava, C., & Mihalcea, R. 2008, March. Learn-ing to identify emotions in text. Proceedings of the2008 ACM symposium on Applied computing (pp.1556-1560). ACM.

Sudhof, M., Gomz Emilsson, A., Maas, A. L., & Potts,C. 2014, August. Sentiment expression conditionedby affective transitions and social forces. Proceed-ings of the 20th ACM SIGKDD international confer-ence on Knowledge discovery and data mining (pp.1136-1145). ACM.

Turney, P. D. 2002, July. Thumbs up or thumbs down?:semantic orientation applied to unsupervised classi-fication of reviews. Proceedings of the 40th annualmeeting on association for computational linguistics(pp. 417-424). Association for Computational Lin-guistics.

Wilson, T., Wiebe, J., & Hwa, R. 2004, July. Justhow mad are you? Finding strong and weak opinionclauses. aaai (Vol. 4, pp. 761-769).

predicting emotional reaction distributions from buzzfeed ...sunilpai/buzzfeed_paper_cs224u.pdf ·...

Documents