context-aware learning for sentence-level sentiment ...bishan/papers/context-aware... ·...

Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 325–335,Baltimore, Maryland, USA, June 23-25 2014. c©2014 Association for Computational Linguistics

Context-aware Learning for Sentence-level Sentiment Analysiswith Posterior Regularization

Bishan YangDepartment of Computer Science

Cornell [email protected]

Claire CardieDepartment of Computer Science

Cornell [email protected]

Abstract

This paper proposes a novel context-awaremethod for analyzing sentiment at thelevel of individual sentences. Most ex-isting machine learning approaches suf-fer from limitations in the modeling ofcomplex linguistic structures across sen-tences and often fail to capture non-local contextual cues that are importantfor sentiment interpretation. In contrast,our approach allows structured modelingof sentiment while taking into accountboth local and global contextual infor-mation. Specifically, we encode intu-itive lexical and discourse knowledge asexpressive constraints and integrate theminto the learning of conditional randomfield models via posterior regularization.The context-aware constraints provide ad-ditional power to the CRF model and canguide semi-supervised learning when la-beled data is limited. Experiments onstandard product review datasets show thatour method outperforms the state-of-the-art methods in both the supervised andsemi-supervised settings.

1 Introduction

The ability to extract sentiment from text is cru-cial for many opinion-mining applications such asopinion summarization, opinion question answer-ing and opinion retrieval. Accordingly, extract-ing sentiment at the fine-grained level (e.g. at thesentence- or phrase-level) has received increasingattention recently due to its challenging nature andits importance in supporting these opinion analysistasks (Pang and Lee, 2008).

In this paper, we focus on the task of sentence-level sentiment classification in online reviews.Typical approaches to the task employ supervised

machine learning algorithms with rich featuresand take into account the interactions betweenwords to handle compositional effects such as po-larity reversal (e.g. (Nakagawa et al., 2010;Socher et al., 2013)). Still, their methods can en-counter difficulty when the sentence on its owndoes not contain strong enough sentiment signals(due to the lack of statistical evidence or the re-quirement for background knowledge). Considerthe following review for example,

1. Hearing the music in real stereo is a true reve-

lation. 2. You can feel that the music is no longer

constrained by the mono recording. 3. In fact, it

is more like the players are performing on a stage

in front of you ...

Existing feature-based classifiers may be effectivein identifying the positive sentiment of the firstsentence due to the use of the word revelation,but they could be less effective in the last two sen-tences due to the lack of explicit sentiment signals.However, if we examine these sentences within thediscourse context, we can see that: the second sen-tence expresses sentiment towards the same aspect– the music – as the first sentence; the third sen-tence expands the second sentence with the dis-course connective In fact. These discourse-levelrelations help indicate that sentence 2 and 3 arelikely to have positive sentiment as well.

The importance of discourse for sentiment anal-ysis has become increasingly recognized. Mostexisting work considers discourse relations be-tween adjacent sentences or clauses and incor-porates them as constraints (Kanayama and Na-sukawa, 2006; Zhou et al., 2011) or features inclassifiers Trivedi and Eisenstein (2013; Lazari-dou et al. (2013). Very little work has exploredlong-distance discourse relations for sentimentanalysis. Somasundaran et al. (2008) definescoreference relations on opinion targets and ap-plies them to constrain the polarity of sentences.

325

However, the discourse relations were obtainedfrom fine-grained annotations and implemented ashard constraints on polarity.

Obtaining sentiment labels at the fine-grainedlevel is costly. Semi-supervised techniques havebeen proposed for sentence-level sentiment classi-fication (Tackstrom and McDonald, 2011a; Qu etal., 2012). However, they rely on a large amountof document-level sentiment labels that may notbe naturally available in many domains.

In this paper, we propose a sentence-level senti-ment classification method that can (1) incorporaterich discourse information at both local and globallevels; (2) encode discourse knowledge as softconstraints during learning; (3) make use of un-labeled data to enhance learning. Specifically, weuse the Conditional Random Field (CRF) modelas the learner for sentence-level sentiment classi-fication, and incorporate rich discourse and lexi-cal knowledge as soft constraints into the learn-ing of CRF parameters via Posterior Regulariza-tion (PR) (Ganchev et al., 2010). As a frameworkfor structured learning with constraints, PR hasbeen successfully applied to many structural NLPtasks (Ganchev et al., 2009; Ganchev et al., 2010;Ganchev and Das, 2013). Our work is the first toexplore PR for sentiment analysis. Unlike mostprevious work, we explore a rich set of structuralconstraints that cannot be naturally encoded in thefeature-label form, and show that such constraintscan improve the performance of the CRF model.

We evaluate our approach on the sentence-level sentiment classification task using two stan-dard product review datasets. Experimental re-sults show that our model outperforms state-of-the-art methods in both the supervised and semi-supervised settings. We also show that dis-course knowledge is highly useful for improvingsentence-level sentiment classification.

2 Related Work

There has been a large amount of work on sen-timent analysis at various levels of granular-ity (Pang and Lee, 2008). In this paper, we focuson the study of sentence-level sentiment classifi-cation. Existing machine learning approaches forthe task can be classified based on the use of twoideas. The first idea is to exploit sentiment sig-nals at the sentence level by learning the relevanceof sentiment and words while taking into accountthe context in which they occur: Nakagawa et

al. (2010) uses tree-CRF to model word interac-tions based on dependency tree structures; Choiand Cardie (2008) applies compositional inferencerules to handle polarity reversal; Socher et al.(2011) and Socher et al. (2013) compute composi-tional vector representations for words and phrasesand use them as features in a classifier.

The second idea is to exploit sentiment signalsat the inter-sentential level. Polanyi and Zaenen(2006) argue that discourse structure is importantin polarity classification. Various attempts havebeen made to incorporate discourse relations intosentiment analysis: Pang and Lee (2004) exploredthe consistency of subjectivity between neighbor-ing sentences; Mao and Lebanon (2007),McDon-ald et al. (2007), and Tackstrom and McDonald(2011a) developed structured learning models tocapture sentiment dependencies between adjacentsentences; Kanayama and Nasukawa (2006) andZhou et al. (2011) use discourse relations to con-strain two text segments to have either the samepolarity or opposite polarities; Trivedi and Eisen-stein (2013) and Lazaridou et al. (2013) encodethe discourse connectors as model features in su-pervised classifiers. Very little work has exploredlong-distance discourse relations. Somasundaranet al. (2008) define opinion target relations and ap-ply them to constrain the polarity of text segmentsannotated with target relations. Recently, Zhanget al. (2013) explored the use of explanatory dis-course relations as soft constraints in a MarkovLogic Network framework for extracting subjec-tive text segments.

Leveraging both ideas, our approach exploitssentiment signals from both intra-sentential andinter-sentential context. It has the advantages ofutilizing rich discourse knowledge at different lev-els of context and encoding it as soft constraintsduring learning.

Our approach is also semi-supervised. Com-pared to the existing work on semi-supervisedlearning for sentence-level sentiment classification(Tackstrom and McDonald, 2011a; Tackstrom andMcDonald, 2011b; Qu et al., 2012), our workdoes not rely on a large amount of coarse-grained(document-level) labeled data, instead, distantsupervision mainly comes from linguistically-motivated constraints.

Our work also relates to the study of posteriorregularization (PR) (Ganchev et al., 2010). PR hasbeen successfully applied to many structured NLP

326

tasks such as dependency parsing, information ex-traction and cross-lingual learning tasks (Ganchevet al., 2009; Bellare et al., 2009; Ganchev et al.,2010; Ganchev and Das, 2013). Most previouswork using PR mainly experiments with feature-label constraints. In contrast, we explore a richset of linguistically-motivated constraints whichcannot be naturally formulated in the feature-labelform. We also show that constraints derived fromthe discourse context can be highly useful for dis-ambiguating sentence-level sentiment.

3 Approach

In this section, we present the details of our pro-posed approach. We formulate the sentence-levelsentiment classification task as a sequence label-ing problem. The inputs to the model are sentence-segmented documents annotated with sentence-level sentiment labels (positive, negative or neu-tral) along with a set of unlabeled documents.During prediction, the model outputs sentiment la-bels for a sequence of sentences in the test docu-ment. We utilize conditional random fields and usePosterior Regularization (PR) to learn their param-eters with a rich set of context-aware constraints.

In what follows, we first briefly describe theframework of Posterior Regularization. Then weintroduce the context-aware constraints derivedbased on intuitive discourse and lexical knowl-edge. Finally we describe how to perform learningand inference with these constraints.

3.1 Posterior Regularization

PR is a framework for structured learning withconstraints (Ganchev et al., 2010). In this work,we apply PR in the context of CRFs for sentence-level sentiment classification.

Denote x as a sequence of sentences within adocument and y as a vector of sentiment labelsassociated with x. The CRF model the followingconditional probabilities:

pθ(y|x) =exp(θ · f(x,y))

Zθ(x)

where f(x,y) are the model features, θ are themodel parameters, and Zθ(x) =

∑y exp(θ ·

f(x,y)) is a normalization constant. The objec-tive function for a standard CRF is to maximizethe log-likelihood over a collection of labeled doc-

uments plus a regularization term:

maxθL(θ) = max

θ

∑(x,y)

log pθ(y|x)− ||θ||22

2δ2

PR makes the assumption that the labeled datawe have is not enough for learning good modelparameters, but we have a set of constraints on theposterior distribution of the labels. We can definethe set of desirable posterior distrbutions as

Q = {q(Y) : Eq[φ(X,Y)] = b} (1)

where φ is a constraint function, b is a vector ofdesired values of the expectations of the constraintfunctions under the distribution q 1. Note that thedistribution q is defined over a collection of un-labeled documents where the constraint functionsapply, and we assume independence between doc-uments.

The PR objective can be written as the origi-nal model objective penalized with a regulariza-tion term, which minimizes the KL-divergence be-tween the desired model posteriors and the learnedmodel posteriors with an L2 penalty 2 for the con-straint violations.

maxθL(θ)−min

q∈Q{KL(q(Y)||pθ(Y|X))

+ β||Eq[φ(X,Y)]− b||22}(2)

The objective can be optimized by an EM-likescheme that iteratively solves the minimizationproblem and the maximization problem. Solvingthe minimization problem is equivalent to solvingits dual since the objective is convex. The dualproblem is

arg maxλ

λ · b− logZλ(X)− 14β||λ||22 (3)

We optimize the objective function 2 usingstochastic projected gradient, and compute thelearning rate using AdaGrad (Duchi et al., 2010).

3.2 Context-aware Posterior ConstraintsWe develop a rich set of context-aware poste-rior constraints for sentence-level sentiment anal-ysis by exploiting lexical and discourse knowl-edge. Specifically, we construct the lexical con-straints by extracting sentiment-bearing patterns

1In general, inequality constraints can also be used. Wefocus on the equality constraints since we found them to ex-press the sentiment-relevant constraints well.

2Other convex functions can be used for the penalty. Weuse L2 norm because it works well in practice. β is a regular-ization constant

327

within sentences and construct the discourse-levelconstraints by extracting discourse relations thatindicate sentiment coherence or sentiment changesboth within and across sentences. Each constraintcan be formulated as equality between the expec-tation of a constraint function value and a desiredvalue set by prior knowledge. The equality is notstrictly enforced (due to the regularization in thePR objective 2). Therefore all the constraints areapplied as soft constraints. Table 1 provides in-tuitive description and examples for all the con-straints used in our model.

Lexical Patterns The existence of a polarity-carrying word alone may not correctly indicate thepolarity of the sentence, as the polarity can be re-versed by other polarity-reversing words. We ex-tract lexical patterns that consist of polar wordsand negators 3, and apply the heuristics based oncompositional semantics (Choi and Cardie, 2008)to assign a sentiment value to each pattern.

We encode the extracted lexical patterns alongwith their sentiment values as feature-label con-straints. The constraint function can be written as

φw(x, y) =∑i

fw(xi, yi)

where fw(xi, yi) is a feature function which hasvalue 1 when sentence xi contains the lexical pat-tern w and its sentiment label yi equals to the ex-pected sentiment value and has value 0 otherwise.The constraint expectation value is set to be theprior probability of associating w with its senti-ment value. Note that sentences with neutral senti-ment can also contain such lexical patterns. There-fore we allow the lexical patterns to be assigned aneutral sentiment with a prior probability r0 (wecompute this value as the empirical probability ofneutral sentiment in the training documents). Us-ing the polarity indicated by lexical patterns toconstrain the sentiment of sentences is quite ag-gressive. Therefore we only consider lexical pat-terns that are strongly discriminative (many opin-ion words in the lexicon only indicate sentimentwith weak strength). The selected lexical patternsinclude a handful of seed patterns (such as “pros”and “cons”) and the lexical patterns that have highprecision (larger then 0.9) of predicting sentimentin the training data.

3The polar words are identified using the MPQA lexiconand the negators are identified using a handful of seed wordsextended by the General Inquirer dictionary and WordNet asdescribed in (Choi and Cardie, 2008).

Discourse Connectives. Lexical patterns canbe limited in capturing contextual informationsince they only look at interactions between wordswithin an expression. To capture context at theclause or sentence level, we consider discourseconnectives, which are cue phrases or words thatindicate discourse relations between adjacent sen-tences or clauses. To identify discourse connec-tives, we apply a discourse tagger trained on thePenn Discourse Treebank (Prasad et al., 2008) 4

to our data. Discourse connectives are tagged withfour senses: Expansion, Contingency, Compari-son, Temporal.

Discourse connectives can operate at both intra-sentential and inter-sentential level. For example,the word “although” is often used to connect twopolar clauses within a sentence, while the word“however” is often used to at the beginning ofthe sentence to connect two polar sentences. Itis important to distinguish these two types of dis-course connectives. We consider a discourse con-nective to be intra-sentential if it has the Com-parison sense and connects two polar clauses withopposite polarities (determined by the lexical pat-terns). We construct a feature-label constraint foreach intra-sentential discourse connective and setits expected sentiment value to be neutral.

Unlike the intra-sentential discourse connec-tives, the inter-sentential discourse connectivescan indicate sentiment transitions between sen-tences. Intuitively, discourse connectives withthe senses of Expansion (e.g. also, for example,furthermore) and Contingency (e.g. as a result,hence, because) are likely to indicate sentimentcoherence; discourse connectives with the senseof Comparison (e.g. but, however, nevertheless)are likely to indicate sentiment changes. This in-tuition is reasonable but it assumes the two sen-tences connected by the discourse connective areboth polar sentences. In general, discourse con-nectives can also be used to connect non-polar(neutral) sentences. Thus it is hard to directlyconstrain the posterior expectation for each typeof sentiment transitions using inter-sentential dis-course connectives.

Instead, we impose constraints on the modelposteriors by reducing constraint violations. We

4http://www.cis.upenn.edu/˜epitler/discourse.html

328

Types Description and Examples Inter-sentential

Lexical patternsThe sentence containing a polar lexical pattern w tends to have the polarityindicated by w. Example lexical patterns are annoying, hate, amazing, not dis-appointed, no concerns, favorite, recommend.

Discourse Connectives(clause)

The sentence containing a discourse connective cwhich connects its two clausesthat have opposite polarities indicated by the lexical patterns tends to have neu-tral sentiment. Example connectives are while, although, though, but.

Discourse Connectives(sentence)

Two adjacent sentences which are connected by a discourse connective c tendsto have the same polarity if c indicates a Expansion or Contingency relation,e.g. also, for example, in fact, because ; opposite polarities if c indicates aComparison relation, e.g. otherwise, nevertheless, however.

X

Coreference The sentences which contain coreferential entities appeared as targets of opinionexpressions tend to have the same polarity. X

Listing patterns A series of sentences connected via a listing tend to have the same polarity. XGlobal labels The sentence-level polarity tends to be consistent with the document-level po-

larity. X

Table 1: Summarization of Posterior Constraints for Sentence-level Sentiment Classification

define the following constraint function:

φc,s(x, y) =∑i

fc,s(xi, yi, yi−1)

where c denotes a discourse connective, s indi-cates its sense, and fc,s is a penalty function thattakes value 1.0 when yi and yi−1 form a contradic-tory sentiment transition, that is, yi 6=polar yi−1 ifs ∈ {Expansion,Contingency}, or yi =polar yi−1

if s = Comparison. The desired value for the con-straint expectation is set to 0 so that the model isencouraged to have less constraint violations.

Opinion Coreference Sentences in a discoursecan be linked by many types of coherence rela-tions (Jurafsky et al., 2000). Coreference is oneof the commonly used relations in written text.In this work, we explore coreference in the con-text of sentence-level sentiment analysis. We con-sider a set of polar sentences to be linked by theopinion coreference relation if they contain core-ferring opinion-related entities. For example, thefollowing sentences express opinions towards “thespeaker phone”, “The speaker phone” and “it” re-spectively. As these opinion targets are corefer-ential (referring to the same entity “the speakerphone”), they are linked by the opinion corefer-ence relation 5.

My favorite features are the speakerphone and the radio. The speakerphone is very functional. I use it inthe car, very audible even with freewaynoise.

5In general, the opinion-related entities include both theopinion targets and the opinion holders. In this work, weonly consider the targets since we experiment with single-author product reviews. The opinion holders can be includedin a similar way as the opinion targets.

Our coreference relations indicated by opiniontargets overlap with the same target relation intro-duced in (Somasundaran et al., 2009). The dif-ferences are: (1) we encode the coreference re-lations as soft constraints during learning insteadof applying them as hard constraints during infer-ence time; (2) our constraints can apply to bothpolar and non-polar sentences; (3) our identifica-tion of coreference relations is automatic withoutany fine-grained annotations for opinion targets.

To extract coreferential opinion targets, we ap-ply Stanford’s coreference system (Lee et al.,2013) to extract coreferential mentions in the doc-ument, and then apply a set of syntactic rules toidentify opinion targets from the extracted men-tions. The syntactic rules correspond to theshortest dependency paths between an opinionword and an extracted mention. We considerthe 10 most frequent dependency paths in thetraining data. Example dependency paths includensubj(opinion, mention), nobj(opinion, mention),and amod(mention, opinion).

For sentences connected by the opinion coref-erence relation, we expect their sentiment to beconsistent. To encode this intuition, we define thefollowing constraint function:

φcoref (x, y) =∑

i,ant(i)=j,j≥0

fcoref (xi, xj , yi, yj)

where ant(i) denotes the index of the sentencewhich contains an antecedent target of the targetmentioned in sentence i (the antecedent relationsover pairs of opinion targets can be constructedusing the coreference resolver), and fcoref is apenalty function which takes value 1.0 when theexpected sentiment coherency is violated, that is,yi 6=polar yj . Similar to the inter-sentential dis-

329

course connectives, modeling opinion coreferencevia constraint violations allows the model to han-dle neutral sentiment. The expected value of theconstraint functions is set to 0.

Listing Patterns Another type of coherence re-lations we observe in online reviews is listing,where a reviewer expresses his/her opinions bylisting a series of statements followed by a se-quence of numbers. For example, “1. It’s smallerthan the ipod mini .... 2. It has a removable battery....”. We expect sentences connected by a listingto have consistent sentiment. We implement thisconstraint in the same form as the coreference con-straint (the antecedent assignments are constructedfrom the numberings).

Global Sentiment Previous studies havedemonstrated the value of document-level sen-timent in guiding the semi-supervised learningof sentence-level sentiment (Tackstrom andMcDonald, 2011b; Qu et al., 2012). In this work,we also take into account this information andencode it as posterior constraints. Note that theseconstraints are not necessary for our model andcan be applied when the document-level sentimentlabels are naturally available.

Based on an analysis of the Amazon reviewdata, we observe that sentence-level sentimentusually doesn’t conflict with the document-levelsentiment in terms of polarity. For example, theproportion of negative sentences in the positivedocuments is very small compared to the propor-tion of positive sentences. To encode this intuition,we define the following constraint function:

φg(x, y) =n∑i

δ(yi 6=polar g)/n

where g ∈ {positive, negative} denotes the sen-timent value of a polar document, n is the totalnumber of sentences in x, and δ is an indicatorfunction. We hope the expectation of the con-straint function takes a small value. In our experi-ments, we set the expected value to be the empiri-cal estimate of the probability of “conflicting” sen-timent in polar documents using the training data.

3.3 Training and Inference

During training, we need to compute the constraintexpectations and the feature expectations underthe auxiliary distribution q at each gradient step.

We can derive q by solving the dual problem in 3:

q(y|x) =exp(θ · f(x,y) + λ · φ(x,y))

Zλ,θ(X)(4)

where Zλ,θ(X) is a normalization constant. Mostof our constraints can be factorized in the sameway as factorizing the model features in the first-order CRF model, and we can compute the expec-tations under q very efficiently using the forward-backward algorithm. However, some of our dis-course constraints (opinion coreference and list-ing) can break the tractable structure of the model.For constraints with higher-order structures, weuse Gibbs Sampling (Geman and Geman, 1984) toapproximate the expectations. Given a sequencex, we sample a label yi at each position i by com-puting the unnormalized conditional probabilitiesp(yi = l|y−i) ∝ exp(θ · f(x,yi = l,y−i) + λ ·φ(x,yi = l,y−i)) and renormalizing them. Sincethe possible label assignments only differ at posi-tion i, we can make the computation efficient bymaintaining the structure of the coreference clus-ters and precomputing the constraint function fordifferent types of violations.

During inference, we find the best label assign-ment by computing arg maxy q(y|x). For doc-uments where the higher-order constraints apply,we use the same Gibbs sampler as described aboveto infer the most likely label assignment, other-wise, we use the Viterbi algorithm.

4 Experiments

We experimented with two product reviewdatasets for sentence-level sentiment classifica-tion: the Customer Review (CR) data (Hu and Liu,2004)6 which contains 638 reviews of 14 prod-ucts such as cameras and cell phones, and theMulti-domain Amazon (MD) data from the test setof Tackstrom and McDonald (2011a) which con-tains 294 reivews from 5 different domains. As inQu et al. (2012), we chose the books, electronicsand music domains for evaluation. Each domainalso comes with 33,000 extra reviews with onlydocument-level sentiment labels.

We evaluated our method in two settings: su-pervised and semi-supervised. In the supervisedsetting, we treated the test data as unlabeled dataand performed transductive learning. In the semi-supervised setting, our unlabeled data consists of

6Available at http://www.cs.uic.edu/˜liub/FBS/sentiment-analysis.html.

330

both the available unlabeled data and the test data.For each domain in the MD dataset, we madeuse of no more than 100 unlabeled documents inwhich our posterior constraints apply. We adoptedthe evaluation schemes used in previous work: 10-fold cross validation for the CR dataset and 3-foldcross validation for the MD dataset. We also reportboth two-way classification (positive vs. negative)and three-way classification results (positive, neg-ative or neutral). We use accuracy as the per-formance measure. In our tables, boldface num-bers are statistically significant by paired t-test forp < 0.05 against the best baseline developed inthis paper 7.

We trained our model using a CRF incorpo-rated with the proposed posterior constraints. Forthe CRF features, we include the tokens, the part-of-speech tags, the prior polarities of lexical pat-terns indicated by the opinion lexicon and thenegator lexicon, the number of positive and neg-ative tokens and the output of the vote-flip algo-rithm (Choi and Cardie, 2009). In addition, we in-clude the discourse connectives as local or transi-tion features and the document-level sentiment la-bels as features (only available in the MD dataset).

We set the CRF regularization parameter σ = 1and set the posterior regularization parameter βand γ (a trade-off parameter we introduce to bal-ance the supervised objective and the posteriorregularizer in 2) by using grid search 8. Forapproximation inference with higher-order con-straints, we perform 2000 Gibbs sampling itera-tions where the first 1000 iterations are burn-in it-erations. To make the results more stable, we con-struct three Markov chains that run in parallel, andselect the sample with the largest objective value.

All posterior constraints were developed usingthe training data on each training fold. For the MDdataset, we also used the dvd domain as additionallabeled data for developing the constraints.

Baselines. We compared our method to a num-ber of baselines: (1) CRF: CRF with the same setof model features as in our method. (2) CRF-INF: CRF augmented with inference constraints.We can incorporate the proposed constraints (con-straints derived from lexical patterns and discourseconnectives) as hard constraints into CRF during

7Significance test was not conducted over the previousmethods as we do not have their results for each fold.

8We conducted 10-fold cross-validation on each trainingfold with the parameter space: β : [0.01, 0.05, 0.1, 0.5, 1.0]and γ : [0.1, 0.5, 1.0, 5.0, 10.0].

Methods CR MDCRF 81.1 67.0

CRF-inflex 80.9 66.4CRF-infdisc 81.1 67.2

PRlex 81.8 69.7PR 82.7 70.6

Previous workTreeCRF (Nakagawa et al., 2010) 81.4 -

Dropout LR (Wang and Manning, 2013) 82.1 -

Table 2: Accuracy results (%) for supervised sen-timent classification (two-way)

Books Electronics Music AvgVoteFlip 44.6 45.0 47.8 45.8

DocOracle 53.6 50.5 63.0 55.7CRF 57.4 57.5 61.8 58.9

CRF-inflex 56.7 56.4 60.4 57.8CRF-infdisc 57.2 57.6 62.1 59.0

PRlex 60.3 59.9 63.2 61.1PR 61.6 61.0 64.4 62.3

Previous workHCRF 55.9 61.0 58.7 58.5MEM 59.7 59.6 63.8 61.0

Table 3: Accuracy results (%) for semi-supervisedsentiment classification (three-way) on the MDdataset

inference by manually setting λ in equation 4 to alarge value,9. When λ is large enough, it is equiva-lent to adding hard constraints to the viterbi infer-ence. To better understand the different effects oflexical and discourse constraints, we report resultsfor applying only the lexical constraints (CRF-INFlex) as well as results for applying only thediscourse constraints (CRF-INFdisc). (3) PRlex:a variant of our PR model which only applies thelexical constraints. For the three-way classifica-tion task on the MD dataset, we also implementedthe following baselines: (4) VOTEFLIP: a rule-based algorithm that leverages the positive, nega-tive and neutral cues along with the effect of nega-tion to determine the sentence sentiment (Choiand Cardie, 2009). (5) DOCORACLE: assignseach sentence the label of its corresponding doc-ument.

4.1 Results

We first report results on a binary (positive or neg-ative) sentence-level sentiment classification task.For this task, we used the supervised setting andperformed transductive learning for our model.Table 2 shows the accuracy results. We can see

9We set λ to 1000 for the lexical constraints and -1000 tothe discourse connective constraints in the experiments

331

Books Electronics Musicpos/neg/neu pos/neg/neu pos/neg/neu

VoteFlip 43/42/47 45/46/44 50/46/46DocOracle 54/60/49 57/54/42 72/65/52

CRF 47/51/64 60/61/52 67/60/58CRF-inflex 46/52/63 59/61/50 65/59/57CRF-infdisc 47/51/64 60/61/52 67/61/59

PRlex 50/56/66 64/63/53 67/64/59PR 52/56/68 64/66/53 69/65/60

Table 4: F1 scores for each sentiment cate-gory (positive, negative and neutral) for semi-supervised sentiment classification on the MDdataset

that PR significantly outperforms all other base-lines in both the CR dataset and the MD dataset(average accuracy across domains is reported).The poor performance of CRF-INFlex indicatesthat directly applying lexical constraints as hardconstraints during inference could only hurt theperformance. CRF-INFdisc slightly outperformsCRF but the improvement is not significant. Incontrast, both PRlex and PR significantly outper-form CRF, which implies that incorporating lex-ical and discourse constraints as posterior con-straints is much more effective. The superior per-formance of PR over PRlex further suggests thatthe proper use of discourse information can signif-icantly improve accuracy for sentence-level senti-ment classification.

We also analyzed the model’s performance on athree-way sentiment classification task. By intro-ducing the “neutral” category, the sentiment clas-sification problem becomes harder. Table 4 showsthe results in terms of accuracy for each domainin the MD dataset. We can see that both PR andPRlex significantly outperform all other baselinesin all domains. The rule-based baseline VOTE-FLIP gave the weakest performance because it hasno prediction power on sentences with no opinionwords. DOCORACLE performs much better thanVOTEFLIP and performs especially well on theMusic domain. This indicates that the document-level sentiment is a very strong indicator of thesentence-level sentiment label. For the CRF base-line and its invariants, we observe a similar per-formance trend as in the two-way classificationtask: there is nearly no performance improve-ment from applying the lexical and discourse-connective-based constraints during CRF infer-ence. In contrast, both PRlex and PR providesubstantial improvements over CRF. This con-

firms that encoding lexical and discourse knowl-edge as posterior constraints allows the feature-based model to gain additional learning powerfor sentence-level sentiment prediction. In par-ticular, incorporating discourse constraints leadsto consistent improvements to our model. Thisdemonstrates that our modeling of discourse in-formation is effective and that taking into accountthe discourse context is important for improvingsentence-level sentiment analysis. We also com-pare our results to the previously published resultson the same dataset. HCRF (Tackstrom and Mc-Donald, 2011a) and MEM (Qu et al., 2012) aretwo state-of-the-art semi-supervised methods forsentence-level sentiment classification. We cansee that our best model PR gives the best resultsin most categories.

Table 4 shows the results in terms of F1 scoresfor each sentiment category (positive, negative andneutral). We can see that the PR models are able toprovide improvements over all the sentiment cate-gories compared to all the baselines in general. Weobserve that the DOCORACLE baseline providesvery strong F1 scores on the positive and nega-tive categories especially in the Books and Mu-sic domains, but very poor F1 on the neutral cate-gory. This is because it over-predicts the polar sen-tences in the polar documents, and predicts no po-lar sentences in the neutral documents. In contrast,our PR models provide more balanced F1 scoresamong all the sentiment categories. Compared tothe CRF baseline and its variants, we found thatthe PR models can greatly improve the precisionof predicting positive and negative sentences, re-sulting in a significant improvement on the pos-itive/negative F1 scores. However, the improve-ment on the neutral category is modest. A plausi-ble explanation is that most of our constraints fo-cus on discriminating polar sentences. They canhelp reduce the errors of misclassifying polar sen-tences, but the model needs more constraints inorder to distinguish neutral sentences from polarsentences. We plan to address this issue in futurework.

4.2 Discussion

We analyze the errors to better understand the mer-its and limitations of the PR model. We foundthat the PR model is able to correct many CRFerrors caused by the lack of labeled data. The firstrow in Table 5 shows an example of such errors.

332

Example Sentences CRF PRExample 1: 〈neg〉 If I could, I would like to return it or exchangefor something better.〈/neg〉〈neu〉 × XExample 2: 〈neg〉 Things I wasn’t a fan of – the ending was tocutesy for my taste.〈/neg〉〈neg〉 Also, all of the side characters(particularly the mom, vee, and the teacher) were incredibly flatand stereotypical to me.〈/neg〉

〈neu〉〈pos〉 × X

Example 3: 〈neg〉 I also have excessive noise when I talk andhave phone in my pocket while walking.〈/neg〉〈neu〉 But othermodels are no better.〈/neu〉

〈neg〉〈pos〉 × 〈neg〉〈pos〉 ×

Table 5: Example sentences where PR succeeds and fails to correct the mistakes of CRF

The lexical features return and exchange maybe good indicators of negative sentiment for thesentence. However, with limited labeled data, theCRF learner can only associate very weak senti-ment signals to these features. In contrast, the PRmodel is able to associate stronger sentiment sig-nals to these features by leveraging unlabeled datafor indirect supervision. A simple lexicon-basedconstraint during inference time may also correctthis case. However, hard-constraint baselines canhardly improve the performance in general be-cause the contributions of different constraints arenot learned and their combination may not lead tobetter predictions. This is also demonstrated bythe limited performance of CRF-INF in our exper-iments.

We also found that the discourse constraintsplay an important role in improving the sentimentprediction. The lexical constraints alone are of-ten not sufficient since their coverage is limited bythe sentiment lexicon and they can only constrainsentiment locally. On the contrary, discourse con-straints are not dependent on sentiment lexicons,and more importantly, they can provide sentimentpreferences on multiple sentences at the sametime. When combining discourse constraints withfeatures from different sentences, the PR modelbecomes more powerful in disambiguating senti-ment. The second example in Table 5 shows thatthe PR model learned with discourse constraintscorrectly predicts the sentiment of two sentenceswhere no lexical constraints apply.

However, discourse constraints are not alwayshelpful. One reason is that they do not constrainthe neutral sentiment. As a result they could nothelp disambiguate neutral sentiment from polarsentiment, such as the third example in Table 5.This is also a problem for most of our lexical con-straints. In general, it is hard to learn reliable indi-cators for the neutral sentiment. In the MD dataset,a neutral label may be given because the sentence

contains mixed sentiment or no sentiment or it isoff-topic. We plan to explore more refined con-straints that can deal with the neutral sentiment infuture work. Another limitation of the discourseconstraints is that they could be affected by the er-rors of the discourse parser and the coreference re-solver. A potential way to address this issue is tolearn discourse constraints jointly with sentiment.We plan to study this in future research.

5 Conclusion

In this paper, we propose a context-aware ap-proach for learning sentence-level sentiment. Ourapproach incorporates intuitive lexical and dis-course knowledge as expressive constraints whiletraining a conditional random field model via pos-terior regularization. We explore a rich set ofcontext-aware constraints at both intra- and inter-sentential levels, and demonstrate their effective-ness in the analysis of sentence-level sentiment.While we focus on the sentence-level task, our ap-proach can be easily extended to handle sentimentanalysis at finer levels of granularity. Our exper-iments show that our model achieves better accu-racy than existing supervised and semi-supervisedmodels for the sentence-level sentiment classifica-tion task.

Acknowledgments

This work was supported in part by DARPA-BAA-12-47 DEFT grant #12475008 and NSF grantBCS-0904822. We thank Igor Labutov for help-ful discussion and suggestions; Oscar Tackstromand Lizhen Qu for providing their Amazon reviewdatasets; and the anonymous reviewers for helpfulcomments and suggestions.

ReferencesKedar Bellare, Gregory Druck, and Andrew McCal-

lum. 2009. Alternating projections for learning

333

with expectation constraints. In Proceedings of theTwenty-Fifth Conference on Uncertainty in ArtificialIntelligence, pages 43–50. AUAI Press.

Yejin Choi and Claire Cardie. 2008. Learning withcompositional semantics as structural inference forsubsentential sentiment analysis. In Proceedings ofthe Conference on Empirical Methods in NaturalLanguage Processing, pages 793–801. Associationfor Computational Linguistics.

Yejin Choi and Claire Cardie. 2009. Adapting a po-larity lexicon using integer linear programming fordomain-specific sentiment classification. In Pro-ceedings of the 2009 Conference on Empirical Meth-ods in Natural Language Processing: Volume 2-Volume 2, pages 590–598. Association for Compu-tational Linguistics.

John Duchi, Elad Hazan, and Yoram Singer. 2010.Adaptive subgradient methods for online learningand stochastic optimization. Journal of MachineLearning Research, 12:2121–2159.

Kuzman Ganchev and Dipanjan Das. 2013. Cross-lingual discriminative learning of sequence modelswith posterior regularization.

Kuzman Ganchev, Jennifer Gillenwater, and BenTaskar. 2009. Dependency grammar induction viabitext projection constraints. In Proceedings of theACL-IJCNLP, pages 369–377.

Kuzman Ganchev, Joao Graca, Jennifer Gillenwater,and Ben Taskar. 2010. Posterior regularization forstructured latent variable models. The Journal ofMachine Learning Research, 99:2001–2049.

Stuart Geman and Donald Geman. 1984. Stochas-tic relaxation, gibbs distributions, and the bayesianrestoration of images. Pattern Analysis and MachineIntelligence, IEEE Transactions on, (6):721–741.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In Proceedings of the tenthACM SIGKDD international conference on Knowl-edge discovery and data mining, pages 168–177.ACM.

Dan Jurafsky, James H Martin, Andrew Kehler, KeithVander Linden, and Nigel Ward. 2000. Speechand language processing: An introduction to natu-ral language processing, computational linguistics,and speech recognition, volume 2. MIT Press.

Hiroshi Kanayama and Tetsuya Nasukawa. 2006.Fully automatic lexicon expansion for domain-oriented sentiment analysis. In Proceedings of the2006 Conference on Empirical Methods in NaturalLanguage Processing, pages 355–363. Associationfor Computational Linguistics.

Angeliki Lazaridou, Ivan Titov, and CarolineSporleder. 2013. A bayesian model for jointunsupervised induction of sentiment, aspect and

discourse representations. In To Appear in Proceed-ings of the 51th Annual Meeting of the Associationfor Computational Linguistics, Sofia, Bulgaria,August. Association for Computational Linguistics.

Heeyoung Lee, Angel Chang, Yves Peirsman,Nathanael Chambers, Mihai Surdeanu, and Dan Ju-rafsky. 2013. Deterministic coreference resolutionbased on entity-centric, precision-ranked rules.

Yi Mao and Guy Lebanon. 2007. Isotonic conditionalrandom fields and local sentiment flow. Advances inneural information processing systems, 19:961.

Ryan McDonald, Kerry Hannan, Tyler Neylon, MikeWells, and Jeff Reynar. 2007. Structured mod-els for fine-to-coarse sentiment analysis. In An-nual Meeting-Association For Computational Lin-guistics, volume 45, page 432.

Tetsuji Nakagawa, Kentaro Inui, and Sadao Kurohashi.2010. Dependency tree-based sentiment classifica-tion using crfs with hidden variables. In HumanLanguage Technologies: The 2010 Annual Confer-ence of the North American Chapter of the Associa-tion for Computational Linguistics, pages 786–794.Association for Computational Linguistics.

Bo Pang and Lillian Lee. 2004. A sentimental educa-tion: Sentiment analysis using subjectivity summa-rization based on minimum cuts. In Proceedings ofthe 42nd annual meeting on Association for Compu-tational Linguistics, page 271. Association for Com-putational Linguistics.

Bo Pang and Lillian Lee. 2008. Opinion mining andsentiment analysis. Now Pub.

Livia Polanyi and Annie Zaenen. 2006. Contextualvalence shifters. In Computing attitude and affect intext: Theory and applications, pages 1–10. Springer.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Milt-sakaki, Livio Robaldo, Aravind K Joshi, and Bon-nie L Webber. 2008. The penn discourse treebank2.0. In LREC. Citeseer.

Lizhen Qu, Rainer Gemulla, and Gerhard Weikum.2012. A weakly supervised model for sentence-levelsemantic orientation analysis with multiple experts.In Proceedings of the 2012 Joint Conference onEmpirical Methods in Natural Language Process-ing and Computational Natural Language Learning,pages 149–159. Association for Computational Lin-guistics.

Richard Socher, Jeffrey Pennington, Eric H Huang,Andrew Y Ng, and Christopher D Manning. 2011.Semi-supervised recursive autoencoders for predict-ing sentiment distributions. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing, pages 151–161. Association forComputational Linguistics.

334

Richard Socher, Alex Perelygin, Jean Y Wu, JasonChuang, Christopher D Manning, Andrew Y Ng,and Christopher Potts. 2013. Recursive deep mod-els for semantic compositionality over a sentimenttreebank. In Proceedings of EMNLP.

Swapna Somasundaran, Janyce Wiebe, and Josef Rup-penhofer. 2008. Discourse level opinion interpre-tation. In Proceedings of the 22nd InternationalConference on Computational Linguistics-Volume 1,pages 801–808. Association for Computational Lin-guistics.

Swapna Somasundaran, Galileo Namata, JanyceWiebe, and Lise Getoor. 2009. Supervised andunsupervised methods in employing discourse rela-tions for improving opinion polarity classification.In Proceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing: Volume1-Volume 1, pages 170–179. Association for Com-putational Linguistics.

Oscar Tackstrom and Ryan McDonald. 2011a. Dis-covering fine-grained sentiment with latent variablestructured prediction models. In Advances in Infor-mation Retrieval, pages 368–374. Springer.

Oscar Tackstrom and Ryan McDonald. 2011b. Semi-supervised latent variable models for sentence-levelsentiment analysis.

Rakshit Trivedi and Jacob Eisenstein. 2013. Discourseconnectors for latent subjectivity in sentiment analy-sis. In Proceedings of NAACL-HLT, pages 808–813.

Sida Wang and Christopher Manning. 2013. Fastdropout training. In Proceedings of the 30th Inter-national Conference on Machine Learning (ICML-13), pages 118–126.

Qi Zhang, Jin Qian, Huan Chen, Jihua Kang, and Xu-anjing Huang. 2013. Discourse level explanatoryrelation extraction from product reviews using first-order logic.

Lanjun Zhou, Binyang Li, Wei Gao, Zhongyu Wei,and Kam-Fai Wong. 2011. Unsupervised discoveryof discourse relations for eliminating intra-sentencepolarity ambiguities. In Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 162–171. Association for Com-putational Linguistics.

335

context-aware learning for sentence-level sentiment ...bishan/papers/context-aware... ·...

Documents